linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] btrfs: make lseek and fiemap much more efficient
@ 2022-09-01 13:18 fdmanana
  2022-09-01 13:18 ` [PATCH 01/10] btrfs: allow hole and data seeking to be interruptible fdmanana
                   ` (12 more replies)
  0 siblings, 13 replies; 53+ messages in thread
From: fdmanana @ 2022-09-01 13:18 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

We often get reports of fiemap and hole/data seeking (lseek) being too slow
on btrfs, or even unusable in some cases due to being extremely slow.

Some recent reports for fiemap:

    https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
    https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/

For lseek (LSF/MM from 2017):

   https://lwn.net/Articles/718805/

Basically both are slow due to very high algorithmic complexity which
scales badly with the number of extents in a file and the heigth of
subvolume and extent b+trees.

Using Pavel's test case (first Link tag for fiemap), which uses files with
many 4K extents and holes before and after each extent (kind of a worst
case scenario), the speedup is of several orders of magnitude (for the 1G
file, from ~225 seconds down to ~0.1 seconds).

Finally the new algorithm for fiemap also ends up solving a bug with the
current algorithm. This happens because we are currently relying on extent
maps to report extents, which can be merged, and this may cause us to
report 2 different extents as a single one that is not shared but one of
them is shared (or the other way around). More details on this on patches
9/10 and 10/10.

Patches 1/10 and 2/10 are for lseek, introducing some code that will later
be used by fiemap too (patch 10/10). More details in the changelogs.

There are a few more things that can be done to speedup fiemap and lseek,
but I'll leave those other optimizations I have in mind for some other time.

Filipe Manana (10):
  btrfs: allow hole and data seeking to be interruptible
  btrfs: make hole and data seeking a lot more efficient
  btrfs: remove check for impossible block start for an extent map at fiemap
  btrfs: remove zero length check when entering fiemap
  btrfs: properly flush delalloc when entering fiemap
  btrfs: allow fiemap to be interruptible
  btrfs: rename btrfs_check_shared() to a more descriptive name
  btrfs: speedup checking for extent sharedness during fiemap
  btrfs: skip unnecessary extent buffer sharedness checks during fiemap
  btrfs: make fiemap more efficient and accurate reporting extent sharedness

 fs/btrfs/backref.c     | 153 ++++++++-
 fs/btrfs/backref.h     |  20 +-
 fs/btrfs/ctree.h       |  22 +-
 fs/btrfs/extent-tree.c |  10 +-
 fs/btrfs/extent_io.c   | 703 ++++++++++++++++++++++++++++-------------
 fs/btrfs/file.c        | 439 +++++++++++++++++++++++--
 fs/btrfs/inode.c       | 146 ++-------
 7 files changed, 1111 insertions(+), 382 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 01/10] btrfs: allow hole and data seeking to be interruptible
  2022-09-01 13:18 [PATCH 00/10] btrfs: make lseek and fiemap much more efficient fdmanana
@ 2022-09-01 13:18 ` fdmanana
  2022-09-01 13:58   ` Josef Bacik
  2022-09-01 21:49   ` Qu Wenruo
  2022-09-01 13:18 ` [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient fdmanana
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 53+ messages in thread
From: fdmanana @ 2022-09-01 13:18 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Doing hole or data seeking on a file with a very large number of extents
can take a long time, and we have reports of it being too slow (such as
at LSFMM from 2017, see the Link below). So make it interruptible.

Link: https://lwn.net/Articles/718805/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0a76ae8b8e96..96f444ad0951 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3652,6 +3652,10 @@ static loff_t find_desired_extent(struct btrfs_inode *inode, loff_t offset,
 		start = em->start + em->len;
 		free_extent_map(em);
 		em = NULL;
+		if (fatal_signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
 		cond_resched();
 	}
 	free_extent_map(em);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient
  2022-09-01 13:18 [PATCH 00/10] btrfs: make lseek and fiemap much more efficient fdmanana
  2022-09-01 13:18 ` [PATCH 01/10] btrfs: allow hole and data seeking to be interruptible fdmanana
@ 2022-09-01 13:18 ` fdmanana
  2022-09-01 14:03   ` Josef Bacik
                     ` (2 more replies)
  2022-09-01 13:18 ` [PATCH 03/10] btrfs: remove check for impossible block start for an extent map at fiemap fdmanana
                   ` (10 subsequent siblings)
  12 siblings, 3 replies; 53+ messages in thread
From: fdmanana @ 2022-09-01 13:18 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

The current implementation of hole and data seeking for llseek does not
scale well in regards to the number of extents and the distance between
the start offset and the next hole or extent. This is due to a very high
algorithmic complexity. Often we also get reports of btrfs' hole and data
seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link
tag at the bottom).

In order to better understand it, lets consider the case where the start
offset is 0, we are seeking for a hole and the file size is 16G. Between
file offset 0 and the first hole in the file there are 100K extents - this
is common for large files, specially if we have compression enabled, since
the maximum extent size is limited to 128K. The steps take by the main
loop of the current algorithm are the following:

1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which
   calls btrfs_get_extent(). This will first lookup for an extent map in
   the inode's extent map tree (a red black tree). If the extent map is
   not loaded in memory, then it will do a lookup for the corresponding
   file extent item in the subvolume's b+tree, create an extent map based
   on the contents of the file extent item and then add the extent map to
   the extent map tree of the inode;

2) The second iteration calls btrfs_get_extent_fiemap() again, this time
   with a start offset matching the end offset of the previous extent.
   Again, btrfs_get_extent() will first search the extent map tree, and
   if it doesn't find an extent map there, it will again search in the
   b+tree of the subvolume for a matching file extent item, build an
   extent map based on the file extent item, and add the extent map to
   to the extent map tree of the inode;

3) This repeats over and over until we find the first hole (when seeking
   for holes) or until we find the first extent (when seeking for data).

   If there no extent maps loaded in memory for each iteration, then on
   each iteration we do 1 extent map tree search, 1 b+tree search, plus
   1 more extent map tree traversal to insert an extent map - plus we
   allocate memory for the extent map.

   On each iteration we are growing the size of the extent map tree,
   making each future search slower, and also visiting the same b+tree
   leaves over and over again - taking into account with the default leaf
   size of 16K we can fit more than 200 file extent items in a leaf - so
   we can visit the same b+tree leaf 200+ times, on each visit walking
   down a path from the root to the leaf.

So it's easy to see that what we have now doesn't scale well. Also, it
loads an extent map for every file extent item into memory, which is not
efficient - we should add extents maps only when doing IO (writing or
reading file data).

This change implements a new algorithm which scales much better, and
works like this:

1) We iterate over the subvolume's b+tree, visiting each leaf that has
   file extent items once and only once;

2) For any file extent items found, that don't represent holes or prealloc
   extents, it will not search the extent map tree - there's no need at
   all for that - an extent map is just an in-memory representation of a
   file extent item;

3) When a hole is found, or a prealloc extent, it will check if there's
   delalloc for its range. For this it will search for EXTENT_DELALLOC
   bits in the inode's io tree and check the extent map tree - this is
   for accounting for unflushed delalloc and for flushed delalloc (the
   period between running delalloc and ordered extent completion),
   respectively. This is similar to what the current implementation does
   when it finds a hole or prealloc extent, but without creating extent
   maps and adding them to the extent map tree in case they are not
   loaded in memory;

4) It never allocates extent maps, or adds extent maps to the inode's
   extent map tree. This not only saves memory and time (from the tree
   insertions and allocations), but also eliminates the possibility of
   -ENOMEM due to allocating too many extent maps.

Part of this new code will also be used later for fiemap (which also
suffers similar scalability problems).

The following test example can be used to quickly measure the efficiency
before and after this patch:

    $ cat test-seek-hole.sh
    #!/bin/bash

    DEV=/dev/sdi
    MNT=/mnt/sdi

    mkfs.btrfs -f $DEV

    mount -o compress=lzo $DEV $MNT

    # 16G file -> 131073 compressed extents.
    xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar

    # Leave a 1M hole at file offset 15G.
    xfs_io -c "fpunch 15G 1M" $MNT/foobar

    # Unmount and mount again, so that we can test when there's no
    # metadata cached in memory.
    umount $MNT
    mount -o compress=lzo $DEV $MNT

    # Test seeking for hole from offset 0 (hole is at offset 15G).

    start=$(date +%s%N)
    xfs_io -c "seek -h 0" $MNT/foobar
    end=$(date +%s%N)
    dur=$(( (end - start) / 1000000 ))
    echo "Took $dur milliseconds to seek first hole (metadata not cached)"
    echo

    start=$(date +%s%N)
    xfs_io -c "seek -h 0" $MNT/foobar
    end=$(date +%s%N)
    dur=$(( (end - start) / 1000000 ))
    echo "Took $dur milliseconds to seek first hole (metadata cached)"
    echo

    umount $MNT

Before this change:

    $ ./test-seek-hole.sh
    (...)
    Whence	Result
    HOLE	16106127360
    Took 176 milliseconds to seek first hole (metadata not cached)

    Whence	Result
    HOLE	16106127360
    Took 17 milliseconds to seek first hole (metadata cached)

After this change:

    $ ./test-seek-hole.sh
    (...)
    Whence	Result
    HOLE	16106127360
    Took 43 milliseconds to seek first hole (metadata not cached)

    Whence	Result
    HOLE	16106127360
    Took 13 milliseconds to seek first hole (metadata cached)

That's about 4X faster when no metadata is cached and about 30% faster
when all metadata is cached.

In practice the differences may often be significantly higher, either due
to a higher number of extents in a file or because the subvolume's b+tree
is much bigger than in this example, where we only have one file.

Link: https://lwn.net/Articles/718805/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c | 437 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 406 insertions(+), 31 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 96f444ad0951..b292a8ada3a4 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3601,22 +3601,281 @@ static long btrfs_fallocate(struct file *file, int mode,
 	return ret;
 }
 
+/*
+ * Helper for have_delalloc_in_range(). Find a subrange in a given range that
+ * has unflushed and/or flushing delalloc. There might be other adjacent
+ * subranges after the one it found, so have_delalloc_in_range() keeps looping
+ * while it gets adjacent subranges, and merging them together.
+ */
+static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
+				   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
+{
+	const u64 len = end + 1 - start;
+	struct extent_map_tree *em_tree = &inode->extent_tree;
+	struct extent_map *em;
+	u64 em_end;
+	u64 delalloc_len;
+
+	/*
+	 * Search the io tree first for EXTENT_DELALLOC. If we find any, it
+	 * means we have delalloc (dirty pages) for which writeback has not
+	 * started yet.
+	 */
+	*delalloc_start_ret = start;
+	delalloc_len = count_range_bits(&inode->io_tree, delalloc_start_ret, end,
+					len, EXTENT_DELALLOC, 1);
+	/*
+	 * If delalloc was found then *delalloc_start_ret has a sector size
+	 * aligned value (rounded down).
+	 */
+	if (delalloc_len > 0)
+		*delalloc_end_ret = *delalloc_start_ret + delalloc_len - 1;
+
+	/*
+	 * Now also check if there's any extent map in the range that does not
+	 * map to a hole or prealloc extent. We do this because:
+	 *
+	 * 1) When delalloc is flushed, the file range is locked, we clear the
+	 *    EXTENT_DELALLOC bit from the io tree and create an extent map for
+	 *    an allocated extent. So we might just have been called after
+	 *    delalloc is flushed and before the ordered extent completes and
+	 *    inserts the new file extent item in the subvolume's btree;
+	 *
+	 * 2) We may have an extent map created by flushing delalloc for a
+	 *    subrange that starts before the subrange we found marked with
+	 *    EXTENT_DELALLOC in the io tree.
+	 */
+	read_lock(&em_tree->lock);
+	em = lookup_extent_mapping(em_tree, start, len);
+	read_unlock(&em_tree->lock);
+
+	/* extent_map_end() returns a non-inclusive end offset. */
+	em_end = em ? extent_map_end(em) : 0;
+
+	/*
+	 * If we have a hole/prealloc extent map, check the next one if this one
+	 * ends before our range's end.
+	 */
+	if (em && (em->block_start == EXTENT_MAP_HOLE ||
+		   test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) && em_end < end) {
+		struct extent_map *next_em;
+
+		read_lock(&em_tree->lock);
+		next_em = lookup_extent_mapping(em_tree, em_end, len - em_end);
+		read_unlock(&em_tree->lock);
+
+		free_extent_map(em);
+		em_end = next_em ? extent_map_end(next_em) : 0;
+		em = next_em;
+	}
+
+	if (em && (em->block_start == EXTENT_MAP_HOLE ||
+		   test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) {
+		free_extent_map(em);
+		em = NULL;
+	}
+
+	/*
+	 * No extent map or one for a hole or prealloc extent. Use the delalloc
+	 * range we found in the io tree if we have one.
+	 */
+	if (!em)
+		return (delalloc_len > 0);
+
+	/*
+	 * We don't have any range as EXTENT_DELALLOC in the io tree, so the
+	 * extent map is the only subrange representing delalloc.
+	 */
+	if (delalloc_len == 0) {
+		*delalloc_start_ret = em->start;
+		*delalloc_end_ret = min(end, em_end - 1);
+		free_extent_map(em);
+		return true;
+	}
+
+	/*
+	 * The extent map represents a delalloc range that starts before the
+	 * delalloc range we found in the io tree.
+	 */
+	if (em->start < *delalloc_start_ret) {
+		*delalloc_start_ret = em->start;
+		/*
+		 * If the ranges are adjacent, return a combined range.
+		 * Otherwise return the extent map's range.
+		 */
+		if (em_end < *delalloc_start_ret)
+			*delalloc_end_ret = min(end, em_end - 1);
+
+		free_extent_map(em);
+		return true;
+	}
+
+	/*
+	 * The extent map starts after the delalloc range we found in the io
+	 * tree. If it's adjacent, return a combined range, otherwise return
+	 * the range found in the io tree.
+	 */
+	if (*delalloc_end_ret + 1 == em->start)
+		*delalloc_end_ret = min(end, em_end - 1);
+
+	free_extent_map(em);
+	return true;
+}
+
+/*
+ * Check if there's delalloc in a given range.
+ *
+ * @inode:               The inode.
+ * @start:               The start offset of the range. It does not need to be
+ *                       sector size aligned.
+ * @end:                 The end offset (inclusive value) of the search range.
+ *                       It does not need to be sector size aligned.
+ * @delalloc_start_ret:  Output argument, set to the start offset of the
+ *                       subrange found with delalloc (may not be sector size
+ *                       aligned).
+ * @delalloc_end_ret:    Output argument, set to he end offset (inclusive value)
+ *                       of the subrange found with delalloc.
+ *
+ * Returns true if a subrange with delalloc is found within the given range, and
+ * if so it sets @delalloc_start_ret and @delalloc_end_ret with the start and
+ * end offsets of the subrange.
+ */
+static bool have_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
+				   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
+{
+	u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
+	u64 prev_delalloc_end = 0;
+	bool ret = false;
+
+	while (cur_offset < end) {
+		u64 delalloc_start;
+		u64 delalloc_end;
+		bool delalloc;
+
+		delalloc = find_delalloc_subrange(inode, cur_offset, end,
+						  &delalloc_start,
+						  &delalloc_end);
+		if (!delalloc)
+			break;
+
+		if (prev_delalloc_end == 0) {
+			/* First subrange found. */
+			*delalloc_start_ret = max(delalloc_start, start);
+			*delalloc_end_ret = delalloc_end;
+			ret = true;
+		} else if (delalloc_start == prev_delalloc_end + 1) {
+			/* Subrange adjacent to the previous one, merge them. */
+			*delalloc_end_ret = delalloc_end;
+		} else {
+			/* Subrange not adjacent to the previous one, exit. */
+			break;
+		}
+
+		prev_delalloc_end = delalloc_end;
+		cur_offset = delalloc_end + 1;
+		cond_resched();
+	}
+
+	return ret;
+}
+
+/*
+ * Check if there's a hole or delalloc range in a range representing a hole (or
+ * prealloc extent) found in the inode's subvolume btree.
+ *
+ * @inode:      The inode.
+ * @whence:     Seek mode (SEEK_DATA or SEEK_HOLE).
+ * @start:      Start offset of the hole region. It does not need to be sector
+ *              size aligned.
+ * @end:        End offset (inclusive value) of the hole region. It does not
+ *              need to be sector size aligned.
+ * @start_ret:  Return parameter, used to set the start of the subrange in the
+ *              hole that matches the search criteria (seek mode), if such
+ *              subrange is found (return value of the function is true).
+ *              The value returned here may not be sector size aligned.
+ *
+ * Returns true if a subrange matching the given seek mode is found, and if one
+ * is found, it updates @start_ret with the start of the subrange.
+ */
+static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
+					u64 start, u64 end, u64 *start_ret)
+{
+	u64 delalloc_start;
+	u64 delalloc_end;
+	bool delalloc;
+
+	delalloc = have_delalloc_in_range(inode, start, end, &delalloc_start,
+					  &delalloc_end);
+	if (delalloc && whence == SEEK_DATA) {
+		*start_ret = delalloc_start;
+		return true;
+	}
+
+	if (delalloc && whence == SEEK_HOLE) {
+		/*
+		 * We found delalloc but it starts after out start offset. So we
+		 * have a hole between our start offset and the delalloc start.
+		 */
+		if (start < delalloc_start) {
+			*start_ret = start;
+			return true;
+		}
+		/*
+		 * Delalloc range starts at our start offset.
+		 * If the delalloc range's length is smaller than our range,
+		 * then it means we have a hole that starts where the delalloc
+		 * subrange ends.
+		 */
+		if (delalloc_end < end) {
+			*start_ret = delalloc_end + 1;
+			return true;
+		}
+
+		/* There's delalloc for the whole range. */
+		return false;
+	}
+
+	if (!delalloc && whence == SEEK_HOLE) {
+		*start_ret = start;
+		return true;
+	}
+
+	/*
+	 * No delalloc in the range and we are seeking for data. The caller has
+	 * to iterate to the next extent item in the subvolume btree.
+	 */
+	return false;
+}
+
 static loff_t find_desired_extent(struct btrfs_inode *inode, loff_t offset,
 				  int whence)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct extent_map *em = NULL;
 	struct extent_state *cached_state = NULL;
-	loff_t i_size = inode->vfs_inode.i_size;
+	const loff_t i_size = i_size_read(&inode->vfs_inode);
+	const u64 ino = btrfs_ino(inode);
+	struct btrfs_root *root = inode->root;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	u64 last_extent_end;
 	u64 lockstart;
 	u64 lockend;
 	u64 start;
-	u64 len;
-	int ret = 0;
+	int ret;
+	bool found = false;
 
 	if (i_size == 0 || offset >= i_size)
 		return -ENXIO;
 
+	/*
+	 * Quick path. If the inode has no prealloc extents and its number of
+	 * bytes used matches its i_size, then it can not have holes.
+	 */
+	if (whence == SEEK_HOLE &&
+	    !(inode->flags & BTRFS_INODE_PREALLOC) &&
+	    inode_get_bytes(&inode->vfs_inode) == i_size)
+		return i_size;
+
 	/*
 	 * offset can be negative, in this case we start finding DATA/HOLE from
 	 * the very start of the file.
@@ -3628,49 +3887,165 @@ static loff_t find_desired_extent(struct btrfs_inode *inode, loff_t offset,
 	if (lockend <= lockstart)
 		lockend = lockstart + fs_info->sectorsize;
 	lockend--;
-	len = lockend - lockstart + 1;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+	path->reada = READA_FORWARD;
+
+	key.objectid = ino;
+	key.type = BTRFS_EXTENT_DATA_KEY;
+	key.offset = start;
+
+	last_extent_end = lockstart;
 
 	lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
 
+	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+	if (ret < 0) {
+		goto out;
+	} else if (ret > 0 && path->slots[0] > 0) {
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
+		if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
+			path->slots[0]--;
+	}
+
 	while (start < i_size) {
-		em = btrfs_get_extent_fiemap(inode, start, len);
-		if (IS_ERR(em)) {
-			ret = PTR_ERR(em);
-			em = NULL;
-			break;
+		struct extent_buffer *leaf = path->nodes[0];
+		struct btrfs_file_extent_item *extent;
+		u64 extent_end;
+
+		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+			ret = btrfs_next_leaf(root, path);
+			if (ret < 0)
+				goto out;
+			else if (ret > 0)
+				break;
+
+			leaf = path->nodes[0];
 		}
 
-		if (whence == SEEK_HOLE &&
-		    (em->block_start == EXTENT_MAP_HOLE ||
-		     test_bit(EXTENT_FLAG_PREALLOC, &em->flags)))
-			break;
-		else if (whence == SEEK_DATA &&
-			   (em->block_start != EXTENT_MAP_HOLE &&
-			    !test_bit(EXTENT_FLAG_PREALLOC, &em->flags)))
+		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
 			break;
 
-		start = em->start + em->len;
-		free_extent_map(em);
-		em = NULL;
+		extent_end = btrfs_file_extent_end(path);
+
+		/*
+		 * In the first iteration we may have a slot that points to an
+		 * extent that ends before our start offset, so skip it.
+		 */
+		if (extent_end <= start) {
+			path->slots[0]++;
+			continue;
+		}
+
+		/* We have an implicit hole, NO_HOLES feature is likely set. */
+		if (last_extent_end < key.offset) {
+			u64 search_start = last_extent_end;
+			u64 found_start;
+
+			/*
+			 * First iteration, @start matches @offset and it's
+			 * within the hole.
+			 */
+			if (start == offset)
+				search_start = offset;
+
+			found = find_desired_extent_in_hole(inode, whence,
+							    search_start,
+							    key.offset - 1,
+							    &found_start);
+			if (found) {
+				start = found_start;
+				break;
+			}
+			/*
+			 * Didn't find data or a hole (due to delalloc) in the
+			 * implicit hole range, so need to analyze the extent.
+			 */
+		}
+
+		extent = btrfs_item_ptr(leaf, path->slots[0],
+					struct btrfs_file_extent_item);
+
+		if (btrfs_file_extent_disk_bytenr(leaf, extent) == 0 ||
+		    btrfs_file_extent_type(leaf, extent) ==
+		    BTRFS_FILE_EXTENT_PREALLOC) {
+			/*
+			 * Explicit hole or prealloc extent, search for delalloc.
+			 * A prealloc extent is treated like a hole.
+			 */
+			u64 search_start = key.offset;
+			u64 found_start;
+
+			/*
+			 * First iteration, @start matches @offset and it's
+			 * within the hole.
+			 */
+			if (start == offset)
+				search_start = offset;
+
+			found = find_desired_extent_in_hole(inode, whence,
+							    search_start,
+							    extent_end - 1,
+							    &found_start);
+			if (found) {
+				start = found_start;
+				break;
+			}
+			/*
+			 * Didn't find data or a hole (due to delalloc) in the
+			 * implicit hole range, so need to analyze the next
+			 * extent item.
+			 */
+		} else {
+			/*
+			 * Found a regular or inline extent.
+			 * If we are seeking for data, adjust the start offset
+			 * and stop, we're done.
+			 */
+			if (whence == SEEK_DATA) {
+				start = max_t(u64, key.offset, offset);
+				found = true;
+				break;
+			}
+			/*
+			 * Else, we are seeking for a hole, check the next file
+			 * extent item.
+			 */
+		}
+
+		start = extent_end;
+		last_extent_end = extent_end;
+		path->slots[0]++;
 		if (fatal_signal_pending(current)) {
 			ret = -EINTR;
-			break;
+			goto out;
 		}
 		cond_resched();
 	}
-	free_extent_map(em);
+
+	/* We have an implicit hole from the last extent found up to i_size. */
+	if (!found && start < i_size) {
+		found = find_desired_extent_in_hole(inode, whence, start,
+						    i_size - 1, &start);
+		if (!found)
+			start = i_size;
+	}
+
+out:
 	unlock_extent_cached(&inode->io_tree, lockstart, lockend,
 			     &cached_state);
-	if (ret) {
-		offset = ret;
-	} else {
-		if (whence == SEEK_DATA && start >= i_size)
-			offset = -ENXIO;
-		else
-			offset = min_t(loff_t, start, i_size);
-	}
+	btrfs_free_path(path);
+
+	if (ret < 0)
+		return ret;
+
+	if (whence == SEEK_DATA && start >= i_size)
+		return -ENXIO;
 
-	return offset;
+	return min_t(loff_t, start, i_size);
 }
 
 static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 03/10] btrfs: remove check for impossible block start for an extent map at fiemap
  2022-09-01 13:18 [PATCH 00/10] btrfs: make lseek and fiemap much more efficient fdmanana
  2022-09-01 13:18 ` [PATCH 01/10] btrfs: allow hole and data seeking to be interruptible fdmanana
  2022-09-01 13:18 ` [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient fdmanana
@ 2022-09-01 13:18 ` fdmanana
  2022-09-01 14:03   ` Josef Bacik
  2022-09-01 22:19   ` Qu Wenruo
  2022-09-01 13:18 ` [PATCH 04/10] btrfs: remove zero length check when entering fiemap fdmanana
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 53+ messages in thread
From: fdmanana @ 2022-09-01 13:18 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

During fiemap we are testing if an extent map has a block start with a
value of EXTENT_MAP_LAST_BYTE, but that is never set on an extent map,
and never was according to git history. So remove that useless check.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f57a3e91fc2c..ceb7dfe8d6dc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5642,10 +5642,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 		if (off >= max)
 			end = 1;
 
-		if (em->block_start == EXTENT_MAP_LAST_BYTE) {
-			end = 1;
-			flags |= FIEMAP_EXTENT_LAST;
-		} else if (em->block_start == EXTENT_MAP_INLINE) {
+		if (em->block_start == EXTENT_MAP_INLINE) {
 			flags |= (FIEMAP_EXTENT_DATA_INLINE |
 				  FIEMAP_EXTENT_NOT_ALIGNED);
 		} else if (em->block_start == EXTENT_MAP_DELALLOC) {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 04/10] btrfs: remove zero length check when entering fiemap
  2022-09-01 13:18 [PATCH 00/10] btrfs: make lseek and fiemap much more efficient fdmanana
                   ` (2 preceding siblings ...)
  2022-09-01 13:18 ` [PATCH 03/10] btrfs: remove check for impossible block start for an extent map at fiemap fdmanana
@ 2022-09-01 13:18 ` fdmanana
  2022-09-01 14:04   ` Josef Bacik
  2022-09-01 22:24   ` Qu Wenruo
  2022-09-01 13:18 ` [PATCH 05/10] btrfs: properly flush delalloc " fdmanana
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 53+ messages in thread
From: fdmanana @ 2022-09-01 13:18 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There's no point to check for a 0 length at extent_fiemap(), as before
calling it, we called fiemap_prep() at btrfs_fiemap(), which already
checks for a zero length and returns the same -EINVAL error. So remove
the pointless check.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ceb7dfe8d6dc..6e2143b6fba3 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5526,9 +5526,6 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	u64 em_len = 0;
 	u64 em_end = 0;
 
-	if (len == 0)
-		return -EINVAL;
-
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 05/10] btrfs: properly flush delalloc when entering fiemap
  2022-09-01 13:18 [PATCH 00/10] btrfs: make lseek and fiemap much more efficient fdmanana
                   ` (3 preceding siblings ...)
  2022-09-01 13:18 ` [PATCH 04/10] btrfs: remove zero length check when entering fiemap fdmanana
@ 2022-09-01 13:18 ` fdmanana
  2022-09-01 14:06   ` Josef Bacik
  2022-09-01 22:38   ` Qu Wenruo
  2022-09-01 13:18 ` [PATCH 06/10] btrfs: allow fiemap to be interruptible fdmanana
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 53+ messages in thread
From: fdmanana @ 2022-09-01 13:18 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

If the flag FIEMAP_FLAG_SYNC is passed to fiemap, it means all delalloc
should be flushed and writeback complete. We call the generic helper
fiemap_prep() which does a filemap_write_and_wait() in case that flag is
given, however that is not enough if we have compression. Because a
single filemap_fdatawrite_range() only starts compression (in an async
thread) and therefore returns before the compression is done and writeback
is started.

So make btrfs_fiemap(), actually wait for all writeback to start and
complete if FIEMAP_FLAG_SYNC is set. We start and wait for writeback
on the whole possible file range, from 0 to LLONG_MAX, because that is
what the generic code at fiemap_prep() does.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 493623c81535..2c7d31990777 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8258,6 +8258,26 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 	if (ret)
 		return ret;
 
+	/*
+	 * fiemap_prep() called filemap_write_and_wait() for the whole possible
+	 * file range (0 to LLONG_MAX), but that is not enough if we have
+	 * compression enabled. The first filemap_fdatawrite_range() only kicks
+	 * in the compression of data (in an async thread) and will return
+	 * before the compression is done and writeback is started. A second
+	 * filemap_fdatawrite_range() is needed to wait for the compression to
+	 * complete and writeback to start. Without this, our user is very
+	 * likely to get stale results, because the extents and extent maps for
+	 * delalloc regions are only allocated when writeback starts.
+	 */
+	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
+		ret = btrfs_fdatawrite_range(inode, 0, LLONG_MAX);
+		if (ret)
+			return ret;
+		ret = filemap_fdatawait_range(inode->i_mapping, 0, LLONG_MAX);
+		if (ret)
+			return ret;
+	}
+
 	return extent_fiemap(BTRFS_I(inode), fieinfo, start, len);
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 06/10] btrfs: allow fiemap to be interruptible
  2022-09-01 13:18 [PATCH 00/10] btrfs: make lseek and fiemap much more efficient fdmanana
                   ` (4 preceding siblings ...)
  2022-09-01 13:18 ` [PATCH 05/10] btrfs: properly flush delalloc " fdmanana
@ 2022-09-01 13:18 ` fdmanana
  2022-09-01 14:07   ` Josef Bacik
  2022-09-01 22:42   ` Qu Wenruo
  2022-09-01 13:18 ` [PATCH 07/10] btrfs: rename btrfs_check_shared() to a more descriptive name fdmanana
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 53+ messages in thread
From: fdmanana @ 2022-09-01 13:18 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Doing fiemap on a file with a very large number of extents can take a very
long time, and we have reports of it being too slow (two recent examples
in the Link tags below), so make it interruptible.

Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6e2143b6fba3..1260038eb47d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5694,6 +5694,11 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 				ret = 0;
 			goto out_free;
 		}
+
+		if (fatal_signal_pending(current)) {
+			ret = -EINTR;
+			goto out_free;
+		}
 	}
 out_free:
 	if (!ret)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 07/10] btrfs: rename btrfs_check_shared() to a more descriptive name
  2022-09-01 13:18 [PATCH 00/10] btrfs: make lseek and fiemap much more efficient fdmanana
                   ` (5 preceding siblings ...)
  2022-09-01 13:18 ` [PATCH 06/10] btrfs: allow fiemap to be interruptible fdmanana
@ 2022-09-01 13:18 ` fdmanana
  2022-09-01 14:08   ` Josef Bacik
  2022-09-01 22:45   ` Qu Wenruo
  2022-09-01 13:18 ` [PATCH 08/10] btrfs: speedup checking for extent sharedness during fiemap fdmanana
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 53+ messages in thread
From: fdmanana @ 2022-09-01 13:18 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

The function btrfs_check_shared() is supposed to be used to check if a
data extent is shared, but its name is too generic, may easily cause
confusion in the sense that it may be used for metadata extents.

So rename it to btrfs_is_data_extent_shared(), which will also make it
less confusing after the next change that adds a backref lookup cache for
the b+tree nodes that lead to the leaf that contains the file extent item
that points to the target data extent.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/backref.c   | 8 ++++----
 fs/btrfs/backref.h   | 4 ++--
 fs/btrfs/extent_io.c | 5 +++--
 3 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index d385357e19b6..e2ac10a695b6 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1512,7 +1512,7 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
 }
 
 /**
- * Check if an extent is shared or not
+ * Check if a data extent is shared or not.
  *
  * @root:   root inode belongs to
  * @inum:   inode number of the inode whose extent we are checking
@@ -1520,7 +1520,7 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
  * @roots:  list of roots this extent is shared among
  * @tmp:    temporary list used for iteration
  *
- * btrfs_check_shared uses the backref walking code but will short
+ * btrfs_is_data_extent_shared uses the backref walking code but will short
  * circuit as soon as it finds a root or inode that doesn't match the
  * one passed in. This provides a significant performance benefit for
  * callers (such as fiemap) which want to know whether the extent is
@@ -1531,8 +1531,8 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
  *
  * Return: 0 if extent is not shared, 1 if it is shared, < 0 on error.
  */
-int btrfs_check_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
-		struct ulist *roots, struct ulist *tmp)
+int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
+				struct ulist *roots, struct ulist *tmp)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_trans_handle *trans;
diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h
index 2759de7d324c..08354394b1bb 100644
--- a/fs/btrfs/backref.h
+++ b/fs/btrfs/backref.h
@@ -62,8 +62,8 @@ int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid,
 			  u64 start_off, struct btrfs_path *path,
 			  struct btrfs_inode_extref **ret_extref,
 			  u64 *found_off);
-int btrfs_check_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
-		struct ulist *roots, struct ulist *tmp_ulist);
+int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
+				struct ulist *roots, struct ulist *tmp);
 
 int __init btrfs_prelim_ref_init(void);
 void __cold btrfs_prelim_ref_exit(void);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1260038eb47d..a47710516ecf 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5656,8 +5656,9 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 			 * then we're just getting a count and we can skip the
 			 * lookup stuff.
 			 */
-			ret = btrfs_check_shared(root, btrfs_ino(inode),
-						 bytenr, roots, tmp_ulist);
+			ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
+							  bytenr, roots,
+							  tmp_ulist);
 			if (ret < 0)
 				goto out_free;
 			if (ret)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 08/10] btrfs: speedup checking for extent sharedness during fiemap
  2022-09-01 13:18 [PATCH 00/10] btrfs: make lseek and fiemap much more efficient fdmanana
                   ` (6 preceding siblings ...)
  2022-09-01 13:18 ` [PATCH 07/10] btrfs: rename btrfs_check_shared() to a more descriptive name fdmanana
@ 2022-09-01 13:18 ` fdmanana
  2022-09-01 14:23   ` Josef Bacik
  2022-09-01 22:50   ` Qu Wenruo
  2022-09-01 13:18 ` [PATCH 09/10] btrfs: skip unnecessary extent buffer sharedness checks " fdmanana
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 53+ messages in thread
From: fdmanana @ 2022-09-01 13:18 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

One of the most expensive tasks performed during fiemap is to check if
an extent is shared. This task has two major steps:

1) Check if the data extent is shared. This implies checking the extent
   item in the extent tree, checking delayed references, etc. If we
   find the data extent is directly shared, we terminate immediately;

2) If the data extent is not directly shared (its extent item has a
   refcount of 1), then it may be shared if we have snapshots that share
   subtrees of the inode's subvolume b+tree. So we check if the leaf
   containing the file extent item is shared, then its parent node, then
   the parent node of the parent node, etc, until we reach the root node
   or we find one of them is shared - in which case we stop immediately.

During fiemap we process the extents of a file from left to right, from
file offset 0 to eof. This means that we iterate b+tree leaves from left
to right, and has the implication that we keep repeating that second step
above several times for the same b+tree path of the inode's subvolume
b+tree.

For example, if we have two file extent items in leaf X, and the path to
leaf X is A -> B -> C -> X, then when we try to determine if the data
extent referenced by the first extent item is shared, we check if the data
extent is shared - if it's not, then we check if leaf X is shared, if not,
then we check if node C is shared, if not, then check if node B is shared,
if not than check if node A is shared. When we move to the next file
extent item, after determining the data extent is not shared, we repeat
the checks for X, C, B and A - doing all the expensive searches in the
extent tree, delayed refs, etc. If we have thousands of tile extents, then
we keep repeating the sharedness checks for the same paths over and over.

On a file that has no shared extents or only a small portion, it's easy
to see that this scales terribly with the number of extents in the file
and the sizes of the extent and subvolume b+trees.

This change eliminates the repeated sharedness check on extent buffers
by caching the results of the last path used. The results can be used as
long as no snapshots were created since they were cached (for not shared
extent buffers) or no roots were dropped since they were cached (for
shared extent buffers). This greatly reduces the time spent by fiemap for
files with thousands of extents and/or large extent and subvolume b+trees.

Example performance test:

    $ cat fiemap-perf-test.sh
    #!/bin/bash

    DEV=/dev/sdi
    MNT=/mnt/sdi

    mkfs.btrfs -f $DEV
    mount -o compress=lzo $DEV $MNT

    # 40G gives 327680 128K file extents (due to compression).
    xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar

    umount $MNT
    mount -o compress=lzo $DEV $MNT

    start=$(date +%s%N)
    filefrag $MNT/foobar
    end=$(date +%s%N)
    dur=$(( (end - start) / 1000000 ))
    echo "fiemap took $dur milliseconds (metadata not cached)"

    start=$(date +%s%N)
    filefrag $MNT/foobar
    end=$(date +%s%N)
    dur=$(( (end - start) / 1000000 ))
    echo "fiemap took $dur milliseconds (metadata cached)"

    umount $MNT

Before this patch:

    $ ./fiemap-perf-test.sh
    (...)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 3597 milliseconds (metadata not cached)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 2107 milliseconds (metadata cached)

After this patch:

    $ ./fiemap-perf-test.sh
    (...)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 1646 milliseconds (metadata not cached)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 698 milliseconds (metadata cached)

That's about 2.2x faster when no metadata is cached, and about 3x faster
when all metadata is cached. On a real filesystem with many other files,
data, directories, etc, the b+trees will be 2 or 3 levels higher,
therefore this optimization will have a higher impact.

Several reports of a slow fiemap show up often, the two Link tags below
refer to two recent reports of such slowness. This patch, together with
the next ones in the series, is meant to address that.

Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/backref.c     | 122 ++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/backref.h     |  17 +++++-
 fs/btrfs/ctree.h       |  18 ++++++
 fs/btrfs/extent-tree.c |  10 +++-
 fs/btrfs/extent_io.c   |  11 ++--
 5 files changed, 170 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index e2ac10a695b6..40b48abb6978 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1511,6 +1511,105 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+/*
+ * The caller has joined a transaction or is holding a read lock on the
+ * fs_info->commit_root_sem semaphore, so no need to worry about the root's last
+ * snapshot field changing while updating or checking the cache.
+ */
+static bool lookup_backref_shared_cache(struct btrfs_backref_shared_cache *cache,
+					struct btrfs_root *root,
+					u64 bytenr, int level, bool *is_shared)
+{
+	struct btrfs_backref_shared_cache_entry *entry;
+
+	if (WARN_ON_ONCE(level >= BTRFS_MAX_LEVEL))
+		return false;
+
+	/*
+	 * Level -1 is used for the data extent, which is not reliable to cache
+	 * because its reference count can increase or decrease without us
+	 * realizing. We cache results only for extent buffers that lead from
+	 * the root node down to the leaf with the file extent item.
+	 */
+	ASSERT(level >= 0);
+
+	entry = &cache->entries[level];
+
+	/* Unused cache entry or being used for some other extent buffer. */
+	if (entry->bytenr != bytenr)
+		return false;
+
+	/*
+	 * We cached a false result, but the last snapshot generation of the
+	 * root changed, so we now have a snapshot. Don't trust the result.
+	 */
+	if (!entry->is_shared &&
+	    entry->gen != btrfs_root_last_snapshot(&root->root_item))
+		return false;
+
+	/*
+	 * If we cached a true result and the last generation used for dropping
+	 * a root changed, we can not trust the result, because the dropped root
+	 * could be a snapshot sharing this extent buffer.
+	 */
+	if (entry->is_shared &&
+	    entry->gen != btrfs_get_last_root_drop_gen(root->fs_info))
+		return false;
+
+	*is_shared = entry->is_shared;
+
+	return true;
+}
+
+/*
+ * The caller has joined a transaction or is holding a read lock on the
+ * fs_info->commit_root_sem semaphore, so no need to worry about the root's last
+ * snapshot field changing while updating or checking the cache.
+ */
+static void store_backref_shared_cache(struct btrfs_backref_shared_cache *cache,
+				       struct btrfs_root *root,
+				       u64 bytenr, int level, bool is_shared)
+{
+	struct btrfs_backref_shared_cache_entry *entry;
+	u64 gen;
+
+	if (WARN_ON_ONCE(level >= BTRFS_MAX_LEVEL))
+		return;
+
+	/*
+	 * Level -1 is used for the data extent, which is not reliable to cache
+	 * because its reference count can increase or decrease without us
+	 * realizing. We cache results only for extent buffers that lead from
+	 * the root node down to the leaf with the file extent item.
+	 */
+	ASSERT(level >= 0);
+
+	if (is_shared)
+		gen = btrfs_get_last_root_drop_gen(root->fs_info);
+	else
+		gen = btrfs_root_last_snapshot(&root->root_item);
+
+	entry = &cache->entries[level];
+	entry->bytenr = bytenr;
+	entry->is_shared = is_shared;
+	entry->gen = gen;
+
+	/*
+	 * If we found an extent buffer is shared, set the cache result for all
+	 * extent buffers below it to true. As nodes in the path are COWed,
+	 * their sharedness is moved to their children, and if a leaf is COWed,
+	 * then the sharedness of a data extent becomes direct, the refcount of
+	 * data extent is increased in the extent item at the extent tree.
+	 */
+	if (is_shared) {
+		for (int i = 0; i < level; i++) {
+			entry = &cache->entries[i];
+			entry->is_shared = is_shared;
+			entry->gen = gen;
+		}
+	}
+}
+
 /**
  * Check if a data extent is shared or not.
  *
@@ -1519,6 +1618,7 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
  * @bytenr: logical bytenr of the extent we are checking
  * @roots:  list of roots this extent is shared among
  * @tmp:    temporary list used for iteration
+ * @cache:  a backref lookup result cache
  *
  * btrfs_is_data_extent_shared uses the backref walking code but will short
  * circuit as soon as it finds a root or inode that doesn't match the
@@ -1532,7 +1632,8 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
  * Return: 0 if extent is not shared, 1 if it is shared, < 0 on error.
  */
 int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
-				struct ulist *roots, struct ulist *tmp)
+				struct ulist *roots, struct ulist *tmp,
+				struct btrfs_backref_shared_cache *cache)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_trans_handle *trans;
@@ -1545,6 +1646,7 @@ int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
 		.inum = inum,
 		.share_count = 0,
 	};
+	int level;
 
 	ulist_init(roots);
 	ulist_init(tmp);
@@ -1561,22 +1663,40 @@ int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
 		btrfs_get_tree_mod_seq(fs_info, &elem);
 	}
 
+	/* -1 means we are in the bytenr of the data extent. */
+	level = -1;
 	ULIST_ITER_INIT(&uiter);
 	while (1) {
+		bool is_shared;
+		bool cached;
+
 		ret = find_parent_nodes(trans, fs_info, bytenr, elem.seq, tmp,
 					roots, NULL, &shared, false);
 		if (ret == BACKREF_FOUND_SHARED) {
 			/* this is the only condition under which we return 1 */
 			ret = 1;
+			if (level >= 0)
+				store_backref_shared_cache(cache, root, bytenr,
+							   level, true);
 			break;
 		}
 		if (ret < 0 && ret != -ENOENT)
 			break;
 		ret = 0;
+		if (level >= 0)
+			store_backref_shared_cache(cache, root, bytenr,
+						   level, false);
 		node = ulist_next(tmp, &uiter);
 		if (!node)
 			break;
 		bytenr = node->val;
+		level++;
+		cached = lookup_backref_shared_cache(cache, root, bytenr, level,
+						     &is_shared);
+		if (cached) {
+			ret = is_shared ? 1 : 0;
+			break;
+		}
 		shared.share_count = 0;
 		cond_resched();
 	}
diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h
index 08354394b1bb..797ba5371d55 100644
--- a/fs/btrfs/backref.h
+++ b/fs/btrfs/backref.h
@@ -17,6 +17,20 @@ struct inode_fs_paths {
 	struct btrfs_data_container	*fspath;
 };
 
+struct btrfs_backref_shared_cache_entry {
+	u64 bytenr;
+	u64 gen;
+	bool is_shared;
+};
+
+struct btrfs_backref_shared_cache {
+	/*
+	 * A path from a root to a leaf that has a file extent item pointing to
+	 * a given data extent should never exceed the maximum b+tree heigth.
+	 */
+	struct btrfs_backref_shared_cache_entry entries[BTRFS_MAX_LEVEL];
+};
+
 typedef int (iterate_extent_inodes_t)(u64 inum, u64 offset, u64 root,
 		void *ctx);
 
@@ -63,7 +77,8 @@ int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid,
 			  struct btrfs_inode_extref **ret_extref,
 			  u64 *found_off);
 int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
-				struct ulist *roots, struct ulist *tmp);
+				struct ulist *roots, struct ulist *tmp,
+				struct btrfs_backref_shared_cache *cache);
 
 int __init btrfs_prelim_ref_init(void);
 void __cold btrfs_prelim_ref_exit(void);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3dc30f5e6fd0..f7fe7f633eb5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1095,6 +1095,13 @@ struct btrfs_fs_info {
 	/* Updates are not protected by any lock */
 	struct btrfs_commit_stats commit_stats;
 
+	/*
+	 * Last generation where we dropped a non-relocation root.
+	 * Use btrfs_set_last_root_drop_gen() and btrfs_get_last_root_drop_gen()
+	 * to change it and to read it, respectively.
+	 */
+	u64 last_root_drop_gen;
+
 	/*
 	 * Annotations for transaction events (structures are empty when
 	 * compiled without lockdep).
@@ -1119,6 +1126,17 @@ struct btrfs_fs_info {
 #endif
 };
 
+static inline void btrfs_set_last_root_drop_gen(struct btrfs_fs_info *fs_info,
+						u64 gen)
+{
+	WRITE_ONCE(fs_info->last_root_drop_gen, gen);
+}
+
+static inline u64 btrfs_get_last_root_drop_gen(const struct btrfs_fs_info *fs_info)
+{
+	return READ_ONCE(fs_info->last_root_drop_gen);
+}
+
 static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
 {
 	return sb->s_fs_info;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index bcd0e72cded3..9818285dface 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5635,6 +5635,8 @@ static noinline int walk_up_tree(struct btrfs_trans_handle *trans,
  */
 int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 {
+	const bool is_reloc_root = (root->root_key.objectid ==
+				    BTRFS_TREE_RELOC_OBJECTID);
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_path *path;
 	struct btrfs_trans_handle *trans;
@@ -5794,6 +5796,9 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 				goto out_end_trans;
 			}
 
+			if (!is_reloc_root)
+				btrfs_set_last_root_drop_gen(fs_info, trans->transid);
+
 			btrfs_end_transaction_throttle(trans);
 			if (!for_reloc && btrfs_need_cleaner_sleep(fs_info)) {
 				btrfs_debug(fs_info,
@@ -5828,7 +5833,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 		goto out_end_trans;
 	}
 
-	if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
+	if (!is_reloc_root) {
 		ret = btrfs_find_root(tree_root, &root->root_key, path,
 				      NULL, NULL);
 		if (ret < 0) {
@@ -5860,6 +5865,9 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 		btrfs_put_root(root);
 	root_dropped = true;
 out_end_trans:
+	if (!is_reloc_root)
+		btrfs_set_last_root_drop_gen(fs_info, trans->transid);
+
 	btrfs_end_transaction_throttle(trans);
 out_free:
 	kfree(wc);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a47710516ecf..781436cc373c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5519,6 +5519,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	struct btrfs_path *path;
 	struct btrfs_root *root = inode->root;
 	struct fiemap_cache cache = { 0 };
+	struct btrfs_backref_shared_cache *backref_cache;
 	struct ulist *roots;
 	struct ulist *tmp_ulist;
 	int end = 0;
@@ -5526,13 +5527,11 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	u64 em_len = 0;
 	u64 em_end = 0;
 
+	backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
 	path = btrfs_alloc_path();
-	if (!path)
-		return -ENOMEM;
-
 	roots = ulist_alloc(GFP_KERNEL);
 	tmp_ulist = ulist_alloc(GFP_KERNEL);
-	if (!roots || !tmp_ulist) {
+	if (!backref_cache || !path || !roots || !tmp_ulist) {
 		ret = -ENOMEM;
 		goto out_free_ulist;
 	}
@@ -5658,7 +5657,8 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 			 */
 			ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
 							  bytenr, roots,
-							  tmp_ulist);
+							  tmp_ulist,
+							  backref_cache);
 			if (ret < 0)
 				goto out_free;
 			if (ret)
@@ -5710,6 +5710,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 			     &cached_state);
 
 out_free_ulist:
+	kfree(backref_cache);
 	btrfs_free_path(path);
 	ulist_free(roots);
 	ulist_free(tmp_ulist);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 09/10] btrfs: skip unnecessary extent buffer sharedness checks during fiemap
  2022-09-01 13:18 [PATCH 00/10] btrfs: make lseek and fiemap much more efficient fdmanana
                   ` (7 preceding siblings ...)
  2022-09-01 13:18 ` [PATCH 08/10] btrfs: speedup checking for extent sharedness during fiemap fdmanana
@ 2022-09-01 13:18 ` fdmanana
  2022-09-01 14:26   ` Josef Bacik
  2022-09-01 23:01   ` Qu Wenruo
  2022-09-01 13:18 ` [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness fdmanana
                   ` (3 subsequent siblings)
  12 siblings, 2 replies; 53+ messages in thread
From: fdmanana @ 2022-09-01 13:18 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

During fiemap, for each file extent we find, we must check if it's shared
or not. The sharedness check starts by verifying if the extent is directly
shared (its refcount in the extent tree is > 1), and if it is not directly
shared, then we will check if every node in the subvolume b+tree leading
from the root to the leaf that has the file extent item (in reverse order),
is shared (through snapshots).

However this second step is not needed if our extent was created in a
transaction more recent than the last transaction where a snapshot of the
inode's root happened, because it can't be shared indirectly (through
shared subtrees) without a snapshot created in a more recent transaction.

So grab the generation of the extent from the extent map and pass it to
btrfs_is_data_extent_shared(), which will skip this second phase when the
generation is more recent than the root's last snapshot value. Note that
we skip this optimization if the extent map is the result of merging 2
or more extent maps, because in this case its generation is the maximum
of the generations of all merged extent maps.

The fact the we use extent maps and they can be merged despite the
underlying extents being distinct (different file extent items in the
subvolume b+tree and different extent items in the extent b+tree), can
result in some bugs when reporting shared extents. But this is a problem
of the current implementation of fiemap relying on extent maps.
One example where we get incorrect results is:

    $ cat fiemap-bug.sh
    #!/bin/bash

    DEV=/dev/sdj
    MNT=/mnt/sdj

    mkfs.btrfs -f $DEV
    mount $DEV $MNT

    # Create a file with two 256K extents.
    # Since there is no other write activity, they will be contiguous,
    # and their extent maps merged, despite having two distinct extents.
    xfs_io -f -c "pwrite -S 0xab 0 256K" \
              -c "fsync" \
              -c "pwrite -S 0xcd 256K 256K" \
              -c "fsync" \
              $MNT/foo

    # Now clone only the second extent into another file.
    xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar

    # Filefrag will report a single 512K extent, and say it's not shared.
    echo
    filefrag -v $MNT/foo

    umount $MNT

Running the reproducer:

    $ ./fiemap-bug.sh
    wrote 262144/262144 bytes at offset 0
    256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
    wrote 262144/262144 bytes at offset 262144
    256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
    linked 262144/262144 bytes at offset 0
    256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)

    Filesystem type is: 9123683e
    File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
     ext:     logical_offset:        physical_offset: length:   expected: flags:
       0:        0..     127:       3328..      3455:    128:             last,eof
    /mnt/sdj/foo: 1 extent found

We end up reporting that we have a single 512K that is not shared, however
we have two 256K extents, and the second one is shared. Changing the
reproducer to clone instead the first extent into file 'bar', makes us
report a single 512K extent that is shared, which is algo incorrect since
we have two 256K extents and only the first one is shared.

This is z problem that existed before this change, and remains after this
change, as it can't be easily fixed. The next patch in the series reworks
fiemap to primarily use file extent items instead of extent maps (except
for checking for delalloc ranges), with the goal of improving its
scalability and performance, but it also ends up fixing this particular
bug caused by extent map merging.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/backref.c   | 27 +++++++++++++++++++++------
 fs/btrfs/backref.h   |  1 +
 fs/btrfs/extent_io.c | 18 ++++++++++++++++--
 3 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 40b48abb6978..bf4ca4a82550 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1613,12 +1613,14 @@ static void store_backref_shared_cache(struct btrfs_backref_shared_cache *cache,
 /**
  * Check if a data extent is shared or not.
  *
- * @root:   root inode belongs to
- * @inum:   inode number of the inode whose extent we are checking
- * @bytenr: logical bytenr of the extent we are checking
- * @roots:  list of roots this extent is shared among
- * @tmp:    temporary list used for iteration
- * @cache:  a backref lookup result cache
+ * @root:        The root the inode belongs to.
+ * @inum:        Number of the inode whose extent we are checking.
+ * @bytenr:      Logical bytenr of the extent we are checking.
+ * @extent_gen:  Generation of the extent (file extent item) or 0 if it is
+ *               not known.
+ * @roots:       List of roots this extent is shared among.
+ * @tmp:         Temporary list used for iteration.
+ * @cache:       A backref lookup result cache.
  *
  * btrfs_is_data_extent_shared uses the backref walking code but will short
  * circuit as soon as it finds a root or inode that doesn't match the
@@ -1632,6 +1634,7 @@ static void store_backref_shared_cache(struct btrfs_backref_shared_cache *cache,
  * Return: 0 if extent is not shared, 1 if it is shared, < 0 on error.
  */
 int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
+				u64 extent_gen,
 				struct ulist *roots, struct ulist *tmp,
 				struct btrfs_backref_shared_cache *cache)
 {
@@ -1683,6 +1686,18 @@ int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
 		if (ret < 0 && ret != -ENOENT)
 			break;
 		ret = 0;
+		/*
+		 * If our data extent is not shared through reflinks and it was
+		 * created in a generation after the last one used to create a
+		 * snapshot of the inode's root, then it can not be shared
+		 * indirectly through subtrees, as that can only happen with
+		 * snapshots. In this case bail out, no need to check for the
+		 * sharedness of extent buffers.
+		 */
+		if (level == -1 &&
+		    extent_gen > btrfs_root_last_snapshot(&root->root_item))
+			break;
+
 		if (level >= 0)
 			store_backref_shared_cache(cache, root, bytenr,
 						   level, false);
diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h
index 797ba5371d55..7d18b5ac71dd 100644
--- a/fs/btrfs/backref.h
+++ b/fs/btrfs/backref.h
@@ -77,6 +77,7 @@ int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid,
 			  struct btrfs_inode_extref **ret_extref,
 			  u64 *found_off);
 int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
+				u64 extent_gen,
 				struct ulist *roots, struct ulist *tmp,
 				struct btrfs_backref_shared_cache *cache);
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 781436cc373c..0e3fa9b08aaf 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5645,9 +5645,23 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 			flags |= (FIEMAP_EXTENT_DELALLOC |
 				  FIEMAP_EXTENT_UNKNOWN);
 		} else if (fieinfo->fi_extents_max) {
+			u64 extent_gen;
 			u64 bytenr = em->block_start -
 				(em->start - em->orig_start);
 
+			/*
+			 * If two extent maps are merged, then their generation
+			 * is set to the maximum between their generations.
+			 * Otherwise its generation matches the one we have in
+			 * corresponding file extent item. If we have a merged
+			 * extent map, don't use its generation to speedup the
+			 * sharedness check below.
+			 */
+			if (test_bit(EXTENT_FLAG_MERGED, &em->flags))
+				extent_gen = 0;
+			else
+				extent_gen = em->generation;
+
 			/*
 			 * As btrfs supports shared space, this information
 			 * can be exported to userspace tools via
@@ -5656,8 +5670,8 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 			 * lookup stuff.
 			 */
 			ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
-							  bytenr, roots,
-							  tmp_ulist,
+							  bytenr, extent_gen,
+							  roots, tmp_ulist,
 							  backref_cache);
 			if (ret < 0)
 				goto out_free;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness
  2022-09-01 13:18 [PATCH 00/10] btrfs: make lseek and fiemap much more efficient fdmanana
                   ` (8 preceding siblings ...)
  2022-09-01 13:18 ` [PATCH 09/10] btrfs: skip unnecessary extent buffer sharedness checks " fdmanana
@ 2022-09-01 13:18 ` fdmanana
  2022-09-01 14:35   ` Josef Bacik
  2022-09-01 23:27   ` Qu Wenruo
  2022-09-02  0:53 ` [PATCH 00/10] btrfs: make lseek and fiemap much more efficient Wang Yugui
                   ` (2 subsequent siblings)
  12 siblings, 2 replies; 53+ messages in thread
From: fdmanana @ 2022-09-01 13:18 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

The current fiemap implementation does not scale very well with the number
of extents a file has. This is both because the main algorithm to find out
the extents has a high algorithmic complexity and because for each extent
we have to check if it's shared. This second part, checking if an extent
is shared, is significantly improved by the two previous patches in this
patchset, while the first part is improved by this specific patch. Every
now and then we get reports from users mentioning fiemap is too slow or
even unusable for files with a very large number of extents, such as the
two recent reports referred to by the Link tags at the bottom of this
change log.

To understand why the part of finding which extents a file has is very
inneficient, consider the example of doing a full ranged fiemap against
a file that has over 100K extents (normal for example for a file with
more than 10G of data and using compression, which limits the extent size
to 128K). When we enter fiemap at extent_fiemap(), the following happens:

1) Before entering the main loop, we call get_extent_skip_holes() to get
   the first extent map. This leads us to btrfs_get_extent_fiemap(), which
   in turn calls btrfs_get_extent(), to find the first extent map that
   covers the file range [0, LLONG_MAX).

   btrfs_get_extent() will first search the inode's extent map tree, to
   see if we have an extent map there that covers the range. If it does
   not find one, then it will search the inode's subvolume b+tree for a
   fitting file extent item. After finding the file extent item, it will
   allocate an extent map, fill it in with information extracted from the
   file extent item, and add it to the inode's extent map tree (which
   requires a search for insertion in the tree).

2) Then we enter the main loop at extent_fiemap(), emit the details of
   the extent, and call again get_extent_skip_holes(), with a start
   offset matching the end of the extent map we previously processed.

   We end up at btrfs_get_extent() again, will search the extent map tree
   and then search the subvolume b+tree for a file extent item if we could
   not find an extent map in the extent tree. We allocate an extent map,
   fill it in with the details in the file extent item, and then insert
   it into the extent map tree (yet another search in this tree).

3) The second step is repeated over and over, until we have processed the
   whole file range. Each iteration ends at btrfs_get_extent(), which
   does a red black tree search on the extent map tree, then searches the
   subvolume b+tree, allocates an extent map and then does another search
   in the extent map tree in order to insert the extent map.

   In the best scenario we have all the extent maps already in the extent
   tree, and so for each extent we do a single search on a red black tree,
   so we have a complexity of O(n log n).

   In the worst scenario we don't have any extent map already loaded in
   the extent map tree, or have very few already there. In this case the
   complexity is much higher since we do:

   - A red black tree search on the extent map tree, which has O(log n)
     complexity, initially very fast since the tree is empty or very
     small, but as we end up allocating extent maps and adding them to
     the tree when we don't find them there, each subsequent search on
     the tree gets slower, since it's getting bigger and bigger after
     each iteration.

   - A search on the subvolume b+tree, also O(log n) complexity, but it
     has items for all inodes in the subvolume, not just items for our
     inode. Plus on a filesystem with concurrent operations on other
     inodes, we can block doing the search due to lock contention on
     b+tree nodes/leaves.

   - Allocate an extent map - this can block, and can also fail if we
     are under serious memory pressure.

   - Do another search on the extent maps red black tree, with the goal
     of inserting the extent map we just allocated. Again, after every
     iteration this tree is getting bigger by 1 element, so after many
     iterations the searches are slower and slower.

   - We will not need the allocated extent map anymore, so it's pointless
     to add it to the extent map tree. It's just wasting time and memory.

   In short we end up searching the extent map tree multiple times, on a
   tree that is growing bigger and bigger after each iteration. And
   besides that we visit the same leaf of the subvolume b+tree many times,
   since a leaf with the default size of 16K can easily have more than 200
   file extent items.

This is very inneficient overall. This patch changes the algorithm to
instead iterate over the subvolume b+tree, visiting each leaf only once,
and only searching in the extent map tree for file ranges that have holes
or prealloc extents, in order to figure out if we have delalloc there.
It will never allocate an extent map and add it to the extent map tree.
This is very similar to what was previously done for the lseek's hole and
data seeking features.

Also, the current implementation relying on extent maps for figuring out
which extents we have is not correct. This is because extent maps can be
merged even if they represent different extents - we do this to minimize
memory utilization and keep extent map trees smaller. For example if we
have two extents that are contiguous on disk, once we load the two extent
maps, they get merged into a single one - however if only one of the
extents is shared, we end up reporting both as shared or both as not
shared, which is incorrect.

This reproducer triggers that bug:

    $ cat fiemap-bug.sh
    #!/bin/bash

    DEV=/dev/sdj
    MNT=/mnt/sdj

    mkfs.btrfs -f $DEV
    mount $DEV $MNT

    # Create a file with two 256K extents.
    # Since there is no other write activity, they will be contiguous,
    # and their extent maps merged, despite having two distinct extents.
    xfs_io -f -c "pwrite -S 0xab 0 256K" \
              -c "fsync" \
              -c "pwrite -S 0xcd 256K 256K" \
              -c "fsync" \
              $MNT/foo

    # Now clone only the second extent into another file.
    xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar

    # Filefrag will report a single 512K extent, and say it's not shared.
    echo
    filefrag -v $MNT/foo

    umount $MNT

Running the reproducer:

    $ ./fiemap-bug.sh
    wrote 262144/262144 bytes at offset 0
    256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
    wrote 262144/262144 bytes at offset 262144
    256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
    linked 262144/262144 bytes at offset 0
    256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)

    Filesystem type is: 9123683e
    File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
     ext:     logical_offset:        physical_offset: length:   expected: flags:
       0:        0..     127:       3328..      3455:    128:             last,eof
    /mnt/sdj/foo: 1 extent found

We end up reporting that we have a single 512K that is not shared, however
we have two 256K extents, and the second one is shared. Changing the
reproducer to clone instead the first extent into file 'bar', makes us
report a single 512K extent that is shared, which is algo incorrect since
we have two 256K extents and only the first one is shared.

This patch is part of a larger patchset that is comprised of the following
patches:

    btrfs: allow hole and data seeking to be interruptible
    btrfs: make hole and data seeking a lot more efficient
    btrfs: remove check for impossible block start for an extent map at fiemap
    btrfs: remove zero length check when entering fiemap
    btrfs: properly flush delalloc when entering fiemap
    btrfs: allow fiemap to be interruptible
    btrfs: rename btrfs_check_shared() to a more descriptive name
    btrfs: speedup checking for extent sharedness during fiemap
    btrfs: skip unnecessary extent buffer sharedness checks during fiemap
    btrfs: make fiemap more efficient and accurate reporting extent sharedness

The patchset was tested on a machine running a non-debug kernel (Debian's
default config) and compared the tests below on a branch without the
patchset versus the same branch with the whole patchset applied.

The following test for a large compressed file without holes:

    $ cat fiemap-perf-test.sh
    #!/bin/bash

    DEV=/dev/sdi
    MNT=/mnt/sdi

    mkfs.btrfs -f $DEV
    mount -o compress=lzo $DEV $MNT

    # 40G gives 327680 128K file extents (due to compression).
    xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar

    umount $MNT
    mount -o compress=lzo $DEV $MNT

    start=$(date +%s%N)
    filefrag $MNT/foobar
    end=$(date +%s%N)
    dur=$(( (end - start) / 1000000 ))
    echo "fiemap took $dur milliseconds (metadata not cached)"

    start=$(date +%s%N)
    filefrag $MNT/foobar
    end=$(date +%s%N)
    dur=$(( (end - start) / 1000000 ))
    echo "fiemap took $dur milliseconds (metadata cached)"

    umount $MNT

Before patchset:

    $ ./fiemap-perf-test.sh
    (...)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 3597 milliseconds (metadata not cached)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 2107 milliseconds (metadata cached)

After patchset:

    $ ./fiemap-perf-test.sh
    (...)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 1214 milliseconds (metadata not cached)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 684 milliseconds (metadata cached)

That's a speedup of about 3x for both cases (no metadata cached and all
metadata cached).

The test provided by Pavel (first Link tag at the bottom), which uses
files with a large number of holes, was also used to measure the gains,
and it consists on a small C program and a shell script to invoke it.
The C program is the following:

    $ cat pavels-test.c
    #include <stdio.h>
    #include <unistd.h>
    #include <stdlib.h>
    #include <fcntl.h>

    #include <sys/stat.h>
    #include <sys/time.h>
    #include <sys/ioctl.h>

    #include <linux/fs.h>
    #include <linux/fiemap.h>

    #define FILE_INTERVAL (1<<13) /* 8Kb */

    long long interval(struct timeval t1, struct timeval t2)
    {
        long long val = 0;
        val += (t2.tv_usec - t1.tv_usec);
        val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
        return val;
    }

    int main(int argc, char **argv)
    {
        struct fiemap fiemap = {};
        struct timeval t1, t2;
        char data = 'a';
        struct stat st;
        int fd, off, file_size = FILE_INTERVAL;

        if (argc != 3 && argc != 2) {
                printf("usage: %s <path> [size]\n", argv[0]);
                return 1;
        }

        if (argc == 3)
                file_size = atoi(argv[2]);
        if (file_size < FILE_INTERVAL)
                file_size = FILE_INTERVAL;
        file_size -= file_size % FILE_INTERVAL;

        fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
        if (fd < 0) {
            perror("open");
            return 1;
        }

        for (off = 0; off < file_size; off += FILE_INTERVAL) {
            if (pwrite(fd, &data, 1, off) != 1) {
                perror("pwrite");
                close(fd);
                return 1;
            }
        }

        if (ftruncate(fd, file_size)) {
            perror("ftruncate");
            close(fd);
            return 1;
        }

        if (fstat(fd, &st) < 0) {
            perror("fstat");
            close(fd);
            return 1;
        }

        printf("size: %ld\n", st.st_size);
        printf("actual size: %ld\n", st.st_blocks * 512);

        fiemap.fm_length = FIEMAP_MAX_OFFSET;
        gettimeofday(&t1, NULL);
        if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
            perror("fiemap");
            close(fd);
            return 1;
        }
        gettimeofday(&t2, NULL);

        printf("fiemap: fm_mapped_extents = %d\n",
               fiemap.fm_mapped_extents);
        printf("time = %lld us\n", interval(t1, t2));

        close(fd);
        return 0;
    }

    $ gcc -o pavels_test pavels_test.c

And the wrapper shell script:

    $ cat fiemap-pavels-test.sh

    #!/bin/bash

    DEV=/dev/sdi
    MNT=/mnt/sdi

    mkfs.btrfs -f -O no-holes $DEV
    mount $DEV $MNT

    echo
    echo "*********** 256M ***********"
    echo

    ./pavels-test $MNT/testfile $((1 << 28))
    echo
    ./pavels-test $MNT/testfile $((1 << 28))

    echo
    echo "*********** 512M ***********"
    echo

    ./pavels-test $MNT/testfile $((1 << 29))
    echo
    ./pavels-test $MNT/testfile $((1 << 29))

    echo
    echo "*********** 1G ***********"
    echo

    ./pavels-test $MNT/testfile $((1 << 30))
    echo
    ./pavels-test $MNT/testfile $((1 << 30))

    umount $MNT

Running his reproducer before applying the patchset:

    *********** 256M ***********

    size: 268435456
    actual size: 134217728
    fiemap: fm_mapped_extents = 32768
    time = 4003133 us

    size: 268435456
    actual size: 134217728
    fiemap: fm_mapped_extents = 32768
    time = 4895330 us

    *********** 512M ***********

    size: 536870912
    actual size: 268435456
    fiemap: fm_mapped_extents = 65536
    time = 30123675 us

    size: 536870912
    actual size: 268435456
    fiemap: fm_mapped_extents = 65536
    time = 33450934 us

    *********** 1G ***********

    size: 1073741824
    actual size: 536870912
    fiemap: fm_mapped_extents = 131072
    time = 224924074 us

    size: 1073741824
    actual size: 536870912
    fiemap: fm_mapped_extents = 131072
    time = 217239242 us

Running it after applying the patchset:

    *********** 256M ***********

    size: 268435456
    actual size: 134217728
    fiemap: fm_mapped_extents = 32768
    time = 29475 us

    size: 268435456
    actual size: 134217728
    fiemap: fm_mapped_extents = 32768
    time = 29307 us

    *********** 512M ***********

    size: 536870912
    actual size: 268435456
    fiemap: fm_mapped_extents = 65536
    time = 58996 us

    size: 536870912
    actual size: 268435456
    fiemap: fm_mapped_extents = 65536
    time = 59115 us

    *********** 1G ***********

    size: 1073741824
    actual size: 536870912
    fiemap: fm_mapped_extents = 116251
    time = 124141 us

    size: 1073741824
    actual size: 536870912
    fiemap: fm_mapped_extents = 131072
    time = 119387 us

The speedup is massive, both on the first fiemap call and on the second
one as well, as his test creates files with many holes and small extents
(every extent follows a hole and precedes another hole).

For the 256M file we go from 4 seconds down to 29 milliseconds in the
first run, and then from 4.9 seconds down to 29 milliseconds again in the
second run, a speedup of 138x and 169x, respectively.

For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
first run, and then from 33.5 seconds down to 59 milliseconds again in the
second run, a speedup of 510x and 568x, respectively.

For the 1G file, we go from 225 seconds down to 124 milliseconds in the
first run, and then from 217 seconds down to 119 milliseconds in the
second run, a speedup of 1815x and 1824x, respectively.

Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.h     |   4 +-
 fs/btrfs/extent_io.c | 714 +++++++++++++++++++++++++++++--------------
 fs/btrfs/file.c      |  16 +-
 fs/btrfs/inode.c     | 140 +--------
 4 files changed, 506 insertions(+), 368 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f7fe7f633eb5..7b266f9dc8b4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3402,8 +3402,6 @@ unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
 				    u64 start, u64 end);
 int btrfs_check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
 			  u32 bio_offset, struct page *page, u32 pgoff);
-struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
-					   u64 start, u64 len);
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 			      u64 *orig_start, u64 *orig_block_len,
 			      u64 *ram_bytes, bool strict);
@@ -3583,6 +3581,8 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
 int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
 			   size_t *write_bytes);
 void btrfs_check_nocow_unlock(struct btrfs_inode *inode);
+bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
+				  u64 *delalloc_start_ret, u64 *delalloc_end_ret);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0e3fa9b08aaf..50bb2182e795 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5353,42 +5353,6 @@ int try_release_extent_mapping(struct page *page, gfp_t mask)
 	return try_release_extent_state(tree, page, mask);
 }
 
-/*
- * helper function for fiemap, which doesn't want to see any holes.
- * This maps until we find something past 'last'
- */
-static struct extent_map *get_extent_skip_holes(struct btrfs_inode *inode,
-						u64 offset, u64 last)
-{
-	u64 sectorsize = btrfs_inode_sectorsize(inode);
-	struct extent_map *em;
-	u64 len;
-
-	if (offset >= last)
-		return NULL;
-
-	while (1) {
-		len = last - offset;
-		if (len == 0)
-			break;
-		len = ALIGN(len, sectorsize);
-		em = btrfs_get_extent_fiemap(inode, offset, len);
-		if (IS_ERR(em))
-			return em;
-
-		/* if this isn't a hole return it */
-		if (em->block_start != EXTENT_MAP_HOLE)
-			return em;
-
-		/* this is a hole, advance to the next extent */
-		offset = extent_map_end(em);
-		free_extent_map(em);
-		if (offset >= last)
-			break;
-	}
-	return NULL;
-}
-
 /*
  * To cache previous fiemap extent
  *
@@ -5418,6 +5382,9 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
 {
 	int ret = 0;
 
+	/* Set at the end of extent_fiemap(). */
+	ASSERT((flags & FIEMAP_EXTENT_LAST) == 0);
+
 	if (!cache->cached)
 		goto assign;
 
@@ -5446,11 +5413,10 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
 	 */
 	if (cache->offset + cache->len  == offset &&
 	    cache->phys + cache->len == phys  &&
-	    (cache->flags & ~FIEMAP_EXTENT_LAST) ==
-			(flags & ~FIEMAP_EXTENT_LAST)) {
+	    cache->flags == flags) {
 		cache->len += len;
 		cache->flags |= flags;
-		goto try_submit_last;
+		return 0;
 	}
 
 	/* Not mergeable, need to submit cached one */
@@ -5465,13 +5431,8 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
 	cache->phys = phys;
 	cache->len = len;
 	cache->flags = flags;
-try_submit_last:
-	if (cache->flags & FIEMAP_EXTENT_LAST) {
-		ret = fiemap_fill_next_extent(fieinfo, cache->offset,
-				cache->phys, cache->len, cache->flags);
-		cache->cached = false;
-	}
-	return ret;
+
+	return 0;
 }
 
 /*
@@ -5501,229 +5462,534 @@ static int emit_last_fiemap_cache(struct fiemap_extent_info *fieinfo,
 	return ret;
 }
 
-int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
-		  u64 start, u64 len)
+static int fiemap_next_leaf_item(struct btrfs_inode *inode,
+				 struct btrfs_path *path)
 {
-	int ret = 0;
-	u64 off;
-	u64 max = start + len;
-	u32 flags = 0;
-	u32 found_type;
-	u64 last;
-	u64 last_for_get_extent = 0;
-	u64 disko = 0;
-	u64 isize = i_size_read(&inode->vfs_inode);
-	struct btrfs_key found_key;
-	struct extent_map *em = NULL;
-	struct extent_state *cached_state = NULL;
-	struct btrfs_path *path;
+	struct extent_buffer *clone;
+	struct btrfs_key key;
+	int slot;
+	int ret;
+
+	path->slots[0]++;
+	if (path->slots[0] < btrfs_header_nritems(path->nodes[0]))
+		return 0;
+
+	ret = btrfs_next_leaf(inode->root, path);
+	if (ret != 0)
+		return ret;
+
+	/*
+	 * Don't bother with cloning if there are no more file extent items for
+	 * our inode.
+	 */
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+	if (key.objectid != btrfs_ino(inode) || key.type != BTRFS_EXTENT_DATA_KEY)
+		return 1;
+
+	/* See the comment at fiemap_search_slot() about why we clone. */
+	clone = btrfs_clone_extent_buffer(path->nodes[0]);
+	if (!clone)
+		return -ENOMEM;
+
+	slot = path->slots[0];
+	btrfs_release_path(path);
+	path->nodes[0] = clone;
+	path->slots[0] = slot;
+
+	return 0;
+}
+
+/*
+ * Search for the first file extent item that starts at a given file offset or
+ * the one that starts immediately before that offset.
+ * Returns: 0 on success, < 0 on error, 1 if not found.
+ */
+static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path,
+			      u64 file_offset)
+{
+	const u64 ino = btrfs_ino(inode);
 	struct btrfs_root *root = inode->root;
-	struct fiemap_cache cache = { 0 };
-	struct btrfs_backref_shared_cache *backref_cache;
-	struct ulist *roots;
-	struct ulist *tmp_ulist;
-	int end = 0;
-	u64 em_start = 0;
-	u64 em_len = 0;
-	u64 em_end = 0;
+	struct extent_buffer *clone;
+	struct btrfs_key key;
+	int slot;
+	int ret;
 
-	backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
-	path = btrfs_alloc_path();
-	roots = ulist_alloc(GFP_KERNEL);
-	tmp_ulist = ulist_alloc(GFP_KERNEL);
-	if (!backref_cache || !path || !roots || !tmp_ulist) {
-		ret = -ENOMEM;
-		goto out_free_ulist;
+	key.objectid = ino;
+	key.type = BTRFS_EXTENT_DATA_KEY;
+	key.offset = file_offset;
+
+	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+	if (ret < 0)
+		return ret;
+
+	if (ret > 0 && path->slots[0] > 0) {
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
+		if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
+			path->slots[0]--;
+	}
+
+	if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
+		ret = btrfs_next_leaf(root, path);
+		if (ret != 0)
+			return ret;
+
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
+			return 1;
 	}
 
 	/*
-	 * We can't initialize that to 'start' as this could miss extents due
-	 * to extent item merging
+	 * We clone the leaf and use it during fiemap. This is because while
+	 * using the leaf we do expensive things like checking if an extent is
+	 * shared, which can take a long time. In order to prevent blocking
+	 * other tasks for too long, we use a clone of the leaf. We have locked
+	 * the file range in the inode's io tree, so we know none of our file
+	 * extent items can change. This way we avoid blocking other tasks that
+	 * want to insert items for other inodes in the same leaf or b+tree
+	 * rebalance operations (triggered for example when someone is trying
+	 * to push items into this leaf when trying to insert an item in a
+	 * neighbour leaf).
+	 * We also need the private clone because holding a read lock on an
+	 * extent buffer of the subvolume's b+tree will make lockdep unhappy
+	 * when we call fiemap_fill_next_extent(), because that may cause a page
+	 * fault when filling the user space buffer with fiemap data.
 	 */
-	off = 0;
-	start = round_down(start, btrfs_inode_sectorsize(inode));
-	len = round_up(max, btrfs_inode_sectorsize(inode)) - start;
+	clone = btrfs_clone_extent_buffer(path->nodes[0]);
+	if (!clone)
+		return -ENOMEM;
+
+	slot = path->slots[0];
+	btrfs_release_path(path);
+	path->nodes[0] = clone;
+	path->slots[0] = slot;
+
+	return 0;
+}
+
+/*
+ * Process a range which is a hole or a prealloc extent in the inode's subvolume
+ * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc
+ * extent. The end offset (@end) is inclusive.
+ */
+static int fiemap_process_hole(struct btrfs_inode *inode,
+			       struct fiemap_extent_info *fieinfo,
+			       struct fiemap_cache *cache,
+			       struct btrfs_backref_shared_cache *backref_cache,
+			       u64 disk_bytenr, u64 extent_offset,
+			       u64 extent_gen,
+			       struct ulist *roots, struct ulist *tmp_ulist,
+			       u64 start, u64 end)
+{
+	const u64 i_size = i_size_read(&inode->vfs_inode);
+	const u64 ino = btrfs_ino(inode);
+	u64 cur_offset = start;
+	u64 last_delalloc_end = 0;
+	u32 prealloc_flags = FIEMAP_EXTENT_UNWRITTEN;
+	bool checked_extent_shared = false;
+	int ret;
 
 	/*
-	 * lookup the last file extent.  We're not using i_size here
-	 * because there might be preallocation past i_size
+	 * There can be no delalloc past i_size, so don't waste time looking for
+	 * it beyond i_size.
 	 */
-	ret = btrfs_lookup_file_extent(NULL, root, path, btrfs_ino(inode), -1,
-				       0);
-	if (ret < 0) {
-		goto out_free_ulist;
-	} else {
-		WARN_ON(!ret);
-		if (ret == 1)
-			ret = 0;
-	}
+	while (cur_offset < end && cur_offset < i_size) {
+		u64 delalloc_start;
+		u64 delalloc_end;
+		u64 prealloc_start;
+		u64 prealloc_len = 0;
+		bool delalloc;
+
+		delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
+							&delalloc_start,
+							&delalloc_end);
+		if (!delalloc)
+			break;
 
-	path->slots[0]--;
-	btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]);
-	found_type = found_key.type;
-
-	/* No extents, but there might be delalloc bits */
-	if (found_key.objectid != btrfs_ino(inode) ||
-	    found_type != BTRFS_EXTENT_DATA_KEY) {
-		/* have to trust i_size as the end */
-		last = (u64)-1;
-		last_for_get_extent = isize;
-	} else {
 		/*
-		 * remember the start of the last extent.  There are a
-		 * bunch of different factors that go into the length of the
-		 * extent, so its much less complex to remember where it started
+		 * If this is a prealloc extent we have to report every section
+		 * of it that has no delalloc.
 		 */
-		last = found_key.offset;
-		last_for_get_extent = last + 1;
+		if (disk_bytenr != 0) {
+			if (last_delalloc_end == 0) {
+				prealloc_start = start;
+				prealloc_len = delalloc_start - start;
+			} else {
+				prealloc_start = last_delalloc_end + 1;
+				prealloc_len = delalloc_start - prealloc_start;
+			}
+		}
+
+		if (prealloc_len > 0) {
+			if (!checked_extent_shared && fieinfo->fi_extents_max) {
+				ret = btrfs_is_data_extent_shared(inode->root,
+							  ino, disk_bytenr,
+							  extent_gen, roots,
+							  tmp_ulist,
+							  backref_cache);
+				if (ret < 0)
+					return ret;
+				else if (ret > 0)
+					prealloc_flags |= FIEMAP_EXTENT_SHARED;
+
+				checked_extent_shared = true;
+			}
+			ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
+						 disk_bytenr + extent_offset,
+						 prealloc_len, prealloc_flags);
+			if (ret)
+				return ret;
+			extent_offset += prealloc_len;
+		}
+
+		ret = emit_fiemap_extent(fieinfo, cache, delalloc_start, 0,
+					 delalloc_end + 1 - delalloc_start,
+					 FIEMAP_EXTENT_DELALLOC |
+					 FIEMAP_EXTENT_UNKNOWN);
+		if (ret)
+			return ret;
+
+		last_delalloc_end = delalloc_end;
+		cur_offset = delalloc_end + 1;
+		extent_offset += cur_offset - delalloc_start;
+		cond_resched();
+	}
+
+	/*
+	 * Either we found no delalloc for the whole prealloc extent or we have
+	 * a prealloc extent that spans i_size or starts at or after i_size.
+	 */
+	if (disk_bytenr != 0 && last_delalloc_end < end) {
+		u64 prealloc_start;
+		u64 prealloc_len;
+
+		if (last_delalloc_end == 0) {
+			prealloc_start = start;
+			prealloc_len = end + 1 - start;
+		} else {
+			prealloc_start = last_delalloc_end + 1;
+			prealloc_len = end + 1 - prealloc_start;
+		}
+
+		if (!checked_extent_shared && fieinfo->fi_extents_max) {
+			ret = btrfs_is_data_extent_shared(inode->root,
+							  ino, disk_bytenr,
+							  extent_gen, roots,
+							  tmp_ulist,
+							  backref_cache);
+			if (ret < 0)
+				return ret;
+			else if (ret > 0)
+				prealloc_flags |= FIEMAP_EXTENT_SHARED;
+		}
+		ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
+					 disk_bytenr + extent_offset,
+					 prealloc_len, prealloc_flags);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int fiemap_find_last_extent_offset(struct btrfs_inode *inode,
+					  struct btrfs_path *path,
+					  u64 *last_extent_end_ret)
+{
+	const u64 ino = btrfs_ino(inode);
+	struct btrfs_root *root = inode->root;
+	struct extent_buffer *leaf;
+	struct btrfs_file_extent_item *ei;
+	struct btrfs_key key;
+	u64 disk_bytenr;
+	int ret;
+
+	/*
+	 * Lookup the last file extent. We're not using i_size here because
+	 * there might be preallocation past i_size.
+	 */
+	ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0);
+	/* There can't be a file extent item at offset (u64)-1 */
+	ASSERT(ret != 0);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * For a non-existing key, btrfs_search_slot() always leaves us at a
+	 * slot > 0, except if the btree is empty, which is impossible because
+	 * at least it has the inode item for this inode and all the items for
+	 * the root inode 256.
+	 */
+	ASSERT(path->slots[0] > 0);
+	path->slots[0]--;
+	leaf = path->nodes[0];
+	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+	if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
+		/* No file extent items in the subvolume tree. */
+		*last_extent_end_ret = 0;
+		return 0;
 	}
-	btrfs_release_path(path);
 
 	/*
-	 * we might have some extents allocated but more delalloc past those
-	 * extents.  so, we trust isize unless the start of the last extent is
-	 * beyond isize
+	 * For an inline extent, the disk_bytenr is where inline data starts at,
+	 * so first check if we have an inline extent item before checking if we
+	 * have an implicit hole (disk_bytenr == 0).
 	 */
-	if (last < isize) {
-		last = (u64)-1;
-		last_for_get_extent = isize;
+	ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
+	if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
+		*last_extent_end_ret = btrfs_file_extent_end(path);
+		return 0;
 	}
 
-	lock_extent_bits(&inode->io_tree, start, start + len - 1,
-			 &cached_state);
+	/*
+	 * Find the last file extent item that is not a hole (when NO_HOLES is
+	 * not enabled). This should take at most 2 iterations in the worst
+	 * case: we have one hole file extent item at slot 0 of a leaf and
+	 * another hole file extent item as the last item in the previous leaf.
+	 * This is because we merge file extent items that represent holes.
+	 */
+	disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+	while (disk_bytenr == 0) {
+		ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
+		if (ret < 0) {
+			return ret;
+		} else if (ret > 0) {
+			/* No file extent items that are not holes. */
+			*last_extent_end_ret = 0;
+			return 0;
+		}
+		leaf = path->nodes[0];
+		ei = btrfs_item_ptr(leaf, path->slots[0],
+				    struct btrfs_file_extent_item);
+		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+	}
 
-	em = get_extent_skip_holes(inode, start, last_for_get_extent);
-	if (!em)
-		goto out;
-	if (IS_ERR(em)) {
-		ret = PTR_ERR(em);
+	*last_extent_end_ret = btrfs_file_extent_end(path);
+	return 0;
+}
+
+int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
+		  u64 start, u64 len)
+{
+	const u64 ino = btrfs_ino(inode);
+	struct extent_state *cached_state = NULL;
+	struct btrfs_path *path;
+	struct btrfs_root *root = inode->root;
+	struct fiemap_cache cache = { 0 };
+	struct btrfs_backref_shared_cache *backref_cache;
+	struct ulist *roots;
+	struct ulist *tmp_ulist;
+	u64 last_extent_end;
+	u64 prev_extent_end;
+	u64 lockstart;
+	u64 lockend;
+	bool stopped = false;
+	int ret;
+
+	backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
+	path = btrfs_alloc_path();
+	roots = ulist_alloc(GFP_KERNEL);
+	tmp_ulist = ulist_alloc(GFP_KERNEL);
+	if (!backref_cache || !path || !roots || !tmp_ulist) {
+		ret = -ENOMEM;
 		goto out;
 	}
 
-	while (!end) {
-		u64 offset_in_extent = 0;
+	lockstart = round_down(start, btrfs_inode_sectorsize(inode));
+	lockend = round_up(start + len, btrfs_inode_sectorsize(inode));
+	prev_extent_end = lockstart;
 
-		/* break if the extent we found is outside the range */
-		if (em->start >= max || extent_map_end(em) < off)
-			break;
+	lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
 
-		/*
-		 * get_extent may return an extent that starts before our
-		 * requested range.  We have to make sure the ranges
-		 * we return to fiemap always move forward and don't
-		 * overlap, so adjust the offsets here
-		 */
-		em_start = max(em->start, off);
+	ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
+	if (ret < 0)
+		goto out_unlock;
+	btrfs_release_path(path);
 
+	path->reada = READA_FORWARD;
+	ret = fiemap_search_slot(inode, path, lockstart);
+	if (ret < 0) {
+		goto out_unlock;
+	} else if (ret > 0) {
 		/*
-		 * record the offset from the start of the extent
-		 * for adjusting the disk offset below.  Only do this if the
-		 * extent isn't compressed since our in ram offset may be past
-		 * what we have actually allocated on disk.
+		 * No file extent item found, but we may have delalloc between
+		 * the current offset and i_size. So check for that.
 		 */
-		if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
-			offset_in_extent = em_start - em->start;
-		em_end = extent_map_end(em);
-		em_len = em_end - em_start;
-		flags = 0;
-		if (em->block_start < EXTENT_MAP_LAST_BYTE)
-			disko = em->block_start + offset_in_extent;
-		else
-			disko = 0;
+		ret = 0;
+		goto check_eof_delalloc;
+	}
+
+	while (prev_extent_end < lockend) {
+		struct extent_buffer *leaf = path->nodes[0];
+		struct btrfs_file_extent_item *ei;
+		struct btrfs_key key;
+		u64 extent_end;
+		u64 extent_len;
+		u64 extent_offset = 0;
+		u64 extent_gen;
+		u64 disk_bytenr = 0;
+		u64 flags = 0;
+		int extent_type;
+		u8 compression;
+
+		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
+			break;
+
+		extent_end = btrfs_file_extent_end(path);
 
 		/*
-		 * bump off for our next call to get_extent
+		 * The first iteration can leave us at an extent item that ends
+		 * before our range's start. Move to the next item.
 		 */
-		off = extent_map_end(em);
-		if (off >= max)
-			end = 1;
-
-		if (em->block_start == EXTENT_MAP_INLINE) {
-			flags |= (FIEMAP_EXTENT_DATA_INLINE |
-				  FIEMAP_EXTENT_NOT_ALIGNED);
-		} else if (em->block_start == EXTENT_MAP_DELALLOC) {
-			flags |= (FIEMAP_EXTENT_DELALLOC |
-				  FIEMAP_EXTENT_UNKNOWN);
-		} else if (fieinfo->fi_extents_max) {
-			u64 extent_gen;
-			u64 bytenr = em->block_start -
-				(em->start - em->orig_start);
+		if (extent_end <= lockstart)
+			goto next_item;
 
-			/*
-			 * If two extent maps are merged, then their generation
-			 * is set to the maximum between their generations.
-			 * Otherwise its generation matches the one we have in
-			 * corresponding file extent item. If we have a merged
-			 * extent map, don't use its generation to speedup the
-			 * sharedness check below.
-			 */
-			if (test_bit(EXTENT_FLAG_MERGED, &em->flags))
-				extent_gen = 0;
-			else
-				extent_gen = em->generation;
+		/* We have in implicit hole (NO_HOLES feature enabled). */
+		if (prev_extent_end < key.offset) {
+			const u64 range_end = min(key.offset, lockend) - 1;
 
-			/*
-			 * As btrfs supports shared space, this information
-			 * can be exported to userspace tools via
-			 * flag FIEMAP_EXTENT_SHARED.  If fi_extents_max == 0
-			 * then we're just getting a count and we can skip the
-			 * lookup stuff.
-			 */
-			ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
-							  bytenr, extent_gen,
-							  roots, tmp_ulist,
-							  backref_cache);
-			if (ret < 0)
-				goto out_free;
-			if (ret)
-				flags |= FIEMAP_EXTENT_SHARED;
-			ret = 0;
-		}
-		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
-			flags |= FIEMAP_EXTENT_ENCODED;
-		if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
-			flags |= FIEMAP_EXTENT_UNWRITTEN;
+			ret = fiemap_process_hole(inode, fieinfo, &cache,
+						  backref_cache, 0, 0, 0,
+						  roots, tmp_ulist,
+						  prev_extent_end, range_end);
+			if (ret < 0) {
+				goto out_unlock;
+			} else if (ret > 0) {
+				/* fiemap_fill_next_extent() told us to stop. */
+				stopped = true;
+				break;
+			}
 
-		free_extent_map(em);
-		em = NULL;
-		if ((em_start >= last) || em_len == (u64)-1 ||
-		   (last == (u64)-1 && isize <= em_end)) {
-			flags |= FIEMAP_EXTENT_LAST;
-			end = 1;
+			/* We've reached the end of the fiemap range, stop. */
+			if (key.offset >= lockend) {
+				stopped = true;
+				break;
+			}
 		}
 
-		/* now scan forward to see if this is really the last extent. */
-		em = get_extent_skip_holes(inode, off, last_for_get_extent);
-		if (IS_ERR(em)) {
-			ret = PTR_ERR(em);
-			goto out;
+		extent_len = extent_end - key.offset;
+		ei = btrfs_item_ptr(leaf, path->slots[0],
+				    struct btrfs_file_extent_item);
+		compression = btrfs_file_extent_compression(leaf, ei);
+		extent_type = btrfs_file_extent_type(leaf, ei);
+		extent_gen = btrfs_file_extent_generation(leaf, ei);
+
+		if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
+			disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+			if (compression == BTRFS_COMPRESS_NONE)
+				extent_offset = btrfs_file_extent_offset(leaf, ei);
 		}
-		if (!em) {
-			flags |= FIEMAP_EXTENT_LAST;
-			end = 1;
+
+		if (compression != BTRFS_COMPRESS_NONE)
+			flags |= FIEMAP_EXTENT_ENCODED;
+
+		if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
+			flags |= FIEMAP_EXTENT_DATA_INLINE;
+			flags |= FIEMAP_EXTENT_NOT_ALIGNED;
+			ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0,
+						 extent_len, flags);
+		} else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
+			ret = fiemap_process_hole(inode, fieinfo, &cache,
+						  backref_cache,
+						  disk_bytenr, extent_offset,
+						  extent_gen, roots, tmp_ulist,
+						  key.offset, extent_end - 1);
+		} else if (disk_bytenr == 0) {
+			/* We have an explicit hole. */
+			ret = fiemap_process_hole(inode, fieinfo, &cache,
+						  backref_cache, 0, 0, 0,
+						  roots, tmp_ulist,
+						  key.offset, extent_end - 1);
+		} else {
+			/* We have a regular extent. */
+			if (fieinfo->fi_extents_max) {
+				ret = btrfs_is_data_extent_shared(root, ino,
+								  disk_bytenr,
+								  extent_gen,
+								  roots,
+								  tmp_ulist,
+								  backref_cache);
+				if (ret < 0)
+					goto out_unlock;
+				else if (ret > 0)
+					flags |= FIEMAP_EXTENT_SHARED;
+			}
+
+			ret = emit_fiemap_extent(fieinfo, &cache, key.offset,
+						 disk_bytenr + extent_offset,
+						 extent_len, flags);
 		}
-		ret = emit_fiemap_extent(fieinfo, &cache, em_start, disko,
-					   em_len, flags);
-		if (ret) {
-			if (ret == 1)
-				ret = 0;
-			goto out_free;
+
+		if (ret < 0) {
+			goto out_unlock;
+		} else if (ret > 0) {
+			/* fiemap_fill_next_extent() told us to stop. */
+			stopped = true;
+			break;
 		}
 
+		prev_extent_end = extent_end;
+next_item:
 		if (fatal_signal_pending(current)) {
 			ret = -EINTR;
-			goto out_free;
+			goto out_unlock;
 		}
+
+		ret = fiemap_next_leaf_item(inode, path);
+		if (ret < 0) {
+			goto out_unlock;
+		} else if (ret > 0) {
+			/* No more file extent items for this inode. */
+			break;
+		}
+		cond_resched();
 	}
-out_free:
-	if (!ret)
-		ret = emit_last_fiemap_cache(fieinfo, &cache);
-	free_extent_map(em);
-out:
-	unlock_extent_cached(&inode->io_tree, start, start + len - 1,
-			     &cached_state);
 
-out_free_ulist:
+check_eof_delalloc:
+	/*
+	 * Release (and free) the path before emitting any final entries to
+	 * fiemap_fill_next_extent() to keep lockdep happy. This is because
+	 * once we find no more file extent items exist, we may have a
+	 * non-cloned leaf, and fiemap_fill_next_extent() can trigger page
+	 * faults when copying data to the user space buffer.
+	 */
+	btrfs_free_path(path);
+	path = NULL;
+
+	if (!stopped && prev_extent_end < lockend) {
+		ret = fiemap_process_hole(inode, fieinfo, &cache, backref_cache,
+					  0, 0, 0, roots, tmp_ulist,
+					  prev_extent_end, lockend - 1);
+		if (ret < 0)
+			goto out_unlock;
+		prev_extent_end = lockend;
+	}
+
+	if (cache.cached && cache.offset + cache.len >= last_extent_end) {
+		const u64 i_size = i_size_read(&inode->vfs_inode);
+
+		if (prev_extent_end < i_size) {
+			u64 delalloc_start;
+			u64 delalloc_end;
+			bool delalloc;
+
+			delalloc = btrfs_find_delalloc_in_range(inode,
+								prev_extent_end,
+								i_size - 1,
+								&delalloc_start,
+								&delalloc_end);
+			if (!delalloc)
+				cache.flags |= FIEMAP_EXTENT_LAST;
+		} else {
+			cache.flags |= FIEMAP_EXTENT_LAST;
+		}
+	}
+
+	ret = emit_last_fiemap_cache(fieinfo, &cache);
+
+out_unlock:
+	unlock_extent_cached(&inode->io_tree, lockstart, lockend, &cached_state);
+out:
 	kfree(backref_cache);
 	btrfs_free_path(path);
 	ulist_free(roots);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b292a8ada3a4..636b3ec46184 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3602,10 +3602,10 @@ static long btrfs_fallocate(struct file *file, int mode,
 }
 
 /*
- * Helper for have_delalloc_in_range(). Find a subrange in a given range that
- * has unflushed and/or flushing delalloc. There might be other adjacent
- * subranges after the one it found, so have_delalloc_in_range() keeps looping
- * while it gets adjacent subranges, and merging them together.
+ * Helper for btrfs_find_delalloc_in_range(). Find a subrange in a given range
+ * that has unflushed and/or flushing delalloc. There might be other adjacent
+ * subranges after the one it found, so btrfs_find_delalloc_in_range() keeps
+ * looping while it gets adjacent subranges, and merging them together.
  */
 static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
 				   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
@@ -3740,8 +3740,8 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
  * if so it sets @delalloc_start_ret and @delalloc_end_ret with the start and
  * end offsets of the subrange.
  */
-static bool have_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
-				   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
+bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
+				  u64 *delalloc_start_ret, u64 *delalloc_end_ret)
 {
 	u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
 	u64 prev_delalloc_end = 0;
@@ -3804,8 +3804,8 @@ static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
 	u64 delalloc_end;
 	bool delalloc;
 
-	delalloc = have_delalloc_in_range(inode, start, end, &delalloc_start,
-					  &delalloc_end);
+	delalloc = btrfs_find_delalloc_in_range(inode, start, end,
+						&delalloc_start, &delalloc_end);
 	if (delalloc && whence == SEEK_DATA) {
 		*start_ret = delalloc_start;
 		return true;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2c7d31990777..8be1e021513a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7064,133 +7064,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 	return em;
 }
 
-struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
-					   u64 start, u64 len)
-{
-	struct extent_map *em;
-	struct extent_map *hole_em = NULL;
-	u64 delalloc_start = start;
-	u64 end;
-	u64 delalloc_len;
-	u64 delalloc_end;
-	int err = 0;
-
-	em = btrfs_get_extent(inode, NULL, 0, start, len);
-	if (IS_ERR(em))
-		return em;
-	/*
-	 * If our em maps to:
-	 * - a hole or
-	 * - a pre-alloc extent,
-	 * there might actually be delalloc bytes behind it.
-	 */
-	if (em->block_start != EXTENT_MAP_HOLE &&
-	    !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
-		return em;
-	else
-		hole_em = em;
-
-	/* check to see if we've wrapped (len == -1 or similar) */
-	end = start + len;
-	if (end < start)
-		end = (u64)-1;
-	else
-		end -= 1;
-
-	em = NULL;
-
-	/* ok, we didn't find anything, lets look for delalloc */
-	delalloc_len = count_range_bits(&inode->io_tree, &delalloc_start,
-				 end, len, EXTENT_DELALLOC, 1);
-	delalloc_end = delalloc_start + delalloc_len;
-	if (delalloc_end < delalloc_start)
-		delalloc_end = (u64)-1;
-
-	/*
-	 * We didn't find anything useful, return the original results from
-	 * get_extent()
-	 */
-	if (delalloc_start > end || delalloc_end <= start) {
-		em = hole_em;
-		hole_em = NULL;
-		goto out;
-	}
-
-	/*
-	 * Adjust the delalloc_start to make sure it doesn't go backwards from
-	 * the start they passed in
-	 */
-	delalloc_start = max(start, delalloc_start);
-	delalloc_len = delalloc_end - delalloc_start;
-
-	if (delalloc_len > 0) {
-		u64 hole_start;
-		u64 hole_len;
-		const u64 hole_end = extent_map_end(hole_em);
-
-		em = alloc_extent_map();
-		if (!em) {
-			err = -ENOMEM;
-			goto out;
-		}
-
-		ASSERT(hole_em);
-		/*
-		 * When btrfs_get_extent can't find anything it returns one
-		 * huge hole
-		 *
-		 * Make sure what it found really fits our range, and adjust to
-		 * make sure it is based on the start from the caller
-		 */
-		if (hole_end <= start || hole_em->start > end) {
-		       free_extent_map(hole_em);
-		       hole_em = NULL;
-		} else {
-		       hole_start = max(hole_em->start, start);
-		       hole_len = hole_end - hole_start;
-		}
-
-		if (hole_em && delalloc_start > hole_start) {
-			/*
-			 * Our hole starts before our delalloc, so we have to
-			 * return just the parts of the hole that go until the
-			 * delalloc starts
-			 */
-			em->len = min(hole_len, delalloc_start - hole_start);
-			em->start = hole_start;
-			em->orig_start = hole_start;
-			/*
-			 * Don't adjust block start at all, it is fixed at
-			 * EXTENT_MAP_HOLE
-			 */
-			em->block_start = hole_em->block_start;
-			em->block_len = hole_len;
-			if (test_bit(EXTENT_FLAG_PREALLOC, &hole_em->flags))
-				set_bit(EXTENT_FLAG_PREALLOC, &em->flags);
-		} else {
-			/*
-			 * Hole is out of passed range or it starts after
-			 * delalloc range
-			 */
-			em->start = delalloc_start;
-			em->len = delalloc_len;
-			em->orig_start = delalloc_start;
-			em->block_start = EXTENT_MAP_DELALLOC;
-			em->block_len = delalloc_len;
-		}
-	} else {
-		return hole_em;
-	}
-out:
-
-	free_extent_map(hole_em);
-	if (err) {
-		free_extent_map(em);
-		return ERR_PTR(err);
-	}
-	return em;
-}
-
 static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 						  const u64 start,
 						  const u64 len,
@@ -8265,15 +8138,14 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 	 * in the compression of data (in an async thread) and will return
 	 * before the compression is done and writeback is started. A second
 	 * filemap_fdatawrite_range() is needed to wait for the compression to
-	 * complete and writeback to start. Without this, our user is very
-	 * likely to get stale results, because the extents and extent maps for
-	 * delalloc regions are only allocated when writeback starts.
+	 * complete and writeback to start. We also need to wait for ordered
+	 * extents to complete, because our fiemap implementation uses mainly
+	 * file extent items to list the extents, searching for extent maps
+	 * only for file ranges with holes or prealloc extents to figure out
+	 * if we have delalloc in those ranges.
 	 */
 	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
-		ret = btrfs_fdatawrite_range(inode, 0, LLONG_MAX);
-		if (ret)
-			return ret;
-		ret = filemap_fdatawait_range(inode->i_mapping, 0, LLONG_MAX);
+		ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
 		if (ret)
 			return ret;
 	}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 01/10] btrfs: allow hole and data seeking to be interruptible
  2022-09-01 13:18 ` [PATCH 01/10] btrfs: allow hole and data seeking to be interruptible fdmanana
@ 2022-09-01 13:58   ` Josef Bacik
  2022-09-01 21:49   ` Qu Wenruo
  1 sibling, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2022-09-01 13:58 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Thu, Sep 01, 2022 at 02:18:21PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Doing hole or data seeking on a file with a very large number of extents
> can take a long time, and we have reports of it being too slow (such as
> at LSFMM from 2017, see the Link below). So make it interruptible.
> 
> Link: https://lwn.net/Articles/718805/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient
  2022-09-01 13:18 ` [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient fdmanana
@ 2022-09-01 14:03   ` Josef Bacik
  2022-09-01 15:00     ` Filipe Manana
  2022-09-01 22:18   ` Qu Wenruo
  2022-09-11 22:12   ` Qu Wenruo
  2 siblings, 1 reply; 53+ messages in thread
From: Josef Bacik @ 2022-09-01 14:03 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Thu, Sep 01, 2022 at 02:18:22PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> The current implementation of hole and data seeking for llseek does not
> scale well in regards to the number of extents and the distance between
> the start offset and the next hole or extent. This is due to a very high
> algorithmic complexity. Often we also get reports of btrfs' hole and data
> seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link
> tag at the bottom).
> 
> In order to better understand it, lets consider the case where the start
> offset is 0, we are seeking for a hole and the file size is 16G. Between
> file offset 0 and the first hole in the file there are 100K extents - this
> is common for large files, specially if we have compression enabled, since
> the maximum extent size is limited to 128K. The steps take by the main
> loop of the current algorithm are the following:
> 
> 1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which
>    calls btrfs_get_extent(). This will first lookup for an extent map in
>    the inode's extent map tree (a red black tree). If the extent map is
>    not loaded in memory, then it will do a lookup for the corresponding
>    file extent item in the subvolume's b+tree, create an extent map based
>    on the contents of the file extent item and then add the extent map to
>    the extent map tree of the inode;
> 
> 2) The second iteration calls btrfs_get_extent_fiemap() again, this time
>    with a start offset matching the end offset of the previous extent.
>    Again, btrfs_get_extent() will first search the extent map tree, and
>    if it doesn't find an extent map there, it will again search in the
>    b+tree of the subvolume for a matching file extent item, build an
>    extent map based on the file extent item, and add the extent map to
>    to the extent map tree of the inode;
> 
> 3) This repeats over and over until we find the first hole (when seeking
>    for holes) or until we find the first extent (when seeking for data).
> 
>    If there no extent maps loaded in memory for each iteration, then on
>    each iteration we do 1 extent map tree search, 1 b+tree search, plus
>    1 more extent map tree traversal to insert an extent map - plus we
>    allocate memory for the extent map.
> 
>    On each iteration we are growing the size of the extent map tree,
>    making each future search slower, and also visiting the same b+tree
>    leaves over and over again - taking into account with the default leaf
>    size of 16K we can fit more than 200 file extent items in a leaf - so
>    we can visit the same b+tree leaf 200+ times, on each visit walking
>    down a path from the root to the leaf.
> 
> So it's easy to see that what we have now doesn't scale well. Also, it
> loads an extent map for every file extent item into memory, which is not
> efficient - we should add extents maps only when doing IO (writing or
> reading file data).
> 
> This change implements a new algorithm which scales much better, and
> works like this:
> 
> 1) We iterate over the subvolume's b+tree, visiting each leaf that has
>    file extent items once and only once;
> 
> 2) For any file extent items found, that don't represent holes or prealloc
>    extents, it will not search the extent map tree - there's no need at
>    all for that - an extent map is just an in-memory representation of a
>    file extent item;
> 
> 3) When a hole is found, or a prealloc extent, it will check if there's
>    delalloc for its range. For this it will search for EXTENT_DELALLOC
>    bits in the inode's io tree and check the extent map tree - this is
>    for accounting for unflushed delalloc and for flushed delalloc (the
>    period between running delalloc and ordered extent completion),
>    respectively. This is similar to what the current implementation does
>    when it finds a hole or prealloc extent, but without creating extent
>    maps and adding them to the extent map tree in case they are not
>    loaded in memory;
> 
> 4) It never allocates extent maps, or adds extent maps to the inode's
>    extent map tree. This not only saves memory and time (from the tree
>    insertions and allocations), but also eliminates the possibility of
>    -ENOMEM due to allocating too many extent maps.
> 
> Part of this new code will also be used later for fiemap (which also
> suffers similar scalability problems).
> 
> The following test example can be used to quickly measure the efficiency
> before and after this patch:
> 
>     $ cat test-seek-hole.sh
>     #!/bin/bash
> 
>     DEV=/dev/sdi
>     MNT=/mnt/sdi
> 
>     mkfs.btrfs -f $DEV
> 
>     mount -o compress=lzo $DEV $MNT
> 
>     # 16G file -> 131073 compressed extents.
>     xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar
> 
>     # Leave a 1M hole at file offset 15G.
>     xfs_io -c "fpunch 15G 1M" $MNT/foobar
> 
>     # Unmount and mount again, so that we can test when there's no
>     # metadata cached in memory.
>     umount $MNT
>     mount -o compress=lzo $DEV $MNT
> 
>     # Test seeking for hole from offset 0 (hole is at offset 15G).
> 
>     start=$(date +%s%N)
>     xfs_io -c "seek -h 0" $MNT/foobar
>     end=$(date +%s%N)
>     dur=$(( (end - start) / 1000000 ))
>     echo "Took $dur milliseconds to seek first hole (metadata not cached)"
>     echo
> 
>     start=$(date +%s%N)
>     xfs_io -c "seek -h 0" $MNT/foobar
>     end=$(date +%s%N)
>     dur=$(( (end - start) / 1000000 ))
>     echo "Took $dur milliseconds to seek first hole (metadata cached)"
>     echo
> 
>     umount $MNT
> 
> Before this change:
> 
>     $ ./test-seek-hole.sh
>     (...)
>     Whence	Result
>     HOLE	16106127360
>     Took 176 milliseconds to seek first hole (metadata not cached)
> 
>     Whence	Result
>     HOLE	16106127360
>     Took 17 milliseconds to seek first hole (metadata cached)
> 
> After this change:
> 
>     $ ./test-seek-hole.sh
>     (...)
>     Whence	Result
>     HOLE	16106127360
>     Took 43 milliseconds to seek first hole (metadata not cached)
> 
>     Whence	Result
>     HOLE	16106127360
>     Took 13 milliseconds to seek first hole (metadata cached)
> 
> That's about 4X faster when no metadata is cached and about 30% faster
> when all metadata is cached.
> 
> In practice the differences may often be significantly higher, either due
> to a higher number of extents in a file or because the subvolume's b+tree
> is much bigger than in this example, where we only have one file.
> 
> Link: https://lwn.net/Articles/718805/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/file.c | 437 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 406 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 96f444ad0951..b292a8ada3a4 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3601,22 +3601,281 @@ static long btrfs_fallocate(struct file *file, int mode,
>  	return ret;
>  }
>  
> +/*
> + * Helper for have_delalloc_in_range(). Find a subrange in a given range that
> + * has unflushed and/or flushing delalloc. There might be other adjacent
> + * subranges after the one it found, so have_delalloc_in_range() keeps looping
> + * while it gets adjacent subranges, and merging them together.
> + */
> +static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
> +				   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> +{
> +	const u64 len = end + 1 - start;
> +	struct extent_map_tree *em_tree = &inode->extent_tree;
> +	struct extent_map *em;
> +	u64 em_end;
> +	u64 delalloc_len;
> +
> +	/*
> +	 * Search the io tree first for EXTENT_DELALLOC. If we find any, it
> +	 * means we have delalloc (dirty pages) for which writeback has not
> +	 * started yet.
> +	 */
> +	*delalloc_start_ret = start;
> +	delalloc_len = count_range_bits(&inode->io_tree, delalloc_start_ret, end,
> +					len, EXTENT_DELALLOC, 1);
> +	/*
> +	 * If delalloc was found then *delalloc_start_ret has a sector size
> +	 * aligned value (rounded down).
> +	 */
> +	if (delalloc_len > 0)
> +		*delalloc_end_ret = *delalloc_start_ret + delalloc_len - 1;
> +
> +	/*
> +	 * Now also check if there's any extent map in the range that does not
> +	 * map to a hole or prealloc extent. We do this because:
> +	 *
> +	 * 1) When delalloc is flushed, the file range is locked, we clear the
> +	 *    EXTENT_DELALLOC bit from the io tree and create an extent map for
> +	 *    an allocated extent. So we might just have been called after
> +	 *    delalloc is flushed and before the ordered extent completes and
> +	 *    inserts the new file extent item in the subvolume's btree;
> +	 *
> +	 * 2) We may have an extent map created by flushing delalloc for a
> +	 *    subrange that starts before the subrange we found marked with
> +	 *    EXTENT_DELALLOC in the io tree.
> +	 */
> +	read_lock(&em_tree->lock);
> +	em = lookup_extent_mapping(em_tree, start, len);
> +	read_unlock(&em_tree->lock);
> +
> +	/* extent_map_end() returns a non-inclusive end offset. */
> +	em_end = em ? extent_map_end(em) : 0;
> +
> +	/*
> +	 * If we have a hole/prealloc extent map, check the next one if this one
> +	 * ends before our range's end.
> +	 */
> +	if (em && (em->block_start == EXTENT_MAP_HOLE ||
> +		   test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) && em_end < end) {
> +		struct extent_map *next_em;
> +
> +		read_lock(&em_tree->lock);
> +		next_em = lookup_extent_mapping(em_tree, em_end, len - em_end);
> +		read_unlock(&em_tree->lock);
> +
> +		free_extent_map(em);
> +		em_end = next_em ? extent_map_end(next_em) : 0;
> +		em = next_em;
> +	}
> +
> +	if (em && (em->block_start == EXTENT_MAP_HOLE ||
> +		   test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) {
> +		free_extent_map(em);
> +		em = NULL;
> +	}
> +
> +	/*
> +	 * No extent map or one for a hole or prealloc extent. Use the delalloc
> +	 * range we found in the io tree if we have one.
> +	 */
> +	if (!em)
> +		return (delalloc_len > 0);
> +

You can move this after the lookup, and then remove the if (em && parts above.
Then all you need to do is in the second if statement return (delalloc_len > 0);
Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/10] btrfs: remove check for impossible block start for an extent map at fiemap
  2022-09-01 13:18 ` [PATCH 03/10] btrfs: remove check for impossible block start for an extent map at fiemap fdmanana
@ 2022-09-01 14:03   ` Josef Bacik
  2022-09-01 22:19   ` Qu Wenruo
  1 sibling, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2022-09-01 14:03 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Thu, Sep 01, 2022 at 02:18:23PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> During fiemap we are testing if an extent map has a block start with a
> value of EXTENT_MAP_LAST_BYTE, but that is never set on an extent map,
> and never was according to git history. So remove that useless check.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/10] btrfs: remove zero length check when entering fiemap
  2022-09-01 13:18 ` [PATCH 04/10] btrfs: remove zero length check when entering fiemap fdmanana
@ 2022-09-01 14:04   ` Josef Bacik
  2022-09-01 22:24   ` Qu Wenruo
  1 sibling, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2022-09-01 14:04 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Thu, Sep 01, 2022 at 02:18:24PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> There's no point to check for a 0 length at extent_fiemap(), as before
> calling it, we called fiemap_prep() at btrfs_fiemap(), which already
> checks for a zero length and returns the same -EINVAL error. So remove
> the pointless check.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/10] btrfs: properly flush delalloc when entering fiemap
  2022-09-01 13:18 ` [PATCH 05/10] btrfs: properly flush delalloc " fdmanana
@ 2022-09-01 14:06   ` Josef Bacik
  2022-09-01 22:38   ` Qu Wenruo
  1 sibling, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2022-09-01 14:06 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Thu, Sep 01, 2022 at 02:18:25PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> If the flag FIEMAP_FLAG_SYNC is passed to fiemap, it means all delalloc
> should be flushed and writeback complete. We call the generic helper
> fiemap_prep() which does a filemap_write_and_wait() in case that flag is
> given, however that is not enough if we have compression. Because a
> single filemap_fdatawrite_range() only starts compression (in an async
> thread) and therefore returns before the compression is done and writeback
> is started.
> 
> So make btrfs_fiemap(), actually wait for all writeback to start and
> complete if FIEMAP_FLAG_SYNC is set. We start and wait for writeback
> on the whole possible file range, from 0 to LLONG_MAX, because that is
> what the generic code at fiemap_prep() does.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 06/10] btrfs: allow fiemap to be interruptible
  2022-09-01 13:18 ` [PATCH 06/10] btrfs: allow fiemap to be interruptible fdmanana
@ 2022-09-01 14:07   ` Josef Bacik
  2022-09-01 22:42   ` Qu Wenruo
  1 sibling, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2022-09-01 14:07 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Thu, Sep 01, 2022 at 02:18:26PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Doing fiemap on a file with a very large number of extents can take a very
> long time, and we have reports of it being too slow (two recent examples
> in the Link tags below), so make it interruptible.
> 
> Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/10] btrfs: rename btrfs_check_shared() to a more descriptive name
  2022-09-01 13:18 ` [PATCH 07/10] btrfs: rename btrfs_check_shared() to a more descriptive name fdmanana
@ 2022-09-01 14:08   ` Josef Bacik
  2022-09-01 22:45   ` Qu Wenruo
  1 sibling, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2022-09-01 14:08 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Thu, Sep 01, 2022 at 02:18:27PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> The function btrfs_check_shared() is supposed to be used to check if a
> data extent is shared, but its name is too generic, may easily cause
> confusion in the sense that it may be used for metadata extents.
> 
> So rename it to btrfs_is_data_extent_shared(), which will also make it
> less confusing after the next change that adds a backref lookup cache for
> the b+tree nodes that lead to the leaf that contains the file extent item
> that points to the target data extent.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 08/10] btrfs: speedup checking for extent sharedness during fiemap
  2022-09-01 13:18 ` [PATCH 08/10] btrfs: speedup checking for extent sharedness during fiemap fdmanana
@ 2022-09-01 14:23   ` Josef Bacik
  2022-09-01 22:50   ` Qu Wenruo
  1 sibling, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2022-09-01 14:23 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Thu, Sep 01, 2022 at 02:18:28PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> One of the most expensive tasks performed during fiemap is to check if
> an extent is shared. This task has two major steps:
> 
> 1) Check if the data extent is shared. This implies checking the extent
>    item in the extent tree, checking delayed references, etc. If we
>    find the data extent is directly shared, we terminate immediately;
> 
> 2) If the data extent is not directly shared (its extent item has a
>    refcount of 1), then it may be shared if we have snapshots that share
>    subtrees of the inode's subvolume b+tree. So we check if the leaf
>    containing the file extent item is shared, then its parent node, then
>    the parent node of the parent node, etc, until we reach the root node
>    or we find one of them is shared - in which case we stop immediately.
> 
> During fiemap we process the extents of a file from left to right, from
> file offset 0 to eof. This means that we iterate b+tree leaves from left
> to right, and has the implication that we keep repeating that second step
> above several times for the same b+tree path of the inode's subvolume
> b+tree.
> 
> For example, if we have two file extent items in leaf X, and the path to
> leaf X is A -> B -> C -> X, then when we try to determine if the data
> extent referenced by the first extent item is shared, we check if the data
> extent is shared - if it's not, then we check if leaf X is shared, if not,
> then we check if node C is shared, if not, then check if node B is shared,
> if not than check if node A is shared. When we move to the next file
> extent item, after determining the data extent is not shared, we repeat
> the checks for X, C, B and A - doing all the expensive searches in the
> extent tree, delayed refs, etc. If we have thousands of tile extents, then
> we keep repeating the sharedness checks for the same paths over and over.
> 
> On a file that has no shared extents or only a small portion, it's easy
> to see that this scales terribly with the number of extents in the file
> and the sizes of the extent and subvolume b+trees.
> 
> This change eliminates the repeated sharedness check on extent buffers
> by caching the results of the last path used. The results can be used as
> long as no snapshots were created since they were cached (for not shared
> extent buffers) or no roots were dropped since they were cached (for
> shared extent buffers). This greatly reduces the time spent by fiemap for
> files with thousands of extents and/or large extent and subvolume b+trees.
> 
> Example performance test:
> 
>     $ cat fiemap-perf-test.sh
>     #!/bin/bash
> 
>     DEV=/dev/sdi
>     MNT=/mnt/sdi
> 
>     mkfs.btrfs -f $DEV
>     mount -o compress=lzo $DEV $MNT
> 
>     # 40G gives 327680 128K file extents (due to compression).
>     xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
> 
>     umount $MNT
>     mount -o compress=lzo $DEV $MNT
> 
>     start=$(date +%s%N)
>     filefrag $MNT/foobar
>     end=$(date +%s%N)
>     dur=$(( (end - start) / 1000000 ))
>     echo "fiemap took $dur milliseconds (metadata not cached)"
> 
>     start=$(date +%s%N)
>     filefrag $MNT/foobar
>     end=$(date +%s%N)
>     dur=$(( (end - start) / 1000000 ))
>     echo "fiemap took $dur milliseconds (metadata cached)"
> 
>     umount $MNT
> 
> Before this patch:
> 
>     $ ./fiemap-perf-test.sh
>     (...)
>     /mnt/sdi/foobar: 327680 extents found
>     fiemap took 3597 milliseconds (metadata not cached)
>     /mnt/sdi/foobar: 327680 extents found
>     fiemap took 2107 milliseconds (metadata cached)
> 
> After this patch:
> 
>     $ ./fiemap-perf-test.sh
>     (...)
>     /mnt/sdi/foobar: 327680 extents found
>     fiemap took 1646 milliseconds (metadata not cached)
>     /mnt/sdi/foobar: 327680 extents found
>     fiemap took 698 milliseconds (metadata cached)
> 
> That's about 2.2x faster when no metadata is cached, and about 3x faster
> when all metadata is cached. On a real filesystem with many other files,
> data, directories, etc, the b+trees will be 2 or 3 levels higher,
> therefore this optimization will have a higher impact.
> 
> Several reports of a slow fiemap show up often, the two Link tags below
> refer to two recent reports of such slowness. This patch, together with
> the next ones in the series, is meant to address that.
> 
> Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10] btrfs: skip unnecessary extent buffer sharedness checks during fiemap
  2022-09-01 13:18 ` [PATCH 09/10] btrfs: skip unnecessary extent buffer sharedness checks " fdmanana
@ 2022-09-01 14:26   ` Josef Bacik
  2022-09-01 23:01   ` Qu Wenruo
  1 sibling, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2022-09-01 14:26 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Thu, Sep 01, 2022 at 02:18:29PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> During fiemap, for each file extent we find, we must check if it's shared
> or not. The sharedness check starts by verifying if the extent is directly
> shared (its refcount in the extent tree is > 1), and if it is not directly
> shared, then we will check if every node in the subvolume b+tree leading
> from the root to the leaf that has the file extent item (in reverse order),
> is shared (through snapshots).
> 
> However this second step is not needed if our extent was created in a
> transaction more recent than the last transaction where a snapshot of the
> inode's root happened, because it can't be shared indirectly (through
> shared subtrees) without a snapshot created in a more recent transaction.
> 
> So grab the generation of the extent from the extent map and pass it to
> btrfs_is_data_extent_shared(), which will skip this second phase when the
> generation is more recent than the root's last snapshot value. Note that
> we skip this optimization if the extent map is the result of merging 2
> or more extent maps, because in this case its generation is the maximum
> of the generations of all merged extent maps.
> 
> The fact the we use extent maps and they can be merged despite the
> underlying extents being distinct (different file extent items in the
> subvolume b+tree and different extent items in the extent b+tree), can
> result in some bugs when reporting shared extents. But this is a problem
> of the current implementation of fiemap relying on extent maps.
> One example where we get incorrect results is:
> 
>     $ cat fiemap-bug.sh
>     #!/bin/bash
> 
>     DEV=/dev/sdj
>     MNT=/mnt/sdj
> 
>     mkfs.btrfs -f $DEV
>     mount $DEV $MNT
> 
>     # Create a file with two 256K extents.
>     # Since there is no other write activity, they will be contiguous,
>     # and their extent maps merged, despite having two distinct extents.
>     xfs_io -f -c "pwrite -S 0xab 0 256K" \
>               -c "fsync" \
>               -c "pwrite -S 0xcd 256K 256K" \
>               -c "fsync" \
>               $MNT/foo
> 
>     # Now clone only the second extent into another file.
>     xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
> 
>     # Filefrag will report a single 512K extent, and say it's not shared.
>     echo
>     filefrag -v $MNT/foo
> 
>     umount $MNT
> 
> Running the reproducer:
> 
>     $ ./fiemap-bug.sh
>     wrote 262144/262144 bytes at offset 0
>     256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
>     wrote 262144/262144 bytes at offset 262144
>     256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
>     linked 262144/262144 bytes at offset 0
>     256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
> 
>     Filesystem type is: 9123683e
>     File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
>      ext:     logical_offset:        physical_offset: length:   expected: flags:
>        0:        0..     127:       3328..      3455:    128:             last,eof
>     /mnt/sdj/foo: 1 extent found
> 
> We end up reporting that we have a single 512K that is not shared, however
> we have two 256K extents, and the second one is shared. Changing the
> reproducer to clone instead the first extent into file 'bar', makes us
> report a single 512K extent that is shared, which is algo incorrect since
> we have two 256K extents and only the first one is shared.
> 
> This is z problem that existed before this change, and remains after this
> change, as it can't be easily fixed. The next patch in the series reworks
> fiemap to primarily use file extent items instead of extent maps (except
> for checking for delalloc ranges), with the goal of improving its
> scalability and performance, but it also ends up fixing this particular
> bug caused by extent map merging.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness
  2022-09-01 13:18 ` [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness fdmanana
@ 2022-09-01 14:35   ` Josef Bacik
  2022-09-01 15:04     ` Filipe Manana
  2022-09-01 23:27   ` Qu Wenruo
  1 sibling, 1 reply; 53+ messages in thread
From: Josef Bacik @ 2022-09-01 14:35 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Thu, Sep 01, 2022 at 02:18:30PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> The current fiemap implementation does not scale very well with the number
> of extents a file has. This is both because the main algorithm to find out
> the extents has a high algorithmic complexity and because for each extent
> we have to check if it's shared. This second part, checking if an extent
> is shared, is significantly improved by the two previous patches in this
> patchset, while the first part is improved by this specific patch. Every
> now and then we get reports from users mentioning fiemap is too slow or
> even unusable for files with a very large number of extents, such as the
> two recent reports referred to by the Link tags at the bottom of this
> change log.
> 
> To understand why the part of finding which extents a file has is very
> inneficient, consider the example of doing a full ranged fiemap against
> a file that has over 100K extents (normal for example for a file with
> more than 10G of data and using compression, which limits the extent size
> to 128K). When we enter fiemap at extent_fiemap(), the following happens:
> 
> 1) Before entering the main loop, we call get_extent_skip_holes() to get
>    the first extent map. This leads us to btrfs_get_extent_fiemap(), which
>    in turn calls btrfs_get_extent(), to find the first extent map that
>    covers the file range [0, LLONG_MAX).
> 
>    btrfs_get_extent() will first search the inode's extent map tree, to
>    see if we have an extent map there that covers the range. If it does
>    not find one, then it will search the inode's subvolume b+tree for a
>    fitting file extent item. After finding the file extent item, it will
>    allocate an extent map, fill it in with information extracted from the
>    file extent item, and add it to the inode's extent map tree (which
>    requires a search for insertion in the tree).
> 
> 2) Then we enter the main loop at extent_fiemap(), emit the details of
>    the extent, and call again get_extent_skip_holes(), with a start
>    offset matching the end of the extent map we previously processed.
> 
>    We end up at btrfs_get_extent() again, will search the extent map tree
>    and then search the subvolume b+tree for a file extent item if we could
>    not find an extent map in the extent tree. We allocate an extent map,
>    fill it in with the details in the file extent item, and then insert
>    it into the extent map tree (yet another search in this tree).
> 
> 3) The second step is repeated over and over, until we have processed the
>    whole file range. Each iteration ends at btrfs_get_extent(), which
>    does a red black tree search on the extent map tree, then searches the
>    subvolume b+tree, allocates an extent map and then does another search
>    in the extent map tree in order to insert the extent map.
> 
>    In the best scenario we have all the extent maps already in the extent
>    tree, and so for each extent we do a single search on a red black tree,
>    so we have a complexity of O(n log n).
> 
>    In the worst scenario we don't have any extent map already loaded in
>    the extent map tree, or have very few already there. In this case the
>    complexity is much higher since we do:
> 
>    - A red black tree search on the extent map tree, which has O(log n)
>      complexity, initially very fast since the tree is empty or very
>      small, but as we end up allocating extent maps and adding them to
>      the tree when we don't find them there, each subsequent search on
>      the tree gets slower, since it's getting bigger and bigger after
>      each iteration.
> 
>    - A search on the subvolume b+tree, also O(log n) complexity, but it
>      has items for all inodes in the subvolume, not just items for our
>      inode. Plus on a filesystem with concurrent operations on other
>      inodes, we can block doing the search due to lock contention on
>      b+tree nodes/leaves.
> 
>    - Allocate an extent map - this can block, and can also fail if we
>      are under serious memory pressure.
> 
>    - Do another search on the extent maps red black tree, with the goal
>      of inserting the extent map we just allocated. Again, after every
>      iteration this tree is getting bigger by 1 element, so after many
>      iterations the searches are slower and slower.
> 
>    - We will not need the allocated extent map anymore, so it's pointless
>      to add it to the extent map tree. It's just wasting time and memory.
> 
>    In short we end up searching the extent map tree multiple times, on a
>    tree that is growing bigger and bigger after each iteration. And
>    besides that we visit the same leaf of the subvolume b+tree many times,
>    since a leaf with the default size of 16K can easily have more than 200
>    file extent items.
> 
> This is very inneficient overall. This patch changes the algorithm to
> instead iterate over the subvolume b+tree, visiting each leaf only once,
> and only searching in the extent map tree for file ranges that have holes
> or prealloc extents, in order to figure out if we have delalloc there.
> It will never allocate an extent map and add it to the extent map tree.
> This is very similar to what was previously done for the lseek's hole and
> data seeking features.
> 
> Also, the current implementation relying on extent maps for figuring out
> which extents we have is not correct. This is because extent maps can be
> merged even if they represent different extents - we do this to minimize
> memory utilization and keep extent map trees smaller. For example if we
> have two extents that are contiguous on disk, once we load the two extent
> maps, they get merged into a single one - however if only one of the
> extents is shared, we end up reporting both as shared or both as not
> shared, which is incorrect.
> 
> This reproducer triggers that bug:
> 
>     $ cat fiemap-bug.sh
>     #!/bin/bash
> 
>     DEV=/dev/sdj
>     MNT=/mnt/sdj
> 
>     mkfs.btrfs -f $DEV
>     mount $DEV $MNT
> 
>     # Create a file with two 256K extents.
>     # Since there is no other write activity, they will be contiguous,
>     # and their extent maps merged, despite having two distinct extents.
>     xfs_io -f -c "pwrite -S 0xab 0 256K" \
>               -c "fsync" \
>               -c "pwrite -S 0xcd 256K 256K" \
>               -c "fsync" \
>               $MNT/foo
> 
>     # Now clone only the second extent into another file.
>     xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
> 
>     # Filefrag will report a single 512K extent, and say it's not shared.
>     echo
>     filefrag -v $MNT/foo
> 
>     umount $MNT
> 
> Running the reproducer:
> 
>     $ ./fiemap-bug.sh
>     wrote 262144/262144 bytes at offset 0
>     256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
>     wrote 262144/262144 bytes at offset 262144
>     256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
>     linked 262144/262144 bytes at offset 0
>     256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
> 
>     Filesystem type is: 9123683e
>     File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
>      ext:     logical_offset:        physical_offset: length:   expected: flags:
>        0:        0..     127:       3328..      3455:    128:             last,eof
>     /mnt/sdj/foo: 1 extent found
> 
> We end up reporting that we have a single 512K that is not shared, however
> we have two 256K extents, and the second one is shared. Changing the
> reproducer to clone instead the first extent into file 'bar', makes us
> report a single 512K extent that is shared, which is algo incorrect since
> we have two 256K extents and only the first one is shared.
> 
> This patch is part of a larger patchset that is comprised of the following
> patches:
> 
>     btrfs: allow hole and data seeking to be interruptible
>     btrfs: make hole and data seeking a lot more efficient
>     btrfs: remove check for impossible block start for an extent map at fiemap
>     btrfs: remove zero length check when entering fiemap
>     btrfs: properly flush delalloc when entering fiemap
>     btrfs: allow fiemap to be interruptible
>     btrfs: rename btrfs_check_shared() to a more descriptive name
>     btrfs: speedup checking for extent sharedness during fiemap
>     btrfs: skip unnecessary extent buffer sharedness checks during fiemap
>     btrfs: make fiemap more efficient and accurate reporting extent sharedness
> 
> The patchset was tested on a machine running a non-debug kernel (Debian's
> default config) and compared the tests below on a branch without the
> patchset versus the same branch with the whole patchset applied.
> 
> The following test for a large compressed file without holes:
> 
>     $ cat fiemap-perf-test.sh
>     #!/bin/bash
> 
>     DEV=/dev/sdi
>     MNT=/mnt/sdi
> 
>     mkfs.btrfs -f $DEV
>     mount -o compress=lzo $DEV $MNT
> 
>     # 40G gives 327680 128K file extents (due to compression).
>     xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar
> 
>     umount $MNT
>     mount -o compress=lzo $DEV $MNT
> 
>     start=$(date +%s%N)
>     filefrag $MNT/foobar
>     end=$(date +%s%N)
>     dur=$(( (end - start) / 1000000 ))
>     echo "fiemap took $dur milliseconds (metadata not cached)"
> 
>     start=$(date +%s%N)
>     filefrag $MNT/foobar
>     end=$(date +%s%N)
>     dur=$(( (end - start) / 1000000 ))
>     echo "fiemap took $dur milliseconds (metadata cached)"
> 
>     umount $MNT
> 
> Before patchset:
> 
>     $ ./fiemap-perf-test.sh
>     (...)
>     /mnt/sdi/foobar: 327680 extents found
>     fiemap took 3597 milliseconds (metadata not cached)
>     /mnt/sdi/foobar: 327680 extents found
>     fiemap took 2107 milliseconds (metadata cached)
> 
> After patchset:
> 
>     $ ./fiemap-perf-test.sh
>     (...)
>     /mnt/sdi/foobar: 327680 extents found
>     fiemap took 1214 milliseconds (metadata not cached)
>     /mnt/sdi/foobar: 327680 extents found
>     fiemap took 684 milliseconds (metadata cached)
> 
> That's a speedup of about 3x for both cases (no metadata cached and all
> metadata cached).
> 
> The test provided by Pavel (first Link tag at the bottom), which uses
> files with a large number of holes, was also used to measure the gains,
> and it consists on a small C program and a shell script to invoke it.
> The C program is the following:
> 
>     $ cat pavels-test.c
>     #include <stdio.h>
>     #include <unistd.h>
>     #include <stdlib.h>
>     #include <fcntl.h>
> 
>     #include <sys/stat.h>
>     #include <sys/time.h>
>     #include <sys/ioctl.h>
> 
>     #include <linux/fs.h>
>     #include <linux/fiemap.h>
> 
>     #define FILE_INTERVAL (1<<13) /* 8Kb */
> 
>     long long interval(struct timeval t1, struct timeval t2)
>     {
>         long long val = 0;
>         val += (t2.tv_usec - t1.tv_usec);
>         val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
>         return val;
>     }
> 
>     int main(int argc, char **argv)
>     {
>         struct fiemap fiemap = {};
>         struct timeval t1, t2;
>         char data = 'a';
>         struct stat st;
>         int fd, off, file_size = FILE_INTERVAL;
> 
>         if (argc != 3 && argc != 2) {
>                 printf("usage: %s <path> [size]\n", argv[0]);
>                 return 1;
>         }
> 
>         if (argc == 3)
>                 file_size = atoi(argv[2]);
>         if (file_size < FILE_INTERVAL)
>                 file_size = FILE_INTERVAL;
>         file_size -= file_size % FILE_INTERVAL;
> 
>         fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
>         if (fd < 0) {
>             perror("open");
>             return 1;
>         }
> 
>         for (off = 0; off < file_size; off += FILE_INTERVAL) {
>             if (pwrite(fd, &data, 1, off) != 1) {
>                 perror("pwrite");
>                 close(fd);
>                 return 1;
>             }
>         }
> 
>         if (ftruncate(fd, file_size)) {
>             perror("ftruncate");
>             close(fd);
>             return 1;
>         }
> 
>         if (fstat(fd, &st) < 0) {
>             perror("fstat");
>             close(fd);
>             return 1;
>         }
> 
>         printf("size: %ld\n", st.st_size);
>         printf("actual size: %ld\n", st.st_blocks * 512);
> 
>         fiemap.fm_length = FIEMAP_MAX_OFFSET;
>         gettimeofday(&t1, NULL);
>         if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
>             perror("fiemap");
>             close(fd);
>             return 1;
>         }
>         gettimeofday(&t2, NULL);
> 
>         printf("fiemap: fm_mapped_extents = %d\n",
>                fiemap.fm_mapped_extents);
>         printf("time = %lld us\n", interval(t1, t2));
> 
>         close(fd);
>         return 0;
>     }
> 
>     $ gcc -o pavels_test pavels_test.c
> 
> And the wrapper shell script:
> 
>     $ cat fiemap-pavels-test.sh
> 
>     #!/bin/bash
> 
>     DEV=/dev/sdi
>     MNT=/mnt/sdi
> 
>     mkfs.btrfs -f -O no-holes $DEV
>     mount $DEV $MNT
> 
>     echo
>     echo "*********** 256M ***********"
>     echo
> 
>     ./pavels-test $MNT/testfile $((1 << 28))
>     echo
>     ./pavels-test $MNT/testfile $((1 << 28))
> 
>     echo
>     echo "*********** 512M ***********"
>     echo
> 
>     ./pavels-test $MNT/testfile $((1 << 29))
>     echo
>     ./pavels-test $MNT/testfile $((1 << 29))
> 
>     echo
>     echo "*********** 1G ***********"
>     echo
> 
>     ./pavels-test $MNT/testfile $((1 << 30))
>     echo
>     ./pavels-test $MNT/testfile $((1 << 30))
> 
>     umount $MNT
> 
> Running his reproducer before applying the patchset:
> 
>     *********** 256M ***********
> 
>     size: 268435456
>     actual size: 134217728
>     fiemap: fm_mapped_extents = 32768
>     time = 4003133 us
> 
>     size: 268435456
>     actual size: 134217728
>     fiemap: fm_mapped_extents = 32768
>     time = 4895330 us
> 
>     *********** 512M ***********
> 
>     size: 536870912
>     actual size: 268435456
>     fiemap: fm_mapped_extents = 65536
>     time = 30123675 us
> 
>     size: 536870912
>     actual size: 268435456
>     fiemap: fm_mapped_extents = 65536
>     time = 33450934 us
> 
>     *********** 1G ***********
> 
>     size: 1073741824
>     actual size: 536870912
>     fiemap: fm_mapped_extents = 131072
>     time = 224924074 us
> 
>     size: 1073741824
>     actual size: 536870912
>     fiemap: fm_mapped_extents = 131072
>     time = 217239242 us
> 
> Running it after applying the patchset:
> 
>     *********** 256M ***********
> 
>     size: 268435456
>     actual size: 134217728
>     fiemap: fm_mapped_extents = 32768
>     time = 29475 us
> 
>     size: 268435456
>     actual size: 134217728
>     fiemap: fm_mapped_extents = 32768
>     time = 29307 us
> 
>     *********** 512M ***********
> 
>     size: 536870912
>     actual size: 268435456
>     fiemap: fm_mapped_extents = 65536
>     time = 58996 us
> 
>     size: 536870912
>     actual size: 268435456
>     fiemap: fm_mapped_extents = 65536
>     time = 59115 us
> 
>     *********** 1G ***********
> 
>     size: 1073741824
>     actual size: 536870912
>     fiemap: fm_mapped_extents = 116251
>     time = 124141 us
> 
>     size: 1073741824
>     actual size: 536870912
>     fiemap: fm_mapped_extents = 131072
>     time = 119387 us
> 
> The speedup is massive, both on the first fiemap call and on the second
> one as well, as his test creates files with many holes and small extents
> (every extent follows a hole and precedes another hole).
> 
> For the 256M file we go from 4 seconds down to 29 milliseconds in the
> first run, and then from 4.9 seconds down to 29 milliseconds again in the
> second run, a speedup of 138x and 169x, respectively.
> 
> For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
> first run, and then from 33.5 seconds down to 59 milliseconds again in the
> second run, a speedup of 510x and 568x, respectively.
> 
> For the 1G file, we go from 225 seconds down to 124 milliseconds in the
> first run, and then from 217 seconds down to 119 milliseconds in the
> second run, a speedup of 1815x and 1824x, respectively.
> 
> Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
> Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/ctree.h     |   4 +-
>  fs/btrfs/extent_io.c | 714 +++++++++++++++++++++++++++++--------------
>  fs/btrfs/file.c      |  16 +-
>  fs/btrfs/inode.c     | 140 +--------
>  4 files changed, 506 insertions(+), 368 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index f7fe7f633eb5..7b266f9dc8b4 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3402,8 +3402,6 @@ unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
>  				    u64 start, u64 end);
>  int btrfs_check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
>  			  u32 bio_offset, struct page *page, u32 pgoff);
> -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
> -					   u64 start, u64 len);
>  noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
>  			      u64 *orig_start, u64 *orig_block_len,
>  			      u64 *ram_bytes, bool strict);
> @@ -3583,6 +3581,8 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
>  int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
>  			   size_t *write_bytes);
>  void btrfs_check_nocow_unlock(struct btrfs_inode *inode);
> +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> +				  u64 *delalloc_start_ret, u64 *delalloc_end_ret);
>  
>  /* tree-defrag.c */
>  int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 0e3fa9b08aaf..50bb2182e795 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5353,42 +5353,6 @@ int try_release_extent_mapping(struct page *page, gfp_t mask)
>  	return try_release_extent_state(tree, page, mask);
>  }
>  
> -/*
> - * helper function for fiemap, which doesn't want to see any holes.
> - * This maps until we find something past 'last'
> - */
> -static struct extent_map *get_extent_skip_holes(struct btrfs_inode *inode,
> -						u64 offset, u64 last)
> -{
> -	u64 sectorsize = btrfs_inode_sectorsize(inode);
> -	struct extent_map *em;
> -	u64 len;
> -
> -	if (offset >= last)
> -		return NULL;
> -
> -	while (1) {
> -		len = last - offset;
> -		if (len == 0)
> -			break;
> -		len = ALIGN(len, sectorsize);
> -		em = btrfs_get_extent_fiemap(inode, offset, len);
> -		if (IS_ERR(em))
> -			return em;
> -
> -		/* if this isn't a hole return it */
> -		if (em->block_start != EXTENT_MAP_HOLE)
> -			return em;
> -
> -		/* this is a hole, advance to the next extent */
> -		offset = extent_map_end(em);
> -		free_extent_map(em);
> -		if (offset >= last)
> -			break;
> -	}
> -	return NULL;
> -}
> -
>  /*
>   * To cache previous fiemap extent
>   *
> @@ -5418,6 +5382,9 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
>  {
>  	int ret = 0;
>  
> +	/* Set at the end of extent_fiemap(). */
> +	ASSERT((flags & FIEMAP_EXTENT_LAST) == 0);
> +
>  	if (!cache->cached)
>  		goto assign;
>  
> @@ -5446,11 +5413,10 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
>  	 */
>  	if (cache->offset + cache->len  == offset &&
>  	    cache->phys + cache->len == phys  &&
> -	    (cache->flags & ~FIEMAP_EXTENT_LAST) ==
> -			(flags & ~FIEMAP_EXTENT_LAST)) {
> +	    cache->flags == flags) {
>  		cache->len += len;
>  		cache->flags |= flags;
> -		goto try_submit_last;
> +		return 0;
>  	}
>  
>  	/* Not mergeable, need to submit cached one */
> @@ -5465,13 +5431,8 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
>  	cache->phys = phys;
>  	cache->len = len;
>  	cache->flags = flags;
> -try_submit_last:
> -	if (cache->flags & FIEMAP_EXTENT_LAST) {
> -		ret = fiemap_fill_next_extent(fieinfo, cache->offset,
> -				cache->phys, cache->len, cache->flags);
> -		cache->cached = false;
> -	}
> -	return ret;
> +
> +	return 0;
>  }
>  
>  /*
> @@ -5501,229 +5462,534 @@ static int emit_last_fiemap_cache(struct fiemap_extent_info *fieinfo,
>  	return ret;
>  }
>  
> -int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> -		  u64 start, u64 len)
> +static int fiemap_next_leaf_item(struct btrfs_inode *inode,
> +				 struct btrfs_path *path)
>  {
> -	int ret = 0;
> -	u64 off;
> -	u64 max = start + len;
> -	u32 flags = 0;
> -	u32 found_type;
> -	u64 last;
> -	u64 last_for_get_extent = 0;
> -	u64 disko = 0;
> -	u64 isize = i_size_read(&inode->vfs_inode);
> -	struct btrfs_key found_key;
> -	struct extent_map *em = NULL;
> -	struct extent_state *cached_state = NULL;
> -	struct btrfs_path *path;
> +	struct extent_buffer *clone;
> +	struct btrfs_key key;
> +	int slot;
> +	int ret;
> +
> +	path->slots[0]++;
> +	if (path->slots[0] < btrfs_header_nritems(path->nodes[0]))
> +		return 0;
> +
> +	ret = btrfs_next_leaf(inode->root, path);
> +	if (ret != 0)
> +		return ret;
> +
> +	/*
> +	 * Don't bother with cloning if there are no more file extent items for
> +	 * our inode.
> +	 */
> +	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +	if (key.objectid != btrfs_ino(inode) || key.type != BTRFS_EXTENT_DATA_KEY)
> +		return 1;
> +
> +	/* See the comment at fiemap_search_slot() about why we clone. */
> +	clone = btrfs_clone_extent_buffer(path->nodes[0]);
> +	if (!clone)
> +		return -ENOMEM;
> +
> +	slot = path->slots[0];
> +	btrfs_release_path(path);
> +	path->nodes[0] = clone;
> +	path->slots[0] = slot;
> +
> +	return 0;
> +}
> +
> +/*
> + * Search for the first file extent item that starts at a given file offset or
> + * the one that starts immediately before that offset.
> + * Returns: 0 on success, < 0 on error, 1 if not found.
> + */
> +static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path,
> +			      u64 file_offset)
> +{
> +	const u64 ino = btrfs_ino(inode);
>  	struct btrfs_root *root = inode->root;
> -	struct fiemap_cache cache = { 0 };
> -	struct btrfs_backref_shared_cache *backref_cache;
> -	struct ulist *roots;
> -	struct ulist *tmp_ulist;
> -	int end = 0;
> -	u64 em_start = 0;
> -	u64 em_len = 0;
> -	u64 em_end = 0;
> +	struct extent_buffer *clone;
> +	struct btrfs_key key;
> +	int slot;
> +	int ret;
>  
> -	backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
> -	path = btrfs_alloc_path();
> -	roots = ulist_alloc(GFP_KERNEL);
> -	tmp_ulist = ulist_alloc(GFP_KERNEL);
> -	if (!backref_cache || !path || !roots || !tmp_ulist) {
> -		ret = -ENOMEM;
> -		goto out_free_ulist;
> +	key.objectid = ino;
> +	key.type = BTRFS_EXTENT_DATA_KEY;
> +	key.offset = file_offset;
> +
> +	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (ret > 0 && path->slots[0] > 0) {
> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
> +		if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
> +			path->slots[0]--;
> +	}
> +
> +	if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
> +		ret = btrfs_next_leaf(root, path);
> +		if (ret != 0)
> +			return ret;
> +
> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> +			return 1;
>  	}
>  
>  	/*
> -	 * We can't initialize that to 'start' as this could miss extents due
> -	 * to extent item merging
> +	 * We clone the leaf and use it during fiemap. This is because while
> +	 * using the leaf we do expensive things like checking if an extent is
> +	 * shared, which can take a long time. In order to prevent blocking
> +	 * other tasks for too long, we use a clone of the leaf. We have locked
> +	 * the file range in the inode's io tree, so we know none of our file
> +	 * extent items can change. This way we avoid blocking other tasks that
> +	 * want to insert items for other inodes in the same leaf or b+tree
> +	 * rebalance operations (triggered for example when someone is trying
> +	 * to push items into this leaf when trying to insert an item in a
> +	 * neighbour leaf).
> +	 * We also need the private clone because holding a read lock on an
> +	 * extent buffer of the subvolume's b+tree will make lockdep unhappy
> +	 * when we call fiemap_fill_next_extent(), because that may cause a page
> +	 * fault when filling the user space buffer with fiemap data.
>  	 */
> -	off = 0;
> -	start = round_down(start, btrfs_inode_sectorsize(inode));
> -	len = round_up(max, btrfs_inode_sectorsize(inode)) - start;
> +	clone = btrfs_clone_extent_buffer(path->nodes[0]);
> +	if (!clone)
> +		return -ENOMEM;
> +
> +	slot = path->slots[0];
> +	btrfs_release_path(path);
> +	path->nodes[0] = clone;
> +	path->slots[0] = slot;
> +
> +	return 0;
> +}
> +
> +/*
> + * Process a range which is a hole or a prealloc extent in the inode's subvolume
> + * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc
> + * extent. The end offset (@end) is inclusive.
> + */
> +static int fiemap_process_hole(struct btrfs_inode *inode,
> +			       struct fiemap_extent_info *fieinfo,
> +			       struct fiemap_cache *cache,
> +			       struct btrfs_backref_shared_cache *backref_cache,
> +			       u64 disk_bytenr, u64 extent_offset,
> +			       u64 extent_gen,
> +			       struct ulist *roots, struct ulist *tmp_ulist,
> +			       u64 start, u64 end)
> +{
> +	const u64 i_size = i_size_read(&inode->vfs_inode);
> +	const u64 ino = btrfs_ino(inode);
> +	u64 cur_offset = start;
> +	u64 last_delalloc_end = 0;
> +	u32 prealloc_flags = FIEMAP_EXTENT_UNWRITTEN;
> +	bool checked_extent_shared = false;
> +	int ret;
>  
>  	/*
> -	 * lookup the last file extent.  We're not using i_size here
> -	 * because there might be preallocation past i_size
> +	 * There can be no delalloc past i_size, so don't waste time looking for
> +	 * it beyond i_size.
>  	 */
> -	ret = btrfs_lookup_file_extent(NULL, root, path, btrfs_ino(inode), -1,
> -				       0);
> -	if (ret < 0) {
> -		goto out_free_ulist;
> -	} else {
> -		WARN_ON(!ret);
> -		if (ret == 1)
> -			ret = 0;
> -	}
> +	while (cur_offset < end && cur_offset < i_size) {
> +		u64 delalloc_start;
> +		u64 delalloc_end;
> +		u64 prealloc_start;
> +		u64 prealloc_len = 0;
> +		bool delalloc;
> +
> +		delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
> +							&delalloc_start,
> +							&delalloc_end);
> +		if (!delalloc)
> +			break;
>  
> -	path->slots[0]--;
> -	btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]);
> -	found_type = found_key.type;
> -
> -	/* No extents, but there might be delalloc bits */
> -	if (found_key.objectid != btrfs_ino(inode) ||
> -	    found_type != BTRFS_EXTENT_DATA_KEY) {
> -		/* have to trust i_size as the end */
> -		last = (u64)-1;
> -		last_for_get_extent = isize;
> -	} else {
>  		/*
> -		 * remember the start of the last extent.  There are a
> -		 * bunch of different factors that go into the length of the
> -		 * extent, so its much less complex to remember where it started
> +		 * If this is a prealloc extent we have to report every section
> +		 * of it that has no delalloc.
>  		 */
> -		last = found_key.offset;
> -		last_for_get_extent = last + 1;
> +		if (disk_bytenr != 0) {
> +			if (last_delalloc_end == 0) {
> +				prealloc_start = start;
> +				prealloc_len = delalloc_start - start;
> +			} else {
> +				prealloc_start = last_delalloc_end + 1;
> +				prealloc_len = delalloc_start - prealloc_start;
> +			}
> +		}
> +
> +		if (prealloc_len > 0) {
> +			if (!checked_extent_shared && fieinfo->fi_extents_max) {
> +				ret = btrfs_is_data_extent_shared(inode->root,
> +							  ino, disk_bytenr,
> +							  extent_gen, roots,
> +							  tmp_ulist,
> +							  backref_cache);
> +				if (ret < 0)
> +					return ret;
> +				else if (ret > 0)
> +					prealloc_flags |= FIEMAP_EXTENT_SHARED;
> +
> +				checked_extent_shared = true;
> +			}
> +			ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
> +						 disk_bytenr + extent_offset,
> +						 prealloc_len, prealloc_flags);
> +			if (ret)
> +				return ret;
> +			extent_offset += prealloc_len;
> +		}
> +
> +		ret = emit_fiemap_extent(fieinfo, cache, delalloc_start, 0,
> +					 delalloc_end + 1 - delalloc_start,
> +					 FIEMAP_EXTENT_DELALLOC |
> +					 FIEMAP_EXTENT_UNKNOWN);
> +		if (ret)
> +			return ret;
> +
> +		last_delalloc_end = delalloc_end;
> +		cur_offset = delalloc_end + 1;
> +		extent_offset += cur_offset - delalloc_start;
> +		cond_resched();
> +	}
> +
> +	/*
> +	 * Either we found no delalloc for the whole prealloc extent or we have
> +	 * a prealloc extent that spans i_size or starts at or after i_size.
> +	 */
> +	if (disk_bytenr != 0 && last_delalloc_end < end) {
> +		u64 prealloc_start;
> +		u64 prealloc_len;
> +
> +		if (last_delalloc_end == 0) {
> +			prealloc_start = start;
> +			prealloc_len = end + 1 - start;
> +		} else {
> +			prealloc_start = last_delalloc_end + 1;
> +			prealloc_len = end + 1 - prealloc_start;
> +		}
> +
> +		if (!checked_extent_shared && fieinfo->fi_extents_max) {
> +			ret = btrfs_is_data_extent_shared(inode->root,
> +							  ino, disk_bytenr,
> +							  extent_gen, roots,
> +							  tmp_ulist,
> +							  backref_cache);
> +			if (ret < 0)
> +				return ret;
> +			else if (ret > 0)
> +				prealloc_flags |= FIEMAP_EXTENT_SHARED;
> +		}
> +		ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
> +					 disk_bytenr + extent_offset,
> +					 prealloc_len, prealloc_flags);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static int fiemap_find_last_extent_offset(struct btrfs_inode *inode,
> +					  struct btrfs_path *path,
> +					  u64 *last_extent_end_ret)
> +{
> +	const u64 ino = btrfs_ino(inode);
> +	struct btrfs_root *root = inode->root;
> +	struct extent_buffer *leaf;
> +	struct btrfs_file_extent_item *ei;
> +	struct btrfs_key key;
> +	u64 disk_bytenr;
> +	int ret;
> +
> +	/*
> +	 * Lookup the last file extent. We're not using i_size here because
> +	 * there might be preallocation past i_size.
> +	 */
> +	ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0);
> +	/* There can't be a file extent item at offset (u64)-1 */
> +	ASSERT(ret != 0);
> +	if (ret < 0)
> +		return ret;
> +
> +	/*
> +	 * For a non-existing key, btrfs_search_slot() always leaves us at a
> +	 * slot > 0, except if the btree is empty, which is impossible because
> +	 * at least it has the inode item for this inode and all the items for
> +	 * the root inode 256.
> +	 */
> +	ASSERT(path->slots[0] > 0);
> +	path->slots[0]--;
> +	leaf = path->nodes[0];
> +	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +	if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> +		/* No file extent items in the subvolume tree. */
> +		*last_extent_end_ret = 0;
> +		return 0;
>  	}
> -	btrfs_release_path(path);
>  
>  	/*
> -	 * we might have some extents allocated but more delalloc past those
> -	 * extents.  so, we trust isize unless the start of the last extent is
> -	 * beyond isize
> +	 * For an inline extent, the disk_bytenr is where inline data starts at,
> +	 * so first check if we have an inline extent item before checking if we
> +	 * have an implicit hole (disk_bytenr == 0).
>  	 */
> -	if (last < isize) {
> -		last = (u64)-1;
> -		last_for_get_extent = isize;
> +	ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
> +	if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
> +		*last_extent_end_ret = btrfs_file_extent_end(path);
> +		return 0;
>  	}
>  
> -	lock_extent_bits(&inode->io_tree, start, start + len - 1,
> -			 &cached_state);
> +	/*
> +	 * Find the last file extent item that is not a hole (when NO_HOLES is
> +	 * not enabled). This should take at most 2 iterations in the worst
> +	 * case: we have one hole file extent item at slot 0 of a leaf and
> +	 * another hole file extent item as the last item in the previous leaf.
> +	 * This is because we merge file extent items that represent holes.
> +	 */
> +	disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> +	while (disk_bytenr == 0) {
> +		ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
> +		if (ret < 0) {
> +			return ret;
> +		} else if (ret > 0) {
> +			/* No file extent items that are not holes. */
> +			*last_extent_end_ret = 0;
> +			return 0;
> +		}
> +		leaf = path->nodes[0];
> +		ei = btrfs_item_ptr(leaf, path->slots[0],
> +				    struct btrfs_file_extent_item);
> +		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> +	}
>  
> -	em = get_extent_skip_holes(inode, start, last_for_get_extent);
> -	if (!em)
> -		goto out;
> -	if (IS_ERR(em)) {
> -		ret = PTR_ERR(em);
> +	*last_extent_end_ret = btrfs_file_extent_end(path);
> +	return 0;
> +}
> +
> +int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> +		  u64 start, u64 len)
> +{
> +	const u64 ino = btrfs_ino(inode);
> +	struct extent_state *cached_state = NULL;
> +	struct btrfs_path *path;
> +	struct btrfs_root *root = inode->root;
> +	struct fiemap_cache cache = { 0 };
> +	struct btrfs_backref_shared_cache *backref_cache;
> +	struct ulist *roots;
> +	struct ulist *tmp_ulist;
> +	u64 last_extent_end;
> +	u64 prev_extent_end;
> +	u64 lockstart;
> +	u64 lockend;
> +	bool stopped = false;
> +	int ret;
> +
> +	backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
> +	path = btrfs_alloc_path();
> +	roots = ulist_alloc(GFP_KERNEL);
> +	tmp_ulist = ulist_alloc(GFP_KERNEL);
> +	if (!backref_cache || !path || !roots || !tmp_ulist) {
> +		ret = -ENOMEM;
>  		goto out;
>  	}
>  
> -	while (!end) {
> -		u64 offset_in_extent = 0;
> +	lockstart = round_down(start, btrfs_inode_sectorsize(inode));
> +	lockend = round_up(start + len, btrfs_inode_sectorsize(inode));
> +	prev_extent_end = lockstart;
>  
> -		/* break if the extent we found is outside the range */
> -		if (em->start >= max || extent_map_end(em) < off)
> -			break;
> +	lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
>  
> -		/*
> -		 * get_extent may return an extent that starts before our
> -		 * requested range.  We have to make sure the ranges
> -		 * we return to fiemap always move forward and don't
> -		 * overlap, so adjust the offsets here
> -		 */
> -		em_start = max(em->start, off);
> +	ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
> +	if (ret < 0)
> +		goto out_unlock;
> +	btrfs_release_path(path);
>  
> +	path->reada = READA_FORWARD;
> +	ret = fiemap_search_slot(inode, path, lockstart);
> +	if (ret < 0) {
> +		goto out_unlock;
> +	} else if (ret > 0) {
>  		/*
> -		 * record the offset from the start of the extent
> -		 * for adjusting the disk offset below.  Only do this if the
> -		 * extent isn't compressed since our in ram offset may be past
> -		 * what we have actually allocated on disk.
> +		 * No file extent item found, but we may have delalloc between
> +		 * the current offset and i_size. So check for that.
>  		 */
> -		if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
> -			offset_in_extent = em_start - em->start;
> -		em_end = extent_map_end(em);
> -		em_len = em_end - em_start;
> -		flags = 0;
> -		if (em->block_start < EXTENT_MAP_LAST_BYTE)
> -			disko = em->block_start + offset_in_extent;
> -		else
> -			disko = 0;
> +		ret = 0;
> +		goto check_eof_delalloc;
> +	}
> +
> +	while (prev_extent_end < lockend) {
> +		struct extent_buffer *leaf = path->nodes[0];
> +		struct btrfs_file_extent_item *ei;
> +		struct btrfs_key key;
> +		u64 extent_end;
> +		u64 extent_len;
> +		u64 extent_offset = 0;
> +		u64 extent_gen;
> +		u64 disk_bytenr = 0;
> +		u64 flags = 0;
> +		int extent_type;
> +		u8 compression;
> +
> +		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> +			break;
> +
> +		extent_end = btrfs_file_extent_end(path);
>  
>  		/*
> -		 * bump off for our next call to get_extent
> +		 * The first iteration can leave us at an extent item that ends
> +		 * before our range's start. Move to the next item.
>  		 */
> -		off = extent_map_end(em);
> -		if (off >= max)
> -			end = 1;
> -
> -		if (em->block_start == EXTENT_MAP_INLINE) {
> -			flags |= (FIEMAP_EXTENT_DATA_INLINE |
> -				  FIEMAP_EXTENT_NOT_ALIGNED);
> -		} else if (em->block_start == EXTENT_MAP_DELALLOC) {
> -			flags |= (FIEMAP_EXTENT_DELALLOC |
> -				  FIEMAP_EXTENT_UNKNOWN);
> -		} else if (fieinfo->fi_extents_max) {
> -			u64 extent_gen;
> -			u64 bytenr = em->block_start -
> -				(em->start - em->orig_start);
> +		if (extent_end <= lockstart)
> +			goto next_item;
>  
> -			/*
> -			 * If two extent maps are merged, then their generation
> -			 * is set to the maximum between their generations.
> -			 * Otherwise its generation matches the one we have in
> -			 * corresponding file extent item. If we have a merged
> -			 * extent map, don't use its generation to speedup the
> -			 * sharedness check below.
> -			 */
> -			if (test_bit(EXTENT_FLAG_MERGED, &em->flags))
> -				extent_gen = 0;
> -			else
> -				extent_gen = em->generation;
> +		/* We have in implicit hole (NO_HOLES feature enabled). */
> +		if (prev_extent_end < key.offset) {
> +			const u64 range_end = min(key.offset, lockend) - 1;
>  
> -			/*
> -			 * As btrfs supports shared space, this information
> -			 * can be exported to userspace tools via
> -			 * flag FIEMAP_EXTENT_SHARED.  If fi_extents_max == 0
> -			 * then we're just getting a count and we can skip the
> -			 * lookup stuff.
> -			 */
> -			ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
> -							  bytenr, extent_gen,
> -							  roots, tmp_ulist,
> -							  backref_cache);
> -			if (ret < 0)
> -				goto out_free;
> -			if (ret)
> -				flags |= FIEMAP_EXTENT_SHARED;
> -			ret = 0;
> -		}
> -		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
> -			flags |= FIEMAP_EXTENT_ENCODED;
> -		if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> -			flags |= FIEMAP_EXTENT_UNWRITTEN;
> +			ret = fiemap_process_hole(inode, fieinfo, &cache,
> +						  backref_cache, 0, 0, 0,
> +						  roots, tmp_ulist,
> +						  prev_extent_end, range_end);
> +			if (ret < 0) {
> +				goto out_unlock;
> +			} else if (ret > 0) {
> +				/* fiemap_fill_next_extent() told us to stop. */
> +				stopped = true;
> +				break;
> +			}
>  
> -		free_extent_map(em);
> -		em = NULL;
> -		if ((em_start >= last) || em_len == (u64)-1 ||
> -		   (last == (u64)-1 && isize <= em_end)) {
> -			flags |= FIEMAP_EXTENT_LAST;
> -			end = 1;
> +			/* We've reached the end of the fiemap range, stop. */
> +			if (key.offset >= lockend) {
> +				stopped = true;
> +				break;
> +			}
>  		}
>  
> -		/* now scan forward to see if this is really the last extent. */
> -		em = get_extent_skip_holes(inode, off, last_for_get_extent);
> -		if (IS_ERR(em)) {
> -			ret = PTR_ERR(em);
> -			goto out;
> +		extent_len = extent_end - key.offset;
> +		ei = btrfs_item_ptr(leaf, path->slots[0],
> +				    struct btrfs_file_extent_item);
> +		compression = btrfs_file_extent_compression(leaf, ei);
> +		extent_type = btrfs_file_extent_type(leaf, ei);
> +		extent_gen = btrfs_file_extent_generation(leaf, ei);
> +
> +		if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
> +			disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> +			if (compression == BTRFS_COMPRESS_NONE)
> +				extent_offset = btrfs_file_extent_offset(leaf, ei);
>  		}
> -		if (!em) {
> -			flags |= FIEMAP_EXTENT_LAST;
> -			end = 1;
> +
> +		if (compression != BTRFS_COMPRESS_NONE)
> +			flags |= FIEMAP_EXTENT_ENCODED;
> +
> +		if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
> +			flags |= FIEMAP_EXTENT_DATA_INLINE;
> +			flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> +			ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0,
> +						 extent_len, flags);
> +		} else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
> +			ret = fiemap_process_hole(inode, fieinfo, &cache,
> +						  backref_cache,
> +						  disk_bytenr, extent_offset,
> +						  extent_gen, roots, tmp_ulist,
> +						  key.offset, extent_end - 1);
> +		} else if (disk_bytenr == 0) {
> +			/* We have an explicit hole. */
> +			ret = fiemap_process_hole(inode, fieinfo, &cache,
> +						  backref_cache, 0, 0, 0,
> +						  roots, tmp_ulist,
> +						  key.offset, extent_end - 1);
> +		} else {
> +			/* We have a regular extent. */
> +			if (fieinfo->fi_extents_max) {
> +				ret = btrfs_is_data_extent_shared(root, ino,
> +								  disk_bytenr,
> +								  extent_gen,
> +								  roots,
> +								  tmp_ulist,
> +								  backref_cache);
> +				if (ret < 0)
> +					goto out_unlock;
> +				else if (ret > 0)
> +					flags |= FIEMAP_EXTENT_SHARED;
> +			}
> +
> +			ret = emit_fiemap_extent(fieinfo, &cache, key.offset,
> +						 disk_bytenr + extent_offset,
> +						 extent_len, flags);
>  		}
> -		ret = emit_fiemap_extent(fieinfo, &cache, em_start, disko,
> -					   em_len, flags);
> -		if (ret) {
> -			if (ret == 1)
> -				ret = 0;
> -			goto out_free;
> +
> +		if (ret < 0) {
> +			goto out_unlock;
> +		} else if (ret > 0) {
> +			/* fiemap_fill_next_extent() told us to stop. */
> +			stopped = true;
> +			break;
>  		}
>  
> +		prev_extent_end = extent_end;
> +next_item:
>  		if (fatal_signal_pending(current)) {
>  			ret = -EINTR;
> -			goto out_free;
> +			goto out_unlock;
>  		}
> +
> +		ret = fiemap_next_leaf_item(inode, path);
> +		if (ret < 0) {
> +			goto out_unlock;
> +		} else if (ret > 0) {
> +			/* No more file extent items for this inode. */
> +			break;
> +		}
> +		cond_resched();
>  	}
> -out_free:
> -	if (!ret)
> -		ret = emit_last_fiemap_cache(fieinfo, &cache);
> -	free_extent_map(em);
> -out:
> -	unlock_extent_cached(&inode->io_tree, start, start + len - 1,
> -			     &cached_state);
>  
> -out_free_ulist:
> +check_eof_delalloc:
> +	/*
> +	 * Release (and free) the path before emitting any final entries to
> +	 * fiemap_fill_next_extent() to keep lockdep happy. This is because
> +	 * once we find no more file extent items exist, we may have a
> +	 * non-cloned leaf, and fiemap_fill_next_extent() can trigger page
> +	 * faults when copying data to the user space buffer.
> +	 */
> +	btrfs_free_path(path);
> +	path = NULL;
> +
> +	if (!stopped && prev_extent_end < lockend) {
> +		ret = fiemap_process_hole(inode, fieinfo, &cache, backref_cache,
> +					  0, 0, 0, roots, tmp_ulist,
> +					  prev_extent_end, lockend - 1);
> +		if (ret < 0)
> +			goto out_unlock;
> +		prev_extent_end = lockend;
> +	}
> +
> +	if (cache.cached && cache.offset + cache.len >= last_extent_end) {
> +		const u64 i_size = i_size_read(&inode->vfs_inode);
> +
> +		if (prev_extent_end < i_size) {
> +			u64 delalloc_start;
> +			u64 delalloc_end;
> +			bool delalloc;
> +
> +			delalloc = btrfs_find_delalloc_in_range(inode,
> +								prev_extent_end,
> +								i_size - 1,
> +								&delalloc_start,
> +								&delalloc_end);
> +			if (!delalloc)
> +				cache.flags |= FIEMAP_EXTENT_LAST;
> +		} else {
> +			cache.flags |= FIEMAP_EXTENT_LAST;
> +		}
> +	}
> +
> +	ret = emit_last_fiemap_cache(fieinfo, &cache);
> +
> +out_unlock:
> +	unlock_extent_cached(&inode->io_tree, lockstart, lockend, &cached_state);
> +out:
>  	kfree(backref_cache);
>  	btrfs_free_path(path);
>  	ulist_free(roots);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index b292a8ada3a4..636b3ec46184 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3602,10 +3602,10 @@ static long btrfs_fallocate(struct file *file, int mode,
>  }
>  
>  /*
> - * Helper for have_delalloc_in_range(). Find a subrange in a given range that
> - * has unflushed and/or flushing delalloc. There might be other adjacent
> - * subranges after the one it found, so have_delalloc_in_range() keeps looping
> - * while it gets adjacent subranges, and merging them together.
> + * Helper for btrfs_find_delalloc_in_range(). Find a subrange in a given range
> + * that has unflushed and/or flushing delalloc. There might be other adjacent
> + * subranges after the one it found, so btrfs_find_delalloc_in_range() keeps
> + * looping while it gets adjacent subranges, and merging them together.
>   */
>  static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
>  				   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> @@ -3740,8 +3740,8 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
>   * if so it sets @delalloc_start_ret and @delalloc_end_ret with the start and
>   * end offsets of the subrange.
>   */
> -static bool have_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> -				   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> +				  u64 *delalloc_start_ret, u64 *delalloc_end_ret)
>  {
>  	u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
>  	u64 prev_delalloc_end = 0;
> @@ -3804,8 +3804,8 @@ static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
>  	u64 delalloc_end;
>  	bool delalloc;
>  
> -	delalloc = have_delalloc_in_range(inode, start, end, &delalloc_start,
> -					  &delalloc_end);
> +	delalloc = btrfs_find_delalloc_in_range(inode, start, end,
> +						&delalloc_start, &delalloc_end);
>  	if (delalloc && whence == SEEK_DATA) {
>  		*start_ret = delalloc_start;
>  		return true;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 2c7d31990777..8be1e021513a 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7064,133 +7064,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
>  	return em;
>  }
>  
> -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
> -					   u64 start, u64 len)
> -{
> -	struct extent_map *em;
> -	struct extent_map *hole_em = NULL;
> -	u64 delalloc_start = start;
> -	u64 end;
> -	u64 delalloc_len;
> -	u64 delalloc_end;
> -	int err = 0;
> -
> -	em = btrfs_get_extent(inode, NULL, 0, start, len);
> -	if (IS_ERR(em))
> -		return em;
> -	/*
> -	 * If our em maps to:
> -	 * - a hole or
> -	 * - a pre-alloc extent,
> -	 * there might actually be delalloc bytes behind it.
> -	 */
> -	if (em->block_start != EXTENT_MAP_HOLE &&
> -	    !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> -		return em;
> -	else
> -		hole_em = em;
> -
> -	/* check to see if we've wrapped (len == -1 or similar) */
> -	end = start + len;
> -	if (end < start)
> -		end = (u64)-1;
> -	else
> -		end -= 1;
> -
> -	em = NULL;
> -
> -	/* ok, we didn't find anything, lets look for delalloc */
> -	delalloc_len = count_range_bits(&inode->io_tree, &delalloc_start,
> -				 end, len, EXTENT_DELALLOC, 1);
> -	delalloc_end = delalloc_start + delalloc_len;
> -	if (delalloc_end < delalloc_start)
> -		delalloc_end = (u64)-1;
> -
> -	/*
> -	 * We didn't find anything useful, return the original results from
> -	 * get_extent()
> -	 */
> -	if (delalloc_start > end || delalloc_end <= start) {
> -		em = hole_em;
> -		hole_em = NULL;
> -		goto out;
> -	}
> -
> -	/*
> -	 * Adjust the delalloc_start to make sure it doesn't go backwards from
> -	 * the start they passed in
> -	 */
> -	delalloc_start = max(start, delalloc_start);
> -	delalloc_len = delalloc_end - delalloc_start;
> -
> -	if (delalloc_len > 0) {
> -		u64 hole_start;
> -		u64 hole_len;
> -		const u64 hole_end = extent_map_end(hole_em);
> -
> -		em = alloc_extent_map();
> -		if (!em) {
> -			err = -ENOMEM;
> -			goto out;
> -		}
> -
> -		ASSERT(hole_em);
> -		/*
> -		 * When btrfs_get_extent can't find anything it returns one
> -		 * huge hole
> -		 *
> -		 * Make sure what it found really fits our range, and adjust to
> -		 * make sure it is based on the start from the caller
> -		 */
> -		if (hole_end <= start || hole_em->start > end) {
> -		       free_extent_map(hole_em);
> -		       hole_em = NULL;
> -		} else {
> -		       hole_start = max(hole_em->start, start);
> -		       hole_len = hole_end - hole_start;
> -		}
> -
> -		if (hole_em && delalloc_start > hole_start) {
> -			/*
> -			 * Our hole starts before our delalloc, so we have to
> -			 * return just the parts of the hole that go until the
> -			 * delalloc starts
> -			 */
> -			em->len = min(hole_len, delalloc_start - hole_start);
> -			em->start = hole_start;
> -			em->orig_start = hole_start;
> -			/*
> -			 * Don't adjust block start at all, it is fixed at
> -			 * EXTENT_MAP_HOLE
> -			 */
> -			em->block_start = hole_em->block_start;
> -			em->block_len = hole_len;
> -			if (test_bit(EXTENT_FLAG_PREALLOC, &hole_em->flags))
> -				set_bit(EXTENT_FLAG_PREALLOC, &em->flags);
> -		} else {
> -			/*
> -			 * Hole is out of passed range or it starts after
> -			 * delalloc range
> -			 */
> -			em->start = delalloc_start;
> -			em->len = delalloc_len;
> -			em->orig_start = delalloc_start;
> -			em->block_start = EXTENT_MAP_DELALLOC;
> -			em->block_len = delalloc_len;
> -		}
> -	} else {
> -		return hole_em;
> -	}
> -out:
> -
> -	free_extent_map(hole_em);
> -	if (err) {
> -		free_extent_map(em);
> -		return ERR_PTR(err);
> -	}
> -	return em;
> -}
> -
>  static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>  						  const u64 start,
>  						  const u64 len,
> @@ -8265,15 +8138,14 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>  	 * in the compression of data (in an async thread) and will return
>  	 * before the compression is done and writeback is started. A second
>  	 * filemap_fdatawrite_range() is needed to wait for the compression to
> -	 * complete and writeback to start. Without this, our user is very
> -	 * likely to get stale results, because the extents and extent maps for
> -	 * delalloc regions are only allocated when writeback starts.
> +	 * complete and writeback to start. We also need to wait for ordered
> +	 * extents to complete, because our fiemap implementation uses mainly
> +	 * file extent items to list the extents, searching for extent maps
> +	 * only for file ranges with holes or prealloc extents to figure out
> +	 * if we have delalloc in those ranges.
>  	 */
>  	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
> -		ret = btrfs_fdatawrite_range(inode, 0, LLONG_MAX);
> -		if (ret)
> -			return ret;
> -		ret = filemap_fdatawait_range(inode->i_mapping, 0, LLONG_MAX);
> +		ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
>  		if (ret)
>  			return ret;
>  	}

Hmm this bit should be in "btrfs: properly flush delalloc when entering fiemap"
instead.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient
  2022-09-01 14:03   ` Josef Bacik
@ 2022-09-01 15:00     ` Filipe Manana
  2022-09-02 13:26       ` Josef Bacik
  0 siblings, 1 reply; 53+ messages in thread
From: Filipe Manana @ 2022-09-01 15:00 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Thu, Sep 01, 2022 at 10:03:03AM -0400, Josef Bacik wrote:
> On Thu, Sep 01, 2022 at 02:18:22PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> > 
> > The current implementation of hole and data seeking for llseek does not
> > scale well in regards to the number of extents and the distance between
> > the start offset and the next hole or extent. This is due to a very high
> > algorithmic complexity. Often we also get reports of btrfs' hole and data
> > seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link
> > tag at the bottom).
> > 
> > In order to better understand it, lets consider the case where the start
> > offset is 0, we are seeking for a hole and the file size is 16G. Between
> > file offset 0 and the first hole in the file there are 100K extents - this
> > is common for large files, specially if we have compression enabled, since
> > the maximum extent size is limited to 128K. The steps take by the main
> > loop of the current algorithm are the following:
> > 
> > 1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which
> >    calls btrfs_get_extent(). This will first lookup for an extent map in
> >    the inode's extent map tree (a red black tree). If the extent map is
> >    not loaded in memory, then it will do a lookup for the corresponding
> >    file extent item in the subvolume's b+tree, create an extent map based
> >    on the contents of the file extent item and then add the extent map to
> >    the extent map tree of the inode;
> > 
> > 2) The second iteration calls btrfs_get_extent_fiemap() again, this time
> >    with a start offset matching the end offset of the previous extent.
> >    Again, btrfs_get_extent() will first search the extent map tree, and
> >    if it doesn't find an extent map there, it will again search in the
> >    b+tree of the subvolume for a matching file extent item, build an
> >    extent map based on the file extent item, and add the extent map to
> >    to the extent map tree of the inode;
> > 
> > 3) This repeats over and over until we find the first hole (when seeking
> >    for holes) or until we find the first extent (when seeking for data).
> > 
> >    If there no extent maps loaded in memory for each iteration, then on
> >    each iteration we do 1 extent map tree search, 1 b+tree search, plus
> >    1 more extent map tree traversal to insert an extent map - plus we
> >    allocate memory for the extent map.
> > 
> >    On each iteration we are growing the size of the extent map tree,
> >    making each future search slower, and also visiting the same b+tree
> >    leaves over and over again - taking into account with the default leaf
> >    size of 16K we can fit more than 200 file extent items in a leaf - so
> >    we can visit the same b+tree leaf 200+ times, on each visit walking
> >    down a path from the root to the leaf.
> > 
> > So it's easy to see that what we have now doesn't scale well. Also, it
> > loads an extent map for every file extent item into memory, which is not
> > efficient - we should add extents maps only when doing IO (writing or
> > reading file data).
> > 
> > This change implements a new algorithm which scales much better, and
> > works like this:
> > 
> > 1) We iterate over the subvolume's b+tree, visiting each leaf that has
> >    file extent items once and only once;
> > 
> > 2) For any file extent items found, that don't represent holes or prealloc
> >    extents, it will not search the extent map tree - there's no need at
> >    all for that - an extent map is just an in-memory representation of a
> >    file extent item;
> > 
> > 3) When a hole is found, or a prealloc extent, it will check if there's
> >    delalloc for its range. For this it will search for EXTENT_DELALLOC
> >    bits in the inode's io tree and check the extent map tree - this is
> >    for accounting for unflushed delalloc and for flushed delalloc (the
> >    period between running delalloc and ordered extent completion),
> >    respectively. This is similar to what the current implementation does
> >    when it finds a hole or prealloc extent, but without creating extent
> >    maps and adding them to the extent map tree in case they are not
> >    loaded in memory;
> > 
> > 4) It never allocates extent maps, or adds extent maps to the inode's
> >    extent map tree. This not only saves memory and time (from the tree
> >    insertions and allocations), but also eliminates the possibility of
> >    -ENOMEM due to allocating too many extent maps.
> > 
> > Part of this new code will also be used later for fiemap (which also
> > suffers similar scalability problems).
> > 
> > The following test example can be used to quickly measure the efficiency
> > before and after this patch:
> > 
> >     $ cat test-seek-hole.sh
> >     #!/bin/bash
> > 
> >     DEV=/dev/sdi
> >     MNT=/mnt/sdi
> > 
> >     mkfs.btrfs -f $DEV
> > 
> >     mount -o compress=lzo $DEV $MNT
> > 
> >     # 16G file -> 131073 compressed extents.
> >     xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar
> > 
> >     # Leave a 1M hole at file offset 15G.
> >     xfs_io -c "fpunch 15G 1M" $MNT/foobar
> > 
> >     # Unmount and mount again, so that we can test when there's no
> >     # metadata cached in memory.
> >     umount $MNT
> >     mount -o compress=lzo $DEV $MNT
> > 
> >     # Test seeking for hole from offset 0 (hole is at offset 15G).
> > 
> >     start=$(date +%s%N)
> >     xfs_io -c "seek -h 0" $MNT/foobar
> >     end=$(date +%s%N)
> >     dur=$(( (end - start) / 1000000 ))
> >     echo "Took $dur milliseconds to seek first hole (metadata not cached)"
> >     echo
> > 
> >     start=$(date +%s%N)
> >     xfs_io -c "seek -h 0" $MNT/foobar
> >     end=$(date +%s%N)
> >     dur=$(( (end - start) / 1000000 ))
> >     echo "Took $dur milliseconds to seek first hole (metadata cached)"
> >     echo
> > 
> >     umount $MNT
> > 
> > Before this change:
> > 
> >     $ ./test-seek-hole.sh
> >     (...)
> >     Whence	Result
> >     HOLE	16106127360
> >     Took 176 milliseconds to seek first hole (metadata not cached)
> > 
> >     Whence	Result
> >     HOLE	16106127360
> >     Took 17 milliseconds to seek first hole (metadata cached)
> > 
> > After this change:
> > 
> >     $ ./test-seek-hole.sh
> >     (...)
> >     Whence	Result
> >     HOLE	16106127360
> >     Took 43 milliseconds to seek first hole (metadata not cached)
> > 
> >     Whence	Result
> >     HOLE	16106127360
> >     Took 13 milliseconds to seek first hole (metadata cached)
> > 
> > That's about 4X faster when no metadata is cached and about 30% faster
> > when all metadata is cached.
> > 
> > In practice the differences may often be significantly higher, either due
> > to a higher number of extents in a file or because the subvolume's b+tree
> > is much bigger than in this example, where we only have one file.
> > 
> > Link: https://lwn.net/Articles/718805/
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/file.c | 437 ++++++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 406 insertions(+), 31 deletions(-)
> > 
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 96f444ad0951..b292a8ada3a4 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -3601,22 +3601,281 @@ static long btrfs_fallocate(struct file *file, int mode,
> >  	return ret;
> >  }
> >  
> > +/*
> > + * Helper for have_delalloc_in_range(). Find a subrange in a given range that
> > + * has unflushed and/or flushing delalloc. There might be other adjacent
> > + * subranges after the one it found, so have_delalloc_in_range() keeps looping
> > + * while it gets adjacent subranges, and merging them together.
> > + */
> > +static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
> > +				   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> > +{
> > +	const u64 len = end + 1 - start;
> > +	struct extent_map_tree *em_tree = &inode->extent_tree;
> > +	struct extent_map *em;
> > +	u64 em_end;
> > +	u64 delalloc_len;
> > +
> > +	/*
> > +	 * Search the io tree first for EXTENT_DELALLOC. If we find any, it
> > +	 * means we have delalloc (dirty pages) for which writeback has not
> > +	 * started yet.
> > +	 */
> > +	*delalloc_start_ret = start;
> > +	delalloc_len = count_range_bits(&inode->io_tree, delalloc_start_ret, end,
> > +					len, EXTENT_DELALLOC, 1);
> > +	/*
> > +	 * If delalloc was found then *delalloc_start_ret has a sector size
> > +	 * aligned value (rounded down).
> > +	 */
> > +	if (delalloc_len > 0)
> > +		*delalloc_end_ret = *delalloc_start_ret + delalloc_len - 1;
> > +
> > +	/*
> > +	 * Now also check if there's any extent map in the range that does not
> > +	 * map to a hole or prealloc extent. We do this because:
> > +	 *
> > +	 * 1) When delalloc is flushed, the file range is locked, we clear the
> > +	 *    EXTENT_DELALLOC bit from the io tree and create an extent map for
> > +	 *    an allocated extent. So we might just have been called after
> > +	 *    delalloc is flushed and before the ordered extent completes and
> > +	 *    inserts the new file extent item in the subvolume's btree;
> > +	 *
> > +	 * 2) We may have an extent map created by flushing delalloc for a
> > +	 *    subrange that starts before the subrange we found marked with
> > +	 *    EXTENT_DELALLOC in the io tree.
> > +	 */
> > +	read_lock(&em_tree->lock);
> > +	em = lookup_extent_mapping(em_tree, start, len);
> > +	read_unlock(&em_tree->lock);
> > +
> > +	/* extent_map_end() returns a non-inclusive end offset. */
> > +	em_end = em ? extent_map_end(em) : 0;
> > +
> > +	/*
> > +	 * If we have a hole/prealloc extent map, check the next one if this one
> > +	 * ends before our range's end.
> > +	 */
> > +	if (em && (em->block_start == EXTENT_MAP_HOLE ||
> > +		   test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) && em_end < end) {
> > +		struct extent_map *next_em;
> > +
> > +		read_lock(&em_tree->lock);
> > +		next_em = lookup_extent_mapping(em_tree, em_end, len - em_end);
> > +		read_unlock(&em_tree->lock);
> > +
> > +		free_extent_map(em);
> > +		em_end = next_em ? extent_map_end(next_em) : 0;
> > +		em = next_em;
> > +	}
> > +
> > +	if (em && (em->block_start == EXTENT_MAP_HOLE ||
> > +		   test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) {
> > +		free_extent_map(em);
> > +		em = NULL;
> > +	}
> > +
> > +	/*
> > +	 * No extent map or one for a hole or prealloc extent. Use the delalloc
> > +	 * range we found in the io tree if we have one.
> > +	 */
> > +	if (!em)
> > +		return (delalloc_len > 0);
> > +
> 
> You can move this after the lookup, and then remove the if (em && parts above.
> Then all you need to do is in the second if statement return (delalloc_len > 0);

Nop, it won't work by doing just that.

To move that if statement, it would require all the following changes,
which to me it doesn't seem to provide any benefit, aesthetically or
otherwise:

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 636b3ec46184..05037b8950d5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3649,39 +3649,39 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
 	em = lookup_extent_mapping(em_tree, start, len);
 	read_unlock(&em_tree->lock);
 
+	if (!em)
+		return (delalloc_len > 0);
+
 	/* extent_map_end() returns a non-inclusive end offset. */
-	em_end = em ? extent_map_end(em) : 0;
+	em_end = extent_map_end(em);
 
 	/*
 	 * If we have a hole/prealloc extent map, check the next one if this one
 	 * ends before our range's end.
 	 */
-	if (em && (em->block_start == EXTENT_MAP_HOLE ||
-		   test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) && em_end < end) {
+	if (em->block_start == EXTENT_MAP_HOLE ||
+	    test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) {
 		struct extent_map *next_em;
 
+		free_extent_map(em);
+
+		if (em_end >= end)
+			return (delalloc_len > 0);
+
 		read_lock(&em_tree->lock);
 		next_em = lookup_extent_mapping(em_tree, em_end, len - em_end);
 		read_unlock(&em_tree->lock);
 
-		free_extent_map(em);
-		em_end = next_em ? extent_map_end(next_em) : 0;
-		em = next_em;
-	}
+		if (!next_em || next_em->block_start == EXTENT_MAP_HOLE ||
+		    test_bit(EXTENT_FLAG_PREALLOC, &next_em->flags)) {
+			free_extent_map(next_em);
+			return (delalloc_len > 0);
+		}
 
-	if (em && (em->block_start == EXTENT_MAP_HOLE ||
-		   test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) {
-		free_extent_map(em);
-		em = NULL;
+		em = next_em;
+		em_end = extent_map_end(em);
 	}
 
-	/*
-	 * No extent map or one for a hole or prealloc extent. Use the delalloc
-	 * range we found in the io tree if we have one.
-	 */
-	if (!em)
-		return (delalloc_len > 0);
-
 	/*
 	 * We don't have any range as EXTENT_DELALLOC in the io tree, so the
 	 * extent map is the only subrange representing delalloc.


> Thanks,
> 
> Josef

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness
  2022-09-01 14:35   ` Josef Bacik
@ 2022-09-01 15:04     ` Filipe Manana
  2022-09-02 13:25       ` Josef Bacik
  0 siblings, 1 reply; 53+ messages in thread
From: Filipe Manana @ 2022-09-01 15:04 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Thu, Sep 1, 2022 at 3:35 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Thu, Sep 01, 2022 at 02:18:30PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > The current fiemap implementation does not scale very well with the number
> > of extents a file has. This is both because the main algorithm to find out
> > the extents has a high algorithmic complexity and because for each extent
> > we have to check if it's shared. This second part, checking if an extent
> > is shared, is significantly improved by the two previous patches in this
> > patchset, while the first part is improved by this specific patch. Every
> > now and then we get reports from users mentioning fiemap is too slow or
> > even unusable for files with a very large number of extents, such as the
> > two recent reports referred to by the Link tags at the bottom of this
> > change log.
> >
> > To understand why the part of finding which extents a file has is very
> > inneficient, consider the example of doing a full ranged fiemap against
> > a file that has over 100K extents (normal for example for a file with
> > more than 10G of data and using compression, which limits the extent size
> > to 128K). When we enter fiemap at extent_fiemap(), the following happens:
> >
> > 1) Before entering the main loop, we call get_extent_skip_holes() to get
> >    the first extent map. This leads us to btrfs_get_extent_fiemap(), which
> >    in turn calls btrfs_get_extent(), to find the first extent map that
> >    covers the file range [0, LLONG_MAX).
> >
> >    btrfs_get_extent() will first search the inode's extent map tree, to
> >    see if we have an extent map there that covers the range. If it does
> >    not find one, then it will search the inode's subvolume b+tree for a
> >    fitting file extent item. After finding the file extent item, it will
> >    allocate an extent map, fill it in with information extracted from the
> >    file extent item, and add it to the inode's extent map tree (which
> >    requires a search for insertion in the tree).
> >
> > 2) Then we enter the main loop at extent_fiemap(), emit the details of
> >    the extent, and call again get_extent_skip_holes(), with a start
> >    offset matching the end of the extent map we previously processed.
> >
> >    We end up at btrfs_get_extent() again, will search the extent map tree
> >    and then search the subvolume b+tree for a file extent item if we could
> >    not find an extent map in the extent tree. We allocate an extent map,
> >    fill it in with the details in the file extent item, and then insert
> >    it into the extent map tree (yet another search in this tree).
> >
> > 3) The second step is repeated over and over, until we have processed the
> >    whole file range. Each iteration ends at btrfs_get_extent(), which
> >    does a red black tree search on the extent map tree, then searches the
> >    subvolume b+tree, allocates an extent map and then does another search
> >    in the extent map tree in order to insert the extent map.
> >
> >    In the best scenario we have all the extent maps already in the extent
> >    tree, and so for each extent we do a single search on a red black tree,
> >    so we have a complexity of O(n log n).
> >
> >    In the worst scenario we don't have any extent map already loaded in
> >    the extent map tree, or have very few already there. In this case the
> >    complexity is much higher since we do:
> >
> >    - A red black tree search on the extent map tree, which has O(log n)
> >      complexity, initially very fast since the tree is empty or very
> >      small, but as we end up allocating extent maps and adding them to
> >      the tree when we don't find them there, each subsequent search on
> >      the tree gets slower, since it's getting bigger and bigger after
> >      each iteration.
> >
> >    - A search on the subvolume b+tree, also O(log n) complexity, but it
> >      has items for all inodes in the subvolume, not just items for our
> >      inode. Plus on a filesystem with concurrent operations on other
> >      inodes, we can block doing the search due to lock contention on
> >      b+tree nodes/leaves.
> >
> >    - Allocate an extent map - this can block, and can also fail if we
> >      are under serious memory pressure.
> >
> >    - Do another search on the extent maps red black tree, with the goal
> >      of inserting the extent map we just allocated. Again, after every
> >      iteration this tree is getting bigger by 1 element, so after many
> >      iterations the searches are slower and slower.
> >
> >    - We will not need the allocated extent map anymore, so it's pointless
> >      to add it to the extent map tree. It's just wasting time and memory.
> >
> >    In short we end up searching the extent map tree multiple times, on a
> >    tree that is growing bigger and bigger after each iteration. And
> >    besides that we visit the same leaf of the subvolume b+tree many times,
> >    since a leaf with the default size of 16K can easily have more than 200
> >    file extent items.
> >
> > This is very inneficient overall. This patch changes the algorithm to
> > instead iterate over the subvolume b+tree, visiting each leaf only once,
> > and only searching in the extent map tree for file ranges that have holes
> > or prealloc extents, in order to figure out if we have delalloc there.
> > It will never allocate an extent map and add it to the extent map tree.
> > This is very similar to what was previously done for the lseek's hole and
> > data seeking features.
> >
> > Also, the current implementation relying on extent maps for figuring out
> > which extents we have is not correct. This is because extent maps can be
> > merged even if they represent different extents - we do this to minimize
> > memory utilization and keep extent map trees smaller. For example if we
> > have two extents that are contiguous on disk, once we load the two extent
> > maps, they get merged into a single one - however if only one of the
> > extents is shared, we end up reporting both as shared or both as not
> > shared, which is incorrect.
> >
> > This reproducer triggers that bug:
> >
> >     $ cat fiemap-bug.sh
> >     #!/bin/bash
> >
> >     DEV=/dev/sdj
> >     MNT=/mnt/sdj
> >
> >     mkfs.btrfs -f $DEV
> >     mount $DEV $MNT
> >
> >     # Create a file with two 256K extents.
> >     # Since there is no other write activity, they will be contiguous,
> >     # and their extent maps merged, despite having two distinct extents.
> >     xfs_io -f -c "pwrite -S 0xab 0 256K" \
> >               -c "fsync" \
> >               -c "pwrite -S 0xcd 256K 256K" \
> >               -c "fsync" \
> >               $MNT/foo
> >
> >     # Now clone only the second extent into another file.
> >     xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
> >
> >     # Filefrag will report a single 512K extent, and say it's not shared.
> >     echo
> >     filefrag -v $MNT/foo
> >
> >     umount $MNT
> >
> > Running the reproducer:
> >
> >     $ ./fiemap-bug.sh
> >     wrote 262144/262144 bytes at offset 0
> >     256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
> >     wrote 262144/262144 bytes at offset 262144
> >     256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
> >     linked 262144/262144 bytes at offset 0
> >     256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
> >
> >     Filesystem type is: 9123683e
> >     File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
> >      ext:     logical_offset:        physical_offset: length:   expected: flags:
> >        0:        0..     127:       3328..      3455:    128:             last,eof
> >     /mnt/sdj/foo: 1 extent found
> >
> > We end up reporting that we have a single 512K that is not shared, however
> > we have two 256K extents, and the second one is shared. Changing the
> > reproducer to clone instead the first extent into file 'bar', makes us
> > report a single 512K extent that is shared, which is algo incorrect since
> > we have two 256K extents and only the first one is shared.
> >
> > This patch is part of a larger patchset that is comprised of the following
> > patches:
> >
> >     btrfs: allow hole and data seeking to be interruptible
> >     btrfs: make hole and data seeking a lot more efficient
> >     btrfs: remove check for impossible block start for an extent map at fiemap
> >     btrfs: remove zero length check when entering fiemap
> >     btrfs: properly flush delalloc when entering fiemap
> >     btrfs: allow fiemap to be interruptible
> >     btrfs: rename btrfs_check_shared() to a more descriptive name
> >     btrfs: speedup checking for extent sharedness during fiemap
> >     btrfs: skip unnecessary extent buffer sharedness checks during fiemap
> >     btrfs: make fiemap more efficient and accurate reporting extent sharedness
> >
> > The patchset was tested on a machine running a non-debug kernel (Debian's
> > default config) and compared the tests below on a branch without the
> > patchset versus the same branch with the whole patchset applied.
> >
> > The following test for a large compressed file without holes:
> >
> >     $ cat fiemap-perf-test.sh
> >     #!/bin/bash
> >
> >     DEV=/dev/sdi
> >     MNT=/mnt/sdi
> >
> >     mkfs.btrfs -f $DEV
> >     mount -o compress=lzo $DEV $MNT
> >
> >     # 40G gives 327680 128K file extents (due to compression).
> >     xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar
> >
> >     umount $MNT
> >     mount -o compress=lzo $DEV $MNT
> >
> >     start=$(date +%s%N)
> >     filefrag $MNT/foobar
> >     end=$(date +%s%N)
> >     dur=$(( (end - start) / 1000000 ))
> >     echo "fiemap took $dur milliseconds (metadata not cached)"
> >
> >     start=$(date +%s%N)
> >     filefrag $MNT/foobar
> >     end=$(date +%s%N)
> >     dur=$(( (end - start) / 1000000 ))
> >     echo "fiemap took $dur milliseconds (metadata cached)"
> >
> >     umount $MNT
> >
> > Before patchset:
> >
> >     $ ./fiemap-perf-test.sh
> >     (...)
> >     /mnt/sdi/foobar: 327680 extents found
> >     fiemap took 3597 milliseconds (metadata not cached)
> >     /mnt/sdi/foobar: 327680 extents found
> >     fiemap took 2107 milliseconds (metadata cached)
> >
> > After patchset:
> >
> >     $ ./fiemap-perf-test.sh
> >     (...)
> >     /mnt/sdi/foobar: 327680 extents found
> >     fiemap took 1214 milliseconds (metadata not cached)
> >     /mnt/sdi/foobar: 327680 extents found
> >     fiemap took 684 milliseconds (metadata cached)
> >
> > That's a speedup of about 3x for both cases (no metadata cached and all
> > metadata cached).
> >
> > The test provided by Pavel (first Link tag at the bottom), which uses
> > files with a large number of holes, was also used to measure the gains,
> > and it consists on a small C program and a shell script to invoke it.
> > The C program is the following:
> >
> >     $ cat pavels-test.c
> >     #include <stdio.h>
> >     #include <unistd.h>
> >     #include <stdlib.h>
> >     #include <fcntl.h>
> >
> >     #include <sys/stat.h>
> >     #include <sys/time.h>
> >     #include <sys/ioctl.h>
> >
> >     #include <linux/fs.h>
> >     #include <linux/fiemap.h>
> >
> >     #define FILE_INTERVAL (1<<13) /* 8Kb */
> >
> >     long long interval(struct timeval t1, struct timeval t2)
> >     {
> >         long long val = 0;
> >         val += (t2.tv_usec - t1.tv_usec);
> >         val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
> >         return val;
> >     }
> >
> >     int main(int argc, char **argv)
> >     {
> >         struct fiemap fiemap = {};
> >         struct timeval t1, t2;
> >         char data = 'a';
> >         struct stat st;
> >         int fd, off, file_size = FILE_INTERVAL;
> >
> >         if (argc != 3 && argc != 2) {
> >                 printf("usage: %s <path> [size]\n", argv[0]);
> >                 return 1;
> >         }
> >
> >         if (argc == 3)
> >                 file_size = atoi(argv[2]);
> >         if (file_size < FILE_INTERVAL)
> >                 file_size = FILE_INTERVAL;
> >         file_size -= file_size % FILE_INTERVAL;
> >
> >         fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
> >         if (fd < 0) {
> >             perror("open");
> >             return 1;
> >         }
> >
> >         for (off = 0; off < file_size; off += FILE_INTERVAL) {
> >             if (pwrite(fd, &data, 1, off) != 1) {
> >                 perror("pwrite");
> >                 close(fd);
> >                 return 1;
> >             }
> >         }
> >
> >         if (ftruncate(fd, file_size)) {
> >             perror("ftruncate");
> >             close(fd);
> >             return 1;
> >         }
> >
> >         if (fstat(fd, &st) < 0) {
> >             perror("fstat");
> >             close(fd);
> >             return 1;
> >         }
> >
> >         printf("size: %ld\n", st.st_size);
> >         printf("actual size: %ld\n", st.st_blocks * 512);
> >
> >         fiemap.fm_length = FIEMAP_MAX_OFFSET;
> >         gettimeofday(&t1, NULL);
> >         if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
> >             perror("fiemap");
> >             close(fd);
> >             return 1;
> >         }
> >         gettimeofday(&t2, NULL);
> >
> >         printf("fiemap: fm_mapped_extents = %d\n",
> >                fiemap.fm_mapped_extents);
> >         printf("time = %lld us\n", interval(t1, t2));
> >
> >         close(fd);
> >         return 0;
> >     }
> >
> >     $ gcc -o pavels_test pavels_test.c
> >
> > And the wrapper shell script:
> >
> >     $ cat fiemap-pavels-test.sh
> >
> >     #!/bin/bash
> >
> >     DEV=/dev/sdi
> >     MNT=/mnt/sdi
> >
> >     mkfs.btrfs -f -O no-holes $DEV
> >     mount $DEV $MNT
> >
> >     echo
> >     echo "*********** 256M ***********"
> >     echo
> >
> >     ./pavels-test $MNT/testfile $((1 << 28))
> >     echo
> >     ./pavels-test $MNT/testfile $((1 << 28))
> >
> >     echo
> >     echo "*********** 512M ***********"
> >     echo
> >
> >     ./pavels-test $MNT/testfile $((1 << 29))
> >     echo
> >     ./pavels-test $MNT/testfile $((1 << 29))
> >
> >     echo
> >     echo "*********** 1G ***********"
> >     echo
> >
> >     ./pavels-test $MNT/testfile $((1 << 30))
> >     echo
> >     ./pavels-test $MNT/testfile $((1 << 30))
> >
> >     umount $MNT
> >
> > Running his reproducer before applying the patchset:
> >
> >     *********** 256M ***********
> >
> >     size: 268435456
> >     actual size: 134217728
> >     fiemap: fm_mapped_extents = 32768
> >     time = 4003133 us
> >
> >     size: 268435456
> >     actual size: 134217728
> >     fiemap: fm_mapped_extents = 32768
> >     time = 4895330 us
> >
> >     *********** 512M ***********
> >
> >     size: 536870912
> >     actual size: 268435456
> >     fiemap: fm_mapped_extents = 65536
> >     time = 30123675 us
> >
> >     size: 536870912
> >     actual size: 268435456
> >     fiemap: fm_mapped_extents = 65536
> >     time = 33450934 us
> >
> >     *********** 1G ***********
> >
> >     size: 1073741824
> >     actual size: 536870912
> >     fiemap: fm_mapped_extents = 131072
> >     time = 224924074 us
> >
> >     size: 1073741824
> >     actual size: 536870912
> >     fiemap: fm_mapped_extents = 131072
> >     time = 217239242 us
> >
> > Running it after applying the patchset:
> >
> >     *********** 256M ***********
> >
> >     size: 268435456
> >     actual size: 134217728
> >     fiemap: fm_mapped_extents = 32768
> >     time = 29475 us
> >
> >     size: 268435456
> >     actual size: 134217728
> >     fiemap: fm_mapped_extents = 32768
> >     time = 29307 us
> >
> >     *********** 512M ***********
> >
> >     size: 536870912
> >     actual size: 268435456
> >     fiemap: fm_mapped_extents = 65536
> >     time = 58996 us
> >
> >     size: 536870912
> >     actual size: 268435456
> >     fiemap: fm_mapped_extents = 65536
> >     time = 59115 us
> >
> >     *********** 1G ***********
> >
> >     size: 1073741824
> >     actual size: 536870912
> >     fiemap: fm_mapped_extents = 116251
> >     time = 124141 us
> >
> >     size: 1073741824
> >     actual size: 536870912
> >     fiemap: fm_mapped_extents = 131072
> >     time = 119387 us
> >
> > The speedup is massive, both on the first fiemap call and on the second
> > one as well, as his test creates files with many holes and small extents
> > (every extent follows a hole and precedes another hole).
> >
> > For the 256M file we go from 4 seconds down to 29 milliseconds in the
> > first run, and then from 4.9 seconds down to 29 milliseconds again in the
> > second run, a speedup of 138x and 169x, respectively.
> >
> > For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
> > first run, and then from 33.5 seconds down to 59 milliseconds again in the
> > second run, a speedup of 510x and 568x, respectively.
> >
> > For the 1G file, we go from 225 seconds down to 124 milliseconds in the
> > first run, and then from 217 seconds down to 119 milliseconds in the
> > second run, a speedup of 1815x and 1824x, respectively.
> >
> > Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> > Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> > Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
> > Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/ctree.h     |   4 +-
> >  fs/btrfs/extent_io.c | 714 +++++++++++++++++++++++++++++--------------
> >  fs/btrfs/file.c      |  16 +-
> >  fs/btrfs/inode.c     | 140 +--------
> >  4 files changed, 506 insertions(+), 368 deletions(-)
> >
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index f7fe7f633eb5..7b266f9dc8b4 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -3402,8 +3402,6 @@ unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
> >                                   u64 start, u64 end);
> >  int btrfs_check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
> >                         u32 bio_offset, struct page *page, u32 pgoff);
> > -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
> > -                                        u64 start, u64 len);
> >  noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
> >                             u64 *orig_start, u64 *orig_block_len,
> >                             u64 *ram_bytes, bool strict);
> > @@ -3583,6 +3581,8 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
> >  int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
> >                          size_t *write_bytes);
> >  void btrfs_check_nocow_unlock(struct btrfs_inode *inode);
> > +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> > +                               u64 *delalloc_start_ret, u64 *delalloc_end_ret);
> >
> >  /* tree-defrag.c */
> >  int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 0e3fa9b08aaf..50bb2182e795 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -5353,42 +5353,6 @@ int try_release_extent_mapping(struct page *page, gfp_t mask)
> >       return try_release_extent_state(tree, page, mask);
> >  }
> >
> > -/*
> > - * helper function for fiemap, which doesn't want to see any holes.
> > - * This maps until we find something past 'last'
> > - */
> > -static struct extent_map *get_extent_skip_holes(struct btrfs_inode *inode,
> > -                                             u64 offset, u64 last)
> > -{
> > -     u64 sectorsize = btrfs_inode_sectorsize(inode);
> > -     struct extent_map *em;
> > -     u64 len;
> > -
> > -     if (offset >= last)
> > -             return NULL;
> > -
> > -     while (1) {
> > -             len = last - offset;
> > -             if (len == 0)
> > -                     break;
> > -             len = ALIGN(len, sectorsize);
> > -             em = btrfs_get_extent_fiemap(inode, offset, len);
> > -             if (IS_ERR(em))
> > -                     return em;
> > -
> > -             /* if this isn't a hole return it */
> > -             if (em->block_start != EXTENT_MAP_HOLE)
> > -                     return em;
> > -
> > -             /* this is a hole, advance to the next extent */
> > -             offset = extent_map_end(em);
> > -             free_extent_map(em);
> > -             if (offset >= last)
> > -                     break;
> > -     }
> > -     return NULL;
> > -}
> > -
> >  /*
> >   * To cache previous fiemap extent
> >   *
> > @@ -5418,6 +5382,9 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
> >  {
> >       int ret = 0;
> >
> > +     /* Set at the end of extent_fiemap(). */
> > +     ASSERT((flags & FIEMAP_EXTENT_LAST) == 0);
> > +
> >       if (!cache->cached)
> >               goto assign;
> >
> > @@ -5446,11 +5413,10 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
> >        */
> >       if (cache->offset + cache->len  == offset &&
> >           cache->phys + cache->len == phys  &&
> > -         (cache->flags & ~FIEMAP_EXTENT_LAST) ==
> > -                     (flags & ~FIEMAP_EXTENT_LAST)) {
> > +         cache->flags == flags) {
> >               cache->len += len;
> >               cache->flags |= flags;
> > -             goto try_submit_last;
> > +             return 0;
> >       }
> >
> >       /* Not mergeable, need to submit cached one */
> > @@ -5465,13 +5431,8 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
> >       cache->phys = phys;
> >       cache->len = len;
> >       cache->flags = flags;
> > -try_submit_last:
> > -     if (cache->flags & FIEMAP_EXTENT_LAST) {
> > -             ret = fiemap_fill_next_extent(fieinfo, cache->offset,
> > -                             cache->phys, cache->len, cache->flags);
> > -             cache->cached = false;
> > -     }
> > -     return ret;
> > +
> > +     return 0;
> >  }
> >
> >  /*
> > @@ -5501,229 +5462,534 @@ static int emit_last_fiemap_cache(struct fiemap_extent_info *fieinfo,
> >       return ret;
> >  }
> >
> > -int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> > -               u64 start, u64 len)
> > +static int fiemap_next_leaf_item(struct btrfs_inode *inode,
> > +                              struct btrfs_path *path)
> >  {
> > -     int ret = 0;
> > -     u64 off;
> > -     u64 max = start + len;
> > -     u32 flags = 0;
> > -     u32 found_type;
> > -     u64 last;
> > -     u64 last_for_get_extent = 0;
> > -     u64 disko = 0;
> > -     u64 isize = i_size_read(&inode->vfs_inode);
> > -     struct btrfs_key found_key;
> > -     struct extent_map *em = NULL;
> > -     struct extent_state *cached_state = NULL;
> > -     struct btrfs_path *path;
> > +     struct extent_buffer *clone;
> > +     struct btrfs_key key;
> > +     int slot;
> > +     int ret;
> > +
> > +     path->slots[0]++;
> > +     if (path->slots[0] < btrfs_header_nritems(path->nodes[0]))
> > +             return 0;
> > +
> > +     ret = btrfs_next_leaf(inode->root, path);
> > +     if (ret != 0)
> > +             return ret;
> > +
> > +     /*
> > +      * Don't bother with cloning if there are no more file extent items for
> > +      * our inode.
> > +      */
> > +     btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> > +     if (key.objectid != btrfs_ino(inode) || key.type != BTRFS_EXTENT_DATA_KEY)
> > +             return 1;
> > +
> > +     /* See the comment at fiemap_search_slot() about why we clone. */
> > +     clone = btrfs_clone_extent_buffer(path->nodes[0]);
> > +     if (!clone)
> > +             return -ENOMEM;
> > +
> > +     slot = path->slots[0];
> > +     btrfs_release_path(path);
> > +     path->nodes[0] = clone;
> > +     path->slots[0] = slot;
> > +
> > +     return 0;
> > +}
> > +
> > +/*
> > + * Search for the first file extent item that starts at a given file offset or
> > + * the one that starts immediately before that offset.
> > + * Returns: 0 on success, < 0 on error, 1 if not found.
> > + */
> > +static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path,
> > +                           u64 file_offset)
> > +{
> > +     const u64 ino = btrfs_ino(inode);
> >       struct btrfs_root *root = inode->root;
> > -     struct fiemap_cache cache = { 0 };
> > -     struct btrfs_backref_shared_cache *backref_cache;
> > -     struct ulist *roots;
> > -     struct ulist *tmp_ulist;
> > -     int end = 0;
> > -     u64 em_start = 0;
> > -     u64 em_len = 0;
> > -     u64 em_end = 0;
> > +     struct extent_buffer *clone;
> > +     struct btrfs_key key;
> > +     int slot;
> > +     int ret;
> >
> > -     backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
> > -     path = btrfs_alloc_path();
> > -     roots = ulist_alloc(GFP_KERNEL);
> > -     tmp_ulist = ulist_alloc(GFP_KERNEL);
> > -     if (!backref_cache || !path || !roots || !tmp_ulist) {
> > -             ret = -ENOMEM;
> > -             goto out_free_ulist;
> > +     key.objectid = ino;
> > +     key.type = BTRFS_EXTENT_DATA_KEY;
> > +     key.offset = file_offset;
> > +
> > +     ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> > +     if (ret < 0)
> > +             return ret;
> > +
> > +     if (ret > 0 && path->slots[0] > 0) {
> > +             btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
> > +             if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
> > +                     path->slots[0]--;
> > +     }
> > +
> > +     if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
> > +             ret = btrfs_next_leaf(root, path);
> > +             if (ret != 0)
> > +                     return ret;
> > +
> > +             btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> > +             if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> > +                     return 1;
> >       }
> >
> >       /*
> > -      * We can't initialize that to 'start' as this could miss extents due
> > -      * to extent item merging
> > +      * We clone the leaf and use it during fiemap. This is because while
> > +      * using the leaf we do expensive things like checking if an extent is
> > +      * shared, which can take a long time. In order to prevent blocking
> > +      * other tasks for too long, we use a clone of the leaf. We have locked
> > +      * the file range in the inode's io tree, so we know none of our file
> > +      * extent items can change. This way we avoid blocking other tasks that
> > +      * want to insert items for other inodes in the same leaf or b+tree
> > +      * rebalance operations (triggered for example when someone is trying
> > +      * to push items into this leaf when trying to insert an item in a
> > +      * neighbour leaf).
> > +      * We also need the private clone because holding a read lock on an
> > +      * extent buffer of the subvolume's b+tree will make lockdep unhappy
> > +      * when we call fiemap_fill_next_extent(), because that may cause a page
> > +      * fault when filling the user space buffer with fiemap data.
> >        */
> > -     off = 0;
> > -     start = round_down(start, btrfs_inode_sectorsize(inode));
> > -     len = round_up(max, btrfs_inode_sectorsize(inode)) - start;
> > +     clone = btrfs_clone_extent_buffer(path->nodes[0]);
> > +     if (!clone)
> > +             return -ENOMEM;
> > +
> > +     slot = path->slots[0];
> > +     btrfs_release_path(path);
> > +     path->nodes[0] = clone;
> > +     path->slots[0] = slot;
> > +
> > +     return 0;
> > +}
> > +
> > +/*
> > + * Process a range which is a hole or a prealloc extent in the inode's subvolume
> > + * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc
> > + * extent. The end offset (@end) is inclusive.
> > + */
> > +static int fiemap_process_hole(struct btrfs_inode *inode,
> > +                            struct fiemap_extent_info *fieinfo,
> > +                            struct fiemap_cache *cache,
> > +                            struct btrfs_backref_shared_cache *backref_cache,
> > +                            u64 disk_bytenr, u64 extent_offset,
> > +                            u64 extent_gen,
> > +                            struct ulist *roots, struct ulist *tmp_ulist,
> > +                            u64 start, u64 end)
> > +{
> > +     const u64 i_size = i_size_read(&inode->vfs_inode);
> > +     const u64 ino = btrfs_ino(inode);
> > +     u64 cur_offset = start;
> > +     u64 last_delalloc_end = 0;
> > +     u32 prealloc_flags = FIEMAP_EXTENT_UNWRITTEN;
> > +     bool checked_extent_shared = false;
> > +     int ret;
> >
> >       /*
> > -      * lookup the last file extent.  We're not using i_size here
> > -      * because there might be preallocation past i_size
> > +      * There can be no delalloc past i_size, so don't waste time looking for
> > +      * it beyond i_size.
> >        */
> > -     ret = btrfs_lookup_file_extent(NULL, root, path, btrfs_ino(inode), -1,
> > -                                    0);
> > -     if (ret < 0) {
> > -             goto out_free_ulist;
> > -     } else {
> > -             WARN_ON(!ret);
> > -             if (ret == 1)
> > -                     ret = 0;
> > -     }
> > +     while (cur_offset < end && cur_offset < i_size) {
> > +             u64 delalloc_start;
> > +             u64 delalloc_end;
> > +             u64 prealloc_start;
> > +             u64 prealloc_len = 0;
> > +             bool delalloc;
> > +
> > +             delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
> > +                                                     &delalloc_start,
> > +                                                     &delalloc_end);
> > +             if (!delalloc)
> > +                     break;
> >
> > -     path->slots[0]--;
> > -     btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]);
> > -     found_type = found_key.type;
> > -
> > -     /* No extents, but there might be delalloc bits */
> > -     if (found_key.objectid != btrfs_ino(inode) ||
> > -         found_type != BTRFS_EXTENT_DATA_KEY) {
> > -             /* have to trust i_size as the end */
> > -             last = (u64)-1;
> > -             last_for_get_extent = isize;
> > -     } else {
> >               /*
> > -              * remember the start of the last extent.  There are a
> > -              * bunch of different factors that go into the length of the
> > -              * extent, so its much less complex to remember where it started
> > +              * If this is a prealloc extent we have to report every section
> > +              * of it that has no delalloc.
> >                */
> > -             last = found_key.offset;
> > -             last_for_get_extent = last + 1;
> > +             if (disk_bytenr != 0) {
> > +                     if (last_delalloc_end == 0) {
> > +                             prealloc_start = start;
> > +                             prealloc_len = delalloc_start - start;
> > +                     } else {
> > +                             prealloc_start = last_delalloc_end + 1;
> > +                             prealloc_len = delalloc_start - prealloc_start;
> > +                     }
> > +             }
> > +
> > +             if (prealloc_len > 0) {
> > +                     if (!checked_extent_shared && fieinfo->fi_extents_max) {
> > +                             ret = btrfs_is_data_extent_shared(inode->root,
> > +                                                       ino, disk_bytenr,
> > +                                                       extent_gen, roots,
> > +                                                       tmp_ulist,
> > +                                                       backref_cache);
> > +                             if (ret < 0)
> > +                                     return ret;
> > +                             else if (ret > 0)
> > +                                     prealloc_flags |= FIEMAP_EXTENT_SHARED;
> > +
> > +                             checked_extent_shared = true;
> > +                     }
> > +                     ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
> > +                                              disk_bytenr + extent_offset,
> > +                                              prealloc_len, prealloc_flags);
> > +                     if (ret)
> > +                             return ret;
> > +                     extent_offset += prealloc_len;
> > +             }
> > +
> > +             ret = emit_fiemap_extent(fieinfo, cache, delalloc_start, 0,
> > +                                      delalloc_end + 1 - delalloc_start,
> > +                                      FIEMAP_EXTENT_DELALLOC |
> > +                                      FIEMAP_EXTENT_UNKNOWN);
> > +             if (ret)
> > +                     return ret;
> > +
> > +             last_delalloc_end = delalloc_end;
> > +             cur_offset = delalloc_end + 1;
> > +             extent_offset += cur_offset - delalloc_start;
> > +             cond_resched();
> > +     }
> > +
> > +     /*
> > +      * Either we found no delalloc for the whole prealloc extent or we have
> > +      * a prealloc extent that spans i_size or starts at or after i_size.
> > +      */
> > +     if (disk_bytenr != 0 && last_delalloc_end < end) {
> > +             u64 prealloc_start;
> > +             u64 prealloc_len;
> > +
> > +             if (last_delalloc_end == 0) {
> > +                     prealloc_start = start;
> > +                     prealloc_len = end + 1 - start;
> > +             } else {
> > +                     prealloc_start = last_delalloc_end + 1;
> > +                     prealloc_len = end + 1 - prealloc_start;
> > +             }
> > +
> > +             if (!checked_extent_shared && fieinfo->fi_extents_max) {
> > +                     ret = btrfs_is_data_extent_shared(inode->root,
> > +                                                       ino, disk_bytenr,
> > +                                                       extent_gen, roots,
> > +                                                       tmp_ulist,
> > +                                                       backref_cache);
> > +                     if (ret < 0)
> > +                             return ret;
> > +                     else if (ret > 0)
> > +                             prealloc_flags |= FIEMAP_EXTENT_SHARED;
> > +             }
> > +             ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
> > +                                      disk_bytenr + extent_offset,
> > +                                      prealloc_len, prealloc_flags);
> > +             if (ret)
> > +                     return ret;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static int fiemap_find_last_extent_offset(struct btrfs_inode *inode,
> > +                                       struct btrfs_path *path,
> > +                                       u64 *last_extent_end_ret)
> > +{
> > +     const u64 ino = btrfs_ino(inode);
> > +     struct btrfs_root *root = inode->root;
> > +     struct extent_buffer *leaf;
> > +     struct btrfs_file_extent_item *ei;
> > +     struct btrfs_key key;
> > +     u64 disk_bytenr;
> > +     int ret;
> > +
> > +     /*
> > +      * Lookup the last file extent. We're not using i_size here because
> > +      * there might be preallocation past i_size.
> > +      */
> > +     ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0);
> > +     /* There can't be a file extent item at offset (u64)-1 */
> > +     ASSERT(ret != 0);
> > +     if (ret < 0)
> > +             return ret;
> > +
> > +     /*
> > +      * For a non-existing key, btrfs_search_slot() always leaves us at a
> > +      * slot > 0, except if the btree is empty, which is impossible because
> > +      * at least it has the inode item for this inode and all the items for
> > +      * the root inode 256.
> > +      */
> > +     ASSERT(path->slots[0] > 0);
> > +     path->slots[0]--;
> > +     leaf = path->nodes[0];
> > +     btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> > +     if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> > +             /* No file extent items in the subvolume tree. */
> > +             *last_extent_end_ret = 0;
> > +             return 0;
> >       }
> > -     btrfs_release_path(path);
> >
> >       /*
> > -      * we might have some extents allocated but more delalloc past those
> > -      * extents.  so, we trust isize unless the start of the last extent is
> > -      * beyond isize
> > +      * For an inline extent, the disk_bytenr is where inline data starts at,
> > +      * so first check if we have an inline extent item before checking if we
> > +      * have an implicit hole (disk_bytenr == 0).
> >        */
> > -     if (last < isize) {
> > -             last = (u64)-1;
> > -             last_for_get_extent = isize;
> > +     ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
> > +     if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
> > +             *last_extent_end_ret = btrfs_file_extent_end(path);
> > +             return 0;
> >       }
> >
> > -     lock_extent_bits(&inode->io_tree, start, start + len - 1,
> > -                      &cached_state);
> > +     /*
> > +      * Find the last file extent item that is not a hole (when NO_HOLES is
> > +      * not enabled). This should take at most 2 iterations in the worst
> > +      * case: we have one hole file extent item at slot 0 of a leaf and
> > +      * another hole file extent item as the last item in the previous leaf.
> > +      * This is because we merge file extent items that represent holes.
> > +      */
> > +     disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> > +     while (disk_bytenr == 0) {
> > +             ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
> > +             if (ret < 0) {
> > +                     return ret;
> > +             } else if (ret > 0) {
> > +                     /* No file extent items that are not holes. */
> > +                     *last_extent_end_ret = 0;
> > +                     return 0;
> > +             }
> > +             leaf = path->nodes[0];
> > +             ei = btrfs_item_ptr(leaf, path->slots[0],
> > +                                 struct btrfs_file_extent_item);
> > +             disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> > +     }
> >
> > -     em = get_extent_skip_holes(inode, start, last_for_get_extent);
> > -     if (!em)
> > -             goto out;
> > -     if (IS_ERR(em)) {
> > -             ret = PTR_ERR(em);
> > +     *last_extent_end_ret = btrfs_file_extent_end(path);
> > +     return 0;
> > +}
> > +
> > +int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> > +               u64 start, u64 len)
> > +{
> > +     const u64 ino = btrfs_ino(inode);
> > +     struct extent_state *cached_state = NULL;
> > +     struct btrfs_path *path;
> > +     struct btrfs_root *root = inode->root;
> > +     struct fiemap_cache cache = { 0 };
> > +     struct btrfs_backref_shared_cache *backref_cache;
> > +     struct ulist *roots;
> > +     struct ulist *tmp_ulist;
> > +     u64 last_extent_end;
> > +     u64 prev_extent_end;
> > +     u64 lockstart;
> > +     u64 lockend;
> > +     bool stopped = false;
> > +     int ret;
> > +
> > +     backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
> > +     path = btrfs_alloc_path();
> > +     roots = ulist_alloc(GFP_KERNEL);
> > +     tmp_ulist = ulist_alloc(GFP_KERNEL);
> > +     if (!backref_cache || !path || !roots || !tmp_ulist) {
> > +             ret = -ENOMEM;
> >               goto out;
> >       }
> >
> > -     while (!end) {
> > -             u64 offset_in_extent = 0;
> > +     lockstart = round_down(start, btrfs_inode_sectorsize(inode));
> > +     lockend = round_up(start + len, btrfs_inode_sectorsize(inode));
> > +     prev_extent_end = lockstart;
> >
> > -             /* break if the extent we found is outside the range */
> > -             if (em->start >= max || extent_map_end(em) < off)
> > -                     break;
> > +     lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
> >
> > -             /*
> > -              * get_extent may return an extent that starts before our
> > -              * requested range.  We have to make sure the ranges
> > -              * we return to fiemap always move forward and don't
> > -              * overlap, so adjust the offsets here
> > -              */
> > -             em_start = max(em->start, off);
> > +     ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
> > +     if (ret < 0)
> > +             goto out_unlock;
> > +     btrfs_release_path(path);
> >
> > +     path->reada = READA_FORWARD;
> > +     ret = fiemap_search_slot(inode, path, lockstart);
> > +     if (ret < 0) {
> > +             goto out_unlock;
> > +     } else if (ret > 0) {
> >               /*
> > -              * record the offset from the start of the extent
> > -              * for adjusting the disk offset below.  Only do this if the
> > -              * extent isn't compressed since our in ram offset may be past
> > -              * what we have actually allocated on disk.
> > +              * No file extent item found, but we may have delalloc between
> > +              * the current offset and i_size. So check for that.
> >                */
> > -             if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
> > -                     offset_in_extent = em_start - em->start;
> > -             em_end = extent_map_end(em);
> > -             em_len = em_end - em_start;
> > -             flags = 0;
> > -             if (em->block_start < EXTENT_MAP_LAST_BYTE)
> > -                     disko = em->block_start + offset_in_extent;
> > -             else
> > -                     disko = 0;
> > +             ret = 0;
> > +             goto check_eof_delalloc;
> > +     }
> > +
> > +     while (prev_extent_end < lockend) {
> > +             struct extent_buffer *leaf = path->nodes[0];
> > +             struct btrfs_file_extent_item *ei;
> > +             struct btrfs_key key;
> > +             u64 extent_end;
> > +             u64 extent_len;
> > +             u64 extent_offset = 0;
> > +             u64 extent_gen;
> > +             u64 disk_bytenr = 0;
> > +             u64 flags = 0;
> > +             int extent_type;
> > +             u8 compression;
> > +
> > +             btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> > +             if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> > +                     break;
> > +
> > +             extent_end = btrfs_file_extent_end(path);
> >
> >               /*
> > -              * bump off for our next call to get_extent
> > +              * The first iteration can leave us at an extent item that ends
> > +              * before our range's start. Move to the next item.
> >                */
> > -             off = extent_map_end(em);
> > -             if (off >= max)
> > -                     end = 1;
> > -
> > -             if (em->block_start == EXTENT_MAP_INLINE) {
> > -                     flags |= (FIEMAP_EXTENT_DATA_INLINE |
> > -                               FIEMAP_EXTENT_NOT_ALIGNED);
> > -             } else if (em->block_start == EXTENT_MAP_DELALLOC) {
> > -                     flags |= (FIEMAP_EXTENT_DELALLOC |
> > -                               FIEMAP_EXTENT_UNKNOWN);
> > -             } else if (fieinfo->fi_extents_max) {
> > -                     u64 extent_gen;
> > -                     u64 bytenr = em->block_start -
> > -                             (em->start - em->orig_start);
> > +             if (extent_end <= lockstart)
> > +                     goto next_item;
> >
> > -                     /*
> > -                      * If two extent maps are merged, then their generation
> > -                      * is set to the maximum between their generations.
> > -                      * Otherwise its generation matches the one we have in
> > -                      * corresponding file extent item. If we have a merged
> > -                      * extent map, don't use its generation to speedup the
> > -                      * sharedness check below.
> > -                      */
> > -                     if (test_bit(EXTENT_FLAG_MERGED, &em->flags))
> > -                             extent_gen = 0;
> > -                     else
> > -                             extent_gen = em->generation;
> > +             /* We have in implicit hole (NO_HOLES feature enabled). */
> > +             if (prev_extent_end < key.offset) {
> > +                     const u64 range_end = min(key.offset, lockend) - 1;
> >
> > -                     /*
> > -                      * As btrfs supports shared space, this information
> > -                      * can be exported to userspace tools via
> > -                      * flag FIEMAP_EXTENT_SHARED.  If fi_extents_max == 0
> > -                      * then we're just getting a count and we can skip the
> > -                      * lookup stuff.
> > -                      */
> > -                     ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
> > -                                                       bytenr, extent_gen,
> > -                                                       roots, tmp_ulist,
> > -                                                       backref_cache);
> > -                     if (ret < 0)
> > -                             goto out_free;
> > -                     if (ret)
> > -                             flags |= FIEMAP_EXTENT_SHARED;
> > -                     ret = 0;
> > -             }
> > -             if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
> > -                     flags |= FIEMAP_EXTENT_ENCODED;
> > -             if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> > -                     flags |= FIEMAP_EXTENT_UNWRITTEN;
> > +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
> > +                                               backref_cache, 0, 0, 0,
> > +                                               roots, tmp_ulist,
> > +                                               prev_extent_end, range_end);
> > +                     if (ret < 0) {
> > +                             goto out_unlock;
> > +                     } else if (ret > 0) {
> > +                             /* fiemap_fill_next_extent() told us to stop. */
> > +                             stopped = true;
> > +                             break;
> > +                     }
> >
> > -             free_extent_map(em);
> > -             em = NULL;
> > -             if ((em_start >= last) || em_len == (u64)-1 ||
> > -                (last == (u64)-1 && isize <= em_end)) {
> > -                     flags |= FIEMAP_EXTENT_LAST;
> > -                     end = 1;
> > +                     /* We've reached the end of the fiemap range, stop. */
> > +                     if (key.offset >= lockend) {
> > +                             stopped = true;
> > +                             break;
> > +                     }
> >               }
> >
> > -             /* now scan forward to see if this is really the last extent. */
> > -             em = get_extent_skip_holes(inode, off, last_for_get_extent);
> > -             if (IS_ERR(em)) {
> > -                     ret = PTR_ERR(em);
> > -                     goto out;
> > +             extent_len = extent_end - key.offset;
> > +             ei = btrfs_item_ptr(leaf, path->slots[0],
> > +                                 struct btrfs_file_extent_item);
> > +             compression = btrfs_file_extent_compression(leaf, ei);
> > +             extent_type = btrfs_file_extent_type(leaf, ei);
> > +             extent_gen = btrfs_file_extent_generation(leaf, ei);
> > +
> > +             if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
> > +                     disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> > +                     if (compression == BTRFS_COMPRESS_NONE)
> > +                             extent_offset = btrfs_file_extent_offset(leaf, ei);
> >               }
> > -             if (!em) {
> > -                     flags |= FIEMAP_EXTENT_LAST;
> > -                     end = 1;
> > +
> > +             if (compression != BTRFS_COMPRESS_NONE)
> > +                     flags |= FIEMAP_EXTENT_ENCODED;
> > +
> > +             if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
> > +                     flags |= FIEMAP_EXTENT_DATA_INLINE;
> > +                     flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > +                     ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0,
> > +                                              extent_len, flags);
> > +             } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
> > +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
> > +                                               backref_cache,
> > +                                               disk_bytenr, extent_offset,
> > +                                               extent_gen, roots, tmp_ulist,
> > +                                               key.offset, extent_end - 1);
> > +             } else if (disk_bytenr == 0) {
> > +                     /* We have an explicit hole. */
> > +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
> > +                                               backref_cache, 0, 0, 0,
> > +                                               roots, tmp_ulist,
> > +                                               key.offset, extent_end - 1);
> > +             } else {
> > +                     /* We have a regular extent. */
> > +                     if (fieinfo->fi_extents_max) {
> > +                             ret = btrfs_is_data_extent_shared(root, ino,
> > +                                                               disk_bytenr,
> > +                                                               extent_gen,
> > +                                                               roots,
> > +                                                               tmp_ulist,
> > +                                                               backref_cache);
> > +                             if (ret < 0)
> > +                                     goto out_unlock;
> > +                             else if (ret > 0)
> > +                                     flags |= FIEMAP_EXTENT_SHARED;
> > +                     }
> > +
> > +                     ret = emit_fiemap_extent(fieinfo, &cache, key.offset,
> > +                                              disk_bytenr + extent_offset,
> > +                                              extent_len, flags);
> >               }
> > -             ret = emit_fiemap_extent(fieinfo, &cache, em_start, disko,
> > -                                        em_len, flags);
> > -             if (ret) {
> > -                     if (ret == 1)
> > -                             ret = 0;
> > -                     goto out_free;
> > +
> > +             if (ret < 0) {
> > +                     goto out_unlock;
> > +             } else if (ret > 0) {
> > +                     /* fiemap_fill_next_extent() told us to stop. */
> > +                     stopped = true;
> > +                     break;
> >               }
> >
> > +             prev_extent_end = extent_end;
> > +next_item:
> >               if (fatal_signal_pending(current)) {
> >                       ret = -EINTR;
> > -                     goto out_free;
> > +                     goto out_unlock;
> >               }
> > +
> > +             ret = fiemap_next_leaf_item(inode, path);
> > +             if (ret < 0) {
> > +                     goto out_unlock;
> > +             } else if (ret > 0) {
> > +                     /* No more file extent items for this inode. */
> > +                     break;
> > +             }
> > +             cond_resched();
> >       }
> > -out_free:
> > -     if (!ret)
> > -             ret = emit_last_fiemap_cache(fieinfo, &cache);
> > -     free_extent_map(em);
> > -out:
> > -     unlock_extent_cached(&inode->io_tree, start, start + len - 1,
> > -                          &cached_state);
> >
> > -out_free_ulist:
> > +check_eof_delalloc:
> > +     /*
> > +      * Release (and free) the path before emitting any final entries to
> > +      * fiemap_fill_next_extent() to keep lockdep happy. This is because
> > +      * once we find no more file extent items exist, we may have a
> > +      * non-cloned leaf, and fiemap_fill_next_extent() can trigger page
> > +      * faults when copying data to the user space buffer.
> > +      */
> > +     btrfs_free_path(path);
> > +     path = NULL;
> > +
> > +     if (!stopped && prev_extent_end < lockend) {
> > +             ret = fiemap_process_hole(inode, fieinfo, &cache, backref_cache,
> > +                                       0, 0, 0, roots, tmp_ulist,
> > +                                       prev_extent_end, lockend - 1);
> > +             if (ret < 0)
> > +                     goto out_unlock;
> > +             prev_extent_end = lockend;
> > +     }
> > +
> > +     if (cache.cached && cache.offset + cache.len >= last_extent_end) {
> > +             const u64 i_size = i_size_read(&inode->vfs_inode);
> > +
> > +             if (prev_extent_end < i_size) {
> > +                     u64 delalloc_start;
> > +                     u64 delalloc_end;
> > +                     bool delalloc;
> > +
> > +                     delalloc = btrfs_find_delalloc_in_range(inode,
> > +                                                             prev_extent_end,
> > +                                                             i_size - 1,
> > +                                                             &delalloc_start,
> > +                                                             &delalloc_end);
> > +                     if (!delalloc)
> > +                             cache.flags |= FIEMAP_EXTENT_LAST;
> > +             } else {
> > +                     cache.flags |= FIEMAP_EXTENT_LAST;
> > +             }
> > +     }
> > +
> > +     ret = emit_last_fiemap_cache(fieinfo, &cache);
> > +
> > +out_unlock:
> > +     unlock_extent_cached(&inode->io_tree, lockstart, lockend, &cached_state);
> > +out:
> >       kfree(backref_cache);
> >       btrfs_free_path(path);
> >       ulist_free(roots);
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index b292a8ada3a4..636b3ec46184 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -3602,10 +3602,10 @@ static long btrfs_fallocate(struct file *file, int mode,
> >  }
> >
> >  /*
> > - * Helper for have_delalloc_in_range(). Find a subrange in a given range that
> > - * has unflushed and/or flushing delalloc. There might be other adjacent
> > - * subranges after the one it found, so have_delalloc_in_range() keeps looping
> > - * while it gets adjacent subranges, and merging them together.
> > + * Helper for btrfs_find_delalloc_in_range(). Find a subrange in a given range
> > + * that has unflushed and/or flushing delalloc. There might be other adjacent
> > + * subranges after the one it found, so btrfs_find_delalloc_in_range() keeps
> > + * looping while it gets adjacent subranges, and merging them together.
> >   */
> >  static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
> >                                  u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> > @@ -3740,8 +3740,8 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
> >   * if so it sets @delalloc_start_ret and @delalloc_end_ret with the start and
> >   * end offsets of the subrange.
> >   */
> > -static bool have_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> > -                                u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> > +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> > +                               u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> >  {
> >       u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
> >       u64 prev_delalloc_end = 0;
> > @@ -3804,8 +3804,8 @@ static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
> >       u64 delalloc_end;
> >       bool delalloc;
> >
> > -     delalloc = have_delalloc_in_range(inode, start, end, &delalloc_start,
> > -                                       &delalloc_end);
> > +     delalloc = btrfs_find_delalloc_in_range(inode, start, end,
> > +                                             &delalloc_start, &delalloc_end);
> >       if (delalloc && whence == SEEK_DATA) {
> >               *start_ret = delalloc_start;
> >               return true;
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 2c7d31990777..8be1e021513a 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -7064,133 +7064,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
> >       return em;
> >  }
> >
> > -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
> > -                                        u64 start, u64 len)
> > -{
> > -     struct extent_map *em;
> > -     struct extent_map *hole_em = NULL;
> > -     u64 delalloc_start = start;
> > -     u64 end;
> > -     u64 delalloc_len;
> > -     u64 delalloc_end;
> > -     int err = 0;
> > -
> > -     em = btrfs_get_extent(inode, NULL, 0, start, len);
> > -     if (IS_ERR(em))
> > -             return em;
> > -     /*
> > -      * If our em maps to:
> > -      * - a hole or
> > -      * - a pre-alloc extent,
> > -      * there might actually be delalloc bytes behind it.
> > -      */
> > -     if (em->block_start != EXTENT_MAP_HOLE &&
> > -         !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> > -             return em;
> > -     else
> > -             hole_em = em;
> > -
> > -     /* check to see if we've wrapped (len == -1 or similar) */
> > -     end = start + len;
> > -     if (end < start)
> > -             end = (u64)-1;
> > -     else
> > -             end -= 1;
> > -
> > -     em = NULL;
> > -
> > -     /* ok, we didn't find anything, lets look for delalloc */
> > -     delalloc_len = count_range_bits(&inode->io_tree, &delalloc_start,
> > -                              end, len, EXTENT_DELALLOC, 1);
> > -     delalloc_end = delalloc_start + delalloc_len;
> > -     if (delalloc_end < delalloc_start)
> > -             delalloc_end = (u64)-1;
> > -
> > -     /*
> > -      * We didn't find anything useful, return the original results from
> > -      * get_extent()
> > -      */
> > -     if (delalloc_start > end || delalloc_end <= start) {
> > -             em = hole_em;
> > -             hole_em = NULL;
> > -             goto out;
> > -     }
> > -
> > -     /*
> > -      * Adjust the delalloc_start to make sure it doesn't go backwards from
> > -      * the start they passed in
> > -      */
> > -     delalloc_start = max(start, delalloc_start);
> > -     delalloc_len = delalloc_end - delalloc_start;
> > -
> > -     if (delalloc_len > 0) {
> > -             u64 hole_start;
> > -             u64 hole_len;
> > -             const u64 hole_end = extent_map_end(hole_em);
> > -
> > -             em = alloc_extent_map();
> > -             if (!em) {
> > -                     err = -ENOMEM;
> > -                     goto out;
> > -             }
> > -
> > -             ASSERT(hole_em);
> > -             /*
> > -              * When btrfs_get_extent can't find anything it returns one
> > -              * huge hole
> > -              *
> > -              * Make sure what it found really fits our range, and adjust to
> > -              * make sure it is based on the start from the caller
> > -              */
> > -             if (hole_end <= start || hole_em->start > end) {
> > -                    free_extent_map(hole_em);
> > -                    hole_em = NULL;
> > -             } else {
> > -                    hole_start = max(hole_em->start, start);
> > -                    hole_len = hole_end - hole_start;
> > -             }
> > -
> > -             if (hole_em && delalloc_start > hole_start) {
> > -                     /*
> > -                      * Our hole starts before our delalloc, so we have to
> > -                      * return just the parts of the hole that go until the
> > -                      * delalloc starts
> > -                      */
> > -                     em->len = min(hole_len, delalloc_start - hole_start);
> > -                     em->start = hole_start;
> > -                     em->orig_start = hole_start;
> > -                     /*
> > -                      * Don't adjust block start at all, it is fixed at
> > -                      * EXTENT_MAP_HOLE
> > -                      */
> > -                     em->block_start = hole_em->block_start;
> > -                     em->block_len = hole_len;
> > -                     if (test_bit(EXTENT_FLAG_PREALLOC, &hole_em->flags))
> > -                             set_bit(EXTENT_FLAG_PREALLOC, &em->flags);
> > -             } else {
> > -                     /*
> > -                      * Hole is out of passed range or it starts after
> > -                      * delalloc range
> > -                      */
> > -                     em->start = delalloc_start;
> > -                     em->len = delalloc_len;
> > -                     em->orig_start = delalloc_start;
> > -                     em->block_start = EXTENT_MAP_DELALLOC;
> > -                     em->block_len = delalloc_len;
> > -             }
> > -     } else {
> > -             return hole_em;
> > -     }
> > -out:
> > -
> > -     free_extent_map(hole_em);
> > -     if (err) {
> > -             free_extent_map(em);
> > -             return ERR_PTR(err);
> > -     }
> > -     return em;
> > -}
> > -
> >  static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
> >                                                 const u64 start,
> >                                                 const u64 len,
> > @@ -8265,15 +8138,14 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> >        * in the compression of data (in an async thread) and will return
> >        * before the compression is done and writeback is started. A second
> >        * filemap_fdatawrite_range() is needed to wait for the compression to
> > -      * complete and writeback to start. Without this, our user is very
> > -      * likely to get stale results, because the extents and extent maps for
> > -      * delalloc regions are only allocated when writeback starts.
> > +      * complete and writeback to start. We also need to wait for ordered
> > +      * extents to complete, because our fiemap implementation uses mainly
> > +      * file extent items to list the extents, searching for extent maps
> > +      * only for file ranges with holes or prealloc extents to figure out
> > +      * if we have delalloc in those ranges.
> >        */
> >       if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
> > -             ret = btrfs_fdatawrite_range(inode, 0, LLONG_MAX);
> > -             if (ret)
> > -                     return ret;
> > -             ret = filemap_fdatawait_range(inode->i_mapping, 0, LLONG_MAX);
> > +             ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
> >               if (ret)
> >                       return ret;
> >       }
>
> Hmm this bit should be in "btrfs: properly flush delalloc when entering fiemap"
> instead.  Thanks,

Nop, the change is done here for a good reason: before this change, we
only needed
to wait for writeback to complete (actually just to start and create
the new extent maps),
so that's why that other patch only waits for writeback to complete,
just like the generic code.

After this change we need to wait for ordered extents to complete,
since we use the file
extent items to get extent information for fiemap - that's why that
change is in this patch.

Thanks.

>
> Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 01/10] btrfs: allow hole and data seeking to be interruptible
  2022-09-01 13:18 ` [PATCH 01/10] btrfs: allow hole and data seeking to be interruptible fdmanana
  2022-09-01 13:58   ` Josef Bacik
@ 2022-09-01 21:49   ` Qu Wenruo
  1 sibling, 0 replies; 53+ messages in thread
From: Qu Wenruo @ 2022-09-01 21:49 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



On 2022/9/1 21:18, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> Doing hole or data seeking on a file with a very large number of extents
> can take a long time, and we have reports of it being too slow (such as
> at LSFMM from 2017, see the Link below). So make it interruptible.
>
> Link: https://lwn.net/Articles/718805/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/file.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 0a76ae8b8e96..96f444ad0951 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3652,6 +3652,10 @@ static loff_t find_desired_extent(struct btrfs_inode *inode, loff_t offset,
>   		start = em->start + em->len;
>   		free_extent_map(em);
>   		em = NULL;
> +		if (fatal_signal_pending(current)) {
> +			ret = -EINTR;
> +			break;
> +		}
>   		cond_resched();
>   	}
>   	free_extent_map(em);

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient
  2022-09-01 13:18 ` [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient fdmanana
  2022-09-01 14:03   ` Josef Bacik
@ 2022-09-01 22:18   ` Qu Wenruo
  2022-09-02  8:36     ` Filipe Manana
  2022-09-11 22:12   ` Qu Wenruo
  2 siblings, 1 reply; 53+ messages in thread
From: Qu Wenruo @ 2022-09-01 22:18 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



On 2022/9/1 21:18, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> The current implementation of hole and data seeking for llseek does not
> scale well in regards to the number of extents and the distance between
> the start offset and the next hole or extent. This is due to a very high
> algorithmic complexity. Often we also get reports of btrfs' hole and data
> seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link
> tag at the bottom).
>
> In order to better understand it, lets consider the case where the start
> offset is 0, we are seeking for a hole and the file size is 16G. Between
> file offset 0 and the first hole in the file there are 100K extents - this
> is common for large files, specially if we have compression enabled, since
> the maximum extent size is limited to 128K. The steps take by the main
> loop of the current algorithm are the following:
>
> 1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which
>     calls btrfs_get_extent(). This will first lookup for an extent map in
>     the inode's extent map tree (a red black tree). If the extent map is
>     not loaded in memory, then it will do a lookup for the corresponding
>     file extent item in the subvolume's b+tree, create an extent map based
>     on the contents of the file extent item and then add the extent map to
>     the extent map tree of the inode;
>
> 2) The second iteration calls btrfs_get_extent_fiemap() again, this time
>     with a start offset matching the end offset of the previous extent.
>     Again, btrfs_get_extent() will first search the extent map tree, and
>     if it doesn't find an extent map there, it will again search in the
>     b+tree of the subvolume for a matching file extent item, build an
>     extent map based on the file extent item, and add the extent map to
>     to the extent map tree of the inode;
>
> 3) This repeats over and over until we find the first hole (when seeking
>     for holes) or until we find the first extent (when seeking for data).
>
>     If there no extent maps loaded in memory for each iteration, then on
>     each iteration we do 1 extent map tree search, 1 b+tree search, plus
>     1 more extent map tree traversal to insert an extent map - plus we
>     allocate memory for the extent map.

I'm a little intereted in if we have other workload which are heavily
relying on extent map tree search?

If so, would it make sense to load a batch of file extents items into
extent map tree in one go?


Another thing not that related to the patchset is, since extent map
doesn't really get freed up unless that inode is invicted/truncated, I'm
wondering would it be a problem for heavily fragmented files to take up
too much memory just for extent map tree?

Would we need a way to drop extent map in the future?

>
>     On each iteration we are growing the size of the extent map tree,
>     making each future search slower, and also visiting the same b+tree
>     leaves over and over again - taking into account with the default leaf
>     size of 16K we can fit more than 200 file extent items in a leaf - so
>     we can visit the same b+tree leaf 200+ times, on each visit walking
>     down a path from the root to the leaf.
>
> So it's easy to see that what we have now doesn't scale well. Also, it
> loads an extent map for every file extent item into memory, which is not
> efficient - we should add extents maps only when doing IO (writing or
> reading file data).
>
> This change implements a new algorithm which scales much better, and
> works like this:
>
> 1) We iterate over the subvolume's b+tree, visiting each leaf that has
>     file extent items once and only once;
>
> 2) For any file extent items found, that don't represent holes or prealloc
>     extents, it will not search the extent map tree - there's no need at
>     all for that - an extent map is just an in-memory representation of a
>     file extent item;
>
> 3) When a hole is found, or a prealloc extent, it will check if there's
>     delalloc for its range. For this it will search for EXTENT_DELALLOC
>     bits in the inode's io tree and check the extent map tree - this is
>     for accounting for unflushed delalloc and for flushed delalloc (the
>     period between running delalloc and ordered extent completion),
>     respectively. This is similar to what the current implementation does
>     when it finds a hole or prealloc extent, but without creating extent
>     maps and adding them to the extent map tree in case they are not
>     loaded in memory;

Would it be possible that, before we starting the subvolume tree search,
just run all delalloc of that target inode and prevent new writes, so we
can forget about the delalloc situation completely?

>
> 4) It never allocates extent maps, or adds extent maps to the inode's
>     extent map tree. This not only saves memory and time (from the tree
>     insertions and allocations), but also eliminates the possibility of
>     -ENOMEM due to allocating too many extent maps.
>
> Part of this new code will also be used later for fiemap (which also
> suffers similar scalability problems).
>
> The following test example can be used to quickly measure the efficiency
> before and after this patch:
>
>      $ cat test-seek-hole.sh
>      #!/bin/bash
>
>      DEV=/dev/sdi
>      MNT=/mnt/sdi
>
>      mkfs.btrfs -f $DEV
>
>      mount -o compress=lzo $DEV $MNT
>
>      # 16G file -> 131073 compressed extents.
>      xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar
>
>      # Leave a 1M hole at file offset 15G.
>      xfs_io -c "fpunch 15G 1M" $MNT/foobar
>
>      # Unmount and mount again, so that we can test when there's no
>      # metadata cached in memory.
>      umount $MNT
>      mount -o compress=lzo $DEV $MNT
>
>      # Test seeking for hole from offset 0 (hole is at offset 15G).
>
>      start=$(date +%s%N)
>      xfs_io -c "seek -h 0" $MNT/foobar
>      end=$(date +%s%N)
>      dur=$(( (end - start) / 1000000 ))
>      echo "Took $dur milliseconds to seek first hole (metadata not cached)"
>      echo
>
>      start=$(date +%s%N)
>      xfs_io -c "seek -h 0" $MNT/foobar
>      end=$(date +%s%N)
>      dur=$(( (end - start) / 1000000 ))
>      echo "Took $dur milliseconds to seek first hole (metadata cached)"
>      echo
>
>      umount $MNT
>
> Before this change:
>
>      $ ./test-seek-hole.sh
>      (...)
>      Whence	Result
>      HOLE	16106127360
>      Took 176 milliseconds to seek first hole (metadata not cached)
>
>      Whence	Result
>      HOLE	16106127360
>      Took 17 milliseconds to seek first hole (metadata cached)
>
> After this change:
>
>      $ ./test-seek-hole.sh
>      (...)
>      Whence	Result
>      HOLE	16106127360
>      Took 43 milliseconds to seek first hole (metadata not cached)
>
>      Whence	Result
>      HOLE	16106127360
>      Took 13 milliseconds to seek first hole (metadata cached)
>
> That's about 4X faster when no metadata is cached and about 30% faster
> when all metadata is cached.
>
> In practice the differences may often be significantly higher, either due
> to a higher number of extents in a file or because the subvolume's b+tree
> is much bigger than in this example, where we only have one file.
>
> Link: https://lwn.net/Articles/718805/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/file.c | 437 ++++++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 406 insertions(+), 31 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 96f444ad0951..b292a8ada3a4 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3601,22 +3601,281 @@ static long btrfs_fallocate(struct file *file, int mode,
>   	return ret;
>   }
>
> +/*
> + * Helper for have_delalloc_in_range(). Find a subrange in a given range that
> + * has unflushed and/or flushing delalloc. There might be other adjacent
> + * subranges after the one it found, so have_delalloc_in_range() keeps looping
> + * while it gets adjacent subranges, and merging them together.
> + */
> +static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
> +				   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> +{
> +	const u64 len = end + 1 - start;
> +	struct extent_map_tree *em_tree = &inode->extent_tree;
> +	struct extent_map *em;
> +	u64 em_end;
> +	u64 delalloc_len;
> +
> +	/*
> +	 * Search the io tree first for EXTENT_DELALLOC. If we find any, it
> +	 * means we have delalloc (dirty pages) for which writeback has not
> +	 * started yet.
> +	 */
> +	*delalloc_start_ret = start;
> +	delalloc_len = count_range_bits(&inode->io_tree, delalloc_start_ret, end,
> +					len, EXTENT_DELALLOC, 1);
> +	/*
> +	 * If delalloc was found then *delalloc_start_ret has a sector size
> +	 * aligned value (rounded down).
> +	 */
> +	if (delalloc_len > 0)
> +		*delalloc_end_ret = *delalloc_start_ret + delalloc_len - 1;
> +
> +	/*
> +	 * Now also check if there's any extent map in the range that does not
> +	 * map to a hole or prealloc extent. We do this because:
> +	 *
> +	 * 1) When delalloc is flushed, the file range is locked, we clear the
> +	 *    EXTENT_DELALLOC bit from the io tree and create an extent map for
> +	 *    an allocated extent. So we might just have been called after
> +	 *    delalloc is flushed and before the ordered extent completes and
> +	 *    inserts the new file extent item in the subvolume's btree;
> +	 *
> +	 * 2) We may have an extent map created by flushing delalloc for a
> +	 *    subrange that starts before the subrange we found marked with
> +	 *    EXTENT_DELALLOC in the io tree.
> +	 */
> +	read_lock(&em_tree->lock);
> +	em = lookup_extent_mapping(em_tree, start, len);
> +	read_unlock(&em_tree->lock);
> +
> +	/* extent_map_end() returns a non-inclusive end offset. */
> +	em_end = em ? extent_map_end(em) : 0;
> +
> +	/*
> +	 * If we have a hole/prealloc extent map, check the next one if this one
> +	 * ends before our range's end.
> +	 */
> +	if (em && (em->block_start == EXTENT_MAP_HOLE ||
> +		   test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) && em_end < end) {
> +		struct extent_map *next_em;
> +
> +		read_lock(&em_tree->lock);
> +		next_em = lookup_extent_mapping(em_tree, em_end, len - em_end);
> +		read_unlock(&em_tree->lock);
> +
> +		free_extent_map(em);
> +		em_end = next_em ? extent_map_end(next_em) : 0;
> +		em = next_em;
> +	}
> +
> +	if (em && (em->block_start == EXTENT_MAP_HOLE ||
> +		   test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) {
> +		free_extent_map(em);
> +		em = NULL;
> +	}
> +
> +	/*
> +	 * No extent map or one for a hole or prealloc extent. Use the delalloc
> +	 * range we found in the io tree if we have one.
> +	 */
> +	if (!em)
> +		return (delalloc_len > 0);
> +
> +	/*
> +	 * We don't have any range as EXTENT_DELALLOC in the io tree, so the
> +	 * extent map is the only subrange representing delalloc.
> +	 */
> +	if (delalloc_len == 0) {
> +		*delalloc_start_ret = em->start;
> +		*delalloc_end_ret = min(end, em_end - 1);
> +		free_extent_map(em);
> +		return true;
> +	}
> +
> +	/*
> +	 * The extent map represents a delalloc range that starts before the
> +	 * delalloc range we found in the io tree.
> +	 */
> +	if (em->start < *delalloc_start_ret) {
> +		*delalloc_start_ret = em->start;
> +		/*
> +		 * If the ranges are adjacent, return a combined range.
> +		 * Otherwise return the extent map's range.
> +		 */
> +		if (em_end < *delalloc_start_ret)
> +			*delalloc_end_ret = min(end, em_end - 1);
> +
> +		free_extent_map(em);
> +		return true;
> +	}
> +
> +	/*
> +	 * The extent map starts after the delalloc range we found in the io
> +	 * tree. If it's adjacent, return a combined range, otherwise return
> +	 * the range found in the io tree.
> +	 */
> +	if (*delalloc_end_ret + 1 == em->start)
> +		*delalloc_end_ret = min(end, em_end - 1);
> +
> +	free_extent_map(em);
> +	return true;
> +}
> +
> +/*
> + * Check if there's delalloc in a given range.
> + *
> + * @inode:               The inode.
> + * @start:               The start offset of the range. It does not need to be
> + *                       sector size aligned.
> + * @end:                 The end offset (inclusive value) of the search range.
> + *                       It does not need to be sector size aligned.
> + * @delalloc_start_ret:  Output argument, set to the start offset of the
> + *                       subrange found with delalloc (may not be sector size
> + *                       aligned).
> + * @delalloc_end_ret:    Output argument, set to he end offset (inclusive value)
> + *                       of the subrange found with delalloc.
> + *
> + * Returns true if a subrange with delalloc is found within the given range, and
> + * if so it sets @delalloc_start_ret and @delalloc_end_ret with the start and
> + * end offsets of the subrange.
> + */
> +static bool have_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> +				   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> +{
> +	u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
> +	u64 prev_delalloc_end = 0;
> +	bool ret = false;
> +
> +	while (cur_offset < end) {
> +		u64 delalloc_start;
> +		u64 delalloc_end;
> +		bool delalloc;
> +
> +		delalloc = find_delalloc_subrange(inode, cur_offset, end,
> +						  &delalloc_start,
> +						  &delalloc_end);
> +		if (!delalloc)
> +			break;
> +
> +		if (prev_delalloc_end == 0) {
> +			/* First subrange found. */
> +			*delalloc_start_ret = max(delalloc_start, start);
> +			*delalloc_end_ret = delalloc_end;
> +			ret = true;
> +		} else if (delalloc_start == prev_delalloc_end + 1) {
> +			/* Subrange adjacent to the previous one, merge them. */
> +			*delalloc_end_ret = delalloc_end;
> +		} else {
> +			/* Subrange not adjacent to the previous one, exit. */
> +			break;
> +		}
> +
> +		prev_delalloc_end = delalloc_end;
> +		cur_offset = delalloc_end + 1;
> +		cond_resched();
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * Check if there's a hole or delalloc range in a range representing a hole (or
> + * prealloc extent) found in the inode's subvolume btree.
> + *
> + * @inode:      The inode.
> + * @whence:     Seek mode (SEEK_DATA or SEEK_HOLE).
> + * @start:      Start offset of the hole region. It does not need to be sector
> + *              size aligned.
> + * @end:        End offset (inclusive value) of the hole region. It does not
> + *              need to be sector size aligned.
> + * @start_ret:  Return parameter, used to set the start of the subrange in the
> + *              hole that matches the search criteria (seek mode), if such
> + *              subrange is found (return value of the function is true).
> + *              The value returned here may not be sector size aligned.
> + *
> + * Returns true if a subrange matching the given seek mode is found, and if one
> + * is found, it updates @start_ret with the start of the subrange.
> + */
> +static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
> +					u64 start, u64 end, u64 *start_ret)
> +{
> +	u64 delalloc_start;
> +	u64 delalloc_end;
> +	bool delalloc;
> +
> +	delalloc = have_delalloc_in_range(inode, start, end, &delalloc_start,
> +					  &delalloc_end);
> +	if (delalloc && whence == SEEK_DATA) {
> +		*start_ret = delalloc_start;
> +		return true;
> +	}
> +
> +	if (delalloc && whence == SEEK_HOLE) {
> +		/*
> +		 * We found delalloc but it starts after out start offset. So we
> +		 * have a hole between our start offset and the delalloc start.
> +		 */
> +		if (start < delalloc_start) {
> +			*start_ret = start;
> +			return true;
> +		}
> +		/*
> +		 * Delalloc range starts at our start offset.
> +		 * If the delalloc range's length is smaller than our range,
> +		 * then it means we have a hole that starts where the delalloc
> +		 * subrange ends.
> +		 */
> +		if (delalloc_end < end) {
> +			*start_ret = delalloc_end + 1;
> +			return true;
> +		}
> +
> +		/* There's delalloc for the whole range. */
> +		return false;
> +	}
> +
> +	if (!delalloc && whence == SEEK_HOLE) {
> +		*start_ret = start;
> +		return true;
> +	}
> +
> +	/*
> +	 * No delalloc in the range and we are seeking for data. The caller has
> +	 * to iterate to the next extent item in the subvolume btree.
> +	 */
> +	return false;
> +}
> +
>   static loff_t find_desired_extent(struct btrfs_inode *inode, loff_t offset,
>   				  int whence)
>   {
>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	struct extent_map *em = NULL;
>   	struct extent_state *cached_state = NULL;
> -	loff_t i_size = inode->vfs_inode.i_size;
> +	const loff_t i_size = i_size_read(&inode->vfs_inode);
> +	const u64 ino = btrfs_ino(inode);
> +	struct btrfs_root *root = inode->root;
> +	struct btrfs_path *path;
> +	struct btrfs_key key;
> +	u64 last_extent_end;
>   	u64 lockstart;
>   	u64 lockend;
>   	u64 start;
> -	u64 len;
> -	int ret = 0;
> +	int ret;
> +	bool found = false;
>
>   	if (i_size == 0 || offset >= i_size)
>   		return -ENXIO;
>
> +	/*
> +	 * Quick path. If the inode has no prealloc extents and its number of
> +	 * bytes used matches its i_size, then it can not have holes.
> +	 */
> +	if (whence == SEEK_HOLE &&
> +	    !(inode->flags & BTRFS_INODE_PREALLOC) &&
> +	    inode_get_bytes(&inode->vfs_inode) == i_size)
> +		return i_size;
> +

Would we need a counter part for all holes quick path?'

Thanks,
Qu

>   	/*
>   	 * offset can be negative, in this case we start finding DATA/HOLE from
>   	 * the very start of the file.
> @@ -3628,49 +3887,165 @@ static loff_t find_desired_extent(struct btrfs_inode *inode, loff_t offset,
>   	if (lockend <= lockstart)
>   		lockend = lockstart + fs_info->sectorsize;
>   	lockend--;
> -	len = lockend - lockstart + 1;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +	path->reada = READA_FORWARD;
> +
> +	key.objectid = ino;
> +	key.type = BTRFS_EXTENT_DATA_KEY;
> +	key.offset = start;
> +
> +	last_extent_end = lockstart;
>
>   	lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
>
> +	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> +	if (ret < 0) {
> +		goto out;
> +	} else if (ret > 0 && path->slots[0] > 0) {
> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
> +		if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
> +			path->slots[0]--;
> +	}
> +
>   	while (start < i_size) {
> -		em = btrfs_get_extent_fiemap(inode, start, len);
> -		if (IS_ERR(em)) {
> -			ret = PTR_ERR(em);
> -			em = NULL;
> -			break;
> +		struct extent_buffer *leaf = path->nodes[0];
> +		struct btrfs_file_extent_item *extent;
> +		u64 extent_end;
> +
> +		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
> +			ret = btrfs_next_leaf(root, path);
> +			if (ret < 0)
> +				goto out;
> +			else if (ret > 0)
> +				break;
> +
> +			leaf = path->nodes[0];
>   		}
>
> -		if (whence == SEEK_HOLE &&
> -		    (em->block_start == EXTENT_MAP_HOLE ||
> -		     test_bit(EXTENT_FLAG_PREALLOC, &em->flags)))
> -			break;
> -		else if (whence == SEEK_DATA &&
> -			   (em->block_start != EXTENT_MAP_HOLE &&
> -			    !test_bit(EXTENT_FLAG_PREALLOC, &em->flags)))
> +		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
>   			break;
>
> -		start = em->start + em->len;
> -		free_extent_map(em);
> -		em = NULL;
> +		extent_end = btrfs_file_extent_end(path);
> +
> +		/*
> +		 * In the first iteration we may have a slot that points to an
> +		 * extent that ends before our start offset, so skip it.
> +		 */
> +		if (extent_end <= start) {
> +			path->slots[0]++;
> +			continue;
> +		}
> +
> +		/* We have an implicit hole, NO_HOLES feature is likely set. */
> +		if (last_extent_end < key.offset) {
> +			u64 search_start = last_extent_end;
> +			u64 found_start;
> +
> +			/*
> +			 * First iteration, @start matches @offset and it's
> +			 * within the hole.
> +			 */
> +			if (start == offset)
> +				search_start = offset;
> +
> +			found = find_desired_extent_in_hole(inode, whence,
> +							    search_start,
> +							    key.offset - 1,
> +							    &found_start);
> +			if (found) {
> +				start = found_start;
> +				break;
> +			}
> +			/*
> +			 * Didn't find data or a hole (due to delalloc) in the
> +			 * implicit hole range, so need to analyze the extent.
> +			 */
> +		}
> +
> +		extent = btrfs_item_ptr(leaf, path->slots[0],
> +					struct btrfs_file_extent_item);
> +
> +		if (btrfs_file_extent_disk_bytenr(leaf, extent) == 0 ||
> +		    btrfs_file_extent_type(leaf, extent) ==
> +		    BTRFS_FILE_EXTENT_PREALLOC) {
> +			/*
> +			 * Explicit hole or prealloc extent, search for delalloc.
> +			 * A prealloc extent is treated like a hole.
> +			 */
> +			u64 search_start = key.offset;
> +			u64 found_start;
> +
> +			/*
> +			 * First iteration, @start matches @offset and it's
> +			 * within the hole.
> +			 */
> +			if (start == offset)
> +				search_start = offset;
> +
> +			found = find_desired_extent_in_hole(inode, whence,
> +							    search_start,
> +							    extent_end - 1,
> +							    &found_start);
> +			if (found) {
> +				start = found_start;
> +				break;
> +			}
> +			/*
> +			 * Didn't find data or a hole (due to delalloc) in the
> +			 * implicit hole range, so need to analyze the next
> +			 * extent item.
> +			 */
> +		} else {
> +			/*
> +			 * Found a regular or inline extent.
> +			 * If we are seeking for data, adjust the start offset
> +			 * and stop, we're done.
> +			 */
> +			if (whence == SEEK_DATA) {
> +				start = max_t(u64, key.offset, offset);
> +				found = true;
> +				break;
> +			}
> +			/*
> +			 * Else, we are seeking for a hole, check the next file
> +			 * extent item.
> +			 */
> +		}
> +
> +		start = extent_end;
> +		last_extent_end = extent_end;
> +		path->slots[0]++;
>   		if (fatal_signal_pending(current)) {
>   			ret = -EINTR;
> -			break;
> +			goto out;
>   		}
>   		cond_resched();
>   	}
> -	free_extent_map(em);
> +
> +	/* We have an implicit hole from the last extent found up to i_size. */
> +	if (!found && start < i_size) {
> +		found = find_desired_extent_in_hole(inode, whence, start,
> +						    i_size - 1, &start);
> +		if (!found)
> +			start = i_size;
> +	}
> +
> +out:
>   	unlock_extent_cached(&inode->io_tree, lockstart, lockend,
>   			     &cached_state);
> -	if (ret) {
> -		offset = ret;
> -	} else {
> -		if (whence == SEEK_DATA && start >= i_size)
> -			offset = -ENXIO;
> -		else
> -			offset = min_t(loff_t, start, i_size);
> -	}
> +	btrfs_free_path(path);
> +
> +	if (ret < 0)
> +		return ret;
> +
> +	if (whence == SEEK_DATA && start >= i_size)
> +		return -ENXIO;
>
> -	return offset;
> +	return min_t(loff_t, start, i_size);
>   }
>
>   static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/10] btrfs: remove check for impossible block start for an extent map at fiemap
  2022-09-01 13:18 ` [PATCH 03/10] btrfs: remove check for impossible block start for an extent map at fiemap fdmanana
  2022-09-01 14:03   ` Josef Bacik
@ 2022-09-01 22:19   ` Qu Wenruo
  1 sibling, 0 replies; 53+ messages in thread
From: Qu Wenruo @ 2022-09-01 22:19 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



On 2022/9/1 21:18, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> During fiemap we are testing if an extent map has a block start with a
> value of EXTENT_MAP_LAST_BYTE, but that is never set on an extent map,
> and never was according to git history. So remove that useless check.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/extent_io.c | 5 +----
>   1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index f57a3e91fc2c..ceb7dfe8d6dc 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5642,10 +5642,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>   		if (off >= max)
>   			end = 1;
>
> -		if (em->block_start == EXTENT_MAP_LAST_BYTE) {
> -			end = 1;
> -			flags |= FIEMAP_EXTENT_LAST;
> -		} else if (em->block_start == EXTENT_MAP_INLINE) {
> +		if (em->block_start == EXTENT_MAP_INLINE) {
>   			flags |= (FIEMAP_EXTENT_DATA_INLINE |
>   				  FIEMAP_EXTENT_NOT_ALIGNED);
>   		} else if (em->block_start == EXTENT_MAP_DELALLOC) {

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/10] btrfs: remove zero length check when entering fiemap
  2022-09-01 13:18 ` [PATCH 04/10] btrfs: remove zero length check when entering fiemap fdmanana
  2022-09-01 14:04   ` Josef Bacik
@ 2022-09-01 22:24   ` Qu Wenruo
  1 sibling, 0 replies; 53+ messages in thread
From: Qu Wenruo @ 2022-09-01 22:24 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



On 2022/9/1 21:18, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> There's no point to check for a 0 length at extent_fiemap(), as before
> calling it, we called fiemap_prep() at btrfs_fiemap(), which already
> checks for a zero length and returns the same -EINVAL error. So remove
> the pointless check.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/extent_io.c | 3 ---
>   1 file changed, 3 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index ceb7dfe8d6dc..6e2143b6fba3 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5526,9 +5526,6 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>   	u64 em_len = 0;
>   	u64 em_end = 0;
>
> -	if (len == 0)
> -		return -EINVAL;
> -
>   	path = btrfs_alloc_path();
>   	if (!path)
>   		return -ENOMEM;

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/10] btrfs: properly flush delalloc when entering fiemap
  2022-09-01 13:18 ` [PATCH 05/10] btrfs: properly flush delalloc " fdmanana
  2022-09-01 14:06   ` Josef Bacik
@ 2022-09-01 22:38   ` Qu Wenruo
  1 sibling, 0 replies; 53+ messages in thread
From: Qu Wenruo @ 2022-09-01 22:38 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



On 2022/9/1 21:18, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> If the flag FIEMAP_FLAG_SYNC is passed to fiemap, it means all delalloc
> should be flushed and writeback complete. We call the generic helper
> fiemap_prep() which does a filemap_write_and_wait() in case that flag is
> given, however that is not enough if we have compression. Because a
> single filemap_fdatawrite_range() only starts compression (in an async
> thread) and therefore returns before the compression is done and writeback
> is started.
>
> So make btrfs_fiemap(), actually wait for all writeback to start and
> complete if FIEMAP_FLAG_SYNC is set. We start and wait for writeback
> on the whole possible file range, from 0 to LLONG_MAX, because that is
> what the generic code at fiemap_prep() does.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/inode.c | 20 ++++++++++++++++++++
>   1 file changed, 20 insertions(+)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 493623c81535..2c7d31990777 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8258,6 +8258,26 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>   	if (ret)
>   		return ret;
>
> +	/*
> +	 * fiemap_prep() called filemap_write_and_wait() for the whole possible
> +	 * file range (0 to LLONG_MAX), but that is not enough if we have
> +	 * compression enabled. The first filemap_fdatawrite_range() only kicks
> +	 * in the compression of data (in an async thread) and will return
> +	 * before the compression is done and writeback is started. A second
> +	 * filemap_fdatawrite_range() is needed to wait for the compression to
> +	 * complete and writeback to start. Without this, our user is very
> +	 * likely to get stale results, because the extents and extent maps for
> +	 * delalloc regions are only allocated when writeback starts.
> +	 */
> +	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
> +		ret = btrfs_fdatawrite_range(inode, 0, LLONG_MAX);
> +		if (ret)
> +			return ret;
> +		ret = filemap_fdatawait_range(inode->i_mapping, 0, LLONG_MAX);
> +		if (ret)
> +			return ret;
> +	}
> +
>   	return extent_fiemap(BTRFS_I(inode), fieinfo, start, len);
>   }
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 06/10] btrfs: allow fiemap to be interruptible
  2022-09-01 13:18 ` [PATCH 06/10] btrfs: allow fiemap to be interruptible fdmanana
  2022-09-01 14:07   ` Josef Bacik
@ 2022-09-01 22:42   ` Qu Wenruo
  2022-09-02  8:38     ` Filipe Manana
  1 sibling, 1 reply; 53+ messages in thread
From: Qu Wenruo @ 2022-09-01 22:42 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



On 2022/9/1 21:18, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> Doing fiemap on a file with a very large number of extents can take a very
> long time, and we have reports of it being too slow (two recent examples
> in the Link tags below), so make it interruptible.
>
> Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Just one small question unrelated to the patch itself.

Would it be possible that, introducing a new flag to skip SHARED flag
check can further speed up the fiemap operation in btrfs?

Thanks,
Qu
> ---
>   fs/btrfs/extent_io.c | 5 +++++
>   1 file changed, 5 insertions(+)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 6e2143b6fba3..1260038eb47d 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5694,6 +5694,11 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>   				ret = 0;
>   			goto out_free;
>   		}
> +
> +		if (fatal_signal_pending(current)) {
> +			ret = -EINTR;
> +			goto out_free;
> +		}
>   	}
>   out_free:
>   	if (!ret)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/10] btrfs: rename btrfs_check_shared() to a more descriptive name
  2022-09-01 13:18 ` [PATCH 07/10] btrfs: rename btrfs_check_shared() to a more descriptive name fdmanana
  2022-09-01 14:08   ` Josef Bacik
@ 2022-09-01 22:45   ` Qu Wenruo
  1 sibling, 0 replies; 53+ messages in thread
From: Qu Wenruo @ 2022-09-01 22:45 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



On 2022/9/1 21:18, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> The function btrfs_check_shared() is supposed to be used to check if a
> data extent is shared, but its name is too generic, may easily cause
> confusion in the sense that it may be used for metadata extents.
>
> So rename it to btrfs_is_data_extent_shared(), which will also make it
> less confusing after the next change that adds a backref lookup cache for
> the b+tree nodes that lead to the leaf that contains the file extent item
> that points to the target data extent.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

> ---
>   fs/btrfs/backref.c   | 8 ++++----
>   fs/btrfs/backref.h   | 4 ++--
>   fs/btrfs/extent_io.c | 5 +++--
>   3 files changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index d385357e19b6..e2ac10a695b6 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -1512,7 +1512,7 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
>   }
>
>   /**
> - * Check if an extent is shared or not
> + * Check if a data extent is shared or not.
>    *
>    * @root:   root inode belongs to
>    * @inum:   inode number of the inode whose extent we are checking
> @@ -1520,7 +1520,7 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
>    * @roots:  list of roots this extent is shared among
>    * @tmp:    temporary list used for iteration
>    *
> - * btrfs_check_shared uses the backref walking code but will short
> + * btrfs_is_data_extent_shared uses the backref walking code but will short
>    * circuit as soon as it finds a root or inode that doesn't match the
>    * one passed in. This provides a significant performance benefit for
>    * callers (such as fiemap) which want to know whether the extent is
> @@ -1531,8 +1531,8 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
>    *
>    * Return: 0 if extent is not shared, 1 if it is shared, < 0 on error.
>    */
> -int btrfs_check_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
> -		struct ulist *roots, struct ulist *tmp)
> +int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
> +				struct ulist *roots, struct ulist *tmp)
>   {
>   	struct btrfs_fs_info *fs_info = root->fs_info;
>   	struct btrfs_trans_handle *trans;
> diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h
> index 2759de7d324c..08354394b1bb 100644
> --- a/fs/btrfs/backref.h
> +++ b/fs/btrfs/backref.h
> @@ -62,8 +62,8 @@ int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid,
>   			  u64 start_off, struct btrfs_path *path,
>   			  struct btrfs_inode_extref **ret_extref,
>   			  u64 *found_off);
> -int btrfs_check_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
> -		struct ulist *roots, struct ulist *tmp_ulist);
> +int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
> +				struct ulist *roots, struct ulist *tmp);
>
>   int __init btrfs_prelim_ref_init(void);
>   void __cold btrfs_prelim_ref_exit(void);
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 1260038eb47d..a47710516ecf 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5656,8 +5656,9 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>   			 * then we're just getting a count and we can skip the
>   			 * lookup stuff.
>   			 */
> -			ret = btrfs_check_shared(root, btrfs_ino(inode),
> -						 bytenr, roots, tmp_ulist);
> +			ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
> +							  bytenr, roots,
> +							  tmp_ulist);
>   			if (ret < 0)
>   				goto out_free;
>   			if (ret)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 08/10] btrfs: speedup checking for extent sharedness during fiemap
  2022-09-01 13:18 ` [PATCH 08/10] btrfs: speedup checking for extent sharedness during fiemap fdmanana
  2022-09-01 14:23   ` Josef Bacik
@ 2022-09-01 22:50   ` Qu Wenruo
  2022-09-02  8:46     ` Filipe Manana
  1 sibling, 1 reply; 53+ messages in thread
From: Qu Wenruo @ 2022-09-01 22:50 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



On 2022/9/1 21:18, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> One of the most expensive tasks performed during fiemap is to check if
> an extent is shared. This task has two major steps:
>
> 1) Check if the data extent is shared. This implies checking the extent
>     item in the extent tree, checking delayed references, etc. If we
>     find the data extent is directly shared, we terminate immediately;
>
> 2) If the data extent is not directly shared (its extent item has a
>     refcount of 1), then it may be shared if we have snapshots that share
>     subtrees of the inode's subvolume b+tree. So we check if the leaf
>     containing the file extent item is shared, then its parent node, then
>     the parent node of the parent node, etc, until we reach the root node
>     or we find one of them is shared - in which case we stop immediately.
>
> During fiemap we process the extents of a file from left to right, from
> file offset 0 to eof. This means that we iterate b+tree leaves from left
> to right, and has the implication that we keep repeating that second step
> above several times for the same b+tree path of the inode's subvolume
> b+tree.
>
> For example, if we have two file extent items in leaf X, and the path to
> leaf X is A -> B -> C -> X, then when we try to determine if the data
> extent referenced by the first extent item is shared, we check if the data
> extent is shared - if it's not, then we check if leaf X is shared, if not,
> then we check if node C is shared, if not, then check if node B is shared,
> if not than check if node A is shared. When we move to the next file
> extent item, after determining the data extent is not shared, we repeat
> the checks for X, C, B and A - doing all the expensive searches in the
> extent tree, delayed refs, etc. If we have thousands of tile extents, then
> we keep repeating the sharedness checks for the same paths over and over.
>
> On a file that has no shared extents or only a small portion, it's easy
> to see that this scales terribly with the number of extents in the file
> and the sizes of the extent and subvolume b+trees.
>
> This change eliminates the repeated sharedness check on extent buffers
> by caching the results of the last path used. The results can be used as
> long as no snapshots were created since they were cached (for not shared
> extent buffers) or no roots were dropped since they were cached (for
> shared extent buffers). This greatly reduces the time spent by fiemap for
> files with thousands of extents and/or large extent and subvolume b+trees.

This sounds pretty much like the existing btrfs_backref_cache is doing.

It stores a map to speedup the backref lookup.

But a quick search didn't hit things like btrfs_backref_edge() or
btrfs_backref_cache().

Would it be possible to reuse the existing facility to do the same thing?

Thanks,
Qu
>
> Example performance test:
>
>      $ cat fiemap-perf-test.sh
>      #!/bin/bash
>
>      DEV=/dev/sdi
>      MNT=/mnt/sdi
>
>      mkfs.btrfs -f $DEV
>      mount -o compress=lzo $DEV $MNT
>
>      # 40G gives 327680 128K file extents (due to compression).
>      xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
>
>      umount $MNT
>      mount -o compress=lzo $DEV $MNT
>
>      start=$(date +%s%N)
>      filefrag $MNT/foobar
>      end=$(date +%s%N)
>      dur=$(( (end - start) / 1000000 ))
>      echo "fiemap took $dur milliseconds (metadata not cached)"
>
>      start=$(date +%s%N)
>      filefrag $MNT/foobar
>      end=$(date +%s%N)
>      dur=$(( (end - start) / 1000000 ))
>      echo "fiemap took $dur milliseconds (metadata cached)"
>
>      umount $MNT
>
> Before this patch:
>
>      $ ./fiemap-perf-test.sh
>      (...)
>      /mnt/sdi/foobar: 327680 extents found
>      fiemap took 3597 milliseconds (metadata not cached)
>      /mnt/sdi/foobar: 327680 extents found
>      fiemap took 2107 milliseconds (metadata cached)
>
> After this patch:
>
>      $ ./fiemap-perf-test.sh
>      (...)
>      /mnt/sdi/foobar: 327680 extents found
>      fiemap took 1646 milliseconds (metadata not cached)
>      /mnt/sdi/foobar: 327680 extents found
>      fiemap took 698 milliseconds (metadata cached)
>
> That's about 2.2x faster when no metadata is cached, and about 3x faster
> when all metadata is cached. On a real filesystem with many other files,
> data, directories, etc, the b+trees will be 2 or 3 levels higher,
> therefore this optimization will have a higher impact.
>
> Several reports of a slow fiemap show up often, the two Link tags below
> refer to two recent reports of such slowness. This patch, together with
> the next ones in the series, is meant to address that.
>
> Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/backref.c     | 122 ++++++++++++++++++++++++++++++++++++++++-
>   fs/btrfs/backref.h     |  17 +++++-
>   fs/btrfs/ctree.h       |  18 ++++++
>   fs/btrfs/extent-tree.c |  10 +++-
>   fs/btrfs/extent_io.c   |  11 ++--
>   5 files changed, 170 insertions(+), 8 deletions(-)
>
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index e2ac10a695b6..40b48abb6978 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -1511,6 +1511,105 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
>   	return ret;
>   }
>
> +/*
> + * The caller has joined a transaction or is holding a read lock on the
> + * fs_info->commit_root_sem semaphore, so no need to worry about the root's last
> + * snapshot field changing while updating or checking the cache.
> + */
> +static bool lookup_backref_shared_cache(struct btrfs_backref_shared_cache *cache,
> +					struct btrfs_root *root,
> +					u64 bytenr, int level, bool *is_shared)
> +{
> +	struct btrfs_backref_shared_cache_entry *entry;
> +
> +	if (WARN_ON_ONCE(level >= BTRFS_MAX_LEVEL))
> +		return false;
> +
> +	/*
> +	 * Level -1 is used for the data extent, which is not reliable to cache
> +	 * because its reference count can increase or decrease without us
> +	 * realizing. We cache results only for extent buffers that lead from
> +	 * the root node down to the leaf with the file extent item.
> +	 */
> +	ASSERT(level >= 0);
> +
> +	entry = &cache->entries[level];
> +
> +	/* Unused cache entry or being used for some other extent buffer. */
> +	if (entry->bytenr != bytenr)
> +		return false;
> +
> +	/*
> +	 * We cached a false result, but the last snapshot generation of the
> +	 * root changed, so we now have a snapshot. Don't trust the result.
> +	 */
> +	if (!entry->is_shared &&
> +	    entry->gen != btrfs_root_last_snapshot(&root->root_item))
> +		return false;
> +
> +	/*
> +	 * If we cached a true result and the last generation used for dropping
> +	 * a root changed, we can not trust the result, because the dropped root
> +	 * could be a snapshot sharing this extent buffer.
> +	 */
> +	if (entry->is_shared &&
> +	    entry->gen != btrfs_get_last_root_drop_gen(root->fs_info))
> +		return false;
> +
> +	*is_shared = entry->is_shared;
> +
> +	return true;
> +}
> +
> +/*
> + * The caller has joined a transaction or is holding a read lock on the
> + * fs_info->commit_root_sem semaphore, so no need to worry about the root's last
> + * snapshot field changing while updating or checking the cache.
> + */
> +static void store_backref_shared_cache(struct btrfs_backref_shared_cache *cache,
> +				       struct btrfs_root *root,
> +				       u64 bytenr, int level, bool is_shared)
> +{
> +	struct btrfs_backref_shared_cache_entry *entry;
> +	u64 gen;
> +
> +	if (WARN_ON_ONCE(level >= BTRFS_MAX_LEVEL))
> +		return;
> +
> +	/*
> +	 * Level -1 is used for the data extent, which is not reliable to cache
> +	 * because its reference count can increase or decrease without us
> +	 * realizing. We cache results only for extent buffers that lead from
> +	 * the root node down to the leaf with the file extent item.
> +	 */
> +	ASSERT(level >= 0);
> +
> +	if (is_shared)
> +		gen = btrfs_get_last_root_drop_gen(root->fs_info);
> +	else
> +		gen = btrfs_root_last_snapshot(&root->root_item);
> +
> +	entry = &cache->entries[level];
> +	entry->bytenr = bytenr;
> +	entry->is_shared = is_shared;
> +	entry->gen = gen;
> +
> +	/*
> +	 * If we found an extent buffer is shared, set the cache result for all
> +	 * extent buffers below it to true. As nodes in the path are COWed,
> +	 * their sharedness is moved to their children, and if a leaf is COWed,
> +	 * then the sharedness of a data extent becomes direct, the refcount of
> +	 * data extent is increased in the extent item at the extent tree.
> +	 */
> +	if (is_shared) {
> +		for (int i = 0; i < level; i++) {
> +			entry = &cache->entries[i];
> +			entry->is_shared = is_shared;
> +			entry->gen = gen;
> +		}
> +	}
> +}
> +
>   /**
>    * Check if a data extent is shared or not.
>    *
> @@ -1519,6 +1618,7 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
>    * @bytenr: logical bytenr of the extent we are checking
>    * @roots:  list of roots this extent is shared among
>    * @tmp:    temporary list used for iteration
> + * @cache:  a backref lookup result cache
>    *
>    * btrfs_is_data_extent_shared uses the backref walking code but will short
>    * circuit as soon as it finds a root or inode that doesn't match the
> @@ -1532,7 +1632,8 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
>    * Return: 0 if extent is not shared, 1 if it is shared, < 0 on error.
>    */
>   int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
> -				struct ulist *roots, struct ulist *tmp)
> +				struct ulist *roots, struct ulist *tmp,
> +				struct btrfs_backref_shared_cache *cache)
>   {
>   	struct btrfs_fs_info *fs_info = root->fs_info;
>   	struct btrfs_trans_handle *trans;
> @@ -1545,6 +1646,7 @@ int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
>   		.inum = inum,
>   		.share_count = 0,
>   	};
> +	int level;
>
>   	ulist_init(roots);
>   	ulist_init(tmp);
> @@ -1561,22 +1663,40 @@ int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
>   		btrfs_get_tree_mod_seq(fs_info, &elem);
>   	}
>
> +	/* -1 means we are in the bytenr of the data extent. */
> +	level = -1;
>   	ULIST_ITER_INIT(&uiter);
>   	while (1) {
> +		bool is_shared;
> +		bool cached;
> +
>   		ret = find_parent_nodes(trans, fs_info, bytenr, elem.seq, tmp,
>   					roots, NULL, &shared, false);
>   		if (ret == BACKREF_FOUND_SHARED) {
>   			/* this is the only condition under which we return 1 */
>   			ret = 1;
> +			if (level >= 0)
> +				store_backref_shared_cache(cache, root, bytenr,
> +							   level, true);
>   			break;
>   		}
>   		if (ret < 0 && ret != -ENOENT)
>   			break;
>   		ret = 0;
> +		if (level >= 0)
> +			store_backref_shared_cache(cache, root, bytenr,
> +						   level, false);
>   		node = ulist_next(tmp, &uiter);
>   		if (!node)
>   			break;
>   		bytenr = node->val;
> +		level++;
> +		cached = lookup_backref_shared_cache(cache, root, bytenr, level,
> +						     &is_shared);
> +		if (cached) {
> +			ret = is_shared ? 1 : 0;
> +			break;
> +		}
>   		shared.share_count = 0;
>   		cond_resched();
>   	}
> diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h
> index 08354394b1bb..797ba5371d55 100644
> --- a/fs/btrfs/backref.h
> +++ b/fs/btrfs/backref.h
> @@ -17,6 +17,20 @@ struct inode_fs_paths {
>   	struct btrfs_data_container	*fspath;
>   };
>
> +struct btrfs_backref_shared_cache_entry {
> +	u64 bytenr;
> +	u64 gen;
> +	bool is_shared;
> +};
> +
> +struct btrfs_backref_shared_cache {
> +	/*
> +	 * A path from a root to a leaf that has a file extent item pointing to
> +	 * a given data extent should never exceed the maximum b+tree heigth.
> +	 */
> +	struct btrfs_backref_shared_cache_entry entries[BTRFS_MAX_LEVEL];
> +};
> +
>   typedef int (iterate_extent_inodes_t)(u64 inum, u64 offset, u64 root,
>   		void *ctx);
>
> @@ -63,7 +77,8 @@ int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid,
>   			  struct btrfs_inode_extref **ret_extref,
>   			  u64 *found_off);
>   int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
> -				struct ulist *roots, struct ulist *tmp);
> +				struct ulist *roots, struct ulist *tmp,
> +				struct btrfs_backref_shared_cache *cache);
>
>   int __init btrfs_prelim_ref_init(void);
>   void __cold btrfs_prelim_ref_exit(void);
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 3dc30f5e6fd0..f7fe7f633eb5 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1095,6 +1095,13 @@ struct btrfs_fs_info {
>   	/* Updates are not protected by any lock */
>   	struct btrfs_commit_stats commit_stats;
>
> +	/*
> +	 * Last generation where we dropped a non-relocation root.
> +	 * Use btrfs_set_last_root_drop_gen() and btrfs_get_last_root_drop_gen()
> +	 * to change it and to read it, respectively.
> +	 */
> +	u64 last_root_drop_gen;
> +
>   	/*
>   	 * Annotations for transaction events (structures are empty when
>   	 * compiled without lockdep).
> @@ -1119,6 +1126,17 @@ struct btrfs_fs_info {
>   #endif
>   };
>
> +static inline void btrfs_set_last_root_drop_gen(struct btrfs_fs_info *fs_info,
> +						u64 gen)
> +{
> +	WRITE_ONCE(fs_info->last_root_drop_gen, gen);
> +}
> +
> +static inline u64 btrfs_get_last_root_drop_gen(const struct btrfs_fs_info *fs_info)
> +{
> +	return READ_ONCE(fs_info->last_root_drop_gen);
> +}
> +
>   static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
>   {
>   	return sb->s_fs_info;
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index bcd0e72cded3..9818285dface 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5635,6 +5635,8 @@ static noinline int walk_up_tree(struct btrfs_trans_handle *trans,
>    */
>   int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
>   {
> +	const bool is_reloc_root = (root->root_key.objectid ==
> +				    BTRFS_TREE_RELOC_OBJECTID);
>   	struct btrfs_fs_info *fs_info = root->fs_info;
>   	struct btrfs_path *path;
>   	struct btrfs_trans_handle *trans;
> @@ -5794,6 +5796,9 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
>   				goto out_end_trans;
>   			}
>
> +			if (!is_reloc_root)
> +				btrfs_set_last_root_drop_gen(fs_info, trans->transid);
> +
>   			btrfs_end_transaction_throttle(trans);
>   			if (!for_reloc && btrfs_need_cleaner_sleep(fs_info)) {
>   				btrfs_debug(fs_info,
> @@ -5828,7 +5833,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
>   		goto out_end_trans;
>   	}
>
> -	if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
> +	if (!is_reloc_root) {
>   		ret = btrfs_find_root(tree_root, &root->root_key, path,
>   				      NULL, NULL);
>   		if (ret < 0) {
> @@ -5860,6 +5865,9 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
>   		btrfs_put_root(root);
>   	root_dropped = true;
>   out_end_trans:
> +	if (!is_reloc_root)
> +		btrfs_set_last_root_drop_gen(fs_info, trans->transid);
> +
>   	btrfs_end_transaction_throttle(trans);
>   out_free:
>   	kfree(wc);
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index a47710516ecf..781436cc373c 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5519,6 +5519,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>   	struct btrfs_path *path;
>   	struct btrfs_root *root = inode->root;
>   	struct fiemap_cache cache = { 0 };
> +	struct btrfs_backref_shared_cache *backref_cache;
>   	struct ulist *roots;
>   	struct ulist *tmp_ulist;
>   	int end = 0;
> @@ -5526,13 +5527,11 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>   	u64 em_len = 0;
>   	u64 em_end = 0;
>
> +	backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
>   	path = btrfs_alloc_path();
> -	if (!path)
> -		return -ENOMEM;
> -
>   	roots = ulist_alloc(GFP_KERNEL);
>   	tmp_ulist = ulist_alloc(GFP_KERNEL);
> -	if (!roots || !tmp_ulist) {
> +	if (!backref_cache || !path || !roots || !tmp_ulist) {
>   		ret = -ENOMEM;
>   		goto out_free_ulist;
>   	}
> @@ -5658,7 +5657,8 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>   			 */
>   			ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
>   							  bytenr, roots,
> -							  tmp_ulist);
> +							  tmp_ulist,
> +							  backref_cache);
>   			if (ret < 0)
>   				goto out_free;
>   			if (ret)
> @@ -5710,6 +5710,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>   			     &cached_state);
>
>   out_free_ulist:
> +	kfree(backref_cache);
>   	btrfs_free_path(path);
>   	ulist_free(roots);
>   	ulist_free(tmp_ulist);

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/10] btrfs: skip unnecessary extent buffer sharedness checks during fiemap
  2022-09-01 13:18 ` [PATCH 09/10] btrfs: skip unnecessary extent buffer sharedness checks " fdmanana
  2022-09-01 14:26   ` Josef Bacik
@ 2022-09-01 23:01   ` Qu Wenruo
  1 sibling, 0 replies; 53+ messages in thread
From: Qu Wenruo @ 2022-09-01 23:01 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



On 2022/9/1 21:18, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> During fiemap, for each file extent we find, we must check if it's shared
> or not. The sharedness check starts by verifying if the extent is directly
> shared (its refcount in the extent tree is > 1), and if it is not directly
> shared, then we will check if every node in the subvolume b+tree leading
> from the root to the leaf that has the file extent item (in reverse order),
> is shared (through snapshots).
>
> However this second step is not needed if our extent was created in a
> transaction more recent than the last transaction where a snapshot of the
> inode's root happened, because it can't be shared indirectly (through
> shared subtrees) without a snapshot created in a more recent transaction.

This is a pretty awesome!

>
> So grab the generation of the extent from the extent map and pass it to
> btrfs_is_data_extent_shared(), which will skip this second phase when the
> generation is more recent than the root's last snapshot value. Note that
> we skip this optimization if the extent map is the result of merging 2
> or more extent maps, because in this case its generation is the maximum
> of the generations of all merged extent maps.

And this pitfall also taken into consideration, even better.

>
> The fact the we use extent maps and they can be merged despite the
> underlying extents being distinct (different file extent items in the
> subvolume b+tree and different extent items in the extent b+tree), can
> result in some bugs when reporting shared extents. But this is a problem
> of the current implementation of fiemap relying on extent maps.
> One example where we get incorrect results is:
>
>      $ cat fiemap-bug.sh
>      #!/bin/bash
>
>      DEV=/dev/sdj
>      MNT=/mnt/sdj
>
>      mkfs.btrfs -f $DEV
>      mount $DEV $MNT
>
>      # Create a file with two 256K extents.
>      # Since there is no other write activity, they will be contiguous,
>      # and their extent maps merged, despite having two distinct extents.
>      xfs_io -f -c "pwrite -S 0xab 0 256K" \
>                -c "fsync" \
>                -c "pwrite -S 0xcd 256K 256K" \
>                -c "fsync" \
>                $MNT/foo
>
>      # Now clone only the second extent into another file.
>      xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
>
>      # Filefrag will report a single 512K extent, and say it's not shared.
>      echo
>      filefrag -v $MNT/foo
>
>      umount $MNT
>
> Running the reproducer:
>
>      $ ./fiemap-bug.sh
>      wrote 262144/262144 bytes at offset 0
>      256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
>      wrote 262144/262144 bytes at offset 262144
>      256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
>      linked 262144/262144 bytes at offset 0
>      256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
>
>      Filesystem type is: 9123683e
>      File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
>       ext:     logical_offset:        physical_offset: length:   expected: flags:
>         0:        0..     127:       3328..      3455:    128:             last,eof
>      /mnt/sdj/foo: 1 extent found
>
> We end up reporting that we have a single 512K that is not shared, however
> we have two 256K extents, and the second one is shared. Changing the
> reproducer to clone instead the first extent into file 'bar', makes us
> report a single 512K extent that is shared, which is algo incorrect since
> we have two 256K extents and only the first one is shared.
>
> This is z problem that existed before this change, and remains after this
> change, as it can't be easily fixed. The next patch in the series reworks
> fiemap to primarily use file extent items instead of extent maps (except
> for checking for delalloc ranges), with the goal of improving its
> scalability and performance, but it also ends up fixing this particular
> bug caused by extent map merging.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/backref.c   | 27 +++++++++++++++++++++------
>   fs/btrfs/backref.h   |  1 +
>   fs/btrfs/extent_io.c | 18 ++++++++++++++++--
>   3 files changed, 38 insertions(+), 8 deletions(-)
>
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index 40b48abb6978..bf4ca4a82550 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -1613,12 +1613,14 @@ static void store_backref_shared_cache(struct btrfs_backref_shared_cache *cache,
>   /**
>    * Check if a data extent is shared or not.
>    *
> - * @root:   root inode belongs to
> - * @inum:   inode number of the inode whose extent we are checking
> - * @bytenr: logical bytenr of the extent we are checking
> - * @roots:  list of roots this extent is shared among
> - * @tmp:    temporary list used for iteration
> - * @cache:  a backref lookup result cache
> + * @root:        The root the inode belongs to.
> + * @inum:        Number of the inode whose extent we are checking.
> + * @bytenr:      Logical bytenr of the extent we are checking.
> + * @extent_gen:  Generation of the extent (file extent item) or 0 if it is
> + *               not known.
> + * @roots:       List of roots this extent is shared among.
> + * @tmp:         Temporary list used for iteration.
> + * @cache:       A backref lookup result cache.
>    *
>    * btrfs_is_data_extent_shared uses the backref walking code but will short
>    * circuit as soon as it finds a root or inode that doesn't match the
> @@ -1632,6 +1634,7 @@ static void store_backref_shared_cache(struct btrfs_backref_shared_cache *cache,
>    * Return: 0 if extent is not shared, 1 if it is shared, < 0 on error.
>    */
>   int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
> +				u64 extent_gen,
>   				struct ulist *roots, struct ulist *tmp,
>   				struct btrfs_backref_shared_cache *cache)
>   {
> @@ -1683,6 +1686,18 @@ int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
>   		if (ret < 0 && ret != -ENOENT)
>   			break;
>   		ret = 0;
> +		/*
> +		 * If our data extent is not shared through reflinks and it was
> +		 * created in a generation after the last one used to create a
> +		 * snapshot of the inode's root, then it can not be shared
> +		 * indirectly through subtrees, as that can only happen with
> +		 * snapshots. In this case bail out, no need to check for the
> +		 * sharedness of extent buffers.
> +		 */
> +		if (level == -1 &&
> +		    extent_gen > btrfs_root_last_snapshot(&root->root_item))
> +			break;
> +
>   		if (level >= 0)
>   			store_backref_shared_cache(cache, root, bytenr,
>   						   level, false);
> diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h
> index 797ba5371d55..7d18b5ac71dd 100644
> --- a/fs/btrfs/backref.h
> +++ b/fs/btrfs/backref.h
> @@ -77,6 +77,7 @@ int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid,
>   			  struct btrfs_inode_extref **ret_extref,
>   			  u64 *found_off);
>   int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
> +				u64 extent_gen,
>   				struct ulist *roots, struct ulist *tmp,
>   				struct btrfs_backref_shared_cache *cache);
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 781436cc373c..0e3fa9b08aaf 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5645,9 +5645,23 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>   			flags |= (FIEMAP_EXTENT_DELALLOC |
>   				  FIEMAP_EXTENT_UNKNOWN);
>   		} else if (fieinfo->fi_extents_max) {
> +			u64 extent_gen;
>   			u64 bytenr = em->block_start -
>   				(em->start - em->orig_start);
>
> +			/*
> +			 * If two extent maps are merged, then their generation
> +			 * is set to the maximum between their generations.
> +			 * Otherwise its generation matches the one we have in
> +			 * corresponding file extent item. If we have a merged
> +			 * extent map, don't use its generation to speedup the
> +			 * sharedness check below.
> +			 */
> +			if (test_bit(EXTENT_FLAG_MERGED, &em->flags))
> +				extent_gen = 0;
> +			else
> +				extent_gen = em->generation;
> +
>   			/*
>   			 * As btrfs supports shared space, this information
>   			 * can be exported to userspace tools via
> @@ -5656,8 +5670,8 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>   			 * lookup stuff.
>   			 */
>   			ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
> -							  bytenr, roots,
> -							  tmp_ulist,
> +							  bytenr, extent_gen,
> +							  roots, tmp_ulist,
>   							  backref_cache);
>   			if (ret < 0)
>   				goto out_free;

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness
  2022-09-01 13:18 ` [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness fdmanana
  2022-09-01 14:35   ` Josef Bacik
@ 2022-09-01 23:27   ` Qu Wenruo
  2022-09-02  8:59     ` Filipe Manana
  1 sibling, 1 reply; 53+ messages in thread
From: Qu Wenruo @ 2022-09-01 23:27 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



On 2022/9/1 21:18, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> The current fiemap implementation does not scale very well with the number
> of extents a file has. This is both because the main algorithm to find out
> the extents has a high algorithmic complexity and because for each extent
> we have to check if it's shared. This second part, checking if an extent
> is shared, is significantly improved by the two previous patches in this
> patchset, while the first part is improved by this specific patch. Every
> now and then we get reports from users mentioning fiemap is too slow or
> even unusable for files with a very large number of extents, such as the
> two recent reports referred to by the Link tags at the bottom of this
> change log.
>
> To understand why the part of finding which extents a file has is very
> inneficient, consider the example of doing a full ranged fiemap against
> a file that has over 100K extents (normal for example for a file with
> more than 10G of data and using compression, which limits the extent size
> to 128K). When we enter fiemap at extent_fiemap(), the following happens:
>
> 1) Before entering the main loop, we call get_extent_skip_holes() to get
>     the first extent map. This leads us to btrfs_get_extent_fiemap(), which
>     in turn calls btrfs_get_extent(), to find the first extent map that
>     covers the file range [0, LLONG_MAX).
>
>     btrfs_get_extent() will first search the inode's extent map tree, to
>     see if we have an extent map there that covers the range. If it does
>     not find one, then it will search the inode's subvolume b+tree for a
>     fitting file extent item. After finding the file extent item, it will
>     allocate an extent map, fill it in with information extracted from the
>     file extent item, and add it to the inode's extent map tree (which
>     requires a search for insertion in the tree).
>
> 2) Then we enter the main loop at extent_fiemap(), emit the details of
>     the extent, and call again get_extent_skip_holes(), with a start
>     offset matching the end of the extent map we previously processed.
>
>     We end up at btrfs_get_extent() again, will search the extent map tree
>     and then search the subvolume b+tree for a file extent item if we could
>     not find an extent map in the extent tree. We allocate an extent map,
>     fill it in with the details in the file extent item, and then insert
>     it into the extent map tree (yet another search in this tree).
>
> 3) The second step is repeated over and over, until we have processed the
>     whole file range. Each iteration ends at btrfs_get_extent(), which
>     does a red black tree search on the extent map tree, then searches the
>     subvolume b+tree, allocates an extent map and then does another search
>     in the extent map tree in order to insert the extent map.
>
>     In the best scenario we have all the extent maps already in the extent
>     tree, and so for each extent we do a single search on a red black tree,
>     so we have a complexity of O(n log n).
>
>     In the worst scenario we don't have any extent map already loaded in
>     the extent map tree, or have very few already there. In this case the
>     complexity is much higher since we do:
>
>     - A red black tree search on the extent map tree, which has O(log n)
>       complexity, initially very fast since the tree is empty or very
>       small, but as we end up allocating extent maps and adding them to
>       the tree when we don't find them there, each subsequent search on
>       the tree gets slower, since it's getting bigger and bigger after
>       each iteration.
>
>     - A search on the subvolume b+tree, also O(log n) complexity, but it
>       has items for all inodes in the subvolume, not just items for our
>       inode. Plus on a filesystem with concurrent operations on other
>       inodes, we can block doing the search due to lock contention on
>       b+tree nodes/leaves.
>
>     - Allocate an extent map - this can block, and can also fail if we
>       are under serious memory pressure.
>
>     - Do another search on the extent maps red black tree, with the goal
>       of inserting the extent map we just allocated. Again, after every
>       iteration this tree is getting bigger by 1 element, so after many
>       iterations the searches are slower and slower.
>
>     - We will not need the allocated extent map anymore, so it's pointless
>       to add it to the extent map tree. It's just wasting time and memory.
>
>     In short we end up searching the extent map tree multiple times, on a
>     tree that is growing bigger and bigger after each iteration. And
>     besides that we visit the same leaf of the subvolume b+tree many times,
>     since a leaf with the default size of 16K can easily have more than 200
>     file extent items.
>
> This is very inneficient overall. This patch changes the algorithm to
> instead iterate over the subvolume b+tree, visiting each leaf only once,
> and only searching in the extent map tree for file ranges that have holes
> or prealloc extents, in order to figure out if we have delalloc there.
> It will never allocate an extent map and add it to the extent map tree.
> This is very similar to what was previously done for the lseek's hole and
> data seeking features.
>
> Also, the current implementation relying on extent maps for figuring out
> which extents we have is not correct. This is because extent maps can be
> merged even if they represent different extents - we do this to minimize
> memory utilization and keep extent map trees smaller. For example if we
> have two extents that are contiguous on disk, once we load the two extent
> maps, they get merged into a single one - however if only one of the
> extents is shared, we end up reporting both as shared or both as not
> shared, which is incorrect.

Is there any other major usage for extent map now?

I can only think of read, which uses extent map to grab the logical
bytenr of the real extent.

In that case, the SHARED flag doesn't make much sense anyway, can we do
a cleanup for those flags? Since fiemap/lseek no longer relies on extent
map anymore.

>
> This reproducer triggers that bug:
>
>      $ cat fiemap-bug.sh
>      #!/bin/bash
>
>      DEV=/dev/sdj
>      MNT=/mnt/sdj
>
>      mkfs.btrfs -f $DEV
>      mount $DEV $MNT
>
>      # Create a file with two 256K extents.
>      # Since there is no other write activity, they will be contiguous,
>      # and their extent maps merged, despite having two distinct extents.
>      xfs_io -f -c "pwrite -S 0xab 0 256K" \
>                -c "fsync" \
>                -c "pwrite -S 0xcd 256K 256K" \
>                -c "fsync" \
>                $MNT/foo
>
>      # Now clone only the second extent into another file.
>      xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
>
>      # Filefrag will report a single 512K extent, and say it's not shared.
>      echo
>      filefrag -v $MNT/foo
>
>      umount $MNT
>
> Running the reproducer:
>
>      $ ./fiemap-bug.sh
>      wrote 262144/262144 bytes at offset 0
>      256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
>      wrote 262144/262144 bytes at offset 262144
>      256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
>      linked 262144/262144 bytes at offset 0
>      256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
>
>      Filesystem type is: 9123683e
>      File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
>       ext:     logical_offset:        physical_offset: length:   expected: flags:
>         0:        0..     127:       3328..      3455:    128:             last,eof
>      /mnt/sdj/foo: 1 extent found
>
> We end up reporting that we have a single 512K that is not shared, however
> we have two 256K extents, and the second one is shared. Changing the
> reproducer to clone instead the first extent into file 'bar', makes us
> report a single 512K extent that is shared, which is algo incorrect since
> we have two 256K extents and only the first one is shared.
>
> This patch is part of a larger patchset that is comprised of the following
> patches:
>
>      btrfs: allow hole and data seeking to be interruptible
>      btrfs: make hole and data seeking a lot more efficient
>      btrfs: remove check for impossible block start for an extent map at fiemap
>      btrfs: remove zero length check when entering fiemap
>      btrfs: properly flush delalloc when entering fiemap
>      btrfs: allow fiemap to be interruptible
>      btrfs: rename btrfs_check_shared() to a more descriptive name
>      btrfs: speedup checking for extent sharedness during fiemap
>      btrfs: skip unnecessary extent buffer sharedness checks during fiemap
>      btrfs: make fiemap more efficient and accurate reporting extent sharedness
>
> The patchset was tested on a machine running a non-debug kernel (Debian's
> default config) and compared the tests below on a branch without the
> patchset versus the same branch with the whole patchset applied.
>
> The following test for a large compressed file without holes:
>
>      $ cat fiemap-perf-test.sh
>      #!/bin/bash
>
>      DEV=/dev/sdi
>      MNT=/mnt/sdi
>
>      mkfs.btrfs -f $DEV
>      mount -o compress=lzo $DEV $MNT
>
>      # 40G gives 327680 128K file extents (due to compression).
>      xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar
>
>      umount $MNT
>      mount -o compress=lzo $DEV $MNT
>
>      start=$(date +%s%N)
>      filefrag $MNT/foobar
>      end=$(date +%s%N)
>      dur=$(( (end - start) / 1000000 ))
>      echo "fiemap took $dur milliseconds (metadata not cached)"
>
>      start=$(date +%s%N)
>      filefrag $MNT/foobar
>      end=$(date +%s%N)
>      dur=$(( (end - start) / 1000000 ))
>      echo "fiemap took $dur milliseconds (metadata cached)"
>
>      umount $MNT
>
> Before patchset:
>
>      $ ./fiemap-perf-test.sh
>      (...)
>      /mnt/sdi/foobar: 327680 extents found
>      fiemap took 3597 milliseconds (metadata not cached)
>      /mnt/sdi/foobar: 327680 extents found
>      fiemap took 2107 milliseconds (metadata cached)
>
> After patchset:
>
>      $ ./fiemap-perf-test.sh
>      (...)
>      /mnt/sdi/foobar: 327680 extents found
>      fiemap took 1214 milliseconds (metadata not cached)
>      /mnt/sdi/foobar: 327680 extents found
>      fiemap took 684 milliseconds (metadata cached)
>
> That's a speedup of about 3x for both cases (no metadata cached and all
> metadata cached).
>
> The test provided by Pavel (first Link tag at the bottom), which uses
> files with a large number of holes, was also used to measure the gains,
> and it consists on a small C program and a shell script to invoke it.
> The C program is the following:
>
>      $ cat pavels-test.c
>      #include <stdio.h>
>      #include <unistd.h>
>      #include <stdlib.h>
>      #include <fcntl.h>
>
>      #include <sys/stat.h>
>      #include <sys/time.h>
>      #include <sys/ioctl.h>
>
>      #include <linux/fs.h>
>      #include <linux/fiemap.h>
>
>      #define FILE_INTERVAL (1<<13) /* 8Kb */
>
>      long long interval(struct timeval t1, struct timeval t2)
>      {
>          long long val = 0;
>          val += (t2.tv_usec - t1.tv_usec);
>          val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
>          return val;
>      }
>
>      int main(int argc, char **argv)
>      {
>          struct fiemap fiemap = {};
>          struct timeval t1, t2;
>          char data = 'a';
>          struct stat st;
>          int fd, off, file_size = FILE_INTERVAL;
>
>          if (argc != 3 && argc != 2) {
>                  printf("usage: %s <path> [size]\n", argv[0]);
>                  return 1;
>          }
>
>          if (argc == 3)
>                  file_size = atoi(argv[2]);
>          if (file_size < FILE_INTERVAL)
>                  file_size = FILE_INTERVAL;
>          file_size -= file_size % FILE_INTERVAL;
>
>          fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
>          if (fd < 0) {
>              perror("open");
>              return 1;
>          }
>
>          for (off = 0; off < file_size; off += FILE_INTERVAL) {
>              if (pwrite(fd, &data, 1, off) != 1) {
>                  perror("pwrite");
>                  close(fd);
>                  return 1;
>              }
>          }
>
>          if (ftruncate(fd, file_size)) {
>              perror("ftruncate");
>              close(fd);
>              return 1;
>          }
>
>          if (fstat(fd, &st) < 0) {
>              perror("fstat");
>              close(fd);
>              return 1;
>          }
>
>          printf("size: %ld\n", st.st_size);
>          printf("actual size: %ld\n", st.st_blocks * 512);
>
>          fiemap.fm_length = FIEMAP_MAX_OFFSET;
>          gettimeofday(&t1, NULL);
>          if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
>              perror("fiemap");
>              close(fd);
>              return 1;
>          }
>          gettimeofday(&t2, NULL);
>
>          printf("fiemap: fm_mapped_extents = %d\n",
>                 fiemap.fm_mapped_extents);
>          printf("time = %lld us\n", interval(t1, t2));
>
>          close(fd);
>          return 0;
>      }
>
>      $ gcc -o pavels_test pavels_test.c
>
> And the wrapper shell script:
>
>      $ cat fiemap-pavels-test.sh
>
>      #!/bin/bash
>
>      DEV=/dev/sdi
>      MNT=/mnt/sdi
>
>      mkfs.btrfs -f -O no-holes $DEV
>      mount $DEV $MNT
>
>      echo
>      echo "*********** 256M ***********"
>      echo
>
>      ./pavels-test $MNT/testfile $((1 << 28))
>      echo
>      ./pavels-test $MNT/testfile $((1 << 28))
>
>      echo
>      echo "*********** 512M ***********"
>      echo
>
>      ./pavels-test $MNT/testfile $((1 << 29))
>      echo
>      ./pavels-test $MNT/testfile $((1 << 29))
>
>      echo
>      echo "*********** 1G ***********"
>      echo
>
>      ./pavels-test $MNT/testfile $((1 << 30))
>      echo
>      ./pavels-test $MNT/testfile $((1 << 30))
>
>      umount $MNT
>
> Running his reproducer before applying the patchset:
>
>      *********** 256M ***********
>
>      size: 268435456
>      actual size: 134217728
>      fiemap: fm_mapped_extents = 32768
>      time = 4003133 us
>
>      size: 268435456
>      actual size: 134217728
>      fiemap: fm_mapped_extents = 32768
>      time = 4895330 us
>
>      *********** 512M ***********
>
>      size: 536870912
>      actual size: 268435456
>      fiemap: fm_mapped_extents = 65536
>      time = 30123675 us
>
>      size: 536870912
>      actual size: 268435456
>      fiemap: fm_mapped_extents = 65536
>      time = 33450934 us
>
>      *********** 1G ***********
>
>      size: 1073741824
>      actual size: 536870912
>      fiemap: fm_mapped_extents = 131072
>      time = 224924074 us
>
>      size: 1073741824
>      actual size: 536870912
>      fiemap: fm_mapped_extents = 131072
>      time = 217239242 us
>
> Running it after applying the patchset:
>
>      *********** 256M ***********
>
>      size: 268435456
>      actual size: 134217728
>      fiemap: fm_mapped_extents = 32768
>      time = 29475 us
>
>      size: 268435456
>      actual size: 134217728
>      fiemap: fm_mapped_extents = 32768
>      time = 29307 us
>
>      *********** 512M ***********
>
>      size: 536870912
>      actual size: 268435456
>      fiemap: fm_mapped_extents = 65536
>      time = 58996 us
>
>      size: 536870912
>      actual size: 268435456
>      fiemap: fm_mapped_extents = 65536
>      time = 59115 us
>
>      *********** 1G ***********
>
>      size: 1073741824
>      actual size: 536870912
>      fiemap: fm_mapped_extents = 116251
>      time = 124141 us
>
>      size: 1073741824
>      actual size: 536870912
>      fiemap: fm_mapped_extents = 131072
>      time = 119387 us
>
> The speedup is massive, both on the first fiemap call and on the second
> one as well, as his test creates files with many holes and small extents
> (every extent follows a hole and precedes another hole).
>
> For the 256M file we go from 4 seconds down to 29 milliseconds in the
> first run, and then from 4.9 seconds down to 29 milliseconds again in the
> second run, a speedup of 138x and 169x, respectively.
>
> For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
> first run, and then from 33.5 seconds down to 59 milliseconds again in the
> second run, a speedup of 510x and 568x, respectively.
>
> For the 1G file, we go from 225 seconds down to 124 milliseconds in the
> first run, and then from 217 seconds down to 119 milliseconds in the
> second run, a speedup of 1815x and 1824x, respectively.
>
> Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
> Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/ctree.h     |   4 +-
>   fs/btrfs/extent_io.c | 714 +++++++++++++++++++++++++++++--------------
>   fs/btrfs/file.c      |  16 +-
>   fs/btrfs/inode.c     | 140 +--------
>   4 files changed, 506 insertions(+), 368 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index f7fe7f633eb5..7b266f9dc8b4 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3402,8 +3402,6 @@ unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
>   				    u64 start, u64 end);
>   int btrfs_check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
>   			  u32 bio_offset, struct page *page, u32 pgoff);
> -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
> -					   u64 start, u64 len);
>   noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
>   			      u64 *orig_start, u64 *orig_block_len,
>   			      u64 *ram_bytes, bool strict);
> @@ -3583,6 +3581,8 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
>   int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
>   			   size_t *write_bytes);
>   void btrfs_check_nocow_unlock(struct btrfs_inode *inode);
> +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> +				  u64 *delalloc_start_ret, u64 *delalloc_end_ret);
>
>   /* tree-defrag.c */
>   int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 0e3fa9b08aaf..50bb2182e795 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5353,42 +5353,6 @@ int try_release_extent_mapping(struct page *page, gfp_t mask)
>   	return try_release_extent_state(tree, page, mask);
>   }
>
> -/*
> - * helper function for fiemap, which doesn't want to see any holes.
> - * This maps until we find something past 'last'
> - */
> -static struct extent_map *get_extent_skip_holes(struct btrfs_inode *inode,
> -						u64 offset, u64 last)
> -{
> -	u64 sectorsize = btrfs_inode_sectorsize(inode);
> -	struct extent_map *em;
> -	u64 len;
> -
> -	if (offset >= last)
> -		return NULL;
> -
> -	while (1) {
> -		len = last - offset;
> -		if (len == 0)
> -			break;
> -		len = ALIGN(len, sectorsize);
> -		em = btrfs_get_extent_fiemap(inode, offset, len);
> -		if (IS_ERR(em))
> -			return em;
> -
> -		/* if this isn't a hole return it */
> -		if (em->block_start != EXTENT_MAP_HOLE)
> -			return em;
> -
> -		/* this is a hole, advance to the next extent */
> -		offset = extent_map_end(em);
> -		free_extent_map(em);
> -		if (offset >= last)
> -			break;
> -	}
> -	return NULL;
> -}
> -
>   /*
>    * To cache previous fiemap extent
>    *
> @@ -5418,6 +5382,9 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
>   {
>   	int ret = 0;
>
> +	/* Set at the end of extent_fiemap(). */
> +	ASSERT((flags & FIEMAP_EXTENT_LAST) == 0);
> +
>   	if (!cache->cached)
>   		goto assign;
>
> @@ -5446,11 +5413,10 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
>   	 */
>   	if (cache->offset + cache->len  == offset &&
>   	    cache->phys + cache->len == phys  &&
> -	    (cache->flags & ~FIEMAP_EXTENT_LAST) ==
> -			(flags & ~FIEMAP_EXTENT_LAST)) {
> +	    cache->flags == flags) {
>   		cache->len += len;
>   		cache->flags |= flags;
> -		goto try_submit_last;
> +		return 0;
>   	}
>
>   	/* Not mergeable, need to submit cached one */
> @@ -5465,13 +5431,8 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
>   	cache->phys = phys;
>   	cache->len = len;
>   	cache->flags = flags;
> -try_submit_last:
> -	if (cache->flags & FIEMAP_EXTENT_LAST) {
> -		ret = fiemap_fill_next_extent(fieinfo, cache->offset,
> -				cache->phys, cache->len, cache->flags);
> -		cache->cached = false;
> -	}
> -	return ret;
> +
> +	return 0;
>   }
>
>   /*
> @@ -5501,229 +5462,534 @@ static int emit_last_fiemap_cache(struct fiemap_extent_info *fieinfo,
>   	return ret;
>   }
>
> -int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> -		  u64 start, u64 len)
> +static int fiemap_next_leaf_item(struct btrfs_inode *inode,
> +				 struct btrfs_path *path)
>   {
> -	int ret = 0;
> -	u64 off;
> -	u64 max = start + len;
> -	u32 flags = 0;
> -	u32 found_type;
> -	u64 last;
> -	u64 last_for_get_extent = 0;
> -	u64 disko = 0;
> -	u64 isize = i_size_read(&inode->vfs_inode);
> -	struct btrfs_key found_key;
> -	struct extent_map *em = NULL;
> -	struct extent_state *cached_state = NULL;
> -	struct btrfs_path *path;
> +	struct extent_buffer *clone;
> +	struct btrfs_key key;
> +	int slot;
> +	int ret;
> +
> +	path->slots[0]++;
> +	if (path->slots[0] < btrfs_header_nritems(path->nodes[0]))
> +		return 0;
> +
> +	ret = btrfs_next_leaf(inode->root, path);
> +	if (ret != 0)
> +		return ret;
> +
> +	/*
> +	 * Don't bother with cloning if there are no more file extent items for
> +	 * our inode.
> +	 */
> +	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +	if (key.objectid != btrfs_ino(inode) || key.type != BTRFS_EXTENT_DATA_KEY)
> +		return 1;
> +
> +	/* See the comment at fiemap_search_slot() about why we clone. */
> +	clone = btrfs_clone_extent_buffer(path->nodes[0]);
> +	if (!clone)
> +		return -ENOMEM;
> +
> +	slot = path->slots[0];
> +	btrfs_release_path(path);
> +	path->nodes[0] = clone;
> +	path->slots[0] = slot;
> +
> +	return 0;
> +}
> +
> +/*
> + * Search for the first file extent item that starts at a given file offset or
> + * the one that starts immediately before that offset.
> + * Returns: 0 on success, < 0 on error, 1 if not found.
> + */
> +static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path,
> +			      u64 file_offset)
> +{
> +	const u64 ino = btrfs_ino(inode);
>   	struct btrfs_root *root = inode->root;
> -	struct fiemap_cache cache = { 0 };
> -	struct btrfs_backref_shared_cache *backref_cache;
> -	struct ulist *roots;
> -	struct ulist *tmp_ulist;
> -	int end = 0;
> -	u64 em_start = 0;
> -	u64 em_len = 0;
> -	u64 em_end = 0;
> +	struct extent_buffer *clone;
> +	struct btrfs_key key;
> +	int slot;
> +	int ret;
>
> -	backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
> -	path = btrfs_alloc_path();
> -	roots = ulist_alloc(GFP_KERNEL);
> -	tmp_ulist = ulist_alloc(GFP_KERNEL);
> -	if (!backref_cache || !path || !roots || !tmp_ulist) {
> -		ret = -ENOMEM;
> -		goto out_free_ulist;
> +	key.objectid = ino;
> +	key.type = BTRFS_EXTENT_DATA_KEY;
> +	key.offset = file_offset;
> +
> +	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (ret > 0 && path->slots[0] > 0) {
> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
> +		if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
> +			path->slots[0]--;
> +	}
> +
> +	if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
> +		ret = btrfs_next_leaf(root, path);
> +		if (ret != 0)
> +			return ret;
> +
> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> +			return 1;
>   	}
>
>   	/*
> -	 * We can't initialize that to 'start' as this could miss extents due
> -	 * to extent item merging
> +	 * We clone the leaf and use it during fiemap. This is because while
> +	 * using the leaf we do expensive things like checking if an extent is
> +	 * shared, which can take a long time. In order to prevent blocking
> +	 * other tasks for too long, we use a clone of the leaf. We have locked
> +	 * the file range in the inode's io tree, so we know none of our file
> +	 * extent items can change. This way we avoid blocking other tasks that
> +	 * want to insert items for other inodes in the same leaf or b+tree
> +	 * rebalance operations (triggered for example when someone is trying
> +	 * to push items into this leaf when trying to insert an item in a
> +	 * neighbour leaf).
> +	 * We also need the private clone because holding a read lock on an
> +	 * extent buffer of the subvolume's b+tree will make lockdep unhappy
> +	 * when we call fiemap_fill_next_extent(), because that may cause a page
> +	 * fault when filling the user space buffer with fiemap data.
>   	 */
> -	off = 0;
> -	start = round_down(start, btrfs_inode_sectorsize(inode));
> -	len = round_up(max, btrfs_inode_sectorsize(inode)) - start;
> +	clone = btrfs_clone_extent_buffer(path->nodes[0]);
> +	if (!clone)
> +		return -ENOMEM;
> +
> +	slot = path->slots[0];
> +	btrfs_release_path(path);
> +	path->nodes[0] = clone;
> +	path->slots[0] = slot;

Although this is correct, it still looks a little tricky.

We rely on btrfs_release_path() to release all tree blocks in the
subvolume tree, including unlocking the tree blocks, thus path->locks[0]
is also 0, meaning next time we call btrfs_release_path() we won't try
to unlock the cloned eb.

But I'd say it's still pretty tricky, and unfortunately I don't have any
better alternative.

> +
> +	return 0;
> +}
> +
> +/*
> + * Process a range which is a hole or a prealloc extent in the inode's subvolume
> + * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc
> + * extent. The end offset (@end) is inclusive.
> + */
> +static int fiemap_process_hole(struct btrfs_inode *inode,

Does the name still make sense as we're handling both hole and prealloc
range?


And I always find the delalloc search a big pain during lseek/fiemap.

I guess except using certain flags, there is some hard requirement for
delalloc range reporting?

> +			       struct fiemap_extent_info *fieinfo,
> +			       struct fiemap_cache *cache,
> +			       struct btrfs_backref_shared_cache *backref_cache,
> +			       u64 disk_bytenr, u64 extent_offset,
> +			       u64 extent_gen,
> +			       struct ulist *roots, struct ulist *tmp_ulist,
> +			       u64 start, u64 end)
> +{
> +	const u64 i_size = i_size_read(&inode->vfs_inode);
> +	const u64 ino = btrfs_ino(inode);
> +	u64 cur_offset = start;
> +	u64 last_delalloc_end = 0;
> +	u32 prealloc_flags = FIEMAP_EXTENT_UNWRITTEN;
> +	bool checked_extent_shared = false;
> +	int ret;
>
>   	/*
> -	 * lookup the last file extent.  We're not using i_size here
> -	 * because there might be preallocation past i_size
> +	 * There can be no delalloc past i_size, so don't waste time looking for
> +	 * it beyond i_size.
>   	 */
> -	ret = btrfs_lookup_file_extent(NULL, root, path, btrfs_ino(inode), -1,
> -				       0);
> -	if (ret < 0) {
> -		goto out_free_ulist;
> -	} else {
> -		WARN_ON(!ret);
> -		if (ret == 1)
> -			ret = 0;
> -	}
> +	while (cur_offset < end && cur_offset < i_size) {
> +		u64 delalloc_start;
> +		u64 delalloc_end;
> +		u64 prealloc_start;
> +		u64 prealloc_len = 0;
> +		bool delalloc;
> +
> +		delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
> +							&delalloc_start,
> +							&delalloc_end);
> +		if (!delalloc)
> +			break;
>
> -	path->slots[0]--;
> -	btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]);
> -	found_type = found_key.type;
> -
> -	/* No extents, but there might be delalloc bits */
> -	if (found_key.objectid != btrfs_ino(inode) ||
> -	    found_type != BTRFS_EXTENT_DATA_KEY) {
> -		/* have to trust i_size as the end */
> -		last = (u64)-1;
> -		last_for_get_extent = isize;
> -	} else {
>   		/*
> -		 * remember the start of the last extent.  There are a
> -		 * bunch of different factors that go into the length of the
> -		 * extent, so its much less complex to remember where it started
> +		 * If this is a prealloc extent we have to report every section
> +		 * of it that has no delalloc.
>   		 */
> -		last = found_key.offset;
> -		last_for_get_extent = last + 1;
> +		if (disk_bytenr != 0) {
> +			if (last_delalloc_end == 0) {
> +				prealloc_start = start;
> +				prealloc_len = delalloc_start - start;
> +			} else {
> +				prealloc_start = last_delalloc_end + 1;
> +				prealloc_len = delalloc_start - prealloc_start;
> +			}
> +		}
> +
> +		if (prealloc_len > 0) {
> +			if (!checked_extent_shared && fieinfo->fi_extents_max) {
> +				ret = btrfs_is_data_extent_shared(inode->root,
> +							  ino, disk_bytenr,
> +							  extent_gen, roots,
> +							  tmp_ulist,
> +							  backref_cache);
> +				if (ret < 0)
> +					return ret;
> +				else if (ret > 0)
> +					prealloc_flags |= FIEMAP_EXTENT_SHARED;
> +
> +				checked_extent_shared = true;
> +			}
> +			ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
> +						 disk_bytenr + extent_offset,
> +						 prealloc_len, prealloc_flags);
> +			if (ret)
> +				return ret;
> +			extent_offset += prealloc_len;
> +		}
> +
> +		ret = emit_fiemap_extent(fieinfo, cache, delalloc_start, 0,
> +					 delalloc_end + 1 - delalloc_start,
> +					 FIEMAP_EXTENT_DELALLOC |
> +					 FIEMAP_EXTENT_UNKNOWN);
> +		if (ret)
> +			return ret;
> +
> +		last_delalloc_end = delalloc_end;
> +		cur_offset = delalloc_end + 1;
> +		extent_offset += cur_offset - delalloc_start;
> +		cond_resched();
> +	}
> +
> +	/*
> +	 * Either we found no delalloc for the whole prealloc extent or we have
> +	 * a prealloc extent that spans i_size or starts at or after i_size.
> +	 */
> +	if (disk_bytenr != 0 && last_delalloc_end < end) {
> +		u64 prealloc_start;
> +		u64 prealloc_len;
> +
> +		if (last_delalloc_end == 0) {
> +			prealloc_start = start;
> +			prealloc_len = end + 1 - start;
> +		} else {
> +			prealloc_start = last_delalloc_end + 1;
> +			prealloc_len = end + 1 - prealloc_start;
> +		}
> +
> +		if (!checked_extent_shared && fieinfo->fi_extents_max) {
> +			ret = btrfs_is_data_extent_shared(inode->root,
> +							  ino, disk_bytenr,
> +							  extent_gen, roots,
> +							  tmp_ulist,
> +							  backref_cache);
> +			if (ret < 0)
> +				return ret;
> +			else if (ret > 0)
> +				prealloc_flags |= FIEMAP_EXTENT_SHARED;
> +		}
> +		ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
> +					 disk_bytenr + extent_offset,
> +					 prealloc_len, prealloc_flags);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static int fiemap_find_last_extent_offset(struct btrfs_inode *inode,
> +					  struct btrfs_path *path,
> +					  u64 *last_extent_end_ret)
> +{
> +	const u64 ino = btrfs_ino(inode);
> +	struct btrfs_root *root = inode->root;
> +	struct extent_buffer *leaf;
> +	struct btrfs_file_extent_item *ei;
> +	struct btrfs_key key;
> +	u64 disk_bytenr;
> +	int ret;
> +
> +	/*
> +	 * Lookup the last file extent. We're not using i_size here because
> +	 * there might be preallocation past i_size.
> +	 */

I'm wondering how could this happen?

Normally if we're truncating an inode, the extents starting after
round_up(i_size, sectorsize) should be dropped.

Or if we later enlarge the inode, we may hit old extents and read out
some stale data other than expected zeros.

Thus searching using round_up(i_size, sectorsize) should still let us to
reach the slot after the last file extent.

Or did I miss something?

Thanks,
Qu

> +	ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0);
> +	/* There can't be a file extent item at offset (u64)-1 */
> +	ASSERT(ret != 0);
> +	if (ret < 0)
> +		return ret;
> +
> +	/*
> +	 * For a non-existing key, btrfs_search_slot() always leaves us at a
> +	 * slot > 0, except if the btree is empty, which is impossible because
> +	 * at least it has the inode item for this inode and all the items for
> +	 * the root inode 256.
> +	 */
> +	ASSERT(path->slots[0] > 0);
> +	path->slots[0]--;
> +	leaf = path->nodes[0];
> +	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +	if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> +		/* No file extent items in the subvolume tree. */
> +		*last_extent_end_ret = 0;
> +		return 0;
>   	}
> -	btrfs_release_path(path);
>
>   	/*
> -	 * we might have some extents allocated but more delalloc past those
> -	 * extents.  so, we trust isize unless the start of the last extent is
> -	 * beyond isize
> +	 * For an inline extent, the disk_bytenr is where inline data starts at,
> +	 * so first check if we have an inline extent item before checking if we
> +	 * have an implicit hole (disk_bytenr == 0).
>   	 */
> -	if (last < isize) {
> -		last = (u64)-1;
> -		last_for_get_extent = isize;
> +	ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
> +	if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
> +		*last_extent_end_ret = btrfs_file_extent_end(path);
> +		return 0;
>   	}
>
> -	lock_extent_bits(&inode->io_tree, start, start + len - 1,
> -			 &cached_state);
> +	/*
> +	 * Find the last file extent item that is not a hole (when NO_HOLES is
> +	 * not enabled). This should take at most 2 iterations in the worst
> +	 * case: we have one hole file extent item at slot 0 of a leaf and
> +	 * another hole file extent item as the last item in the previous leaf.
> +	 * This is because we merge file extent items that represent holes.
> +	 */
> +	disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> +	while (disk_bytenr == 0) {
> +		ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
> +		if (ret < 0) {
> +			return ret;
> +		} else if (ret > 0) {
> +			/* No file extent items that are not holes. */
> +			*last_extent_end_ret = 0;
> +			return 0;
> +		}
> +		leaf = path->nodes[0];
> +		ei = btrfs_item_ptr(leaf, path->slots[0],
> +				    struct btrfs_file_extent_item);
> +		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> +	}
>
> -	em = get_extent_skip_holes(inode, start, last_for_get_extent);
> -	if (!em)
> -		goto out;
> -	if (IS_ERR(em)) {
> -		ret = PTR_ERR(em);
> +	*last_extent_end_ret = btrfs_file_extent_end(path);
> +	return 0;
> +}
> +
> +int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> +		  u64 start, u64 len)
> +{
> +	const u64 ino = btrfs_ino(inode);
> +	struct extent_state *cached_state = NULL;
> +	struct btrfs_path *path;
> +	struct btrfs_root *root = inode->root;
> +	struct fiemap_cache cache = { 0 };
> +	struct btrfs_backref_shared_cache *backref_cache;
> +	struct ulist *roots;
> +	struct ulist *tmp_ulist;
> +	u64 last_extent_end;
> +	u64 prev_extent_end;
> +	u64 lockstart;
> +	u64 lockend;
> +	bool stopped = false;
> +	int ret;
> +
> +	backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
> +	path = btrfs_alloc_path();
> +	roots = ulist_alloc(GFP_KERNEL);
> +	tmp_ulist = ulist_alloc(GFP_KERNEL);
> +	if (!backref_cache || !path || !roots || !tmp_ulist) {
> +		ret = -ENOMEM;
>   		goto out;
>   	}
>
> -	while (!end) {
> -		u64 offset_in_extent = 0;
> +	lockstart = round_down(start, btrfs_inode_sectorsize(inode));
> +	lockend = round_up(start + len, btrfs_inode_sectorsize(inode));
> +	prev_extent_end = lockstart;
>
> -		/* break if the extent we found is outside the range */
> -		if (em->start >= max || extent_map_end(em) < off)
> -			break;
> +	lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
>
> -		/*
> -		 * get_extent may return an extent that starts before our
> -		 * requested range.  We have to make sure the ranges
> -		 * we return to fiemap always move forward and don't
> -		 * overlap, so adjust the offsets here
> -		 */
> -		em_start = max(em->start, off);
> +	ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
> +	if (ret < 0)
> +		goto out_unlock;
> +	btrfs_release_path(path);
>
> +	path->reada = READA_FORWARD;
> +	ret = fiemap_search_slot(inode, path, lockstart);
> +	if (ret < 0) {
> +		goto out_unlock;
> +	} else if (ret > 0) {
>   		/*
> -		 * record the offset from the start of the extent
> -		 * for adjusting the disk offset below.  Only do this if the
> -		 * extent isn't compressed since our in ram offset may be past
> -		 * what we have actually allocated on disk.
> +		 * No file extent item found, but we may have delalloc between
> +		 * the current offset and i_size. So check for that.
>   		 */
> -		if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
> -			offset_in_extent = em_start - em->start;
> -		em_end = extent_map_end(em);
> -		em_len = em_end - em_start;
> -		flags = 0;
> -		if (em->block_start < EXTENT_MAP_LAST_BYTE)
> -			disko = em->block_start + offset_in_extent;
> -		else
> -			disko = 0;
> +		ret = 0;
> +		goto check_eof_delalloc;
> +	}
> +
> +	while (prev_extent_end < lockend) {
> +		struct extent_buffer *leaf = path->nodes[0];
> +		struct btrfs_file_extent_item *ei;
> +		struct btrfs_key key;
> +		u64 extent_end;
> +		u64 extent_len;
> +		u64 extent_offset = 0;
> +		u64 extent_gen;
> +		u64 disk_bytenr = 0;
> +		u64 flags = 0;
> +		int extent_type;
> +		u8 compression;
> +
> +		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> +			break;
> +
> +		extent_end = btrfs_file_extent_end(path);
>
>   		/*
> -		 * bump off for our next call to get_extent
> +		 * The first iteration can leave us at an extent item that ends
> +		 * before our range's start. Move to the next item.
>   		 */
> -		off = extent_map_end(em);
> -		if (off >= max)
> -			end = 1;
> -
> -		if (em->block_start == EXTENT_MAP_INLINE) {
> -			flags |= (FIEMAP_EXTENT_DATA_INLINE |
> -				  FIEMAP_EXTENT_NOT_ALIGNED);
> -		} else if (em->block_start == EXTENT_MAP_DELALLOC) {
> -			flags |= (FIEMAP_EXTENT_DELALLOC |
> -				  FIEMAP_EXTENT_UNKNOWN);
> -		} else if (fieinfo->fi_extents_max) {
> -			u64 extent_gen;
> -			u64 bytenr = em->block_start -
> -				(em->start - em->orig_start);
> +		if (extent_end <= lockstart)
> +			goto next_item;
>
> -			/*
> -			 * If two extent maps are merged, then their generation
> -			 * is set to the maximum between their generations.
> -			 * Otherwise its generation matches the one we have in
> -			 * corresponding file extent item. If we have a merged
> -			 * extent map, don't use its generation to speedup the
> -			 * sharedness check below.
> -			 */
> -			if (test_bit(EXTENT_FLAG_MERGED, &em->flags))
> -				extent_gen = 0;
> -			else
> -				extent_gen = em->generation;
> +		/* We have in implicit hole (NO_HOLES feature enabled). */
> +		if (prev_extent_end < key.offset) {
> +			const u64 range_end = min(key.offset, lockend) - 1;
>
> -			/*
> -			 * As btrfs supports shared space, this information
> -			 * can be exported to userspace tools via
> -			 * flag FIEMAP_EXTENT_SHARED.  If fi_extents_max == 0
> -			 * then we're just getting a count and we can skip the
> -			 * lookup stuff.
> -			 */
> -			ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
> -							  bytenr, extent_gen,
> -							  roots, tmp_ulist,
> -							  backref_cache);
> -			if (ret < 0)
> -				goto out_free;
> -			if (ret)
> -				flags |= FIEMAP_EXTENT_SHARED;
> -			ret = 0;
> -		}
> -		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
> -			flags |= FIEMAP_EXTENT_ENCODED;
> -		if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> -			flags |= FIEMAP_EXTENT_UNWRITTEN;
> +			ret = fiemap_process_hole(inode, fieinfo, &cache,
> +						  backref_cache, 0, 0, 0,
> +						  roots, tmp_ulist,
> +						  prev_extent_end, range_end);
> +			if (ret < 0) {
> +				goto out_unlock;
> +			} else if (ret > 0) {
> +				/* fiemap_fill_next_extent() told us to stop. */
> +				stopped = true;
> +				break;
> +			}
>
> -		free_extent_map(em);
> -		em = NULL;
> -		if ((em_start >= last) || em_len == (u64)-1 ||
> -		   (last == (u64)-1 && isize <= em_end)) {
> -			flags |= FIEMAP_EXTENT_LAST;
> -			end = 1;
> +			/* We've reached the end of the fiemap range, stop. */
> +			if (key.offset >= lockend) {
> +				stopped = true;
> +				break;
> +			}
>   		}
>
> -		/* now scan forward to see if this is really the last extent. */
> -		em = get_extent_skip_holes(inode, off, last_for_get_extent);
> -		if (IS_ERR(em)) {
> -			ret = PTR_ERR(em);
> -			goto out;
> +		extent_len = extent_end - key.offset;
> +		ei = btrfs_item_ptr(leaf, path->slots[0],
> +				    struct btrfs_file_extent_item);
> +		compression = btrfs_file_extent_compression(leaf, ei);
> +		extent_type = btrfs_file_extent_type(leaf, ei);
> +		extent_gen = btrfs_file_extent_generation(leaf, ei);
> +
> +		if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
> +			disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> +			if (compression == BTRFS_COMPRESS_NONE)
> +				extent_offset = btrfs_file_extent_offset(leaf, ei);
>   		}
> -		if (!em) {
> -			flags |= FIEMAP_EXTENT_LAST;
> -			end = 1;
> +
> +		if (compression != BTRFS_COMPRESS_NONE)
> +			flags |= FIEMAP_EXTENT_ENCODED;
> +
> +		if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
> +			flags |= FIEMAP_EXTENT_DATA_INLINE;
> +			flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> +			ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0,
> +						 extent_len, flags);
> +		} else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
> +			ret = fiemap_process_hole(inode, fieinfo, &cache,
> +						  backref_cache,
> +						  disk_bytenr, extent_offset,
> +						  extent_gen, roots, tmp_ulist,
> +						  key.offset, extent_end - 1);
> +		} else if (disk_bytenr == 0) {
> +			/* We have an explicit hole. */
> +			ret = fiemap_process_hole(inode, fieinfo, &cache,
> +						  backref_cache, 0, 0, 0,
> +						  roots, tmp_ulist,
> +						  key.offset, extent_end - 1);
> +		} else {
> +			/* We have a regular extent. */
> +			if (fieinfo->fi_extents_max) {
> +				ret = btrfs_is_data_extent_shared(root, ino,
> +								  disk_bytenr,
> +								  extent_gen,
> +								  roots,
> +								  tmp_ulist,
> +								  backref_cache);
> +				if (ret < 0)
> +					goto out_unlock;
> +				else if (ret > 0)
> +					flags |= FIEMAP_EXTENT_SHARED;
> +			}
> +
> +			ret = emit_fiemap_extent(fieinfo, &cache, key.offset,
> +						 disk_bytenr + extent_offset,
> +						 extent_len, flags);
>   		}
> -		ret = emit_fiemap_extent(fieinfo, &cache, em_start, disko,
> -					   em_len, flags);
> -		if (ret) {
> -			if (ret == 1)
> -				ret = 0;
> -			goto out_free;
> +
> +		if (ret < 0) {
> +			goto out_unlock;
> +		} else if (ret > 0) {
> +			/* fiemap_fill_next_extent() told us to stop. */
> +			stopped = true;
> +			break;
>   		}
>
> +		prev_extent_end = extent_end;
> +next_item:
>   		if (fatal_signal_pending(current)) {
>   			ret = -EINTR;
> -			goto out_free;
> +			goto out_unlock;
>   		}
> +
> +		ret = fiemap_next_leaf_item(inode, path);
> +		if (ret < 0) {
> +			goto out_unlock;
> +		} else if (ret > 0) {
> +			/* No more file extent items for this inode. */
> +			break;
> +		}
> +		cond_resched();
>   	}
> -out_free:
> -	if (!ret)
> -		ret = emit_last_fiemap_cache(fieinfo, &cache);
> -	free_extent_map(em);
> -out:
> -	unlock_extent_cached(&inode->io_tree, start, start + len - 1,
> -			     &cached_state);
>
> -out_free_ulist:
> +check_eof_delalloc:
> +	/*
> +	 * Release (and free) the path before emitting any final entries to
> +	 * fiemap_fill_next_extent() to keep lockdep happy. This is because
> +	 * once we find no more file extent items exist, we may have a
> +	 * non-cloned leaf, and fiemap_fill_next_extent() can trigger page
> +	 * faults when copying data to the user space buffer.
> +	 */
> +	btrfs_free_path(path);
> +	path = NULL;
> +
> +	if (!stopped && prev_extent_end < lockend) {
> +		ret = fiemap_process_hole(inode, fieinfo, &cache, backref_cache,
> +					  0, 0, 0, roots, tmp_ulist,
> +					  prev_extent_end, lockend - 1);
> +		if (ret < 0)
> +			goto out_unlock;
> +		prev_extent_end = lockend;
> +	}
> +
> +	if (cache.cached && cache.offset + cache.len >= last_extent_end) {
> +		const u64 i_size = i_size_read(&inode->vfs_inode);
> +
> +		if (prev_extent_end < i_size) {
> +			u64 delalloc_start;
> +			u64 delalloc_end;
> +			bool delalloc;
> +
> +			delalloc = btrfs_find_delalloc_in_range(inode,
> +								prev_extent_end,
> +								i_size - 1,
> +								&delalloc_start,
> +								&delalloc_end);
> +			if (!delalloc)
> +				cache.flags |= FIEMAP_EXTENT_LAST;
> +		} else {
> +			cache.flags |= FIEMAP_EXTENT_LAST;
> +		}
> +	}
> +
> +	ret = emit_last_fiemap_cache(fieinfo, &cache);
> +
> +out_unlock:
> +	unlock_extent_cached(&inode->io_tree, lockstart, lockend, &cached_state);
> +out:
>   	kfree(backref_cache);
>   	btrfs_free_path(path);
>   	ulist_free(roots);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index b292a8ada3a4..636b3ec46184 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3602,10 +3602,10 @@ static long btrfs_fallocate(struct file *file, int mode,
>   }
>
>   /*
> - * Helper for have_delalloc_in_range(). Find a subrange in a given range that
> - * has unflushed and/or flushing delalloc. There might be other adjacent
> - * subranges after the one it found, so have_delalloc_in_range() keeps looping
> - * while it gets adjacent subranges, and merging them together.
> + * Helper for btrfs_find_delalloc_in_range(). Find a subrange in a given range
> + * that has unflushed and/or flushing delalloc. There might be other adjacent
> + * subranges after the one it found, so btrfs_find_delalloc_in_range() keeps
> + * looping while it gets adjacent subranges, and merging them together.
>    */
>   static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
>   				   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> @@ -3740,8 +3740,8 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
>    * if so it sets @delalloc_start_ret and @delalloc_end_ret with the start and
>    * end offsets of the subrange.
>    */
> -static bool have_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> -				   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> +				  u64 *delalloc_start_ret, u64 *delalloc_end_ret)
>   {
>   	u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
>   	u64 prev_delalloc_end = 0;
> @@ -3804,8 +3804,8 @@ static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
>   	u64 delalloc_end;
>   	bool delalloc;
>
> -	delalloc = have_delalloc_in_range(inode, start, end, &delalloc_start,
> -					  &delalloc_end);
> +	delalloc = btrfs_find_delalloc_in_range(inode, start, end,
> +						&delalloc_start, &delalloc_end);
>   	if (delalloc && whence == SEEK_DATA) {
>   		*start_ret = delalloc_start;
>   		return true;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 2c7d31990777..8be1e021513a 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7064,133 +7064,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
>   	return em;
>   }
>
> -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
> -					   u64 start, u64 len)
> -{
> -	struct extent_map *em;
> -	struct extent_map *hole_em = NULL;
> -	u64 delalloc_start = start;
> -	u64 end;
> -	u64 delalloc_len;
> -	u64 delalloc_end;
> -	int err = 0;
> -
> -	em = btrfs_get_extent(inode, NULL, 0, start, len);
> -	if (IS_ERR(em))
> -		return em;
> -	/*
> -	 * If our em maps to:
> -	 * - a hole or
> -	 * - a pre-alloc extent,
> -	 * there might actually be delalloc bytes behind it.
> -	 */
> -	if (em->block_start != EXTENT_MAP_HOLE &&
> -	    !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> -		return em;
> -	else
> -		hole_em = em;
> -
> -	/* check to see if we've wrapped (len == -1 or similar) */
> -	end = start + len;
> -	if (end < start)
> -		end = (u64)-1;
> -	else
> -		end -= 1;
> -
> -	em = NULL;
> -
> -	/* ok, we didn't find anything, lets look for delalloc */
> -	delalloc_len = count_range_bits(&inode->io_tree, &delalloc_start,
> -				 end, len, EXTENT_DELALLOC, 1);
> -	delalloc_end = delalloc_start + delalloc_len;
> -	if (delalloc_end < delalloc_start)
> -		delalloc_end = (u64)-1;
> -
> -	/*
> -	 * We didn't find anything useful, return the original results from
> -	 * get_extent()
> -	 */
> -	if (delalloc_start > end || delalloc_end <= start) {
> -		em = hole_em;
> -		hole_em = NULL;
> -		goto out;
> -	}
> -
> -	/*
> -	 * Adjust the delalloc_start to make sure it doesn't go backwards from
> -	 * the start they passed in
> -	 */
> -	delalloc_start = max(start, delalloc_start);
> -	delalloc_len = delalloc_end - delalloc_start;
> -
> -	if (delalloc_len > 0) {
> -		u64 hole_start;
> -		u64 hole_len;
> -		const u64 hole_end = extent_map_end(hole_em);
> -
> -		em = alloc_extent_map();
> -		if (!em) {
> -			err = -ENOMEM;
> -			goto out;
> -		}
> -
> -		ASSERT(hole_em);
> -		/*
> -		 * When btrfs_get_extent can't find anything it returns one
> -		 * huge hole
> -		 *
> -		 * Make sure what it found really fits our range, and adjust to
> -		 * make sure it is based on the start from the caller
> -		 */
> -		if (hole_end <= start || hole_em->start > end) {
> -		       free_extent_map(hole_em);
> -		       hole_em = NULL;
> -		} else {
> -		       hole_start = max(hole_em->start, start);
> -		       hole_len = hole_end - hole_start;
> -		}
> -
> -		if (hole_em && delalloc_start > hole_start) {
> -			/*
> -			 * Our hole starts before our delalloc, so we have to
> -			 * return just the parts of the hole that go until the
> -			 * delalloc starts
> -			 */
> -			em->len = min(hole_len, delalloc_start - hole_start);
> -			em->start = hole_start;
> -			em->orig_start = hole_start;
> -			/*
> -			 * Don't adjust block start at all, it is fixed at
> -			 * EXTENT_MAP_HOLE
> -			 */
> -			em->block_start = hole_em->block_start;
> -			em->block_len = hole_len;
> -			if (test_bit(EXTENT_FLAG_PREALLOC, &hole_em->flags))
> -				set_bit(EXTENT_FLAG_PREALLOC, &em->flags);
> -		} else {
> -			/*
> -			 * Hole is out of passed range or it starts after
> -			 * delalloc range
> -			 */
> -			em->start = delalloc_start;
> -			em->len = delalloc_len;
> -			em->orig_start = delalloc_start;
> -			em->block_start = EXTENT_MAP_DELALLOC;
> -			em->block_len = delalloc_len;
> -		}
> -	} else {
> -		return hole_em;
> -	}
> -out:
> -
> -	free_extent_map(hole_em);
> -	if (err) {
> -		free_extent_map(em);
> -		return ERR_PTR(err);
> -	}
> -	return em;
> -}
> -
>   static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>   						  const u64 start,
>   						  const u64 len,
> @@ -8265,15 +8138,14 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>   	 * in the compression of data (in an async thread) and will return
>   	 * before the compression is done and writeback is started. A second
>   	 * filemap_fdatawrite_range() is needed to wait for the compression to
> -	 * complete and writeback to start. Without this, our user is very
> -	 * likely to get stale results, because the extents and extent maps for
> -	 * delalloc regions are only allocated when writeback starts.
> +	 * complete and writeback to start. We also need to wait for ordered
> +	 * extents to complete, because our fiemap implementation uses mainly
> +	 * file extent items to list the extents, searching for extent maps
> +	 * only for file ranges with holes or prealloc extents to figure out
> +	 * if we have delalloc in those ranges.
>   	 */
>   	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
> -		ret = btrfs_fdatawrite_range(inode, 0, LLONG_MAX);
> -		if (ret)
> -			return ret;
> -		ret = filemap_fdatawait_range(inode->i_mapping, 0, LLONG_MAX);
> +		ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
>   		if (ret)
>   			return ret;
>   	}

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/10] btrfs: make lseek and fiemap much more efficient
  2022-09-01 13:18 [PATCH 00/10] btrfs: make lseek and fiemap much more efficient fdmanana
                   ` (9 preceding siblings ...)
  2022-09-01 13:18 ` [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness fdmanana
@ 2022-09-02  0:53 ` Wang Yugui
  2022-09-02  8:24   ` Filipe Manana
  2022-09-06 16:20 ` David Sterba
  2022-09-07  9:12 ` Christoph Hellwig
  12 siblings, 1 reply; 53+ messages in thread
From: Wang Yugui @ 2022-09-02  0:53 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

Hi,

> From: Filipe Manana <fdmanana@suse.com>
> 
> We often get reports of fiemap and hole/data seeking (lseek) being too slow
> on btrfs, or even unusable in some cases due to being extremely slow.
> 
> Some recent reports for fiemap:
> 
>     https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
>     https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> 
> For lseek (LSF/MM from 2017):
> 
>    https://lwn.net/Articles/718805/
> 
> Basically both are slow due to very high algorithmic complexity which
> scales badly with the number of extents in a file and the heigth of
> subvolume and extent b+trees.
> 
> Using Pavel's test case (first Link tag for fiemap), which uses files with
> many 4K extents and holes before and after each extent (kind of a worst
> case scenario), the speedup is of several orders of magnitude (for the 1G
> file, from ~225 seconds down to ~0.1 seconds).
> 
> Finally the new algorithm for fiemap also ends up solving a bug with the
> current algorithm. This happens because we are currently relying on extent
> maps to report extents, which can be merged, and this may cause us to
> report 2 different extents as a single one that is not shared but one of
> them is shared (or the other way around). More details on this on patches
> 9/10 and 10/10.
> 
> Patches 1/10 and 2/10 are for lseek, introducing some code that will later
> be used by fiemap too (patch 10/10). More details in the changelogs.
> 
> There are a few more things that can be done to speedup fiemap and lseek,
> but I'll leave those other optimizations I have in mind for some other time.
> 
> Filipe Manana (10):
>   btrfs: allow hole and data seeking to be interruptible
>   btrfs: make hole and data seeking a lot more efficient
>   btrfs: remove check for impossible block start for an extent map at fiemap
>   btrfs: remove zero length check when entering fiemap
>   btrfs: properly flush delalloc when entering fiemap
>   btrfs: allow fiemap to be interruptible
>   btrfs: rename btrfs_check_shared() to a more descriptive name
>   btrfs: speedup checking for extent sharedness during fiemap
>   btrfs: skip unnecessary extent buffer sharedness checks during fiemap
>   btrfs: make fiemap more efficient and accurate reporting extent sharedness
> 
>  fs/btrfs/backref.c     | 153 ++++++++-
>  fs/btrfs/backref.h     |  20 +-
>  fs/btrfs/ctree.h       |  22 +-
>  fs/btrfs/extent-tree.c |  10 +-
>  fs/btrfs/extent_io.c   | 703 ++++++++++++++++++++++++++++-------------
>  fs/btrfs/file.c        | 439 +++++++++++++++++++++++--
>  fs/btrfs/inode.c       | 146 ++-------
>  7 files changed, 1111 insertions(+), 382 deletions(-)


An infinite loop happen when the 10 pathes applied to 6.0-rc3.

a file is created by 'pavels-test.c' of [PATCH 10/10]. 
and then '/bin/cp /mnt/test/file1 /dev/null' will trigger an infinite
loop.

'sysrq -l' output:

[ 1437.765228] Call Trace:
[ 1437.765228]  <TASK>
[ 1437.765228]  set_extent_bit+0x33d/0x6e0 [btrfs]
[ 1437.765228]  lock_extent_bits+0x64/0xa0 [btrfs]
[ 1437.765228]  btrfs_file_llseek+0x192/0x5b0 [btrfs]
[ 1437.765228]  ksys_lseek+0x64/0xb0
[ 1437.765228]  do_syscall_64+0x58/0x80
[ 1437.765228]  ? syscall_exit_to_user_mode+0x12/0x30
[ 1437.765228]  ? do_syscall_64+0x67/0x80
[ 1437.765228]  ? do_syscall_64+0x67/0x80
[ 1437.765228]  ? exc_page_fault+0x64/0x140
[ 1437.765228]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 1437.765228] RIP: 0033:0x7f5a263441bb

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/09/02



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/10] btrfs: make lseek and fiemap much more efficient
  2022-09-02  0:53 ` [PATCH 00/10] btrfs: make lseek and fiemap much more efficient Wang Yugui
@ 2022-09-02  8:24   ` Filipe Manana
  2022-09-02 11:41     ` Wang Yugui
  2022-09-02 11:45     ` Filipe Manana
  0 siblings, 2 replies; 53+ messages in thread
From: Filipe Manana @ 2022-09-02  8:24 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-btrfs

On Fri, Sep 2, 2022 at 2:09 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
>
> Hi,
>
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > We often get reports of fiemap and hole/data seeking (lseek) being too slow
> > on btrfs, or even unusable in some cases due to being extremely slow.
> >
> > Some recent reports for fiemap:
> >
> >     https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> >     https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> >
> > For lseek (LSF/MM from 2017):
> >
> >    https://lwn.net/Articles/718805/
> >
> > Basically both are slow due to very high algorithmic complexity which
> > scales badly with the number of extents in a file and the heigth of
> > subvolume and extent b+trees.
> >
> > Using Pavel's test case (first Link tag for fiemap), which uses files with
> > many 4K extents and holes before and after each extent (kind of a worst
> > case scenario), the speedup is of several orders of magnitude (for the 1G
> > file, from ~225 seconds down to ~0.1 seconds).
> >
> > Finally the new algorithm for fiemap also ends up solving a bug with the
> > current algorithm. This happens because we are currently relying on extent
> > maps to report extents, which can be merged, and this may cause us to
> > report 2 different extents as a single one that is not shared but one of
> > them is shared (or the other way around). More details on this on patches
> > 9/10 and 10/10.
> >
> > Patches 1/10 and 2/10 are for lseek, introducing some code that will later
> > be used by fiemap too (patch 10/10). More details in the changelogs.
> >
> > There are a few more things that can be done to speedup fiemap and lseek,
> > but I'll leave those other optimizations I have in mind for some other time.
> >
> > Filipe Manana (10):
> >   btrfs: allow hole and data seeking to be interruptible
> >   btrfs: make hole and data seeking a lot more efficient
> >   btrfs: remove check for impossible block start for an extent map at fiemap
> >   btrfs: remove zero length check when entering fiemap
> >   btrfs: properly flush delalloc when entering fiemap
> >   btrfs: allow fiemap to be interruptible
> >   btrfs: rename btrfs_check_shared() to a more descriptive name
> >   btrfs: speedup checking for extent sharedness during fiemap
> >   btrfs: skip unnecessary extent buffer sharedness checks during fiemap
> >   btrfs: make fiemap more efficient and accurate reporting extent sharedness
> >
> >  fs/btrfs/backref.c     | 153 ++++++++-
> >  fs/btrfs/backref.h     |  20 +-
> >  fs/btrfs/ctree.h       |  22 +-
> >  fs/btrfs/extent-tree.c |  10 +-
> >  fs/btrfs/extent_io.c   | 703 ++++++++++++++++++++++++++++-------------
> >  fs/btrfs/file.c        | 439 +++++++++++++++++++++++--
> >  fs/btrfs/inode.c       | 146 ++-------
> >  7 files changed, 1111 insertions(+), 382 deletions(-)
>
>
> An infinite loop happen when the 10 pathes applied to 6.0-rc3.

Nop, it's not an infinite loop, and it happens as well before the patchset.
The reason is that the files created by the test are very sparse and
with small extents.
It's full of 4K extents surrounded by 8K holes.

So any one doing hole seeking, advances 8K on every lseek call.
If you strace the cp process, with

strace -p <cp pid>

You'll see something like this filling your terminal:

(...)
lseek(3, 18808832, SEEK_SET)            = 18808832
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096) = 4096
read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096) = 4096
write(4, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096) = 4096
lseek(3, 18817024, SEEK_SET)            = 18817024
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096) = 4096
read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096) = 4096
write(4, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096) = 4096
lseek(3, 18825216, SEEK_SET)            = 18825216
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096) = 4096
read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096) = 4096
write(4, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096) = 4096
lseek(3, 18833408, SEEK_SET)            = 18833408
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096) = 4096
read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096) = 4096
write(4, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096) = 4096
lseek(3, 18841600, SEEK_SET)            = 18841600
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096) = 4096
read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096) = 4096
write(4, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096) = 4096
lseek(3, 18849792, SEEK_SET)            = 18849792
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096) = 4096
read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096) = 4096
write(4, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096) = 4096
lseek(3, 18857984, SEEK_SET)            = 18857984
(...)

It takes a long time, but it finishes. If you notice the difference
between each return
value is exactly 8K.

That happens both before and after the patchset.

Thanks.


>
> a file is created by 'pavels-test.c' of [PATCH 10/10].
> and then '/bin/cp /mnt/test/file1 /dev/null' will trigger an infinite
> loop.
>
> 'sysrq -l' output:
>
> [ 1437.765228] Call Trace:
> [ 1437.765228]  <TASK>
> [ 1437.765228]  set_extent_bit+0x33d/0x6e0 [btrfs]
> [ 1437.765228]  lock_extent_bits+0x64/0xa0 [btrfs]
> [ 1437.765228]  btrfs_file_llseek+0x192/0x5b0 [btrfs]
> [ 1437.765228]  ksys_lseek+0x64/0xb0
> [ 1437.765228]  do_syscall_64+0x58/0x80
> [ 1437.765228]  ? syscall_exit_to_user_mode+0x12/0x30
> [ 1437.765228]  ? do_syscall_64+0x67/0x80
> [ 1437.765228]  ? do_syscall_64+0x67/0x80
> [ 1437.765228]  ? exc_page_fault+0x64/0x140
> [ 1437.765228]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [ 1437.765228] RIP: 0033:0x7f5a263441bb
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2022/09/02
>
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient
  2022-09-01 22:18   ` Qu Wenruo
@ 2022-09-02  8:36     ` Filipe Manana
  0 siblings, 0 replies; 53+ messages in thread
From: Filipe Manana @ 2022-09-02  8:36 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Sep 1, 2022 at 11:18 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2022/9/1 21:18, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > The current implementation of hole and data seeking for llseek does not
> > scale well in regards to the number of extents and the distance between
> > the start offset and the next hole or extent. This is due to a very high
> > algorithmic complexity. Often we also get reports of btrfs' hole and data
> > seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link
> > tag at the bottom).
> >
> > In order to better understand it, lets consider the case where the start
> > offset is 0, we are seeking for a hole and the file size is 16G. Between
> > file offset 0 and the first hole in the file there are 100K extents - this
> > is common for large files, specially if we have compression enabled, since
> > the maximum extent size is limited to 128K. The steps take by the main
> > loop of the current algorithm are the following:
> >
> > 1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which
> >     calls btrfs_get_extent(). This will first lookup for an extent map in
> >     the inode's extent map tree (a red black tree). If the extent map is
> >     not loaded in memory, then it will do a lookup for the corresponding
> >     file extent item in the subvolume's b+tree, create an extent map based
> >     on the contents of the file extent item and then add the extent map to
> >     the extent map tree of the inode;
> >
> > 2) The second iteration calls btrfs_get_extent_fiemap() again, this time
> >     with a start offset matching the end offset of the previous extent.
> >     Again, btrfs_get_extent() will first search the extent map tree, and
> >     if it doesn't find an extent map there, it will again search in the
> >     b+tree of the subvolume for a matching file extent item, build an
> >     extent map based on the file extent item, and add the extent map to
> >     to the extent map tree of the inode;
> >
> > 3) This repeats over and over until we find the first hole (when seeking
> >     for holes) or until we find the first extent (when seeking for data).
> >
> >     If there no extent maps loaded in memory for each iteration, then on
> >     each iteration we do 1 extent map tree search, 1 b+tree search, plus
> >     1 more extent map tree traversal to insert an extent map - plus we
> >     allocate memory for the extent map.
>
> I'm a little intereted in if we have other workload which are heavily
> relying on extent map tree search?
>
> If so, would it make sense to load a batch of file extents items into
> extent map tree in one go?

It depends on the workload...
I don't recall any other case, plus in the case of lseek and fiemap we
don't need
to load the extent maps at all.

>
>
> Another thing not that related to the patchset is, since extent map
> doesn't really get freed up unless that inode is invicted/truncated, I'm

It also gets freed on page/folio release.

> wondering would it be a problem for heavily fragmented files to take up
> too much memory just for extent map tree?
>
> Would we need a way to drop extent map in the future?

Extent maps are necessary for an efficient fsync for example.

There's a problem with extent maps I want to address, but that's
unrelated to lseek and fiemap,
it's quite a separate issue.

>
> >
> >     On each iteration we are growing the size of the extent map tree,
> >     making each future search slower, and also visiting the same b+tree
> >     leaves over and over again - taking into account with the default leaf
> >     size of 16K we can fit more than 200 file extent items in a leaf - so
> >     we can visit the same b+tree leaf 200+ times, on each visit walking
> >     down a path from the root to the leaf.
> >
> > So it's easy to see that what we have now doesn't scale well. Also, it
> > loads an extent map for every file extent item into memory, which is not
> > efficient - we should add extents maps only when doing IO (writing or
> > reading file data).
> >
> > This change implements a new algorithm which scales much better, and
> > works like this:
> >
> > 1) We iterate over the subvolume's b+tree, visiting each leaf that has
> >     file extent items once and only once;
> >
> > 2) For any file extent items found, that don't represent holes or prealloc
> >     extents, it will not search the extent map tree - there's no need at
> >     all for that - an extent map is just an in-memory representation of a
> >     file extent item;
> >
> > 3) When a hole is found, or a prealloc extent, it will check if there's
> >     delalloc for its range. For this it will search for EXTENT_DELALLOC
> >     bits in the inode's io tree and check the extent map tree - this is
> >     for accounting for unflushed delalloc and for flushed delalloc (the
> >     period between running delalloc and ordered extent completion),
> >     respectively. This is similar to what the current implementation does
> >     when it finds a hole or prealloc extent, but without creating extent
> >     maps and adding them to the extent map tree in case they are not
> >     loaded in memory;
>
> Would it be possible that, before we starting the subvolume tree search,
> just run all delalloc of that target inode and prevent new writes, so we
> can forget about the delalloc situation completely?

That would be a significant behavioural change.
ext4 and xfs don't seem to do it, and given that lseek is widely used
(cp for example),
making such change would possibly result in people reporting extra latency.
I understand your pov to reduce/simplify code, but from a user's
perspective I don't see
a justification to flush delalloc and wait for IO (and ordered
extents) to complete.

>
> >
> > 4) It never allocates extent maps, or adds extent maps to the inode's
> >     extent map tree. This not only saves memory and time (from the tree
> >     insertions and allocations), but also eliminates the possibility of
> >     -ENOMEM due to allocating too many extent maps.
> >
> > Part of this new code will also be used later for fiemap (which also
> > suffers similar scalability problems).
> >
> > The following test example can be used to quickly measure the efficiency
> > before and after this patch:
> >
> >      $ cat test-seek-hole.sh
> >      #!/bin/bash
> >
> >      DEV=/dev/sdi
> >      MNT=/mnt/sdi
> >
> >      mkfs.btrfs -f $DEV
> >
> >      mount -o compress=lzo $DEV $MNT
> >
> >      # 16G file -> 131073 compressed extents.
> >      xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar
> >
> >      # Leave a 1M hole at file offset 15G.
> >      xfs_io -c "fpunch 15G 1M" $MNT/foobar
> >
> >      # Unmount and mount again, so that we can test when there's no
> >      # metadata cached in memory.
> >      umount $MNT
> >      mount -o compress=lzo $DEV $MNT
> >
> >      # Test seeking for hole from offset 0 (hole is at offset 15G).
> >
> >      start=$(date +%s%N)
> >      xfs_io -c "seek -h 0" $MNT/foobar
> >      end=$(date +%s%N)
> >      dur=$(( (end - start) / 1000000 ))
> >      echo "Took $dur milliseconds to seek first hole (metadata not cached)"
> >      echo
> >
> >      start=$(date +%s%N)
> >      xfs_io -c "seek -h 0" $MNT/foobar
> >      end=$(date +%s%N)
> >      dur=$(( (end - start) / 1000000 ))
> >      echo "Took $dur milliseconds to seek first hole (metadata cached)"
> >      echo
> >
> >      umount $MNT
> >
> > Before this change:
> >
> >      $ ./test-seek-hole.sh
> >      (...)
> >      Whence   Result
> >      HOLE     16106127360
> >      Took 176 milliseconds to seek first hole (metadata not cached)
> >
> >      Whence   Result
> >      HOLE     16106127360
> >      Took 17 milliseconds to seek first hole (metadata cached)
> >
> > After this change:
> >
> >      $ ./test-seek-hole.sh
> >      (...)
> >      Whence   Result
> >      HOLE     16106127360
> >      Took 43 milliseconds to seek first hole (metadata not cached)
> >
> >      Whence   Result
> >      HOLE     16106127360
> >      Took 13 milliseconds to seek first hole (metadata cached)
> >
> > That's about 4X faster when no metadata is cached and about 30% faster
> > when all metadata is cached.
> >
> > In practice the differences may often be significantly higher, either due
> > to a higher number of extents in a file or because the subvolume's b+tree
> > is much bigger than in this example, where we only have one file.
> >
> > Link: https://lwn.net/Articles/718805/
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >   fs/btrfs/file.c | 437 ++++++++++++++++++++++++++++++++++++++++++++----
> >   1 file changed, 406 insertions(+), 31 deletions(-)
> >
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 96f444ad0951..b292a8ada3a4 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -3601,22 +3601,281 @@ static long btrfs_fallocate(struct file *file, int mode,
> >       return ret;
> >   }
> >
> > +/*
> > + * Helper for have_delalloc_in_range(). Find a subrange in a given range that
> > + * has unflushed and/or flushing delalloc. There might be other adjacent
> > + * subranges after the one it found, so have_delalloc_in_range() keeps looping
> > + * while it gets adjacent subranges, and merging them together.
> > + */
> > +static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
> > +                                u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> > +{
> > +     const u64 len = end + 1 - start;
> > +     struct extent_map_tree *em_tree = &inode->extent_tree;
> > +     struct extent_map *em;
> > +     u64 em_end;
> > +     u64 delalloc_len;
> > +
> > +     /*
> > +      * Search the io tree first for EXTENT_DELALLOC. If we find any, it
> > +      * means we have delalloc (dirty pages) for which writeback has not
> > +      * started yet.
> > +      */
> > +     *delalloc_start_ret = start;
> > +     delalloc_len = count_range_bits(&inode->io_tree, delalloc_start_ret, end,
> > +                                     len, EXTENT_DELALLOC, 1);
> > +     /*
> > +      * If delalloc was found then *delalloc_start_ret has a sector size
> > +      * aligned value (rounded down).
> > +      */
> > +     if (delalloc_len > 0)
> > +             *delalloc_end_ret = *delalloc_start_ret + delalloc_len - 1;
> > +
> > +     /*
> > +      * Now also check if there's any extent map in the range that does not
> > +      * map to a hole or prealloc extent. We do this because:
> > +      *
> > +      * 1) When delalloc is flushed, the file range is locked, we clear the
> > +      *    EXTENT_DELALLOC bit from the io tree and create an extent map for
> > +      *    an allocated extent. So we might just have been called after
> > +      *    delalloc is flushed and before the ordered extent completes and
> > +      *    inserts the new file extent item in the subvolume's btree;
> > +      *
> > +      * 2) We may have an extent map created by flushing delalloc for a
> > +      *    subrange that starts before the subrange we found marked with
> > +      *    EXTENT_DELALLOC in the io tree.
> > +      */
> > +     read_lock(&em_tree->lock);
> > +     em = lookup_extent_mapping(em_tree, start, len);
> > +     read_unlock(&em_tree->lock);
> > +
> > +     /* extent_map_end() returns a non-inclusive end offset. */
> > +     em_end = em ? extent_map_end(em) : 0;
> > +
> > +     /*
> > +      * If we have a hole/prealloc extent map, check the next one if this one
> > +      * ends before our range's end.
> > +      */
> > +     if (em && (em->block_start == EXTENT_MAP_HOLE ||
> > +                test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) && em_end < end) {
> > +             struct extent_map *next_em;
> > +
> > +             read_lock(&em_tree->lock);
> > +             next_em = lookup_extent_mapping(em_tree, em_end, len - em_end);
> > +             read_unlock(&em_tree->lock);
> > +
> > +             free_extent_map(em);
> > +             em_end = next_em ? extent_map_end(next_em) : 0;
> > +             em = next_em;
> > +     }
> > +
> > +     if (em && (em->block_start == EXTENT_MAP_HOLE ||
> > +                test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) {
> > +             free_extent_map(em);
> > +             em = NULL;
> > +     }
> > +
> > +     /*
> > +      * No extent map or one for a hole or prealloc extent. Use the delalloc
> > +      * range we found in the io tree if we have one.
> > +      */
> > +     if (!em)
> > +             return (delalloc_len > 0);
> > +
> > +     /*
> > +      * We don't have any range as EXTENT_DELALLOC in the io tree, so the
> > +      * extent map is the only subrange representing delalloc.
> > +      */
> > +     if (delalloc_len == 0) {
> > +             *delalloc_start_ret = em->start;
> > +             *delalloc_end_ret = min(end, em_end - 1);
> > +             free_extent_map(em);
> > +             return true;
> > +     }
> > +
> > +     /*
> > +      * The extent map represents a delalloc range that starts before the
> > +      * delalloc range we found in the io tree.
> > +      */
> > +     if (em->start < *delalloc_start_ret) {
> > +             *delalloc_start_ret = em->start;
> > +             /*
> > +              * If the ranges are adjacent, return a combined range.
> > +              * Otherwise return the extent map's range.
> > +              */
> > +             if (em_end < *delalloc_start_ret)
> > +                     *delalloc_end_ret = min(end, em_end - 1);
> > +
> > +             free_extent_map(em);
> > +             return true;
> > +     }
> > +
> > +     /*
> > +      * The extent map starts after the delalloc range we found in the io
> > +      * tree. If it's adjacent, return a combined range, otherwise return
> > +      * the range found in the io tree.
> > +      */
> > +     if (*delalloc_end_ret + 1 == em->start)
> > +             *delalloc_end_ret = min(end, em_end - 1);
> > +
> > +     free_extent_map(em);
> > +     return true;
> > +}
> > +
> > +/*
> > + * Check if there's delalloc in a given range.
> > + *
> > + * @inode:               The inode.
> > + * @start:               The start offset of the range. It does not need to be
> > + *                       sector size aligned.
> > + * @end:                 The end offset (inclusive value) of the search range.
> > + *                       It does not need to be sector size aligned.
> > + * @delalloc_start_ret:  Output argument, set to the start offset of the
> > + *                       subrange found with delalloc (may not be sector size
> > + *                       aligned).
> > + * @delalloc_end_ret:    Output argument, set to he end offset (inclusive value)
> > + *                       of the subrange found with delalloc.
> > + *
> > + * Returns true if a subrange with delalloc is found within the given range, and
> > + * if so it sets @delalloc_start_ret and @delalloc_end_ret with the start and
> > + * end offsets of the subrange.
> > + */
> > +static bool have_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> > +                                u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> > +{
> > +     u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
> > +     u64 prev_delalloc_end = 0;
> > +     bool ret = false;
> > +
> > +     while (cur_offset < end) {
> > +             u64 delalloc_start;
> > +             u64 delalloc_end;
> > +             bool delalloc;
> > +
> > +             delalloc = find_delalloc_subrange(inode, cur_offset, end,
> > +                                               &delalloc_start,
> > +                                               &delalloc_end);
> > +             if (!delalloc)
> > +                     break;
> > +
> > +             if (prev_delalloc_end == 0) {
> > +                     /* First subrange found. */
> > +                     *delalloc_start_ret = max(delalloc_start, start);
> > +                     *delalloc_end_ret = delalloc_end;
> > +                     ret = true;
> > +             } else if (delalloc_start == prev_delalloc_end + 1) {
> > +                     /* Subrange adjacent to the previous one, merge them. */
> > +                     *delalloc_end_ret = delalloc_end;
> > +             } else {
> > +                     /* Subrange not adjacent to the previous one, exit. */
> > +                     break;
> > +             }
> > +
> > +             prev_delalloc_end = delalloc_end;
> > +             cur_offset = delalloc_end + 1;
> > +             cond_resched();
> > +     }
> > +
> > +     return ret;
> > +}
> > +
> > +/*
> > + * Check if there's a hole or delalloc range in a range representing a hole (or
> > + * prealloc extent) found in the inode's subvolume btree.
> > + *
> > + * @inode:      The inode.
> > + * @whence:     Seek mode (SEEK_DATA or SEEK_HOLE).
> > + * @start:      Start offset of the hole region. It does not need to be sector
> > + *              size aligned.
> > + * @end:        End offset (inclusive value) of the hole region. It does not
> > + *              need to be sector size aligned.
> > + * @start_ret:  Return parameter, used to set the start of the subrange in the
> > + *              hole that matches the search criteria (seek mode), if such
> > + *              subrange is found (return value of the function is true).
> > + *              The value returned here may not be sector size aligned.
> > + *
> > + * Returns true if a subrange matching the given seek mode is found, and if one
> > + * is found, it updates @start_ret with the start of the subrange.
> > + */
> > +static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
> > +                                     u64 start, u64 end, u64 *start_ret)
> > +{
> > +     u64 delalloc_start;
> > +     u64 delalloc_end;
> > +     bool delalloc;
> > +
> > +     delalloc = have_delalloc_in_range(inode, start, end, &delalloc_start,
> > +                                       &delalloc_end);
> > +     if (delalloc && whence == SEEK_DATA) {
> > +             *start_ret = delalloc_start;
> > +             return true;
> > +     }
> > +
> > +     if (delalloc && whence == SEEK_HOLE) {
> > +             /*
> > +              * We found delalloc but it starts after out start offset. So we
> > +              * have a hole between our start offset and the delalloc start.
> > +              */
> > +             if (start < delalloc_start) {
> > +                     *start_ret = start;
> > +                     return true;
> > +             }
> > +             /*
> > +              * Delalloc range starts at our start offset.
> > +              * If the delalloc range's length is smaller than our range,
> > +              * then it means we have a hole that starts where the delalloc
> > +              * subrange ends.
> > +              */
> > +             if (delalloc_end < end) {
> > +                     *start_ret = delalloc_end + 1;
> > +                     return true;
> > +             }
> > +
> > +             /* There's delalloc for the whole range. */
> > +             return false;
> > +     }
> > +
> > +     if (!delalloc && whence == SEEK_HOLE) {
> > +             *start_ret = start;
> > +             return true;
> > +     }
> > +
> > +     /*
> > +      * No delalloc in the range and we are seeking for data. The caller has
> > +      * to iterate to the next extent item in the subvolume btree.
> > +      */
> > +     return false;
> > +}
> > +
> >   static loff_t find_desired_extent(struct btrfs_inode *inode, loff_t offset,
> >                                 int whence)
> >   {
> >       struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > -     struct extent_map *em = NULL;
> >       struct extent_state *cached_state = NULL;
> > -     loff_t i_size = inode->vfs_inode.i_size;
> > +     const loff_t i_size = i_size_read(&inode->vfs_inode);
> > +     const u64 ino = btrfs_ino(inode);
> > +     struct btrfs_root *root = inode->root;
> > +     struct btrfs_path *path;
> > +     struct btrfs_key key;
> > +     u64 last_extent_end;
> >       u64 lockstart;
> >       u64 lockend;
> >       u64 start;
> > -     u64 len;
> > -     int ret = 0;
> > +     int ret;
> > +     bool found = false;
> >
> >       if (i_size == 0 || offset >= i_size)
> >               return -ENXIO;
> >
> > +     /*
> > +      * Quick path. If the inode has no prealloc extents and its number of
> > +      * bytes used matches its i_size, then it can not have holes.
> > +      */
> > +     if (whence == SEEK_HOLE &&
> > +         !(inode->flags & BTRFS_INODE_PREALLOC) &&
> > +         inode_get_bytes(&inode->vfs_inode) == i_size)
> > +             return i_size;
> > +
>
> Would we need a counter part for all holes quick path?'

I suppose you mean a file with an i_size > 0 and no extents at all.
That is already fast... the first btrfs_search_slot() will not find
any extent item and we finish very quickly.

Thanks.

>
> Thanks,
> Qu
>
> >       /*
> >        * offset can be negative, in this case we start finding DATA/HOLE from
> >        * the very start of the file.
> > @@ -3628,49 +3887,165 @@ static loff_t find_desired_extent(struct btrfs_inode *inode, loff_t offset,
> >       if (lockend <= lockstart)
> >               lockend = lockstart + fs_info->sectorsize;
> >       lockend--;
> > -     len = lockend - lockstart + 1;
> > +
> > +     path = btrfs_alloc_path();
> > +     if (!path)
> > +             return -ENOMEM;
> > +     path->reada = READA_FORWARD;
> > +
> > +     key.objectid = ino;
> > +     key.type = BTRFS_EXTENT_DATA_KEY;
> > +     key.offset = start;
> > +
> > +     last_extent_end = lockstart;
> >
> >       lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
> >
> > +     ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> > +     if (ret < 0) {
> > +             goto out;
> > +     } else if (ret > 0 && path->slots[0] > 0) {
> > +             btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
> > +             if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
> > +                     path->slots[0]--;
> > +     }
> > +
> >       while (start < i_size) {
> > -             em = btrfs_get_extent_fiemap(inode, start, len);
> > -             if (IS_ERR(em)) {
> > -                     ret = PTR_ERR(em);
> > -                     em = NULL;
> > -                     break;
> > +             struct extent_buffer *leaf = path->nodes[0];
> > +             struct btrfs_file_extent_item *extent;
> > +             u64 extent_end;
> > +
> > +             if (path->slots[0] >= btrfs_header_nritems(leaf)) {
> > +                     ret = btrfs_next_leaf(root, path);
> > +                     if (ret < 0)
> > +                             goto out;
> > +                     else if (ret > 0)
> > +                             break;
> > +
> > +                     leaf = path->nodes[0];
> >               }
> >
> > -             if (whence == SEEK_HOLE &&
> > -                 (em->block_start == EXTENT_MAP_HOLE ||
> > -                  test_bit(EXTENT_FLAG_PREALLOC, &em->flags)))
> > -                     break;
> > -             else if (whence == SEEK_DATA &&
> > -                        (em->block_start != EXTENT_MAP_HOLE &&
> > -                         !test_bit(EXTENT_FLAG_PREALLOC, &em->flags)))
> > +             btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> > +             if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> >                       break;
> >
> > -             start = em->start + em->len;
> > -             free_extent_map(em);
> > -             em = NULL;
> > +             extent_end = btrfs_file_extent_end(path);
> > +
> > +             /*
> > +              * In the first iteration we may have a slot that points to an
> > +              * extent that ends before our start offset, so skip it.
> > +              */
> > +             if (extent_end <= start) {
> > +                     path->slots[0]++;
> > +                     continue;
> > +             }
> > +
> > +             /* We have an implicit hole, NO_HOLES feature is likely set. */
> > +             if (last_extent_end < key.offset) {
> > +                     u64 search_start = last_extent_end;
> > +                     u64 found_start;
> > +
> > +                     /*
> > +                      * First iteration, @start matches @offset and it's
> > +                      * within the hole.
> > +                      */
> > +                     if (start == offset)
> > +                             search_start = offset;
> > +
> > +                     found = find_desired_extent_in_hole(inode, whence,
> > +                                                         search_start,
> > +                                                         key.offset - 1,
> > +                                                         &found_start);
> > +                     if (found) {
> > +                             start = found_start;
> > +                             break;
> > +                     }
> > +                     /*
> > +                      * Didn't find data or a hole (due to delalloc) in the
> > +                      * implicit hole range, so need to analyze the extent.
> > +                      */
> > +             }
> > +
> > +             extent = btrfs_item_ptr(leaf, path->slots[0],
> > +                                     struct btrfs_file_extent_item);
> > +
> > +             if (btrfs_file_extent_disk_bytenr(leaf, extent) == 0 ||
> > +                 btrfs_file_extent_type(leaf, extent) ==
> > +                 BTRFS_FILE_EXTENT_PREALLOC) {
> > +                     /*
> > +                      * Explicit hole or prealloc extent, search for delalloc.
> > +                      * A prealloc extent is treated like a hole.
> > +                      */
> > +                     u64 search_start = key.offset;
> > +                     u64 found_start;
> > +
> > +                     /*
> > +                      * First iteration, @start matches @offset and it's
> > +                      * within the hole.
> > +                      */
> > +                     if (start == offset)
> > +                             search_start = offset;
> > +
> > +                     found = find_desired_extent_in_hole(inode, whence,
> > +                                                         search_start,
> > +                                                         extent_end - 1,
> > +                                                         &found_start);
> > +                     if (found) {
> > +                             start = found_start;
> > +                             break;
> > +                     }
> > +                     /*
> > +                      * Didn't find data or a hole (due to delalloc) in the
> > +                      * implicit hole range, so need to analyze the next
> > +                      * extent item.
> > +                      */
> > +             } else {
> > +                     /*
> > +                      * Found a regular or inline extent.
> > +                      * If we are seeking for data, adjust the start offset
> > +                      * and stop, we're done.
> > +                      */
> > +                     if (whence == SEEK_DATA) {
> > +                             start = max_t(u64, key.offset, offset);
> > +                             found = true;
> > +                             break;
> > +                     }
> > +                     /*
> > +                      * Else, we are seeking for a hole, check the next file
> > +                      * extent item.
> > +                      */
> > +             }
> > +
> > +             start = extent_end;
> > +             last_extent_end = extent_end;
> > +             path->slots[0]++;
> >               if (fatal_signal_pending(current)) {
> >                       ret = -EINTR;
> > -                     break;
> > +                     goto out;
> >               }
> >               cond_resched();
> >       }
> > -     free_extent_map(em);
> > +
> > +     /* We have an implicit hole from the last extent found up to i_size. */
> > +     if (!found && start < i_size) {
> > +             found = find_desired_extent_in_hole(inode, whence, start,
> > +                                                 i_size - 1, &start);
> > +             if (!found)
> > +                     start = i_size;
> > +     }
> > +
> > +out:
> >       unlock_extent_cached(&inode->io_tree, lockstart, lockend,
> >                            &cached_state);
> > -     if (ret) {
> > -             offset = ret;
> > -     } else {
> > -             if (whence == SEEK_DATA && start >= i_size)
> > -                     offset = -ENXIO;
> > -             else
> > -                     offset = min_t(loff_t, start, i_size);
> > -     }
> > +     btrfs_free_path(path);
> > +
> > +     if (ret < 0)
> > +             return ret;
> > +
> > +     if (whence == SEEK_DATA && start >= i_size)
> > +             return -ENXIO;
> >
> > -     return offset;
> > +     return min_t(loff_t, start, i_size);
> >   }
> >
> >   static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 06/10] btrfs: allow fiemap to be interruptible
  2022-09-01 22:42   ` Qu Wenruo
@ 2022-09-02  8:38     ` Filipe Manana
  0 siblings, 0 replies; 53+ messages in thread
From: Filipe Manana @ 2022-09-02  8:38 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Sep 1, 2022 at 11:42 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2022/9/1 21:18, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Doing fiemap on a file with a very large number of extents can take a very
> > long time, and we have reports of it being too slow (two recent examples
> > in the Link tags below), so make it interruptible.
> >
> > Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> > Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
> Just one small question unrelated to the patch itself.
>
> Would it be possible that, introducing a new flag to skip SHARED flag
> check can further speed up the fiemap operation in btrfs?

Maybe, but given that fiemap is not btrfs specific, that's an
interface change that would have to be
discussed with all filesystem people.

>
> Thanks,
> Qu
> > ---
> >   fs/btrfs/extent_io.c | 5 +++++
> >   1 file changed, 5 insertions(+)
> >
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 6e2143b6fba3..1260038eb47d 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -5694,6 +5694,11 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> >                               ret = 0;
> >                       goto out_free;
> >               }
> > +
> > +             if (fatal_signal_pending(current)) {
> > +                     ret = -EINTR;
> > +                     goto out_free;
> > +             }
> >       }
> >   out_free:
> >       if (!ret)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 08/10] btrfs: speedup checking for extent sharedness during fiemap
  2022-09-01 22:50   ` Qu Wenruo
@ 2022-09-02  8:46     ` Filipe Manana
  0 siblings, 0 replies; 53+ messages in thread
From: Filipe Manana @ 2022-09-02  8:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Sep 1, 2022 at 11:50 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2022/9/1 21:18, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > One of the most expensive tasks performed during fiemap is to check if
> > an extent is shared. This task has two major steps:
> >
> > 1) Check if the data extent is shared. This implies checking the extent
> >     item in the extent tree, checking delayed references, etc. If we
> >     find the data extent is directly shared, we terminate immediately;
> >
> > 2) If the data extent is not directly shared (its extent item has a
> >     refcount of 1), then it may be shared if we have snapshots that share
> >     subtrees of the inode's subvolume b+tree. So we check if the leaf
> >     containing the file extent item is shared, then its parent node, then
> >     the parent node of the parent node, etc, until we reach the root node
> >     or we find one of them is shared - in which case we stop immediately.
> >
> > During fiemap we process the extents of a file from left to right, from
> > file offset 0 to eof. This means that we iterate b+tree leaves from left
> > to right, and has the implication that we keep repeating that second step
> > above several times for the same b+tree path of the inode's subvolume
> > b+tree.
> >
> > For example, if we have two file extent items in leaf X, and the path to
> > leaf X is A -> B -> C -> X, then when we try to determine if the data
> > extent referenced by the first extent item is shared, we check if the data
> > extent is shared - if it's not, then we check if leaf X is shared, if not,
> > then we check if node C is shared, if not, then check if node B is shared,
> > if not than check if node A is shared. When we move to the next file
> > extent item, after determining the data extent is not shared, we repeat
> > the checks for X, C, B and A - doing all the expensive searches in the
> > extent tree, delayed refs, etc. If we have thousands of tile extents, then
> > we keep repeating the sharedness checks for the same paths over and over.
> >
> > On a file that has no shared extents or only a small portion, it's easy
> > to see that this scales terribly with the number of extents in the file
> > and the sizes of the extent and subvolume b+trees.
> >
> > This change eliminates the repeated sharedness check on extent buffers
> > by caching the results of the last path used. The results can be used as
> > long as no snapshots were created since they were cached (for not shared
> > extent buffers) or no roots were dropped since they were cached (for
> > shared extent buffers). This greatly reduces the time spent by fiemap for
> > files with thousands of extents and/or large extent and subvolume b+trees.
>
> This sounds pretty much like the existing btrfs_backref_cache is doing.
>
> It stores a map to speedup the backref lookup.
>
> But a quick search didn't hit things like btrfs_backref_edge() or
> btrfs_backref_cache().
>
> Would it be possible to reuse the existing facility to do the same thing?

Nop, the existing facility is heavy and is meant to collect all
backreferences for an extent.
It stores the backreferences in a rb tree, has to allocate memory for them, etc.

btrfs_check_shared() (now renamed to btrfs_is_data_extent_shared()),
is a much simpler
thing than collecting backreferences - it just stops once it finds an
extent is shared.

The cache I introduced is equally simple and the best fit for it, it
has a fixed size, no need
to collect backreferences, allocate memory for them, add them to a red
black tree, etc.
All it needs is a single path that doesn't change very often, and when
it does it's just one
extent buffer at a time. Also, it allows  to short circuit on 'not
shared' results right at level 0,
which is very common and makes a huge difference.

Thanks.

>
> Thanks,
> Qu
> >
> > Example performance test:
> >
> >      $ cat fiemap-perf-test.sh
> >      #!/bin/bash
> >
> >      DEV=/dev/sdi
> >      MNT=/mnt/sdi
> >
> >      mkfs.btrfs -f $DEV
> >      mount -o compress=lzo $DEV $MNT
> >
> >      # 40G gives 327680 128K file extents (due to compression).
> >      xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
> >
> >      umount $MNT
> >      mount -o compress=lzo $DEV $MNT
> >
> >      start=$(date +%s%N)
> >      filefrag $MNT/foobar
> >      end=$(date +%s%N)
> >      dur=$(( (end - start) / 1000000 ))
> >      echo "fiemap took $dur milliseconds (metadata not cached)"
> >
> >      start=$(date +%s%N)
> >      filefrag $MNT/foobar
> >      end=$(date +%s%N)
> >      dur=$(( (end - start) / 1000000 ))
> >      echo "fiemap took $dur milliseconds (metadata cached)"
> >
> >      umount $MNT
> >
> > Before this patch:
> >
> >      $ ./fiemap-perf-test.sh
> >      (...)
> >      /mnt/sdi/foobar: 327680 extents found
> >      fiemap took 3597 milliseconds (metadata not cached)
> >      /mnt/sdi/foobar: 327680 extents found
> >      fiemap took 2107 milliseconds (metadata cached)
> >
> > After this patch:
> >
> >      $ ./fiemap-perf-test.sh
> >      (...)
> >      /mnt/sdi/foobar: 327680 extents found
> >      fiemap took 1646 milliseconds (metadata not cached)
> >      /mnt/sdi/foobar: 327680 extents found
> >      fiemap took 698 milliseconds (metadata cached)
> >
> > That's about 2.2x faster when no metadata is cached, and about 3x faster
> > when all metadata is cached. On a real filesystem with many other files,
> > data, directories, etc, the b+trees will be 2 or 3 levels higher,
> > therefore this optimization will have a higher impact.
> >
> > Several reports of a slow fiemap show up often, the two Link tags below
> > refer to two recent reports of such slowness. This patch, together with
> > the next ones in the series, is meant to address that.
> >
> > Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> > Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >   fs/btrfs/backref.c     | 122 ++++++++++++++++++++++++++++++++++++++++-
> >   fs/btrfs/backref.h     |  17 +++++-
> >   fs/btrfs/ctree.h       |  18 ++++++
> >   fs/btrfs/extent-tree.c |  10 +++-
> >   fs/btrfs/extent_io.c   |  11 ++--
> >   5 files changed, 170 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> > index e2ac10a695b6..40b48abb6978 100644
> > --- a/fs/btrfs/backref.c
> > +++ b/fs/btrfs/backref.c
> > @@ -1511,6 +1511,105 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
> >       return ret;
> >   }
> >
> > +/*
> > + * The caller has joined a transaction or is holding a read lock on the
> > + * fs_info->commit_root_sem semaphore, so no need to worry about the root's last
> > + * snapshot field changing while updating or checking the cache.
> > + */
> > +static bool lookup_backref_shared_cache(struct btrfs_backref_shared_cache *cache,
> > +                                     struct btrfs_root *root,
> > +                                     u64 bytenr, int level, bool *is_shared)
> > +{
> > +     struct btrfs_backref_shared_cache_entry *entry;
> > +
> > +     if (WARN_ON_ONCE(level >= BTRFS_MAX_LEVEL))
> > +             return false;
> > +
> > +     /*
> > +      * Level -1 is used for the data extent, which is not reliable to cache
> > +      * because its reference count can increase or decrease without us
> > +      * realizing. We cache results only for extent buffers that lead from
> > +      * the root node down to the leaf with the file extent item.
> > +      */
> > +     ASSERT(level >= 0);
> > +
> > +     entry = &cache->entries[level];
> > +
> > +     /* Unused cache entry or being used for some other extent buffer. */
> > +     if (entry->bytenr != bytenr)
> > +             return false;
> > +
> > +     /*
> > +      * We cached a false result, but the last snapshot generation of the
> > +      * root changed, so we now have a snapshot. Don't trust the result.
> > +      */
> > +     if (!entry->is_shared &&
> > +         entry->gen != btrfs_root_last_snapshot(&root->root_item))
> > +             return false;
> > +
> > +     /*
> > +      * If we cached a true result and the last generation used for dropping
> > +      * a root changed, we can not trust the result, because the dropped root
> > +      * could be a snapshot sharing this extent buffer.
> > +      */
> > +     if (entry->is_shared &&
> > +         entry->gen != btrfs_get_last_root_drop_gen(root->fs_info))
> > +             return false;
> > +
> > +     *is_shared = entry->is_shared;
> > +
> > +     return true;
> > +}
> > +
> > +/*
> > + * The caller has joined a transaction or is holding a read lock on the
> > + * fs_info->commit_root_sem semaphore, so no need to worry about the root's last
> > + * snapshot field changing while updating or checking the cache.
> > + */
> > +static void store_backref_shared_cache(struct btrfs_backref_shared_cache *cache,
> > +                                    struct btrfs_root *root,
> > +                                    u64 bytenr, int level, bool is_shared)
> > +{
> > +     struct btrfs_backref_shared_cache_entry *entry;
> > +     u64 gen;
> > +
> > +     if (WARN_ON_ONCE(level >= BTRFS_MAX_LEVEL))
> > +             return;
> > +
> > +     /*
> > +      * Level -1 is used for the data extent, which is not reliable to cache
> > +      * because its reference count can increase or decrease without us
> > +      * realizing. We cache results only for extent buffers that lead from
> > +      * the root node down to the leaf with the file extent item.
> > +      */
> > +     ASSERT(level >= 0);
> > +
> > +     if (is_shared)
> > +             gen = btrfs_get_last_root_drop_gen(root->fs_info);
> > +     else
> > +             gen = btrfs_root_last_snapshot(&root->root_item);
> > +
> > +     entry = &cache->entries[level];
> > +     entry->bytenr = bytenr;
> > +     entry->is_shared = is_shared;
> > +     entry->gen = gen;
> > +
> > +     /*
> > +      * If we found an extent buffer is shared, set the cache result for all
> > +      * extent buffers below it to true. As nodes in the path are COWed,
> > +      * their sharedness is moved to their children, and if a leaf is COWed,
> > +      * then the sharedness of a data extent becomes direct, the refcount of
> > +      * data extent is increased in the extent item at the extent tree.
> > +      */
> > +     if (is_shared) {
> > +             for (int i = 0; i < level; i++) {
> > +                     entry = &cache->entries[i];
> > +                     entry->is_shared = is_shared;
> > +                     entry->gen = gen;
> > +             }
> > +     }
> > +}
> > +
> >   /**
> >    * Check if a data extent is shared or not.
> >    *
> > @@ -1519,6 +1618,7 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
> >    * @bytenr: logical bytenr of the extent we are checking
> >    * @roots:  list of roots this extent is shared among
> >    * @tmp:    temporary list used for iteration
> > + * @cache:  a backref lookup result cache
> >    *
> >    * btrfs_is_data_extent_shared uses the backref walking code but will short
> >    * circuit as soon as it finds a root or inode that doesn't match the
> > @@ -1532,7 +1632,8 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
> >    * Return: 0 if extent is not shared, 1 if it is shared, < 0 on error.
> >    */
> >   int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
> > -                             struct ulist *roots, struct ulist *tmp)
> > +                             struct ulist *roots, struct ulist *tmp,
> > +                             struct btrfs_backref_shared_cache *cache)
> >   {
> >       struct btrfs_fs_info *fs_info = root->fs_info;
> >       struct btrfs_trans_handle *trans;
> > @@ -1545,6 +1646,7 @@ int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
> >               .inum = inum,
> >               .share_count = 0,
> >       };
> > +     int level;
> >
> >       ulist_init(roots);
> >       ulist_init(tmp);
> > @@ -1561,22 +1663,40 @@ int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
> >               btrfs_get_tree_mod_seq(fs_info, &elem);
> >       }
> >
> > +     /* -1 means we are in the bytenr of the data extent. */
> > +     level = -1;
> >       ULIST_ITER_INIT(&uiter);
> >       while (1) {
> > +             bool is_shared;
> > +             bool cached;
> > +
> >               ret = find_parent_nodes(trans, fs_info, bytenr, elem.seq, tmp,
> >                                       roots, NULL, &shared, false);
> >               if (ret == BACKREF_FOUND_SHARED) {
> >                       /* this is the only condition under which we return 1 */
> >                       ret = 1;
> > +                     if (level >= 0)
> > +                             store_backref_shared_cache(cache, root, bytenr,
> > +                                                        level, true);
> >                       break;
> >               }
> >               if (ret < 0 && ret != -ENOENT)
> >                       break;
> >               ret = 0;
> > +             if (level >= 0)
> > +                     store_backref_shared_cache(cache, root, bytenr,
> > +                                                level, false);
> >               node = ulist_next(tmp, &uiter);
> >               if (!node)
> >                       break;
> >               bytenr = node->val;
> > +             level++;
> > +             cached = lookup_backref_shared_cache(cache, root, bytenr, level,
> > +                                                  &is_shared);
> > +             if (cached) {
> > +                     ret = is_shared ? 1 : 0;
> > +                     break;
> > +             }
> >               shared.share_count = 0;
> >               cond_resched();
> >       }
> > diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h
> > index 08354394b1bb..797ba5371d55 100644
> > --- a/fs/btrfs/backref.h
> > +++ b/fs/btrfs/backref.h
> > @@ -17,6 +17,20 @@ struct inode_fs_paths {
> >       struct btrfs_data_container     *fspath;
> >   };
> >
> > +struct btrfs_backref_shared_cache_entry {
> > +     u64 bytenr;
> > +     u64 gen;
> > +     bool is_shared;
> > +};
> > +
> > +struct btrfs_backref_shared_cache {
> > +     /*
> > +      * A path from a root to a leaf that has a file extent item pointing to
> > +      * a given data extent should never exceed the maximum b+tree heigth.
> > +      */
> > +     struct btrfs_backref_shared_cache_entry entries[BTRFS_MAX_LEVEL];
> > +};
> > +
> >   typedef int (iterate_extent_inodes_t)(u64 inum, u64 offset, u64 root,
> >               void *ctx);
> >
> > @@ -63,7 +77,8 @@ int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid,
> >                         struct btrfs_inode_extref **ret_extref,
> >                         u64 *found_off);
> >   int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
> > -                             struct ulist *roots, struct ulist *tmp);
> > +                             struct ulist *roots, struct ulist *tmp,
> > +                             struct btrfs_backref_shared_cache *cache);
> >
> >   int __init btrfs_prelim_ref_init(void);
> >   void __cold btrfs_prelim_ref_exit(void);
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 3dc30f5e6fd0..f7fe7f633eb5 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -1095,6 +1095,13 @@ struct btrfs_fs_info {
> >       /* Updates are not protected by any lock */
> >       struct btrfs_commit_stats commit_stats;
> >
> > +     /*
> > +      * Last generation where we dropped a non-relocation root.
> > +      * Use btrfs_set_last_root_drop_gen() and btrfs_get_last_root_drop_gen()
> > +      * to change it and to read it, respectively.
> > +      */
> > +     u64 last_root_drop_gen;
> > +
> >       /*
> >        * Annotations for transaction events (structures are empty when
> >        * compiled without lockdep).
> > @@ -1119,6 +1126,17 @@ struct btrfs_fs_info {
> >   #endif
> >   };
> >
> > +static inline void btrfs_set_last_root_drop_gen(struct btrfs_fs_info *fs_info,
> > +                                             u64 gen)
> > +{
> > +     WRITE_ONCE(fs_info->last_root_drop_gen, gen);
> > +}
> > +
> > +static inline u64 btrfs_get_last_root_drop_gen(const struct btrfs_fs_info *fs_info)
> > +{
> > +     return READ_ONCE(fs_info->last_root_drop_gen);
> > +}
> > +
> >   static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
> >   {
> >       return sb->s_fs_info;
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index bcd0e72cded3..9818285dface 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -5635,6 +5635,8 @@ static noinline int walk_up_tree(struct btrfs_trans_handle *trans,
> >    */
> >   int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
> >   {
> > +     const bool is_reloc_root = (root->root_key.objectid ==
> > +                                 BTRFS_TREE_RELOC_OBJECTID);
> >       struct btrfs_fs_info *fs_info = root->fs_info;
> >       struct btrfs_path *path;
> >       struct btrfs_trans_handle *trans;
> > @@ -5794,6 +5796,9 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
> >                               goto out_end_trans;
> >                       }
> >
> > +                     if (!is_reloc_root)
> > +                             btrfs_set_last_root_drop_gen(fs_info, trans->transid);
> > +
> >                       btrfs_end_transaction_throttle(trans);
> >                       if (!for_reloc && btrfs_need_cleaner_sleep(fs_info)) {
> >                               btrfs_debug(fs_info,
> > @@ -5828,7 +5833,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
> >               goto out_end_trans;
> >       }
> >
> > -     if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
> > +     if (!is_reloc_root) {
> >               ret = btrfs_find_root(tree_root, &root->root_key, path,
> >                                     NULL, NULL);
> >               if (ret < 0) {
> > @@ -5860,6 +5865,9 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
> >               btrfs_put_root(root);
> >       root_dropped = true;
> >   out_end_trans:
> > +     if (!is_reloc_root)
> > +             btrfs_set_last_root_drop_gen(fs_info, trans->transid);
> > +
> >       btrfs_end_transaction_throttle(trans);
> >   out_free:
> >       kfree(wc);
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index a47710516ecf..781436cc373c 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -5519,6 +5519,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> >       struct btrfs_path *path;
> >       struct btrfs_root *root = inode->root;
> >       struct fiemap_cache cache = { 0 };
> > +     struct btrfs_backref_shared_cache *backref_cache;
> >       struct ulist *roots;
> >       struct ulist *tmp_ulist;
> >       int end = 0;
> > @@ -5526,13 +5527,11 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> >       u64 em_len = 0;
> >       u64 em_end = 0;
> >
> > +     backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
> >       path = btrfs_alloc_path();
> > -     if (!path)
> > -             return -ENOMEM;
> > -
> >       roots = ulist_alloc(GFP_KERNEL);
> >       tmp_ulist = ulist_alloc(GFP_KERNEL);
> > -     if (!roots || !tmp_ulist) {
> > +     if (!backref_cache || !path || !roots || !tmp_ulist) {
> >               ret = -ENOMEM;
> >               goto out_free_ulist;
> >       }
> > @@ -5658,7 +5657,8 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> >                        */
> >                       ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
> >                                                         bytenr, roots,
> > -                                                       tmp_ulist);
> > +                                                       tmp_ulist,
> > +                                                       backref_cache);
> >                       if (ret < 0)
> >                               goto out_free;
> >                       if (ret)
> > @@ -5710,6 +5710,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> >                            &cached_state);
> >
> >   out_free_ulist:
> > +     kfree(backref_cache);
> >       btrfs_free_path(path);
> >       ulist_free(roots);
> >       ulist_free(tmp_ulist);

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness
  2022-09-01 23:27   ` Qu Wenruo
@ 2022-09-02  8:59     ` Filipe Manana
  2022-09-02  9:34       ` Qu Wenruo
  0 siblings, 1 reply; 53+ messages in thread
From: Filipe Manana @ 2022-09-02  8:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Sep 2, 2022 at 12:27 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2022/9/1 21:18, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > The current fiemap implementation does not scale very well with the number
> > of extents a file has. This is both because the main algorithm to find out
> > the extents has a high algorithmic complexity and because for each extent
> > we have to check if it's shared. This second part, checking if an extent
> > is shared, is significantly improved by the two previous patches in this
> > patchset, while the first part is improved by this specific patch. Every
> > now and then we get reports from users mentioning fiemap is too slow or
> > even unusable for files with a very large number of extents, such as the
> > two recent reports referred to by the Link tags at the bottom of this
> > change log.
> >
> > To understand why the part of finding which extents a file has is very
> > inneficient, consider the example of doing a full ranged fiemap against
> > a file that has over 100K extents (normal for example for a file with
> > more than 10G of data and using compression, which limits the extent size
> > to 128K). When we enter fiemap at extent_fiemap(), the following happens:
> >
> > 1) Before entering the main loop, we call get_extent_skip_holes() to get
> >     the first extent map. This leads us to btrfs_get_extent_fiemap(), which
> >     in turn calls btrfs_get_extent(), to find the first extent map that
> >     covers the file range [0, LLONG_MAX).
> >
> >     btrfs_get_extent() will first search the inode's extent map tree, to
> >     see if we have an extent map there that covers the range. If it does
> >     not find one, then it will search the inode's subvolume b+tree for a
> >     fitting file extent item. After finding the file extent item, it will
> >     allocate an extent map, fill it in with information extracted from the
> >     file extent item, and add it to the inode's extent map tree (which
> >     requires a search for insertion in the tree).
> >
> > 2) Then we enter the main loop at extent_fiemap(), emit the details of
> >     the extent, and call again get_extent_skip_holes(), with a start
> >     offset matching the end of the extent map we previously processed.
> >
> >     We end up at btrfs_get_extent() again, will search the extent map tree
> >     and then search the subvolume b+tree for a file extent item if we could
> >     not find an extent map in the extent tree. We allocate an extent map,
> >     fill it in with the details in the file extent item, and then insert
> >     it into the extent map tree (yet another search in this tree).
> >
> > 3) The second step is repeated over and over, until we have processed the
> >     whole file range. Each iteration ends at btrfs_get_extent(), which
> >     does a red black tree search on the extent map tree, then searches the
> >     subvolume b+tree, allocates an extent map and then does another search
> >     in the extent map tree in order to insert the extent map.
> >
> >     In the best scenario we have all the extent maps already in the extent
> >     tree, and so for each extent we do a single search on a red black tree,
> >     so we have a complexity of O(n log n).
> >
> >     In the worst scenario we don't have any extent map already loaded in
> >     the extent map tree, or have very few already there. In this case the
> >     complexity is much higher since we do:
> >
> >     - A red black tree search on the extent map tree, which has O(log n)
> >       complexity, initially very fast since the tree is empty or very
> >       small, but as we end up allocating extent maps and adding them to
> >       the tree when we don't find them there, each subsequent search on
> >       the tree gets slower, since it's getting bigger and bigger after
> >       each iteration.
> >
> >     - A search on the subvolume b+tree, also O(log n) complexity, but it
> >       has items for all inodes in the subvolume, not just items for our
> >       inode. Plus on a filesystem with concurrent operations on other
> >       inodes, we can block doing the search due to lock contention on
> >       b+tree nodes/leaves.
> >
> >     - Allocate an extent map - this can block, and can also fail if we
> >       are under serious memory pressure.
> >
> >     - Do another search on the extent maps red black tree, with the goal
> >       of inserting the extent map we just allocated. Again, after every
> >       iteration this tree is getting bigger by 1 element, so after many
> >       iterations the searches are slower and slower.
> >
> >     - We will not need the allocated extent map anymore, so it's pointless
> >       to add it to the extent map tree. It's just wasting time and memory.
> >
> >     In short we end up searching the extent map tree multiple times, on a
> >     tree that is growing bigger and bigger after each iteration. And
> >     besides that we visit the same leaf of the subvolume b+tree many times,
> >     since a leaf with the default size of 16K can easily have more than 200
> >     file extent items.
> >
> > This is very inneficient overall. This patch changes the algorithm to
> > instead iterate over the subvolume b+tree, visiting each leaf only once,
> > and only searching in the extent map tree for file ranges that have holes
> > or prealloc extents, in order to figure out if we have delalloc there.
> > It will never allocate an extent map and add it to the extent map tree.
> > This is very similar to what was previously done for the lseek's hole and
> > data seeking features.
> >
> > Also, the current implementation relying on extent maps for figuring out
> > which extents we have is not correct. This is because extent maps can be
> > merged even if they represent different extents - we do this to minimize
> > memory utilization and keep extent map trees smaller. For example if we
> > have two extents that are contiguous on disk, once we load the two extent
> > maps, they get merged into a single one - however if only one of the
> > extents is shared, we end up reporting both as shared or both as not
> > shared, which is incorrect.
>
> Is there any other major usage for extent map now?

For fsync at least is very important.
Also for reads it's nice to not have to go to the b+tree.
If someone reads several pages of an extent, being able to get it directly
from the extent map tree is faster than having to go to the btree for
every read.
Extent map trees are per inode, but subvolume b+trees can have a lot of
concurrent write and read access.

>
> I can only think of read, which uses extent map to grab the logical
> bytenr of the real extent.
>
> In that case, the SHARED flag doesn't make much sense anyway, can we do
> a cleanup for those flags? Since fiemap/lseek no longer relies on extent
> map anymore.

I don't get it. What SHARED flag are talking about? And which "flags", where?
We have nothing specific for lseek/fiemap in the extent maps, so I
don't understand.

>
> >
> > This reproducer triggers that bug:
> >
> >      $ cat fiemap-bug.sh
> >      #!/bin/bash
> >
> >      DEV=/dev/sdj
> >      MNT=/mnt/sdj
> >
> >      mkfs.btrfs -f $DEV
> >      mount $DEV $MNT
> >
> >      # Create a file with two 256K extents.
> >      # Since there is no other write activity, they will be contiguous,
> >      # and their extent maps merged, despite having two distinct extents.
> >      xfs_io -f -c "pwrite -S 0xab 0 256K" \
> >                -c "fsync" \
> >                -c "pwrite -S 0xcd 256K 256K" \
> >                -c "fsync" \
> >                $MNT/foo
> >
> >      # Now clone only the second extent into another file.
> >      xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
> >
> >      # Filefrag will report a single 512K extent, and say it's not shared.
> >      echo
> >      filefrag -v $MNT/foo
> >
> >      umount $MNT
> >
> > Running the reproducer:
> >
> >      $ ./fiemap-bug.sh
> >      wrote 262144/262144 bytes at offset 0
> >      256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
> >      wrote 262144/262144 bytes at offset 262144
> >      256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
> >      linked 262144/262144 bytes at offset 0
> >      256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
> >
> >      Filesystem type is: 9123683e
> >      File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
> >       ext:     logical_offset:        physical_offset: length:   expected: flags:
> >         0:        0..     127:       3328..      3455:    128:             last,eof
> >      /mnt/sdj/foo: 1 extent found
> >
> > We end up reporting that we have a single 512K that is not shared, however
> > we have two 256K extents, and the second one is shared. Changing the
> > reproducer to clone instead the first extent into file 'bar', makes us
> > report a single 512K extent that is shared, which is algo incorrect since
> > we have two 256K extents and only the first one is shared.
> >
> > This patch is part of a larger patchset that is comprised of the following
> > patches:
> >
> >      btrfs: allow hole and data seeking to be interruptible
> >      btrfs: make hole and data seeking a lot more efficient
> >      btrfs: remove check for impossible block start for an extent map at fiemap
> >      btrfs: remove zero length check when entering fiemap
> >      btrfs: properly flush delalloc when entering fiemap
> >      btrfs: allow fiemap to be interruptible
> >      btrfs: rename btrfs_check_shared() to a more descriptive name
> >      btrfs: speedup checking for extent sharedness during fiemap
> >      btrfs: skip unnecessary extent buffer sharedness checks during fiemap
> >      btrfs: make fiemap more efficient and accurate reporting extent sharedness
> >
> > The patchset was tested on a machine running a non-debug kernel (Debian's
> > default config) and compared the tests below on a branch without the
> > patchset versus the same branch with the whole patchset applied.
> >
> > The following test for a large compressed file without holes:
> >
> >      $ cat fiemap-perf-test.sh
> >      #!/bin/bash
> >
> >      DEV=/dev/sdi
> >      MNT=/mnt/sdi
> >
> >      mkfs.btrfs -f $DEV
> >      mount -o compress=lzo $DEV $MNT
> >
> >      # 40G gives 327680 128K file extents (due to compression).
> >      xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar
> >
> >      umount $MNT
> >      mount -o compress=lzo $DEV $MNT
> >
> >      start=$(date +%s%N)
> >      filefrag $MNT/foobar
> >      end=$(date +%s%N)
> >      dur=$(( (end - start) / 1000000 ))
> >      echo "fiemap took $dur milliseconds (metadata not cached)"
> >
> >      start=$(date +%s%N)
> >      filefrag $MNT/foobar
> >      end=$(date +%s%N)
> >      dur=$(( (end - start) / 1000000 ))
> >      echo "fiemap took $dur milliseconds (metadata cached)"
> >
> >      umount $MNT
> >
> > Before patchset:
> >
> >      $ ./fiemap-perf-test.sh
> >      (...)
> >      /mnt/sdi/foobar: 327680 extents found
> >      fiemap took 3597 milliseconds (metadata not cached)
> >      /mnt/sdi/foobar: 327680 extents found
> >      fiemap took 2107 milliseconds (metadata cached)
> >
> > After patchset:
> >
> >      $ ./fiemap-perf-test.sh
> >      (...)
> >      /mnt/sdi/foobar: 327680 extents found
> >      fiemap took 1214 milliseconds (metadata not cached)
> >      /mnt/sdi/foobar: 327680 extents found
> >      fiemap took 684 milliseconds (metadata cached)
> >
> > That's a speedup of about 3x for both cases (no metadata cached and all
> > metadata cached).
> >
> > The test provided by Pavel (first Link tag at the bottom), which uses
> > files with a large number of holes, was also used to measure the gains,
> > and it consists on a small C program and a shell script to invoke it.
> > The C program is the following:
> >
> >      $ cat pavels-test.c
> >      #include <stdio.h>
> >      #include <unistd.h>
> >      #include <stdlib.h>
> >      #include <fcntl.h>
> >
> >      #include <sys/stat.h>
> >      #include <sys/time.h>
> >      #include <sys/ioctl.h>
> >
> >      #include <linux/fs.h>
> >      #include <linux/fiemap.h>
> >
> >      #define FILE_INTERVAL (1<<13) /* 8Kb */
> >
> >      long long interval(struct timeval t1, struct timeval t2)
> >      {
> >          long long val = 0;
> >          val += (t2.tv_usec - t1.tv_usec);
> >          val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
> >          return val;
> >      }
> >
> >      int main(int argc, char **argv)
> >      {
> >          struct fiemap fiemap = {};
> >          struct timeval t1, t2;
> >          char data = 'a';
> >          struct stat st;
> >          int fd, off, file_size = FILE_INTERVAL;
> >
> >          if (argc != 3 && argc != 2) {
> >                  printf("usage: %s <path> [size]\n", argv[0]);
> >                  return 1;
> >          }
> >
> >          if (argc == 3)
> >                  file_size = atoi(argv[2]);
> >          if (file_size < FILE_INTERVAL)
> >                  file_size = FILE_INTERVAL;
> >          file_size -= file_size % FILE_INTERVAL;
> >
> >          fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
> >          if (fd < 0) {
> >              perror("open");
> >              return 1;
> >          }
> >
> >          for (off = 0; off < file_size; off += FILE_INTERVAL) {
> >              if (pwrite(fd, &data, 1, off) != 1) {
> >                  perror("pwrite");
> >                  close(fd);
> >                  return 1;
> >              }
> >          }
> >
> >          if (ftruncate(fd, file_size)) {
> >              perror("ftruncate");
> >              close(fd);
> >              return 1;
> >          }
> >
> >          if (fstat(fd, &st) < 0) {
> >              perror("fstat");
> >              close(fd);
> >              return 1;
> >          }
> >
> >          printf("size: %ld\n", st.st_size);
> >          printf("actual size: %ld\n", st.st_blocks * 512);
> >
> >          fiemap.fm_length = FIEMAP_MAX_OFFSET;
> >          gettimeofday(&t1, NULL);
> >          if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
> >              perror("fiemap");
> >              close(fd);
> >              return 1;
> >          }
> >          gettimeofday(&t2, NULL);
> >
> >          printf("fiemap: fm_mapped_extents = %d\n",
> >                 fiemap.fm_mapped_extents);
> >          printf("time = %lld us\n", interval(t1, t2));
> >
> >          close(fd);
> >          return 0;
> >      }
> >
> >      $ gcc -o pavels_test pavels_test.c
> >
> > And the wrapper shell script:
> >
> >      $ cat fiemap-pavels-test.sh
> >
> >      #!/bin/bash
> >
> >      DEV=/dev/sdi
> >      MNT=/mnt/sdi
> >
> >      mkfs.btrfs -f -O no-holes $DEV
> >      mount $DEV $MNT
> >
> >      echo
> >      echo "*********** 256M ***********"
> >      echo
> >
> >      ./pavels-test $MNT/testfile $((1 << 28))
> >      echo
> >      ./pavels-test $MNT/testfile $((1 << 28))
> >
> >      echo
> >      echo "*********** 512M ***********"
> >      echo
> >
> >      ./pavels-test $MNT/testfile $((1 << 29))
> >      echo
> >      ./pavels-test $MNT/testfile $((1 << 29))
> >
> >      echo
> >      echo "*********** 1G ***********"
> >      echo
> >
> >      ./pavels-test $MNT/testfile $((1 << 30))
> >      echo
> >      ./pavels-test $MNT/testfile $((1 << 30))
> >
> >      umount $MNT
> >
> > Running his reproducer before applying the patchset:
> >
> >      *********** 256M ***********
> >
> >      size: 268435456
> >      actual size: 134217728
> >      fiemap: fm_mapped_extents = 32768
> >      time = 4003133 us
> >
> >      size: 268435456
> >      actual size: 134217728
> >      fiemap: fm_mapped_extents = 32768
> >      time = 4895330 us
> >
> >      *********** 512M ***********
> >
> >      size: 536870912
> >      actual size: 268435456
> >      fiemap: fm_mapped_extents = 65536
> >      time = 30123675 us
> >
> >      size: 536870912
> >      actual size: 268435456
> >      fiemap: fm_mapped_extents = 65536
> >      time = 33450934 us
> >
> >      *********** 1G ***********
> >
> >      size: 1073741824
> >      actual size: 536870912
> >      fiemap: fm_mapped_extents = 131072
> >      time = 224924074 us
> >
> >      size: 1073741824
> >      actual size: 536870912
> >      fiemap: fm_mapped_extents = 131072
> >      time = 217239242 us
> >
> > Running it after applying the patchset:
> >
> >      *********** 256M ***********
> >
> >      size: 268435456
> >      actual size: 134217728
> >      fiemap: fm_mapped_extents = 32768
> >      time = 29475 us
> >
> >      size: 268435456
> >      actual size: 134217728
> >      fiemap: fm_mapped_extents = 32768
> >      time = 29307 us
> >
> >      *********** 512M ***********
> >
> >      size: 536870912
> >      actual size: 268435456
> >      fiemap: fm_mapped_extents = 65536
> >      time = 58996 us
> >
> >      size: 536870912
> >      actual size: 268435456
> >      fiemap: fm_mapped_extents = 65536
> >      time = 59115 us
> >
> >      *********** 1G ***********
> >
> >      size: 1073741824
> >      actual size: 536870912
> >      fiemap: fm_mapped_extents = 116251
> >      time = 124141 us
> >
> >      size: 1073741824
> >      actual size: 536870912
> >      fiemap: fm_mapped_extents = 131072
> >      time = 119387 us
> >
> > The speedup is massive, both on the first fiemap call and on the second
> > one as well, as his test creates files with many holes and small extents
> > (every extent follows a hole and precedes another hole).
> >
> > For the 256M file we go from 4 seconds down to 29 milliseconds in the
> > first run, and then from 4.9 seconds down to 29 milliseconds again in the
> > second run, a speedup of 138x and 169x, respectively.
> >
> > For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
> > first run, and then from 33.5 seconds down to 59 milliseconds again in the
> > second run, a speedup of 510x and 568x, respectively.
> >
> > For the 1G file, we go from 225 seconds down to 124 milliseconds in the
> > first run, and then from 217 seconds down to 119 milliseconds in the
> > second run, a speedup of 1815x and 1824x, respectively.
> >
> > Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> > Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> > Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
> > Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >   fs/btrfs/ctree.h     |   4 +-
> >   fs/btrfs/extent_io.c | 714 +++++++++++++++++++++++++++++--------------
> >   fs/btrfs/file.c      |  16 +-
> >   fs/btrfs/inode.c     | 140 +--------
> >   4 files changed, 506 insertions(+), 368 deletions(-)
> >
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index f7fe7f633eb5..7b266f9dc8b4 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -3402,8 +3402,6 @@ unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
> >                                   u64 start, u64 end);
> >   int btrfs_check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
> >                         u32 bio_offset, struct page *page, u32 pgoff);
> > -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
> > -                                        u64 start, u64 len);
> >   noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
> >                             u64 *orig_start, u64 *orig_block_len,
> >                             u64 *ram_bytes, bool strict);
> > @@ -3583,6 +3581,8 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
> >   int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
> >                          size_t *write_bytes);
> >   void btrfs_check_nocow_unlock(struct btrfs_inode *inode);
> > +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> > +                               u64 *delalloc_start_ret, u64 *delalloc_end_ret);
> >
> >   /* tree-defrag.c */
> >   int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 0e3fa9b08aaf..50bb2182e795 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -5353,42 +5353,6 @@ int try_release_extent_mapping(struct page *page, gfp_t mask)
> >       return try_release_extent_state(tree, page, mask);
> >   }
> >
> > -/*
> > - * helper function for fiemap, which doesn't want to see any holes.
> > - * This maps until we find something past 'last'
> > - */
> > -static struct extent_map *get_extent_skip_holes(struct btrfs_inode *inode,
> > -                                             u64 offset, u64 last)
> > -{
> > -     u64 sectorsize = btrfs_inode_sectorsize(inode);
> > -     struct extent_map *em;
> > -     u64 len;
> > -
> > -     if (offset >= last)
> > -             return NULL;
> > -
> > -     while (1) {
> > -             len = last - offset;
> > -             if (len == 0)
> > -                     break;
> > -             len = ALIGN(len, sectorsize);
> > -             em = btrfs_get_extent_fiemap(inode, offset, len);
> > -             if (IS_ERR(em))
> > -                     return em;
> > -
> > -             /* if this isn't a hole return it */
> > -             if (em->block_start != EXTENT_MAP_HOLE)
> > -                     return em;
> > -
> > -             /* this is a hole, advance to the next extent */
> > -             offset = extent_map_end(em);
> > -             free_extent_map(em);
> > -             if (offset >= last)
> > -                     break;
> > -     }
> > -     return NULL;
> > -}
> > -
> >   /*
> >    * To cache previous fiemap extent
> >    *
> > @@ -5418,6 +5382,9 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
> >   {
> >       int ret = 0;
> >
> > +     /* Set at the end of extent_fiemap(). */
> > +     ASSERT((flags & FIEMAP_EXTENT_LAST) == 0);
> > +
> >       if (!cache->cached)
> >               goto assign;
> >
> > @@ -5446,11 +5413,10 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
> >        */
> >       if (cache->offset + cache->len  == offset &&
> >           cache->phys + cache->len == phys  &&
> > -         (cache->flags & ~FIEMAP_EXTENT_LAST) ==
> > -                     (flags & ~FIEMAP_EXTENT_LAST)) {
> > +         cache->flags == flags) {
> >               cache->len += len;
> >               cache->flags |= flags;
> > -             goto try_submit_last;
> > +             return 0;
> >       }
> >
> >       /* Not mergeable, need to submit cached one */
> > @@ -5465,13 +5431,8 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
> >       cache->phys = phys;
> >       cache->len = len;
> >       cache->flags = flags;
> > -try_submit_last:
> > -     if (cache->flags & FIEMAP_EXTENT_LAST) {
> > -             ret = fiemap_fill_next_extent(fieinfo, cache->offset,
> > -                             cache->phys, cache->len, cache->flags);
> > -             cache->cached = false;
> > -     }
> > -     return ret;
> > +
> > +     return 0;
> >   }
> >
> >   /*
> > @@ -5501,229 +5462,534 @@ static int emit_last_fiemap_cache(struct fiemap_extent_info *fieinfo,
> >       return ret;
> >   }
> >
> > -int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> > -               u64 start, u64 len)
> > +static int fiemap_next_leaf_item(struct btrfs_inode *inode,
> > +                              struct btrfs_path *path)
> >   {
> > -     int ret = 0;
> > -     u64 off;
> > -     u64 max = start + len;
> > -     u32 flags = 0;
> > -     u32 found_type;
> > -     u64 last;
> > -     u64 last_for_get_extent = 0;
> > -     u64 disko = 0;
> > -     u64 isize = i_size_read(&inode->vfs_inode);
> > -     struct btrfs_key found_key;
> > -     struct extent_map *em = NULL;
> > -     struct extent_state *cached_state = NULL;
> > -     struct btrfs_path *path;
> > +     struct extent_buffer *clone;
> > +     struct btrfs_key key;
> > +     int slot;
> > +     int ret;
> > +
> > +     path->slots[0]++;
> > +     if (path->slots[0] < btrfs_header_nritems(path->nodes[0]))
> > +             return 0;
> > +
> > +     ret = btrfs_next_leaf(inode->root, path);
> > +     if (ret != 0)
> > +             return ret;
> > +
> > +     /*
> > +      * Don't bother with cloning if there are no more file extent items for
> > +      * our inode.
> > +      */
> > +     btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> > +     if (key.objectid != btrfs_ino(inode) || key.type != BTRFS_EXTENT_DATA_KEY)
> > +             return 1;
> > +
> > +     /* See the comment at fiemap_search_slot() about why we clone. */
> > +     clone = btrfs_clone_extent_buffer(path->nodes[0]);
> > +     if (!clone)
> > +             return -ENOMEM;
> > +
> > +     slot = path->slots[0];
> > +     btrfs_release_path(path);
> > +     path->nodes[0] = clone;
> > +     path->slots[0] = slot;
> > +
> > +     return 0;
> > +}
> > +
> > +/*
> > + * Search for the first file extent item that starts at a given file offset or
> > + * the one that starts immediately before that offset.
> > + * Returns: 0 on success, < 0 on error, 1 if not found.
> > + */
> > +static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path,
> > +                           u64 file_offset)
> > +{
> > +     const u64 ino = btrfs_ino(inode);
> >       struct btrfs_root *root = inode->root;
> > -     struct fiemap_cache cache = { 0 };
> > -     struct btrfs_backref_shared_cache *backref_cache;
> > -     struct ulist *roots;
> > -     struct ulist *tmp_ulist;
> > -     int end = 0;
> > -     u64 em_start = 0;
> > -     u64 em_len = 0;
> > -     u64 em_end = 0;
> > +     struct extent_buffer *clone;
> > +     struct btrfs_key key;
> > +     int slot;
> > +     int ret;
> >
> > -     backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
> > -     path = btrfs_alloc_path();
> > -     roots = ulist_alloc(GFP_KERNEL);
> > -     tmp_ulist = ulist_alloc(GFP_KERNEL);
> > -     if (!backref_cache || !path || !roots || !tmp_ulist) {
> > -             ret = -ENOMEM;
> > -             goto out_free_ulist;
> > +     key.objectid = ino;
> > +     key.type = BTRFS_EXTENT_DATA_KEY;
> > +     key.offset = file_offset;
> > +
> > +     ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> > +     if (ret < 0)
> > +             return ret;
> > +
> > +     if (ret > 0 && path->slots[0] > 0) {
> > +             btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
> > +             if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
> > +                     path->slots[0]--;
> > +     }
> > +
> > +     if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
> > +             ret = btrfs_next_leaf(root, path);
> > +             if (ret != 0)
> > +                     return ret;
> > +
> > +             btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> > +             if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> > +                     return 1;
> >       }
> >
> >       /*
> > -      * We can't initialize that to 'start' as this could miss extents due
> > -      * to extent item merging
> > +      * We clone the leaf and use it during fiemap. This is because while
> > +      * using the leaf we do expensive things like checking if an extent is
> > +      * shared, which can take a long time. In order to prevent blocking
> > +      * other tasks for too long, we use a clone of the leaf. We have locked
> > +      * the file range in the inode's io tree, so we know none of our file
> > +      * extent items can change. This way we avoid blocking other tasks that
> > +      * want to insert items for other inodes in the same leaf or b+tree
> > +      * rebalance operations (triggered for example when someone is trying
> > +      * to push items into this leaf when trying to insert an item in a
> > +      * neighbour leaf).
> > +      * We also need the private clone because holding a read lock on an
> > +      * extent buffer of the subvolume's b+tree will make lockdep unhappy
> > +      * when we call fiemap_fill_next_extent(), because that may cause a page
> > +      * fault when filling the user space buffer with fiemap data.
> >        */
> > -     off = 0;
> > -     start = round_down(start, btrfs_inode_sectorsize(inode));
> > -     len = round_up(max, btrfs_inode_sectorsize(inode)) - start;
> > +     clone = btrfs_clone_extent_buffer(path->nodes[0]);
> > +     if (!clone)
> > +             return -ENOMEM;
> > +
> > +     slot = path->slots[0];
> > +     btrfs_release_path(path);
> > +     path->nodes[0] = clone;
> > +     path->slots[0] = slot;
>
> Although this is correct, it still looks a little tricky.
>
> We rely on btrfs_release_path() to release all tree blocks in the
> subvolume tree, including unlocking the tree blocks, thus path->locks[0]
> is also 0, meaning next time we call btrfs_release_path() we won't try
> to unlock the cloned eb.

We're not taking any lock on the cloned extent buffer. It's not
needed, it's private
to the task.

>
> But I'd say it's still pretty tricky, and unfortunately I don't have any
> better alternative.
>
> > +
> > +     return 0;
> > +}
> > +
> > +/*
> > + * Process a range which is a hole or a prealloc extent in the inode's subvolume
> > + * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc
> > + * extent. The end offset (@end) is inclusive.
> > + */
> > +static int fiemap_process_hole(struct btrfs_inode *inode,
>
> Does the name still make sense as we're handling both hole and prealloc
> range?

I chose that name because hole and prealloc are treated the same way.
Sure, I could name it fiemap_process_hole_or_prealloc() or something
like that, but
I decided to keep the name shorter, make them explicit in the comments and code.

The old code did the same, get_extent_skip_holes() skipped holes and
prealloc extents without delalloc.

>
>
> And I always find the delalloc search a big pain during lseek/fiemap.
>
> I guess except using certain flags, there is some hard requirement for
> delalloc range reporting?

Yes. Delalloc is not meant to be flushed for fiemap unless
FIEMAP_FLAG_SYNC is given by the user.
For lseek it's just not needed, but that was already mentioned /
discussed in patch 2/10.

>
> > +                            struct fiemap_extent_info *fieinfo,
> > +                            struct fiemap_cache *cache,
> > +                            struct btrfs_backref_shared_cache *backref_cache,
> > +                            u64 disk_bytenr, u64 extent_offset,
> > +                            u64 extent_gen,
> > +                            struct ulist *roots, struct ulist *tmp_ulist,
> > +                            u64 start, u64 end)
> > +{
> > +     const u64 i_size = i_size_read(&inode->vfs_inode);
> > +     const u64 ino = btrfs_ino(inode);
> > +     u64 cur_offset = start;
> > +     u64 last_delalloc_end = 0;
> > +     u32 prealloc_flags = FIEMAP_EXTENT_UNWRITTEN;
> > +     bool checked_extent_shared = false;
> > +     int ret;
> >
> >       /*
> > -      * lookup the last file extent.  We're not using i_size here
> > -      * because there might be preallocation past i_size
> > +      * There can be no delalloc past i_size, so don't waste time looking for
> > +      * it beyond i_size.
> >        */
> > -     ret = btrfs_lookup_file_extent(NULL, root, path, btrfs_ino(inode), -1,
> > -                                    0);
> > -     if (ret < 0) {
> > -             goto out_free_ulist;
> > -     } else {
> > -             WARN_ON(!ret);
> > -             if (ret == 1)
> > -                     ret = 0;
> > -     }
> > +     while (cur_offset < end && cur_offset < i_size) {
> > +             u64 delalloc_start;
> > +             u64 delalloc_end;
> > +             u64 prealloc_start;
> > +             u64 prealloc_len = 0;
> > +             bool delalloc;
> > +
> > +             delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
> > +                                                     &delalloc_start,
> > +                                                     &delalloc_end);
> > +             if (!delalloc)
> > +                     break;
> >
> > -     path->slots[0]--;
> > -     btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]);
> > -     found_type = found_key.type;
> > -
> > -     /* No extents, but there might be delalloc bits */
> > -     if (found_key.objectid != btrfs_ino(inode) ||
> > -         found_type != BTRFS_EXTENT_DATA_KEY) {
> > -             /* have to trust i_size as the end */
> > -             last = (u64)-1;
> > -             last_for_get_extent = isize;
> > -     } else {
> >               /*
> > -              * remember the start of the last extent.  There are a
> > -              * bunch of different factors that go into the length of the
> > -              * extent, so its much less complex to remember where it started
> > +              * If this is a prealloc extent we have to report every section
> > +              * of it that has no delalloc.
> >                */
> > -             last = found_key.offset;
> > -             last_for_get_extent = last + 1;
> > +             if (disk_bytenr != 0) {
> > +                     if (last_delalloc_end == 0) {
> > +                             prealloc_start = start;
> > +                             prealloc_len = delalloc_start - start;
> > +                     } else {
> > +                             prealloc_start = last_delalloc_end + 1;
> > +                             prealloc_len = delalloc_start - prealloc_start;
> > +                     }
> > +             }
> > +
> > +             if (prealloc_len > 0) {
> > +                     if (!checked_extent_shared && fieinfo->fi_extents_max) {
> > +                             ret = btrfs_is_data_extent_shared(inode->root,
> > +                                                       ino, disk_bytenr,
> > +                                                       extent_gen, roots,
> > +                                                       tmp_ulist,
> > +                                                       backref_cache);
> > +                             if (ret < 0)
> > +                                     return ret;
> > +                             else if (ret > 0)
> > +                                     prealloc_flags |= FIEMAP_EXTENT_SHARED;
> > +
> > +                             checked_extent_shared = true;
> > +                     }
> > +                     ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
> > +                                              disk_bytenr + extent_offset,
> > +                                              prealloc_len, prealloc_flags);
> > +                     if (ret)
> > +                             return ret;
> > +                     extent_offset += prealloc_len;
> > +             }
> > +
> > +             ret = emit_fiemap_extent(fieinfo, cache, delalloc_start, 0,
> > +                                      delalloc_end + 1 - delalloc_start,
> > +                                      FIEMAP_EXTENT_DELALLOC |
> > +                                      FIEMAP_EXTENT_UNKNOWN);
> > +             if (ret)
> > +                     return ret;
> > +
> > +             last_delalloc_end = delalloc_end;
> > +             cur_offset = delalloc_end + 1;
> > +             extent_offset += cur_offset - delalloc_start;
> > +             cond_resched();
> > +     }
> > +
> > +     /*
> > +      * Either we found no delalloc for the whole prealloc extent or we have
> > +      * a prealloc extent that spans i_size or starts at or after i_size.
> > +      */
> > +     if (disk_bytenr != 0 && last_delalloc_end < end) {
> > +             u64 prealloc_start;
> > +             u64 prealloc_len;
> > +
> > +             if (last_delalloc_end == 0) {
> > +                     prealloc_start = start;
> > +                     prealloc_len = end + 1 - start;
> > +             } else {
> > +                     prealloc_start = last_delalloc_end + 1;
> > +                     prealloc_len = end + 1 - prealloc_start;
> > +             }
> > +
> > +             if (!checked_extent_shared && fieinfo->fi_extents_max) {
> > +                     ret = btrfs_is_data_extent_shared(inode->root,
> > +                                                       ino, disk_bytenr,
> > +                                                       extent_gen, roots,
> > +                                                       tmp_ulist,
> > +                                                       backref_cache);
> > +                     if (ret < 0)
> > +                             return ret;
> > +                     else if (ret > 0)
> > +                             prealloc_flags |= FIEMAP_EXTENT_SHARED;
> > +             }
> > +             ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
> > +                                      disk_bytenr + extent_offset,
> > +                                      prealloc_len, prealloc_flags);
> > +             if (ret)
> > +                     return ret;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static int fiemap_find_last_extent_offset(struct btrfs_inode *inode,
> > +                                       struct btrfs_path *path,
> > +                                       u64 *last_extent_end_ret)
> > +{
> > +     const u64 ino = btrfs_ino(inode);
> > +     struct btrfs_root *root = inode->root;
> > +     struct extent_buffer *leaf;
> > +     struct btrfs_file_extent_item *ei;
> > +     struct btrfs_key key;
> > +     u64 disk_bytenr;
> > +     int ret;
> > +
> > +     /*
> > +      * Lookup the last file extent. We're not using i_size here because
> > +      * there might be preallocation past i_size.
> > +      */
>
> I'm wondering how could this happen?

You can fallocate an extent at or after i_size.

>
> Normally if we're truncating an inode, the extents starting after
> round_up(i_size, sectorsize) should be dropped.

It has nothing to do with truncate, just fallocate.

>
> Or if we later enlarge the inode, we may hit old extents and read out
> some stale data other than expected zeros.
>
> Thus searching using round_up(i_size, sectorsize) should still let us to
> reach the slot after the last file extent.
>
> Or did I miss something?

Yes, it's about prealloc extents at or after i_size.

Thanks.

>
> Thanks,
> Qu
>
> > +     ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0);
> > +     /* There can't be a file extent item at offset (u64)-1 */
> > +     ASSERT(ret != 0);
> > +     if (ret < 0)
> > +             return ret;
> > +
> > +     /*
> > +      * For a non-existing key, btrfs_search_slot() always leaves us at a
> > +      * slot > 0, except if the btree is empty, which is impossible because
> > +      * at least it has the inode item for this inode and all the items for
> > +      * the root inode 256.
> > +      */
> > +     ASSERT(path->slots[0] > 0);
> > +     path->slots[0]--;
> > +     leaf = path->nodes[0];
> > +     btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> > +     if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> > +             /* No file extent items in the subvolume tree. */
> > +             *last_extent_end_ret = 0;
> > +             return 0;
> >       }
> > -     btrfs_release_path(path);
> >
> >       /*
> > -      * we might have some extents allocated but more delalloc past those
> > -      * extents.  so, we trust isize unless the start of the last extent is
> > -      * beyond isize
> > +      * For an inline extent, the disk_bytenr is where inline data starts at,
> > +      * so first check if we have an inline extent item before checking if we
> > +      * have an implicit hole (disk_bytenr == 0).
> >        */
> > -     if (last < isize) {
> > -             last = (u64)-1;
> > -             last_for_get_extent = isize;
> > +     ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
> > +     if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
> > +             *last_extent_end_ret = btrfs_file_extent_end(path);
> > +             return 0;
> >       }
> >
> > -     lock_extent_bits(&inode->io_tree, start, start + len - 1,
> > -                      &cached_state);
> > +     /*
> > +      * Find the last file extent item that is not a hole (when NO_HOLES is
> > +      * not enabled). This should take at most 2 iterations in the worst
> > +      * case: we have one hole file extent item at slot 0 of a leaf and
> > +      * another hole file extent item as the last item in the previous leaf.
> > +      * This is because we merge file extent items that represent holes.
> > +      */
> > +     disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> > +     while (disk_bytenr == 0) {
> > +             ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
> > +             if (ret < 0) {
> > +                     return ret;
> > +             } else if (ret > 0) {
> > +                     /* No file extent items that are not holes. */
> > +                     *last_extent_end_ret = 0;
> > +                     return 0;
> > +             }
> > +             leaf = path->nodes[0];
> > +             ei = btrfs_item_ptr(leaf, path->slots[0],
> > +                                 struct btrfs_file_extent_item);
> > +             disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> > +     }
> >
> > -     em = get_extent_skip_holes(inode, start, last_for_get_extent);
> > -     if (!em)
> > -             goto out;
> > -     if (IS_ERR(em)) {
> > -             ret = PTR_ERR(em);
> > +     *last_extent_end_ret = btrfs_file_extent_end(path);
> > +     return 0;
> > +}
> > +
> > +int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> > +               u64 start, u64 len)
> > +{
> > +     const u64 ino = btrfs_ino(inode);
> > +     struct extent_state *cached_state = NULL;
> > +     struct btrfs_path *path;
> > +     struct btrfs_root *root = inode->root;
> > +     struct fiemap_cache cache = { 0 };
> > +     struct btrfs_backref_shared_cache *backref_cache;
> > +     struct ulist *roots;
> > +     struct ulist *tmp_ulist;
> > +     u64 last_extent_end;
> > +     u64 prev_extent_end;
> > +     u64 lockstart;
> > +     u64 lockend;
> > +     bool stopped = false;
> > +     int ret;
> > +
> > +     backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
> > +     path = btrfs_alloc_path();
> > +     roots = ulist_alloc(GFP_KERNEL);
> > +     tmp_ulist = ulist_alloc(GFP_KERNEL);
> > +     if (!backref_cache || !path || !roots || !tmp_ulist) {
> > +             ret = -ENOMEM;
> >               goto out;
> >       }
> >
> > -     while (!end) {
> > -             u64 offset_in_extent = 0;
> > +     lockstart = round_down(start, btrfs_inode_sectorsize(inode));
> > +     lockend = round_up(start + len, btrfs_inode_sectorsize(inode));
> > +     prev_extent_end = lockstart;
> >
> > -             /* break if the extent we found is outside the range */
> > -             if (em->start >= max || extent_map_end(em) < off)
> > -                     break;
> > +     lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
> >
> > -             /*
> > -              * get_extent may return an extent that starts before our
> > -              * requested range.  We have to make sure the ranges
> > -              * we return to fiemap always move forward and don't
> > -              * overlap, so adjust the offsets here
> > -              */
> > -             em_start = max(em->start, off);
> > +     ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
> > +     if (ret < 0)
> > +             goto out_unlock;
> > +     btrfs_release_path(path);
> >
> > +     path->reada = READA_FORWARD;
> > +     ret = fiemap_search_slot(inode, path, lockstart);
> > +     if (ret < 0) {
> > +             goto out_unlock;
> > +     } else if (ret > 0) {
> >               /*
> > -              * record the offset from the start of the extent
> > -              * for adjusting the disk offset below.  Only do this if the
> > -              * extent isn't compressed since our in ram offset may be past
> > -              * what we have actually allocated on disk.
> > +              * No file extent item found, but we may have delalloc between
> > +              * the current offset and i_size. So check for that.
> >                */
> > -             if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
> > -                     offset_in_extent = em_start - em->start;
> > -             em_end = extent_map_end(em);
> > -             em_len = em_end - em_start;
> > -             flags = 0;
> > -             if (em->block_start < EXTENT_MAP_LAST_BYTE)
> > -                     disko = em->block_start + offset_in_extent;
> > -             else
> > -                     disko = 0;
> > +             ret = 0;
> > +             goto check_eof_delalloc;
> > +     }
> > +
> > +     while (prev_extent_end < lockend) {
> > +             struct extent_buffer *leaf = path->nodes[0];
> > +             struct btrfs_file_extent_item *ei;
> > +             struct btrfs_key key;
> > +             u64 extent_end;
> > +             u64 extent_len;
> > +             u64 extent_offset = 0;
> > +             u64 extent_gen;
> > +             u64 disk_bytenr = 0;
> > +             u64 flags = 0;
> > +             int extent_type;
> > +             u8 compression;
> > +
> > +             btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> > +             if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> > +                     break;
> > +
> > +             extent_end = btrfs_file_extent_end(path);
> >
> >               /*
> > -              * bump off for our next call to get_extent
> > +              * The first iteration can leave us at an extent item that ends
> > +              * before our range's start. Move to the next item.
> >                */
> > -             off = extent_map_end(em);
> > -             if (off >= max)
> > -                     end = 1;
> > -
> > -             if (em->block_start == EXTENT_MAP_INLINE) {
> > -                     flags |= (FIEMAP_EXTENT_DATA_INLINE |
> > -                               FIEMAP_EXTENT_NOT_ALIGNED);
> > -             } else if (em->block_start == EXTENT_MAP_DELALLOC) {
> > -                     flags |= (FIEMAP_EXTENT_DELALLOC |
> > -                               FIEMAP_EXTENT_UNKNOWN);
> > -             } else if (fieinfo->fi_extents_max) {
> > -                     u64 extent_gen;
> > -                     u64 bytenr = em->block_start -
> > -                             (em->start - em->orig_start);
> > +             if (extent_end <= lockstart)
> > +                     goto next_item;
> >
> > -                     /*
> > -                      * If two extent maps are merged, then their generation
> > -                      * is set to the maximum between their generations.
> > -                      * Otherwise its generation matches the one we have in
> > -                      * corresponding file extent item. If we have a merged
> > -                      * extent map, don't use its generation to speedup the
> > -                      * sharedness check below.
> > -                      */
> > -                     if (test_bit(EXTENT_FLAG_MERGED, &em->flags))
> > -                             extent_gen = 0;
> > -                     else
> > -                             extent_gen = em->generation;
> > +             /* We have in implicit hole (NO_HOLES feature enabled). */
> > +             if (prev_extent_end < key.offset) {
> > +                     const u64 range_end = min(key.offset, lockend) - 1;
> >
> > -                     /*
> > -                      * As btrfs supports shared space, this information
> > -                      * can be exported to userspace tools via
> > -                      * flag FIEMAP_EXTENT_SHARED.  If fi_extents_max == 0
> > -                      * then we're just getting a count and we can skip the
> > -                      * lookup stuff.
> > -                      */
> > -                     ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
> > -                                                       bytenr, extent_gen,
> > -                                                       roots, tmp_ulist,
> > -                                                       backref_cache);
> > -                     if (ret < 0)
> > -                             goto out_free;
> > -                     if (ret)
> > -                             flags |= FIEMAP_EXTENT_SHARED;
> > -                     ret = 0;
> > -             }
> > -             if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
> > -                     flags |= FIEMAP_EXTENT_ENCODED;
> > -             if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> > -                     flags |= FIEMAP_EXTENT_UNWRITTEN;
> > +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
> > +                                               backref_cache, 0, 0, 0,
> > +                                               roots, tmp_ulist,
> > +                                               prev_extent_end, range_end);
> > +                     if (ret < 0) {
> > +                             goto out_unlock;
> > +                     } else if (ret > 0) {
> > +                             /* fiemap_fill_next_extent() told us to stop. */
> > +                             stopped = true;
> > +                             break;
> > +                     }
> >
> > -             free_extent_map(em);
> > -             em = NULL;
> > -             if ((em_start >= last) || em_len == (u64)-1 ||
> > -                (last == (u64)-1 && isize <= em_end)) {
> > -                     flags |= FIEMAP_EXTENT_LAST;
> > -                     end = 1;
> > +                     /* We've reached the end of the fiemap range, stop. */
> > +                     if (key.offset >= lockend) {
> > +                             stopped = true;
> > +                             break;
> > +                     }
> >               }
> >
> > -             /* now scan forward to see if this is really the last extent. */
> > -             em = get_extent_skip_holes(inode, off, last_for_get_extent);
> > -             if (IS_ERR(em)) {
> > -                     ret = PTR_ERR(em);
> > -                     goto out;
> > +             extent_len = extent_end - key.offset;
> > +             ei = btrfs_item_ptr(leaf, path->slots[0],
> > +                                 struct btrfs_file_extent_item);
> > +             compression = btrfs_file_extent_compression(leaf, ei);
> > +             extent_type = btrfs_file_extent_type(leaf, ei);
> > +             extent_gen = btrfs_file_extent_generation(leaf, ei);
> > +
> > +             if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
> > +                     disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> > +                     if (compression == BTRFS_COMPRESS_NONE)
> > +                             extent_offset = btrfs_file_extent_offset(leaf, ei);
> >               }
> > -             if (!em) {
> > -                     flags |= FIEMAP_EXTENT_LAST;
> > -                     end = 1;
> > +
> > +             if (compression != BTRFS_COMPRESS_NONE)
> > +                     flags |= FIEMAP_EXTENT_ENCODED;
> > +
> > +             if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
> > +                     flags |= FIEMAP_EXTENT_DATA_INLINE;
> > +                     flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > +                     ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0,
> > +                                              extent_len, flags);
> > +             } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
> > +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
> > +                                               backref_cache,
> > +                                               disk_bytenr, extent_offset,
> > +                                               extent_gen, roots, tmp_ulist,
> > +                                               key.offset, extent_end - 1);
> > +             } else if (disk_bytenr == 0) {
> > +                     /* We have an explicit hole. */
> > +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
> > +                                               backref_cache, 0, 0, 0,
> > +                                               roots, tmp_ulist,
> > +                                               key.offset, extent_end - 1);
> > +             } else {
> > +                     /* We have a regular extent. */
> > +                     if (fieinfo->fi_extents_max) {
> > +                             ret = btrfs_is_data_extent_shared(root, ino,
> > +                                                               disk_bytenr,
> > +                                                               extent_gen,
> > +                                                               roots,
> > +                                                               tmp_ulist,
> > +                                                               backref_cache);
> > +                             if (ret < 0)
> > +                                     goto out_unlock;
> > +                             else if (ret > 0)
> > +                                     flags |= FIEMAP_EXTENT_SHARED;
> > +                     }
> > +
> > +                     ret = emit_fiemap_extent(fieinfo, &cache, key.offset,
> > +                                              disk_bytenr + extent_offset,
> > +                                              extent_len, flags);
> >               }
> > -             ret = emit_fiemap_extent(fieinfo, &cache, em_start, disko,
> > -                                        em_len, flags);
> > -             if (ret) {
> > -                     if (ret == 1)
> > -                             ret = 0;
> > -                     goto out_free;
> > +
> > +             if (ret < 0) {
> > +                     goto out_unlock;
> > +             } else if (ret > 0) {
> > +                     /* fiemap_fill_next_extent() told us to stop. */
> > +                     stopped = true;
> > +                     break;
> >               }
> >
> > +             prev_extent_end = extent_end;
> > +next_item:
> >               if (fatal_signal_pending(current)) {
> >                       ret = -EINTR;
> > -                     goto out_free;
> > +                     goto out_unlock;
> >               }
> > +
> > +             ret = fiemap_next_leaf_item(inode, path);
> > +             if (ret < 0) {
> > +                     goto out_unlock;
> > +             } else if (ret > 0) {
> > +                     /* No more file extent items for this inode. */
> > +                     break;
> > +             }
> > +             cond_resched();
> >       }
> > -out_free:
> > -     if (!ret)
> > -             ret = emit_last_fiemap_cache(fieinfo, &cache);
> > -     free_extent_map(em);
> > -out:
> > -     unlock_extent_cached(&inode->io_tree, start, start + len - 1,
> > -                          &cached_state);
> >
> > -out_free_ulist:
> > +check_eof_delalloc:
> > +     /*
> > +      * Release (and free) the path before emitting any final entries to
> > +      * fiemap_fill_next_extent() to keep lockdep happy. This is because
> > +      * once we find no more file extent items exist, we may have a
> > +      * non-cloned leaf, and fiemap_fill_next_extent() can trigger page
> > +      * faults when copying data to the user space buffer.
> > +      */
> > +     btrfs_free_path(path);
> > +     path = NULL;
> > +
> > +     if (!stopped && prev_extent_end < lockend) {
> > +             ret = fiemap_process_hole(inode, fieinfo, &cache, backref_cache,
> > +                                       0, 0, 0, roots, tmp_ulist,
> > +                                       prev_extent_end, lockend - 1);
> > +             if (ret < 0)
> > +                     goto out_unlock;
> > +             prev_extent_end = lockend;
> > +     }
> > +
> > +     if (cache.cached && cache.offset + cache.len >= last_extent_end) {
> > +             const u64 i_size = i_size_read(&inode->vfs_inode);
> > +
> > +             if (prev_extent_end < i_size) {
> > +                     u64 delalloc_start;
> > +                     u64 delalloc_end;
> > +                     bool delalloc;
> > +
> > +                     delalloc = btrfs_find_delalloc_in_range(inode,
> > +                                                             prev_extent_end,
> > +                                                             i_size - 1,
> > +                                                             &delalloc_start,
> > +                                                             &delalloc_end);
> > +                     if (!delalloc)
> > +                             cache.flags |= FIEMAP_EXTENT_LAST;
> > +             } else {
> > +                     cache.flags |= FIEMAP_EXTENT_LAST;
> > +             }
> > +     }
> > +
> > +     ret = emit_last_fiemap_cache(fieinfo, &cache);
> > +
> > +out_unlock:
> > +     unlock_extent_cached(&inode->io_tree, lockstart, lockend, &cached_state);
> > +out:
> >       kfree(backref_cache);
> >       btrfs_free_path(path);
> >       ulist_free(roots);
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index b292a8ada3a4..636b3ec46184 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -3602,10 +3602,10 @@ static long btrfs_fallocate(struct file *file, int mode,
> >   }
> >
> >   /*
> > - * Helper for have_delalloc_in_range(). Find a subrange in a given range that
> > - * has unflushed and/or flushing delalloc. There might be other adjacent
> > - * subranges after the one it found, so have_delalloc_in_range() keeps looping
> > - * while it gets adjacent subranges, and merging them together.
> > + * Helper for btrfs_find_delalloc_in_range(). Find a subrange in a given range
> > + * that has unflushed and/or flushing delalloc. There might be other adjacent
> > + * subranges after the one it found, so btrfs_find_delalloc_in_range() keeps
> > + * looping while it gets adjacent subranges, and merging them together.
> >    */
> >   static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
> >                                  u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> > @@ -3740,8 +3740,8 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
> >    * if so it sets @delalloc_start_ret and @delalloc_end_ret with the start and
> >    * end offsets of the subrange.
> >    */
> > -static bool have_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> > -                                u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> > +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> > +                               u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> >   {
> >       u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
> >       u64 prev_delalloc_end = 0;
> > @@ -3804,8 +3804,8 @@ static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
> >       u64 delalloc_end;
> >       bool delalloc;
> >
> > -     delalloc = have_delalloc_in_range(inode, start, end, &delalloc_start,
> > -                                       &delalloc_end);
> > +     delalloc = btrfs_find_delalloc_in_range(inode, start, end,
> > +                                             &delalloc_start, &delalloc_end);
> >       if (delalloc && whence == SEEK_DATA) {
> >               *start_ret = delalloc_start;
> >               return true;
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 2c7d31990777..8be1e021513a 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -7064,133 +7064,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
> >       return em;
> >   }
> >
> > -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
> > -                                        u64 start, u64 len)
> > -{
> > -     struct extent_map *em;
> > -     struct extent_map *hole_em = NULL;
> > -     u64 delalloc_start = start;
> > -     u64 end;
> > -     u64 delalloc_len;
> > -     u64 delalloc_end;
> > -     int err = 0;
> > -
> > -     em = btrfs_get_extent(inode, NULL, 0, start, len);
> > -     if (IS_ERR(em))
> > -             return em;
> > -     /*
> > -      * If our em maps to:
> > -      * - a hole or
> > -      * - a pre-alloc extent,
> > -      * there might actually be delalloc bytes behind it.
> > -      */
> > -     if (em->block_start != EXTENT_MAP_HOLE &&
> > -         !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> > -             return em;
> > -     else
> > -             hole_em = em;
> > -
> > -     /* check to see if we've wrapped (len == -1 or similar) */
> > -     end = start + len;
> > -     if (end < start)
> > -             end = (u64)-1;
> > -     else
> > -             end -= 1;
> > -
> > -     em = NULL;
> > -
> > -     /* ok, we didn't find anything, lets look for delalloc */
> > -     delalloc_len = count_range_bits(&inode->io_tree, &delalloc_start,
> > -                              end, len, EXTENT_DELALLOC, 1);
> > -     delalloc_end = delalloc_start + delalloc_len;
> > -     if (delalloc_end < delalloc_start)
> > -             delalloc_end = (u64)-1;
> > -
> > -     /*
> > -      * We didn't find anything useful, return the original results from
> > -      * get_extent()
> > -      */
> > -     if (delalloc_start > end || delalloc_end <= start) {
> > -             em = hole_em;
> > -             hole_em = NULL;
> > -             goto out;
> > -     }
> > -
> > -     /*
> > -      * Adjust the delalloc_start to make sure it doesn't go backwards from
> > -      * the start they passed in
> > -      */
> > -     delalloc_start = max(start, delalloc_start);
> > -     delalloc_len = delalloc_end - delalloc_start;
> > -
> > -     if (delalloc_len > 0) {
> > -             u64 hole_start;
> > -             u64 hole_len;
> > -             const u64 hole_end = extent_map_end(hole_em);
> > -
> > -             em = alloc_extent_map();
> > -             if (!em) {
> > -                     err = -ENOMEM;
> > -                     goto out;
> > -             }
> > -
> > -             ASSERT(hole_em);
> > -             /*
> > -              * When btrfs_get_extent can't find anything it returns one
> > -              * huge hole
> > -              *
> > -              * Make sure what it found really fits our range, and adjust to
> > -              * make sure it is based on the start from the caller
> > -              */
> > -             if (hole_end <= start || hole_em->start > end) {
> > -                    free_extent_map(hole_em);
> > -                    hole_em = NULL;
> > -             } else {
> > -                    hole_start = max(hole_em->start, start);
> > -                    hole_len = hole_end - hole_start;
> > -             }
> > -
> > -             if (hole_em && delalloc_start > hole_start) {
> > -                     /*
> > -                      * Our hole starts before our delalloc, so we have to
> > -                      * return just the parts of the hole that go until the
> > -                      * delalloc starts
> > -                      */
> > -                     em->len = min(hole_len, delalloc_start - hole_start);
> > -                     em->start = hole_start;
> > -                     em->orig_start = hole_start;
> > -                     /*
> > -                      * Don't adjust block start at all, it is fixed at
> > -                      * EXTENT_MAP_HOLE
> > -                      */
> > -                     em->block_start = hole_em->block_start;
> > -                     em->block_len = hole_len;
> > -                     if (test_bit(EXTENT_FLAG_PREALLOC, &hole_em->flags))
> > -                             set_bit(EXTENT_FLAG_PREALLOC, &em->flags);
> > -             } else {
> > -                     /*
> > -                      * Hole is out of passed range or it starts after
> > -                      * delalloc range
> > -                      */
> > -                     em->start = delalloc_start;
> > -                     em->len = delalloc_len;
> > -                     em->orig_start = delalloc_start;
> > -                     em->block_start = EXTENT_MAP_DELALLOC;
> > -                     em->block_len = delalloc_len;
> > -             }
> > -     } else {
> > -             return hole_em;
> > -     }
> > -out:
> > -
> > -     free_extent_map(hole_em);
> > -     if (err) {
> > -             free_extent_map(em);
> > -             return ERR_PTR(err);
> > -     }
> > -     return em;
> > -}
> > -
> >   static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
> >                                                 const u64 start,
> >                                                 const u64 len,
> > @@ -8265,15 +8138,14 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> >        * in the compression of data (in an async thread) and will return
> >        * before the compression is done and writeback is started. A second
> >        * filemap_fdatawrite_range() is needed to wait for the compression to
> > -      * complete and writeback to start. Without this, our user is very
> > -      * likely to get stale results, because the extents and extent maps for
> > -      * delalloc regions are only allocated when writeback starts.
> > +      * complete and writeback to start. We also need to wait for ordered
> > +      * extents to complete, because our fiemap implementation uses mainly
> > +      * file extent items to list the extents, searching for extent maps
> > +      * only for file ranges with holes or prealloc extents to figure out
> > +      * if we have delalloc in those ranges.
> >        */
> >       if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
> > -             ret = btrfs_fdatawrite_range(inode, 0, LLONG_MAX);
> > -             if (ret)
> > -                     return ret;
> > -             ret = filemap_fdatawait_range(inode->i_mapping, 0, LLONG_MAX);
> > +             ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
> >               if (ret)
> >                       return ret;
> >       }

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness
  2022-09-02  8:59     ` Filipe Manana
@ 2022-09-02  9:34       ` Qu Wenruo
  2022-09-02  9:41         ` Filipe Manana
  0 siblings, 1 reply; 53+ messages in thread
From: Qu Wenruo @ 2022-09-02  9:34 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs



On 2022/9/2 16:59, Filipe Manana wrote:
> On Fri, Sep 2, 2022 at 12:27 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
[...]
>>
>> Is there any other major usage for extent map now?
> 
> For fsync at least is very important.

OK, forgot that fsync is relying on that to determine if an extent needs 
to be logged.

> Also for reads it's nice to not have to go to the b+tree.
> If someone reads several pages of an extent, being able to get it directly
> from the extent map tree is faster than having to go to the btree for
> every read.
> Extent map trees are per inode, but subvolume b+trees can have a lot of
> concurrent write and read access.
> 
>>
>> I can only think of read, which uses extent map to grab the logical
>> bytenr of the real extent.
>>
>> In that case, the SHARED flag doesn't make much sense anyway, can we do
>> a cleanup for those flags? Since fiemap/lseek no longer relies on extent
>> map anymore.
> 
> I don't get it. What SHARED flag are talking about? And which "flags", where?
> We have nothing specific for lseek/fiemap in the extent maps, so I
> don't understand.

Nevermind, I got confused and think there would be one SHARED flag for 
extent map, but that's totally wrong...

> 
>>
[...]
>>
>> Although this is correct, it still looks a little tricky.
>>
>> We rely on btrfs_release_path() to release all tree blocks in the
>> subvolume tree, including unlocking the tree blocks, thus path->locks[0]
>> is also 0, meaning next time we call btrfs_release_path() we won't try
>> to unlock the cloned eb.
> 
> We're not taking any lock on the cloned extent buffer. It's not
> needed, it's private
> to the task.

Yep, that's completely fine, just looks a little tricky since we're 
going to release that path twice, and that's expected.

> 
>>
>> But I'd say it's still pretty tricky, and unfortunately I don't have any
>> better alternative.
>>
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +/*
>>> + * Process a range which is a hole or a prealloc extent in the inode's subvolume
>>> + * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc
>>> + * extent. The end offset (@end) is inclusive.
>>> + */
>>> +static int fiemap_process_hole(struct btrfs_inode *inode,
>>
>> Does the name still make sense as we're handling both hole and prealloc
>> range?
> 
> I chose that name because hole and prealloc are treated the same way.
> Sure, I could name it fiemap_process_hole_or_prealloc() or something
> like that, but
> I decided to keep the name shorter, make them explicit in the comments and code.
> 
> The old code did the same, get_extent_skip_holes() skipped holes and
> prealloc extents without delalloc.
> 
>>
>>
>> And I always find the delalloc search a big pain during lseek/fiemap.
>>
>> I guess except using certain flags, there is some hard requirement for
>> delalloc range reporting?
> 
> Yes. Delalloc is not meant to be flushed for fiemap unless
> FIEMAP_FLAG_SYNC is given by the user.

Would it be possible to let btrfs always flush the delalloc range, no 
matter if FIEMAP_FLAG_SYNC is specified or not?

I really want to avoid the whole delalloc search thing if possible.

Although I'd guess such behavior would be against the fiemap 
requirement, and the extra writeback may greatly slow down the fiemap 
itself for large files with tons of delalloc, so not really expect this 
to happen.

> For lseek it's just not needed, but that was already mentioned /
> discussed in patch 2/10.
> 
[...]
>>> +     /*
>>> +      * Lookup the last file extent. We're not using i_size here because
>>> +      * there might be preallocation past i_size.
>>> +      */
>>
>> I'm wondering how could this happen?
> 
> You can fallocate an extent at or after i_size.
> 
>>
>> Normally if we're truncating an inode, the extents starting after
>> round_up(i_size, sectorsize) should be dropped.
> 
> It has nothing to do with truncate, just fallocate.
> 
>>
>> Or if we later enlarge the inode, we may hit old extents and read out
>> some stale data other than expected zeros.
>>
>> Thus searching using round_up(i_size, sectorsize) should still let us to
>> reach the slot after the last file extent.
>>
>> Or did I miss something?
> 
> Yes, it's about prealloc extents at or after i_size.

Did you mean falloc using keep_size flag?

Then that explains the whole reason.

Thanks,
Qu

> 
> Thanks.
> 
>>
>> Thanks,
>> Qu
>>
>>> +     ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0);
>>> +     /* There can't be a file extent item at offset (u64)-1 */
>>> +     ASSERT(ret != 0);
>>> +     if (ret < 0)
>>> +             return ret;
>>> +
>>> +     /*
>>> +      * For a non-existing key, btrfs_search_slot() always leaves us at a
>>> +      * slot > 0, except if the btree is empty, which is impossible because
>>> +      * at least it has the inode item for this inode and all the items for
>>> +      * the root inode 256.
>>> +      */
>>> +     ASSERT(path->slots[0] > 0);
>>> +     path->slots[0]--;
>>> +     leaf = path->nodes[0];
>>> +     btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
>>> +     if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
>>> +             /* No file extent items in the subvolume tree. */
>>> +             *last_extent_end_ret = 0;
>>> +             return 0;
>>>        }
>>> -     btrfs_release_path(path);
>>>
>>>        /*
>>> -      * we might have some extents allocated but more delalloc past those
>>> -      * extents.  so, we trust isize unless the start of the last extent is
>>> -      * beyond isize
>>> +      * For an inline extent, the disk_bytenr is where inline data starts at,
>>> +      * so first check if we have an inline extent item before checking if we
>>> +      * have an implicit hole (disk_bytenr == 0).
>>>         */
>>> -     if (last < isize) {
>>> -             last = (u64)-1;
>>> -             last_for_get_extent = isize;
>>> +     ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
>>> +     if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
>>> +             *last_extent_end_ret = btrfs_file_extent_end(path);
>>> +             return 0;
>>>        }
>>>
>>> -     lock_extent_bits(&inode->io_tree, start, start + len - 1,
>>> -                      &cached_state);
>>> +     /*
>>> +      * Find the last file extent item that is not a hole (when NO_HOLES is
>>> +      * not enabled). This should take at most 2 iterations in the worst
>>> +      * case: we have one hole file extent item at slot 0 of a leaf and
>>> +      * another hole file extent item as the last item in the previous leaf.
>>> +      * This is because we merge file extent items that represent holes.
>>> +      */
>>> +     disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
>>> +     while (disk_bytenr == 0) {
>>> +             ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
>>> +             if (ret < 0) {
>>> +                     return ret;
>>> +             } else if (ret > 0) {
>>> +                     /* No file extent items that are not holes. */
>>> +                     *last_extent_end_ret = 0;
>>> +                     return 0;
>>> +             }
>>> +             leaf = path->nodes[0];
>>> +             ei = btrfs_item_ptr(leaf, path->slots[0],
>>> +                                 struct btrfs_file_extent_item);
>>> +             disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
>>> +     }
>>>
>>> -     em = get_extent_skip_holes(inode, start, last_for_get_extent);
>>> -     if (!em)
>>> -             goto out;
>>> -     if (IS_ERR(em)) {
>>> -             ret = PTR_ERR(em);
>>> +     *last_extent_end_ret = btrfs_file_extent_end(path);
>>> +     return 0;
>>> +}
>>> +
>>> +int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>>> +               u64 start, u64 len)
>>> +{
>>> +     const u64 ino = btrfs_ino(inode);
>>> +     struct extent_state *cached_state = NULL;
>>> +     struct btrfs_path *path;
>>> +     struct btrfs_root *root = inode->root;
>>> +     struct fiemap_cache cache = { 0 };
>>> +     struct btrfs_backref_shared_cache *backref_cache;
>>> +     struct ulist *roots;
>>> +     struct ulist *tmp_ulist;
>>> +     u64 last_extent_end;
>>> +     u64 prev_extent_end;
>>> +     u64 lockstart;
>>> +     u64 lockend;
>>> +     bool stopped = false;
>>> +     int ret;
>>> +
>>> +     backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
>>> +     path = btrfs_alloc_path();
>>> +     roots = ulist_alloc(GFP_KERNEL);
>>> +     tmp_ulist = ulist_alloc(GFP_KERNEL);
>>> +     if (!backref_cache || !path || !roots || !tmp_ulist) {
>>> +             ret = -ENOMEM;
>>>                goto out;
>>>        }
>>>
>>> -     while (!end) {
>>> -             u64 offset_in_extent = 0;
>>> +     lockstart = round_down(start, btrfs_inode_sectorsize(inode));
>>> +     lockend = round_up(start + len, btrfs_inode_sectorsize(inode));
>>> +     prev_extent_end = lockstart;
>>>
>>> -             /* break if the extent we found is outside the range */
>>> -             if (em->start >= max || extent_map_end(em) < off)
>>> -                     break;
>>> +     lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
>>>
>>> -             /*
>>> -              * get_extent may return an extent that starts before our
>>> -              * requested range.  We have to make sure the ranges
>>> -              * we return to fiemap always move forward and don't
>>> -              * overlap, so adjust the offsets here
>>> -              */
>>> -             em_start = max(em->start, off);
>>> +     ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
>>> +     if (ret < 0)
>>> +             goto out_unlock;
>>> +     btrfs_release_path(path);
>>>
>>> +     path->reada = READA_FORWARD;
>>> +     ret = fiemap_search_slot(inode, path, lockstart);
>>> +     if (ret < 0) {
>>> +             goto out_unlock;
>>> +     } else if (ret > 0) {
>>>                /*
>>> -              * record the offset from the start of the extent
>>> -              * for adjusting the disk offset below.  Only do this if the
>>> -              * extent isn't compressed since our in ram offset may be past
>>> -              * what we have actually allocated on disk.
>>> +              * No file extent item found, but we may have delalloc between
>>> +              * the current offset and i_size. So check for that.
>>>                 */
>>> -             if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
>>> -                     offset_in_extent = em_start - em->start;
>>> -             em_end = extent_map_end(em);
>>> -             em_len = em_end - em_start;
>>> -             flags = 0;
>>> -             if (em->block_start < EXTENT_MAP_LAST_BYTE)
>>> -                     disko = em->block_start + offset_in_extent;
>>> -             else
>>> -                     disko = 0;
>>> +             ret = 0;
>>> +             goto check_eof_delalloc;
>>> +     }
>>> +
>>> +     while (prev_extent_end < lockend) {
>>> +             struct extent_buffer *leaf = path->nodes[0];
>>> +             struct btrfs_file_extent_item *ei;
>>> +             struct btrfs_key key;
>>> +             u64 extent_end;
>>> +             u64 extent_len;
>>> +             u64 extent_offset = 0;
>>> +             u64 extent_gen;
>>> +             u64 disk_bytenr = 0;
>>> +             u64 flags = 0;
>>> +             int extent_type;
>>> +             u8 compression;
>>> +
>>> +             btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
>>> +             if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
>>> +                     break;
>>> +
>>> +             extent_end = btrfs_file_extent_end(path);
>>>
>>>                /*
>>> -              * bump off for our next call to get_extent
>>> +              * The first iteration can leave us at an extent item that ends
>>> +              * before our range's start. Move to the next item.
>>>                 */
>>> -             off = extent_map_end(em);
>>> -             if (off >= max)
>>> -                     end = 1;
>>> -
>>> -             if (em->block_start == EXTENT_MAP_INLINE) {
>>> -                     flags |= (FIEMAP_EXTENT_DATA_INLINE |
>>> -                               FIEMAP_EXTENT_NOT_ALIGNED);
>>> -             } else if (em->block_start == EXTENT_MAP_DELALLOC) {
>>> -                     flags |= (FIEMAP_EXTENT_DELALLOC |
>>> -                               FIEMAP_EXTENT_UNKNOWN);
>>> -             } else if (fieinfo->fi_extents_max) {
>>> -                     u64 extent_gen;
>>> -                     u64 bytenr = em->block_start -
>>> -                             (em->start - em->orig_start);
>>> +             if (extent_end <= lockstart)
>>> +                     goto next_item;
>>>
>>> -                     /*
>>> -                      * If two extent maps are merged, then their generation
>>> -                      * is set to the maximum between their generations.
>>> -                      * Otherwise its generation matches the one we have in
>>> -                      * corresponding file extent item. If we have a merged
>>> -                      * extent map, don't use its generation to speedup the
>>> -                      * sharedness check below.
>>> -                      */
>>> -                     if (test_bit(EXTENT_FLAG_MERGED, &em->flags))
>>> -                             extent_gen = 0;
>>> -                     else
>>> -                             extent_gen = em->generation;
>>> +             /* We have in implicit hole (NO_HOLES feature enabled). */
>>> +             if (prev_extent_end < key.offset) {
>>> +                     const u64 range_end = min(key.offset, lockend) - 1;
>>>
>>> -                     /*
>>> -                      * As btrfs supports shared space, this information
>>> -                      * can be exported to userspace tools via
>>> -                      * flag FIEMAP_EXTENT_SHARED.  If fi_extents_max == 0
>>> -                      * then we're just getting a count and we can skip the
>>> -                      * lookup stuff.
>>> -                      */
>>> -                     ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
>>> -                                                       bytenr, extent_gen,
>>> -                                                       roots, tmp_ulist,
>>> -                                                       backref_cache);
>>> -                     if (ret < 0)
>>> -                             goto out_free;
>>> -                     if (ret)
>>> -                             flags |= FIEMAP_EXTENT_SHARED;
>>> -                     ret = 0;
>>> -             }
>>> -             if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
>>> -                     flags |= FIEMAP_EXTENT_ENCODED;
>>> -             if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
>>> -                     flags |= FIEMAP_EXTENT_UNWRITTEN;
>>> +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
>>> +                                               backref_cache, 0, 0, 0,
>>> +                                               roots, tmp_ulist,
>>> +                                               prev_extent_end, range_end);
>>> +                     if (ret < 0) {
>>> +                             goto out_unlock;
>>> +                     } else if (ret > 0) {
>>> +                             /* fiemap_fill_next_extent() told us to stop. */
>>> +                             stopped = true;
>>> +                             break;
>>> +                     }
>>>
>>> -             free_extent_map(em);
>>> -             em = NULL;
>>> -             if ((em_start >= last) || em_len == (u64)-1 ||
>>> -                (last == (u64)-1 && isize <= em_end)) {
>>> -                     flags |= FIEMAP_EXTENT_LAST;
>>> -                     end = 1;
>>> +                     /* We've reached the end of the fiemap range, stop. */
>>> +                     if (key.offset >= lockend) {
>>> +                             stopped = true;
>>> +                             break;
>>> +                     }
>>>                }
>>>
>>> -             /* now scan forward to see if this is really the last extent. */
>>> -             em = get_extent_skip_holes(inode, off, last_for_get_extent);
>>> -             if (IS_ERR(em)) {
>>> -                     ret = PTR_ERR(em);
>>> -                     goto out;
>>> +             extent_len = extent_end - key.offset;
>>> +             ei = btrfs_item_ptr(leaf, path->slots[0],
>>> +                                 struct btrfs_file_extent_item);
>>> +             compression = btrfs_file_extent_compression(leaf, ei);
>>> +             extent_type = btrfs_file_extent_type(leaf, ei);
>>> +             extent_gen = btrfs_file_extent_generation(leaf, ei);
>>> +
>>> +             if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
>>> +                     disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
>>> +                     if (compression == BTRFS_COMPRESS_NONE)
>>> +                             extent_offset = btrfs_file_extent_offset(leaf, ei);
>>>                }
>>> -             if (!em) {
>>> -                     flags |= FIEMAP_EXTENT_LAST;
>>> -                     end = 1;
>>> +
>>> +             if (compression != BTRFS_COMPRESS_NONE)
>>> +                     flags |= FIEMAP_EXTENT_ENCODED;
>>> +
>>> +             if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
>>> +                     flags |= FIEMAP_EXTENT_DATA_INLINE;
>>> +                     flags |= FIEMAP_EXTENT_NOT_ALIGNED;
>>> +                     ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0,
>>> +                                              extent_len, flags);
>>> +             } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
>>> +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
>>> +                                               backref_cache,
>>> +                                               disk_bytenr, extent_offset,
>>> +                                               extent_gen, roots, tmp_ulist,
>>> +                                               key.offset, extent_end - 1);
>>> +             } else if (disk_bytenr == 0) {
>>> +                     /* We have an explicit hole. */
>>> +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
>>> +                                               backref_cache, 0, 0, 0,
>>> +                                               roots, tmp_ulist,
>>> +                                               key.offset, extent_end - 1);
>>> +             } else {
>>> +                     /* We have a regular extent. */
>>> +                     if (fieinfo->fi_extents_max) {
>>> +                             ret = btrfs_is_data_extent_shared(root, ino,
>>> +                                                               disk_bytenr,
>>> +                                                               extent_gen,
>>> +                                                               roots,
>>> +                                                               tmp_ulist,
>>> +                                                               backref_cache);
>>> +                             if (ret < 0)
>>> +                                     goto out_unlock;
>>> +                             else if (ret > 0)
>>> +                                     flags |= FIEMAP_EXTENT_SHARED;
>>> +                     }
>>> +
>>> +                     ret = emit_fiemap_extent(fieinfo, &cache, key.offset,
>>> +                                              disk_bytenr + extent_offset,
>>> +                                              extent_len, flags);
>>>                }
>>> -             ret = emit_fiemap_extent(fieinfo, &cache, em_start, disko,
>>> -                                        em_len, flags);
>>> -             if (ret) {
>>> -                     if (ret == 1)
>>> -                             ret = 0;
>>> -                     goto out_free;
>>> +
>>> +             if (ret < 0) {
>>> +                     goto out_unlock;
>>> +             } else if (ret > 0) {
>>> +                     /* fiemap_fill_next_extent() told us to stop. */
>>> +                     stopped = true;
>>> +                     break;
>>>                }
>>>
>>> +             prev_extent_end = extent_end;
>>> +next_item:
>>>                if (fatal_signal_pending(current)) {
>>>                        ret = -EINTR;
>>> -                     goto out_free;
>>> +                     goto out_unlock;
>>>                }
>>> +
>>> +             ret = fiemap_next_leaf_item(inode, path);
>>> +             if (ret < 0) {
>>> +                     goto out_unlock;
>>> +             } else if (ret > 0) {
>>> +                     /* No more file extent items for this inode. */
>>> +                     break;
>>> +             }
>>> +             cond_resched();
>>>        }
>>> -out_free:
>>> -     if (!ret)
>>> -             ret = emit_last_fiemap_cache(fieinfo, &cache);
>>> -     free_extent_map(em);
>>> -out:
>>> -     unlock_extent_cached(&inode->io_tree, start, start + len - 1,
>>> -                          &cached_state);
>>>
>>> -out_free_ulist:
>>> +check_eof_delalloc:
>>> +     /*
>>> +      * Release (and free) the path before emitting any final entries to
>>> +      * fiemap_fill_next_extent() to keep lockdep happy. This is because
>>> +      * once we find no more file extent items exist, we may have a
>>> +      * non-cloned leaf, and fiemap_fill_next_extent() can trigger page
>>> +      * faults when copying data to the user space buffer.
>>> +      */
>>> +     btrfs_free_path(path);
>>> +     path = NULL;
>>> +
>>> +     if (!stopped && prev_extent_end < lockend) {
>>> +             ret = fiemap_process_hole(inode, fieinfo, &cache, backref_cache,
>>> +                                       0, 0, 0, roots, tmp_ulist,
>>> +                                       prev_extent_end, lockend - 1);
>>> +             if (ret < 0)
>>> +                     goto out_unlock;
>>> +             prev_extent_end = lockend;
>>> +     }
>>> +
>>> +     if (cache.cached && cache.offset + cache.len >= last_extent_end) {
>>> +             const u64 i_size = i_size_read(&inode->vfs_inode);
>>> +
>>> +             if (prev_extent_end < i_size) {
>>> +                     u64 delalloc_start;
>>> +                     u64 delalloc_end;
>>> +                     bool delalloc;
>>> +
>>> +                     delalloc = btrfs_find_delalloc_in_range(inode,
>>> +                                                             prev_extent_end,
>>> +                                                             i_size - 1,
>>> +                                                             &delalloc_start,
>>> +                                                             &delalloc_end);
>>> +                     if (!delalloc)
>>> +                             cache.flags |= FIEMAP_EXTENT_LAST;
>>> +             } else {
>>> +                     cache.flags |= FIEMAP_EXTENT_LAST;
>>> +             }
>>> +     }
>>> +
>>> +     ret = emit_last_fiemap_cache(fieinfo, &cache);
>>> +
>>> +out_unlock:
>>> +     unlock_extent_cached(&inode->io_tree, lockstart, lockend, &cached_state);
>>> +out:
>>>        kfree(backref_cache);
>>>        btrfs_free_path(path);
>>>        ulist_free(roots);
>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>> index b292a8ada3a4..636b3ec46184 100644
>>> --- a/fs/btrfs/file.c
>>> +++ b/fs/btrfs/file.c
>>> @@ -3602,10 +3602,10 @@ static long btrfs_fallocate(struct file *file, int mode,
>>>    }
>>>
>>>    /*
>>> - * Helper for have_delalloc_in_range(). Find a subrange in a given range that
>>> - * has unflushed and/or flushing delalloc. There might be other adjacent
>>> - * subranges after the one it found, so have_delalloc_in_range() keeps looping
>>> - * while it gets adjacent subranges, and merging them together.
>>> + * Helper for btrfs_find_delalloc_in_range(). Find a subrange in a given range
>>> + * that has unflushed and/or flushing delalloc. There might be other adjacent
>>> + * subranges after the one it found, so btrfs_find_delalloc_in_range() keeps
>>> + * looping while it gets adjacent subranges, and merging them together.
>>>     */
>>>    static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
>>>                                   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
>>> @@ -3740,8 +3740,8 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
>>>     * if so it sets @delalloc_start_ret and @delalloc_end_ret with the start and
>>>     * end offsets of the subrange.
>>>     */
>>> -static bool have_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
>>> -                                u64 *delalloc_start_ret, u64 *delalloc_end_ret)
>>> +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
>>> +                               u64 *delalloc_start_ret, u64 *delalloc_end_ret)
>>>    {
>>>        u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
>>>        u64 prev_delalloc_end = 0;
>>> @@ -3804,8 +3804,8 @@ static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
>>>        u64 delalloc_end;
>>>        bool delalloc;
>>>
>>> -     delalloc = have_delalloc_in_range(inode, start, end, &delalloc_start,
>>> -                                       &delalloc_end);
>>> +     delalloc = btrfs_find_delalloc_in_range(inode, start, end,
>>> +                                             &delalloc_start, &delalloc_end);
>>>        if (delalloc && whence == SEEK_DATA) {
>>>                *start_ret = delalloc_start;
>>>                return true;
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index 2c7d31990777..8be1e021513a 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -7064,133 +7064,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
>>>        return em;
>>>    }
>>>
>>> -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
>>> -                                        u64 start, u64 len)
>>> -{
>>> -     struct extent_map *em;
>>> -     struct extent_map *hole_em = NULL;
>>> -     u64 delalloc_start = start;
>>> -     u64 end;
>>> -     u64 delalloc_len;
>>> -     u64 delalloc_end;
>>> -     int err = 0;
>>> -
>>> -     em = btrfs_get_extent(inode, NULL, 0, start, len);
>>> -     if (IS_ERR(em))
>>> -             return em;
>>> -     /*
>>> -      * If our em maps to:
>>> -      * - a hole or
>>> -      * - a pre-alloc extent,
>>> -      * there might actually be delalloc bytes behind it.
>>> -      */
>>> -     if (em->block_start != EXTENT_MAP_HOLE &&
>>> -         !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
>>> -             return em;
>>> -     else
>>> -             hole_em = em;
>>> -
>>> -     /* check to see if we've wrapped (len == -1 or similar) */
>>> -     end = start + len;
>>> -     if (end < start)
>>> -             end = (u64)-1;
>>> -     else
>>> -             end -= 1;
>>> -
>>> -     em = NULL;
>>> -
>>> -     /* ok, we didn't find anything, lets look for delalloc */
>>> -     delalloc_len = count_range_bits(&inode->io_tree, &delalloc_start,
>>> -                              end, len, EXTENT_DELALLOC, 1);
>>> -     delalloc_end = delalloc_start + delalloc_len;
>>> -     if (delalloc_end < delalloc_start)
>>> -             delalloc_end = (u64)-1;
>>> -
>>> -     /*
>>> -      * We didn't find anything useful, return the original results from
>>> -      * get_extent()
>>> -      */
>>> -     if (delalloc_start > end || delalloc_end <= start) {
>>> -             em = hole_em;
>>> -             hole_em = NULL;
>>> -             goto out;
>>> -     }
>>> -
>>> -     /*
>>> -      * Adjust the delalloc_start to make sure it doesn't go backwards from
>>> -      * the start they passed in
>>> -      */
>>> -     delalloc_start = max(start, delalloc_start);
>>> -     delalloc_len = delalloc_end - delalloc_start;
>>> -
>>> -     if (delalloc_len > 0) {
>>> -             u64 hole_start;
>>> -             u64 hole_len;
>>> -             const u64 hole_end = extent_map_end(hole_em);
>>> -
>>> -             em = alloc_extent_map();
>>> -             if (!em) {
>>> -                     err = -ENOMEM;
>>> -                     goto out;
>>> -             }
>>> -
>>> -             ASSERT(hole_em);
>>> -             /*
>>> -              * When btrfs_get_extent can't find anything it returns one
>>> -              * huge hole
>>> -              *
>>> -              * Make sure what it found really fits our range, and adjust to
>>> -              * make sure it is based on the start from the caller
>>> -              */
>>> -             if (hole_end <= start || hole_em->start > end) {
>>> -                    free_extent_map(hole_em);
>>> -                    hole_em = NULL;
>>> -             } else {
>>> -                    hole_start = max(hole_em->start, start);
>>> -                    hole_len = hole_end - hole_start;
>>> -             }
>>> -
>>> -             if (hole_em && delalloc_start > hole_start) {
>>> -                     /*
>>> -                      * Our hole starts before our delalloc, so we have to
>>> -                      * return just the parts of the hole that go until the
>>> -                      * delalloc starts
>>> -                      */
>>> -                     em->len = min(hole_len, delalloc_start - hole_start);
>>> -                     em->start = hole_start;
>>> -                     em->orig_start = hole_start;
>>> -                     /*
>>> -                      * Don't adjust block start at all, it is fixed at
>>> -                      * EXTENT_MAP_HOLE
>>> -                      */
>>> -                     em->block_start = hole_em->block_start;
>>> -                     em->block_len = hole_len;
>>> -                     if (test_bit(EXTENT_FLAG_PREALLOC, &hole_em->flags))
>>> -                             set_bit(EXTENT_FLAG_PREALLOC, &em->flags);
>>> -             } else {
>>> -                     /*
>>> -                      * Hole is out of passed range or it starts after
>>> -                      * delalloc range
>>> -                      */
>>> -                     em->start = delalloc_start;
>>> -                     em->len = delalloc_len;
>>> -                     em->orig_start = delalloc_start;
>>> -                     em->block_start = EXTENT_MAP_DELALLOC;
>>> -                     em->block_len = delalloc_len;
>>> -             }
>>> -     } else {
>>> -             return hole_em;
>>> -     }
>>> -out:
>>> -
>>> -     free_extent_map(hole_em);
>>> -     if (err) {
>>> -             free_extent_map(em);
>>> -             return ERR_PTR(err);
>>> -     }
>>> -     return em;
>>> -}
>>> -
>>>    static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>>>                                                  const u64 start,
>>>                                                  const u64 len,
>>> @@ -8265,15 +8138,14 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>>>         * in the compression of data (in an async thread) and will return
>>>         * before the compression is done and writeback is started. A second
>>>         * filemap_fdatawrite_range() is needed to wait for the compression to
>>> -      * complete and writeback to start. Without this, our user is very
>>> -      * likely to get stale results, because the extents and extent maps for
>>> -      * delalloc regions are only allocated when writeback starts.
>>> +      * complete and writeback to start. We also need to wait for ordered
>>> +      * extents to complete, because our fiemap implementation uses mainly
>>> +      * file extent items to list the extents, searching for extent maps
>>> +      * only for file ranges with holes or prealloc extents to figure out
>>> +      * if we have delalloc in those ranges.
>>>         */
>>>        if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
>>> -             ret = btrfs_fdatawrite_range(inode, 0, LLONG_MAX);
>>> -             if (ret)
>>> -                     return ret;
>>> -             ret = filemap_fdatawait_range(inode->i_mapping, 0, LLONG_MAX);
>>> +             ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
>>>                if (ret)
>>>                        return ret;
>>>        }

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness
  2022-09-02  9:34       ` Qu Wenruo
@ 2022-09-02  9:41         ` Filipe Manana
  2022-09-02  9:50           ` Qu Wenruo
  0 siblings, 1 reply; 53+ messages in thread
From: Filipe Manana @ 2022-09-02  9:41 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On Fri, Sep 2, 2022 at 10:35 AM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> On 2022/9/2 16:59, Filipe Manana wrote:
> > On Fri, Sep 2, 2022 at 12:27 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> [...]
> >>
> >> Is there any other major usage for extent map now?
> >
> > For fsync at least is very important.
>
> OK, forgot that fsync is relying on that to determine if an extent needs
> to be logged.
>
> > Also for reads it's nice to not have to go to the b+tree.
> > If someone reads several pages of an extent, being able to get it directly
> > from the extent map tree is faster than having to go to the btree for
> > every read.
> > Extent map trees are per inode, but subvolume b+trees can have a lot of
> > concurrent write and read access.
> >
> >>
> >> I can only think of read, which uses extent map to grab the logical
> >> bytenr of the real extent.
> >>
> >> In that case, the SHARED flag doesn't make much sense anyway, can we do
> >> a cleanup for those flags? Since fiemap/lseek no longer relies on extent
> >> map anymore.
> >
> > I don't get it. What SHARED flag are talking about? And which "flags", where?
> > We have nothing specific for lseek/fiemap in the extent maps, so I
> > don't understand.
>
> Nevermind, I got confused and think there would be one SHARED flag for
> extent map, but that's totally wrong...
>
> >
> >>
> [...]
> >>
> >> Although this is correct, it still looks a little tricky.
> >>
> >> We rely on btrfs_release_path() to release all tree blocks in the
> >> subvolume tree, including unlocking the tree blocks, thus path->locks[0]
> >> is also 0, meaning next time we call btrfs_release_path() we won't try
> >> to unlock the cloned eb.
> >
> > We're not taking any lock on the cloned extent buffer. It's not
> > needed, it's private
> > to the task.
>
> Yep, that's completely fine, just looks a little tricky since we're
> going to release that path twice, and that's expected.

Hum?
It's twice but for different extent buffers.

>
> >
> >>
> >> But I'd say it's still pretty tricky, and unfortunately I don't have any
> >> better alternative.
> >>
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Process a range which is a hole or a prealloc extent in the inode's subvolume
> >>> + * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc
> >>> + * extent. The end offset (@end) is inclusive.
> >>> + */
> >>> +static int fiemap_process_hole(struct btrfs_inode *inode,
> >>
> >> Does the name still make sense as we're handling both hole and prealloc
> >> range?
> >
> > I chose that name because hole and prealloc are treated the same way.
> > Sure, I could name it fiemap_process_hole_or_prealloc() or something
> > like that, but
> > I decided to keep the name shorter, make them explicit in the comments and code.
> >
> > The old code did the same, get_extent_skip_holes() skipped holes and
> > prealloc extents without delalloc.
> >
> >>
> >>
> >> And I always find the delalloc search a big pain during lseek/fiemap.
> >>
> >> I guess except using certain flags, there is some hard requirement for
> >> delalloc range reporting?
> >
> > Yes. Delalloc is not meant to be flushed for fiemap unless
> > FIEMAP_FLAG_SYNC is given by the user.
>
> Would it be possible to let btrfs always flush the delalloc range, no
> matter if FIEMAP_FLAG_SYNC is specified or not?
>
> I really want to avoid the whole delalloc search thing if possible.
>
> Although I'd guess such behavior would be against the fiemap
> requirement, and the extra writeback may greatly slow down the fiemap
> itself for large files with tons of delalloc, so not really expect this
> to happen.

No, doing such a change is a bad idea.
It changes the semantics and expected behaviour.

My goal here is to preserve all semantics and behaviour, but make it
more efficient.

Even if we were all to decide to do that, that should be done
separately - but I don't think that is correct anyway,
fiemap can be used to detect delalloc, and probably there are users
using it for that.


>
> > For lseek it's just not needed, but that was already mentioned /
> > discussed in patch 2/10.
> >
> [...]
> >>> +     /*
> >>> +      * Lookup the last file extent. We're not using i_size here because
> >>> +      * there might be preallocation past i_size.
> >>> +      */
> >>
> >> I'm wondering how could this happen?
> >
> > You can fallocate an extent at or after i_size.
> >
> >>
> >> Normally if we're truncating an inode, the extents starting after
> >> round_up(i_size, sectorsize) should be dropped.
> >
> > It has nothing to do with truncate, just fallocate.
> >
> >>
> >> Or if we later enlarge the inode, we may hit old extents and read out
> >> some stale data other than expected zeros.
> >>
> >> Thus searching using round_up(i_size, sectorsize) should still let us to
> >> reach the slot after the last file extent.
> >>
> >> Or did I miss something?
> >
> > Yes, it's about prealloc extents at or after i_size.
>
> Did you mean falloc using keep_size flag?

Yes. Otherwise the extent wouldn't end up beyond i_size.

>
> Then that explains the whole reason.

Great!
Thanks.

>
> Thanks,
> Qu
>
> >
> > Thanks.
> >
> >>
> >> Thanks,
> >> Qu
> >>
> >>> +     ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0);
> >>> +     /* There can't be a file extent item at offset (u64)-1 */
> >>> +     ASSERT(ret != 0);
> >>> +     if (ret < 0)
> >>> +             return ret;
> >>> +
> >>> +     /*
> >>> +      * For a non-existing key, btrfs_search_slot() always leaves us at a
> >>> +      * slot > 0, except if the btree is empty, which is impossible because
> >>> +      * at least it has the inode item for this inode and all the items for
> >>> +      * the root inode 256.
> >>> +      */
> >>> +     ASSERT(path->slots[0] > 0);
> >>> +     path->slots[0]--;
> >>> +     leaf = path->nodes[0];
> >>> +     btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> >>> +     if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> >>> +             /* No file extent items in the subvolume tree. */
> >>> +             *last_extent_end_ret = 0;
> >>> +             return 0;
> >>>        }
> >>> -     btrfs_release_path(path);
> >>>
> >>>        /*
> >>> -      * we might have some extents allocated but more delalloc past those
> >>> -      * extents.  so, we trust isize unless the start of the last extent is
> >>> -      * beyond isize
> >>> +      * For an inline extent, the disk_bytenr is where inline data starts at,
> >>> +      * so first check if we have an inline extent item before checking if we
> >>> +      * have an implicit hole (disk_bytenr == 0).
> >>>         */
> >>> -     if (last < isize) {
> >>> -             last = (u64)-1;
> >>> -             last_for_get_extent = isize;
> >>> +     ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
> >>> +     if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
> >>> +             *last_extent_end_ret = btrfs_file_extent_end(path);
> >>> +             return 0;
> >>>        }
> >>>
> >>> -     lock_extent_bits(&inode->io_tree, start, start + len - 1,
> >>> -                      &cached_state);
> >>> +     /*
> >>> +      * Find the last file extent item that is not a hole (when NO_HOLES is
> >>> +      * not enabled). This should take at most 2 iterations in the worst
> >>> +      * case: we have one hole file extent item at slot 0 of a leaf and
> >>> +      * another hole file extent item as the last item in the previous leaf.
> >>> +      * This is because we merge file extent items that represent holes.
> >>> +      */
> >>> +     disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> >>> +     while (disk_bytenr == 0) {
> >>> +             ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
> >>> +             if (ret < 0) {
> >>> +                     return ret;
> >>> +             } else if (ret > 0) {
> >>> +                     /* No file extent items that are not holes. */
> >>> +                     *last_extent_end_ret = 0;
> >>> +                     return 0;
> >>> +             }
> >>> +             leaf = path->nodes[0];
> >>> +             ei = btrfs_item_ptr(leaf, path->slots[0],
> >>> +                                 struct btrfs_file_extent_item);
> >>> +             disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> >>> +     }
> >>>
> >>> -     em = get_extent_skip_holes(inode, start, last_for_get_extent);
> >>> -     if (!em)
> >>> -             goto out;
> >>> -     if (IS_ERR(em)) {
> >>> -             ret = PTR_ERR(em);
> >>> +     *last_extent_end_ret = btrfs_file_extent_end(path);
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> >>> +               u64 start, u64 len)
> >>> +{
> >>> +     const u64 ino = btrfs_ino(inode);
> >>> +     struct extent_state *cached_state = NULL;
> >>> +     struct btrfs_path *path;
> >>> +     struct btrfs_root *root = inode->root;
> >>> +     struct fiemap_cache cache = { 0 };
> >>> +     struct btrfs_backref_shared_cache *backref_cache;
> >>> +     struct ulist *roots;
> >>> +     struct ulist *tmp_ulist;
> >>> +     u64 last_extent_end;
> >>> +     u64 prev_extent_end;
> >>> +     u64 lockstart;
> >>> +     u64 lockend;
> >>> +     bool stopped = false;
> >>> +     int ret;
> >>> +
> >>> +     backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
> >>> +     path = btrfs_alloc_path();
> >>> +     roots = ulist_alloc(GFP_KERNEL);
> >>> +     tmp_ulist = ulist_alloc(GFP_KERNEL);
> >>> +     if (!backref_cache || !path || !roots || !tmp_ulist) {
> >>> +             ret = -ENOMEM;
> >>>                goto out;
> >>>        }
> >>>
> >>> -     while (!end) {
> >>> -             u64 offset_in_extent = 0;
> >>> +     lockstart = round_down(start, btrfs_inode_sectorsize(inode));
> >>> +     lockend = round_up(start + len, btrfs_inode_sectorsize(inode));
> >>> +     prev_extent_end = lockstart;
> >>>
> >>> -             /* break if the extent we found is outside the range */
> >>> -             if (em->start >= max || extent_map_end(em) < off)
> >>> -                     break;
> >>> +     lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
> >>>
> >>> -             /*
> >>> -              * get_extent may return an extent that starts before our
> >>> -              * requested range.  We have to make sure the ranges
> >>> -              * we return to fiemap always move forward and don't
> >>> -              * overlap, so adjust the offsets here
> >>> -              */
> >>> -             em_start = max(em->start, off);
> >>> +     ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
> >>> +     if (ret < 0)
> >>> +             goto out_unlock;
> >>> +     btrfs_release_path(path);
> >>>
> >>> +     path->reada = READA_FORWARD;
> >>> +     ret = fiemap_search_slot(inode, path, lockstart);
> >>> +     if (ret < 0) {
> >>> +             goto out_unlock;
> >>> +     } else if (ret > 0) {
> >>>                /*
> >>> -              * record the offset from the start of the extent
> >>> -              * for adjusting the disk offset below.  Only do this if the
> >>> -              * extent isn't compressed since our in ram offset may be past
> >>> -              * what we have actually allocated on disk.
> >>> +              * No file extent item found, but we may have delalloc between
> >>> +              * the current offset and i_size. So check for that.
> >>>                 */
> >>> -             if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
> >>> -                     offset_in_extent = em_start - em->start;
> >>> -             em_end = extent_map_end(em);
> >>> -             em_len = em_end - em_start;
> >>> -             flags = 0;
> >>> -             if (em->block_start < EXTENT_MAP_LAST_BYTE)
> >>> -                     disko = em->block_start + offset_in_extent;
> >>> -             else
> >>> -                     disko = 0;
> >>> +             ret = 0;
> >>> +             goto check_eof_delalloc;
> >>> +     }
> >>> +
> >>> +     while (prev_extent_end < lockend) {
> >>> +             struct extent_buffer *leaf = path->nodes[0];
> >>> +             struct btrfs_file_extent_item *ei;
> >>> +             struct btrfs_key key;
> >>> +             u64 extent_end;
> >>> +             u64 extent_len;
> >>> +             u64 extent_offset = 0;
> >>> +             u64 extent_gen;
> >>> +             u64 disk_bytenr = 0;
> >>> +             u64 flags = 0;
> >>> +             int extent_type;
> >>> +             u8 compression;
> >>> +
> >>> +             btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> >>> +             if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> >>> +                     break;
> >>> +
> >>> +             extent_end = btrfs_file_extent_end(path);
> >>>
> >>>                /*
> >>> -              * bump off for our next call to get_extent
> >>> +              * The first iteration can leave us at an extent item that ends
> >>> +              * before our range's start. Move to the next item.
> >>>                 */
> >>> -             off = extent_map_end(em);
> >>> -             if (off >= max)
> >>> -                     end = 1;
> >>> -
> >>> -             if (em->block_start == EXTENT_MAP_INLINE) {
> >>> -                     flags |= (FIEMAP_EXTENT_DATA_INLINE |
> >>> -                               FIEMAP_EXTENT_NOT_ALIGNED);
> >>> -             } else if (em->block_start == EXTENT_MAP_DELALLOC) {
> >>> -                     flags |= (FIEMAP_EXTENT_DELALLOC |
> >>> -                               FIEMAP_EXTENT_UNKNOWN);
> >>> -             } else if (fieinfo->fi_extents_max) {
> >>> -                     u64 extent_gen;
> >>> -                     u64 bytenr = em->block_start -
> >>> -                             (em->start - em->orig_start);
> >>> +             if (extent_end <= lockstart)
> >>> +                     goto next_item;
> >>>
> >>> -                     /*
> >>> -                      * If two extent maps are merged, then their generation
> >>> -                      * is set to the maximum between their generations.
> >>> -                      * Otherwise its generation matches the one we have in
> >>> -                      * corresponding file extent item. If we have a merged
> >>> -                      * extent map, don't use its generation to speedup the
> >>> -                      * sharedness check below.
> >>> -                      */
> >>> -                     if (test_bit(EXTENT_FLAG_MERGED, &em->flags))
> >>> -                             extent_gen = 0;
> >>> -                     else
> >>> -                             extent_gen = em->generation;
> >>> +             /* We have in implicit hole (NO_HOLES feature enabled). */
> >>> +             if (prev_extent_end < key.offset) {
> >>> +                     const u64 range_end = min(key.offset, lockend) - 1;
> >>>
> >>> -                     /*
> >>> -                      * As btrfs supports shared space, this information
> >>> -                      * can be exported to userspace tools via
> >>> -                      * flag FIEMAP_EXTENT_SHARED.  If fi_extents_max == 0
> >>> -                      * then we're just getting a count and we can skip the
> >>> -                      * lookup stuff.
> >>> -                      */
> >>> -                     ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
> >>> -                                                       bytenr, extent_gen,
> >>> -                                                       roots, tmp_ulist,
> >>> -                                                       backref_cache);
> >>> -                     if (ret < 0)
> >>> -                             goto out_free;
> >>> -                     if (ret)
> >>> -                             flags |= FIEMAP_EXTENT_SHARED;
> >>> -                     ret = 0;
> >>> -             }
> >>> -             if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
> >>> -                     flags |= FIEMAP_EXTENT_ENCODED;
> >>> -             if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> >>> -                     flags |= FIEMAP_EXTENT_UNWRITTEN;
> >>> +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
> >>> +                                               backref_cache, 0, 0, 0,
> >>> +                                               roots, tmp_ulist,
> >>> +                                               prev_extent_end, range_end);
> >>> +                     if (ret < 0) {
> >>> +                             goto out_unlock;
> >>> +                     } else if (ret > 0) {
> >>> +                             /* fiemap_fill_next_extent() told us to stop. */
> >>> +                             stopped = true;
> >>> +                             break;
> >>> +                     }
> >>>
> >>> -             free_extent_map(em);
> >>> -             em = NULL;
> >>> -             if ((em_start >= last) || em_len == (u64)-1 ||
> >>> -                (last == (u64)-1 && isize <= em_end)) {
> >>> -                     flags |= FIEMAP_EXTENT_LAST;
> >>> -                     end = 1;
> >>> +                     /* We've reached the end of the fiemap range, stop. */
> >>> +                     if (key.offset >= lockend) {
> >>> +                             stopped = true;
> >>> +                             break;
> >>> +                     }
> >>>                }
> >>>
> >>> -             /* now scan forward to see if this is really the last extent. */
> >>> -             em = get_extent_skip_holes(inode, off, last_for_get_extent);
> >>> -             if (IS_ERR(em)) {
> >>> -                     ret = PTR_ERR(em);
> >>> -                     goto out;
> >>> +             extent_len = extent_end - key.offset;
> >>> +             ei = btrfs_item_ptr(leaf, path->slots[0],
> >>> +                                 struct btrfs_file_extent_item);
> >>> +             compression = btrfs_file_extent_compression(leaf, ei);
> >>> +             extent_type = btrfs_file_extent_type(leaf, ei);
> >>> +             extent_gen = btrfs_file_extent_generation(leaf, ei);
> >>> +
> >>> +             if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
> >>> +                     disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> >>> +                     if (compression == BTRFS_COMPRESS_NONE)
> >>> +                             extent_offset = btrfs_file_extent_offset(leaf, ei);
> >>>                }
> >>> -             if (!em) {
> >>> -                     flags |= FIEMAP_EXTENT_LAST;
> >>> -                     end = 1;
> >>> +
> >>> +             if (compression != BTRFS_COMPRESS_NONE)
> >>> +                     flags |= FIEMAP_EXTENT_ENCODED;
> >>> +
> >>> +             if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
> >>> +                     flags |= FIEMAP_EXTENT_DATA_INLINE;
> >>> +                     flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> >>> +                     ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0,
> >>> +                                              extent_len, flags);
> >>> +             } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
> >>> +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
> >>> +                                               backref_cache,
> >>> +                                               disk_bytenr, extent_offset,
> >>> +                                               extent_gen, roots, tmp_ulist,
> >>> +                                               key.offset, extent_end - 1);
> >>> +             } else if (disk_bytenr == 0) {
> >>> +                     /* We have an explicit hole. */
> >>> +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
> >>> +                                               backref_cache, 0, 0, 0,
> >>> +                                               roots, tmp_ulist,
> >>> +                                               key.offset, extent_end - 1);
> >>> +             } else {
> >>> +                     /* We have a regular extent. */
> >>> +                     if (fieinfo->fi_extents_max) {
> >>> +                             ret = btrfs_is_data_extent_shared(root, ino,
> >>> +                                                               disk_bytenr,
> >>> +                                                               extent_gen,
> >>> +                                                               roots,
> >>> +                                                               tmp_ulist,
> >>> +                                                               backref_cache);
> >>> +                             if (ret < 0)
> >>> +                                     goto out_unlock;
> >>> +                             else if (ret > 0)
> >>> +                                     flags |= FIEMAP_EXTENT_SHARED;
> >>> +                     }
> >>> +
> >>> +                     ret = emit_fiemap_extent(fieinfo, &cache, key.offset,
> >>> +                                              disk_bytenr + extent_offset,
> >>> +                                              extent_len, flags);
> >>>                }
> >>> -             ret = emit_fiemap_extent(fieinfo, &cache, em_start, disko,
> >>> -                                        em_len, flags);
> >>> -             if (ret) {
> >>> -                     if (ret == 1)
> >>> -                             ret = 0;
> >>> -                     goto out_free;
> >>> +
> >>> +             if (ret < 0) {
> >>> +                     goto out_unlock;
> >>> +             } else if (ret > 0) {
> >>> +                     /* fiemap_fill_next_extent() told us to stop. */
> >>> +                     stopped = true;
> >>> +                     break;
> >>>                }
> >>>
> >>> +             prev_extent_end = extent_end;
> >>> +next_item:
> >>>                if (fatal_signal_pending(current)) {
> >>>                        ret = -EINTR;
> >>> -                     goto out_free;
> >>> +                     goto out_unlock;
> >>>                }
> >>> +
> >>> +             ret = fiemap_next_leaf_item(inode, path);
> >>> +             if (ret < 0) {
> >>> +                     goto out_unlock;
> >>> +             } else if (ret > 0) {
> >>> +                     /* No more file extent items for this inode. */
> >>> +                     break;
> >>> +             }
> >>> +             cond_resched();
> >>>        }
> >>> -out_free:
> >>> -     if (!ret)
> >>> -             ret = emit_last_fiemap_cache(fieinfo, &cache);
> >>> -     free_extent_map(em);
> >>> -out:
> >>> -     unlock_extent_cached(&inode->io_tree, start, start + len - 1,
> >>> -                          &cached_state);
> >>>
> >>> -out_free_ulist:
> >>> +check_eof_delalloc:
> >>> +     /*
> >>> +      * Release (and free) the path before emitting any final entries to
> >>> +      * fiemap_fill_next_extent() to keep lockdep happy. This is because
> >>> +      * once we find no more file extent items exist, we may have a
> >>> +      * non-cloned leaf, and fiemap_fill_next_extent() can trigger page
> >>> +      * faults when copying data to the user space buffer.
> >>> +      */
> >>> +     btrfs_free_path(path);
> >>> +     path = NULL;
> >>> +
> >>> +     if (!stopped && prev_extent_end < lockend) {
> >>> +             ret = fiemap_process_hole(inode, fieinfo, &cache, backref_cache,
> >>> +                                       0, 0, 0, roots, tmp_ulist,
> >>> +                                       prev_extent_end, lockend - 1);
> >>> +             if (ret < 0)
> >>> +                     goto out_unlock;
> >>> +             prev_extent_end = lockend;
> >>> +     }
> >>> +
> >>> +     if (cache.cached && cache.offset + cache.len >= last_extent_end) {
> >>> +             const u64 i_size = i_size_read(&inode->vfs_inode);
> >>> +
> >>> +             if (prev_extent_end < i_size) {
> >>> +                     u64 delalloc_start;
> >>> +                     u64 delalloc_end;
> >>> +                     bool delalloc;
> >>> +
> >>> +                     delalloc = btrfs_find_delalloc_in_range(inode,
> >>> +                                                             prev_extent_end,
> >>> +                                                             i_size - 1,
> >>> +                                                             &delalloc_start,
> >>> +                                                             &delalloc_end);
> >>> +                     if (!delalloc)
> >>> +                             cache.flags |= FIEMAP_EXTENT_LAST;
> >>> +             } else {
> >>> +                     cache.flags |= FIEMAP_EXTENT_LAST;
> >>> +             }
> >>> +     }
> >>> +
> >>> +     ret = emit_last_fiemap_cache(fieinfo, &cache);
> >>> +
> >>> +out_unlock:
> >>> +     unlock_extent_cached(&inode->io_tree, lockstart, lockend, &cached_state);
> >>> +out:
> >>>        kfree(backref_cache);
> >>>        btrfs_free_path(path);
> >>>        ulist_free(roots);
> >>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> >>> index b292a8ada3a4..636b3ec46184 100644
> >>> --- a/fs/btrfs/file.c
> >>> +++ b/fs/btrfs/file.c
> >>> @@ -3602,10 +3602,10 @@ static long btrfs_fallocate(struct file *file, int mode,
> >>>    }
> >>>
> >>>    /*
> >>> - * Helper for have_delalloc_in_range(). Find a subrange in a given range that
> >>> - * has unflushed and/or flushing delalloc. There might be other adjacent
> >>> - * subranges after the one it found, so have_delalloc_in_range() keeps looping
> >>> - * while it gets adjacent subranges, and merging them together.
> >>> + * Helper for btrfs_find_delalloc_in_range(). Find a subrange in a given range
> >>> + * that has unflushed and/or flushing delalloc. There might be other adjacent
> >>> + * subranges after the one it found, so btrfs_find_delalloc_in_range() keeps
> >>> + * looping while it gets adjacent subranges, and merging them together.
> >>>     */
> >>>    static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
> >>>                                   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> >>> @@ -3740,8 +3740,8 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
> >>>     * if so it sets @delalloc_start_ret and @delalloc_end_ret with the start and
> >>>     * end offsets of the subrange.
> >>>     */
> >>> -static bool have_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> >>> -                                u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> >>> +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> >>> +                               u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> >>>    {
> >>>        u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
> >>>        u64 prev_delalloc_end = 0;
> >>> @@ -3804,8 +3804,8 @@ static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
> >>>        u64 delalloc_end;
> >>>        bool delalloc;
> >>>
> >>> -     delalloc = have_delalloc_in_range(inode, start, end, &delalloc_start,
> >>> -                                       &delalloc_end);
> >>> +     delalloc = btrfs_find_delalloc_in_range(inode, start, end,
> >>> +                                             &delalloc_start, &delalloc_end);
> >>>        if (delalloc && whence == SEEK_DATA) {
> >>>                *start_ret = delalloc_start;
> >>>                return true;
> >>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> >>> index 2c7d31990777..8be1e021513a 100644
> >>> --- a/fs/btrfs/inode.c
> >>> +++ b/fs/btrfs/inode.c
> >>> @@ -7064,133 +7064,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
> >>>        return em;
> >>>    }
> >>>
> >>> -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
> >>> -                                        u64 start, u64 len)
> >>> -{
> >>> -     struct extent_map *em;
> >>> -     struct extent_map *hole_em = NULL;
> >>> -     u64 delalloc_start = start;
> >>> -     u64 end;
> >>> -     u64 delalloc_len;
> >>> -     u64 delalloc_end;
> >>> -     int err = 0;
> >>> -
> >>> -     em = btrfs_get_extent(inode, NULL, 0, start, len);
> >>> -     if (IS_ERR(em))
> >>> -             return em;
> >>> -     /*
> >>> -      * If our em maps to:
> >>> -      * - a hole or
> >>> -      * - a pre-alloc extent,
> >>> -      * there might actually be delalloc bytes behind it.
> >>> -      */
> >>> -     if (em->block_start != EXTENT_MAP_HOLE &&
> >>> -         !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> >>> -             return em;
> >>> -     else
> >>> -             hole_em = em;
> >>> -
> >>> -     /* check to see if we've wrapped (len == -1 or similar) */
> >>> -     end = start + len;
> >>> -     if (end < start)
> >>> -             end = (u64)-1;
> >>> -     else
> >>> -             end -= 1;
> >>> -
> >>> -     em = NULL;
> >>> -
> >>> -     /* ok, we didn't find anything, lets look for delalloc */
> >>> -     delalloc_len = count_range_bits(&inode->io_tree, &delalloc_start,
> >>> -                              end, len, EXTENT_DELALLOC, 1);
> >>> -     delalloc_end = delalloc_start + delalloc_len;
> >>> -     if (delalloc_end < delalloc_start)
> >>> -             delalloc_end = (u64)-1;
> >>> -
> >>> -     /*
> >>> -      * We didn't find anything useful, return the original results from
> >>> -      * get_extent()
> >>> -      */
> >>> -     if (delalloc_start > end || delalloc_end <= start) {
> >>> -             em = hole_em;
> >>> -             hole_em = NULL;
> >>> -             goto out;
> >>> -     }
> >>> -
> >>> -     /*
> >>> -      * Adjust the delalloc_start to make sure it doesn't go backwards from
> >>> -      * the start they passed in
> >>> -      */
> >>> -     delalloc_start = max(start, delalloc_start);
> >>> -     delalloc_len = delalloc_end - delalloc_start;
> >>> -
> >>> -     if (delalloc_len > 0) {
> >>> -             u64 hole_start;
> >>> -             u64 hole_len;
> >>> -             const u64 hole_end = extent_map_end(hole_em);
> >>> -
> >>> -             em = alloc_extent_map();
> >>> -             if (!em) {
> >>> -                     err = -ENOMEM;
> >>> -                     goto out;
> >>> -             }
> >>> -
> >>> -             ASSERT(hole_em);
> >>> -             /*
> >>> -              * When btrfs_get_extent can't find anything it returns one
> >>> -              * huge hole
> >>> -              *
> >>> -              * Make sure what it found really fits our range, and adjust to
> >>> -              * make sure it is based on the start from the caller
> >>> -              */
> >>> -             if (hole_end <= start || hole_em->start > end) {
> >>> -                    free_extent_map(hole_em);
> >>> -                    hole_em = NULL;
> >>> -             } else {
> >>> -                    hole_start = max(hole_em->start, start);
> >>> -                    hole_len = hole_end - hole_start;
> >>> -             }
> >>> -
> >>> -             if (hole_em && delalloc_start > hole_start) {
> >>> -                     /*
> >>> -                      * Our hole starts before our delalloc, so we have to
> >>> -                      * return just the parts of the hole that go until the
> >>> -                      * delalloc starts
> >>> -                      */
> >>> -                     em->len = min(hole_len, delalloc_start - hole_start);
> >>> -                     em->start = hole_start;
> >>> -                     em->orig_start = hole_start;
> >>> -                     /*
> >>> -                      * Don't adjust block start at all, it is fixed at
> >>> -                      * EXTENT_MAP_HOLE
> >>> -                      */
> >>> -                     em->block_start = hole_em->block_start;
> >>> -                     em->block_len = hole_len;
> >>> -                     if (test_bit(EXTENT_FLAG_PREALLOC, &hole_em->flags))
> >>> -                             set_bit(EXTENT_FLAG_PREALLOC, &em->flags);
> >>> -             } else {
> >>> -                     /*
> >>> -                      * Hole is out of passed range or it starts after
> >>> -                      * delalloc range
> >>> -                      */
> >>> -                     em->start = delalloc_start;
> >>> -                     em->len = delalloc_len;
> >>> -                     em->orig_start = delalloc_start;
> >>> -                     em->block_start = EXTENT_MAP_DELALLOC;
> >>> -                     em->block_len = delalloc_len;
> >>> -             }
> >>> -     } else {
> >>> -             return hole_em;
> >>> -     }
> >>> -out:
> >>> -
> >>> -     free_extent_map(hole_em);
> >>> -     if (err) {
> >>> -             free_extent_map(em);
> >>> -             return ERR_PTR(err);
> >>> -     }
> >>> -     return em;
> >>> -}
> >>> -
> >>>    static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
> >>>                                                  const u64 start,
> >>>                                                  const u64 len,
> >>> @@ -8265,15 +8138,14 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> >>>         * in the compression of data (in an async thread) and will return
> >>>         * before the compression is done and writeback is started. A second
> >>>         * filemap_fdatawrite_range() is needed to wait for the compression to
> >>> -      * complete and writeback to start. Without this, our user is very
> >>> -      * likely to get stale results, because the extents and extent maps for
> >>> -      * delalloc regions are only allocated when writeback starts.
> >>> +      * complete and writeback to start. We also need to wait for ordered
> >>> +      * extents to complete, because our fiemap implementation uses mainly
> >>> +      * file extent items to list the extents, searching for extent maps
> >>> +      * only for file ranges with holes or prealloc extents to figure out
> >>> +      * if we have delalloc in those ranges.
> >>>         */
> >>>        if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
> >>> -             ret = btrfs_fdatawrite_range(inode, 0, LLONG_MAX);
> >>> -             if (ret)
> >>> -                     return ret;
> >>> -             ret = filemap_fdatawait_range(inode->i_mapping, 0, LLONG_MAX);
> >>> +             ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
> >>>                if (ret)
> >>>                        return ret;
> >>>        }

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness
  2022-09-02  9:41         ` Filipe Manana
@ 2022-09-02  9:50           ` Qu Wenruo
  0 siblings, 0 replies; 53+ messages in thread
From: Qu Wenruo @ 2022-09-02  9:50 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Qu Wenruo, linux-btrfs



On 2022/9/2 17:41, Filipe Manana wrote:
> On Fri, Sep 2, 2022 at 10:35 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>>
>>
>> On 2022/9/2 16:59, Filipe Manana wrote:
>>> On Fri, Sep 2, 2022 at 12:27 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> [...]
>>>>
>>>> Is there any other major usage for extent map now?
>>>
>>> For fsync at least is very important.
>>
>> OK, forgot that fsync is relying on that to determine if an extent needs
>> to be logged.
>>
>>> Also for reads it's nice to not have to go to the b+tree.
>>> If someone reads several pages of an extent, being able to get it directly
>>> from the extent map tree is faster than having to go to the btree for
>>> every read.
>>> Extent map trees are per inode, but subvolume b+trees can have a lot of
>>> concurrent write and read access.
>>>
>>>>
>>>> I can only think of read, which uses extent map to grab the logical
>>>> bytenr of the real extent.
>>>>
>>>> In that case, the SHARED flag doesn't make much sense anyway, can we do
>>>> a cleanup for those flags? Since fiemap/lseek no longer relies on extent
>>>> map anymore.
>>>
>>> I don't get it. What SHARED flag are talking about? And which "flags", where?
>>> We have nothing specific for lseek/fiemap in the extent maps, so I
>>> don't understand.
>>
>> Nevermind, I got confused and think there would be one SHARED flag for
>> extent map, but that's totally wrong...
>>
>>>
>>>>
>> [...]
>>>>
>>>> Although this is correct, it still looks a little tricky.
>>>>
>>>> We rely on btrfs_release_path() to release all tree blocks in the
>>>> subvolume tree, including unlocking the tree blocks, thus path->locks[0]
>>>> is also 0, meaning next time we call btrfs_release_path() we won't try
>>>> to unlock the cloned eb.
>>>
>>> We're not taking any lock on the cloned extent buffer. It's not
>>> needed, it's private
>>> to the task.
>>
>> Yep, that's completely fine, just looks a little tricky since we're
>> going to release that path twice, and that's expected.
> 
> Hum?
> It's twice but for different extent buffers.

Yep, it's just different from our regular call procedure, 
btrfs_search_slot() -> btrfs_release_path()/btrfs_free_path(), without 
doing double btrfs_release_path().

But since you have enough comment, and the whole trick is hidden behind 
fiemap_search_slot()/fiemap_next_leaf_item(), I guess it's fine.

Thanks,
Qu

> 
>>
>>>
>>>>
>>>> But I'd say it's still pretty tricky, and unfortunately I don't have any
>>>> better alternative.
>>>>
>>>>> +
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Process a range which is a hole or a prealloc extent in the inode's subvolume
>>>>> + * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc
>>>>> + * extent. The end offset (@end) is inclusive.
>>>>> + */
>>>>> +static int fiemap_process_hole(struct btrfs_inode *inode,
>>>>
>>>> Does the name still make sense as we're handling both hole and prealloc
>>>> range?
>>>
>>> I chose that name because hole and prealloc are treated the same way.
>>> Sure, I could name it fiemap_process_hole_or_prealloc() or something
>>> like that, but
>>> I decided to keep the name shorter, make them explicit in the comments and code.
>>>
>>> The old code did the same, get_extent_skip_holes() skipped holes and
>>> prealloc extents without delalloc.
>>>
>>>>
>>>>
>>>> And I always find the delalloc search a big pain during lseek/fiemap.
>>>>
>>>> I guess except using certain flags, there is some hard requirement for
>>>> delalloc range reporting?
>>>
>>> Yes. Delalloc is not meant to be flushed for fiemap unless
>>> FIEMAP_FLAG_SYNC is given by the user.
>>
>> Would it be possible to let btrfs always flush the delalloc range, no
>> matter if FIEMAP_FLAG_SYNC is specified or not?
>>
>> I really want to avoid the whole delalloc search thing if possible.
>>
>> Although I'd guess such behavior would be against the fiemap
>> requirement, and the extra writeback may greatly slow down the fiemap
>> itself for large files with tons of delalloc, so not really expect this
>> to happen.
> 
> No, doing such a change is a bad idea.
> It changes the semantics and expected behaviour.
> 
> My goal here is to preserve all semantics and behaviour, but make it
> more efficient.
> 
> Even if we were all to decide to do that, that should be done
> separately - but I don't think that is correct anyway,
> fiemap can be used to detect delalloc, and probably there are users
> using it for that.
> 
> 
>>
>>> For lseek it's just not needed, but that was already mentioned /
>>> discussed in patch 2/10.
>>>
>> [...]
>>>>> +     /*
>>>>> +      * Lookup the last file extent. We're not using i_size here because
>>>>> +      * there might be preallocation past i_size.
>>>>> +      */
>>>>
>>>> I'm wondering how could this happen?
>>>
>>> You can fallocate an extent at or after i_size.
>>>
>>>>
>>>> Normally if we're truncating an inode, the extents starting after
>>>> round_up(i_size, sectorsize) should be dropped.
>>>
>>> It has nothing to do with truncate, just fallocate.
>>>
>>>>
>>>> Or if we later enlarge the inode, we may hit old extents and read out
>>>> some stale data other than expected zeros.
>>>>
>>>> Thus searching using round_up(i_size, sectorsize) should still let us to
>>>> reach the slot after the last file extent.
>>>>
>>>> Or did I miss something?
>>>
>>> Yes, it's about prealloc extents at or after i_size.
>>
>> Did you mean falloc using keep_size flag?
> 
> Yes. Otherwise the extent wouldn't end up beyond i_size.
> 
>>
>> Then that explains the whole reason.
> 
> Great!
> Thanks.
> 
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks.
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>> +     ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0);
>>>>> +     /* There can't be a file extent item at offset (u64)-1 */
>>>>> +     ASSERT(ret != 0);
>>>>> +     if (ret < 0)
>>>>> +             return ret;
>>>>> +
>>>>> +     /*
>>>>> +      * For a non-existing key, btrfs_search_slot() always leaves us at a
>>>>> +      * slot > 0, except if the btree is empty, which is impossible because
>>>>> +      * at least it has the inode item for this inode and all the items for
>>>>> +      * the root inode 256.
>>>>> +      */
>>>>> +     ASSERT(path->slots[0] > 0);
>>>>> +     path->slots[0]--;
>>>>> +     leaf = path->nodes[0];
>>>>> +     btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
>>>>> +     if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
>>>>> +             /* No file extent items in the subvolume tree. */
>>>>> +             *last_extent_end_ret = 0;
>>>>> +             return 0;
>>>>>         }
>>>>> -     btrfs_release_path(path);
>>>>>
>>>>>         /*
>>>>> -      * we might have some extents allocated but more delalloc past those
>>>>> -      * extents.  so, we trust isize unless the start of the last extent is
>>>>> -      * beyond isize
>>>>> +      * For an inline extent, the disk_bytenr is where inline data starts at,
>>>>> +      * so first check if we have an inline extent item before checking if we
>>>>> +      * have an implicit hole (disk_bytenr == 0).
>>>>>          */
>>>>> -     if (last < isize) {
>>>>> -             last = (u64)-1;
>>>>> -             last_for_get_extent = isize;
>>>>> +     ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
>>>>> +     if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
>>>>> +             *last_extent_end_ret = btrfs_file_extent_end(path);
>>>>> +             return 0;
>>>>>         }
>>>>>
>>>>> -     lock_extent_bits(&inode->io_tree, start, start + len - 1,
>>>>> -                      &cached_state);
>>>>> +     /*
>>>>> +      * Find the last file extent item that is not a hole (when NO_HOLES is
>>>>> +      * not enabled). This should take at most 2 iterations in the worst
>>>>> +      * case: we have one hole file extent item at slot 0 of a leaf and
>>>>> +      * another hole file extent item as the last item in the previous leaf.
>>>>> +      * This is because we merge file extent items that represent holes.
>>>>> +      */
>>>>> +     disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
>>>>> +     while (disk_bytenr == 0) {
>>>>> +             ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
>>>>> +             if (ret < 0) {
>>>>> +                     return ret;
>>>>> +             } else if (ret > 0) {
>>>>> +                     /* No file extent items that are not holes. */
>>>>> +                     *last_extent_end_ret = 0;
>>>>> +                     return 0;
>>>>> +             }
>>>>> +             leaf = path->nodes[0];
>>>>> +             ei = btrfs_item_ptr(leaf, path->slots[0],
>>>>> +                                 struct btrfs_file_extent_item);
>>>>> +             disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
>>>>> +     }
>>>>>
>>>>> -     em = get_extent_skip_holes(inode, start, last_for_get_extent);
>>>>> -     if (!em)
>>>>> -             goto out;
>>>>> -     if (IS_ERR(em)) {
>>>>> -             ret = PTR_ERR(em);
>>>>> +     *last_extent_end_ret = btrfs_file_extent_end(path);
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>>>>> +               u64 start, u64 len)
>>>>> +{
>>>>> +     const u64 ino = btrfs_ino(inode);
>>>>> +     struct extent_state *cached_state = NULL;
>>>>> +     struct btrfs_path *path;
>>>>> +     struct btrfs_root *root = inode->root;
>>>>> +     struct fiemap_cache cache = { 0 };
>>>>> +     struct btrfs_backref_shared_cache *backref_cache;
>>>>> +     struct ulist *roots;
>>>>> +     struct ulist *tmp_ulist;
>>>>> +     u64 last_extent_end;
>>>>> +     u64 prev_extent_end;
>>>>> +     u64 lockstart;
>>>>> +     u64 lockend;
>>>>> +     bool stopped = false;
>>>>> +     int ret;
>>>>> +
>>>>> +     backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
>>>>> +     path = btrfs_alloc_path();
>>>>> +     roots = ulist_alloc(GFP_KERNEL);
>>>>> +     tmp_ulist = ulist_alloc(GFP_KERNEL);
>>>>> +     if (!backref_cache || !path || !roots || !tmp_ulist) {
>>>>> +             ret = -ENOMEM;
>>>>>                 goto out;
>>>>>         }
>>>>>
>>>>> -     while (!end) {
>>>>> -             u64 offset_in_extent = 0;
>>>>> +     lockstart = round_down(start, btrfs_inode_sectorsize(inode));
>>>>> +     lockend = round_up(start + len, btrfs_inode_sectorsize(inode));
>>>>> +     prev_extent_end = lockstart;
>>>>>
>>>>> -             /* break if the extent we found is outside the range */
>>>>> -             if (em->start >= max || extent_map_end(em) < off)
>>>>> -                     break;
>>>>> +     lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
>>>>>
>>>>> -             /*
>>>>> -              * get_extent may return an extent that starts before our
>>>>> -              * requested range.  We have to make sure the ranges
>>>>> -              * we return to fiemap always move forward and don't
>>>>> -              * overlap, so adjust the offsets here
>>>>> -              */
>>>>> -             em_start = max(em->start, off);
>>>>> +     ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
>>>>> +     if (ret < 0)
>>>>> +             goto out_unlock;
>>>>> +     btrfs_release_path(path);
>>>>>
>>>>> +     path->reada = READA_FORWARD;
>>>>> +     ret = fiemap_search_slot(inode, path, lockstart);
>>>>> +     if (ret < 0) {
>>>>> +             goto out_unlock;
>>>>> +     } else if (ret > 0) {
>>>>>                 /*
>>>>> -              * record the offset from the start of the extent
>>>>> -              * for adjusting the disk offset below.  Only do this if the
>>>>> -              * extent isn't compressed since our in ram offset may be past
>>>>> -              * what we have actually allocated on disk.
>>>>> +              * No file extent item found, but we may have delalloc between
>>>>> +              * the current offset and i_size. So check for that.
>>>>>                  */
>>>>> -             if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
>>>>> -                     offset_in_extent = em_start - em->start;
>>>>> -             em_end = extent_map_end(em);
>>>>> -             em_len = em_end - em_start;
>>>>> -             flags = 0;
>>>>> -             if (em->block_start < EXTENT_MAP_LAST_BYTE)
>>>>> -                     disko = em->block_start + offset_in_extent;
>>>>> -             else
>>>>> -                     disko = 0;
>>>>> +             ret = 0;
>>>>> +             goto check_eof_delalloc;
>>>>> +     }
>>>>> +
>>>>> +     while (prev_extent_end < lockend) {
>>>>> +             struct extent_buffer *leaf = path->nodes[0];
>>>>> +             struct btrfs_file_extent_item *ei;
>>>>> +             struct btrfs_key key;
>>>>> +             u64 extent_end;
>>>>> +             u64 extent_len;
>>>>> +             u64 extent_offset = 0;
>>>>> +             u64 extent_gen;
>>>>> +             u64 disk_bytenr = 0;
>>>>> +             u64 flags = 0;
>>>>> +             int extent_type;
>>>>> +             u8 compression;
>>>>> +
>>>>> +             btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
>>>>> +             if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
>>>>> +                     break;
>>>>> +
>>>>> +             extent_end = btrfs_file_extent_end(path);
>>>>>
>>>>>                 /*
>>>>> -              * bump off for our next call to get_extent
>>>>> +              * The first iteration can leave us at an extent item that ends
>>>>> +              * before our range's start. Move to the next item.
>>>>>                  */
>>>>> -             off = extent_map_end(em);
>>>>> -             if (off >= max)
>>>>> -                     end = 1;
>>>>> -
>>>>> -             if (em->block_start == EXTENT_MAP_INLINE) {
>>>>> -                     flags |= (FIEMAP_EXTENT_DATA_INLINE |
>>>>> -                               FIEMAP_EXTENT_NOT_ALIGNED);
>>>>> -             } else if (em->block_start == EXTENT_MAP_DELALLOC) {
>>>>> -                     flags |= (FIEMAP_EXTENT_DELALLOC |
>>>>> -                               FIEMAP_EXTENT_UNKNOWN);
>>>>> -             } else if (fieinfo->fi_extents_max) {
>>>>> -                     u64 extent_gen;
>>>>> -                     u64 bytenr = em->block_start -
>>>>> -                             (em->start - em->orig_start);
>>>>> +             if (extent_end <= lockstart)
>>>>> +                     goto next_item;
>>>>>
>>>>> -                     /*
>>>>> -                      * If two extent maps are merged, then their generation
>>>>> -                      * is set to the maximum between their generations.
>>>>> -                      * Otherwise its generation matches the one we have in
>>>>> -                      * corresponding file extent item. If we have a merged
>>>>> -                      * extent map, don't use its generation to speedup the
>>>>> -                      * sharedness check below.
>>>>> -                      */
>>>>> -                     if (test_bit(EXTENT_FLAG_MERGED, &em->flags))
>>>>> -                             extent_gen = 0;
>>>>> -                     else
>>>>> -                             extent_gen = em->generation;
>>>>> +             /* We have in implicit hole (NO_HOLES feature enabled). */
>>>>> +             if (prev_extent_end < key.offset) {
>>>>> +                     const u64 range_end = min(key.offset, lockend) - 1;
>>>>>
>>>>> -                     /*
>>>>> -                      * As btrfs supports shared space, this information
>>>>> -                      * can be exported to userspace tools via
>>>>> -                      * flag FIEMAP_EXTENT_SHARED.  If fi_extents_max == 0
>>>>> -                      * then we're just getting a count and we can skip the
>>>>> -                      * lookup stuff.
>>>>> -                      */
>>>>> -                     ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
>>>>> -                                                       bytenr, extent_gen,
>>>>> -                                                       roots, tmp_ulist,
>>>>> -                                                       backref_cache);
>>>>> -                     if (ret < 0)
>>>>> -                             goto out_free;
>>>>> -                     if (ret)
>>>>> -                             flags |= FIEMAP_EXTENT_SHARED;
>>>>> -                     ret = 0;
>>>>> -             }
>>>>> -             if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
>>>>> -                     flags |= FIEMAP_EXTENT_ENCODED;
>>>>> -             if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
>>>>> -                     flags |= FIEMAP_EXTENT_UNWRITTEN;
>>>>> +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
>>>>> +                                               backref_cache, 0, 0, 0,
>>>>> +                                               roots, tmp_ulist,
>>>>> +                                               prev_extent_end, range_end);
>>>>> +                     if (ret < 0) {
>>>>> +                             goto out_unlock;
>>>>> +                     } else if (ret > 0) {
>>>>> +                             /* fiemap_fill_next_extent() told us to stop. */
>>>>> +                             stopped = true;
>>>>> +                             break;
>>>>> +                     }
>>>>>
>>>>> -             free_extent_map(em);
>>>>> -             em = NULL;
>>>>> -             if ((em_start >= last) || em_len == (u64)-1 ||
>>>>> -                (last == (u64)-1 && isize <= em_end)) {
>>>>> -                     flags |= FIEMAP_EXTENT_LAST;
>>>>> -                     end = 1;
>>>>> +                     /* We've reached the end of the fiemap range, stop. */
>>>>> +                     if (key.offset >= lockend) {
>>>>> +                             stopped = true;
>>>>> +                             break;
>>>>> +                     }
>>>>>                 }
>>>>>
>>>>> -             /* now scan forward to see if this is really the last extent. */
>>>>> -             em = get_extent_skip_holes(inode, off, last_for_get_extent);
>>>>> -             if (IS_ERR(em)) {
>>>>> -                     ret = PTR_ERR(em);
>>>>> -                     goto out;
>>>>> +             extent_len = extent_end - key.offset;
>>>>> +             ei = btrfs_item_ptr(leaf, path->slots[0],
>>>>> +                                 struct btrfs_file_extent_item);
>>>>> +             compression = btrfs_file_extent_compression(leaf, ei);
>>>>> +             extent_type = btrfs_file_extent_type(leaf, ei);
>>>>> +             extent_gen = btrfs_file_extent_generation(leaf, ei);
>>>>> +
>>>>> +             if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
>>>>> +                     disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
>>>>> +                     if (compression == BTRFS_COMPRESS_NONE)
>>>>> +                             extent_offset = btrfs_file_extent_offset(leaf, ei);
>>>>>                 }
>>>>> -             if (!em) {
>>>>> -                     flags |= FIEMAP_EXTENT_LAST;
>>>>> -                     end = 1;
>>>>> +
>>>>> +             if (compression != BTRFS_COMPRESS_NONE)
>>>>> +                     flags |= FIEMAP_EXTENT_ENCODED;
>>>>> +
>>>>> +             if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
>>>>> +                     flags |= FIEMAP_EXTENT_DATA_INLINE;
>>>>> +                     flags |= FIEMAP_EXTENT_NOT_ALIGNED;
>>>>> +                     ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0,
>>>>> +                                              extent_len, flags);
>>>>> +             } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
>>>>> +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
>>>>> +                                               backref_cache,
>>>>> +                                               disk_bytenr, extent_offset,
>>>>> +                                               extent_gen, roots, tmp_ulist,
>>>>> +                                               key.offset, extent_end - 1);
>>>>> +             } else if (disk_bytenr == 0) {
>>>>> +                     /* We have an explicit hole. */
>>>>> +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
>>>>> +                                               backref_cache, 0, 0, 0,
>>>>> +                                               roots, tmp_ulist,
>>>>> +                                               key.offset, extent_end - 1);
>>>>> +             } else {
>>>>> +                     /* We have a regular extent. */
>>>>> +                     if (fieinfo->fi_extents_max) {
>>>>> +                             ret = btrfs_is_data_extent_shared(root, ino,
>>>>> +                                                               disk_bytenr,
>>>>> +                                                               extent_gen,
>>>>> +                                                               roots,
>>>>> +                                                               tmp_ulist,
>>>>> +                                                               backref_cache);
>>>>> +                             if (ret < 0)
>>>>> +                                     goto out_unlock;
>>>>> +                             else if (ret > 0)
>>>>> +                                     flags |= FIEMAP_EXTENT_SHARED;
>>>>> +                     }
>>>>> +
>>>>> +                     ret = emit_fiemap_extent(fieinfo, &cache, key.offset,
>>>>> +                                              disk_bytenr + extent_offset,
>>>>> +                                              extent_len, flags);
>>>>>                 }
>>>>> -             ret = emit_fiemap_extent(fieinfo, &cache, em_start, disko,
>>>>> -                                        em_len, flags);
>>>>> -             if (ret) {
>>>>> -                     if (ret == 1)
>>>>> -                             ret = 0;
>>>>> -                     goto out_free;
>>>>> +
>>>>> +             if (ret < 0) {
>>>>> +                     goto out_unlock;
>>>>> +             } else if (ret > 0) {
>>>>> +                     /* fiemap_fill_next_extent() told us to stop. */
>>>>> +                     stopped = true;
>>>>> +                     break;
>>>>>                 }
>>>>>
>>>>> +             prev_extent_end = extent_end;
>>>>> +next_item:
>>>>>                 if (fatal_signal_pending(current)) {
>>>>>                         ret = -EINTR;
>>>>> -                     goto out_free;
>>>>> +                     goto out_unlock;
>>>>>                 }
>>>>> +
>>>>> +             ret = fiemap_next_leaf_item(inode, path);
>>>>> +             if (ret < 0) {
>>>>> +                     goto out_unlock;
>>>>> +             } else if (ret > 0) {
>>>>> +                     /* No more file extent items for this inode. */
>>>>> +                     break;
>>>>> +             }
>>>>> +             cond_resched();
>>>>>         }
>>>>> -out_free:
>>>>> -     if (!ret)
>>>>> -             ret = emit_last_fiemap_cache(fieinfo, &cache);
>>>>> -     free_extent_map(em);
>>>>> -out:
>>>>> -     unlock_extent_cached(&inode->io_tree, start, start + len - 1,
>>>>> -                          &cached_state);
>>>>>
>>>>> -out_free_ulist:
>>>>> +check_eof_delalloc:
>>>>> +     /*
>>>>> +      * Release (and free) the path before emitting any final entries to
>>>>> +      * fiemap_fill_next_extent() to keep lockdep happy. This is because
>>>>> +      * once we find no more file extent items exist, we may have a
>>>>> +      * non-cloned leaf, and fiemap_fill_next_extent() can trigger page
>>>>> +      * faults when copying data to the user space buffer.
>>>>> +      */
>>>>> +     btrfs_free_path(path);
>>>>> +     path = NULL;
>>>>> +
>>>>> +     if (!stopped && prev_extent_end < lockend) {
>>>>> +             ret = fiemap_process_hole(inode, fieinfo, &cache, backref_cache,
>>>>> +                                       0, 0, 0, roots, tmp_ulist,
>>>>> +                                       prev_extent_end, lockend - 1);
>>>>> +             if (ret < 0)
>>>>> +                     goto out_unlock;
>>>>> +             prev_extent_end = lockend;
>>>>> +     }
>>>>> +
>>>>> +     if (cache.cached && cache.offset + cache.len >= last_extent_end) {
>>>>> +             const u64 i_size = i_size_read(&inode->vfs_inode);
>>>>> +
>>>>> +             if (prev_extent_end < i_size) {
>>>>> +                     u64 delalloc_start;
>>>>> +                     u64 delalloc_end;
>>>>> +                     bool delalloc;
>>>>> +
>>>>> +                     delalloc = btrfs_find_delalloc_in_range(inode,
>>>>> +                                                             prev_extent_end,
>>>>> +                                                             i_size - 1,
>>>>> +                                                             &delalloc_start,
>>>>> +                                                             &delalloc_end);
>>>>> +                     if (!delalloc)
>>>>> +                             cache.flags |= FIEMAP_EXTENT_LAST;
>>>>> +             } else {
>>>>> +                     cache.flags |= FIEMAP_EXTENT_LAST;
>>>>> +             }
>>>>> +     }
>>>>> +
>>>>> +     ret = emit_last_fiemap_cache(fieinfo, &cache);
>>>>> +
>>>>> +out_unlock:
>>>>> +     unlock_extent_cached(&inode->io_tree, lockstart, lockend, &cached_state);
>>>>> +out:
>>>>>         kfree(backref_cache);
>>>>>         btrfs_free_path(path);
>>>>>         ulist_free(roots);
>>>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>>>> index b292a8ada3a4..636b3ec46184 100644
>>>>> --- a/fs/btrfs/file.c
>>>>> +++ b/fs/btrfs/file.c
>>>>> @@ -3602,10 +3602,10 @@ static long btrfs_fallocate(struct file *file, int mode,
>>>>>     }
>>>>>
>>>>>     /*
>>>>> - * Helper for have_delalloc_in_range(). Find a subrange in a given range that
>>>>> - * has unflushed and/or flushing delalloc. There might be other adjacent
>>>>> - * subranges after the one it found, so have_delalloc_in_range() keeps looping
>>>>> - * while it gets adjacent subranges, and merging them together.
>>>>> + * Helper for btrfs_find_delalloc_in_range(). Find a subrange in a given range
>>>>> + * that has unflushed and/or flushing delalloc. There might be other adjacent
>>>>> + * subranges after the one it found, so btrfs_find_delalloc_in_range() keeps
>>>>> + * looping while it gets adjacent subranges, and merging them together.
>>>>>      */
>>>>>     static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
>>>>>                                    u64 *delalloc_start_ret, u64 *delalloc_end_ret)
>>>>> @@ -3740,8 +3740,8 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
>>>>>      * if so it sets @delalloc_start_ret and @delalloc_end_ret with the start and
>>>>>      * end offsets of the subrange.
>>>>>      */
>>>>> -static bool have_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
>>>>> -                                u64 *delalloc_start_ret, u64 *delalloc_end_ret)
>>>>> +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
>>>>> +                               u64 *delalloc_start_ret, u64 *delalloc_end_ret)
>>>>>     {
>>>>>         u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
>>>>>         u64 prev_delalloc_end = 0;
>>>>> @@ -3804,8 +3804,8 @@ static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
>>>>>         u64 delalloc_end;
>>>>>         bool delalloc;
>>>>>
>>>>> -     delalloc = have_delalloc_in_range(inode, start, end, &delalloc_start,
>>>>> -                                       &delalloc_end);
>>>>> +     delalloc = btrfs_find_delalloc_in_range(inode, start, end,
>>>>> +                                             &delalloc_start, &delalloc_end);
>>>>>         if (delalloc && whence == SEEK_DATA) {
>>>>>                 *start_ret = delalloc_start;
>>>>>                 return true;
>>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>>>> index 2c7d31990777..8be1e021513a 100644
>>>>> --- a/fs/btrfs/inode.c
>>>>> +++ b/fs/btrfs/inode.c
>>>>> @@ -7064,133 +7064,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
>>>>>         return em;
>>>>>     }
>>>>>
>>>>> -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
>>>>> -                                        u64 start, u64 len)
>>>>> -{
>>>>> -     struct extent_map *em;
>>>>> -     struct extent_map *hole_em = NULL;
>>>>> -     u64 delalloc_start = start;
>>>>> -     u64 end;
>>>>> -     u64 delalloc_len;
>>>>> -     u64 delalloc_end;
>>>>> -     int err = 0;
>>>>> -
>>>>> -     em = btrfs_get_extent(inode, NULL, 0, start, len);
>>>>> -     if (IS_ERR(em))
>>>>> -             return em;
>>>>> -     /*
>>>>> -      * If our em maps to:
>>>>> -      * - a hole or
>>>>> -      * - a pre-alloc extent,
>>>>> -      * there might actually be delalloc bytes behind it.
>>>>> -      */
>>>>> -     if (em->block_start != EXTENT_MAP_HOLE &&
>>>>> -         !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
>>>>> -             return em;
>>>>> -     else
>>>>> -             hole_em = em;
>>>>> -
>>>>> -     /* check to see if we've wrapped (len == -1 or similar) */
>>>>> -     end = start + len;
>>>>> -     if (end < start)
>>>>> -             end = (u64)-1;
>>>>> -     else
>>>>> -             end -= 1;
>>>>> -
>>>>> -     em = NULL;
>>>>> -
>>>>> -     /* ok, we didn't find anything, lets look for delalloc */
>>>>> -     delalloc_len = count_range_bits(&inode->io_tree, &delalloc_start,
>>>>> -                              end, len, EXTENT_DELALLOC, 1);
>>>>> -     delalloc_end = delalloc_start + delalloc_len;
>>>>> -     if (delalloc_end < delalloc_start)
>>>>> -             delalloc_end = (u64)-1;
>>>>> -
>>>>> -     /*
>>>>> -      * We didn't find anything useful, return the original results from
>>>>> -      * get_extent()
>>>>> -      */
>>>>> -     if (delalloc_start > end || delalloc_end <= start) {
>>>>> -             em = hole_em;
>>>>> -             hole_em = NULL;
>>>>> -             goto out;
>>>>> -     }
>>>>> -
>>>>> -     /*
>>>>> -      * Adjust the delalloc_start to make sure it doesn't go backwards from
>>>>> -      * the start they passed in
>>>>> -      */
>>>>> -     delalloc_start = max(start, delalloc_start);
>>>>> -     delalloc_len = delalloc_end - delalloc_start;
>>>>> -
>>>>> -     if (delalloc_len > 0) {
>>>>> -             u64 hole_start;
>>>>> -             u64 hole_len;
>>>>> -             const u64 hole_end = extent_map_end(hole_em);
>>>>> -
>>>>> -             em = alloc_extent_map();
>>>>> -             if (!em) {
>>>>> -                     err = -ENOMEM;
>>>>> -                     goto out;
>>>>> -             }
>>>>> -
>>>>> -             ASSERT(hole_em);
>>>>> -             /*
>>>>> -              * When btrfs_get_extent can't find anything it returns one
>>>>> -              * huge hole
>>>>> -              *
>>>>> -              * Make sure what it found really fits our range, and adjust to
>>>>> -              * make sure it is based on the start from the caller
>>>>> -              */
>>>>> -             if (hole_end <= start || hole_em->start > end) {
>>>>> -                    free_extent_map(hole_em);
>>>>> -                    hole_em = NULL;
>>>>> -             } else {
>>>>> -                    hole_start = max(hole_em->start, start);
>>>>> -                    hole_len = hole_end - hole_start;
>>>>> -             }
>>>>> -
>>>>> -             if (hole_em && delalloc_start > hole_start) {
>>>>> -                     /*
>>>>> -                      * Our hole starts before our delalloc, so we have to
>>>>> -                      * return just the parts of the hole that go until the
>>>>> -                      * delalloc starts
>>>>> -                      */
>>>>> -                     em->len = min(hole_len, delalloc_start - hole_start);
>>>>> -                     em->start = hole_start;
>>>>> -                     em->orig_start = hole_start;
>>>>> -                     /*
>>>>> -                      * Don't adjust block start at all, it is fixed at
>>>>> -                      * EXTENT_MAP_HOLE
>>>>> -                      */
>>>>> -                     em->block_start = hole_em->block_start;
>>>>> -                     em->block_len = hole_len;
>>>>> -                     if (test_bit(EXTENT_FLAG_PREALLOC, &hole_em->flags))
>>>>> -                             set_bit(EXTENT_FLAG_PREALLOC, &em->flags);
>>>>> -             } else {
>>>>> -                     /*
>>>>> -                      * Hole is out of passed range or it starts after
>>>>> -                      * delalloc range
>>>>> -                      */
>>>>> -                     em->start = delalloc_start;
>>>>> -                     em->len = delalloc_len;
>>>>> -                     em->orig_start = delalloc_start;
>>>>> -                     em->block_start = EXTENT_MAP_DELALLOC;
>>>>> -                     em->block_len = delalloc_len;
>>>>> -             }
>>>>> -     } else {
>>>>> -             return hole_em;
>>>>> -     }
>>>>> -out:
>>>>> -
>>>>> -     free_extent_map(hole_em);
>>>>> -     if (err) {
>>>>> -             free_extent_map(em);
>>>>> -             return ERR_PTR(err);
>>>>> -     }
>>>>> -     return em;
>>>>> -}
>>>>> -
>>>>>     static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>>>>>                                                   const u64 start,
>>>>>                                                   const u64 len,
>>>>> @@ -8265,15 +8138,14 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>>>>>          * in the compression of data (in an async thread) and will return
>>>>>          * before the compression is done and writeback is started. A second
>>>>>          * filemap_fdatawrite_range() is needed to wait for the compression to
>>>>> -      * complete and writeback to start. Without this, our user is very
>>>>> -      * likely to get stale results, because the extents and extent maps for
>>>>> -      * delalloc regions are only allocated when writeback starts.
>>>>> +      * complete and writeback to start. We also need to wait for ordered
>>>>> +      * extents to complete, because our fiemap implementation uses mainly
>>>>> +      * file extent items to list the extents, searching for extent maps
>>>>> +      * only for file ranges with holes or prealloc extents to figure out
>>>>> +      * if we have delalloc in those ranges.
>>>>>          */
>>>>>         if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
>>>>> -             ret = btrfs_fdatawrite_range(inode, 0, LLONG_MAX);
>>>>> -             if (ret)
>>>>> -                     return ret;
>>>>> -             ret = filemap_fdatawait_range(inode->i_mapping, 0, LLONG_MAX);
>>>>> +             ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
>>>>>                 if (ret)
>>>>>                         return ret;
>>>>>         }

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/10] btrfs: make lseek and fiemap much more efficient
  2022-09-02  8:24   ` Filipe Manana
@ 2022-09-02 11:41     ` Wang Yugui
  2022-09-02 11:45     ` Filipe Manana
  1 sibling, 0 replies; 53+ messages in thread
From: Wang Yugui @ 2022-09-02 11:41 UTC (permalink / raw)
  To: linux-btrfs

Hi,

> On Fri, Sep 2, 2022 at 2:09 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
> >
> > Hi,
> >
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > We often get reports of fiemap and hole/data seeking (lseek) being too slow
> > > on btrfs, or even unusable in some cases due to being extremely slow.
> > >
> > > Some recent reports for fiemap:
> > >
> > >     https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> > >     https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> > >
> > > For lseek (LSF/MM from 2017):
> > >
> > >    https://lwn.net/Articles/718805/
> > >
> > > Basically both are slow due to very high algorithmic complexity which
> > > scales badly with the number of extents in a file and the heigth of
> > > subvolume and extent b+trees.
> > >
> > > Using Pavel's test case (first Link tag for fiemap), which uses files with
> > > many 4K extents and holes before and after each extent (kind of a worst
> > > case scenario), the speedup is of several orders of magnitude (for the 1G
> > > file, from ~225 seconds down to ~0.1 seconds).
> > >
> > > Finally the new algorithm for fiemap also ends up solving a bug with the
> > > current algorithm. This happens because we are currently relying on extent
> > > maps to report extents, which can be merged, and this may cause us to
> > > report 2 different extents as a single one that is not shared but one of
> > > them is shared (or the other way around). More details on this on patches
> > > 9/10 and 10/10.
> > >
> > > Patches 1/10 and 2/10 are for lseek, introducing some code that will later
> > > be used by fiemap too (patch 10/10). More details in the changelogs.
> > >
> > > There are a few more things that can be done to speedup fiemap and lseek,
> > > but I'll leave those other optimizations I have in mind for some other time.
> > >
> > > Filipe Manana (10):
> > >   btrfs: allow hole and data seeking to be interruptible
> > >   btrfs: make hole and data seeking a lot more efficient
> > >   btrfs: remove check for impossible block start for an extent map at fiemap
> > >   btrfs: remove zero length check when entering fiemap
> > >   btrfs: properly flush delalloc when entering fiemap
> > >   btrfs: allow fiemap to be interruptible
> > >   btrfs: rename btrfs_check_shared() to a more descriptive name
> > >   btrfs: speedup checking for extent sharedness during fiemap
> > >   btrfs: skip unnecessary extent buffer sharedness checks during fiemap
> > >   btrfs: make fiemap more efficient and accurate reporting extent sharedness
> > >
> > >  fs/btrfs/backref.c     | 153 ++++++++-
> > >  fs/btrfs/backref.h     |  20 +-
> > >  fs/btrfs/ctree.h       |  22 +-
> > >  fs/btrfs/extent-tree.c |  10 +-
> > >  fs/btrfs/extent_io.c   | 703 ++++++++++++++++++++++++++++-------------
> > >  fs/btrfs/file.c        | 439 +++++++++++++++++++++++--
> > >  fs/btrfs/inode.c       | 146 ++-------
> > >  7 files changed, 1111 insertions(+), 382 deletions(-)
> >
> >
> > An infinite loop happen when the 10 pathes applied to 6.0-rc3.
> 
> Nop, it's not an infinite loop, and it happens as well before the patchset.
> The reason is that the files created by the test are very sparse and
> with small extents.
> It's full of 4K extents surrounded by 8K holes.
> 
> So any one doing hole seeking, advances 8K on every lseek call.
> If you strace the cp process, with
> 
> strace -p <cp pid>
> 
> You'll see something like this filling your terminal:
> 
> (...)
> lseek(3, 18808832, SEEK_SET)            = 18808832
> write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
...
> lseek(3, 18857984, SEEK_SET)            = 18857984
> (...)
> 
> It takes a long time, but it finishes. If you notice the difference
> between each return
> value is exactly 8K.
> 
> That happens both before and after the patchset.

Yes. It takes a long time, but it finishes.
Thanks for the advice of 'strace -p <cp pid>'

more tests show that the performance depends on whether the data is
cached.

When data is not cached (echo 3 >/proc/sys/vm/drop_caches),
'/bin/cp /mnt/test/file1 /dev/null' take 97.37s.

When data is cached (/bin/cp again),
'/bin/cp /mnt/test/file1 /dev/null' take 2056.53s.

/mnt/test/file1 is 512M created by that producer.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/09/02


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/10] btrfs: make lseek and fiemap much more efficient
  2022-09-02  8:24   ` Filipe Manana
  2022-09-02 11:41     ` Wang Yugui
@ 2022-09-02 11:45     ` Filipe Manana
  2022-09-05 14:39       ` Filipe Manana
  1 sibling, 1 reply; 53+ messages in thread
From: Filipe Manana @ 2022-09-02 11:45 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-btrfs

On Fri, Sep 2, 2022 at 9:24 AM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Fri, Sep 2, 2022 at 2:09 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
> >
> > Hi,
> >
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > We often get reports of fiemap and hole/data seeking (lseek) being too slow
> > > on btrfs, or even unusable in some cases due to being extremely slow.
> > >
> > > Some recent reports for fiemap:
> > >
> > >     https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> > >     https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> > >
> > > For lseek (LSF/MM from 2017):
> > >
> > >    https://lwn.net/Articles/718805/
> > >
> > > Basically both are slow due to very high algorithmic complexity which
> > > scales badly with the number of extents in a file and the heigth of
> > > subvolume and extent b+trees.
> > >
> > > Using Pavel's test case (first Link tag for fiemap), which uses files with
> > > many 4K extents and holes before and after each extent (kind of a worst
> > > case scenario), the speedup is of several orders of magnitude (for the 1G
> > > file, from ~225 seconds down to ~0.1 seconds).
> > >
> > > Finally the new algorithm for fiemap also ends up solving a bug with the
> > > current algorithm. This happens because we are currently relying on extent
> > > maps to report extents, which can be merged, and this may cause us to
> > > report 2 different extents as a single one that is not shared but one of
> > > them is shared (or the other way around). More details on this on patches
> > > 9/10 and 10/10.
> > >
> > > Patches 1/10 and 2/10 are for lseek, introducing some code that will later
> > > be used by fiemap too (patch 10/10). More details in the changelogs.
> > >
> > > There are a few more things that can be done to speedup fiemap and lseek,
> > > but I'll leave those other optimizations I have in mind for some other time.
> > >
> > > Filipe Manana (10):
> > >   btrfs: allow hole and data seeking to be interruptible
> > >   btrfs: make hole and data seeking a lot more efficient
> > >   btrfs: remove check for impossible block start for an extent map at fiemap
> > >   btrfs: remove zero length check when entering fiemap
> > >   btrfs: properly flush delalloc when entering fiemap
> > >   btrfs: allow fiemap to be interruptible
> > >   btrfs: rename btrfs_check_shared() to a more descriptive name
> > >   btrfs: speedup checking for extent sharedness during fiemap
> > >   btrfs: skip unnecessary extent buffer sharedness checks during fiemap
> > >   btrfs: make fiemap more efficient and accurate reporting extent sharedness
> > >
> > >  fs/btrfs/backref.c     | 153 ++++++++-
> > >  fs/btrfs/backref.h     |  20 +-
> > >  fs/btrfs/ctree.h       |  22 +-
> > >  fs/btrfs/extent-tree.c |  10 +-
> > >  fs/btrfs/extent_io.c   | 703 ++++++++++++++++++++++++++++-------------
> > >  fs/btrfs/file.c        | 439 +++++++++++++++++++++++--
> > >  fs/btrfs/inode.c       | 146 ++-------
> > >  7 files changed, 1111 insertions(+), 382 deletions(-)
> >
> >
> > An infinite loop happen when the 10 pathes applied to 6.0-rc3.
>
> Nop, it's not an infinite loop, and it happens as well before the patchset.
> The reason is that the files created by the test are very sparse and
> with small extents.
> It's full of 4K extents surrounded by 8K holes.
>
> So any one doing hole seeking, advances 8K on every lseek call.
> If you strace the cp process, with
>
> strace -p <cp pid>
>
> You'll see something like this filling your terminal:
>
> (...)
> lseek(3, 18808832, SEEK_SET)            = 18808832
> write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> write(4, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> lseek(3, 18817024, SEEK_SET)            = 18817024
> write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> write(4, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> lseek(3, 18825216, SEEK_SET)            = 18825216
> write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> write(4, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> lseek(3, 18833408, SEEK_SET)            = 18833408
> write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> write(4, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> lseek(3, 18841600, SEEK_SET)            = 18841600
> write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> write(4, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> lseek(3, 18849792, SEEK_SET)            = 18849792
> write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> write(4, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 4096) = 4096
> lseek(3, 18857984, SEEK_SET)            = 18857984
> (...)
>
> It takes a long time, but it finishes. If you notice the difference
> between each return
> value is exactly 8K.
>
> That happens both before and after the patchset.

Btw, on a release (non-debug) kernel this is what I get before and
after the patchset.

Before patchset:

root 12:05:51 /home/fdmanana/scripts/other_perf/fiemap > umount
/dev/sdi ; mkfs.btrfs -f /dev/sdi ; mount /dev/sdi /mnt/sdi
root 12:06:47 /home/fdmanana/scripts/other_perf/fiemap > ./pavels-test
/mnt/sdi/foobar $((1 << 30)) && time cp /mnt/sdi/foobar /dev/null
size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 131072
time = 256243106 us

real 5m50.026s
user 0m0.232s
sys 5m48.698s


After patchset:

root 12:32:44 /home/fdmanana/scripts/other_perf/fiemap > umount
/dev/sdi ; mkfs.btrfs -f /dev/sdi ; mount /dev/sdi /mnt/sdi
root 12:33:01 /home/fdmanana/scripts/other_perf/fiemap > ./pavels-test
/mnt/sdi/foobar $((1 << 30)) && time cp /mnt/sdi/foobar /dev/null
size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 129941
time = 134062 us

real 0m57.606s
user 0m0.185s
sys 0m57.375s


Not as fast as ext4 yet, which takes ~1.5 seconds, but it's getting much better.
What's causing cp to be slow are the multiple ranged fiemap calls it does.
cp also does a lot of lseek calls to detect and skip holes, but the
total time spent on it is almost insignificant when compared to
fiemap.

I'll work on making fiemap more efficient, but that will come in some
other separate patch or patches that will build upon this patchset.
I like to make things more incremental and avoid having too many
changes in a single kernel release for fiemap.

>
> Thanks.
>
>
> >
> > a file is created by 'pavels-test.c' of [PATCH 10/10].
> > and then '/bin/cp /mnt/test/file1 /dev/null' will trigger an infinite
> > loop.
> >
> > 'sysrq -l' output:
> >
> > [ 1437.765228] Call Trace:
> > [ 1437.765228]  <TASK>
> > [ 1437.765228]  set_extent_bit+0x33d/0x6e0 [btrfs]
> > [ 1437.765228]  lock_extent_bits+0x64/0xa0 [btrfs]
> > [ 1437.765228]  btrfs_file_llseek+0x192/0x5b0 [btrfs]
> > [ 1437.765228]  ksys_lseek+0x64/0xb0
> > [ 1437.765228]  do_syscall_64+0x58/0x80
> > [ 1437.765228]  ? syscall_exit_to_user_mode+0x12/0x30
> > [ 1437.765228]  ? do_syscall_64+0x67/0x80
> > [ 1437.765228]  ? do_syscall_64+0x67/0x80
> > [ 1437.765228]  ? exc_page_fault+0x64/0x140
> > [ 1437.765228]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > [ 1437.765228] RIP: 0033:0x7f5a263441bb
> >
> > Best Regards
> > Wang Yugui (wangyugui@e16-tech.com)
> > 2022/09/02
> >
> >

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness
  2022-09-01 15:04     ` Filipe Manana
@ 2022-09-02 13:25       ` Josef Bacik
  0 siblings, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2022-09-02 13:25 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Thu, Sep 01, 2022 at 04:04:17PM +0100, Filipe Manana wrote:
> On Thu, Sep 1, 2022 at 3:35 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Thu, Sep 01, 2022 at 02:18:30PM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > The current fiemap implementation does not scale very well with the number
> > > of extents a file has. This is both because the main algorithm to find out
> > > the extents has a high algorithmic complexity and because for each extent
> > > we have to check if it's shared. This second part, checking if an extent
> > > is shared, is significantly improved by the two previous patches in this
> > > patchset, while the first part is improved by this specific patch. Every
> > > now and then we get reports from users mentioning fiemap is too slow or
> > > even unusable for files with a very large number of extents, such as the
> > > two recent reports referred to by the Link tags at the bottom of this
> > > change log.
> > >
> > > To understand why the part of finding which extents a file has is very
> > > inneficient, consider the example of doing a full ranged fiemap against
> > > a file that has over 100K extents (normal for example for a file with
> > > more than 10G of data and using compression, which limits the extent size
> > > to 128K). When we enter fiemap at extent_fiemap(), the following happens:
> > >
> > > 1) Before entering the main loop, we call get_extent_skip_holes() to get
> > >    the first extent map. This leads us to btrfs_get_extent_fiemap(), which
> > >    in turn calls btrfs_get_extent(), to find the first extent map that
> > >    covers the file range [0, LLONG_MAX).
> > >
> > >    btrfs_get_extent() will first search the inode's extent map tree, to
> > >    see if we have an extent map there that covers the range. If it does
> > >    not find one, then it will search the inode's subvolume b+tree for a
> > >    fitting file extent item. After finding the file extent item, it will
> > >    allocate an extent map, fill it in with information extracted from the
> > >    file extent item, and add it to the inode's extent map tree (which
> > >    requires a search for insertion in the tree).
> > >
> > > 2) Then we enter the main loop at extent_fiemap(), emit the details of
> > >    the extent, and call again get_extent_skip_holes(), with a start
> > >    offset matching the end of the extent map we previously processed.
> > >
> > >    We end up at btrfs_get_extent() again, will search the extent map tree
> > >    and then search the subvolume b+tree for a file extent item if we could
> > >    not find an extent map in the extent tree. We allocate an extent map,
> > >    fill it in with the details in the file extent item, and then insert
> > >    it into the extent map tree (yet another search in this tree).
> > >
> > > 3) The second step is repeated over and over, until we have processed the
> > >    whole file range. Each iteration ends at btrfs_get_extent(), which
> > >    does a red black tree search on the extent map tree, then searches the
> > >    subvolume b+tree, allocates an extent map and then does another search
> > >    in the extent map tree in order to insert the extent map.
> > >
> > >    In the best scenario we have all the extent maps already in the extent
> > >    tree, and so for each extent we do a single search on a red black tree,
> > >    so we have a complexity of O(n log n).
> > >
> > >    In the worst scenario we don't have any extent map already loaded in
> > >    the extent map tree, or have very few already there. In this case the
> > >    complexity is much higher since we do:
> > >
> > >    - A red black tree search on the extent map tree, which has O(log n)
> > >      complexity, initially very fast since the tree is empty or very
> > >      small, but as we end up allocating extent maps and adding them to
> > >      the tree when we don't find them there, each subsequent search on
> > >      the tree gets slower, since it's getting bigger and bigger after
> > >      each iteration.
> > >
> > >    - A search on the subvolume b+tree, also O(log n) complexity, but it
> > >      has items for all inodes in the subvolume, not just items for our
> > >      inode. Plus on a filesystem with concurrent operations on other
> > >      inodes, we can block doing the search due to lock contention on
> > >      b+tree nodes/leaves.
> > >
> > >    - Allocate an extent map - this can block, and can also fail if we
> > >      are under serious memory pressure.
> > >
> > >    - Do another search on the extent maps red black tree, with the goal
> > >      of inserting the extent map we just allocated. Again, after every
> > >      iteration this tree is getting bigger by 1 element, so after many
> > >      iterations the searches are slower and slower.
> > >
> > >    - We will not need the allocated extent map anymore, so it's pointless
> > >      to add it to the extent map tree. It's just wasting time and memory.
> > >
> > >    In short we end up searching the extent map tree multiple times, on a
> > >    tree that is growing bigger and bigger after each iteration. And
> > >    besides that we visit the same leaf of the subvolume b+tree many times,
> > >    since a leaf with the default size of 16K can easily have more than 200
> > >    file extent items.
> > >
> > > This is very inneficient overall. This patch changes the algorithm to
> > > instead iterate over the subvolume b+tree, visiting each leaf only once,
> > > and only searching in the extent map tree for file ranges that have holes
> > > or prealloc extents, in order to figure out if we have delalloc there.
> > > It will never allocate an extent map and add it to the extent map tree.
> > > This is very similar to what was previously done for the lseek's hole and
> > > data seeking features.
> > >
> > > Also, the current implementation relying on extent maps for figuring out
> > > which extents we have is not correct. This is because extent maps can be
> > > merged even if they represent different extents - we do this to minimize
> > > memory utilization and keep extent map trees smaller. For example if we
> > > have two extents that are contiguous on disk, once we load the two extent
> > > maps, they get merged into a single one - however if only one of the
> > > extents is shared, we end up reporting both as shared or both as not
> > > shared, which is incorrect.
> > >
> > > This reproducer triggers that bug:
> > >
> > >     $ cat fiemap-bug.sh
> > >     #!/bin/bash
> > >
> > >     DEV=/dev/sdj
> > >     MNT=/mnt/sdj
> > >
> > >     mkfs.btrfs -f $DEV
> > >     mount $DEV $MNT
> > >
> > >     # Create a file with two 256K extents.
> > >     # Since there is no other write activity, they will be contiguous,
> > >     # and their extent maps merged, despite having two distinct extents.
> > >     xfs_io -f -c "pwrite -S 0xab 0 256K" \
> > >               -c "fsync" \
> > >               -c "pwrite -S 0xcd 256K 256K" \
> > >               -c "fsync" \
> > >               $MNT/foo
> > >
> > >     # Now clone only the second extent into another file.
> > >     xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
> > >
> > >     # Filefrag will report a single 512K extent, and say it's not shared.
> > >     echo
> > >     filefrag -v $MNT/foo
> > >
> > >     umount $MNT
> > >
> > > Running the reproducer:
> > >
> > >     $ ./fiemap-bug.sh
> > >     wrote 262144/262144 bytes at offset 0
> > >     256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
> > >     wrote 262144/262144 bytes at offset 262144
> > >     256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
> > >     linked 262144/262144 bytes at offset 0
> > >     256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
> > >
> > >     Filesystem type is: 9123683e
> > >     File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
> > >      ext:     logical_offset:        physical_offset: length:   expected: flags:
> > >        0:        0..     127:       3328..      3455:    128:             last,eof
> > >     /mnt/sdj/foo: 1 extent found
> > >
> > > We end up reporting that we have a single 512K that is not shared, however
> > > we have two 256K extents, and the second one is shared. Changing the
> > > reproducer to clone instead the first extent into file 'bar', makes us
> > > report a single 512K extent that is shared, which is algo incorrect since
> > > we have two 256K extents and only the first one is shared.
> > >
> > > This patch is part of a larger patchset that is comprised of the following
> > > patches:
> > >
> > >     btrfs: allow hole and data seeking to be interruptible
> > >     btrfs: make hole and data seeking a lot more efficient
> > >     btrfs: remove check for impossible block start for an extent map at fiemap
> > >     btrfs: remove zero length check when entering fiemap
> > >     btrfs: properly flush delalloc when entering fiemap
> > >     btrfs: allow fiemap to be interruptible
> > >     btrfs: rename btrfs_check_shared() to a more descriptive name
> > >     btrfs: speedup checking for extent sharedness during fiemap
> > >     btrfs: skip unnecessary extent buffer sharedness checks during fiemap
> > >     btrfs: make fiemap more efficient and accurate reporting extent sharedness
> > >
> > > The patchset was tested on a machine running a non-debug kernel (Debian's
> > > default config) and compared the tests below on a branch without the
> > > patchset versus the same branch with the whole patchset applied.
> > >
> > > The following test for a large compressed file without holes:
> > >
> > >     $ cat fiemap-perf-test.sh
> > >     #!/bin/bash
> > >
> > >     DEV=/dev/sdi
> > >     MNT=/mnt/sdi
> > >
> > >     mkfs.btrfs -f $DEV
> > >     mount -o compress=lzo $DEV $MNT
> > >
> > >     # 40G gives 327680 128K file extents (due to compression).
> > >     xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar
> > >
> > >     umount $MNT
> > >     mount -o compress=lzo $DEV $MNT
> > >
> > >     start=$(date +%s%N)
> > >     filefrag $MNT/foobar
> > >     end=$(date +%s%N)
> > >     dur=$(( (end - start) / 1000000 ))
> > >     echo "fiemap took $dur milliseconds (metadata not cached)"
> > >
> > >     start=$(date +%s%N)
> > >     filefrag $MNT/foobar
> > >     end=$(date +%s%N)
> > >     dur=$(( (end - start) / 1000000 ))
> > >     echo "fiemap took $dur milliseconds (metadata cached)"
> > >
> > >     umount $MNT
> > >
> > > Before patchset:
> > >
> > >     $ ./fiemap-perf-test.sh
> > >     (...)
> > >     /mnt/sdi/foobar: 327680 extents found
> > >     fiemap took 3597 milliseconds (metadata not cached)
> > >     /mnt/sdi/foobar: 327680 extents found
> > >     fiemap took 2107 milliseconds (metadata cached)
> > >
> > > After patchset:
> > >
> > >     $ ./fiemap-perf-test.sh
> > >     (...)
> > >     /mnt/sdi/foobar: 327680 extents found
> > >     fiemap took 1214 milliseconds (metadata not cached)
> > >     /mnt/sdi/foobar: 327680 extents found
> > >     fiemap took 684 milliseconds (metadata cached)
> > >
> > > That's a speedup of about 3x for both cases (no metadata cached and all
> > > metadata cached).
> > >
> > > The test provided by Pavel (first Link tag at the bottom), which uses
> > > files with a large number of holes, was also used to measure the gains,
> > > and it consists on a small C program and a shell script to invoke it.
> > > The C program is the following:
> > >
> > >     $ cat pavels-test.c
> > >     #include <stdio.h>
> > >     #include <unistd.h>
> > >     #include <stdlib.h>
> > >     #include <fcntl.h>
> > >
> > >     #include <sys/stat.h>
> > >     #include <sys/time.h>
> > >     #include <sys/ioctl.h>
> > >
> > >     #include <linux/fs.h>
> > >     #include <linux/fiemap.h>
> > >
> > >     #define FILE_INTERVAL (1<<13) /* 8Kb */
> > >
> > >     long long interval(struct timeval t1, struct timeval t2)
> > >     {
> > >         long long val = 0;
> > >         val += (t2.tv_usec - t1.tv_usec);
> > >         val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
> > >         return val;
> > >     }
> > >
> > >     int main(int argc, char **argv)
> > >     {
> > >         struct fiemap fiemap = {};
> > >         struct timeval t1, t2;
> > >         char data = 'a';
> > >         struct stat st;
> > >         int fd, off, file_size = FILE_INTERVAL;
> > >
> > >         if (argc != 3 && argc != 2) {
> > >                 printf("usage: %s <path> [size]\n", argv[0]);
> > >                 return 1;
> > >         }
> > >
> > >         if (argc == 3)
> > >                 file_size = atoi(argv[2]);
> > >         if (file_size < FILE_INTERVAL)
> > >                 file_size = FILE_INTERVAL;
> > >         file_size -= file_size % FILE_INTERVAL;
> > >
> > >         fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
> > >         if (fd < 0) {
> > >             perror("open");
> > >             return 1;
> > >         }
> > >
> > >         for (off = 0; off < file_size; off += FILE_INTERVAL) {
> > >             if (pwrite(fd, &data, 1, off) != 1) {
> > >                 perror("pwrite");
> > >                 close(fd);
> > >                 return 1;
> > >             }
> > >         }
> > >
> > >         if (ftruncate(fd, file_size)) {
> > >             perror("ftruncate");
> > >             close(fd);
> > >             return 1;
> > >         }
> > >
> > >         if (fstat(fd, &st) < 0) {
> > >             perror("fstat");
> > >             close(fd);
> > >             return 1;
> > >         }
> > >
> > >         printf("size: %ld\n", st.st_size);
> > >         printf("actual size: %ld\n", st.st_blocks * 512);
> > >
> > >         fiemap.fm_length = FIEMAP_MAX_OFFSET;
> > >         gettimeofday(&t1, NULL);
> > >         if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
> > >             perror("fiemap");
> > >             close(fd);
> > >             return 1;
> > >         }
> > >         gettimeofday(&t2, NULL);
> > >
> > >         printf("fiemap: fm_mapped_extents = %d\n",
> > >                fiemap.fm_mapped_extents);
> > >         printf("time = %lld us\n", interval(t1, t2));
> > >
> > >         close(fd);
> > >         return 0;
> > >     }
> > >
> > >     $ gcc -o pavels_test pavels_test.c
> > >
> > > And the wrapper shell script:
> > >
> > >     $ cat fiemap-pavels-test.sh
> > >
> > >     #!/bin/bash
> > >
> > >     DEV=/dev/sdi
> > >     MNT=/mnt/sdi
> > >
> > >     mkfs.btrfs -f -O no-holes $DEV
> > >     mount $DEV $MNT
> > >
> > >     echo
> > >     echo "*********** 256M ***********"
> > >     echo
> > >
> > >     ./pavels-test $MNT/testfile $((1 << 28))
> > >     echo
> > >     ./pavels-test $MNT/testfile $((1 << 28))
> > >
> > >     echo
> > >     echo "*********** 512M ***********"
> > >     echo
> > >
> > >     ./pavels-test $MNT/testfile $((1 << 29))
> > >     echo
> > >     ./pavels-test $MNT/testfile $((1 << 29))
> > >
> > >     echo
> > >     echo "*********** 1G ***********"
> > >     echo
> > >
> > >     ./pavels-test $MNT/testfile $((1 << 30))
> > >     echo
> > >     ./pavels-test $MNT/testfile $((1 << 30))
> > >
> > >     umount $MNT
> > >
> > > Running his reproducer before applying the patchset:
> > >
> > >     *********** 256M ***********
> > >
> > >     size: 268435456
> > >     actual size: 134217728
> > >     fiemap: fm_mapped_extents = 32768
> > >     time = 4003133 us
> > >
> > >     size: 268435456
> > >     actual size: 134217728
> > >     fiemap: fm_mapped_extents = 32768
> > >     time = 4895330 us
> > >
> > >     *********** 512M ***********
> > >
> > >     size: 536870912
> > >     actual size: 268435456
> > >     fiemap: fm_mapped_extents = 65536
> > >     time = 30123675 us
> > >
> > >     size: 536870912
> > >     actual size: 268435456
> > >     fiemap: fm_mapped_extents = 65536
> > >     time = 33450934 us
> > >
> > >     *********** 1G ***********
> > >
> > >     size: 1073741824
> > >     actual size: 536870912
> > >     fiemap: fm_mapped_extents = 131072
> > >     time = 224924074 us
> > >
> > >     size: 1073741824
> > >     actual size: 536870912
> > >     fiemap: fm_mapped_extents = 131072
> > >     time = 217239242 us
> > >
> > > Running it after applying the patchset:
> > >
> > >     *********** 256M ***********
> > >
> > >     size: 268435456
> > >     actual size: 134217728
> > >     fiemap: fm_mapped_extents = 32768
> > >     time = 29475 us
> > >
> > >     size: 268435456
> > >     actual size: 134217728
> > >     fiemap: fm_mapped_extents = 32768
> > >     time = 29307 us
> > >
> > >     *********** 512M ***********
> > >
> > >     size: 536870912
> > >     actual size: 268435456
> > >     fiemap: fm_mapped_extents = 65536
> > >     time = 58996 us
> > >
> > >     size: 536870912
> > >     actual size: 268435456
> > >     fiemap: fm_mapped_extents = 65536
> > >     time = 59115 us
> > >
> > >     *********** 1G ***********
> > >
> > >     size: 1073741824
> > >     actual size: 536870912
> > >     fiemap: fm_mapped_extents = 116251
> > >     time = 124141 us
> > >
> > >     size: 1073741824
> > >     actual size: 536870912
> > >     fiemap: fm_mapped_extents = 131072
> > >     time = 119387 us
> > >
> > > The speedup is massive, both on the first fiemap call and on the second
> > > one as well, as his test creates files with many holes and small extents
> > > (every extent follows a hole and precedes another hole).
> > >
> > > For the 256M file we go from 4 seconds down to 29 milliseconds in the
> > > first run, and then from 4.9 seconds down to 29 milliseconds again in the
> > > second run, a speedup of 138x and 169x, respectively.
> > >
> > > For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
> > > first run, and then from 33.5 seconds down to 59 milliseconds again in the
> > > second run, a speedup of 510x and 568x, respectively.
> > >
> > > For the 1G file, we go from 225 seconds down to 124 milliseconds in the
> > > first run, and then from 217 seconds down to 119 milliseconds in the
> > > second run, a speedup of 1815x and 1824x, respectively.
> > >
> > > Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> > > Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> > > Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
> > > Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > ---
> > >  fs/btrfs/ctree.h     |   4 +-
> > >  fs/btrfs/extent_io.c | 714 +++++++++++++++++++++++++++++--------------
> > >  fs/btrfs/file.c      |  16 +-
> > >  fs/btrfs/inode.c     | 140 +--------
> > >  4 files changed, 506 insertions(+), 368 deletions(-)
> > >
> > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > > index f7fe7f633eb5..7b266f9dc8b4 100644
> > > --- a/fs/btrfs/ctree.h
> > > +++ b/fs/btrfs/ctree.h
> > > @@ -3402,8 +3402,6 @@ unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
> > >                                   u64 start, u64 end);
> > >  int btrfs_check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
> > >                         u32 bio_offset, struct page *page, u32 pgoff);
> > > -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
> > > -                                        u64 start, u64 len);
> > >  noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
> > >                             u64 *orig_start, u64 *orig_block_len,
> > >                             u64 *ram_bytes, bool strict);
> > > @@ -3583,6 +3581,8 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
> > >  int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
> > >                          size_t *write_bytes);
> > >  void btrfs_check_nocow_unlock(struct btrfs_inode *inode);
> > > +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> > > +                               u64 *delalloc_start_ret, u64 *delalloc_end_ret);
> > >
> > >  /* tree-defrag.c */
> > >  int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
> > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > index 0e3fa9b08aaf..50bb2182e795 100644
> > > --- a/fs/btrfs/extent_io.c
> > > +++ b/fs/btrfs/extent_io.c
> > > @@ -5353,42 +5353,6 @@ int try_release_extent_mapping(struct page *page, gfp_t mask)
> > >       return try_release_extent_state(tree, page, mask);
> > >  }
> > >
> > > -/*
> > > - * helper function for fiemap, which doesn't want to see any holes.
> > > - * This maps until we find something past 'last'
> > > - */
> > > -static struct extent_map *get_extent_skip_holes(struct btrfs_inode *inode,
> > > -                                             u64 offset, u64 last)
> > > -{
> > > -     u64 sectorsize = btrfs_inode_sectorsize(inode);
> > > -     struct extent_map *em;
> > > -     u64 len;
> > > -
> > > -     if (offset >= last)
> > > -             return NULL;
> > > -
> > > -     while (1) {
> > > -             len = last - offset;
> > > -             if (len == 0)
> > > -                     break;
> > > -             len = ALIGN(len, sectorsize);
> > > -             em = btrfs_get_extent_fiemap(inode, offset, len);
> > > -             if (IS_ERR(em))
> > > -                     return em;
> > > -
> > > -             /* if this isn't a hole return it */
> > > -             if (em->block_start != EXTENT_MAP_HOLE)
> > > -                     return em;
> > > -
> > > -             /* this is a hole, advance to the next extent */
> > > -             offset = extent_map_end(em);
> > > -             free_extent_map(em);
> > > -             if (offset >= last)
> > > -                     break;
> > > -     }
> > > -     return NULL;
> > > -}
> > > -
> > >  /*
> > >   * To cache previous fiemap extent
> > >   *
> > > @@ -5418,6 +5382,9 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
> > >  {
> > >       int ret = 0;
> > >
> > > +     /* Set at the end of extent_fiemap(). */
> > > +     ASSERT((flags & FIEMAP_EXTENT_LAST) == 0);
> > > +
> > >       if (!cache->cached)
> > >               goto assign;
> > >
> > > @@ -5446,11 +5413,10 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
> > >        */
> > >       if (cache->offset + cache->len  == offset &&
> > >           cache->phys + cache->len == phys  &&
> > > -         (cache->flags & ~FIEMAP_EXTENT_LAST) ==
> > > -                     (flags & ~FIEMAP_EXTENT_LAST)) {
> > > +         cache->flags == flags) {
> > >               cache->len += len;
> > >               cache->flags |= flags;
> > > -             goto try_submit_last;
> > > +             return 0;
> > >       }
> > >
> > >       /* Not mergeable, need to submit cached one */
> > > @@ -5465,13 +5431,8 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
> > >       cache->phys = phys;
> > >       cache->len = len;
> > >       cache->flags = flags;
> > > -try_submit_last:
> > > -     if (cache->flags & FIEMAP_EXTENT_LAST) {
> > > -             ret = fiemap_fill_next_extent(fieinfo, cache->offset,
> > > -                             cache->phys, cache->len, cache->flags);
> > > -             cache->cached = false;
> > > -     }
> > > -     return ret;
> > > +
> > > +     return 0;
> > >  }
> > >
> > >  /*
> > > @@ -5501,229 +5462,534 @@ static int emit_last_fiemap_cache(struct fiemap_extent_info *fieinfo,
> > >       return ret;
> > >  }
> > >
> > > -int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> > > -               u64 start, u64 len)
> > > +static int fiemap_next_leaf_item(struct btrfs_inode *inode,
> > > +                              struct btrfs_path *path)
> > >  {
> > > -     int ret = 0;
> > > -     u64 off;
> > > -     u64 max = start + len;
> > > -     u32 flags = 0;
> > > -     u32 found_type;
> > > -     u64 last;
> > > -     u64 last_for_get_extent = 0;
> > > -     u64 disko = 0;
> > > -     u64 isize = i_size_read(&inode->vfs_inode);
> > > -     struct btrfs_key found_key;
> > > -     struct extent_map *em = NULL;
> > > -     struct extent_state *cached_state = NULL;
> > > -     struct btrfs_path *path;
> > > +     struct extent_buffer *clone;
> > > +     struct btrfs_key key;
> > > +     int slot;
> > > +     int ret;
> > > +
> > > +     path->slots[0]++;
> > > +     if (path->slots[0] < btrfs_header_nritems(path->nodes[0]))
> > > +             return 0;
> > > +
> > > +     ret = btrfs_next_leaf(inode->root, path);
> > > +     if (ret != 0)
> > > +             return ret;
> > > +
> > > +     /*
> > > +      * Don't bother with cloning if there are no more file extent items for
> > > +      * our inode.
> > > +      */
> > > +     btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> > > +     if (key.objectid != btrfs_ino(inode) || key.type != BTRFS_EXTENT_DATA_KEY)
> > > +             return 1;
> > > +
> > > +     /* See the comment at fiemap_search_slot() about why we clone. */
> > > +     clone = btrfs_clone_extent_buffer(path->nodes[0]);
> > > +     if (!clone)
> > > +             return -ENOMEM;
> > > +
> > > +     slot = path->slots[0];
> > > +     btrfs_release_path(path);
> > > +     path->nodes[0] = clone;
> > > +     path->slots[0] = slot;
> > > +
> > > +     return 0;
> > > +}
> > > +
> > > +/*
> > > + * Search for the first file extent item that starts at a given file offset or
> > > + * the one that starts immediately before that offset.
> > > + * Returns: 0 on success, < 0 on error, 1 if not found.
> > > + */
> > > +static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path,
> > > +                           u64 file_offset)
> > > +{
> > > +     const u64 ino = btrfs_ino(inode);
> > >       struct btrfs_root *root = inode->root;
> > > -     struct fiemap_cache cache = { 0 };
> > > -     struct btrfs_backref_shared_cache *backref_cache;
> > > -     struct ulist *roots;
> > > -     struct ulist *tmp_ulist;
> > > -     int end = 0;
> > > -     u64 em_start = 0;
> > > -     u64 em_len = 0;
> > > -     u64 em_end = 0;
> > > +     struct extent_buffer *clone;
> > > +     struct btrfs_key key;
> > > +     int slot;
> > > +     int ret;
> > >
> > > -     backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
> > > -     path = btrfs_alloc_path();
> > > -     roots = ulist_alloc(GFP_KERNEL);
> > > -     tmp_ulist = ulist_alloc(GFP_KERNEL);
> > > -     if (!backref_cache || !path || !roots || !tmp_ulist) {
> > > -             ret = -ENOMEM;
> > > -             goto out_free_ulist;
> > > +     key.objectid = ino;
> > > +     key.type = BTRFS_EXTENT_DATA_KEY;
> > > +     key.offset = file_offset;
> > > +
> > > +     ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> > > +     if (ret < 0)
> > > +             return ret;
> > > +
> > > +     if (ret > 0 && path->slots[0] > 0) {
> > > +             btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
> > > +             if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
> > > +                     path->slots[0]--;
> > > +     }
> > > +
> > > +     if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
> > > +             ret = btrfs_next_leaf(root, path);
> > > +             if (ret != 0)
> > > +                     return ret;
> > > +
> > > +             btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> > > +             if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> > > +                     return 1;
> > >       }
> > >
> > >       /*
> > > -      * We can't initialize that to 'start' as this could miss extents due
> > > -      * to extent item merging
> > > +      * We clone the leaf and use it during fiemap. This is because while
> > > +      * using the leaf we do expensive things like checking if an extent is
> > > +      * shared, which can take a long time. In order to prevent blocking
> > > +      * other tasks for too long, we use a clone of the leaf. We have locked
> > > +      * the file range in the inode's io tree, so we know none of our file
> > > +      * extent items can change. This way we avoid blocking other tasks that
> > > +      * want to insert items for other inodes in the same leaf or b+tree
> > > +      * rebalance operations (triggered for example when someone is trying
> > > +      * to push items into this leaf when trying to insert an item in a
> > > +      * neighbour leaf).
> > > +      * We also need the private clone because holding a read lock on an
> > > +      * extent buffer of the subvolume's b+tree will make lockdep unhappy
> > > +      * when we call fiemap_fill_next_extent(), because that may cause a page
> > > +      * fault when filling the user space buffer with fiemap data.
> > >        */
> > > -     off = 0;
> > > -     start = round_down(start, btrfs_inode_sectorsize(inode));
> > > -     len = round_up(max, btrfs_inode_sectorsize(inode)) - start;
> > > +     clone = btrfs_clone_extent_buffer(path->nodes[0]);
> > > +     if (!clone)
> > > +             return -ENOMEM;
> > > +
> > > +     slot = path->slots[0];
> > > +     btrfs_release_path(path);
> > > +     path->nodes[0] = clone;
> > > +     path->slots[0] = slot;
> > > +
> > > +     return 0;
> > > +}
> > > +
> > > +/*
> > > + * Process a range which is a hole or a prealloc extent in the inode's subvolume
> > > + * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc
> > > + * extent. The end offset (@end) is inclusive.
> > > + */
> > > +static int fiemap_process_hole(struct btrfs_inode *inode,
> > > +                            struct fiemap_extent_info *fieinfo,
> > > +                            struct fiemap_cache *cache,
> > > +                            struct btrfs_backref_shared_cache *backref_cache,
> > > +                            u64 disk_bytenr, u64 extent_offset,
> > > +                            u64 extent_gen,
> > > +                            struct ulist *roots, struct ulist *tmp_ulist,
> > > +                            u64 start, u64 end)
> > > +{
> > > +     const u64 i_size = i_size_read(&inode->vfs_inode);
> > > +     const u64 ino = btrfs_ino(inode);
> > > +     u64 cur_offset = start;
> > > +     u64 last_delalloc_end = 0;
> > > +     u32 prealloc_flags = FIEMAP_EXTENT_UNWRITTEN;
> > > +     bool checked_extent_shared = false;
> > > +     int ret;
> > >
> > >       /*
> > > -      * lookup the last file extent.  We're not using i_size here
> > > -      * because there might be preallocation past i_size
> > > +      * There can be no delalloc past i_size, so don't waste time looking for
> > > +      * it beyond i_size.
> > >        */
> > > -     ret = btrfs_lookup_file_extent(NULL, root, path, btrfs_ino(inode), -1,
> > > -                                    0);
> > > -     if (ret < 0) {
> > > -             goto out_free_ulist;
> > > -     } else {
> > > -             WARN_ON(!ret);
> > > -             if (ret == 1)
> > > -                     ret = 0;
> > > -     }
> > > +     while (cur_offset < end && cur_offset < i_size) {
> > > +             u64 delalloc_start;
> > > +             u64 delalloc_end;
> > > +             u64 prealloc_start;
> > > +             u64 prealloc_len = 0;
> > > +             bool delalloc;
> > > +
> > > +             delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
> > > +                                                     &delalloc_start,
> > > +                                                     &delalloc_end);
> > > +             if (!delalloc)
> > > +                     break;
> > >
> > > -     path->slots[0]--;
> > > -     btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]);
> > > -     found_type = found_key.type;
> > > -
> > > -     /* No extents, but there might be delalloc bits */
> > > -     if (found_key.objectid != btrfs_ino(inode) ||
> > > -         found_type != BTRFS_EXTENT_DATA_KEY) {
> > > -             /* have to trust i_size as the end */
> > > -             last = (u64)-1;
> > > -             last_for_get_extent = isize;
> > > -     } else {
> > >               /*
> > > -              * remember the start of the last extent.  There are a
> > > -              * bunch of different factors that go into the length of the
> > > -              * extent, so its much less complex to remember where it started
> > > +              * If this is a prealloc extent we have to report every section
> > > +              * of it that has no delalloc.
> > >                */
> > > -             last = found_key.offset;
> > > -             last_for_get_extent = last + 1;
> > > +             if (disk_bytenr != 0) {
> > > +                     if (last_delalloc_end == 0) {
> > > +                             prealloc_start = start;
> > > +                             prealloc_len = delalloc_start - start;
> > > +                     } else {
> > > +                             prealloc_start = last_delalloc_end + 1;
> > > +                             prealloc_len = delalloc_start - prealloc_start;
> > > +                     }
> > > +             }
> > > +
> > > +             if (prealloc_len > 0) {
> > > +                     if (!checked_extent_shared && fieinfo->fi_extents_max) {
> > > +                             ret = btrfs_is_data_extent_shared(inode->root,
> > > +                                                       ino, disk_bytenr,
> > > +                                                       extent_gen, roots,
> > > +                                                       tmp_ulist,
> > > +                                                       backref_cache);
> > > +                             if (ret < 0)
> > > +                                     return ret;
> > > +                             else if (ret > 0)
> > > +                                     prealloc_flags |= FIEMAP_EXTENT_SHARED;
> > > +
> > > +                             checked_extent_shared = true;
> > > +                     }
> > > +                     ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
> > > +                                              disk_bytenr + extent_offset,
> > > +                                              prealloc_len, prealloc_flags);
> > > +                     if (ret)
> > > +                             return ret;
> > > +                     extent_offset += prealloc_len;
> > > +             }
> > > +
> > > +             ret = emit_fiemap_extent(fieinfo, cache, delalloc_start, 0,
> > > +                                      delalloc_end + 1 - delalloc_start,
> > > +                                      FIEMAP_EXTENT_DELALLOC |
> > > +                                      FIEMAP_EXTENT_UNKNOWN);
> > > +             if (ret)
> > > +                     return ret;
> > > +
> > > +             last_delalloc_end = delalloc_end;
> > > +             cur_offset = delalloc_end + 1;
> > > +             extent_offset += cur_offset - delalloc_start;
> > > +             cond_resched();
> > > +     }
> > > +
> > > +     /*
> > > +      * Either we found no delalloc for the whole prealloc extent or we have
> > > +      * a prealloc extent that spans i_size or starts at or after i_size.
> > > +      */
> > > +     if (disk_bytenr != 0 && last_delalloc_end < end) {
> > > +             u64 prealloc_start;
> > > +             u64 prealloc_len;
> > > +
> > > +             if (last_delalloc_end == 0) {
> > > +                     prealloc_start = start;
> > > +                     prealloc_len = end + 1 - start;
> > > +             } else {
> > > +                     prealloc_start = last_delalloc_end + 1;
> > > +                     prealloc_len = end + 1 - prealloc_start;
> > > +             }
> > > +
> > > +             if (!checked_extent_shared && fieinfo->fi_extents_max) {
> > > +                     ret = btrfs_is_data_extent_shared(inode->root,
> > > +                                                       ino, disk_bytenr,
> > > +                                                       extent_gen, roots,
> > > +                                                       tmp_ulist,
> > > +                                                       backref_cache);
> > > +                     if (ret < 0)
> > > +                             return ret;
> > > +                     else if (ret > 0)
> > > +                             prealloc_flags |= FIEMAP_EXTENT_SHARED;
> > > +             }
> > > +             ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
> > > +                                      disk_bytenr + extent_offset,
> > > +                                      prealloc_len, prealloc_flags);
> > > +             if (ret)
> > > +                     return ret;
> > > +     }
> > > +
> > > +     return 0;
> > > +}
> > > +
> > > +static int fiemap_find_last_extent_offset(struct btrfs_inode *inode,
> > > +                                       struct btrfs_path *path,
> > > +                                       u64 *last_extent_end_ret)
> > > +{
> > > +     const u64 ino = btrfs_ino(inode);
> > > +     struct btrfs_root *root = inode->root;
> > > +     struct extent_buffer *leaf;
> > > +     struct btrfs_file_extent_item *ei;
> > > +     struct btrfs_key key;
> > > +     u64 disk_bytenr;
> > > +     int ret;
> > > +
> > > +     /*
> > > +      * Lookup the last file extent. We're not using i_size here because
> > > +      * there might be preallocation past i_size.
> > > +      */
> > > +     ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0);
> > > +     /* There can't be a file extent item at offset (u64)-1 */
> > > +     ASSERT(ret != 0);
> > > +     if (ret < 0)
> > > +             return ret;
> > > +
> > > +     /*
> > > +      * For a non-existing key, btrfs_search_slot() always leaves us at a
> > > +      * slot > 0, except if the btree is empty, which is impossible because
> > > +      * at least it has the inode item for this inode and all the items for
> > > +      * the root inode 256.
> > > +      */
> > > +     ASSERT(path->slots[0] > 0);
> > > +     path->slots[0]--;
> > > +     leaf = path->nodes[0];
> > > +     btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> > > +     if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> > > +             /* No file extent items in the subvolume tree. */
> > > +             *last_extent_end_ret = 0;
> > > +             return 0;
> > >       }
> > > -     btrfs_release_path(path);
> > >
> > >       /*
> > > -      * we might have some extents allocated but more delalloc past those
> > > -      * extents.  so, we trust isize unless the start of the last extent is
> > > -      * beyond isize
> > > +      * For an inline extent, the disk_bytenr is where inline data starts at,
> > > +      * so first check if we have an inline extent item before checking if we
> > > +      * have an implicit hole (disk_bytenr == 0).
> > >        */
> > > -     if (last < isize) {
> > > -             last = (u64)-1;
> > > -             last_for_get_extent = isize;
> > > +     ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
> > > +     if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
> > > +             *last_extent_end_ret = btrfs_file_extent_end(path);
> > > +             return 0;
> > >       }
> > >
> > > -     lock_extent_bits(&inode->io_tree, start, start + len - 1,
> > > -                      &cached_state);
> > > +     /*
> > > +      * Find the last file extent item that is not a hole (when NO_HOLES is
> > > +      * not enabled). This should take at most 2 iterations in the worst
> > > +      * case: we have one hole file extent item at slot 0 of a leaf and
> > > +      * another hole file extent item as the last item in the previous leaf.
> > > +      * This is because we merge file extent items that represent holes.
> > > +      */
> > > +     disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> > > +     while (disk_bytenr == 0) {
> > > +             ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
> > > +             if (ret < 0) {
> > > +                     return ret;
> > > +             } else if (ret > 0) {
> > > +                     /* No file extent items that are not holes. */
> > > +                     *last_extent_end_ret = 0;
> > > +                     return 0;
> > > +             }
> > > +             leaf = path->nodes[0];
> > > +             ei = btrfs_item_ptr(leaf, path->slots[0],
> > > +                                 struct btrfs_file_extent_item);
> > > +             disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> > > +     }
> > >
> > > -     em = get_extent_skip_holes(inode, start, last_for_get_extent);
> > > -     if (!em)
> > > -             goto out;
> > > -     if (IS_ERR(em)) {
> > > -             ret = PTR_ERR(em);
> > > +     *last_extent_end_ret = btrfs_file_extent_end(path);
> > > +     return 0;
> > > +}
> > > +
> > > +int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> > > +               u64 start, u64 len)
> > > +{
> > > +     const u64 ino = btrfs_ino(inode);
> > > +     struct extent_state *cached_state = NULL;
> > > +     struct btrfs_path *path;
> > > +     struct btrfs_root *root = inode->root;
> > > +     struct fiemap_cache cache = { 0 };
> > > +     struct btrfs_backref_shared_cache *backref_cache;
> > > +     struct ulist *roots;
> > > +     struct ulist *tmp_ulist;
> > > +     u64 last_extent_end;
> > > +     u64 prev_extent_end;
> > > +     u64 lockstart;
> > > +     u64 lockend;
> > > +     bool stopped = false;
> > > +     int ret;
> > > +
> > > +     backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
> > > +     path = btrfs_alloc_path();
> > > +     roots = ulist_alloc(GFP_KERNEL);
> > > +     tmp_ulist = ulist_alloc(GFP_KERNEL);
> > > +     if (!backref_cache || !path || !roots || !tmp_ulist) {
> > > +             ret = -ENOMEM;
> > >               goto out;
> > >       }
> > >
> > > -     while (!end) {
> > > -             u64 offset_in_extent = 0;
> > > +     lockstart = round_down(start, btrfs_inode_sectorsize(inode));
> > > +     lockend = round_up(start + len, btrfs_inode_sectorsize(inode));
> > > +     prev_extent_end = lockstart;
> > >
> > > -             /* break if the extent we found is outside the range */
> > > -             if (em->start >= max || extent_map_end(em) < off)
> > > -                     break;
> > > +     lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
> > >
> > > -             /*
> > > -              * get_extent may return an extent that starts before our
> > > -              * requested range.  We have to make sure the ranges
> > > -              * we return to fiemap always move forward and don't
> > > -              * overlap, so adjust the offsets here
> > > -              */
> > > -             em_start = max(em->start, off);
> > > +     ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
> > > +     if (ret < 0)
> > > +             goto out_unlock;
> > > +     btrfs_release_path(path);
> > >
> > > +     path->reada = READA_FORWARD;
> > > +     ret = fiemap_search_slot(inode, path, lockstart);
> > > +     if (ret < 0) {
> > > +             goto out_unlock;
> > > +     } else if (ret > 0) {
> > >               /*
> > > -              * record the offset from the start of the extent
> > > -              * for adjusting the disk offset below.  Only do this if the
> > > -              * extent isn't compressed since our in ram offset may be past
> > > -              * what we have actually allocated on disk.
> > > +              * No file extent item found, but we may have delalloc between
> > > +              * the current offset and i_size. So check for that.
> > >                */
> > > -             if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
> > > -                     offset_in_extent = em_start - em->start;
> > > -             em_end = extent_map_end(em);
> > > -             em_len = em_end - em_start;
> > > -             flags = 0;
> > > -             if (em->block_start < EXTENT_MAP_LAST_BYTE)
> > > -                     disko = em->block_start + offset_in_extent;
> > > -             else
> > > -                     disko = 0;
> > > +             ret = 0;
> > > +             goto check_eof_delalloc;
> > > +     }
> > > +
> > > +     while (prev_extent_end < lockend) {
> > > +             struct extent_buffer *leaf = path->nodes[0];
> > > +             struct btrfs_file_extent_item *ei;
> > > +             struct btrfs_key key;
> > > +             u64 extent_end;
> > > +             u64 extent_len;
> > > +             u64 extent_offset = 0;
> > > +             u64 extent_gen;
> > > +             u64 disk_bytenr = 0;
> > > +             u64 flags = 0;
> > > +             int extent_type;
> > > +             u8 compression;
> > > +
> > > +             btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> > > +             if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> > > +                     break;
> > > +
> > > +             extent_end = btrfs_file_extent_end(path);
> > >
> > >               /*
> > > -              * bump off for our next call to get_extent
> > > +              * The first iteration can leave us at an extent item that ends
> > > +              * before our range's start. Move to the next item.
> > >                */
> > > -             off = extent_map_end(em);
> > > -             if (off >= max)
> > > -                     end = 1;
> > > -
> > > -             if (em->block_start == EXTENT_MAP_INLINE) {
> > > -                     flags |= (FIEMAP_EXTENT_DATA_INLINE |
> > > -                               FIEMAP_EXTENT_NOT_ALIGNED);
> > > -             } else if (em->block_start == EXTENT_MAP_DELALLOC) {
> > > -                     flags |= (FIEMAP_EXTENT_DELALLOC |
> > > -                               FIEMAP_EXTENT_UNKNOWN);
> > > -             } else if (fieinfo->fi_extents_max) {
> > > -                     u64 extent_gen;
> > > -                     u64 bytenr = em->block_start -
> > > -                             (em->start - em->orig_start);
> > > +             if (extent_end <= lockstart)
> > > +                     goto next_item;
> > >
> > > -                     /*
> > > -                      * If two extent maps are merged, then their generation
> > > -                      * is set to the maximum between their generations.
> > > -                      * Otherwise its generation matches the one we have in
> > > -                      * corresponding file extent item. If we have a merged
> > > -                      * extent map, don't use its generation to speedup the
> > > -                      * sharedness check below.
> > > -                      */
> > > -                     if (test_bit(EXTENT_FLAG_MERGED, &em->flags))
> > > -                             extent_gen = 0;
> > > -                     else
> > > -                             extent_gen = em->generation;
> > > +             /* We have in implicit hole (NO_HOLES feature enabled). */
> > > +             if (prev_extent_end < key.offset) {
> > > +                     const u64 range_end = min(key.offset, lockend) - 1;
> > >
> > > -                     /*
> > > -                      * As btrfs supports shared space, this information
> > > -                      * can be exported to userspace tools via
> > > -                      * flag FIEMAP_EXTENT_SHARED.  If fi_extents_max == 0
> > > -                      * then we're just getting a count and we can skip the
> > > -                      * lookup stuff.
> > > -                      */
> > > -                     ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
> > > -                                                       bytenr, extent_gen,
> > > -                                                       roots, tmp_ulist,
> > > -                                                       backref_cache);
> > > -                     if (ret < 0)
> > > -                             goto out_free;
> > > -                     if (ret)
> > > -                             flags |= FIEMAP_EXTENT_SHARED;
> > > -                     ret = 0;
> > > -             }
> > > -             if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
> > > -                     flags |= FIEMAP_EXTENT_ENCODED;
> > > -             if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> > > -                     flags |= FIEMAP_EXTENT_UNWRITTEN;
> > > +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
> > > +                                               backref_cache, 0, 0, 0,
> > > +                                               roots, tmp_ulist,
> > > +                                               prev_extent_end, range_end);
> > > +                     if (ret < 0) {
> > > +                             goto out_unlock;
> > > +                     } else if (ret > 0) {
> > > +                             /* fiemap_fill_next_extent() told us to stop. */
> > > +                             stopped = true;
> > > +                             break;
> > > +                     }
> > >
> > > -             free_extent_map(em);
> > > -             em = NULL;
> > > -             if ((em_start >= last) || em_len == (u64)-1 ||
> > > -                (last == (u64)-1 && isize <= em_end)) {
> > > -                     flags |= FIEMAP_EXTENT_LAST;
> > > -                     end = 1;
> > > +                     /* We've reached the end of the fiemap range, stop. */
> > > +                     if (key.offset >= lockend) {
> > > +                             stopped = true;
> > > +                             break;
> > > +                     }
> > >               }
> > >
> > > -             /* now scan forward to see if this is really the last extent. */
> > > -             em = get_extent_skip_holes(inode, off, last_for_get_extent);
> > > -             if (IS_ERR(em)) {
> > > -                     ret = PTR_ERR(em);
> > > -                     goto out;
> > > +             extent_len = extent_end - key.offset;
> > > +             ei = btrfs_item_ptr(leaf, path->slots[0],
> > > +                                 struct btrfs_file_extent_item);
> > > +             compression = btrfs_file_extent_compression(leaf, ei);
> > > +             extent_type = btrfs_file_extent_type(leaf, ei);
> > > +             extent_gen = btrfs_file_extent_generation(leaf, ei);
> > > +
> > > +             if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
> > > +                     disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> > > +                     if (compression == BTRFS_COMPRESS_NONE)
> > > +                             extent_offset = btrfs_file_extent_offset(leaf, ei);
> > >               }
> > > -             if (!em) {
> > > -                     flags |= FIEMAP_EXTENT_LAST;
> > > -                     end = 1;
> > > +
> > > +             if (compression != BTRFS_COMPRESS_NONE)
> > > +                     flags |= FIEMAP_EXTENT_ENCODED;
> > > +
> > > +             if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
> > > +                     flags |= FIEMAP_EXTENT_DATA_INLINE;
> > > +                     flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > +                     ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0,
> > > +                                              extent_len, flags);
> > > +             } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
> > > +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
> > > +                                               backref_cache,
> > > +                                               disk_bytenr, extent_offset,
> > > +                                               extent_gen, roots, tmp_ulist,
> > > +                                               key.offset, extent_end - 1);
> > > +             } else if (disk_bytenr == 0) {
> > > +                     /* We have an explicit hole. */
> > > +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
> > > +                                               backref_cache, 0, 0, 0,
> > > +                                               roots, tmp_ulist,
> > > +                                               key.offset, extent_end - 1);
> > > +             } else {
> > > +                     /* We have a regular extent. */
> > > +                     if (fieinfo->fi_extents_max) {
> > > +                             ret = btrfs_is_data_extent_shared(root, ino,
> > > +                                                               disk_bytenr,
> > > +                                                               extent_gen,
> > > +                                                               roots,
> > > +                                                               tmp_ulist,
> > > +                                                               backref_cache);
> > > +                             if (ret < 0)
> > > +                                     goto out_unlock;
> > > +                             else if (ret > 0)
> > > +                                     flags |= FIEMAP_EXTENT_SHARED;
> > > +                     }
> > > +
> > > +                     ret = emit_fiemap_extent(fieinfo, &cache, key.offset,
> > > +                                              disk_bytenr + extent_offset,
> > > +                                              extent_len, flags);
> > >               }
> > > -             ret = emit_fiemap_extent(fieinfo, &cache, em_start, disko,
> > > -                                        em_len, flags);
> > > -             if (ret) {
> > > -                     if (ret == 1)
> > > -                             ret = 0;
> > > -                     goto out_free;
> > > +
> > > +             if (ret < 0) {
> > > +                     goto out_unlock;
> > > +             } else if (ret > 0) {
> > > +                     /* fiemap_fill_next_extent() told us to stop. */
> > > +                     stopped = true;
> > > +                     break;
> > >               }
> > >
> > > +             prev_extent_end = extent_end;
> > > +next_item:
> > >               if (fatal_signal_pending(current)) {
> > >                       ret = -EINTR;
> > > -                     goto out_free;
> > > +                     goto out_unlock;
> > >               }
> > > +
> > > +             ret = fiemap_next_leaf_item(inode, path);
> > > +             if (ret < 0) {
> > > +                     goto out_unlock;
> > > +             } else if (ret > 0) {
> > > +                     /* No more file extent items for this inode. */
> > > +                     break;
> > > +             }
> > > +             cond_resched();
> > >       }
> > > -out_free:
> > > -     if (!ret)
> > > -             ret = emit_last_fiemap_cache(fieinfo, &cache);
> > > -     free_extent_map(em);
> > > -out:
> > > -     unlock_extent_cached(&inode->io_tree, start, start + len - 1,
> > > -                          &cached_state);
> > >
> > > -out_free_ulist:
> > > +check_eof_delalloc:
> > > +     /*
> > > +      * Release (and free) the path before emitting any final entries to
> > > +      * fiemap_fill_next_extent() to keep lockdep happy. This is because
> > > +      * once we find no more file extent items exist, we may have a
> > > +      * non-cloned leaf, and fiemap_fill_next_extent() can trigger page
> > > +      * faults when copying data to the user space buffer.
> > > +      */
> > > +     btrfs_free_path(path);
> > > +     path = NULL;
> > > +
> > > +     if (!stopped && prev_extent_end < lockend) {
> > > +             ret = fiemap_process_hole(inode, fieinfo, &cache, backref_cache,
> > > +                                       0, 0, 0, roots, tmp_ulist,
> > > +                                       prev_extent_end, lockend - 1);
> > > +             if (ret < 0)
> > > +                     goto out_unlock;
> > > +             prev_extent_end = lockend;
> > > +     }
> > > +
> > > +     if (cache.cached && cache.offset + cache.len >= last_extent_end) {
> > > +             const u64 i_size = i_size_read(&inode->vfs_inode);
> > > +
> > > +             if (prev_extent_end < i_size) {
> > > +                     u64 delalloc_start;
> > > +                     u64 delalloc_end;
> > > +                     bool delalloc;
> > > +
> > > +                     delalloc = btrfs_find_delalloc_in_range(inode,
> > > +                                                             prev_extent_end,
> > > +                                                             i_size - 1,
> > > +                                                             &delalloc_start,
> > > +                                                             &delalloc_end);
> > > +                     if (!delalloc)
> > > +                             cache.flags |= FIEMAP_EXTENT_LAST;
> > > +             } else {
> > > +                     cache.flags |= FIEMAP_EXTENT_LAST;
> > > +             }
> > > +     }
> > > +
> > > +     ret = emit_last_fiemap_cache(fieinfo, &cache);
> > > +
> > > +out_unlock:
> > > +     unlock_extent_cached(&inode->io_tree, lockstart, lockend, &cached_state);
> > > +out:
> > >       kfree(backref_cache);
> > >       btrfs_free_path(path);
> > >       ulist_free(roots);
> > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > index b292a8ada3a4..636b3ec46184 100644
> > > --- a/fs/btrfs/file.c
> > > +++ b/fs/btrfs/file.c
> > > @@ -3602,10 +3602,10 @@ static long btrfs_fallocate(struct file *file, int mode,
> > >  }
> > >
> > >  /*
> > > - * Helper for have_delalloc_in_range(). Find a subrange in a given range that
> > > - * has unflushed and/or flushing delalloc. There might be other adjacent
> > > - * subranges after the one it found, so have_delalloc_in_range() keeps looping
> > > - * while it gets adjacent subranges, and merging them together.
> > > + * Helper for btrfs_find_delalloc_in_range(). Find a subrange in a given range
> > > + * that has unflushed and/or flushing delalloc. There might be other adjacent
> > > + * subranges after the one it found, so btrfs_find_delalloc_in_range() keeps
> > > + * looping while it gets adjacent subranges, and merging them together.
> > >   */
> > >  static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
> > >                                  u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> > > @@ -3740,8 +3740,8 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
> > >   * if so it sets @delalloc_start_ret and @delalloc_end_ret with the start and
> > >   * end offsets of the subrange.
> > >   */
> > > -static bool have_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> > > -                                u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> > > +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> > > +                               u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> > >  {
> > >       u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
> > >       u64 prev_delalloc_end = 0;
> > > @@ -3804,8 +3804,8 @@ static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
> > >       u64 delalloc_end;
> > >       bool delalloc;
> > >
> > > -     delalloc = have_delalloc_in_range(inode, start, end, &delalloc_start,
> > > -                                       &delalloc_end);
> > > +     delalloc = btrfs_find_delalloc_in_range(inode, start, end,
> > > +                                             &delalloc_start, &delalloc_end);
> > >       if (delalloc && whence == SEEK_DATA) {
> > >               *start_ret = delalloc_start;
> > >               return true;
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index 2c7d31990777..8be1e021513a 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > @@ -7064,133 +7064,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
> > >       return em;
> > >  }
> > >
> > > -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
> > > -                                        u64 start, u64 len)
> > > -{
> > > -     struct extent_map *em;
> > > -     struct extent_map *hole_em = NULL;
> > > -     u64 delalloc_start = start;
> > > -     u64 end;
> > > -     u64 delalloc_len;
> > > -     u64 delalloc_end;
> > > -     int err = 0;
> > > -
> > > -     em = btrfs_get_extent(inode, NULL, 0, start, len);
> > > -     if (IS_ERR(em))
> > > -             return em;
> > > -     /*
> > > -      * If our em maps to:
> > > -      * - a hole or
> > > -      * - a pre-alloc extent,
> > > -      * there might actually be delalloc bytes behind it.
> > > -      */
> > > -     if (em->block_start != EXTENT_MAP_HOLE &&
> > > -         !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> > > -             return em;
> > > -     else
> > > -             hole_em = em;
> > > -
> > > -     /* check to see if we've wrapped (len == -1 or similar) */
> > > -     end = start + len;
> > > -     if (end < start)
> > > -             end = (u64)-1;
> > > -     else
> > > -             end -= 1;
> > > -
> > > -     em = NULL;
> > > -
> > > -     /* ok, we didn't find anything, lets look for delalloc */
> > > -     delalloc_len = count_range_bits(&inode->io_tree, &delalloc_start,
> > > -                              end, len, EXTENT_DELALLOC, 1);
> > > -     delalloc_end = delalloc_start + delalloc_len;
> > > -     if (delalloc_end < delalloc_start)
> > > -             delalloc_end = (u64)-1;
> > > -
> > > -     /*
> > > -      * We didn't find anything useful, return the original results from
> > > -      * get_extent()
> > > -      */
> > > -     if (delalloc_start > end || delalloc_end <= start) {
> > > -             em = hole_em;
> > > -             hole_em = NULL;
> > > -             goto out;
> > > -     }
> > > -
> > > -     /*
> > > -      * Adjust the delalloc_start to make sure it doesn't go backwards from
> > > -      * the start they passed in
> > > -      */
> > > -     delalloc_start = max(start, delalloc_start);
> > > -     delalloc_len = delalloc_end - delalloc_start;
> > > -
> > > -     if (delalloc_len > 0) {
> > > -             u64 hole_start;
> > > -             u64 hole_len;
> > > -             const u64 hole_end = extent_map_end(hole_em);
> > > -
> > > -             em = alloc_extent_map();
> > > -             if (!em) {
> > > -                     err = -ENOMEM;
> > > -                     goto out;
> > > -             }
> > > -
> > > -             ASSERT(hole_em);
> > > -             /*
> > > -              * When btrfs_get_extent can't find anything it returns one
> > > -              * huge hole
> > > -              *
> > > -              * Make sure what it found really fits our range, and adjust to
> > > -              * make sure it is based on the start from the caller
> > > -              */
> > > -             if (hole_end <= start || hole_em->start > end) {
> > > -                    free_extent_map(hole_em);
> > > -                    hole_em = NULL;
> > > -             } else {
> > > -                    hole_start = max(hole_em->start, start);
> > > -                    hole_len = hole_end - hole_start;
> > > -             }
> > > -
> > > -             if (hole_em && delalloc_start > hole_start) {
> > > -                     /*
> > > -                      * Our hole starts before our delalloc, so we have to
> > > -                      * return just the parts of the hole that go until the
> > > -                      * delalloc starts
> > > -                      */
> > > -                     em->len = min(hole_len, delalloc_start - hole_start);
> > > -                     em->start = hole_start;
> > > -                     em->orig_start = hole_start;
> > > -                     /*
> > > -                      * Don't adjust block start at all, it is fixed at
> > > -                      * EXTENT_MAP_HOLE
> > > -                      */
> > > -                     em->block_start = hole_em->block_start;
> > > -                     em->block_len = hole_len;
> > > -                     if (test_bit(EXTENT_FLAG_PREALLOC, &hole_em->flags))
> > > -                             set_bit(EXTENT_FLAG_PREALLOC, &em->flags);
> > > -             } else {
> > > -                     /*
> > > -                      * Hole is out of passed range or it starts after
> > > -                      * delalloc range
> > > -                      */
> > > -                     em->start = delalloc_start;
> > > -                     em->len = delalloc_len;
> > > -                     em->orig_start = delalloc_start;
> > > -                     em->block_start = EXTENT_MAP_DELALLOC;
> > > -                     em->block_len = delalloc_len;
> > > -             }
> > > -     } else {
> > > -             return hole_em;
> > > -     }
> > > -out:
> > > -
> > > -     free_extent_map(hole_em);
> > > -     if (err) {
> > > -             free_extent_map(em);
> > > -             return ERR_PTR(err);
> > > -     }
> > > -     return em;
> > > -}
> > > -
> > >  static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
> > >                                                 const u64 start,
> > >                                                 const u64 len,
> > > @@ -8265,15 +8138,14 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> > >        * in the compression of data (in an async thread) and will return
> > >        * before the compression is done and writeback is started. A second
> > >        * filemap_fdatawrite_range() is needed to wait for the compression to
> > > -      * complete and writeback to start. Without this, our user is very
> > > -      * likely to get stale results, because the extents and extent maps for
> > > -      * delalloc regions are only allocated when writeback starts.
> > > +      * complete and writeback to start. We also need to wait for ordered
> > > +      * extents to complete, because our fiemap implementation uses mainly
> > > +      * file extent items to list the extents, searching for extent maps
> > > +      * only for file ranges with holes or prealloc extents to figure out
> > > +      * if we have delalloc in those ranges.
> > >        */
> > >       if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
> > > -             ret = btrfs_fdatawrite_range(inode, 0, LLONG_MAX);
> > > -             if (ret)
> > > -                     return ret;
> > > -             ret = filemap_fdatawait_range(inode->i_mapping, 0, LLONG_MAX);
> > > +             ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
> > >               if (ret)
> > >                       return ret;
> > >       }
> >
> > Hmm this bit should be in "btrfs: properly flush delalloc when entering fiemap"
> > instead.  Thanks,
> 
> Nop, the change is done here for a good reason: before this change, we
> only needed
> to wait for writeback to complete (actually just to start and create
> the new extent maps),
> so that's why that other patch only waits for writeback to complete,
> just like the generic code.
> 
> After this change we need to wait for ordered extents to complete,
> since we use the file
> extent items to get extent information for fiemap - that's why that
> change is in this patch.
>

Got it, you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient
  2022-09-01 15:00     ` Filipe Manana
@ 2022-09-02 13:26       ` Josef Bacik
  0 siblings, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2022-09-02 13:26 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Thu, Sep 01, 2022 at 04:00:48PM +0100, Filipe Manana wrote:
> On Thu, Sep 01, 2022 at 10:03:03AM -0400, Josef Bacik wrote:
> > On Thu, Sep 01, 2022 at 02:18:22PM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > > 
> > > The current implementation of hole and data seeking for llseek does not
> > > scale well in regards to the number of extents and the distance between
> > > the start offset and the next hole or extent. This is due to a very high
> > > algorithmic complexity. Often we also get reports of btrfs' hole and data
> > > seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link
> > > tag at the bottom).
> > > 
> > > In order to better understand it, lets consider the case where the start
> > > offset is 0, we are seeking for a hole and the file size is 16G. Between
> > > file offset 0 and the first hole in the file there are 100K extents - this
> > > is common for large files, specially if we have compression enabled, since
> > > the maximum extent size is limited to 128K. The steps take by the main
> > > loop of the current algorithm are the following:
> > > 
> > > 1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which
> > >    calls btrfs_get_extent(). This will first lookup for an extent map in
> > >    the inode's extent map tree (a red black tree). If the extent map is
> > >    not loaded in memory, then it will do a lookup for the corresponding
> > >    file extent item in the subvolume's b+tree, create an extent map based
> > >    on the contents of the file extent item and then add the extent map to
> > >    the extent map tree of the inode;
> > > 
> > > 2) The second iteration calls btrfs_get_extent_fiemap() again, this time
> > >    with a start offset matching the end offset of the previous extent.
> > >    Again, btrfs_get_extent() will first search the extent map tree, and
> > >    if it doesn't find an extent map there, it will again search in the
> > >    b+tree of the subvolume for a matching file extent item, build an
> > >    extent map based on the file extent item, and add the extent map to
> > >    to the extent map tree of the inode;
> > > 
> > > 3) This repeats over and over until we find the first hole (when seeking
> > >    for holes) or until we find the first extent (when seeking for data).
> > > 
> > >    If there no extent maps loaded in memory for each iteration, then on
> > >    each iteration we do 1 extent map tree search, 1 b+tree search, plus
> > >    1 more extent map tree traversal to insert an extent map - plus we
> > >    allocate memory for the extent map.
> > > 
> > >    On each iteration we are growing the size of the extent map tree,
> > >    making each future search slower, and also visiting the same b+tree
> > >    leaves over and over again - taking into account with the default leaf
> > >    size of 16K we can fit more than 200 file extent items in a leaf - so
> > >    we can visit the same b+tree leaf 200+ times, on each visit walking
> > >    down a path from the root to the leaf.
> > > 
> > > So it's easy to see that what we have now doesn't scale well. Also, it
> > > loads an extent map for every file extent item into memory, which is not
> > > efficient - we should add extents maps only when doing IO (writing or
> > > reading file data).
> > > 
> > > This change implements a new algorithm which scales much better, and
> > > works like this:
> > > 
> > > 1) We iterate over the subvolume's b+tree, visiting each leaf that has
> > >    file extent items once and only once;
> > > 
> > > 2) For any file extent items found, that don't represent holes or prealloc
> > >    extents, it will not search the extent map tree - there's no need at
> > >    all for that - an extent map is just an in-memory representation of a
> > >    file extent item;
> > > 
> > > 3) When a hole is found, or a prealloc extent, it will check if there's
> > >    delalloc for its range. For this it will search for EXTENT_DELALLOC
> > >    bits in the inode's io tree and check the extent map tree - this is
> > >    for accounting for unflushed delalloc and for flushed delalloc (the
> > >    period between running delalloc and ordered extent completion),
> > >    respectively. This is similar to what the current implementation does
> > >    when it finds a hole or prealloc extent, but without creating extent
> > >    maps and adding them to the extent map tree in case they are not
> > >    loaded in memory;
> > > 
> > > 4) It never allocates extent maps, or adds extent maps to the inode's
> > >    extent map tree. This not only saves memory and time (from the tree
> > >    insertions and allocations), but also eliminates the possibility of
> > >    -ENOMEM due to allocating too many extent maps.
> > > 
> > > Part of this new code will also be used later for fiemap (which also
> > > suffers similar scalability problems).
> > > 
> > > The following test example can be used to quickly measure the efficiency
> > > before and after this patch:
> > > 
> > >     $ cat test-seek-hole.sh
> > >     #!/bin/bash
> > > 
> > >     DEV=/dev/sdi
> > >     MNT=/mnt/sdi
> > > 
> > >     mkfs.btrfs -f $DEV
> > > 
> > >     mount -o compress=lzo $DEV $MNT
> > > 
> > >     # 16G file -> 131073 compressed extents.
> > >     xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar
> > > 
> > >     # Leave a 1M hole at file offset 15G.
> > >     xfs_io -c "fpunch 15G 1M" $MNT/foobar
> > > 
> > >     # Unmount and mount again, so that we can test when there's no
> > >     # metadata cached in memory.
> > >     umount $MNT
> > >     mount -o compress=lzo $DEV $MNT
> > > 
> > >     # Test seeking for hole from offset 0 (hole is at offset 15G).
> > > 
> > >     start=$(date +%s%N)
> > >     xfs_io -c "seek -h 0" $MNT/foobar
> > >     end=$(date +%s%N)
> > >     dur=$(( (end - start) / 1000000 ))
> > >     echo "Took $dur milliseconds to seek first hole (metadata not cached)"
> > >     echo
> > > 
> > >     start=$(date +%s%N)
> > >     xfs_io -c "seek -h 0" $MNT/foobar
> > >     end=$(date +%s%N)
> > >     dur=$(( (end - start) / 1000000 ))
> > >     echo "Took $dur milliseconds to seek first hole (metadata cached)"
> > >     echo
> > > 
> > >     umount $MNT
> > > 
> > > Before this change:
> > > 
> > >     $ ./test-seek-hole.sh
> > >     (...)
> > >     Whence	Result
> > >     HOLE	16106127360
> > >     Took 176 milliseconds to seek first hole (metadata not cached)
> > > 
> > >     Whence	Result
> > >     HOLE	16106127360
> > >     Took 17 milliseconds to seek first hole (metadata cached)
> > > 
> > > After this change:
> > > 
> > >     $ ./test-seek-hole.sh
> > >     (...)
> > >     Whence	Result
> > >     HOLE	16106127360
> > >     Took 43 milliseconds to seek first hole (metadata not cached)
> > > 
> > >     Whence	Result
> > >     HOLE	16106127360
> > >     Took 13 milliseconds to seek first hole (metadata cached)
> > > 
> > > That's about 4X faster when no metadata is cached and about 30% faster
> > > when all metadata is cached.
> > > 
> > > In practice the differences may often be significantly higher, either due
> > > to a higher number of extents in a file or because the subvolume's b+tree
> > > is much bigger than in this example, where we only have one file.
> > > 
> > > Link: https://lwn.net/Articles/718805/
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > ---
> > >  fs/btrfs/file.c | 437 ++++++++++++++++++++++++++++++++++++++++++++----
> > >  1 file changed, 406 insertions(+), 31 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > index 96f444ad0951..b292a8ada3a4 100644
> > > --- a/fs/btrfs/file.c
> > > +++ b/fs/btrfs/file.c
> > > @@ -3601,22 +3601,281 @@ static long btrfs_fallocate(struct file *file, int mode,
> > >  	return ret;
> > >  }
> > >  
> > > +/*
> > > + * Helper for have_delalloc_in_range(). Find a subrange in a given range that
> > > + * has unflushed and/or flushing delalloc. There might be other adjacent
> > > + * subranges after the one it found, so have_delalloc_in_range() keeps looping
> > > + * while it gets adjacent subranges, and merging them together.
> > > + */
> > > +static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
> > > +				   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> > > +{
> > > +	const u64 len = end + 1 - start;
> > > +	struct extent_map_tree *em_tree = &inode->extent_tree;
> > > +	struct extent_map *em;
> > > +	u64 em_end;
> > > +	u64 delalloc_len;
> > > +
> > > +	/*
> > > +	 * Search the io tree first for EXTENT_DELALLOC. If we find any, it
> > > +	 * means we have delalloc (dirty pages) for which writeback has not
> > > +	 * started yet.
> > > +	 */
> > > +	*delalloc_start_ret = start;
> > > +	delalloc_len = count_range_bits(&inode->io_tree, delalloc_start_ret, end,
> > > +					len, EXTENT_DELALLOC, 1);
> > > +	/*
> > > +	 * If delalloc was found then *delalloc_start_ret has a sector size
> > > +	 * aligned value (rounded down).
> > > +	 */
> > > +	if (delalloc_len > 0)
> > > +		*delalloc_end_ret = *delalloc_start_ret + delalloc_len - 1;
> > > +
> > > +	/*
> > > +	 * Now also check if there's any extent map in the range that does not
> > > +	 * map to a hole or prealloc extent. We do this because:
> > > +	 *
> > > +	 * 1) When delalloc is flushed, the file range is locked, we clear the
> > > +	 *    EXTENT_DELALLOC bit from the io tree and create an extent map for
> > > +	 *    an allocated extent. So we might just have been called after
> > > +	 *    delalloc is flushed and before the ordered extent completes and
> > > +	 *    inserts the new file extent item in the subvolume's btree;
> > > +	 *
> > > +	 * 2) We may have an extent map created by flushing delalloc for a
> > > +	 *    subrange that starts before the subrange we found marked with
> > > +	 *    EXTENT_DELALLOC in the io tree.
> > > +	 */
> > > +	read_lock(&em_tree->lock);
> > > +	em = lookup_extent_mapping(em_tree, start, len);
> > > +	read_unlock(&em_tree->lock);
> > > +
> > > +	/* extent_map_end() returns a non-inclusive end offset. */
> > > +	em_end = em ? extent_map_end(em) : 0;
> > > +
> > > +	/*
> > > +	 * If we have a hole/prealloc extent map, check the next one if this one
> > > +	 * ends before our range's end.
> > > +	 */
> > > +	if (em && (em->block_start == EXTENT_MAP_HOLE ||
> > > +		   test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) && em_end < end) {
> > > +		struct extent_map *next_em;
> > > +
> > > +		read_lock(&em_tree->lock);
> > > +		next_em = lookup_extent_mapping(em_tree, em_end, len - em_end);
> > > +		read_unlock(&em_tree->lock);
> > > +
> > > +		free_extent_map(em);
> > > +		em_end = next_em ? extent_map_end(next_em) : 0;
> > > +		em = next_em;
> > > +	}
> > > +
> > > +	if (em && (em->block_start == EXTENT_MAP_HOLE ||
> > > +		   test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) {
> > > +		free_extent_map(em);
> > > +		em = NULL;
> > > +	}
> > > +
> > > +	/*
> > > +	 * No extent map or one for a hole or prealloc extent. Use the delalloc
> > > +	 * range we found in the io tree if we have one.
> > > +	 */
> > > +	if (!em)
> > > +		return (delalloc_len > 0);
> > > +
> > 
> > You can move this after the lookup, and then remove the if (em && parts above.
> > Then all you need to do is in the second if statement return (delalloc_len > 0);
> 
> Nop, it won't work by doing just that.
> 
> To move that if statement, it would require all the following changes,
> which to me it doesn't seem to provide any benefit, aesthetically or
> otherwise:
> 

Ah yeah I see, in that case you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/10] btrfs: make lseek and fiemap much more efficient
  2022-09-02 11:45     ` Filipe Manana
@ 2022-09-05 14:39       ` Filipe Manana
  0 siblings, 0 replies; 53+ messages in thread
From: Filipe Manana @ 2022-09-05 14:39 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-btrfs

On Fri, Sep 2, 2022 at 12:45 PM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Fri, Sep 2, 2022 at 9:24 AM Filipe Manana <fdmanana@kernel.org> wrote:
> >
> > On Fri, Sep 2, 2022 at 2:09 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
> > >
> > > Hi,
> > >
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > We often get reports of fiemap and hole/data seeking (lseek) being too slow
> > > > on btrfs, or even unusable in some cases due to being extremely slow.
> > > >
> > > > Some recent reports for fiemap:
> > > >
> > > >     https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> > > >     https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> > > >
> > > > For lseek (LSF/MM from 2017):
> > > >
> > > >    https://lwn.net/Articles/718805/
> > > >
> > > > Basically both are slow due to very high algorithmic complexity which
> > > > scales badly with the number of extents in a file and the heigth of
> > > > subvolume and extent b+trees.
> > > >
> > > > Using Pavel's test case (first Link tag for fiemap), which uses files with
> > > > many 4K extents and holes before and after each extent (kind of a worst
> > > > case scenario), the speedup is of several orders of magnitude (for the 1G
> > > > file, from ~225 seconds down to ~0.1 seconds).
> > > >
> > > > Finally the new algorithm for fiemap also ends up solving a bug with the
> > > > current algorithm. This happens because we are currently relying on extent
> > > > maps to report extents, which can be merged, and this may cause us to
> > > > report 2 different extents as a single one that is not shared but one of
> > > > them is shared (or the other way around). More details on this on patches
> > > > 9/10 and 10/10.
> > > >
> > > > Patches 1/10 and 2/10 are for lseek, introducing some code that will later
> > > > be used by fiemap too (patch 10/10). More details in the changelogs.
> > > >
> > > > There are a few more things that can be done to speedup fiemap and lseek,
> > > > but I'll leave those other optimizations I have in mind for some other time.
> > > >
> > > > Filipe Manana (10):
> > > >   btrfs: allow hole and data seeking to be interruptible
> > > >   btrfs: make hole and data seeking a lot more efficient
> > > >   btrfs: remove check for impossible block start for an extent map at fiemap
> > > >   btrfs: remove zero length check when entering fiemap
> > > >   btrfs: properly flush delalloc when entering fiemap
> > > >   btrfs: allow fiemap to be interruptible
> > > >   btrfs: rename btrfs_check_shared() to a more descriptive name
> > > >   btrfs: speedup checking for extent sharedness during fiemap
> > > >   btrfs: skip unnecessary extent buffer sharedness checks during fiemap
> > > >   btrfs: make fiemap more efficient and accurate reporting extent sharedness
> > > >
> > > >  fs/btrfs/backref.c     | 153 ++++++++-
> > > >  fs/btrfs/backref.h     |  20 +-
> > > >  fs/btrfs/ctree.h       |  22 +-
> > > >  fs/btrfs/extent-tree.c |  10 +-
> > > >  fs/btrfs/extent_io.c   | 703 ++++++++++++++++++++++++++++-------------
> > > >  fs/btrfs/file.c        | 439 +++++++++++++++++++++++--
> > > >  fs/btrfs/inode.c       | 146 ++-------
> > > >  7 files changed, 1111 insertions(+), 382 deletions(-)
> > >
> > >
> > > An infinite loop happen when the 10 pathes applied to 6.0-rc3.
> >
> > Nop, it's not an infinite loop, and it happens as well before the patchset.
> > The reason is that the files created by the test are very sparse and
> > with small extents.
> > It's full of 4K extents surrounded by 8K holes.
> >
> > So any one doing hole seeking, advances 8K on every lseek call.
> > If you strace the cp process, with
> >
> > strace -p <cp pid>
> >
> > You'll see something like this filling your terminal:
> >
> > (...)
> > lseek(3, 18808832, SEEK_SET)            = 18808832
> > write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 4096) = 4096
> > read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 4096) = 4096
> > write(4, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 4096) = 4096
> > lseek(3, 18817024, SEEK_SET)            = 18817024
> > write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 4096) = 4096
> > read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 4096) = 4096
> > write(4, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 4096) = 4096
> > lseek(3, 18825216, SEEK_SET)            = 18825216
> > write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 4096) = 4096
> > read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 4096) = 4096
> > write(4, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 4096) = 4096
> > lseek(3, 18833408, SEEK_SET)            = 18833408
> > write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 4096) = 4096
> > read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 4096) = 4096
> > write(4, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 4096) = 4096
> > lseek(3, 18841600, SEEK_SET)            = 18841600
> > write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 4096) = 4096
> > read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 4096) = 4096
> > write(4, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 4096) = 4096
> > lseek(3, 18849792, SEEK_SET)            = 18849792
> > write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 4096) = 4096
> > read(3, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 4096) = 4096
> > write(4, "a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> > 4096) = 4096
> > lseek(3, 18857984, SEEK_SET)            = 18857984
> > (...)
> >
> > It takes a long time, but it finishes. If you notice the difference
> > between each return
> > value is exactly 8K.
> >
> > That happens both before and after the patchset.
>
> Btw, on a release (non-debug) kernel this is what I get before and
> after the patchset.
>
> Before patchset:
>
> root 12:05:51 /home/fdmanana/scripts/other_perf/fiemap > umount
> /dev/sdi ; mkfs.btrfs -f /dev/sdi ; mount /dev/sdi /mnt/sdi
> root 12:06:47 /home/fdmanana/scripts/other_perf/fiemap > ./pavels-test
> /mnt/sdi/foobar $((1 << 30)) && time cp /mnt/sdi/foobar /dev/null
> size: 1073741824
> actual size: 536870912
> fiemap: fm_mapped_extents = 131072
> time = 256243106 us
>
> real 5m50.026s
> user 0m0.232s
> sys 5m48.698s
>
>
> After patchset:
>
> root 12:32:44 /home/fdmanana/scripts/other_perf/fiemap > umount
> /dev/sdi ; mkfs.btrfs -f /dev/sdi ; mount /dev/sdi /mnt/sdi
> root 12:33:01 /home/fdmanana/scripts/other_perf/fiemap > ./pavels-test
> /mnt/sdi/foobar $((1 << 30)) && time cp /mnt/sdi/foobar /dev/null
> size: 1073741824
> actual size: 536870912
> fiemap: fm_mapped_extents = 129941
> time = 134062 us
>
> real 0m57.606s
> user 0m0.185s
> sys 0m57.375s
>
>
> Not as fast as ext4 yet, which takes ~1.5 seconds, but it's getting much better.
> What's causing cp to be slow are the multiple ranged fiemap calls it does.
> cp also does a lot of lseek calls to detect and skip holes, but the
> total time spent on it is almost insignificant when compared to
> fiemap.

For the cp case, part of the slowness could be mitigated by having cp
calling fiemap with a larger buffer,
as its buffer is always for 72 extents only:

(....)
ioctl(3, FS_IOC_FIEMAP, {fm_start=1055780864,
fm_length=18446744072653770751, fm_flags=FIEMAP_FLAG_SYNC,
fm_extent_count=72} => {fm_flags=FIEMAP_FLAG_SYNC,
fm_mapped_extents=72, ...}) = 0 <0.013190>
ioctl(3, FS_IOC_FIEMAP, {fm_start=1056370688,
fm_length=18446744072653180927, fm_flags=FIEMAP_FLAG_SYNC,
fm_extent_count=72} => {fm_flags=FIEMAP_FLAG_SYNC,
fm_mapped_extents=72, ...}) = 0 <0.012889>
ioctl(3, FS_IOC_FIEMAP, {fm_start=1056960512,
fm_length=18446744072652591103, fm_flags=FIEMAP_FLAG_SYNC,
fm_extent_count=72} => {fm_flags=FIEMAP_FLAG_SYNC,
fm_mapped_extents=72, ...}) = 0 <0.012541>
(....)

The filefrag utility uses a buffer for 292 extents for example.

But that doesn't matter much anymore, because last year the cp code
that uses fiemap was removed:

https://github.com/coreutils/coreutils/commit/26eccf6c98696c50f4416ba2967edc8676870716

So now it uses lseek's SEEK_HOLE to detect holes.

The lseek calls from the traces pasted before corresponded to lseek
SEEK_SET, used to set
the current offset on the source file to the offset of each extent
returned by fiemap, therefore skipping holes.

That's a lot more efficient than using fiemap to detect and skip holes.

>
> I'll work on making fiemap more efficient, but that will come in some
> other separate patch or patches that will build upon this patchset.
> I like to make things more incremental and avoid having too many
> changes in a single kernel release for fiemap.
>
> >
> > Thanks.
> >
> >
> > >
> > > a file is created by 'pavels-test.c' of [PATCH 10/10].
> > > and then '/bin/cp /mnt/test/file1 /dev/null' will trigger an infinite
> > > loop.
> > >
> > > 'sysrq -l' output:
> > >
> > > [ 1437.765228] Call Trace:
> > > [ 1437.765228]  <TASK>
> > > [ 1437.765228]  set_extent_bit+0x33d/0x6e0 [btrfs]
> > > [ 1437.765228]  lock_extent_bits+0x64/0xa0 [btrfs]
> > > [ 1437.765228]  btrfs_file_llseek+0x192/0x5b0 [btrfs]
> > > [ 1437.765228]  ksys_lseek+0x64/0xb0
> > > [ 1437.765228]  do_syscall_64+0x58/0x80
> > > [ 1437.765228]  ? syscall_exit_to_user_mode+0x12/0x30
> > > [ 1437.765228]  ? do_syscall_64+0x67/0x80
> > > [ 1437.765228]  ? do_syscall_64+0x67/0x80
> > > [ 1437.765228]  ? exc_page_fault+0x64/0x140
> > > [ 1437.765228]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > [ 1437.765228] RIP: 0033:0x7f5a263441bb
> > >
> > > Best Regards
> > > Wang Yugui (wangyugui@e16-tech.com)
> > > 2022/09/02
> > >
> > >

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/10] btrfs: make lseek and fiemap much more efficient
  2022-09-01 13:18 [PATCH 00/10] btrfs: make lseek and fiemap much more efficient fdmanana
                   ` (10 preceding siblings ...)
  2022-09-02  0:53 ` [PATCH 00/10] btrfs: make lseek and fiemap much more efficient Wang Yugui
@ 2022-09-06 16:20 ` David Sterba
  2022-09-06 17:13   ` Filipe Manana
  2022-09-07  9:12 ` Christoph Hellwig
  12 siblings, 1 reply; 53+ messages in thread
From: David Sterba @ 2022-09-06 16:20 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Thu, Sep 01, 2022 at 02:18:20PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> We often get reports of fiemap and hole/data seeking (lseek) being too slow
> on btrfs, or even unusable in some cases due to being extremely slow.
> 
> Some recent reports for fiemap:
> 
>     https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
>     https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> 
> For lseek (LSF/MM from 2017):
> 
>    https://lwn.net/Articles/718805/
> 
> Basically both are slow due to very high algorithmic complexity which
> scales badly with the number of extents in a file and the heigth of
> subvolume and extent b+trees.
> 
> Using Pavel's test case (first Link tag for fiemap), which uses files with
> many 4K extents and holes before and after each extent (kind of a worst
> case scenario), the speedup is of several orders of magnitude (for the 1G
> file, from ~225 seconds down to ~0.1 seconds).
> 
> Finally the new algorithm for fiemap also ends up solving a bug with the
> current algorithm. This happens because we are currently relying on extent
> maps to report extents, which can be merged, and this may cause us to
> report 2 different extents as a single one that is not shared but one of
> them is shared (or the other way around). More details on this on patches
> 9/10 and 10/10.
> 
> Patches 1/10 and 2/10 are for lseek, introducing some code that will later
> be used by fiemap too (patch 10/10). More details in the changelogs.

The speedup is unbelievable, thank you very much!

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/10] btrfs: make lseek and fiemap much more efficient
  2022-09-06 16:20 ` David Sterba
@ 2022-09-06 17:13   ` Filipe Manana
  0 siblings, 0 replies; 53+ messages in thread
From: Filipe Manana @ 2022-09-06 17:13 UTC (permalink / raw)
  To: David Sterba; +Cc: linux-btrfs

On Tue, Sep 06, 2022 at 06:20:06PM +0200, David Sterba wrote:
> On Thu, Sep 01, 2022 at 02:18:20PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> > 
> > We often get reports of fiemap and hole/data seeking (lseek) being too slow
> > on btrfs, or even unusable in some cases due to being extremely slow.
> > 
> > Some recent reports for fiemap:
> > 
> >     https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> >     https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> > 
> > For lseek (LSF/MM from 2017):
> > 
> >    https://lwn.net/Articles/718805/
> > 
> > Basically both are slow due to very high algorithmic complexity which
> > scales badly with the number of extents in a file and the heigth of
> > subvolume and extent b+trees.
> > 
> > Using Pavel's test case (first Link tag for fiemap), which uses files with
> > many 4K extents and holes before and after each extent (kind of a worst
> > case scenario), the speedup is of several orders of magnitude (for the 1G
> > file, from ~225 seconds down to ~0.1 seconds).
> > 
> > Finally the new algorithm for fiemap also ends up solving a bug with the
> > current algorithm. This happens because we are currently relying on extent
> > maps to report extents, which can be merged, and this may cause us to
> > report 2 different extents as a single one that is not shared but one of
> > them is shared (or the other way around). More details on this on patches
> > 9/10 and 10/10.
> > 
> > Patches 1/10 and 2/10 are for lseek, introducing some code that will later
> > be used by fiemap too (patch 10/10). More details in the changelogs.
> 
> The speedup is unbelievable, thank you very much!

I left something for cleanup in patch 10/10:

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 50bb2182e795..62a643020e10 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5408,14 +5408,12 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
         *    So truly compressed (physical size smaller than logical size)
         *    extents won't get merged with each other
         *
-        * 3) Share same flags except FIEMAP_EXTENT_LAST
-        *    So regular extent won't get merged with prealloc extent
+        * 3) Share same flags
         */
        if (cache->offset + cache->len  == offset &&
            cache->phys + cache->len == phys  &&
            cache->flags == flags) {
                cache->len += len;
-               cache->flags |= flags;
                return 0;
        }
 

Can you fold this up to 10/10, or do you want to me resend or send separately?
Thanks.

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/10] btrfs: make lseek and fiemap much more efficient
  2022-09-01 13:18 [PATCH 00/10] btrfs: make lseek and fiemap much more efficient fdmanana
                   ` (11 preceding siblings ...)
  2022-09-06 16:20 ` David Sterba
@ 2022-09-07  9:12 ` Christoph Hellwig
  2022-09-07  9:47   ` Filipe Manana
  12 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2022-09-07  9:12 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On a related question:  have you looked into using iomap for fiemap
seek in btrfs?  This won't remove the need for fixing some of the
underlying algorithmic complexity, but it should allow to shed some
boilerplate code and reuse bits used by other disk based file systems.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/10] btrfs: make lseek and fiemap much more efficient
  2022-09-07  9:12 ` Christoph Hellwig
@ 2022-09-07  9:47   ` Filipe Manana
  0 siblings, 0 replies; 53+ messages in thread
From: Filipe Manana @ 2022-09-07  9:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-btrfs

On Wed, Sep 7, 2022 at 10:13 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On a related question:  have you looked into using iomap for fiemap
> seek in btrfs?  This won't remove the need for fixing some of the
> underlying algorithmic complexity, but it should allow to shed some
> boilerplate code and reuse bits used by other disk based file systems.

Yes, I took a brief look.
But that would be something to do separately - right now I want to
address user complaints
about too slow fiemap/lseek (and a bug in fiemap with shared flag
missing or incorrect).

So yes, it's left for some time later. Probably not before I work on
the next set of changes
to further improve fiemap's performance.

One thing I noticed is that iomap always sets FIEMAP_EXTENT_LAST on
the last processed
extent. On btrfs we only set the flag if it's really the last extent
in the file, so the results are
different from iomap in case we have a ranged fiemap with an end range
that ends before the
last extent in the file. That seems like a bug in iomap. It seems it
can be worked around by returning
an artificial hole in the last iteration, but that seems odd.

Looking again at iomap/fiemap.c, it doesn't seem like it would help
reduce code or make anything
simpler, it almost doesn't do anything except combining the delalloc flags.

Thanks.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient
  2022-09-01 13:18 ` [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient fdmanana
  2022-09-01 14:03   ` Josef Bacik
  2022-09-01 22:18   ` Qu Wenruo
@ 2022-09-11 22:12   ` Qu Wenruo
  2022-09-12  8:38     ` Filipe Manana
  2 siblings, 1 reply; 53+ messages in thread
From: Qu Wenruo @ 2022-09-11 22:12 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



On 2022/9/1 21:18, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> The current implementation of hole and data seeking for llseek does not
> scale well in regards to the number of extents and the distance between
> the start offset and the next hole or extent. This is due to a very high
> algorithmic complexity. Often we also get reports of btrfs' hole and data
> seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link
> tag at the bottom).
>
> In order to better understand it, lets consider the case where the start
> offset is 0, we are seeking for a hole and the file size is 16G. Between
> file offset 0 and the first hole in the file there are 100K extents - this
> is common for large files, specially if we have compression enabled, since
> the maximum extent size is limited to 128K. The steps take by the main
> loop of the current algorithm are the following:
>
> 1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which
>     calls btrfs_get_extent(). This will first lookup for an extent map in
>     the inode's extent map tree (a red black tree). If the extent map is
>     not loaded in memory, then it will do a lookup for the corresponding
>     file extent item in the subvolume's b+tree, create an extent map based
>     on the contents of the file extent item and then add the extent map to
>     the extent map tree of the inode;
>
> 2) The second iteration calls btrfs_get_extent_fiemap() again, this time
>     with a start offset matching the end offset of the previous extent.
>     Again, btrfs_get_extent() will first search the extent map tree, and
>     if it doesn't find an extent map there, it will again search in the
>     b+tree of the subvolume for a matching file extent item, build an
>     extent map based on the file extent item, and add the extent map to
>     to the extent map tree of the inode;

One small question, unrelated to the whole series since the series
mostly no longer utilize the extent map tree anymore.

I'm wondering if we should add all near by extent map in a batch, other
than one by one, in btrfs_get_extent()?
(E.g. if we find what we need, and there are more file extents in the
same leaf, we can just continue to add them all into the cache)

However I'm not 100% sure if it's good or not, as the extent map can
take up memory space and no way to free unless truncation/evict.
And we have very limited usage for extent map (other than read and log
tree?)

Thanks,
Qu

>
> 3) This repeats over and over until we find the first hole (when seeking
>     for holes) or until we find the first extent (when seeking for data).
>
>     If there no extent maps loaded in memory for each iteration, then on
>     each iteration we do 1 extent map tree search, 1 b+tree search, plus
>     1 more extent map tree traversal to insert an extent map - plus we
>     allocate memory for the extent map.
>
>     On each iteration we are growing the size of the extent map tree,
>     making each future search slower, and also visiting the same b+tree
>     leaves over and over again - taking into account with the default leaf
>     size of 16K we can fit more than 200 file extent items in a leaf - so
>     we can visit the same b+tree leaf 200+ times, on each visit walking
>     down a path from the root to the leaf.
>
> So it's easy to see that what we have now doesn't scale well. Also, it
> loads an extent map for every file extent item into memory, which is not
> efficient - we should add extents maps only when doing IO (writing or
> reading file data).
>
> This change implements a new algorithm which scales much better, and
> works like this:
>
> 1) We iterate over the subvolume's b+tree, visiting each leaf that has
>     file extent items once and only once;
>
> 2) For any file extent items found, that don't represent holes or prealloc
>     extents, it will not search the extent map tree - there's no need at
>     all for that - an extent map is just an in-memory representation of a
>     file extent item;
>
> 3) When a hole is found, or a prealloc extent, it will check if there's
>     delalloc for its range. For this it will search for EXTENT_DELALLOC
>     bits in the inode's io tree and check the extent map tree - this is
>     for accounting for unflushed delalloc and for flushed delalloc (the
>     period between running delalloc and ordered extent completion),
>     respectively. This is similar to what the current implementation does
>     when it finds a hole or prealloc extent, but without creating extent
>     maps and adding them to the extent map tree in case they are not
>     loaded in memory;
>
> 4) It never allocates extent maps, or adds extent maps to the inode's
>     extent map tree. This not only saves memory and time (from the tree
>     insertions and allocations), but also eliminates the possibility of
>     -ENOMEM due to allocating too many extent maps.
>
> Part of this new code will also be used later for fiemap (which also
> suffers similar scalability problems).
>
> The following test example can be used to quickly measure the efficiency
> before and after this patch:
>
>      $ cat test-seek-hole.sh
>      #!/bin/bash
>
>      DEV=/dev/sdi
>      MNT=/mnt/sdi
>
>      mkfs.btrfs -f $DEV
>
>      mount -o compress=lzo $DEV $MNT
>
>      # 16G file -> 131073 compressed extents.
>      xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar
>
>      # Leave a 1M hole at file offset 15G.
>      xfs_io -c "fpunch 15G 1M" $MNT/foobar
>
>      # Unmount and mount again, so that we can test when there's no
>      # metadata cached in memory.
>      umount $MNT
>      mount -o compress=lzo $DEV $MNT
>
>      # Test seeking for hole from offset 0 (hole is at offset 15G).
>
>      start=$(date +%s%N)
>      xfs_io -c "seek -h 0" $MNT/foobar
>      end=$(date +%s%N)
>      dur=$(( (end - start) / 1000000 ))
>      echo "Took $dur milliseconds to seek first hole (metadata not cached)"
>      echo
>
>      start=$(date +%s%N)
>      xfs_io -c "seek -h 0" $MNT/foobar
>      end=$(date +%s%N)
>      dur=$(( (end - start) / 1000000 ))
>      echo "Took $dur milliseconds to seek first hole (metadata cached)"
>      echo
>
>      umount $MNT
>
> Before this change:
>
>      $ ./test-seek-hole.sh
>      (...)
>      Whence	Result
>      HOLE	16106127360
>      Took 176 milliseconds to seek first hole (metadata not cached)
>
>      Whence	Result
>      HOLE	16106127360
>      Took 17 milliseconds to seek first hole (metadata cached)
>
> After this change:
>
>      $ ./test-seek-hole.sh
>      (...)
>      Whence	Result
>      HOLE	16106127360
>      Took 43 milliseconds to seek first hole (metadata not cached)
>
>      Whence	Result
>      HOLE	16106127360
>      Took 13 milliseconds to seek first hole (metadata cached)
>
> That's about 4X faster when no metadata is cached and about 30% faster
> when all metadata is cached.
>
> In practice the differences may often be significantly higher, either due
> to a higher number of extents in a file or because the subvolume's b+tree
> is much bigger than in this example, where we only have one file.
>
> Link: https://lwn.net/Articles/718805/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/file.c | 437 ++++++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 406 insertions(+), 31 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 96f444ad0951..b292a8ada3a4 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3601,22 +3601,281 @@ static long btrfs_fallocate(struct file *file, int mode,
>   	return ret;
>   }
>
> +/*
> + * Helper for have_delalloc_in_range(). Find a subrange in a given range that
> + * has unflushed and/or flushing delalloc. There might be other adjacent
> + * subranges after the one it found, so have_delalloc_in_range() keeps looping
> + * while it gets adjacent subranges, and merging them together.
> + */
> +static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
> +				   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> +{
> +	const u64 len = end + 1 - start;
> +	struct extent_map_tree *em_tree = &inode->extent_tree;
> +	struct extent_map *em;
> +	u64 em_end;
> +	u64 delalloc_len;
> +
> +	/*
> +	 * Search the io tree first for EXTENT_DELALLOC. If we find any, it
> +	 * means we have delalloc (dirty pages) for which writeback has not
> +	 * started yet.
> +	 */
> +	*delalloc_start_ret = start;
> +	delalloc_len = count_range_bits(&inode->io_tree, delalloc_start_ret, end,
> +					len, EXTENT_DELALLOC, 1);
> +	/*
> +	 * If delalloc was found then *delalloc_start_ret has a sector size
> +	 * aligned value (rounded down).
> +	 */
> +	if (delalloc_len > 0)
> +		*delalloc_end_ret = *delalloc_start_ret + delalloc_len - 1;
> +
> +	/*
> +	 * Now also check if there's any extent map in the range that does not
> +	 * map to a hole or prealloc extent. We do this because:
> +	 *
> +	 * 1) When delalloc is flushed, the file range is locked, we clear the
> +	 *    EXTENT_DELALLOC bit from the io tree and create an extent map for
> +	 *    an allocated extent. So we might just have been called after
> +	 *    delalloc is flushed and before the ordered extent completes and
> +	 *    inserts the new file extent item in the subvolume's btree;
> +	 *
> +	 * 2) We may have an extent map created by flushing delalloc for a
> +	 *    subrange that starts before the subrange we found marked with
> +	 *    EXTENT_DELALLOC in the io tree.
> +	 */
> +	read_lock(&em_tree->lock);
> +	em = lookup_extent_mapping(em_tree, start, len);
> +	read_unlock(&em_tree->lock);
> +
> +	/* extent_map_end() returns a non-inclusive end offset. */
> +	em_end = em ? extent_map_end(em) : 0;
> +
> +	/*
> +	 * If we have a hole/prealloc extent map, check the next one if this one
> +	 * ends before our range's end.
> +	 */
> +	if (em && (em->block_start == EXTENT_MAP_HOLE ||
> +		   test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) && em_end < end) {
> +		struct extent_map *next_em;
> +
> +		read_lock(&em_tree->lock);
> +		next_em = lookup_extent_mapping(em_tree, em_end, len - em_end);
> +		read_unlock(&em_tree->lock);
> +
> +		free_extent_map(em);
> +		em_end = next_em ? extent_map_end(next_em) : 0;
> +		em = next_em;
> +	}
> +
> +	if (em && (em->block_start == EXTENT_MAP_HOLE ||
> +		   test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) {
> +		free_extent_map(em);
> +		em = NULL;
> +	}
> +
> +	/*
> +	 * No extent map or one for a hole or prealloc extent. Use the delalloc
> +	 * range we found in the io tree if we have one.
> +	 */
> +	if (!em)
> +		return (delalloc_len > 0);
> +
> +	/*
> +	 * We don't have any range as EXTENT_DELALLOC in the io tree, so the
> +	 * extent map is the only subrange representing delalloc.
> +	 */
> +	if (delalloc_len == 0) {
> +		*delalloc_start_ret = em->start;
> +		*delalloc_end_ret = min(end, em_end - 1);
> +		free_extent_map(em);
> +		return true;
> +	}
> +
> +	/*
> +	 * The extent map represents a delalloc range that starts before the
> +	 * delalloc range we found in the io tree.
> +	 */
> +	if (em->start < *delalloc_start_ret) {
> +		*delalloc_start_ret = em->start;
> +		/*
> +		 * If the ranges are adjacent, return a combined range.
> +		 * Otherwise return the extent map's range.
> +		 */
> +		if (em_end < *delalloc_start_ret)
> +			*delalloc_end_ret = min(end, em_end - 1);
> +
> +		free_extent_map(em);
> +		return true;
> +	}
> +
> +	/*
> +	 * The extent map starts after the delalloc range we found in the io
> +	 * tree. If it's adjacent, return a combined range, otherwise return
> +	 * the range found in the io tree.
> +	 */
> +	if (*delalloc_end_ret + 1 == em->start)
> +		*delalloc_end_ret = min(end, em_end - 1);
> +
> +	free_extent_map(em);
> +	return true;
> +}
> +
> +/*
> + * Check if there's delalloc in a given range.
> + *
> + * @inode:               The inode.
> + * @start:               The start offset of the range. It does not need to be
> + *                       sector size aligned.
> + * @end:                 The end offset (inclusive value) of the search range.
> + *                       It does not need to be sector size aligned.
> + * @delalloc_start_ret:  Output argument, set to the start offset of the
> + *                       subrange found with delalloc (may not be sector size
> + *                       aligned).
> + * @delalloc_end_ret:    Output argument, set to he end offset (inclusive value)
> + *                       of the subrange found with delalloc.
> + *
> + * Returns true if a subrange with delalloc is found within the given range, and
> + * if so it sets @delalloc_start_ret and @delalloc_end_ret with the start and
> + * end offsets of the subrange.
> + */
> +static bool have_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> +				   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> +{
> +	u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
> +	u64 prev_delalloc_end = 0;
> +	bool ret = false;
> +
> +	while (cur_offset < end) {
> +		u64 delalloc_start;
> +		u64 delalloc_end;
> +		bool delalloc;
> +
> +		delalloc = find_delalloc_subrange(inode, cur_offset, end,
> +						  &delalloc_start,
> +						  &delalloc_end);
> +		if (!delalloc)
> +			break;
> +
> +		if (prev_delalloc_end == 0) {
> +			/* First subrange found. */
> +			*delalloc_start_ret = max(delalloc_start, start);
> +			*delalloc_end_ret = delalloc_end;
> +			ret = true;
> +		} else if (delalloc_start == prev_delalloc_end + 1) {
> +			/* Subrange adjacent to the previous one, merge them. */
> +			*delalloc_end_ret = delalloc_end;
> +		} else {
> +			/* Subrange not adjacent to the previous one, exit. */
> +			break;
> +		}
> +
> +		prev_delalloc_end = delalloc_end;
> +		cur_offset = delalloc_end + 1;
> +		cond_resched();
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * Check if there's a hole or delalloc range in a range representing a hole (or
> + * prealloc extent) found in the inode's subvolume btree.
> + *
> + * @inode:      The inode.
> + * @whence:     Seek mode (SEEK_DATA or SEEK_HOLE).
> + * @start:      Start offset of the hole region. It does not need to be sector
> + *              size aligned.
> + * @end:        End offset (inclusive value) of the hole region. It does not
> + *              need to be sector size aligned.
> + * @start_ret:  Return parameter, used to set the start of the subrange in the
> + *              hole that matches the search criteria (seek mode), if such
> + *              subrange is found (return value of the function is true).
> + *              The value returned here may not be sector size aligned.
> + *
> + * Returns true if a subrange matching the given seek mode is found, and if one
> + * is found, it updates @start_ret with the start of the subrange.
> + */
> +static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
> +					u64 start, u64 end, u64 *start_ret)
> +{
> +	u64 delalloc_start;
> +	u64 delalloc_end;
> +	bool delalloc;
> +
> +	delalloc = have_delalloc_in_range(inode, start, end, &delalloc_start,
> +					  &delalloc_end);
> +	if (delalloc && whence == SEEK_DATA) {
> +		*start_ret = delalloc_start;
> +		return true;
> +	}
> +
> +	if (delalloc && whence == SEEK_HOLE) {
> +		/*
> +		 * We found delalloc but it starts after out start offset. So we
> +		 * have a hole between our start offset and the delalloc start.
> +		 */
> +		if (start < delalloc_start) {
> +			*start_ret = start;
> +			return true;
> +		}
> +		/*
> +		 * Delalloc range starts at our start offset.
> +		 * If the delalloc range's length is smaller than our range,
> +		 * then it means we have a hole that starts where the delalloc
> +		 * subrange ends.
> +		 */
> +		if (delalloc_end < end) {
> +			*start_ret = delalloc_end + 1;
> +			return true;
> +		}
> +
> +		/* There's delalloc for the whole range. */
> +		return false;
> +	}
> +
> +	if (!delalloc && whence == SEEK_HOLE) {
> +		*start_ret = start;
> +		return true;
> +	}
> +
> +	/*
> +	 * No delalloc in the range and we are seeking for data. The caller has
> +	 * to iterate to the next extent item in the subvolume btree.
> +	 */
> +	return false;
> +}
> +
>   static loff_t find_desired_extent(struct btrfs_inode *inode, loff_t offset,
>   				  int whence)
>   {
>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	struct extent_map *em = NULL;
>   	struct extent_state *cached_state = NULL;
> -	loff_t i_size = inode->vfs_inode.i_size;
> +	const loff_t i_size = i_size_read(&inode->vfs_inode);
> +	const u64 ino = btrfs_ino(inode);
> +	struct btrfs_root *root = inode->root;
> +	struct btrfs_path *path;
> +	struct btrfs_key key;
> +	u64 last_extent_end;
>   	u64 lockstart;
>   	u64 lockend;
>   	u64 start;
> -	u64 len;
> -	int ret = 0;
> +	int ret;
> +	bool found = false;
>
>   	if (i_size == 0 || offset >= i_size)
>   		return -ENXIO;
>
> +	/*
> +	 * Quick path. If the inode has no prealloc extents and its number of
> +	 * bytes used matches its i_size, then it can not have holes.
> +	 */
> +	if (whence == SEEK_HOLE &&
> +	    !(inode->flags & BTRFS_INODE_PREALLOC) &&
> +	    inode_get_bytes(&inode->vfs_inode) == i_size)
> +		return i_size;
> +
>   	/*
>   	 * offset can be negative, in this case we start finding DATA/HOLE from
>   	 * the very start of the file.
> @@ -3628,49 +3887,165 @@ static loff_t find_desired_extent(struct btrfs_inode *inode, loff_t offset,
>   	if (lockend <= lockstart)
>   		lockend = lockstart + fs_info->sectorsize;
>   	lockend--;
> -	len = lockend - lockstart + 1;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +	path->reada = READA_FORWARD;
> +
> +	key.objectid = ino;
> +	key.type = BTRFS_EXTENT_DATA_KEY;
> +	key.offset = start;
> +
> +	last_extent_end = lockstart;
>
>   	lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
>
> +	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> +	if (ret < 0) {
> +		goto out;
> +	} else if (ret > 0 && path->slots[0] > 0) {
> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
> +		if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
> +			path->slots[0]--;
> +	}
> +
>   	while (start < i_size) {
> -		em = btrfs_get_extent_fiemap(inode, start, len);
> -		if (IS_ERR(em)) {
> -			ret = PTR_ERR(em);
> -			em = NULL;
> -			break;
> +		struct extent_buffer *leaf = path->nodes[0];
> +		struct btrfs_file_extent_item *extent;
> +		u64 extent_end;
> +
> +		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
> +			ret = btrfs_next_leaf(root, path);
> +			if (ret < 0)
> +				goto out;
> +			else if (ret > 0)
> +				break;
> +
> +			leaf = path->nodes[0];
>   		}
>
> -		if (whence == SEEK_HOLE &&
> -		    (em->block_start == EXTENT_MAP_HOLE ||
> -		     test_bit(EXTENT_FLAG_PREALLOC, &em->flags)))
> -			break;
> -		else if (whence == SEEK_DATA &&
> -			   (em->block_start != EXTENT_MAP_HOLE &&
> -			    !test_bit(EXTENT_FLAG_PREALLOC, &em->flags)))
> +		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
>   			break;
>
> -		start = em->start + em->len;
> -		free_extent_map(em);
> -		em = NULL;
> +		extent_end = btrfs_file_extent_end(path);
> +
> +		/*
> +		 * In the first iteration we may have a slot that points to an
> +		 * extent that ends before our start offset, so skip it.
> +		 */
> +		if (extent_end <= start) {
> +			path->slots[0]++;
> +			continue;
> +		}
> +
> +		/* We have an implicit hole, NO_HOLES feature is likely set. */
> +		if (last_extent_end < key.offset) {
> +			u64 search_start = last_extent_end;
> +			u64 found_start;
> +
> +			/*
> +			 * First iteration, @start matches @offset and it's
> +			 * within the hole.
> +			 */
> +			if (start == offset)
> +				search_start = offset;
> +
> +			found = find_desired_extent_in_hole(inode, whence,
> +							    search_start,
> +							    key.offset - 1,
> +							    &found_start);
> +			if (found) {
> +				start = found_start;
> +				break;
> +			}
> +			/*
> +			 * Didn't find data or a hole (due to delalloc) in the
> +			 * implicit hole range, so need to analyze the extent.
> +			 */
> +		}
> +
> +		extent = btrfs_item_ptr(leaf, path->slots[0],
> +					struct btrfs_file_extent_item);
> +
> +		if (btrfs_file_extent_disk_bytenr(leaf, extent) == 0 ||
> +		    btrfs_file_extent_type(leaf, extent) ==
> +		    BTRFS_FILE_EXTENT_PREALLOC) {
> +			/*
> +			 * Explicit hole or prealloc extent, search for delalloc.
> +			 * A prealloc extent is treated like a hole.
> +			 */
> +			u64 search_start = key.offset;
> +			u64 found_start;
> +
> +			/*
> +			 * First iteration, @start matches @offset and it's
> +			 * within the hole.
> +			 */
> +			if (start == offset)
> +				search_start = offset;
> +
> +			found = find_desired_extent_in_hole(inode, whence,
> +							    search_start,
> +							    extent_end - 1,
> +							    &found_start);
> +			if (found) {
> +				start = found_start;
> +				break;
> +			}
> +			/*
> +			 * Didn't find data or a hole (due to delalloc) in the
> +			 * implicit hole range, so need to analyze the next
> +			 * extent item.
> +			 */
> +		} else {
> +			/*
> +			 * Found a regular or inline extent.
> +			 * If we are seeking for data, adjust the start offset
> +			 * and stop, we're done.
> +			 */
> +			if (whence == SEEK_DATA) {
> +				start = max_t(u64, key.offset, offset);
> +				found = true;
> +				break;
> +			}
> +			/*
> +			 * Else, we are seeking for a hole, check the next file
> +			 * extent item.
> +			 */
> +		}
> +
> +		start = extent_end;
> +		last_extent_end = extent_end;
> +		path->slots[0]++;
>   		if (fatal_signal_pending(current)) {
>   			ret = -EINTR;
> -			break;
> +			goto out;
>   		}
>   		cond_resched();
>   	}
> -	free_extent_map(em);
> +
> +	/* We have an implicit hole from the last extent found up to i_size. */
> +	if (!found && start < i_size) {
> +		found = find_desired_extent_in_hole(inode, whence, start,
> +						    i_size - 1, &start);
> +		if (!found)
> +			start = i_size;
> +	}
> +
> +out:
>   	unlock_extent_cached(&inode->io_tree, lockstart, lockend,
>   			     &cached_state);
> -	if (ret) {
> -		offset = ret;
> -	} else {
> -		if (whence == SEEK_DATA && start >= i_size)
> -			offset = -ENXIO;
> -		else
> -			offset = min_t(loff_t, start, i_size);
> -	}
> +	btrfs_free_path(path);
> +
> +	if (ret < 0)
> +		return ret;
> +
> +	if (whence == SEEK_DATA && start >= i_size)
> +		return -ENXIO;
>
> -	return offset;
> +	return min_t(loff_t, start, i_size);
>   }
>
>   static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient
  2022-09-11 22:12   ` Qu Wenruo
@ 2022-09-12  8:38     ` Filipe Manana
  0 siblings, 0 replies; 53+ messages in thread
From: Filipe Manana @ 2022-09-12  8:38 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sun, Sep 11, 2022 at 11:12 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2022/9/1 21:18, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > The current implementation of hole and data seeking for llseek does not
> > scale well in regards to the number of extents and the distance between
> > the start offset and the next hole or extent. This is due to a very high
> > algorithmic complexity. Often we also get reports of btrfs' hole and data
> > seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link
> > tag at the bottom).
> >
> > In order to better understand it, lets consider the case where the start
> > offset is 0, we are seeking for a hole and the file size is 16G. Between
> > file offset 0 and the first hole in the file there are 100K extents - this
> > is common for large files, specially if we have compression enabled, since
> > the maximum extent size is limited to 128K. The steps take by the main
> > loop of the current algorithm are the following:
> >
> > 1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which
> >     calls btrfs_get_extent(). This will first lookup for an extent map in
> >     the inode's extent map tree (a red black tree). If the extent map is
> >     not loaded in memory, then it will do a lookup for the corresponding
> >     file extent item in the subvolume's b+tree, create an extent map based
> >     on the contents of the file extent item and then add the extent map to
> >     the extent map tree of the inode;
> >
> > 2) The second iteration calls btrfs_get_extent_fiemap() again, this time
> >     with a start offset matching the end offset of the previous extent.
> >     Again, btrfs_get_extent() will first search the extent map tree, and
> >     if it doesn't find an extent map there, it will again search in the
> >     b+tree of the subvolume for a matching file extent item, build an
> >     extent map based on the file extent item, and add the extent map to
> >     to the extent map tree of the inode;
>
> One small question, unrelated to the whole series since the series
> mostly no longer utilize the extent map tree anymore.
>
> I'm wondering if we should add all near by extent map in a batch, other
> than one by one, in btrfs_get_extent()?
> (E.g. if we find what we need, and there are more file extents in the
> same leaf, we can just continue to add them all into the cache)
>
> However I'm not 100% sure if it's good or not, as the extent map can
> take up memory space and no way to free unless truncation/evict.
> And we have very limited usage for extent map (other than read and log
> tree?)

This seems like basically the same thing you asked before at:

https://lore.kernel.org/linux-btrfs/cover.1662022922.git.fdmanana@suse.com/T/#md0595933858b548804fb329b4ae216ae6d76fa9a

My answer is the same I gave you back then.

I don't see anywhere in all of btrfs where we could have a benefit
from bulk loading extent maps.
If something needs to load and use a large number of extent maps in a
short period, then it's better off checking and using
file extent items directly, and not create and use extent maps.

If you think there's some place that could benefit from it, then try
it and measure it.
As this is all unrelated to lseek, fiemap and this patchset, I'd
suggest moving this discussion elsewhere.

Thanks.

>
> Thanks,
> Qu
>
> >
> > 3) This repeats over and over until we find the first hole (when seeking
> >     for holes) or until we find the first extent (when seeking for data).
> >
> >     If there no extent maps loaded in memory for each iteration, then on
> >     each iteration we do 1 extent map tree search, 1 b+tree search, plus
> >     1 more extent map tree traversal to insert an extent map - plus we
> >     allocate memory for the extent map.
> >
> >     On each iteration we are growing the size of the extent map tree,
> >     making each future search slower, and also visiting the same b+tree
> >     leaves over and over again - taking into account with the default leaf
> >     size of 16K we can fit more than 200 file extent items in a leaf - so
> >     we can visit the same b+tree leaf 200+ times, on each visit walking
> >     down a path from the root to the leaf.
> >
> > So it's easy to see that what we have now doesn't scale well. Also, it
> > loads an extent map for every file extent item into memory, which is not
> > efficient - we should add extents maps only when doing IO (writing or
> > reading file data).
> >
> > This change implements a new algorithm which scales much better, and
> > works like this:
> >
> > 1) We iterate over the subvolume's b+tree, visiting each leaf that has
> >     file extent items once and only once;
> >
> > 2) For any file extent items found, that don't represent holes or prealloc
> >     extents, it will not search the extent map tree - there's no need at
> >     all for that - an extent map is just an in-memory representation of a
> >     file extent item;
> >
> > 3) When a hole is found, or a prealloc extent, it will check if there's
> >     delalloc for its range. For this it will search for EXTENT_DELALLOC
> >     bits in the inode's io tree and check the extent map tree - this is
> >     for accounting for unflushed delalloc and for flushed delalloc (the
> >     period between running delalloc and ordered extent completion),
> >     respectively. This is similar to what the current implementation does
> >     when it finds a hole or prealloc extent, but without creating extent
> >     maps and adding them to the extent map tree in case they are not
> >     loaded in memory;
> >
> > 4) It never allocates extent maps, or adds extent maps to the inode's
> >     extent map tree. This not only saves memory and time (from the tree
> >     insertions and allocations), but also eliminates the possibility of
> >     -ENOMEM due to allocating too many extent maps.
> >
> > Part of this new code will also be used later for fiemap (which also
> > suffers similar scalability problems).
> >
> > The following test example can be used to quickly measure the efficiency
> > before and after this patch:
> >
> >      $ cat test-seek-hole.sh
> >      #!/bin/bash
> >
> >      DEV=/dev/sdi
> >      MNT=/mnt/sdi
> >
> >      mkfs.btrfs -f $DEV
> >
> >      mount -o compress=lzo $DEV $MNT
> >
> >      # 16G file -> 131073 compressed extents.
> >      xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar
> >
> >      # Leave a 1M hole at file offset 15G.
> >      xfs_io -c "fpunch 15G 1M" $MNT/foobar
> >
> >      # Unmount and mount again, so that we can test when there's no
> >      # metadata cached in memory.
> >      umount $MNT
> >      mount -o compress=lzo $DEV $MNT
> >
> >      # Test seeking for hole from offset 0 (hole is at offset 15G).
> >
> >      start=$(date +%s%N)
> >      xfs_io -c "seek -h 0" $MNT/foobar
> >      end=$(date +%s%N)
> >      dur=$(( (end - start) / 1000000 ))
> >      echo "Took $dur milliseconds to seek first hole (metadata not cached)"
> >      echo
> >
> >      start=$(date +%s%N)
> >      xfs_io -c "seek -h 0" $MNT/foobar
> >      end=$(date +%s%N)
> >      dur=$(( (end - start) / 1000000 ))
> >      echo "Took $dur milliseconds to seek first hole (metadata cached)"
> >      echo
> >
> >      umount $MNT
> >
> > Before this change:
> >
> >      $ ./test-seek-hole.sh
> >      (...)
> >      Whence   Result
> >      HOLE     16106127360
> >      Took 176 milliseconds to seek first hole (metadata not cached)
> >
> >      Whence   Result
> >      HOLE     16106127360
> >      Took 17 milliseconds to seek first hole (metadata cached)
> >
> > After this change:
> >
> >      $ ./test-seek-hole.sh
> >      (...)
> >      Whence   Result
> >      HOLE     16106127360
> >      Took 43 milliseconds to seek first hole (metadata not cached)
> >
> >      Whence   Result
> >      HOLE     16106127360
> >      Took 13 milliseconds to seek first hole (metadata cached)
> >
> > That's about 4X faster when no metadata is cached and about 30% faster
> > when all metadata is cached.
> >
> > In practice the differences may often be significantly higher, either due
> > to a higher number of extents in a file or because the subvolume's b+tree
> > is much bigger than in this example, where we only have one file.
> >
> > Link: https://lwn.net/Articles/718805/
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >   fs/btrfs/file.c | 437 ++++++++++++++++++++++++++++++++++++++++++++----
> >   1 file changed, 406 insertions(+), 31 deletions(-)
> >
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 96f444ad0951..b292a8ada3a4 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -3601,22 +3601,281 @@ static long btrfs_fallocate(struct file *file, int mode,
> >       return ret;
> >   }
> >
> > +/*
> > + * Helper for have_delalloc_in_range(). Find a subrange in a given range that
> > + * has unflushed and/or flushing delalloc. There might be other adjacent
> > + * subranges after the one it found, so have_delalloc_in_range() keeps looping
> > + * while it gets adjacent subranges, and merging them together.
> > + */
> > +static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
> > +                                u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> > +{
> > +     const u64 len = end + 1 - start;
> > +     struct extent_map_tree *em_tree = &inode->extent_tree;
> > +     struct extent_map *em;
> > +     u64 em_end;
> > +     u64 delalloc_len;
> > +
> > +     /*
> > +      * Search the io tree first for EXTENT_DELALLOC. If we find any, it
> > +      * means we have delalloc (dirty pages) for which writeback has not
> > +      * started yet.
> > +      */
> > +     *delalloc_start_ret = start;
> > +     delalloc_len = count_range_bits(&inode->io_tree, delalloc_start_ret, end,
> > +                                     len, EXTENT_DELALLOC, 1);
> > +     /*
> > +      * If delalloc was found then *delalloc_start_ret has a sector size
> > +      * aligned value (rounded down).
> > +      */
> > +     if (delalloc_len > 0)
> > +             *delalloc_end_ret = *delalloc_start_ret + delalloc_len - 1;
> > +
> > +     /*
> > +      * Now also check if there's any extent map in the range that does not
> > +      * map to a hole or prealloc extent. We do this because:
> > +      *
> > +      * 1) When delalloc is flushed, the file range is locked, we clear the
> > +      *    EXTENT_DELALLOC bit from the io tree and create an extent map for
> > +      *    an allocated extent. So we might just have been called after
> > +      *    delalloc is flushed and before the ordered extent completes and
> > +      *    inserts the new file extent item in the subvolume's btree;
> > +      *
> > +      * 2) We may have an extent map created by flushing delalloc for a
> > +      *    subrange that starts before the subrange we found marked with
> > +      *    EXTENT_DELALLOC in the io tree.
> > +      */
> > +     read_lock(&em_tree->lock);
> > +     em = lookup_extent_mapping(em_tree, start, len);
> > +     read_unlock(&em_tree->lock);
> > +
> > +     /* extent_map_end() returns a non-inclusive end offset. */
> > +     em_end = em ? extent_map_end(em) : 0;
> > +
> > +     /*
> > +      * If we have a hole/prealloc extent map, check the next one if this one
> > +      * ends before our range's end.
> > +      */
> > +     if (em && (em->block_start == EXTENT_MAP_HOLE ||
> > +                test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) && em_end < end) {
> > +             struct extent_map *next_em;
> > +
> > +             read_lock(&em_tree->lock);
> > +             next_em = lookup_extent_mapping(em_tree, em_end, len - em_end);
> > +             read_unlock(&em_tree->lock);
> > +
> > +             free_extent_map(em);
> > +             em_end = next_em ? extent_map_end(next_em) : 0;
> > +             em = next_em;
> > +     }
> > +
> > +     if (em && (em->block_start == EXTENT_MAP_HOLE ||
> > +                test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) {
> > +             free_extent_map(em);
> > +             em = NULL;
> > +     }
> > +
> > +     /*
> > +      * No extent map or one for a hole or prealloc extent. Use the delalloc
> > +      * range we found in the io tree if we have one.
> > +      */
> > +     if (!em)
> > +             return (delalloc_len > 0);
> > +
> > +     /*
> > +      * We don't have any range as EXTENT_DELALLOC in the io tree, so the
> > +      * extent map is the only subrange representing delalloc.
> > +      */
> > +     if (delalloc_len == 0) {
> > +             *delalloc_start_ret = em->start;
> > +             *delalloc_end_ret = min(end, em_end - 1);
> > +             free_extent_map(em);
> > +             return true;
> > +     }
> > +
> > +     /*
> > +      * The extent map represents a delalloc range that starts before the
> > +      * delalloc range we found in the io tree.
> > +      */
> > +     if (em->start < *delalloc_start_ret) {
> > +             *delalloc_start_ret = em->start;
> > +             /*
> > +              * If the ranges are adjacent, return a combined range.
> > +              * Otherwise return the extent map's range.
> > +              */
> > +             if (em_end < *delalloc_start_ret)
> > +                     *delalloc_end_ret = min(end, em_end - 1);
> > +
> > +             free_extent_map(em);
> > +             return true;
> > +     }
> > +
> > +     /*
> > +      * The extent map starts after the delalloc range we found in the io
> > +      * tree. If it's adjacent, return a combined range, otherwise return
> > +      * the range found in the io tree.
> > +      */
> > +     if (*delalloc_end_ret + 1 == em->start)
> > +             *delalloc_end_ret = min(end, em_end - 1);
> > +
> > +     free_extent_map(em);
> > +     return true;
> > +}
> > +
> > +/*
> > + * Check if there's delalloc in a given range.
> > + *
> > + * @inode:               The inode.
> > + * @start:               The start offset of the range. It does not need to be
> > + *                       sector size aligned.
> > + * @end:                 The end offset (inclusive value) of the search range.
> > + *                       It does not need to be sector size aligned.
> > + * @delalloc_start_ret:  Output argument, set to the start offset of the
> > + *                       subrange found with delalloc (may not be sector size
> > + *                       aligned).
> > + * @delalloc_end_ret:    Output argument, set to he end offset (inclusive value)
> > + *                       of the subrange found with delalloc.
> > + *
> > + * Returns true if a subrange with delalloc is found within the given range, and
> > + * if so it sets @delalloc_start_ret and @delalloc_end_ret with the start and
> > + * end offsets of the subrange.
> > + */
> > +static bool have_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> > +                                u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> > +{
> > +     u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
> > +     u64 prev_delalloc_end = 0;
> > +     bool ret = false;
> > +
> > +     while (cur_offset < end) {
> > +             u64 delalloc_start;
> > +             u64 delalloc_end;
> > +             bool delalloc;
> > +
> > +             delalloc = find_delalloc_subrange(inode, cur_offset, end,
> > +                                               &delalloc_start,
> > +                                               &delalloc_end);
> > +             if (!delalloc)
> > +                     break;
> > +
> > +             if (prev_delalloc_end == 0) {
> > +                     /* First subrange found. */
> > +                     *delalloc_start_ret = max(delalloc_start, start);
> > +                     *delalloc_end_ret = delalloc_end;
> > +                     ret = true;
> > +             } else if (delalloc_start == prev_delalloc_end + 1) {
> > +                     /* Subrange adjacent to the previous one, merge them. */
> > +                     *delalloc_end_ret = delalloc_end;
> > +             } else {
> > +                     /* Subrange not adjacent to the previous one, exit. */
> > +                     break;
> > +             }
> > +
> > +             prev_delalloc_end = delalloc_end;
> > +             cur_offset = delalloc_end + 1;
> > +             cond_resched();
> > +     }
> > +
> > +     return ret;
> > +}
> > +
> > +/*
> > + * Check if there's a hole or delalloc range in a range representing a hole (or
> > + * prealloc extent) found in the inode's subvolume btree.
> > + *
> > + * @inode:      The inode.
> > + * @whence:     Seek mode (SEEK_DATA or SEEK_HOLE).
> > + * @start:      Start offset of the hole region. It does not need to be sector
> > + *              size aligned.
> > + * @end:        End offset (inclusive value) of the hole region. It does not
> > + *              need to be sector size aligned.
> > + * @start_ret:  Return parameter, used to set the start of the subrange in the
> > + *              hole that matches the search criteria (seek mode), if such
> > + *              subrange is found (return value of the function is true).
> > + *              The value returned here may not be sector size aligned.
> > + *
> > + * Returns true if a subrange matching the given seek mode is found, and if one
> > + * is found, it updates @start_ret with the start of the subrange.
> > + */
> > +static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
> > +                                     u64 start, u64 end, u64 *start_ret)
> > +{
> > +     u64 delalloc_start;
> > +     u64 delalloc_end;
> > +     bool delalloc;
> > +
> > +     delalloc = have_delalloc_in_range(inode, start, end, &delalloc_start,
> > +                                       &delalloc_end);
> > +     if (delalloc && whence == SEEK_DATA) {
> > +             *start_ret = delalloc_start;
> > +             return true;
> > +     }
> > +
> > +     if (delalloc && whence == SEEK_HOLE) {
> > +             /*
> > +              * We found delalloc but it starts after out start offset. So we
> > +              * have a hole between our start offset and the delalloc start.
> > +              */
> > +             if (start < delalloc_start) {
> > +                     *start_ret = start;
> > +                     return true;
> > +             }
> > +             /*
> > +              * Delalloc range starts at our start offset.
> > +              * If the delalloc range's length is smaller than our range,
> > +              * then it means we have a hole that starts where the delalloc
> > +              * subrange ends.
> > +              */
> > +             if (delalloc_end < end) {
> > +                     *start_ret = delalloc_end + 1;
> > +                     return true;
> > +             }
> > +
> > +             /* There's delalloc for the whole range. */
> > +             return false;
> > +     }
> > +
> > +     if (!delalloc && whence == SEEK_HOLE) {
> > +             *start_ret = start;
> > +             return true;
> > +     }
> > +
> > +     /*
> > +      * No delalloc in the range and we are seeking for data. The caller has
> > +      * to iterate to the next extent item in the subvolume btree.
> > +      */
> > +     return false;
> > +}
> > +
> >   static loff_t find_desired_extent(struct btrfs_inode *inode, loff_t offset,
> >                                 int whence)
> >   {
> >       struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > -     struct extent_map *em = NULL;
> >       struct extent_state *cached_state = NULL;
> > -     loff_t i_size = inode->vfs_inode.i_size;
> > +     const loff_t i_size = i_size_read(&inode->vfs_inode);
> > +     const u64 ino = btrfs_ino(inode);
> > +     struct btrfs_root *root = inode->root;
> > +     struct btrfs_path *path;
> > +     struct btrfs_key key;
> > +     u64 last_extent_end;
> >       u64 lockstart;
> >       u64 lockend;
> >       u64 start;
> > -     u64 len;
> > -     int ret = 0;
> > +     int ret;
> > +     bool found = false;
> >
> >       if (i_size == 0 || offset >= i_size)
> >               return -ENXIO;
> >
> > +     /*
> > +      * Quick path. If the inode has no prealloc extents and its number of
> > +      * bytes used matches its i_size, then it can not have holes.
> > +      */
> > +     if (whence == SEEK_HOLE &&
> > +         !(inode->flags & BTRFS_INODE_PREALLOC) &&
> > +         inode_get_bytes(&inode->vfs_inode) == i_size)
> > +             return i_size;
> > +
> >       /*
> >        * offset can be negative, in this case we start finding DATA/HOLE from
> >        * the very start of the file.
> > @@ -3628,49 +3887,165 @@ static loff_t find_desired_extent(struct btrfs_inode *inode, loff_t offset,
> >       if (lockend <= lockstart)
> >               lockend = lockstart + fs_info->sectorsize;
> >       lockend--;
> > -     len = lockend - lockstart + 1;
> > +
> > +     path = btrfs_alloc_path();
> > +     if (!path)
> > +             return -ENOMEM;
> > +     path->reada = READA_FORWARD;
> > +
> > +     key.objectid = ino;
> > +     key.type = BTRFS_EXTENT_DATA_KEY;
> > +     key.offset = start;
> > +
> > +     last_extent_end = lockstart;
> >
> >       lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
> >
> > +     ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> > +     if (ret < 0) {
> > +             goto out;
> > +     } else if (ret > 0 && path->slots[0] > 0) {
> > +             btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
> > +             if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
> > +                     path->slots[0]--;
> > +     }
> > +
> >       while (start < i_size) {
> > -             em = btrfs_get_extent_fiemap(inode, start, len);
> > -             if (IS_ERR(em)) {
> > -                     ret = PTR_ERR(em);
> > -                     em = NULL;
> > -                     break;
> > +             struct extent_buffer *leaf = path->nodes[0];
> > +             struct btrfs_file_extent_item *extent;
> > +             u64 extent_end;
> > +
> > +             if (path->slots[0] >= btrfs_header_nritems(leaf)) {
> > +                     ret = btrfs_next_leaf(root, path);
> > +                     if (ret < 0)
> > +                             goto out;
> > +                     else if (ret > 0)
> > +                             break;
> > +
> > +                     leaf = path->nodes[0];
> >               }
> >
> > -             if (whence == SEEK_HOLE &&
> > -                 (em->block_start == EXTENT_MAP_HOLE ||
> > -                  test_bit(EXTENT_FLAG_PREALLOC, &em->flags)))
> > -                     break;
> > -             else if (whence == SEEK_DATA &&
> > -                        (em->block_start != EXTENT_MAP_HOLE &&
> > -                         !test_bit(EXTENT_FLAG_PREALLOC, &em->flags)))
> > +             btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> > +             if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> >                       break;
> >
> > -             start = em->start + em->len;
> > -             free_extent_map(em);
> > -             em = NULL;
> > +             extent_end = btrfs_file_extent_end(path);
> > +
> > +             /*
> > +              * In the first iteration we may have a slot that points to an
> > +              * extent that ends before our start offset, so skip it.
> > +              */
> > +             if (extent_end <= start) {
> > +                     path->slots[0]++;
> > +                     continue;
> > +             }
> > +
> > +             /* We have an implicit hole, NO_HOLES feature is likely set. */
> > +             if (last_extent_end < key.offset) {
> > +                     u64 search_start = last_extent_end;
> > +                     u64 found_start;
> > +
> > +                     /*
> > +                      * First iteration, @start matches @offset and it's
> > +                      * within the hole.
> > +                      */
> > +                     if (start == offset)
> > +                             search_start = offset;
> > +
> > +                     found = find_desired_extent_in_hole(inode, whence,
> > +                                                         search_start,
> > +                                                         key.offset - 1,
> > +                                                         &found_start);
> > +                     if (found) {
> > +                             start = found_start;
> > +                             break;
> > +                     }
> > +                     /*
> > +                      * Didn't find data or a hole (due to delalloc) in the
> > +                      * implicit hole range, so need to analyze the extent.
> > +                      */
> > +             }
> > +
> > +             extent = btrfs_item_ptr(leaf, path->slots[0],
> > +                                     struct btrfs_file_extent_item);
> > +
> > +             if (btrfs_file_extent_disk_bytenr(leaf, extent) == 0 ||
> > +                 btrfs_file_extent_type(leaf, extent) ==
> > +                 BTRFS_FILE_EXTENT_PREALLOC) {
> > +                     /*
> > +                      * Explicit hole or prealloc extent, search for delalloc.
> > +                      * A prealloc extent is treated like a hole.
> > +                      */
> > +                     u64 search_start = key.offset;
> > +                     u64 found_start;
> > +
> > +                     /*
> > +                      * First iteration, @start matches @offset and it's
> > +                      * within the hole.
> > +                      */
> > +                     if (start == offset)
> > +                             search_start = offset;
> > +
> > +                     found = find_desired_extent_in_hole(inode, whence,
> > +                                                         search_start,
> > +                                                         extent_end - 1,
> > +                                                         &found_start);
> > +                     if (found) {
> > +                             start = found_start;
> > +                             break;
> > +                     }
> > +                     /*
> > +                      * Didn't find data or a hole (due to delalloc) in the
> > +                      * implicit hole range, so need to analyze the next
> > +                      * extent item.
> > +                      */
> > +             } else {
> > +                     /*
> > +                      * Found a regular or inline extent.
> > +                      * If we are seeking for data, adjust the start offset
> > +                      * and stop, we're done.
> > +                      */
> > +                     if (whence == SEEK_DATA) {
> > +                             start = max_t(u64, key.offset, offset);
> > +                             found = true;
> > +                             break;
> > +                     }
> > +                     /*
> > +                      * Else, we are seeking for a hole, check the next file
> > +                      * extent item.
> > +                      */
> > +             }
> > +
> > +             start = extent_end;
> > +             last_extent_end = extent_end;
> > +             path->slots[0]++;
> >               if (fatal_signal_pending(current)) {
> >                       ret = -EINTR;
> > -                     break;
> > +                     goto out;
> >               }
> >               cond_resched();
> >       }
> > -     free_extent_map(em);
> > +
> > +     /* We have an implicit hole from the last extent found up to i_size. */
> > +     if (!found && start < i_size) {
> > +             found = find_desired_extent_in_hole(inode, whence, start,
> > +                                                 i_size - 1, &start);
> > +             if (!found)
> > +                     start = i_size;
> > +     }
> > +
> > +out:
> >       unlock_extent_cached(&inode->io_tree, lockstart, lockend,
> >                            &cached_state);
> > -     if (ret) {
> > -             offset = ret;
> > -     } else {
> > -             if (whence == SEEK_DATA && start >= i_size)
> > -                     offset = -ENXIO;
> > -             else
> > -                     offset = min_t(loff_t, start, i_size);
> > -     }
> > +     btrfs_free_path(path);
> > +
> > +     if (ret < 0)
> > +             return ret;
> > +
> > +     if (whence == SEEK_DATA && start >= i_size)
> > +             return -ENXIO;
> >
> > -     return offset;
> > +     return min_t(loff_t, start, i_size);
> >   }
> >
> >   static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2022-09-12  8:38 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-01 13:18 [PATCH 00/10] btrfs: make lseek and fiemap much more efficient fdmanana
2022-09-01 13:18 ` [PATCH 01/10] btrfs: allow hole and data seeking to be interruptible fdmanana
2022-09-01 13:58   ` Josef Bacik
2022-09-01 21:49   ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient fdmanana
2022-09-01 14:03   ` Josef Bacik
2022-09-01 15:00     ` Filipe Manana
2022-09-02 13:26       ` Josef Bacik
2022-09-01 22:18   ` Qu Wenruo
2022-09-02  8:36     ` Filipe Manana
2022-09-11 22:12   ` Qu Wenruo
2022-09-12  8:38     ` Filipe Manana
2022-09-01 13:18 ` [PATCH 03/10] btrfs: remove check for impossible block start for an extent map at fiemap fdmanana
2022-09-01 14:03   ` Josef Bacik
2022-09-01 22:19   ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 04/10] btrfs: remove zero length check when entering fiemap fdmanana
2022-09-01 14:04   ` Josef Bacik
2022-09-01 22:24   ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 05/10] btrfs: properly flush delalloc " fdmanana
2022-09-01 14:06   ` Josef Bacik
2022-09-01 22:38   ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 06/10] btrfs: allow fiemap to be interruptible fdmanana
2022-09-01 14:07   ` Josef Bacik
2022-09-01 22:42   ` Qu Wenruo
2022-09-02  8:38     ` Filipe Manana
2022-09-01 13:18 ` [PATCH 07/10] btrfs: rename btrfs_check_shared() to a more descriptive name fdmanana
2022-09-01 14:08   ` Josef Bacik
2022-09-01 22:45   ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 08/10] btrfs: speedup checking for extent sharedness during fiemap fdmanana
2022-09-01 14:23   ` Josef Bacik
2022-09-01 22:50   ` Qu Wenruo
2022-09-02  8:46     ` Filipe Manana
2022-09-01 13:18 ` [PATCH 09/10] btrfs: skip unnecessary extent buffer sharedness checks " fdmanana
2022-09-01 14:26   ` Josef Bacik
2022-09-01 23:01   ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness fdmanana
2022-09-01 14:35   ` Josef Bacik
2022-09-01 15:04     ` Filipe Manana
2022-09-02 13:25       ` Josef Bacik
2022-09-01 23:27   ` Qu Wenruo
2022-09-02  8:59     ` Filipe Manana
2022-09-02  9:34       ` Qu Wenruo
2022-09-02  9:41         ` Filipe Manana
2022-09-02  9:50           ` Qu Wenruo
2022-09-02  0:53 ` [PATCH 00/10] btrfs: make lseek and fiemap much more efficient Wang Yugui
2022-09-02  8:24   ` Filipe Manana
2022-09-02 11:41     ` Wang Yugui
2022-09-02 11:45     ` Filipe Manana
2022-09-05 14:39       ` Filipe Manana
2022-09-06 16:20 ` David Sterba
2022-09-06 17:13   ` Filipe Manana
2022-09-07  9:12 ` Christoph Hellwig
2022-09-07  9:47   ` Filipe Manana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).