From: Josef Bacik <josef@toxicpanda.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness
Date: Thu, 1 Sep 2022 10:35:34 -0400 [thread overview]
Message-ID: <YxDDNvsbrlxoaJxh@localhost.localdomain> (raw)
In-Reply-To: <afad772c28ed5fc236785df6f3c43282d5c12534.1662022922.git.fdmanana@suse.com>
On Thu, Sep 01, 2022 at 02:18:30PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> The current fiemap implementation does not scale very well with the number
> of extents a file has. This is both because the main algorithm to find out
> the extents has a high algorithmic complexity and because for each extent
> we have to check if it's shared. This second part, checking if an extent
> is shared, is significantly improved by the two previous patches in this
> patchset, while the first part is improved by this specific patch. Every
> now and then we get reports from users mentioning fiemap is too slow or
> even unusable for files with a very large number of extents, such as the
> two recent reports referred to by the Link tags at the bottom of this
> change log.
>
> To understand why the part of finding which extents a file has is very
> inneficient, consider the example of doing a full ranged fiemap against
> a file that has over 100K extents (normal for example for a file with
> more than 10G of data and using compression, which limits the extent size
> to 128K). When we enter fiemap at extent_fiemap(), the following happens:
>
> 1) Before entering the main loop, we call get_extent_skip_holes() to get
> the first extent map. This leads us to btrfs_get_extent_fiemap(), which
> in turn calls btrfs_get_extent(), to find the first extent map that
> covers the file range [0, LLONG_MAX).
>
> btrfs_get_extent() will first search the inode's extent map tree, to
> see if we have an extent map there that covers the range. If it does
> not find one, then it will search the inode's subvolume b+tree for a
> fitting file extent item. After finding the file extent item, it will
> allocate an extent map, fill it in with information extracted from the
> file extent item, and add it to the inode's extent map tree (which
> requires a search for insertion in the tree).
>
> 2) Then we enter the main loop at extent_fiemap(), emit the details of
> the extent, and call again get_extent_skip_holes(), with a start
> offset matching the end of the extent map we previously processed.
>
> We end up at btrfs_get_extent() again, will search the extent map tree
> and then search the subvolume b+tree for a file extent item if we could
> not find an extent map in the extent tree. We allocate an extent map,
> fill it in with the details in the file extent item, and then insert
> it into the extent map tree (yet another search in this tree).
>
> 3) The second step is repeated over and over, until we have processed the
> whole file range. Each iteration ends at btrfs_get_extent(), which
> does a red black tree search on the extent map tree, then searches the
> subvolume b+tree, allocates an extent map and then does another search
> in the extent map tree in order to insert the extent map.
>
> In the best scenario we have all the extent maps already in the extent
> tree, and so for each extent we do a single search on a red black tree,
> so we have a complexity of O(n log n).
>
> In the worst scenario we don't have any extent map already loaded in
> the extent map tree, or have very few already there. In this case the
> complexity is much higher since we do:
>
> - A red black tree search on the extent map tree, which has O(log n)
> complexity, initially very fast since the tree is empty or very
> small, but as we end up allocating extent maps and adding them to
> the tree when we don't find them there, each subsequent search on
> the tree gets slower, since it's getting bigger and bigger after
> each iteration.
>
> - A search on the subvolume b+tree, also O(log n) complexity, but it
> has items for all inodes in the subvolume, not just items for our
> inode. Plus on a filesystem with concurrent operations on other
> inodes, we can block doing the search due to lock contention on
> b+tree nodes/leaves.
>
> - Allocate an extent map - this can block, and can also fail if we
> are under serious memory pressure.
>
> - Do another search on the extent maps red black tree, with the goal
> of inserting the extent map we just allocated. Again, after every
> iteration this tree is getting bigger by 1 element, so after many
> iterations the searches are slower and slower.
>
> - We will not need the allocated extent map anymore, so it's pointless
> to add it to the extent map tree. It's just wasting time and memory.
>
> In short we end up searching the extent map tree multiple times, on a
> tree that is growing bigger and bigger after each iteration. And
> besides that we visit the same leaf of the subvolume b+tree many times,
> since a leaf with the default size of 16K can easily have more than 200
> file extent items.
>
> This is very inneficient overall. This patch changes the algorithm to
> instead iterate over the subvolume b+tree, visiting each leaf only once,
> and only searching in the extent map tree for file ranges that have holes
> or prealloc extents, in order to figure out if we have delalloc there.
> It will never allocate an extent map and add it to the extent map tree.
> This is very similar to what was previously done for the lseek's hole and
> data seeking features.
>
> Also, the current implementation relying on extent maps for figuring out
> which extents we have is not correct. This is because extent maps can be
> merged even if they represent different extents - we do this to minimize
> memory utilization and keep extent map trees smaller. For example if we
> have two extents that are contiguous on disk, once we load the two extent
> maps, they get merged into a single one - however if only one of the
> extents is shared, we end up reporting both as shared or both as not
> shared, which is incorrect.
>
> This reproducer triggers that bug:
>
> $ cat fiemap-bug.sh
> #!/bin/bash
>
> DEV=/dev/sdj
> MNT=/mnt/sdj
>
> mkfs.btrfs -f $DEV
> mount $DEV $MNT
>
> # Create a file with two 256K extents.
> # Since there is no other write activity, they will be contiguous,
> # and their extent maps merged, despite having two distinct extents.
> xfs_io -f -c "pwrite -S 0xab 0 256K" \
> -c "fsync" \
> -c "pwrite -S 0xcd 256K 256K" \
> -c "fsync" \
> $MNT/foo
>
> # Now clone only the second extent into another file.
> xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
>
> # Filefrag will report a single 512K extent, and say it's not shared.
> echo
> filefrag -v $MNT/foo
>
> umount $MNT
>
> Running the reproducer:
>
> $ ./fiemap-bug.sh
> wrote 262144/262144 bytes at offset 0
> 256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
> wrote 262144/262144 bytes at offset 262144
> 256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
> linked 262144/262144 bytes at offset 0
> 256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
>
> Filesystem type is: 9123683e
> File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 127: 3328.. 3455: 128: last,eof
> /mnt/sdj/foo: 1 extent found
>
> We end up reporting that we have a single 512K that is not shared, however
> we have two 256K extents, and the second one is shared. Changing the
> reproducer to clone instead the first extent into file 'bar', makes us
> report a single 512K extent that is shared, which is algo incorrect since
> we have two 256K extents and only the first one is shared.
>
> This patch is part of a larger patchset that is comprised of the following
> patches:
>
> btrfs: allow hole and data seeking to be interruptible
> btrfs: make hole and data seeking a lot more efficient
> btrfs: remove check for impossible block start for an extent map at fiemap
> btrfs: remove zero length check when entering fiemap
> btrfs: properly flush delalloc when entering fiemap
> btrfs: allow fiemap to be interruptible
> btrfs: rename btrfs_check_shared() to a more descriptive name
> btrfs: speedup checking for extent sharedness during fiemap
> btrfs: skip unnecessary extent buffer sharedness checks during fiemap
> btrfs: make fiemap more efficient and accurate reporting extent sharedness
>
> The patchset was tested on a machine running a non-debug kernel (Debian's
> default config) and compared the tests below on a branch without the
> patchset versus the same branch with the whole patchset applied.
>
> The following test for a large compressed file without holes:
>
> $ cat fiemap-perf-test.sh
> #!/bin/bash
>
> DEV=/dev/sdi
> MNT=/mnt/sdi
>
> mkfs.btrfs -f $DEV
> mount -o compress=lzo $DEV $MNT
>
> # 40G gives 327680 128K file extents (due to compression).
> xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar
>
> umount $MNT
> mount -o compress=lzo $DEV $MNT
>
> start=$(date +%s%N)
> filefrag $MNT/foobar
> end=$(date +%s%N)
> dur=$(( (end - start) / 1000000 ))
> echo "fiemap took $dur milliseconds (metadata not cached)"
>
> start=$(date +%s%N)
> filefrag $MNT/foobar
> end=$(date +%s%N)
> dur=$(( (end - start) / 1000000 ))
> echo "fiemap took $dur milliseconds (metadata cached)"
>
> umount $MNT
>
> Before patchset:
>
> $ ./fiemap-perf-test.sh
> (...)
> /mnt/sdi/foobar: 327680 extents found
> fiemap took 3597 milliseconds (metadata not cached)
> /mnt/sdi/foobar: 327680 extents found
> fiemap took 2107 milliseconds (metadata cached)
>
> After patchset:
>
> $ ./fiemap-perf-test.sh
> (...)
> /mnt/sdi/foobar: 327680 extents found
> fiemap took 1214 milliseconds (metadata not cached)
> /mnt/sdi/foobar: 327680 extents found
> fiemap took 684 milliseconds (metadata cached)
>
> That's a speedup of about 3x for both cases (no metadata cached and all
> metadata cached).
>
> The test provided by Pavel (first Link tag at the bottom), which uses
> files with a large number of holes, was also used to measure the gains,
> and it consists on a small C program and a shell script to invoke it.
> The C program is the following:
>
> $ cat pavels-test.c
> #include <stdio.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <fcntl.h>
>
> #include <sys/stat.h>
> #include <sys/time.h>
> #include <sys/ioctl.h>
>
> #include <linux/fs.h>
> #include <linux/fiemap.h>
>
> #define FILE_INTERVAL (1<<13) /* 8Kb */
>
> long long interval(struct timeval t1, struct timeval t2)
> {
> long long val = 0;
> val += (t2.tv_usec - t1.tv_usec);
> val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
> return val;
> }
>
> int main(int argc, char **argv)
> {
> struct fiemap fiemap = {};
> struct timeval t1, t2;
> char data = 'a';
> struct stat st;
> int fd, off, file_size = FILE_INTERVAL;
>
> if (argc != 3 && argc != 2) {
> printf("usage: %s <path> [size]\n", argv[0]);
> return 1;
> }
>
> if (argc == 3)
> file_size = atoi(argv[2]);
> if (file_size < FILE_INTERVAL)
> file_size = FILE_INTERVAL;
> file_size -= file_size % FILE_INTERVAL;
>
> fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
> if (fd < 0) {
> perror("open");
> return 1;
> }
>
> for (off = 0; off < file_size; off += FILE_INTERVAL) {
> if (pwrite(fd, &data, 1, off) != 1) {
> perror("pwrite");
> close(fd);
> return 1;
> }
> }
>
> if (ftruncate(fd, file_size)) {
> perror("ftruncate");
> close(fd);
> return 1;
> }
>
> if (fstat(fd, &st) < 0) {
> perror("fstat");
> close(fd);
> return 1;
> }
>
> printf("size: %ld\n", st.st_size);
> printf("actual size: %ld\n", st.st_blocks * 512);
>
> fiemap.fm_length = FIEMAP_MAX_OFFSET;
> gettimeofday(&t1, NULL);
> if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
> perror("fiemap");
> close(fd);
> return 1;
> }
> gettimeofday(&t2, NULL);
>
> printf("fiemap: fm_mapped_extents = %d\n",
> fiemap.fm_mapped_extents);
> printf("time = %lld us\n", interval(t1, t2));
>
> close(fd);
> return 0;
> }
>
> $ gcc -o pavels_test pavels_test.c
>
> And the wrapper shell script:
>
> $ cat fiemap-pavels-test.sh
>
> #!/bin/bash
>
> DEV=/dev/sdi
> MNT=/mnt/sdi
>
> mkfs.btrfs -f -O no-holes $DEV
> mount $DEV $MNT
>
> echo
> echo "*********** 256M ***********"
> echo
>
> ./pavels-test $MNT/testfile $((1 << 28))
> echo
> ./pavels-test $MNT/testfile $((1 << 28))
>
> echo
> echo "*********** 512M ***********"
> echo
>
> ./pavels-test $MNT/testfile $((1 << 29))
> echo
> ./pavels-test $MNT/testfile $((1 << 29))
>
> echo
> echo "*********** 1G ***********"
> echo
>
> ./pavels-test $MNT/testfile $((1 << 30))
> echo
> ./pavels-test $MNT/testfile $((1 << 30))
>
> umount $MNT
>
> Running his reproducer before applying the patchset:
>
> *********** 256M ***********
>
> size: 268435456
> actual size: 134217728
> fiemap: fm_mapped_extents = 32768
> time = 4003133 us
>
> size: 268435456
> actual size: 134217728
> fiemap: fm_mapped_extents = 32768
> time = 4895330 us
>
> *********** 512M ***********
>
> size: 536870912
> actual size: 268435456
> fiemap: fm_mapped_extents = 65536
> time = 30123675 us
>
> size: 536870912
> actual size: 268435456
> fiemap: fm_mapped_extents = 65536
> time = 33450934 us
>
> *********** 1G ***********
>
> size: 1073741824
> actual size: 536870912
> fiemap: fm_mapped_extents = 131072
> time = 224924074 us
>
> size: 1073741824
> actual size: 536870912
> fiemap: fm_mapped_extents = 131072
> time = 217239242 us
>
> Running it after applying the patchset:
>
> *********** 256M ***********
>
> size: 268435456
> actual size: 134217728
> fiemap: fm_mapped_extents = 32768
> time = 29475 us
>
> size: 268435456
> actual size: 134217728
> fiemap: fm_mapped_extents = 32768
> time = 29307 us
>
> *********** 512M ***********
>
> size: 536870912
> actual size: 268435456
> fiemap: fm_mapped_extents = 65536
> time = 58996 us
>
> size: 536870912
> actual size: 268435456
> fiemap: fm_mapped_extents = 65536
> time = 59115 us
>
> *********** 1G ***********
>
> size: 1073741824
> actual size: 536870912
> fiemap: fm_mapped_extents = 116251
> time = 124141 us
>
> size: 1073741824
> actual size: 536870912
> fiemap: fm_mapped_extents = 131072
> time = 119387 us
>
> The speedup is massive, both on the first fiemap call and on the second
> one as well, as his test creates files with many holes and small extents
> (every extent follows a hole and precedes another hole).
>
> For the 256M file we go from 4 seconds down to 29 milliseconds in the
> first run, and then from 4.9 seconds down to 29 milliseconds again in the
> second run, a speedup of 138x and 169x, respectively.
>
> For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
> first run, and then from 33.5 seconds down to 59 milliseconds again in the
> second run, a speedup of 510x and 568x, respectively.
>
> For the 1G file, we go from 225 seconds down to 124 milliseconds in the
> first run, and then from 217 seconds down to 119 milliseconds in the
> second run, a speedup of 1815x and 1824x, respectively.
>
> Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
> Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> fs/btrfs/ctree.h | 4 +-
> fs/btrfs/extent_io.c | 714 +++++++++++++++++++++++++++++--------------
> fs/btrfs/file.c | 16 +-
> fs/btrfs/inode.c | 140 +--------
> 4 files changed, 506 insertions(+), 368 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index f7fe7f633eb5..7b266f9dc8b4 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3402,8 +3402,6 @@ unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
> u64 start, u64 end);
> int btrfs_check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
> u32 bio_offset, struct page *page, u32 pgoff);
> -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
> - u64 start, u64 len);
> noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
> u64 *orig_start, u64 *orig_block_len,
> u64 *ram_bytes, bool strict);
> @@ -3583,6 +3581,8 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
> int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
> size_t *write_bytes);
> void btrfs_check_nocow_unlock(struct btrfs_inode *inode);
> +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> + u64 *delalloc_start_ret, u64 *delalloc_end_ret);
>
> /* tree-defrag.c */
> int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 0e3fa9b08aaf..50bb2182e795 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5353,42 +5353,6 @@ int try_release_extent_mapping(struct page *page, gfp_t mask)
> return try_release_extent_state(tree, page, mask);
> }
>
> -/*
> - * helper function for fiemap, which doesn't want to see any holes.
> - * This maps until we find something past 'last'
> - */
> -static struct extent_map *get_extent_skip_holes(struct btrfs_inode *inode,
> - u64 offset, u64 last)
> -{
> - u64 sectorsize = btrfs_inode_sectorsize(inode);
> - struct extent_map *em;
> - u64 len;
> -
> - if (offset >= last)
> - return NULL;
> -
> - while (1) {
> - len = last - offset;
> - if (len == 0)
> - break;
> - len = ALIGN(len, sectorsize);
> - em = btrfs_get_extent_fiemap(inode, offset, len);
> - if (IS_ERR(em))
> - return em;
> -
> - /* if this isn't a hole return it */
> - if (em->block_start != EXTENT_MAP_HOLE)
> - return em;
> -
> - /* this is a hole, advance to the next extent */
> - offset = extent_map_end(em);
> - free_extent_map(em);
> - if (offset >= last)
> - break;
> - }
> - return NULL;
> -}
> -
> /*
> * To cache previous fiemap extent
> *
> @@ -5418,6 +5382,9 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
> {
> int ret = 0;
>
> + /* Set at the end of extent_fiemap(). */
> + ASSERT((flags & FIEMAP_EXTENT_LAST) == 0);
> +
> if (!cache->cached)
> goto assign;
>
> @@ -5446,11 +5413,10 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
> */
> if (cache->offset + cache->len == offset &&
> cache->phys + cache->len == phys &&
> - (cache->flags & ~FIEMAP_EXTENT_LAST) ==
> - (flags & ~FIEMAP_EXTENT_LAST)) {
> + cache->flags == flags) {
> cache->len += len;
> cache->flags |= flags;
> - goto try_submit_last;
> + return 0;
> }
>
> /* Not mergeable, need to submit cached one */
> @@ -5465,13 +5431,8 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
> cache->phys = phys;
> cache->len = len;
> cache->flags = flags;
> -try_submit_last:
> - if (cache->flags & FIEMAP_EXTENT_LAST) {
> - ret = fiemap_fill_next_extent(fieinfo, cache->offset,
> - cache->phys, cache->len, cache->flags);
> - cache->cached = false;
> - }
> - return ret;
> +
> + return 0;
> }
>
> /*
> @@ -5501,229 +5462,534 @@ static int emit_last_fiemap_cache(struct fiemap_extent_info *fieinfo,
> return ret;
> }
>
> -int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> - u64 start, u64 len)
> +static int fiemap_next_leaf_item(struct btrfs_inode *inode,
> + struct btrfs_path *path)
> {
> - int ret = 0;
> - u64 off;
> - u64 max = start + len;
> - u32 flags = 0;
> - u32 found_type;
> - u64 last;
> - u64 last_for_get_extent = 0;
> - u64 disko = 0;
> - u64 isize = i_size_read(&inode->vfs_inode);
> - struct btrfs_key found_key;
> - struct extent_map *em = NULL;
> - struct extent_state *cached_state = NULL;
> - struct btrfs_path *path;
> + struct extent_buffer *clone;
> + struct btrfs_key key;
> + int slot;
> + int ret;
> +
> + path->slots[0]++;
> + if (path->slots[0] < btrfs_header_nritems(path->nodes[0]))
> + return 0;
> +
> + ret = btrfs_next_leaf(inode->root, path);
> + if (ret != 0)
> + return ret;
> +
> + /*
> + * Don't bother with cloning if there are no more file extent items for
> + * our inode.
> + */
> + btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> + if (key.objectid != btrfs_ino(inode) || key.type != BTRFS_EXTENT_DATA_KEY)
> + return 1;
> +
> + /* See the comment at fiemap_search_slot() about why we clone. */
> + clone = btrfs_clone_extent_buffer(path->nodes[0]);
> + if (!clone)
> + return -ENOMEM;
> +
> + slot = path->slots[0];
> + btrfs_release_path(path);
> + path->nodes[0] = clone;
> + path->slots[0] = slot;
> +
> + return 0;
> +}
> +
> +/*
> + * Search for the first file extent item that starts at a given file offset or
> + * the one that starts immediately before that offset.
> + * Returns: 0 on success, < 0 on error, 1 if not found.
> + */
> +static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path,
> + u64 file_offset)
> +{
> + const u64 ino = btrfs_ino(inode);
> struct btrfs_root *root = inode->root;
> - struct fiemap_cache cache = { 0 };
> - struct btrfs_backref_shared_cache *backref_cache;
> - struct ulist *roots;
> - struct ulist *tmp_ulist;
> - int end = 0;
> - u64 em_start = 0;
> - u64 em_len = 0;
> - u64 em_end = 0;
> + struct extent_buffer *clone;
> + struct btrfs_key key;
> + int slot;
> + int ret;
>
> - backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
> - path = btrfs_alloc_path();
> - roots = ulist_alloc(GFP_KERNEL);
> - tmp_ulist = ulist_alloc(GFP_KERNEL);
> - if (!backref_cache || !path || !roots || !tmp_ulist) {
> - ret = -ENOMEM;
> - goto out_free_ulist;
> + key.objectid = ino;
> + key.type = BTRFS_EXTENT_DATA_KEY;
> + key.offset = file_offset;
> +
> + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> + if (ret < 0)
> + return ret;
> +
> + if (ret > 0 && path->slots[0] > 0) {
> + btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
> + if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
> + path->slots[0]--;
> + }
> +
> + if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
> + ret = btrfs_next_leaf(root, path);
> + if (ret != 0)
> + return ret;
> +
> + btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> + if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> + return 1;
> }
>
> /*
> - * We can't initialize that to 'start' as this could miss extents due
> - * to extent item merging
> + * We clone the leaf and use it during fiemap. This is because while
> + * using the leaf we do expensive things like checking if an extent is
> + * shared, which can take a long time. In order to prevent blocking
> + * other tasks for too long, we use a clone of the leaf. We have locked
> + * the file range in the inode's io tree, so we know none of our file
> + * extent items can change. This way we avoid blocking other tasks that
> + * want to insert items for other inodes in the same leaf or b+tree
> + * rebalance operations (triggered for example when someone is trying
> + * to push items into this leaf when trying to insert an item in a
> + * neighbour leaf).
> + * We also need the private clone because holding a read lock on an
> + * extent buffer of the subvolume's b+tree will make lockdep unhappy
> + * when we call fiemap_fill_next_extent(), because that may cause a page
> + * fault when filling the user space buffer with fiemap data.
> */
> - off = 0;
> - start = round_down(start, btrfs_inode_sectorsize(inode));
> - len = round_up(max, btrfs_inode_sectorsize(inode)) - start;
> + clone = btrfs_clone_extent_buffer(path->nodes[0]);
> + if (!clone)
> + return -ENOMEM;
> +
> + slot = path->slots[0];
> + btrfs_release_path(path);
> + path->nodes[0] = clone;
> + path->slots[0] = slot;
> +
> + return 0;
> +}
> +
> +/*
> + * Process a range which is a hole or a prealloc extent in the inode's subvolume
> + * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc
> + * extent. The end offset (@end) is inclusive.
> + */
> +static int fiemap_process_hole(struct btrfs_inode *inode,
> + struct fiemap_extent_info *fieinfo,
> + struct fiemap_cache *cache,
> + struct btrfs_backref_shared_cache *backref_cache,
> + u64 disk_bytenr, u64 extent_offset,
> + u64 extent_gen,
> + struct ulist *roots, struct ulist *tmp_ulist,
> + u64 start, u64 end)
> +{
> + const u64 i_size = i_size_read(&inode->vfs_inode);
> + const u64 ino = btrfs_ino(inode);
> + u64 cur_offset = start;
> + u64 last_delalloc_end = 0;
> + u32 prealloc_flags = FIEMAP_EXTENT_UNWRITTEN;
> + bool checked_extent_shared = false;
> + int ret;
>
> /*
> - * lookup the last file extent. We're not using i_size here
> - * because there might be preallocation past i_size
> + * There can be no delalloc past i_size, so don't waste time looking for
> + * it beyond i_size.
> */
> - ret = btrfs_lookup_file_extent(NULL, root, path, btrfs_ino(inode), -1,
> - 0);
> - if (ret < 0) {
> - goto out_free_ulist;
> - } else {
> - WARN_ON(!ret);
> - if (ret == 1)
> - ret = 0;
> - }
> + while (cur_offset < end && cur_offset < i_size) {
> + u64 delalloc_start;
> + u64 delalloc_end;
> + u64 prealloc_start;
> + u64 prealloc_len = 0;
> + bool delalloc;
> +
> + delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
> + &delalloc_start,
> + &delalloc_end);
> + if (!delalloc)
> + break;
>
> - path->slots[0]--;
> - btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]);
> - found_type = found_key.type;
> -
> - /* No extents, but there might be delalloc bits */
> - if (found_key.objectid != btrfs_ino(inode) ||
> - found_type != BTRFS_EXTENT_DATA_KEY) {
> - /* have to trust i_size as the end */
> - last = (u64)-1;
> - last_for_get_extent = isize;
> - } else {
> /*
> - * remember the start of the last extent. There are a
> - * bunch of different factors that go into the length of the
> - * extent, so its much less complex to remember where it started
> + * If this is a prealloc extent we have to report every section
> + * of it that has no delalloc.
> */
> - last = found_key.offset;
> - last_for_get_extent = last + 1;
> + if (disk_bytenr != 0) {
> + if (last_delalloc_end == 0) {
> + prealloc_start = start;
> + prealloc_len = delalloc_start - start;
> + } else {
> + prealloc_start = last_delalloc_end + 1;
> + prealloc_len = delalloc_start - prealloc_start;
> + }
> + }
> +
> + if (prealloc_len > 0) {
> + if (!checked_extent_shared && fieinfo->fi_extents_max) {
> + ret = btrfs_is_data_extent_shared(inode->root,
> + ino, disk_bytenr,
> + extent_gen, roots,
> + tmp_ulist,
> + backref_cache);
> + if (ret < 0)
> + return ret;
> + else if (ret > 0)
> + prealloc_flags |= FIEMAP_EXTENT_SHARED;
> +
> + checked_extent_shared = true;
> + }
> + ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
> + disk_bytenr + extent_offset,
> + prealloc_len, prealloc_flags);
> + if (ret)
> + return ret;
> + extent_offset += prealloc_len;
> + }
> +
> + ret = emit_fiemap_extent(fieinfo, cache, delalloc_start, 0,
> + delalloc_end + 1 - delalloc_start,
> + FIEMAP_EXTENT_DELALLOC |
> + FIEMAP_EXTENT_UNKNOWN);
> + if (ret)
> + return ret;
> +
> + last_delalloc_end = delalloc_end;
> + cur_offset = delalloc_end + 1;
> + extent_offset += cur_offset - delalloc_start;
> + cond_resched();
> + }
> +
> + /*
> + * Either we found no delalloc for the whole prealloc extent or we have
> + * a prealloc extent that spans i_size or starts at or after i_size.
> + */
> + if (disk_bytenr != 0 && last_delalloc_end < end) {
> + u64 prealloc_start;
> + u64 prealloc_len;
> +
> + if (last_delalloc_end == 0) {
> + prealloc_start = start;
> + prealloc_len = end + 1 - start;
> + } else {
> + prealloc_start = last_delalloc_end + 1;
> + prealloc_len = end + 1 - prealloc_start;
> + }
> +
> + if (!checked_extent_shared && fieinfo->fi_extents_max) {
> + ret = btrfs_is_data_extent_shared(inode->root,
> + ino, disk_bytenr,
> + extent_gen, roots,
> + tmp_ulist,
> + backref_cache);
> + if (ret < 0)
> + return ret;
> + else if (ret > 0)
> + prealloc_flags |= FIEMAP_EXTENT_SHARED;
> + }
> + ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
> + disk_bytenr + extent_offset,
> + prealloc_len, prealloc_flags);
> + if (ret)
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> +static int fiemap_find_last_extent_offset(struct btrfs_inode *inode,
> + struct btrfs_path *path,
> + u64 *last_extent_end_ret)
> +{
> + const u64 ino = btrfs_ino(inode);
> + struct btrfs_root *root = inode->root;
> + struct extent_buffer *leaf;
> + struct btrfs_file_extent_item *ei;
> + struct btrfs_key key;
> + u64 disk_bytenr;
> + int ret;
> +
> + /*
> + * Lookup the last file extent. We're not using i_size here because
> + * there might be preallocation past i_size.
> + */
> + ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0);
> + /* There can't be a file extent item at offset (u64)-1 */
> + ASSERT(ret != 0);
> + if (ret < 0)
> + return ret;
> +
> + /*
> + * For a non-existing key, btrfs_search_slot() always leaves us at a
> + * slot > 0, except if the btree is empty, which is impossible because
> + * at least it has the inode item for this inode and all the items for
> + * the root inode 256.
> + */
> + ASSERT(path->slots[0] > 0);
> + path->slots[0]--;
> + leaf = path->nodes[0];
> + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> + if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> + /* No file extent items in the subvolume tree. */
> + *last_extent_end_ret = 0;
> + return 0;
> }
> - btrfs_release_path(path);
>
> /*
> - * we might have some extents allocated but more delalloc past those
> - * extents. so, we trust isize unless the start of the last extent is
> - * beyond isize
> + * For an inline extent, the disk_bytenr is where inline data starts at,
> + * so first check if we have an inline extent item before checking if we
> + * have an implicit hole (disk_bytenr == 0).
> */
> - if (last < isize) {
> - last = (u64)-1;
> - last_for_get_extent = isize;
> + ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
> + if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
> + *last_extent_end_ret = btrfs_file_extent_end(path);
> + return 0;
> }
>
> - lock_extent_bits(&inode->io_tree, start, start + len - 1,
> - &cached_state);
> + /*
> + * Find the last file extent item that is not a hole (when NO_HOLES is
> + * not enabled). This should take at most 2 iterations in the worst
> + * case: we have one hole file extent item at slot 0 of a leaf and
> + * another hole file extent item as the last item in the previous leaf.
> + * This is because we merge file extent items that represent holes.
> + */
> + disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> + while (disk_bytenr == 0) {
> + ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
> + if (ret < 0) {
> + return ret;
> + } else if (ret > 0) {
> + /* No file extent items that are not holes. */
> + *last_extent_end_ret = 0;
> + return 0;
> + }
> + leaf = path->nodes[0];
> + ei = btrfs_item_ptr(leaf, path->slots[0],
> + struct btrfs_file_extent_item);
> + disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> + }
>
> - em = get_extent_skip_holes(inode, start, last_for_get_extent);
> - if (!em)
> - goto out;
> - if (IS_ERR(em)) {
> - ret = PTR_ERR(em);
> + *last_extent_end_ret = btrfs_file_extent_end(path);
> + return 0;
> +}
> +
> +int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> + u64 start, u64 len)
> +{
> + const u64 ino = btrfs_ino(inode);
> + struct extent_state *cached_state = NULL;
> + struct btrfs_path *path;
> + struct btrfs_root *root = inode->root;
> + struct fiemap_cache cache = { 0 };
> + struct btrfs_backref_shared_cache *backref_cache;
> + struct ulist *roots;
> + struct ulist *tmp_ulist;
> + u64 last_extent_end;
> + u64 prev_extent_end;
> + u64 lockstart;
> + u64 lockend;
> + bool stopped = false;
> + int ret;
> +
> + backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
> + path = btrfs_alloc_path();
> + roots = ulist_alloc(GFP_KERNEL);
> + tmp_ulist = ulist_alloc(GFP_KERNEL);
> + if (!backref_cache || !path || !roots || !tmp_ulist) {
> + ret = -ENOMEM;
> goto out;
> }
>
> - while (!end) {
> - u64 offset_in_extent = 0;
> + lockstart = round_down(start, btrfs_inode_sectorsize(inode));
> + lockend = round_up(start + len, btrfs_inode_sectorsize(inode));
> + prev_extent_end = lockstart;
>
> - /* break if the extent we found is outside the range */
> - if (em->start >= max || extent_map_end(em) < off)
> - break;
> + lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
>
> - /*
> - * get_extent may return an extent that starts before our
> - * requested range. We have to make sure the ranges
> - * we return to fiemap always move forward and don't
> - * overlap, so adjust the offsets here
> - */
> - em_start = max(em->start, off);
> + ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
> + if (ret < 0)
> + goto out_unlock;
> + btrfs_release_path(path);
>
> + path->reada = READA_FORWARD;
> + ret = fiemap_search_slot(inode, path, lockstart);
> + if (ret < 0) {
> + goto out_unlock;
> + } else if (ret > 0) {
> /*
> - * record the offset from the start of the extent
> - * for adjusting the disk offset below. Only do this if the
> - * extent isn't compressed since our in ram offset may be past
> - * what we have actually allocated on disk.
> + * No file extent item found, but we may have delalloc between
> + * the current offset and i_size. So check for that.
> */
> - if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
> - offset_in_extent = em_start - em->start;
> - em_end = extent_map_end(em);
> - em_len = em_end - em_start;
> - flags = 0;
> - if (em->block_start < EXTENT_MAP_LAST_BYTE)
> - disko = em->block_start + offset_in_extent;
> - else
> - disko = 0;
> + ret = 0;
> + goto check_eof_delalloc;
> + }
> +
> + while (prev_extent_end < lockend) {
> + struct extent_buffer *leaf = path->nodes[0];
> + struct btrfs_file_extent_item *ei;
> + struct btrfs_key key;
> + u64 extent_end;
> + u64 extent_len;
> + u64 extent_offset = 0;
> + u64 extent_gen;
> + u64 disk_bytenr = 0;
> + u64 flags = 0;
> + int extent_type;
> + u8 compression;
> +
> + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> + if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> + break;
> +
> + extent_end = btrfs_file_extent_end(path);
>
> /*
> - * bump off for our next call to get_extent
> + * The first iteration can leave us at an extent item that ends
> + * before our range's start. Move to the next item.
> */
> - off = extent_map_end(em);
> - if (off >= max)
> - end = 1;
> -
> - if (em->block_start == EXTENT_MAP_INLINE) {
> - flags |= (FIEMAP_EXTENT_DATA_INLINE |
> - FIEMAP_EXTENT_NOT_ALIGNED);
> - } else if (em->block_start == EXTENT_MAP_DELALLOC) {
> - flags |= (FIEMAP_EXTENT_DELALLOC |
> - FIEMAP_EXTENT_UNKNOWN);
> - } else if (fieinfo->fi_extents_max) {
> - u64 extent_gen;
> - u64 bytenr = em->block_start -
> - (em->start - em->orig_start);
> + if (extent_end <= lockstart)
> + goto next_item;
>
> - /*
> - * If two extent maps are merged, then their generation
> - * is set to the maximum between their generations.
> - * Otherwise its generation matches the one we have in
> - * corresponding file extent item. If we have a merged
> - * extent map, don't use its generation to speedup the
> - * sharedness check below.
> - */
> - if (test_bit(EXTENT_FLAG_MERGED, &em->flags))
> - extent_gen = 0;
> - else
> - extent_gen = em->generation;
> + /* We have in implicit hole (NO_HOLES feature enabled). */
> + if (prev_extent_end < key.offset) {
> + const u64 range_end = min(key.offset, lockend) - 1;
>
> - /*
> - * As btrfs supports shared space, this information
> - * can be exported to userspace tools via
> - * flag FIEMAP_EXTENT_SHARED. If fi_extents_max == 0
> - * then we're just getting a count and we can skip the
> - * lookup stuff.
> - */
> - ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
> - bytenr, extent_gen,
> - roots, tmp_ulist,
> - backref_cache);
> - if (ret < 0)
> - goto out_free;
> - if (ret)
> - flags |= FIEMAP_EXTENT_SHARED;
> - ret = 0;
> - }
> - if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
> - flags |= FIEMAP_EXTENT_ENCODED;
> - if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> - flags |= FIEMAP_EXTENT_UNWRITTEN;
> + ret = fiemap_process_hole(inode, fieinfo, &cache,
> + backref_cache, 0, 0, 0,
> + roots, tmp_ulist,
> + prev_extent_end, range_end);
> + if (ret < 0) {
> + goto out_unlock;
> + } else if (ret > 0) {
> + /* fiemap_fill_next_extent() told us to stop. */
> + stopped = true;
> + break;
> + }
>
> - free_extent_map(em);
> - em = NULL;
> - if ((em_start >= last) || em_len == (u64)-1 ||
> - (last == (u64)-1 && isize <= em_end)) {
> - flags |= FIEMAP_EXTENT_LAST;
> - end = 1;
> + /* We've reached the end of the fiemap range, stop. */
> + if (key.offset >= lockend) {
> + stopped = true;
> + break;
> + }
> }
>
> - /* now scan forward to see if this is really the last extent. */
> - em = get_extent_skip_holes(inode, off, last_for_get_extent);
> - if (IS_ERR(em)) {
> - ret = PTR_ERR(em);
> - goto out;
> + extent_len = extent_end - key.offset;
> + ei = btrfs_item_ptr(leaf, path->slots[0],
> + struct btrfs_file_extent_item);
> + compression = btrfs_file_extent_compression(leaf, ei);
> + extent_type = btrfs_file_extent_type(leaf, ei);
> + extent_gen = btrfs_file_extent_generation(leaf, ei);
> +
> + if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
> + disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> + if (compression == BTRFS_COMPRESS_NONE)
> + extent_offset = btrfs_file_extent_offset(leaf, ei);
> }
> - if (!em) {
> - flags |= FIEMAP_EXTENT_LAST;
> - end = 1;
> +
> + if (compression != BTRFS_COMPRESS_NONE)
> + flags |= FIEMAP_EXTENT_ENCODED;
> +
> + if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
> + flags |= FIEMAP_EXTENT_DATA_INLINE;
> + flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> + ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0,
> + extent_len, flags);
> + } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
> + ret = fiemap_process_hole(inode, fieinfo, &cache,
> + backref_cache,
> + disk_bytenr, extent_offset,
> + extent_gen, roots, tmp_ulist,
> + key.offset, extent_end - 1);
> + } else if (disk_bytenr == 0) {
> + /* We have an explicit hole. */
> + ret = fiemap_process_hole(inode, fieinfo, &cache,
> + backref_cache, 0, 0, 0,
> + roots, tmp_ulist,
> + key.offset, extent_end - 1);
> + } else {
> + /* We have a regular extent. */
> + if (fieinfo->fi_extents_max) {
> + ret = btrfs_is_data_extent_shared(root, ino,
> + disk_bytenr,
> + extent_gen,
> + roots,
> + tmp_ulist,
> + backref_cache);
> + if (ret < 0)
> + goto out_unlock;
> + else if (ret > 0)
> + flags |= FIEMAP_EXTENT_SHARED;
> + }
> +
> + ret = emit_fiemap_extent(fieinfo, &cache, key.offset,
> + disk_bytenr + extent_offset,
> + extent_len, flags);
> }
> - ret = emit_fiemap_extent(fieinfo, &cache, em_start, disko,
> - em_len, flags);
> - if (ret) {
> - if (ret == 1)
> - ret = 0;
> - goto out_free;
> +
> + if (ret < 0) {
> + goto out_unlock;
> + } else if (ret > 0) {
> + /* fiemap_fill_next_extent() told us to stop. */
> + stopped = true;
> + break;
> }
>
> + prev_extent_end = extent_end;
> +next_item:
> if (fatal_signal_pending(current)) {
> ret = -EINTR;
> - goto out_free;
> + goto out_unlock;
> }
> +
> + ret = fiemap_next_leaf_item(inode, path);
> + if (ret < 0) {
> + goto out_unlock;
> + } else if (ret > 0) {
> + /* No more file extent items for this inode. */
> + break;
> + }
> + cond_resched();
> }
> -out_free:
> - if (!ret)
> - ret = emit_last_fiemap_cache(fieinfo, &cache);
> - free_extent_map(em);
> -out:
> - unlock_extent_cached(&inode->io_tree, start, start + len - 1,
> - &cached_state);
>
> -out_free_ulist:
> +check_eof_delalloc:
> + /*
> + * Release (and free) the path before emitting any final entries to
> + * fiemap_fill_next_extent() to keep lockdep happy. This is because
> + * once we find no more file extent items exist, we may have a
> + * non-cloned leaf, and fiemap_fill_next_extent() can trigger page
> + * faults when copying data to the user space buffer.
> + */
> + btrfs_free_path(path);
> + path = NULL;
> +
> + if (!stopped && prev_extent_end < lockend) {
> + ret = fiemap_process_hole(inode, fieinfo, &cache, backref_cache,
> + 0, 0, 0, roots, tmp_ulist,
> + prev_extent_end, lockend - 1);
> + if (ret < 0)
> + goto out_unlock;
> + prev_extent_end = lockend;
> + }
> +
> + if (cache.cached && cache.offset + cache.len >= last_extent_end) {
> + const u64 i_size = i_size_read(&inode->vfs_inode);
> +
> + if (prev_extent_end < i_size) {
> + u64 delalloc_start;
> + u64 delalloc_end;
> + bool delalloc;
> +
> + delalloc = btrfs_find_delalloc_in_range(inode,
> + prev_extent_end,
> + i_size - 1,
> + &delalloc_start,
> + &delalloc_end);
> + if (!delalloc)
> + cache.flags |= FIEMAP_EXTENT_LAST;
> + } else {
> + cache.flags |= FIEMAP_EXTENT_LAST;
> + }
> + }
> +
> + ret = emit_last_fiemap_cache(fieinfo, &cache);
> +
> +out_unlock:
> + unlock_extent_cached(&inode->io_tree, lockstart, lockend, &cached_state);
> +out:
> kfree(backref_cache);
> btrfs_free_path(path);
> ulist_free(roots);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index b292a8ada3a4..636b3ec46184 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3602,10 +3602,10 @@ static long btrfs_fallocate(struct file *file, int mode,
> }
>
> /*
> - * Helper for have_delalloc_in_range(). Find a subrange in a given range that
> - * has unflushed and/or flushing delalloc. There might be other adjacent
> - * subranges after the one it found, so have_delalloc_in_range() keeps looping
> - * while it gets adjacent subranges, and merging them together.
> + * Helper for btrfs_find_delalloc_in_range(). Find a subrange in a given range
> + * that has unflushed and/or flushing delalloc. There might be other adjacent
> + * subranges after the one it found, so btrfs_find_delalloc_in_range() keeps
> + * looping while it gets adjacent subranges, and merging them together.
> */
> static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
> u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> @@ -3740,8 +3740,8 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
> * if so it sets @delalloc_start_ret and @delalloc_end_ret with the start and
> * end offsets of the subrange.
> */
> -static bool have_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> - u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> + u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> {
> u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
> u64 prev_delalloc_end = 0;
> @@ -3804,8 +3804,8 @@ static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
> u64 delalloc_end;
> bool delalloc;
>
> - delalloc = have_delalloc_in_range(inode, start, end, &delalloc_start,
> - &delalloc_end);
> + delalloc = btrfs_find_delalloc_in_range(inode, start, end,
> + &delalloc_start, &delalloc_end);
> if (delalloc && whence == SEEK_DATA) {
> *start_ret = delalloc_start;
> return true;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 2c7d31990777..8be1e021513a 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7064,133 +7064,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
> return em;
> }
>
> -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
> - u64 start, u64 len)
> -{
> - struct extent_map *em;
> - struct extent_map *hole_em = NULL;
> - u64 delalloc_start = start;
> - u64 end;
> - u64 delalloc_len;
> - u64 delalloc_end;
> - int err = 0;
> -
> - em = btrfs_get_extent(inode, NULL, 0, start, len);
> - if (IS_ERR(em))
> - return em;
> - /*
> - * If our em maps to:
> - * - a hole or
> - * - a pre-alloc extent,
> - * there might actually be delalloc bytes behind it.
> - */
> - if (em->block_start != EXTENT_MAP_HOLE &&
> - !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> - return em;
> - else
> - hole_em = em;
> -
> - /* check to see if we've wrapped (len == -1 or similar) */
> - end = start + len;
> - if (end < start)
> - end = (u64)-1;
> - else
> - end -= 1;
> -
> - em = NULL;
> -
> - /* ok, we didn't find anything, lets look for delalloc */
> - delalloc_len = count_range_bits(&inode->io_tree, &delalloc_start,
> - end, len, EXTENT_DELALLOC, 1);
> - delalloc_end = delalloc_start + delalloc_len;
> - if (delalloc_end < delalloc_start)
> - delalloc_end = (u64)-1;
> -
> - /*
> - * We didn't find anything useful, return the original results from
> - * get_extent()
> - */
> - if (delalloc_start > end || delalloc_end <= start) {
> - em = hole_em;
> - hole_em = NULL;
> - goto out;
> - }
> -
> - /*
> - * Adjust the delalloc_start to make sure it doesn't go backwards from
> - * the start they passed in
> - */
> - delalloc_start = max(start, delalloc_start);
> - delalloc_len = delalloc_end - delalloc_start;
> -
> - if (delalloc_len > 0) {
> - u64 hole_start;
> - u64 hole_len;
> - const u64 hole_end = extent_map_end(hole_em);
> -
> - em = alloc_extent_map();
> - if (!em) {
> - err = -ENOMEM;
> - goto out;
> - }
> -
> - ASSERT(hole_em);
> - /*
> - * When btrfs_get_extent can't find anything it returns one
> - * huge hole
> - *
> - * Make sure what it found really fits our range, and adjust to
> - * make sure it is based on the start from the caller
> - */
> - if (hole_end <= start || hole_em->start > end) {
> - free_extent_map(hole_em);
> - hole_em = NULL;
> - } else {
> - hole_start = max(hole_em->start, start);
> - hole_len = hole_end - hole_start;
> - }
> -
> - if (hole_em && delalloc_start > hole_start) {
> - /*
> - * Our hole starts before our delalloc, so we have to
> - * return just the parts of the hole that go until the
> - * delalloc starts
> - */
> - em->len = min(hole_len, delalloc_start - hole_start);
> - em->start = hole_start;
> - em->orig_start = hole_start;
> - /*
> - * Don't adjust block start at all, it is fixed at
> - * EXTENT_MAP_HOLE
> - */
> - em->block_start = hole_em->block_start;
> - em->block_len = hole_len;
> - if (test_bit(EXTENT_FLAG_PREALLOC, &hole_em->flags))
> - set_bit(EXTENT_FLAG_PREALLOC, &em->flags);
> - } else {
> - /*
> - * Hole is out of passed range or it starts after
> - * delalloc range
> - */
> - em->start = delalloc_start;
> - em->len = delalloc_len;
> - em->orig_start = delalloc_start;
> - em->block_start = EXTENT_MAP_DELALLOC;
> - em->block_len = delalloc_len;
> - }
> - } else {
> - return hole_em;
> - }
> -out:
> -
> - free_extent_map(hole_em);
> - if (err) {
> - free_extent_map(em);
> - return ERR_PTR(err);
> - }
> - return em;
> -}
> -
> static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
> const u64 start,
> const u64 len,
> @@ -8265,15 +8138,14 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> * in the compression of data (in an async thread) and will return
> * before the compression is done and writeback is started. A second
> * filemap_fdatawrite_range() is needed to wait for the compression to
> - * complete and writeback to start. Without this, our user is very
> - * likely to get stale results, because the extents and extent maps for
> - * delalloc regions are only allocated when writeback starts.
> + * complete and writeback to start. We also need to wait for ordered
> + * extents to complete, because our fiemap implementation uses mainly
> + * file extent items to list the extents, searching for extent maps
> + * only for file ranges with holes or prealloc extents to figure out
> + * if we have delalloc in those ranges.
> */
> if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
> - ret = btrfs_fdatawrite_range(inode, 0, LLONG_MAX);
> - if (ret)
> - return ret;
> - ret = filemap_fdatawait_range(inode->i_mapping, 0, LLONG_MAX);
> + ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
> if (ret)
> return ret;
> }
Hmm this bit should be in "btrfs: properly flush delalloc when entering fiemap"
instead. Thanks,
Josef
next prev parent reply other threads:[~2022-09-01 14:35 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-01 13:18 [PATCH 00/10] btrfs: make lseek and fiemap much more efficient fdmanana
2022-09-01 13:18 ` [PATCH 01/10] btrfs: allow hole and data seeking to be interruptible fdmanana
2022-09-01 13:58 ` Josef Bacik
2022-09-01 21:49 ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient fdmanana
2022-09-01 14:03 ` Josef Bacik
2022-09-01 15:00 ` Filipe Manana
2022-09-02 13:26 ` Josef Bacik
2022-09-01 22:18 ` Qu Wenruo
2022-09-02 8:36 ` Filipe Manana
2022-09-11 22:12 ` Qu Wenruo
2022-09-12 8:38 ` Filipe Manana
2022-09-01 13:18 ` [PATCH 03/10] btrfs: remove check for impossible block start for an extent map at fiemap fdmanana
2022-09-01 14:03 ` Josef Bacik
2022-09-01 22:19 ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 04/10] btrfs: remove zero length check when entering fiemap fdmanana
2022-09-01 14:04 ` Josef Bacik
2022-09-01 22:24 ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 05/10] btrfs: properly flush delalloc " fdmanana
2022-09-01 14:06 ` Josef Bacik
2022-09-01 22:38 ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 06/10] btrfs: allow fiemap to be interruptible fdmanana
2022-09-01 14:07 ` Josef Bacik
2022-09-01 22:42 ` Qu Wenruo
2022-09-02 8:38 ` Filipe Manana
2022-09-01 13:18 ` [PATCH 07/10] btrfs: rename btrfs_check_shared() to a more descriptive name fdmanana
2022-09-01 14:08 ` Josef Bacik
2022-09-01 22:45 ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 08/10] btrfs: speedup checking for extent sharedness during fiemap fdmanana
2022-09-01 14:23 ` Josef Bacik
2022-09-01 22:50 ` Qu Wenruo
2022-09-02 8:46 ` Filipe Manana
2022-09-01 13:18 ` [PATCH 09/10] btrfs: skip unnecessary extent buffer sharedness checks " fdmanana
2022-09-01 14:26 ` Josef Bacik
2022-09-01 23:01 ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness fdmanana
2022-09-01 14:35 ` Josef Bacik [this message]
2022-09-01 15:04 ` Filipe Manana
2022-09-02 13:25 ` Josef Bacik
2022-09-01 23:27 ` Qu Wenruo
2022-09-02 8:59 ` Filipe Manana
2022-09-02 9:34 ` Qu Wenruo
2022-09-02 9:41 ` Filipe Manana
2022-09-02 9:50 ` Qu Wenruo
2022-09-02 0:53 ` [PATCH 00/10] btrfs: make lseek and fiemap much more efficient Wang Yugui
2022-09-02 8:24 ` Filipe Manana
2022-09-02 11:41 ` Wang Yugui
2022-09-02 11:45 ` Filipe Manana
2022-09-05 14:39 ` Filipe Manana
2022-09-06 16:20 ` David Sterba
2022-09-06 17:13 ` Filipe Manana
2022-09-07 9:12 ` Christoph Hellwig
2022-09-07 9:47 ` Filipe Manana
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YxDDNvsbrlxoaJxh@localhost.localdomain \
--to=josef@toxicpanda.com \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).