Re: [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness

From: Qu Wenruo <wqu@suse.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: Qu Wenruo <quwenruo.btrfs@gmx.com>, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness
Date: Fri, 2 Sep 2022 17:50:01 +0800	[thread overview]
Message-ID: <ab72088a-6c69-71c1-3427-7cb8f83911cf@suse.com> (raw)
In-Reply-To: <CAL3q7H7LVe-6qOrSLAcuRijp9PshM3xGo4Bjq6kSvedehNGkEQ@mail.gmail.com>

On 2022/9/2 17:41, Filipe Manana wrote:
> On Fri, Sep 2, 2022 at 10:35 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>>
>>
>> On 2022/9/2 16:59, Filipe Manana wrote:
>>> On Fri, Sep 2, 2022 at 12:27 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> [...]
>>>>
>>>> Is there any other major usage for extent map now?
>>>
>>> For fsync at least is very important.
>>
>> OK, forgot that fsync is relying on that to determine if an extent needs
>> to be logged.
>>
>>> Also for reads it's nice to not have to go to the b+tree.
>>> If someone reads several pages of an extent, being able to get it directly
>>> from the extent map tree is faster than having to go to the btree for
>>> every read.
>>> Extent map trees are per inode, but subvolume b+trees can have a lot of
>>> concurrent write and read access.
>>>
>>>>
>>>> I can only think of read, which uses extent map to grab the logical
>>>> bytenr of the real extent.
>>>>
>>>> In that case, the SHARED flag doesn't make much sense anyway, can we do
>>>> a cleanup for those flags? Since fiemap/lseek no longer relies on extent
>>>> map anymore.
>>>
>>> I don't get it. What SHARED flag are talking about? And which "flags", where?
>>> We have nothing specific for lseek/fiemap in the extent maps, so I
>>> don't understand.
>>
>> Nevermind, I got confused and think there would be one SHARED flag for
>> extent map, but that's totally wrong...
>>
>>>
>>>>
>> [...]
>>>>
>>>> Although this is correct, it still looks a little tricky.
>>>>
>>>> We rely on btrfs_release_path() to release all tree blocks in the
>>>> subvolume tree, including unlocking the tree blocks, thus path->locks[0]
>>>> is also 0, meaning next time we call btrfs_release_path() we won't try
>>>> to unlock the cloned eb.
>>>
>>> We're not taking any lock on the cloned extent buffer. It's not
>>> needed, it's private
>>> to the task.
>>
>> Yep, that's completely fine, just looks a little tricky since we're
>> going to release that path twice, and that's expected.
> 
> Hum?
> It's twice but for different extent buffers.

Yep, it's just different from our regular call procedure, 
btrfs_search_slot() -> btrfs_release_path()/btrfs_free_path(), without 
doing double btrfs_release_path().

But since you have enough comment, and the whole trick is hidden behind 
fiemap_search_slot()/fiemap_next_leaf_item(), I guess it's fine.

Thanks,
Qu

> 
>>
>>>
>>>>
>>>> But I'd say it's still pretty tricky, and unfortunately I don't have any
>>>> better alternative.
>>>>
>>>>> +
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Process a range which is a hole or a prealloc extent in the inode's subvolume
>>>>> + * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc
>>>>> + * extent. The end offset (@end) is inclusive.
>>>>> + */
>>>>> +static int fiemap_process_hole(struct btrfs_inode *inode,
>>>>
>>>> Does the name still make sense as we're handling both hole and prealloc
>>>> range?
>>>
>>> I chose that name because hole and prealloc are treated the same way.
>>> Sure, I could name it fiemap_process_hole_or_prealloc() or something
>>> like that, but
>>> I decided to keep the name shorter, make them explicit in the comments and code.
>>>
>>> The old code did the same, get_extent_skip_holes() skipped holes and
>>> prealloc extents without delalloc.
>>>
>>>>
>>>>
>>>> And I always find the delalloc search a big pain during lseek/fiemap.
>>>>
>>>> I guess except using certain flags, there is some hard requirement for
>>>> delalloc range reporting?
>>>
>>> Yes. Delalloc is not meant to be flushed for fiemap unless
>>> FIEMAP_FLAG_SYNC is given by the user.
>>
>> Would it be possible to let btrfs always flush the delalloc range, no
>> matter if FIEMAP_FLAG_SYNC is specified or not?
>>
>> I really want to avoid the whole delalloc search thing if possible.
>>
>> Although I'd guess such behavior would be against the fiemap
>> requirement, and the extra writeback may greatly slow down the fiemap
>> itself for large files with tons of delalloc, so not really expect this
>> to happen.
> 
> No, doing such a change is a bad idea.
> It changes the semantics and expected behaviour.
> 
> My goal here is to preserve all semantics and behaviour, but make it
> more efficient.
> 
> Even if we were all to decide to do that, that should be done
> separately - but I don't think that is correct anyway,
> fiemap can be used to detect delalloc, and probably there are users
> using it for that.
> 
> 
>>
>>> For lseek it's just not needed, but that was already mentioned /
>>> discussed in patch 2/10.
>>>
>> [...]
>>>>> +     /*
>>>>> +      * Lookup the last file extent. We're not using i_size here because
>>>>> +      * there might be preallocation past i_size.
>>>>> +      */
>>>>
>>>> I'm wondering how could this happen?
>>>
>>> You can fallocate an extent at or after i_size.
>>>
>>>>
>>>> Normally if we're truncating an inode, the extents starting after
>>>> round_up(i_size, sectorsize) should be dropped.
>>>
>>> It has nothing to do with truncate, just fallocate.
>>>
>>>>
>>>> Or if we later enlarge the inode, we may hit old extents and read out
>>>> some stale data other than expected zeros.
>>>>
>>>> Thus searching using round_up(i_size, sectorsize) should still let us to
>>>> reach the slot after the last file extent.
>>>>
>>>> Or did I miss something?
>>>
>>> Yes, it's about prealloc extents at or after i_size.
>>
>> Did you mean falloc using keep_size flag?
> 
> Yes. Otherwise the extent wouldn't end up beyond i_size.
> 
>>
>> Then that explains the whole reason.
> 
> Great!
> Thanks.
> 
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks.
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>> +     ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0);
>>>>> +     /* There can't be a file extent item at offset (u64)-1 */
>>>>> +     ASSERT(ret != 0);
>>>>> +     if (ret < 0)
>>>>> +             return ret;
>>>>> +
>>>>> +     /*
>>>>> +      * For a non-existing key, btrfs_search_slot() always leaves us at a
>>>>> +      * slot > 0, except if the btree is empty, which is impossible because
>>>>> +      * at least it has the inode item for this inode and all the items for
>>>>> +      * the root inode 256.
>>>>> +      */
>>>>> +     ASSERT(path->slots[0] > 0);
>>>>> +     path->slots[0]--;
>>>>> +     leaf = path->nodes[0];
>>>>> +     btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
>>>>> +     if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
>>>>> +             /* No file extent items in the subvolume tree. */
>>>>> +             *last_extent_end_ret = 0;
>>>>> +             return 0;
>>>>>         }
>>>>> -     btrfs_release_path(path);
>>>>>
>>>>>         /*
>>>>> -      * we might have some extents allocated but more delalloc past those
>>>>> -      * extents.  so, we trust isize unless the start of the last extent is
>>>>> -      * beyond isize
>>>>> +      * For an inline extent, the disk_bytenr is where inline data starts at,
>>>>> +      * so first check if we have an inline extent item before checking if we
>>>>> +      * have an implicit hole (disk_bytenr == 0).
>>>>>          */
>>>>> -     if (last < isize) {
>>>>> -             last = (u64)-1;
>>>>> -             last_for_get_extent = isize;
>>>>> +     ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
>>>>> +     if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
>>>>> +             *last_extent_end_ret = btrfs_file_extent_end(path);
>>>>> +             return 0;
>>>>>         }
>>>>>
>>>>> -     lock_extent_bits(&inode->io_tree, start, start + len - 1,
>>>>> -                      &cached_state);
>>>>> +     /*
>>>>> +      * Find the last file extent item that is not a hole (when NO_HOLES is
>>>>> +      * not enabled). This should take at most 2 iterations in the worst
>>>>> +      * case: we have one hole file extent item at slot 0 of a leaf and
>>>>> +      * another hole file extent item as the last item in the previous leaf.
>>>>> +      * This is because we merge file extent items that represent holes.
>>>>> +      */
>>>>> +     disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
>>>>> +     while (disk_bytenr == 0) {
>>>>> +             ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
>>>>> +             if (ret < 0) {
>>>>> +                     return ret;
>>>>> +             } else if (ret > 0) {
>>>>> +                     /* No file extent items that are not holes. */
>>>>> +                     *last_extent_end_ret = 0;
>>>>> +                     return 0;
>>>>> +             }
>>>>> +             leaf = path->nodes[0];
>>>>> +             ei = btrfs_item_ptr(leaf, path->slots[0],
>>>>> +                                 struct btrfs_file_extent_item);
>>>>> +             disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
>>>>> +     }
>>>>>
>>>>> -     em = get_extent_skip_holes(inode, start, last_for_get_extent);
>>>>> -     if (!em)
>>>>> -             goto out;
>>>>> -     if (IS_ERR(em)) {
>>>>> -             ret = PTR_ERR(em);
>>>>> +     *last_extent_end_ret = btrfs_file_extent_end(path);
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>>>>> +               u64 start, u64 len)
>>>>> +{
>>>>> +     const u64 ino = btrfs_ino(inode);
>>>>> +     struct extent_state *cached_state = NULL;
>>>>> +     struct btrfs_path *path;
>>>>> +     struct btrfs_root *root = inode->root;
>>>>> +     struct fiemap_cache cache = { 0 };
>>>>> +     struct btrfs_backref_shared_cache *backref_cache;
>>>>> +     struct ulist *roots;
>>>>> +     struct ulist *tmp_ulist;
>>>>> +     u64 last_extent_end;
>>>>> +     u64 prev_extent_end;
>>>>> +     u64 lockstart;
>>>>> +     u64 lockend;
>>>>> +     bool stopped = false;
>>>>> +     int ret;
>>>>> +
>>>>> +     backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
>>>>> +     path = btrfs_alloc_path();
>>>>> +     roots = ulist_alloc(GFP_KERNEL);
>>>>> +     tmp_ulist = ulist_alloc(GFP_KERNEL);
>>>>> +     if (!backref_cache || !path || !roots || !tmp_ulist) {
>>>>> +             ret = -ENOMEM;
>>>>>                 goto out;
>>>>>         }
>>>>>
>>>>> -     while (!end) {
>>>>> -             u64 offset_in_extent = 0;
>>>>> +     lockstart = round_down(start, btrfs_inode_sectorsize(inode));
>>>>> +     lockend = round_up(start + len, btrfs_inode_sectorsize(inode));
>>>>> +     prev_extent_end = lockstart;
>>>>>
>>>>> -             /* break if the extent we found is outside the range */
>>>>> -             if (em->start >= max || extent_map_end(em) < off)
>>>>> -                     break;
>>>>> +     lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
>>>>>
>>>>> -             /*
>>>>> -              * get_extent may return an extent that starts before our
>>>>> -              * requested range.  We have to make sure the ranges
>>>>> -              * we return to fiemap always move forward and don't
>>>>> -              * overlap, so adjust the offsets here
>>>>> -              */
>>>>> -             em_start = max(em->start, off);
>>>>> +     ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
>>>>> +     if (ret < 0)
>>>>> +             goto out_unlock;
>>>>> +     btrfs_release_path(path);
>>>>>
>>>>> +     path->reada = READA_FORWARD;
>>>>> +     ret = fiemap_search_slot(inode, path, lockstart);
>>>>> +     if (ret < 0) {
>>>>> +             goto out_unlock;
>>>>> +     } else if (ret > 0) {
>>>>>                 /*
>>>>> -              * record the offset from the start of the extent
>>>>> -              * for adjusting the disk offset below.  Only do this if the
>>>>> -              * extent isn't compressed since our in ram offset may be past
>>>>> -              * what we have actually allocated on disk.
>>>>> +              * No file extent item found, but we may have delalloc between
>>>>> +              * the current offset and i_size. So check for that.
>>>>>                  */
>>>>> -             if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
>>>>> -                     offset_in_extent = em_start - em->start;
>>>>> -             em_end = extent_map_end(em);
>>>>> -             em_len = em_end - em_start;
>>>>> -             flags = 0;
>>>>> -             if (em->block_start < EXTENT_MAP_LAST_BYTE)
>>>>> -                     disko = em->block_start + offset_in_extent;
>>>>> -             else
>>>>> -                     disko = 0;
>>>>> +             ret = 0;
>>>>> +             goto check_eof_delalloc;
>>>>> +     }
>>>>> +
>>>>> +     while (prev_extent_end < lockend) {
>>>>> +             struct extent_buffer *leaf = path->nodes[0];
>>>>> +             struct btrfs_file_extent_item *ei;
>>>>> +             struct btrfs_key key;
>>>>> +             u64 extent_end;
>>>>> +             u64 extent_len;
>>>>> +             u64 extent_offset = 0;
>>>>> +             u64 extent_gen;
>>>>> +             u64 disk_bytenr = 0;
>>>>> +             u64 flags = 0;
>>>>> +             int extent_type;
>>>>> +             u8 compression;
>>>>> +
>>>>> +             btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
>>>>> +             if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
>>>>> +                     break;
>>>>> +
>>>>> +             extent_end = btrfs_file_extent_end(path);
>>>>>
>>>>>                 /*
>>>>> -              * bump off for our next call to get_extent
>>>>> +              * The first iteration can leave us at an extent item that ends
>>>>> +              * before our range's start. Move to the next item.
>>>>>                  */
>>>>> -             off = extent_map_end(em);
>>>>> -             if (off >= max)
>>>>> -                     end = 1;
>>>>> -
>>>>> -             if (em->block_start == EXTENT_MAP_INLINE) {
>>>>> -                     flags |= (FIEMAP_EXTENT_DATA_INLINE |
>>>>> -                               FIEMAP_EXTENT_NOT_ALIGNED);
>>>>> -             } else if (em->block_start == EXTENT_MAP_DELALLOC) {
>>>>> -                     flags |= (FIEMAP_EXTENT_DELALLOC |
>>>>> -                               FIEMAP_EXTENT_UNKNOWN);
>>>>> -             } else if (fieinfo->fi_extents_max) {
>>>>> -                     u64 extent_gen;
>>>>> -                     u64 bytenr = em->block_start -
>>>>> -                             (em->start - em->orig_start);
>>>>> +             if (extent_end <= lockstart)
>>>>> +                     goto next_item;
>>>>>
>>>>> -                     /*
>>>>> -                      * If two extent maps are merged, then their generation
>>>>> -                      * is set to the maximum between their generations.
>>>>> -                      * Otherwise its generation matches the one we have in
>>>>> -                      * corresponding file extent item. If we have a merged
>>>>> -                      * extent map, don't use its generation to speedup the
>>>>> -                      * sharedness check below.
>>>>> -                      */
>>>>> -                     if (test_bit(EXTENT_FLAG_MERGED, &em->flags))
>>>>> -                             extent_gen = 0;
>>>>> -                     else
>>>>> -                             extent_gen = em->generation;
>>>>> +             /* We have in implicit hole (NO_HOLES feature enabled). */
>>>>> +             if (prev_extent_end < key.offset) {
>>>>> +                     const u64 range_end = min(key.offset, lockend) - 1;
>>>>>
>>>>> -                     /*
>>>>> -                      * As btrfs supports shared space, this information
>>>>> -                      * can be exported to userspace tools via
>>>>> -                      * flag FIEMAP_EXTENT_SHARED.  If fi_extents_max == 0
>>>>> -                      * then we're just getting a count and we can skip the
>>>>> -                      * lookup stuff.
>>>>> -                      */
>>>>> -                     ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
>>>>> -                                                       bytenr, extent_gen,
>>>>> -                                                       roots, tmp_ulist,
>>>>> -                                                       backref_cache);
>>>>> -                     if (ret < 0)
>>>>> -                             goto out_free;
>>>>> -                     if (ret)
>>>>> -                             flags |= FIEMAP_EXTENT_SHARED;
>>>>> -                     ret = 0;
>>>>> -             }
>>>>> -             if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
>>>>> -                     flags |= FIEMAP_EXTENT_ENCODED;
>>>>> -             if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
>>>>> -                     flags |= FIEMAP_EXTENT_UNWRITTEN;
>>>>> +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
>>>>> +                                               backref_cache, 0, 0, 0,
>>>>> +                                               roots, tmp_ulist,
>>>>> +                                               prev_extent_end, range_end);
>>>>> +                     if (ret < 0) {
>>>>> +                             goto out_unlock;
>>>>> +                     } else if (ret > 0) {
>>>>> +                             /* fiemap_fill_next_extent() told us to stop. */
>>>>> +                             stopped = true;
>>>>> +                             break;
>>>>> +                     }
>>>>>
>>>>> -             free_extent_map(em);
>>>>> -             em = NULL;
>>>>> -             if ((em_start >= last) || em_len == (u64)-1 ||
>>>>> -                (last == (u64)-1 && isize <= em_end)) {
>>>>> -                     flags |= FIEMAP_EXTENT_LAST;
>>>>> -                     end = 1;
>>>>> +                     /* We've reached the end of the fiemap range, stop. */
>>>>> +                     if (key.offset >= lockend) {
>>>>> +                             stopped = true;
>>>>> +                             break;
>>>>> +                     }
>>>>>                 }
>>>>>
>>>>> -             /* now scan forward to see if this is really the last extent. */
>>>>> -             em = get_extent_skip_holes(inode, off, last_for_get_extent);
>>>>> -             if (IS_ERR(em)) {
>>>>> -                     ret = PTR_ERR(em);
>>>>> -                     goto out;
>>>>> +             extent_len = extent_end - key.offset;
>>>>> +             ei = btrfs_item_ptr(leaf, path->slots[0],
>>>>> +                                 struct btrfs_file_extent_item);
>>>>> +             compression = btrfs_file_extent_compression(leaf, ei);
>>>>> +             extent_type = btrfs_file_extent_type(leaf, ei);
>>>>> +             extent_gen = btrfs_file_extent_generation(leaf, ei);
>>>>> +
>>>>> +             if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
>>>>> +                     disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
>>>>> +                     if (compression == BTRFS_COMPRESS_NONE)
>>>>> +                             extent_offset = btrfs_file_extent_offset(leaf, ei);
>>>>>                 }
>>>>> -             if (!em) {
>>>>> -                     flags |= FIEMAP_EXTENT_LAST;
>>>>> -                     end = 1;
>>>>> +
>>>>> +             if (compression != BTRFS_COMPRESS_NONE)
>>>>> +                     flags |= FIEMAP_EXTENT_ENCODED;
>>>>> +
>>>>> +             if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
>>>>> +                     flags |= FIEMAP_EXTENT_DATA_INLINE;
>>>>> +                     flags |= FIEMAP_EXTENT_NOT_ALIGNED;
>>>>> +                     ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0,
>>>>> +                                              extent_len, flags);
>>>>> +             } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
>>>>> +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
>>>>> +                                               backref_cache,
>>>>> +                                               disk_bytenr, extent_offset,
>>>>> +                                               extent_gen, roots, tmp_ulist,
>>>>> +                                               key.offset, extent_end - 1);
>>>>> +             } else if (disk_bytenr == 0) {
>>>>> +                     /* We have an explicit hole. */
>>>>> +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
>>>>> +                                               backref_cache, 0, 0, 0,
>>>>> +                                               roots, tmp_ulist,
>>>>> +                                               key.offset, extent_end - 1);
>>>>> +             } else {
>>>>> +                     /* We have a regular extent. */
>>>>> +                     if (fieinfo->fi_extents_max) {
>>>>> +                             ret = btrfs_is_data_extent_shared(root, ino,
>>>>> +                                                               disk_bytenr,
>>>>> +                                                               extent_gen,
>>>>> +                                                               roots,
>>>>> +                                                               tmp_ulist,
>>>>> +                                                               backref_cache);
>>>>> +                             if (ret < 0)
>>>>> +                                     goto out_unlock;
>>>>> +                             else if (ret > 0)
>>>>> +                                     flags |= FIEMAP_EXTENT_SHARED;
>>>>> +                     }
>>>>> +
>>>>> +                     ret = emit_fiemap_extent(fieinfo, &cache, key.offset,
>>>>> +                                              disk_bytenr + extent_offset,
>>>>> +                                              extent_len, flags);
>>>>>                 }
>>>>> -             ret = emit_fiemap_extent(fieinfo, &cache, em_start, disko,
>>>>> -                                        em_len, flags);
>>>>> -             if (ret) {
>>>>> -                     if (ret == 1)
>>>>> -                             ret = 0;
>>>>> -                     goto out_free;
>>>>> +
>>>>> +             if (ret < 0) {
>>>>> +                     goto out_unlock;
>>>>> +             } else if (ret > 0) {
>>>>> +                     /* fiemap_fill_next_extent() told us to stop. */
>>>>> +                     stopped = true;
>>>>> +                     break;
>>>>>                 }
>>>>>
>>>>> +             prev_extent_end = extent_end;
>>>>> +next_item:
>>>>>                 if (fatal_signal_pending(current)) {
>>>>>                         ret = -EINTR;
>>>>> -                     goto out_free;
>>>>> +                     goto out_unlock;
>>>>>                 }
>>>>> +
>>>>> +             ret = fiemap_next_leaf_item(inode, path);
>>>>> +             if (ret < 0) {
>>>>> +                     goto out_unlock;
>>>>> +             } else if (ret > 0) {
>>>>> +                     /* No more file extent items for this inode. */
>>>>> +                     break;
>>>>> +             }
>>>>> +             cond_resched();
>>>>>         }
>>>>> -out_free:
>>>>> -     if (!ret)
>>>>> -             ret = emit_last_fiemap_cache(fieinfo, &cache);
>>>>> -     free_extent_map(em);
>>>>> -out:
>>>>> -     unlock_extent_cached(&inode->io_tree, start, start + len - 1,
>>>>> -                          &cached_state);
>>>>>
>>>>> -out_free_ulist:
>>>>> +check_eof_delalloc:
>>>>> +     /*
>>>>> +      * Release (and free) the path before emitting any final entries to
>>>>> +      * fiemap_fill_next_extent() to keep lockdep happy. This is because
>>>>> +      * once we find no more file extent items exist, we may have a
>>>>> +      * non-cloned leaf, and fiemap_fill_next_extent() can trigger page
>>>>> +      * faults when copying data to the user space buffer.
>>>>> +      */
>>>>> +     btrfs_free_path(path);
>>>>> +     path = NULL;
>>>>> +
>>>>> +     if (!stopped && prev_extent_end < lockend) {
>>>>> +             ret = fiemap_process_hole(inode, fieinfo, &cache, backref_cache,
>>>>> +                                       0, 0, 0, roots, tmp_ulist,
>>>>> +                                       prev_extent_end, lockend - 1);
>>>>> +             if (ret < 0)
>>>>> +                     goto out_unlock;
>>>>> +             prev_extent_end = lockend;
>>>>> +     }
>>>>> +
>>>>> +     if (cache.cached && cache.offset + cache.len >= last_extent_end) {
>>>>> +             const u64 i_size = i_size_read(&inode->vfs_inode);
>>>>> +
>>>>> +             if (prev_extent_end < i_size) {
>>>>> +                     u64 delalloc_start;
>>>>> +                     u64 delalloc_end;
>>>>> +                     bool delalloc;
>>>>> +
>>>>> +                     delalloc = btrfs_find_delalloc_in_range(inode,
>>>>> +                                                             prev_extent_end,
>>>>> +                                                             i_size - 1,
>>>>> +                                                             &delalloc_start,
>>>>> +                                                             &delalloc_end);
>>>>> +                     if (!delalloc)
>>>>> +                             cache.flags |= FIEMAP_EXTENT_LAST;
>>>>> +             } else {
>>>>> +                     cache.flags |= FIEMAP_EXTENT_LAST;
>>>>> +             }
>>>>> +     }
>>>>> +
>>>>> +     ret = emit_last_fiemap_cache(fieinfo, &cache);
>>>>> +
>>>>> +out_unlock:
>>>>> +     unlock_extent_cached(&inode->io_tree, lockstart, lockend, &cached_state);
>>>>> +out:
>>>>>         kfree(backref_cache);
>>>>>         btrfs_free_path(path);
>>>>>         ulist_free(roots);
>>>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>>>> index b292a8ada3a4..636b3ec46184 100644
>>>>> --- a/fs/btrfs/file.c
>>>>> +++ b/fs/btrfs/file.c
>>>>> @@ -3602,10 +3602,10 @@ static long btrfs_fallocate(struct file *file, int mode,
>>>>>     }
>>>>>
>>>>>     /*
>>>>> - * Helper for have_delalloc_in_range(). Find a subrange in a given range that
>>>>> - * has unflushed and/or flushing delalloc. There might be other adjacent
>>>>> - * subranges after the one it found, so have_delalloc_in_range() keeps looping
>>>>> - * while it gets adjacent subranges, and merging them together.
>>>>> + * Helper for btrfs_find_delalloc_in_range(). Find a subrange in a given range
>>>>> + * that has unflushed and/or flushing delalloc. There might be other adjacent
>>>>> + * subranges after the one it found, so btrfs_find_delalloc_in_range() keeps
>>>>> + * looping while it gets adjacent subranges, and merging them together.
>>>>>      */
>>>>>     static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
>>>>>                                    u64 *delalloc_start_ret, u64 *delalloc_end_ret)
>>>>> @@ -3740,8 +3740,8 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
>>>>>      * if so it sets @delalloc_start_ret and @delalloc_end_ret with the start and
>>>>>      * end offsets of the subrange.
>>>>>      */
>>>>> -static bool have_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
>>>>> -                                u64 *delalloc_start_ret, u64 *delalloc_end_ret)
>>>>> +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
>>>>> +                               u64 *delalloc_start_ret, u64 *delalloc_end_ret)
>>>>>     {
>>>>>         u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
>>>>>         u64 prev_delalloc_end = 0;
>>>>> @@ -3804,8 +3804,8 @@ static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
>>>>>         u64 delalloc_end;
>>>>>         bool delalloc;
>>>>>
>>>>> -     delalloc = have_delalloc_in_range(inode, start, end, &delalloc_start,
>>>>> -                                       &delalloc_end);
>>>>> +     delalloc = btrfs_find_delalloc_in_range(inode, start, end,
>>>>> +                                             &delalloc_start, &delalloc_end);
>>>>>         if (delalloc && whence == SEEK_DATA) {
>>>>>                 *start_ret = delalloc_start;
>>>>>                 return true;
>>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>>>> index 2c7d31990777..8be1e021513a 100644
>>>>> --- a/fs/btrfs/inode.c
>>>>> +++ b/fs/btrfs/inode.c
>>>>> @@ -7064,133 +7064,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
>>>>>         return em;
>>>>>     }
>>>>>
>>>>> -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
>>>>> -                                        u64 start, u64 len)
>>>>> -{
>>>>> -     struct extent_map *em;
>>>>> -     struct extent_map *hole_em = NULL;
>>>>> -     u64 delalloc_start = start;
>>>>> -     u64 end;
>>>>> -     u64 delalloc_len;
>>>>> -     u64 delalloc_end;
>>>>> -     int err = 0;
>>>>> -
>>>>> -     em = btrfs_get_extent(inode, NULL, 0, start, len);
>>>>> -     if (IS_ERR(em))
>>>>> -             return em;
>>>>> -     /*
>>>>> -      * If our em maps to:
>>>>> -      * - a hole or
>>>>> -      * - a pre-alloc extent,
>>>>> -      * there might actually be delalloc bytes behind it.
>>>>> -      */
>>>>> -     if (em->block_start != EXTENT_MAP_HOLE &&
>>>>> -         !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
>>>>> -             return em;
>>>>> -     else
>>>>> -             hole_em = em;
>>>>> -
>>>>> -     /* check to see if we've wrapped (len == -1 or similar) */
>>>>> -     end = start + len;
>>>>> -     if (end < start)
>>>>> -             end = (u64)-1;
>>>>> -     else
>>>>> -             end -= 1;
>>>>> -
>>>>> -     em = NULL;
>>>>> -
>>>>> -     /* ok, we didn't find anything, lets look for delalloc */
>>>>> -     delalloc_len = count_range_bits(&inode->io_tree, &delalloc_start,
>>>>> -                              end, len, EXTENT_DELALLOC, 1);
>>>>> -     delalloc_end = delalloc_start + delalloc_len;
>>>>> -     if (delalloc_end < delalloc_start)
>>>>> -             delalloc_end = (u64)-1;
>>>>> -
>>>>> -     /*
>>>>> -      * We didn't find anything useful, return the original results from
>>>>> -      * get_extent()
>>>>> -      */
>>>>> -     if (delalloc_start > end || delalloc_end <= start) {
>>>>> -             em = hole_em;
>>>>> -             hole_em = NULL;
>>>>> -             goto out;
>>>>> -     }
>>>>> -
>>>>> -     /*
>>>>> -      * Adjust the delalloc_start to make sure it doesn't go backwards from
>>>>> -      * the start they passed in
>>>>> -      */
>>>>> -     delalloc_start = max(start, delalloc_start);
>>>>> -     delalloc_len = delalloc_end - delalloc_start;
>>>>> -
>>>>> -     if (delalloc_len > 0) {
>>>>> -             u64 hole_start;
>>>>> -             u64 hole_len;
>>>>> -             const u64 hole_end = extent_map_end(hole_em);
>>>>> -
>>>>> -             em = alloc_extent_map();
>>>>> -             if (!em) {
>>>>> -                     err = -ENOMEM;
>>>>> -                     goto out;
>>>>> -             }
>>>>> -
>>>>> -             ASSERT(hole_em);
>>>>> -             /*
>>>>> -              * When btrfs_get_extent can't find anything it returns one
>>>>> -              * huge hole
>>>>> -              *
>>>>> -              * Make sure what it found really fits our range, and adjust to
>>>>> -              * make sure it is based on the start from the caller
>>>>> -              */
>>>>> -             if (hole_end <= start || hole_em->start > end) {
>>>>> -                    free_extent_map(hole_em);
>>>>> -                    hole_em = NULL;
>>>>> -             } else {
>>>>> -                    hole_start = max(hole_em->start, start);
>>>>> -                    hole_len = hole_end - hole_start;
>>>>> -             }
>>>>> -
>>>>> -             if (hole_em && delalloc_start > hole_start) {
>>>>> -                     /*
>>>>> -                      * Our hole starts before our delalloc, so we have to
>>>>> -                      * return just the parts of the hole that go until the
>>>>> -                      * delalloc starts
>>>>> -                      */
>>>>> -                     em->len = min(hole_len, delalloc_start - hole_start);
>>>>> -                     em->start = hole_start;
>>>>> -                     em->orig_start = hole_start;
>>>>> -                     /*
>>>>> -                      * Don't adjust block start at all, it is fixed at
>>>>> -                      * EXTENT_MAP_HOLE
>>>>> -                      */
>>>>> -                     em->block_start = hole_em->block_start;
>>>>> -                     em->block_len = hole_len;
>>>>> -                     if (test_bit(EXTENT_FLAG_PREALLOC, &hole_em->flags))
>>>>> -                             set_bit(EXTENT_FLAG_PREALLOC, &em->flags);
>>>>> -             } else {
>>>>> -                     /*
>>>>> -                      * Hole is out of passed range or it starts after
>>>>> -                      * delalloc range
>>>>> -                      */
>>>>> -                     em->start = delalloc_start;
>>>>> -                     em->len = delalloc_len;
>>>>> -                     em->orig_start = delalloc_start;
>>>>> -                     em->block_start = EXTENT_MAP_DELALLOC;
>>>>> -                     em->block_len = delalloc_len;
>>>>> -             }
>>>>> -     } else {
>>>>> -             return hole_em;
>>>>> -     }
>>>>> -out:
>>>>> -
>>>>> -     free_extent_map(hole_em);
>>>>> -     if (err) {
>>>>> -             free_extent_map(em);
>>>>> -             return ERR_PTR(err);
>>>>> -     }
>>>>> -     return em;
>>>>> -}
>>>>> -
>>>>>     static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>>>>>                                                   const u64 start,
>>>>>                                                   const u64 len,
>>>>> @@ -8265,15 +8138,14 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>>>>>          * in the compression of data (in an async thread) and will return
>>>>>          * before the compression is done and writeback is started. A second
>>>>>          * filemap_fdatawrite_range() is needed to wait for the compression to
>>>>> -      * complete and writeback to start. Without this, our user is very
>>>>> -      * likely to get stale results, because the extents and extent maps for
>>>>> -      * delalloc regions are only allocated when writeback starts.
>>>>> +      * complete and writeback to start. We also need to wait for ordered
>>>>> +      * extents to complete, because our fiemap implementation uses mainly
>>>>> +      * file extent items to list the extents, searching for extent maps
>>>>> +      * only for file ranges with holes or prealloc extents to figure out
>>>>> +      * if we have delalloc in those ranges.
>>>>>          */
>>>>>         if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
>>>>> -             ret = btrfs_fdatawrite_range(inode, 0, LLONG_MAX);
>>>>> -             if (ret)
>>>>> -                     return ret;
>>>>> -             ret = filemap_fdatawait_range(inode->i_mapping, 0, LLONG_MAX);
>>>>> +             ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
>>>>>                 if (ret)
>>>>>                         return ret;
>>>>>         }