From: Josef Bacik <josef@toxicpanda.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient
Date: Thu, 1 Sep 2022 10:03:03 -0400 [thread overview]
Message-ID: <YxC7l6cL2GtilEf3@localhost.localdomain> (raw)
In-Reply-To: <246cd5358b28e3e11a96fe2abd0a4a34840cdb85.1662022922.git.fdmanana@suse.com>
On Thu, Sep 01, 2022 at 02:18:22PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> The current implementation of hole and data seeking for llseek does not
> scale well in regards to the number of extents and the distance between
> the start offset and the next hole or extent. This is due to a very high
> algorithmic complexity. Often we also get reports of btrfs' hole and data
> seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link
> tag at the bottom).
>
> In order to better understand it, lets consider the case where the start
> offset is 0, we are seeking for a hole and the file size is 16G. Between
> file offset 0 and the first hole in the file there are 100K extents - this
> is common for large files, specially if we have compression enabled, since
> the maximum extent size is limited to 128K. The steps take by the main
> loop of the current algorithm are the following:
>
> 1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which
> calls btrfs_get_extent(). This will first lookup for an extent map in
> the inode's extent map tree (a red black tree). If the extent map is
> not loaded in memory, then it will do a lookup for the corresponding
> file extent item in the subvolume's b+tree, create an extent map based
> on the contents of the file extent item and then add the extent map to
> the extent map tree of the inode;
>
> 2) The second iteration calls btrfs_get_extent_fiemap() again, this time
> with a start offset matching the end offset of the previous extent.
> Again, btrfs_get_extent() will first search the extent map tree, and
> if it doesn't find an extent map there, it will again search in the
> b+tree of the subvolume for a matching file extent item, build an
> extent map based on the file extent item, and add the extent map to
> to the extent map tree of the inode;
>
> 3) This repeats over and over until we find the first hole (when seeking
> for holes) or until we find the first extent (when seeking for data).
>
> If there no extent maps loaded in memory for each iteration, then on
> each iteration we do 1 extent map tree search, 1 b+tree search, plus
> 1 more extent map tree traversal to insert an extent map - plus we
> allocate memory for the extent map.
>
> On each iteration we are growing the size of the extent map tree,
> making each future search slower, and also visiting the same b+tree
> leaves over and over again - taking into account with the default leaf
> size of 16K we can fit more than 200 file extent items in a leaf - so
> we can visit the same b+tree leaf 200+ times, on each visit walking
> down a path from the root to the leaf.
>
> So it's easy to see that what we have now doesn't scale well. Also, it
> loads an extent map for every file extent item into memory, which is not
> efficient - we should add extents maps only when doing IO (writing or
> reading file data).
>
> This change implements a new algorithm which scales much better, and
> works like this:
>
> 1) We iterate over the subvolume's b+tree, visiting each leaf that has
> file extent items once and only once;
>
> 2) For any file extent items found, that don't represent holes or prealloc
> extents, it will not search the extent map tree - there's no need at
> all for that - an extent map is just an in-memory representation of a
> file extent item;
>
> 3) When a hole is found, or a prealloc extent, it will check if there's
> delalloc for its range. For this it will search for EXTENT_DELALLOC
> bits in the inode's io tree and check the extent map tree - this is
> for accounting for unflushed delalloc and for flushed delalloc (the
> period between running delalloc and ordered extent completion),
> respectively. This is similar to what the current implementation does
> when it finds a hole or prealloc extent, but without creating extent
> maps and adding them to the extent map tree in case they are not
> loaded in memory;
>
> 4) It never allocates extent maps, or adds extent maps to the inode's
> extent map tree. This not only saves memory and time (from the tree
> insertions and allocations), but also eliminates the possibility of
> -ENOMEM due to allocating too many extent maps.
>
> Part of this new code will also be used later for fiemap (which also
> suffers similar scalability problems).
>
> The following test example can be used to quickly measure the efficiency
> before and after this patch:
>
> $ cat test-seek-hole.sh
> #!/bin/bash
>
> DEV=/dev/sdi
> MNT=/mnt/sdi
>
> mkfs.btrfs -f $DEV
>
> mount -o compress=lzo $DEV $MNT
>
> # 16G file -> 131073 compressed extents.
> xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar
>
> # Leave a 1M hole at file offset 15G.
> xfs_io -c "fpunch 15G 1M" $MNT/foobar
>
> # Unmount and mount again, so that we can test when there's no
> # metadata cached in memory.
> umount $MNT
> mount -o compress=lzo $DEV $MNT
>
> # Test seeking for hole from offset 0 (hole is at offset 15G).
>
> start=$(date +%s%N)
> xfs_io -c "seek -h 0" $MNT/foobar
> end=$(date +%s%N)
> dur=$(( (end - start) / 1000000 ))
> echo "Took $dur milliseconds to seek first hole (metadata not cached)"
> echo
>
> start=$(date +%s%N)
> xfs_io -c "seek -h 0" $MNT/foobar
> end=$(date +%s%N)
> dur=$(( (end - start) / 1000000 ))
> echo "Took $dur milliseconds to seek first hole (metadata cached)"
> echo
>
> umount $MNT
>
> Before this change:
>
> $ ./test-seek-hole.sh
> (...)
> Whence Result
> HOLE 16106127360
> Took 176 milliseconds to seek first hole (metadata not cached)
>
> Whence Result
> HOLE 16106127360
> Took 17 milliseconds to seek first hole (metadata cached)
>
> After this change:
>
> $ ./test-seek-hole.sh
> (...)
> Whence Result
> HOLE 16106127360
> Took 43 milliseconds to seek first hole (metadata not cached)
>
> Whence Result
> HOLE 16106127360
> Took 13 milliseconds to seek first hole (metadata cached)
>
> That's about 4X faster when no metadata is cached and about 30% faster
> when all metadata is cached.
>
> In practice the differences may often be significantly higher, either due
> to a higher number of extents in a file or because the subvolume's b+tree
> is much bigger than in this example, where we only have one file.
>
> Link: https://lwn.net/Articles/718805/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> fs/btrfs/file.c | 437 ++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 406 insertions(+), 31 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 96f444ad0951..b292a8ada3a4 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3601,22 +3601,281 @@ static long btrfs_fallocate(struct file *file, int mode,
> return ret;
> }
>
> +/*
> + * Helper for have_delalloc_in_range(). Find a subrange in a given range that
> + * has unflushed and/or flushing delalloc. There might be other adjacent
> + * subranges after the one it found, so have_delalloc_in_range() keeps looping
> + * while it gets adjacent subranges, and merging them together.
> + */
> +static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
> + u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> +{
> + const u64 len = end + 1 - start;
> + struct extent_map_tree *em_tree = &inode->extent_tree;
> + struct extent_map *em;
> + u64 em_end;
> + u64 delalloc_len;
> +
> + /*
> + * Search the io tree first for EXTENT_DELALLOC. If we find any, it
> + * means we have delalloc (dirty pages) for which writeback has not
> + * started yet.
> + */
> + *delalloc_start_ret = start;
> + delalloc_len = count_range_bits(&inode->io_tree, delalloc_start_ret, end,
> + len, EXTENT_DELALLOC, 1);
> + /*
> + * If delalloc was found then *delalloc_start_ret has a sector size
> + * aligned value (rounded down).
> + */
> + if (delalloc_len > 0)
> + *delalloc_end_ret = *delalloc_start_ret + delalloc_len - 1;
> +
> + /*
> + * Now also check if there's any extent map in the range that does not
> + * map to a hole or prealloc extent. We do this because:
> + *
> + * 1) When delalloc is flushed, the file range is locked, we clear the
> + * EXTENT_DELALLOC bit from the io tree and create an extent map for
> + * an allocated extent. So we might just have been called after
> + * delalloc is flushed and before the ordered extent completes and
> + * inserts the new file extent item in the subvolume's btree;
> + *
> + * 2) We may have an extent map created by flushing delalloc for a
> + * subrange that starts before the subrange we found marked with
> + * EXTENT_DELALLOC in the io tree.
> + */
> + read_lock(&em_tree->lock);
> + em = lookup_extent_mapping(em_tree, start, len);
> + read_unlock(&em_tree->lock);
> +
> + /* extent_map_end() returns a non-inclusive end offset. */
> + em_end = em ? extent_map_end(em) : 0;
> +
> + /*
> + * If we have a hole/prealloc extent map, check the next one if this one
> + * ends before our range's end.
> + */
> + if (em && (em->block_start == EXTENT_MAP_HOLE ||
> + test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) && em_end < end) {
> + struct extent_map *next_em;
> +
> + read_lock(&em_tree->lock);
> + next_em = lookup_extent_mapping(em_tree, em_end, len - em_end);
> + read_unlock(&em_tree->lock);
> +
> + free_extent_map(em);
> + em_end = next_em ? extent_map_end(next_em) : 0;
> + em = next_em;
> + }
> +
> + if (em && (em->block_start == EXTENT_MAP_HOLE ||
> + test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) {
> + free_extent_map(em);
> + em = NULL;
> + }
> +
> + /*
> + * No extent map or one for a hole or prealloc extent. Use the delalloc
> + * range we found in the io tree if we have one.
> + */
> + if (!em)
> + return (delalloc_len > 0);
> +
You can move this after the lookup, and then remove the if (em && parts above.
Then all you need to do is in the second if statement return (delalloc_len > 0);
Thanks,
Josef
next prev parent reply other threads:[~2022-09-01 14:03 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-01 13:18 [PATCH 00/10] btrfs: make lseek and fiemap much more efficient fdmanana
2022-09-01 13:18 ` [PATCH 01/10] btrfs: allow hole and data seeking to be interruptible fdmanana
2022-09-01 13:58 ` Josef Bacik
2022-09-01 21:49 ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient fdmanana
2022-09-01 14:03 ` Josef Bacik [this message]
2022-09-01 15:00 ` Filipe Manana
2022-09-02 13:26 ` Josef Bacik
2022-09-01 22:18 ` Qu Wenruo
2022-09-02 8:36 ` Filipe Manana
2022-09-11 22:12 ` Qu Wenruo
2022-09-12 8:38 ` Filipe Manana
2022-09-01 13:18 ` [PATCH 03/10] btrfs: remove check for impossible block start for an extent map at fiemap fdmanana
2022-09-01 14:03 ` Josef Bacik
2022-09-01 22:19 ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 04/10] btrfs: remove zero length check when entering fiemap fdmanana
2022-09-01 14:04 ` Josef Bacik
2022-09-01 22:24 ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 05/10] btrfs: properly flush delalloc " fdmanana
2022-09-01 14:06 ` Josef Bacik
2022-09-01 22:38 ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 06/10] btrfs: allow fiemap to be interruptible fdmanana
2022-09-01 14:07 ` Josef Bacik
2022-09-01 22:42 ` Qu Wenruo
2022-09-02 8:38 ` Filipe Manana
2022-09-01 13:18 ` [PATCH 07/10] btrfs: rename btrfs_check_shared() to a more descriptive name fdmanana
2022-09-01 14:08 ` Josef Bacik
2022-09-01 22:45 ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 08/10] btrfs: speedup checking for extent sharedness during fiemap fdmanana
2022-09-01 14:23 ` Josef Bacik
2022-09-01 22:50 ` Qu Wenruo
2022-09-02 8:46 ` Filipe Manana
2022-09-01 13:18 ` [PATCH 09/10] btrfs: skip unnecessary extent buffer sharedness checks " fdmanana
2022-09-01 14:26 ` Josef Bacik
2022-09-01 23:01 ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness fdmanana
2022-09-01 14:35 ` Josef Bacik
2022-09-01 15:04 ` Filipe Manana
2022-09-02 13:25 ` Josef Bacik
2022-09-01 23:27 ` Qu Wenruo
2022-09-02 8:59 ` Filipe Manana
2022-09-02 9:34 ` Qu Wenruo
2022-09-02 9:41 ` Filipe Manana
2022-09-02 9:50 ` Qu Wenruo
2022-09-02 0:53 ` [PATCH 00/10] btrfs: make lseek and fiemap much more efficient Wang Yugui
2022-09-02 8:24 ` Filipe Manana
2022-09-02 11:41 ` Wang Yugui
2022-09-02 11:45 ` Filipe Manana
2022-09-05 14:39 ` Filipe Manana
2022-09-06 16:20 ` David Sterba
2022-09-06 17:13 ` Filipe Manana
2022-09-07 9:12 ` Christoph Hellwig
2022-09-07 9:47 ` Filipe Manana
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YxC7l6cL2GtilEf3@localhost.localdomain \
--to=josef@toxicpanda.com \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).