linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: fdmanana@kernel.org, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 09/10] btrfs: skip unnecessary extent buffer sharedness checks during fiemap
Date: Fri, 2 Sep 2022 07:01:11 +0800	[thread overview]
Message-ID: <97ffaadc-4b8f-96bb-6ed4-6857c13919c2@gmx.com> (raw)
In-Reply-To: <d80f75e12d0212da59cbcccac2eddd506c8998af.1662022922.git.fdmanana@suse.com>



On 2022/9/1 21:18, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> During fiemap, for each file extent we find, we must check if it's shared
> or not. The sharedness check starts by verifying if the extent is directly
> shared (its refcount in the extent tree is > 1), and if it is not directly
> shared, then we will check if every node in the subvolume b+tree leading
> from the root to the leaf that has the file extent item (in reverse order),
> is shared (through snapshots).
>
> However this second step is not needed if our extent was created in a
> transaction more recent than the last transaction where a snapshot of the
> inode's root happened, because it can't be shared indirectly (through
> shared subtrees) without a snapshot created in a more recent transaction.

This is a pretty awesome!

>
> So grab the generation of the extent from the extent map and pass it to
> btrfs_is_data_extent_shared(), which will skip this second phase when the
> generation is more recent than the root's last snapshot value. Note that
> we skip this optimization if the extent map is the result of merging 2
> or more extent maps, because in this case its generation is the maximum
> of the generations of all merged extent maps.

And this pitfall also taken into consideration, even better.

>
> The fact the we use extent maps and they can be merged despite the
> underlying extents being distinct (different file extent items in the
> subvolume b+tree and different extent items in the extent b+tree), can
> result in some bugs when reporting shared extents. But this is a problem
> of the current implementation of fiemap relying on extent maps.
> One example where we get incorrect results is:
>
>      $ cat fiemap-bug.sh
>      #!/bin/bash
>
>      DEV=/dev/sdj
>      MNT=/mnt/sdj
>
>      mkfs.btrfs -f $DEV
>      mount $DEV $MNT
>
>      # Create a file with two 256K extents.
>      # Since there is no other write activity, they will be contiguous,
>      # and their extent maps merged, despite having two distinct extents.
>      xfs_io -f -c "pwrite -S 0xab 0 256K" \
>                -c "fsync" \
>                -c "pwrite -S 0xcd 256K 256K" \
>                -c "fsync" \
>                $MNT/foo
>
>      # Now clone only the second extent into another file.
>      xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
>
>      # Filefrag will report a single 512K extent, and say it's not shared.
>      echo
>      filefrag -v $MNT/foo
>
>      umount $MNT
>
> Running the reproducer:
>
>      $ ./fiemap-bug.sh
>      wrote 262144/262144 bytes at offset 0
>      256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
>      wrote 262144/262144 bytes at offset 262144
>      256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
>      linked 262144/262144 bytes at offset 0
>      256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
>
>      Filesystem type is: 9123683e
>      File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
>       ext:     logical_offset:        physical_offset: length:   expected: flags:
>         0:        0..     127:       3328..      3455:    128:             last,eof
>      /mnt/sdj/foo: 1 extent found
>
> We end up reporting that we have a single 512K that is not shared, however
> we have two 256K extents, and the second one is shared. Changing the
> reproducer to clone instead the first extent into file 'bar', makes us
> report a single 512K extent that is shared, which is algo incorrect since
> we have two 256K extents and only the first one is shared.
>
> This is z problem that existed before this change, and remains after this
> change, as it can't be easily fixed. The next patch in the series reworks
> fiemap to primarily use file extent items instead of extent maps (except
> for checking for delalloc ranges), with the goal of improving its
> scalability and performance, but it also ends up fixing this particular
> bug caused by extent map merging.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/backref.c   | 27 +++++++++++++++++++++------
>   fs/btrfs/backref.h   |  1 +
>   fs/btrfs/extent_io.c | 18 ++++++++++++++++--
>   3 files changed, 38 insertions(+), 8 deletions(-)
>
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index 40b48abb6978..bf4ca4a82550 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -1613,12 +1613,14 @@ static void store_backref_shared_cache(struct btrfs_backref_shared_cache *cache,
>   /**
>    * Check if a data extent is shared or not.
>    *
> - * @root:   root inode belongs to
> - * @inum:   inode number of the inode whose extent we are checking
> - * @bytenr: logical bytenr of the extent we are checking
> - * @roots:  list of roots this extent is shared among
> - * @tmp:    temporary list used for iteration
> - * @cache:  a backref lookup result cache
> + * @root:        The root the inode belongs to.
> + * @inum:        Number of the inode whose extent we are checking.
> + * @bytenr:      Logical bytenr of the extent we are checking.
> + * @extent_gen:  Generation of the extent (file extent item) or 0 if it is
> + *               not known.
> + * @roots:       List of roots this extent is shared among.
> + * @tmp:         Temporary list used for iteration.
> + * @cache:       A backref lookup result cache.
>    *
>    * btrfs_is_data_extent_shared uses the backref walking code but will short
>    * circuit as soon as it finds a root or inode that doesn't match the
> @@ -1632,6 +1634,7 @@ static void store_backref_shared_cache(struct btrfs_backref_shared_cache *cache,
>    * Return: 0 if extent is not shared, 1 if it is shared, < 0 on error.
>    */
>   int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
> +				u64 extent_gen,
>   				struct ulist *roots, struct ulist *tmp,
>   				struct btrfs_backref_shared_cache *cache)
>   {
> @@ -1683,6 +1686,18 @@ int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
>   		if (ret < 0 && ret != -ENOENT)
>   			break;
>   		ret = 0;
> +		/*
> +		 * If our data extent is not shared through reflinks and it was
> +		 * created in a generation after the last one used to create a
> +		 * snapshot of the inode's root, then it can not be shared
> +		 * indirectly through subtrees, as that can only happen with
> +		 * snapshots. In this case bail out, no need to check for the
> +		 * sharedness of extent buffers.
> +		 */
> +		if (level == -1 &&
> +		    extent_gen > btrfs_root_last_snapshot(&root->root_item))
> +			break;
> +
>   		if (level >= 0)
>   			store_backref_shared_cache(cache, root, bytenr,
>   						   level, false);
> diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h
> index 797ba5371d55..7d18b5ac71dd 100644
> --- a/fs/btrfs/backref.h
> +++ b/fs/btrfs/backref.h
> @@ -77,6 +77,7 @@ int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid,
>   			  struct btrfs_inode_extref **ret_extref,
>   			  u64 *found_off);
>   int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
> +				u64 extent_gen,
>   				struct ulist *roots, struct ulist *tmp,
>   				struct btrfs_backref_shared_cache *cache);
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 781436cc373c..0e3fa9b08aaf 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5645,9 +5645,23 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>   			flags |= (FIEMAP_EXTENT_DELALLOC |
>   				  FIEMAP_EXTENT_UNKNOWN);
>   		} else if (fieinfo->fi_extents_max) {
> +			u64 extent_gen;
>   			u64 bytenr = em->block_start -
>   				(em->start - em->orig_start);
>
> +			/*
> +			 * If two extent maps are merged, then their generation
> +			 * is set to the maximum between their generations.
> +			 * Otherwise its generation matches the one we have in
> +			 * corresponding file extent item. If we have a merged
> +			 * extent map, don't use its generation to speedup the
> +			 * sharedness check below.
> +			 */
> +			if (test_bit(EXTENT_FLAG_MERGED, &em->flags))
> +				extent_gen = 0;
> +			else
> +				extent_gen = em->generation;
> +
>   			/*
>   			 * As btrfs supports shared space, this information
>   			 * can be exported to userspace tools via
> @@ -5656,8 +5670,8 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>   			 * lookup stuff.
>   			 */
>   			ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
> -							  bytenr, roots,
> -							  tmp_ulist,
> +							  bytenr, extent_gen,
> +							  roots, tmp_ulist,
>   							  backref_cache);
>   			if (ret < 0)
>   				goto out_free;

  parent reply	other threads:[~2022-09-01 23:01 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-01 13:18 [PATCH 00/10] btrfs: make lseek and fiemap much more efficient fdmanana
2022-09-01 13:18 ` [PATCH 01/10] btrfs: allow hole and data seeking to be interruptible fdmanana
2022-09-01 13:58   ` Josef Bacik
2022-09-01 21:49   ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient fdmanana
2022-09-01 14:03   ` Josef Bacik
2022-09-01 15:00     ` Filipe Manana
2022-09-02 13:26       ` Josef Bacik
2022-09-01 22:18   ` Qu Wenruo
2022-09-02  8:36     ` Filipe Manana
2022-09-11 22:12   ` Qu Wenruo
2022-09-12  8:38     ` Filipe Manana
2022-09-01 13:18 ` [PATCH 03/10] btrfs: remove check for impossible block start for an extent map at fiemap fdmanana
2022-09-01 14:03   ` Josef Bacik
2022-09-01 22:19   ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 04/10] btrfs: remove zero length check when entering fiemap fdmanana
2022-09-01 14:04   ` Josef Bacik
2022-09-01 22:24   ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 05/10] btrfs: properly flush delalloc " fdmanana
2022-09-01 14:06   ` Josef Bacik
2022-09-01 22:38   ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 06/10] btrfs: allow fiemap to be interruptible fdmanana
2022-09-01 14:07   ` Josef Bacik
2022-09-01 22:42   ` Qu Wenruo
2022-09-02  8:38     ` Filipe Manana
2022-09-01 13:18 ` [PATCH 07/10] btrfs: rename btrfs_check_shared() to a more descriptive name fdmanana
2022-09-01 14:08   ` Josef Bacik
2022-09-01 22:45   ` Qu Wenruo
2022-09-01 13:18 ` [PATCH 08/10] btrfs: speedup checking for extent sharedness during fiemap fdmanana
2022-09-01 14:23   ` Josef Bacik
2022-09-01 22:50   ` Qu Wenruo
2022-09-02  8:46     ` Filipe Manana
2022-09-01 13:18 ` [PATCH 09/10] btrfs: skip unnecessary extent buffer sharedness checks " fdmanana
2022-09-01 14:26   ` Josef Bacik
2022-09-01 23:01   ` Qu Wenruo [this message]
2022-09-01 13:18 ` [PATCH 10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness fdmanana
2022-09-01 14:35   ` Josef Bacik
2022-09-01 15:04     ` Filipe Manana
2022-09-02 13:25       ` Josef Bacik
2022-09-01 23:27   ` Qu Wenruo
2022-09-02  8:59     ` Filipe Manana
2022-09-02  9:34       ` Qu Wenruo
2022-09-02  9:41         ` Filipe Manana
2022-09-02  9:50           ` Qu Wenruo
2022-09-02  0:53 ` [PATCH 00/10] btrfs: make lseek and fiemap much more efficient Wang Yugui
2022-09-02  8:24   ` Filipe Manana
2022-09-02 11:41     ` Wang Yugui
2022-09-02 11:45     ` Filipe Manana
2022-09-05 14:39       ` Filipe Manana
2022-09-06 16:20 ` David Sterba
2022-09-06 17:13   ` Filipe Manana
2022-09-07  9:12 ` Christoph Hellwig
2022-09-07  9:47   ` Filipe Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=97ffaadc-4b8f-96bb-6ed4-6857c13919c2@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).