[PATCH 08/10] btrfs: speedup checking for extent sharedness during fiemap

From: fdmanana@kernel.org
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 08/10] btrfs: speedup checking for extent sharedness during fiemap
Date: Thu,  1 Sep 2022 14:18:28 +0100	[thread overview]
Message-ID: <5e696c29b65f6558b8012596aa513101ed04a21a.1662022922.git.fdmanana@suse.com> (raw)
In-Reply-To: <cover.1662022922.git.fdmanana@suse.com>

From: Filipe Manana <fdmanana@suse.com>

One of the most expensive tasks performed during fiemap is to check if
an extent is shared. This task has two major steps:

1) Check if the data extent is shared. This implies checking the extent
   item in the extent tree, checking delayed references, etc. If we
   find the data extent is directly shared, we terminate immediately;

2) If the data extent is not directly shared (its extent item has a
   refcount of 1), then it may be shared if we have snapshots that share
   subtrees of the inode's subvolume b+tree. So we check if the leaf
   containing the file extent item is shared, then its parent node, then
   the parent node of the parent node, etc, until we reach the root node
   or we find one of them is shared - in which case we stop immediately.

During fiemap we process the extents of a file from left to right, from
file offset 0 to eof. This means that we iterate b+tree leaves from left
to right, and has the implication that we keep repeating that second step
above several times for the same b+tree path of the inode's subvolume
b+tree.

For example, if we have two file extent items in leaf X, and the path to
leaf X is A -> B -> C -> X, then when we try to determine if the data
extent referenced by the first extent item is shared, we check if the data
extent is shared - if it's not, then we check if leaf X is shared, if not,
then we check if node C is shared, if not, then check if node B is shared,
if not than check if node A is shared. When we move to the next file
extent item, after determining the data extent is not shared, we repeat
the checks for X, C, B and A - doing all the expensive searches in the
extent tree, delayed refs, etc. If we have thousands of tile extents, then
we keep repeating the sharedness checks for the same paths over and over.

On a file that has no shared extents or only a small portion, it's easy
to see that this scales terribly with the number of extents in the file
and the sizes of the extent and subvolume b+trees.

This change eliminates the repeated sharedness check on extent buffers
by caching the results of the last path used. The results can be used as
long as no snapshots were created since they were cached (for not shared
extent buffers) or no roots were dropped since they were cached (for
shared extent buffers). This greatly reduces the time spent by fiemap for
files with thousands of extents and/or large extent and subvolume b+trees.

Example performance test:

    $ cat fiemap-perf-test.sh
    #!/bin/bash

    DEV=/dev/sdi
    MNT=/mnt/sdi

    mkfs.btrfs -f $DEV
    mount -o compress=lzo $DEV $MNT

    # 40G gives 327680 128K file extents (due to compression).
    xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar

    umount $MNT
    mount -o compress=lzo $DEV $MNT

    start=$(date +%s%N)
    filefrag $MNT/foobar
    end=$(date +%s%N)
    dur=$(( (end - start) / 1000000 ))
    echo "fiemap took $dur milliseconds (metadata not cached)"

    start=$(date +%s%N)
    filefrag $MNT/foobar
    end=$(date +%s%N)
    dur=$(( (end - start) / 1000000 ))
    echo "fiemap took $dur milliseconds (metadata cached)"

    umount $MNT

Before this patch:

    $ ./fiemap-perf-test.sh
    (...)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 3597 milliseconds (metadata not cached)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 2107 milliseconds (metadata cached)

After this patch:

    $ ./fiemap-perf-test.sh
    (...)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 1646 milliseconds (metadata not cached)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 698 milliseconds (metadata cached)

That's about 2.2x faster when no metadata is cached, and about 3x faster
when all metadata is cached. On a real filesystem with many other files,
data, directories, etc, the b+trees will be 2 or 3 levels higher,
therefore this optimization will have a higher impact.

Several reports of a slow fiemap show up often, the two Link tags below
refer to two recent reports of such slowness. This patch, together with
the next ones in the series, is meant to address that.

Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/backref.c     | 122 ++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/backref.h     |  17 +++++-
 fs/btrfs/ctree.h       |  18 ++++++
 fs/btrfs/extent-tree.c |  10 +++-
 fs/btrfs/extent_io.c   |  11 ++--
 5 files changed, 170 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index e2ac10a695b6..40b48abb6978 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1511,6 +1511,105 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+/*
+ * The caller has joined a transaction or is holding a read lock on the
+ * fs_info->commit_root_sem semaphore, so no need to worry about the root's last
+ * snapshot field changing while updating or checking the cache.
+ */
+static bool lookup_backref_shared_cache(struct btrfs_backref_shared_cache *cache,
+					struct btrfs_root *root,
+					u64 bytenr, int level, bool *is_shared)
+{
+	struct btrfs_backref_shared_cache_entry *entry;
+
+	if (WARN_ON_ONCE(level >= BTRFS_MAX_LEVEL))
+		return false;
+
+	/*
+	 * Level -1 is used for the data extent, which is not reliable to cache
+	 * because its reference count can increase or decrease without us
+	 * realizing. We cache results only for extent buffers that lead from
+	 * the root node down to the leaf with the file extent item.
+	 */
+	ASSERT(level >= 0);
+
+	entry = &cache->entries[level];
+
+	/* Unused cache entry or being used for some other extent buffer. */
+	if (entry->bytenr != bytenr)
+		return false;
+
+	/*
+	 * We cached a false result, but the last snapshot generation of the
+	 * root changed, so we now have a snapshot. Don't trust the result.
+	 */
+	if (!entry->is_shared &&
+	    entry->gen != btrfs_root_last_snapshot(&root->root_item))
+		return false;
+
+	/*
+	 * If we cached a true result and the last generation used for dropping
+	 * a root changed, we can not trust the result, because the dropped root
+	 * could be a snapshot sharing this extent buffer.
+	 */
+	if (entry->is_shared &&
+	    entry->gen != btrfs_get_last_root_drop_gen(root->fs_info))
+		return false;
+
+	*is_shared = entry->is_shared;
+
+	return true;
+}
+
+/*
+ * The caller has joined a transaction or is holding a read lock on the
+ * fs_info->commit_root_sem semaphore, so no need to worry about the root's last
+ * snapshot field changing while updating or checking the cache.
+ */
+static void store_backref_shared_cache(struct btrfs_backref_shared_cache *cache,
+				       struct btrfs_root *root,
+				       u64 bytenr, int level, bool is_shared)
+{
+	struct btrfs_backref_shared_cache_entry *entry;
+	u64 gen;
+
+	if (WARN_ON_ONCE(level >= BTRFS_MAX_LEVEL))
+		return;
+
+	/*
+	 * Level -1 is used for the data extent, which is not reliable to cache
+	 * because its reference count can increase or decrease without us
+	 * realizing. We cache results only for extent buffers that lead from
+	 * the root node down to the leaf with the file extent item.
+	 */
+	ASSERT(level >= 0);
+
+	if (is_shared)
+		gen = btrfs_get_last_root_drop_gen(root->fs_info);
+	else
+		gen = btrfs_root_last_snapshot(&root->root_item);
+
+	entry = &cache->entries[level];
+	entry->bytenr = bytenr;
+	entry->is_shared = is_shared;
+	entry->gen = gen;
+
+	/*
+	 * If we found an extent buffer is shared, set the cache result for all
+	 * extent buffers below it to true. As nodes in the path are COWed,
+	 * their sharedness is moved to their children, and if a leaf is COWed,
+	 * then the sharedness of a data extent becomes direct, the refcount of
+	 * data extent is increased in the extent item at the extent tree.
+	 */
+	if (is_shared) {
+		for (int i = 0; i < level; i++) {
+			entry = &cache->entries[i];
+			entry->is_shared = is_shared;
+			entry->gen = gen;
+		}
+	}
+}
+
 /**
  * Check if a data extent is shared or not.
  *
@@ -1519,6 +1618,7 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
  * @bytenr: logical bytenr of the extent we are checking
  * @roots:  list of roots this extent is shared among
  * @tmp:    temporary list used for iteration
+ * @cache:  a backref lookup result cache
  *
  * btrfs_is_data_extent_shared uses the backref walking code but will short
  * circuit as soon as it finds a root or inode that doesn't match the
@@ -1532,7 +1632,8 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
  * Return: 0 if extent is not shared, 1 if it is shared, < 0 on error.
  */
 int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
-				struct ulist *roots, struct ulist *tmp)
+				struct ulist *roots, struct ulist *tmp,
+				struct btrfs_backref_shared_cache *cache)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_trans_handle *trans;
@@ -1545,6 +1646,7 @@ int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
 		.inum = inum,
 		.share_count = 0,
 	};
+	int level;
 
 	ulist_init(roots);
 	ulist_init(tmp);
@@ -1561,22 +1663,40 @@ int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
 		btrfs_get_tree_mod_seq(fs_info, &elem);
 	}
 
+	/* -1 means we are in the bytenr of the data extent. */
+	level = -1;
 	ULIST_ITER_INIT(&uiter);
 	while (1) {
+		bool is_shared;
+		bool cached;
+
 		ret = find_parent_nodes(trans, fs_info, bytenr, elem.seq, tmp,
 					roots, NULL, &shared, false);
 		if (ret == BACKREF_FOUND_SHARED) {
 			/* this is the only condition under which we return 1 */
 			ret = 1;
+			if (level >= 0)
+				store_backref_shared_cache(cache, root, bytenr,
+							   level, true);
 			break;
 		}
 		if (ret < 0 && ret != -ENOENT)
 			break;
 		ret = 0;
+		if (level >= 0)
+			store_backref_shared_cache(cache, root, bytenr,
+						   level, false);
 		node = ulist_next(tmp, &uiter);
 		if (!node)
 			break;
 		bytenr = node->val;
+		level++;
+		cached = lookup_backref_shared_cache(cache, root, bytenr, level,
+						     &is_shared);
+		if (cached) {
+			ret = is_shared ? 1 : 0;
+			break;
+		}
 		shared.share_count = 0;
 		cond_resched();
 	}
diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h
index 08354394b1bb..797ba5371d55 100644
--- a/fs/btrfs/backref.h
+++ b/fs/btrfs/backref.h
@@ -17,6 +17,20 @@ struct inode_fs_paths {
 	struct btrfs_data_container	*fspath;
 };
 
+struct btrfs_backref_shared_cache_entry {
+	u64 bytenr;
+	u64 gen;
+	bool is_shared;
+};
+
+struct btrfs_backref_shared_cache {
+	/*
+	 * A path from a root to a leaf that has a file extent item pointing to
+	 * a given data extent should never exceed the maximum b+tree heigth.
+	 */
+	struct btrfs_backref_shared_cache_entry entries[BTRFS_MAX_LEVEL];
+};
+
 typedef int (iterate_extent_inodes_t)(u64 inum, u64 offset, u64 root,
 		void *ctx);
 
@@ -63,7 +77,8 @@ int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid,
 			  struct btrfs_inode_extref **ret_extref,
 			  u64 *found_off);
 int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
-				struct ulist *roots, struct ulist *tmp);
+				struct ulist *roots, struct ulist *tmp,
+				struct btrfs_backref_shared_cache *cache);
 
 int __init btrfs_prelim_ref_init(void);
 void __cold btrfs_prelim_ref_exit(void);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3dc30f5e6fd0..f7fe7f633eb5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1095,6 +1095,13 @@ struct btrfs_fs_info {
 	/* Updates are not protected by any lock */
 	struct btrfs_commit_stats commit_stats;
 
+	/*
+	 * Last generation where we dropped a non-relocation root.
+	 * Use btrfs_set_last_root_drop_gen() and btrfs_get_last_root_drop_gen()
+	 * to change it and to read it, respectively.
+	 */
+	u64 last_root_drop_gen;
+
 	/*
 	 * Annotations for transaction events (structures are empty when
 	 * compiled without lockdep).
@@ -1119,6 +1126,17 @@ struct btrfs_fs_info {
 #endif
 };
 
+static inline void btrfs_set_last_root_drop_gen(struct btrfs_fs_info *fs_info,
+						u64 gen)
+{
+	WRITE_ONCE(fs_info->last_root_drop_gen, gen);
+}
+
+static inline u64 btrfs_get_last_root_drop_gen(const struct btrfs_fs_info *fs_info)
+{
+	return READ_ONCE(fs_info->last_root_drop_gen);
+}
+
 static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
 {
 	return sb->s_fs_info;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index bcd0e72cded3..9818285dface 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5635,6 +5635,8 @@ static noinline int walk_up_tree(struct btrfs_trans_handle *trans,
  */
 int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 {
+	const bool is_reloc_root = (root->root_key.objectid ==
+				    BTRFS_TREE_RELOC_OBJECTID);
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_path *path;
 	struct btrfs_trans_handle *trans;
@@ -5794,6 +5796,9 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 				goto out_end_trans;
 			}
 
+			if (!is_reloc_root)
+				btrfs_set_last_root_drop_gen(fs_info, trans->transid);
+
 			btrfs_end_transaction_throttle(trans);
 			if (!for_reloc && btrfs_need_cleaner_sleep(fs_info)) {
 				btrfs_debug(fs_info,
@@ -5828,7 +5833,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 		goto out_end_trans;
 	}
 
-	if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
+	if (!is_reloc_root) {
 		ret = btrfs_find_root(tree_root, &root->root_key, path,
 				      NULL, NULL);
 		if (ret < 0) {
@@ -5860,6 +5865,9 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 		btrfs_put_root(root);
 	root_dropped = true;
 out_end_trans:
+	if (!is_reloc_root)
+		btrfs_set_last_root_drop_gen(fs_info, trans->transid);
+
 	btrfs_end_transaction_throttle(trans);
 out_free:
 	kfree(wc);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a47710516ecf..781436cc373c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5519,6 +5519,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	struct btrfs_path *path;
 	struct btrfs_root *root = inode->root;
 	struct fiemap_cache cache = { 0 };
+	struct btrfs_backref_shared_cache *backref_cache;
 	struct ulist *roots;
 	struct ulist *tmp_ulist;
 	int end = 0;
@@ -5526,13 +5527,11 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	u64 em_len = 0;
 	u64 em_end = 0;
 
+	backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
 	path = btrfs_alloc_path();
-	if (!path)
-		return -ENOMEM;
-
 	roots = ulist_alloc(GFP_KERNEL);
 	tmp_ulist = ulist_alloc(GFP_KERNEL);
-	if (!roots || !tmp_ulist) {
+	if (!backref_cache || !path || !roots || !tmp_ulist) {
 		ret = -ENOMEM;
 		goto out_free_ulist;
 	}
@@ -5658,7 +5657,8 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 			 */
 			ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
 							  bytenr, roots,
-							  tmp_ulist);
+							  tmp_ulist,
+							  backref_cache);
 			if (ret < 0)
 				goto out_free;
 			if (ret)
@@ -5710,6 +5710,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 			     &cached_state);
 
 out_free_ulist:
+	kfree(backref_cache);
 	btrfs_free_path(path);
 	ulist_free(roots);
 	ulist_free(tmp_ulist);
-- 
2.35.1