All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/11] btrfs: add a shrinker for extent maps
@ 2024-04-10 11:28 fdmanana
  2024-04-10 11:28 ` [PATCH 01/11] btrfs: pass an inode to btrfs_add_extent_mapping() fdmanana
                   ` (13 more replies)
  0 siblings, 14 replies; 64+ messages in thread
From: fdmanana @ 2024-04-10 11:28 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently we don't limit the amount of extent maps we can have for inodes
from a subvolume tree, which can result in excessive use of memory and in
some cases in running into OOM situations. This was reported some time ago
by a user and it's specially easier to trigger with direct IO.

The shrinker itself is patch 9/11, what comes before is simple preparatory
work and the rest just trace events. More details in the change logs.

Filipe Manana (11):
  btrfs: pass an inode to btrfs_add_extent_mapping()
  btrfs: tests: error out on unexpected extent map reference count
  btrfs: simplify add_extent_mapping() by removing pointless label
  btrfs: pass the extent map tree's inode to add_extent_mapping()
  btrfs: pass the extent map tree's inode to clear_em_logging()
  btrfs: pass the extent map tree's inode to remove_extent_mapping()
  btrfs: pass the extent map tree's inode to replace_extent_mapping()
  btrfs: add a global per cpu counter to track number of used extent maps
  btrfs: add a shrinker for extent maps
  btrfs: update comment for btrfs_set_inode_full_sync() about locking
  btrfs: add tracepoints for extent map shrinker events

 fs/btrfs/btrfs_inode.h            |   8 +-
 fs/btrfs/disk-io.c                |   5 +
 fs/btrfs/extent_io.c              |   2 +-
 fs/btrfs/extent_map.c             | 340 +++++++++++++++++++++++++-----
 fs/btrfs/extent_map.h             |   9 +-
 fs/btrfs/fs.h                     |   4 +
 fs/btrfs/inode.c                  |   2 +-
 fs/btrfs/tests/extent-map-tests.c | 216 ++++++++++---------
 fs/btrfs/tree-log.c               |   4 +-
 include/trace/events/btrfs.h      |  92 ++++++++
 10 files changed, 524 insertions(+), 158 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 01/11] btrfs: pass an inode to btrfs_add_extent_mapping()
  2024-04-10 11:28 [PATCH 00/11] btrfs: add a shrinker for extent maps fdmanana
@ 2024-04-10 11:28 ` fdmanana
  2024-04-10 11:28 ` [PATCH 02/11] btrfs: tests: error out on unexpected extent map reference count fdmanana
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-10 11:28 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Instead of passing fs_info and extent map tree arguments to
btrfs_add_extent_mapping(), we can pass an inode instead, as extent maps
are always inserted in the extent map tree of an inode, and the fs_info
can be extracted from the inode (inode->root->fs_info). The only exception
is in the self tests where we allocate an extent map tree and then use it
to insert/update/remove extent maps. However the tests can be changed to
use a test inode and then use the inode's extent map tree.

So change btrfs_add_extent_mapping() to have an inode as an argument
instead of a fs_info and an extent map tree. This reduces the number of
parameters and will also be needed for an upcoming change.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c             |  14 +--
 fs/btrfs/extent_map.h             |   3 +-
 fs/btrfs/inode.c                  |   2 +-
 fs/btrfs/tests/extent-map-tests.c | 174 +++++++++++++++---------------
 4 files changed, 95 insertions(+), 98 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 471654cb65b0..840be23d2c0a 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -546,10 +546,9 @@ static noinline int merge_extent_mapping(struct extent_map_tree *em_tree,
 }
 
 /*
- * Add extent mapping into em_tree.
+ * Add extent mapping into an inode's extent map tree.
  *
- * @fs_info:  the filesystem
- * @em_tree:  extent tree into which we want to insert the extent mapping
+ * @inode:    target inode
  * @em_in:    extent we are inserting
  * @start:    start of the logical range btrfs_get_extent() is requesting
  * @len:      length of the logical range btrfs_get_extent() is requesting
@@ -557,8 +556,8 @@ static noinline int merge_extent_mapping(struct extent_map_tree *em_tree,
  * Note that @em_in's range may be different from [start, start+len),
  * but they must be overlapped.
  *
- * Insert @em_in into @em_tree. In case there is an overlapping range, handle
- * the -EEXIST by either:
+ * Insert @em_in into the inode's extent map tree. In case there is an
+ * overlapping range, handle the -EEXIST by either:
  * a) Returning the existing extent in @em_in if @start is within the
  *    existing em.
  * b) Merge the existing extent with @em_in passed in.
@@ -566,12 +565,13 @@ static noinline int merge_extent_mapping(struct extent_map_tree *em_tree,
  * Return 0 on success, otherwise -EEXIST.
  *
  */
-int btrfs_add_extent_mapping(struct btrfs_fs_info *fs_info,
-			     struct extent_map_tree *em_tree,
+int btrfs_add_extent_mapping(struct btrfs_inode *inode,
 			     struct extent_map **em_in, u64 start, u64 len)
 {
 	int ret;
 	struct extent_map *em = *em_in;
+	struct extent_map_tree *em_tree = &inode->extent_tree;
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 
 	/*
 	 * Tree-checker should have rejected any inline extent with non-zero
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 10e9491865c9..f287ab46e368 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -132,8 +132,7 @@ int unpin_extent_cache(struct btrfs_inode *inode, u64 start, u64 len, u64 gen);
 void clear_em_logging(struct extent_map_tree *tree, struct extent_map *em);
 struct extent_map *search_extent_mapping(struct extent_map_tree *tree,
 					 u64 start, u64 len);
-int btrfs_add_extent_mapping(struct btrfs_fs_info *fs_info,
-			     struct extent_map_tree *em_tree,
+int btrfs_add_extent_mapping(struct btrfs_inode *inode,
 			     struct extent_map **em_in, u64 start, u64 len);
 void btrfs_drop_extent_map_range(struct btrfs_inode *inode,
 				 u64 start, u64 end,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 94ac20e62e13..27888810e6ac 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6992,7 +6992,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 	}
 
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, start, len);
+	ret = btrfs_add_extent_mapping(inode, &em, start, len);
 	write_unlock(&em_tree->lock);
 out:
 	btrfs_free_path(path);
diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
index 253cce7ffecf..96089c4c38a5 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -53,9 +53,9 @@ static void free_extent_map_tree(struct extent_map_tree *em_tree)
  *                                    ->add_extent_mapping(0, 16K)
  *                                    -> #handle -EEXIST
  */
-static int test_case_1(struct btrfs_fs_info *fs_info,
-		struct extent_map_tree *em_tree)
+static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 {
+	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
 	u64 start = 0;
 	u64 len = SZ_8K;
@@ -73,7 +73,7 @@ static int test_case_1(struct btrfs_fs_info *fs_info,
 	em->block_start = 0;
 	em->block_len = SZ_16K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret < 0) {
 		test_err("cannot add extent range [0, 16K)");
@@ -94,7 +94,7 @@ static int test_case_1(struct btrfs_fs_info *fs_info,
 	em->block_start = SZ_32K; /* avoid merging */
 	em->block_len = SZ_4K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret < 0) {
 		test_err("cannot add extent range [16K, 20K)");
@@ -115,7 +115,7 @@ static int test_case_1(struct btrfs_fs_info *fs_info,
 	em->block_start = start;
 	em->block_len = len;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret) {
 		test_err("case1 [%llu %llu]: ret %d", start, start + len, ret);
@@ -148,9 +148,9 @@ static int test_case_1(struct btrfs_fs_info *fs_info,
  * Reading the inline ending up with EEXIST, ie. read an inline
  * extent and discard page cache and read it again.
  */
-static int test_case_2(struct btrfs_fs_info *fs_info,
-		struct extent_map_tree *em_tree)
+static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 {
+	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
 	int ret;
 
@@ -166,7 +166,7 @@ static int test_case_2(struct btrfs_fs_info *fs_info,
 	em->block_start = EXTENT_MAP_INLINE;
 	em->block_len = (u64)-1;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret < 0) {
 		test_err("cannot add extent range [0, 1K)");
@@ -187,7 +187,7 @@ static int test_case_2(struct btrfs_fs_info *fs_info,
 	em->block_start = SZ_4K;
 	em->block_len = SZ_4K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret < 0) {
 		test_err("cannot add extent range [4K, 8K)");
@@ -208,7 +208,7 @@ static int test_case_2(struct btrfs_fs_info *fs_info,
 	em->block_start = EXTENT_MAP_INLINE;
 	em->block_len = (u64)-1;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret) {
 		test_err("case2 [0 1K]: ret %d", ret);
@@ -235,8 +235,9 @@ static int test_case_2(struct btrfs_fs_info *fs_info,
 }
 
 static int __test_case_3(struct btrfs_fs_info *fs_info,
-		struct extent_map_tree *em_tree, u64 start)
+			 struct btrfs_inode *inode, u64 start)
 {
+	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
 	u64 len = SZ_4K;
 	int ret;
@@ -253,7 +254,7 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 	em->block_start = SZ_4K;
 	em->block_len = SZ_4K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret < 0) {
 		test_err("cannot add extent range [4K, 8K)");
@@ -274,7 +275,7 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 	em->block_start = 0;
 	em->block_len = SZ_16K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, start, len);
+	ret = btrfs_add_extent_mapping(inode, &em, start, len);
 	write_unlock(&em_tree->lock);
 	if (ret) {
 		test_err("case3 [%llu %llu): ret %d",
@@ -322,25 +323,25 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
  *   -> add_extent_mapping()
  *                            -> add_extent_mapping()
  */
-static int test_case_3(struct btrfs_fs_info *fs_info,
-		struct extent_map_tree *em_tree)
+static int test_case_3(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 {
 	int ret;
 
-	ret = __test_case_3(fs_info, em_tree, 0);
+	ret = __test_case_3(fs_info, inode, 0);
 	if (ret)
 		return ret;
-	ret = __test_case_3(fs_info, em_tree, SZ_8K);
+	ret = __test_case_3(fs_info, inode, SZ_8K);
 	if (ret)
 		return ret;
-	ret = __test_case_3(fs_info, em_tree, (12 * SZ_1K));
+	ret = __test_case_3(fs_info, inode, (12 * SZ_1K));
 
 	return ret;
 }
 
 static int __test_case_4(struct btrfs_fs_info *fs_info,
-		struct extent_map_tree *em_tree, u64 start)
+			 struct btrfs_inode *inode, u64 start)
 {
+	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
 	u64 len = SZ_4K;
 	int ret;
@@ -357,7 +358,7 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	em->block_start = 0;
 	em->block_len = SZ_8K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret < 0) {
 		test_err("cannot add extent range [0, 8K)");
@@ -378,7 +379,7 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	em->block_start = SZ_16K; /* avoid merging */
 	em->block_len = 24 * SZ_1K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret < 0) {
 		test_err("cannot add extent range [8K, 32K)");
@@ -398,7 +399,7 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	em->block_start = 0;
 	em->block_len = SZ_32K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, start, len);
+	ret = btrfs_add_extent_mapping(inode, &em, start, len);
 	write_unlock(&em_tree->lock);
 	if (ret) {
 		test_err("case4 [%llu %llu): ret %d",
@@ -450,23 +451,22 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
  *                                             # handle -EEXIST when adding
  *                                             # [0, 32K)
  */
-static int test_case_4(struct btrfs_fs_info *fs_info,
-		struct extent_map_tree *em_tree)
+static int test_case_4(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 {
 	int ret;
 
-	ret = __test_case_4(fs_info, em_tree, 0);
+	ret = __test_case_4(fs_info, inode, 0);
 	if (ret)
 		return ret;
-	ret = __test_case_4(fs_info, em_tree, SZ_4K);
+	ret = __test_case_4(fs_info, inode, SZ_4K);
 
 	return ret;
 }
 
-static int add_compressed_extent(struct btrfs_fs_info *fs_info,
-				 struct extent_map_tree *em_tree,
+static int add_compressed_extent(struct btrfs_inode *inode,
 				 u64 start, u64 len, u64 block_start)
 {
+	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
 	int ret;
 
@@ -482,7 +482,7 @@ static int add_compressed_extent(struct btrfs_fs_info *fs_info,
 	em->block_len = SZ_4K;
 	em->flags |= EXTENT_FLAG_COMPRESS_ZLIB;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	free_extent_map(em);
 	if (ret < 0) {
@@ -588,53 +588,43 @@ static int validate_range(struct extent_map_tree *em_tree, int index)
  * They'll have the EXTENT_FLAG_COMPRESSED flag set to keep the em tree from
  * merging the em's.
  */
-static int test_case_5(struct btrfs_fs_info *fs_info)
+static int test_case_5(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 {
-	struct extent_map_tree *em_tree;
-	struct inode *inode;
 	u64 start, end;
 	int ret;
 
 	test_msg("Running btrfs_drop_extent_map_range tests");
 
-	inode = btrfs_new_test_inode();
-	if (!inode) {
-		test_std_err(TEST_ALLOC_INODE);
-		return -ENOMEM;
-	}
-
-	em_tree = &BTRFS_I(inode)->extent_tree;
-
 	/* [0, 12k) */
-	ret = add_compressed_extent(fs_info, em_tree, 0, SZ_4K * 3, 0);
+	ret = add_compressed_extent(inode, 0, SZ_4K * 3, 0);
 	if (ret) {
 		test_err("cannot add extent range [0, 12K)");
 		goto out;
 	}
 
 	/* [12k, 24k) */
-	ret = add_compressed_extent(fs_info, em_tree, SZ_4K * 3, SZ_4K * 3, SZ_4K);
+	ret = add_compressed_extent(inode, SZ_4K * 3, SZ_4K * 3, SZ_4K);
 	if (ret) {
 		test_err("cannot add extent range [12k, 24k)");
 		goto out;
 	}
 
 	/* [24k, 36k) */
-	ret = add_compressed_extent(fs_info, em_tree, SZ_4K * 6, SZ_4K * 3, SZ_8K);
+	ret = add_compressed_extent(inode, SZ_4K * 6, SZ_4K * 3, SZ_8K);
 	if (ret) {
 		test_err("cannot add extent range [12k, 24k)");
 		goto out;
 	}
 
 	/* [36k, 40k) */
-	ret = add_compressed_extent(fs_info, em_tree, SZ_32K + SZ_4K, SZ_4K, SZ_4K * 3);
+	ret = add_compressed_extent(inode, SZ_32K + SZ_4K, SZ_4K, SZ_4K * 3);
 	if (ret) {
 		test_err("cannot add extent range [12k, 24k)");
 		goto out;
 	}
 
 	/* [40k, 64k) */
-	ret = add_compressed_extent(fs_info, em_tree, SZ_4K * 10, SZ_4K * 6, SZ_16K);
+	ret = add_compressed_extent(inode, SZ_4K * 10, SZ_4K * 6, SZ_16K);
 	if (ret) {
 		test_err("cannot add extent range [12k, 24k)");
 		goto out;
@@ -643,36 +633,36 @@ static int test_case_5(struct btrfs_fs_info *fs_info)
 	/* Drop [8k, 12k) */
 	start = SZ_8K;
 	end = (3 * SZ_4K) - 1;
-	btrfs_drop_extent_map_range(BTRFS_I(inode), start, end, false);
-	ret = validate_range(&BTRFS_I(inode)->extent_tree, 0);
+	btrfs_drop_extent_map_range(inode, start, end, false);
+	ret = validate_range(&inode->extent_tree, 0);
 	if (ret)
 		goto out;
 
 	/* Drop [12k, 20k) */
 	start = SZ_4K * 3;
 	end = SZ_16K + SZ_4K - 1;
-	btrfs_drop_extent_map_range(BTRFS_I(inode), start, end, false);
-	ret = validate_range(&BTRFS_I(inode)->extent_tree, 1);
+	btrfs_drop_extent_map_range(inode, start, end, false);
+	ret = validate_range(&inode->extent_tree, 1);
 	if (ret)
 		goto out;
 
 	/* Drop [28k, 32k) */
 	start = SZ_32K - SZ_4K;
 	end = SZ_32K - 1;
-	btrfs_drop_extent_map_range(BTRFS_I(inode), start, end, false);
-	ret = validate_range(&BTRFS_I(inode)->extent_tree, 2);
+	btrfs_drop_extent_map_range(inode, start, end, false);
+	ret = validate_range(&inode->extent_tree, 2);
 	if (ret)
 		goto out;
 
 	/* Drop [32k, 64k) */
 	start = SZ_32K;
 	end = SZ_64K - 1;
-	btrfs_drop_extent_map_range(BTRFS_I(inode), start, end, false);
-	ret = validate_range(&BTRFS_I(inode)->extent_tree, 3);
+	btrfs_drop_extent_map_range(inode, start, end, false);
+	ret = validate_range(&inode->extent_tree, 3);
 	if (ret)
 		goto out;
 out:
-	iput(inode);
+	free_extent_map_tree(&inode->extent_tree);
 	return ret;
 }
 
@@ -681,23 +671,25 @@ static int test_case_5(struct btrfs_fs_info *fs_info)
  * for areas between two existing ems.  Validate it doesn't do this when there
  * are two unmerged em's side by side.
  */
-static int test_case_6(struct btrfs_fs_info *fs_info, struct extent_map_tree *em_tree)
+static int test_case_6(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 {
+	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em = NULL;
 	int ret;
 
-	ret = add_compressed_extent(fs_info, em_tree, 0, SZ_4K, 0);
+	ret = add_compressed_extent(inode, 0, SZ_4K, 0);
 	if (ret)
 		goto out;
 
-	ret = add_compressed_extent(fs_info, em_tree, SZ_4K, SZ_4K, 0);
+	ret = add_compressed_extent(inode, SZ_4K, SZ_4K, 0);
 	if (ret)
 		goto out;
 
 	em = alloc_extent_map();
 	if (!em) {
 		test_std_err(TEST_ALLOC_EXTENT_MAP);
-		return -ENOMEM;
+		ret = -ENOMEM;
+		goto out;
 	}
 
 	em->start = SZ_4K;
@@ -705,7 +697,7 @@ static int test_case_6(struct btrfs_fs_info *fs_info, struct extent_map_tree *em
 	em->block_start = SZ_16K;
 	em->block_len = SZ_16K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, 0, SZ_8K);
+	ret = btrfs_add_extent_mapping(inode, &em, 0, SZ_8K);
 	write_unlock(&em_tree->lock);
 
 	if (ret != 0) {
@@ -734,28 +726,19 @@ static int test_case_6(struct btrfs_fs_info *fs_info, struct extent_map_tree *em
  * true would mess up the start/end calculations and subsequent splits would be
  * incorrect.
  */
-static int test_case_7(struct btrfs_fs_info *fs_info)
+static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 {
-	struct extent_map_tree *em_tree;
+	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
-	struct inode *inode;
 	int ret;
+	int ret2;
 
 	test_msg("Running btrfs_drop_extent_cache with pinned");
 
-	inode = btrfs_new_test_inode();
-	if (!inode) {
-		test_std_err(TEST_ALLOC_INODE);
-		return -ENOMEM;
-	}
-
-	em_tree = &BTRFS_I(inode)->extent_tree;
-
 	em = alloc_extent_map();
 	if (!em) {
 		test_std_err(TEST_ALLOC_EXTENT_MAP);
-		ret = -ENOMEM;
-		goto out;
+		return -ENOMEM;
 	}
 
 	/* [0, 16K), pinned */
@@ -765,7 +748,7 @@ static int test_case_7(struct btrfs_fs_info *fs_info)
 	em->block_len = SZ_4K;
 	em->flags |= EXTENT_FLAG_PINNED;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret < 0) {
 		test_err("couldn't add extent map");
@@ -786,7 +769,7 @@ static int test_case_7(struct btrfs_fs_info *fs_info)
 	em->block_start = SZ_32K;
 	em->block_len = SZ_16K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret < 0) {
 		test_err("couldn't add extent map");
@@ -798,7 +781,7 @@ static int test_case_7(struct btrfs_fs_info *fs_info)
 	 * Drop [0, 36K) This should skip the [0, 4K) extent and then split the
 	 * [32K, 48K) extent.
 	 */
-	btrfs_drop_extent_map_range(BTRFS_I(inode), 0, (36 * SZ_1K) - 1, true);
+	btrfs_drop_extent_map_range(inode, 0, (36 * SZ_1K) - 1, true);
 
 	/* Make sure our extent maps look sane. */
 	ret = -EINVAL;
@@ -860,7 +843,11 @@ static int test_case_7(struct btrfs_fs_info *fs_info)
 	ret = 0;
 out:
 	free_extent_map(em);
-	iput(inode);
+	/* Unpin our extent to prevent warning when removing it below. */
+	ret2 = unpin_extent_cache(inode, 0, SZ_16K, 0);
+	if (ret == 0)
+		ret = ret2;
+	free_extent_map_tree(em_tree);
 	return ret;
 }
 
@@ -954,7 +941,8 @@ static int test_rmap_block(struct btrfs_fs_info *fs_info,
 int btrfs_test_extent_map(void)
 {
 	struct btrfs_fs_info *fs_info = NULL;
-	struct extent_map_tree *em_tree;
+	struct inode *inode;
+	struct btrfs_root *root = NULL;
 	int ret = 0, i;
 	struct rmap_test_vector rmap_tests[] = {
 		{
@@ -1003,33 +991,42 @@ int btrfs_test_extent_map(void)
 		return -ENOMEM;
 	}
 
-	em_tree = kzalloc(sizeof(*em_tree), GFP_KERNEL);
-	if (!em_tree) {
+	inode = btrfs_new_test_inode();
+	if (!inode) {
+		test_std_err(TEST_ALLOC_INODE);
 		ret = -ENOMEM;
 		goto out;
 	}
 
-	extent_map_tree_init(em_tree);
+	root = btrfs_alloc_dummy_root(fs_info);
+	if (IS_ERR(root)) {
+		test_std_err(TEST_ALLOC_ROOT);
+		ret = PTR_ERR(root);
+		root = NULL;
+		goto out;
+	}
 
-	ret = test_case_1(fs_info, em_tree);
+	BTRFS_I(inode)->root = root;
+
+	ret = test_case_1(fs_info, BTRFS_I(inode));
 	if (ret)
 		goto out;
-	ret = test_case_2(fs_info, em_tree);
+	ret = test_case_2(fs_info, BTRFS_I(inode));
 	if (ret)
 		goto out;
-	ret = test_case_3(fs_info, em_tree);
+	ret = test_case_3(fs_info, BTRFS_I(inode));
 	if (ret)
 		goto out;
-	ret = test_case_4(fs_info, em_tree);
+	ret = test_case_4(fs_info, BTRFS_I(inode));
 	if (ret)
 		goto out;
-	ret = test_case_5(fs_info);
+	ret = test_case_5(fs_info, BTRFS_I(inode));
 	if (ret)
 		goto out;
-	ret = test_case_6(fs_info, em_tree);
+	ret = test_case_6(fs_info, BTRFS_I(inode));
 	if (ret)
 		goto out;
-	ret = test_case_7(fs_info);
+	ret = test_case_7(fs_info, BTRFS_I(inode));
 	if (ret)
 		goto out;
 
@@ -1041,7 +1038,8 @@ int btrfs_test_extent_map(void)
 	}
 
 out:
-	kfree(em_tree);
+	iput(inode);
+	btrfs_free_dummy_root(root);
 	btrfs_free_dummy_fs_info(fs_info);
 
 	return ret;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 02/11] btrfs: tests: error out on unexpected extent map reference count
  2024-04-10 11:28 [PATCH 00/11] btrfs: add a shrinker for extent maps fdmanana
  2024-04-10 11:28 ` [PATCH 01/11] btrfs: pass an inode to btrfs_add_extent_mapping() fdmanana
@ 2024-04-10 11:28 ` fdmanana
  2024-04-10 11:28 ` [PATCH 03/11] btrfs: simplify add_extent_mapping() by removing pointless label fdmanana
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-10 11:28 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

In the extent map self tests, when freeing all extent maps from a test
extent map tree we are not expecting to find any extent map with a
reference count different from 1 (the tree reference). If we find any,
we just log a message but we don't fail the test, which makes it very easy
to miss any bug/regression - no one reads the test messages unless a test
fails. So change the behaviour to make a test fail if we find an extent
map in the tree with a reference count different from 1. Make the failure
happen only after removing all extent maps, so that we don't leak memory.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tests/extent-map-tests.c | 43 +++++++++++++++++++++++++------
 1 file changed, 35 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
index 96089c4c38a5..9e9cb591c0f1 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -11,10 +11,11 @@
 #include "../disk-io.h"
 #include "../block-group.h"
 
-static void free_extent_map_tree(struct extent_map_tree *em_tree)
+static int free_extent_map_tree(struct extent_map_tree *em_tree)
 {
 	struct extent_map *em;
 	struct rb_node *node;
+	int ret = 0;
 
 	write_lock(&em_tree->lock);
 	while (!RB_EMPTY_ROOT(&em_tree->map.rb_root)) {
@@ -24,6 +25,7 @@ static void free_extent_map_tree(struct extent_map_tree *em_tree)
 
 #ifdef CONFIG_BTRFS_DEBUG
 		if (refcount_read(&em->refs) != 1) {
+			ret = -EINVAL;
 			test_err(
 "em leak: em (start %llu len %llu block_start %llu block_len %llu) refs %d",
 				 em->start, em->len, em->block_start,
@@ -35,6 +37,8 @@ static void free_extent_map_tree(struct extent_map_tree *em_tree)
 		free_extent_map(em);
 	}
 	write_unlock(&em_tree->lock);
+
+	return ret;
 }
 
 /*
@@ -60,6 +64,7 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	u64 start = 0;
 	u64 len = SZ_8K;
 	int ret;
+	int ret2;
 
 	em = alloc_extent_map();
 	if (!em) {
@@ -137,7 +142,9 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	}
 	free_extent_map(em);
 out:
-	free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(em_tree);
+	if (ret == 0)
+		ret = ret2;
 
 	return ret;
 }
@@ -153,6 +160,7 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
 	int ret;
+	int ret2;
 
 	em = alloc_extent_map();
 	if (!em) {
@@ -229,7 +237,9 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	}
 	free_extent_map(em);
 out:
-	free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(em_tree);
+	if (ret == 0)
+		ret = ret2;
 
 	return ret;
 }
@@ -241,6 +251,7 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 	struct extent_map *em;
 	u64 len = SZ_4K;
 	int ret;
+	int ret2;
 
 	em = alloc_extent_map();
 	if (!em) {
@@ -302,7 +313,9 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 	}
 	free_extent_map(em);
 out:
-	free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(em_tree);
+	if (ret == 0)
+		ret = ret2;
 
 	return ret;
 }
@@ -345,6 +358,7 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	struct extent_map *em;
 	u64 len = SZ_4K;
 	int ret;
+	int ret2;
 
 	em = alloc_extent_map();
 	if (!em) {
@@ -421,7 +435,9 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	}
 	free_extent_map(em);
 out:
-	free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(em_tree);
+	if (ret == 0)
+		ret = ret2;
 
 	return ret;
 }
@@ -592,6 +608,7 @@ static int test_case_5(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 {
 	u64 start, end;
 	int ret;
+	int ret2;
 
 	test_msg("Running btrfs_drop_extent_map_range tests");
 
@@ -662,7 +679,10 @@ static int test_case_5(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	if (ret)
 		goto out;
 out:
-	free_extent_map_tree(&inode->extent_tree);
+	ret2 = free_extent_map_tree(&inode->extent_tree);
+	if (ret == 0)
+		ret = ret2;
+
 	return ret;
 }
 
@@ -676,6 +696,7 @@ static int test_case_6(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em = NULL;
 	int ret;
+	int ret2;
 
 	ret = add_compressed_extent(inode, 0, SZ_4K, 0);
 	if (ret)
@@ -717,7 +738,10 @@ static int test_case_6(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	ret = 0;
 out:
 	free_extent_map(em);
-	free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(em_tree);
+	if (ret == 0)
+		ret = ret2;
+
 	return ret;
 }
 
@@ -847,7 +871,10 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	ret2 = unpin_extent_cache(inode, 0, SZ_16K, 0);
 	if (ret == 0)
 		ret = ret2;
-	free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(em_tree);
+	if (ret == 0)
+		ret = ret2;
+
 	return ret;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 03/11] btrfs: simplify add_extent_mapping() by removing pointless label
  2024-04-10 11:28 [PATCH 00/11] btrfs: add a shrinker for extent maps fdmanana
  2024-04-10 11:28 ` [PATCH 01/11] btrfs: pass an inode to btrfs_add_extent_mapping() fdmanana
  2024-04-10 11:28 ` [PATCH 02/11] btrfs: tests: error out on unexpected extent map reference count fdmanana
@ 2024-04-10 11:28 ` fdmanana
  2024-04-10 11:28 ` [PATCH 04/11] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-10 11:28 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

The add_extent_mapping() function is short and trivial, there's no need to
have a label for a quick exit in case of an error, even because there's no
error handling needed, we just need to return the error. So remove that
label and return directly.

Also while at it remove the redundant initialization of 'ret', as that may
help avoid some warnings with clang tools such as the one reported/fixed
by commit 966de47ff0c9 ("btrfs: remove redundant initialization of
variables in log_new_ancestors").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 840be23d2c0a..d125d5ab9b1d 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -370,17 +370,17 @@ static inline void setup_extent_mapping(struct extent_map_tree *tree,
 static int add_extent_mapping(struct extent_map_tree *tree,
 			      struct extent_map *em, int modified)
 {
-	int ret = 0;
+	int ret;
 
 	lockdep_assert_held_write(&tree->lock);
 
 	ret = tree_insert(&tree->map, em);
 	if (ret)
-		goto out;
+		return ret;
 
 	setup_extent_mapping(tree, em, modified);
-out:
-	return ret;
+
+	return 0;
 }
 
 static struct extent_map *
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 04/11] btrfs: pass the extent map tree's inode to add_extent_mapping()
  2024-04-10 11:28 [PATCH 00/11] btrfs: add a shrinker for extent maps fdmanana
                   ` (2 preceding siblings ...)
  2024-04-10 11:28 ` [PATCH 03/11] btrfs: simplify add_extent_mapping() by removing pointless label fdmanana
@ 2024-04-10 11:28 ` fdmanana
  2024-04-10 11:28 ` [PATCH 05/11] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-10 11:28 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are always added to an inode's extent map tree, so there's no
need to pass the extent map tree explicitly to add_extent_mapping().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change add_extent_mapping() to receive the inode instead of its
extent map tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c | 34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index d125d5ab9b1d..d0e0c4e5415e 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -355,21 +355,22 @@ static inline void setup_extent_mapping(struct extent_map_tree *tree,
 }
 
 /*
- * Add new extent map to the extent tree
+ * Add a new extent map to an inode's extent map tree.
  *
- * @tree:	tree to insert new map in
+ * @inode:	the target inode
  * @em:		map to insert
  * @modified:	indicate whether the given @em should be added to the
  *	        modified list, which indicates the extent needs to be logged
  *
- * Insert @em into @tree or perform a simple forward/backward merge with
- * existing mappings.  The extent_map struct passed in will be inserted
- * into the tree directly, with an additional reference taken, or a
- * reference dropped if the merge attempt was successful.
+ * Insert @em into the @inode's extent map tree or perform a simple
+ * forward/backward merge with existing mappings.  The extent_map struct passed
+ * in will be inserted into the tree directly, with an additional reference
+ * taken, or a reference dropped if the merge attempt was successful.
  */
-static int add_extent_mapping(struct extent_map_tree *tree,
+static int add_extent_mapping(struct btrfs_inode *inode,
 			      struct extent_map *em, int modified)
 {
+	struct extent_map_tree *tree = &inode->extent_tree;
 	int ret;
 
 	lockdep_assert_held_write(&tree->lock);
@@ -508,7 +509,7 @@ static struct extent_map *prev_extent_map(struct extent_map *em)
  * and an extent that you want to insert, deal with overlap and insert
  * the best fitted new extent into the tree.
  */
-static noinline int merge_extent_mapping(struct extent_map_tree *em_tree,
+static noinline int merge_extent_mapping(struct btrfs_inode *inode,
 					 struct extent_map *existing,
 					 struct extent_map *em,
 					 u64 map_start)
@@ -542,7 +543,7 @@ static noinline int merge_extent_mapping(struct extent_map_tree *em_tree,
 		em->block_start += start_diff;
 		em->block_len = em->len;
 	}
-	return add_extent_mapping(em_tree, em, 0);
+	return add_extent_mapping(inode, em, 0);
 }
 
 /*
@@ -570,7 +571,6 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode,
 {
 	int ret;
 	struct extent_map *em = *em_in;
-	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 
 	/*
@@ -580,7 +580,7 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode,
 	if (em->block_start == EXTENT_MAP_INLINE)
 		ASSERT(em->start == 0);
 
-	ret = add_extent_mapping(em_tree, em, 0);
+	ret = add_extent_mapping(inode, em, 0);
 	/* it is possible that someone inserted the extent into the tree
 	 * while we had the lock dropped.  It is also possible that
 	 * an overlapping map exists in the tree
@@ -588,7 +588,7 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode,
 	if (ret == -EEXIST) {
 		struct extent_map *existing;
 
-		existing = search_extent_mapping(em_tree, start, len);
+		existing = search_extent_mapping(&inode->extent_tree, start, len);
 
 		trace_btrfs_handle_em_exist(fs_info, existing, em, start, len);
 
@@ -609,8 +609,7 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode,
 			 * The existing extent map is the one nearest to
 			 * the [start, start + len) range which overlaps
 			 */
-			ret = merge_extent_mapping(em_tree, existing,
-						   em, start);
+			ret = merge_extent_mapping(inode, existing, em, start);
 			if (WARN_ON(ret)) {
 				free_extent_map(em);
 				*em_in = NULL;
@@ -818,8 +817,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 			} else {
 				int ret;
 
-				ret = add_extent_mapping(em_tree, split,
-							 modified);
+				ret = add_extent_mapping(inode, split, modified);
 				/* Logic error, shouldn't happen. */
 				ASSERT(ret == 0);
 				if (WARN_ON(ret != 0) && modified)
@@ -909,7 +907,7 @@ int btrfs_replace_extent_map_range(struct btrfs_inode *inode,
 	do {
 		btrfs_drop_extent_map_range(inode, new_em->start, end, false);
 		write_lock(&tree->lock);
-		ret = add_extent_mapping(tree, new_em, modified);
+		ret = add_extent_mapping(inode, new_em, modified);
 		write_unlock(&tree->lock);
 	} while (ret == -EEXIST);
 
@@ -990,7 +988,7 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	split_mid->ram_bytes = split_mid->len;
 	split_mid->flags = flags;
 	split_mid->generation = em->generation;
-	add_extent_mapping(em_tree, split_mid, 1);
+	add_extent_mapping(inode, split_mid, 1);
 
 	/* Once for us */
 	free_extent_map(em);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 05/11] btrfs: pass the extent map tree's inode to clear_em_logging()
  2024-04-10 11:28 [PATCH 00/11] btrfs: add a shrinker for extent maps fdmanana
                   ` (3 preceding siblings ...)
  2024-04-10 11:28 ` [PATCH 04/11] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
@ 2024-04-10 11:28 ` fdmanana
  2024-04-10 11:28 ` [PATCH 06/11] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-10 11:28 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to
clear_em_logging().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change clear_em_logging() to receive the inode instead of its extent
map tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c | 4 +++-
 fs/btrfs/extent_map.h | 2 +-
 fs/btrfs/tree-log.c   | 4 ++--
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index d0e0c4e5415e..7cda78d11d75 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -331,8 +331,10 @@ int unpin_extent_cache(struct btrfs_inode *inode, u64 start, u64 len, u64 gen)
 
 }
 
-void clear_em_logging(struct extent_map_tree *tree, struct extent_map *em)
+void clear_em_logging(struct btrfs_inode *inode, struct extent_map *em)
 {
+	struct extent_map_tree *tree = &inode->extent_tree;
+
 	lockdep_assert_held_write(&tree->lock);
 
 	em->flags &= ~EXTENT_FLAG_LOGGING;
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index f287ab46e368..732fc8d7e534 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -129,7 +129,7 @@ void free_extent_map(struct extent_map *em);
 int __init extent_map_init(void);
 void __cold extent_map_exit(void);
 int unpin_extent_cache(struct btrfs_inode *inode, u64 start, u64 len, u64 gen);
-void clear_em_logging(struct extent_map_tree *tree, struct extent_map *em);
+void clear_em_logging(struct btrfs_inode *inode, struct extent_map *em);
 struct extent_map *search_extent_mapping(struct extent_map_tree *tree,
 					 u64 start, u64 len);
 int btrfs_add_extent_mapping(struct btrfs_inode *inode,
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index d9777649e170..4a4fca841510 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4945,7 +4945,7 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 		 * private list.
 		 */
 		if (ret) {
-			clear_em_logging(tree, em);
+			clear_em_logging(inode, em);
 			free_extent_map(em);
 			continue;
 		}
@@ -4954,7 +4954,7 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 
 		ret = log_one_extent(trans, inode, em, path, ctx);
 		write_lock(&tree->lock);
-		clear_em_logging(tree, em);
+		clear_em_logging(inode, em);
 		free_extent_map(em);
 	}
 	WARN_ON(!list_empty(&extents));
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 06/11] btrfs: pass the extent map tree's inode to remove_extent_mapping()
  2024-04-10 11:28 [PATCH 00/11] btrfs: add a shrinker for extent maps fdmanana
                   ` (4 preceding siblings ...)
  2024-04-10 11:28 ` [PATCH 05/11] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
@ 2024-04-10 11:28 ` fdmanana
  2024-04-10 11:28 ` [PATCH 07/11] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-10 11:28 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to
remove_extent_mapping().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change remove_extent_mapping() to receive the inode instead of its
extent map tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c              |  2 +-
 fs/btrfs/extent_map.c             | 22 +++++++++++++---------
 fs/btrfs/extent_map.h             |  2 +-
 fs/btrfs/tests/extent-map-tests.c | 19 ++++++++++---------
 4 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d90330f26827..1b236fc3f411 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2457,7 +2457,7 @@ int try_release_extent_mapping(struct page *page, gfp_t mask)
 			 * hurts the fsync performance for workloads with a data
 			 * size that exceeds or is close to the system's memory).
 			 */
-			remove_extent_mapping(map, em);
+			remove_extent_mapping(btrfs_inode, em);
 			/* once for the rb tree */
 			free_extent_map(em);
 next:
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 7cda78d11d75..289669763965 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -449,16 +449,18 @@ struct extent_map *search_extent_mapping(struct extent_map_tree *tree,
 }
 
 /*
- * Remove an extent_map from the extent tree.
+ * Remove an extent_map from its inode's extent tree.
  *
- * @tree:	extent tree to remove from
+ * @inode:	the inode the extent map belongs to
  * @em:		extent map being removed
  *
- * Remove @em from @tree.  No reference counts are dropped, and no checks
- * are done to see if the range is in use.
+ * Remove @em from the extent tree of @inode.  No reference counts are dropped,
+ * and no checks are done to see if the range is in use.
  */
-void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em)
+void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em)
 {
+	struct extent_map_tree *tree = &inode->extent_tree;
+
 	lockdep_assert_held_write(&tree->lock);
 
 	WARN_ON(em->flags & EXTENT_FLAG_PINNED);
@@ -633,8 +635,10 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode,
  * if needed. This avoids searching the tree, from the root down to the first
  * extent map, before each deletion.
  */
-static void drop_all_extent_maps_fast(struct extent_map_tree *tree)
+static void drop_all_extent_maps_fast(struct btrfs_inode *inode)
 {
+	struct extent_map_tree *tree = &inode->extent_tree;
+
 	write_lock(&tree->lock);
 	while (!RB_EMPTY_ROOT(&tree->map.rb_root)) {
 		struct extent_map *em;
@@ -643,7 +647,7 @@ static void drop_all_extent_maps_fast(struct extent_map_tree *tree)
 		node = rb_first_cached(&tree->map);
 		em = rb_entry(node, struct extent_map, rb_node);
 		em->flags &= ~(EXTENT_FLAG_PINNED | EXTENT_FLAG_LOGGING);
-		remove_extent_mapping(tree, em);
+		remove_extent_mapping(inode, em);
 		free_extent_map(em);
 		cond_resched_rwlock_write(&tree->lock);
 	}
@@ -676,7 +680,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 	WARN_ON(end < start);
 	if (end == (u64)-1) {
 		if (start == 0 && !skip_pinned) {
-			drop_all_extent_maps_fast(em_tree);
+			drop_all_extent_maps_fast(inode);
 			return;
 		}
 		len = (u64)-1;
@@ -854,7 +858,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 				ASSERT(!split);
 				btrfs_set_inode_full_sync(inode);
 			}
-			remove_extent_mapping(em_tree, em);
+			remove_extent_mapping(inode, em);
 		}
 
 		/*
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 732fc8d7e534..c3707461ff62 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -120,7 +120,7 @@ static inline u64 extent_map_end(const struct extent_map *em)
 void extent_map_tree_init(struct extent_map_tree *tree);
 struct extent_map *lookup_extent_mapping(struct extent_map_tree *tree,
 					 u64 start, u64 len);
-void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em);
+void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em);
 int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 		     u64 new_logical);
 
diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
index 9e9cb591c0f1..db6fb1a2c78f 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -11,8 +11,9 @@
 #include "../disk-io.h"
 #include "../block-group.h"
 
-static int free_extent_map_tree(struct extent_map_tree *em_tree)
+static int free_extent_map_tree(struct btrfs_inode *inode)
 {
+	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
 	struct rb_node *node;
 	int ret = 0;
@@ -21,7 +22,7 @@ static int free_extent_map_tree(struct extent_map_tree *em_tree)
 	while (!RB_EMPTY_ROOT(&em_tree->map.rb_root)) {
 		node = rb_first_cached(&em_tree->map);
 		em = rb_entry(node, struct extent_map, rb_node);
-		remove_extent_mapping(em_tree, em);
+		remove_extent_mapping(inode, em);
 
 #ifdef CONFIG_BTRFS_DEBUG
 		if (refcount_read(&em->refs) != 1) {
@@ -142,7 +143,7 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	}
 	free_extent_map(em);
 out:
-	ret2 = free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
@@ -237,7 +238,7 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	}
 	free_extent_map(em);
 out:
-	ret2 = free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
@@ -313,7 +314,7 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 	}
 	free_extent_map(em);
 out:
-	ret2 = free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
@@ -435,7 +436,7 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	}
 	free_extent_map(em);
 out:
-	ret2 = free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
@@ -679,7 +680,7 @@ static int test_case_5(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	if (ret)
 		goto out;
 out:
-	ret2 = free_extent_map_tree(&inode->extent_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
@@ -738,7 +739,7 @@ static int test_case_6(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	ret = 0;
 out:
 	free_extent_map(em);
-	ret2 = free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
@@ -871,7 +872,7 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	ret2 = unpin_extent_cache(inode, 0, SZ_16K, 0);
 	if (ret == 0)
 		ret = ret2;
-	ret2 = free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 07/11] btrfs: pass the extent map tree's inode to replace_extent_mapping()
  2024-04-10 11:28 [PATCH 00/11] btrfs: add a shrinker for extent maps fdmanana
                   ` (5 preceding siblings ...)
  2024-04-10 11:28 ` [PATCH 06/11] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
@ 2024-04-10 11:28 ` fdmanana
  2024-04-10 11:28 ` [PATCH 08/11] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-10 11:28 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to
replace_extent_mapping().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change replace_extent_mapping() to receive the inode instead of its
extent map tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 289669763965..15817b842c24 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -470,11 +470,13 @@ void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em)
 	RB_CLEAR_NODE(&em->rb_node);
 }
 
-static void replace_extent_mapping(struct extent_map_tree *tree,
+static void replace_extent_mapping(struct btrfs_inode *inode,
 				   struct extent_map *cur,
 				   struct extent_map *new,
 				   int modified)
 {
+	struct extent_map_tree *tree = &inode->extent_tree;
+
 	lockdep_assert_held_write(&tree->lock);
 
 	WARN_ON(cur->flags & EXTENT_FLAG_PINNED);
@@ -777,7 +779,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 
 			split->generation = gen;
 			split->flags = flags;
-			replace_extent_mapping(em_tree, em, split, modified);
+			replace_extent_mapping(inode, em, split, modified);
 			free_extent_map(split);
 			split = split2;
 			split2 = NULL;
@@ -818,8 +820,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 			}
 
 			if (extent_map_in_tree(em)) {
-				replace_extent_mapping(em_tree, em, split,
-						       modified);
+				replace_extent_mapping(inode, em, split, modified);
 			} else {
 				int ret;
 
@@ -977,7 +978,7 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	split_pre->flags = flags;
 	split_pre->generation = em->generation;
 
-	replace_extent_mapping(em_tree, em, split_pre, 1);
+	replace_extent_mapping(inode, em, split_pre, 1);
 
 	/*
 	 * Now we only have an extent_map at:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 08/11] btrfs: add a global per cpu counter to track number of used extent maps
  2024-04-10 11:28 [PATCH 00/11] btrfs: add a shrinker for extent maps fdmanana
                   ` (6 preceding siblings ...)
  2024-04-10 11:28 ` [PATCH 07/11] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
@ 2024-04-10 11:28 ` fdmanana
  2024-04-11  5:39   ` Qu Wenruo
  2024-04-10 11:28 ` [PATCH 09/11] btrfs: add a shrinker for " fdmanana
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 64+ messages in thread
From: fdmanana @ 2024-04-10 11:28 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Add a per cpu counter that tracks the total number of extent maps that are
in extent trees of inodes that belong to fs trees. This is going to be
used in an upcoming change that adds a shrinker for extent maps. Only
extent maps for fs trees are considered, because for special trees such as
the data relocation tree we don't want to evict their extent maps which
are critical for the relocation to work, and since those are limited, it's
not a concern to have them in memory during the relocation of a block
group. Another case are extent maps for free space cache inodes, which
must always remain in memory, but those are limited (there's only one per
free space cache inode, which means one per block group).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/disk-io.c    |  6 ++++++
 fs/btrfs/extent_map.c | 38 +++++++++++++++++++++++++++-----------
 fs/btrfs/fs.h         |  2 ++
 3 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0474e9b6d302..3c2d35b2062e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1269,6 +1269,8 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
 	percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
 	percpu_counter_destroy(&fs_info->delalloc_bytes);
 	percpu_counter_destroy(&fs_info->ordered_bytes);
+	ASSERT(percpu_counter_sum_positive(&fs_info->evictable_extent_maps) == 0);
+	percpu_counter_destroy(&fs_info->evictable_extent_maps);
 	percpu_counter_destroy(&fs_info->dev_replace.bio_counter);
 	btrfs_free_csum_hash(fs_info);
 	btrfs_free_stripe_hash_table(fs_info);
@@ -2848,6 +2850,10 @@ static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block
 	if (ret)
 		return ret;
 
+	ret = percpu_counter_init(&fs_info->evictable_extent_maps, 0, GFP_KERNEL);
+	if (ret)
+		return ret;
+
 	ret = percpu_counter_init(&fs_info->dirty_metadata_bytes, 0, GFP_KERNEL);
 	if (ret)
 		return ret;
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 15817b842c24..2fcf28148a81 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -76,6 +76,14 @@ static u64 range_end(u64 start, u64 len)
 	return start + len;
 }
 
+static void dec_evictable_extent_maps(struct btrfs_inode *inode)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+
+	if (!btrfs_is_testing(fs_info) && is_fstree(btrfs_root_id(inode->root)))
+		percpu_counter_dec(&fs_info->evictable_extent_maps);
+}
+
 static int tree_insert(struct rb_root_cached *root, struct extent_map *em)
 {
 	struct rb_node **p = &root->rb_root.rb_node;
@@ -223,8 +231,9 @@ static bool mergeable_maps(const struct extent_map *prev, const struct extent_ma
 	return next->block_start == prev->block_start;
 }
 
-static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
+static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 {
+	struct extent_map_tree *tree = &inode->extent_tree;
 	struct extent_map *merge = NULL;
 	struct rb_node *rb;
 
@@ -258,6 +267,7 @@ static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
 			rb_erase_cached(&merge->rb_node, &tree->map);
 			RB_CLEAR_NODE(&merge->rb_node);
 			free_extent_map(merge);
+			dec_evictable_extent_maps(inode);
 		}
 	}
 
@@ -272,6 +282,7 @@ static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
 		em->generation = max(em->generation, merge->generation);
 		em->flags |= EXTENT_FLAG_MERGED;
 		free_extent_map(merge);
+		dec_evictable_extent_maps(inode);
 	}
 }
 
@@ -322,7 +333,7 @@ int unpin_extent_cache(struct btrfs_inode *inode, u64 start, u64 len, u64 gen)
 	em->generation = gen;
 	em->flags &= ~EXTENT_FLAG_PINNED;
 
-	try_merge_map(tree, em);
+	try_merge_map(inode, em);
 
 out:
 	write_unlock(&tree->lock);
@@ -333,16 +344,14 @@ int unpin_extent_cache(struct btrfs_inode *inode, u64 start, u64 len, u64 gen)
 
 void clear_em_logging(struct btrfs_inode *inode, struct extent_map *em)
 {
-	struct extent_map_tree *tree = &inode->extent_tree;
-
-	lockdep_assert_held_write(&tree->lock);
+	lockdep_assert_held_write(&inode->extent_tree.lock);
 
 	em->flags &= ~EXTENT_FLAG_LOGGING;
 	if (extent_map_in_tree(em))
-		try_merge_map(tree, em);
+		try_merge_map(inode, em);
 }
 
-static inline void setup_extent_mapping(struct extent_map_tree *tree,
+static inline void setup_extent_mapping(struct btrfs_inode *inode,
 					struct extent_map *em,
 					int modified)
 {
@@ -351,9 +360,9 @@ static inline void setup_extent_mapping(struct extent_map_tree *tree,
 	ASSERT(list_empty(&em->list));
 
 	if (modified)
-		list_add(&em->list, &tree->modified_extents);
+		list_add(&em->list, &inode->extent_tree.modified_extents);
 	else
-		try_merge_map(tree, em);
+		try_merge_map(inode, em);
 }
 
 /*
@@ -373,6 +382,8 @@ static int add_extent_mapping(struct btrfs_inode *inode,
 			      struct extent_map *em, int modified)
 {
 	struct extent_map_tree *tree = &inode->extent_tree;
+	struct btrfs_root *root = inode->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
 	int ret;
 
 	lockdep_assert_held_write(&tree->lock);
@@ -381,7 +392,10 @@ static int add_extent_mapping(struct btrfs_inode *inode,
 	if (ret)
 		return ret;
 
-	setup_extent_mapping(tree, em, modified);
+	setup_extent_mapping(inode, em, modified);
+
+	if (!btrfs_is_testing(fs_info) && is_fstree(btrfs_root_id(root)))
+		percpu_counter_inc(&fs_info->evictable_extent_maps);
 
 	return 0;
 }
@@ -468,6 +482,8 @@ void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em)
 	if (!(em->flags & EXTENT_FLAG_LOGGING))
 		list_del_init(&em->list);
 	RB_CLEAR_NODE(&em->rb_node);
+
+	dec_evictable_extent_maps(inode);
 }
 
 static void replace_extent_mapping(struct btrfs_inode *inode,
@@ -486,7 +502,7 @@ static void replace_extent_mapping(struct btrfs_inode *inode,
 	rb_replace_node_cached(&cur->rb_node, &new->rb_node, &tree->map);
 	RB_CLEAR_NODE(&cur->rb_node);
 
-	setup_extent_mapping(tree, new, modified);
+	setup_extent_mapping(inode, new, modified);
 }
 
 static struct extent_map *next_extent_map(const struct extent_map *em)
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 93f5c57ea4e3..534d30dafe32 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -630,6 +630,8 @@ struct btrfs_fs_info {
 	s32 dirty_metadata_batch;
 	s32 delalloc_batch;
 
+	struct percpu_counter evictable_extent_maps;
+
 	/* Protected by 'trans_lock'. */
 	struct list_head dirty_cowonly_roots;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 09/11] btrfs: add a shrinker for extent maps
  2024-04-10 11:28 [PATCH 00/11] btrfs: add a shrinker for extent maps fdmanana
                   ` (7 preceding siblings ...)
  2024-04-10 11:28 ` [PATCH 08/11] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
@ 2024-04-10 11:28 ` fdmanana
  2024-04-11  5:58   ` Qu Wenruo
  2024-04-10 11:28 ` [PATCH 10/11] btrfs: update comment for btrfs_set_inode_full_sync() about locking fdmanana
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 64+ messages in thread
From: fdmanana @ 2024-04-10 11:28 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are used either to represent existing file extent items, or to
represent new extents that are going to be written and the respective file
extent items are created when the ordered extent completes.

We currently don't have any limit for how many extent maps we can have,
neither per inode nor globally. Most of the time this not too noticeable
because extent maps are removed in the following situations:

1) When evicting an inode;

2) When releasing folios (pages) through the btrfs_release_folio() address
   space operation callback.

   However we won't release extent maps in the folio range if the folio is
   either dirty or under writeback or if the inode's i_size is less than
   or equals to 16M (see try_release_extent_mapping(). This 16M i_size
   constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
   extent_io and extent_state optimizations"), but there's no explanation
   about why we have it or why the 16M value.

This means that for buffered IO we can reach an OOM situation due to too
many extent maps if either of the following happens:

1) There's a set of tasks constantly doing IO on many files with a size
   not larger than 16M, specially if they keep the files open for very
   long periods, therefore preventing inode eviction.

   This requires a really high number of such files, and having many non
   mergeable extent maps (due to random 4K writes for example) and a
   machine with very little memory;

2) There's a set tasks constantly doing random write IO (therefore
   creating many non mergeable extent maps) on files and keeping them
   open for long periods of time, so inode eviction doesn't happen and
   there's always a lot of dirty pages or pages under writeback,
   preventing btrfs_release_folio() from releasing the respective extent
   maps.

This second case was actually reported in the thread pointed by the Link
tag below, and it requires a very large file under heavy IO and a machine
with very little amount of RAM, which is probably hard to happen in
practice in a real world use case.

However when using direct IO this is not so hard to happen, because the
page cache is not used, and therefore btrfs_release_folio() is never
called. Which means extent maps are dropped only when evicting the inode,
and that means that if we have tasks that keep a file descriptor open and
keep doing IO on a very large file (or files), we can exhaust memory due
to an unbounded amount of extent maps. This is especially easy to happen
if we have a huge file with millions of small extents and their extent
maps are not mergeable (non contiguous offsets and disk locations).
This was reported in that thread with the following fio test:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/sdj
   MNT=/mnt/sdj
   MOUNT_OPTIONS="-o ssd"
   MKFS_OPTIONS=""

   cat <<EOF > /tmp/fio-job.ini
   [global]
   name=fio-rand-write
   filename=$MNT/fio-rand-write
   rw=randwrite
   bs=4K
   direct=1
   numjobs=16
   fallocate=none
   time_based
   runtime=90000

   [file1]
   size=300G
   ioengine=libaio
   iodepth=16

   EOF

   umount $MNT &> /dev/null
   mkfs.btrfs -f $MKFS_OPTIONS $DEV
   mount $MOUNT_OPTIONS $DEV $MNT

   fio /tmp/fio-job.ini
   umount $MNT

Monitoring the btrfs_extent_map slab while running the test with:

   $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
                        /sys/kernel/slab/btrfs_extent_map/total_objects'

Shows the number of active and total extent maps skyrocketing to tens of
millions, and on systems with a short amount of memory it's easy and quick
to get into an OOM situation, as reported in that thread.

So to avoid this issue add a shrinker that will remove extents maps, as
long as they are not pinned, and takes proper care with any concurrent
fsync to avoid missing extents (setting the full sync flag while in the
middle of a fast fsync). This shrinker is similar to the one ext4 uses
for its extent_status structure, which is analogous to btrfs' extent_map
structure.

Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/disk-io.c    |   7 +-
 fs/btrfs/extent_map.c | 200 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/extent_map.h |   2 +
 fs/btrfs/fs.h         |   2 +
 4 files changed, 207 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3c2d35b2062e..8bb295eaf3d7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1266,11 +1266,10 @@ static void free_global_roots(struct btrfs_fs_info *fs_info)
 
 void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
 {
+	btrfs_unregister_extent_map_shrinker(fs_info);
 	percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
 	percpu_counter_destroy(&fs_info->delalloc_bytes);
 	percpu_counter_destroy(&fs_info->ordered_bytes);
-	ASSERT(percpu_counter_sum_positive(&fs_info->evictable_extent_maps) == 0);
-	percpu_counter_destroy(&fs_info->evictable_extent_maps);
 	percpu_counter_destroy(&fs_info->dev_replace.bio_counter);
 	btrfs_free_csum_hash(fs_info);
 	btrfs_free_stripe_hash_table(fs_info);
@@ -2846,11 +2845,11 @@ static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block
 	sb->s_blocksize = BTRFS_BDEV_BLOCKSIZE;
 	sb->s_blocksize_bits = blksize_bits(BTRFS_BDEV_BLOCKSIZE);
 
-	ret = percpu_counter_init(&fs_info->ordered_bytes, 0, GFP_KERNEL);
+	ret = btrfs_register_extent_map_shrinker(fs_info);
 	if (ret)
 		return ret;
 
-	ret = percpu_counter_init(&fs_info->evictable_extent_maps, 0, GFP_KERNEL);
+	ret = percpu_counter_init(&fs_info->ordered_bytes, 0, GFP_KERNEL);
 	if (ret)
 		return ret;
 
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 2fcf28148a81..fa755921442d 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -8,6 +8,7 @@
 #include "extent_map.h"
 #include "compression.h"
 #include "btrfs_inode.h"
+#include "disk-io.h"
 
 
 static struct kmem_cache *extent_map_cache;
@@ -1026,3 +1027,202 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	free_extent_map(split_pre);
 	return ret;
 }
+
+static unsigned long btrfs_scan_inode(struct btrfs_inode *inode,
+				      unsigned long *scanned,
+				      unsigned long nr_to_scan)
+{
+	struct extent_map_tree *tree = &inode->extent_tree;
+	unsigned long nr_dropped = 0;
+	struct rb_node *node;
+
+	/*
+	 * Take the mmap lock so that we serialize with the inode logging phase
+	 * of fsync because we may need to set the full sync flag on the inode,
+	 * in case we have to remove extent maps in the tree's list of modified
+	 * extents. If we set the full sync flag in the inode while an fsync is
+	 * in progress, we may risk missing new extents because before the flag
+	 * is set, fsync decides to only wait for writeback to complete and then
+	 * during inode logging it sees the flag set and uses the subvolume tree
+	 * to find new extents, which may not be there yet because ordered
+	 * extents haven't completed yet.
+	 */
+	down_read(&inode->i_mmap_lock);
+	write_lock(&tree->lock);
+	node = rb_first_cached(&tree->map);
+	while (node) {
+		struct extent_map *em;
+
+		em = rb_entry(node, struct extent_map, rb_node);
+		node = rb_next(node);
+		(*scanned)++;
+
+		if (em->flags & EXTENT_FLAG_PINNED)
+			goto next;
+
+		if (!list_empty(&em->list))
+			btrfs_set_inode_full_sync(inode);
+
+		remove_extent_mapping(inode, em);
+		/* Drop the reference for the tree. */
+		free_extent_map(em);
+		nr_dropped++;
+next:
+		if (*scanned >= nr_to_scan)
+			break;
+
+		/*
+		 * Restart if we had to resched, and any extent maps that were
+		 * pinned before may have become unpinned after we released the
+		 * lock and took it again.
+		 */
+		if (cond_resched_rwlock_write(&tree->lock))
+			node = rb_first_cached(&tree->map);
+	}
+	write_unlock(&tree->lock);
+	up_read(&inode->i_mmap_lock);
+
+	return nr_dropped;
+}
+
+static unsigned long btrfs_scan_root(struct btrfs_root *root,
+				     unsigned long *scanned,
+				     unsigned long nr_to_scan)
+{
+	unsigned long nr_dropped = 0;
+	u64 ino = 0;
+
+	while (*scanned < nr_to_scan) {
+		struct rb_node *node;
+		struct rb_node *prev = NULL;
+		struct btrfs_inode *inode;
+		bool stop_search = true;
+
+		spin_lock(&root->inode_lock);
+		node = root->inode_tree.rb_node;
+
+		while (node) {
+			prev = node;
+			inode = rb_entry(node, struct btrfs_inode, rb_node);
+			if (ino < btrfs_ino(inode))
+				node = node->rb_left;
+			else if (ino > btrfs_ino(inode))
+				node = node->rb_right;
+			else
+				break;
+		}
+
+		if (!node) {
+			while (prev) {
+				inode = rb_entry(prev, struct btrfs_inode, rb_node);
+				if (ino <= btrfs_ino(inode)) {
+					node = prev;
+					break;
+				}
+				prev = rb_next(prev);
+			}
+		}
+
+		while (node) {
+			inode = rb_entry(node, struct btrfs_inode, rb_node);
+			ino = btrfs_ino(inode) + 1;
+			if (igrab(&inode->vfs_inode)) {
+				spin_unlock(&root->inode_lock);
+				stop_search = false;
+
+				nr_dropped += btrfs_scan_inode(inode, scanned,
+							       nr_to_scan);
+				iput(&inode->vfs_inode);
+				cond_resched();
+				break;
+			}
+			node = rb_next(node);
+		}
+
+		if (stop_search) {
+			spin_unlock(&root->inode_lock);
+			break;
+		}
+	}
+
+	return nr_dropped;
+}
+
+static unsigned long btrfs_extent_maps_scan(struct shrinker *shrinker,
+					    struct shrink_control *sc)
+{
+	struct btrfs_fs_info *fs_info = shrinker->private_data;
+	unsigned long nr_dropped = 0;
+	unsigned long scanned = 0;
+	u64 next_root_id = 0;
+
+	while (scanned < sc->nr_to_scan) {
+		struct btrfs_root *root;
+		unsigned long count;
+
+		spin_lock(&fs_info->fs_roots_radix_lock);
+		count = radix_tree_gang_lookup(&fs_info->fs_roots_radix,
+					       (void **)&root, next_root_id, 1);
+		if (count == 0) {
+			spin_unlock(&fs_info->fs_roots_radix_lock);
+			break;
+		}
+		next_root_id = btrfs_root_id(root) + 1;
+		root = btrfs_grab_root(root);
+		spin_unlock(&fs_info->fs_roots_radix_lock);
+
+		if (!root)
+			continue;
+
+		if (is_fstree(btrfs_root_id(root)))
+			nr_dropped += btrfs_scan_root(root, &scanned, sc->nr_to_scan);
+
+		btrfs_put_root(root);
+	}
+
+	return nr_dropped;
+}
+
+static unsigned long btrfs_extent_maps_count(struct shrinker *shrinker,
+					     struct shrink_control *sc)
+{
+	struct btrfs_fs_info *fs_info = shrinker->private_data;
+	const s64 total = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
+
+	/* The unsigned long type is 32 bits on 32 bits platforms. */
+#if BITS_PER_LONG == 32
+	if (total > ULONG_MAX)
+		return ULONG_MAX;
+#endif
+	return total;
+}
+
+int btrfs_register_extent_map_shrinker(struct btrfs_fs_info *fs_info)
+{
+	int ret;
+
+	ret = percpu_counter_init(&fs_info->evictable_extent_maps, 0, GFP_KERNEL);
+	if (ret)
+		return ret;
+
+	fs_info->extent_map_shrinker = shrinker_alloc(0, "em-btrfs:%s", fs_info->sb->s_id);
+	if (!fs_info->extent_map_shrinker) {
+		percpu_counter_destroy(&fs_info->evictable_extent_maps);
+		return -ENOMEM;
+	}
+
+	fs_info->extent_map_shrinker->scan_objects = btrfs_extent_maps_scan;
+	fs_info->extent_map_shrinker->count_objects = btrfs_extent_maps_count;
+	fs_info->extent_map_shrinker->private_data = fs_info;
+
+	shrinker_register(fs_info->extent_map_shrinker);
+
+	return 0;
+}
+
+void btrfs_unregister_extent_map_shrinker(struct btrfs_fs_info *fs_info)
+{
+	shrinker_free(fs_info->extent_map_shrinker);
+	ASSERT(percpu_counter_sum_positive(&fs_info->evictable_extent_maps) == 0);
+	percpu_counter_destroy(&fs_info->evictable_extent_maps);
+}
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index c3707461ff62..8a6be2f7a0e2 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -140,5 +140,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode,
 int btrfs_replace_extent_map_range(struct btrfs_inode *inode,
 				   struct extent_map *new_em,
 				   bool modified);
+int btrfs_register_extent_map_shrinker(struct btrfs_fs_info *fs_info);
+void btrfs_unregister_extent_map_shrinker(struct btrfs_fs_info *fs_info);
 
 #endif
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 534d30dafe32..f1414814bd69 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -857,6 +857,8 @@ struct btrfs_fs_info {
 	struct lockdep_map btrfs_trans_pending_ordered_map;
 	struct lockdep_map btrfs_ordered_extent_map;
 
+	struct shrinker *extent_map_shrinker;
+
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
 	struct rb_root block_tree;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 10/11] btrfs: update comment for btrfs_set_inode_full_sync() about locking
  2024-04-10 11:28 [PATCH 00/11] btrfs: add a shrinker for extent maps fdmanana
                   ` (8 preceding siblings ...)
  2024-04-10 11:28 ` [PATCH 09/11] btrfs: add a shrinker for " fdmanana
@ 2024-04-10 11:28 ` fdmanana
  2024-04-10 11:28 ` [PATCH 11/11] btrfs: add tracepoints for extent map shrinker events fdmanana
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-10 11:28 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Nowadays we have a lock used to synchronize mmap writes with reflink and
fsync operations (struct btrfs_inode::i_mmap_lock), so update the comment
for btrfs_set_inode_full_sync() to mention that it can also be called
while holding that mmap lock. Besides being a valid alternative to the
inode's VFS lock, we already have the extent map shrinker using that mmap
lock instead.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 100020ca4658..ce01709e372f 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -381,9 +381,11 @@ static inline void btrfs_set_inode_last_sub_trans(struct btrfs_inode *inode)
 }
 
 /*
- * Should be called while holding the inode's VFS lock in exclusive mode or in a
- * context where no one else can access the inode concurrently (during inode
- * creation or when loading an inode from disk).
+ * Should be called while holding the inode's VFS lock in exclusive mode, or
+ * while holding the inode's mmap lock (struct btrfs_inode::i_mmap_lock) in
+ * either shared or exclusive mode, or in a context where no one else can access
+ * the inode concurrently (during inode creation or when loading an inode from
+ * disk).
  */
 static inline void btrfs_set_inode_full_sync(struct btrfs_inode *inode)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 11/11] btrfs: add tracepoints for extent map shrinker events
  2024-04-10 11:28 [PATCH 00/11] btrfs: add a shrinker for extent maps fdmanana
                   ` (9 preceding siblings ...)
  2024-04-10 11:28 ` [PATCH 10/11] btrfs: update comment for btrfs_set_inode_full_sync() about locking fdmanana
@ 2024-04-10 11:28 ` fdmanana
  2024-04-11  5:25 ` [PATCH 00/11] btrfs: add a shrinker for extent maps Qu Wenruo
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-10 11:28 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Add some tracepoints for the extent map shrinker to help debug and analyse
main events. These have proved useful during development of the shrinker.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c        | 15 ++++++
 include/trace/events/btrfs.h | 92 ++++++++++++++++++++++++++++++++++++
 2 files changed, 107 insertions(+)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index fa755921442d..2be5324085fe 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -1064,6 +1064,7 @@ static unsigned long btrfs_scan_inode(struct btrfs_inode *inode,
 			btrfs_set_inode_full_sync(inode);
 
 		remove_extent_mapping(inode, em);
+		trace_btrfs_extent_map_shrinker_remove_em(inode, em);
 		/* Drop the reference for the tree. */
 		free_extent_map(em);
 		nr_dropped++;
@@ -1156,6 +1157,12 @@ static unsigned long btrfs_extent_maps_scan(struct shrinker *shrinker,
 	unsigned long scanned = 0;
 	u64 next_root_id = 0;
 
+	if (trace_btrfs_extent_map_shrinker_scan_enter_enabled()) {
+		s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
+
+		trace_btrfs_extent_map_shrinker_scan_enter(fs_info, sc->nr_to_scan, nr);
+	}
+
 	while (scanned < sc->nr_to_scan) {
 		struct btrfs_root *root;
 		unsigned long count;
@@ -1180,6 +1187,12 @@ static unsigned long btrfs_extent_maps_scan(struct shrinker *shrinker,
 		btrfs_put_root(root);
 	}
 
+	if (trace_btrfs_extent_map_shrinker_scan_exit_enabled()) {
+		s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
+
+		trace_btrfs_extent_map_shrinker_scan_exit(fs_info, nr_dropped, nr);
+	}
+
 	return nr_dropped;
 }
 
@@ -1189,6 +1202,8 @@ static unsigned long btrfs_extent_maps_count(struct shrinker *shrinker,
 	struct btrfs_fs_info *fs_info = shrinker->private_data;
 	const s64 total = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
 
+	trace_btrfs_extent_map_shrinker_count(fs_info, sc->nr_to_scan, total);
+
 	/* The unsigned long type is 32 bits on 32 bits platforms. */
 #if BITS_PER_LONG == 32
 	if (total > ULONG_MAX)
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 766cfd48386c..ba49efa2bc74 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2551,6 +2551,98 @@ TRACE_EVENT(btrfs_get_raid_extent_offset,
 			__entry->devid)
 );
 
+TRACE_EVENT(btrfs_extent_map_shrinker_count,
+
+	TP_PROTO(const struct btrfs_fs_info *fs_info, u64 nr_to_scan, u64 nr),
+
+	TP_ARGS(fs_info, nr_to_scan, nr),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	nr_to_scan	)
+		__field(	u64,	nr		)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->nr_to_scan	= nr_to_scan;
+		__entry->nr		= nr;
+	),
+
+	TP_printk_btrfs("nr_to_scan=%llu nr=%llu",
+			__entry->nr_to_scan, __entry->nr)
+);
+
+TRACE_EVENT(btrfs_extent_map_shrinker_scan_enter,
+
+	TP_PROTO(const struct btrfs_fs_info *fs_info, u64 nr_to_scan, u64 nr),
+
+	TP_ARGS(fs_info, nr_to_scan, nr),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	nr_to_scan	)
+		__field(	u64,	nr		)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->nr_to_scan	= nr_to_scan;
+		__entry->nr		= nr;
+	),
+
+	TP_printk_btrfs("nr_to_scan=%llu nr=%llu",
+			__entry->nr_to_scan, __entry->nr)
+);
+
+TRACE_EVENT(btrfs_extent_map_shrinker_scan_exit,
+
+	TP_PROTO(const struct btrfs_fs_info *fs_info, u64 nr_dropped, u64 nr),
+
+	TP_ARGS(fs_info, nr_dropped, nr),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	nr_dropped	)
+		__field(	u64,	nr		)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->nr_dropped	= nr_dropped;
+		__entry->nr		= nr;
+	),
+
+	TP_printk_btrfs("nr_dropped=%llu nr=%llu",
+			__entry->nr_dropped, __entry->nr)
+);
+
+TRACE_EVENT(btrfs_extent_map_shrinker_remove_em,
+
+	TP_PROTO(const struct btrfs_inode *inode, const struct extent_map *em),
+
+	TP_ARGS(inode, em),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	ino		)
+		__field(	u64,	root_id		)
+		__field(	u64,	start		)
+		__field(	u64,	len		)
+		__field(	u64,	block_start	)
+		__field(	u32,	flags		)
+	),
+
+	TP_fast_assign_btrfs(inode->root->fs_info,
+		__entry->ino		= btrfs_ino(inode);
+		__entry->root_id	= inode->root->root_key.objectid;
+		__entry->start		= em->start;
+		__entry->len		= em->len;
+		__entry->block_start	= em->block_start;
+		__entry->flags		= em->flags;
+	),
+
+	TP_printk_btrfs(
+"ino=%llu root=%llu(%s) start=%llu len=%llu block_start=%llu(%s) flags=%s",
+			__entry->ino, show_root_type(__entry->root_id),
+			__entry->start, __entry->len,
+			show_map_type(__entry->block_start),
+			show_map_flags(__entry->flags))
+);
+
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH 00/11] btrfs: add a shrinker for extent maps
  2024-04-10 11:28 [PATCH 00/11] btrfs: add a shrinker for extent maps fdmanana
                   ` (10 preceding siblings ...)
  2024-04-10 11:28 ` [PATCH 11/11] btrfs: add tracepoints for extent map shrinker events fdmanana
@ 2024-04-11  5:25 ` Qu Wenruo
  2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
  2024-04-16 13:08 ` [PATCH v3 00/10] btrfs: add a shrinker for extent maps fdmanana
  13 siblings, 0 replies; 64+ messages in thread
From: Qu Wenruo @ 2024-04-11  5:25 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/4/10 20:58, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> Currently we don't limit the amount of extent maps we can have for inodes
> from a subvolume tree, which can result in excessive use of memory and in
> some cases in running into OOM situations. This was reported some time ago
> by a user and it's specially easier to trigger with direct IO.
>
> The shrinker itself is patch 9/11, what comes before is simple preparatory
> work and the rest just trace events. More details in the change logs.
>
> Filipe Manana (11):
>    btrfs: pass an inode to btrfs_add_extent_mapping()
>    btrfs: tests: error out on unexpected extent map reference count
>    btrfs: simplify add_extent_mapping() by removing pointless label
>    btrfs: pass the extent map tree's inode to add_extent_mapping()
>    btrfs: pass the extent map tree's inode to clear_em_logging()
>    btrfs: pass the extent map tree's inode to remove_extent_mapping()
>    btrfs: pass the extent map tree's inode to replace_extent_mapping()

Those preparation are all fine even as independent patchset.

Reviewed-by: Qu Wenruo <wqu@suse.com>

>    btrfs: add a global per cpu counter to track number of used extent maps
>    btrfs: add a shrinker for extent maps

Unfortunately I'm not yet familiar enough on logged/pinned extent maps yet.
Thus no comprehensive review for the shrinker implementation yet.

Thanks,
Qu

>    btrfs: update comment for btrfs_set_inode_full_sync() about locking
>    btrfs: add tracepoints for extent map shrinker events
>
>   fs/btrfs/btrfs_inode.h            |   8 +-
>   fs/btrfs/disk-io.c                |   5 +
>   fs/btrfs/extent_io.c              |   2 +-
>   fs/btrfs/extent_map.c             | 340 +++++++++++++++++++++++++-----
>   fs/btrfs/extent_map.h             |   9 +-
>   fs/btrfs/fs.h                     |   4 +
>   fs/btrfs/inode.c                  |   2 +-
>   fs/btrfs/tests/extent-map-tests.c | 216 ++++++++++---------
>   fs/btrfs/tree-log.c               |   4 +-
>   include/trace/events/btrfs.h      |  92 ++++++++
>   10 files changed, 524 insertions(+), 158 deletions(-)
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 08/11] btrfs: add a global per cpu counter to track number of used extent maps
  2024-04-10 11:28 ` [PATCH 08/11] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
@ 2024-04-11  5:39   ` Qu Wenruo
  2024-04-11 10:09     ` Filipe Manana
  0 siblings, 1 reply; 64+ messages in thread
From: Qu Wenruo @ 2024-04-11  5:39 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/4/10 20:58, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> Add a per cpu counter that tracks the total number of extent maps that are
> in extent trees of inodes that belong to fs trees. This is going to be
> used in an upcoming change that adds a shrinker for extent maps. Only
> extent maps for fs trees are considered, because for special trees such as
> the data relocation tree we don't want to evict their extent maps which
> are critical for the relocation to work, and since those are limited, it's
> not a concern to have them in memory during the relocation of a block
> group. Another case are extent maps for free space cache inodes, which
> must always remain in memory, but those are limited (there's only one per
> free space cache inode, which means one per block group).
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
[...]
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -76,6 +76,14 @@ static u64 range_end(u64 start, u64 len)
>   	return start + len;
>   }
>
> +static void dec_evictable_extent_maps(struct btrfs_inode *inode)
> +{
> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +
> +	if (!btrfs_is_testing(fs_info) && is_fstree(btrfs_root_id(inode->root)))
> +		percpu_counter_dec(&fs_info->evictable_extent_maps);
> +}
> +
>   static int tree_insert(struct rb_root_cached *root, struct extent_map *em)
>   {
>   	struct rb_node **p = &root->rb_root.rb_node;
> @@ -223,8 +231,9 @@ static bool mergeable_maps(const struct extent_map *prev, const struct extent_ma
>   	return next->block_start == prev->block_start;
>   }
>
> -static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
> +static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)

Maybe it would be a little easier to read if the extent_map_tree to
btrfs_inode conversion happens in a dedicated patch, just like all the
previous ones?

Otherwise the introduction of the per-cpu counter looks good to me.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 09/11] btrfs: add a shrinker for extent maps
  2024-04-10 11:28 ` [PATCH 09/11] btrfs: add a shrinker for " fdmanana
@ 2024-04-11  5:58   ` Qu Wenruo
  2024-04-11 10:15     ` Filipe Manana
  0 siblings, 1 reply; 64+ messages in thread
From: Qu Wenruo @ 2024-04-11  5:58 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/4/10 20:58, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> Extent maps are used either to represent existing file extent items, or to
> represent new extents that are going to be written and the respective file
> extent items are created when the ordered extent completes.
>
> We currently don't have any limit for how many extent maps we can have,
> neither per inode nor globally. Most of the time this not too noticeable
> because extent maps are removed in the following situations:
>
> 1) When evicting an inode;
>
> 2) When releasing folios (pages) through the btrfs_release_folio() address
>     space operation callback.
>
>     However we won't release extent maps in the folio range if the folio is
>     either dirty or under writeback or if the inode's i_size is less than
>     or equals to 16M (see try_release_extent_mapping(). This 16M i_size
>     constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
>     extent_io and extent_state optimizations"), but there's no explanation
>     about why we have it or why the 16M value.
>
> This means that for buffered IO we can reach an OOM situation due to too
> many extent maps if either of the following happens:
>
> 1) There's a set of tasks constantly doing IO on many files with a size
>     not larger than 16M, specially if they keep the files open for very
>     long periods, therefore preventing inode eviction.
>
>     This requires a really high number of such files, and having many non
>     mergeable extent maps (due to random 4K writes for example) and a
>     machine with very little memory;
>
> 2) There's a set tasks constantly doing random write IO (therefore
>     creating many non mergeable extent maps) on files and keeping them
>     open for long periods of time, so inode eviction doesn't happen and
>     there's always a lot of dirty pages or pages under writeback,
>     preventing btrfs_release_folio() from releasing the respective extent
>     maps.
>
> This second case was actually reported in the thread pointed by the Link
> tag below, and it requires a very large file under heavy IO and a machine
> with very little amount of RAM, which is probably hard to happen in
> practice in a real world use case.
>
> However when using direct IO this is not so hard to happen, because the
> page cache is not used, and therefore btrfs_release_folio() is never
> called. Which means extent maps are dropped only when evicting the inode,
> and that means that if we have tasks that keep a file descriptor open and
> keep doing IO on a very large file (or files), we can exhaust memory due
> to an unbounded amount of extent maps. This is especially easy to happen
> if we have a huge file with millions of small extents and their extent
> maps are not mergeable (non contiguous offsets and disk locations).
> This was reported in that thread with the following fio test:
>
>     $ cat test.sh
>     #!/bin/bash
>
>     DEV=/dev/sdj
>     MNT=/mnt/sdj
>     MOUNT_OPTIONS="-o ssd"
>     MKFS_OPTIONS=""
>
>     cat <<EOF > /tmp/fio-job.ini
>     [global]
>     name=fio-rand-write
>     filename=$MNT/fio-rand-write
>     rw=randwrite
>     bs=4K
>     direct=1
>     numjobs=16
>     fallocate=none
>     time_based
>     runtime=90000
>
>     [file1]
>     size=300G
>     ioengine=libaio
>     iodepth=16
>
>     EOF
>
>     umount $MNT &> /dev/null
>     mkfs.btrfs -f $MKFS_OPTIONS $DEV
>     mount $MOUNT_OPTIONS $DEV $MNT
>
>     fio /tmp/fio-job.ini
>     umount $MNT
>
> Monitoring the btrfs_extent_map slab while running the test with:
>
>     $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
>                          /sys/kernel/slab/btrfs_extent_map/total_objects'
>
> Shows the number of active and total extent maps skyrocketing to tens of
> millions, and on systems with a short amount of memory it's easy and quick
> to get into an OOM situation, as reported in that thread.
>
> So to avoid this issue add a shrinker that will remove extents maps, as
> long as they are not pinned, and takes proper care with any concurrent
> fsync to avoid missing extents (setting the full sync flag while in the
> middle of a fast fsync). This shrinker is similar to the one ext4 uses
> for its extent_status structure, which is analogous to btrfs' extent_map
> structure.
>
> Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
[...]
> +
> +static unsigned long btrfs_scan_root(struct btrfs_root *root,
> +				     unsigned long *scanned,
> +				     unsigned long nr_to_scan)
> +{
> +	unsigned long nr_dropped = 0;
> +	u64 ino = 0;
> +
> +	while (*scanned < nr_to_scan) {
> +		struct rb_node *node;
> +		struct rb_node *prev = NULL;
> +		struct btrfs_inode *inode;
> +		bool stop_search = true;
> +
> +		spin_lock(&root->inode_lock);
> +		node = root->inode_tree.rb_node;
> +
> +		while (node) {
> +			prev = node;
> +			inode = rb_entry(node, struct btrfs_inode, rb_node);
> +			if (ino < btrfs_ino(inode))
> +				node = node->rb_left;
> +			else if (ino > btrfs_ino(inode))
> +				node = node->rb_right;
> +			else
> +				break;
> +		}
> +
> +		if (!node) {
> +			while (prev) {
> +				inode = rb_entry(prev, struct btrfs_inode, rb_node);
> +				if (ino <= btrfs_ino(inode)) {
> +					node = prev;
> +					break;
> +				}
> +				prev = rb_next(prev);
> +			}
> +		}

The "while (node) {}" loop and above "if (!node) {}" is to locate the
first inode after @ino (which is the last scanned inode number).

Maybe extract them into a helper, with some name like
"find_next_inode_to_scan()" could be a little easier to read?

> +
> +		while (node) {
> +			inode = rb_entry(node, struct btrfs_inode, rb_node);
> +			ino = btrfs_ino(inode) + 1;
> +			if (igrab(&inode->vfs_inode)) {
> +				spin_unlock(&root->inode_lock);
> +				stop_search = false;
> +
> +				nr_dropped += btrfs_scan_inode(inode, scanned,
> +							       nr_to_scan);
> +				iput(&inode->vfs_inode);
> +				cond_resched();
> +				break;
> +			}
> +			node = rb_next(node);
> +		}
> +
> +		if (stop_search) {
> +			spin_unlock(&root->inode_lock);
> +			break;
> +		}
> +	}
> +
> +	return nr_dropped;
> +}
> +
> +static unsigned long btrfs_extent_maps_scan(struct shrinker *shrinker,
> +					    struct shrink_control *sc)
> +{
> +	struct btrfs_fs_info *fs_info = shrinker->private_data;
> +	unsigned long nr_dropped = 0;
> +	unsigned long scanned = 0;
> +	u64 next_root_id = 0;
> +
> +	while (scanned < sc->nr_to_scan) {
> +		struct btrfs_root *root;
> +		unsigned long count;
> +
> +		spin_lock(&fs_info->fs_roots_radix_lock);
> +		count = radix_tree_gang_lookup(&fs_info->fs_roots_radix,
> +					       (void **)&root, next_root_id, 1);
> +		if (count == 0) {
> +			spin_unlock(&fs_info->fs_roots_radix_lock);
> +			break;
> +		}
> +		next_root_id = btrfs_root_id(root) + 1;
> +		root = btrfs_grab_root(root);
> +		spin_unlock(&fs_info->fs_roots_radix_lock);
> +
> +		if (!root)
> +			continue;
> +
> +		if (is_fstree(btrfs_root_id(root)))
> +			nr_dropped += btrfs_scan_root(root, &scanned, sc->nr_to_scan);
> +
> +		btrfs_put_root(root);
> +	}
> +
> +	return nr_dropped;
> +}
> +
> +static unsigned long btrfs_extent_maps_count(struct shrinker *shrinker,
> +					     struct shrink_control *sc)
> +{
> +	struct btrfs_fs_info *fs_info = shrinker->private_data;
> +	const s64 total = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
> +
> +	/* The unsigned long type is 32 bits on 32 bits platforms. */
> +#if BITS_PER_LONG == 32
> +	if (total > ULONG_MAX)
> +		return ULONG_MAX;
> +#endif

Can this be a simple min_t(unsigned long, total, ULONG_MAX)?

Another question is, since total is s64, wouldn't any negative number go
ULONG_MAX directly for 32bit systems?

And since the function is just a shrink hook, I'm not sure what would
happen if we return ULONG_MAX for negative values.

Otherwise the idea looks pretty good, it's just me not qualified to give
a good review.

Thanks,
Qu
> +	return total;
> +}
> +
> +int btrfs_register_extent_map_shrinker(struct btrfs_fs_info *fs_info)
> +{
> +	int ret;
> +
> +	ret = percpu_counter_init(&fs_info->evictable_extent_maps, 0, GFP_KERNEL);
> +	if (ret)
> +		return ret;
> +
> +	fs_info->extent_map_shrinker = shrinker_alloc(0, "em-btrfs:%s", fs_info->sb->s_id);
> +	if (!fs_info->extent_map_shrinker) {
> +		percpu_counter_destroy(&fs_info->evictable_extent_maps);
> +		return -ENOMEM;
> +	}
> +
> +	fs_info->extent_map_shrinker->scan_objects = btrfs_extent_maps_scan;
> +	fs_info->extent_map_shrinker->count_objects = btrfs_extent_maps_count;
> +	fs_info->extent_map_shrinker->private_data = fs_info;
> +
> +	shrinker_register(fs_info->extent_map_shrinker);
> +
> +	return 0;
> +}
> +
> +void btrfs_unregister_extent_map_shrinker(struct btrfs_fs_info *fs_info)
> +{
> +	shrinker_free(fs_info->extent_map_shrinker);
> +	ASSERT(percpu_counter_sum_positive(&fs_info->evictable_extent_maps) == 0);
> +	percpu_counter_destroy(&fs_info->evictable_extent_maps);
> +}
> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> index c3707461ff62..8a6be2f7a0e2 100644
> --- a/fs/btrfs/extent_map.h
> +++ b/fs/btrfs/extent_map.h
> @@ -140,5 +140,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode,
>   int btrfs_replace_extent_map_range(struct btrfs_inode *inode,
>   				   struct extent_map *new_em,
>   				   bool modified);
> +int btrfs_register_extent_map_shrinker(struct btrfs_fs_info *fs_info);
> +void btrfs_unregister_extent_map_shrinker(struct btrfs_fs_info *fs_info);
>
>   #endif
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 534d30dafe32..f1414814bd69 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -857,6 +857,8 @@ struct btrfs_fs_info {
>   	struct lockdep_map btrfs_trans_pending_ordered_map;
>   	struct lockdep_map btrfs_ordered_extent_map;
>
> +	struct shrinker *extent_map_shrinker;
> +
>   #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>   	spinlock_t ref_verify_lock;
>   	struct rb_root block_tree;

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 08/11] btrfs: add a global per cpu counter to track number of used extent maps
  2024-04-11  5:39   ` Qu Wenruo
@ 2024-04-11 10:09     ` Filipe Manana
  0 siblings, 0 replies; 64+ messages in thread
From: Filipe Manana @ 2024-04-11 10:09 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Apr 11, 2024 at 6:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/4/10 20:58, fdmanana@kernel.org 写道:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Add a per cpu counter that tracks the total number of extent maps that are
> > in extent trees of inodes that belong to fs trees. This is going to be
> > used in an upcoming change that adds a shrinker for extent maps. Only
> > extent maps for fs trees are considered, because for special trees such as
> > the data relocation tree we don't want to evict their extent maps which
> > are critical for the relocation to work, and since those are limited, it's
> > not a concern to have them in memory during the relocation of a block
> > group. Another case are extent maps for free space cache inodes, which
> > must always remain in memory, but those are limited (there's only one per
> > free space cache inode, which means one per block group).
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> [...]
> > --- a/fs/btrfs/extent_map.c
> > +++ b/fs/btrfs/extent_map.c
> > @@ -76,6 +76,14 @@ static u64 range_end(u64 start, u64 len)
> >       return start + len;
> >   }
> >
> > +static void dec_evictable_extent_maps(struct btrfs_inode *inode)
> > +{
> > +     struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > +
> > +     if (!btrfs_is_testing(fs_info) && is_fstree(btrfs_root_id(inode->root)))
> > +             percpu_counter_dec(&fs_info->evictable_extent_maps);
> > +}
> > +
> >   static int tree_insert(struct rb_root_cached *root, struct extent_map *em)
> >   {
> >       struct rb_node **p = &root->rb_root.rb_node;
> > @@ -223,8 +231,9 @@ static bool mergeable_maps(const struct extent_map *prev, const struct extent_ma
> >       return next->block_start == prev->block_start;
> >   }
> >
> > -static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
> > +static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
>
> Maybe it would be a little easier to read if the extent_map_tree to
> btrfs_inode conversion happens in a dedicated patch, just like all the
> previous ones?

Sure, I can do that. But it's such a small and trivial change that I
didn't think it would make review harder, even because the patch is
very small.

>
> Otherwise the introduction of the per-cpu counter looks good to me.
>
> Thanks,
> Qu

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 09/11] btrfs: add a shrinker for extent maps
  2024-04-11  5:58   ` Qu Wenruo
@ 2024-04-11 10:15     ` Filipe Manana
  0 siblings, 0 replies; 64+ messages in thread
From: Filipe Manana @ 2024-04-11 10:15 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Apr 11, 2024 at 6:58 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/4/10 20:58, fdmanana@kernel.org 写道:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Extent maps are used either to represent existing file extent items, or to
> > represent new extents that are going to be written and the respective file
> > extent items are created when the ordered extent completes.
> >
> > We currently don't have any limit for how many extent maps we can have,
> > neither per inode nor globally. Most of the time this not too noticeable
> > because extent maps are removed in the following situations:
> >
> > 1) When evicting an inode;
> >
> > 2) When releasing folios (pages) through the btrfs_release_folio() address
> >     space operation callback.
> >
> >     However we won't release extent maps in the folio range if the folio is
> >     either dirty or under writeback or if the inode's i_size is less than
> >     or equals to 16M (see try_release_extent_mapping(). This 16M i_size
> >     constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
> >     extent_io and extent_state optimizations"), but there's no explanation
> >     about why we have it or why the 16M value.
> >
> > This means that for buffered IO we can reach an OOM situation due to too
> > many extent maps if either of the following happens:
> >
> > 1) There's a set of tasks constantly doing IO on many files with a size
> >     not larger than 16M, specially if they keep the files open for very
> >     long periods, therefore preventing inode eviction.
> >
> >     This requires a really high number of such files, and having many non
> >     mergeable extent maps (due to random 4K writes for example) and a
> >     machine with very little memory;
> >
> > 2) There's a set tasks constantly doing random write IO (therefore
> >     creating many non mergeable extent maps) on files and keeping them
> >     open for long periods of time, so inode eviction doesn't happen and
> >     there's always a lot of dirty pages or pages under writeback,
> >     preventing btrfs_release_folio() from releasing the respective extent
> >     maps.
> >
> > This second case was actually reported in the thread pointed by the Link
> > tag below, and it requires a very large file under heavy IO and a machine
> > with very little amount of RAM, which is probably hard to happen in
> > practice in a real world use case.
> >
> > However when using direct IO this is not so hard to happen, because the
> > page cache is not used, and therefore btrfs_release_folio() is never
> > called. Which means extent maps are dropped only when evicting the inode,
> > and that means that if we have tasks that keep a file descriptor open and
> > keep doing IO on a very large file (or files), we can exhaust memory due
> > to an unbounded amount of extent maps. This is especially easy to happen
> > if we have a huge file with millions of small extents and their extent
> > maps are not mergeable (non contiguous offsets and disk locations).
> > This was reported in that thread with the following fio test:
> >
> >     $ cat test.sh
> >     #!/bin/bash
> >
> >     DEV=/dev/sdj
> >     MNT=/mnt/sdj
> >     MOUNT_OPTIONS="-o ssd"
> >     MKFS_OPTIONS=""
> >
> >     cat <<EOF > /tmp/fio-job.ini
> >     [global]
> >     name=fio-rand-write
> >     filename=$MNT/fio-rand-write
> >     rw=randwrite
> >     bs=4K
> >     direct=1
> >     numjobs=16
> >     fallocate=none
> >     time_based
> >     runtime=90000
> >
> >     [file1]
> >     size=300G
> >     ioengine=libaio
> >     iodepth=16
> >
> >     EOF
> >
> >     umount $MNT &> /dev/null
> >     mkfs.btrfs -f $MKFS_OPTIONS $DEV
> >     mount $MOUNT_OPTIONS $DEV $MNT
> >
> >     fio /tmp/fio-job.ini
> >     umount $MNT
> >
> > Monitoring the btrfs_extent_map slab while running the test with:
> >
> >     $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
> >                          /sys/kernel/slab/btrfs_extent_map/total_objects'
> >
> > Shows the number of active and total extent maps skyrocketing to tens of
> > millions, and on systems with a short amount of memory it's easy and quick
> > to get into an OOM situation, as reported in that thread.
> >
> > So to avoid this issue add a shrinker that will remove extents maps, as
> > long as they are not pinned, and takes proper care with any concurrent
> > fsync to avoid missing extents (setting the full sync flag while in the
> > middle of a fast fsync). This shrinker is similar to the one ext4 uses
> > for its extent_status structure, which is analogous to btrfs' extent_map
> > structure.
> >
> > Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> [...]
> > +
> > +static unsigned long btrfs_scan_root(struct btrfs_root *root,
> > +                                  unsigned long *scanned,
> > +                                  unsigned long nr_to_scan)
> > +{
> > +     unsigned long nr_dropped = 0;
> > +     u64 ino = 0;
> > +
> > +     while (*scanned < nr_to_scan) {
> > +             struct rb_node *node;
> > +             struct rb_node *prev = NULL;
> > +             struct btrfs_inode *inode;
> > +             bool stop_search = true;
> > +
> > +             spin_lock(&root->inode_lock);
> > +             node = root->inode_tree.rb_node;
> > +
> > +             while (node) {
> > +                     prev = node;
> > +                     inode = rb_entry(node, struct btrfs_inode, rb_node);
> > +                     if (ino < btrfs_ino(inode))
> > +                             node = node->rb_left;
> > +                     else if (ino > btrfs_ino(inode))
> > +                             node = node->rb_right;
> > +                     else
> > +                             break;
> > +             }
> > +
> > +             if (!node) {
> > +                     while (prev) {
> > +                             inode = rb_entry(prev, struct btrfs_inode, rb_node);
> > +                             if (ino <= btrfs_ino(inode)) {
> > +                                     node = prev;
> > +                                     break;
> > +                             }
> > +                             prev = rb_next(prev);
> > +                     }
> > +             }
>
> The "while (node) {}" loop and above "if (!node) {}" is to locate the
> first inode after @ino (which is the last scanned inode number).
>
> Maybe extract them into a helper, with some name like
> "find_next_inode_to_scan()" could be a little easier to read?

Sure, I can do that.

>
> > +
> > +             while (node) {
> > +                     inode = rb_entry(node, struct btrfs_inode, rb_node);
> > +                     ino = btrfs_ino(inode) + 1;
> > +                     if (igrab(&inode->vfs_inode)) {
> > +                             spin_unlock(&root->inode_lock);
> > +                             stop_search = false;
> > +
> > +                             nr_dropped += btrfs_scan_inode(inode, scanned,
> > +                                                            nr_to_scan);
> > +                             iput(&inode->vfs_inode);
> > +                             cond_resched();
> > +                             break;
> > +                     }
> > +                     node = rb_next(node);
> > +             }
> > +
> > +             if (stop_search) {
> > +                     spin_unlock(&root->inode_lock);
> > +                     break;
> > +             }
> > +     }
> > +
> > +     return nr_dropped;
> > +}
> > +
> > +static unsigned long btrfs_extent_maps_scan(struct shrinker *shrinker,
> > +                                         struct shrink_control *sc)
> > +{
> > +     struct btrfs_fs_info *fs_info = shrinker->private_data;
> > +     unsigned long nr_dropped = 0;
> > +     unsigned long scanned = 0;
> > +     u64 next_root_id = 0;
> > +
> > +     while (scanned < sc->nr_to_scan) {
> > +             struct btrfs_root *root;
> > +             unsigned long count;
> > +
> > +             spin_lock(&fs_info->fs_roots_radix_lock);
> > +             count = radix_tree_gang_lookup(&fs_info->fs_roots_radix,
> > +                                            (void **)&root, next_root_id, 1);
> > +             if (count == 0) {
> > +                     spin_unlock(&fs_info->fs_roots_radix_lock);
> > +                     break;
> > +             }
> > +             next_root_id = btrfs_root_id(root) + 1;
> > +             root = btrfs_grab_root(root);
> > +             spin_unlock(&fs_info->fs_roots_radix_lock);
> > +
> > +             if (!root)
> > +                     continue;
> > +
> > +             if (is_fstree(btrfs_root_id(root)))
> > +                     nr_dropped += btrfs_scan_root(root, &scanned, sc->nr_to_scan);
> > +
> > +             btrfs_put_root(root);
> > +     }
> > +
> > +     return nr_dropped;
> > +}
> > +
> > +static unsigned long btrfs_extent_maps_count(struct shrinker *shrinker,
> > +                                          struct shrink_control *sc)
> > +{
> > +     struct btrfs_fs_info *fs_info = shrinker->private_data;
> > +     const s64 total = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
> > +
> > +     /* The unsigned long type is 32 bits on 32 bits platforms. */
> > +#if BITS_PER_LONG == 32
> > +     if (total > ULONG_MAX)
> > +             return ULONG_MAX;
> > +#endif
>
> Can this be a simple min_t(unsigned long, total, ULONG_MAX)?

Why? Either the current if or min should be easy to understand for anyone.

I'm actually removing this altogether, as on a 32 bits systems we
can't have more than ULONG_MAX extent maps allocated any way, even if
they could be 1 byte long.
Ext4 ignores that in its shrinker for its extent_status structure for
example, which makes sense for that same reason.

>
> Another question is, since total is s64, wouldn't any negative number go
> ULONG_MAX directly for 32bit systems?

How could you have a negative number?
percpu_counter_sum_positive() guarantees a non-negative number is returned.

>
> And since the function is just a shrink hook, I'm not sure what would
> happen if we return ULONG_MAX for negative values.

Nothing. When scanning we stop iteration once we don't find more
roots/inodes/extent maps.

>
> Otherwise the idea looks pretty good, it's just me not qualified to give
> a good review.
>
> Thanks,
> Qu
> > +     return total;
> > +}
> > +
> > +int btrfs_register_extent_map_shrinker(struct btrfs_fs_info *fs_info)
> > +{
> > +     int ret;
> > +
> > +     ret = percpu_counter_init(&fs_info->evictable_extent_maps, 0, GFP_KERNEL);
> > +     if (ret)
> > +             return ret;
> > +
> > +     fs_info->extent_map_shrinker = shrinker_alloc(0, "em-btrfs:%s", fs_info->sb->s_id);
> > +     if (!fs_info->extent_map_shrinker) {
> > +             percpu_counter_destroy(&fs_info->evictable_extent_maps);
> > +             return -ENOMEM;
> > +     }
> > +
> > +     fs_info->extent_map_shrinker->scan_objects = btrfs_extent_maps_scan;
> > +     fs_info->extent_map_shrinker->count_objects = btrfs_extent_maps_count;
> > +     fs_info->extent_map_shrinker->private_data = fs_info;
> > +
> > +     shrinker_register(fs_info->extent_map_shrinker);
> > +
> > +     return 0;
> > +}
> > +
> > +void btrfs_unregister_extent_map_shrinker(struct btrfs_fs_info *fs_info)
> > +{
> > +     shrinker_free(fs_info->extent_map_shrinker);
> > +     ASSERT(percpu_counter_sum_positive(&fs_info->evictable_extent_maps) == 0);
> > +     percpu_counter_destroy(&fs_info->evictable_extent_maps);
> > +}
> > diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> > index c3707461ff62..8a6be2f7a0e2 100644
> > --- a/fs/btrfs/extent_map.h
> > +++ b/fs/btrfs/extent_map.h
> > @@ -140,5 +140,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode,
> >   int btrfs_replace_extent_map_range(struct btrfs_inode *inode,
> >                                  struct extent_map *new_em,
> >                                  bool modified);
> > +int btrfs_register_extent_map_shrinker(struct btrfs_fs_info *fs_info);
> > +void btrfs_unregister_extent_map_shrinker(struct btrfs_fs_info *fs_info);
> >
> >   #endif
> > diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> > index 534d30dafe32..f1414814bd69 100644
> > --- a/fs/btrfs/fs.h
> > +++ b/fs/btrfs/fs.h
> > @@ -857,6 +857,8 @@ struct btrfs_fs_info {
> >       struct lockdep_map btrfs_trans_pending_ordered_map;
> >       struct lockdep_map btrfs_ordered_extent_map;
> >
> > +     struct shrinker *extent_map_shrinker;
> > +
> >   #ifdef CONFIG_BTRFS_FS_REF_VERIFY
> >       spinlock_t ref_verify_lock;
> >       struct rb_root block_tree;

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 00/15] btrfs: add a shrinker for extent maps
  2024-04-10 11:28 [PATCH 00/11] btrfs: add a shrinker for extent maps fdmanana
                   ` (11 preceding siblings ...)
  2024-04-11  5:25 ` [PATCH 00/11] btrfs: add a shrinker for extent maps Qu Wenruo
@ 2024-04-11 16:18 ` fdmanana
  2024-04-11 16:18   ` [PATCH v2 01/15] btrfs: pass an inode to btrfs_add_extent_mapping() fdmanana
                     ` (14 more replies)
  2024-04-16 13:08 ` [PATCH v3 00/10] btrfs: add a shrinker for extent maps fdmanana
  13 siblings, 15 replies; 64+ messages in thread
From: fdmanana @ 2024-04-11 16:18 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently we don't limit the amount of extent maps we can have for inodes
from a subvolume tree, which can result in excessive use of memory and in
some cases in running into OOM situations. This was reported some time ago
by a user and it's specially easier to trigger with direct IO.

The shrinker itself is patch 13/15, what comes before is simple preparatory
work and the rest just trace events. More details in the change logs.

V2: Split patch 09/11 into 3.
    Added two patches to export and use helper to find inode in a root.
    Updated patch 13/15 to use the helper for finding next inode and
    removed the #ifdef for 32 bits case which is irrelevant as on 32 bits
    systems we can't ever have more than ULONG_MAX extent maps allocated.

Filipe Manana (15):
  btrfs: pass an inode to btrfs_add_extent_mapping()
  btrfs: tests: error out on unexpected extent map reference count
  btrfs: simplify add_extent_mapping() by removing pointless label
  btrfs: pass the extent map tree's inode to add_extent_mapping()
  btrfs: pass the extent map tree's inode to clear_em_logging()
  btrfs: pass the extent map tree's inode to remove_extent_mapping()
  btrfs: pass the extent map tree's inode to replace_extent_mapping()
  btrfs: pass the extent map tree's inode to setup_extent_mapping()
  btrfs: pass the extent map tree's inode to try_merge_map()
  btrfs: add a global per cpu counter to track number of used extent maps
  btrfs: export find_next_inode() as btrfs_find_first_inode()
  btrfs: use btrfs_find_first_inode() at btrfs_prune_dentries()
  btrfs: add a shrinker for extent maps
  btrfs: update comment for btrfs_set_inode_full_sync() about locking
  btrfs: add tracepoints for extent map shrinker events

 fs/btrfs/btrfs_inode.h            |   9 +-
 fs/btrfs/disk-io.c                |   5 +
 fs/btrfs/extent_io.c              |   2 +-
 fs/btrfs/extent_map.c             | 297 ++++++++++++++++++++++++------
 fs/btrfs/extent_map.h             |   9 +-
 fs/btrfs/fs.h                     |   4 +
 fs/btrfs/inode.c                  | 126 +++++++------
 fs/btrfs/relocation.c             | 105 +++--------
 fs/btrfs/tests/extent-map-tests.c | 216 ++++++++++++----------
 fs/btrfs/tree-log.c               |   4 +-
 include/trace/events/btrfs.h      |  92 +++++++++
 11 files changed, 579 insertions(+), 290 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 01/15] btrfs: pass an inode to btrfs_add_extent_mapping()
  2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
@ 2024-04-11 16:18   ` fdmanana
  2024-04-11 16:18   ` [PATCH v2 02/15] btrfs: tests: error out on unexpected extent map reference count fdmanana
                     ` (13 subsequent siblings)
  14 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-11 16:18 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Instead of passing fs_info and extent map tree arguments to
btrfs_add_extent_mapping(), we can pass an inode instead, as extent maps
are always inserted in the extent map tree of an inode, and the fs_info
can be extracted from the inode (inode->root->fs_info). The only exception
is in the self tests where we allocate an extent map tree and then use it
to insert/update/remove extent maps. However the tests can be changed to
use a test inode and then use the inode's extent map tree.

So change btrfs_add_extent_mapping() to have an inode as an argument
instead of a fs_info and an extent map tree. This reduces the number of
parameters and will also be needed for an upcoming change.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c             |  14 +--
 fs/btrfs/extent_map.h             |   3 +-
 fs/btrfs/inode.c                  |   2 +-
 fs/btrfs/tests/extent-map-tests.c | 174 +++++++++++++++---------------
 4 files changed, 95 insertions(+), 98 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 471654cb65b0..840be23d2c0a 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -546,10 +546,9 @@ static noinline int merge_extent_mapping(struct extent_map_tree *em_tree,
 }
 
 /*
- * Add extent mapping into em_tree.
+ * Add extent mapping into an inode's extent map tree.
  *
- * @fs_info:  the filesystem
- * @em_tree:  extent tree into which we want to insert the extent mapping
+ * @inode:    target inode
  * @em_in:    extent we are inserting
  * @start:    start of the logical range btrfs_get_extent() is requesting
  * @len:      length of the logical range btrfs_get_extent() is requesting
@@ -557,8 +556,8 @@ static noinline int merge_extent_mapping(struct extent_map_tree *em_tree,
  * Note that @em_in's range may be different from [start, start+len),
  * but they must be overlapped.
  *
- * Insert @em_in into @em_tree. In case there is an overlapping range, handle
- * the -EEXIST by either:
+ * Insert @em_in into the inode's extent map tree. In case there is an
+ * overlapping range, handle the -EEXIST by either:
  * a) Returning the existing extent in @em_in if @start is within the
  *    existing em.
  * b) Merge the existing extent with @em_in passed in.
@@ -566,12 +565,13 @@ static noinline int merge_extent_mapping(struct extent_map_tree *em_tree,
  * Return 0 on success, otherwise -EEXIST.
  *
  */
-int btrfs_add_extent_mapping(struct btrfs_fs_info *fs_info,
-			     struct extent_map_tree *em_tree,
+int btrfs_add_extent_mapping(struct btrfs_inode *inode,
 			     struct extent_map **em_in, u64 start, u64 len)
 {
 	int ret;
 	struct extent_map *em = *em_in;
+	struct extent_map_tree *em_tree = &inode->extent_tree;
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 
 	/*
 	 * Tree-checker should have rejected any inline extent with non-zero
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 10e9491865c9..f287ab46e368 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -132,8 +132,7 @@ int unpin_extent_cache(struct btrfs_inode *inode, u64 start, u64 len, u64 gen);
 void clear_em_logging(struct extent_map_tree *tree, struct extent_map *em);
 struct extent_map *search_extent_mapping(struct extent_map_tree *tree,
 					 u64 start, u64 len);
-int btrfs_add_extent_mapping(struct btrfs_fs_info *fs_info,
-			     struct extent_map_tree *em_tree,
+int btrfs_add_extent_mapping(struct btrfs_inode *inode,
 			     struct extent_map **em_in, u64 start, u64 len);
 void btrfs_drop_extent_map_range(struct btrfs_inode *inode,
 				 u64 start, u64 end,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1a1a4b6d33ed..d4539b4b8148 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6992,7 +6992,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 	}
 
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, start, len);
+	ret = btrfs_add_extent_mapping(inode, &em, start, len);
 	write_unlock(&em_tree->lock);
 out:
 	btrfs_free_path(path);
diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
index 253cce7ffecf..96089c4c38a5 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -53,9 +53,9 @@ static void free_extent_map_tree(struct extent_map_tree *em_tree)
  *                                    ->add_extent_mapping(0, 16K)
  *                                    -> #handle -EEXIST
  */
-static int test_case_1(struct btrfs_fs_info *fs_info,
-		struct extent_map_tree *em_tree)
+static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 {
+	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
 	u64 start = 0;
 	u64 len = SZ_8K;
@@ -73,7 +73,7 @@ static int test_case_1(struct btrfs_fs_info *fs_info,
 	em->block_start = 0;
 	em->block_len = SZ_16K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret < 0) {
 		test_err("cannot add extent range [0, 16K)");
@@ -94,7 +94,7 @@ static int test_case_1(struct btrfs_fs_info *fs_info,
 	em->block_start = SZ_32K; /* avoid merging */
 	em->block_len = SZ_4K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret < 0) {
 		test_err("cannot add extent range [16K, 20K)");
@@ -115,7 +115,7 @@ static int test_case_1(struct btrfs_fs_info *fs_info,
 	em->block_start = start;
 	em->block_len = len;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret) {
 		test_err("case1 [%llu %llu]: ret %d", start, start + len, ret);
@@ -148,9 +148,9 @@ static int test_case_1(struct btrfs_fs_info *fs_info,
  * Reading the inline ending up with EEXIST, ie. read an inline
  * extent and discard page cache and read it again.
  */
-static int test_case_2(struct btrfs_fs_info *fs_info,
-		struct extent_map_tree *em_tree)
+static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 {
+	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
 	int ret;
 
@@ -166,7 +166,7 @@ static int test_case_2(struct btrfs_fs_info *fs_info,
 	em->block_start = EXTENT_MAP_INLINE;
 	em->block_len = (u64)-1;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret < 0) {
 		test_err("cannot add extent range [0, 1K)");
@@ -187,7 +187,7 @@ static int test_case_2(struct btrfs_fs_info *fs_info,
 	em->block_start = SZ_4K;
 	em->block_len = SZ_4K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret < 0) {
 		test_err("cannot add extent range [4K, 8K)");
@@ -208,7 +208,7 @@ static int test_case_2(struct btrfs_fs_info *fs_info,
 	em->block_start = EXTENT_MAP_INLINE;
 	em->block_len = (u64)-1;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret) {
 		test_err("case2 [0 1K]: ret %d", ret);
@@ -235,8 +235,9 @@ static int test_case_2(struct btrfs_fs_info *fs_info,
 }
 
 static int __test_case_3(struct btrfs_fs_info *fs_info,
-		struct extent_map_tree *em_tree, u64 start)
+			 struct btrfs_inode *inode, u64 start)
 {
+	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
 	u64 len = SZ_4K;
 	int ret;
@@ -253,7 +254,7 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 	em->block_start = SZ_4K;
 	em->block_len = SZ_4K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret < 0) {
 		test_err("cannot add extent range [4K, 8K)");
@@ -274,7 +275,7 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 	em->block_start = 0;
 	em->block_len = SZ_16K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, start, len);
+	ret = btrfs_add_extent_mapping(inode, &em, start, len);
 	write_unlock(&em_tree->lock);
 	if (ret) {
 		test_err("case3 [%llu %llu): ret %d",
@@ -322,25 +323,25 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
  *   -> add_extent_mapping()
  *                            -> add_extent_mapping()
  */
-static int test_case_3(struct btrfs_fs_info *fs_info,
-		struct extent_map_tree *em_tree)
+static int test_case_3(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 {
 	int ret;
 
-	ret = __test_case_3(fs_info, em_tree, 0);
+	ret = __test_case_3(fs_info, inode, 0);
 	if (ret)
 		return ret;
-	ret = __test_case_3(fs_info, em_tree, SZ_8K);
+	ret = __test_case_3(fs_info, inode, SZ_8K);
 	if (ret)
 		return ret;
-	ret = __test_case_3(fs_info, em_tree, (12 * SZ_1K));
+	ret = __test_case_3(fs_info, inode, (12 * SZ_1K));
 
 	return ret;
 }
 
 static int __test_case_4(struct btrfs_fs_info *fs_info,
-		struct extent_map_tree *em_tree, u64 start)
+			 struct btrfs_inode *inode, u64 start)
 {
+	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
 	u64 len = SZ_4K;
 	int ret;
@@ -357,7 +358,7 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	em->block_start = 0;
 	em->block_len = SZ_8K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret < 0) {
 		test_err("cannot add extent range [0, 8K)");
@@ -378,7 +379,7 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	em->block_start = SZ_16K; /* avoid merging */
 	em->block_len = 24 * SZ_1K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret < 0) {
 		test_err("cannot add extent range [8K, 32K)");
@@ -398,7 +399,7 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	em->block_start = 0;
 	em->block_len = SZ_32K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, start, len);
+	ret = btrfs_add_extent_mapping(inode, &em, start, len);
 	write_unlock(&em_tree->lock);
 	if (ret) {
 		test_err("case4 [%llu %llu): ret %d",
@@ -450,23 +451,22 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
  *                                             # handle -EEXIST when adding
  *                                             # [0, 32K)
  */
-static int test_case_4(struct btrfs_fs_info *fs_info,
-		struct extent_map_tree *em_tree)
+static int test_case_4(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 {
 	int ret;
 
-	ret = __test_case_4(fs_info, em_tree, 0);
+	ret = __test_case_4(fs_info, inode, 0);
 	if (ret)
 		return ret;
-	ret = __test_case_4(fs_info, em_tree, SZ_4K);
+	ret = __test_case_4(fs_info, inode, SZ_4K);
 
 	return ret;
 }
 
-static int add_compressed_extent(struct btrfs_fs_info *fs_info,
-				 struct extent_map_tree *em_tree,
+static int add_compressed_extent(struct btrfs_inode *inode,
 				 u64 start, u64 len, u64 block_start)
 {
+	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
 	int ret;
 
@@ -482,7 +482,7 @@ static int add_compressed_extent(struct btrfs_fs_info *fs_info,
 	em->block_len = SZ_4K;
 	em->flags |= EXTENT_FLAG_COMPRESS_ZLIB;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	free_extent_map(em);
 	if (ret < 0) {
@@ -588,53 +588,43 @@ static int validate_range(struct extent_map_tree *em_tree, int index)
  * They'll have the EXTENT_FLAG_COMPRESSED flag set to keep the em tree from
  * merging the em's.
  */
-static int test_case_5(struct btrfs_fs_info *fs_info)
+static int test_case_5(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 {
-	struct extent_map_tree *em_tree;
-	struct inode *inode;
 	u64 start, end;
 	int ret;
 
 	test_msg("Running btrfs_drop_extent_map_range tests");
 
-	inode = btrfs_new_test_inode();
-	if (!inode) {
-		test_std_err(TEST_ALLOC_INODE);
-		return -ENOMEM;
-	}
-
-	em_tree = &BTRFS_I(inode)->extent_tree;
-
 	/* [0, 12k) */
-	ret = add_compressed_extent(fs_info, em_tree, 0, SZ_4K * 3, 0);
+	ret = add_compressed_extent(inode, 0, SZ_4K * 3, 0);
 	if (ret) {
 		test_err("cannot add extent range [0, 12K)");
 		goto out;
 	}
 
 	/* [12k, 24k) */
-	ret = add_compressed_extent(fs_info, em_tree, SZ_4K * 3, SZ_4K * 3, SZ_4K);
+	ret = add_compressed_extent(inode, SZ_4K * 3, SZ_4K * 3, SZ_4K);
 	if (ret) {
 		test_err("cannot add extent range [12k, 24k)");
 		goto out;
 	}
 
 	/* [24k, 36k) */
-	ret = add_compressed_extent(fs_info, em_tree, SZ_4K * 6, SZ_4K * 3, SZ_8K);
+	ret = add_compressed_extent(inode, SZ_4K * 6, SZ_4K * 3, SZ_8K);
 	if (ret) {
 		test_err("cannot add extent range [12k, 24k)");
 		goto out;
 	}
 
 	/* [36k, 40k) */
-	ret = add_compressed_extent(fs_info, em_tree, SZ_32K + SZ_4K, SZ_4K, SZ_4K * 3);
+	ret = add_compressed_extent(inode, SZ_32K + SZ_4K, SZ_4K, SZ_4K * 3);
 	if (ret) {
 		test_err("cannot add extent range [12k, 24k)");
 		goto out;
 	}
 
 	/* [40k, 64k) */
-	ret = add_compressed_extent(fs_info, em_tree, SZ_4K * 10, SZ_4K * 6, SZ_16K);
+	ret = add_compressed_extent(inode, SZ_4K * 10, SZ_4K * 6, SZ_16K);
 	if (ret) {
 		test_err("cannot add extent range [12k, 24k)");
 		goto out;
@@ -643,36 +633,36 @@ static int test_case_5(struct btrfs_fs_info *fs_info)
 	/* Drop [8k, 12k) */
 	start = SZ_8K;
 	end = (3 * SZ_4K) - 1;
-	btrfs_drop_extent_map_range(BTRFS_I(inode), start, end, false);
-	ret = validate_range(&BTRFS_I(inode)->extent_tree, 0);
+	btrfs_drop_extent_map_range(inode, start, end, false);
+	ret = validate_range(&inode->extent_tree, 0);
 	if (ret)
 		goto out;
 
 	/* Drop [12k, 20k) */
 	start = SZ_4K * 3;
 	end = SZ_16K + SZ_4K - 1;
-	btrfs_drop_extent_map_range(BTRFS_I(inode), start, end, false);
-	ret = validate_range(&BTRFS_I(inode)->extent_tree, 1);
+	btrfs_drop_extent_map_range(inode, start, end, false);
+	ret = validate_range(&inode->extent_tree, 1);
 	if (ret)
 		goto out;
 
 	/* Drop [28k, 32k) */
 	start = SZ_32K - SZ_4K;
 	end = SZ_32K - 1;
-	btrfs_drop_extent_map_range(BTRFS_I(inode), start, end, false);
-	ret = validate_range(&BTRFS_I(inode)->extent_tree, 2);
+	btrfs_drop_extent_map_range(inode, start, end, false);
+	ret = validate_range(&inode->extent_tree, 2);
 	if (ret)
 		goto out;
 
 	/* Drop [32k, 64k) */
 	start = SZ_32K;
 	end = SZ_64K - 1;
-	btrfs_drop_extent_map_range(BTRFS_I(inode), start, end, false);
-	ret = validate_range(&BTRFS_I(inode)->extent_tree, 3);
+	btrfs_drop_extent_map_range(inode, start, end, false);
+	ret = validate_range(&inode->extent_tree, 3);
 	if (ret)
 		goto out;
 out:
-	iput(inode);
+	free_extent_map_tree(&inode->extent_tree);
 	return ret;
 }
 
@@ -681,23 +671,25 @@ static int test_case_5(struct btrfs_fs_info *fs_info)
  * for areas between two existing ems.  Validate it doesn't do this when there
  * are two unmerged em's side by side.
  */
-static int test_case_6(struct btrfs_fs_info *fs_info, struct extent_map_tree *em_tree)
+static int test_case_6(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 {
+	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em = NULL;
 	int ret;
 
-	ret = add_compressed_extent(fs_info, em_tree, 0, SZ_4K, 0);
+	ret = add_compressed_extent(inode, 0, SZ_4K, 0);
 	if (ret)
 		goto out;
 
-	ret = add_compressed_extent(fs_info, em_tree, SZ_4K, SZ_4K, 0);
+	ret = add_compressed_extent(inode, SZ_4K, SZ_4K, 0);
 	if (ret)
 		goto out;
 
 	em = alloc_extent_map();
 	if (!em) {
 		test_std_err(TEST_ALLOC_EXTENT_MAP);
-		return -ENOMEM;
+		ret = -ENOMEM;
+		goto out;
 	}
 
 	em->start = SZ_4K;
@@ -705,7 +697,7 @@ static int test_case_6(struct btrfs_fs_info *fs_info, struct extent_map_tree *em
 	em->block_start = SZ_16K;
 	em->block_len = SZ_16K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, 0, SZ_8K);
+	ret = btrfs_add_extent_mapping(inode, &em, 0, SZ_8K);
 	write_unlock(&em_tree->lock);
 
 	if (ret != 0) {
@@ -734,28 +726,19 @@ static int test_case_6(struct btrfs_fs_info *fs_info, struct extent_map_tree *em
  * true would mess up the start/end calculations and subsequent splits would be
  * incorrect.
  */
-static int test_case_7(struct btrfs_fs_info *fs_info)
+static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 {
-	struct extent_map_tree *em_tree;
+	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
-	struct inode *inode;
 	int ret;
+	int ret2;
 
 	test_msg("Running btrfs_drop_extent_cache with pinned");
 
-	inode = btrfs_new_test_inode();
-	if (!inode) {
-		test_std_err(TEST_ALLOC_INODE);
-		return -ENOMEM;
-	}
-
-	em_tree = &BTRFS_I(inode)->extent_tree;
-
 	em = alloc_extent_map();
 	if (!em) {
 		test_std_err(TEST_ALLOC_EXTENT_MAP);
-		ret = -ENOMEM;
-		goto out;
+		return -ENOMEM;
 	}
 
 	/* [0, 16K), pinned */
@@ -765,7 +748,7 @@ static int test_case_7(struct btrfs_fs_info *fs_info)
 	em->block_len = SZ_4K;
 	em->flags |= EXTENT_FLAG_PINNED;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret < 0) {
 		test_err("couldn't add extent map");
@@ -786,7 +769,7 @@ static int test_case_7(struct btrfs_fs_info *fs_info)
 	em->block_start = SZ_32K;
 	em->block_len = SZ_16K;
 	write_lock(&em_tree->lock);
-	ret = btrfs_add_extent_mapping(fs_info, em_tree, &em, em->start, em->len);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
 	if (ret < 0) {
 		test_err("couldn't add extent map");
@@ -798,7 +781,7 @@ static int test_case_7(struct btrfs_fs_info *fs_info)
 	 * Drop [0, 36K) This should skip the [0, 4K) extent and then split the
 	 * [32K, 48K) extent.
 	 */
-	btrfs_drop_extent_map_range(BTRFS_I(inode), 0, (36 * SZ_1K) - 1, true);
+	btrfs_drop_extent_map_range(inode, 0, (36 * SZ_1K) - 1, true);
 
 	/* Make sure our extent maps look sane. */
 	ret = -EINVAL;
@@ -860,7 +843,11 @@ static int test_case_7(struct btrfs_fs_info *fs_info)
 	ret = 0;
 out:
 	free_extent_map(em);
-	iput(inode);
+	/* Unpin our extent to prevent warning when removing it below. */
+	ret2 = unpin_extent_cache(inode, 0, SZ_16K, 0);
+	if (ret == 0)
+		ret = ret2;
+	free_extent_map_tree(em_tree);
 	return ret;
 }
 
@@ -954,7 +941,8 @@ static int test_rmap_block(struct btrfs_fs_info *fs_info,
 int btrfs_test_extent_map(void)
 {
 	struct btrfs_fs_info *fs_info = NULL;
-	struct extent_map_tree *em_tree;
+	struct inode *inode;
+	struct btrfs_root *root = NULL;
 	int ret = 0, i;
 	struct rmap_test_vector rmap_tests[] = {
 		{
@@ -1003,33 +991,42 @@ int btrfs_test_extent_map(void)
 		return -ENOMEM;
 	}
 
-	em_tree = kzalloc(sizeof(*em_tree), GFP_KERNEL);
-	if (!em_tree) {
+	inode = btrfs_new_test_inode();
+	if (!inode) {
+		test_std_err(TEST_ALLOC_INODE);
 		ret = -ENOMEM;
 		goto out;
 	}
 
-	extent_map_tree_init(em_tree);
+	root = btrfs_alloc_dummy_root(fs_info);
+	if (IS_ERR(root)) {
+		test_std_err(TEST_ALLOC_ROOT);
+		ret = PTR_ERR(root);
+		root = NULL;
+		goto out;
+	}
 
-	ret = test_case_1(fs_info, em_tree);
+	BTRFS_I(inode)->root = root;
+
+	ret = test_case_1(fs_info, BTRFS_I(inode));
 	if (ret)
 		goto out;
-	ret = test_case_2(fs_info, em_tree);
+	ret = test_case_2(fs_info, BTRFS_I(inode));
 	if (ret)
 		goto out;
-	ret = test_case_3(fs_info, em_tree);
+	ret = test_case_3(fs_info, BTRFS_I(inode));
 	if (ret)
 		goto out;
-	ret = test_case_4(fs_info, em_tree);
+	ret = test_case_4(fs_info, BTRFS_I(inode));
 	if (ret)
 		goto out;
-	ret = test_case_5(fs_info);
+	ret = test_case_5(fs_info, BTRFS_I(inode));
 	if (ret)
 		goto out;
-	ret = test_case_6(fs_info, em_tree);
+	ret = test_case_6(fs_info, BTRFS_I(inode));
 	if (ret)
 		goto out;
-	ret = test_case_7(fs_info);
+	ret = test_case_7(fs_info, BTRFS_I(inode));
 	if (ret)
 		goto out;
 
@@ -1041,7 +1038,8 @@ int btrfs_test_extent_map(void)
 	}
 
 out:
-	kfree(em_tree);
+	iput(inode);
+	btrfs_free_dummy_root(root);
 	btrfs_free_dummy_fs_info(fs_info);
 
 	return ret;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 02/15] btrfs: tests: error out on unexpected extent map reference count
  2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
  2024-04-11 16:18   ` [PATCH v2 01/15] btrfs: pass an inode to btrfs_add_extent_mapping() fdmanana
@ 2024-04-11 16:18   ` fdmanana
  2024-04-11 16:18   ` [PATCH v2 03/15] btrfs: simplify add_extent_mapping() by removing pointless label fdmanana
                     ` (12 subsequent siblings)
  14 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-11 16:18 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

In the extent map self tests, when freeing all extent maps from a test
extent map tree we are not expecting to find any extent map with a
reference count different from 1 (the tree reference). If we find any,
we just log a message but we don't fail the test, which makes it very easy
to miss any bug/regression - no one reads the test messages unless a test
fails. So change the behaviour to make a test fail if we find an extent
map in the tree with a reference count different from 1. Make the failure
happen only after removing all extent maps, so that we don't leak memory.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tests/extent-map-tests.c | 43 +++++++++++++++++++++++++------
 1 file changed, 35 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
index 96089c4c38a5..9e9cb591c0f1 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -11,10 +11,11 @@
 #include "../disk-io.h"
 #include "../block-group.h"
 
-static void free_extent_map_tree(struct extent_map_tree *em_tree)
+static int free_extent_map_tree(struct extent_map_tree *em_tree)
 {
 	struct extent_map *em;
 	struct rb_node *node;
+	int ret = 0;
 
 	write_lock(&em_tree->lock);
 	while (!RB_EMPTY_ROOT(&em_tree->map.rb_root)) {
@@ -24,6 +25,7 @@ static void free_extent_map_tree(struct extent_map_tree *em_tree)
 
 #ifdef CONFIG_BTRFS_DEBUG
 		if (refcount_read(&em->refs) != 1) {
+			ret = -EINVAL;
 			test_err(
 "em leak: em (start %llu len %llu block_start %llu block_len %llu) refs %d",
 				 em->start, em->len, em->block_start,
@@ -35,6 +37,8 @@ static void free_extent_map_tree(struct extent_map_tree *em_tree)
 		free_extent_map(em);
 	}
 	write_unlock(&em_tree->lock);
+
+	return ret;
 }
 
 /*
@@ -60,6 +64,7 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	u64 start = 0;
 	u64 len = SZ_8K;
 	int ret;
+	int ret2;
 
 	em = alloc_extent_map();
 	if (!em) {
@@ -137,7 +142,9 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	}
 	free_extent_map(em);
 out:
-	free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(em_tree);
+	if (ret == 0)
+		ret = ret2;
 
 	return ret;
 }
@@ -153,6 +160,7 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
 	int ret;
+	int ret2;
 
 	em = alloc_extent_map();
 	if (!em) {
@@ -229,7 +237,9 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	}
 	free_extent_map(em);
 out:
-	free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(em_tree);
+	if (ret == 0)
+		ret = ret2;
 
 	return ret;
 }
@@ -241,6 +251,7 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 	struct extent_map *em;
 	u64 len = SZ_4K;
 	int ret;
+	int ret2;
 
 	em = alloc_extent_map();
 	if (!em) {
@@ -302,7 +313,9 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 	}
 	free_extent_map(em);
 out:
-	free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(em_tree);
+	if (ret == 0)
+		ret = ret2;
 
 	return ret;
 }
@@ -345,6 +358,7 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	struct extent_map *em;
 	u64 len = SZ_4K;
 	int ret;
+	int ret2;
 
 	em = alloc_extent_map();
 	if (!em) {
@@ -421,7 +435,9 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	}
 	free_extent_map(em);
 out:
-	free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(em_tree);
+	if (ret == 0)
+		ret = ret2;
 
 	return ret;
 }
@@ -592,6 +608,7 @@ static int test_case_5(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 {
 	u64 start, end;
 	int ret;
+	int ret2;
 
 	test_msg("Running btrfs_drop_extent_map_range tests");
 
@@ -662,7 +679,10 @@ static int test_case_5(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	if (ret)
 		goto out;
 out:
-	free_extent_map_tree(&inode->extent_tree);
+	ret2 = free_extent_map_tree(&inode->extent_tree);
+	if (ret == 0)
+		ret = ret2;
+
 	return ret;
 }
 
@@ -676,6 +696,7 @@ static int test_case_6(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em = NULL;
 	int ret;
+	int ret2;
 
 	ret = add_compressed_extent(inode, 0, SZ_4K, 0);
 	if (ret)
@@ -717,7 +738,10 @@ static int test_case_6(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	ret = 0;
 out:
 	free_extent_map(em);
-	free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(em_tree);
+	if (ret == 0)
+		ret = ret2;
+
 	return ret;
 }
 
@@ -847,7 +871,10 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	ret2 = unpin_extent_cache(inode, 0, SZ_16K, 0);
 	if (ret == 0)
 		ret = ret2;
-	free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(em_tree);
+	if (ret == 0)
+		ret = ret2;
+
 	return ret;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 03/15] btrfs: simplify add_extent_mapping() by removing pointless label
  2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
  2024-04-11 16:18   ` [PATCH v2 01/15] btrfs: pass an inode to btrfs_add_extent_mapping() fdmanana
  2024-04-11 16:18   ` [PATCH v2 02/15] btrfs: tests: error out on unexpected extent map reference count fdmanana
@ 2024-04-11 16:18   ` fdmanana
  2024-04-11 16:18   ` [PATCH v2 04/15] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
                     ` (11 subsequent siblings)
  14 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-11 16:18 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

The add_extent_mapping() function is short and trivial, there's no need to
have a label for a quick exit in case of an error, even because there's no
error handling needed, we just need to return the error. So remove that
label and return directly.

Also while at it remove the redundant initialization of 'ret', as that may
help avoid some warnings with clang tools such as the one reported/fixed
by commit 966de47ff0c9 ("btrfs: remove redundant initialization of
variables in log_new_ancestors").

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 840be23d2c0a..d125d5ab9b1d 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -370,17 +370,17 @@ static inline void setup_extent_mapping(struct extent_map_tree *tree,
 static int add_extent_mapping(struct extent_map_tree *tree,
 			      struct extent_map *em, int modified)
 {
-	int ret = 0;
+	int ret;
 
 	lockdep_assert_held_write(&tree->lock);
 
 	ret = tree_insert(&tree->map, em);
 	if (ret)
-		goto out;
+		return ret;
 
 	setup_extent_mapping(tree, em, modified);
-out:
-	return ret;
+
+	return 0;
 }
 
 static struct extent_map *
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 04/15] btrfs: pass the extent map tree's inode to add_extent_mapping()
  2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
                     ` (2 preceding siblings ...)
  2024-04-11 16:18   ` [PATCH v2 03/15] btrfs: simplify add_extent_mapping() by removing pointless label fdmanana
@ 2024-04-11 16:18   ` fdmanana
  2024-04-11 16:18   ` [PATCH v2 05/15] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
                     ` (10 subsequent siblings)
  14 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-11 16:18 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are always added to an inode's extent map tree, so there's no
need to pass the extent map tree explicitly to add_extent_mapping().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change add_extent_mapping() to receive the inode instead of its
extent map tree.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c | 34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index d125d5ab9b1d..d0e0c4e5415e 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -355,21 +355,22 @@ static inline void setup_extent_mapping(struct extent_map_tree *tree,
 }
 
 /*
- * Add new extent map to the extent tree
+ * Add a new extent map to an inode's extent map tree.
  *
- * @tree:	tree to insert new map in
+ * @inode:	the target inode
  * @em:		map to insert
  * @modified:	indicate whether the given @em should be added to the
  *	        modified list, which indicates the extent needs to be logged
  *
- * Insert @em into @tree or perform a simple forward/backward merge with
- * existing mappings.  The extent_map struct passed in will be inserted
- * into the tree directly, with an additional reference taken, or a
- * reference dropped if the merge attempt was successful.
+ * Insert @em into the @inode's extent map tree or perform a simple
+ * forward/backward merge with existing mappings.  The extent_map struct passed
+ * in will be inserted into the tree directly, with an additional reference
+ * taken, or a reference dropped if the merge attempt was successful.
  */
-static int add_extent_mapping(struct extent_map_tree *tree,
+static int add_extent_mapping(struct btrfs_inode *inode,
 			      struct extent_map *em, int modified)
 {
+	struct extent_map_tree *tree = &inode->extent_tree;
 	int ret;
 
 	lockdep_assert_held_write(&tree->lock);
@@ -508,7 +509,7 @@ static struct extent_map *prev_extent_map(struct extent_map *em)
  * and an extent that you want to insert, deal with overlap and insert
  * the best fitted new extent into the tree.
  */
-static noinline int merge_extent_mapping(struct extent_map_tree *em_tree,
+static noinline int merge_extent_mapping(struct btrfs_inode *inode,
 					 struct extent_map *existing,
 					 struct extent_map *em,
 					 u64 map_start)
@@ -542,7 +543,7 @@ static noinline int merge_extent_mapping(struct extent_map_tree *em_tree,
 		em->block_start += start_diff;
 		em->block_len = em->len;
 	}
-	return add_extent_mapping(em_tree, em, 0);
+	return add_extent_mapping(inode, em, 0);
 }
 
 /*
@@ -570,7 +571,6 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode,
 {
 	int ret;
 	struct extent_map *em = *em_in;
-	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 
 	/*
@@ -580,7 +580,7 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode,
 	if (em->block_start == EXTENT_MAP_INLINE)
 		ASSERT(em->start == 0);
 
-	ret = add_extent_mapping(em_tree, em, 0);
+	ret = add_extent_mapping(inode, em, 0);
 	/* it is possible that someone inserted the extent into the tree
 	 * while we had the lock dropped.  It is also possible that
 	 * an overlapping map exists in the tree
@@ -588,7 +588,7 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode,
 	if (ret == -EEXIST) {
 		struct extent_map *existing;
 
-		existing = search_extent_mapping(em_tree, start, len);
+		existing = search_extent_mapping(&inode->extent_tree, start, len);
 
 		trace_btrfs_handle_em_exist(fs_info, existing, em, start, len);
 
@@ -609,8 +609,7 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode,
 			 * The existing extent map is the one nearest to
 			 * the [start, start + len) range which overlaps
 			 */
-			ret = merge_extent_mapping(em_tree, existing,
-						   em, start);
+			ret = merge_extent_mapping(inode, existing, em, start);
 			if (WARN_ON(ret)) {
 				free_extent_map(em);
 				*em_in = NULL;
@@ -818,8 +817,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 			} else {
 				int ret;
 
-				ret = add_extent_mapping(em_tree, split,
-							 modified);
+				ret = add_extent_mapping(inode, split, modified);
 				/* Logic error, shouldn't happen. */
 				ASSERT(ret == 0);
 				if (WARN_ON(ret != 0) && modified)
@@ -909,7 +907,7 @@ int btrfs_replace_extent_map_range(struct btrfs_inode *inode,
 	do {
 		btrfs_drop_extent_map_range(inode, new_em->start, end, false);
 		write_lock(&tree->lock);
-		ret = add_extent_mapping(tree, new_em, modified);
+		ret = add_extent_mapping(inode, new_em, modified);
 		write_unlock(&tree->lock);
 	} while (ret == -EEXIST);
 
@@ -990,7 +988,7 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	split_mid->ram_bytes = split_mid->len;
 	split_mid->flags = flags;
 	split_mid->generation = em->generation;
-	add_extent_mapping(em_tree, split_mid, 1);
+	add_extent_mapping(inode, split_mid, 1);
 
 	/* Once for us */
 	free_extent_map(em);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 05/15] btrfs: pass the extent map tree's inode to clear_em_logging()
  2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
                     ` (3 preceding siblings ...)
  2024-04-11 16:18   ` [PATCH v2 04/15] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
@ 2024-04-11 16:18   ` fdmanana
  2024-04-11 16:19   ` [PATCH v2 06/15] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
                     ` (9 subsequent siblings)
  14 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-11 16:18 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to
clear_em_logging().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change clear_em_logging() to receive the inode instead of its extent
map tree.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c | 4 +++-
 fs/btrfs/extent_map.h | 2 +-
 fs/btrfs/tree-log.c   | 4 ++--
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index d0e0c4e5415e..7cda78d11d75 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -331,8 +331,10 @@ int unpin_extent_cache(struct btrfs_inode *inode, u64 start, u64 len, u64 gen)
 
 }
 
-void clear_em_logging(struct extent_map_tree *tree, struct extent_map *em)
+void clear_em_logging(struct btrfs_inode *inode, struct extent_map *em)
 {
+	struct extent_map_tree *tree = &inode->extent_tree;
+
 	lockdep_assert_held_write(&tree->lock);
 
 	em->flags &= ~EXTENT_FLAG_LOGGING;
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index f287ab46e368..732fc8d7e534 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -129,7 +129,7 @@ void free_extent_map(struct extent_map *em);
 int __init extent_map_init(void);
 void __cold extent_map_exit(void);
 int unpin_extent_cache(struct btrfs_inode *inode, u64 start, u64 len, u64 gen);
-void clear_em_logging(struct extent_map_tree *tree, struct extent_map *em);
+void clear_em_logging(struct btrfs_inode *inode, struct extent_map *em);
 struct extent_map *search_extent_mapping(struct extent_map_tree *tree,
 					 u64 start, u64 len);
 int btrfs_add_extent_mapping(struct btrfs_inode *inode,
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index d9777649e170..4a4fca841510 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4945,7 +4945,7 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 		 * private list.
 		 */
 		if (ret) {
-			clear_em_logging(tree, em);
+			clear_em_logging(inode, em);
 			free_extent_map(em);
 			continue;
 		}
@@ -4954,7 +4954,7 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 
 		ret = log_one_extent(trans, inode, em, path, ctx);
 		write_lock(&tree->lock);
-		clear_em_logging(tree, em);
+		clear_em_logging(inode, em);
 		free_extent_map(em);
 	}
 	WARN_ON(!list_empty(&extents));
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 06/15] btrfs: pass the extent map tree's inode to remove_extent_mapping()
  2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
                     ` (4 preceding siblings ...)
  2024-04-11 16:18   ` [PATCH v2 05/15] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
@ 2024-04-11 16:19   ` fdmanana
  2024-04-11 16:19   ` [PATCH v2 07/15] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
                     ` (8 subsequent siblings)
  14 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-11 16:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to
remove_extent_mapping().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change remove_extent_mapping() to receive the inode instead of its
extent map tree.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c              |  2 +-
 fs/btrfs/extent_map.c             | 22 +++++++++++++---------
 fs/btrfs/extent_map.h             |  2 +-
 fs/btrfs/tests/extent-map-tests.c | 19 ++++++++++---------
 4 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d90330f26827..1b236fc3f411 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2457,7 +2457,7 @@ int try_release_extent_mapping(struct page *page, gfp_t mask)
 			 * hurts the fsync performance for workloads with a data
 			 * size that exceeds or is close to the system's memory).
 			 */
-			remove_extent_mapping(map, em);
+			remove_extent_mapping(btrfs_inode, em);
 			/* once for the rb tree */
 			free_extent_map(em);
 next:
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 7cda78d11d75..289669763965 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -449,16 +449,18 @@ struct extent_map *search_extent_mapping(struct extent_map_tree *tree,
 }
 
 /*
- * Remove an extent_map from the extent tree.
+ * Remove an extent_map from its inode's extent tree.
  *
- * @tree:	extent tree to remove from
+ * @inode:	the inode the extent map belongs to
  * @em:		extent map being removed
  *
- * Remove @em from @tree.  No reference counts are dropped, and no checks
- * are done to see if the range is in use.
+ * Remove @em from the extent tree of @inode.  No reference counts are dropped,
+ * and no checks are done to see if the range is in use.
  */
-void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em)
+void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em)
 {
+	struct extent_map_tree *tree = &inode->extent_tree;
+
 	lockdep_assert_held_write(&tree->lock);
 
 	WARN_ON(em->flags & EXTENT_FLAG_PINNED);
@@ -633,8 +635,10 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode,
  * if needed. This avoids searching the tree, from the root down to the first
  * extent map, before each deletion.
  */
-static void drop_all_extent_maps_fast(struct extent_map_tree *tree)
+static void drop_all_extent_maps_fast(struct btrfs_inode *inode)
 {
+	struct extent_map_tree *tree = &inode->extent_tree;
+
 	write_lock(&tree->lock);
 	while (!RB_EMPTY_ROOT(&tree->map.rb_root)) {
 		struct extent_map *em;
@@ -643,7 +647,7 @@ static void drop_all_extent_maps_fast(struct extent_map_tree *tree)
 		node = rb_first_cached(&tree->map);
 		em = rb_entry(node, struct extent_map, rb_node);
 		em->flags &= ~(EXTENT_FLAG_PINNED | EXTENT_FLAG_LOGGING);
-		remove_extent_mapping(tree, em);
+		remove_extent_mapping(inode, em);
 		free_extent_map(em);
 		cond_resched_rwlock_write(&tree->lock);
 	}
@@ -676,7 +680,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 	WARN_ON(end < start);
 	if (end == (u64)-1) {
 		if (start == 0 && !skip_pinned) {
-			drop_all_extent_maps_fast(em_tree);
+			drop_all_extent_maps_fast(inode);
 			return;
 		}
 		len = (u64)-1;
@@ -854,7 +858,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 				ASSERT(!split);
 				btrfs_set_inode_full_sync(inode);
 			}
-			remove_extent_mapping(em_tree, em);
+			remove_extent_mapping(inode, em);
 		}
 
 		/*
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 732fc8d7e534..c3707461ff62 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -120,7 +120,7 @@ static inline u64 extent_map_end(const struct extent_map *em)
 void extent_map_tree_init(struct extent_map_tree *tree);
 struct extent_map *lookup_extent_mapping(struct extent_map_tree *tree,
 					 u64 start, u64 len);
-void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em);
+void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em);
 int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 		     u64 new_logical);
 
diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
index 9e9cb591c0f1..db6fb1a2c78f 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -11,8 +11,9 @@
 #include "../disk-io.h"
 #include "../block-group.h"
 
-static int free_extent_map_tree(struct extent_map_tree *em_tree)
+static int free_extent_map_tree(struct btrfs_inode *inode)
 {
+	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
 	struct rb_node *node;
 	int ret = 0;
@@ -21,7 +22,7 @@ static int free_extent_map_tree(struct extent_map_tree *em_tree)
 	while (!RB_EMPTY_ROOT(&em_tree->map.rb_root)) {
 		node = rb_first_cached(&em_tree->map);
 		em = rb_entry(node, struct extent_map, rb_node);
-		remove_extent_mapping(em_tree, em);
+		remove_extent_mapping(inode, em);
 
 #ifdef CONFIG_BTRFS_DEBUG
 		if (refcount_read(&em->refs) != 1) {
@@ -142,7 +143,7 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	}
 	free_extent_map(em);
 out:
-	ret2 = free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
@@ -237,7 +238,7 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	}
 	free_extent_map(em);
 out:
-	ret2 = free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
@@ -313,7 +314,7 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 	}
 	free_extent_map(em);
 out:
-	ret2 = free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
@@ -435,7 +436,7 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	}
 	free_extent_map(em);
 out:
-	ret2 = free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
@@ -679,7 +680,7 @@ static int test_case_5(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	if (ret)
 		goto out;
 out:
-	ret2 = free_extent_map_tree(&inode->extent_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
@@ -738,7 +739,7 @@ static int test_case_6(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	ret = 0;
 out:
 	free_extent_map(em);
-	ret2 = free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
@@ -871,7 +872,7 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	ret2 = unpin_extent_cache(inode, 0, SZ_16K, 0);
 	if (ret == 0)
 		ret = ret2;
-	ret2 = free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 07/15] btrfs: pass the extent map tree's inode to replace_extent_mapping()
  2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
                     ` (5 preceding siblings ...)
  2024-04-11 16:19   ` [PATCH v2 06/15] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
@ 2024-04-11 16:19   ` fdmanana
  2024-04-11 16:19   ` [PATCH v2 08/15] btrfs: pass the extent map tree's inode to setup_extent_mapping() fdmanana
                     ` (7 subsequent siblings)
  14 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-11 16:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to
replace_extent_mapping().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change replace_extent_mapping() to receive the inode instead of its
extent map tree.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 289669763965..15817b842c24 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -470,11 +470,13 @@ void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em)
 	RB_CLEAR_NODE(&em->rb_node);
 }
 
-static void replace_extent_mapping(struct extent_map_tree *tree,
+static void replace_extent_mapping(struct btrfs_inode *inode,
 				   struct extent_map *cur,
 				   struct extent_map *new,
 				   int modified)
 {
+	struct extent_map_tree *tree = &inode->extent_tree;
+
 	lockdep_assert_held_write(&tree->lock);
 
 	WARN_ON(cur->flags & EXTENT_FLAG_PINNED);
@@ -777,7 +779,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 
 			split->generation = gen;
 			split->flags = flags;
-			replace_extent_mapping(em_tree, em, split, modified);
+			replace_extent_mapping(inode, em, split, modified);
 			free_extent_map(split);
 			split = split2;
 			split2 = NULL;
@@ -818,8 +820,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 			}
 
 			if (extent_map_in_tree(em)) {
-				replace_extent_mapping(em_tree, em, split,
-						       modified);
+				replace_extent_mapping(inode, em, split, modified);
 			} else {
 				int ret;
 
@@ -977,7 +978,7 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	split_pre->flags = flags;
 	split_pre->generation = em->generation;
 
-	replace_extent_mapping(em_tree, em, split_pre, 1);
+	replace_extent_mapping(inode, em, split_pre, 1);
 
 	/*
 	 * Now we only have an extent_map at:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 08/15] btrfs: pass the extent map tree's inode to setup_extent_mapping()
  2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
                     ` (6 preceding siblings ...)
  2024-04-11 16:19   ` [PATCH v2 07/15] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
@ 2024-04-11 16:19   ` fdmanana
  2024-04-11 23:25     ` Qu Wenruo
  2024-04-11 16:19   ` [PATCH v2 09/15] btrfs: pass the extent map tree's inode to try_merge_map() fdmanana
                     ` (6 subsequent siblings)
  14 siblings, 1 reply; 64+ messages in thread
From: fdmanana @ 2024-04-11 16:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to
setup_extent_mapping().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change setup_extent_mapping() to receive the inode instead of its
extent map tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 15817b842c24..2753bf2964cb 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -342,7 +342,7 @@ void clear_em_logging(struct btrfs_inode *inode, struct extent_map *em)
 		try_merge_map(tree, em);
 }
 
-static inline void setup_extent_mapping(struct extent_map_tree *tree,
+static inline void setup_extent_mapping(struct btrfs_inode *inode,
 					struct extent_map *em,
 					int modified)
 {
@@ -351,9 +351,9 @@ static inline void setup_extent_mapping(struct extent_map_tree *tree,
 	ASSERT(list_empty(&em->list));
 
 	if (modified)
-		list_add(&em->list, &tree->modified_extents);
+		list_add(&em->list, &inode->extent_tree.modified_extents);
 	else
-		try_merge_map(tree, em);
+		try_merge_map(&inode->extent_tree, em);
 }
 
 /*
@@ -381,7 +381,7 @@ static int add_extent_mapping(struct btrfs_inode *inode,
 	if (ret)
 		return ret;
 
-	setup_extent_mapping(tree, em, modified);
+	setup_extent_mapping(inode, em, modified);
 
 	return 0;
 }
@@ -486,7 +486,7 @@ static void replace_extent_mapping(struct btrfs_inode *inode,
 	rb_replace_node_cached(&cur->rb_node, &new->rb_node, &tree->map);
 	RB_CLEAR_NODE(&cur->rb_node);
 
-	setup_extent_mapping(tree, new, modified);
+	setup_extent_mapping(inode, new, modified);
 }
 
 static struct extent_map *next_extent_map(const struct extent_map *em)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 09/15] btrfs: pass the extent map tree's inode to try_merge_map()
  2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
                     ` (7 preceding siblings ...)
  2024-04-11 16:19   ` [PATCH v2 08/15] btrfs: pass the extent map tree's inode to setup_extent_mapping() fdmanana
@ 2024-04-11 16:19   ` fdmanana
  2024-04-11 23:25     ` Qu Wenruo
  2024-04-11 16:19   ` [PATCH v2 10/15] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
                     ` (5 subsequent siblings)
  14 siblings, 1 reply; 64+ messages in thread
From: fdmanana @ 2024-04-11 16:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to try_merge_map().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change try_merge_map() to receive the inode instead of its extent
map tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 2753bf2964cb..97a8e0484415 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -223,8 +223,9 @@ static bool mergeable_maps(const struct extent_map *prev, const struct extent_ma
 	return next->block_start == prev->block_start;
 }
 
-static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
+static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 {
+	struct extent_map_tree *tree = &inode->extent_tree;
 	struct extent_map *merge = NULL;
 	struct rb_node *rb;
 
@@ -322,7 +323,7 @@ int unpin_extent_cache(struct btrfs_inode *inode, u64 start, u64 len, u64 gen)
 	em->generation = gen;
 	em->flags &= ~EXTENT_FLAG_PINNED;
 
-	try_merge_map(tree, em);
+	try_merge_map(inode, em);
 
 out:
 	write_unlock(&tree->lock);
@@ -333,13 +334,11 @@ int unpin_extent_cache(struct btrfs_inode *inode, u64 start, u64 len, u64 gen)
 
 void clear_em_logging(struct btrfs_inode *inode, struct extent_map *em)
 {
-	struct extent_map_tree *tree = &inode->extent_tree;
-
-	lockdep_assert_held_write(&tree->lock);
+	lockdep_assert_held_write(&inode->extent_tree.lock);
 
 	em->flags &= ~EXTENT_FLAG_LOGGING;
 	if (extent_map_in_tree(em))
-		try_merge_map(tree, em);
+		try_merge_map(inode, em);
 }
 
 static inline void setup_extent_mapping(struct btrfs_inode *inode,
@@ -353,7 +352,7 @@ static inline void setup_extent_mapping(struct btrfs_inode *inode,
 	if (modified)
 		list_add(&em->list, &inode->extent_tree.modified_extents);
 	else
-		try_merge_map(&inode->extent_tree, em);
+		try_merge_map(inode, em);
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 10/15] btrfs: add a global per cpu counter to track number of used extent maps
  2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
                     ` (8 preceding siblings ...)
  2024-04-11 16:19   ` [PATCH v2 09/15] btrfs: pass the extent map tree's inode to try_merge_map() fdmanana
@ 2024-04-11 16:19   ` fdmanana
  2024-04-15  2:47     ` kernel test robot
  2024-04-11 16:19   ` [PATCH v2 11/15] btrfs: export find_next_inode() as btrfs_find_first_inode() fdmanana
                     ` (4 subsequent siblings)
  14 siblings, 1 reply; 64+ messages in thread
From: fdmanana @ 2024-04-11 16:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Add a per cpu counter that tracks the total number of extent maps that are
in extent trees of inodes that belong to fs trees. This is going to be
used in an upcoming change that adds a shrinker for extent maps. Only
extent maps for fs trees are considered, because for special trees such as
the data relocation tree we don't want to evict their extent maps which
are critical for the relocation to work, and since those are limited, it's
not a concern to have them in memory during the relocation of a block
group. Another case are extent maps for free space cache inodes, which
must always remain in memory, but those are limited (there's only one per
free space cache inode, which means one per block group).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/disk-io.c    |  6 ++++++
 fs/btrfs/extent_map.c | 17 +++++++++++++++++
 fs/btrfs/fs.h         |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0474e9b6d302..3c2d35b2062e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1269,6 +1269,8 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
 	percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
 	percpu_counter_destroy(&fs_info->delalloc_bytes);
 	percpu_counter_destroy(&fs_info->ordered_bytes);
+	ASSERT(percpu_counter_sum_positive(&fs_info->evictable_extent_maps) == 0);
+	percpu_counter_destroy(&fs_info->evictable_extent_maps);
 	percpu_counter_destroy(&fs_info->dev_replace.bio_counter);
 	btrfs_free_csum_hash(fs_info);
 	btrfs_free_stripe_hash_table(fs_info);
@@ -2848,6 +2850,10 @@ static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block
 	if (ret)
 		return ret;
 
+	ret = percpu_counter_init(&fs_info->evictable_extent_maps, 0, GFP_KERNEL);
+	if (ret)
+		return ret;
+
 	ret = percpu_counter_init(&fs_info->dirty_metadata_bytes, 0, GFP_KERNEL);
 	if (ret)
 		return ret;
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 97a8e0484415..2fcf28148a81 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -76,6 +76,14 @@ static u64 range_end(u64 start, u64 len)
 	return start + len;
 }
 
+static void dec_evictable_extent_maps(struct btrfs_inode *inode)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+
+	if (!btrfs_is_testing(fs_info) && is_fstree(btrfs_root_id(inode->root)))
+		percpu_counter_dec(&fs_info->evictable_extent_maps);
+}
+
 static int tree_insert(struct rb_root_cached *root, struct extent_map *em)
 {
 	struct rb_node **p = &root->rb_root.rb_node;
@@ -259,6 +267,7 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 			rb_erase_cached(&merge->rb_node, &tree->map);
 			RB_CLEAR_NODE(&merge->rb_node);
 			free_extent_map(merge);
+			dec_evictable_extent_maps(inode);
 		}
 	}
 
@@ -273,6 +282,7 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 		em->generation = max(em->generation, merge->generation);
 		em->flags |= EXTENT_FLAG_MERGED;
 		free_extent_map(merge);
+		dec_evictable_extent_maps(inode);
 	}
 }
 
@@ -372,6 +382,8 @@ static int add_extent_mapping(struct btrfs_inode *inode,
 			      struct extent_map *em, int modified)
 {
 	struct extent_map_tree *tree = &inode->extent_tree;
+	struct btrfs_root *root = inode->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
 	int ret;
 
 	lockdep_assert_held_write(&tree->lock);
@@ -382,6 +394,9 @@ static int add_extent_mapping(struct btrfs_inode *inode,
 
 	setup_extent_mapping(inode, em, modified);
 
+	if (!btrfs_is_testing(fs_info) && is_fstree(btrfs_root_id(root)))
+		percpu_counter_inc(&fs_info->evictable_extent_maps);
+
 	return 0;
 }
 
@@ -467,6 +482,8 @@ void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em)
 	if (!(em->flags & EXTENT_FLAG_LOGGING))
 		list_del_init(&em->list);
 	RB_CLEAR_NODE(&em->rb_node);
+
+	dec_evictable_extent_maps(inode);
 }
 
 static void replace_extent_mapping(struct btrfs_inode *inode,
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 93f5c57ea4e3..534d30dafe32 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -630,6 +630,8 @@ struct btrfs_fs_info {
 	s32 dirty_metadata_batch;
 	s32 delalloc_batch;
 
+	struct percpu_counter evictable_extent_maps;
+
 	/* Protected by 'trans_lock'. */
 	struct list_head dirty_cowonly_roots;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 11/15] btrfs: export find_next_inode() as btrfs_find_first_inode()
  2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
                     ` (9 preceding siblings ...)
  2024-04-11 16:19   ` [PATCH v2 10/15] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
@ 2024-04-11 16:19   ` fdmanana
  2024-04-11 23:14     ` Qu Wenruo
  2024-04-11 16:19   ` [PATCH v2 12/15] btrfs: use btrfs_find_first_inode() at btrfs_prune_dentries() fdmanana
                     ` (3 subsequent siblings)
  14 siblings, 1 reply; 64+ messages in thread
From: fdmanana @ 2024-04-11 16:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Export the relocation private helper find_next_inode() to inode.c, as this
same logic is also used at btrfs_prune_dentries() and will be used by an
upcoming change that adds an extent map shrinker. The next patch will
change btrfs_prune_dentries() to use this helper.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h |   1 +
 fs/btrfs/inode.c       |  58 +++++++++++++++++++++++
 fs/btrfs/relocation.c  | 105 ++++++++++-------------------------------
 3 files changed, 84 insertions(+), 80 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index ed8bd15aa3e2..9a87ada7fe52 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -543,6 +543,7 @@ ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter,
 		       size_t done_before);
 struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter,
 				  size_t done_before);
+struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino);
 
 extern const struct dentry_operations btrfs_dentry_operations;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d4539b4b8148..9dc41334c3a3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10827,6 +10827,64 @@ void btrfs_assert_inode_range_clean(struct btrfs_inode *inode, u64 start, u64 en
 	ASSERT(ordered == NULL);
 }
 
+/*
+ * Find the first inode with a minimum number.
+ *
+ * @root:	The root to search for.
+ * @min_ino:	The minimum inode number.
+ *
+ * Find the first inode in the @root with a number >= @min_ino and return it.
+ * Returns NULL if no such inode found.
+ */
+struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino)
+{
+	struct rb_node *node;
+	struct rb_node *prev = NULL;
+	struct btrfs_inode *inode;
+
+	spin_lock(&root->inode_lock);
+again:
+	node = root->inode_tree.rb_node;
+	while (node) {
+		prev = node;
+		inode = rb_entry(node, struct btrfs_inode, rb_node);
+		if (min_ino < btrfs_ino(inode))
+			node = node->rb_left;
+		else if (min_ino > btrfs_ino(inode))
+			node = node->rb_right;
+		else
+			break;
+	}
+
+	if (!node) {
+		while (prev) {
+			inode = rb_entry(prev, struct btrfs_inode, rb_node);
+			if (min_ino <= btrfs_ino(inode)) {
+				node = prev;
+				break;
+			}
+			prev = rb_next(prev);
+		}
+	}
+
+	while (node) {
+		inode = rb_entry(prev, struct btrfs_inode, rb_node);
+		if (igrab(&inode->vfs_inode)) {
+			spin_unlock(&root->inode_lock);
+			return inode;
+		}
+
+		min_ino = btrfs_ino(inode) + 1;
+		if (cond_resched_lock(&root->inode_lock))
+			goto again;
+
+		node = rb_next(node);
+	}
+	spin_unlock(&root->inode_lock);
+
+	return NULL;
+}
+
 static const struct inode_operations btrfs_dir_inode_operations = {
 	.getattr	= btrfs_getattr,
 	.lookup		= btrfs_lookup,
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 5c9ef6717f84..5b19b41f64a2 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -951,60 +951,6 @@ int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
-/*
- * helper to find first cached inode with inode number >= objectid
- * in a subvolume
- */
-static struct inode *find_next_inode(struct btrfs_root *root, u64 objectid)
-{
-	struct rb_node *node;
-	struct rb_node *prev;
-	struct btrfs_inode *entry;
-	struct inode *inode;
-
-	spin_lock(&root->inode_lock);
-again:
-	node = root->inode_tree.rb_node;
-	prev = NULL;
-	while (node) {
-		prev = node;
-		entry = rb_entry(node, struct btrfs_inode, rb_node);
-
-		if (objectid < btrfs_ino(entry))
-			node = node->rb_left;
-		else if (objectid > btrfs_ino(entry))
-			node = node->rb_right;
-		else
-			break;
-	}
-	if (!node) {
-		while (prev) {
-			entry = rb_entry(prev, struct btrfs_inode, rb_node);
-			if (objectid <= btrfs_ino(entry)) {
-				node = prev;
-				break;
-			}
-			prev = rb_next(prev);
-		}
-	}
-	while (node) {
-		entry = rb_entry(node, struct btrfs_inode, rb_node);
-		inode = igrab(&entry->vfs_inode);
-		if (inode) {
-			spin_unlock(&root->inode_lock);
-			return inode;
-		}
-
-		objectid = btrfs_ino(entry) + 1;
-		if (cond_resched_lock(&root->inode_lock))
-			goto again;
-
-		node = rb_next(node);
-	}
-	spin_unlock(&root->inode_lock);
-	return NULL;
-}
-
 /*
  * get new location of data
  */
@@ -1065,7 +1011,7 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_key key;
 	struct btrfs_file_extent_item *fi;
-	struct inode *inode = NULL;
+	struct btrfs_inode *inode = NULL;
 	u64 parent;
 	u64 bytenr;
 	u64 new_bytenr = 0;
@@ -1112,13 +1058,13 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
 		 */
 		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
 			if (first) {
-				inode = find_next_inode(root, key.objectid);
+				inode = btrfs_find_first_inode(root, key.objectid);
 				first = 0;
-			} else if (inode && btrfs_ino(BTRFS_I(inode)) < key.objectid) {
-				btrfs_add_delayed_iput(BTRFS_I(inode));
-				inode = find_next_inode(root, key.objectid);
+			} else if (inode && btrfs_ino(inode) < key.objectid) {
+				btrfs_add_delayed_iput(inode);
+				inode = btrfs_find_first_inode(root, key.objectid);
 			}
-			if (inode && btrfs_ino(BTRFS_I(inode)) == key.objectid) {
+			if (inode && btrfs_ino(inode) == key.objectid) {
 				struct extent_state *cached_state = NULL;
 
 				end = key.offset +
@@ -1128,21 +1074,19 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
 				WARN_ON(!IS_ALIGNED(end, fs_info->sectorsize));
 				end--;
 				/* Take mmap lock to serialize with reflinks. */
-				if (!down_read_trylock(&BTRFS_I(inode)->i_mmap_lock))
+				if (!down_read_trylock(&inode->i_mmap_lock))
 					continue;
-				ret = try_lock_extent(&BTRFS_I(inode)->io_tree,
-						      key.offset, end,
-						      &cached_state);
+				ret = try_lock_extent(&inode->io_tree, key.offset,
+						      end, &cached_state);
 				if (!ret) {
-					up_read(&BTRFS_I(inode)->i_mmap_lock);
+					up_read(&inode->i_mmap_lock);
 					continue;
 				}
 
-				btrfs_drop_extent_map_range(BTRFS_I(inode),
-							    key.offset, end, true);
-				unlock_extent(&BTRFS_I(inode)->io_tree,
-					      key.offset, end, &cached_state);
-				up_read(&BTRFS_I(inode)->i_mmap_lock);
+				btrfs_drop_extent_map_range(inode, key.offset, end, true);
+				unlock_extent(&inode->io_tree, key.offset, end,
+					      &cached_state);
+				up_read(&inode->i_mmap_lock);
 			}
 		}
 
@@ -1185,7 +1129,7 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
 	if (dirty)
 		btrfs_mark_buffer_dirty(trans, leaf);
 	if (inode)
-		btrfs_add_delayed_iput(BTRFS_I(inode));
+		btrfs_add_delayed_iput(inode);
 	return ret;
 }
 
@@ -1527,7 +1471,7 @@ static int invalidate_extent_cache(struct btrfs_root *root,
 				   const struct btrfs_key *max_key)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
-	struct inode *inode = NULL;
+	struct btrfs_inode *inode = NULL;
 	u64 objectid;
 	u64 start, end;
 	u64 ino;
@@ -1537,23 +1481,24 @@ static int invalidate_extent_cache(struct btrfs_root *root,
 		struct extent_state *cached_state = NULL;
 
 		cond_resched();
-		iput(inode);
+		if (inode)
+			iput(&inode->vfs_inode);
 
 		if (objectid > max_key->objectid)
 			break;
 
-		inode = find_next_inode(root, objectid);
+		inode = btrfs_find_first_inode(root, objectid);
 		if (!inode)
 			break;
-		ino = btrfs_ino(BTRFS_I(inode));
+		ino = btrfs_ino(inode);
 
 		if (ino > max_key->objectid) {
-			iput(inode);
+			iput(&inode->vfs_inode);
 			break;
 		}
 
 		objectid = ino + 1;
-		if (!S_ISREG(inode->i_mode))
+		if (!S_ISREG(inode->vfs_inode.i_mode))
 			continue;
 
 		if (unlikely(min_key->objectid == ino)) {
@@ -1586,9 +1531,9 @@ static int invalidate_extent_cache(struct btrfs_root *root,
 		}
 
 		/* the lock_extent waits for read_folio to complete */
-		lock_extent(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
-		btrfs_drop_extent_map_range(BTRFS_I(inode), start, end, true);
-		unlock_extent(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
+		lock_extent(&inode->io_tree, start, end, &cached_state);
+		btrfs_drop_extent_map_range(inode, start, end, true);
+		unlock_extent(&inode->io_tree, start, end, &cached_state);
 	}
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 12/15] btrfs: use btrfs_find_first_inode() at btrfs_prune_dentries()
  2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
                     ` (10 preceding siblings ...)
  2024-04-11 16:19   ` [PATCH v2 11/15] btrfs: export find_next_inode() as btrfs_find_first_inode() fdmanana
@ 2024-04-11 16:19   ` fdmanana
  2024-04-11 23:15     ` Qu Wenruo
  2024-04-11 16:19   ` [PATCH v2 13/15] btrfs: add a shrinker for extent maps fdmanana
                     ` (2 subsequent siblings)
  14 siblings, 1 reply; 64+ messages in thread
From: fdmanana @ 2024-04-11 16:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently btrfs_prune_dentries() has open code to find the first inode in
a root with a minimum inode number. Remove that code and make it use the
helper btrfs_find_first_inode() for that task.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 66 ++++++++++--------------------------------------
 1 file changed, 14 insertions(+), 52 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9dc41334c3a3..2dae4e975e80 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4436,64 +4436,26 @@ static noinline int may_destroy_subvol(struct btrfs_root *root)
 static void btrfs_prune_dentries(struct btrfs_root *root)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
-	struct rb_node *node;
-	struct rb_node *prev;
-	struct btrfs_inode *entry;
-	struct inode *inode;
-	u64 objectid = 0;
+	struct btrfs_inode *inode;
+	u64 min_ino = 0;
 
 	if (!BTRFS_FS_ERROR(fs_info))
 		WARN_ON(btrfs_root_refs(&root->root_item) != 0);
 
-	spin_lock(&root->inode_lock);
-again:
-	node = root->inode_tree.rb_node;
-	prev = NULL;
-	while (node) {
-		prev = node;
-		entry = rb_entry(node, struct btrfs_inode, rb_node);
-
-		if (objectid < btrfs_ino(entry))
-			node = node->rb_left;
-		else if (objectid > btrfs_ino(entry))
-			node = node->rb_right;
-		else
-			break;
-	}
-	if (!node) {
-		while (prev) {
-			entry = rb_entry(prev, struct btrfs_inode, rb_node);
-			if (objectid <= btrfs_ino(entry)) {
-				node = prev;
-				break;
-			}
-			prev = rb_next(prev);
-		}
-	}
-	while (node) {
-		entry = rb_entry(node, struct btrfs_inode, rb_node);
-		objectid = btrfs_ino(entry) + 1;
-		inode = igrab(&entry->vfs_inode);
-		if (inode) {
-			spin_unlock(&root->inode_lock);
-			if (atomic_read(&inode->i_count) > 1)
-				d_prune_aliases(inode);
-			/*
-			 * btrfs_drop_inode will have it removed from the inode
-			 * cache when its usage count hits zero.
-			 */
-			iput(inode);
-			cond_resched();
-			spin_lock(&root->inode_lock);
-			goto again;
-		}
-
-		if (cond_resched_lock(&root->inode_lock))
-			goto again;
+	inode = btrfs_find_first_inode(root, min_ino);
+	while (inode) {
+		if (atomic_read(&inode->vfs_inode.i_count) > 1)
+			d_prune_aliases(&inode->vfs_inode);
 
-		node = rb_next(node);
+		min_ino = btrfs_ino(inode) + 1;
+		/*
+		 * btrfs_drop_inode() will have it removed from the inode
+		 * cache when its usage count hits zero.
+		 */
+		iput(&inode->vfs_inode);
+		cond_resched();
+		inode = btrfs_find_first_inode(root, min_ino);
 	}
-	spin_unlock(&root->inode_lock);
 }
 
 int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 13/15] btrfs: add a shrinker for extent maps
  2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
                     ` (11 preceding siblings ...)
  2024-04-11 16:19   ` [PATCH v2 12/15] btrfs: use btrfs_find_first_inode() at btrfs_prune_dentries() fdmanana
@ 2024-04-11 16:19   ` fdmanana
  2024-04-12 20:06     ` Josef Bacik
  2024-04-11 16:19   ` [PATCH v2 14/15] btrfs: update comment for btrfs_set_inode_full_sync() about locking fdmanana
  2024-04-11 16:19   ` [PATCH v2 15/15] btrfs: add tracepoints for extent map shrinker events fdmanana
  14 siblings, 1 reply; 64+ messages in thread
From: fdmanana @ 2024-04-11 16:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are used either to represent existing file extent items, or to
represent new extents that are going to be written and the respective file
extent items are created when the ordered extent completes.

We currently don't have any limit for how many extent maps we can have,
neither per inode nor globally. Most of the time this not too noticeable
because extent maps are removed in the following situations:

1) When evicting an inode;

2) When releasing folios (pages) through the btrfs_release_folio() address
   space operation callback.

   However we won't release extent maps in the folio range if the folio is
   either dirty or under writeback or if the inode's i_size is less than
   or equals to 16M (see try_release_extent_mapping(). This 16M i_size
   constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
   extent_io and extent_state optimizations"), but there's no explanation
   about why we have it or why the 16M value.

This means that for buffered IO we can reach an OOM situation due to too
many extent maps if either of the following happens:

1) There's a set of tasks constantly doing IO on many files with a size
   not larger than 16M, specially if they keep the files open for very
   long periods, therefore preventing inode eviction.

   This requires a really high number of such files, and having many non
   mergeable extent maps (due to random 4K writes for example) and a
   machine with very little memory;

2) There's a set tasks constantly doing random write IO (therefore
   creating many non mergeable extent maps) on files and keeping them
   open for long periods of time, so inode eviction doesn't happen and
   there's always a lot of dirty pages or pages under writeback,
   preventing btrfs_release_folio() from releasing the respective extent
   maps.

This second case was actually reported in the thread pointed by the Link
tag below, and it requires a very large file under heavy IO and a machine
with very little amount of RAM, which is probably hard to happen in
practice in a real world use case.

However when using direct IO this is not so hard to happen, because the
page cache is not used, and therefore btrfs_release_folio() is never
called. Which means extent maps are dropped only when evicting the inode,
and that means that if we have tasks that keep a file descriptor open and
keep doing IO on a very large file (or files), we can exhaust memory due
to an unbounded amount of extent maps. This is especially easy to happen
if we have a huge file with millions of small extents and their extent
maps are not mergeable (non contiguous offsets and disk locations).
This was reported in that thread with the following fio test:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/sdj
   MNT=/mnt/sdj
   MOUNT_OPTIONS="-o ssd"
   MKFS_OPTIONS=""

   cat <<EOF > /tmp/fio-job.ini
   [global]
   name=fio-rand-write
   filename=$MNT/fio-rand-write
   rw=randwrite
   bs=4K
   direct=1
   numjobs=16
   fallocate=none
   time_based
   runtime=90000

   [file1]
   size=300G
   ioengine=libaio
   iodepth=16

   EOF

   umount $MNT &> /dev/null
   mkfs.btrfs -f $MKFS_OPTIONS $DEV
   mount $MOUNT_OPTIONS $DEV $MNT

   fio /tmp/fio-job.ini
   umount $MNT

Monitoring the btrfs_extent_map slab while running the test with:

   $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
                        /sys/kernel/slab/btrfs_extent_map/total_objects'

Shows the number of active and total extent maps skyrocketing to tens of
millions, and on systems with a short amount of memory it's easy and quick
to get into an OOM situation, as reported in that thread.

So to avoid this issue add a shrinker that will remove extents maps, as
long as they are not pinned, and takes proper care with any concurrent
fsync to avoid missing extents (setting the full sync flag while in the
middle of a fast fsync). This shrinker is similar to the one ext4 uses
for its extent_status structure, which is analogous to btrfs' extent_map
structure.

Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/disk-io.c    |   7 +-
 fs/btrfs/extent_map.c | 156 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/extent_map.h |   2 +
 fs/btrfs/fs.h         |   2 +
 4 files changed, 163 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3c2d35b2062e..8bb295eaf3d7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1266,11 +1266,10 @@ static void free_global_roots(struct btrfs_fs_info *fs_info)
 
 void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
 {
+	btrfs_unregister_extent_map_shrinker(fs_info);
 	percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
 	percpu_counter_destroy(&fs_info->delalloc_bytes);
 	percpu_counter_destroy(&fs_info->ordered_bytes);
-	ASSERT(percpu_counter_sum_positive(&fs_info->evictable_extent_maps) == 0);
-	percpu_counter_destroy(&fs_info->evictable_extent_maps);
 	percpu_counter_destroy(&fs_info->dev_replace.bio_counter);
 	btrfs_free_csum_hash(fs_info);
 	btrfs_free_stripe_hash_table(fs_info);
@@ -2846,11 +2845,11 @@ static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block
 	sb->s_blocksize = BTRFS_BDEV_BLOCKSIZE;
 	sb->s_blocksize_bits = blksize_bits(BTRFS_BDEV_BLOCKSIZE);
 
-	ret = percpu_counter_init(&fs_info->ordered_bytes, 0, GFP_KERNEL);
+	ret = btrfs_register_extent_map_shrinker(fs_info);
 	if (ret)
 		return ret;
 
-	ret = percpu_counter_init(&fs_info->evictable_extent_maps, 0, GFP_KERNEL);
+	ret = percpu_counter_init(&fs_info->ordered_bytes, 0, GFP_KERNEL);
 	if (ret)
 		return ret;
 
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 2fcf28148a81..9791acda5b57 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -8,6 +8,7 @@
 #include "extent_map.h"
 #include "compression.h"
 #include "btrfs_inode.h"
+#include "disk-io.h"
 
 
 static struct kmem_cache *extent_map_cache;
@@ -1026,3 +1027,158 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	free_extent_map(split_pre);
 	return ret;
 }
+
+static unsigned long btrfs_scan_inode(struct btrfs_inode *inode,
+				      unsigned long *scanned,
+				      unsigned long nr_to_scan)
+{
+	struct extent_map_tree *tree = &inode->extent_tree;
+	unsigned long nr_dropped = 0;
+	struct rb_node *node;
+
+	/*
+	 * Take the mmap lock so that we serialize with the inode logging phase
+	 * of fsync because we may need to set the full sync flag on the inode,
+	 * in case we have to remove extent maps in the tree's list of modified
+	 * extents. If we set the full sync flag in the inode while an fsync is
+	 * in progress, we may risk missing new extents because before the flag
+	 * is set, fsync decides to only wait for writeback to complete and then
+	 * during inode logging it sees the flag set and uses the subvolume tree
+	 * to find new extents, which may not be there yet because ordered
+	 * extents haven't completed yet.
+	 */
+	down_read(&inode->i_mmap_lock);
+	write_lock(&tree->lock);
+	node = rb_first_cached(&tree->map);
+	while (node) {
+		struct extent_map *em;
+
+		em = rb_entry(node, struct extent_map, rb_node);
+		node = rb_next(node);
+		(*scanned)++;
+
+		if (em->flags & EXTENT_FLAG_PINNED)
+			goto next;
+
+		if (!list_empty(&em->list))
+			btrfs_set_inode_full_sync(inode);
+
+		remove_extent_mapping(inode, em);
+		/* Drop the reference for the tree. */
+		free_extent_map(em);
+		nr_dropped++;
+next:
+		if (*scanned >= nr_to_scan)
+			break;
+
+		/*
+		 * Restart if we had to resched, and any extent maps that were
+		 * pinned before may have become unpinned after we released the
+		 * lock and took it again.
+		 */
+		if (cond_resched_rwlock_write(&tree->lock))
+			node = rb_first_cached(&tree->map);
+	}
+	write_unlock(&tree->lock);
+	up_read(&inode->i_mmap_lock);
+
+	return nr_dropped;
+}
+
+static unsigned long btrfs_scan_root(struct btrfs_root *root,
+				     unsigned long *scanned,
+				     unsigned long nr_to_scan)
+{
+	struct btrfs_inode *inode;
+	unsigned long nr_dropped = 0;
+	u64 min_ino = 0;
+
+	inode = btrfs_find_first_inode(root, min_ino);
+	while (inode) {
+		nr_dropped += btrfs_scan_inode(inode, scanned, nr_to_scan);
+
+		min_ino = btrfs_ino(inode) + 1;
+		iput(&inode->vfs_inode);
+
+		if (*scanned >= nr_to_scan)
+			break;
+
+		cond_resched();
+		inode = btrfs_find_first_inode(root, min_ino);
+	}
+
+	return nr_dropped;
+}
+
+static unsigned long btrfs_extent_maps_scan(struct shrinker *shrinker,
+					    struct shrink_control *sc)
+{
+	struct btrfs_fs_info *fs_info = shrinker->private_data;
+	unsigned long nr_dropped = 0;
+	unsigned long scanned = 0;
+	u64 next_root_id = 0;
+
+	while (scanned < sc->nr_to_scan) {
+		struct btrfs_root *root;
+		unsigned long count;
+
+		spin_lock(&fs_info->fs_roots_radix_lock);
+		count = radix_tree_gang_lookup(&fs_info->fs_roots_radix,
+					       (void **)&root, next_root_id, 1);
+		if (count == 0) {
+			spin_unlock(&fs_info->fs_roots_radix_lock);
+			break;
+		}
+		next_root_id = btrfs_root_id(root) + 1;
+		root = btrfs_grab_root(root);
+		spin_unlock(&fs_info->fs_roots_radix_lock);
+
+		if (!root)
+			continue;
+
+		if (is_fstree(btrfs_root_id(root)))
+			nr_dropped += btrfs_scan_root(root, &scanned, sc->nr_to_scan);
+
+		btrfs_put_root(root);
+	}
+
+	return nr_dropped;
+}
+
+static unsigned long btrfs_extent_maps_count(struct shrinker *shrinker,
+					     struct shrink_control *sc)
+{
+	struct btrfs_fs_info *fs_info = shrinker->private_data;
+
+	return percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
+}
+
+int btrfs_register_extent_map_shrinker(struct btrfs_fs_info *fs_info)
+{
+	int ret;
+
+	ret = percpu_counter_init(&fs_info->evictable_extent_maps, 0, GFP_KERNEL);
+	if (ret)
+		return ret;
+
+	fs_info->extent_map_shrinker = shrinker_alloc(0, "em-btrfs:%s", fs_info->sb->s_id);
+	if (!fs_info->extent_map_shrinker) {
+		percpu_counter_destroy(&fs_info->evictable_extent_maps);
+		return -ENOMEM;
+	}
+
+	fs_info->extent_map_shrinker->scan_objects = btrfs_extent_maps_scan;
+	fs_info->extent_map_shrinker->count_objects = btrfs_extent_maps_count;
+	fs_info->extent_map_shrinker->private_data = fs_info;
+
+	shrinker_register(fs_info->extent_map_shrinker);
+
+	return 0;
+}
+
+void btrfs_unregister_extent_map_shrinker(struct btrfs_fs_info *fs_info)
+{
+	shrinker_free(fs_info->extent_map_shrinker);
+	ASSERT(percpu_counter_sum_positive(&fs_info->evictable_extent_maps) == 0);
+	percpu_counter_destroy(&fs_info->evictable_extent_maps);
+}
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index c3707461ff62..8a6be2f7a0e2 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -140,5 +140,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode,
 int btrfs_replace_extent_map_range(struct btrfs_inode *inode,
 				   struct extent_map *new_em,
 				   bool modified);
+int btrfs_register_extent_map_shrinker(struct btrfs_fs_info *fs_info);
+void btrfs_unregister_extent_map_shrinker(struct btrfs_fs_info *fs_info);
 
 #endif
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 534d30dafe32..f1414814bd69 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -857,6 +857,8 @@ struct btrfs_fs_info {
 	struct lockdep_map btrfs_trans_pending_ordered_map;
 	struct lockdep_map btrfs_ordered_extent_map;
 
+	struct shrinker *extent_map_shrinker;
+
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
 	struct rb_root block_tree;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 14/15] btrfs: update comment for btrfs_set_inode_full_sync() about locking
  2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
                     ` (12 preceding siblings ...)
  2024-04-11 16:19   ` [PATCH v2 13/15] btrfs: add a shrinker for extent maps fdmanana
@ 2024-04-11 16:19   ` fdmanana
  2024-04-11 16:19   ` [PATCH v2 15/15] btrfs: add tracepoints for extent map shrinker events fdmanana
  14 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-11 16:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Nowadays we have a lock used to synchronize mmap writes with reflink and
fsync operations (struct btrfs_inode::i_mmap_lock), so update the comment
for btrfs_set_inode_full_sync() to mention that it can also be called
while holding that mmap lock. Besides being a valid alternative to the
inode's VFS lock, we already have the extent map shrinker using that mmap
lock instead.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9a87ada7fe52..91c994b569f3 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -381,9 +381,11 @@ static inline void btrfs_set_inode_last_sub_trans(struct btrfs_inode *inode)
 }
 
 /*
- * Should be called while holding the inode's VFS lock in exclusive mode or in a
- * context where no one else can access the inode concurrently (during inode
- * creation or when loading an inode from disk).
+ * Should be called while holding the inode's VFS lock in exclusive mode, or
+ * while holding the inode's mmap lock (struct btrfs_inode::i_mmap_lock) in
+ * either shared or exclusive mode, or in a context where no one else can access
+ * the inode concurrently (during inode creation or when loading an inode from
+ * disk).
  */
 static inline void btrfs_set_inode_full_sync(struct btrfs_inode *inode)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 15/15] btrfs: add tracepoints for extent map shrinker events
  2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
                     ` (13 preceding siblings ...)
  2024-04-11 16:19   ` [PATCH v2 14/15] btrfs: update comment for btrfs_set_inode_full_sync() about locking fdmanana
@ 2024-04-11 16:19   ` fdmanana
  14 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-11 16:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Add some tracepoints for the extent map shrinker to help debug and analyse
main events. These have proved useful during development of the shrinker.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c        | 18 ++++++-
 include/trace/events/btrfs.h | 92 ++++++++++++++++++++++++++++++++++++
 2 files changed, 109 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 9791acda5b57..5c40efe39e70 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -1064,6 +1064,7 @@ static unsigned long btrfs_scan_inode(struct btrfs_inode *inode,
 			btrfs_set_inode_full_sync(inode);
 
 		remove_extent_mapping(inode, em);
+		trace_btrfs_extent_map_shrinker_remove_em(inode, em);
 		/* Drop the reference for the tree. */
 		free_extent_map(em);
 		nr_dropped++;
@@ -1118,6 +1119,12 @@ static unsigned long btrfs_extent_maps_scan(struct shrinker *shrinker,
 	unsigned long scanned = 0;
 	u64 next_root_id = 0;
 
+	if (trace_btrfs_extent_map_shrinker_scan_enter_enabled()) {
+		s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
+
+		trace_btrfs_extent_map_shrinker_scan_enter(fs_info, sc->nr_to_scan, nr);
+	}
+
 	while (scanned < sc->nr_to_scan) {
 		struct btrfs_root *root;
 		unsigned long count;
@@ -1142,6 +1149,12 @@ static unsigned long btrfs_extent_maps_scan(struct shrinker *shrinker,
 		btrfs_put_root(root);
 	}
 
+	if (trace_btrfs_extent_map_shrinker_scan_exit_enabled()) {
+		s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
+
+		trace_btrfs_extent_map_shrinker_scan_exit(fs_info, nr_dropped, nr);
+	}
+
 	return nr_dropped;
 }
 
@@ -1149,8 +1162,11 @@ static unsigned long btrfs_extent_maps_count(struct shrinker *shrinker,
 					     struct shrink_control *sc)
 {
 	struct btrfs_fs_info *fs_info = shrinker->private_data;
+	const s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
+
+	trace_btrfs_extent_map_shrinker_count(fs_info, sc->nr_to_scan, nr);
 
-	return percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
+	return nr;
 }
 
 int btrfs_register_extent_map_shrinker(struct btrfs_fs_info *fs_info)
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 766cfd48386c..ba49efa2bc74 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2551,6 +2551,98 @@ TRACE_EVENT(btrfs_get_raid_extent_offset,
 			__entry->devid)
 );
 
+TRACE_EVENT(btrfs_extent_map_shrinker_count,
+
+	TP_PROTO(const struct btrfs_fs_info *fs_info, u64 nr_to_scan, u64 nr),
+
+	TP_ARGS(fs_info, nr_to_scan, nr),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	nr_to_scan	)
+		__field(	u64,	nr		)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->nr_to_scan	= nr_to_scan;
+		__entry->nr		= nr;
+	),
+
+	TP_printk_btrfs("nr_to_scan=%llu nr=%llu",
+			__entry->nr_to_scan, __entry->nr)
+);
+
+TRACE_EVENT(btrfs_extent_map_shrinker_scan_enter,
+
+	TP_PROTO(const struct btrfs_fs_info *fs_info, u64 nr_to_scan, u64 nr),
+
+	TP_ARGS(fs_info, nr_to_scan, nr),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	nr_to_scan	)
+		__field(	u64,	nr		)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->nr_to_scan	= nr_to_scan;
+		__entry->nr		= nr;
+	),
+
+	TP_printk_btrfs("nr_to_scan=%llu nr=%llu",
+			__entry->nr_to_scan, __entry->nr)
+);
+
+TRACE_EVENT(btrfs_extent_map_shrinker_scan_exit,
+
+	TP_PROTO(const struct btrfs_fs_info *fs_info, u64 nr_dropped, u64 nr),
+
+	TP_ARGS(fs_info, nr_dropped, nr),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	nr_dropped	)
+		__field(	u64,	nr		)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->nr_dropped	= nr_dropped;
+		__entry->nr		= nr;
+	),
+
+	TP_printk_btrfs("nr_dropped=%llu nr=%llu",
+			__entry->nr_dropped, __entry->nr)
+);
+
+TRACE_EVENT(btrfs_extent_map_shrinker_remove_em,
+
+	TP_PROTO(const struct btrfs_inode *inode, const struct extent_map *em),
+
+	TP_ARGS(inode, em),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	ino		)
+		__field(	u64,	root_id		)
+		__field(	u64,	start		)
+		__field(	u64,	len		)
+		__field(	u64,	block_start	)
+		__field(	u32,	flags		)
+	),
+
+	TP_fast_assign_btrfs(inode->root->fs_info,
+		__entry->ino		= btrfs_ino(inode);
+		__entry->root_id	= inode->root->root_key.objectid;
+		__entry->start		= em->start;
+		__entry->len		= em->len;
+		__entry->block_start	= em->block_start;
+		__entry->flags		= em->flags;
+	),
+
+	TP_printk_btrfs(
+"ino=%llu root=%llu(%s) start=%llu len=%llu block_start=%llu(%s) flags=%s",
+			__entry->ino, show_root_type(__entry->root_id),
+			__entry->start, __entry->len,
+			show_map_type(__entry->block_start),
+			show_map_flags(__entry->flags))
+);
+
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 11/15] btrfs: export find_next_inode() as btrfs_find_first_inode()
  2024-04-11 16:19   ` [PATCH v2 11/15] btrfs: export find_next_inode() as btrfs_find_first_inode() fdmanana
@ 2024-04-11 23:14     ` Qu Wenruo
  0 siblings, 0 replies; 64+ messages in thread
From: Qu Wenruo @ 2024-04-11 23:14 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/4/12 01:49, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> Export the relocation private helper find_next_inode() to inode.c, as this
> same logic is also used at btrfs_prune_dentries() and will be used by an
> upcoming change that adds an extent map shrinker. The next patch will
> change btrfs_prune_dentries() to use this helper.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/btrfs_inode.h |   1 +
>   fs/btrfs/inode.c       |  58 +++++++++++++++++++++++
>   fs/btrfs/relocation.c  | 105 ++++++++++-------------------------------
>   3 files changed, 84 insertions(+), 80 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index ed8bd15aa3e2..9a87ada7fe52 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -543,6 +543,7 @@ ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter,
>   		       size_t done_before);
>   struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter,
>   				  size_t done_before);
> +struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino);
>
>   extern const struct dentry_operations btrfs_dentry_operations;
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d4539b4b8148..9dc41334c3a3 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -10827,6 +10827,64 @@ void btrfs_assert_inode_range_clean(struct btrfs_inode *inode, u64 start, u64 en
>   	ASSERT(ordered == NULL);
>   }
>
> +/*
> + * Find the first inode with a minimum number.
> + *
> + * @root:	The root to search for.
> + * @min_ino:	The minimum inode number.
> + *
> + * Find the first inode in the @root with a number >= @min_ino and return it.
> + * Returns NULL if no such inode found.
> + */
> +struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino)
> +{
> +	struct rb_node *node;
> +	struct rb_node *prev = NULL;
> +	struct btrfs_inode *inode;
> +
> +	spin_lock(&root->inode_lock);
> +again:
> +	node = root->inode_tree.rb_node;
> +	while (node) {
> +		prev = node;
> +		inode = rb_entry(node, struct btrfs_inode, rb_node);
> +		if (min_ino < btrfs_ino(inode))
> +			node = node->rb_left;
> +		else if (min_ino > btrfs_ino(inode))
> +			node = node->rb_right;
> +		else
> +			break;
> +	}
> +
> +	if (!node) {
> +		while (prev) {
> +			inode = rb_entry(prev, struct btrfs_inode, rb_node);
> +			if (min_ino <= btrfs_ino(inode)) {
> +				node = prev;
> +				break;
> +			}
> +			prev = rb_next(prev);
> +		}
> +	}
> +
> +	while (node) {
> +		inode = rb_entry(prev, struct btrfs_inode, rb_node);
> +		if (igrab(&inode->vfs_inode)) {
> +			spin_unlock(&root->inode_lock);
> +			return inode;
> +		}
> +
> +		min_ino = btrfs_ino(inode) + 1;
> +		if (cond_resched_lock(&root->inode_lock))
> +			goto again;
> +
> +		node = rb_next(node);
> +	}
> +	spin_unlock(&root->inode_lock);
> +
> +	return NULL;
> +}
> +
>   static const struct inode_operations btrfs_dir_inode_operations = {
>   	.getattr	= btrfs_getattr,
>   	.lookup		= btrfs_lookup,
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 5c9ef6717f84..5b19b41f64a2 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -951,60 +951,6 @@ int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
>   	return ret;
>   }
>
> -/*
> - * helper to find first cached inode with inode number >= objectid
> - * in a subvolume
> - */
> -static struct inode *find_next_inode(struct btrfs_root *root, u64 objectid)
> -{
> -	struct rb_node *node;
> -	struct rb_node *prev;
> -	struct btrfs_inode *entry;
> -	struct inode *inode;
> -
> -	spin_lock(&root->inode_lock);
> -again:
> -	node = root->inode_tree.rb_node;
> -	prev = NULL;
> -	while (node) {
> -		prev = node;
> -		entry = rb_entry(node, struct btrfs_inode, rb_node);
> -
> -		if (objectid < btrfs_ino(entry))
> -			node = node->rb_left;
> -		else if (objectid > btrfs_ino(entry))
> -			node = node->rb_right;
> -		else
> -			break;
> -	}
> -	if (!node) {
> -		while (prev) {
> -			entry = rb_entry(prev, struct btrfs_inode, rb_node);
> -			if (objectid <= btrfs_ino(entry)) {
> -				node = prev;
> -				break;
> -			}
> -			prev = rb_next(prev);
> -		}
> -	}
> -	while (node) {
> -		entry = rb_entry(node, struct btrfs_inode, rb_node);
> -		inode = igrab(&entry->vfs_inode);
> -		if (inode) {
> -			spin_unlock(&root->inode_lock);
> -			return inode;
> -		}
> -
> -		objectid = btrfs_ino(entry) + 1;
> -		if (cond_resched_lock(&root->inode_lock))
> -			goto again;
> -
> -		node = rb_next(node);
> -	}
> -	spin_unlock(&root->inode_lock);
> -	return NULL;
> -}
> -
>   /*
>    * get new location of data
>    */
> @@ -1065,7 +1011,7 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
>   	struct btrfs_fs_info *fs_info = root->fs_info;
>   	struct btrfs_key key;
>   	struct btrfs_file_extent_item *fi;
> -	struct inode *inode = NULL;
> +	struct btrfs_inode *inode = NULL;
>   	u64 parent;
>   	u64 bytenr;
>   	u64 new_bytenr = 0;
> @@ -1112,13 +1058,13 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
>   		 */
>   		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
>   			if (first) {
> -				inode = find_next_inode(root, key.objectid);
> +				inode = btrfs_find_first_inode(root, key.objectid);
>   				first = 0;
> -			} else if (inode && btrfs_ino(BTRFS_I(inode)) < key.objectid) {
> -				btrfs_add_delayed_iput(BTRFS_I(inode));
> -				inode = find_next_inode(root, key.objectid);
> +			} else if (inode && btrfs_ino(inode) < key.objectid) {
> +				btrfs_add_delayed_iput(inode);
> +				inode = btrfs_find_first_inode(root, key.objectid);
>   			}
> -			if (inode && btrfs_ino(BTRFS_I(inode)) == key.objectid) {
> +			if (inode && btrfs_ino(inode) == key.objectid) {
>   				struct extent_state *cached_state = NULL;
>
>   				end = key.offset +
> @@ -1128,21 +1074,19 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
>   				WARN_ON(!IS_ALIGNED(end, fs_info->sectorsize));
>   				end--;
>   				/* Take mmap lock to serialize with reflinks. */
> -				if (!down_read_trylock(&BTRFS_I(inode)->i_mmap_lock))
> +				if (!down_read_trylock(&inode->i_mmap_lock))
>   					continue;
> -				ret = try_lock_extent(&BTRFS_I(inode)->io_tree,
> -						      key.offset, end,
> -						      &cached_state);
> +				ret = try_lock_extent(&inode->io_tree, key.offset,
> +						      end, &cached_state);
>   				if (!ret) {
> -					up_read(&BTRFS_I(inode)->i_mmap_lock);
> +					up_read(&inode->i_mmap_lock);
>   					continue;
>   				}
>
> -				btrfs_drop_extent_map_range(BTRFS_I(inode),
> -							    key.offset, end, true);
> -				unlock_extent(&BTRFS_I(inode)->io_tree,
> -					      key.offset, end, &cached_state);
> -				up_read(&BTRFS_I(inode)->i_mmap_lock);
> +				btrfs_drop_extent_map_range(inode, key.offset, end, true);
> +				unlock_extent(&inode->io_tree, key.offset, end,
> +					      &cached_state);
> +				up_read(&inode->i_mmap_lock);
>   			}
>   		}
>
> @@ -1185,7 +1129,7 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
>   	if (dirty)
>   		btrfs_mark_buffer_dirty(trans, leaf);
>   	if (inode)
> -		btrfs_add_delayed_iput(BTRFS_I(inode));
> +		btrfs_add_delayed_iput(inode);
>   	return ret;
>   }
>
> @@ -1527,7 +1471,7 @@ static int invalidate_extent_cache(struct btrfs_root *root,
>   				   const struct btrfs_key *max_key)
>   {
>   	struct btrfs_fs_info *fs_info = root->fs_info;
> -	struct inode *inode = NULL;
> +	struct btrfs_inode *inode = NULL;
>   	u64 objectid;
>   	u64 start, end;
>   	u64 ino;
> @@ -1537,23 +1481,24 @@ static int invalidate_extent_cache(struct btrfs_root *root,
>   		struct extent_state *cached_state = NULL;
>
>   		cond_resched();
> -		iput(inode);
> +		if (inode)
> +			iput(&inode->vfs_inode);
>
>   		if (objectid > max_key->objectid)
>   			break;
>
> -		inode = find_next_inode(root, objectid);
> +		inode = btrfs_find_first_inode(root, objectid);
>   		if (!inode)
>   			break;
> -		ino = btrfs_ino(BTRFS_I(inode));
> +		ino = btrfs_ino(inode);
>
>   		if (ino > max_key->objectid) {
> -			iput(inode);
> +			iput(&inode->vfs_inode);
>   			break;
>   		}
>
>   		objectid = ino + 1;
> -		if (!S_ISREG(inode->i_mode))
> +		if (!S_ISREG(inode->vfs_inode.i_mode))
>   			continue;
>
>   		if (unlikely(min_key->objectid == ino)) {
> @@ -1586,9 +1531,9 @@ static int invalidate_extent_cache(struct btrfs_root *root,
>   		}
>
>   		/* the lock_extent waits for read_folio to complete */
> -		lock_extent(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
> -		btrfs_drop_extent_map_range(BTRFS_I(inode), start, end, true);
> -		unlock_extent(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
> +		lock_extent(&inode->io_tree, start, end, &cached_state);
> +		btrfs_drop_extent_map_range(inode, start, end, true);
> +		unlock_extent(&inode->io_tree, start, end, &cached_state);
>   	}
>   	return 0;
>   }

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 12/15] btrfs: use btrfs_find_first_inode() at btrfs_prune_dentries()
  2024-04-11 16:19   ` [PATCH v2 12/15] btrfs: use btrfs_find_first_inode() at btrfs_prune_dentries() fdmanana
@ 2024-04-11 23:15     ` Qu Wenruo
  0 siblings, 0 replies; 64+ messages in thread
From: Qu Wenruo @ 2024-04-11 23:15 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/4/12 01:49, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> Currently btrfs_prune_dentries() has open code to find the first inode in
> a root with a minimum inode number. Remove that code and make it use the
> helper btrfs_find_first_inode() for that task.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

> ---
>   fs/btrfs/inode.c | 66 ++++++++++--------------------------------------
>   1 file changed, 14 insertions(+), 52 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9dc41334c3a3..2dae4e975e80 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4436,64 +4436,26 @@ static noinline int may_destroy_subvol(struct btrfs_root *root)
>   static void btrfs_prune_dentries(struct btrfs_root *root)
>   {
>   	struct btrfs_fs_info *fs_info = root->fs_info;
> -	struct rb_node *node;
> -	struct rb_node *prev;
> -	struct btrfs_inode *entry;
> -	struct inode *inode;
> -	u64 objectid = 0;
> +	struct btrfs_inode *inode;
> +	u64 min_ino = 0;
>
>   	if (!BTRFS_FS_ERROR(fs_info))
>   		WARN_ON(btrfs_root_refs(&root->root_item) != 0);
>
> -	spin_lock(&root->inode_lock);
> -again:
> -	node = root->inode_tree.rb_node;
> -	prev = NULL;
> -	while (node) {
> -		prev = node;
> -		entry = rb_entry(node, struct btrfs_inode, rb_node);
> -
> -		if (objectid < btrfs_ino(entry))
> -			node = node->rb_left;
> -		else if (objectid > btrfs_ino(entry))
> -			node = node->rb_right;
> -		else
> -			break;
> -	}
> -	if (!node) {
> -		while (prev) {
> -			entry = rb_entry(prev, struct btrfs_inode, rb_node);
> -			if (objectid <= btrfs_ino(entry)) {
> -				node = prev;
> -				break;
> -			}
> -			prev = rb_next(prev);
> -		}
> -	}
> -	while (node) {
> -		entry = rb_entry(node, struct btrfs_inode, rb_node);
> -		objectid = btrfs_ino(entry) + 1;
> -		inode = igrab(&entry->vfs_inode);
> -		if (inode) {
> -			spin_unlock(&root->inode_lock);
> -			if (atomic_read(&inode->i_count) > 1)
> -				d_prune_aliases(inode);
> -			/*
> -			 * btrfs_drop_inode will have it removed from the inode
> -			 * cache when its usage count hits zero.
> -			 */
> -			iput(inode);
> -			cond_resched();
> -			spin_lock(&root->inode_lock);
> -			goto again;
> -		}
> -
> -		if (cond_resched_lock(&root->inode_lock))
> -			goto again;
> +	inode = btrfs_find_first_inode(root, min_ino);
> +	while (inode) {
> +		if (atomic_read(&inode->vfs_inode.i_count) > 1)
> +			d_prune_aliases(&inode->vfs_inode);
>
> -		node = rb_next(node);
> +		min_ino = btrfs_ino(inode) + 1;
> +		/*
> +		 * btrfs_drop_inode() will have it removed from the inode
> +		 * cache when its usage count hits zero.
> +		 */
> +		iput(&inode->vfs_inode);
> +		cond_resched();
> +		inode = btrfs_find_first_inode(root, min_ino);
>   	}
> -	spin_unlock(&root->inode_lock);
>   }
>
>   int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 08/15] btrfs: pass the extent map tree's inode to setup_extent_mapping()
  2024-04-11 16:19   ` [PATCH v2 08/15] btrfs: pass the extent map tree's inode to setup_extent_mapping() fdmanana
@ 2024-04-11 23:25     ` Qu Wenruo
  0 siblings, 0 replies; 64+ messages in thread
From: Qu Wenruo @ 2024-04-11 23:25 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/4/12 01:49, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Extent maps are always associated to an inode's extent map tree, so
> there's no need to pass the extent map tree explicitly to
> setup_extent_mapping().
> 
> In order to facilitate an upcoming change that adds a shrinker for extent
> maps, change setup_extent_mapping() to receive the inode instead of its
> extent map tree.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/extent_map.c | 10 +++++-----
>   1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> index 15817b842c24..2753bf2964cb 100644
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -342,7 +342,7 @@ void clear_em_logging(struct btrfs_inode *inode, struct extent_map *em)
>   		try_merge_map(tree, em);
>   }
>   
> -static inline void setup_extent_mapping(struct extent_map_tree *tree,
> +static inline void setup_extent_mapping(struct btrfs_inode *inode,
>   					struct extent_map *em,
>   					int modified)
>   {
> @@ -351,9 +351,9 @@ static inline void setup_extent_mapping(struct extent_map_tree *tree,
>   	ASSERT(list_empty(&em->list));
>   
>   	if (modified)
> -		list_add(&em->list, &tree->modified_extents);
> +		list_add(&em->list, &inode->extent_tree.modified_extents);
>   	else
> -		try_merge_map(tree, em);
> +		try_merge_map(&inode->extent_tree, em);
>   }
>   
>   /*
> @@ -381,7 +381,7 @@ static int add_extent_mapping(struct btrfs_inode *inode,
>   	if (ret)
>   		return ret;
>   
> -	setup_extent_mapping(tree, em, modified);
> +	setup_extent_mapping(inode, em, modified);
>   
>   	return 0;
>   }
> @@ -486,7 +486,7 @@ static void replace_extent_mapping(struct btrfs_inode *inode,
>   	rb_replace_node_cached(&cur->rb_node, &new->rb_node, &tree->map);
>   	RB_CLEAR_NODE(&cur->rb_node);
>   
> -	setup_extent_mapping(tree, new, modified);
> +	setup_extent_mapping(inode, new, modified);
>   }
>   
>   static struct extent_map *next_extent_map(const struct extent_map *em)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 09/15] btrfs: pass the extent map tree's inode to try_merge_map()
  2024-04-11 16:19   ` [PATCH v2 09/15] btrfs: pass the extent map tree's inode to try_merge_map() fdmanana
@ 2024-04-11 23:25     ` Qu Wenruo
  0 siblings, 0 replies; 64+ messages in thread
From: Qu Wenruo @ 2024-04-11 23:25 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/4/12 01:49, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Extent maps are always associated to an inode's extent map tree, so
> there's no need to pass the extent map tree explicitly to try_merge_map().
> 
> In order to facilitate an upcoming change that adds a shrinker for extent
> maps, change try_merge_map() to receive the inode instead of its extent
> map tree.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/extent_map.c | 13 ++++++-------
>   1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> index 2753bf2964cb..97a8e0484415 100644
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -223,8 +223,9 @@ static bool mergeable_maps(const struct extent_map *prev, const struct extent_ma
>   	return next->block_start == prev->block_start;
>   }
>   
> -static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
> +static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
>   {
> +	struct extent_map_tree *tree = &inode->extent_tree;
>   	struct extent_map *merge = NULL;
>   	struct rb_node *rb;
>   
> @@ -322,7 +323,7 @@ int unpin_extent_cache(struct btrfs_inode *inode, u64 start, u64 len, u64 gen)
>   	em->generation = gen;
>   	em->flags &= ~EXTENT_FLAG_PINNED;
>   
> -	try_merge_map(tree, em);
> +	try_merge_map(inode, em);
>   
>   out:
>   	write_unlock(&tree->lock);
> @@ -333,13 +334,11 @@ int unpin_extent_cache(struct btrfs_inode *inode, u64 start, u64 len, u64 gen)
>   
>   void clear_em_logging(struct btrfs_inode *inode, struct extent_map *em)
>   {
> -	struct extent_map_tree *tree = &inode->extent_tree;
> -
> -	lockdep_assert_held_write(&tree->lock);
> +	lockdep_assert_held_write(&inode->extent_tree.lock);
>   
>   	em->flags &= ~EXTENT_FLAG_LOGGING;
>   	if (extent_map_in_tree(em))
> -		try_merge_map(tree, em);
> +		try_merge_map(inode, em);
>   }
>   
>   static inline void setup_extent_mapping(struct btrfs_inode *inode,
> @@ -353,7 +352,7 @@ static inline void setup_extent_mapping(struct btrfs_inode *inode,
>   	if (modified)
>   		list_add(&em->list, &inode->extent_tree.modified_extents);
>   	else
> -		try_merge_map(&inode->extent_tree, em);
> +		try_merge_map(inode, em);
>   }
>   
>   /*

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 13/15] btrfs: add a shrinker for extent maps
  2024-04-11 16:19   ` [PATCH v2 13/15] btrfs: add a shrinker for extent maps fdmanana
@ 2024-04-12 20:06     ` Josef Bacik
  2024-04-13 11:07       ` Filipe Manana
  0 siblings, 1 reply; 64+ messages in thread
From: Josef Bacik @ 2024-04-12 20:06 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Thu, Apr 11, 2024 at 05:19:07PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Extent maps are used either to represent existing file extent items, or to
> represent new extents that are going to be written and the respective file
> extent items are created when the ordered extent completes.
> 
> We currently don't have any limit for how many extent maps we can have,
> neither per inode nor globally. Most of the time this not too noticeable
> because extent maps are removed in the following situations:
> 
> 1) When evicting an inode;
> 
> 2) When releasing folios (pages) through the btrfs_release_folio() address
>    space operation callback.
> 
>    However we won't release extent maps in the folio range if the folio is
>    either dirty or under writeback or if the inode's i_size is less than
>    or equals to 16M (see try_release_extent_mapping(). This 16M i_size
>    constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
>    extent_io and extent_state optimizations"), but there's no explanation
>    about why we have it or why the 16M value.
> 
> This means that for buffered IO we can reach an OOM situation due to too
> many extent maps if either of the following happens:
> 
> 1) There's a set of tasks constantly doing IO on many files with a size
>    not larger than 16M, specially if they keep the files open for very
>    long periods, therefore preventing inode eviction.
> 
>    This requires a really high number of such files, and having many non
>    mergeable extent maps (due to random 4K writes for example) and a
>    machine with very little memory;
> 
> 2) There's a set tasks constantly doing random write IO (therefore
>    creating many non mergeable extent maps) on files and keeping them
>    open for long periods of time, so inode eviction doesn't happen and
>    there's always a lot of dirty pages or pages under writeback,
>    preventing btrfs_release_folio() from releasing the respective extent
>    maps.
> 
> This second case was actually reported in the thread pointed by the Link
> tag below, and it requires a very large file under heavy IO and a machine
> with very little amount of RAM, which is probably hard to happen in
> practice in a real world use case.
> 
> However when using direct IO this is not so hard to happen, because the
> page cache is not used, and therefore btrfs_release_folio() is never
> called. Which means extent maps are dropped only when evicting the inode,
> and that means that if we have tasks that keep a file descriptor open and
> keep doing IO on a very large file (or files), we can exhaust memory due
> to an unbounded amount of extent maps. This is especially easy to happen
> if we have a huge file with millions of small extents and their extent
> maps are not mergeable (non contiguous offsets and disk locations).
> This was reported in that thread with the following fio test:
> 
>    $ cat test.sh
>    #!/bin/bash
> 
>    DEV=/dev/sdj
>    MNT=/mnt/sdj
>    MOUNT_OPTIONS="-o ssd"
>    MKFS_OPTIONS=""
> 
>    cat <<EOF > /tmp/fio-job.ini
>    [global]
>    name=fio-rand-write
>    filename=$MNT/fio-rand-write
>    rw=randwrite
>    bs=4K
>    direct=1
>    numjobs=16
>    fallocate=none
>    time_based
>    runtime=90000
> 
>    [file1]
>    size=300G
>    ioengine=libaio
>    iodepth=16
> 
>    EOF
> 
>    umount $MNT &> /dev/null
>    mkfs.btrfs -f $MKFS_OPTIONS $DEV
>    mount $MOUNT_OPTIONS $DEV $MNT
> 
>    fio /tmp/fio-job.ini
>    umount $MNT
> 
> Monitoring the btrfs_extent_map slab while running the test with:
> 
>    $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
>                         /sys/kernel/slab/btrfs_extent_map/total_objects'
> 
> Shows the number of active and total extent maps skyrocketing to tens of
> millions, and on systems with a short amount of memory it's easy and quick
> to get into an OOM situation, as reported in that thread.
> 
> So to avoid this issue add a shrinker that will remove extents maps, as
> long as they are not pinned, and takes proper care with any concurrent
> fsync to avoid missing extents (setting the full sync flag while in the
> middle of a fast fsync). This shrinker is similar to the one ext4 uses
> for its extent_status structure, which is analogous to btrfs' extent_map
> structure.
> 
> Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

I don't like this for a few reasons

1. We're always starting with the first root and the first inode.  We're just
   going to constantly screw that first inode over and over again.
2. I really, really hate our inode rb-tree, I want to reduce it's use, not add
   more users.  It would be nice if we could just utilize ->s_inodes_lru instead
   for this, which would also give us the nice advantage of not having to think
   about order since it's already in LRU order.
3. We're registering our own shrinker without a proper LRU setup.  I think it
   would make sense if we wanted to have a LRU for our extent maps, but I think
   that's not a great idea.  We could get the same benefit by adding our own
   ->nr_cached_objects() and ->free_cached_objects(), I think that's a better
   approach no matter what other changes you make instead of registering our own
   shrinker.

The concept I whole heartedly agree with, this just needs some tweaks to be more
fair and cleaner.  The rest of the code is fine, you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

to the rest of it.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 13/15] btrfs: add a shrinker for extent maps
  2024-04-12 20:06     ` Josef Bacik
@ 2024-04-13 11:07       ` Filipe Manana
  2024-04-14 10:38         ` Filipe Manana
  2024-04-14 13:02         ` Josef Bacik
  0 siblings, 2 replies; 64+ messages in thread
From: Filipe Manana @ 2024-04-13 11:07 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Fri, Apr 12, 2024 at 9:07 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Thu, Apr 11, 2024 at 05:19:07PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Extent maps are used either to represent existing file extent items, or to
> > represent new extents that are going to be written and the respective file
> > extent items are created when the ordered extent completes.
> >
> > We currently don't have any limit for how many extent maps we can have,
> > neither per inode nor globally. Most of the time this not too noticeable
> > because extent maps are removed in the following situations:
> >
> > 1) When evicting an inode;
> >
> > 2) When releasing folios (pages) through the btrfs_release_folio() address
> >    space operation callback.
> >
> >    However we won't release extent maps in the folio range if the folio is
> >    either dirty or under writeback or if the inode's i_size is less than
> >    or equals to 16M (see try_release_extent_mapping(). This 16M i_size
> >    constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
> >    extent_io and extent_state optimizations"), but there's no explanation
> >    about why we have it or why the 16M value.
> >
> > This means that for buffered IO we can reach an OOM situation due to too
> > many extent maps if either of the following happens:
> >
> > 1) There's a set of tasks constantly doing IO on many files with a size
> >    not larger than 16M, specially if they keep the files open for very
> >    long periods, therefore preventing inode eviction.
> >
> >    This requires a really high number of such files, and having many non
> >    mergeable extent maps (due to random 4K writes for example) and a
> >    machine with very little memory;
> >
> > 2) There's a set tasks constantly doing random write IO (therefore
> >    creating many non mergeable extent maps) on files and keeping them
> >    open for long periods of time, so inode eviction doesn't happen and
> >    there's always a lot of dirty pages or pages under writeback,
> >    preventing btrfs_release_folio() from releasing the respective extent
> >    maps.
> >
> > This second case was actually reported in the thread pointed by the Link
> > tag below, and it requires a very large file under heavy IO and a machine
> > with very little amount of RAM, which is probably hard to happen in
> > practice in a real world use case.
> >
> > However when using direct IO this is not so hard to happen, because the
> > page cache is not used, and therefore btrfs_release_folio() is never
> > called. Which means extent maps are dropped only when evicting the inode,
> > and that means that if we have tasks that keep a file descriptor open and
> > keep doing IO on a very large file (or files), we can exhaust memory due
> > to an unbounded amount of extent maps. This is especially easy to happen
> > if we have a huge file with millions of small extents and their extent
> > maps are not mergeable (non contiguous offsets and disk locations).
> > This was reported in that thread with the following fio test:
> >
> >    $ cat test.sh
> >    #!/bin/bash
> >
> >    DEV=/dev/sdj
> >    MNT=/mnt/sdj
> >    MOUNT_OPTIONS="-o ssd"
> >    MKFS_OPTIONS=""
> >
> >    cat <<EOF > /tmp/fio-job.ini
> >    [global]
> >    name=fio-rand-write
> >    filename=$MNT/fio-rand-write
> >    rw=randwrite
> >    bs=4K
> >    direct=1
> >    numjobs=16
> >    fallocate=none
> >    time_based
> >    runtime=90000
> >
> >    [file1]
> >    size=300G
> >    ioengine=libaio
> >    iodepth=16
> >
> >    EOF
> >
> >    umount $MNT &> /dev/null
> >    mkfs.btrfs -f $MKFS_OPTIONS $DEV
> >    mount $MOUNT_OPTIONS $DEV $MNT
> >
> >    fio /tmp/fio-job.ini
> >    umount $MNT
> >
> > Monitoring the btrfs_extent_map slab while running the test with:
> >
> >    $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
> >                         /sys/kernel/slab/btrfs_extent_map/total_objects'
> >
> > Shows the number of active and total extent maps skyrocketing to tens of
> > millions, and on systems with a short amount of memory it's easy and quick
> > to get into an OOM situation, as reported in that thread.
> >
> > So to avoid this issue add a shrinker that will remove extents maps, as
> > long as they are not pinned, and takes proper care with any concurrent
> > fsync to avoid missing extents (setting the full sync flag while in the
> > middle of a fast fsync). This shrinker is similar to the one ext4 uses
> > for its extent_status structure, which is analogous to btrfs' extent_map
> > structure.
> >
> > Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> I don't like this for a few reasons
>
> 1. We're always starting with the first root and the first inode.  We're just
>    going to constantly screw that first inode over and over again.
> 2. I really, really hate our inode rb-tree, I want to reduce it's use, not add
>    more users.  It would be nice if we could just utilize ->s_inodes_lru instead
>    for this, which would also give us the nice advantage of not having to think
>    about order since it's already in LRU order.
> 3. We're registering our own shrinker without a proper LRU setup.  I think it
>    would make sense if we wanted to have a LRU for our extent maps, but I think
>    that's not a great idea.  We could get the same benefit by adding our own
>    ->nr_cached_objects() and ->free_cached_objects(), I think that's a better
>    approach no matter what other changes you make instead of registering our own
>    shrinker.

Ok, so some comments about all that.

Using sb->s_inode_lru, which has the struct list_lru type is complicated.
I had considered that some time ago, and here are the problems with it:

1) List iteration, done with list_lru_walk() (or one of its variants),
is done while holding the lru list's spinlock.

This means we can't remove all extents maps in one go as that will
take too much time if we have millions of them for example, and make
other tasks spin for too long on the lock.

We will also need to take the inode's mmap semaphore which is a
blocking operation, so we can't do it while under the spinlock.

Sure, the lru list iteration's callback (the "isolate" argument of
type list_lru_walk_cb) can release that lru list spinlock, do the
extent map iteration and removal, then lock it again and return
LRU_RETRY.
But that will restart the search from the first element in the lru
list. This means we can be iterating over the same inodes over and
over.

2) You may hate the inode rb-tree, but we don't have another way to
iterate over the inodes and be efficient to search for inodes starting
at a particular number.

3)  ->nr_cached_objects() and ->free_cached_objects() of the super
operations are a good alternative to registering our own shrinker.
I went with our own shrinker because it's what ext4 is doing and it's
simple. Changing to the super operations is fairly simple and I
embrace it.

4) That concern about LRU is not really that relevant as you think.

Look at fs/super.c:super_cache_scan() (for when we define our
->nr_cached_objects() and ->free_cached_objects() callbacks).

Every time ->free_cached_objects() is called we do it for a number of
items we returned with ->nr_cached_objects().
That means we all go over all inodes and so going in LRU order or any
other order ends up being irrelevant.

The same goes for the shrinker solution.

Sure you can argue that in the time between calling
->nr_cached_objects() and calling ->free_cached_objects() more extent
maps may have been created.
But that's not a big time window, not enough to add that many extent
maps to make any practical difference.
Plus when the shrinking is performed it's because we are under heavy
memory pressure and removing extent maps won't have that much impact
anyway, since page cache was freed amongst other things that have more
impact on performance than extent maps.

This is probably why ext4 is following the same approach for its
extent_status structures as I did here (well, I followed their way of
doing things).
Those structures are in a rb tree of the inode, just like we do with
our extent maps.
And the inodes are in a regular linked list (struct
ext4_sb_info::s_es_list) and added with list_add_tail to that list -
they're never moved within the list (no LRU).
When the scan callback of the shrinker is invoked, it always iterates
the inodes in that list order - not LRU or anything "fancy".


So I'm okay with moving to an implementation based on
->nr_cached_objects() and calling ->free_cached_objects(), that's
simple.
But iterating the inodes will have to be with the rb-tree, as we don't
have anything else that allows us to iterate and start at any given
number.
I can make it to remember the last scanned inode (and root) so that
the next time the shrinking happens it will start after that last
scanned inode.

And adding our own lru implementation wouldn't make much difference
because of all that, and would also have 2 downsides to it:

1) We would need a struct list_head  (16 bytes) plus a pointer to the
inode (so that we can remove an extent map from the rb tree) added to
the extent_map structure, so 24 bytes overhead.

2) All the maintenance of the list, adding to it, removing from it,
moving elements, etc, all requires locking and overhead from lock
contention (using the struct list_lru API or our own).

So my proposal is to do this:

1) Keep using the rb tree to search for inodes - this is abstracted by
an helper function used in 2 other places.
2) Remember the last scanned inode/root, and on every scan start after
that inode.
3) Use ->nr_cached_objects() and calling ->free_cached_objects()
instead of registering a shrinker.

What do you think?
Thanks.

>
> The concept I whole heartedly agree with, this just needs some tweaks to be more
> fair and cleaner.  The rest of the code is fine, you can add
>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>
> to the rest of it.  Thanks,
>
> Josef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 13/15] btrfs: add a shrinker for extent maps
  2024-04-13 11:07       ` Filipe Manana
@ 2024-04-14 10:38         ` Filipe Manana
  2024-04-14 13:02         ` Josef Bacik
  1 sibling, 0 replies; 64+ messages in thread
From: Filipe Manana @ 2024-04-14 10:38 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Sat, Apr 13, 2024 at 12:07 PM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Fri, Apr 12, 2024 at 9:07 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Thu, Apr 11, 2024 at 05:19:07PM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > Extent maps are used either to represent existing file extent items, or to
> > > represent new extents that are going to be written and the respective file
> > > extent items are created when the ordered extent completes.
> > >
> > > We currently don't have any limit for how many extent maps we can have,
> > > neither per inode nor globally. Most of the time this not too noticeable
> > > because extent maps are removed in the following situations:
> > >
> > > 1) When evicting an inode;
> > >
> > > 2) When releasing folios (pages) through the btrfs_release_folio() address
> > >    space operation callback.
> > >
> > >    However we won't release extent maps in the folio range if the folio is
> > >    either dirty or under writeback or if the inode's i_size is less than
> > >    or equals to 16M (see try_release_extent_mapping(). This 16M i_size
> > >    constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
> > >    extent_io and extent_state optimizations"), but there's no explanation
> > >    about why we have it or why the 16M value.
> > >
> > > This means that for buffered IO we can reach an OOM situation due to too
> > > many extent maps if either of the following happens:
> > >
> > > 1) There's a set of tasks constantly doing IO on many files with a size
> > >    not larger than 16M, specially if they keep the files open for very
> > >    long periods, therefore preventing inode eviction.
> > >
> > >    This requires a really high number of such files, and having many non
> > >    mergeable extent maps (due to random 4K writes for example) and a
> > >    machine with very little memory;
> > >
> > > 2) There's a set tasks constantly doing random write IO (therefore
> > >    creating many non mergeable extent maps) on files and keeping them
> > >    open for long periods of time, so inode eviction doesn't happen and
> > >    there's always a lot of dirty pages or pages under writeback,
> > >    preventing btrfs_release_folio() from releasing the respective extent
> > >    maps.
> > >
> > > This second case was actually reported in the thread pointed by the Link
> > > tag below, and it requires a very large file under heavy IO and a machine
> > > with very little amount of RAM, which is probably hard to happen in
> > > practice in a real world use case.
> > >
> > > However when using direct IO this is not so hard to happen, because the
> > > page cache is not used, and therefore btrfs_release_folio() is never
> > > called. Which means extent maps are dropped only when evicting the inode,
> > > and that means that if we have tasks that keep a file descriptor open and
> > > keep doing IO on a very large file (or files), we can exhaust memory due
> > > to an unbounded amount of extent maps. This is especially easy to happen
> > > if we have a huge file with millions of small extents and their extent
> > > maps are not mergeable (non contiguous offsets and disk locations).
> > > This was reported in that thread with the following fio test:
> > >
> > >    $ cat test.sh
> > >    #!/bin/bash
> > >
> > >    DEV=/dev/sdj
> > >    MNT=/mnt/sdj
> > >    MOUNT_OPTIONS="-o ssd"
> > >    MKFS_OPTIONS=""
> > >
> > >    cat <<EOF > /tmp/fio-job.ini
> > >    [global]
> > >    name=fio-rand-write
> > >    filename=$MNT/fio-rand-write
> > >    rw=randwrite
> > >    bs=4K
> > >    direct=1
> > >    numjobs=16
> > >    fallocate=none
> > >    time_based
> > >    runtime=90000
> > >
> > >    [file1]
> > >    size=300G
> > >    ioengine=libaio
> > >    iodepth=16
> > >
> > >    EOF
> > >
> > >    umount $MNT &> /dev/null
> > >    mkfs.btrfs -f $MKFS_OPTIONS $DEV
> > >    mount $MOUNT_OPTIONS $DEV $MNT
> > >
> > >    fio /tmp/fio-job.ini
> > >    umount $MNT
> > >
> > > Monitoring the btrfs_extent_map slab while running the test with:
> > >
> > >    $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
> > >                         /sys/kernel/slab/btrfs_extent_map/total_objects'
> > >
> > > Shows the number of active and total extent maps skyrocketing to tens of
> > > millions, and on systems with a short amount of memory it's easy and quick
> > > to get into an OOM situation, as reported in that thread.
> > >
> > > So to avoid this issue add a shrinker that will remove extents maps, as
> > > long as they are not pinned, and takes proper care with any concurrent
> > > fsync to avoid missing extents (setting the full sync flag while in the
> > > middle of a fast fsync). This shrinker is similar to the one ext4 uses
> > > for its extent_status structure, which is analogous to btrfs' extent_map
> > > structure.
> > >
> > > Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >
> > I don't like this for a few reasons
> >
> > 1. We're always starting with the first root and the first inode.  We're just
> >    going to constantly screw that first inode over and over again.
> > 2. I really, really hate our inode rb-tree, I want to reduce it's use, not add
> >    more users.  It would be nice if we could just utilize ->s_inodes_lru instead
> >    for this, which would also give us the nice advantage of not having to think
> >    about order since it's already in LRU order.
> > 3. We're registering our own shrinker without a proper LRU setup.  I think it
> >    would make sense if we wanted to have a LRU for our extent maps, but I think
> >    that's not a great idea.  We could get the same benefit by adding our own
> >    ->nr_cached_objects() and ->free_cached_objects(), I think that's a better
> >    approach no matter what other changes you make instead of registering our own
> >    shrinker.
>
> Ok, so some comments about all that.
>
> Using sb->s_inode_lru, which has the struct list_lru type is complicated.
> I had considered that some time ago, and here are the problems with it:
>
> 1) List iteration, done with list_lru_walk() (or one of its variants),
> is done while holding the lru list's spinlock.
>
> This means we can't remove all extents maps in one go as that will
> take too much time if we have millions of them for example, and make
> other tasks spin for too long on the lock.
>
> We will also need to take the inode's mmap semaphore which is a
> blocking operation, so we can't do it while under the spinlock.
>
> Sure, the lru list iteration's callback (the "isolate" argument of
> type list_lru_walk_cb) can release that lru list spinlock, do the
> extent map iteration and removal, then lock it again and return
> LRU_RETRY.
> But that will restart the search from the first element in the lru
> list. This means we can be iterating over the same inodes over and
> over.
>
> 2) You may hate the inode rb-tree, but we don't have another way to
> iterate over the inodes and be efficient to search for inodes starting
> at a particular number.
>
> 3)  ->nr_cached_objects() and ->free_cached_objects() of the super
> operations are a good alternative to registering our own shrinker.
> I went with our own shrinker because it's what ext4 is doing and it's
> simple. Changing to the super operations is fairly simple and I
> embrace it.
>
> 4) That concern about LRU is not really that relevant as you think.
>
> Look at fs/super.c:super_cache_scan() (for when we define our
> ->nr_cached_objects() and ->free_cached_objects() callbacks).
>
> Every time ->free_cached_objects() is called we do it for a number of
> items we returned with ->nr_cached_objects().
> That means we all go over all inodes and so going in LRU order or any
> other order ends up being irrelevant.
>
> The same goes for the shrinker solution.
>
> Sure you can argue that in the time between calling
> ->nr_cached_objects() and calling ->free_cached_objects() more extent
> maps may have been created.
> But that's not a big time window, not enough to add that many extent
> maps to make any practical difference.
> Plus when the shrinking is performed it's because we are under heavy
> memory pressure and removing extent maps won't have that much impact
> anyway, since page cache was freed amongst other things that have more
> impact on performance than extent maps.
>
> This is probably why ext4 is following the same approach for its
> extent_status structures as I did here (well, I followed their way of
> doing things).
> Those structures are in a rb tree of the inode, just like we do with
> our extent maps.
> And the inodes are in a regular linked list (struct
> ext4_sb_info::s_es_list) and added with list_add_tail to that list -
> they're never moved within the list (no LRU).
> When the scan callback of the shrinker is invoked, it always iterates
> the inodes in that list order - not LRU or anything "fancy".

Another example is xfs, which implements a ->nr_cached_objects() and
->free_cached_objects().
From xfs_fs_free_cached_objects() we end up at xfs_icwalk_ag, where it
walks inodes in the order they are in a radix tree, not in LRU order.
It does however keep track of the last index processed so that the
next time it is called, it starts from that index, but definitely not
LRU.

>
>
> So I'm okay with moving to an implementation based on
> ->nr_cached_objects() and calling ->free_cached_objects(), that's
> simple.
> But iterating the inodes will have to be with the rb-tree, as we don't
> have anything else that allows us to iterate and start at any given
> number.
> I can make it to remember the last scanned inode (and root) so that
> the next time the shrinking happens it will start after that last
> scanned inode.
>
> And adding our own lru implementation wouldn't make much difference
> because of all that, and would also have 2 downsides to it:
>
> 1) We would need a struct list_head  (16 bytes) plus a pointer to the
> inode (so that we can remove an extent map from the rb tree) added to
> the extent_map structure, so 24 bytes overhead.
>
> 2) All the maintenance of the list, adding to it, removing from it,
> moving elements, etc, all requires locking and overhead from lock
> contention (using the struct list_lru API or our own).
>
> So my proposal is to do this:
>
> 1) Keep using the rb tree to search for inodes - this is abstracted by
> an helper function used in 2 other places.
> 2) Remember the last scanned inode/root, and on every scan start after
> that inode.
> 3) Use ->nr_cached_objects() and calling ->free_cached_objects()
> instead of registering a shrinker.
>
> What do you think?
> Thanks.
>
> >
> > The concept I whole heartedly agree with, this just needs some tweaks to be more
> > fair and cleaner.  The rest of the code is fine, you can add
> >
> > Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> >
> > to the rest of it.  Thanks,
> >
> > Josef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 13/15] btrfs: add a shrinker for extent maps
  2024-04-13 11:07       ` Filipe Manana
  2024-04-14 10:38         ` Filipe Manana
@ 2024-04-14 13:02         ` Josef Bacik
  2024-04-15 11:24           ` Filipe Manana
  1 sibling, 1 reply; 64+ messages in thread
From: Josef Bacik @ 2024-04-14 13:02 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Sat, Apr 13, 2024 at 12:07:30PM +0100, Filipe Manana wrote:
> On Fri, Apr 12, 2024 at 9:07 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Thu, Apr 11, 2024 at 05:19:07PM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > Extent maps are used either to represent existing file extent items, or to
> > > represent new extents that are going to be written and the respective file
> > > extent items are created when the ordered extent completes.
> > >
> > > We currently don't have any limit for how many extent maps we can have,
> > > neither per inode nor globally. Most of the time this not too noticeable
> > > because extent maps are removed in the following situations:
> > >
> > > 1) When evicting an inode;
> > >
> > > 2) When releasing folios (pages) through the btrfs_release_folio() address
> > >    space operation callback.
> > >
> > >    However we won't release extent maps in the folio range if the folio is
> > >    either dirty or under writeback or if the inode's i_size is less than
> > >    or equals to 16M (see try_release_extent_mapping(). This 16M i_size
> > >    constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
> > >    extent_io and extent_state optimizations"), but there's no explanation
> > >    about why we have it or why the 16M value.
> > >
> > > This means that for buffered IO we can reach an OOM situation due to too
> > > many extent maps if either of the following happens:
> > >
> > > 1) There's a set of tasks constantly doing IO on many files with a size
> > >    not larger than 16M, specially if they keep the files open for very
> > >    long periods, therefore preventing inode eviction.
> > >
> > >    This requires a really high number of such files, and having many non
> > >    mergeable extent maps (due to random 4K writes for example) and a
> > >    machine with very little memory;
> > >
> > > 2) There's a set tasks constantly doing random write IO (therefore
> > >    creating many non mergeable extent maps) on files and keeping them
> > >    open for long periods of time, so inode eviction doesn't happen and
> > >    there's always a lot of dirty pages or pages under writeback,
> > >    preventing btrfs_release_folio() from releasing the respective extent
> > >    maps.
> > >
> > > This second case was actually reported in the thread pointed by the Link
> > > tag below, and it requires a very large file under heavy IO and a machine
> > > with very little amount of RAM, which is probably hard to happen in
> > > practice in a real world use case.
> > >
> > > However when using direct IO this is not so hard to happen, because the
> > > page cache is not used, and therefore btrfs_release_folio() is never
> > > called. Which means extent maps are dropped only when evicting the inode,
> > > and that means that if we have tasks that keep a file descriptor open and
> > > keep doing IO on a very large file (or files), we can exhaust memory due
> > > to an unbounded amount of extent maps. This is especially easy to happen
> > > if we have a huge file with millions of small extents and their extent
> > > maps are not mergeable (non contiguous offsets and disk locations).
> > > This was reported in that thread with the following fio test:
> > >
> > >    $ cat test.sh
> > >    #!/bin/bash
> > >
> > >    DEV=/dev/sdj
> > >    MNT=/mnt/sdj
> > >    MOUNT_OPTIONS="-o ssd"
> > >    MKFS_OPTIONS=""
> > >
> > >    cat <<EOF > /tmp/fio-job.ini
> > >    [global]
> > >    name=fio-rand-write
> > >    filename=$MNT/fio-rand-write
> > >    rw=randwrite
> > >    bs=4K
> > >    direct=1
> > >    numjobs=16
> > >    fallocate=none
> > >    time_based
> > >    runtime=90000
> > >
> > >    [file1]
> > >    size=300G
> > >    ioengine=libaio
> > >    iodepth=16
> > >
> > >    EOF
> > >
> > >    umount $MNT &> /dev/null
> > >    mkfs.btrfs -f $MKFS_OPTIONS $DEV
> > >    mount $MOUNT_OPTIONS $DEV $MNT
> > >
> > >    fio /tmp/fio-job.ini
> > >    umount $MNT
> > >
> > > Monitoring the btrfs_extent_map slab while running the test with:
> > >
> > >    $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
> > >                         /sys/kernel/slab/btrfs_extent_map/total_objects'
> > >
> > > Shows the number of active and total extent maps skyrocketing to tens of
> > > millions, and on systems with a short amount of memory it's easy and quick
> > > to get into an OOM situation, as reported in that thread.
> > >
> > > So to avoid this issue add a shrinker that will remove extents maps, as
> > > long as they are not pinned, and takes proper care with any concurrent
> > > fsync to avoid missing extents (setting the full sync flag while in the
> > > middle of a fast fsync). This shrinker is similar to the one ext4 uses
> > > for its extent_status structure, which is analogous to btrfs' extent_map
> > > structure.
> > >
> > > Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >
> > I don't like this for a few reasons
> >
> > 1. We're always starting with the first root and the first inode.  We're just
> >    going to constantly screw that first inode over and over again.
> > 2. I really, really hate our inode rb-tree, I want to reduce it's use, not add
> >    more users.  It would be nice if we could just utilize ->s_inodes_lru instead
> >    for this, which would also give us the nice advantage of not having to think
> >    about order since it's already in LRU order.
> > 3. We're registering our own shrinker without a proper LRU setup.  I think it
> >    would make sense if we wanted to have a LRU for our extent maps, but I think
> >    that's not a great idea.  We could get the same benefit by adding our own
> >    ->nr_cached_objects() and ->free_cached_objects(), I think that's a better
> >    approach no matter what other changes you make instead of registering our own
> >    shrinker.
> 
> Ok, so some comments about all that.
> 
> Using sb->s_inode_lru, which has the struct list_lru type is complicated.
> I had considered that some time ago, and here are the problems with it:
> 
> 1) List iteration, done with list_lru_walk() (or one of its variants),
> is done while holding the lru list's spinlock.
> 
> This means we can't remove all extents maps in one go as that will
> take too much time if we have millions of them for example, and make
> other tasks spin for too long on the lock.
> 
> We will also need to take the inode's mmap semaphore which is a
> blocking operation, so we can't do it while under the spinlock.
> 
> Sure, the lru list iteration's callback (the "isolate" argument of
> type list_lru_walk_cb) can release that lru list spinlock, do the
> extent map iteration and removal, then lock it again and return
> LRU_RETRY.
> But that will restart the search from the first element in the lru
> list. This means we can be iterating over the same inodes over and
> over.

Agreed, but at least it'll be in LRU order, we'll be visiting each inode in
their LRU order.  If the first inode hasn't changed position then it hasn't been
accessed in a while.

We absolutely would need to use the isolate mechanism, I think we would just add
a counter to the extent_io_tree to see how many extent states there are, and we
skip anything that's already been drained.  Then we can do our normal locking
thing when we walk through the isolate list.

> 
> 2) You may hate the inode rb-tree, but we don't have another way to
> iterate over the inodes and be efficient to search for inodes starting
> at a particular number.
> 
> 3)  ->nr_cached_objects() and ->free_cached_objects() of the super
> operations are a good alternative to registering our own shrinker.
> I went with our own shrinker because it's what ext4 is doing and it's
> simple. Changing to the super operations is fairly simple and I
> embrace it.
> 
> 4) That concern about LRU is not really that relevant as you think.
> 
> Look at fs/super.c:super_cache_scan() (for when we define our
> ->nr_cached_objects() and ->free_cached_objects() callbacks).
> 
> Every time ->free_cached_objects() is called we do it for a number of
> items we returned with ->nr_cached_objects().
> That means we all go over all inodes and so going in LRU order or any
> other order ends up being irrelevant.
> 
> The same goes for the shrinker solution.
> 
> Sure you can argue that in the time between calling
> ->nr_cached_objects() and calling ->free_cached_objects() more extent
> maps may have been created.
> But that's not a big time window, not enough to add that many extent
> maps to make any practical difference.
> Plus when the shrinking is performed it's because we are under heavy
> memory pressure and removing extent maps won't have that much impact
> anyway, since page cache was freed amongst other things that have more
> impact on performance than extent maps.
> 
> This is probably why ext4 is following the same approach for its
> extent_status structures as I did here (well, I followed their way of
> doing things).
> Those structures are in a rb tree of the inode, just like we do with
> our extent maps.
> And the inodes are in a regular linked list (struct
> ext4_sb_info::s_es_list) and added with list_add_tail to that list -
> they're never moved within the list (no LRU).
> When the scan callback of the shrinker is invoked, it always iterates
> the inodes in that list order - not LRU or anything "fancy".
> 
> 
> So I'm okay with moving to an implementation based on
> ->nr_cached_objects() and calling ->free_cached_objects(), that's
> simple.
> But iterating the inodes will have to be with the rb-tree, as we don't
> have anything else that allows us to iterate and start at any given
> number.
> I can make it to remember the last scanned inode (and root) so that
> the next time the shrinking happens it will start after that last
> scanned inode.
> 
> And adding our own lru implementation wouldn't make much difference
> because of all that, and would also have 2 downsides to it:
> 
> 1) We would need a struct list_head  (16 bytes) plus a pointer to the
> inode (so that we can remove an extent map from the rb tree) added to
> the extent_map structure, so 24 bytes overhead.
> 
> 2) All the maintenance of the list, adding to it, removing from it,
> moving elements, etc, all requires locking and overhead from lock
> contention (using the struct list_lru API or our own).
> 
> So my proposal is to do this:
> 
> 1) Keep using the rb tree to search for inodes - this is abstracted by
> an helper function used in 2 other places.
> 2) Remember the last scanned inode/root, and on every scan start after
> that inode.
> 3) Use ->nr_cached_objects() and calling ->free_cached_objects()
> instead of registering a shrinker.
>

I'm still looking for a way to delete our rb_tree in the future, it causes us
headaches with a lot of parallel file creates, but honestly other things are
much worse than the rbtree right now.

This is fine with me, we're not getting rid of it anytime soon, as long as we
have a way to remember where we were then that's good enough for now and in
keeping with all other existing reclaim implementations.  Thanks,

Josef 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/15] btrfs: add a global per cpu counter to track number of used extent maps
  2024-04-11 16:19   ` [PATCH v2 10/15] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
@ 2024-04-15  2:47     ` kernel test robot
  0 siblings, 0 replies; 64+ messages in thread
From: kernel test robot @ 2024-04-15  2:47 UTC (permalink / raw)
  To: fdmanana; +Cc: oe-lkp, lkp, linux-btrfs, oliver.sang



Hello,

kernel test robot noticed "EIP:__percpu_counter_sum" on:

commit: b1c708ad024484a86a493ac9b1c94ba55ed9aec5 ("[PATCH v2 10/15] btrfs: add a global per cpu counter to track number of used extent maps")
url: https://github.com/intel-lab-lkp/linux/commits/fdmanana-kernel-org/btrfs-pass-an-inode-to-btrfs_add_extent_mapping/20240412-015132
base: https://git.kernel.org/cgit/linux/kernel/git/kdave/linux.git for-next
patch link: https://lore.kernel.org/all/c52817a5221e712a7b3cb3686496eed82d9e04ce.1712837044.git.fdmanana@suse.com/
patch subject: [PATCH v2 10/15] btrfs: add a global per cpu counter to track number of used extent maps

in testcase: boot

compiler: clang-17
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+---------------------------------------------------+------------+------------+
|                                                   | a4dd4472f7 | b1c708ad02 |
+---------------------------------------------------+------------+------------+
| INFO:trying_to_register_non-static_key            | 0          | 6          |
| BUG:unable_to_handle_page_fault_for_address       | 0          | 6          |
| Oops:#[##]                                        | 0          | 6          |
| EIP:__percpu_counter_sum                          | 0          | 6          |
| Kernel_panic-not_syncing:Fatal_exception          | 0          | 6          |
+---------------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202404151034.cd36195c-oliver.sang@intel.com


[  201.010606][  T168] INFO: trying to register non-static key.
[  201.011499][  T168] The code is fine but needs lockdep annotation, or maybe
[  201.012269][  T168] you didn't initialize this object before use?
[  201.012951][  T168] turning off the locking correctness validator.
[  201.013663][  T168] CPU: 1 PID: 168 Comm: mount Tainted: G        W        N 6.9.0-rc3-00121-gb1c708ad0244 #1
[  201.014742][  T168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[  201.015907][  T168] Call Trace:
[ 201.016332][ T168] dump_stack_lvl (lib/dump_stack.c:116) 
[ 201.016878][ T168] dump_stack (lib/dump_stack.c:123) 
[ 201.017392][ T168] assign_lock_key (kernel/locking/lockdep.c:?) 
[ 201.017966][ T168] register_lock_class (kernel/locking/lockdep.c:?) 
[ 201.018604][ T168] __lock_acquire (kernel/locking/lockdep.c:5014) 
[ 201.019210][ T168] ? __slab_free (mm/slub.c:4099) 
[ 201.019799][ T168] ? kern_path (fs/namei.c:2625) 
[ 201.020345][ T168] ? kmem_cache_free (mm/slub.c:4233 mm/slub.c:4281 mm/slub.c:4344) 
[ 201.020984][ T168] ? kern_path (fs/namei.c:2625) 
[ 201.021555][ T168] ? kern_path (fs/namei.c:2625) 
[ 201.022098][ T168] ? kern_path (fs/namei.c:2625) 
[ 201.022623][ T168] ? lock_release (kernel/locking/lockdep.c:5244) 
[ 201.023186][ T168] lock_acquire (kernel/locking/lockdep.c:5754) 
[ 201.023827][ T168] ? __percpu_counter_sum (lib/percpu_counter.c:?) 
[ 201.024440][ T168] ? mutex_unlock (kernel/locking/mutex.c:549) 
[ 201.024984][ T168] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:110 kernel/locking/spinlock.c:162) 
[ 201.025594][ T168] ? __percpu_counter_sum (lib/percpu_counter.c:?) 
[ 201.026203][ T168] __percpu_counter_sum (lib/percpu_counter.c:?) 
[ 201.026800][ T168] ? debug_mutex_init (include/linux/lockdep.h:135 include/linux/lockdep.h:142 kernel/locking/mutex-debug.c:87) 
[ 201.027371][ T168] btrfs_free_fs_info (fs/btrfs/disk-io.c:1272) 
[ 201.027986][ T168] btrfs_free_fs_context (fs/btrfs/super.c:2103) 
[ 201.028585][ T168] put_fs_context (fs/fs_context.c:522) 
[ 201.029136][ T168] btrfs_get_tree (include/linux/err.h:61 fs/btrfs/super.c:2051 fs/btrfs/super.c:2085) 
[ 201.029698][ T168] ? btrfs_parse_param (include/linux/fs_parser.h:73 fs/btrfs/super.c:272) 
[ 201.030293][ T168] ? security_capable (security/security.c:1036) 
[ 201.030869][ T168] vfs_get_tree (fs/super.c:1780) 
[ 201.031399][ T168] do_new_mount (fs/namespace.c:3352) 
[ 201.031950][ T168] ? security_capable (security/security.c:1036) 
[ 201.032538][ T168] path_mount (fs/namespace.c:3679) 
[ 201.033065][ T168] __ia32_sys_mount (fs/namespace.c:3692 fs/namespace.c:3898 fs/namespace.c:3875 fs/namespace.c:3875) 
[ 201.033643][ T168] do_int80_syscall_32 (arch/x86/entry/common.c:?) 
[ 201.034492][ T168] ? syscall_exit_to_user_mode (kernel/entry/common.c:221) 
[ 201.036999][ T168] ? do_int80_syscall_32 (arch/x86/entry/common.c:278) 
[ 201.037611][ T168] ? do_int80_syscall_32 (arch/x86/entry/common.c:278) 
[ 201.038214][ T168] ? do_int80_syscall_32 (arch/x86/entry/common.c:278) 
[ 201.038814][ T168] ? do_int80_syscall_32 (arch/x86/entry/common.c:278) 
[ 201.039445][ T168] ? irqentry_exit (kernel/entry/common.c:367) 
[ 201.040004][ T168] ? exc_page_fault (arch/x86/mm/fault.c:1567) 
[ 201.040579][ T168] entry_INT80_32 (arch/x86/entry/entry_32.S:944) 
[  201.041121][  T168] EIP: 0xb7dedd4e
[ 201.041595][ T168] Code: 90 66 90 66 90 66 90 66 90 66 90 90 57 56 53 8b 7c 24 20 8b 74 24 1c 8b 54 24 18 8b 4c 24 14 8b 5c 24 10 b8 15 00 00 00 cd 80 <5b> 5e 5f 3d 01 f0 ff ff 0f 83 64 7a f3 ff c3 66 90 90 57 56 53 8b
All code
========
   0:	90                   	nop
   1:	66 90                	xchg   %ax,%ax
   3:	66 90                	xchg   %ax,%ax
   5:	66 90                	xchg   %ax,%ax
   7:	66 90                	xchg   %ax,%ax
   9:	66 90                	xchg   %ax,%ax
   b:	90                   	nop
   c:	57                   	push   %rdi
   d:	56                   	push   %rsi
   e:	53                   	push   %rbx
   f:	8b 7c 24 20          	mov    0x20(%rsp),%edi
  13:	8b 74 24 1c          	mov    0x1c(%rsp),%esi
  17:	8b 54 24 18          	mov    0x18(%rsp),%edx
  1b:	8b 4c 24 14          	mov    0x14(%rsp),%ecx
  1f:	8b 5c 24 10          	mov    0x10(%rsp),%ebx
  23:	b8 15 00 00 00       	mov    $0x15,%eax
  28:	cd 80                	int    $0x80
  2a:*	5b                   	pop    %rbx		<-- trapping instruction
  2b:	5e                   	pop    %rsi
  2c:	5f                   	pop    %rdi
  2d:	3d 01 f0 ff ff       	cmp    $0xfffff001,%eax
  32:	0f 83 64 7a f3 ff    	jae    0xfffffffffff37a9c
  38:	c3                   	ret
  39:	66 90                	xchg   %ax,%ax
  3b:	90                   	nop
  3c:	57                   	push   %rdi
  3d:	56                   	push   %rsi
  3e:	53                   	push   %rbx
  3f:	8b                   	.byte 0x8b

Code starting with the faulting instruction
===========================================
   0:	5b                   	pop    %rbx
   1:	5e                   	pop    %rsi
   2:	5f                   	pop    %rdi
   3:	3d 01 f0 ff ff       	cmp    $0xfffff001,%eax
   8:	0f 83 64 7a f3 ff    	jae    0xfffffffffff37a72
   e:	c3                   	ret
   f:	66 90                	xchg   %ax,%ax
  11:	90                   	nop
  12:	57                   	push   %rdi
  13:	56                   	push   %rsi
  14:	53                   	push   %rbx
  15:	8b                   	.byte 0x8b
[  201.043677][  T168] EAX: ffffffda EBX: 004f14ca ECX: 004f14df EDX: 0210a960
[  201.044439][  T168] ESI: 00008000 EDI: 00000000 EBP: b7fa3474 ESP: bfbe8ad0
[  201.045227][  T168] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
[  201.046161][  T168] BUG: unable to handle page fault for address: 28a5a000
[  201.046968][  T168] #PF: supervisor read access in kernel mode
[  201.047675][  T168] #PF: error_code(0x0000) - not-present page
[  201.048368][  T168] *pdpt = 000000001a10a001 *pde = 0000000000000000
[  201.049105][  T168] Oops: 0000 [#1] SMP
[  201.049606][  T168] CPU: 1 PID: 168 Comm: mount Tainted: G        W        N 6.9.0-rc3-00121-gb1c708ad0244 #1
[  201.050803][  T168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 201.052100][ T168] EIP: __percpu_counter_sum (lib/percpu_counter.c:147) 
[ 201.052770][ T168] Code: 90 90 90 89 d0 d3 e8 d3 e0 25 ff 00 00 00 74 2e 89 45 e8 f3 0f bc 4d e8 83 f9 07 77 21 8b 45 f0 8b 40 34 8b 34 8d f4 b5 9c c3 <8b> 04 06 89 c6 c1 fe 1f 01 c7 11 f3 83 f9 07 8d 49 01 75 c5 8b 45
All code
========
   0:	90                   	nop
   1:	90                   	nop
   2:	90                   	nop
   3:	89 d0                	mov    %edx,%eax
   5:	d3 e8                	shr    %cl,%eax
   7:	d3 e0                	shl    %cl,%eax
   9:	25 ff 00 00 00       	and    $0xff,%eax
   e:	74 2e                	je     0x3e
  10:	89 45 e8             	mov    %eax,-0x18(%rbp)
  13:	f3 0f bc 4d e8       	tzcnt  -0x18(%rbp),%ecx
  18:	83 f9 07             	cmp    $0x7,%ecx
  1b:	77 21                	ja     0x3e
  1d:	8b 45 f0             	mov    -0x10(%rbp),%eax
  20:	8b 40 34             	mov    0x34(%rax),%eax
  23:	8b 34 8d f4 b5 9c c3 	mov    -0x3c634a0c(,%rcx,4),%esi
  2a:*	8b 04 06             	mov    (%rsi,%rax,1),%eax		<-- trapping instruction
  2d:	89 c6                	mov    %eax,%esi
  2f:	c1 fe 1f             	sar    $0x1f,%esi
  32:	01 c7                	add    %eax,%edi
  34:	11 f3                	adc    %esi,%ebx
  36:	83 f9 07             	cmp    $0x7,%ecx
  39:	8d 49 01             	lea    0x1(%rcx),%ecx
  3c:	75 c5                	jne    0x3
  3e:	8b                   	.byte 0x8b
  3f:	45                   	rex.RB

Code starting with the faulting instruction
===========================================
   0:	8b 04 06             	mov    (%rsi,%rax,1),%eax
   3:	89 c6                	mov    %eax,%esi
   5:	c1 fe 1f             	sar    $0x1f,%esi
   8:	01 c7                	add    %eax,%edi
   a:	11 f3                	adc    %esi,%ebx
   c:	83 f9 07             	cmp    $0x7,%ecx
   f:	8d 49 01             	lea    0x1(%rcx),%ecx
  12:	75 c5                	jne    0xffffffffffffffd9
  14:	8b                   	.byte 0x8b
  15:	45                   	rex.RB
[  201.054977][  T168] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000003
[  201.058373][  T168] ESI: 28a5a000 EDI: 00000000 EBP: ee833d88 ESP: ee833d70
[  201.059245][  T168] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010097
[  201.060208][  T168] CR0: 80050033 CR2: 28a5a000 CR3: 0502cfa0 CR4: 000406b0
[  201.061046][  T168] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[  201.061893][  T168] DR6: fffe0ff0 DR7: 00000400
[  201.062503][  T168] Call Trace:
[ 201.062978][ T168] ? __die_body (arch/x86/kernel/dumpstack.c:478 arch/x86/kernel/dumpstack.c:420) 
[ 201.067652][ T168] ? __die (arch/x86/kernel/dumpstack.c:434) 
[ 201.068186][ T168] ? page_fault_oops (arch/x86/mm/fault.c:709) 
[ 201.068788][ T168] ? __slab_free (mm/slub.c:4099) 
[ 201.069398][ T168] ? kernelmode_fixup_or_oops (arch/x86/mm/fault.c:767) 
[ 201.070107][ T168] ? __bad_area_nosemaphore (arch/x86/mm/fault.c:814) 
[ 201.070815][ T168] ? kern_path (fs/namei.c:2625) 
[ 201.071402][ T168] ? bad_area_nosemaphore (arch/x86/mm/fault.c:863) 
[ 201.072062][ T168] ? do_user_addr_fault (arch/x86/mm/fault.c:?) 
[ 201.072729][ T168] ? exc_page_fault (arch/x86/include/asm/irqflags.h:19 arch/x86/include/asm/irqflags.h:67 arch/x86/include/asm/irqflags.h:127 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1563) 
[ 201.073359][ T168] ? pvclock_clocksource_read_nowd (arch/x86/mm/fault.c:1518) 
[ 201.074135][ T168] ? handle_exception (arch/x86/entry/entry_32.S:1054) 
[ 201.074815][ T168] ? atomic_dec_and_mutex_lock (kernel/locking/mutex.c:548 kernel/locking/mutex.c:1153) 
[ 201.075553][ T168] ? rwsem_down_write_slowpath (include/trace/events/lock.h:95) 
[ 201.076301][ T168] ? pvclock_clocksource_read_nowd (arch/x86/mm/fault.c:1518) 
[ 201.077099][ T168] ? __percpu_counter_sum (lib/percpu_counter.c:147) 
[ 201.077771][ T168] ? rwsem_down_write_slowpath (include/trace/events/lock.h:95) 
[ 201.078512][ T168] ? pvclock_clocksource_read_nowd (arch/x86/mm/fault.c:1518) 
[ 201.079292][ T168] ? __percpu_counter_sum (lib/percpu_counter.c:147) 
[ 201.079968][ T168] btrfs_free_fs_info (fs/btrfs/disk-io.c:1272) 
[ 201.080615][ T168] btrfs_free_fs_context (fs/btrfs/super.c:2103) 
[ 201.081282][ T168] put_fs_context (fs/fs_context.c:522) 
[ 201.081910][ T168] btrfs_get_tree (include/linux/err.h:61 fs/btrfs/super.c:2051 fs/btrfs/super.c:2085) 
[ 201.082549][ T168] ? btrfs_parse_param (include/linux/fs_parser.h:73 fs/btrfs/super.c:272) 
[ 201.083205][ T168] ? security_capable (security/security.c:1036) 
[ 201.083857][ T168] vfs_get_tree (fs/super.c:1780) 
[ 201.084444][ T168] do_new_mount (fs/namespace.c:3352) 
[ 201.085054][ T168] ? security_capable (security/security.c:1036) 
[ 201.085690][ T168] path_mount (fs/namespace.c:3679) 
[ 201.086281][ T168] __ia32_sys_mount (fs/namespace.c:3692 fs/namespace.c:3898 fs/namespace.c:3875 fs/namespace.c:3875) 
[ 201.089463][ T168] do_int80_syscall_32 (arch/x86/entry/common.c:?) 
[ 201.090154][ T168] ? syscall_exit_to_user_mode (kernel/entry/common.c:221) 
[ 201.090895][ T168] ? do_int80_syscall_32 (arch/x86/entry/common.c:278) 
[ 201.091578][ T168] ? do_int80_syscall_32 (arch/x86/entry/common.c:278) 
[ 201.092250][ T168] ? do_int80_syscall_32 (arch/x86/entry/common.c:278) 
[ 201.092939][ T168] ? do_int80_syscall_32 (arch/x86/entry/common.c:278) 
[ 201.093621][ T168] ? irqentry_exit (kernel/entry/common.c:367) 
[ 201.094251][ T168] ? exc_page_fault (arch/x86/mm/fault.c:1567) 
[ 201.094913][ T168] entry_INT80_32 (arch/x86/entry/entry_32.S:944) 
[  201.095560][  T168] EIP: 0xb7dedd4e
[ 201.096066][ T168] Code: 90 66 90 66 90 66 90 66 90 66 90 90 57 56 53 8b 7c 24 20 8b 74 24 1c 8b 54 24 18 8b 4c 24 14 8b 5c 24 10 b8 15 00 00 00 cd 80 <5b> 5e 5f 3d 01 f0 ff ff 0f 83 64 7a f3 ff c3 66 90 90 57 56 53 8b
All code
========
   0:	90                   	nop
   1:	66 90                	xchg   %ax,%ax
   3:	66 90                	xchg   %ax,%ax
   5:	66 90                	xchg   %ax,%ax
   7:	66 90                	xchg   %ax,%ax
   9:	66 90                	xchg   %ax,%ax
   b:	90                   	nop
   c:	57                   	push   %rdi
   d:	56                   	push   %rsi
   e:	53                   	push   %rbx
   f:	8b 7c 24 20          	mov    0x20(%rsp),%edi
  13:	8b 74 24 1c          	mov    0x1c(%rsp),%esi
  17:	8b 54 24 18          	mov    0x18(%rsp),%edx
  1b:	8b 4c 24 14          	mov    0x14(%rsp),%ecx
  1f:	8b 5c 24 10          	mov    0x10(%rsp),%ebx
  23:	b8 15 00 00 00       	mov    $0x15,%eax
  28:	cd 80                	int    $0x80
  2a:*	5b                   	pop    %rbx		<-- trapping instruction
  2b:	5e                   	pop    %rsi
  2c:	5f                   	pop    %rdi
  2d:	3d 01 f0 ff ff       	cmp    $0xfffff001,%eax
  32:	0f 83 64 7a f3 ff    	jae    0xfffffffffff37a9c
  38:	c3                   	ret
  39:	66 90                	xchg   %ax,%ax
  3b:	90                   	nop
  3c:	57                   	push   %rdi
  3d:	56                   	push   %rsi
  3e:	53                   	push   %rbx
  3f:	8b                   	.byte 0x8b

Code starting with the faulting instruction
===========================================
   0:	5b                   	pop    %rbx
   1:	5e                   	pop    %rsi
   2:	5f                   	pop    %rdi
   3:	3d 01 f0 ff ff       	cmp    $0xfffff001,%eax
   8:	0f 83 64 7a f3 ff    	jae    0xfffffffffff37a72
   e:	c3                   	ret
   f:	66 90                	xchg   %ax,%ax
  11:	90                   	nop
  12:	57                   	push   %rdi
  13:	56                   	push   %rsi
  14:	53                   	push   %rbx
  15:	8b                   	.byte 0x8b


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240415/202404151034.cd36195c-oliver.sang@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 13/15] btrfs: add a shrinker for extent maps
  2024-04-14 13:02         ` Josef Bacik
@ 2024-04-15 11:24           ` Filipe Manana
  0 siblings, 0 replies; 64+ messages in thread
From: Filipe Manana @ 2024-04-15 11:24 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Sun, Apr 14, 2024 at 2:02 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Sat, Apr 13, 2024 at 12:07:30PM +0100, Filipe Manana wrote:
> > On Fri, Apr 12, 2024 at 9:07 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > >
> > > On Thu, Apr 11, 2024 at 05:19:07PM +0100, fdmanana@kernel.org wrote:
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > Extent maps are used either to represent existing file extent items, or to
> > > > represent new extents that are going to be written and the respective file
> > > > extent items are created when the ordered extent completes.
> > > >
> > > > We currently don't have any limit for how many extent maps we can have,
> > > > neither per inode nor globally. Most of the time this not too noticeable
> > > > because extent maps are removed in the following situations:
> > > >
> > > > 1) When evicting an inode;
> > > >
> > > > 2) When releasing folios (pages) through the btrfs_release_folio() address
> > > >    space operation callback.
> > > >
> > > >    However we won't release extent maps in the folio range if the folio is
> > > >    either dirty or under writeback or if the inode's i_size is less than
> > > >    or equals to 16M (see try_release_extent_mapping(). This 16M i_size
> > > >    constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
> > > >    extent_io and extent_state optimizations"), but there's no explanation
> > > >    about why we have it or why the 16M value.
> > > >
> > > > This means that for buffered IO we can reach an OOM situation due to too
> > > > many extent maps if either of the following happens:
> > > >
> > > > 1) There's a set of tasks constantly doing IO on many files with a size
> > > >    not larger than 16M, specially if they keep the files open for very
> > > >    long periods, therefore preventing inode eviction.
> > > >
> > > >    This requires a really high number of such files, and having many non
> > > >    mergeable extent maps (due to random 4K writes for example) and a
> > > >    machine with very little memory;
> > > >
> > > > 2) There's a set tasks constantly doing random write IO (therefore
> > > >    creating many non mergeable extent maps) on files and keeping them
> > > >    open for long periods of time, so inode eviction doesn't happen and
> > > >    there's always a lot of dirty pages or pages under writeback,
> > > >    preventing btrfs_release_folio() from releasing the respective extent
> > > >    maps.
> > > >
> > > > This second case was actually reported in the thread pointed by the Link
> > > > tag below, and it requires a very large file under heavy IO and a machine
> > > > with very little amount of RAM, which is probably hard to happen in
> > > > practice in a real world use case.
> > > >
> > > > However when using direct IO this is not so hard to happen, because the
> > > > page cache is not used, and therefore btrfs_release_folio() is never
> > > > called. Which means extent maps are dropped only when evicting the inode,
> > > > and that means that if we have tasks that keep a file descriptor open and
> > > > keep doing IO on a very large file (or files), we can exhaust memory due
> > > > to an unbounded amount of extent maps. This is especially easy to happen
> > > > if we have a huge file with millions of small extents and their extent
> > > > maps are not mergeable (non contiguous offsets and disk locations).
> > > > This was reported in that thread with the following fio test:
> > > >
> > > >    $ cat test.sh
> > > >    #!/bin/bash
> > > >
> > > >    DEV=/dev/sdj
> > > >    MNT=/mnt/sdj
> > > >    MOUNT_OPTIONS="-o ssd"
> > > >    MKFS_OPTIONS=""
> > > >
> > > >    cat <<EOF > /tmp/fio-job.ini
> > > >    [global]
> > > >    name=fio-rand-write
> > > >    filename=$MNT/fio-rand-write
> > > >    rw=randwrite
> > > >    bs=4K
> > > >    direct=1
> > > >    numjobs=16
> > > >    fallocate=none
> > > >    time_based
> > > >    runtime=90000
> > > >
> > > >    [file1]
> > > >    size=300G
> > > >    ioengine=libaio
> > > >    iodepth=16
> > > >
> > > >    EOF
> > > >
> > > >    umount $MNT &> /dev/null
> > > >    mkfs.btrfs -f $MKFS_OPTIONS $DEV
> > > >    mount $MOUNT_OPTIONS $DEV $MNT
> > > >
> > > >    fio /tmp/fio-job.ini
> > > >    umount $MNT
> > > >
> > > > Monitoring the btrfs_extent_map slab while running the test with:
> > > >
> > > >    $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
> > > >                         /sys/kernel/slab/btrfs_extent_map/total_objects'
> > > >
> > > > Shows the number of active and total extent maps skyrocketing to tens of
> > > > millions, and on systems with a short amount of memory it's easy and quick
> > > > to get into an OOM situation, as reported in that thread.
> > > >
> > > > So to avoid this issue add a shrinker that will remove extents maps, as
> > > > long as they are not pinned, and takes proper care with any concurrent
> > > > fsync to avoid missing extents (setting the full sync flag while in the
> > > > middle of a fast fsync). This shrinker is similar to the one ext4 uses
> > > > for its extent_status structure, which is analogous to btrfs' extent_map
> > > > structure.
> > > >
> > > > Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
> > > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > >
> > > I don't like this for a few reasons
> > >
> > > 1. We're always starting with the first root and the first inode.  We're just
> > >    going to constantly screw that first inode over and over again.
> > > 2. I really, really hate our inode rb-tree, I want to reduce it's use, not add
> > >    more users.  It would be nice if we could just utilize ->s_inodes_lru instead
> > >    for this, which would also give us the nice advantage of not having to think
> > >    about order since it's already in LRU order.
> > > 3. We're registering our own shrinker without a proper LRU setup.  I think it
> > >    would make sense if we wanted to have a LRU for our extent maps, but I think
> > >    that's not a great idea.  We could get the same benefit by adding our own
> > >    ->nr_cached_objects() and ->free_cached_objects(), I think that's a better
> > >    approach no matter what other changes you make instead of registering our own
> > >    shrinker.
> >
> > Ok, so some comments about all that.
> >
> > Using sb->s_inode_lru, which has the struct list_lru type is complicated.
> > I had considered that some time ago, and here are the problems with it:
> >
> > 1) List iteration, done with list_lru_walk() (or one of its variants),
> > is done while holding the lru list's spinlock.
> >
> > This means we can't remove all extents maps in one go as that will
> > take too much time if we have millions of them for example, and make
> > other tasks spin for too long on the lock.
> >
> > We will also need to take the inode's mmap semaphore which is a
> > blocking operation, so we can't do it while under the spinlock.
> >
> > Sure, the lru list iteration's callback (the "isolate" argument of
> > type list_lru_walk_cb) can release that lru list spinlock, do the
> > extent map iteration and removal, then lock it again and return
> > LRU_RETRY.
> > But that will restart the search from the first element in the lru
> > list. This means we can be iterating over the same inodes over and
> > over.
>
> Agreed, but at least it'll be in LRU order, we'll be visiting each inode in
> their LRU order.  If the first inode hasn't changed position then it hasn't been
> accessed in a while.
>
> We absolutely would need to use the isolate mechanism, I think we would just add
> a counter to the extent_io_tree to see how many extent states there are, and we
> skip anything that's already been drained.

Looking at the extent io tree is irrelevant for many scenarios, and
LRU is often a very poor choice.

For example consider an application that keeps a file descriptor open
for a very long period, like a database server (I've worked in the
past on one that did that).
If it's using direct IO and does some random reads here and there,
then caches the data in its memory (not the page cache).
The extent io tree will be empty and we'll have around the extent maps
for those ranges which the application will not use for a very long
period, maybe hours or more.

Despite that it is the most recently used inode, yet it's the one with
the most useless extents maps.

The same goes for writes, the application will cache the data in
memory and having the extent map around is useless for it.
Even if it overwrites those same ranges in the near future, since for
COW writes we create a new extent map and drop the old one, and for
NOCOW we always check file extent items in the subvolume tree, not
using any extent maps in the range. That is, the extent maps aren't
needed for the writes.

And same as before, the inode will be the most recently used.

What ext4 and xfs are doing is just fine.
There's no need to complicate something that when triggered is when
the system is already under pressure and everything is already slow
due to the memory pressure.

> Then we can do our normal locking
> thing when we walk through the isolate list.
>
> >
> > 2) You may hate the inode rb-tree, but we don't have another way to
> > iterate over the inodes and be efficient to search for inodes starting
> > at a particular number.
> >
> > 3)  ->nr_cached_objects() and ->free_cached_objects() of the super
> > operations are a good alternative to registering our own shrinker.
> > I went with our own shrinker because it's what ext4 is doing and it's
> > simple. Changing to the super operations is fairly simple and I
> > embrace it.
> >
> > 4) That concern about LRU is not really that relevant as you think.
> >
> > Look at fs/super.c:super_cache_scan() (for when we define our
> > ->nr_cached_objects() and ->free_cached_objects() callbacks).
> >
> > Every time ->free_cached_objects() is called we do it for a number of
> > items we returned with ->nr_cached_objects().
> > That means we all go over all inodes and so going in LRU order or any
> > other order ends up being irrelevant.
> >
> > The same goes for the shrinker solution.
> >
> > Sure you can argue that in the time between calling
> > ->nr_cached_objects() and calling ->free_cached_objects() more extent
> > maps may have been created.
> > But that's not a big time window, not enough to add that many extent
> > maps to make any practical difference.
> > Plus when the shrinking is performed it's because we are under heavy
> > memory pressure and removing extent maps won't have that much impact
> > anyway, since page cache was freed amongst other things that have more
> > impact on performance than extent maps.
> >
> > This is probably why ext4 is following the same approach for its
> > extent_status structures as I did here (well, I followed their way of
> > doing things).
> > Those structures are in a rb tree of the inode, just like we do with
> > our extent maps.
> > And the inodes are in a regular linked list (struct
> > ext4_sb_info::s_es_list) and added with list_add_tail to that list -
> > they're never moved within the list (no LRU).
> > When the scan callback of the shrinker is invoked, it always iterates
> > the inodes in that list order - not LRU or anything "fancy".
> >
> >
> > So I'm okay with moving to an implementation based on
> > ->nr_cached_objects() and calling ->free_cached_objects(), that's
> > simple.
> > But iterating the inodes will have to be with the rb-tree, as we don't
> > have anything else that allows us to iterate and start at any given
> > number.
> > I can make it to remember the last scanned inode (and root) so that
> > the next time the shrinking happens it will start after that last
> > scanned inode.
> >
> > And adding our own lru implementation wouldn't make much difference
> > because of all that, and would also have 2 downsides to it:
> >
> > 1) We would need a struct list_head  (16 bytes) plus a pointer to the
> > inode (so that we can remove an extent map from the rb tree) added to
> > the extent_map structure, so 24 bytes overhead.
> >
> > 2) All the maintenance of the list, adding to it, removing from it,
> > moving elements, etc, all requires locking and overhead from lock
> > contention (using the struct list_lru API or our own).
> >
> > So my proposal is to do this:
> >
> > 1) Keep using the rb tree to search for inodes - this is abstracted by
> > an helper function used in 2 other places.
> > 2) Remember the last scanned inode/root, and on every scan start after
> > that inode.
> > 3) Use ->nr_cached_objects() and calling ->free_cached_objects()
> > instead of registering a shrinker.
> >
>
> I'm still looking for a way to delete our rb_tree in the future, it causes us
> headaches with a lot of parallel file creates, but honestly other things are
> much worse than the rbtree right now.

I get it, I've thought sometime ago to replace it with an xarray or maple tree.
The rb tree gets really deep after 100k inodes, plus all the cache
line pulling every time navigating the tree for a search or insertion.

On the other hand it's embedded in the inode, and unlike when using an
xarray or maple tree, doesn't require extra memory allocations for
insertions and dealing with allocation failures.
I will give it a try sometime after finishing the extent map shrinker.
After those preparatory patches it's all hidden in a helper function,
making the switch to some other data structure easy.

Thanks.

>
> This is fine with me, we're not getting rid of it anytime soon, as long as we
> have a way to remember where we were then that's good enough for now and in
> keeping with all other existing reclaim implementations.  Thanks,
>
> Josef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v3 00/10] btrfs: add a shrinker for extent maps
  2024-04-10 11:28 [PATCH 00/11] btrfs: add a shrinker for extent maps fdmanana
                   ` (12 preceding siblings ...)
  2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
@ 2024-04-16 13:08 ` fdmanana
  2024-04-16 13:08   ` [PATCH v3 01/10] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
                     ` (9 more replies)
  13 siblings, 10 replies; 64+ messages in thread
From: fdmanana @ 2024-04-16 13:08 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently we don't limit the amount of extent maps we can have for inodes
from a subvolume tree, which can result in excessive use of memory and in
some cases in running into OOM situations. This was reported some time ago
by a user and it's specially easier to trigger with direct IO.

The shrinker itself is patch 08/10, what comes before is simple preparatory
work and the rest just trace events. More details in the change logs.

V3: Removed some preparatory patches that are already in for-next.

    Updated patch 07/10 to avoid a lockdep warning due to attempt to read
    the percpu counter when freeing fs_info if during the open_ctree()
    path we had an error before initializing the counter. Reported by the
    Intel test robot.

    Update the shrinker patch (08/10) to use the nr_cached_objects and
    free_cached_objects callbacks of struct super_operations.
    Make the shrinker remember the last inode and root it processed, so
    that it starts from there the next time it runs.
    Also avoid a deadlock when trying to lock the mmap lock, not use a
    down_read_trylock() and comment about it.
    Avoid setting the inode full sync flag if an extent in the list of
    modified extents is from an older generation, whose transaction was
    already committed and therefore the file extent item was persisted.

V2: Split patch 09/11 into 3.
    Added two patches to export and use helper to find inode in a root.
    Updated patch 13/15 to use the helper for finding next inode and
    removed the #ifdef for 32 bits case which is irrelevant as on 32 bits
    systems we can't ever have more than ULONG_MAX extent maps allocated.

Filipe Manana (10):
  btrfs: pass the extent map tree's inode to add_extent_mapping()
  btrfs: pass the extent map tree's inode to clear_em_logging()
  btrfs: pass the extent map tree's inode to remove_extent_mapping()
  btrfs: pass the extent map tree's inode to replace_extent_mapping()
  btrfs: pass the extent map tree's inode to setup_extent_mapping()
  btrfs: pass the extent map tree's inode to try_merge_map()
  btrfs: add a global per cpu counter to track number of used extent maps
  btrfs: add a shrinker for extent maps
  btrfs: update comment for btrfs_set_inode_full_sync() about locking
  btrfs: add tracepoints for extent map shrinker events

 fs/btrfs/btrfs_inode.h            |   8 +-
 fs/btrfs/disk-io.c                |   9 +
 fs/btrfs/extent_io.c              |   2 +-
 fs/btrfs/extent_map.c             | 277 +++++++++++++++++++++++++-----
 fs/btrfs/extent_map.h             |   5 +-
 fs/btrfs/fs.h                     |   4 +
 fs/btrfs/super.c                  |  20 +++
 fs/btrfs/tests/extent-map-tests.c |  19 +-
 fs/btrfs/tree-log.c               |   4 +-
 include/trace/events/btrfs.h      |  99 +++++++++++
 10 files changed, 388 insertions(+), 59 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v3 01/10] btrfs: pass the extent map tree's inode to add_extent_mapping()
  2024-04-16 13:08 ` [PATCH v3 00/10] btrfs: add a shrinker for extent maps fdmanana
@ 2024-04-16 13:08   ` fdmanana
  2024-04-17 11:08     ` Johannes Thumshirn
  2024-04-16 13:08   ` [PATCH v3 02/10] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
                     ` (8 subsequent siblings)
  9 siblings, 1 reply; 64+ messages in thread
From: fdmanana @ 2024-04-16 13:08 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are always added to an inode's extent map tree, so there's no
need to pass the extent map tree explicitly to add_extent_mapping().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change add_extent_mapping() to receive the inode instead of its
extent map tree.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c | 34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index d125d5ab9b1d..d0e0c4e5415e 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -355,21 +355,22 @@ static inline void setup_extent_mapping(struct extent_map_tree *tree,
 }
 
 /*
- * Add new extent map to the extent tree
+ * Add a new extent map to an inode's extent map tree.
  *
- * @tree:	tree to insert new map in
+ * @inode:	the target inode
  * @em:		map to insert
  * @modified:	indicate whether the given @em should be added to the
  *	        modified list, which indicates the extent needs to be logged
  *
- * Insert @em into @tree or perform a simple forward/backward merge with
- * existing mappings.  The extent_map struct passed in will be inserted
- * into the tree directly, with an additional reference taken, or a
- * reference dropped if the merge attempt was successful.
+ * Insert @em into the @inode's extent map tree or perform a simple
+ * forward/backward merge with existing mappings.  The extent_map struct passed
+ * in will be inserted into the tree directly, with an additional reference
+ * taken, or a reference dropped if the merge attempt was successful.
  */
-static int add_extent_mapping(struct extent_map_tree *tree,
+static int add_extent_mapping(struct btrfs_inode *inode,
 			      struct extent_map *em, int modified)
 {
+	struct extent_map_tree *tree = &inode->extent_tree;
 	int ret;
 
 	lockdep_assert_held_write(&tree->lock);
@@ -508,7 +509,7 @@ static struct extent_map *prev_extent_map(struct extent_map *em)
  * and an extent that you want to insert, deal with overlap and insert
  * the best fitted new extent into the tree.
  */
-static noinline int merge_extent_mapping(struct extent_map_tree *em_tree,
+static noinline int merge_extent_mapping(struct btrfs_inode *inode,
 					 struct extent_map *existing,
 					 struct extent_map *em,
 					 u64 map_start)
@@ -542,7 +543,7 @@ static noinline int merge_extent_mapping(struct extent_map_tree *em_tree,
 		em->block_start += start_diff;
 		em->block_len = em->len;
 	}
-	return add_extent_mapping(em_tree, em, 0);
+	return add_extent_mapping(inode, em, 0);
 }
 
 /*
@@ -570,7 +571,6 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode,
 {
 	int ret;
 	struct extent_map *em = *em_in;
-	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 
 	/*
@@ -580,7 +580,7 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode,
 	if (em->block_start == EXTENT_MAP_INLINE)
 		ASSERT(em->start == 0);
 
-	ret = add_extent_mapping(em_tree, em, 0);
+	ret = add_extent_mapping(inode, em, 0);
 	/* it is possible that someone inserted the extent into the tree
 	 * while we had the lock dropped.  It is also possible that
 	 * an overlapping map exists in the tree
@@ -588,7 +588,7 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode,
 	if (ret == -EEXIST) {
 		struct extent_map *existing;
 
-		existing = search_extent_mapping(em_tree, start, len);
+		existing = search_extent_mapping(&inode->extent_tree, start, len);
 
 		trace_btrfs_handle_em_exist(fs_info, existing, em, start, len);
 
@@ -609,8 +609,7 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode,
 			 * The existing extent map is the one nearest to
 			 * the [start, start + len) range which overlaps
 			 */
-			ret = merge_extent_mapping(em_tree, existing,
-						   em, start);
+			ret = merge_extent_mapping(inode, existing, em, start);
 			if (WARN_ON(ret)) {
 				free_extent_map(em);
 				*em_in = NULL;
@@ -818,8 +817,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 			} else {
 				int ret;
 
-				ret = add_extent_mapping(em_tree, split,
-							 modified);
+				ret = add_extent_mapping(inode, split, modified);
 				/* Logic error, shouldn't happen. */
 				ASSERT(ret == 0);
 				if (WARN_ON(ret != 0) && modified)
@@ -909,7 +907,7 @@ int btrfs_replace_extent_map_range(struct btrfs_inode *inode,
 	do {
 		btrfs_drop_extent_map_range(inode, new_em->start, end, false);
 		write_lock(&tree->lock);
-		ret = add_extent_mapping(tree, new_em, modified);
+		ret = add_extent_mapping(inode, new_em, modified);
 		write_unlock(&tree->lock);
 	} while (ret == -EEXIST);
 
@@ -990,7 +988,7 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	split_mid->ram_bytes = split_mid->len;
 	split_mid->flags = flags;
 	split_mid->generation = em->generation;
-	add_extent_mapping(em_tree, split_mid, 1);
+	add_extent_mapping(inode, split_mid, 1);
 
 	/* Once for us */
 	free_extent_map(em);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 02/10] btrfs: pass the extent map tree's inode to clear_em_logging()
  2024-04-16 13:08 ` [PATCH v3 00/10] btrfs: add a shrinker for extent maps fdmanana
  2024-04-16 13:08   ` [PATCH v3 01/10] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
@ 2024-04-16 13:08   ` fdmanana
  2024-04-17 11:09     ` Johannes Thumshirn
  2024-04-16 13:08   ` [PATCH v3 03/10] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
                     ` (7 subsequent siblings)
  9 siblings, 1 reply; 64+ messages in thread
From: fdmanana @ 2024-04-16 13:08 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to
clear_em_logging().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change clear_em_logging() to receive the inode instead of its extent
map tree.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c | 4 +++-
 fs/btrfs/extent_map.h | 2 +-
 fs/btrfs/tree-log.c   | 4 ++--
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index d0e0c4e5415e..7cda78d11d75 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -331,8 +331,10 @@ int unpin_extent_cache(struct btrfs_inode *inode, u64 start, u64 len, u64 gen)
 
 }
 
-void clear_em_logging(struct extent_map_tree *tree, struct extent_map *em)
+void clear_em_logging(struct btrfs_inode *inode, struct extent_map *em)
 {
+	struct extent_map_tree *tree = &inode->extent_tree;
+
 	lockdep_assert_held_write(&tree->lock);
 
 	em->flags &= ~EXTENT_FLAG_LOGGING;
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index f287ab46e368..732fc8d7e534 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -129,7 +129,7 @@ void free_extent_map(struct extent_map *em);
 int __init extent_map_init(void);
 void __cold extent_map_exit(void);
 int unpin_extent_cache(struct btrfs_inode *inode, u64 start, u64 len, u64 gen);
-void clear_em_logging(struct extent_map_tree *tree, struct extent_map *em);
+void clear_em_logging(struct btrfs_inode *inode, struct extent_map *em);
 struct extent_map *search_extent_mapping(struct extent_map_tree *tree,
 					 u64 start, u64 len);
 int btrfs_add_extent_mapping(struct btrfs_inode *inode,
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 1c7efb7c2160..765fbbe3ee30 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4949,7 +4949,7 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 		 * private list.
 		 */
 		if (ret) {
-			clear_em_logging(tree, em);
+			clear_em_logging(inode, em);
 			free_extent_map(em);
 			continue;
 		}
@@ -4958,7 +4958,7 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 
 		ret = log_one_extent(trans, inode, em, path, ctx);
 		write_lock(&tree->lock);
-		clear_em_logging(tree, em);
+		clear_em_logging(inode, em);
 		free_extent_map(em);
 	}
 	WARN_ON(!list_empty(&extents));
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 03/10] btrfs: pass the extent map tree's inode to remove_extent_mapping()
  2024-04-16 13:08 ` [PATCH v3 00/10] btrfs: add a shrinker for extent maps fdmanana
  2024-04-16 13:08   ` [PATCH v3 01/10] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
  2024-04-16 13:08   ` [PATCH v3 02/10] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
@ 2024-04-16 13:08   ` fdmanana
  2024-04-17 11:10     ` Johannes Thumshirn
  2024-04-16 13:08   ` [PATCH v3 04/10] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
                     ` (6 subsequent siblings)
  9 siblings, 1 reply; 64+ messages in thread
From: fdmanana @ 2024-04-16 13:08 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to
remove_extent_mapping().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change remove_extent_mapping() to receive the inode instead of its
extent map tree.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c              |  2 +-
 fs/btrfs/extent_map.c             | 22 +++++++++++++---------
 fs/btrfs/extent_map.h             |  2 +-
 fs/btrfs/tests/extent-map-tests.c | 19 ++++++++++---------
 4 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 91122817f137..7b10f47d8f83 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2457,7 +2457,7 @@ int try_release_extent_mapping(struct page *page, gfp_t mask)
 			 * hurts the fsync performance for workloads with a data
 			 * size that exceeds or is close to the system's memory).
 			 */
-			remove_extent_mapping(map, em);
+			remove_extent_mapping(btrfs_inode, em);
 			/* once for the rb tree */
 			free_extent_map(em);
 next:
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 7cda78d11d75..289669763965 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -449,16 +449,18 @@ struct extent_map *search_extent_mapping(struct extent_map_tree *tree,
 }
 
 /*
- * Remove an extent_map from the extent tree.
+ * Remove an extent_map from its inode's extent tree.
  *
- * @tree:	extent tree to remove from
+ * @inode:	the inode the extent map belongs to
  * @em:		extent map being removed
  *
- * Remove @em from @tree.  No reference counts are dropped, and no checks
- * are done to see if the range is in use.
+ * Remove @em from the extent tree of @inode.  No reference counts are dropped,
+ * and no checks are done to see if the range is in use.
  */
-void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em)
+void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em)
 {
+	struct extent_map_tree *tree = &inode->extent_tree;
+
 	lockdep_assert_held_write(&tree->lock);
 
 	WARN_ON(em->flags & EXTENT_FLAG_PINNED);
@@ -633,8 +635,10 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode,
  * if needed. This avoids searching the tree, from the root down to the first
  * extent map, before each deletion.
  */
-static void drop_all_extent_maps_fast(struct extent_map_tree *tree)
+static void drop_all_extent_maps_fast(struct btrfs_inode *inode)
 {
+	struct extent_map_tree *tree = &inode->extent_tree;
+
 	write_lock(&tree->lock);
 	while (!RB_EMPTY_ROOT(&tree->map.rb_root)) {
 		struct extent_map *em;
@@ -643,7 +647,7 @@ static void drop_all_extent_maps_fast(struct extent_map_tree *tree)
 		node = rb_first_cached(&tree->map);
 		em = rb_entry(node, struct extent_map, rb_node);
 		em->flags &= ~(EXTENT_FLAG_PINNED | EXTENT_FLAG_LOGGING);
-		remove_extent_mapping(tree, em);
+		remove_extent_mapping(inode, em);
 		free_extent_map(em);
 		cond_resched_rwlock_write(&tree->lock);
 	}
@@ -676,7 +680,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 	WARN_ON(end < start);
 	if (end == (u64)-1) {
 		if (start == 0 && !skip_pinned) {
-			drop_all_extent_maps_fast(em_tree);
+			drop_all_extent_maps_fast(inode);
 			return;
 		}
 		len = (u64)-1;
@@ -854,7 +858,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 				ASSERT(!split);
 				btrfs_set_inode_full_sync(inode);
 			}
-			remove_extent_mapping(em_tree, em);
+			remove_extent_mapping(inode, em);
 		}
 
 		/*
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 732fc8d7e534..c3707461ff62 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -120,7 +120,7 @@ static inline u64 extent_map_end(const struct extent_map *em)
 void extent_map_tree_init(struct extent_map_tree *tree);
 struct extent_map *lookup_extent_mapping(struct extent_map_tree *tree,
 					 u64 start, u64 len);
-void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em);
+void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em);
 int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 		     u64 new_logical);
 
diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
index 9e9cb591c0f1..db6fb1a2c78f 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -11,8 +11,9 @@
 #include "../disk-io.h"
 #include "../block-group.h"
 
-static int free_extent_map_tree(struct extent_map_tree *em_tree)
+static int free_extent_map_tree(struct btrfs_inode *inode)
 {
+	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
 	struct rb_node *node;
 	int ret = 0;
@@ -21,7 +22,7 @@ static int free_extent_map_tree(struct extent_map_tree *em_tree)
 	while (!RB_EMPTY_ROOT(&em_tree->map.rb_root)) {
 		node = rb_first_cached(&em_tree->map);
 		em = rb_entry(node, struct extent_map, rb_node);
-		remove_extent_mapping(em_tree, em);
+		remove_extent_mapping(inode, em);
 
 #ifdef CONFIG_BTRFS_DEBUG
 		if (refcount_read(&em->refs) != 1) {
@@ -142,7 +143,7 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	}
 	free_extent_map(em);
 out:
-	ret2 = free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
@@ -237,7 +238,7 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	}
 	free_extent_map(em);
 out:
-	ret2 = free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
@@ -313,7 +314,7 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 	}
 	free_extent_map(em);
 out:
-	ret2 = free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
@@ -435,7 +436,7 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	}
 	free_extent_map(em);
 out:
-	ret2 = free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
@@ -679,7 +680,7 @@ static int test_case_5(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	if (ret)
 		goto out;
 out:
-	ret2 = free_extent_map_tree(&inode->extent_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
@@ -738,7 +739,7 @@ static int test_case_6(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	ret = 0;
 out:
 	free_extent_map(em);
-	ret2 = free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
@@ -871,7 +872,7 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	ret2 = unpin_extent_cache(inode, 0, SZ_16K, 0);
 	if (ret == 0)
 		ret = ret2;
-	ret2 = free_extent_map_tree(em_tree);
+	ret2 = free_extent_map_tree(inode);
 	if (ret == 0)
 		ret = ret2;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 04/10] btrfs: pass the extent map tree's inode to replace_extent_mapping()
  2024-04-16 13:08 ` [PATCH v3 00/10] btrfs: add a shrinker for extent maps fdmanana
                     ` (2 preceding siblings ...)
  2024-04-16 13:08   ` [PATCH v3 03/10] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
@ 2024-04-16 13:08   ` fdmanana
  2024-04-17 11:11     ` Johannes Thumshirn
  2024-04-16 13:08   ` [PATCH v3 05/10] btrfs: pass the extent map tree's inode to setup_extent_mapping() fdmanana
                     ` (5 subsequent siblings)
  9 siblings, 1 reply; 64+ messages in thread
From: fdmanana @ 2024-04-16 13:08 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to
replace_extent_mapping().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change replace_extent_mapping() to receive the inode instead of its
extent map tree.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 289669763965..15817b842c24 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -470,11 +470,13 @@ void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em)
 	RB_CLEAR_NODE(&em->rb_node);
 }
 
-static void replace_extent_mapping(struct extent_map_tree *tree,
+static void replace_extent_mapping(struct btrfs_inode *inode,
 				   struct extent_map *cur,
 				   struct extent_map *new,
 				   int modified)
 {
+	struct extent_map_tree *tree = &inode->extent_tree;
+
 	lockdep_assert_held_write(&tree->lock);
 
 	WARN_ON(cur->flags & EXTENT_FLAG_PINNED);
@@ -777,7 +779,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 
 			split->generation = gen;
 			split->flags = flags;
-			replace_extent_mapping(em_tree, em, split, modified);
+			replace_extent_mapping(inode, em, split, modified);
 			free_extent_map(split);
 			split = split2;
 			split2 = NULL;
@@ -818,8 +820,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 			}
 
 			if (extent_map_in_tree(em)) {
-				replace_extent_mapping(em_tree, em, split,
-						       modified);
+				replace_extent_mapping(inode, em, split, modified);
 			} else {
 				int ret;
 
@@ -977,7 +978,7 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	split_pre->flags = flags;
 	split_pre->generation = em->generation;
 
-	replace_extent_mapping(em_tree, em, split_pre, 1);
+	replace_extent_mapping(inode, em, split_pre, 1);
 
 	/*
 	 * Now we only have an extent_map at:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 05/10] btrfs: pass the extent map tree's inode to setup_extent_mapping()
  2024-04-16 13:08 ` [PATCH v3 00/10] btrfs: add a shrinker for extent maps fdmanana
                     ` (3 preceding siblings ...)
  2024-04-16 13:08   ` [PATCH v3 04/10] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
@ 2024-04-16 13:08   ` fdmanana
  2024-04-17 11:11     ` Johannes Thumshirn
  2024-04-16 13:08   ` [PATCH v3 06/10] btrfs: pass the extent map tree's inode to try_merge_map() fdmanana
                     ` (4 subsequent siblings)
  9 siblings, 1 reply; 64+ messages in thread
From: fdmanana @ 2024-04-16 13:08 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to
setup_extent_mapping().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change setup_extent_mapping() to receive the inode instead of its
extent map tree.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 15817b842c24..2753bf2964cb 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -342,7 +342,7 @@ void clear_em_logging(struct btrfs_inode *inode, struct extent_map *em)
 		try_merge_map(tree, em);
 }
 
-static inline void setup_extent_mapping(struct extent_map_tree *tree,
+static inline void setup_extent_mapping(struct btrfs_inode *inode,
 					struct extent_map *em,
 					int modified)
 {
@@ -351,9 +351,9 @@ static inline void setup_extent_mapping(struct extent_map_tree *tree,
 	ASSERT(list_empty(&em->list));
 
 	if (modified)
-		list_add(&em->list, &tree->modified_extents);
+		list_add(&em->list, &inode->extent_tree.modified_extents);
 	else
-		try_merge_map(tree, em);
+		try_merge_map(&inode->extent_tree, em);
 }
 
 /*
@@ -381,7 +381,7 @@ static int add_extent_mapping(struct btrfs_inode *inode,
 	if (ret)
 		return ret;
 
-	setup_extent_mapping(tree, em, modified);
+	setup_extent_mapping(inode, em, modified);
 
 	return 0;
 }
@@ -486,7 +486,7 @@ static void replace_extent_mapping(struct btrfs_inode *inode,
 	rb_replace_node_cached(&cur->rb_node, &new->rb_node, &tree->map);
 	RB_CLEAR_NODE(&cur->rb_node);
 
-	setup_extent_mapping(tree, new, modified);
+	setup_extent_mapping(inode, new, modified);
 }
 
 static struct extent_map *next_extent_map(const struct extent_map *em)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 06/10] btrfs: pass the extent map tree's inode to try_merge_map()
  2024-04-16 13:08 ` [PATCH v3 00/10] btrfs: add a shrinker for extent maps fdmanana
                     ` (4 preceding siblings ...)
  2024-04-16 13:08   ` [PATCH v3 05/10] btrfs: pass the extent map tree's inode to setup_extent_mapping() fdmanana
@ 2024-04-16 13:08   ` fdmanana
  2024-04-17 11:12     ` Johannes Thumshirn
  2024-04-16 13:08   ` [PATCH v3 07/10] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
                     ` (3 subsequent siblings)
  9 siblings, 1 reply; 64+ messages in thread
From: fdmanana @ 2024-04-16 13:08 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to try_merge_map().

In order to facilitate an upcoming change that adds a shrinker for extent
maps, change try_merge_map() to receive the inode instead of its extent
map tree.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 2753bf2964cb..97a8e0484415 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -223,8 +223,9 @@ static bool mergeable_maps(const struct extent_map *prev, const struct extent_ma
 	return next->block_start == prev->block_start;
 }
 
-static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
+static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 {
+	struct extent_map_tree *tree = &inode->extent_tree;
 	struct extent_map *merge = NULL;
 	struct rb_node *rb;
 
@@ -322,7 +323,7 @@ int unpin_extent_cache(struct btrfs_inode *inode, u64 start, u64 len, u64 gen)
 	em->generation = gen;
 	em->flags &= ~EXTENT_FLAG_PINNED;
 
-	try_merge_map(tree, em);
+	try_merge_map(inode, em);
 
 out:
 	write_unlock(&tree->lock);
@@ -333,13 +334,11 @@ int unpin_extent_cache(struct btrfs_inode *inode, u64 start, u64 len, u64 gen)
 
 void clear_em_logging(struct btrfs_inode *inode, struct extent_map *em)
 {
-	struct extent_map_tree *tree = &inode->extent_tree;
-
-	lockdep_assert_held_write(&tree->lock);
+	lockdep_assert_held_write(&inode->extent_tree.lock);
 
 	em->flags &= ~EXTENT_FLAG_LOGGING;
 	if (extent_map_in_tree(em))
-		try_merge_map(tree, em);
+		try_merge_map(inode, em);
 }
 
 static inline void setup_extent_mapping(struct btrfs_inode *inode,
@@ -353,7 +352,7 @@ static inline void setup_extent_mapping(struct btrfs_inode *inode,
 	if (modified)
 		list_add(&em->list, &inode->extent_tree.modified_extents);
 	else
-		try_merge_map(&inode->extent_tree, em);
+		try_merge_map(inode, em);
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 07/10] btrfs: add a global per cpu counter to track number of used extent maps
  2024-04-16 13:08 ` [PATCH v3 00/10] btrfs: add a shrinker for extent maps fdmanana
                     ` (5 preceding siblings ...)
  2024-04-16 13:08   ` [PATCH v3 06/10] btrfs: pass the extent map tree's inode to try_merge_map() fdmanana
@ 2024-04-16 13:08   ` fdmanana
  2024-04-17 11:14     ` Johannes Thumshirn
  2024-04-16 13:08   ` [PATCH v3 08/10] btrfs: add a shrinker for " fdmanana
                     ` (2 subsequent siblings)
  9 siblings, 1 reply; 64+ messages in thread
From: fdmanana @ 2024-04-16 13:08 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Add a per cpu counter that tracks the total number of extent maps that are
in extent trees of inodes that belong to fs trees. This is going to be
used in an upcoming change that adds a shrinker for extent maps. Only
extent maps for fs trees are considered, because for special trees such as
the data relocation tree we don't want to evict their extent maps which
are critical for the relocation to work, and since those are limited, it's
not a concern to have them in memory during the relocation of a block
group. Another case are extent maps for free space cache inodes, which
must always remain in memory, but those are limited (there's only one per
free space cache inode, which means one per block group).

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/disk-io.c    |  9 +++++++++
 fs/btrfs/extent_map.c | 17 +++++++++++++++++
 fs/btrfs/fs.h         |  2 ++
 3 files changed, 28 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0474e9b6d302..bc9d1b48011e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1266,9 +1266,14 @@ static void free_global_roots(struct btrfs_fs_info *fs_info)
 
 void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
 {
+	struct percpu_counter *em_counter = &fs_info->evictable_extent_maps;
+
 	percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
 	percpu_counter_destroy(&fs_info->delalloc_bytes);
 	percpu_counter_destroy(&fs_info->ordered_bytes);
+	if (percpu_counter_initialized(em_counter))
+		ASSERT(percpu_counter_sum_positive(em_counter) == 0);
+	percpu_counter_destroy(em_counter);
 	percpu_counter_destroy(&fs_info->dev_replace.bio_counter);
 	btrfs_free_csum_hash(fs_info);
 	btrfs_free_stripe_hash_table(fs_info);
@@ -2848,6 +2853,10 @@ static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block
 	if (ret)
 		return ret;
 
+	ret = percpu_counter_init(&fs_info->evictable_extent_maps, 0, GFP_KERNEL);
+	if (ret)
+		return ret;
+
 	ret = percpu_counter_init(&fs_info->dirty_metadata_bytes, 0, GFP_KERNEL);
 	if (ret)
 		return ret;
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 97a8e0484415..2fcf28148a81 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -76,6 +76,14 @@ static u64 range_end(u64 start, u64 len)
 	return start + len;
 }
 
+static void dec_evictable_extent_maps(struct btrfs_inode *inode)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+
+	if (!btrfs_is_testing(fs_info) && is_fstree(btrfs_root_id(inode->root)))
+		percpu_counter_dec(&fs_info->evictable_extent_maps);
+}
+
 static int tree_insert(struct rb_root_cached *root, struct extent_map *em)
 {
 	struct rb_node **p = &root->rb_root.rb_node;
@@ -259,6 +267,7 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 			rb_erase_cached(&merge->rb_node, &tree->map);
 			RB_CLEAR_NODE(&merge->rb_node);
 			free_extent_map(merge);
+			dec_evictable_extent_maps(inode);
 		}
 	}
 
@@ -273,6 +282,7 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 		em->generation = max(em->generation, merge->generation);
 		em->flags |= EXTENT_FLAG_MERGED;
 		free_extent_map(merge);
+		dec_evictable_extent_maps(inode);
 	}
 }
 
@@ -372,6 +382,8 @@ static int add_extent_mapping(struct btrfs_inode *inode,
 			      struct extent_map *em, int modified)
 {
 	struct extent_map_tree *tree = &inode->extent_tree;
+	struct btrfs_root *root = inode->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
 	int ret;
 
 	lockdep_assert_held_write(&tree->lock);
@@ -382,6 +394,9 @@ static int add_extent_mapping(struct btrfs_inode *inode,
 
 	setup_extent_mapping(inode, em, modified);
 
+	if (!btrfs_is_testing(fs_info) && is_fstree(btrfs_root_id(root)))
+		percpu_counter_inc(&fs_info->evictable_extent_maps);
+
 	return 0;
 }
 
@@ -467,6 +482,8 @@ void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em)
 	if (!(em->flags & EXTENT_FLAG_LOGGING))
 		list_del_init(&em->list);
 	RB_CLEAR_NODE(&em->rb_node);
+
+	dec_evictable_extent_maps(inode);
 }
 
 static void replace_extent_mapping(struct btrfs_inode *inode,
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 93f5c57ea4e3..534d30dafe32 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -630,6 +630,8 @@ struct btrfs_fs_info {
 	s32 dirty_metadata_batch;
 	s32 delalloc_batch;
 
+	struct percpu_counter evictable_extent_maps;
+
 	/* Protected by 'trans_lock'. */
 	struct list_head dirty_cowonly_roots;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 08/10] btrfs: add a shrinker for extent maps
  2024-04-16 13:08 ` [PATCH v3 00/10] btrfs: add a shrinker for extent maps fdmanana
                     ` (6 preceding siblings ...)
  2024-04-16 13:08   ` [PATCH v3 07/10] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
@ 2024-04-16 13:08   ` fdmanana
  2024-04-16 17:25     ` Josef Bacik
  2024-04-16 13:08   ` [PATCH v3 09/10] btrfs: update comment for btrfs_set_inode_full_sync() about locking fdmanana
  2024-04-16 13:08   ` [PATCH v3 10/10] btrfs: add tracepoints for extent map shrinker events fdmanana
  9 siblings, 1 reply; 64+ messages in thread
From: fdmanana @ 2024-04-16 13:08 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Extent maps are used either to represent existing file extent items, or to
represent new extents that are going to be written and the respective file
extent items are created when the ordered extent completes.

We currently don't have any limit for how many extent maps we can have,
neither per inode nor globally. Most of the time this not too noticeable
because extent maps are removed in the following situations:

1) When evicting an inode;

2) When releasing folios (pages) through the btrfs_release_folio() address
   space operation callback.

   However we won't release extent maps in the folio range if the folio is
   either dirty or under writeback or if the inode's i_size is less than
   or equals to 16M (see try_release_extent_mapping(). This 16M i_size
   constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
   extent_io and extent_state optimizations"), but there's no explanation
   about why we have it or why the 16M value.

This means that for buffered IO we can reach an OOM situation due to too
many extent maps if either of the following happens:

1) There's a set of tasks constantly doing IO on many files with a size
   not larger than 16M, specially if they keep the files open for very
   long periods, therefore preventing inode eviction.

   This requires a really high number of such files, and having many non
   mergeable extent maps (due to random 4K writes for example) and a
   machine with very little memory;

2) There's a set tasks constantly doing random write IO (therefore
   creating many non mergeable extent maps) on files and keeping them
   open for long periods of time, so inode eviction doesn't happen and
   there's always a lot of dirty pages or pages under writeback,
   preventing btrfs_release_folio() from releasing the respective extent
   maps.

This second case was actually reported in the thread pointed by the Link
tag below, and it requires a very large file under heavy IO and a machine
with very little amount of RAM, which is probably hard to happen in
practice in a real world use case.

However when using direct IO this is not so hard to happen, because the
page cache is not used, and therefore btrfs_release_folio() is never
called. Which means extent maps are dropped only when evicting the inode,
and that means that if we have tasks that keep a file descriptor open and
keep doing IO on a very large file (or files), we can exhaust memory due
to an unbounded amount of extent maps. This is especially easy to happen
if we have a huge file with millions of small extents and their extent
maps are not mergeable (non contiguous offsets and disk locations).
This was reported in that thread with the following fio test:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/sdj
   MNT=/mnt/sdj
   MOUNT_OPTIONS="-o ssd"
   MKFS_OPTIONS=""

   cat <<EOF > /tmp/fio-job.ini
   [global]
   name=fio-rand-write
   filename=$MNT/fio-rand-write
   rw=randwrite
   bs=4K
   direct=1
   numjobs=16
   fallocate=none
   time_based
   runtime=90000

   [file1]
   size=300G
   ioengine=libaio
   iodepth=16

   EOF

   umount $MNT &> /dev/null
   mkfs.btrfs -f $MKFS_OPTIONS $DEV
   mount $MOUNT_OPTIONS $DEV $MNT

   fio /tmp/fio-job.ini
   umount $MNT

Monitoring the btrfs_extent_map slab while running the test with:

   $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
                        /sys/kernel/slab/btrfs_extent_map/total_objects'

Shows the number of active and total extent maps skyrocketing to tens of
millions, and on systems with a short amount of memory it's easy and quick
to get into an OOM situation, as reported in that thread.

So to avoid this issue add a shrinker that will remove extents maps, as
long as they are not pinned, and takes proper care with any concurrent
fsync to avoid missing extents (setting the full sync flag while in the
middle of a fast fsync). This shrinker is triggered through the callbacks
nr_cached_objects and free_cached_objects of struct super_operations.

The shrinker will iterates over all roots and over all inodes of each
root, and keeps track of the last scanned root and inode, so that the
next time it runs, it starts from that root and from the next inode.
This is similar to what xfs does for its inode reclaim (implements those
callbacks, and cycles through inodes by starting from where it ended
last time).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c | 159 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/extent_map.h |   1 +
 fs/btrfs/fs.h         |   2 +
 fs/btrfs/super.c      |  17 +++++
 4 files changed, 179 insertions(+)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 2fcf28148a81..b638d87db500 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -8,6 +8,7 @@
 #include "extent_map.h"
 #include "compression.h"
 #include "btrfs_inode.h"
+#include "disk-io.h"
 
 
 static struct kmem_cache *extent_map_cache;
@@ -1026,3 +1027,161 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	free_extent_map(split_pre);
 	return ret;
 }
+
+static long btrfs_scan_inode(struct btrfs_inode *inode, long *scanned, long nr_to_scan)
+{
+	const u64 cur_fs_gen = btrfs_get_fs_generation(inode->root->fs_info);
+	struct extent_map_tree *tree = &inode->extent_tree;
+	long nr_dropped = 0;
+	struct rb_node *node;
+
+	/*
+	 * Take the mmap lock so that we serialize with the inode logging phase
+	 * of fsync because we may need to set the full sync flag on the inode,
+	 * in case we have to remove extent maps in the tree's list of modified
+	 * extents. If we set the full sync flag in the inode while an fsync is
+	 * in progress, we may risk missing new extents because before the flag
+	 * is set, fsync decides to only wait for writeback to complete and then
+	 * during inode logging it sees the flag set and uses the subvolume tree
+	 * to find new extents, which may not be there yet because ordered
+	 * extents haven't completed yet.
+	 *
+	 * We also do a try lock because otherwise we could deadlock. This is
+	 * because the shrinker for this filesystem may be invoked while we are
+	 * in a path that is holding the mmap lock in write mode. For example in
+	 * a reflink operation while COWing an extent buffer, when allocating
+	 * pages for a new extent buffer and under memory pressure, the shrinker
+	 * may be invoked, and therefore we would deadlock by attempting to read
+	 * lock the mmap lock while we are holding already a write lock on it.
+	 */
+	if (!down_read_trylock(&inode->i_mmap_lock))
+		return 0;
+
+	write_lock(&tree->lock);
+	node = rb_first_cached(&tree->map);
+	while (node) {
+		struct extent_map *em;
+
+		em = rb_entry(node, struct extent_map, rb_node);
+		node = rb_next(node);
+		(*scanned)++;
+
+		if (em->flags & EXTENT_FLAG_PINNED)
+			goto next;
+
+		/*
+		 * If the inode is in the list of modified extents (new) and its
+		 * generation is the same (or is greater than) the current fs
+		 * generation, it means it was not yet persisted so we have to
+		 * set the full sync flag so that the next fsync will not miss
+		 * it.
+		 */
+		if (!list_empty(&em->list) && em->generation >= cur_fs_gen)
+			btrfs_set_inode_full_sync(inode);
+
+		remove_extent_mapping(inode, em);
+		/* Drop the reference for the tree. */
+		free_extent_map(em);
+		nr_dropped++;
+next:
+		if (*scanned >= nr_to_scan)
+			break;
+
+		/*
+		 * Restart if we had to resched, and any extent maps that were
+		 * pinned before may have become unpinned after we released the
+		 * lock and took it again.
+		 */
+		if (cond_resched_rwlock_write(&tree->lock))
+			node = rb_first_cached(&tree->map);
+	}
+	write_unlock(&tree->lock);
+	up_read(&inode->i_mmap_lock);
+
+	return nr_dropped;
+}
+
+static long btrfs_scan_root(struct btrfs_root *root, long *scanned, long nr_to_scan)
+{
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_inode *inode;
+	long nr_dropped = 0;
+	u64 min_ino = fs_info->extent_map_shrinker_last_ino + 1;
+
+	inode = btrfs_find_first_inode(root, min_ino);
+	while (inode) {
+		nr_dropped += btrfs_scan_inode(inode, scanned, nr_to_scan);
+
+		min_ino = btrfs_ino(inode) + 1;
+		fs_info->extent_map_shrinker_last_ino = btrfs_ino(inode);
+		iput(&inode->vfs_inode);
+
+		if (*scanned >= nr_to_scan)
+			break;
+
+		cond_resched();
+		inode = btrfs_find_first_inode(root, min_ino);
+	}
+
+	if (inode) {
+		/*
+		 * There are still inodes in this root or we happened to process
+		 * the last one and reached the scan limit. In either case set
+		 * the current root to this one, so we'll resume from the next
+		 * inode if there is one or we will find out this was the last
+		 * one and move to the next root.
+		 */
+		fs_info->extent_map_shrinker_last_root = btrfs_root_id(root);
+	} else {
+		/*
+		 * No more inodes in this root, set extent_map_shrinker_last_ino to 0 so
+		 * that when processing the next root we start from its first inode.
+		 */
+		fs_info->extent_map_shrinker_last_ino = 0;
+		fs_info->extent_map_shrinker_last_root = btrfs_root_id(root) + 1;
+	}
+
+	return nr_dropped;
+}
+
+long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
+{
+	const u64 start_root_id = fs_info->extent_map_shrinker_last_root;
+	u64 next_root_id = start_root_id;
+	bool cycled = false;
+	long nr_dropped = 0;
+	long scanned = 0;
+
+	while (scanned < nr_to_scan) {
+		struct btrfs_root *root;
+		unsigned long count;
+
+		spin_lock(&fs_info->fs_roots_radix_lock);
+		count = radix_tree_gang_lookup(&fs_info->fs_roots_radix,
+					       (void **)&root, next_root_id, 1);
+		if (count == 0) {
+			spin_unlock(&fs_info->fs_roots_radix_lock);
+			if (start_root_id > 0 && !cycled) {
+				next_root_id = 0;
+				fs_info->extent_map_shrinker_last_root = 0;
+				fs_info->extent_map_shrinker_last_ino = 0;
+				cycled = true;
+				continue;
+			}
+			break;
+		}
+		next_root_id = btrfs_root_id(root) + 1;
+		root = btrfs_grab_root(root);
+		spin_unlock(&fs_info->fs_roots_radix_lock);
+
+		if (!root)
+			continue;
+
+		if (is_fstree(btrfs_root_id(root)))
+			nr_dropped += btrfs_scan_root(root, &scanned, nr_to_scan);
+
+		btrfs_put_root(root);
+	}
+
+	return nr_dropped;
+}
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index c3707461ff62..8306a8f294d0 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -140,5 +140,6 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode,
 int btrfs_replace_extent_map_range(struct btrfs_inode *inode,
 				   struct extent_map *new_em,
 				   bool modified);
+long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan);
 
 #endif
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 534d30dafe32..9711c9c0e78f 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -631,6 +631,8 @@ struct btrfs_fs_info {
 	s32 delalloc_batch;
 
 	struct percpu_counter evictable_extent_maps;
+	u64 extent_map_shrinker_last_root;
+	u64 extent_map_shrinker_last_ino;
 
 	/* Protected by 'trans_lock'. */
 	struct list_head dirty_cowonly_roots;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 7e44ccaf348f..a3877e65d3b5 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2374,6 +2374,21 @@ static int btrfs_show_devname(struct seq_file *m, struct dentry *root)
 	return 0;
 }
 
+static long btrfs_nr_cached_objects(struct super_block *sb, struct shrink_control *sc)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+
+	return percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
+}
+
+static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_control *sc)
+{
+	const long nr_to_scan = min_t(unsigned long, LONG_MAX, sc->nr_to_scan);
+	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+
+	return btrfs_free_extent_maps(fs_info, nr_to_scan);
+}
+
 static const struct super_operations btrfs_super_ops = {
 	.drop_inode	= btrfs_drop_inode,
 	.evict_inode	= btrfs_evict_inode,
@@ -2387,6 +2402,8 @@ static const struct super_operations btrfs_super_ops = {
 	.statfs		= btrfs_statfs,
 	.freeze_fs	= btrfs_freeze,
 	.unfreeze_fs	= btrfs_unfreeze,
+	.nr_cached_objects = btrfs_nr_cached_objects,
+	.free_cached_objects = btrfs_free_cached_objects,
 };
 
 static const struct file_operations btrfs_ctl_fops = {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 09/10] btrfs: update comment for btrfs_set_inode_full_sync() about locking
  2024-04-16 13:08 ` [PATCH v3 00/10] btrfs: add a shrinker for extent maps fdmanana
                     ` (7 preceding siblings ...)
  2024-04-16 13:08   ` [PATCH v3 08/10] btrfs: add a shrinker for " fdmanana
@ 2024-04-16 13:08   ` fdmanana
  2024-04-16 13:08   ` [PATCH v3 10/10] btrfs: add tracepoints for extent map shrinker events fdmanana
  9 siblings, 0 replies; 64+ messages in thread
From: fdmanana @ 2024-04-16 13:08 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Nowadays we have a lock used to synchronize mmap writes with reflink and
fsync operations (struct btrfs_inode::i_mmap_lock), so update the comment
for btrfs_set_inode_full_sync() to mention that it can also be called
while holding that mmap lock. Besides being a valid alternative to the
inode's VFS lock, we already have the extent map shrinker using that mmap
lock instead.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9a87ada7fe52..91c994b569f3 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -381,9 +381,11 @@ static inline void btrfs_set_inode_last_sub_trans(struct btrfs_inode *inode)
 }
 
 /*
- * Should be called while holding the inode's VFS lock in exclusive mode or in a
- * context where no one else can access the inode concurrently (during inode
- * creation or when loading an inode from disk).
+ * Should be called while holding the inode's VFS lock in exclusive mode, or
+ * while holding the inode's mmap lock (struct btrfs_inode::i_mmap_lock) in
+ * either shared or exclusive mode, or in a context where no one else can access
+ * the inode concurrently (during inode creation or when loading an inode from
+ * disk).
  */
 static inline void btrfs_set_inode_full_sync(struct btrfs_inode *inode)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 10/10] btrfs: add tracepoints for extent map shrinker events
  2024-04-16 13:08 ` [PATCH v3 00/10] btrfs: add a shrinker for extent maps fdmanana
                     ` (8 preceding siblings ...)
  2024-04-16 13:08   ` [PATCH v3 09/10] btrfs: update comment for btrfs_set_inode_full_sync() about locking fdmanana
@ 2024-04-16 13:08   ` fdmanana
  2024-04-16 17:26     ` Josef Bacik
  2024-04-17 11:20     ` Johannes Thumshirn
  9 siblings, 2 replies; 64+ messages in thread
From: fdmanana @ 2024-04-16 13:08 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Add some tracepoints for the extent map shrinker to help debug and analyse
main events. These have proved useful during development of the shrinker.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c        | 13 +++++
 fs/btrfs/super.c             |  5 +-
 include/trace/events/btrfs.h | 99 ++++++++++++++++++++++++++++++++++++
 3 files changed, 116 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index b638d87db500..2967b3d78399 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -1080,6 +1080,7 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, long *scanned, long nr_t
 			btrfs_set_inode_full_sync(inode);
 
 		remove_extent_mapping(inode, em);
+		trace_btrfs_extent_map_shrinker_remove_em(inode, em);
 		/* Drop the reference for the tree. */
 		free_extent_map(em);
 		nr_dropped++;
@@ -1152,6 +1153,12 @@ long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
 	long nr_dropped = 0;
 	long scanned = 0;
 
+	if (trace_btrfs_extent_map_shrinker_scan_enter_enabled()) {
+		s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
+
+		trace_btrfs_extent_map_shrinker_scan_enter(fs_info, nr_to_scan, nr);
+	}
+
 	while (scanned < nr_to_scan) {
 		struct btrfs_root *root;
 		unsigned long count;
@@ -1183,5 +1190,11 @@ long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
 		btrfs_put_root(root);
 	}
 
+	if (trace_btrfs_extent_map_shrinker_scan_exit_enabled()) {
+		s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
+
+		trace_btrfs_extent_map_shrinker_scan_exit(fs_info, nr_dropped, nr);
+	}
+
 	return nr_dropped;
 }
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index a3877e65d3b5..670914835d8a 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2377,8 +2377,11 @@ static int btrfs_show_devname(struct seq_file *m, struct dentry *root)
 static long btrfs_nr_cached_objects(struct super_block *sb, struct shrink_control *sc)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+	const s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
 
-	return percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
+	trace_btrfs_extent_map_shrinker_count(fs_info, nr);
+
+	return nr;
 }
 
 static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_control *sc)
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 8f2497603cb5..d2d94d7c3fb5 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2537,6 +2537,105 @@ TRACE_EVENT(btrfs_get_raid_extent_offset,
 			__entry->devid)
 );
 
+TRACE_EVENT(btrfs_extent_map_shrinker_count,
+
+	TP_PROTO(const struct btrfs_fs_info *fs_info, long nr),
+
+	TP_ARGS(fs_info, nr),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	long,	nr	)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->nr		= nr;
+	),
+
+	TP_printk_btrfs("nr=%ld", __entry->nr)
+);
+
+TRACE_EVENT(btrfs_extent_map_shrinker_scan_enter,
+
+	TP_PROTO(const struct btrfs_fs_info *fs_info, long nr_to_scan, long nr),
+
+	TP_ARGS(fs_info, nr_to_scan, nr),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	long,	nr_to_scan	)
+		__field(	long,	nr		)
+		__field(	u64,	last_root_id	)
+		__field(	u64,	last_ino	)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->nr_to_scan	= nr_to_scan;
+		__entry->nr		= nr;
+		__entry->last_root_id	= fs_info->extent_map_shrinker_last_root;
+		__entry->last_ino	= fs_info->extent_map_shrinker_last_ino;
+	),
+
+	TP_printk_btrfs("nr_to_scan=%ld nr=%ld last_root=%llu(%s) last_ino=%llu",
+			__entry->nr_to_scan, __entry->nr,
+			show_root_type(__entry->last_root_id), __entry->last_ino)
+);
+
+TRACE_EVENT(btrfs_extent_map_shrinker_scan_exit,
+
+	TP_PROTO(const struct btrfs_fs_info *fs_info, long nr_dropped, long nr),
+
+	TP_ARGS(fs_info, nr_dropped, nr),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	long,	nr_dropped	)
+		__field(	long,	nr		)
+		__field(	u64,	last_root_id	)
+		__field(	u64,	last_ino	)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->nr_dropped	= nr_dropped;
+		__entry->nr		= nr;
+		__entry->last_root_id	= fs_info->extent_map_shrinker_last_root;
+		__entry->last_ino	= fs_info->extent_map_shrinker_last_ino;
+	),
+
+	TP_printk_btrfs("nr_dropped=%ld nr=%ld last_root=%llu(%s) last_ino=%llu",
+			__entry->nr_dropped, __entry->nr,
+			show_root_type(__entry->last_root_id), __entry->last_ino)
+);
+
+TRACE_EVENT(btrfs_extent_map_shrinker_remove_em,
+
+	TP_PROTO(const struct btrfs_inode *inode, const struct extent_map *em),
+
+	TP_ARGS(inode, em),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	ino		)
+		__field(	u64,	root_id		)
+		__field(	u64,	start		)
+		__field(	u64,	len		)
+		__field(	u64,	block_start	)
+		__field(	u32,	flags		)
+	),
+
+	TP_fast_assign_btrfs(inode->root->fs_info,
+		__entry->ino		= btrfs_ino(inode);
+		__entry->root_id	= inode->root->root_key.objectid;
+		__entry->start		= em->start;
+		__entry->len		= em->len;
+		__entry->block_start	= em->block_start;
+		__entry->flags		= em->flags;
+	),
+
+	TP_printk_btrfs(
+"ino=%llu root=%llu(%s) start=%llu len=%llu block_start=%llu(%s) flags=%s",
+			__entry->ino, show_root_type(__entry->root_id),
+			__entry->start, __entry->len,
+			show_map_type(__entry->block_start),
+			show_map_flags(__entry->flags))
+);
+
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 08/10] btrfs: add a shrinker for extent maps
  2024-04-16 13:08   ` [PATCH v3 08/10] btrfs: add a shrinker for " fdmanana
@ 2024-04-16 17:25     ` Josef Bacik
  0 siblings, 0 replies; 64+ messages in thread
From: Josef Bacik @ 2024-04-16 17:25 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Tue, Apr 16, 2024 at 02:08:10PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Extent maps are used either to represent existing file extent items, or to
> represent new extents that are going to be written and the respective file
> extent items are created when the ordered extent completes.
> 
> We currently don't have any limit for how many extent maps we can have,
> neither per inode nor globally. Most of the time this not too noticeable
> because extent maps are removed in the following situations:
> 
> 1) When evicting an inode;
> 
> 2) When releasing folios (pages) through the btrfs_release_folio() address
>    space operation callback.
> 
>    However we won't release extent maps in the folio range if the folio is
>    either dirty or under writeback or if the inode's i_size is less than
>    or equals to 16M (see try_release_extent_mapping(). This 16M i_size
>    constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
>    extent_io and extent_state optimizations"), but there's no explanation
>    about why we have it or why the 16M value.
> 
> This means that for buffered IO we can reach an OOM situation due to too
> many extent maps if either of the following happens:
> 
> 1) There's a set of tasks constantly doing IO on many files with a size
>    not larger than 16M, specially if they keep the files open for very
>    long periods, therefore preventing inode eviction.
> 
>    This requires a really high number of such files, and having many non
>    mergeable extent maps (due to random 4K writes for example) and a
>    machine with very little memory;
> 
> 2) There's a set tasks constantly doing random write IO (therefore
>    creating many non mergeable extent maps) on files and keeping them
>    open for long periods of time, so inode eviction doesn't happen and
>    there's always a lot of dirty pages or pages under writeback,
>    preventing btrfs_release_folio() from releasing the respective extent
>    maps.
> 
> This second case was actually reported in the thread pointed by the Link
> tag below, and it requires a very large file under heavy IO and a machine
> with very little amount of RAM, which is probably hard to happen in
> practice in a real world use case.
> 
> However when using direct IO this is not so hard to happen, because the
> page cache is not used, and therefore btrfs_release_folio() is never
> called. Which means extent maps are dropped only when evicting the inode,
> and that means that if we have tasks that keep a file descriptor open and
> keep doing IO on a very large file (or files), we can exhaust memory due
> to an unbounded amount of extent maps. This is especially easy to happen
> if we have a huge file with millions of small extents and their extent
> maps are not mergeable (non contiguous offsets and disk locations).
> This was reported in that thread with the following fio test:
> 
>    $ cat test.sh
>    #!/bin/bash
> 
>    DEV=/dev/sdj
>    MNT=/mnt/sdj
>    MOUNT_OPTIONS="-o ssd"
>    MKFS_OPTIONS=""
> 
>    cat <<EOF > /tmp/fio-job.ini
>    [global]
>    name=fio-rand-write
>    filename=$MNT/fio-rand-write
>    rw=randwrite
>    bs=4K
>    direct=1
>    numjobs=16
>    fallocate=none
>    time_based
>    runtime=90000
> 
>    [file1]
>    size=300G
>    ioengine=libaio
>    iodepth=16
> 
>    EOF
> 
>    umount $MNT &> /dev/null
>    mkfs.btrfs -f $MKFS_OPTIONS $DEV
>    mount $MOUNT_OPTIONS $DEV $MNT
> 
>    fio /tmp/fio-job.ini
>    umount $MNT
> 
> Monitoring the btrfs_extent_map slab while running the test with:
> 
>    $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
>                         /sys/kernel/slab/btrfs_extent_map/total_objects'
> 
> Shows the number of active and total extent maps skyrocketing to tens of
> millions, and on systems with a short amount of memory it's easy and quick
> to get into an OOM situation, as reported in that thread.
> 
> So to avoid this issue add a shrinker that will remove extents maps, as
> long as they are not pinned, and takes proper care with any concurrent
> fsync to avoid missing extents (setting the full sync flag while in the
> middle of a fast fsync). This shrinker is triggered through the callbacks
> nr_cached_objects and free_cached_objects of struct super_operations.
> 
> The shrinker will iterates over all roots and over all inodes of each
> root, and keeps track of the last scanned root and inode, so that the
> next time it runs, it starts from that root and from the next inode.
> This is similar to what xfs does for its inode reclaim (implements those
> callbacks, and cycles through inodes by starting from where it ended
> last time).
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

This is great, thanks Filipe!

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 10/10] btrfs: add tracepoints for extent map shrinker events
  2024-04-16 13:08   ` [PATCH v3 10/10] btrfs: add tracepoints for extent map shrinker events fdmanana
@ 2024-04-16 17:26     ` Josef Bacik
  2024-04-17 11:20     ` Johannes Thumshirn
  1 sibling, 0 replies; 64+ messages in thread
From: Josef Bacik @ 2024-04-16 17:26 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Tue, Apr 16, 2024 at 02:08:12PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Add some tracepoints for the extent map shrinker to help debug and analyse
> main events. These have proved useful during development of the shrinker.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 01/10] btrfs: pass the extent map tree's inode to add_extent_mapping()
  2024-04-16 13:08   ` [PATCH v3 01/10] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
@ 2024-04-17 11:08     ` Johannes Thumshirn
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Thumshirn @ 2024-04-17 11:08 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 02/10] btrfs: pass the extent map tree's inode to clear_em_logging()
  2024-04-16 13:08   ` [PATCH v3 02/10] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
@ 2024-04-17 11:09     ` Johannes Thumshirn
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Thumshirn @ 2024-04-17 11:09 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 03/10] btrfs: pass the extent map tree's inode to remove_extent_mapping()
  2024-04-16 13:08   ` [PATCH v3 03/10] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
@ 2024-04-17 11:10     ` Johannes Thumshirn
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Thumshirn @ 2024-04-17 11:10 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 04/10] btrfs: pass the extent map tree's inode to replace_extent_mapping()
  2024-04-16 13:08   ` [PATCH v3 04/10] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
@ 2024-04-17 11:11     ` Johannes Thumshirn
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Thumshirn @ 2024-04-17 11:11 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 05/10] btrfs: pass the extent map tree's inode to setup_extent_mapping()
  2024-04-16 13:08   ` [PATCH v3 05/10] btrfs: pass the extent map tree's inode to setup_extent_mapping() fdmanana
@ 2024-04-17 11:11     ` Johannes Thumshirn
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Thumshirn @ 2024-04-17 11:11 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 06/10] btrfs: pass the extent map tree's inode to try_merge_map()
  2024-04-16 13:08   ` [PATCH v3 06/10] btrfs: pass the extent map tree's inode to try_merge_map() fdmanana
@ 2024-04-17 11:12     ` Johannes Thumshirn
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Thumshirn @ 2024-04-17 11:12 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 07/10] btrfs: add a global per cpu counter to track number of used extent maps
  2024-04-16 13:08   ` [PATCH v3 07/10] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
@ 2024-04-17 11:14     ` Johannes Thumshirn
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Thumshirn @ 2024-04-17 11:14 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 10/10] btrfs: add tracepoints for extent map shrinker events
  2024-04-16 13:08   ` [PATCH v3 10/10] btrfs: add tracepoints for extent map shrinker events fdmanana
  2024-04-16 17:26     ` Josef Bacik
@ 2024-04-17 11:20     ` Johannes Thumshirn
  1 sibling, 0 replies; 64+ messages in thread
From: Johannes Thumshirn @ 2024-04-17 11:20 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2024-04-17 11:20 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-10 11:28 [PATCH 00/11] btrfs: add a shrinker for extent maps fdmanana
2024-04-10 11:28 ` [PATCH 01/11] btrfs: pass an inode to btrfs_add_extent_mapping() fdmanana
2024-04-10 11:28 ` [PATCH 02/11] btrfs: tests: error out on unexpected extent map reference count fdmanana
2024-04-10 11:28 ` [PATCH 03/11] btrfs: simplify add_extent_mapping() by removing pointless label fdmanana
2024-04-10 11:28 ` [PATCH 04/11] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
2024-04-10 11:28 ` [PATCH 05/11] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
2024-04-10 11:28 ` [PATCH 06/11] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
2024-04-10 11:28 ` [PATCH 07/11] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
2024-04-10 11:28 ` [PATCH 08/11] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
2024-04-11  5:39   ` Qu Wenruo
2024-04-11 10:09     ` Filipe Manana
2024-04-10 11:28 ` [PATCH 09/11] btrfs: add a shrinker for " fdmanana
2024-04-11  5:58   ` Qu Wenruo
2024-04-11 10:15     ` Filipe Manana
2024-04-10 11:28 ` [PATCH 10/11] btrfs: update comment for btrfs_set_inode_full_sync() about locking fdmanana
2024-04-10 11:28 ` [PATCH 11/11] btrfs: add tracepoints for extent map shrinker events fdmanana
2024-04-11  5:25 ` [PATCH 00/11] btrfs: add a shrinker for extent maps Qu Wenruo
2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
2024-04-11 16:18   ` [PATCH v2 01/15] btrfs: pass an inode to btrfs_add_extent_mapping() fdmanana
2024-04-11 16:18   ` [PATCH v2 02/15] btrfs: tests: error out on unexpected extent map reference count fdmanana
2024-04-11 16:18   ` [PATCH v2 03/15] btrfs: simplify add_extent_mapping() by removing pointless label fdmanana
2024-04-11 16:18   ` [PATCH v2 04/15] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
2024-04-11 16:18   ` [PATCH v2 05/15] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
2024-04-11 16:19   ` [PATCH v2 06/15] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
2024-04-11 16:19   ` [PATCH v2 07/15] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
2024-04-11 16:19   ` [PATCH v2 08/15] btrfs: pass the extent map tree's inode to setup_extent_mapping() fdmanana
2024-04-11 23:25     ` Qu Wenruo
2024-04-11 16:19   ` [PATCH v2 09/15] btrfs: pass the extent map tree's inode to try_merge_map() fdmanana
2024-04-11 23:25     ` Qu Wenruo
2024-04-11 16:19   ` [PATCH v2 10/15] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
2024-04-15  2:47     ` kernel test robot
2024-04-11 16:19   ` [PATCH v2 11/15] btrfs: export find_next_inode() as btrfs_find_first_inode() fdmanana
2024-04-11 23:14     ` Qu Wenruo
2024-04-11 16:19   ` [PATCH v2 12/15] btrfs: use btrfs_find_first_inode() at btrfs_prune_dentries() fdmanana
2024-04-11 23:15     ` Qu Wenruo
2024-04-11 16:19   ` [PATCH v2 13/15] btrfs: add a shrinker for extent maps fdmanana
2024-04-12 20:06     ` Josef Bacik
2024-04-13 11:07       ` Filipe Manana
2024-04-14 10:38         ` Filipe Manana
2024-04-14 13:02         ` Josef Bacik
2024-04-15 11:24           ` Filipe Manana
2024-04-11 16:19   ` [PATCH v2 14/15] btrfs: update comment for btrfs_set_inode_full_sync() about locking fdmanana
2024-04-11 16:19   ` [PATCH v2 15/15] btrfs: add tracepoints for extent map shrinker events fdmanana
2024-04-16 13:08 ` [PATCH v3 00/10] btrfs: add a shrinker for extent maps fdmanana
2024-04-16 13:08   ` [PATCH v3 01/10] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
2024-04-17 11:08     ` Johannes Thumshirn
2024-04-16 13:08   ` [PATCH v3 02/10] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
2024-04-17 11:09     ` Johannes Thumshirn
2024-04-16 13:08   ` [PATCH v3 03/10] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
2024-04-17 11:10     ` Johannes Thumshirn
2024-04-16 13:08   ` [PATCH v3 04/10] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
2024-04-17 11:11     ` Johannes Thumshirn
2024-04-16 13:08   ` [PATCH v3 05/10] btrfs: pass the extent map tree's inode to setup_extent_mapping() fdmanana
2024-04-17 11:11     ` Johannes Thumshirn
2024-04-16 13:08   ` [PATCH v3 06/10] btrfs: pass the extent map tree's inode to try_merge_map() fdmanana
2024-04-17 11:12     ` Johannes Thumshirn
2024-04-16 13:08   ` [PATCH v3 07/10] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
2024-04-17 11:14     ` Johannes Thumshirn
2024-04-16 13:08   ` [PATCH v3 08/10] btrfs: add a shrinker for " fdmanana
2024-04-16 17:25     ` Josef Bacik
2024-04-16 13:08   ` [PATCH v3 09/10] btrfs: update comment for btrfs_set_inode_full_sync() about locking fdmanana
2024-04-16 13:08   ` [PATCH v3 10/10] btrfs: add tracepoints for extent map shrinker events fdmanana
2024-04-16 17:26     ` Josef Bacik
2024-04-17 11:20     ` Johannes Thumshirn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.