All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/9] ext4: add free-space extent based allocator
@ 2020-08-21  1:55 Harshad Shirwadkar
  2020-08-21  1:55 ` [RFC PATCH v2 1/9] ext4: add handling for extended mount options Harshad Shirwadkar
                   ` (8 more replies)
  0 siblings, 9 replies; 14+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21  1:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar, Harshad Shirwadkar

ext4: add free-space extent based allocator

This patchset introduces a new multiblock allocator for the Ext4 file
system based on an enhanced version of the free space management
mechanism originally proposed by Kadekodi S. and Jain S., in their
paper, "Taking Linux Filesystems to the Space Age: Space Maps in Ext4"
[1].

High Level Design
=================

This allocator maintains the free space information about the file
system in per-flexible block group level trees. These trees are
constructed from block bitmaps stored on disk. Therefore, this feature
does not require any on-disk format change. For every flexible block
group, we maintain individual freespace nodes in two trees, one sorted
by flex-bg offset and other sorted by length. This gives us two
benefits: i) In the allocation path, our search time complexity is
O(Log(Number of freespace regions in the flex-bg)). ii) In free path,
in the same time complexity we can merge the adjacent nodes thereby
reducing the size of the tree efficiently.

Along with flexible block group level trees, we also introduce a top
level meta-tree which consists of individual flex-bg trees. This tree
is sorted by length of largest extents found in flex-bgs. The key
advantage that this tree gives us is this: During an allocation
request, the allocator is now able to consult this tree and directly
(in O(Log(Number of Flex BGs)) jump to a flexible block group which
_definitely_ has at least one (the largest) extent that can satisfy
this request. If no flexible block group exists which can satisfy this
request, the allocator can now immediately drop the CR level.

In order to preseve the parallel allocation / free performance, the
allocator only *tries* to consult this tree. It does so by calling
read_trylock() function and if the meta-tree is busy, the allocator
continues its linear search until it is able to grab a read-lock on
this tree.

Memory Footprint
================

In a less fragmented file system, the memory footprint of the new
allocator is significantly lower than buddy bitmaps. Memory footprint
of the freespace tree allocator can be checked by running "cat
/sys/fs/ext4/<dev>/frsp_tree_usage". For an almost newly created 100G
disk, the total usage is only ~10 KB. However, as the fragmentation
level increases, the memory footprint of this allocator may
increase. The memory usage of the freespace tree allocator can be
controlled using sysfs tunable "mb_frsp_max_mem". Once the memory
threshold is reached, the allocator starts evicting the freespace
trees in the LRU fashion from memory. However, we don't remove tree's
entry from the meta-tree. This allows the allocator to efficiently
reconstruct only relevant trees from on-disk bitmaps under high memory
pressure. As we show in the evaluation section, freespace tree
allocator still manages to outperform buddy allocator in memory crunch
situation. The default value of mb_frsp_max_mem is max(memory needed
for buddy, maximum memory needed for one tree in the worst
case). Accounting for max memory needed for one tree allows us to keep
at least one tree in memory even in the worst case. This avoids
thrashing.

Caching Tree Metadata
=====================

As noted in our experiments, we find caching tree metadata (the
largest available extent in a tree) in the meta-tree significantly
boosts allocation performance. However, if the allocator waits for the
cache to fill up before starting to serve allocation requests, that
may severely degrade allocation performance on large disks. Thus, it
is important to tune the tree caching behavior according to the
underlying block device. This patchset provides four cache aggression
levels. Cache aggression level can be specified as a mount time
parameter "frsp_cache_aggression". Here is the meaning of these
different levels:

|-------+------------------------------------------------------------------|
| Level | Meaning                                                          |
|-------+------------------------------------------------------------------|
|     0 | Try to avoid caching as much as possible. In this mode           |
|       | the allocator tries hard to serve the request from the already   |
|       | cached trees.                                                    |
|-------+------------------------------------------------------------------|
|     1 | Avoid caching at CR=0. Otherwise, cache trees on every other     |
|       | allocation request. (default)                                    |
|-------+------------------------------------------------------------------|
|     2 | Cache trees on every other allocation request.                   |
|-------+------------------------------------------------------------------|
|     3 | Aggressive caching. In this mode the allocator aggressively      |
|       | caches uncached trees, even if the request can be fulfilled      |
|       | by one of the available trees. Using this mode, the allocator    |
|       | ends up caching trees quickly and thereby is able to make better |
|       | allocation decisions sooner.                                     |
|-------+------------------------------------------------------------------|

Evaluation
==========

This feature did not introduce any regressions in Ext4 XFStests in
quick group. We created a _highly_ fragmented 60T disk with over 490K
block groups and over 30K flexible block groups. We tested the write
performance of the first small write (10MB) and a larger second write
(100MB) using both the buddy allocator and freespace tree
allocator. We turned memory limit off for the first four free space
tree configurations.

First Write Performance
-----------------------
|--------------------------------------+---------+---------+-----------+--------|
| Allocator                            | RAM     | # trees | Perf      | Time   |
|                                      | usage   |  cached |           |        |
|--------------------------------------+---------+---------+-----------+--------|
| Buddy Allocator (v5.7)               | -       |       - | 6.8 KB/s  | 25m51s |
|--------------------------------------+---------+---------+-----------+--------|
| With --prefetch_block_bitmap         | -       |       - | 28.1 KB/s | 6m14s  |
|--------------------------------------+---------+---------+-----------+--------|
| Free space tree allocator at level 0 | 43.5 MB |     201 | 2.6 MB/s  | 0m8s   |
|--------------------------------------+---------+---------+-----------+--------|
| Free space tree allocator at level 1 | 193 MB  |     874 | 1.5 MB/s  | 0m47s  |
|--------------------------------------+---------+---------+-----------+--------|
| Free space tree allocator at level 2 | 3.6 GB  |   16476 | 21.4 KB/s | 8m14s  |
|--------------------------------------+---------+---------+-----------+--------|
| Free space tree allocator at level 3 | 6.8 GB  |   30720 | 9.1 KB/s  | 19m22s |
|--------------------------------------+---------+---------+-----------+--------|
| Free space tree allocator at level 3 | 1023 MB |   30720 | 9.3 KB/s  | 18m53s |
| ( With 1G Limit)                     |         |         |           |        |
|--------------------------------------+---------+---------+-----------+--------|

Second Write Performance
------------------------
|--------------------------------------+---------+---------+-----------+-------|
| Allocator                            | RAM     | # trees | Perf      | Time  |
|                                      | usage   |  cached |           |       |
|--------------------------------------+---------+---------+-----------+-------|
| Buddy Allocator (v5.7)               | -       |       - | 499 KB/s  | 3m30s |
|--------------------------------------+---------+---------+-----------+-------|
| With --prefetch_block_bitmap         | -       |       - | 185 KB/s  | 9m26s |
|--------------------------------------+---------+---------+-----------+-------|
| Free space tree allocator at level 0 | 48.7 MB |     226 | 48.2 MB/s | 6s    |
|--------------------------------------+---------+---------+-----------+-------|
| Free space tree allocator at level 1 | 221 MB  |    1007 | 26.8 MB/s | 8s    |
|--------------------------------------+---------+---------+-----------+-------|
| Free space tree allocator at level 2 | 6.8 GB  |   30720 | 178 KB/s  | 9m54s |
|--------------------------------------+---------+---------+-----------+-------|
| Free space tree allocator at level 3 | 6.8 GB  |   30720 | 79 MB/s   | 5s    |
|--------------------------------------+---------+---------+-----------+-------|
| Free space tree allocator at level 3 | 1023 MB |   30720 | 77.1 MB/s | 7s    |
| ( With 1G Limit)                     |         |         |           |       |
|--------------------------------------+---------+---------+-----------+-------|

Verified that parallel write performance on a newly created disk is
not very different for buddy allocator and freespace tree
allocator. This patchset is applied on top of block bitmap prefetch
patches.

Signed-off-by: Yuexin Li <lyx1209@gmail.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>

[1] https://www.kernel.org/doc/ols/2010/ols2010-pages-121-132.pdf

Changes Since V1:

- Rebased the patchset on top of ext4 dev branch
- Converted the patches to RFC

Harshad Shirwadkar (8):
  ext4: add handling for extended mount options
  ext4: rename ext4_mb_load_buddy to ext4_mb_load_allocator
  ext4: add prefetching support for freespace trees
  ext4: add freespace tree optimizations
  ext4: add memory usage tracker for freespace trees
  ext4: add LRU eviction for free space trees
  ext4: add tracepoints for free space trees
  ext4: add freespace trees documentation in code

Yuexin Li (1):
  ext4: add freespace tree allocator

 fs/ext4/ext4.h              |  117 +++
 fs/ext4/mballoc.c           | 1541 +++++++++++++++++++++++++++++++++--
 fs/ext4/mballoc.h           |   67 +-
 fs/ext4/resize.c            |    8 +
 fs/ext4/super.c             |   60 +-
 fs/ext4/sysfs.c             |   11 +
 include/trace/events/ext4.h |  112 +++
 7 files changed, 1821 insertions(+), 95 deletions(-)

-- 
2.28.0.297.g1956fa8f8d-goog


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 1/9] ext4: add handling for extended mount options
  2020-08-21  1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
@ 2020-08-21  1:55 ` Harshad Shirwadkar
  2020-08-21  1:55 ` [RFC PATCH v2 2/9] ext4: rename ext4_mb_load_buddy to ext4_mb_load_allocator Harshad Shirwadkar
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21  1:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar, Andreas Dilger

From: Harshad Shirwadkar <harshadshirwadkar@gmail.com>

We are running out of mount option bits. Add handling for using
s_mount_opt2. This patch was originally submitted as a part of fast
commit patch series. Resending it here with this patch series too.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
---
 fs/ext4/super.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 13bdddc081e0..656eccf6fc9b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1738,6 +1738,7 @@ static int clear_qf_name(struct super_block *sb, int qtype)
 #define MOPT_EXT4_ONLY	(MOPT_NO_EXT2 | MOPT_NO_EXT3)
 #define MOPT_STRING	0x0400
 #define MOPT_SKIP	0x0800
+#define MOPT_2		0x1000
 
 static const struct mount_opts {
 	int	token;
@@ -2207,10 +2208,17 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 			WARN_ON(1);
 			return -1;
 		}
-		if (arg != 0)
-			sbi->s_mount_opt |= m->mount_opt;
-		else
-			sbi->s_mount_opt &= ~m->mount_opt;
+		if (m->flags & MOPT_2) {
+			if (arg != 0)
+				sbi->s_mount_opt2 |= m->mount_opt;
+			else
+				sbi->s_mount_opt2 &= ~m->mount_opt;
+		} else {
+			if (arg != 0)
+				sbi->s_mount_opt |= m->mount_opt;
+			else
+				sbi->s_mount_opt &= ~m->mount_opt;
+		}
 	}
 	return 1;
 }
-- 
2.28.0.297.g1956fa8f8d-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 2/9] ext4: rename ext4_mb_load_buddy to ext4_mb_load_allocator
  2020-08-21  1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
  2020-08-21  1:55 ` [RFC PATCH v2 1/9] ext4: add handling for extended mount options Harshad Shirwadkar
@ 2020-08-21  1:55 ` Harshad Shirwadkar
  2020-08-21  1:55 ` [RFC PATCH v2 3/9] ext4: add freespace tree allocator Harshad Shirwadkar
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21  1:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar

From: Harshad Shirwadkar <harshadshirwadkar@gmail.com>

This patch renames ext4_mb_load_buddy and ext4_mb_unload_buddy to
ext4_mb_load_allocator and ext4_mb_unload_allocator. Also, we add a
flag argument to ext4_mb_load_allocator function which is currently
unused. This patch helps reduce the size of the following patch "ext4:
add freespace tree allocator" significantly. In the interest of
keeping this patchset shorter, I have not renamed ext4_buddy structure
and e4b variable names. But have added that as a TODO item.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/mballoc.c | 86 ++++++++++++++++++++++++-----------------------
 1 file changed, 44 insertions(+), 42 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 132c118d12e1..48d791304bf1 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -29,6 +29,7 @@
  *   - don't normalize tails
  *   - quota
  *   - reservation for superuser
+ *   - rename ext4_buddy to ext4_allocator and e4b variables to allocator
  *
  * TODO v3:
  *   - bitmap read-ahead (proposed by Oleg Drokin aka green)
@@ -92,7 +93,7 @@
  * mapped to the buddy and bitmap information regarding different
  * groups. The buddy information is attached to buddy cache inode so that
  * we can access them through the page cache. The information regarding
- * each group is loaded via ext4_mb_load_buddy.  The information involve
+ * each group is loaded via ext4_mb_load_allocator.  The information involve
  * block bitmap and buddy information. The information are stored in the
  * inode as:
  *
@@ -845,7 +846,7 @@ static void mb_regenerate_buddy(struct ext4_buddy *e4b)
 
 /* The buddy information is attached the buddy cache inode
  * for convenience. The information regarding each group
- * is loaded via ext4_mb_load_buddy. The information involve
+ * is loaded via ext4_mb_load_allocator. The information involve
  * block bitmap and buddy information. The information are
  * stored in the inode as
  *
@@ -1105,7 +1106,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
 	 * This ensures that we don't reinit the buddy cache
 	 * page which map to the group from which we are already
 	 * allocating. If we are looking at the buddy cache we would
-	 * have taken a reference using ext4_mb_load_buddy and that
+	 * have taken a reference using ext4_mb_load_allocator and that
 	 * would have pinned buddy page to page cache.
 	 * The call to ext4_mb_get_buddy_page_lock will mark the
 	 * page accessed.
@@ -1157,8 +1158,8 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
  * calling this routine!
  */
 static noinline_for_stack int
-ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
-		       struct ext4_buddy *e4b, gfp_t gfp)
+ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group,
+				struct ext4_buddy *e4b, gfp_t gfp, int flags)
 {
 	int blocks_per_page;
 	int block;
@@ -1293,13 +1294,13 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
 	return ret;
 }
 
-static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
-			      struct ext4_buddy *e4b)
+static int ext4_mb_load_allocator(struct super_block *sb, ext4_group_t group,
+			      struct ext4_buddy *e4b, int flags)
 {
-	return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_NOFS);
+	return ext4_mb_load_allocator_gfp(sb, group, e4b, GFP_NOFS, flags);
 }
 
-static void ext4_mb_unload_buddy(struct ext4_buddy *e4b)
+static void ext4_mb_unload_allocator(struct ext4_buddy *e4b)
 {
 	if (e4b->bd_bitmap_page)
 		put_page(e4b->bd_bitmap_page);
@@ -1859,7 +1860,7 @@ int ext4_mb_try_best_found(struct ext4_allocation_context *ac,
 	int err;
 
 	BUG_ON(ex.fe_len <= 0);
-	err = ext4_mb_load_buddy(ac->ac_sb, group, e4b);
+	err = ext4_mb_load_allocator(ac->ac_sb, group, e4b, 0);
 	if (err)
 		return err;
 
@@ -1872,7 +1873,7 @@ int ext4_mb_try_best_found(struct ext4_allocation_context *ac,
 	}
 
 	ext4_unlock_group(ac->ac_sb, group);
-	ext4_mb_unload_buddy(e4b);
+	ext4_mb_unload_allocator(e4b);
 
 	return 0;
 }
@@ -1893,12 +1894,12 @@ int ext4_mb_find_by_goal(struct ext4_allocation_context *ac,
 	if (grp->bb_free == 0)
 		return 0;
 
-	err = ext4_mb_load_buddy(ac->ac_sb, group, e4b);
+	err = ext4_mb_load_allocator(ac->ac_sb, group, e4b, 0);
 	if (err)
 		return err;
 
 	if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) {
-		ext4_mb_unload_buddy(e4b);
+		ext4_mb_unload_allocator(e4b);
 		return 0;
 	}
 
@@ -1936,7 +1937,7 @@ int ext4_mb_find_by_goal(struct ext4_allocation_context *ac,
 		ext4_mb_use_best_found(ac, e4b);
 	}
 	ext4_unlock_group(ac->ac_sb, group);
-	ext4_mb_unload_buddy(e4b);
+	ext4_mb_unload_allocator(e4b);
 
 	return 0;
 }
@@ -2417,7 +2418,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 				continue;
 			}
 
-			err = ext4_mb_load_buddy(sb, group, &e4b);
+			err = ext4_mb_load_allocator(sb, group, &e4b, 0);
 			if (err)
 				goto out;
 
@@ -2430,7 +2431,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 			ret = ext4_mb_good_group(ac, group, cr);
 			if (ret == 0) {
 				ext4_unlock_group(sb, group);
-				ext4_mb_unload_buddy(&e4b);
+				ext4_mb_unload_allocator(&e4b);
 				continue;
 			}
 
@@ -2444,7 +2445,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 				ext4_mb_complex_scan_group(ac, &e4b);
 
 			ext4_unlock_group(sb, group);
-			ext4_mb_unload_buddy(&e4b);
+			ext4_mb_unload_allocator(&e4b);
 
 			if (ac->ac_status != AC_STATUS_CONTINUE)
 				break;
@@ -2543,7 +2544,7 @@ static int ext4_mb_seq_groups_show(struct seq_file *seq, void *v)
 	grinfo = ext4_get_group_info(sb, group);
 	/* Load the group info in memory only if not already loaded. */
 	if (unlikely(EXT4_MB_GRP_NEED_INIT(grinfo))) {
-		err = ext4_mb_load_buddy(sb, group, &e4b);
+		err = ext4_mb_load_allocator(sb, group, &e4b, 0);
 		if (err) {
 			seq_printf(seq, "#%-5u: I/O error\n", group);
 			return 0;
@@ -2554,7 +2555,7 @@ static int ext4_mb_seq_groups_show(struct seq_file *seq, void *v)
 	memcpy(&sg, ext4_get_group_info(sb, group), i);
 
 	if (buddy_loaded)
-		ext4_mb_unload_buddy(&e4b);
+		ext4_mb_unload_allocator(&e4b);
 
 	seq_printf(seq, "#%-5u: %-5u %-5u %-5u [", group, sg.info.bb_free,
 			sg.info.bb_fragments, sg.info.bb_first_free);
@@ -3049,7 +3050,7 @@ static void ext4_free_data_in_buddy(struct super_block *sb,
 	mb_debug(sb, "gonna free %u blocks in group %u (0x%p):",
 		 entry->efd_count, entry->efd_group, entry);
 
-	err = ext4_mb_load_buddy(sb, entry->efd_group, &e4b);
+	err = ext4_mb_load_allocator(sb, entry->efd_group, &e4b, 0);
 	/* we expect to find existing buddy because it's pinned */
 	BUG_ON(err != 0);
 
@@ -3084,7 +3085,7 @@ static void ext4_free_data_in_buddy(struct super_block *sb,
 	}
 	ext4_unlock_group(sb, entry->efd_group);
 	kmem_cache_free(ext4_free_data_cachep, entry);
-	ext4_mb_unload_buddy(&e4b);
+	ext4_mb_unload_allocator(&e4b);
 
 	mb_debug(sb, "freed %d blocks in %d structures\n", count,
 		 count2);
@@ -3558,12 +3559,13 @@ static void ext4_discard_allocated_blocks(struct ext4_allocation_context *ac)
 	if (pa == NULL) {
 		if (ac->ac_f_ex.fe_len == 0)
 			return;
-		err = ext4_mb_load_buddy(ac->ac_sb, ac->ac_f_ex.fe_group, &e4b);
+		err = ext4_mb_load_allocator(ac->ac_sb, ac->ac_f_ex.fe_group,
+						&e4b, 0);
 		if (err) {
 			/*
 			 * This should never happen since we pin the
 			 * pages in the ext4_allocation_context so
-			 * ext4_mb_load_buddy() should never fail.
+			 * ext4_mb_load_allocator() should never fail.
 			 */
 			WARN(1, "mb_load_buddy failed (%d)", err);
 			return;
@@ -3572,7 +3574,7 @@ static void ext4_discard_allocated_blocks(struct ext4_allocation_context *ac)
 		mb_free_blocks(ac->ac_inode, &e4b, ac->ac_f_ex.fe_start,
 			       ac->ac_f_ex.fe_len);
 		ext4_unlock_group(ac->ac_sb, ac->ac_f_ex.fe_group);
-		ext4_mb_unload_buddy(&e4b);
+		ext4_mb_unload_allocator(&e4b);
 		return;
 	}
 	if (pa->pa_type == MB_INODE_PA)
@@ -4175,7 +4177,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 		goto out_dbg;
 	}
 
-	err = ext4_mb_load_buddy(sb, group, &e4b);
+	err = ext4_mb_load_allocator(sb, group, &e4b, 0);
 	if (err) {
 		ext4_warning(sb, "Error %d loading buddy information for %u",
 			     err, group);
@@ -4250,7 +4252,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 
 out:
 	ext4_unlock_group(sb, group);
-	ext4_mb_unload_buddy(&e4b);
+	ext4_mb_unload_allocator(&e4b);
 	put_bh(bitmap_bh);
 out_dbg:
 	mb_debug(sb, "discarded (%d) blocks preallocated for group %u bb_free (%d)\n",
@@ -4347,8 +4349,8 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
 		BUG_ON(pa->pa_type != MB_INODE_PA);
 		group = ext4_get_group_number(sb, pa->pa_pstart);
 
-		err = ext4_mb_load_buddy_gfp(sb, group, &e4b,
-					     GFP_NOFS|__GFP_NOFAIL);
+		err = ext4_mb_load_allocator_gfp(sb, group, &e4b,
+						GFP_NOFS|__GFP_NOFAIL, 0);
 		if (err) {
 			ext4_error_err(sb, -err, "Error %d loading buddy information for %u",
 				       err, group);
@@ -4360,7 +4362,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
 			err = PTR_ERR(bitmap_bh);
 			ext4_error_err(sb, -err, "Error %d reading block bitmap for %u",
 				       err, group);
-			ext4_mb_unload_buddy(&e4b);
+			ext4_mb_unload_allocator(&e4b);
 			continue;
 		}
 
@@ -4369,7 +4371,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
 		ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
 		ext4_unlock_group(sb, group);
 
-		ext4_mb_unload_buddy(&e4b);
+		ext4_mb_unload_allocator(&e4b);
 		put_bh(bitmap_bh);
 
 		list_del(&pa->u.pa_tmp_list);
@@ -4642,8 +4644,8 @@ ext4_mb_discard_lg_preallocations(struct super_block *sb,
 		int err;
 
 		group = ext4_get_group_number(sb, pa->pa_pstart);
-		err = ext4_mb_load_buddy_gfp(sb, group, &e4b,
-					     GFP_NOFS|__GFP_NOFAIL);
+		err = ext4_mb_load_allocator_gfp(sb, group, &e4b,
+						GFP_NOFS|__GFP_NOFAIL, 0);
 		if (err) {
 			ext4_error_err(sb, -err, "Error %d loading buddy information for %u",
 				       err, group);
@@ -4654,7 +4656,7 @@ ext4_mb_discard_lg_preallocations(struct super_block *sb,
 		ext4_mb_release_group_pa(&e4b, pa);
 		ext4_unlock_group(sb, group);
 
-		ext4_mb_unload_buddy(&e4b);
+		ext4_mb_unload_allocator(&e4b);
 		list_del(&pa->u.pa_tmp_list);
 		call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
 	}
@@ -5241,8 +5243,8 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 	trace_ext4_mballoc_free(sb, inode, block_group, bit, count_clusters);
 
 	/* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */
-	err = ext4_mb_load_buddy_gfp(sb, block_group, &e4b,
-				     GFP_NOFS|__GFP_NOFAIL);
+	err = ext4_mb_load_allocator_gfp(sb, block_group, &e4b,
+					GFP_NOFS|__GFP_NOFAIL, 0);
 	if (err)
 		goto error_return;
 
@@ -5316,7 +5318,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 				   count_clusters);
 	}
 
-	ext4_mb_unload_buddy(&e4b);
+	ext4_mb_unload_allocator(&e4b);
 
 	/* We dirtied the bitmap block */
 	BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
@@ -5434,7 +5436,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
 		}
 	}
 
-	err = ext4_mb_load_buddy(sb, block_group, &e4b);
+	err = ext4_mb_load_allocator(sb, block_group, &e4b, 0);
 	if (err)
 		goto error_return;
 
@@ -5462,7 +5464,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
 						  flex_group)->free_clusters);
 	}
 
-	ext4_mb_unload_buddy(&e4b);
+	ext4_mb_unload_allocator(&e4b);
 
 	/* We dirtied the bitmap block */
 	BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
@@ -5550,7 +5552,7 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
 
 	trace_ext4_trim_all_free(sb, group, start, max);
 
-	ret = ext4_mb_load_buddy(sb, group, &e4b);
+	ret = ext4_mb_load_allocator(sb, group, &e4b, 0);
 	if (ret) {
 		ext4_warning(sb, "Error %d loading buddy information for %u",
 			     ret, group);
@@ -5604,7 +5606,7 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
 	}
 out:
 	ext4_unlock_group(sb, group);
-	ext4_mb_unload_buddy(&e4b);
+	ext4_mb_unload_allocator(&e4b);
 
 	ext4_debug("trimmed %d blocks in the group %d\n",
 		count, group);
@@ -5718,7 +5720,7 @@ ext4_mballoc_query_range(
 	struct ext4_buddy		e4b;
 	int				error;
 
-	error = ext4_mb_load_buddy(sb, group, &e4b);
+	error = ext4_mb_load_allocator(sb, group, &e4b, 0);
 	if (error)
 		return error;
 	bitmap = e4b.bd_bitmap;
@@ -5747,7 +5749,7 @@ ext4_mballoc_query_range(
 
 	ext4_unlock_group(sb, group);
 out_unload:
-	ext4_mb_unload_buddy(&e4b);
+	ext4_mb_unload_allocator(&e4b);
 
 	return error;
 }
-- 
2.28.0.297.g1956fa8f8d-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 3/9] ext4: add freespace tree allocator
  2020-08-21  1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
  2020-08-21  1:55 ` [RFC PATCH v2 1/9] ext4: add handling for extended mount options Harshad Shirwadkar
  2020-08-21  1:55 ` [RFC PATCH v2 2/9] ext4: rename ext4_mb_load_buddy to ext4_mb_load_allocator Harshad Shirwadkar
@ 2020-08-21  1:55 ` Harshad Shirwadkar
  2020-08-21  4:06   ` kernel test robot
  2020-08-21  7:04   ` kernel test robot
  2020-08-21  1:55 ` [RFC PATCH v2 4/9] ext4: add prefetching support for freespace trees Harshad Shirwadkar
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 14+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21  1:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar

From: Yuexin Li <lyx1209@gmail.com>

This patch adds a new freespace tree allocator. The allocator
organizes free space regions in a couple of in-memory rbtrees, one
sorted by length and the other by offset. We use these per-flex-bg
level trees to quickly scan length sorted free space regions and
determine the best region for a given request. This feature can be
enabled by passing "-o freespace_tree" mount option.

Signed-off-by: Yuexin Li <lyx1209@gmail.com>
Co-developed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4.h    |  45 +++
 fs/ext4/mballoc.c | 984 ++++++++++++++++++++++++++++++++++++++++++++--
 fs/ext4/mballoc.h |  61 ++-
 fs/ext4/resize.c  |   8 +
 fs/ext4/super.c   |  35 +-
 5 files changed, 1084 insertions(+), 49 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 523e00d7b392..513c8473f45f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -363,6 +363,7 @@ struct flex_groups {
 	atomic64_t	free_clusters;
 	atomic_t	free_inodes;
 	atomic_t	used_dirs;
+	struct ext4_frsp_tree	*frsp_tree;
 };
 
 #define EXT4_BG_INODE_UNINIT	0x0001 /* Inode table/bitmap not in use */
@@ -1214,6 +1215,12 @@ struct ext4_inode_info {
 #define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM	0x00000008 /* User explicitly
 						specified journal checksum */
 
+
+#define EXT4_MOUNT2_FREESPACE_TREE	0x00000020 /* Enable freespace tree
+						    * allocator (turns buddy
+						    * allocator off)
+						    */
+
 #define clear_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt &= \
 						~EXT4_MOUNT_##opt
 #define set_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt |= \
@@ -1419,6 +1426,30 @@ struct ext4_super_block {
 #define ext4_has_strict_mode(sbi) \
 	(sbi->s_encoding_flags & EXT4_ENC_STRICT_MODE_FL)
 
+/*
+ * Freespace tree structure
+ */
+struct ext4_frsp_tree {
+	struct rb_root_cached frsp_offset_root;		/* Root for offset
+							 * sorted tree
+							 */
+	struct rb_root_cached frsp_len_root;		/* Root for Length
+							 * sorted tree
+							 */
+	struct mutex frsp_lock;
+	__u8 frsp_flags;
+	__u32 frsp_max_free_len;			/* Max free extent in
+							 * this flex bg
+							 */
+	__u32 frsp_index;				/* Tree index (flex bg
+							 * number)
+							 */
+};
+
+/* Freespace tree flags */
+
+/* Tree is loaded in memory */
+#define EXT4_MB_FRSP_FLAG_LOADED			0x0001
 /*
  * fourth extended-fs super-block data in memory
  */
@@ -1557,6 +1588,9 @@ struct ext4_sb_info {
 	struct flex_groups * __rcu *s_flex_groups;
 	ext4_group_t s_flex_groups_allocated;
 
+	/* freespace trees stuff */
+	int s_mb_num_frsp_trees;
+
 	/* workqueue for reserved extent conversions (buffered io) */
 	struct workqueue_struct *rsv_conversion_wq;
 
@@ -2676,6 +2710,7 @@ extern int ext4_init_inode_table(struct super_block *sb,
 extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate);
 
 /* mballoc.c */
+extern int ext4_mb_add_frsp_trees(struct super_block *sb, int ngroups);
 extern const struct seq_operations ext4_mb_seq_groups_ops;
 extern long ext4_mb_stats;
 extern long ext4_mb_max_to_scan;
@@ -3100,6 +3135,16 @@ static inline ext4_group_t ext4_flex_group(struct ext4_sb_info *sbi,
 	return block_group >> sbi->s_log_groups_per_flex;
 }
 
+static inline struct ext4_frsp_tree *
+ext4_get_frsp_tree(struct super_block *sb, ext4_group_t flex_bg)
+{
+	struct flex_groups *fg;
+
+	fg = sbi_array_rcu_deref(EXT4_SB(sb), s_flex_groups, flex_bg);
+
+	return fg->frsp_tree;
+}
+
 static inline unsigned int ext4_flex_bg_size(struct ext4_sb_info *sbi)
 {
 	return 1 << sbi->s_log_groups_per_flex;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 48d791304bf1..4b1c405543f0 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -333,6 +333,7 @@
 static struct kmem_cache *ext4_pspace_cachep;
 static struct kmem_cache *ext4_ac_cachep;
 static struct kmem_cache *ext4_free_data_cachep;
+static struct kmem_cache *ext4_freespace_node_cachep;
 
 /* We create slab caches for groupinfo data structures based on the
  * superblock block size.  There will be one per mounted filesystem for
@@ -352,6 +353,11 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
 						ext4_group_t group);
 static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
 
+static int ext4_mb_load_allocator(struct super_block *sb, ext4_group_t group,
+			      struct ext4_buddy *e4b, int flags);
+static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex);
+static void ext4_mb_unload_allocator(struct ext4_buddy *e4b);
+
 /*
  * The algorithm using this percpu seq counter goes below:
  * 1. We sample the percpu discard_pa_seq counter before trying for block
@@ -395,6 +401,33 @@ static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
 	return addr;
 }
 
+static inline int ext4_blkno_to_flex_offset(struct super_block *sb,
+			ext4_fsblk_t blkno)
+{
+	return blkno % (ext4_flex_bg_size(EXT4_SB(sb)) *
+				EXT4_SB(sb)->s_blocks_per_group);
+}
+
+static inline ext4_fsblk_t ext4_flex_offset_to_blkno(struct super_block *sb,
+			int flex_bg, int flex_offset)
+{
+	return flex_bg * ext4_flex_bg_size(EXT4_SB(sb)) *
+		EXT4_SB(sb)->s_blocks_per_group + flex_offset;
+}
+
+static inline int ext4_mb_frsp_on(struct super_block *sb)
+{
+	return test_opt2(sb, FREESPACE_TREE) &&
+			EXT4_SB(sb)->s_es->s_log_groups_per_flex;
+}
+
+static inline unsigned int ext4_num_grps_to_flexbg(struct super_block *sb,
+							int ngroups)
+{
+	return (ngroups + ext4_flex_bg_size(EXT4_SB(sb)) - 1) >>
+			(EXT4_SB(sb)->s_es->s_log_groups_per_flex);
+}
+
 static inline int mb_test_bit(int bit, void *addr)
 {
 	/*
@@ -453,6 +486,7 @@ static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
 {
 	char *bb;
 
+	WARN_ON(ext4_mb_frsp_on(e4b->bd_sb));
 	BUG_ON(e4b->bd_bitmap == e4b->bd_buddy);
 	BUG_ON(max == NULL);
 
@@ -620,6 +654,9 @@ static int __mb_check_buddy(struct ext4_buddy *e4b, char *file,
 	void *buddy;
 	void *buddy2;
 
+	if (ext4_mb_frsp_on(sb))
+		return 0;
+
 	{
 		static int mb_check_counter;
 		if (mb_check_counter++ % 100 != 0)
@@ -706,6 +743,729 @@ static int __mb_check_buddy(struct ext4_buddy *e4b, char *file,
 #define mb_check_buddy(e4b)
 #endif
 
+/* Generic macro for inserting a new node in cached rbtree */
+#define ext4_mb_rb_insert(__root, __new, __node_t, __entry, __cmp)	do {	\
+	struct rb_node **iter = &((__root)->rb_root.rb_node), *parent = NULL;	\
+	bool leftmost = true;							\
+	__node_t *this = NULL;							\
+										\
+	while (*iter) {								\
+		this = rb_entry(*iter, __node_t, __entry);			\
+		parent = *iter;							\
+		if (__cmp((__new), this)) {					\
+			iter = &((*iter)->rb_left);				\
+		} else {							\
+			iter = &((*iter)->rb_right);				\
+			leftmost = false;					\
+		}								\
+	}									\
+	rb_link_node(&(__new)->__entry, parent, iter);				\
+	rb_insert_color_cached(&(__new)->__entry, __root, leftmost);		\
+} while (0)
+
+static inline int ext4_mb_frsp_offset_cmp(struct ext4_frsp_node *arg1,
+					  struct ext4_frsp_node *arg2)
+{
+	return (arg1->frsp_offset < arg2->frsp_offset);
+}
+
+/*
+ * Free space extents length sorting function, the nodes are sorted in
+ * decreasing order of length
+ */
+static inline int ext4_mb_frsp_len_cmp(struct ext4_frsp_node *arg1,
+				       struct ext4_frsp_node *arg2)
+{
+	return (arg1->frsp_len > arg2->frsp_len);
+}
+
+/* insert to offset-indexed tree */
+static void ext4_mb_frsp_insert_off(struct ext4_frsp_tree *tree,
+				struct ext4_frsp_node *new_entry)
+{
+	memset(&new_entry->frsp_node, 0, sizeof(new_entry->frsp_node));
+	ext4_mb_rb_insert(&tree->frsp_offset_root, new_entry,
+		struct ext4_frsp_node, frsp_node, ext4_mb_frsp_offset_cmp);
+}
+
+/* insert to tree sorted by length */
+static void ext4_mb_frsp_insert_len(struct ext4_frsp_tree *tree,
+				struct ext4_frsp_node *new_entry)
+{
+	memset(&new_entry->frsp_len_node, 0, sizeof(new_entry->frsp_len_node));
+	ext4_mb_rb_insert(&tree->frsp_len_root, new_entry,
+		struct ext4_frsp_node, frsp_len_node,
+		ext4_mb_frsp_len_cmp);
+}
+
+#ifdef CONFIG_EXT4_DEBUG
+/* print freespace_tree in pre-order traversal */
+void ext4_mb_frsp_print_tree_len(struct super_block *sb,
+					struct ext4_frsp_tree *tree)
+{
+	unsigned int count = 0;
+	ext4_fsblk_t blk = 0, blk_end = 0;
+	ext4_group_t group = 0, group_end = 0;
+	struct ext4_frsp_node *entry = NULL;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct rb_node *cur;
+
+	cur = rb_first_cached(&tree->frsp_len_root);
+	mb_debug(sb, "\nTree\tNode\tlength\tblock\tgroup\n");
+
+	while (cur) {
+		entry = rb_entry(cur, struct ext4_frsp_node, frsp_len_node);
+		count++;
+		blk = ext4_flex_offset_to_blkno(sb, tree->frsp_index,
+			entry->frsp_offset);
+		blk_end = ext4_flex_offset_to_blkno(sb, tree->frsp_index,
+			entry->frsp_offset + entry->frsp_len - 1);
+		group = blk / sbi->s_blocks_per_group;
+		group_end = (blk_end-1) / sbi->s_blocks_per_group;
+		mb_debug(sb, "%u\t%u\t%u\t%u(%lld)--%llu\t%u--%u\n",
+			tree->frsp_index, count, entry->frsp_len,
+			entry->frsp_offset, blk, blk_end, group, group_end);
+		cur = rb_next(cur);
+	}
+}
+#endif
+
+static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
+{
+	struct ext4_frsp_node *node;
+
+	node = kmem_cache_alloc(ext4_freespace_node_cachep, GFP_NOFS);
+	if (!node)
+		return NULL;
+
+	RB_CLEAR_NODE(&node->frsp_node);
+	RB_CLEAR_NODE(&node->frsp_len_node);
+
+	return node;
+}
+
+static void ext4_mb_frsp_free_node(struct super_block *sb,
+		struct ext4_frsp_node *node)
+{
+	kmem_cache_free(ext4_freespace_node_cachep, node);
+}
+
+/* Evict a tree from memory */
+void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree)
+{
+	struct ext4_frsp_node *frsp_node;
+	struct rb_node *node;
+
+	mb_debug(sb, "Evicting tree %d\n", tree->frsp_index);
+	mutex_lock(&tree->frsp_lock);
+	if (!(tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED))
+		goto out;
+
+	node = rb_first_cached(&tree->frsp_offset_root);
+	while (node) {
+		frsp_node = rb_entry(node, struct ext4_frsp_node, frsp_node);
+		rb_erase_cached(node, &tree->frsp_offset_root);
+		rb_erase_cached(&frsp_node->frsp_len_node,
+				&tree->frsp_len_root);
+		ext4_mb_frsp_free_node(sb, frsp_node);
+		node = rb_first_cached(&tree->frsp_offset_root);
+	}
+	tree->frsp_flags &= ~EXT4_MB_FRSP_FLAG_LOADED;
+	tree->frsp_offset_root = RB_ROOT_CACHED;
+	tree->frsp_len_root = RB_ROOT_CACHED;
+out:
+	mutex_unlock(&tree->frsp_lock);
+}
+
+/*
+ * Search tree by flexbg offset. Returns the node containing "target". If
+ * no such node is present, returns NULL. Must be called with tree mutex held.
+ */
+struct ext4_frsp_node *ext4_mb_frsp_search_by_off(struct super_block *sb,
+				struct ext4_frsp_tree *tree,
+				ext4_grpblk_t target)
+{
+	struct rb_root *root = &tree->frsp_offset_root.rb_root;
+	struct rb_node *node = root->rb_node;
+	struct ext4_frsp_node *this = NULL;
+
+	while (node) {
+		this = rb_entry(node, struct ext4_frsp_node, frsp_node);
+		if (this->frsp_offset == target)
+			return this;
+		else if (target < this->frsp_offset)
+			node = node->rb_left;
+		else
+			node = node->rb_right;
+	}
+	return NULL;
+}
+
+/*
+ * Check if two entries in freespace tree can be merged together. Nodes
+ * can merge together only if they are physically contiguous and belong
+ * to the same block group.
+ */
+int ext4_mb_frsp_node_can_merge(struct super_block *sb,
+	struct ext4_frsp_node *prev_entry, struct ext4_frsp_node *cur_entry)
+{
+	if (prev_entry->frsp_offset + prev_entry->frsp_len !=
+		cur_entry->frsp_offset)
+		return 0;
+	if (prev_entry->frsp_offset / EXT4_SB(sb)->s_blocks_per_group !=
+		cur_entry->frsp_offset / EXT4_SB(sb)->s_blocks_per_group)
+		return 0;
+	return 1;
+}
+
+/*
+ * Add new free space region to tree. We insert the new node in both, the
+ * length sorted and the flex-bg offset sorted trees. Must be called with
+ * tree mutex held.
+ */
+int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
+				ext4_grpblk_t offset, ext4_grpblk_t len)
+{
+	struct ext4_frsp_node *new_entry = NULL, *next_entry = NULL,
+				*prev_entry = NULL;
+	struct rb_node *left = NULL, *right = NULL;
+
+	new_entry = ext4_mb_frsp_alloc_node(sb);
+	if (!new_entry)
+		return -ENOMEM;
+
+	new_entry->frsp_offset = offset;
+	new_entry->frsp_len = len;
+
+	ext4_mb_frsp_insert_off(tree, new_entry);
+	/* try merge to left and right */
+	/* left */
+	left = rb_prev(&new_entry->frsp_node);
+	if (left) {
+		prev_entry = rb_entry(left, struct ext4_frsp_node, frsp_node);
+		if (ext4_mb_frsp_node_can_merge(sb, prev_entry, new_entry)) {
+			new_entry->frsp_offset = prev_entry->frsp_offset;
+			new_entry->frsp_len += prev_entry->frsp_len;
+			rb_erase_cached(left, &tree->frsp_offset_root);
+			rb_erase_cached(&prev_entry->frsp_len_node,
+						&tree->frsp_len_root);
+			ext4_mb_frsp_free_node(sb, prev_entry);
+		}
+	}
+
+	/* right */
+	right = rb_next(&new_entry->frsp_node);
+	if (right) {
+		next_entry = rb_entry(right, struct ext4_frsp_node, frsp_node);
+		if (ext4_mb_frsp_node_can_merge(sb, new_entry, next_entry)) {
+			new_entry->frsp_len += next_entry->frsp_len;
+			rb_erase_cached(right, &tree->frsp_offset_root);
+			rb_erase_cached(&next_entry->frsp_len_node,
+						&tree->frsp_len_root);
+			ext4_mb_frsp_free_node(sb, next_entry);
+		}
+	}
+	ext4_mb_frsp_insert_len(tree, new_entry);
+
+	return 0;
+}
+
+/*
+ * Free length number of blocks starting at block number block in free space
+ * tree.
+ */
+int ext4_mb_frsp_free_blocks(struct super_block *sb, ext4_group_t group,
+				ext4_grpblk_t block, ext4_grpblk_t length)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_frsp_tree *tree =
+		ext4_get_frsp_tree(sb, ext4_flex_group(sbi, group));
+	int err = 0;
+
+	mutex_lock(&tree->frsp_lock);
+	err = ext4_mb_frsp_add_region(sb, tree,
+			ext4_blkno_to_flex_offset(sb, block), length);
+	mutex_unlock(&tree->frsp_lock);
+
+	return err;
+}
+
+/*
+ * Create freespace tree from on-disk bitmap. Must be called with tree mutex
+ * held. Returns 0 on success, error otherwise.
+ */
+int ext4_mb_frsp_bb_to_tree(struct ext4_buddy *e4b, ext4_group_t group,
+			struct buffer_head *bh)
+{
+	struct super_block *sb = e4b->bd_sb;
+	int length = 0, bit = 0, next;
+	int end = EXT4_SB(sb)->s_blocks_per_group;
+	ext4_fsblk_t block;
+	int ret = 0;
+
+	/* find all unused blocks in bitmap, convert them to new tree node */
+	while (bit < end) {
+		bit = mb_find_next_zero_bit(bh->b_data, end, bit);
+		if (bit >= end)
+			break;
+
+		next = mb_find_next_bit(bh->b_data, end, bit);
+		length = next - bit;
+		block = ext4_group_first_block_no(sb, group) + bit;
+
+		ret = ext4_mb_frsp_add_region(sb, e4b->frsp_tree,
+			ext4_blkno_to_flex_offset(sb, block), length);
+		if (ret)
+			break;
+		bit = next + 1;
+	}
+
+	return ret;
+}
+
+/*
+ * Load one freespace_tree from disk. Assume holding mutex lock on tree.
+ */
+int ext4_mb_frsp_load(struct ext4_buddy *e4b)
+{
+	ext4_group_t ngroups, group_start, group_end, grp;
+	struct ext4_sb_info *sbi = EXT4_SB(e4b->bd_sb);
+	int i, ret = 0;
+	struct buffer_head **bh = NULL;
+
+	if (e4b->frsp_tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED)
+		return 0;
+
+	group_start = e4b->frsp_tree->frsp_index * ext4_flex_bg_size(sbi);
+	group_end = min(group_start + ext4_flex_bg_size(sbi),
+			ext4_get_groups_count(e4b->bd_sb)) - 1;
+	ngroups = group_end - group_start + 1;
+
+	bh = kcalloc(ngroups, sizeof(*bh), GFP_KERNEL);
+	if (!bh)
+		return -ENOMEM;
+	for (i = 0; i < ngroups; i++) {
+		grp = i + group_start;
+		bh[i] = ext4_read_block_bitmap_nowait(e4b->bd_sb, grp, false);
+		if (IS_ERR(bh[i])) {
+			ret = PTR_ERR(bh[i]);
+			goto out;
+		}
+	}
+	for (i = 0; i < ngroups; i++) {
+		grp = i + group_start;
+		if (!bitmap_uptodate(bh[i])) {
+			ret = ext4_wait_block_bitmap(e4b->bd_sb, grp, bh[i]);
+			if (ret)
+				goto out;
+		}
+		ret = ext4_mb_frsp_bb_to_tree(e4b, grp, bh[i]);
+		if (ret)
+			goto out;
+	}
+	e4b->frsp_tree->frsp_flags |= EXT4_MB_FRSP_FLAG_LOADED;
+out:
+	for (i = 0; i < ngroups; i++) {
+		if (!IS_ERR_OR_NULL(bh[i]))
+			put_bh(bh[i]);
+		if (!ret && IS_ERR(bh[i]))
+			ret = PTR_ERR(bh[i]);
+	}
+	kfree(bh);
+	return ret;
+
+}
+
+/*
+ * Determine if node with length len is better than what we have found until
+ * now. Return 1 if that is the case, 0 otherwise. If len is exact match,
+ * set *best = 1.
+ */
+static int ext4_mb_frsp_is_better(struct ext4_allocation_context *ac,
+					int len, int *best)
+{
+	struct ext4_tree_extent *btx = &ac->ac_b_tree_ex;
+	struct ext4_free_extent *gex = &ac->ac_g_ex;
+
+	if (best)
+		*best = 0;
+	if (len == gex->fe_len) {
+		if (best)
+			*best = 1;
+		return 1;
+	}
+	if (ac->ac_criteria == 0 && len < gex->fe_len)
+		return 0;
+	/*
+	 * See if the new node is cloer to goal length than
+	 * the best extent found so far
+	 */
+	if (btx->te_len < gex->fe_len && len > btx->te_len)
+		return 1;
+	if (len > gex->fe_len && len < btx->te_len)
+		return 1;
+	return 0;
+}
+
+/*
+ * check if we have scanned sufficient freespace candidates
+ * stop scanning if reached/exceeded s_max_to_scan
+ */
+static void ext4_mb_frsp_check_limits(struct ext4_allocation_context *ac)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+
+	if (ac->ac_status == AC_STATUS_FOUND)
+		return;
+
+	/*
+	 * Exceeded max number of nodes to scan
+	 */
+	if (ac->ac_found > sbi->s_mb_max_to_scan &&
+			!(ac->ac_flags & EXT4_MB_HINT_FIRST))
+		ac->ac_status = AC_STATUS_BREAK;
+}
+
+/*
+ * Mark free space in selected tree node as used and update the tree.
+ * This must be called with tree lock.
+ */
+static void ext4_mb_frsp_use_best_found(struct ext4_allocation_context *ac,
+			ext4_group_t flex, struct ext4_frsp_node *selected)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+	struct ext4_tree_extent *btx = &ac->ac_b_tree_ex;
+	struct ext4_free_extent *bex;
+	unsigned int group_no;
+	struct ext4_buddy e4b;
+
+	WARN_ON(ac->ac_status == AC_STATUS_FOUND);
+	btx->te_len = min(btx->te_len, ac->ac_g_ex.fe_len);
+	ac->ac_status = AC_STATUS_FOUND;
+
+	/* update ac best-found-extent */
+	bex = &ac->ac_b_ex;
+	group_no = (btx->te_flex * ext4_flex_bg_size(sbi)) +
+			(btx->te_flex_start / sbi->s_blocks_per_group);
+	bex->fe_start = btx->te_flex_start % sbi->s_blocks_per_group;
+	bex->fe_group = group_no;
+	bex->fe_len = btx->te_len;
+	bex->fe_node = selected;
+
+	/* Add free blocks to our tree */
+	ext4_mb_load_allocator(ac->ac_sb, group_no, &e4b,
+		EXT4_ALLOCATOR_FRSP_NOLOAD);
+	mb_mark_used(&e4b, bex);
+	ext4_mb_unload_allocator(&e4b);
+}
+/*
+ * The routine checks whether found tree node is good enough. If it is,
+ * then the tree node gets marked used and flag is set to the context
+ * to stop scanning.
+ *
+ * Otherwise, the tree node is compared with the previous found extent and
+ * if new one is better, then it's stored in the context.
+ *
+ * Later, the best found tree node will be used, if mballoc can't find good
+ * enough extent.
+ */
+static int ext4_mb_frsp_measure_node(struct ext4_allocation_context *ac,
+				int tree_idx, struct ext4_frsp_node *cur,
+				int goal)
+{
+	struct ext4_tree_extent *btx = &ac->ac_b_tree_ex;
+	ext4_fsblk_t start;
+	int best_found = 0, max;
+
+	WARN_ON(btx->te_len < 0);
+	WARN_ON(ac->ac_status != AC_STATUS_CONTINUE);
+
+	if (goal) {
+		start = ext4_group_first_block_no(ac->ac_sb,
+					ac->ac_g_ex.fe_group) +
+					ac->ac_g_ex.fe_start;
+		if (cur->frsp_offset > ext4_blkno_to_flex_offset(ac->ac_sb,
+						start))
+			return 0;
+		max = cur->frsp_offset + cur->frsp_len -
+			ext4_blkno_to_flex_offset(ac->ac_sb, start);
+		if (max >= ac->ac_g_ex.fe_len &&
+			ac->ac_g_ex.fe_len == EXT4_SB(ac->ac_sb)->s_stripe) {
+			if (do_div(start, EXT4_SB(ac->ac_sb)->s_stripe) != 0)
+				return 0;
+			best_found = 1;
+		} else if (max >= ac->ac_g_ex.fe_len) {
+			best_found = 1;
+		} else if (max > 0 && (ac->ac_flags & EXT4_MB_HINT_MERGE)) {
+			/*
+			 * Sometimes, caller may want to merge even small
+			 * number of blocks to an existing extent
+			 */
+			best_found = 1;
+
+		} else {
+			return 0;
+		}
+		ac->ac_found++;
+		goto use_extent;
+	}
+	ac->ac_found++;
+
+	/* The special case - take what you catch first */
+	if (unlikely(ac->ac_flags & EXT4_MB_HINT_FIRST)) {
+		best_found = 1;
+		goto use_extent;
+	}
+
+	/*
+	 * Prefer not allocating blocks in first group in flex bg for data
+	 * blocks.
+	 */
+	if (unlikely((ac->ac_criteria < 2) &&
+			(ac->ac_flags & EXT4_MB_HINT_DATA) &&
+			(cur->frsp_offset < EXT4_BLOCKS_PER_GROUP(ac->ac_sb))))
+		return 1;
+
+
+	/* If this is first found extent, just store it in the context */
+	if (btx->te_len == 0)
+		goto use_extent;
+
+	if (ext4_mb_frsp_is_better(ac, cur->frsp_len, &best_found))
+		goto use_extent;
+
+	ext4_mb_frsp_check_limits(ac);
+	return 0;
+
+use_extent:
+	btx->te_flex = tree_idx;
+	if (goal) {
+		btx->te_flex_start = ext4_blkno_to_flex_offset(ac->ac_sb,
+								start);
+		btx->te_len = max;
+	} else {
+		btx->te_flex_start = cur->frsp_offset;
+		btx->te_len = cur->frsp_len;
+	}
+	if (best_found)
+		ext4_mb_frsp_use_best_found(ac, tree_idx, cur);
+	ext4_mb_frsp_check_limits(ac);
+	return 1;
+}
+
+/* Search by goal */
+static noinline_for_stack
+void ext4_mb_frsp_find_by_goal(struct ext4_allocation_context *ac)
+{
+	struct ext4_frsp_node *cur = NULL;
+	unsigned int tree_blk;
+	ext4_fsblk_t blk;
+	struct ext4_buddy e4b;
+	int ret;
+
+	if (!(ac->ac_flags & EXT4_MB_HINT_TRY_GOAL))
+		return;
+
+	/* compute start node offset in tree */
+	blk = ext4_group_first_block_no(ac->ac_sb, ac->ac_g_ex.fe_group) +
+			ac->ac_g_ex.fe_start;
+	tree_blk = ext4_blkno_to_flex_offset(ac->ac_sb, blk);
+
+	ret = ext4_mb_load_allocator(ac->ac_sb, ac->ac_g_ex.fe_group, &e4b, 0);
+	if (ret)
+		return;
+
+	/* try goal block and its freespace_tree first */
+	mutex_lock(&e4b.frsp_tree->frsp_lock);
+	cur = ext4_mb_frsp_search_by_off(ac->ac_sb, e4b.frsp_tree, tree_blk);
+	if (cur && tree_blk < cur->frsp_offset + cur->frsp_len - 1)
+		ext4_mb_frsp_measure_node(ac, e4b.frsp_tree->frsp_index, cur,
+						1 /* searching by goal */);
+
+	mutex_unlock(&e4b.frsp_tree->frsp_lock);
+	ext4_mb_unload_allocator(&e4b);
+}
+
+static void ext4_mb_frsp_process(struct ext4_allocation_context *ac,
+				struct ext4_frsp_tree *tree)
+{
+	struct ext4_frsp_node *cur = NULL;
+	struct rb_node *node = NULL;
+	int ret;
+
+	node = rb_first_cached(&tree->frsp_len_root);
+	mb_debug(ac->ac_sb, "Processing tree index %d, flags = %x\n",
+			tree->frsp_index, tree->frsp_flags);
+	/*
+	 * In order to serve the "No data blocks in first group in a flexbg"
+	 * requirement, we cannot do a O(Log N) search here. TODO: it's possible
+	 * to that at CR=2, where that requirement doesn't apply.
+	 */
+	while (node && ac->ac_status == AC_STATUS_CONTINUE) {
+		cur = rb_entry(node, struct ext4_frsp_node, frsp_len_node);
+		if (ac->ac_criteria == 0 && cur->frsp_len < ac->ac_g_ex.fe_len)
+			return;
+		ret = ext4_mb_frsp_measure_node(ac, tree->frsp_index, cur,
+						0 /* searching by len */);
+		if (ret == 0)
+			return;
+		node = rb_next(node);
+	}
+}
+
+/* allocator for freespace_tree */
+static noinline_for_stack int
+ext4_mb_tree_allocator(struct ext4_allocation_context *ac)
+{
+	struct ext4_buddy e4b;
+	struct ext4_sb_info *sbi;
+	struct super_block *sb;
+	struct ext4_frsp_tree *tree = NULL;
+	struct ext4_frsp_node *cur = NULL;
+	struct ext4_tree_extent *btx = NULL;
+	int ret = 0, start_idx = 0, tree_idx, j;
+
+	sb = ac->ac_sb;
+	btx = &ac->ac_b_tree_ex;
+	sbi = EXT4_SB(sb);
+
+	start_idx = ext4_flex_group(sbi, ac->ac_g_ex.fe_group);
+	mb_debug(sb, "requested size %d\n", ac->ac_g_ex.fe_len);
+	/* First try searching from goal blk in start-idx-th freespace tree */
+	ext4_mb_frsp_find_by_goal(ac);
+	if (ac->ac_status == AC_STATUS_FOUND)
+		goto out;
+
+	if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
+		goto out;
+
+	ac->ac_criteria = 0;
+	ac->ac_groups_scanned = 0;
+repeat:
+
+	/* Loop through the rest of trees (flex_bg) */
+	for (j = start_idx;
+		(ac->ac_groups_scanned < sbi->s_mb_num_frsp_trees) &&
+			ac->ac_status == AC_STATUS_CONTINUE; j++) {
+		ac->ac_groups_scanned++;
+		tree_idx = (j % sbi->s_mb_num_frsp_trees);
+
+		ret = ext4_mb_load_allocator(sb,
+				tree_idx * ext4_flex_bg_size(sbi), &e4b, 0);
+		if (ret)
+			goto out;
+		mutex_lock(&e4b.frsp_tree->frsp_lock);
+		ext4_mb_frsp_process(ac, e4b.frsp_tree);
+		mutex_unlock(&e4b.frsp_tree->frsp_lock);
+		ext4_mb_unload_allocator(&e4b);
+	}
+
+	if (ac->ac_status != AC_STATUS_FOUND) {
+		if (ac->ac_criteria < 2) {
+			ac->ac_criteria++;
+			ac->ac_groups_scanned = 0;
+			mb_debug(sb, "Falling back to CR=%d", ac->ac_criteria);
+			goto repeat;
+		}
+		if (btx->te_len > 0 && !(ac->ac_flags & EXT4_MB_HINT_FIRST)) {
+			e4b.frsp_flags = 0;
+			ret = ext4_mb_load_allocator(sb, btx->te_flex *
+					ext4_flex_bg_size(sbi), &e4b, 0);
+			if (ret)
+				goto out;
+			tree = e4b.frsp_tree;
+			mutex_lock(&tree->frsp_lock);
+			cur = ext4_mb_frsp_search_by_off(sb, tree,
+					btx->te_flex_start);
+			if (cur) {
+				ext4_mb_frsp_use_best_found(ac, btx->te_flex,
+								cur);
+				mutex_unlock(&tree->frsp_lock);
+				ext4_mb_unload_allocator(&e4b);
+
+			} else {
+				/*
+				 * Someone else took this freespace node before
+				 * us. Reset the best-found tree extent, and
+				 * turn on FIRST HINT (greedy).
+				 */
+				mutex_unlock(&tree->frsp_lock);
+				ac->ac_b_tree_ex.te_flex_start = 0;
+				ac->ac_b_tree_ex.te_flex = 0;
+				ac->ac_b_tree_ex.te_len = 0;
+				ac->ac_status = AC_STATUS_CONTINUE;
+				ac->ac_flags |= EXT4_MB_HINT_FIRST;
+				ac->ac_groups_scanned = 0;
+				ext4_mb_unload_allocator(&e4b);
+				goto repeat;
+			}
+		}
+	}
+
+	ret = btx->te_len ? 0 : -ENOSPC;
+
+out:
+	return ret;
+}
+
+static void ext4_mb_frsp_init_tree(struct ext4_frsp_tree *tree, int index)
+{
+	tree->frsp_offset_root = RB_ROOT_CACHED;
+	tree->frsp_len_root = RB_ROOT_CACHED;
+	mutex_init(&(tree->frsp_lock));
+	tree->frsp_flags = 0;
+	tree->frsp_index = index;
+}
+
+int ext4_mb_init_freespace_trees(struct super_block *sb)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct flex_groups *fg;
+	int i;
+
+	sbi->s_mb_num_frsp_trees =
+		ext4_num_grps_to_flexbg(sb, ext4_get_groups_count(sb));
+
+	for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) {
+		fg = sbi_array_rcu_deref(sbi, s_flex_groups, i);
+		fg->frsp_tree = kzalloc(sizeof(struct ext4_frsp_tree),
+				GFP_KERNEL);
+		if (!fg->frsp_tree)
+			return -ENOMEM;
+		ext4_mb_frsp_init_tree(fg->frsp_tree, i);
+	}
+
+	return 0;
+}
+
+int ext4_mb_add_frsp_trees(struct super_block *sb, int ngroups)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	int flex_bg_count, old_trees_count = sbi->s_mb_num_frsp_trees;
+	int i;
+
+	if (!ext4_mb_frsp_on(sb))
+		return 0;
+
+	flex_bg_count = ext4_num_grps_to_flexbg(sb, ngroups);
+	if (old_trees_count > 0)
+		ext4_mb_frsp_free_tree(sb, ext4_get_frsp_tree(sb,
+						old_trees_count - 1));
+
+	for (i = old_trees_count; i < flex_bg_count; i++) {
+		struct flex_groups *fg;
+
+		fg = sbi_array_rcu_deref(sbi, s_flex_groups, i);
+		fg->frsp_tree = kzalloc(sizeof(*fg->frsp_tree), GFP_KERNEL);
+		if (!fg->frsp_tree)
+			return -ENOMEM;
+		ext4_mb_frsp_init_tree(fg->frsp_tree, i);
+	}
+	sbi->s_mb_num_frsp_trees = flex_bg_count;
+
+	return 0;
+}
+
 /*
  * Divide blocks started from @first with length @len into
  * smaller chunks with power of 2 blocks.
@@ -1042,6 +1802,9 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
 	e4b->bd_buddy_page = NULL;
 	e4b->bd_bitmap_page = NULL;
 
+	if (ext4_mb_frsp_on(sb))
+		return 0;
+
 	blocks_per_page = PAGE_SIZE / sb->s_blocksize;
 	/*
 	 * the buddy cache inode stores the block bitmap
@@ -1099,6 +1862,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
 	struct page *page;
 	int ret = 0;
 
+	WARN_ON(ext4_mb_frsp_on(sb));
 	might_sleep();
 	mb_debug(sb, "init group %u\n", group);
 	this_grp = ext4_get_group_info(sb, group);
@@ -1156,6 +1920,11 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
  * Locking note:  This routine calls ext4_mb_init_cache(), which takes the
  * block group lock of all groups for this page; do not hold the BG lock when
  * calling this routine!
+ *
+ * For freespace trees, do not hold tree mutex while calling this routine.
+ * It is okay to hold that mutex only if EXT4_ALLOCATOR_FRSP_NOLOAD flag is
+ * set in e4b->frsp_flags. If that flag is set, this function doesn't try
+ * to load an unloaded tree.
  */
 static noinline_for_stack int
 ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group,
@@ -1166,7 +1935,7 @@ ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group,
 	int pnum;
 	int poff;
 	struct page *page;
-	int ret;
+	int ret = 0;
 	struct ext4_group_info *grp;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct inode *inode = sbi->s_buddy_cache;
@@ -1183,6 +1952,22 @@ ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group,
 	e4b->bd_group = group;
 	e4b->bd_buddy_page = NULL;
 	e4b->bd_bitmap_page = NULL;
+	if (ext4_mb_frsp_on(sb)) {
+		e4b->frsp_tree = ext4_get_frsp_tree(sb,
+					ext4_flex_group(sbi, group));
+		e4b->frsp_flags = flags;
+		if (flags & EXT4_ALLOCATOR_FRSP_NOLOAD)
+			return 0;
+
+		mutex_lock(&e4b->frsp_tree->frsp_lock);
+		if (e4b->frsp_tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED) {
+			mutex_unlock(&e4b->frsp_tree->frsp_lock);
+			return 0;
+		}
+		ret = ext4_mb_frsp_load(e4b);
+		mutex_unlock(&e4b->frsp_tree->frsp_lock);
+		return ret;
+	}
 
 	if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
 		/*
@@ -1302,6 +2087,8 @@ static int ext4_mb_load_allocator(struct super_block *sb, ext4_group_t group,
 
 static void ext4_mb_unload_allocator(struct ext4_buddy *e4b)
 {
+	if (ext4_mb_frsp_on(e4b->bd_sb))
+		return;
 	if (e4b->bd_bitmap_page)
 		put_page(e4b->bd_bitmap_page);
 	if (e4b->bd_buddy_page)
@@ -1494,6 +2281,16 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
 	if (first < e4b->bd_info->bb_first_free)
 		e4b->bd_info->bb_first_free = first;
 
+	if (ext4_mb_frsp_on(sb)) {
+		ext4_grpblk_t first_blk = EXT4_C2B(EXT4_SB(sb), first) +
+			ext4_group_first_block_no(sb, e4b->bd_group);
+
+		ext4_unlock_group(sb, e4b->bd_group);
+		ext4_mb_frsp_free_blocks(sb, e4b->bd_group, first_blk, count);
+		ext4_lock_group(sb, e4b->bd_group);
+		return;
+	}
+
 	/* access memory sequentially: check left neighbour,
 	 * clear range and then check right neighbour
 	 */
@@ -1560,6 +2357,9 @@ static int mb_find_extent(struct ext4_buddy *e4b, int block,
 	assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group));
 	BUG_ON(ex == NULL);
 
+	if (ext4_mb_frsp_on(e4b->bd_sb))
+		return 0;
+
 	buddy = mb_find_buddy(e4b, 0, &max);
 	BUG_ON(buddy == NULL);
 	BUG_ON(block >= max);
@@ -1624,17 +2424,74 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
 	unsigned ret = 0;
 	int len0 = len;
 	void *buddy;
+	ext4_grpblk_t flex_offset;
 
 	BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3));
 	BUG_ON(e4b->bd_group != ex->fe_group);
+
+	e4b->bd_info->bb_free -= len;
+	if (e4b->bd_info->bb_first_free == start)
+		e4b->bd_info->bb_first_free += len;
+
+	if (ext4_mb_frsp_on(e4b->bd_sb)) {
+		struct ext4_frsp_node *new;
+
+		flex_offset = ext4_blkno_to_flex_offset(e4b->bd_sb,
+					ext4_group_first_block_no(e4b->bd_sb,
+						e4b->bd_group) + ex->fe_start);
+		if (!ex->fe_node) {
+			ex->fe_node = ext4_mb_frsp_search_by_off(e4b->bd_sb,
+					e4b->frsp_tree, flex_offset);
+			if (!ex->fe_node)
+				return 0;
+		}
+		/*
+		 * Remove the node from the trees before we modify these since
+		 * we will change the length and / or offset of this node.
+		 */
+		rb_erase_cached(&ex->fe_node->frsp_node,
+					&e4b->frsp_tree->frsp_offset_root);
+		rb_erase_cached(&ex->fe_node->frsp_len_node,
+					&e4b->frsp_tree->frsp_len_root);
+		RB_CLEAR_NODE(&ex->fe_node->frsp_node);
+		RB_CLEAR_NODE(&ex->fe_node->frsp_len_node);
+		if (flex_offset > ex->fe_node->frsp_offset) {
+			if (flex_offset + ex->fe_len !=
+				ex->fe_node->frsp_offset +
+				ex->fe_node->frsp_len) {
+				/* Need to split the node */
+				new = ext4_mb_frsp_alloc_node(e4b->bd_sb);
+				if (!new)
+					return -ENOMEM;
+				new->frsp_offset = flex_offset + ex->fe_len;
+				new->frsp_len = (ex->fe_node->frsp_offset +
+					ex->fe_node->frsp_len) -
+					new->frsp_offset;
+				ext4_mb_frsp_insert_len(e4b->frsp_tree, new);
+				ext4_mb_frsp_insert_off(e4b->frsp_tree, new);
+			}
+			ex->fe_node->frsp_len = flex_offset -
+						ex->fe_node->frsp_offset;
+		} else if (ex->fe_len < ex->fe_node->frsp_len) {
+			/* used only a part: update node */
+			ex->fe_node->frsp_offset += ex->fe_len;
+			ex->fe_node->frsp_len -= ex->fe_len;
+		} else {
+			ext4_mb_frsp_free_node(e4b->bd_sb, ex->fe_node);
+			return 0;
+		}
+
+		ext4_mb_frsp_insert_len(e4b->frsp_tree, ex->fe_node);
+		ext4_mb_frsp_insert_off(e4b->frsp_tree, ex->fe_node);
+
+		return 0;
+	}
+
 	assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group));
 	mb_check_buddy(e4b);
 	mb_mark_used_double(e4b, start, len);
 
 	this_cpu_inc(discard_pa_seq);
-	e4b->bd_info->bb_free -= len;
-	if (e4b->bd_info->bb_first_free == start)
-		e4b->bd_info->bb_first_free += len;
 
 	/* let's maintain fragments counter */
 	if (start != 0)
@@ -2773,6 +3630,13 @@ static int ext4_mb_init_backend(struct super_block *sb)
 	rcu_read_lock();
 	kvfree(rcu_dereference(sbi->s_group_info));
 	rcu_read_unlock();
+	if (ext4_mb_frsp_on(sb)) {
+		for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) {
+			struct ext4_frsp_tree *tree = ext4_get_frsp_tree(sb, i);
+
+			kfree(tree);
+		}
+	}
 	return -ENOMEM;
 }
 
@@ -2869,6 +3733,13 @@ int ext4_mb_init(struct super_block *sb)
 		i++;
 	} while (i <= sb->s_blocksize_bits + 1);
 
+	/* init for freespace trees */
+	if (ext4_mb_frsp_on(sb)) {
+		ret = ext4_mb_init_freespace_trees(sb);
+		if (ret)
+			goto out;
+	}
+
 	spin_lock_init(&sbi->s_md_lock);
 	spin_lock_init(&sbi->s_bal_lock);
 	sbi->s_mb_free_pending = 0;
@@ -2936,6 +3807,13 @@ int ext4_mb_init(struct super_block *sb)
 	sbi->s_mb_offsets = NULL;
 	kfree(sbi->s_mb_maxs);
 	sbi->s_mb_maxs = NULL;
+	if (ext4_mb_frsp_on(sb)) {
+		for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) {
+			struct ext4_frsp_tree *tree = ext4_get_frsp_tree(sb, i);
+
+			kfree(tree);
+		}
+	}
 	return ret;
 }
 
@@ -3080,8 +3958,10 @@ static void ext4_free_data_in_buddy(struct super_block *sb,
 		/* No more items in the per group rb tree
 		 * balance refcounts from ext4_mb_free_metadata()
 		 */
-		put_page(e4b.bd_buddy_page);
-		put_page(e4b.bd_bitmap_page);
+		if (!ext4_mb_frsp_on(sb)) {
+			put_page(e4b.bd_buddy_page);
+			put_page(e4b.bd_bitmap_page);
+		}
 	}
 	ext4_unlock_group(sb, entry->efd_group);
 	kmem_cache_free(ext4_free_data_cachep, entry);
@@ -3159,9 +4039,14 @@ int __init ext4_init_mballoc(void)
 					   SLAB_RECLAIM_ACCOUNT);
 	if (ext4_free_data_cachep == NULL)
 		goto out_ac_free;
+	ext4_freespace_node_cachep = KMEM_CACHE(ext4_frsp_node,
+						SLAB_RECLAIM_ACCOUNT);
+	if (ext4_freespace_node_cachep == NULL)
+		goto out_frsp_free;
 
 	return 0;
-
+out_frsp_free:
+	kmem_cache_destroy(ext4_free_data_cachep);
 out_ac_free:
 	kmem_cache_destroy(ext4_ac_cachep);
 out_pa_free:
@@ -3180,6 +4065,7 @@ void ext4_exit_mballoc(void)
 	kmem_cache_destroy(ext4_pspace_cachep);
 	kmem_cache_destroy(ext4_ac_cachep);
 	kmem_cache_destroy(ext4_free_data_cachep);
+	kmem_cache_destroy(ext4_freespace_node_cachep);
 	ext4_groupinfo_destroy_slabs();
 }
 
@@ -3682,6 +4568,9 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
 		return false;
 
+	if (ext4_mb_frsp_on(ac->ac_sb))
+		return 0;
+
 	/* first, try per-file preallocation */
 	rcu_read_lock();
 	list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
@@ -4384,6 +5273,8 @@ static int ext4_mb_pa_alloc(struct ext4_allocation_context *ac)
 	struct ext4_prealloc_space *pa;
 
 	BUG_ON(ext4_pspace_cachep == NULL);
+	if (ext4_mb_frsp_on(ac->ac_sb))
+		return 0;
 	pa = kmem_cache_zalloc(ext4_pspace_cachep, GFP_NOFS);
 	if (!pa)
 		return -ENOMEM;
@@ -4396,6 +5287,8 @@ static void ext4_mb_pa_free(struct ext4_allocation_context *ac)
 {
 	struct ext4_prealloc_space *pa = ac->ac_pa;
 
+	if (ext4_mb_frsp_on(ac->ac_sb))
+		return;
 	BUG_ON(!pa);
 	ac->ac_pa = NULL;
 	WARN_ON(!atomic_dec_and_test(&pa->pa_count));
@@ -4569,6 +5462,13 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac,
 	ac->ac_o_ex.fe_len = len;
 	ac->ac_g_ex = ac->ac_o_ex;
 	ac->ac_flags = ar->flags;
+	if (ext4_mb_frsp_on(sb))
+		ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC;
+
+	/* set up best-found tree node */
+	ac->ac_b_tree_ex.te_flex_start = 0;
+	ac->ac_b_tree_ex.te_flex = 0;
+	ac->ac_b_tree_ex.te_len = 0;
 
 	/* we have to define context: we'll work with a file or
 	 * locality group. this is a policy, actually */
@@ -4919,7 +5819,11 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 			goto errout;
 repeat:
 		/* allocate space in core */
-		*errp = ext4_mb_regular_allocator(ac);
+		if (ext4_mb_frsp_on(sb))
+			*errp = ext4_mb_tree_allocator(ac);
+		else
+			*errp = ext4_mb_regular_allocator(ac);
+
 		/*
 		 * pa allocated above is added to grp->bb_prealloc_list only
 		 * when we were able to allocate some block i.e. when
@@ -4932,8 +5836,13 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 			ext4_discard_allocated_blocks(ac);
 			goto errout;
 		}
-		if (ac->ac_status == AC_STATUS_FOUND &&
-			ac->ac_o_ex.fe_len >= ac->ac_f_ex.fe_len)
+		/*
+		 * Freespace trees should never return more than what was asked
+		 * for.
+		 */
+		if (!ext4_mb_frsp_on(sb) &&
+			(ac->ac_status == AC_STATUS_FOUND &&
+			ac->ac_o_ex.fe_len >= ac->ac_f_ex.fe_len))
 			ext4_mb_pa_free(ac);
 	}
 	if (likely(ac->ac_status == AC_STATUS_FOUND)) {
@@ -5024,13 +5933,12 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
 	struct rb_node *parent = NULL, *new_node;
 
 	BUG_ON(!ext4_handle_valid(handle));
-	BUG_ON(e4b->bd_bitmap_page == NULL);
-	BUG_ON(e4b->bd_buddy_page == NULL);
+	BUG_ON(!ext4_mb_frsp_on(sb) && e4b->bd_bitmap_page == NULL);
 
 	new_node = &new_entry->efd_node;
 	cluster = new_entry->efd_start_cluster;
 
-	if (!*n) {
+	if (!*n && !ext4_mb_frsp_on(sb)) {
 		/* first free block exent. We need to
 		   protect buddy cache from being freed,
 		 * otherwise we'll refresh it from
@@ -5509,6 +6417,7 @@ __acquires(bitlock)
 	ex.fe_start = start;
 	ex.fe_group = group;
 	ex.fe_len = count;
+	ex.fe_node = NULL;
 
 	/*
 	 * Mark blocks used, so no one can reuse them while
@@ -5548,6 +6457,7 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
 	void *bitmap;
 	ext4_grpblk_t next, count = 0, free_count = 0;
 	struct ext4_buddy e4b;
+	struct buffer_head *bh = NULL;
 	int ret = 0;
 
 	trace_ext4_trim_all_free(sb, group, start, max);
@@ -5558,7 +6468,17 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
 			     ret, group);
 		return ret;
 	}
-	bitmap = e4b.bd_bitmap;
+
+	if (ext4_mb_frsp_on(sb)) {
+		bh = ext4_read_block_bitmap(sb, group);
+		if (IS_ERR(bh)) {
+			ret = PTR_ERR(bh);
+			goto out;
+		}
+		bitmap = bh->b_data;
+	} else {
+		bitmap = e4b.bd_bitmap;
+	}
 
 	ext4_lock_group(sb, group);
 	if (EXT4_MB_GRP_WAS_TRIMMED(e4b.bd_info) &&
@@ -5605,6 +6525,8 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
 		EXT4_MB_GRP_SET_TRIMMED(e4b.bd_info);
 	}
 out:
+	if (!IS_ERR_OR_NULL(bh))
+		brelse(bh);
 	ext4_unlock_group(sb, group);
 	ext4_mb_unload_allocator(&e4b);
 
@@ -5665,7 +6587,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
 	for (group = first_group; group <= last_group; group++) {
 		grp = ext4_get_group_info(sb, group);
 		/* We only do this if the grp has never been initialized */
-		if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
+		if (unlikely(!ext4_mb_frsp_on(sb) && EXT4_MB_GRP_NEED_INIT(grp))) {
 			ret = ext4_mb_init_group(sb, group, GFP_NOFS);
 			if (ret)
 				break;
@@ -5719,16 +6641,25 @@ ext4_mballoc_query_range(
 	ext4_grpblk_t			next;
 	struct ext4_buddy		e4b;
 	int				error;
+	struct buffer_head		*bh = NULL;
 
-	error = ext4_mb_load_allocator(sb, group, &e4b, 0);
-	if (error)
-		return error;
-	bitmap = e4b.bd_bitmap;
-
+	if (ext4_mb_frsp_on(sb)) {
+		bh = ext4_read_block_bitmap(sb, group);
+		if (IS_ERR(bh)) {
+			error = PTR_ERR(bh);
+			goto out_unload;
+		}
+		bitmap = bh->b_data;
+	} else {
+		error = ext4_mb_load_allocator(sb, group, &e4b, 0);
+		if (error)
+			return error;
+		bitmap = e4b.bd_bitmap;
+	}
 	ext4_lock_group(sb, group);
-
-	start = (e4b.bd_info->bb_first_free > start) ?
-		e4b.bd_info->bb_first_free : start;
+	if (!ext4_mb_frsp_on(sb))
+		start = (e4b.bd_info->bb_first_free > start) ?
+			e4b.bd_info->bb_first_free : start;
 	if (end >= EXT4_CLUSTERS_PER_GROUP(sb))
 		end = EXT4_CLUSTERS_PER_GROUP(sb) - 1;
 
@@ -5737,19 +6668,20 @@ ext4_mballoc_query_range(
 		if (start > end)
 			break;
 		next = mb_find_next_bit(bitmap, end + 1, start);
-
 		ext4_unlock_group(sb, group);
 		error = formatter(sb, group, start, next - start, priv);
 		if (error)
 			goto out_unload;
 		ext4_lock_group(sb, group);
-
 		start = next + 1;
 	}
 
 	ext4_unlock_group(sb, group);
 out_unload:
-	ext4_mb_unload_allocator(&e4b);
+	if (!IS_ERR_OR_NULL(bh))
+		brelse(bh);
+	if (!ext4_mb_frsp_on(sb))
+		ext4_mb_unload_allocator(&e4b);
 
 	return error;
 }
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index e75b4749aa1c..af61651258a3 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -78,6 +78,20 @@
  */
 #define MB_DEFAULT_MAX_INODE_PREALLOC	512
 
+/*
+ * Struct for tree node in freespace_tree
+ */
+struct ext4_frsp_node {
+	ext4_grpblk_t frsp_offset;	/* Start block offset inside
+					 * current flexible group
+					 */
+	ext4_grpblk_t frsp_len;		/*
+					 * Length of the free space in
+					 * number of clusters
+					 */
+	struct rb_node frsp_node;
+	struct rb_node frsp_len_node;
+};
 struct ext4_free_data {
 	/* this links the free block information from sb_info */
 	struct list_head		efd_list;
@@ -125,6 +139,7 @@ struct ext4_free_extent {
 	ext4_grpblk_t fe_start;	/* In cluster units */
 	ext4_group_t fe_group;
 	ext4_grpblk_t fe_len;	/* In cluster units */
+	struct ext4_frsp_node *fe_node;
 };
 
 /*
@@ -145,7 +160,14 @@ struct ext4_locality_group {
 	spinlock_t		lg_prealloc_lock;
 };
 
+struct ext4_tree_extent {
+	ext4_group_t te_flex;		/* flex_bg index (tree index) */
+	ext4_grpblk_t te_flex_start;	/* block offset w.r.t flex bg */
+	ext4_grpblk_t te_len;		/* length */
+};
+
 struct ext4_allocation_context {
+	__u32	ac_id;
 	struct inode *ac_inode;
 	struct super_block *ac_sb;
 
@@ -158,8 +180,16 @@ struct ext4_allocation_context {
 	/* the best found extent */
 	struct ext4_free_extent ac_b_ex;
 
-	/* copy of the best found extent taken before preallocation efforts */
-	struct ext4_free_extent ac_f_ex;
+	/* With freespace trees, we don't use preallocation anymore. */
+	union {
+		/*
+		 * copy of the best found extent taken before
+		 * preallocation efforts
+		 */
+		struct ext4_free_extent ac_f_ex;
+		/* the best found tree extent */
+		struct ext4_tree_extent ac_b_tree_ex;
+	};
 
 	__u16 ac_groups_scanned;
 	__u16 ac_found;
@@ -181,14 +211,31 @@ struct ext4_allocation_context {
 #define AC_STATUS_FOUND		2
 #define AC_STATUS_BREAK		3
 
+/*
+ * Freespace tree flags
+ */
+#define EXT4_ALLOCATOR_FRSP_NOLOAD		0x0001	/*
+							 * Don't load freespace
+							 * tree, if it's not
+							 * in memory.
+							 */
+
 struct ext4_buddy {
-	struct page *bd_buddy_page;
-	void *bd_buddy;
-	struct page *bd_bitmap_page;
-	void *bd_bitmap;
+	union {
+		struct {
+			struct page *bd_buddy_page;
+			void *bd_buddy;
+			struct page *bd_bitmap_page;
+			void *bd_bitmap;
+			__u16 bd_blkbits;
+		};
+		struct {
+			struct ext4_frsp_tree *frsp_tree;
+			__u32 frsp_flags;
+		};
+	};
 	struct ext4_group_info *bd_info;
 	struct super_block *bd_sb;
-	__u16 bd_blkbits;
 	ext4_group_t bd_group;
 };
 
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index a50b51270ea9..6a0e1fc18e95 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -1679,6 +1679,10 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
 	if (err)
 		goto out;
 
+	err = ext4_mb_add_frsp_trees(sb, input->group + 1);
+	if (err)
+		goto out;
+
 	flex_gd.count = 1;
 	flex_gd.groups = input;
 	flex_gd.bg_flags = &bg_flags;
@@ -2051,6 +2055,10 @@ int ext4_resize_fs(struct super_block *sb, ext4_fsblk_t n_blocks_count)
 	if (err)
 		goto out;
 
+	err = ext4_mb_add_frsp_trees(sb, n_group + 1);
+	if (err)
+		goto out;
+
 	flex_gd = alloc_flex_gd(flexbg_size);
 	if (flex_gd == NULL) {
 		err = -ENOMEM;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 656eccf6fc9b..55ff4e8be976 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1526,7 +1526,7 @@ enum {
 	Opt_dioread_nolock, Opt_dioread_lock,
 	Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
 	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
-	Opt_prefetch_block_bitmaps,
+	Opt_prefetch_block_bitmaps, Opt_freespace_tree,
 };
 
 static const match_table_t tokens = {
@@ -1613,6 +1613,7 @@ static const match_table_t tokens = {
 	{Opt_init_itable, "init_itable=%u"},
 	{Opt_init_itable, "init_itable"},
 	{Opt_noinit_itable, "noinit_itable"},
+	{Opt_freespace_tree, "freespace_tree"},
 	{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
 	{Opt_test_dummy_encryption, "test_dummy_encryption=%s"},
 	{Opt_test_dummy_encryption, "test_dummy_encryption"},
@@ -1839,6 +1840,8 @@ static const struct mount_opts {
 	{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
 	{Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS,
 	 MOPT_SET},
+	{Opt_freespace_tree, EXT4_MOUNT2_FREESPACE_TREE,
+	 MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
 	{Opt_err, 0, 0}
 };
 
@@ -4745,12 +4748,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		}
 	}
 
+	if (ext4_has_feature_flex_bg(sb))
+		if (!ext4_fill_flex_info(sb)) {
+			ext4_msg(sb, KERN_ERR,
+			       "unable to initialize flex_bg meta info!");
+			goto failed_mount5;
+		}
+
 	ext4_ext_init(sb);
 	err = ext4_mb_init(sb);
 	if (err) {
 		ext4_msg(sb, KERN_ERR, "failed to initialize mballoc (%d)",
 			 err);
-		goto failed_mount5;
+		goto failed_mount6;
 	}
 
 	block = ext4_count_free_clusters(sb);
@@ -4780,14 +4790,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		goto failed_mount6;
 	}
 
-	if (ext4_has_feature_flex_bg(sb))
-		if (!ext4_fill_flex_info(sb)) {
-			ext4_msg(sb, KERN_ERR,
-			       "unable to initialize "
-			       "flex_bg meta info!");
-			goto failed_mount6;
-		}
-
 	err = ext4_register_li_request(sb, first_not_zeroed);
 	if (err)
 		goto failed_mount6;
@@ -4872,7 +4874,14 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	ext4_unregister_li_request(sb);
 failed_mount6:
 	ext4_mb_release(sb);
+	percpu_counter_destroy(&sbi->s_freeclusters_counter);
+	percpu_counter_destroy(&sbi->s_freeinodes_counter);
+	percpu_counter_destroy(&sbi->s_dirs_counter);
+	percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
+	percpu_free_rwsem(&sbi->s_writepages_rwsem);
 	rcu_read_lock();
+
+failed_mount5:
 	flex_groups = rcu_dereference(sbi->s_flex_groups);
 	if (flex_groups) {
 		for (i = 0; i < sbi->s_flex_groups_allocated; i++)
@@ -4880,12 +4889,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		kvfree(flex_groups);
 	}
 	rcu_read_unlock();
-	percpu_counter_destroy(&sbi->s_freeclusters_counter);
-	percpu_counter_destroy(&sbi->s_freeinodes_counter);
-	percpu_counter_destroy(&sbi->s_dirs_counter);
-	percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
-	percpu_free_rwsem(&sbi->s_writepages_rwsem);
-failed_mount5:
 	ext4_ext_release(sb);
 	ext4_release_system_zone(sb);
 failed_mount4a:
-- 
2.28.0.297.g1956fa8f8d-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 4/9] ext4: add prefetching support for freespace trees
  2020-08-21  1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
                   ` (2 preceding siblings ...)
  2020-08-21  1:55 ` [RFC PATCH v2 3/9] ext4: add freespace tree allocator Harshad Shirwadkar
@ 2020-08-21  1:55 ` Harshad Shirwadkar
  2020-08-21  1:55 ` [RFC PATCH v2 5/9] ext4: add freespace tree optimizations Harshad Shirwadkar
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21  1:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar

From: Harshad Shirwadkar <harshadshirwadkar@gmail.com>

This patch supports creation of freespace trees for prefetched block
bitmaps.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/mballoc.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 4b1c405543f0..168e9708257f 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3137,6 +3137,9 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
 void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
 			   unsigned int nr)
 {
+	struct ext4_buddy e4b;
+	int ret;
+
 	while (nr-- > 0) {
 		struct ext4_group_desc *gdp = ext4_get_group_desc(sb, group,
 								  NULL);
@@ -3151,8 +3154,15 @@ void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
 		    ext4_free_group_clusters(sb, gdp) > 0 &&
 		    !(ext4_has_group_desc_csum(sb) &&
 		      (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) {
-			if (ext4_mb_init_group(sb, group, GFP_NOFS))
+			if (ext4_mb_frsp_on(sb)) {
+				ret = ext4_mb_load_allocator(sb, group, &e4b,
+							     0);
+				if (ret)
+					break;
+				ext4_mb_unload_allocator(&e4b);
+			} else if (ext4_mb_init_group(sb, group, GFP_NOFS)) {
 				break;
+			}
 		}
 	}
 }
-- 
2.28.0.297.g1956fa8f8d-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 5/9] ext4: add freespace tree optimizations
  2020-08-21  1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
                   ` (3 preceding siblings ...)
  2020-08-21  1:55 ` [RFC PATCH v2 4/9] ext4: add prefetching support for freespace trees Harshad Shirwadkar
@ 2020-08-21  1:55 ` Harshad Shirwadkar
  2020-08-21 10:48   ` kernel test robot
  2020-08-21  1:55 ` [RFC PATCH v2 6/9] ext4: add memory usage tracker for freespace trees Harshad Shirwadkar
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 14+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21  1:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar

From: Harshad Shirwadkar <harshadshirwadkar@gmail.com>

This patch adds an optimization on top of our freespace tree
allocator. We add a new meta-tree which contains tree node sorted by
length of their largest extent. We use this tree to optimize an
allocation request so that it immediately gets a hit or it falls back
to next CR level.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4.h    |  17 +++++
 fs/ext4/mballoc.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/mballoc.h |   1 +
 fs/ext4/super.c   |   9 +++
 4 files changed, 205 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 513c8473f45f..15e6ce9f1afa 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -153,6 +153,8 @@ enum SHIFT_DIRECTION {
 #define EXT4_MB_USE_RESERVED		0x2000
 /* Do strict check for free blocks while retrying block allocation */
 #define EXT4_MB_STRICT_CHECK		0x4000
+/* Disable freespace optimizations on ac */
+#define EXT4_MB_FRSP_NO_OPTIMIZE	0x8000
 
 struct ext4_allocation_request {
 	/* target inode for block we're allocating */
@@ -1444,12 +1446,22 @@ struct ext4_frsp_tree {
 	__u32 frsp_index;				/* Tree index (flex bg
 							 * number)
 							 */
+	struct list_head frsp_list_node;
+	struct rb_node frsp_len_node;
 };
 
 /* Freespace tree flags */
 
 /* Tree is loaded in memory */
 #define EXT4_MB_FRSP_FLAG_LOADED			0x0001
+/* Tree is inserted into meta tree */
+#define EXT4_MB_FRSP_FLAG_CACHED			0x2
+
+/* Freespace tree cache aggression levels */
+#define EXT4_MB_FRSP_MIN_CACHE_AGGRESSION		0x0
+#define EXT4_MB_FRSP_DEFAULT_CACHE_AGGRESSION		0x1
+#define EXT4_MB_FRSP_MAX_CACHE_AGGRESSION		0x3
+
 /*
  * fourth extended-fs super-block data in memory
  */
@@ -1590,6 +1602,11 @@ struct ext4_sb_info {
 
 	/* freespace trees stuff */
 	int s_mb_num_frsp_trees;
+	struct rb_root_cached s_mb_frsp_meta_tree;
+	rwlock_t s_mb_frsp_lock;
+	atomic_t s_mb_num_frsp_trees_cached;
+	struct list_head s_mb_uncached_trees;
+	u32 s_mb_frsp_cache_aggression;
 
 	/* workqueue for reserved extent conversions (buffered io) */
 	struct workqueue_struct *rsv_conversion_wq;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 168e9708257f..1da63afdbb3d 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -779,6 +779,13 @@ static inline int ext4_mb_frsp_len_cmp(struct ext4_frsp_node *arg1,
 	return (arg1->frsp_len > arg2->frsp_len);
 }
 
+/* Compare function for meta tree */
+static inline int ext4_mb_frsp_meta_cmp(struct ext4_frsp_tree *arg1,
+					struct ext4_frsp_tree *arg2)
+{
+	return (arg1->frsp_max_free_len > arg2->frsp_max_free_len);
+}
+
 /* insert to offset-indexed tree */
 static void ext4_mb_frsp_insert_off(struct ext4_frsp_tree *tree,
 				struct ext4_frsp_node *new_entry)
@@ -798,6 +805,35 @@ static void ext4_mb_frsp_insert_len(struct ext4_frsp_tree *tree,
 		ext4_mb_frsp_len_cmp);
 }
 
+void ext4_mb_frsp_meta_reinsert(struct super_block *sb,
+	struct ext4_frsp_tree *tree)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_frsp_node *node;
+	struct rb_node *first = rb_first_cached(&tree->frsp_len_root);
+	struct rb_root_cached *meta_root = &EXT4_SB(sb)->s_mb_frsp_meta_tree;
+	int expected_len = 0;
+
+	if (!(tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED))
+		return;
+
+	if (first) {
+		node = rb_entry(first, struct ext4_frsp_node, frsp_len_node);
+		expected_len = node->frsp_len;
+	}
+
+	if (tree->frsp_max_free_len == expected_len)
+		return;
+
+	write_lock(&sbi->s_mb_frsp_lock);
+	tree->frsp_max_free_len = expected_len;
+	rb_erase_cached(&tree->frsp_len_node, &sbi->s_mb_frsp_meta_tree);
+	RB_CLEAR_NODE(&tree->frsp_len_node);
+	ext4_mb_rb_insert(meta_root, tree, struct ext4_frsp_tree, frsp_len_node,
+		ext4_mb_frsp_meta_cmp);
+	write_unlock(&sbi->s_mb_frsp_lock);
+}
+
 #ifdef CONFIG_EXT4_DEBUG
 /* print freespace_tree in pre-order traversal */
 void ext4_mb_frsp_print_tree_len(struct super_block *sb,
@@ -966,6 +1002,7 @@ int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
 		}
 	}
 	ext4_mb_frsp_insert_len(tree, new_entry);
+	ext4_mb_frsp_meta_reinsert(sb, tree);
 
 	return 0;
 }
@@ -1063,6 +1100,9 @@ int ext4_mb_frsp_load(struct ext4_buddy *e4b)
 		if (ret)
 			goto out;
 	}
+	if (!(e4b->frsp_tree->frsp_flags & EXT4_MB_FRSP_FLAG_CACHED))
+		atomic_inc(&sbi->s_mb_num_frsp_trees_cached);
+	e4b->frsp_tree->frsp_flags |= EXT4_MB_FRSP_FLAG_CACHED;
 	e4b->frsp_tree->frsp_flags |= EXT4_MB_FRSP_FLAG_LOADED;
 out:
 	for (i = 0; i < ngroups; i++) {
@@ -1156,6 +1196,7 @@ static void ext4_mb_frsp_use_best_found(struct ext4_allocation_context *ac,
 	ext4_mb_load_allocator(ac->ac_sb, group_no, &e4b,
 		EXT4_ALLOCATOR_FRSP_NOLOAD);
 	mb_mark_used(&e4b, bex);
+	ext4_mb_frsp_meta_reinsert(ac->ac_sb, e4b.frsp_tree);
 	ext4_mb_unload_allocator(&e4b);
 }
 /*
@@ -1286,6 +1327,124 @@ void ext4_mb_frsp_find_by_goal(struct ext4_allocation_context *ac)
 	ext4_mb_unload_allocator(&e4b);
 }
 
+/*
+ * Determine if caching of trees is necessary. This function returns 1 if it is,
+ * 0 if it is not.
+ */
+static int ext4_mb_frsp_should_cache(struct ext4_allocation_context *ac)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+
+	if (list_empty(&sbi->s_mb_uncached_trees))
+		return 0;
+
+	if (sbi->s_mb_frsp_cache_aggression >=
+		EXT4_MB_FRSP_MAX_CACHE_AGGRESSION)
+		return 1;
+
+	if (sbi->s_mb_frsp_cache_aggression ==
+		EXT4_MB_FRSP_MIN_CACHE_AGGRESSION)
+		return 0;
+
+	/* At cache aggression level 1, skip caching at CR 0 */
+	if (sbi->s_mb_frsp_cache_aggression == 1 && ac->ac_criteria == 0)
+		return 0;
+
+	/*
+	 * At cache aggression level 2, perform caching for every alternate
+	 * optimization.
+	 */
+	return (ac->ac_num_optimizations & 0x1);
+}
+
+/*
+ * Optimize allocation request. This function _tries_ to lookup the meta-tree
+ * and if it can optimize the search in any way, it does so. As a result
+ * this function returns 1, if the optimization was performed. In this case,
+ * the caller should restart the search from tree mentioned in *tree_idx.
+ */
+int ext4_mb_frsp_optimize(struct ext4_allocation_context *ac, int *tree_idx)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+	struct ext4_frsp_tree *cur = NULL;
+	struct rb_node *node = NULL;
+	int found = 0, best = 0, cache_more_trees = 0, better_len = 0, ret = 0;
+
+	if (ac->ac_flags & EXT4_MB_HINT_FIRST ||
+		ac->ac_flags & EXT4_MB_FRSP_NO_OPTIMIZE ||
+		ac->ac_status != AC_STATUS_CONTINUE)
+		return 0;
+
+	ac->ac_num_optimizations++;
+	if (!read_trylock(&sbi->s_mb_frsp_lock))
+		return 0;
+
+	node = sbi->s_mb_frsp_meta_tree.rb_root.rb_node;
+	while (node) {
+		cur = rb_entry(node, struct ext4_frsp_tree, frsp_len_node);
+		if (ext4_mb_frsp_is_better(ac, cur->frsp_max_free_len, &best)) {
+			/*
+			 * This tree definitely has a better node than the best
+			 * found so far.
+			 */
+			found = 1;
+			ret = 1;
+			*tree_idx = cur->frsp_index;
+			better_len = cur->frsp_max_free_len;
+			if (best)
+				break;
+		}
+		if (cur->frsp_max_free_len > ac->ac_g_ex.fe_len)
+			node = node->rb_right;
+		else
+			node = node->rb_left;
+	}
+
+	if (ac->ac_found == 0 && !found) {
+		/*
+		 * If we haven't found a good match above, and we hadn't found
+		 * any match before us, that means we need to loosen our
+		 * criteria. Note that, if we had found something earlier,
+		 * not finding a better node doesn't imply that there is no
+		 * better node available.
+		 * TODO - in this case determine probabilistically which tree
+		 * may have a better node and direct our allocator there.
+		 */
+		if (ext4_mb_frsp_should_cache(ac)) {
+			cache_more_trees = 1;
+		} else if (ac->ac_criteria < 2) {
+			ac->ac_criteria++;
+			ac->ac_groups_scanned = 0;
+			*tree_idx = 0;
+			ret = 1;
+		} else {
+			ac->ac_flags |= EXT4_MB_HINT_FIRST;
+		}
+	} else if (!best && !list_empty(&sbi->s_mb_uncached_trees)) {
+		cache_more_trees = ext4_mb_frsp_should_cache(ac);
+	}
+
+	if (cache_more_trees) {
+		cur = list_first_entry(&sbi->s_mb_uncached_trees,
+				struct ext4_frsp_tree, frsp_list_node);
+		list_del_init(&cur->frsp_list_node);
+		*tree_idx = cur->frsp_index;
+		ret = 1;
+	}
+	read_unlock(&sbi->s_mb_frsp_lock);
+
+	/*
+	 * If we couldn't optimize now, it's unlikely that we'll be able to
+	 * optimize this request anymore
+	 */
+	if (!ret)
+		ac->ac_flags |= EXT4_MB_FRSP_NO_OPTIMIZE;
+	mb_debug(ac->ac_sb,
+		"Optimizer suggestion: found = %d, tree = %d, len = %d, cr = %d\n",
+		found, *tree_idx, better_len, ac->ac_criteria);
+	return ret;
+}
+
 static void ext4_mb_frsp_process(struct ext4_allocation_context *ac,
 				struct ext4_frsp_tree *tree)
 {
@@ -1324,6 +1483,7 @@ ext4_mb_tree_allocator(struct ext4_allocation_context *ac)
 	struct ext4_frsp_node *cur = NULL;
 	struct ext4_tree_extent *btx = NULL;
 	int ret = 0, start_idx = 0, tree_idx, j;
+	int optimize;
 
 	sb = ac->ac_sb;
 	btx = &ac->ac_b_tree_ex;
@@ -1341,6 +1501,8 @@ ext4_mb_tree_allocator(struct ext4_allocation_context *ac)
 
 	ac->ac_criteria = 0;
 	ac->ac_groups_scanned = 0;
+	ext4_mb_frsp_optimize(ac, &start_idx);
+
 repeat:
 
 	/* Loop through the rest of trees (flex_bg) */
@@ -1357,13 +1519,17 @@ ext4_mb_tree_allocator(struct ext4_allocation_context *ac)
 		mutex_lock(&e4b.frsp_tree->frsp_lock);
 		ext4_mb_frsp_process(ac, e4b.frsp_tree);
 		mutex_unlock(&e4b.frsp_tree->frsp_lock);
+		optimize = ext4_mb_frsp_optimize(ac, &start_idx);
 		ext4_mb_unload_allocator(&e4b);
+		if (optimize)
+			goto repeat;
 	}
 
 	if (ac->ac_status != AC_STATUS_FOUND) {
 		if (ac->ac_criteria < 2) {
 			ac->ac_criteria++;
 			ac->ac_groups_scanned = 0;
+			ac->ac_flags &= ~EXT4_MB_FRSP_NO_OPTIMIZE;
 			mb_debug(sb, "Falling back to CR=%d", ac->ac_criteria);
 			goto repeat;
 		}
@@ -1415,6 +1581,7 @@ static void ext4_mb_frsp_init_tree(struct ext4_frsp_tree *tree, int index)
 	mutex_init(&(tree->frsp_lock));
 	tree->frsp_flags = 0;
 	tree->frsp_index = index;
+	INIT_LIST_HEAD(&tree->frsp_list_node);
 }
 
 int ext4_mb_init_freespace_trees(struct super_block *sb)
@@ -1425,6 +1592,8 @@ int ext4_mb_init_freespace_trees(struct super_block *sb)
 
 	sbi->s_mb_num_frsp_trees =
 		ext4_num_grps_to_flexbg(sb, ext4_get_groups_count(sb));
+	sbi->s_mb_frsp_meta_tree = RB_ROOT_CACHED;
+	INIT_LIST_HEAD(&sbi->s_mb_uncached_trees);
 
 	for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) {
 		fg = sbi_array_rcu_deref(sbi, s_flex_groups, i);
@@ -1433,7 +1602,11 @@ int ext4_mb_init_freespace_trees(struct super_block *sb)
 		if (!fg->frsp_tree)
 			return -ENOMEM;
 		ext4_mb_frsp_init_tree(fg->frsp_tree, i);
+		list_add(&fg->frsp_tree->frsp_list_node,
+				&sbi->s_mb_uncached_trees);
 	}
+	rwlock_init(&sbi->s_mb_frsp_lock);
+	atomic_set(&sbi->s_mb_num_frsp_trees_cached, 0);
 
 	return 0;
 }
@@ -1460,6 +1633,11 @@ int ext4_mb_add_frsp_trees(struct super_block *sb, int ngroups)
 		if (!fg->frsp_tree)
 			return -ENOMEM;
 		ext4_mb_frsp_init_tree(fg->frsp_tree, i);
+		write_lock(&sbi->s_mb_frsp_lock);
+		list_add(&fg->frsp_tree->frsp_list_node,
+				&sbi->s_mb_uncached_trees);
+		write_unlock(&sbi->s_mb_frsp_lock);
+		ext4_mb_frsp_meta_reinsert(sb, fg->frsp_tree);
 	}
 	sbi->s_mb_num_frsp_trees = flex_bg_count;
 
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index af61651258a3..1fcdd3e6f7d5 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -200,6 +200,7 @@ struct ext4_allocation_context {
 	__u8 ac_criteria;
 	__u8 ac_2order;		/* if request is to allocate 2^N blocks and
 				 * N > 0, the field stores N, otherwise 0 */
+	__u8 ac_num_optimizations;
 	__u8 ac_op;		/* operation, for history only */
 	struct page *ac_bitmap_page;
 	struct page *ac_buddy_page;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 55ff4e8be976..6a2d43d81b14 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1527,6 +1527,8 @@ enum {
 	Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
 	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
 	Opt_prefetch_block_bitmaps, Opt_freespace_tree,
+	Opt_frsp_cache_aggression,
+
 };
 
 static const match_table_t tokens = {
@@ -1620,6 +1622,7 @@ static const match_table_t tokens = {
 	{Opt_nombcache, "nombcache"},
 	{Opt_nombcache, "no_mbcache"},	/* for backward compatibility */
 	{Opt_prefetch_block_bitmaps, "prefetch_block_bitmaps"},
+	{Opt_frsp_cache_aggression, "frsp_cache_aggression=%u"},
 	{Opt_removed, "check=none"},	/* mount option from ext2/3 */
 	{Opt_removed, "nocheck"},	/* mount option from ext2/3 */
 	{Opt_removed, "reservation"},	/* mount option from ext2/3 */
@@ -1836,6 +1839,7 @@ static const struct mount_opts {
 	{Opt_jqfmt_vfsv0, QFMT_VFS_V0, MOPT_QFMT},
 	{Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT},
 	{Opt_max_dir_size_kb, 0, MOPT_GTE0},
+	{Opt_frsp_cache_aggression, 0, MOPT_GTE0},
 	{Opt_test_dummy_encryption, 0, MOPT_STRING},
 	{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
 	{Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS,
@@ -2043,6 +2047,10 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 		sbi->s_li_wait_mult = arg;
 	} else if (token == Opt_max_dir_size_kb) {
 		sbi->s_max_dir_size_kb = arg;
+	} else if (token == Opt_frsp_cache_aggression) {
+		sbi->s_mb_frsp_cache_aggression =
+			min(EXT4_MB_FRSP_MAX_CACHE_AGGRESSION,
+				max(EXT4_MB_FRSP_MIN_CACHE_AGGRESSION, arg));
 	} else if (token == Opt_stripe) {
 		sbi->s_stripe = arg;
 	} else if (token == Opt_resuid) {
@@ -3962,6 +3970,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	sbi->s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ;
 	sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME;
 	sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;
+	sbi->s_mb_frsp_cache_aggression = EXT4_MB_FRSP_DEFAULT_CACHE_AGGRESSION;
 
 	if ((def_mount_opts & EXT4_DEFM_NOBARRIER) == 0)
 		set_opt(sb, BARRIER);
-- 
2.28.0.297.g1956fa8f8d-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 6/9] ext4: add memory usage tracker for freespace trees
  2020-08-21  1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
                   ` (4 preceding siblings ...)
  2020-08-21  1:55 ` [RFC PATCH v2 5/9] ext4: add freespace tree optimizations Harshad Shirwadkar
@ 2020-08-21  1:55 ` Harshad Shirwadkar
  2020-08-21  1:55 ` [RFC PATCH v2 7/9] ext4: add LRU eviction for free space trees Harshad Shirwadkar
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21  1:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar

From: Harshad Shirwadkar <harshadshirwadkar@gmail.com>

Freespace trees can occupy a lot of memory with as the fragmentation
increases. This patch adds a sysfs file to monitor the memory usage of
the freespace tree allocator. Also, added a sysfs config to control
maximum memory that the allocator can use. If the allocator exceeds
this threshold, file system enters "FRSP_MEM_CRUNCH" state. The next
patch in the series performs LRU eviction when this state is reached.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4.h    |  8 ++++++++
 fs/ext4/mballoc.c | 20 ++++++++++++++++++++
 fs/ext4/mballoc.h |  4 ++++
 fs/ext4/sysfs.c   | 11 +++++++++++
 4 files changed, 43 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 15e6ce9f1afa..93bf2fe35cf1 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1223,6 +1223,12 @@ struct ext4_inode_info {
 						    * allocator off)
 						    */
 
+#define EXT4_MOUNT2_FRSP_MEM_CRUNCH	0x00000040 /*
+						    * Freespace tree allocator
+						    * is in a tight memory
+						    * situation.
+						    */
+
 #define clear_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt &= \
 						~EXT4_MOUNT_##opt
 #define set_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt |= \
@@ -1607,6 +1613,8 @@ struct ext4_sb_info {
 	atomic_t s_mb_num_frsp_trees_cached;
 	struct list_head s_mb_uncached_trees;
 	u32 s_mb_frsp_cache_aggression;
+	atomic_t s_mb_num_fragments;
+	u32 s_mb_frsp_mem_limit;
 
 	/* workqueue for reserved extent conversions (buffered io) */
 	struct workqueue_struct *rsv_conversion_wq;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 1da63afdbb3d..b28b7fb0506e 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -869,6 +869,7 @@ void ext4_mb_frsp_print_tree_len(struct super_block *sb,
 static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
 {
 	struct ext4_frsp_node *node;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
 
 	node = kmem_cache_alloc(ext4_freespace_node_cachep, GFP_NOFS);
 	if (!node)
@@ -877,13 +878,31 @@ static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
 	RB_CLEAR_NODE(&node->frsp_node);
 	RB_CLEAR_NODE(&node->frsp_len_node);
 
+	atomic_inc(&sbi->s_mb_num_fragments);
+
+	if (sbi->s_mb_frsp_mem_limit &&
+		atomic_read(&sbi->s_mb_num_fragments) >
+		EXT4_FRSP_MEM_LIMIT_TO_NUM_NODES(sb))
+		set_opt2(sb, FRSP_MEM_CRUNCH);
+	else
+		clear_opt2(sb, FRSP_MEM_CRUNCH);
+
+
 	return node;
 }
 
 static void ext4_mb_frsp_free_node(struct super_block *sb,
 		struct ext4_frsp_node *node)
 {
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+
 	kmem_cache_free(ext4_freespace_node_cachep, node);
+	atomic_dec(&sbi->s_mb_num_fragments);
+
+	if (!sbi->s_mb_frsp_mem_limit ||
+		atomic_read(&sbi->s_mb_num_fragments) <
+		EXT4_FRSP_MEM_LIMIT_TO_NUM_NODES(sb))
+		clear_opt2(sb, FRSP_MEM_CRUNCH);
 }
 
 /* Evict a tree from memory */
@@ -1607,6 +1626,7 @@ int ext4_mb_init_freespace_trees(struct super_block *sb)
 	}
 	rwlock_init(&sbi->s_mb_frsp_lock);
 	atomic_set(&sbi->s_mb_num_frsp_trees_cached, 0);
+	atomic_set(&sbi->s_mb_num_fragments, 0);
 
 	return 0;
 }
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 1fcdd3e6f7d5..6cfb228e4da2 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -92,6 +92,10 @@ struct ext4_frsp_node {
 	struct rb_node frsp_node;
 	struct rb_node frsp_len_node;
 };
+
+#define EXT4_FRSP_MEM_LIMIT_TO_NUM_NODES(__sb)				\
+	((sbi->s_mb_frsp_mem_limit / sizeof(struct ext4_frsp_node)))
+
 struct ext4_free_data {
 	/* this links the free block information from sb_info */
 	struct list_head		efd_list;
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index bfabb799fa45..19301b10944b 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -8,6 +8,7 @@
  *
  */
 
+#include "mballoc.h"
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/seq_file.h>
@@ -24,6 +25,7 @@ typedef enum {
 	attr_session_write_kbytes,
 	attr_lifetime_write_kbytes,
 	attr_reserved_clusters,
+	attr_frsp_tree_usage,
 	attr_inode_readahead,
 	attr_trigger_test_error,
 	attr_first_error_time,
@@ -208,6 +210,7 @@ EXT4_ATTR_FUNC(delayed_allocation_blocks, 0444);
 EXT4_ATTR_FUNC(session_write_kbytes, 0444);
 EXT4_ATTR_FUNC(lifetime_write_kbytes, 0444);
 EXT4_ATTR_FUNC(reserved_clusters, 0644);
+EXT4_ATTR_FUNC(frsp_tree_usage, 0444);
 
 EXT4_ATTR_OFFSET(inode_readahead_blks, 0644, inode_readahead,
 		 ext4_sb_info, s_inode_readahead_blks);
@@ -248,6 +251,7 @@ EXT4_ATTR(last_error_time, 0444, last_error_time);
 EXT4_ATTR(journal_task, 0444, journal_task);
 EXT4_RW_ATTR_SBI_UI(mb_prefetch, s_mb_prefetch);
 EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_limit);
+EXT4_RW_ATTR_SBI_UI(mb_frsp_max_mem, s_mb_frsp_mem_limit);
 
 static unsigned int old_bump_val = 128;
 EXT4_ATTR_PTR(max_writeback_mb_bump, 0444, pointer_ui, &old_bump_val);
@@ -257,6 +261,7 @@ static struct attribute *ext4_attrs[] = {
 	ATTR_LIST(session_write_kbytes),
 	ATTR_LIST(lifetime_write_kbytes),
 	ATTR_LIST(reserved_clusters),
+	ATTR_LIST(frsp_tree_usage),
 	ATTR_LIST(inode_readahead_blks),
 	ATTR_LIST(inode_goal),
 	ATTR_LIST(mb_stats),
@@ -296,6 +301,7 @@ static struct attribute *ext4_attrs[] = {
 #endif
 	ATTR_LIST(mb_prefetch),
 	ATTR_LIST(mb_prefetch_limit),
+	ATTR_LIST(mb_frsp_max_mem),
 	NULL,
 };
 ATTRIBUTE_GROUPS(ext4);
@@ -378,6 +384,11 @@ static ssize_t ext4_attr_show(struct kobject *kobj,
 		return snprintf(buf, PAGE_SIZE, "%llu\n",
 				(unsigned long long)
 				atomic64_read(&sbi->s_resv_clusters));
+	case attr_frsp_tree_usage:
+		return snprintf(buf, PAGE_SIZE, "%llu\n",
+				(unsigned long long)
+				atomic_read(&sbi->s_mb_num_fragments) *
+				sizeof(struct ext4_frsp_node));
 	case attr_inode_readahead:
 	case attr_pointer_ui:
 		if (!ptr)
-- 
2.28.0.297.g1956fa8f8d-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 7/9] ext4: add LRU eviction for free space trees
  2020-08-21  1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
                   ` (5 preceding siblings ...)
  2020-08-21  1:55 ` [RFC PATCH v2 6/9] ext4: add memory usage tracker for freespace trees Harshad Shirwadkar
@ 2020-08-21  1:55 ` Harshad Shirwadkar
  2020-08-21  1:55 ` [RFC PATCH v2 8/9] ext4: add tracepoints " Harshad Shirwadkar
  2020-08-21  1:55 ` [RFC PATCH v2 9/9] ext4: add freespace trees documentation in code Harshad Shirwadkar
  8 siblings, 0 replies; 14+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21  1:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar

From: Harshad Shirwadkar <harshadshirwadkar@gmail.com>

This patch adds LRU eviction policy for freespace trees. In order to
avoid contention on one LRU lock, the LRU scheme is implemented as two
lists - an active list and an inactive list. Trees in active list
aren't moved inside the list thereby avoiding the need of a
lock. Trees in inactive list become active only if they are accessed
twice in a short interval thereby avoiding outliers to enter the
active list.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4.h    |  46 +++++++++++++
 fs/ext4/mballoc.c | 164 ++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 197 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 93bf2fe35cf1..64d0dbbcd517 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1454,6 +1454,10 @@ struct ext4_frsp_tree {
 							 */
 	struct list_head frsp_list_node;
 	struct rb_node frsp_len_node;
+	atomic_t frsp_fragments;
+	struct list_head frsp_lru_active_node;
+	unsigned long frsp_last_access;
+	atomic_t frsp_ref;
 };
 
 /* Freespace tree flags */
@@ -1468,6 +1472,47 @@ struct ext4_frsp_tree {
 #define EXT4_MB_FRSP_DEFAULT_CACHE_AGGRESSION		0x1
 #define EXT4_MB_FRSP_MAX_CACHE_AGGRESSION		0x3
 
+/* LRU management for free space trees */
+struct ext4_mb_frsp_lru {
+	rwlock_t frsp_lru_lock;
+	atomic_t frsp_active_fragments;		/* Current #of fragments in
+						 * the active queue
+						 */
+	u32 frsp_max_active_fragments;
+	/*
+	 * List of active trees. Trees at tail are oldest trees in active set.
+	 */
+	struct list_head frsp_lru_active;
+	/*
+	 * List of inactive trees but loaded trees.
+	 */
+	struct list_head frsp_lru_inactive;
+	struct super_block *frsp_lru_sb;
+};
+
+/*
+ * Minimum memory needed for our allocator. The pathological worst case for
+ * freespace trees is when every other block is allocated. In this case,
+ * the allocator will end up storing an extent node of length 1 for each free
+ * block. We need to make sure that the minimum memory available for the
+ * allocator is as much memory needed for 1 worst-case tree. This ensures that
+ * we can at-least keep 1 tree in memory. This way we avoid thrashing.
+ */
+#define EXT4_MB_FRSP_MEM_MIN(sb)					\
+	((sizeof(struct ext4_frsp_node) * ext4_flex_bg_size(EXT4_SB(sb))) \
+		* (EXT4_BLOCKS_PER_GROUP(sb) / 2))
+
+/* Half of the total memory available is allowed for active list */
+#define EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb)				\
+	((EXT4_SB(sb)->s_mb_frsp_mem_limit /				\
+		(2 * sizeof(struct ext4_frsp_node))))
+
+/*
+ * Maximum number of jiffies allowed between two successive hits on a freespace
+ * tree before we move it to the "active" queue.
+ */
+#define EXT4_MB_FRSP_ACTIVE_THRESHOLD_JIFFIES		100
+
 /*
  * fourth extended-fs super-block data in memory
  */
@@ -1615,6 +1660,7 @@ struct ext4_sb_info {
 	u32 s_mb_frsp_cache_aggression;
 	atomic_t s_mb_num_fragments;
 	u32 s_mb_frsp_mem_limit;
+	struct ext4_mb_frsp_lru s_mb_frsp_lru;
 
 	/* workqueue for reserved extent conversions (buffered io) */
 	struct workqueue_struct *rsv_conversion_wq;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index b28b7fb0506e..aea0eb8d28da 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -866,10 +866,12 @@ void ext4_mb_frsp_print_tree_len(struct super_block *sb,
 }
 #endif
 
-static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
+static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb,
+			struct ext4_frsp_tree *tree)
 {
 	struct ext4_frsp_node *node;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_mb_frsp_lru *lru = &EXT4_SB(sb)->s_mb_frsp_lru;
 
 	node = kmem_cache_alloc(ext4_freespace_node_cachep, GFP_NOFS);
 	if (!node)
@@ -879,6 +881,11 @@ static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
 	RB_CLEAR_NODE(&node->frsp_len_node);
 
 	atomic_inc(&sbi->s_mb_num_fragments);
+	atomic_inc(&tree->frsp_fragments);
+	read_lock(&lru->frsp_lru_lock);
+	if (!list_empty(&tree->frsp_lru_active_node))
+		atomic_inc(&lru->frsp_active_fragments);
+	read_unlock(&lru->frsp_lru_lock);
 
 	if (sbi->s_mb_frsp_mem_limit &&
 		atomic_read(&sbi->s_mb_num_fragments) >
@@ -892,12 +899,18 @@ static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
 }
 
 static void ext4_mb_frsp_free_node(struct super_block *sb,
-		struct ext4_frsp_node *node)
+		struct ext4_frsp_tree *tree, struct ext4_frsp_node *node)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_mb_frsp_lru *lru = &EXT4_SB(sb)->s_mb_frsp_lru;
 
 	kmem_cache_free(ext4_freespace_node_cachep, node);
 	atomic_dec(&sbi->s_mb_num_fragments);
+	atomic_dec(&tree->frsp_fragments);
+	read_lock(&lru->frsp_lru_lock);
+	if (!list_empty(&tree->frsp_lru_active_node))
+		atomic_dec(&lru->frsp_active_fragments);
+	read_unlock(&lru->frsp_lru_lock);
 
 	if (!sbi->s_mb_frsp_mem_limit ||
 		atomic_read(&sbi->s_mb_num_fragments) <
@@ -915,6 +928,8 @@ void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree)
 	mutex_lock(&tree->frsp_lock);
 	if (!(tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED))
 		goto out;
+	if (atomic_read(&tree->frsp_ref))
+		goto out;
 
 	node = rb_first_cached(&tree->frsp_offset_root);
 	while (node) {
@@ -922,7 +937,7 @@ void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree)
 		rb_erase_cached(node, &tree->frsp_offset_root);
 		rb_erase_cached(&frsp_node->frsp_len_node,
 				&tree->frsp_len_root);
-		ext4_mb_frsp_free_node(sb, frsp_node);
+		ext4_mb_frsp_free_node(sb, tree, frsp_node);
 		node = rb_first_cached(&tree->frsp_offset_root);
 	}
 	tree->frsp_flags &= ~EXT4_MB_FRSP_FLAG_LOADED;
@@ -985,7 +1000,7 @@ int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
 				*prev_entry = NULL;
 	struct rb_node *left = NULL, *right = NULL;
 
-	new_entry = ext4_mb_frsp_alloc_node(sb);
+	new_entry = ext4_mb_frsp_alloc_node(sb, tree);
 	if (!new_entry)
 		return -ENOMEM;
 
@@ -1004,7 +1019,7 @@ int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
 			rb_erase_cached(left, &tree->frsp_offset_root);
 			rb_erase_cached(&prev_entry->frsp_len_node,
 						&tree->frsp_len_root);
-			ext4_mb_frsp_free_node(sb, prev_entry);
+			ext4_mb_frsp_free_node(sb, tree, prev_entry);
 		}
 	}
 
@@ -1017,7 +1032,7 @@ int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
 			rb_erase_cached(right, &tree->frsp_offset_root);
 			rb_erase_cached(&next_entry->frsp_len_node,
 						&tree->frsp_len_root);
-			ext4_mb_frsp_free_node(sb, next_entry);
+			ext4_mb_frsp_free_node(sb, tree, next_entry);
 		}
 	}
 	ext4_mb_frsp_insert_len(tree, new_entry);
@@ -1601,6 +1616,10 @@ static void ext4_mb_frsp_init_tree(struct ext4_frsp_tree *tree, int index)
 	tree->frsp_flags = 0;
 	tree->frsp_index = index;
 	INIT_LIST_HEAD(&tree->frsp_list_node);
+	atomic_set(&tree->frsp_fragments, 0);
+	tree->frsp_last_access = 0;
+	INIT_LIST_HEAD(&tree->frsp_lru_active_node);
+	atomic_set(&tree->frsp_ref, 0);
 }
 
 int ext4_mb_init_freespace_trees(struct super_block *sb)
@@ -1627,6 +1646,15 @@ int ext4_mb_init_freespace_trees(struct super_block *sb)
 	rwlock_init(&sbi->s_mb_frsp_lock);
 	atomic_set(&sbi->s_mb_num_frsp_trees_cached, 0);
 	atomic_set(&sbi->s_mb_num_fragments, 0);
+	INIT_LIST_HEAD(&sbi->s_mb_frsp_lru.frsp_lru_active);
+	INIT_LIST_HEAD(&sbi->s_mb_frsp_lru.frsp_lru_inactive);
+	rwlock_init(&sbi->s_mb_frsp_lru.frsp_lru_lock);
+	/* Set the default hard-limit to be as much as buddy bitmaps */
+	sbi->s_mb_frsp_mem_limit = ext4_get_groups_count(sb) <<
+				(sb->s_blocksize_bits + 1);
+	if (sbi->s_mb_frsp_mem_limit < EXT4_MB_FRSP_MEM_MIN(sb))
+		sbi->s_mb_frsp_mem_limit = EXT4_MB_FRSP_MEM_MIN(sb);
+	atomic_set(&sbi->s_mb_frsp_lru.frsp_active_fragments, 0);
 
 	return 0;
 }
@@ -1664,6 +1692,99 @@ int ext4_mb_add_frsp_trees(struct super_block *sb, int ngroups)
 	return 0;
 }
 
+/*
+ * Evict one tree from inactive list.
+ */
+static void ext4_frsp_evict(struct super_block *sb)
+{
+	struct ext4_mb_frsp_lru *lru = &EXT4_SB(sb)->s_mb_frsp_lru;
+	struct ext4_frsp_tree *tree = NULL;
+	bool found = false;
+
+	write_lock(&lru->frsp_lru_lock);
+	if (!list_empty(&lru->frsp_lru_inactive)) {
+		/* Evict from front, insert at tail */
+		found = 0;
+		list_for_each_entry(tree, &lru->frsp_lru_inactive,
+			frsp_list_node) {
+			if (!atomic_read(&tree->frsp_ref)) {
+				found = true;
+				break;
+			}
+		}
+		if (found)
+			list_del_init(&tree->frsp_list_node);
+	}
+	write_unlock(&lru->frsp_lru_lock);
+	if (found)
+		ext4_mb_frsp_free_tree(sb, tree);
+}
+
+/*
+ * This function maintains LRU lists. "tree" has just been accessed.
+ */
+static void ext4_mb_frsp_maintain_lru(struct super_block *sb,
+					struct ext4_frsp_tree *tree)
+{
+	struct ext4_mb_frsp_lru *lru = &EXT4_SB(sb)->s_mb_frsp_lru;
+	struct ext4_frsp_tree *last;
+	unsigned long current_jiffies = jiffies;
+
+	read_lock(&lru->frsp_lru_lock);
+	if (!list_empty(&tree->frsp_lru_active_node)) {
+		/* Already active, nothing to do */
+		read_unlock(&lru->frsp_lru_lock);
+		goto out;
+	}
+
+	/*
+	 * Check if this tree needs to be moved to active list. We move it to
+	 * active list if one of the following three conditions is true:
+	 * - This is the first access to this tree
+	 * - We haven't yet reached the max active fragments threshold, so there
+	 *   is space in active list.
+	 * - This tree was accessed twice in
+	 *   EXT4_MB_FRSP_ACTIVE_THRESHOLD_JIFFIES interval.
+	 */
+	if (tree->frsp_last_access &&
+		EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb) &&
+		atomic_read(&lru->frsp_active_fragments) >
+			EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb) &&
+		current_jiffies - tree->frsp_last_access >
+		EXT4_MB_FRSP_ACTIVE_THRESHOLD_JIFFIES) {
+		read_unlock(&lru->frsp_lru_lock);
+		goto out;
+	}
+	read_unlock(&lru->frsp_lru_lock);
+
+	write_lock(&lru->frsp_lru_lock);
+	/* Check again just in case */
+	if (!list_empty(&tree->frsp_lru_active_node)) {
+		write_unlock(&lru->frsp_lru_lock);
+		goto out;
+	}
+	list_add(&tree->frsp_lru_active_node, &lru->frsp_lru_active);
+	list_del_init(&tree->frsp_list_node);
+	atomic_add(atomic_read(&tree->frsp_fragments),
+			&lru->frsp_active_fragments);
+	/* Remove trees from active queue until we are below the limit */
+	while (EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb) &&
+		atomic_read(&lru->frsp_active_fragments) >
+			EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb)) {
+		last = list_last_entry(&lru->frsp_lru_active,
+				struct ext4_frsp_tree, frsp_lru_active_node);
+		list_del_init(&last->frsp_lru_active_node);
+		/* Evict from front, insert at tail */
+		list_add_tail(&last->frsp_list_node, &lru->frsp_lru_inactive);
+		atomic_sub(atomic_read(&last->frsp_fragments),
+			&lru->frsp_active_fragments);
+	}
+	write_unlock(&lru->frsp_lru_lock);
+
+out:
+	tree->frsp_last_access = current_jiffies;
+}
+
 /*
  * Divide blocks started from @first with length @len into
  * smaller chunks with power of 2 blocks.
@@ -2154,10 +2275,12 @@ ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group,
 		e4b->frsp_tree = ext4_get_frsp_tree(sb,
 					ext4_flex_group(sbi, group));
 		e4b->frsp_flags = flags;
+		ext4_mb_frsp_maintain_lru(sb, e4b->frsp_tree);
 		if (flags & EXT4_ALLOCATOR_FRSP_NOLOAD)
 			return 0;
 
 		mutex_lock(&e4b->frsp_tree->frsp_lock);
+		atomic_inc(&e4b->frsp_tree->frsp_ref);
 		if (e4b->frsp_tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED) {
 			mutex_unlock(&e4b->frsp_tree->frsp_lock);
 			return 0;
@@ -2285,8 +2408,15 @@ static int ext4_mb_load_allocator(struct super_block *sb, ext4_group_t group,
 
 static void ext4_mb_unload_allocator(struct ext4_buddy *e4b)
 {
-	if (ext4_mb_frsp_on(e4b->bd_sb))
+	if (ext4_mb_frsp_on(e4b->bd_sb)) {
+		if (!e4b->frsp_tree ||
+			e4b->frsp_flags & EXT4_ALLOCATOR_FRSP_NOLOAD)
+			return;
+		atomic_dec(&e4b->frsp_tree->frsp_ref);
+		if (test_opt2(e4b->bd_sb, FRSP_MEM_CRUNCH))
+			ext4_frsp_evict(e4b->bd_sb);
 		return;
+	}
 	if (e4b->bd_bitmap_page)
 		put_page(e4b->bd_bitmap_page);
 	if (e4b->bd_buddy_page)
@@ -2658,7 +2788,8 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
 				ex->fe_node->frsp_offset +
 				ex->fe_node->frsp_len) {
 				/* Need to split the node */
-				new = ext4_mb_frsp_alloc_node(e4b->bd_sb);
+				new = ext4_mb_frsp_alloc_node(e4b->bd_sb,
+								e4b->frsp_tree);
 				if (!new)
 					return -ENOMEM;
 				new->frsp_offset = flex_offset + ex->fe_len;
@@ -2675,7 +2806,8 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
 			ex->fe_node->frsp_offset += ex->fe_len;
 			ex->fe_node->frsp_len -= ex->fe_len;
 		} else {
-			ext4_mb_frsp_free_node(e4b->bd_sb, ex->fe_node);
+			ext4_mb_frsp_free_node(e4b->bd_sb, e4b->frsp_tree,
+							ex->fe_node);
 			return 0;
 		}
 
@@ -4166,7 +4298,9 @@ static void ext4_free_data_in_buddy(struct super_block *sb,
 		/* No more items in the per group rb tree
 		 * balance refcounts from ext4_mb_free_metadata()
 		 */
-		if (!ext4_mb_frsp_on(sb)) {
+		if (ext4_mb_frsp_on(sb)) {
+			atomic_dec(&e4b.frsp_tree->frsp_ref);
+		} else {
 			put_page(e4b.bd_buddy_page);
 			put_page(e4b.bd_bitmap_page);
 		}
@@ -6146,14 +6280,18 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
 	new_node = &new_entry->efd_node;
 	cluster = new_entry->efd_start_cluster;
 
-	if (!*n && !ext4_mb_frsp_on(sb)) {
+	if (!*n) {
 		/* first free block exent. We need to
 		   protect buddy cache from being freed,
 		 * otherwise we'll refresh it from
 		 * on-disk bitmap and lose not-yet-available
 		 * blocks */
-		get_page(e4b->bd_buddy_page);
-		get_page(e4b->bd_bitmap_page);
+		if (ext4_mb_frsp_on(sb)) {
+			atomic_inc(&e4b->frsp_tree->frsp_ref);
+		} else {
+			get_page(e4b->bd_buddy_page);
+			get_page(e4b->bd_bitmap_page);
+		}
 	}
 	while (*n) {
 		parent = *n;
-- 
2.28.0.297.g1956fa8f8d-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 8/9] ext4: add tracepoints for free space trees
  2020-08-21  1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
                   ` (6 preceding siblings ...)
  2020-08-21  1:55 ` [RFC PATCH v2 7/9] ext4: add LRU eviction for free space trees Harshad Shirwadkar
@ 2020-08-21  1:55 ` Harshad Shirwadkar
  2020-08-21  1:55 ` [RFC PATCH v2 9/9] ext4: add freespace trees documentation in code Harshad Shirwadkar
  8 siblings, 0 replies; 14+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21  1:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar

From: Harshad Shirwadkar <harshadshirwadkar@gmail.com>

This patch adds a few trace points to track the flow of allocation
requests with free space trees.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4.h              |   1 +
 fs/ext4/mballoc.c           |   9 +++
 fs/ext4/mballoc.h           |   3 +-
 include/trace/events/ext4.h | 112 ++++++++++++++++++++++++++++++++++++
 4 files changed, 124 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 64d0dbbcd517..22850763c5a8 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1661,6 +1661,7 @@ struct ext4_sb_info {
 	atomic_t s_mb_num_fragments;
 	u32 s_mb_frsp_mem_limit;
 	struct ext4_mb_frsp_lru s_mb_frsp_lru;
+	atomic_t s_mb_ac_id;
 
 	/* workqueue for reserved extent conversions (buffered io) */
 	struct workqueue_struct *rsv_conversion_wq;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index aea0eb8d28da..eb99e2fb9a68 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -945,6 +945,7 @@ void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree)
 	tree->frsp_len_root = RB_ROOT_CACHED;
 out:
 	mutex_unlock(&tree->frsp_lock);
+	trace_ext4_mb_frsp_free_tree(sb, tree->frsp_index);
 }
 
 /*
@@ -1476,6 +1477,7 @@ int ext4_mb_frsp_optimize(struct ext4_allocation_context *ac, int *tree_idx)
 	mb_debug(ac->ac_sb,
 		"Optimizer suggestion: found = %d, tree = %d, len = %d, cr = %d\n",
 		found, *tree_idx, better_len, ac->ac_criteria);
+	trace_ext4_mb_frsp_optimize(ac, found, cache_more_trees, *tree_idx);
 	return ret;
 }
 
@@ -1546,6 +1548,11 @@ ext4_mb_tree_allocator(struct ext4_allocation_context *ac)
 		ac->ac_groups_scanned++;
 		tree_idx = (j % sbi->s_mb_num_frsp_trees);
 
+		trace_ext4_mb_tree_allocator(ac, tree_idx,
+				sbi->s_mb_num_frsp_trees,
+				atomic_read(&sbi->s_mb_num_frsp_trees_cached),
+				atomic_read(&sbi->s_mb_num_fragments) *
+				sizeof(struct ext4_frsp_node));
 		ret = ext4_mb_load_allocator(sb,
 				tree_idx * ext4_flex_bg_size(sbi), &e4b, 0);
 		if (ret)
@@ -1655,6 +1662,7 @@ int ext4_mb_init_freespace_trees(struct super_block *sb)
 	if (sbi->s_mb_frsp_mem_limit < EXT4_MB_FRSP_MEM_MIN(sb))
 		sbi->s_mb_frsp_mem_limit = EXT4_MB_FRSP_MEM_MIN(sb);
 	atomic_set(&sbi->s_mb_frsp_lru.frsp_active_fragments, 0);
+	atomic_set(&sbi->s_mb_ac_id, 0);
 
 	return 0;
 }
@@ -5794,6 +5802,7 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac,
 	ext4_get_group_no_and_offset(sb, goal, &group, &block);
 
 	/* set up allocation goals */
+	ac->ac_id = atomic_inc_return(&sbi->s_mb_ac_id);
 	ac->ac_b_ex.fe_logical = EXT4_LBLK_CMASK(sbi, ar->logical);
 	ac->ac_status = AC_STATUS_CONTINUE;
 	ac->ac_sb = sb;
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 6cfb228e4da2..e734c05572b6 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -171,7 +171,6 @@ struct ext4_tree_extent {
 };
 
 struct ext4_allocation_context {
-	__u32	ac_id;
 	struct inode *ac_inode;
 	struct super_block *ac_sb;
 
@@ -206,6 +205,8 @@ struct ext4_allocation_context {
 				 * N > 0, the field stores N, otherwise 0 */
 	__u8 ac_num_optimizations;
 	__u8 ac_op;		/* operation, for history only */
+	__u8 ac_id;		/* allocation ID for tracking purpose */
+
 	struct page *ac_bitmap_page;
 	struct page *ac_buddy_page;
 	struct ext4_prealloc_space *ac_pa;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 4c8b99ec8606..e5b4c1576097 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -874,6 +874,118 @@ TRACE_EVENT(ext4_allocate_blocks,
 		  __entry->pleft, __entry->pright)
 );
 
+TRACE_EVENT(ext4_mb_frsp_free_tree,
+	TP_PROTO(struct super_block *sb, int tree),
+
+	TP_ARGS(sb, tree),
+
+	TP_STRUCT__entry(
+		__field(	dev_t,	dev			)
+		__field(	int,	tree			)
+	),
+
+	TP_fast_assign(
+		__entry->dev	= sb->s_dev;
+		__entry->tree	= tree;
+	),
+
+	TP_printk("dev %d,%d tree %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),  __entry->tree)
+);
+
+TRACE_EVENT(ext4_mb_frsp_optimize,
+	TP_PROTO(struct ext4_allocation_context *ac, int found,
+		 int cache_more_trees, int tree),
+
+	TP_ARGS(ac, found, cache_more_trees, tree),
+
+	TP_STRUCT__entry(
+		__field(	dev_t,	dev			)
+		__field(	ino_t,	ino			)
+		__field(	unsigned int, len		)
+		__field(	unsigned int, flags		)
+		__field(	int,	found			)
+		__field(	int,	tree			)
+		__field(	int,	num_found		)
+		__field(	int,	attempts		)
+		__field(	int,	ac_id			)
+		__field(	int,	ac_criteria		)
+		__field(	int,	ac_groups_scanned	)
+		__field(	int,	cache_more_trees	)
+	),
+
+	TP_fast_assign(
+		__entry->dev	= ac->ac_sb->s_dev;
+		__entry->ino	= ac->ac_inode->i_ino;
+		__entry->len	= ac->ac_b_tree_ex.te_len;
+		__entry->flags	= ac->ac_flags;
+		__entry->found	= found;
+		__entry->tree	= tree;
+		__entry->attempts = ac->ac_num_optimizations;
+		__entry->ac_id = ac->ac_id;
+		__entry->num_found = ac->ac_found;
+		__entry->ac_criteria = ac->ac_criteria;
+		__entry->ac_groups_scanned = ac->ac_groups_scanned;
+		__entry->cache_more_trees = cache_more_trees;
+	),
+
+	TP_printk("dev %d,%d ino %lu ac %d flags %s best-len %u num-found %d "
+		  "suggest-tree %d cache_more %d attempt %d status %d cr %d "
+		  "nflexgrps %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  (unsigned long) __entry->ino, __entry->ac_id,
+		  show_mballoc_flags(__entry->flags),
+		  __entry->len, __entry->num_found, __entry->tree,
+		  __entry->cache_more_trees, __entry->attempts, __entry->found,
+		  __entry->ac_criteria, __entry->ac_groups_scanned)
+);
+
+TRACE_EVENT(ext4_mb_tree_allocator,
+	TP_PROTO(struct ext4_allocation_context *ac, int tree, int num_trees,
+		int num_trees_loaded, int bytes_usage),
+
+	TP_ARGS(ac, tree, num_trees, num_trees_loaded, bytes_usage),
+
+	TP_STRUCT__entry(
+		__field(	dev_t,	dev			)
+		__field(	ino_t,	ino			)
+		__field(	unsigned int, len		)
+		__field(	unsigned int, flags		)
+		__field(	int,	num_trees		)
+		__field(	int,	num_trees_loaded	)
+		__field(	int,	tree			)
+		__field(	int,	ac_id			)
+		__field(	int,	ac_found		)
+		__field(	int,	ac_criteria		)
+		__field(	int,	ac_groups_scanned	)
+		__field(	int,	bytes_usage		)
+	),
+
+	TP_fast_assign(
+		__entry->dev	= ac->ac_sb->s_dev;
+		__entry->ino	= ac->ac_inode->i_ino;
+		__entry->len	= ac->ac_b_tree_ex.te_len;
+		__entry->flags	= ac->ac_flags;
+		__entry->num_trees = num_trees;
+		__entry->num_trees_loaded	= num_trees_loaded;
+		__entry->tree = tree;
+		__entry->ac_id = ac->ac_id;
+		__entry->ac_found = ac->ac_found;
+		__entry->ac_criteria = ac->ac_criteria;
+		__entry->ac_groups_scanned = ac->ac_groups_scanned;
+		__entry->bytes_usage = bytes_usage;
+	),
+
+	TP_printk("dev %d,%d ino %lu ac %d flags %s best-len %u found %d current-tree %d "
+		  "num-trees %d num-trees-loaded %d cr %d nflexgrps %d mem_bytes %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  (unsigned long) __entry->ino, __entry->ac_id,
+		  show_mballoc_flags(__entry->flags),
+		  __entry->len, __entry->ac_found, __entry->tree, __entry->num_trees,
+		  __entry->num_trees_loaded, __entry->ac_criteria,
+		  __entry->ac_groups_scanned, __entry->bytes_usage)
+);
+
 TRACE_EVENT(ext4_free_blocks,
 	TP_PROTO(struct inode *inode, __u64 block, unsigned long count,
 		 int flags),
-- 
2.28.0.297.g1956fa8f8d-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 9/9] ext4: add freespace trees documentation in code
  2020-08-21  1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
                   ` (7 preceding siblings ...)
  2020-08-21  1:55 ` [RFC PATCH v2 8/9] ext4: add tracepoints " Harshad Shirwadkar
@ 2020-08-21  1:55 ` Harshad Shirwadkar
  8 siblings, 0 replies; 14+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21  1:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar

From: Harshad Shirwadkar <harshadshirwadkar@gmail.com>

This patch adds a comment in mballoc.c describing the design and flow
of free-space trees.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/mballoc.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 116 insertions(+)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index eb99e2fb9a68..b5d55a52daff 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -330,6 +330,122 @@
  *        object
  *
  */
+
+/*
+ * The freespace tree allocator
+ * -----------------------------
+ *
+ * High Level Design
+ * =================
+ * This allocator maintains the free space information about the file system in
+ * per-flexible block group level trees. For every flexible block group, we
+ * maintain individual freespace nodes in two trees, one sorted by flex-bg
+ * offset and other sorted by length. This gives us two benefits: i) In
+ * allocation path, our search time complexity is O(Log(Number of freespace
+ * regions in the flex-bg)). ii) In free path, in the same time complexity we
+ * can merge the adjacent nodes thereby reducing the size of the tree
+ * efficiently.
+ *
+ * Along with flexible block group level trees, we also introduce a top level
+ * meta-tree which consists of individual trees. This tree is sorted by length
+ * of largest extents found in flex-bgs. The key advantage that this tree gives
+ * us is this: During an allocation request, the allocator is now able to
+ * consult this tree and directly (in O(Log(Number of Flex BGs)) jump to a
+ * flexible block group which definitely has at least one (the largest) extent
+ * that can satisfy this request. If no flexible block group exists which can
+ * satisfy this request, the allocator can now immediately drop the CR level.
+ *
+ * In order to preseve the parallel allocation / free performance, the allocator
+ * only *tries* to consult this tree. It does so by calling read_trylock()
+ * function and if the meta-tree is busy, the allocator continues its linear
+ * search until it is able to grab a read-lock on this tree.
+ *
+ * Memory Footprint
+ * ================
+ *
+ * In a less fragmented file system, the memory footprint of the new allocator
+ * is significantly lower than buddy bitmaps. However, as the fragmentation
+ * level increases, the memory footprint of this allocator increases
+ * significantly. The memory usage of the freespace tree allocator can be
+ * controlled using sysfs tunable /sys/fs/ext4/<dev>/mb_frsp_max_mem. The
+ * default value of this is max(memory needed for buddy, maximum memory needed
+ * for one tree in the worst case). Accounting for max memory needed for one
+ * tree allows us to keep at least one tree in memory even in the worst
+ * case. This avoids thrashing. Once the memory threshold is reached, the
+ * allocator starts evicting trees from memory in LRU fashion. However, we don't
+ * remove tree's entry from the meta-tree. This allows the allocator to
+ * efficiently reconstruct only relevant trees from on-disk bitmaps. In our
+ * evaluations, we found that freespace tree allocator still manages to
+ * outperform buddy allocator in memory crunch situation.
+ *
+ * LRU
+ * ===
+ *
+ * We maintain two lists - an active list and an inactive list of trees. Trees
+ * in active list stay in it until evicted. In order to reduce contention on the
+ * active list lock, once a tree is in active list it is not moved inside the
+ * list. Also, a tree moves from inactive list to active list only if it is
+ * accessed twice in a EXT4_MB_FRSP_ACTIVE_THRESHOLD_JIFFIES window.
+ *
+ *
+ * Caching Tree Metadata
+ * =====================
+ * As noted in our experiments, we find caching tree metadata (the largest
+ * available extent in a tree) in the meta-tree significantly boosts allocation
+ * performance. However, if the allocator waits for the cache to fill up before
+ * starting to serve allocation requests, that may severely degrade allocation
+ * performance on large disks. Thus, it is important to tune the tree caching
+ * behavior according to the underlying block device. This patchset provides
+ * four cache aggression levels. Cache aggression level can be specified as a
+ * mount time parametere "frsp_cache_aggression". Here is the meaning of these
+ * different levels:
+ *
+ * Cache Aggression 0: Try to serve request only cached trees
+ * Cache Aggression 1 (Default): Try to serve request from only cached trees
+ *	at CR 0. At all other CRs, try to use an uncached tree for every
+ *	alternate request.
+ * Cache Aggression 2: Try to use an uncached tree for every alternate request
+ *	at all CR levels.
+ * Cache Aggression 3: Try to use uncached trees for every request.
+ *
+ * Moreover, if file system is mounted with "prefetch_block_bitmaps", tree
+ * caching starts at mount time.
+ *
+ * Locking Order
+ * =============
+ *
+ * - Tree lock
+ * - Meta tree lock (sbi->s_mb_frsp_lock)
+ * - LRU lock
+ *
+ * Critical sections of meta-tree lock and LRU locks are kept as small as
+ * possible and you shouldn't need to take meta-tree lock and lru-lock
+ * simultaneously.
+ *
+ * High Level Algorithm
+ * ====================
+ * - Consult meta-tree asking which flex-bg should the allocator look at
+ *   - One of the following things can happen
+ *     - Meta tree may find a suitable flex-bg for the request
+ *        - In this case we start searching in that tree
+ *     - Meta tree may realize that there's no suitable flex-bg
+ *        - In this case we increase CR and restart
+ *     - Based on the caching mode, meta tree may redirect us to an
+ *       uncached tree
+ *     - Meta tree is busy
+ *       - In this case, we dont wait for meta-tree to become available,
+ *         instead, we continue our search linearly.
+ * - Pick a tree (either based on what meta-tree told us or linearly from the
+ *   last one)
+ * - Manage LRU structure (ext4_mb_frsp_maintain_lru())
+ *   - Move tree to active list if needed
+ *   - Move trees from active list to inactive lists if needed
+ * - Perform search by length and pick matching extents.
+ * - Repeat until best found
+ * - Perform memory-crunch check and evict trees in LRU fashion if needed
+ *   (ext4_mb_unload_allocator()))
+ */
+
 static struct kmem_cache *ext4_pspace_cachep;
 static struct kmem_cache *ext4_ac_cachep;
 static struct kmem_cache *ext4_free_data_cachep;
-- 
2.28.0.297.g1956fa8f8d-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 3/9] ext4: add freespace tree allocator
  2020-08-21  1:55 ` [RFC PATCH v2 3/9] ext4: add freespace tree allocator Harshad Shirwadkar
@ 2020-08-21  4:06   ` kernel test robot
  2020-08-21  7:04   ` kernel test robot
  1 sibling, 0 replies; 14+ messages in thread
From: kernel test robot @ 2020-08-21  4:06 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1972 bytes --]

Hi Harshad,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on ext4/dev]
[also build test ERROR on next-20200820]
[cannot apply to tip/perf/core v5.9-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Harshad-Shirwadkar/ext4-add-free-space-extent-based-allocator/20200821-095647
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: nds32-defconfig (attached as .config)
compiler: nds32le-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=nds32 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   nds32le-linux-ld: fs/ext4/mballoc.o: in function `ext4_mb_frsp_free_blocks':
>> mballoc.c:(.text+0x2b2e): undefined reference to `__umoddi3'
>> nds32le-linux-ld: mballoc.c:(.text+0x2b32): undefined reference to `__umoddi3'
   nds32le-linux-ld: fs/ext4/mballoc.o: in function `ext4_mb_frsp_bb_to_tree':
   mballoc.c:(.text+0x2c36): undefined reference to `__umoddi3'
   nds32le-linux-ld: mballoc.c:(.text+0x2c3a): undefined reference to `__umoddi3'
   nds32le-linux-ld: fs/ext4/mballoc.o: in function `mb_mark_used':
   mballoc.c:(.text+0x54d4): undefined reference to `__umoddi3'
   nds32le-linux-ld: fs/ext4/mballoc.o:mballoc.c:(.text+0x54d8): more undefined references to `__umoddi3' follow

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 10907 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 3/9] ext4: add freespace tree allocator
  2020-08-21  1:55 ` [RFC PATCH v2 3/9] ext4: add freespace tree allocator Harshad Shirwadkar
  2020-08-21  4:06   ` kernel test robot
@ 2020-08-21  7:04   ` kernel test robot
  1 sibling, 0 replies; 14+ messages in thread
From: kernel test robot @ 2020-08-21  7:04 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2304 bytes --]

Hi Harshad,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on ext4/dev]
[also build test ERROR on next-20200820]
[cannot apply to tip/perf/core tytso-fscrypt/master v5.9-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Harshad-Shirwadkar/ext4-add-free-space-extent-based-allocator/20200821-095647
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: i386-randconfig-s001-20200820 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.2-191-g10164920-dirty
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: fs/ext4/mballoc.o: in function `ext4_blkno_to_flex_offset':
>> fs/ext4/mballoc.c:407: undefined reference to `__umoddi3'
>> ld: fs/ext4/mballoc.c:407: undefined reference to `__umoddi3'
>> ld: fs/ext4/mballoc.c:407: undefined reference to `__umoddi3'
>> ld: fs/ext4/mballoc.c:407: undefined reference to `__umoddi3'
>> ld: fs/ext4/mballoc.c:407: undefined reference to `__umoddi3'
   ld: fs/ext4/mballoc.o:fs/ext4/mballoc.c:407: more undefined references to `__umoddi3' follow

# https://github.com/0day-ci/linux/commit/db488bc7c0ef14f7ea22209b8faaddaffcfabbd5
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Harshad-Shirwadkar/ext4-add-free-space-extent-based-allocator/20200821-095647
git checkout db488bc7c0ef14f7ea22209b8faaddaffcfabbd5
vim +407 fs/ext4/mballoc.c

   403	
   404	static inline int ext4_blkno_to_flex_offset(struct super_block *sb,
   405				ext4_fsblk_t blkno)
   406	{
 > 407		return blkno % (ext4_flex_bg_size(EXT4_SB(sb)) *
   408					EXT4_SB(sb)->s_blocks_per_group);
   409	}
   410	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 31788 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 5/9] ext4: add freespace tree optimizations
  2020-08-21  1:55 ` [RFC PATCH v2 5/9] ext4: add freespace tree optimizations Harshad Shirwadkar
@ 2020-08-21 10:48   ` kernel test robot
  0 siblings, 0 replies; 14+ messages in thread
From: kernel test robot @ 2020-08-21 10:48 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 7452 bytes --]

Hi Harshad,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on ext4/dev]
[also build test WARNING on next-20200821]
[cannot apply to tip/perf/core v5.9-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Harshad-Shirwadkar/ext4-add-free-space-extent-based-allocator/20200821-095647
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: x86_64-randconfig-r036-20200820 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project b587ca93be114d07ec3bf654add97d7872325281)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/ext4/mballoc.c:808:6: warning: no previous prototype for function 'ext4_mb_frsp_meta_reinsert' [-Wmissing-prototypes]
   void ext4_mb_frsp_meta_reinsert(struct super_block *sb,
        ^
   fs/ext4/mballoc.c:808:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void ext4_mb_frsp_meta_reinsert(struct super_block *sb,
   ^
   static 
   fs/ext4/mballoc.c:890:6: warning: no previous prototype for function 'ext4_mb_frsp_free_tree' [-Wmissing-prototypes]
   void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree)
        ^
   fs/ext4/mballoc.c:890:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree)
   ^
   static 
   fs/ext4/mballoc.c:920:24: warning: no previous prototype for function 'ext4_mb_frsp_search_by_off' [-Wmissing-prototypes]
   struct ext4_frsp_node *ext4_mb_frsp_search_by_off(struct super_block *sb,
                          ^
   fs/ext4/mballoc.c:920:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   struct ext4_frsp_node *ext4_mb_frsp_search_by_off(struct super_block *sb,
   ^
   static 
   fs/ext4/mballoc.c:945:5: warning: no previous prototype for function 'ext4_mb_frsp_node_can_merge' [-Wmissing-prototypes]
   int ext4_mb_frsp_node_can_merge(struct super_block *sb,
       ^
   fs/ext4/mballoc.c:945:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int ext4_mb_frsp_node_can_merge(struct super_block *sb,
   ^
   static 
   fs/ext4/mballoc.c:962:5: warning: no previous prototype for function 'ext4_mb_frsp_add_region' [-Wmissing-prototypes]
   int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
       ^
   fs/ext4/mballoc.c:962:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
   ^
   static 
   fs/ext4/mballoc.c:1014:5: warning: no previous prototype for function 'ext4_mb_frsp_free_blocks' [-Wmissing-prototypes]
   int ext4_mb_frsp_free_blocks(struct super_block *sb, ext4_group_t group,
       ^
   fs/ext4/mballoc.c:1014:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int ext4_mb_frsp_free_blocks(struct super_block *sb, ext4_group_t group,
   ^
   static 
   fs/ext4/mballoc.c:1034:5: warning: no previous prototype for function 'ext4_mb_frsp_bb_to_tree' [-Wmissing-prototypes]
   int ext4_mb_frsp_bb_to_tree(struct ext4_buddy *e4b, ext4_group_t group,
       ^
   fs/ext4/mballoc.c:1034:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int ext4_mb_frsp_bb_to_tree(struct ext4_buddy *e4b, ext4_group_t group,
   ^
   static 
   fs/ext4/mballoc.c:1066:5: warning: no previous prototype for function 'ext4_mb_frsp_load' [-Wmissing-prototypes]
   int ext4_mb_frsp_load(struct ext4_buddy *e4b)
       ^
   fs/ext4/mballoc.c:1066:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int ext4_mb_frsp_load(struct ext4_buddy *e4b)
   ^
   static 
>> fs/ext4/mballoc.c:1366:5: warning: no previous prototype for function 'ext4_mb_frsp_optimize' [-Wmissing-prototypes]
   int ext4_mb_frsp_optimize(struct ext4_allocation_context *ac, int *tree_idx)
       ^
   fs/ext4/mballoc.c:1366:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int ext4_mb_frsp_optimize(struct ext4_allocation_context *ac, int *tree_idx)
   ^
   static 
   fs/ext4/mballoc.c:1587:5: warning: no previous prototype for function 'ext4_mb_init_freespace_trees' [-Wmissing-prototypes]
   int ext4_mb_init_freespace_trees(struct super_block *sb)
       ^
   fs/ext4/mballoc.c:1587:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int ext4_mb_init_freespace_trees(struct super_block *sb)
   ^
   static 
   fs/ext4/mballoc.c:411:28: warning: unused function 'ext4_flex_offset_to_blkno' [-Wunused-function]
   static inline ext4_fsblk_t ext4_flex_offset_to_blkno(struct super_block *sb,
                              ^
   11 warnings generated.

# https://github.com/0day-ci/linux/commit/832851db4656a8d433083fc076a53db955d43020
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Harshad-Shirwadkar/ext4-add-free-space-extent-based-allocator/20200821-095647
git checkout 832851db4656a8d433083fc076a53db955d43020
vim +/ext4_mb_frsp_meta_reinsert +808 fs/ext4/mballoc.c

   807	
 > 808	void ext4_mb_frsp_meta_reinsert(struct super_block *sb,
   809		struct ext4_frsp_tree *tree)
   810	{
   811		struct ext4_sb_info *sbi = EXT4_SB(sb);
   812		struct ext4_frsp_node *node;
   813		struct rb_node *first = rb_first_cached(&tree->frsp_len_root);
   814		struct rb_root_cached *meta_root = &EXT4_SB(sb)->s_mb_frsp_meta_tree;
   815		int expected_len = 0;
   816	
   817		if (!(tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED))
   818			return;
   819	
   820		if (first) {
   821			node = rb_entry(first, struct ext4_frsp_node, frsp_len_node);
   822			expected_len = node->frsp_len;
   823		}
   824	
   825		if (tree->frsp_max_free_len == expected_len)
   826			return;
   827	
   828		write_lock(&sbi->s_mb_frsp_lock);
   829		tree->frsp_max_free_len = expected_len;
   830		rb_erase_cached(&tree->frsp_len_node, &sbi->s_mb_frsp_meta_tree);
   831		RB_CLEAR_NODE(&tree->frsp_len_node);
   832		ext4_mb_rb_insert(meta_root, tree, struct ext4_frsp_tree, frsp_len_node,
   833			ext4_mb_frsp_meta_cmp);
   834		write_unlock(&sbi->s_mb_frsp_lock);
   835	}
   836	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 43415 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 5/9] ext4: add freespace tree optimizations
@ 2020-08-21  5:40 kernel test robot
  0 siblings, 0 replies; 14+ messages in thread
From: kernel test robot @ 2020-08-21  5:40 UTC (permalink / raw)
  To: kbuild

[-- Attachment #1: Type: text/plain, Size: 12971 bytes --]

CC: kbuild-all(a)lists.01.org
In-Reply-To: <20200821015523.1698374-6-harshads@google.com>
References: <20200821015523.1698374-6-harshads@google.com>
TO: Harshad Shirwadkar <harshadshirwadkar@gmail.com>

Hi Harshad,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on ext4/dev]
[also build test WARNING on next-20200820]
[cannot apply to tip/perf/core tytso-fscrypt/master v5.9-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Harshad-Shirwadkar/ext4-add-free-space-extent-based-allocator/20200821-095647
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
:::::: branch date: 4 hours ago
:::::: commit date: 4 hours ago
config: x86_64-randconfig-s021-20200820 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.2-191-g10164920-dirty
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

   fs/ext4/mballoc.c:808:6: sparse: sparse: symbol 'ext4_mb_frsp_meta_reinsert' was not declared. Should it be static?
   fs/ext4/mballoc.c:920:23: sparse: sparse: symbol 'ext4_mb_frsp_search_by_off' was not declared. Should it be static?
   fs/ext4/mballoc.c:945:5: sparse: sparse: symbol 'ext4_mb_frsp_node_can_merge' was not declared. Should it be static?
   fs/ext4/mballoc.c:962:5: sparse: sparse: symbol 'ext4_mb_frsp_add_region' was not declared. Should it be static?
   fs/ext4/mballoc.c:1014:5: sparse: sparse: symbol 'ext4_mb_frsp_free_blocks' was not declared. Should it be static?
   fs/ext4/mballoc.c:1034:5: sparse: sparse: symbol 'ext4_mb_frsp_bb_to_tree' was not declared. Should it be static?
   fs/ext4/mballoc.c:1066:5: sparse: sparse: symbol 'ext4_mb_frsp_load' was not declared. Should it be static?
   fs/ext4/mballoc.c:1587:5: sparse: sparse: symbol 'ext4_mb_init_freespace_trees' was not declared. Should it be static?
>> fs/ext4/mballoc.c:1366:5: sparse: sparse: context imbalance in 'ext4_mb_frsp_optimize' - different lock contexts for basic block
   fs/ext4/mballoc.c:1885:9: sparse: sparse: context imbalance in 'ext4_mb_init_cache' - different lock contexts for basic block
   fs/ext4/mballoc.c:2437:13: sparse: sparse: context imbalance in 'mb_free_blocks' - different lock contexts for basic block
   fs/ext4/mballoc.c:2889:5: sparse: sparse: context imbalance in 'ext4_mb_try_best_found' - different lock contexts for basic block
   fs/ext4/mballoc.c:2917:5: sparse: sparse: context imbalance in 'ext4_mb_find_by_goal' - different lock contexts for basic block
   fs/ext4/mballoc.c:3204:12: sparse: sparse: context imbalance in 'ext4_mb_good_group_nolock' - wrong count at exit
   fs/ext4/mballoc.c:3425:49: sparse: sparse: context imbalance in 'ext4_mb_regular_allocator' - different lock contexts for basic block
   fs/ext4/mballoc.c:4035:17: sparse: sparse: context imbalance in 'ext4_mb_release' - different lock contexts for basic block
   fs/ext4/mballoc.c: note: in included file (through include/linux/wait.h, include/linux/wait_bit.h, include/linux/fs.h, fs/ext4/ext4_jbd2.h):
   include/linux/spinlock.h:393:9: sparse: sparse: context imbalance in 'ext4_free_data_in_buddy' - wrong count at exit
   fs/ext4/mballoc.c:4376:15: sparse: sparse: context imbalance in 'ext4_mb_mark_diskspace_used' - different lock contexts for basic block
   fs/ext4/mballoc.c:4627:13: sparse: sparse: context imbalance in 'ext4_discard_allocated_blocks' - different lock contexts for basic block
   fs/ext4/mballoc.c:5270:24: sparse: sparse: context imbalance in 'ext4_mb_discard_group_preallocations' - different lock contexts for basic block
   fs/ext4/mballoc.c:5425:9: sparse: sparse: context imbalance in 'ext4_discard_preallocations' - different lock contexts for basic block
   fs/ext4/mballoc.c:5731:9: sparse: sparse: context imbalance in 'ext4_mb_discard_lg_preallocations' - different lock contexts for basic block
   fs/ext4/mballoc.c:6276:9: sparse: sparse: context imbalance in 'ext4_free_blocks' - different lock contexts for basic block
   fs/ext4/mballoc.c:6576:15: sparse: sparse: context imbalance in 'ext4_group_add_blocks' - different lock contexts for basic block
   fs/ext4/mballoc.c:6617:24: sparse: sparse: context imbalance in 'ext4_trim_extent' - wrong count at exit
   include/linux/spinlock.h:393:9: sparse: sparse: context imbalance in 'ext4_trim_all_free' - unexpected unlock
   fs/ext4/mballoc.c:6869:28: sparse: sparse: context imbalance in 'ext4_mballoc_query_range' - different lock contexts for basic block

# https://github.com/0day-ci/linux/commit/832851db4656a8d433083fc076a53db955d43020
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Harshad-Shirwadkar/ext4-add-free-space-extent-based-allocator/20200821-095647
git checkout 832851db4656a8d433083fc076a53db955d43020
vim +/ext4_mb_frsp_optimize +1366 fs/ext4/mballoc.c

832851db4656a8d Harshad Shirwadkar 2020-08-20  1359  
832851db4656a8d Harshad Shirwadkar 2020-08-20  1360  /*
832851db4656a8d Harshad Shirwadkar 2020-08-20  1361   * Optimize allocation request. This function _tries_ to lookup the meta-tree
832851db4656a8d Harshad Shirwadkar 2020-08-20  1362   * and if it can optimize the search in any way, it does so. As a result
832851db4656a8d Harshad Shirwadkar 2020-08-20  1363   * this function returns 1, if the optimization was performed. In this case,
832851db4656a8d Harshad Shirwadkar 2020-08-20  1364   * the caller should restart the search from tree mentioned in *tree_idx.
832851db4656a8d Harshad Shirwadkar 2020-08-20  1365   */
832851db4656a8d Harshad Shirwadkar 2020-08-20 @1366  int ext4_mb_frsp_optimize(struct ext4_allocation_context *ac, int *tree_idx)
832851db4656a8d Harshad Shirwadkar 2020-08-20  1367  {
832851db4656a8d Harshad Shirwadkar 2020-08-20  1368  	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
832851db4656a8d Harshad Shirwadkar 2020-08-20  1369  	struct ext4_frsp_tree *cur = NULL;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1370  	struct rb_node *node = NULL;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1371  	int found = 0, best = 0, cache_more_trees = 0, better_len = 0, ret = 0;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1372  
832851db4656a8d Harshad Shirwadkar 2020-08-20  1373  	if (ac->ac_flags & EXT4_MB_HINT_FIRST ||
832851db4656a8d Harshad Shirwadkar 2020-08-20  1374  		ac->ac_flags & EXT4_MB_FRSP_NO_OPTIMIZE ||
832851db4656a8d Harshad Shirwadkar 2020-08-20  1375  		ac->ac_status != AC_STATUS_CONTINUE)
832851db4656a8d Harshad Shirwadkar 2020-08-20  1376  		return 0;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1377  
832851db4656a8d Harshad Shirwadkar 2020-08-20  1378  	ac->ac_num_optimizations++;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1379  	if (!read_trylock(&sbi->s_mb_frsp_lock))
832851db4656a8d Harshad Shirwadkar 2020-08-20  1380  		return 0;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1381  
832851db4656a8d Harshad Shirwadkar 2020-08-20  1382  	node = sbi->s_mb_frsp_meta_tree.rb_root.rb_node;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1383  	while (node) {
832851db4656a8d Harshad Shirwadkar 2020-08-20  1384  		cur = rb_entry(node, struct ext4_frsp_tree, frsp_len_node);
832851db4656a8d Harshad Shirwadkar 2020-08-20  1385  		if (ext4_mb_frsp_is_better(ac, cur->frsp_max_free_len, &best)) {
832851db4656a8d Harshad Shirwadkar 2020-08-20  1386  			/*
832851db4656a8d Harshad Shirwadkar 2020-08-20  1387  			 * This tree definitely has a better node than the best
832851db4656a8d Harshad Shirwadkar 2020-08-20  1388  			 * found so far.
832851db4656a8d Harshad Shirwadkar 2020-08-20  1389  			 */
832851db4656a8d Harshad Shirwadkar 2020-08-20  1390  			found = 1;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1391  			ret = 1;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1392  			*tree_idx = cur->frsp_index;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1393  			better_len = cur->frsp_max_free_len;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1394  			if (best)
832851db4656a8d Harshad Shirwadkar 2020-08-20  1395  				break;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1396  		}
832851db4656a8d Harshad Shirwadkar 2020-08-20  1397  		if (cur->frsp_max_free_len > ac->ac_g_ex.fe_len)
832851db4656a8d Harshad Shirwadkar 2020-08-20  1398  			node = node->rb_right;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1399  		else
832851db4656a8d Harshad Shirwadkar 2020-08-20  1400  			node = node->rb_left;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1401  	}
832851db4656a8d Harshad Shirwadkar 2020-08-20  1402  
832851db4656a8d Harshad Shirwadkar 2020-08-20  1403  	if (ac->ac_found == 0 && !found) {
832851db4656a8d Harshad Shirwadkar 2020-08-20  1404  		/*
832851db4656a8d Harshad Shirwadkar 2020-08-20  1405  		 * If we haven't found a good match above, and we hadn't found
832851db4656a8d Harshad Shirwadkar 2020-08-20  1406  		 * any match before us, that means we need to loosen our
832851db4656a8d Harshad Shirwadkar 2020-08-20  1407  		 * criteria. Note that, if we had found something earlier,
832851db4656a8d Harshad Shirwadkar 2020-08-20  1408  		 * not finding a better node doesn't imply that there is no
832851db4656a8d Harshad Shirwadkar 2020-08-20  1409  		 * better node available.
832851db4656a8d Harshad Shirwadkar 2020-08-20  1410  		 * TODO - in this case determine probabilistically which tree
832851db4656a8d Harshad Shirwadkar 2020-08-20  1411  		 * may have a better node and direct our allocator there.
832851db4656a8d Harshad Shirwadkar 2020-08-20  1412  		 */
832851db4656a8d Harshad Shirwadkar 2020-08-20  1413  		if (ext4_mb_frsp_should_cache(ac)) {
832851db4656a8d Harshad Shirwadkar 2020-08-20  1414  			cache_more_trees = 1;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1415  		} else if (ac->ac_criteria < 2) {
832851db4656a8d Harshad Shirwadkar 2020-08-20  1416  			ac->ac_criteria++;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1417  			ac->ac_groups_scanned = 0;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1418  			*tree_idx = 0;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1419  			ret = 1;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1420  		} else {
832851db4656a8d Harshad Shirwadkar 2020-08-20  1421  			ac->ac_flags |= EXT4_MB_HINT_FIRST;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1422  		}
832851db4656a8d Harshad Shirwadkar 2020-08-20  1423  	} else if (!best && !list_empty(&sbi->s_mb_uncached_trees)) {
832851db4656a8d Harshad Shirwadkar 2020-08-20  1424  		cache_more_trees = ext4_mb_frsp_should_cache(ac);
832851db4656a8d Harshad Shirwadkar 2020-08-20  1425  	}
832851db4656a8d Harshad Shirwadkar 2020-08-20  1426  
832851db4656a8d Harshad Shirwadkar 2020-08-20  1427  	if (cache_more_trees) {
832851db4656a8d Harshad Shirwadkar 2020-08-20  1428  		cur = list_first_entry(&sbi->s_mb_uncached_trees,
832851db4656a8d Harshad Shirwadkar 2020-08-20  1429  				struct ext4_frsp_tree, frsp_list_node);
832851db4656a8d Harshad Shirwadkar 2020-08-20  1430  		list_del_init(&cur->frsp_list_node);
832851db4656a8d Harshad Shirwadkar 2020-08-20  1431  		*tree_idx = cur->frsp_index;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1432  		ret = 1;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1433  	}
832851db4656a8d Harshad Shirwadkar 2020-08-20  1434  	read_unlock(&sbi->s_mb_frsp_lock);
832851db4656a8d Harshad Shirwadkar 2020-08-20  1435  
832851db4656a8d Harshad Shirwadkar 2020-08-20  1436  	/*
832851db4656a8d Harshad Shirwadkar 2020-08-20  1437  	 * If we couldn't optimize now, it's unlikely that we'll be able to
832851db4656a8d Harshad Shirwadkar 2020-08-20  1438  	 * optimize this request anymore
832851db4656a8d Harshad Shirwadkar 2020-08-20  1439  	 */
832851db4656a8d Harshad Shirwadkar 2020-08-20  1440  	if (!ret)
832851db4656a8d Harshad Shirwadkar 2020-08-20  1441  		ac->ac_flags |= EXT4_MB_FRSP_NO_OPTIMIZE;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1442  	mb_debug(ac->ac_sb,
832851db4656a8d Harshad Shirwadkar 2020-08-20  1443  		"Optimizer suggestion: found = %d, tree = %d, len = %d, cr = %d\n",
832851db4656a8d Harshad Shirwadkar 2020-08-20  1444  		found, *tree_idx, better_len, ac->ac_criteria);
832851db4656a8d Harshad Shirwadkar 2020-08-20  1445  	return ret;
832851db4656a8d Harshad Shirwadkar 2020-08-20  1446  }
832851db4656a8d Harshad Shirwadkar 2020-08-20  1447  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 28418 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-08-21 10:48 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-21  1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
2020-08-21  1:55 ` [RFC PATCH v2 1/9] ext4: add handling for extended mount options Harshad Shirwadkar
2020-08-21  1:55 ` [RFC PATCH v2 2/9] ext4: rename ext4_mb_load_buddy to ext4_mb_load_allocator Harshad Shirwadkar
2020-08-21  1:55 ` [RFC PATCH v2 3/9] ext4: add freespace tree allocator Harshad Shirwadkar
2020-08-21  4:06   ` kernel test robot
2020-08-21  7:04   ` kernel test robot
2020-08-21  1:55 ` [RFC PATCH v2 4/9] ext4: add prefetching support for freespace trees Harshad Shirwadkar
2020-08-21  1:55 ` [RFC PATCH v2 5/9] ext4: add freespace tree optimizations Harshad Shirwadkar
2020-08-21 10:48   ` kernel test robot
2020-08-21  1:55 ` [RFC PATCH v2 6/9] ext4: add memory usage tracker for freespace trees Harshad Shirwadkar
2020-08-21  1:55 ` [RFC PATCH v2 7/9] ext4: add LRU eviction for free space trees Harshad Shirwadkar
2020-08-21  1:55 ` [RFC PATCH v2 8/9] ext4: add tracepoints " Harshad Shirwadkar
2020-08-21  1:55 ` [RFC PATCH v2 9/9] ext4: add freespace trees documentation in code Harshad Shirwadkar
2020-08-21  5:40 [RFC PATCH v2 5/9] ext4: add freespace tree optimizations kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.