* [RFC PATCH v2 1/9] ext4: add handling for extended mount options
2020-08-21 1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
@ 2020-08-21 1:55 ` Harshad Shirwadkar
2020-08-21 1:55 ` [RFC PATCH v2 2/9] ext4: rename ext4_mb_load_buddy to ext4_mb_load_allocator Harshad Shirwadkar
` (7 subsequent siblings)
8 siblings, 0 replies; 13+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21 1:55 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar, Andreas Dilger
From: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
We are running out of mount option bits. Add handling for using
s_mount_opt2. This patch was originally submitted as a part of fast
commit patch series. Resending it here with this patch series too.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
---
fs/ext4/super.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 13bdddc081e0..656eccf6fc9b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1738,6 +1738,7 @@ static int clear_qf_name(struct super_block *sb, int qtype)
#define MOPT_EXT4_ONLY (MOPT_NO_EXT2 | MOPT_NO_EXT3)
#define MOPT_STRING 0x0400
#define MOPT_SKIP 0x0800
+#define MOPT_2 0x1000
static const struct mount_opts {
int token;
@@ -2207,10 +2208,17 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
WARN_ON(1);
return -1;
}
- if (arg != 0)
- sbi->s_mount_opt |= m->mount_opt;
- else
- sbi->s_mount_opt &= ~m->mount_opt;
+ if (m->flags & MOPT_2) {
+ if (arg != 0)
+ sbi->s_mount_opt2 |= m->mount_opt;
+ else
+ sbi->s_mount_opt2 &= ~m->mount_opt;
+ } else {
+ if (arg != 0)
+ sbi->s_mount_opt |= m->mount_opt;
+ else
+ sbi->s_mount_opt &= ~m->mount_opt;
+ }
}
return 1;
}
--
2.28.0.297.g1956fa8f8d-goog
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC PATCH v2 2/9] ext4: rename ext4_mb_load_buddy to ext4_mb_load_allocator
2020-08-21 1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
2020-08-21 1:55 ` [RFC PATCH v2 1/9] ext4: add handling for extended mount options Harshad Shirwadkar
@ 2020-08-21 1:55 ` Harshad Shirwadkar
2020-08-21 1:55 ` [RFC PATCH v2 3/9] ext4: add freespace tree allocator Harshad Shirwadkar
` (6 subsequent siblings)
8 siblings, 0 replies; 13+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21 1:55 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar
From: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
This patch renames ext4_mb_load_buddy and ext4_mb_unload_buddy to
ext4_mb_load_allocator and ext4_mb_unload_allocator. Also, we add a
flag argument to ext4_mb_load_allocator function which is currently
unused. This patch helps reduce the size of the following patch "ext4:
add freespace tree allocator" significantly. In the interest of
keeping this patchset shorter, I have not renamed ext4_buddy structure
and e4b variable names. But have added that as a TODO item.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
fs/ext4/mballoc.c | 86 ++++++++++++++++++++++++-----------------------
1 file changed, 44 insertions(+), 42 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 132c118d12e1..48d791304bf1 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -29,6 +29,7 @@
* - don't normalize tails
* - quota
* - reservation for superuser
+ * - rename ext4_buddy to ext4_allocator and e4b variables to allocator
*
* TODO v3:
* - bitmap read-ahead (proposed by Oleg Drokin aka green)
@@ -92,7 +93,7 @@
* mapped to the buddy and bitmap information regarding different
* groups. The buddy information is attached to buddy cache inode so that
* we can access them through the page cache. The information regarding
- * each group is loaded via ext4_mb_load_buddy. The information involve
+ * each group is loaded via ext4_mb_load_allocator. The information involve
* block bitmap and buddy information. The information are stored in the
* inode as:
*
@@ -845,7 +846,7 @@ static void mb_regenerate_buddy(struct ext4_buddy *e4b)
/* The buddy information is attached the buddy cache inode
* for convenience. The information regarding each group
- * is loaded via ext4_mb_load_buddy. The information involve
+ * is loaded via ext4_mb_load_allocator. The information involve
* block bitmap and buddy information. The information are
* stored in the inode as
*
@@ -1105,7 +1106,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
* This ensures that we don't reinit the buddy cache
* page which map to the group from which we are already
* allocating. If we are looking at the buddy cache we would
- * have taken a reference using ext4_mb_load_buddy and that
+ * have taken a reference using ext4_mb_load_allocator and that
* would have pinned buddy page to page cache.
* The call to ext4_mb_get_buddy_page_lock will mark the
* page accessed.
@@ -1157,8 +1158,8 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
* calling this routine!
*/
static noinline_for_stack int
-ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
- struct ext4_buddy *e4b, gfp_t gfp)
+ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group,
+ struct ext4_buddy *e4b, gfp_t gfp, int flags)
{
int blocks_per_page;
int block;
@@ -1293,13 +1294,13 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
return ret;
}
-static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
- struct ext4_buddy *e4b)
+static int ext4_mb_load_allocator(struct super_block *sb, ext4_group_t group,
+ struct ext4_buddy *e4b, int flags)
{
- return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_NOFS);
+ return ext4_mb_load_allocator_gfp(sb, group, e4b, GFP_NOFS, flags);
}
-static void ext4_mb_unload_buddy(struct ext4_buddy *e4b)
+static void ext4_mb_unload_allocator(struct ext4_buddy *e4b)
{
if (e4b->bd_bitmap_page)
put_page(e4b->bd_bitmap_page);
@@ -1859,7 +1860,7 @@ int ext4_mb_try_best_found(struct ext4_allocation_context *ac,
int err;
BUG_ON(ex.fe_len <= 0);
- err = ext4_mb_load_buddy(ac->ac_sb, group, e4b);
+ err = ext4_mb_load_allocator(ac->ac_sb, group, e4b, 0);
if (err)
return err;
@@ -1872,7 +1873,7 @@ int ext4_mb_try_best_found(struct ext4_allocation_context *ac,
}
ext4_unlock_group(ac->ac_sb, group);
- ext4_mb_unload_buddy(e4b);
+ ext4_mb_unload_allocator(e4b);
return 0;
}
@@ -1893,12 +1894,12 @@ int ext4_mb_find_by_goal(struct ext4_allocation_context *ac,
if (grp->bb_free == 0)
return 0;
- err = ext4_mb_load_buddy(ac->ac_sb, group, e4b);
+ err = ext4_mb_load_allocator(ac->ac_sb, group, e4b, 0);
if (err)
return err;
if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) {
- ext4_mb_unload_buddy(e4b);
+ ext4_mb_unload_allocator(e4b);
return 0;
}
@@ -1936,7 +1937,7 @@ int ext4_mb_find_by_goal(struct ext4_allocation_context *ac,
ext4_mb_use_best_found(ac, e4b);
}
ext4_unlock_group(ac->ac_sb, group);
- ext4_mb_unload_buddy(e4b);
+ ext4_mb_unload_allocator(e4b);
return 0;
}
@@ -2417,7 +2418,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
continue;
}
- err = ext4_mb_load_buddy(sb, group, &e4b);
+ err = ext4_mb_load_allocator(sb, group, &e4b, 0);
if (err)
goto out;
@@ -2430,7 +2431,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
ret = ext4_mb_good_group(ac, group, cr);
if (ret == 0) {
ext4_unlock_group(sb, group);
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
continue;
}
@@ -2444,7 +2445,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
ext4_mb_complex_scan_group(ac, &e4b);
ext4_unlock_group(sb, group);
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
if (ac->ac_status != AC_STATUS_CONTINUE)
break;
@@ -2543,7 +2544,7 @@ static int ext4_mb_seq_groups_show(struct seq_file *seq, void *v)
grinfo = ext4_get_group_info(sb, group);
/* Load the group info in memory only if not already loaded. */
if (unlikely(EXT4_MB_GRP_NEED_INIT(grinfo))) {
- err = ext4_mb_load_buddy(sb, group, &e4b);
+ err = ext4_mb_load_allocator(sb, group, &e4b, 0);
if (err) {
seq_printf(seq, "#%-5u: I/O error\n", group);
return 0;
@@ -2554,7 +2555,7 @@ static int ext4_mb_seq_groups_show(struct seq_file *seq, void *v)
memcpy(&sg, ext4_get_group_info(sb, group), i);
if (buddy_loaded)
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
seq_printf(seq, "#%-5u: %-5u %-5u %-5u [", group, sg.info.bb_free,
sg.info.bb_fragments, sg.info.bb_first_free);
@@ -3049,7 +3050,7 @@ static void ext4_free_data_in_buddy(struct super_block *sb,
mb_debug(sb, "gonna free %u blocks in group %u (0x%p):",
entry->efd_count, entry->efd_group, entry);
- err = ext4_mb_load_buddy(sb, entry->efd_group, &e4b);
+ err = ext4_mb_load_allocator(sb, entry->efd_group, &e4b, 0);
/* we expect to find existing buddy because it's pinned */
BUG_ON(err != 0);
@@ -3084,7 +3085,7 @@ static void ext4_free_data_in_buddy(struct super_block *sb,
}
ext4_unlock_group(sb, entry->efd_group);
kmem_cache_free(ext4_free_data_cachep, entry);
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
mb_debug(sb, "freed %d blocks in %d structures\n", count,
count2);
@@ -3558,12 +3559,13 @@ static void ext4_discard_allocated_blocks(struct ext4_allocation_context *ac)
if (pa == NULL) {
if (ac->ac_f_ex.fe_len == 0)
return;
- err = ext4_mb_load_buddy(ac->ac_sb, ac->ac_f_ex.fe_group, &e4b);
+ err = ext4_mb_load_allocator(ac->ac_sb, ac->ac_f_ex.fe_group,
+ &e4b, 0);
if (err) {
/*
* This should never happen since we pin the
* pages in the ext4_allocation_context so
- * ext4_mb_load_buddy() should never fail.
+ * ext4_mb_load_allocator() should never fail.
*/
WARN(1, "mb_load_buddy failed (%d)", err);
return;
@@ -3572,7 +3574,7 @@ static void ext4_discard_allocated_blocks(struct ext4_allocation_context *ac)
mb_free_blocks(ac->ac_inode, &e4b, ac->ac_f_ex.fe_start,
ac->ac_f_ex.fe_len);
ext4_unlock_group(ac->ac_sb, ac->ac_f_ex.fe_group);
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
return;
}
if (pa->pa_type == MB_INODE_PA)
@@ -4175,7 +4177,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
goto out_dbg;
}
- err = ext4_mb_load_buddy(sb, group, &e4b);
+ err = ext4_mb_load_allocator(sb, group, &e4b, 0);
if (err) {
ext4_warning(sb, "Error %d loading buddy information for %u",
err, group);
@@ -4250,7 +4252,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
out:
ext4_unlock_group(sb, group);
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
put_bh(bitmap_bh);
out_dbg:
mb_debug(sb, "discarded (%d) blocks preallocated for group %u bb_free (%d)\n",
@@ -4347,8 +4349,8 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
BUG_ON(pa->pa_type != MB_INODE_PA);
group = ext4_get_group_number(sb, pa->pa_pstart);
- err = ext4_mb_load_buddy_gfp(sb, group, &e4b,
- GFP_NOFS|__GFP_NOFAIL);
+ err = ext4_mb_load_allocator_gfp(sb, group, &e4b,
+ GFP_NOFS|__GFP_NOFAIL, 0);
if (err) {
ext4_error_err(sb, -err, "Error %d loading buddy information for %u",
err, group);
@@ -4360,7 +4362,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
err = PTR_ERR(bitmap_bh);
ext4_error_err(sb, -err, "Error %d reading block bitmap for %u",
err, group);
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
continue;
}
@@ -4369,7 +4371,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
ext4_unlock_group(sb, group);
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
put_bh(bitmap_bh);
list_del(&pa->u.pa_tmp_list);
@@ -4642,8 +4644,8 @@ ext4_mb_discard_lg_preallocations(struct super_block *sb,
int err;
group = ext4_get_group_number(sb, pa->pa_pstart);
- err = ext4_mb_load_buddy_gfp(sb, group, &e4b,
- GFP_NOFS|__GFP_NOFAIL);
+ err = ext4_mb_load_allocator_gfp(sb, group, &e4b,
+ GFP_NOFS|__GFP_NOFAIL, 0);
if (err) {
ext4_error_err(sb, -err, "Error %d loading buddy information for %u",
err, group);
@@ -4654,7 +4656,7 @@ ext4_mb_discard_lg_preallocations(struct super_block *sb,
ext4_mb_release_group_pa(&e4b, pa);
ext4_unlock_group(sb, group);
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
list_del(&pa->u.pa_tmp_list);
call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
}
@@ -5241,8 +5243,8 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
trace_ext4_mballoc_free(sb, inode, block_group, bit, count_clusters);
/* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */
- err = ext4_mb_load_buddy_gfp(sb, block_group, &e4b,
- GFP_NOFS|__GFP_NOFAIL);
+ err = ext4_mb_load_allocator_gfp(sb, block_group, &e4b,
+ GFP_NOFS|__GFP_NOFAIL, 0);
if (err)
goto error_return;
@@ -5316,7 +5318,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
count_clusters);
}
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
/* We dirtied the bitmap block */
BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
@@ -5434,7 +5436,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
}
}
- err = ext4_mb_load_buddy(sb, block_group, &e4b);
+ err = ext4_mb_load_allocator(sb, block_group, &e4b, 0);
if (err)
goto error_return;
@@ -5462,7 +5464,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
flex_group)->free_clusters);
}
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
/* We dirtied the bitmap block */
BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
@@ -5550,7 +5552,7 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
trace_ext4_trim_all_free(sb, group, start, max);
- ret = ext4_mb_load_buddy(sb, group, &e4b);
+ ret = ext4_mb_load_allocator(sb, group, &e4b, 0);
if (ret) {
ext4_warning(sb, "Error %d loading buddy information for %u",
ret, group);
@@ -5604,7 +5606,7 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
}
out:
ext4_unlock_group(sb, group);
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
ext4_debug("trimmed %d blocks in the group %d\n",
count, group);
@@ -5718,7 +5720,7 @@ ext4_mballoc_query_range(
struct ext4_buddy e4b;
int error;
- error = ext4_mb_load_buddy(sb, group, &e4b);
+ error = ext4_mb_load_allocator(sb, group, &e4b, 0);
if (error)
return error;
bitmap = e4b.bd_bitmap;
@@ -5747,7 +5749,7 @@ ext4_mballoc_query_range(
ext4_unlock_group(sb, group);
out_unload:
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
return error;
}
--
2.28.0.297.g1956fa8f8d-goog
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC PATCH v2 3/9] ext4: add freespace tree allocator
2020-08-21 1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
2020-08-21 1:55 ` [RFC PATCH v2 1/9] ext4: add handling for extended mount options Harshad Shirwadkar
2020-08-21 1:55 ` [RFC PATCH v2 2/9] ext4: rename ext4_mb_load_buddy to ext4_mb_load_allocator Harshad Shirwadkar
@ 2020-08-21 1:55 ` Harshad Shirwadkar
2020-08-21 4:06 ` kernel test robot
2020-08-21 7:04 ` kernel test robot
2020-08-21 1:55 ` [RFC PATCH v2 4/9] ext4: add prefetching support for freespace trees Harshad Shirwadkar
` (5 subsequent siblings)
8 siblings, 2 replies; 13+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21 1:55 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar
From: Yuexin Li <lyx1209@gmail.com>
This patch adds a new freespace tree allocator. The allocator
organizes free space regions in a couple of in-memory rbtrees, one
sorted by length and the other by offset. We use these per-flex-bg
level trees to quickly scan length sorted free space regions and
determine the best region for a given request. This feature can be
enabled by passing "-o freespace_tree" mount option.
Signed-off-by: Yuexin Li <lyx1209@gmail.com>
Co-developed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
fs/ext4/ext4.h | 45 +++
fs/ext4/mballoc.c | 984 ++++++++++++++++++++++++++++++++++++++++++++--
fs/ext4/mballoc.h | 61 ++-
fs/ext4/resize.c | 8 +
fs/ext4/super.c | 35 +-
5 files changed, 1084 insertions(+), 49 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 523e00d7b392..513c8473f45f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -363,6 +363,7 @@ struct flex_groups {
atomic64_t free_clusters;
atomic_t free_inodes;
atomic_t used_dirs;
+ struct ext4_frsp_tree *frsp_tree;
};
#define EXT4_BG_INODE_UNINIT 0x0001 /* Inode table/bitmap not in use */
@@ -1214,6 +1215,12 @@ struct ext4_inode_info {
#define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM 0x00000008 /* User explicitly
specified journal checksum */
+
+#define EXT4_MOUNT2_FREESPACE_TREE 0x00000020 /* Enable freespace tree
+ * allocator (turns buddy
+ * allocator off)
+ */
+
#define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
~EXT4_MOUNT_##opt
#define set_opt(sb, opt) EXT4_SB(sb)->s_mount_opt |= \
@@ -1419,6 +1426,30 @@ struct ext4_super_block {
#define ext4_has_strict_mode(sbi) \
(sbi->s_encoding_flags & EXT4_ENC_STRICT_MODE_FL)
+/*
+ * Freespace tree structure
+ */
+struct ext4_frsp_tree {
+ struct rb_root_cached frsp_offset_root; /* Root for offset
+ * sorted tree
+ */
+ struct rb_root_cached frsp_len_root; /* Root for Length
+ * sorted tree
+ */
+ struct mutex frsp_lock;
+ __u8 frsp_flags;
+ __u32 frsp_max_free_len; /* Max free extent in
+ * this flex bg
+ */
+ __u32 frsp_index; /* Tree index (flex bg
+ * number)
+ */
+};
+
+/* Freespace tree flags */
+
+/* Tree is loaded in memory */
+#define EXT4_MB_FRSP_FLAG_LOADED 0x0001
/*
* fourth extended-fs super-block data in memory
*/
@@ -1557,6 +1588,9 @@ struct ext4_sb_info {
struct flex_groups * __rcu *s_flex_groups;
ext4_group_t s_flex_groups_allocated;
+ /* freespace trees stuff */
+ int s_mb_num_frsp_trees;
+
/* workqueue for reserved extent conversions (buffered io) */
struct workqueue_struct *rsv_conversion_wq;
@@ -2676,6 +2710,7 @@ extern int ext4_init_inode_table(struct super_block *sb,
extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate);
/* mballoc.c */
+extern int ext4_mb_add_frsp_trees(struct super_block *sb, int ngroups);
extern const struct seq_operations ext4_mb_seq_groups_ops;
extern long ext4_mb_stats;
extern long ext4_mb_max_to_scan;
@@ -3100,6 +3135,16 @@ static inline ext4_group_t ext4_flex_group(struct ext4_sb_info *sbi,
return block_group >> sbi->s_log_groups_per_flex;
}
+static inline struct ext4_frsp_tree *
+ext4_get_frsp_tree(struct super_block *sb, ext4_group_t flex_bg)
+{
+ struct flex_groups *fg;
+
+ fg = sbi_array_rcu_deref(EXT4_SB(sb), s_flex_groups, flex_bg);
+
+ return fg->frsp_tree;
+}
+
static inline unsigned int ext4_flex_bg_size(struct ext4_sb_info *sbi)
{
return 1 << sbi->s_log_groups_per_flex;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 48d791304bf1..4b1c405543f0 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -333,6 +333,7 @@
static struct kmem_cache *ext4_pspace_cachep;
static struct kmem_cache *ext4_ac_cachep;
static struct kmem_cache *ext4_free_data_cachep;
+static struct kmem_cache *ext4_freespace_node_cachep;
/* We create slab caches for groupinfo data structures based on the
* superblock block size. There will be one per mounted filesystem for
@@ -352,6 +353,11 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
ext4_group_t group);
static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
+static int ext4_mb_load_allocator(struct super_block *sb, ext4_group_t group,
+ struct ext4_buddy *e4b, int flags);
+static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex);
+static void ext4_mb_unload_allocator(struct ext4_buddy *e4b);
+
/*
* The algorithm using this percpu seq counter goes below:
* 1. We sample the percpu discard_pa_seq counter before trying for block
@@ -395,6 +401,33 @@ static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
return addr;
}
+static inline int ext4_blkno_to_flex_offset(struct super_block *sb,
+ ext4_fsblk_t blkno)
+{
+ return blkno % (ext4_flex_bg_size(EXT4_SB(sb)) *
+ EXT4_SB(sb)->s_blocks_per_group);
+}
+
+static inline ext4_fsblk_t ext4_flex_offset_to_blkno(struct super_block *sb,
+ int flex_bg, int flex_offset)
+{
+ return flex_bg * ext4_flex_bg_size(EXT4_SB(sb)) *
+ EXT4_SB(sb)->s_blocks_per_group + flex_offset;
+}
+
+static inline int ext4_mb_frsp_on(struct super_block *sb)
+{
+ return test_opt2(sb, FREESPACE_TREE) &&
+ EXT4_SB(sb)->s_es->s_log_groups_per_flex;
+}
+
+static inline unsigned int ext4_num_grps_to_flexbg(struct super_block *sb,
+ int ngroups)
+{
+ return (ngroups + ext4_flex_bg_size(EXT4_SB(sb)) - 1) >>
+ (EXT4_SB(sb)->s_es->s_log_groups_per_flex);
+}
+
static inline int mb_test_bit(int bit, void *addr)
{
/*
@@ -453,6 +486,7 @@ static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
{
char *bb;
+ WARN_ON(ext4_mb_frsp_on(e4b->bd_sb));
BUG_ON(e4b->bd_bitmap == e4b->bd_buddy);
BUG_ON(max == NULL);
@@ -620,6 +654,9 @@ static int __mb_check_buddy(struct ext4_buddy *e4b, char *file,
void *buddy;
void *buddy2;
+ if (ext4_mb_frsp_on(sb))
+ return 0;
+
{
static int mb_check_counter;
if (mb_check_counter++ % 100 != 0)
@@ -706,6 +743,729 @@ static int __mb_check_buddy(struct ext4_buddy *e4b, char *file,
#define mb_check_buddy(e4b)
#endif
+/* Generic macro for inserting a new node in cached rbtree */
+#define ext4_mb_rb_insert(__root, __new, __node_t, __entry, __cmp) do { \
+ struct rb_node **iter = &((__root)->rb_root.rb_node), *parent = NULL; \
+ bool leftmost = true; \
+ __node_t *this = NULL; \
+ \
+ while (*iter) { \
+ this = rb_entry(*iter, __node_t, __entry); \
+ parent = *iter; \
+ if (__cmp((__new), this)) { \
+ iter = &((*iter)->rb_left); \
+ } else { \
+ iter = &((*iter)->rb_right); \
+ leftmost = false; \
+ } \
+ } \
+ rb_link_node(&(__new)->__entry, parent, iter); \
+ rb_insert_color_cached(&(__new)->__entry, __root, leftmost); \
+} while (0)
+
+static inline int ext4_mb_frsp_offset_cmp(struct ext4_frsp_node *arg1,
+ struct ext4_frsp_node *arg2)
+{
+ return (arg1->frsp_offset < arg2->frsp_offset);
+}
+
+/*
+ * Free space extents length sorting function, the nodes are sorted in
+ * decreasing order of length
+ */
+static inline int ext4_mb_frsp_len_cmp(struct ext4_frsp_node *arg1,
+ struct ext4_frsp_node *arg2)
+{
+ return (arg1->frsp_len > arg2->frsp_len);
+}
+
+/* insert to offset-indexed tree */
+static void ext4_mb_frsp_insert_off(struct ext4_frsp_tree *tree,
+ struct ext4_frsp_node *new_entry)
+{
+ memset(&new_entry->frsp_node, 0, sizeof(new_entry->frsp_node));
+ ext4_mb_rb_insert(&tree->frsp_offset_root, new_entry,
+ struct ext4_frsp_node, frsp_node, ext4_mb_frsp_offset_cmp);
+}
+
+/* insert to tree sorted by length */
+static void ext4_mb_frsp_insert_len(struct ext4_frsp_tree *tree,
+ struct ext4_frsp_node *new_entry)
+{
+ memset(&new_entry->frsp_len_node, 0, sizeof(new_entry->frsp_len_node));
+ ext4_mb_rb_insert(&tree->frsp_len_root, new_entry,
+ struct ext4_frsp_node, frsp_len_node,
+ ext4_mb_frsp_len_cmp);
+}
+
+#ifdef CONFIG_EXT4_DEBUG
+/* print freespace_tree in pre-order traversal */
+void ext4_mb_frsp_print_tree_len(struct super_block *sb,
+ struct ext4_frsp_tree *tree)
+{
+ unsigned int count = 0;
+ ext4_fsblk_t blk = 0, blk_end = 0;
+ ext4_group_t group = 0, group_end = 0;
+ struct ext4_frsp_node *entry = NULL;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct rb_node *cur;
+
+ cur = rb_first_cached(&tree->frsp_len_root);
+ mb_debug(sb, "\nTree\tNode\tlength\tblock\tgroup\n");
+
+ while (cur) {
+ entry = rb_entry(cur, struct ext4_frsp_node, frsp_len_node);
+ count++;
+ blk = ext4_flex_offset_to_blkno(sb, tree->frsp_index,
+ entry->frsp_offset);
+ blk_end = ext4_flex_offset_to_blkno(sb, tree->frsp_index,
+ entry->frsp_offset + entry->frsp_len - 1);
+ group = blk / sbi->s_blocks_per_group;
+ group_end = (blk_end-1) / sbi->s_blocks_per_group;
+ mb_debug(sb, "%u\t%u\t%u\t%u(%lld)--%llu\t%u--%u\n",
+ tree->frsp_index, count, entry->frsp_len,
+ entry->frsp_offset, blk, blk_end, group, group_end);
+ cur = rb_next(cur);
+ }
+}
+#endif
+
+static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
+{
+ struct ext4_frsp_node *node;
+
+ node = kmem_cache_alloc(ext4_freespace_node_cachep, GFP_NOFS);
+ if (!node)
+ return NULL;
+
+ RB_CLEAR_NODE(&node->frsp_node);
+ RB_CLEAR_NODE(&node->frsp_len_node);
+
+ return node;
+}
+
+static void ext4_mb_frsp_free_node(struct super_block *sb,
+ struct ext4_frsp_node *node)
+{
+ kmem_cache_free(ext4_freespace_node_cachep, node);
+}
+
+/* Evict a tree from memory */
+void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree)
+{
+ struct ext4_frsp_node *frsp_node;
+ struct rb_node *node;
+
+ mb_debug(sb, "Evicting tree %d\n", tree->frsp_index);
+ mutex_lock(&tree->frsp_lock);
+ if (!(tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED))
+ goto out;
+
+ node = rb_first_cached(&tree->frsp_offset_root);
+ while (node) {
+ frsp_node = rb_entry(node, struct ext4_frsp_node, frsp_node);
+ rb_erase_cached(node, &tree->frsp_offset_root);
+ rb_erase_cached(&frsp_node->frsp_len_node,
+ &tree->frsp_len_root);
+ ext4_mb_frsp_free_node(sb, frsp_node);
+ node = rb_first_cached(&tree->frsp_offset_root);
+ }
+ tree->frsp_flags &= ~EXT4_MB_FRSP_FLAG_LOADED;
+ tree->frsp_offset_root = RB_ROOT_CACHED;
+ tree->frsp_len_root = RB_ROOT_CACHED;
+out:
+ mutex_unlock(&tree->frsp_lock);
+}
+
+/*
+ * Search tree by flexbg offset. Returns the node containing "target". If
+ * no such node is present, returns NULL. Must be called with tree mutex held.
+ */
+struct ext4_frsp_node *ext4_mb_frsp_search_by_off(struct super_block *sb,
+ struct ext4_frsp_tree *tree,
+ ext4_grpblk_t target)
+{
+ struct rb_root *root = &tree->frsp_offset_root.rb_root;
+ struct rb_node *node = root->rb_node;
+ struct ext4_frsp_node *this = NULL;
+
+ while (node) {
+ this = rb_entry(node, struct ext4_frsp_node, frsp_node);
+ if (this->frsp_offset == target)
+ return this;
+ else if (target < this->frsp_offset)
+ node = node->rb_left;
+ else
+ node = node->rb_right;
+ }
+ return NULL;
+}
+
+/*
+ * Check if two entries in freespace tree can be merged together. Nodes
+ * can merge together only if they are physically contiguous and belong
+ * to the same block group.
+ */
+int ext4_mb_frsp_node_can_merge(struct super_block *sb,
+ struct ext4_frsp_node *prev_entry, struct ext4_frsp_node *cur_entry)
+{
+ if (prev_entry->frsp_offset + prev_entry->frsp_len !=
+ cur_entry->frsp_offset)
+ return 0;
+ if (prev_entry->frsp_offset / EXT4_SB(sb)->s_blocks_per_group !=
+ cur_entry->frsp_offset / EXT4_SB(sb)->s_blocks_per_group)
+ return 0;
+ return 1;
+}
+
+/*
+ * Add new free space region to tree. We insert the new node in both, the
+ * length sorted and the flex-bg offset sorted trees. Must be called with
+ * tree mutex held.
+ */
+int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
+ ext4_grpblk_t offset, ext4_grpblk_t len)
+{
+ struct ext4_frsp_node *new_entry = NULL, *next_entry = NULL,
+ *prev_entry = NULL;
+ struct rb_node *left = NULL, *right = NULL;
+
+ new_entry = ext4_mb_frsp_alloc_node(sb);
+ if (!new_entry)
+ return -ENOMEM;
+
+ new_entry->frsp_offset = offset;
+ new_entry->frsp_len = len;
+
+ ext4_mb_frsp_insert_off(tree, new_entry);
+ /* try merge to left and right */
+ /* left */
+ left = rb_prev(&new_entry->frsp_node);
+ if (left) {
+ prev_entry = rb_entry(left, struct ext4_frsp_node, frsp_node);
+ if (ext4_mb_frsp_node_can_merge(sb, prev_entry, new_entry)) {
+ new_entry->frsp_offset = prev_entry->frsp_offset;
+ new_entry->frsp_len += prev_entry->frsp_len;
+ rb_erase_cached(left, &tree->frsp_offset_root);
+ rb_erase_cached(&prev_entry->frsp_len_node,
+ &tree->frsp_len_root);
+ ext4_mb_frsp_free_node(sb, prev_entry);
+ }
+ }
+
+ /* right */
+ right = rb_next(&new_entry->frsp_node);
+ if (right) {
+ next_entry = rb_entry(right, struct ext4_frsp_node, frsp_node);
+ if (ext4_mb_frsp_node_can_merge(sb, new_entry, next_entry)) {
+ new_entry->frsp_len += next_entry->frsp_len;
+ rb_erase_cached(right, &tree->frsp_offset_root);
+ rb_erase_cached(&next_entry->frsp_len_node,
+ &tree->frsp_len_root);
+ ext4_mb_frsp_free_node(sb, next_entry);
+ }
+ }
+ ext4_mb_frsp_insert_len(tree, new_entry);
+
+ return 0;
+}
+
+/*
+ * Free length number of blocks starting at block number block in free space
+ * tree.
+ */
+int ext4_mb_frsp_free_blocks(struct super_block *sb, ext4_group_t group,
+ ext4_grpblk_t block, ext4_grpblk_t length)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_frsp_tree *tree =
+ ext4_get_frsp_tree(sb, ext4_flex_group(sbi, group));
+ int err = 0;
+
+ mutex_lock(&tree->frsp_lock);
+ err = ext4_mb_frsp_add_region(sb, tree,
+ ext4_blkno_to_flex_offset(sb, block), length);
+ mutex_unlock(&tree->frsp_lock);
+
+ return err;
+}
+
+/*
+ * Create freespace tree from on-disk bitmap. Must be called with tree mutex
+ * held. Returns 0 on success, error otherwise.
+ */
+int ext4_mb_frsp_bb_to_tree(struct ext4_buddy *e4b, ext4_group_t group,
+ struct buffer_head *bh)
+{
+ struct super_block *sb = e4b->bd_sb;
+ int length = 0, bit = 0, next;
+ int end = EXT4_SB(sb)->s_blocks_per_group;
+ ext4_fsblk_t block;
+ int ret = 0;
+
+ /* find all unused blocks in bitmap, convert them to new tree node */
+ while (bit < end) {
+ bit = mb_find_next_zero_bit(bh->b_data, end, bit);
+ if (bit >= end)
+ break;
+
+ next = mb_find_next_bit(bh->b_data, end, bit);
+ length = next - bit;
+ block = ext4_group_first_block_no(sb, group) + bit;
+
+ ret = ext4_mb_frsp_add_region(sb, e4b->frsp_tree,
+ ext4_blkno_to_flex_offset(sb, block), length);
+ if (ret)
+ break;
+ bit = next + 1;
+ }
+
+ return ret;
+}
+
+/*
+ * Load one freespace_tree from disk. Assume holding mutex lock on tree.
+ */
+int ext4_mb_frsp_load(struct ext4_buddy *e4b)
+{
+ ext4_group_t ngroups, group_start, group_end, grp;
+ struct ext4_sb_info *sbi = EXT4_SB(e4b->bd_sb);
+ int i, ret = 0;
+ struct buffer_head **bh = NULL;
+
+ if (e4b->frsp_tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED)
+ return 0;
+
+ group_start = e4b->frsp_tree->frsp_index * ext4_flex_bg_size(sbi);
+ group_end = min(group_start + ext4_flex_bg_size(sbi),
+ ext4_get_groups_count(e4b->bd_sb)) - 1;
+ ngroups = group_end - group_start + 1;
+
+ bh = kcalloc(ngroups, sizeof(*bh), GFP_KERNEL);
+ if (!bh)
+ return -ENOMEM;
+ for (i = 0; i < ngroups; i++) {
+ grp = i + group_start;
+ bh[i] = ext4_read_block_bitmap_nowait(e4b->bd_sb, grp, false);
+ if (IS_ERR(bh[i])) {
+ ret = PTR_ERR(bh[i]);
+ goto out;
+ }
+ }
+ for (i = 0; i < ngroups; i++) {
+ grp = i + group_start;
+ if (!bitmap_uptodate(bh[i])) {
+ ret = ext4_wait_block_bitmap(e4b->bd_sb, grp, bh[i]);
+ if (ret)
+ goto out;
+ }
+ ret = ext4_mb_frsp_bb_to_tree(e4b, grp, bh[i]);
+ if (ret)
+ goto out;
+ }
+ e4b->frsp_tree->frsp_flags |= EXT4_MB_FRSP_FLAG_LOADED;
+out:
+ for (i = 0; i < ngroups; i++) {
+ if (!IS_ERR_OR_NULL(bh[i]))
+ put_bh(bh[i]);
+ if (!ret && IS_ERR(bh[i]))
+ ret = PTR_ERR(bh[i]);
+ }
+ kfree(bh);
+ return ret;
+
+}
+
+/*
+ * Determine if node with length len is better than what we have found until
+ * now. Return 1 if that is the case, 0 otherwise. If len is exact match,
+ * set *best = 1.
+ */
+static int ext4_mb_frsp_is_better(struct ext4_allocation_context *ac,
+ int len, int *best)
+{
+ struct ext4_tree_extent *btx = &ac->ac_b_tree_ex;
+ struct ext4_free_extent *gex = &ac->ac_g_ex;
+
+ if (best)
+ *best = 0;
+ if (len == gex->fe_len) {
+ if (best)
+ *best = 1;
+ return 1;
+ }
+ if (ac->ac_criteria == 0 && len < gex->fe_len)
+ return 0;
+ /*
+ * See if the new node is cloer to goal length than
+ * the best extent found so far
+ */
+ if (btx->te_len < gex->fe_len && len > btx->te_len)
+ return 1;
+ if (len > gex->fe_len && len < btx->te_len)
+ return 1;
+ return 0;
+}
+
+/*
+ * check if we have scanned sufficient freespace candidates
+ * stop scanning if reached/exceeded s_max_to_scan
+ */
+static void ext4_mb_frsp_check_limits(struct ext4_allocation_context *ac)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+
+ if (ac->ac_status == AC_STATUS_FOUND)
+ return;
+
+ /*
+ * Exceeded max number of nodes to scan
+ */
+ if (ac->ac_found > sbi->s_mb_max_to_scan &&
+ !(ac->ac_flags & EXT4_MB_HINT_FIRST))
+ ac->ac_status = AC_STATUS_BREAK;
+}
+
+/*
+ * Mark free space in selected tree node as used and update the tree.
+ * This must be called with tree lock.
+ */
+static void ext4_mb_frsp_use_best_found(struct ext4_allocation_context *ac,
+ ext4_group_t flex, struct ext4_frsp_node *selected)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ struct ext4_tree_extent *btx = &ac->ac_b_tree_ex;
+ struct ext4_free_extent *bex;
+ unsigned int group_no;
+ struct ext4_buddy e4b;
+
+ WARN_ON(ac->ac_status == AC_STATUS_FOUND);
+ btx->te_len = min(btx->te_len, ac->ac_g_ex.fe_len);
+ ac->ac_status = AC_STATUS_FOUND;
+
+ /* update ac best-found-extent */
+ bex = &ac->ac_b_ex;
+ group_no = (btx->te_flex * ext4_flex_bg_size(sbi)) +
+ (btx->te_flex_start / sbi->s_blocks_per_group);
+ bex->fe_start = btx->te_flex_start % sbi->s_blocks_per_group;
+ bex->fe_group = group_no;
+ bex->fe_len = btx->te_len;
+ bex->fe_node = selected;
+
+ /* Add free blocks to our tree */
+ ext4_mb_load_allocator(ac->ac_sb, group_no, &e4b,
+ EXT4_ALLOCATOR_FRSP_NOLOAD);
+ mb_mark_used(&e4b, bex);
+ ext4_mb_unload_allocator(&e4b);
+}
+/*
+ * The routine checks whether found tree node is good enough. If it is,
+ * then the tree node gets marked used and flag is set to the context
+ * to stop scanning.
+ *
+ * Otherwise, the tree node is compared with the previous found extent and
+ * if new one is better, then it's stored in the context.
+ *
+ * Later, the best found tree node will be used, if mballoc can't find good
+ * enough extent.
+ */
+static int ext4_mb_frsp_measure_node(struct ext4_allocation_context *ac,
+ int tree_idx, struct ext4_frsp_node *cur,
+ int goal)
+{
+ struct ext4_tree_extent *btx = &ac->ac_b_tree_ex;
+ ext4_fsblk_t start;
+ int best_found = 0, max;
+
+ WARN_ON(btx->te_len < 0);
+ WARN_ON(ac->ac_status != AC_STATUS_CONTINUE);
+
+ if (goal) {
+ start = ext4_group_first_block_no(ac->ac_sb,
+ ac->ac_g_ex.fe_group) +
+ ac->ac_g_ex.fe_start;
+ if (cur->frsp_offset > ext4_blkno_to_flex_offset(ac->ac_sb,
+ start))
+ return 0;
+ max = cur->frsp_offset + cur->frsp_len -
+ ext4_blkno_to_flex_offset(ac->ac_sb, start);
+ if (max >= ac->ac_g_ex.fe_len &&
+ ac->ac_g_ex.fe_len == EXT4_SB(ac->ac_sb)->s_stripe) {
+ if (do_div(start, EXT4_SB(ac->ac_sb)->s_stripe) != 0)
+ return 0;
+ best_found = 1;
+ } else if (max >= ac->ac_g_ex.fe_len) {
+ best_found = 1;
+ } else if (max > 0 && (ac->ac_flags & EXT4_MB_HINT_MERGE)) {
+ /*
+ * Sometimes, caller may want to merge even small
+ * number of blocks to an existing extent
+ */
+ best_found = 1;
+
+ } else {
+ return 0;
+ }
+ ac->ac_found++;
+ goto use_extent;
+ }
+ ac->ac_found++;
+
+ /* The special case - take what you catch first */
+ if (unlikely(ac->ac_flags & EXT4_MB_HINT_FIRST)) {
+ best_found = 1;
+ goto use_extent;
+ }
+
+ /*
+ * Prefer not allocating blocks in first group in flex bg for data
+ * blocks.
+ */
+ if (unlikely((ac->ac_criteria < 2) &&
+ (ac->ac_flags & EXT4_MB_HINT_DATA) &&
+ (cur->frsp_offset < EXT4_BLOCKS_PER_GROUP(ac->ac_sb))))
+ return 1;
+
+
+ /* If this is first found extent, just store it in the context */
+ if (btx->te_len == 0)
+ goto use_extent;
+
+ if (ext4_mb_frsp_is_better(ac, cur->frsp_len, &best_found))
+ goto use_extent;
+
+ ext4_mb_frsp_check_limits(ac);
+ return 0;
+
+use_extent:
+ btx->te_flex = tree_idx;
+ if (goal) {
+ btx->te_flex_start = ext4_blkno_to_flex_offset(ac->ac_sb,
+ start);
+ btx->te_len = max;
+ } else {
+ btx->te_flex_start = cur->frsp_offset;
+ btx->te_len = cur->frsp_len;
+ }
+ if (best_found)
+ ext4_mb_frsp_use_best_found(ac, tree_idx, cur);
+ ext4_mb_frsp_check_limits(ac);
+ return 1;
+}
+
+/* Search by goal */
+static noinline_for_stack
+void ext4_mb_frsp_find_by_goal(struct ext4_allocation_context *ac)
+{
+ struct ext4_frsp_node *cur = NULL;
+ unsigned int tree_blk;
+ ext4_fsblk_t blk;
+ struct ext4_buddy e4b;
+ int ret;
+
+ if (!(ac->ac_flags & EXT4_MB_HINT_TRY_GOAL))
+ return;
+
+ /* compute start node offset in tree */
+ blk = ext4_group_first_block_no(ac->ac_sb, ac->ac_g_ex.fe_group) +
+ ac->ac_g_ex.fe_start;
+ tree_blk = ext4_blkno_to_flex_offset(ac->ac_sb, blk);
+
+ ret = ext4_mb_load_allocator(ac->ac_sb, ac->ac_g_ex.fe_group, &e4b, 0);
+ if (ret)
+ return;
+
+ /* try goal block and its freespace_tree first */
+ mutex_lock(&e4b.frsp_tree->frsp_lock);
+ cur = ext4_mb_frsp_search_by_off(ac->ac_sb, e4b.frsp_tree, tree_blk);
+ if (cur && tree_blk < cur->frsp_offset + cur->frsp_len - 1)
+ ext4_mb_frsp_measure_node(ac, e4b.frsp_tree->frsp_index, cur,
+ 1 /* searching by goal */);
+
+ mutex_unlock(&e4b.frsp_tree->frsp_lock);
+ ext4_mb_unload_allocator(&e4b);
+}
+
+static void ext4_mb_frsp_process(struct ext4_allocation_context *ac,
+ struct ext4_frsp_tree *tree)
+{
+ struct ext4_frsp_node *cur = NULL;
+ struct rb_node *node = NULL;
+ int ret;
+
+ node = rb_first_cached(&tree->frsp_len_root);
+ mb_debug(ac->ac_sb, "Processing tree index %d, flags = %x\n",
+ tree->frsp_index, tree->frsp_flags);
+ /*
+ * In order to serve the "No data blocks in first group in a flexbg"
+ * requirement, we cannot do a O(Log N) search here. TODO: it's possible
+ * to that at CR=2, where that requirement doesn't apply.
+ */
+ while (node && ac->ac_status == AC_STATUS_CONTINUE) {
+ cur = rb_entry(node, struct ext4_frsp_node, frsp_len_node);
+ if (ac->ac_criteria == 0 && cur->frsp_len < ac->ac_g_ex.fe_len)
+ return;
+ ret = ext4_mb_frsp_measure_node(ac, tree->frsp_index, cur,
+ 0 /* searching by len */);
+ if (ret == 0)
+ return;
+ node = rb_next(node);
+ }
+}
+
+/* allocator for freespace_tree */
+static noinline_for_stack int
+ext4_mb_tree_allocator(struct ext4_allocation_context *ac)
+{
+ struct ext4_buddy e4b;
+ struct ext4_sb_info *sbi;
+ struct super_block *sb;
+ struct ext4_frsp_tree *tree = NULL;
+ struct ext4_frsp_node *cur = NULL;
+ struct ext4_tree_extent *btx = NULL;
+ int ret = 0, start_idx = 0, tree_idx, j;
+
+ sb = ac->ac_sb;
+ btx = &ac->ac_b_tree_ex;
+ sbi = EXT4_SB(sb);
+
+ start_idx = ext4_flex_group(sbi, ac->ac_g_ex.fe_group);
+ mb_debug(sb, "requested size %d\n", ac->ac_g_ex.fe_len);
+ /* First try searching from goal blk in start-idx-th freespace tree */
+ ext4_mb_frsp_find_by_goal(ac);
+ if (ac->ac_status == AC_STATUS_FOUND)
+ goto out;
+
+ if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
+ goto out;
+
+ ac->ac_criteria = 0;
+ ac->ac_groups_scanned = 0;
+repeat:
+
+ /* Loop through the rest of trees (flex_bg) */
+ for (j = start_idx;
+ (ac->ac_groups_scanned < sbi->s_mb_num_frsp_trees) &&
+ ac->ac_status == AC_STATUS_CONTINUE; j++) {
+ ac->ac_groups_scanned++;
+ tree_idx = (j % sbi->s_mb_num_frsp_trees);
+
+ ret = ext4_mb_load_allocator(sb,
+ tree_idx * ext4_flex_bg_size(sbi), &e4b, 0);
+ if (ret)
+ goto out;
+ mutex_lock(&e4b.frsp_tree->frsp_lock);
+ ext4_mb_frsp_process(ac, e4b.frsp_tree);
+ mutex_unlock(&e4b.frsp_tree->frsp_lock);
+ ext4_mb_unload_allocator(&e4b);
+ }
+
+ if (ac->ac_status != AC_STATUS_FOUND) {
+ if (ac->ac_criteria < 2) {
+ ac->ac_criteria++;
+ ac->ac_groups_scanned = 0;
+ mb_debug(sb, "Falling back to CR=%d", ac->ac_criteria);
+ goto repeat;
+ }
+ if (btx->te_len > 0 && !(ac->ac_flags & EXT4_MB_HINT_FIRST)) {
+ e4b.frsp_flags = 0;
+ ret = ext4_mb_load_allocator(sb, btx->te_flex *
+ ext4_flex_bg_size(sbi), &e4b, 0);
+ if (ret)
+ goto out;
+ tree = e4b.frsp_tree;
+ mutex_lock(&tree->frsp_lock);
+ cur = ext4_mb_frsp_search_by_off(sb, tree,
+ btx->te_flex_start);
+ if (cur) {
+ ext4_mb_frsp_use_best_found(ac, btx->te_flex,
+ cur);
+ mutex_unlock(&tree->frsp_lock);
+ ext4_mb_unload_allocator(&e4b);
+
+ } else {
+ /*
+ * Someone else took this freespace node before
+ * us. Reset the best-found tree extent, and
+ * turn on FIRST HINT (greedy).
+ */
+ mutex_unlock(&tree->frsp_lock);
+ ac->ac_b_tree_ex.te_flex_start = 0;
+ ac->ac_b_tree_ex.te_flex = 0;
+ ac->ac_b_tree_ex.te_len = 0;
+ ac->ac_status = AC_STATUS_CONTINUE;
+ ac->ac_flags |= EXT4_MB_HINT_FIRST;
+ ac->ac_groups_scanned = 0;
+ ext4_mb_unload_allocator(&e4b);
+ goto repeat;
+ }
+ }
+ }
+
+ ret = btx->te_len ? 0 : -ENOSPC;
+
+out:
+ return ret;
+}
+
+static void ext4_mb_frsp_init_tree(struct ext4_frsp_tree *tree, int index)
+{
+ tree->frsp_offset_root = RB_ROOT_CACHED;
+ tree->frsp_len_root = RB_ROOT_CACHED;
+ mutex_init(&(tree->frsp_lock));
+ tree->frsp_flags = 0;
+ tree->frsp_index = index;
+}
+
+int ext4_mb_init_freespace_trees(struct super_block *sb)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct flex_groups *fg;
+ int i;
+
+ sbi->s_mb_num_frsp_trees =
+ ext4_num_grps_to_flexbg(sb, ext4_get_groups_count(sb));
+
+ for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) {
+ fg = sbi_array_rcu_deref(sbi, s_flex_groups, i);
+ fg->frsp_tree = kzalloc(sizeof(struct ext4_frsp_tree),
+ GFP_KERNEL);
+ if (!fg->frsp_tree)
+ return -ENOMEM;
+ ext4_mb_frsp_init_tree(fg->frsp_tree, i);
+ }
+
+ return 0;
+}
+
+int ext4_mb_add_frsp_trees(struct super_block *sb, int ngroups)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ int flex_bg_count, old_trees_count = sbi->s_mb_num_frsp_trees;
+ int i;
+
+ if (!ext4_mb_frsp_on(sb))
+ return 0;
+
+ flex_bg_count = ext4_num_grps_to_flexbg(sb, ngroups);
+ if (old_trees_count > 0)
+ ext4_mb_frsp_free_tree(sb, ext4_get_frsp_tree(sb,
+ old_trees_count - 1));
+
+ for (i = old_trees_count; i < flex_bg_count; i++) {
+ struct flex_groups *fg;
+
+ fg = sbi_array_rcu_deref(sbi, s_flex_groups, i);
+ fg->frsp_tree = kzalloc(sizeof(*fg->frsp_tree), GFP_KERNEL);
+ if (!fg->frsp_tree)
+ return -ENOMEM;
+ ext4_mb_frsp_init_tree(fg->frsp_tree, i);
+ }
+ sbi->s_mb_num_frsp_trees = flex_bg_count;
+
+ return 0;
+}
+
/*
* Divide blocks started from @first with length @len into
* smaller chunks with power of 2 blocks.
@@ -1042,6 +1802,9 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
e4b->bd_buddy_page = NULL;
e4b->bd_bitmap_page = NULL;
+ if (ext4_mb_frsp_on(sb))
+ return 0;
+
blocks_per_page = PAGE_SIZE / sb->s_blocksize;
/*
* the buddy cache inode stores the block bitmap
@@ -1099,6 +1862,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
struct page *page;
int ret = 0;
+ WARN_ON(ext4_mb_frsp_on(sb));
might_sleep();
mb_debug(sb, "init group %u\n", group);
this_grp = ext4_get_group_info(sb, group);
@@ -1156,6 +1920,11 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
* Locking note: This routine calls ext4_mb_init_cache(), which takes the
* block group lock of all groups for this page; do not hold the BG lock when
* calling this routine!
+ *
+ * For freespace trees, do not hold tree mutex while calling this routine.
+ * It is okay to hold that mutex only if EXT4_ALLOCATOR_FRSP_NOLOAD flag is
+ * set in e4b->frsp_flags. If that flag is set, this function doesn't try
+ * to load an unloaded tree.
*/
static noinline_for_stack int
ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group,
@@ -1166,7 +1935,7 @@ ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group,
int pnum;
int poff;
struct page *page;
- int ret;
+ int ret = 0;
struct ext4_group_info *grp;
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct inode *inode = sbi->s_buddy_cache;
@@ -1183,6 +1952,22 @@ ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group,
e4b->bd_group = group;
e4b->bd_buddy_page = NULL;
e4b->bd_bitmap_page = NULL;
+ if (ext4_mb_frsp_on(sb)) {
+ e4b->frsp_tree = ext4_get_frsp_tree(sb,
+ ext4_flex_group(sbi, group));
+ e4b->frsp_flags = flags;
+ if (flags & EXT4_ALLOCATOR_FRSP_NOLOAD)
+ return 0;
+
+ mutex_lock(&e4b->frsp_tree->frsp_lock);
+ if (e4b->frsp_tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED) {
+ mutex_unlock(&e4b->frsp_tree->frsp_lock);
+ return 0;
+ }
+ ret = ext4_mb_frsp_load(e4b);
+ mutex_unlock(&e4b->frsp_tree->frsp_lock);
+ return ret;
+ }
if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
/*
@@ -1302,6 +2087,8 @@ static int ext4_mb_load_allocator(struct super_block *sb, ext4_group_t group,
static void ext4_mb_unload_allocator(struct ext4_buddy *e4b)
{
+ if (ext4_mb_frsp_on(e4b->bd_sb))
+ return;
if (e4b->bd_bitmap_page)
put_page(e4b->bd_bitmap_page);
if (e4b->bd_buddy_page)
@@ -1494,6 +2281,16 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
if (first < e4b->bd_info->bb_first_free)
e4b->bd_info->bb_first_free = first;
+ if (ext4_mb_frsp_on(sb)) {
+ ext4_grpblk_t first_blk = EXT4_C2B(EXT4_SB(sb), first) +
+ ext4_group_first_block_no(sb, e4b->bd_group);
+
+ ext4_unlock_group(sb, e4b->bd_group);
+ ext4_mb_frsp_free_blocks(sb, e4b->bd_group, first_blk, count);
+ ext4_lock_group(sb, e4b->bd_group);
+ return;
+ }
+
/* access memory sequentially: check left neighbour,
* clear range and then check right neighbour
*/
@@ -1560,6 +2357,9 @@ static int mb_find_extent(struct ext4_buddy *e4b, int block,
assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group));
BUG_ON(ex == NULL);
+ if (ext4_mb_frsp_on(e4b->bd_sb))
+ return 0;
+
buddy = mb_find_buddy(e4b, 0, &max);
BUG_ON(buddy == NULL);
BUG_ON(block >= max);
@@ -1624,17 +2424,74 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
unsigned ret = 0;
int len0 = len;
void *buddy;
+ ext4_grpblk_t flex_offset;
BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3));
BUG_ON(e4b->bd_group != ex->fe_group);
+
+ e4b->bd_info->bb_free -= len;
+ if (e4b->bd_info->bb_first_free == start)
+ e4b->bd_info->bb_first_free += len;
+
+ if (ext4_mb_frsp_on(e4b->bd_sb)) {
+ struct ext4_frsp_node *new;
+
+ flex_offset = ext4_blkno_to_flex_offset(e4b->bd_sb,
+ ext4_group_first_block_no(e4b->bd_sb,
+ e4b->bd_group) + ex->fe_start);
+ if (!ex->fe_node) {
+ ex->fe_node = ext4_mb_frsp_search_by_off(e4b->bd_sb,
+ e4b->frsp_tree, flex_offset);
+ if (!ex->fe_node)
+ return 0;
+ }
+ /*
+ * Remove the node from the trees before we modify these since
+ * we will change the length and / or offset of this node.
+ */
+ rb_erase_cached(&ex->fe_node->frsp_node,
+ &e4b->frsp_tree->frsp_offset_root);
+ rb_erase_cached(&ex->fe_node->frsp_len_node,
+ &e4b->frsp_tree->frsp_len_root);
+ RB_CLEAR_NODE(&ex->fe_node->frsp_node);
+ RB_CLEAR_NODE(&ex->fe_node->frsp_len_node);
+ if (flex_offset > ex->fe_node->frsp_offset) {
+ if (flex_offset + ex->fe_len !=
+ ex->fe_node->frsp_offset +
+ ex->fe_node->frsp_len) {
+ /* Need to split the node */
+ new = ext4_mb_frsp_alloc_node(e4b->bd_sb);
+ if (!new)
+ return -ENOMEM;
+ new->frsp_offset = flex_offset + ex->fe_len;
+ new->frsp_len = (ex->fe_node->frsp_offset +
+ ex->fe_node->frsp_len) -
+ new->frsp_offset;
+ ext4_mb_frsp_insert_len(e4b->frsp_tree, new);
+ ext4_mb_frsp_insert_off(e4b->frsp_tree, new);
+ }
+ ex->fe_node->frsp_len = flex_offset -
+ ex->fe_node->frsp_offset;
+ } else if (ex->fe_len < ex->fe_node->frsp_len) {
+ /* used only a part: update node */
+ ex->fe_node->frsp_offset += ex->fe_len;
+ ex->fe_node->frsp_len -= ex->fe_len;
+ } else {
+ ext4_mb_frsp_free_node(e4b->bd_sb, ex->fe_node);
+ return 0;
+ }
+
+ ext4_mb_frsp_insert_len(e4b->frsp_tree, ex->fe_node);
+ ext4_mb_frsp_insert_off(e4b->frsp_tree, ex->fe_node);
+
+ return 0;
+ }
+
assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group));
mb_check_buddy(e4b);
mb_mark_used_double(e4b, start, len);
this_cpu_inc(discard_pa_seq);
- e4b->bd_info->bb_free -= len;
- if (e4b->bd_info->bb_first_free == start)
- e4b->bd_info->bb_first_free += len;
/* let's maintain fragments counter */
if (start != 0)
@@ -2773,6 +3630,13 @@ static int ext4_mb_init_backend(struct super_block *sb)
rcu_read_lock();
kvfree(rcu_dereference(sbi->s_group_info));
rcu_read_unlock();
+ if (ext4_mb_frsp_on(sb)) {
+ for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) {
+ struct ext4_frsp_tree *tree = ext4_get_frsp_tree(sb, i);
+
+ kfree(tree);
+ }
+ }
return -ENOMEM;
}
@@ -2869,6 +3733,13 @@ int ext4_mb_init(struct super_block *sb)
i++;
} while (i <= sb->s_blocksize_bits + 1);
+ /* init for freespace trees */
+ if (ext4_mb_frsp_on(sb)) {
+ ret = ext4_mb_init_freespace_trees(sb);
+ if (ret)
+ goto out;
+ }
+
spin_lock_init(&sbi->s_md_lock);
spin_lock_init(&sbi->s_bal_lock);
sbi->s_mb_free_pending = 0;
@@ -2936,6 +3807,13 @@ int ext4_mb_init(struct super_block *sb)
sbi->s_mb_offsets = NULL;
kfree(sbi->s_mb_maxs);
sbi->s_mb_maxs = NULL;
+ if (ext4_mb_frsp_on(sb)) {
+ for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) {
+ struct ext4_frsp_tree *tree = ext4_get_frsp_tree(sb, i);
+
+ kfree(tree);
+ }
+ }
return ret;
}
@@ -3080,8 +3958,10 @@ static void ext4_free_data_in_buddy(struct super_block *sb,
/* No more items in the per group rb tree
* balance refcounts from ext4_mb_free_metadata()
*/
- put_page(e4b.bd_buddy_page);
- put_page(e4b.bd_bitmap_page);
+ if (!ext4_mb_frsp_on(sb)) {
+ put_page(e4b.bd_buddy_page);
+ put_page(e4b.bd_bitmap_page);
+ }
}
ext4_unlock_group(sb, entry->efd_group);
kmem_cache_free(ext4_free_data_cachep, entry);
@@ -3159,9 +4039,14 @@ int __init ext4_init_mballoc(void)
SLAB_RECLAIM_ACCOUNT);
if (ext4_free_data_cachep == NULL)
goto out_ac_free;
+ ext4_freespace_node_cachep = KMEM_CACHE(ext4_frsp_node,
+ SLAB_RECLAIM_ACCOUNT);
+ if (ext4_freespace_node_cachep == NULL)
+ goto out_frsp_free;
return 0;
-
+out_frsp_free:
+ kmem_cache_destroy(ext4_free_data_cachep);
out_ac_free:
kmem_cache_destroy(ext4_ac_cachep);
out_pa_free:
@@ -3180,6 +4065,7 @@ void ext4_exit_mballoc(void)
kmem_cache_destroy(ext4_pspace_cachep);
kmem_cache_destroy(ext4_ac_cachep);
kmem_cache_destroy(ext4_free_data_cachep);
+ kmem_cache_destroy(ext4_freespace_node_cachep);
ext4_groupinfo_destroy_slabs();
}
@@ -3682,6 +4568,9 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
return false;
+ if (ext4_mb_frsp_on(ac->ac_sb))
+ return 0;
+
/* first, try per-file preallocation */
rcu_read_lock();
list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
@@ -4384,6 +5273,8 @@ static int ext4_mb_pa_alloc(struct ext4_allocation_context *ac)
struct ext4_prealloc_space *pa;
BUG_ON(ext4_pspace_cachep == NULL);
+ if (ext4_mb_frsp_on(ac->ac_sb))
+ return 0;
pa = kmem_cache_zalloc(ext4_pspace_cachep, GFP_NOFS);
if (!pa)
return -ENOMEM;
@@ -4396,6 +5287,8 @@ static void ext4_mb_pa_free(struct ext4_allocation_context *ac)
{
struct ext4_prealloc_space *pa = ac->ac_pa;
+ if (ext4_mb_frsp_on(ac->ac_sb))
+ return;
BUG_ON(!pa);
ac->ac_pa = NULL;
WARN_ON(!atomic_dec_and_test(&pa->pa_count));
@@ -4569,6 +5462,13 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac,
ac->ac_o_ex.fe_len = len;
ac->ac_g_ex = ac->ac_o_ex;
ac->ac_flags = ar->flags;
+ if (ext4_mb_frsp_on(sb))
+ ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC;
+
+ /* set up best-found tree node */
+ ac->ac_b_tree_ex.te_flex_start = 0;
+ ac->ac_b_tree_ex.te_flex = 0;
+ ac->ac_b_tree_ex.te_len = 0;
/* we have to define context: we'll work with a file or
* locality group. this is a policy, actually */
@@ -4919,7 +5819,11 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
goto errout;
repeat:
/* allocate space in core */
- *errp = ext4_mb_regular_allocator(ac);
+ if (ext4_mb_frsp_on(sb))
+ *errp = ext4_mb_tree_allocator(ac);
+ else
+ *errp = ext4_mb_regular_allocator(ac);
+
/*
* pa allocated above is added to grp->bb_prealloc_list only
* when we were able to allocate some block i.e. when
@@ -4932,8 +5836,13 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
ext4_discard_allocated_blocks(ac);
goto errout;
}
- if (ac->ac_status == AC_STATUS_FOUND &&
- ac->ac_o_ex.fe_len >= ac->ac_f_ex.fe_len)
+ /*
+ * Freespace trees should never return more than what was asked
+ * for.
+ */
+ if (!ext4_mb_frsp_on(sb) &&
+ (ac->ac_status == AC_STATUS_FOUND &&
+ ac->ac_o_ex.fe_len >= ac->ac_f_ex.fe_len))
ext4_mb_pa_free(ac);
}
if (likely(ac->ac_status == AC_STATUS_FOUND)) {
@@ -5024,13 +5933,12 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
struct rb_node *parent = NULL, *new_node;
BUG_ON(!ext4_handle_valid(handle));
- BUG_ON(e4b->bd_bitmap_page == NULL);
- BUG_ON(e4b->bd_buddy_page == NULL);
+ BUG_ON(!ext4_mb_frsp_on(sb) && e4b->bd_bitmap_page == NULL);
new_node = &new_entry->efd_node;
cluster = new_entry->efd_start_cluster;
- if (!*n) {
+ if (!*n && !ext4_mb_frsp_on(sb)) {
/* first free block exent. We need to
protect buddy cache from being freed,
* otherwise we'll refresh it from
@@ -5509,6 +6417,7 @@ __acquires(bitlock)
ex.fe_start = start;
ex.fe_group = group;
ex.fe_len = count;
+ ex.fe_node = NULL;
/*
* Mark blocks used, so no one can reuse them while
@@ -5548,6 +6457,7 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
void *bitmap;
ext4_grpblk_t next, count = 0, free_count = 0;
struct ext4_buddy e4b;
+ struct buffer_head *bh = NULL;
int ret = 0;
trace_ext4_trim_all_free(sb, group, start, max);
@@ -5558,7 +6468,17 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
ret, group);
return ret;
}
- bitmap = e4b.bd_bitmap;
+
+ if (ext4_mb_frsp_on(sb)) {
+ bh = ext4_read_block_bitmap(sb, group);
+ if (IS_ERR(bh)) {
+ ret = PTR_ERR(bh);
+ goto out;
+ }
+ bitmap = bh->b_data;
+ } else {
+ bitmap = e4b.bd_bitmap;
+ }
ext4_lock_group(sb, group);
if (EXT4_MB_GRP_WAS_TRIMMED(e4b.bd_info) &&
@@ -5605,6 +6525,8 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
EXT4_MB_GRP_SET_TRIMMED(e4b.bd_info);
}
out:
+ if (!IS_ERR_OR_NULL(bh))
+ brelse(bh);
ext4_unlock_group(sb, group);
ext4_mb_unload_allocator(&e4b);
@@ -5665,7 +6587,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
for (group = first_group; group <= last_group; group++) {
grp = ext4_get_group_info(sb, group);
/* We only do this if the grp has never been initialized */
- if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
+ if (unlikely(!ext4_mb_frsp_on(sb) && EXT4_MB_GRP_NEED_INIT(grp))) {
ret = ext4_mb_init_group(sb, group, GFP_NOFS);
if (ret)
break;
@@ -5719,16 +6641,25 @@ ext4_mballoc_query_range(
ext4_grpblk_t next;
struct ext4_buddy e4b;
int error;
+ struct buffer_head *bh = NULL;
- error = ext4_mb_load_allocator(sb, group, &e4b, 0);
- if (error)
- return error;
- bitmap = e4b.bd_bitmap;
-
+ if (ext4_mb_frsp_on(sb)) {
+ bh = ext4_read_block_bitmap(sb, group);
+ if (IS_ERR(bh)) {
+ error = PTR_ERR(bh);
+ goto out_unload;
+ }
+ bitmap = bh->b_data;
+ } else {
+ error = ext4_mb_load_allocator(sb, group, &e4b, 0);
+ if (error)
+ return error;
+ bitmap = e4b.bd_bitmap;
+ }
ext4_lock_group(sb, group);
-
- start = (e4b.bd_info->bb_first_free > start) ?
- e4b.bd_info->bb_first_free : start;
+ if (!ext4_mb_frsp_on(sb))
+ start = (e4b.bd_info->bb_first_free > start) ?
+ e4b.bd_info->bb_first_free : start;
if (end >= EXT4_CLUSTERS_PER_GROUP(sb))
end = EXT4_CLUSTERS_PER_GROUP(sb) - 1;
@@ -5737,19 +6668,20 @@ ext4_mballoc_query_range(
if (start > end)
break;
next = mb_find_next_bit(bitmap, end + 1, start);
-
ext4_unlock_group(sb, group);
error = formatter(sb, group, start, next - start, priv);
if (error)
goto out_unload;
ext4_lock_group(sb, group);
-
start = next + 1;
}
ext4_unlock_group(sb, group);
out_unload:
- ext4_mb_unload_allocator(&e4b);
+ if (!IS_ERR_OR_NULL(bh))
+ brelse(bh);
+ if (!ext4_mb_frsp_on(sb))
+ ext4_mb_unload_allocator(&e4b);
return error;
}
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index e75b4749aa1c..af61651258a3 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -78,6 +78,20 @@
*/
#define MB_DEFAULT_MAX_INODE_PREALLOC 512
+/*
+ * Struct for tree node in freespace_tree
+ */
+struct ext4_frsp_node {
+ ext4_grpblk_t frsp_offset; /* Start block offset inside
+ * current flexible group
+ */
+ ext4_grpblk_t frsp_len; /*
+ * Length of the free space in
+ * number of clusters
+ */
+ struct rb_node frsp_node;
+ struct rb_node frsp_len_node;
+};
struct ext4_free_data {
/* this links the free block information from sb_info */
struct list_head efd_list;
@@ -125,6 +139,7 @@ struct ext4_free_extent {
ext4_grpblk_t fe_start; /* In cluster units */
ext4_group_t fe_group;
ext4_grpblk_t fe_len; /* In cluster units */
+ struct ext4_frsp_node *fe_node;
};
/*
@@ -145,7 +160,14 @@ struct ext4_locality_group {
spinlock_t lg_prealloc_lock;
};
+struct ext4_tree_extent {
+ ext4_group_t te_flex; /* flex_bg index (tree index) */
+ ext4_grpblk_t te_flex_start; /* block offset w.r.t flex bg */
+ ext4_grpblk_t te_len; /* length */
+};
+
struct ext4_allocation_context {
+ __u32 ac_id;
struct inode *ac_inode;
struct super_block *ac_sb;
@@ -158,8 +180,16 @@ struct ext4_allocation_context {
/* the best found extent */
struct ext4_free_extent ac_b_ex;
- /* copy of the best found extent taken before preallocation efforts */
- struct ext4_free_extent ac_f_ex;
+ /* With freespace trees, we don't use preallocation anymore. */
+ union {
+ /*
+ * copy of the best found extent taken before
+ * preallocation efforts
+ */
+ struct ext4_free_extent ac_f_ex;
+ /* the best found tree extent */
+ struct ext4_tree_extent ac_b_tree_ex;
+ };
__u16 ac_groups_scanned;
__u16 ac_found;
@@ -181,14 +211,31 @@ struct ext4_allocation_context {
#define AC_STATUS_FOUND 2
#define AC_STATUS_BREAK 3
+/*
+ * Freespace tree flags
+ */
+#define EXT4_ALLOCATOR_FRSP_NOLOAD 0x0001 /*
+ * Don't load freespace
+ * tree, if it's not
+ * in memory.
+ */
+
struct ext4_buddy {
- struct page *bd_buddy_page;
- void *bd_buddy;
- struct page *bd_bitmap_page;
- void *bd_bitmap;
+ union {
+ struct {
+ struct page *bd_buddy_page;
+ void *bd_buddy;
+ struct page *bd_bitmap_page;
+ void *bd_bitmap;
+ __u16 bd_blkbits;
+ };
+ struct {
+ struct ext4_frsp_tree *frsp_tree;
+ __u32 frsp_flags;
+ };
+ };
struct ext4_group_info *bd_info;
struct super_block *bd_sb;
- __u16 bd_blkbits;
ext4_group_t bd_group;
};
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index a50b51270ea9..6a0e1fc18e95 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -1679,6 +1679,10 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
if (err)
goto out;
+ err = ext4_mb_add_frsp_trees(sb, input->group + 1);
+ if (err)
+ goto out;
+
flex_gd.count = 1;
flex_gd.groups = input;
flex_gd.bg_flags = &bg_flags;
@@ -2051,6 +2055,10 @@ int ext4_resize_fs(struct super_block *sb, ext4_fsblk_t n_blocks_count)
if (err)
goto out;
+ err = ext4_mb_add_frsp_trees(sb, n_group + 1);
+ if (err)
+ goto out;
+
flex_gd = alloc_flex_gd(flexbg_size);
if (flex_gd == NULL) {
err = -ENOMEM;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 656eccf6fc9b..55ff4e8be976 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1526,7 +1526,7 @@ enum {
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
- Opt_prefetch_block_bitmaps,
+ Opt_prefetch_block_bitmaps, Opt_freespace_tree,
};
static const match_table_t tokens = {
@@ -1613,6 +1613,7 @@ static const match_table_t tokens = {
{Opt_init_itable, "init_itable=%u"},
{Opt_init_itable, "init_itable"},
{Opt_noinit_itable, "noinit_itable"},
+ {Opt_freespace_tree, "freespace_tree"},
{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
{Opt_test_dummy_encryption, "test_dummy_encryption=%s"},
{Opt_test_dummy_encryption, "test_dummy_encryption"},
@@ -1839,6 +1840,8 @@ static const struct mount_opts {
{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
{Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS,
MOPT_SET},
+ {Opt_freespace_tree, EXT4_MOUNT2_FREESPACE_TREE,
+ MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
{Opt_err, 0, 0}
};
@@ -4745,12 +4748,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
}
}
+ if (ext4_has_feature_flex_bg(sb))
+ if (!ext4_fill_flex_info(sb)) {
+ ext4_msg(sb, KERN_ERR,
+ "unable to initialize flex_bg meta info!");
+ goto failed_mount5;
+ }
+
ext4_ext_init(sb);
err = ext4_mb_init(sb);
if (err) {
ext4_msg(sb, KERN_ERR, "failed to initialize mballoc (%d)",
err);
- goto failed_mount5;
+ goto failed_mount6;
}
block = ext4_count_free_clusters(sb);
@@ -4780,14 +4790,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
goto failed_mount6;
}
- if (ext4_has_feature_flex_bg(sb))
- if (!ext4_fill_flex_info(sb)) {
- ext4_msg(sb, KERN_ERR,
- "unable to initialize "
- "flex_bg meta info!");
- goto failed_mount6;
- }
-
err = ext4_register_li_request(sb, first_not_zeroed);
if (err)
goto failed_mount6;
@@ -4872,7 +4874,14 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
ext4_unregister_li_request(sb);
failed_mount6:
ext4_mb_release(sb);
+ percpu_counter_destroy(&sbi->s_freeclusters_counter);
+ percpu_counter_destroy(&sbi->s_freeinodes_counter);
+ percpu_counter_destroy(&sbi->s_dirs_counter);
+ percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
+ percpu_free_rwsem(&sbi->s_writepages_rwsem);
rcu_read_lock();
+
+failed_mount5:
flex_groups = rcu_dereference(sbi->s_flex_groups);
if (flex_groups) {
for (i = 0; i < sbi->s_flex_groups_allocated; i++)
@@ -4880,12 +4889,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
kvfree(flex_groups);
}
rcu_read_unlock();
- percpu_counter_destroy(&sbi->s_freeclusters_counter);
- percpu_counter_destroy(&sbi->s_freeinodes_counter);
- percpu_counter_destroy(&sbi->s_dirs_counter);
- percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
- percpu_free_rwsem(&sbi->s_writepages_rwsem);
-failed_mount5:
ext4_ext_release(sb);
ext4_release_system_zone(sb);
failed_mount4a:
--
2.28.0.297.g1956fa8f8d-goog
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC PATCH v2 3/9] ext4: add freespace tree allocator
2020-08-21 1:55 ` [RFC PATCH v2 3/9] ext4: add freespace tree allocator Harshad Shirwadkar
@ 2020-08-21 4:06 ` kernel test robot
2020-08-21 7:04 ` kernel test robot
1 sibling, 0 replies; 13+ messages in thread
From: kernel test robot @ 2020-08-21 4:06 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 1972 bytes --]
Hi Harshad,
[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on ext4/dev]
[also build test ERROR on next-20200820]
[cannot apply to tip/perf/core v5.9-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Harshad-Shirwadkar/ext4-add-free-space-extent-based-allocator/20200821-095647
base: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: nds32-defconfig (attached as .config)
compiler: nds32le-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=nds32
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
nds32le-linux-ld: fs/ext4/mballoc.o: in function `ext4_mb_frsp_free_blocks':
>> mballoc.c:(.text+0x2b2e): undefined reference to `__umoddi3'
>> nds32le-linux-ld: mballoc.c:(.text+0x2b32): undefined reference to `__umoddi3'
nds32le-linux-ld: fs/ext4/mballoc.o: in function `ext4_mb_frsp_bb_to_tree':
mballoc.c:(.text+0x2c36): undefined reference to `__umoddi3'
nds32le-linux-ld: mballoc.c:(.text+0x2c3a): undefined reference to `__umoddi3'
nds32le-linux-ld: fs/ext4/mballoc.o: in function `mb_mark_used':
mballoc.c:(.text+0x54d4): undefined reference to `__umoddi3'
nds32le-linux-ld: fs/ext4/mballoc.o:mballoc.c:(.text+0x54d8): more undefined references to `__umoddi3' follow
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 10907 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH v2 3/9] ext4: add freespace tree allocator
2020-08-21 1:55 ` [RFC PATCH v2 3/9] ext4: add freespace tree allocator Harshad Shirwadkar
2020-08-21 4:06 ` kernel test robot
@ 2020-08-21 7:04 ` kernel test robot
1 sibling, 0 replies; 13+ messages in thread
From: kernel test robot @ 2020-08-21 7:04 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 2304 bytes --]
Hi Harshad,
[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on ext4/dev]
[also build test ERROR on next-20200820]
[cannot apply to tip/perf/core tytso-fscrypt/master v5.9-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Harshad-Shirwadkar/ext4-add-free-space-extent-based-allocator/20200821-095647
base: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: i386-randconfig-s001-20200820 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
# apt-get install sparse
# sparse version: v0.6.2-191-g10164920-dirty
# save the attached .config to linux build tree
make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=i386
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
ld: fs/ext4/mballoc.o: in function `ext4_blkno_to_flex_offset':
>> fs/ext4/mballoc.c:407: undefined reference to `__umoddi3'
>> ld: fs/ext4/mballoc.c:407: undefined reference to `__umoddi3'
>> ld: fs/ext4/mballoc.c:407: undefined reference to `__umoddi3'
>> ld: fs/ext4/mballoc.c:407: undefined reference to `__umoddi3'
>> ld: fs/ext4/mballoc.c:407: undefined reference to `__umoddi3'
ld: fs/ext4/mballoc.o:fs/ext4/mballoc.c:407: more undefined references to `__umoddi3' follow
# https://github.com/0day-ci/linux/commit/db488bc7c0ef14f7ea22209b8faaddaffcfabbd5
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Harshad-Shirwadkar/ext4-add-free-space-extent-based-allocator/20200821-095647
git checkout db488bc7c0ef14f7ea22209b8faaddaffcfabbd5
vim +407 fs/ext4/mballoc.c
403
404 static inline int ext4_blkno_to_flex_offset(struct super_block *sb,
405 ext4_fsblk_t blkno)
406 {
> 407 return blkno % (ext4_flex_bg_size(EXT4_SB(sb)) *
408 EXT4_SB(sb)->s_blocks_per_group);
409 }
410
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 31788 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC PATCH v2 4/9] ext4: add prefetching support for freespace trees
2020-08-21 1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
` (2 preceding siblings ...)
2020-08-21 1:55 ` [RFC PATCH v2 3/9] ext4: add freespace tree allocator Harshad Shirwadkar
@ 2020-08-21 1:55 ` Harshad Shirwadkar
2020-08-21 1:55 ` [RFC PATCH v2 5/9] ext4: add freespace tree optimizations Harshad Shirwadkar
` (4 subsequent siblings)
8 siblings, 0 replies; 13+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21 1:55 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar
From: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
This patch supports creation of freespace trees for prefetched block
bitmaps.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
fs/ext4/mballoc.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 4b1c405543f0..168e9708257f 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3137,6 +3137,9 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
unsigned int nr)
{
+ struct ext4_buddy e4b;
+ int ret;
+
while (nr-- > 0) {
struct ext4_group_desc *gdp = ext4_get_group_desc(sb, group,
NULL);
@@ -3151,8 +3154,15 @@ void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
ext4_free_group_clusters(sb, gdp) > 0 &&
!(ext4_has_group_desc_csum(sb) &&
(gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) {
- if (ext4_mb_init_group(sb, group, GFP_NOFS))
+ if (ext4_mb_frsp_on(sb)) {
+ ret = ext4_mb_load_allocator(sb, group, &e4b,
+ 0);
+ if (ret)
+ break;
+ ext4_mb_unload_allocator(&e4b);
+ } else if (ext4_mb_init_group(sb, group, GFP_NOFS)) {
break;
+ }
}
}
}
--
2.28.0.297.g1956fa8f8d-goog
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC PATCH v2 5/9] ext4: add freespace tree optimizations
2020-08-21 1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
` (3 preceding siblings ...)
2020-08-21 1:55 ` [RFC PATCH v2 4/9] ext4: add prefetching support for freespace trees Harshad Shirwadkar
@ 2020-08-21 1:55 ` Harshad Shirwadkar
2020-08-21 10:48 ` kernel test robot
2020-08-21 1:55 ` [RFC PATCH v2 6/9] ext4: add memory usage tracker for freespace trees Harshad Shirwadkar
` (3 subsequent siblings)
8 siblings, 1 reply; 13+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21 1:55 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar
From: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
This patch adds an optimization on top of our freespace tree
allocator. We add a new meta-tree which contains tree node sorted by
length of their largest extent. We use this tree to optimize an
allocation request so that it immediately gets a hit or it falls back
to next CR level.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
fs/ext4/ext4.h | 17 +++++
fs/ext4/mballoc.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/mballoc.h | 1 +
fs/ext4/super.c | 9 +++
4 files changed, 205 insertions(+)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 513c8473f45f..15e6ce9f1afa 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -153,6 +153,8 @@ enum SHIFT_DIRECTION {
#define EXT4_MB_USE_RESERVED 0x2000
/* Do strict check for free blocks while retrying block allocation */
#define EXT4_MB_STRICT_CHECK 0x4000
+/* Disable freespace optimizations on ac */
+#define EXT4_MB_FRSP_NO_OPTIMIZE 0x8000
struct ext4_allocation_request {
/* target inode for block we're allocating */
@@ -1444,12 +1446,22 @@ struct ext4_frsp_tree {
__u32 frsp_index; /* Tree index (flex bg
* number)
*/
+ struct list_head frsp_list_node;
+ struct rb_node frsp_len_node;
};
/* Freespace tree flags */
/* Tree is loaded in memory */
#define EXT4_MB_FRSP_FLAG_LOADED 0x0001
+/* Tree is inserted into meta tree */
+#define EXT4_MB_FRSP_FLAG_CACHED 0x2
+
+/* Freespace tree cache aggression levels */
+#define EXT4_MB_FRSP_MIN_CACHE_AGGRESSION 0x0
+#define EXT4_MB_FRSP_DEFAULT_CACHE_AGGRESSION 0x1
+#define EXT4_MB_FRSP_MAX_CACHE_AGGRESSION 0x3
+
/*
* fourth extended-fs super-block data in memory
*/
@@ -1590,6 +1602,11 @@ struct ext4_sb_info {
/* freespace trees stuff */
int s_mb_num_frsp_trees;
+ struct rb_root_cached s_mb_frsp_meta_tree;
+ rwlock_t s_mb_frsp_lock;
+ atomic_t s_mb_num_frsp_trees_cached;
+ struct list_head s_mb_uncached_trees;
+ u32 s_mb_frsp_cache_aggression;
/* workqueue for reserved extent conversions (buffered io) */
struct workqueue_struct *rsv_conversion_wq;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 168e9708257f..1da63afdbb3d 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -779,6 +779,13 @@ static inline int ext4_mb_frsp_len_cmp(struct ext4_frsp_node *arg1,
return (arg1->frsp_len > arg2->frsp_len);
}
+/* Compare function for meta tree */
+static inline int ext4_mb_frsp_meta_cmp(struct ext4_frsp_tree *arg1,
+ struct ext4_frsp_tree *arg2)
+{
+ return (arg1->frsp_max_free_len > arg2->frsp_max_free_len);
+}
+
/* insert to offset-indexed tree */
static void ext4_mb_frsp_insert_off(struct ext4_frsp_tree *tree,
struct ext4_frsp_node *new_entry)
@@ -798,6 +805,35 @@ static void ext4_mb_frsp_insert_len(struct ext4_frsp_tree *tree,
ext4_mb_frsp_len_cmp);
}
+void ext4_mb_frsp_meta_reinsert(struct super_block *sb,
+ struct ext4_frsp_tree *tree)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_frsp_node *node;
+ struct rb_node *first = rb_first_cached(&tree->frsp_len_root);
+ struct rb_root_cached *meta_root = &EXT4_SB(sb)->s_mb_frsp_meta_tree;
+ int expected_len = 0;
+
+ if (!(tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED))
+ return;
+
+ if (first) {
+ node = rb_entry(first, struct ext4_frsp_node, frsp_len_node);
+ expected_len = node->frsp_len;
+ }
+
+ if (tree->frsp_max_free_len == expected_len)
+ return;
+
+ write_lock(&sbi->s_mb_frsp_lock);
+ tree->frsp_max_free_len = expected_len;
+ rb_erase_cached(&tree->frsp_len_node, &sbi->s_mb_frsp_meta_tree);
+ RB_CLEAR_NODE(&tree->frsp_len_node);
+ ext4_mb_rb_insert(meta_root, tree, struct ext4_frsp_tree, frsp_len_node,
+ ext4_mb_frsp_meta_cmp);
+ write_unlock(&sbi->s_mb_frsp_lock);
+}
+
#ifdef CONFIG_EXT4_DEBUG
/* print freespace_tree in pre-order traversal */
void ext4_mb_frsp_print_tree_len(struct super_block *sb,
@@ -966,6 +1002,7 @@ int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
}
}
ext4_mb_frsp_insert_len(tree, new_entry);
+ ext4_mb_frsp_meta_reinsert(sb, tree);
return 0;
}
@@ -1063,6 +1100,9 @@ int ext4_mb_frsp_load(struct ext4_buddy *e4b)
if (ret)
goto out;
}
+ if (!(e4b->frsp_tree->frsp_flags & EXT4_MB_FRSP_FLAG_CACHED))
+ atomic_inc(&sbi->s_mb_num_frsp_trees_cached);
+ e4b->frsp_tree->frsp_flags |= EXT4_MB_FRSP_FLAG_CACHED;
e4b->frsp_tree->frsp_flags |= EXT4_MB_FRSP_FLAG_LOADED;
out:
for (i = 0; i < ngroups; i++) {
@@ -1156,6 +1196,7 @@ static void ext4_mb_frsp_use_best_found(struct ext4_allocation_context *ac,
ext4_mb_load_allocator(ac->ac_sb, group_no, &e4b,
EXT4_ALLOCATOR_FRSP_NOLOAD);
mb_mark_used(&e4b, bex);
+ ext4_mb_frsp_meta_reinsert(ac->ac_sb, e4b.frsp_tree);
ext4_mb_unload_allocator(&e4b);
}
/*
@@ -1286,6 +1327,124 @@ void ext4_mb_frsp_find_by_goal(struct ext4_allocation_context *ac)
ext4_mb_unload_allocator(&e4b);
}
+/*
+ * Determine if caching of trees is necessary. This function returns 1 if it is,
+ * 0 if it is not.
+ */
+static int ext4_mb_frsp_should_cache(struct ext4_allocation_context *ac)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+
+ if (list_empty(&sbi->s_mb_uncached_trees))
+ return 0;
+
+ if (sbi->s_mb_frsp_cache_aggression >=
+ EXT4_MB_FRSP_MAX_CACHE_AGGRESSION)
+ return 1;
+
+ if (sbi->s_mb_frsp_cache_aggression ==
+ EXT4_MB_FRSP_MIN_CACHE_AGGRESSION)
+ return 0;
+
+ /* At cache aggression level 1, skip caching at CR 0 */
+ if (sbi->s_mb_frsp_cache_aggression == 1 && ac->ac_criteria == 0)
+ return 0;
+
+ /*
+ * At cache aggression level 2, perform caching for every alternate
+ * optimization.
+ */
+ return (ac->ac_num_optimizations & 0x1);
+}
+
+/*
+ * Optimize allocation request. This function _tries_ to lookup the meta-tree
+ * and if it can optimize the search in any way, it does so. As a result
+ * this function returns 1, if the optimization was performed. In this case,
+ * the caller should restart the search from tree mentioned in *tree_idx.
+ */
+int ext4_mb_frsp_optimize(struct ext4_allocation_context *ac, int *tree_idx)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ struct ext4_frsp_tree *cur = NULL;
+ struct rb_node *node = NULL;
+ int found = 0, best = 0, cache_more_trees = 0, better_len = 0, ret = 0;
+
+ if (ac->ac_flags & EXT4_MB_HINT_FIRST ||
+ ac->ac_flags & EXT4_MB_FRSP_NO_OPTIMIZE ||
+ ac->ac_status != AC_STATUS_CONTINUE)
+ return 0;
+
+ ac->ac_num_optimizations++;
+ if (!read_trylock(&sbi->s_mb_frsp_lock))
+ return 0;
+
+ node = sbi->s_mb_frsp_meta_tree.rb_root.rb_node;
+ while (node) {
+ cur = rb_entry(node, struct ext4_frsp_tree, frsp_len_node);
+ if (ext4_mb_frsp_is_better(ac, cur->frsp_max_free_len, &best)) {
+ /*
+ * This tree definitely has a better node than the best
+ * found so far.
+ */
+ found = 1;
+ ret = 1;
+ *tree_idx = cur->frsp_index;
+ better_len = cur->frsp_max_free_len;
+ if (best)
+ break;
+ }
+ if (cur->frsp_max_free_len > ac->ac_g_ex.fe_len)
+ node = node->rb_right;
+ else
+ node = node->rb_left;
+ }
+
+ if (ac->ac_found == 0 && !found) {
+ /*
+ * If we haven't found a good match above, and we hadn't found
+ * any match before us, that means we need to loosen our
+ * criteria. Note that, if we had found something earlier,
+ * not finding a better node doesn't imply that there is no
+ * better node available.
+ * TODO - in this case determine probabilistically which tree
+ * may have a better node and direct our allocator there.
+ */
+ if (ext4_mb_frsp_should_cache(ac)) {
+ cache_more_trees = 1;
+ } else if (ac->ac_criteria < 2) {
+ ac->ac_criteria++;
+ ac->ac_groups_scanned = 0;
+ *tree_idx = 0;
+ ret = 1;
+ } else {
+ ac->ac_flags |= EXT4_MB_HINT_FIRST;
+ }
+ } else if (!best && !list_empty(&sbi->s_mb_uncached_trees)) {
+ cache_more_trees = ext4_mb_frsp_should_cache(ac);
+ }
+
+ if (cache_more_trees) {
+ cur = list_first_entry(&sbi->s_mb_uncached_trees,
+ struct ext4_frsp_tree, frsp_list_node);
+ list_del_init(&cur->frsp_list_node);
+ *tree_idx = cur->frsp_index;
+ ret = 1;
+ }
+ read_unlock(&sbi->s_mb_frsp_lock);
+
+ /*
+ * If we couldn't optimize now, it's unlikely that we'll be able to
+ * optimize this request anymore
+ */
+ if (!ret)
+ ac->ac_flags |= EXT4_MB_FRSP_NO_OPTIMIZE;
+ mb_debug(ac->ac_sb,
+ "Optimizer suggestion: found = %d, tree = %d, len = %d, cr = %d\n",
+ found, *tree_idx, better_len, ac->ac_criteria);
+ return ret;
+}
+
static void ext4_mb_frsp_process(struct ext4_allocation_context *ac,
struct ext4_frsp_tree *tree)
{
@@ -1324,6 +1483,7 @@ ext4_mb_tree_allocator(struct ext4_allocation_context *ac)
struct ext4_frsp_node *cur = NULL;
struct ext4_tree_extent *btx = NULL;
int ret = 0, start_idx = 0, tree_idx, j;
+ int optimize;
sb = ac->ac_sb;
btx = &ac->ac_b_tree_ex;
@@ -1341,6 +1501,8 @@ ext4_mb_tree_allocator(struct ext4_allocation_context *ac)
ac->ac_criteria = 0;
ac->ac_groups_scanned = 0;
+ ext4_mb_frsp_optimize(ac, &start_idx);
+
repeat:
/* Loop through the rest of trees (flex_bg) */
@@ -1357,13 +1519,17 @@ ext4_mb_tree_allocator(struct ext4_allocation_context *ac)
mutex_lock(&e4b.frsp_tree->frsp_lock);
ext4_mb_frsp_process(ac, e4b.frsp_tree);
mutex_unlock(&e4b.frsp_tree->frsp_lock);
+ optimize = ext4_mb_frsp_optimize(ac, &start_idx);
ext4_mb_unload_allocator(&e4b);
+ if (optimize)
+ goto repeat;
}
if (ac->ac_status != AC_STATUS_FOUND) {
if (ac->ac_criteria < 2) {
ac->ac_criteria++;
ac->ac_groups_scanned = 0;
+ ac->ac_flags &= ~EXT4_MB_FRSP_NO_OPTIMIZE;
mb_debug(sb, "Falling back to CR=%d", ac->ac_criteria);
goto repeat;
}
@@ -1415,6 +1581,7 @@ static void ext4_mb_frsp_init_tree(struct ext4_frsp_tree *tree, int index)
mutex_init(&(tree->frsp_lock));
tree->frsp_flags = 0;
tree->frsp_index = index;
+ INIT_LIST_HEAD(&tree->frsp_list_node);
}
int ext4_mb_init_freespace_trees(struct super_block *sb)
@@ -1425,6 +1592,8 @@ int ext4_mb_init_freespace_trees(struct super_block *sb)
sbi->s_mb_num_frsp_trees =
ext4_num_grps_to_flexbg(sb, ext4_get_groups_count(sb));
+ sbi->s_mb_frsp_meta_tree = RB_ROOT_CACHED;
+ INIT_LIST_HEAD(&sbi->s_mb_uncached_trees);
for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) {
fg = sbi_array_rcu_deref(sbi, s_flex_groups, i);
@@ -1433,7 +1602,11 @@ int ext4_mb_init_freespace_trees(struct super_block *sb)
if (!fg->frsp_tree)
return -ENOMEM;
ext4_mb_frsp_init_tree(fg->frsp_tree, i);
+ list_add(&fg->frsp_tree->frsp_list_node,
+ &sbi->s_mb_uncached_trees);
}
+ rwlock_init(&sbi->s_mb_frsp_lock);
+ atomic_set(&sbi->s_mb_num_frsp_trees_cached, 0);
return 0;
}
@@ -1460,6 +1633,11 @@ int ext4_mb_add_frsp_trees(struct super_block *sb, int ngroups)
if (!fg->frsp_tree)
return -ENOMEM;
ext4_mb_frsp_init_tree(fg->frsp_tree, i);
+ write_lock(&sbi->s_mb_frsp_lock);
+ list_add(&fg->frsp_tree->frsp_list_node,
+ &sbi->s_mb_uncached_trees);
+ write_unlock(&sbi->s_mb_frsp_lock);
+ ext4_mb_frsp_meta_reinsert(sb, fg->frsp_tree);
}
sbi->s_mb_num_frsp_trees = flex_bg_count;
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index af61651258a3..1fcdd3e6f7d5 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -200,6 +200,7 @@ struct ext4_allocation_context {
__u8 ac_criteria;
__u8 ac_2order; /* if request is to allocate 2^N blocks and
* N > 0, the field stores N, otherwise 0 */
+ __u8 ac_num_optimizations;
__u8 ac_op; /* operation, for history only */
struct page *ac_bitmap_page;
struct page *ac_buddy_page;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 55ff4e8be976..6a2d43d81b14 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1527,6 +1527,8 @@ enum {
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
Opt_prefetch_block_bitmaps, Opt_freespace_tree,
+ Opt_frsp_cache_aggression,
+
};
static const match_table_t tokens = {
@@ -1620,6 +1622,7 @@ static const match_table_t tokens = {
{Opt_nombcache, "nombcache"},
{Opt_nombcache, "no_mbcache"}, /* for backward compatibility */
{Opt_prefetch_block_bitmaps, "prefetch_block_bitmaps"},
+ {Opt_frsp_cache_aggression, "frsp_cache_aggression=%u"},
{Opt_removed, "check=none"}, /* mount option from ext2/3 */
{Opt_removed, "nocheck"}, /* mount option from ext2/3 */
{Opt_removed, "reservation"}, /* mount option from ext2/3 */
@@ -1836,6 +1839,7 @@ static const struct mount_opts {
{Opt_jqfmt_vfsv0, QFMT_VFS_V0, MOPT_QFMT},
{Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT},
{Opt_max_dir_size_kb, 0, MOPT_GTE0},
+ {Opt_frsp_cache_aggression, 0, MOPT_GTE0},
{Opt_test_dummy_encryption, 0, MOPT_STRING},
{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
{Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS,
@@ -2043,6 +2047,10 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
sbi->s_li_wait_mult = arg;
} else if (token == Opt_max_dir_size_kb) {
sbi->s_max_dir_size_kb = arg;
+ } else if (token == Opt_frsp_cache_aggression) {
+ sbi->s_mb_frsp_cache_aggression =
+ min(EXT4_MB_FRSP_MAX_CACHE_AGGRESSION,
+ max(EXT4_MB_FRSP_MIN_CACHE_AGGRESSION, arg));
} else if (token == Opt_stripe) {
sbi->s_stripe = arg;
} else if (token == Opt_resuid) {
@@ -3962,6 +3970,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sbi->s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ;
sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME;
sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;
+ sbi->s_mb_frsp_cache_aggression = EXT4_MB_FRSP_DEFAULT_CACHE_AGGRESSION;
if ((def_mount_opts & EXT4_DEFM_NOBARRIER) == 0)
set_opt(sb, BARRIER);
--
2.28.0.297.g1956fa8f8d-goog
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC PATCH v2 5/9] ext4: add freespace tree optimizations
2020-08-21 1:55 ` [RFC PATCH v2 5/9] ext4: add freespace tree optimizations Harshad Shirwadkar
@ 2020-08-21 10:48 ` kernel test robot
0 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2020-08-21 10:48 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 7452 bytes --]
Hi Harshad,
[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on ext4/dev]
[also build test WARNING on next-20200821]
[cannot apply to tip/perf/core v5.9-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Harshad-Shirwadkar/ext4-add-free-space-extent-based-allocator/20200821-095647
base: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: x86_64-randconfig-r036-20200820 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project b587ca93be114d07ec3bf654add97d7872325281)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install x86_64 cross compiling tool for clang build
# apt-get install binutils-x86-64-linux-gnu
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All warnings (new ones prefixed by >>):
>> fs/ext4/mballoc.c:808:6: warning: no previous prototype for function 'ext4_mb_frsp_meta_reinsert' [-Wmissing-prototypes]
void ext4_mb_frsp_meta_reinsert(struct super_block *sb,
^
fs/ext4/mballoc.c:808:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void ext4_mb_frsp_meta_reinsert(struct super_block *sb,
^
static
fs/ext4/mballoc.c:890:6: warning: no previous prototype for function 'ext4_mb_frsp_free_tree' [-Wmissing-prototypes]
void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree)
^
fs/ext4/mballoc.c:890:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree)
^
static
fs/ext4/mballoc.c:920:24: warning: no previous prototype for function 'ext4_mb_frsp_search_by_off' [-Wmissing-prototypes]
struct ext4_frsp_node *ext4_mb_frsp_search_by_off(struct super_block *sb,
^
fs/ext4/mballoc.c:920:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
struct ext4_frsp_node *ext4_mb_frsp_search_by_off(struct super_block *sb,
^
static
fs/ext4/mballoc.c:945:5: warning: no previous prototype for function 'ext4_mb_frsp_node_can_merge' [-Wmissing-prototypes]
int ext4_mb_frsp_node_can_merge(struct super_block *sb,
^
fs/ext4/mballoc.c:945:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int ext4_mb_frsp_node_can_merge(struct super_block *sb,
^
static
fs/ext4/mballoc.c:962:5: warning: no previous prototype for function 'ext4_mb_frsp_add_region' [-Wmissing-prototypes]
int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
^
fs/ext4/mballoc.c:962:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
^
static
fs/ext4/mballoc.c:1014:5: warning: no previous prototype for function 'ext4_mb_frsp_free_blocks' [-Wmissing-prototypes]
int ext4_mb_frsp_free_blocks(struct super_block *sb, ext4_group_t group,
^
fs/ext4/mballoc.c:1014:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int ext4_mb_frsp_free_blocks(struct super_block *sb, ext4_group_t group,
^
static
fs/ext4/mballoc.c:1034:5: warning: no previous prototype for function 'ext4_mb_frsp_bb_to_tree' [-Wmissing-prototypes]
int ext4_mb_frsp_bb_to_tree(struct ext4_buddy *e4b, ext4_group_t group,
^
fs/ext4/mballoc.c:1034:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int ext4_mb_frsp_bb_to_tree(struct ext4_buddy *e4b, ext4_group_t group,
^
static
fs/ext4/mballoc.c:1066:5: warning: no previous prototype for function 'ext4_mb_frsp_load' [-Wmissing-prototypes]
int ext4_mb_frsp_load(struct ext4_buddy *e4b)
^
fs/ext4/mballoc.c:1066:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int ext4_mb_frsp_load(struct ext4_buddy *e4b)
^
static
>> fs/ext4/mballoc.c:1366:5: warning: no previous prototype for function 'ext4_mb_frsp_optimize' [-Wmissing-prototypes]
int ext4_mb_frsp_optimize(struct ext4_allocation_context *ac, int *tree_idx)
^
fs/ext4/mballoc.c:1366:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int ext4_mb_frsp_optimize(struct ext4_allocation_context *ac, int *tree_idx)
^
static
fs/ext4/mballoc.c:1587:5: warning: no previous prototype for function 'ext4_mb_init_freespace_trees' [-Wmissing-prototypes]
int ext4_mb_init_freespace_trees(struct super_block *sb)
^
fs/ext4/mballoc.c:1587:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int ext4_mb_init_freespace_trees(struct super_block *sb)
^
static
fs/ext4/mballoc.c:411:28: warning: unused function 'ext4_flex_offset_to_blkno' [-Wunused-function]
static inline ext4_fsblk_t ext4_flex_offset_to_blkno(struct super_block *sb,
^
11 warnings generated.
# https://github.com/0day-ci/linux/commit/832851db4656a8d433083fc076a53db955d43020
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Harshad-Shirwadkar/ext4-add-free-space-extent-based-allocator/20200821-095647
git checkout 832851db4656a8d433083fc076a53db955d43020
vim +/ext4_mb_frsp_meta_reinsert +808 fs/ext4/mballoc.c
807
> 808 void ext4_mb_frsp_meta_reinsert(struct super_block *sb,
809 struct ext4_frsp_tree *tree)
810 {
811 struct ext4_sb_info *sbi = EXT4_SB(sb);
812 struct ext4_frsp_node *node;
813 struct rb_node *first = rb_first_cached(&tree->frsp_len_root);
814 struct rb_root_cached *meta_root = &EXT4_SB(sb)->s_mb_frsp_meta_tree;
815 int expected_len = 0;
816
817 if (!(tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED))
818 return;
819
820 if (first) {
821 node = rb_entry(first, struct ext4_frsp_node, frsp_len_node);
822 expected_len = node->frsp_len;
823 }
824
825 if (tree->frsp_max_free_len == expected_len)
826 return;
827
828 write_lock(&sbi->s_mb_frsp_lock);
829 tree->frsp_max_free_len = expected_len;
830 rb_erase_cached(&tree->frsp_len_node, &sbi->s_mb_frsp_meta_tree);
831 RB_CLEAR_NODE(&tree->frsp_len_node);
832 ext4_mb_rb_insert(meta_root, tree, struct ext4_frsp_tree, frsp_len_node,
833 ext4_mb_frsp_meta_cmp);
834 write_unlock(&sbi->s_mb_frsp_lock);
835 }
836
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 43415 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC PATCH v2 6/9] ext4: add memory usage tracker for freespace trees
2020-08-21 1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
` (4 preceding siblings ...)
2020-08-21 1:55 ` [RFC PATCH v2 5/9] ext4: add freespace tree optimizations Harshad Shirwadkar
@ 2020-08-21 1:55 ` Harshad Shirwadkar
2020-08-21 1:55 ` [RFC PATCH v2 7/9] ext4: add LRU eviction for free space trees Harshad Shirwadkar
` (2 subsequent siblings)
8 siblings, 0 replies; 13+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21 1:55 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar
From: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Freespace trees can occupy a lot of memory with as the fragmentation
increases. This patch adds a sysfs file to monitor the memory usage of
the freespace tree allocator. Also, added a sysfs config to control
maximum memory that the allocator can use. If the allocator exceeds
this threshold, file system enters "FRSP_MEM_CRUNCH" state. The next
patch in the series performs LRU eviction when this state is reached.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
fs/ext4/ext4.h | 8 ++++++++
fs/ext4/mballoc.c | 20 ++++++++++++++++++++
fs/ext4/mballoc.h | 4 ++++
fs/ext4/sysfs.c | 11 +++++++++++
4 files changed, 43 insertions(+)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 15e6ce9f1afa..93bf2fe35cf1 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1223,6 +1223,12 @@ struct ext4_inode_info {
* allocator off)
*/
+#define EXT4_MOUNT2_FRSP_MEM_CRUNCH 0x00000040 /*
+ * Freespace tree allocator
+ * is in a tight memory
+ * situation.
+ */
+
#define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
~EXT4_MOUNT_##opt
#define set_opt(sb, opt) EXT4_SB(sb)->s_mount_opt |= \
@@ -1607,6 +1613,8 @@ struct ext4_sb_info {
atomic_t s_mb_num_frsp_trees_cached;
struct list_head s_mb_uncached_trees;
u32 s_mb_frsp_cache_aggression;
+ atomic_t s_mb_num_fragments;
+ u32 s_mb_frsp_mem_limit;
/* workqueue for reserved extent conversions (buffered io) */
struct workqueue_struct *rsv_conversion_wq;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 1da63afdbb3d..b28b7fb0506e 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -869,6 +869,7 @@ void ext4_mb_frsp_print_tree_len(struct super_block *sb,
static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
{
struct ext4_frsp_node *node;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
node = kmem_cache_alloc(ext4_freespace_node_cachep, GFP_NOFS);
if (!node)
@@ -877,13 +878,31 @@ static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
RB_CLEAR_NODE(&node->frsp_node);
RB_CLEAR_NODE(&node->frsp_len_node);
+ atomic_inc(&sbi->s_mb_num_fragments);
+
+ if (sbi->s_mb_frsp_mem_limit &&
+ atomic_read(&sbi->s_mb_num_fragments) >
+ EXT4_FRSP_MEM_LIMIT_TO_NUM_NODES(sb))
+ set_opt2(sb, FRSP_MEM_CRUNCH);
+ else
+ clear_opt2(sb, FRSP_MEM_CRUNCH);
+
+
return node;
}
static void ext4_mb_frsp_free_node(struct super_block *sb,
struct ext4_frsp_node *node)
{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
kmem_cache_free(ext4_freespace_node_cachep, node);
+ atomic_dec(&sbi->s_mb_num_fragments);
+
+ if (!sbi->s_mb_frsp_mem_limit ||
+ atomic_read(&sbi->s_mb_num_fragments) <
+ EXT4_FRSP_MEM_LIMIT_TO_NUM_NODES(sb))
+ clear_opt2(sb, FRSP_MEM_CRUNCH);
}
/* Evict a tree from memory */
@@ -1607,6 +1626,7 @@ int ext4_mb_init_freespace_trees(struct super_block *sb)
}
rwlock_init(&sbi->s_mb_frsp_lock);
atomic_set(&sbi->s_mb_num_frsp_trees_cached, 0);
+ atomic_set(&sbi->s_mb_num_fragments, 0);
return 0;
}
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 1fcdd3e6f7d5..6cfb228e4da2 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -92,6 +92,10 @@ struct ext4_frsp_node {
struct rb_node frsp_node;
struct rb_node frsp_len_node;
};
+
+#define EXT4_FRSP_MEM_LIMIT_TO_NUM_NODES(__sb) \
+ ((sbi->s_mb_frsp_mem_limit / sizeof(struct ext4_frsp_node)))
+
struct ext4_free_data {
/* this links the free block information from sb_info */
struct list_head efd_list;
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index bfabb799fa45..19301b10944b 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -8,6 +8,7 @@
*
*/
+#include "mballoc.h"
#include <linux/time.h>
#include <linux/fs.h>
#include <linux/seq_file.h>
@@ -24,6 +25,7 @@ typedef enum {
attr_session_write_kbytes,
attr_lifetime_write_kbytes,
attr_reserved_clusters,
+ attr_frsp_tree_usage,
attr_inode_readahead,
attr_trigger_test_error,
attr_first_error_time,
@@ -208,6 +210,7 @@ EXT4_ATTR_FUNC(delayed_allocation_blocks, 0444);
EXT4_ATTR_FUNC(session_write_kbytes, 0444);
EXT4_ATTR_FUNC(lifetime_write_kbytes, 0444);
EXT4_ATTR_FUNC(reserved_clusters, 0644);
+EXT4_ATTR_FUNC(frsp_tree_usage, 0444);
EXT4_ATTR_OFFSET(inode_readahead_blks, 0644, inode_readahead,
ext4_sb_info, s_inode_readahead_blks);
@@ -248,6 +251,7 @@ EXT4_ATTR(last_error_time, 0444, last_error_time);
EXT4_ATTR(journal_task, 0444, journal_task);
EXT4_RW_ATTR_SBI_UI(mb_prefetch, s_mb_prefetch);
EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_limit);
+EXT4_RW_ATTR_SBI_UI(mb_frsp_max_mem, s_mb_frsp_mem_limit);
static unsigned int old_bump_val = 128;
EXT4_ATTR_PTR(max_writeback_mb_bump, 0444, pointer_ui, &old_bump_val);
@@ -257,6 +261,7 @@ static struct attribute *ext4_attrs[] = {
ATTR_LIST(session_write_kbytes),
ATTR_LIST(lifetime_write_kbytes),
ATTR_LIST(reserved_clusters),
+ ATTR_LIST(frsp_tree_usage),
ATTR_LIST(inode_readahead_blks),
ATTR_LIST(inode_goal),
ATTR_LIST(mb_stats),
@@ -296,6 +301,7 @@ static struct attribute *ext4_attrs[] = {
#endif
ATTR_LIST(mb_prefetch),
ATTR_LIST(mb_prefetch_limit),
+ ATTR_LIST(mb_frsp_max_mem),
NULL,
};
ATTRIBUTE_GROUPS(ext4);
@@ -378,6 +384,11 @@ static ssize_t ext4_attr_show(struct kobject *kobj,
return snprintf(buf, PAGE_SIZE, "%llu\n",
(unsigned long long)
atomic64_read(&sbi->s_resv_clusters));
+ case attr_frsp_tree_usage:
+ return snprintf(buf, PAGE_SIZE, "%llu\n",
+ (unsigned long long)
+ atomic_read(&sbi->s_mb_num_fragments) *
+ sizeof(struct ext4_frsp_node));
case attr_inode_readahead:
case attr_pointer_ui:
if (!ptr)
--
2.28.0.297.g1956fa8f8d-goog
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC PATCH v2 7/9] ext4: add LRU eviction for free space trees
2020-08-21 1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
` (5 preceding siblings ...)
2020-08-21 1:55 ` [RFC PATCH v2 6/9] ext4: add memory usage tracker for freespace trees Harshad Shirwadkar
@ 2020-08-21 1:55 ` Harshad Shirwadkar
2020-08-21 1:55 ` [RFC PATCH v2 8/9] ext4: add tracepoints " Harshad Shirwadkar
2020-08-21 1:55 ` [RFC PATCH v2 9/9] ext4: add freespace trees documentation in code Harshad Shirwadkar
8 siblings, 0 replies; 13+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21 1:55 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar
From: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
This patch adds LRU eviction policy for freespace trees. In order to
avoid contention on one LRU lock, the LRU scheme is implemented as two
lists - an active list and an inactive list. Trees in active list
aren't moved inside the list thereby avoiding the need of a
lock. Trees in inactive list become active only if they are accessed
twice in a short interval thereby avoiding outliers to enter the
active list.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
fs/ext4/ext4.h | 46 +++++++++++++
fs/ext4/mballoc.c | 164 ++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 197 insertions(+), 13 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 93bf2fe35cf1..64d0dbbcd517 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1454,6 +1454,10 @@ struct ext4_frsp_tree {
*/
struct list_head frsp_list_node;
struct rb_node frsp_len_node;
+ atomic_t frsp_fragments;
+ struct list_head frsp_lru_active_node;
+ unsigned long frsp_last_access;
+ atomic_t frsp_ref;
};
/* Freespace tree flags */
@@ -1468,6 +1472,47 @@ struct ext4_frsp_tree {
#define EXT4_MB_FRSP_DEFAULT_CACHE_AGGRESSION 0x1
#define EXT4_MB_FRSP_MAX_CACHE_AGGRESSION 0x3
+/* LRU management for free space trees */
+struct ext4_mb_frsp_lru {
+ rwlock_t frsp_lru_lock;
+ atomic_t frsp_active_fragments; /* Current #of fragments in
+ * the active queue
+ */
+ u32 frsp_max_active_fragments;
+ /*
+ * List of active trees. Trees at tail are oldest trees in active set.
+ */
+ struct list_head frsp_lru_active;
+ /*
+ * List of inactive trees but loaded trees.
+ */
+ struct list_head frsp_lru_inactive;
+ struct super_block *frsp_lru_sb;
+};
+
+/*
+ * Minimum memory needed for our allocator. The pathological worst case for
+ * freespace trees is when every other block is allocated. In this case,
+ * the allocator will end up storing an extent node of length 1 for each free
+ * block. We need to make sure that the minimum memory available for the
+ * allocator is as much memory needed for 1 worst-case tree. This ensures that
+ * we can at-least keep 1 tree in memory. This way we avoid thrashing.
+ */
+#define EXT4_MB_FRSP_MEM_MIN(sb) \
+ ((sizeof(struct ext4_frsp_node) * ext4_flex_bg_size(EXT4_SB(sb))) \
+ * (EXT4_BLOCKS_PER_GROUP(sb) / 2))
+
+/* Half of the total memory available is allowed for active list */
+#define EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb) \
+ ((EXT4_SB(sb)->s_mb_frsp_mem_limit / \
+ (2 * sizeof(struct ext4_frsp_node))))
+
+/*
+ * Maximum number of jiffies allowed between two successive hits on a freespace
+ * tree before we move it to the "active" queue.
+ */
+#define EXT4_MB_FRSP_ACTIVE_THRESHOLD_JIFFIES 100
+
/*
* fourth extended-fs super-block data in memory
*/
@@ -1615,6 +1660,7 @@ struct ext4_sb_info {
u32 s_mb_frsp_cache_aggression;
atomic_t s_mb_num_fragments;
u32 s_mb_frsp_mem_limit;
+ struct ext4_mb_frsp_lru s_mb_frsp_lru;
/* workqueue for reserved extent conversions (buffered io) */
struct workqueue_struct *rsv_conversion_wq;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index b28b7fb0506e..aea0eb8d28da 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -866,10 +866,12 @@ void ext4_mb_frsp_print_tree_len(struct super_block *sb,
}
#endif
-static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
+static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb,
+ struct ext4_frsp_tree *tree)
{
struct ext4_frsp_node *node;
struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_mb_frsp_lru *lru = &EXT4_SB(sb)->s_mb_frsp_lru;
node = kmem_cache_alloc(ext4_freespace_node_cachep, GFP_NOFS);
if (!node)
@@ -879,6 +881,11 @@ static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
RB_CLEAR_NODE(&node->frsp_len_node);
atomic_inc(&sbi->s_mb_num_fragments);
+ atomic_inc(&tree->frsp_fragments);
+ read_lock(&lru->frsp_lru_lock);
+ if (!list_empty(&tree->frsp_lru_active_node))
+ atomic_inc(&lru->frsp_active_fragments);
+ read_unlock(&lru->frsp_lru_lock);
if (sbi->s_mb_frsp_mem_limit &&
atomic_read(&sbi->s_mb_num_fragments) >
@@ -892,12 +899,18 @@ static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
}
static void ext4_mb_frsp_free_node(struct super_block *sb,
- struct ext4_frsp_node *node)
+ struct ext4_frsp_tree *tree, struct ext4_frsp_node *node)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_mb_frsp_lru *lru = &EXT4_SB(sb)->s_mb_frsp_lru;
kmem_cache_free(ext4_freespace_node_cachep, node);
atomic_dec(&sbi->s_mb_num_fragments);
+ atomic_dec(&tree->frsp_fragments);
+ read_lock(&lru->frsp_lru_lock);
+ if (!list_empty(&tree->frsp_lru_active_node))
+ atomic_dec(&lru->frsp_active_fragments);
+ read_unlock(&lru->frsp_lru_lock);
if (!sbi->s_mb_frsp_mem_limit ||
atomic_read(&sbi->s_mb_num_fragments) <
@@ -915,6 +928,8 @@ void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree)
mutex_lock(&tree->frsp_lock);
if (!(tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED))
goto out;
+ if (atomic_read(&tree->frsp_ref))
+ goto out;
node = rb_first_cached(&tree->frsp_offset_root);
while (node) {
@@ -922,7 +937,7 @@ void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree)
rb_erase_cached(node, &tree->frsp_offset_root);
rb_erase_cached(&frsp_node->frsp_len_node,
&tree->frsp_len_root);
- ext4_mb_frsp_free_node(sb, frsp_node);
+ ext4_mb_frsp_free_node(sb, tree, frsp_node);
node = rb_first_cached(&tree->frsp_offset_root);
}
tree->frsp_flags &= ~EXT4_MB_FRSP_FLAG_LOADED;
@@ -985,7 +1000,7 @@ int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
*prev_entry = NULL;
struct rb_node *left = NULL, *right = NULL;
- new_entry = ext4_mb_frsp_alloc_node(sb);
+ new_entry = ext4_mb_frsp_alloc_node(sb, tree);
if (!new_entry)
return -ENOMEM;
@@ -1004,7 +1019,7 @@ int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
rb_erase_cached(left, &tree->frsp_offset_root);
rb_erase_cached(&prev_entry->frsp_len_node,
&tree->frsp_len_root);
- ext4_mb_frsp_free_node(sb, prev_entry);
+ ext4_mb_frsp_free_node(sb, tree, prev_entry);
}
}
@@ -1017,7 +1032,7 @@ int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
rb_erase_cached(right, &tree->frsp_offset_root);
rb_erase_cached(&next_entry->frsp_len_node,
&tree->frsp_len_root);
- ext4_mb_frsp_free_node(sb, next_entry);
+ ext4_mb_frsp_free_node(sb, tree, next_entry);
}
}
ext4_mb_frsp_insert_len(tree, new_entry);
@@ -1601,6 +1616,10 @@ static void ext4_mb_frsp_init_tree(struct ext4_frsp_tree *tree, int index)
tree->frsp_flags = 0;
tree->frsp_index = index;
INIT_LIST_HEAD(&tree->frsp_list_node);
+ atomic_set(&tree->frsp_fragments, 0);
+ tree->frsp_last_access = 0;
+ INIT_LIST_HEAD(&tree->frsp_lru_active_node);
+ atomic_set(&tree->frsp_ref, 0);
}
int ext4_mb_init_freespace_trees(struct super_block *sb)
@@ -1627,6 +1646,15 @@ int ext4_mb_init_freespace_trees(struct super_block *sb)
rwlock_init(&sbi->s_mb_frsp_lock);
atomic_set(&sbi->s_mb_num_frsp_trees_cached, 0);
atomic_set(&sbi->s_mb_num_fragments, 0);
+ INIT_LIST_HEAD(&sbi->s_mb_frsp_lru.frsp_lru_active);
+ INIT_LIST_HEAD(&sbi->s_mb_frsp_lru.frsp_lru_inactive);
+ rwlock_init(&sbi->s_mb_frsp_lru.frsp_lru_lock);
+ /* Set the default hard-limit to be as much as buddy bitmaps */
+ sbi->s_mb_frsp_mem_limit = ext4_get_groups_count(sb) <<
+ (sb->s_blocksize_bits + 1);
+ if (sbi->s_mb_frsp_mem_limit < EXT4_MB_FRSP_MEM_MIN(sb))
+ sbi->s_mb_frsp_mem_limit = EXT4_MB_FRSP_MEM_MIN(sb);
+ atomic_set(&sbi->s_mb_frsp_lru.frsp_active_fragments, 0);
return 0;
}
@@ -1664,6 +1692,99 @@ int ext4_mb_add_frsp_trees(struct super_block *sb, int ngroups)
return 0;
}
+/*
+ * Evict one tree from inactive list.
+ */
+static void ext4_frsp_evict(struct super_block *sb)
+{
+ struct ext4_mb_frsp_lru *lru = &EXT4_SB(sb)->s_mb_frsp_lru;
+ struct ext4_frsp_tree *tree = NULL;
+ bool found = false;
+
+ write_lock(&lru->frsp_lru_lock);
+ if (!list_empty(&lru->frsp_lru_inactive)) {
+ /* Evict from front, insert at tail */
+ found = 0;
+ list_for_each_entry(tree, &lru->frsp_lru_inactive,
+ frsp_list_node) {
+ if (!atomic_read(&tree->frsp_ref)) {
+ found = true;
+ break;
+ }
+ }
+ if (found)
+ list_del_init(&tree->frsp_list_node);
+ }
+ write_unlock(&lru->frsp_lru_lock);
+ if (found)
+ ext4_mb_frsp_free_tree(sb, tree);
+}
+
+/*
+ * This function maintains LRU lists. "tree" has just been accessed.
+ */
+static void ext4_mb_frsp_maintain_lru(struct super_block *sb,
+ struct ext4_frsp_tree *tree)
+{
+ struct ext4_mb_frsp_lru *lru = &EXT4_SB(sb)->s_mb_frsp_lru;
+ struct ext4_frsp_tree *last;
+ unsigned long current_jiffies = jiffies;
+
+ read_lock(&lru->frsp_lru_lock);
+ if (!list_empty(&tree->frsp_lru_active_node)) {
+ /* Already active, nothing to do */
+ read_unlock(&lru->frsp_lru_lock);
+ goto out;
+ }
+
+ /*
+ * Check if this tree needs to be moved to active list. We move it to
+ * active list if one of the following three conditions is true:
+ * - This is the first access to this tree
+ * - We haven't yet reached the max active fragments threshold, so there
+ * is space in active list.
+ * - This tree was accessed twice in
+ * EXT4_MB_FRSP_ACTIVE_THRESHOLD_JIFFIES interval.
+ */
+ if (tree->frsp_last_access &&
+ EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb) &&
+ atomic_read(&lru->frsp_active_fragments) >
+ EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb) &&
+ current_jiffies - tree->frsp_last_access >
+ EXT4_MB_FRSP_ACTIVE_THRESHOLD_JIFFIES) {
+ read_unlock(&lru->frsp_lru_lock);
+ goto out;
+ }
+ read_unlock(&lru->frsp_lru_lock);
+
+ write_lock(&lru->frsp_lru_lock);
+ /* Check again just in case */
+ if (!list_empty(&tree->frsp_lru_active_node)) {
+ write_unlock(&lru->frsp_lru_lock);
+ goto out;
+ }
+ list_add(&tree->frsp_lru_active_node, &lru->frsp_lru_active);
+ list_del_init(&tree->frsp_list_node);
+ atomic_add(atomic_read(&tree->frsp_fragments),
+ &lru->frsp_active_fragments);
+ /* Remove trees from active queue until we are below the limit */
+ while (EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb) &&
+ atomic_read(&lru->frsp_active_fragments) >
+ EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb)) {
+ last = list_last_entry(&lru->frsp_lru_active,
+ struct ext4_frsp_tree, frsp_lru_active_node);
+ list_del_init(&last->frsp_lru_active_node);
+ /* Evict from front, insert at tail */
+ list_add_tail(&last->frsp_list_node, &lru->frsp_lru_inactive);
+ atomic_sub(atomic_read(&last->frsp_fragments),
+ &lru->frsp_active_fragments);
+ }
+ write_unlock(&lru->frsp_lru_lock);
+
+out:
+ tree->frsp_last_access = current_jiffies;
+}
+
/*
* Divide blocks started from @first with length @len into
* smaller chunks with power of 2 blocks.
@@ -2154,10 +2275,12 @@ ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group,
e4b->frsp_tree = ext4_get_frsp_tree(sb,
ext4_flex_group(sbi, group));
e4b->frsp_flags = flags;
+ ext4_mb_frsp_maintain_lru(sb, e4b->frsp_tree);
if (flags & EXT4_ALLOCATOR_FRSP_NOLOAD)
return 0;
mutex_lock(&e4b->frsp_tree->frsp_lock);
+ atomic_inc(&e4b->frsp_tree->frsp_ref);
if (e4b->frsp_tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED) {
mutex_unlock(&e4b->frsp_tree->frsp_lock);
return 0;
@@ -2285,8 +2408,15 @@ static int ext4_mb_load_allocator(struct super_block *sb, ext4_group_t group,
static void ext4_mb_unload_allocator(struct ext4_buddy *e4b)
{
- if (ext4_mb_frsp_on(e4b->bd_sb))
+ if (ext4_mb_frsp_on(e4b->bd_sb)) {
+ if (!e4b->frsp_tree ||
+ e4b->frsp_flags & EXT4_ALLOCATOR_FRSP_NOLOAD)
+ return;
+ atomic_dec(&e4b->frsp_tree->frsp_ref);
+ if (test_opt2(e4b->bd_sb, FRSP_MEM_CRUNCH))
+ ext4_frsp_evict(e4b->bd_sb);
return;
+ }
if (e4b->bd_bitmap_page)
put_page(e4b->bd_bitmap_page);
if (e4b->bd_buddy_page)
@@ -2658,7 +2788,8 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
ex->fe_node->frsp_offset +
ex->fe_node->frsp_len) {
/* Need to split the node */
- new = ext4_mb_frsp_alloc_node(e4b->bd_sb);
+ new = ext4_mb_frsp_alloc_node(e4b->bd_sb,
+ e4b->frsp_tree);
if (!new)
return -ENOMEM;
new->frsp_offset = flex_offset + ex->fe_len;
@@ -2675,7 +2806,8 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
ex->fe_node->frsp_offset += ex->fe_len;
ex->fe_node->frsp_len -= ex->fe_len;
} else {
- ext4_mb_frsp_free_node(e4b->bd_sb, ex->fe_node);
+ ext4_mb_frsp_free_node(e4b->bd_sb, e4b->frsp_tree,
+ ex->fe_node);
return 0;
}
@@ -4166,7 +4298,9 @@ static void ext4_free_data_in_buddy(struct super_block *sb,
/* No more items in the per group rb tree
* balance refcounts from ext4_mb_free_metadata()
*/
- if (!ext4_mb_frsp_on(sb)) {
+ if (ext4_mb_frsp_on(sb)) {
+ atomic_dec(&e4b.frsp_tree->frsp_ref);
+ } else {
put_page(e4b.bd_buddy_page);
put_page(e4b.bd_bitmap_page);
}
@@ -6146,14 +6280,18 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
new_node = &new_entry->efd_node;
cluster = new_entry->efd_start_cluster;
- if (!*n && !ext4_mb_frsp_on(sb)) {
+ if (!*n) {
/* first free block exent. We need to
protect buddy cache from being freed,
* otherwise we'll refresh it from
* on-disk bitmap and lose not-yet-available
* blocks */
- get_page(e4b->bd_buddy_page);
- get_page(e4b->bd_bitmap_page);
+ if (ext4_mb_frsp_on(sb)) {
+ atomic_inc(&e4b->frsp_tree->frsp_ref);
+ } else {
+ get_page(e4b->bd_buddy_page);
+ get_page(e4b->bd_bitmap_page);
+ }
}
while (*n) {
parent = *n;
--
2.28.0.297.g1956fa8f8d-goog
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC PATCH v2 8/9] ext4: add tracepoints for free space trees
2020-08-21 1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
` (6 preceding siblings ...)
2020-08-21 1:55 ` [RFC PATCH v2 7/9] ext4: add LRU eviction for free space trees Harshad Shirwadkar
@ 2020-08-21 1:55 ` Harshad Shirwadkar
2020-08-21 1:55 ` [RFC PATCH v2 9/9] ext4: add freespace trees documentation in code Harshad Shirwadkar
8 siblings, 0 replies; 13+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21 1:55 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar
From: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
This patch adds a few trace points to track the flow of allocation
requests with free space trees.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
fs/ext4/ext4.h | 1 +
fs/ext4/mballoc.c | 9 +++
fs/ext4/mballoc.h | 3 +-
include/trace/events/ext4.h | 112 ++++++++++++++++++++++++++++++++++++
4 files changed, 124 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 64d0dbbcd517..22850763c5a8 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1661,6 +1661,7 @@ struct ext4_sb_info {
atomic_t s_mb_num_fragments;
u32 s_mb_frsp_mem_limit;
struct ext4_mb_frsp_lru s_mb_frsp_lru;
+ atomic_t s_mb_ac_id;
/* workqueue for reserved extent conversions (buffered io) */
struct workqueue_struct *rsv_conversion_wq;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index aea0eb8d28da..eb99e2fb9a68 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -945,6 +945,7 @@ void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree)
tree->frsp_len_root = RB_ROOT_CACHED;
out:
mutex_unlock(&tree->frsp_lock);
+ trace_ext4_mb_frsp_free_tree(sb, tree->frsp_index);
}
/*
@@ -1476,6 +1477,7 @@ int ext4_mb_frsp_optimize(struct ext4_allocation_context *ac, int *tree_idx)
mb_debug(ac->ac_sb,
"Optimizer suggestion: found = %d, tree = %d, len = %d, cr = %d\n",
found, *tree_idx, better_len, ac->ac_criteria);
+ trace_ext4_mb_frsp_optimize(ac, found, cache_more_trees, *tree_idx);
return ret;
}
@@ -1546,6 +1548,11 @@ ext4_mb_tree_allocator(struct ext4_allocation_context *ac)
ac->ac_groups_scanned++;
tree_idx = (j % sbi->s_mb_num_frsp_trees);
+ trace_ext4_mb_tree_allocator(ac, tree_idx,
+ sbi->s_mb_num_frsp_trees,
+ atomic_read(&sbi->s_mb_num_frsp_trees_cached),
+ atomic_read(&sbi->s_mb_num_fragments) *
+ sizeof(struct ext4_frsp_node));
ret = ext4_mb_load_allocator(sb,
tree_idx * ext4_flex_bg_size(sbi), &e4b, 0);
if (ret)
@@ -1655,6 +1662,7 @@ int ext4_mb_init_freespace_trees(struct super_block *sb)
if (sbi->s_mb_frsp_mem_limit < EXT4_MB_FRSP_MEM_MIN(sb))
sbi->s_mb_frsp_mem_limit = EXT4_MB_FRSP_MEM_MIN(sb);
atomic_set(&sbi->s_mb_frsp_lru.frsp_active_fragments, 0);
+ atomic_set(&sbi->s_mb_ac_id, 0);
return 0;
}
@@ -5794,6 +5802,7 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac,
ext4_get_group_no_and_offset(sb, goal, &group, &block);
/* set up allocation goals */
+ ac->ac_id = atomic_inc_return(&sbi->s_mb_ac_id);
ac->ac_b_ex.fe_logical = EXT4_LBLK_CMASK(sbi, ar->logical);
ac->ac_status = AC_STATUS_CONTINUE;
ac->ac_sb = sb;
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 6cfb228e4da2..e734c05572b6 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -171,7 +171,6 @@ struct ext4_tree_extent {
};
struct ext4_allocation_context {
- __u32 ac_id;
struct inode *ac_inode;
struct super_block *ac_sb;
@@ -206,6 +205,8 @@ struct ext4_allocation_context {
* N > 0, the field stores N, otherwise 0 */
__u8 ac_num_optimizations;
__u8 ac_op; /* operation, for history only */
+ __u8 ac_id; /* allocation ID for tracking purpose */
+
struct page *ac_bitmap_page;
struct page *ac_buddy_page;
struct ext4_prealloc_space *ac_pa;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 4c8b99ec8606..e5b4c1576097 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -874,6 +874,118 @@ TRACE_EVENT(ext4_allocate_blocks,
__entry->pleft, __entry->pright)
);
+TRACE_EVENT(ext4_mb_frsp_free_tree,
+ TP_PROTO(struct super_block *sb, int tree),
+
+ TP_ARGS(sb, tree),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( int, tree )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = sb->s_dev;
+ __entry->tree = tree;
+ ),
+
+ TP_printk("dev %d,%d tree %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->tree)
+);
+
+TRACE_EVENT(ext4_mb_frsp_optimize,
+ TP_PROTO(struct ext4_allocation_context *ac, int found,
+ int cache_more_trees, int tree),
+
+ TP_ARGS(ac, found, cache_more_trees, tree),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( unsigned int, len )
+ __field( unsigned int, flags )
+ __field( int, found )
+ __field( int, tree )
+ __field( int, num_found )
+ __field( int, attempts )
+ __field( int, ac_id )
+ __field( int, ac_criteria )
+ __field( int, ac_groups_scanned )
+ __field( int, cache_more_trees )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = ac->ac_sb->s_dev;
+ __entry->ino = ac->ac_inode->i_ino;
+ __entry->len = ac->ac_b_tree_ex.te_len;
+ __entry->flags = ac->ac_flags;
+ __entry->found = found;
+ __entry->tree = tree;
+ __entry->attempts = ac->ac_num_optimizations;
+ __entry->ac_id = ac->ac_id;
+ __entry->num_found = ac->ac_found;
+ __entry->ac_criteria = ac->ac_criteria;
+ __entry->ac_groups_scanned = ac->ac_groups_scanned;
+ __entry->cache_more_trees = cache_more_trees;
+ ),
+
+ TP_printk("dev %d,%d ino %lu ac %d flags %s best-len %u num-found %d "
+ "suggest-tree %d cache_more %d attempt %d status %d cr %d "
+ "nflexgrps %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino, __entry->ac_id,
+ show_mballoc_flags(__entry->flags),
+ __entry->len, __entry->num_found, __entry->tree,
+ __entry->cache_more_trees, __entry->attempts, __entry->found,
+ __entry->ac_criteria, __entry->ac_groups_scanned)
+);
+
+TRACE_EVENT(ext4_mb_tree_allocator,
+ TP_PROTO(struct ext4_allocation_context *ac, int tree, int num_trees,
+ int num_trees_loaded, int bytes_usage),
+
+ TP_ARGS(ac, tree, num_trees, num_trees_loaded, bytes_usage),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( unsigned int, len )
+ __field( unsigned int, flags )
+ __field( int, num_trees )
+ __field( int, num_trees_loaded )
+ __field( int, tree )
+ __field( int, ac_id )
+ __field( int, ac_found )
+ __field( int, ac_criteria )
+ __field( int, ac_groups_scanned )
+ __field( int, bytes_usage )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = ac->ac_sb->s_dev;
+ __entry->ino = ac->ac_inode->i_ino;
+ __entry->len = ac->ac_b_tree_ex.te_len;
+ __entry->flags = ac->ac_flags;
+ __entry->num_trees = num_trees;
+ __entry->num_trees_loaded = num_trees_loaded;
+ __entry->tree = tree;
+ __entry->ac_id = ac->ac_id;
+ __entry->ac_found = ac->ac_found;
+ __entry->ac_criteria = ac->ac_criteria;
+ __entry->ac_groups_scanned = ac->ac_groups_scanned;
+ __entry->bytes_usage = bytes_usage;
+ ),
+
+ TP_printk("dev %d,%d ino %lu ac %d flags %s best-len %u found %d current-tree %d "
+ "num-trees %d num-trees-loaded %d cr %d nflexgrps %d mem_bytes %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino, __entry->ac_id,
+ show_mballoc_flags(__entry->flags),
+ __entry->len, __entry->ac_found, __entry->tree, __entry->num_trees,
+ __entry->num_trees_loaded, __entry->ac_criteria,
+ __entry->ac_groups_scanned, __entry->bytes_usage)
+);
+
TRACE_EVENT(ext4_free_blocks,
TP_PROTO(struct inode *inode, __u64 block, unsigned long count,
int flags),
--
2.28.0.297.g1956fa8f8d-goog
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC PATCH v2 9/9] ext4: add freespace trees documentation in code
2020-08-21 1:55 [RFC PATCH v2 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
` (7 preceding siblings ...)
2020-08-21 1:55 ` [RFC PATCH v2 8/9] ext4: add tracepoints " Harshad Shirwadkar
@ 2020-08-21 1:55 ` Harshad Shirwadkar
8 siblings, 0 replies; 13+ messages in thread
From: Harshad Shirwadkar @ 2020-08-21 1:55 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar
From: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
This patch adds a comment in mballoc.c describing the design and flow
of free-space trees.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
fs/ext4/mballoc.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 116 insertions(+)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index eb99e2fb9a68..b5d55a52daff 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -330,6 +330,122 @@
* object
*
*/
+
+/*
+ * The freespace tree allocator
+ * -----------------------------
+ *
+ * High Level Design
+ * =================
+ * This allocator maintains the free space information about the file system in
+ * per-flexible block group level trees. For every flexible block group, we
+ * maintain individual freespace nodes in two trees, one sorted by flex-bg
+ * offset and other sorted by length. This gives us two benefits: i) In
+ * allocation path, our search time complexity is O(Log(Number of freespace
+ * regions in the flex-bg)). ii) In free path, in the same time complexity we
+ * can merge the adjacent nodes thereby reducing the size of the tree
+ * efficiently.
+ *
+ * Along with flexible block group level trees, we also introduce a top level
+ * meta-tree which consists of individual trees. This tree is sorted by length
+ * of largest extents found in flex-bgs. The key advantage that this tree gives
+ * us is this: During an allocation request, the allocator is now able to
+ * consult this tree and directly (in O(Log(Number of Flex BGs)) jump to a
+ * flexible block group which definitely has at least one (the largest) extent
+ * that can satisfy this request. If no flexible block group exists which can
+ * satisfy this request, the allocator can now immediately drop the CR level.
+ *
+ * In order to preseve the parallel allocation / free performance, the allocator
+ * only *tries* to consult this tree. It does so by calling read_trylock()
+ * function and if the meta-tree is busy, the allocator continues its linear
+ * search until it is able to grab a read-lock on this tree.
+ *
+ * Memory Footprint
+ * ================
+ *
+ * In a less fragmented file system, the memory footprint of the new allocator
+ * is significantly lower than buddy bitmaps. However, as the fragmentation
+ * level increases, the memory footprint of this allocator increases
+ * significantly. The memory usage of the freespace tree allocator can be
+ * controlled using sysfs tunable /sys/fs/ext4/<dev>/mb_frsp_max_mem. The
+ * default value of this is max(memory needed for buddy, maximum memory needed
+ * for one tree in the worst case). Accounting for max memory needed for one
+ * tree allows us to keep at least one tree in memory even in the worst
+ * case. This avoids thrashing. Once the memory threshold is reached, the
+ * allocator starts evicting trees from memory in LRU fashion. However, we don't
+ * remove tree's entry from the meta-tree. This allows the allocator to
+ * efficiently reconstruct only relevant trees from on-disk bitmaps. In our
+ * evaluations, we found that freespace tree allocator still manages to
+ * outperform buddy allocator in memory crunch situation.
+ *
+ * LRU
+ * ===
+ *
+ * We maintain two lists - an active list and an inactive list of trees. Trees
+ * in active list stay in it until evicted. In order to reduce contention on the
+ * active list lock, once a tree is in active list it is not moved inside the
+ * list. Also, a tree moves from inactive list to active list only if it is
+ * accessed twice in a EXT4_MB_FRSP_ACTIVE_THRESHOLD_JIFFIES window.
+ *
+ *
+ * Caching Tree Metadata
+ * =====================
+ * As noted in our experiments, we find caching tree metadata (the largest
+ * available extent in a tree) in the meta-tree significantly boosts allocation
+ * performance. However, if the allocator waits for the cache to fill up before
+ * starting to serve allocation requests, that may severely degrade allocation
+ * performance on large disks. Thus, it is important to tune the tree caching
+ * behavior according to the underlying block device. This patchset provides
+ * four cache aggression levels. Cache aggression level can be specified as a
+ * mount time parametere "frsp_cache_aggression". Here is the meaning of these
+ * different levels:
+ *
+ * Cache Aggression 0: Try to serve request only cached trees
+ * Cache Aggression 1 (Default): Try to serve request from only cached trees
+ * at CR 0. At all other CRs, try to use an uncached tree for every
+ * alternate request.
+ * Cache Aggression 2: Try to use an uncached tree for every alternate request
+ * at all CR levels.
+ * Cache Aggression 3: Try to use uncached trees for every request.
+ *
+ * Moreover, if file system is mounted with "prefetch_block_bitmaps", tree
+ * caching starts at mount time.
+ *
+ * Locking Order
+ * =============
+ *
+ * - Tree lock
+ * - Meta tree lock (sbi->s_mb_frsp_lock)
+ * - LRU lock
+ *
+ * Critical sections of meta-tree lock and LRU locks are kept as small as
+ * possible and you shouldn't need to take meta-tree lock and lru-lock
+ * simultaneously.
+ *
+ * High Level Algorithm
+ * ====================
+ * - Consult meta-tree asking which flex-bg should the allocator look at
+ * - One of the following things can happen
+ * - Meta tree may find a suitable flex-bg for the request
+ * - In this case we start searching in that tree
+ * - Meta tree may realize that there's no suitable flex-bg
+ * - In this case we increase CR and restart
+ * - Based on the caching mode, meta tree may redirect us to an
+ * uncached tree
+ * - Meta tree is busy
+ * - In this case, we dont wait for meta-tree to become available,
+ * instead, we continue our search linearly.
+ * - Pick a tree (either based on what meta-tree told us or linearly from the
+ * last one)
+ * - Manage LRU structure (ext4_mb_frsp_maintain_lru())
+ * - Move tree to active list if needed
+ * - Move trees from active list to inactive lists if needed
+ * - Perform search by length and pick matching extents.
+ * - Repeat until best found
+ * - Perform memory-crunch check and evict trees in LRU fashion if needed
+ * (ext4_mb_unload_allocator()))
+ */
+
static struct kmem_cache *ext4_pspace_cachep;
static struct kmem_cache *ext4_ac_cachep;
static struct kmem_cache *ext4_free_data_cachep;
--
2.28.0.297.g1956fa8f8d-goog
^ permalink raw reply related [flat|nested] 13+ messages in thread