* [PATCH 1/9] ext4: add handling for extended mount options
2020-08-19 7:30 [PATCH 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
@ 2020-08-19 7:30 ` Harshad Shirwadkar
2020-08-19 7:30 ` [PATCH 2/9] ext4: rename ext4_mb_load_buddy to ext4_mb_load_allocator Harshad Shirwadkar
` (7 subsequent siblings)
8 siblings, 0 replies; 13+ messages in thread
From: Harshad Shirwadkar @ 2020-08-19 7:30 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar, Andreas Dilger
We are running out of mount option bits. Add handling for using
s_mount_opt2. This patch was originally submitted as a part of fast
commit patch series. Resending it here with this patch series too.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
---
fs/ext4/super.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 51e91a220ea9..98aa12602b68 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1733,6 +1733,7 @@ static int clear_qf_name(struct super_block *sb, int qtype)
#define MOPT_EXT4_ONLY (MOPT_NO_EXT2 | MOPT_NO_EXT3)
#define MOPT_STRING 0x0400
#define MOPT_SKIP 0x0800
+#define MOPT_2 0x1000
static const struct mount_opts {
int token;
@@ -2202,10 +2203,17 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
WARN_ON(1);
return -1;
}
- if (arg != 0)
- sbi->s_mount_opt |= m->mount_opt;
- else
- sbi->s_mount_opt &= ~m->mount_opt;
+ if (m->flags & MOPT_2) {
+ if (arg != 0)
+ sbi->s_mount_opt2 |= m->mount_opt;
+ else
+ sbi->s_mount_opt2 &= ~m->mount_opt;
+ } else {
+ if (arg != 0)
+ sbi->s_mount_opt |= m->mount_opt;
+ else
+ sbi->s_mount_opt &= ~m->mount_opt;
+ }
}
return 1;
}
--
2.28.0.220.ged08abb693-goog
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/9] ext4: rename ext4_mb_load_buddy to ext4_mb_load_allocator
2020-08-19 7:30 [PATCH 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
2020-08-19 7:30 ` [PATCH 1/9] ext4: add handling for extended mount options Harshad Shirwadkar
@ 2020-08-19 7:30 ` Harshad Shirwadkar
2020-08-19 7:30 ` [PATCH 3/9] ext4: add freespace tree allocator Harshad Shirwadkar
` (6 subsequent siblings)
8 siblings, 0 replies; 13+ messages in thread
From: Harshad Shirwadkar @ 2020-08-19 7:30 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar
This patch renames ext4_mb_load_buddy and ext4_mb_unload_buddy to
ext4_mb_load_allocator and ext4_mb_unload_allocator. Also, we add a
flag argument to ext4_mb_load_allocator function which is currently
unused. This patch helps reduce the size of the following patch "ext4:
add freespace tree allocator" significantly. In the interest of
keeping this patchset shorter, I have not renamed ext4_buddy structure
and e4b variable names. But have added that as a TODO item.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
fs/ext4/mballoc.c | 86 ++++++++++++++++++++++++-----------------------
1 file changed, 44 insertions(+), 42 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 47de61e44db2..2d8d3d9c7918 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -29,6 +29,7 @@
* - don't normalize tails
* - quota
* - reservation for superuser
+ * - rename ext4_buddy to ext4_allocator and e4b variables to allocator
*
* TODO v3:
* - bitmap read-ahead (proposed by Oleg Drokin aka green)
@@ -92,7 +93,7 @@
* mapped to the buddy and bitmap information regarding different
* groups. The buddy information is attached to buddy cache inode so that
* we can access them through the page cache. The information regarding
- * each group is loaded via ext4_mb_load_buddy. The information involve
+ * each group is loaded via ext4_mb_load_allocator. The information involve
* block bitmap and buddy information. The information are stored in the
* inode as:
*
@@ -845,7 +846,7 @@ static void mb_regenerate_buddy(struct ext4_buddy *e4b)
/* The buddy information is attached the buddy cache inode
* for convenience. The information regarding each group
- * is loaded via ext4_mb_load_buddy. The information involve
+ * is loaded via ext4_mb_load_allocator. The information involve
* block bitmap and buddy information. The information are
* stored in the inode as
*
@@ -1105,7 +1106,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
* This ensures that we don't reinit the buddy cache
* page which map to the group from which we are already
* allocating. If we are looking at the buddy cache we would
- * have taken a reference using ext4_mb_load_buddy and that
+ * have taken a reference using ext4_mb_load_allocator and that
* would have pinned buddy page to page cache.
* The call to ext4_mb_get_buddy_page_lock will mark the
* page accessed.
@@ -1157,8 +1158,8 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
* calling this routine!
*/
static noinline_for_stack int
-ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
- struct ext4_buddy *e4b, gfp_t gfp)
+ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group,
+ struct ext4_buddy *e4b, gfp_t gfp, int flags)
{
int blocks_per_page;
int block;
@@ -1296,13 +1297,13 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
return ret;
}
-static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
- struct ext4_buddy *e4b)
+static int ext4_mb_load_allocator(struct super_block *sb, ext4_group_t group,
+ struct ext4_buddy *e4b, int flags)
{
- return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_NOFS);
+ return ext4_mb_load_allocator_gfp(sb, group, e4b, GFP_NOFS, flags);
}
-static void ext4_mb_unload_buddy(struct ext4_buddy *e4b)
+static void ext4_mb_unload_allocator(struct ext4_buddy *e4b)
{
if (e4b->bd_bitmap_page)
put_page(e4b->bd_bitmap_page);
@@ -1866,7 +1867,7 @@ int ext4_mb_try_best_found(struct ext4_allocation_context *ac,
int err;
BUG_ON(ex.fe_len <= 0);
- err = ext4_mb_load_buddy(ac->ac_sb, group, e4b);
+ err = ext4_mb_load_allocator(ac->ac_sb, group, e4b, 0);
if (err)
return err;
@@ -1879,7 +1880,7 @@ int ext4_mb_try_best_found(struct ext4_allocation_context *ac,
}
ext4_unlock_group(ac->ac_sb, group);
- ext4_mb_unload_buddy(e4b);
+ ext4_mb_unload_allocator(e4b);
return 0;
}
@@ -1900,12 +1901,12 @@ int ext4_mb_find_by_goal(struct ext4_allocation_context *ac,
if (grp->bb_free == 0)
return 0;
- err = ext4_mb_load_buddy(ac->ac_sb, group, e4b);
+ err = ext4_mb_load_allocator(ac->ac_sb, group, e4b, 0);
if (err)
return err;
if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) {
- ext4_mb_unload_buddy(e4b);
+ ext4_mb_unload_allocator(e4b);
return 0;
}
@@ -1943,7 +1944,7 @@ int ext4_mb_find_by_goal(struct ext4_allocation_context *ac,
ext4_mb_use_best_found(ac, e4b);
}
ext4_unlock_group(ac->ac_sb, group);
- ext4_mb_unload_buddy(e4b);
+ ext4_mb_unload_allocator(e4b);
return 0;
}
@@ -2424,7 +2425,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
continue;
}
- err = ext4_mb_load_buddy(sb, group, &e4b);
+ err = ext4_mb_load_allocator(sb, group, &e4b, 0);
if (err)
goto out;
@@ -2437,7 +2438,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
ret = ext4_mb_good_group(ac, group, cr);
if (ret == 0) {
ext4_unlock_group(sb, group);
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
continue;
}
@@ -2451,7 +2452,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
ext4_mb_complex_scan_group(ac, &e4b);
ext4_unlock_group(sb, group);
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
if (ac->ac_status != AC_STATUS_CONTINUE)
break;
@@ -2548,7 +2549,7 @@ static int ext4_mb_seq_groups_show(struct seq_file *seq, void *v)
grinfo = ext4_get_group_info(sb, group);
/* Load the group info in memory only if not already loaded. */
if (unlikely(EXT4_MB_GRP_NEED_INIT(grinfo))) {
- err = ext4_mb_load_buddy(sb, group, &e4b);
+ err = ext4_mb_load_allocator(sb, group, &e4b, 0);
if (err) {
seq_printf(seq, "#%-5u: I/O error\n", group);
return 0;
@@ -2559,7 +2560,7 @@ static int ext4_mb_seq_groups_show(struct seq_file *seq, void *v)
memcpy(&sg, ext4_get_group_info(sb, group), i);
if (buddy_loaded)
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
seq_printf(seq, "#%-5u: %-5u %-5u %-5u [", group, sg.info.bb_free,
sg.info.bb_fragments, sg.info.bb_first_free);
@@ -3053,7 +3054,7 @@ static void ext4_free_data_in_buddy(struct super_block *sb,
mb_debug(sb, "gonna free %u blocks in group %u (0x%p):",
entry->efd_count, entry->efd_group, entry);
- err = ext4_mb_load_buddy(sb, entry->efd_group, &e4b);
+ err = ext4_mb_load_allocator(sb, entry->efd_group, &e4b, 0);
/* we expect to find existing buddy because it's pinned */
BUG_ON(err != 0);
@@ -3088,7 +3089,7 @@ static void ext4_free_data_in_buddy(struct super_block *sb,
}
ext4_unlock_group(sb, entry->efd_group);
kmem_cache_free(ext4_free_data_cachep, entry);
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
mb_debug(sb, "freed %d blocks in %d structures\n", count,
count2);
@@ -3562,12 +3563,13 @@ static void ext4_discard_allocated_blocks(struct ext4_allocation_context *ac)
if (pa == NULL) {
if (ac->ac_f_ex.fe_len == 0)
return;
- err = ext4_mb_load_buddy(ac->ac_sb, ac->ac_f_ex.fe_group, &e4b);
+ err = ext4_mb_load_allocator(ac->ac_sb, ac->ac_f_ex.fe_group,
+ &e4b, 0);
if (err) {
/*
* This should never happen since we pin the
* pages in the ext4_allocation_context so
- * ext4_mb_load_buddy() should never fail.
+ * ext4_mb_load_allocator() should never fail.
*/
WARN(1, "mb_load_buddy failed (%d)", err);
return;
@@ -3576,7 +3578,7 @@ static void ext4_discard_allocated_blocks(struct ext4_allocation_context *ac)
mb_free_blocks(ac->ac_inode, &e4b, ac->ac_f_ex.fe_start,
ac->ac_f_ex.fe_len);
ext4_unlock_group(ac->ac_sb, ac->ac_f_ex.fe_group);
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
return;
}
if (pa->pa_type == MB_INODE_PA)
@@ -4158,7 +4160,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
goto out_dbg;
}
- err = ext4_mb_load_buddy(sb, group, &e4b);
+ err = ext4_mb_load_allocator(sb, group, &e4b, 0);
if (err) {
ext4_warning(sb, "Error %d loading buddy information for %u",
err, group);
@@ -4233,7 +4235,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
out:
ext4_unlock_group(sb, group);
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
put_bh(bitmap_bh);
out_dbg:
mb_debug(sb, "discarded (%d) blocks preallocated for group %u bb_free (%d)\n",
@@ -4325,8 +4327,8 @@ void ext4_discard_preallocations(struct inode *inode)
BUG_ON(pa->pa_type != MB_INODE_PA);
group = ext4_get_group_number(sb, pa->pa_pstart);
- err = ext4_mb_load_buddy_gfp(sb, group, &e4b,
- GFP_NOFS|__GFP_NOFAIL);
+ err = ext4_mb_load_allocator_gfp(sb, group, &e4b,
+ GFP_NOFS|__GFP_NOFAIL, 0);
if (err) {
ext4_error_err(sb, -err, "Error %d loading buddy information for %u",
err, group);
@@ -4338,7 +4340,7 @@ void ext4_discard_preallocations(struct inode *inode)
err = PTR_ERR(bitmap_bh);
ext4_error_err(sb, -err, "Error %d reading block bitmap for %u",
err, group);
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
continue;
}
@@ -4347,7 +4349,7 @@ void ext4_discard_preallocations(struct inode *inode)
ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
ext4_unlock_group(sb, group);
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
put_bh(bitmap_bh);
list_del(&pa->u.pa_tmp_list);
@@ -4620,8 +4622,8 @@ ext4_mb_discard_lg_preallocations(struct super_block *sb,
int err;
group = ext4_get_group_number(sb, pa->pa_pstart);
- err = ext4_mb_load_buddy_gfp(sb, group, &e4b,
- GFP_NOFS|__GFP_NOFAIL);
+ err = ext4_mb_load_allocator_gfp(sb, group, &e4b,
+ GFP_NOFS|__GFP_NOFAIL, 0);
if (err) {
ext4_error_err(sb, -err, "Error %d loading buddy information for %u",
err, group);
@@ -4632,7 +4634,7 @@ ext4_mb_discard_lg_preallocations(struct super_block *sb,
ext4_mb_release_group_pa(&e4b, pa);
ext4_unlock_group(sb, group);
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
list_del(&pa->u.pa_tmp_list);
call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
}
@@ -5189,8 +5191,8 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
trace_ext4_mballoc_free(sb, inode, block_group, bit, count_clusters);
/* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */
- err = ext4_mb_load_buddy_gfp(sb, block_group, &e4b,
- GFP_NOFS|__GFP_NOFAIL);
+ err = ext4_mb_load_allocator_gfp(sb, block_group, &e4b,
+ GFP_NOFS|__GFP_NOFAIL, 0);
if (err)
goto error_return;
@@ -5264,7 +5266,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
count_clusters);
}
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
/* We dirtied the bitmap block */
BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
@@ -5382,7 +5384,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
}
}
- err = ext4_mb_load_buddy(sb, block_group, &e4b);
+ err = ext4_mb_load_allocator(sb, block_group, &e4b, 0);
if (err)
goto error_return;
@@ -5410,7 +5412,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
flex_group)->free_clusters);
}
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
/* We dirtied the bitmap block */
BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
@@ -5498,7 +5500,7 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
trace_ext4_trim_all_free(sb, group, start, max);
- ret = ext4_mb_load_buddy(sb, group, &e4b);
+ ret = ext4_mb_load_allocator(sb, group, &e4b, 0);
if (ret) {
ext4_warning(sb, "Error %d loading buddy information for %u",
ret, group);
@@ -5552,7 +5554,7 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
}
out:
ext4_unlock_group(sb, group);
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
ext4_debug("trimmed %d blocks in the group %d\n",
count, group);
@@ -5666,7 +5668,7 @@ ext4_mballoc_query_range(
struct ext4_buddy e4b;
int error;
- error = ext4_mb_load_buddy(sb, group, &e4b);
+ error = ext4_mb_load_allocator(sb, group, &e4b, 0);
if (error)
return error;
bitmap = e4b.bd_bitmap;
@@ -5695,7 +5697,7 @@ ext4_mballoc_query_range(
ext4_unlock_group(sb, group);
out_unload:
- ext4_mb_unload_buddy(&e4b);
+ ext4_mb_unload_allocator(&e4b);
return error;
}
--
2.28.0.220.ged08abb693-goog
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 3/9] ext4: add freespace tree allocator
2020-08-19 7:30 [PATCH 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
2020-08-19 7:30 ` [PATCH 1/9] ext4: add handling for extended mount options Harshad Shirwadkar
2020-08-19 7:30 ` [PATCH 2/9] ext4: rename ext4_mb_load_buddy to ext4_mb_load_allocator Harshad Shirwadkar
@ 2020-08-19 7:30 ` Harshad Shirwadkar
2020-09-03 13:45 ` Благодаренко Артём
2020-08-19 7:30 ` [PATCH 4/9] ext4: add prefetching support for freespace trees Harshad Shirwadkar
` (5 subsequent siblings)
8 siblings, 1 reply; 13+ messages in thread
From: Harshad Shirwadkar @ 2020-08-19 7:30 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar
From: Yuexin Li <lyx1209@gmail.com>
This patch adds a new freespace tree allocator. The allocator
organizes free space regions in a couple of in-memory rbtrees, one
sorted by length and the other by offset. We use these per-flex-bg
level trees to quickly scan length sorted free space regions and
determine the best region for a given request. This feature can be
enabled by passing "-o freespace_tree" mount option.
Signed-off-by: Yuexin Li <lyx1209@gmail.com>
Co-developed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
fs/ext4/ext4.h | 45 +++
fs/ext4/mballoc.c | 984 ++++++++++++++++++++++++++++++++++++++++++++--
fs/ext4/mballoc.h | 61 ++-
fs/ext4/resize.c | 8 +
fs/ext4/super.c | 35 +-
5 files changed, 1084 insertions(+), 49 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 4df6f429de1a..3bb2675d4d40 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -363,6 +363,7 @@ struct flex_groups {
atomic64_t free_clusters;
atomic_t free_inodes;
atomic_t used_dirs;
+ struct ext4_frsp_tree *frsp_tree;
};
#define EXT4_BG_INODE_UNINIT 0x0001 /* Inode table/bitmap not in use */
@@ -1197,6 +1198,12 @@ struct ext4_inode_info {
#define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM 0x00000008 /* User explicitly
specified journal checksum */
+
+#define EXT4_MOUNT2_FREESPACE_TREE 0x00000020 /* Enable freespace tree
+ * allocator (turns buddy
+ * allocator off)
+ */
+
#define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
~EXT4_MOUNT_##opt
#define set_opt(sb, opt) EXT4_SB(sb)->s_mount_opt |= \
@@ -1402,6 +1409,30 @@ struct ext4_super_block {
#define ext4_has_strict_mode(sbi) \
(sbi->s_encoding_flags & EXT4_ENC_STRICT_MODE_FL)
+/*
+ * Freespace tree structure
+ */
+struct ext4_frsp_tree {
+ struct rb_root_cached frsp_offset_root; /* Root for offset
+ * sorted tree
+ */
+ struct rb_root_cached frsp_len_root; /* Root for Length
+ * sorted tree
+ */
+ struct mutex frsp_lock;
+ __u8 frsp_flags;
+ __u32 frsp_max_free_len; /* Max free extent in
+ * this flex bg
+ */
+ __u32 frsp_index; /* Tree index (flex bg
+ * number)
+ */
+};
+
+/* Freespace tree flags */
+
+/* Tree is loaded in memory */
+#define EXT4_MB_FRSP_FLAG_LOADED 0x0001
/*
* fourth extended-fs super-block data in memory
*/
@@ -1539,6 +1570,9 @@ struct ext4_sb_info {
struct flex_groups * __rcu *s_flex_groups;
ext4_group_t s_flex_groups_allocated;
+ /* freespace trees stuff */
+ int s_mb_num_frsp_trees;
+
/* workqueue for reserved extent conversions (buffered io) */
struct workqueue_struct *rsv_conversion_wq;
@@ -2653,6 +2687,7 @@ extern int ext4_init_inode_table(struct super_block *sb,
extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate);
/* mballoc.c */
+extern int ext4_mb_add_frsp_trees(struct super_block *sb, int ngroups);
extern const struct seq_operations ext4_mb_seq_groups_ops;
extern long ext4_mb_stats;
extern long ext4_mb_max_to_scan;
@@ -3084,6 +3119,16 @@ static inline ext4_group_t ext4_flex_group(struct ext4_sb_info *sbi,
return block_group >> sbi->s_log_groups_per_flex;
}
+static inline struct ext4_frsp_tree *
+ext4_get_frsp_tree(struct super_block *sb, ext4_group_t flex_bg)
+{
+ struct flex_groups *fg;
+
+ fg = sbi_array_rcu_deref(EXT4_SB(sb), s_flex_groups, flex_bg);
+
+ return fg->frsp_tree;
+}
+
static inline unsigned int ext4_flex_bg_size(struct ext4_sb_info *sbi)
{
return 1 << sbi->s_log_groups_per_flex;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 2d8d3d9c7918..74bdc2dcb49c 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -333,6 +333,7 @@
static struct kmem_cache *ext4_pspace_cachep;
static struct kmem_cache *ext4_ac_cachep;
static struct kmem_cache *ext4_free_data_cachep;
+static struct kmem_cache *ext4_freespace_node_cachep;
/* We create slab caches for groupinfo data structures based on the
* superblock block size. There will be one per mounted filesystem for
@@ -352,6 +353,11 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
ext4_group_t group);
static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
+static int ext4_mb_load_allocator(struct super_block *sb, ext4_group_t group,
+ struct ext4_buddy *e4b, int flags);
+static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex);
+static void ext4_mb_unload_allocator(struct ext4_buddy *e4b);
+
/*
* The algorithm using this percpu seq counter goes below:
* 1. We sample the percpu discard_pa_seq counter before trying for block
@@ -395,6 +401,33 @@ static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
return addr;
}
+static inline int ext4_blkno_to_flex_offset(struct super_block *sb,
+ ext4_fsblk_t blkno)
+{
+ return blkno % (ext4_flex_bg_size(EXT4_SB(sb)) *
+ EXT4_SB(sb)->s_blocks_per_group);
+}
+
+static inline ext4_fsblk_t ext4_flex_offset_to_blkno(struct super_block *sb,
+ int flex_bg, int flex_offset)
+{
+ return flex_bg * ext4_flex_bg_size(EXT4_SB(sb)) *
+ EXT4_SB(sb)->s_blocks_per_group + flex_offset;
+}
+
+static inline int ext4_mb_frsp_on(struct super_block *sb)
+{
+ return test_opt2(sb, FREESPACE_TREE) &&
+ EXT4_SB(sb)->s_es->s_log_groups_per_flex;
+}
+
+static inline unsigned int ext4_num_grps_to_flexbg(struct super_block *sb,
+ int ngroups)
+{
+ return (ngroups + ext4_flex_bg_size(EXT4_SB(sb)) - 1) >>
+ (EXT4_SB(sb)->s_es->s_log_groups_per_flex);
+}
+
static inline int mb_test_bit(int bit, void *addr)
{
/*
@@ -453,6 +486,7 @@ static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
{
char *bb;
+ WARN_ON(ext4_mb_frsp_on(e4b->bd_sb));
BUG_ON(e4b->bd_bitmap == e4b->bd_buddy);
BUG_ON(max == NULL);
@@ -620,6 +654,9 @@ static int __mb_check_buddy(struct ext4_buddy *e4b, char *file,
void *buddy;
void *buddy2;
+ if (ext4_mb_frsp_on(sb))
+ return 0;
+
{
static int mb_check_counter;
if (mb_check_counter++ % 100 != 0)
@@ -706,6 +743,729 @@ static int __mb_check_buddy(struct ext4_buddy *e4b, char *file,
#define mb_check_buddy(e4b)
#endif
+/* Generic macro for inserting a new node in cached rbtree */
+#define ext4_mb_rb_insert(__root, __new, __node_t, __entry, __cmp) do { \
+ struct rb_node **iter = &((__root)->rb_root.rb_node), *parent = NULL; \
+ bool leftmost = true; \
+ __node_t *this = NULL; \
+ \
+ while (*iter) { \
+ this = rb_entry(*iter, __node_t, __entry); \
+ parent = *iter; \
+ if (__cmp((__new), this)) { \
+ iter = &((*iter)->rb_left); \
+ } else { \
+ iter = &((*iter)->rb_right); \
+ leftmost = false; \
+ } \
+ } \
+ rb_link_node(&(__new)->__entry, parent, iter); \
+ rb_insert_color_cached(&(__new)->__entry, __root, leftmost); \
+} while (0)
+
+static inline int ext4_mb_frsp_offset_cmp(struct ext4_frsp_node *arg1,
+ struct ext4_frsp_node *arg2)
+{
+ return (arg1->frsp_offset < arg2->frsp_offset);
+}
+
+/*
+ * Free space extents length sorting function, the nodes are sorted in
+ * decreasing order of length
+ */
+static inline int ext4_mb_frsp_len_cmp(struct ext4_frsp_node *arg1,
+ struct ext4_frsp_node *arg2)
+{
+ return (arg1->frsp_len > arg2->frsp_len);
+}
+
+/* insert to offset-indexed tree */
+static void ext4_mb_frsp_insert_off(struct ext4_frsp_tree *tree,
+ struct ext4_frsp_node *new_entry)
+{
+ memset(&new_entry->frsp_node, 0, sizeof(new_entry->frsp_node));
+ ext4_mb_rb_insert(&tree->frsp_offset_root, new_entry,
+ struct ext4_frsp_node, frsp_node, ext4_mb_frsp_offset_cmp);
+}
+
+/* insert to tree sorted by length */
+static void ext4_mb_frsp_insert_len(struct ext4_frsp_tree *tree,
+ struct ext4_frsp_node *new_entry)
+{
+ memset(&new_entry->frsp_len_node, 0, sizeof(new_entry->frsp_len_node));
+ ext4_mb_rb_insert(&tree->frsp_len_root, new_entry,
+ struct ext4_frsp_node, frsp_len_node,
+ ext4_mb_frsp_len_cmp);
+}
+
+#ifdef CONFIG_EXT4_DEBUG
+/* print freespace_tree in pre-order traversal */
+void ext4_mb_frsp_print_tree_len(struct super_block *sb,
+ struct ext4_frsp_tree *tree)
+{
+ unsigned int count = 0;
+ ext4_fsblk_t blk = 0, blk_end = 0;
+ ext4_group_t group = 0, group_end = 0;
+ struct ext4_frsp_node *entry = NULL;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct rb_node *cur;
+
+ cur = rb_first_cached(&tree->frsp_len_root);
+ mb_debug(sb, "\nTree\tNode\tlength\tblock\tgroup\n");
+
+ while (cur) {
+ entry = rb_entry(cur, struct ext4_frsp_node, frsp_len_node);
+ count++;
+ blk = ext4_flex_offset_to_blkno(sb, tree->frsp_index,
+ entry->frsp_offset);
+ blk_end = ext4_flex_offset_to_blkno(sb, tree->frsp_index,
+ entry->frsp_offset + entry->frsp_len - 1);
+ group = blk / sbi->s_blocks_per_group;
+ group_end = (blk_end-1) / sbi->s_blocks_per_group;
+ mb_debug(sb, "%u\t%u\t%u\t%u(%lld)--%llu\t%u--%u\n",
+ tree->frsp_index, count, entry->frsp_len,
+ entry->frsp_offset, blk, blk_end, group, group_end);
+ cur = rb_next(cur);
+ }
+}
+#endif
+
+static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
+{
+ struct ext4_frsp_node *node;
+
+ node = kmem_cache_alloc(ext4_freespace_node_cachep, GFP_NOFS);
+ if (!node)
+ return NULL;
+
+ RB_CLEAR_NODE(&node->frsp_node);
+ RB_CLEAR_NODE(&node->frsp_len_node);
+
+ return node;
+}
+
+static void ext4_mb_frsp_free_node(struct super_block *sb,
+ struct ext4_frsp_node *node)
+{
+ kmem_cache_free(ext4_freespace_node_cachep, node);
+}
+
+/* Evict a tree from memory */
+void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree)
+{
+ struct ext4_frsp_node *frsp_node;
+ struct rb_node *node;
+
+ mb_debug(sb, "Evicting tree %d\n", tree->frsp_index);
+ mutex_lock(&tree->frsp_lock);
+ if (!(tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED))
+ goto out;
+
+ node = rb_first_cached(&tree->frsp_offset_root);
+ while (node) {
+ frsp_node = rb_entry(node, struct ext4_frsp_node, frsp_node);
+ rb_erase_cached(node, &tree->frsp_offset_root);
+ rb_erase_cached(&frsp_node->frsp_len_node,
+ &tree->frsp_len_root);
+ ext4_mb_frsp_free_node(sb, frsp_node);
+ node = rb_first_cached(&tree->frsp_offset_root);
+ }
+ tree->frsp_flags &= ~EXT4_MB_FRSP_FLAG_LOADED;
+ tree->frsp_offset_root = RB_ROOT_CACHED;
+ tree->frsp_len_root = RB_ROOT_CACHED;
+out:
+ mutex_unlock(&tree->frsp_lock);
+}
+
+/*
+ * Search tree by flexbg offset. Returns the node containing "target". If
+ * no such node is present, returns NULL. Must be called with tree mutex held.
+ */
+struct ext4_frsp_node *ext4_mb_frsp_search_by_off(struct super_block *sb,
+ struct ext4_frsp_tree *tree,
+ ext4_grpblk_t target)
+{
+ struct rb_root *root = &tree->frsp_offset_root.rb_root;
+ struct rb_node *node = root->rb_node;
+ struct ext4_frsp_node *this = NULL;
+
+ while (node) {
+ this = rb_entry(node, struct ext4_frsp_node, frsp_node);
+ if (this->frsp_offset == target)
+ return this;
+ else if (target < this->frsp_offset)
+ node = node->rb_left;
+ else
+ node = node->rb_right;
+ }
+ return NULL;
+}
+
+/*
+ * Check if two entries in freespace tree can be merged together. Nodes
+ * can merge together only if they are physically contiguous and belong
+ * to the same block group.
+ */
+int ext4_mb_frsp_node_can_merge(struct super_block *sb,
+ struct ext4_frsp_node *prev_entry, struct ext4_frsp_node *cur_entry)
+{
+ if (prev_entry->frsp_offset + prev_entry->frsp_len !=
+ cur_entry->frsp_offset)
+ return 0;
+ if (prev_entry->frsp_offset / EXT4_SB(sb)->s_blocks_per_group !=
+ cur_entry->frsp_offset / EXT4_SB(sb)->s_blocks_per_group)
+ return 0;
+ return 1;
+}
+
+/*
+ * Add new free space region to tree. We insert the new node in both, the
+ * length sorted and the flex-bg offset sorted trees. Must be called with
+ * tree mutex held.
+ */
+int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
+ ext4_grpblk_t offset, ext4_grpblk_t len)
+{
+ struct ext4_frsp_node *new_entry = NULL, *next_entry = NULL,
+ *prev_entry = NULL;
+ struct rb_node *left = NULL, *right = NULL;
+
+ new_entry = ext4_mb_frsp_alloc_node(sb);
+ if (!new_entry)
+ return -ENOMEM;
+
+ new_entry->frsp_offset = offset;
+ new_entry->frsp_len = len;
+
+ ext4_mb_frsp_insert_off(tree, new_entry);
+ /* try merge to left and right */
+ /* left */
+ left = rb_prev(&new_entry->frsp_node);
+ if (left) {
+ prev_entry = rb_entry(left, struct ext4_frsp_node, frsp_node);
+ if (ext4_mb_frsp_node_can_merge(sb, prev_entry, new_entry)) {
+ new_entry->frsp_offset = prev_entry->frsp_offset;
+ new_entry->frsp_len += prev_entry->frsp_len;
+ rb_erase_cached(left, &tree->frsp_offset_root);
+ rb_erase_cached(&prev_entry->frsp_len_node,
+ &tree->frsp_len_root);
+ ext4_mb_frsp_free_node(sb, prev_entry);
+ }
+ }
+
+ /* right */
+ right = rb_next(&new_entry->frsp_node);
+ if (right) {
+ next_entry = rb_entry(right, struct ext4_frsp_node, frsp_node);
+ if (ext4_mb_frsp_node_can_merge(sb, new_entry, next_entry)) {
+ new_entry->frsp_len += next_entry->frsp_len;
+ rb_erase_cached(right, &tree->frsp_offset_root);
+ rb_erase_cached(&next_entry->frsp_len_node,
+ &tree->frsp_len_root);
+ ext4_mb_frsp_free_node(sb, next_entry);
+ }
+ }
+ ext4_mb_frsp_insert_len(tree, new_entry);
+
+ return 0;
+}
+
+/*
+ * Free length number of blocks starting at block number block in free space
+ * tree.
+ */
+int ext4_mb_frsp_free_blocks(struct super_block *sb, ext4_group_t group,
+ ext4_grpblk_t block, ext4_grpblk_t length)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_frsp_tree *tree =
+ ext4_get_frsp_tree(sb, ext4_flex_group(sbi, group));
+ int err = 0;
+
+ mutex_lock(&tree->frsp_lock);
+ err = ext4_mb_frsp_add_region(sb, tree,
+ ext4_blkno_to_flex_offset(sb, block), length);
+ mutex_unlock(&tree->frsp_lock);
+
+ return err;
+}
+
+/*
+ * Create freespace tree from on-disk bitmap. Must be called with tree mutex
+ * held. Returns 0 on success, error otherwise.
+ */
+int ext4_mb_frsp_bb_to_tree(struct ext4_buddy *e4b, ext4_group_t group,
+ struct buffer_head *bh)
+{
+ struct super_block *sb = e4b->bd_sb;
+ int length = 0, bit = 0, next;
+ int end = EXT4_SB(sb)->s_blocks_per_group;
+ ext4_fsblk_t block;
+ int ret = 0;
+
+ /* find all unused blocks in bitmap, convert them to new tree node */
+ while (bit < end) {
+ bit = mb_find_next_zero_bit(bh->b_data, end, bit);
+ if (bit >= end)
+ break;
+
+ next = mb_find_next_bit(bh->b_data, end, bit);
+ length = next - bit;
+ block = ext4_group_first_block_no(sb, group) + bit;
+
+ ret = ext4_mb_frsp_add_region(sb, e4b->frsp_tree,
+ ext4_blkno_to_flex_offset(sb, block), length);
+ if (ret)
+ break;
+ bit = next + 1;
+ }
+
+ return ret;
+}
+
+/*
+ * Load one freespace_tree from disk. Assume holding mutex lock on tree.
+ */
+int ext4_mb_frsp_load(struct ext4_buddy *e4b)
+{
+ ext4_group_t ngroups, group_start, group_end, grp;
+ struct ext4_sb_info *sbi = EXT4_SB(e4b->bd_sb);
+ int i, ret = 0;
+ struct buffer_head **bh = NULL;
+
+ if (e4b->frsp_tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED)
+ return 0;
+
+ group_start = e4b->frsp_tree->frsp_index * ext4_flex_bg_size(sbi);
+ group_end = min(group_start + ext4_flex_bg_size(sbi),
+ ext4_get_groups_count(e4b->bd_sb)) - 1;
+ ngroups = group_end - group_start + 1;
+
+ bh = kcalloc(ngroups, sizeof(*bh), GFP_KERNEL);
+ if (!bh)
+ return -ENOMEM;
+ for (i = 0; i < ngroups; i++) {
+ grp = i + group_start;
+ bh[i] = ext4_read_block_bitmap_nowait(e4b->bd_sb, grp, false);
+ if (IS_ERR(bh[i])) {
+ ret = PTR_ERR(bh[i]);
+ goto out;
+ }
+ }
+ for (i = 0; i < ngroups; i++) {
+ grp = i + group_start;
+ if (!bitmap_uptodate(bh[i])) {
+ ret = ext4_wait_block_bitmap(e4b->bd_sb, grp, bh[i]);
+ if (ret)
+ goto out;
+ }
+ ret = ext4_mb_frsp_bb_to_tree(e4b, grp, bh[i]);
+ if (ret)
+ goto out;
+ }
+ e4b->frsp_tree->frsp_flags |= EXT4_MB_FRSP_FLAG_LOADED;
+out:
+ for (i = 0; i < ngroups; i++) {
+ if (!IS_ERR_OR_NULL(bh[i]))
+ put_bh(bh[i]);
+ if (!ret && IS_ERR(bh[i]))
+ ret = PTR_ERR(bh[i]);
+ }
+ kfree(bh);
+ return ret;
+
+}
+
+/*
+ * Determine if node with length len is better than what we have found until
+ * now. Return 1 if that is the case, 0 otherwise. If len is exact match,
+ * set *best = 1.
+ */
+static int ext4_mb_frsp_is_better(struct ext4_allocation_context *ac,
+ int len, int *best)
+{
+ struct ext4_tree_extent *btx = &ac->ac_b_tree_ex;
+ struct ext4_free_extent *gex = &ac->ac_g_ex;
+
+ if (best)
+ *best = 0;
+ if (len == gex->fe_len) {
+ if (best)
+ *best = 1;
+ return 1;
+ }
+ if (ac->ac_criteria == 0 && len < gex->fe_len)
+ return 0;
+ /*
+ * See if the new node is cloer to goal length than
+ * the best extent found so far
+ */
+ if (btx->te_len < gex->fe_len && len > btx->te_len)
+ return 1;
+ if (len > gex->fe_len && len < btx->te_len)
+ return 1;
+ return 0;
+}
+
+/*
+ * check if we have scanned sufficient freespace candidates
+ * stop scanning if reached/exceeded s_max_to_scan
+ */
+static void ext4_mb_frsp_check_limits(struct ext4_allocation_context *ac)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+
+ if (ac->ac_status == AC_STATUS_FOUND)
+ return;
+
+ /*
+ * Exceeded max number of nodes to scan
+ */
+ if (ac->ac_found > sbi->s_mb_max_to_scan &&
+ !(ac->ac_flags & EXT4_MB_HINT_FIRST))
+ ac->ac_status = AC_STATUS_BREAK;
+}
+
+/*
+ * Mark free space in selected tree node as used and update the tree.
+ * This must be called with tree lock.
+ */
+static void ext4_mb_frsp_use_best_found(struct ext4_allocation_context *ac,
+ ext4_group_t flex, struct ext4_frsp_node *selected)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ struct ext4_tree_extent *btx = &ac->ac_b_tree_ex;
+ struct ext4_free_extent *bex;
+ unsigned int group_no;
+ struct ext4_buddy e4b;
+
+ WARN_ON(ac->ac_status == AC_STATUS_FOUND);
+ btx->te_len = min(btx->te_len, ac->ac_g_ex.fe_len);
+ ac->ac_status = AC_STATUS_FOUND;
+
+ /* update ac best-found-extent */
+ bex = &ac->ac_b_ex;
+ group_no = (btx->te_flex * ext4_flex_bg_size(sbi)) +
+ (btx->te_flex_start / sbi->s_blocks_per_group);
+ bex->fe_start = btx->te_flex_start % sbi->s_blocks_per_group;
+ bex->fe_group = group_no;
+ bex->fe_len = btx->te_len;
+ bex->fe_node = selected;
+
+ /* Add free blocks to our tree */
+ ext4_mb_load_allocator(ac->ac_sb, group_no, &e4b,
+ EXT4_ALLOCATOR_FRSP_NOLOAD);
+ mb_mark_used(&e4b, bex);
+ ext4_mb_unload_allocator(&e4b);
+}
+/*
+ * The routine checks whether found tree node is good enough. If it is,
+ * then the tree node gets marked used and flag is set to the context
+ * to stop scanning.
+ *
+ * Otherwise, the tree node is compared with the previous found extent and
+ * if new one is better, then it's stored in the context.
+ *
+ * Later, the best found tree node will be used, if mballoc can't find good
+ * enough extent.
+ */
+static int ext4_mb_frsp_measure_node(struct ext4_allocation_context *ac,
+ int tree_idx, struct ext4_frsp_node *cur,
+ int goal)
+{
+ struct ext4_tree_extent *btx = &ac->ac_b_tree_ex;
+ ext4_fsblk_t start;
+ int best_found = 0, max;
+
+ WARN_ON(btx->te_len < 0);
+ WARN_ON(ac->ac_status != AC_STATUS_CONTINUE);
+
+ if (goal) {
+ start = ext4_group_first_block_no(ac->ac_sb,
+ ac->ac_g_ex.fe_group) +
+ ac->ac_g_ex.fe_start;
+ if (cur->frsp_offset > ext4_blkno_to_flex_offset(ac->ac_sb,
+ start))
+ return 0;
+ max = cur->frsp_offset + cur->frsp_len -
+ ext4_blkno_to_flex_offset(ac->ac_sb, start);
+ if (max >= ac->ac_g_ex.fe_len &&
+ ac->ac_g_ex.fe_len == EXT4_SB(ac->ac_sb)->s_stripe) {
+ if (do_div(start, EXT4_SB(ac->ac_sb)->s_stripe) != 0)
+ return 0;
+ best_found = 1;
+ } else if (max >= ac->ac_g_ex.fe_len) {
+ best_found = 1;
+ } else if (max > 0 && (ac->ac_flags & EXT4_MB_HINT_MERGE)) {
+ /*
+ * Sometimes, caller may want to merge even small
+ * number of blocks to an existing extent
+ */
+ best_found = 1;
+
+ } else {
+ return 0;
+ }
+ ac->ac_found++;
+ goto use_extent;
+ }
+ ac->ac_found++;
+
+ /* The special case - take what you catch first */
+ if (unlikely(ac->ac_flags & EXT4_MB_HINT_FIRST)) {
+ best_found = 1;
+ goto use_extent;
+ }
+
+ /*
+ * Prefer not allocating blocks in first group in flex bg for data
+ * blocks.
+ */
+ if (unlikely((ac->ac_criteria < 2) &&
+ (ac->ac_flags & EXT4_MB_HINT_DATA) &&
+ (cur->frsp_offset < EXT4_BLOCKS_PER_GROUP(ac->ac_sb))))
+ return 1;
+
+
+ /* If this is first found extent, just store it in the context */
+ if (btx->te_len == 0)
+ goto use_extent;
+
+ if (ext4_mb_frsp_is_better(ac, cur->frsp_len, &best_found))
+ goto use_extent;
+
+ ext4_mb_frsp_check_limits(ac);
+ return 0;
+
+use_extent:
+ btx->te_flex = tree_idx;
+ if (goal) {
+ btx->te_flex_start = ext4_blkno_to_flex_offset(ac->ac_sb,
+ start);
+ btx->te_len = max;
+ } else {
+ btx->te_flex_start = cur->frsp_offset;
+ btx->te_len = cur->frsp_len;
+ }
+ if (best_found)
+ ext4_mb_frsp_use_best_found(ac, tree_idx, cur);
+ ext4_mb_frsp_check_limits(ac);
+ return 1;
+}
+
+/* Search by goal */
+static noinline_for_stack
+void ext4_mb_frsp_find_by_goal(struct ext4_allocation_context *ac)
+{
+ struct ext4_frsp_node *cur = NULL;
+ unsigned int tree_blk;
+ ext4_fsblk_t blk;
+ struct ext4_buddy e4b;
+ int ret;
+
+ if (!(ac->ac_flags & EXT4_MB_HINT_TRY_GOAL))
+ return;
+
+ /* compute start node offset in tree */
+ blk = ext4_group_first_block_no(ac->ac_sb, ac->ac_g_ex.fe_group) +
+ ac->ac_g_ex.fe_start;
+ tree_blk = ext4_blkno_to_flex_offset(ac->ac_sb, blk);
+
+ ret = ext4_mb_load_allocator(ac->ac_sb, ac->ac_g_ex.fe_group, &e4b, 0);
+ if (ret)
+ return;
+
+ /* try goal block and its freespace_tree first */
+ mutex_lock(&e4b.frsp_tree->frsp_lock);
+ cur = ext4_mb_frsp_search_by_off(ac->ac_sb, e4b.frsp_tree, tree_blk);
+ if (cur && tree_blk < cur->frsp_offset + cur->frsp_len - 1)
+ ext4_mb_frsp_measure_node(ac, e4b.frsp_tree->frsp_index, cur,
+ 1 /* searching by goal */);
+
+ mutex_unlock(&e4b.frsp_tree->frsp_lock);
+ ext4_mb_unload_allocator(&e4b);
+}
+
+static void ext4_mb_frsp_process(struct ext4_allocation_context *ac,
+ struct ext4_frsp_tree *tree)
+{
+ struct ext4_frsp_node *cur = NULL;
+ struct rb_node *node = NULL;
+ int ret;
+
+ node = rb_first_cached(&tree->frsp_len_root);
+ mb_debug(ac->ac_sb, "Processing tree index %d, flags = %x\n",
+ tree->frsp_index, tree->frsp_flags);
+ /*
+ * In order to serve the "No data blocks in first group in a flexbg"
+ * requirement, we cannot do a O(Log N) search here. TODO: it's possible
+ * to that at CR=2, where that requirement doesn't apply.
+ */
+ while (node && ac->ac_status == AC_STATUS_CONTINUE) {
+ cur = rb_entry(node, struct ext4_frsp_node, frsp_len_node);
+ if (ac->ac_criteria == 0 && cur->frsp_len < ac->ac_g_ex.fe_len)
+ return;
+ ret = ext4_mb_frsp_measure_node(ac, tree->frsp_index, cur,
+ 0 /* searching by len */);
+ if (ret == 0)
+ return;
+ node = rb_next(node);
+ }
+}
+
+/* allocator for freespace_tree */
+static noinline_for_stack int
+ext4_mb_tree_allocator(struct ext4_allocation_context *ac)
+{
+ struct ext4_buddy e4b;
+ struct ext4_sb_info *sbi;
+ struct super_block *sb;
+ struct ext4_frsp_tree *tree = NULL;
+ struct ext4_frsp_node *cur = NULL;
+ struct ext4_tree_extent *btx = NULL;
+ int ret = 0, start_idx = 0, tree_idx, j;
+
+ sb = ac->ac_sb;
+ btx = &ac->ac_b_tree_ex;
+ sbi = EXT4_SB(sb);
+
+ start_idx = ext4_flex_group(sbi, ac->ac_g_ex.fe_group);
+ mb_debug(sb, "requested size %d\n", ac->ac_g_ex.fe_len);
+ /* First try searching from goal blk in start-idx-th freespace tree */
+ ext4_mb_frsp_find_by_goal(ac);
+ if (ac->ac_status == AC_STATUS_FOUND)
+ goto out;
+
+ if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
+ goto out;
+
+ ac->ac_criteria = 0;
+ ac->ac_groups_scanned = 0;
+repeat:
+
+ /* Loop through the rest of trees (flex_bg) */
+ for (j = start_idx;
+ (ac->ac_groups_scanned < sbi->s_mb_num_frsp_trees) &&
+ ac->ac_status == AC_STATUS_CONTINUE; j++) {
+ ac->ac_groups_scanned++;
+ tree_idx = (j % sbi->s_mb_num_frsp_trees);
+
+ ret = ext4_mb_load_allocator(sb,
+ tree_idx * ext4_flex_bg_size(sbi), &e4b, 0);
+ if (ret)
+ goto out;
+ mutex_lock(&e4b.frsp_tree->frsp_lock);
+ ext4_mb_frsp_process(ac, e4b.frsp_tree);
+ mutex_unlock(&e4b.frsp_tree->frsp_lock);
+ ext4_mb_unload_allocator(&e4b);
+ }
+
+ if (ac->ac_status != AC_STATUS_FOUND) {
+ if (ac->ac_criteria < 2) {
+ ac->ac_criteria++;
+ ac->ac_groups_scanned = 0;
+ mb_debug(sb, "Falling back to CR=%d", ac->ac_criteria);
+ goto repeat;
+ }
+ if (btx->te_len > 0 && !(ac->ac_flags & EXT4_MB_HINT_FIRST)) {
+ e4b.frsp_flags = 0;
+ ret = ext4_mb_load_allocator(sb, btx->te_flex *
+ ext4_flex_bg_size(sbi), &e4b, 0);
+ if (ret)
+ goto out;
+ tree = e4b.frsp_tree;
+ mutex_lock(&tree->frsp_lock);
+ cur = ext4_mb_frsp_search_by_off(sb, tree,
+ btx->te_flex_start);
+ if (cur) {
+ ext4_mb_frsp_use_best_found(ac, btx->te_flex,
+ cur);
+ mutex_unlock(&tree->frsp_lock);
+ ext4_mb_unload_allocator(&e4b);
+
+ } else {
+ /*
+ * Someone else took this freespace node before
+ * us. Reset the best-found tree extent, and
+ * turn on FIRST HINT (greedy).
+ */
+ mutex_unlock(&tree->frsp_lock);
+ ac->ac_b_tree_ex.te_flex_start = 0;
+ ac->ac_b_tree_ex.te_flex = 0;
+ ac->ac_b_tree_ex.te_len = 0;
+ ac->ac_status = AC_STATUS_CONTINUE;
+ ac->ac_flags |= EXT4_MB_HINT_FIRST;
+ ac->ac_groups_scanned = 0;
+ ext4_mb_unload_allocator(&e4b);
+ goto repeat;
+ }
+ }
+ }
+
+ ret = btx->te_len ? 0 : -ENOSPC;
+
+out:
+ return ret;
+}
+
+static void ext4_mb_frsp_init_tree(struct ext4_frsp_tree *tree, int index)
+{
+ tree->frsp_offset_root = RB_ROOT_CACHED;
+ tree->frsp_len_root = RB_ROOT_CACHED;
+ mutex_init(&(tree->frsp_lock));
+ tree->frsp_flags = 0;
+ tree->frsp_index = index;
+}
+
+int ext4_mb_init_freespace_trees(struct super_block *sb)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct flex_groups *fg;
+ int i;
+
+ sbi->s_mb_num_frsp_trees =
+ ext4_num_grps_to_flexbg(sb, ext4_get_groups_count(sb));
+
+ for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) {
+ fg = sbi_array_rcu_deref(sbi, s_flex_groups, i);
+ fg->frsp_tree = kzalloc(sizeof(struct ext4_frsp_tree),
+ GFP_KERNEL);
+ if (!fg->frsp_tree)
+ return -ENOMEM;
+ ext4_mb_frsp_init_tree(fg->frsp_tree, i);
+ }
+
+ return 0;
+}
+
+int ext4_mb_add_frsp_trees(struct super_block *sb, int ngroups)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ int flex_bg_count, old_trees_count = sbi->s_mb_num_frsp_trees;
+ int i;
+
+ if (!ext4_mb_frsp_on(sb))
+ return 0;
+
+ flex_bg_count = ext4_num_grps_to_flexbg(sb, ngroups);
+ if (old_trees_count > 0)
+ ext4_mb_frsp_free_tree(sb, ext4_get_frsp_tree(sb,
+ old_trees_count - 1));
+
+ for (i = old_trees_count; i < flex_bg_count; i++) {
+ struct flex_groups *fg;
+
+ fg = sbi_array_rcu_deref(sbi, s_flex_groups, i);
+ fg->frsp_tree = kzalloc(sizeof(*fg->frsp_tree), GFP_KERNEL);
+ if (!fg->frsp_tree)
+ return -ENOMEM;
+ ext4_mb_frsp_init_tree(fg->frsp_tree, i);
+ }
+ sbi->s_mb_num_frsp_trees = flex_bg_count;
+
+ return 0;
+}
+
/*
* Divide blocks started from @first with length @len into
* smaller chunks with power of 2 blocks.
@@ -1042,6 +1802,9 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
e4b->bd_buddy_page = NULL;
e4b->bd_bitmap_page = NULL;
+ if (ext4_mb_frsp_on(sb))
+ return 0;
+
blocks_per_page = PAGE_SIZE / sb->s_blocksize;
/*
* the buddy cache inode stores the block bitmap
@@ -1099,6 +1862,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
struct page *page;
int ret = 0;
+ WARN_ON(ext4_mb_frsp_on(sb));
might_sleep();
mb_debug(sb, "init group %u\n", group);
this_grp = ext4_get_group_info(sb, group);
@@ -1156,6 +1920,11 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
* Locking note: This routine calls ext4_mb_init_cache(), which takes the
* block group lock of all groups for this page; do not hold the BG lock when
* calling this routine!
+ *
+ * For freespace trees, do not hold tree mutex while calling this routine.
+ * It is okay to hold that mutex only if EXT4_ALLOCATOR_FRSP_NOLOAD flag is
+ * set in e4b->frsp_flags. If that flag is set, this function doesn't try
+ * to load an unloaded tree.
*/
static noinline_for_stack int
ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group,
@@ -1166,7 +1935,7 @@ ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group,
int pnum;
int poff;
struct page *page;
- int ret;
+ int ret = 0;
struct ext4_group_info *grp;
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct inode *inode = sbi->s_buddy_cache;
@@ -1183,6 +1952,22 @@ ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group,
e4b->bd_group = group;
e4b->bd_buddy_page = NULL;
e4b->bd_bitmap_page = NULL;
+ if (ext4_mb_frsp_on(sb)) {
+ e4b->frsp_tree = ext4_get_frsp_tree(sb,
+ ext4_flex_group(sbi, group));
+ e4b->frsp_flags = flags;
+ if (flags & EXT4_ALLOCATOR_FRSP_NOLOAD)
+ return 0;
+
+ mutex_lock(&e4b->frsp_tree->frsp_lock);
+ if (e4b->frsp_tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED) {
+ mutex_unlock(&e4b->frsp_tree->frsp_lock);
+ return 0;
+ }
+ ret = ext4_mb_frsp_load(e4b);
+ mutex_unlock(&e4b->frsp_tree->frsp_lock);
+ return ret;
+ }
if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
/*
@@ -1305,6 +2090,8 @@ static int ext4_mb_load_allocator(struct super_block *sb, ext4_group_t group,
static void ext4_mb_unload_allocator(struct ext4_buddy *e4b)
{
+ if (ext4_mb_frsp_on(e4b->bd_sb))
+ return;
if (e4b->bd_bitmap_page)
put_page(e4b->bd_bitmap_page);
if (e4b->bd_buddy_page)
@@ -1497,6 +2284,16 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
if (first < e4b->bd_info->bb_first_free)
e4b->bd_info->bb_first_free = first;
+ if (ext4_mb_frsp_on(sb)) {
+ ext4_grpblk_t first_blk = EXT4_C2B(EXT4_SB(sb), first) +
+ ext4_group_first_block_no(sb, e4b->bd_group);
+
+ ext4_unlock_group(sb, e4b->bd_group);
+ ext4_mb_frsp_free_blocks(sb, e4b->bd_group, first_blk, count);
+ ext4_lock_group(sb, e4b->bd_group);
+ return;
+ }
+
/* access memory sequentially: check left neighbour,
* clear range and then check right neighbour
*/
@@ -1563,6 +2360,9 @@ static int mb_find_extent(struct ext4_buddy *e4b, int block,
assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group));
BUG_ON(ex == NULL);
+ if (ext4_mb_frsp_on(e4b->bd_sb))
+ return 0;
+
buddy = mb_find_buddy(e4b, 0, &max);
BUG_ON(buddy == NULL);
BUG_ON(block >= max);
@@ -1627,17 +2427,74 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
unsigned ret = 0;
int len0 = len;
void *buddy;
+ ext4_grpblk_t flex_offset;
BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3));
BUG_ON(e4b->bd_group != ex->fe_group);
+
+ e4b->bd_info->bb_free -= len;
+ if (e4b->bd_info->bb_first_free == start)
+ e4b->bd_info->bb_first_free += len;
+
+ if (ext4_mb_frsp_on(e4b->bd_sb)) {
+ struct ext4_frsp_node *new;
+
+ flex_offset = ext4_blkno_to_flex_offset(e4b->bd_sb,
+ ext4_group_first_block_no(e4b->bd_sb,
+ e4b->bd_group) + ex->fe_start);
+ if (!ex->fe_node) {
+ ex->fe_node = ext4_mb_frsp_search_by_off(e4b->bd_sb,
+ e4b->frsp_tree, flex_offset);
+ if (!ex->fe_node)
+ return 0;
+ }
+ /*
+ * Remove the node from the trees before we modify these since
+ * we will change the length and / or offset of this node.
+ */
+ rb_erase_cached(&ex->fe_node->frsp_node,
+ &e4b->frsp_tree->frsp_offset_root);
+ rb_erase_cached(&ex->fe_node->frsp_len_node,
+ &e4b->frsp_tree->frsp_len_root);
+ RB_CLEAR_NODE(&ex->fe_node->frsp_node);
+ RB_CLEAR_NODE(&ex->fe_node->frsp_len_node);
+ if (flex_offset > ex->fe_node->frsp_offset) {
+ if (flex_offset + ex->fe_len !=
+ ex->fe_node->frsp_offset +
+ ex->fe_node->frsp_len) {
+ /* Need to split the node */
+ new = ext4_mb_frsp_alloc_node(e4b->bd_sb);
+ if (!new)
+ return -ENOMEM;
+ new->frsp_offset = flex_offset + ex->fe_len;
+ new->frsp_len = (ex->fe_node->frsp_offset +
+ ex->fe_node->frsp_len) -
+ new->frsp_offset;
+ ext4_mb_frsp_insert_len(e4b->frsp_tree, new);
+ ext4_mb_frsp_insert_off(e4b->frsp_tree, new);
+ }
+ ex->fe_node->frsp_len = flex_offset -
+ ex->fe_node->frsp_offset;
+ } else if (ex->fe_len < ex->fe_node->frsp_len) {
+ /* used only a part: update node */
+ ex->fe_node->frsp_offset += ex->fe_len;
+ ex->fe_node->frsp_len -= ex->fe_len;
+ } else {
+ ext4_mb_frsp_free_node(e4b->bd_sb, ex->fe_node);
+ return 0;
+ }
+
+ ext4_mb_frsp_insert_len(e4b->frsp_tree, ex->fe_node);
+ ext4_mb_frsp_insert_off(e4b->frsp_tree, ex->fe_node);
+
+ return 0;
+ }
+
assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group));
mb_check_buddy(e4b);
mb_mark_used_double(e4b, start, len);
this_cpu_inc(discard_pa_seq);
- e4b->bd_info->bb_free -= len;
- if (e4b->bd_info->bb_first_free == start)
- e4b->bd_info->bb_first_free += len;
/* let's maintain fragments counter */
if (start != 0)
@@ -2778,6 +3635,13 @@ static int ext4_mb_init_backend(struct super_block *sb)
rcu_read_lock();
kvfree(rcu_dereference(sbi->s_group_info));
rcu_read_unlock();
+ if (ext4_mb_frsp_on(sb)) {
+ for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) {
+ struct ext4_frsp_tree *tree = ext4_get_frsp_tree(sb, i);
+
+ kfree(tree);
+ }
+ }
return -ENOMEM;
}
@@ -2874,6 +3738,13 @@ int ext4_mb_init(struct super_block *sb)
i++;
} while (i <= sb->s_blocksize_bits + 1);
+ /* init for freespace trees */
+ if (ext4_mb_frsp_on(sb)) {
+ ret = ext4_mb_init_freespace_trees(sb);
+ if (ret)
+ goto out;
+ }
+
spin_lock_init(&sbi->s_md_lock);
spin_lock_init(&sbi->s_bal_lock);
sbi->s_mb_free_pending = 0;
@@ -2940,6 +3811,13 @@ int ext4_mb_init(struct super_block *sb)
sbi->s_mb_offsets = NULL;
kfree(sbi->s_mb_maxs);
sbi->s_mb_maxs = NULL;
+ if (ext4_mb_frsp_on(sb)) {
+ for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) {
+ struct ext4_frsp_tree *tree = ext4_get_frsp_tree(sb, i);
+
+ kfree(tree);
+ }
+ }
return ret;
}
@@ -3084,8 +3962,10 @@ static void ext4_free_data_in_buddy(struct super_block *sb,
/* No more items in the per group rb tree
* balance refcounts from ext4_mb_free_metadata()
*/
- put_page(e4b.bd_buddy_page);
- put_page(e4b.bd_bitmap_page);
+ if (!ext4_mb_frsp_on(sb)) {
+ put_page(e4b.bd_buddy_page);
+ put_page(e4b.bd_bitmap_page);
+ }
}
ext4_unlock_group(sb, entry->efd_group);
kmem_cache_free(ext4_free_data_cachep, entry);
@@ -3163,9 +4043,14 @@ int __init ext4_init_mballoc(void)
SLAB_RECLAIM_ACCOUNT);
if (ext4_free_data_cachep == NULL)
goto out_ac_free;
+ ext4_freespace_node_cachep = KMEM_CACHE(ext4_frsp_node,
+ SLAB_RECLAIM_ACCOUNT);
+ if (ext4_freespace_node_cachep == NULL)
+ goto out_frsp_free;
return 0;
-
+out_frsp_free:
+ kmem_cache_destroy(ext4_free_data_cachep);
out_ac_free:
kmem_cache_destroy(ext4_ac_cachep);
out_pa_free:
@@ -3184,6 +4069,7 @@ void ext4_exit_mballoc(void)
kmem_cache_destroy(ext4_pspace_cachep);
kmem_cache_destroy(ext4_ac_cachep);
kmem_cache_destroy(ext4_free_data_cachep);
+ kmem_cache_destroy(ext4_freespace_node_cachep);
ext4_groupinfo_destroy_slabs();
}
@@ -3686,6 +4572,9 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
return false;
+ if (ext4_mb_frsp_on(ac->ac_sb))
+ return 0;
+
/* first, try per-file preallocation */
rcu_read_lock();
list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
@@ -4362,6 +5251,8 @@ static int ext4_mb_pa_alloc(struct ext4_allocation_context *ac)
struct ext4_prealloc_space *pa;
BUG_ON(ext4_pspace_cachep == NULL);
+ if (ext4_mb_frsp_on(ac->ac_sb))
+ return 0;
pa = kmem_cache_zalloc(ext4_pspace_cachep, GFP_NOFS);
if (!pa)
return -ENOMEM;
@@ -4374,6 +5265,8 @@ static void ext4_mb_pa_free(struct ext4_allocation_context *ac)
{
struct ext4_prealloc_space *pa = ac->ac_pa;
+ if (ext4_mb_frsp_on(ac->ac_sb))
+ return;
BUG_ON(!pa);
ac->ac_pa = NULL;
WARN_ON(!atomic_dec_and_test(&pa->pa_count));
@@ -4547,6 +5440,13 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac,
ac->ac_o_ex.fe_len = len;
ac->ac_g_ex = ac->ac_o_ex;
ac->ac_flags = ar->flags;
+ if (ext4_mb_frsp_on(sb))
+ ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC;
+
+ /* set up best-found tree node */
+ ac->ac_b_tree_ex.te_flex_start = 0;
+ ac->ac_b_tree_ex.te_flex = 0;
+ ac->ac_b_tree_ex.te_len = 0;
/* we have to define context: we'll we work with a file or
* locality group. this is a policy, actually */
@@ -4867,7 +5767,11 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
goto errout;
repeat:
/* allocate space in core */
- *errp = ext4_mb_regular_allocator(ac);
+ if (ext4_mb_frsp_on(sb))
+ *errp = ext4_mb_tree_allocator(ac);
+ else
+ *errp = ext4_mb_regular_allocator(ac);
+
/*
* pa allocated above is added to grp->bb_prealloc_list only
* when we were able to allocate some block i.e. when
@@ -4880,8 +5784,13 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
ext4_discard_allocated_blocks(ac);
goto errout;
}
- if (ac->ac_status == AC_STATUS_FOUND &&
- ac->ac_o_ex.fe_len >= ac->ac_f_ex.fe_len)
+ /*
+ * Freespace trees should never return more than what was asked
+ * for.
+ */
+ if (!ext4_mb_frsp_on(sb) &&
+ (ac->ac_status == AC_STATUS_FOUND &&
+ ac->ac_o_ex.fe_len >= ac->ac_f_ex.fe_len))
ext4_mb_pa_free(ac);
}
if (likely(ac->ac_status == AC_STATUS_FOUND)) {
@@ -4972,13 +5881,12 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
struct rb_node *parent = NULL, *new_node;
BUG_ON(!ext4_handle_valid(handle));
- BUG_ON(e4b->bd_bitmap_page == NULL);
- BUG_ON(e4b->bd_buddy_page == NULL);
+ BUG_ON(!ext4_mb_frsp_on(sb) && e4b->bd_bitmap_page == NULL);
new_node = &new_entry->efd_node;
cluster = new_entry->efd_start_cluster;
- if (!*n) {
+ if (!*n && !ext4_mb_frsp_on(sb)) {
/* first free block exent. We need to
protect buddy cache from being freed,
* otherwise we'll refresh it from
@@ -5457,6 +6365,7 @@ __acquires(bitlock)
ex.fe_start = start;
ex.fe_group = group;
ex.fe_len = count;
+ ex.fe_node = NULL;
/*
* Mark blocks used, so no one can reuse them while
@@ -5496,6 +6405,7 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
void *bitmap;
ext4_grpblk_t next, count = 0, free_count = 0;
struct ext4_buddy e4b;
+ struct buffer_head *bh = NULL;
int ret = 0;
trace_ext4_trim_all_free(sb, group, start, max);
@@ -5506,7 +6416,17 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
ret, group);
return ret;
}
- bitmap = e4b.bd_bitmap;
+
+ if (ext4_mb_frsp_on(sb)) {
+ bh = ext4_read_block_bitmap(sb, group);
+ if (IS_ERR(bh)) {
+ ret = PTR_ERR(bh);
+ goto out;
+ }
+ bitmap = bh->b_data;
+ } else {
+ bitmap = e4b.bd_bitmap;
+ }
ext4_lock_group(sb, group);
if (EXT4_MB_GRP_WAS_TRIMMED(e4b.bd_info) &&
@@ -5553,6 +6473,8 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
EXT4_MB_GRP_SET_TRIMMED(e4b.bd_info);
}
out:
+ if (!IS_ERR_OR_NULL(bh))
+ brelse(bh);
ext4_unlock_group(sb, group);
ext4_mb_unload_allocator(&e4b);
@@ -5613,7 +6535,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
for (group = first_group; group <= last_group; group++) {
grp = ext4_get_group_info(sb, group);
/* We only do this if the grp has never been initialized */
- if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
+ if (unlikely(!ext4_mb_frsp_on(sb) && EXT4_MB_GRP_NEED_INIT(grp))) {
ret = ext4_mb_init_group(sb, group, GFP_NOFS);
if (ret)
break;
@@ -5667,16 +6589,25 @@ ext4_mballoc_query_range(
ext4_grpblk_t next;
struct ext4_buddy e4b;
int error;
+ struct buffer_head *bh = NULL;
- error = ext4_mb_load_allocator(sb, group, &e4b, 0);
- if (error)
- return error;
- bitmap = e4b.bd_bitmap;
-
+ if (ext4_mb_frsp_on(sb)) {
+ bh = ext4_read_block_bitmap(sb, group);
+ if (IS_ERR(bh)) {
+ error = PTR_ERR(bh);
+ goto out_unload;
+ }
+ bitmap = bh->b_data;
+ } else {
+ error = ext4_mb_load_allocator(sb, group, &e4b, 0);
+ if (error)
+ return error;
+ bitmap = e4b.bd_bitmap;
+ }
ext4_lock_group(sb, group);
-
- start = (e4b.bd_info->bb_first_free > start) ?
- e4b.bd_info->bb_first_free : start;
+ if (!ext4_mb_frsp_on(sb))
+ start = (e4b.bd_info->bb_first_free > start) ?
+ e4b.bd_info->bb_first_free : start;
if (end >= EXT4_CLUSTERS_PER_GROUP(sb))
end = EXT4_CLUSTERS_PER_GROUP(sb) - 1;
@@ -5685,19 +6616,20 @@ ext4_mballoc_query_range(
if (start > end)
break;
next = mb_find_next_bit(bitmap, end + 1, start);
-
ext4_unlock_group(sb, group);
error = formatter(sb, group, start, next - start, priv);
if (error)
goto out_unload;
ext4_lock_group(sb, group);
-
start = next + 1;
}
ext4_unlock_group(sb, group);
out_unload:
- ext4_mb_unload_allocator(&e4b);
+ if (!IS_ERR_OR_NULL(bh))
+ brelse(bh);
+ if (!ext4_mb_frsp_on(sb))
+ ext4_mb_unload_allocator(&e4b);
return error;
}
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 6b4d17c2935d..32b9ee452de7 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -74,6 +74,20 @@
#define MB_DEFAULT_GROUP_PREALLOC 512
+/*
+ * Struct for tree node in freespace_tree
+ */
+struct ext4_frsp_node {
+ ext4_grpblk_t frsp_offset; /* Start block offset inside
+ * current flexible group
+ */
+ ext4_grpblk_t frsp_len; /*
+ * Length of the free space in
+ * number of clusters
+ */
+ struct rb_node frsp_node;
+ struct rb_node frsp_len_node;
+};
struct ext4_free_data {
/* this links the free block information from sb_info */
struct list_head efd_list;
@@ -121,6 +135,7 @@ struct ext4_free_extent {
ext4_grpblk_t fe_start; /* In cluster units */
ext4_group_t fe_group;
ext4_grpblk_t fe_len; /* In cluster units */
+ struct ext4_frsp_node *fe_node;
};
/*
@@ -141,7 +156,14 @@ struct ext4_locality_group {
spinlock_t lg_prealloc_lock;
};
+struct ext4_tree_extent {
+ ext4_group_t te_flex; /* flex_bg index (tree index) */
+ ext4_grpblk_t te_flex_start; /* block offset w.r.t flex bg */
+ ext4_grpblk_t te_len; /* length */
+};
+
struct ext4_allocation_context {
+ __u32 ac_id;
struct inode *ac_inode;
struct super_block *ac_sb;
@@ -154,8 +176,16 @@ struct ext4_allocation_context {
/* the best found extent */
struct ext4_free_extent ac_b_ex;
- /* copy of the best found extent taken before preallocation efforts */
- struct ext4_free_extent ac_f_ex;
+ /* With freespace trees, we don't use preallocation anymore. */
+ union {
+ /*
+ * copy of the best found extent taken before
+ * preallocation efforts
+ */
+ struct ext4_free_extent ac_f_ex;
+ /* the best found tree extent */
+ struct ext4_tree_extent ac_b_tree_ex;
+ };
__u16 ac_groups_scanned;
__u16 ac_found;
@@ -177,14 +207,31 @@ struct ext4_allocation_context {
#define AC_STATUS_FOUND 2
#define AC_STATUS_BREAK 3
+/*
+ * Freespace tree flags
+ */
+#define EXT4_ALLOCATOR_FRSP_NOLOAD 0x0001 /*
+ * Don't load freespace
+ * tree, if it's not
+ * in memory.
+ */
+
struct ext4_buddy {
- struct page *bd_buddy_page;
- void *bd_buddy;
- struct page *bd_bitmap_page;
- void *bd_bitmap;
+ union {
+ struct {
+ struct page *bd_buddy_page;
+ void *bd_buddy;
+ struct page *bd_bitmap_page;
+ void *bd_bitmap;
+ __u16 bd_blkbits;
+ };
+ struct {
+ struct ext4_frsp_tree *frsp_tree;
+ __u32 frsp_flags;
+ };
+ };
struct ext4_group_info *bd_info;
struct super_block *bd_sb;
- __u16 bd_blkbits;
ext4_group_t bd_group;
};
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index a50b51270ea9..6a0e1fc18e95 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -1679,6 +1679,10 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
if (err)
goto out;
+ err = ext4_mb_add_frsp_trees(sb, input->group + 1);
+ if (err)
+ goto out;
+
flex_gd.count = 1;
flex_gd.groups = input;
flex_gd.bg_flags = &bg_flags;
@@ -2051,6 +2055,10 @@ int ext4_resize_fs(struct super_block *sb, ext4_fsblk_t n_blocks_count)
if (err)
goto out;
+ err = ext4_mb_add_frsp_trees(sb, n_group + 1);
+ if (err)
+ goto out;
+
flex_gd = alloc_flex_gd(flexbg_size);
if (flex_gd == NULL) {
err = -ENOMEM;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 98aa12602b68..2f4b7061365f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1521,7 +1521,7 @@ enum {
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
- Opt_prefetch_block_bitmaps,
+ Opt_prefetch_block_bitmaps, Opt_freespace_tree,
};
static const match_table_t tokens = {
@@ -1608,6 +1608,7 @@ static const match_table_t tokens = {
{Opt_init_itable, "init_itable=%u"},
{Opt_init_itable, "init_itable"},
{Opt_noinit_itable, "noinit_itable"},
+ {Opt_freespace_tree, "freespace_tree"},
{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
{Opt_test_dummy_encryption, "test_dummy_encryption=%s"},
{Opt_test_dummy_encryption, "test_dummy_encryption"},
@@ -1834,6 +1835,8 @@ static const struct mount_opts {
{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
{Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS,
MOPT_SET},
+ {Opt_freespace_tree, EXT4_MOUNT2_FREESPACE_TREE,
+ MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
{Opt_err, 0, 0}
};
@@ -4740,12 +4743,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
goto failed_mount4a;
}
+ if (ext4_has_feature_flex_bg(sb))
+ if (!ext4_fill_flex_info(sb)) {
+ ext4_msg(sb, KERN_ERR,
+ "unable to initialize flex_bg meta info!");
+ goto failed_mount5;
+ }
+
ext4_ext_init(sb);
err = ext4_mb_init(sb);
if (err) {
ext4_msg(sb, KERN_ERR, "failed to initialize mballoc (%d)",
err);
- goto failed_mount5;
+ goto failed_mount6;
}
block = ext4_count_free_clusters(sb);
@@ -4775,14 +4785,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
goto failed_mount6;
}
- if (ext4_has_feature_flex_bg(sb))
- if (!ext4_fill_flex_info(sb)) {
- ext4_msg(sb, KERN_ERR,
- "unable to initialize "
- "flex_bg meta info!");
- goto failed_mount6;
- }
-
err = ext4_register_li_request(sb, first_not_zeroed);
if (err)
goto failed_mount6;
@@ -4856,7 +4858,14 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
ext4_unregister_li_request(sb);
failed_mount6:
ext4_mb_release(sb);
+ percpu_counter_destroy(&sbi->s_freeclusters_counter);
+ percpu_counter_destroy(&sbi->s_freeinodes_counter);
+ percpu_counter_destroy(&sbi->s_dirs_counter);
+ percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
+ percpu_free_rwsem(&sbi->s_writepages_rwsem);
rcu_read_lock();
+
+failed_mount5:
flex_groups = rcu_dereference(sbi->s_flex_groups);
if (flex_groups) {
for (i = 0; i < sbi->s_flex_groups_allocated; i++)
@@ -4864,12 +4873,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
kvfree(flex_groups);
}
rcu_read_unlock();
- percpu_counter_destroy(&sbi->s_freeclusters_counter);
- percpu_counter_destroy(&sbi->s_freeinodes_counter);
- percpu_counter_destroy(&sbi->s_dirs_counter);
- percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
- percpu_free_rwsem(&sbi->s_writepages_rwsem);
-failed_mount5:
ext4_ext_release(sb);
ext4_release_system_zone(sb);
failed_mount4a:
--
2.28.0.220.ged08abb693-goog
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 3/9] ext4: add freespace tree allocator
2020-08-19 7:30 ` [PATCH 3/9] ext4: add freespace tree allocator Harshad Shirwadkar
@ 2020-09-03 13:45 ` Благодаренко Артём
0 siblings, 0 replies; 13+ messages in thread
From: Благодаренко Артём @ 2020-09-03 13:45 UTC (permalink / raw)
To: Harshad Shirwadkar; +Cc: linux-ext4, tytso, lyx1209
Hi Harshad,
Some questions bellow.
> On 19 Aug 2020, at 10:30, Harshad Shirwadkar <harshadshirwadkar@gmail.com> wrote:
>
> From: Yuexin Li <lyx1209@gmail.com>
>
> This patch adds a new freespace tree allocator. The allocator
> organizes free space regions in a couple of in-memory rbtrees, one
> sorted by length and the other by offset. We use these per-flex-bg
> level trees to quickly scan length sorted free space regions and
> determine the best region for a given request. This feature can be
> enabled by passing "-o freespace_tree" mount option.
>
> Signed-off-by: Yuexin Li <lyx1209@gmail.com>
> Co-developed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> ---
> fs/ext4/ext4.h | 45 +++
> fs/ext4/mballoc.c | 984 ++++++++++++++++++++++++++++++++++++++++++++--
> fs/ext4/mballoc.h | 61 ++-
> fs/ext4/resize.c | 8 +
> fs/ext4/super.c | 35 +-
> 5 files changed, 1084 insertions(+), 49 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 4df6f429de1a..3bb2675d4d40 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -363,6 +363,7 @@ struct flex_groups {
> atomic64_t free_clusters;
> atomic_t free_inodes;
> atomic_t used_dirs;
> + struct ext4_frsp_tree *frsp_tree;
> };
>
> #define EXT4_BG_INODE_UNINIT 0x0001 /* Inode table/bitmap not in use */
> @@ -1197,6 +1198,12 @@ struct ext4_inode_info {
> #define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM 0x00000008 /* User explicitly
> specified journal checksum */
>
> +
> +#define EXT4_MOUNT2_FREESPACE_TREE 0x00000020 /* Enable freespace tree
> + * allocator (turns buddy
> + * allocator off)
> + */
> +
> #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
> ~EXT4_MOUNT_##opt
> #define set_opt(sb, opt) EXT4_SB(sb)->s_mount_opt |= \
> @@ -1402,6 +1409,30 @@ struct ext4_super_block {
> #define ext4_has_strict_mode(sbi) \
> (sbi->s_encoding_flags & EXT4_ENC_STRICT_MODE_FL)
>
> +/*
> + * Freespace tree structure
> + */
> +struct ext4_frsp_tree {
> + struct rb_root_cached frsp_offset_root; /* Root for offset
> + * sorted tree
> + */
> + struct rb_root_cached frsp_len_root; /* Root for Length
> + * sorted tree
> + */
> + struct mutex frsp_lock;
> + __u8 frsp_flags;
> + __u32 frsp_max_free_len; /* Max free extent in
> + * this flex bg
> + */
> + __u32 frsp_index; /* Tree index (flex bg
> + * number)
> + */
> +};
> +
> +/* Freespace tree flags */
> +
> +/* Tree is loaded in memory */
> +#define EXT4_MB_FRSP_FLAG_LOADED 0x0001
> /*
> * fourth extended-fs super-block data in memory
> */
> @@ -1539,6 +1570,9 @@ struct ext4_sb_info {
> struct flex_groups * __rcu *s_flex_groups;
> ext4_group_t s_flex_groups_allocated;
>
> + /* freespace trees stuff */
> + int s_mb_num_frsp_trees;
> +
> /* workqueue for reserved extent conversions (buffered io) */
> struct workqueue_struct *rsv_conversion_wq;
>
> @@ -2653,6 +2687,7 @@ extern int ext4_init_inode_table(struct super_block *sb,
> extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate);
>
> /* mballoc.c */
> +extern int ext4_mb_add_frsp_trees(struct super_block *sb, int ngroups);
> extern const struct seq_operations ext4_mb_seq_groups_ops;
> extern long ext4_mb_stats;
> extern long ext4_mb_max_to_scan;
> @@ -3084,6 +3119,16 @@ static inline ext4_group_t ext4_flex_group(struct ext4_sb_info *sbi,
> return block_group >> sbi->s_log_groups_per_flex;
> }
>
> +static inline struct ext4_frsp_tree *
> +ext4_get_frsp_tree(struct super_block *sb, ext4_group_t flex_bg)
> +{
> + struct flex_groups *fg;
> +
> + fg = sbi_array_rcu_deref(EXT4_SB(sb), s_flex_groups, flex_bg);
> +
> + return fg->frsp_tree;
> +}
> +
> static inline unsigned int ext4_flex_bg_size(struct ext4_sb_info *sbi)
> {
> return 1 << sbi->s_log_groups_per_flex;
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 2d8d3d9c7918..74bdc2dcb49c 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -333,6 +333,7 @@
> static struct kmem_cache *ext4_pspace_cachep;
> static struct kmem_cache *ext4_ac_cachep;
> static struct kmem_cache *ext4_free_data_cachep;
> +static struct kmem_cache *ext4_freespace_node_cachep;
>
> /* We create slab caches for groupinfo data structures based on the
> * superblock block size. There will be one per mounted filesystem for
> @@ -352,6 +353,11 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
> ext4_group_t group);
> static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
>
> +static int ext4_mb_load_allocator(struct super_block *sb, ext4_group_t group,
> + struct ext4_buddy *e4b, int flags);
> +static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex);
> +static void ext4_mb_unload_allocator(struct ext4_buddy *e4b);
> +
> /*
> * The algorithm using this percpu seq counter goes below:
> * 1. We sample the percpu discard_pa_seq counter before trying for block
> @@ -395,6 +401,33 @@ static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
> return addr;
> }
>
> +static inline int ext4_blkno_to_flex_offset(struct super_block *sb,
> + ext4_fsblk_t blkno)
> +{
> + return blkno % (ext4_flex_bg_size(EXT4_SB(sb)) *
> + EXT4_SB(sb)->s_blocks_per_group);
> +}
> +
> +static inline ext4_fsblk_t ext4_flex_offset_to_blkno(struct super_block *sb,
> + int flex_bg, int flex_offset)
> +{
> + return flex_bg * ext4_flex_bg_size(EXT4_SB(sb)) *
> + EXT4_SB(sb)->s_blocks_per_group + flex_offset;
> +}
> +
> +static inline int ext4_mb_frsp_on(struct super_block *sb)
> +{
> + return test_opt2(sb, FREESPACE_TREE) &&
> + EXT4_SB(sb)->s_es->s_log_groups_per_flex;
> +}
> +
> +static inline unsigned int ext4_num_grps_to_flexbg(struct super_block *sb,
> + int ngroups)
> +{
> + return (ngroups + ext4_flex_bg_size(EXT4_SB(sb)) - 1) >>
> + (EXT4_SB(sb)->s_es->s_log_groups_per_flex);
> +}
> +
> static inline int mb_test_bit(int bit, void *addr)
> {
> /*
> @@ -453,6 +486,7 @@ static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
> {
> char *bb;
>
> + WARN_ON(ext4_mb_frsp_on(e4b->bd_sb));
> BUG_ON(e4b->bd_bitmap == e4b->bd_buddy);
> BUG_ON(max == NULL);
>
> @@ -620,6 +654,9 @@ static int __mb_check_buddy(struct ext4_buddy *e4b, char *file,
> void *buddy;
> void *buddy2;
>
> + if (ext4_mb_frsp_on(sb))
> + return 0;
> +
> {
> static int mb_check_counter;
> if (mb_check_counter++ % 100 != 0)
> @@ -706,6 +743,729 @@ static int __mb_check_buddy(struct ext4_buddy *e4b, char *file,
> #define mb_check_buddy(e4b)
> #endif
>
> +/* Generic macro for inserting a new node in cached rbtree */
> +#define ext4_mb_rb_insert(__root, __new, __node_t, __entry, __cmp) do { \
> + struct rb_node **iter = &((__root)->rb_root.rb_node), *parent = NULL; \
> + bool leftmost = true; \
> + __node_t *this = NULL; \
> + \
> + while (*iter) { \
> + this = rb_entry(*iter, __node_t, __entry); \
> + parent = *iter; \
> + if (__cmp((__new), this)) { \
> + iter = &((*iter)->rb_left); \
> + } else { \
> + iter = &((*iter)->rb_right); \
> + leftmost = false; \
> + } \
> + } \
> + rb_link_node(&(__new)->__entry, parent, iter); \
> + rb_insert_color_cached(&(__new)->__entry, __root, leftmost); \
> +} while (0)
Using kernel primitives from lib/rbtree.c is preferable. This lib has a lot of users and looks stable. There is rb_insert_color() there that can be used against ext4_mb_rb_insert()
> +
> +static inline int ext4_mb_frsp_offset_cmp(struct ext4_frsp_node *arg1,
> + struct ext4_frsp_node *arg2)
> +{
> + return (arg1->frsp_offset < arg2->frsp_offset);
> +}
> +
> +/*
> + * Free space extents length sorting function, the nodes are sorted in
> + * decreasing order of length
> + */
> +static inline int ext4_mb_frsp_len_cmp(struct ext4_frsp_node *arg1,
> + struct ext4_frsp_node *arg2)
> +{
> + return (arg1->frsp_len > arg2->frsp_len);
> +}
> +
> +/* insert to offset-indexed tree */
> +static void ext4_mb_frsp_insert_off(struct ext4_frsp_tree *tree,
> + struct ext4_frsp_node *new_entry)
> +{
> + memset(&new_entry->frsp_node, 0, sizeof(new_entry->frsp_node));
> + ext4_mb_rb_insert(&tree->frsp_offset_root, new_entry,
> + struct ext4_frsp_node, frsp_node, ext4_mb_frsp_offset_cmp);
> +}
> +
> +/* insert to tree sorted by length */
> +static void ext4_mb_frsp_insert_len(struct ext4_frsp_tree *tree,
> + struct ext4_frsp_node *new_entry)
> +{
> + memset(&new_entry->frsp_len_node, 0, sizeof(new_entry->frsp_len_node));
> + ext4_mb_rb_insert(&tree->frsp_len_root, new_entry,
> + struct ext4_frsp_node, frsp_len_node,
> + ext4_mb_frsp_len_cmp);
> +}
> +
> +#ifdef CONFIG_EXT4_DEBUG
> +/* print freespace_tree in pre-order traversal */
> +void ext4_mb_frsp_print_tree_len(struct super_block *sb,
> + struct ext4_frsp_tree *tree)
> +{
> + unsigned int count = 0;
> + ext4_fsblk_t blk = 0, blk_end = 0;
> + ext4_group_t group = 0, group_end = 0;
> + struct ext4_frsp_node *entry = NULL;
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> + struct rb_node *cur;
> +
> + cur = rb_first_cached(&tree->frsp_len_root);
> + mb_debug(sb, "\nTree\tNode\tlength\tblock\tgroup\n");
> +
> + while (cur) {
> + entry = rb_entry(cur, struct ext4_frsp_node, frsp_len_node);
> + count++;
> + blk = ext4_flex_offset_to_blkno(sb, tree->frsp_index,
> + entry->frsp_offset);
> + blk_end = ext4_flex_offset_to_blkno(sb, tree->frsp_index,
> + entry->frsp_offset + entry->frsp_len - 1);
> + group = blk / sbi->s_blocks_per_group;
> + group_end = (blk_end-1) / sbi->s_blocks_per_group;
> + mb_debug(sb, "%u\t%u\t%u\t%u(%lld)--%llu\t%u--%u\n",
> + tree->frsp_index, count, entry->frsp_len,
> + entry->frsp_offset, blk, blk_end, group, group_end);
> + cur = rb_next(cur);
> + }
> +}
> +#endif
> +
> +static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
> +{
> + struct ext4_frsp_node *node;
> +
> + node = kmem_cache_alloc(ext4_freespace_node_cachep, GFP_NOFS);
> + if (!node)
> + return NULL;
> +
> + RB_CLEAR_NODE(&node->frsp_node);
> + RB_CLEAR_NODE(&node->frsp_len_node);
> +
> + return node;
> +}
> +
> +static void ext4_mb_frsp_free_node(struct super_block *sb,
> + struct ext4_frsp_node *node)
> +{
> + kmem_cache_free(ext4_freespace_node_cachep, node);
> +}
> +
> +/* Evict a tree from memory */
> +void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree)
> +{
> + struct ext4_frsp_node *frsp_node;
> + struct rb_node *node;
> +
> + mb_debug(sb, "Evicting tree %d\n", tree->frsp_index);
> + mutex_lock(&tree->frsp_lock);
> + if (!(tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED))
> + goto out;
> +
> + node = rb_first_cached(&tree->frsp_offset_root);
> + while (node) {
> + frsp_node = rb_entry(node, struct ext4_frsp_node, frsp_node);
> + rb_erase_cached(node, &tree->frsp_offset_root);
> + rb_erase_cached(&frsp_node->frsp_len_node,
> + &tree->frsp_len_root);
> + ext4_mb_frsp_free_node(sb, frsp_node);
> + node = rb_first_cached(&tree->frsp_offset_root);
> + }
> + tree->frsp_flags &= ~EXT4_MB_FRSP_FLAG_LOADED;
> + tree->frsp_offset_root = RB_ROOT_CACHED;
> + tree->frsp_len_root = RB_ROOT_CACHED;
> +out:
> + mutex_unlock(&tree->frsp_lock);
> +}
> +
> +/*
> + * Search tree by flexbg offset. Returns the node containing "target". If
> + * no such node is present, returns NULL. Must be called with tree mutex held.
> + */
> +struct ext4_frsp_node *ext4_mb_frsp_search_by_off(struct super_block *sb,
> + struct ext4_frsp_tree *tree,
> + ext4_grpblk_t target)
> +{
> + struct rb_root *root = &tree->frsp_offset_root.rb_root;
> + struct rb_node *node = root->rb_node;
> + struct ext4_frsp_node *this = NULL;
> +
> + while (node) {
> + this = rb_entry(node, struct ext4_frsp_node, frsp_node);
> + if (this->frsp_offset == target)
> + return this;
> + else if (target < this->frsp_offset)
> + node = node->rb_left;
> + else
> + node = node->rb_right;
> + }
> + return NULL;
> +}
> +
> +/*
> + * Check if two entries in freespace tree can be merged together. Nodes
> + * can merge together only if they are physically contiguous and belong
> + * to the same block group.
> + */
> +int ext4_mb_frsp_node_can_merge(struct super_block *sb,
> + struct ext4_frsp_node *prev_entry, struct ext4_frsp_node *cur_entry)
> +{
> + if (prev_entry->frsp_offset + prev_entry->frsp_len !=
> + cur_entry->frsp_offset)
> + return 0;
> + if (prev_entry->frsp_offset / EXT4_SB(sb)->s_blocks_per_group !=
> + cur_entry->frsp_offset / EXT4_SB(sb)->s_blocks_per_group)
> + return 0;
> + return 1;
> +}
> +
> +/*
> + * Add new free space region to tree. We insert the new node in both, the
> + * length sorted and the flex-bg offset sorted trees. Must be called with
> + * tree mutex held.
> + */
> +int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
> + ext4_grpblk_t offset, ext4_grpblk_t len)
> +{
> + struct ext4_frsp_node *new_entry = NULL, *next_entry = NULL,
> + *prev_entry = NULL;
> + struct rb_node *left = NULL, *right = NULL;
> +
> + new_entry = ext4_mb_frsp_alloc_node(sb);
> + if (!new_entry)
> + return -ENOMEM;
> +
> + new_entry->frsp_offset = offset;
> + new_entry->frsp_len = len;
> +
> + ext4_mb_frsp_insert_off(tree, new_entry);
> + /* try merge to left and right */
> + /* left */
> + left = rb_prev(&new_entry->frsp_node);
> + if (left) {
> + prev_entry = rb_entry(left, struct ext4_frsp_node, frsp_node);
> + if (ext4_mb_frsp_node_can_merge(sb, prev_entry, new_entry)) {
> + new_entry->frsp_offset = prev_entry->frsp_offset;
> + new_entry->frsp_len += prev_entry->frsp_len;
> + rb_erase_cached(left, &tree->frsp_offset_root);
> + rb_erase_cached(&prev_entry->frsp_len_node,
> + &tree->frsp_len_root);
> + ext4_mb_frsp_free_node(sb, prev_entry);
> + }
> + }
> +
> + /* right */
> + right = rb_next(&new_entry->frsp_node);
> + if (right) {
> + next_entry = rb_entry(right, struct ext4_frsp_node, frsp_node);
> + if (ext4_mb_frsp_node_can_merge(sb, new_entry, next_entry)) {
> + new_entry->frsp_len += next_entry->frsp_len;
> + rb_erase_cached(right, &tree->frsp_offset_root);
> + rb_erase_cached(&next_entry->frsp_len_node,
> + &tree->frsp_len_root);
> + ext4_mb_frsp_free_node(sb, next_entry);
> + }
> + }
> + ext4_mb_frsp_insert_len(tree, new_entry);
> +
> + return 0;
> +}
> +
> +/*
> + * Free length number of blocks starting at block number block in free space
> + * tree.
> + */
> +int ext4_mb_frsp_free_blocks(struct super_block *sb, ext4_group_t group,
> + ext4_grpblk_t block, ext4_grpblk_t length)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> + struct ext4_frsp_tree *tree =
> + ext4_get_frsp_tree(sb, ext4_flex_group(sbi, group));
> + int err = 0;
> +
> + mutex_lock(&tree->frsp_lock);
> + err = ext4_mb_frsp_add_region(sb, tree,
> + ext4_blkno_to_flex_offset(sb, block), length);
> + mutex_unlock(&tree->frsp_lock);
> +
> + return err;
> +}
> +
> +/*
> + * Create freespace tree from on-disk bitmap. Must be called with tree mutex
> + * held. Returns 0 on success, error otherwise.
> + */
> +int ext4_mb_frsp_bb_to_tree(struct ext4_buddy *e4b, ext4_group_t group,
> + struct buffer_head *bh)
> +{
> + struct super_block *sb = e4b->bd_sb;
> + int length = 0, bit = 0, next;
> + int end = EXT4_SB(sb)->s_blocks_per_group;
> + ext4_fsblk_t block;
> + int ret = 0;
> +
> + /* find all unused blocks in bitmap, convert them to new tree node */
> + while (bit < end) {
> + bit = mb_find_next_zero_bit(bh->b_data, end, bit);
> + if (bit >= end)
> + break;
> +
> + next = mb_find_next_bit(bh->b_data, end, bit);
> + length = next - bit;
> + block = ext4_group_first_block_no(sb, group) + bit;
> +
> + ret = ext4_mb_frsp_add_region(sb, e4b->frsp_tree,
> + ext4_blkno_to_flex_offset(sb, block), length);
> + if (ret)
> + break;
> + bit = next + 1;
> + }
> +
> + return ret;
> +}
> +
> +/*
> + * Load one freespace_tree from disk. Assume holding mutex lock on tree.
> + */
> +int ext4_mb_frsp_load(struct ext4_buddy *e4b)
> +{
> + ext4_group_t ngroups, group_start, group_end, grp;
> + struct ext4_sb_info *sbi = EXT4_SB(e4b->bd_sb);
> + int i, ret = 0;
> + struct buffer_head **bh = NULL;
> +
> + if (e4b->frsp_tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED)
> + return 0;
> +
> + group_start = e4b->frsp_tree->frsp_index * ext4_flex_bg_size(sbi);
> + group_end = min(group_start + ext4_flex_bg_size(sbi),
> + ext4_get_groups_count(e4b->bd_sb)) - 1;
> + ngroups = group_end - group_start + 1;
> +
> + bh = kcalloc(ngroups, sizeof(*bh), GFP_KERNEL);
> + if (!bh)
> + return -ENOMEM;
> + for (i = 0; i < ngroups; i++) {
> + grp = i + group_start;
> + bh[i] = ext4_read_block_bitmap_nowait(e4b->bd_sb, grp, false);
> + if (IS_ERR(bh[i])) {
> + ret = PTR_ERR(bh[i]);
> + goto out;
> + }
> + }
> + for (i = 0; i < ngroups; i++) {
> + grp = i + group_start;
> + if (!bitmap_uptodate(bh[i])) {
> + ret = ext4_wait_block_bitmap(e4b->bd_sb, grp, bh[i]);
> + if (ret)
> + goto out;
> + }
> + ret = ext4_mb_frsp_bb_to_tree(e4b, grp, bh[i]);
> + if (ret)
> + goto out;
> + }
> + e4b->frsp_tree->frsp_flags |= EXT4_MB_FRSP_FLAG_LOADED;
> +out:
> + for (i = 0; i < ngroups; i++) {
> + if (!IS_ERR_OR_NULL(bh[i]))
> + put_bh(bh[i]);
> + if (!ret && IS_ERR(bh[i]))
> + ret = PTR_ERR(bh[i]);
> + }
> + kfree(bh);
> + return ret;
> +
> +}
> +
> +/*
> + * Determine if node with length len is better than what we have found until
> + * now. Return 1 if that is the case, 0 otherwise. If len is exact match,
> + * set *best = 1.
> + */
> +static int ext4_mb_frsp_is_better(struct ext4_allocation_context *ac,
> + int len, int *best)
> +{
> + struct ext4_tree_extent *btx = &ac->ac_b_tree_ex;
> + struct ext4_free_extent *gex = &ac->ac_g_ex;
> +
> + if (best)
> + *best = 0;
> + if (len == gex->fe_len) {
> + if (best)
> + *best = 1;
> + return 1;
> + }
> + if (ac->ac_criteria == 0 && len < gex->fe_len)
> + return 0;
> + /*
> + * See if the new node is cloer to goal length than
> + * the best extent found so far
> + */
> + if (btx->te_len < gex->fe_len && len > btx->te_len)
> + return 1;
> + if (len > gex->fe_len && len < btx->te_len)
> + return 1;
> + return 0;
> +}
> +
> +/*
> + * check if we have scanned sufficient freespace candidates
> + * stop scanning if reached/exceeded s_max_to_scan
> + */
> +static void ext4_mb_frsp_check_limits(struct ext4_allocation_context *ac)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> +
> + if (ac->ac_status == AC_STATUS_FOUND)
> + return;
> +
> + /*
> + * Exceeded max number of nodes to scan
> + */
> + if (ac->ac_found > sbi->s_mb_max_to_scan &&
> + !(ac->ac_flags & EXT4_MB_HINT_FIRST))
> + ac->ac_status = AC_STATUS_BREAK;
> +}
> +
> +/*
> + * Mark free space in selected tree node as used and update the tree.
> + * This must be called with tree lock.
> + */
> +static void ext4_mb_frsp_use_best_found(struct ext4_allocation_context *ac,
> + ext4_group_t flex, struct ext4_frsp_node *selected)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> + struct ext4_tree_extent *btx = &ac->ac_b_tree_ex;
> + struct ext4_free_extent *bex;
> + unsigned int group_no;
> + struct ext4_buddy e4b;
> +
> + WARN_ON(ac->ac_status == AC_STATUS_FOUND);
> + btx->te_len = min(btx->te_len, ac->ac_g_ex.fe_len);
> + ac->ac_status = AC_STATUS_FOUND;
> +
> + /* update ac best-found-extent */
> + bex = &ac->ac_b_ex;
> + group_no = (btx->te_flex * ext4_flex_bg_size(sbi)) +
> + (btx->te_flex_start / sbi->s_blocks_per_group);
> + bex->fe_start = btx->te_flex_start % sbi->s_blocks_per_group;
> + bex->fe_group = group_no;
> + bex->fe_len = btx->te_len;
> + bex->fe_node = selected;
> +
> + /* Add free blocks to our tree */
> + ext4_mb_load_allocator(ac->ac_sb, group_no, &e4b,
> + EXT4_ALLOCATOR_FRSP_NOLOAD);
> + mb_mark_used(&e4b, bex);
> + ext4_mb_unload_allocator(&e4b);
> +}
> +/*
> + * The routine checks whether found tree node is good enough. If it is,
> + * then the tree node gets marked used and flag is set to the context
> + * to stop scanning.
> + *
> + * Otherwise, the tree node is compared with the previous found extent and
> + * if new one is better, then it's stored in the context.
> + *
> + * Later, the best found tree node will be used, if mballoc can't find good
> + * enough extent.
> + */
> +static int ext4_mb_frsp_measure_node(struct ext4_allocation_context *ac,
> + int tree_idx, struct ext4_frsp_node *cur,
> + int goal)
> +{
> + struct ext4_tree_extent *btx = &ac->ac_b_tree_ex;
> + ext4_fsblk_t start;
> + int best_found = 0, max;
> +
> + WARN_ON(btx->te_len < 0);
> + WARN_ON(ac->ac_status != AC_STATUS_CONTINUE);
> +
> + if (goal) {
> + start = ext4_group_first_block_no(ac->ac_sb,
> + ac->ac_g_ex.fe_group) +
> + ac->ac_g_ex.fe_start;
> + if (cur->frsp_offset > ext4_blkno_to_flex_offset(ac->ac_sb,
> + start))
> + return 0;
> + max = cur->frsp_offset + cur->frsp_len -
> + ext4_blkno_to_flex_offset(ac->ac_sb, start);
> + if (max >= ac->ac_g_ex.fe_len &&
> + ac->ac_g_ex.fe_len == EXT4_SB(ac->ac_sb)->s_stripe) {
> + if (do_div(start, EXT4_SB(ac->ac_sb)->s_stripe) != 0)
> + return 0;
> + best_found = 1;
> + } else if (max >= ac->ac_g_ex.fe_len) {
> + best_found = 1;
> + } else if (max > 0 && (ac->ac_flags & EXT4_MB_HINT_MERGE)) {
> + /*
> + * Sometimes, caller may want to merge even small
> + * number of blocks to an existing extent
> + */
> + best_found = 1;
> +
> + } else {
> + return 0;
> + }
> + ac->ac_found++;
> + goto use_extent;
> + }
> + ac->ac_found++;
> +
> + /* The special case - take what you catch first */
> + if (unlikely(ac->ac_flags & EXT4_MB_HINT_FIRST)) {
> + best_found = 1;
> + goto use_extent;
> + }
> +
> + /*
> + * Prefer not allocating blocks in first group in flex bg for data
> + * blocks.
> + */
> + if (unlikely((ac->ac_criteria < 2) &&
> + (ac->ac_flags & EXT4_MB_HINT_DATA) &&
> + (cur->frsp_offset < EXT4_BLOCKS_PER_GROUP(ac->ac_sb))))
> + return 1;
> +
> +
> + /* If this is first found extent, just store it in the context */
> + if (btx->te_len == 0)
> + goto use_extent;
> +
> + if (ext4_mb_frsp_is_better(ac, cur->frsp_len, &best_found))
> + goto use_extent;
> +
> + ext4_mb_frsp_check_limits(ac);
> + return 0;
> +
> +use_extent:
> + btx->te_flex = tree_idx;
> + if (goal) {
> + btx->te_flex_start = ext4_blkno_to_flex_offset(ac->ac_sb,
> + start);
> + btx->te_len = max;
> + } else {
> + btx->te_flex_start = cur->frsp_offset;
> + btx->te_len = cur->frsp_len;
> + }
> + if (best_found)
> + ext4_mb_frsp_use_best_found(ac, tree_idx, cur);
> + ext4_mb_frsp_check_limits(ac);
> + return 1;
> +}
> +
> +/* Search by goal */
> +static noinline_for_stack
> +void ext4_mb_frsp_find_by_goal(struct ext4_allocation_context *ac)
> +{
> + struct ext4_frsp_node *cur = NULL;
> + unsigned int tree_blk;
> + ext4_fsblk_t blk;
> + struct ext4_buddy e4b;
> + int ret;
> +
> + if (!(ac->ac_flags & EXT4_MB_HINT_TRY_GOAL))
> + return;
> +
> + /* compute start node offset in tree */
> + blk = ext4_group_first_block_no(ac->ac_sb, ac->ac_g_ex.fe_group) +
> + ac->ac_g_ex.fe_start;
> + tree_blk = ext4_blkno_to_flex_offset(ac->ac_sb, blk);
> +
> + ret = ext4_mb_load_allocator(ac->ac_sb, ac->ac_g_ex.fe_group, &e4b, 0);
> + if (ret)
> + return;
> +
> + /* try goal block and its freespace_tree first */
> + mutex_lock(&e4b.frsp_tree->frsp_lock);
> + cur = ext4_mb_frsp_search_by_off(ac->ac_sb, e4b.frsp_tree, tree_blk);
> + if (cur && tree_blk < cur->frsp_offset + cur->frsp_len - 1)
> + ext4_mb_frsp_measure_node(ac, e4b.frsp_tree->frsp_index, cur,
> + 1 /* searching by goal */);
> +
> + mutex_unlock(&e4b.frsp_tree->frsp_lock);
> + ext4_mb_unload_allocator(&e4b);
> +}
> +
> +static void ext4_mb_frsp_process(struct ext4_allocation_context *ac,
> + struct ext4_frsp_tree *tree)
> +{
> + struct ext4_frsp_node *cur = NULL;
> + struct rb_node *node = NULL;
> + int ret;
> +
> + node = rb_first_cached(&tree->frsp_len_root);
> + mb_debug(ac->ac_sb, "Processing tree index %d, flags = %x\n",
> + tree->frsp_index, tree->frsp_flags);
> + /*
> + * In order to serve the "No data blocks in first group in a flexbg"
> + * requirement, we cannot do a O(Log N) search here. TODO: it's possible
> + * to that at CR=2, where that requirement doesn't apply.
> + */
> + while (node && ac->ac_status == AC_STATUS_CONTINUE) {
> + cur = rb_entry(node, struct ext4_frsp_node, frsp_len_node);
> + if (ac->ac_criteria == 0 && cur->frsp_len < ac->ac_g_ex.fe_len)
> + return;
> + ret = ext4_mb_frsp_measure_node(ac, tree->frsp_index, cur,
> + 0 /* searching by len */);
> + if (ret == 0)
> + return;
> + node = rb_next(node);
> + }
> +}
> +
> +/* allocator for freespace_tree */
> +static noinline_for_stack int
> +ext4_mb_tree_allocator(struct ext4_allocation_context *ac)
> +{
> + struct ext4_buddy e4b;
> + struct ext4_sb_info *sbi;
> + struct super_block *sb;
> + struct ext4_frsp_tree *tree = NULL;
> + struct ext4_frsp_node *cur = NULL;
> + struct ext4_tree_extent *btx = NULL;
> + int ret = 0, start_idx = 0, tree_idx, j;
> +
> + sb = ac->ac_sb;
> + btx = &ac->ac_b_tree_ex;
> + sbi = EXT4_SB(sb);
> +
> + start_idx = ext4_flex_group(sbi, ac->ac_g_ex.fe_group);
> + mb_debug(sb, "requested size %d\n", ac->ac_g_ex.fe_len);
> + /* First try searching from goal blk in start-idx-th freespace tree */
> + ext4_mb_frsp_find_by_goal(ac);
> + if (ac->ac_status == AC_STATUS_FOUND)
> + goto out;
> +
> + if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
> + goto out;
> +
> + ac->ac_criteria = 0;
> + ac->ac_groups_scanned = 0;
> +repeat:
> +
> + /* Loop through the rest of trees (flex_bg) */
> + for (j = start_idx;
> + (ac->ac_groups_scanned < sbi->s_mb_num_frsp_trees) &&
> + ac->ac_status == AC_STATUS_CONTINUE; j++) {
> + ac->ac_groups_scanned++;
> + tree_idx = (j % sbi->s_mb_num_frsp_trees);
> +
> + ret = ext4_mb_load_allocator(sb,
> + tree_idx * ext4_flex_bg_size(sbi), &e4b, 0);
> + if (ret)
> + goto out;
> + mutex_lock(&e4b.frsp_tree->frsp_lock);
> + ext4_mb_frsp_process(ac, e4b.frsp_tree);
> + mutex_unlock(&e4b.frsp_tree->frsp_lock);
> + ext4_mb_unload_allocator(&e4b);
> + }
> +
> + if (ac->ac_status != AC_STATUS_FOUND) {
> + if (ac->ac_criteria < 2) {
> + ac->ac_criteria++;
> + ac->ac_groups_scanned = 0;
> + mb_debug(sb, "Falling back to CR=%d", ac->ac_criteria);
> + goto repeat;
> + }
> + if (btx->te_len > 0 && !(ac->ac_flags & EXT4_MB_HINT_FIRST)) {
> + e4b.frsp_flags = 0;
> + ret = ext4_mb_load_allocator(sb, btx->te_flex *
> + ext4_flex_bg_size(sbi), &e4b, 0);
> + if (ret)
> + goto out;
> + tree = e4b.frsp_tree;
> + mutex_lock(&tree->frsp_lock);
> + cur = ext4_mb_frsp_search_by_off(sb, tree,
> + btx->te_flex_start);
> + if (cur) {
> + ext4_mb_frsp_use_best_found(ac, btx->te_flex,
> + cur);
> + mutex_unlock(&tree->frsp_lock);
> + ext4_mb_unload_allocator(&e4b);
> +
> + } else {
> + /*
> + * Someone else took this freespace node before
> + * us. Reset the best-found tree extent, and
> + * turn on FIRST HINT (greedy).
> + */
> + mutex_unlock(&tree->frsp_lock);
> + ac->ac_b_tree_ex.te_flex_start = 0;
> + ac->ac_b_tree_ex.te_flex = 0;
> + ac->ac_b_tree_ex.te_len = 0;
> + ac->ac_status = AC_STATUS_CONTINUE;
> + ac->ac_flags |= EXT4_MB_HINT_FIRST;
> + ac->ac_groups_scanned = 0;
> + ext4_mb_unload_allocator(&e4b);
> + goto repeat;
> + }
> + }
> + }
> +
> + ret = btx->te_len ? 0 : -ENOSPC;
> +
> +out:
> + return ret;
> +}
> +
> +static void ext4_mb_frsp_init_tree(struct ext4_frsp_tree *tree, int index)
> +{
> + tree->frsp_offset_root = RB_ROOT_CACHED;
> + tree->frsp_len_root = RB_ROOT_CACHED;
> + mutex_init(&(tree->frsp_lock));
> + tree->frsp_flags = 0;
> + tree->frsp_index = index;
> +}
> +
> +int ext4_mb_init_freespace_trees(struct super_block *sb)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> + struct flex_groups *fg;
> + int i;
> +
> + sbi->s_mb_num_frsp_trees =
> + ext4_num_grps_to_flexbg(sb, ext4_get_groups_count(sb));
> +
> + for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) {
> + fg = sbi_array_rcu_deref(sbi, s_flex_groups, i);
> + fg->frsp_tree = kzalloc(sizeof(struct ext4_frsp_tree),
> + GFP_KERNEL);
> + if (!fg->frsp_tree)
> + return -ENOMEM;
> + ext4_mb_frsp_init_tree(fg->frsp_tree, i);
> + }
> +
> + return 0;
> +}
> +
> +int ext4_mb_add_frsp_trees(struct super_block *sb, int ngroups)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> + int flex_bg_count, old_trees_count = sbi->s_mb_num_frsp_trees;
> + int i;
> +
> + if (!ext4_mb_frsp_on(sb))
> + return 0;
> +
> + flex_bg_count = ext4_num_grps_to_flexbg(sb, ngroups);
> + if (old_trees_count > 0)
> + ext4_mb_frsp_free_tree(sb, ext4_get_frsp_tree(sb,
> + old_trees_count - 1));
> +
> + for (i = old_trees_count; i < flex_bg_count; i++) {
> + struct flex_groups *fg;
> +
> + fg = sbi_array_rcu_deref(sbi, s_flex_groups, i);
> + fg->frsp_tree = kzalloc(sizeof(*fg->frsp_tree), GFP_KERNEL);
> + if (!fg->frsp_tree)
> + return -ENOMEM;
> + ext4_mb_frsp_init_tree(fg->frsp_tree, i);
> + }
> + sbi->s_mb_num_frsp_trees = flex_bg_count;
> +
> + return 0;
> +}
> +
> /*
> * Divide blocks started from @first with length @len into
> * smaller chunks with power of 2 blocks.
> @@ -1042,6 +1802,9 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
> e4b->bd_buddy_page = NULL;
> e4b->bd_bitmap_page = NULL;
>
> + if (ext4_mb_frsp_on(sb))
> + return 0;
> +
> blocks_per_page = PAGE_SIZE / sb->s_blocksize;
> /*
> * the buddy cache inode stores the block bitmap
> @@ -1099,6 +1862,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
> struct page *page;
> int ret = 0;
>
> + WARN_ON(ext4_mb_frsp_on(sb));
> might_sleep();
> mb_debug(sb, "init group %u\n", group);
> this_grp = ext4_get_group_info(sb, group);
> @@ -1156,6 +1920,11 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
> * Locking note: This routine calls ext4_mb_init_cache(), which takes the
> * block group lock of all groups for this page; do not hold the BG lock when
> * calling this routine!
> + *
> + * For freespace trees, do not hold tree mutex while calling this routine.
> + * It is okay to hold that mutex only if EXT4_ALLOCATOR_FRSP_NOLOAD flag is
> + * set in e4b->frsp_flags. If that flag is set, this function doesn't try
> + * to load an unloaded tree.
> */
> static noinline_for_stack int
> ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group,
> @@ -1166,7 +1935,7 @@ ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group,
> int pnum;
> int poff;
> struct page *page;
> - int ret;
> + int ret = 0;
> struct ext4_group_info *grp;
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> struct inode *inode = sbi->s_buddy_cache;
> @@ -1183,6 +1952,22 @@ ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group,
> e4b->bd_group = group;
> e4b->bd_buddy_page = NULL;
> e4b->bd_bitmap_page = NULL;
> + if (ext4_mb_frsp_on(sb)) {
> + e4b->frsp_tree = ext4_get_frsp_tree(sb,
> + ext4_flex_group(sbi, group));
> + e4b->frsp_flags = flags;
> + if (flags & EXT4_ALLOCATOR_FRSP_NOLOAD)
> + return 0;
> +
> + mutex_lock(&e4b->frsp_tree->frsp_lock);
> + if (e4b->frsp_tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED) {
> + mutex_unlock(&e4b->frsp_tree->frsp_lock);
> + return 0;
> + }
> + ret = ext4_mb_frsp_load(e4b);
> + mutex_unlock(&e4b->frsp_tree->frsp_lock);
> + return ret;
> + }
>
> if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
> /*
> @@ -1305,6 +2090,8 @@ static int ext4_mb_load_allocator(struct super_block *sb, ext4_group_t group,
>
> static void ext4_mb_unload_allocator(struct ext4_buddy *e4b)
> {
> + if (ext4_mb_frsp_on(e4b->bd_sb))
> + return;
> if (e4b->bd_bitmap_page)
> put_page(e4b->bd_bitmap_page);
> if (e4b->bd_buddy_page)
> @@ -1497,6 +2284,16 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
> if (first < e4b->bd_info->bb_first_free)
> e4b->bd_info->bb_first_free = first;
>
> + if (ext4_mb_frsp_on(sb)) {
> + ext4_grpblk_t first_blk = EXT4_C2B(EXT4_SB(sb), first) +
> + ext4_group_first_block_no(sb, e4b->bd_group);
> +
> + ext4_unlock_group(sb, e4b->bd_group);
> + ext4_mb_frsp_free_blocks(sb, e4b->bd_group, first_blk, count);
> + ext4_lock_group(sb, e4b->bd_group);
> + return;
> + }
> +
> /* access memory sequentially: check left neighbour,
> * clear range and then check right neighbour
> */
> @@ -1563,6 +2360,9 @@ static int mb_find_extent(struct ext4_buddy *e4b, int block,
> assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group));
> BUG_ON(ex == NULL);
>
> + if (ext4_mb_frsp_on(e4b->bd_sb))
> + return 0;
> +
> buddy = mb_find_buddy(e4b, 0, &max);
> BUG_ON(buddy == NULL);
> BUG_ON(block >= max);
> @@ -1627,17 +2427,74 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
> unsigned ret = 0;
> int len0 = len;
> void *buddy;
> + ext4_grpblk_t flex_offset;
>
> BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3));
> BUG_ON(e4b->bd_group != ex->fe_group);
> +
> + e4b->bd_info->bb_free -= len;
> + if (e4b->bd_info->bb_first_free == start)
> + e4b->bd_info->bb_first_free += len;
> +
> + if (ext4_mb_frsp_on(e4b->bd_sb)) {
> + struct ext4_frsp_node *new;
> +
> + flex_offset = ext4_blkno_to_flex_offset(e4b->bd_sb,
> + ext4_group_first_block_no(e4b->bd_sb,
> + e4b->bd_group) + ex->fe_start);
> + if (!ex->fe_node) {
> + ex->fe_node = ext4_mb_frsp_search_by_off(e4b->bd_sb,
> + e4b->frsp_tree, flex_offset);
> + if (!ex->fe_node)
> + return 0;
> + }
> + /*
> + * Remove the node from the trees before we modify these since
> + * we will change the length and / or offset of this node.
> + */
> + rb_erase_cached(&ex->fe_node->frsp_node,
> + &e4b->frsp_tree->frsp_offset_root);
> + rb_erase_cached(&ex->fe_node->frsp_len_node,
> + &e4b->frsp_tree->frsp_len_root);
> + RB_CLEAR_NODE(&ex->fe_node->frsp_node);
> + RB_CLEAR_NODE(&ex->fe_node->frsp_len_node);
> + if (flex_offset > ex->fe_node->frsp_offset) {
> + if (flex_offset + ex->fe_len !=
> + ex->fe_node->frsp_offset +
> + ex->fe_node->frsp_len) {
> + /* Need to split the node */
> + new = ext4_mb_frsp_alloc_node(e4b->bd_sb);
> + if (!new)
> + return -ENOMEM;
> + new->frsp_offset = flex_offset + ex->fe_len;
> + new->frsp_len = (ex->fe_node->frsp_offset +
> + ex->fe_node->frsp_len) -
> + new->frsp_offset;
> + ext4_mb_frsp_insert_len(e4b->frsp_tree, new);
> + ext4_mb_frsp_insert_off(e4b->frsp_tree, new);
> + }
> + ex->fe_node->frsp_len = flex_offset -
> + ex->fe_node->frsp_offset;
> + } else if (ex->fe_len < ex->fe_node->frsp_len) {
> + /* used only a part: update node */
> + ex->fe_node->frsp_offset += ex->fe_len;
> + ex->fe_node->frsp_len -= ex->fe_len;
> + } else {
> + ext4_mb_frsp_free_node(e4b->bd_sb, ex->fe_node);
> + return 0;
> + }
> +
> + ext4_mb_frsp_insert_len(e4b->frsp_tree, ex->fe_node);
> + ext4_mb_frsp_insert_off(e4b->frsp_tree, ex->fe_node);
> +
> + return 0;
> + }
> +
> assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group));
> mb_check_buddy(e4b);
> mb_mark_used_double(e4b, start, len);
>
> this_cpu_inc(discard_pa_seq);
> - e4b->bd_info->bb_free -= len;
> - if (e4b->bd_info->bb_first_free == start)
> - e4b->bd_info->bb_first_free += len;
>
> /* let's maintain fragments counter */
> if (start != 0)
> @@ -2778,6 +3635,13 @@ static int ext4_mb_init_backend(struct super_block *sb)
> rcu_read_lock();
> kvfree(rcu_dereference(sbi->s_group_info));
> rcu_read_unlock();
> + if (ext4_mb_frsp_on(sb)) {
> + for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) {
> + struct ext4_frsp_tree *tree = ext4_get_frsp_tree(sb, i);
> +
> + kfree(tree);
> + }
> + }
> return -ENOMEM;
> }
>
> @@ -2874,6 +3738,13 @@ int ext4_mb_init(struct super_block *sb)
> i++;
> } while (i <= sb->s_blocksize_bits + 1);
>
> + /* init for freespace trees */
> + if (ext4_mb_frsp_on(sb)) {
> + ret = ext4_mb_init_freespace_trees(sb);
> + if (ret)
> + goto out;
> + }
> +
> spin_lock_init(&sbi->s_md_lock);
> spin_lock_init(&sbi->s_bal_lock);
> sbi->s_mb_free_pending = 0;
> @@ -2940,6 +3811,13 @@ int ext4_mb_init(struct super_block *sb)
> sbi->s_mb_offsets = NULL;
> kfree(sbi->s_mb_maxs);
> sbi->s_mb_maxs = NULL;
> + if (ext4_mb_frsp_on(sb)) {
> + for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) {
> + struct ext4_frsp_tree *tree = ext4_get_frsp_tree(sb, i);
> +
> + kfree(tree);
> + }
> + }
> return ret;
> }
>
> @@ -3084,8 +3962,10 @@ static void ext4_free_data_in_buddy(struct super_block *sb,
> /* No more items in the per group rb tree
> * balance refcounts from ext4_mb_free_metadata()
> */
> - put_page(e4b.bd_buddy_page);
> - put_page(e4b.bd_bitmap_page);
> + if (!ext4_mb_frsp_on(sb)) {
> + put_page(e4b.bd_buddy_page);
> + put_page(e4b.bd_bitmap_page);
> + }
> }
> ext4_unlock_group(sb, entry->efd_group);
> kmem_cache_free(ext4_free_data_cachep, entry);
> @@ -3163,9 +4043,14 @@ int __init ext4_init_mballoc(void)
> SLAB_RECLAIM_ACCOUNT);
> if (ext4_free_data_cachep == NULL)
> goto out_ac_free;
> + ext4_freespace_node_cachep = KMEM_CACHE(ext4_frsp_node,
> + SLAB_RECLAIM_ACCOUNT);
> + if (ext4_freespace_node_cachep == NULL)
> + goto out_frsp_free;
>
> return 0;
> -
> +out_frsp_free:
> + kmem_cache_destroy(ext4_free_data_cachep);
> out_ac_free:
> kmem_cache_destroy(ext4_ac_cachep);
> out_pa_free:
> @@ -3184,6 +4069,7 @@ void ext4_exit_mballoc(void)
> kmem_cache_destroy(ext4_pspace_cachep);
> kmem_cache_destroy(ext4_ac_cachep);
> kmem_cache_destroy(ext4_free_data_cachep);
> + kmem_cache_destroy(ext4_freespace_node_cachep);
> ext4_groupinfo_destroy_slabs();
> }
>
> @@ -3686,6 +4572,9 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
> if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
> return false;
>
> + if (ext4_mb_frsp_on(ac->ac_sb))
> + return 0;
> +
> /* first, try per-file preallocation */
> rcu_read_lock();
> list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
> @@ -4362,6 +5251,8 @@ static int ext4_mb_pa_alloc(struct ext4_allocation_context *ac)
> struct ext4_prealloc_space *pa;
>
> BUG_ON(ext4_pspace_cachep == NULL);
> + if (ext4_mb_frsp_on(ac->ac_sb))
> + return 0;
> pa = kmem_cache_zalloc(ext4_pspace_cachep, GFP_NOFS);
> if (!pa)
> return -ENOMEM;
> @@ -4374,6 +5265,8 @@ static void ext4_mb_pa_free(struct ext4_allocation_context *ac)
> {
> struct ext4_prealloc_space *pa = ac->ac_pa;
>
> + if (ext4_mb_frsp_on(ac->ac_sb))
> + return;
> BUG_ON(!pa);
> ac->ac_pa = NULL;
> WARN_ON(!atomic_dec_and_test(&pa->pa_count));
> @@ -4547,6 +5440,13 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac,
> ac->ac_o_ex.fe_len = len;
> ac->ac_g_ex = ac->ac_o_ex;
> ac->ac_flags = ar->flags;
> + if (ext4_mb_frsp_on(sb))
> + ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC;
> +
> + /* set up best-found tree node */
> + ac->ac_b_tree_ex.te_flex_start = 0;
> + ac->ac_b_tree_ex.te_flex = 0;
> + ac->ac_b_tree_ex.te_len = 0;
>
> /* we have to define context: we'll we work with a file or
> * locality group. this is a policy, actually */
> @@ -4867,7 +5767,11 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
> goto errout;
> repeat:
> /* allocate space in core */
> - *errp = ext4_mb_regular_allocator(ac);
> + if (ext4_mb_frsp_on(sb))
> + *errp = ext4_mb_tree_allocator(ac);
> + else
> + *errp = ext4_mb_regular_allocator(ac);
> +
> /*
> * pa allocated above is added to grp->bb_prealloc_list only
> * when we were able to allocate some block i.e. when
> @@ -4880,8 +5784,13 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
> ext4_discard_allocated_blocks(ac);
> goto errout;
> }
> - if (ac->ac_status == AC_STATUS_FOUND &&
> - ac->ac_o_ex.fe_len >= ac->ac_f_ex.fe_len)
> + /*
> + * Freespace trees should never return more than what was asked
> + * for.
> + */
> + if (!ext4_mb_frsp_on(sb) &&
> + (ac->ac_status == AC_STATUS_FOUND &&
> + ac->ac_o_ex.fe_len >= ac->ac_f_ex.fe_len))
> ext4_mb_pa_free(ac);
> }
> if (likely(ac->ac_status == AC_STATUS_FOUND)) {
> @@ -4972,13 +5881,12 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
> struct rb_node *parent = NULL, *new_node;
>
> BUG_ON(!ext4_handle_valid(handle));
> - BUG_ON(e4b->bd_bitmap_page == NULL);
> - BUG_ON(e4b->bd_buddy_page == NULL);
> + BUG_ON(!ext4_mb_frsp_on(sb) && e4b->bd_bitmap_page == NULL);
>
> new_node = &new_entry->efd_node;
> cluster = new_entry->efd_start_cluster;
>
> - if (!*n) {
> + if (!*n && !ext4_mb_frsp_on(sb)) {
> /* first free block exent. We need to
> protect buddy cache from being freed,
> * otherwise we'll refresh it from
> @@ -5457,6 +6365,7 @@ __acquires(bitlock)
> ex.fe_start = start;
> ex.fe_group = group;
> ex.fe_len = count;
> + ex.fe_node = NULL;
>
> /*
> * Mark blocks used, so no one can reuse them while
> @@ -5496,6 +6405,7 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
> void *bitmap;
> ext4_grpblk_t next, count = 0, free_count = 0;
> struct ext4_buddy e4b;
> + struct buffer_head *bh = NULL;
> int ret = 0;
>
> trace_ext4_trim_all_free(sb, group, start, max);
> @@ -5506,7 +6416,17 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
> ret, group);
> return ret;
> }
> - bitmap = e4b.bd_bitmap;
> +
> + if (ext4_mb_frsp_on(sb)) {
> + bh = ext4_read_block_bitmap(sb, group);
> + if (IS_ERR(bh)) {
> + ret = PTR_ERR(bh);
> + goto out;
> + }
> + bitmap = bh->b_data;
> + } else {
> + bitmap = e4b.bd_bitmap;
> + }
>
> ext4_lock_group(sb, group);
> if (EXT4_MB_GRP_WAS_TRIMMED(e4b.bd_info) &&
> @@ -5553,6 +6473,8 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
> EXT4_MB_GRP_SET_TRIMMED(e4b.bd_info);
> }
> out:
> + if (!IS_ERR_OR_NULL(bh))
> + brelse(bh);
> ext4_unlock_group(sb, group);
> ext4_mb_unload_allocator(&e4b);
>
> @@ -5613,7 +6535,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
> for (group = first_group; group <= last_group; group++) {
> grp = ext4_get_group_info(sb, group);
> /* We only do this if the grp has never been initialized */
> - if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
> + if (unlikely(!ext4_mb_frsp_on(sb) && EXT4_MB_GRP_NEED_INIT(grp))) {
> ret = ext4_mb_init_group(sb, group, GFP_NOFS);
> if (ret)
> break;
> @@ -5667,16 +6589,25 @@ ext4_mballoc_query_range(
> ext4_grpblk_t next;
> struct ext4_buddy e4b;
> int error;
> + struct buffer_head *bh = NULL;
>
> - error = ext4_mb_load_allocator(sb, group, &e4b, 0);
> - if (error)
> - return error;
> - bitmap = e4b.bd_bitmap;
> -
> + if (ext4_mb_frsp_on(sb)) {
> + bh = ext4_read_block_bitmap(sb, group);
> + if (IS_ERR(bh)) {
> + error = PTR_ERR(bh);
> + goto out_unload;
> + }
> + bitmap = bh->b_data;
> + } else {
> + error = ext4_mb_load_allocator(sb, group, &e4b, 0);
> + if (error)
> + return error;
> + bitmap = e4b.bd_bitmap;
> + }
> ext4_lock_group(sb, group);
> -
> - start = (e4b.bd_info->bb_first_free > start) ?
> - e4b.bd_info->bb_first_free : start;
> + if (!ext4_mb_frsp_on(sb))
> + start = (e4b.bd_info->bb_first_free > start) ?
> + e4b.bd_info->bb_first_free : start;
> if (end >= EXT4_CLUSTERS_PER_GROUP(sb))
> end = EXT4_CLUSTERS_PER_GROUP(sb) - 1;
>
> @@ -5685,19 +6616,20 @@ ext4_mballoc_query_range(
> if (start > end)
> break;
> next = mb_find_next_bit(bitmap, end + 1, start);
> -
> ext4_unlock_group(sb, group);
> error = formatter(sb, group, start, next - start, priv);
> if (error)
> goto out_unload;
> ext4_lock_group(sb, group);
> -
> start = next + 1;
> }
>
> ext4_unlock_group(sb, group);
> out_unload:
> - ext4_mb_unload_allocator(&e4b);
> + if (!IS_ERR_OR_NULL(bh))
> + brelse(bh);
> + if (!ext4_mb_frsp_on(sb))
> + ext4_mb_unload_allocator(&e4b);
>
> return error;
> }
> diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
> index 6b4d17c2935d..32b9ee452de7 100644
> --- a/fs/ext4/mballoc.h
> +++ b/fs/ext4/mballoc.h
> @@ -74,6 +74,20 @@
> #define MB_DEFAULT_GROUP_PREALLOC 512
>
>
> +/*
> + * Struct for tree node in freespace_tree
> + */
> +struct ext4_frsp_node {
> + ext4_grpblk_t frsp_offset; /* Start block offset inside
> + * current flexible group
> + */
> + ext4_grpblk_t frsp_len; /*
> + * Length of the free space in
> + * number of clusters
> + */
> + struct rb_node frsp_node;
> + struct rb_node frsp_len_node;
> +};
> struct ext4_free_data {
> /* this links the free block information from sb_info */
> struct list_head efd_list;
> @@ -121,6 +135,7 @@ struct ext4_free_extent {
> ext4_grpblk_t fe_start; /* In cluster units */
> ext4_group_t fe_group;
> ext4_grpblk_t fe_len; /* In cluster units */
> + struct ext4_frsp_node *fe_node;
> };
>
> /*
> @@ -141,7 +156,14 @@ struct ext4_locality_group {
> spinlock_t lg_prealloc_lock;
> };
>
> +struct ext4_tree_extent {
> + ext4_group_t te_flex; /* flex_bg index (tree index) */
> + ext4_grpblk_t te_flex_start; /* block offset w.r.t flex bg */
> + ext4_grpblk_t te_len; /* length */
> +};
> +
> struct ext4_allocation_context {
> + __u32 ac_id;
> struct inode *ac_inode;
> struct super_block *ac_sb;
>
> @@ -154,8 +176,16 @@ struct ext4_allocation_context {
> /* the best found extent */
> struct ext4_free_extent ac_b_ex;
>
> - /* copy of the best found extent taken before preallocation efforts */
> - struct ext4_free_extent ac_f_ex;
> + /* With freespace trees, we don't use preallocation anymore. */
> + union {
> + /*
> + * copy of the best found extent taken before
> + * preallocation efforts
> + */
> + struct ext4_free_extent ac_f_ex;
> + /* the best found tree extent */
> + struct ext4_tree_extent ac_b_tree_ex;
> + };
>
> __u16 ac_groups_scanned;
> __u16 ac_found;
> @@ -177,14 +207,31 @@ struct ext4_allocation_context {
> #define AC_STATUS_FOUND 2
> #define AC_STATUS_BREAK 3
>
> +/*
> + * Freespace tree flags
> + */
> +#define EXT4_ALLOCATOR_FRSP_NOLOAD 0x0001 /*
> + * Don't load freespace
> + * tree, if it's not
> + * in memory.
> + */
> +
> struct ext4_buddy {
> - struct page *bd_buddy_page;
> - void *bd_buddy;
> - struct page *bd_bitmap_page;
> - void *bd_bitmap;
> + union {
> + struct {
> + struct page *bd_buddy_page;
> + void *bd_buddy;
> + struct page *bd_bitmap_page;
> + void *bd_bitmap;
> + __u16 bd_blkbits;
> + };
> + struct {
> + struct ext4_frsp_tree *frsp_tree;
> + __u32 frsp_flags;
> + };
> + };
> struct ext4_group_info *bd_info;
> struct super_block *bd_sb;
> - __u16 bd_blkbits;
> ext4_group_t bd_group;
> };
>
> diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
> index a50b51270ea9..6a0e1fc18e95 100644
> --- a/fs/ext4/resize.c
> +++ b/fs/ext4/resize.c
> @@ -1679,6 +1679,10 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
> if (err)
> goto out;
>
> + err = ext4_mb_add_frsp_trees(sb, input->group + 1);
> + if (err)
> + goto out;
> +
> flex_gd.count = 1;
> flex_gd.groups = input;
> flex_gd.bg_flags = &bg_flags;
> @@ -2051,6 +2055,10 @@ int ext4_resize_fs(struct super_block *sb, ext4_fsblk_t n_blocks_count)
> if (err)
> goto out;
>
> + err = ext4_mb_add_frsp_trees(sb, n_group + 1);
> + if (err)
> + goto out;
> +
> flex_gd = alloc_flex_gd(flexbg_size);
> if (flex_gd == NULL) {
> err = -ENOMEM;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 98aa12602b68..2f4b7061365f 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1521,7 +1521,7 @@ enum {
> Opt_dioread_nolock, Opt_dioread_lock,
> Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
> Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
> - Opt_prefetch_block_bitmaps,
> + Opt_prefetch_block_bitmaps, Opt_freespace_tree,
> };
>
> static const match_table_t tokens = {
> @@ -1608,6 +1608,7 @@ static const match_table_t tokens = {
> {Opt_init_itable, "init_itable=%u"},
> {Opt_init_itable, "init_itable"},
> {Opt_noinit_itable, "noinit_itable"},
> + {Opt_freespace_tree, "freespace_tree"},
> {Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
> {Opt_test_dummy_encryption, "test_dummy_encryption=%s"},
> {Opt_test_dummy_encryption, "test_dummy_encryption"},
> @@ -1834,6 +1835,8 @@ static const struct mount_opts {
> {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
> {Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS,
> MOPT_SET},
> + {Opt_freespace_tree, EXT4_MOUNT2_FREESPACE_TREE,
> + MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
> {Opt_err, 0, 0}
> };
>
> @@ -4740,12 +4743,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> goto failed_mount4a;
> }
>
Flex bg is required. This requirement needs to be placed in documentation. Also mkfs should prevent from creating partition with this new feature, but without flex_bg. Same for tunefs.
Harshad, are you going to share e2fsprogs with the new option support soon? It would be great to have it for debugging purpose.
> + if (ext4_has_feature_flex_bg(sb))
> + if (!ext4_fill_flex_info(sb)) {
> + ext4_msg(sb, KERN_ERR,
> + "unable to initialize flex_bg meta info!");
> + goto failed_mount5;
> + }
> +
> ext4_ext_init(sb);
> err = ext4_mb_init(sb);
> if (err) {
> ext4_msg(sb, KERN_ERR, "failed to initialize mballoc (%d)",
> err);
> - goto failed_mount5;
> + goto failed_mount6;
> }
>
> block = ext4_count_free_clusters(sb);
> @@ -4775,14 +4785,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> goto failed_mount6;
> }
>
> - if (ext4_has_feature_flex_bg(sb))
> - if (!ext4_fill_flex_info(sb)) {
> - ext4_msg(sb, KERN_ERR,
> - "unable to initialize "
> - "flex_bg meta info!");
> - goto failed_mount6;
> - }
> -
> err = ext4_register_li_request(sb, first_not_zeroed);
> if (err)
> goto failed_mount6;
> @@ -4856,7 +4858,14 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> ext4_unregister_li_request(sb);
> failed_mount6:
> ext4_mb_release(sb);
> + percpu_counter_destroy(&sbi->s_freeclusters_counter);
> + percpu_counter_destroy(&sbi->s_freeinodes_counter);
> + percpu_counter_destroy(&sbi->s_dirs_counter);
> + percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
> + percpu_free_rwsem(&sbi->s_writepages_rwsem);
> rcu_read_lock();
> +
> +failed_mount5:
> flex_groups = rcu_dereference(sbi->s_flex_groups);
> if (flex_groups) {
> for (i = 0; i < sbi->s_flex_groups_allocated; i++)
> @@ -4864,12 +4873,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> kvfree(flex_groups);
> }
> rcu_read_unlock();
> - percpu_counter_destroy(&sbi->s_freeclusters_counter);
> - percpu_counter_destroy(&sbi->s_freeinodes_counter);
> - percpu_counter_destroy(&sbi->s_dirs_counter);
> - percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
> - percpu_free_rwsem(&sbi->s_writepages_rwsem);
> -failed_mount5:
> ext4_ext_release(sb);
> ext4_release_system_zone(sb);
> failed_mount4a:
> --
> 2.28.0.220.ged08abb693-goog
Is it possible to split this patch to “freespace trees primitives” and “free space allocator logics” to simplify patchers inspection and landing? Thanks.
Best regards,
Artem Blagodarenko.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 4/9] ext4: add prefetching support for freespace trees
2020-08-19 7:30 [PATCH 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
` (2 preceding siblings ...)
2020-08-19 7:30 ` [PATCH 3/9] ext4: add freespace tree allocator Harshad Shirwadkar
@ 2020-08-19 7:30 ` Harshad Shirwadkar
2020-08-19 7:31 ` [PATCH 5/9] ext4: add freespace tree optimizations Harshad Shirwadkar
` (4 subsequent siblings)
8 siblings, 0 replies; 13+ messages in thread
From: Harshad Shirwadkar @ 2020-08-19 7:30 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar
This patch supports creation of freespace trees for prefetched block
bitmaps.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
fs/ext4/mballoc.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 74bdc2dcb49c..1f4e69c6f488 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3145,6 +3145,9 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
unsigned int nr)
{
+ struct ext4_buddy e4b;
+ int ret;
+
while (nr-- > 0) {
struct ext4_group_desc *gdp = ext4_get_group_desc(sb, group,
NULL);
@@ -3159,8 +3162,15 @@ void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
ext4_free_group_clusters(sb, gdp) > 0 &&
!(ext4_has_group_desc_csum(sb) &&
(gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) {
- if (ext4_mb_init_group(sb, group, GFP_NOFS))
+ if (ext4_mb_frsp_on(sb)) {
+ ret = ext4_mb_load_allocator(sb, group, &e4b,
+ 0);
+ if (ret)
+ break;
+ ext4_mb_unload_allocator(&e4b);
+ } else if (ext4_mb_init_group(sb, group, GFP_NOFS)) {
break;
+ }
}
}
}
--
2.28.0.220.ged08abb693-goog
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 5/9] ext4: add freespace tree optimizations
2020-08-19 7:30 [PATCH 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
` (3 preceding siblings ...)
2020-08-19 7:30 ` [PATCH 4/9] ext4: add prefetching support for freespace trees Harshad Shirwadkar
@ 2020-08-19 7:31 ` Harshad Shirwadkar
2020-08-19 7:31 ` [PATCH 6/9] ext4: add memory usage tracker for freespace trees Harshad Shirwadkar
` (3 subsequent siblings)
8 siblings, 0 replies; 13+ messages in thread
From: Harshad Shirwadkar @ 2020-08-19 7:31 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar
This patch adds an optimization on top of our freespace tree
allocator. We add a new meta-tree which contains tree node sorted by
length of their largest extent. We use this tree to optimize an
allocation request so that it immediately gets a hit or it falls back
to next CR level.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
fs/ext4/ext4.h | 17 +++++
fs/ext4/mballoc.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/mballoc.h | 1 +
fs/ext4/super.c | 9 +++
4 files changed, 205 insertions(+)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3bb2675d4d40..8cfe089ebea6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -153,6 +153,8 @@ enum SHIFT_DIRECTION {
#define EXT4_MB_USE_RESERVED 0x2000
/* Do strict check for free blocks while retrying block allocation */
#define EXT4_MB_STRICT_CHECK 0x4000
+/* Disable freespace optimizations on ac */
+#define EXT4_MB_FRSP_NO_OPTIMIZE 0x8000
struct ext4_allocation_request {
/* target inode for block we're allocating */
@@ -1427,12 +1429,22 @@ struct ext4_frsp_tree {
__u32 frsp_index; /* Tree index (flex bg
* number)
*/
+ struct list_head frsp_list_node;
+ struct rb_node frsp_len_node;
};
/* Freespace tree flags */
/* Tree is loaded in memory */
#define EXT4_MB_FRSP_FLAG_LOADED 0x0001
+/* Tree is inserted into meta tree */
+#define EXT4_MB_FRSP_FLAG_CACHED 0x2
+
+/* Freespace tree cache aggression levels */
+#define EXT4_MB_FRSP_MIN_CACHE_AGGRESSION 0x0
+#define EXT4_MB_FRSP_DEFAULT_CACHE_AGGRESSION 0x1
+#define EXT4_MB_FRSP_MAX_CACHE_AGGRESSION 0x3
+
/*
* fourth extended-fs super-block data in memory
*/
@@ -1572,6 +1584,11 @@ struct ext4_sb_info {
/* freespace trees stuff */
int s_mb_num_frsp_trees;
+ struct rb_root_cached s_mb_frsp_meta_tree;
+ rwlock_t s_mb_frsp_lock;
+ atomic_t s_mb_num_frsp_trees_cached;
+ struct list_head s_mb_uncached_trees;
+ u32 s_mb_frsp_cache_aggression;
/* workqueue for reserved extent conversions (buffered io) */
struct workqueue_struct *rsv_conversion_wq;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 1f4e69c6f488..fa027b626abe 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -779,6 +779,13 @@ static inline int ext4_mb_frsp_len_cmp(struct ext4_frsp_node *arg1,
return (arg1->frsp_len > arg2->frsp_len);
}
+/* Compare function for meta tree */
+static inline int ext4_mb_frsp_meta_cmp(struct ext4_frsp_tree *arg1,
+ struct ext4_frsp_tree *arg2)
+{
+ return (arg1->frsp_max_free_len > arg2->frsp_max_free_len);
+}
+
/* insert to offset-indexed tree */
static void ext4_mb_frsp_insert_off(struct ext4_frsp_tree *tree,
struct ext4_frsp_node *new_entry)
@@ -798,6 +805,35 @@ static void ext4_mb_frsp_insert_len(struct ext4_frsp_tree *tree,
ext4_mb_frsp_len_cmp);
}
+void ext4_mb_frsp_meta_reinsert(struct super_block *sb,
+ struct ext4_frsp_tree *tree)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_frsp_node *node;
+ struct rb_node *first = rb_first_cached(&tree->frsp_len_root);
+ struct rb_root_cached *meta_root = &EXT4_SB(sb)->s_mb_frsp_meta_tree;
+ int expected_len = 0;
+
+ if (!(tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED))
+ return;
+
+ if (first) {
+ node = rb_entry(first, struct ext4_frsp_node, frsp_len_node);
+ expected_len = node->frsp_len;
+ }
+
+ if (tree->frsp_max_free_len == expected_len)
+ return;
+
+ write_lock(&sbi->s_mb_frsp_lock);
+ tree->frsp_max_free_len = expected_len;
+ rb_erase_cached(&tree->frsp_len_node, &sbi->s_mb_frsp_meta_tree);
+ RB_CLEAR_NODE(&tree->frsp_len_node);
+ ext4_mb_rb_insert(meta_root, tree, struct ext4_frsp_tree, frsp_len_node,
+ ext4_mb_frsp_meta_cmp);
+ write_unlock(&sbi->s_mb_frsp_lock);
+}
+
#ifdef CONFIG_EXT4_DEBUG
/* print freespace_tree in pre-order traversal */
void ext4_mb_frsp_print_tree_len(struct super_block *sb,
@@ -966,6 +1002,7 @@ int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
}
}
ext4_mb_frsp_insert_len(tree, new_entry);
+ ext4_mb_frsp_meta_reinsert(sb, tree);
return 0;
}
@@ -1063,6 +1100,9 @@ int ext4_mb_frsp_load(struct ext4_buddy *e4b)
if (ret)
goto out;
}
+ if (!(e4b->frsp_tree->frsp_flags & EXT4_MB_FRSP_FLAG_CACHED))
+ atomic_inc(&sbi->s_mb_num_frsp_trees_cached);
+ e4b->frsp_tree->frsp_flags |= EXT4_MB_FRSP_FLAG_CACHED;
e4b->frsp_tree->frsp_flags |= EXT4_MB_FRSP_FLAG_LOADED;
out:
for (i = 0; i < ngroups; i++) {
@@ -1156,6 +1196,7 @@ static void ext4_mb_frsp_use_best_found(struct ext4_allocation_context *ac,
ext4_mb_load_allocator(ac->ac_sb, group_no, &e4b,
EXT4_ALLOCATOR_FRSP_NOLOAD);
mb_mark_used(&e4b, bex);
+ ext4_mb_frsp_meta_reinsert(ac->ac_sb, e4b.frsp_tree);
ext4_mb_unload_allocator(&e4b);
}
/*
@@ -1286,6 +1327,124 @@ void ext4_mb_frsp_find_by_goal(struct ext4_allocation_context *ac)
ext4_mb_unload_allocator(&e4b);
}
+/*
+ * Determine if caching of trees is necessary. This function returns 1 if it is,
+ * 0 if it is not.
+ */
+static int ext4_mb_frsp_should_cache(struct ext4_allocation_context *ac)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+
+ if (list_empty(&sbi->s_mb_uncached_trees))
+ return 0;
+
+ if (sbi->s_mb_frsp_cache_aggression >=
+ EXT4_MB_FRSP_MAX_CACHE_AGGRESSION)
+ return 1;
+
+ if (sbi->s_mb_frsp_cache_aggression ==
+ EXT4_MB_FRSP_MIN_CACHE_AGGRESSION)
+ return 0;
+
+ /* At cache aggression level 1, skip caching at CR 0 */
+ if (sbi->s_mb_frsp_cache_aggression == 1 && ac->ac_criteria == 0)
+ return 0;
+
+ /*
+ * At cache aggression level 2, perform caching for every alternate
+ * optimization.
+ */
+ return (ac->ac_num_optimizations & 0x1);
+}
+
+/*
+ * Optimize allocation request. This function _tries_ to lookup the meta-tree
+ * and if it can optimize the search in any way, it does so. As a result
+ * this function returns 1, if the optimization was performed. In this case,
+ * the caller should restart the search from tree mentioned in *tree_idx.
+ */
+int ext4_mb_frsp_optimize(struct ext4_allocation_context *ac, int *tree_idx)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ struct ext4_frsp_tree *cur = NULL;
+ struct rb_node *node = NULL;
+ int found = 0, best = 0, cache_more_trees = 0, better_len = 0, ret = 0;
+
+ if (ac->ac_flags & EXT4_MB_HINT_FIRST ||
+ ac->ac_flags & EXT4_MB_FRSP_NO_OPTIMIZE ||
+ ac->ac_status != AC_STATUS_CONTINUE)
+ return 0;
+
+ ac->ac_num_optimizations++;
+ if (!read_trylock(&sbi->s_mb_frsp_lock))
+ return 0;
+
+ node = sbi->s_mb_frsp_meta_tree.rb_root.rb_node;
+ while (node) {
+ cur = rb_entry(node, struct ext4_frsp_tree, frsp_len_node);
+ if (ext4_mb_frsp_is_better(ac, cur->frsp_max_free_len, &best)) {
+ /*
+ * This tree definitely has a better node than the best
+ * found so far.
+ */
+ found = 1;
+ ret = 1;
+ *tree_idx = cur->frsp_index;
+ better_len = cur->frsp_max_free_len;
+ if (best)
+ break;
+ }
+ if (cur->frsp_max_free_len > ac->ac_g_ex.fe_len)
+ node = node->rb_right;
+ else
+ node = node->rb_left;
+ }
+
+ if (ac->ac_found == 0 && !found) {
+ /*
+ * If we haven't found a good match above, and we hadn't found
+ * any match before us, that means we need to loosen our
+ * criteria. Note that, if we had found something earlier,
+ * not finding a better node doesn't imply that there is no
+ * better node available.
+ * TODO - in this case determine probabilistically which tree
+ * may have a better node and direct our allocator there.
+ */
+ if (ext4_mb_frsp_should_cache(ac)) {
+ cache_more_trees = 1;
+ } else if (ac->ac_criteria < 2) {
+ ac->ac_criteria++;
+ ac->ac_groups_scanned = 0;
+ *tree_idx = 0;
+ ret = 1;
+ } else {
+ ac->ac_flags |= EXT4_MB_HINT_FIRST;
+ }
+ } else if (!best && !list_empty(&sbi->s_mb_uncached_trees)) {
+ cache_more_trees = ext4_mb_frsp_should_cache(ac);
+ }
+
+ if (cache_more_trees) {
+ cur = list_first_entry(&sbi->s_mb_uncached_trees,
+ struct ext4_frsp_tree, frsp_list_node);
+ list_del_init(&cur->frsp_list_node);
+ *tree_idx = cur->frsp_index;
+ ret = 1;
+ }
+ read_unlock(&sbi->s_mb_frsp_lock);
+
+ /*
+ * If we couldn't optimize now, it's unlikely that we'll be able to
+ * optimize this request anymore
+ */
+ if (!ret)
+ ac->ac_flags |= EXT4_MB_FRSP_NO_OPTIMIZE;
+ mb_debug(ac->ac_sb,
+ "Optimizer suggestion: found = %d, tree = %d, len = %d, cr = %d\n",
+ found, *tree_idx, better_len, ac->ac_criteria);
+ return ret;
+}
+
static void ext4_mb_frsp_process(struct ext4_allocation_context *ac,
struct ext4_frsp_tree *tree)
{
@@ -1324,6 +1483,7 @@ ext4_mb_tree_allocator(struct ext4_allocation_context *ac)
struct ext4_frsp_node *cur = NULL;
struct ext4_tree_extent *btx = NULL;
int ret = 0, start_idx = 0, tree_idx, j;
+ int optimize;
sb = ac->ac_sb;
btx = &ac->ac_b_tree_ex;
@@ -1341,6 +1501,8 @@ ext4_mb_tree_allocator(struct ext4_allocation_context *ac)
ac->ac_criteria = 0;
ac->ac_groups_scanned = 0;
+ ext4_mb_frsp_optimize(ac, &start_idx);
+
repeat:
/* Loop through the rest of trees (flex_bg) */
@@ -1357,13 +1519,17 @@ ext4_mb_tree_allocator(struct ext4_allocation_context *ac)
mutex_lock(&e4b.frsp_tree->frsp_lock);
ext4_mb_frsp_process(ac, e4b.frsp_tree);
mutex_unlock(&e4b.frsp_tree->frsp_lock);
+ optimize = ext4_mb_frsp_optimize(ac, &start_idx);
ext4_mb_unload_allocator(&e4b);
+ if (optimize)
+ goto repeat;
}
if (ac->ac_status != AC_STATUS_FOUND) {
if (ac->ac_criteria < 2) {
ac->ac_criteria++;
ac->ac_groups_scanned = 0;
+ ac->ac_flags &= ~EXT4_MB_FRSP_NO_OPTIMIZE;
mb_debug(sb, "Falling back to CR=%d", ac->ac_criteria);
goto repeat;
}
@@ -1415,6 +1581,7 @@ static void ext4_mb_frsp_init_tree(struct ext4_frsp_tree *tree, int index)
mutex_init(&(tree->frsp_lock));
tree->frsp_flags = 0;
tree->frsp_index = index;
+ INIT_LIST_HEAD(&tree->frsp_list_node);
}
int ext4_mb_init_freespace_trees(struct super_block *sb)
@@ -1425,6 +1592,8 @@ int ext4_mb_init_freespace_trees(struct super_block *sb)
sbi->s_mb_num_frsp_trees =
ext4_num_grps_to_flexbg(sb, ext4_get_groups_count(sb));
+ sbi->s_mb_frsp_meta_tree = RB_ROOT_CACHED;
+ INIT_LIST_HEAD(&sbi->s_mb_uncached_trees);
for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) {
fg = sbi_array_rcu_deref(sbi, s_flex_groups, i);
@@ -1433,7 +1602,11 @@ int ext4_mb_init_freespace_trees(struct super_block *sb)
if (!fg->frsp_tree)
return -ENOMEM;
ext4_mb_frsp_init_tree(fg->frsp_tree, i);
+ list_add(&fg->frsp_tree->frsp_list_node,
+ &sbi->s_mb_uncached_trees);
}
+ rwlock_init(&sbi->s_mb_frsp_lock);
+ atomic_set(&sbi->s_mb_num_frsp_trees_cached, 0);
return 0;
}
@@ -1460,6 +1633,11 @@ int ext4_mb_add_frsp_trees(struct super_block *sb, int ngroups)
if (!fg->frsp_tree)
return -ENOMEM;
ext4_mb_frsp_init_tree(fg->frsp_tree, i);
+ write_lock(&sbi->s_mb_frsp_lock);
+ list_add(&fg->frsp_tree->frsp_list_node,
+ &sbi->s_mb_uncached_trees);
+ write_unlock(&sbi->s_mb_frsp_lock);
+ ext4_mb_frsp_meta_reinsert(sb, fg->frsp_tree);
}
sbi->s_mb_num_frsp_trees = flex_bg_count;
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 32b9ee452de7..ac65f7eac611 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -196,6 +196,7 @@ struct ext4_allocation_context {
__u8 ac_criteria;
__u8 ac_2order; /* if request is to allocate 2^N blocks and
* N > 0, the field stores N, otherwise 0 */
+ __u8 ac_num_optimizations;
__u8 ac_op; /* operation, for history only */
struct page *ac_bitmap_page;
struct page *ac_buddy_page;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 2f4b7061365f..97b63e521b97 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1522,6 +1522,8 @@ enum {
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
Opt_prefetch_block_bitmaps, Opt_freespace_tree,
+ Opt_frsp_cache_aggression,
+
};
static const match_table_t tokens = {
@@ -1615,6 +1617,7 @@ static const match_table_t tokens = {
{Opt_nombcache, "nombcache"},
{Opt_nombcache, "no_mbcache"}, /* for backward compatibility */
{Opt_prefetch_block_bitmaps, "prefetch_block_bitmaps"},
+ {Opt_frsp_cache_aggression, "frsp_cache_aggression=%u"},
{Opt_removed, "check=none"}, /* mount option from ext2/3 */
{Opt_removed, "nocheck"}, /* mount option from ext2/3 */
{Opt_removed, "reservation"}, /* mount option from ext2/3 */
@@ -1831,6 +1834,7 @@ static const struct mount_opts {
{Opt_jqfmt_vfsv0, QFMT_VFS_V0, MOPT_QFMT},
{Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT},
{Opt_max_dir_size_kb, 0, MOPT_GTE0},
+ {Opt_frsp_cache_aggression, 0, MOPT_GTE0},
{Opt_test_dummy_encryption, 0, MOPT_STRING},
{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
{Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS,
@@ -2038,6 +2042,10 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
sbi->s_li_wait_mult = arg;
} else if (token == Opt_max_dir_size_kb) {
sbi->s_max_dir_size_kb = arg;
+ } else if (token == Opt_frsp_cache_aggression) {
+ sbi->s_mb_frsp_cache_aggression =
+ min(EXT4_MB_FRSP_MAX_CACHE_AGGRESSION,
+ max(EXT4_MB_FRSP_MIN_CACHE_AGGRESSION, arg));
} else if (token == Opt_stripe) {
sbi->s_stripe = arg;
} else if (token == Opt_resuid) {
@@ -3959,6 +3967,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sbi->s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ;
sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME;
sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;
+ sbi->s_mb_frsp_cache_aggression = EXT4_MB_FRSP_DEFAULT_CACHE_AGGRESSION;
if ((def_mount_opts & EXT4_DEFM_NOBARRIER) == 0)
set_opt(sb, BARRIER);
--
2.28.0.220.ged08abb693-goog
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 6/9] ext4: add memory usage tracker for freespace trees
2020-08-19 7:30 [PATCH 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
` (4 preceding siblings ...)
2020-08-19 7:31 ` [PATCH 5/9] ext4: add freespace tree optimizations Harshad Shirwadkar
@ 2020-08-19 7:31 ` Harshad Shirwadkar
2020-08-28 13:57 ` Благодаренко Артём
2020-08-19 7:31 ` [PATCH 7/9] ext4: add LRU eviction for free space trees Harshad Shirwadkar
` (2 subsequent siblings)
8 siblings, 1 reply; 13+ messages in thread
From: Harshad Shirwadkar @ 2020-08-19 7:31 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar
Freespace trees can occupy a lot of memory with as the fragmentation
increases. This patch adds a sysfs file to monitor the memory usage of
the freespace tree allocator. Also, added a sysfs config to control
maximum memory that the allocator can use. If the allocator exceeds
this threshold, file system enters "FRSP_MEM_CRUNCH" state. The next
patch in the series performs LRU eviction when this state is reached.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
fs/ext4/ext4.h | 8 ++++++++
fs/ext4/mballoc.c | 20 ++++++++++++++++++++
fs/ext4/mballoc.h | 4 ++++
fs/ext4/sysfs.c | 11 +++++++++++
4 files changed, 43 insertions(+)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8cfe089ebea6..45fc3b230357 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1206,6 +1206,12 @@ struct ext4_inode_info {
* allocator off)
*/
+#define EXT4_MOUNT2_FRSP_MEM_CRUNCH 0x00000040 /*
+ * Freespace tree allocator
+ * is in a tight memory
+ * situation.
+ */
+
#define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
~EXT4_MOUNT_##opt
#define set_opt(sb, opt) EXT4_SB(sb)->s_mount_opt |= \
@@ -1589,6 +1595,8 @@ struct ext4_sb_info {
atomic_t s_mb_num_frsp_trees_cached;
struct list_head s_mb_uncached_trees;
u32 s_mb_frsp_cache_aggression;
+ atomic_t s_mb_num_fragments;
+ u32 s_mb_frsp_mem_limit;
/* workqueue for reserved extent conversions (buffered io) */
struct workqueue_struct *rsv_conversion_wq;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index fa027b626abe..aada6838cafd 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -869,6 +869,7 @@ void ext4_mb_frsp_print_tree_len(struct super_block *sb,
static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
{
struct ext4_frsp_node *node;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
node = kmem_cache_alloc(ext4_freespace_node_cachep, GFP_NOFS);
if (!node)
@@ -877,13 +878,31 @@ static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
RB_CLEAR_NODE(&node->frsp_node);
RB_CLEAR_NODE(&node->frsp_len_node);
+ atomic_inc(&sbi->s_mb_num_fragments);
+
+ if (sbi->s_mb_frsp_mem_limit &&
+ atomic_read(&sbi->s_mb_num_fragments) >
+ EXT4_FRSP_MEM_LIMIT_TO_NUM_NODES(sb))
+ set_opt2(sb, FRSP_MEM_CRUNCH);
+ else
+ clear_opt2(sb, FRSP_MEM_CRUNCH);
+
+
return node;
}
static void ext4_mb_frsp_free_node(struct super_block *sb,
struct ext4_frsp_node *node)
{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
kmem_cache_free(ext4_freespace_node_cachep, node);
+ atomic_dec(&sbi->s_mb_num_fragments);
+
+ if (!sbi->s_mb_frsp_mem_limit ||
+ atomic_read(&sbi->s_mb_num_fragments) <
+ EXT4_FRSP_MEM_LIMIT_TO_NUM_NODES(sb))
+ clear_opt2(sb, FRSP_MEM_CRUNCH);
}
/* Evict a tree from memory */
@@ -1607,6 +1626,7 @@ int ext4_mb_init_freespace_trees(struct super_block *sb)
}
rwlock_init(&sbi->s_mb_frsp_lock);
atomic_set(&sbi->s_mb_num_frsp_trees_cached, 0);
+ atomic_set(&sbi->s_mb_num_fragments, 0);
return 0;
}
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index ac65f7eac611..08cac358324d 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -88,6 +88,10 @@ struct ext4_frsp_node {
struct rb_node frsp_node;
struct rb_node frsp_len_node;
};
+
+#define EXT4_FRSP_MEM_LIMIT_TO_NUM_NODES(__sb) \
+ ((sbi->s_mb_frsp_mem_limit / sizeof(struct ext4_frsp_node)))
+
struct ext4_free_data {
/* this links the free block information from sb_info */
struct list_head efd_list;
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index 31e0db726d21..d23cb51635c3 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -8,6 +8,7 @@
*
*/
+#include "mballoc.h"
#include <linux/time.h>
#include <linux/fs.h>
#include <linux/seq_file.h>
@@ -24,6 +25,7 @@ typedef enum {
attr_session_write_kbytes,
attr_lifetime_write_kbytes,
attr_reserved_clusters,
+ attr_frsp_tree_usage,
attr_inode_readahead,
attr_trigger_test_error,
attr_first_error_time,
@@ -205,6 +207,7 @@ EXT4_ATTR_FUNC(delayed_allocation_blocks, 0444);
EXT4_ATTR_FUNC(session_write_kbytes, 0444);
EXT4_ATTR_FUNC(lifetime_write_kbytes, 0444);
EXT4_ATTR_FUNC(reserved_clusters, 0644);
+EXT4_ATTR_FUNC(frsp_tree_usage, 0444);
EXT4_ATTR_OFFSET(inode_readahead_blks, 0644, inode_readahead,
ext4_sb_info, s_inode_readahead_blks);
@@ -242,6 +245,7 @@ EXT4_ATTR(last_error_time, 0444, last_error_time);
EXT4_ATTR(journal_task, 0444, journal_task);
EXT4_RW_ATTR_SBI_UI(mb_prefetch, s_mb_prefetch);
EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_limit);
+EXT4_RW_ATTR_SBI_UI(mb_frsp_max_mem, s_mb_frsp_mem_limit);
static unsigned int old_bump_val = 128;
EXT4_ATTR_PTR(max_writeback_mb_bump, 0444, pointer_ui, &old_bump_val);
@@ -251,6 +255,7 @@ static struct attribute *ext4_attrs[] = {
ATTR_LIST(session_write_kbytes),
ATTR_LIST(lifetime_write_kbytes),
ATTR_LIST(reserved_clusters),
+ ATTR_LIST(frsp_tree_usage),
ATTR_LIST(inode_readahead_blks),
ATTR_LIST(inode_goal),
ATTR_LIST(mb_stats),
@@ -287,6 +292,7 @@ static struct attribute *ext4_attrs[] = {
#endif
ATTR_LIST(mb_prefetch),
ATTR_LIST(mb_prefetch_limit),
+ ATTR_LIST(mb_frsp_max_mem),
NULL,
};
ATTRIBUTE_GROUPS(ext4);
@@ -369,6 +375,11 @@ static ssize_t ext4_attr_show(struct kobject *kobj,
return snprintf(buf, PAGE_SIZE, "%llu\n",
(unsigned long long)
atomic64_read(&sbi->s_resv_clusters));
+ case attr_frsp_tree_usage:
+ return snprintf(buf, PAGE_SIZE, "%llu\n",
+ (unsigned long long)
+ atomic_read(&sbi->s_mb_num_fragments) *
+ sizeof(struct ext4_frsp_node));
case attr_inode_readahead:
case attr_pointer_ui:
if (!ptr)
--
2.28.0.220.ged08abb693-goog
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 6/9] ext4: add memory usage tracker for freespace trees
2020-08-19 7:31 ` [PATCH 6/9] ext4: add memory usage tracker for freespace trees Harshad Shirwadkar
@ 2020-08-28 13:57 ` Благодаренко Артём
2020-08-28 15:22 ` harshad shirwadkar
0 siblings, 1 reply; 13+ messages in thread
From: Благодаренко Артём @ 2020-08-28 13:57 UTC (permalink / raw)
To: Harshad Shirwadkar; +Cc: linux-ext4, tytso, lyx1209
Hello Harshad,
Thank you for these useful patches. I am still reviewing them. This one looks good for me, but one place looks strange. See bellow.
> On 19 Aug 2020, at 10:31, Harshad Shirwadkar <harshadshirwadkar@gmail.com> wrote:
>
> Freespace trees can occupy a lot of memory with as the fragmentation
> increases. This patch adds a sysfs file to monitor the memory usage of
> the freespace tree allocator. Also, added a sysfs config to control
> maximum memory that the allocator can use. If the allocator exceeds
> this threshold, file system enters "FRSP_MEM_CRUNCH" state. The next
> patch in the series performs LRU eviction when this state is reached.
>
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> ---
> fs/ext4/ext4.h | 8 ++++++++
> fs/ext4/mballoc.c | 20 ++++++++++++++++++++
> fs/ext4/mballoc.h | 4 ++++
> fs/ext4/sysfs.c | 11 +++++++++++
> 4 files changed, 43 insertions(+)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 8cfe089ebea6..45fc3b230357 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1206,6 +1206,12 @@ struct ext4_inode_info {
> * allocator off)
> */
>
> +#define EXT4_MOUNT2_FRSP_MEM_CRUNCH 0x00000040 /*
> + * Freespace tree allocator
> + * is in a tight memory
> + * situation.
> + */
> +
> #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
> ~EXT4_MOUNT_##opt
> #define set_opt(sb, opt) EXT4_SB(sb)->s_mount_opt |= \
> @@ -1589,6 +1595,8 @@ struct ext4_sb_info {
> atomic_t s_mb_num_frsp_trees_cached;
> struct list_head s_mb_uncached_trees;
> u32 s_mb_frsp_cache_aggression;
> + atomic_t s_mb_num_fragments;
> + u32 s_mb_frsp_mem_limit;
>
> /* workqueue for reserved extent conversions (buffered io) */
> struct workqueue_struct *rsv_conversion_wq;
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index fa027b626abe..aada6838cafd 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -869,6 +869,7 @@ void ext4_mb_frsp_print_tree_len(struct super_block *sb,
> static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
> {
> struct ext4_frsp_node *node;
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
>
> node = kmem_cache_alloc(ext4_freespace_node_cachep, GFP_NOFS);
> if (!node)
> @@ -877,13 +878,31 @@ static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
> RB_CLEAR_NODE(&node->frsp_node);
> RB_CLEAR_NODE(&node->frsp_len_node);
>
> + atomic_inc(&sbi->s_mb_num_fragments);
> +
> + if (sbi->s_mb_frsp_mem_limit &&
> + atomic_read(&sbi->s_mb_num_fragments) >
> + EXT4_FRSP_MEM_LIMIT_TO_NUM_NODES(sb))
> + set_opt2(sb, FRSP_MEM_CRUNCH);
> + else
> + clear_opt2(sb, FRSP_MEM_CRUNCH);
> +
> +
Why FRSP_MEM_CRUNCH is cleared here? Are any cases when a node allocating can reduce fragments numbers?
> return node;
> }
>
> static void ext4_mb_frsp_free_node(struct super_block *sb,
> struct ext4_frsp_node *node)
> {
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> +
> kmem_cache_free(ext4_freespace_node_cachep, node);
> + atomic_dec(&sbi->s_mb_num_fragments);
> +
> + if (!sbi->s_mb_frsp_mem_limit ||
> + atomic_read(&sbi->s_mb_num_fragments) <
> + EXT4_FRSP_MEM_LIMIT_TO_NUM_NODES(sb))
> + clear_opt2(sb, FRSP_MEM_CRUNCH);
> }
If there are some reasons to clear FRSP_MEM_CRUNCH in ext4_mb_frsp_alloc_node, should we also set FRSP_MEM_CRUNCH here?
>
> /* Evict a tree from memory */
> @@ -1607,6 +1626,7 @@ int ext4_mb_init_freespace_trees(struct super_block *sb)
> }
> rwlock_init(&sbi->s_mb_frsp_lock);
> atomic_set(&sbi->s_mb_num_frsp_trees_cached, 0);
> + atomic_set(&sbi->s_mb_num_fragments, 0);
>
> return 0;
> }
> diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
> index ac65f7eac611..08cac358324d 100644
> --- a/fs/ext4/mballoc.h
> +++ b/fs/ext4/mballoc.h
> @@ -88,6 +88,10 @@ struct ext4_frsp_node {
> struct rb_node frsp_node;
> struct rb_node frsp_len_node;
> };
> +
> +#define EXT4_FRSP_MEM_LIMIT_TO_NUM_NODES(__sb) \
> + ((sbi->s_mb_frsp_mem_limit / sizeof(struct ext4_frsp_node)))
> +
> struct ext4_free_data {
> /* this links the free block information from sb_info */
> struct list_head efd_list;
> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> index 31e0db726d21..d23cb51635c3 100644
> --- a/fs/ext4/sysfs.c
> +++ b/fs/ext4/sysfs.c
> @@ -8,6 +8,7 @@
> *
> */
>
> +#include "mballoc.h"
> #include <linux/time.h>
> #include <linux/fs.h>
> #include <linux/seq_file.h>
> @@ -24,6 +25,7 @@ typedef enum {
> attr_session_write_kbytes,
> attr_lifetime_write_kbytes,
> attr_reserved_clusters,
> + attr_frsp_tree_usage,
> attr_inode_readahead,
> attr_trigger_test_error,
> attr_first_error_time,
> @@ -205,6 +207,7 @@ EXT4_ATTR_FUNC(delayed_allocation_blocks, 0444);
> EXT4_ATTR_FUNC(session_write_kbytes, 0444);
> EXT4_ATTR_FUNC(lifetime_write_kbytes, 0444);
> EXT4_ATTR_FUNC(reserved_clusters, 0644);
> +EXT4_ATTR_FUNC(frsp_tree_usage, 0444);
>
> EXT4_ATTR_OFFSET(inode_readahead_blks, 0644, inode_readahead,
> ext4_sb_info, s_inode_readahead_blks);
> @@ -242,6 +245,7 @@ EXT4_ATTR(last_error_time, 0444, last_error_time);
> EXT4_ATTR(journal_task, 0444, journal_task);
> EXT4_RW_ATTR_SBI_UI(mb_prefetch, s_mb_prefetch);
> EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_limit);
> +EXT4_RW_ATTR_SBI_UI(mb_frsp_max_mem, s_mb_frsp_mem_limit);
>
> static unsigned int old_bump_val = 128;
> EXT4_ATTR_PTR(max_writeback_mb_bump, 0444, pointer_ui, &old_bump_val);
> @@ -251,6 +255,7 @@ static struct attribute *ext4_attrs[] = {
> ATTR_LIST(session_write_kbytes),
> ATTR_LIST(lifetime_write_kbytes),
> ATTR_LIST(reserved_clusters),
> + ATTR_LIST(frsp_tree_usage),
> ATTR_LIST(inode_readahead_blks),
> ATTR_LIST(inode_goal),
> ATTR_LIST(mb_stats),
> @@ -287,6 +292,7 @@ static struct attribute *ext4_attrs[] = {
> #endif
> ATTR_LIST(mb_prefetch),
> ATTR_LIST(mb_prefetch_limit),
> + ATTR_LIST(mb_frsp_max_mem),
> NULL,
> };
> ATTRIBUTE_GROUPS(ext4);
> @@ -369,6 +375,11 @@ static ssize_t ext4_attr_show(struct kobject *kobj,
> return snprintf(buf, PAGE_SIZE, "%llu\n",
> (unsigned long long)
> atomic64_read(&sbi->s_resv_clusters));
> + case attr_frsp_tree_usage:
> + return snprintf(buf, PAGE_SIZE, "%llu\n",
> + (unsigned long long)
> + atomic_read(&sbi->s_mb_num_fragments) *
> + sizeof(struct ext4_frsp_node));
> case attr_inode_readahead:
> case attr_pointer_ui:
> if (!ptr)
> --
> 2.28.0.220.ged08abb693-goog
>
Best regards,
Artem Blagodarenko.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 6/9] ext4: add memory usage tracker for freespace trees
2020-08-28 13:57 ` Благодаренко Артём
@ 2020-08-28 15:22 ` harshad shirwadkar
0 siblings, 0 replies; 13+ messages in thread
From: harshad shirwadkar @ 2020-08-28 15:22 UTC (permalink / raw)
To: Благодаренко
Артём
Cc: Ext4 Developers List, Theodore Y. Ts'o, lyx1209
Thanks Artem for reviewing the patches.
On Fri, Aug 28, 2020 at 6:57 AM Благодаренко Артём
<artem.blagodarenko@gmail.com> wrote:
>
> Hello Harshad,
>
> Thank you for these useful patches. I am still reviewing them. This one looks good for me, but one place looks strange. See bellow.
>
> > On 19 Aug 2020, at 10:31, Harshad Shirwadkar <harshadshirwadkar@gmail.com> wrote:
> >
> > Freespace trees can occupy a lot of memory with as the fragmentation
> > increases. This patch adds a sysfs file to monitor the memory usage of
> > the freespace tree allocator. Also, added a sysfs config to control
> > maximum memory that the allocator can use. If the allocator exceeds
> > this threshold, file system enters "FRSP_MEM_CRUNCH" state. The next
> > patch in the series performs LRU eviction when this state is reached.
> >
> > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> > ---
> > fs/ext4/ext4.h | 8 ++++++++
> > fs/ext4/mballoc.c | 20 ++++++++++++++++++++
> > fs/ext4/mballoc.h | 4 ++++
> > fs/ext4/sysfs.c | 11 +++++++++++
> > 4 files changed, 43 insertions(+)
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 8cfe089ebea6..45fc3b230357 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -1206,6 +1206,12 @@ struct ext4_inode_info {
> > * allocator off)
> > */
> >
> > +#define EXT4_MOUNT2_FRSP_MEM_CRUNCH 0x00000040 /*
> > + * Freespace tree allocator
> > + * is in a tight memory
> > + * situation.
> > + */
> > +
> > #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
> > ~EXT4_MOUNT_##opt
> > #define set_opt(sb, opt) EXT4_SB(sb)->s_mount_opt |= \
> > @@ -1589,6 +1595,8 @@ struct ext4_sb_info {
> > atomic_t s_mb_num_frsp_trees_cached;
> > struct list_head s_mb_uncached_trees;
> > u32 s_mb_frsp_cache_aggression;
> > + atomic_t s_mb_num_fragments;
> > + u32 s_mb_frsp_mem_limit;
> >
> > /* workqueue for reserved extent conversions (buffered io) */
> > struct workqueue_struct *rsv_conversion_wq;
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index fa027b626abe..aada6838cafd 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -869,6 +869,7 @@ void ext4_mb_frsp_print_tree_len(struct super_block *sb,
> > static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
> > {
> > struct ext4_frsp_node *node;
> > + struct ext4_sb_info *sbi = EXT4_SB(sb);
> >
> > node = kmem_cache_alloc(ext4_freespace_node_cachep, GFP_NOFS);
> > if (!node)
> > @@ -877,13 +878,31 @@ static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
> > RB_CLEAR_NODE(&node->frsp_node);
> > RB_CLEAR_NODE(&node->frsp_len_node);
> >
> > + atomic_inc(&sbi->s_mb_num_fragments);
> > +
> > + if (sbi->s_mb_frsp_mem_limit &&
> > + atomic_read(&sbi->s_mb_num_fragments) >
> > + EXT4_FRSP_MEM_LIMIT_TO_NUM_NODES(sb))
> > + set_opt2(sb, FRSP_MEM_CRUNCH);
> > + else
> > + clear_opt2(sb, FRSP_MEM_CRUNCH);
> > +
> > +
>
> Why FRSP_MEM_CRUNCH is cleared here? Are any cases when a node allocating can reduce fragments numbers?
The reason why I have this here is to handle the case when the memory
limit gets updated by a sysfs tunable. So, once the sysfs tunable
tunable "mb_frsm_max_mem" was increased, then the next allocation
request would clear the MEM_CRUNCH flag. Without this, the FS would
evict a tree in LRU fashion (implemented in the next patch) which
shouldn't be necessary. Other option would be to update the MEM_CRUNCH
option right at the time when sysfs tunable is updated.
>
> > return node;
> > }
> >
> > static void ext4_mb_frsp_free_node(struct super_block *sb,
> > struct ext4_frsp_node *node)
> > {
> > + struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +
> > kmem_cache_free(ext4_freespace_node_cachep, node);
> > + atomic_dec(&sbi->s_mb_num_fragments);
> > +
> > + if (!sbi->s_mb_frsp_mem_limit ||
> > + atomic_read(&sbi->s_mb_num_fragments) <
> > + EXT4_FRSP_MEM_LIMIT_TO_NUM_NODES(sb))
> > + clear_opt2(sb, FRSP_MEM_CRUNCH);
> > }
>
> If there are some reasons to clear FRSP_MEM_CRUNCH in ext4_mb_frsp_alloc_node, should we also set FRSP_MEM_CRUNCH here?
Thanks, I think we can do that. If the memory limit decreases, it'd be
good to set MEM_CRUNCH here. In my testing, I have found that
decreasing the memory limit to be very slow to take effect. Perhaps
this might be one of the reasons.
Thanks,
Harshad
>
> >
> > /* Evict a tree from memory */
> > @@ -1607,6 +1626,7 @@ int ext4_mb_init_freespace_trees(struct super_block *sb)
> > }
> > rwlock_init(&sbi->s_mb_frsp_lock);
> > atomic_set(&sbi->s_mb_num_frsp_trees_cached, 0);
> > + atomic_set(&sbi->s_mb_num_fragments, 0);
> >
> > return 0;
> > }
> > diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
> > index ac65f7eac611..08cac358324d 100644
> > --- a/fs/ext4/mballoc.h
> > +++ b/fs/ext4/mballoc.h
> > @@ -88,6 +88,10 @@ struct ext4_frsp_node {
> > struct rb_node frsp_node;
> > struct rb_node frsp_len_node;
> > };
> > +
> > +#define EXT4_FRSP_MEM_LIMIT_TO_NUM_NODES(__sb) \
> > + ((sbi->s_mb_frsp_mem_limit / sizeof(struct ext4_frsp_node)))
> > +
> > struct ext4_free_data {
> > /* this links the free block information from sb_info */
> > struct list_head efd_list;
> > diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> > index 31e0db726d21..d23cb51635c3 100644
> > --- a/fs/ext4/sysfs.c
> > +++ b/fs/ext4/sysfs.c
> > @@ -8,6 +8,7 @@
> > *
> > */
> >
> > +#include "mballoc.h"
> > #include <linux/time.h>
> > #include <linux/fs.h>
> > #include <linux/seq_file.h>
> > @@ -24,6 +25,7 @@ typedef enum {
> > attr_session_write_kbytes,
> > attr_lifetime_write_kbytes,
> > attr_reserved_clusters,
> > + attr_frsp_tree_usage,
> > attr_inode_readahead,
> > attr_trigger_test_error,
> > attr_first_error_time,
> > @@ -205,6 +207,7 @@ EXT4_ATTR_FUNC(delayed_allocation_blocks, 0444);
> > EXT4_ATTR_FUNC(session_write_kbytes, 0444);
> > EXT4_ATTR_FUNC(lifetime_write_kbytes, 0444);
> > EXT4_ATTR_FUNC(reserved_clusters, 0644);
> > +EXT4_ATTR_FUNC(frsp_tree_usage, 0444);
> >
> > EXT4_ATTR_OFFSET(inode_readahead_blks, 0644, inode_readahead,
> > ext4_sb_info, s_inode_readahead_blks);
> > @@ -242,6 +245,7 @@ EXT4_ATTR(last_error_time, 0444, last_error_time);
> > EXT4_ATTR(journal_task, 0444, journal_task);
> > EXT4_RW_ATTR_SBI_UI(mb_prefetch, s_mb_prefetch);
> > EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_limit);
> > +EXT4_RW_ATTR_SBI_UI(mb_frsp_max_mem, s_mb_frsp_mem_limit);
> >
> > static unsigned int old_bump_val = 128;
> > EXT4_ATTR_PTR(max_writeback_mb_bump, 0444, pointer_ui, &old_bump_val);
> > @@ -251,6 +255,7 @@ static struct attribute *ext4_attrs[] = {
> > ATTR_LIST(session_write_kbytes),
> > ATTR_LIST(lifetime_write_kbytes),
> > ATTR_LIST(reserved_clusters),
> > + ATTR_LIST(frsp_tree_usage),
> > ATTR_LIST(inode_readahead_blks),
> > ATTR_LIST(inode_goal),
> > ATTR_LIST(mb_stats),
> > @@ -287,6 +292,7 @@ static struct attribute *ext4_attrs[] = {
> > #endif
> > ATTR_LIST(mb_prefetch),
> > ATTR_LIST(mb_prefetch_limit),
> > + ATTR_LIST(mb_frsp_max_mem),
> > NULL,
> > };
> > ATTRIBUTE_GROUPS(ext4);
> > @@ -369,6 +375,11 @@ static ssize_t ext4_attr_show(struct kobject *kobj,
> > return snprintf(buf, PAGE_SIZE, "%llu\n",
> > (unsigned long long)
> > atomic64_read(&sbi->s_resv_clusters));
> > + case attr_frsp_tree_usage:
> > + return snprintf(buf, PAGE_SIZE, "%llu\n",
> > + (unsigned long long)
> > + atomic_read(&sbi->s_mb_num_fragments) *
> > + sizeof(struct ext4_frsp_node));
> > case attr_inode_readahead:
> > case attr_pointer_ui:
> > if (!ptr)
> > --
> > 2.28.0.220.ged08abb693-goog
> >
>
> Best regards,
> Artem Blagodarenko.
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 7/9] ext4: add LRU eviction for free space trees
2020-08-19 7:30 [PATCH 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
` (5 preceding siblings ...)
2020-08-19 7:31 ` [PATCH 6/9] ext4: add memory usage tracker for freespace trees Harshad Shirwadkar
@ 2020-08-19 7:31 ` Harshad Shirwadkar
2020-08-19 7:31 ` [PATCH 8/9] ext4: add tracepoints " Harshad Shirwadkar
2020-08-19 7:31 ` [PATCH 9/9] ext4: add freespace trees documentation in code Harshad Shirwadkar
8 siblings, 0 replies; 13+ messages in thread
From: Harshad Shirwadkar @ 2020-08-19 7:31 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar
This patch adds LRU eviction policy for freespace trees. In order to
avoid contention on one LRU lock, the LRU scheme is implemented as two
lists - an active list and an inactive list. Trees in active list
aren't moved inside the list thereby avoiding the need of a
lock. Trees in inactive list become active only if they are accessed
twice in a short interval thereby avoiding outliers to enter the
active list.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
fs/ext4/ext4.h | 46 +++++++++++++
fs/ext4/mballoc.c | 164 ++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 197 insertions(+), 13 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 45fc3b230357..8f015f90a537 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1437,6 +1437,10 @@ struct ext4_frsp_tree {
*/
struct list_head frsp_list_node;
struct rb_node frsp_len_node;
+ atomic_t frsp_fragments;
+ struct list_head frsp_lru_active_node;
+ unsigned long frsp_last_access;
+ atomic_t frsp_ref;
};
/* Freespace tree flags */
@@ -1451,6 +1455,47 @@ struct ext4_frsp_tree {
#define EXT4_MB_FRSP_DEFAULT_CACHE_AGGRESSION 0x1
#define EXT4_MB_FRSP_MAX_CACHE_AGGRESSION 0x3
+/* LRU management for free space trees */
+struct ext4_mb_frsp_lru {
+ rwlock_t frsp_lru_lock;
+ atomic_t frsp_active_fragments; /* Current #of fragments in
+ * the active queue
+ */
+ u32 frsp_max_active_fragments;
+ /*
+ * List of active trees. Trees at tail are oldest trees in active set.
+ */
+ struct list_head frsp_lru_active;
+ /*
+ * List of inactive trees but loaded trees.
+ */
+ struct list_head frsp_lru_inactive;
+ struct super_block *frsp_lru_sb;
+};
+
+/*
+ * Minimum memory needed for our allocator. The pathological worst case for
+ * freespace trees is when every other block is allocated. In this case,
+ * the allocator will end up storing an extent node of length 1 for each free
+ * block. We need to make sure that the minimum memory available for the
+ * allocator is as much memory needed for 1 worst-case tree. This ensures that
+ * we can at-least keep 1 tree in memory. This way we avoid thrashing.
+ */
+#define EXT4_MB_FRSP_MEM_MIN(sb) \
+ ((sizeof(struct ext4_frsp_node) * ext4_flex_bg_size(EXT4_SB(sb))) \
+ * (EXT4_BLOCKS_PER_GROUP(sb) / 2))
+
+/* Half of the total memory available is allowed for active list */
+#define EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb) \
+ ((EXT4_SB(sb)->s_mb_frsp_mem_limit / \
+ (2 * sizeof(struct ext4_frsp_node))))
+
+/*
+ * Maximum number of jiffies allowed between two successive hits on a freespace
+ * tree before we move it to the "active" queue.
+ */
+#define EXT4_MB_FRSP_ACTIVE_THRESHOLD_JIFFIES 100
+
/*
* fourth extended-fs super-block data in memory
*/
@@ -1597,6 +1642,7 @@ struct ext4_sb_info {
u32 s_mb_frsp_cache_aggression;
atomic_t s_mb_num_fragments;
u32 s_mb_frsp_mem_limit;
+ struct ext4_mb_frsp_lru s_mb_frsp_lru;
/* workqueue for reserved extent conversions (buffered io) */
struct workqueue_struct *rsv_conversion_wq;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index aada6838cafd..98591d44e3cd 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -866,10 +866,12 @@ void ext4_mb_frsp_print_tree_len(struct super_block *sb,
}
#endif
-static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
+static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb,
+ struct ext4_frsp_tree *tree)
{
struct ext4_frsp_node *node;
struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_mb_frsp_lru *lru = &EXT4_SB(sb)->s_mb_frsp_lru;
node = kmem_cache_alloc(ext4_freespace_node_cachep, GFP_NOFS);
if (!node)
@@ -879,6 +881,11 @@ static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
RB_CLEAR_NODE(&node->frsp_len_node);
atomic_inc(&sbi->s_mb_num_fragments);
+ atomic_inc(&tree->frsp_fragments);
+ read_lock(&lru->frsp_lru_lock);
+ if (!list_empty(&tree->frsp_lru_active_node))
+ atomic_inc(&lru->frsp_active_fragments);
+ read_unlock(&lru->frsp_lru_lock);
if (sbi->s_mb_frsp_mem_limit &&
atomic_read(&sbi->s_mb_num_fragments) >
@@ -892,12 +899,18 @@ static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb)
}
static void ext4_mb_frsp_free_node(struct super_block *sb,
- struct ext4_frsp_node *node)
+ struct ext4_frsp_tree *tree, struct ext4_frsp_node *node)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_mb_frsp_lru *lru = &EXT4_SB(sb)->s_mb_frsp_lru;
kmem_cache_free(ext4_freespace_node_cachep, node);
atomic_dec(&sbi->s_mb_num_fragments);
+ atomic_dec(&tree->frsp_fragments);
+ read_lock(&lru->frsp_lru_lock);
+ if (!list_empty(&tree->frsp_lru_active_node))
+ atomic_dec(&lru->frsp_active_fragments);
+ read_unlock(&lru->frsp_lru_lock);
if (!sbi->s_mb_frsp_mem_limit ||
atomic_read(&sbi->s_mb_num_fragments) <
@@ -915,6 +928,8 @@ void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree)
mutex_lock(&tree->frsp_lock);
if (!(tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED))
goto out;
+ if (atomic_read(&tree->frsp_ref))
+ goto out;
node = rb_first_cached(&tree->frsp_offset_root);
while (node) {
@@ -922,7 +937,7 @@ void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree)
rb_erase_cached(node, &tree->frsp_offset_root);
rb_erase_cached(&frsp_node->frsp_len_node,
&tree->frsp_len_root);
- ext4_mb_frsp_free_node(sb, frsp_node);
+ ext4_mb_frsp_free_node(sb, tree, frsp_node);
node = rb_first_cached(&tree->frsp_offset_root);
}
tree->frsp_flags &= ~EXT4_MB_FRSP_FLAG_LOADED;
@@ -985,7 +1000,7 @@ int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
*prev_entry = NULL;
struct rb_node *left = NULL, *right = NULL;
- new_entry = ext4_mb_frsp_alloc_node(sb);
+ new_entry = ext4_mb_frsp_alloc_node(sb, tree);
if (!new_entry)
return -ENOMEM;
@@ -1004,7 +1019,7 @@ int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
rb_erase_cached(left, &tree->frsp_offset_root);
rb_erase_cached(&prev_entry->frsp_len_node,
&tree->frsp_len_root);
- ext4_mb_frsp_free_node(sb, prev_entry);
+ ext4_mb_frsp_free_node(sb, tree, prev_entry);
}
}
@@ -1017,7 +1032,7 @@ int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree,
rb_erase_cached(right, &tree->frsp_offset_root);
rb_erase_cached(&next_entry->frsp_len_node,
&tree->frsp_len_root);
- ext4_mb_frsp_free_node(sb, next_entry);
+ ext4_mb_frsp_free_node(sb, tree, next_entry);
}
}
ext4_mb_frsp_insert_len(tree, new_entry);
@@ -1601,6 +1616,10 @@ static void ext4_mb_frsp_init_tree(struct ext4_frsp_tree *tree, int index)
tree->frsp_flags = 0;
tree->frsp_index = index;
INIT_LIST_HEAD(&tree->frsp_list_node);
+ atomic_set(&tree->frsp_fragments, 0);
+ tree->frsp_last_access = 0;
+ INIT_LIST_HEAD(&tree->frsp_lru_active_node);
+ atomic_set(&tree->frsp_ref, 0);
}
int ext4_mb_init_freespace_trees(struct super_block *sb)
@@ -1627,6 +1646,15 @@ int ext4_mb_init_freespace_trees(struct super_block *sb)
rwlock_init(&sbi->s_mb_frsp_lock);
atomic_set(&sbi->s_mb_num_frsp_trees_cached, 0);
atomic_set(&sbi->s_mb_num_fragments, 0);
+ INIT_LIST_HEAD(&sbi->s_mb_frsp_lru.frsp_lru_active);
+ INIT_LIST_HEAD(&sbi->s_mb_frsp_lru.frsp_lru_inactive);
+ rwlock_init(&sbi->s_mb_frsp_lru.frsp_lru_lock);
+ /* Set the default hard-limit to be as much as buddy bitmaps */
+ sbi->s_mb_frsp_mem_limit = ext4_get_groups_count(sb) <<
+ (sb->s_blocksize_bits + 1);
+ if (sbi->s_mb_frsp_mem_limit < EXT4_MB_FRSP_MEM_MIN(sb))
+ sbi->s_mb_frsp_mem_limit = EXT4_MB_FRSP_MEM_MIN(sb);
+ atomic_set(&sbi->s_mb_frsp_lru.frsp_active_fragments, 0);
return 0;
}
@@ -1664,6 +1692,99 @@ int ext4_mb_add_frsp_trees(struct super_block *sb, int ngroups)
return 0;
}
+/*
+ * Evict one tree from inactive list.
+ */
+static void ext4_frsp_evict(struct super_block *sb)
+{
+ struct ext4_mb_frsp_lru *lru = &EXT4_SB(sb)->s_mb_frsp_lru;
+ struct ext4_frsp_tree *tree = NULL;
+ bool found = false;
+
+ write_lock(&lru->frsp_lru_lock);
+ if (!list_empty(&lru->frsp_lru_inactive)) {
+ /* Evict from front, insert at tail */
+ found = 0;
+ list_for_each_entry(tree, &lru->frsp_lru_inactive,
+ frsp_list_node) {
+ if (!atomic_read(&tree->frsp_ref)) {
+ found = true;
+ break;
+ }
+ }
+ if (found)
+ list_del_init(&tree->frsp_list_node);
+ }
+ write_unlock(&lru->frsp_lru_lock);
+ if (found)
+ ext4_mb_frsp_free_tree(sb, tree);
+}
+
+/*
+ * This function maintains LRU lists. "tree" has just been accessed.
+ */
+static void ext4_mb_frsp_maintain_lru(struct super_block *sb,
+ struct ext4_frsp_tree *tree)
+{
+ struct ext4_mb_frsp_lru *lru = &EXT4_SB(sb)->s_mb_frsp_lru;
+ struct ext4_frsp_tree *last;
+ unsigned long current_jiffies = jiffies;
+
+ read_lock(&lru->frsp_lru_lock);
+ if (!list_empty(&tree->frsp_lru_active_node)) {
+ /* Already active, nothing to do */
+ read_unlock(&lru->frsp_lru_lock);
+ goto out;
+ }
+
+ /*
+ * Check if this tree needs to be moved to active list. We move it to
+ * active list if one of the following three conditions is true:
+ * - This is the first access to this tree
+ * - We haven't yet reached the max active fragments threshold, so there
+ * is space in active list.
+ * - This tree was accessed twice in
+ * EXT4_MB_FRSP_ACTIVE_THRESHOLD_JIFFIES interval.
+ */
+ if (tree->frsp_last_access &&
+ EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb) &&
+ atomic_read(&lru->frsp_active_fragments) >
+ EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb) &&
+ current_jiffies - tree->frsp_last_access >
+ EXT4_MB_FRSP_ACTIVE_THRESHOLD_JIFFIES) {
+ read_unlock(&lru->frsp_lru_lock);
+ goto out;
+ }
+ read_unlock(&lru->frsp_lru_lock);
+
+ write_lock(&lru->frsp_lru_lock);
+ /* Check again just in case */
+ if (!list_empty(&tree->frsp_lru_active_node)) {
+ write_unlock(&lru->frsp_lru_lock);
+ goto out;
+ }
+ list_add(&tree->frsp_lru_active_node, &lru->frsp_lru_active);
+ list_del_init(&tree->frsp_list_node);
+ atomic_add(atomic_read(&tree->frsp_fragments),
+ &lru->frsp_active_fragments);
+ /* Remove trees from active queue until we are below the limit */
+ while (EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb) &&
+ atomic_read(&lru->frsp_active_fragments) >
+ EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb)) {
+ last = list_last_entry(&lru->frsp_lru_active,
+ struct ext4_frsp_tree, frsp_lru_active_node);
+ list_del_init(&last->frsp_lru_active_node);
+ /* Evict from front, insert at tail */
+ list_add_tail(&last->frsp_list_node, &lru->frsp_lru_inactive);
+ atomic_sub(atomic_read(&last->frsp_fragments),
+ &lru->frsp_active_fragments);
+ }
+ write_unlock(&lru->frsp_lru_lock);
+
+out:
+ tree->frsp_last_access = current_jiffies;
+}
+
/*
* Divide blocks started from @first with length @len into
* smaller chunks with power of 2 blocks.
@@ -2154,10 +2275,12 @@ ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group,
e4b->frsp_tree = ext4_get_frsp_tree(sb,
ext4_flex_group(sbi, group));
e4b->frsp_flags = flags;
+ ext4_mb_frsp_maintain_lru(sb, e4b->frsp_tree);
if (flags & EXT4_ALLOCATOR_FRSP_NOLOAD)
return 0;
mutex_lock(&e4b->frsp_tree->frsp_lock);
+ atomic_inc(&e4b->frsp_tree->frsp_ref);
if (e4b->frsp_tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED) {
mutex_unlock(&e4b->frsp_tree->frsp_lock);
return 0;
@@ -2288,8 +2411,15 @@ static int ext4_mb_load_allocator(struct super_block *sb, ext4_group_t group,
static void ext4_mb_unload_allocator(struct ext4_buddy *e4b)
{
- if (ext4_mb_frsp_on(e4b->bd_sb))
+ if (ext4_mb_frsp_on(e4b->bd_sb)) {
+ if (!e4b->frsp_tree ||
+ e4b->frsp_flags & EXT4_ALLOCATOR_FRSP_NOLOAD)
+ return;
+ atomic_dec(&e4b->frsp_tree->frsp_ref);
+ if (test_opt2(e4b->bd_sb, FRSP_MEM_CRUNCH))
+ ext4_frsp_evict(e4b->bd_sb);
return;
+ }
if (e4b->bd_bitmap_page)
put_page(e4b->bd_bitmap_page);
if (e4b->bd_buddy_page)
@@ -2661,7 +2791,8 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
ex->fe_node->frsp_offset +
ex->fe_node->frsp_len) {
/* Need to split the node */
- new = ext4_mb_frsp_alloc_node(e4b->bd_sb);
+ new = ext4_mb_frsp_alloc_node(e4b->bd_sb,
+ e4b->frsp_tree);
if (!new)
return -ENOMEM;
new->frsp_offset = flex_offset + ex->fe_len;
@@ -2678,7 +2809,8 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
ex->fe_node->frsp_offset += ex->fe_len;
ex->fe_node->frsp_len -= ex->fe_len;
} else {
- ext4_mb_frsp_free_node(e4b->bd_sb, ex->fe_node);
+ ext4_mb_frsp_free_node(e4b->bd_sb, e4b->frsp_tree,
+ ex->fe_node);
return 0;
}
@@ -4170,7 +4302,9 @@ static void ext4_free_data_in_buddy(struct super_block *sb,
/* No more items in the per group rb tree
* balance refcounts from ext4_mb_free_metadata()
*/
- if (!ext4_mb_frsp_on(sb)) {
+ if (ext4_mb_frsp_on(sb)) {
+ atomic_dec(&e4b.frsp_tree->frsp_ref);
+ } else {
put_page(e4b.bd_buddy_page);
put_page(e4b.bd_bitmap_page);
}
@@ -6094,14 +6228,18 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
new_node = &new_entry->efd_node;
cluster = new_entry->efd_start_cluster;
- if (!*n && !ext4_mb_frsp_on(sb)) {
+ if (!*n) {
/* first free block exent. We need to
protect buddy cache from being freed,
* otherwise we'll refresh it from
* on-disk bitmap and lose not-yet-available
* blocks */
- get_page(e4b->bd_buddy_page);
- get_page(e4b->bd_bitmap_page);
+ if (ext4_mb_frsp_on(sb)) {
+ atomic_inc(&e4b->frsp_tree->frsp_ref);
+ } else {
+ get_page(e4b->bd_buddy_page);
+ get_page(e4b->bd_bitmap_page);
+ }
}
while (*n) {
parent = *n;
--
2.28.0.220.ged08abb693-goog
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 8/9] ext4: add tracepoints for free space trees
2020-08-19 7:30 [PATCH 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
` (6 preceding siblings ...)
2020-08-19 7:31 ` [PATCH 7/9] ext4: add LRU eviction for free space trees Harshad Shirwadkar
@ 2020-08-19 7:31 ` Harshad Shirwadkar
2020-08-19 7:31 ` [PATCH 9/9] ext4: add freespace trees documentation in code Harshad Shirwadkar
8 siblings, 0 replies; 13+ messages in thread
From: Harshad Shirwadkar @ 2020-08-19 7:31 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar
This patch adds a few trace points to track the flow of allocation
requests with free space trees.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
fs/ext4/ext4.h | 1 +
fs/ext4/mballoc.c | 9 +++
fs/ext4/mballoc.h | 3 +-
include/trace/events/ext4.h | 112 ++++++++++++++++++++++++++++++++++++
4 files changed, 124 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8f015f90a537..e2ab02eab129 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1643,6 +1643,7 @@ struct ext4_sb_info {
atomic_t s_mb_num_fragments;
u32 s_mb_frsp_mem_limit;
struct ext4_mb_frsp_lru s_mb_frsp_lru;
+ atomic_t s_mb_ac_id;
/* workqueue for reserved extent conversions (buffered io) */
struct workqueue_struct *rsv_conversion_wq;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 98591d44e3cd..b9a833341b98 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -945,6 +945,7 @@ void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree)
tree->frsp_len_root = RB_ROOT_CACHED;
out:
mutex_unlock(&tree->frsp_lock);
+ trace_ext4_mb_frsp_free_tree(sb, tree->frsp_index);
}
/*
@@ -1476,6 +1477,7 @@ int ext4_mb_frsp_optimize(struct ext4_allocation_context *ac, int *tree_idx)
mb_debug(ac->ac_sb,
"Optimizer suggestion: found = %d, tree = %d, len = %d, cr = %d\n",
found, *tree_idx, better_len, ac->ac_criteria);
+ trace_ext4_mb_frsp_optimize(ac, found, cache_more_trees, *tree_idx);
return ret;
}
@@ -1546,6 +1548,11 @@ ext4_mb_tree_allocator(struct ext4_allocation_context *ac)
ac->ac_groups_scanned++;
tree_idx = (j % sbi->s_mb_num_frsp_trees);
+ trace_ext4_mb_tree_allocator(ac, tree_idx,
+ sbi->s_mb_num_frsp_trees,
+ atomic_read(&sbi->s_mb_num_frsp_trees_cached),
+ atomic_read(&sbi->s_mb_num_fragments) *
+ sizeof(struct ext4_frsp_node));
ret = ext4_mb_load_allocator(sb,
tree_idx * ext4_flex_bg_size(sbi), &e4b, 0);
if (ret)
@@ -1655,6 +1662,7 @@ int ext4_mb_init_freespace_trees(struct super_block *sb)
if (sbi->s_mb_frsp_mem_limit < EXT4_MB_FRSP_MEM_MIN(sb))
sbi->s_mb_frsp_mem_limit = EXT4_MB_FRSP_MEM_MIN(sb);
atomic_set(&sbi->s_mb_frsp_lru.frsp_active_fragments, 0);
+ atomic_set(&sbi->s_mb_ac_id, 0);
return 0;
}
@@ -5772,6 +5780,7 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac,
ext4_get_group_no_and_offset(sb, goal, &group, &block);
/* set up allocation goals */
+ ac->ac_id = atomic_inc_return(&sbi->s_mb_ac_id);
ac->ac_b_ex.fe_logical = EXT4_LBLK_CMASK(sbi, ar->logical);
ac->ac_status = AC_STATUS_CONTINUE;
ac->ac_sb = sb;
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 08cac358324d..d673cce1b53b 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -167,7 +167,6 @@ struct ext4_tree_extent {
};
struct ext4_allocation_context {
- __u32 ac_id;
struct inode *ac_inode;
struct super_block *ac_sb;
@@ -202,6 +201,8 @@ struct ext4_allocation_context {
* N > 0, the field stores N, otherwise 0 */
__u8 ac_num_optimizations;
__u8 ac_op; /* operation, for history only */
+ __u8 ac_id; /* allocation ID for tracking purpose */
+
struct page *ac_bitmap_page;
struct page *ac_buddy_page;
struct ext4_prealloc_space *ac_pa;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 8008d2e116b9..e9b6460e7456 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -869,6 +869,118 @@ TRACE_EVENT(ext4_allocate_blocks,
__entry->pleft, __entry->pright)
);
+TRACE_EVENT(ext4_mb_frsp_free_tree,
+ TP_PROTO(struct super_block *sb, int tree),
+
+ TP_ARGS(sb, tree),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( int, tree )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = sb->s_dev;
+ __entry->tree = tree;
+ ),
+
+ TP_printk("dev %d,%d tree %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->tree)
+);
+
+TRACE_EVENT(ext4_mb_frsp_optimize,
+ TP_PROTO(struct ext4_allocation_context *ac, int found,
+ int cache_more_trees, int tree),
+
+ TP_ARGS(ac, found, cache_more_trees, tree),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( unsigned int, len )
+ __field( unsigned int, flags )
+ __field( int, found )
+ __field( int, tree )
+ __field( int, num_found )
+ __field( int, attempts )
+ __field( int, ac_id )
+ __field( int, ac_criteria )
+ __field( int, ac_groups_scanned )
+ __field( int, cache_more_trees )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = ac->ac_sb->s_dev;
+ __entry->ino = ac->ac_inode->i_ino;
+ __entry->len = ac->ac_b_tree_ex.te_len;
+ __entry->flags = ac->ac_flags;
+ __entry->found = found;
+ __entry->tree = tree;
+ __entry->attempts = ac->ac_num_optimizations;
+ __entry->ac_id = ac->ac_id;
+ __entry->num_found = ac->ac_found;
+ __entry->ac_criteria = ac->ac_criteria;
+ __entry->ac_groups_scanned = ac->ac_groups_scanned;
+ __entry->cache_more_trees = cache_more_trees;
+ ),
+
+ TP_printk("dev %d,%d ino %lu ac %d flags %s best-len %u num-found %d "
+ "suggest-tree %d cache_more %d attempt %d status %d cr %d "
+ "nflexgrps %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino, __entry->ac_id,
+ show_mballoc_flags(__entry->flags),
+ __entry->len, __entry->num_found, __entry->tree,
+ __entry->cache_more_trees, __entry->attempts, __entry->found,
+ __entry->ac_criteria, __entry->ac_groups_scanned)
+);
+
+TRACE_EVENT(ext4_mb_tree_allocator,
+ TP_PROTO(struct ext4_allocation_context *ac, int tree, int num_trees,
+ int num_trees_loaded, int bytes_usage),
+
+ TP_ARGS(ac, tree, num_trees, num_trees_loaded, bytes_usage),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( unsigned int, len )
+ __field( unsigned int, flags )
+ __field( int, num_trees )
+ __field( int, num_trees_loaded )
+ __field( int, tree )
+ __field( int, ac_id )
+ __field( int, ac_found )
+ __field( int, ac_criteria )
+ __field( int, ac_groups_scanned )
+ __field( int, bytes_usage )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = ac->ac_sb->s_dev;
+ __entry->ino = ac->ac_inode->i_ino;
+ __entry->len = ac->ac_b_tree_ex.te_len;
+ __entry->flags = ac->ac_flags;
+ __entry->num_trees = num_trees;
+ __entry->num_trees_loaded = num_trees_loaded;
+ __entry->tree = tree;
+ __entry->ac_id = ac->ac_id;
+ __entry->ac_found = ac->ac_found;
+ __entry->ac_criteria = ac->ac_criteria;
+ __entry->ac_groups_scanned = ac->ac_groups_scanned;
+ __entry->bytes_usage = bytes_usage;
+ ),
+
+ TP_printk("dev %d,%d ino %lu ac %d flags %s best-len %u found %d current-tree %d "
+ "num-trees %d num-trees-loaded %d cr %d nflexgrps %d mem_bytes %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino, __entry->ac_id,
+ show_mballoc_flags(__entry->flags),
+ __entry->len, __entry->ac_found, __entry->tree, __entry->num_trees,
+ __entry->num_trees_loaded, __entry->ac_criteria,
+ __entry->ac_groups_scanned, __entry->bytes_usage)
+);
+
TRACE_EVENT(ext4_free_blocks,
TP_PROTO(struct inode *inode, __u64 block, unsigned long count,
int flags),
--
2.28.0.220.ged08abb693-goog
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 9/9] ext4: add freespace trees documentation in code
2020-08-19 7:30 [PATCH 0/9] ext4: add free-space extent based allocator Harshad Shirwadkar
` (7 preceding siblings ...)
2020-08-19 7:31 ` [PATCH 8/9] ext4: add tracepoints " Harshad Shirwadkar
@ 2020-08-19 7:31 ` Harshad Shirwadkar
8 siblings, 0 replies; 13+ messages in thread
From: Harshad Shirwadkar @ 2020-08-19 7:31 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, lyx1209, Harshad Shirwadkar
This patch adds a comment in mballoc.c describing the design and flow
of free-space trees.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
fs/ext4/mballoc.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 116 insertions(+)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index b9a833341b98..644df0bf61ee 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -330,6 +330,122 @@
* object
*
*/
+
+/*
+ * The freespace tree allocator
+ * -----------------------------
+ *
+ * High Level Design
+ * =================
+ * This allocator maintains the free space information about the file system in
+ * per-flexible block group level trees. For every flexible block group, we
+ * maintain individual freespace nodes in two trees, one sorted by flex-bg
+ * offset and other sorted by length. This gives us two benefits: i) In
+ * allocation path, our search time complexity is O(Log(Number of freespace
+ * regions in the flex-bg)). ii) In free path, in the same time complexity we
+ * can merge the adjacent nodes thereby reducing the size of the tree
+ * efficiently.
+ *
+ * Along with flexible block group level trees, we also introduce a top level
+ * meta-tree which consists of individual trees. This tree is sorted by length
+ * of largest extents found in flex-bgs. The key advantage that this tree gives
+ * us is this: During an allocation request, the allocator is now able to
+ * consult this tree and directly (in O(Log(Number of Flex BGs)) jump to a
+ * flexible block group which definitely has at least one (the largest) extent
+ * that can satisfy this request. If no flexible block group exists which can
+ * satisfy this request, the allocator can now immediately drop the CR level.
+ *
+ * In order to preseve the parallel allocation / free performance, the allocator
+ * only *tries* to consult this tree. It does so by calling read_trylock()
+ * function and if the meta-tree is busy, the allocator continues its linear
+ * search until it is able to grab a read-lock on this tree.
+ *
+ * Memory Footprint
+ * ================
+ *
+ * In a less fragmented file system, the memory footprint of the new allocator
+ * is significantly lower than buddy bitmaps. However, as the fragmentation
+ * level increases, the memory footprint of this allocator increases
+ * significantly. The memory usage of the freespace tree allocator can be
+ * controlled using sysfs tunable /sys/fs/ext4/<dev>/mb_frsp_max_mem. The
+ * default value of this is max(memory needed for buddy, maximum memory needed
+ * for one tree in the worst case). Accounting for max memory needed for one
+ * tree allows us to keep at least one tree in memory even in the worst
+ * case. This avoids thrashing. Once the memory threshold is reached, the
+ * allocator starts evicting trees from memory in LRU fashion. However, we don't
+ * remove tree's entry from the meta-tree. This allows the allocator to
+ * efficiently reconstruct only relevant trees from on-disk bitmaps. In our
+ * evaluations, we found that freespace tree allocator still manages to
+ * outperform buddy allocator in memory crunch situation.
+ *
+ * LRU
+ * ===
+ *
+ * We maintain two lists - an active list and an inactive list of trees. Trees
+ * in active list stay in it until evicted. In order to reduce contention on the
+ * active list lock, once a tree is in active list it is not moved inside the
+ * list. Also, a tree moves from inactive list to active list only if it is
+ * accessed twice in a EXT4_MB_FRSP_ACTIVE_THRESHOLD_JIFFIES window.
+ *
+ *
+ * Caching Tree Metadata
+ * =====================
+ * As noted in our experiments, we find caching tree metadata (the largest
+ * available extent in a tree) in the meta-tree significantly boosts allocation
+ * performance. However, if the allocator waits for the cache to fill up before
+ * starting to serve allocation requests, that may severely degrade allocation
+ * performance on large disks. Thus, it is important to tune the tree caching
+ * behavior according to the underlying block device. This patchset provides
+ * four cache aggression levels. Cache aggression level can be specified as a
+ * mount time parametere "frsp_cache_aggression". Here is the meaning of these
+ * different levels:
+ *
+ * Cache Aggression 0: Try to serve request only cached trees
+ * Cache Aggression 1 (Default): Try to serve request from only cached trees
+ * at CR 0. At all other CRs, try to use an uncached tree for every
+ * alternate request.
+ * Cache Aggression 2: Try to use an uncached tree for every alternate request
+ * at all CR levels.
+ * Cache Aggression 3: Try to use uncached trees for every request.
+ *
+ * Moreover, if file system is mounted with "prefetch_block_bitmaps", tree
+ * caching starts at mount time.
+ *
+ * Locking Order
+ * =============
+ *
+ * - Tree lock
+ * - Meta tree lock (sbi->s_mb_frsp_lock)
+ * - LRU lock
+ *
+ * Critical sections of meta-tree lock and LRU locks are kept as small as
+ * possible and you shouldn't need to take meta-tree lock and lru-lock
+ * simultaneously.
+ *
+ * High Level Algorithm
+ * ====================
+ * - Consult meta-tree asking which flex-bg should the allocator look at
+ * - One of the following things can happen
+ * - Meta tree may find a suitable flex-bg for the request
+ * - In this case we start searching in that tree
+ * - Meta tree may realize that there's no suitable flex-bg
+ * - In this case we increase CR and restart
+ * - Based on the caching mode, meta tree may redirect us to an
+ * uncached tree
+ * - Meta tree is busy
+ * - In this case, we dont wait for meta-tree to become available,
+ * instead, we continue our search linearly.
+ * - Pick a tree (either based on what meta-tree told us or linearly from the
+ * last one)
+ * - Manage LRU structure (ext4_mb_frsp_maintain_lru())
+ * - Move tree to active list if needed
+ * - Move trees from active list to inactive lists if needed
+ * - Perform search by length and pick matching extents.
+ * - Repeat until best found
+ * - Perform memory-crunch check and evict trees in LRU fashion if needed
+ * (ext4_mb_unload_allocator()))
+ */
+
static struct kmem_cache *ext4_pspace_cachep;
static struct kmem_cache *ext4_ac_cachep;
static struct kmem_cache *ext4_free_data_cachep;
--
2.28.0.220.ged08abb693-goog
^ permalink raw reply related [flat|nested] 13+ messages in thread