[PATCH 0/5 v2] ext4: Fix performance regression with mballoc

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5 v2] ext4: Fix performance regression with mballoc
@ 2022-09-06 15:29 Jan Kara
  2022-09-06 15:29 ` [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara
                   ` (6 more replies)
  0 siblings, 7 replies; 27+ messages in thread
From: Jan Kara @ 2022-09-06 15:29 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren,
	Andreas Dilger, Jan Kara

Hello,

Here is a second version of my mballoc improvements to avoid spreading
allocations with mb_optimize_scan=1. The patches fix the performance
regression I was able to reproduce with reaim on my test machine:

                     mb_optimize_scan=0     mb_optimize_scan=1     patched
Hmean     disk-1       2076.12 (   0.00%)     2099.37 (   1.12%)     2032.52 (  -2.10%)
Hmean     disk-41     92481.20 (   0.00%)    83787.47 *  -9.40%*    90308.37 (  -2.35%)
Hmean     disk-81    155073.39 (   0.00%)   135527.05 * -12.60%*   154285.71 (  -0.51%)
Hmean     disk-121   185109.64 (   0.00%)   166284.93 * -10.17%*   185298.62 (   0.10%)
Hmean     disk-161   229890.53 (   0.00%)   207563.39 *  -9.71%*   232883.32 *   1.30%*
Hmean     disk-201   223333.33 (   0.00%)   203235.59 *  -9.00%*   221446.93 (  -0.84%)
Hmean     disk-241   235735.25 (   0.00%)   217705.51 *  -7.65%*   239483.27 *   1.59%*
Hmean     disk-281   266772.15 (   0.00%)   241132.72 *  -9.61%*   263108.62 (  -1.37%)
Hmean     disk-321   265435.50 (   0.00%)   245412.84 *  -7.54%*   267277.27 (   0.69%)

The changes also significanly reduce spreading of allocations for small /
moderately sized files. I'm not able to measure a performance difference
resulting from this but on eMMC storage this seems to be the main culprit
of reduced performance. Untarring of raspberry-pi archive touches following
numbers of groups:

	mb_optimize_scan=0	mb_optimize_scan=1	patched
groups	4			22			7

To achieve this I have added two more changes on top of v1 - patches 4 and 5.
Patch 4 makes sure we use locality group preallocation even for files that are
not likely to grow anymore (previously we have disabled all preallocations for
such files, however locality group preallocation still makes a lot of sense for
such files). This patch reduced spread of a small file allocations but larger
file allocations were still spread significantly because they avoid locality
group preallocation and as they are not power-of-two in size, they also
immediately start with cr=1 scan. To address that I've changed the data
structure for looking up the best block group to allocate from (see patch 5
for details).

Stefan, can you please test whether these patches fix the problem for you as
well? Comments & review welcome.

								Honza
Previous versions:
Link: http://lore.kernel.org/r/20220823134508.27854-1-jack@suse.cz # v1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan
  2022-09-06 15:29 [PATCH 0/5 v2] ext4: Fix performance regression with mballoc Jan Kara
@ 2022-09-06 15:29 ` Jan Kara
  2022-09-07 17:43   ` Ritesh Harjani (IBM)
  2022-09-06 15:29 ` [PATCH 2/5] ext4: Avoid unnecessary spreading of allocations among groups Jan Kara
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2022-09-06 15:29 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren,
	Andreas Dilger, Jan Kara, stable

One of the side-effects of mb_optimize_scan was that the optimized
functions to select next group to try were called even before we tried
the goal group. As a result we no longer allocate files close to
corresponding inodes as well as we don't try to expand currently
allocated extent in the same group. This results in reaim regression
with workfile.disk workload of upto 8% with many clients on my test
machine:

                     baseline               mb_optimize_scan
Hmean     disk-1       2114.16 (   0.00%)     2099.37 (  -0.70%)
Hmean     disk-41     87794.43 (   0.00%)    83787.47 *  -4.56%*
Hmean     disk-81    148170.73 (   0.00%)   135527.05 *  -8.53%*
Hmean     disk-121   177506.11 (   0.00%)   166284.93 *  -6.32%*
Hmean     disk-161   220951.51 (   0.00%)   207563.39 *  -6.06%*
Hmean     disk-201   208722.74 (   0.00%)   203235.59 (  -2.63%)
Hmean     disk-241   222051.60 (   0.00%)   217705.51 (  -1.96%)
Hmean     disk-281   252244.17 (   0.00%)   241132.72 *  -4.41%*
Hmean     disk-321   255844.84 (   0.00%)   245412.84 *  -4.08%*

Also this is causing huge regression (time increased by a factor of 5 or
so) when untarring archive with lots of small files on some eMMC storage
cards.

Fix the problem by making sure we try goal group first.

Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Reported-by: Stefan Wahren <stefan.wahren@i2se.com>
Link: https://lore.kernel.org/all/20220727105123.ckwrhbilzrxqpt24@quack3/
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/mballoc.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index bd8f8b5c3d30..41e1cfecac3b 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1049,8 +1049,10 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
 {
 	*new_cr = ac->ac_criteria;
 
-	if (!should_optimize_scan(ac) || ac->ac_groups_linear_remaining)
+	if (!should_optimize_scan(ac) || ac->ac_groups_linear_remaining) {
+		*group = next_linear_group(ac, *group, ngroups);
 		return;
+	}
 
 	if (*new_cr == 0) {
 		ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
@@ -2636,7 +2638,7 @@ static noinline_for_stack int
 ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 {
 	ext4_group_t prefetch_grp = 0, ngroups, group, i;
-	int cr = -1;
+	int cr = -1, new_cr;
 	int err = 0, first_err = 0;
 	unsigned int nr = 0, prefetch_ios = 0;
 	struct ext4_sb_info *sbi;
@@ -2711,13 +2713,11 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		ac->ac_groups_linear_remaining = sbi->s_mb_max_linear_groups;
 		prefetch_grp = group;
 
-		for (i = 0; i < ngroups; group = next_linear_group(ac, group, ngroups),
-			     i++) {
-			int ret = 0, new_cr;
+		for (i = 0, new_cr = cr; i < ngroups; i++,
+		     ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) {
+			int ret = 0;
 
 			cond_resched();
-
-			ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups);
 			if (new_cr != cr) {
 				cr = new_cr;
 				goto repeat;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/5] ext4: Avoid unnecessary spreading of allocations among groups
  2022-09-06 15:29 [PATCH 0/5 v2] ext4: Fix performance regression with mballoc Jan Kara
  2022-09-06 15:29 ` [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara
@ 2022-09-06 15:29 ` Jan Kara
  2022-09-07 18:05   ` Ritesh Harjani (IBM)
  2022-09-06 15:29 ` [PATCH 3/5] ext4: Make directory inode spreading reflect flexbg size Jan Kara
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2022-09-06 15:29 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren,
	Andreas Dilger, Jan Kara

mb_set_largest_free_order() updates lists containing groups with largest
chunk of free space of given order. The way it updates it leads to
always moving the group to the tail of the list. Thus allocations
looking for free space of given order effectively end up cycling through
all groups (and due to initialization in last to first order). This
spreads allocations among block groups which reduces performance for
rotating disks or low-end flash media. Change
mb_set_largest_free_order() to only update lists if the order of the
largest free chunk in the group changed.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/mballoc.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 41e1cfecac3b..6251b4a6cc63 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1077,23 +1077,25 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	int i;
 
-	if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
+	for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
+		if (grp->bb_counters[i] > 0)
+			break;
+	/* No need to move between order lists? */
+	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) ||
+	    i == grp->bb_largest_free_order) {
+		grp->bb_largest_free_order = i;
+		return;
+	}
+
+	if (grp->bb_largest_free_order >= 0) {
 		write_lock(&sbi->s_mb_largest_free_orders_locks[
 					      grp->bb_largest_free_order]);
 		list_del_init(&grp->bb_largest_free_order_node);
 		write_unlock(&sbi->s_mb_largest_free_orders_locks[
 					      grp->bb_largest_free_order]);
 	}
-	grp->bb_largest_free_order = -1; /* uninit */
-
-	for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
-		if (grp->bb_counters[i] > 0) {
-			grp->bb_largest_free_order = i;
-			break;
-		}
-	}
-	if (test_opt2(sb, MB_OPTIMIZE_SCAN) &&
-	    grp->bb_largest_free_order >= 0 && grp->bb_free) {
+	grp->bb_largest_free_order = i;
+	if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
 		write_lock(&sbi->s_mb_largest_free_orders_locks[
 					      grp->bb_largest_free_order]);
 		list_add_tail(&grp->bb_largest_free_order_node,
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/5] ext4: Make directory inode spreading reflect flexbg size
  2022-09-06 15:29 [PATCH 0/5 v2] ext4: Fix performance regression with mballoc Jan Kara
  2022-09-06 15:29 ` [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara
  2022-09-06 15:29 ` [PATCH 2/5] ext4: Avoid unnecessary spreading of allocations among groups Jan Kara
@ 2022-09-06 15:29 ` Jan Kara
  2022-09-06 15:29 ` [PATCH 4/5] ext4: Use locality group preallocation for small closed files Jan Kara
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Jan Kara @ 2022-09-06 15:29 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren,
	Andreas Dilger, Jan Kara

Currently the Orlov inode allocator searches for free inodes for a
directory only in flex block groups with at most inodes_per_group/16
more directory inodes than average per flex block group. However with
growing size of flex block group this becomes unnecessarily strict.
Scale allowed difference from average directory count per flex block
group with flex block group size as we do with other metrics.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ialloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index f73e5eb43eae..208b87ce8858 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -510,7 +510,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
 		goto fallback;
 	}
 
-	max_dirs = ndirs / ngroups + inodes_per_group / 16;
+	max_dirs = ndirs / ngroups + inodes_per_group*flex_size / 16;
 	min_inodes = avefreei - inodes_per_group*flex_size / 4;
 	if (min_inodes < 1)
 		min_inodes = 1;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/5] ext4: Use locality group preallocation for small closed files
  2022-09-06 15:29 [PATCH 0/5 v2] ext4: Fix performance regression with mballoc Jan Kara
                   ` (2 preceding siblings ...)
  2022-09-06 15:29 ` [PATCH 3/5] ext4: Make directory inode spreading reflect flexbg size Jan Kara
@ 2022-09-06 15:29 ` Jan Kara
  2022-09-07 18:25   ` Ritesh Harjani (IBM)
  2022-09-06 15:29 ` [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree Jan Kara
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2022-09-06 15:29 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren,
	Andreas Dilger, Jan Kara

Curently we don't use any preallocation when a file is already closed
when allocating blocks (from writeback code when converting delayed
allocation). However for small files, using locality group preallocation
is actually desirable as that is not specific to a particular file.
Rather it is a method to pack small files together to reduce
fragmentation and for that the fact the file is closed is actually even
stronger hint the file would benefit from packing. So change the logic
to allow locality group preallocation in this case.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/mballoc.c | 27 +++++++++++++++------------
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 6251b4a6cc63..af1e49c3603f 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -5195,6 +5195,7 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	int bsbits = ac->ac_sb->s_blocksize_bits;
 	loff_t size, isize;
+	bool inode_pa_eligible, group_pa_eligible;
 
 	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
 		return;
@@ -5202,25 +5203,27 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
 	if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
 		return;
 
+	group_pa_eligible = sbi->s_mb_group_prealloc > 0;
+	inode_pa_eligible = true;
 	size = ac->ac_o_ex.fe_logical + EXT4_C2B(sbi, ac->ac_o_ex.fe_len);
 	isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1)
 		>> bsbits;
 
+	/* No point in using inode preallocation for closed files */
 	if ((size == isize) && !ext4_fs_is_busy(sbi) &&
-	    !inode_is_open_for_write(ac->ac_inode)) {
-		ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC;
-		return;
-	}
+	    !inode_is_open_for_write(ac->ac_inode))
+		inode_pa_eligible = false;
 
-	if (sbi->s_mb_group_prealloc <= 0) {
-		ac->ac_flags |= EXT4_MB_STREAM_ALLOC;
-		return;
-	}
-
-	/* don't use group allocation for large files */
 	size = max(size, isize);
-	if (size > sbi->s_mb_stream_request) {
-		ac->ac_flags |= EXT4_MB_STREAM_ALLOC;
+	/* Don't use group allocation for large files */
+	if (size > sbi->s_mb_stream_request)
+		group_pa_eligible = false;
+
+	if (!group_pa_eligible) {
+		if (inode_pa_eligible)
+			ac->ac_flags |= EXT4_MB_STREAM_ALLOC;
+		else
+			ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC;
 		return;
 	}
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree
  2022-09-06 15:29 [PATCH 0/5 v2] ext4: Fix performance regression with mballoc Jan Kara
                   ` (3 preceding siblings ...)
  2022-09-06 15:29 ` [PATCH 4/5] ext4: Use locality group preallocation for small closed files Jan Kara
@ 2022-09-06 15:29 ` Jan Kara
  2022-09-08 11:00     ` Dan Carpenter
                     ` (2 more replies)
  2022-09-06 20:38 ` [PATCH 0/5 v2] ext4: Fix performance regression with mballoc Stefan Wahren
  2022-09-08  8:17 ` Ojaswin Mujoo
  6 siblings, 3 replies; 27+ messages in thread
From: Jan Kara @ 2022-09-06 15:29 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren,
	Andreas Dilger, Jan Kara

Using rbtree for sorting groups by average fragment size is relatively
expensive (needs rbtree update on every block freeing or allocation) and
leads to wide spreading of allocations because selection of block group
is very sentitive both to changes in free space and amount of blocks
allocated. Furthermore selecting group with the best matching average
fragment size is not necessary anyway, even more so because the
variability of fragment sizes within a group is likely large so average
is not telling much. We just need a group with large enough average
fragment size so that we have high probability of finding large enough
free extent and we don't want average fragment size to be too big so
that we are likely to find free extent only somewhat larger than what we
need.

So instead of maintaing rbtree of groups sorted by fragment size keep
bins (lists) or groups where average fragment size is in the interval
[2^i, 2^(i+1)). This structure requires less updates on block allocation
/ freeing, generally avoids chaotic spreading of allocations into block
groups, and still is able to quickly (even faster that the rbtree)
provide a block group which is likely to have a suitably sized free
space extent.

This patch reduces number of block groups used when untarring archive
with medium sized files (size somewhat above 64k which is default
mballoc limit for avoiding locality group preallocation) to about half
and thus improves write speeds for eMMC flash significantly.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h    |  10 +-
 fs/ext4/mballoc.c | 252 +++++++++++++++++++---------------------------
 fs/ext4/mballoc.h |   1 -
 3 files changed, 110 insertions(+), 153 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9bca5565547b..3bf9a6926798 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -167,8 +167,6 @@ enum SHIFT_DIRECTION {
 #define EXT4_MB_CR0_OPTIMIZED		0x8000
 /* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */
 #define EXT4_MB_CR1_OPTIMIZED		0x00010000
-/* Perform linear traversal for one group */
-#define EXT4_MB_SEARCH_NEXT_LINEAR	0x00020000
 struct ext4_allocation_request {
 	/* target inode for block we're allocating */
 	struct inode *inode;
@@ -1600,8 +1598,8 @@ struct ext4_sb_info {
 	struct list_head s_discard_list;
 	struct work_struct s_discard_work;
 	atomic_t s_retry_alloc_pending;
-	struct rb_root s_mb_avg_fragment_size_root;
-	rwlock_t s_mb_rb_lock;
+	struct list_head *s_mb_avg_fragment_size;
+	rwlock_t *s_mb_avg_fragment_size_locks;
 	struct list_head *s_mb_largest_free_orders;
 	rwlock_t *s_mb_largest_free_orders_locks;
 
@@ -3413,6 +3411,8 @@ struct ext4_group_info {
 	ext4_grpblk_t	bb_first_free;	/* first free block */
 	ext4_grpblk_t	bb_free;	/* total free blocks */
 	ext4_grpblk_t	bb_fragments;	/* nr of freespace fragments */
+	int		bb_avg_fragment_size_order;	/* order of average
+							   fragment in BG */
 	ext4_grpblk_t	bb_largest_free_order;/* order of largest frag in BG */
 	ext4_group_t	bb_group;	/* Group number */
 	struct          list_head bb_prealloc_list;
@@ -3420,7 +3420,7 @@ struct ext4_group_info {
 	void            *bb_bitmap;
 #endif
 	struct rw_semaphore alloc_sem;
-	struct rb_node	bb_avg_fragment_size_rb;
+	struct list_head bb_avg_fragment_size_node;
 	struct list_head bb_largest_free_order_node;
 	ext4_grpblk_t	bb_counters[];	/* Nr of free power-of-two-block
 					 * regions, index is order.
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index af1e49c3603f..213d2d0750dd 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -140,13 +140,15 @@
  *    number of buddy bitmap orders possible) number of lists. Group-infos are
  *    placed in appropriate lists.
  *
- * 2) Average fragment size rb tree (sbi->s_mb_avg_fragment_size_root)
+ * 2) Average fragment size lists (sbi->s_mb_avg_fragment_size)
  *
- *    Locking: sbi->s_mb_rb_lock (rwlock)
+ *    Locking: sbi->s_mb_avg_fragment_size_locks(array of rw locks)
  *
- *    This is a red black tree consisting of group infos and the tree is sorted
- *    by average fragment sizes (which is calculated as ext4_group_info->bb_free
- *    / ext4_group_info->bb_fragments).
+ *    This is an array of lists where in the i-th list there are groups with
+ *    average fragment size >= 2^i and < 2^(i+1). The average fragment size
+ *    is computed as ext4_group_info->bb_free / ext4_group_info->bb_fragments.
+ *    Note that we don't bother with a special list for completely empty groups
+ *    so we only have MB_NUM_ORDERS(sb) lists.
  *
  * When "mb_optimize_scan" mount option is set, mballoc consults the above data
  * structures to decide the order in which groups are to be traversed for
@@ -160,7 +162,8 @@
  *
  * At CR = 1, we only consider groups where average fragment size > request
  * size. So, we lookup a group which has average fragment size just above or
- * equal to request size using our rb tree (data structure 2) in O(log N) time.
+ * equal to request size using our average fragment size group lists (data
+ * structure 2) in O(1) time.
  *
  * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in
  * linear order which requires O(N) search time for each CR 0 and CR 1 phase.
@@ -802,65 +805,51 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
 	}
 }
 
-static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
-			int (*cmp)(struct rb_node *, struct rb_node *))
+static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
 {
-	struct rb_node **iter = &root->rb_node, *parent = NULL;
+	int order;
 
-	while (*iter) {
-		parent = *iter;
-		if (cmp(new, *iter) > 0)
-			iter = &((*iter)->rb_left);
-		else
-			iter = &((*iter)->rb_right);
-	}
-
-	rb_link_node(new, parent, iter);
-	rb_insert_color(new, root);
-}
-
-static int
-ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2)
-{
-	struct ext4_group_info *grp1 = rb_entry(rb1,
-						struct ext4_group_info,
-						bb_avg_fragment_size_rb);
-	struct ext4_group_info *grp2 = rb_entry(rb2,
-						struct ext4_group_info,
-						bb_avg_fragment_size_rb);
-	int num_frags_1, num_frags_2;
-
-	num_frags_1 = grp1->bb_fragments ?
-		grp1->bb_free / grp1->bb_fragments : 0;
-	num_frags_2 = grp2->bb_fragments ?
-		grp2->bb_free / grp2->bb_fragments : 0;
-
-	return (num_frags_2 - num_frags_1);
+	/*
+	 * We don't bother with a special lists groups with only 1 block free
+ 	 * extents and for completely empty groups.
+	 */
+	order = fls(len) - 2;
+	if (order < 0)
+		return 0;
+	if (order == MB_NUM_ORDERS(sb))
+		order--;
+	return order;
 }
 
-/*
- * Reinsert grpinfo into the avg_fragment_size tree with new average
- * fragment size.
- */
+/* Move group to appropriate avg_fragment_size list */
 static void
 mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	int new_order;
 
 	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
 		return;
 
-	write_lock(&sbi->s_mb_rb_lock);
-	if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
-		rb_erase(&grp->bb_avg_fragment_size_rb,
-				&sbi->s_mb_avg_fragment_size_root);
-		RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
-	}
+	new_order = mb_avg_fragment_size_order(sb,
+					grp->bb_free / grp->bb_fragments);
+	if (new_order == grp->bb_avg_fragment_size_order)
+		return;
 
-	ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root,
-		&grp->bb_avg_fragment_size_rb,
-		ext4_mb_avg_fragment_size_cmp);
-	write_unlock(&sbi->s_mb_rb_lock);
+	if (grp->bb_avg_fragment_size_order != -1) {
+		write_lock(&sbi->s_mb_avg_fragment_size_locks[
+					grp->bb_avg_fragment_size_order]);
+		list_del(&grp->bb_avg_fragment_size_node);
+		write_unlock(&sbi->s_mb_avg_fragment_size_locks[
+					grp->bb_avg_fragment_size_order]);
+	}
+	grp->bb_avg_fragment_size_order = new_order;
+	write_lock(&sbi->s_mb_avg_fragment_size_locks[
+					grp->bb_avg_fragment_size_order]);
+	list_add_tail(&grp->bb_avg_fragment_size_node,
+		&sbi->s_mb_avg_fragment_size[grp->bb_avg_fragment_size_order]);
+	write_unlock(&sbi->s_mb_avg_fragment_size_locks[
+					grp->bb_avg_fragment_size_order]);
 }
 
 /*
@@ -909,86 +898,56 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
 		*new_cr = 1;
 	} else {
 		*group = grp->bb_group;
-		ac->ac_last_optimal_group = *group;
 		ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED;
 	}
 }
 
 /*
  * Choose next group by traversing average fragment size tree. Updates *new_cr
- * if cr lvel needs an update. Sets EXT4_MB_SEARCH_NEXT_LINEAR to indicate that
- * the linear search should continue for one iteration since there's lock
- * contention on the rb tree lock.
+ * if cr level needs an update. 
  */
 static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
 		int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
-	int avg_fragment_size, best_so_far;
-	struct rb_node *node, *found;
-	struct ext4_group_info *grp;
-
-	/*
-	 * If there is contention on the lock, instead of waiting for the lock
-	 * to become available, just continue searching lineraly. We'll resume
-	 * our rb tree search later starting at ac->ac_last_optimal_group.
-	 */
-	if (!read_trylock(&sbi->s_mb_rb_lock)) {
-		ac->ac_flags |= EXT4_MB_SEARCH_NEXT_LINEAR;
-		return;
-	}
+	struct ext4_group_info *grp, *iter;
+	int i;
 
 	if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
 		if (sbi->s_mb_stats)
 			atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
-		/* We have found something at CR 1 in the past */
-		grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group);
-		for (found = rb_next(&grp->bb_avg_fragment_size_rb); found != NULL;
-		     found = rb_next(found)) {
-			grp = rb_entry(found, struct ext4_group_info,
-				       bb_avg_fragment_size_rb);
+	}
+
+	for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
+	     i < MB_NUM_ORDERS(ac->ac_sb); i++) {
+		if (list_empty(&sbi->s_mb_avg_fragment_size[i]))
+			continue;
+		read_lock(&sbi->s_mb_avg_fragment_size_locks[i]);
+		if (list_empty(&sbi->s_mb_avg_fragment_size[i])) {
+			read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
+			continue;
+		}
+		grp = NULL;
+		list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i],
+				    bb_avg_fragment_size_node) {
 			if (sbi->s_mb_stats)
 				atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
-			if (likely(ext4_mb_good_group(ac, grp->bb_group, 1)))
+			if (likely(ext4_mb_good_group(ac, iter->bb_group, 1))) {
+				grp = iter;
 				break;
-		}
-		goto done;
-	}
-
-	node = sbi->s_mb_avg_fragment_size_root.rb_node;
-	best_so_far = 0;
-	found = NULL;
-
-	while (node) {
-		grp = rb_entry(node, struct ext4_group_info,
-			       bb_avg_fragment_size_rb);
-		avg_fragment_size = 0;
-		if (ext4_mb_good_group(ac, grp->bb_group, 1)) {
-			avg_fragment_size = grp->bb_fragments ?
-				grp->bb_free / grp->bb_fragments : 0;
-			if (!best_so_far || avg_fragment_size < best_so_far) {
-				best_so_far = avg_fragment_size;
-				found = node;
 			}
 		}
-		if (avg_fragment_size > ac->ac_g_ex.fe_len)
-			node = node->rb_right;
-		else
-			node = node->rb_left;
+		read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
+		if (grp)
+			break;
 	}
 
-done:
-	if (found) {
-		grp = rb_entry(found, struct ext4_group_info,
-			       bb_avg_fragment_size_rb);
+	if (grp) {
 		*group = grp->bb_group;
 		ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
 	} else {
 		*new_cr = 2;
 	}
-
-	read_unlock(&sbi->s_mb_rb_lock);
-	ac->ac_last_optimal_group = *group;
 }
 
 static inline int should_optimize_scan(struct ext4_allocation_context *ac)
@@ -1017,11 +976,6 @@ next_linear_group(struct ext4_allocation_context *ac, int group, int ngroups)
 		goto inc_and_return;
 	}
 
-	if (ac->ac_flags & EXT4_MB_SEARCH_NEXT_LINEAR) {
-		ac->ac_flags &= ~EXT4_MB_SEARCH_NEXT_LINEAR;
-		goto inc_and_return;
-	}
-
 	return group;
 inc_and_return:
 	/*
@@ -1152,13 +1106,13 @@ void ext4_mb_generate_buddy(struct super_block *sb,
 					EXT4_GROUP_INFO_BBITMAP_CORRUPT);
 	}
 	mb_set_largest_free_order(sb, grp);
+	mb_update_avg_fragment_size(sb, grp);
 
 	clear_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &(grp->bb_state));
 
 	period = get_cycles() - period;
 	atomic_inc(&sbi->s_mb_buddies_generated);
 	atomic64_add(period, &sbi->s_mb_generation_time);
-	mb_update_avg_fragment_size(sb, grp);
 }
 
 /* The buddy information is attached the buddy cache inode
@@ -2711,7 +2665,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		 * from the goal value specified
 		 */
 		group = ac->ac_g_ex.fe_group;
-		ac->ac_last_optimal_group = group;
 		ac->ac_groups_linear_remaining = sbi->s_mb_max_linear_groups;
 		prefetch_grp = group;
 
@@ -2993,9 +2946,7 @@ __acquires(&EXT4_SB(sb)->s_mb_rb_lock)
 	struct super_block *sb = pde_data(file_inode(seq->file));
 	unsigned long position;
 
-	read_lock(&EXT4_SB(sb)->s_mb_rb_lock);
-
-	if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + 1)
+	if (*pos < 0 || *pos >= 2*MB_NUM_ORDERS(sb) + 1)
 		return NULL;
 	position = *pos + 1;
 	return (void *) ((unsigned long) position);
@@ -3007,7 +2958,7 @@ static void *ext4_mb_seq_structs_summary_next(struct seq_file *seq, void *v, lof
 	unsigned long position;
 
 	++*pos;
-	if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + 1)
+	if (*pos < 0 || *pos >= 2*MB_NUM_ORDERS(sb) + 1)
 		return NULL;
 	position = *pos + 1;
 	return (void *) ((unsigned long) position);
@@ -3019,29 +2970,22 @@ static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v)
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	unsigned long position = ((unsigned long) v);
 	struct ext4_group_info *grp;
-	struct rb_node *n;
-	unsigned int count, min, max;
+	unsigned int count;
 
 	position--;
 	if (position >= MB_NUM_ORDERS(sb)) {
-		seq_puts(seq, "fragment_size_tree:\n");
-		n = rb_first(&sbi->s_mb_avg_fragment_size_root);
-		if (!n) {
-			seq_puts(seq, "\ttree_min: 0\n\ttree_max: 0\n\ttree_nodes: 0\n");
-			return 0;
-		}
-		grp = rb_entry(n, struct ext4_group_info, bb_avg_fragment_size_rb);
-		min = grp->bb_fragments ? grp->bb_free / grp->bb_fragments : 0;
-		count = 1;
-		while (rb_next(n)) {
-			count++;
-			n = rb_next(n);
-		}
-		grp = rb_entry(n, struct ext4_group_info, bb_avg_fragment_size_rb);
-		max = grp->bb_fragments ? grp->bb_free / grp->bb_fragments : 0;
+		if (position == MB_NUM_ORDERS(sb))
+			seq_puts(seq, "fragment_size_tree:\n");
+		position -= MB_NUM_ORDERS(sb);
 
-		seq_printf(seq, "\ttree_min: %u\n\ttree_max: %u\n\ttree_nodes: %u\n",
-			   min, max, count);
+		count = 0;
+		read_lock(&sbi->s_mb_avg_fragment_size_locks[position]);
+		list_for_each_entry(grp, &sbi->s_mb_avg_fragment_size[position],
+				    bb_avg_fragment_size_node)
+			count++;
+		read_unlock(&sbi->s_mb_avg_fragment_size_locks[position]);
+		seq_printf(seq, "\tlist_order_%u_groups: %u\n",
+		   			(unsigned int)position, count);
 		return 0;
 	}
 
@@ -3051,27 +2995,20 @@ static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v)
 		seq_puts(seq, "max_free_order_lists:\n");
 	}
 	count = 0;
+	read_lock(&sbi->s_mb_largest_free_orders_locks[position]);
 	list_for_each_entry(grp, &sbi->s_mb_largest_free_orders[position],
 			    bb_largest_free_order_node)
 		count++;
+	read_unlock(&sbi->s_mb_largest_free_orders_locks[position]);
 	seq_printf(seq, "\tlist_order_%u_groups: %u\n",
 		   (unsigned int)position, count);
 
 	return 0;
 }
 
-static void ext4_mb_seq_structs_summary_stop(struct seq_file *seq, void *v)
-__releases(&EXT4_SB(sb)->s_mb_rb_lock)
-{
-	struct super_block *sb = pde_data(file_inode(seq->file));
-
-	read_unlock(&EXT4_SB(sb)->s_mb_rb_lock);
-}
-
 const struct seq_operations ext4_mb_seq_structs_summary_ops = {
 	.start  = ext4_mb_seq_structs_summary_start,
 	.next   = ext4_mb_seq_structs_summary_next,
-	.stop   = ext4_mb_seq_structs_summary_stop,
 	.show   = ext4_mb_seq_structs_summary_show,
 };
 
@@ -3178,8 +3115,9 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
 	init_rwsem(&meta_group_info[i]->alloc_sem);
 	meta_group_info[i]->bb_free_root = RB_ROOT;
 	INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
-	RB_CLEAR_NODE(&meta_group_info[i]->bb_avg_fragment_size_rb);
+	INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node);
 	meta_group_info[i]->bb_largest_free_order = -1;  /* uninit */
+	meta_group_info[i]->bb_avg_fragment_size_order = -1;  /* uninit */
 	meta_group_info[i]->bb_group = group;
 
 	mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group);
@@ -3428,7 +3366,24 @@ int ext4_mb_init(struct super_block *sb)
 		i++;
 	} while (i < MB_NUM_ORDERS(sb));
 
-	sbi->s_mb_avg_fragment_size_root = RB_ROOT;
+	sbi->s_mb_avg_fragment_size =
+		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
+			GFP_KERNEL);
+	if (!sbi->s_mb_avg_fragment_size) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	sbi->s_mb_avg_fragment_size_locks =
+		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
+			GFP_KERNEL);
+	if (!sbi->s_mb_avg_fragment_size_locks) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
+		INIT_LIST_HEAD(&sbi->s_mb_avg_fragment_size[i]);
+		rwlock_init(&sbi->s_mb_avg_fragment_size_locks[i]);
+	}
 	sbi->s_mb_largest_free_orders =
 		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
 			GFP_KERNEL);
@@ -3447,7 +3402,6 @@ int ext4_mb_init(struct super_block *sb)
 		INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
 		rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
 	}
-	rwlock_init(&sbi->s_mb_rb_lock);
 
 	spin_lock_init(&sbi->s_md_lock);
 	sbi->s_mb_free_pending = 0;
@@ -3518,6 +3472,8 @@ int ext4_mb_init(struct super_block *sb)
 	free_percpu(sbi->s_locality_groups);
 	sbi->s_locality_groups = NULL;
 out:
+	kfree(sbi->s_mb_avg_fragment_size);
+	kfree(sbi->s_mb_avg_fragment_size_locks);
 	kfree(sbi->s_mb_largest_free_orders);
 	kfree(sbi->s_mb_largest_free_orders_locks);
 	kfree(sbi->s_mb_offsets);
@@ -3584,6 +3540,8 @@ int ext4_mb_release(struct super_block *sb)
 		kvfree(group_info);
 		rcu_read_unlock();
 	}
+	kfree(sbi->s_mb_avg_fragment_size);
+	kfree(sbi->s_mb_avg_fragment_size_locks);
 	kfree(sbi->s_mb_largest_free_orders);
 	kfree(sbi->s_mb_largest_free_orders_locks);
 	kfree(sbi->s_mb_offsets);
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 39da92ceabf8..dcda2a943cee 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -178,7 +178,6 @@ struct ext4_allocation_context {
 	/* copy of the best found extent taken before preallocation efforts */
 	struct ext4_free_extent ac_f_ex;
 
-	ext4_group_t ac_last_optimal_group;
 	__u32 ac_groups_considered;
 	__u32 ac_flags;		/* allocation hints */
 	__u16 ac_groups_scanned;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5 v2] ext4: Fix performance regression with mballoc
  2022-09-06 15:29 [PATCH 0/5 v2] ext4: Fix performance regression with mballoc Jan Kara
                   ` (4 preceding siblings ...)
  2022-09-06 15:29 ` [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree Jan Kara
@ 2022-09-06 20:38 ` Stefan Wahren
  2022-09-07 10:49   ` Jan Kara
  2022-09-08  8:17 ` Ojaswin Mujoo
  6 siblings, 1 reply; 27+ messages in thread
From: Stefan Wahren @ 2022-09-06 20:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, Thorsten Leemhuis, Ted Tso, Ojaswin Mujoo, Andreas Dilger

Hi Jan,

Am 06.09.22 um 17:29 schrieb Jan Kara:
> Hello,
>
> Here is a second version of my mballoc improvements to avoid spreading
> allocations with mb_optimize_scan=1. The patches fix the performance
> regression I was able to reproduce with reaim on my test machine:
>
>                       mb_optimize_scan=0     mb_optimize_scan=1     patched
> Hmean     disk-1       2076.12 (   0.00%)     2099.37 (   1.12%)     2032.52 (  -2.10%)
> Hmean     disk-41     92481.20 (   0.00%)    83787.47 *  -9.40%*    90308.37 (  -2.35%)
> Hmean     disk-81    155073.39 (   0.00%)   135527.05 * -12.60%*   154285.71 (  -0.51%)
> Hmean     disk-121   185109.64 (   0.00%)   166284.93 * -10.17%*   185298.62 (   0.10%)
> Hmean     disk-161   229890.53 (   0.00%)   207563.39 *  -9.71%*   232883.32 *   1.30%*
> Hmean     disk-201   223333.33 (   0.00%)   203235.59 *  -9.00%*   221446.93 (  -0.84%)
> Hmean     disk-241   235735.25 (   0.00%)   217705.51 *  -7.65%*   239483.27 *   1.59%*
> Hmean     disk-281   266772.15 (   0.00%)   241132.72 *  -9.61%*   263108.62 (  -1.37%)
> Hmean     disk-321   265435.50 (   0.00%)   245412.84 *  -7.54%*   267277.27 (   0.69%)
>
> The changes also significanly reduce spreading of allocations for small /
> moderately sized files. I'm not able to measure a performance difference
> resulting from this but on eMMC storage this seems to be the main culprit
> of reduced performance. Untarring of raspberry-pi archive touches following
> numbers of groups:
>
> 	mb_optimize_scan=0	mb_optimize_scan=1	patched
> groups	4			22			7
>
> To achieve this I have added two more changes on top of v1 - patches 4 and 5.
> Patch 4 makes sure we use locality group preallocation even for files that are
> not likely to grow anymore (previously we have disabled all preallocations for
> such files, however locality group preallocation still makes a lot of sense for
> such files). This patch reduced spread of a small file allocations but larger
> file allocations were still spread significantly because they avoid locality
> group preallocation and as they are not power-of-two in size, they also
> immediately start with cr=1 scan. To address that I've changed the data
> structure for looking up the best block group to allocate from (see patch 5
> for details).
>
> Stefan, can you please test whether these patches fix the problem for you as
> well? Comments & review welcome.

this looks amazing \o/

With this patch v2 applied the untar with mb_optimize_scan=1 is now 
faster than mb_optimize_scan=0.

mb_optimize_scan=0 -> almost 5 minutes

mb_optimize_scan=1 -> almost 1 minute

The original scenario (firmware download) with mb_optimize_scan=1 is now 
fast as mb_optimize_scan=0.

Here the iostat as usual:

https://github.com/lategoodbye/mb_optimize_scan_regress/commit/f4ad188e0feee60bffa23a8e1ad254544768c3bd

There is just one thing, but not sure this if this comes from these 
patches. If i call

cat /proc/fs/ext4/mmcblk1p2/mb_structs_summary

The kernel throw a NULL pointer derefence in 
ext4_mb_seq_structs_summary_show

Best regards

>
> 								Honza
> Previous versions:
> Link: http://lore.kernel.org/r/20220823134508.27854-1-jack@suse.cz # v1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree
  2022-09-06 15:29 ` [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree Jan Kara
  2022-09-08 11:00     ` Dan Carpenter
@ 2022-09-08 11:00     ` Dan Carpenter
  2022-09-08  8:29   ` Ojaswin Mujoo
  2 siblings, 0 replies; 27+ messages in thread
From: kernel test robot @ 2022-09-07  4:51 UTC (permalink / raw)
  To: kbuild

[-- Attachment #1: Type: text/plain, Size: 5765 bytes --]

BCC: lkp(a)intel.com
CC: kbuild-all(a)lists.01.org
In-Reply-To: <20220906152920.25584-5-jack@suse.cz>
References: <20220906152920.25584-5-jack@suse.cz>
TO: Jan Kara <jack@suse.cz>
TO: Ted Tso <tytso@mit.edu>
CC: linux-ext4(a)vger.kernel.org
CC: Thorsten Leemhuis <regressions@leemhuis.info>
CC: Ojaswin Mujoo <ojaswin@linux.ibm.com>
CC: Stefan Wahren <stefan.wahren@i2se.com>
CC: Andreas Dilger <adilger.kernel@dilger.ca>
CC: Jan Kara <jack@suse.cz>

Hi Jan,

I love your patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v6.0-rc4 next-20220906]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Jan-Kara/ext4-Fix-performance-regression-with-mballoc/20220907-000945
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 53e99dcff61e1523ec1c3628b2d564ba15d32eb7
:::::: branch date: 13 hours ago
:::::: commit date: 13 hours ago
config: m68k-randconfig-m041-20220906 (https://download.01.org/0day-ci/archive/20220907/202209071206.u1iHKVzB-lkp(a)intel.com/config)
compiler: m68k-linux-gcc (GCC) 12.1.0

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

New smatch warnings:
fs/ext4/mballoc.c:945 ext4_mb_choose_next_group_cr1() error: uninitialized symbol 'grp'.

Old smatch warnings:
arch/m68k/include/asm/bitops.h:266 arch___test_and_clear_bit() warn: signedness bug returning '(-128)'
fs/ext4/mballoc.c:3960 ext4_mb_mark_bb() error: uninitialized symbol 'err'.

vim +/grp +945 fs/ext4/mballoc.c

196e402adf2e4c Harshad Shirwadkar 2021-04-01  904  
196e402adf2e4c Harshad Shirwadkar 2021-04-01  905  /*
196e402adf2e4c Harshad Shirwadkar 2021-04-01  906   * Choose next group by traversing average fragment size tree. Updates *new_cr
31b571b608cf66 Jan Kara           2022-09-06  907   * if cr level needs an update. 
196e402adf2e4c Harshad Shirwadkar 2021-04-01  908   */
196e402adf2e4c Harshad Shirwadkar 2021-04-01  909  static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
196e402adf2e4c Harshad Shirwadkar 2021-04-01  910  		int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
196e402adf2e4c Harshad Shirwadkar 2021-04-01  911  {
196e402adf2e4c Harshad Shirwadkar 2021-04-01  912  	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
31b571b608cf66 Jan Kara           2022-09-06  913  	struct ext4_group_info *grp, *iter;
31b571b608cf66 Jan Kara           2022-09-06  914  	int i;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  915  
196e402adf2e4c Harshad Shirwadkar 2021-04-01  916  	if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
196e402adf2e4c Harshad Shirwadkar 2021-04-01  917  		if (sbi->s_mb_stats)
196e402adf2e4c Harshad Shirwadkar 2021-04-01  918  			atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
31b571b608cf66 Jan Kara           2022-09-06  919  	}
31b571b608cf66 Jan Kara           2022-09-06  920  
31b571b608cf66 Jan Kara           2022-09-06  921  	for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
31b571b608cf66 Jan Kara           2022-09-06  922  	     i < MB_NUM_ORDERS(ac->ac_sb); i++) {
31b571b608cf66 Jan Kara           2022-09-06  923  		if (list_empty(&sbi->s_mb_avg_fragment_size[i]))
31b571b608cf66 Jan Kara           2022-09-06  924  			continue;
31b571b608cf66 Jan Kara           2022-09-06  925  		read_lock(&sbi->s_mb_avg_fragment_size_locks[i]);
31b571b608cf66 Jan Kara           2022-09-06  926  		if (list_empty(&sbi->s_mb_avg_fragment_size[i])) {
31b571b608cf66 Jan Kara           2022-09-06  927  			read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
31b571b608cf66 Jan Kara           2022-09-06  928  			continue;
31b571b608cf66 Jan Kara           2022-09-06  929  		}
31b571b608cf66 Jan Kara           2022-09-06  930  		grp = NULL;
31b571b608cf66 Jan Kara           2022-09-06  931  		list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i],
31b571b608cf66 Jan Kara           2022-09-06  932  				    bb_avg_fragment_size_node) {
196e402adf2e4c Harshad Shirwadkar 2021-04-01  933  			if (sbi->s_mb_stats)
196e402adf2e4c Harshad Shirwadkar 2021-04-01  934  				atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
31b571b608cf66 Jan Kara           2022-09-06  935  			if (likely(ext4_mb_good_group(ac, iter->bb_group, 1))) {
31b571b608cf66 Jan Kara           2022-09-06  936  				grp = iter;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  937  				break;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  938  			}
196e402adf2e4c Harshad Shirwadkar 2021-04-01  939  		}
31b571b608cf66 Jan Kara           2022-09-06  940  		read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
31b571b608cf66 Jan Kara           2022-09-06  941  		if (grp)
31b571b608cf66 Jan Kara           2022-09-06  942  			break;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  943  	}
196e402adf2e4c Harshad Shirwadkar 2021-04-01  944  
31b571b608cf66 Jan Kara           2022-09-06 @945  	if (grp) {
196e402adf2e4c Harshad Shirwadkar 2021-04-01  946  		*group = grp->bb_group;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  947  		ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  948  	} else {
196e402adf2e4c Harshad Shirwadkar 2021-04-01  949  		*new_cr = 2;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  950  	}
196e402adf2e4c Harshad Shirwadkar 2021-04-01  951  }
196e402adf2e4c Harshad Shirwadkar 2021-04-01  952  

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5 v2] ext4: Fix performance regression with mballoc
  2022-09-06 20:38 ` [PATCH 0/5 v2] ext4: Fix performance regression with mballoc Stefan Wahren
@ 2022-09-07 10:49   ` Jan Kara
  2022-09-07 13:02     ` Jan Kara
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2022-09-07 10:49 UTC (permalink / raw)
  To: Stefan Wahren
  Cc: Jan Kara, linux-ext4, Thorsten Leemhuis, Ted Tso, Ojaswin Mujoo,
	Andreas Dilger

Hi Stefan!

On Tue 06-09-22 22:38:10, Stefan Wahren wrote:
> Am 06.09.22 um 17:29 schrieb Jan Kara:
> > Hello,
> > 
> > Here is a second version of my mballoc improvements to avoid spreading
> > allocations with mb_optimize_scan=1. The patches fix the performance
> > regression I was able to reproduce with reaim on my test machine:
> > 
> >                       mb_optimize_scan=0     mb_optimize_scan=1     patched
> > Hmean     disk-1       2076.12 (   0.00%)     2099.37 (   1.12%)     2032.52 (  -2.10%)
> > Hmean     disk-41     92481.20 (   0.00%)    83787.47 *  -9.40%*    90308.37 (  -2.35%)
> > Hmean     disk-81    155073.39 (   0.00%)   135527.05 * -12.60%*   154285.71 (  -0.51%)
> > Hmean     disk-121   185109.64 (   0.00%)   166284.93 * -10.17%*   185298.62 (   0.10%)
> > Hmean     disk-161   229890.53 (   0.00%)   207563.39 *  -9.71%*   232883.32 *   1.30%*
> > Hmean     disk-201   223333.33 (   0.00%)   203235.59 *  -9.00%*   221446.93 (  -0.84%)
> > Hmean     disk-241   235735.25 (   0.00%)   217705.51 *  -7.65%*   239483.27 *   1.59%*
> > Hmean     disk-281   266772.15 (   0.00%)   241132.72 *  -9.61%*   263108.62 (  -1.37%)
> > Hmean     disk-321   265435.50 (   0.00%)   245412.84 *  -7.54%*   267277.27 (   0.69%)
> > 
> > The changes also significanly reduce spreading of allocations for small /
> > moderately sized files. I'm not able to measure a performance difference
> > resulting from this but on eMMC storage this seems to be the main culprit
> > of reduced performance. Untarring of raspberry-pi archive touches following
> > numbers of groups:
> > 
> > 	mb_optimize_scan=0	mb_optimize_scan=1	patched
> > groups	4			22			7
> > 
> > To achieve this I have added two more changes on top of v1 - patches 4 and 5.
> > Patch 4 makes sure we use locality group preallocation even for files that are
> > not likely to grow anymore (previously we have disabled all preallocations for
> > such files, however locality group preallocation still makes a lot of sense for
> > such files). This patch reduced spread of a small file allocations but larger
> > file allocations were still spread significantly because they avoid locality
> > group preallocation and as they are not power-of-two in size, they also
> > immediately start with cr=1 scan. To address that I've changed the data
> > structure for looking up the best block group to allocate from (see patch 5
> > for details).
> > 
> > Stefan, can you please test whether these patches fix the problem for you as
> > well? Comments & review welcome.
> 
> this looks amazing \o/
> 
> With this patch v2 applied the untar with mb_optimize_scan=1 is now faster
> than mb_optimize_scan=0.
> 
> mb_optimize_scan=0 -> almost 5 minutes
> 
> mb_optimize_scan=1 -> almost 1 minute
> 
> The original scenario (firmware download) with mb_optimize_scan=1 is now
> fast as mb_optimize_scan=0.

Glad to hear that!

> Here the iostat as usual:
> 
> https://github.com/lategoodbye/mb_optimize_scan_regress/commit/f4ad188e0feee60bffa23a8e1ad254544768c3bd
> 
> There is just one thing, but not sure this if this comes from these patches.
> If i call
> 
> cat /proc/fs/ext4/mmcblk1p2/mb_structs_summary
> 
> The kernel throw a NULL pointer derefence in
> ext4_mb_seq_structs_summary_show

Yeah, likely a bug in my last patch. I didn't test my changes to the sysfs
interface. Thanks for testing this, I'll have a look.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5 v2] ext4: Fix performance regression with mballoc
  2022-09-07 10:49   ` Jan Kara
@ 2022-09-07 13:02     ` Jan Kara
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Kara @ 2022-09-07 13:02 UTC (permalink / raw)
  To: Stefan Wahren
  Cc: Jan Kara, linux-ext4, Thorsten Leemhuis, Ted Tso, Ojaswin Mujoo,
	Andreas Dilger

[-- Attachment #1: Type: text/plain, Size: 891 bytes --]

On Wed 07-09-22 12:49:47, Jan Kara wrote:
> > Here the iostat as usual:
> > 
> > https://github.com/lategoodbye/mb_optimize_scan_regress/commit/f4ad188e0feee60bffa23a8e1ad254544768c3bd
> > 
> > There is just one thing, but not sure this if this comes from these patches.
> > If i call
> > 
> > cat /proc/fs/ext4/mmcblk1p2/mb_structs_summary
> > 
> > The kernel throw a NULL pointer derefence in
> > ext4_mb_seq_structs_summary_show
> 
> Yeah, likely a bug in my last patch. I didn't test my changes to the sysfs
> interface. Thanks for testing this, I'll have a look.

Indeed the procfs interface update was buggy. Attached is a new version of
patch 5 to fixup the interface (and a couple of whitespace issues) in the
previous version. I will resend the full series but I'll wait some more
time whether there are not other comments.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0005-ext4-Use-buckets-for-cr-1-block-scan-instead-of-rbtr.patch --]
[-- Type: text/x-patch, Size: 18781 bytes --]

From 5bfd6973184e8ab7180cfebdf0e71e6dc9b7af7a Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Fri, 2 Sep 2022 18:28:30 +0200
Subject: [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree

Using rbtree for sorting groups by average fragment size is relatively
expensive (needs rbtree update on every block freeing or allocation) and
leads to wide spreading of allocations because selection of block group
is very sentitive both to changes in free space and amount of blocks
allocated. Furthermore selecting group with the best matching average
fragment size is not necessary anyway, even more so because the
variability of fragment sizes within a group is likely large so average
is not telling much. We just need a group with large enough average
fragment size so that we have high probability of finding large enough
free extent and we don't want average fragment size to be too big so
that we are likely to find free extent only somewhat larger than what we
need.

So instead of maintaing rbtree of groups sorted by fragment size keep
bins (lists) or groups where average fragment size is in the interval
[2^i, 2^(i+1)). This structure requires less updates on block allocation
/ freeing, generally avoids chaotic spreading of allocations into block
groups, and still is able to quickly (even faster that the rbtree)
provide a block group which is likely to have a suitably sized free
space extent.

This patch reduces number of block groups used when untarring archive
with medium sized files (size somewhat above 64k which is default
mballoc limit for avoiding locality group preallocation) to about half
and thus improves write speeds for eMMC flash significantly.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h    |  10 +-
 fs/ext4/mballoc.c | 247 ++++++++++++++++++++--------------------------
 fs/ext4/mballoc.h |   1 -
 3 files changed, 110 insertions(+), 148 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9bca5565547b..3bf9a6926798 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -167,8 +167,6 @@ enum SHIFT_DIRECTION {
 #define EXT4_MB_CR0_OPTIMIZED		0x8000
 /* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */
 #define EXT4_MB_CR1_OPTIMIZED		0x00010000
-/* Perform linear traversal for one group */
-#define EXT4_MB_SEARCH_NEXT_LINEAR	0x00020000
 struct ext4_allocation_request {
 	/* target inode for block we're allocating */
 	struct inode *inode;
@@ -1600,8 +1598,8 @@ struct ext4_sb_info {
 	struct list_head s_discard_list;
 	struct work_struct s_discard_work;
 	atomic_t s_retry_alloc_pending;
-	struct rb_root s_mb_avg_fragment_size_root;
-	rwlock_t s_mb_rb_lock;
+	struct list_head *s_mb_avg_fragment_size;
+	rwlock_t *s_mb_avg_fragment_size_locks;
 	struct list_head *s_mb_largest_free_orders;
 	rwlock_t *s_mb_largest_free_orders_locks;
 
@@ -3413,6 +3411,8 @@ struct ext4_group_info {
 	ext4_grpblk_t	bb_first_free;	/* first free block */
 	ext4_grpblk_t	bb_free;	/* total free blocks */
 	ext4_grpblk_t	bb_fragments;	/* nr of freespace fragments */
+	int		bb_avg_fragment_size_order;	/* order of average
+							   fragment in BG */
 	ext4_grpblk_t	bb_largest_free_order;/* order of largest frag in BG */
 	ext4_group_t	bb_group;	/* Group number */
 	struct          list_head bb_prealloc_list;
@@ -3420,7 +3420,7 @@ struct ext4_group_info {
 	void            *bb_bitmap;
 #endif
 	struct rw_semaphore alloc_sem;
-	struct rb_node	bb_avg_fragment_size_rb;
+	struct list_head bb_avg_fragment_size_node;
 	struct list_head bb_largest_free_order_node;
 	ext4_grpblk_t	bb_counters[];	/* Nr of free power-of-two-block
 					 * regions, index is order.
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index af1e49c3603f..3cf1de0f8f68 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -140,13 +140,15 @@
  *    number of buddy bitmap orders possible) number of lists. Group-infos are
  *    placed in appropriate lists.
  *
- * 2) Average fragment size rb tree (sbi->s_mb_avg_fragment_size_root)
+ * 2) Average fragment size lists (sbi->s_mb_avg_fragment_size)
  *
- *    Locking: sbi->s_mb_rb_lock (rwlock)
+ *    Locking: sbi->s_mb_avg_fragment_size_locks(array of rw locks)
  *
- *    This is a red black tree consisting of group infos and the tree is sorted
- *    by average fragment sizes (which is calculated as ext4_group_info->bb_free
- *    / ext4_group_info->bb_fragments).
+ *    This is an array of lists where in the i-th list there are groups with
+ *    average fragment size >= 2^i and < 2^(i+1). The average fragment size
+ *    is computed as ext4_group_info->bb_free / ext4_group_info->bb_fragments.
+ *    Note that we don't bother with a special list for completely empty groups
+ *    so we only have MB_NUM_ORDERS(sb) lists.
  *
  * When "mb_optimize_scan" mount option is set, mballoc consults the above data
  * structures to decide the order in which groups are to be traversed for
@@ -160,7 +162,8 @@
  *
  * At CR = 1, we only consider groups where average fragment size > request
  * size. So, we lookup a group which has average fragment size just above or
- * equal to request size using our rb tree (data structure 2) in O(log N) time.
+ * equal to request size using our average fragment size group lists (data
+ * structure 2) in O(1) time.
  *
  * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in
  * linear order which requires O(N) search time for each CR 0 and CR 1 phase.
@@ -802,65 +805,51 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
 	}
 }
 
-static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
-			int (*cmp)(struct rb_node *, struct rb_node *))
+static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
 {
-	struct rb_node **iter = &root->rb_node, *parent = NULL;
+	int order;
 
-	while (*iter) {
-		parent = *iter;
-		if (cmp(new, *iter) > 0)
-			iter = &((*iter)->rb_left);
-		else
-			iter = &((*iter)->rb_right);
-	}
-
-	rb_link_node(new, parent, iter);
-	rb_insert_color(new, root);
-}
-
-static int
-ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2)
-{
-	struct ext4_group_info *grp1 = rb_entry(rb1,
-						struct ext4_group_info,
-						bb_avg_fragment_size_rb);
-	struct ext4_group_info *grp2 = rb_entry(rb2,
-						struct ext4_group_info,
-						bb_avg_fragment_size_rb);
-	int num_frags_1, num_frags_2;
-
-	num_frags_1 = grp1->bb_fragments ?
-		grp1->bb_free / grp1->bb_fragments : 0;
-	num_frags_2 = grp2->bb_fragments ?
-		grp2->bb_free / grp2->bb_fragments : 0;
-
-	return (num_frags_2 - num_frags_1);
+	/*
+	 * We don't bother with a special lists groups with only 1 block free
+	 * extents and for completely empty groups.
+	 */
+	order = fls(len) - 2;
+	if (order < 0)
+		return 0;
+	if (order == MB_NUM_ORDERS(sb))
+		order--;
+	return order;
 }
 
-/*
- * Reinsert grpinfo into the avg_fragment_size tree with new average
- * fragment size.
- */
+/* Move group to appropriate avg_fragment_size list */
 static void
 mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	int new_order;
 
 	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
 		return;
 
-	write_lock(&sbi->s_mb_rb_lock);
-	if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
-		rb_erase(&grp->bb_avg_fragment_size_rb,
-				&sbi->s_mb_avg_fragment_size_root);
-		RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
-	}
+	new_order = mb_avg_fragment_size_order(sb,
+					grp->bb_free / grp->bb_fragments);
+	if (new_order == grp->bb_avg_fragment_size_order)
+		return;
 
-	ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root,
-		&grp->bb_avg_fragment_size_rb,
-		ext4_mb_avg_fragment_size_cmp);
-	write_unlock(&sbi->s_mb_rb_lock);
+	if (grp->bb_avg_fragment_size_order != -1) {
+		write_lock(&sbi->s_mb_avg_fragment_size_locks[
+					grp->bb_avg_fragment_size_order]);
+		list_del(&grp->bb_avg_fragment_size_node);
+		write_unlock(&sbi->s_mb_avg_fragment_size_locks[
+					grp->bb_avg_fragment_size_order]);
+	}
+	grp->bb_avg_fragment_size_order = new_order;
+	write_lock(&sbi->s_mb_avg_fragment_size_locks[
+					grp->bb_avg_fragment_size_order]);
+	list_add_tail(&grp->bb_avg_fragment_size_node,
+		&sbi->s_mb_avg_fragment_size[grp->bb_avg_fragment_size_order]);
+	write_unlock(&sbi->s_mb_avg_fragment_size_locks[
+					grp->bb_avg_fragment_size_order]);
 }
 
 /*
@@ -909,86 +898,56 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
 		*new_cr = 1;
 	} else {
 		*group = grp->bb_group;
-		ac->ac_last_optimal_group = *group;
 		ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED;
 	}
 }
 
 /*
  * Choose next group by traversing average fragment size tree. Updates *new_cr
- * if cr lvel needs an update. Sets EXT4_MB_SEARCH_NEXT_LINEAR to indicate that
- * the linear search should continue for one iteration since there's lock
- * contention on the rb tree lock.
+ * if cr level needs an update.
  */
 static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
 		int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
-	int avg_fragment_size, best_so_far;
-	struct rb_node *node, *found;
-	struct ext4_group_info *grp;
-
-	/*
-	 * If there is contention on the lock, instead of waiting for the lock
-	 * to become available, just continue searching lineraly. We'll resume
-	 * our rb tree search later starting at ac->ac_last_optimal_group.
-	 */
-	if (!read_trylock(&sbi->s_mb_rb_lock)) {
-		ac->ac_flags |= EXT4_MB_SEARCH_NEXT_LINEAR;
-		return;
-	}
+	struct ext4_group_info *grp, *iter;
+	int i;
 
 	if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
 		if (sbi->s_mb_stats)
 			atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
-		/* We have found something at CR 1 in the past */
-		grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group);
-		for (found = rb_next(&grp->bb_avg_fragment_size_rb); found != NULL;
-		     found = rb_next(found)) {
-			grp = rb_entry(found, struct ext4_group_info,
-				       bb_avg_fragment_size_rb);
+	}
+
+	for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
+	     i < MB_NUM_ORDERS(ac->ac_sb); i++) {
+		if (list_empty(&sbi->s_mb_avg_fragment_size[i]))
+			continue;
+		read_lock(&sbi->s_mb_avg_fragment_size_locks[i]);
+		if (list_empty(&sbi->s_mb_avg_fragment_size[i])) {
+			read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
+			continue;
+		}
+		grp = NULL;
+		list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i],
+				    bb_avg_fragment_size_node) {
 			if (sbi->s_mb_stats)
 				atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
-			if (likely(ext4_mb_good_group(ac, grp->bb_group, 1)))
+			if (likely(ext4_mb_good_group(ac, iter->bb_group, 1))) {
+				grp = iter;
 				break;
-		}
-		goto done;
-	}
-
-	node = sbi->s_mb_avg_fragment_size_root.rb_node;
-	best_so_far = 0;
-	found = NULL;
-
-	while (node) {
-		grp = rb_entry(node, struct ext4_group_info,
-			       bb_avg_fragment_size_rb);
-		avg_fragment_size = 0;
-		if (ext4_mb_good_group(ac, grp->bb_group, 1)) {
-			avg_fragment_size = grp->bb_fragments ?
-				grp->bb_free / grp->bb_fragments : 0;
-			if (!best_so_far || avg_fragment_size < best_so_far) {
-				best_so_far = avg_fragment_size;
-				found = node;
 			}
 		}
-		if (avg_fragment_size > ac->ac_g_ex.fe_len)
-			node = node->rb_right;
-		else
-			node = node->rb_left;
+		read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
+		if (grp)
+			break;
 	}
 
-done:
-	if (found) {
-		grp = rb_entry(found, struct ext4_group_info,
-			       bb_avg_fragment_size_rb);
+	if (grp) {
 		*group = grp->bb_group;
 		ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
 	} else {
 		*new_cr = 2;
 	}
-
-	read_unlock(&sbi->s_mb_rb_lock);
-	ac->ac_last_optimal_group = *group;
 }
 
 static inline int should_optimize_scan(struct ext4_allocation_context *ac)
@@ -1017,11 +976,6 @@ next_linear_group(struct ext4_allocation_context *ac, int group, int ngroups)
 		goto inc_and_return;
 	}
 
-	if (ac->ac_flags & EXT4_MB_SEARCH_NEXT_LINEAR) {
-		ac->ac_flags &= ~EXT4_MB_SEARCH_NEXT_LINEAR;
-		goto inc_and_return;
-	}
-
 	return group;
 inc_and_return:
 	/*
@@ -1152,13 +1106,13 @@ void ext4_mb_generate_buddy(struct super_block *sb,
 					EXT4_GROUP_INFO_BBITMAP_CORRUPT);
 	}
 	mb_set_largest_free_order(sb, grp);
+	mb_update_avg_fragment_size(sb, grp);
 
 	clear_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &(grp->bb_state));
 
 	period = get_cycles() - period;
 	atomic_inc(&sbi->s_mb_buddies_generated);
 	atomic64_add(period, &sbi->s_mb_generation_time);
-	mb_update_avg_fragment_size(sb, grp);
 }
 
 /* The buddy information is attached the buddy cache inode
@@ -2711,7 +2665,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		 * from the goal value specified
 		 */
 		group = ac->ac_g_ex.fe_group;
-		ac->ac_last_optimal_group = group;
 		ac->ac_groups_linear_remaining = sbi->s_mb_max_linear_groups;
 		prefetch_grp = group;
 
@@ -2993,9 +2946,7 @@ __acquires(&EXT4_SB(sb)->s_mb_rb_lock)
 	struct super_block *sb = pde_data(file_inode(seq->file));
 	unsigned long position;
 
-	read_lock(&EXT4_SB(sb)->s_mb_rb_lock);
-
-	if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + 1)
+	if (*pos < 0 || *pos >= 2*MB_NUM_ORDERS(sb))
 		return NULL;
 	position = *pos + 1;
 	return (void *) ((unsigned long) position);
@@ -3007,7 +2958,7 @@ static void *ext4_mb_seq_structs_summary_next(struct seq_file *seq, void *v, lof
 	unsigned long position;
 
 	++*pos;
-	if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + 1)
+	if (*pos < 0 || *pos >= 2*MB_NUM_ORDERS(sb))
 		return NULL;
 	position = *pos + 1;
 	return (void *) ((unsigned long) position);
@@ -3019,29 +2970,22 @@ static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v)
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	unsigned long position = ((unsigned long) v);
 	struct ext4_group_info *grp;
-	struct rb_node *n;
-	unsigned int count, min, max;
+	unsigned int count;
 
 	position--;
 	if (position >= MB_NUM_ORDERS(sb)) {
-		seq_puts(seq, "fragment_size_tree:\n");
-		n = rb_first(&sbi->s_mb_avg_fragment_size_root);
-		if (!n) {
-			seq_puts(seq, "\ttree_min: 0\n\ttree_max: 0\n\ttree_nodes: 0\n");
-			return 0;
-		}
-		grp = rb_entry(n, struct ext4_group_info, bb_avg_fragment_size_rb);
-		min = grp->bb_fragments ? grp->bb_free / grp->bb_fragments : 0;
-		count = 1;
-		while (rb_next(n)) {
-			count++;
-			n = rb_next(n);
-		}
-		grp = rb_entry(n, struct ext4_group_info, bb_avg_fragment_size_rb);
-		max = grp->bb_fragments ? grp->bb_free / grp->bb_fragments : 0;
+		position -= MB_NUM_ORDERS(sb);
+		if (position == 0)
+			seq_puts(seq, "avg_fragment_size_lists:\n");
 
-		seq_printf(seq, "\ttree_min: %u\n\ttree_max: %u\n\ttree_nodes: %u\n",
-			   min, max, count);
+		count = 0;
+		read_lock(&sbi->s_mb_avg_fragment_size_locks[position]);
+		list_for_each_entry(grp, &sbi->s_mb_avg_fragment_size[position],
+				    bb_avg_fragment_size_node)
+			count++;
+		read_unlock(&sbi->s_mb_avg_fragment_size_locks[position]);
+		seq_printf(seq, "\tlist_order_%u_groups: %u\n",
+					(unsigned int)position, count);
 		return 0;
 	}
 
@@ -3051,9 +2995,11 @@ static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v)
 		seq_puts(seq, "max_free_order_lists:\n");
 	}
 	count = 0;
+	read_lock(&sbi->s_mb_largest_free_orders_locks[position]);
 	list_for_each_entry(grp, &sbi->s_mb_largest_free_orders[position],
 			    bb_largest_free_order_node)
 		count++;
+	read_unlock(&sbi->s_mb_largest_free_orders_locks[position]);
 	seq_printf(seq, "\tlist_order_%u_groups: %u\n",
 		   (unsigned int)position, count);
 
@@ -3061,11 +3007,7 @@ static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v)
 }
 
 static void ext4_mb_seq_structs_summary_stop(struct seq_file *seq, void *v)
-__releases(&EXT4_SB(sb)->s_mb_rb_lock)
 {
-	struct super_block *sb = pde_data(file_inode(seq->file));
-
-	read_unlock(&EXT4_SB(sb)->s_mb_rb_lock);
 }
 
 const struct seq_operations ext4_mb_seq_structs_summary_ops = {
@@ -3178,8 +3120,9 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
 	init_rwsem(&meta_group_info[i]->alloc_sem);
 	meta_group_info[i]->bb_free_root = RB_ROOT;
 	INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
-	RB_CLEAR_NODE(&meta_group_info[i]->bb_avg_fragment_size_rb);
+	INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node);
 	meta_group_info[i]->bb_largest_free_order = -1;  /* uninit */
+	meta_group_info[i]->bb_avg_fragment_size_order = -1;  /* uninit */
 	meta_group_info[i]->bb_group = group;
 
 	mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group);
@@ -3428,7 +3371,24 @@ int ext4_mb_init(struct super_block *sb)
 		i++;
 	} while (i < MB_NUM_ORDERS(sb));
 
-	sbi->s_mb_avg_fragment_size_root = RB_ROOT;
+	sbi->s_mb_avg_fragment_size =
+		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
+			GFP_KERNEL);
+	if (!sbi->s_mb_avg_fragment_size) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	sbi->s_mb_avg_fragment_size_locks =
+		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
+			GFP_KERNEL);
+	if (!sbi->s_mb_avg_fragment_size_locks) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
+		INIT_LIST_HEAD(&sbi->s_mb_avg_fragment_size[i]);
+		rwlock_init(&sbi->s_mb_avg_fragment_size_locks[i]);
+	}
 	sbi->s_mb_largest_free_orders =
 		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
 			GFP_KERNEL);
@@ -3447,7 +3407,6 @@ int ext4_mb_init(struct super_block *sb)
 		INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
 		rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
 	}
-	rwlock_init(&sbi->s_mb_rb_lock);
 
 	spin_lock_init(&sbi->s_md_lock);
 	sbi->s_mb_free_pending = 0;
@@ -3518,6 +3477,8 @@ int ext4_mb_init(struct super_block *sb)
 	free_percpu(sbi->s_locality_groups);
 	sbi->s_locality_groups = NULL;
 out:
+	kfree(sbi->s_mb_avg_fragment_size);
+	kfree(sbi->s_mb_avg_fragment_size_locks);
 	kfree(sbi->s_mb_largest_free_orders);
 	kfree(sbi->s_mb_largest_free_orders_locks);
 	kfree(sbi->s_mb_offsets);
@@ -3584,6 +3545,8 @@ int ext4_mb_release(struct super_block *sb)
 		kvfree(group_info);
 		rcu_read_unlock();
 	}
+	kfree(sbi->s_mb_avg_fragment_size);
+	kfree(sbi->s_mb_avg_fragment_size_locks);
 	kfree(sbi->s_mb_largest_free_orders);
 	kfree(sbi->s_mb_largest_free_orders_locks);
 	kfree(sbi->s_mb_offsets);
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 39da92ceabf8..dcda2a943cee 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -178,7 +178,6 @@ struct ext4_allocation_context {
 	/* copy of the best found extent taken before preallocation efforts */
 	struct ext4_free_extent ac_f_ex;
 
-	ext4_group_t ac_last_optimal_group;
 	__u32 ac_groups_considered;
 	__u32 ac_flags;		/* allocation hints */
 	__u16 ac_groups_scanned;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan
  2022-09-06 15:29 ` [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara
@ 2022-09-07 17:43   ` Ritesh Harjani (IBM)
  0 siblings, 0 replies; 27+ messages in thread
From: Ritesh Harjani (IBM) @ 2022-09-07 17:43 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo,
	Stefan Wahren, Andreas Dilger, stable

On 22/09/06 05:29PM, Jan Kara wrote:
> One of the side-effects of mb_optimize_scan was that the optimized
> functions to select next group to try were called even before we tried
> the goal group. As a result we no longer allocate files close to
> corresponding inodes as well as we don't try to expand currently
> allocated extent in the same group. This results in reaim regression
> with workfile.disk workload of upto 8% with many clients on my test
> machine:
> 
>                      baseline               mb_optimize_scan
> Hmean     disk-1       2114.16 (   0.00%)     2099.37 (  -0.70%)
> Hmean     disk-41     87794.43 (   0.00%)    83787.47 *  -4.56%*
> Hmean     disk-81    148170.73 (   0.00%)   135527.05 *  -8.53%*
> Hmean     disk-121   177506.11 (   0.00%)   166284.93 *  -6.32%*
> Hmean     disk-161   220951.51 (   0.00%)   207563.39 *  -6.06%*
> Hmean     disk-201   208722.74 (   0.00%)   203235.59 (  -2.63%)
> Hmean     disk-241   222051.60 (   0.00%)   217705.51 (  -1.96%)
> Hmean     disk-281   252244.17 (   0.00%)   241132.72 *  -4.41%*
> Hmean     disk-321   255844.84 (   0.00%)   245412.84 *  -4.08%*
> 
> Also this is causing huge regression (time increased by a factor of 5 or
> so) when untarring archive with lots of small files on some eMMC storage
> cards.
> 
> Fix the problem by making sure we try goal group first.
> 

Yup, this is definitely a bug. We were never trying goal group then,
except maybe for rotational devices (due to ac_groups_linear_remaining).

Looks right to me.
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/5] ext4: Avoid unnecessary spreading of allocations among groups
  2022-09-06 15:29 ` [PATCH 2/5] ext4: Avoid unnecessary spreading of allocations among groups Jan Kara
@ 2022-09-07 18:05   ` Ritesh Harjani (IBM)
  2022-09-08  8:57     ` Jan Kara
  0 siblings, 1 reply; 27+ messages in thread
From: Ritesh Harjani (IBM) @ 2022-09-07 18:05 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo,
	Stefan Wahren, Andreas Dilger

On 22/09/06 05:29PM, Jan Kara wrote:
> mb_set_largest_free_order() updates lists containing groups with largest
> chunk of free space of given order. The way it updates it leads to
> always moving the group to the tail of the list. Thus allocations
> looking for free space of given order effectively end up cycling through
> all groups (and due to initialization in last to first order). This
> spreads allocations among block groups which reduces performance for
> rotating disks or low-end flash media. Change
> mb_set_largest_free_order() to only update lists if the order of the
> largest free chunk in the group changed.

Nice and clear explaination. Thanks :)

This change also looks good to me.
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

One other thought to further optimize - 
Will it make a difference if rather then adding the group to the tail of the list, 
we add that group to the head of sbi->s_mb_largest_free_orders[new_order]. 

This is because this group is the latest from where blocks were allocated/freed,
and hence the next allocation should first try from this group in order to keep 
the files/extents blocks close to each other? 
(That sometimes might help with disk firmware to avoid doing discards if the freed 
block can be reused?)

Or does goal block will always cover that case by default and we might never
require this? Maybe in a case of a new file within the same directory where 
the goal group has no free blocks, but the last group attempted should be 
retried first?

-ritesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/5] ext4: Use locality group preallocation for small closed files
  2022-09-06 15:29 ` [PATCH 4/5] ext4: Use locality group preallocation for small closed files Jan Kara
@ 2022-09-07 18:25   ` Ritesh Harjani (IBM)
  0 siblings, 0 replies; 27+ messages in thread
From: Ritesh Harjani (IBM) @ 2022-09-07 18:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo,
	Stefan Wahren, Andreas Dilger

On 22/09/06 05:29PM, Jan Kara wrote:
> Curently we don't use any preallocation when a file is already closed
> when allocating blocks (from writeback code when converting delayed
> allocation). However for small files, using locality group preallocation

I had always wondered about this case. But yes for small files it completely
make sense to enable preallocations for smaller files even though they are
closed.

> is actually desirable as that is not specific to a particular file.
> Rather it is a method to pack small files together to reduce
> fragmentation and for that the fact the file is closed is actually even
> stronger hint the file would benefit from packing. So change the logic
> to allow locality group preallocation in this case.
> 

One thing which comes to mind is that we never discard lg preallocations.
With this change we will always enable lg preallocations for small files. 
These preallocations will then be only discarded when some allocation request
fails which will be retried after doing discard preallocations. 

Though it is not a problem, since any small file allocation will benefit out of
these lgs. But it shouldn't be too large that it starts causing performance hits 
for large files. 
Not for this patch but something to remember maybe ^^^^. 
(While we are internally looking at preallocation space for few optimizations, 
above is something to keep a note of.)

> Signed-off-by: Jan Kara <jack@suse.cz>

Looks good to me.
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree
  2022-09-06 15:29 ` [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree Jan Kara
  2022-09-08 11:00     ` Dan Carpenter
@ 2022-09-07 18:41   ` Ritesh Harjani (IBM)
  2022-09-08  9:01     ` Jan Kara
  2022-09-08  8:29   ` Ojaswin Mujoo
  2 siblings, 1 reply; 27+ messages in thread
From: Ritesh Harjani (IBM) @ 2022-09-07 18:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo,
	Stefan Wahren, Andreas Dilger

On 22/09/06 05:29PM, Jan Kara wrote:
> Using rbtree for sorting groups by average fragment size is relatively
> expensive (needs rbtree update on every block freeing or allocation) and
> leads to wide spreading of allocations because selection of block group
> is very sentitive both to changes in free space and amount of blocks
> allocated. Furthermore selecting group with the best matching average
> fragment size is not necessary anyway, even more so because the
> variability of fragment sizes within a group is likely large so average
> is not telling much. We just need a group with large enough average
> fragment size so that we have high probability of finding large enough
> free extent and we don't want average fragment size to be too big so
> that we are likely to find free extent only somewhat larger than what we
> need.
> 
> So instead of maintaing rbtree of groups sorted by fragment size keep
> bins (lists) or groups where average fragment size is in the interval
> [2^i, 2^(i+1)). This structure requires less updates on block allocation
> / freeing, generally avoids chaotic spreading of allocations into block
> groups, and still is able to quickly (even faster that the rbtree)
> provide a block group which is likely to have a suitably sized free
> space extent.

This makes sense because we anyways maintain buddy bitmap for MB_NUM_ORDERS
bitmaps. Hence our data structure to maintain different lists of groups, with 
their average fragments size can be bounded within MB_NUM_ORDERS lists.
This also makes it for amortized O(1) search time for finding the right group
in CR1 search.

> 
> This patch reduces number of block groups used when untarring archive
> with medium sized files (size somewhat above 64k which is default
> mballoc limit for avoiding locality group preallocation) to about half
> and thus improves write speeds for eMMC flash significantly.
> 

Indeed a nice change. More inline with the how we maintain
sbi->s_mb_largest_free_orders lists.

I think as you already noted there are few minor checkpatch errors,
other than that one small query below.

> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/ext4.h    |  10 +-
>  fs/ext4/mballoc.c | 252 +++++++++++++++++++---------------------------
>  fs/ext4/mballoc.h |   1 -
>  3 files changed, 110 insertions(+), 153 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 9bca5565547b..3bf9a6926798 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -167,8 +167,6 @@ enum SHIFT_DIRECTION {
>  #define EXT4_MB_CR0_OPTIMIZED		0x8000
>  /* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */
>  #define EXT4_MB_CR1_OPTIMIZED		0x00010000
> -/* Perform linear traversal for one group */
> -#define EXT4_MB_SEARCH_NEXT_LINEAR	0x00020000
>  struct ext4_allocation_request {
>  	/* target inode for block we're allocating */
>  	struct inode *inode;
> @@ -1600,8 +1598,8 @@ struct ext4_sb_info {
>  	struct list_head s_discard_list;
>  	struct work_struct s_discard_work;
>  	atomic_t s_retry_alloc_pending;
> -	struct rb_root s_mb_avg_fragment_size_root;
> -	rwlock_t s_mb_rb_lock;
> +	struct list_head *s_mb_avg_fragment_size;
> +	rwlock_t *s_mb_avg_fragment_size_locks;
>  	struct list_head *s_mb_largest_free_orders;
>  	rwlock_t *s_mb_largest_free_orders_locks;
>  
> @@ -3413,6 +3411,8 @@ struct ext4_group_info {
>  	ext4_grpblk_t	bb_first_free;	/* first free block */
>  	ext4_grpblk_t	bb_free;	/* total free blocks */
>  	ext4_grpblk_t	bb_fragments;	/* nr of freespace fragments */
> +	int		bb_avg_fragment_size_order;	/* order of average
> +							   fragment in BG */
>  	ext4_grpblk_t	bb_largest_free_order;/* order of largest frag in BG */
>  	ext4_group_t	bb_group;	/* Group number */
>  	struct          list_head bb_prealloc_list;
> @@ -3420,7 +3420,7 @@ struct ext4_group_info {
>  	void            *bb_bitmap;
>  #endif
>  	struct rw_semaphore alloc_sem;
> -	struct rb_node	bb_avg_fragment_size_rb;
> +	struct list_head bb_avg_fragment_size_node;
>  	struct list_head bb_largest_free_order_node;
>  	ext4_grpblk_t	bb_counters[];	/* Nr of free power-of-two-block
>  					 * regions, index is order.
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index af1e49c3603f..213d2d0750dd 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -140,13 +140,15 @@
>   *    number of buddy bitmap orders possible) number of lists. Group-infos are
>   *    placed in appropriate lists.
>   *
> - * 2) Average fragment size rb tree (sbi->s_mb_avg_fragment_size_root)
> + * 2) Average fragment size lists (sbi->s_mb_avg_fragment_size)
>   *
> - *    Locking: sbi->s_mb_rb_lock (rwlock)
> + *    Locking: sbi->s_mb_avg_fragment_size_locks(array of rw locks)
>   *
> - *    This is a red black tree consisting of group infos and the tree is sorted
> - *    by average fragment sizes (which is calculated as ext4_group_info->bb_free
> - *    / ext4_group_info->bb_fragments).
> + *    This is an array of lists where in the i-th list there are groups with
> + *    average fragment size >= 2^i and < 2^(i+1). The average fragment size
> + *    is computed as ext4_group_info->bb_free / ext4_group_info->bb_fragments.
> + *    Note that we don't bother with a special list for completely empty groups
> + *    so we only have MB_NUM_ORDERS(sb) lists.
>   *
>   * When "mb_optimize_scan" mount option is set, mballoc consults the above data
>   * structures to decide the order in which groups are to be traversed for
> @@ -160,7 +162,8 @@
>   *
>   * At CR = 1, we only consider groups where average fragment size > request
>   * size. So, we lookup a group which has average fragment size just above or
> - * equal to request size using our rb tree (data structure 2) in O(log N) time.
> + * equal to request size using our average fragment size group lists (data
> + * structure 2) in O(1) time.
>   *
>   * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in
>   * linear order which requires O(N) search time for each CR 0 and CR 1 phase.
> @@ -802,65 +805,51 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
>  	}
>  }
>  
> -static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
> -			int (*cmp)(struct rb_node *, struct rb_node *))
> +static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
>  {
> -	struct rb_node **iter = &root->rb_node, *parent = NULL;
> +	int order;
>  
> -	while (*iter) {
> -		parent = *iter;
> -		if (cmp(new, *iter) > 0)
> -			iter = &((*iter)->rb_left);
> -		else
> -			iter = &((*iter)->rb_right);
> -	}
> -
> -	rb_link_node(new, parent, iter);
> -	rb_insert_color(new, root);
> -}
> -
> -static int
> -ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2)
> -{
> -	struct ext4_group_info *grp1 = rb_entry(rb1,
> -						struct ext4_group_info,
> -						bb_avg_fragment_size_rb);
> -	struct ext4_group_info *grp2 = rb_entry(rb2,
> -						struct ext4_group_info,
> -						bb_avg_fragment_size_rb);
> -	int num_frags_1, num_frags_2;
> -
> -	num_frags_1 = grp1->bb_fragments ?
> -		grp1->bb_free / grp1->bb_fragments : 0;
> -	num_frags_2 = grp2->bb_fragments ?
> -		grp2->bb_free / grp2->bb_fragments : 0;
> -
> -	return (num_frags_2 - num_frags_1);
> +	/*
> +	 * We don't bother with a special lists groups with only 1 block free
> + 	 * extents and for completely empty groups.
> +	 */
> +	order = fls(len) - 2;
> +	if (order < 0)
> +		return 0;
> +	if (order == MB_NUM_ORDERS(sb))
> +		order--;
> +	return order;
>  }
>  
> -/*
> - * Reinsert grpinfo into the avg_fragment_size tree with new average
> - * fragment size.
> - */
> +/* Move group to appropriate avg_fragment_size list */
>  static void
>  mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
>  {
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	int new_order;
>  
>  	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
>  		return;
>  
> -	write_lock(&sbi->s_mb_rb_lock);
> -	if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
> -		rb_erase(&grp->bb_avg_fragment_size_rb,
> -				&sbi->s_mb_avg_fragment_size_root);
> -		RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
> -	}
> +	new_order = mb_avg_fragment_size_order(sb,
> +					grp->bb_free / grp->bb_fragments);

Previous rbtree change was always checking for if grp->bb_fragments for 0.
Can grp->bb_fragments be 0 here?

-ritesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5 v2] ext4: Fix performance regression with mballoc
  2022-09-06 15:29 [PATCH 0/5 v2] ext4: Fix performance regression with mballoc Jan Kara
                   ` (5 preceding siblings ...)
  2022-09-06 20:38 ` [PATCH 0/5 v2] ext4: Fix performance regression with mballoc Stefan Wahren
@ 2022-09-08  8:17 ` Ojaswin Mujoo
  2022-09-08  9:12   ` Jan Kara
  6 siblings, 1 reply; 27+ messages in thread
From: Ojaswin Mujoo @ 2022-09-08  8:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Tso, linux-ext4, Thorsten Leemhuis, Stefan Wahren, Andreas Dilger

On Tue, Sep 06, 2022 at 05:29:06PM +0200, Jan Kara wrote:
> Hello,
> 
> Here is a second version of my mballoc improvements to avoid spreading
> allocations with mb_optimize_scan=1. The patches fix the performance
> regression I was able to reproduce with reaim on my test machine:
> 
>                      mb_optimize_scan=0     mb_optimize_scan=1     patched
> Hmean     disk-1       2076.12 (   0.00%)     2099.37 (   1.12%)     2032.52 (  -2.10%)
> Hmean     disk-41     92481.20 (   0.00%)    83787.47 *  -9.40%*    90308.37 (  -2.35%)
> Hmean     disk-81    155073.39 (   0.00%)   135527.05 * -12.60%*   154285.71 (  -0.51%)
> Hmean     disk-121   185109.64 (   0.00%)   166284.93 * -10.17%*   185298.62 (   0.10%)
> Hmean     disk-161   229890.53 (   0.00%)   207563.39 *  -9.71%*   232883.32 *   1.30%*
> Hmean     disk-201   223333.33 (   0.00%)   203235.59 *  -9.00%*   221446.93 (  -0.84%)
> Hmean     disk-241   235735.25 (   0.00%)   217705.51 *  -7.65%*   239483.27 *   1.59%*
> Hmean     disk-281   266772.15 (   0.00%)   241132.72 *  -9.61%*   263108.62 (  -1.37%)
> Hmean     disk-321   265435.50 (   0.00%)   245412.84 *  -7.54%*   267277.27 (   0.69%)
> 
> The changes also significanly reduce spreading of allocations for small /
> moderately sized files. I'm not able to measure a performance difference
> resulting from this but on eMMC storage this seems to be the main culprit
> of reduced performance. Untarring of raspberry-pi archive touches following
> numbers of groups:
> 
> 	mb_optimize_scan=0	mb_optimize_scan=1	patched
> groups	4			22			7
> 
> To achieve this I have added two more changes on top of v1 - patches 4 and 5.
> Patch 4 makes sure we use locality group preallocation even for files that are
> not likely to grow anymore (previously we have disabled all preallocations for
> such files, however locality group preallocation still makes a lot of sense for
> such files). This patch reduced spread of a small file allocations but larger
> file allocations were still spread significantly because they avoid locality
> group preallocation and as they are not power-of-two in size, they also
> immediately start with cr=1 scan. To address that I've changed the data
> structure for looking up the best block group to allocate from (see patch 5
> for details).
> 
> Stefan, can you please test whether these patches fix the problem for you as
> well? Comments & review welcome.
> 
> 								Honza
> Previous versions:
> Link: http://lore.kernel.org/r/20220823134508.27854-1-jack@suse.cz # v1

Hi Jan,

Thanks for the patch. I tested this series on my raspberry pi and I can
confirm that the regression is no longer present with both
mb_optimize_scan=0 and =1 taking similar amount of time to untar. The
allocation spread I'm seeing is as follows:
mb_optimize_scan=0: 10
mb_optimize_scan=1: 17 (Check graphs for more details)

For mb_optimize_scan=1, I also compared the spread of locality group PA
eligible files (<64KB) and inode pa files. The results can be found
here:

mb_optimize_scan=0:
https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/patchv2-mbopt0.png
mb_optimize_scan=1:
https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/patchv2.png
mb_optimize_scan=1 (lg pa only):
https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/patchv2-lgs.png
mb_optimize_scan=1 (inode pa only):
https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/patchv2-i.png

The smaller files are now closer together due to the changes to
locality group pa logic. Most of the spread is now coming from mid to
large files.

To test this further, I created a tar of 2000 100KB files to see if
there is any performance drop due to higher spread of these files and
notcied that although the spread is slightly higher(5BGs vs 9), we don't
see a performance drop while untarring with mb_optimize_scan=1.

Although we still have some spread, I think we have brought it down to a
much more manageable level and that combined with improvements to CR1
allocation have given us a good performance improvement.

Feel free to add:
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
Ojaswin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree
  2022-09-06 15:29 ` [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree Jan Kara
  2022-09-08 11:00     ` Dan Carpenter
  2022-09-07 18:41   ` Ritesh Harjani (IBM)
@ 2022-09-08  8:29   ` Ojaswin Mujoo
  2022-09-08  9:03     ` Jan Kara
  2 siblings, 1 reply; 27+ messages in thread
From: Ojaswin Mujoo @ 2022-09-08  8:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Tso, linux-ext4, Thorsten Leemhuis, Stefan Wahren, Andreas Dilger

Hi Jan,

On Tue, Sep 06, 2022 at 05:29:11PM +0200, Jan Kara wrote:
>  
>  ** snip **
>  /*
>   * Choose next group by traversing average fragment size tree. Updates *new_cr
Maybe we can change this to "average fragment size list of suitable
order"
> - * if cr lvel needs an update. Sets EXT4_MB_SEARCH_NEXT_LINEAR to indicate that
> - * the linear search should continue for one iteration since there's lock
> - * contention on the rb tree lock.
> + * if cr level needs an update. 
>   */
>  static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
>  		int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
>  {
>  	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> -	int avg_fragment_size, best_so_far;
> -	struct rb_node *node, *found;
> -	struct ext4_group_info *grp;

Other than that, this patch along with the updated mb_structs_summary
proc file change looks good to me.

Regards,
Ojaswin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/5] ext4: Avoid unnecessary spreading of allocations among groups
  2022-09-07 18:05   ` Ritesh Harjani (IBM)
@ 2022-09-08  8:57     ` Jan Kara
  2022-09-08  9:24       ` Ritesh Harjani (IBM)
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2022-09-08  8:57 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: Jan Kara, Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo,
	Stefan Wahren, Andreas Dilger

On Wed 07-09-22 23:35:07, Ritesh Harjani (IBM) wrote:
> On 22/09/06 05:29PM, Jan Kara wrote:
> > mb_set_largest_free_order() updates lists containing groups with largest
> > chunk of free space of given order. The way it updates it leads to
> > always moving the group to the tail of the list. Thus allocations
> > looking for free space of given order effectively end up cycling through
> > all groups (and due to initialization in last to first order). This
> > spreads allocations among block groups which reduces performance for
> > rotating disks or low-end flash media. Change
> > mb_set_largest_free_order() to only update lists if the order of the
> > largest free chunk in the group changed.
> 
> Nice and clear explaination. Thanks :)
> 
> This change also looks good to me.
> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

Thanks for review!

> One other thought to further optimize - 
> Will it make a difference if rather then adding the group to the tail of the list, 
> we add that group to the head of sbi->s_mb_largest_free_orders[new_order]. 
> 
> This is because this group is the latest from where blocks were allocated/freed,
> and hence the next allocation should first try from this group in order to keep 
> the files/extents blocks close to each other? 
> (That sometimes might help with disk firmware to avoid doing discards if the freed 
> block can be reused?)
> 
> Or does goal block will always cover that case by default and we might never
> require this? Maybe in a case of a new file within the same directory where 
> the goal group has no free blocks, but the last group attempted should be 
> retried first?

So I was also wondering about this somewhat. I think that goal group will
take care of keeping file data together so head/tail insertion should not
matter too much for one file. Maybe if the allocation comes from a
different inode, then the head/tail insertion matters but then it is not
certain whether the allocation is actually related and what its order is
(depending on that we might prefer same / different group) so I've decided
to just keep things as they are. I agree it might be interesting to
investigate and experiment with various workloads and see whether the
head/tail insertion makes a difference for some workload but I think it's a
separate project.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree
  2022-09-07 18:41   ` Ritesh Harjani (IBM)
@ 2022-09-08  9:01     ` Jan Kara
  2022-09-08  9:23       ` Ritesh Harjani (IBM)
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2022-09-08  9:01 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: Jan Kara, Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo,
	Stefan Wahren, Andreas Dilger

On Thu 08-09-22 00:11:10, Ritesh Harjani (IBM) wrote:
> On 22/09/06 05:29PM, Jan Kara wrote:
> > Using rbtree for sorting groups by average fragment size is relatively
> > expensive (needs rbtree update on every block freeing or allocation) and
> > leads to wide spreading of allocations because selection of block group
> > is very sentitive both to changes in free space and amount of blocks
> > allocated. Furthermore selecting group with the best matching average
> > fragment size is not necessary anyway, even more so because the
> > variability of fragment sizes within a group is likely large so average
> > is not telling much. We just need a group with large enough average
> > fragment size so that we have high probability of finding large enough
> > free extent and we don't want average fragment size to be too big so
> > that we are likely to find free extent only somewhat larger than what we
> > need.
> > 
> > So instead of maintaing rbtree of groups sorted by fragment size keep
> > bins (lists) or groups where average fragment size is in the interval
> > [2^i, 2^(i+1)). This structure requires less updates on block allocation
> > / freeing, generally avoids chaotic spreading of allocations into block
> > groups, and still is able to quickly (even faster that the rbtree)
> > provide a block group which is likely to have a suitably sized free
> > space extent.
> 
> This makes sense because we anyways maintain buddy bitmap for MB_NUM_ORDERS
> bitmaps. Hence our data structure to maintain different lists of groups, with 
> their average fragments size can be bounded within MB_NUM_ORDERS lists.
> This also makes it for amortized O(1) search time for finding the right group
> in CR1 search.
> 
> > 
> > This patch reduces number of block groups used when untarring archive
> > with medium sized files (size somewhat above 64k which is default
> > mballoc limit for avoiding locality group preallocation) to about half
> > and thus improves write speeds for eMMC flash significantly.
> > 
> 
> Indeed a nice change. More inline with the how we maintain
> sbi->s_mb_largest_free_orders lists.

I didn't really find more comments than the one below?

> I think as you already noted there are few minor checkpatch errors,
> other than that one small query below.

Yep, some checkpatch errors + procfs file handling bugs + one bad unlock in
an error recovery path. All fixed up locally :)

> > -/*
> > - * Reinsert grpinfo into the avg_fragment_size tree with new average
> > - * fragment size.
> > - */
> > +/* Move group to appropriate avg_fragment_size list */
> >  static void
> >  mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
> >  {
> >  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +	int new_order;
> >  
> >  	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
> >  		return;
> >  
> > -	write_lock(&sbi->s_mb_rb_lock);
> > -	if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
> > -		rb_erase(&grp->bb_avg_fragment_size_rb,
> > -				&sbi->s_mb_avg_fragment_size_root);
> > -		RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
> > -	}
> > +	new_order = mb_avg_fragment_size_order(sb,
> > +					grp->bb_free / grp->bb_fragments);
> 
> Previous rbtree change was always checking for if grp->bb_fragments for 0.
> Can grp->bb_fragments be 0 here?

Since grp->bb_free is greater than zero, there should be at least one
fragment...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree
  2022-09-08  8:29   ` Ojaswin Mujoo
@ 2022-09-08  9:03     ` Jan Kara
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Kara @ 2022-09-08  9:03 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Jan Kara, Ted Tso, linux-ext4, Thorsten Leemhuis, Stefan Wahren,
	Andreas Dilger

On Thu 08-09-22 13:59:58, Ojaswin Mujoo wrote:
> Hi Jan,
> 
> On Tue, Sep 06, 2022 at 05:29:11PM +0200, Jan Kara wrote:
> >  
> >  ** snip **
> >  /*
> >   * Choose next group by traversing average fragment size tree. Updates *new_cr
> Maybe we can change this to "average fragment size list of suitable
> order"

Right. Fixed. Thanks for catching this.

> > - * if cr lvel needs an update. Sets EXT4_MB_SEARCH_NEXT_LINEAR to indicate that
> > - * the linear search should continue for one iteration since there's lock
> > - * contention on the rb tree lock.
> > + * if cr level needs an update. 
> >   */
> >  static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> >  		int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> >  {
> >  	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> > -	int avg_fragment_size, best_so_far;
> > -	struct rb_node *node, *found;
> > -	struct ext4_group_info *grp;
> 
> Other than that, this patch along with the updated mb_structs_summary
> proc file change looks good to me.

Thanks for review & testing!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5 v2] ext4: Fix performance regression with mballoc
  2022-09-08  8:17 ` Ojaswin Mujoo
@ 2022-09-08  9:12   ` Jan Kara
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Kara @ 2022-09-08  9:12 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Jan Kara, Ted Tso, linux-ext4, Thorsten Leemhuis, Stefan Wahren,
	Andreas Dilger

On Thu 08-09-22 13:47:56, Ojaswin Mujoo wrote:
> On Tue, Sep 06, 2022 at 05:29:06PM +0200, Jan Kara wrote:
> > Hello,
> > 
> > Here is a second version of my mballoc improvements to avoid spreading
> > allocations with mb_optimize_scan=1. The patches fix the performance
> > regression I was able to reproduce with reaim on my test machine:
> > 
> >                      mb_optimize_scan=0     mb_optimize_scan=1     patched
> > Hmean     disk-1       2076.12 (   0.00%)     2099.37 (   1.12%)     2032.52 (  -2.10%)
> > Hmean     disk-41     92481.20 (   0.00%)    83787.47 *  -9.40%*    90308.37 (  -2.35%)
> > Hmean     disk-81    155073.39 (   0.00%)   135527.05 * -12.60%*   154285.71 (  -0.51%)
> > Hmean     disk-121   185109.64 (   0.00%)   166284.93 * -10.17%*   185298.62 (   0.10%)
> > Hmean     disk-161   229890.53 (   0.00%)   207563.39 *  -9.71%*   232883.32 *   1.30%*
> > Hmean     disk-201   223333.33 (   0.00%)   203235.59 *  -9.00%*   221446.93 (  -0.84%)
> > Hmean     disk-241   235735.25 (   0.00%)   217705.51 *  -7.65%*   239483.27 *   1.59%*
> > Hmean     disk-281   266772.15 (   0.00%)   241132.72 *  -9.61%*   263108.62 (  -1.37%)
> > Hmean     disk-321   265435.50 (   0.00%)   245412.84 *  -7.54%*   267277.27 (   0.69%)
> > 
> > The changes also significanly reduce spreading of allocations for small /
> > moderately sized files. I'm not able to measure a performance difference
> > resulting from this but on eMMC storage this seems to be the main culprit
> > of reduced performance. Untarring of raspberry-pi archive touches following
> > numbers of groups:
> > 
> > 	mb_optimize_scan=0	mb_optimize_scan=1	patched
> > groups	4			22			7
> > 
> > To achieve this I have added two more changes on top of v1 - patches 4 and 5.
> > Patch 4 makes sure we use locality group preallocation even for files that are
> > not likely to grow anymore (previously we have disabled all preallocations for
> > such files, however locality group preallocation still makes a lot of sense for
> > such files). This patch reduced spread of a small file allocations but larger
> > file allocations were still spread significantly because they avoid locality
> > group preallocation and as they are not power-of-two in size, they also
> > immediately start with cr=1 scan. To address that I've changed the data
> > structure for looking up the best block group to allocate from (see patch 5
> > for details).
> > 
> > Stefan, can you please test whether these patches fix the problem for you as
> > well? Comments & review welcome.
> > 
> > 								Honza
> > Previous versions:
> > Link: http://lore.kernel.org/r/20220823134508.27854-1-jack@suse.cz # v1
> 
> Hi Jan,
> 
> Thanks for the patch. I tested this series on my raspberry pi and I can
> confirm that the regression is no longer present with both
> mb_optimize_scan=0 and =1 taking similar amount of time to untar. The
> allocation spread I'm seeing is as follows:
> mb_optimize_scan=0: 10
> mb_optimize_scan=1: 17 (Check graphs for more details)
> 
> For mb_optimize_scan=1, I also compared the spread of locality group PA
> eligible files (<64KB) and inode pa files. The results can be found
> here:
> 
> mb_optimize_scan=0:
> https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/patchv2-mbopt0.png
> mb_optimize_scan=1:
> https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/patchv2.png
> mb_optimize_scan=1 (lg pa only):
> https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/patchv2-lgs.png
> mb_optimize_scan=1 (inode pa only):
> https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/patchv2-i.png
> 
> The smaller files are now closer together due to the changes to
> locality group pa logic. Most of the spread is now coming from mid to
> large files.
> 
> To test this further, I created a tar of 2000 100KB files to see if
> there is any performance drop due to higher spread of these files and
> notcied that although the spread is slightly higher(5BGs vs 9), we don't
> see a performance drop while untarring with mb_optimize_scan=1.
> 
> Although we still have some spread, I think we have brought it down to a
> much more manageable level and that combined with improvements to CR1
> allocation have given us a good performance improvement.
> 
> Feel free to add:
> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Thanks a lot for the throughout testing!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree
  2022-09-08  9:01     ` Jan Kara
@ 2022-09-08  9:23       ` Ritesh Harjani (IBM)
  0 siblings, 0 replies; 27+ messages in thread
From: Ritesh Harjani (IBM) @ 2022-09-08  9:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo,
	Stefan Wahren, Andreas Dilger

On 22/09/08 11:01AM, Jan Kara wrote:
> On Thu 08-09-22 00:11:10, Ritesh Harjani (IBM) wrote:
> > On 22/09/06 05:29PM, Jan Kara wrote:
> > > Using rbtree for sorting groups by average fragment size is relatively
> > > expensive (needs rbtree update on every block freeing or allocation) and
> > > leads to wide spreading of allocations because selection of block group
> > > is very sentitive both to changes in free space and amount of blocks
> > > allocated. Furthermore selecting group with the best matching average
> > > fragment size is not necessary anyway, even more so because the
> > > variability of fragment sizes within a group is likely large so average
> > > is not telling much. We just need a group with large enough average
> > > fragment size so that we have high probability of finding large enough
> > > free extent and we don't want average fragment size to be too big so
> > > that we are likely to find free extent only somewhat larger than what we
> > > need.
> > > 
> > > So instead of maintaing rbtree of groups sorted by fragment size keep
> > > bins (lists) or groups where average fragment size is in the interval
> > > [2^i, 2^(i+1)). This structure requires less updates on block allocation
> > > / freeing, generally avoids chaotic spreading of allocations into block
> > > groups, and still is able to quickly (even faster that the rbtree)
> > > provide a block group which is likely to have a suitably sized free
> > > space extent.
> > 
> > This makes sense because we anyways maintain buddy bitmap for MB_NUM_ORDERS
> > bitmaps. Hence our data structure to maintain different lists of groups, with 
> > their average fragments size can be bounded within MB_NUM_ORDERS lists.
> > This also makes it for amortized O(1) search time for finding the right group
> > in CR1 search.
> > 
> > > 
> > > This patch reduces number of block groups used when untarring archive
> > > with medium sized files (size somewhat above 64k which is default
> > > mballoc limit for avoiding locality group preallocation) to about half
> > > and thus improves write speeds for eMMC flash significantly.
> > > 
> > 
> > Indeed a nice change. More inline with the how we maintain
> > sbi->s_mb_largest_free_orders lists.
> 
> I didn't really find more comments than the one below?

No I meant. The data structure is more inline with sbi->s_mb_largest_free_orders
lists :) Had no other comments. 

> 
> > I think as you already noted there are few minor checkpatch errors,
> > other than that one small query below.
> 
> Yep, some checkpatch errors + procfs file handling bugs + one bad unlock in
> an error recovery path. All fixed up locally :)

Sure.

> 
> > > -/*
> > > - * Reinsert grpinfo into the avg_fragment_size tree with new average
> > > - * fragment size.
> > > - */
> > > +/* Move group to appropriate avg_fragment_size list */
> > >  static void
> > >  mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
> > >  {
> > >  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> > > +	int new_order;
> > >  
> > >  	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
> > >  		return;
> > >  
> > > -	write_lock(&sbi->s_mb_rb_lock);
> > > -	if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
> > > -		rb_erase(&grp->bb_avg_fragment_size_rb,
> > > -				&sbi->s_mb_avg_fragment_size_root);
> > > -		RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
> > > -	}
> > > +	new_order = mb_avg_fragment_size_order(sb,
> > > +					grp->bb_free / grp->bb_fragments);
> > 
> > Previous rbtree change was always checking for if grp->bb_fragments for 0.
> > Can grp->bb_fragments be 0 here?
> 
> Since grp->bb_free is greater than zero, there should be at least one
> fragment...

aah yes, right.

-ritesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/5] ext4: Avoid unnecessary spreading of allocations among groups
  2022-09-08  8:57     ` Jan Kara
@ 2022-09-08  9:24       ` Ritesh Harjani (IBM)
  0 siblings, 0 replies; 27+ messages in thread
From: Ritesh Harjani (IBM) @ 2022-09-08  9:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo,
	Stefan Wahren, Andreas Dilger

On 22/09/08 10:57AM, Jan Kara wrote:
> On Wed 07-09-22 23:35:07, Ritesh Harjani (IBM) wrote:
> > On 22/09/06 05:29PM, Jan Kara wrote:
> > > mb_set_largest_free_order() updates lists containing groups with largest
> > > chunk of free space of given order. The way it updates it leads to
> > > always moving the group to the tail of the list. Thus allocations
> > > looking for free space of given order effectively end up cycling through
> > > all groups (and due to initialization in last to first order). This
> > > spreads allocations among block groups which reduces performance for
> > > rotating disks or low-end flash media. Change
> > > mb_set_largest_free_order() to only update lists if the order of the
> > > largest free chunk in the group changed.
> > 
> > Nice and clear explaination. Thanks :)
> > 
> > This change also looks good to me.
> > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> 
> Thanks for review!
> 
> > One other thought to further optimize - 
> > Will it make a difference if rather then adding the group to the tail of the list, 
> > we add that group to the head of sbi->s_mb_largest_free_orders[new_order]. 
> > 
> > This is because this group is the latest from where blocks were allocated/freed,
> > and hence the next allocation should first try from this group in order to keep 
> > the files/extents blocks close to each other? 
> > (That sometimes might help with disk firmware to avoid doing discards if the freed 
> > block can be reused?)
> > 
> > Or does goal block will always cover that case by default and we might never
> > require this? Maybe in a case of a new file within the same directory where 
> > the goal group has no free blocks, but the last group attempted should be 
> > retried first?
> 
> So I was also wondering about this somewhat. I think that goal group will
> take care of keeping file data together so head/tail insertion should not
> matter too much for one file. Maybe if the allocation comes from a
> different inode, then the head/tail insertion matters but then it is not
> certain whether the allocation is actually related and what its order is
> (depending on that we might prefer same / different group) so I've decided
> to just keep things as they are. I agree it might be interesting to
> investigate and experiment with various workloads and see whether the
> head/tail insertion makes a difference for some workload but I think it's a
> separate project.
> 

Sure. Make sense.
Thanks for still sharing your thoughts on it.

-ritesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree
@ 2022-09-08 11:00     ` Dan Carpenter
  0 siblings, 0 replies; 27+ messages in thread
From: Dan Carpenter @ 2022-09-08 11:00 UTC (permalink / raw)
  To: kbuild, Jan Kara, Ted Tso
  Cc: lkp, kbuild-all, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo,
	Stefan Wahren, Andreas Dilger, Jan Kara

Hi Jan,

https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Jan-Kara/ext4-Fix-performance-regression-with-mballoc/20220907-000945
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 53e99dcff61e1523ec1c3628b2d564ba15d32eb7
config: m68k-randconfig-m041-20220906 (https://download.01.org/0day-ci/archive/20220907/202209071206.u1iHKVzB-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 12.1.0

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

New smatch warnings:
fs/ext4/mballoc.c:945 ext4_mb_choose_next_group_cr1() error: uninitialized symbol 'grp'.

vim +/grp +945 fs/ext4/mballoc.c

196e402adf2e4c Harshad Shirwadkar 2021-04-01  909  static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
196e402adf2e4c Harshad Shirwadkar 2021-04-01  910  		int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
196e402adf2e4c Harshad Shirwadkar 2021-04-01  911  {
196e402adf2e4c Harshad Shirwadkar 2021-04-01  912  	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
31b571b608cf66 Jan Kara           2022-09-06  913  	struct ext4_group_info *grp, *iter;
31b571b608cf66 Jan Kara           2022-09-06  914  	int i;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  915  
196e402adf2e4c Harshad Shirwadkar 2021-04-01  916  	if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
196e402adf2e4c Harshad Shirwadkar 2021-04-01  917  		if (sbi->s_mb_stats)
196e402adf2e4c Harshad Shirwadkar 2021-04-01  918  			atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
31b571b608cf66 Jan Kara           2022-09-06  919  	}
31b571b608cf66 Jan Kara           2022-09-06  920  
31b571b608cf66 Jan Kara           2022-09-06  921  	for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
31b571b608cf66 Jan Kara           2022-09-06  922  	     i < MB_NUM_ORDERS(ac->ac_sb); i++) {
31b571b608cf66 Jan Kara           2022-09-06  923  		if (list_empty(&sbi->s_mb_avg_fragment_size[i]))
31b571b608cf66 Jan Kara           2022-09-06  924  			continue;
31b571b608cf66 Jan Kara           2022-09-06  925  		read_lock(&sbi->s_mb_avg_fragment_size_locks[i]);
31b571b608cf66 Jan Kara           2022-09-06  926  		if (list_empty(&sbi->s_mb_avg_fragment_size[i])) {
31b571b608cf66 Jan Kara           2022-09-06  927  			read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
31b571b608cf66 Jan Kara           2022-09-06  928  			continue;

Smatch worries that we can hit these two continues on every iteration.
Why not just initialize "grp = NULL;" at the start of the function?

31b571b608cf66 Jan Kara           2022-09-06  929  		}
31b571b608cf66 Jan Kara           2022-09-06  930  		grp = NULL;
31b571b608cf66 Jan Kara           2022-09-06  931  		list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i],
31b571b608cf66 Jan Kara           2022-09-06  932  				    bb_avg_fragment_size_node) {
196e402adf2e4c Harshad Shirwadkar 2021-04-01  933  			if (sbi->s_mb_stats)
196e402adf2e4c Harshad Shirwadkar 2021-04-01  934  				atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
31b571b608cf66 Jan Kara           2022-09-06  935  			if (likely(ext4_mb_good_group(ac, iter->bb_group, 1))) {
31b571b608cf66 Jan Kara           2022-09-06  936  				grp = iter;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  937  				break;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  938  			}
196e402adf2e4c Harshad Shirwadkar 2021-04-01  939  		}
31b571b608cf66 Jan Kara           2022-09-06  940  		read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
31b571b608cf66 Jan Kara           2022-09-06  941  		if (grp)
31b571b608cf66 Jan Kara           2022-09-06  942  			break;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  943  	}
196e402adf2e4c Harshad Shirwadkar 2021-04-01  944  
31b571b608cf66 Jan Kara           2022-09-06 @945  	if (grp) {
196e402adf2e4c Harshad Shirwadkar 2021-04-01  946  		*group = grp->bb_group;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  947  		ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  948  	} else {
196e402adf2e4c Harshad Shirwadkar 2021-04-01  949  		*new_cr = 2;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  950  	}
196e402adf2e4c Harshad Shirwadkar 2021-04-01  951  }

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree
@ 2022-09-08 11:00     ` Dan Carpenter
  0 siblings, 0 replies; 27+ messages in thread
From: Dan Carpenter @ 2022-09-08 11:00 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4449 bytes --]

Hi Jan,

https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Jan-Kara/ext4-Fix-performance-regression-with-mballoc/20220907-000945
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 53e99dcff61e1523ec1c3628b2d564ba15d32eb7
config: m68k-randconfig-m041-20220906 (https://download.01.org/0day-ci/archive/20220907/202209071206.u1iHKVzB-lkp(a)intel.com/config)
compiler: m68k-linux-gcc (GCC) 12.1.0

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

New smatch warnings:
fs/ext4/mballoc.c:945 ext4_mb_choose_next_group_cr1() error: uninitialized symbol 'grp'.

vim +/grp +945 fs/ext4/mballoc.c

196e402adf2e4c Harshad Shirwadkar 2021-04-01  909  static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
196e402adf2e4c Harshad Shirwadkar 2021-04-01  910  		int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
196e402adf2e4c Harshad Shirwadkar 2021-04-01  911  {
196e402adf2e4c Harshad Shirwadkar 2021-04-01  912  	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
31b571b608cf66 Jan Kara           2022-09-06  913  	struct ext4_group_info *grp, *iter;
31b571b608cf66 Jan Kara           2022-09-06  914  	int i;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  915  
196e402adf2e4c Harshad Shirwadkar 2021-04-01  916  	if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
196e402adf2e4c Harshad Shirwadkar 2021-04-01  917  		if (sbi->s_mb_stats)
196e402adf2e4c Harshad Shirwadkar 2021-04-01  918  			atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
31b571b608cf66 Jan Kara           2022-09-06  919  	}
31b571b608cf66 Jan Kara           2022-09-06  920  
31b571b608cf66 Jan Kara           2022-09-06  921  	for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
31b571b608cf66 Jan Kara           2022-09-06  922  	     i < MB_NUM_ORDERS(ac->ac_sb); i++) {
31b571b608cf66 Jan Kara           2022-09-06  923  		if (list_empty(&sbi->s_mb_avg_fragment_size[i]))
31b571b608cf66 Jan Kara           2022-09-06  924  			continue;
31b571b608cf66 Jan Kara           2022-09-06  925  		read_lock(&sbi->s_mb_avg_fragment_size_locks[i]);
31b571b608cf66 Jan Kara           2022-09-06  926  		if (list_empty(&sbi->s_mb_avg_fragment_size[i])) {
31b571b608cf66 Jan Kara           2022-09-06  927  			read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
31b571b608cf66 Jan Kara           2022-09-06  928  			continue;

Smatch worries that we can hit these two continues on every iteration.
Why not just initialize "grp = NULL;" at the start of the function?

31b571b608cf66 Jan Kara           2022-09-06  929  		}
31b571b608cf66 Jan Kara           2022-09-06  930  		grp = NULL;
31b571b608cf66 Jan Kara           2022-09-06  931  		list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i],
31b571b608cf66 Jan Kara           2022-09-06  932  				    bb_avg_fragment_size_node) {
196e402adf2e4c Harshad Shirwadkar 2021-04-01  933  			if (sbi->s_mb_stats)
196e402adf2e4c Harshad Shirwadkar 2021-04-01  934  				atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
31b571b608cf66 Jan Kara           2022-09-06  935  			if (likely(ext4_mb_good_group(ac, iter->bb_group, 1))) {
31b571b608cf66 Jan Kara           2022-09-06  936  				grp = iter;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  937  				break;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  938  			}
196e402adf2e4c Harshad Shirwadkar 2021-04-01  939  		}
31b571b608cf66 Jan Kara           2022-09-06  940  		read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
31b571b608cf66 Jan Kara           2022-09-06  941  		if (grp)
31b571b608cf66 Jan Kara           2022-09-06  942  			break;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  943  	}
196e402adf2e4c Harshad Shirwadkar 2021-04-01  944  
31b571b608cf66 Jan Kara           2022-09-06 @945  	if (grp) {
196e402adf2e4c Harshad Shirwadkar 2021-04-01  946  		*group = grp->bb_group;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  947  		ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  948  	} else {
196e402adf2e4c Harshad Shirwadkar 2021-04-01  949  		*new_cr = 2;
196e402adf2e4c Harshad Shirwadkar 2021-04-01  950  	}
196e402adf2e4c Harshad Shirwadkar 2021-04-01  951  }

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree
  2022-09-08 11:00     ` Dan Carpenter
@ 2022-09-08 11:33       ` Jan Kara
  -1 siblings, 0 replies; 27+ messages in thread
From: Jan Kara @ 2022-09-08 11:33 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: kbuild, Jan Kara, Ted Tso, lkp, kbuild-all, linux-ext4,
	Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren, Andreas Dilger

On Thu 08-09-22 14:00:17, Dan Carpenter wrote:
> Hi Jan,
> 
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Jan-Kara/ext4-Fix-performance-regression-with-mballoc/20220907-000945
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 53e99dcff61e1523ec1c3628b2d564ba15d32eb7
> config: m68k-randconfig-m041-20220906 (https://download.01.org/0day-ci/archive/20220907/202209071206.u1iHKVzB-lkp@intel.com/config)
> compiler: m68k-linux-gcc (GCC) 12.1.0
> 
> If you fix the issue, kindly add following tag where applicable
> Reported-by: kernel test robot <lkp@intel.com>
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> 
> New smatch warnings:
> fs/ext4/mballoc.c:945 ext4_mb_choose_next_group_cr1() error: uninitialized symbol 'grp'.
> 
> vim +/grp +945 fs/ext4/mballoc.c
> 
> 196e402adf2e4c Harshad Shirwadkar 2021-04-01  909  static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> 196e402adf2e4c Harshad Shirwadkar 2021-04-01  910  		int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> 196e402adf2e4c Harshad Shirwadkar 2021-04-01  911  {
> 196e402adf2e4c Harshad Shirwadkar 2021-04-01  912  	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> 31b571b608cf66 Jan Kara           2022-09-06  913  	struct ext4_group_info *grp, *iter;
> 31b571b608cf66 Jan Kara           2022-09-06  914  	int i;
> 196e402adf2e4c Harshad Shirwadkar 2021-04-01  915  
> 196e402adf2e4c Harshad Shirwadkar 2021-04-01  916  	if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
> 196e402adf2e4c Harshad Shirwadkar 2021-04-01  917  		if (sbi->s_mb_stats)
> 196e402adf2e4c Harshad Shirwadkar 2021-04-01  918  			atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
> 31b571b608cf66 Jan Kara           2022-09-06  919  	}
> 31b571b608cf66 Jan Kara           2022-09-06  920  
> 31b571b608cf66 Jan Kara           2022-09-06  921  	for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
> 31b571b608cf66 Jan Kara           2022-09-06  922  	     i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> 31b571b608cf66 Jan Kara           2022-09-06  923  		if (list_empty(&sbi->s_mb_avg_fragment_size[i]))
> 31b571b608cf66 Jan Kara           2022-09-06  924  			continue;
> 31b571b608cf66 Jan Kara           2022-09-06  925  		read_lock(&sbi->s_mb_avg_fragment_size_locks[i]);
> 31b571b608cf66 Jan Kara           2022-09-06  926  		if (list_empty(&sbi->s_mb_avg_fragment_size[i])) {
> 31b571b608cf66 Jan Kara           2022-09-06  927  			read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> 31b571b608cf66 Jan Kara           2022-09-06  928  			continue;
> 
> Smatch worries that we can hit these two continues on every iteration.
> Why not just initialize "grp = NULL;" at the start of the function?

Yes, good point. It may even be possible in some rare racy cornercase. I'll
do what you suggest. Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree
@ 2022-09-08 11:33       ` Jan Kara
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Kara @ 2022-09-08 11:33 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3015 bytes --]

On Thu 08-09-22 14:00:17, Dan Carpenter wrote:
> Hi Jan,
> 
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Jan-Kara/ext4-Fix-performance-regression-with-mballoc/20220907-000945
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 53e99dcff61e1523ec1c3628b2d564ba15d32eb7
> config: m68k-randconfig-m041-20220906 (https://download.01.org/0day-ci/archive/20220907/202209071206.u1iHKVzB-lkp(a)intel.com/config)
> compiler: m68k-linux-gcc (GCC) 12.1.0
> 
> If you fix the issue, kindly add following tag where applicable
> Reported-by: kernel test robot <lkp@intel.com>
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> 
> New smatch warnings:
> fs/ext4/mballoc.c:945 ext4_mb_choose_next_group_cr1() error: uninitialized symbol 'grp'.
> 
> vim +/grp +945 fs/ext4/mballoc.c
> 
> 196e402adf2e4c Harshad Shirwadkar 2021-04-01  909  static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> 196e402adf2e4c Harshad Shirwadkar 2021-04-01  910  		int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> 196e402adf2e4c Harshad Shirwadkar 2021-04-01  911  {
> 196e402adf2e4c Harshad Shirwadkar 2021-04-01  912  	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> 31b571b608cf66 Jan Kara           2022-09-06  913  	struct ext4_group_info *grp, *iter;
> 31b571b608cf66 Jan Kara           2022-09-06  914  	int i;
> 196e402adf2e4c Harshad Shirwadkar 2021-04-01  915  
> 196e402adf2e4c Harshad Shirwadkar 2021-04-01  916  	if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
> 196e402adf2e4c Harshad Shirwadkar 2021-04-01  917  		if (sbi->s_mb_stats)
> 196e402adf2e4c Harshad Shirwadkar 2021-04-01  918  			atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
> 31b571b608cf66 Jan Kara           2022-09-06  919  	}
> 31b571b608cf66 Jan Kara           2022-09-06  920  
> 31b571b608cf66 Jan Kara           2022-09-06  921  	for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
> 31b571b608cf66 Jan Kara           2022-09-06  922  	     i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> 31b571b608cf66 Jan Kara           2022-09-06  923  		if (list_empty(&sbi->s_mb_avg_fragment_size[i]))
> 31b571b608cf66 Jan Kara           2022-09-06  924  			continue;
> 31b571b608cf66 Jan Kara           2022-09-06  925  		read_lock(&sbi->s_mb_avg_fragment_size_locks[i]);
> 31b571b608cf66 Jan Kara           2022-09-06  926  		if (list_empty(&sbi->s_mb_avg_fragment_size[i])) {
> 31b571b608cf66 Jan Kara           2022-09-06  927  			read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> 31b571b608cf66 Jan Kara           2022-09-06  928  			continue;
> 
> Smatch worries that we can hit these two continues on every iteration.
> Why not just initialize "grp = NULL;" at the start of the function?

Yes, good point. It may even be possible in some rare racy cornercase. I'll
do what you suggest. Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 2/5] ext4: Avoid unnecessary spreading of allocations among groups
  2022-09-08  9:21 [PATCH 0/5 v3] " Jan Kara
@ 2022-09-08  9:21 ` Jan Kara
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Kara @ 2022-09-08  9:21 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren,
	Harshad Shirwadkar, Jan Kara, stable, Ritesh Harjani

mb_set_largest_free_order() updates lists containing groups with largest
chunk of free space of given order. The way it updates it leads to
always moving the group to the tail of the list. Thus allocations
looking for free space of given order effectively end up cycling through
all groups (and due to initialization in last to first order). This
spreads allocations among block groups which reduces performance for
rotating disks or low-end flash media. Change
mb_set_largest_free_order() to only update lists if the order of the
largest free chunk in the group changed.

Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/mballoc.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 41e1cfecac3b..6251b4a6cc63 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1077,23 +1077,25 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	int i;
 
-	if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
+	for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
+		if (grp->bb_counters[i] > 0)
+			break;
+	/* No need to move between order lists? */
+	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) ||
+	    i == grp->bb_largest_free_order) {
+		grp->bb_largest_free_order = i;
+		return;
+	}
+
+	if (grp->bb_largest_free_order >= 0) {
 		write_lock(&sbi->s_mb_largest_free_orders_locks[
 					      grp->bb_largest_free_order]);
 		list_del_init(&grp->bb_largest_free_order_node);
 		write_unlock(&sbi->s_mb_largest_free_orders_locks[
 					      grp->bb_largest_free_order]);
 	}
-	grp->bb_largest_free_order = -1; /* uninit */
-
-	for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
-		if (grp->bb_counters[i] > 0) {
-			grp->bb_largest_free_order = i;
-			break;
-		}
-	}
-	if (test_opt2(sb, MB_OPTIMIZE_SCAN) &&
-	    grp->bb_largest_free_order >= 0 && grp->bb_free) {
+	grp->bb_largest_free_order = i;
+	if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
 		write_lock(&sbi->s_mb_largest_free_orders_locks[
 					      grp->bb_largest_free_order]);
 		list_add_tail(&grp->bb_largest_free_order_node,
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-09-08 11:34 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-06 15:29 [PATCH 0/5 v2] ext4: Fix performance regression with mballoc Jan Kara
2022-09-06 15:29 ` [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara
2022-09-07 17:43   ` Ritesh Harjani (IBM)
2022-09-06 15:29 ` [PATCH 2/5] ext4: Avoid unnecessary spreading of allocations among groups Jan Kara
2022-09-07 18:05   ` Ritesh Harjani (IBM)
2022-09-08  8:57     ` Jan Kara
2022-09-08  9:24       ` Ritesh Harjani (IBM)
2022-09-06 15:29 ` [PATCH 3/5] ext4: Make directory inode spreading reflect flexbg size Jan Kara
2022-09-06 15:29 ` [PATCH 4/5] ext4: Use locality group preallocation for small closed files Jan Kara
2022-09-07 18:25   ` Ritesh Harjani (IBM)
2022-09-06 15:29 ` [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree Jan Kara
2022-09-07  4:51   ` kernel test robot
2022-09-08 11:00     ` Dan Carpenter
2022-09-08 11:00     ` Dan Carpenter
2022-09-08 11:33     ` Jan Kara
2022-09-08 11:33       ` Jan Kara
2022-09-07 18:41   ` Ritesh Harjani (IBM)
2022-09-08  9:01     ` Jan Kara
2022-09-08  9:23       ` Ritesh Harjani (IBM)
2022-09-08  8:29   ` Ojaswin Mujoo
2022-09-08  9:03     ` Jan Kara
2022-09-06 20:38 ` [PATCH 0/5 v2] ext4: Fix performance regression with mballoc Stefan Wahren
2022-09-07 10:49   ` Jan Kara
2022-09-07 13:02     ` Jan Kara
2022-09-08  8:17 ` Ojaswin Mujoo
2022-09-08  9:12   ` Jan Kara
2022-09-08  9:21 [PATCH 0/5 v3] " Jan Kara
2022-09-08  9:21 ` [PATCH 2/5] ext4: Avoid unnecessary spreading of allocations among groups Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.