* [PATCH 0/5 v3] ext4: Fix performance regression with mballoc @ 2022-09-08 9:21 Jan Kara 2022-09-08 9:21 ` [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara ` (5 more replies) 0 siblings, 6 replies; 16+ messages in thread From: Jan Kara @ 2022-09-08 9:21 UTC (permalink / raw) To: Ted Tso Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren, Harshad Shirwadkar, Jan Kara Hello, Here is the third version of my mballoc improvements to avoid spreading allocations with mb_optimize_scan=1. Since v2 there are only small changes and fixes found during review and testing. Overall the series looks mostly ready to go, I just didn't see any comments regarding patch 3 - a fix of metabg handling in the Orlov allocator which is kind of independent, I've just found it when reading the code. Also patch 5 needs final review after all the fixes. Changes since v1: - reworked data structure for CR 1 scan - make small closed files use locality group preallocation - fix metabg handling in the Orlov allocator Changes since v2: - whitespace fixes - fix outdated comment - fix handling of mb_structs_summary procfs file - fix bad unlock on error recovery path Original cover letter: The patches fix the performance regression I was able to reproduce with reaim on my test machine: mb_optimize_scan=0 mb_optimize_scan=1 patched Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%) Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%) Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%) Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%) Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%* Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%) Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%* Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%) Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%) The changes also significanly reduce spreading of allocations for small / moderately sized files. I'm not able to measure a performance difference resulting from this but on eMMC storage this seems to be the main culprit of reduced performance. Untarring of raspberry-pi archive touches following numbers of groups: mb_optimize_scan=0 mb_optimize_scan=1 patched groups 4 22 7 To achieve this I have added two more changes on top of v1 - patches 4 and 5. Patch 4 makes sure we use locality group preallocation even for files that are not likely to grow anymore (previously we have disabled all preallocations for such files, however locality group preallocation still makes a lot of sense for such files). This patch reduced spread of a small file allocations but larger file allocations were still spread significantly because they avoid locality group preallocation and as they are not power-of-two in size, they also immediately start with cr=1 scan. To address that I've changed the data structure for looking up the best block group to allocate from (see patch 5 for details). Honza Previous versions: Link: http://lore.kernel.org/r/20220823134508.27854-1-jack@suse.cz # v1 Link: http://lore.kernel.org/r/20220906150803.375-1-jack@suse.cz # v2 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan 2022-09-08 9:21 [PATCH 0/5 v3] ext4: Fix performance regression with mballoc Jan Kara @ 2022-09-08 9:21 ` Jan Kara 2022-09-22 2:52 ` Theodore Ts'o 2022-09-08 9:21 ` [PATCH 2/5] ext4: Avoid unnecessary spreading of allocations among groups Jan Kara ` (4 subsequent siblings) 5 siblings, 1 reply; 16+ messages in thread From: Jan Kara @ 2022-09-08 9:21 UTC (permalink / raw) To: Ted Tso Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren, Harshad Shirwadkar, Jan Kara, stable, Ritesh Harjani One of the side-effects of mb_optimize_scan was that the optimized functions to select next group to try were called even before we tried the goal group. As a result we no longer allocate files close to corresponding inodes as well as we don't try to expand currently allocated extent in the same group. This results in reaim regression with workfile.disk workload of upto 8% with many clients on my test machine: baseline mb_optimize_scan Hmean disk-1 2114.16 ( 0.00%) 2099.37 ( -0.70%) Hmean disk-41 87794.43 ( 0.00%) 83787.47 * -4.56%* Hmean disk-81 148170.73 ( 0.00%) 135527.05 * -8.53%* Hmean disk-121 177506.11 ( 0.00%) 166284.93 * -6.32%* Hmean disk-161 220951.51 ( 0.00%) 207563.39 * -6.06%* Hmean disk-201 208722.74 ( 0.00%) 203235.59 ( -2.63%) Hmean disk-241 222051.60 ( 0.00%) 217705.51 ( -1.96%) Hmean disk-281 252244.17 ( 0.00%) 241132.72 * -4.41%* Hmean disk-321 255844.84 ( 0.00%) 245412.84 * -4.08%* Also this is causing huge regression (time increased by a factor of 5 or so) when untarring archive with lots of small files on some eMMC storage cards. Fix the problem by making sure we try goal group first. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/all/20220727105123.ckwrhbilzrxqpt24@quack3/ Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/mballoc.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index bd8f8b5c3d30..41e1cfecac3b 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1049,8 +1049,10 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac, { *new_cr = ac->ac_criteria; - if (!should_optimize_scan(ac) || ac->ac_groups_linear_remaining) + if (!should_optimize_scan(ac) || ac->ac_groups_linear_remaining) { + *group = next_linear_group(ac, *group, ngroups); return; + } if (*new_cr == 0) { ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups); @@ -2636,7 +2638,7 @@ static noinline_for_stack int ext4_mb_regular_allocator(struct ext4_allocation_context *ac) { ext4_group_t prefetch_grp = 0, ngroups, group, i; - int cr = -1; + int cr = -1, new_cr; int err = 0, first_err = 0; unsigned int nr = 0, prefetch_ios = 0; struct ext4_sb_info *sbi; @@ -2711,13 +2713,11 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) ac->ac_groups_linear_remaining = sbi->s_mb_max_linear_groups; prefetch_grp = group; - for (i = 0; i < ngroups; group = next_linear_group(ac, group, ngroups), - i++) { - int ret = 0, new_cr; + for (i = 0, new_cr = cr; i < ngroups; i++, + ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) { + int ret = 0; cond_resched(); - - ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups); if (new_cr != cr) { cr = new_cr; goto repeat; -- 2.35.3 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan 2022-09-08 9:21 ` [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara @ 2022-09-22 2:52 ` Theodore Ts'o 2022-09-22 9:15 ` Jan Kara 0 siblings, 1 reply; 16+ messages in thread From: Theodore Ts'o @ 2022-09-22 2:52 UTC (permalink / raw) To: jack Cc: Theodore Ts'o, harshadshirwadkar, stable, ritesh.list, stefan.wahren, ojaswin, linux-ext4, regressions On Thu, 8 Sep 2022 11:21:24 +0200, Jan Kara wrote: > One of the side-effects of mb_optimize_scan was that the optimized > functions to select next group to try were called even before we tried > the goal group. As a result we no longer allocate files close to > corresponding inodes as well as we don't try to expand currently > allocated extent in the same group. This results in reaim regression > with workfile.disk workload of upto 8% with many clients on my test > machine: > > [...] Applied, thanks! [1/5] ext4: Make mballoc try target group first even with mb_optimize_scan commit: 4fca50d440cc5d4dc570ad5484cc0b70b381bc2a [2/5] ext4: Avoid unnecessary spreading of allocations among groups commit: 1940265ede6683f6317cba0d428ce6505eaca944 [3/5] ext4: Make directory inode spreading reflect flexbg size commit: 613c5a85898d1cd44e68f28d65eccf64a8ace9cf [4/5] ext4: Use locality group preallocation for small closed files commit: a9f2a2931d0e197ab28c6007966053fdababd53f [5/5] ext4: Use buckets for cr 1 block scan instead of rbtree commit: 83e80a6e3543f37f74c8e48a5f305b054b65ce2a Best regards, -- Theodore Ts'o <tytso@mit.edu> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan 2022-09-22 2:52 ` Theodore Ts'o @ 2022-09-22 9:15 ` Jan Kara 2022-09-26 9:11 ` Jan Kara 0 siblings, 1 reply; 16+ messages in thread From: Jan Kara @ 2022-09-22 9:15 UTC (permalink / raw) To: Theodore Ts'o Cc: jack, harshadshirwadkar, stable, ritesh.list, stefan.wahren, ojaswin, linux-ext4, regressions [-- Attachment #1: Type: text/plain, Size: 1458 bytes --] On Wed 21-09-22 22:52:34, Theodore Ts'o wrote: > On Thu, 8 Sep 2022 11:21:24 +0200, Jan Kara wrote: > > One of the side-effects of mb_optimize_scan was that the optimized > > functions to select next group to try were called even before we tried > > the goal group. As a result we no longer allocate files close to > > corresponding inodes as well as we don't try to expand currently > > allocated extent in the same group. This results in reaim regression > > with workfile.disk workload of upto 8% with many clients on my test > > machine: > > > > [...] > > Applied, thanks! > > [1/5] ext4: Make mballoc try target group first even with mb_optimize_scan > commit: 4fca50d440cc5d4dc570ad5484cc0b70b381bc2a > [2/5] ext4: Avoid unnecessary spreading of allocations among groups > commit: 1940265ede6683f6317cba0d428ce6505eaca944 > [3/5] ext4: Make directory inode spreading reflect flexbg size > commit: 613c5a85898d1cd44e68f28d65eccf64a8ace9cf > [4/5] ext4: Use locality group preallocation for small closed files > commit: a9f2a2931d0e197ab28c6007966053fdababd53f > [5/5] ext4: Use buckets for cr 1 block scan instead of rbtree > commit: 83e80a6e3543f37f74c8e48a5f305b054b65ce2a Thanks Ted! I just have locally a small fixup to the series that was reported by Smatch. It is attached, either fold it into the last patch or just merge it as a separate patch. Thanks! Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR [-- Attachment #2: 0001-ext4-Fixup-possible-uninitialized-variable-access-in.patch --] [-- Type: text/x-patch, Size: 1475 bytes --] From 8885b11fb253e08ecfa90a28beffb01719af84f5 Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Thu, 22 Sep 2022 11:09:29 +0200 Subject: [PATCH] ext4: Fixup possible uninitialized variable access in ext4_mb_choose_next_group_cr1() Variable 'grp' may be left uninitialized if there's no group with suitable average fragment size (or larger). Fix the problem by initializing it earlier. Fixes: 83e80a6e3543 ("ext4: use buckets for cr 1 block scan instead of rbtree") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/mballoc.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 71f5b67d7f28..9dad93059945 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -910,7 +910,7 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac, int *new_cr, ext4_group_t *group, ext4_group_t ngroups) { struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); - struct ext4_group_info *grp, *iter; + struct ext4_group_info *grp = NULL, *iter; int i; if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) { @@ -927,7 +927,6 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac, read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]); continue; } - grp = NULL; list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i], bb_avg_fragment_size_node) { if (sbi->s_mb_stats) -- 2.35.3 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan 2022-09-22 9:15 ` Jan Kara @ 2022-09-26 9:11 ` Jan Kara 2022-09-26 17:32 ` Theodore Ts'o 0 siblings, 1 reply; 16+ messages in thread From: Jan Kara @ 2022-09-26 9:11 UTC (permalink / raw) To: Theodore Ts'o Cc: jack, harshadshirwadkar, stable, ritesh.list, stefan.wahren, ojaswin, linux-ext4, regressions On Thu 22-09-22 11:15:42, Jan Kara wrote: > On Wed 21-09-22 22:52:34, Theodore Ts'o wrote: > > On Thu, 8 Sep 2022 11:21:24 +0200, Jan Kara wrote: > > > One of the side-effects of mb_optimize_scan was that the optimized > > > functions to select next group to try were called even before we tried > > > the goal group. As a result we no longer allocate files close to > > > corresponding inodes as well as we don't try to expand currently > > > allocated extent in the same group. This results in reaim regression > > > with workfile.disk workload of upto 8% with many clients on my test > > > machine: > > > > > > [...] > > > > Applied, thanks! > > > > [1/5] ext4: Make mballoc try target group first even with mb_optimize_scan > > commit: 4fca50d440cc5d4dc570ad5484cc0b70b381bc2a > > [2/5] ext4: Avoid unnecessary spreading of allocations among groups > > commit: 1940265ede6683f6317cba0d428ce6505eaca944 > > [3/5] ext4: Make directory inode spreading reflect flexbg size > > commit: 613c5a85898d1cd44e68f28d65eccf64a8ace9cf > > [4/5] ext4: Use locality group preallocation for small closed files > > commit: a9f2a2931d0e197ab28c6007966053fdababd53f > > [5/5] ext4: Use buckets for cr 1 block scan instead of rbtree > > commit: 83e80a6e3543f37f74c8e48a5f305b054b65ce2a > > Thanks Ted! I just have locally a small fixup to the series that was reported > by Smatch. It is attached, either fold it into the last patch or just merge > it as a separate patch. Thanks! Ted, I've noticed you've merged my mballoc fixes (and pushed them to Linus) without this fixup. Can you please merge it? The use of uninitialized variable seems rare but possible... Honza > From 8885b11fb253e08ecfa90a28beffb01719af84f5 Mon Sep 17 00:00:00 2001 > From: Jan Kara <jack@suse.cz> > Date: Thu, 22 Sep 2022 11:09:29 +0200 > Subject: [PATCH] ext4: Fixup possible uninitialized variable access in > ext4_mb_choose_next_group_cr1() > > Variable 'grp' may be left uninitialized if there's no group with > suitable average fragment size (or larger). Fix the problem by > initializing it earlier. > > Fixes: 83e80a6e3543 ("ext4: use buckets for cr 1 block scan instead of rbtree") > Reported-by: Dan Carpenter <dan.carpenter@oracle.com> > Signed-off-by: Jan Kara <jack@suse.cz> > --- > fs/ext4/mballoc.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index 71f5b67d7f28..9dad93059945 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -910,7 +910,7 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac, > int *new_cr, ext4_group_t *group, ext4_group_t ngroups) > { > struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); > - struct ext4_group_info *grp, *iter; > + struct ext4_group_info *grp = NULL, *iter; > int i; > > if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) { > @@ -927,7 +927,6 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac, > read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]); > continue; > } > - grp = NULL; > list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i], > bb_avg_fragment_size_node) { > if (sbi->s_mb_stats) > -- > 2.35.3 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan 2022-09-26 9:11 ` Jan Kara @ 2022-09-26 17:32 ` Theodore Ts'o 0 siblings, 0 replies; 16+ messages in thread From: Theodore Ts'o @ 2022-09-26 17:32 UTC (permalink / raw) To: Jan Kara Cc: harshadshirwadkar, stable, ritesh.list, stefan.wahren, ojaswin, linux-ext4, regressions On Mon, Sep 26, 2022 at 11:11:26AM +0200, Jan Kara wrote: > > Ted, I've noticed you've merged my mballoc fixes (and pushed them to Linus) > without this fixup. Can you please merge it? The use of uninitialized > variable seems rare but possible... Oops, sorry my bad. I had forgotten to merge the fix before I pushed them to Linus. I'm preparing a follow-up push to Linus right now. Once a smoke test passes, I'll sned out the pull request. - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 2/5] ext4: Avoid unnecessary spreading of allocations among groups 2022-09-08 9:21 [PATCH 0/5 v3] ext4: Fix performance regression with mballoc Jan Kara 2022-09-08 9:21 ` [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara @ 2022-09-08 9:21 ` Jan Kara 2022-09-08 9:21 ` [PATCH 3/5] ext4: Make directory inode spreading reflect flexbg size Jan Kara ` (3 subsequent siblings) 5 siblings, 0 replies; 16+ messages in thread From: Jan Kara @ 2022-09-08 9:21 UTC (permalink / raw) To: Ted Tso Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren, Harshad Shirwadkar, Jan Kara, stable, Ritesh Harjani mb_set_largest_free_order() updates lists containing groups with largest chunk of free space of given order. The way it updates it leads to always moving the group to the tail of the list. Thus allocations looking for free space of given order effectively end up cycling through all groups (and due to initialization in last to first order). This spreads allocations among block groups which reduces performance for rotating disks or low-end flash media. Change mb_set_largest_free_order() to only update lists if the order of the largest free chunk in the group changed. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/mballoc.c | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 41e1cfecac3b..6251b4a6cc63 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1077,23 +1077,25 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp) struct ext4_sb_info *sbi = EXT4_SB(sb); int i; - if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) { + for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) + if (grp->bb_counters[i] > 0) + break; + /* No need to move between order lists? */ + if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || + i == grp->bb_largest_free_order) { + grp->bb_largest_free_order = i; + return; + } + + if (grp->bb_largest_free_order >= 0) { write_lock(&sbi->s_mb_largest_free_orders_locks[ grp->bb_largest_free_order]); list_del_init(&grp->bb_largest_free_order_node); write_unlock(&sbi->s_mb_largest_free_orders_locks[ grp->bb_largest_free_order]); } - grp->bb_largest_free_order = -1; /* uninit */ - - for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) { - if (grp->bb_counters[i] > 0) { - grp->bb_largest_free_order = i; - break; - } - } - if (test_opt2(sb, MB_OPTIMIZE_SCAN) && - grp->bb_largest_free_order >= 0 && grp->bb_free) { + grp->bb_largest_free_order = i; + if (grp->bb_largest_free_order >= 0 && grp->bb_free) { write_lock(&sbi->s_mb_largest_free_orders_locks[ grp->bb_largest_free_order]); list_add_tail(&grp->bb_largest_free_order_node, -- 2.35.3 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/5] ext4: Make directory inode spreading reflect flexbg size 2022-09-08 9:21 [PATCH 0/5 v3] ext4: Fix performance regression with mballoc Jan Kara 2022-09-08 9:21 ` [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara 2022-09-08 9:21 ` [PATCH 2/5] ext4: Avoid unnecessary spreading of allocations among groups Jan Kara @ 2022-09-08 9:21 ` Jan Kara 2022-09-08 9:21 ` [PATCH 4/5] ext4: Use locality group preallocation for small closed files Jan Kara ` (2 subsequent siblings) 5 siblings, 0 replies; 16+ messages in thread From: Jan Kara @ 2022-09-08 9:21 UTC (permalink / raw) To: Ted Tso Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren, Harshad Shirwadkar, Jan Kara Currently the Orlov inode allocator searches for free inodes for a directory only in flex block groups with at most inodes_per_group/16 more directory inodes than average per flex block group. However with growing size of flex block group this becomes unnecessarily strict. Scale allowed difference from average directory count per flex block group with flex block group size as we do with other metrics. Tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/ialloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index f73e5eb43eae..208b87ce8858 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -510,7 +510,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent, goto fallback; } - max_dirs = ndirs / ngroups + inodes_per_group / 16; + max_dirs = ndirs / ngroups + inodes_per_group*flex_size / 16; min_inodes = avefreei - inodes_per_group*flex_size / 4; if (min_inodes < 1) min_inodes = 1; -- 2.35.3 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 4/5] ext4: Use locality group preallocation for small closed files 2022-09-08 9:21 [PATCH 0/5 v3] ext4: Fix performance regression with mballoc Jan Kara ` (2 preceding siblings ...) 2022-09-08 9:21 ` [PATCH 3/5] ext4: Make directory inode spreading reflect flexbg size Jan Kara @ 2022-09-08 9:21 ` Jan Kara 2022-09-08 9:21 ` [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree Jan Kara 2022-09-08 10:36 ` [PATCH 0/5 v3] ext4: Fix performance regression with mballoc Stefan Wahren 5 siblings, 0 replies; 16+ messages in thread From: Jan Kara @ 2022-09-08 9:21 UTC (permalink / raw) To: Ted Tso Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren, Harshad Shirwadkar, Jan Kara, stable, Ritesh Harjani Curently we don't use any preallocation when a file is already closed when allocating blocks (from writeback code when converting delayed allocation). However for small files, using locality group preallocation is actually desirable as that is not specific to a particular file. Rather it is a method to pack small files together to reduce fragmentation and for that the fact the file is closed is actually even stronger hint the file would benefit from packing. So change the logic to allow locality group preallocation in this case. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/mballoc.c | 27 +++++++++++++++------------ 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 6251b4a6cc63..af1e49c3603f 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -5195,6 +5195,7 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac) struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); int bsbits = ac->ac_sb->s_blocksize_bits; loff_t size, isize; + bool inode_pa_eligible, group_pa_eligible; if (!(ac->ac_flags & EXT4_MB_HINT_DATA)) return; @@ -5202,25 +5203,27 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac) if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY)) return; + group_pa_eligible = sbi->s_mb_group_prealloc > 0; + inode_pa_eligible = true; size = ac->ac_o_ex.fe_logical + EXT4_C2B(sbi, ac->ac_o_ex.fe_len); isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1) >> bsbits; + /* No point in using inode preallocation for closed files */ if ((size == isize) && !ext4_fs_is_busy(sbi) && - !inode_is_open_for_write(ac->ac_inode)) { - ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC; - return; - } + !inode_is_open_for_write(ac->ac_inode)) + inode_pa_eligible = false; - if (sbi->s_mb_group_prealloc <= 0) { - ac->ac_flags |= EXT4_MB_STREAM_ALLOC; - return; - } - - /* don't use group allocation for large files */ size = max(size, isize); - if (size > sbi->s_mb_stream_request) { - ac->ac_flags |= EXT4_MB_STREAM_ALLOC; + /* Don't use group allocation for large files */ + if (size > sbi->s_mb_stream_request) + group_pa_eligible = false; + + if (!group_pa_eligible) { + if (inode_pa_eligible) + ac->ac_flags |= EXT4_MB_STREAM_ALLOC; + else + ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC; return; } -- 2.35.3 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree 2022-09-08 9:21 [PATCH 0/5 v3] ext4: Fix performance regression with mballoc Jan Kara ` (3 preceding siblings ...) 2022-09-08 9:21 ` [PATCH 4/5] ext4: Use locality group preallocation for small closed files Jan Kara @ 2022-09-08 9:21 ` Jan Kara 2022-09-09 6:12 ` Ritesh Harjani (IBM) 2022-09-08 10:36 ` [PATCH 0/5 v3] ext4: Fix performance regression with mballoc Stefan Wahren 5 siblings, 1 reply; 16+ messages in thread From: Jan Kara @ 2022-09-08 9:21 UTC (permalink / raw) To: Ted Tso Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren, Harshad Shirwadkar, Jan Kara, stable Using rbtree for sorting groups by average fragment size is relatively expensive (needs rbtree update on every block freeing or allocation) and leads to wide spreading of allocations because selection of block group is very sentitive both to changes in free space and amount of blocks allocated. Furthermore selecting group with the best matching average fragment size is not necessary anyway, even more so because the variability of fragment sizes within a group is likely large so average is not telling much. We just need a group with large enough average fragment size so that we have high probability of finding large enough free extent and we don't want average fragment size to be too big so that we are likely to find free extent only somewhat larger than what we need. So instead of maintaing rbtree of groups sorted by fragment size keep bins (lists) or groups where average fragment size is in the interval [2^i, 2^(i+1)). This structure requires less updates on block allocation / freeing, generally avoids chaotic spreading of allocations into block groups, and still is able to quickly (even faster that the rbtree) provide a block group which is likely to have a suitably sized free space extent. This patch reduces number of block groups used when untarring archive with medium sized files (size somewhat above 64k which is default mballoc limit for avoiding locality group preallocation) to about half and thus improves write speeds for eMMC flash significantly. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/ext4.h | 10 +- fs/ext4/mballoc.c | 249 ++++++++++++++++++++-------------------------- fs/ext4/mballoc.h | 1 - 3 files changed, 111 insertions(+), 149 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 9bca5565547b..3bf9a6926798 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -167,8 +167,6 @@ enum SHIFT_DIRECTION { #define EXT4_MB_CR0_OPTIMIZED 0x8000 /* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */ #define EXT4_MB_CR1_OPTIMIZED 0x00010000 -/* Perform linear traversal for one group */ -#define EXT4_MB_SEARCH_NEXT_LINEAR 0x00020000 struct ext4_allocation_request { /* target inode for block we're allocating */ struct inode *inode; @@ -1600,8 +1598,8 @@ struct ext4_sb_info { struct list_head s_discard_list; struct work_struct s_discard_work; atomic_t s_retry_alloc_pending; - struct rb_root s_mb_avg_fragment_size_root; - rwlock_t s_mb_rb_lock; + struct list_head *s_mb_avg_fragment_size; + rwlock_t *s_mb_avg_fragment_size_locks; struct list_head *s_mb_largest_free_orders; rwlock_t *s_mb_largest_free_orders_locks; @@ -3413,6 +3411,8 @@ struct ext4_group_info { ext4_grpblk_t bb_first_free; /* first free block */ ext4_grpblk_t bb_free; /* total free blocks */ ext4_grpblk_t bb_fragments; /* nr of freespace fragments */ + int bb_avg_fragment_size_order; /* order of average + fragment in BG */ ext4_grpblk_t bb_largest_free_order;/* order of largest frag in BG */ ext4_group_t bb_group; /* Group number */ struct list_head bb_prealloc_list; @@ -3420,7 +3420,7 @@ struct ext4_group_info { void *bb_bitmap; #endif struct rw_semaphore alloc_sem; - struct rb_node bb_avg_fragment_size_rb; + struct list_head bb_avg_fragment_size_node; struct list_head bb_largest_free_order_node; ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block * regions, index is order. diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index af1e49c3603f..31873af0421b 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -140,13 +140,15 @@ * number of buddy bitmap orders possible) number of lists. Group-infos are * placed in appropriate lists. * - * 2) Average fragment size rb tree (sbi->s_mb_avg_fragment_size_root) + * 2) Average fragment size lists (sbi->s_mb_avg_fragment_size) * - * Locking: sbi->s_mb_rb_lock (rwlock) + * Locking: sbi->s_mb_avg_fragment_size_locks(array of rw locks) * - * This is a red black tree consisting of group infos and the tree is sorted - * by average fragment sizes (which is calculated as ext4_group_info->bb_free - * / ext4_group_info->bb_fragments). + * This is an array of lists where in the i-th list there are groups with + * average fragment size >= 2^i and < 2^(i+1). The average fragment size + * is computed as ext4_group_info->bb_free / ext4_group_info->bb_fragments. + * Note that we don't bother with a special list for completely empty groups + * so we only have MB_NUM_ORDERS(sb) lists. * * When "mb_optimize_scan" mount option is set, mballoc consults the above data * structures to decide the order in which groups are to be traversed for @@ -160,7 +162,8 @@ * * At CR = 1, we only consider groups where average fragment size > request * size. So, we lookup a group which has average fragment size just above or - * equal to request size using our rb tree (data structure 2) in O(log N) time. + * equal to request size using our average fragment size group lists (data + * structure 2) in O(1) time. * * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in * linear order which requires O(N) search time for each CR 0 and CR 1 phase. @@ -802,65 +805,51 @@ static void ext4_mb_mark_free_simple(struct super_block *sb, } } -static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new, - int (*cmp)(struct rb_node *, struct rb_node *)) +static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len) { - struct rb_node **iter = &root->rb_node, *parent = NULL; + int order; - while (*iter) { - parent = *iter; - if (cmp(new, *iter) > 0) - iter = &((*iter)->rb_left); - else - iter = &((*iter)->rb_right); - } - - rb_link_node(new, parent, iter); - rb_insert_color(new, root); -} - -static int -ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2) -{ - struct ext4_group_info *grp1 = rb_entry(rb1, - struct ext4_group_info, - bb_avg_fragment_size_rb); - struct ext4_group_info *grp2 = rb_entry(rb2, - struct ext4_group_info, - bb_avg_fragment_size_rb); - int num_frags_1, num_frags_2; - - num_frags_1 = grp1->bb_fragments ? - grp1->bb_free / grp1->bb_fragments : 0; - num_frags_2 = grp2->bb_fragments ? - grp2->bb_free / grp2->bb_fragments : 0; - - return (num_frags_2 - num_frags_1); + /* + * We don't bother with a special lists groups with only 1 block free + * extents and for completely empty groups. + */ + order = fls(len) - 2; + if (order < 0) + return 0; + if (order == MB_NUM_ORDERS(sb)) + order--; + return order; } -/* - * Reinsert grpinfo into the avg_fragment_size tree with new average - * fragment size. - */ +/* Move group to appropriate avg_fragment_size list */ static void mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp) { struct ext4_sb_info *sbi = EXT4_SB(sb); + int new_order; if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0) return; - write_lock(&sbi->s_mb_rb_lock); - if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) { - rb_erase(&grp->bb_avg_fragment_size_rb, - &sbi->s_mb_avg_fragment_size_root); - RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb); - } + new_order = mb_avg_fragment_size_order(sb, + grp->bb_free / grp->bb_fragments); + if (new_order == grp->bb_avg_fragment_size_order) + return; - ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root, - &grp->bb_avg_fragment_size_rb, - ext4_mb_avg_fragment_size_cmp); - write_unlock(&sbi->s_mb_rb_lock); + if (grp->bb_avg_fragment_size_order != -1) { + write_lock(&sbi->s_mb_avg_fragment_size_locks[ + grp->bb_avg_fragment_size_order]); + list_del(&grp->bb_avg_fragment_size_node); + write_unlock(&sbi->s_mb_avg_fragment_size_locks[ + grp->bb_avg_fragment_size_order]); + } + grp->bb_avg_fragment_size_order = new_order; + write_lock(&sbi->s_mb_avg_fragment_size_locks[ + grp->bb_avg_fragment_size_order]); + list_add_tail(&grp->bb_avg_fragment_size_node, + &sbi->s_mb_avg_fragment_size[grp->bb_avg_fragment_size_order]); + write_unlock(&sbi->s_mb_avg_fragment_size_locks[ + grp->bb_avg_fragment_size_order]); } /* @@ -909,86 +898,56 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac, *new_cr = 1; } else { *group = grp->bb_group; - ac->ac_last_optimal_group = *group; ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED; } } /* - * Choose next group by traversing average fragment size tree. Updates *new_cr - * if cr lvel needs an update. Sets EXT4_MB_SEARCH_NEXT_LINEAR to indicate that - * the linear search should continue for one iteration since there's lock - * contention on the rb tree lock. + * Choose next group by traversing average fragment size list of suitable + * order. Updates *new_cr if cr level needs an update. */ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac, int *new_cr, ext4_group_t *group, ext4_group_t ngroups) { struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); - int avg_fragment_size, best_so_far; - struct rb_node *node, *found; - struct ext4_group_info *grp; - - /* - * If there is contention on the lock, instead of waiting for the lock - * to become available, just continue searching lineraly. We'll resume - * our rb tree search later starting at ac->ac_last_optimal_group. - */ - if (!read_trylock(&sbi->s_mb_rb_lock)) { - ac->ac_flags |= EXT4_MB_SEARCH_NEXT_LINEAR; - return; - } + struct ext4_group_info *grp, *iter; + int i; if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) { if (sbi->s_mb_stats) atomic_inc(&sbi->s_bal_cr1_bad_suggestions); - /* We have found something at CR 1 in the past */ - grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group); - for (found = rb_next(&grp->bb_avg_fragment_size_rb); found != NULL; - found = rb_next(found)) { - grp = rb_entry(found, struct ext4_group_info, - bb_avg_fragment_size_rb); + } + + for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len); + i < MB_NUM_ORDERS(ac->ac_sb); i++) { + if (list_empty(&sbi->s_mb_avg_fragment_size[i])) + continue; + read_lock(&sbi->s_mb_avg_fragment_size_locks[i]); + if (list_empty(&sbi->s_mb_avg_fragment_size[i])) { + read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]); + continue; + } + grp = NULL; + list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i], + bb_avg_fragment_size_node) { if (sbi->s_mb_stats) atomic64_inc(&sbi->s_bal_cX_groups_considered[1]); - if (likely(ext4_mb_good_group(ac, grp->bb_group, 1))) + if (likely(ext4_mb_good_group(ac, iter->bb_group, 1))) { + grp = iter; break; - } - goto done; - } - - node = sbi->s_mb_avg_fragment_size_root.rb_node; - best_so_far = 0; - found = NULL; - - while (node) { - grp = rb_entry(node, struct ext4_group_info, - bb_avg_fragment_size_rb); - avg_fragment_size = 0; - if (ext4_mb_good_group(ac, grp->bb_group, 1)) { - avg_fragment_size = grp->bb_fragments ? - grp->bb_free / grp->bb_fragments : 0; - if (!best_so_far || avg_fragment_size < best_so_far) { - best_so_far = avg_fragment_size; - found = node; } } - if (avg_fragment_size > ac->ac_g_ex.fe_len) - node = node->rb_right; - else - node = node->rb_left; + read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]); + if (grp) + break; } -done: - if (found) { - grp = rb_entry(found, struct ext4_group_info, - bb_avg_fragment_size_rb); + if (grp) { *group = grp->bb_group; ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED; } else { *new_cr = 2; } - - read_unlock(&sbi->s_mb_rb_lock); - ac->ac_last_optimal_group = *group; } static inline int should_optimize_scan(struct ext4_allocation_context *ac) @@ -1017,11 +976,6 @@ next_linear_group(struct ext4_allocation_context *ac, int group, int ngroups) goto inc_and_return; } - if (ac->ac_flags & EXT4_MB_SEARCH_NEXT_LINEAR) { - ac->ac_flags &= ~EXT4_MB_SEARCH_NEXT_LINEAR; - goto inc_and_return; - } - return group; inc_and_return: /* @@ -1152,13 +1106,13 @@ void ext4_mb_generate_buddy(struct super_block *sb, EXT4_GROUP_INFO_BBITMAP_CORRUPT); } mb_set_largest_free_order(sb, grp); + mb_update_avg_fragment_size(sb, grp); clear_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &(grp->bb_state)); period = get_cycles() - period; atomic_inc(&sbi->s_mb_buddies_generated); atomic64_add(period, &sbi->s_mb_generation_time); - mb_update_avg_fragment_size(sb, grp); } /* The buddy information is attached the buddy cache inode @@ -2711,7 +2665,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) * from the goal value specified */ group = ac->ac_g_ex.fe_group; - ac->ac_last_optimal_group = group; ac->ac_groups_linear_remaining = sbi->s_mb_max_linear_groups; prefetch_grp = group; @@ -2993,9 +2946,7 @@ __acquires(&EXT4_SB(sb)->s_mb_rb_lock) struct super_block *sb = pde_data(file_inode(seq->file)); unsigned long position; - read_lock(&EXT4_SB(sb)->s_mb_rb_lock); - - if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + 1) + if (*pos < 0 || *pos >= 2*MB_NUM_ORDERS(sb)) return NULL; position = *pos + 1; return (void *) ((unsigned long) position); @@ -3007,7 +2958,7 @@ static void *ext4_mb_seq_structs_summary_next(struct seq_file *seq, void *v, lof unsigned long position; ++*pos; - if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + 1) + if (*pos < 0 || *pos >= 2*MB_NUM_ORDERS(sb)) return NULL; position = *pos + 1; return (void *) ((unsigned long) position); @@ -3019,29 +2970,22 @@ static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v) struct ext4_sb_info *sbi = EXT4_SB(sb); unsigned long position = ((unsigned long) v); struct ext4_group_info *grp; - struct rb_node *n; - unsigned int count, min, max; + unsigned int count; position--; if (position >= MB_NUM_ORDERS(sb)) { - seq_puts(seq, "fragment_size_tree:\n"); - n = rb_first(&sbi->s_mb_avg_fragment_size_root); - if (!n) { - seq_puts(seq, "\ttree_min: 0\n\ttree_max: 0\n\ttree_nodes: 0\n"); - return 0; - } - grp = rb_entry(n, struct ext4_group_info, bb_avg_fragment_size_rb); - min = grp->bb_fragments ? grp->bb_free / grp->bb_fragments : 0; - count = 1; - while (rb_next(n)) { - count++; - n = rb_next(n); - } - grp = rb_entry(n, struct ext4_group_info, bb_avg_fragment_size_rb); - max = grp->bb_fragments ? grp->bb_free / grp->bb_fragments : 0; + position -= MB_NUM_ORDERS(sb); + if (position == 0) + seq_puts(seq, "avg_fragment_size_lists:\n"); - seq_printf(seq, "\ttree_min: %u\n\ttree_max: %u\n\ttree_nodes: %u\n", - min, max, count); + count = 0; + read_lock(&sbi->s_mb_avg_fragment_size_locks[position]); + list_for_each_entry(grp, &sbi->s_mb_avg_fragment_size[position], + bb_avg_fragment_size_node) + count++; + read_unlock(&sbi->s_mb_avg_fragment_size_locks[position]); + seq_printf(seq, "\tlist_order_%u_groups: %u\n", + (unsigned int)position, count); return 0; } @@ -3051,9 +2995,11 @@ static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v) seq_puts(seq, "max_free_order_lists:\n"); } count = 0; + read_lock(&sbi->s_mb_largest_free_orders_locks[position]); list_for_each_entry(grp, &sbi->s_mb_largest_free_orders[position], bb_largest_free_order_node) count++; + read_unlock(&sbi->s_mb_largest_free_orders_locks[position]); seq_printf(seq, "\tlist_order_%u_groups: %u\n", (unsigned int)position, count); @@ -3061,11 +3007,7 @@ static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v) } static void ext4_mb_seq_structs_summary_stop(struct seq_file *seq, void *v) -__releases(&EXT4_SB(sb)->s_mb_rb_lock) { - struct super_block *sb = pde_data(file_inode(seq->file)); - - read_unlock(&EXT4_SB(sb)->s_mb_rb_lock); } const struct seq_operations ext4_mb_seq_structs_summary_ops = { @@ -3178,8 +3120,9 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group, init_rwsem(&meta_group_info[i]->alloc_sem); meta_group_info[i]->bb_free_root = RB_ROOT; INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node); - RB_CLEAR_NODE(&meta_group_info[i]->bb_avg_fragment_size_rb); + INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node); meta_group_info[i]->bb_largest_free_order = -1; /* uninit */ + meta_group_info[i]->bb_avg_fragment_size_order = -1; /* uninit */ meta_group_info[i]->bb_group = group; mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group); @@ -3428,7 +3371,24 @@ int ext4_mb_init(struct super_block *sb) i++; } while (i < MB_NUM_ORDERS(sb)); - sbi->s_mb_avg_fragment_size_root = RB_ROOT; + sbi->s_mb_avg_fragment_size = + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head), + GFP_KERNEL); + if (!sbi->s_mb_avg_fragment_size) { + ret = -ENOMEM; + goto out; + } + sbi->s_mb_avg_fragment_size_locks = + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t), + GFP_KERNEL); + if (!sbi->s_mb_avg_fragment_size_locks) { + ret = -ENOMEM; + goto out; + } + for (i = 0; i < MB_NUM_ORDERS(sb); i++) { + INIT_LIST_HEAD(&sbi->s_mb_avg_fragment_size[i]); + rwlock_init(&sbi->s_mb_avg_fragment_size_locks[i]); + } sbi->s_mb_largest_free_orders = kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head), GFP_KERNEL); @@ -3447,7 +3407,6 @@ int ext4_mb_init(struct super_block *sb) INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]); rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]); } - rwlock_init(&sbi->s_mb_rb_lock); spin_lock_init(&sbi->s_md_lock); sbi->s_mb_free_pending = 0; @@ -3518,6 +3477,8 @@ int ext4_mb_init(struct super_block *sb) free_percpu(sbi->s_locality_groups); sbi->s_locality_groups = NULL; out: + kfree(sbi->s_mb_avg_fragment_size); + kfree(sbi->s_mb_avg_fragment_size_locks); kfree(sbi->s_mb_largest_free_orders); kfree(sbi->s_mb_largest_free_orders_locks); kfree(sbi->s_mb_offsets); @@ -3584,6 +3545,8 @@ int ext4_mb_release(struct super_block *sb) kvfree(group_info); rcu_read_unlock(); } + kfree(sbi->s_mb_avg_fragment_size); + kfree(sbi->s_mb_avg_fragment_size_locks); kfree(sbi->s_mb_largest_free_orders); kfree(sbi->s_mb_largest_free_orders_locks); kfree(sbi->s_mb_offsets); diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index 39da92ceabf8..dcda2a943cee 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -178,7 +178,6 @@ struct ext4_allocation_context { /* copy of the best found extent taken before preallocation efforts */ struct ext4_free_extent ac_f_ex; - ext4_group_t ac_last_optimal_group; __u32 ac_groups_considered; __u32 ac_flags; /* allocation hints */ __u16 ac_groups_scanned; -- 2.35.3 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree 2022-09-08 9:21 ` [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree Jan Kara @ 2022-09-09 6:12 ` Ritesh Harjani (IBM) 0 siblings, 0 replies; 16+ messages in thread From: Ritesh Harjani (IBM) @ 2022-09-09 6:12 UTC (permalink / raw) To: Jan Kara Cc: Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren, Harshad Shirwadkar, stable On 22/09/08 11:21AM, Jan Kara wrote: > Using rbtree for sorting groups by average fragment size is relatively > expensive (needs rbtree update on every block freeing or allocation) and > leads to wide spreading of allocations because selection of block group > is very sentitive both to changes in free space and amount of blocks > allocated. Furthermore selecting group with the best matching average > fragment size is not necessary anyway, even more so because the > variability of fragment sizes within a group is likely large so average > is not telling much. We just need a group with large enough average > fragment size so that we have high probability of finding large enough > free extent and we don't want average fragment size to be too big so > that we are likely to find free extent only somewhat larger than what we > need. > > So instead of maintaing rbtree of groups sorted by fragment size keep > bins (lists) or groups where average fragment size is in the interval > [2^i, 2^(i+1)). This structure requires less updates on block allocation > / freeing, generally avoids chaotic spreading of allocations into block > groups, and still is able to quickly (even faster that the rbtree) > provide a block group which is likely to have a suitably sized free > space extent. > > This patch reduces number of block groups used when untarring archive > with medium sized files (size somewhat above 64k which is default > mballoc limit for avoiding locality group preallocation) to about half > and thus improves write speeds for eMMC flash significantly. > > Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") > CC: stable@vger.kernel.org > Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> > Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> > Signed-off-by: Jan Kara <jack@suse.cz> > --- > fs/ext4/ext4.h | 10 +- > fs/ext4/mballoc.c | 249 ++++++++++++++++++++-------------------------- > fs/ext4/mballoc.h | 1 - > 3 files changed, 111 insertions(+), 149 deletions(-) Hello Jan, I have reviewed the patch and also verified bb_fragments. That indeed will atleast be 1 (ext4_mb_generate_buddy()) as you had mentioned. The patch looks good to me. Please feel free to add: Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/5 v3] ext4: Fix performance regression with mballoc 2022-09-08 9:21 [PATCH 0/5 v3] ext4: Fix performance regression with mballoc Jan Kara ` (4 preceding siblings ...) 2022-09-08 9:21 ` [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree Jan Kara @ 2022-09-08 10:36 ` Stefan Wahren 2022-09-09 10:40 ` Jan Kara 5 siblings, 1 reply; 16+ messages in thread From: Stefan Wahren @ 2022-09-08 10:36 UTC (permalink / raw) To: Jan Kara Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Harshad Shirwadkar, Ted Tso Hi Jan, Am 08.09.22 um 11:21 schrieb Jan Kara: > Hello, > > Here is the third version of my mballoc improvements to avoid spreading > allocations with mb_optimize_scan=1. Since v2 there are only small changes and > fixes found during review and testing. Overall the series looks mostly ready to > go, I just didn't see any comments regarding patch 3 - a fix of metabg handling > in the Orlov allocator which is kind of independent, I've just found it when > reading the code. Also patch 5 needs final review after all the fixes. > > Changes since v1: > - reworked data structure for CR 1 scan > - make small closed files use locality group preallocation > - fix metabg handling in the Orlov allocator > > Changes since v2: > - whitespace fixes > - fix outdated comment > - fix handling of mb_structs_summary procfs file > - fix bad unlock on error recovery path unfortunately the real patches doesn't have v3 which leads to confusion. Just a note: in case this series cannot be applied for stable (5.15), we need a second solution to fix the regression there. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/5 v3] ext4: Fix performance regression with mballoc 2022-09-08 10:36 ` [PATCH 0/5 v3] ext4: Fix performance regression with mballoc Stefan Wahren @ 2022-09-09 10:40 ` Jan Kara 2022-09-11 12:32 ` Stefan Wahren 0 siblings, 1 reply; 16+ messages in thread From: Jan Kara @ 2022-09-09 10:40 UTC (permalink / raw) To: Stefan Wahren Cc: Jan Kara, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Harshad Shirwadkar, Ted Tso On Thu 08-09-22 12:36:10, Stefan Wahren wrote: > Hi Jan, > > Am 08.09.22 um 11:21 schrieb Jan Kara: > > Hello, > > > > Here is the third version of my mballoc improvements to avoid spreading > > allocations with mb_optimize_scan=1. Since v2 there are only small changes and > > fixes found during review and testing. Overall the series looks mostly ready to > > go, I just didn't see any comments regarding patch 3 - a fix of metabg handling > > in the Orlov allocator which is kind of independent, I've just found it when > > reading the code. Also patch 5 needs final review after all the fixes. > > > > Changes since v1: > > - reworked data structure for CR 1 scan > > - make small closed files use locality group preallocation > > - fix metabg handling in the Orlov allocator > > > > Changes since v2: > > - whitespace fixes > > - fix outdated comment > > - fix handling of mb_structs_summary procfs file > > - fix bad unlock on error recovery path > > unfortunately the real patches doesn't have v3 which leads to confusion. Yeah, OK, I've updated my scripting for posting patches to include version in each posted patch :) > Just a note: in case this series cannot be applied for stable (5.15), we > need a second solution to fix the regression there. My plan is to backport the current series. It is not that invasive so it should be doable... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/5 v3] ext4: Fix performance regression with mballoc 2022-09-09 10:40 ` Jan Kara @ 2022-09-11 12:32 ` Stefan Wahren 0 siblings, 0 replies; 16+ messages in thread From: Stefan Wahren @ 2022-09-11 12:32 UTC (permalink / raw) To: Jan Kara Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Harshad Shirwadkar, Ted Tso Hi Jan, Am 09.09.22 um 12:40 schrieb Jan Kara: > On Thu 08-09-22 12:36:10, Stefan Wahren wrote: >> Hi Jan, >> >> Am 08.09.22 um 11:21 schrieb Jan Kara: >>> Hello, >>> >>> Here is the third version of my mballoc improvements to avoid spreading >>> allocations with mb_optimize_scan=1. Since v2 there are only small changes and >>> fixes found during review and testing. Overall the series looks mostly ready to >>> go, I just didn't see any comments regarding patch 3 - a fix of metabg handling >>> in the Orlov allocator which is kind of independent, I've just found it when >>> reading the code. Also patch 5 needs final review after all the fixes. >>> >>> Changes since v1: >>> - reworked data structure for CR 1 scan >>> - make small closed files use locality group preallocation >>> - fix metabg handling in the Orlov allocator >>> >>> Changes since v2: >>> - whitespace fixes >>> - fix outdated comment >>> - fix handling of mb_structs_summary procfs file >>> - fix bad unlock on error recovery path >> unfortunately the real patches doesn't have v3 which leads to confusion. > Yeah, OK, I've updated my scripting for posting patches to include version > in each posted patch :) > >> Just a note: in case this series cannot be applied for stable (5.15), we >> need a second solution to fix the regression there. > My plan is to backport the current series. It is not that invasive so it > should be doable... This would be great. Btw i retested this version and i can confirm the procfs crash is fixed. > > Honza ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 0/5 v2] ext4: Fix performance regression with mballoc @ 2022-09-06 15:29 Jan Kara 2022-09-06 15:29 ` [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara 0 siblings, 1 reply; 16+ messages in thread From: Jan Kara @ 2022-09-06 15:29 UTC (permalink / raw) To: Ted Tso Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren, Andreas Dilger, Jan Kara Hello, Here is a second version of my mballoc improvements to avoid spreading allocations with mb_optimize_scan=1. The patches fix the performance regression I was able to reproduce with reaim on my test machine: mb_optimize_scan=0 mb_optimize_scan=1 patched Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%) Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%) Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%) Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%) Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%* Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%) Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%* Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%) Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%) The changes also significanly reduce spreading of allocations for small / moderately sized files. I'm not able to measure a performance difference resulting from this but on eMMC storage this seems to be the main culprit of reduced performance. Untarring of raspberry-pi archive touches following numbers of groups: mb_optimize_scan=0 mb_optimize_scan=1 patched groups 4 22 7 To achieve this I have added two more changes on top of v1 - patches 4 and 5. Patch 4 makes sure we use locality group preallocation even for files that are not likely to grow anymore (previously we have disabled all preallocations for such files, however locality group preallocation still makes a lot of sense for such files). This patch reduced spread of a small file allocations but larger file allocations were still spread significantly because they avoid locality group preallocation and as they are not power-of-two in size, they also immediately start with cr=1 scan. To address that I've changed the data structure for looking up the best block group to allocate from (see patch 5 for details). Stefan, can you please test whether these patches fix the problem for you as well? Comments & review welcome. Honza Previous versions: Link: http://lore.kernel.org/r/20220823134508.27854-1-jack@suse.cz # v1 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan 2022-09-06 15:29 [PATCH 0/5 v2] " Jan Kara @ 2022-09-06 15:29 ` Jan Kara 2022-09-07 17:43 ` Ritesh Harjani (IBM) 0 siblings, 1 reply; 16+ messages in thread From: Jan Kara @ 2022-09-06 15:29 UTC (permalink / raw) To: Ted Tso Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren, Andreas Dilger, Jan Kara, stable One of the side-effects of mb_optimize_scan was that the optimized functions to select next group to try were called even before we tried the goal group. As a result we no longer allocate files close to corresponding inodes as well as we don't try to expand currently allocated extent in the same group. This results in reaim regression with workfile.disk workload of upto 8% with many clients on my test machine: baseline mb_optimize_scan Hmean disk-1 2114.16 ( 0.00%) 2099.37 ( -0.70%) Hmean disk-41 87794.43 ( 0.00%) 83787.47 * -4.56%* Hmean disk-81 148170.73 ( 0.00%) 135527.05 * -8.53%* Hmean disk-121 177506.11 ( 0.00%) 166284.93 * -6.32%* Hmean disk-161 220951.51 ( 0.00%) 207563.39 * -6.06%* Hmean disk-201 208722.74 ( 0.00%) 203235.59 ( -2.63%) Hmean disk-241 222051.60 ( 0.00%) 217705.51 ( -1.96%) Hmean disk-281 252244.17 ( 0.00%) 241132.72 * -4.41%* Hmean disk-321 255844.84 ( 0.00%) 245412.84 * -4.08%* Also this is causing huge regression (time increased by a factor of 5 or so) when untarring archive with lots of small files on some eMMC storage cards. Fix the problem by making sure we try goal group first. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Reported-by: Stefan Wahren <stefan.wahren@i2se.com> Link: https://lore.kernel.org/all/20220727105123.ckwrhbilzrxqpt24@quack3/ Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/mballoc.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index bd8f8b5c3d30..41e1cfecac3b 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1049,8 +1049,10 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac, { *new_cr = ac->ac_criteria; - if (!should_optimize_scan(ac) || ac->ac_groups_linear_remaining) + if (!should_optimize_scan(ac) || ac->ac_groups_linear_remaining) { + *group = next_linear_group(ac, *group, ngroups); return; + } if (*new_cr == 0) { ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups); @@ -2636,7 +2638,7 @@ static noinline_for_stack int ext4_mb_regular_allocator(struct ext4_allocation_context *ac) { ext4_group_t prefetch_grp = 0, ngroups, group, i; - int cr = -1; + int cr = -1, new_cr; int err = 0, first_err = 0; unsigned int nr = 0, prefetch_ios = 0; struct ext4_sb_info *sbi; @@ -2711,13 +2713,11 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) ac->ac_groups_linear_remaining = sbi->s_mb_max_linear_groups; prefetch_grp = group; - for (i = 0; i < ngroups; group = next_linear_group(ac, group, ngroups), - i++) { - int ret = 0, new_cr; + for (i = 0, new_cr = cr; i < ngroups; i++, + ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) { + int ret = 0; cond_resched(); - - ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups); if (new_cr != cr) { cr = new_cr; goto repeat; -- 2.35.3 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan 2022-09-06 15:29 ` [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara @ 2022-09-07 17:43 ` Ritesh Harjani (IBM) 0 siblings, 0 replies; 16+ messages in thread From: Ritesh Harjani (IBM) @ 2022-09-07 17:43 UTC (permalink / raw) To: Jan Kara Cc: Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren, Andreas Dilger, stable On 22/09/06 05:29PM, Jan Kara wrote: > One of the side-effects of mb_optimize_scan was that the optimized > functions to select next group to try were called even before we tried > the goal group. As a result we no longer allocate files close to > corresponding inodes as well as we don't try to expand currently > allocated extent in the same group. This results in reaim regression > with workfile.disk workload of upto 8% with many clients on my test > machine: > > baseline mb_optimize_scan > Hmean disk-1 2114.16 ( 0.00%) 2099.37 ( -0.70%) > Hmean disk-41 87794.43 ( 0.00%) 83787.47 * -4.56%* > Hmean disk-81 148170.73 ( 0.00%) 135527.05 * -8.53%* > Hmean disk-121 177506.11 ( 0.00%) 166284.93 * -6.32%* > Hmean disk-161 220951.51 ( 0.00%) 207563.39 * -6.06%* > Hmean disk-201 208722.74 ( 0.00%) 203235.59 ( -2.63%) > Hmean disk-241 222051.60 ( 0.00%) 217705.51 ( -1.96%) > Hmean disk-281 252244.17 ( 0.00%) 241132.72 * -4.41%* > Hmean disk-321 255844.84 ( 0.00%) 245412.84 * -4.08%* > > Also this is causing huge regression (time increased by a factor of 5 or > so) when untarring archive with lots of small files on some eMMC storage > cards. > > Fix the problem by making sure we try goal group first. > Yup, this is definitely a bug. We were never trying goal group then, except maybe for rotational devices (due to ac_groups_linear_remaining). Looks right to me. Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2022-09-26 17:56 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-09-08 9:21 [PATCH 0/5 v3] ext4: Fix performance regression with mballoc Jan Kara 2022-09-08 9:21 ` [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara 2022-09-22 2:52 ` Theodore Ts'o 2022-09-22 9:15 ` Jan Kara 2022-09-26 9:11 ` Jan Kara 2022-09-26 17:32 ` Theodore Ts'o 2022-09-08 9:21 ` [PATCH 2/5] ext4: Avoid unnecessary spreading of allocations among groups Jan Kara 2022-09-08 9:21 ` [PATCH 3/5] ext4: Make directory inode spreading reflect flexbg size Jan Kara 2022-09-08 9:21 ` [PATCH 4/5] ext4: Use locality group preallocation for small closed files Jan Kara 2022-09-08 9:21 ` [PATCH 5/5] ext4: Use buckets for cr 1 block scan instead of rbtree Jan Kara 2022-09-09 6:12 ` Ritesh Harjani (IBM) 2022-09-08 10:36 ` [PATCH 0/5 v3] ext4: Fix performance regression with mballoc Stefan Wahren 2022-09-09 10:40 ` Jan Kara 2022-09-11 12:32 ` Stefan Wahren -- strict thread matches above, loose matches on Subject: below -- 2022-09-06 15:29 [PATCH 0/5 v2] " Jan Kara 2022-09-06 15:29 ` [PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara 2022-09-07 17:43 ` Ritesh Harjani (IBM)
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.