All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] ext4: Fix performance regression with mballoc
@ 2022-08-23 20:15 Jan Kara
  2022-08-23 20:15 ` [PATCH 1/2] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Jan Kara @ 2022-08-23 20:15 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren,
	Harshad Shirwadkar, Jan Kara

Hello,

So I have implemented mballoc improvements to avoid spreading allocations
even with mb_optimize_scan=1. It fixes the performance regression I was able
to reproduce with reaim on my test machine:

                     mb_optimize_scan=0     mb_optimize_scan=1     patched
Hmean     disk-1       2076.12 (   0.00%)     2099.37 (   1.12%)     2032.52 (  -2.10%)
Hmean     disk-41     92481.20 (   0.00%)    83787.47 *  -9.40%*    90308.37 (  -2.35%)
Hmean     disk-81    155073.39 (   0.00%)   135527.05 * -12.60%*   154285.71 (  -0.51%)
Hmean     disk-121   185109.64 (   0.00%)   166284.93 * -10.17%*   185298.62 (   0.10%)
Hmean     disk-161   229890.53 (   0.00%)   207563.39 *  -9.71%*   232883.32 *   1.30%*
Hmean     disk-201   223333.33 (   0.00%)   203235.59 *  -9.00%*   221446.93 (  -0.84%)
Hmean     disk-241   235735.25 (   0.00%)   217705.51 *  -7.65%*   239483.27 *   1.59%*
Hmean     disk-281   266772.15 (   0.00%)   241132.72 *  -9.61%*   263108.62 (  -1.37%)
Hmean     disk-321   265435.50 (   0.00%)   245412.84 *  -7.54%*   267277.27 (   0.69%)

Stefan, can you please test whether these patches fix the problem for you as
well? Comments & review welcome.

								Honza

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/2] ext4: Make mballoc try target group first even with mb_optimize_scan
  2022-08-23 20:15 [PATCH 0/2] ext4: Fix performance regression with mballoc Jan Kara
@ 2022-08-23 20:15 ` Jan Kara
  2022-08-23 20:15 ` [PATCH 2/2] ext4: Avoid unnecessary spreading of allocations among groups Jan Kara
  2022-08-24 10:17 ` [PATCH 0/2] ext4: Fix performance regression with mballoc Stefan Wahren
  2 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2022-08-23 20:15 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren,
	Harshad Shirwadkar, Jan Kara, stable

One of the side-effects of mb_optimize_scan was that the optimized
functions to select next group to try were called even before we tried
the goal group. As a result we no longer allocate files close to
corresponding inodes as well as we don't try to expand currently
allocated extent in the same group. This results in reaim regression
with workfile.disk workload of upto 8% with many clients on my test
machine:

                     baseline               mb_optimize_scan
Hmean     disk-1       2114.16 (   0.00%)     2099.37 (  -0.70%)
Hmean     disk-41     87794.43 (   0.00%)    83787.47 *  -4.56%*
Hmean     disk-81    148170.73 (   0.00%)   135527.05 *  -8.53%*
Hmean     disk-121   177506.11 (   0.00%)   166284.93 *  -6.32%*
Hmean     disk-161   220951.51 (   0.00%)   207563.39 *  -6.06%*
Hmean     disk-201   208722.74 (   0.00%)   203235.59 (  -2.63%)
Hmean     disk-241   222051.60 (   0.00%)   217705.51 (  -1.96%)
Hmean     disk-281   252244.17 (   0.00%)   241132.72 *  -4.41%*
Hmean     disk-321   255844.84 (   0.00%)   245412.84 *  -4.08%*

Also this is causing huge regression (time increased by a factor of 5 or
so) when untarring archive with lots of small files on some eMMC storage
cards.

Fix the problem by making sure we try goal group first.

Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Reported-by: Stefan Wahren <stefan.wahren@i2se.com>
Link: https://lore.kernel.org/all/20220727105123.ckwrhbilzrxqpt24@quack3/
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/mballoc.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index bd8f8b5c3d30..41e1cfecac3b 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1049,8 +1049,10 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
 {
 	*new_cr = ac->ac_criteria;
 
-	if (!should_optimize_scan(ac) || ac->ac_groups_linear_remaining)
+	if (!should_optimize_scan(ac) || ac->ac_groups_linear_remaining) {
+		*group = next_linear_group(ac, *group, ngroups);
 		return;
+	}
 
 	if (*new_cr == 0) {
 		ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
@@ -2636,7 +2638,7 @@ static noinline_for_stack int
 ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 {
 	ext4_group_t prefetch_grp = 0, ngroups, group, i;
-	int cr = -1;
+	int cr = -1, new_cr;
 	int err = 0, first_err = 0;
 	unsigned int nr = 0, prefetch_ios = 0;
 	struct ext4_sb_info *sbi;
@@ -2711,13 +2713,11 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		ac->ac_groups_linear_remaining = sbi->s_mb_max_linear_groups;
 		prefetch_grp = group;
 
-		for (i = 0; i < ngroups; group = next_linear_group(ac, group, ngroups),
-			     i++) {
-			int ret = 0, new_cr;
+		for (i = 0, new_cr = cr; i < ngroups; i++,
+		     ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) {
+			int ret = 0;
 
 			cond_resched();
-
-			ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups);
 			if (new_cr != cr) {
 				cr = new_cr;
 				goto repeat;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] ext4: Avoid unnecessary spreading of allocations among groups
  2022-08-23 20:15 [PATCH 0/2] ext4: Fix performance regression with mballoc Jan Kara
  2022-08-23 20:15 ` [PATCH 1/2] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara
@ 2022-08-23 20:15 ` Jan Kara
  2022-08-24 10:17 ` [PATCH 0/2] ext4: Fix performance regression with mballoc Stefan Wahren
  2 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2022-08-23 20:15 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren,
	Harshad Shirwadkar, Jan Kara

mb_set_largest_free_order() updates lists containing groups with largest
chunk of free space of given order. The way it updates it leads to
always moving the group to the tail of the list. Thus allocations
looking for free space of given order effectively end up cycling through
all groups (and due to initialization in last to first order). This
spreads allocations among block groups which reduces performance for
rotating disks or low-end flash media. Change
mb_set_largest_free_order() to only update lists if the order of the
largest free chunk in the group changed.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/mballoc.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 41e1cfecac3b..6251b4a6cc63 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1077,23 +1077,25 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	int i;
 
-	if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
+	for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
+		if (grp->bb_counters[i] > 0)
+			break;
+	/* No need to move between order lists? */
+	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) ||
+	    i == grp->bb_largest_free_order) {
+		grp->bb_largest_free_order = i;
+		return;
+	}
+
+	if (grp->bb_largest_free_order >= 0) {
 		write_lock(&sbi->s_mb_largest_free_orders_locks[
 					      grp->bb_largest_free_order]);
 		list_del_init(&grp->bb_largest_free_order_node);
 		write_unlock(&sbi->s_mb_largest_free_orders_locks[
 					      grp->bb_largest_free_order]);
 	}
-	grp->bb_largest_free_order = -1; /* uninit */
-
-	for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
-		if (grp->bb_counters[i] > 0) {
-			grp->bb_largest_free_order = i;
-			break;
-		}
-	}
-	if (test_opt2(sb, MB_OPTIMIZE_SCAN) &&
-	    grp->bb_largest_free_order >= 0 && grp->bb_free) {
+	grp->bb_largest_free_order = i;
+	if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
 		write_lock(&sbi->s_mb_largest_free_orders_locks[
 					      grp->bb_largest_free_order]);
 		list_add_tail(&grp->bb_largest_free_order_node,
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc
  2022-08-23 20:15 [PATCH 0/2] ext4: Fix performance regression with mballoc Jan Kara
  2022-08-23 20:15 ` [PATCH 1/2] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara
  2022-08-23 20:15 ` [PATCH 2/2] ext4: Avoid unnecessary spreading of allocations among groups Jan Kara
@ 2022-08-24 10:17 ` Stefan Wahren
  2022-08-24 10:40   ` Jan Kara
  2 siblings, 1 reply; 20+ messages in thread
From: Stefan Wahren @ 2022-08-24 10:17 UTC (permalink / raw)
  To: Jan Kara, Ted Tso
  Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Harshad Shirwadkar

Hi Jan,

Am 23.08.22 um 22:15 schrieb Jan Kara:
> Hello,
>
> So I have implemented mballoc improvements to avoid spreading allocations
> even with mb_optimize_scan=1. It fixes the performance regression I was able
> to reproduce with reaim on my test machine:
>
>                       mb_optimize_scan=0     mb_optimize_scan=1     patched
> Hmean     disk-1       2076.12 (   0.00%)     2099.37 (   1.12%)     2032.52 (  -2.10%)
> Hmean     disk-41     92481.20 (   0.00%)    83787.47 *  -9.40%*    90308.37 (  -2.35%)
> Hmean     disk-81    155073.39 (   0.00%)   135527.05 * -12.60%*   154285.71 (  -0.51%)
> Hmean     disk-121   185109.64 (   0.00%)   166284.93 * -10.17%*   185298.62 (   0.10%)
> Hmean     disk-161   229890.53 (   0.00%)   207563.39 *  -9.71%*   232883.32 *   1.30%*
> Hmean     disk-201   223333.33 (   0.00%)   203235.59 *  -9.00%*   221446.93 (  -0.84%)
> Hmean     disk-241   235735.25 (   0.00%)   217705.51 *  -7.65%*   239483.27 *   1.59%*
> Hmean     disk-281   266772.15 (   0.00%)   241132.72 *  -9.61%*   263108.62 (  -1.37%)
> Hmean     disk-321   265435.50 (   0.00%)   245412.84 *  -7.54%*   267277.27 (   0.69%)
>
> Stefan, can you please test whether these patches fix the problem for you as
> well? Comments & review welcome.

i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the 
update process succeed which is a improvement, but the download + unpack 
duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1 
minute ).

Unfortuntately i don't have much time this week and next week i'm in 
holidays.

Just a question, my tests always had MBCACHE=y . Is it possible that the 
mb_optimize_scan is counterproductive for MBCACHE in this case?

I'm asking because before the download the update script removes the 
files from the previous update process which already cause a high load.

Best regards

>
> 								Honza

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc
  2022-08-24 10:17 ` [PATCH 0/2] ext4: Fix performance regression with mballoc Stefan Wahren
@ 2022-08-24 10:40   ` Jan Kara
  2022-08-24 14:13     ` Jan Kara
  2022-08-24 21:24     ` Stefan Wahren
  0 siblings, 2 replies; 20+ messages in thread
From: Jan Kara @ 2022-08-24 10:40 UTC (permalink / raw)
  To: Stefan Wahren
  Cc: Jan Kara, Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo,
	Harshad Shirwadkar

Hi Stefan!

On Wed 24-08-22 12:17:14, Stefan Wahren wrote:
> Am 23.08.22 um 22:15 schrieb Jan Kara:
> > Hello,
> > 
> > So I have implemented mballoc improvements to avoid spreading allocations
> > even with mb_optimize_scan=1. It fixes the performance regression I was able
> > to reproduce with reaim on my test machine:
> > 
> >                       mb_optimize_scan=0     mb_optimize_scan=1     patched
> > Hmean     disk-1       2076.12 (   0.00%)     2099.37 (   1.12%)     2032.52 (  -2.10%)
> > Hmean     disk-41     92481.20 (   0.00%)    83787.47 *  -9.40%*    90308.37 (  -2.35%)
> > Hmean     disk-81    155073.39 (   0.00%)   135527.05 * -12.60%*   154285.71 (  -0.51%)
> > Hmean     disk-121   185109.64 (   0.00%)   166284.93 * -10.17%*   185298.62 (   0.10%)
> > Hmean     disk-161   229890.53 (   0.00%)   207563.39 *  -9.71%*   232883.32 *   1.30%*
> > Hmean     disk-201   223333.33 (   0.00%)   203235.59 *  -9.00%*   221446.93 (  -0.84%)
> > Hmean     disk-241   235735.25 (   0.00%)   217705.51 *  -7.65%*   239483.27 *   1.59%*
> > Hmean     disk-281   266772.15 (   0.00%)   241132.72 *  -9.61%*   263108.62 (  -1.37%)
> > Hmean     disk-321   265435.50 (   0.00%)   245412.84 *  -7.54%*   267277.27 (   0.69%)
> > 
> > Stefan, can you please test whether these patches fix the problem for you as
> > well? Comments & review welcome.
> 
> i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
> update process succeed which is a improvement, but the download + unpack
> duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
> minute ).

OK, thanks for testing! I'll try to check specifically untar whether I can
still see some differences in the IO pattern on my test machine.

> Unfortuntately i don't have much time this week and next week i'm in
> holidays.

No problem.

> Just a question, my tests always had MBCACHE=y . Is it possible that the
> mb_optimize_scan is counterproductive for MBCACHE in this case?

MBCACHE (despite similar name) is actually related to extended attributes
so it likely has no impact on your workload.

> I'm asking because before the download the update script removes the files
> from the previous update process which already cause a high load.

Do you mean already the removal step is noticeably slower with
mb_optimize_scan=1? The removal will be modifying directory blocks, inode
table blocks, block & inode bitmaps, and group descriptors. So if block
allocations are more spread (due to mb_optimize_scan=1 used during the
untar), the removal may also take somewhat longer.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc
  2022-08-24 10:40   ` Jan Kara
@ 2022-08-24 14:13     ` Jan Kara
  2022-08-25 17:49       ` Ojaswin Mujoo
  2022-08-24 21:24     ` Stefan Wahren
  1 sibling, 1 reply; 20+ messages in thread
From: Jan Kara @ 2022-08-24 14:13 UTC (permalink / raw)
  To: Stefan Wahren
  Cc: Jan Kara, Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo,
	Harshad Shirwadkar

On Wed 24-08-22 12:40:10, Jan Kara wrote:
> Hi Stefan!
> 
> On Wed 24-08-22 12:17:14, Stefan Wahren wrote:
> > Am 23.08.22 um 22:15 schrieb Jan Kara:
> > > Hello,
> > > 
> > > So I have implemented mballoc improvements to avoid spreading allocations
> > > even with mb_optimize_scan=1. It fixes the performance regression I was able
> > > to reproduce with reaim on my test machine:
> > > 
> > >                       mb_optimize_scan=0     mb_optimize_scan=1     patched
> > > Hmean     disk-1       2076.12 (   0.00%)     2099.37 (   1.12%)     2032.52 (  -2.10%)
> > > Hmean     disk-41     92481.20 (   0.00%)    83787.47 *  -9.40%*    90308.37 (  -2.35%)
> > > Hmean     disk-81    155073.39 (   0.00%)   135527.05 * -12.60%*   154285.71 (  -0.51%)
> > > Hmean     disk-121   185109.64 (   0.00%)   166284.93 * -10.17%*   185298.62 (   0.10%)
> > > Hmean     disk-161   229890.53 (   0.00%)   207563.39 *  -9.71%*   232883.32 *   1.30%*
> > > Hmean     disk-201   223333.33 (   0.00%)   203235.59 *  -9.00%*   221446.93 (  -0.84%)
> > > Hmean     disk-241   235735.25 (   0.00%)   217705.51 *  -7.65%*   239483.27 *   1.59%*
> > > Hmean     disk-281   266772.15 (   0.00%)   241132.72 *  -9.61%*   263108.62 (  -1.37%)
> > > Hmean     disk-321   265435.50 (   0.00%)   245412.84 *  -7.54%*   267277.27 (   0.69%)
> > > 
> > > Stefan, can you please test whether these patches fix the problem for you as
> > > well? Comments & review welcome.
> > 
> > i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
> > update process succeed which is a improvement, but the download + unpack
> > duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
> > minute ).
> 
> OK, thanks for testing! I'll try to check specifically untar whether I can
> still see some differences in the IO pattern on my test machine.

I have created the same tar archive as you've referenced (files with same
number of blocks) and looked at where blocks get allocated with
mb_optimize_scan=0 and with mb_optimize_scan=1 + my patches. And the
resulting IO pattern looks practically the same on my test machine. In
particular in both cases files get allocated only in 6 groups, if I look
at the number of erase blocks that are expected to be touched by file data
(for various erase block sizes from 512k to 4MB) I get practically same
numbers for both cases.

Ojaswin, I think you've also mentioned you were able to reproduce the issue
in your setup? Are you still able to reproduce it with the patched kernel?
Can you help debugging while Stefan is away?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc
  2022-08-24 10:40   ` Jan Kara
  2022-08-24 14:13     ` Jan Kara
@ 2022-08-24 21:24     ` Stefan Wahren
  2022-08-25  9:18       ` Jan Kara
  1 sibling, 1 reply; 20+ messages in thread
From: Stefan Wahren @ 2022-08-24 21:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo,
	Harshad Shirwadkar

Hi Jan,

Am 24.08.22 um 12:40 schrieb Jan Kara:
> Hi Stefan!
>
> On Wed 24-08-22 12:17:14, Stefan Wahren wrote:
>> Am 23.08.22 um 22:15 schrieb Jan Kara:
>>> Hello,
>>>
>>> So I have implemented mballoc improvements to avoid spreading allocations
>>> even with mb_optimize_scan=1. It fixes the performance regression I was able
>>> to reproduce with reaim on my test machine:
>>>
>>>                        mb_optimize_scan=0     mb_optimize_scan=1     patched
>>> Hmean     disk-1       2076.12 (   0.00%)     2099.37 (   1.12%)     2032.52 (  -2.10%)
>>> Hmean     disk-41     92481.20 (   0.00%)    83787.47 *  -9.40%*    90308.37 (  -2.35%)
>>> Hmean     disk-81    155073.39 (   0.00%)   135527.05 * -12.60%*   154285.71 (  -0.51%)
>>> Hmean     disk-121   185109.64 (   0.00%)   166284.93 * -10.17%*   185298.62 (   0.10%)
>>> Hmean     disk-161   229890.53 (   0.00%)   207563.39 *  -9.71%*   232883.32 *   1.30%*
>>> Hmean     disk-201   223333.33 (   0.00%)   203235.59 *  -9.00%*   221446.93 (  -0.84%)
>>> Hmean     disk-241   235735.25 (   0.00%)   217705.51 *  -7.65%*   239483.27 *   1.59%*
>>> Hmean     disk-281   266772.15 (   0.00%)   241132.72 *  -9.61%*   263108.62 (  -1.37%)
>>> Hmean     disk-321   265435.50 (   0.00%)   245412.84 *  -7.54%*   267277.27 (   0.69%)
>>>
>>> Stefan, can you please test whether these patches fix the problem for you as
>>> well? Comments & review welcome.
>> i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
>> update process succeed which is a improvement, but the download + unpack
>> duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
>> minute ).
> OK, thanks for testing! I'll try to check specifically untar whether I can
> still see some differences in the IO pattern on my test machine.

i made two iostat output logs during the complete download phase with 
5.19 and your series applied. iostat was running via ssh connection and 
rpi-update via serial console.

First with mb_optimize_scan=0

https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_nooptimize_download_success.iostat.log

Second with mb_optimize_scan=1

https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_optimize_download_success.iostat.log

Maybe this helps

>
>> Unfortuntately i don't have much time this week and next week i'm in
>> holidays.
> No problem.
>
>> Just a question, my tests always had MBCACHE=y . Is it possible that the
>> mb_optimize_scan is counterproductive for MBCACHE in this case?
> MBCACHE (despite similar name) is actually related to extended attributes
> so it likely has no impact on your workload.
>
>> I'm asking because before the download the update script removes the files
>> from the previous update process which already cause a high load.
> Do you mean already the removal step is noticeably slower with
> mb_optimize_scan=1? The removal will be modifying directory blocks, inode
> table blocks, block & inode bitmaps, and group descriptors. So if block
> allocations are more spread (due to mb_optimize_scan=1 used during the
> untar), the removal may also take somewhat longer.
Not sure about this, maybe we should concentrate on download / untar phase.
> 								Honza

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc
  2022-08-24 21:24     ` Stefan Wahren
@ 2022-08-25  9:18       ` Jan Kara
  2022-08-25 15:48         ` Stefan Wahren
  2022-08-25 16:57         ` Stefan Wahren
  0 siblings, 2 replies; 20+ messages in thread
From: Jan Kara @ 2022-08-25  9:18 UTC (permalink / raw)
  To: Stefan Wahren
  Cc: Jan Kara, Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo,
	Harshad Shirwadkar

Hi Stefan!

On Wed 24-08-22 23:24:43, Stefan Wahren wrote:
> Am 24.08.22 um 12:40 schrieb Jan Kara:
> > Hi Stefan!
> > 
> > On Wed 24-08-22 12:17:14, Stefan Wahren wrote:
> > > Am 23.08.22 um 22:15 schrieb Jan Kara:
> > > > Hello,
> > > > 
> > > > So I have implemented mballoc improvements to avoid spreading allocations
> > > > even with mb_optimize_scan=1. It fixes the performance regression I was able
> > > > to reproduce with reaim on my test machine:
> > > > 
> > > >                        mb_optimize_scan=0     mb_optimize_scan=1     patched
> > > > Hmean     disk-1       2076.12 (   0.00%)     2099.37 (   1.12%)     2032.52 (  -2.10%)
> > > > Hmean     disk-41     92481.20 (   0.00%)    83787.47 *  -9.40%*    90308.37 (  -2.35%)
> > > > Hmean     disk-81    155073.39 (   0.00%)   135527.05 * -12.60%*   154285.71 (  -0.51%)
> > > > Hmean     disk-121   185109.64 (   0.00%)   166284.93 * -10.17%*   185298.62 (   0.10%)
> > > > Hmean     disk-161   229890.53 (   0.00%)   207563.39 *  -9.71%*   232883.32 *   1.30%*
> > > > Hmean     disk-201   223333.33 (   0.00%)   203235.59 *  -9.00%*   221446.93 (  -0.84%)
> > > > Hmean     disk-241   235735.25 (   0.00%)   217705.51 *  -7.65%*   239483.27 *   1.59%*
> > > > Hmean     disk-281   266772.15 (   0.00%)   241132.72 *  -9.61%*   263108.62 (  -1.37%)
> > > > Hmean     disk-321   265435.50 (   0.00%)   245412.84 *  -7.54%*   267277.27 (   0.69%)
> > > > 
> > > > Stefan, can you please test whether these patches fix the problem for you as
> > > > well? Comments & review welcome.
> > > i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
> > > update process succeed which is a improvement, but the download + unpack
> > > duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
> > > minute ).
> > OK, thanks for testing! I'll try to check specifically untar whether I can
> > still see some differences in the IO pattern on my test machine.
> 
> i made two iostat output logs during the complete download phase with 5.19
> and your series applied. iostat was running via ssh connection and
> rpi-update via serial console.
> 
> First with mb_optimize_scan=0
> 
> https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_nooptimize_download_success.iostat.log
> 
> Second with mb_optimize_scan=1
> 
> https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_optimize_download_success.iostat.log
> 
> Maybe this helps

Thanks for the data! So this is interesting. In both iostat logs, there is
initial phase where no IO happens. I guess that's expected. It is
significantly longer in the mb_optimize_scan=0 but I suppose that is just
caused by a difference in when iostat was actually started. Then in
mb_optimize_scan=0 there is 155 seconds where the eMMC card is 100%
utilized and then iostat ends. During this time ~63MB is written
altogether. Request sizes vary a lot, average is 60KB.

In mb_optimize_scan=1 case there is 715 seconds recorded where eMMC card is
100% utilized. During this time ~133MB is written, average request size is
40KB. If I look just at first 155 seconds of the trace (assuming iostat was
in both cases terminated before writing was fully done), we have written
~53MB and average request size is 56KB.

So with mb_optimize_scan=1 we are indeed still somewhat slower but based on
the trace it is not clear why the download+unpack should take 7 minutes
instead of 1 minute. There must be some other effect we are missing.

Perhaps if you just download the archive manually, call sync(1), and measure
how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
can see whether plain untar is indeed making the difference or there's
something else influencing the result as well (I have checked and
rpi-update does a lot of other deleting & copying as the part of the
update)? Thanks.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc
  2022-08-25  9:18       ` Jan Kara
@ 2022-08-25 15:48         ` Stefan Wahren
  2022-08-26  9:52           ` Jan Kara
  2022-08-25 16:57         ` Stefan Wahren
  1 sibling, 1 reply; 20+ messages in thread
From: Stefan Wahren @ 2022-08-25 15:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo,
	Harshad Shirwadkar

Hi Jan,

Am 25.08.22 um 11:18 schrieb Jan Kara:
> Hi Stefan!
>
> On Wed 24-08-22 23:24:43, Stefan Wahren wrote:
>> Am 24.08.22 um 12:40 schrieb Jan Kara:
>>> Hi Stefan!
>>>
>>> On Wed 24-08-22 12:17:14, Stefan Wahren wrote:
>>>> Am 23.08.22 um 22:15 schrieb Jan Kara:
>>>>> Hello,
>>>>>
>>>>> So I have implemented mballoc improvements to avoid spreading allocations
>>>>> even with mb_optimize_scan=1. It fixes the performance regression I was able
>>>>> to reproduce with reaim on my test machine:
>>>>>
>>>>>                         mb_optimize_scan=0     mb_optimize_scan=1     patched
>>>>> Hmean     disk-1       2076.12 (   0.00%)     2099.37 (   1.12%)     2032.52 (  -2.10%)
>>>>> Hmean     disk-41     92481.20 (   0.00%)    83787.47 *  -9.40%*    90308.37 (  -2.35%)
>>>>> Hmean     disk-81    155073.39 (   0.00%)   135527.05 * -12.60%*   154285.71 (  -0.51%)
>>>>> Hmean     disk-121   185109.64 (   0.00%)   166284.93 * -10.17%*   185298.62 (   0.10%)
>>>>> Hmean     disk-161   229890.53 (   0.00%)   207563.39 *  -9.71%*   232883.32 *   1.30%*
>>>>> Hmean     disk-201   223333.33 (   0.00%)   203235.59 *  -9.00%*   221446.93 (  -0.84%)
>>>>> Hmean     disk-241   235735.25 (   0.00%)   217705.51 *  -7.65%*   239483.27 *   1.59%*
>>>>> Hmean     disk-281   266772.15 (   0.00%)   241132.72 *  -9.61%*   263108.62 (  -1.37%)
>>>>> Hmean     disk-321   265435.50 (   0.00%)   245412.84 *  -7.54%*   267277.27 (   0.69%)
>>>>>
>>>>> Stefan, can you please test whether these patches fix the problem for you as
>>>>> well? Comments & review welcome.
>>>> i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
>>>> update process succeed which is a improvement, but the download + unpack
>>>> duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
>>>> minute ).
>>> OK, thanks for testing! I'll try to check specifically untar whether I can
>>> still see some differences in the IO pattern on my test machine.
>> i made two iostat output logs during the complete download phase with 5.19
>> and your series applied. iostat was running via ssh connection and
>> rpi-update via serial console.
>>
>> First with mb_optimize_scan=0
>>
>> https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_nooptimize_download_success.iostat.log
>>
>> Second with mb_optimize_scan=1
>>
>> https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_optimize_download_success.iostat.log
>>
>> Maybe this helps
> Thanks for the data! So this is interesting. In both iostat logs, there is
> initial phase where no IO happens. I guess that's expected. It is
> significantly longer in the mb_optimize_scan=0 but I suppose that is just
> caused by a difference in when iostat was actually started. Then in
> mb_optimize_scan=0 there is 155 seconds where the eMMC card is 100%
> utilized and then iostat ends. During this time ~63MB is written
> altogether. Request sizes vary a lot, average is 60KB.
>
> In mb_optimize_scan=1 case there is 715 seconds recorded where eMMC card is
> 100% utilized. During this time ~133MB is written, average request size is
> 40KB. If I look just at first 155 seconds of the trace (assuming iostat was
> in both cases terminated before writing was fully done), we have written
> ~53MB and average request size is 56KB.
>
> So with mb_optimize_scan=1 we are indeed still somewhat slower but based on
> the trace it is not clear why the download+unpack should take 7 minutes
> instead of 1 minute. There must be some other effect we are missing.
>
> Perhaps if you just download the archive manually, call sync(1), and measure
> how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> can see whether plain untar is indeed making the difference or there's
> something else influencing the result as well (I have checked and
> rpi-update does a lot of other deleting & copying as the part of the
> update)? Thanks.

I will provide those iostats.

Btw i untar the firmware archive (mb_optimized_scan=1 and your patch) 
and got following:

cat /proc/fs/ext4/mmcblk1p2/mb_structs_summary


optimize_scan: 1
max_free_order_lists:
         list_order_0_groups: 5
         list_order_1_groups: 0
         list_order_2_groups: 0
         list_order_3_groups: 0
         list_order_4_groups: 1
         list_order_5_groups: 0
         list_order_6_groups: 1
         list_order_7_groups: 1
         list_order_8_groups: 10
         list_order_9_groups: 1
         list_order_10_groups: 2
         list_order_11_groups: 0
         list_order_12_groups: 2
         list_order_13_groups: 55
fragment_size_tree:
         tree_min: 1
         tree_max: 31249

         tree_nodes: 79

Is this expected?

>
> 								Honza

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc
  2022-08-25  9:18       ` Jan Kara
  2022-08-25 15:48         ` Stefan Wahren
@ 2022-08-25 16:57         ` Stefan Wahren
  2022-08-26 10:15           ` Jan Kara
  1 sibling, 1 reply; 20+ messages in thread
From: Stefan Wahren @ 2022-08-25 16:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo,
	Harshad Shirwadkar

Hi Jan,

Am 25.08.22 um 11:18 schrieb Jan Kara:
> Hi Stefan!
>
> On Wed 24-08-22 23:24:43, Stefan Wahren wrote:
>> Am 24.08.22 um 12:40 schrieb Jan Kara:
>>> Hi Stefan!
>>>
>>> On Wed 24-08-22 12:17:14, Stefan Wahren wrote:
>>>> Am 23.08.22 um 22:15 schrieb Jan Kara:
>>>>> Hello,
>>>>>
>>>>> So I have implemented mballoc improvements to avoid spreading allocations
>>>>> even with mb_optimize_scan=1. It fixes the performance regression I was able
>>>>> to reproduce with reaim on my test machine:
>>>>>
>>>>>                         mb_optimize_scan=0     mb_optimize_scan=1     patched
>>>>> Hmean     disk-1       2076.12 (   0.00%)     2099.37 (   1.12%)     2032.52 (  -2.10%)
>>>>> Hmean     disk-41     92481.20 (   0.00%)    83787.47 *  -9.40%*    90308.37 (  -2.35%)
>>>>> Hmean     disk-81    155073.39 (   0.00%)   135527.05 * -12.60%*   154285.71 (  -0.51%)
>>>>> Hmean     disk-121   185109.64 (   0.00%)   166284.93 * -10.17%*   185298.62 (   0.10%)
>>>>> Hmean     disk-161   229890.53 (   0.00%)   207563.39 *  -9.71%*   232883.32 *   1.30%*
>>>>> Hmean     disk-201   223333.33 (   0.00%)   203235.59 *  -9.00%*   221446.93 (  -0.84%)
>>>>> Hmean     disk-241   235735.25 (   0.00%)   217705.51 *  -7.65%*   239483.27 *   1.59%*
>>>>> Hmean     disk-281   266772.15 (   0.00%)   241132.72 *  -9.61%*   263108.62 (  -1.37%)
>>>>> Hmean     disk-321   265435.50 (   0.00%)   245412.84 *  -7.54%*   267277.27 (   0.69%)
>>>>>
>>>>> Stefan, can you please test whether these patches fix the problem for you as
>>>>> well? Comments & review welcome.
>>>> i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
>>>> update process succeed which is a improvement, but the download + unpack
>>>> duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
>>>> minute ).
>>> OK, thanks for testing! I'll try to check specifically untar whether I can
>>> still see some differences in the IO pattern on my test machine.
>> i made two iostat output logs during the complete download phase with 5.19
>> and your series applied. iostat was running via ssh connection and
>> rpi-update via serial console.
>>
>> First with mb_optimize_scan=0
>>
>> https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_nooptimize_download_success.iostat.log
>>
>> Second with mb_optimize_scan=1
>>
>> https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_optimize_download_success.iostat.log
>>
>> Maybe this helps
> Thanks for the data! So this is interesting. In both iostat logs, there is
> initial phase where no IO happens. I guess that's expected. It is
> significantly longer in the mb_optimize_scan=0 but I suppose that is just
> caused by a difference in when iostat was actually started. Then in
> mb_optimize_scan=0 there is 155 seconds where the eMMC card is 100%
> utilized and then iostat ends. During this time ~63MB is written
> altogether. Request sizes vary a lot, average is 60KB.
>
> In mb_optimize_scan=1 case there is 715 seconds recorded where eMMC card is
> 100% utilized. During this time ~133MB is written, average request size is
> 40KB. If I look just at first 155 seconds of the trace (assuming iostat was
> in both cases terminated before writing was fully done), we have written
> ~53MB and average request size is 56KB.
>
> So with mb_optimize_scan=1 we are indeed still somewhat slower but based on
> the trace it is not clear why the download+unpack should take 7 minutes
> instead of 1 minute. There must be some other effect we are missing.
>
> Perhaps if you just download the archive manually, call sync(1), and measure
> how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> can see whether plain untar is indeed making the difference or there's
> something else influencing the result as well (I have checked and
> rpi-update does a lot of other deleting & copying as the part of the
> update)? Thanks.

mb_optimize_scan=0 -> almost 5 minutes

mb_optimize_scan=1 -> almost 18 minutes

https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec

>
> 								Honza

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc
  2022-08-24 14:13     ` Jan Kara
@ 2022-08-25 17:49       ` Ojaswin Mujoo
  2022-08-26  9:07         ` Jan Kara
  0 siblings, 1 reply; 20+ messages in thread
From: Ojaswin Mujoo @ 2022-08-25 17:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: Stefan Wahren, Ted Tso, linux-ext4, Thorsten Leemhuis,
	Harshad Shirwadkar

On Wed, Aug 24, 2022 at 04:13:38PM +0200, Jan Kara wrote:
> On Wed 24-08-22 12:40:10, Jan Kara wrote:
> > Hi Stefan!
> > 
> > On Wed 24-08-22 12:17:14, Stefan Wahren wrote:
> > > Am 23.08.22 um 22:15 schrieb Jan Kara:
> > > > Hello,
> > > > 
> > > > So I have implemented mballoc improvements to avoid spreading allocations
> > > > even with mb_optimize_scan=1. It fixes the performance regression I was able
> > > > to reproduce with reaim on my test machine:
> > > > 
> > > >                       mb_optimize_scan=0     mb_optimize_scan=1     patched
> > > > Hmean     disk-1       2076.12 (   0.00%)     2099.37 (   1.12%)     2032.52 (  -2.10%)
> > > > Hmean     disk-41     92481.20 (   0.00%)    83787.47 *  -9.40%*    90308.37 (  -2.35%)
> > > > Hmean     disk-81    155073.39 (   0.00%)   135527.05 * -12.60%*   154285.71 (  -0.51%)
> > > > Hmean     disk-121   185109.64 (   0.00%)   166284.93 * -10.17%*   185298.62 (   0.10%)
> > > > Hmean     disk-161   229890.53 (   0.00%)   207563.39 *  -9.71%*   232883.32 *   1.30%*
> > > > Hmean     disk-201   223333.33 (   0.00%)   203235.59 *  -9.00%*   221446.93 (  -0.84%)
> > > > Hmean     disk-241   235735.25 (   0.00%)   217705.51 *  -7.65%*   239483.27 *   1.59%*
> > > > Hmean     disk-281   266772.15 (   0.00%)   241132.72 *  -9.61%*   263108.62 (  -1.37%)
> > > > Hmean     disk-321   265435.50 (   0.00%)   245412.84 *  -7.54%*   267277.27 (   0.69%)
> > > > 
> > > > Stefan, can you please test whether these patches fix the problem for you as
> > > > well? Comments & review welcome.
> > > 
> > > i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
> > > update process succeed which is a improvement, but the download + unpack
> > > duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
> > > minute ).
> > 
> > OK, thanks for testing! I'll try to check specifically untar whether I can
> > still see some differences in the IO pattern on my test machine.
> 
> I have created the same tar archive as you've referenced (files with same
> number of blocks) and looked at where blocks get allocated with
> mb_optimize_scan=0 and with mb_optimize_scan=1 + my patches. And the
> resulting IO pattern looks practically the same on my test machine. In
> particular in both cases files get allocated only in 6 groups, if I look
> at the number of erase blocks that are expected to be touched by file data
> (for various erase block sizes from 512k to 4MB) I get practically same
> numbers for both cases.
> 
> Ojaswin, I think you've also mentioned you were able to reproduce the issue
> in your setup? Are you still able to reproduce it with the patched kernel?
> Can you help debugging while Stefan is away?
> 
>                 Honza
Hi Jan,

So I ran some more tests on v6.0-rc2 kernel with and without your patches and
here are the details:

Workload:-
  time tar -xf rpi-firmware.tar -C ./test
  time sync

System details:
  - Rpi 3b+ w/ 8G memory card (~4G free)
  - tar is ~120MB compressed

And here is the output of time command for various tests. Since some of them
take some time to complete, I ran them only 2 3 times each so the numbers might
vary but they are indicative of the issue.

v6.0-rc2 (Without patches)

mb_optimize_scan = 0

**tar**
real    1m39.574s
user    0m10.311s
sys     0m2.761s  

**sync**
real    0m22.269s
user    0m0.001s
sys     0m0.005s

mb_optimize_scan = 1

**tar**
real    21m25.288s
user    0m9.607
sys     0m3.026

**sync**
real    1m23.402s
user    0m0.005s
sys     0m0.000s

v6.0-rc2 (With patches)

mb_optimize_scan = 0

* similar to unpatched (~1 to 2mins) *

mb_optimize_scan = 1

**tar**
real    5m7.858s
user    0m11.008s
sys     0m2.739s

**sync**
real    6m7.308s
user    0m0.003s
sys     0m0.001s

At this point, I'm pretty confident that it is the untar operation that is
having most of the regression and no other download/delete operations in
rpi-update are behind the delay. Further, it does seem like your patches
improve the performance but, from my tests, we are still not close to the
mb_optimize_scan=0 numbers.

I'm going to spend some more time trying to collect the perfs and which block 
group the allocations are happening during the untar to see if we can get a better
idea from that data. Let me know if you'd want me to collect anything else.

PS: One question, to find the blocks groups being used I'm planning to take
the dumpe2fs output before and after untar and then see the groups where free blocks
changed (since there is nothing much running on Pi i assume this should give us
a rough idea of allocation pattern of untar), just wanted to check if there's a
better way to get this data.

Regards,
Ojaswin
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc
  2022-08-25 17:49       ` Ojaswin Mujoo
@ 2022-08-26  9:07         ` Jan Kara
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2022-08-26  9:07 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Jan Kara, Stefan Wahren, Ted Tso, linux-ext4, Thorsten Leemhuis,
	Harshad Shirwadkar

On Thu 25-08-22 23:19:48, Ojaswin Mujoo wrote:
> On Wed, Aug 24, 2022 at 04:13:38PM +0200, Jan Kara wrote:
> > On Wed 24-08-22 12:40:10, Jan Kara wrote:
> > > Hi Stefan!
> > > 
> > > On Wed 24-08-22 12:17:14, Stefan Wahren wrote:
> > > > Am 23.08.22 um 22:15 schrieb Jan Kara:
> > > > > Hello,
> > > > > 
> > > > > So I have implemented mballoc improvements to avoid spreading allocations
> > > > > even with mb_optimize_scan=1. It fixes the performance regression I was able
> > > > > to reproduce with reaim on my test machine:
> > > > > 
> > > > >                       mb_optimize_scan=0     mb_optimize_scan=1     patched
> > > > > Hmean     disk-1       2076.12 (   0.00%)     2099.37 (   1.12%)     2032.52 (  -2.10%)
> > > > > Hmean     disk-41     92481.20 (   0.00%)    83787.47 *  -9.40%*    90308.37 (  -2.35%)
> > > > > Hmean     disk-81    155073.39 (   0.00%)   135527.05 * -12.60%*   154285.71 (  -0.51%)
> > > > > Hmean     disk-121   185109.64 (   0.00%)   166284.93 * -10.17%*   185298.62 (   0.10%)
> > > > > Hmean     disk-161   229890.53 (   0.00%)   207563.39 *  -9.71%*   232883.32 *   1.30%*
> > > > > Hmean     disk-201   223333.33 (   0.00%)   203235.59 *  -9.00%*   221446.93 (  -0.84%)
> > > > > Hmean     disk-241   235735.25 (   0.00%)   217705.51 *  -7.65%*   239483.27 *   1.59%*
> > > > > Hmean     disk-281   266772.15 (   0.00%)   241132.72 *  -9.61%*   263108.62 (  -1.37%)
> > > > > Hmean     disk-321   265435.50 (   0.00%)   245412.84 *  -7.54%*   267277.27 (   0.69%)
> > > > > 
> > > > > Stefan, can you please test whether these patches fix the problem for you as
> > > > > well? Comments & review welcome.
> > > > 
> > > > i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
> > > > update process succeed which is a improvement, but the download + unpack
> > > > duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
> > > > minute ).
> > > 
> > > OK, thanks for testing! I'll try to check specifically untar whether I can
> > > still see some differences in the IO pattern on my test machine.
> > 
> > I have created the same tar archive as you've referenced (files with same
> > number of blocks) and looked at where blocks get allocated with
> > mb_optimize_scan=0 and with mb_optimize_scan=1 + my patches. And the
> > resulting IO pattern looks practically the same on my test machine. In
> > particular in both cases files get allocated only in 6 groups, if I look
> > at the number of erase blocks that are expected to be touched by file data
> > (for various erase block sizes from 512k to 4MB) I get practically same
> > numbers for both cases.
> > 
> > Ojaswin, I think you've also mentioned you were able to reproduce the issue
> > in your setup? Are you still able to reproduce it with the patched kernel?
> > Can you help debugging while Stefan is away?
> > 
> >                 Honza
> Hi Jan,
> 
> So I ran some more tests on v6.0-rc2 kernel with and without your patches and
> here are the details:
> 
> Workload:-
>   time tar -xf rpi-firmware.tar -C ./test
>   time sync
> 
> System details:
>   - Rpi 3b+ w/ 8G memory card (~4G free)
>   - tar is ~120MB compressed

Hum, maybe the difference is that I've tried with somewhat larger (20G) and
otherwise empty filesystem...

> And here is the output of time command for various tests. Since some of them
> take some time to complete, I ran them only 2 3 times each so the numbers might
> vary but they are indicative of the issue.
> 
> v6.0-rc2 (Without patches)
> 
> mb_optimize_scan = 0
> 
> **tar**
> real    1m39.574s
> user    0m10.311s
> sys     0m2.761s  
> 
> **sync**
> real    0m22.269s
> user    0m0.001s
> sys     0m0.005s
> 
> mb_optimize_scan = 1
> 
> **tar**
> real    21m25.288s
> user    0m9.607
> sys     0m3.026
> 
> **sync**
> real    1m23.402s
> user    0m0.005s
> sys     0m0.000s
> 
> v6.0-rc2 (With patches)
> 
> mb_optimize_scan = 0
> 
> * similar to unpatched (~1 to 2mins) *
> 
> mb_optimize_scan = 1
> 
> **tar**
> real    5m7.858s
> user    0m11.008s
> sys     0m2.739s
> 
> **sync**
> real    6m7.308s
> user    0m0.003s
> sys     0m0.001s
> 
> At this point, I'm pretty confident that it is the untar operation that is
> having most of the regression and no other download/delete operations in
> rpi-update are behind the delay. Further, it does seem like your patches
> improve the performance but, from my tests, we are still not close to the
> mb_optimize_scan=0 numbers.

Yes, thanks for the tests!

> I'm going to spend some more time trying to collect the perfs and which block 
> group the allocations are happening during the untar to see if we can get a better
> idea from that data. Let me know if you'd want me to collect anything else.
> 
> PS: One question, to find the blocks groups being used I'm planning to take
> the dumpe2fs output before and after untar and then see the groups where free blocks
> changed (since there is nothing much running on Pi i assume this should give us
> a rough idea of allocation pattern of untar), just wanted to check if there's a
> better way to get this data.

I have used 'find <target-dir> -exec filefrag -v {} \;' to get block
numbers of files. That gets you better insight than plain dumpe2fs
numbers...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc
  2022-08-25 15:48         ` Stefan Wahren
@ 2022-08-26  9:52           ` Jan Kara
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2022-08-26  9:52 UTC (permalink / raw)
  To: Stefan Wahren
  Cc: Jan Kara, Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo,
	Harshad Shirwadkar

Hi Stefan!

On Thu 25-08-22 17:48:32, Stefan Wahren wrote:
> Am 25.08.22 um 11:18 schrieb Jan Kara:
> > On Wed 24-08-22 23:24:43, Stefan Wahren wrote:
> > > Am 24.08.22 um 12:40 schrieb Jan Kara:
> > > > On Wed 24-08-22 12:17:14, Stefan Wahren wrote:
> > > > > Am 23.08.22 um 22:15 schrieb Jan Kara:
> > > > > > So I have implemented mballoc improvements to avoid spreading allocations
> > > > > > even with mb_optimize_scan=1. It fixes the performance regression I was able
> > > > > > to reproduce with reaim on my test machine:
> > > > > > 
> > > > > >                         mb_optimize_scan=0     mb_optimize_scan=1     patched
> > > > > > Hmean     disk-1       2076.12 (   0.00%)     2099.37 (   1.12%)     2032.52 (  -2.10%)
> > > > > > Hmean     disk-41     92481.20 (   0.00%)    83787.47 *  -9.40%*    90308.37 (  -2.35%)
> > > > > > Hmean     disk-81    155073.39 (   0.00%)   135527.05 * -12.60%*   154285.71 (  -0.51%)
> > > > > > Hmean     disk-121   185109.64 (   0.00%)   166284.93 * -10.17%*   185298.62 (   0.10%)
> > > > > > Hmean     disk-161   229890.53 (   0.00%)   207563.39 *  -9.71%*   232883.32 *   1.30%*
> > > > > > Hmean     disk-201   223333.33 (   0.00%)   203235.59 *  -9.00%*   221446.93 (  -0.84%)
> > > > > > Hmean     disk-241   235735.25 (   0.00%)   217705.51 *  -7.65%*   239483.27 *   1.59%*
> > > > > > Hmean     disk-281   266772.15 (   0.00%)   241132.72 *  -9.61%*   263108.62 (  -1.37%)
> > > > > > Hmean     disk-321   265435.50 (   0.00%)   245412.84 *  -7.54%*   267277.27 (   0.69%)
> > > > > > 
> > > > > > Stefan, can you please test whether these patches fix the problem for you as
> > > > > > well? Comments & review welcome.
> > > > > i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the
> > > > > update process succeed which is a improvement, but the download + unpack
> > > > > duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1
> > > > > minute ).
> > > > OK, thanks for testing! I'll try to check specifically untar whether I can
> > > > still see some differences in the IO pattern on my test machine.
> > > i made two iostat output logs during the complete download phase with 5.19
> > > and your series applied. iostat was running via ssh connection and
> > > rpi-update via serial console.
> > > 
> > > First with mb_optimize_scan=0
> > > 
> > > https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_nooptimize_download_success.iostat.log
> > > 
> > > Second with mb_optimize_scan=1
> > > 
> > > https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_optimize_download_success.iostat.log
> > > 
> > > Maybe this helps
> > Thanks for the data! So this is interesting. In both iostat logs, there is
> > initial phase where no IO happens. I guess that's expected. It is
> > significantly longer in the mb_optimize_scan=0 but I suppose that is just
> > caused by a difference in when iostat was actually started. Then in
> > mb_optimize_scan=0 there is 155 seconds where the eMMC card is 100%
> > utilized and then iostat ends. During this time ~63MB is written
> > altogether. Request sizes vary a lot, average is 60KB.
> > 
> > In mb_optimize_scan=1 case there is 715 seconds recorded where eMMC card is
> > 100% utilized. During this time ~133MB is written, average request size is
> > 40KB. If I look just at first 155 seconds of the trace (assuming iostat was
> > in both cases terminated before writing was fully done), we have written
> > ~53MB and average request size is 56KB.
> > 
> > So with mb_optimize_scan=1 we are indeed still somewhat slower but based on
> > the trace it is not clear why the download+unpack should take 7 minutes
> > instead of 1 minute. There must be some other effect we are missing.
> > 
> > Perhaps if you just download the archive manually, call sync(1), and measure
> > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> > can see whether plain untar is indeed making the difference or there's
> > something else influencing the result as well (I have checked and
> > rpi-update does a lot of other deleting & copying as the part of the
> > update)? Thanks.
> 
> I will provide those iostats.
> 
> Btw i untar the firmware archive (mb_optimized_scan=1 and your patch) and
> got following:
> 
> cat /proc/fs/ext4/mmcblk1p2/mb_structs_summary
> 
> 
> optimize_scan: 1
> max_free_order_lists:
>         list_order_0_groups: 5
>         list_order_1_groups: 0
>         list_order_2_groups: 0
>         list_order_3_groups: 0
>         list_order_4_groups: 1
>         list_order_5_groups: 0
>         list_order_6_groups: 1
>         list_order_7_groups: 1
>         list_order_8_groups: 10
>         list_order_9_groups: 1
>         list_order_10_groups: 2
>         list_order_11_groups: 0
>         list_order_12_groups: 2
>         list_order_13_groups: 55
> fragment_size_tree:
>         tree_min: 1
>         tree_max: 31249
> 
>         tree_nodes: 79
> 
> Is this expected?

Yes, I don't see anything out of ordinary in this for a used filesystem. It
tells us there are 55 groups with big chunks of free space, there are some
groups which have only small chunks of free space but that's expected when
the filesystem is reasonably filled...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc
  2022-08-25 16:57         ` Stefan Wahren
@ 2022-08-26 10:15           ` Jan Kara
  2022-08-27 14:36             ` Ojaswin Mujoo
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2022-08-26 10:15 UTC (permalink / raw)
  To: Stefan Wahren
  Cc: Jan Kara, Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo,
	Harshad Shirwadkar

Hi Stefan,

On Thu 25-08-22 18:57:08, Stefan Wahren wrote:
> > Perhaps if you just download the archive manually, call sync(1), and measure
> > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> > can see whether plain untar is indeed making the difference or there's
> > something else influencing the result as well (I have checked and
> > rpi-update does a lot of other deleting & copying as the part of the
> > update)? Thanks.
> 
> mb_optimize_scan=0 -> almost 5 minutes
> 
> mb_optimize_scan=1 -> almost 18 minutes
> 
> https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec

Thanks! So now the iostat data indeed looks substantially different.

			nooptimize	optimize
Total written		183.6 MB	190.5 MB
Time (recorded)		283 s		1040 s
Avg write request size	79 KB		41 KB

So indeed with mb_optimize_scan=1 we do submit substantially smaller
requests on average. So far I'm not sure why that is. Since Ojaswin can
reproduce as well, let's see what he can see from block location info.
Thanks again for help with debugging this and enjoy your vacation!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc
  2022-08-26 10:15           ` Jan Kara
@ 2022-08-27 14:36             ` Ojaswin Mujoo
  2022-08-29  9:04               ` Jan Kara
  2022-09-04 10:01               ` Stefan Wahren
  0 siblings, 2 replies; 20+ messages in thread
From: Ojaswin Mujoo @ 2022-08-27 14:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Stefan Wahren, Ted Tso, linux-ext4, Thorsten Leemhuis,
	Harshad Shirwadkar

On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote:
> Hi Stefan,
> 
> On Thu 25-08-22 18:57:08, Stefan Wahren wrote:
> > > Perhaps if you just download the archive manually, call sync(1), and measure
> > > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> > > can see whether plain untar is indeed making the difference or there's
> > > something else influencing the result as well (I have checked and
> > > rpi-update does a lot of other deleting & copying as the part of the
> > > update)? Thanks.
> > 
> > mb_optimize_scan=0 -> almost 5 minutes
> > 
> > mb_optimize_scan=1 -> almost 18 minutes
> > 
> > https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec
> 
> Thanks! So now the iostat data indeed looks substantially different.
> 
> 			nooptimize	optimize
> Total written		183.6 MB	190.5 MB
> Time (recorded)		283 s		1040 s
> Avg write request size	79 KB		41 KB
> 
> So indeed with mb_optimize_scan=1 we do submit substantially smaller
> requests on average. So far I'm not sure why that is. Since Ojaswin can
> reproduce as well, let's see what he can see from block location info.
> Thanks again for help with debugging this and enjoy your vacation!
> 

Hi Jan and Stefan,

Apologies for the delay, I was on leave yesterday and couldn't find time to get to this.

So I was able to collect the block numbers using the method you suggested. I converted the 
blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can 
find them here:

mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png
mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png
mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png

Observations:
* Before the patched mb_optimize_scan=1 allocations were way more spread out in
  40 different BGs.
* With the patch, we still allocate in 36 different BGs but majority happen in
  just 1 or 2 BGs.
* With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could
  explain why this is faster.

Also, one strange thing I'm seeing is that the perfs don't really show any
particular function causing the regression, which is surprising considering
mb_optimize_scan=1 almost takes 10 times more time.

All the perfs can be found here (raw files and perf report/diff --stdio ):
https://github.com/OjaswinM/mbopt-bug/tree/master/perfs

Lastly, FWIW I'm not able to replicate the regression when using loop devices
and mb_optmize_scan=1 performs similar to mb-opmtimize_scan=0 (without patches
as well). Not sure if this is related to the issue or just some side effect of
using loop devices.

Will post here if I have any updates on this.

Regards,
Ojaswin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc
  2022-08-27 14:36             ` Ojaswin Mujoo
@ 2022-08-29  9:04               ` Jan Kara
  2022-08-31  7:36                 ` Ojaswin Mujoo
  2022-09-04 10:01               ` Stefan Wahren
  1 sibling, 1 reply; 20+ messages in thread
From: Jan Kara @ 2022-08-29  9:04 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Jan Kara, Stefan Wahren, Ted Tso, linux-ext4, Thorsten Leemhuis,
	Harshad Shirwadkar

On Sat 27-08-22 20:06:00, Ojaswin Mujoo wrote:
> On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote:
> > Hi Stefan,
> > 
> > On Thu 25-08-22 18:57:08, Stefan Wahren wrote:
> > > > Perhaps if you just download the archive manually, call sync(1), and measure
> > > > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> > > > can see whether plain untar is indeed making the difference or there's
> > > > something else influencing the result as well (I have checked and
> > > > rpi-update does a lot of other deleting & copying as the part of the
> > > > update)? Thanks.
> > > 
> > > mb_optimize_scan=0 -> almost 5 minutes
> > > 
> > > mb_optimize_scan=1 -> almost 18 minutes
> > > 
> > > https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec
> > 
> > Thanks! So now the iostat data indeed looks substantially different.
> > 
> > 			nooptimize	optimize
> > Total written		183.6 MB	190.5 MB
> > Time (recorded)		283 s		1040 s
> > Avg write request size	79 KB		41 KB
> > 
> > So indeed with mb_optimize_scan=1 we do submit substantially smaller
> > requests on average. So far I'm not sure why that is. Since Ojaswin can
> > reproduce as well, let's see what he can see from block location info.
> > Thanks again for help with debugging this and enjoy your vacation!
> > 
> 
> Hi Jan and Stefan,
> 
> Apologies for the delay, I was on leave yesterday and couldn't find time to get to this.
> 
> So I was able to collect the block numbers using the method you suggested. I converted the 
> blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can 
> find them here:
> 
> mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png
> mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png
> mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png
> 
> Observations:
> * Before the patched mb_optimize_scan=1 allocations were way more spread out in
>   40 different BGs.
> * With the patch, we still allocate in 36 different BGs but majority happen in
>   just 1 or 2 BGs.
> * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could
>   explain why this is faster.

Thanks for testing Ojaswin! Based on iostats from Stefan, I'm relatively
confident the spread between block groups is responsible for the
performance regression. Iostats show pretty clearly that the write
throughput is determined by the average write request size which is
directly related to the number of block groups we allocate from.

Your stats for patched kernel show that there are two block groups which
get big part of allocations (these are likely the target block groups) but
then remaining ~1/3 is spread a lot. I'm not yet sure why that is... I
guess I will fiddle some more with my test VM and try to reproduce these
allocation differences (on my test server the allocation pattern on patched
kernel is very similar with mb_optimize_scan=0/1).

> Also, one strange thing I'm seeing is that the perfs don't really show any
> particular function causing the regression, which is surprising considering
> mb_optimize_scan=1 almost takes 10 times more time.

Well, the time is not spent by CPU. We spend more time waiting for IO which
is not visible in perf profiles. You could plot something like offcpu flame
graphs, there the difference would be visible but I don't expect you'd see
anything more than just we spend more time in functions waiting for
writeback to complete.

> Lastly, FWIW I'm not able to replicate the regression when using loop devices
> and mb_optmize_scan=1 performs similar to mb-opmtimize_scan=0 (without patches
> as well). Not sure if this is related to the issue or just some side effect of
> using loop devices.

This is because eMMC devices seem to be very sensitive to IO pattern (and
write request size). For loop devices, we don't care about request size
much so that's why mb_optimize_scan makes no big difference. But can you
still see the difference in the allocation pattern with the loop device?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc
  2022-08-29  9:04               ` Jan Kara
@ 2022-08-31  7:36                 ` Ojaswin Mujoo
  0 siblings, 0 replies; 20+ messages in thread
From: Ojaswin Mujoo @ 2022-08-31  7:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Stefan Wahren, Ted Tso, linux-ext4, Thorsten Leemhuis,
	Harshad Shirwadkar

On Mon, Aug 29, 2022 at 11:04:34AM +0200, Jan Kara wrote:
> On Sat 27-08-22 20:06:00, Ojaswin Mujoo wrote:
> > On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote:
> > > Hi Stefan,
> > > 
> > > On Thu 25-08-22 18:57:08, Stefan Wahren wrote:
> > > > > Perhaps if you just download the archive manually, call sync(1), and measure
> > > > > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> > > > > can see whether plain untar is indeed making the difference or there's
> > > > > something else influencing the result as well (I have checked and
> > > > > rpi-update does a lot of other deleting & copying as the part of the
> > > > > update)? Thanks.
> > > > 
> > > > mb_optimize_scan=0 -> almost 5 minutes
> > > > 
> > > > mb_optimize_scan=1 -> almost 18 minutes
> > > > 
> > > > https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec
> > > 
> > > Thanks! So now the iostat data indeed looks substantially different.
> > > 
> > > 			nooptimize	optimize
> > > Total written		183.6 MB	190.5 MB
> > > Time (recorded)		283 s		1040 s
> > > Avg write request size	79 KB		41 KB
> > > 
> > > So indeed with mb_optimize_scan=1 we do submit substantially smaller
> > > requests on average. So far I'm not sure why that is. Since Ojaswin can
> > > reproduce as well, let's see what he can see from block location info.
> > > Thanks again for help with debugging this and enjoy your vacation!
> > > 
> > 
> > Hi Jan and Stefan,
> > 
> > Apologies for the delay, I was on leave yesterday and couldn't find time to get to this.
> > 
> > So I was able to collect the block numbers using the method you suggested. I converted the 
> > blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can 
> > find them here:
> > 
> > mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png
> > mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png
> > mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png
> > 
> > Observations:
> > * Before the patched mb_optimize_scan=1 allocations were way more spread out in
> >   40 different BGs.
> > * With the patch, we still allocate in 36 different BGs but majority happen in
> >   just 1 or 2 BGs.
> > * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could
> >   explain why this is faster.
> 
> Thanks for testing Ojaswin! Based on iostats from Stefan, I'm relatively
> confident the spread between block groups is responsible for the
> performance regression. Iostats show pretty clearly that the write
> throughput is determined by the average write request size which is
> directly related to the number of block groups we allocate from.
> 
> Your stats for patched kernel show that there are two block groups which
> get big part of allocations (these are likely the target block groups) but
> then remaining ~1/3 is spread a lot. I'm not yet sure why that is... I
> guess I will fiddle some more with my test VM and try to reproduce these
> allocation differences (on my test server the allocation pattern on patched
> kernel is very similar with mb_optimize_scan=0/1).
> 
> > Also, one strange thing I'm seeing is that the perfs don't really show any
> > particular function causing the regression, which is surprising considering
> > mb_optimize_scan=1 almost takes 10 times more time.
> 
> Well, the time is not spent by CPU. We spend more time waiting for IO which
> is not visible in perf profiles. You could plot something like offcpu flame
> graphs, there the difference would be visible but I don't expect you'd see
> anything more than just we spend more time in functions waiting for
> writeback to complete.
Ahh I see, that makes sense.
> 
> > Lastly, FWIW I'm not able to replicate the regression when using loop devices
> > and mb_optmize_scan=1 performs similar to mb-opmtimize_scan=0 (without patches
> > as well). Not sure if this is related to the issue or just some side effect of
> > using loop devices.
> 
> This is because eMMC devices seem to be very sensitive to IO pattern (and
> write request size). For loop devices, we don't care about request size
> much so that's why mb_optimize_scan makes no big difference. But can you
> still see the difference in the allocation pattern with the loop device?
So i tested with loop devices and yes I do see a similar allocation
pattern althout the regression is not there.

Thanks,
Ojaswin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc
  2022-08-27 14:36             ` Ojaswin Mujoo
  2022-08-29  9:04               ` Jan Kara
@ 2022-09-04 10:01               ` Stefan Wahren
  2022-09-04 22:32                 ` Andreas Dilger
  1 sibling, 1 reply; 20+ messages in thread
From: Stefan Wahren @ 2022-09-04 10:01 UTC (permalink / raw)
  To: Ojaswin Mujoo, Jan Kara
  Cc: Ted Tso, linux-ext4, Thorsten Leemhuis, Harshad Shirwadkar

Hi,

Am 27.08.22 um 16:36 schrieb Ojaswin Mujoo:
> On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote:
>> Hi Stefan,
>>
>> On Thu 25-08-22 18:57:08, Stefan Wahren wrote:
>>>> Perhaps if you just download the archive manually, call sync(1), and measure
>>>> how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
>>>> can see whether plain untar is indeed making the difference or there's
>>>> something else influencing the result as well (I have checked and
>>>> rpi-update does a lot of other deleting & copying as the part of the
>>>> update)? Thanks.
>>> mb_optimize_scan=0 -> almost 5 minutes
>>>
>>> mb_optimize_scan=1 -> almost 18 minutes
>>>
>>> https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec
>> Thanks! So now the iostat data indeed looks substantially different.
>>
>> 			nooptimize	optimize
>> Total written		183.6 MB	190.5 MB
>> Time (recorded)		283 s		1040 s
>> Avg write request size	79 KB		41 KB
>>
>> So indeed with mb_optimize_scan=1 we do submit substantially smaller
>> requests on average. So far I'm not sure why that is. Since Ojaswin can
>> reproduce as well, let's see what he can see from block location info.
>> Thanks again for help with debugging this and enjoy your vacation!
>>
> Hi Jan and Stefan,
>
> Apologies for the delay, I was on leave yesterday and couldn't find time to get to this.
>
> So I was able to collect the block numbers using the method you suggested. I converted the
> blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can
> find them here:
>
> mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png
> mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png
> mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png
>
> Observations:
> * Before the patched mb_optimize_scan=1 allocations were way more spread out in
>    40 different BGs.
> * With the patch, we still allocate in 36 different BGs but majority happen in
>    just 1 or 2 BGs.
> * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could
>    explain why this is faster.

thanks this is very helpful for me to understand. So it seems to me that 
with disabled mb_optimized_scan we have a more sequential write behavior 
and with enabled mb_optimized_scan a more random write behavior.

 From my understanding writing small blocks at random addresses of the 
sd card flash causes a lot of overhead, because the sd card controller 
need to erase huge blocks (up to 1 MB) before it's able to program the 
flash pages. This would explain why this series doesn't fix the 
performance issue, the total amount of BGs is still much higher.

Is this new block allocation pattern a side effect of the optimization 
or desired?

> Also, one strange thing I'm seeing is that the perfs don't really show any
> particular function causing the regression, which is surprising considering
> mb_optimize_scan=1 almost takes 10 times more time.
>
> All the perfs can be found here (raw files and perf report/diff --stdio ):
> https://github.com/OjaswinM/mbopt-bug/tree/master/perfs
>
> Lastly, FWIW I'm not able to replicate the regression when using loop devices
> and mb_optmize_scan=1 performs similar to mb-opmtimize_scan=0 (without patches
> as well). Not sure if this is related to the issue or just some side effect of
> using loop devices.
The regression actually happen in the sd card, but it's triggered 
external by the different IO pattern. For a loop device there is no 
difference between sequential and random write.
>
> Will post here if I have any updates on this.
>
> Regards,
> Ojaswin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc
  2022-09-04 10:01               ` Stefan Wahren
@ 2022-09-04 22:32                 ` Andreas Dilger
  2022-09-05 10:15                   ` Jan Kara
  0 siblings, 1 reply; 20+ messages in thread
From: Andreas Dilger @ 2022-09-04 22:32 UTC (permalink / raw)
  To: Stefan Wahren
  Cc: Ojaswin Mujoo, Jan Kara, Ted Tso, linux-ext4, Thorsten Leemhuis,
	Harshad Shirwadkar

On Sep 4, 2022, at 00:01, Stefan Wahren <stefan.wahren@i2se.com> wrote:
> 
>> Am 27.08.22 um 16:36 schrieb Ojaswin Mujoo:
>>> On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote:
>>> Hi Stefan,
>>> 
>>> On Thu 25-08-22 18:57:08, Stefan Wahren wrote:
>>>>> Perhaps if you just download the archive manually, call sync(1), and measure
>>>>> how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
>>>>> can see whether plain untar is indeed making the difference or there's
>>>>> something else influencing the result as well (I have checked and
>>>>> rpi-update does a lot of other deleting & copying as the part of the
>>>>> update)? Thanks.
>>>> mb_optimize_scan=0 -> almost 5 minutes
>>>> 
>>>> mb_optimize_scan=1 -> almost 18 minutes
>>>> 
>>>> https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec
>>> Thanks! So now the iostat data indeed looks substantially different.
>>> 
>>>            nooptimize    optimize
>>> Total written        183.6 MB    190.5 MB
>>> Time (recorded)        283 s        1040 s
>>> Avg write request size    79 KB        41 KB
>>> 
>>> So indeed with mb_optimize_scan=1 we do submit substantially smaller
>>> requests on average. So far I'm not sure why that is. Since Ojaswin can
>>> reproduce as well, let's see what he can see from block location info.
>>> Thanks again for help with debugging this and enjoy your vacation!
>>> 
>> Hi Jan and Stefan,
>> 
>> Apologies for the delay, I was on leave yesterday and couldn't find time to get to this.
>> 
>> So I was able to collect the block numbers using the method you suggested. I converted the
>> blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can
>> find them here:
>> 
>> mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png
>> mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png
>> mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png
>> 
>> Observations:
>> * Before the patched mb_optimize_scan=1 allocations were way more spread out in
>>   40 different BGs.
>> * With the patch, we still allocate in 36 different BGs but majority happen in
>>   just 1 or 2 BGs.
>> * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could
>>   explain why this is faster.
> 
> thanks this is very helpful for me to understand. So it seems to me that with disabled mb_optimized_scan we have a more sequential write behavior and with enabled mb_optimized_scan a more random write behavior.
> 
> From my understanding writing small blocks at random addresses of the sd card flash causes a lot of overhead, because the sd card controller need to erase huge blocks (up to 1 MB) before it's able to program the flash pages. This would explain why this series doesn't fix the performance issue, the total amount of BGs is still much higher.
> 
> Is this new block allocation pattern a side effect of the optimization or desired?

The goal of the mb_optimized_scan is to avoid a large amount of linear scanning for
very large filesystems when there are many block groups that are full or fragmented. 

It seems for empty filesystems the new list management is not very ideal. It probably
makes sense to have a hybrid, with some small amount of linear scanning (eg. a meta
block group worth), and then use the new list to find a new group if that doesn't find any
group with free space. 

Cheers, Andreas

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc
  2022-09-04 22:32                 ` Andreas Dilger
@ 2022-09-05 10:15                   ` Jan Kara
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2022-09-05 10:15 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Stefan Wahren, Ojaswin Mujoo, Jan Kara, Ted Tso, linux-ext4,
	Thorsten Leemhuis, Harshad Shirwadkar

On Sun 04-09-22 12:32:59, Andreas Dilger wrote:
> On Sep 4, 2022, at 00:01, Stefan Wahren <stefan.wahren@i2se.com> wrote:
> > 
> >> Am 27.08.22 um 16:36 schrieb Ojaswin Mujoo:
> >>> On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote:
> >>> Hi Stefan,
> >>> 
> >>> On Thu 25-08-22 18:57:08, Stefan Wahren wrote:
> >>>>> Perhaps if you just download the archive manually, call sync(1), and measure
> >>>>> how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> >>>>> can see whether plain untar is indeed making the difference or there's
> >>>>> something else influencing the result as well (I have checked and
> >>>>> rpi-update does a lot of other deleting & copying as the part of the
> >>>>> update)? Thanks.
> >>>> mb_optimize_scan=0 -> almost 5 minutes
> >>>> 
> >>>> mb_optimize_scan=1 -> almost 18 minutes
> >>>> 
> >>>> https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec
> >>> Thanks! So now the iostat data indeed looks substantially different.
> >>> 
> >>>            nooptimize    optimize
> >>> Total written        183.6 MB    190.5 MB
> >>> Time (recorded)        283 s        1040 s
> >>> Avg write request size    79 KB        41 KB
> >>> 
> >>> So indeed with mb_optimize_scan=1 we do submit substantially smaller
> >>> requests on average. So far I'm not sure why that is. Since Ojaswin can
> >>> reproduce as well, let's see what he can see from block location info.
> >>> Thanks again for help with debugging this and enjoy your vacation!
> >>> 
> >> Hi Jan and Stefan,
> >> 
> >> Apologies for the delay, I was on leave yesterday and couldn't find time to get to this.
> >> 
> >> So I was able to collect the block numbers using the method you suggested. I converted the
> >> blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can
> >> find them here:
> >> 
> >> mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png
> >> mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png
> >> mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png
> >> 
> >> Observations:
> >> * Before the patched mb_optimize_scan=1 allocations were way more spread out in
> >>   40 different BGs.
> >> * With the patch, we still allocate in 36 different BGs but majority happen in
> >>   just 1 or 2 BGs.
> >> * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could
> >>   explain why this is faster.
> > 
> > thanks this is very helpful for me to understand. So it seems to me that with disabled mb_optimized_scan we have a more sequential write behavior and with enabled mb_optimized_scan a more random write behavior.
> > 
> > From my understanding writing small blocks at random addresses of the sd card flash causes a lot of overhead, because the sd card controller need to erase huge blocks (up to 1 MB) before it's able to program the flash pages. This would explain why this series doesn't fix the performance issue, the total amount of BGs is still much higher.
> > 
> > Is this new block allocation pattern a side effect of the optimization or desired?
> 
> The goal of the mb_optimized_scan is to avoid a large amount of linear
> scanning for very large filesystems when there are many block groups that
> are full or fragmented. 
> 
> It seems for empty filesystems the new list management is not very ideal.

The filesystems here are actually about half full and not too fragmented.

> It probably makes sense to have a hybrid, with some small amount of
> linear scanning (eg. a meta block group worth), and then use the new list
> to find a new group if that doesn't find any group with free space. 

There is a heuristic to scan a few block groups linearly before using the
data structures to decide about the next block group in current mballoc
code but it gets used only for rotational devices. I don't know about some
easy way how to detect other types of storage like eMMC cards that also
benefit from better data allocation locality.

I have come up with two more patches on top of my current attempt which
improve allocation locality and at least for the untar case causing issues
on eMMC they do get close to the mb_optimize_scan=0 locality. I want to
check whether the higher locality does not hurt performance for highly
parallel workloads though. Then I'll post them for review and discussion.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-09-05 10:17 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-23 20:15 [PATCH 0/2] ext4: Fix performance regression with mballoc Jan Kara
2022-08-23 20:15 ` [PATCH 1/2] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara
2022-08-23 20:15 ` [PATCH 2/2] ext4: Avoid unnecessary spreading of allocations among groups Jan Kara
2022-08-24 10:17 ` [PATCH 0/2] ext4: Fix performance regression with mballoc Stefan Wahren
2022-08-24 10:40   ` Jan Kara
2022-08-24 14:13     ` Jan Kara
2022-08-25 17:49       ` Ojaswin Mujoo
2022-08-26  9:07         ` Jan Kara
2022-08-24 21:24     ` Stefan Wahren
2022-08-25  9:18       ` Jan Kara
2022-08-25 15:48         ` Stefan Wahren
2022-08-26  9:52           ` Jan Kara
2022-08-25 16:57         ` Stefan Wahren
2022-08-26 10:15           ` Jan Kara
2022-08-27 14:36             ` Ojaswin Mujoo
2022-08-29  9:04               ` Jan Kara
2022-08-31  7:36                 ` Ojaswin Mujoo
2022-09-04 10:01               ` Stefan Wahren
2022-09-04 22:32                 ` Andreas Dilger
2022-09-05 10:15                   ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.