* Raid6 check performance regression 5.15 -> 5.16
@ 2022-03-07 18:15 Larkin Lowrey
2022-03-08 1:00 ` Song Liu
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: Larkin Lowrey @ 2022-03-07 18:15 UTC (permalink / raw)
To: linux-raid
I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
One host with a 20 drive array went from 170MB/s to 11MB/s. Another host
with a 15 drive array went from 180MB/s to 43MB/s. In both cases the
arrays are almost completely idle. I can flip between the two kernels
with no other changes and observe the performance changes.
Is this a known issue?
--Larkin
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Raid6 check performance regression 5.15 -> 5.16
2022-03-07 18:15 Raid6 check performance regression 5.15 -> 5.16 Larkin Lowrey
@ 2022-03-08 1:00 ` Song Liu
2022-03-08 22:31 ` Roger Heflin
2022-03-08 22:51 ` Larkin Lowrey
2022-03-08 5:44 ` Thorsten Leemhuis
` (2 subsequent siblings)
3 siblings, 2 replies; 10+ messages in thread
From: Song Liu @ 2022-03-08 1:00 UTC (permalink / raw)
To: Larkin Lowrey; +Cc: linux-raid
On Mon, Mar 7, 2022 at 10:21 AM Larkin Lowrey <llowrey@nuclearwinter.com> wrote:
>
> I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
> One host with a 20 drive array went from 170MB/s to 11MB/s. Another host
> with a 15 drive array went from 180MB/s to 43MB/s. In both cases the
> arrays are almost completely idle. I can flip between the two kernels
> with no other changes and observe the performance changes.
>
> Is this a known issue?
I am not aware of this issue. Could you please share
mdadm --detail /dev/mdXXXX
output of the array?
Thanks,
Song
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Raid6 check performance regression 5.15 -> 5.16
2022-03-07 18:15 Raid6 check performance regression 5.15 -> 5.16 Larkin Lowrey
2022-03-08 1:00 ` Song Liu
@ 2022-03-08 5:44 ` Thorsten Leemhuis
2022-03-17 13:10 ` Raid6 check performance regression 5.15 -> 5.16 #forregzbot Thorsten Leemhuis
2022-03-08 9:41 ` Raid6 check performance regression 5.15 -> 5.16 Wilson Jonathan
2022-03-08 10:32 ` Wilson Jonathan
3 siblings, 1 reply; 10+ messages in thread
From: Thorsten Leemhuis @ 2022-03-08 5:44 UTC (permalink / raw)
To: Larkin Lowrey, linux-raid, regressions
[TLDR: I'm adding the regression report below to regzbot, the Linux
kernel regression tracking bot; all text you find below is compiled from
a few templates paragraphs you might have encountered already already
from similar mails.]
On 07.03.22 19:15, Larkin Lowrey wrote:
> I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
> One host with a 20 drive array went from 170MB/s to 11MB/s. Another host
> with a 15 drive array went from 180MB/s to 43MB/s. In both cases the
> arrays are almost completely idle. I can flip between the two kernels
> with no other changes and observe the performance changes.
>
> Is this a known issue?
Hi, this is your Linux kernel regression tracker.
Thanks for the report.
CCing the regression mailing list, as it should be in the loop for all
regressions, as explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html
To be sure below issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, my Linux kernel regression tracking bot:
#regzbot ^introduced v5.15..v5.16
#regzbot title md: Raid6 check performance regression
#regzbot ignore-activity
If it turns out this isn't a regression, free free to remove it from the
tracking by sending a reply to this thread containing a paragraph like
"#regzbot invalid: reason why this is invalid" (without the quotes).
Reminder for developers: when fixing the issue, please add a 'Link:'
tags pointing to the report (the mail quoted above) using
lore.kernel.org/r/, as explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'. Regzbot needs them to
automatically connect reports with fixes, but they are useful in
general, too.
I'm sending this to everyone that got the initial report, to make
everyone aware of the tracking. I also hope that messages like this
motivate people to directly get at least the regression mailing list and
ideally even regzbot involved when dealing with regressions, as messages
like this wouldn't be needed then. And don't worry, if I need to send
other mails regarding this regression only relevant for regzbot I'll
send them to the regressions lists only (with a tag in the subject so
people can filter them away). With a bit of luck no such messages will
be needed anyway.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.
--
Additional information about regzbot:
If you want to know more about regzbot, check out its web-interface, the
getting start guide, and the references documentation:
https://linux-regtracking.leemhuis.info/regzbot/
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md
The last two documents will explain how you can interact with regzbot
yourself if your want to.
Hint for reporters: when reporting a regression it's in your interest to
CC the regression list and tell regzbot about the issue, as that ensures
the regression makes it onto the radar of the Linux kernel's regression
tracker -- that's in your interest, as it ensures your report won't fall
through the cracks unnoticed.
Hint for developers: you normally don't need to care about regzbot once
it's involved. Fix the issue as you normally would, just remember to
include 'Link:' tag in the patch descriptions pointing to all reports
about the issue. This has been expected from developers even before
regzbot showed up for reasons explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Raid6 check performance regression 5.15 -> 5.16
2022-03-07 18:15 Raid6 check performance regression 5.15 -> 5.16 Larkin Lowrey
2022-03-08 1:00 ` Song Liu
2022-03-08 5:44 ` Thorsten Leemhuis
@ 2022-03-08 9:41 ` Wilson Jonathan
2022-03-08 10:32 ` Wilson Jonathan
3 siblings, 0 replies; 10+ messages in thread
From: Wilson Jonathan @ 2022-03-08 9:41 UTC (permalink / raw)
To: Larkin Lowrey, linux-raid
On Mon, 2022-03-07 at 13:15 -0500, Larkin Lowrey wrote:
> I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
> One host with a 20 drive array went from 170MB/s to 11MB/s. Another
> host
> with a 15 drive array went from 180MB/s to 43MB/s. In both cases the
> arrays are almost completely idle. I can flip between the two kernels
> with no other changes and observe the performance changes.
I am also seeing a huge slowdown on Debian using 5.16.0-3-amd64.
Normally my monthly scrub would take from 1am till about 10am.
This was a consistent timing which its been doing for close to two
years without fail. The check speed would start in the 130MB-ish range
and eventually slow to about 90MB-ish the closer to finishing it got.
The disks are WD RED's (the non-dodgy ones) WDC WD40EFRX-68N32N0 and
there are 6 of them in raid6 (no spares). There are no abnormal
smartctl figures (such as RRER, MZER, etc.) showing so its not one
starting to fail.
The current speed is now down to 54,851K with at least 4 hours to go
and has been running from 8PM to 9AM already (I kicked it off manually
last night as I could see it was going to take forever at the weekend
and granddaughter doesn't deal with "its going slow" very well so I
killed it).
The problem is not limited to hard drives. I also run 3
arrays/partitions on NVME (set up as 3 drives, one spare, raid10-far2
which are used for /, /var, *swap) which instead of taking about 2 mins
are taking in excess of 10 mins to complete.
Before running the current mdadm check(s) the kernel was upgraded. I
try to apt-get update, apt-get dist-upgrade at the weekend but some
times forget so I can't tell if a check was run under the previous
version or a version prior to that... The previous version was 5.16.0-
3-amd64 which as far as I can tell had no issues (I tend access my
computer around 9 on a Sunday and get hit once a month by programs
"hanging"/being slow which reminds me to check if a mdadm check is
running, cat /proc/mdstat, which it usually is and it usually tells me
that I should be fine by 10-ish (I do the mins/60).
In the time its taken me to type this, and run commands to check
figures etc, and then check it and amend things (about 30-40 mins) the
speed is now down to 52,187K. I'm going to let it finish as I don't
like the idea of not having the monthly scrub complete, but boy does it
suck when I can see it getting much slower than usual the closer it
gets to finishing.
>
> Is this a known issue?
Well you and me makes two noticing an issue so...
>
> --Larkin
Jon.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Raid6 check performance regression 5.15 -> 5.16
2022-03-07 18:15 Raid6 check performance regression 5.15 -> 5.16 Larkin Lowrey
` (2 preceding siblings ...)
2022-03-08 9:41 ` Raid6 check performance regression 5.15 -> 5.16 Wilson Jonathan
@ 2022-03-08 10:32 ` Wilson Jonathan
3 siblings, 0 replies; 10+ messages in thread
From: Wilson Jonathan @ 2022-03-08 10:32 UTC (permalink / raw)
To: Larkin Lowrey, linux-raid
On Mon, 2022-03-07 at 13:15 -0500, Larkin Lowrey wrote:
> I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
> One host with a 20 drive array went from 170MB/s to 11MB/s. Another
> host
> with a 15 drive array went from 180MB/s to 43MB/s. In both cases the
> arrays are almost completely idle. I can flip between the two kernels
> with no other changes and observe the performance changes.
>
> Is this a known issue?
>
> --Larkin
I killed it in the end. The computer went from "slow" and "delayed"...
to taking an annoyingly long time to do anything.
It also gave me a chance to test using the other kernel. Booting to
5.15.0-3-amd64 and starting the "check" shows circa 400mins to complete
which is what it normally takes.
re-booting to 5.16.0-3-amd64 and starting the check shows circa
1000mins to complete.
I noticed on marc.info that Song had posted a request (hadn't filtered
to mail). This is the output of that for two of the arrays:
/dev/md8:
Version : 1.2
Creation Time : Fri Feb 14 08:38:30 2020
Raid Level : raid6
Array Size : 15073892352 (14.04 TiB 15.44 TB)
Used Dev Size : 3768473088 (3.51 TiB 3.86 TB)
Raid Devices : 6
Total Devices : 6
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Tue Mar 8 10:28:00 2022
State : clean
Active Devices : 6
Working Devices : 6
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Name : debianz97:8
UUID : 51cfc705:98c0ef75:d2b5c558:363f2fd0
Events : 159898
Number Major Minor RaidDevice State
0 8 88 0 active sync /dev/sdf8
1 8 40 1 active sync /dev/sdc8
2 8 72 2 active sync /dev/sde8
3 8 56 3 active sync /dev/sdd8
4 8 24 4 active sync /dev/sdb8
5 8 8 5 active sync /dev/sda8
/dev/md4:
Version : 1.2
Creation Time : Wed Feb 5 11:11:16 2020
Raid Level : raid10
Array Size : 71236608 (67.94 GiB 72.95 GB)
Used Dev Size : 71236608 (67.94 GiB 72.95 GB)
Raid Devices : 2
Total Devices : 3
Persistence : Superblock is persistent
Update Time : Tue Mar 8 10:17:08 2022
State : clean
Active Devices : 2
Working Devices : 3
Failed Devices : 0
Spare Devices : 1
Layout : far=2
Chunk Size : 512K
Consistency Policy : resync
Name : BusterTR4:R10Swap
UUID : 3f2d098b:4b0df7a4:dfa23b05:0af8f480
Events : 144
Number Major Minor RaidDevice State
0 259 14 0 active sync /dev/nvme1n1p4
1 259 10 1 active sync /dev/nvme2n1p4
2 259 5 - spare /dev/nvme0n1p4
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Raid6 check performance regression 5.15 -> 5.16
2022-03-08 1:00 ` Song Liu
@ 2022-03-08 22:31 ` Roger Heflin
2022-03-08 22:51 ` Larkin Lowrey
1 sibling, 0 replies; 10+ messages in thread
From: Roger Heflin @ 2022-03-08 22:31 UTC (permalink / raw)
To: Song Liu; +Cc: Larkin Lowrey, linux-raid
I just looked at my raid6 check start/ends (before is 5.15.10-200
(fedora), after is 5.16-11-200 (fedora).
md14: 7disks before: 2hr20m, 2h19m, 2hr16m, 2h18m,2h34m, 2hr28m, 2h27m
after: 5h6m, 4h50m.
md15: 7disk before: 3hr14m, after: 7hr24m, 6hr6m,7hr8m.
md17: 4disk before: 6hr11m, 6hr36m, 6hr27m, 6hr8m, 6hr16m after:
8hr10m, 7hr, 5hr33m
So it appears to have affected the arrays with 4 disks significantly
less than my arrays with 7 disks.
.
On Tue, Mar 8, 2022 at 3:50 PM Song Liu <song@kernel.org> wrote:
>
> On Mon, Mar 7, 2022 at 10:21 AM Larkin Lowrey <llowrey@nuclearwinter.com> wrote:
> >
> > I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
> > One host with a 20 drive array went from 170MB/s to 11MB/s. Another host
> > with a 15 drive array went from 180MB/s to 43MB/s. In both cases the
> > arrays are almost completely idle. I can flip between the two kernels
> > with no other changes and observe the performance changes.
> >
> > Is this a known issue?
>
> I am not aware of this issue. Could you please share
>
> mdadm --detail /dev/mdXXXX
>
> output of the array?
>
> Thanks,
> Song
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Raid6 check performance regression 5.15 -> 5.16
2022-03-08 1:00 ` Song Liu
2022-03-08 22:31 ` Roger Heflin
@ 2022-03-08 22:51 ` Larkin Lowrey
2022-03-09 6:35 ` Song Liu
1 sibling, 1 reply; 10+ messages in thread
From: Larkin Lowrey @ 2022-03-08 22:51 UTC (permalink / raw)
To: Song Liu; +Cc: linux-raid
On Tue, Mar 8, 2022 at 3:50 PM Song Liu <song@kernel.org> wrote:
> On Mon, Mar 7, 2022 at 10:21 AM Larkin Lowrey <llowrey@nuclearwinter.com> wrote:
>> I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
>> One host with a 20 drive array went from 170MB/s to 11MB/s. Another host
>> with a 15 drive array went from 180MB/s to 43MB/s. In both cases the
>> arrays are almost completely idle. I can flip between the two kernels
>> with no other changes and observe the performance changes.
>>
>> Is this a known issue?
>
> I am not aware of this issue. Could you please share
>
> mdadm --detail /dev/mdXXXX
>
> output of the array?
>
> Thanks,
> Song
Host A:
# mdadm --detail /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Thu Nov 19 18:21:44 2020
Raid Level : raid6
Array Size : 126961942016 (118.24 TiB 130.01 TB)
Used Dev Size : 9766303232 (9.10 TiB 10.00 TB)
Raid Devices : 15
Total Devices : 15
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Tue Mar 8 12:39:14 2022
State : clean
Active Devices : 15
Working Devices : 15
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Name : fubar:1 (local to host fubar)
UUID : eaefc9b7:74af4850:69556e2e:bc05d666
Events : 85950
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
2 8 33 2 active sync /dev/sdc1
3 8 49 3 active sync /dev/sdd1
4 8 65 4 active sync /dev/sde1
5 8 81 5 active sync /dev/sdf1
16 8 97 6 active sync /dev/sdg1
7 8 113 7 active sync /dev/sdh1
8 8 129 8 active sync /dev/sdi1
9 8 145 9 active sync /dev/sdj1
10 8 161 10 active sync /dev/sdk1
11 8 177 11 active sync /dev/sdl1
12 8 193 12 active sync /dev/sdm1
13 8 209 13 active sync /dev/sdn1
14 8 225 14 active sync /dev/sdo1
Host B:
# mdadm --detail /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Thu Oct 10 14:18:16 2019
Raid Level : raid6
Array Size : 140650080768 (130.99 TiB 144.03 TB)
Used Dev Size : 7813893376 (7.28 TiB 8.00 TB)
Raid Devices : 20
Total Devices : 20
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Tue Mar 8 17:40:48 2022
State : clean
Active Devices : 20
Working Devices : 20
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 128K
Consistency Policy : bitmap
Name : mcp:1
UUID : 803f5eb5:e59d4091:5b91fa17:64801e54
Events : 302158
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 65 145 1 active sync /dev/sdz1
2 65 177 2 active sync /dev/sdab1
3 65 209 3 active sync /dev/sdad1
4 8 209 4 active sync /dev/sdn1
5 65 129 5 active sync /dev/sdy1
6 8 241 6 active sync /dev/sdp1
7 65 241 7 active sync /dev/sdaf1
8 8 161 8 active sync /dev/sdk1
9 8 113 9 active sync /dev/sdh1
10 8 129 10 active sync /dev/sdi1
11 66 33 11 active sync /dev/sdai1
12 65 1 12 active sync /dev/sdq1
13 8 65 13 active sync /dev/sde1
14 66 17 14 active sync /dev/sdah1
15 8 49 15 active sync /dev/sdd1
19 66 81 16 active sync /dev/sdal1
16 66 65 17 active sync /dev/sdak1
17 8 145 18 active sync /dev/sdj1
18 66 129 19 active sync /dev/sdao1
The regression was introduced somewhere between these two Fedora kernels:
5.15.18-200 (good)
5.16.5-200 (bad)
--Larkin
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Raid6 check performance regression 5.15 -> 5.16
2022-03-08 22:51 ` Larkin Lowrey
@ 2022-03-09 6:35 ` Song Liu
2022-03-09 16:27 ` Roger Heflin
0 siblings, 1 reply; 10+ messages in thread
From: Song Liu @ 2022-03-09 6:35 UTC (permalink / raw)
To: Larkin Lowrey, Roger Heflin, Wilson Jonathan; +Cc: linux-raid
On Tue, Mar 8, 2022 at 2:51 PM Larkin Lowrey <llowrey@nuclearwinter.com> wrote:
>
> On Tue, Mar 8, 2022 at 3:50 PM Song Liu <song@kernel.org> wrote:
> > On Mon, Mar 7, 2022 at 10:21 AM Larkin Lowrey <llowrey@nuclearwinter.com> wrote:
> >> I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
> >> One host with a 20 drive array went from 170MB/s to 11MB/s. Another host
> >> with a 15 drive array went from 180MB/s to 43MB/s. In both cases the
> >> arrays are almost completely idle. I can flip between the two kernels
> >> with no other changes and observe the performance changes.
> >>
> >> Is this a known issue?
> >
> > I am not aware of this issue. Could you please share
> >
> > mdadm --detail /dev/mdXXXX
> >
> > output of the array?
> >
> > Thanks,
> > Song
>
> Host A:
> # mdadm --detail /dev/md1
> /dev/md1:
> Version : 1.2
> Creation Time : Thu Nov 19 18:21:44 2020
> Raid Level : raid6
> Array Size : 126961942016 (118.24 TiB 130.01 TB)
> Used Dev Size : 9766303232 (9.10 TiB 10.00 TB)
> Raid Devices : 15
> Total Devices : 15
> Persistence : Superblock is persistent
>
> Intent Bitmap : Internal
>
> Update Time : Tue Mar 8 12:39:14 2022
> State : clean
> Active Devices : 15
> Working Devices : 15
> Failed Devices : 0
> Spare Devices : 0
>
> Layout : left-symmetric
> Chunk Size : 512K
>
> Consistency Policy : bitmap
>
> Name : fubar:1 (local to host fubar)
> UUID : eaefc9b7:74af4850:69556e2e:bc05d666
> Events : 85950
>
> Number Major Minor RaidDevice State
> 0 8 1 0 active sync /dev/sda1
> 1 8 17 1 active sync /dev/sdb1
> 2 8 33 2 active sync /dev/sdc1
> 3 8 49 3 active sync /dev/sdd1
> 4 8 65 4 active sync /dev/sde1
> 5 8 81 5 active sync /dev/sdf1
> 16 8 97 6 active sync /dev/sdg1
> 7 8 113 7 active sync /dev/sdh1
> 8 8 129 8 active sync /dev/sdi1
> 9 8 145 9 active sync /dev/sdj1
> 10 8 161 10 active sync /dev/sdk1
> 11 8 177 11 active sync /dev/sdl1
> 12 8 193 12 active sync /dev/sdm1
> 13 8 209 13 active sync /dev/sdn1
> 14 8 225 14 active sync /dev/sdo1
>
> Host B:
> # mdadm --detail /dev/md1
> /dev/md1:
> Version : 1.2
> Creation Time : Thu Oct 10 14:18:16 2019
> Raid Level : raid6
> Array Size : 140650080768 (130.99 TiB 144.03 TB)
> Used Dev Size : 7813893376 (7.28 TiB 8.00 TB)
> Raid Devices : 20
> Total Devices : 20
> Persistence : Superblock is persistent
>
> Intent Bitmap : Internal
>
> Update Time : Tue Mar 8 17:40:48 2022
> State : clean
> Active Devices : 20
> Working Devices : 20
> Failed Devices : 0
> Spare Devices : 0
>
> Layout : left-symmetric
> Chunk Size : 128K
>
> Consistency Policy : bitmap
>
> Name : mcp:1
> UUID : 803f5eb5:e59d4091:5b91fa17:64801e54
> Events : 302158
>
> Number Major Minor RaidDevice State
> 0 8 1 0 active sync /dev/sda1
> 1 65 145 1 active sync /dev/sdz1
> 2 65 177 2 active sync /dev/sdab1
> 3 65 209 3 active sync /dev/sdad1
> 4 8 209 4 active sync /dev/sdn1
> 5 65 129 5 active sync /dev/sdy1
> 6 8 241 6 active sync /dev/sdp1
> 7 65 241 7 active sync /dev/sdaf1
> 8 8 161 8 active sync /dev/sdk1
> 9 8 113 9 active sync /dev/sdh1
> 10 8 129 10 active sync /dev/sdi1
> 11 66 33 11 active sync /dev/sdai1
> 12 65 1 12 active sync /dev/sdq1
> 13 8 65 13 active sync /dev/sde1
> 14 66 17 14 active sync /dev/sdah1
> 15 8 49 15 active sync /dev/sdd1
> 19 66 81 16 active sync /dev/sdal1
> 16 66 65 17 active sync /dev/sdak1
> 17 8 145 18 active sync /dev/sdj1
> 18 66 129 19 active sync /dev/sdao1
>
> The regression was introduced somewhere between these two Fedora kernels:
> 5.15.18-200 (good)
> 5.16.5-200 (bad)
Hi folks,
Sorry for the regression and thanks for sharing your array setup and
observations.
I think I have found the fix for it. I will send a patch for it. If
you want to try the fix
sooner, you can find it at:
For 5.16:
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=tmp/fix-5.16&id=872c1a638b9751061b11b64a240892c989d1c618
For 5.17:
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=tmp/fix-5.17&id=c06ccb305e697d89fe99376c9036d1a2ece44c77
Thanks,
Song
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Raid6 check performance regression 5.15 -> 5.16
2022-03-09 6:35 ` Song Liu
@ 2022-03-09 16:27 ` Roger Heflin
0 siblings, 0 replies; 10+ messages in thread
From: Roger Heflin @ 2022-03-09 16:27 UTC (permalink / raw)
To: Song Liu; +Cc: Larkin Lowrey, Wilson Jonathan, linux-raid
I have tested this. The patch seems to fix the issue.
Test method was:
fedora 5.16.11-200 (check broken taking about 4h50m to 5h6min-2runs
that I have data for)
kernel.org 5.16.13 + this patch (17% done in 25min, 100 more minutes
to finish - seems to be fast again predicted around 2hr, is consistent
with good speed before 5.6.16).
On Wed, Mar 9, 2022 at 12:35 AM Song Liu <song@kernel.org> wrote:
>
> On Tue, Mar 8, 2022 at 2:51 PM Larkin Lowrey <llowrey@nuclearwinter.com> wrote:
> >
> > On Tue, Mar 8, 2022 at 3:50 PM Song Liu <song@kernel.org> wrote:
> > > On Mon, Mar 7, 2022 at 10:21 AM Larkin Lowrey <llowrey@nuclearwinter.com> wrote:
> > >> I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
> > >> One host with a 20 drive array went from 170MB/s to 11MB/s. Another host
> > >> with a 15 drive array went from 180MB/s to 43MB/s. In both cases the
> > >> arrays are almost completely idle. I can flip between the two kernels
> > >> with no other changes and observe the performance changes.
> > >>
> > >> Is this a known issue?
> > >
> > > I am not aware of this issue. Could you please share
> > >
> > > mdadm --detail /dev/mdXXXX
> > >
> > > output of the array?
> > >
> > > Thanks,
> > > Song
> >
> > Host A:
> > # mdadm --detail /dev/md1
> > /dev/md1:
> > Version : 1.2
> > Creation Time : Thu Nov 19 18:21:44 2020
> > Raid Level : raid6
> > Array Size : 126961942016 (118.24 TiB 130.01 TB)
> > Used Dev Size : 9766303232 (9.10 TiB 10.00 TB)
> > Raid Devices : 15
> > Total Devices : 15
> > Persistence : Superblock is persistent
> >
> > Intent Bitmap : Internal
> >
> > Update Time : Tue Mar 8 12:39:14 2022
> > State : clean
> > Active Devices : 15
> > Working Devices : 15
> > Failed Devices : 0
> > Spare Devices : 0
> >
> > Layout : left-symmetric
> > Chunk Size : 512K
> >
> > Consistency Policy : bitmap
> >
> > Name : fubar:1 (local to host fubar)
> > UUID : eaefc9b7:74af4850:69556e2e:bc05d666
> > Events : 85950
> >
> > Number Major Minor RaidDevice State
> > 0 8 1 0 active sync /dev/sda1
> > 1 8 17 1 active sync /dev/sdb1
> > 2 8 33 2 active sync /dev/sdc1
> > 3 8 49 3 active sync /dev/sdd1
> > 4 8 65 4 active sync /dev/sde1
> > 5 8 81 5 active sync /dev/sdf1
> > 16 8 97 6 active sync /dev/sdg1
> > 7 8 113 7 active sync /dev/sdh1
> > 8 8 129 8 active sync /dev/sdi1
> > 9 8 145 9 active sync /dev/sdj1
> > 10 8 161 10 active sync /dev/sdk1
> > 11 8 177 11 active sync /dev/sdl1
> > 12 8 193 12 active sync /dev/sdm1
> > 13 8 209 13 active sync /dev/sdn1
> > 14 8 225 14 active sync /dev/sdo1
> >
> > Host B:
> > # mdadm --detail /dev/md1
> > /dev/md1:
> > Version : 1.2
> > Creation Time : Thu Oct 10 14:18:16 2019
> > Raid Level : raid6
> > Array Size : 140650080768 (130.99 TiB 144.03 TB)
> > Used Dev Size : 7813893376 (7.28 TiB 8.00 TB)
> > Raid Devices : 20
> > Total Devices : 20
> > Persistence : Superblock is persistent
> >
> > Intent Bitmap : Internal
> >
> > Update Time : Tue Mar 8 17:40:48 2022
> > State : clean
> > Active Devices : 20
> > Working Devices : 20
> > Failed Devices : 0
> > Spare Devices : 0
> >
> > Layout : left-symmetric
> > Chunk Size : 128K
> >
> > Consistency Policy : bitmap
> >
> > Name : mcp:1
> > UUID : 803f5eb5:e59d4091:5b91fa17:64801e54
> > Events : 302158
> >
> > Number Major Minor RaidDevice State
> > 0 8 1 0 active sync /dev/sda1
> > 1 65 145 1 active sync /dev/sdz1
> > 2 65 177 2 active sync /dev/sdab1
> > 3 65 209 3 active sync /dev/sdad1
> > 4 8 209 4 active sync /dev/sdn1
> > 5 65 129 5 active sync /dev/sdy1
> > 6 8 241 6 active sync /dev/sdp1
> > 7 65 241 7 active sync /dev/sdaf1
> > 8 8 161 8 active sync /dev/sdk1
> > 9 8 113 9 active sync /dev/sdh1
> > 10 8 129 10 active sync /dev/sdi1
> > 11 66 33 11 active sync /dev/sdai1
> > 12 65 1 12 active sync /dev/sdq1
> > 13 8 65 13 active sync /dev/sde1
> > 14 66 17 14 active sync /dev/sdah1
> > 15 8 49 15 active sync /dev/sdd1
> > 19 66 81 16 active sync /dev/sdal1
> > 16 66 65 17 active sync /dev/sdak1
> > 17 8 145 18 active sync /dev/sdj1
> > 18 66 129 19 active sync /dev/sdao1
> >
> > The regression was introduced somewhere between these two Fedora kernels:
> > 5.15.18-200 (good)
> > 5.16.5-200 (bad)
>
> Hi folks,
>
> Sorry for the regression and thanks for sharing your array setup and
> observations.
>
> I think I have found the fix for it. I will send a patch for it. If
> you want to try the fix
> sooner, you can find it at:
>
> For 5.16:
> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=tmp/fix-5.16&id=872c1a638b9751061b11b64a240892c989d1c618
>
> For 5.17:
> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=tmp/fix-5.17&id=c06ccb305e697d89fe99376c9036d1a2ece44c77
>
> Thanks,
> Song
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Raid6 check performance regression 5.15 -> 5.16 #forregzbot
2022-03-08 5:44 ` Thorsten Leemhuis
@ 2022-03-17 13:10 ` Thorsten Leemhuis
0 siblings, 0 replies; 10+ messages in thread
From: Thorsten Leemhuis @ 2022-03-17 13:10 UTC (permalink / raw)
To: regressions
TWIMC: this mail is primarily send for documentation purposes and for
regzbot, my Linux kernel regression tracking bot. These mails usually
contain '#forregzbot' in the subject, to make them easy to spot and filter.
#regzbot fixed-by: 26fed4ac4eab09c27
On 08.03.22 06:44, Thorsten Leemhuis wrote:
> [TLDR: I'm adding the regression report below to regzbot, the Linux
> kernel regression tracking bot; all text you find below is compiled from
> a few templates paragraphs you might have encountered already already
> from similar mails.]
>
> On 07.03.22 19:15, Larkin Lowrey wrote:
>> I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
>> One host with a 20 drive array went from 170MB/s to 11MB/s. Another host
>> with a 15 drive array went from 180MB/s to 43MB/s. In both cases the
>> arrays are almost completely idle. I can flip between the two kernels
>> with no other changes and observe the performance changes.
>>
>> Is this a known issue?
>
> Hi, this is your Linux kernel regression tracker.
>
> Thanks for the report.
>
> CCing the regression mailing list, as it should be in the loop for all
> regressions, as explained here:
> https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html
>
> To be sure below issue doesn't fall through the cracks unnoticed, I'm
> adding it to regzbot, my Linux kernel regression tracking bot:
>
> #regzbot ^introduced v5.15..v5.16
> #regzbot title md: Raid6 check performance regression
> #regzbot ignore-activity
>
> If it turns out this isn't a regression, free free to remove it from the
> tracking by sending a reply to this thread containing a paragraph like
> "#regzbot invalid: reason why this is invalid" (without the quotes).
>
> Reminder for developers: when fixing the issue, please add a 'Link:'
> tags pointing to the report (the mail quoted above) using
> lore.kernel.org/r/, as explained in
> 'Documentation/process/submitting-patches.rst' and
> 'Documentation/process/5.Posting.rst'. Regzbot needs them to
> automatically connect reports with fixes, but they are useful in
> general, too.
>
> I'm sending this to everyone that got the initial report, to make
> everyone aware of the tracking. I also hope that messages like this
> motivate people to directly get at least the regression mailing list and
> ideally even regzbot involved when dealing with regressions, as messages
> like this wouldn't be needed then. And don't worry, if I need to send
> other mails regarding this regression only relevant for regzbot I'll
> send them to the regressions lists only (with a tag in the subject so
> people can filter them away). With a bit of luck no such messages will
> be needed anyway.
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>
> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
> reports on my table. I can only look briefly into most of them and lack
> knowledge about most of the areas they concern. I thus unfortunately
> will sometimes get things wrong or miss something important. I hope
> that's not the case here; if you think it is, don't hesitate to tell me
> in a public reply, it's in everyone's interest to set the public record
> straight.
>
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2022-03-17 13:10 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-07 18:15 Raid6 check performance regression 5.15 -> 5.16 Larkin Lowrey
2022-03-08 1:00 ` Song Liu
2022-03-08 22:31 ` Roger Heflin
2022-03-08 22:51 ` Larkin Lowrey
2022-03-09 6:35 ` Song Liu
2022-03-09 16:27 ` Roger Heflin
2022-03-08 5:44 ` Thorsten Leemhuis
2022-03-17 13:10 ` Raid6 check performance regression 5.15 -> 5.16 #forregzbot Thorsten Leemhuis
2022-03-08 9:41 ` Raid6 check performance regression 5.15 -> 5.16 Wilson Jonathan
2022-03-08 10:32 ` Wilson Jonathan
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.