All of lore.kernel.org
 help / color / mirror / Atom feed
* Raid6 check performance regression 5.15 -> 5.16
@ 2022-03-07 18:15 Larkin Lowrey
  2022-03-08  1:00 ` Song Liu
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Larkin Lowrey @ 2022-03-07 18:15 UTC (permalink / raw)
  To: linux-raid

I am seeing a 'check' speed regression between kernels 5.15 and 5.16. 
One host with a 20 drive array went from 170MB/s to 11MB/s. Another host 
with a 15 drive array went from 180MB/s to 43MB/s. In both cases the 
arrays are almost completely idle. I can flip between the two kernels 
with no other changes and observe the performance changes.

Is this a known issue?

--Larkin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Raid6 check performance regression 5.15 -> 5.16
  2022-03-07 18:15 Raid6 check performance regression 5.15 -> 5.16 Larkin Lowrey
@ 2022-03-08  1:00 ` Song Liu
  2022-03-08 22:31   ` Roger Heflin
  2022-03-08 22:51   ` Larkin Lowrey
  2022-03-08  5:44 ` Thorsten Leemhuis
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 10+ messages in thread
From: Song Liu @ 2022-03-08  1:00 UTC (permalink / raw)
  To: Larkin Lowrey; +Cc: linux-raid

On Mon, Mar 7, 2022 at 10:21 AM Larkin Lowrey <llowrey@nuclearwinter.com> wrote:
>
> I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
> One host with a 20 drive array went from 170MB/s to 11MB/s. Another host
> with a 15 drive array went from 180MB/s to 43MB/s. In both cases the
> arrays are almost completely idle. I can flip between the two kernels
> with no other changes and observe the performance changes.
>
> Is this a known issue?

I am not aware of this issue. Could you please share

  mdadm --detail /dev/mdXXXX

output of the array?

Thanks,
Song

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Raid6 check performance regression 5.15 -> 5.16
  2022-03-07 18:15 Raid6 check performance regression 5.15 -> 5.16 Larkin Lowrey
  2022-03-08  1:00 ` Song Liu
@ 2022-03-08  5:44 ` Thorsten Leemhuis
  2022-03-17 13:10   ` Raid6 check performance regression 5.15 -> 5.16 #forregzbot Thorsten Leemhuis
  2022-03-08  9:41 ` Raid6 check performance regression 5.15 -> 5.16 Wilson Jonathan
  2022-03-08 10:32 ` Wilson Jonathan
  3 siblings, 1 reply; 10+ messages in thread
From: Thorsten Leemhuis @ 2022-03-08  5:44 UTC (permalink / raw)
  To: Larkin Lowrey, linux-raid, regressions

[TLDR: I'm adding the regression report below to regzbot, the Linux
kernel regression tracking bot; all text you find below is compiled from
a few templates paragraphs you might have encountered already already
from similar mails.]

On 07.03.22 19:15, Larkin Lowrey wrote:
> I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
> One host with a 20 drive array went from 170MB/s to 11MB/s. Another host
> with a 15 drive array went from 180MB/s to 43MB/s. In both cases the
> arrays are almost completely idle. I can flip between the two kernels
> with no other changes and observe the performance changes.
> 
> Is this a known issue?

Hi, this is your Linux kernel regression tracker.

Thanks for the report.

CCing the regression mailing list, as it should be in the loop for all
regressions, as explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

To be sure below issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, my Linux kernel regression tracking bot:

#regzbot ^introduced v5.15..v5.16
#regzbot title md: Raid6 check performance regression
#regzbot ignore-activity

If it turns out this isn't a regression, free free to remove it from the
tracking by sending a reply to this thread containing a paragraph like
"#regzbot invalid: reason why this is invalid" (without the quotes).

Reminder for developers: when fixing the issue, please add a 'Link:'
tags pointing to the report (the mail quoted above) using
lore.kernel.org/r/, as explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'. Regzbot needs them to
automatically connect reports with fixes, but they are useful in
general, too.

I'm sending this to everyone that got the initial report, to make
everyone aware of the tracking. I also hope that messages like this
motivate people to directly get at least the regression mailing list and
ideally even regzbot involved when dealing with regressions, as messages
like this wouldn't be needed then. And don't worry, if I need to send
other mails regarding this regression only relevant for regzbot I'll
send them to the regressions lists only (with a tag in the subject so
people can filter them away). With a bit of luck no such messages will
be needed anyway.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.

-- 
Additional information about regzbot:

If you want to know more about regzbot, check out its web-interface, the
getting start guide, and the references documentation:

https://linux-regtracking.leemhuis.info/regzbot/
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md

The last two documents will explain how you can interact with regzbot
yourself if your want to.

Hint for reporters: when reporting a regression it's in your interest to
CC the regression list and tell regzbot about the issue, as that ensures
the regression makes it onto the radar of the Linux kernel's regression
tracker -- that's in your interest, as it ensures your report won't fall
through the cracks unnoticed.

Hint for developers: you normally don't need to care about regzbot once
it's involved. Fix the issue as you normally would, just remember to
include 'Link:' tag in the patch descriptions pointing to all reports
about the issue. This has been expected from developers even before
regzbot showed up for reasons explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Raid6 check performance regression 5.15 -> 5.16
  2022-03-07 18:15 Raid6 check performance regression 5.15 -> 5.16 Larkin Lowrey
  2022-03-08  1:00 ` Song Liu
  2022-03-08  5:44 ` Thorsten Leemhuis
@ 2022-03-08  9:41 ` Wilson Jonathan
  2022-03-08 10:32 ` Wilson Jonathan
  3 siblings, 0 replies; 10+ messages in thread
From: Wilson Jonathan @ 2022-03-08  9:41 UTC (permalink / raw)
  To: Larkin Lowrey, linux-raid

On Mon, 2022-03-07 at 13:15 -0500, Larkin Lowrey wrote:
> I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
> One host with a 20 drive array went from 170MB/s to 11MB/s. Another
> host 
> with a 15 drive array went from 180MB/s to 43MB/s. In both cases the 
> arrays are almost completely idle. I can flip between the two kernels
> with no other changes and observe the performance changes.

I am also seeing a huge slowdown on Debian using 5.16.0-3-amd64.
Normally my monthly scrub would take from 1am till about 10am.

This was a consistent timing which its been doing for close to two
years without fail. The check speed would start in the 130MB-ish range
and eventually slow to about 90MB-ish the closer to finishing it got.
The disks are WD RED's (the non-dodgy ones) WDC WD40EFRX-68N32N0 and
there are 6 of them in raid6 (no spares). There are no abnormal
smartctl figures (such as RRER, MZER, etc.) showing so its not one
starting to fail.

The current speed is now down to 54,851K with at least 4 hours to go
and has been running from 8PM to 9AM already (I kicked it off manually
last night as I could see it was going to take forever at the weekend
and granddaughter doesn't deal with "its going slow" very well so I
killed it).

The problem is not limited to hard drives. I also run 3
arrays/partitions on NVME (set up as 3 drives, one spare, raid10-far2
which are used for /, /var, *swap) which instead of taking about 2 mins
are taking in excess of 10 mins to complete.

Before running the current mdadm check(s) the kernel was upgraded. I
try to apt-get update, apt-get dist-upgrade at the weekend but some
times forget so I can't tell if a check was run under the previous
version or a version prior to that... The previous version was 5.16.0-
3-amd64 which as far as I can tell had no issues (I tend access my
computer around 9 on a Sunday and get hit once a month by programs
"hanging"/being slow which reminds me to check if a mdadm check is
running, cat /proc/mdstat, which it usually is and it usually tells me
that I should be fine by 10-ish (I do the mins/60).

In the time its taken me to type this, and run commands to check
figures etc, and then check it and amend things (about 30-40 mins) the
speed is now down to 52,187K. I'm going to let it finish as I don't
like the idea of not having the monthly scrub complete, but boy does it
suck when I can see it getting much slower than usual the closer it
gets to finishing.

> 
> Is this a known issue?

Well you and me makes two noticing an issue so...

> 
> --Larkin

Jon.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Raid6 check performance regression 5.15 -> 5.16
  2022-03-07 18:15 Raid6 check performance regression 5.15 -> 5.16 Larkin Lowrey
                   ` (2 preceding siblings ...)
  2022-03-08  9:41 ` Raid6 check performance regression 5.15 -> 5.16 Wilson Jonathan
@ 2022-03-08 10:32 ` Wilson Jonathan
  3 siblings, 0 replies; 10+ messages in thread
From: Wilson Jonathan @ 2022-03-08 10:32 UTC (permalink / raw)
  To: Larkin Lowrey, linux-raid

On Mon, 2022-03-07 at 13:15 -0500, Larkin Lowrey wrote:
> I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
> One host with a 20 drive array went from 170MB/s to 11MB/s. Another
> host 
> with a 15 drive array went from 180MB/s to 43MB/s. In both cases the 
> arrays are almost completely idle. I can flip between the two kernels
> with no other changes and observe the performance changes.
> 
> Is this a known issue?
> 
> --Larkin

I killed it in the end. The computer went from "slow" and "delayed"...
to taking an annoyingly long time to do anything.

It also gave me a chance to test using the other kernel. Booting to
5.15.0-3-amd64 and starting the "check" shows circa 400mins to complete
which is what it normally takes.

re-booting to 5.16.0-3-amd64 and starting the check shows circa
1000mins to complete.

I noticed on marc.info that Song had posted a request (hadn't filtered
to mail). This is the output of that for two of the arrays:

/dev/md8:
           Version : 1.2
     Creation Time : Fri Feb 14 08:38:30 2020
        Raid Level : raid6
        Array Size : 15073892352 (14.04 TiB 15.44 TB)
     Used Dev Size : 3768473088 (3.51 TiB 3.86 TB)
      Raid Devices : 6
     Total Devices : 6
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Tue Mar  8 10:28:00 2022
             State : clean 
    Active Devices : 6
   Working Devices : 6
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : debianz97:8
              UUID : 51cfc705:98c0ef75:d2b5c558:363f2fd0
            Events : 159898

    Number   Major   Minor   RaidDevice State
       0       8       88        0      active sync   /dev/sdf8
       1       8       40        1      active sync   /dev/sdc8
       2       8       72        2      active sync   /dev/sde8
       3       8       56        3      active sync   /dev/sdd8
       4       8       24        4      active sync   /dev/sdb8
       5       8        8        5      active sync   /dev/sda8


/dev/md4:
           Version : 1.2
     Creation Time : Wed Feb  5 11:11:16 2020
        Raid Level : raid10
        Array Size : 71236608 (67.94 GiB 72.95 GB)
     Used Dev Size : 71236608 (67.94 GiB 72.95 GB)
      Raid Devices : 2
     Total Devices : 3
       Persistence : Superblock is persistent

       Update Time : Tue Mar  8 10:17:08 2022
             State : clean 
    Active Devices : 2
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 1

            Layout : far=2
        Chunk Size : 512K

Consistency Policy : resync

              Name : BusterTR4:R10Swap
              UUID : 3f2d098b:4b0df7a4:dfa23b05:0af8f480
            Events : 144

    Number   Major   Minor   RaidDevice State
       0     259       14        0      active sync   /dev/nvme1n1p4
       1     259       10        1      active sync   /dev/nvme2n1p4

       2     259        5        -      spare   /dev/nvme0n1p4







^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Raid6 check performance regression 5.15 -> 5.16
  2022-03-08  1:00 ` Song Liu
@ 2022-03-08 22:31   ` Roger Heflin
  2022-03-08 22:51   ` Larkin Lowrey
  1 sibling, 0 replies; 10+ messages in thread
From: Roger Heflin @ 2022-03-08 22:31 UTC (permalink / raw)
  To: Song Liu; +Cc: Larkin Lowrey, linux-raid

I just looked at my raid6 check start/ends (before is 5.15.10-200
(fedora), after is 5.16-11-200 (fedora).
md14: 7disks before: 2hr20m, 2h19m, 2hr16m, 2h18m,2h34m, 2hr28m, 2h27m
  after: 5h6m, 4h50m.
md15: 7disk before: 3hr14m, after: 7hr24m, 6hr6m,7hr8m.
md17: 4disk before:  6hr11m, 6hr36m, 6hr27m, 6hr8m, 6hr16m   after:
8hr10m, 7hr, 5hr33m

So it appears to have affected the arrays with 4 disks significantly
less than my arrays with 7 disks.


.

On Tue, Mar 8, 2022 at 3:50 PM Song Liu <song@kernel.org> wrote:
>
> On Mon, Mar 7, 2022 at 10:21 AM Larkin Lowrey <llowrey@nuclearwinter.com> wrote:
> >
> > I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
> > One host with a 20 drive array went from 170MB/s to 11MB/s. Another host
> > with a 15 drive array went from 180MB/s to 43MB/s. In both cases the
> > arrays are almost completely idle. I can flip between the two kernels
> > with no other changes and observe the performance changes.
> >
> > Is this a known issue?
>
> I am not aware of this issue. Could you please share
>
>   mdadm --detail /dev/mdXXXX
>
> output of the array?
>
> Thanks,
> Song

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Raid6 check performance regression 5.15 -> 5.16
  2022-03-08  1:00 ` Song Liu
  2022-03-08 22:31   ` Roger Heflin
@ 2022-03-08 22:51   ` Larkin Lowrey
  2022-03-09  6:35     ` Song Liu
  1 sibling, 1 reply; 10+ messages in thread
From: Larkin Lowrey @ 2022-03-08 22:51 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On Tue, Mar 8, 2022 at 3:50 PM Song Liu <song@kernel.org> wrote:
> On Mon, Mar 7, 2022 at 10:21 AM Larkin Lowrey <llowrey@nuclearwinter.com> wrote:
>> I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
>> One host with a 20 drive array went from 170MB/s to 11MB/s. Another host
>> with a 15 drive array went from 180MB/s to 43MB/s. In both cases the
>> arrays are almost completely idle. I can flip between the two kernels
>> with no other changes and observe the performance changes.
>>
>> Is this a known issue?
>
> I am not aware of this issue. Could you please share
>
>    mdadm --detail /dev/mdXXXX
>
> output of the array?
>
> Thanks,
> Song

Host A:
# mdadm --detail /dev/md1
/dev/md1:
            Version : 1.2
      Creation Time : Thu Nov 19 18:21:44 2020
         Raid Level : raid6
         Array Size : 126961942016 (118.24 TiB 130.01 TB)
      Used Dev Size : 9766303232 (9.10 TiB 10.00 TB)
       Raid Devices : 15
      Total Devices : 15
        Persistence : Superblock is persistent

      Intent Bitmap : Internal

        Update Time : Tue Mar  8 12:39:14 2022
              State : clean
     Active Devices : 15
    Working Devices : 15
     Failed Devices : 0
      Spare Devices : 0

             Layout : left-symmetric
         Chunk Size : 512K

Consistency Policy : bitmap

               Name : fubar:1  (local to host fubar)
               UUID : eaefc9b7:74af4850:69556e2e:bc05d666
             Events : 85950

     Number   Major   Minor   RaidDevice State
        0       8        1        0      active sync   /dev/sda1
        1       8       17        1      active sync   /dev/sdb1
        2       8       33        2      active sync   /dev/sdc1
        3       8       49        3      active sync   /dev/sdd1
        4       8       65        4      active sync   /dev/sde1
        5       8       81        5      active sync   /dev/sdf1
       16       8       97        6      active sync   /dev/sdg1
        7       8      113        7      active sync   /dev/sdh1
        8       8      129        8      active sync   /dev/sdi1
        9       8      145        9      active sync   /dev/sdj1
       10       8      161       10      active sync   /dev/sdk1
       11       8      177       11      active sync   /dev/sdl1
       12       8      193       12      active sync   /dev/sdm1
       13       8      209       13      active sync   /dev/sdn1
       14       8      225       14      active sync   /dev/sdo1

Host B:
# mdadm --detail /dev/md1
/dev/md1:
            Version : 1.2
      Creation Time : Thu Oct 10 14:18:16 2019
         Raid Level : raid6
         Array Size : 140650080768 (130.99 TiB 144.03 TB)
      Used Dev Size : 7813893376 (7.28 TiB 8.00 TB)
       Raid Devices : 20
      Total Devices : 20
        Persistence : Superblock is persistent

      Intent Bitmap : Internal

        Update Time : Tue Mar  8 17:40:48 2022
              State : clean
     Active Devices : 20
    Working Devices : 20
     Failed Devices : 0
      Spare Devices : 0

             Layout : left-symmetric
         Chunk Size : 128K

Consistency Policy : bitmap

               Name : mcp:1
               UUID : 803f5eb5:e59d4091:5b91fa17:64801e54
             Events : 302158

     Number   Major   Minor   RaidDevice State
        0       8        1        0      active sync   /dev/sda1
        1      65      145        1      active sync   /dev/sdz1
        2      65      177        2      active sync   /dev/sdab1
        3      65      209        3      active sync   /dev/sdad1
        4       8      209        4      active sync   /dev/sdn1
        5      65      129        5      active sync   /dev/sdy1
        6       8      241        6      active sync   /dev/sdp1
        7      65      241        7      active sync   /dev/sdaf1
        8       8      161        8      active sync   /dev/sdk1
        9       8      113        9      active sync   /dev/sdh1
       10       8      129       10      active sync   /dev/sdi1
       11      66       33       11      active sync   /dev/sdai1
       12      65        1       12      active sync   /dev/sdq1
       13       8       65       13      active sync   /dev/sde1
       14      66       17       14      active sync   /dev/sdah1
       15       8       49       15      active sync   /dev/sdd1
       19      66       81       16      active sync   /dev/sdal1
       16      66       65       17      active sync   /dev/sdak1
       17       8      145       18      active sync   /dev/sdj1
       18      66      129       19      active sync   /dev/sdao1

The regression was introduced somewhere between these two Fedora kernels:
5.15.18-200 (good)
5.16.5-200 (bad)

--Larkin


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Raid6 check performance regression 5.15 -> 5.16
  2022-03-08 22:51   ` Larkin Lowrey
@ 2022-03-09  6:35     ` Song Liu
  2022-03-09 16:27       ` Roger Heflin
  0 siblings, 1 reply; 10+ messages in thread
From: Song Liu @ 2022-03-09  6:35 UTC (permalink / raw)
  To: Larkin Lowrey, Roger Heflin, Wilson Jonathan; +Cc: linux-raid

On Tue, Mar 8, 2022 at 2:51 PM Larkin Lowrey <llowrey@nuclearwinter.com> wrote:
>
> On Tue, Mar 8, 2022 at 3:50 PM Song Liu <song@kernel.org> wrote:
> > On Mon, Mar 7, 2022 at 10:21 AM Larkin Lowrey <llowrey@nuclearwinter.com> wrote:
> >> I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
> >> One host with a 20 drive array went from 170MB/s to 11MB/s. Another host
> >> with a 15 drive array went from 180MB/s to 43MB/s. In both cases the
> >> arrays are almost completely idle. I can flip between the two kernels
> >> with no other changes and observe the performance changes.
> >>
> >> Is this a known issue?
> >
> > I am not aware of this issue. Could you please share
> >
> >    mdadm --detail /dev/mdXXXX
> >
> > output of the array?
> >
> > Thanks,
> > Song
>
> Host A:
> # mdadm --detail /dev/md1
> /dev/md1:
>             Version : 1.2
>       Creation Time : Thu Nov 19 18:21:44 2020
>          Raid Level : raid6
>          Array Size : 126961942016 (118.24 TiB 130.01 TB)
>       Used Dev Size : 9766303232 (9.10 TiB 10.00 TB)
>        Raid Devices : 15
>       Total Devices : 15
>         Persistence : Superblock is persistent
>
>       Intent Bitmap : Internal
>
>         Update Time : Tue Mar  8 12:39:14 2022
>               State : clean
>      Active Devices : 15
>     Working Devices : 15
>      Failed Devices : 0
>       Spare Devices : 0
>
>              Layout : left-symmetric
>          Chunk Size : 512K
>
> Consistency Policy : bitmap
>
>                Name : fubar:1  (local to host fubar)
>                UUID : eaefc9b7:74af4850:69556e2e:bc05d666
>              Events : 85950
>
>      Number   Major   Minor   RaidDevice State
>         0       8        1        0      active sync   /dev/sda1
>         1       8       17        1      active sync   /dev/sdb1
>         2       8       33        2      active sync   /dev/sdc1
>         3       8       49        3      active sync   /dev/sdd1
>         4       8       65        4      active sync   /dev/sde1
>         5       8       81        5      active sync   /dev/sdf1
>        16       8       97        6      active sync   /dev/sdg1
>         7       8      113        7      active sync   /dev/sdh1
>         8       8      129        8      active sync   /dev/sdi1
>         9       8      145        9      active sync   /dev/sdj1
>        10       8      161       10      active sync   /dev/sdk1
>        11       8      177       11      active sync   /dev/sdl1
>        12       8      193       12      active sync   /dev/sdm1
>        13       8      209       13      active sync   /dev/sdn1
>        14       8      225       14      active sync   /dev/sdo1
>
> Host B:
> # mdadm --detail /dev/md1
> /dev/md1:
>             Version : 1.2
>       Creation Time : Thu Oct 10 14:18:16 2019
>          Raid Level : raid6
>          Array Size : 140650080768 (130.99 TiB 144.03 TB)
>       Used Dev Size : 7813893376 (7.28 TiB 8.00 TB)
>        Raid Devices : 20
>       Total Devices : 20
>         Persistence : Superblock is persistent
>
>       Intent Bitmap : Internal
>
>         Update Time : Tue Mar  8 17:40:48 2022
>               State : clean
>      Active Devices : 20
>     Working Devices : 20
>      Failed Devices : 0
>       Spare Devices : 0
>
>              Layout : left-symmetric
>          Chunk Size : 128K
>
> Consistency Policy : bitmap
>
>                Name : mcp:1
>                UUID : 803f5eb5:e59d4091:5b91fa17:64801e54
>              Events : 302158
>
>      Number   Major   Minor   RaidDevice State
>         0       8        1        0      active sync   /dev/sda1
>         1      65      145        1      active sync   /dev/sdz1
>         2      65      177        2      active sync   /dev/sdab1
>         3      65      209        3      active sync   /dev/sdad1
>         4       8      209        4      active sync   /dev/sdn1
>         5      65      129        5      active sync   /dev/sdy1
>         6       8      241        6      active sync   /dev/sdp1
>         7      65      241        7      active sync   /dev/sdaf1
>         8       8      161        8      active sync   /dev/sdk1
>         9       8      113        9      active sync   /dev/sdh1
>        10       8      129       10      active sync   /dev/sdi1
>        11      66       33       11      active sync   /dev/sdai1
>        12      65        1       12      active sync   /dev/sdq1
>        13       8       65       13      active sync   /dev/sde1
>        14      66       17       14      active sync   /dev/sdah1
>        15       8       49       15      active sync   /dev/sdd1
>        19      66       81       16      active sync   /dev/sdal1
>        16      66       65       17      active sync   /dev/sdak1
>        17       8      145       18      active sync   /dev/sdj1
>        18      66      129       19      active sync   /dev/sdao1
>
> The regression was introduced somewhere between these two Fedora kernels:
> 5.15.18-200 (good)
> 5.16.5-200 (bad)

Hi folks,

Sorry for the regression and thanks for sharing your array setup and
observations.

I think I have found the fix for it. I will send a patch for it. If
you want to try the fix
sooner, you can find it at:

For 5.16:
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=tmp/fix-5.16&id=872c1a638b9751061b11b64a240892c989d1c618

For 5.17:
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=tmp/fix-5.17&id=c06ccb305e697d89fe99376c9036d1a2ece44c77

Thanks,
Song

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Raid6 check performance regression 5.15 -> 5.16
  2022-03-09  6:35     ` Song Liu
@ 2022-03-09 16:27       ` Roger Heflin
  0 siblings, 0 replies; 10+ messages in thread
From: Roger Heflin @ 2022-03-09 16:27 UTC (permalink / raw)
  To: Song Liu; +Cc: Larkin Lowrey, Wilson Jonathan, linux-raid

I have tested this.  The patch seems to fix the issue.

Test method was:

fedora 5.16.11-200 (check broken taking about 4h50m to 5h6min-2runs
that I have data for)
kernel.org 5.16.13 + this patch (17% done in 25min, 100 more minutes
to finish - seems to be fast again predicted around 2hr, is consistent
with good speed before 5.6.16).

On Wed, Mar 9, 2022 at 12:35 AM Song Liu <song@kernel.org> wrote:
>
> On Tue, Mar 8, 2022 at 2:51 PM Larkin Lowrey <llowrey@nuclearwinter.com> wrote:
> >
> > On Tue, Mar 8, 2022 at 3:50 PM Song Liu <song@kernel.org> wrote:
> > > On Mon, Mar 7, 2022 at 10:21 AM Larkin Lowrey <llowrey@nuclearwinter.com> wrote:
> > >> I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
> > >> One host with a 20 drive array went from 170MB/s to 11MB/s. Another host
> > >> with a 15 drive array went from 180MB/s to 43MB/s. In both cases the
> > >> arrays are almost completely idle. I can flip between the two kernels
> > >> with no other changes and observe the performance changes.
> > >>
> > >> Is this a known issue?
> > >
> > > I am not aware of this issue. Could you please share
> > >
> > >    mdadm --detail /dev/mdXXXX
> > >
> > > output of the array?
> > >
> > > Thanks,
> > > Song
> >
> > Host A:
> > # mdadm --detail /dev/md1
> > /dev/md1:
> >             Version : 1.2
> >       Creation Time : Thu Nov 19 18:21:44 2020
> >          Raid Level : raid6
> >          Array Size : 126961942016 (118.24 TiB 130.01 TB)
> >       Used Dev Size : 9766303232 (9.10 TiB 10.00 TB)
> >        Raid Devices : 15
> >       Total Devices : 15
> >         Persistence : Superblock is persistent
> >
> >       Intent Bitmap : Internal
> >
> >         Update Time : Tue Mar  8 12:39:14 2022
> >               State : clean
> >      Active Devices : 15
> >     Working Devices : 15
> >      Failed Devices : 0
> >       Spare Devices : 0
> >
> >              Layout : left-symmetric
> >          Chunk Size : 512K
> >
> > Consistency Policy : bitmap
> >
> >                Name : fubar:1  (local to host fubar)
> >                UUID : eaefc9b7:74af4850:69556e2e:bc05d666
> >              Events : 85950
> >
> >      Number   Major   Minor   RaidDevice State
> >         0       8        1        0      active sync   /dev/sda1
> >         1       8       17        1      active sync   /dev/sdb1
> >         2       8       33        2      active sync   /dev/sdc1
> >         3       8       49        3      active sync   /dev/sdd1
> >         4       8       65        4      active sync   /dev/sde1
> >         5       8       81        5      active sync   /dev/sdf1
> >        16       8       97        6      active sync   /dev/sdg1
> >         7       8      113        7      active sync   /dev/sdh1
> >         8       8      129        8      active sync   /dev/sdi1
> >         9       8      145        9      active sync   /dev/sdj1
> >        10       8      161       10      active sync   /dev/sdk1
> >        11       8      177       11      active sync   /dev/sdl1
> >        12       8      193       12      active sync   /dev/sdm1
> >        13       8      209       13      active sync   /dev/sdn1
> >        14       8      225       14      active sync   /dev/sdo1
> >
> > Host B:
> > # mdadm --detail /dev/md1
> > /dev/md1:
> >             Version : 1.2
> >       Creation Time : Thu Oct 10 14:18:16 2019
> >          Raid Level : raid6
> >          Array Size : 140650080768 (130.99 TiB 144.03 TB)
> >       Used Dev Size : 7813893376 (7.28 TiB 8.00 TB)
> >        Raid Devices : 20
> >       Total Devices : 20
> >         Persistence : Superblock is persistent
> >
> >       Intent Bitmap : Internal
> >
> >         Update Time : Tue Mar  8 17:40:48 2022
> >               State : clean
> >      Active Devices : 20
> >     Working Devices : 20
> >      Failed Devices : 0
> >       Spare Devices : 0
> >
> >              Layout : left-symmetric
> >          Chunk Size : 128K
> >
> > Consistency Policy : bitmap
> >
> >                Name : mcp:1
> >                UUID : 803f5eb5:e59d4091:5b91fa17:64801e54
> >              Events : 302158
> >
> >      Number   Major   Minor   RaidDevice State
> >         0       8        1        0      active sync   /dev/sda1
> >         1      65      145        1      active sync   /dev/sdz1
> >         2      65      177        2      active sync   /dev/sdab1
> >         3      65      209        3      active sync   /dev/sdad1
> >         4       8      209        4      active sync   /dev/sdn1
> >         5      65      129        5      active sync   /dev/sdy1
> >         6       8      241        6      active sync   /dev/sdp1
> >         7      65      241        7      active sync   /dev/sdaf1
> >         8       8      161        8      active sync   /dev/sdk1
> >         9       8      113        9      active sync   /dev/sdh1
> >        10       8      129       10      active sync   /dev/sdi1
> >        11      66       33       11      active sync   /dev/sdai1
> >        12      65        1       12      active sync   /dev/sdq1
> >        13       8       65       13      active sync   /dev/sde1
> >        14      66       17       14      active sync   /dev/sdah1
> >        15       8       49       15      active sync   /dev/sdd1
> >        19      66       81       16      active sync   /dev/sdal1
> >        16      66       65       17      active sync   /dev/sdak1
> >        17       8      145       18      active sync   /dev/sdj1
> >        18      66      129       19      active sync   /dev/sdao1
> >
> > The regression was introduced somewhere between these two Fedora kernels:
> > 5.15.18-200 (good)
> > 5.16.5-200 (bad)
>
> Hi folks,
>
> Sorry for the regression and thanks for sharing your array setup and
> observations.
>
> I think I have found the fix for it. I will send a patch for it. If
> you want to try the fix
> sooner, you can find it at:
>
> For 5.16:
> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=tmp/fix-5.16&id=872c1a638b9751061b11b64a240892c989d1c618
>
> For 5.17:
> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=tmp/fix-5.17&id=c06ccb305e697d89fe99376c9036d1a2ece44c77
>
> Thanks,
> Song

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Raid6 check performance regression 5.15 -> 5.16 #forregzbot
  2022-03-08  5:44 ` Thorsten Leemhuis
@ 2022-03-17 13:10   ` Thorsten Leemhuis
  0 siblings, 0 replies; 10+ messages in thread
From: Thorsten Leemhuis @ 2022-03-17 13:10 UTC (permalink / raw)
  To: regressions

TWIMC: this mail is primarily send for documentation purposes and for
regzbot, my Linux kernel regression tracking bot. These mails usually
contain '#forregzbot' in the subject, to make them easy to spot and filter.

#regzbot fixed-by: 26fed4ac4eab09c27

On 08.03.22 06:44, Thorsten Leemhuis wrote:
> [TLDR: I'm adding the regression report below to regzbot, the Linux
> kernel regression tracking bot; all text you find below is compiled from
> a few templates paragraphs you might have encountered already already
> from similar mails.]
> 
> On 07.03.22 19:15, Larkin Lowrey wrote:
>> I am seeing a 'check' speed regression between kernels 5.15 and 5.16.
>> One host with a 20 drive array went from 170MB/s to 11MB/s. Another host
>> with a 15 drive array went from 180MB/s to 43MB/s. In both cases the
>> arrays are almost completely idle. I can flip between the two kernels
>> with no other changes and observe the performance changes.
>>
>> Is this a known issue?
> 
> Hi, this is your Linux kernel regression tracker.
> 
> Thanks for the report.
> 
> CCing the regression mailing list, as it should be in the loop for all
> regressions, as explained here:
> https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html
> 
> To be sure below issue doesn't fall through the cracks unnoticed, I'm
> adding it to regzbot, my Linux kernel regression tracking bot:
> 
> #regzbot ^introduced v5.15..v5.16
> #regzbot title md: Raid6 check performance regression
> #regzbot ignore-activity
> 
> If it turns out this isn't a regression, free free to remove it from the
> tracking by sending a reply to this thread containing a paragraph like
> "#regzbot invalid: reason why this is invalid" (without the quotes).
> 
> Reminder for developers: when fixing the issue, please add a 'Link:'
> tags pointing to the report (the mail quoted above) using
> lore.kernel.org/r/, as explained in
> 'Documentation/process/submitting-patches.rst' and
> 'Documentation/process/5.Posting.rst'. Regzbot needs them to
> automatically connect reports with fixes, but they are useful in
> general, too.
> 
> I'm sending this to everyone that got the initial report, to make
> everyone aware of the tracking. I also hope that messages like this
> motivate people to directly get at least the regression mailing list and
> ideally even regzbot involved when dealing with regressions, as messages
> like this wouldn't be needed then. And don't worry, if I need to send
> other mails regarding this regression only relevant for regzbot I'll
> send them to the regressions lists only (with a tag in the subject so
> people can filter them away). With a bit of luck no such messages will
> be needed anyway.
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> 
> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
> reports on my table. I can only look briefly into most of them and lack
> knowledge about most of the areas they concern. I thus unfortunately
> will sometimes get things wrong or miss something important. I hope
> that's not the case here; if you think it is, don't hesitate to tell me
> in a public reply, it's in everyone's interest to set the public record
> straight.
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-03-17 13:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-07 18:15 Raid6 check performance regression 5.15 -> 5.16 Larkin Lowrey
2022-03-08  1:00 ` Song Liu
2022-03-08 22:31   ` Roger Heflin
2022-03-08 22:51   ` Larkin Lowrey
2022-03-09  6:35     ` Song Liu
2022-03-09 16:27       ` Roger Heflin
2022-03-08  5:44 ` Thorsten Leemhuis
2022-03-17 13:10   ` Raid6 check performance regression 5.15 -> 5.16 #forregzbot Thorsten Leemhuis
2022-03-08  9:41 ` Raid6 check performance regression 5.15 -> 5.16 Wilson Jonathan
2022-03-08 10:32 ` Wilson Jonathan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.