linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
@ 2024-02-03  1:02 Mikhail Gavrilov
  2024-02-03  1:08 ` Rahul Rameshbabu
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Mikhail Gavrilov @ 2024-02-03  1:02 UTC (permalink / raw)
  To: Linux List Kernel Mailing, linux-netdev, Greg KH

[-- Attachment #1: Type: text/plain, Size: 1060 bytes --]

Hi,
I'm trying to find the first bad commit that led to a decreased
network outgoing speed.
And every time I come to a huge merge [Merge tag 'usb-6.8-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb]
I have already triple-checked all my answers and speed measurements.
I don't understand where I'm making a mistake.

Let's try to figure it out together.

Input data:
Two computers connected 1Gbps link.
Both have the same hardware.
Network: RTL8125 2.5GbE Controller (rev 05)

When I copy files from one computer to another and kernel snapshot
builded from commit 296455ade1fd I have 97-110MB/sec which is almost
max speed of 1Gbps link.
When I move to commit 9d1694dc91ce I have only 66-70MB/sec which is
significantly slower.

I bisected the issue by measuring network speed on each step.
I save all results to file [1]

[1] file is attached as a zip archive.

# first bad commit: [8c94ccc7cd691472461448f98e2372c75849406c] Merge
tag 'usb-6.8-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

-- 
Best Regards,
Mike Gavrilov.

[-- Attachment #2: speed-measure.zip --]
[-- Type: application/zip, Size: 2039 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-03  1:02 This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c Mikhail Gavrilov
@ 2024-02-03  1:08 ` Rahul Rameshbabu
  2024-02-03  1:15 ` Randy Dunlap
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 31+ messages in thread
From: Rahul Rameshbabu @ 2024-02-03  1:08 UTC (permalink / raw)
  To: Mikhail Gavrilov; +Cc: Linux List Kernel Mailing, linux-netdev, Greg KH

On Sat, 03 Feb, 2024 06:02:15 +0500 Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> wrote:
> Hi,
> I'm trying to find the first bad commit that led to a decreased
> network outgoing speed.
> And every time I come to a huge merge [Merge tag 'usb-6.8-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb]
> I have already triple-checked all my answers and speed measurements.
> I don't understand where I'm making a mistake.

Have you tried using --first-parent when you git bisect to see if that
helps you find the culprit aside from the merge commit you keep hitting?

https://git-scm.com/docs/git-bisect#Documentation/git-bisect.txt---first-parent

>
> Let's try to figure it out together.
>
> Input data:
> Two computers connected 1Gbps link.
> Both have the same hardware.
> Network: RTL8125 2.5GbE Controller (rev 05)
>
> When I copy files from one computer to another and kernel snapshot
> builded from commit 296455ade1fd I have 97-110MB/sec which is almost
> max speed of 1Gbps link.
> When I move to commit 9d1694dc91ce I have only 66-70MB/sec which is
> significantly slower.
>
> I bisected the issue by measuring network speed on each step.
> I save all results to file [1]
>
> [1] file is attached as a zip archive.
>
> # first bad commit: [8c94ccc7cd691472461448f98e2372c75849406c] Merge
> tag 'usb-6.8-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

--
Thanks,

Rahul Rameshbabu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-03  1:02 This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c Mikhail Gavrilov
  2024-02-03  1:08 ` Rahul Rameshbabu
@ 2024-02-03  1:15 ` Randy Dunlap
  2024-02-03  1:16   ` Randy Dunlap
  2024-02-03 18:20 ` Christian A. Ehrhardt
  2024-02-21  6:48 ` This is the fourth time I’ve " Linux regression tracking #adding (Thorsten Leemhuis)
  3 siblings, 1 reply; 31+ messages in thread
From: Randy Dunlap @ 2024-02-03  1:15 UTC (permalink / raw)
  To: Mikhail Gavrilov, Linux List Kernel Mailing, linux-netdev, Greg KH

Hi,

On 2/2/24 17:02, Mikhail Gavrilov wrote:
> Hi,
> I'm trying to find the first bad commit that led to a decreased
> network outgoing speed.
> And every time I come to a huge merge [Merge tag 'usb-6.8-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb]
> I have already triple-checked all my answers and speed measurements.
> I don't understand where I'm making a mistake.
> 
> Let's try to figure it out together.
> 
> Input data:
> Two computers connected 1Gbps link.
> Both have the same hardware.
> Network: RTL8125 2.5GbE Controller (rev 05)
> 
> When I copy files from one computer to another and kernel snapshot
> builded from commit 296455ade1fd I have 97-110MB/sec which is almost
> max speed of 1Gbps link.
> When I move to commit 9d1694dc91ce I have only 66-70MB/sec which is
> significantly slower.
> 
> I bisected the issue by measuring network speed on each step.
> I save all results to file [1]
> 
> [1] file is attached as a zip archive.
> 
> # first bad commit: [8c94ccc7cd691472461448f98e2372c75849406c] Merge
> tag 'usb-6.8-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

a. Do you clean the object files between each test run?
or at least clean net/* and drivers/net/ethernet/* ?

b. I am far from a git expert, but in the bisects that I have
done, after each test run, I just say
$ git bisect good
or
$ git bisect bad

It looks like you are typing
$ git bisect [good | bad] hashID

Is that correct?

Anyway, I am interested in your outcome just to learn
how to handle this problem.

Good luck.

-- 
#Randy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-03  1:15 ` Randy Dunlap
@ 2024-02-03  1:16   ` Randy Dunlap
  2024-02-03  2:32     ` Jakub Kicinski
  0 siblings, 1 reply; 31+ messages in thread
From: Randy Dunlap @ 2024-02-03  1:16 UTC (permalink / raw)
  To: Mikhail Gavrilov, Linux List Kernel Mailing, Greg KH,
	Network Development

[correct the netdev mailing list address]


On 2/2/24 17:15, Randy Dunlap wrote:
> Hi,
> 
> On 2/2/24 17:02, Mikhail Gavrilov wrote:
>> Hi,
>> I'm trying to find the first bad commit that led to a decreased
>> network outgoing speed.
>> And every time I come to a huge merge [Merge tag 'usb-6.8-rc1' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb]
>> I have already triple-checked all my answers and speed measurements.
>> I don't understand where I'm making a mistake.
>>
>> Let's try to figure it out together.
>>
>> Input data:
>> Two computers connected 1Gbps link.
>> Both have the same hardware.
>> Network: RTL8125 2.5GbE Controller (rev 05)
>>
>> When I copy files from one computer to another and kernel snapshot
>> builded from commit 296455ade1fd I have 97-110MB/sec which is almost
>> max speed of 1Gbps link.
>> When I move to commit 9d1694dc91ce I have only 66-70MB/sec which is
>> significantly slower.
>>
>> I bisected the issue by measuring network speed on each step.
>> I save all results to file [1]
>>
>> [1] file is attached as a zip archive.
>>
>> # first bad commit: [8c94ccc7cd691472461448f98e2372c75849406c] Merge
>> tag 'usb-6.8-rc1' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
> 
> a. Do you clean the object files between each test run?
> or at least clean net/* and drivers/net/ethernet/* ?
> 
> b. I am far from a git expert, but in the bisects that I have
> done, after each test run, I just say
> $ git bisect good
> or
> $ git bisect bad
> 
> It looks like you are typing
> $ git bisect [good | bad] hashID
> 
> Is that correct?
> 
> Anyway, I am interested in your outcome just to learn
> how to handle this problem.
> 
> Good luck.
> 

-- 
#Randy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-03  1:16   ` Randy Dunlap
@ 2024-02-03  2:32     ` Jakub Kicinski
  0 siblings, 0 replies; 31+ messages in thread
From: Jakub Kicinski @ 2024-02-03  2:32 UTC (permalink / raw)
  To: Mikhail Gavrilov
  Cc: Randy Dunlap, Linux List Kernel Mailing, Greg KH, Network Development

Thanks for the forward!

On Fri, 2 Feb 2024 17:16:41 -0800 Randy Dunlap wrote:
> >> When I copy files from one computer to another and kernel snapshot
> >> builded from commit 296455ade1fd I have 97-110MB/sec which is almost
> >> max speed of 1Gbps link.
> >> When I move to commit 9d1694dc91ce I have only 66-70MB/sec which is
> >> significantly slower.

There isn't that much networking in between the two.
Is any of the CPU cores at 100% when you are transferring the data on
the bad commit?
Do you have any iptables / nftables rules?
Are you using TLS in the transfer?
Did you try reverting f1172f3ee3a98754?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-03  1:02 This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c Mikhail Gavrilov
  2024-02-03  1:08 ` Rahul Rameshbabu
  2024-02-03  1:15 ` Randy Dunlap
@ 2024-02-03 18:20 ` Christian A. Ehrhardt
  2024-02-04 20:47   ` Christian A. Ehrhardt
  2024-02-21  6:48 ` This is the fourth time I’ve " Linux regression tracking #adding (Thorsten Leemhuis)
  3 siblings, 1 reply; 31+ messages in thread
From: Christian A. Ehrhardt @ 2024-02-03 18:20 UTC (permalink / raw)
  To: Mikhail Gavrilov; +Cc: Linux List Kernel Mailing, linux-netdev, Greg KH

On Sat, Feb 03, 2024 at 06:02:15AM +0500, Mikhail Gavrilov wrote:
> Hi,
> I'm trying to find the first bad commit that led to a decreased
> network outgoing speed.
> And every time I come to a huge merge [Merge tag 'usb-6.8-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb]
> I have already triple-checked all my answers and speed measurements.
> I don't understand where I'm making a mistake.
> 
> Let's try to figure it out together.
> 
> Input data:
> Two computers connected 1Gbps link.
> Both have the same hardware.
> Network: RTL8125 2.5GbE Controller (rev 05)
> 
> When I copy files from one computer to another and kernel snapshot
> builded from commit 296455ade1fd I have 97-110MB/sec which is almost
> max speed of 1Gbps link.
> When I move to commit 9d1694dc91ce I have only 66-70MB/sec which is
> significantly slower.
> 
> I bisected the issue by measuring network speed on each step.
> I save all results to file [1]
> 
> [1] file is attached as a zip archive.
> 
> # first bad commit: [8c94ccc7cd691472461448f98e2372c75849406c] Merge
> tag 'usb-6.8-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

So (simplified) the change history looks something like this:

   branch point  296455ade1fd (good)     9d1694dc91ce (merge, bad)
       |             |                         |
 ----- * ----------- * ----------------------- * --------------  <- master
        \                                     /
	  ----------- * ---------------------   <- development branch
	              |
		  Bug introduced here

The straight line is Linus' master branch. At some point in the
past a development branch was forked from Linus's branch (the
lower line). During that development the regression was introduced
and when the branch was merged back the master branch started to
have the regression.

To further pin point the bug in the development branch you'll have to
bisect along the lower line.

So how do you do this:

Look at the merge:
$ git cat-file -p 9d1694dc91ce | head -n3
tree d9093aecb9261cccaea1f0a58887fcd9db542172
parent e9a5a78d1ad8ceb4e3df6d6ad93360094c84ac40
parent b2e792ae883a0aa976d4176dfa7dc933263440ea

So the merge commit has two parent commits, one is on the master
branch, the other is on the development branch. To find out which
of the parents is on the development branch you can ask git to find
the common ancestor of the two parent commits and your good commit
on the master branch:

$ git merge-base 296455ade1fd e9a5a78d1ad8ceb4e3df6d6ad93360094c84ac40
296455ade1fdcf5f8f8c033201633b60946c589a
$ git merge-base 296455ade1fd b2e792ae883a0aa976d4176dfa7dc933263440ea
587371ed783b046f22ba7a5e1cc9a19ae35123b4

So the second parent is not on the master branch and the merge base
(i.e. the point where the development branch was forked from the master
branch) is 587371ed783b046f22ba7a5e1cc9a19ae35123b4.

So I'd assume that 587371ed783b046f22ba7a5e1cc9a19ae35123b4 is a good
commit and b2e792ae883a0aa976d4176dfa7dc933263440ea is a bad commit.
You can verify this and then start bisecting like this for better
results:
$ git bisect good 587371ed783b046f22ba7a5e1cc9a19ae35123b4
$ git bisect bad b2e792ae883a0aa976d4176dfa7dc933263440ea

      regards   Christian


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-03 18:20 ` Christian A. Ehrhardt
@ 2024-02-04 20:47   ` Christian A. Ehrhardt
  2024-02-05 21:08     ` Mikhail Gavrilov
  0 siblings, 1 reply; 31+ messages in thread
From: Christian A. Ehrhardt @ 2024-02-04 20:47 UTC (permalink / raw)
  To: Mikhail Gavrilov; +Cc: Linux List Kernel Mailing, linux-netdev, Greg KH

Hi,

[ sorry, replying to myself ]

On Sat, Feb 03, 2024 at 07:20:47PM +0100, Christian A. Ehrhardt wrote:
> On Sat, Feb 03, 2024 at 06:02:15AM +0500, Mikhail Gavrilov wrote:
> > Hi,
> > I'm trying to find the first bad commit that led to a decreased
> > network outgoing speed.
> > And every time I come to a huge merge [Merge tag 'usb-6.8-rc1' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb]
> > I have already triple-checked all my answers and speed measurements.
> > I don't understand where I'm making a mistake.
> > 
> > Let's try to figure it out together.
> > 
> > Input data:
> > Two computers connected 1Gbps link.
> > Both have the same hardware.
> > Network: RTL8125 2.5GbE Controller (rev 05)
> > 
> > When I copy files from one computer to another and kernel snapshot
> > builded from commit 296455ade1fd I have 97-110MB/sec which is almost
> > max speed of 1Gbps link.
> > When I move to commit 9d1694dc91ce I have only 66-70MB/sec which is
> > significantly slower.
> > 
> > I bisected the issue by measuring network speed on each step.
> > I save all results to file [1]
> > 
> > [1] file is attached as a zip archive.
> > 
> > # first bad commit: [8c94ccc7cd691472461448f98e2372c75849406c] Merge
> > tag 'usb-6.8-rc1' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
> 
> So (simplified) the change history looks something like this:
> [ ... ]

Sorry, I was looking at the wrong merge commit and when using
the commit pinpointed by your bisect your log shows
that _both_ parents of the bad merge commit are marked as good
which is somewhat strange.

However, it should be possible to bisect further if you do a rebase
like this:

$ git cat-file -p
8c94ccc7cd691472461448f98e2372c75849406c  | head -n 3
tree d3907cad2a1fbbbcf71847274fdbdcf5a2aeb9a2
parent bd736f38c014ba70ba7ec3bdc6af6fe5368d6612
parent 933bb7b878ddd0f8c094db45551a7daddf806e00
$ git branch m bd736f38c014ba70ba7ec3bdc6af6fe5368d6612
$ git branch d933bb7b878ddd0f8c094db45551a7daddf806e000
$ git checkout d 
Updating files: 100% (11666/11666), done.
Switched to branch 'd'
$ git rebase m 
Successfully rebased and updated refs/heads/d.

Now, "m" must be good as per your bisect log and "d" must be bad
because it is the same tree as the bad merge commit (8c94ccc7cd69).

Due to the rebase there's a liner history between the two, thus
starting a bisect like this might yield more information:

$ git bisect good m
$ git bisect bad d

     regards   Christian



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-04 20:47   ` Christian A. Ehrhardt
@ 2024-02-05 21:08     ` Mikhail Gavrilov
  2024-02-06 11:26       ` Mathias Nyman
  0 siblings, 1 reply; 31+ messages in thread
From: Mikhail Gavrilov @ 2024-02-05 21:08 UTC (permalink / raw)
  To: Christian A. Ehrhardt, niklas.neronin, mathias.nyman
  Cc: Linux List Kernel Mailing, linux-netdev, Greg KH

[-- Attachment #1: Type: text/plain, Size: 2622 bytes --]

On Mon, Feb 5, 2024 at 1:47 AM Christian A. Ehrhardt <lk@c--e.de> wrote:
>
>
> Sorry, I was looking at the wrong merge commit and when using
> the commit pinpointed by your bisect your log shows
> that _both_ parents of the bad merge commit are marked as good
> which is somewhat strange.
>
> However, it should be possible to bisect further if you do a rebase
> like this:
>
> $ git cat-file -p
> 8c94ccc7cd691472461448f98e2372c75849406c  | head -n 3
> tree d3907cad2a1fbbbcf71847274fdbdcf5a2aeb9a2
> parent bd736f38c014ba70ba7ec3bdc6af6fe5368d6612
> parent 933bb7b878ddd0f8c094db45551a7daddf806e00
> $ git branch m bd736f38c014ba70ba7ec3bdc6af6fe5368d6612
> $ git branch d933bb7b878ddd0f8c094db45551a7daddf806e000
> $ git checkout d
> Updating files: 100% (11666/11666), done.
> Switched to branch 'd'
> $ git rebase m
> Successfully rebased and updated refs/heads/d.
>
> Now, "m" must be good as per your bisect log and "d" must be bad
> because it is the same tree as the bad merge commit (8c94ccc7cd69).
>
> Due to the rebase there's a liner history between the two, thus
> starting a bisect like this might yield more information:
>
> $ git bisect good m
> $ git bisect bad d
>
>      regards   Christian
>
>

Thanks for real help.
Now I spotted a really bad commit.

57e153dfd0e7a080373fe5853c5609443d97fa5a is the first bad commit
commit 57e153dfd0e7a080373fe5853c5609443d97fa5a
Author: Niklas Neronin <niklas.neronin@linux.intel.com>
Date:   Fri Dec 1 17:06:40 2023 +0200

    xhci: add handler for only one interrupt line

    Current xHCI driver only supports one "interrupter", meaning we will
    only use one MSI/MSI-X interrupt line. Thus, add handler only to the
    first interrupt line.

    Signed-off-by: Niklas Neronin <niklas.neronin@linux.intel.com>
    Co-developed-by: Mathias Nyman <mathias.nyman@linux.intel.com>
    Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
    Link: https://lore.kernel.org/r/20231201150647.1307406-13-mathias.nyman@linux.intel.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 drivers/usb/host/xhci-pci.c | 35 ++++++++++-------------------------
 1 file changed, 10 insertions(+), 25 deletions(-)

Niklas, Mathias I spotted decreased network speed on sending when
transferring files via sftp between my workstations in the local
network.
And bisection of issue leads me to this commit.
My motherboard is MPG-B650I-EDGE-WIFI looks like it is related to the
mentioned commit.
https://www.msi.com/Motherboard/MPG-B650I-EDGE-WIFI

-- 
Best Regards,
Mike Gavrilov.

[-- Attachment #2: speed-measure-3.zip --]
[-- Type: application/zip, Size: 1019 bytes --]

[-- Attachment #3: .config.zip --]
[-- Type: application/zip, Size: 65390 bytes --]

[-- Attachment #4: git-bisect-log-regression-network-performance.zip --]
[-- Type: application/zip, Size: 1033 bytes --]

[-- Attachment #5: dmesg.zip --]
[-- Type: application/zip, Size: 58258 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-05 21:08     ` Mikhail Gavrilov
@ 2024-02-06 11:26       ` Mathias Nyman
  2024-02-06 16:12         ` Mikhail Gavrilov
  0 siblings, 1 reply; 31+ messages in thread
From: Mathias Nyman @ 2024-02-06 11:26 UTC (permalink / raw)
  To: Mikhail Gavrilov, Christian A. Ehrhardt, niklas.neronin
  Cc: Linux List Kernel Mailing, linux-netdev, Greg KH

On 5.2.2024 23.08, Mikhail Gavrilov wrote:
> On Mon, Feb 5, 2024 at 1:47 AM Christian A. Ehrhardt <lk@c--e.de> wrote:
> 
> Thanks for real help.
> Now I spotted a really bad commit.
> 
> 57e153dfd0e7a080373fe5853c5609443d97fa5a is the first bad commit
> commit 57e153dfd0e7a080373fe5853c5609443d97fa5a
> Author: Niklas Neronin <niklas.neronin@linux.intel.com>
> Date:   Fri Dec 1 17:06:40 2023 +0200
> 
>      xhci: add handler for only one interrupt line
> 
>      Current xHCI driver only supports one "interrupter", meaning we will
>      only use one MSI/MSI-X interrupt line. Thus, add handler only to the
>      first interrupt line.
> 
>      Signed-off-by: Niklas Neronin <niklas.neronin@linux.intel.com>
>      Co-developed-by: Mathias Nyman <mathias.nyman@linux.intel.com>
>      Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
>      Link: https://lore.kernel.org/r/20231201150647.1307406-13-mathias.nyman@linux.intel.com
>      Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> 
>   drivers/usb/host/xhci-pci.c | 35 ++++++++++-------------------------
>   1 file changed, 10 insertions(+), 25 deletions(-)
> 
> Niklas, Mathias I spotted decreased network speed on sending when
> transferring files via sftp between my workstations in the local
> network.
> And bisection of issue leads me to this commit.
> My motherboard is MPG-B650I-EDGE-WIFI looks like it is related to the
> mentioned commit.
> https://www.msi.com/Motherboard/MPG-B650I-EDGE-WIFI
> 

This seems odd, not sure how this usb host change would impact your network speed.

Could you try reverting that patch from 6.8-rc1 and see if it helps?

There are some other patches on top of it that needs to be reverted first.
These should be enough:

36b24ebf9a04 xhci: minor coding style cleanup in 'xhci_try_enable_msi()
9831960df237 xhci: rework 'xhci_try_enable_msi()' MSI and MSI-X setup code
dfbf4441f2d3 xhci: change 'msix_count' to encompass MSI or MSI-X vectors
a795f708b284 xhci: refactor static MSI function
74554e9c2276 xhci: refactor static MSI-X function
f977f4c9301c xhci: add handler for only one interrupt line

That patch changes how we request MSI/MSI-X interrupt(s) for xhci.

Is there any change is /proc/interrupts between a good and bad case?
Such as xhci_hcd using MSI-X instead of MSI, or eth0 and xhci_hcd
interrupting on the same CPU?

Thanks
Mathias

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-06 11:26       ` Mathias Nyman
@ 2024-02-06 16:12         ` Mikhail Gavrilov
  2024-02-07 10:40           ` Mathias Nyman
  0 siblings, 1 reply; 31+ messages in thread
From: Mikhail Gavrilov @ 2024-02-06 16:12 UTC (permalink / raw)
  To: Mathias Nyman
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb

[-- Attachment #1: Type: text/plain, Size: 1369 bytes --]

On Tue, Feb 6, 2024 at 4:24 PM Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
>
>
> This seems odd, not sure how this usb host change would impact your network speed.
>
> Could you try reverting that patch from 6.8-rc1 and see if it helps?
>
> There are some other patches on top of it that needs to be reverted first.
> These should be enough:
>
> 36b24ebf9a04 xhci: minor coding style cleanup in 'xhci_try_enable_msi()
> 9831960df237 xhci: rework 'xhci_try_enable_msi()' MSI and MSI-X setup code
> dfbf4441f2d3 xhci: change 'msix_count' to encompass MSI or MSI-X vectors
> a795f708b284 xhci: refactor static MSI function
> 74554e9c2276 xhci: refactor static MSI-X function
> f977f4c9301c xhci: add handler for only one interrupt line

I confirm after reverting all listed commits and 57e153dfd0e7
performance of the network returned to theoretical maximum.

> That patch changes how we request MSI/MSI-X interrupt(s) for xhci.
>
> Is there any change is /proc/interrupts between a good and bad case?
> Such as xhci_hcd using MSI-X instead of MSI, or eth0 and xhci_hcd
> interrupting on the same CPU?

On the good kernel I have - 32 xhci_hcd, and bad only - 4.
In both scenarios using PCI-MSIX.
I attached both interrupt output as archives to this message.

[1] https://postimg.cc/zL2RYgYZ

-- 
Best Regards,
Mike Gavrilov.

[-- Attachment #2: interrupts-good.zip --]
[-- Type: application/zip, Size: 3009 bytes --]

[-- Attachment #3: interrupts-bad.zip --]
[-- Type: application/zip, Size: 3097 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-06 16:12         ` Mikhail Gavrilov
@ 2024-02-07 10:40           ` Mathias Nyman
  2024-02-07 11:55             ` Mikhail Gavrilov
  0 siblings, 1 reply; 31+ messages in thread
From: Mathias Nyman @ 2024-02-07 10:40 UTC (permalink / raw)
  To: Mikhail Gavrilov
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb

On 6.2.2024 18.12, Mikhail Gavrilov wrote:
> On Tue, Feb 6, 2024 at 4:24 PM Mathias Nyman
> <mathias.nyman@linux.intel.com> wrote:
> 
> I confirm after reverting all listed commits and 57e153dfd0e7
> performance of the network returned to theoretical maximum.
> 
>> That patch changes how we request MSI/MSI-X interrupt(s) for xhci.
>>
>> Is there any change is /proc/interrupts between a good and bad case?
>> Such as xhci_hcd using MSI-X instead of MSI, or eth0 and xhci_hcd
>> interrupting on the same CPU?
> 
> On the good kernel I have - 32 xhci_hcd, and bad only - 4.
> In both scenarios using PCI-MSIX.
> I attached both interrupt output as archives to this message.
> 


Thanks,

Looks like your network adapter ends up interrupting CPU0 in the bad case due
to the change in how many interrupts are requested by xhci_hcd before it.

bad case:
	CPU0	CPU1	...	CPU31
87:	18213809 0	... 	0	IR-PCI-MSIX-0000:0e:00.0    0-edge      enp14s0

Does manually changing it to some other CPU help? picking one that doesn't already
handle a lot of interrupts. CPU0 could also in general be more busy, possibly spending
more time with interrupts disabled.

For example change to CPU23 in the bad case:

echo 800000 > /proc/irq/87/smp_affinity

Check from proc/interrupts that enp14s0 interrupts actually go to CPU23 after this.

Thanks
Mathias


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-07 10:40           ` Mathias Nyman
@ 2024-02-07 11:55             ` Mikhail Gavrilov
  2024-02-08  9:25               ` Mathias Nyman
  0 siblings, 1 reply; 31+ messages in thread
From: Mikhail Gavrilov @ 2024-02-07 11:55 UTC (permalink / raw)
  To: Mathias Nyman
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb

On Wed, Feb 7, 2024 at 3:39 PM Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
>
> Thanks,
>
> Looks like your network adapter ends up interrupting CPU0 in the bad case due
> to the change in how many interrupts are requested by xhci_hcd before it.
>
> bad case:
>         CPU0    CPU1    ...     CPU31
> 87:     18213809 0      ...     0       IR-PCI-MSIX-0000:0e:00.0    0-edge      enp14s0
>
> Does manually changing it to some other CPU help? picking one that doesn't already
> handle a lot of interrupts. CPU0 could also in general be more busy, possibly spending
> more time with interrupts disabled.
>
> For example change to CPU23 in the bad case:
>
> echo 800000 > /proc/irq/87/smp_affinity
>
> Check from proc/interrupts that enp14s0 interrupts actually go to CPU23 after this.
>
> Thanks
> Mathias
>

root@secondary-ws ~# iperf3 -c primary-ws.local -t 5 -p 5000 -P 1
Connecting to host primary-ws.local, port 5000
[  5] local 192.168.1.130 port 49152 connected to 192.168.1.96 port 5000
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  70.9 MBytes   594 Mbits/sec    0    376 KBytes
[  5]   1.00-2.00   sec  72.4 MBytes   607 Mbits/sec    0    431 KBytes
[  5]   2.00-3.00   sec  73.1 MBytes   613 Mbits/sec    0    479 KBytes
[  5]   3.00-4.00   sec  72.4 MBytes   607 Mbits/sec    0    501 KBytes
[  5]   4.00-5.00   sec  73.2 MBytes   614 Mbits/sec    0    501 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-5.00   sec   362 MBytes   607 Mbits/sec    0             sender
[  5]   0.00-5.00   sec   360 MBytes   603 Mbits/sec                  receiver

iperf Done.
root@secondary-ws ~# echo 800000 > /proc/irq/87/smp_affinity
root@secondary-ws ~# iperf3 -c primary-ws.local -t 5 -p 5000 -P 1
Connecting to host primary-ws.local, port 5000
[  5] local 192.168.1.130 port 37620 connected to 192.168.1.96 port 5000
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   111 MBytes   934 Mbits/sec    0    621 KBytes
[  5]   1.00-2.00   sec   109 MBytes   913 Mbits/sec    0    621 KBytes
[  5]   2.00-3.00   sec   110 MBytes   920 Mbits/sec    0    621 KBytes
[  5]   3.00-4.00   sec   110 MBytes   924 Mbits/sec    0    621 KBytes
[  5]   4.00-5.00   sec   109 MBytes   917 Mbits/sec    0    621 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-5.00   sec   549 MBytes   921 Mbits/sec    0             sender
[  5]   0.00-5.00   sec   547 MBytes   916 Mbits/sec                  receiver

iperf Done.

Very interesting, is CPU0 slower than CPU23 by 30%?

-- 
Best Regards,
Mike Gavrilov.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-07 11:55             ` Mikhail Gavrilov
@ 2024-02-08  9:25               ` Mathias Nyman
  2024-02-08 10:32                 ` Mikhail Gavrilov
  0 siblings, 1 reply; 31+ messages in thread
From: Mathias Nyman @ 2024-02-08  9:25 UTC (permalink / raw)
  To: Mikhail Gavrilov
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb

On 7.2.2024 13.55, Mikhail Gavrilov wrote:
> On Wed, Feb 7, 2024 at 3:39 PM Mathias Nyman
> <mathias.nyman@linux.intel.com> wrote:
>>
>> Thanks,
>>
>> Looks like your network adapter ends up interrupting CPU0 in the bad case due
>> to the change in how many interrupts are requested by xhci_hcd before it.
>>
>> bad case:
>>          CPU0    CPU1    ...     CPU31
>> 87:     18213809 0      ...     0       IR-PCI-MSIX-0000:0e:00.0    0-edge      enp14s0
>>
>> Does manually changing it to some other CPU help? picking one that doesn't already
>> handle a lot of interrupts. CPU0 could also in general be more busy, possibly spending
>> more time with interrupts disabled.
>>
>> For example change to CPU23 in the bad case:
>>
>> echo 800000 > /proc/irq/87/smp_affinity
>>
>> Check from proc/interrupts that enp14s0 interrupts actually go to CPU23 after this.
>>
>> Thanks
>> Mathias
>>
> 
> root@secondary-ws ~# iperf3 -c primary-ws.local -t 5 -p 5000 -P 1
> Connecting to host primary-ws.local, port 5000
> [  5] local 192.168.1.130 port 49152 connected to 192.168.1.96 port 5000
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  70.9 MBytes   594 Mbits/sec    0    376 KBytes
> [  5]   1.00-2.00   sec  72.4 MBytes   607 Mbits/sec    0    431 KBytes
> [  5]   2.00-3.00   sec  73.1 MBytes   613 Mbits/sec    0    479 KBytes
> [  5]   3.00-4.00   sec  72.4 MBytes   607 Mbits/sec    0    501 KBytes
> [  5]   4.00-5.00   sec  73.2 MBytes   614 Mbits/sec    0    501 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-5.00   sec   362 MBytes   607 Mbits/sec    0             sender
> [  5]   0.00-5.00   sec   360 MBytes   603 Mbits/sec                  receiver
> 
> iperf Done.
> root@secondary-ws ~# echo 800000 > /proc/irq/87/smp_affinity
> root@secondary-ws ~# iperf3 -c primary-ws.local -t 5 -p 5000 -P 1
> Connecting to host primary-ws.local, port 5000
> [  5] local 192.168.1.130 port 37620 connected to 192.168.1.96 port 5000
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   111 MBytes   934 Mbits/sec    0    621 KBytes
> [  5]   1.00-2.00   sec   109 MBytes   913 Mbits/sec    0    621 KBytes
> [  5]   2.00-3.00   sec   110 MBytes   920 Mbits/sec    0    621 KBytes
> [  5]   3.00-4.00   sec   110 MBytes   924 Mbits/sec    0    621 KBytes
> [  5]   4.00-5.00   sec   109 MBytes   917 Mbits/sec    0    621 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-5.00   sec   549 MBytes   921 Mbits/sec    0             sender
> [  5]   0.00-5.00   sec   547 MBytes   916 Mbits/sec                  receiver
> 
> iperf Done.
> 
> Very interesting, is CPU0 slower than CPU23 by 30%?
> 

My guess is that CPU0 spends more time with interrupts disabled than other CPUs.
Either because it's handling interrupts from some other hardware, or running
code that disables interrupts (for example kernel code inside spin_lock_irq),
and thus not able to handle network adapter interrupts at the same rate as CPU23

Thanks
Mathias


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-08  9:25               ` Mathias Nyman
@ 2024-02-08 10:32                 ` Mikhail Gavrilov
  2024-02-08 15:43                   ` Mathias Nyman
  0 siblings, 1 reply; 31+ messages in thread
From: Mikhail Gavrilov @ 2024-02-08 10:32 UTC (permalink / raw)
  To: Mathias Nyman
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb

On Thu, Feb 8, 2024 at 2:23 PM Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
>
> My guess is that CPU0 spends more time with interrupts disabled than other CPUs.
> Either because it's handling interrupts from some other hardware, or running
> code that disables interrupts (for example kernel code inside spin_lock_irq),
> and thus not able to handle network adapter interrupts at the same rate as CPU23
>

Can this be fixed?
Can I help you here with anything else?

-- 
Best Regards,
Mike Gavrilov.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-08 10:32                 ` Mikhail Gavrilov
@ 2024-02-08 15:43                   ` Mathias Nyman
  2024-02-16  6:15                     ` This is the fourth time Iâve " Linux regression tracking (Thorsten Leemhuis)
  2024-02-19  9:41                     ` This is the fourth time I’ve " Mikhail Gavrilov
  0 siblings, 2 replies; 31+ messages in thread
From: Mathias Nyman @ 2024-02-08 15:43 UTC (permalink / raw)
  To: Mikhail Gavrilov
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb

On 8.2.2024 12.32, Mikhail Gavrilov wrote:
> On Thu, Feb 8, 2024 at 2:23 PM Mathias Nyman
> <mathias.nyman@linux.intel.com> wrote:
>>
>> My guess is that CPU0 spends more time with interrupts disabled than other CPUs.
>> Either because it's handling interrupts from some other hardware, or running
>> code that disables interrupts (for example kernel code inside spin_lock_irq),
>> and thus not able to handle network adapter interrupts at the same rate as CPU23
>>
> 
> Can this be fixed?

Not sure, I'm not that familiar with this area.
Maybe running irqbalance could help?

Thanks
Mathias

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time Iâve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-08 15:43                   ` Mathias Nyman
@ 2024-02-16  6:15                     ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-19  9:41                     ` This is the fourth time I’ve " Mikhail Gavrilov
  1 sibling, 0 replies; 31+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-02-16  6:15 UTC (permalink / raw)
  To: Mathias Nyman, Mikhail Gavrilov
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb, Linux kernel regressions list

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

On 08.02.24 16:43, Mathias Nyman wrote:
> On 8.2.2024 12.32, Mikhail Gavrilov wrote:
>> On Thu, Feb 8, 2024 at 2:23 PM Mathias Nyman
>> <mathias.nyman@linux.intel.com> wrote:
>>>
>>> My guess is that CPU0 spends more time with interrupts disabled than
>>> other CPUs.
>>> Either because it's handling interrupts from some other hardware, or
>>> running
>>> code that disables interrupts (for example kernel code inside
>>> spin_lock_irq),
>>> and thus not able to handle network adapter interrupts at the same
>>> rate as CPU23
>>
>> Can this be fixed?
> 
> Not sure, I'm not that familiar with this area.
> Maybe running irqbalance could help?

Mikhail, what's the status of this? I wonder if I should track this as a
regression to ensure Linus is aware of this.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-08 15:43                   ` Mathias Nyman
  2024-02-16  6:15                     ` This is the fourth time Iâve " Linux regression tracking (Thorsten Leemhuis)
@ 2024-02-19  9:41                     ` Mikhail Gavrilov
  2024-02-20 23:19                       ` Mikhail Gavrilov
  1 sibling, 1 reply; 31+ messages in thread
From: Mikhail Gavrilov @ 2024-02-19  9:41 UTC (permalink / raw)
  To: Mathias Nyman
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb

[-- Attachment #1: Type: text/plain, Size: 843 bytes --]

On Thu, Feb 8, 2024 at 8:42 PM Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
>
> On 8.2.2024 12.32, Mikhail Gavrilov wrote:
> > On Thu, Feb 8, 2024 at 2:23 PM Mathias Nyman
> > <mathias.nyman@linux.intel.com> wrote:
> >>
> >> My guess is that CPU0 spends more time with interrupts disabled than other CPUs.
> >> Either because it's handling interrupts from some other hardware, or running
> >> code that disables interrupts (for example kernel code inside spin_lock_irq),
> >> and thus not able to handle network adapter interrupts at the same rate as CPU23
> >>
> >
> > Can this be fixed?
>
> Not sure, I'm not that familiar with this area.
> Maybe running irqbalance could help?

I installed irqbalance daemon and nothing changed.
So who is responsible for irq balancing?

-- 
Best Regards,
Mike Gavrilov.

[-- Attachment #2: measuaments-irqbalance.zip --]
[-- Type: application/zip, Size: 1214 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-19  9:41                     ` This is the fourth time I’ve " Mikhail Gavrilov
@ 2024-02-20 23:19                       ` Mikhail Gavrilov
  2024-02-20 23:41                         ` Randy Dunlap
  0 siblings, 1 reply; 31+ messages in thread
From: Mikhail Gavrilov @ 2024-02-20 23:19 UTC (permalink / raw)
  To: Mathias Nyman
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb, linux-x86_64, netdev

On Mon, Feb 19, 2024 at 2:41 PM Mikhail Gavrilov
<mikhail.v.gavrilov@gmail.com> wrote:
>
> I installed irqbalance daemon and nothing changed.
> So who is responsible for irq balancing?

Sorry for the noise. Can anyone give me an answer?
Who is responsible for distributing interrupts in Linux?
I spotted network performance regression and it turned out, this was
due to the network card getting other interrupt. It is a side effect
of commit 57e153dfd0e7a080373fe5853c5609443d97fa5a.
Installing irqbalance daemon did not help. Maybe someone experienced
such a problem?

-- 
Best Regards,
Mike Gavrilov.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-20 23:19                       ` Mikhail Gavrilov
@ 2024-02-20 23:41                         ` Randy Dunlap
  2024-02-20 23:43                           ` Randy Dunlap
  0 siblings, 1 reply; 31+ messages in thread
From: Randy Dunlap @ 2024-02-20 23:41 UTC (permalink / raw)
  To: Mikhail Gavrilov, Mathias Nyman
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb, linux-x86_64, netdev

{+ tglx]

On 2/20/24 15:19, Mikhail Gavrilov wrote:
> On Mon, Feb 19, 2024 at 2:41 PM Mikhail Gavrilov
> <mikhail.v.gavrilov@gmail.com> wrote:
>>
>> I installed irqbalance daemon and nothing changed.
>> So who is responsible for irq balancing?
> 
> Sorry for the noise. Can anyone give me an answer?
> Who is responsible for distributing interrupts in Linux?
> I spotted network performance regression and it turned out, this was
> due to the network card getting other interrupt. It is a side effect
> of commit 57e153dfd0e7a080373fe5853c5609443d97fa5a.

That's a merge commit (AFAIK, maybe not so much). The commit in mainline is:

commit f977f4c9301c
Author: Niklas Neronin <niklas.neronin@linux.intel.com>
Date:   Fri Dec 1 17:06:40 2023 +0200

    xhci: add handler for only one interrupt line

> Installing irqbalance daemon did not help. Maybe someone experienced
> such a problem?
> 

Thomas, would you look at this, please?

A network device and xhci (USB) driver are now sharing interrupts.
This causes a large performance decrease for the networking device.

The thread begins here:
https://lore.kernel.org/lkml/CABXGCsNnUfCCYVSb_-j-a-cAdONu1r6Fe8p2OtQ5op_wskOfpw@mail.gmail.com/


motherboard:
"My motherboard is MPG-B650I-EDGE-WIFI looks like it is related to the
mentioned commit.
https://www.msi.com/Motherboard/MPG-B650I-EDGE-WIFI"

network device:
Network: RTL8125 2.5GbE Controller (rev 05)


thanks.
-- 
#Randy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-20 23:41                         ` Randy Dunlap
@ 2024-02-20 23:43                           ` Randy Dunlap
  2024-02-21 13:44                             ` Mathias Nyman
  0 siblings, 1 reply; 31+ messages in thread
From: Randy Dunlap @ 2024-02-20 23:43 UTC (permalink / raw)
  To: Mikhail Gavrilov, Mathias Nyman, Thomas Gleixner
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb, linux-x86_64, netdev



On 2/20/24 15:41, Randy Dunlap wrote:
> {+ tglx]

(this time for real)

> 
> On 2/20/24 15:19, Mikhail Gavrilov wrote:
>> On Mon, Feb 19, 2024 at 2:41 PM Mikhail Gavrilov
>> <mikhail.v.gavrilov@gmail.com> wrote:
>>>
>>> I installed irqbalance daemon and nothing changed.
>>> So who is responsible for irq balancing?
>>
>> Sorry for the noise. Can anyone give me an answer?
>> Who is responsible for distributing interrupts in Linux?
>> I spotted network performance regression and it turned out, this was
>> due to the network card getting other interrupt. It is a side effect
>> of commit 57e153dfd0e7a080373fe5853c5609443d97fa5a.
> 
> That's a merge commit (AFAIK, maybe not so much). The commit in mainline is:
> 
> commit f977f4c9301c
> Author: Niklas Neronin <niklas.neronin@linux.intel.com>
> Date:   Fri Dec 1 17:06:40 2023 +0200
> 
>     xhci: add handler for only one interrupt line
> 
>> Installing irqbalance daemon did not help. Maybe someone experienced
>> such a problem?
>>
> 
> Thomas, would you look at this, please?
> 
> A network device and xhci (USB) driver are now sharing interrupts.
> This causes a large performance decrease for the networking device.
> 
> The thread begins here:
> https://lore.kernel.org/lkml/CABXGCsNnUfCCYVSb_-j-a-cAdONu1r6Fe8p2OtQ5op_wskOfpw@mail.gmail.com/
> 
> 
> motherboard:
> "My motherboard is MPG-B650I-EDGE-WIFI looks like it is related to the
> mentioned commit.
> https://www.msi.com/Motherboard/MPG-B650I-EDGE-WIFI"
> 
> network device:
> Network: RTL8125 2.5GbE Controller (rev 05)
> 
> 
> thanks.

-- 
#Randy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-03  1:02 This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c Mikhail Gavrilov
                   ` (2 preceding siblings ...)
  2024-02-03 18:20 ` Christian A. Ehrhardt
@ 2024-02-21  6:48 ` Linux regression tracking #adding (Thorsten Leemhuis)
  3 siblings, 0 replies; 31+ messages in thread
From: Linux regression tracking #adding (Thorsten Leemhuis) @ 2024-02-21  6:48 UTC (permalink / raw)
  To: Linux List Kernel Mailing, linux-netdev, Linux kernel regressions list

On 03.02.24 02:02, Mikhail Gavrilov wrote:
> Hi,
> I'm trying to find the first bad commit that led to a decreased
> network outgoing speed.
> And every time I come to a huge merge [Merge tag 'usb-6.8-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb]
> I have already triple-checked all my answers and speed measurements.
> I don't understand where I'm making a mistake.
>

To be sure the issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, the Linux kernel regression tracking bot:

#regzbot ^introduced f977f4c9301c
#regzbot title irq/net/usb: performance decrease now that network device
and xhci share IRQs
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-20 23:43                           ` Randy Dunlap
@ 2024-02-21 13:44                             ` Mathias Nyman
  2024-02-26  5:45                               ` This is the fourth time I've " Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 1 reply; 31+ messages in thread
From: Mathias Nyman @ 2024-02-21 13:44 UTC (permalink / raw)
  To: Randy Dunlap, Mikhail Gavrilov, Thomas Gleixner
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb, linux-x86_64, netdev

On 21.2.2024 1.43, Randy Dunlap wrote:
> 
> 
> On 2/20/24 15:41, Randy Dunlap wrote:
>> {+ tglx]
> 
> (this time for real)
> 
>>
>> On 2/20/24 15:19, Mikhail Gavrilov wrote:
>>> On Mon, Feb 19, 2024 at 2:41 PM Mikhail Gavrilov
>>> <mikhail.v.gavrilov@gmail.com> wrote:
>>>>
>>>> I installed irqbalance daemon and nothing changed.
>>>> So who is responsible for irq balancing?
>>>
>>> Sorry for the noise. Can anyone give me an answer?
>>> Who is responsible for distributing interrupts in Linux?
>>> I spotted network performance regression and it turned out, this was
>>> due to the network card getting other interrupt. It is a side effect
>>> of commit 57e153dfd0e7a080373fe5853c5609443d97fa5a.
>>
>> That's a merge commit (AFAIK, maybe not so much). The commit in mainline is:
>>
>> commit f977f4c9301c
>> Author: Niklas Neronin <niklas.neronin@linux.intel.com>
>> Date:   Fri Dec 1 17:06:40 2023 +0200
>>
>>      xhci: add handler for only one interrupt line
>>
>>> Installing irqbalance daemon did not help. Maybe someone experienced
>>> such a problem?
>>>
>>
>> Thomas, would you look at this, please?
>>
>> A network device and xhci (USB) driver are now sharing interrupts.
>> This causes a large performance decrease for the networking device.

Short recap:

xhci (USB) and network device didn't share interrupts, or even interrupt the
same CPU in either good or bad case.

A change in how many interrupts xhci driver requests changed which CPU
the network device interrupts.

In the bad case Mikhail Gavrilovs network device was interrupting CPU0
together with:
- IR-IO-APIC    2-edge      timer
- IR-PCI-MSIX-0000:07:00.0    1-edge      nvme1q1

In the good case network device was interrupting CPU27 together with:
- IR-PCI-MSIX-0000:04:00.0   27-edge      nvme0q27
- IR-PCI-MSIX-0000:07:00.0   28-edge      nvme1q28

Manually moving network device irq 87 from CPU0 to CPU23 helped.
(echo 800000 > /proc/irq/87/smp_affinity)

Thanks
-Mathias


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I've tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-21 13:44                             ` Mathias Nyman
@ 2024-02-26  5:45                               ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-26  9:24                                 ` Mathias Nyman
  0 siblings, 1 reply; 31+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-02-26  5:45 UTC (permalink / raw)
  To: Mathias Nyman, Thomas Gleixner
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb, linux-x86_64, netdev,
	Linux kernel regressions list, Randy Dunlap, Mikhail Gavrilov

On 21.02.24 14:44, Mathias Nyman wrote:
> On 21.2.2024 1.43, Randy Dunlap wrote:
>> On 2/20/24 15:41, Randy Dunlap wrote:
>>> {+ tglx]
>>> On 2/20/24 15:19, Mikhail Gavrilov wrote:
>>>> On Mon, Feb 19, 2024 at 2:41 PM Mikhail Gavrilov
>>>> <mikhail.v.gavrilov@gmail.com> wrote:
>>>> I spotted network performance regression and it turned out, this was
>>>> due to the network card getting other interrupt. It is a side effect
>>>> of commit 57e153dfd0e7a080373fe5853c5609443d97fa5a.
>>> That's a merge commit (AFAIK, maybe not so much). The commit in
>>> mainline is:
>>>
>>> commit f977f4c9301c
>>> Author: Niklas Neronin <niklas.neronin@linux.intel.com>
>>> Date:   Fri Dec 1 17:06:40 2023 +0200
>>>
>>>      xhci: add handler for only one interrupt line
>>>
>>>> Installing irqbalance daemon did not help. Maybe someone experienced
>>>> such a problem?
>>>
>>> Thomas, would you look at this, please?
>>>
>>> A network device and xhci (USB) driver are now sharing interrupts.
>>> This causes a large performance decrease for the networking device.
> 
> Short recap:

Thx for that. As the 6.8 release is merely two or three weeks away while
a fix is nowhere near in sight yet (afaics!) I start to wonder if we
should consider a revert here and try reapplying the culprit in a later
cycle when this problem is fixed.

Mathias, would that be an option? Or is there still hope that we see a
fix for this regression before the release of 6.8?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

> xhci (USB) and network device didn't share interrupts, or even interrupt
> the
> same CPU in either good or bad case.
> 
> A change in how many interrupts xhci driver requests changed which CPU
> the network device interrupts.
> 
> In the bad case Mikhail Gavrilovs network device was interrupting CPU0
> together with:
> - IR-IO-APIC    2-edge      timer
> - IR-PCI-MSIX-0000:07:00.0    1-edge      nvme1q1
> 
> In the good case network device was interrupting CPU27 together with:
> - IR-PCI-MSIX-0000:04:00.0   27-edge      nvme0q27
> - IR-PCI-MSIX-0000:07:00.0   28-edge      nvme1q28
> 
> Manually moving network device irq 87 from CPU0 to CPU23 helped.
> (echo 800000 > /proc/irq/87/smp_affinity)
> 
> Thanks
> -Mathias
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I've tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-26  5:45                               ` This is the fourth time I've " Linux regression tracking (Thorsten Leemhuis)
@ 2024-02-26  9:24                                 ` Mathias Nyman
  2024-02-26  9:51                                   ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 1 reply; 31+ messages in thread
From: Mathias Nyman @ 2024-02-26  9:24 UTC (permalink / raw)
  To: Linux regressions mailing list, Thomas Gleixner
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb, linux-x86_64, netdev, Randy Dunlap,
	Mikhail Gavrilov

On 26.2.2024 7.45, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 21.02.24 14:44, Mathias Nyman wrote:
>> On 21.2.2024 1.43, Randy Dunlap wrote:
>>> On 2/20/24 15:41, Randy Dunlap wrote:
>>>> {+ tglx]
>>>> On 2/20/24 15:19, Mikhail Gavrilov wrote:
>>>>> On Mon, Feb 19, 2024 at 2:41 PM Mikhail Gavrilov
>>>>> <mikhail.v.gavrilov@gmail.com> wrote:
>>>>> I spotted network performance regression and it turned out, this was
>>>>> due to the network card getting other interrupt. It is a side effect
>>>>> of commit 57e153dfd0e7a080373fe5853c5609443d97fa5a.
>>>> That's a merge commit (AFAIK, maybe not so much). The commit in
>>>> mainline is:
>>>>
>>>> commit f977f4c9301c
>>>> Author: Niklas Neronin <niklas.neronin@linux.intel.com>
>>>> Date:   Fri Dec 1 17:06:40 2023 +0200
>>>>
>>>>       xhci: add handler for only one interrupt line
>>>>
>>>>> Installing irqbalance daemon did not help. Maybe someone experienced
>>>>> such a problem?
>>>>
>>>> Thomas, would you look at this, please?
>>>>
>>>> A network device and xhci (USB) driver are now sharing interrupts.
>>>> This causes a large performance decrease for the networking device.
>>
>> Short recap:
> 
> Thx for that. As the 6.8 release is merely two or three weeks away while
> a fix is nowhere near in sight yet (afaics!) I start to wonder if we
> should consider a revert here and try reapplying the culprit in a later
> cycle when this problem is fixed.

I don't think reverting this series is a solution.

This isn't really about those usb xhci patches.
This is about which interrupt gets assigned to which CPU.

Mikhail got unlucky when the network adapter interrupts on that system was
assigned to CPU0, clearly a more "clogged" CPU, thus causing a drop in max
bandwidth.

Thanks
Mathias

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I've tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-26  9:24                                 ` Mathias Nyman
@ 2024-02-26  9:51                                   ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-26 10:54                                     ` Mathias Nyman
  2024-03-04 14:10                                     ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 2 replies; 31+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-02-26  9:51 UTC (permalink / raw)
  To: Mathias Nyman, Linux regressions mailing list, Thomas Gleixner
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb, linux-x86_64, netdev, Randy Dunlap,
	Mikhail Gavrilov

On 26.02.24 10:24, Mathias Nyman wrote:
> On 26.2.2024 7.45, Linux regression tracking (Thorsten Leemhuis) wrote:
>> On 21.02.24 14:44, Mathias Nyman wrote:
>>> On 21.2.2024 1.43, Randy Dunlap wrote:
>>>> On 2/20/24 15:41, Randy Dunlap wrote:
>>>>> {+ tglx]
>>>>> On 2/20/24 15:19, Mikhail Gavrilov wrote:
>>>>>> On Mon, Feb 19, 2024 at 2:41 PM Mikhail Gavrilov
>>>>>> <mikhail.v.gavrilov@gmail.com> wrote:
>>>>>> I spotted network performance regression and it turned out, this was
>>>>>> due to the network card getting other interrupt. It is a side effect
>>>>>> of commit 57e153dfd0e7a080373fe5853c5609443d97fa5a.
>>>>> That's a merge commit (AFAIK, maybe not so much). The commit in
>>>>> mainline is:
>>>>>
>>>>> commit f977f4c9301c
>>>>> Author: Niklas Neronin <niklas.neronin@linux.intel.com>
>>>>> Date:   Fri Dec 1 17:06:40 2023 +0200
>>>>>
>>>>>       xhci: add handler for only one interrupt line
>>>>>
>>>>>> Installing irqbalance daemon did not help. Maybe someone experienced
>>>>>> such a problem?
>>>>>
>>>>> Thomas, would you look at this, please?
>>>>>
>>>>> A network device and xhci (USB) driver are now sharing interrupts.
>>>>> This causes a large performance decrease for the networking device.
>>>
>>> Short recap:
>>
>> Thx for that. As the 6.8 release is merely two or three weeks away while
>> a fix is nowhere near in sight yet (afaics!) I start to wonder if we
>> should consider a revert here and try reapplying the culprit in a later
>> cycle when this problem is fixed.

Thx for the reply.

> I don't think reverting this series is a solution.
> 
> This isn't really about those usb xhci patches.
> This is about which interrupt gets assigned to which CPU.

I know, but from my understanding of Linus expectations wrt to handling
regressions it does not matter much if a bug existed earlier or
somewhere else: what counts is the commit that exposed the problem.

But I might be wrong here. Anyway, not CCing Linus for this; but I'll
likely point him to this direction on Sunday in my next weekly report,
unless some fix comes into sight.

> Mikhail got unlucky when the network adapter interrupts on that system was
> assigned to CPU0, clearly a more "clogged" CPU, thus causing a drop in max
> bandwidth.

But maybe others will be just as "unlucky". Or is there anything to
believe otherwise? Maybe some aspect of the .config or local setup that
is most likely unique to Mikhail's setup?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I've tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-26  9:51                                   ` Linux regression tracking (Thorsten Leemhuis)
@ 2024-02-26 10:54                                     ` Mathias Nyman
  2024-02-26 18:09                                       ` Thomas Gleixner
  2024-03-04 14:10                                     ` Linux regression tracking (Thorsten Leemhuis)
  1 sibling, 1 reply; 31+ messages in thread
From: Mathias Nyman @ 2024-02-26 10:54 UTC (permalink / raw)
  To: Linux regressions mailing list, Thomas Gleixner
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb, linux-x86_64, netdev, Randy Dunlap,
	Mikhail Gavrilov

On 26.2.2024 11.51, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 26.02.24 10:24, Mathias Nyman wrote:
>> On 26.2.2024 7.45, Linux regression tracking (Thorsten Leemhuis) wrote:
>>> On 21.02.24 14:44, Mathias Nyman wrote:
>>>> On 21.2.2024 1.43, Randy Dunlap wrote:
>>>>> On 2/20/24 15:41, Randy Dunlap wrote:
>>>>>> {+ tglx]
>>>>>> On 2/20/24 15:19, Mikhail Gavrilov wrote:
>>>>>>> On Mon, Feb 19, 2024 at 2:41 PM Mikhail Gavrilov
>>>>>>> <mikhail.v.gavrilov@gmail.com> wrote:
>>>>>>> I spotted network performance regression and it turned out, this was
>>>>>>> due to the network card getting other interrupt. It is a side effect
>>>>>>> of commit 57e153dfd0e7a080373fe5853c5609443d97fa5a.
>>>>>> That's a merge commit (AFAIK, maybe not so much). The commit in
>>>>>> mainline is:
>>>>>>
>>>>>> commit f977f4c9301c
>>>>>> Author: Niklas Neronin <niklas.neronin@linux.intel.com>
>>>>>> Date:   Fri Dec 1 17:06:40 2023 +0200
>>>>>>
>>>>>>        xhci: add handler for only one interrupt line
>>>>>>
>>>>>>> Installing irqbalance daemon did not help. Maybe someone experienced
>>>>>>> such a problem?
>>>>>>
>>>>>> Thomas, would you look at this, please?
>>>>>>
>>>>>> A network device and xhci (USB) driver are now sharing interrupts.
>>>>>> This causes a large performance decrease for the networking device.
>>>>
>>>> Short recap:
>>>
>>> Thx for that. As the 6.8 release is merely two or three weeks away while
>>> a fix is nowhere near in sight yet (afaics!) I start to wonder if we
>>> should consider a revert here and try reapplying the culprit in a later
>>> cycle when this problem is fixed.
> 
> Thx for the reply.
> 
>> I don't think reverting this series is a solution.
>>
>> This isn't really about those usb xhci patches.
>> This is about which interrupt gets assigned to which CPU.
> 
> I know, but from my understanding of Linus expectations wrt to handling
> regressions it does not matter much if a bug existed earlier or
> somewhere else: what counts is the commit that exposed the problem.
> 
> But I might be wrong here. Anyway, not CCing Linus for this; but I'll
> likely point him to this direction on Sunday in my next weekly report,
> unless some fix comes into sight.
> 
>> Mikhail got unlucky when the network adapter interrupts on that system was
>> assigned to CPU0, clearly a more "clogged" CPU, thus causing a drop in max
>> bandwidth.
> 
> But maybe others will be just as "unlucky". Or is there anything to
> believe otherwise? Maybe some aspect of the .config or local setup that
> is most likely unique to Mikhail's setup?

I believe this is a zero-sum case.

Others got equally lucky due to this change.
Their devices end up interrupting less clogged CPUs and see a similar
performance increase.
  
Thanks
Mathias


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I've tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-26 10:54                                     ` Mathias Nyman
@ 2024-02-26 18:09                                       ` Thomas Gleixner
  2024-02-27 17:08                                         ` mikhail.v.gavrilov
  0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2024-02-26 18:09 UTC (permalink / raw)
  To: Mathias Nyman, Linux regressions mailing list
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb, linux-x86_64, netdev, Randy Dunlap,
	Mikhail Gavrilov

On Mon, Feb 26 2024 at 12:54, Mathias Nyman wrote:
> On 26.2.2024 11.51, Linux regression tracking (Thorsten Leemhuis) wrote:
>>> I don't think reverting this series is a solution.
>>>
>>> This isn't really about those usb xhci patches.
>>> This is about which interrupt gets assigned to which CPU.
>> 
>> I know, but from my understanding of Linus expectations wrt to handling
>> regressions it does not matter much if a bug existed earlier or
>> somewhere else: what counts is the commit that exposed the problem.
>> 
>> But I might be wrong here. Anyway, not CCing Linus for this; but I'll
>> likely point him to this direction on Sunday in my next weekly report,
>> unless some fix comes into sight.
>> 
>>> Mikhail got unlucky when the network adapter interrupts on that system was
>>> assigned to CPU0, clearly a more "clogged" CPU, thus causing a drop in max
>>> bandwidth.
>> 
>> But maybe others will be just as "unlucky". Or is there anything to
>> believe otherwise? Maybe some aspect of the .config or local setup that
>> is most likely unique to Mikhail's setup?
>
> I believe this is a zero-sum case.
>
> Others got equally lucky due to this change.
> Their devices end up interrupting less clogged CPUs and see a similar
> performance increase.

Reverting this does not make any sense.

The kernel assigns the initial interrupt affinities to the CPUs so that
the number of interrupts is halfways balanced. That spreading algorithm
is completely agnostic of the actual usage of the interrupts. Where
e.g. the network interrupt ends up depends on the probe/enumeration
order of devices. Add another PCI-E card into the machine and it will
again look different.

There is nothing the kernel can do about it and earlier attempts to do
interrupt frequency based balancing in the kernel ended up nowhere
simply because the kernel does not have enough information about the
overall requirements. That's why the kernel leaves the affinity
configuration for user space, e.g. irqbalanced, except for true
multi-queue scenarios like NVME where the kernel binds queues and their
interrupts to specific CPUs or groups of CPUs.

Why ending up on CPU0 has this particular effect on Mikhails machine is
unclear as we don't have any information about the overall workload,
other interrupt sources on CPU0 and their frequency. That'd need to be
investigated with instrumentation and might unearth some completely
different underlying reason causing this behavior.

So I don't think this is a regression in the true sense of
regressions. It's an unfortunate coincidence and reverting the
identified commits would just paper over the real problem, if there is
actually one single source of trouble which causes the performance drop
only on CPU0.  The commits are definitely _not_ the root cause, they
happen to unearth some other issue, which might be as mundane as
e.g. that the NVME interrupt on CPU0 is competing with the network
interrupt. So don't shoot the messenger.

Thanks,

        tglx











^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I've tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-26 18:09                                       ` Thomas Gleixner
@ 2024-02-27 17:08                                         ` mikhail.v.gavrilov
  2024-02-27 17:23                                           ` Thomas Gleixner
  0 siblings, 1 reply; 31+ messages in thread
From: mikhail.v.gavrilov @ 2024-02-27 17:08 UTC (permalink / raw)
  To: Thomas Gleixner, Mathias Nyman, Linux regressions mailing list
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb, x86, netdev, Randy Dunlap

[-- Attachment #1: Type: text/plain, Size: 1429 bytes --]

On Mon, 2024-02-26 at 19:09 +0100, Thomas Gleixner wrote:
> we don't have any information about the overall workload,

During measurements nothing was running except iperf3

> other interrupt sources on CPU0 and their frequency. That'd need to
> be investigated with instrumentation and might unearth some
> completely different underlying reason causing this behavior.

I made simple bash script for benchmark enp14s0 on each of CPU core.

#!/usr/bin/env bash
for i in {0..31}
do
	smp_affinity=$(echo 'obase=16; '$((2 ** i)) | bc)
	echo "echo $smp_affinity > /proc/irq/84/smp_affinity"
	echo $smp_affinity > /proc/irq/84/smp_affinity
	echo 'iperf3 -c primary-ws.local -t 5 -p 5000 -P 1'
	iperf3 -c primary-ws.local -t 5 -p 5000 -P 1
done

And attach here results of iperf3 for kernels 6.7.0 and 6.8.0-rc6.
Which once again makes sure that CPU0 is a bad option in both cases.
And any other CPU does not necessarily 23 allow the network interface
to operate at the limit of the capabilities of the network cable.

I also attach /proc/interrupts I hope this helps clear things up.

I don't know how else to help you. What information to provide.

About repeatability my "unlucky" scenario.
I have two MSI MPG B650I EDGE WIFI motherboards and this problem
happened both at the same time.

It seems the problem has always been there, we just never noticed it.

-- 
Best Regards,
Mike Gavrilov.

[-- Attachment #2: benchmarking-6.7.0-all-cores.zip --]
[-- Type: application/zip, Size: 2152 bytes --]

[-- Attachment #3: benchmarking-6.8.0-0.rc6-all-cores.zip --]
[-- Type: application/zip, Size: 2282 bytes --]

[-- Attachment #4: proc-interrupts.zip --]
[-- Type: application/zip, Size: 2882 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I've tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-27 17:08                                         ` mikhail.v.gavrilov
@ 2024-02-27 17:23                                           ` Thomas Gleixner
       [not found]                                             ` <960fd112b294a902e1bea1fdd8e04a708a05cf45.camel@gmail.com>
  0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2024-02-27 17:23 UTC (permalink / raw)
  To: mikhail.v.gavrilov, Mathias Nyman, Linux regressions mailing list
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb, x86, netdev, Randy Dunlap

On Tue, Feb 27 2024 at 22:08, mikhail.v.gavrilov@gmail.com wrote:
> On Mon, 2024-02-26 at 19:09 +0100, Thomas Gleixner wrote:
>> we don't have any information about the overall workload,
>
> During measurements nothing was running except iperf3

Ok.

> I don't know how else to help you. What information to provide.

If we want to understand why CPU0 is problematic, then you need to use
tracing to capture what's going on on CPU0 vs. other CPUs.

> About repeatability my "unlucky" scenario.
> I have two MSI MPG B650I EDGE WIFI motherboards and this problem
> happened both at the same time.

Sure. The probe order and the number of interrupts are probably exactly
the same. As the spreading algorithm is very basic, it will result in
exactly the same setup for both.

> It seems the problem has always been there, we just never noticed it.

Exactly.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I've tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
       [not found]                                             ` <960fd112b294a902e1bea1fdd8e04a708a05cf45.camel@gmail.com>
@ 2024-02-29  9:41                                               ` Mikhail Gavrilov
  0 siblings, 0 replies; 31+ messages in thread
From: Mikhail Gavrilov @ 2024-02-29  9:41 UTC (permalink / raw)
  To: Thomas Gleixner, Mathias Nyman, Linux regressions mailing list
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb, x86, netdev, Randy Dunlap

On Tue, Feb 27, 2024 at 11:03 PM <mikhail.v.gavrilov@gmail.com> wrote:
>
> On Tue, 2024-02-27 at 18:23 +0100, Thomas Gleixner wrote:
> > If we want to understand why CPU0 is problematic, then you need to
> > use tracing to capture what's going on on CPU0 vs. other CPUs.
>
> I am not hear what kind of profiler software you prefer.
> I famous with sysprof and attach here captures for both cases CPU0 vs
> CPU23. I hope this helps clear things up.
>

Sorry for the noise.
Because I am unsure whether you received or not my previous message
with captures.
I upload them to the mega file exchange server and share links below.
capture CPU0: https://mega.nz/file/Ik5XiZAS#Hra7Xtzplp8xcHYFj4JXnpp8T-0UA0nhNSIJJLEcSBk
capture CPU23: https://mega.nz/file/swg0CQ4C#PvGv_WXmtnATD7tNun5xz-lfA5GGqA-KOv1ZbVRJ_lI

-- 
Best Regards,
Mike Gavrilov.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: This is the fourth time I've tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c
  2024-02-26  9:51                                   ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-26 10:54                                     ` Mathias Nyman
@ 2024-03-04 14:10                                     ` Linux regression tracking (Thorsten Leemhuis)
  1 sibling, 0 replies; 31+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-03-04 14:10 UTC (permalink / raw)
  To: Mathias Nyman, Linux regressions mailing list, Thomas Gleixner
  Cc: Christian A. Ehrhardt, niklas.neronin, Linux List Kernel Mailing,
	Greg KH, linux-usb, linux-x86_64, netdev, Randy Dunlap,
	Mikhail Gavrilov

On 26.02.24 10:51, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 26.02.24 10:24, Mathias Nyman wrote:
>> On 26.2.2024 7.45, Linux regression tracking (Thorsten Leemhuis) wrote:
>>> On 21.02.24 14:44, Mathias Nyman wrote:
>>>> On 21.2.2024 1.43, Randy Dunlap wrote:
>>>>> On 2/20/24 15:41, Randy Dunlap wrote:
>>>>>> {+ tglx]
>>>>>> On 2/20/24 15:19, Mikhail Gavrilov wrote:
>>>>>>> On Mon, Feb 19, 2024 at 2:41 PM Mikhail Gavrilov
>>>>>>> <mikhail.v.gavrilov@gmail.com> wrote:
>>>>>>> I spotted network performance regression and it turned out, this was
>>>>>>> due to the network card getting other interrupt. It is a side effect
>>>>>>> of commit 57e153dfd0e7a080373fe5853c5609443d97fa5a.
>>>>>> That's a merge commit (AFAIK, maybe not so much). The commit in
>>>>>> mainline is:
>>>>>>
>>>>>> commit f977f4c9301c
>>>>>> Author: Niklas Neronin <niklas.neronin@linux.intel.com>
>>>>>> Date:   Fri Dec 1 17:06:40 2023 +0200
>>>>>>
>>>>>>       xhci: add handler for only one interrupt line
>>>>>>
>>>>>>> Installing irqbalance daemon did not help. Maybe someone experienced
>>>>>>> such a problem?
>> This isn't really about those usb xhci patches.
>> This is about which interrupt gets assigned to which CPU.
> I know, but from my understanding of Linus expectations wrt to handling
> regressions it does not matter much if a bug existed earlier or
> somewhere else: what counts is the commit that exposed the problem.

TWIMC, I mentioned this twice in mails to Linus, he didn't get involved,
so I assume things are fine the way they are for him. And then it's of
course totally fine for me, too. :-D

Thx again for all your help and sorry for causing trouble, but in my
line of work these "might or might not be a regression from Linus
viewpoint" sometimes happen.

Ciao, Thorsten

#regzbot resolve: apparently not a regression from Linus viewpoint

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-03-04 14:10 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-03  1:02 This is the fourth time I’ve tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c Mikhail Gavrilov
2024-02-03  1:08 ` Rahul Rameshbabu
2024-02-03  1:15 ` Randy Dunlap
2024-02-03  1:16   ` Randy Dunlap
2024-02-03  2:32     ` Jakub Kicinski
2024-02-03 18:20 ` Christian A. Ehrhardt
2024-02-04 20:47   ` Christian A. Ehrhardt
2024-02-05 21:08     ` Mikhail Gavrilov
2024-02-06 11:26       ` Mathias Nyman
2024-02-06 16:12         ` Mikhail Gavrilov
2024-02-07 10:40           ` Mathias Nyman
2024-02-07 11:55             ` Mikhail Gavrilov
2024-02-08  9:25               ` Mathias Nyman
2024-02-08 10:32                 ` Mikhail Gavrilov
2024-02-08 15:43                   ` Mathias Nyman
2024-02-16  6:15                     ` This is the fourth time Iâve " Linux regression tracking (Thorsten Leemhuis)
2024-02-19  9:41                     ` This is the fourth time I’ve " Mikhail Gavrilov
2024-02-20 23:19                       ` Mikhail Gavrilov
2024-02-20 23:41                         ` Randy Dunlap
2024-02-20 23:43                           ` Randy Dunlap
2024-02-21 13:44                             ` Mathias Nyman
2024-02-26  5:45                               ` This is the fourth time I've " Linux regression tracking (Thorsten Leemhuis)
2024-02-26  9:24                                 ` Mathias Nyman
2024-02-26  9:51                                   ` Linux regression tracking (Thorsten Leemhuis)
2024-02-26 10:54                                     ` Mathias Nyman
2024-02-26 18:09                                       ` Thomas Gleixner
2024-02-27 17:08                                         ` mikhail.v.gavrilov
2024-02-27 17:23                                           ` Thomas Gleixner
     [not found]                                             ` <960fd112b294a902e1bea1fdd8e04a708a05cf45.camel@gmail.com>
2024-02-29  9:41                                               ` Mikhail Gavrilov
2024-03-04 14:10                                     ` Linux regression tracking (Thorsten Leemhuis)
2024-02-21  6:48 ` This is the fourth time I’ve " Linux regression tracking #adding (Thorsten Leemhuis)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).