All of lore.kernel.org
 help / color / mirror / Atom feed
* NAT performance issue 944mbit -> ~40mbit
@ 2020-07-11 15:53 Ian Kumlien
  2020-07-15 20:05   ` [Intel-wired-lan] " Ian Kumlien
  0 siblings, 1 reply; 51+ messages in thread
From: Ian Kumlien @ 2020-07-11 15:53 UTC (permalink / raw)
  To: Linux Kernel Network Developers

Hi,

I first detected this with 5.7.6 but it seems to apply as far back as 5.6.1...
(so, 5.7.8 client -> nat (5.6.1 -> 5.8-rc4 -> server 5.7.7)

It seems to me that the window size doesn't advance, so i did revert
the tcp: grow window for OOO packets only for SACK flows [1]
but it did no difference...

I have a 384 MB tcpdump of a iperf3 session that starts low and then
actually starts to get the bandwidth...
I do use BBR - I have tried with cubic... it didn't help  - the NAT
machine does use fq but changing it doesn't seem to yield any other
results.

Doing -P10 gives you the bandwith and can sometimes break the
stalemate but you'll end up back with the lower transfer speed again.
(it only seems to apply to NAT - the machine is a: A2SDi-12C-HLN4F and
has handled this without problems in the past...)


[1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v5.7.8&id=bf780119617797b5690e999e59a64ad79a572374

First iperf3 as a reference:
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   113 MBytes   945 Mbits/sec    0    814 KBytes
[  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec    0    806 KBytes
[  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   31    792 KBytes
[  5]   3.00-4.00   sec   101 MBytes   849 Mbits/sec   31   1.18 MBytes
[  5]   4.00-5.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
[  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec   31    778 KBytes
[  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec   93    772 KBytes
[  5]   7.00-8.00   sec   112 MBytes   944 Mbits/sec    0    778 KBytes
[  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   60    778 KBytes
[  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec   92    814 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  338             sender
[  5]   0.00-10.01  sec  1.07 GBytes   919 Mbits/sec                  receiver

After that:
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.77 MBytes  40.0 Mbits/sec    0   42.4 KBytes
[  5]   1.00-2.00   sec  4.10 MBytes  34.4 Mbits/sec    0   84.8 KBytes
[  5]   2.00-3.00   sec  4.60 MBytes  38.6 Mbits/sec    0   87.7 KBytes
[  5]   3.00-4.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
[  5]   4.00-5.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
[  5]   5.00-6.00   sec  4.47 MBytes  37.5 Mbits/sec    0   76.4 KBytes
[  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0   67.9 KBytes
[  5]   7.00-8.00   sec  4.66 MBytes  39.1 Mbits/sec    0   67.9 KBytes
[  5]   8.00-9.00   sec  4.35 MBytes  36.5 Mbits/sec    0   82.0 KBytes
[  5]   9.00-10.00  sec  4.66 MBytes  39.1 Mbits/sec    0    139 KBytes
- - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  45.5 MBytes  38.2 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  45.0 MBytes  37.8 Mbits/sec                  receiver

You even get some:
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  5.38 MBytes  45.2 Mbits/sec    0   42.4 KBytes
[  5]   1.00-2.00   sec  7.08 MBytes  59.4 Mbits/sec    0    535 KBytes
[  5]   2.00-3.00   sec   108 MBytes   907 Mbits/sec    0    778 KBytes
[  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    814 KBytes
[  5]   4.00-5.00   sec  91.2 MBytes   765 Mbits/sec    0    829 KBytes
[  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    0    783 KBytes
[  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    769 KBytes
[  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    778 KBytes
[  5]   8.00-9.00   sec   112 MBytes   944 Mbits/sec    0    809 KBytes
[  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec    0    823 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   879 MBytes   738 Mbits/sec    0             sender
[  5]   0.00-10.00  sec   875 MBytes   734 Mbits/sec                  receiver

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: NAT performance issue 944mbit -> ~40mbit
  2020-07-11 15:53 NAT performance issue 944mbit -> ~40mbit Ian Kumlien
@ 2020-07-15 20:05   ` Ian Kumlien
  0 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-15 20:05 UTC (permalink / raw)
  To: Linux Kernel Network Developers, jeffrey.t.kirsher, intel-wired-lan

After a  lot of debugging it turns out that the bug is in igb...

driver: igb
version: 5.6.0-k
firmware-version:  0. 6-1

03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
Connection (rev 03)

It's interesting that it only seems to happen on longer links... Any clues?

On Sat, Jul 11, 2020 at 5:53 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> Hi,
>
> I first detected this with 5.7.6 but it seems to apply as far back as 5.6.1...
> (so, 5.7.8 client -> nat (5.6.1 -> 5.8-rc4 -> server 5.7.7)
>
> It seems to me that the window size doesn't advance, so i did revert
> the tcp: grow window for OOO packets only for SACK flows [1]
> but it did no difference...
>
> I have a 384 MB tcpdump of a iperf3 session that starts low and then
> actually starts to get the bandwidth...
> I do use BBR - I have tried with cubic... it didn't help  - the NAT
> machine does use fq but changing it doesn't seem to yield any other
> results.
>
> Doing -P10 gives you the bandwith and can sometimes break the
> stalemate but you'll end up back with the lower transfer speed again.
> (it only seems to apply to NAT - the machine is a: A2SDi-12C-HLN4F and
> has handled this without problems in the past...)
>
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v5.7.8&id=bf780119617797b5690e999e59a64ad79a572374
>
> First iperf3 as a reference:
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   113 MBytes   945 Mbits/sec    0    814 KBytes
> [  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec    0    806 KBytes
> [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   31    792 KBytes
> [  5]   3.00-4.00   sec   101 MBytes   849 Mbits/sec   31   1.18 MBytes
> [  5]   4.00-5.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
> [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec   31    778 KBytes
> [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec   93    772 KBytes
> [  5]   7.00-8.00   sec   112 MBytes   944 Mbits/sec    0    778 KBytes
> [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   60    778 KBytes
> [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec   92    814 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  338             sender
> [  5]   0.00-10.01  sec  1.07 GBytes   919 Mbits/sec                  receiver
>
> After that:
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  4.77 MBytes  40.0 Mbits/sec    0   42.4 KBytes
> [  5]   1.00-2.00   sec  4.10 MBytes  34.4 Mbits/sec    0   84.8 KBytes
> [  5]   2.00-3.00   sec  4.60 MBytes  38.6 Mbits/sec    0   87.7 KBytes
> [  5]   3.00-4.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> [  5]   4.00-5.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> [  5]   5.00-6.00   sec  4.47 MBytes  37.5 Mbits/sec    0   76.4 KBytes
> [  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0   67.9 KBytes
> [  5]   7.00-8.00   sec  4.66 MBytes  39.1 Mbits/sec    0   67.9 KBytes
> [  5]   8.00-9.00   sec  4.35 MBytes  36.5 Mbits/sec    0   82.0 KBytes
> [  5]   9.00-10.00  sec  4.66 MBytes  39.1 Mbits/sec    0    139 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  45.5 MBytes  38.2 Mbits/sec    0             sender
> [  5]   0.00-10.00  sec  45.0 MBytes  37.8 Mbits/sec                  receiver
>
> You even get some:
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  5.38 MBytes  45.2 Mbits/sec    0   42.4 KBytes
> [  5]   1.00-2.00   sec  7.08 MBytes  59.4 Mbits/sec    0    535 KBytes
> [  5]   2.00-3.00   sec   108 MBytes   907 Mbits/sec    0    778 KBytes
> [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    814 KBytes
> [  5]   4.00-5.00   sec  91.2 MBytes   765 Mbits/sec    0    829 KBytes
> [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    0    783 KBytes
> [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    769 KBytes
> [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    778 KBytes
> [  5]   8.00-9.00   sec   112 MBytes   944 Mbits/sec    0    809 KBytes
> [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec    0    823 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec   879 MBytes   738 Mbits/sec    0             sender
> [  5]   0.00-10.00  sec   875 MBytes   734 Mbits/sec                  receiver

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
@ 2020-07-15 20:05   ` Ian Kumlien
  0 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-15 20:05 UTC (permalink / raw)
  To: intel-wired-lan

After a  lot of debugging it turns out that the bug is in igb...

driver: igb
version: 5.6.0-k
firmware-version:  0. 6-1

03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
Connection (rev 03)

It's interesting that it only seems to happen on longer links... Any clues?

On Sat, Jul 11, 2020 at 5:53 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> Hi,
>
> I first detected this with 5.7.6 but it seems to apply as far back as 5.6.1...
> (so, 5.7.8 client -> nat (5.6.1 -> 5.8-rc4 -> server 5.7.7)
>
> It seems to me that the window size doesn't advance, so i did revert
> the tcp: grow window for OOO packets only for SACK flows [1]
> but it did no difference...
>
> I have a 384 MB tcpdump of a iperf3 session that starts low and then
> actually starts to get the bandwidth...
> I do use BBR - I have tried with cubic... it didn't help  - the NAT
> machine does use fq but changing it doesn't seem to yield any other
> results.
>
> Doing -P10 gives you the bandwith and can sometimes break the
> stalemate but you'll end up back with the lower transfer speed again.
> (it only seems to apply to NAT - the machine is a: A2SDi-12C-HLN4F and
> has handled this without problems in the past...)
>
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v5.7.8&id=bf780119617797b5690e999e59a64ad79a572374
>
> First iperf3 as a reference:
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   113 MBytes   945 Mbits/sec    0    814 KBytes
> [  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec    0    806 KBytes
> [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   31    792 KBytes
> [  5]   3.00-4.00   sec   101 MBytes   849 Mbits/sec   31   1.18 MBytes
> [  5]   4.00-5.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
> [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec   31    778 KBytes
> [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec   93    772 KBytes
> [  5]   7.00-8.00   sec   112 MBytes   944 Mbits/sec    0    778 KBytes
> [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   60    778 KBytes
> [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec   92    814 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  338             sender
> [  5]   0.00-10.01  sec  1.07 GBytes   919 Mbits/sec                  receiver
>
> After that:
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  4.77 MBytes  40.0 Mbits/sec    0   42.4 KBytes
> [  5]   1.00-2.00   sec  4.10 MBytes  34.4 Mbits/sec    0   84.8 KBytes
> [  5]   2.00-3.00   sec  4.60 MBytes  38.6 Mbits/sec    0   87.7 KBytes
> [  5]   3.00-4.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> [  5]   4.00-5.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> [  5]   5.00-6.00   sec  4.47 MBytes  37.5 Mbits/sec    0   76.4 KBytes
> [  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0   67.9 KBytes
> [  5]   7.00-8.00   sec  4.66 MBytes  39.1 Mbits/sec    0   67.9 KBytes
> [  5]   8.00-9.00   sec  4.35 MBytes  36.5 Mbits/sec    0   82.0 KBytes
> [  5]   9.00-10.00  sec  4.66 MBytes  39.1 Mbits/sec    0    139 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  45.5 MBytes  38.2 Mbits/sec    0             sender
> [  5]   0.00-10.00  sec  45.0 MBytes  37.8 Mbits/sec                  receiver
>
> You even get some:
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  5.38 MBytes  45.2 Mbits/sec    0   42.4 KBytes
> [  5]   1.00-2.00   sec  7.08 MBytes  59.4 Mbits/sec    0    535 KBytes
> [  5]   2.00-3.00   sec   108 MBytes   907 Mbits/sec    0    778 KBytes
> [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    814 KBytes
> [  5]   4.00-5.00   sec  91.2 MBytes   765 Mbits/sec    0    829 KBytes
> [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    0    783 KBytes
> [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    769 KBytes
> [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    778 KBytes
> [  5]   8.00-9.00   sec   112 MBytes   944 Mbits/sec    0    809 KBytes
> [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec    0    823 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec   879 MBytes   738 Mbits/sec    0             sender
> [  5]   0.00-10.00  sec   875 MBytes   734 Mbits/sec                  receiver

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: NAT performance issue 944mbit -> ~40mbit
  2020-07-15 20:05   ` [Intel-wired-lan] " Ian Kumlien
@ 2020-07-15 20:31     ` Jakub Kicinski
  -1 siblings, 0 replies; 51+ messages in thread
From: Jakub Kicinski @ 2020-07-15 20:31 UTC (permalink / raw)
  To: Ian Kumlien
  Cc: Linux Kernel Network Developers, jeffrey.t.kirsher, intel-wired-lan

On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> After a  lot of debugging it turns out that the bug is in igb...
> 
> driver: igb
> version: 5.6.0-k
> firmware-version:  0. 6-1
> 
> 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> Connection (rev 03)

Unclear to me what you're actually reporting. Is this a regression
after a kernel upgrade? Compared to no NAT?

> It's interesting that it only seems to happen on longer links... Any clues?

Links as in with longer cables?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
@ 2020-07-15 20:31     ` Jakub Kicinski
  0 siblings, 0 replies; 51+ messages in thread
From: Jakub Kicinski @ 2020-07-15 20:31 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> After a  lot of debugging it turns out that the bug is in igb...
> 
> driver: igb
> version: 5.6.0-k
> firmware-version:  0. 6-1
> 
> 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> Connection (rev 03)

Unclear to me what you're actually reporting. Is this a regression
after a kernel upgrade? Compared to no NAT?

> It's interesting that it only seems to happen on longer links... Any clues?

Links as in with longer cables?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: NAT performance issue 944mbit -> ~40mbit
  2020-07-15 20:31     ` [Intel-wired-lan] " Jakub Kicinski
@ 2020-07-15 21:02       ` Ian Kumlien
  -1 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-15 21:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Linux Kernel Network Developers, jeffrey.t.kirsher, intel-wired-lan

On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > After a  lot of debugging it turns out that the bug is in igb...
> >
> > driver: igb
> > version: 5.6.0-k
> > firmware-version:  0. 6-1
> >
> > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > Connection (rev 03)
>
> Unclear to me what you're actually reporting. Is this a regression
> after a kernel upgrade? Compared to no NAT?

It only happens on "internet links"

Lets say that A is client with ibg driver, B is a firewall running NAT
with ixgbe drivers, C is another local node with igb and
D is a remote node with a bridge backed by a bnx2 interface.

A -> B -> C is ok (B and C is on the same switch)

A -> B -> D -- 32-40mbit

B -> D 944 mbit
C -> D 944 mbit

A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)

Can it be a timing issue? this is on a AMD Ryzen 9 system - I have
tcpdumps but i doubt that they'll help...

> > It's interesting that it only seems to happen on longer links... Any clues?
>
> Links as in with longer cables?

Longer links, as in more hops and unknown (in this case Juniper) switches/boxes

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
@ 2020-07-15 21:02       ` Ian Kumlien
  0 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-15 21:02 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > After a  lot of debugging it turns out that the bug is in igb...
> >
> > driver: igb
> > version: 5.6.0-k
> > firmware-version:  0. 6-1
> >
> > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > Connection (rev 03)
>
> Unclear to me what you're actually reporting. Is this a regression
> after a kernel upgrade? Compared to no NAT?

It only happens on "internet links"

Lets say that A is client with ibg driver, B is a firewall running NAT
with ixgbe drivers, C is another local node with igb and
D is a remote node with a bridge backed by a bnx2 interface.

A -> B -> C is ok (B and C is on the same switch)

A -> B -> D -- 32-40mbit

B -> D 944 mbit
C -> D 944 mbit

A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)

Can it be a timing issue? this is on a AMD Ryzen 9 system - I have
tcpdumps but i doubt that they'll help...

> > It's interesting that it only seems to happen on longer links... Any clues?
>
> Links as in with longer cables?

Longer links, as in more hops and unknown (in this case Juniper) switches/boxes

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: NAT performance issue 944mbit -> ~40mbit
  2020-07-15 21:02       ` [Intel-wired-lan] " Ian Kumlien
@ 2020-07-15 21:12         ` Ian Kumlien
  -1 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-15 21:12 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Linux Kernel Network Developers, jeffrey.t.kirsher, intel-wired-lan

On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > After a  lot of debugging it turns out that the bug is in igb...
> > >
> > > driver: igb
> > > version: 5.6.0-k
> > > firmware-version:  0. 6-1
> > >
> > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > Connection (rev 03)
> >
> > Unclear to me what you're actually reporting. Is this a regression
> > after a kernel upgrade? Compared to no NAT?
>
> It only happens on "internet links"
>
> Lets say that A is client with ibg driver, B is a firewall running NAT
> with ixgbe drivers, C is another local node with igb and
> D is a remote node with a bridge backed by a bnx2 interface.
>
> A -> B -> C is ok (B and C is on the same switch)
>
> A -> B -> D -- 32-40mbit
>
> B -> D 944 mbit
> C -> D 944 mbit
>
> A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)

This should of course be A' -> B -> D

Sorry, I've been scratching my head for about a week...

> Can it be a timing issue? this is on a AMD Ryzen 9 system - I have
> tcpdumps but i doubt that they'll help...
>
> > > It's interesting that it only seems to happen on longer links... Any clues?
> >
> > Links as in with longer cables?
>
> Longer links, as in more hops and unknown (in this case Juniper) switches/boxes

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
@ 2020-07-15 21:12         ` Ian Kumlien
  0 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-15 21:12 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > After a  lot of debugging it turns out that the bug is in igb...
> > >
> > > driver: igb
> > > version: 5.6.0-k
> > > firmware-version:  0. 6-1
> > >
> > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > Connection (rev 03)
> >
> > Unclear to me what you're actually reporting. Is this a regression
> > after a kernel upgrade? Compared to no NAT?
>
> It only happens on "internet links"
>
> Lets say that A is client with ibg driver, B is a firewall running NAT
> with ixgbe drivers, C is another local node with igb and
> D is a remote node with a bridge backed by a bnx2 interface.
>
> A -> B -> C is ok (B and C is on the same switch)
>
> A -> B -> D -- 32-40mbit
>
> B -> D 944 mbit
> C -> D 944 mbit
>
> A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)

This should of course be A' -> B -> D

Sorry, I've been scratching my head for about a week...

> Can it be a timing issue? this is on a AMD Ryzen 9 system - I have
> tcpdumps but i doubt that they'll help...
>
> > > It's interesting that it only seems to happen on longer links... Any clues?
> >
> > Links as in with longer cables?
>
> Longer links, as in more hops and unknown (in this case Juniper) switches/boxes

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: NAT performance issue 944mbit -> ~40mbit
  2020-07-15 21:12         ` [Intel-wired-lan] " Ian Kumlien
@ 2020-07-15 21:40           ` Jakub Kicinski
  -1 siblings, 0 replies; 51+ messages in thread
From: Jakub Kicinski @ 2020-07-15 21:40 UTC (permalink / raw)
  To: Ian Kumlien
  Cc: Linux Kernel Network Developers, jeffrey.t.kirsher, intel-wired-lan

On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:  
> > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:  
> > > > After a  lot of debugging it turns out that the bug is in igb...
> > > >
> > > > driver: igb
> > > > version: 5.6.0-k
> > > > firmware-version:  0. 6-1
> > > >
> > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > Connection (rev 03)  
> > >
> > > Unclear to me what you're actually reporting. Is this a regression
> > > after a kernel upgrade? Compared to no NAT?  
> >
> > It only happens on "internet links"
> >
> > Lets say that A is client with ibg driver, B is a firewall running NAT
> > with ixgbe drivers, C is another local node with igb and
> > D is a remote node with a bridge backed by a bnx2 interface.
> >
> > A -> B -> C is ok (B and C is on the same switch)
> >
> > A -> B -> D -- 32-40mbit
> >
> > B -> D 944 mbit
> > C -> D 944 mbit
> >
> > A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)  
> 
> This should of course be A' -> B -> D
> 
> Sorry, I've been scratching my head for about a week...

Hm, only thing that comes to mind if A' works reliably and A doesn't is
that A has somehow broken TCP offloads. Could you try disabling things
via ethtool -K and see if those settings make a difference?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
@ 2020-07-15 21:40           ` Jakub Kicinski
  0 siblings, 0 replies; 51+ messages in thread
From: Jakub Kicinski @ 2020-07-15 21:40 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:  
> > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:  
> > > > After a  lot of debugging it turns out that the bug is in igb...
> > > >
> > > > driver: igb
> > > > version: 5.6.0-k
> > > > firmware-version:  0. 6-1
> > > >
> > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > Connection (rev 03)  
> > >
> > > Unclear to me what you're actually reporting. Is this a regression
> > > after a kernel upgrade? Compared to no NAT?  
> >
> > It only happens on "internet links"
> >
> > Lets say that A is client with ibg driver, B is a firewall running NAT
> > with ixgbe drivers, C is another local node with igb and
> > D is a remote node with a bridge backed by a bnx2 interface.
> >
> > A -> B -> C is ok (B and C is on the same switch)
> >
> > A -> B -> D -- 32-40mbit
> >
> > B -> D 944 mbit
> > C -> D 944 mbit
> >
> > A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)  
> 
> This should of course be A' -> B -> D
> 
> Sorry, I've been scratching my head for about a week...

Hm, only thing that comes to mind if A' works reliably and A doesn't is
that A has somehow broken TCP offloads. Could you try disabling things
via ethtool -K and see if those settings make a difference?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: NAT performance issue 944mbit -> ~40mbit
  2020-07-15 21:40           ` [Intel-wired-lan] " Jakub Kicinski
@ 2020-07-15 21:59             ` Ian Kumlien
  -1 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-15 21:59 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Linux Kernel Network Developers, jeffrey.t.kirsher, intel-wired-lan

On Wed, Jul 15, 2020 at 11:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
> On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> > On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > > > After a  lot of debugging it turns out that the bug is in igb...
> > > > >
> > > > > driver: igb
> > > > > version: 5.6.0-k
> > > > > firmware-version:  0. 6-1
> > > > >
> > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > Connection (rev 03)
> > > >
> > > > Unclear to me what you're actually reporting. Is this a regression
> > > > after a kernel upgrade? Compared to no NAT?
> > >
> > > It only happens on "internet links"
> > >
> > > Lets say that A is client with ibg driver, B is a firewall running NAT
> > > with ixgbe drivers, C is another local node with igb and
> > > D is a remote node with a bridge backed by a bnx2 interface.
> > >
> > > A -> B -> C is ok (B and C is on the same switch)
> > >
> > > A -> B -> D -- 32-40mbit
> > >
> > > B -> D 944 mbit
> > > C -> D 944 mbit
> > >
> > > A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)
> >
> > This should of course be A' -> B -> D
> >
> > Sorry, I've been scratching my head for about a week...
>
> Hm, only thing that comes to mind if A' works reliably and A doesn't is
> that A has somehow broken TCP offloads. Could you try disabling things
> via ethtool -K and see if those settings make a difference?

It's a bit hard since it works like this, turned tso off:
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
[  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec   31    812 KBytes
[  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec   92    772 KBytes
[  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0    834 KBytes
[  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   60    823 KBytes
[  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec   31    789 KBytes
[  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    786 KBytes
[  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
[  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    772 KBytes
[  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec    0    868 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  214             sender
[  5]   0.00-10.00  sec  1.07 GBytes   920 Mbits/sec                  receiver

Continued running tests:
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  5.82 MBytes  48.8 Mbits/sec    0   82.0 KBytes
[  5]   1.00-2.00   sec  4.97 MBytes  41.7 Mbits/sec    0    130 KBytes
[  5]   2.00-3.00   sec  5.28 MBytes  44.3 Mbits/sec    0   99.0 KBytes
[  5]   3.00-4.00   sec  5.28 MBytes  44.3 Mbits/sec    0    105 KBytes
[  5]   4.00-5.00   sec  5.28 MBytes  44.3 Mbits/sec    0    122 KBytes
[  5]   5.00-6.00   sec  5.28 MBytes  44.3 Mbits/sec    0   82.0 KBytes
[  5]   6.00-7.00   sec  5.28 MBytes  44.3 Mbits/sec    0   79.2 KBytes
[  5]   7.00-8.00   sec  5.28 MBytes  44.3 Mbits/sec    0    110 KBytes
[  5]   8.00-9.00   sec  5.28 MBytes  44.3 Mbits/sec    0    156 KBytes
[  5]   9.00-10.00  sec  5.28 MBytes  44.3 Mbits/sec    0   87.7 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  53.0 MBytes  44.5 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  52.5 MBytes  44.1 Mbits/sec                  receiver

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  7.08 MBytes  59.4 Mbits/sec    0    156 KBytes
[  5]   1.00-2.00   sec  5.97 MBytes  50.0 Mbits/sec    0    110 KBytes
[  5]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec    0    124 KBytes
[  5]   3.00-4.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
[  5]   4.00-5.00   sec  5.47 MBytes  45.9 Mbits/sec    0    158 KBytes
[  5]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec    0   70.7 KBytes
[  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0    113 KBytes
[  5]   7.00-8.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
[  5]   8.00-9.00   sec  4.97 MBytes  41.7 Mbits/sec    0   84.8 KBytes
[  5]   9.00-10.00  sec  5.47 MBytes  45.9 Mbits/sec    0    116 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  55.3 MBytes  46.4 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  53.9 MBytes  45.2 Mbits/sec                  receiver

And the low bandwidth continues with:
ethtool -k enp3s0 |grep ": on"
rx-vlan-offload: on
tx-vlan-offload: on [requested off]
highdma: on [fixed]
rx-vlan-filter: on [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-udp-segmentation: on
hw-tc-offload: on

Can't quite find how to turn those off since they aren't listed in
ethtool (since the text is not what you use to enable/disable)

I was hoping that you'd have a clue of something that might introduce
a regression - ie specific patches to try to revert

Btw, the same issue applies to udp as werll

[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-1.00   sec  6.77 MBytes  56.8 Mbits/sec  4900
[  5]   1.00-2.00   sec  4.27 MBytes  35.8 Mbits/sec  3089
[  5]   2.00-3.00   sec  4.20 MBytes  35.2 Mbits/sec  3041
[  5]   3.00-4.00   sec  4.30 MBytes  36.1 Mbits/sec  3116
[  5]   4.00-5.00   sec  4.24 MBytes  35.6 Mbits/sec  3070
[  5]   5.00-6.00   sec  4.21 MBytes  35.3 Mbits/sec  3047
[  5]   6.00-7.00   sec  4.29 MBytes  36.0 Mbits/sec  3110
[  5]   7.00-8.00   sec  4.28 MBytes  35.9 Mbits/sec  3097
[  5]   8.00-9.00   sec  4.25 MBytes  35.6 Mbits/sec  3075
[  5]   9.00-10.00  sec  4.20 MBytes  35.2 Mbits/sec  3039
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter
Lost/Total Datagrams
[  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.000 ms
0/32584 (0%)  sender
[  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.037 ms
0/32573 (0%)  receiver

vs:

[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-1.00   sec   114 MBytes   954 Mbits/sec  82342
[  5]   1.00-2.00   sec   114 MBytes   955 Mbits/sec  82439
[  5]   2.00-3.00   sec   114 MBytes   956 Mbits/sec  82507
[  5]   3.00-4.00   sec   114 MBytes   955 Mbits/sec  82432
[  5]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  82535
[  5]   5.00-6.00   sec   114 MBytes   953 Mbits/sec  82240
[  5]   6.00-7.00   sec   114 MBytes   956 Mbits/sec  82512
[  5]   7.00-8.00   sec   114 MBytes   956 Mbits/sec  82503
[  5]   8.00-9.00   sec   114 MBytes   956 Mbits/sec  82532
[  5]   9.00-10.00  sec   114 MBytes   956 Mbits/sec  82488
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter
Lost/Total Datagrams
[  5]   0.00-10.00  sec  1.11 GBytes   955 Mbits/sec  0.000 ms
0/824530 (0%)  sender
[  5]   0.00-10.01  sec  1.11 GBytes   949 Mbits/sec  0.014 ms
4756/824530 (0.58%)  receiver


lspci -s 03:00.0  -vvv
03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
Connection (rev 03)
Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 57
IOMMU group: 20
Region 0: Memory at fc900000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at e000 [size=32]
Region 3: Memory at fc920000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Address: 0000000000000000  Data: 0000
Masking: 00000000  Pending: 00000000
Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
Vector table: BAR=3 offset=00000000
PBA: BAR=3 offset=00002000
Capabilities: [a0] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
L0s <2us, L1 <16us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-
EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+
MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [140 v1] Device Serial Number 34-97-f6-ff-ff-31-88-f4
Capabilities: [1a0 v1] Transaction Processing Hints
Device specific mode supported
Steering table in TPH capability structure
Kernel driver in use: igb

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
@ 2020-07-15 21:59             ` Ian Kumlien
  0 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-15 21:59 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, Jul 15, 2020 at 11:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
> On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> > On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > > > After a  lot of debugging it turns out that the bug is in igb...
> > > > >
> > > > > driver: igb
> > > > > version: 5.6.0-k
> > > > > firmware-version:  0. 6-1
> > > > >
> > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > Connection (rev 03)
> > > >
> > > > Unclear to me what you're actually reporting. Is this a regression
> > > > after a kernel upgrade? Compared to no NAT?
> > >
> > > It only happens on "internet links"
> > >
> > > Lets say that A is client with ibg driver, B is a firewall running NAT
> > > with ixgbe drivers, C is another local node with igb and
> > > D is a remote node with a bridge backed by a bnx2 interface.
> > >
> > > A -> B -> C is ok (B and C is on the same switch)
> > >
> > > A -> B -> D -- 32-40mbit
> > >
> > > B -> D 944 mbit
> > > C -> D 944 mbit
> > >
> > > A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)
> >
> > This should of course be A' -> B -> D
> >
> > Sorry, I've been scratching my head for about a week...
>
> Hm, only thing that comes to mind if A' works reliably and A doesn't is
> that A has somehow broken TCP offloads. Could you try disabling things
> via ethtool -K and see if those settings make a difference?

It's a bit hard since it works like this, turned tso off:
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
[  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec   31    812 KBytes
[  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec   92    772 KBytes
[  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0    834 KBytes
[  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   60    823 KBytes
[  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec   31    789 KBytes
[  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    786 KBytes
[  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
[  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    772 KBytes
[  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec    0    868 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  214             sender
[  5]   0.00-10.00  sec  1.07 GBytes   920 Mbits/sec                  receiver

Continued running tests:
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  5.82 MBytes  48.8 Mbits/sec    0   82.0 KBytes
[  5]   1.00-2.00   sec  4.97 MBytes  41.7 Mbits/sec    0    130 KBytes
[  5]   2.00-3.00   sec  5.28 MBytes  44.3 Mbits/sec    0   99.0 KBytes
[  5]   3.00-4.00   sec  5.28 MBytes  44.3 Mbits/sec    0    105 KBytes
[  5]   4.00-5.00   sec  5.28 MBytes  44.3 Mbits/sec    0    122 KBytes
[  5]   5.00-6.00   sec  5.28 MBytes  44.3 Mbits/sec    0   82.0 KBytes
[  5]   6.00-7.00   sec  5.28 MBytes  44.3 Mbits/sec    0   79.2 KBytes
[  5]   7.00-8.00   sec  5.28 MBytes  44.3 Mbits/sec    0    110 KBytes
[  5]   8.00-9.00   sec  5.28 MBytes  44.3 Mbits/sec    0    156 KBytes
[  5]   9.00-10.00  sec  5.28 MBytes  44.3 Mbits/sec    0   87.7 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  53.0 MBytes  44.5 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  52.5 MBytes  44.1 Mbits/sec                  receiver

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  7.08 MBytes  59.4 Mbits/sec    0    156 KBytes
[  5]   1.00-2.00   sec  5.97 MBytes  50.0 Mbits/sec    0    110 KBytes
[  5]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec    0    124 KBytes
[  5]   3.00-4.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
[  5]   4.00-5.00   sec  5.47 MBytes  45.9 Mbits/sec    0    158 KBytes
[  5]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec    0   70.7 KBytes
[  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0    113 KBytes
[  5]   7.00-8.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
[  5]   8.00-9.00   sec  4.97 MBytes  41.7 Mbits/sec    0   84.8 KBytes
[  5]   9.00-10.00  sec  5.47 MBytes  45.9 Mbits/sec    0    116 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  55.3 MBytes  46.4 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  53.9 MBytes  45.2 Mbits/sec                  receiver

And the low bandwidth continues with:
ethtool -k enp3s0 |grep ": on"
rx-vlan-offload: on
tx-vlan-offload: on [requested off]
highdma: on [fixed]
rx-vlan-filter: on [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-udp-segmentation: on
hw-tc-offload: on

Can't quite find how to turn those off since they aren't listed in
ethtool (since the text is not what you use to enable/disable)

I was hoping that you'd have a clue of something that might introduce
a regression - ie specific patches to try to revert

Btw, the same issue applies to udp as werll

[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-1.00   sec  6.77 MBytes  56.8 Mbits/sec  4900
[  5]   1.00-2.00   sec  4.27 MBytes  35.8 Mbits/sec  3089
[  5]   2.00-3.00   sec  4.20 MBytes  35.2 Mbits/sec  3041
[  5]   3.00-4.00   sec  4.30 MBytes  36.1 Mbits/sec  3116
[  5]   4.00-5.00   sec  4.24 MBytes  35.6 Mbits/sec  3070
[  5]   5.00-6.00   sec  4.21 MBytes  35.3 Mbits/sec  3047
[  5]   6.00-7.00   sec  4.29 MBytes  36.0 Mbits/sec  3110
[  5]   7.00-8.00   sec  4.28 MBytes  35.9 Mbits/sec  3097
[  5]   8.00-9.00   sec  4.25 MBytes  35.6 Mbits/sec  3075
[  5]   9.00-10.00  sec  4.20 MBytes  35.2 Mbits/sec  3039
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter
Lost/Total Datagrams
[  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.000 ms
0/32584 (0%)  sender
[  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.037 ms
0/32573 (0%)  receiver

vs:

[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-1.00   sec   114 MBytes   954 Mbits/sec  82342
[  5]   1.00-2.00   sec   114 MBytes   955 Mbits/sec  82439
[  5]   2.00-3.00   sec   114 MBytes   956 Mbits/sec  82507
[  5]   3.00-4.00   sec   114 MBytes   955 Mbits/sec  82432
[  5]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  82535
[  5]   5.00-6.00   sec   114 MBytes   953 Mbits/sec  82240
[  5]   6.00-7.00   sec   114 MBytes   956 Mbits/sec  82512
[  5]   7.00-8.00   sec   114 MBytes   956 Mbits/sec  82503
[  5]   8.00-9.00   sec   114 MBytes   956 Mbits/sec  82532
[  5]   9.00-10.00  sec   114 MBytes   956 Mbits/sec  82488
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter
Lost/Total Datagrams
[  5]   0.00-10.00  sec  1.11 GBytes   955 Mbits/sec  0.000 ms
0/824530 (0%)  sender
[  5]   0.00-10.01  sec  1.11 GBytes   949 Mbits/sec  0.014 ms
4756/824530 (0.58%)  receiver


lspci -s 03:00.0  -vvv
03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
Connection (rev 03)
Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 57
IOMMU group: 20
Region 0: Memory at fc900000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at e000 [size=32]
Region 3: Memory@fc920000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Address: 0000000000000000  Data: 0000
Masking: 00000000  Pending: 00000000
Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
Vector table: BAR=3 offset=00000000
PBA: BAR=3 offset=00002000
Capabilities: [a0] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
L0s <2us, L1 <16us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-
EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+
MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [140 v1] Device Serial Number 34-97-f6-ff-ff-31-88-f4
Capabilities: [1a0 v1] Transaction Processing Hints
Device specific mode supported
Steering table in TPH capability structure
Kernel driver in use: igb

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-15 21:59             ` [Intel-wired-lan] " Ian Kumlien
@ 2020-07-15 22:32               ` Alexander Duyck
  -1 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2020-07-15 22:32 UTC (permalink / raw)
  To: Ian Kumlien
  Cc: Jakub Kicinski, Linux Kernel Network Developers, intel-wired-lan

On Wed, Jul 15, 2020 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Wed, Jul 15, 2020 at 11:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> > > On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > > > > After a  lot of debugging it turns out that the bug is in igb...
> > > > > >
> > > > > > driver: igb
> > > > > > version: 5.6.0-k
> > > > > > firmware-version:  0. 6-1
> > > > > >
> > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > > Connection (rev 03)
> > > > >
> > > > > Unclear to me what you're actually reporting. Is this a regression
> > > > > after a kernel upgrade? Compared to no NAT?
> > > >
> > > > It only happens on "internet links"
> > > >
> > > > Lets say that A is client with ibg driver, B is a firewall running NAT
> > > > with ixgbe drivers, C is another local node with igb and
> > > > D is a remote node with a bridge backed by a bnx2 interface.
> > > >
> > > > A -> B -> C is ok (B and C is on the same switch)
> > > >
> > > > A -> B -> D -- 32-40mbit
> > > >
> > > > B -> D 944 mbit
> > > > C -> D 944 mbit
> > > >
> > > > A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)
> > >
> > > This should of course be A' -> B -> D
> > >
> > > Sorry, I've been scratching my head for about a week...
> >
> > Hm, only thing that comes to mind if A' works reliably and A doesn't is
> > that A has somehow broken TCP offloads. Could you try disabling things
> > via ethtool -K and see if those settings make a difference?
>
> It's a bit hard since it works like this, turned tso off:
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
> [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec   31    812 KBytes
> [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec   92    772 KBytes
> [  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0    834 KBytes
> [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   60    823 KBytes
> [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec   31    789 KBytes
> [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    786 KBytes
> [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    772 KBytes
> [  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec    0    868 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  214             sender
> [  5]   0.00-10.00  sec  1.07 GBytes   920 Mbits/sec                  receiver
>
> Continued running tests:
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  5.82 MBytes  48.8 Mbits/sec    0   82.0 KBytes
> [  5]   1.00-2.00   sec  4.97 MBytes  41.7 Mbits/sec    0    130 KBytes
> [  5]   2.00-3.00   sec  5.28 MBytes  44.3 Mbits/sec    0   99.0 KBytes
> [  5]   3.00-4.00   sec  5.28 MBytes  44.3 Mbits/sec    0    105 KBytes
> [  5]   4.00-5.00   sec  5.28 MBytes  44.3 Mbits/sec    0    122 KBytes
> [  5]   5.00-6.00   sec  5.28 MBytes  44.3 Mbits/sec    0   82.0 KBytes
> [  5]   6.00-7.00   sec  5.28 MBytes  44.3 Mbits/sec    0   79.2 KBytes
> [  5]   7.00-8.00   sec  5.28 MBytes  44.3 Mbits/sec    0    110 KBytes
> [  5]   8.00-9.00   sec  5.28 MBytes  44.3 Mbits/sec    0    156 KBytes
> [  5]   9.00-10.00  sec  5.28 MBytes  44.3 Mbits/sec    0   87.7 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  53.0 MBytes  44.5 Mbits/sec    0             sender
> [  5]   0.00-10.00  sec  52.5 MBytes  44.1 Mbits/sec                  receiver
>
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  7.08 MBytes  59.4 Mbits/sec    0    156 KBytes
> [  5]   1.00-2.00   sec  5.97 MBytes  50.0 Mbits/sec    0    110 KBytes
> [  5]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec    0    124 KBytes
> [  5]   3.00-4.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> [  5]   4.00-5.00   sec  5.47 MBytes  45.9 Mbits/sec    0    158 KBytes
> [  5]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec    0   70.7 KBytes
> [  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0    113 KBytes
> [  5]   7.00-8.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> [  5]   8.00-9.00   sec  4.97 MBytes  41.7 Mbits/sec    0   84.8 KBytes
> [  5]   9.00-10.00  sec  5.47 MBytes  45.9 Mbits/sec    0    116 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  55.3 MBytes  46.4 Mbits/sec    0             sender
> [  5]   0.00-10.00  sec  53.9 MBytes  45.2 Mbits/sec                  receiver
>
> And the low bandwidth continues with:
> ethtool -k enp3s0 |grep ": on"
> rx-vlan-offload: on
> tx-vlan-offload: on [requested off]
> highdma: on [fixed]
> rx-vlan-filter: on [fixed]
> tx-gre-segmentation: on
> tx-gre-csum-segmentation: on
> tx-ipxip4-segmentation: on
> tx-ipxip6-segmentation: on
> tx-udp_tnl-segmentation: on
> tx-udp_tnl-csum-segmentation: on
> tx-gso-partial: on
> tx-udp-segmentation: on
> hw-tc-offload: on
>
> Can't quite find how to turn those off since they aren't listed in
> ethtool (since the text is not what you use to enable/disable)

To disable them you would just repeat the same string in the display
string. So it should just be "ethtool -K enp3s0 tx-gso-partial off"
and that would turn off a large chunk of them as all the encapsulated
support requires gso partial support.

> I was hoping that you'd have a clue of something that might introduce
> a regression - ie specific patches to try to revert
>
> Btw, the same issue applies to udp as werll
>
> [ ID] Interval           Transfer     Bitrate         Total Datagrams
> [  5]   0.00-1.00   sec  6.77 MBytes  56.8 Mbits/sec  4900
> [  5]   1.00-2.00   sec  4.27 MBytes  35.8 Mbits/sec  3089
> [  5]   2.00-3.00   sec  4.20 MBytes  35.2 Mbits/sec  3041
> [  5]   3.00-4.00   sec  4.30 MBytes  36.1 Mbits/sec  3116
> [  5]   4.00-5.00   sec  4.24 MBytes  35.6 Mbits/sec  3070
> [  5]   5.00-6.00   sec  4.21 MBytes  35.3 Mbits/sec  3047
> [  5]   6.00-7.00   sec  4.29 MBytes  36.0 Mbits/sec  3110
> [  5]   7.00-8.00   sec  4.28 MBytes  35.9 Mbits/sec  3097
> [  5]   8.00-9.00   sec  4.25 MBytes  35.6 Mbits/sec  3075
> [  5]   9.00-10.00  sec  4.20 MBytes  35.2 Mbits/sec  3039
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Jitter
> Lost/Total Datagrams
> [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.000 ms
> 0/32584 (0%)  sender
> [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.037 ms
> 0/32573 (0%)  receiver
>
> vs:
>
> [ ID] Interval           Transfer     Bitrate         Total Datagrams
> [  5]   0.00-1.00   sec   114 MBytes   954 Mbits/sec  82342
> [  5]   1.00-2.00   sec   114 MBytes   955 Mbits/sec  82439
> [  5]   2.00-3.00   sec   114 MBytes   956 Mbits/sec  82507
> [  5]   3.00-4.00   sec   114 MBytes   955 Mbits/sec  82432
> [  5]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  82535
> [  5]   5.00-6.00   sec   114 MBytes   953 Mbits/sec  82240
> [  5]   6.00-7.00   sec   114 MBytes   956 Mbits/sec  82512
> [  5]   7.00-8.00   sec   114 MBytes   956 Mbits/sec  82503
> [  5]   8.00-9.00   sec   114 MBytes   956 Mbits/sec  82532
> [  5]   9.00-10.00  sec   114 MBytes   956 Mbits/sec  82488
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Jitter
> Lost/Total Datagrams
> [  5]   0.00-10.00  sec  1.11 GBytes   955 Mbits/sec  0.000 ms
> 0/824530 (0%)  sender
> [  5]   0.00-10.01  sec  1.11 GBytes   949 Mbits/sec  0.014 ms
> 4756/824530 (0.58%)  receiver

The fact that it is impacting UDP seems odd. I wonder if we don't have
a qdisc somewhere that is misbehaving and throttling the Tx. Either
that or I wonder if we are getting spammed with flow control frames.

It would be useful to include the output of just calling "ethtool
enp3s0" on the interface to verify the speed, "ethtool -a enp3s0" to
verify flow control settings, and "ethtool -S enp3s0 | grep -v :\ 0"
to output the statistics and dump anything that isn't zero.

> lspci -s 03:00.0  -vvv
> 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> Connection (rev 03)
> Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0
> Interrupt: pin A routed to IRQ 57
> IOMMU group: 20
> Region 0: Memory at fc900000 (32-bit, non-prefetchable) [size=128K]
> Region 2: I/O ports at e000 [size=32]
> Region 3: Memory at fc920000 (32-bit, non-prefetchable) [size=16K]
> Capabilities: [40] Power Management version 3
> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> Address: 0000000000000000  Data: 0000
> Masking: 00000000  Pending: 00000000
> Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
> Vector table: BAR=3 offset=00000000
> PBA: BAR=3 offset=00002000
> Capabilities: [a0] Express (v2) Endpoint, MSI 00
> DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> MaxPayload 128 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
> L0s <2us, L1 <16us
> ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
> TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

PCIe wise the connection is going to be pretty tight in terms of
bandwidth. It looks like we have 2.5GT/s with only a single lane of
PCIe. In addition we are running with ASPM enabled so that means that
if we don't have enough traffic we are shutting off the one PCIe lane
we have so if we are getting bursty traffic that can get ugly.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
@ 2020-07-15 22:32               ` Alexander Duyck
  0 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2020-07-15 22:32 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, Jul 15, 2020 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Wed, Jul 15, 2020 at 11:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> > > On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > > > > After a  lot of debugging it turns out that the bug is in igb...
> > > > > >
> > > > > > driver: igb
> > > > > > version: 5.6.0-k
> > > > > > firmware-version:  0. 6-1
> > > > > >
> > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > > Connection (rev 03)
> > > > >
> > > > > Unclear to me what you're actually reporting. Is this a regression
> > > > > after a kernel upgrade? Compared to no NAT?
> > > >
> > > > It only happens on "internet links"
> > > >
> > > > Lets say that A is client with ibg driver, B is a firewall running NAT
> > > > with ixgbe drivers, C is another local node with igb and
> > > > D is a remote node with a bridge backed by a bnx2 interface.
> > > >
> > > > A -> B -> C is ok (B and C is on the same switch)
> > > >
> > > > A -> B -> D -- 32-40mbit
> > > >
> > > > B -> D 944 mbit
> > > > C -> D 944 mbit
> > > >
> > > > A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)
> > >
> > > This should of course be A' -> B -> D
> > >
> > > Sorry, I've been scratching my head for about a week...
> >
> > Hm, only thing that comes to mind if A' works reliably and A doesn't is
> > that A has somehow broken TCP offloads. Could you try disabling things
> > via ethtool -K and see if those settings make a difference?
>
> It's a bit hard since it works like this, turned tso off:
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
> [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec   31    812 KBytes
> [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec   92    772 KBytes
> [  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0    834 KBytes
> [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   60    823 KBytes
> [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec   31    789 KBytes
> [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    786 KBytes
> [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    772 KBytes
> [  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec    0    868 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  214             sender
> [  5]   0.00-10.00  sec  1.07 GBytes   920 Mbits/sec                  receiver
>
> Continued running tests:
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  5.82 MBytes  48.8 Mbits/sec    0   82.0 KBytes
> [  5]   1.00-2.00   sec  4.97 MBytes  41.7 Mbits/sec    0    130 KBytes
> [  5]   2.00-3.00   sec  5.28 MBytes  44.3 Mbits/sec    0   99.0 KBytes
> [  5]   3.00-4.00   sec  5.28 MBytes  44.3 Mbits/sec    0    105 KBytes
> [  5]   4.00-5.00   sec  5.28 MBytes  44.3 Mbits/sec    0    122 KBytes
> [  5]   5.00-6.00   sec  5.28 MBytes  44.3 Mbits/sec    0   82.0 KBytes
> [  5]   6.00-7.00   sec  5.28 MBytes  44.3 Mbits/sec    0   79.2 KBytes
> [  5]   7.00-8.00   sec  5.28 MBytes  44.3 Mbits/sec    0    110 KBytes
> [  5]   8.00-9.00   sec  5.28 MBytes  44.3 Mbits/sec    0    156 KBytes
> [  5]   9.00-10.00  sec  5.28 MBytes  44.3 Mbits/sec    0   87.7 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  53.0 MBytes  44.5 Mbits/sec    0             sender
> [  5]   0.00-10.00  sec  52.5 MBytes  44.1 Mbits/sec                  receiver
>
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  7.08 MBytes  59.4 Mbits/sec    0    156 KBytes
> [  5]   1.00-2.00   sec  5.97 MBytes  50.0 Mbits/sec    0    110 KBytes
> [  5]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec    0    124 KBytes
> [  5]   3.00-4.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> [  5]   4.00-5.00   sec  5.47 MBytes  45.9 Mbits/sec    0    158 KBytes
> [  5]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec    0   70.7 KBytes
> [  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0    113 KBytes
> [  5]   7.00-8.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> [  5]   8.00-9.00   sec  4.97 MBytes  41.7 Mbits/sec    0   84.8 KBytes
> [  5]   9.00-10.00  sec  5.47 MBytes  45.9 Mbits/sec    0    116 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  55.3 MBytes  46.4 Mbits/sec    0             sender
> [  5]   0.00-10.00  sec  53.9 MBytes  45.2 Mbits/sec                  receiver
>
> And the low bandwidth continues with:
> ethtool -k enp3s0 |grep ": on"
> rx-vlan-offload: on
> tx-vlan-offload: on [requested off]
> highdma: on [fixed]
> rx-vlan-filter: on [fixed]
> tx-gre-segmentation: on
> tx-gre-csum-segmentation: on
> tx-ipxip4-segmentation: on
> tx-ipxip6-segmentation: on
> tx-udp_tnl-segmentation: on
> tx-udp_tnl-csum-segmentation: on
> tx-gso-partial: on
> tx-udp-segmentation: on
> hw-tc-offload: on
>
> Can't quite find how to turn those off since they aren't listed in
> ethtool (since the text is not what you use to enable/disable)

To disable them you would just repeat the same string in the display
string. So it should just be "ethtool -K enp3s0 tx-gso-partial off"
and that would turn off a large chunk of them as all the encapsulated
support requires gso partial support.

> I was hoping that you'd have a clue of something that might introduce
> a regression - ie specific patches to try to revert
>
> Btw, the same issue applies to udp as werll
>
> [ ID] Interval           Transfer     Bitrate         Total Datagrams
> [  5]   0.00-1.00   sec  6.77 MBytes  56.8 Mbits/sec  4900
> [  5]   1.00-2.00   sec  4.27 MBytes  35.8 Mbits/sec  3089
> [  5]   2.00-3.00   sec  4.20 MBytes  35.2 Mbits/sec  3041
> [  5]   3.00-4.00   sec  4.30 MBytes  36.1 Mbits/sec  3116
> [  5]   4.00-5.00   sec  4.24 MBytes  35.6 Mbits/sec  3070
> [  5]   5.00-6.00   sec  4.21 MBytes  35.3 Mbits/sec  3047
> [  5]   6.00-7.00   sec  4.29 MBytes  36.0 Mbits/sec  3110
> [  5]   7.00-8.00   sec  4.28 MBytes  35.9 Mbits/sec  3097
> [  5]   8.00-9.00   sec  4.25 MBytes  35.6 Mbits/sec  3075
> [  5]   9.00-10.00  sec  4.20 MBytes  35.2 Mbits/sec  3039
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Jitter
> Lost/Total Datagrams
> [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.000 ms
> 0/32584 (0%)  sender
> [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.037 ms
> 0/32573 (0%)  receiver
>
> vs:
>
> [ ID] Interval           Transfer     Bitrate         Total Datagrams
> [  5]   0.00-1.00   sec   114 MBytes   954 Mbits/sec  82342
> [  5]   1.00-2.00   sec   114 MBytes   955 Mbits/sec  82439
> [  5]   2.00-3.00   sec   114 MBytes   956 Mbits/sec  82507
> [  5]   3.00-4.00   sec   114 MBytes   955 Mbits/sec  82432
> [  5]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  82535
> [  5]   5.00-6.00   sec   114 MBytes   953 Mbits/sec  82240
> [  5]   6.00-7.00   sec   114 MBytes   956 Mbits/sec  82512
> [  5]   7.00-8.00   sec   114 MBytes   956 Mbits/sec  82503
> [  5]   8.00-9.00   sec   114 MBytes   956 Mbits/sec  82532
> [  5]   9.00-10.00  sec   114 MBytes   956 Mbits/sec  82488
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Jitter
> Lost/Total Datagrams
> [  5]   0.00-10.00  sec  1.11 GBytes   955 Mbits/sec  0.000 ms
> 0/824530 (0%)  sender
> [  5]   0.00-10.01  sec  1.11 GBytes   949 Mbits/sec  0.014 ms
> 4756/824530 (0.58%)  receiver

The fact that it is impacting UDP seems odd. I wonder if we don't have
a qdisc somewhere that is misbehaving and throttling the Tx. Either
that or I wonder if we are getting spammed with flow control frames.

It would be useful to include the output of just calling "ethtool
enp3s0" on the interface to verify the speed, "ethtool -a enp3s0" to
verify flow control settings, and "ethtool -S enp3s0 | grep -v :\ 0"
to output the statistics and dump anything that isn't zero.

> lspci -s 03:00.0  -vvv
> 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> Connection (rev 03)
> Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0
> Interrupt: pin A routed to IRQ 57
> IOMMU group: 20
> Region 0: Memory at fc900000 (32-bit, non-prefetchable) [size=128K]
> Region 2: I/O ports at e000 [size=32]
> Region 3: Memory at fc920000 (32-bit, non-prefetchable) [size=16K]
> Capabilities: [40] Power Management version 3
> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> Address: 0000000000000000  Data: 0000
> Masking: 00000000  Pending: 00000000
> Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
> Vector table: BAR=3 offset=00000000
> PBA: BAR=3 offset=00002000
> Capabilities: [a0] Express (v2) Endpoint, MSI 00
> DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> MaxPayload 128 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
> L0s <2us, L1 <16us
> ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
> TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

PCIe wise the connection is going to be pretty tight in terms of
bandwidth. It looks like we have 2.5GT/s with only a single lane of
PCIe. In addition we are running with ASPM enabled so that means that
if we don't have enough traffic we are shutting off the one PCIe lane
we have so if we are getting bursty traffic that can get ugly.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-15 22:32               ` Alexander Duyck
@ 2020-07-15 22:51                 ` Ian Kumlien
  -1 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-15 22:51 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jakub Kicinski, Linux Kernel Network Developers, intel-wired-lan

On Thu, Jul 16, 2020 at 12:32 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Jul 15, 2020 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > On Wed, Jul 15, 2020 at 11:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> > > > On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > > > > > After a  lot of debugging it turns out that the bug is in igb...
> > > > > > >
> > > > > > > driver: igb
> > > > > > > version: 5.6.0-k
> > > > > > > firmware-version:  0. 6-1
> > > > > > >
> > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > > > Connection (rev 03)
> > > > > >
> > > > > > Unclear to me what you're actually reporting. Is this a regression
> > > > > > after a kernel upgrade? Compared to no NAT?
> > > > >
> > > > > It only happens on "internet links"
> > > > >
> > > > > Lets say that A is client with ibg driver, B is a firewall running NAT
> > > > > with ixgbe drivers, C is another local node with igb and
> > > > > D is a remote node with a bridge backed by a bnx2 interface.
> > > > >
> > > > > A -> B -> C is ok (B and C is on the same switch)
> > > > >
> > > > > A -> B -> D -- 32-40mbit
> > > > >
> > > > > B -> D 944 mbit
> > > > > C -> D 944 mbit
> > > > >
> > > > > A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)
> > > >
> > > > This should of course be A' -> B -> D
> > > >
> > > > Sorry, I've been scratching my head for about a week...
> > >
> > > Hm, only thing that comes to mind if A' works reliably and A doesn't is
> > > that A has somehow broken TCP offloads. Could you try disabling things
> > > via ethtool -K and see if those settings make a difference?
> >
> > It's a bit hard since it works like this, turned tso off:
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
> > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec   31    812 KBytes
> > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec   92    772 KBytes
> > [  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0    834 KBytes
> > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   60    823 KBytes
> > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec   31    789 KBytes
> > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    786 KBytes
> > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    772 KBytes
> > [  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec    0    868 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  214             sender
> > [  5]   0.00-10.00  sec  1.07 GBytes   920 Mbits/sec                  receiver
> >
> > Continued running tests:
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  5.82 MBytes  48.8 Mbits/sec    0   82.0 KBytes
> > [  5]   1.00-2.00   sec  4.97 MBytes  41.7 Mbits/sec    0    130 KBytes
> > [  5]   2.00-3.00   sec  5.28 MBytes  44.3 Mbits/sec    0   99.0 KBytes
> > [  5]   3.00-4.00   sec  5.28 MBytes  44.3 Mbits/sec    0    105 KBytes
> > [  5]   4.00-5.00   sec  5.28 MBytes  44.3 Mbits/sec    0    122 KBytes
> > [  5]   5.00-6.00   sec  5.28 MBytes  44.3 Mbits/sec    0   82.0 KBytes
> > [  5]   6.00-7.00   sec  5.28 MBytes  44.3 Mbits/sec    0   79.2 KBytes
> > [  5]   7.00-8.00   sec  5.28 MBytes  44.3 Mbits/sec    0    110 KBytes
> > [  5]   8.00-9.00   sec  5.28 MBytes  44.3 Mbits/sec    0    156 KBytes
> > [  5]   9.00-10.00  sec  5.28 MBytes  44.3 Mbits/sec    0   87.7 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  53.0 MBytes  44.5 Mbits/sec    0             sender
> > [  5]   0.00-10.00  sec  52.5 MBytes  44.1 Mbits/sec                  receiver
> >
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  7.08 MBytes  59.4 Mbits/sec    0    156 KBytes
> > [  5]   1.00-2.00   sec  5.97 MBytes  50.0 Mbits/sec    0    110 KBytes
> > [  5]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec    0    124 KBytes
> > [  5]   3.00-4.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > [  5]   4.00-5.00   sec  5.47 MBytes  45.9 Mbits/sec    0    158 KBytes
> > [  5]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec    0   70.7 KBytes
> > [  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0    113 KBytes
> > [  5]   7.00-8.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > [  5]   8.00-9.00   sec  4.97 MBytes  41.7 Mbits/sec    0   84.8 KBytes
> > [  5]   9.00-10.00  sec  5.47 MBytes  45.9 Mbits/sec    0    116 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  55.3 MBytes  46.4 Mbits/sec    0             sender
> > [  5]   0.00-10.00  sec  53.9 MBytes  45.2 Mbits/sec                  receiver
> >
> > And the low bandwidth continues with:
> > ethtool -k enp3s0 |grep ": on"
> > rx-vlan-offload: on
> > tx-vlan-offload: on [requested off]
> > highdma: on [fixed]
> > rx-vlan-filter: on [fixed]
> > tx-gre-segmentation: on
> > tx-gre-csum-segmentation: on
> > tx-ipxip4-segmentation: on
> > tx-ipxip6-segmentation: on
> > tx-udp_tnl-segmentation: on
> > tx-udp_tnl-csum-segmentation: on
> > tx-gso-partial: on
> > tx-udp-segmentation: on
> > hw-tc-offload: on
> >
> > Can't quite find how to turn those off since they aren't listed in
> > ethtool (since the text is not what you use to enable/disable)
>
> To disable them you would just repeat the same string in the display
> string. So it should just be "ethtool -K enp3s0 tx-gso-partial off"
> and that would turn off a large chunk of them as all the encapsulated
> support requires gso partial support.

 ethtool -k enp3s0 |grep ": on"
highdma: on [fixed]
rx-vlan-filter: on [fixed]
---
And then back to back:
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.91 MBytes  41.2 Mbits/sec    0   45.2 KBytes
[  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   52.3 KBytes
[  5]   2.00-3.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
[  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0    141 KBytes
[  5]   4.00-5.00   sec   111 MBytes   928 Mbits/sec   63    764 KBytes
[  5]   5.00-6.00   sec  86.2 MBytes   724 Mbits/sec    0    744 KBytes
[  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   61    769 KBytes
[  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    749 KBytes
[  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    741 KBytes
[  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   31    761 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   644 MBytes   540 Mbits/sec  155             sender
[  5]   0.00-10.01  sec   641 MBytes   537 Mbits/sec                  receiver

and we're back at the not working bit:
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.84 MBytes  40.6 Mbits/sec    0   42.4 KBytes
[  5]   1.00-2.00   sec  4.60 MBytes  38.6 Mbits/sec    0   42.4 KBytes
[  5]   2.00-3.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
[  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
[  5]   4.00-5.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
[  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
[  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
[  5]   7.00-8.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
[  5]   8.00-9.00   sec  4.47 MBytes  37.5 Mbits/sec    0   53.7 KBytes
[  5]   9.00-10.00  sec  4.47 MBytes  37.5 Mbits/sec    0   79.2 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  44.5 MBytes  37.3 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  43.9 MBytes  36.8 Mbits/sec                  receiver

> > I was hoping that you'd have a clue of something that might introduce
> > a regression - ie specific patches to try to revert
> >
> > Btw, the same issue applies to udp as werll
> >
> > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > [  5]   0.00-1.00   sec  6.77 MBytes  56.8 Mbits/sec  4900
> > [  5]   1.00-2.00   sec  4.27 MBytes  35.8 Mbits/sec  3089
> > [  5]   2.00-3.00   sec  4.20 MBytes  35.2 Mbits/sec  3041
> > [  5]   3.00-4.00   sec  4.30 MBytes  36.1 Mbits/sec  3116
> > [  5]   4.00-5.00   sec  4.24 MBytes  35.6 Mbits/sec  3070
> > [  5]   5.00-6.00   sec  4.21 MBytes  35.3 Mbits/sec  3047
> > [  5]   6.00-7.00   sec  4.29 MBytes  36.0 Mbits/sec  3110
> > [  5]   7.00-8.00   sec  4.28 MBytes  35.9 Mbits/sec  3097
> > [  5]   8.00-9.00   sec  4.25 MBytes  35.6 Mbits/sec  3075
> > [  5]   9.00-10.00  sec  4.20 MBytes  35.2 Mbits/sec  3039
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Jitter
> > Lost/Total Datagrams
> > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.000 ms
> > 0/32584 (0%)  sender
> > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.037 ms
> > 0/32573 (0%)  receiver
> >
> > vs:
> >
> > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > [  5]   0.00-1.00   sec   114 MBytes   954 Mbits/sec  82342
> > [  5]   1.00-2.00   sec   114 MBytes   955 Mbits/sec  82439
> > [  5]   2.00-3.00   sec   114 MBytes   956 Mbits/sec  82507
> > [  5]   3.00-4.00   sec   114 MBytes   955 Mbits/sec  82432
> > [  5]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  82535
> > [  5]   5.00-6.00   sec   114 MBytes   953 Mbits/sec  82240
> > [  5]   6.00-7.00   sec   114 MBytes   956 Mbits/sec  82512
> > [  5]   7.00-8.00   sec   114 MBytes   956 Mbits/sec  82503
> > [  5]   8.00-9.00   sec   114 MBytes   956 Mbits/sec  82532
> > [  5]   9.00-10.00  sec   114 MBytes   956 Mbits/sec  82488
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Jitter
> > Lost/Total Datagrams
> > [  5]   0.00-10.00  sec  1.11 GBytes   955 Mbits/sec  0.000 ms
> > 0/824530 (0%)  sender
> > [  5]   0.00-10.01  sec  1.11 GBytes   949 Mbits/sec  0.014 ms
> > 4756/824530 (0.58%)  receiver
>
> The fact that it is impacting UDP seems odd. I wonder if we don't have
> a qdisc somewhere that is misbehaving and throttling the Tx. Either
> that or I wonder if we are getting spammed with flow control frames.

it sometimes works, it looks like the cwindow just isn't increased -
that's where i started...

Example:
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.86 MBytes  40.8 Mbits/sec    0   50.9 KBytes
[  5]   1.00-2.00   sec  4.66 MBytes  39.1 Mbits/sec    0   65.0 KBytes
[  5]   2.00-3.00   sec  4.29 MBytes  36.0 Mbits/sec    0   42.4 KBytes
[  5]   3.00-4.00   sec  4.66 MBytes  39.1 Mbits/sec    0   42.4 KBytes
[  5]   4.00-5.00   sec  23.1 MBytes   194 Mbits/sec    0   1.07 MBytes
[  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
[  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   60    806 KBytes
[  5]   7.00-8.00   sec  82.5 MBytes   692 Mbits/sec    0    812 KBytes
[  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   92    761 KBytes
[  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec    0    755 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   554 MBytes   465 Mbits/sec  152             sender
[  5]   0.00-10.00  sec   550 MBytes   461 Mbits/sec                  receiver

> It would be useful to include the output of just calling "ethtool
> enp3s0" on the interface to verify the speed, "ethtool -a enp3s0" to
> verify flow control settings, and "ethtool -S enp3s0 | grep -v :\ 0"
> to output the statistics and dump anything that isn't zero.

ethtool enp3s0
Settings for enp3s0:
Supported ports: [ TP ]
Supported link modes:   10baseT/Half 10baseT/Full
                        100baseT/Half 100baseT/Full
                        1000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes:  10baseT/Half 10baseT/Full
                        100baseT/Half 100baseT/Full
                        1000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 1000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
MDI-X: off (auto)
Supports Wake-on: pumbg
Wake-on: g
        Current message level: 0x00000007 (7)
                               drv probe link
Link detected: yes
---
ethtool -a enp3s0
Pause parameters for enp3s0:
Autonegotiate: on
RX: on
TX: off
---
ethtool -S enp3s0 |grep  -v :\ 0
NIC statistics:
     rx_packets: 15920618
     tx_packets: 17846725
     rx_bytes: 15676264423
     tx_bytes: 19925010639
     rx_broadcast: 119553
     tx_broadcast: 497
     rx_multicast: 330193
     tx_multicast: 18190
     multicast: 330193
     rx_missed_errors: 270102
     rx_long_length_errors: 6
     tx_tcp_seg_good: 1342561
     rx_long_byte_count: 15676264423
     rx_errors: 6
     rx_length_errors: 6
     rx_fifo_errors: 270102
     tx_queue_0_packets: 7651168
     tx_queue_0_bytes: 7823281566
     tx_queue_0_restart: 4920
     tx_queue_1_packets: 10195557
     tx_queue_1_bytes: 12027522118
     tx_queue_1_restart: 12718
     rx_queue_0_packets: 15920618
     rx_queue_0_bytes: 15612581951
     rx_queue_0_csum_err: 76
(I've only run two runs since i reenabled the interface)
---

> > lspci -s 03:00.0  -vvv
> > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > Connection (rev 03)
> > Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection
> > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > Stepping- SERR- FastB2B- DisINTx+
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > <TAbort- <MAbort- >SERR- <PERR- INTx-
> > Latency: 0
> > Interrupt: pin A routed to IRQ 57
> > IOMMU group: 20
> > Region 0: Memory at fc900000 (32-bit, non-prefetchable) [size=128K]
> > Region 2: I/O ports at e000 [size=32]
> > Region 3: Memory at fc920000 (32-bit, non-prefetchable) [size=16K]
> > Capabilities: [40] Power Management version 3
> > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> > Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> > Address: 0000000000000000  Data: 0000
> > Masking: 00000000  Pending: 00000000
> > Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
> > Vector table: BAR=3 offset=00000000
> > PBA: BAR=3 offset=00002000
> > Capabilities: [a0] Express (v2) Endpoint, MSI 00
> > DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> > DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
> > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> > LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
> > L0s <2us, L1 <16us
> > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
> > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>
> PCIe wise the connection is going to be pretty tight in terms of
> bandwidth. It looks like we have 2.5GT/s with only a single lane of
> PCIe. In addition we are running with ASPM enabled so that means that
> if we don't have enough traffic we are shutting off the one PCIe lane
> we have so if we are getting bursty traffic that can get ugly.

Humm... is there a way to force disable ASPM in sysfs?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
@ 2020-07-15 22:51                 ` Ian Kumlien
  0 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-15 22:51 UTC (permalink / raw)
  To: intel-wired-lan

On Thu, Jul 16, 2020 at 12:32 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Jul 15, 2020 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > On Wed, Jul 15, 2020 at 11:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> > > > On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > > > > > After a  lot of debugging it turns out that the bug is in igb...
> > > > > > >
> > > > > > > driver: igb
> > > > > > > version: 5.6.0-k
> > > > > > > firmware-version:  0. 6-1
> > > > > > >
> > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > > > Connection (rev 03)
> > > > > >
> > > > > > Unclear to me what you're actually reporting. Is this a regression
> > > > > > after a kernel upgrade? Compared to no NAT?
> > > > >
> > > > > It only happens on "internet links"
> > > > >
> > > > > Lets say that A is client with ibg driver, B is a firewall running NAT
> > > > > with ixgbe drivers, C is another local node with igb and
> > > > > D is a remote node with a bridge backed by a bnx2 interface.
> > > > >
> > > > > A -> B -> C is ok (B and C is on the same switch)
> > > > >
> > > > > A -> B -> D -- 32-40mbit
> > > > >
> > > > > B -> D 944 mbit
> > > > > C -> D 944 mbit
> > > > >
> > > > > A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)
> > > >
> > > > This should of course be A' -> B -> D
> > > >
> > > > Sorry, I've been scratching my head for about a week...
> > >
> > > Hm, only thing that comes to mind if A' works reliably and A doesn't is
> > > that A has somehow broken TCP offloads. Could you try disabling things
> > > via ethtool -K and see if those settings make a difference?
> >
> > It's a bit hard since it works like this, turned tso off:
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
> > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec   31    812 KBytes
> > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec   92    772 KBytes
> > [  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0    834 KBytes
> > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   60    823 KBytes
> > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec   31    789 KBytes
> > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    786 KBytes
> > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    772 KBytes
> > [  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec    0    868 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  214             sender
> > [  5]   0.00-10.00  sec  1.07 GBytes   920 Mbits/sec                  receiver
> >
> > Continued running tests:
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  5.82 MBytes  48.8 Mbits/sec    0   82.0 KBytes
> > [  5]   1.00-2.00   sec  4.97 MBytes  41.7 Mbits/sec    0    130 KBytes
> > [  5]   2.00-3.00   sec  5.28 MBytes  44.3 Mbits/sec    0   99.0 KBytes
> > [  5]   3.00-4.00   sec  5.28 MBytes  44.3 Mbits/sec    0    105 KBytes
> > [  5]   4.00-5.00   sec  5.28 MBytes  44.3 Mbits/sec    0    122 KBytes
> > [  5]   5.00-6.00   sec  5.28 MBytes  44.3 Mbits/sec    0   82.0 KBytes
> > [  5]   6.00-7.00   sec  5.28 MBytes  44.3 Mbits/sec    0   79.2 KBytes
> > [  5]   7.00-8.00   sec  5.28 MBytes  44.3 Mbits/sec    0    110 KBytes
> > [  5]   8.00-9.00   sec  5.28 MBytes  44.3 Mbits/sec    0    156 KBytes
> > [  5]   9.00-10.00  sec  5.28 MBytes  44.3 Mbits/sec    0   87.7 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  53.0 MBytes  44.5 Mbits/sec    0             sender
> > [  5]   0.00-10.00  sec  52.5 MBytes  44.1 Mbits/sec                  receiver
> >
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  7.08 MBytes  59.4 Mbits/sec    0    156 KBytes
> > [  5]   1.00-2.00   sec  5.97 MBytes  50.0 Mbits/sec    0    110 KBytes
> > [  5]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec    0    124 KBytes
> > [  5]   3.00-4.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > [  5]   4.00-5.00   sec  5.47 MBytes  45.9 Mbits/sec    0    158 KBytes
> > [  5]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec    0   70.7 KBytes
> > [  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0    113 KBytes
> > [  5]   7.00-8.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > [  5]   8.00-9.00   sec  4.97 MBytes  41.7 Mbits/sec    0   84.8 KBytes
> > [  5]   9.00-10.00  sec  5.47 MBytes  45.9 Mbits/sec    0    116 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  55.3 MBytes  46.4 Mbits/sec    0             sender
> > [  5]   0.00-10.00  sec  53.9 MBytes  45.2 Mbits/sec                  receiver
> >
> > And the low bandwidth continues with:
> > ethtool -k enp3s0 |grep ": on"
> > rx-vlan-offload: on
> > tx-vlan-offload: on [requested off]
> > highdma: on [fixed]
> > rx-vlan-filter: on [fixed]
> > tx-gre-segmentation: on
> > tx-gre-csum-segmentation: on
> > tx-ipxip4-segmentation: on
> > tx-ipxip6-segmentation: on
> > tx-udp_tnl-segmentation: on
> > tx-udp_tnl-csum-segmentation: on
> > tx-gso-partial: on
> > tx-udp-segmentation: on
> > hw-tc-offload: on
> >
> > Can't quite find how to turn those off since they aren't listed in
> > ethtool (since the text is not what you use to enable/disable)
>
> To disable them you would just repeat the same string in the display
> string. So it should just be "ethtool -K enp3s0 tx-gso-partial off"
> and that would turn off a large chunk of them as all the encapsulated
> support requires gso partial support.

 ethtool -k enp3s0 |grep ": on"
highdma: on [fixed]
rx-vlan-filter: on [fixed]
---
And then back to back:
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.91 MBytes  41.2 Mbits/sec    0   45.2 KBytes
[  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   52.3 KBytes
[  5]   2.00-3.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
[  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0    141 KBytes
[  5]   4.00-5.00   sec   111 MBytes   928 Mbits/sec   63    764 KBytes
[  5]   5.00-6.00   sec  86.2 MBytes   724 Mbits/sec    0    744 KBytes
[  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   61    769 KBytes
[  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    749 KBytes
[  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    741 KBytes
[  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   31    761 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   644 MBytes   540 Mbits/sec  155             sender
[  5]   0.00-10.01  sec   641 MBytes   537 Mbits/sec                  receiver

and we're back at the not working bit:
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.84 MBytes  40.6 Mbits/sec    0   42.4 KBytes
[  5]   1.00-2.00   sec  4.60 MBytes  38.6 Mbits/sec    0   42.4 KBytes
[  5]   2.00-3.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
[  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
[  5]   4.00-5.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
[  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
[  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
[  5]   7.00-8.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
[  5]   8.00-9.00   sec  4.47 MBytes  37.5 Mbits/sec    0   53.7 KBytes
[  5]   9.00-10.00  sec  4.47 MBytes  37.5 Mbits/sec    0   79.2 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  44.5 MBytes  37.3 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  43.9 MBytes  36.8 Mbits/sec                  receiver

> > I was hoping that you'd have a clue of something that might introduce
> > a regression - ie specific patches to try to revert
> >
> > Btw, the same issue applies to udp as werll
> >
> > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > [  5]   0.00-1.00   sec  6.77 MBytes  56.8 Mbits/sec  4900
> > [  5]   1.00-2.00   sec  4.27 MBytes  35.8 Mbits/sec  3089
> > [  5]   2.00-3.00   sec  4.20 MBytes  35.2 Mbits/sec  3041
> > [  5]   3.00-4.00   sec  4.30 MBytes  36.1 Mbits/sec  3116
> > [  5]   4.00-5.00   sec  4.24 MBytes  35.6 Mbits/sec  3070
> > [  5]   5.00-6.00   sec  4.21 MBytes  35.3 Mbits/sec  3047
> > [  5]   6.00-7.00   sec  4.29 MBytes  36.0 Mbits/sec  3110
> > [  5]   7.00-8.00   sec  4.28 MBytes  35.9 Mbits/sec  3097
> > [  5]   8.00-9.00   sec  4.25 MBytes  35.6 Mbits/sec  3075
> > [  5]   9.00-10.00  sec  4.20 MBytes  35.2 Mbits/sec  3039
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Jitter
> > Lost/Total Datagrams
> > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.000 ms
> > 0/32584 (0%)  sender
> > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.037 ms
> > 0/32573 (0%)  receiver
> >
> > vs:
> >
> > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > [  5]   0.00-1.00   sec   114 MBytes   954 Mbits/sec  82342
> > [  5]   1.00-2.00   sec   114 MBytes   955 Mbits/sec  82439
> > [  5]   2.00-3.00   sec   114 MBytes   956 Mbits/sec  82507
> > [  5]   3.00-4.00   sec   114 MBytes   955 Mbits/sec  82432
> > [  5]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  82535
> > [  5]   5.00-6.00   sec   114 MBytes   953 Mbits/sec  82240
> > [  5]   6.00-7.00   sec   114 MBytes   956 Mbits/sec  82512
> > [  5]   7.00-8.00   sec   114 MBytes   956 Mbits/sec  82503
> > [  5]   8.00-9.00   sec   114 MBytes   956 Mbits/sec  82532
> > [  5]   9.00-10.00  sec   114 MBytes   956 Mbits/sec  82488
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Jitter
> > Lost/Total Datagrams
> > [  5]   0.00-10.00  sec  1.11 GBytes   955 Mbits/sec  0.000 ms
> > 0/824530 (0%)  sender
> > [  5]   0.00-10.01  sec  1.11 GBytes   949 Mbits/sec  0.014 ms
> > 4756/824530 (0.58%)  receiver
>
> The fact that it is impacting UDP seems odd. I wonder if we don't have
> a qdisc somewhere that is misbehaving and throttling the Tx. Either
> that or I wonder if we are getting spammed with flow control frames.

it sometimes works, it looks like the cwindow just isn't increased -
that's where i started...

Example:
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.86 MBytes  40.8 Mbits/sec    0   50.9 KBytes
[  5]   1.00-2.00   sec  4.66 MBytes  39.1 Mbits/sec    0   65.0 KBytes
[  5]   2.00-3.00   sec  4.29 MBytes  36.0 Mbits/sec    0   42.4 KBytes
[  5]   3.00-4.00   sec  4.66 MBytes  39.1 Mbits/sec    0   42.4 KBytes
[  5]   4.00-5.00   sec  23.1 MBytes   194 Mbits/sec    0   1.07 MBytes
[  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
[  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   60    806 KBytes
[  5]   7.00-8.00   sec  82.5 MBytes   692 Mbits/sec    0    812 KBytes
[  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   92    761 KBytes
[  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec    0    755 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   554 MBytes   465 Mbits/sec  152             sender
[  5]   0.00-10.00  sec   550 MBytes   461 Mbits/sec                  receiver

> It would be useful to include the output of just calling "ethtool
> enp3s0" on the interface to verify the speed, "ethtool -a enp3s0" to
> verify flow control settings, and "ethtool -S enp3s0 | grep -v :\ 0"
> to output the statistics and dump anything that isn't zero.

ethtool enp3s0
Settings for enp3s0:
Supported ports: [ TP ]
Supported link modes:   10baseT/Half 10baseT/Full
                        100baseT/Half 100baseT/Full
                        1000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes:  10baseT/Half 10baseT/Full
                        100baseT/Half 100baseT/Full
                        1000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 1000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
MDI-X: off (auto)
Supports Wake-on: pumbg
Wake-on: g
        Current message level: 0x00000007 (7)
                               drv probe link
Link detected: yes
---
ethtool -a enp3s0
Pause parameters for enp3s0:
Autonegotiate: on
RX: on
TX: off
---
ethtool -S enp3s0 |grep  -v :\ 0
NIC statistics:
     rx_packets: 15920618
     tx_packets: 17846725
     rx_bytes: 15676264423
     tx_bytes: 19925010639
     rx_broadcast: 119553
     tx_broadcast: 497
     rx_multicast: 330193
     tx_multicast: 18190
     multicast: 330193
     rx_missed_errors: 270102
     rx_long_length_errors: 6
     tx_tcp_seg_good: 1342561
     rx_long_byte_count: 15676264423
     rx_errors: 6
     rx_length_errors: 6
     rx_fifo_errors: 270102
     tx_queue_0_packets: 7651168
     tx_queue_0_bytes: 7823281566
     tx_queue_0_restart: 4920
     tx_queue_1_packets: 10195557
     tx_queue_1_bytes: 12027522118
     tx_queue_1_restart: 12718
     rx_queue_0_packets: 15920618
     rx_queue_0_bytes: 15612581951
     rx_queue_0_csum_err: 76
(I've only run two runs since i reenabled the interface)
---

> > lspci -s 03:00.0  -vvv
> > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > Connection (rev 03)
> > Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection
> > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > Stepping- SERR- FastB2B- DisINTx+
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > <TAbort- <MAbort- >SERR- <PERR- INTx-
> > Latency: 0
> > Interrupt: pin A routed to IRQ 57
> > IOMMU group: 20
> > Region 0: Memory at fc900000 (32-bit, non-prefetchable) [size=128K]
> > Region 2: I/O ports at e000 [size=32]
> > Region 3: Memory at fc920000 (32-bit, non-prefetchable) [size=16K]
> > Capabilities: [40] Power Management version 3
> > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> > Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> > Address: 0000000000000000  Data: 0000
> > Masking: 00000000  Pending: 00000000
> > Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
> > Vector table: BAR=3 offset=00000000
> > PBA: BAR=3 offset=00002000
> > Capabilities: [a0] Express (v2) Endpoint, MSI 00
> > DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> > DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
> > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> > LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
> > L0s <2us, L1 <16us
> > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
> > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>
> PCIe wise the connection is going to be pretty tight in terms of
> bandwidth. It looks like we have 2.5GT/s with only a single lane of
> PCIe. In addition we are running with ASPM enabled so that means that
> if we don't have enough traffic we are shutting off the one PCIe lane
> we have so if we are getting bursty traffic that can get ugly.

Humm... is there a way to force disable ASPM in sysfs?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-15 22:51                 ` Ian Kumlien
@ 2020-07-15 23:41                   ` Alexander Duyck
  -1 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2020-07-15 23:41 UTC (permalink / raw)
  To: Ian Kumlien
  Cc: Jakub Kicinski, Linux Kernel Network Developers, intel-wired-lan

On Wed, Jul 15, 2020 at 3:51 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Thu, Jul 16, 2020 at 12:32 AM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> > On Wed, Jul 15, 2020 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > On Wed, Jul 15, 2020 at 11:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> > > > > On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > > > > > > After a  lot of debugging it turns out that the bug is in igb...
> > > > > > > >
> > > > > > > > driver: igb
> > > > > > > > version: 5.6.0-k
> > > > > > > > firmware-version:  0. 6-1
> > > > > > > >
> > > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > > > > Connection (rev 03)
> > > > > > >
> > > > > > > Unclear to me what you're actually reporting. Is this a regression
> > > > > > > after a kernel upgrade? Compared to no NAT?
> > > > > >
> > > > > > It only happens on "internet links"
> > > > > >
> > > > > > Lets say that A is client with ibg driver, B is a firewall running NAT
> > > > > > with ixgbe drivers, C is another local node with igb and
> > > > > > D is a remote node with a bridge backed by a bnx2 interface.
> > > > > >
> > > > > > A -> B -> C is ok (B and C is on the same switch)
> > > > > >
> > > > > > A -> B -> D -- 32-40mbit
> > > > > >
> > > > > > B -> D 944 mbit
> > > > > > C -> D 944 mbit
> > > > > >
> > > > > > A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)
> > > > >
> > > > > This should of course be A' -> B -> D
> > > > >
> > > > > Sorry, I've been scratching my head for about a week...
> > > >
> > > > Hm, only thing that comes to mind if A' works reliably and A doesn't is
> > > > that A has somehow broken TCP offloads. Could you try disabling things
> > > > via ethtool -K and see if those settings make a difference?
> > >
> > > It's a bit hard since it works like this, turned tso off:
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
> > > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec   31    812 KBytes
> > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec   92    772 KBytes
> > > [  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0    834 KBytes
> > > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   60    823 KBytes
> > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec   31    789 KBytes
> > > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    786 KBytes
> > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    772 KBytes
> > > [  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec    0    868 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  214             sender
> > > [  5]   0.00-10.00  sec  1.07 GBytes   920 Mbits/sec                  receiver
> > >
> > > Continued running tests:
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  5.82 MBytes  48.8 Mbits/sec    0   82.0 KBytes
> > > [  5]   1.00-2.00   sec  4.97 MBytes  41.7 Mbits/sec    0    130 KBytes
> > > [  5]   2.00-3.00   sec  5.28 MBytes  44.3 Mbits/sec    0   99.0 KBytes
> > > [  5]   3.00-4.00   sec  5.28 MBytes  44.3 Mbits/sec    0    105 KBytes
> > > [  5]   4.00-5.00   sec  5.28 MBytes  44.3 Mbits/sec    0    122 KBytes
> > > [  5]   5.00-6.00   sec  5.28 MBytes  44.3 Mbits/sec    0   82.0 KBytes
> > > [  5]   6.00-7.00   sec  5.28 MBytes  44.3 Mbits/sec    0   79.2 KBytes
> > > [  5]   7.00-8.00   sec  5.28 MBytes  44.3 Mbits/sec    0    110 KBytes
> > > [  5]   8.00-9.00   sec  5.28 MBytes  44.3 Mbits/sec    0    156 KBytes
> > > [  5]   9.00-10.00  sec  5.28 MBytes  44.3 Mbits/sec    0   87.7 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  53.0 MBytes  44.5 Mbits/sec    0             sender
> > > [  5]   0.00-10.00  sec  52.5 MBytes  44.1 Mbits/sec                  receiver
> > >
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  7.08 MBytes  59.4 Mbits/sec    0    156 KBytes
> > > [  5]   1.00-2.00   sec  5.97 MBytes  50.0 Mbits/sec    0    110 KBytes
> > > [  5]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec    0    124 KBytes
> > > [  5]   3.00-4.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > [  5]   4.00-5.00   sec  5.47 MBytes  45.9 Mbits/sec    0    158 KBytes
> > > [  5]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec    0   70.7 KBytes
> > > [  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0    113 KBytes
> > > [  5]   7.00-8.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > [  5]   8.00-9.00   sec  4.97 MBytes  41.7 Mbits/sec    0   84.8 KBytes
> > > [  5]   9.00-10.00  sec  5.47 MBytes  45.9 Mbits/sec    0    116 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  55.3 MBytes  46.4 Mbits/sec    0             sender
> > > [  5]   0.00-10.00  sec  53.9 MBytes  45.2 Mbits/sec                  receiver
> > >
> > > And the low bandwidth continues with:
> > > ethtool -k enp3s0 |grep ": on"
> > > rx-vlan-offload: on
> > > tx-vlan-offload: on [requested off]
> > > highdma: on [fixed]
> > > rx-vlan-filter: on [fixed]
> > > tx-gre-segmentation: on
> > > tx-gre-csum-segmentation: on
> > > tx-ipxip4-segmentation: on
> > > tx-ipxip6-segmentation: on
> > > tx-udp_tnl-segmentation: on
> > > tx-udp_tnl-csum-segmentation: on
> > > tx-gso-partial: on
> > > tx-udp-segmentation: on
> > > hw-tc-offload: on
> > >
> > > Can't quite find how to turn those off since they aren't listed in
> > > ethtool (since the text is not what you use to enable/disable)
> >
> > To disable them you would just repeat the same string in the display
> > string. So it should just be "ethtool -K enp3s0 tx-gso-partial off"
> > and that would turn off a large chunk of them as all the encapsulated
> > support requires gso partial support.
>
>  ethtool -k enp3s0 |grep ": on"
> highdma: on [fixed]
> rx-vlan-filter: on [fixed]
> ---
> And then back to back:
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  4.91 MBytes  41.2 Mbits/sec    0   45.2 KBytes
> [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   52.3 KBytes
> [  5]   2.00-3.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0    141 KBytes
> [  5]   4.00-5.00   sec   111 MBytes   928 Mbits/sec   63    764 KBytes
> [  5]   5.00-6.00   sec  86.2 MBytes   724 Mbits/sec    0    744 KBytes
> [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   61    769 KBytes
> [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    749 KBytes
> [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    741 KBytes
> [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   31    761 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec   644 MBytes   540 Mbits/sec  155             sender
> [  5]   0.00-10.01  sec   641 MBytes   537 Mbits/sec                  receiver
>
> and we're back at the not working bit:
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  4.84 MBytes  40.6 Mbits/sec    0   42.4 KBytes
> [  5]   1.00-2.00   sec  4.60 MBytes  38.6 Mbits/sec    0   42.4 KBytes
> [  5]   2.00-3.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> [  5]   4.00-5.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> [  5]   7.00-8.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> [  5]   8.00-9.00   sec  4.47 MBytes  37.5 Mbits/sec    0   53.7 KBytes
> [  5]   9.00-10.00  sec  4.47 MBytes  37.5 Mbits/sec    0   79.2 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  44.5 MBytes  37.3 Mbits/sec    0             sender
> [  5]   0.00-10.00  sec  43.9 MBytes  36.8 Mbits/sec                  receiver
>
> > > I was hoping that you'd have a clue of something that might introduce
> > > a regression - ie specific patches to try to revert
> > >
> > > Btw, the same issue applies to udp as werll
> > >
> > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > [  5]   0.00-1.00   sec  6.77 MBytes  56.8 Mbits/sec  4900
> > > [  5]   1.00-2.00   sec  4.27 MBytes  35.8 Mbits/sec  3089
> > > [  5]   2.00-3.00   sec  4.20 MBytes  35.2 Mbits/sec  3041
> > > [  5]   3.00-4.00   sec  4.30 MBytes  36.1 Mbits/sec  3116
> > > [  5]   4.00-5.00   sec  4.24 MBytes  35.6 Mbits/sec  3070
> > > [  5]   5.00-6.00   sec  4.21 MBytes  35.3 Mbits/sec  3047
> > > [  5]   6.00-7.00   sec  4.29 MBytes  36.0 Mbits/sec  3110
> > > [  5]   7.00-8.00   sec  4.28 MBytes  35.9 Mbits/sec  3097
> > > [  5]   8.00-9.00   sec  4.25 MBytes  35.6 Mbits/sec  3075
> > > [  5]   9.00-10.00  sec  4.20 MBytes  35.2 Mbits/sec  3039
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > Lost/Total Datagrams
> > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.000 ms
> > > 0/32584 (0%)  sender
> > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.037 ms
> > > 0/32573 (0%)  receiver
> > >
> > > vs:
> > >
> > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > [  5]   0.00-1.00   sec   114 MBytes   954 Mbits/sec  82342
> > > [  5]   1.00-2.00   sec   114 MBytes   955 Mbits/sec  82439
> > > [  5]   2.00-3.00   sec   114 MBytes   956 Mbits/sec  82507
> > > [  5]   3.00-4.00   sec   114 MBytes   955 Mbits/sec  82432
> > > [  5]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  82535
> > > [  5]   5.00-6.00   sec   114 MBytes   953 Mbits/sec  82240
> > > [  5]   6.00-7.00   sec   114 MBytes   956 Mbits/sec  82512
> > > [  5]   7.00-8.00   sec   114 MBytes   956 Mbits/sec  82503
> > > [  5]   8.00-9.00   sec   114 MBytes   956 Mbits/sec  82532
> > > [  5]   9.00-10.00  sec   114 MBytes   956 Mbits/sec  82488
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > Lost/Total Datagrams
> > > [  5]   0.00-10.00  sec  1.11 GBytes   955 Mbits/sec  0.000 ms
> > > 0/824530 (0%)  sender
> > > [  5]   0.00-10.01  sec  1.11 GBytes   949 Mbits/sec  0.014 ms
> > > 4756/824530 (0.58%)  receiver
> >
> > The fact that it is impacting UDP seems odd. I wonder if we don't have
> > a qdisc somewhere that is misbehaving and throttling the Tx. Either
> > that or I wonder if we are getting spammed with flow control frames.
>
> it sometimes works, it looks like the cwindow just isn't increased -
> that's where i started...
>
> Example:
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  4.86 MBytes  40.8 Mbits/sec    0   50.9 KBytes
> [  5]   1.00-2.00   sec  4.66 MBytes  39.1 Mbits/sec    0   65.0 KBytes
> [  5]   2.00-3.00   sec  4.29 MBytes  36.0 Mbits/sec    0   42.4 KBytes
> [  5]   3.00-4.00   sec  4.66 MBytes  39.1 Mbits/sec    0   42.4 KBytes
> [  5]   4.00-5.00   sec  23.1 MBytes   194 Mbits/sec    0   1.07 MBytes
> [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   60    806 KBytes
> [  5]   7.00-8.00   sec  82.5 MBytes   692 Mbits/sec    0    812 KBytes
> [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   92    761 KBytes
> [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec    0    755 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec   554 MBytes   465 Mbits/sec  152             sender
> [  5]   0.00-10.00  sec   550 MBytes   461 Mbits/sec                  receiver
>
> > It would be useful to include the output of just calling "ethtool
> > enp3s0" on the interface to verify the speed, "ethtool -a enp3s0" to
> > verify flow control settings, and "ethtool -S enp3s0 | grep -v :\ 0"
> > to output the statistics and dump anything that isn't zero.
>
> ethtool enp3s0
> Settings for enp3s0:
> Supported ports: [ TP ]
> Supported link modes:   10baseT/Half 10baseT/Full
>                         100baseT/Half 100baseT/Full
>                         1000baseT/Full
> Supported pause frame use: Symmetric
> Supports auto-negotiation: Yes
> Supported FEC modes: Not reported
> Advertised link modes:  10baseT/Half 10baseT/Full
>                         100baseT/Half 100baseT/Full
>                         1000baseT/Full
> Advertised pause frame use: Symmetric
> Advertised auto-negotiation: Yes
> Advertised FEC modes: Not reported
> Speed: 1000Mb/s
> Duplex: Full
> Auto-negotiation: on
> Port: Twisted Pair
> PHYAD: 1
> Transceiver: internal
> MDI-X: off (auto)
> Supports Wake-on: pumbg
> Wake-on: g
>         Current message level: 0x00000007 (7)
>                                drv probe link
> Link detected: yes
> ---
> ethtool -a enp3s0
> Pause parameters for enp3s0:
> Autonegotiate: on
> RX: on
> TX: off
> ---
> ethtool -S enp3s0 |grep  -v :\ 0
> NIC statistics:
>      rx_packets: 15920618
>      tx_packets: 17846725
>      rx_bytes: 15676264423
>      tx_bytes: 19925010639
>      rx_broadcast: 119553
>      tx_broadcast: 497
>      rx_multicast: 330193
>      tx_multicast: 18190
>      multicast: 330193
>      rx_missed_errors: 270102
>      rx_long_length_errors: 6
>      tx_tcp_seg_good: 1342561
>      rx_long_byte_count: 15676264423
>      rx_errors: 6
>      rx_length_errors: 6
>      rx_fifo_errors: 270102
>      tx_queue_0_packets: 7651168
>      tx_queue_0_bytes: 7823281566
>      tx_queue_0_restart: 4920
>      tx_queue_1_packets: 10195557
>      tx_queue_1_bytes: 12027522118
>      tx_queue_1_restart: 12718
>      rx_queue_0_packets: 15920618
>      rx_queue_0_bytes: 15612581951
>      rx_queue_0_csum_err: 76
> (I've only run two runs since i reenabled the interface)

So I am seeing three things here.

The rx_long_length_errors are usually due to an MTU mismatch. Do you
have something on the network that is using jumbo frames, or is the
MTU on the NIC set to something smaller than what is supported on the
network?

You are getting rx_missed_errors, that would seem to imply that the
DMA is not able to keep up. We may want to try disabling the L1 to see
if we get any boost from doing that.

The last bit is that queue 0 is seeing packets with bad checksums. You
might want to run some tests and see where the bad checksums are
coming from. If they are being detected from a specific NIC such as
the ixgbe in your example it might point to some sort of checksum
error being created as a result of the NAT translation.

> ---
>
> > > lspci -s 03:00.0  -vvv
> > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > Connection (rev 03)
> > > Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection
> > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > Stepping- SERR- FastB2B- DisINTx+
> > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > > <TAbort- <MAbort- >SERR- <PERR- INTx-
> > > Latency: 0
> > > Interrupt: pin A routed to IRQ 57
> > > IOMMU group: 20
> > > Region 0: Memory at fc900000 (32-bit, non-prefetchable) [size=128K]
> > > Region 2: I/O ports at e000 [size=32]
> > > Region 3: Memory at fc920000 (32-bit, non-prefetchable) [size=16K]
> > > Capabilities: [40] Power Management version 3
> > > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> > > Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> > > Address: 0000000000000000  Data: 0000
> > > Masking: 00000000  Pending: 00000000
> > > Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
> > > Vector table: BAR=3 offset=00000000
> > > PBA: BAR=3 offset=00002000
> > > Capabilities: [a0] Express (v2) Endpoint, MSI 00
> > > DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> > > DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
> > > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> > > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> > > LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
> > > L0s <2us, L1 <16us
> > > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > > LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
> > > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> >
> > PCIe wise the connection is going to be pretty tight in terms of
> > bandwidth. It looks like we have 2.5GT/s with only a single lane of
> > PCIe. In addition we are running with ASPM enabled so that means that
> > if we don't have enough traffic we are shutting off the one PCIe lane
> > we have so if we are getting bursty traffic that can get ugly.
>
> Humm... is there a way to force disable ASPM in sysfs?

Actually the easiest way to do this is to just use setpci.

You should be able to dump the word containing the setting via:
# setpci -s 3:00.0 0xB0.w
0042
# setpci -s 3:00.0 0xB0.w=0040

Basically what you do is clear the lower 3 bits of the value so in
this case that means replacing the 2 with a 0 based on the output of
the first command.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
@ 2020-07-15 23:41                   ` Alexander Duyck
  0 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2020-07-15 23:41 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, Jul 15, 2020 at 3:51 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Thu, Jul 16, 2020 at 12:32 AM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> > On Wed, Jul 15, 2020 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > On Wed, Jul 15, 2020 at 11:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> > > > > On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > > > > > > After a  lot of debugging it turns out that the bug is in igb...
> > > > > > > >
> > > > > > > > driver: igb
> > > > > > > > version: 5.6.0-k
> > > > > > > > firmware-version:  0. 6-1
> > > > > > > >
> > > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > > > > Connection (rev 03)
> > > > > > >
> > > > > > > Unclear to me what you're actually reporting. Is this a regression
> > > > > > > after a kernel upgrade? Compared to no NAT?
> > > > > >
> > > > > > It only happens on "internet links"
> > > > > >
> > > > > > Lets say that A is client with ibg driver, B is a firewall running NAT
> > > > > > with ixgbe drivers, C is another local node with igb and
> > > > > > D is a remote node with a bridge backed by a bnx2 interface.
> > > > > >
> > > > > > A -> B -> C is ok (B and C is on the same switch)
> > > > > >
> > > > > > A -> B -> D -- 32-40mbit
> > > > > >
> > > > > > B -> D 944 mbit
> > > > > > C -> D 944 mbit
> > > > > >
> > > > > > A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)
> > > > >
> > > > > This should of course be A' -> B -> D
> > > > >
> > > > > Sorry, I've been scratching my head for about a week...
> > > >
> > > > Hm, only thing that comes to mind if A' works reliably and A doesn't is
> > > > that A has somehow broken TCP offloads. Could you try disabling things
> > > > via ethtool -K and see if those settings make a difference?
> > >
> > > It's a bit hard since it works like this, turned tso off:
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
> > > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec   31    812 KBytes
> > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec   92    772 KBytes
> > > [  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0    834 KBytes
> > > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   60    823 KBytes
> > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec   31    789 KBytes
> > > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    786 KBytes
> > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    772 KBytes
> > > [  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec    0    868 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  214             sender
> > > [  5]   0.00-10.00  sec  1.07 GBytes   920 Mbits/sec                  receiver
> > >
> > > Continued running tests:
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  5.82 MBytes  48.8 Mbits/sec    0   82.0 KBytes
> > > [  5]   1.00-2.00   sec  4.97 MBytes  41.7 Mbits/sec    0    130 KBytes
> > > [  5]   2.00-3.00   sec  5.28 MBytes  44.3 Mbits/sec    0   99.0 KBytes
> > > [  5]   3.00-4.00   sec  5.28 MBytes  44.3 Mbits/sec    0    105 KBytes
> > > [  5]   4.00-5.00   sec  5.28 MBytes  44.3 Mbits/sec    0    122 KBytes
> > > [  5]   5.00-6.00   sec  5.28 MBytes  44.3 Mbits/sec    0   82.0 KBytes
> > > [  5]   6.00-7.00   sec  5.28 MBytes  44.3 Mbits/sec    0   79.2 KBytes
> > > [  5]   7.00-8.00   sec  5.28 MBytes  44.3 Mbits/sec    0    110 KBytes
> > > [  5]   8.00-9.00   sec  5.28 MBytes  44.3 Mbits/sec    0    156 KBytes
> > > [  5]   9.00-10.00  sec  5.28 MBytes  44.3 Mbits/sec    0   87.7 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  53.0 MBytes  44.5 Mbits/sec    0             sender
> > > [  5]   0.00-10.00  sec  52.5 MBytes  44.1 Mbits/sec                  receiver
> > >
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  7.08 MBytes  59.4 Mbits/sec    0    156 KBytes
> > > [  5]   1.00-2.00   sec  5.97 MBytes  50.0 Mbits/sec    0    110 KBytes
> > > [  5]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec    0    124 KBytes
> > > [  5]   3.00-4.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > [  5]   4.00-5.00   sec  5.47 MBytes  45.9 Mbits/sec    0    158 KBytes
> > > [  5]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec    0   70.7 KBytes
> > > [  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0    113 KBytes
> > > [  5]   7.00-8.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > [  5]   8.00-9.00   sec  4.97 MBytes  41.7 Mbits/sec    0   84.8 KBytes
> > > [  5]   9.00-10.00  sec  5.47 MBytes  45.9 Mbits/sec    0    116 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  55.3 MBytes  46.4 Mbits/sec    0             sender
> > > [  5]   0.00-10.00  sec  53.9 MBytes  45.2 Mbits/sec                  receiver
> > >
> > > And the low bandwidth continues with:
> > > ethtool -k enp3s0 |grep ": on"
> > > rx-vlan-offload: on
> > > tx-vlan-offload: on [requested off]
> > > highdma: on [fixed]
> > > rx-vlan-filter: on [fixed]
> > > tx-gre-segmentation: on
> > > tx-gre-csum-segmentation: on
> > > tx-ipxip4-segmentation: on
> > > tx-ipxip6-segmentation: on
> > > tx-udp_tnl-segmentation: on
> > > tx-udp_tnl-csum-segmentation: on
> > > tx-gso-partial: on
> > > tx-udp-segmentation: on
> > > hw-tc-offload: on
> > >
> > > Can't quite find how to turn those off since they aren't listed in
> > > ethtool (since the text is not what you use to enable/disable)
> >
> > To disable them you would just repeat the same string in the display
> > string. So it should just be "ethtool -K enp3s0 tx-gso-partial off"
> > and that would turn off a large chunk of them as all the encapsulated
> > support requires gso partial support.
>
>  ethtool -k enp3s0 |grep ": on"
> highdma: on [fixed]
> rx-vlan-filter: on [fixed]
> ---
> And then back to back:
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  4.91 MBytes  41.2 Mbits/sec    0   45.2 KBytes
> [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   52.3 KBytes
> [  5]   2.00-3.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0    141 KBytes
> [  5]   4.00-5.00   sec   111 MBytes   928 Mbits/sec   63    764 KBytes
> [  5]   5.00-6.00   sec  86.2 MBytes   724 Mbits/sec    0    744 KBytes
> [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   61    769 KBytes
> [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    749 KBytes
> [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    741 KBytes
> [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   31    761 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec   644 MBytes   540 Mbits/sec  155             sender
> [  5]   0.00-10.01  sec   641 MBytes   537 Mbits/sec                  receiver
>
> and we're back at the not working bit:
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  4.84 MBytes  40.6 Mbits/sec    0   42.4 KBytes
> [  5]   1.00-2.00   sec  4.60 MBytes  38.6 Mbits/sec    0   42.4 KBytes
> [  5]   2.00-3.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> [  5]   4.00-5.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> [  5]   7.00-8.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> [  5]   8.00-9.00   sec  4.47 MBytes  37.5 Mbits/sec    0   53.7 KBytes
> [  5]   9.00-10.00  sec  4.47 MBytes  37.5 Mbits/sec    0   79.2 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  44.5 MBytes  37.3 Mbits/sec    0             sender
> [  5]   0.00-10.00  sec  43.9 MBytes  36.8 Mbits/sec                  receiver
>
> > > I was hoping that you'd have a clue of something that might introduce
> > > a regression - ie specific patches to try to revert
> > >
> > > Btw, the same issue applies to udp as werll
> > >
> > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > [  5]   0.00-1.00   sec  6.77 MBytes  56.8 Mbits/sec  4900
> > > [  5]   1.00-2.00   sec  4.27 MBytes  35.8 Mbits/sec  3089
> > > [  5]   2.00-3.00   sec  4.20 MBytes  35.2 Mbits/sec  3041
> > > [  5]   3.00-4.00   sec  4.30 MBytes  36.1 Mbits/sec  3116
> > > [  5]   4.00-5.00   sec  4.24 MBytes  35.6 Mbits/sec  3070
> > > [  5]   5.00-6.00   sec  4.21 MBytes  35.3 Mbits/sec  3047
> > > [  5]   6.00-7.00   sec  4.29 MBytes  36.0 Mbits/sec  3110
> > > [  5]   7.00-8.00   sec  4.28 MBytes  35.9 Mbits/sec  3097
> > > [  5]   8.00-9.00   sec  4.25 MBytes  35.6 Mbits/sec  3075
> > > [  5]   9.00-10.00  sec  4.20 MBytes  35.2 Mbits/sec  3039
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > Lost/Total Datagrams
> > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.000 ms
> > > 0/32584 (0%)  sender
> > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.037 ms
> > > 0/32573 (0%)  receiver
> > >
> > > vs:
> > >
> > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > [  5]   0.00-1.00   sec   114 MBytes   954 Mbits/sec  82342
> > > [  5]   1.00-2.00   sec   114 MBytes   955 Mbits/sec  82439
> > > [  5]   2.00-3.00   sec   114 MBytes   956 Mbits/sec  82507
> > > [  5]   3.00-4.00   sec   114 MBytes   955 Mbits/sec  82432
> > > [  5]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  82535
> > > [  5]   5.00-6.00   sec   114 MBytes   953 Mbits/sec  82240
> > > [  5]   6.00-7.00   sec   114 MBytes   956 Mbits/sec  82512
> > > [  5]   7.00-8.00   sec   114 MBytes   956 Mbits/sec  82503
> > > [  5]   8.00-9.00   sec   114 MBytes   956 Mbits/sec  82532
> > > [  5]   9.00-10.00  sec   114 MBytes   956 Mbits/sec  82488
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > Lost/Total Datagrams
> > > [  5]   0.00-10.00  sec  1.11 GBytes   955 Mbits/sec  0.000 ms
> > > 0/824530 (0%)  sender
> > > [  5]   0.00-10.01  sec  1.11 GBytes   949 Mbits/sec  0.014 ms
> > > 4756/824530 (0.58%)  receiver
> >
> > The fact that it is impacting UDP seems odd. I wonder if we don't have
> > a qdisc somewhere that is misbehaving and throttling the Tx. Either
> > that or I wonder if we are getting spammed with flow control frames.
>
> it sometimes works, it looks like the cwindow just isn't increased -
> that's where i started...
>
> Example:
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  4.86 MBytes  40.8 Mbits/sec    0   50.9 KBytes
> [  5]   1.00-2.00   sec  4.66 MBytes  39.1 Mbits/sec    0   65.0 KBytes
> [  5]   2.00-3.00   sec  4.29 MBytes  36.0 Mbits/sec    0   42.4 KBytes
> [  5]   3.00-4.00   sec  4.66 MBytes  39.1 Mbits/sec    0   42.4 KBytes
> [  5]   4.00-5.00   sec  23.1 MBytes   194 Mbits/sec    0   1.07 MBytes
> [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   60    806 KBytes
> [  5]   7.00-8.00   sec  82.5 MBytes   692 Mbits/sec    0    812 KBytes
> [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   92    761 KBytes
> [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec    0    755 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec   554 MBytes   465 Mbits/sec  152             sender
> [  5]   0.00-10.00  sec   550 MBytes   461 Mbits/sec                  receiver
>
> > It would be useful to include the output of just calling "ethtool
> > enp3s0" on the interface to verify the speed, "ethtool -a enp3s0" to
> > verify flow control settings, and "ethtool -S enp3s0 | grep -v :\ 0"
> > to output the statistics and dump anything that isn't zero.
>
> ethtool enp3s0
> Settings for enp3s0:
> Supported ports: [ TP ]
> Supported link modes:   10baseT/Half 10baseT/Full
>                         100baseT/Half 100baseT/Full
>                         1000baseT/Full
> Supported pause frame use: Symmetric
> Supports auto-negotiation: Yes
> Supported FEC modes: Not reported
> Advertised link modes:  10baseT/Half 10baseT/Full
>                         100baseT/Half 100baseT/Full
>                         1000baseT/Full
> Advertised pause frame use: Symmetric
> Advertised auto-negotiation: Yes
> Advertised FEC modes: Not reported
> Speed: 1000Mb/s
> Duplex: Full
> Auto-negotiation: on
> Port: Twisted Pair
> PHYAD: 1
> Transceiver: internal
> MDI-X: off (auto)
> Supports Wake-on: pumbg
> Wake-on: g
>         Current message level: 0x00000007 (7)
>                                drv probe link
> Link detected: yes
> ---
> ethtool -a enp3s0
> Pause parameters for enp3s0:
> Autonegotiate: on
> RX: on
> TX: off
> ---
> ethtool -S enp3s0 |grep  -v :\ 0
> NIC statistics:
>      rx_packets: 15920618
>      tx_packets: 17846725
>      rx_bytes: 15676264423
>      tx_bytes: 19925010639
>      rx_broadcast: 119553
>      tx_broadcast: 497
>      rx_multicast: 330193
>      tx_multicast: 18190
>      multicast: 330193
>      rx_missed_errors: 270102
>      rx_long_length_errors: 6
>      tx_tcp_seg_good: 1342561
>      rx_long_byte_count: 15676264423
>      rx_errors: 6
>      rx_length_errors: 6
>      rx_fifo_errors: 270102
>      tx_queue_0_packets: 7651168
>      tx_queue_0_bytes: 7823281566
>      tx_queue_0_restart: 4920
>      tx_queue_1_packets: 10195557
>      tx_queue_1_bytes: 12027522118
>      tx_queue_1_restart: 12718
>      rx_queue_0_packets: 15920618
>      rx_queue_0_bytes: 15612581951
>      rx_queue_0_csum_err: 76
> (I've only run two runs since i reenabled the interface)

So I am seeing three things here.

The rx_long_length_errors are usually due to an MTU mismatch. Do you
have something on the network that is using jumbo frames, or is the
MTU on the NIC set to something smaller than what is supported on the
network?

You are getting rx_missed_errors, that would seem to imply that the
DMA is not able to keep up. We may want to try disabling the L1 to see
if we get any boost from doing that.

The last bit is that queue 0 is seeing packets with bad checksums. You
might want to run some tests and see where the bad checksums are
coming from. If they are being detected from a specific NIC such as
the ixgbe in your example it might point to some sort of checksum
error being created as a result of the NAT translation.

> ---
>
> > > lspci -s 03:00.0  -vvv
> > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > Connection (rev 03)
> > > Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection
> > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > Stepping- SERR- FastB2B- DisINTx+
> > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > > <TAbort- <MAbort- >SERR- <PERR- INTx-
> > > Latency: 0
> > > Interrupt: pin A routed to IRQ 57
> > > IOMMU group: 20
> > > Region 0: Memory at fc900000 (32-bit, non-prefetchable) [size=128K]
> > > Region 2: I/O ports at e000 [size=32]
> > > Region 3: Memory at fc920000 (32-bit, non-prefetchable) [size=16K]
> > > Capabilities: [40] Power Management version 3
> > > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> > > Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> > > Address: 0000000000000000  Data: 0000
> > > Masking: 00000000  Pending: 00000000
> > > Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
> > > Vector table: BAR=3 offset=00000000
> > > PBA: BAR=3 offset=00002000
> > > Capabilities: [a0] Express (v2) Endpoint, MSI 00
> > > DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> > > DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
> > > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> > > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> > > LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
> > > L0s <2us, L1 <16us
> > > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > > LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
> > > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> >
> > PCIe wise the connection is going to be pretty tight in terms of
> > bandwidth. It looks like we have 2.5GT/s with only a single lane of
> > PCIe. In addition we are running with ASPM enabled so that means that
> > if we don't have enough traffic we are shutting off the one PCIe lane
> > we have so if we are getting bursty traffic that can get ugly.
>
> Humm... is there a way to force disable ASPM in sysfs?

Actually the easiest way to do this is to just use setpci.

You should be able to dump the word containing the setting via:
# setpci -s 3:00.0 0xB0.w
0042
# setpci -s 3:00.0 0xB0.w=0040

Basically what you do is clear the lower 3 bits of the value so in
this case that means replacing the 2 with a 0 based on the output of
the first command.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-15 23:41                   ` Alexander Duyck
@ 2020-07-15 23:59                     ` Ian Kumlien
  -1 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-15 23:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jakub Kicinski, Linux Kernel Network Developers, intel-wired-lan

On Thu, Jul 16, 2020 at 1:42 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Jul 15, 2020 at 3:51 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > On Thu, Jul 16, 2020 at 12:32 AM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > > On Wed, Jul 15, 2020 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > On Wed, Jul 15, 2020 at 11:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> > > > > > On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > > > > > > > After a  lot of debugging it turns out that the bug is in igb...
> > > > > > > > >
> > > > > > > > > driver: igb
> > > > > > > > > version: 5.6.0-k
> > > > > > > > > firmware-version:  0. 6-1
> > > > > > > > >
> > > > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > > > > > Connection (rev 03)
> > > > > > > >
> > > > > > > > Unclear to me what you're actually reporting. Is this a regression
> > > > > > > > after a kernel upgrade? Compared to no NAT?
> > > > > > >
> > > > > > > It only happens on "internet links"
> > > > > > >
> > > > > > > Lets say that A is client with ibg driver, B is a firewall running NAT
> > > > > > > with ixgbe drivers, C is another local node with igb and
> > > > > > > D is a remote node with a bridge backed by a bnx2 interface.
> > > > > > >
> > > > > > > A -> B -> C is ok (B and C is on the same switch)
> > > > > > >
> > > > > > > A -> B -> D -- 32-40mbit
> > > > > > >
> > > > > > > B -> D 944 mbit
> > > > > > > C -> D 944 mbit
> > > > > > >
> > > > > > > A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)
> > > > > >
> > > > > > This should of course be A' -> B -> D
> > > > > >
> > > > > > Sorry, I've been scratching my head for about a week...
> > > > >
> > > > > Hm, only thing that comes to mind if A' works reliably and A doesn't is
> > > > > that A has somehow broken TCP offloads. Could you try disabling things
> > > > > via ethtool -K and see if those settings make a difference?
> > > >
> > > > It's a bit hard since it works like this, turned tso off:
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
> > > > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec   31    812 KBytes
> > > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec   92    772 KBytes
> > > > [  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0    834 KBytes
> > > > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   60    823 KBytes
> > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec   31    789 KBytes
> > > > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    786 KBytes
> > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    772 KBytes
> > > > [  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec    0    868 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  214             sender
> > > > [  5]   0.00-10.00  sec  1.07 GBytes   920 Mbits/sec                  receiver
> > > >
> > > > Continued running tests:
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  5.82 MBytes  48.8 Mbits/sec    0   82.0 KBytes
> > > > [  5]   1.00-2.00   sec  4.97 MBytes  41.7 Mbits/sec    0    130 KBytes
> > > > [  5]   2.00-3.00   sec  5.28 MBytes  44.3 Mbits/sec    0   99.0 KBytes
> > > > [  5]   3.00-4.00   sec  5.28 MBytes  44.3 Mbits/sec    0    105 KBytes
> > > > [  5]   4.00-5.00   sec  5.28 MBytes  44.3 Mbits/sec    0    122 KBytes
> > > > [  5]   5.00-6.00   sec  5.28 MBytes  44.3 Mbits/sec    0   82.0 KBytes
> > > > [  5]   6.00-7.00   sec  5.28 MBytes  44.3 Mbits/sec    0   79.2 KBytes
> > > > [  5]   7.00-8.00   sec  5.28 MBytes  44.3 Mbits/sec    0    110 KBytes
> > > > [  5]   8.00-9.00   sec  5.28 MBytes  44.3 Mbits/sec    0    156 KBytes
> > > > [  5]   9.00-10.00  sec  5.28 MBytes  44.3 Mbits/sec    0   87.7 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec  53.0 MBytes  44.5 Mbits/sec    0             sender
> > > > [  5]   0.00-10.00  sec  52.5 MBytes  44.1 Mbits/sec                  receiver
> > > >
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  7.08 MBytes  59.4 Mbits/sec    0    156 KBytes
> > > > [  5]   1.00-2.00   sec  5.97 MBytes  50.0 Mbits/sec    0    110 KBytes
> > > > [  5]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec    0    124 KBytes
> > > > [  5]   3.00-4.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > > [  5]   4.00-5.00   sec  5.47 MBytes  45.9 Mbits/sec    0    158 KBytes
> > > > [  5]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec    0   70.7 KBytes
> > > > [  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0    113 KBytes
> > > > [  5]   7.00-8.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > > [  5]   8.00-9.00   sec  4.97 MBytes  41.7 Mbits/sec    0   84.8 KBytes
> > > > [  5]   9.00-10.00  sec  5.47 MBytes  45.9 Mbits/sec    0    116 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec  55.3 MBytes  46.4 Mbits/sec    0             sender
> > > > [  5]   0.00-10.00  sec  53.9 MBytes  45.2 Mbits/sec                  receiver
> > > >
> > > > And the low bandwidth continues with:
> > > > ethtool -k enp3s0 |grep ": on"
> > > > rx-vlan-offload: on
> > > > tx-vlan-offload: on [requested off]
> > > > highdma: on [fixed]
> > > > rx-vlan-filter: on [fixed]
> > > > tx-gre-segmentation: on
> > > > tx-gre-csum-segmentation: on
> > > > tx-ipxip4-segmentation: on
> > > > tx-ipxip6-segmentation: on
> > > > tx-udp_tnl-segmentation: on
> > > > tx-udp_tnl-csum-segmentation: on
> > > > tx-gso-partial: on
> > > > tx-udp-segmentation: on
> > > > hw-tc-offload: on
> > > >
> > > > Can't quite find how to turn those off since they aren't listed in
> > > > ethtool (since the text is not what you use to enable/disable)
> > >
> > > To disable them you would just repeat the same string in the display
> > > string. So it should just be "ethtool -K enp3s0 tx-gso-partial off"
> > > and that would turn off a large chunk of them as all the encapsulated
> > > support requires gso partial support.
> >
> >  ethtool -k enp3s0 |grep ": on"
> > highdma: on [fixed]
> > rx-vlan-filter: on [fixed]
> > ---
> > And then back to back:
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  4.91 MBytes  41.2 Mbits/sec    0   45.2 KBytes
> > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   52.3 KBytes
> > [  5]   2.00-3.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0    141 KBytes
> > [  5]   4.00-5.00   sec   111 MBytes   928 Mbits/sec   63    764 KBytes
> > [  5]   5.00-6.00   sec  86.2 MBytes   724 Mbits/sec    0    744 KBytes
> > [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   61    769 KBytes
> > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    749 KBytes
> > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    741 KBytes
> > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   31    761 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec   644 MBytes   540 Mbits/sec  155             sender
> > [  5]   0.00-10.01  sec   641 MBytes   537 Mbits/sec                  receiver
> >
> > and we're back at the not working bit:
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  4.84 MBytes  40.6 Mbits/sec    0   42.4 KBytes
> > [  5]   1.00-2.00   sec  4.60 MBytes  38.6 Mbits/sec    0   42.4 KBytes
> > [  5]   2.00-3.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> > [  5]   4.00-5.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > [  5]   7.00-8.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> > [  5]   8.00-9.00   sec  4.47 MBytes  37.5 Mbits/sec    0   53.7 KBytes
> > [  5]   9.00-10.00  sec  4.47 MBytes  37.5 Mbits/sec    0   79.2 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  44.5 MBytes  37.3 Mbits/sec    0             sender
> > [  5]   0.00-10.00  sec  43.9 MBytes  36.8 Mbits/sec                  receiver
> >
> > > > I was hoping that you'd have a clue of something that might introduce
> > > > a regression - ie specific patches to try to revert
> > > >
> > > > Btw, the same issue applies to udp as werll
> > > >
> > > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > > [  5]   0.00-1.00   sec  6.77 MBytes  56.8 Mbits/sec  4900
> > > > [  5]   1.00-2.00   sec  4.27 MBytes  35.8 Mbits/sec  3089
> > > > [  5]   2.00-3.00   sec  4.20 MBytes  35.2 Mbits/sec  3041
> > > > [  5]   3.00-4.00   sec  4.30 MBytes  36.1 Mbits/sec  3116
> > > > [  5]   4.00-5.00   sec  4.24 MBytes  35.6 Mbits/sec  3070
> > > > [  5]   5.00-6.00   sec  4.21 MBytes  35.3 Mbits/sec  3047
> > > > [  5]   6.00-7.00   sec  4.29 MBytes  36.0 Mbits/sec  3110
> > > > [  5]   7.00-8.00   sec  4.28 MBytes  35.9 Mbits/sec  3097
> > > > [  5]   8.00-9.00   sec  4.25 MBytes  35.6 Mbits/sec  3075
> > > > [  5]   9.00-10.00  sec  4.20 MBytes  35.2 Mbits/sec  3039
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > > Lost/Total Datagrams
> > > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.000 ms
> > > > 0/32584 (0%)  sender
> > > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.037 ms
> > > > 0/32573 (0%)  receiver
> > > >
> > > > vs:
> > > >
> > > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > > [  5]   0.00-1.00   sec   114 MBytes   954 Mbits/sec  82342
> > > > [  5]   1.00-2.00   sec   114 MBytes   955 Mbits/sec  82439
> > > > [  5]   2.00-3.00   sec   114 MBytes   956 Mbits/sec  82507
> > > > [  5]   3.00-4.00   sec   114 MBytes   955 Mbits/sec  82432
> > > > [  5]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  82535
> > > > [  5]   5.00-6.00   sec   114 MBytes   953 Mbits/sec  82240
> > > > [  5]   6.00-7.00   sec   114 MBytes   956 Mbits/sec  82512
> > > > [  5]   7.00-8.00   sec   114 MBytes   956 Mbits/sec  82503
> > > > [  5]   8.00-9.00   sec   114 MBytes   956 Mbits/sec  82532
> > > > [  5]   9.00-10.00  sec   114 MBytes   956 Mbits/sec  82488
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > > Lost/Total Datagrams
> > > > [  5]   0.00-10.00  sec  1.11 GBytes   955 Mbits/sec  0.000 ms
> > > > 0/824530 (0%)  sender
> > > > [  5]   0.00-10.01  sec  1.11 GBytes   949 Mbits/sec  0.014 ms
> > > > 4756/824530 (0.58%)  receiver
> > >
> > > The fact that it is impacting UDP seems odd. I wonder if we don't have
> > > a qdisc somewhere that is misbehaving and throttling the Tx. Either
> > > that or I wonder if we are getting spammed with flow control frames.
> >
> > it sometimes works, it looks like the cwindow just isn't increased -
> > that's where i started...
> >
> > Example:
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  4.86 MBytes  40.8 Mbits/sec    0   50.9 KBytes
> > [  5]   1.00-2.00   sec  4.66 MBytes  39.1 Mbits/sec    0   65.0 KBytes
> > [  5]   2.00-3.00   sec  4.29 MBytes  36.0 Mbits/sec    0   42.4 KBytes
> > [  5]   3.00-4.00   sec  4.66 MBytes  39.1 Mbits/sec    0   42.4 KBytes
> > [  5]   4.00-5.00   sec  23.1 MBytes   194 Mbits/sec    0   1.07 MBytes
> > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   60    806 KBytes
> > [  5]   7.00-8.00   sec  82.5 MBytes   692 Mbits/sec    0    812 KBytes
> > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   92    761 KBytes
> > [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec    0    755 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec   554 MBytes   465 Mbits/sec  152             sender
> > [  5]   0.00-10.00  sec   550 MBytes   461 Mbits/sec                  receiver
> >
> > > It would be useful to include the output of just calling "ethtool
> > > enp3s0" on the interface to verify the speed, "ethtool -a enp3s0" to
> > > verify flow control settings, and "ethtool -S enp3s0 | grep -v :\ 0"
> > > to output the statistics and dump anything that isn't zero.
> >
> > ethtool enp3s0
> > Settings for enp3s0:
> > Supported ports: [ TP ]
> > Supported link modes:   10baseT/Half 10baseT/Full
> >                         100baseT/Half 100baseT/Full
> >                         1000baseT/Full
> > Supported pause frame use: Symmetric
> > Supports auto-negotiation: Yes
> > Supported FEC modes: Not reported
> > Advertised link modes:  10baseT/Half 10baseT/Full
> >                         100baseT/Half 100baseT/Full
> >                         1000baseT/Full
> > Advertised pause frame use: Symmetric
> > Advertised auto-negotiation: Yes
> > Advertised FEC modes: Not reported
> > Speed: 1000Mb/s
> > Duplex: Full
> > Auto-negotiation: on
> > Port: Twisted Pair
> > PHYAD: 1
> > Transceiver: internal
> > MDI-X: off (auto)
> > Supports Wake-on: pumbg
> > Wake-on: g
> >         Current message level: 0x00000007 (7)
> >                                drv probe link
> > Link detected: yes
> > ---
> > ethtool -a enp3s0
> > Pause parameters for enp3s0:
> > Autonegotiate: on
> > RX: on
> > TX: off
> > ---
> > ethtool -S enp3s0 |grep  -v :\ 0
> > NIC statistics:
> >      rx_packets: 15920618
> >      tx_packets: 17846725
> >      rx_bytes: 15676264423
> >      tx_bytes: 19925010639
> >      rx_broadcast: 119553
> >      tx_broadcast: 497
> >      rx_multicast: 330193
> >      tx_multicast: 18190
> >      multicast: 330193
> >      rx_missed_errors: 270102
> >      rx_long_length_errors: 6
> >      tx_tcp_seg_good: 1342561
> >      rx_long_byte_count: 15676264423
> >      rx_errors: 6
> >      rx_length_errors: 6
> >      rx_fifo_errors: 270102
> >      tx_queue_0_packets: 7651168
> >      tx_queue_0_bytes: 7823281566
> >      tx_queue_0_restart: 4920
> >      tx_queue_1_packets: 10195557
> >      tx_queue_1_bytes: 12027522118
> >      tx_queue_1_restart: 12718
> >      rx_queue_0_packets: 15920618
> >      rx_queue_0_bytes: 15612581951
> >      rx_queue_0_csum_err: 76
> > (I've only run two runs since i reenabled the interface)
>
> So I am seeing three things here.
>
> The rx_long_length_errors are usually due to an MTU mismatch. Do you
> have something on the network that is using jumbo frames, or is the
> MTU on the NIC set to something smaller than what is supported on the
> network?

I'm using jumbo frames on the local network, internet side is the
normal 1500 bytes mtu though

> You are getting rx_missed_errors, that would seem to imply that the
> DMA is not able to keep up. We may want to try disabling the L1 to see
> if we get any boost from doing that.

It used to work, I don't do benchmarks all the time and sometimes the first
benchmarks turn out fine... so it's hard to say when this started happening...

It could also be related to a bios upgrade, but I'm pretty sure I did
successful benchmarks after that...

How do I disable the l1? just echo 0 >
/sys/bus/pci/drivers/igb/0000\:03\:00.0/link/l1_aspm ?

> The last bit is that queue 0 is seeing packets with bad checksums. You
> might want to run some tests and see where the bad checksums are
> coming from. If they are being detected from a specific NIC such as
> the ixgbe in your example it might point to some sort of checksum
> error being created as a result of the NAT translation.

But that should also affect A' and the A -> B -> C case, which it doesn't...

It only seems to happen with higher rtt (6 hops, sub 3 ms in this case
but still high enough somehow)

> > ---
> >
> > > > lspci -s 03:00.0  -vvv
> > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > Connection (rev 03)
> > > > Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection
> > > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > > Stepping- SERR- FastB2B- DisINTx+
> > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > > > <TAbort- <MAbort- >SERR- <PERR- INTx-
> > > > Latency: 0
> > > > Interrupt: pin A routed to IRQ 57
> > > > IOMMU group: 20
> > > > Region 0: Memory at fc900000 (32-bit, non-prefetchable) [size=128K]
> > > > Region 2: I/O ports at e000 [size=32]
> > > > Region 3: Memory at fc920000 (32-bit, non-prefetchable) [size=16K]
> > > > Capabilities: [40] Power Management version 3
> > > > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> > > > Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> > > > Address: 0000000000000000  Data: 0000
> > > > Masking: 00000000  Pending: 00000000
> > > > Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
> > > > Vector table: BAR=3 offset=00000000
> > > > PBA: BAR=3 offset=00002000
> > > > Capabilities: [a0] Express (v2) Endpoint, MSI 00
> > > > DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > > > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> > > > DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
> > > > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> > > > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > > > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> > > > LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
> > > > L0s <2us, L1 <16us
> > > > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > > > LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> > > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > > LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
> > > > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > >
> > > PCIe wise the connection is going to be pretty tight in terms of
> > > bandwidth. It looks like we have 2.5GT/s with only a single lane of
> > > PCIe. In addition we are running with ASPM enabled so that means that
> > > if we don't have enough traffic we are shutting off the one PCIe lane
> > > we have so if we are getting bursty traffic that can get ugly.
> >
> > Humm... is there a way to force disable ASPM in sysfs?
>
> Actually the easiest way to do this is to just use setpci.
>
> You should be able to dump the word containing the setting via:
> # setpci -s 3:00.0 0xB0.w
> 0042
> # setpci -s 3:00.0 0xB0.w=0040
>
> Basically what you do is clear the lower 3 bits of the value so in
> this case that means replacing the 2 with a 0 based on the output of
> the first command.

Well... I'll be damned... I used to force enable ASPM... this must be
related to the change in PCIe bus ASPM
Perhaps disable ASPM if there is only one link?

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   113 MBytes   950 Mbits/sec   31    710 KBytes
[  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec  135    626 KBytes
[  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   18    713 KBytes
[  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    798 KBytes
[  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec    0    721 KBytes
[  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec   31    800 KBytes
[  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    730 KBytes
[  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec   19    730 KBytes
[  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec    0    701 KBytes
[  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec   12    701 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.09 GBytes   937 Mbits/sec  246             sender
[  5]   0.00-10.01  sec  1.09 GBytes   933 Mbits/sec                  receiver

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   114 MBytes   956 Mbits/sec    0    749 KBytes
[  5]   1.00-2.00   sec   111 MBytes   933 Mbits/sec   30    766 KBytes
[  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec    7    749 KBytes
[  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   11    707 KBytes
[  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec    2    699 KBytes
[  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    8    699 KBytes
[  5]   6.00-7.00   sec   112 MBytes   944 Mbits/sec    1    953 KBytes
[  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    701 KBytes
[  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   26    707 KBytes
[  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    2   1.07 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.09 GBytes   939 Mbits/sec   87             sender
[  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   114 MBytes   953 Mbits/sec   16    908 KBytes
[  5]   1.00-2.00   sec   112 MBytes   944 Mbits/sec    0    693 KBytes
[  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec    0    713 KBytes
[  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    687 KBytes
[  5]   4.00-5.00   sec   112 MBytes   944 Mbits/sec   15    687 KBytes
[  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    2    888 KBytes
[  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec   17    696 KBytes
[  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    758 KBytes
[  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   31    749 KBytes
[  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    0    792 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec   81             sender
[  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   114 MBytes   956 Mbits/sec    0    747 KBytes
[  5]   1.00-2.00   sec   111 MBytes   933 Mbits/sec    0    744 KBytes
[  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   12   1.18 MBytes
[  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    2    699 KBytes
[  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   28    699 KBytes
[  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec    0    684 KBytes
[  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    741 KBytes
[  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    3    687 KBytes
[  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   22    699 KBytes
[  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec   11    707 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec   78             sender
[  5]   0.00-10.01  sec  1.09 GBytes   934 Mbits/sec                  receiver
---

ethtool -S enp3s0 |grep -v ": 0"
NIC statistics:
     rx_packets: 16303520
     tx_packets: 21602840
     rx_bytes: 15711958157
     tx_bytes: 25599009212
     rx_broadcast: 122212
     tx_broadcast: 530
     rx_multicast: 333489
     tx_multicast: 18446
     multicast: 333489
     rx_missed_errors: 270143
     rx_long_length_errors: 6
     tx_tcp_seg_good: 1342561
     rx_long_byte_count: 15711958157
     rx_errors: 6
     rx_length_errors: 6
     rx_fifo_errors: 270143
     tx_queue_0_packets: 8963830
     tx_queue_0_bytes: 9803196683
     tx_queue_0_restart: 4920
     tx_queue_1_packets: 12639010
     tx_queue_1_bytes: 15706576814
     tx_queue_1_restart: 12718
     rx_queue_0_packets: 16303520
     rx_queue_0_bytes: 15646744077
     rx_queue_0_csum_err: 76

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
@ 2020-07-15 23:59                     ` Ian Kumlien
  0 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-15 23:59 UTC (permalink / raw)
  To: intel-wired-lan

On Thu, Jul 16, 2020 at 1:42 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Jul 15, 2020 at 3:51 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > On Thu, Jul 16, 2020 at 12:32 AM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > > On Wed, Jul 15, 2020 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > On Wed, Jul 15, 2020 at 11:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> > > > > > On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > > > > > > > After a  lot of debugging it turns out that the bug is in igb...
> > > > > > > > >
> > > > > > > > > driver: igb
> > > > > > > > > version: 5.6.0-k
> > > > > > > > > firmware-version:  0. 6-1
> > > > > > > > >
> > > > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > > > > > Connection (rev 03)
> > > > > > > >
> > > > > > > > Unclear to me what you're actually reporting. Is this a regression
> > > > > > > > after a kernel upgrade? Compared to no NAT?
> > > > > > >
> > > > > > > It only happens on "internet links"
> > > > > > >
> > > > > > > Lets say that A is client with ibg driver, B is a firewall running NAT
> > > > > > > with ixgbe drivers, C is another local node with igb and
> > > > > > > D is a remote node with a bridge backed by a bnx2 interface.
> > > > > > >
> > > > > > > A -> B -> C is ok (B and C is on the same switch)
> > > > > > >
> > > > > > > A -> B -> D -- 32-40mbit
> > > > > > >
> > > > > > > B -> D 944 mbit
> > > > > > > C -> D 944 mbit
> > > > > > >
> > > > > > > A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)
> > > > > >
> > > > > > This should of course be A' -> B -> D
> > > > > >
> > > > > > Sorry, I've been scratching my head for about a week...
> > > > >
> > > > > Hm, only thing that comes to mind if A' works reliably and A doesn't is
> > > > > that A has somehow broken TCP offloads. Could you try disabling things
> > > > > via ethtool -K and see if those settings make a difference?
> > > >
> > > > It's a bit hard since it works like this, turned tso off:
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
> > > > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec   31    812 KBytes
> > > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec   92    772 KBytes
> > > > [  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0    834 KBytes
> > > > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   60    823 KBytes
> > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec   31    789 KBytes
> > > > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    786 KBytes
> > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    772 KBytes
> > > > [  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec    0    868 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  214             sender
> > > > [  5]   0.00-10.00  sec  1.07 GBytes   920 Mbits/sec                  receiver
> > > >
> > > > Continued running tests:
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  5.82 MBytes  48.8 Mbits/sec    0   82.0 KBytes
> > > > [  5]   1.00-2.00   sec  4.97 MBytes  41.7 Mbits/sec    0    130 KBytes
> > > > [  5]   2.00-3.00   sec  5.28 MBytes  44.3 Mbits/sec    0   99.0 KBytes
> > > > [  5]   3.00-4.00   sec  5.28 MBytes  44.3 Mbits/sec    0    105 KBytes
> > > > [  5]   4.00-5.00   sec  5.28 MBytes  44.3 Mbits/sec    0    122 KBytes
> > > > [  5]   5.00-6.00   sec  5.28 MBytes  44.3 Mbits/sec    0   82.0 KBytes
> > > > [  5]   6.00-7.00   sec  5.28 MBytes  44.3 Mbits/sec    0   79.2 KBytes
> > > > [  5]   7.00-8.00   sec  5.28 MBytes  44.3 Mbits/sec    0    110 KBytes
> > > > [  5]   8.00-9.00   sec  5.28 MBytes  44.3 Mbits/sec    0    156 KBytes
> > > > [  5]   9.00-10.00  sec  5.28 MBytes  44.3 Mbits/sec    0   87.7 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec  53.0 MBytes  44.5 Mbits/sec    0             sender
> > > > [  5]   0.00-10.00  sec  52.5 MBytes  44.1 Mbits/sec                  receiver
> > > >
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  7.08 MBytes  59.4 Mbits/sec    0    156 KBytes
> > > > [  5]   1.00-2.00   sec  5.97 MBytes  50.0 Mbits/sec    0    110 KBytes
> > > > [  5]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec    0    124 KBytes
> > > > [  5]   3.00-4.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > > [  5]   4.00-5.00   sec  5.47 MBytes  45.9 Mbits/sec    0    158 KBytes
> > > > [  5]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec    0   70.7 KBytes
> > > > [  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0    113 KBytes
> > > > [  5]   7.00-8.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > > [  5]   8.00-9.00   sec  4.97 MBytes  41.7 Mbits/sec    0   84.8 KBytes
> > > > [  5]   9.00-10.00  sec  5.47 MBytes  45.9 Mbits/sec    0    116 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec  55.3 MBytes  46.4 Mbits/sec    0             sender
> > > > [  5]   0.00-10.00  sec  53.9 MBytes  45.2 Mbits/sec                  receiver
> > > >
> > > > And the low bandwidth continues with:
> > > > ethtool -k enp3s0 |grep ": on"
> > > > rx-vlan-offload: on
> > > > tx-vlan-offload: on [requested off]
> > > > highdma: on [fixed]
> > > > rx-vlan-filter: on [fixed]
> > > > tx-gre-segmentation: on
> > > > tx-gre-csum-segmentation: on
> > > > tx-ipxip4-segmentation: on
> > > > tx-ipxip6-segmentation: on
> > > > tx-udp_tnl-segmentation: on
> > > > tx-udp_tnl-csum-segmentation: on
> > > > tx-gso-partial: on
> > > > tx-udp-segmentation: on
> > > > hw-tc-offload: on
> > > >
> > > > Can't quite find how to turn those off since they aren't listed in
> > > > ethtool (since the text is not what you use to enable/disable)
> > >
> > > To disable them you would just repeat the same string in the display
> > > string. So it should just be "ethtool -K enp3s0 tx-gso-partial off"
> > > and that would turn off a large chunk of them as all the encapsulated
> > > support requires gso partial support.
> >
> >  ethtool -k enp3s0 |grep ": on"
> > highdma: on [fixed]
> > rx-vlan-filter: on [fixed]
> > ---
> > And then back to back:
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  4.91 MBytes  41.2 Mbits/sec    0   45.2 KBytes
> > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   52.3 KBytes
> > [  5]   2.00-3.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0    141 KBytes
> > [  5]   4.00-5.00   sec   111 MBytes   928 Mbits/sec   63    764 KBytes
> > [  5]   5.00-6.00   sec  86.2 MBytes   724 Mbits/sec    0    744 KBytes
> > [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   61    769 KBytes
> > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    749 KBytes
> > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    741 KBytes
> > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   31    761 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec   644 MBytes   540 Mbits/sec  155             sender
> > [  5]   0.00-10.01  sec   641 MBytes   537 Mbits/sec                  receiver
> >
> > and we're back at the not working bit:
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  4.84 MBytes  40.6 Mbits/sec    0   42.4 KBytes
> > [  5]   1.00-2.00   sec  4.60 MBytes  38.6 Mbits/sec    0   42.4 KBytes
> > [  5]   2.00-3.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> > [  5]   4.00-5.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > [  5]   7.00-8.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> > [  5]   8.00-9.00   sec  4.47 MBytes  37.5 Mbits/sec    0   53.7 KBytes
> > [  5]   9.00-10.00  sec  4.47 MBytes  37.5 Mbits/sec    0   79.2 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  44.5 MBytes  37.3 Mbits/sec    0             sender
> > [  5]   0.00-10.00  sec  43.9 MBytes  36.8 Mbits/sec                  receiver
> >
> > > > I was hoping that you'd have a clue of something that might introduce
> > > > a regression - ie specific patches to try to revert
> > > >
> > > > Btw, the same issue applies to udp as werll
> > > >
> > > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > > [  5]   0.00-1.00   sec  6.77 MBytes  56.8 Mbits/sec  4900
> > > > [  5]   1.00-2.00   sec  4.27 MBytes  35.8 Mbits/sec  3089
> > > > [  5]   2.00-3.00   sec  4.20 MBytes  35.2 Mbits/sec  3041
> > > > [  5]   3.00-4.00   sec  4.30 MBytes  36.1 Mbits/sec  3116
> > > > [  5]   4.00-5.00   sec  4.24 MBytes  35.6 Mbits/sec  3070
> > > > [  5]   5.00-6.00   sec  4.21 MBytes  35.3 Mbits/sec  3047
> > > > [  5]   6.00-7.00   sec  4.29 MBytes  36.0 Mbits/sec  3110
> > > > [  5]   7.00-8.00   sec  4.28 MBytes  35.9 Mbits/sec  3097
> > > > [  5]   8.00-9.00   sec  4.25 MBytes  35.6 Mbits/sec  3075
> > > > [  5]   9.00-10.00  sec  4.20 MBytes  35.2 Mbits/sec  3039
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > > Lost/Total Datagrams
> > > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.000 ms
> > > > 0/32584 (0%)  sender
> > > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.037 ms
> > > > 0/32573 (0%)  receiver
> > > >
> > > > vs:
> > > >
> > > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > > [  5]   0.00-1.00   sec   114 MBytes   954 Mbits/sec  82342
> > > > [  5]   1.00-2.00   sec   114 MBytes   955 Mbits/sec  82439
> > > > [  5]   2.00-3.00   sec   114 MBytes   956 Mbits/sec  82507
> > > > [  5]   3.00-4.00   sec   114 MBytes   955 Mbits/sec  82432
> > > > [  5]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  82535
> > > > [  5]   5.00-6.00   sec   114 MBytes   953 Mbits/sec  82240
> > > > [  5]   6.00-7.00   sec   114 MBytes   956 Mbits/sec  82512
> > > > [  5]   7.00-8.00   sec   114 MBytes   956 Mbits/sec  82503
> > > > [  5]   8.00-9.00   sec   114 MBytes   956 Mbits/sec  82532
> > > > [  5]   9.00-10.00  sec   114 MBytes   956 Mbits/sec  82488
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > > Lost/Total Datagrams
> > > > [  5]   0.00-10.00  sec  1.11 GBytes   955 Mbits/sec  0.000 ms
> > > > 0/824530 (0%)  sender
> > > > [  5]   0.00-10.01  sec  1.11 GBytes   949 Mbits/sec  0.014 ms
> > > > 4756/824530 (0.58%)  receiver
> > >
> > > The fact that it is impacting UDP seems odd. I wonder if we don't have
> > > a qdisc somewhere that is misbehaving and throttling the Tx. Either
> > > that or I wonder if we are getting spammed with flow control frames.
> >
> > it sometimes works, it looks like the cwindow just isn't increased -
> > that's where i started...
> >
> > Example:
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  4.86 MBytes  40.8 Mbits/sec    0   50.9 KBytes
> > [  5]   1.00-2.00   sec  4.66 MBytes  39.1 Mbits/sec    0   65.0 KBytes
> > [  5]   2.00-3.00   sec  4.29 MBytes  36.0 Mbits/sec    0   42.4 KBytes
> > [  5]   3.00-4.00   sec  4.66 MBytes  39.1 Mbits/sec    0   42.4 KBytes
> > [  5]   4.00-5.00   sec  23.1 MBytes   194 Mbits/sec    0   1.07 MBytes
> > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   60    806 KBytes
> > [  5]   7.00-8.00   sec  82.5 MBytes   692 Mbits/sec    0    812 KBytes
> > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   92    761 KBytes
> > [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec    0    755 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec   554 MBytes   465 Mbits/sec  152             sender
> > [  5]   0.00-10.00  sec   550 MBytes   461 Mbits/sec                  receiver
> >
> > > It would be useful to include the output of just calling "ethtool
> > > enp3s0" on the interface to verify the speed, "ethtool -a enp3s0" to
> > > verify flow control settings, and "ethtool -S enp3s0 | grep -v :\ 0"
> > > to output the statistics and dump anything that isn't zero.
> >
> > ethtool enp3s0
> > Settings for enp3s0:
> > Supported ports: [ TP ]
> > Supported link modes:   10baseT/Half 10baseT/Full
> >                         100baseT/Half 100baseT/Full
> >                         1000baseT/Full
> > Supported pause frame use: Symmetric
> > Supports auto-negotiation: Yes
> > Supported FEC modes: Not reported
> > Advertised link modes:  10baseT/Half 10baseT/Full
> >                         100baseT/Half 100baseT/Full
> >                         1000baseT/Full
> > Advertised pause frame use: Symmetric
> > Advertised auto-negotiation: Yes
> > Advertised FEC modes: Not reported
> > Speed: 1000Mb/s
> > Duplex: Full
> > Auto-negotiation: on
> > Port: Twisted Pair
> > PHYAD: 1
> > Transceiver: internal
> > MDI-X: off (auto)
> > Supports Wake-on: pumbg
> > Wake-on: g
> >         Current message level: 0x00000007 (7)
> >                                drv probe link
> > Link detected: yes
> > ---
> > ethtool -a enp3s0
> > Pause parameters for enp3s0:
> > Autonegotiate: on
> > RX: on
> > TX: off
> > ---
> > ethtool -S enp3s0 |grep  -v :\ 0
> > NIC statistics:
> >      rx_packets: 15920618
> >      tx_packets: 17846725
> >      rx_bytes: 15676264423
> >      tx_bytes: 19925010639
> >      rx_broadcast: 119553
> >      tx_broadcast: 497
> >      rx_multicast: 330193
> >      tx_multicast: 18190
> >      multicast: 330193
> >      rx_missed_errors: 270102
> >      rx_long_length_errors: 6
> >      tx_tcp_seg_good: 1342561
> >      rx_long_byte_count: 15676264423
> >      rx_errors: 6
> >      rx_length_errors: 6
> >      rx_fifo_errors: 270102
> >      tx_queue_0_packets: 7651168
> >      tx_queue_0_bytes: 7823281566
> >      tx_queue_0_restart: 4920
> >      tx_queue_1_packets: 10195557
> >      tx_queue_1_bytes: 12027522118
> >      tx_queue_1_restart: 12718
> >      rx_queue_0_packets: 15920618
> >      rx_queue_0_bytes: 15612581951
> >      rx_queue_0_csum_err: 76
> > (I've only run two runs since i reenabled the interface)
>
> So I am seeing three things here.
>
> The rx_long_length_errors are usually due to an MTU mismatch. Do you
> have something on the network that is using jumbo frames, or is the
> MTU on the NIC set to something smaller than what is supported on the
> network?

I'm using jumbo frames on the local network, internet side is the
normal 1500 bytes mtu though

> You are getting rx_missed_errors, that would seem to imply that the
> DMA is not able to keep up. We may want to try disabling the L1 to see
> if we get any boost from doing that.

It used to work, I don't do benchmarks all the time and sometimes the first
benchmarks turn out fine... so it's hard to say when this started happening...

It could also be related to a bios upgrade, but I'm pretty sure I did
successful benchmarks after that...

How do I disable the l1? just echo 0 >
/sys/bus/pci/drivers/igb/0000\:03\:00.0/link/l1_aspm ?

> The last bit is that queue 0 is seeing packets with bad checksums. You
> might want to run some tests and see where the bad checksums are
> coming from. If they are being detected from a specific NIC such as
> the ixgbe in your example it might point to some sort of checksum
> error being created as a result of the NAT translation.

But that should also affect A' and the A -> B -> C case, which it doesn't...

It only seems to happen with higher rtt (6 hops, sub 3 ms in this case
but still high enough somehow)

> > ---
> >
> > > > lspci -s 03:00.0  -vvv
> > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > Connection (rev 03)
> > > > Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection
> > > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > > Stepping- SERR- FastB2B- DisINTx+
> > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > > > <TAbort- <MAbort- >SERR- <PERR- INTx-
> > > > Latency: 0
> > > > Interrupt: pin A routed to IRQ 57
> > > > IOMMU group: 20
> > > > Region 0: Memory at fc900000 (32-bit, non-prefetchable) [size=128K]
> > > > Region 2: I/O ports at e000 [size=32]
> > > > Region 3: Memory at fc920000 (32-bit, non-prefetchable) [size=16K]
> > > > Capabilities: [40] Power Management version 3
> > > > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> > > > Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> > > > Address: 0000000000000000  Data: 0000
> > > > Masking: 00000000  Pending: 00000000
> > > > Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
> > > > Vector table: BAR=3 offset=00000000
> > > > PBA: BAR=3 offset=00002000
> > > > Capabilities: [a0] Express (v2) Endpoint, MSI 00
> > > > DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > > > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> > > > DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
> > > > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> > > > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > > > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> > > > LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
> > > > L0s <2us, L1 <16us
> > > > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > > > LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> > > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > > LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
> > > > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > >
> > > PCIe wise the connection is going to be pretty tight in terms of
> > > bandwidth. It looks like we have 2.5GT/s with only a single lane of
> > > PCIe. In addition we are running with ASPM enabled so that means that
> > > if we don't have enough traffic we are shutting off the one PCIe lane
> > > we have so if we are getting bursty traffic that can get ugly.
> >
> > Humm... is there a way to force disable ASPM in sysfs?
>
> Actually the easiest way to do this is to just use setpci.
>
> You should be able to dump the word containing the setting via:
> # setpci -s 3:00.0 0xB0.w
> 0042
> # setpci -s 3:00.0 0xB0.w=0040
>
> Basically what you do is clear the lower 3 bits of the value so in
> this case that means replacing the 2 with a 0 based on the output of
> the first command.

Well... I'll be damned... I used to force enable ASPM... this must be
related to the change in PCIe bus ASPM
Perhaps disable ASPM if there is only one link?

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   113 MBytes   950 Mbits/sec   31    710 KBytes
[  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec  135    626 KBytes
[  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   18    713 KBytes
[  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    798 KBytes
[  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec    0    721 KBytes
[  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec   31    800 KBytes
[  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    730 KBytes
[  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec   19    730 KBytes
[  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec    0    701 KBytes
[  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec   12    701 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.09 GBytes   937 Mbits/sec  246             sender
[  5]   0.00-10.01  sec  1.09 GBytes   933 Mbits/sec                  receiver

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   114 MBytes   956 Mbits/sec    0    749 KBytes
[  5]   1.00-2.00   sec   111 MBytes   933 Mbits/sec   30    766 KBytes
[  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec    7    749 KBytes
[  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   11    707 KBytes
[  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec    2    699 KBytes
[  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    8    699 KBytes
[  5]   6.00-7.00   sec   112 MBytes   944 Mbits/sec    1    953 KBytes
[  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    701 KBytes
[  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   26    707 KBytes
[  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    2   1.07 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.09 GBytes   939 Mbits/sec   87             sender
[  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   114 MBytes   953 Mbits/sec   16    908 KBytes
[  5]   1.00-2.00   sec   112 MBytes   944 Mbits/sec    0    693 KBytes
[  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec    0    713 KBytes
[  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    687 KBytes
[  5]   4.00-5.00   sec   112 MBytes   944 Mbits/sec   15    687 KBytes
[  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    2    888 KBytes
[  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec   17    696 KBytes
[  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    758 KBytes
[  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   31    749 KBytes
[  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    0    792 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec   81             sender
[  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   114 MBytes   956 Mbits/sec    0    747 KBytes
[  5]   1.00-2.00   sec   111 MBytes   933 Mbits/sec    0    744 KBytes
[  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   12   1.18 MBytes
[  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    2    699 KBytes
[  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   28    699 KBytes
[  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec    0    684 KBytes
[  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    741 KBytes
[  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    3    687 KBytes
[  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   22    699 KBytes
[  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec   11    707 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec   78             sender
[  5]   0.00-10.01  sec  1.09 GBytes   934 Mbits/sec                  receiver
---

ethtool -S enp3s0 |grep -v ": 0"
NIC statistics:
     rx_packets: 16303520
     tx_packets: 21602840
     rx_bytes: 15711958157
     tx_bytes: 25599009212
     rx_broadcast: 122212
     tx_broadcast: 530
     rx_multicast: 333489
     tx_multicast: 18446
     multicast: 333489
     rx_missed_errors: 270143
     rx_long_length_errors: 6
     tx_tcp_seg_good: 1342561
     rx_long_byte_count: 15711958157
     rx_errors: 6
     rx_length_errors: 6
     rx_fifo_errors: 270143
     tx_queue_0_packets: 8963830
     tx_queue_0_bytes: 9803196683
     tx_queue_0_restart: 4920
     tx_queue_1_packets: 12639010
     tx_queue_1_bytes: 15706576814
     tx_queue_1_restart: 12718
     rx_queue_0_packets: 16303520
     rx_queue_0_bytes: 15646744077
     rx_queue_0_csum_err: 76

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-15 23:59                     ` Ian Kumlien
@ 2020-07-16 15:18                       ` Alexander Duyck
  -1 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2020-07-16 15:18 UTC (permalink / raw)
  To: Ian Kumlien
  Cc: Jakub Kicinski, Linux Kernel Network Developers, intel-wired-lan

On Wed, Jul 15, 2020 at 5:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Thu, Jul 16, 2020 at 1:42 AM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> > On Wed, Jul 15, 2020 at 3:51 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > On Thu, Jul 16, 2020 at 12:32 AM Alexander Duyck
> > > <alexander.duyck@gmail.com> wrote:
> > > > On Wed, Jul 15, 2020 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > On Wed, Jul 15, 2020 at 11:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> > > > > > > On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > > > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > > > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > > > > > > > > After a  lot of debugging it turns out that the bug is in igb...
> > > > > > > > > >
> > > > > > > > > > driver: igb
> > > > > > > > > > version: 5.6.0-k
> > > > > > > > > > firmware-version:  0. 6-1
> > > > > > > > > >
> > > > > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > > > > > > Connection (rev 03)
> > > > > > > > >
> > > > > > > > > Unclear to me what you're actually reporting. Is this a regression
> > > > > > > > > after a kernel upgrade? Compared to no NAT?
> > > > > > > >
> > > > > > > > It only happens on "internet links"
> > > > > > > >
> > > > > > > > Lets say that A is client with ibg driver, B is a firewall running NAT
> > > > > > > > with ixgbe drivers, C is another local node with igb and
> > > > > > > > D is a remote node with a bridge backed by a bnx2 interface.
> > > > > > > >
> > > > > > > > A -> B -> C is ok (B and C is on the same switch)
> > > > > > > >
> > > > > > > > A -> B -> D -- 32-40mbit
> > > > > > > >
> > > > > > > > B -> D 944 mbit
> > > > > > > > C -> D 944 mbit
> > > > > > > >
> > > > > > > > A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)
> > > > > > >
> > > > > > > This should of course be A' -> B -> D
> > > > > > >
> > > > > > > Sorry, I've been scratching my head for about a week...
> > > > > >
> > > > > > Hm, only thing that comes to mind if A' works reliably and A doesn't is
> > > > > > that A has somehow broken TCP offloads. Could you try disabling things
> > > > > > via ethtool -K and see if those settings make a difference?
> > > > >
> > > > > It's a bit hard since it works like this, turned tso off:
> > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
> > > > > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec   31    812 KBytes
> > > > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec   92    772 KBytes
> > > > > [  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0    834 KBytes
> > > > > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   60    823 KBytes
> > > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec   31    789 KBytes
> > > > > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    786 KBytes
> > > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    772 KBytes
> > > > > [  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec    0    868 KBytes
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  214             sender
> > > > > [  5]   0.00-10.00  sec  1.07 GBytes   920 Mbits/sec                  receiver
> > > > >
> > > > > Continued running tests:
> > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec  5.82 MBytes  48.8 Mbits/sec    0   82.0 KBytes
> > > > > [  5]   1.00-2.00   sec  4.97 MBytes  41.7 Mbits/sec    0    130 KBytes
> > > > > [  5]   2.00-3.00   sec  5.28 MBytes  44.3 Mbits/sec    0   99.0 KBytes
> > > > > [  5]   3.00-4.00   sec  5.28 MBytes  44.3 Mbits/sec    0    105 KBytes
> > > > > [  5]   4.00-5.00   sec  5.28 MBytes  44.3 Mbits/sec    0    122 KBytes
> > > > > [  5]   5.00-6.00   sec  5.28 MBytes  44.3 Mbits/sec    0   82.0 KBytes
> > > > > [  5]   6.00-7.00   sec  5.28 MBytes  44.3 Mbits/sec    0   79.2 KBytes
> > > > > [  5]   7.00-8.00   sec  5.28 MBytes  44.3 Mbits/sec    0    110 KBytes
> > > > > [  5]   8.00-9.00   sec  5.28 MBytes  44.3 Mbits/sec    0    156 KBytes
> > > > > [  5]   9.00-10.00  sec  5.28 MBytes  44.3 Mbits/sec    0   87.7 KBytes
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > [  5]   0.00-10.00  sec  53.0 MBytes  44.5 Mbits/sec    0             sender
> > > > > [  5]   0.00-10.00  sec  52.5 MBytes  44.1 Mbits/sec                  receiver
> > > > >
> > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec  7.08 MBytes  59.4 Mbits/sec    0    156 KBytes
> > > > > [  5]   1.00-2.00   sec  5.97 MBytes  50.0 Mbits/sec    0    110 KBytes
> > > > > [  5]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec    0    124 KBytes
> > > > > [  5]   3.00-4.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > > > [  5]   4.00-5.00   sec  5.47 MBytes  45.9 Mbits/sec    0    158 KBytes
> > > > > [  5]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec    0   70.7 KBytes
> > > > > [  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0    113 KBytes
> > > > > [  5]   7.00-8.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > > > [  5]   8.00-9.00   sec  4.97 MBytes  41.7 Mbits/sec    0   84.8 KBytes
> > > > > [  5]   9.00-10.00  sec  5.47 MBytes  45.9 Mbits/sec    0    116 KBytes
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > [  5]   0.00-10.00  sec  55.3 MBytes  46.4 Mbits/sec    0             sender
> > > > > [  5]   0.00-10.00  sec  53.9 MBytes  45.2 Mbits/sec                  receiver
> > > > >
> > > > > And the low bandwidth continues with:
> > > > > ethtool -k enp3s0 |grep ": on"
> > > > > rx-vlan-offload: on
> > > > > tx-vlan-offload: on [requested off]
> > > > > highdma: on [fixed]
> > > > > rx-vlan-filter: on [fixed]
> > > > > tx-gre-segmentation: on
> > > > > tx-gre-csum-segmentation: on
> > > > > tx-ipxip4-segmentation: on
> > > > > tx-ipxip6-segmentation: on
> > > > > tx-udp_tnl-segmentation: on
> > > > > tx-udp_tnl-csum-segmentation: on
> > > > > tx-gso-partial: on
> > > > > tx-udp-segmentation: on
> > > > > hw-tc-offload: on
> > > > >
> > > > > Can't quite find how to turn those off since they aren't listed in
> > > > > ethtool (since the text is not what you use to enable/disable)
> > > >
> > > > To disable them you would just repeat the same string in the display
> > > > string. So it should just be "ethtool -K enp3s0 tx-gso-partial off"
> > > > and that would turn off a large chunk of them as all the encapsulated
> > > > support requires gso partial support.
> > >
> > >  ethtool -k enp3s0 |grep ": on"
> > > highdma: on [fixed]
> > > rx-vlan-filter: on [fixed]
> > > ---
> > > And then back to back:
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  4.91 MBytes  41.2 Mbits/sec    0   45.2 KBytes
> > > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   52.3 KBytes
> > > [  5]   2.00-3.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> > > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0    141 KBytes
> > > [  5]   4.00-5.00   sec   111 MBytes   928 Mbits/sec   63    764 KBytes
> > > [  5]   5.00-6.00   sec  86.2 MBytes   724 Mbits/sec    0    744 KBytes
> > > [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   61    769 KBytes
> > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    749 KBytes
> > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    741 KBytes
> > > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   31    761 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec   644 MBytes   540 Mbits/sec  155             sender
> > > [  5]   0.00-10.01  sec   641 MBytes   537 Mbits/sec                  receiver
> > >
> > > and we're back at the not working bit:
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  4.84 MBytes  40.6 Mbits/sec    0   42.4 KBytes
> > > [  5]   1.00-2.00   sec  4.60 MBytes  38.6 Mbits/sec    0   42.4 KBytes
> > > [  5]   2.00-3.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> > > [  5]   4.00-5.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> > > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > > [  5]   7.00-8.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> > > [  5]   8.00-9.00   sec  4.47 MBytes  37.5 Mbits/sec    0   53.7 KBytes
> > > [  5]   9.00-10.00  sec  4.47 MBytes  37.5 Mbits/sec    0   79.2 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  44.5 MBytes  37.3 Mbits/sec    0             sender
> > > [  5]   0.00-10.00  sec  43.9 MBytes  36.8 Mbits/sec                  receiver
> > >
> > > > > I was hoping that you'd have a clue of something that might introduce
> > > > > a regression - ie specific patches to try to revert
> > > > >
> > > > > Btw, the same issue applies to udp as werll
> > > > >
> > > > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > > > [  5]   0.00-1.00   sec  6.77 MBytes  56.8 Mbits/sec  4900
> > > > > [  5]   1.00-2.00   sec  4.27 MBytes  35.8 Mbits/sec  3089
> > > > > [  5]   2.00-3.00   sec  4.20 MBytes  35.2 Mbits/sec  3041
> > > > > [  5]   3.00-4.00   sec  4.30 MBytes  36.1 Mbits/sec  3116
> > > > > [  5]   4.00-5.00   sec  4.24 MBytes  35.6 Mbits/sec  3070
> > > > > [  5]   5.00-6.00   sec  4.21 MBytes  35.3 Mbits/sec  3047
> > > > > [  5]   6.00-7.00   sec  4.29 MBytes  36.0 Mbits/sec  3110
> > > > > [  5]   7.00-8.00   sec  4.28 MBytes  35.9 Mbits/sec  3097
> > > > > [  5]   8.00-9.00   sec  4.25 MBytes  35.6 Mbits/sec  3075
> > > > > [  5]   9.00-10.00  sec  4.20 MBytes  35.2 Mbits/sec  3039
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > > > Lost/Total Datagrams
> > > > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.000 ms
> > > > > 0/32584 (0%)  sender
> > > > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.037 ms
> > > > > 0/32573 (0%)  receiver
> > > > >
> > > > > vs:
> > > > >
> > > > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > > > [  5]   0.00-1.00   sec   114 MBytes   954 Mbits/sec  82342
> > > > > [  5]   1.00-2.00   sec   114 MBytes   955 Mbits/sec  82439
> > > > > [  5]   2.00-3.00   sec   114 MBytes   956 Mbits/sec  82507
> > > > > [  5]   3.00-4.00   sec   114 MBytes   955 Mbits/sec  82432
> > > > > [  5]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  82535
> > > > > [  5]   5.00-6.00   sec   114 MBytes   953 Mbits/sec  82240
> > > > > [  5]   6.00-7.00   sec   114 MBytes   956 Mbits/sec  82512
> > > > > [  5]   7.00-8.00   sec   114 MBytes   956 Mbits/sec  82503
> > > > > [  5]   8.00-9.00   sec   114 MBytes   956 Mbits/sec  82532
> > > > > [  5]   9.00-10.00  sec   114 MBytes   956 Mbits/sec  82488
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > > > Lost/Total Datagrams
> > > > > [  5]   0.00-10.00  sec  1.11 GBytes   955 Mbits/sec  0.000 ms
> > > > > 0/824530 (0%)  sender
> > > > > [  5]   0.00-10.01  sec  1.11 GBytes   949 Mbits/sec  0.014 ms
> > > > > 4756/824530 (0.58%)  receiver
> > > >
> > > > The fact that it is impacting UDP seems odd. I wonder if we don't have
> > > > a qdisc somewhere that is misbehaving and throttling the Tx. Either
> > > > that or I wonder if we are getting spammed with flow control frames.
> > >
> > > it sometimes works, it looks like the cwindow just isn't increased -
> > > that's where i started...
> > >
> > > Example:
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  4.86 MBytes  40.8 Mbits/sec    0   50.9 KBytes
> > > [  5]   1.00-2.00   sec  4.66 MBytes  39.1 Mbits/sec    0   65.0 KBytes
> > > [  5]   2.00-3.00   sec  4.29 MBytes  36.0 Mbits/sec    0   42.4 KBytes
> > > [  5]   3.00-4.00   sec  4.66 MBytes  39.1 Mbits/sec    0   42.4 KBytes
> > > [  5]   4.00-5.00   sec  23.1 MBytes   194 Mbits/sec    0   1.07 MBytes
> > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > > [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   60    806 KBytes
> > > [  5]   7.00-8.00   sec  82.5 MBytes   692 Mbits/sec    0    812 KBytes
> > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   92    761 KBytes
> > > [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec    0    755 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec   554 MBytes   465 Mbits/sec  152             sender
> > > [  5]   0.00-10.00  sec   550 MBytes   461 Mbits/sec                  receiver
> > >
> > > > It would be useful to include the output of just calling "ethtool
> > > > enp3s0" on the interface to verify the speed, "ethtool -a enp3s0" to
> > > > verify flow control settings, and "ethtool -S enp3s0 | grep -v :\ 0"
> > > > to output the statistics and dump anything that isn't zero.
> > >
> > > ethtool enp3s0
> > > Settings for enp3s0:
> > > Supported ports: [ TP ]
> > > Supported link modes:   10baseT/Half 10baseT/Full
> > >                         100baseT/Half 100baseT/Full
> > >                         1000baseT/Full
> > > Supported pause frame use: Symmetric
> > > Supports auto-negotiation: Yes
> > > Supported FEC modes: Not reported
> > > Advertised link modes:  10baseT/Half 10baseT/Full
> > >                         100baseT/Half 100baseT/Full
> > >                         1000baseT/Full
> > > Advertised pause frame use: Symmetric
> > > Advertised auto-negotiation: Yes
> > > Advertised FEC modes: Not reported
> > > Speed: 1000Mb/s
> > > Duplex: Full
> > > Auto-negotiation: on
> > > Port: Twisted Pair
> > > PHYAD: 1
> > > Transceiver: internal
> > > MDI-X: off (auto)
> > > Supports Wake-on: pumbg
> > > Wake-on: g
> > >         Current message level: 0x00000007 (7)
> > >                                drv probe link
> > > Link detected: yes
> > > ---
> > > ethtool -a enp3s0
> > > Pause parameters for enp3s0:
> > > Autonegotiate: on
> > > RX: on
> > > TX: off
> > > ---
> > > ethtool -S enp3s0 |grep  -v :\ 0
> > > NIC statistics:
> > >      rx_packets: 15920618
> > >      tx_packets: 17846725
> > >      rx_bytes: 15676264423
> > >      tx_bytes: 19925010639
> > >      rx_broadcast: 119553
> > >      tx_broadcast: 497
> > >      rx_multicast: 330193
> > >      tx_multicast: 18190
> > >      multicast: 330193
> > >      rx_missed_errors: 270102
> > >      rx_long_length_errors: 6
> > >      tx_tcp_seg_good: 1342561
> > >      rx_long_byte_count: 15676264423
> > >      rx_errors: 6
> > >      rx_length_errors: 6
> > >      rx_fifo_errors: 270102
> > >      tx_queue_0_packets: 7651168
> > >      tx_queue_0_bytes: 7823281566
> > >      tx_queue_0_restart: 4920
> > >      tx_queue_1_packets: 10195557
> > >      tx_queue_1_bytes: 12027522118
> > >      tx_queue_1_restart: 12718
> > >      rx_queue_0_packets: 15920618
> > >      rx_queue_0_bytes: 15612581951
> > >      rx_queue_0_csum_err: 76
> > > (I've only run two runs since i reenabled the interface)
> >
> > So I am seeing three things here.
> >
> > The rx_long_length_errors are usually due to an MTU mismatch. Do you
> > have something on the network that is using jumbo frames, or is the
> > MTU on the NIC set to something smaller than what is supported on the
> > network?
>
> I'm using jumbo frames on the local network, internet side is the
> normal 1500 bytes mtu though
>
> > You are getting rx_missed_errors, that would seem to imply that the
> > DMA is not able to keep up. We may want to try disabling the L1 to see
> > if we get any boost from doing that.
>
> It used to work, I don't do benchmarks all the time and sometimes the first
> benchmarks turn out fine... so it's hard to say when this started happening...
>
> It could also be related to a bios upgrade, but I'm pretty sure I did
> successful benchmarks after that...
>
> How do I disable the l1? just echo 0 >
> /sys/bus/pci/drivers/igb/0000\:03\:00.0/link/l1_aspm ?
>
> > The last bit is that queue 0 is seeing packets with bad checksums. You
> > might want to run some tests and see where the bad checksums are
> > coming from. If they are being detected from a specific NIC such as
> > the ixgbe in your example it might point to some sort of checksum
> > error being created as a result of the NAT translation.
>
> But that should also affect A' and the A -> B -> C case, which it doesn't...
>
> It only seems to happen with higher rtt (6 hops, sub 3 ms in this case
> but still high enough somehow)
>
> > > ---
> > >
> > > > > lspci -s 03:00.0  -vvv
> > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > Connection (rev 03)
> > > > > Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection
> > > > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > > > Stepping- SERR- FastB2B- DisINTx+
> > > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > > > > <TAbort- <MAbort- >SERR- <PERR- INTx-
> > > > > Latency: 0
> > > > > Interrupt: pin A routed to IRQ 57
> > > > > IOMMU group: 20
> > > > > Region 0: Memory at fc900000 (32-bit, non-prefetchable) [size=128K]
> > > > > Region 2: I/O ports at e000 [size=32]
> > > > > Region 3: Memory at fc920000 (32-bit, non-prefetchable) [size=16K]
> > > > > Capabilities: [40] Power Management version 3
> > > > > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> > > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> > > > > Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> > > > > Address: 0000000000000000  Data: 0000
> > > > > Masking: 00000000  Pending: 00000000
> > > > > Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
> > > > > Vector table: BAR=3 offset=00000000
> > > > > PBA: BAR=3 offset=00002000
> > > > > Capabilities: [a0] Express (v2) Endpoint, MSI 00
> > > > > DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > > > > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> > > > > DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
> > > > > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> > > > > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > > > > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> > > > > LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
> > > > > L0s <2us, L1 <16us
> > > > > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > > > > LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> > > > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > > > LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
> > > > > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > >
> > > > PCIe wise the connection is going to be pretty tight in terms of
> > > > bandwidth. It looks like we have 2.5GT/s with only a single lane of
> > > > PCIe. In addition we are running with ASPM enabled so that means that
> > > > if we don't have enough traffic we are shutting off the one PCIe lane
> > > > we have so if we are getting bursty traffic that can get ugly.
> > >
> > > Humm... is there a way to force disable ASPM in sysfs?
> >
> > Actually the easiest way to do this is to just use setpci.
> >
> > You should be able to dump the word containing the setting via:
> > # setpci -s 3:00.0 0xB0.w
> > 0042
> > # setpci -s 3:00.0 0xB0.w=0040
> >
> > Basically what you do is clear the lower 3 bits of the value so in
> > this case that means replacing the 2 with a 0 based on the output of
> > the first command.
>
> Well... I'll be damned... I used to force enable ASPM... this must be
> related to the change in PCIe bus ASPM
> Perhaps disable ASPM if there is only one link?

Is there any specific reason why you are enabling ASPM? Is this system
a laptop where you are trying to conserve power when on battery? If
not disabling it probably won't hurt things too much since the power
consumption for a 2.5GT/s link operating in a width of one shouldn't
bee too high. Otherwise you are likely going to end up paying the
price for getting the interface out of L1 when the traffic goes idle
so you are going to see flows that get bursty paying a heavy penalty
when they start dropping packets.

It is also possible this could be something that changed with the
physical PCIe link. Basically L1 works by powering down the link when
idle, and then powering it back up when there is activity. The problem
is bringing it back up can sometimes be a challenge when the physical
link starts to go faulty. I know I have seen that in some cases it can
even result in the device falling off of the PCIe bus if the link
training fails.

> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   113 MBytes   950 Mbits/sec   31    710 KBytes
> [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec  135    626 KBytes
> [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   18    713 KBytes
> [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    798 KBytes
> [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec    0    721 KBytes
> [  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec   31    800 KBytes
> [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    730 KBytes
> [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec   19    730 KBytes
> [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec    0    701 KBytes
> [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec   12    701 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  1.09 GBytes   937 Mbits/sec  246             sender
> [  5]   0.00-10.01  sec  1.09 GBytes   933 Mbits/sec                  receiver
>
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   114 MBytes   956 Mbits/sec    0    749 KBytes
> [  5]   1.00-2.00   sec   111 MBytes   933 Mbits/sec   30    766 KBytes
> [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec    7    749 KBytes
> [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   11    707 KBytes
> [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec    2    699 KBytes
> [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    8    699 KBytes
> [  5]   6.00-7.00   sec   112 MBytes   944 Mbits/sec    1    953 KBytes
> [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    701 KBytes
> [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   26    707 KBytes
> [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    2   1.07 MBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  1.09 GBytes   939 Mbits/sec   87             sender
> [  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver
>
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   114 MBytes   953 Mbits/sec   16    908 KBytes
> [  5]   1.00-2.00   sec   112 MBytes   944 Mbits/sec    0    693 KBytes
> [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec    0    713 KBytes
> [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    687 KBytes
> [  5]   4.00-5.00   sec   112 MBytes   944 Mbits/sec   15    687 KBytes
> [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    2    888 KBytes
> [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec   17    696 KBytes
> [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    758 KBytes
> [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   31    749 KBytes
> [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    0    792 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec   81             sender
> [  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver
>
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   114 MBytes   956 Mbits/sec    0    747 KBytes
> [  5]   1.00-2.00   sec   111 MBytes   933 Mbits/sec    0    744 KBytes
> [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   12   1.18 MBytes
> [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    2    699 KBytes
> [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   28    699 KBytes
> [  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec    0    684 KBytes
> [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    741 KBytes
> [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    3    687 KBytes
> [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   22    699 KBytes
> [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec   11    707 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec   78             sender
> [  5]   0.00-10.01  sec  1.09 GBytes   934 Mbits/sec                  receiver
> ---
>
> ethtool -S enp3s0 |grep -v ": 0"
> NIC statistics:
>      rx_packets: 16303520
>      tx_packets: 21602840
>      rx_bytes: 15711958157
>      tx_bytes: 25599009212
>      rx_broadcast: 122212
>      tx_broadcast: 530
>      rx_multicast: 333489
>      tx_multicast: 18446
>      multicast: 333489
>      rx_missed_errors: 270143
>      rx_long_length_errors: 6
>      tx_tcp_seg_good: 1342561
>      rx_long_byte_count: 15711958157
>      rx_errors: 6
>      rx_length_errors: 6
>      rx_fifo_errors: 270143
>      tx_queue_0_packets: 8963830
>      tx_queue_0_bytes: 9803196683
>      tx_queue_0_restart: 4920
>      tx_queue_1_packets: 12639010
>      tx_queue_1_bytes: 15706576814
>      tx_queue_1_restart: 12718
>      rx_queue_0_packets: 16303520
>      rx_queue_0_bytes: 15646744077
>      rx_queue_0_csum_err: 76

Okay, so this result still has the same length and checksum errors,
were you resetting the system/statistics between runs?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
@ 2020-07-16 15:18                       ` Alexander Duyck
  0 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2020-07-16 15:18 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, Jul 15, 2020 at 5:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Thu, Jul 16, 2020 at 1:42 AM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> > On Wed, Jul 15, 2020 at 3:51 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > On Thu, Jul 16, 2020 at 12:32 AM Alexander Duyck
> > > <alexander.duyck@gmail.com> wrote:
> > > > On Wed, Jul 15, 2020 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > On Wed, Jul 15, 2020 at 11:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> > > > > > > On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > > > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > > > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > > > > > > > > After a  lot of debugging it turns out that the bug is in igb...
> > > > > > > > > >
> > > > > > > > > > driver: igb
> > > > > > > > > > version: 5.6.0-k
> > > > > > > > > > firmware-version:  0. 6-1
> > > > > > > > > >
> > > > > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > > > > > > Connection (rev 03)
> > > > > > > > >
> > > > > > > > > Unclear to me what you're actually reporting. Is this a regression
> > > > > > > > > after a kernel upgrade? Compared to no NAT?
> > > > > > > >
> > > > > > > > It only happens on "internet links"
> > > > > > > >
> > > > > > > > Lets say that A is client with ibg driver, B is a firewall running NAT
> > > > > > > > with ixgbe drivers, C is another local node with igb and
> > > > > > > > D is a remote node with a bridge backed by a bnx2 interface.
> > > > > > > >
> > > > > > > > A -> B -> C is ok (B and C is on the same switch)
> > > > > > > >
> > > > > > > > A -> B -> D -- 32-40mbit
> > > > > > > >
> > > > > > > > B -> D 944 mbit
> > > > > > > > C -> D 944 mbit
> > > > > > > >
> > > > > > > > A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)
> > > > > > >
> > > > > > > This should of course be A' -> B -> D
> > > > > > >
> > > > > > > Sorry, I've been scratching my head for about a week...
> > > > > >
> > > > > > Hm, only thing that comes to mind if A' works reliably and A doesn't is
> > > > > > that A has somehow broken TCP offloads. Could you try disabling things
> > > > > > via ethtool -K and see if those settings make a difference?
> > > > >
> > > > > It's a bit hard since it works like this, turned tso off:
> > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
> > > > > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec   31    812 KBytes
> > > > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec   92    772 KBytes
> > > > > [  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0    834 KBytes
> > > > > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   60    823 KBytes
> > > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec   31    789 KBytes
> > > > > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    786 KBytes
> > > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    772 KBytes
> > > > > [  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec    0    868 KBytes
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  214             sender
> > > > > [  5]   0.00-10.00  sec  1.07 GBytes   920 Mbits/sec                  receiver
> > > > >
> > > > > Continued running tests:
> > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec  5.82 MBytes  48.8 Mbits/sec    0   82.0 KBytes
> > > > > [  5]   1.00-2.00   sec  4.97 MBytes  41.7 Mbits/sec    0    130 KBytes
> > > > > [  5]   2.00-3.00   sec  5.28 MBytes  44.3 Mbits/sec    0   99.0 KBytes
> > > > > [  5]   3.00-4.00   sec  5.28 MBytes  44.3 Mbits/sec    0    105 KBytes
> > > > > [  5]   4.00-5.00   sec  5.28 MBytes  44.3 Mbits/sec    0    122 KBytes
> > > > > [  5]   5.00-6.00   sec  5.28 MBytes  44.3 Mbits/sec    0   82.0 KBytes
> > > > > [  5]   6.00-7.00   sec  5.28 MBytes  44.3 Mbits/sec    0   79.2 KBytes
> > > > > [  5]   7.00-8.00   sec  5.28 MBytes  44.3 Mbits/sec    0    110 KBytes
> > > > > [  5]   8.00-9.00   sec  5.28 MBytes  44.3 Mbits/sec    0    156 KBytes
> > > > > [  5]   9.00-10.00  sec  5.28 MBytes  44.3 Mbits/sec    0   87.7 KBytes
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > [  5]   0.00-10.00  sec  53.0 MBytes  44.5 Mbits/sec    0             sender
> > > > > [  5]   0.00-10.00  sec  52.5 MBytes  44.1 Mbits/sec                  receiver
> > > > >
> > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec  7.08 MBytes  59.4 Mbits/sec    0    156 KBytes
> > > > > [  5]   1.00-2.00   sec  5.97 MBytes  50.0 Mbits/sec    0    110 KBytes
> > > > > [  5]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec    0    124 KBytes
> > > > > [  5]   3.00-4.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > > > [  5]   4.00-5.00   sec  5.47 MBytes  45.9 Mbits/sec    0    158 KBytes
> > > > > [  5]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec    0   70.7 KBytes
> > > > > [  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0    113 KBytes
> > > > > [  5]   7.00-8.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > > > [  5]   8.00-9.00   sec  4.97 MBytes  41.7 Mbits/sec    0   84.8 KBytes
> > > > > [  5]   9.00-10.00  sec  5.47 MBytes  45.9 Mbits/sec    0    116 KBytes
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > [  5]   0.00-10.00  sec  55.3 MBytes  46.4 Mbits/sec    0             sender
> > > > > [  5]   0.00-10.00  sec  53.9 MBytes  45.2 Mbits/sec                  receiver
> > > > >
> > > > > And the low bandwidth continues with:
> > > > > ethtool -k enp3s0 |grep ": on"
> > > > > rx-vlan-offload: on
> > > > > tx-vlan-offload: on [requested off]
> > > > > highdma: on [fixed]
> > > > > rx-vlan-filter: on [fixed]
> > > > > tx-gre-segmentation: on
> > > > > tx-gre-csum-segmentation: on
> > > > > tx-ipxip4-segmentation: on
> > > > > tx-ipxip6-segmentation: on
> > > > > tx-udp_tnl-segmentation: on
> > > > > tx-udp_tnl-csum-segmentation: on
> > > > > tx-gso-partial: on
> > > > > tx-udp-segmentation: on
> > > > > hw-tc-offload: on
> > > > >
> > > > > Can't quite find how to turn those off since they aren't listed in
> > > > > ethtool (since the text is not what you use to enable/disable)
> > > >
> > > > To disable them you would just repeat the same string in the display
> > > > string. So it should just be "ethtool -K enp3s0 tx-gso-partial off"
> > > > and that would turn off a large chunk of them as all the encapsulated
> > > > support requires gso partial support.
> > >
> > >  ethtool -k enp3s0 |grep ": on"
> > > highdma: on [fixed]
> > > rx-vlan-filter: on [fixed]
> > > ---
> > > And then back to back:
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  4.91 MBytes  41.2 Mbits/sec    0   45.2 KBytes
> > > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   52.3 KBytes
> > > [  5]   2.00-3.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> > > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0    141 KBytes
> > > [  5]   4.00-5.00   sec   111 MBytes   928 Mbits/sec   63    764 KBytes
> > > [  5]   5.00-6.00   sec  86.2 MBytes   724 Mbits/sec    0    744 KBytes
> > > [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   61    769 KBytes
> > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    749 KBytes
> > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    741 KBytes
> > > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   31    761 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec   644 MBytes   540 Mbits/sec  155             sender
> > > [  5]   0.00-10.01  sec   641 MBytes   537 Mbits/sec                  receiver
> > >
> > > and we're back at the not working bit:
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  4.84 MBytes  40.6 Mbits/sec    0   42.4 KBytes
> > > [  5]   1.00-2.00   sec  4.60 MBytes  38.6 Mbits/sec    0   42.4 KBytes
> > > [  5]   2.00-3.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> > > [  5]   4.00-5.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> > > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > > [  5]   7.00-8.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> > > [  5]   8.00-9.00   sec  4.47 MBytes  37.5 Mbits/sec    0   53.7 KBytes
> > > [  5]   9.00-10.00  sec  4.47 MBytes  37.5 Mbits/sec    0   79.2 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  44.5 MBytes  37.3 Mbits/sec    0             sender
> > > [  5]   0.00-10.00  sec  43.9 MBytes  36.8 Mbits/sec                  receiver
> > >
> > > > > I was hoping that you'd have a clue of something that might introduce
> > > > > a regression - ie specific patches to try to revert
> > > > >
> > > > > Btw, the same issue applies to udp as werll
> > > > >
> > > > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > > > [  5]   0.00-1.00   sec  6.77 MBytes  56.8 Mbits/sec  4900
> > > > > [  5]   1.00-2.00   sec  4.27 MBytes  35.8 Mbits/sec  3089
> > > > > [  5]   2.00-3.00   sec  4.20 MBytes  35.2 Mbits/sec  3041
> > > > > [  5]   3.00-4.00   sec  4.30 MBytes  36.1 Mbits/sec  3116
> > > > > [  5]   4.00-5.00   sec  4.24 MBytes  35.6 Mbits/sec  3070
> > > > > [  5]   5.00-6.00   sec  4.21 MBytes  35.3 Mbits/sec  3047
> > > > > [  5]   6.00-7.00   sec  4.29 MBytes  36.0 Mbits/sec  3110
> > > > > [  5]   7.00-8.00   sec  4.28 MBytes  35.9 Mbits/sec  3097
> > > > > [  5]   8.00-9.00   sec  4.25 MBytes  35.6 Mbits/sec  3075
> > > > > [  5]   9.00-10.00  sec  4.20 MBytes  35.2 Mbits/sec  3039
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > > > Lost/Total Datagrams
> > > > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.000 ms
> > > > > 0/32584 (0%)  sender
> > > > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.037 ms
> > > > > 0/32573 (0%)  receiver
> > > > >
> > > > > vs:
> > > > >
> > > > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > > > [  5]   0.00-1.00   sec   114 MBytes   954 Mbits/sec  82342
> > > > > [  5]   1.00-2.00   sec   114 MBytes   955 Mbits/sec  82439
> > > > > [  5]   2.00-3.00   sec   114 MBytes   956 Mbits/sec  82507
> > > > > [  5]   3.00-4.00   sec   114 MBytes   955 Mbits/sec  82432
> > > > > [  5]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  82535
> > > > > [  5]   5.00-6.00   sec   114 MBytes   953 Mbits/sec  82240
> > > > > [  5]   6.00-7.00   sec   114 MBytes   956 Mbits/sec  82512
> > > > > [  5]   7.00-8.00   sec   114 MBytes   956 Mbits/sec  82503
> > > > > [  5]   8.00-9.00   sec   114 MBytes   956 Mbits/sec  82532
> > > > > [  5]   9.00-10.00  sec   114 MBytes   956 Mbits/sec  82488
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > > > Lost/Total Datagrams
> > > > > [  5]   0.00-10.00  sec  1.11 GBytes   955 Mbits/sec  0.000 ms
> > > > > 0/824530 (0%)  sender
> > > > > [  5]   0.00-10.01  sec  1.11 GBytes   949 Mbits/sec  0.014 ms
> > > > > 4756/824530 (0.58%)  receiver
> > > >
> > > > The fact that it is impacting UDP seems odd. I wonder if we don't have
> > > > a qdisc somewhere that is misbehaving and throttling the Tx. Either
> > > > that or I wonder if we are getting spammed with flow control frames.
> > >
> > > it sometimes works, it looks like the cwindow just isn't increased -
> > > that's where i started...
> > >
> > > Example:
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  4.86 MBytes  40.8 Mbits/sec    0   50.9 KBytes
> > > [  5]   1.00-2.00   sec  4.66 MBytes  39.1 Mbits/sec    0   65.0 KBytes
> > > [  5]   2.00-3.00   sec  4.29 MBytes  36.0 Mbits/sec    0   42.4 KBytes
> > > [  5]   3.00-4.00   sec  4.66 MBytes  39.1 Mbits/sec    0   42.4 KBytes
> > > [  5]   4.00-5.00   sec  23.1 MBytes   194 Mbits/sec    0   1.07 MBytes
> > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > > [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   60    806 KBytes
> > > [  5]   7.00-8.00   sec  82.5 MBytes   692 Mbits/sec    0    812 KBytes
> > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   92    761 KBytes
> > > [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec    0    755 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec   554 MBytes   465 Mbits/sec  152             sender
> > > [  5]   0.00-10.00  sec   550 MBytes   461 Mbits/sec                  receiver
> > >
> > > > It would be useful to include the output of just calling "ethtool
> > > > enp3s0" on the interface to verify the speed, "ethtool -a enp3s0" to
> > > > verify flow control settings, and "ethtool -S enp3s0 | grep -v :\ 0"
> > > > to output the statistics and dump anything that isn't zero.
> > >
> > > ethtool enp3s0
> > > Settings for enp3s0:
> > > Supported ports: [ TP ]
> > > Supported link modes:   10baseT/Half 10baseT/Full
> > >                         100baseT/Half 100baseT/Full
> > >                         1000baseT/Full
> > > Supported pause frame use: Symmetric
> > > Supports auto-negotiation: Yes
> > > Supported FEC modes: Not reported
> > > Advertised link modes:  10baseT/Half 10baseT/Full
> > >                         100baseT/Half 100baseT/Full
> > >                         1000baseT/Full
> > > Advertised pause frame use: Symmetric
> > > Advertised auto-negotiation: Yes
> > > Advertised FEC modes: Not reported
> > > Speed: 1000Mb/s
> > > Duplex: Full
> > > Auto-negotiation: on
> > > Port: Twisted Pair
> > > PHYAD: 1
> > > Transceiver: internal
> > > MDI-X: off (auto)
> > > Supports Wake-on: pumbg
> > > Wake-on: g
> > >         Current message level: 0x00000007 (7)
> > >                                drv probe link
> > > Link detected: yes
> > > ---
> > > ethtool -a enp3s0
> > > Pause parameters for enp3s0:
> > > Autonegotiate: on
> > > RX: on
> > > TX: off
> > > ---
> > > ethtool -S enp3s0 |grep  -v :\ 0
> > > NIC statistics:
> > >      rx_packets: 15920618
> > >      tx_packets: 17846725
> > >      rx_bytes: 15676264423
> > >      tx_bytes: 19925010639
> > >      rx_broadcast: 119553
> > >      tx_broadcast: 497
> > >      rx_multicast: 330193
> > >      tx_multicast: 18190
> > >      multicast: 330193
> > >      rx_missed_errors: 270102
> > >      rx_long_length_errors: 6
> > >      tx_tcp_seg_good: 1342561
> > >      rx_long_byte_count: 15676264423
> > >      rx_errors: 6
> > >      rx_length_errors: 6
> > >      rx_fifo_errors: 270102
> > >      tx_queue_0_packets: 7651168
> > >      tx_queue_0_bytes: 7823281566
> > >      tx_queue_0_restart: 4920
> > >      tx_queue_1_packets: 10195557
> > >      tx_queue_1_bytes: 12027522118
> > >      tx_queue_1_restart: 12718
> > >      rx_queue_0_packets: 15920618
> > >      rx_queue_0_bytes: 15612581951
> > >      rx_queue_0_csum_err: 76
> > > (I've only run two runs since i reenabled the interface)
> >
> > So I am seeing three things here.
> >
> > The rx_long_length_errors are usually due to an MTU mismatch. Do you
> > have something on the network that is using jumbo frames, or is the
> > MTU on the NIC set to something smaller than what is supported on the
> > network?
>
> I'm using jumbo frames on the local network, internet side is the
> normal 1500 bytes mtu though
>
> > You are getting rx_missed_errors, that would seem to imply that the
> > DMA is not able to keep up. We may want to try disabling the L1 to see
> > if we get any boost from doing that.
>
> It used to work, I don't do benchmarks all the time and sometimes the first
> benchmarks turn out fine... so it's hard to say when this started happening...
>
> It could also be related to a bios upgrade, but I'm pretty sure I did
> successful benchmarks after that...
>
> How do I disable the l1? just echo 0 >
> /sys/bus/pci/drivers/igb/0000\:03\:00.0/link/l1_aspm ?
>
> > The last bit is that queue 0 is seeing packets with bad checksums. You
> > might want to run some tests and see where the bad checksums are
> > coming from. If they are being detected from a specific NIC such as
> > the ixgbe in your example it might point to some sort of checksum
> > error being created as a result of the NAT translation.
>
> But that should also affect A' and the A -> B -> C case, which it doesn't...
>
> It only seems to happen with higher rtt (6 hops, sub 3 ms in this case
> but still high enough somehow)
>
> > > ---
> > >
> > > > > lspci -s 03:00.0  -vvv
> > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > Connection (rev 03)
> > > > > Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection
> > > > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > > > Stepping- SERR- FastB2B- DisINTx+
> > > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > > > > <TAbort- <MAbort- >SERR- <PERR- INTx-
> > > > > Latency: 0
> > > > > Interrupt: pin A routed to IRQ 57
> > > > > IOMMU group: 20
> > > > > Region 0: Memory at fc900000 (32-bit, non-prefetchable) [size=128K]
> > > > > Region 2: I/O ports at e000 [size=32]
> > > > > Region 3: Memory at fc920000 (32-bit, non-prefetchable) [size=16K]
> > > > > Capabilities: [40] Power Management version 3
> > > > > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> > > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> > > > > Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> > > > > Address: 0000000000000000  Data: 0000
> > > > > Masking: 00000000  Pending: 00000000
> > > > > Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
> > > > > Vector table: BAR=3 offset=00000000
> > > > > PBA: BAR=3 offset=00002000
> > > > > Capabilities: [a0] Express (v2) Endpoint, MSI 00
> > > > > DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > > > > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> > > > > DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
> > > > > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> > > > > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > > > > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> > > > > LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
> > > > > L0s <2us, L1 <16us
> > > > > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > > > > LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> > > > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > > > LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
> > > > > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > >
> > > > PCIe wise the connection is going to be pretty tight in terms of
> > > > bandwidth. It looks like we have 2.5GT/s with only a single lane of
> > > > PCIe. In addition we are running with ASPM enabled so that means that
> > > > if we don't have enough traffic we are shutting off the one PCIe lane
> > > > we have so if we are getting bursty traffic that can get ugly.
> > >
> > > Humm... is there a way to force disable ASPM in sysfs?
> >
> > Actually the easiest way to do this is to just use setpci.
> >
> > You should be able to dump the word containing the setting via:
> > # setpci -s 3:00.0 0xB0.w
> > 0042
> > # setpci -s 3:00.0 0xB0.w=0040
> >
> > Basically what you do is clear the lower 3 bits of the value so in
> > this case that means replacing the 2 with a 0 based on the output of
> > the first command.
>
> Well... I'll be damned... I used to force enable ASPM... this must be
> related to the change in PCIe bus ASPM
> Perhaps disable ASPM if there is only one link?

Is there any specific reason why you are enabling ASPM? Is this system
a laptop where you are trying to conserve power when on battery? If
not disabling it probably won't hurt things too much since the power
consumption for a 2.5GT/s link operating in a width of one shouldn't
bee too high. Otherwise you are likely going to end up paying the
price for getting the interface out of L1 when the traffic goes idle
so you are going to see flows that get bursty paying a heavy penalty
when they start dropping packets.

It is also possible this could be something that changed with the
physical PCIe link. Basically L1 works by powering down the link when
idle, and then powering it back up when there is activity. The problem
is bringing it back up can sometimes be a challenge when the physical
link starts to go faulty. I know I have seen that in some cases it can
even result in the device falling off of the PCIe bus if the link
training fails.

> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   113 MBytes   950 Mbits/sec   31    710 KBytes
> [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec  135    626 KBytes
> [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   18    713 KBytes
> [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    798 KBytes
> [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec    0    721 KBytes
> [  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec   31    800 KBytes
> [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    730 KBytes
> [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec   19    730 KBytes
> [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec    0    701 KBytes
> [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec   12    701 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  1.09 GBytes   937 Mbits/sec  246             sender
> [  5]   0.00-10.01  sec  1.09 GBytes   933 Mbits/sec                  receiver
>
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   114 MBytes   956 Mbits/sec    0    749 KBytes
> [  5]   1.00-2.00   sec   111 MBytes   933 Mbits/sec   30    766 KBytes
> [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec    7    749 KBytes
> [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   11    707 KBytes
> [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec    2    699 KBytes
> [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    8    699 KBytes
> [  5]   6.00-7.00   sec   112 MBytes   944 Mbits/sec    1    953 KBytes
> [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    701 KBytes
> [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   26    707 KBytes
> [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    2   1.07 MBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  1.09 GBytes   939 Mbits/sec   87             sender
> [  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver
>
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   114 MBytes   953 Mbits/sec   16    908 KBytes
> [  5]   1.00-2.00   sec   112 MBytes   944 Mbits/sec    0    693 KBytes
> [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec    0    713 KBytes
> [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    687 KBytes
> [  5]   4.00-5.00   sec   112 MBytes   944 Mbits/sec   15    687 KBytes
> [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    2    888 KBytes
> [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec   17    696 KBytes
> [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    758 KBytes
> [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   31    749 KBytes
> [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    0    792 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec   81             sender
> [  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver
>
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   114 MBytes   956 Mbits/sec    0    747 KBytes
> [  5]   1.00-2.00   sec   111 MBytes   933 Mbits/sec    0    744 KBytes
> [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   12   1.18 MBytes
> [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    2    699 KBytes
> [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   28    699 KBytes
> [  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec    0    684 KBytes
> [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    741 KBytes
> [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    3    687 KBytes
> [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   22    699 KBytes
> [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec   11    707 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec   78             sender
> [  5]   0.00-10.01  sec  1.09 GBytes   934 Mbits/sec                  receiver
> ---
>
> ethtool -S enp3s0 |grep -v ": 0"
> NIC statistics:
>      rx_packets: 16303520
>      tx_packets: 21602840
>      rx_bytes: 15711958157
>      tx_bytes: 25599009212
>      rx_broadcast: 122212
>      tx_broadcast: 530
>      rx_multicast: 333489
>      tx_multicast: 18446
>      multicast: 333489
>      rx_missed_errors: 270143
>      rx_long_length_errors: 6
>      tx_tcp_seg_good: 1342561
>      rx_long_byte_count: 15711958157
>      rx_errors: 6
>      rx_length_errors: 6
>      rx_fifo_errors: 270143
>      tx_queue_0_packets: 8963830
>      tx_queue_0_bytes: 9803196683
>      tx_queue_0_restart: 4920
>      tx_queue_1_packets: 12639010
>      tx_queue_1_bytes: 15706576814
>      tx_queue_1_restart: 12718
>      rx_queue_0_packets: 16303520
>      rx_queue_0_bytes: 15646744077
>      rx_queue_0_csum_err: 76

Okay, so this result still has the same length and checksum errors,
were you resetting the system/statistics between runs?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-16 15:18                       ` Alexander Duyck
  (?)
@ 2020-07-16 15:51                       ` Ian Kumlien
  -1 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-16 15:51 UTC (permalink / raw)
  To: intel-wired-lan

On Thursday, July 16, 2020, Alexander Duyck <alexander.duyck@gmail.com>
wrote:

> On Wed, Jul 15, 2020 at 5:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > On Thu, Jul 16, 2020 at 1:42 AM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > > On Wed, Jul 15, 2020 at 3:51 PM Ian Kumlien <ian.kumlien@gmail.com>
> wrote:
> > > > On Thu, Jul 16, 2020 at 12:32 AM Alexander Duyck
> > > > <alexander.duyck@gmail.com> wrote:
> > > > > On Wed, Jul 15, 2020 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com>
> wrote:
> > > > > > On Wed, Jul 15, 2020 at 11:40 PM Jakub Kicinski <kuba@kernel.org>
> wrote:
> > > > > > > On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> > > > > > > > On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <
> ian.kumlien at gmail.com> wrote:
> > > > > > > > > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <
> kuba at kernel.org> wrote:
> > > > > > > > > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > > > > > > > > > After a  lot of debugging it turns out that the bug is
> in igb...
> > > > > > > > > > >
> > > > > > > > > > > driver: igb
> > > > > > > > > > > version: 5.6.0-k
> > > > > > > > > > > firmware-version:  0. 6-1
> > > > > > > > > > >
> > > > > > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211
> Gigabit Network
> > > > > > > > > > > Connection (rev 03)
> > > > > > > > > >
> > > > > > > > > > Unclear to me what you're actually reporting. Is this a
> regression
> > > > > > > > > > after a kernel upgrade? Compared to no NAT?
> > > > > > > > >
> > > > > > > > > It only happens on "internet links"
> > > > > > > > >
> > > > > > > > > Lets say that A is client with ibg driver, B is a firewall
> running NAT
> > > > > > > > > with ixgbe drivers, C is another local node with igb and
> > > > > > > > > D is a remote node with a bridge backed by a bnx2
> interface.
> > > > > > > > >
> > > > > > > > > A -> B -> C is ok (B and C is on the same switch)
> > > > > > > > >
> > > > > > > > > A -> B -> D -- 32-40mbit
> > > > > > > > >
> > > > > > > > > B -> D 944 mbit
> > > > > > > > > C -> D 944 mbit
> > > > > > > > >
> > > > > > > > > A' -> D ~933 mbit (A with realtek nic -- also link is not
> idle atm)
> > > > > > > >
> > > > > > > > This should of course be A' -> B -> D
> > > > > > > >
> > > > > > > > Sorry, I've been scratching my head for about a week...
> > > > > > >
> > > > > > > Hm, only thing that comes to mind if A' works reliably and A
> doesn't is
> > > > > > > that A has somehow broken TCP offloads. Could you try
> disabling things
> > > > > > > via ethtool -K and see if those settings make a difference?
> > > > > >
> > > > > > It's a bit hard since it works like this, turned tso off:
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > > [  5]   0.00-1.00   sec   108 MBytes   902 Mbits/sec    0    783
> KBytes
> > > > > > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec   31    812
> KBytes
> > > > > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec   92    772
> KBytes
> > > > > > [  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0    834
> KBytes
> > > > > > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   60    823
> KBytes
> > > > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec   31    789
> KBytes
> > > > > > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    786
> KBytes
> > > > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    761
> KBytes
> > > > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    772
> KBytes
> > > > > > [  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec    0    868
> KBytes
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  214
>      sender
> > > > > > [  5]   0.00-10.00  sec  1.07 GBytes   920 Mbits/sec
>       receiver
> > > > > >
> > > > > > Continued running tests:
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > > [  5]   0.00-1.00   sec  5.82 MBytes  48.8 Mbits/sec    0   82.0
> KBytes
> > > > > > [  5]   1.00-2.00   sec  4.97 MBytes  41.7 Mbits/sec    0    130
> KBytes
> > > > > > [  5]   2.00-3.00   sec  5.28 MBytes  44.3 Mbits/sec    0   99.0
> KBytes
> > > > > > [  5]   3.00-4.00   sec  5.28 MBytes  44.3 Mbits/sec    0    105
> KBytes
> > > > > > [  5]   4.00-5.00   sec  5.28 MBytes  44.3 Mbits/sec    0    122
> KBytes
> > > > > > [  5]   5.00-6.00   sec  5.28 MBytes  44.3 Mbits/sec    0   82.0
> KBytes
> > > > > > [  5]   6.00-7.00   sec  5.28 MBytes  44.3 Mbits/sec    0   79.2
> KBytes
> > > > > > [  5]   7.00-8.00   sec  5.28 MBytes  44.3 Mbits/sec    0    110
> KBytes
> > > > > > [  5]   8.00-9.00   sec  5.28 MBytes  44.3 Mbits/sec    0    156
> KBytes
> > > > > > [  5]   9.00-10.00  sec  5.28 MBytes  44.3 Mbits/sec    0   87.7
> KBytes
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > [  5]   0.00-10.00  sec  53.0 MBytes  44.5 Mbits/sec    0
>      sender
> > > > > > [  5]   0.00-10.00  sec  52.5 MBytes  44.1 Mbits/sec
>       receiver
> > > > > >
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > > [  5]   0.00-1.00   sec  7.08 MBytes  59.4 Mbits/sec    0    156
> KBytes
> > > > > > [  5]   1.00-2.00   sec  5.97 MBytes  50.0 Mbits/sec    0    110
> KBytes
> > > > > > [  5]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec    0    124
> KBytes
> > > > > > [  5]   3.00-4.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2
> KBytes
> > > > > > [  5]   4.00-5.00   sec  5.47 MBytes  45.9 Mbits/sec    0    158
> KBytes
> > > > > > [  5]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec    0   70.7
> KBytes
> > > > > > [  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0    113
> KBytes
> > > > > > [  5]   7.00-8.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2
> KBytes
> > > > > > [  5]   8.00-9.00   sec  4.97 MBytes  41.7 Mbits/sec    0   84.8
> KBytes
> > > > > > [  5]   9.00-10.00  sec  5.47 MBytes  45.9 Mbits/sec    0    116
> KBytes
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > [  5]   0.00-10.00  sec  55.3 MBytes  46.4 Mbits/sec    0
>      sender
> > > > > > [  5]   0.00-10.00  sec  53.9 MBytes  45.2 Mbits/sec
>       receiver
> > > > > >
> > > > > > And the low bandwidth continues with:
> > > > > > ethtool -k enp3s0 |grep ": on"
> > > > > > rx-vlan-offload: on
> > > > > > tx-vlan-offload: on [requested off]
> > > > > > highdma: on [fixed]
> > > > > > rx-vlan-filter: on [fixed]
> > > > > > tx-gre-segmentation: on
> > > > > > tx-gre-csum-segmentation: on
> > > > > > tx-ipxip4-segmentation: on
> > > > > > tx-ipxip6-segmentation: on
> > > > > > tx-udp_tnl-segmentation: on
> > > > > > tx-udp_tnl-csum-segmentation: on
> > > > > > tx-gso-partial: on
> > > > > > tx-udp-segmentation: on
> > > > > > hw-tc-offload: on
> > > > > >
> > > > > > Can't quite find how to turn those off since they aren't listed
> in
> > > > > > ethtool (since the text is not what you use to enable/disable)
> > > > >
> > > > > To disable them you would just repeat the same string in the
> display
> > > > > string. So it should just be "ethtool -K enp3s0 tx-gso-partial off"
> > > > > and that would turn off a large chunk of them as all the
> encapsulated
> > > > > support requires gso partial support.
> > > >
> > > >  ethtool -k enp3s0 |grep ": on"
> > > > highdma: on [fixed]
> > > > rx-vlan-filter: on [fixed]
> > > > ---
> > > > And then back to back:
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  4.91 MBytes  41.2 Mbits/sec    0   45.2
> KBytes
> > > > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   52.3
> KBytes
> > > > [  5]   2.00-3.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4
> KBytes
> > > > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0    141
> KBytes
> > > > [  5]   4.00-5.00   sec   111 MBytes   928 Mbits/sec   63    764
> KBytes
> > > > [  5]   5.00-6.00   sec  86.2 MBytes   724 Mbits/sec    0    744
> KBytes
> > > > [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   61    769
> KBytes
> > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    749
> KBytes
> > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    741
> KBytes
> > > > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   31    761
> KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec   644 MBytes   540 Mbits/sec  155
>  sender
> > > > [  5]   0.00-10.01  sec   641 MBytes   537 Mbits/sec
>   receiver
> > > >
> > > > and we're back at the not working bit:
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  4.84 MBytes  40.6 Mbits/sec    0   42.4
> KBytes
> > > > [  5]   1.00-2.00   sec  4.60 MBytes  38.6 Mbits/sec    0   42.4
> KBytes
> > > > [  5]   2.00-3.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4
> KBytes
> > > > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9
> KBytes
> > > > [  5]   4.00-5.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4
> KBytes
> > > > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4
> KBytes
> > > > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4
> KBytes
> > > > [  5]   7.00-8.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9
> KBytes
> > > > [  5]   8.00-9.00   sec  4.47 MBytes  37.5 Mbits/sec    0   53.7
> KBytes
> > > > [  5]   9.00-10.00  sec  4.47 MBytes  37.5 Mbits/sec    0   79.2
> KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec  44.5 MBytes  37.3 Mbits/sec    0
>  sender
> > > > [  5]   0.00-10.00  sec  43.9 MBytes  36.8 Mbits/sec
>   receiver
> > > >
> > > > > > I was hoping that you'd have a clue of something that might
> introduce
> > > > > > a regression - ie specific patches to try to revert
> > > > > >
> > > > > > Btw, the same issue applies to udp as werll
> > > > > >
> > > > > > [ ID] Interval           Transfer     Bitrate         Total
> Datagrams
> > > > > > [  5]   0.00-1.00   sec  6.77 MBytes  56.8 Mbits/sec  4900
> > > > > > [  5]   1.00-2.00   sec  4.27 MBytes  35.8 Mbits/sec  3089
> > > > > > [  5]   2.00-3.00   sec  4.20 MBytes  35.2 Mbits/sec  3041
> > > > > > [  5]   3.00-4.00   sec  4.30 MBytes  36.1 Mbits/sec  3116
> > > > > > [  5]   4.00-5.00   sec  4.24 MBytes  35.6 Mbits/sec  3070
> > > > > > [  5]   5.00-6.00   sec  4.21 MBytes  35.3 Mbits/sec  3047
> > > > > > [  5]   6.00-7.00   sec  4.29 MBytes  36.0 Mbits/sec  3110
> > > > > > [  5]   7.00-8.00   sec  4.28 MBytes  35.9 Mbits/sec  3097
> > > > > > [  5]   8.00-9.00   sec  4.25 MBytes  35.6 Mbits/sec  3075
> > > > > > [  5]   9.00-10.00  sec  4.20 MBytes  35.2 Mbits/sec  3039
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > > > > Lost/Total Datagrams
> > > > > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.000 ms
> > > > > > 0/32584 (0%)  sender
> > > > > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.037 ms
> > > > > > 0/32573 (0%)  receiver
> > > > > >
> > > > > > vs:
> > > > > >
> > > > > > [ ID] Interval           Transfer     Bitrate         Total
> Datagrams
> > > > > > [  5]   0.00-1.00   sec   114 MBytes   954 Mbits/sec  82342
> > > > > > [  5]   1.00-2.00   sec   114 MBytes   955 Mbits/sec  82439
> > > > > > [  5]   2.00-3.00   sec   114 MBytes   956 Mbits/sec  82507
> > > > > > [  5]   3.00-4.00   sec   114 MBytes   955 Mbits/sec  82432
> > > > > > [  5]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  82535
> > > > > > [  5]   5.00-6.00   sec   114 MBytes   953 Mbits/sec  82240
> > > > > > [  5]   6.00-7.00   sec   114 MBytes   956 Mbits/sec  82512
> > > > > > [  5]   7.00-8.00   sec   114 MBytes   956 Mbits/sec  82503
> > > > > > [  5]   8.00-9.00   sec   114 MBytes   956 Mbits/sec  82532
> > > > > > [  5]   9.00-10.00  sec   114 MBytes   956 Mbits/sec  82488
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > > > > Lost/Total Datagrams
> > > > > > [  5]   0.00-10.00  sec  1.11 GBytes   955 Mbits/sec  0.000 ms
> > > > > > 0/824530 (0%)  sender
> > > > > > [  5]   0.00-10.01  sec  1.11 GBytes   949 Mbits/sec  0.014 ms
> > > > > > 4756/824530 (0.58%)  receiver
> > > > >
> > > > > The fact that it is impacting UDP seems odd. I wonder if we don't
> have
> > > > > a qdisc somewhere that is misbehaving and throttling the Tx. Either
> > > > > that or I wonder if we are getting spammed with flow control
> frames.
> > > >
> > > > it sometimes works, it looks like the cwindow just isn't increased -
> > > > that's where i started...
> > > >
> > > > Example:
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  4.86 MBytes  40.8 Mbits/sec    0   50.9
> KBytes
> > > > [  5]   1.00-2.00   sec  4.66 MBytes  39.1 Mbits/sec    0   65.0
> KBytes
> > > > [  5]   2.00-3.00   sec  4.29 MBytes  36.0 Mbits/sec    0   42.4
> KBytes
> > > > [  5]   3.00-4.00   sec  4.66 MBytes  39.1 Mbits/sec    0   42.4
> KBytes
> > > > [  5]   4.00-5.00   sec  23.1 MBytes   194 Mbits/sec    0   1.07
> MBytes
> > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec    0    761
> KBytes
> > > > [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   60    806
> KBytes
> > > > [  5]   7.00-8.00   sec  82.5 MBytes   692 Mbits/sec    0    812
> KBytes
> > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   92    761
> KBytes
> > > > [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec    0    755
> KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec   554 MBytes   465 Mbits/sec  152
>  sender
> > > > [  5]   0.00-10.00  sec   550 MBytes   461 Mbits/sec
>   receiver
> > > >
> > > > > It would be useful to include the output of just calling "ethtool
> > > > > enp3s0" on the interface to verify the speed, "ethtool -a enp3s0"
> to
> > > > > verify flow control settings, and "ethtool -S enp3s0 | grep -v :\
> 0"
> > > > > to output the statistics and dump anything that isn't zero.
> > > >
> > > > ethtool enp3s0
> > > > Settings for enp3s0:
> > > > Supported ports: [ TP ]
> > > > Supported link modes:   10baseT/Half 10baseT/Full
> > > >                         100baseT/Half 100baseT/Full
> > > >                         1000baseT/Full
> > > > Supported pause frame use: Symmetric
> > > > Supports auto-negotiation: Yes
> > > > Supported FEC modes: Not reported
> > > > Advertised link modes:  10baseT/Half 10baseT/Full
> > > >                         100baseT/Half 100baseT/Full
> > > >                         1000baseT/Full
> > > > Advertised pause frame use: Symmetric
> > > > Advertised auto-negotiation: Yes
> > > > Advertised FEC modes: Not reported
> > > > Speed: 1000Mb/s
> > > > Duplex: Full
> > > > Auto-negotiation: on
> > > > Port: Twisted Pair
> > > > PHYAD: 1
> > > > Transceiver: internal
> > > > MDI-X: off (auto)
> > > > Supports Wake-on: pumbg
> > > > Wake-on: g
> > > >         Current message level: 0x00000007 (7)
> > > >                                drv probe link
> > > > Link detected: yes
> > > > ---
> > > > ethtool -a enp3s0
> > > > Pause parameters for enp3s0:
> > > > Autonegotiate: on
> > > > RX: on
> > > > TX: off
> > > > ---
> > > > ethtool -S enp3s0 |grep  -v :\ 0
> > > > NIC statistics:
> > > >      rx_packets: 15920618
> > > >      tx_packets: 17846725
> > > >      rx_bytes: 15676264423
> > > >      tx_bytes: 19925010639
> > > >      rx_broadcast: 119553
> > > >      tx_broadcast: 497
> > > >      rx_multicast: 330193
> > > >      tx_multicast: 18190
> > > >      multicast: 330193
> > > >      rx_missed_errors: 270102
> > > >      rx_long_length_errors: 6
> > > >      tx_tcp_seg_good: 1342561
> > > >      rx_long_byte_count: 15676264423
> > > >      rx_errors: 6
> > > >      rx_length_errors: 6
> > > >      rx_fifo_errors: 270102
> > > >      tx_queue_0_packets: 7651168
> > > >      tx_queue_0_bytes: 7823281566
> > > >      tx_queue_0_restart: 4920
> > > >      tx_queue_1_packets: 10195557
> > > >      tx_queue_1_bytes: 12027522118
> > > >      tx_queue_1_restart: 12718
> > > >      rx_queue_0_packets: 15920618
> > > >      rx_queue_0_bytes: 15612581951
> > > >      rx_queue_0_csum_err: 76
> > > > (I've only run two runs since i reenabled the interface)
> > >
> > > So I am seeing three things here.
> > >
> > > The rx_long_length_errors are usually due to an MTU mismatch. Do you
> > > have something on the network that is using jumbo frames, or is the
> > > MTU on the NIC set to something smaller than what is supported on the
> > > network?
> >
> > I'm using jumbo frames on the local network, internet side is the
> > normal 1500 bytes mtu though
> >
> > > You are getting rx_missed_errors, that would seem to imply that the
> > > DMA is not able to keep up. We may want to try disabling the L1 to see
> > > if we get any boost from doing that.
> >
> > It used to work, I don't do benchmarks all the time and sometimes the
> first
> > benchmarks turn out fine... so it's hard to say when this started
> happening...
> >
> > It could also be related to a bios upgrade, but I'm pretty sure I did
> > successful benchmarks after that...
> >
> > How do I disable the l1? just echo 0 >
> > /sys/bus/pci/drivers/igb/0000\:03\:00.0/link/l1_aspm ?
> >
> > > The last bit is that queue 0 is seeing packets with bad checksums. You
> > > might want to run some tests and see where the bad checksums are
> > > coming from. If they are being detected from a specific NIC such as
> > > the ixgbe in your example it might point to some sort of checksum
> > > error being created as a result of the NAT translation.
> >
> > But that should also affect A' and the A -> B -> C case, which it
> doesn't...
> >
> > It only seems to happen with higher rtt (6 hops, sub 3 ms in this case
> > but still high enough somehow)
> >
> > > > ---
> > > >
> > > > > > lspci -s 03:00.0  -vvv
> > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit
> Network
> > > > > > Connection (rev 03)
> > > > > > Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection
> > > > > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr-
> > > > > > Stepping- SERR- FastB2B- DisINTx+
> > > > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > > > > > <TAbort- <MAbort- >SERR- <PERR- INTx-
> > > > > > Latency: 0
> > > > > > Interrupt: pin A routed to IRQ 57
> > > > > > IOMMU group: 20
> > > > > > Region 0: Memory at fc900000 (32-bit, non-prefetchable)
> [size=128K]
> > > > > > Region 2: I/O ports at e000 [size=32]
> > > > > > Region 3: Memory at fc920000 (32-bit, non-prefetchable)
> [size=16K]
> > > > > > Capabilities: [40] Power Management version 3
> > > > > > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
> > > > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> > > > > > Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> > > > > > Address: 0000000000000000  Data: 0000
> > > > > > Masking: 00000000  Pending: 00000000
> > > > > > Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
> > > > > > Vector table: BAR=3 offset=00000000
> > > > > > PBA: BAR=3 offset=00002000
> > > > > > Capabilities: [a0] Express (v2) Endpoint, MSI 00
> > > > > > DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns,
> L1 <64us
> > > > > > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit
> 0.000W
> > > > > > DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
> > > > > > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> > > > > > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > > > > > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+
> TransPend-
> > > > > > LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit
> Latency
> > > > > > L0s <2us, L1 <16us
> > > > > > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > > > > > LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> > > > > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > > > > LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
> > > > > > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > > >
> > > > > PCIe wise the connection is going to be pretty tight in terms of
> > > > > bandwidth. It looks like we have 2.5GT/s with only a single lane of
> > > > > PCIe. In addition we are running with ASPM enabled so that means
> that
> > > > > if we don't have enough traffic we are shutting off the one PCIe
> lane
> > > > > we have so if we are getting bursty traffic that can get ugly.
> > > >
> > > > Humm... is there a way to force disable ASPM in sysfs?
> > >
> > > Actually the easiest way to do this is to just use setpci.
> > >
> > > You should be able to dump the word containing the setting via:
> > > # setpci -s 3:00.0 0xB0.w
> > > 0042
> > > # setpci -s 3:00.0 0xB0.w=0040
> > >
> > > Basically what you do is clear the lower 3 bits of the value so in
> > > this case that means replacing the 2 with a 0 based on the output of
> > > the first command.
> >
> > Well... I'll be damned... I used to force enable ASPM... this must be
> > related to the change in PCIe bus ASPM
> > Perhaps disable ASPM if there is only one link?
>
> Is there any specific reason why you are enabling ASPM? Is this system
> a laptop where you are trying to conserve power when on battery? If
> not disabling it probably won't hurt things too much since the power
> consumption for a 2.5GT/s link operating in a width of one shouldn't
> bee too high. Otherwise you are likely going to end up paying the
> price for getting the interface out of L1 when the traffic goes idle
> so you are going to see flows that get bursty paying a heavy penalty
> when they start dropping packets.
>

Ah, you misunderstand, I used to do this and everything worked - now Linux
enables ASPM by default on all pcie controllers, so imho this should be a
quirk, if there is only one lane, don't do ASPM due to latency and timing
issues...


> It is also possible this could be something that changed with the
> physical PCIe link. Basically L1 works by powering down the link when
> idle, and then powering it back up when there is activity. The problem
> is bringing it back up can sometimes be a challenge when the physical
> link starts to go faulty. I know I have seen that in some cases it can
> even result in the device falling off of the PCIe bus if the link
> training fails.


It works fine without ASPM (and the machine is pretty new)

I suspect we hit some timing race with aggressive ASPM (assumed as such
since it works on local links but doesn't on ~3 ms Links)


> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec   113 MBytes   950 Mbits/sec   31    710 KBytes
> > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec  135    626 KBytes
> > [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   18    713 KBytes
> > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    798 KBytes
> > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec    0    721 KBytes
> > [  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec   31    800 KBytes
> > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    730 KBytes
> > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec   19    730 KBytes
> > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec    0    701 KBytes
> > [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec   12    701 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  1.09 GBytes   937 Mbits/sec  246
>  sender
> > [  5]   0.00-10.01  sec  1.09 GBytes   933 Mbits/sec
> receiver
> >
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec   114 MBytes   956 Mbits/sec    0    749 KBytes
> > [  5]   1.00-2.00   sec   111 MBytes   933 Mbits/sec   30    766 KBytes
> > [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec    7    749 KBytes
> > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   11    707 KBytes
> > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec    2    699 KBytes
> > [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    8    699 KBytes
> > [  5]   6.00-7.00   sec   112 MBytes   944 Mbits/sec    1    953 KBytes
> > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    701 KBytes
> > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   26    707 KBytes
> > [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    2   1.07 MBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  1.09 GBytes   939 Mbits/sec   87
>  sender
> > [  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec
> receiver
> >
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec   114 MBytes   953 Mbits/sec   16    908 KBytes
> > [  5]   1.00-2.00   sec   112 MBytes   944 Mbits/sec    0    693 KBytes
> > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec    0    713 KBytes
> > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    687 KBytes
> > [  5]   4.00-5.00   sec   112 MBytes   944 Mbits/sec   15    687 KBytes
> > [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    2    888 KBytes
> > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec   17    696 KBytes
> > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    758 KBytes
> > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   31    749 KBytes
> > [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    0    792 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec   81
>  sender
> > [  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec
> receiver
> >
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec   114 MBytes   956 Mbits/sec    0    747 KBytes
> > [  5]   1.00-2.00   sec   111 MBytes   933 Mbits/sec    0    744 KBytes
> > [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   12   1.18 MBytes
> > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    2    699 KBytes
> > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   28    699 KBytes
> > [  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec    0    684 KBytes
> > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    741 KBytes
> > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    3    687 KBytes
> > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   22    699 KBytes
> > [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec   11    707 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec   78
>  sender
> > [  5]   0.00-10.01  sec  1.09 GBytes   934 Mbits/sec
> receiver
> > ---
> >
> > ethtool -S enp3s0 |grep -v ": 0"
> > NIC statistics:
> >      rx_packets: 16303520
> >      tx_packets: 21602840
> >      rx_bytes: 15711958157
> >      tx_bytes: 25599009212
> >      rx_broadcast: 122212
> >      tx_broadcast: 530
> >      rx_multicast: 333489
> >      tx_multicast: 18446
> >      multicast: 333489
> >      rx_missed_errors: 270143
> >      rx_long_length_errors: 6
> >      tx_tcp_seg_good: 1342561
> >      rx_long_byte_count: 15711958157
> >      rx_errors: 6
> >      rx_length_errors: 6
> >      rx_fifo_errors: 270143
> >      tx_queue_0_packets: 8963830
> >      tx_queue_0_bytes: 9803196683
> >      tx_queue_0_restart: 4920
> >      tx_queue_1_packets: 12639010
> >      tx_queue_1_bytes: 15706576814
> >      tx_queue_1_restart: 12718
> >      rx_queue_0_packets: 16303520
> >      rx_queue_0_bytes: 15646744077
> >      rx_queue_0_csum_err: 76
>
> Okay, so this result still has the same length and checksum errors,
> were you resetting the system/statistics between runs?
>

Ah, no.... Will reset and do more tests when I'm back home
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20200716/85ffb509/attachment-0001.html>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-16 15:18                       ` Alexander Duyck
@ 2020-07-16 19:47                         ` Ian Kumlien
  -1 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-16 19:47 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jakub Kicinski, Linux Kernel Network Developers, intel-wired-lan

Sorry, tried to respond via the phone, used the webbrowser version but
still html mails... :/

On Thu, Jul 16, 2020 at 5:18 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Jul 15, 2020 at 5:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > On Thu, Jul 16, 2020 at 1:42 AM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > > On Wed, Jul 15, 2020 at 3:51 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > On Thu, Jul 16, 2020 at 12:32 AM Alexander Duyck
> > > > <alexander.duyck@gmail.com> wrote:
> > > > > On Wed, Jul 15, 2020 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > On Wed, Jul 15, 2020 at 11:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > > On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> > > > > > > > On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > > > > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > > > > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > > > > > > > > > After a  lot of debugging it turns out that the bug is in igb...
> > > > > > > > > > >
> > > > > > > > > > > driver: igb
> > > > > > > > > > > version: 5.6.0-k
> > > > > > > > > > > firmware-version:  0. 6-1
> > > > > > > > > > >
> > > > > > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > > > > > > > Connection (rev 03)
> > > > > > > > > >
> > > > > > > > > > Unclear to me what you're actually reporting. Is this a regression
> > > > > > > > > > after a kernel upgrade? Compared to no NAT?
> > > > > > > > >
> > > > > > > > > It only happens on "internet links"
> > > > > > > > >
> > > > > > > > > Lets say that A is client with ibg driver, B is a firewall running NAT
> > > > > > > > > with ixgbe drivers, C is another local node with igb and
> > > > > > > > > D is a remote node with a bridge backed by a bnx2 interface.
> > > > > > > > >
> > > > > > > > > A -> B -> C is ok (B and C is on the same switch)
> > > > > > > > >
> > > > > > > > > A -> B -> D -- 32-40mbit
> > > > > > > > >
> > > > > > > > > B -> D 944 mbit
> > > > > > > > > C -> D 944 mbit
> > > > > > > > >
> > > > > > > > > A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)
> > > > > > > >
> > > > > > > > This should of course be A' -> B -> D
> > > > > > > >
> > > > > > > > Sorry, I've been scratching my head for about a week...
> > > > > > >
> > > > > > > Hm, only thing that comes to mind if A' works reliably and A doesn't is
> > > > > > > that A has somehow broken TCP offloads. Could you try disabling things
> > > > > > > via ethtool -K and see if those settings make a difference?
> > > > > >
> > > > > > It's a bit hard since it works like this, turned tso off:
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > > [  5]   0.00-1.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
> > > > > > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec   31    812 KBytes
> > > > > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec   92    772 KBytes
> > > > > > [  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0    834 KBytes
> > > > > > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   60    823 KBytes
> > > > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec   31    789 KBytes
> > > > > > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    786 KBytes
> > > > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > > > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    772 KBytes
> > > > > > [  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec    0    868 KBytes
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  214             sender
> > > > > > [  5]   0.00-10.00  sec  1.07 GBytes   920 Mbits/sec                  receiver
> > > > > >
> > > > > > Continued running tests:
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > > [  5]   0.00-1.00   sec  5.82 MBytes  48.8 Mbits/sec    0   82.0 KBytes
> > > > > > [  5]   1.00-2.00   sec  4.97 MBytes  41.7 Mbits/sec    0    130 KBytes
> > > > > > [  5]   2.00-3.00   sec  5.28 MBytes  44.3 Mbits/sec    0   99.0 KBytes
> > > > > > [  5]   3.00-4.00   sec  5.28 MBytes  44.3 Mbits/sec    0    105 KBytes
> > > > > > [  5]   4.00-5.00   sec  5.28 MBytes  44.3 Mbits/sec    0    122 KBytes
> > > > > > [  5]   5.00-6.00   sec  5.28 MBytes  44.3 Mbits/sec    0   82.0 KBytes
> > > > > > [  5]   6.00-7.00   sec  5.28 MBytes  44.3 Mbits/sec    0   79.2 KBytes
> > > > > > [  5]   7.00-8.00   sec  5.28 MBytes  44.3 Mbits/sec    0    110 KBytes
> > > > > > [  5]   8.00-9.00   sec  5.28 MBytes  44.3 Mbits/sec    0    156 KBytes
> > > > > > [  5]   9.00-10.00  sec  5.28 MBytes  44.3 Mbits/sec    0   87.7 KBytes
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > [  5]   0.00-10.00  sec  53.0 MBytes  44.5 Mbits/sec    0             sender
> > > > > > [  5]   0.00-10.00  sec  52.5 MBytes  44.1 Mbits/sec                  receiver
> > > > > >
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > > [  5]   0.00-1.00   sec  7.08 MBytes  59.4 Mbits/sec    0    156 KBytes
> > > > > > [  5]   1.00-2.00   sec  5.97 MBytes  50.0 Mbits/sec    0    110 KBytes
> > > > > > [  5]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec    0    124 KBytes
> > > > > > [  5]   3.00-4.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > > > > [  5]   4.00-5.00   sec  5.47 MBytes  45.9 Mbits/sec    0    158 KBytes
> > > > > > [  5]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec    0   70.7 KBytes
> > > > > > [  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0    113 KBytes
> > > > > > [  5]   7.00-8.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > > > > [  5]   8.00-9.00   sec  4.97 MBytes  41.7 Mbits/sec    0   84.8 KBytes
> > > > > > [  5]   9.00-10.00  sec  5.47 MBytes  45.9 Mbits/sec    0    116 KBytes
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > [  5]   0.00-10.00  sec  55.3 MBytes  46.4 Mbits/sec    0             sender
> > > > > > [  5]   0.00-10.00  sec  53.9 MBytes  45.2 Mbits/sec                  receiver
> > > > > >
> > > > > > And the low bandwidth continues with:
> > > > > > ethtool -k enp3s0 |grep ": on"
> > > > > > rx-vlan-offload: on
> > > > > > tx-vlan-offload: on [requested off]
> > > > > > highdma: on [fixed]
> > > > > > rx-vlan-filter: on [fixed]
> > > > > > tx-gre-segmentation: on
> > > > > > tx-gre-csum-segmentation: on
> > > > > > tx-ipxip4-segmentation: on
> > > > > > tx-ipxip6-segmentation: on
> > > > > > tx-udp_tnl-segmentation: on
> > > > > > tx-udp_tnl-csum-segmentation: on
> > > > > > tx-gso-partial: on
> > > > > > tx-udp-segmentation: on
> > > > > > hw-tc-offload: on
> > > > > >
> > > > > > Can't quite find how to turn those off since they aren't listed in
> > > > > > ethtool (since the text is not what you use to enable/disable)
> > > > >
> > > > > To disable them you would just repeat the same string in the display
> > > > > string. So it should just be "ethtool -K enp3s0 tx-gso-partial off"
> > > > > and that would turn off a large chunk of them as all the encapsulated
> > > > > support requires gso partial support.
> > > >
> > > >  ethtool -k enp3s0 |grep ": on"
> > > > highdma: on [fixed]
> > > > rx-vlan-filter: on [fixed]
> > > > ---
> > > > And then back to back:
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  4.91 MBytes  41.2 Mbits/sec    0   45.2 KBytes
> > > > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   52.3 KBytes
> > > > [  5]   2.00-3.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> > > > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0    141 KBytes
> > > > [  5]   4.00-5.00   sec   111 MBytes   928 Mbits/sec   63    764 KBytes
> > > > [  5]   5.00-6.00   sec  86.2 MBytes   724 Mbits/sec    0    744 KBytes
> > > > [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   61    769 KBytes
> > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    749 KBytes
> > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    741 KBytes
> > > > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   31    761 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec   644 MBytes   540 Mbits/sec  155             sender
> > > > [  5]   0.00-10.01  sec   641 MBytes   537 Mbits/sec                  receiver
> > > >
> > > > and we're back at the not working bit:
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  4.84 MBytes  40.6 Mbits/sec    0   42.4 KBytes
> > > > [  5]   1.00-2.00   sec  4.60 MBytes  38.6 Mbits/sec    0   42.4 KBytes
> > > > [  5]   2.00-3.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > > > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> > > > [  5]   4.00-5.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> > > > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > > > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > > > [  5]   7.00-8.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> > > > [  5]   8.00-9.00   sec  4.47 MBytes  37.5 Mbits/sec    0   53.7 KBytes
> > > > [  5]   9.00-10.00  sec  4.47 MBytes  37.5 Mbits/sec    0   79.2 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec  44.5 MBytes  37.3 Mbits/sec    0             sender
> > > > [  5]   0.00-10.00  sec  43.9 MBytes  36.8 Mbits/sec                  receiver
> > > >
> > > > > > I was hoping that you'd have a clue of something that might introduce
> > > > > > a regression - ie specific patches to try to revert
> > > > > >
> > > > > > Btw, the same issue applies to udp as werll
> > > > > >
> > > > > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > > > > [  5]   0.00-1.00   sec  6.77 MBytes  56.8 Mbits/sec  4900
> > > > > > [  5]   1.00-2.00   sec  4.27 MBytes  35.8 Mbits/sec  3089
> > > > > > [  5]   2.00-3.00   sec  4.20 MBytes  35.2 Mbits/sec  3041
> > > > > > [  5]   3.00-4.00   sec  4.30 MBytes  36.1 Mbits/sec  3116
> > > > > > [  5]   4.00-5.00   sec  4.24 MBytes  35.6 Mbits/sec  3070
> > > > > > [  5]   5.00-6.00   sec  4.21 MBytes  35.3 Mbits/sec  3047
> > > > > > [  5]   6.00-7.00   sec  4.29 MBytes  36.0 Mbits/sec  3110
> > > > > > [  5]   7.00-8.00   sec  4.28 MBytes  35.9 Mbits/sec  3097
> > > > > > [  5]   8.00-9.00   sec  4.25 MBytes  35.6 Mbits/sec  3075
> > > > > > [  5]   9.00-10.00  sec  4.20 MBytes  35.2 Mbits/sec  3039
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > > > > Lost/Total Datagrams
> > > > > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.000 ms
> > > > > > 0/32584 (0%)  sender
> > > > > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.037 ms
> > > > > > 0/32573 (0%)  receiver
> > > > > >
> > > > > > vs:
> > > > > >
> > > > > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > > > > [  5]   0.00-1.00   sec   114 MBytes   954 Mbits/sec  82342
> > > > > > [  5]   1.00-2.00   sec   114 MBytes   955 Mbits/sec  82439
> > > > > > [  5]   2.00-3.00   sec   114 MBytes   956 Mbits/sec  82507
> > > > > > [  5]   3.00-4.00   sec   114 MBytes   955 Mbits/sec  82432
> > > > > > [  5]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  82535
> > > > > > [  5]   5.00-6.00   sec   114 MBytes   953 Mbits/sec  82240
> > > > > > [  5]   6.00-7.00   sec   114 MBytes   956 Mbits/sec  82512
> > > > > > [  5]   7.00-8.00   sec   114 MBytes   956 Mbits/sec  82503
> > > > > > [  5]   8.00-9.00   sec   114 MBytes   956 Mbits/sec  82532
> > > > > > [  5]   9.00-10.00  sec   114 MBytes   956 Mbits/sec  82488
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > > > > Lost/Total Datagrams
> > > > > > [  5]   0.00-10.00  sec  1.11 GBytes   955 Mbits/sec  0.000 ms
> > > > > > 0/824530 (0%)  sender
> > > > > > [  5]   0.00-10.01  sec  1.11 GBytes   949 Mbits/sec  0.014 ms
> > > > > > 4756/824530 (0.58%)  receiver
> > > > >
> > > > > The fact that it is impacting UDP seems odd. I wonder if we don't have
> > > > > a qdisc somewhere that is misbehaving and throttling the Tx. Either
> > > > > that or I wonder if we are getting spammed with flow control frames.
> > > >
> > > > it sometimes works, it looks like the cwindow just isn't increased -
> > > > that's where i started...
> > > >
> > > > Example:
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  4.86 MBytes  40.8 Mbits/sec    0   50.9 KBytes
> > > > [  5]   1.00-2.00   sec  4.66 MBytes  39.1 Mbits/sec    0   65.0 KBytes
> > > > [  5]   2.00-3.00   sec  4.29 MBytes  36.0 Mbits/sec    0   42.4 KBytes
> > > > [  5]   3.00-4.00   sec  4.66 MBytes  39.1 Mbits/sec    0   42.4 KBytes
> > > > [  5]   4.00-5.00   sec  23.1 MBytes   194 Mbits/sec    0   1.07 MBytes
> > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > > > [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   60    806 KBytes
> > > > [  5]   7.00-8.00   sec  82.5 MBytes   692 Mbits/sec    0    812 KBytes
> > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   92    761 KBytes
> > > > [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec    0    755 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec   554 MBytes   465 Mbits/sec  152             sender
> > > > [  5]   0.00-10.00  sec   550 MBytes   461 Mbits/sec                  receiver
> > > >
> > > > > It would be useful to include the output of just calling "ethtool
> > > > > enp3s0" on the interface to verify the speed, "ethtool -a enp3s0" to
> > > > > verify flow control settings, and "ethtool -S enp3s0 | grep -v :\ 0"
> > > > > to output the statistics and dump anything that isn't zero.
> > > >
> > > > ethtool enp3s0
> > > > Settings for enp3s0:
> > > > Supported ports: [ TP ]
> > > > Supported link modes:   10baseT/Half 10baseT/Full
> > > >                         100baseT/Half 100baseT/Full
> > > >                         1000baseT/Full
> > > > Supported pause frame use: Symmetric
> > > > Supports auto-negotiation: Yes
> > > > Supported FEC modes: Not reported
> > > > Advertised link modes:  10baseT/Half 10baseT/Full
> > > >                         100baseT/Half 100baseT/Full
> > > >                         1000baseT/Full
> > > > Advertised pause frame use: Symmetric
> > > > Advertised auto-negotiation: Yes
> > > > Advertised FEC modes: Not reported
> > > > Speed: 1000Mb/s
> > > > Duplex: Full
> > > > Auto-negotiation: on
> > > > Port: Twisted Pair
> > > > PHYAD: 1
> > > > Transceiver: internal
> > > > MDI-X: off (auto)
> > > > Supports Wake-on: pumbg
> > > > Wake-on: g
> > > >         Current message level: 0x00000007 (7)
> > > >                                drv probe link
> > > > Link detected: yes
> > > > ---
> > > > ethtool -a enp3s0
> > > > Pause parameters for enp3s0:
> > > > Autonegotiate: on
> > > > RX: on
> > > > TX: off
> > > > ---
> > > > ethtool -S enp3s0 |grep  -v :\ 0
> > > > NIC statistics:
> > > >      rx_packets: 15920618
> > > >      tx_packets: 17846725
> > > >      rx_bytes: 15676264423
> > > >      tx_bytes: 19925010639
> > > >      rx_broadcast: 119553
> > > >      tx_broadcast: 497
> > > >      rx_multicast: 330193
> > > >      tx_multicast: 18190
> > > >      multicast: 330193
> > > >      rx_missed_errors: 270102
> > > >      rx_long_length_errors: 6
> > > >      tx_tcp_seg_good: 1342561
> > > >      rx_long_byte_count: 15676264423
> > > >      rx_errors: 6
> > > >      rx_length_errors: 6
> > > >      rx_fifo_errors: 270102
> > > >      tx_queue_0_packets: 7651168
> > > >      tx_queue_0_bytes: 7823281566
> > > >      tx_queue_0_restart: 4920
> > > >      tx_queue_1_packets: 10195557
> > > >      tx_queue_1_bytes: 12027522118
> > > >      tx_queue_1_restart: 12718
> > > >      rx_queue_0_packets: 15920618
> > > >      rx_queue_0_bytes: 15612581951
> > > >      rx_queue_0_csum_err: 76
> > > > (I've only run two runs since i reenabled the interface)
> > >
> > > So I am seeing three things here.
> > >
> > > The rx_long_length_errors are usually due to an MTU mismatch. Do you
> > > have something on the network that is using jumbo frames, or is the
> > > MTU on the NIC set to something smaller than what is supported on the
> > > network?
> >
> > I'm using jumbo frames on the local network, internet side is the
> > normal 1500 bytes mtu though
> >
> > > You are getting rx_missed_errors, that would seem to imply that the
> > > DMA is not able to keep up. We may want to try disabling the L1 to see
> > > if we get any boost from doing that.
> >
> > It used to work, I don't do benchmarks all the time and sometimes the first
> > benchmarks turn out fine... so it's hard to say when this started happening...
> >
> > It could also be related to a bios upgrade, but I'm pretty sure I did
> > successful benchmarks after that...
> >
> > How do I disable the l1? just echo 0 >
> > /sys/bus/pci/drivers/igb/0000\:03\:00.0/link/l1_aspm ?
> >
> > > The last bit is that queue 0 is seeing packets with bad checksums. You
> > > might want to run some tests and see where the bad checksums are
> > > coming from. If they are being detected from a specific NIC such as
> > > the ixgbe in your example it might point to some sort of checksum
> > > error being created as a result of the NAT translation.
> >
> > But that should also affect A' and the A -> B -> C case, which it doesn't...
> >
> > It only seems to happen with higher rtt (6 hops, sub 3 ms in this case
> > but still high enough somehow)
> >
> > > > ---
> > > >
> > > > > > lspci -s 03:00.0  -vvv
> > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > > Connection (rev 03)
> > > > > > Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection
> > > > > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > > > > Stepping- SERR- FastB2B- DisINTx+
> > > > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > > > > > <TAbort- <MAbort- >SERR- <PERR- INTx-
> > > > > > Latency: 0
> > > > > > Interrupt: pin A routed to IRQ 57
> > > > > > IOMMU group: 20
> > > > > > Region 0: Memory at fc900000 (32-bit, non-prefetchable) [size=128K]
> > > > > > Region 2: I/O ports at e000 [size=32]
> > > > > > Region 3: Memory at fc920000 (32-bit, non-prefetchable) [size=16K]
> > > > > > Capabilities: [40] Power Management version 3
> > > > > > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> > > > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> > > > > > Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> > > > > > Address: 0000000000000000  Data: 0000
> > > > > > Masking: 00000000  Pending: 00000000
> > > > > > Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
> > > > > > Vector table: BAR=3 offset=00000000
> > > > > > PBA: BAR=3 offset=00002000
> > > > > > Capabilities: [a0] Express (v2) Endpoint, MSI 00
> > > > > > DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > > > > > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> > > > > > DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
> > > > > > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> > > > > > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > > > > > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> > > > > > LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
> > > > > > L0s <2us, L1 <16us
> > > > > > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > > > > > LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> > > > > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > > > > LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
> > > > > > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > > >
> > > > > PCIe wise the connection is going to be pretty tight in terms of
> > > > > bandwidth. It looks like we have 2.5GT/s with only a single lane of
> > > > > PCIe. In addition we are running with ASPM enabled so that means that
> > > > > if we don't have enough traffic we are shutting off the one PCIe lane
> > > > > we have so if we are getting bursty traffic that can get ugly.
> > > >
> > > > Humm... is there a way to force disable ASPM in sysfs?
> > >
> > > Actually the easiest way to do this is to just use setpci.
> > >
> > > You should be able to dump the word containing the setting via:
> > > # setpci -s 3:00.0 0xB0.w
> > > 0042
> > > # setpci -s 3:00.0 0xB0.w=0040
> > >
> > > Basically what you do is clear the lower 3 bits of the value so in
> > > this case that means replacing the 2 with a 0 based on the output of
> > > the first command.
> >
> > Well... I'll be damned... I used to force enable ASPM... this must be
> > related to the change in PCIe bus ASPM
> > Perhaps disable ASPM if there is only one link?
>
> Is there any specific reason why you are enabling ASPM? Is this system
> a laptop where you are trying to conserve power when on battery? If
> not disabling it probably won't hurt things too much since the power
> consumption for a 2.5GT/s link operating in a width of one shouldn't
> bee too high. Otherwise you are likely going to end up paying the
> price for getting the interface out of L1 when the traffic goes idle
> so you are going to see flows that get bursty paying a heavy penalty
> when they start dropping packets.

Ah, you misunderstand, I used to do this and everything worked - now
Linux enables ASPM by default on all pcie controllers,
so imho this should be a quirk, if there is only one lane, don't do
ASPM due to latency and timing issues...

> It is also possible this could be something that changed with the
> physical PCIe link. Basically L1 works by powering down the link when
> idle, and then powering it back up when there is activity. The problem
> is bringing it back up can sometimes be a challenge when the physical
> link starts to go faulty. I know I have seen that in some cases it can
> even result in the device falling off of the PCIe bus if the link
> training fails.

It works fine without ASPM (and the machine is pretty new)

I suspect we hit some timing race with aggressive ASPM (assumed as
such since it works on local links but doesn't on ~3 ms Links)

> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec   113 MBytes   950 Mbits/sec   31    710 KBytes
> > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec  135    626 KBytes
> > [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   18    713 KBytes
> > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    798 KBytes
> > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec    0    721 KBytes
> > [  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec   31    800 KBytes
> > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    730 KBytes
> > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec   19    730 KBytes
> > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec    0    701 KBytes
> > [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec   12    701 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  1.09 GBytes   937 Mbits/sec  246             sender
> > [  5]   0.00-10.01  sec  1.09 GBytes   933 Mbits/sec                  receiver
> >
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec   114 MBytes   956 Mbits/sec    0    749 KBytes
> > [  5]   1.00-2.00   sec   111 MBytes   933 Mbits/sec   30    766 KBytes
> > [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec    7    749 KBytes
> > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   11    707 KBytes
> > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec    2    699 KBytes
> > [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    8    699 KBytes
> > [  5]   6.00-7.00   sec   112 MBytes   944 Mbits/sec    1    953 KBytes
> > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    701 KBytes
> > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   26    707 KBytes
> > [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    2   1.07 MBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  1.09 GBytes   939 Mbits/sec   87             sender
> > [  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver
> >
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec   114 MBytes   953 Mbits/sec   16    908 KBytes
> > [  5]   1.00-2.00   sec   112 MBytes   944 Mbits/sec    0    693 KBytes
> > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec    0    713 KBytes
> > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    687 KBytes
> > [  5]   4.00-5.00   sec   112 MBytes   944 Mbits/sec   15    687 KBytes
> > [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    2    888 KBytes
> > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec   17    696 KBytes
> > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    758 KBytes
> > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   31    749 KBytes
> > [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    0    792 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec   81             sender
> > [  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver
> >
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec   114 MBytes   956 Mbits/sec    0    747 KBytes
> > [  5]   1.00-2.00   sec   111 MBytes   933 Mbits/sec    0    744 KBytes
> > [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   12   1.18 MBytes
> > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    2    699 KBytes
> > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   28    699 KBytes
> > [  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec    0    684 KBytes
> > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    741 KBytes
> > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    3    687 KBytes
> > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   22    699 KBytes
> > [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec   11    707 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec   78             sender
> > [  5]   0.00-10.01  sec  1.09 GBytes   934 Mbits/sec                  receiver
> > ---
> >
> > ethtool -S enp3s0 |grep -v ": 0"
> > NIC statistics:
> >      rx_packets: 16303520
> >      tx_packets: 21602840
> >      rx_bytes: 15711958157
> >      tx_bytes: 25599009212
> >      rx_broadcast: 122212
> >      tx_broadcast: 530
> >      rx_multicast: 333489
> >      tx_multicast: 18446
> >      multicast: 333489
> >      rx_missed_errors: 270143
> >      rx_long_length_errors: 6
> >      tx_tcp_seg_good: 1342561
> >      rx_long_byte_count: 15711958157
> >      rx_errors: 6
> >      rx_length_errors: 6
> >      rx_fifo_errors: 270143
> >      tx_queue_0_packets: 8963830
> >      tx_queue_0_bytes: 9803196683
> >      tx_queue_0_restart: 4920
> >      tx_queue_1_packets: 12639010
> >      tx_queue_1_bytes: 15706576814
> >      tx_queue_1_restart: 12718
> >      rx_queue_0_packets: 16303520
> >      rx_queue_0_bytes: 15646744077
> >      rx_queue_0_csum_err: 76
>
> Okay, so this result still has the same length and checksum errors,
> were you resetting the system/statistics between runs?

Ah, no.... Will reset and do more tests when I'm back home

Am I blind or is this part missing from ethtools man page?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
@ 2020-07-16 19:47                         ` Ian Kumlien
  0 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-16 19:47 UTC (permalink / raw)
  To: intel-wired-lan

Sorry, tried to respond via the phone, used the webbrowser version but
still html mails... :/

On Thu, Jul 16, 2020 at 5:18 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Jul 15, 2020 at 5:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > On Thu, Jul 16, 2020 at 1:42 AM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > > On Wed, Jul 15, 2020 at 3:51 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > On Thu, Jul 16, 2020 at 12:32 AM Alexander Duyck
> > > > <alexander.duyck@gmail.com> wrote:
> > > > > On Wed, Jul 15, 2020 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > On Wed, Jul 15, 2020 at 11:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > > On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> > > > > > > > On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > > > > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > > > > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > > > > > > > > > After a  lot of debugging it turns out that the bug is in igb...
> > > > > > > > > > >
> > > > > > > > > > > driver: igb
> > > > > > > > > > > version: 5.6.0-k
> > > > > > > > > > > firmware-version:  0. 6-1
> > > > > > > > > > >
> > > > > > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > > > > > > > Connection (rev 03)
> > > > > > > > > >
> > > > > > > > > > Unclear to me what you're actually reporting. Is this a regression
> > > > > > > > > > after a kernel upgrade? Compared to no NAT?
> > > > > > > > >
> > > > > > > > > It only happens on "internet links"
> > > > > > > > >
> > > > > > > > > Lets say that A is client with ibg driver, B is a firewall running NAT
> > > > > > > > > with ixgbe drivers, C is another local node with igb and
> > > > > > > > > D is a remote node with a bridge backed by a bnx2 interface.
> > > > > > > > >
> > > > > > > > > A -> B -> C is ok (B and C is on the same switch)
> > > > > > > > >
> > > > > > > > > A -> B -> D -- 32-40mbit
> > > > > > > > >
> > > > > > > > > B -> D 944 mbit
> > > > > > > > > C -> D 944 mbit
> > > > > > > > >
> > > > > > > > > A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)
> > > > > > > >
> > > > > > > > This should of course be A' -> B -> D
> > > > > > > >
> > > > > > > > Sorry, I've been scratching my head for about a week...
> > > > > > >
> > > > > > > Hm, only thing that comes to mind if A' works reliably and A doesn't is
> > > > > > > that A has somehow broken TCP offloads. Could you try disabling things
> > > > > > > via ethtool -K and see if those settings make a difference?
> > > > > >
> > > > > > It's a bit hard since it works like this, turned tso off:
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > > [  5]   0.00-1.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
> > > > > > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec   31    812 KBytes
> > > > > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec   92    772 KBytes
> > > > > > [  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0    834 KBytes
> > > > > > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   60    823 KBytes
> > > > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec   31    789 KBytes
> > > > > > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    786 KBytes
> > > > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > > > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    772 KBytes
> > > > > > [  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec    0    868 KBytes
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  214             sender
> > > > > > [  5]   0.00-10.00  sec  1.07 GBytes   920 Mbits/sec                  receiver
> > > > > >
> > > > > > Continued running tests:
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > > [  5]   0.00-1.00   sec  5.82 MBytes  48.8 Mbits/sec    0   82.0 KBytes
> > > > > > [  5]   1.00-2.00   sec  4.97 MBytes  41.7 Mbits/sec    0    130 KBytes
> > > > > > [  5]   2.00-3.00   sec  5.28 MBytes  44.3 Mbits/sec    0   99.0 KBytes
> > > > > > [  5]   3.00-4.00   sec  5.28 MBytes  44.3 Mbits/sec    0    105 KBytes
> > > > > > [  5]   4.00-5.00   sec  5.28 MBytes  44.3 Mbits/sec    0    122 KBytes
> > > > > > [  5]   5.00-6.00   sec  5.28 MBytes  44.3 Mbits/sec    0   82.0 KBytes
> > > > > > [  5]   6.00-7.00   sec  5.28 MBytes  44.3 Mbits/sec    0   79.2 KBytes
> > > > > > [  5]   7.00-8.00   sec  5.28 MBytes  44.3 Mbits/sec    0    110 KBytes
> > > > > > [  5]   8.00-9.00   sec  5.28 MBytes  44.3 Mbits/sec    0    156 KBytes
> > > > > > [  5]   9.00-10.00  sec  5.28 MBytes  44.3 Mbits/sec    0   87.7 KBytes
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > [  5]   0.00-10.00  sec  53.0 MBytes  44.5 Mbits/sec    0             sender
> > > > > > [  5]   0.00-10.00  sec  52.5 MBytes  44.1 Mbits/sec                  receiver
> > > > > >
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > > [  5]   0.00-1.00   sec  7.08 MBytes  59.4 Mbits/sec    0    156 KBytes
> > > > > > [  5]   1.00-2.00   sec  5.97 MBytes  50.0 Mbits/sec    0    110 KBytes
> > > > > > [  5]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec    0    124 KBytes
> > > > > > [  5]   3.00-4.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > > > > [  5]   4.00-5.00   sec  5.47 MBytes  45.9 Mbits/sec    0    158 KBytes
> > > > > > [  5]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec    0   70.7 KBytes
> > > > > > [  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0    113 KBytes
> > > > > > [  5]   7.00-8.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > > > > [  5]   8.00-9.00   sec  4.97 MBytes  41.7 Mbits/sec    0   84.8 KBytes
> > > > > > [  5]   9.00-10.00  sec  5.47 MBytes  45.9 Mbits/sec    0    116 KBytes
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > [  5]   0.00-10.00  sec  55.3 MBytes  46.4 Mbits/sec    0             sender
> > > > > > [  5]   0.00-10.00  sec  53.9 MBytes  45.2 Mbits/sec                  receiver
> > > > > >
> > > > > > And the low bandwidth continues with:
> > > > > > ethtool -k enp3s0 |grep ": on"
> > > > > > rx-vlan-offload: on
> > > > > > tx-vlan-offload: on [requested off]
> > > > > > highdma: on [fixed]
> > > > > > rx-vlan-filter: on [fixed]
> > > > > > tx-gre-segmentation: on
> > > > > > tx-gre-csum-segmentation: on
> > > > > > tx-ipxip4-segmentation: on
> > > > > > tx-ipxip6-segmentation: on
> > > > > > tx-udp_tnl-segmentation: on
> > > > > > tx-udp_tnl-csum-segmentation: on
> > > > > > tx-gso-partial: on
> > > > > > tx-udp-segmentation: on
> > > > > > hw-tc-offload: on
> > > > > >
> > > > > > Can't quite find how to turn those off since they aren't listed in
> > > > > > ethtool (since the text is not what you use to enable/disable)
> > > > >
> > > > > To disable them you would just repeat the same string in the display
> > > > > string. So it should just be "ethtool -K enp3s0 tx-gso-partial off"
> > > > > and that would turn off a large chunk of them as all the encapsulated
> > > > > support requires gso partial support.
> > > >
> > > >  ethtool -k enp3s0 |grep ": on"
> > > > highdma: on [fixed]
> > > > rx-vlan-filter: on [fixed]
> > > > ---
> > > > And then back to back:
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  4.91 MBytes  41.2 Mbits/sec    0   45.2 KBytes
> > > > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   52.3 KBytes
> > > > [  5]   2.00-3.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> > > > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0    141 KBytes
> > > > [  5]   4.00-5.00   sec   111 MBytes   928 Mbits/sec   63    764 KBytes
> > > > [  5]   5.00-6.00   sec  86.2 MBytes   724 Mbits/sec    0    744 KBytes
> > > > [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   61    769 KBytes
> > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    749 KBytes
> > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    741 KBytes
> > > > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   31    761 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec   644 MBytes   540 Mbits/sec  155             sender
> > > > [  5]   0.00-10.01  sec   641 MBytes   537 Mbits/sec                  receiver
> > > >
> > > > and we're back at the not working bit:
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  4.84 MBytes  40.6 Mbits/sec    0   42.4 KBytes
> > > > [  5]   1.00-2.00   sec  4.60 MBytes  38.6 Mbits/sec    0   42.4 KBytes
> > > > [  5]   2.00-3.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > > > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> > > > [  5]   4.00-5.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> > > > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > > > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > > > [  5]   7.00-8.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> > > > [  5]   8.00-9.00   sec  4.47 MBytes  37.5 Mbits/sec    0   53.7 KBytes
> > > > [  5]   9.00-10.00  sec  4.47 MBytes  37.5 Mbits/sec    0   79.2 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec  44.5 MBytes  37.3 Mbits/sec    0             sender
> > > > [  5]   0.00-10.00  sec  43.9 MBytes  36.8 Mbits/sec                  receiver
> > > >
> > > > > > I was hoping that you'd have a clue of something that might introduce
> > > > > > a regression - ie specific patches to try to revert
> > > > > >
> > > > > > Btw, the same issue applies to udp as werll
> > > > > >
> > > > > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > > > > [  5]   0.00-1.00   sec  6.77 MBytes  56.8 Mbits/sec  4900
> > > > > > [  5]   1.00-2.00   sec  4.27 MBytes  35.8 Mbits/sec  3089
> > > > > > [  5]   2.00-3.00   sec  4.20 MBytes  35.2 Mbits/sec  3041
> > > > > > [  5]   3.00-4.00   sec  4.30 MBytes  36.1 Mbits/sec  3116
> > > > > > [  5]   4.00-5.00   sec  4.24 MBytes  35.6 Mbits/sec  3070
> > > > > > [  5]   5.00-6.00   sec  4.21 MBytes  35.3 Mbits/sec  3047
> > > > > > [  5]   6.00-7.00   sec  4.29 MBytes  36.0 Mbits/sec  3110
> > > > > > [  5]   7.00-8.00   sec  4.28 MBytes  35.9 Mbits/sec  3097
> > > > > > [  5]   8.00-9.00   sec  4.25 MBytes  35.6 Mbits/sec  3075
> > > > > > [  5]   9.00-10.00  sec  4.20 MBytes  35.2 Mbits/sec  3039
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > > > > Lost/Total Datagrams
> > > > > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.000 ms
> > > > > > 0/32584 (0%)  sender
> > > > > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.037 ms
> > > > > > 0/32573 (0%)  receiver
> > > > > >
> > > > > > vs:
> > > > > >
> > > > > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > > > > [  5]   0.00-1.00   sec   114 MBytes   954 Mbits/sec  82342
> > > > > > [  5]   1.00-2.00   sec   114 MBytes   955 Mbits/sec  82439
> > > > > > [  5]   2.00-3.00   sec   114 MBytes   956 Mbits/sec  82507
> > > > > > [  5]   3.00-4.00   sec   114 MBytes   955 Mbits/sec  82432
> > > > > > [  5]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  82535
> > > > > > [  5]   5.00-6.00   sec   114 MBytes   953 Mbits/sec  82240
> > > > > > [  5]   6.00-7.00   sec   114 MBytes   956 Mbits/sec  82512
> > > > > > [  5]   7.00-8.00   sec   114 MBytes   956 Mbits/sec  82503
> > > > > > [  5]   8.00-9.00   sec   114 MBytes   956 Mbits/sec  82532
> > > > > > [  5]   9.00-10.00  sec   114 MBytes   956 Mbits/sec  82488
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > > > > Lost/Total Datagrams
> > > > > > [  5]   0.00-10.00  sec  1.11 GBytes   955 Mbits/sec  0.000 ms
> > > > > > 0/824530 (0%)  sender
> > > > > > [  5]   0.00-10.01  sec  1.11 GBytes   949 Mbits/sec  0.014 ms
> > > > > > 4756/824530 (0.58%)  receiver
> > > > >
> > > > > The fact that it is impacting UDP seems odd. I wonder if we don't have
> > > > > a qdisc somewhere that is misbehaving and throttling the Tx. Either
> > > > > that or I wonder if we are getting spammed with flow control frames.
> > > >
> > > > it sometimes works, it looks like the cwindow just isn't increased -
> > > > that's where i started...
> > > >
> > > > Example:
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  4.86 MBytes  40.8 Mbits/sec    0   50.9 KBytes
> > > > [  5]   1.00-2.00   sec  4.66 MBytes  39.1 Mbits/sec    0   65.0 KBytes
> > > > [  5]   2.00-3.00   sec  4.29 MBytes  36.0 Mbits/sec    0   42.4 KBytes
> > > > [  5]   3.00-4.00   sec  4.66 MBytes  39.1 Mbits/sec    0   42.4 KBytes
> > > > [  5]   4.00-5.00   sec  23.1 MBytes   194 Mbits/sec    0   1.07 MBytes
> > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > > > [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   60    806 KBytes
> > > > [  5]   7.00-8.00   sec  82.5 MBytes   692 Mbits/sec    0    812 KBytes
> > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   92    761 KBytes
> > > > [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec    0    755 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec   554 MBytes   465 Mbits/sec  152             sender
> > > > [  5]   0.00-10.00  sec   550 MBytes   461 Mbits/sec                  receiver
> > > >
> > > > > It would be useful to include the output of just calling "ethtool
> > > > > enp3s0" on the interface to verify the speed, "ethtool -a enp3s0" to
> > > > > verify flow control settings, and "ethtool -S enp3s0 | grep -v :\ 0"
> > > > > to output the statistics and dump anything that isn't zero.
> > > >
> > > > ethtool enp3s0
> > > > Settings for enp3s0:
> > > > Supported ports: [ TP ]
> > > > Supported link modes:   10baseT/Half 10baseT/Full
> > > >                         100baseT/Half 100baseT/Full
> > > >                         1000baseT/Full
> > > > Supported pause frame use: Symmetric
> > > > Supports auto-negotiation: Yes
> > > > Supported FEC modes: Not reported
> > > > Advertised link modes:  10baseT/Half 10baseT/Full
> > > >                         100baseT/Half 100baseT/Full
> > > >                         1000baseT/Full
> > > > Advertised pause frame use: Symmetric
> > > > Advertised auto-negotiation: Yes
> > > > Advertised FEC modes: Not reported
> > > > Speed: 1000Mb/s
> > > > Duplex: Full
> > > > Auto-negotiation: on
> > > > Port: Twisted Pair
> > > > PHYAD: 1
> > > > Transceiver: internal
> > > > MDI-X: off (auto)
> > > > Supports Wake-on: pumbg
> > > > Wake-on: g
> > > >         Current message level: 0x00000007 (7)
> > > >                                drv probe link
> > > > Link detected: yes
> > > > ---
> > > > ethtool -a enp3s0
> > > > Pause parameters for enp3s0:
> > > > Autonegotiate: on
> > > > RX: on
> > > > TX: off
> > > > ---
> > > > ethtool -S enp3s0 |grep  -v :\ 0
> > > > NIC statistics:
> > > >      rx_packets: 15920618
> > > >      tx_packets: 17846725
> > > >      rx_bytes: 15676264423
> > > >      tx_bytes: 19925010639
> > > >      rx_broadcast: 119553
> > > >      tx_broadcast: 497
> > > >      rx_multicast: 330193
> > > >      tx_multicast: 18190
> > > >      multicast: 330193
> > > >      rx_missed_errors: 270102
> > > >      rx_long_length_errors: 6
> > > >      tx_tcp_seg_good: 1342561
> > > >      rx_long_byte_count: 15676264423
> > > >      rx_errors: 6
> > > >      rx_length_errors: 6
> > > >      rx_fifo_errors: 270102
> > > >      tx_queue_0_packets: 7651168
> > > >      tx_queue_0_bytes: 7823281566
> > > >      tx_queue_0_restart: 4920
> > > >      tx_queue_1_packets: 10195557
> > > >      tx_queue_1_bytes: 12027522118
> > > >      tx_queue_1_restart: 12718
> > > >      rx_queue_0_packets: 15920618
> > > >      rx_queue_0_bytes: 15612581951
> > > >      rx_queue_0_csum_err: 76
> > > > (I've only run two runs since i reenabled the interface)
> > >
> > > So I am seeing three things here.
> > >
> > > The rx_long_length_errors are usually due to an MTU mismatch. Do you
> > > have something on the network that is using jumbo frames, or is the
> > > MTU on the NIC set to something smaller than what is supported on the
> > > network?
> >
> > I'm using jumbo frames on the local network, internet side is the
> > normal 1500 bytes mtu though
> >
> > > You are getting rx_missed_errors, that would seem to imply that the
> > > DMA is not able to keep up. We may want to try disabling the L1 to see
> > > if we get any boost from doing that.
> >
> > It used to work, I don't do benchmarks all the time and sometimes the first
> > benchmarks turn out fine... so it's hard to say when this started happening...
> >
> > It could also be related to a bios upgrade, but I'm pretty sure I did
> > successful benchmarks after that...
> >
> > How do I disable the l1? just echo 0 >
> > /sys/bus/pci/drivers/igb/0000\:03\:00.0/link/l1_aspm ?
> >
> > > The last bit is that queue 0 is seeing packets with bad checksums. You
> > > might want to run some tests and see where the bad checksums are
> > > coming from. If they are being detected from a specific NIC such as
> > > the ixgbe in your example it might point to some sort of checksum
> > > error being created as a result of the NAT translation.
> >
> > But that should also affect A' and the A -> B -> C case, which it doesn't...
> >
> > It only seems to happen with higher rtt (6 hops, sub 3 ms in this case
> > but still high enough somehow)
> >
> > > > ---
> > > >
> > > > > > lspci -s 03:00.0  -vvv
> > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > > Connection (rev 03)
> > > > > > Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection
> > > > > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > > > > Stepping- SERR- FastB2B- DisINTx+
> > > > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > > > > > <TAbort- <MAbort- >SERR- <PERR- INTx-
> > > > > > Latency: 0
> > > > > > Interrupt: pin A routed to IRQ 57
> > > > > > IOMMU group: 20
> > > > > > Region 0: Memory at fc900000 (32-bit, non-prefetchable) [size=128K]
> > > > > > Region 2: I/O ports at e000 [size=32]
> > > > > > Region 3: Memory at fc920000 (32-bit, non-prefetchable) [size=16K]
> > > > > > Capabilities: [40] Power Management version 3
> > > > > > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> > > > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> > > > > > Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> > > > > > Address: 0000000000000000  Data: 0000
> > > > > > Masking: 00000000  Pending: 00000000
> > > > > > Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
> > > > > > Vector table: BAR=3 offset=00000000
> > > > > > PBA: BAR=3 offset=00002000
> > > > > > Capabilities: [a0] Express (v2) Endpoint, MSI 00
> > > > > > DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > > > > > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> > > > > > DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
> > > > > > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> > > > > > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > > > > > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> > > > > > LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
> > > > > > L0s <2us, L1 <16us
> > > > > > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > > > > > LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> > > > > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > > > > LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
> > > > > > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > > >
> > > > > PCIe wise the connection is going to be pretty tight in terms of
> > > > > bandwidth. It looks like we have 2.5GT/s with only a single lane of
> > > > > PCIe. In addition we are running with ASPM enabled so that means that
> > > > > if we don't have enough traffic we are shutting off the one PCIe lane
> > > > > we have so if we are getting bursty traffic that can get ugly.
> > > >
> > > > Humm... is there a way to force disable ASPM in sysfs?
> > >
> > > Actually the easiest way to do this is to just use setpci.
> > >
> > > You should be able to dump the word containing the setting via:
> > > # setpci -s 3:00.0 0xB0.w
> > > 0042
> > > # setpci -s 3:00.0 0xB0.w=0040
> > >
> > > Basically what you do is clear the lower 3 bits of the value so in
> > > this case that means replacing the 2 with a 0 based on the output of
> > > the first command.
> >
> > Well... I'll be damned... I used to force enable ASPM... this must be
> > related to the change in PCIe bus ASPM
> > Perhaps disable ASPM if there is only one link?
>
> Is there any specific reason why you are enabling ASPM? Is this system
> a laptop where you are trying to conserve power when on battery? If
> not disabling it probably won't hurt things too much since the power
> consumption for a 2.5GT/s link operating in a width of one shouldn't
> bee too high. Otherwise you are likely going to end up paying the
> price for getting the interface out of L1 when the traffic goes idle
> so you are going to see flows that get bursty paying a heavy penalty
> when they start dropping packets.

Ah, you misunderstand, I used to do this and everything worked - now
Linux enables ASPM by default on all pcie controllers,
so imho this should be a quirk, if there is only one lane, don't do
ASPM due to latency and timing issues...

> It is also possible this could be something that changed with the
> physical PCIe link. Basically L1 works by powering down the link when
> idle, and then powering it back up when there is activity. The problem
> is bringing it back up can sometimes be a challenge when the physical
> link starts to go faulty. I know I have seen that in some cases it can
> even result in the device falling off of the PCIe bus if the link
> training fails.

It works fine without ASPM (and the machine is pretty new)

I suspect we hit some timing race with aggressive ASPM (assumed as
such since it works on local links but doesn't on ~3 ms Links)

> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec   113 MBytes   950 Mbits/sec   31    710 KBytes
> > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec  135    626 KBytes
> > [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   18    713 KBytes
> > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    798 KBytes
> > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec    0    721 KBytes
> > [  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec   31    800 KBytes
> > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    730 KBytes
> > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec   19    730 KBytes
> > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec    0    701 KBytes
> > [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec   12    701 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  1.09 GBytes   937 Mbits/sec  246             sender
> > [  5]   0.00-10.01  sec  1.09 GBytes   933 Mbits/sec                  receiver
> >
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec   114 MBytes   956 Mbits/sec    0    749 KBytes
> > [  5]   1.00-2.00   sec   111 MBytes   933 Mbits/sec   30    766 KBytes
> > [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec    7    749 KBytes
> > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   11    707 KBytes
> > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec    2    699 KBytes
> > [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    8    699 KBytes
> > [  5]   6.00-7.00   sec   112 MBytes   944 Mbits/sec    1    953 KBytes
> > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    701 KBytes
> > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   26    707 KBytes
> > [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    2   1.07 MBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  1.09 GBytes   939 Mbits/sec   87             sender
> > [  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver
> >
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec   114 MBytes   953 Mbits/sec   16    908 KBytes
> > [  5]   1.00-2.00   sec   112 MBytes   944 Mbits/sec    0    693 KBytes
> > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec    0    713 KBytes
> > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    687 KBytes
> > [  5]   4.00-5.00   sec   112 MBytes   944 Mbits/sec   15    687 KBytes
> > [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    2    888 KBytes
> > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec   17    696 KBytes
> > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    758 KBytes
> > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   31    749 KBytes
> > [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    0    792 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec   81             sender
> > [  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver
> >
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec   114 MBytes   956 Mbits/sec    0    747 KBytes
> > [  5]   1.00-2.00   sec   111 MBytes   933 Mbits/sec    0    744 KBytes
> > [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   12   1.18 MBytes
> > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    2    699 KBytes
> > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   28    699 KBytes
> > [  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec    0    684 KBytes
> > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    741 KBytes
> > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    3    687 KBytes
> > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   22    699 KBytes
> > [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec   11    707 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec   78             sender
> > [  5]   0.00-10.01  sec  1.09 GBytes   934 Mbits/sec                  receiver
> > ---
> >
> > ethtool -S enp3s0 |grep -v ": 0"
> > NIC statistics:
> >      rx_packets: 16303520
> >      tx_packets: 21602840
> >      rx_bytes: 15711958157
> >      tx_bytes: 25599009212
> >      rx_broadcast: 122212
> >      tx_broadcast: 530
> >      rx_multicast: 333489
> >      tx_multicast: 18446
> >      multicast: 333489
> >      rx_missed_errors: 270143
> >      rx_long_length_errors: 6
> >      tx_tcp_seg_good: 1342561
> >      rx_long_byte_count: 15711958157
> >      rx_errors: 6
> >      rx_length_errors: 6
> >      rx_fifo_errors: 270143
> >      tx_queue_0_packets: 8963830
> >      tx_queue_0_bytes: 9803196683
> >      tx_queue_0_restart: 4920
> >      tx_queue_1_packets: 12639010
> >      tx_queue_1_bytes: 15706576814
> >      tx_queue_1_restart: 12718
> >      rx_queue_0_packets: 16303520
> >      rx_queue_0_bytes: 15646744077
> >      rx_queue_0_csum_err: 76
>
> Okay, so this result still has the same length and checksum errors,
> were you resetting the system/statistics between runs?

Ah, no.... Will reset and do more tests when I'm back home

Am I blind or is this part missing from ethtools man page?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-16 19:47                         ` Ian Kumlien
@ 2020-07-17  0:09                           ` Alexander Duyck
  -1 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2020-07-17  0:09 UTC (permalink / raw)
  To: Ian Kumlien
  Cc: Jakub Kicinski, Linux Kernel Network Developers, intel-wired-lan

On Thu, Jul 16, 2020 at 12:47 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> Sorry, tried to respond via the phone, used the webbrowser version but
> still html mails... :/
>
> On Thu, Jul 16, 2020 at 5:18 PM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> > On Wed, Jul 15, 2020 at 5:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > On Thu, Jul 16, 2020 at 1:42 AM Alexander Duyck
> > > <alexander.duyck@gmail.com> wrote:
> > > > On Wed, Jul 15, 2020 at 3:51 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > On Thu, Jul 16, 2020 at 12:32 AM Alexander Duyck
> > > > > <alexander.duyck@gmail.com> wrote:
> > > > > > On Wed, Jul 15, 2020 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > > On Wed, Jul 15, 2020 at 11:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > > > On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> > > > > > > > > On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > > > > > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > > > > > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > > > > > > > > > > After a  lot of debugging it turns out that the bug is in igb...
> > > > > > > > > > > >
> > > > > > > > > > > > driver: igb
> > > > > > > > > > > > version: 5.6.0-k
> > > > > > > > > > > > firmware-version:  0. 6-1
> > > > > > > > > > > >
> > > > > > > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > > > > > > > > Connection (rev 03)
> > > > > > > > > > >
> > > > > > > > > > > Unclear to me what you're actually reporting. Is this a regression
> > > > > > > > > > > after a kernel upgrade? Compared to no NAT?
> > > > > > > > > >
> > > > > > > > > > It only happens on "internet links"
> > > > > > > > > >
> > > > > > > > > > Lets say that A is client with ibg driver, B is a firewall running NAT
> > > > > > > > > > with ixgbe drivers, C is another local node with igb and
> > > > > > > > > > D is a remote node with a bridge backed by a bnx2 interface.
> > > > > > > > > >
> > > > > > > > > > A -> B -> C is ok (B and C is on the same switch)
> > > > > > > > > >
> > > > > > > > > > A -> B -> D -- 32-40mbit
> > > > > > > > > >
> > > > > > > > > > B -> D 944 mbit
> > > > > > > > > > C -> D 944 mbit
> > > > > > > > > >
> > > > > > > > > > A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)
> > > > > > > > >
> > > > > > > > > This should of course be A' -> B -> D
> > > > > > > > >
> > > > > > > > > Sorry, I've been scratching my head for about a week...
> > > > > > > >
> > > > > > > > Hm, only thing that comes to mind if A' works reliably and A doesn't is
> > > > > > > > that A has somehow broken TCP offloads. Could you try disabling things
> > > > > > > > via ethtool -K and see if those settings make a difference?
> > > > > > >
> > > > > > > It's a bit hard since it works like this, turned tso off:
> > > > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > > > [  5]   0.00-1.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
> > > > > > > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec   31    812 KBytes
> > > > > > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec   92    772 KBytes
> > > > > > > [  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0    834 KBytes
> > > > > > > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   60    823 KBytes
> > > > > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec   31    789 KBytes
> > > > > > > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    786 KBytes
> > > > > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > > > > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    772 KBytes
> > > > > > > [  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec    0    868 KBytes
> > > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  214             sender
> > > > > > > [  5]   0.00-10.00  sec  1.07 GBytes   920 Mbits/sec                  receiver
> > > > > > >
> > > > > > > Continued running tests:
> > > > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > > > [  5]   0.00-1.00   sec  5.82 MBytes  48.8 Mbits/sec    0   82.0 KBytes
> > > > > > > [  5]   1.00-2.00   sec  4.97 MBytes  41.7 Mbits/sec    0    130 KBytes
> > > > > > > [  5]   2.00-3.00   sec  5.28 MBytes  44.3 Mbits/sec    0   99.0 KBytes
> > > > > > > [  5]   3.00-4.00   sec  5.28 MBytes  44.3 Mbits/sec    0    105 KBytes
> > > > > > > [  5]   4.00-5.00   sec  5.28 MBytes  44.3 Mbits/sec    0    122 KBytes
> > > > > > > [  5]   5.00-6.00   sec  5.28 MBytes  44.3 Mbits/sec    0   82.0 KBytes
> > > > > > > [  5]   6.00-7.00   sec  5.28 MBytes  44.3 Mbits/sec    0   79.2 KBytes
> > > > > > > [  5]   7.00-8.00   sec  5.28 MBytes  44.3 Mbits/sec    0    110 KBytes
> > > > > > > [  5]   8.00-9.00   sec  5.28 MBytes  44.3 Mbits/sec    0    156 KBytes
> > > > > > > [  5]   9.00-10.00  sec  5.28 MBytes  44.3 Mbits/sec    0   87.7 KBytes
> > > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > > [  5]   0.00-10.00  sec  53.0 MBytes  44.5 Mbits/sec    0             sender
> > > > > > > [  5]   0.00-10.00  sec  52.5 MBytes  44.1 Mbits/sec                  receiver
> > > > > > >
> > > > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > > > [  5]   0.00-1.00   sec  7.08 MBytes  59.4 Mbits/sec    0    156 KBytes
> > > > > > > [  5]   1.00-2.00   sec  5.97 MBytes  50.0 Mbits/sec    0    110 KBytes
> > > > > > > [  5]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec    0    124 KBytes
> > > > > > > [  5]   3.00-4.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > > > > > [  5]   4.00-5.00   sec  5.47 MBytes  45.9 Mbits/sec    0    158 KBytes
> > > > > > > [  5]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec    0   70.7 KBytes
> > > > > > > [  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0    113 KBytes
> > > > > > > [  5]   7.00-8.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > > > > > [  5]   8.00-9.00   sec  4.97 MBytes  41.7 Mbits/sec    0   84.8 KBytes
> > > > > > > [  5]   9.00-10.00  sec  5.47 MBytes  45.9 Mbits/sec    0    116 KBytes
> > > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > > [  5]   0.00-10.00  sec  55.3 MBytes  46.4 Mbits/sec    0             sender
> > > > > > > [  5]   0.00-10.00  sec  53.9 MBytes  45.2 Mbits/sec                  receiver
> > > > > > >
> > > > > > > And the low bandwidth continues with:
> > > > > > > ethtool -k enp3s0 |grep ": on"
> > > > > > > rx-vlan-offload: on
> > > > > > > tx-vlan-offload: on [requested off]
> > > > > > > highdma: on [fixed]
> > > > > > > rx-vlan-filter: on [fixed]
> > > > > > > tx-gre-segmentation: on
> > > > > > > tx-gre-csum-segmentation: on
> > > > > > > tx-ipxip4-segmentation: on
> > > > > > > tx-ipxip6-segmentation: on
> > > > > > > tx-udp_tnl-segmentation: on
> > > > > > > tx-udp_tnl-csum-segmentation: on
> > > > > > > tx-gso-partial: on
> > > > > > > tx-udp-segmentation: on
> > > > > > > hw-tc-offload: on
> > > > > > >
> > > > > > > Can't quite find how to turn those off since they aren't listed in
> > > > > > > ethtool (since the text is not what you use to enable/disable)
> > > > > >
> > > > > > To disable them you would just repeat the same string in the display
> > > > > > string. So it should just be "ethtool -K enp3s0 tx-gso-partial off"
> > > > > > and that would turn off a large chunk of them as all the encapsulated
> > > > > > support requires gso partial support.
> > > > >
> > > > >  ethtool -k enp3s0 |grep ": on"
> > > > > highdma: on [fixed]
> > > > > rx-vlan-filter: on [fixed]
> > > > > ---
> > > > > And then back to back:
> > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec  4.91 MBytes  41.2 Mbits/sec    0   45.2 KBytes
> > > > > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   52.3 KBytes
> > > > > [  5]   2.00-3.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> > > > > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0    141 KBytes
> > > > > [  5]   4.00-5.00   sec   111 MBytes   928 Mbits/sec   63    764 KBytes
> > > > > [  5]   5.00-6.00   sec  86.2 MBytes   724 Mbits/sec    0    744 KBytes
> > > > > [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   61    769 KBytes
> > > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    749 KBytes
> > > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    741 KBytes
> > > > > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   31    761 KBytes
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > [  5]   0.00-10.00  sec   644 MBytes   540 Mbits/sec  155             sender
> > > > > [  5]   0.00-10.01  sec   641 MBytes   537 Mbits/sec                  receiver
> > > > >
> > > > > and we're back at the not working bit:
> > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec  4.84 MBytes  40.6 Mbits/sec    0   42.4 KBytes
> > > > > [  5]   1.00-2.00   sec  4.60 MBytes  38.6 Mbits/sec    0   42.4 KBytes
> > > > > [  5]   2.00-3.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > > > > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> > > > > [  5]   4.00-5.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> > > > > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > > > > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > > > > [  5]   7.00-8.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> > > > > [  5]   8.00-9.00   sec  4.47 MBytes  37.5 Mbits/sec    0   53.7 KBytes
> > > > > [  5]   9.00-10.00  sec  4.47 MBytes  37.5 Mbits/sec    0   79.2 KBytes
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > [  5]   0.00-10.00  sec  44.5 MBytes  37.3 Mbits/sec    0             sender
> > > > > [  5]   0.00-10.00  sec  43.9 MBytes  36.8 Mbits/sec                  receiver
> > > > >
> > > > > > > I was hoping that you'd have a clue of something that might introduce
> > > > > > > a regression - ie specific patches to try to revert
> > > > > > >
> > > > > > > Btw, the same issue applies to udp as werll
> > > > > > >
> > > > > > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > > > > > [  5]   0.00-1.00   sec  6.77 MBytes  56.8 Mbits/sec  4900
> > > > > > > [  5]   1.00-2.00   sec  4.27 MBytes  35.8 Mbits/sec  3089
> > > > > > > [  5]   2.00-3.00   sec  4.20 MBytes  35.2 Mbits/sec  3041
> > > > > > > [  5]   3.00-4.00   sec  4.30 MBytes  36.1 Mbits/sec  3116
> > > > > > > [  5]   4.00-5.00   sec  4.24 MBytes  35.6 Mbits/sec  3070
> > > > > > > [  5]   5.00-6.00   sec  4.21 MBytes  35.3 Mbits/sec  3047
> > > > > > > [  5]   6.00-7.00   sec  4.29 MBytes  36.0 Mbits/sec  3110
> > > > > > > [  5]   7.00-8.00   sec  4.28 MBytes  35.9 Mbits/sec  3097
> > > > > > > [  5]   8.00-9.00   sec  4.25 MBytes  35.6 Mbits/sec  3075
> > > > > > > [  5]   9.00-10.00  sec  4.20 MBytes  35.2 Mbits/sec  3039
> > > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > > > > > Lost/Total Datagrams
> > > > > > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.000 ms
> > > > > > > 0/32584 (0%)  sender
> > > > > > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.037 ms
> > > > > > > 0/32573 (0%)  receiver
> > > > > > >
> > > > > > > vs:
> > > > > > >
> > > > > > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > > > > > [  5]   0.00-1.00   sec   114 MBytes   954 Mbits/sec  82342
> > > > > > > [  5]   1.00-2.00   sec   114 MBytes   955 Mbits/sec  82439
> > > > > > > [  5]   2.00-3.00   sec   114 MBytes   956 Mbits/sec  82507
> > > > > > > [  5]   3.00-4.00   sec   114 MBytes   955 Mbits/sec  82432
> > > > > > > [  5]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  82535
> > > > > > > [  5]   5.00-6.00   sec   114 MBytes   953 Mbits/sec  82240
> > > > > > > [  5]   6.00-7.00   sec   114 MBytes   956 Mbits/sec  82512
> > > > > > > [  5]   7.00-8.00   sec   114 MBytes   956 Mbits/sec  82503
> > > > > > > [  5]   8.00-9.00   sec   114 MBytes   956 Mbits/sec  82532
> > > > > > > [  5]   9.00-10.00  sec   114 MBytes   956 Mbits/sec  82488
> > > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > > > > > Lost/Total Datagrams
> > > > > > > [  5]   0.00-10.00  sec  1.11 GBytes   955 Mbits/sec  0.000 ms
> > > > > > > 0/824530 (0%)  sender
> > > > > > > [  5]   0.00-10.01  sec  1.11 GBytes   949 Mbits/sec  0.014 ms
> > > > > > > 4756/824530 (0.58%)  receiver
> > > > > >
> > > > > > The fact that it is impacting UDP seems odd. I wonder if we don't have
> > > > > > a qdisc somewhere that is misbehaving and throttling the Tx. Either
> > > > > > that or I wonder if we are getting spammed with flow control frames.
> > > > >
> > > > > it sometimes works, it looks like the cwindow just isn't increased -
> > > > > that's where i started...
> > > > >
> > > > > Example:
> > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec  4.86 MBytes  40.8 Mbits/sec    0   50.9 KBytes
> > > > > [  5]   1.00-2.00   sec  4.66 MBytes  39.1 Mbits/sec    0   65.0 KBytes
> > > > > [  5]   2.00-3.00   sec  4.29 MBytes  36.0 Mbits/sec    0   42.4 KBytes
> > > > > [  5]   3.00-4.00   sec  4.66 MBytes  39.1 Mbits/sec    0   42.4 KBytes
> > > > > [  5]   4.00-5.00   sec  23.1 MBytes   194 Mbits/sec    0   1.07 MBytes
> > > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > > > > [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   60    806 KBytes
> > > > > [  5]   7.00-8.00   sec  82.5 MBytes   692 Mbits/sec    0    812 KBytes
> > > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   92    761 KBytes
> > > > > [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec    0    755 KBytes
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > [  5]   0.00-10.00  sec   554 MBytes   465 Mbits/sec  152             sender
> > > > > [  5]   0.00-10.00  sec   550 MBytes   461 Mbits/sec                  receiver
> > > > >
> > > > > > It would be useful to include the output of just calling "ethtool
> > > > > > enp3s0" on the interface to verify the speed, "ethtool -a enp3s0" to
> > > > > > verify flow control settings, and "ethtool -S enp3s0 | grep -v :\ 0"
> > > > > > to output the statistics and dump anything that isn't zero.
> > > > >
> > > > > ethtool enp3s0
> > > > > Settings for enp3s0:
> > > > > Supported ports: [ TP ]
> > > > > Supported link modes:   10baseT/Half 10baseT/Full
> > > > >                         100baseT/Half 100baseT/Full
> > > > >                         1000baseT/Full
> > > > > Supported pause frame use: Symmetric
> > > > > Supports auto-negotiation: Yes
> > > > > Supported FEC modes: Not reported
> > > > > Advertised link modes:  10baseT/Half 10baseT/Full
> > > > >                         100baseT/Half 100baseT/Full
> > > > >                         1000baseT/Full
> > > > > Advertised pause frame use: Symmetric
> > > > > Advertised auto-negotiation: Yes
> > > > > Advertised FEC modes: Not reported
> > > > > Speed: 1000Mb/s
> > > > > Duplex: Full
> > > > > Auto-negotiation: on
> > > > > Port: Twisted Pair
> > > > > PHYAD: 1
> > > > > Transceiver: internal
> > > > > MDI-X: off (auto)
> > > > > Supports Wake-on: pumbg
> > > > > Wake-on: g
> > > > >         Current message level: 0x00000007 (7)
> > > > >                                drv probe link
> > > > > Link detected: yes
> > > > > ---
> > > > > ethtool -a enp3s0
> > > > > Pause parameters for enp3s0:
> > > > > Autonegotiate: on
> > > > > RX: on
> > > > > TX: off
> > > > > ---
> > > > > ethtool -S enp3s0 |grep  -v :\ 0
> > > > > NIC statistics:
> > > > >      rx_packets: 15920618
> > > > >      tx_packets: 17846725
> > > > >      rx_bytes: 15676264423
> > > > >      tx_bytes: 19925010639
> > > > >      rx_broadcast: 119553
> > > > >      tx_broadcast: 497
> > > > >      rx_multicast: 330193
> > > > >      tx_multicast: 18190
> > > > >      multicast: 330193
> > > > >      rx_missed_errors: 270102
> > > > >      rx_long_length_errors: 6
> > > > >      tx_tcp_seg_good: 1342561
> > > > >      rx_long_byte_count: 15676264423
> > > > >      rx_errors: 6
> > > > >      rx_length_errors: 6
> > > > >      rx_fifo_errors: 270102
> > > > >      tx_queue_0_packets: 7651168
> > > > >      tx_queue_0_bytes: 7823281566
> > > > >      tx_queue_0_restart: 4920
> > > > >      tx_queue_1_packets: 10195557
> > > > >      tx_queue_1_bytes: 12027522118
> > > > >      tx_queue_1_restart: 12718
> > > > >      rx_queue_0_packets: 15920618
> > > > >      rx_queue_0_bytes: 15612581951
> > > > >      rx_queue_0_csum_err: 76
> > > > > (I've only run two runs since i reenabled the interface)
> > > >
> > > > So I am seeing three things here.
> > > >
> > > > The rx_long_length_errors are usually due to an MTU mismatch. Do you
> > > > have something on the network that is using jumbo frames, or is the
> > > > MTU on the NIC set to something smaller than what is supported on the
> > > > network?
> > >
> > > I'm using jumbo frames on the local network, internet side is the
> > > normal 1500 bytes mtu though
> > >
> > > > You are getting rx_missed_errors, that would seem to imply that the
> > > > DMA is not able to keep up. We may want to try disabling the L1 to see
> > > > if we get any boost from doing that.
> > >
> > > It used to work, I don't do benchmarks all the time and sometimes the first
> > > benchmarks turn out fine... so it's hard to say when this started happening...
> > >
> > > It could also be related to a bios upgrade, but I'm pretty sure I did
> > > successful benchmarks after that...
> > >
> > > How do I disable the l1? just echo 0 >
> > > /sys/bus/pci/drivers/igb/0000\:03\:00.0/link/l1_aspm ?
> > >
> > > > The last bit is that queue 0 is seeing packets with bad checksums. You
> > > > might want to run some tests and see where the bad checksums are
> > > > coming from. If they are being detected from a specific NIC such as
> > > > the ixgbe in your example it might point to some sort of checksum
> > > > error being created as a result of the NAT translation.
> > >
> > > But that should also affect A' and the A -> B -> C case, which it doesn't...
> > >
> > > It only seems to happen with higher rtt (6 hops, sub 3 ms in this case
> > > but still high enough somehow)
> > >
> > > > > ---
> > > > >
> > > > > > > lspci -s 03:00.0  -vvv
> > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > > > Connection (rev 03)
> > > > > > > Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection
> > > > > > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > > > > > Stepping- SERR- FastB2B- DisINTx+
> > > > > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > > > > > > <TAbort- <MAbort- >SERR- <PERR- INTx-
> > > > > > > Latency: 0
> > > > > > > Interrupt: pin A routed to IRQ 57
> > > > > > > IOMMU group: 20
> > > > > > > Region 0: Memory at fc900000 (32-bit, non-prefetchable) [size=128K]
> > > > > > > Region 2: I/O ports at e000 [size=32]
> > > > > > > Region 3: Memory at fc920000 (32-bit, non-prefetchable) [size=16K]
> > > > > > > Capabilities: [40] Power Management version 3
> > > > > > > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> > > > > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> > > > > > > Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> > > > > > > Address: 0000000000000000  Data: 0000
> > > > > > > Masking: 00000000  Pending: 00000000
> > > > > > > Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
> > > > > > > Vector table: BAR=3 offset=00000000
> > > > > > > PBA: BAR=3 offset=00002000
> > > > > > > Capabilities: [a0] Express (v2) Endpoint, MSI 00
> > > > > > > DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > > > > > > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> > > > > > > DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
> > > > > > > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> > > > > > > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > > > > > > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> > > > > > > LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
> > > > > > > L0s <2us, L1 <16us
> > > > > > > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > > > > > > LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> > > > > > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > > > > > LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
> > > > > > > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > > > >
> > > > > > PCIe wise the connection is going to be pretty tight in terms of
> > > > > > bandwidth. It looks like we have 2.5GT/s with only a single lane of
> > > > > > PCIe. In addition we are running with ASPM enabled so that means that
> > > > > > if we don't have enough traffic we are shutting off the one PCIe lane
> > > > > > we have so if we are getting bursty traffic that can get ugly.
> > > > >
> > > > > Humm... is there a way to force disable ASPM in sysfs?
> > > >
> > > > Actually the easiest way to do this is to just use setpci.
> > > >
> > > > You should be able to dump the word containing the setting via:
> > > > # setpci -s 3:00.0 0xB0.w
> > > > 0042
> > > > # setpci -s 3:00.0 0xB0.w=0040
> > > >
> > > > Basically what you do is clear the lower 3 bits of the value so in
> > > > this case that means replacing the 2 with a 0 based on the output of
> > > > the first command.
> > >
> > > Well... I'll be damned... I used to force enable ASPM... this must be
> > > related to the change in PCIe bus ASPM
> > > Perhaps disable ASPM if there is only one link?
> >
> > Is there any specific reason why you are enabling ASPM? Is this system
> > a laptop where you are trying to conserve power when on battery? If
> > not disabling it probably won't hurt things too much since the power
> > consumption for a 2.5GT/s link operating in a width of one shouldn't
> > bee too high. Otherwise you are likely going to end up paying the
> > price for getting the interface out of L1 when the traffic goes idle
> > so you are going to see flows that get bursty paying a heavy penalty
> > when they start dropping packets.
>
> Ah, you misunderstand, I used to do this and everything worked - now
> Linux enables ASPM by default on all pcie controllers,
> so imho this should be a quirk, if there is only one lane, don't do
> ASPM due to latency and timing issues...
>
> > It is also possible this could be something that changed with the
> > physical PCIe link. Basically L1 works by powering down the link when
> > idle, and then powering it back up when there is activity. The problem
> > is bringing it back up can sometimes be a challenge when the physical
> > link starts to go faulty. I know I have seen that in some cases it can
> > even result in the device falling off of the PCIe bus if the link
> > training fails.
>
> It works fine without ASPM (and the machine is pretty new)
>
> I suspect we hit some timing race with aggressive ASPM (assumed as
> such since it works on local links but doesn't on ~3 ms Links)

Agreed. What is probably happening if you are using a NAT is that it
may be seeing some burstiness being introduced and as a result the
part is going to sleep and then being overrun when the traffic does
arrive.

> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec   113 MBytes   950 Mbits/sec   31    710 KBytes
> > > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec  135    626 KBytes
> > > [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   18    713 KBytes
> > > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    798 KBytes
> > > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec    0    721 KBytes
> > > [  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec   31    800 KBytes
> > > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    730 KBytes
> > > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec   19    730 KBytes
> > > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec    0    701 KBytes
> > > [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec   12    701 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  1.09 GBytes   937 Mbits/sec  246             sender
> > > [  5]   0.00-10.01  sec  1.09 GBytes   933 Mbits/sec                  receiver
> > >
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec   114 MBytes   956 Mbits/sec    0    749 KBytes
> > > [  5]   1.00-2.00   sec   111 MBytes   933 Mbits/sec   30    766 KBytes
> > > [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec    7    749 KBytes
> > > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   11    707 KBytes
> > > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec    2    699 KBytes
> > > [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    8    699 KBytes
> > > [  5]   6.00-7.00   sec   112 MBytes   944 Mbits/sec    1    953 KBytes
> > > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    701 KBytes
> > > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   26    707 KBytes
> > > [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    2   1.07 MBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  1.09 GBytes   939 Mbits/sec   87             sender
> > > [  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver
> > >
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec   114 MBytes   953 Mbits/sec   16    908 KBytes
> > > [  5]   1.00-2.00   sec   112 MBytes   944 Mbits/sec    0    693 KBytes
> > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec    0    713 KBytes
> > > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    687 KBytes
> > > [  5]   4.00-5.00   sec   112 MBytes   944 Mbits/sec   15    687 KBytes
> > > [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    2    888 KBytes
> > > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec   17    696 KBytes
> > > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    758 KBytes
> > > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   31    749 KBytes
> > > [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    0    792 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec   81             sender
> > > [  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver
> > >
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec   114 MBytes   956 Mbits/sec    0    747 KBytes
> > > [  5]   1.00-2.00   sec   111 MBytes   933 Mbits/sec    0    744 KBytes
> > > [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   12   1.18 MBytes
> > > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    2    699 KBytes
> > > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   28    699 KBytes
> > > [  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec    0    684 KBytes
> > > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    741 KBytes
> > > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    3    687 KBytes
> > > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   22    699 KBytes
> > > [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec   11    707 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec   78             sender
> > > [  5]   0.00-10.01  sec  1.09 GBytes   934 Mbits/sec                  receiver
> > > ---
> > >
> > > ethtool -S enp3s0 |grep -v ": 0"
> > > NIC statistics:
> > >      rx_packets: 16303520
> > >      tx_packets: 21602840
> > >      rx_bytes: 15711958157
> > >      tx_bytes: 25599009212
> > >      rx_broadcast: 122212
> > >      tx_broadcast: 530
> > >      rx_multicast: 333489
> > >      tx_multicast: 18446
> > >      multicast: 333489
> > >      rx_missed_errors: 270143
> > >      rx_long_length_errors: 6
> > >      tx_tcp_seg_good: 1342561
> > >      rx_long_byte_count: 15711958157
> > >      rx_errors: 6
> > >      rx_length_errors: 6
> > >      rx_fifo_errors: 270143
> > >      tx_queue_0_packets: 8963830
> > >      tx_queue_0_bytes: 9803196683
> > >      tx_queue_0_restart: 4920
> > >      tx_queue_1_packets: 12639010
> > >      tx_queue_1_bytes: 15706576814
> > >      tx_queue_1_restart: 12718
> > >      rx_queue_0_packets: 16303520
> > >      rx_queue_0_bytes: 15646744077
> > >      rx_queue_0_csum_err: 76
> >
> > Okay, so this result still has the same length and checksum errors,
> > were you resetting the system/statistics between runs?
>
> Ah, no.... Will reset and do more tests when I'm back home
>
> Am I blind or is this part missing from ethtools man page?

There isn't a reset that will reset the stats via ethtool. The device
stats will be persistent until the driver is unloaded and reloaded or
the system is reset. You can reset the queue stats by changing the
number of queues. So for example using "ethtool -L enp3s0 1;  ethtool
-L enp3s0 2".

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
@ 2020-07-17  0:09                           ` Alexander Duyck
  0 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2020-07-17  0:09 UTC (permalink / raw)
  To: intel-wired-lan

On Thu, Jul 16, 2020 at 12:47 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> Sorry, tried to respond via the phone, used the webbrowser version but
> still html mails... :/
>
> On Thu, Jul 16, 2020 at 5:18 PM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> > On Wed, Jul 15, 2020 at 5:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > On Thu, Jul 16, 2020 at 1:42 AM Alexander Duyck
> > > <alexander.duyck@gmail.com> wrote:
> > > > On Wed, Jul 15, 2020 at 3:51 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > On Thu, Jul 16, 2020 at 12:32 AM Alexander Duyck
> > > > > <alexander.duyck@gmail.com> wrote:
> > > > > > On Wed, Jul 15, 2020 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > > On Wed, Jul 15, 2020 at 11:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > > > On Wed, 15 Jul 2020 23:12:23 +0200 Ian Kumlien wrote:
> > > > > > > > > On Wed, Jul 15, 2020 at 11:02 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > > > > > On Wed, Jul 15, 2020 at 10:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > > > > > > On Wed, 15 Jul 2020 22:05:58 +0200 Ian Kumlien wrote:
> > > > > > > > > > > > After a  lot of debugging it turns out that the bug is in igb...
> > > > > > > > > > > >
> > > > > > > > > > > > driver: igb
> > > > > > > > > > > > version: 5.6.0-k
> > > > > > > > > > > > firmware-version:  0. 6-1
> > > > > > > > > > > >
> > > > > > > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > > > > > > > > Connection (rev 03)
> > > > > > > > > > >
> > > > > > > > > > > Unclear to me what you're actually reporting. Is this a regression
> > > > > > > > > > > after a kernel upgrade? Compared to no NAT?
> > > > > > > > > >
> > > > > > > > > > It only happens on "internet links"
> > > > > > > > > >
> > > > > > > > > > Lets say that A is client with ibg driver, B is a firewall running NAT
> > > > > > > > > > with ixgbe drivers, C is another local node with igb and
> > > > > > > > > > D is a remote node with a bridge backed by a bnx2 interface.
> > > > > > > > > >
> > > > > > > > > > A -> B -> C is ok (B and C is on the same switch)
> > > > > > > > > >
> > > > > > > > > > A -> B -> D -- 32-40mbit
> > > > > > > > > >
> > > > > > > > > > B -> D 944 mbit
> > > > > > > > > > C -> D 944 mbit
> > > > > > > > > >
> > > > > > > > > > A' -> D ~933 mbit (A with realtek nic -- also link is not idle atm)
> > > > > > > > >
> > > > > > > > > This should of course be A' -> B -> D
> > > > > > > > >
> > > > > > > > > Sorry, I've been scratching my head for about a week...
> > > > > > > >
> > > > > > > > Hm, only thing that comes to mind if A' works reliably and A doesn't is
> > > > > > > > that A has somehow broken TCP offloads. Could you try disabling things
> > > > > > > > via ethtool -K and see if those settings make a difference?
> > > > > > >
> > > > > > > It's a bit hard since it works like this, turned tso off:
> > > > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > > > [  5]   0.00-1.00   sec   108 MBytes   902 Mbits/sec    0    783 KBytes
> > > > > > > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec   31    812 KBytes
> > > > > > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec   92    772 KBytes
> > > > > > > [  5]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0    834 KBytes
> > > > > > > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   60    823 KBytes
> > > > > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec   31    789 KBytes
> > > > > > > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    786 KBytes
> > > > > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > > > > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    772 KBytes
> > > > > > > [  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec    0    868 KBytes
> > > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec  214             sender
> > > > > > > [  5]   0.00-10.00  sec  1.07 GBytes   920 Mbits/sec                  receiver
> > > > > > >
> > > > > > > Continued running tests:
> > > > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > > > [  5]   0.00-1.00   sec  5.82 MBytes  48.8 Mbits/sec    0   82.0 KBytes
> > > > > > > [  5]   1.00-2.00   sec  4.97 MBytes  41.7 Mbits/sec    0    130 KBytes
> > > > > > > [  5]   2.00-3.00   sec  5.28 MBytes  44.3 Mbits/sec    0   99.0 KBytes
> > > > > > > [  5]   3.00-4.00   sec  5.28 MBytes  44.3 Mbits/sec    0    105 KBytes
> > > > > > > [  5]   4.00-5.00   sec  5.28 MBytes  44.3 Mbits/sec    0    122 KBytes
> > > > > > > [  5]   5.00-6.00   sec  5.28 MBytes  44.3 Mbits/sec    0   82.0 KBytes
> > > > > > > [  5]   6.00-7.00   sec  5.28 MBytes  44.3 Mbits/sec    0   79.2 KBytes
> > > > > > > [  5]   7.00-8.00   sec  5.28 MBytes  44.3 Mbits/sec    0    110 KBytes
> > > > > > > [  5]   8.00-9.00   sec  5.28 MBytes  44.3 Mbits/sec    0    156 KBytes
> > > > > > > [  5]   9.00-10.00  sec  5.28 MBytes  44.3 Mbits/sec    0   87.7 KBytes
> > > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > > [  5]   0.00-10.00  sec  53.0 MBytes  44.5 Mbits/sec    0             sender
> > > > > > > [  5]   0.00-10.00  sec  52.5 MBytes  44.1 Mbits/sec                  receiver
> > > > > > >
> > > > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > > > [  5]   0.00-1.00   sec  7.08 MBytes  59.4 Mbits/sec    0    156 KBytes
> > > > > > > [  5]   1.00-2.00   sec  5.97 MBytes  50.0 Mbits/sec    0    110 KBytes
> > > > > > > [  5]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec    0    124 KBytes
> > > > > > > [  5]   3.00-4.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > > > > > [  5]   4.00-5.00   sec  5.47 MBytes  45.9 Mbits/sec    0    158 KBytes
> > > > > > > [  5]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec    0   70.7 KBytes
> > > > > > > [  5]   6.00-7.00   sec  5.47 MBytes  45.9 Mbits/sec    0    113 KBytes
> > > > > > > [  5]   7.00-8.00   sec  5.47 MBytes  45.9 Mbits/sec    0   96.2 KBytes
> > > > > > > [  5]   8.00-9.00   sec  4.97 MBytes  41.7 Mbits/sec    0   84.8 KBytes
> > > > > > > [  5]   9.00-10.00  sec  5.47 MBytes  45.9 Mbits/sec    0    116 KBytes
> > > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > > [  5]   0.00-10.00  sec  55.3 MBytes  46.4 Mbits/sec    0             sender
> > > > > > > [  5]   0.00-10.00  sec  53.9 MBytes  45.2 Mbits/sec                  receiver
> > > > > > >
> > > > > > > And the low bandwidth continues with:
> > > > > > > ethtool -k enp3s0 |grep ": on"
> > > > > > > rx-vlan-offload: on
> > > > > > > tx-vlan-offload: on [requested off]
> > > > > > > highdma: on [fixed]
> > > > > > > rx-vlan-filter: on [fixed]
> > > > > > > tx-gre-segmentation: on
> > > > > > > tx-gre-csum-segmentation: on
> > > > > > > tx-ipxip4-segmentation: on
> > > > > > > tx-ipxip6-segmentation: on
> > > > > > > tx-udp_tnl-segmentation: on
> > > > > > > tx-udp_tnl-csum-segmentation: on
> > > > > > > tx-gso-partial: on
> > > > > > > tx-udp-segmentation: on
> > > > > > > hw-tc-offload: on
> > > > > > >
> > > > > > > Can't quite find how to turn those off since they aren't listed in
> > > > > > > ethtool (since the text is not what you use to enable/disable)
> > > > > >
> > > > > > To disable them you would just repeat the same string in the display
> > > > > > string. So it should just be "ethtool -K enp3s0 tx-gso-partial off"
> > > > > > and that would turn off a large chunk of them as all the encapsulated
> > > > > > support requires gso partial support.
> > > > >
> > > > >  ethtool -k enp3s0 |grep ": on"
> > > > > highdma: on [fixed]
> > > > > rx-vlan-filter: on [fixed]
> > > > > ---
> > > > > And then back to back:
> > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec  4.91 MBytes  41.2 Mbits/sec    0   45.2 KBytes
> > > > > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   52.3 KBytes
> > > > > [  5]   2.00-3.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> > > > > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0    141 KBytes
> > > > > [  5]   4.00-5.00   sec   111 MBytes   928 Mbits/sec   63    764 KBytes
> > > > > [  5]   5.00-6.00   sec  86.2 MBytes   724 Mbits/sec    0    744 KBytes
> > > > > [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   61    769 KBytes
> > > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0    749 KBytes
> > > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec    0    741 KBytes
> > > > > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   31    761 KBytes
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > [  5]   0.00-10.00  sec   644 MBytes   540 Mbits/sec  155             sender
> > > > > [  5]   0.00-10.01  sec   641 MBytes   537 Mbits/sec                  receiver
> > > > >
> > > > > and we're back at the not working bit:
> > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec  4.84 MBytes  40.6 Mbits/sec    0   42.4 KBytes
> > > > > [  5]   1.00-2.00   sec  4.60 MBytes  38.6 Mbits/sec    0   42.4 KBytes
> > > > > [  5]   2.00-3.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > > > > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> > > > > [  5]   4.00-5.00   sec  4.47 MBytes  37.5 Mbits/sec    0   42.4 KBytes
> > > > > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > > > > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   42.4 KBytes
> > > > > [  5]   7.00-8.00   sec  4.47 MBytes  37.5 Mbits/sec    0   67.9 KBytes
> > > > > [  5]   8.00-9.00   sec  4.47 MBytes  37.5 Mbits/sec    0   53.7 KBytes
> > > > > [  5]   9.00-10.00  sec  4.47 MBytes  37.5 Mbits/sec    0   79.2 KBytes
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > [  5]   0.00-10.00  sec  44.5 MBytes  37.3 Mbits/sec    0             sender
> > > > > [  5]   0.00-10.00  sec  43.9 MBytes  36.8 Mbits/sec                  receiver
> > > > >
> > > > > > > I was hoping that you'd have a clue of something that might introduce
> > > > > > > a regression - ie specific patches to try to revert
> > > > > > >
> > > > > > > Btw, the same issue applies to udp as werll
> > > > > > >
> > > > > > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > > > > > [  5]   0.00-1.00   sec  6.77 MBytes  56.8 Mbits/sec  4900
> > > > > > > [  5]   1.00-2.00   sec  4.27 MBytes  35.8 Mbits/sec  3089
> > > > > > > [  5]   2.00-3.00   sec  4.20 MBytes  35.2 Mbits/sec  3041
> > > > > > > [  5]   3.00-4.00   sec  4.30 MBytes  36.1 Mbits/sec  3116
> > > > > > > [  5]   4.00-5.00   sec  4.24 MBytes  35.6 Mbits/sec  3070
> > > > > > > [  5]   5.00-6.00   sec  4.21 MBytes  35.3 Mbits/sec  3047
> > > > > > > [  5]   6.00-7.00   sec  4.29 MBytes  36.0 Mbits/sec  3110
> > > > > > > [  5]   7.00-8.00   sec  4.28 MBytes  35.9 Mbits/sec  3097
> > > > > > > [  5]   8.00-9.00   sec  4.25 MBytes  35.6 Mbits/sec  3075
> > > > > > > [  5]   9.00-10.00  sec  4.20 MBytes  35.2 Mbits/sec  3039
> > > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > > > > > Lost/Total Datagrams
> > > > > > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.000 ms
> > > > > > > 0/32584 (0%)  sender
> > > > > > > [  5]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec  0.037 ms
> > > > > > > 0/32573 (0%)  receiver
> > > > > > >
> > > > > > > vs:
> > > > > > >
> > > > > > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > > > > > [  5]   0.00-1.00   sec   114 MBytes   954 Mbits/sec  82342
> > > > > > > [  5]   1.00-2.00   sec   114 MBytes   955 Mbits/sec  82439
> > > > > > > [  5]   2.00-3.00   sec   114 MBytes   956 Mbits/sec  82507
> > > > > > > [  5]   3.00-4.00   sec   114 MBytes   955 Mbits/sec  82432
> > > > > > > [  5]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  82535
> > > > > > > [  5]   5.00-6.00   sec   114 MBytes   953 Mbits/sec  82240
> > > > > > > [  5]   6.00-7.00   sec   114 MBytes   956 Mbits/sec  82512
> > > > > > > [  5]   7.00-8.00   sec   114 MBytes   956 Mbits/sec  82503
> > > > > > > [  5]   8.00-9.00   sec   114 MBytes   956 Mbits/sec  82532
> > > > > > > [  5]   9.00-10.00  sec   114 MBytes   956 Mbits/sec  82488
> > > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > > [ ID] Interval           Transfer     Bitrate         Jitter
> > > > > > > Lost/Total Datagrams
> > > > > > > [  5]   0.00-10.00  sec  1.11 GBytes   955 Mbits/sec  0.000 ms
> > > > > > > 0/824530 (0%)  sender
> > > > > > > [  5]   0.00-10.01  sec  1.11 GBytes   949 Mbits/sec  0.014 ms
> > > > > > > 4756/824530 (0.58%)  receiver
> > > > > >
> > > > > > The fact that it is impacting UDP seems odd. I wonder if we don't have
> > > > > > a qdisc somewhere that is misbehaving and throttling the Tx. Either
> > > > > > that or I wonder if we are getting spammed with flow control frames.
> > > > >
> > > > > it sometimes works, it looks like the cwindow just isn't increased -
> > > > > that's where i started...
> > > > >
> > > > > Example:
> > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec  4.86 MBytes  40.8 Mbits/sec    0   50.9 KBytes
> > > > > [  5]   1.00-2.00   sec  4.66 MBytes  39.1 Mbits/sec    0   65.0 KBytes
> > > > > [  5]   2.00-3.00   sec  4.29 MBytes  36.0 Mbits/sec    0   42.4 KBytes
> > > > > [  5]   3.00-4.00   sec  4.66 MBytes  39.1 Mbits/sec    0   42.4 KBytes
> > > > > [  5]   4.00-5.00   sec  23.1 MBytes   194 Mbits/sec    0   1.07 MBytes
> > > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec    0    761 KBytes
> > > > > [  5]   6.00-7.00   sec  98.8 MBytes   828 Mbits/sec   60    806 KBytes
> > > > > [  5]   7.00-8.00   sec  82.5 MBytes   692 Mbits/sec    0    812 KBytes
> > > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   92    761 KBytes
> > > > > [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec    0    755 KBytes
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > [  5]   0.00-10.00  sec   554 MBytes   465 Mbits/sec  152             sender
> > > > > [  5]   0.00-10.00  sec   550 MBytes   461 Mbits/sec                  receiver
> > > > >
> > > > > > It would be useful to include the output of just calling "ethtool
> > > > > > enp3s0" on the interface to verify the speed, "ethtool -a enp3s0" to
> > > > > > verify flow control settings, and "ethtool -S enp3s0 | grep -v :\ 0"
> > > > > > to output the statistics and dump anything that isn't zero.
> > > > >
> > > > > ethtool enp3s0
> > > > > Settings for enp3s0:
> > > > > Supported ports: [ TP ]
> > > > > Supported link modes:   10baseT/Half 10baseT/Full
> > > > >                         100baseT/Half 100baseT/Full
> > > > >                         1000baseT/Full
> > > > > Supported pause frame use: Symmetric
> > > > > Supports auto-negotiation: Yes
> > > > > Supported FEC modes: Not reported
> > > > > Advertised link modes:  10baseT/Half 10baseT/Full
> > > > >                         100baseT/Half 100baseT/Full
> > > > >                         1000baseT/Full
> > > > > Advertised pause frame use: Symmetric
> > > > > Advertised auto-negotiation: Yes
> > > > > Advertised FEC modes: Not reported
> > > > > Speed: 1000Mb/s
> > > > > Duplex: Full
> > > > > Auto-negotiation: on
> > > > > Port: Twisted Pair
> > > > > PHYAD: 1
> > > > > Transceiver: internal
> > > > > MDI-X: off (auto)
> > > > > Supports Wake-on: pumbg
> > > > > Wake-on: g
> > > > >         Current message level: 0x00000007 (7)
> > > > >                                drv probe link
> > > > > Link detected: yes
> > > > > ---
> > > > > ethtool -a enp3s0
> > > > > Pause parameters for enp3s0:
> > > > > Autonegotiate: on
> > > > > RX: on
> > > > > TX: off
> > > > > ---
> > > > > ethtool -S enp3s0 |grep  -v :\ 0
> > > > > NIC statistics:
> > > > >      rx_packets: 15920618
> > > > >      tx_packets: 17846725
> > > > >      rx_bytes: 15676264423
> > > > >      tx_bytes: 19925010639
> > > > >      rx_broadcast: 119553
> > > > >      tx_broadcast: 497
> > > > >      rx_multicast: 330193
> > > > >      tx_multicast: 18190
> > > > >      multicast: 330193
> > > > >      rx_missed_errors: 270102
> > > > >      rx_long_length_errors: 6
> > > > >      tx_tcp_seg_good: 1342561
> > > > >      rx_long_byte_count: 15676264423
> > > > >      rx_errors: 6
> > > > >      rx_length_errors: 6
> > > > >      rx_fifo_errors: 270102
> > > > >      tx_queue_0_packets: 7651168
> > > > >      tx_queue_0_bytes: 7823281566
> > > > >      tx_queue_0_restart: 4920
> > > > >      tx_queue_1_packets: 10195557
> > > > >      tx_queue_1_bytes: 12027522118
> > > > >      tx_queue_1_restart: 12718
> > > > >      rx_queue_0_packets: 15920618
> > > > >      rx_queue_0_bytes: 15612581951
> > > > >      rx_queue_0_csum_err: 76
> > > > > (I've only run two runs since i reenabled the interface)
> > > >
> > > > So I am seeing three things here.
> > > >
> > > > The rx_long_length_errors are usually due to an MTU mismatch. Do you
> > > > have something on the network that is using jumbo frames, or is the
> > > > MTU on the NIC set to something smaller than what is supported on the
> > > > network?
> > >
> > > I'm using jumbo frames on the local network, internet side is the
> > > normal 1500 bytes mtu though
> > >
> > > > You are getting rx_missed_errors, that would seem to imply that the
> > > > DMA is not able to keep up. We may want to try disabling the L1 to see
> > > > if we get any boost from doing that.
> > >
> > > It used to work, I don't do benchmarks all the time and sometimes the first
> > > benchmarks turn out fine... so it's hard to say when this started happening...
> > >
> > > It could also be related to a bios upgrade, but I'm pretty sure I did
> > > successful benchmarks after that...
> > >
> > > How do I disable the l1? just echo 0 >
> > > /sys/bus/pci/drivers/igb/0000\:03\:00.0/link/l1_aspm ?
> > >
> > > > The last bit is that queue 0 is seeing packets with bad checksums. You
> > > > might want to run some tests and see where the bad checksums are
> > > > coming from. If they are being detected from a specific NIC such as
> > > > the ixgbe in your example it might point to some sort of checksum
> > > > error being created as a result of the NAT translation.
> > >
> > > But that should also affect A' and the A -> B -> C case, which it doesn't...
> > >
> > > It only seems to happen with higher rtt (6 hops, sub 3 ms in this case
> > > but still high enough somehow)
> > >
> > > > > ---
> > > > >
> > > > > > > lspci -s 03:00.0  -vvv
> > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> > > > > > > Connection (rev 03)
> > > > > > > Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection
> > > > > > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > > > > > Stepping- SERR- FastB2B- DisINTx+
> > > > > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > > > > > > <TAbort- <MAbort- >SERR- <PERR- INTx-
> > > > > > > Latency: 0
> > > > > > > Interrupt: pin A routed to IRQ 57
> > > > > > > IOMMU group: 20
> > > > > > > Region 0: Memory at fc900000 (32-bit, non-prefetchable) [size=128K]
> > > > > > > Region 2: I/O ports at e000 [size=32]
> > > > > > > Region 3: Memory at fc920000 (32-bit, non-prefetchable) [size=16K]
> > > > > > > Capabilities: [40] Power Management version 3
> > > > > > > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> > > > > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> > > > > > > Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> > > > > > > Address: 0000000000000000  Data: 0000
> > > > > > > Masking: 00000000  Pending: 00000000
> > > > > > > Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
> > > > > > > Vector table: BAR=3 offset=00000000
> > > > > > > PBA: BAR=3 offset=00002000
> > > > > > > Capabilities: [a0] Express (v2) Endpoint, MSI 00
> > > > > > > DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > > > > > > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> > > > > > > DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
> > > > > > > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> > > > > > > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > > > > > > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> > > > > > > LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
> > > > > > > L0s <2us, L1 <16us
> > > > > > > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > > > > > > LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> > > > > > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > > > > > LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
> > > > > > > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > > > >
> > > > > > PCIe wise the connection is going to be pretty tight in terms of
> > > > > > bandwidth. It looks like we have 2.5GT/s with only a single lane of
> > > > > > PCIe. In addition we are running with ASPM enabled so that means that
> > > > > > if we don't have enough traffic we are shutting off the one PCIe lane
> > > > > > we have so if we are getting bursty traffic that can get ugly.
> > > > >
> > > > > Humm... is there a way to force disable ASPM in sysfs?
> > > >
> > > > Actually the easiest way to do this is to just use setpci.
> > > >
> > > > You should be able to dump the word containing the setting via:
> > > > # setpci -s 3:00.0 0xB0.w
> > > > 0042
> > > > # setpci -s 3:00.0 0xB0.w=0040
> > > >
> > > > Basically what you do is clear the lower 3 bits of the value so in
> > > > this case that means replacing the 2 with a 0 based on the output of
> > > > the first command.
> > >
> > > Well... I'll be damned... I used to force enable ASPM... this must be
> > > related to the change in PCIe bus ASPM
> > > Perhaps disable ASPM if there is only one link?
> >
> > Is there any specific reason why you are enabling ASPM? Is this system
> > a laptop where you are trying to conserve power when on battery? If
> > not disabling it probably won't hurt things too much since the power
> > consumption for a 2.5GT/s link operating in a width of one shouldn't
> > bee too high. Otherwise you are likely going to end up paying the
> > price for getting the interface out of L1 when the traffic goes idle
> > so you are going to see flows that get bursty paying a heavy penalty
> > when they start dropping packets.
>
> Ah, you misunderstand, I used to do this and everything worked - now
> Linux enables ASPM by default on all pcie controllers,
> so imho this should be a quirk, if there is only one lane, don't do
> ASPM due to latency and timing issues...
>
> > It is also possible this could be something that changed with the
> > physical PCIe link. Basically L1 works by powering down the link when
> > idle, and then powering it back up when there is activity. The problem
> > is bringing it back up can sometimes be a challenge when the physical
> > link starts to go faulty. I know I have seen that in some cases it can
> > even result in the device falling off of the PCIe bus if the link
> > training fails.
>
> It works fine without ASPM (and the machine is pretty new)
>
> I suspect we hit some timing race with aggressive ASPM (assumed as
> such since it works on local links but doesn't on ~3 ms Links)

Agreed. What is probably happening if you are using a NAT is that it
may be seeing some burstiness being introduced and as a result the
part is going to sleep and then being overrun when the traffic does
arrive.

> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec   113 MBytes   950 Mbits/sec   31    710 KBytes
> > > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec  135    626 KBytes
> > > [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   18    713 KBytes
> > > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    798 KBytes
> > > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec    0    721 KBytes
> > > [  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec   31    800 KBytes
> > > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    730 KBytes
> > > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec   19    730 KBytes
> > > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec    0    701 KBytes
> > > [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec   12    701 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  1.09 GBytes   937 Mbits/sec  246             sender
> > > [  5]   0.00-10.01  sec  1.09 GBytes   933 Mbits/sec                  receiver
> > >
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec   114 MBytes   956 Mbits/sec    0    749 KBytes
> > > [  5]   1.00-2.00   sec   111 MBytes   933 Mbits/sec   30    766 KBytes
> > > [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec    7    749 KBytes
> > > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   11    707 KBytes
> > > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec    2    699 KBytes
> > > [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    8    699 KBytes
> > > [  5]   6.00-7.00   sec   112 MBytes   944 Mbits/sec    1    953 KBytes
> > > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    701 KBytes
> > > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   26    707 KBytes
> > > [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    2   1.07 MBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  1.09 GBytes   939 Mbits/sec   87             sender
> > > [  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver
> > >
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec   114 MBytes   953 Mbits/sec   16    908 KBytes
> > > [  5]   1.00-2.00   sec   112 MBytes   944 Mbits/sec    0    693 KBytes
> > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec    0    713 KBytes
> > > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    0    687 KBytes
> > > [  5]   4.00-5.00   sec   112 MBytes   944 Mbits/sec   15    687 KBytes
> > > [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    2    888 KBytes
> > > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec   17    696 KBytes
> > > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    0    758 KBytes
> > > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   31    749 KBytes
> > > [  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    0    792 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec   81             sender
> > > [  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver
> > >
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec   114 MBytes   956 Mbits/sec    0    747 KBytes
> > > [  5]   1.00-2.00   sec   111 MBytes   933 Mbits/sec    0    744 KBytes
> > > [  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec   12   1.18 MBytes
> > > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec    2    699 KBytes
> > > [  5]   4.00-5.00   sec   111 MBytes   933 Mbits/sec   28    699 KBytes
> > > [  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec    0    684 KBytes
> > > [  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    741 KBytes
> > > [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec    3    687 KBytes
> > > [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec   22    699 KBytes
> > > [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec   11    707 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec   78             sender
> > > [  5]   0.00-10.01  sec  1.09 GBytes   934 Mbits/sec                  receiver
> > > ---
> > >
> > > ethtool -S enp3s0 |grep -v ": 0"
> > > NIC statistics:
> > >      rx_packets: 16303520
> > >      tx_packets: 21602840
> > >      rx_bytes: 15711958157
> > >      tx_bytes: 25599009212
> > >      rx_broadcast: 122212
> > >      tx_broadcast: 530
> > >      rx_multicast: 333489
> > >      tx_multicast: 18446
> > >      multicast: 333489
> > >      rx_missed_errors: 270143
> > >      rx_long_length_errors: 6
> > >      tx_tcp_seg_good: 1342561
> > >      rx_long_byte_count: 15711958157
> > >      rx_errors: 6
> > >      rx_length_errors: 6
> > >      rx_fifo_errors: 270143
> > >      tx_queue_0_packets: 8963830
> > >      tx_queue_0_bytes: 9803196683
> > >      tx_queue_0_restart: 4920
> > >      tx_queue_1_packets: 12639010
> > >      tx_queue_1_bytes: 15706576814
> > >      tx_queue_1_restart: 12718
> > >      rx_queue_0_packets: 16303520
> > >      rx_queue_0_bytes: 15646744077
> > >      rx_queue_0_csum_err: 76
> >
> > Okay, so this result still has the same length and checksum errors,
> > were you resetting the system/statistics between runs?
>
> Ah, no.... Will reset and do more tests when I'm back home
>
> Am I blind or is this part missing from ethtools man page?

There isn't a reset that will reset the stats via ethtool. The device
stats will be persistent until the driver is unloaded and reloaded or
the system is reset. You can reset the queue stats by changing the
number of queues. So for example using "ethtool -L enp3s0 1;  ethtool
-L enp3s0 2".

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-17  0:09                           ` Alexander Duyck
@ 2020-07-17 13:45                             ` Ian Kumlien
  -1 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-17 13:45 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jakub Kicinski, Linux Kernel Network Developers, intel-wired-lan

On Fri, Jul 17, 2020 at 2:09 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Thu, Jul 16, 2020 at 12:47 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:

> > Sorry, tried to respond via the phone, used the webbrowser version but
> > still html mails... :/

> > On Thu, Jul 16, 2020 at 5:18 PM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > > On Wed, Jul 15, 2020 at 5:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:

[--8<--]

> > > > Well... I'll be damned... I used to force enable ASPM... this must be
> > > > related to the change in PCIe bus ASPM
> > > > Perhaps disable ASPM if there is only one link?
> > >
> > > Is there any specific reason why you are enabling ASPM? Is this system
> > > a laptop where you are trying to conserve power when on battery? If
> > > not disabling it probably won't hurt things too much since the power
> > > consumption for a 2.5GT/s link operating in a width of one shouldn't
> > > bee too high. Otherwise you are likely going to end up paying the
> > > price for getting the interface out of L1 when the traffic goes idle
> > > so you are going to see flows that get bursty paying a heavy penalty
> > > when they start dropping packets.
> >
> > Ah, you misunderstand, I used to do this and everything worked - now
> > Linux enables ASPM by default on all pcie controllers,
> > so imho this should be a quirk, if there is only one lane, don't do
> > ASPM due to latency and timing issues...
> >
> > > It is also possible this could be something that changed with the
> > > physical PCIe link. Basically L1 works by powering down the link when
> > > idle, and then powering it back up when there is activity. The problem
> > > is bringing it back up can sometimes be a challenge when the physical
> > > link starts to go faulty. I know I have seen that in some cases it can
> > > even result in the device falling off of the PCIe bus if the link
> > > training fails.
> >
> > It works fine without ASPM (and the machine is pretty new)
> >
> > I suspect we hit some timing race with aggressive ASPM (assumed as
> > such since it works on local links but doesn't on ~3 ms Links)
>
> Agreed. What is probably happening if you are using a NAT is that it
> may be seeing some burstiness being introduced and as a result the
> part is going to sleep and then being overrun when the traffic does
> arrive.

Weird though, seems to be very aggressive timings =)

[--8<--]

> > > > ethtool -S enp3s0 |grep -v ": 0"
> > > > NIC statistics:
> > > >      rx_packets: 16303520
> > > >      tx_packets: 21602840
> > > >      rx_bytes: 15711958157
> > > >      tx_bytes: 25599009212
> > > >      rx_broadcast: 122212
> > > >      tx_broadcast: 530
> > > >      rx_multicast: 333489
> > > >      tx_multicast: 18446
> > > >      multicast: 333489
> > > >      rx_missed_errors: 270143
> > > >      rx_long_length_errors: 6
> > > >      tx_tcp_seg_good: 1342561
> > > >      rx_long_byte_count: 15711958157
> > > >      rx_errors: 6
> > > >      rx_length_errors: 6
> > > >      rx_fifo_errors: 270143
> > > >      tx_queue_0_packets: 8963830
> > > >      tx_queue_0_bytes: 9803196683
> > > >      tx_queue_0_restart: 4920
> > > >      tx_queue_1_packets: 12639010
> > > >      tx_queue_1_bytes: 15706576814
> > > >      tx_queue_1_restart: 12718
> > > >      rx_queue_0_packets: 16303520
> > > >      rx_queue_0_bytes: 15646744077
> > > >      rx_queue_0_csum_err: 76
> > >
> > > Okay, so this result still has the same length and checksum errors,
> > > were you resetting the system/statistics between runs?
> >
> > Ah, no.... Will reset and do more tests when I'm back home
> >
> > Am I blind or is this part missing from ethtools man page?
>
> There isn't a reset that will reset the stats via ethtool. The device
> stats will be persistent until the driver is unloaded and reloaded or
> the system is reset. You can reset the queue stats by changing the
> number of queues. So for example using "ethtool -L enp3s0 1;  ethtool
> -L enp3s0 2".

It did reset some counters but not all...

NIC statistics:
     rx_packets: 37339997
     tx_packets: 36066432
     rx_bytes: 39226365570
     tx_bytes: 37364799188
     rx_broadcast: 197736
     tx_broadcast: 1187
     rx_multicast: 572374
     tx_multicast: 30546
     multicast: 572374
     collisions: 0
     rx_crc_errors: 0
     rx_no_buffer_count: 0
     rx_missed_errors: 270844
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     rx_long_length_errors: 6
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 2663350
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_long_byte_count: 39226365570
     tx_dma_out_of_sync: 0
     tx_smbus: 0
     rx_smbus: 0
     dropped_smbus: 0
     os2bmc_rx_by_bmc: 0
     os2bmc_tx_by_bmc: 0
     os2bmc_tx_by_host: 0
     os2bmc_rx_by_host: 0
     tx_hwtstamp_timeouts: 0
     tx_hwtstamp_skipped: 0
     rx_hwtstamp_cleared: 0
     rx_errors: 6
     tx_errors: 0
     tx_dropped: 0
     rx_length_errors: 6
     rx_over_errors: 0
     rx_frame_errors: 0
     rx_fifo_errors: 270844
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_queue_0_packets: 16069894
     tx_queue_0_bytes: 16031462246
     tx_queue_0_restart: 4920
     tx_queue_1_packets: 19996538
     tx_queue_1_bytes: 21169430746
     tx_queue_1_restart: 12718
     rx_queue_0_packets: 37339997
     rx_queue_0_bytes: 39077005582
     rx_queue_0_drops: 0
     rx_queue_0_csum_err: 76
     rx_queue_0_alloc_failed: 0
     rx_queue_1_packets: 0
     rx_queue_1_bytes: 0
     rx_queue_1_drops: 0
     rx_queue_1_csum_err: 0
     rx_queue_1_alloc_failed: 0

-- vs --

NIC statistics:
     rx_packets: 37340720
     tx_packets: 36066920
     rx_bytes: 39226590275
     tx_bytes: 37364899567
     rx_broadcast: 197755
     tx_broadcast: 1204
     rx_multicast: 572582
     tx_multicast: 30563
     multicast: 572582
     collisions: 0
     rx_crc_errors: 0
     rx_no_buffer_count: 0
     rx_missed_errors: 270844
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     rx_long_length_errors: 6
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 2663352
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_long_byte_count: 39226590275
     tx_dma_out_of_sync: 0
     tx_smbus: 0
     rx_smbus: 0
     dropped_smbus: 0
     os2bmc_rx_by_bmc: 0
     os2bmc_tx_by_bmc: 0
     os2bmc_tx_by_host: 0
     os2bmc_rx_by_host: 0
     tx_hwtstamp_timeouts: 0
     tx_hwtstamp_skipped: 0
     rx_hwtstamp_cleared: 0
     rx_errors: 6
     tx_errors: 0
     tx_dropped: 0
     rx_length_errors: 6
     rx_over_errors: 0
     rx_frame_errors: 0
     rx_fifo_errors: 270844
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_queue_0_packets: 59
     tx_queue_0_bytes: 11829
     tx_queue_0_restart: 0
     tx_queue_1_packets: 49
     tx_queue_1_bytes: 12058
     tx_queue_1_restart: 0
     rx_queue_0_packets: 84
     rx_queue_0_bytes: 22195
     rx_queue_0_drops: 0
     rx_queue_0_csum_err: 0
     rx_queue_0_alloc_failed: 0
     rx_queue_1_packets: 0
     rx_queue_1_bytes: 0
     rx_queue_1_drops: 0
     rx_queue_1_csum_err: 0
     rx_queue_1_alloc_failed: 0

---

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
@ 2020-07-17 13:45                             ` Ian Kumlien
  0 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-17 13:45 UTC (permalink / raw)
  To: intel-wired-lan

On Fri, Jul 17, 2020 at 2:09 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Thu, Jul 16, 2020 at 12:47 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:

> > Sorry, tried to respond via the phone, used the webbrowser version but
> > still html mails... :/

> > On Thu, Jul 16, 2020 at 5:18 PM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > > On Wed, Jul 15, 2020 at 5:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:

[--8<--]

> > > > Well... I'll be damned... I used to force enable ASPM... this must be
> > > > related to the change in PCIe bus ASPM
> > > > Perhaps disable ASPM if there is only one link?
> > >
> > > Is there any specific reason why you are enabling ASPM? Is this system
> > > a laptop where you are trying to conserve power when on battery? If
> > > not disabling it probably won't hurt things too much since the power
> > > consumption for a 2.5GT/s link operating in a width of one shouldn't
> > > bee too high. Otherwise you are likely going to end up paying the
> > > price for getting the interface out of L1 when the traffic goes idle
> > > so you are going to see flows that get bursty paying a heavy penalty
> > > when they start dropping packets.
> >
> > Ah, you misunderstand, I used to do this and everything worked - now
> > Linux enables ASPM by default on all pcie controllers,
> > so imho this should be a quirk, if there is only one lane, don't do
> > ASPM due to latency and timing issues...
> >
> > > It is also possible this could be something that changed with the
> > > physical PCIe link. Basically L1 works by powering down the link when
> > > idle, and then powering it back up when there is activity. The problem
> > > is bringing it back up can sometimes be a challenge when the physical
> > > link starts to go faulty. I know I have seen that in some cases it can
> > > even result in the device falling off of the PCIe bus if the link
> > > training fails.
> >
> > It works fine without ASPM (and the machine is pretty new)
> >
> > I suspect we hit some timing race with aggressive ASPM (assumed as
> > such since it works on local links but doesn't on ~3 ms Links)
>
> Agreed. What is probably happening if you are using a NAT is that it
> may be seeing some burstiness being introduced and as a result the
> part is going to sleep and then being overrun when the traffic does
> arrive.

Weird though, seems to be very aggressive timings =)

[--8<--]

> > > > ethtool -S enp3s0 |grep -v ": 0"
> > > > NIC statistics:
> > > >      rx_packets: 16303520
> > > >      tx_packets: 21602840
> > > >      rx_bytes: 15711958157
> > > >      tx_bytes: 25599009212
> > > >      rx_broadcast: 122212
> > > >      tx_broadcast: 530
> > > >      rx_multicast: 333489
> > > >      tx_multicast: 18446
> > > >      multicast: 333489
> > > >      rx_missed_errors: 270143
> > > >      rx_long_length_errors: 6
> > > >      tx_tcp_seg_good: 1342561
> > > >      rx_long_byte_count: 15711958157
> > > >      rx_errors: 6
> > > >      rx_length_errors: 6
> > > >      rx_fifo_errors: 270143
> > > >      tx_queue_0_packets: 8963830
> > > >      tx_queue_0_bytes: 9803196683
> > > >      tx_queue_0_restart: 4920
> > > >      tx_queue_1_packets: 12639010
> > > >      tx_queue_1_bytes: 15706576814
> > > >      tx_queue_1_restart: 12718
> > > >      rx_queue_0_packets: 16303520
> > > >      rx_queue_0_bytes: 15646744077
> > > >      rx_queue_0_csum_err: 76
> > >
> > > Okay, so this result still has the same length and checksum errors,
> > > were you resetting the system/statistics between runs?
> >
> > Ah, no.... Will reset and do more tests when I'm back home
> >
> > Am I blind or is this part missing from ethtools man page?
>
> There isn't a reset that will reset the stats via ethtool. The device
> stats will be persistent until the driver is unloaded and reloaded or
> the system is reset. You can reset the queue stats by changing the
> number of queues. So for example using "ethtool -L enp3s0 1;  ethtool
> -L enp3s0 2".

It did reset some counters but not all...

NIC statistics:
     rx_packets: 37339997
     tx_packets: 36066432
     rx_bytes: 39226365570
     tx_bytes: 37364799188
     rx_broadcast: 197736
     tx_broadcast: 1187
     rx_multicast: 572374
     tx_multicast: 30546
     multicast: 572374
     collisions: 0
     rx_crc_errors: 0
     rx_no_buffer_count: 0
     rx_missed_errors: 270844
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     rx_long_length_errors: 6
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 2663350
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_long_byte_count: 39226365570
     tx_dma_out_of_sync: 0
     tx_smbus: 0
     rx_smbus: 0
     dropped_smbus: 0
     os2bmc_rx_by_bmc: 0
     os2bmc_tx_by_bmc: 0
     os2bmc_tx_by_host: 0
     os2bmc_rx_by_host: 0
     tx_hwtstamp_timeouts: 0
     tx_hwtstamp_skipped: 0
     rx_hwtstamp_cleared: 0
     rx_errors: 6
     tx_errors: 0
     tx_dropped: 0
     rx_length_errors: 6
     rx_over_errors: 0
     rx_frame_errors: 0
     rx_fifo_errors: 270844
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_queue_0_packets: 16069894
     tx_queue_0_bytes: 16031462246
     tx_queue_0_restart: 4920
     tx_queue_1_packets: 19996538
     tx_queue_1_bytes: 21169430746
     tx_queue_1_restart: 12718
     rx_queue_0_packets: 37339997
     rx_queue_0_bytes: 39077005582
     rx_queue_0_drops: 0
     rx_queue_0_csum_err: 76
     rx_queue_0_alloc_failed: 0
     rx_queue_1_packets: 0
     rx_queue_1_bytes: 0
     rx_queue_1_drops: 0
     rx_queue_1_csum_err: 0
     rx_queue_1_alloc_failed: 0

-- vs --

NIC statistics:
     rx_packets: 37340720
     tx_packets: 36066920
     rx_bytes: 39226590275
     tx_bytes: 37364899567
     rx_broadcast: 197755
     tx_broadcast: 1204
     rx_multicast: 572582
     tx_multicast: 30563
     multicast: 572582
     collisions: 0
     rx_crc_errors: 0
     rx_no_buffer_count: 0
     rx_missed_errors: 270844
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     rx_long_length_errors: 6
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 2663352
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_long_byte_count: 39226590275
     tx_dma_out_of_sync: 0
     tx_smbus: 0
     rx_smbus: 0
     dropped_smbus: 0
     os2bmc_rx_by_bmc: 0
     os2bmc_tx_by_bmc: 0
     os2bmc_tx_by_host: 0
     os2bmc_rx_by_host: 0
     tx_hwtstamp_timeouts: 0
     tx_hwtstamp_skipped: 0
     rx_hwtstamp_cleared: 0
     rx_errors: 6
     tx_errors: 0
     tx_dropped: 0
     rx_length_errors: 6
     rx_over_errors: 0
     rx_frame_errors: 0
     rx_fifo_errors: 270844
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_queue_0_packets: 59
     tx_queue_0_bytes: 11829
     tx_queue_0_restart: 0
     tx_queue_1_packets: 49
     tx_queue_1_bytes: 12058
     tx_queue_1_restart: 0
     rx_queue_0_packets: 84
     rx_queue_0_bytes: 22195
     rx_queue_0_drops: 0
     rx_queue_0_csum_err: 0
     rx_queue_0_alloc_failed: 0
     rx_queue_1_packets: 0
     rx_queue_1_bytes: 0
     rx_queue_1_drops: 0
     rx_queue_1_csum_err: 0
     rx_queue_1_alloc_failed: 0

---

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-17 13:45                             ` Ian Kumlien
@ 2020-07-24 12:01                               ` Ian Kumlien
  -1 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-24 12:01 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jakub Kicinski, Linux Kernel Network Developers, intel-wired-lan

On Fri, Jul 17, 2020 at 3:45 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Fri, Jul 17, 2020 at 2:09 AM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> > On Thu, Jul 16, 2020 at 12:47 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> > > Sorry, tried to respond via the phone, used the webbrowser version but
> > > still html mails... :/
>
> > > On Thu, Jul 16, 2020 at 5:18 PM Alexander Duyck
> > > <alexander.duyck@gmail.com> wrote:
> > > > On Wed, Jul 15, 2020 at 5:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> [--8<--]
>
> > > > > Well... I'll be damned... I used to force enable ASPM... this must be
> > > > > related to the change in PCIe bus ASPM
> > > > > Perhaps disable ASPM if there is only one link?
> > > >
> > > > Is there any specific reason why you are enabling ASPM? Is this system
> > > > a laptop where you are trying to conserve power when on battery? If
> > > > not disabling it probably won't hurt things too much since the power
> > > > consumption for a 2.5GT/s link operating in a width of one shouldn't
> > > > bee too high. Otherwise you are likely going to end up paying the
> > > > price for getting the interface out of L1 when the traffic goes idle
> > > > so you are going to see flows that get bursty paying a heavy penalty
> > > > when they start dropping packets.
> > >
> > > Ah, you misunderstand, I used to do this and everything worked - now
> > > Linux enables ASPM by default on all pcie controllers,
> > > so imho this should be a quirk, if there is only one lane, don't do
> > > ASPM due to latency and timing issues...
> > >
> > > > It is also possible this could be something that changed with the
> > > > physical PCIe link. Basically L1 works by powering down the link when
> > > > idle, and then powering it back up when there is activity. The problem
> > > > is bringing it back up can sometimes be a challenge when the physical
> > > > link starts to go faulty. I know I have seen that in some cases it can
> > > > even result in the device falling off of the PCIe bus if the link
> > > > training fails.
> > >
> > > It works fine without ASPM (and the machine is pretty new)
> > >
> > > I suspect we hit some timing race with aggressive ASPM (assumed as
> > > such since it works on local links but doesn't on ~3 ms Links)
> >
> > Agreed. What is probably happening if you are using a NAT is that it
> > may be seeing some burstiness being introduced and as a result the
> > part is going to sleep and then being overrun when the traffic does
> > arrive.
>
> Weird though, seems to be very aggressive timings =)
>
> [--8<--]
>
> > > > > ethtool -S enp3s0 |grep -v ": 0"
> > > > > NIC statistics:
> > > > >      rx_packets: 16303520
> > > > >      tx_packets: 21602840
> > > > >      rx_bytes: 15711958157
> > > > >      tx_bytes: 25599009212
> > > > >      rx_broadcast: 122212
> > > > >      tx_broadcast: 530
> > > > >      rx_multicast: 333489
> > > > >      tx_multicast: 18446
> > > > >      multicast: 333489
> > > > >      rx_missed_errors: 270143
> > > > >      rx_long_length_errors: 6
> > > > >      tx_tcp_seg_good: 1342561
> > > > >      rx_long_byte_count: 15711958157
> > > > >      rx_errors: 6
> > > > >      rx_length_errors: 6
> > > > >      rx_fifo_errors: 270143
> > > > >      tx_queue_0_packets: 8963830
> > > > >      tx_queue_0_bytes: 9803196683
> > > > >      tx_queue_0_restart: 4920
> > > > >      tx_queue_1_packets: 12639010
> > > > >      tx_queue_1_bytes: 15706576814
> > > > >      tx_queue_1_restart: 12718
> > > > >      rx_queue_0_packets: 16303520
> > > > >      rx_queue_0_bytes: 15646744077
> > > > >      rx_queue_0_csum_err: 76
> > > >
> > > > Okay, so this result still has the same length and checksum errors,
> > > > were you resetting the system/statistics between runs?
> > >
> > > Ah, no.... Will reset and do more tests when I'm back home
> > >
> > > Am I blind or is this part missing from ethtools man page?
> >
> > There isn't a reset that will reset the stats via ethtool. The device
> > stats will be persistent until the driver is unloaded and reloaded or
> > the system is reset. You can reset the queue stats by changing the
> > number of queues. So for example using "ethtool -L enp3s0 1;  ethtool
> > -L enp3s0 2".

As a side note, would something like this fix it - not even compile tested


diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
b/drivers/net/ethernet/intel/igb/igb_main.c
index 8bb3db2cbd41..1a7240aae85c 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3396,6 +3396,13 @@ static int igb_probe(struct pci_dev *pdev,
const struct pci_device_id *ent)
                          "Width x2" :
                          (hw->bus.width == e1000_bus_width_pcie_x1) ?
                          "Width x1" : "unknown"), netdev->dev_addr);
+               /* quirk */
+#ifdef CONFIG_PCIEASPM
+               if (hw->bus.width == e1000_bus_width_pcie_x1) {
+                       /* single lane pcie causes problems with ASPM */
+                       pdev->pcie_link_state->aspm_enabled = 0;
+               }
+#endif
        }

        if ((hw->mac.type >= e1000_i210 ||

I don't know where the right place to put a quirk would be...

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
@ 2020-07-24 12:01                               ` Ian Kumlien
  0 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-24 12:01 UTC (permalink / raw)
  To: intel-wired-lan

On Fri, Jul 17, 2020 at 3:45 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Fri, Jul 17, 2020 at 2:09 AM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> > On Thu, Jul 16, 2020 at 12:47 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> > > Sorry, tried to respond via the phone, used the webbrowser version but
> > > still html mails... :/
>
> > > On Thu, Jul 16, 2020 at 5:18 PM Alexander Duyck
> > > <alexander.duyck@gmail.com> wrote:
> > > > On Wed, Jul 15, 2020 at 5:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> [--8<--]
>
> > > > > Well... I'll be damned... I used to force enable ASPM... this must be
> > > > > related to the change in PCIe bus ASPM
> > > > > Perhaps disable ASPM if there is only one link?
> > > >
> > > > Is there any specific reason why you are enabling ASPM? Is this system
> > > > a laptop where you are trying to conserve power when on battery? If
> > > > not disabling it probably won't hurt things too much since the power
> > > > consumption for a 2.5GT/s link operating in a width of one shouldn't
> > > > bee too high. Otherwise you are likely going to end up paying the
> > > > price for getting the interface out of L1 when the traffic goes idle
> > > > so you are going to see flows that get bursty paying a heavy penalty
> > > > when they start dropping packets.
> > >
> > > Ah, you misunderstand, I used to do this and everything worked - now
> > > Linux enables ASPM by default on all pcie controllers,
> > > so imho this should be a quirk, if there is only one lane, don't do
> > > ASPM due to latency and timing issues...
> > >
> > > > It is also possible this could be something that changed with the
> > > > physical PCIe link. Basically L1 works by powering down the link when
> > > > idle, and then powering it back up when there is activity. The problem
> > > > is bringing it back up can sometimes be a challenge when the physical
> > > > link starts to go faulty. I know I have seen that in some cases it can
> > > > even result in the device falling off of the PCIe bus if the link
> > > > training fails.
> > >
> > > It works fine without ASPM (and the machine is pretty new)
> > >
> > > I suspect we hit some timing race with aggressive ASPM (assumed as
> > > such since it works on local links but doesn't on ~3 ms Links)
> >
> > Agreed. What is probably happening if you are using a NAT is that it
> > may be seeing some burstiness being introduced and as a result the
> > part is going to sleep and then being overrun when the traffic does
> > arrive.
>
> Weird though, seems to be very aggressive timings =)
>
> [--8<--]
>
> > > > > ethtool -S enp3s0 |grep -v ": 0"
> > > > > NIC statistics:
> > > > >      rx_packets: 16303520
> > > > >      tx_packets: 21602840
> > > > >      rx_bytes: 15711958157
> > > > >      tx_bytes: 25599009212
> > > > >      rx_broadcast: 122212
> > > > >      tx_broadcast: 530
> > > > >      rx_multicast: 333489
> > > > >      tx_multicast: 18446
> > > > >      multicast: 333489
> > > > >      rx_missed_errors: 270143
> > > > >      rx_long_length_errors: 6
> > > > >      tx_tcp_seg_good: 1342561
> > > > >      rx_long_byte_count: 15711958157
> > > > >      rx_errors: 6
> > > > >      rx_length_errors: 6
> > > > >      rx_fifo_errors: 270143
> > > > >      tx_queue_0_packets: 8963830
> > > > >      tx_queue_0_bytes: 9803196683
> > > > >      tx_queue_0_restart: 4920
> > > > >      tx_queue_1_packets: 12639010
> > > > >      tx_queue_1_bytes: 15706576814
> > > > >      tx_queue_1_restart: 12718
> > > > >      rx_queue_0_packets: 16303520
> > > > >      rx_queue_0_bytes: 15646744077
> > > > >      rx_queue_0_csum_err: 76
> > > >
> > > > Okay, so this result still has the same length and checksum errors,
> > > > were you resetting the system/statistics between runs?
> > >
> > > Ah, no.... Will reset and do more tests when I'm back home
> > >
> > > Am I blind or is this part missing from ethtools man page?
> >
> > There isn't a reset that will reset the stats via ethtool. The device
> > stats will be persistent until the driver is unloaded and reloaded or
> > the system is reset. You can reset the queue stats by changing the
> > number of queues. So for example using "ethtool -L enp3s0 1;  ethtool
> > -L enp3s0 2".

As a side note, would something like this fix it - not even compile tested


diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
b/drivers/net/ethernet/intel/igb/igb_main.c
index 8bb3db2cbd41..1a7240aae85c 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3396,6 +3396,13 @@ static int igb_probe(struct pci_dev *pdev,
const struct pci_device_id *ent)
                          "Width x2" :
                          (hw->bus.width == e1000_bus_width_pcie_x1) ?
                          "Width x1" : "unknown"), netdev->dev_addr);
+               /* quirk */
+#ifdef CONFIG_PCIEASPM
+               if (hw->bus.width == e1000_bus_width_pcie_x1) {
+                       /* single lane pcie causes problems with ASPM */
+                       pdev->pcie_link_state->aspm_enabled = 0;
+               }
+#endif
        }

        if ((hw->mac.type >= e1000_i210 ||

I don't know where the right place to put a quirk would be...

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-24 12:01                               ` Ian Kumlien
@ 2020-07-24 12:33                                 ` Ian Kumlien
  -1 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-24 12:33 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jakub Kicinski, Linux Kernel Network Developers, intel-wired-lan

On Fri, Jul 24, 2020 at 2:01 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Fri, Jul 17, 2020 at 3:45 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:

[--8<--]

> As a side note, would something like this fix it - not even compile tested
>
>
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> b/drivers/net/ethernet/intel/igb/igb_main.c
> index 8bb3db2cbd41..1a7240aae85c 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -3396,6 +3396,13 @@ static int igb_probe(struct pci_dev *pdev,
> const struct pci_device_id *ent)
>                           "Width x2" :
>                           (hw->bus.width == e1000_bus_width_pcie_x1) ?
>                           "Width x1" : "unknown"), netdev->dev_addr);
> +               /* quirk */
> +#ifdef CONFIG_PCIEASPM
> +               if (hw->bus.width == e1000_bus_width_pcie_x1) {
> +                       /* single lane pcie causes problems with ASPM */
> +                       pdev->pcie_link_state->aspm_enabled = 0;
> +               }
> +#endif
>         }
>
>         if ((hw->mac.type >= e1000_i210 ||
>
> I don't know where the right place to put a quirk would be...

Ok so that was a real brainfart... turns out that there is a lack of
good ways to get to that but it was more intended to
know where the quirk should go...

Due to the lack of api:s i started wondering if this will apply to
more devices than just network cards - potentially we could
be a little bit more selective and only not enable it in one direction but...

diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index b17e5ffd31b1..96a3c6837124 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -584,15 +584,16 @@ static void pcie_aspm_cap_init(struct
pcie_link_state *link, int blacklist)
         * given link unless components on both sides of the link each
         * support L0s.
         */
-       if (dwreg.support & upreg.support & PCIE_LINK_STATE_L0S)
-               link->aspm_support |= ASPM_STATE_L0S;
-       if (dwreg.enabled & PCIE_LINK_STATE_L0S)
-               link->aspm_enabled |= ASPM_STATE_L0S_UP;
-       if (upreg.enabled & PCIE_LINK_STATE_L0S)
-               link->aspm_enabled |= ASPM_STATE_L0S_DW;
-       link->latency_up.l0s = calc_l0s_latency(upreg.latency_encoding_l0s);
-       link->latency_dw.l0s = calc_l0s_latency(dwreg.latency_encoding_l0s);
-
+       if (pcie_get_width_cap(child) != PCIE_LNK_X1) {
+               if (dwreg.support & upreg.support & PCIE_LINK_STATE_L0S)
+                       link->aspm_support |= ASPM_STATE_L0S;
+               if (dwreg.enabled & PCIE_LINK_STATE_L0S)
+                       link->aspm_enabled |= ASPM_STATE_L0S_UP;
+               if (upreg.enabled & PCIE_LINK_STATE_L0S)
+                       link->aspm_enabled |= ASPM_STATE_L0S_DW;
+               link->latency_up.l0s =
calc_l0s_latency(upreg.latency_encoding_l0s);
+               link->latency_dw.l0s =
calc_l0s_latency(dwreg.latency_encoding_l0s);
+       }

this time it's compile tested...

It could also be  if (pcie_get_width_cap(child) > PCIE_LNK_X1) {

I assume that ASPM is not enabled for: PCIE_LNK_WIDTH_RESRV ;)

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
@ 2020-07-24 12:33                                 ` Ian Kumlien
  0 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-24 12:33 UTC (permalink / raw)
  To: intel-wired-lan

On Fri, Jul 24, 2020 at 2:01 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Fri, Jul 17, 2020 at 3:45 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:

[--8<--]

> As a side note, would something like this fix it - not even compile tested
>
>
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> b/drivers/net/ethernet/intel/igb/igb_main.c
> index 8bb3db2cbd41..1a7240aae85c 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -3396,6 +3396,13 @@ static int igb_probe(struct pci_dev *pdev,
> const struct pci_device_id *ent)
>                           "Width x2" :
>                           (hw->bus.width == e1000_bus_width_pcie_x1) ?
>                           "Width x1" : "unknown"), netdev->dev_addr);
> +               /* quirk */
> +#ifdef CONFIG_PCIEASPM
> +               if (hw->bus.width == e1000_bus_width_pcie_x1) {
> +                       /* single lane pcie causes problems with ASPM */
> +                       pdev->pcie_link_state->aspm_enabled = 0;
> +               }
> +#endif
>         }
>
>         if ((hw->mac.type >= e1000_i210 ||
>
> I don't know where the right place to put a quirk would be...

Ok so that was a real brainfart... turns out that there is a lack of
good ways to get to that but it was more intended to
know where the quirk should go...

Due to the lack of api:s i started wondering if this will apply to
more devices than just network cards - potentially we could
be a little bit more selective and only not enable it in one direction but...

diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index b17e5ffd31b1..96a3c6837124 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -584,15 +584,16 @@ static void pcie_aspm_cap_init(struct
pcie_link_state *link, int blacklist)
         * given link unless components on both sides of the link each
         * support L0s.
         */
-       if (dwreg.support & upreg.support & PCIE_LINK_STATE_L0S)
-               link->aspm_support |= ASPM_STATE_L0S;
-       if (dwreg.enabled & PCIE_LINK_STATE_L0S)
-               link->aspm_enabled |= ASPM_STATE_L0S_UP;
-       if (upreg.enabled & PCIE_LINK_STATE_L0S)
-               link->aspm_enabled |= ASPM_STATE_L0S_DW;
-       link->latency_up.l0s = calc_l0s_latency(upreg.latency_encoding_l0s);
-       link->latency_dw.l0s = calc_l0s_latency(dwreg.latency_encoding_l0s);
-
+       if (pcie_get_width_cap(child) != PCIE_LNK_X1) {
+               if (dwreg.support & upreg.support & PCIE_LINK_STATE_L0S)
+                       link->aspm_support |= ASPM_STATE_L0S;
+               if (dwreg.enabled & PCIE_LINK_STATE_L0S)
+                       link->aspm_enabled |= ASPM_STATE_L0S_UP;
+               if (upreg.enabled & PCIE_LINK_STATE_L0S)
+                       link->aspm_enabled |= ASPM_STATE_L0S_DW;
+               link->latency_up.l0s =
calc_l0s_latency(upreg.latency_encoding_l0s);
+               link->latency_dw.l0s =
calc_l0s_latency(dwreg.latency_encoding_l0s);
+       }

this time it's compile tested...

It could also be  if (pcie_get_width_cap(child) > PCIE_LNK_X1) {

I assume that ASPM is not enabled for: PCIE_LNK_WIDTH_RESRV ;)

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-24 12:33                                 ` Ian Kumlien
@ 2020-07-24 14:56                                   ` Alexander Duyck
  -1 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2020-07-24 14:56 UTC (permalink / raw)
  To: Ian Kumlien
  Cc: Jakub Kicinski, Linux Kernel Network Developers, intel-wired-lan

On Fri, Jul 24, 2020 at 5:33 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Fri, Jul 24, 2020 at 2:01 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > On Fri, Jul 17, 2020 at 3:45 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> [--8<--]
>
> > As a side note, would something like this fix it - not even compile tested
> >
> >
> > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> > b/drivers/net/ethernet/intel/igb/igb_main.c
> > index 8bb3db2cbd41..1a7240aae85c 100644
> > --- a/drivers/net/ethernet/intel/igb/igb_main.c
> > +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> > @@ -3396,6 +3396,13 @@ static int igb_probe(struct pci_dev *pdev,
> > const struct pci_device_id *ent)
> >                           "Width x2" :
> >                           (hw->bus.width == e1000_bus_width_pcie_x1) ?
> >                           "Width x1" : "unknown"), netdev->dev_addr);
> > +               /* quirk */
> > +#ifdef CONFIG_PCIEASPM
> > +               if (hw->bus.width == e1000_bus_width_pcie_x1) {
> > +                       /* single lane pcie causes problems with ASPM */
> > +                       pdev->pcie_link_state->aspm_enabled = 0;
> > +               }
> > +#endif
> >         }
> >
> >         if ((hw->mac.type >= e1000_i210 ||
> >
> > I don't know where the right place to put a quirk would be...
>
> Ok so that was a real brainfart... turns out that there is a lack of
> good ways to get to that but it was more intended to
> know where the quirk should go...
>
> Due to the lack of api:s i started wondering if this will apply to
> more devices than just network cards - potentially we could
> be a little bit more selective and only not enable it in one direction but...
>
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index b17e5ffd31b1..96a3c6837124 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -584,15 +584,16 @@ static void pcie_aspm_cap_init(struct
> pcie_link_state *link, int blacklist)
>          * given link unless components on both sides of the link each
>          * support L0s.
>          */
> -       if (dwreg.support & upreg.support & PCIE_LINK_STATE_L0S)
> -               link->aspm_support |= ASPM_STATE_L0S;
> -       if (dwreg.enabled & PCIE_LINK_STATE_L0S)
> -               link->aspm_enabled |= ASPM_STATE_L0S_UP;
> -       if (upreg.enabled & PCIE_LINK_STATE_L0S)
> -               link->aspm_enabled |= ASPM_STATE_L0S_DW;
> -       link->latency_up.l0s = calc_l0s_latency(upreg.latency_encoding_l0s);
> -       link->latency_dw.l0s = calc_l0s_latency(dwreg.latency_encoding_l0s);
> -
> +       if (pcie_get_width_cap(child) != PCIE_LNK_X1) {
> +               if (dwreg.support & upreg.support & PCIE_LINK_STATE_L0S)
> +                       link->aspm_support |= ASPM_STATE_L0S;
> +               if (dwreg.enabled & PCIE_LINK_STATE_L0S)
> +                       link->aspm_enabled |= ASPM_STATE_L0S_UP;
> +               if (upreg.enabled & PCIE_LINK_STATE_L0S)
> +                       link->aspm_enabled |= ASPM_STATE_L0S_DW;
> +               link->latency_up.l0s =
> calc_l0s_latency(upreg.latency_encoding_l0s);
> +               link->latency_dw.l0s =
> calc_l0s_latency(dwreg.latency_encoding_l0s);
> +       }
>
> this time it's compile tested...
>
> It could also be  if (pcie_get_width_cap(child) > PCIE_LNK_X1) {
>
> I assume that ASPM is not enabled for: PCIE_LNK_WIDTH_RESRV ;)

This is probably a bit too broad of a scope to be used generically
since this will disable ASPM for all devices that have a x1 link
width.

It might make more sense to look at something such as
e1000e_disable_aspm as an example of how to approach this.

As far as what triggers it we would need to get more details about the
setup. I'd be curious if we have an "lspci -vvv" for the system
available. The assumption is that the ASPM exit latency is high on
this system and that in turn is causing the bandwidth issues as you
start entering L1. If I am not mistaken the device should advertise
about 16us for the exit latency. I'd be curious if we have a device
somewhere between the NIC and the root port that might be increasing
the delay in exiting L1, and then if we could identify that we could
add a PCIe quirk for that.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
@ 2020-07-24 14:56                                   ` Alexander Duyck
  0 siblings, 0 replies; 51+ messages in thread
From: Alexander Duyck @ 2020-07-24 14:56 UTC (permalink / raw)
  To: intel-wired-lan

On Fri, Jul 24, 2020 at 5:33 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Fri, Jul 24, 2020 at 2:01 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > On Fri, Jul 17, 2020 at 3:45 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> [--8<--]
>
> > As a side note, would something like this fix it - not even compile tested
> >
> >
> > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> > b/drivers/net/ethernet/intel/igb/igb_main.c
> > index 8bb3db2cbd41..1a7240aae85c 100644
> > --- a/drivers/net/ethernet/intel/igb/igb_main.c
> > +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> > @@ -3396,6 +3396,13 @@ static int igb_probe(struct pci_dev *pdev,
> > const struct pci_device_id *ent)
> >                           "Width x2" :
> >                           (hw->bus.width == e1000_bus_width_pcie_x1) ?
> >                           "Width x1" : "unknown"), netdev->dev_addr);
> > +               /* quirk */
> > +#ifdef CONFIG_PCIEASPM
> > +               if (hw->bus.width == e1000_bus_width_pcie_x1) {
> > +                       /* single lane pcie causes problems with ASPM */
> > +                       pdev->pcie_link_state->aspm_enabled = 0;
> > +               }
> > +#endif
> >         }
> >
> >         if ((hw->mac.type >= e1000_i210 ||
> >
> > I don't know where the right place to put a quirk would be...
>
> Ok so that was a real brainfart... turns out that there is a lack of
> good ways to get to that but it was more intended to
> know where the quirk should go...
>
> Due to the lack of api:s i started wondering if this will apply to
> more devices than just network cards - potentially we could
> be a little bit more selective and only not enable it in one direction but...
>
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index b17e5ffd31b1..96a3c6837124 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -584,15 +584,16 @@ static void pcie_aspm_cap_init(struct
> pcie_link_state *link, int blacklist)
>          * given link unless components on both sides of the link each
>          * support L0s.
>          */
> -       if (dwreg.support & upreg.support & PCIE_LINK_STATE_L0S)
> -               link->aspm_support |= ASPM_STATE_L0S;
> -       if (dwreg.enabled & PCIE_LINK_STATE_L0S)
> -               link->aspm_enabled |= ASPM_STATE_L0S_UP;
> -       if (upreg.enabled & PCIE_LINK_STATE_L0S)
> -               link->aspm_enabled |= ASPM_STATE_L0S_DW;
> -       link->latency_up.l0s = calc_l0s_latency(upreg.latency_encoding_l0s);
> -       link->latency_dw.l0s = calc_l0s_latency(dwreg.latency_encoding_l0s);
> -
> +       if (pcie_get_width_cap(child) != PCIE_LNK_X1) {
> +               if (dwreg.support & upreg.support & PCIE_LINK_STATE_L0S)
> +                       link->aspm_support |= ASPM_STATE_L0S;
> +               if (dwreg.enabled & PCIE_LINK_STATE_L0S)
> +                       link->aspm_enabled |= ASPM_STATE_L0S_UP;
> +               if (upreg.enabled & PCIE_LINK_STATE_L0S)
> +                       link->aspm_enabled |= ASPM_STATE_L0S_DW;
> +               link->latency_up.l0s =
> calc_l0s_latency(upreg.latency_encoding_l0s);
> +               link->latency_dw.l0s =
> calc_l0s_latency(dwreg.latency_encoding_l0s);
> +       }
>
> this time it's compile tested...
>
> It could also be  if (pcie_get_width_cap(child) > PCIE_LNK_X1) {
>
> I assume that ASPM is not enabled for: PCIE_LNK_WIDTH_RESRV ;)

This is probably a bit too broad of a scope to be used generically
since this will disable ASPM for all devices that have a x1 link
width.

It might make more sense to look at something such as
e1000e_disable_aspm as an example of how to approach this.

As far as what triggers it we would need to get more details about the
setup. I'd be curious if we have an "lspci -vvv" for the system
available. The assumption is that the ASPM exit latency is high on
this system and that in turn is causing the bandwidth issues as you
start entering L1. If I am not mistaken the device should advertise
about 16us for the exit latency. I'd be curious if we have a device
somewhere between the NIC and the root port that might be increasing
the delay in exiting L1, and then if we could identify that we could
add a PCIe quirk for that.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
       [not found]                                       ` <CAA85sZs_PSsyZhvdKBCoAGxoZvaQFhQ6j7qoA7y8ffjs2RqEGw@mail.gmail.com>
@ 2020-07-24 21:50                                         ` Alexander Duyck
  2020-07-24 22:41                                           ` Ian Kumlien
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2020-07-24 21:50 UTC (permalink / raw)
  To: intel-wired-lan

On Fri, Jul 24, 2020 at 2:14 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Fri, Jul 24, 2020 at 10:45 PM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > On Fri, Jul 24, 2020 at 12:23 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > >
> > > On Fri, Jul 24, 2020 at 4:57 PM Alexander Duyck
> > > <alexander.duyck@gmail.com> wrote:
> > > >
> > > > On Fri, Jul 24, 2020 at 5:33 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jul 24, 2020 at 2:01 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Jul 17, 2020 at 3:45 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > >
> > > > > [--8<--]
> > > > >
> > > > > > As a side note, would something like this fix it - not even compile tested
> > > > > >
> > > > > >
> > > > > > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> > > > > > b/drivers/net/ethernet/intel/igb/igb_main.c
> > > > > > index 8bb3db2cbd41..1a7240aae85c 100644
> > > > > > --- a/drivers/net/ethernet/intel/igb/igb_main.c
> > > > > > +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> > > > > > @@ -3396,6 +3396,13 @@ static int igb_probe(struct pci_dev *pdev,
> > > > > > const struct pci_device_id *ent)
> > > > > >                           "Width x2" :
> > > > > >                           (hw->bus.width == e1000_bus_width_pcie_x1) ?
> > > > > >                           "Width x1" : "unknown"), netdev->dev_addr);
> > > > > > +               /* quirk */
> > > > > > +#ifdef CONFIG_PCIEASPM
> > > > > > +               if (hw->bus.width == e1000_bus_width_pcie_x1) {
> > > > > > +                       /* single lane pcie causes problems with ASPM */
> > > > > > +                       pdev->pcie_link_state->aspm_enabled = 0;
> > > > > > +               }
> > > > > > +#endif
> > > > > >         }
> > > > > >
> > > > > >         if ((hw->mac.type >= e1000_i210 ||
> > > > > >
> > > > > > I don't know where the right place to put a quirk would be...
> > > > >
> > > > > Ok so that was a real brainfart... turns out that there is a lack of
> > > > > good ways to get to that but it was more intended to
> > > > > know where the quirk should go...
> > > > >
> > > > > Due to the lack of api:s i started wondering if this will apply to
> > > > > more devices than just network cards - potentially we could
> > > > > be a little bit more selective and only not enable it in one direction but...
> > > > >
> > > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > > index b17e5ffd31b1..96a3c6837124 100644
> > > > > --- a/drivers/pci/pcie/aspm.c
> > > > > +++ b/drivers/pci/pcie/aspm.c
> > > > > @@ -584,15 +584,16 @@ static void pcie_aspm_cap_init(struct
> > > > > pcie_link_state *link, int blacklist)
> > > > >          * given link unless components on both sides of the link each
> > > > >          * support L0s.
> > > > >          */
> > > > > -       if (dwreg.support & upreg.support & PCIE_LINK_STATE_L0S)
> > > > > -               link->aspm_support |= ASPM_STATE_L0S;
> > > > > -       if (dwreg.enabled & PCIE_LINK_STATE_L0S)
> > > > > -               link->aspm_enabled |= ASPM_STATE_L0S_UP;
> > > > > -       if (upreg.enabled & PCIE_LINK_STATE_L0S)
> > > > > -               link->aspm_enabled |= ASPM_STATE_L0S_DW;
> > > > > -       link->latency_up.l0s = calc_l0s_latency(upreg.latency_encoding_l0s);
> > > > > -       link->latency_dw.l0s = calc_l0s_latency(dwreg.latency_encoding_l0s);
> > > > > -
> > > > > +       if (pcie_get_width_cap(child) != PCIE_LNK_X1) {
> > > > > +               if (dwreg.support & upreg.support & PCIE_LINK_STATE_L0S)
> > > > > +                       link->aspm_support |= ASPM_STATE_L0S;
> > > > > +               if (dwreg.enabled & PCIE_LINK_STATE_L0S)
> > > > > +                       link->aspm_enabled |= ASPM_STATE_L0S_UP;
> > > > > +               if (upreg.enabled & PCIE_LINK_STATE_L0S)
> > > > > +                       link->aspm_enabled |= ASPM_STATE_L0S_DW;
> > > > > +               link->latency_up.l0s =
> > > > > calc_l0s_latency(upreg.latency_encoding_l0s);
> > > > > +               link->latency_dw.l0s =
> > > > > calc_l0s_latency(dwreg.latency_encoding_l0s);
> > > > > +       }
> > > > >
> > > > > this time it's compile tested...
> > > > >
> > > > > It could also be  if (pcie_get_width_cap(child) > PCIE_LNK_X1) {
> > > > >
> > > > > I assume that ASPM is not enabled for: PCIE_LNK_WIDTH_RESRV ;)
> > > >
> > > > This is probably a bit too broad of a scope to be used generically
> > > > since this will disable ASPM for all devices that have a x1 link
> > > > width.
> > >
> > > I agree, but also, the change to enable aspm on the controllers was
> > > quite recent so it could be a general
> > > issue that a lot of people could be suffering from... I haven't seen
> > > any reports though...
> >
> > The problem is your layout is very likely specific. It may effect
> > others with a similar layout, but for example I have the same
> > controller in one of my systems and I have not been having any issues.
> >
> > > But otoh worst case would be a minor revert in power usage ;)
> > >
> > > > It might make more sense to look at something such as
> > > > e1000e_disable_aspm as an example of how to approach this.
> > >
> > > Oh, my grepping completely failed to dig up this, thanks!
> >
> > https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/e1000e/netdev.c#L6743
> >
> > > > As far as what triggers it we would need to get more details about the
> > > > setup. I'd be curious if we have an "lspci -vvv" for the system
> > > > available. The assumption is that the ASPM exit latency is high on
> > > > this system and that in turn is causing the bandwidth issues as you
> > > > start entering L1. If I am not mistaken the device should advertise
> > > > about 16us for the exit latency. I'd be curious if we have a device
> > > > somewhere between the NIC and the root port that might be increasing
> > > > the delay in exiting L1, and then if we could identify that we could
> > > > add a PCIe quirk for that.
> > >
> > > We only disabled the L0s afair, but from e1000e_disable_aspm - "can't
> > > have L1 without L0s"
> > > so perhaps they are disabled as well...
> >
> > Not sure where you got that from. It looks like with your system the
> > L0s is disabled and you only have support for L1.
>
> First of all, sorry, I accidentally dropped the mailinglist :(
>
> And the comment I quoted was from the e1000e_disable_aspm:
>         switch (state) {
>         case PCIE_LINK_STATE_L0S:
>         case PCIE_LINK_STATE_L0S | PCIE_LINK_STATE_L1:
>                 aspm_dis_mask |= PCI_EXP_LNKCTL_ASPM_L0S;
>                 /* fall-through - can't have L1 without L0s */
>        <====
>         case PCIE_LINK_STATE_L1:
>                 aspm_dis_mask |= PCI_EXP_LNKCTL_ASPM_L1;
>                 break;
>         default:
>                 return;
>         }
> ----
>
> > >
> > > And:
> > > lspci -t
> > > -[0000:00]-+-00.0
> > >            +-00.2
> > >            +-01.0
> > >            +-01.2-[01-07]----00.0-[02-07]--+-03.0-[03]----00.0
> >
> > I think I now know what patch broke things for you. It is most likely
> > this one that enabled ASPM on devices behind bridges:
> > https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=next&id=66ff14e59e8a30690755b08bc3042359703fb07a
>
> Ah, yes, correct
>
> > My advice would be to revert that patch and see if it resolves the
> > issue for you.
>
> Could do that yes =)
>
> I'm mainly looking for a more generic solution...

That would be the generic solution. The patch has obviously broken
things so we need to address the issues. The immediate solution is to
revert it, but the more correct solution may be to do something like
add an allowlist for the cases where enabling ASPM will not harm
system performance.

> > Device 3:00.0 is your i211 gigabit network controller. Notice you have
> > a bridge between it and the root complex. This can be problematic as I
> > believe the main reason for the code that was removed in the patch is
> > that wakeups can end up being serialized if all of the links are down
> > or you could end up with one of the other devices on the bridge
> > utilizing the PCIe link an reducing the total throughput, especially
> > if you have the link to the root complex also taking part in power
> > management. Starting at the root complex it looks like you have the
> > link between the bridge and the PCIe switch. It is running L1 enabled
> > with a 32us time for it to reestablish link according to the root
> > complex side (00:01.2->1:00.0). The next link is the switch to the
> > i211 which is 2:03.0 -> 3:00.0. The interesting bit here is that the
> > bridge says it only needs 32us while the NIC is saying it will need
> > 64us. That upper bound is already a pretty significant value, however
> > you have two links to contend with so in reality you are looking at
> > something like 96us to bring up both links if they are brought up
> > serially.
>
> hummm... Interesting... I have never managed to parse that lspci thing
> properly...

Actually I parsed it a bit incorrectly too.

The i211 lists that it only supports up to 64us maximum delay in L1
wakeup latency. The switch is advertising 32us delay to come out of L1
on both the upstream and downstream ports. As such the link would be
considered marginal with L1 enabled and so it should be disabled.

> It is also interesting that the realtek card seems to be on the same link then?
> With ASPM disabled, I wonder if that is due to the setpci command or
> if it was disabled before..
> (playing with setpci makes no difference but it might require a reboot.. )

Are you using the same command you were using for the i211? Did you
make sure to update the offset since the PCIe configuration block
starts at a different offset? Also you probably need to make sure to
only try to update function 0 of the device since I suspect the other
functions won't have any effect.

> > When you consider that you are using a Gigabit Ethernet connection
> > that is moving data at roughly 1000 bits per microsecond, or 125 bytes
> > per microsecond. At that rate we should have roughly 270us worth of
> > packets we can buffer before we are forced to start dropping packets
> > assuming the device is configured with a default 34K Rx buffer. As
> > such I am not entirely sure ASPM is the only issue we have here. I
> > assume you may also have CPU C states enabled as well? By any chance
> > do you have C6 or deeper sleep states enabled on the system? If so
> > that might be what is pushing us into the issues that you were seeing.
> > Basically we are seeing something that is causing the PCIe to stall
> > for over 270us. My thought is that it is likely a number of factors
> > where we have too many devices sleeping and as a result the total
> > wakeup latency is likely 300us or more resulting in dropped packets.
>
> It seems like I only have C2 as max...
>
> grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
> /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
> /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
> /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2
>
> Anyway, we should bring this back to the mailing list

That's fine. I assumed you didn't want to post the lspci to the
mailing list as it might bounce for being too large.

So a generic solution for this would be to have a function that would
scan the PCIe bus and determine the total L1 and L0s exit latency. If
a device advertises an acceptable ASPM power state exit latency and we
have met or exceeded that we should be disabling that ASPM feature for
the device.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-24 21:50                                         ` Alexander Duyck
@ 2020-07-24 22:41                                           ` Ian Kumlien
  2020-07-24 22:49                                             ` Ian Kumlien
  0 siblings, 1 reply; 51+ messages in thread
From: Ian Kumlien @ 2020-07-24 22:41 UTC (permalink / raw)
  To: intel-wired-lan

On Fri, Jul 24, 2020 at 11:51 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Fri, Jul 24, 2020 at 2:14 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > On Fri, Jul 24, 2020 at 10:45 PM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > >
> > > On Fri, Jul 24, 2020 at 12:23 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > >
> > > > On Fri, Jul 24, 2020 at 4:57 PM Alexander Duyck
> > > > <alexander.duyck@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jul 24, 2020 at 5:33 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Jul 24, 2020 at 2:01 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, Jul 17, 2020 at 3:45 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > >
> > > > > > [--8<--]
> > > > > >
> > > > > > > As a side note, would something like this fix it - not even compile tested
> > > > > > >
> > > > > > >
> > > > > > > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> > > > > > > b/drivers/net/ethernet/intel/igb/igb_main.c
> > > > > > > index 8bb3db2cbd41..1a7240aae85c 100644
> > > > > > > --- a/drivers/net/ethernet/intel/igb/igb_main.c
> > > > > > > +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> > > > > > > @@ -3396,6 +3396,13 @@ static int igb_probe(struct pci_dev *pdev,
> > > > > > > const struct pci_device_id *ent)
> > > > > > >                           "Width x2" :
> > > > > > >                           (hw->bus.width == e1000_bus_width_pcie_x1) ?
> > > > > > >                           "Width x1" : "unknown"), netdev->dev_addr);
> > > > > > > +               /* quirk */
> > > > > > > +#ifdef CONFIG_PCIEASPM
> > > > > > > +               if (hw->bus.width == e1000_bus_width_pcie_x1) {
> > > > > > > +                       /* single lane pcie causes problems with ASPM */
> > > > > > > +                       pdev->pcie_link_state->aspm_enabled = 0;
> > > > > > > +               }
> > > > > > > +#endif
> > > > > > >         }
> > > > > > >
> > > > > > >         if ((hw->mac.type >= e1000_i210 ||
> > > > > > >
> > > > > > > I don't know where the right place to put a quirk would be...
> > > > > >
> > > > > > Ok so that was a real brainfart... turns out that there is a lack of
> > > > > > good ways to get to that but it was more intended to
> > > > > > know where the quirk should go...
> > > > > >
> > > > > > Due to the lack of api:s i started wondering if this will apply to
> > > > > > more devices than just network cards - potentially we could
> > > > > > be a little bit more selective and only not enable it in one direction but...
> > > > > >
> > > > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > > > index b17e5ffd31b1..96a3c6837124 100644
> > > > > > --- a/drivers/pci/pcie/aspm.c
> > > > > > +++ b/drivers/pci/pcie/aspm.c
> > > > > > @@ -584,15 +584,16 @@ static void pcie_aspm_cap_init(struct
> > > > > > pcie_link_state *link, int blacklist)
> > > > > >          * given link unless components on both sides of the link each
> > > > > >          * support L0s.
> > > > > >          */
> > > > > > -       if (dwreg.support & upreg.support & PCIE_LINK_STATE_L0S)
> > > > > > -               link->aspm_support |= ASPM_STATE_L0S;
> > > > > > -       if (dwreg.enabled & PCIE_LINK_STATE_L0S)
> > > > > > -               link->aspm_enabled |= ASPM_STATE_L0S_UP;
> > > > > > -       if (upreg.enabled & PCIE_LINK_STATE_L0S)
> > > > > > -               link->aspm_enabled |= ASPM_STATE_L0S_DW;
> > > > > > -       link->latency_up.l0s = calc_l0s_latency(upreg.latency_encoding_l0s);
> > > > > > -       link->latency_dw.l0s = calc_l0s_latency(dwreg.latency_encoding_l0s);
> > > > > > -
> > > > > > +       if (pcie_get_width_cap(child) != PCIE_LNK_X1) {
> > > > > > +               if (dwreg.support & upreg.support & PCIE_LINK_STATE_L0S)
> > > > > > +                       link->aspm_support |= ASPM_STATE_L0S;
> > > > > > +               if (dwreg.enabled & PCIE_LINK_STATE_L0S)
> > > > > > +                       link->aspm_enabled |= ASPM_STATE_L0S_UP;
> > > > > > +               if (upreg.enabled & PCIE_LINK_STATE_L0S)
> > > > > > +                       link->aspm_enabled |= ASPM_STATE_L0S_DW;
> > > > > > +               link->latency_up.l0s =
> > > > > > calc_l0s_latency(upreg.latency_encoding_l0s);
> > > > > > +               link->latency_dw.l0s =
> > > > > > calc_l0s_latency(dwreg.latency_encoding_l0s);
> > > > > > +       }
> > > > > >
> > > > > > this time it's compile tested...
> > > > > >
> > > > > > It could also be  if (pcie_get_width_cap(child) > PCIE_LNK_X1) {
> > > > > >
> > > > > > I assume that ASPM is not enabled for: PCIE_LNK_WIDTH_RESRV ;)
> > > > >
> > > > > This is probably a bit too broad of a scope to be used generically
> > > > > since this will disable ASPM for all devices that have a x1 link
> > > > > width.
> > > >
> > > > I agree, but also, the change to enable aspm on the controllers was
> > > > quite recent so it could be a general
> > > > issue that a lot of people could be suffering from... I haven't seen
> > > > any reports though...
> > >
> > > The problem is your layout is very likely specific. It may effect
> > > others with a similar layout, but for example I have the same
> > > controller in one of my systems and I have not been having any issues.
> > >
> > > > But otoh worst case would be a minor revert in power usage ;)
> > > >
> > > > > It might make more sense to look at something such as
> > > > > e1000e_disable_aspm as an example of how to approach this.
> > > >
> > > > Oh, my grepping completely failed to dig up this, thanks!
> > >
> > > https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/e1000e/netdev.c#L6743
> > >
> > > > > As far as what triggers it we would need to get more details about the
> > > > > setup. I'd be curious if we have an "lspci -vvv" for the system
> > > > > available. The assumption is that the ASPM exit latency is high on
> > > > > this system and that in turn is causing the bandwidth issues as you
> > > > > start entering L1. If I am not mistaken the device should advertise
> > > > > about 16us for the exit latency. I'd be curious if we have a device
> > > > > somewhere between the NIC and the root port that might be increasing
> > > > > the delay in exiting L1, and then if we could identify that we could
> > > > > add a PCIe quirk for that.
> > > >
> > > > We only disabled the L0s afair, but from e1000e_disable_aspm - "can't
> > > > have L1 without L0s"
> > > > so perhaps they are disabled as well...
> > >
> > > Not sure where you got that from. It looks like with your system the
> > > L0s is disabled and you only have support for L1.
> >
> > First of all, sorry, I accidentally dropped the mailinglist :(
> >
> > And the comment I quoted was from the e1000e_disable_aspm:
> >         switch (state) {
> >         case PCIE_LINK_STATE_L0S:
> >         case PCIE_LINK_STATE_L0S | PCIE_LINK_STATE_L1:
> >                 aspm_dis_mask |= PCI_EXP_LNKCTL_ASPM_L0S;
> >                 /* fall-through - can't have L1 without L0s */
> >        <====
> >         case PCIE_LINK_STATE_L1:
> >                 aspm_dis_mask |= PCI_EXP_LNKCTL_ASPM_L1;
> >                 break;
> >         default:
> >                 return;
> >         }
> > ----
> >
> > > >
> > > > And:
> > > > lspci -t
> > > > -[0000:00]-+-00.0
> > > >            +-00.2
> > > >            +-01.0
> > > >            +-01.2-[01-07]----00.0-[02-07]--+-03.0-[03]----00.0
> > >
> > > I think I now know what patch broke things for you. It is most likely
> > > this one that enabled ASPM on devices behind bridges:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=next&id=66ff14e59e8a30690755b08bc3042359703fb07a
> >
> > Ah, yes, correct
> >
> > > My advice would be to revert that patch and see if it resolves the
> > > issue for you.
> >
> > Could do that yes =)
> >
> > I'm mainly looking for a more generic solution...
>
> That would be the generic solution. The patch has obviously broken
> things so we need to address the issues. The immediate solution is to
> revert it, but the more correct solution may be to do something like
> add an allowlist for the cases where enabling ASPM will not harm
> system performance.

more like a generic solution like the one you mention below where we
get the best of both worlds... =)

> > > Device 3:00.0 is your i211 gigabit network controller. Notice you have
> > > a bridge between it and the root complex. This can be problematic as I
> > > believe the main reason for the code that was removed in the patch is
> > > that wakeups can end up being serialized if all of the links are down
> > > or you could end up with one of the other devices on the bridge
> > > utilizing the PCIe link an reducing the total throughput, especially
> > > if you have the link to the root complex also taking part in power
> > > management. Starting at the root complex it looks like you have the
> > > link between the bridge and the PCIe switch. It is running L1 enabled
> > > with a 32us time for it to reestablish link according to the root
> > > complex side (00:01.2->1:00.0). The next link is the switch to the
> > > i211 which is 2:03.0 -> 3:00.0. The interesting bit here is that the
> > > bridge says it only needs 32us while the NIC is saying it will need
> > > 64us. That upper bound is already a pretty significant value, however
> > > you have two links to contend with so in reality you are looking at
> > > something like 96us to bring up both links if they are brought up
> > > serially.
> >
> > hummm... Interesting... I have never managed to parse that lspci thing
> > properly...
>
> Actually I parsed it a bit incorrectly too.
>
> The i211 lists that it only supports up to 64us maximum delay in L1
> wakeup latency. The switch is advertising 32us delay to come out of L1
> on both the upstream and downstream ports. As such the link would be
> considered marginal with L1 enabled and so it should be disabled.
>
> > It is also interesting that the realtek card seems to be on the same link then?
> > With ASPM disabled, I wonder if that is due to the setpci command or
> > if it was disabled before..
> > (playing with setpci makes no difference but it might require a reboot.. )
>
> Are you using the same command you were using for the i211? Did you
> make sure to update the offset since the PCIe configuration block
> starts at a different offset? Also you probably need to make sure to
> only try to update function 0 of the device since I suspect the other
> functions won't have any effect.

Ah, no, i only toggled the i211 to see if that's what caused the ASPM
to be disabled...

But it seems it isn't -- will have to reboot to verify though

> > > When you consider that you are using a Gigabit Ethernet connection
> > > that is moving data at roughly 1000 bits per microsecond, or 125 bytes
> > > per microsecond. At that rate we should have roughly 270us worth of
> > > packets we can buffer before we are forced to start dropping packets
> > > assuming the device is configured with a default 34K Rx buffer. As
> > > such I am not entirely sure ASPM is the only issue we have here. I
> > > assume you may also have CPU C states enabled as well? By any chance
> > > do you have C6 or deeper sleep states enabled on the system? If so
> > > that might be what is pushing us into the issues that you were seeing.
> > > Basically we are seeing something that is causing the PCIe to stall
> > > for over 270us. My thought is that it is likely a number of factors
> > > where we have too many devices sleeping and as a result the total
> > > wakeup latency is likely 300us or more resulting in dropped packets.
> >
> > It seems like I only have C2 as max...
> >
> > grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
> > /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
> > /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
> > /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2
> >
> > Anyway, we should bring this back to the mailing list
>
> That's fine. I assumed you didn't want to post the lspci to the
> mailing list as it might bounce for being too large.

Good thinking, but it was actually a slip :/

> So a generic solution for this would be to have a function that would
> scan the PCIe bus and determine the total L1 and L0s exit latency. If
> a device advertises an acceptable ASPM power state exit latency and we
> have met or exceeded that we should be disabling that ASPM feature for
> the device.

Yeah, since I'm on vacation I'll actually see if I can look in to that!
(I mean, I'm not that used to these kinds of things but if my messing
around inspires someone
or if noone else is working on it, then... what the hey ;) )

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-24 22:41                                           ` Ian Kumlien
@ 2020-07-24 22:49                                             ` Ian Kumlien
  2020-07-24 23:08                                               ` Ian Kumlien
  0 siblings, 1 reply; 51+ messages in thread
From: Ian Kumlien @ 2020-07-24 22:49 UTC (permalink / raw)
  To: intel-wired-lan

On Sat, Jul 25, 2020 at 12:41 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Fri, Jul 24, 2020 at 11:51 PM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > On Fri, Jul 24, 2020 at 2:14 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > >
> > > On Fri, Jul 24, 2020 at 10:45 PM Alexander Duyck
> > > <alexander.duyck@gmail.com> wrote:
> > > >
> > > > On Fri, Jul 24, 2020 at 12:23 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jul 24, 2020 at 4:57 PM Alexander Duyck
> > > > > <alexander.duyck@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Jul 24, 2020 at 5:33 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, Jul 24, 2020 at 2:01 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, Jul 17, 2020 at 3:45 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > >
> > > > > > > [--8<--]
> > > > > > >
> > > > > > > > As a side note, would something like this fix it - not even compile tested
> > > > > > > >
> > > > > > > >
> > > > > > > > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> > > > > > > > b/drivers/net/ethernet/intel/igb/igb_main.c
> > > > > > > > index 8bb3db2cbd41..1a7240aae85c 100644
> > > > > > > > --- a/drivers/net/ethernet/intel/igb/igb_main.c
> > > > > > > > +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> > > > > > > > @@ -3396,6 +3396,13 @@ static int igb_probe(struct pci_dev *pdev,
> > > > > > > > const struct pci_device_id *ent)
> > > > > > > >                           "Width x2" :
> > > > > > > >                           (hw->bus.width == e1000_bus_width_pcie_x1) ?
> > > > > > > >                           "Width x1" : "unknown"), netdev->dev_addr);
> > > > > > > > +               /* quirk */
> > > > > > > > +#ifdef CONFIG_PCIEASPM
> > > > > > > > +               if (hw->bus.width == e1000_bus_width_pcie_x1) {
> > > > > > > > +                       /* single lane pcie causes problems with ASPM */
> > > > > > > > +                       pdev->pcie_link_state->aspm_enabled = 0;
> > > > > > > > +               }
> > > > > > > > +#endif
> > > > > > > >         }
> > > > > > > >
> > > > > > > >         if ((hw->mac.type >= e1000_i210 ||
> > > > > > > >
> > > > > > > > I don't know where the right place to put a quirk would be...
> > > > > > >
> > > > > > > Ok so that was a real brainfart... turns out that there is a lack of
> > > > > > > good ways to get to that but it was more intended to
> > > > > > > know where the quirk should go...
> > > > > > >
> > > > > > > Due to the lack of api:s i started wondering if this will apply to
> > > > > > > more devices than just network cards - potentially we could
> > > > > > > be a little bit more selective and only not enable it in one direction but...
> > > > > > >
> > > > > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > > > > index b17e5ffd31b1..96a3c6837124 100644
> > > > > > > --- a/drivers/pci/pcie/aspm.c
> > > > > > > +++ b/drivers/pci/pcie/aspm.c
> > > > > > > @@ -584,15 +584,16 @@ static void pcie_aspm_cap_init(struct
> > > > > > > pcie_link_state *link, int blacklist)
> > > > > > >          * given link unless components on both sides of the link each
> > > > > > >          * support L0s.
> > > > > > >          */
> > > > > > > -       if (dwreg.support & upreg.support & PCIE_LINK_STATE_L0S)
> > > > > > > -               link->aspm_support |= ASPM_STATE_L0S;
> > > > > > > -       if (dwreg.enabled & PCIE_LINK_STATE_L0S)
> > > > > > > -               link->aspm_enabled |= ASPM_STATE_L0S_UP;
> > > > > > > -       if (upreg.enabled & PCIE_LINK_STATE_L0S)
> > > > > > > -               link->aspm_enabled |= ASPM_STATE_L0S_DW;
> > > > > > > -       link->latency_up.l0s = calc_l0s_latency(upreg.latency_encoding_l0s);
> > > > > > > -       link->latency_dw.l0s = calc_l0s_latency(dwreg.latency_encoding_l0s);
> > > > > > > -
> > > > > > > +       if (pcie_get_width_cap(child) != PCIE_LNK_X1) {
> > > > > > > +               if (dwreg.support & upreg.support & PCIE_LINK_STATE_L0S)
> > > > > > > +                       link->aspm_support |= ASPM_STATE_L0S;
> > > > > > > +               if (dwreg.enabled & PCIE_LINK_STATE_L0S)
> > > > > > > +                       link->aspm_enabled |= ASPM_STATE_L0S_UP;
> > > > > > > +               if (upreg.enabled & PCIE_LINK_STATE_L0S)
> > > > > > > +                       link->aspm_enabled |= ASPM_STATE_L0S_DW;
> > > > > > > +               link->latency_up.l0s =
> > > > > > > calc_l0s_latency(upreg.latency_encoding_l0s);
> > > > > > > +               link->latency_dw.l0s =
> > > > > > > calc_l0s_latency(dwreg.latency_encoding_l0s);
> > > > > > > +       }
> > > > > > >
> > > > > > > this time it's compile tested...
> > > > > > >
> > > > > > > It could also be  if (pcie_get_width_cap(child) > PCIE_LNK_X1) {
> > > > > > >
> > > > > > > I assume that ASPM is not enabled for: PCIE_LNK_WIDTH_RESRV ;)
> > > > > >
> > > > > > This is probably a bit too broad of a scope to be used generically
> > > > > > since this will disable ASPM for all devices that have a x1 link
> > > > > > width.
> > > > >
> > > > > I agree, but also, the change to enable aspm on the controllers was
> > > > > quite recent so it could be a general
> > > > > issue that a lot of people could be suffering from... I haven't seen
> > > > > any reports though...
> > > >
> > > > The problem is your layout is very likely specific. It may effect
> > > > others with a similar layout, but for example I have the same
> > > > controller in one of my systems and I have not been having any issues.
> > > >
> > > > > But otoh worst case would be a minor revert in power usage ;)
> > > > >
> > > > > > It might make more sense to look at something such as
> > > > > > e1000e_disable_aspm as an example of how to approach this.
> > > > >
> > > > > Oh, my grepping completely failed to dig up this, thanks!
> > > >
> > > > https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/e1000e/netdev.c#L6743
> > > >
> > > > > > As far as what triggers it we would need to get more details about the
> > > > > > setup. I'd be curious if we have an "lspci -vvv" for the system
> > > > > > available. The assumption is that the ASPM exit latency is high on
> > > > > > this system and that in turn is causing the bandwidth issues as you
> > > > > > start entering L1. If I am not mistaken the device should advertise
> > > > > > about 16us for the exit latency. I'd be curious if we have a device
> > > > > > somewhere between the NIC and the root port that might be increasing
> > > > > > the delay in exiting L1, and then if we could identify that we could
> > > > > > add a PCIe quirk for that.
> > > > >
> > > > > We only disabled the L0s afair, but from e1000e_disable_aspm - "can't
> > > > > have L1 without L0s"
> > > > > so perhaps they are disabled as well...
> > > >
> > > > Not sure where you got that from. It looks like with your system the
> > > > L0s is disabled and you only have support for L1.
> > >
> > > First of all, sorry, I accidentally dropped the mailinglist :(
> > >
> > > And the comment I quoted was from the e1000e_disable_aspm:
> > >         switch (state) {
> > >         case PCIE_LINK_STATE_L0S:
> > >         case PCIE_LINK_STATE_L0S | PCIE_LINK_STATE_L1:
> > >                 aspm_dis_mask |= PCI_EXP_LNKCTL_ASPM_L0S;
> > >                 /* fall-through - can't have L1 without L0s */
> > >        <====
> > >         case PCIE_LINK_STATE_L1:
> > >                 aspm_dis_mask |= PCI_EXP_LNKCTL_ASPM_L1;
> > >                 break;
> > >         default:
> > >                 return;
> > >         }
> > > ----
> > >
> > > > >
> > > > > And:
> > > > > lspci -t
> > > > > -[0000:00]-+-00.0
> > > > >            +-00.2
> > > > >            +-01.0
> > > > >            +-01.2-[01-07]----00.0-[02-07]--+-03.0-[03]----00.0
> > > >
> > > > I think I now know what patch broke things for you. It is most likely
> > > > this one that enabled ASPM on devices behind bridges:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=next&id=66ff14e59e8a30690755b08bc3042359703fb07a
> > >
> > > Ah, yes, correct
> > >
> > > > My advice would be to revert that patch and see if it resolves the
> > > > issue for you.
> > >
> > > Could do that yes =)
> > >
> > > I'm mainly looking for a more generic solution...
> >
> > That would be the generic solution. The patch has obviously broken
> > things so we need to address the issues. The immediate solution is to
> > revert it, but the more correct solution may be to do something like
> > add an allowlist for the cases where enabling ASPM will not harm
> > system performance.
>
> more like a generic solution like the one you mention below where we
> get the best of both worlds... =)
>
> > > > Device 3:00.0 is your i211 gigabit network controller. Notice you have
> > > > a bridge between it and the root complex. This can be problematic as I
> > > > believe the main reason for the code that was removed in the patch is
> > > > that wakeups can end up being serialized if all of the links are down
> > > > or you could end up with one of the other devices on the bridge
> > > > utilizing the PCIe link an reducing the total throughput, especially
> > > > if you have the link to the root complex also taking part in power
> > > > management. Starting at the root complex it looks like you have the
> > > > link between the bridge and the PCIe switch. It is running L1 enabled
> > > > with a 32us time for it to reestablish link according to the root
> > > > complex side (00:01.2->1:00.0). The next link is the switch to the
> > > > i211 which is 2:03.0 -> 3:00.0. The interesting bit here is that the
> > > > bridge says it only needs 32us while the NIC is saying it will need
> > > > 64us. That upper bound is already a pretty significant value, however
> > > > you have two links to contend with so in reality you are looking at
> > > > something like 96us to bring up both links if they are brought up
> > > > serially.
> > >
> > > hummm... Interesting... I have never managed to parse that lspci thing
> > > properly...
> >
> > Actually I parsed it a bit incorrectly too.
> >
> > The i211 lists that it only supports up to 64us maximum delay in L1
> > wakeup latency. The switch is advertising 32us delay to come out of L1
> > on both the upstream and downstream ports. As such the link would be
> > considered marginal with L1 enabled and so it should be disabled.
> >
> > > It is also interesting that the realtek card seems to be on the same link then?
> > > With ASPM disabled, I wonder if that is due to the setpci command or
> > > if it was disabled before..
> > > (playing with setpci makes no difference but it might require a reboot.. )
> >
> > Are you using the same command you were using for the i211? Did you
> > make sure to update the offset since the PCIe configuration block
> > starts at a different offset? Also you probably need to make sure to
> > only try to update function 0 of the device since I suspect the other
> > functions won't have any effect.
>
> Ah, no, i only toggled the i211 to see if that's what caused the ASPM
> to be disabled...
>
> But it seems it isn't -- will have to reboot to verify though
>
> > > > When you consider that you are using a Gigabit Ethernet connection
> > > > that is moving data at roughly 1000 bits per microsecond, or 125 bytes
> > > > per microsecond. At that rate we should have roughly 270us worth of
> > > > packets we can buffer before we are forced to start dropping packets
> > > > assuming the device is configured with a default 34K Rx buffer. As
> > > > such I am not entirely sure ASPM is the only issue we have here. I
> > > > assume you may also have CPU C states enabled as well? By any chance
> > > > do you have C6 or deeper sleep states enabled on the system? If so
> > > > that might be what is pushing us into the issues that you were seeing.
> > > > Basically we are seeing something that is causing the PCIe to stall
> > > > for over 270us. My thought is that it is likely a number of factors
> > > > where we have too many devices sleeping and as a result the total
> > > > wakeup latency is likely 300us or more resulting in dropped packets.
> > >
> > > It seems like I only have C2 as max...
> > >
> > > grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
> > > /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
> > > /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
> > > /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2
> > >
> > > Anyway, we should bring this back to the mailing list
> >
> > That's fine. I assumed you didn't want to post the lspci to the
> > mailing list as it might bounce for being too large.
>
> Good thinking, but it was actually a slip :/
>
> > So a generic solution for this would be to have a function that would
> > scan the PCIe bus and determine the total L1 and L0s exit latency. If
> > a device advertises an acceptable ASPM power state exit latency and we
> > have met or exceeded that we should be disabling that ASPM feature for
> > the device.
>
> Yeah, since I'm on vacation I'll actually see if I can look in to that!
> (I mean, I'm not that used to these kinds of things but if my messing
> around inspires someone
> or if noone else is working on it, then... what the hey ;) )

Uhm... so, in the function that determines latency they only do MAX

Ie:
static void pcie_aspm_check_latency(struct pci_dev *endpoint)
{
...
                latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
---

I just want to see if I'm understanding you right, is it correct that
the latency should be:
a.up + b.dw + b.up + c.dw

for a (root?) to go through b (bridge/switch?) to c (device)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-24 22:49                                             ` Ian Kumlien
@ 2020-07-24 23:08                                               ` Ian Kumlien
  2020-07-25  0:13                                                 ` Ian Kumlien
  2020-07-25  0:45                                                 ` Alexander Duyck
  0 siblings, 2 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-24 23:08 UTC (permalink / raw)
  To: intel-wired-lan

On Sat, Jul 25, 2020 at 12:49 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Sat, Jul 25, 2020 at 12:41 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > On Fri, Jul 24, 2020 at 11:51 PM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > >
> > > On Fri, Jul 24, 2020 at 2:14 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:

[--8<--]

> > > > > >
> > > > > > And:
> > > > > > lspci -t
> > > > > > -[0000:00]-+-00.0
> > > > > >            +-00.2
> > > > > >            +-01.0
> > > > > >            +-01.2-[01-07]----00.0-[02-07]--+-03.0-[03]----00.0
> > > > >
> > > > > I think I now know what patch broke things for you. It is most likely
> > > > > this one that enabled ASPM on devices behind bridges:
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=next&id=66ff14e59e8a30690755b08bc3042359703fb07a
> > > >
> > > > Ah, yes, correct
> > > >
> > > > > My advice would be to revert that patch and see if it resolves the
> > > > > issue for you.
> > > >
> > > > Could do that yes =)
> > > >
> > > > I'm mainly looking for a more generic solution...
> > >
> > > That would be the generic solution. The patch has obviously broken
> > > things so we need to address the issues. The immediate solution is to
> > > revert it, but the more correct solution may be to do something like
> > > add an allowlist for the cases where enabling ASPM will not harm
> > > system performance.
> >
> > more like a generic solution like the one you mention below where we
> > get the best of both worlds... =)
> >
> > > > > Device 3:00.0 is your i211 gigabit network controller. Notice you have
> > > > > a bridge between it and the root complex. This can be problematic as I
> > > > > believe the main reason for the code that was removed in the patch is
> > > > > that wakeups can end up being serialized if all of the links are down
> > > > > or you could end up with one of the other devices on the bridge
> > > > > utilizing the PCIe link an reducing the total throughput, especially
> > > > > if you have the link to the root complex also taking part in power
> > > > > management. Starting at the root complex it looks like you have the
> > > > > link between the bridge and the PCIe switch. It is running L1 enabled
> > > > > with a 32us time for it to reestablish link according to the root
> > > > > complex side (00:01.2->1:00.0). The next link is the switch to the
> > > > > i211 which is 2:03.0 -> 3:00.0. The interesting bit here is that the
> > > > > bridge says it only needs 32us while the NIC is saying it will need
> > > > > 64us. That upper bound is already a pretty significant value, however
> > > > > you have two links to contend with so in reality you are looking at
> > > > > something like 96us to bring up both links if they are brought up
> > > > > serially.
> > > >
> > > > hummm... Interesting... I have never managed to parse that lspci thing
> > > > properly...
> > >
> > > Actually I parsed it a bit incorrectly too.
> > >
> > > The i211 lists that it only supports up to 64us maximum delay in L1
> > > wakeup latency. The switch is advertising 32us delay to come out of L1
> > > on both the upstream and downstream ports. As such the link would be
> > > considered marginal with L1 enabled and so it should be disabled.
> > >
> > > > It is also interesting that the realtek card seems to be on the same link then?
> > > > With ASPM disabled, I wonder if that is due to the setpci command or
> > > > if it was disabled before..
> > > > (playing with setpci makes no difference but it might require a reboot.. )
> > >
> > > Are you using the same command you were using for the i211? Did you
> > > make sure to update the offset since the PCIe configuration block
> > > starts at a different offset? Also you probably need to make sure to
> > > only try to update function 0 of the device since I suspect the other
> > > functions won't have any effect.
> >
> > Ah, no, i only toggled the i211 to see if that's what caused the ASPM
> > to be disabled...
> >
> > But it seems it isn't -- will have to reboot to verify though
> >
> > > > > When you consider that you are using a Gigabit Ethernet connection
> > > > > that is moving data at roughly 1000 bits per microsecond, or 125 bytes
> > > > > per microsecond. At that rate we should have roughly 270us worth of
> > > > > packets we can buffer before we are forced to start dropping packets
> > > > > assuming the device is configured with a default 34K Rx buffer. As
> > > > > such I am not entirely sure ASPM is the only issue we have here. I
> > > > > assume you may also have CPU C states enabled as well? By any chance
> > > > > do you have C6 or deeper sleep states enabled on the system? If so
> > > > > that might be what is pushing us into the issues that you were seeing.
> > > > > Basically we are seeing something that is causing the PCIe to stall
> > > > > for over 270us. My thought is that it is likely a number of factors
> > > > > where we have too many devices sleeping and as a result the total
> > > > > wakeup latency is likely 300us or more resulting in dropped packets.
> > > >
> > > > It seems like I only have C2 as max...
> > > >
> > > > grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
> > > > /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
> > > > /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
> > > > /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2
> > > >
> > > > Anyway, we should bring this back to the mailing list
> > >
> > > That's fine. I assumed you didn't want to post the lspci to the
> > > mailing list as it might bounce for being too large.
> >
> > Good thinking, but it was actually a slip :/
> >
> > > So a generic solution for this would be to have a function that would
> > > scan the PCIe bus and determine the total L1 and L0s exit latency. If
> > > a device advertises an acceptable ASPM power state exit latency and we
> > > have met or exceeded that we should be disabling that ASPM feature for
> > > the device.
> >
> > Yeah, since I'm on vacation I'll actually see if I can look in to that!
> > (I mean, I'm not that used to these kinds of things but if my messing
> > around inspires someone
> > or if noone else is working on it, then... what the hey ;) )
>
> Uhm... so, in the function that determines latency they only do MAX
>
> Ie:
> static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> {
> ...
>                 latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> ---
>
> I just want to see if I'm understanding you right, is it correct that
> the latency should be:
> a.up + b.dw + b.up + c.dw
>
> for a (root?) to go through b (bridge/switch?) to c (device)

Also, we only disabled L0, which isn't counted as a total at all, it
only checks each side.

Since pcie is serial and judging from your previous statements I
assume that the max statement is a bug.
I also assume that l0 counts, and should be dealt with the same way
and it should also be cumulative...

The question becomes, is this latency from root? or is it "one step"?

Also they add one microsecond but that looks like it should be
parent.l0.up + link.l0.dw latency values

So are my assumptions correct that the serial nature means that all
latenies stack?

l0 is done first, so latency is actually l0 + l1 as max latency? (per side)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-24 23:08                                               ` Ian Kumlien
@ 2020-07-25  0:13                                                 ` Ian Kumlien
  2020-07-25  0:45                                                 ` Alexander Duyck
  1 sibling, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-25  0:13 UTC (permalink / raw)
  To: intel-wired-lan

On Sat, Jul 25, 2020 at 1:08 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Sat, Jul 25, 2020 at 12:49 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > On Sat, Jul 25, 2020 at 12:41 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > >
> > > On Fri, Jul 24, 2020 at 11:51 PM Alexander Duyck
> > > <alexander.duyck@gmail.com> wrote:
> > > >
> > > > On Fri, Jul 24, 2020 at 2:14 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> [--8<--]
>
> > > > > > >
> > > > > > > And:
> > > > > > > lspci -t
> > > > > > > -[0000:00]-+-00.0
> > > > > > >            +-00.2
> > > > > > >            +-01.0
> > > > > > >            +-01.2-[01-07]----00.0-[02-07]--+-03.0-[03]----00.0
> > > > > >
> > > > > > I think I now know what patch broke things for you. It is most likely
> > > > > > this one that enabled ASPM on devices behind bridges:
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=next&id=66ff14e59e8a30690755b08bc3042359703fb07a
> > > > >
> > > > > Ah, yes, correct
> > > > >
> > > > > > My advice would be to revert that patch and see if it resolves the
> > > > > > issue for you.
> > > > >
> > > > > Could do that yes =)
> > > > >
> > > > > I'm mainly looking for a more generic solution...
> > > >
> > > > That would be the generic solution. The patch has obviously broken
> > > > things so we need to address the issues. The immediate solution is to
> > > > revert it, but the more correct solution may be to do something like
> > > > add an allowlist for the cases where enabling ASPM will not harm
> > > > system performance.
> > >
> > > more like a generic solution like the one you mention below where we
> > > get the best of both worlds... =)
> > >
> > > > > > Device 3:00.0 is your i211 gigabit network controller. Notice you have
> > > > > > a bridge between it and the root complex. This can be problematic as I
> > > > > > believe the main reason for the code that was removed in the patch is
> > > > > > that wakeups can end up being serialized if all of the links are down
> > > > > > or you could end up with one of the other devices on the bridge
> > > > > > utilizing the PCIe link an reducing the total throughput, especially
> > > > > > if you have the link to the root complex also taking part in power
> > > > > > management. Starting at the root complex it looks like you have the
> > > > > > link between the bridge and the PCIe switch. It is running L1 enabled
> > > > > > with a 32us time for it to reestablish link according to the root
> > > > > > complex side (00:01.2->1:00.0). The next link is the switch to the
> > > > > > i211 which is 2:03.0 -> 3:00.0. The interesting bit here is that the
> > > > > > bridge says it only needs 32us while the NIC is saying it will need
> > > > > > 64us. That upper bound is already a pretty significant value, however
> > > > > > you have two links to contend with so in reality you are looking at
> > > > > > something like 96us to bring up both links if they are brought up
> > > > > > serially.
> > > > >
> > > > > hummm... Interesting... I have never managed to parse that lspci thing
> > > > > properly...
> > > >
> > > > Actually I parsed it a bit incorrectly too.
> > > >
> > > > The i211 lists that it only supports up to 64us maximum delay in L1
> > > > wakeup latency. The switch is advertising 32us delay to come out of L1
> > > > on both the upstream and downstream ports. As such the link would be
> > > > considered marginal with L1 enabled and so it should be disabled.
> > > >
> > > > > It is also interesting that the realtek card seems to be on the same link then?
> > > > > With ASPM disabled, I wonder if that is due to the setpci command or
> > > > > if it was disabled before..
> > > > > (playing with setpci makes no difference but it might require a reboot.. )
> > > >
> > > > Are you using the same command you were using for the i211? Did you
> > > > make sure to update the offset since the PCIe configuration block
> > > > starts at a different offset? Also you probably need to make sure to
> > > > only try to update function 0 of the device since I suspect the other
> > > > functions won't have any effect.
> > >
> > > Ah, no, i only toggled the i211 to see if that's what caused the ASPM
> > > to be disabled...
> > >
> > > But it seems it isn't -- will have to reboot to verify though
> > >
> > > > > > When you consider that you are using a Gigabit Ethernet connection
> > > > > > that is moving data at roughly 1000 bits per microsecond, or 125 bytes
> > > > > > per microsecond. At that rate we should have roughly 270us worth of
> > > > > > packets we can buffer before we are forced to start dropping packets
> > > > > > assuming the device is configured with a default 34K Rx buffer. As
> > > > > > such I am not entirely sure ASPM is the only issue we have here. I
> > > > > > assume you may also have CPU C states enabled as well? By any chance
> > > > > > do you have C6 or deeper sleep states enabled on the system? If so
> > > > > > that might be what is pushing us into the issues that you were seeing.
> > > > > > Basically we are seeing something that is causing the PCIe to stall
> > > > > > for over 270us. My thought is that it is likely a number of factors
> > > > > > where we have too many devices sleeping and as a result the total
> > > > > > wakeup latency is likely 300us or more resulting in dropped packets.
> > > > >
> > > > > It seems like I only have C2 as max...
> > > > >
> > > > > grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
> > > > > /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
> > > > > /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
> > > > > /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2
> > > > >
> > > > > Anyway, we should bring this back to the mailing list
> > > >
> > > > That's fine. I assumed you didn't want to post the lspci to the
> > > > mailing list as it might bounce for being too large.
> > >
> > > Good thinking, but it was actually a slip :/
> > >
> > > > So a generic solution for this would be to have a function that would
> > > > scan the PCIe bus and determine the total L1 and L0s exit latency. If
> > > > a device advertises an acceptable ASPM power state exit latency and we
> > > > have met or exceeded that we should be disabling that ASPM feature for
> > > > the device.
> > >
> > > Yeah, since I'm on vacation I'll actually see if I can look in to that!
> > > (I mean, I'm not that used to these kinds of things but if my messing
> > > around inspires someone
> > > or if noone else is working on it, then... what the hey ;) )
> >
> > Uhm... so, in the function that determines latency they only do MAX
> >
> > Ie:
> > static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > {
> > ...
> >                 latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > ---
> >
> > I just want to see if I'm understanding you right, is it correct that
> > the latency should be:
> > a.up + b.dw + b.up + c.dw
> >
> > for a (root?) to go through b (bridge/switch?) to c (device)
>
> Also, we only disabled L0, which isn't counted as a total at all, it
> only checks each side.
>
> Since pcie is serial and judging from your previous statements I
> assume that the max statement is a bug.
> I also assume that l0 counts, and should be dealt with the same way
> and it should also be cumulative...
>
> The question becomes, is this latency from root? or is it "one step"?
>
> Also they add one microsecond but that looks like it should be
> parent.l0.up + link.l0.dw latency values
>
> So are my assumptions correct that the serial nature means that all
> latenies stack?
>
> l0 is done first, so latency is actually l0 + l1 as max latency? (per side)

Found some intel documentation "PCI Express compiler v6.1 User Guide" and it
does seem like the l0 and l1 latencies are cumulative...

So it looks like we should have something like:
        while (link) {
                /* Check L0s latency, this is cumulative from the root */
                if (link->aspm_capable & ASPM_STATE_L0S) {
                        /* Check downstream direction L0s latency */
                        if (link->aspm_capable & ASPM_STATE_L0S_DW)
                                l0_switch_latency += link->latency_dw.l0s;
                        /* Add upstream direction L0s parent latency */
                        if (link->parent && link->parent->aspm_capable
& ASPM_STATE_L0S_UP)
                                l0_switch_latency +=
link->parent->latency_up.l0s;

                        /* Clear ASPM L0 since the latency is too high */
                        if (l0_switch_latency > acceptable->l0s)

endpoint->bus->self->link_state->aspm_capable &= ~ASPM_STATE_L0S
                }

                latency = link->latency_up.l1 + link->latency_dw.l1;
                if (link->aspm_capable & ASPM_STATE_L1) {
                        l1_switch_latency += link->latency_dw.l1;
                        if (link->parent && link->parent->aspm_capable
& ASPM_STATE_L1)
                                l1_switch_latency +=
link->parent->latency.up.l1;

                        /* Clear ASPM L1 since the latency is too high */
                        /* FIXME: should l0 latency be used here? */
                        if (l1_switch_latency + l0_switch_latency >
acceptable->l1)

endpoint->bus->self->link_state->aspm_capable &= ~ASPM_STATE_L1
                        /* FIXME: is this still required? */
                        l1_switch_latency += 1000;
                }

                link = link->parent;
---

I don't know if i got downstream/upstream right though... But as you
can see there are a few questions...
An additional question is if it's enough to only clear endpoints or if
i should walk the path... (it does feel like endpoints would be good
enough)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-24 23:08                                               ` Ian Kumlien
  2020-07-25  0:13                                                 ` Ian Kumlien
@ 2020-07-25  0:45                                                 ` Alexander Duyck
  2020-07-25  1:03                                                   ` Ian Kumlien
  1 sibling, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2020-07-25  0:45 UTC (permalink / raw)
  To: intel-wired-lan

On Fri, Jul 24, 2020 at 4:08 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Sat, Jul 25, 2020 at 12:49 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > On Sat, Jul 25, 2020 at 12:41 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > >
> > > On Fri, Jul 24, 2020 at 11:51 PM Alexander Duyck
> > > <alexander.duyck@gmail.com> wrote:
> > > >
> > > > On Fri, Jul 24, 2020 at 2:14 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> [--8<--]
>
> > > > > > >
> > > > > > > And:
> > > > > > > lspci -t
> > > > > > > -[0000:00]-+-00.0
> > > > > > >            +-00.2
> > > > > > >            +-01.0
> > > > > > >            +-01.2-[01-07]----00.0-[02-07]--+-03.0-[03]----00.0
> > > > > >
> > > > > > I think I now know what patch broke things for you. It is most likely
> > > > > > this one that enabled ASPM on devices behind bridges:
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=next&id=66ff14e59e8a30690755b08bc3042359703fb07a
> > > > >
> > > > > Ah, yes, correct
> > > > >
> > > > > > My advice would be to revert that patch and see if it resolves the
> > > > > > issue for you.
> > > > >
> > > > > Could do that yes =)
> > > > >
> > > > > I'm mainly looking for a more generic solution...
> > > >
> > > > That would be the generic solution. The patch has obviously broken
> > > > things so we need to address the issues. The immediate solution is to
> > > > revert it, but the more correct solution may be to do something like
> > > > add an allowlist for the cases where enabling ASPM will not harm
> > > > system performance.
> > >
> > > more like a generic solution like the one you mention below where we
> > > get the best of both worlds... =)
> > >
> > > > > > Device 3:00.0 is your i211 gigabit network controller. Notice you have
> > > > > > a bridge between it and the root complex. This can be problematic as I
> > > > > > believe the main reason for the code that was removed in the patch is
> > > > > > that wakeups can end up being serialized if all of the links are down
> > > > > > or you could end up with one of the other devices on the bridge
> > > > > > utilizing the PCIe link an reducing the total throughput, especially
> > > > > > if you have the link to the root complex also taking part in power
> > > > > > management. Starting at the root complex it looks like you have the
> > > > > > link between the bridge and the PCIe switch. It is running L1 enabled
> > > > > > with a 32us time for it to reestablish link according to the root
> > > > > > complex side (00:01.2->1:00.0). The next link is the switch to the
> > > > > > i211 which is 2:03.0 -> 3:00.0. The interesting bit here is that the
> > > > > > bridge says it only needs 32us while the NIC is saying it will need
> > > > > > 64us. That upper bound is already a pretty significant value, however
> > > > > > you have two links to contend with so in reality you are looking at
> > > > > > something like 96us to bring up both links if they are brought up
> > > > > > serially.
> > > > >
> > > > > hummm... Interesting... I have never managed to parse that lspci thing
> > > > > properly...
> > > >
> > > > Actually I parsed it a bit incorrectly too.
> > > >
> > > > The i211 lists that it only supports up to 64us maximum delay in L1
> > > > wakeup latency. The switch is advertising 32us delay to come out of L1
> > > > on both the upstream and downstream ports. As such the link would be
> > > > considered marginal with L1 enabled and so it should be disabled.
> > > >
> > > > > It is also interesting that the realtek card seems to be on the same link then?
> > > > > With ASPM disabled, I wonder if that is due to the setpci command or
> > > > > if it was disabled before..
> > > > > (playing with setpci makes no difference but it might require a reboot.. )
> > > >
> > > > Are you using the same command you were using for the i211? Did you
> > > > make sure to update the offset since the PCIe configuration block
> > > > starts at a different offset? Also you probably need to make sure to
> > > > only try to update function 0 of the device since I suspect the other
> > > > functions won't have any effect.
> > >
> > > Ah, no, i only toggled the i211 to see if that's what caused the ASPM
> > > to be disabled...
> > >
> > > But it seems it isn't -- will have to reboot to verify though
> > >
> > > > > > When you consider that you are using a Gigabit Ethernet connection
> > > > > > that is moving data at roughly 1000 bits per microsecond, or 125 bytes
> > > > > > per microsecond. At that rate we should have roughly 270us worth of
> > > > > > packets we can buffer before we are forced to start dropping packets
> > > > > > assuming the device is configured with a default 34K Rx buffer. As
> > > > > > such I am not entirely sure ASPM is the only issue we have here. I
> > > > > > assume you may also have CPU C states enabled as well? By any chance
> > > > > > do you have C6 or deeper sleep states enabled on the system? If so
> > > > > > that might be what is pushing us into the issues that you were seeing.
> > > > > > Basically we are seeing something that is causing the PCIe to stall
> > > > > > for over 270us. My thought is that it is likely a number of factors
> > > > > > where we have too many devices sleeping and as a result the total
> > > > > > wakeup latency is likely 300us or more resulting in dropped packets.
> > > > >
> > > > > It seems like I only have C2 as max...
> > > > >
> > > > > grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
> > > > > /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
> > > > > /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
> > > > > /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2
> > > > >
> > > > > Anyway, we should bring this back to the mailing list
> > > >
> > > > That's fine. I assumed you didn't want to post the lspci to the
> > > > mailing list as it might bounce for being too large.
> > >
> > > Good thinking, but it was actually a slip :/
> > >
> > > > So a generic solution for this would be to have a function that would
> > > > scan the PCIe bus and determine the total L1 and L0s exit latency. If
> > > > a device advertises an acceptable ASPM power state exit latency and we
> > > > have met or exceeded that we should be disabling that ASPM feature for
> > > > the device.
> > >
> > > Yeah, since I'm on vacation I'll actually see if I can look in to that!
> > > (I mean, I'm not that used to these kinds of things but if my messing
> > > around inspires someone
> > > or if noone else is working on it, then... what the hey ;) )
> >
> > Uhm... so, in the function that determines latency they only do MAX
> >
> > Ie:
> > static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > {
> > ...
> >                 latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > ---
> >
> > I just want to see if I'm understanding you right, is it correct that
> > the latency should be:
> > a.up + b.dw + b.up + c.dw
> >
> > for a (root?) to go through b (bridge/switch?) to c (device)

Actually I think it is max(a.dw, b.up) + max(b.dw, a.up). Basically
what you want is the maximum time to bring the link up so technically
you only have 2 links so you just have to add up the maximum time to
create each link.

> Also, we only disabled L0, which isn't counted as a total at all, it
> only checks each side.

Not sure what you mean here. L0 is the link fully powered on. The two
link states we have to worry about are L0s and L1 as those involve
various states of power-down. The L1 latency is the nasty one as that
basically involves fully powering down the link and requires time for
the link to be reestablished.

> Since pcie is serial and judging from your previous statements I
> assume that the max statement is a bug.
> I also assume that l0 counts, and should be dealt with the same way
> and it should also be cumulative...

That latency check looks like it would be for a single link. Not each
link in the chain.

> The question becomes, is this latency from root? or is it "one step"?

Actually the function is doing both. I had to reread the spec.
Basically the switch is supposed to start trying to bring up the other
links as soon as it detects that we are trying to bring up the link.
In theory this is only supposed to add about 1us. So in theory the
total bring-up time should be 33us.

> Also they add one microsecond but that looks like it should be
> parent.l0.up + link.l0.dw latency values

Yes, the 1us is the value I reference above. Basically the assumption
is that as soon as one link starts retraining it should start working
on the other ones so the serialization penalty is only supposed to be
1us.

> So are my assumptions correct that the serial nature means that all
> latenies stack?
>
> l0 is done first, so latency is actually l0 + l1 as max latency? (per side)

I don't think the L0s latency needs to be added if that is what you
are asking. Basically you either go from L0s->L0 or L1->L0. There is
no jumping between L0s and L1.

Something still isn't adding up with all this as the latency shouldn't
be enough to trigger buffer overruns. I wonder if we don't have
something that is misreporting the actual L1 wakeup latency. One thing
that I notice is that the link between the root complex and the PCIe
switch seems to have some sort of electrical issue. If you look at the
section from the upstream side of the switch:
LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <32us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s (ok), Width x4 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

One thing that catches my eye is that it is only linked at x4 when
both sides are listing themselves as x8. Do you know if this has ever
linked at x8 or has it always been x4? With this being a Gen 4 x8
connection it would normally take a little while to establish a link,
but with it having to fail down to a x4 that would add extra time and
likely push it out of the expected exit latency. Also I notice that
there are mentions of lane errors in the config, however I suspect
those are Gen 4 features that I am not familiar with so I don't know
if those are normal. It might be interesting to reboot and see if the
link goes back to a x8 and if the lane errors clear@some point. If
it does then we might want to disable ASPM on the upstream port of the
switch since I have seen ASPM cause link downgrades and that may be
what is occurring here.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-25  0:45                                                 ` Alexander Duyck
@ 2020-07-25  1:03                                                   ` Ian Kumlien
  2020-07-25 13:53                                                     ` Ian Kumlien
  2020-07-25 17:30                                                     ` Alexander Duyck
  0 siblings, 2 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-25  1:03 UTC (permalink / raw)
  To: intel-wired-lan

On Sat, Jul 25, 2020 at 2:45 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Fri, Jul 24, 2020 at 4:08 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > On Sat, Jul 25, 2020 at 12:49 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > >
> > > On Sat, Jul 25, 2020 at 12:41 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > >
> > > > On Fri, Jul 24, 2020 at 11:51 PM Alexander Duyck
> > > > <alexander.duyck@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jul 24, 2020 at 2:14 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > [--8<--]
> >
> > > > > > > >
> > > > > > > > And:
> > > > > > > > lspci -t
> > > > > > > > -[0000:00]-+-00.0
> > > > > > > >            +-00.2
> > > > > > > >            +-01.0
> > > > > > > >            +-01.2-[01-07]----00.0-[02-07]--+-03.0-[03]----00.0
> > > > > > >
> > > > > > > I think I now know what patch broke things for you. It is most likely
> > > > > > > this one that enabled ASPM on devices behind bridges:
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=next&id=66ff14e59e8a30690755b08bc3042359703fb07a
> > > > > >
> > > > > > Ah, yes, correct
> > > > > >
> > > > > > > My advice would be to revert that patch and see if it resolves the
> > > > > > > issue for you.
> > > > > >
> > > > > > Could do that yes =)
> > > > > >
> > > > > > I'm mainly looking for a more generic solution...
> > > > >
> > > > > That would be the generic solution. The patch has obviously broken
> > > > > things so we need to address the issues. The immediate solution is to
> > > > > revert it, but the more correct solution may be to do something like
> > > > > add an allowlist for the cases where enabling ASPM will not harm
> > > > > system performance.
> > > >
> > > > more like a generic solution like the one you mention below where we
> > > > get the best of both worlds... =)
> > > >
> > > > > > > Device 3:00.0 is your i211 gigabit network controller. Notice you have
> > > > > > > a bridge between it and the root complex. This can be problematic as I
> > > > > > > believe the main reason for the code that was removed in the patch is
> > > > > > > that wakeups can end up being serialized if all of the links are down
> > > > > > > or you could end up with one of the other devices on the bridge
> > > > > > > utilizing the PCIe link an reducing the total throughput, especially
> > > > > > > if you have the link to the root complex also taking part in power
> > > > > > > management. Starting at the root complex it looks like you have the
> > > > > > > link between the bridge and the PCIe switch. It is running L1 enabled
> > > > > > > with a 32us time for it to reestablish link according to the root
> > > > > > > complex side (00:01.2->1:00.0). The next link is the switch to the
> > > > > > > i211 which is 2:03.0 -> 3:00.0. The interesting bit here is that the
> > > > > > > bridge says it only needs 32us while the NIC is saying it will need
> > > > > > > 64us. That upper bound is already a pretty significant value, however
> > > > > > > you have two links to contend with so in reality you are looking at
> > > > > > > something like 96us to bring up both links if they are brought up
> > > > > > > serially.
> > > > > >
> > > > > > hummm... Interesting... I have never managed to parse that lspci thing
> > > > > > properly...
> > > > >
> > > > > Actually I parsed it a bit incorrectly too.
> > > > >
> > > > > The i211 lists that it only supports up to 64us maximum delay in L1
> > > > > wakeup latency. The switch is advertising 32us delay to come out of L1
> > > > > on both the upstream and downstream ports. As such the link would be
> > > > > considered marginal with L1 enabled and so it should be disabled.
> > > > >
> > > > > > It is also interesting that the realtek card seems to be on the same link then?
> > > > > > With ASPM disabled, I wonder if that is due to the setpci command or
> > > > > > if it was disabled before..
> > > > > > (playing with setpci makes no difference but it might require a reboot.. )
> > > > >
> > > > > Are you using the same command you were using for the i211? Did you
> > > > > make sure to update the offset since the PCIe configuration block
> > > > > starts at a different offset? Also you probably need to make sure to
> > > > > only try to update function 0 of the device since I suspect the other
> > > > > functions won't have any effect.
> > > >
> > > > Ah, no, i only toggled the i211 to see if that's what caused the ASPM
> > > > to be disabled...
> > > >
> > > > But it seems it isn't -- will have to reboot to verify though
> > > >
> > > > > > > When you consider that you are using a Gigabit Ethernet connection
> > > > > > > that is moving data at roughly 1000 bits per microsecond, or 125 bytes
> > > > > > > per microsecond. At that rate we should have roughly 270us worth of
> > > > > > > packets we can buffer before we are forced to start dropping packets
> > > > > > > assuming the device is configured with a default 34K Rx buffer. As
> > > > > > > such I am not entirely sure ASPM is the only issue we have here. I
> > > > > > > assume you may also have CPU C states enabled as well? By any chance
> > > > > > > do you have C6 or deeper sleep states enabled on the system? If so
> > > > > > > that might be what is pushing us into the issues that you were seeing.
> > > > > > > Basically we are seeing something that is causing the PCIe to stall
> > > > > > > for over 270us. My thought is that it is likely a number of factors
> > > > > > > where we have too many devices sleeping and as a result the total
> > > > > > > wakeup latency is likely 300us or more resulting in dropped packets.
> > > > > >
> > > > > > It seems like I only have C2 as max...
> > > > > >
> > > > > > grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
> > > > > > /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
> > > > > > /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
> > > > > > /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2
> > > > > >
> > > > > > Anyway, we should bring this back to the mailing list
> > > > >
> > > > > That's fine. I assumed you didn't want to post the lspci to the
> > > > > mailing list as it might bounce for being too large.
> > > >
> > > > Good thinking, but it was actually a slip :/
> > > >
> > > > > So a generic solution for this would be to have a function that would
> > > > > scan the PCIe bus and determine the total L1 and L0s exit latency. If
> > > > > a device advertises an acceptable ASPM power state exit latency and we
> > > > > have met or exceeded that we should be disabling that ASPM feature for
> > > > > the device.
> > > >
> > > > Yeah, since I'm on vacation I'll actually see if I can look in to that!
> > > > (I mean, I'm not that used to these kinds of things but if my messing
> > > > around inspires someone
> > > > or if noone else is working on it, then... what the hey ;) )
> > >
> > > Uhm... so, in the function that determines latency they only do MAX
> > >
> > > Ie:
> > > static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > {
> > > ...
> > >                 latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > ---
> > >
> > > I just want to see if I'm understanding you right, is it correct that
> > > the latency should be:
> > > a.up + b.dw + b.up + c.dw
> > >
> > > for a (root?) to go through b (bridge/switch?) to c (device)
>
> Actually I think it is max(a.dw, b.up) + max(b.dw, a.up). Basically
> what you want is the maximum time to bring the link up so technically
> you only have 2 links so you just have to add up the maximum time to
> create each link.

Ah so it's not cumulative per link, only max value on one, got it!

> > Also, we only disabled L0, which isn't counted as a total at all, it
> > only checks each side.
>
> Not sure what you mean here. L0 is the link fully powered on. The two
> link states we have to worry about are L0s and L1 as those involve
> various states of power-down. The L1 latency is the nasty one as that
> basically involves fully powering down the link and requires time for
> the link to be reestablished.

we basically did the &= ~ASPM_STATE_L0S - is the S indicative of something?

> > Since pcie is serial and judging from your previous statements I
> > assume that the max statement is a bug.
> > I also assume that l0 counts, and should be dealt with the same way
> > and it should also be cumulative...
>
> That latency check looks like it would be for a single link. Not each
> link in the chain.

Yes, it checks each link in the chain, which is incorrect, it's actually the
cumulative latency that is important... Well... according to what I have
been able to gather from various documents anyway ;)

> > The question becomes, is this latency from root? or is it "one step"?
>
> Actually the function is doing both. I had to reread the spec.
> Basically the switch is supposed to start trying to bring up the other
> links as soon as it detects that we are trying to bring up the link.
> In theory this is only supposed to add about 1us. So in theory the
> total bring-up time should be 33us.

Ah ok, thanks, that answers another question in the chain ;)

> > Also they add one microsecond but that looks like it should be
> > parent.l0.up + link.l0.dw latency values
>
> Yes, the 1us is the value I reference above. Basically the assumption
> is that as soon as one link starts retraining it should start working
> on the other ones so the serialization penalty is only supposed to be
> 1us.

AH!

> > So are my assumptions correct that the serial nature means that all
> > latenies stack?
> >
> > l0 is done first, so latency is actually l0 + l1 as max latency? (per side)
>
> I don't think the L0s latency needs to be added if that is what you
> are asking. Basically you either go from L0s->L0 or L1->L0. There is
> no jumping between L0s and L1.

Ok!

> Something still isn't adding up with all this as the latency shouldn't
> be enough to trigger buffer overruns. I wonder if we don't have
> something that is misreporting the actual L1 wakeup latency. One thing
> that I notice is that the link between the root complex and the PCIe
> switch seems to have some sort of electrical issue. If you look at the
> section from the upstream side of the switch:
> LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <32us
> ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 16GT/s (ok), Width x4 (downgraded)
> TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>
> One thing that catches my eye is that it is only linked at x4 when
> both sides are listing themselves as x8. Do you know if this has ever
> linked at x8 or has it always been x4? With this being a Gen 4 x8
> connection it would normally take a little while to establish a link,
> but with it having to fail down to a x4 that would add extra time and
> likely push it out of the expected exit latency. Also I notice that
> there are mentions of lane errors in the config, however I suspect
> those are Gen 4 features that I am not familiar with so I don't know
> if those are normal. It might be interesting to reboot and see if the
> link goes back to a x8 and if the lane errors clear at some point. If
> it does then we might want to disable ASPM on the upstream port of the
> switch since I have seen ASPM cause link downgrades and that may be
> what is occurring here.

Humm... And that would mean disabling ASPM completely to test?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-25  1:03                                                   ` Ian Kumlien
@ 2020-07-25 13:53                                                     ` Ian Kumlien
  2020-07-25 17:43                                                       ` Alexander Duyck
  2020-07-25 17:30                                                     ` Alexander Duyck
  1 sibling, 1 reply; 51+ messages in thread
From: Ian Kumlien @ 2020-07-25 13:53 UTC (permalink / raw)
  To: intel-wired-lan

On Sat, Jul 25, 2020 at 3:03 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> On Sat, Jul 25, 2020 at 2:45 AM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> > On Fri, Jul 24, 2020 at 4:08 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:

[--8<--]

> > Actually I think it is max(a.dw, b.up) + max(b.dw, a.up). Basically
> > what you want is the maximum time to bring the link up so technically
> > you only have 2 links so you just have to add up the maximum time to
> > create each link.
>
> Ah so it's not cumulative per link, only max value on one, got it!
>
> > > Also, we only disabled L0, which isn't counted as a total at all, it
> > > only checks each side.
> >
> > Not sure what you mean here. L0 is the link fully powered on. The two
> > link states we have to worry about are L0s and L1 as those involve
> > various states of power-down. The L1 latency is the nasty one as that
> > basically involves fully powering down the link and requires time for
> > the link to be reestablished.
>
> we basically did the &= ~ASPM_STATE_L0S - is the S indicative of something?
>
> > > Since pcie is serial and judging from your previous statements I
> > > assume that the max statement is a bug.
> > > I also assume that l0 counts, and should be dealt with the same way
> > > and it should also be cumulative...
> >
> > That latency check looks like it would be for a single link. Not each
> > link in the chain.
>
> Yes, it checks each link in the chain, which is incorrect, it's actually the
> cumulative latency that is important... Well... according to what I have
> been able to gather from various documents anyway ;)
>
> > > The question becomes, is this latency from root? or is it "one step"?
> >
> > Actually the function is doing both. I had to reread the spec.
> > Basically the switch is supposed to start trying to bring up the other
> > links as soon as it detects that we are trying to bring up the link.
> > In theory this is only supposed to add about 1us. So in theory the
> > total bring-up time should be 33us.
>
> Ah ok, thanks, that answers another question in the chain ;)

So, then this is what should be done:
diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index b17e5ffd31b1..2b8f7ea7f7bc 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -434,7 +434,7 @@ static void pcie_get_aspm_reg(struct pci_dev *pdev,

 static void pcie_aspm_check_latency(struct pci_dev *endpoint)
 {
-       u32 latency, l1_switch_latency = 0;
+       u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
        struct aspm_latency *acceptable;
        struct pcie_link_state *link;

@@ -470,8 +470,9 @@ static void pcie_aspm_check_latency(struct pci_dev
*endpoint)
                 * substate latencies (and hence do not do any check).
                 */
                latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
+               l1_max_latency = max_t(u32, latency, l1_max_latency)
                if ((link->aspm_capable & ASPM_STATE_L1) &&
-                   (latency + l1_switch_latency > acceptable->l1))
+                   (l1_max_latency + l1_switch_latency > acceptable->l1))
                        link->aspm_capable &= ~ASPM_STATE_L1;
                l1_switch_latency += 1000;
---

for l1 latency... I do however find it odd that you don't disable it
on the endpoint but on the
potential bridge/switch/root you're on - shouldn't the disable be on
endpoint->bus->self->link_state?

Anyway, this should handle any latency bumps... and could be done
differently reusing latency and:
latency = max_t(u32, latency, max_t(u32, link->latency_up.l1,
link->latency_dw.l1));

but kept it this way for legibility...

But for L0 -- been looking at it and I wonder... from what I can see
it's cumulative for the link, but L0S seems
different and is perhaps not quite the same...

[--8<--]

> > Something still isn't adding up with all this as the latency shouldn't
> > be enough to trigger buffer overruns. I wonder if we don't have
> > something that is misreporting the actual L1 wakeup latency. One thing
> > that I notice is that the link between the root complex and the PCIe
> > switch seems to have some sort of electrical issue. If you look at the
> > section from the upstream side of the switch:
> > LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <32us
> > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
> > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > LnkSta: Speed 16GT/s (ok), Width x4 (downgraded)
> > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> >
> > One thing that catches my eye is that it is only linked at x4 when
> > both sides are listing themselves as x8. Do you know if this has ever
> > linked at x8 or has it always been x4? With this being a Gen 4 x8
> > connection it would normally take a little while to establish a link,
> > but with it having to fail down to a x4 that would add extra time and
> > likely push it out of the expected exit latency. Also I notice that
> > there are mentions of lane errors in the config, however I suspect
> > those are Gen 4 features that I am not familiar with so I don't know
> > if those are normal. It might be interesting to reboot and see if the
> > link goes back to a x8 and if the lane errors clear at some point. If
> > it does then we might want to disable ASPM on the upstream port of the
> > switch since I have seen ASPM cause link downgrades and that may be
> > what is occurring here.
>
> Humm... And that would mean disabling ASPM completely to test?

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-25  1:03                                                   ` Ian Kumlien
  2020-07-25 13:53                                                     ` Ian Kumlien
@ 2020-07-25 17:30                                                     ` Alexander Duyck
  2020-07-25 18:52                                                       ` Ian Kumlien
  1 sibling, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2020-07-25 17:30 UTC (permalink / raw)
  To: intel-wired-lan

On Fri, Jul 24, 2020 at 6:03 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Sat, Jul 25, 2020 at 2:45 AM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > On Fri, Jul 24, 2020 at 4:08 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > >
> > > On Sat, Jul 25, 2020 at 12:49 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > >
> > > > On Sat, Jul 25, 2020 at 12:41 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jul 24, 2020 at 11:51 PM Alexander Duyck
> > > > > <alexander.duyck@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Jul 24, 2020 at 2:14 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > >
> > > [--8<--]
> > >
> > > > > > > > >
> > > > > > > > > And:
> > > > > > > > > lspci -t
> > > > > > > > > -[0000:00]-+-00.0
> > > > > > > > >            +-00.2
> > > > > > > > >            +-01.0
> > > > > > > > >            +-01.2-[01-07]----00.0-[02-07]--+-03.0-[03]----00.0
> > > > > > > >
> > > > > > > > I think I now know what patch broke things for you. It is most likely
> > > > > > > > this one that enabled ASPM on devices behind bridges:
> > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=next&id=66ff14e59e8a30690755b08bc3042359703fb07a
> > > > > > >
> > > > > > > Ah, yes, correct
> > > > > > >
> > > > > > > > My advice would be to revert that patch and see if it resolves the
> > > > > > > > issue for you.
> > > > > > >
> > > > > > > Could do that yes =)
> > > > > > >
> > > > > > > I'm mainly looking for a more generic solution...
> > > > > >
> > > > > > That would be the generic solution. The patch has obviously broken
> > > > > > things so we need to address the issues. The immediate solution is to
> > > > > > revert it, but the more correct solution may be to do something like
> > > > > > add an allowlist for the cases where enabling ASPM will not harm
> > > > > > system performance.
> > > > >
> > > > > more like a generic solution like the one you mention below where we
> > > > > get the best of both worlds... =)
> > > > >
> > > > > > > > Device 3:00.0 is your i211 gigabit network controller. Notice you have
> > > > > > > > a bridge between it and the root complex. This can be problematic as I
> > > > > > > > believe the main reason for the code that was removed in the patch is
> > > > > > > > that wakeups can end up being serialized if all of the links are down
> > > > > > > > or you could end up with one of the other devices on the bridge
> > > > > > > > utilizing the PCIe link an reducing the total throughput, especially
> > > > > > > > if you have the link to the root complex also taking part in power
> > > > > > > > management. Starting at the root complex it looks like you have the
> > > > > > > > link between the bridge and the PCIe switch. It is running L1 enabled
> > > > > > > > with a 32us time for it to reestablish link according to the root
> > > > > > > > complex side (00:01.2->1:00.0). The next link is the switch to the
> > > > > > > > i211 which is 2:03.0 -> 3:00.0. The interesting bit here is that the
> > > > > > > > bridge says it only needs 32us while the NIC is saying it will need
> > > > > > > > 64us. That upper bound is already a pretty significant value, however
> > > > > > > > you have two links to contend with so in reality you are looking at
> > > > > > > > something like 96us to bring up both links if they are brought up
> > > > > > > > serially.
> > > > > > >
> > > > > > > hummm... Interesting... I have never managed to parse that lspci thing
> > > > > > > properly...
> > > > > >
> > > > > > Actually I parsed it a bit incorrectly too.
> > > > > >
> > > > > > The i211 lists that it only supports up to 64us maximum delay in L1
> > > > > > wakeup latency. The switch is advertising 32us delay to come out of L1
> > > > > > on both the upstream and downstream ports. As such the link would be
> > > > > > considered marginal with L1 enabled and so it should be disabled.
> > > > > >
> > > > > > > It is also interesting that the realtek card seems to be on the same link then?
> > > > > > > With ASPM disabled, I wonder if that is due to the setpci command or
> > > > > > > if it was disabled before..
> > > > > > > (playing with setpci makes no difference but it might require a reboot.. )
> > > > > >
> > > > > > Are you using the same command you were using for the i211? Did you
> > > > > > make sure to update the offset since the PCIe configuration block
> > > > > > starts at a different offset? Also you probably need to make sure to
> > > > > > only try to update function 0 of the device since I suspect the other
> > > > > > functions won't have any effect.
> > > > >
> > > > > Ah, no, i only toggled the i211 to see if that's what caused the ASPM
> > > > > to be disabled...
> > > > >
> > > > > But it seems it isn't -- will have to reboot to verify though
> > > > >
> > > > > > > > When you consider that you are using a Gigabit Ethernet connection
> > > > > > > > that is moving data at roughly 1000 bits per microsecond, or 125 bytes
> > > > > > > > per microsecond. At that rate we should have roughly 270us worth of
> > > > > > > > packets we can buffer before we are forced to start dropping packets
> > > > > > > > assuming the device is configured with a default 34K Rx buffer. As
> > > > > > > > such I am not entirely sure ASPM is the only issue we have here. I
> > > > > > > > assume you may also have CPU C states enabled as well? By any chance
> > > > > > > > do you have C6 or deeper sleep states enabled on the system? If so
> > > > > > > > that might be what is pushing us into the issues that you were seeing.
> > > > > > > > Basically we are seeing something that is causing the PCIe to stall
> > > > > > > > for over 270us. My thought is that it is likely a number of factors
> > > > > > > > where we have too many devices sleeping and as a result the total
> > > > > > > > wakeup latency is likely 300us or more resulting in dropped packets.
> > > > > > >
> > > > > > > It seems like I only have C2 as max...
> > > > > > >
> > > > > > > grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
> > > > > > > /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
> > > > > > > /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
> > > > > > > /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2
> > > > > > >
> > > > > > > Anyway, we should bring this back to the mailing list
> > > > > >
> > > > > > That's fine. I assumed you didn't want to post the lspci to the
> > > > > > mailing list as it might bounce for being too large.
> > > > >
> > > > > Good thinking, but it was actually a slip :/
> > > > >
> > > > > > So a generic solution for this would be to have a function that would
> > > > > > scan the PCIe bus and determine the total L1 and L0s exit latency. If
> > > > > > a device advertises an acceptable ASPM power state exit latency and we
> > > > > > have met or exceeded that we should be disabling that ASPM feature for
> > > > > > the device.
> > > > >
> > > > > Yeah, since I'm on vacation I'll actually see if I can look in to that!
> > > > > (I mean, I'm not that used to these kinds of things but if my messing
> > > > > around inspires someone
> > > > > or if noone else is working on it, then... what the hey ;) )
> > > >
> > > > Uhm... so, in the function that determines latency they only do MAX
> > > >
> > > > Ie:
> > > > static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > > {
> > > > ...
> > > >                 latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > > ---
> > > >
> > > > I just want to see if I'm understanding you right, is it correct that
> > > > the latency should be:
> > > > a.up + b.dw + b.up + c.dw
> > > >
> > > > for a (root?) to go through b (bridge/switch?) to c (device)
> >
> > Actually I think it is max(a.dw, b.up) + max(b.dw, a.up). Basically
> > what you want is the maximum time to bring the link up so technically
> > you only have 2 links so you just have to add up the maximum time to
> > create each link.
>
> Ah so it's not cumulative per link, only max value on one, got it!
>
> > > Also, we only disabled L0, which isn't counted as a total at all, it
> > > only checks each side.
> >
> > Not sure what you mean here. L0 is the link fully powered on. The two
> > link states we have to worry about are L0s and L1 as those involve
> > various states of power-down. The L1 latency is the nasty one as that
> > basically involves fully powering down the link and requires time for
> > the link to be reestablished.
>
> we basically did the &= ~ASPM_STATE_L0S - is the S indicative of something?

So the command I gave you was basically clearing both the L1 and L0S
states. It disabled ASPM entirely. However it looks like only L1 is
supported on your platform.

> > > Since pcie is serial and judging from your previous statements I
> > > assume that the max statement is a bug.
> > > I also assume that l0 counts, and should be dealt with the same way
> > > and it should also be cumulative...
> >
> > That latency check looks like it would be for a single link. Not each
> > link in the chain.
>
> Yes, it checks each link in the chain, which is incorrect, it's actually the
> cumulative latency that is important... Well... according to what I have
> been able to gather from various documents anyway ;)

Right. We would need to determine the latency of the entire chain. So
that would effectively be the max for any one link plus 1us for every
switch it has to pass through.

> > > The question becomes, is this latency from root? or is it "one step"?
> >
> > Actually the function is doing both. I had to reread the spec.
> > Basically the switch is supposed to start trying to bring up the other
> > links as soon as it detects that we are trying to bring up the link.
> > In theory this is only supposed to add about 1us. So in theory the
> > total bring-up time should be 33us.
>
> Ah ok, thanks, that answers another question in the chain ;)
>
> > > Also they add one microsecond but that looks like it should be
> > > parent.l0.up + link.l0.dw latency values
> >
> > Yes, the 1us is the value I reference above. Basically the assumption
> > is that as soon as one link starts retraining it should start working
> > on the other ones so the serialization penalty is only supposed to be
> > 1us.
>
> AH!
>
> > > So are my assumptions correct that the serial nature means that all
> > > latenies stack?
> > >
> > > l0 is done first, so latency is actually l0 + l1 as max latency? (per side)
> >
> > I don't think the L0s latency needs to be added if that is what you
> > are asking. Basically you either go from L0s->L0 or L1->L0. There is
> > no jumping between L0s and L1.
>
> Ok!
>
> > Something still isn't adding up with all this as the latency shouldn't
> > be enough to trigger buffer overruns. I wonder if we don't have
> > something that is misreporting the actual L1 wakeup latency. One thing
> > that I notice is that the link between the root complex and the PCIe
> > switch seems to have some sort of electrical issue. If you look at the
> > section from the upstream side of the switch:
> > LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <32us
> > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
> > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > LnkSta: Speed 16GT/s (ok), Width x4 (downgraded)
> > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> >
> > One thing that catches my eye is that it is only linked at x4 when
> > both sides are listing themselves as x8. Do you know if this has ever
> > linked at x8 or has it always been x4? With this being a Gen 4 x8
> > connection it would normally take a little while to establish a link,
> > but with it having to fail down to a x4 that would add extra time and
> > likely push it out of the expected exit latency. Also I notice that
> > there are mentions of lane errors in the config, however I suspect
> > those are Gen 4 features that I am not familiar with so I don't know
> > if those are normal. It might be interesting to reboot and see if the
> > link goes back to a x8 and if the lane errors clear at some point. If
> > it does then we might want to disable ASPM on the upstream port of the
> > switch since I have seen ASPM cause link downgrades and that may be
> > what is occurring here.
>
> Humm... And that would mean disabling ASPM completely to test?

Maybe. Right after boot there is a good likelihood that the link will
be the most healthy, so it is likely to still be x8 if that is the
true width of the link. If we are seeing it degrade over time that
would be a sign that maybe we should disable L1 on the link between
the switch and the root complex instead of messing with the NICs.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-25 13:53                                                     ` Ian Kumlien
@ 2020-07-25 17:43                                                       ` Alexander Duyck
  2020-07-25 18:56                                                         ` Ian Kumlien
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2020-07-25 17:43 UTC (permalink / raw)
  To: intel-wired-lan

On Sat, Jul 25, 2020 at 6:53 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Sat, Jul 25, 2020 at 3:03 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > On Sat, Jul 25, 2020 at 2:45 AM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > > On Fri, Jul 24, 2020 at 4:08 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> [--8<--]
>
> > > Actually I think it is max(a.dw, b.up) + max(b.dw, a.up). Basically
> > > what you want is the maximum time to bring the link up so technically
> > > you only have 2 links so you just have to add up the maximum time to
> > > create each link.
> >
> > Ah so it's not cumulative per link, only max value on one, got it!
> >
> > > > Also, we only disabled L0, which isn't counted as a total at all, it
> > > > only checks each side.
> > >
> > > Not sure what you mean here. L0 is the link fully powered on. The two
> > > link states we have to worry about are L0s and L1 as those involve
> > > various states of power-down. The L1 latency is the nasty one as that
> > > basically involves fully powering down the link and requires time for
> > > the link to be reestablished.
> >
> > we basically did the &= ~ASPM_STATE_L0S - is the S indicative of something?
> >
> > > > Since pcie is serial and judging from your previous statements I
> > > > assume that the max statement is a bug.
> > > > I also assume that l0 counts, and should be dealt with the same way
> > > > and it should also be cumulative...
> > >
> > > That latency check looks like it would be for a single link. Not each
> > > link in the chain.
> >
> > Yes, it checks each link in the chain, which is incorrect, it's actually the
> > cumulative latency that is important... Well... according to what I have
> > been able to gather from various documents anyway ;)
> >
> > > > The question becomes, is this latency from root? or is it "one step"?
> > >
> > > Actually the function is doing both. I had to reread the spec.
> > > Basically the switch is supposed to start trying to bring up the other
> > > links as soon as it detects that we are trying to bring up the link.
> > > In theory this is only supposed to add about 1us. So in theory the
> > > total bring-up time should be 33us.
> >
> > Ah ok, thanks, that answers another question in the chain ;)
>
> So, then this is what should be done:
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index b17e5ffd31b1..2b8f7ea7f7bc 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -434,7 +434,7 @@ static void pcie_get_aspm_reg(struct pci_dev *pdev,
>
>  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
>  {
> -       u32 latency, l1_switch_latency = 0;
> +       u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
>         struct aspm_latency *acceptable;
>         struct pcie_link_state *link;
>
> @@ -470,8 +470,9 @@ static void pcie_aspm_check_latency(struct pci_dev
> *endpoint)
>                  * substate latencies (and hence do not do any check).
>                  */
>                 latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> +               l1_max_latency = max_t(u32, latency, l1_max_latency)
>                 if ((link->aspm_capable & ASPM_STATE_L1) &&
> -                   (latency + l1_switch_latency > acceptable->l1))
> +                   (l1_max_latency + l1_switch_latency > acceptable->l1))
>                         link->aspm_capable &= ~ASPM_STATE_L1;
>                 l1_switch_latency += 1000;
> ---

This makes sense to me. You might want to submit it to the linux-pci
mailing list.

> for l1 latency... I do however find it odd that you don't disable it
> on the endpoint but on the
> potential bridge/switch/root you're on - shouldn't the disable be on
> endpoint->bus->self->link_state?

I think the idea is that we want to leave the leaves of the root
complex with ASPM enabled and disable it as we approach the trunk as
it is more likely that we are going to see more frequent wakeups as we
approach the root complex, or at least that would be my theory anyway.
Basically the closer you get to the root complex the more likely you
are to have more devices making use of the path so the more likely it
is to have to stay on anyway.

> Anyway, this should handle any latency bumps... and could be done
> differently reusing latency and:
> latency = max_t(u32, latency, max_t(u32, link->latency_up.l1,
> link->latency_dw.l1));
>
> but kept it this way for legibility...
>
> But for L0 -- been looking at it and I wonder... from what I can see
> it's cumulative for the link, but L0S seems
> different and is perhaps not quite the same...

You have mentioned L0 several times now and I wonder what you are
referring to. L0 is the fully powered on state. That is the state we
are trying to get back to. L0s and L1 are the lower power states that
we have to get out of with L1 being a much more complicated state to
get out of as we shut down the clocks and link if I recall and have to
reestablish both before we can resume operation.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-25 17:30                                                     ` Alexander Duyck
@ 2020-07-25 18:52                                                       ` Ian Kumlien
  0 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-25 18:52 UTC (permalink / raw)
  To: intel-wired-lan

On Sat, Jul 25, 2020 at 7:30 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Fri, Jul 24, 2020 at 6:03 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > On Sat, Jul 25, 2020 at 2:45 AM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > >
> > > On Fri, Jul 24, 2020 at 4:08 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > >
> > > > On Sat, Jul 25, 2020 at 12:49 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > >
> > > > > On Sat, Jul 25, 2020 at 12:41 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Jul 24, 2020 at 11:51 PM Alexander Duyck
> > > > > > <alexander.duyck@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, Jul 24, 2020 at 2:14 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > >
> > > > [--8<--]
> > > >
> > > > > > > > > >
> > > > > > > > > > And:
> > > > > > > > > > lspci -t
> > > > > > > > > > -[0000:00]-+-00.0
> > > > > > > > > >            +-00.2
> > > > > > > > > >            +-01.0
> > > > > > > > > >            +-01.2-[01-07]----00.0-[02-07]--+-03.0-[03]----00.0
> > > > > > > > >
> > > > > > > > > I think I now know what patch broke things for you. It is most likely
> > > > > > > > > this one that enabled ASPM on devices behind bridges:
> > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=next&id=66ff14e59e8a30690755b08bc3042359703fb07a
> > > > > > > >
> > > > > > > > Ah, yes, correct
> > > > > > > >
> > > > > > > > > My advice would be to revert that patch and see if it resolves the
> > > > > > > > > issue for you.
> > > > > > > >
> > > > > > > > Could do that yes =)
> > > > > > > >
> > > > > > > > I'm mainly looking for a more generic solution...
> > > > > > >
> > > > > > > That would be the generic solution. The patch has obviously broken
> > > > > > > things so we need to address the issues. The immediate solution is to
> > > > > > > revert it, but the more correct solution may be to do something like
> > > > > > > add an allowlist for the cases where enabling ASPM will not harm
> > > > > > > system performance.
> > > > > >
> > > > > > more like a generic solution like the one you mention below where we
> > > > > > get the best of both worlds... =)
> > > > > >
> > > > > > > > > Device 3:00.0 is your i211 gigabit network controller. Notice you have
> > > > > > > > > a bridge between it and the root complex. This can be problematic as I
> > > > > > > > > believe the main reason for the code that was removed in the patch is
> > > > > > > > > that wakeups can end up being serialized if all of the links are down
> > > > > > > > > or you could end up with one of the other devices on the bridge
> > > > > > > > > utilizing the PCIe link an reducing the total throughput, especially
> > > > > > > > > if you have the link to the root complex also taking part in power
> > > > > > > > > management. Starting at the root complex it looks like you have the
> > > > > > > > > link between the bridge and the PCIe switch. It is running L1 enabled
> > > > > > > > > with a 32us time for it to reestablish link according to the root
> > > > > > > > > complex side (00:01.2->1:00.0). The next link is the switch to the
> > > > > > > > > i211 which is 2:03.0 -> 3:00.0. The interesting bit here is that the
> > > > > > > > > bridge says it only needs 32us while the NIC is saying it will need
> > > > > > > > > 64us. That upper bound is already a pretty significant value, however
> > > > > > > > > you have two links to contend with so in reality you are looking at
> > > > > > > > > something like 96us to bring up both links if they are brought up
> > > > > > > > > serially.
> > > > > > > >
> > > > > > > > hummm... Interesting... I have never managed to parse that lspci thing
> > > > > > > > properly...
> > > > > > >
> > > > > > > Actually I parsed it a bit incorrectly too.
> > > > > > >
> > > > > > > The i211 lists that it only supports up to 64us maximum delay in L1
> > > > > > > wakeup latency. The switch is advertising 32us delay to come out of L1
> > > > > > > on both the upstream and downstream ports. As such the link would be
> > > > > > > considered marginal with L1 enabled and so it should be disabled.
> > > > > > >
> > > > > > > > It is also interesting that the realtek card seems to be on the same link then?
> > > > > > > > With ASPM disabled, I wonder if that is due to the setpci command or
> > > > > > > > if it was disabled before..
> > > > > > > > (playing with setpci makes no difference but it might require a reboot.. )
> > > > > > >
> > > > > > > Are you using the same command you were using for the i211? Did you
> > > > > > > make sure to update the offset since the PCIe configuration block
> > > > > > > starts at a different offset? Also you probably need to make sure to
> > > > > > > only try to update function 0 of the device since I suspect the other
> > > > > > > functions won't have any effect.
> > > > > >
> > > > > > Ah, no, i only toggled the i211 to see if that's what caused the ASPM
> > > > > > to be disabled...
> > > > > >
> > > > > > But it seems it isn't -- will have to reboot to verify though
> > > > > >
> > > > > > > > > When you consider that you are using a Gigabit Ethernet connection
> > > > > > > > > that is moving data at roughly 1000 bits per microsecond, or 125 bytes
> > > > > > > > > per microsecond. At that rate we should have roughly 270us worth of
> > > > > > > > > packets we can buffer before we are forced to start dropping packets
> > > > > > > > > assuming the device is configured with a default 34K Rx buffer. As
> > > > > > > > > such I am not entirely sure ASPM is the only issue we have here. I
> > > > > > > > > assume you may also have CPU C states enabled as well? By any chance
> > > > > > > > > do you have C6 or deeper sleep states enabled on the system? If so
> > > > > > > > > that might be what is pushing us into the issues that you were seeing.
> > > > > > > > > Basically we are seeing something that is causing the PCIe to stall
> > > > > > > > > for over 270us. My thought is that it is likely a number of factors
> > > > > > > > > where we have too many devices sleeping and as a result the total
> > > > > > > > > wakeup latency is likely 300us or more resulting in dropped packets.
> > > > > > > >
> > > > > > > > It seems like I only have C2 as max...
> > > > > > > >
> > > > > > > > grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
> > > > > > > > /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
> > > > > > > > /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
> > > > > > > > /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2
> > > > > > > >
> > > > > > > > Anyway, we should bring this back to the mailing list
> > > > > > >
> > > > > > > That's fine. I assumed you didn't want to post the lspci to the
> > > > > > > mailing list as it might bounce for being too large.
> > > > > >
> > > > > > Good thinking, but it was actually a slip :/
> > > > > >
> > > > > > > So a generic solution for this would be to have a function that would
> > > > > > > scan the PCIe bus and determine the total L1 and L0s exit latency. If
> > > > > > > a device advertises an acceptable ASPM power state exit latency and we
> > > > > > > have met or exceeded that we should be disabling that ASPM feature for
> > > > > > > the device.
> > > > > >
> > > > > > Yeah, since I'm on vacation I'll actually see if I can look in to that!
> > > > > > (I mean, I'm not that used to these kinds of things but if my messing
> > > > > > around inspires someone
> > > > > > or if noone else is working on it, then... what the hey ;) )
> > > > >
> > > > > Uhm... so, in the function that determines latency they only do MAX
> > > > >
> > > > > Ie:
> > > > > static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > > > {
> > > > > ...
> > > > >                 latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > > > ---
> > > > >
> > > > > I just want to see if I'm understanding you right, is it correct that
> > > > > the latency should be:
> > > > > a.up + b.dw + b.up + c.dw
> > > > >
> > > > > for a (root?) to go through b (bridge/switch?) to c (device)
> > >
> > > Actually I think it is max(a.dw, b.up) + max(b.dw, a.up). Basically
> > > what you want is the maximum time to bring the link up so technically
> > > you only have 2 links so you just have to add up the maximum time to
> > > create each link.
> >
> > Ah so it's not cumulative per link, only max value on one, got it!
> >
> > > > Also, we only disabled L0, which isn't counted as a total at all, it
> > > > only checks each side.
> > >
> > > Not sure what you mean here. L0 is the link fully powered on. The two
> > > link states we have to worry about are L0s and L1 as those involve
> > > various states of power-down. The L1 latency is the nasty one as that
> > > basically involves fully powering down the link and requires time for
> > > the link to be reestablished.
> >
> > we basically did the &= ~ASPM_STATE_L0S - is the S indicative of something?
>
> So the command I gave you was basically clearing both the L1 and L0S
> states. It disabled ASPM entirely. However it looks like only L1 is
> supported on your platform.

Ah ok, perhaps i missed that somewhere, it looked like L0s was off by
L1 was still on

> > > > Since pcie is serial and judging from your previous statements I
> > > > assume that the max statement is a bug.
> > > > I also assume that l0 counts, and should be dealt with the same way
> > > > and it should also be cumulative...
> > >
> > > That latency check looks like it would be for a single link. Not each
> > > link in the chain.
> >
> > Yes, it checks each link in the chain, which is incorrect, it's actually the
> > cumulative latency that is important... Well... according to what I have
> > been able to gather from various documents anyway ;)
>
> Right. We would need to determine the latency of the entire chain. So
> that would effectively be the max for any one link plus 1us for every
> switch it has to pass through.

Which the patch i did actually does, =)

Gonna test it soon to see if it handles my case

> > > > The question becomes, is this latency from root? or is it "one step"?
> > >
> > > Actually the function is doing both. I had to reread the spec.
> > > Basically the switch is supposed to start trying to bring up the other
> > > links as soon as it detects that we are trying to bring up the link.
> > > In theory this is only supposed to add about 1us. So in theory the
> > > total bring-up time should be 33us.
> >
> > Ah ok, thanks, that answers another question in the chain ;)
> >
> > > > Also they add one microsecond but that looks like it should be
> > > > parent.l0.up + link.l0.dw latency values
> > >
> > > Yes, the 1us is the value I reference above. Basically the assumption
> > > is that as soon as one link starts retraining it should start working
> > > on the other ones so the serialization penalty is only supposed to be
> > > 1us.
> >
> > AH!
> >
> > > > So are my assumptions correct that the serial nature means that all
> > > > latenies stack?
> > > >
> > > > l0 is done first, so latency is actually l0 + l1 as max latency? (per side)
> > >
> > > I don't think the L0s latency needs to be added if that is what you
> > > are asking. Basically you either go from L0s->L0 or L1->L0. There is
> > > no jumping between L0s and L1.
> >
> > Ok!
> >
> > > Something still isn't adding up with all this as the latency shouldn't
> > > be enough to trigger buffer overruns. I wonder if we don't have
> > > something that is misreporting the actual L1 wakeup latency. One thing
> > > that I notice is that the link between the root complex and the PCIe
> > > switch seems to have some sort of electrical issue. If you look at the
> > > section from the upstream side of the switch:
> > > LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <32us
> > > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > > LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
> > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > LnkSta: Speed 16GT/s (ok), Width x4 (downgraded)
> > > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > >
> > > One thing that catches my eye is that it is only linked at x4 when
> > > both sides are listing themselves as x8. Do you know if this has ever
> > > linked at x8 or has it always been x4? With this being a Gen 4 x8
> > > connection it would normally take a little while to establish a link,
> > > but with it having to fail down to a x4 that would add extra time and
> > > likely push it out of the expected exit latency. Also I notice that
> > > there are mentions of lane errors in the config, however I suspect
> > > those are Gen 4 features that I am not familiar with so I don't know
> > > if those are normal. It might be interesting to reboot and see if the
> > > link goes back to a x8 and if the lane errors clear at some point. If
> > > it does then we might want to disable ASPM on the upstream port of the
> > > switch since I have seen ASPM cause link downgrades and that may be
> > > what is occurring here.
> >
> > Humm... And that would mean disabling ASPM completely to test?
>
> Maybe. Right after boot there is a good likelihood that the link will
> be the most healthy, so it is likely to still be x8 if that is the
> true width of the link. If we are seeing it degrade over time that
> would be a sign that maybe we should disable L1 on the link between
> the switch and the root complex instead of messing with the NICs.

I doubt it, I suspect that the chip can do more but it's just not there...

Just like f.ex:
LnkSta: Speed 8GT/s (ok), Width x8 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

PCIEv3 x16 card in a x8 slot

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-25 17:43                                                       ` Alexander Duyck
@ 2020-07-25 18:56                                                         ` Ian Kumlien
  2020-07-25 19:35                                                           ` Ian Kumlien
  0 siblings, 1 reply; 51+ messages in thread
From: Ian Kumlien @ 2020-07-25 18:56 UTC (permalink / raw)
  To: intel-wired-lan

On Sat, Jul 25, 2020 at 7:43 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Sat, Jul 25, 2020 at 6:53 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > On Sat, Jul 25, 2020 at 3:03 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > On Sat, Jul 25, 2020 at 2:45 AM Alexander Duyck
> > > <alexander.duyck@gmail.com> wrote:
> > > > On Fri, Jul 24, 2020 at 4:08 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > [--8<--]
> >
> > > > Actually I think it is max(a.dw, b.up) + max(b.dw, a.up). Basically
> > > > what you want is the maximum time to bring the link up so technically
> > > > you only have 2 links so you just have to add up the maximum time to
> > > > create each link.
> > >
> > > Ah so it's not cumulative per link, only max value on one, got it!
> > >
> > > > > Also, we only disabled L0, which isn't counted as a total at all, it
> > > > > only checks each side.
> > > >
> > > > Not sure what you mean here. L0 is the link fully powered on. The two
> > > > link states we have to worry about are L0s and L1 as those involve
> > > > various states of power-down. The L1 latency is the nasty one as that
> > > > basically involves fully powering down the link and requires time for
> > > > the link to be reestablished.
> > >
> > > we basically did the &= ~ASPM_STATE_L0S - is the S indicative of something?
> > >
> > > > > Since pcie is serial and judging from your previous statements I
> > > > > assume that the max statement is a bug.
> > > > > I also assume that l0 counts, and should be dealt with the same way
> > > > > and it should also be cumulative...
> > > >
> > > > That latency check looks like it would be for a single link. Not each
> > > > link in the chain.
> > >
> > > Yes, it checks each link in the chain, which is incorrect, it's actually the
> > > cumulative latency that is important... Well... according to what I have
> > > been able to gather from various documents anyway ;)
> > >
> > > > > The question becomes, is this latency from root? or is it "one step"?
> > > >
> > > > Actually the function is doing both. I had to reread the spec.
> > > > Basically the switch is supposed to start trying to bring up the other
> > > > links as soon as it detects that we are trying to bring up the link.
> > > > In theory this is only supposed to add about 1us. So in theory the
> > > > total bring-up time should be 33us.
> > >
> > > Ah ok, thanks, that answers another question in the chain ;)
> >
> > So, then this is what should be done:
> > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > index b17e5ffd31b1..2b8f7ea7f7bc 100644
> > --- a/drivers/pci/pcie/aspm.c
> > +++ b/drivers/pci/pcie/aspm.c
> > @@ -434,7 +434,7 @@ static void pcie_get_aspm_reg(struct pci_dev *pdev,
> >
> >  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> >  {
> > -       u32 latency, l1_switch_latency = 0;
> > +       u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
> >         struct aspm_latency *acceptable;
> >         struct pcie_link_state *link;
> >
> > @@ -470,8 +470,9 @@ static void pcie_aspm_check_latency(struct pci_dev
> > *endpoint)
> >                  * substate latencies (and hence do not do any check).
> >                  */
> >                 latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > +               l1_max_latency = max_t(u32, latency, l1_max_latency)
> >                 if ((link->aspm_capable & ASPM_STATE_L1) &&
> > -                   (latency + l1_switch_latency > acceptable->l1))
> > +                   (l1_max_latency + l1_switch_latency > acceptable->l1))
> >                         link->aspm_capable &= ~ASPM_STATE_L1;
> >                 l1_switch_latency += 1000;
> > ---
>
> This makes sense to me. You might want to submit it to the linux-pci
> mailing list.

Will after trying it and adding the missing ';'

> > for l1 latency... I do however find it odd that you don't disable it
> > on the endpoint but on the
> > potential bridge/switch/root you're on - shouldn't the disable be on
> > endpoint->bus->self->link_state?
>
> I think the idea is that we want to leave the leaves of the root
> complex with ASPM enabled and disable it as we approach the trunk as
> it is more likely that we are going to see more frequent wakeups as we
> approach the root complex, or at least that would be my theory anyway.
> Basically the closer you get to the root complex the more likely you
> are to have more devices making use of the path so the more likely it
> is to have to stay on anyway.

Well since we walk from endpoint to root complex, it's actually more
likely that we disable the root complex...

since max_latency + hops will be the largest number there

> > Anyway, this should handle any latency bumps... and could be done
> > differently reusing latency and:
> > latency = max_t(u32, latency, max_t(u32, link->latency_up.l1,
> > link->latency_dw.l1));
> >
> > but kept it this way for legibility...
> >
> > But for L0 -- been looking at it and I wonder... from what I can see
> > it's cumulative for the link, but L0S seems
> > different and is perhaps not quite the same...
>
> You have mentioned L0 several times now and I wonder what you are
> referring to. L0 is the fully powered on state. That is the state we
> are trying to get back to. L0s and L1 are the lower power states that
> we have to get out of with L1 being a much more complicated state to
> get out of as we shut down the clocks and link if I recall and have to
> reestablish both before we can resume operation.

Yeah, trying to understand it all...

Reading about L0 says that the latency covers the whole chain, but i
don't know if there is a
case for L0s that I'm missing since it seems hard to get informastion
about it and how it works

Basically, is the L0s check correct? or should it be changed? Should
it respect the limitations set forth for L0?
or is it completely different?

I'm really thankful for all the information you've been providing :)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-25 18:56                                                         ` Ian Kumlien
@ 2020-07-25 19:35                                                           ` Ian Kumlien
  2020-07-25 20:10                                                             ` Alexander Duyck
  0 siblings, 1 reply; 51+ messages in thread
From: Ian Kumlien @ 2020-07-25 19:35 UTC (permalink / raw)
  To: intel-wired-lan

On Sat, Jul 25, 2020 at 8:56 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Sat, Jul 25, 2020 at 7:43 PM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > On Sat, Jul 25, 2020 at 6:53 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > >
> > > On Sat, Jul 25, 2020 at 3:03 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > On Sat, Jul 25, 2020 at 2:45 AM Alexander Duyck
> > > > <alexander.duyck@gmail.com> wrote:
> > > > > On Fri, Jul 24, 2020 at 4:08 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > >
> > > [--8<--]
> > >
> > > > > Actually I think it is max(a.dw, b.up) + max(b.dw, a.up). Basically
> > > > > what you want is the maximum time to bring the link up so technically
> > > > > you only have 2 links so you just have to add up the maximum time to
> > > > > create each link.
> > > >
> > > > Ah so it's not cumulative per link, only max value on one, got it!
> > > >
> > > > > > Also, we only disabled L0, which isn't counted as a total at all, it
> > > > > > only checks each side.
> > > > >
> > > > > Not sure what you mean here. L0 is the link fully powered on. The two
> > > > > link states we have to worry about are L0s and L1 as those involve
> > > > > various states of power-down. The L1 latency is the nasty one as that
> > > > > basically involves fully powering down the link and requires time for
> > > > > the link to be reestablished.
> > > >
> > > > we basically did the &= ~ASPM_STATE_L0S - is the S indicative of something?
> > > >
> > > > > > Since pcie is serial and judging from your previous statements I
> > > > > > assume that the max statement is a bug.
> > > > > > I also assume that l0 counts, and should be dealt with the same way
> > > > > > and it should also be cumulative...
> > > > >
> > > > > That latency check looks like it would be for a single link. Not each
> > > > > link in the chain.
> > > >
> > > > Yes, it checks each link in the chain, which is incorrect, it's actually the
> > > > cumulative latency that is important... Well... according to what I have
> > > > been able to gather from various documents anyway ;)
> > > >
> > > > > > The question becomes, is this latency from root? or is it "one step"?
> > > > >
> > > > > Actually the function is doing both. I had to reread the spec.
> > > > > Basically the switch is supposed to start trying to bring up the other
> > > > > links as soon as it detects that we are trying to bring up the link.
> > > > > In theory this is only supposed to add about 1us. So in theory the
> > > > > total bring-up time should be 33us.
> > > >
> > > > Ah ok, thanks, that answers another question in the chain ;)
> > >
> > > So, then this is what should be done:
> > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > index b17e5ffd31b1..2b8f7ea7f7bc 100644
> > > --- a/drivers/pci/pcie/aspm.c
> > > +++ b/drivers/pci/pcie/aspm.c
> > > @@ -434,7 +434,7 @@ static void pcie_get_aspm_reg(struct pci_dev *pdev,
> > >
> > >  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > >  {
> > > -       u32 latency, l1_switch_latency = 0;
> > > +       u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
> > >         struct aspm_latency *acceptable;
> > >         struct pcie_link_state *link;
> > >
> > > @@ -470,8 +470,9 @@ static void pcie_aspm_check_latency(struct pci_dev
> > > *endpoint)
> > >                  * substate latencies (and hence do not do any check).
> > >                  */
> > >                 latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > +               l1_max_latency = max_t(u32, latency, l1_max_latency)
> > >                 if ((link->aspm_capable & ASPM_STATE_L1) &&
> > > -                   (latency + l1_switch_latency > acceptable->l1))
> > > +                   (l1_max_latency + l1_switch_latency > acceptable->l1))
> > >                         link->aspm_capable &= ~ASPM_STATE_L1;
> > >                 l1_switch_latency += 1000;
> > > ---
> >
> > This makes sense to me. You might want to submit it to the linux-pci
> > mailing list.
>
> Will after trying it and adding the missing ';'

So rebooted, and the chain we had was:
00:01.2->1:00.0 -> 2:03.0 -> 3:00.0

And with my patch, we have:
for x in 00:01.2 1:00.0 2:03.0 3:00.0 ; do echo $x && lspci -s $x -vvv
|grep LnkCtl ; done
00:01.2
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
01:00.0
LnkCtl: ASPM Disabled; Disabled- CommClk+
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
2:03.0
LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-,
Selectable De-emphasis: -6dB
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
3:00.0
LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-

And the siwtch is still downgraded so i suspect a lack of physical lines...
LnkSta: Speed 16GT/s (ok), Width x4 (downgraded)
TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-

Just disabling the endpoint however results in:
for x in 00:01.2 1:00.0 2:03.0 3:00.0 ; do echo $x && lspci -s $x -vvv
|grep LnkCtl ; done
00:01.2
LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
1:00.0
LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
2:03.0
LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-,
Selectable De-emphasis: -6dB
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
3:00.0
LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
----

Ie, it didn't seem to apply...

Looking at the differences:
diff -u lscpi-root.output lscpi-endpoint.output
--- lscpi-root.output 2020-07-25 21:24:10.661458522 +0200
+++ lscpi-endpoint.output 2020-07-25 21:20:50.316049129 +0200
@@ -3,7 +3,6 @@
 00:00.2
 00:01.0
 00:01.2
- LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
 00:02.0
 00:03.0
 00:03.1
@@ -27,7 +26,6 @@
 00:18.6
 00:18.7
 01:00.0
- LnkCtl: ASPM Disabled; Disabled- CommClk+
 02:03.0
 02:04.0
  LnkCtl: ASPM Disabled; Disabled- CommClk+

So that handles two bridges then...
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD]
Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse Switch
Upstream (prog-if 00 [Normal decode])

And these two had ASPM Enabled before my changes... They actually seem
to fix it as well! ;)





> > > for l1 latency... I do however find it odd that you don't disable it
> > > on the endpoint but on the
> > > potential bridge/switch/root you're on - shouldn't the disable be on
> > > endpoint->bus->self->link_state?
> >
> > I think the idea is that we want to leave the leaves of the root
> > complex with ASPM enabled and disable it as we approach the trunk as
> > it is more likely that we are going to see more frequent wakeups as we
> > approach the root complex, or at least that would be my theory anyway.
> > Basically the closer you get to the root complex the more likely you
> > are to have more devices making use of the path so the more likely it
> > is to have to stay on anyway.
>
> Well since we walk from endpoint to root complex, it's actually more
> likely that we disable the root complex...
>
> since max_latency + hops will be the largest number there
>
> > > Anyway, this should handle any latency bumps... and could be done
> > > differently reusing latency and:
> > > latency = max_t(u32, latency, max_t(u32, link->latency_up.l1,
> > > link->latency_dw.l1));
> > >
> > > but kept it this way for legibility...
> > >
> > > But for L0 -- been looking at it and I wonder... from what I can see
> > > it's cumulative for the link, but L0S seems
> > > different and is perhaps not quite the same...
> >
> > You have mentioned L0 several times now and I wonder what you are
> > referring to. L0 is the fully powered on state. That is the state we
> > are trying to get back to. L0s and L1 are the lower power states that
> > we have to get out of with L1 being a much more complicated state to
> > get out of as we shut down the clocks and link if I recall and have to
> > reestablish both before we can resume operation.
>
> Yeah, trying to understand it all...
>
> Reading about L0 says that the latency covers the whole chain, but i
> don't know if there is a
> case for L0s that I'm missing since it seems hard to get informastion
> about it and how it works
>
> Basically, is the L0s check correct? or should it be changed? Should
> it respect the limitations set forth for L0?
> or is it completely different?
>
> I'm really thankful for all the information you've been providing :)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-25 19:35                                                           ` Ian Kumlien
@ 2020-07-25 20:10                                                             ` Alexander Duyck
  2020-07-25 20:16                                                               ` Ian Kumlien
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Duyck @ 2020-07-25 20:10 UTC (permalink / raw)
  To: intel-wired-lan

On Sat, Jul 25, 2020 at 12:35 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> On Sat, Jul 25, 2020 at 8:56 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > On Sat, Jul 25, 2020 at 7:43 PM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > >
> > > On Sat, Jul 25, 2020 at 6:53 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > >
> > > > On Sat, Jul 25, 2020 at 3:03 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > On Sat, Jul 25, 2020 at 2:45 AM Alexander Duyck
> > > > > <alexander.duyck@gmail.com> wrote:
> > > > > > On Fri, Jul 24, 2020 at 4:08 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > >
> > > > [--8<--]
> > > >
> > > > > > Actually I think it is max(a.dw, b.up) + max(b.dw, a.up). Basically
> > > > > > what you want is the maximum time to bring the link up so technically
> > > > > > you only have 2 links so you just have to add up the maximum time to
> > > > > > create each link.
> > > > >
> > > > > Ah so it's not cumulative per link, only max value on one, got it!
> > > > >
> > > > > > > Also, we only disabled L0, which isn't counted as a total at all, it
> > > > > > > only checks each side.
> > > > > >
> > > > > > Not sure what you mean here. L0 is the link fully powered on. The two
> > > > > > link states we have to worry about are L0s and L1 as those involve
> > > > > > various states of power-down. The L1 latency is the nasty one as that
> > > > > > basically involves fully powering down the link and requires time for
> > > > > > the link to be reestablished.
> > > > >
> > > > > we basically did the &= ~ASPM_STATE_L0S - is the S indicative of something?
> > > > >
> > > > > > > Since pcie is serial and judging from your previous statements I
> > > > > > > assume that the max statement is a bug.
> > > > > > > I also assume that l0 counts, and should be dealt with the same way
> > > > > > > and it should also be cumulative...
> > > > > >
> > > > > > That latency check looks like it would be for a single link. Not each
> > > > > > link in the chain.
> > > > >
> > > > > Yes, it checks each link in the chain, which is incorrect, it's actually the
> > > > > cumulative latency that is important... Well... according to what I have
> > > > > been able to gather from various documents anyway ;)
> > > > >
> > > > > > > The question becomes, is this latency from root? or is it "one step"?
> > > > > >
> > > > > > Actually the function is doing both. I had to reread the spec.
> > > > > > Basically the switch is supposed to start trying to bring up the other
> > > > > > links as soon as it detects that we are trying to bring up the link.
> > > > > > In theory this is only supposed to add about 1us. So in theory the
> > > > > > total bring-up time should be 33us.
> > > > >
> > > > > Ah ok, thanks, that answers another question in the chain ;)
> > > >
> > > > So, then this is what should be done:
> > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > index b17e5ffd31b1..2b8f7ea7f7bc 100644
> > > > --- a/drivers/pci/pcie/aspm.c
> > > > +++ b/drivers/pci/pcie/aspm.c
> > > > @@ -434,7 +434,7 @@ static void pcie_get_aspm_reg(struct pci_dev *pdev,
> > > >
> > > >  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > >  {
> > > > -       u32 latency, l1_switch_latency = 0;
> > > > +       u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
> > > >         struct aspm_latency *acceptable;
> > > >         struct pcie_link_state *link;
> > > >
> > > > @@ -470,8 +470,9 @@ static void pcie_aspm_check_latency(struct pci_dev
> > > > *endpoint)
> > > >                  * substate latencies (and hence do not do any check).
> > > >                  */
> > > >                 latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > > +               l1_max_latency = max_t(u32, latency, l1_max_latency)
> > > >                 if ((link->aspm_capable & ASPM_STATE_L1) &&
> > > > -                   (latency + l1_switch_latency > acceptable->l1))
> > > > +                   (l1_max_latency + l1_switch_latency > acceptable->l1))
> > > >                         link->aspm_capable &= ~ASPM_STATE_L1;
> > > >                 l1_switch_latency += 1000;
> > > > ---
> > >
> > > This makes sense to me. You might want to submit it to the linux-pci
> > > mailing list.
> >
> > Will after trying it and adding the missing ';'
>
> So rebooted, and the chain we had was:
> 00:01.2->1:00.0 -> 2:03.0 -> 3:00.0
>
> And with my patch, we have:
> for x in 00:01.2 1:00.0 2:03.0 3:00.0 ; do echo $x && lspci -s $x -vvv
> |grep LnkCtl ; done
> 00:01.2
> LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
> LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
> LnkCtl3: LnkEquIntrruptEn- PerformEqu-
> 01:00.0
> LnkCtl: ASPM Disabled; Disabled- CommClk+
> LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
> LnkCtl3: LnkEquIntrruptEn- PerformEqu-
> 2:03.0
> LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
> LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-,
> Selectable De-emphasis: -6dB
> LnkCtl3: LnkEquIntrruptEn- PerformEqu-
> 3:00.0
> LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
>
> And the siwtch is still downgraded so i suspect a lack of physical lines...
> LnkSta: Speed 16GT/s (ok), Width x4 (downgraded)
> TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-

Well that is good. So it is disabling ASPM on the root complex side of
the switch and leaving ASPM enabled for the NIC then. That is the
behavior I would expect to see since that will still cut total power
while avoiding cycling L1 on the upstream facing side of the switch.

> Just disabling the endpoint however results in:
> for x in 00:01.2 1:00.0 2:03.0 3:00.0 ; do echo $x && lspci -s $x -vvv
> |grep LnkCtl ; done
> 00:01.2
> LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
> LnkCtl3: LnkEquIntrruptEn- PerformEqu-
> 1:00.0
> LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
> LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
> LnkCtl3: LnkEquIntrruptEn- PerformEqu-
> 2:03.0
> LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
> LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-,
> Selectable De-emphasis: -6dB
> LnkCtl3: LnkEquIntrruptEn- PerformEqu-
> 3:00.0
> LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
> ----
>
> Ie, it didn't seem to apply...

What do you mean by "just disabling the endpoint"?

> Looking at the differences:
> diff -u lscpi-root.output lscpi-endpoint.output
> --- lscpi-root.output 2020-07-25 21:24:10.661458522 +0200
> +++ lscpi-endpoint.output 2020-07-25 21:20:50.316049129 +0200
> @@ -3,7 +3,6 @@
>  00:00.2
>  00:01.0
>  00:01.2
> - LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
>  00:02.0
>  00:03.0
>  00:03.1
> @@ -27,7 +26,6 @@
>  00:18.6
>  00:18.7
>  01:00.0
> - LnkCtl: ASPM Disabled; Disabled- CommClk+
>  02:03.0
>  02:04.0
>   LnkCtl: ASPM Disabled; Disabled- CommClk+
>
> So that handles two bridges then...
> 00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD]
> Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
> 01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse Switch
> Upstream (prog-if 00 [Normal decode])
>
> And these two had ASPM Enabled before my changes... They actually seem
> to fix it as well! ;)
>

That is what I would have suspected. Odds are this is the optimal
setup in terms of power savings as well as the link to the root
complex would be cycling back on for any of the other devices that are
connected to this switch anyway.

It looks like you submitted it as an RFC over on the linux-pci maling
list. One thing I would suggest is when you go to submit the actual
patch make sure to include a "Signed-off-by:" with your name and
preferred email address as that is required for official submissions.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit
  2020-07-25 20:10                                                             ` Alexander Duyck
@ 2020-07-25 20:16                                                               ` Ian Kumlien
  0 siblings, 0 replies; 51+ messages in thread
From: Ian Kumlien @ 2020-07-25 20:16 UTC (permalink / raw)
  To: intel-wired-lan

On Sat, Jul 25, 2020 at 10:10 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Sat, Jul 25, 2020 at 12:35 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > On Sat, Jul 25, 2020 at 8:56 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > >
> > > On Sat, Jul 25, 2020 at 7:43 PM Alexander Duyck
> > > <alexander.duyck@gmail.com> wrote:
> > > >
> > > > On Sat, Jul 25, 2020 at 6:53 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > >
> > > > > On Sat, Jul 25, 2020 at 3:03 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > > On Sat, Jul 25, 2020 at 2:45 AM Alexander Duyck
> > > > > > <alexander.duyck@gmail.com> wrote:
> > > > > > > On Fri, Jul 24, 2020 at 4:08 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > >
> > > > > [--8<--]
> > > > >
> > > > > > > Actually I think it is max(a.dw, b.up) + max(b.dw, a.up). Basically
> > > > > > > what you want is the maximum time to bring the link up so technically
> > > > > > > you only have 2 links so you just have to add up the maximum time to
> > > > > > > create each link.
> > > > > >
> > > > > > Ah so it's not cumulative per link, only max value on one, got it!
> > > > > >
> > > > > > > > Also, we only disabled L0, which isn't counted as a total at all, it
> > > > > > > > only checks each side.
> > > > > > >
> > > > > > > Not sure what you mean here. L0 is the link fully powered on. The two
> > > > > > > link states we have to worry about are L0s and L1 as those involve
> > > > > > > various states of power-down. The L1 latency is the nasty one as that
> > > > > > > basically involves fully powering down the link and requires time for
> > > > > > > the link to be reestablished.
> > > > > >
> > > > > > we basically did the &= ~ASPM_STATE_L0S - is the S indicative of something?
> > > > > >
> > > > > > > > Since pcie is serial and judging from your previous statements I
> > > > > > > > assume that the max statement is a bug.
> > > > > > > > I also assume that l0 counts, and should be dealt with the same way
> > > > > > > > and it should also be cumulative...
> > > > > > >
> > > > > > > That latency check looks like it would be for a single link. Not each
> > > > > > > link in the chain.
> > > > > >
> > > > > > Yes, it checks each link in the chain, which is incorrect, it's actually the
> > > > > > cumulative latency that is important... Well... according to what I have
> > > > > > been able to gather from various documents anyway ;)
> > > > > >
> > > > > > > > The question becomes, is this latency from root? or is it "one step"?
> > > > > > >
> > > > > > > Actually the function is doing both. I had to reread the spec.
> > > > > > > Basically the switch is supposed to start trying to bring up the other
> > > > > > > links as soon as it detects that we are trying to bring up the link.
> > > > > > > In theory this is only supposed to add about 1us. So in theory the
> > > > > > > total bring-up time should be 33us.
> > > > > >
> > > > > > Ah ok, thanks, that answers another question in the chain ;)
> > > > >
> > > > > So, then this is what should be done:
> > > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > > index b17e5ffd31b1..2b8f7ea7f7bc 100644
> > > > > --- a/drivers/pci/pcie/aspm.c
> > > > > +++ b/drivers/pci/pcie/aspm.c
> > > > > @@ -434,7 +434,7 @@ static void pcie_get_aspm_reg(struct pci_dev *pdev,
> > > > >
> > > > >  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > > >  {
> > > > > -       u32 latency, l1_switch_latency = 0;
> > > > > +       u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
> > > > >         struct aspm_latency *acceptable;
> > > > >         struct pcie_link_state *link;
> > > > >
> > > > > @@ -470,8 +470,9 @@ static void pcie_aspm_check_latency(struct pci_dev
> > > > > *endpoint)
> > > > >                  * substate latencies (and hence do not do any check).
> > > > >                  */
> > > > >                 latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > > > +               l1_max_latency = max_t(u32, latency, l1_max_latency)
> > > > >                 if ((link->aspm_capable & ASPM_STATE_L1) &&
> > > > > -                   (latency + l1_switch_latency > acceptable->l1))
> > > > > +                   (l1_max_latency + l1_switch_latency > acceptable->l1))
> > > > >                         link->aspm_capable &= ~ASPM_STATE_L1;
> > > > >                 l1_switch_latency += 1000;
> > > > > ---
> > > >
> > > > This makes sense to me. You might want to submit it to the linux-pci
> > > > mailing list.
> > >
> > > Will after trying it and adding the missing ';'
> >
> > So rebooted, and the chain we had was:
> > 00:01.2->1:00.0 -> 2:03.0 -> 3:00.0
> >
> > And with my patch, we have:
> > for x in 00:01.2 1:00.0 2:03.0 3:00.0 ; do echo $x && lspci -s $x -vvv
> > |grep LnkCtl ; done
> > 00:01.2
> > LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
> > LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
> > LnkCtl3: LnkEquIntrruptEn- PerformEqu-
> > 01:00.0
> > LnkCtl: ASPM Disabled; Disabled- CommClk+
> > LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
> > LnkCtl3: LnkEquIntrruptEn- PerformEqu-
> > 2:03.0
> > LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
> > LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-,
> > Selectable De-emphasis: -6dB
> > LnkCtl3: LnkEquIntrruptEn- PerformEqu-
> > 3:00.0
> > LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> > LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
> >
> > And the siwtch is still downgraded so i suspect a lack of physical lines...
> > LnkSta: Speed 16GT/s (ok), Width x4 (downgraded)
> > TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
>
> Well that is good. So it is disabling ASPM on the root complex side of
> the switch and leaving ASPM enabled for the NIC then. That is the
> behavior I would expect to see since that will still cut total power
> while avoiding cycling L1 on the upstream facing side of the switch.

Great, it wasn't what i was expecting ;)

It's a very learning experience - i thought that it worked like a
chain basically
and any termination would terminate instances after that instance, but it seems
like it's completely separate...

> > Just disabling the endpoint however results in:
> > for x in 00:01.2 1:00.0 2:03.0 3:00.0 ; do echo $x && lspci -s $x -vvv
> > |grep LnkCtl ; done
> > 00:01.2
> > LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> > LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
> > LnkCtl3: LnkEquIntrruptEn- PerformEqu-
> > 1:00.0
> > LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
> > LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
> > LnkCtl3: LnkEquIntrruptEn- PerformEqu-
> > 2:03.0
> > LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
> > LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-,
> > Selectable De-emphasis: -6dB
> > LnkCtl3: LnkEquIntrruptEn- PerformEqu-
> > 3:00.0
> > LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> > LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
> > ----
> >
> > Ie, it didn't seem to apply...
>
> What do you mean by "just disabling the endpoint"?

The function starts with:
link = endpoint->bus->self->link_state;

And then we walk link = link->parent

So disabling on endpoint would be disabling on the nic (in this case,
potentially)

> > Looking at the differences:
> > diff -u lscpi-root.output lscpi-endpoint.output
> > --- lscpi-root.output 2020-07-25 21:24:10.661458522 +0200
> > +++ lscpi-endpoint.output 2020-07-25 21:20:50.316049129 +0200
> > @@ -3,7 +3,6 @@
> >  00:00.2
> >  00:01.0
> >  00:01.2
> > - LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
> >  00:02.0
> >  00:03.0
> >  00:03.1
> > @@ -27,7 +26,6 @@
> >  00:18.6
> >  00:18.7
> >  01:00.0
> > - LnkCtl: ASPM Disabled; Disabled- CommClk+
> >  02:03.0
> >  02:04.0
> >   LnkCtl: ASPM Disabled; Disabled- CommClk+
> >
> > So that handles two bridges then...
> > 00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD]
> > Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
> > 01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse Switch
> > Upstream (prog-if 00 [Normal decode])
> >
> > And these two had ASPM Enabled before my changes... They actually seem
> > to fix it as well! ;)
> >
>
> That is what I would have suspected. Odds are this is the optimal
> setup in terms of power savings as well as the link to the root
> complex would be cycling back on for any of the other devices that are
> connected to this switch anyway.
>
> It looks like you submitted it as an RFC over on the linux-pci maling
> list. One thing I would suggest is when you go to submit the actual
> patch make sure to include a "Signed-off-by:" with your name and
> preferred email address as that is required for official submissions.

Yep, will do, just want to see if there is any feedback first

> Thanks.
>
> - Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2020-07-25 20:16 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-11 15:53 NAT performance issue 944mbit -> ~40mbit Ian Kumlien
2020-07-15 20:05 ` Ian Kumlien
2020-07-15 20:05   ` [Intel-wired-lan] " Ian Kumlien
2020-07-15 20:31   ` Jakub Kicinski
2020-07-15 20:31     ` [Intel-wired-lan] " Jakub Kicinski
2020-07-15 21:02     ` Ian Kumlien
2020-07-15 21:02       ` [Intel-wired-lan] " Ian Kumlien
2020-07-15 21:12       ` Ian Kumlien
2020-07-15 21:12         ` [Intel-wired-lan] " Ian Kumlien
2020-07-15 21:40         ` Jakub Kicinski
2020-07-15 21:40           ` [Intel-wired-lan] " Jakub Kicinski
2020-07-15 21:59           ` Ian Kumlien
2020-07-15 21:59             ` [Intel-wired-lan] " Ian Kumlien
2020-07-15 22:32             ` Alexander Duyck
2020-07-15 22:32               ` Alexander Duyck
2020-07-15 22:51               ` Ian Kumlien
2020-07-15 22:51                 ` Ian Kumlien
2020-07-15 23:41                 ` Alexander Duyck
2020-07-15 23:41                   ` Alexander Duyck
2020-07-15 23:59                   ` Ian Kumlien
2020-07-15 23:59                     ` Ian Kumlien
2020-07-16 15:18                     ` Alexander Duyck
2020-07-16 15:18                       ` Alexander Duyck
2020-07-16 15:51                       ` Ian Kumlien
2020-07-16 19:47                       ` Ian Kumlien
2020-07-16 19:47                         ` Ian Kumlien
2020-07-17  0:09                         ` Alexander Duyck
2020-07-17  0:09                           ` Alexander Duyck
2020-07-17 13:45                           ` Ian Kumlien
2020-07-17 13:45                             ` Ian Kumlien
2020-07-24 12:01                             ` Ian Kumlien
2020-07-24 12:01                               ` Ian Kumlien
2020-07-24 12:33                               ` Ian Kumlien
2020-07-24 12:33                                 ` Ian Kumlien
2020-07-24 14:56                                 ` Alexander Duyck
2020-07-24 14:56                                   ` Alexander Duyck
     [not found]                                   ` <CAA85sZsEG_SCC4GLb8xaUsESrzZyAwF0qmse6sJ=e1QkK9DVsQ@mail.gmail.com>
     [not found]                                     ` <CAKgT0UcY4FwAFf0BXv7vc_5ram7YkFXda78PWkdEFgMLsitvWA@mail.gmail.com>
     [not found]                                       ` <CAA85sZs_PSsyZhvdKBCoAGxoZvaQFhQ6j7qoA7y8ffjs2RqEGw@mail.gmail.com>
2020-07-24 21:50                                         ` Alexander Duyck
2020-07-24 22:41                                           ` Ian Kumlien
2020-07-24 22:49                                             ` Ian Kumlien
2020-07-24 23:08                                               ` Ian Kumlien
2020-07-25  0:13                                                 ` Ian Kumlien
2020-07-25  0:45                                                 ` Alexander Duyck
2020-07-25  1:03                                                   ` Ian Kumlien
2020-07-25 13:53                                                     ` Ian Kumlien
2020-07-25 17:43                                                       ` Alexander Duyck
2020-07-25 18:56                                                         ` Ian Kumlien
2020-07-25 19:35                                                           ` Ian Kumlien
2020-07-25 20:10                                                             ` Alexander Duyck
2020-07-25 20:16                                                               ` Ian Kumlien
2020-07-25 17:30                                                     ` Alexander Duyck
2020-07-25 18:52                                                       ` Ian Kumlien

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.