All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-10 13:53 ` Arnaud Ebalard
  0 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-10 13:53 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Greg Kroah-Hartman
  Cc: netdev, stable, linux-arm-kernel

Hi,

I decided to upgrade the kernel on one of my ReadyNAS 102 from 3.11.1 to 
3.11.7. The device is based on Marvell Armada 370 SoC and uses mvneta
driver. Mine runs Debian armel unstable but I can confirm the issue also
happens on a debian harmhf unstable.

Doing some scp transfers of files located on the NAS (1000baseT-FD on
both side), I noticed the transfers rate is ridiculously slow (280KB/s).
I did the same test with a 3.12 kernel and got the same results,
i.e. AFAICT the bug also exist upstream.

So, I decided to go to hell and start digging a bit: I run a 'git bisect'
session on stable tree from 3.11.1 (known good) to 3.11.7 (known
bad). The results are given below.

I decided to reboot on my old 3.11.1 kernel and do 20 files transfers
of a 1GB file located on the NAS to my laptop via scp. I took the 20+
minutes and let them all finish: each transfer took between 1min5s and
1min7s (around 16MB/s, the limitation in that case being the crypto part).

I rebooted again and did the exact same thing on the 3.11.7 and after
the completion of the first file transfer in 1m6s (16MB/s), the second
one gave me that:

arno@small:~scp RN102:/tmp/random /dev/null
random                               0% 1664KB 278.9KB/s 1:05:37 ETA^C

And it continued that way for the remaining transfers (i did ^c after
some seconds to restart the transfer when the rate was low):

$ for k in $(seq 1 20) ; do scp RN102:random /dev/null ; done 
random                             100% 1024MB  15.6MB/s   01:06 ETA^C 
random                               0% 9856KB 282.2KB/s 1:01:20 ETA^C
random                              16%  168MB 563.9KB/s   25:54 ETA^C
random                               0% 2816KB 273.5KB/s 1:03:43 ETA^C
random                             100% 1024MB  15.5MB/s   01:06    
random                               1%   17MB 282.3KB/s 1:00:54 ETA^C
random                               0%  544KB 259.2KB/s 1:07:23 ETA^C
random                               0% 4224KB 277.3KB/s 1:02:45 ETA^C
random                               0%  832KB 262.1KB/s 1:06:37 ETA^C
random                               0% 3360KB 273.4KB/s 1:03:43 ETA^C
random                               0% 3072KB 271.8KB/s 1:04:07 ETA^C
random                               0%  832KB 262.1KB/s 1:06:37 ETA^C
random                               0% 1408KB 267.0KB/s 1:05:21 ETA^C
random                               0% 1120KB 264.7KB/s 1:05:57 ETA
...

To be sure, I did 2 additional reboots, one on each kernel and the
result is consistent. Perfect on 3.11.1 and slow rate most of the time
on 3.11.7 (Both kernel are compiled from a fresh make clean, using the
same config file).

Then, knowing that, I started a git bisect session on stable tree to end
up with the following suspects. I failed to go any further to a single
commit, due to crashes, but I could recompile a kernel w/ debug info and
report what I get if neeeded.

commit dc0791aee672 tcp: do not forget FIN in tcp_shifted_skb()     [bad]
commit 18ddf5127c9f tcp: must unclone packets before mangling them
commit 80bd5d8968d8 tcp: TSQ can use a dynamic limit
commit dbeb18b22197 tcp: TSO packets automatic sizing
commit 50704410d014 Linux 3.11.6                                    [good]

Eric, David, if it has already been reported and fixed, just tell
me. Otherwise, if you have any ideas, I'll be happy to test this
evening.

Cheers,

a+


Just in case it may be useful, here is what ethtool reports on RN102:

# ethtool -i eth0
driver: mvneta
version: 1.0
firmware-version: 
bus-info: eth0
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

# ethtool -k eth0
Features for eth0:
rx-checksumming: off [fixed]
tx-checksumming: on
        tx-checksum-ipv4: on
        tx-checksum-ip-generic: off [fixed]
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
        tx-tcp-segmentation: off [fixed]
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp6-segmentation: off [fixed]
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: off [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-mpls-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: on
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-10 13:53 ` Arnaud Ebalard
  0 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-10 13:53 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

I decided to upgrade the kernel on one of my ReadyNAS 102 from 3.11.1 to 
3.11.7. The device is based on Marvell Armada 370 SoC and uses mvneta
driver. Mine runs Debian armel unstable but I can confirm the issue also
happens on a debian harmhf unstable.

Doing some scp transfers of files located on the NAS (1000baseT-FD on
both side), I noticed the transfers rate is ridiculously slow (280KB/s).
I did the same test with a 3.12 kernel and got the same results,
i.e. AFAICT the bug also exist upstream.

So, I decided to go to hell and start digging a bit: I run a 'git bisect'
session on stable tree from 3.11.1 (known good) to 3.11.7 (known
bad). The results are given below.

I decided to reboot on my old 3.11.1 kernel and do 20 files transfers
of a 1GB file located on the NAS to my laptop via scp. I took the 20+
minutes and let them all finish: each transfer took between 1min5s and
1min7s (around 16MB/s, the limitation in that case being the crypto part).

I rebooted again and did the exact same thing on the 3.11.7 and after
the completion of the first file transfer in 1m6s (16MB/s), the second
one gave me that:

arno at small:~scp RN102:/tmp/random /dev/null
random                               0% 1664KB 278.9KB/s 1:05:37 ETA^C

And it continued that way for the remaining transfers (i did ^c after
some seconds to restart the transfer when the rate was low):

$ for k in $(seq 1 20) ; do scp RN102:random /dev/null ; done 
random                             100% 1024MB  15.6MB/s   01:06 ETA^C 
random                               0% 9856KB 282.2KB/s 1:01:20 ETA^C
random                              16%  168MB 563.9KB/s   25:54 ETA^C
random                               0% 2816KB 273.5KB/s 1:03:43 ETA^C
random                             100% 1024MB  15.5MB/s   01:06    
random                               1%   17MB 282.3KB/s 1:00:54 ETA^C
random                               0%  544KB 259.2KB/s 1:07:23 ETA^C
random                               0% 4224KB 277.3KB/s 1:02:45 ETA^C
random                               0%  832KB 262.1KB/s 1:06:37 ETA^C
random                               0% 3360KB 273.4KB/s 1:03:43 ETA^C
random                               0% 3072KB 271.8KB/s 1:04:07 ETA^C
random                               0%  832KB 262.1KB/s 1:06:37 ETA^C
random                               0% 1408KB 267.0KB/s 1:05:21 ETA^C
random                               0% 1120KB 264.7KB/s 1:05:57 ETA
...

To be sure, I did 2 additional reboots, one on each kernel and the
result is consistent. Perfect on 3.11.1 and slow rate most of the time
on 3.11.7 (Both kernel are compiled from a fresh make clean, using the
same config file).

Then, knowing that, I started a git bisect session on stable tree to end
up with the following suspects. I failed to go any further to a single
commit, due to crashes, but I could recompile a kernel w/ debug info and
report what I get if neeeded.

commit dc0791aee672 tcp: do not forget FIN in tcp_shifted_skb()     [bad]
commit 18ddf5127c9f tcp: must unclone packets before mangling them
commit 80bd5d8968d8 tcp: TSQ can use a dynamic limit
commit dbeb18b22197 tcp: TSO packets automatic sizing
commit 50704410d014 Linux 3.11.6                                    [good]

Eric, David, if it has already been reported and fixed, just tell
me. Otherwise, if you have any ideas, I'll be happy to test this
evening.

Cheers,

a+


Just in case it may be useful, here is what ethtool reports on RN102:

# ethtool -i eth0
driver: mvneta
version: 1.0
firmware-version: 
bus-info: eth0
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

# ethtool -k eth0
Features for eth0:
rx-checksumming: off [fixed]
tx-checksumming: on
        tx-checksum-ipv4: on
        tx-checksum-ip-generic: off [fixed]
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
        tx-tcp-segmentation: off [fixed]
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp6-segmentation: off [fixed]
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: off [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-mpls-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: on
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-10 13:53 ` Arnaud Ebalard
@ 2013-11-12  6:48   ` Cong Wang
  -1 siblings, 0 replies; 121+ messages in thread
From: Cong Wang @ 2013-11-12  6:48 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: netdev, stable

On Sun, 10 Nov 2013 at 13:53 GMT, Arnaud Ebalard <arno@natisbad.org> wrote:
> Hi,
>
> I decided to upgrade the kernel on one of my ReadyNAS 102 from 3.11.1 to 
> 3.11.7. The device is based on Marvell Armada 370 SoC and uses mvneta
> driver. Mine runs Debian armel unstable but I can confirm the issue also
> happens on a debian harmhf unstable.
>
[...]
>
> Then, knowing that, I started a git bisect session on stable tree to end
> up with the following suspects. I failed to go any further to a single
> commit, due to crashes, but I could recompile a kernel w/ debug info and
> report what I get if neeeded.
>
> commit dc0791aee672 tcp: do not forget FIN in tcp_shifted_skb()     [bad]
> commit 18ddf5127c9f tcp: must unclone packets before mangling them
> commit 80bd5d8968d8 tcp: TSQ can use a dynamic limit
> commit dbeb18b22197 tcp: TSO packets automatic sizing
> commit 50704410d014 Linux 3.11.6                                    [good]
>

This regression is probably introduced the last TSQ commit, Eric has a patch
for mvneta in the other thread:

http://article.gmane.org/gmane.linux.network/290359

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-12  6:48   ` Cong Wang
  0 siblings, 0 replies; 121+ messages in thread
From: Cong Wang @ 2013-11-12  6:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 10 Nov 2013 at 13:53 GMT, Arnaud Ebalard <arno@natisbad.org> wrote:
> Hi,
>
> I decided to upgrade the kernel on one of my ReadyNAS 102 from 3.11.1 to 
> 3.11.7. The device is based on Marvell Armada 370 SoC and uses mvneta
> driver. Mine runs Debian armel unstable but I can confirm the issue also
> happens on a debian harmhf unstable.
>
[...]
>
> Then, knowing that, I started a git bisect session on stable tree to end
> up with the following suspects. I failed to go any further to a single
> commit, due to crashes, but I could recompile a kernel w/ debug info and
> report what I get if neeeded.
>
> commit dc0791aee672 tcp: do not forget FIN in tcp_shifted_skb()     [bad]
> commit 18ddf5127c9f tcp: must unclone packets before mangling them
> commit 80bd5d8968d8 tcp: TSQ can use a dynamic limit
> commit dbeb18b22197 tcp: TSO packets automatic sizing
> commit 50704410d014 Linux 3.11.6                                    [good]
>

This regression is probably introduced the last TSQ commit, Eric has a patch
for mvneta in the other thread:

http://article.gmane.org/gmane.linux.network/290359

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-12  6:48   ` Cong Wang
@ 2013-11-12  7:56     ` Arnaud Ebalard
  -1 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-12  7:56 UTC (permalink / raw)
  To: Cong Wang; +Cc: linux-arm-kernel, netdev, stable, edumazet

Hi,

Thanks for the pointer. See below.

Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Sun, 10 Nov 2013 at 13:53 GMT, Arnaud Ebalard <arno@natisbad.org> wrote:
>> Hi,
>>
>> I decided to upgrade the kernel on one of my ReadyNAS 102 from 3.11.1 to 
>> 3.11.7. The device is based on Marvell Armada 370 SoC and uses mvneta
>> driver. Mine runs Debian armel unstable but I can confirm the issue also
>> happens on a debian harmhf unstable.
>>
> [...]
>>
>> Then, knowing that, I started a git bisect session on stable tree to end
>> up with the following suspects. I failed to go any further to a single
>> commit, due to crashes, but I could recompile a kernel w/ debug info and
>> report what I get if neeeded.
>>
>> commit dc0791aee672 tcp: do not forget FIN in tcp_shifted_skb()     [bad]
>> commit 18ddf5127c9f tcp: must unclone packets before mangling them
>> commit 80bd5d8968d8 tcp: TSQ can use a dynamic limit
>> commit dbeb18b22197 tcp: TSO packets automatic sizing
>> commit 50704410d014 Linux 3.11.6                                    [good]
>>
>
> This regression is probably introduced the last TSQ commit, Eric has a patch
> for mvneta in the other thread:
>
> http://article.gmane.org/gmane.linux.network/290359

I had some offline (*) discussions w/ Eric and did some test w/ the patches
he sent. It does not fix the regression I see. It would be nice if someone
w/ the hardware and more knowledge of mvneta driver could reproduce the
issue and spend some time on it.

That been said, even if the driver is most probably not the only one to
blame here (considering the result of bisect and current thread on
netdev), I never managed to get the performance I have on my ReadyNAS
Duo v2 (i.e. 108MB/s for a file served by an Apache) with a mvneta-based
platform (RN102, RN104 or RN2120). Understanding why is on an already a
long todo list. 

Cheers,

a+

(*): for some reasons, my messages to netdev and stable are not published
even though I can interact w/ {majordomo,autoanswer}@vger.kernel.org. I
poked postmaster@ bug got no reply yet.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-12  7:56     ` Arnaud Ebalard
  0 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-12  7:56 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

Thanks for the pointer. See below.

Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Sun, 10 Nov 2013 at 13:53 GMT, Arnaud Ebalard <arno@natisbad.org> wrote:
>> Hi,
>>
>> I decided to upgrade the kernel on one of my ReadyNAS 102 from 3.11.1 to 
>> 3.11.7. The device is based on Marvell Armada 370 SoC and uses mvneta
>> driver. Mine runs Debian armel unstable but I can confirm the issue also
>> happens on a debian harmhf unstable.
>>
> [...]
>>
>> Then, knowing that, I started a git bisect session on stable tree to end
>> up with the following suspects. I failed to go any further to a single
>> commit, due to crashes, but I could recompile a kernel w/ debug info and
>> report what I get if neeeded.
>>
>> commit dc0791aee672 tcp: do not forget FIN in tcp_shifted_skb()     [bad]
>> commit 18ddf5127c9f tcp: must unclone packets before mangling them
>> commit 80bd5d8968d8 tcp: TSQ can use a dynamic limit
>> commit dbeb18b22197 tcp: TSO packets automatic sizing
>> commit 50704410d014 Linux 3.11.6                                    [good]
>>
>
> This regression is probably introduced the last TSQ commit, Eric has a patch
> for mvneta in the other thread:
>
> http://article.gmane.org/gmane.linux.network/290359

I had some offline (*) discussions w/ Eric and did some test w/ the patches
he sent. It does not fix the regression I see. It would be nice if someone
w/ the hardware and more knowledge of mvneta driver could reproduce the
issue and spend some time on it.

That been said, even if the driver is most probably not the only one to
blame here (considering the result of bisect and current thread on
netdev), I never managed to get the performance I have on my ReadyNAS
Duo v2 (i.e. 108MB/s for a file served by an Apache) with a mvneta-based
platform (RN102, RN104 or RN2120). Understanding why is on an already a
long todo list. 

Cheers,

a+

(*): for some reasons, my messages to netdev and stable are not published
even though I can interact w/ {majordomo,autoanswer}@vger.kernel.org. I
poked postmaster@ bug got no reply yet.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-12  7:56     ` Arnaud Ebalard
@ 2013-11-12  8:36       ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-12  8:36 UTC (permalink / raw)
  To: Arnaud Ebalard; +Cc: Cong Wang, edumazet, stable, linux-arm-kernel, netdev

Hi Arnaud,

On Tue, Nov 12, 2013 at 08:56:25AM +0100, Arnaud Ebalard wrote:
> I had some offline (*) discussions w/ Eric and did some test w/ the patches
> he sent. It does not fix the regression I see. It would be nice if someone
> w/ the hardware and more knowledge of mvneta driver could reproduce the
> issue and spend some time on it.

I could give it a try but am falling very short of time at the moment.

> That been said, even if the driver is most probably not the only one to
> blame here (considering the result of bisect and current thread on
> netdev), I never managed to get the performance I have on my ReadyNAS
> Duo v2 (i.e. 108MB/s for a file served by an Apache) with a mvneta-based
> platform (RN102, RN104 or RN2120). Understanding why is on an already a
> long todo list. 

Yes I found that your original numbers were already quite low, so it is
also possible that you have a different problem (eg: faulty switch or
auto-negociation problem where the switch goes to half duplex because
the neta does not advertise nway or whatever) that is emphasized by the
latest changes.

> Cheers,
> 
> a+
> 
> (*): for some reasons, my messages to netdev and stable are not published
> even though I can interact w/ {majordomo,autoanswer}@vger.kernel.org. I
> poked postmaster@ bug got no reply yet.

I can confirm that I got this message from you on netdev so it should be OK
now.

Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-12  8:36       ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-12  8:36 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Arnaud,

On Tue, Nov 12, 2013 at 08:56:25AM +0100, Arnaud Ebalard wrote:
> I had some offline (*) discussions w/ Eric and did some test w/ the patches
> he sent. It does not fix the regression I see. It would be nice if someone
> w/ the hardware and more knowledge of mvneta driver could reproduce the
> issue and spend some time on it.

I could give it a try but am falling very short of time at the moment.

> That been said, even if the driver is most probably not the only one to
> blame here (considering the result of bisect and current thread on
> netdev), I never managed to get the performance I have on my ReadyNAS
> Duo v2 (i.e. 108MB/s for a file served by an Apache) with a mvneta-based
> platform (RN102, RN104 or RN2120). Understanding why is on an already a
> long todo list. 

Yes I found that your original numbers were already quite low, so it is
also possible that you have a different problem (eg: faulty switch or
auto-negociation problem where the switch goes to half duplex because
the neta does not advertise nway or whatever) that is emphasized by the
latest changes.

> Cheers,
> 
> a+
> 
> (*): for some reasons, my messages to netdev and stable are not published
> even though I can interact w/ {majordomo,autoanswer}@vger.kernel.org. I
> poked postmaster@ bug got no reply yet.

I can confirm that I got this message from you on netdev so it should be OK
now.

Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-12  8:36       ` Willy Tarreau
@ 2013-11-12  9:14         ` Arnaud Ebalard
  -1 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-12  9:14 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Cong Wang, linux-arm-kernel, netdev, stable, edumazet

Hi,

Willy Tarreau <w@1wt.eu> writes:

>> That been said, even if the driver is most probably not the only one to
>> blame here (considering the result of bisect and current thread on
>> netdev), I never managed to get the performance I have on my ReadyNAS
>> Duo v2 (i.e. 108MB/s for a file served by an Apache) with a mvneta-based
>> platform (RN102, RN104 or RN2120). Understanding why is on an already a
>> long todo list. 
>
> Yes I found that your original numbers were already quite low,

Tests for the rgression were done w/ scp, and were hence limited by the
crypto (16MB/s using arcfour128). But I also did some tests w/ a simple
wget for a file served by Apache *before* the regression and I never got
more than 60MB/s from what I recall. Can you beat that? 

> so it is also possible that you have a different problem (eg: faulty
> switch or auto-negociation problem where the switch goes to half
> duplex because the neta does not advertise nway or whatever) that is
> emphasized by the latest changes

Tested w/ back to back connections to the NAS from various hosts and
through different switch. Never saturated the link.

> I can confirm that I got this message from you on netdev so it should be OK
> now.

Good. Thanks for the info.

a+

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-12  9:14         ` Arnaud Ebalard
  0 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-12  9:14 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

Willy Tarreau <w@1wt.eu> writes:

>> That been said, even if the driver is most probably not the only one to
>> blame here (considering the result of bisect and current thread on
>> netdev), I never managed to get the performance I have on my ReadyNAS
>> Duo v2 (i.e. 108MB/s for a file served by an Apache) with a mvneta-based
>> platform (RN102, RN104 or RN2120). Understanding why is on an already a
>> long todo list. 
>
> Yes I found that your original numbers were already quite low,

Tests for the rgression were done w/ scp, and were hence limited by the
crypto (16MB/s using arcfour128). But I also did some tests w/ a simple
wget for a file served by Apache *before* the regression and I never got
more than 60MB/s from what I recall. Can you beat that? 

> so it is also possible that you have a different problem (eg: faulty
> switch or auto-negociation problem where the switch goes to half
> duplex because the neta does not advertise nway or whatever) that is
> emphasized by the latest changes

Tested w/ back to back connections to the NAS from various hosts and
through different switch. Never saturated the link.

> I can confirm that I got this message from you on netdev so it should be OK
> now.

Good. Thanks for the info.

a+

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-12  9:14         ` Arnaud Ebalard
@ 2013-11-12 10:01           ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-12 10:01 UTC (permalink / raw)
  To: Arnaud Ebalard; +Cc: Cong Wang, edumazet, linux-arm-kernel, stable, netdev

Hi Arnaud,

On Tue, Nov 12, 2013 at 10:14:34AM +0100, Arnaud Ebalard wrote:
> Tests for the rgression were done w/ scp, and were hence limited by the
> crypto (16MB/s using arcfour128). But I also did some tests w/ a simple
> wget for a file served by Apache *before* the regression and I never got
> more than 60MB/s from what I recall. Can you beat that? 

Yes, I finally picked my mirabox out of my bag for a quick test. It boots
off 3.10.0-rc7 and I totally saturate one port (stable 988 Mbps) with even
a single TCP stream.

With two systems, one directly connected (dockstar) and the other one via
a switch, I get 2*650 Mbps (a single TCP stream is enough on each).

I'll have to re-run some tests using a more up to date kernel, but that
will probably not be today though.

Regards,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-12 10:01           ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-12 10:01 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Arnaud,

On Tue, Nov 12, 2013 at 10:14:34AM +0100, Arnaud Ebalard wrote:
> Tests for the rgression were done w/ scp, and were hence limited by the
> crypto (16MB/s using arcfour128). But I also did some tests w/ a simple
> wget for a file served by Apache *before* the regression and I never got
> more than 60MB/s from what I recall. Can you beat that? 

Yes, I finally picked my mirabox out of my bag for a quick test. It boots
off 3.10.0-rc7 and I totally saturate one port (stable 988 Mbps) with even
a single TCP stream.

With two systems, one directly connected (dockstar) and the other one via
a switch, I get 2*650 Mbps (a single TCP stream is enough on each).

I'll have to re-run some tests using a more up to date kernel, but that
will probably not be today though.

Regards,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH] tcp: tsq: restore minimal amount of queueing
  2013-11-12  7:56     ` Arnaud Ebalard
  (?)
  (?)
@ 2013-11-12 14:39     ` Eric Dumazet
  2013-11-12 15:24       ` Sujith Manoharan
                         ` (2 more replies)
  -1 siblings, 3 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-12 14:39 UTC (permalink / raw)
  To: Arnaud Ebalard, David Miller, Sujith Manoharan
  Cc: Cong Wang, netdev, Felix Fietkau

From: Eric Dumazet <edumazet@google.com>

After commit c9eeec26e32e ("tcp: TSQ can use a dynamic limit"), several
users reported throughput regressions, notably on mvneta and wifi
adapters.

802.11 AMPDU requires a fair amount of queueing to be effective.

This patch partially reverts the change done in tcp_write_xmit()
so that the minimal amount is sysctl_tcp_limit_output_bytes.

It also remove the use of this sysctl while building skb stored
in write queue, as TSO autosizing does the right thing anyway.

Users with well behaving NICS and correct qdisc (like sch_fq),
can then lower the default sysctl_tcp_limit_output_bytes value from
128KB to 8KB.

The new usage of sysctl_tcp_limit_output_bytes permits each driver
author to check how driver performs when/if the value is set
to a minimum of 4KB :

Normally, line rate for a single TCP flow should be possible, 
but some drivers rely on timers to perform TX completion and
too long delays prevent reaching full throughput.

Fixes: c9eeec26e32e ("tcp: TSQ can use a dynamic limit")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Sujith Manoharan <sujith@msujith.org>
Reported-by: Arnaud Ebalard <arno@natisbad.org>
Cc: Felix Fietkau <nbd@openwrt.org>
---
 Documentation/networking/ip-sysctl.txt |    3 ---
 net/ipv4/tcp.c                         |    6 ------
 net/ipv4/tcp_output.c                  |    6 +++++-
 3 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index a46d785..7d8dc93 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -588,9 +588,6 @@ tcp_limit_output_bytes - INTEGER
 	typical pfifo_fast qdiscs.
 	tcp_limit_output_bytes limits the number of bytes on qdisc
 	or device to reduce artificial RTT/cwnd and reduce bufferbloat.
-	Note: For GSO/TSO enabled flows, we try to have at least two
-	packets in flight. Reducing tcp_limit_output_bytes might also
-	reduce the size of individual GSO packet (64KB being the max)
 	Default: 131072
 
 tcp_challenge_ack_limit - INTEGER
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 6e5617b..be5246e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -806,12 +806,6 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
 		xmit_size_goal = min_t(u32, gso_size,
 				       sk->sk_gso_max_size - 1 - hlen);
 
-		/* TSQ : try to have at least two segments in flight
-		 * (one in NIC TX ring, another in Qdisc)
-		 */
-		xmit_size_goal = min_t(u32, xmit_size_goal,
-				       sysctl_tcp_limit_output_bytes >> 1);
-
 		xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
 
 		/* We try hard to avoid divides here */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d46f214..9f0b338 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1875,8 +1875,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		 *  - better RTT estimation and ACK scheduling
 		 *  - faster recovery
 		 *  - high rates
+		 * Alas, some drivers / subsystems require a fair amount
+		 * of queued bytes to ensure line rate.
+		 * One example is wifi aggregation (802.11 AMPDU)
 		 */
-		limit = max(skb->truesize, sk->sk_pacing_rate >> 10);
+		limit = max(sysctl_tcp_limit_output_bytes,
+			    sk->sk_pacing_rate >> 10);
 
 		if (atomic_read(&sk->sk_wmem_alloc) > limit) {
 			set_bit(TSQ_THROTTLED, &tp->tsq_flags);

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH] tcp: tsq: restore minimal amount of queueing
  2013-11-12 14:39     ` [PATCH] tcp: tsq: restore minimal amount of queueing Eric Dumazet
@ 2013-11-12 15:24       ` Sujith Manoharan
  2013-11-13 14:06       ` Eric Dumazet
  2013-11-13 14:32       ` [PATCH v2] " Eric Dumazet
  2 siblings, 0 replies; 121+ messages in thread
From: Sujith Manoharan @ 2013-11-12 15:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Arnaud Ebalard, David Miller, Cong Wang, netdev, Felix Fietkau

Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> After commit c9eeec26e32e ("tcp: TSQ can use a dynamic limit"), several
> users reported throughput regressions, notably on mvneta and wifi
> adapters.
> 
> 802.11 AMPDU requires a fair amount of queueing to be effective.
> 
> This patch partially reverts the change done in tcp_write_xmit()
> so that the minimal amount is sysctl_tcp_limit_output_bytes.
> 
> It also remove the use of this sysctl while building skb stored
> in write queue, as TSO autosizing does the right thing anyway.
> 
> Users with well behaving NICS and correct qdisc (like sch_fq),
> can then lower the default sysctl_tcp_limit_output_bytes value from
> 128KB to 8KB.
> 
> The new usage of sysctl_tcp_limit_output_bytes permits each driver
> author to check how driver performs when/if the value is set
> to a minimum of 4KB :
> 
> Normally, line rate for a single TCP flow should be possible, 
> but some drivers rely on timers to perform TX completion and
> too long delays prevent reaching full throughput.

I tested the patch with ath9k and performance with a 2-stream card is normal
again, about 195 Mbps in open air.

Thanks for the fix !

Also, I think this needs to be marked as a stable candidate, since 3.12 needs
this fix.

Sujith

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-12 10:01           ` Willy Tarreau
@ 2013-11-12 15:34             ` Arnaud Ebalard
  -1 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-12 15:34 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Cong Wang, edumazet, stable, linux-arm-kernel, netdev

Hi,

Willy Tarreau <w@1wt.eu> writes:

> On Tue, Nov 12, 2013 at 10:14:34AM +0100, Arnaud Ebalard wrote:
>> Tests for the rgression were done w/ scp, and were hence limited by the
>> crypto (16MB/s using arcfour128). But I also did some tests w/ a simple
>> wget for a file served by Apache *before* the regression and I never got
>> more than 60MB/s from what I recall. Can you beat that? 
>
> Yes, I finally picked my mirabox out of my bag for a quick test. It boots
> off 3.10.0-rc7 and I totally saturate one port (stable 988 Mbps) with even
> a single TCP stream.

Thanks for the feedback. That's interesting. What are you using for your tests
(wget, ...)? 

> With two systems, one directly connected (dockstar) and the other one via
> a switch, I get 2*650 Mbps (a single TCP stream is enough on each).
>
> I'll have to re-run some tests using a more up to date kernel, but that
> will probably not be today though.

Can you give a pre-3.11.7 kernel a try if you find the time? I started
working on RN102 during 3.10-rc cycle but do not remember if I did the
first preformance tests on 3.10 or 3.11. And if you find more time,
3.11.7 would be nice too ;-)

Cheers,

a+

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-12 15:34             ` Arnaud Ebalard
  0 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-12 15:34 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

Willy Tarreau <w@1wt.eu> writes:

> On Tue, Nov 12, 2013 at 10:14:34AM +0100, Arnaud Ebalard wrote:
>> Tests for the rgression were done w/ scp, and were hence limited by the
>> crypto (16MB/s using arcfour128). But I also did some tests w/ a simple
>> wget for a file served by Apache *before* the regression and I never got
>> more than 60MB/s from what I recall. Can you beat that? 
>
> Yes, I finally picked my mirabox out of my bag for a quick test. It boots
> off 3.10.0-rc7 and I totally saturate one port (stable 988 Mbps) with even
> a single TCP stream.

Thanks for the feedback. That's interesting. What are you using for your tests
(wget, ...)? 

> With two systems, one directly connected (dockstar) and the other one via
> a switch, I get 2*650 Mbps (a single TCP stream is enough on each).
>
> I'll have to re-run some tests using a more up to date kernel, but that
> will probably not be today though.

Can you give a pre-3.11.7 kernel a try if you find the time? I started
working on RN102 during 3.10-rc cycle but do not remember if I did the
first preformance tests on 3.10 or 3.11. And if you find more time,
3.11.7 would be nice too ;-)

Cheers,

a+

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-12 15:34             ` Arnaud Ebalard
@ 2013-11-13  7:22               ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-13  7:22 UTC (permalink / raw)
  To: Arnaud Ebalard; +Cc: Cong Wang, edumazet, stable, linux-arm-kernel, netdev

On Tue, Nov 12, 2013 at 04:34:24PM +0100, Arnaud Ebalard wrote:
> Hi,
> 
> Willy Tarreau <w@1wt.eu> writes:
> 
> > On Tue, Nov 12, 2013 at 10:14:34AM +0100, Arnaud Ebalard wrote:
> >> Tests for the rgression were done w/ scp, and were hence limited by the
> >> crypto (16MB/s using arcfour128). But I also did some tests w/ a simple
> >> wget for a file served by Apache *before* the regression and I never got
> >> more than 60MB/s from what I recall. Can you beat that? 
> >
> > Yes, I finally picked my mirabox out of my bag for a quick test. It boots
> > off 3.10.0-rc7 and I totally saturate one port (stable 988 Mbps) with even
> > a single TCP stream.
> 
> Thanks for the feedback. That's interesting. What are you using for your tests
> (wget, ...)? 

No, inject (for the client) + httpterm (for the server), but it also works with
a simple netcat < /dev/zero, except that netcat uses 8kB buffers and is quickly
CPU-bound. The tools I'm talking about are available here :

  http://1wt.eu/tools/inject/?C=M;O=D
  http://1wt.eu/tools/httpterm/httpterm-1.7.2.tar.gz

Httpterm is a dummy web server. You can send requests like
"GET /?s=1m HTTP/1.0" and it returns 1 MB of data in the response,
which is quite convenient! I'm sorry for the limited documentation
(don't even try to write a config file, it's a fork of an old haproxy
version). Simply start it as :

     httpterm -D -L ip:port    (where 'ip' is optional)

Inject is an HTTP client initially designed to test applications but still
doing well enough for component testing (though it does not scale well with
large numbers of connections). I remember that Pablo Neira rewrote a simpler
equivalent here : http://1984.lsi.us.es/git/http-client-benchmark, but I'm
used to use my old version.

There's an old doc in PDF in the download directory. Unfortunately it
speaks french which is not always very convenient. But what I like there
is that you get one line of stats per second so you can easily follow how
the test goes, as opposite to some tools like "ab" which only give you a
summary at the end. That's one of the key points that Pablo has reimplemented
in his tool BTW.

> > With two systems, one directly connected (dockstar) and the other one via
> > a switch, I get 2*650 Mbps (a single TCP stream is enough on each).
> >
> > I'll have to re-run some tests using a more up to date kernel, but that
> > will probably not be today though.
> 
> Can you give a pre-3.11.7 kernel a try if you find the time? I started
> working on RN102 during 3.10-rc cycle but do not remember if I did the
> first preformance tests on 3.10 or 3.11. And if you find more time,
> 3.11.7 would be nice too ;-)

Still have not found time for this but I observed something intriguing
which might possibly match your experience : if I use large enough send
buffers on the mirabox and receive buffers on the client, then the
traffic drops for objects larger than 1 MB. I have quickly checked what's
happening and it's just that there are pauses of up to 8 ms between some
packets when the TCP send window grows larger than about 200 kB. And
since there are no drops, there is no reason for the window to shrink.
I suspect it's exactly related to the issue explained by Eric about the
timer used to recycle the Tx descriptors. However last time I checked,
these ones were also processed in the Rx path, which means that the
ACKs that flow back should have had the same effect as a Tx IRQ (unless
I'd use asymmetric routing, which was not the case). So there might be
another issue. Ah, and it only happens with GSO.

I really need some time to perform more tests, I'm sorry Arnaud, but I
can't do them right now. What you can do is to try to reduce your send
window to 1 MB or less to see if the issue persists :

   $ cat /proc/sys/ipv4/tcp_wmem
   $ echo 4096 16384 1048576 > /proc/sys/ipv4/tcp_wmem

You also need to monitor your CPU usage to ensure that you're not limited
by some processing inside apache. At 1 Gbps, you should use only something
like 40-50% of the CPU.

Cheers,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-13  7:22               ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-13  7:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Nov 12, 2013 at 04:34:24PM +0100, Arnaud Ebalard wrote:
> Hi,
> 
> Willy Tarreau <w@1wt.eu> writes:
> 
> > On Tue, Nov 12, 2013 at 10:14:34AM +0100, Arnaud Ebalard wrote:
> >> Tests for the rgression were done w/ scp, and were hence limited by the
> >> crypto (16MB/s using arcfour128). But I also did some tests w/ a simple
> >> wget for a file served by Apache *before* the regression and I never got
> >> more than 60MB/s from what I recall. Can you beat that? 
> >
> > Yes, I finally picked my mirabox out of my bag for a quick test. It boots
> > off 3.10.0-rc7 and I totally saturate one port (stable 988 Mbps) with even
> > a single TCP stream.
> 
> Thanks for the feedback. That's interesting. What are you using for your tests
> (wget, ...)? 

No, inject (for the client) + httpterm (for the server), but it also works with
a simple netcat < /dev/zero, except that netcat uses 8kB buffers and is quickly
CPU-bound. The tools I'm talking about are available here :

  http://1wt.eu/tools/inject/?C=M;O=D
  http://1wt.eu/tools/httpterm/httpterm-1.7.2.tar.gz

Httpterm is a dummy web server. You can send requests like
"GET /?s=1m HTTP/1.0" and it returns 1 MB of data in the response,
which is quite convenient! I'm sorry for the limited documentation
(don't even try to write a config file, it's a fork of an old haproxy
version). Simply start it as :

     httpterm -D -L ip:port    (where 'ip' is optional)

Inject is an HTTP client initially designed to test applications but still
doing well enough for component testing (though it does not scale well with
large numbers of connections). I remember that Pablo Neira rewrote a simpler
equivalent here : http://1984.lsi.us.es/git/http-client-benchmark, but I'm
used to use my old version.

There's an old doc in PDF in the download directory. Unfortunately it
speaks french which is not always very convenient. But what I like there
is that you get one line of stats per second so you can easily follow how
the test goes, as opposite to some tools like "ab" which only give you a
summary@the end. That's one of the key points that Pablo has reimplemented
in his tool BTW.

> > With two systems, one directly connected (dockstar) and the other one via
> > a switch, I get 2*650 Mbps (a single TCP stream is enough on each).
> >
> > I'll have to re-run some tests using a more up to date kernel, but that
> > will probably not be today though.
> 
> Can you give a pre-3.11.7 kernel a try if you find the time? I started
> working on RN102 during 3.10-rc cycle but do not remember if I did the
> first preformance tests on 3.10 or 3.11. And if you find more time,
> 3.11.7 would be nice too ;-)

Still have not found time for this but I observed something intriguing
which might possibly match your experience : if I use large enough send
buffers on the mirabox and receive buffers on the client, then the
traffic drops for objects larger than 1 MB. I have quickly checked what's
happening and it's just that there are pauses of up to 8 ms between some
packets when the TCP send window grows larger than about 200 kB. And
since there are no drops, there is no reason for the window to shrink.
I suspect it's exactly related to the issue explained by Eric about the
timer used to recycle the Tx descriptors. However last time I checked,
these ones were also processed in the Rx path, which means that the
ACKs that flow back should have had the same effect as a Tx IRQ (unless
I'd use asymmetric routing, which was not the case). So there might be
another issue. Ah, and it only happens with GSO.

I really need some time to perform more tests, I'm sorry Arnaud, but I
can't do them right now. What you can do is to try to reduce your send
window to 1 MB or less to see if the issue persists :

   $ cat /proc/sys/ipv4/tcp_wmem
   $ echo 4096 16384 1048576 > /proc/sys/ipv4/tcp_wmem

You also need to monitor your CPU usage to ensure that you're not limited
by some processing inside apache. At 1 Gbps, you should use only something
like 40-50% of the CPU.

Cheers,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] tcp: tsq: restore minimal amount of queueing
  2013-11-12 14:39     ` [PATCH] tcp: tsq: restore minimal amount of queueing Eric Dumazet
  2013-11-12 15:24       ` Sujith Manoharan
@ 2013-11-13 14:06       ` Eric Dumazet
  2013-11-13 14:32       ` [PATCH v2] " Eric Dumazet
  2 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-13 14:06 UTC (permalink / raw)
  To: Arnaud Ebalard
  Cc: David Miller, Sujith Manoharan, Cong Wang, netdev, Felix Fietkau

On Tue, 2013-11-12 at 06:39 -0800, Eric Dumazet wrote:

> -		limit = max(skb->truesize, sk->sk_pacing_rate >> 10);
> +		limit = max(sysctl_tcp_limit_output_bytes,
> +			    sk->sk_pacing_rate >> 10);
>  


I'll send a v2, a max_t(unsigned int, ..., ...) is needed here,
as reported by kbuild bot.

net/ipv4/tcp_output.c:1882:177: warning: comparison of distinct pointer
types lacks a cast [enabled by default]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH v2] tcp: tsq: restore minimal amount of queueing
  2013-11-12 14:39     ` [PATCH] tcp: tsq: restore minimal amount of queueing Eric Dumazet
  2013-11-12 15:24       ` Sujith Manoharan
  2013-11-13 14:06       ` Eric Dumazet
@ 2013-11-13 14:32       ` Eric Dumazet
  2013-11-13 21:18         ` Arnaud Ebalard
  2013-11-14 21:26         ` David Miller
  2 siblings, 2 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-13 14:32 UTC (permalink / raw)
  To: Arnaud Ebalard
  Cc: David Miller, Sujith Manoharan, Cong Wang, netdev, Felix Fietkau

From: Eric Dumazet <edumazet@google.com>

After commit c9eeec26e32e ("tcp: TSQ can use a dynamic limit"), several
users reported throughput regressions, notably on mvneta and wifi
adapters.

802.11 AMPDU requires a fair amount of queueing to be effective.

This patch partially reverts the change done in tcp_write_xmit()
so that the minimal amount is sysctl_tcp_limit_output_bytes.

It also remove the use of this sysctl while building skb stored
in write queue, as TSO autosizing does the right thing anyway.

Users with well behaving NICS and correct qdisc (like sch_fq),
can then lower the default sysctl_tcp_limit_output_bytes value from
128KB to 8KB.

This new usage of sysctl_tcp_limit_output_bytes permits each driver
authors to check how their driver performs when/if the value is set
to a minimum of 4KB.

Normally, line rate for a single TCP flow should be possible, 
but some drivers rely on timers to perform TX completion and
too long TX completion delays prevent reaching full throughput.

Fixes: c9eeec26e32e ("tcp: TSQ can use a dynamic limit")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Sujith Manoharan <sujith@msujith.org>
Reported-by: Arnaud Ebalard <arno@natisbad.org>
Tested-by: Sujith Manoharan <sujith@msujith.org>
Cc: Felix Fietkau <nbd@openwrt.org>
---
v2: use a max_t() instead of max() to suppress a compiler warning

 Documentation/networking/ip-sysctl.txt |    3 ---
 net/ipv4/tcp.c                         |    6 ------
 net/ipv4/tcp_output.c                  |    6 +++++-
 3 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 8b8a057..3c12d9a 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -577,9 +577,6 @@ tcp_limit_output_bytes - INTEGER
 	typical pfifo_fast qdiscs.
 	tcp_limit_output_bytes limits the number of bytes on qdisc
 	or device to reduce artificial RTT/cwnd and reduce bufferbloat.
-	Note: For GSO/TSO enabled flows, we try to have at least two
-	packets in flight. Reducing tcp_limit_output_bytes might also
-	reduce the size of individual GSO packet (64KB being the max)
 	Default: 131072
 
 tcp_challenge_ack_limit - INTEGER
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8e8529d..3dc0c6c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -808,12 +808,6 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
 		xmit_size_goal = min_t(u32, gso_size,
 				       sk->sk_gso_max_size - 1 - hlen);
 
-		/* TSQ : try to have at least two segments in flight
-		 * (one in NIC TX ring, another in Qdisc)
-		 */
-		xmit_size_goal = min_t(u32, xmit_size_goal,
-				       sysctl_tcp_limit_output_bytes >> 1);
-
 		xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
 
 		/* We try hard to avoid divides here */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6728546..c5231d9 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1875,8 +1875,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		 *  - better RTT estimation and ACK scheduling
 		 *  - faster recovery
 		 *  - high rates
+		 * Alas, some drivers / subsystems require a fair amount
+		 * of queued bytes to ensure line rate.
+		 * One example is wifi aggregation (802.11 AMPDU)
 		 */
-		limit = max(skb->truesize, sk->sk_pacing_rate >> 10);
+		limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes,
+			      sk->sk_pacing_rate >> 10);
 
 		if (atomic_read(&sk->sk_wmem_alloc) > limit) {
 			set_bit(TSQ_THROTTLED, &tp->tsq_flags);

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH v2] tcp: tsq: restore minimal amount of queueing
  2013-11-13 14:32       ` [PATCH v2] " Eric Dumazet
@ 2013-11-13 21:18         ` Arnaud Ebalard
  2013-11-13 21:59           ` Holger Hoffstaette
  2013-11-13 22:41           ` Eric Dumazet
  2013-11-14 21:26         ` David Miller
  1 sibling, 2 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-13 21:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Sujith Manoharan, Cong Wang, netdev, Felix Fietkau

Hi Eric,

Eric Dumazet <eric.dumazet@gmail.com> writes:

> From: Eric Dumazet <edumazet@google.com>
>
> After commit c9eeec26e32e ("tcp: TSQ can use a dynamic limit"), several
> users reported throughput regressions, notably on mvneta and wifi
> adapters.
>
> 802.11 AMPDU requires a fair amount of queueing to be effective.
>
> This patch partially reverts the change done in tcp_write_xmit()
> so that the minimal amount is sysctl_tcp_limit_output_bytes.

I just tested the fix on current linux tree with the same setup
on which I observed the regression: it also fixes it on my side
on my RN102. Thanks for the quick fix, Eric. 

Cheers,

a+

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2] tcp: tsq: restore minimal amount of queueing
  2013-11-13 21:18         ` Arnaud Ebalard
@ 2013-11-13 21:59           ` Holger Hoffstaette
  2013-11-13 23:40             ` Eric Dumazet
  2013-11-13 22:41           ` Eric Dumazet
  1 sibling, 1 reply; 121+ messages in thread
From: Holger Hoffstaette @ 2013-11-13 21:59 UTC (permalink / raw)
  To: netdev

On Wed, 13 Nov 2013 22:18:12 +0100, Arnaud Ebalard wrote:

> Hi Eric,
> 
> Eric Dumazet <eric.dumazet@gmail.com> writes:
> 
>> From: Eric Dumazet <edumazet@google.com>
>>
>> After commit c9eeec26e32e ("tcp: TSQ can use a dynamic limit"), several
>> users reported throughput regressions, notably on mvneta and wifi
>> adapters.
>>
>> 802.11 AMPDU requires a fair amount of queueing to be effective.
>>
>> This patch partially reverts the change done in tcp_write_xmit() so that
>> the minimal amount is sysctl_tcp_limit_output_bytes.
> 
> I just tested the fix on current linux tree with the same setup on which I
> observed the regression: it also fixes it on my side on my RN102. Thanks
> for the quick fix, Eric.

+1: fixes spastic NFS & Samba throughput since 3.12.0 with r8169 and e100e
for me as well. It's really not just broken/weird Wifi cards that are
affected by this.

Thanks!

-h

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2] tcp: tsq: restore minimal amount of queueing
  2013-11-13 21:18         ` Arnaud Ebalard
  2013-11-13 21:59           ` Holger Hoffstaette
@ 2013-11-13 22:41           ` Eric Dumazet
  1 sibling, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-13 22:41 UTC (permalink / raw)
  To: Arnaud Ebalard
  Cc: David Miller, Sujith Manoharan, Cong Wang, netdev, Felix Fietkau

On Wed, 2013-11-13 at 22:18 +0100, Arnaud Ebalard wrote:
> Hi Eric,
> 
> Eric Dumazet <eric.dumazet@gmail.com> writes:
> 
> > From: Eric Dumazet <edumazet@google.com>
> >
> > After commit c9eeec26e32e ("tcp: TSQ can use a dynamic limit"), several
> > users reported throughput regressions, notably on mvneta and wifi
> > adapters.
> >
> > 802.11 AMPDU requires a fair amount of queueing to be effective.
> >
> > This patch partially reverts the change done in tcp_write_xmit()
> > so that the minimal amount is sysctl_tcp_limit_output_bytes.
> 
> I just tested the fix on current linux tree with the same setup
> on which I observed the regression: it also fixes it on my side
> on my RN102. Thanks for the quick fix, Eric. 
> 

It would be nice you find the 'right value' for the sysctl, to give a
hint to mvneta maintainers ;)

Thanks

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2] tcp: tsq: restore minimal amount of queueing
  2013-11-13 21:59           ` Holger Hoffstaette
@ 2013-11-13 23:40             ` Eric Dumazet
  2013-11-13 23:52               ` Holger Hoffstaette
  0 siblings, 1 reply; 121+ messages in thread
From: Eric Dumazet @ 2013-11-13 23:40 UTC (permalink / raw)
  To: Holger Hoffstaette; +Cc: netdev

On Wed, 2013-11-13 at 22:59 +0100, Holger Hoffstaette wrote:

> +1: fixes spastic NFS & Samba throughput since 3.12.0 with r8169 and e100e
> for me as well. It's really not just broken/weird Wifi cards that are
> affected by this.

Same remark, it would be nice to pinpoint the problematic driver.

e1000e had a problem last year, fixed in commit 
8edc0e624db3756783233e464879eb2e3b904c13
("e1000e: Change wthresh to 1 to avoid possible Tx stalls")

We can live with the current situation and continue to fill buffers
to work around bugs, or we can try to fix the bugs ;)

Thanks

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2] tcp: tsq: restore minimal amount of queueing
  2013-11-13 23:40             ` Eric Dumazet
@ 2013-11-13 23:52               ` Holger Hoffstaette
  2013-11-17 23:15                 ` Francois Romieu
  0 siblings, 1 reply; 121+ messages in thread
From: Holger Hoffstaette @ 2013-11-13 23:52 UTC (permalink / raw)
  To: netdev

On Wed, 13 Nov 2013 15:40:40 -0800, Eric Dumazet wrote:

> On Wed, 2013-11-13 at 22:59 +0100, Holger Hoffstaette wrote:
> 
>> +1: fixes spastic NFS & Samba throughput since 3.12.0 with r8169 and
>> e100e for me as well. It's really not just broken/weird Wifi cards that
>> are affected by this.
> 
> Same remark, it would be nice to pinpoint the problematic driver.

Since I saw this with r8169->r8169 and e1000e->r8169 it's probably
everyone's favourite r8169 :)
Unfortunately I can't be more help but if you can suggest/whip up a fix
I'd be happy to help test.

-h

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2] tcp: tsq: restore minimal amount of queueing
  2013-11-13 14:32       ` [PATCH v2] " Eric Dumazet
  2013-11-13 21:18         ` Arnaud Ebalard
@ 2013-11-14 21:26         ` David Miller
  1 sibling, 0 replies; 121+ messages in thread
From: David Miller @ 2013-11-14 21:26 UTC (permalink / raw)
  To: eric.dumazet; +Cc: arno, sujith, xiyou.wangcong, netdev, nbd

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 13 Nov 2013 06:32:54 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> After commit c9eeec26e32e ("tcp: TSQ can use a dynamic limit"), several
> users reported throughput regressions, notably on mvneta and wifi
> adapters.
> 
> 802.11 AMPDU requires a fair amount of queueing to be effective.
> 
> This patch partially reverts the change done in tcp_write_xmit()
> so that the minimal amount is sysctl_tcp_limit_output_bytes.
> 
> It also remove the use of this sysctl while building skb stored
> in write queue, as TSO autosizing does the right thing anyway.
> 
> Users with well behaving NICS and correct qdisc (like sch_fq),
> can then lower the default sysctl_tcp_limit_output_bytes value from
> 128KB to 8KB.
> 
> This new usage of sysctl_tcp_limit_output_bytes permits each driver
> authors to check how their driver performs when/if the value is set
> to a minimum of 4KB.
> 
> Normally, line rate for a single TCP flow should be possible, 
> but some drivers rely on timers to perform TX completion and
> too long TX completion delays prevent reaching full throughput.
> 
> Fixes: c9eeec26e32e ("tcp: TSQ can use a dynamic limit")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Sujith Manoharan <sujith@msujith.org>
> Reported-by: Arnaud Ebalard <arno@natisbad.org>
> Tested-by: Sujith Manoharan <sujith@msujith.org>

Applied and queued up for -stable, thanks Eric.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-13  7:22               ` Willy Tarreau
@ 2013-11-17 14:19                 ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-17 14:19 UTC (permalink / raw)
  To: Arnaud Ebalard
  Cc: Thomas Petazzoni, Cong Wang, edumazet, linux-arm-kernel, netdev

Hi Arnaud,

[CCing Thomas and removing stable@]

On Wed, Nov 13, 2013 at 08:22:57AM +0100, Willy Tarreau wrote:
> On Tue, Nov 12, 2013 at 04:34:24PM +0100, Arnaud Ebalard wrote:
> > Can you give a pre-3.11.7 kernel a try if you find the time? I started
> > working on RN102 during 3.10-rc cycle but do not remember if I did the
> > first preformance tests on 3.10 or 3.11. And if you find more time,
> > 3.11.7 would be nice too ;-)
> 
> Still have not found time for this but I observed something intriguing
> which might possibly match your experience : if I use large enough send
> buffers on the mirabox and receive buffers on the client, then the
> traffic drops for objects larger than 1 MB. I have quickly checked what's
> happening and it's just that there are pauses of up to 8 ms between some
> packets when the TCP send window grows larger than about 200 kB. And
> since there are no drops, there is no reason for the window to shrink.
> I suspect it's exactly related to the issue explained by Eric about the
> timer used to recycle the Tx descriptors. However last time I checked,
> these ones were also processed in the Rx path, which means that the
> ACKs that flow back should have had the same effect as a Tx IRQ (unless
> I'd use asymmetric routing, which was not the case). So there might be
> another issue. Ah, and it only happens with GSO.

I just had a quick look at the driver and I can confirm that Eric is right
about the fact that we use up to two descriptors per GSO segment. Thus, we
can saturate the Tx queue at 532/2 = 266 Tx segments = 388360 bytes (for
1460 MSS). I thought I had seen a tx flush from the rx poll function but I
can't find it so it seems I was wrong, or that I possibly misunderstood
mvneta_poll() the first time I read it. Thus the observed behaviour is
perfectly normal.

With GSO enabled, as soon as the window grows large enough, we can fill
all the Tx descriptors with few segments, then need to wait for 10ms (12
if running at 250 Hz as I am) to flush them, which explains the low speed
I was observing with large windows. When disabling GSO, as much as twice
the number of descriptors can be used, which is enough to fill the wire
in the same time frame. Additionally it's likely that more descriptors
get the time to be sent during that period and that each call to mvneta_tx()
causing a call to mvneta_txq_done() releases some of the previously sent
descriptors, allowing to sustain wire rate.

I wonder if we can call mvneta_txq_done() from the IRQ handler, which would
cause some recycling of the Tx descriptors when receiving the corresponding
ACKs.

Ideally we should enable the Tx IRQ, but I still have no access to this
chip's datasheet despite having asked Marvell several times in one year
(Thomas has it though).

So it is fairly possible that in your case you can't fill the link if you
consume too many descriptors. For example, if your server uses TCP_NODELAY
and sends incomplete segments (which is quite common), it's very easy to
run out of descriptors before the link is full.

I still did not have time to run a new kernel on this device however :-(

Best regards,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-17 14:19                 ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-17 14:19 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Arnaud,

[CCing Thomas and removing stable@]

On Wed, Nov 13, 2013 at 08:22:57AM +0100, Willy Tarreau wrote:
> On Tue, Nov 12, 2013 at 04:34:24PM +0100, Arnaud Ebalard wrote:
> > Can you give a pre-3.11.7 kernel a try if you find the time? I started
> > working on RN102 during 3.10-rc cycle but do not remember if I did the
> > first preformance tests on 3.10 or 3.11. And if you find more time,
> > 3.11.7 would be nice too ;-)
> 
> Still have not found time for this but I observed something intriguing
> which might possibly match your experience : if I use large enough send
> buffers on the mirabox and receive buffers on the client, then the
> traffic drops for objects larger than 1 MB. I have quickly checked what's
> happening and it's just that there are pauses of up to 8 ms between some
> packets when the TCP send window grows larger than about 200 kB. And
> since there are no drops, there is no reason for the window to shrink.
> I suspect it's exactly related to the issue explained by Eric about the
> timer used to recycle the Tx descriptors. However last time I checked,
> these ones were also processed in the Rx path, which means that the
> ACKs that flow back should have had the same effect as a Tx IRQ (unless
> I'd use asymmetric routing, which was not the case). So there might be
> another issue. Ah, and it only happens with GSO.

I just had a quick look at the driver and I can confirm that Eric is right
about the fact that we use up to two descriptors per GSO segment. Thus, we
can saturate the Tx queue at 532/2 = 266 Tx segments = 388360 bytes (for
1460 MSS). I thought I had seen a tx flush from the rx poll function but I
can't find it so it seems I was wrong, or that I possibly misunderstood
mvneta_poll() the first time I read it. Thus the observed behaviour is
perfectly normal.

With GSO enabled, as soon as the window grows large enough, we can fill
all the Tx descriptors with few segments, then need to wait for 10ms (12
if running at 250 Hz as I am) to flush them, which explains the low speed
I was observing with large windows. When disabling GSO, as much as twice
the number of descriptors can be used, which is enough to fill the wire
in the same time frame. Additionally it's likely that more descriptors
get the time to be sent during that period and that each call to mvneta_tx()
causing a call to mvneta_txq_done() releases some of the previously sent
descriptors, allowing to sustain wire rate.

I wonder if we can call mvneta_txq_done() from the IRQ handler, which would
cause some recycling of the Tx descriptors when receiving the corresponding
ACKs.

Ideally we should enable the Tx IRQ, but I still have no access to this
chip's datasheet despite having asked Marvell several times in one year
(Thomas has it though).

So it is fairly possible that in your case you can't fill the link if you
consume too many descriptors. For example, if your server uses TCP_NODELAY
and sends incomplete segments (which is quite common), it's very easy to
run out of descriptors before the link is full.

I still did not have time to run a new kernel on this device however :-(

Best regards,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-17 14:19                 ` Willy Tarreau
@ 2013-11-17 17:41                   ` Eric Dumazet
  -1 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-17 17:41 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Arnaud Ebalard, Cong Wang, edumazet, linux-arm-kernel, netdev,
	Thomas Petazzoni

On Sun, 2013-11-17 at 15:19 +0100, Willy Tarreau wrote:

> 
> So it is fairly possible that in your case you can't fill the link if you
> consume too many descriptors. For example, if your server uses TCP_NODELAY
> and sends incomplete segments (which is quite common), it's very easy to
> run out of descriptors before the link is full.

BTW I have a very simple patch for TCP stack that could help this exact
situation...

Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with
very small frames, and let tcp_sendmsg() have more chance to fill
complete packets.

Again, for this to work very well, you need that NIC performs TX
completion in reasonable amount of time...

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3dc0c6c..10456cf 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -624,13 +624,19 @@ static inline void tcp_push(struct sock *sk, int flags, int mss_now,
 {
 	if (tcp_send_head(sk)) {
 		struct tcp_sock *tp = tcp_sk(sk);
+		struct sk_buff *skb = tcp_write_queue_tail(sk);
 
 		if (!(flags & MSG_MORE) || forced_push(tp))
-			tcp_mark_push(tp, tcp_write_queue_tail(sk));
+			tcp_mark_push(tp, skb);
 
 		tcp_mark_urg(tp, flags);
-		__tcp_push_pending_frames(sk, mss_now,
-					  (flags & MSG_MORE) ? TCP_NAGLE_CORK : nonagle);
+		if (flags & MSG_MORE)
+			nonagle = TCP_NAGLE_CORK;
+		if (atomic_read(&sk->sk_wmem_alloc) > 2048) {
+			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
+			nonagle = TCP_NAGLE_CORK;
+		}
+		__tcp_push_pending_frames(sk, mss_now, nonagle);
 	}
 }
 

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-17 17:41                   ` Eric Dumazet
  0 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-17 17:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 2013-11-17 at 15:19 +0100, Willy Tarreau wrote:

> 
> So it is fairly possible that in your case you can't fill the link if you
> consume too many descriptors. For example, if your server uses TCP_NODELAY
> and sends incomplete segments (which is quite common), it's very easy to
> run out of descriptors before the link is full.

BTW I have a very simple patch for TCP stack that could help this exact
situation...

Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with
very small frames, and let tcp_sendmsg() have more chance to fill
complete packets.

Again, for this to work very well, you need that NIC performs TX
completion in reasonable amount of time...

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3dc0c6c..10456cf 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -624,13 +624,19 @@ static inline void tcp_push(struct sock *sk, int flags, int mss_now,
 {
 	if (tcp_send_head(sk)) {
 		struct tcp_sock *tp = tcp_sk(sk);
+		struct sk_buff *skb = tcp_write_queue_tail(sk);
 
 		if (!(flags & MSG_MORE) || forced_push(tp))
-			tcp_mark_push(tp, tcp_write_queue_tail(sk));
+			tcp_mark_push(tp, skb);
 
 		tcp_mark_urg(tp, flags);
-		__tcp_push_pending_frames(sk, mss_now,
-					  (flags & MSG_MORE) ? TCP_NAGLE_CORK : nonagle);
+		if (flags & MSG_MORE)
+			nonagle = TCP_NAGLE_CORK;
+		if (atomic_read(&sk->sk_wmem_alloc) > 2048) {
+			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
+			nonagle = TCP_NAGLE_CORK;
+		}
+		__tcp_push_pending_frames(sk, mss_now, nonagle);
 	}
 }
 

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH v2] tcp: tsq: restore minimal amount of queueing
  2013-11-13 23:52               ` Holger Hoffstaette
@ 2013-11-17 23:15                 ` Francois Romieu
  2013-11-18 16:26                   ` Holger Hoffstätte
  0 siblings, 1 reply; 121+ messages in thread
From: Francois Romieu @ 2013-11-17 23:15 UTC (permalink / raw)
  To: Holger Hoffstaette; +Cc: netdev

Holger Hoffstaette <holger.hoffstaette@googlemail.com> :
[...]
> Since I saw this with r8169->r8169 and e1000e->r8169 it's probably
> everyone's favourite r8169 :)
> Unfortunately I can't be more help but if you can suggest/whip up a fix
> I'd be happy to help test.

The r8169 driver does not rely on a timer for Tx completion.

The patch below should not hurt.

Can you describe your system a bit more specifically ?

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 3397cee..7280d5d 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -6393,12 +6393,12 @@ static int rtl8169_poll(struct napi_struct *napi, int budget)
 	status = rtl_get_events(tp);
 	rtl_ack_events(tp, status & ~tp->event_slow);
 
-	if (status & RTL_EVENT_NAPI_RX)
-		work_done = rtl_rx(dev, tp, (u32) budget);
-
 	if (status & RTL_EVENT_NAPI_TX)
 		rtl_tx(dev, tp);
 
+	if (status & RTL_EVENT_NAPI_RX)
+		work_done = rtl_rx(dev, tp, (u32) budget);
+
 	if (status & tp->event_slow) {
 		enable_mask &= ~tp->event_slow;
 

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* RE: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-17 14:19                 ` Willy Tarreau
@ 2013-11-18 10:09                   ` David Laight
  -1 siblings, 0 replies; 121+ messages in thread
From: David Laight @ 2013-11-18 10:09 UTC (permalink / raw)
  To: Willy Tarreau, Arnaud Ebalard
  Cc: Cong Wang, edumazet, linux-arm-kernel, netdev, Thomas Petazzoni

> I wonder if we can call mvneta_txq_done() from the IRQ handler, which would
> cause some recycling of the Tx descriptors when receiving the corresponding
> ACKs.
> 
> Ideally we should enable the Tx IRQ, but I still have no access to this
> chip's datasheet despite having asked Marvell several times in one year
> (Thomas has it though).
> 
> So it is fairly possible that in your case you can't fill the link if you
> consume too many descriptors. For example, if your server uses TCP_NODELAY
> and sends incomplete segments (which is quite common), it's very easy to
> run out of descriptors before the link is full.

Or you have a significant number of active tcp connections.

Even if there were no requirement to free the skb quickly you still
need to take a 'tx done' interrupt when the link is transmit rate limited.
There are scenarios when there is no receive traffic - eg asymmetric
routing, but testable with netperf UDP transmits.

	David

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-18 10:09                   ` David Laight
  0 siblings, 0 replies; 121+ messages in thread
From: David Laight @ 2013-11-18 10:09 UTC (permalink / raw)
  To: linux-arm-kernel

> I wonder if we can call mvneta_txq_done() from the IRQ handler, which would
> cause some recycling of the Tx descriptors when receiving the corresponding
> ACKs.
> 
> Ideally we should enable the Tx IRQ, but I still have no access to this
> chip's datasheet despite having asked Marvell several times in one year
> (Thomas has it though).
> 
> So it is fairly possible that in your case you can't fill the link if you
> consume too many descriptors. For example, if your server uses TCP_NODELAY
> and sends incomplete segments (which is quite common), it's very easy to
> run out of descriptors before the link is full.

Or you have a significant number of active tcp connections.

Even if there were no requirement to free the skb quickly you still
need to take a 'tx done' interrupt when the link is transmit rate limited.
There are scenarios when there is no receive traffic - eg asymmetric
routing, but testable with netperf UDP transmits.

	David

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-17 14:19                 ` Willy Tarreau
@ 2013-11-18 10:26                   ` Thomas Petazzoni
  -1 siblings, 0 replies; 121+ messages in thread
From: Thomas Petazzoni @ 2013-11-18 10:26 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Arnaud Ebalard, Cong Wang, edumazet, linux-arm-kernel, netdev,
	simon.guinot

Willy, All,

On Sun, 17 Nov 2013 15:19:40 +0100, Willy Tarreau wrote:

> On Wed, Nov 13, 2013 at 08:22:57AM +0100, Willy Tarreau wrote:
> > On Tue, Nov 12, 2013 at 04:34:24PM +0100, Arnaud Ebalard wrote:
> > > Can you give a pre-3.11.7 kernel a try if you find the time? I
> > > started working on RN102 during 3.10-rc cycle but do not remember
> > > if I did the first preformance tests on 3.10 or 3.11. And if you
> > > find more time, 3.11.7 would be nice too ;-)
> > 
> > Still have not found time for this but I observed something
> > intriguing which might possibly match your experience : if I use
> > large enough send buffers on the mirabox and receive buffers on the
> > client, then the traffic drops for objects larger than 1 MB. I have
> > quickly checked what's happening and it's just that there are
> > pauses of up to 8 ms between some packets when the TCP send window
> > grows larger than about 200 kB. And since there are no drops, there
> > is no reason for the window to shrink. I suspect it's exactly
> > related to the issue explained by Eric about the timer used to
> > recycle the Tx descriptors. However last time I checked, these ones
> > were also processed in the Rx path, which means that the ACKs that
> > flow back should have had the same effect as a Tx IRQ (unless I'd
> > use asymmetric routing, which was not the case). So there might be
> > another issue. Ah, and it only happens with GSO.

I haven't read the entire discussion yet, but do you guys have
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/clk/mvebu?id=1022c75f5abd3a3b25e679bc8793d21bedd009b4
applied? It got merged recently, and it fixes a number of networking
problems on Armada 370.

I've added Simon Guinot in Cc, who is the author of this patch.

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-18 10:26                   ` Thomas Petazzoni
  0 siblings, 0 replies; 121+ messages in thread
From: Thomas Petazzoni @ 2013-11-18 10:26 UTC (permalink / raw)
  To: linux-arm-kernel

Willy, All,

On Sun, 17 Nov 2013 15:19:40 +0100, Willy Tarreau wrote:

> On Wed, Nov 13, 2013 at 08:22:57AM +0100, Willy Tarreau wrote:
> > On Tue, Nov 12, 2013 at 04:34:24PM +0100, Arnaud Ebalard wrote:
> > > Can you give a pre-3.11.7 kernel a try if you find the time? I
> > > started working on RN102 during 3.10-rc cycle but do not remember
> > > if I did the first preformance tests on 3.10 or 3.11. And if you
> > > find more time, 3.11.7 would be nice too ;-)
> > 
> > Still have not found time for this but I observed something
> > intriguing which might possibly match your experience : if I use
> > large enough send buffers on the mirabox and receive buffers on the
> > client, then the traffic drops for objects larger than 1 MB. I have
> > quickly checked what's happening and it's just that there are
> > pauses of up to 8 ms between some packets when the TCP send window
> > grows larger than about 200 kB. And since there are no drops, there
> > is no reason for the window to shrink. I suspect it's exactly
> > related to the issue explained by Eric about the timer used to
> > recycle the Tx descriptors. However last time I checked, these ones
> > were also processed in the Rx path, which means that the ACKs that
> > flow back should have had the same effect as a Tx IRQ (unless I'd
> > use asymmetric routing, which was not the case). So there might be
> > another issue. Ah, and it only happens with GSO.

I haven't read the entire discussion yet, but do you guys have
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/clk/mvebu?id=1022c75f5abd3a3b25e679bc8793d21bedd009b4
applied? It got merged recently, and it fixes a number of networking
problems on Armada 370.

I've added Simon Guinot in Cc, who is the author of this patch.

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-18 10:26                   ` Thomas Petazzoni
@ 2013-11-18 10:44                     ` Simon Guinot
  -1 siblings, 0 replies; 121+ messages in thread
From: Simon Guinot @ 2013-11-18 10:44 UTC (permalink / raw)
  To: Thomas Petazzoni
  Cc: netdev, Arnaud Ebalard, Vincent Donnefort, edumazet, Cong Wang,
	Willy Tarreau, linux-arm-kernel


[-- Attachment #1.1: Type: text/plain, Size: 2338 bytes --]

On Mon, Nov 18, 2013 at 11:26:01AM +0100, Thomas Petazzoni wrote:
> Willy, All,
> 
> On Sun, 17 Nov 2013 15:19:40 +0100, Willy Tarreau wrote:
> 
> > On Wed, Nov 13, 2013 at 08:22:57AM +0100, Willy Tarreau wrote:
> > > On Tue, Nov 12, 2013 at 04:34:24PM +0100, Arnaud Ebalard wrote:
> > > > Can you give a pre-3.11.7 kernel a try if you find the time? I
> > > > started working on RN102 during 3.10-rc cycle but do not remember
> > > > if I did the first preformance tests on 3.10 or 3.11. And if you
> > > > find more time, 3.11.7 would be nice too ;-)
> > > 
> > > Still have not found time for this but I observed something
> > > intriguing which might possibly match your experience : if I use
> > > large enough send buffers on the mirabox and receive buffers on the
> > > client, then the traffic drops for objects larger than 1 MB. I have
> > > quickly checked what's happening and it's just that there are
> > > pauses of up to 8 ms between some packets when the TCP send window
> > > grows larger than about 200 kB. And since there are no drops, there
> > > is no reason for the window to shrink. I suspect it's exactly
> > > related to the issue explained by Eric about the timer used to
> > > recycle the Tx descriptors. However last time I checked, these ones
> > > were also processed in the Rx path, which means that the ACKs that
> > > flow back should have had the same effect as a Tx IRQ (unless I'd
> > > use asymmetric routing, which was not the case). So there might be
> > > another issue. Ah, and it only happens with GSO.
> 
> I haven't read the entire discussion yet, but do you guys have
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/clk/mvebu?id=1022c75f5abd3a3b25e679bc8793d21bedd009b4
> applied? It got merged recently, and it fixes a number of networking
> problems on Armada 370.
> 
> I've added Simon Guinot in Cc, who is the author of this patch.

I don't think it is related. We also have noticed a huge performance
regression. Reverting the following patch restores the rate:

c9eeec26 tcp: TSQ can use a dynamic limit

I don't understand why...

Regards,

Simon

> 
> Best regards,
> 
> Thomas
> -- 
> Thomas Petazzoni, CTO, Free Electrons
> Embedded Linux, Kernel and Android engineering
> http://free-electrons.com

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-18 10:44                     ` Simon Guinot
  0 siblings, 0 replies; 121+ messages in thread
From: Simon Guinot @ 2013-11-18 10:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Nov 18, 2013 at 11:26:01AM +0100, Thomas Petazzoni wrote:
> Willy, All,
> 
> On Sun, 17 Nov 2013 15:19:40 +0100, Willy Tarreau wrote:
> 
> > On Wed, Nov 13, 2013 at 08:22:57AM +0100, Willy Tarreau wrote:
> > > On Tue, Nov 12, 2013 at 04:34:24PM +0100, Arnaud Ebalard wrote:
> > > > Can you give a pre-3.11.7 kernel a try if you find the time? I
> > > > started working on RN102 during 3.10-rc cycle but do not remember
> > > > if I did the first preformance tests on 3.10 or 3.11. And if you
> > > > find more time, 3.11.7 would be nice too ;-)
> > > 
> > > Still have not found time for this but I observed something
> > > intriguing which might possibly match your experience : if I use
> > > large enough send buffers on the mirabox and receive buffers on the
> > > client, then the traffic drops for objects larger than 1 MB. I have
> > > quickly checked what's happening and it's just that there are
> > > pauses of up to 8 ms between some packets when the TCP send window
> > > grows larger than about 200 kB. And since there are no drops, there
> > > is no reason for the window to shrink. I suspect it's exactly
> > > related to the issue explained by Eric about the timer used to
> > > recycle the Tx descriptors. However last time I checked, these ones
> > > were also processed in the Rx path, which means that the ACKs that
> > > flow back should have had the same effect as a Tx IRQ (unless I'd
> > > use asymmetric routing, which was not the case). So there might be
> > > another issue. Ah, and it only happens with GSO.
> 
> I haven't read the entire discussion yet, but do you guys have
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/clk/mvebu?id=1022c75f5abd3a3b25e679bc8793d21bedd009b4
> applied? It got merged recently, and it fixes a number of networking
> problems on Armada 370.
> 
> I've added Simon Guinot in Cc, who is the author of this patch.

I don't think it is related. We also have noticed a huge performance
regression. Reverting the following patch restores the rate:

c9eeec26 tcp: TSQ can use a dynamic limit

I don't understand why...

Regards,

Simon

> 
> Best regards,
> 
> Thomas
> -- 
> Thomas Petazzoni, CTO, Free Electrons
> Embedded Linux, Kernel and Android engineering
> http://free-electrons.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20131118/611b504e/attachment.sig>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-18 10:26                   ` Thomas Petazzoni
@ 2013-11-18 10:51                     ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-18 10:51 UTC (permalink / raw)
  To: Thomas Petazzoni
  Cc: Arnaud Ebalard, Cong Wang, edumazet, linux-arm-kernel, netdev,
	simon.guinot

Hi Thomas,

On Mon, Nov 18, 2013 at 11:26:01AM +0100, Thomas Petazzoni wrote:
> I haven't read the entire discussion yet, but do you guys have
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/clk/mvebu?id=1022c75f5abd3a3b25e679bc8793d21bedd009b4
> applied? It got merged recently, and it fixes a number of networking
> problems on Armada 370.

No, because my version was even older than the code which introduced this
issue :-)

The main issue is related to something we discussed once ago which surprized
both of us, the use of a Tx timer to release the Tx descriptors. I remember
I considered that it was not a big issue because the flush was also done in
the Rx path (thus on ACKs) but I can't find trace of this code so my analysis
was wrong. Thus we can hit some situations where we fill the descriptors
before filling the link.

Ideally we should have a Tx IRQ. At the very least we should call the tx
refill function in mvneta_poll() I believe. I can try to do it but I'd
rather have the Tx IRQ working instead.

Regards,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-18 10:51                     ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-18 10:51 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Thomas,

On Mon, Nov 18, 2013 at 11:26:01AM +0100, Thomas Petazzoni wrote:
> I haven't read the entire discussion yet, but do you guys have
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/clk/mvebu?id=1022c75f5abd3a3b25e679bc8793d21bedd009b4
> applied? It got merged recently, and it fixes a number of networking
> problems on Armada 370.

No, because my version was even older than the code which introduced this
issue :-)

The main issue is related to something we discussed once ago which surprized
both of us, the use of a Tx timer to release the Tx descriptors. I remember
I considered that it was not a big issue because the flush was also done in
the Rx path (thus on ACKs) but I can't find trace of this code so my analysis
was wrong. Thus we can hit some situations where we fill the descriptors
before filling the link.

Ideally we should have a Tx IRQ. At the very least we should call the tx
refill function in mvneta_poll() I believe. I can try to do it but I'd
rather have the Tx IRQ working instead.

Regards,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-18 10:09                   ` David Laight
@ 2013-11-18 10:52                     ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-18 10:52 UTC (permalink / raw)
  To: David Laight
  Cc: Arnaud Ebalard, Cong Wang, edumazet, linux-arm-kernel, netdev,
	Thomas Petazzoni

On Mon, Nov 18, 2013 at 10:09:53AM -0000, David Laight wrote:
> > So it is fairly possible that in your case you can't fill the link if you
> > consume too many descriptors. For example, if your server uses TCP_NODELAY
> > and sends incomplete segments (which is quite common), it's very easy to
> > run out of descriptors before the link is full.
> 
> Or you have a significant number of active tcp connections.
> 
> Even if there were no requirement to free the skb quickly you still
> need to take a 'tx done' interrupt when the link is transmit rate limited.
> There are scenarios when there is no receive traffic - eg asymmetric
> routing, but testable with netperf UDP transmits.

Yes absolutely, but I was talking about the current situation that Arnaud
is facing and which I could reproduce with a large window.

Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-18 10:52                     ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-18 10:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Nov 18, 2013 at 10:09:53AM -0000, David Laight wrote:
> > So it is fairly possible that in your case you can't fill the link if you
> > consume too many descriptors. For example, if your server uses TCP_NODELAY
> > and sends incomplete segments (which is quite common), it's very easy to
> > run out of descriptors before the link is full.
> 
> Or you have a significant number of active tcp connections.
> 
> Even if there were no requirement to free the skb quickly you still
> need to take a 'tx done' interrupt when the link is transmit rate limited.
> There are scenarios when there is no receive traffic - eg asymmetric
> routing, but testable with netperf UDP transmits.

Yes absolutely, but I was talking about the current situation that Arnaud
is facing and which I could reproduce with a large window.

Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2] tcp: tsq: restore minimal amount of queueing
  2013-11-17 23:15                 ` Francois Romieu
@ 2013-11-18 16:26                   ` Holger Hoffstätte
  2013-11-18 16:47                     ` Eric Dumazet
  0 siblings, 1 reply; 121+ messages in thread
From: Holger Hoffstätte @ 2013-11-18 16:26 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev

On 11/18/13 00:15, Francois Romieu wrote:
> Holger Hoffstaette <holger.hoffstaette@googlemail.com> :
> [...]
>> Since I saw this with r8169->r8169 and e1000e->r8169 it's probably
>> everyone's favourite r8169 :)
>> Unfortunately I can't be more help but if you can suggest/whip up a fix
>> I'd be happy to help test.
> 
> The r8169 driver does not rely on a timer for Tx completion.

Thankx, that's good to hear.

> The patch below should not hurt.

It does not seem to hurt, but neither can I notice much of a change.
However that's probably because of some other side effects, see below.

Do I understand the diff correctly that it makes the driver perform
outstanding transmissions before budgeting reads? Just curious.

> Can you describe your system a bit more specifically ?

Server has r8169, client is either r8169 (Windows/linux) or Thinkpad
with e1000e. Clients use NFS & Samba. Since Eric's TSQ patch the erratic
3.12.0-vanilla behaviour has "stabilized" in the sense that latency &
throughout became relatively smooth and more or less as expected, both
for large copies and many small files.

However since then I found that increasing the tcp_limit_output_bytes to
262144 (twice the default of 128k) makes things really fly. Copying
large files (>1GB) over NFS from the e1000e now quickly reaches the full
1Gb line throughput. This was really surprising.

Apart from the laptop being relatively old and being difficult to
benchmark due to typical power state scaling, I suspect the e1000e
running with dynamic interrupt moderation is not completely innocent
either. I used to turn this off some years back and had great success,
but that was on Windows.

Long story short there's just too much up- and downscaling, buffering
and queueing involved in all parts to point to any single culprit, but
increasing the byte limit *has* helped everywhere and had no noticeable
impact on internet traffic. I understand the motivation for small queues
from a bufferbloat-fighting point of view (using fq_codel did wonders
for a friend without external router!), but apparently for LAN traffic
this doesn't seem to work in all cases.

Not sure if any of this is helpful. :)

cheers
Holger

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2] tcp: tsq: restore minimal amount of queueing
  2013-11-18 16:26                   ` Holger Hoffstätte
@ 2013-11-18 16:47                     ` Eric Dumazet
  0 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-18 16:47 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: Francois Romieu, netdev

On Mon, 2013-11-18 at 17:26 +0100, Holger Hoffstätte wrote:
> On 11/18/13 00:15, Francois Romieu wrote:
> > Holger Hoffstaette <holger.hoffstaette@googlemail.com> :
> > [...]
> >> Since I saw this with r8169->r8169 and e1000e->r8169 it's probably
> >> everyone's favourite r8169 :)
> >> Unfortunately I can't be more help but if you can suggest/whip up a fix
> >> I'd be happy to help test.
> > 
> > The r8169 driver does not rely on a timer for Tx completion.
> 
> Thankx, that's good to hear.
> 
> > The patch below should not hurt.
> 
> It does not seem to hurt, but neither can I notice much of a change.
> However that's probably because of some other side effects, see below.
> 
> Do I understand the diff correctly that it makes the driver perform
> outstanding transmissions before budgeting reads? Just curious.
> 
> > Can you describe your system a bit more specifically ?
> 
> Server has r8169, client is either r8169 (Windows/linux) or Thinkpad
> with e1000e. Clients use NFS & Samba. Since Eric's TSQ patch the erratic
> 3.12.0-vanilla behaviour has "stabilized" in the sense that latency &
> throughout became relatively smooth and more or less as expected, both
> for large copies and many small files.
> 
> However since then I found that increasing the tcp_limit_output_bytes to
> 262144 (twice the default of 128k) makes things really fly. Copying
> large files (>1GB) over NFS from the e1000e now quickly reaches the full
> 1Gb line throughput. This was really surprising.
> 
> Apart from the laptop being relatively old and being difficult to
> benchmark due to typical power state scaling, I suspect the e1000e
> running with dynamic interrupt moderation is not completely innocent
> either. I used to turn this off some years back and had great success,
> but that was on Windows.

I think it would make sense to instrument the delay between the
ndo_start_xmit() and kfree_skb() for transmitted skb.

We might have a surprise for some drivers, seeing delays in the order of
several ms ....

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-18 10:44                     ` Simon Guinot
@ 2013-11-18 16:54                       ` Stephen Hemminger
  -1 siblings, 0 replies; 121+ messages in thread
From: Stephen Hemminger @ 2013-11-18 16:54 UTC (permalink / raw)
  To: Simon Guinot
  Cc: Thomas Petazzoni, Willy Tarreau, Arnaud Ebalard, Cong Wang,
	edumazet, linux-arm-kernel, netdev, Vincent Donnefort

On Mon, 18 Nov 2013 11:44:48 +0100
Simon Guinot <simon.guinot@sequanux.org> wrote:

> On Mon, Nov 18, 2013 at 11:26:01AM +0100, Thomas Petazzoni wrote:
> > Willy, All,
> > 
> > On Sun, 17 Nov 2013 15:19:40 +0100, Willy Tarreau wrote:
> > 
> > > On Wed, Nov 13, 2013 at 08:22:57AM +0100, Willy Tarreau wrote:
> > > > On Tue, Nov 12, 2013 at 04:34:24PM +0100, Arnaud Ebalard wrote:
> > > > > Can you give a pre-3.11.7 kernel a try if you find the time? I
> > > > > started working on RN102 during 3.10-rc cycle but do not remember
> > > > > if I did the first preformance tests on 3.10 or 3.11. And if you
> > > > > find more time, 3.11.7 would be nice too ;-)
> > > > 
> > > > Still have not found time for this but I observed something
> > > > intriguing which might possibly match your experience : if I use
> > > > large enough send buffers on the mirabox and receive buffers on the
> > > > client, then the traffic drops for objects larger than 1 MB. I have
> > > > quickly checked what's happening and it's just that there are
> > > > pauses of up to 8 ms between some packets when the TCP send window
> > > > grows larger than about 200 kB. And since there are no drops, there
> > > > is no reason for the window to shrink. I suspect it's exactly
> > > > related to the issue explained by Eric about the timer used to
> > > > recycle the Tx descriptors. However last time I checked, these ones
> > > > were also processed in the Rx path, which means that the ACKs that
> > > > flow back should have had the same effect as a Tx IRQ (unless I'd
> > > > use asymmetric routing, which was not the case). So there might be
> > > > another issue. Ah, and it only happens with GSO.
> > 
> > I haven't read the entire discussion yet, but do you guys have
> > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/clk/mvebu?id=1022c75f5abd3a3b25e679bc8793d21bedd009b4
> > applied? It got merged recently, and it fixes a number of networking
> > problems on Armada 370.
> > 
> > I've added Simon Guinot in Cc, who is the author of this patch.
> 
> I don't think it is related. We also have noticed a huge performance
> regression. Reverting the following patch restores the rate:
> 
> c9eeec26 tcp: TSQ can use a dynamic limit
> 

But without that patch there was a performance regression for high speed
interfaces whihc was caused by TSQ. 10G performance dropped to 8G

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-18 16:54                       ` Stephen Hemminger
  0 siblings, 0 replies; 121+ messages in thread
From: Stephen Hemminger @ 2013-11-18 16:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 18 Nov 2013 11:44:48 +0100
Simon Guinot <simon.guinot@sequanux.org> wrote:

> On Mon, Nov 18, 2013 at 11:26:01AM +0100, Thomas Petazzoni wrote:
> > Willy, All,
> > 
> > On Sun, 17 Nov 2013 15:19:40 +0100, Willy Tarreau wrote:
> > 
> > > On Wed, Nov 13, 2013 at 08:22:57AM +0100, Willy Tarreau wrote:
> > > > On Tue, Nov 12, 2013 at 04:34:24PM +0100, Arnaud Ebalard wrote:
> > > > > Can you give a pre-3.11.7 kernel a try if you find the time? I
> > > > > started working on RN102 during 3.10-rc cycle but do not remember
> > > > > if I did the first preformance tests on 3.10 or 3.11. And if you
> > > > > find more time, 3.11.7 would be nice too ;-)
> > > > 
> > > > Still have not found time for this but I observed something
> > > > intriguing which might possibly match your experience : if I use
> > > > large enough send buffers on the mirabox and receive buffers on the
> > > > client, then the traffic drops for objects larger than 1 MB. I have
> > > > quickly checked what's happening and it's just that there are
> > > > pauses of up to 8 ms between some packets when the TCP send window
> > > > grows larger than about 200 kB. And since there are no drops, there
> > > > is no reason for the window to shrink. I suspect it's exactly
> > > > related to the issue explained by Eric about the timer used to
> > > > recycle the Tx descriptors. However last time I checked, these ones
> > > > were also processed in the Rx path, which means that the ACKs that
> > > > flow back should have had the same effect as a Tx IRQ (unless I'd
> > > > use asymmetric routing, which was not the case). So there might be
> > > > another issue. Ah, and it only happens with GSO.
> > 
> > I haven't read the entire discussion yet, but do you guys have
> > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/clk/mvebu?id=1022c75f5abd3a3b25e679bc8793d21bedd009b4
> > applied? It got merged recently, and it fixes a number of networking
> > problems on Armada 370.
> > 
> > I've added Simon Guinot in Cc, who is the author of this patch.
> 
> I don't think it is related. We also have noticed a huge performance
> regression. Reverting the following patch restores the rate:
> 
> c9eeec26 tcp: TSQ can use a dynamic limit
> 

But without that patch there was a performance regression for high speed
interfaces whihc was caused by TSQ. 10G performance dropped to 8G

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-18 16:54                       ` Stephen Hemminger
@ 2013-11-18 17:13                         ` Eric Dumazet
  -1 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-18 17:13 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Simon Guinot, Thomas Petazzoni, Willy Tarreau, Arnaud Ebalard,
	Cong Wang, edumazet, linux-arm-kernel, netdev, Vincent Donnefort

On Mon, 2013-11-18 at 08:54 -0800, Stephen Hemminger wrote:
> On Mon, 18 Nov 2013 11:44:48 +0100
> Simon Guinot <simon.guinot@sequanux.org> wrote:

> > c9eeec26 tcp: TSQ can use a dynamic limit
> > 
> 
> But without that patch there was a performance regression for high speed
> interfaces whihc was caused by TSQ. 10G performance dropped to 8G

Yes, this made sure we could feed more than 2 TSO packets on TX ring.

But decreasing minimal amount of queuing from 128KB to ~1ms of the
current rate did not please NIC which can have a big delay between
ndo_start_xmit() and actual skb freeing (TX completion)

So
http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee

restored this minimal amount of buffering, and let the bigger amount for
40Gb NICs ;)

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-18 17:13                         ` Eric Dumazet
  0 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-18 17:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 2013-11-18 at 08:54 -0800, Stephen Hemminger wrote:
> On Mon, 18 Nov 2013 11:44:48 +0100
> Simon Guinot <simon.guinot@sequanux.org> wrote:

> > c9eeec26 tcp: TSQ can use a dynamic limit
> > 
> 
> But without that patch there was a performance regression for high speed
> interfaces whihc was caused by TSQ. 10G performance dropped to 8G

Yes, this made sure we could feed more than 2 TSO packets on TX ring.

But decreasing minimal amount of queuing from 128KB to ~1ms of the
current rate did not please NIC which can have a big delay between
ndo_start_xmit() and actual skb freeing (TX completion)

So
http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee

restored this minimal amount of buffering, and let the bigger amount for
40Gb NICs ;)

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-18 10:51                     ` Willy Tarreau
@ 2013-11-18 17:58                       ` Florian Fainelli
  -1 siblings, 0 replies; 121+ messages in thread
From: Florian Fainelli @ 2013-11-18 17:58 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Thomas Petazzoni, simon.guinot, netdev, Arnaud Ebalard, edumazet,
	Cong Wang, linux-arm-kernel

Hello Willy, Thomas,

2013/11/18 Willy Tarreau <w@1wt.eu>:
> Hi Thomas,
>
> On Mon, Nov 18, 2013 at 11:26:01AM +0100, Thomas Petazzoni wrote:
>> I haven't read the entire discussion yet, but do you guys have
>> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/clk/mvebu?id=1022c75f5abd3a3b25e679bc8793d21bedd009b4
>> applied? It got merged recently, and it fixes a number of networking
>> problems on Armada 370.
>
> No, because my version was even older than the code which introduced this
> issue :-)
>
> The main issue is related to something we discussed once ago which surprized
> both of us, the use of a Tx timer to release the Tx descriptors. I remember
> I considered that it was not a big issue because the flush was also done in
> the Rx path (thus on ACKs) but I can't find trace of this code so my analysis
> was wrong. Thus we can hit some situations where we fill the descriptors
> before filling the link.

So long as you are using TCP this works because the ACKs will somehow
create an artificial "forced" completion of your transmitted SKBs, how
about an UDP streamer use case? In that case you will quickly fill up
all of your descriptors and have to wait for the descriptors to be
freed by the 10ms timer. I do not think this is desirable at all, and
this will requite very large UDP sender socket buffers. I remember
asking Thomas what was the reason for not using the TX completion IRQ
during the first incarnation of the driver, but I do not quite
remember what was the answer.

If the original mvneta driver authors fears where that TX completion
could generate too many IRQs, they should use netif_stop_queue() /
netif_wake_queue() and mask off/on interrupts appropriately to slow
down the pace of TX interrupts.

>
> Ideally we should have a Tx IRQ. At the very least we should call the tx
> refill function in mvneta_poll() I believe. I can try to do it but I'd
> rather have the Tx IRQ working instead.

Right, actually you should do both, free transmitted SKBs from your
NAPI poll callback and from the TX completion IRQ to ensure SKBs are
freed up in time no matter what workload/use case is being used.
-- 
Florian

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-18 17:58                       ` Florian Fainelli
  0 siblings, 0 replies; 121+ messages in thread
From: Florian Fainelli @ 2013-11-18 17:58 UTC (permalink / raw)
  To: linux-arm-kernel

Hello Willy, Thomas,

2013/11/18 Willy Tarreau <w@1wt.eu>:
> Hi Thomas,
>
> On Mon, Nov 18, 2013 at 11:26:01AM +0100, Thomas Petazzoni wrote:
>> I haven't read the entire discussion yet, but do you guys have
>> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/clk/mvebu?id=1022c75f5abd3a3b25e679bc8793d21bedd009b4
>> applied? It got merged recently, and it fixes a number of networking
>> problems on Armada 370.
>
> No, because my version was even older than the code which introduced this
> issue :-)
>
> The main issue is related to something we discussed once ago which surprized
> both of us, the use of a Tx timer to release the Tx descriptors. I remember
> I considered that it was not a big issue because the flush was also done in
> the Rx path (thus on ACKs) but I can't find trace of this code so my analysis
> was wrong. Thus we can hit some situations where we fill the descriptors
> before filling the link.

So long as you are using TCP this works because the ACKs will somehow
create an artificial "forced" completion of your transmitted SKBs, how
about an UDP streamer use case? In that case you will quickly fill up
all of your descriptors and have to wait for the descriptors to be
freed by the 10ms timer. I do not think this is desirable at all, and
this will requite very large UDP sender socket buffers. I remember
asking Thomas what was the reason for not using the TX completion IRQ
during the first incarnation of the driver, but I do not quite
remember what was the answer.

If the original mvneta driver authors fears where that TX completion
could generate too many IRQs, they should use netif_stop_queue() /
netif_wake_queue() and mask off/on interrupts appropriately to slow
down the pace of TX interrupts.

>
> Ideally we should have a Tx IRQ. At the very least we should call the tx
> refill function in mvneta_poll() I believe. I can try to do it but I'd
> rather have the Tx IRQ working instead.

Right, actually you should do both, free transmitted SKBs from your
NAPI poll callback and from the TX completion IRQ to ensure SKBs are
freed up in time no matter what workload/use case is being used.
-- 
Florian

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-17 17:41                   ` Eric Dumazet
@ 2013-11-19  6:44                     ` Arnaud Ebalard
  -1 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-19  6:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Petazzoni, Florian Fainelli, simon.guinot, netdev,
	edumazet, Cong Wang, Willy Tarreau, linux-arm-kernel

Hi,

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Sun, 2013-11-17 at 15:19 +0100, Willy Tarreau wrote:
>
>> 
>> So it is fairly possible that in your case you can't fill the link if you
>> consume too many descriptors. For example, if your server uses TCP_NODELAY
>> and sends incomplete segments (which is quite common), it's very easy to
>> run out of descriptors before the link is full.
>
> BTW I have a very simple patch for TCP stack that could help this exact
> situation...
>
> Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with
> very small frames, and let tcp_sendmsg() have more chance to fill
> complete packets.
>
> Again, for this to work very well, you need that NIC performs TX
> completion in reasonable amount of time...
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 3dc0c6c..10456cf 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -624,13 +624,19 @@ static inline void tcp_push(struct sock *sk, int flags, int mss_now,
>  {
>  	if (tcp_send_head(sk)) {
>  		struct tcp_sock *tp = tcp_sk(sk);
> +		struct sk_buff *skb = tcp_write_queue_tail(sk);
>  
>  		if (!(flags & MSG_MORE) || forced_push(tp))
> -			tcp_mark_push(tp, tcp_write_queue_tail(sk));
> +			tcp_mark_push(tp, skb);
>  
>  		tcp_mark_urg(tp, flags);
> -		__tcp_push_pending_frames(sk, mss_now,
> -					  (flags & MSG_MORE) ? TCP_NAGLE_CORK : nonagle);
> +		if (flags & MSG_MORE)
> +			nonagle = TCP_NAGLE_CORK;
> +		if (atomic_read(&sk->sk_wmem_alloc) > 2048) {
> +			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
> +			nonagle = TCP_NAGLE_CORK;
> +		}
> +		__tcp_push_pending_frames(sk, mss_now, nonagle);
>  	}
>  }

I did some test regarding mvneta perf on current linus tree (commit
2d3c627502f2a9b0, w/ c9eeec26e32e "tcp: TSQ can use a dynamic limit"
reverted). It has Simon's tclk patch for mvebu (1022c75f5abd, "clk:
armada-370: fix tclk frequencies"). Kernel has some debug options
enabled and the patch above is not applied. I will spend some time on
this two directions this evening. The idea was to get some numbers on
the impact of TCP send window size and tcp_limit_output_bytes for
mvneta. 

The test is done with a laptop (Debian, 3.11.0, e1000e) directly
connected to a RN102 (Marvell Armada 370 @1.2GHz, mvneta). The RN102
is running Debian armhf with an Apache2 serving a 1GB file from ext4
over lvm over RAID1 from 2 WD30EFRX. The client is nothing fancy, i.e.
a simple wget w/ -O /dev/null option.

With the exact same setup on a ReadyNAS Duo v2 (Kirkwood 88f6282
@1.6GHz, mv643xx_eth), I managed to get a throughput of 108MB/s
(cannot remember the kernel version but sth between 3.8 and 3.10.

So with that setup:

w/ TCP send window set to   4MB: 17.4 MB/s
w/ TCP send window set to   2MB: 16.2 MB/s
w/ TCP send window set to   1MB: 15.6 MB/s
w/ TCP send window set to 512KB: 25.6 MB/s
w/ TCP send window set to 256KB: 57.7 MB/s
w/ TCP send window set to 128KB: 54.0 MB/s
w/ TCP send window set to  64KB: 46.2 MB/s
w/ TCP send window set to  32KB: 42.8 MB/s

Then, I started playing w/ tcp_limit_output_bytes (default is 131072),
w/ TCP send window set to 256KB:

tcp_limit_output_bytes set to 512KB: 59.3 MB/s
tcp_limit_output_bytes set to 256KB: 58.5 MB/s
tcp_limit_output_bytes set to 128KB: 56.2 MB/s
tcp_limit_output_bytes set to  64KB: 32.1 MB/s
tcp_limit_output_bytes set to  32KB: 4.76 MB/s

As a side note, during the test, I sometimes gets peak for some seconds
at the beginning at 90MB/s which tend to confirm what WIlly wrote,
i.e. that the hardware can do more.

Cheers,

a+

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-19  6:44                     ` Arnaud Ebalard
  0 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-19  6:44 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Sun, 2013-11-17 at 15:19 +0100, Willy Tarreau wrote:
>
>> 
>> So it is fairly possible that in your case you can't fill the link if you
>> consume too many descriptors. For example, if your server uses TCP_NODELAY
>> and sends incomplete segments (which is quite common), it's very easy to
>> run out of descriptors before the link is full.
>
> BTW I have a very simple patch for TCP stack that could help this exact
> situation...
>
> Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with
> very small frames, and let tcp_sendmsg() have more chance to fill
> complete packets.
>
> Again, for this to work very well, you need that NIC performs TX
> completion in reasonable amount of time...
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 3dc0c6c..10456cf 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -624,13 +624,19 @@ static inline void tcp_push(struct sock *sk, int flags, int mss_now,
>  {
>  	if (tcp_send_head(sk)) {
>  		struct tcp_sock *tp = tcp_sk(sk);
> +		struct sk_buff *skb = tcp_write_queue_tail(sk);
>  
>  		if (!(flags & MSG_MORE) || forced_push(tp))
> -			tcp_mark_push(tp, tcp_write_queue_tail(sk));
> +			tcp_mark_push(tp, skb);
>  
>  		tcp_mark_urg(tp, flags);
> -		__tcp_push_pending_frames(sk, mss_now,
> -					  (flags & MSG_MORE) ? TCP_NAGLE_CORK : nonagle);
> +		if (flags & MSG_MORE)
> +			nonagle = TCP_NAGLE_CORK;
> +		if (atomic_read(&sk->sk_wmem_alloc) > 2048) {
> +			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
> +			nonagle = TCP_NAGLE_CORK;
> +		}
> +		__tcp_push_pending_frames(sk, mss_now, nonagle);
>  	}
>  }

I did some test regarding mvneta perf on current linus tree (commit
2d3c627502f2a9b0, w/ c9eeec26e32e "tcp: TSQ can use a dynamic limit"
reverted). It has Simon's tclk patch for mvebu (1022c75f5abd, "clk:
armada-370: fix tclk frequencies"). Kernel has some debug options
enabled and the patch above is not applied. I will spend some time on
this two directions this evening. The idea was to get some numbers on
the impact of TCP send window size and tcp_limit_output_bytes for
mvneta. 

The test is done with a laptop (Debian, 3.11.0, e1000e) directly
connected to a RN102 (Marvell Armada 370 @1.2GHz, mvneta). The RN102
is running Debian armhf with an Apache2 serving a 1GB file from ext4
over lvm over RAID1 from 2 WD30EFRX. The client is nothing fancy, i.e.
a simple wget w/ -O /dev/null option.

With the exact same setup on a ReadyNAS Duo v2 (Kirkwood 88f6282
@1.6GHz, mv643xx_eth), I managed to get a throughput of 108MB/s
(cannot remember the kernel version but sth between 3.8 and 3.10.

So with that setup:

w/ TCP send window set to   4MB: 17.4 MB/s
w/ TCP send window set to   2MB: 16.2 MB/s
w/ TCP send window set to   1MB: 15.6 MB/s
w/ TCP send window set to 512KB: 25.6 MB/s
w/ TCP send window set to 256KB: 57.7 MB/s
w/ TCP send window set to 128KB: 54.0 MB/s
w/ TCP send window set to  64KB: 46.2 MB/s
w/ TCP send window set to  32KB: 42.8 MB/s

Then, I started playing w/ tcp_limit_output_bytes (default is 131072),
w/ TCP send window set to 256KB:

tcp_limit_output_bytes set to 512KB: 59.3 MB/s
tcp_limit_output_bytes set to 256KB: 58.5 MB/s
tcp_limit_output_bytes set to 128KB: 56.2 MB/s
tcp_limit_output_bytes set to  64KB: 32.1 MB/s
tcp_limit_output_bytes set to  32KB: 4.76 MB/s

As a side note, during the test, I sometimes gets peak for some seconds
at the beginning at 90MB/s which tend to confirm what WIlly wrote,
i.e. that the hardware can do more.

Cheers,

a+

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-19  6:44                     ` Arnaud Ebalard
@ 2013-11-19 13:53                       ` Eric Dumazet
  -1 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-19 13:53 UTC (permalink / raw)
  To: Arnaud Ebalard
  Cc: Willy Tarreau, Thomas Petazzoni, netdev, edumazet, Cong Wang,
	linux-arm-kernel, Florian Fainelli, simon.guinot

On Tue, 2013-11-19 at 07:44 +0100, Arnaud Ebalard wrote:

> I did some test regarding mvneta perf on current linus tree (commit
> 2d3c627502f2a9b0, w/ c9eeec26e32e "tcp: TSQ can use a dynamic limit"
> reverted). It has Simon's tclk patch for mvebu (1022c75f5abd, "clk:
> armada-370: fix tclk frequencies"). Kernel has some debug options
> enabled and the patch above is not applied. I will spend some time on
> this two directions this evening. The idea was to get some numbers on
> the impact of TCP send window size and tcp_limit_output_bytes for
> mvneta.

Note the last patch I sent was not relevant to your problem, do not
bother trying it. Its useful for applications doing lot of consecutive
short writes, like interactive ssh launching line buffered commands.

>  
> 
> The test is done with a laptop (Debian, 3.11.0, e1000e) directly
> connected to a RN102 (Marvell Armada 370 @1.2GHz, mvneta). The RN102
> is running Debian armhf with an Apache2 serving a 1GB file from ext4
> over lvm over RAID1 from 2 WD30EFRX. The client is nothing fancy, i.e.
> a simple wget w/ -O /dev/null option.
> 
> With the exact same setup on a ReadyNAS Duo v2 (Kirkwood 88f6282
> @1.6GHz, mv643xx_eth), I managed to get a throughput of 108MB/s
> (cannot remember the kernel version but sth between 3.8 and 3.10.
> 
> So with that setup:
> 
> w/ TCP send window set to   4MB: 17.4 MB/s
> w/ TCP send window set to   2MB: 16.2 MB/s
> w/ TCP send window set to   1MB: 15.6 MB/s
> w/ TCP send window set to 512KB: 25.6 MB/s
> w/ TCP send window set to 256KB: 57.7 MB/s
> w/ TCP send window set to 128KB: 54.0 MB/s
> w/ TCP send window set to  64KB: 46.2 MB/s
> w/ TCP send window set to  32KB: 42.8 MB/s

One of the problem is that tcp_sendmsg() holds the socket lock for the
whole duration of the system call if it has not to sleep. This model
doesnt allow for incoming ACKS to be processed (they are put in socket
backlog and will be processed at socket release time), and TX completion
to also queue the next chunk.

These strange results you have tend to show that if you have a big TCP
send window, the web server pushes a lot of bytes per system call and
might stall the ACK clocking or TX refills.

> 
> Then, I started playing w/ tcp_limit_output_bytes (default is 131072),
> w/ TCP send window set to 256KB:
> 
> tcp_limit_output_bytes set to 512KB: 59.3 MB/s
> tcp_limit_output_bytes set to 256KB: 58.5 MB/s
> tcp_limit_output_bytes set to 128KB: 56.2 MB/s
> tcp_limit_output_bytes set to  64KB: 32.1 MB/s
> tcp_limit_output_bytes set to  32KB: 4.76 MB/s
> 
> As a side note, during the test, I sometimes gets peak for some seconds
> at the beginning at 90MB/s which tend to confirm what WIlly wrote,
> i.e. that the hardware can do more.

I would also check the receiver. I suspect packets drops because of a
bad driver doing skb->truesize overshooting.

nstat >/dev/null ; wget .... ; nstat

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-19 13:53                       ` Eric Dumazet
  0 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-19 13:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2013-11-19 at 07:44 +0100, Arnaud Ebalard wrote:

> I did some test regarding mvneta perf on current linus tree (commit
> 2d3c627502f2a9b0, w/ c9eeec26e32e "tcp: TSQ can use a dynamic limit"
> reverted). It has Simon's tclk patch for mvebu (1022c75f5abd, "clk:
> armada-370: fix tclk frequencies"). Kernel has some debug options
> enabled and the patch above is not applied. I will spend some time on
> this two directions this evening. The idea was to get some numbers on
> the impact of TCP send window size and tcp_limit_output_bytes for
> mvneta.

Note the last patch I sent was not relevant to your problem, do not
bother trying it. Its useful for applications doing lot of consecutive
short writes, like interactive ssh launching line buffered commands.

>  
> 
> The test is done with a laptop (Debian, 3.11.0, e1000e) directly
> connected to a RN102 (Marvell Armada 370 @1.2GHz, mvneta). The RN102
> is running Debian armhf with an Apache2 serving a 1GB file from ext4
> over lvm over RAID1 from 2 WD30EFRX. The client is nothing fancy, i.e.
> a simple wget w/ -O /dev/null option.
> 
> With the exact same setup on a ReadyNAS Duo v2 (Kirkwood 88f6282
> @1.6GHz, mv643xx_eth), I managed to get a throughput of 108MB/s
> (cannot remember the kernel version but sth between 3.8 and 3.10.
> 
> So with that setup:
> 
> w/ TCP send window set to   4MB: 17.4 MB/s
> w/ TCP send window set to   2MB: 16.2 MB/s
> w/ TCP send window set to   1MB: 15.6 MB/s
> w/ TCP send window set to 512KB: 25.6 MB/s
> w/ TCP send window set to 256KB: 57.7 MB/s
> w/ TCP send window set to 128KB: 54.0 MB/s
> w/ TCP send window set to  64KB: 46.2 MB/s
> w/ TCP send window set to  32KB: 42.8 MB/s

One of the problem is that tcp_sendmsg() holds the socket lock for the
whole duration of the system call if it has not to sleep. This model
doesnt allow for incoming ACKS to be processed (they are put in socket
backlog and will be processed at socket release time), and TX completion
to also queue the next chunk.

These strange results you have tend to show that if you have a big TCP
send window, the web server pushes a lot of bytes per system call and
might stall the ACK clocking or TX refills.

> 
> Then, I started playing w/ tcp_limit_output_bytes (default is 131072),
> w/ TCP send window set to 256KB:
> 
> tcp_limit_output_bytes set to 512KB: 59.3 MB/s
> tcp_limit_output_bytes set to 256KB: 58.5 MB/s
> tcp_limit_output_bytes set to 128KB: 56.2 MB/s
> tcp_limit_output_bytes set to  64KB: 32.1 MB/s
> tcp_limit_output_bytes set to  32KB: 4.76 MB/s
> 
> As a side note, during the test, I sometimes gets peak for some seconds
> at the beginning at 90MB/s which tend to confirm what WIlly wrote,
> i.e. that the hardware can do more.

I would also check the receiver. I suspect packets drops because of a
bad driver doing skb->truesize overshooting.

nstat >/dev/null ; wget .... ; nstat

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-19 13:53                       ` Eric Dumazet
@ 2013-11-19 17:43                         ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-19 17:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Petazzoni, Florian Fainelli, simon.guinot, netdev,
	Arnaud Ebalard, edumazet, Cong Wang, linux-arm-kernel

Hi Eric,

On Tue, Nov 19, 2013 at 05:53:14AM -0800, Eric Dumazet wrote:
> These strange results you have tend to show that if you have a big TCP
> send window, the web server pushes a lot of bytes per system call and
> might stall the ACK clocking or TX refills.

It's tx refills which are not done in this case from what I think I
understood in the driver. IIRC, the refill is done once at the beginning
of xmit and in the tx timer callback. So if you have too large a window
that fills the descriptors during a few tx calls during which no desc was
released, you could end up having to wait for the timer since you're not
allowed to send anymore.

> > Then, I started playing w/ tcp_limit_output_bytes (default is 131072),
> > w/ TCP send window set to 256KB:
> > 
> > tcp_limit_output_bytes set to 512KB: 59.3 MB/s
> > tcp_limit_output_bytes set to 256KB: 58.5 MB/s
> > tcp_limit_output_bytes set to 128KB: 56.2 MB/s
> > tcp_limit_output_bytes set to  64KB: 32.1 MB/s
> > tcp_limit_output_bytes set to  32KB: 4.76 MB/s
> > 
> > As a side note, during the test, I sometimes gets peak for some seconds
> > at the beginning at 90MB/s which tend to confirm what WIlly wrote,
> > i.e. that the hardware can do more.
> 
> I would also check the receiver. I suspect packets drops because of a
> bad driver doing skb->truesize overshooting.

When I first observed the issue, at first I suspected my laptop's driver
when I saw this problem, so I tried with a dockstar instead and the issue
disappeared... until I increased the tcp_rmem on it to match my laptop :-)

Arnaud, you might be interested in trying checking if the following change
does something for you in mvneta.c :

- #define MVNETA_TX_DONE_TIMER_PERIOD 10
+ #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ)

This can only have any effect if you run at 250 or 1000 Hz, but not at 100
of course. It should reduce the time to first IRQ.

Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-19 17:43                         ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-19 17:43 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Eric,

On Tue, Nov 19, 2013 at 05:53:14AM -0800, Eric Dumazet wrote:
> These strange results you have tend to show that if you have a big TCP
> send window, the web server pushes a lot of bytes per system call and
> might stall the ACK clocking or TX refills.

It's tx refills which are not done in this case from what I think I
understood in the driver. IIRC, the refill is done once at the beginning
of xmit and in the tx timer callback. So if you have too large a window
that fills the descriptors during a few tx calls during which no desc was
released, you could end up having to wait for the timer since you're not
allowed to send anymore.

> > Then, I started playing w/ tcp_limit_output_bytes (default is 131072),
> > w/ TCP send window set to 256KB:
> > 
> > tcp_limit_output_bytes set to 512KB: 59.3 MB/s
> > tcp_limit_output_bytes set to 256KB: 58.5 MB/s
> > tcp_limit_output_bytes set to 128KB: 56.2 MB/s
> > tcp_limit_output_bytes set to  64KB: 32.1 MB/s
> > tcp_limit_output_bytes set to  32KB: 4.76 MB/s
> > 
> > As a side note, during the test, I sometimes gets peak for some seconds
> > at the beginning at 90MB/s which tend to confirm what WIlly wrote,
> > i.e. that the hardware can do more.
> 
> I would also check the receiver. I suspect packets drops because of a
> bad driver doing skb->truesize overshooting.

When I first observed the issue, at first I suspected my laptop's driver
when I saw this problem, so I tried with a dockstar instead and the issue
disappeared... until I increased the tcp_rmem on it to match my laptop :-)

Arnaud, you might be interested in trying checking if the following change
does something for you in mvneta.c :

- #define MVNETA_TX_DONE_TIMER_PERIOD 10
+ #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ)

This can only have any effect if you run at 250 or 1000 Hz, but not at 100
of course. It should reduce the time to first IRQ.

Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-19 17:43                         ` Willy Tarreau
@ 2013-11-19 18:31                           ` Eric Dumazet
  -1 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-19 18:31 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Arnaud Ebalard, Thomas Petazzoni, netdev, edumazet, Cong Wang,
	linux-arm-kernel, Florian Fainelli, simon.guinot

On Tue, 2013-11-19 at 18:43 +0100, Willy Tarreau wrote:

> - #define MVNETA_TX_DONE_TIMER_PERIOD 10
> + #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ)
> 

I suggested this in a prior mail :

#define MVNETA_TX_DONE_TIMER_PERIOD 1

But apparently it was triggering strange crashes...

> This can only have any effect if you run at 250 or 1000 Hz, but not at 100
> of course. It should reduce the time to first IRQ.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-19 18:31                           ` Eric Dumazet
  0 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-19 18:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2013-11-19 at 18:43 +0100, Willy Tarreau wrote:

> - #define MVNETA_TX_DONE_TIMER_PERIOD 10
> + #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ)
> 

I suggested this in a prior mail :

#define MVNETA_TX_DONE_TIMER_PERIOD 1

But apparently it was triggering strange crashes...

> This can only have any effect if you run at 250 or 1000 Hz, but not at 100
> of course. It should reduce the time to first IRQ.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-19 18:31                           ` Eric Dumazet
@ 2013-11-19 18:41                             ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-19 18:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Petazzoni, Florian Fainelli, simon.guinot, netdev,
	Arnaud Ebalard, edumazet, Cong Wang, linux-arm-kernel

On Tue, Nov 19, 2013 at 10:31:50AM -0800, Eric Dumazet wrote:
> On Tue, 2013-11-19 at 18:43 +0100, Willy Tarreau wrote:
> 
> > - #define MVNETA_TX_DONE_TIMER_PERIOD 10
> > + #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ)
> > 
> 
> I suggested this in a prior mail :
> 
> #define MVNETA_TX_DONE_TIMER_PERIOD 1

Ah sorry, I remember now.

> But apparently it was triggering strange crashes...

Ah, when a bug hides another one, it's the situation I prefer, because
by working on one, you end up fixing two :-)

Cheers,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-19 18:41                             ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-19 18:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Nov 19, 2013 at 10:31:50AM -0800, Eric Dumazet wrote:
> On Tue, 2013-11-19 at 18:43 +0100, Willy Tarreau wrote:
> 
> > - #define MVNETA_TX_DONE_TIMER_PERIOD 10
> > + #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ)
> > 
> 
> I suggested this in a prior mail :
> 
> #define MVNETA_TX_DONE_TIMER_PERIOD 1

Ah sorry, I remember now.

> But apparently it was triggering strange crashes...

Ah, when a bug hides another one, it's the situation I prefer, because
by working on one, you end up fixing two :-)

Cheers,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-19 18:41                             ` Willy Tarreau
@ 2013-11-19 23:53                               ` Arnaud Ebalard
  -1 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-19 23:53 UTC (permalink / raw)
  To: Willy Tarreau, Eric Dumazet
  Cc: Thomas Petazzoni, Florian Fainelli, simon.guinot, netdev,
	edumazet, Cong Wang, linux-arm-kernel

Hi,

Willy Tarreau <w@1wt.eu> writes:

> On Tue, Nov 19, 2013 at 10:31:50AM -0800, Eric Dumazet wrote:
>> On Tue, 2013-11-19 at 18:43 +0100, Willy Tarreau wrote:
>> 
>> > - #define MVNETA_TX_DONE_TIMER_PERIOD 10
>> > + #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ)
>> > 
>> 
>> I suggested this in a prior mail :
>> 
>> #define MVNETA_TX_DONE_TIMER_PERIOD 1
>
> Ah sorry, I remember now.
>
>> But apparently it was triggering strange crashes...
>
> Ah, when a bug hides another one, it's the situation I prefer, because
> by working on one, you end up fixing two :-)

Follow me just for one sec: today, I got a USB 3.0 Gigabit Ethernet
adapter. More specifically an AX88179-based one (Logitec LAN-GTJU3H3),
about which there is currently a thread on netdev and linux-usb
lists. Anyway, I decided to give it a try on my RN102 just to check what
performance I could achieve. So I basically did the same experiment as
yesterday (wget on client against a 1GB file located on the filesystem
served by an apache on the NAS) except that time the AX88179-based
adapater was used instead of the mvneta-based interface. Well, the
download started at a high rate (90MB/s) but then drops and I get some
SATA error on the NAS (similar to the errors I already got during
12.0-rc series [1] to finally *erroneously* consider it was an artefact).

So I decided to remove the SATA controllers and disks from the equation:
I switched to my ReadyNAS 2120 whose GbE interfaces are also based on
mvneta driver and comes w/ 2GB of RAM. The main additional difference is
that the device is a dual core armada @1.2GHz, where the RN102 is a
single core armada @1.2GHz. I created a dummy 1GB file *in RAM*
(/run/shm) to have it served by the apache2 instead of the file
previously stored on the disks. 

I started w/ todays linus tree (dec8e46178b) with Eric's revert patch
for c9eeec26e32e (tcp: TSQ can use a dynamic limit) and also the change
to mvneta driver to have:

-#define MVNETA_TX_DONE_TIMER_PERIOD    10
+#define MVNETA_TX_DONE_TIMER_PERIOD    1

Here are the average speed given by wget for the following TCP send
window:

   4 MB:  19 MB/s
   2 MB:  21 MB/s
   1 MB:  21 MB/s
  512KB:  23 MB/s
  384KB: 105 MB/s
  256KB: 112 MB/s
  128KB: 111 MB/s
   64KB:  93 MB/s

Then, I decided to redo the exact same test w/o the change on
MVNETA_TX_DONE_TIMER_PERIOD (i.e. w/ the initial value of 10). I get the
exact same results as with the MVNETA_TX_DONE_TIMER_PERIOD set to 1, i.e:

   4 MB:  20 MB/s
   2 MB:  21 MB/s
   1 MB:  21 MB/s
  512KB:  22 MB/s
  384KB: 105 MB/s
  256KB: 112 MB/s
  128KB: 111 MB/s
   64KB:  93 MB/s

And, then, I also dropped Eric's revert patch for c9eeec26e32e (tcp: TSQ
can use a dynamic limit), just to verify we came back where the thread
started but i got a surprise:

   4 MB:  10 MB/s
   2 MB:  11 MB/s
   1 MB:  10 MB/s
  512KB:  12 MB/s
  384KB: 104 MB/s
  256KB: 112 MB/s
  128KB: 112 MB/s
   64KB:  93 MB/s

Instead of the 256KB/s I had observed the low value was now 10MB/s. I
thought it was due to the switch from RN102 to RN2120 so I came back
to the RN102 w/o any specific patch for mvneta nor your revert patch for 
c9eeec26e32e, i.e. only Linus tree as it is today (dec8e46178b). The
file is served from the disk:

   4 MB:   5 MB/s
   2 MB:   5 MB/s
   1 MB:   5 MB/s
  512KB:   5 MB/s
  384KB:  90 MB/s for 4s, then 3 MB/s
  256KB:  80 MB/s for 3s, then 2 MB/s
  128KB:  90 MB/s for 3s, then 3 MB/s
   64KB:  80 MB/s for 3s, then 3 MB/S

Then, I allocated a dummy 400MB file in RAM (/run/shm) and redid the
test on the RN102:

   4 MB:   8 MB/s
   2 MB:   8 MB/s
   1 MB:  92 MB/s
  512KB:  90 MB/s
  384KB:  90 MB/s
  256KB:  90 MB/s
  128KB:  90 MB/s
   64KB:  60 MB/s

In the end, here are the conclusions *I* draw from this test session,
do not hesitate to correct me:

 - Eric, it seems something changed in linus tree betwen the beginning
   of the thread and now, which somehow reduces the effect of the
   regression we were seen: I never got back the 256KB/s.
 - You revert patch still improves the perf a lot
 - It seems reducing MVNETA_TX_DONE_TIMER_PERIOD does not help
 - w/ your revert patch, I can confirm that mvneta driver is capable of
   doing line rate w/ proper tweak of TCP send window (256KB instead of
   4M)
 - It seems I will I have to spend some time on the SATA issues I
   previously thought were an artefact of not cleaning my tree during a
   debug session [1], i.e. there is IMHO an issue.

What I do not get is what can cause the perf to drop from 90MB/s to
3MB/s (w/ a 256KB send window) when streaming from the disk instead of
the RAM. I have no issue having dd read from the fs @ 150MB/s and
mvneta streaming from RAM @ 90MB/s but both together get me 3MB/s after
a few seconds.

Anyway, I think if the thread keeps going on improving mvneta, I'll do
all additional tests from RAM and will stop polluting netdev w/ possible
sata/disk/fs issues.

Cheers,

a+

[1]: http://thread.gmane.org/gmane.linux.ports.arm.kernel/271508

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-19 23:53                               ` Arnaud Ebalard
  0 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-19 23:53 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

Willy Tarreau <w@1wt.eu> writes:

> On Tue, Nov 19, 2013 at 10:31:50AM -0800, Eric Dumazet wrote:
>> On Tue, 2013-11-19 at 18:43 +0100, Willy Tarreau wrote:
>> 
>> > - #define MVNETA_TX_DONE_TIMER_PERIOD 10
>> > + #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ)
>> > 
>> 
>> I suggested this in a prior mail :
>> 
>> #define MVNETA_TX_DONE_TIMER_PERIOD 1
>
> Ah sorry, I remember now.
>
>> But apparently it was triggering strange crashes...
>
> Ah, when a bug hides another one, it's the situation I prefer, because
> by working on one, you end up fixing two :-)

Follow me just for one sec: today, I got a USB 3.0 Gigabit Ethernet
adapter. More specifically an AX88179-based one (Logitec LAN-GTJU3H3),
about which there is currently a thread on netdev and linux-usb
lists. Anyway, I decided to give it a try on my RN102 just to check what
performance I could achieve. So I basically did the same experiment as
yesterday (wget on client against a 1GB file located on the filesystem
served by an apache on the NAS) except that time the AX88179-based
adapater was used instead of the mvneta-based interface. Well, the
download started at a high rate (90MB/s) but then drops and I get some
SATA error on the NAS (similar to the errors I already got during
12.0-rc series [1] to finally *erroneously* consider it was an artefact).

So I decided to remove the SATA controllers and disks from the equation:
I switched to my ReadyNAS 2120 whose GbE interfaces are also based on
mvneta driver and comes w/ 2GB of RAM. The main additional difference is
that the device is a dual core armada @1.2GHz, where the RN102 is a
single core armada @1.2GHz. I created a dummy 1GB file *in RAM*
(/run/shm) to have it served by the apache2 instead of the file
previously stored on the disks. 

I started w/ todays linus tree (dec8e46178b) with Eric's revert patch
for c9eeec26e32e (tcp: TSQ can use a dynamic limit) and also the change
to mvneta driver to have:

-#define MVNETA_TX_DONE_TIMER_PERIOD    10
+#define MVNETA_TX_DONE_TIMER_PERIOD    1

Here are the average speed given by wget for the following TCP send
window:

   4 MB:  19 MB/s
   2 MB:  21 MB/s
   1 MB:  21 MB/s
  512KB:  23 MB/s
  384KB: 105 MB/s
  256KB: 112 MB/s
  128KB: 111 MB/s
   64KB:  93 MB/s

Then, I decided to redo the exact same test w/o the change on
MVNETA_TX_DONE_TIMER_PERIOD (i.e. w/ the initial value of 10). I get the
exact same results as with the MVNETA_TX_DONE_TIMER_PERIOD set to 1, i.e:

   4 MB:  20 MB/s
   2 MB:  21 MB/s
   1 MB:  21 MB/s
  512KB:  22 MB/s
  384KB: 105 MB/s
  256KB: 112 MB/s
  128KB: 111 MB/s
   64KB:  93 MB/s

And, then, I also dropped Eric's revert patch for c9eeec26e32e (tcp: TSQ
can use a dynamic limit), just to verify we came back where the thread
started but i got a surprise:

   4 MB:  10 MB/s
   2 MB:  11 MB/s
   1 MB:  10 MB/s
  512KB:  12 MB/s
  384KB: 104 MB/s
  256KB: 112 MB/s
  128KB: 112 MB/s
   64KB:  93 MB/s

Instead of the 256KB/s I had observed the low value was now 10MB/s. I
thought it was due to the switch from RN102 to RN2120 so I came back
to the RN102 w/o any specific patch for mvneta nor your revert patch for 
c9eeec26e32e, i.e. only Linus tree as it is today (dec8e46178b). The
file is served from the disk:

   4 MB:   5 MB/s
   2 MB:   5 MB/s
   1 MB:   5 MB/s
  512KB:   5 MB/s
  384KB:  90 MB/s for 4s, then 3 MB/s
  256KB:  80 MB/s for 3s, then 2 MB/s
  128KB:  90 MB/s for 3s, then 3 MB/s
   64KB:  80 MB/s for 3s, then 3 MB/S

Then, I allocated a dummy 400MB file in RAM (/run/shm) and redid the
test on the RN102:

   4 MB:   8 MB/s
   2 MB:   8 MB/s
   1 MB:  92 MB/s
  512KB:  90 MB/s
  384KB:  90 MB/s
  256KB:  90 MB/s
  128KB:  90 MB/s
   64KB:  60 MB/s

In the end, here are the conclusions *I* draw from this test session,
do not hesitate to correct me:

 - Eric, it seems something changed in linus tree betwen the beginning
   of the thread and now, which somehow reduces the effect of the
   regression we were seen: I never got back the 256KB/s.
 - You revert patch still improves the perf a lot
 - It seems reducing MVNETA_TX_DONE_TIMER_PERIOD does not help
 - w/ your revert patch, I can confirm that mvneta driver is capable of
   doing line rate w/ proper tweak of TCP send window (256KB instead of
   4M)
 - It seems I will I have to spend some time on the SATA issues I
   previously thought were an artefact of not cleaning my tree during a
   debug session [1], i.e. there is IMHO an issue.

What I do not get is what can cause the perf to drop from 90MB/s to
3MB/s (w/ a 256KB send window) when streaming from the disk instead of
the RAM. I have no issue having dd read from the fs @ 150MB/s and
mvneta streaming from RAM @ 90MB/s but both together get me 3MB/s after
a few seconds.

Anyway, I think if the thread keeps going on improving mvneta, I'll do
all additional tests from RAM and will stop polluting netdev w/ possible
sata/disk/fs issues.

Cheers,

a+

[1]: http://thread.gmane.org/gmane.linux.ports.arm.kernel/271508

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-19 23:53                               ` Arnaud Ebalard
@ 2013-11-20  0:08                                 ` Eric Dumazet
  -1 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-20  0:08 UTC (permalink / raw)
  To: Arnaud Ebalard
  Cc: Willy Tarreau, Thomas Petazzoni, Florian Fainelli, simon.guinot,
	netdev, edumazet, Cong Wang, linux-arm-kernel

On Wed, 2013-11-20 at 00:53 +0100, Arnaud Ebalard wrote:

> Anyway, I think if the thread keeps going on improving mvneta, I'll do
> all additional tests from RAM and will stop polluting netdev w/ possible
> sata/disk/fs issues.

;)

Alternative would be to use netperf or iperf to not use disk at all
and focus on TCP/network issues only.

Thanks !

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20  0:08                                 ` Eric Dumazet
  0 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-20  0:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2013-11-20 at 00:53 +0100, Arnaud Ebalard wrote:

> Anyway, I think if the thread keeps going on improving mvneta, I'll do
> all additional tests from RAM and will stop polluting netdev w/ possible
> sata/disk/fs issues.

;)

Alternative would be to use netperf or iperf to not use disk at all
and focus on TCP/network issues only.

Thanks !

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-20  0:08                                 ` Eric Dumazet
@ 2013-11-20  0:35                                   ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-20  0:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Petazzoni, Florian Fainelli, simon.guinot, netdev,
	Arnaud Ebalard, edumazet, Cong Wang, linux-arm-kernel

On Tue, Nov 19, 2013 at 04:08:49PM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-20 at 00:53 +0100, Arnaud Ebalard wrote:
> 
> > Anyway, I think if the thread keeps going on improving mvneta, I'll do
> > all additional tests from RAM and will stop polluting netdev w/ possible
> > sata/disk/fs issues.
> 
> ;)
> 
> Alternative would be to use netperf or iperf to not use disk at all
> and focus on TCP/network issues only.

Yes, that's for the same reason that I continue to use inject/httpterm
for such purposes :
  - httpterm uses tee()+splice() to send pre-built pages without copying ;
  - inject uses recv(MSG_TRUNC) to ack everything without copying.

Both of them are really interesting to test the hardware's capabilities
and to push components in the middle to their limits without causing too
much burden to the end points.

I don't know if either netperf or iperf can make use of this now, and
I've been used to my tools, but I should take a look again.

Cheers,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20  0:35                                   ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-20  0:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Nov 19, 2013 at 04:08:49PM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-20 at 00:53 +0100, Arnaud Ebalard wrote:
> 
> > Anyway, I think if the thread keeps going on improving mvneta, I'll do
> > all additional tests from RAM and will stop polluting netdev w/ possible
> > sata/disk/fs issues.
> 
> ;)
> 
> Alternative would be to use netperf or iperf to not use disk at all
> and focus on TCP/network issues only.

Yes, that's for the same reason that I continue to use inject/httpterm
for such purposes :
  - httpterm uses tee()+splice() to send pre-built pages without copying ;
  - inject uses recv(MSG_TRUNC) to ack everything without copying.

Both of them are really interesting to test the hardware's capabilities
and to push components in the middle to their limits without causing too
much burden to the end points.

I don't know if either netperf or iperf can make use of this now, and
I've been used to my tools, but I should take a look again.

Cheers,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-20  0:35                                   ` Willy Tarreau
@ 2013-11-20  0:43                                     ` Eric Dumazet
  -1 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-20  0:43 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Arnaud Ebalard, Thomas Petazzoni, Florian Fainelli, simon.guinot,
	netdev, edumazet, Cong Wang, linux-arm-kernel

On Wed, 2013-11-20 at 01:35 +0100, Willy Tarreau wrote:
> On Tue, Nov 19, 2013 at 04:08:49PM -0800, Eric Dumazet wrote:
> > On Wed, 2013-11-20 at 00:53 +0100, Arnaud Ebalard wrote:
> > 
> > > Anyway, I think if the thread keeps going on improving mvneta, I'll do
> > > all additional tests from RAM and will stop polluting netdev w/ possible
> > > sata/disk/fs issues.
> > 
> > ;)
> > 
> > Alternative would be to use netperf or iperf to not use disk at all
> > and focus on TCP/network issues only.
> 
> Yes, that's for the same reason that I continue to use inject/httpterm
> for such purposes :
>   - httpterm uses tee()+splice() to send pre-built pages without copying ;
>   - inject uses recv(MSG_TRUNC) to ack everything without copying.
> 
> Both of them are really interesting to test the hardware's capabilities
> and to push components in the middle to their limits without causing too
> much burden to the end points.
> 
> I don't know if either netperf or iperf can make use of this now, and
> I've been used to my tools, but I should take a look again.

netperf -t TCP_SENDFILE  does the zero copy at sender.

And more generally -V option does copy avoidance

(Use splice(sockfd -> nullfd) at receiver if I remember well)


Anyway, we should do the normal copy, because it might demonstrate
scheduling problems.

If you want to test raw speed, you could use pktgen ;)

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20  0:43                                     ` Eric Dumazet
  0 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-20  0:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2013-11-20 at 01:35 +0100, Willy Tarreau wrote:
> On Tue, Nov 19, 2013 at 04:08:49PM -0800, Eric Dumazet wrote:
> > On Wed, 2013-11-20 at 00:53 +0100, Arnaud Ebalard wrote:
> > 
> > > Anyway, I think if the thread keeps going on improving mvneta, I'll do
> > > all additional tests from RAM and will stop polluting netdev w/ possible
> > > sata/disk/fs issues.
> > 
> > ;)
> > 
> > Alternative would be to use netperf or iperf to not use disk at all
> > and focus on TCP/network issues only.
> 
> Yes, that's for the same reason that I continue to use inject/httpterm
> for such purposes :
>   - httpterm uses tee()+splice() to send pre-built pages without copying ;
>   - inject uses recv(MSG_TRUNC) to ack everything without copying.
> 
> Both of them are really interesting to test the hardware's capabilities
> and to push components in the middle to their limits without causing too
> much burden to the end points.
> 
> I don't know if either netperf or iperf can make use of this now, and
> I've been used to my tools, but I should take a look again.

netperf -t TCP_SENDFILE  does the zero copy at sender.

And more generally -V option does copy avoidance

(Use splice(sockfd -> nullfd) at receiver if I remember well)


Anyway, we should do the normal copy, because it might demonstrate
scheduling problems.

If you want to test raw speed, you could use pktgen ;)

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-20  0:43                                     ` Eric Dumazet
@ 2013-11-20  0:52                                       ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-20  0:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Petazzoni, Florian Fainelli, simon.guinot, netdev,
	Arnaud Ebalard, edumazet, Cong Wang, linux-arm-kernel

On Tue, Nov 19, 2013 at 04:43:48PM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-20 at 01:35 +0100, Willy Tarreau wrote:
> > On Tue, Nov 19, 2013 at 04:08:49PM -0800, Eric Dumazet wrote:
> > > On Wed, 2013-11-20 at 00:53 +0100, Arnaud Ebalard wrote:
> > > 
> > > > Anyway, I think if the thread keeps going on improving mvneta, I'll do
> > > > all additional tests from RAM and will stop polluting netdev w/ possible
> > > > sata/disk/fs issues.
> > > 
> > > ;)
> > > 
> > > Alternative would be to use netperf or iperf to not use disk at all
> > > and focus on TCP/network issues only.
> > 
> > Yes, that's for the same reason that I continue to use inject/httpterm
> > for such purposes :
> >   - httpterm uses tee()+splice() to send pre-built pages without copying ;
> >   - inject uses recv(MSG_TRUNC) to ack everything without copying.
> > 
> > Both of them are really interesting to test the hardware's capabilities
> > and to push components in the middle to their limits without causing too
> > much burden to the end points.
> > 
> > I don't know if either netperf or iperf can make use of this now, and
> > I've been used to my tools, but I should take a look again.
> 
> netperf -t TCP_SENDFILE  does the zero copy at sender.
> 
> And more generally -V option does copy avoidance
> 
> (Use splice(sockfd -> nullfd) at receiver if I remember well)

OK thanks for the info.

> Anyway, we should do the normal copy, because it might demonstrate
> scheduling problems.

Yes, especially in this case, though I got the issue with GSO only,
so it might vary as well.

> If you want to test raw speed, you could use pktgen ;)

Except I'm mostly focused on HTTP as you know. And for generating higher
packet rates than pktgen, I have an absolutely ugly patch that I'm a
bit ashamed of for mvneta to multiply the number of emitted descriptors
for a given skb by skb->sk->sk_mark (which I set using setsockopt), this
allows me to generate up to 1.488 Mpps on a USB-powered system, not that
bad in my opinion :-)

Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20  0:52                                       ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-20  0:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Nov 19, 2013 at 04:43:48PM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-20 at 01:35 +0100, Willy Tarreau wrote:
> > On Tue, Nov 19, 2013 at 04:08:49PM -0800, Eric Dumazet wrote:
> > > On Wed, 2013-11-20 at 00:53 +0100, Arnaud Ebalard wrote:
> > > 
> > > > Anyway, I think if the thread keeps going on improving mvneta, I'll do
> > > > all additional tests from RAM and will stop polluting netdev w/ possible
> > > > sata/disk/fs issues.
> > > 
> > > ;)
> > > 
> > > Alternative would be to use netperf or iperf to not use disk at all
> > > and focus on TCP/network issues only.
> > 
> > Yes, that's for the same reason that I continue to use inject/httpterm
> > for such purposes :
> >   - httpterm uses tee()+splice() to send pre-built pages without copying ;
> >   - inject uses recv(MSG_TRUNC) to ack everything without copying.
> > 
> > Both of them are really interesting to test the hardware's capabilities
> > and to push components in the middle to their limits without causing too
> > much burden to the end points.
> > 
> > I don't know if either netperf or iperf can make use of this now, and
> > I've been used to my tools, but I should take a look again.
> 
> netperf -t TCP_SENDFILE  does the zero copy at sender.
> 
> And more generally -V option does copy avoidance
> 
> (Use splice(sockfd -> nullfd) at receiver if I remember well)

OK thanks for the info.

> Anyway, we should do the normal copy, because it might demonstrate
> scheduling problems.

Yes, especially in this case, though I got the issue with GSO only,
so it might vary as well.

> If you want to test raw speed, you could use pktgen ;)

Except I'm mostly focused on HTTP as you know. And for generating higher
packet rates than pktgen, I have an absolutely ugly patch that I'm a
bit ashamed of for mvneta to multiply the number of emitted descriptors
for a given skb by skb->sk->sk_mark (which I set using setsockopt), this
allows me to generate up to 1.488 Mpps on a USB-powered system, not that
bad in my opinion :-)

Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-19 23:53                               ` Arnaud Ebalard
@ 2013-11-20  8:50                                 ` Thomas Petazzoni
  -1 siblings, 0 replies; 121+ messages in thread
From: Thomas Petazzoni @ 2013-11-20  8:50 UTC (permalink / raw)
  To: Arnaud Ebalard
  Cc: Willy Tarreau, Eric Dumazet, Florian Fainelli, simon.guinot,
	netdev, edumazet, Cong Wang, linux-arm-kernel

Arnaud,

On Wed, 20 Nov 2013 00:53:43 +0100, Arnaud Ebalard wrote:

>  - It seems I will I have to spend some time on the SATA issues I
>    previously thought were an artefact of not cleaning my tree during a
>    debug session [1], i.e. there is IMHO an issue.

I don't remember in detail what your SATA problem was, but just to let
you know that we are currently debugging a problem that occurs on
Armada XP (more than one core is needed for the problem to occur), with
SATA (the symptom is that after some time of SATA usage, SATA traffic is
stalled, and no SATA interrupts are generated anymore). We're still
working on this one, and trying to figure out where the problem is.

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20  8:50                                 ` Thomas Petazzoni
  0 siblings, 0 replies; 121+ messages in thread
From: Thomas Petazzoni @ 2013-11-20  8:50 UTC (permalink / raw)
  To: linux-arm-kernel

Arnaud,

On Wed, 20 Nov 2013 00:53:43 +0100, Arnaud Ebalard wrote:

>  - It seems I will I have to spend some time on the SATA issues I
>    previously thought were an artefact of not cleaning my tree during a
>    debug session [1], i.e. there is IMHO an issue.

I don't remember in detail what your SATA problem was, but just to let
you know that we are currently debugging a problem that occurs on
Armada XP (more than one core is needed for the problem to occur), with
SATA (the symptom is that after some time of SATA usage, SATA traffic is
stalled, and no SATA interrupts are generated anymore). We're still
working on this one, and trying to figure out where the problem is.

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-17 17:41                   ` Eric Dumazet
@ 2013-11-20 17:12                     ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-20 17:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Petazzoni, netdev, Arnaud Ebalard, edumazet, Cong Wang,
	linux-arm-kernel

Hi guys,

On Sun, Nov 17, 2013 at 09:41:38AM -0800, Eric Dumazet wrote:
> On Sun, 2013-11-17 at 15:19 +0100, Willy Tarreau wrote:
> 
> > 
> > So it is fairly possible that in your case you can't fill the link if you
> > consume too many descriptors. For example, if your server uses TCP_NODELAY
> > and sends incomplete segments (which is quite common), it's very easy to
> > run out of descriptors before the link is full.
> 
> BTW I have a very simple patch for TCP stack that could help this exact
> situation...
> 
> Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with
> very small frames, and let tcp_sendmsg() have more chance to fill
> complete packets.
> 
> Again, for this to work very well, you need that NIC performs TX
> completion in reasonable amount of time...

Eric, first I would like to confirm that I could reproduce Arnaud's issue
using 3.10.19 (160 kB/s in the worst case).

Second, I confirm that your patch partially fixes it and my performance
can be brought back to what I had with 3.10-rc7, but with a lot of
concurrent streams. In fact, in 3.10-rc7, I managed to constantly saturate
the wire when transfering 7 concurrent streams (118.6 kB/s). With the patch
applied, performance is still only 27 MB/s at 7 concurrent streams, and I
need at least 35 concurrent streams to fill the pipe. Strangely, after
2 GB of cumulated data transferred, the bandwidth divided by 11-fold and
fell to 10 MB/s again.

If I revert both "0ae5f47eff tcp: TSQ can use a dynamic limit" and
your latest patch, the performance is back to original.

Now I understand there's a major issue with the driver. But since the
patch emphasizes the situations where drivers take a lot of time to
wake the queue up, don't you think there could be an issue with low
bandwidth links (eg: PPPoE over xDSL, 10 Mbps ethernet, etc...) ?
I'm a bit worried about what we might discover in this area I must
confess (despite generally being mostly focused on 10+ Gbps).

Best regards,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20 17:12                     ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-20 17:12 UTC (permalink / raw)
  To: linux-arm-kernel

Hi guys,

On Sun, Nov 17, 2013 at 09:41:38AM -0800, Eric Dumazet wrote:
> On Sun, 2013-11-17 at 15:19 +0100, Willy Tarreau wrote:
> 
> > 
> > So it is fairly possible that in your case you can't fill the link if you
> > consume too many descriptors. For example, if your server uses TCP_NODELAY
> > and sends incomplete segments (which is quite common), it's very easy to
> > run out of descriptors before the link is full.
> 
> BTW I have a very simple patch for TCP stack that could help this exact
> situation...
> 
> Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with
> very small frames, and let tcp_sendmsg() have more chance to fill
> complete packets.
> 
> Again, for this to work very well, you need that NIC performs TX
> completion in reasonable amount of time...

Eric, first I would like to confirm that I could reproduce Arnaud's issue
using 3.10.19 (160 kB/s in the worst case).

Second, I confirm that your patch partially fixes it and my performance
can be brought back to what I had with 3.10-rc7, but with a lot of
concurrent streams. In fact, in 3.10-rc7, I managed to constantly saturate
the wire when transfering 7 concurrent streams (118.6 kB/s). With the patch
applied, performance is still only 27 MB/s at 7 concurrent streams, and I
need at least 35 concurrent streams to fill the pipe. Strangely, after
2 GB of cumulated data transferred, the bandwidth divided by 11-fold and
fell to 10 MB/s again.

If I revert both "0ae5f47eff tcp: TSQ can use a dynamic limit" and
your latest patch, the performance is back to original.

Now I understand there's a major issue with the driver. But since the
patch emphasizes the situations where drivers take a lot of time to
wake the queue up, don't you think there could be an issue with low
bandwidth links (eg: PPPoE over xDSL, 10 Mbps ethernet, etc...) ?
I'm a bit worried about what we might discover in this area I must
confess (despite generally being mostly focused on 10+ Gbps).

Best regards,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-20 17:12                     ` Willy Tarreau
@ 2013-11-20 17:30                       ` Eric Dumazet
  -1 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-20 17:30 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Arnaud Ebalard, Cong Wang, edumazet, linux-arm-kernel, netdev,
	Thomas Petazzoni

On Wed, 2013-11-20 at 18:12 +0100, Willy Tarreau wrote:
> Hi guys,

> Eric, first I would like to confirm that I could reproduce Arnaud's issue
> using 3.10.19 (160 kB/s in the worst case).
> 
> Second, I confirm that your patch partially fixes it and my performance
> can be brought back to what I had with 3.10-rc7, but with a lot of
> concurrent streams. In fact, in 3.10-rc7, I managed to constantly saturate
> the wire when transfering 7 concurrent streams (118.6 kB/s). With the patch
> applied, performance is still only 27 MB/s at 7 concurrent streams, and I
> need at least 35 concurrent streams to fill the pipe. Strangely, after
> 2 GB of cumulated data transferred, the bandwidth divided by 11-fold and
> fell to 10 MB/s again.
> 
> If I revert both "0ae5f47eff tcp: TSQ can use a dynamic limit" and
> your latest patch, the performance is back to original.
> 
> Now I understand there's a major issue with the driver. But since the
> patch emphasizes the situations where drivers take a lot of time to
> wake the queue up, don't you think there could be an issue with low
> bandwidth links (eg: PPPoE over xDSL, 10 Mbps ethernet, etc...) ?
> I'm a bit worried about what we might discover in this area I must
> confess (despite generally being mostly focused on 10+ Gbps).

Well, all TCP performance results are highly dependent on the workload,
and both receivers and senders behavior.

We made many improvements like TSO auto sizing, DRS (dynamic Right
Sizing), and if the application used some specific settings (like
SO_SNDBUF / SO_RCVBUF or other tweaks), we can not guarantee that same
exact performance is reached from kernel version X to kernel version Y.

We try to make forward progress, there is little gain to revert all
these great works. Linux had this tendency to favor throughput by using
overly large skbs. Its time to do better.

As explained, some drivers are buggy, and need fixes.

If nobody wants to fix them, this really means no one is interested
getting them fixed.

I am willing to help if you provide details, because otherwise I need
a crystal ball ;)

One known problem of TCP is the fact that an incoming ACK making room in
socket write queue immediately wakeup a blocked thread (POLLOUT), even
if only one MSS was ack, and write queue has 2MB of outstanding bytes.

All these scheduling problems should be identified and fixed, and yes,
this will require a dozen more patches.

max (128KB , 1-2 ms) of buffering per flow should be enough to reach
line rate, even for a single flow, but this means the sk_sndbuf value
for the socket must take into account the pipe size _plus_ 1ms of
buffering.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20 17:30                       ` Eric Dumazet
  0 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-20 17:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2013-11-20 at 18:12 +0100, Willy Tarreau wrote:
> Hi guys,

> Eric, first I would like to confirm that I could reproduce Arnaud's issue
> using 3.10.19 (160 kB/s in the worst case).
> 
> Second, I confirm that your patch partially fixes it and my performance
> can be brought back to what I had with 3.10-rc7, but with a lot of
> concurrent streams. In fact, in 3.10-rc7, I managed to constantly saturate
> the wire when transfering 7 concurrent streams (118.6 kB/s). With the patch
> applied, performance is still only 27 MB/s at 7 concurrent streams, and I
> need at least 35 concurrent streams to fill the pipe. Strangely, after
> 2 GB of cumulated data transferred, the bandwidth divided by 11-fold and
> fell to 10 MB/s again.
> 
> If I revert both "0ae5f47eff tcp: TSQ can use a dynamic limit" and
> your latest patch, the performance is back to original.
> 
> Now I understand there's a major issue with the driver. But since the
> patch emphasizes the situations where drivers take a lot of time to
> wake the queue up, don't you think there could be an issue with low
> bandwidth links (eg: PPPoE over xDSL, 10 Mbps ethernet, etc...) ?
> I'm a bit worried about what we might discover in this area I must
> confess (despite generally being mostly focused on 10+ Gbps).

Well, all TCP performance results are highly dependent on the workload,
and both receivers and senders behavior.

We made many improvements like TSO auto sizing, DRS (dynamic Right
Sizing), and if the application used some specific settings (like
SO_SNDBUF / SO_RCVBUF or other tweaks), we can not guarantee that same
exact performance is reached from kernel version X to kernel version Y.

We try to make forward progress, there is little gain to revert all
these great works. Linux had this tendency to favor throughput by using
overly large skbs. Its time to do better.

As explained, some drivers are buggy, and need fixes.

If nobody wants to fix them, this really means no one is interested
getting them fixed.

I am willing to help if you provide details, because otherwise I need
a crystal ball ;)

One known problem of TCP is the fact that an incoming ACK making room in
socket write queue immediately wakeup a blocked thread (POLLOUT), even
if only one MSS was ack, and write queue has 2MB of outstanding bytes.

All these scheduling problems should be identified and fixed, and yes,
this will require a dozen more patches.

max (128KB , 1-2 ms) of buffering per flow should be enough to reach
line rate, even for a single flow, but this means the sk_sndbuf value
for the socket must take into account the pipe size _plus_ 1ms of
buffering.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-20 17:12                     ` Willy Tarreau
@ 2013-11-20 17:34                       ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-20 17:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Petazzoni, netdev, Arnaud Ebalard, edumazet, Cong Wang,
	linux-arm-kernel

On Wed, Nov 20, 2013 at 06:12:27PM +0100, Willy Tarreau wrote:
> Hi guys,
> 
> On Sun, Nov 17, 2013 at 09:41:38AM -0800, Eric Dumazet wrote:
> > On Sun, 2013-11-17 at 15:19 +0100, Willy Tarreau wrote:
> > 
> > > 
> > > So it is fairly possible that in your case you can't fill the link if you
> > > consume too many descriptors. For example, if your server uses TCP_NODELAY
> > > and sends incomplete segments (which is quite common), it's very easy to
> > > run out of descriptors before the link is full.
> > 
> > BTW I have a very simple patch for TCP stack that could help this exact
> > situation...
> > 
> > Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with
> > very small frames, and let tcp_sendmsg() have more chance to fill
> > complete packets.
> > 
> > Again, for this to work very well, you need that NIC performs TX
> > completion in reasonable amount of time...
> 
> Eric, first I would like to confirm that I could reproduce Arnaud's issue
> using 3.10.19 (160 kB/s in the worst case).
> 
> Second, I confirm that your patch partially fixes it and my performance
> can be brought back to what I had with 3.10-rc7, but with a lot of
> concurrent streams. In fact, in 3.10-rc7, I managed to constantly saturate
> the wire when transfering 7 concurrent streams (118.6 kB/s). With the patch
> applied, performance is still only 27 MB/s at 7 concurrent streams, and I
> need at least 35 concurrent streams to fill the pipe. Strangely, after
> 2 GB of cumulated data transferred, the bandwidth divided by 11-fold and
> fell to 10 MB/s again.
> 
> If I revert both "0ae5f47eff tcp: TSQ can use a dynamic limit" and
> your latest patch, the performance is back to original.
> 
> Now I understand there's a major issue with the driver. But since the
> patch emphasizes the situations where drivers take a lot of time to
> wake the queue up, don't you think there could be an issue with low
> bandwidth links (eg: PPPoE over xDSL, 10 Mbps ethernet, etc...) ?
> I'm a bit worried about what we might discover in this area I must
> confess (despite generally being mostly focused on 10+ Gbps).

One important point, I was looking for the other patch you pointed
in this long thread and finally found it :

> So
> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee
> 
> restored this minimal amount of buffering, and let the bigger amount for
> 40Gb NICs ;)

This one definitely restores original performance, so it's a much better
bet in my opinion :-)

Best regards,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20 17:34                       ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-20 17:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Nov 20, 2013 at 06:12:27PM +0100, Willy Tarreau wrote:
> Hi guys,
> 
> On Sun, Nov 17, 2013 at 09:41:38AM -0800, Eric Dumazet wrote:
> > On Sun, 2013-11-17 at 15:19 +0100, Willy Tarreau wrote:
> > 
> > > 
> > > So it is fairly possible that in your case you can't fill the link if you
> > > consume too many descriptors. For example, if your server uses TCP_NODELAY
> > > and sends incomplete segments (which is quite common), it's very easy to
> > > run out of descriptors before the link is full.
> > 
> > BTW I have a very simple patch for TCP stack that could help this exact
> > situation...
> > 
> > Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with
> > very small frames, and let tcp_sendmsg() have more chance to fill
> > complete packets.
> > 
> > Again, for this to work very well, you need that NIC performs TX
> > completion in reasonable amount of time...
> 
> Eric, first I would like to confirm that I could reproduce Arnaud's issue
> using 3.10.19 (160 kB/s in the worst case).
> 
> Second, I confirm that your patch partially fixes it and my performance
> can be brought back to what I had with 3.10-rc7, but with a lot of
> concurrent streams. In fact, in 3.10-rc7, I managed to constantly saturate
> the wire when transfering 7 concurrent streams (118.6 kB/s). With the patch
> applied, performance is still only 27 MB/s at 7 concurrent streams, and I
> need at least 35 concurrent streams to fill the pipe. Strangely, after
> 2 GB of cumulated data transferred, the bandwidth divided by 11-fold and
> fell to 10 MB/s again.
> 
> If I revert both "0ae5f47eff tcp: TSQ can use a dynamic limit" and
> your latest patch, the performance is back to original.
> 
> Now I understand there's a major issue with the driver. But since the
> patch emphasizes the situations where drivers take a lot of time to
> wake the queue up, don't you think there could be an issue with low
> bandwidth links (eg: PPPoE over xDSL, 10 Mbps ethernet, etc...) ?
> I'm a bit worried about what we might discover in this area I must
> confess (despite generally being mostly focused on 10+ Gbps).

One important point, I was looking for the other patch you pointed
in this long thread and finally found it :

> So
> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee
> 
> restored this minimal amount of buffering, and let the bigger amount for
> 40Gb NICs ;)

This one definitely restores original performance, so it's a much better
bet in my opinion :-)

Best regards,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-20 17:30                       ` Eric Dumazet
@ 2013-11-20 17:38                         ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-20 17:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Arnaud Ebalard, Cong Wang, edumazet, linux-arm-kernel, netdev,
	Thomas Petazzoni

On Wed, Nov 20, 2013 at 09:30:07AM -0800, Eric Dumazet wrote:
> Well, all TCP performance results are highly dependent on the workload,
> and both receivers and senders behavior.
> 
> We made many improvements like TSO auto sizing, DRS (dynamic Right
> Sizing), and if the application used some specific settings (like
> SO_SNDBUF / SO_RCVBUF or other tweaks), we can not guarantee that same
> exact performance is reached from kernel version X to kernel version Y.

Of course, which is why I only care when there's a significant
difference. If I need 6 streams in a version and 8 in another one to
fill the wire, I call them identical. It's only when we dig into the
details that we analyse the differences.

> We try to make forward progress, there is little gain to revert all
> these great works. Linux had this tendency to favor throughput by using
> overly large skbs. Its time to do better.

I agree. Unfortunately our mails have crossed each other, so just to
keep this tread mostly linear, your next patch here :

   http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee

Fixes that regression and the performance is back to normal which is
good.

> As explained, some drivers are buggy, and need fixes.

Agreed!

> If nobody wants to fix them, this really means no one is interested
> getting them fixed.

I was exactly reading the code when I found a window with your patch
above that I was looking for :-)

> I am willing to help if you provide details, because otherwise I need
> a crystal ball ;)
> 
> One known problem of TCP is the fact that an incoming ACK making room in
> socket write queue immediately wakeup a blocked thread (POLLOUT), even
> if only one MSS was ack, and write queue has 2MB of outstanding bytes.

Indeed.

> All these scheduling problems should be identified and fixed, and yes,
> this will require a dozen more patches.
> 
> max (128KB , 1-2 ms) of buffering per flow should be enough to reach
> line rate, even for a single flow, but this means the sk_sndbuf value
> for the socket must take into account the pipe size _plus_ 1ms of
> buffering.

Which is the purpose of your patch above and I confirm it fixes the
problem.

Now looking at how to workaround this lack of Tx IRQ.

Thanks!
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20 17:38                         ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-20 17:38 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Nov 20, 2013 at 09:30:07AM -0800, Eric Dumazet wrote:
> Well, all TCP performance results are highly dependent on the workload,
> and both receivers and senders behavior.
> 
> We made many improvements like TSO auto sizing, DRS (dynamic Right
> Sizing), and if the application used some specific settings (like
> SO_SNDBUF / SO_RCVBUF or other tweaks), we can not guarantee that same
> exact performance is reached from kernel version X to kernel version Y.

Of course, which is why I only care when there's a significant
difference. If I need 6 streams in a version and 8 in another one to
fill the wire, I call them identical. It's only when we dig into the
details that we analyse the differences.

> We try to make forward progress, there is little gain to revert all
> these great works. Linux had this tendency to favor throughput by using
> overly large skbs. Its time to do better.

I agree. Unfortunately our mails have crossed each other, so just to
keep this tread mostly linear, your next patch here :

   http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee

Fixes that regression and the performance is back to normal which is
good.

> As explained, some drivers are buggy, and need fixes.

Agreed!

> If nobody wants to fix them, this really means no one is interested
> getting them fixed.

I was exactly reading the code when I found a window with your patch
above that I was looking for :-)

> I am willing to help if you provide details, because otherwise I need
> a crystal ball ;)
> 
> One known problem of TCP is the fact that an incoming ACK making room in
> socket write queue immediately wakeup a blocked thread (POLLOUT), even
> if only one MSS was ack, and write queue has 2MB of outstanding bytes.

Indeed.

> All these scheduling problems should be identified and fixed, and yes,
> this will require a dozen more patches.
> 
> max (128KB , 1-2 ms) of buffering per flow should be enough to reach
> line rate, even for a single flow, but this means the sk_sndbuf value
> for the socket must take into account the pipe size _plus_ 1ms of
> buffering.

Which is the purpose of your patch above and I confirm it fixes the
problem.

Now looking at how to workaround this lack of Tx IRQ.

Thanks!
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-20 17:34                       ` Willy Tarreau
@ 2013-11-20 17:40                         ` Eric Dumazet
  -1 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-20 17:40 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Thomas Petazzoni, netdev, Arnaud Ebalard, edumazet, Cong Wang,
	linux-arm-kernel

On Wed, 2013-11-20 at 18:34 +0100, Willy Tarreau wrote:

> One important point, I was looking for the other patch you pointed
> in this long thread and finally found it :
> 
> > So
> > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee
> > 
> > restored this minimal amount of buffering, and let the bigger amount for
> > 40Gb NICs ;)
> 
> This one definitely restores original performance, so it's a much better
> bet in my opinion :-)

I don't understand. I thought you were using this patch.

I guess we are spending time on a already solved problem.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20 17:40                         ` Eric Dumazet
  0 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-20 17:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2013-11-20 at 18:34 +0100, Willy Tarreau wrote:

> One important point, I was looking for the other patch you pointed
> in this long thread and finally found it :
> 
> > So
> > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee
> > 
> > restored this minimal amount of buffering, and let the bigger amount for
> > 40Gb NICs ;)
> 
> This one definitely restores original performance, so it's a much better
> bet in my opinion :-)

I don't understand. I thought you were using this patch.

I guess we are spending time on a already solved problem.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-20 17:40                         ` Eric Dumazet
@ 2013-11-20 18:15                           ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-20 18:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Petazzoni, netdev, Arnaud Ebalard, edumazet, Cong Wang,
	linux-arm-kernel

On Wed, Nov 20, 2013 at 09:40:22AM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-20 at 18:34 +0100, Willy Tarreau wrote:
> 
> > One important point, I was looking for the other patch you pointed
> > in this long thread and finally found it :
> > 
> > > So
> > > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee
> > > 
> > > restored this minimal amount of buffering, and let the bigger amount for
> > > 40Gb NICs ;)
> > 
> > This one definitely restores original performance, so it's a much better
> > bet in my opinion :-)
> 
> I don't understand. I thought you were using this patch.

No, I was on latest stable (3.10.19) which exhibits the regression but does
not yet have your fix above. Then I tested the patch your proposed in this
thread, then this latest one. Since the patch is not yet even in Linus'
tree, I'm not sure Arnaud has tried it yet.

> I guess we are spending time on a already solved problem.

That's possible indeed. Sorry if I was not clear enough, I tried.

Regards,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20 18:15                           ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-20 18:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Nov 20, 2013 at 09:40:22AM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-20 at 18:34 +0100, Willy Tarreau wrote:
> 
> > One important point, I was looking for the other patch you pointed
> > in this long thread and finally found it :
> > 
> > > So
> > > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee
> > > 
> > > restored this minimal amount of buffering, and let the bigger amount for
> > > 40Gb NICs ;)
> > 
> > This one definitely restores original performance, so it's a much better
> > bet in my opinion :-)
> 
> I don't understand. I thought you were using this patch.

No, I was on latest stable (3.10.19) which exhibits the regression but does
not yet have your fix above. Then I tested the patch your proposed in this
thread, then this latest one. Since the patch is not yet even in Linus'
tree, I'm not sure Arnaud has tried it yet.

> I guess we are spending time on a already solved problem.

That's possible indeed. Sorry if I was not clear enough, I tried.

Regards,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-20 18:15                           ` Willy Tarreau
@ 2013-11-20 18:21                             ` Eric Dumazet
  -1 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-20 18:21 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Thomas Petazzoni, netdev, Arnaud Ebalard, edumazet, Cong Wang,
	linux-arm-kernel

On Wed, 2013-11-20 at 19:15 +0100, Willy Tarreau wrote:

> No, I was on latest stable (3.10.19) which exhibits the regression but does
> not yet have your fix above. Then I tested the patch your proposed in this
> thread, then this latest one. Since the patch is not yet even in Linus'
> tree, I'm not sure Arnaud has tried it yet.

Oh right ;)

BTW Linus tree has the fix, I just checked this now.

(Linus got David changes 19 hours ago)

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20 18:21                             ` Eric Dumazet
  0 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-20 18:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2013-11-20 at 19:15 +0100, Willy Tarreau wrote:

> No, I was on latest stable (3.10.19) which exhibits the regression but does
> not yet have your fix above. Then I tested the patch your proposed in this
> thread, then this latest one. Since the patch is not yet even in Linus'
> tree, I'm not sure Arnaud has tried it yet.

Oh right ;)

BTW Linus tree has the fix, I just checked this now.

(Linus got David changes 19 hours ago)

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-20 18:21                             ` Eric Dumazet
@ 2013-11-20 18:29                               ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-20 18:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Petazzoni, netdev, Arnaud Ebalard, edumazet, Cong Wang,
	linux-arm-kernel

On Wed, Nov 20, 2013 at 10:21:24AM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-20 at 19:15 +0100, Willy Tarreau wrote:
> 
> > No, I was on latest stable (3.10.19) which exhibits the regression but does
> > not yet have your fix above. Then I tested the patch your proposed in this
> > thread, then this latest one. Since the patch is not yet even in Linus'
> > tree, I'm not sure Arnaud has tried it yet.
> 
> Oh right ;)
> 
> BTW Linus tree has the fix, I just checked this now.
> 
> (Linus got David changes 19 hours ago)

Ah yes I didn't look far enough back.

Thanks,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20 18:29                               ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-20 18:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Nov 20, 2013 at 10:21:24AM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-20 at 19:15 +0100, Willy Tarreau wrote:
> 
> > No, I was on latest stable (3.10.19) which exhibits the regression but does
> > not yet have your fix above. Then I tested the patch your proposed in this
> > thread, then this latest one. Since the patch is not yet even in Linus'
> > tree, I'm not sure Arnaud has tried it yet.
> 
> Oh right ;)
> 
> BTW Linus tree has the fix, I just checked this now.
> 
> (Linus got David changes 19 hours ago)

Ah yes I didn't look far enough back.

Thanks,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-20 17:30                       ` Eric Dumazet
@ 2013-11-20 18:52                         ` David Miller
  -1 siblings, 0 replies; 121+ messages in thread
From: David Miller @ 2013-11-20 18:52 UTC (permalink / raw)
  To: eric.dumazet
  Cc: w, arno, xiyou.wangcong, edumazet, linux-arm-kernel, netdev,
	thomas.petazzoni

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 20 Nov 2013 09:30:07 -0800

> max (128KB , 1-2 ms) of buffering per flow should be enough to reach
> line rate, even for a single flow, but this means the sk_sndbuf value
> for the socket must take into account the pipe size _plus_ 1ms of
> buffering.

And we can implement this using the estimated pacing rate.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20 18:52                         ` David Miller
  0 siblings, 0 replies; 121+ messages in thread
From: David Miller @ 2013-11-20 18:52 UTC (permalink / raw)
  To: linux-arm-kernel

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 20 Nov 2013 09:30:07 -0800

> max (128KB , 1-2 ms) of buffering per flow should be enough to reach
> line rate, even for a single flow, but this means the sk_sndbuf value
> for the socket must take into account the pipe size _plus_ 1ms of
> buffering.

And we can implement this using the estimated pacing rate.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-19 23:53                               ` Arnaud Ebalard
@ 2013-11-20 19:11                                 ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-20 19:11 UTC (permalink / raw)
  To: Arnaud Ebalard
  Cc: Thomas Petazzoni, Florian Fainelli, simon.guinot, Eric Dumazet,
	netdev, edumazet, Cong Wang, linux-arm-kernel

Hi Arnaud,

first, thanks for all these tests.

On Wed, Nov 20, 2013 at 12:53:43AM +0100, Arnaud Ebalard wrote:
(...)
> In the end, here are the conclusions *I* draw from this test session,
> do not hesitate to correct me:
> 
>  - Eric, it seems something changed in linus tree betwen the beginning
>    of the thread and now, which somehow reduces the effect of the
>    regression we were seen: I never got back the 256KB/s.
>  - You revert patch still improves the perf a lot
>  - It seems reducing MVNETA_TX_DONE_TIMER_PERIOD does not help
>  - w/ your revert patch, I can confirm that mvneta driver is capable of
>    doing line rate w/ proper tweak of TCP send window (256KB instead of
>    4M)
>  - It seems I will I have to spend some time on the SATA issues I
>    previously thought were an artefact of not cleaning my tree during a
>    debug session [1], i.e. there is IMHO an issue.

Could you please try Eric's patch that was just merged into Linus' tree
if it was not yet in the kernel you tried :

  98e09386c0e  tcp: tsq: restore minimal amount of queueing

For me it restored the original performance (I saturate the Gbps with
about 7 concurrent streams).

Further, I wrote the small patch below for mvneta. I'm not sure it's
smp-safe but it's a PoC. In mvneta_poll() which currently is only called
upon Rx interrupt, it tries to flush all possible remaining Tx descriptors
if any. That significantly improved my transfer rate, now I easily achieve
1 Gbps using a single TCP stream on the mirabox. Not tried on the AX3 yet.

It also increased the overall connection rate by 10% on empty HTTP responses
(small packets), very likely by reducing the dead time between some segments!

You'll probably want to give it a try, so here it comes.

Cheers,
Willy

>From d1a00e593841223c7d871007b1e1fc528afe8e4d Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Wed, 20 Nov 2013 19:47:11 +0100
Subject: EXP: net: mvneta: try to flush Tx descriptor queue upon Rx
 interrupts

Right now the mvneta driver doesn't handle Tx IRQ, and solely relies on a
timer to flush Tx descriptors. This causes jerky output traffic with bursts
and pauses, making it difficult to reach line rate with very few streams.
This patch tries to improve the situation which is complicated by the lack
of public datasheet from Marvell. The workaround consists in trying to flush
pending buffers during the Rx polling. The idea is that for symmetric TCP
traffic, ACKs received in response to the packets sent will trigger the Rx
interrupt and will anticipate the flushing of the descriptors.

The results are quite good, a single TCP stream is now capable of saturating
a gigabit.

This is only a workaround, it doesn't address asymmetric traffic nor datagram
based traffic.

Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 5aed8ed..59e1c86 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2013,6 +2013,26 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 	}
 
 	pp->cause_rx_tx = cause_rx_tx;
+
+	/* Try to flush pending Tx buffers if any */
+	if (test_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags)) {
+		int tx_todo = 0;
+
+		mvneta_tx_done_gbe(pp,
+	                           (((1 << txq_number) - 1) &
+	                           MVNETA_CAUSE_TXQ_SENT_DESC_ALL_MASK),
+	                           &tx_todo);
+
+		if (tx_todo > 0) {
+			mod_timer(&pp->tx_done_timer,
+			          jiffies + msecs_to_jiffies(MVNETA_TX_DONE_TIMER_PERIOD));
+		}
+		else {
+			clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
+			del_timer(&pp->tx_done_timer);
+		}
+	}
+
 	return rx_done;
 }
 
-- 
1.7.12.1

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20 19:11                                 ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-20 19:11 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Arnaud,

first, thanks for all these tests.

On Wed, Nov 20, 2013 at 12:53:43AM +0100, Arnaud Ebalard wrote:
(...)
> In the end, here are the conclusions *I* draw from this test session,
> do not hesitate to correct me:
> 
>  - Eric, it seems something changed in linus tree betwen the beginning
>    of the thread and now, which somehow reduces the effect of the
>    regression we were seen: I never got back the 256KB/s.
>  - You revert patch still improves the perf a lot
>  - It seems reducing MVNETA_TX_DONE_TIMER_PERIOD does not help
>  - w/ your revert patch, I can confirm that mvneta driver is capable of
>    doing line rate w/ proper tweak of TCP send window (256KB instead of
>    4M)
>  - It seems I will I have to spend some time on the SATA issues I
>    previously thought were an artefact of not cleaning my tree during a
>    debug session [1], i.e. there is IMHO an issue.

Could you please try Eric's patch that was just merged into Linus' tree
if it was not yet in the kernel you tried :

  98e09386c0e  tcp: tsq: restore minimal amount of queueing

For me it restored the original performance (I saturate the Gbps with
about 7 concurrent streams).

Further, I wrote the small patch below for mvneta. I'm not sure it's
smp-safe but it's a PoC. In mvneta_poll() which currently is only called
upon Rx interrupt, it tries to flush all possible remaining Tx descriptors
if any. That significantly improved my transfer rate, now I easily achieve
1 Gbps using a single TCP stream on the mirabox. Not tried on the AX3 yet.

It also increased the overall connection rate by 10% on empty HTTP responses
(small packets), very likely by reducing the dead time between some segments!

You'll probably want to give it a try, so here it comes.

Cheers,
Willy

>From d1a00e593841223c7d871007b1e1fc528afe8e4d Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Wed, 20 Nov 2013 19:47:11 +0100
Subject: EXP: net: mvneta: try to flush Tx descriptor queue upon Rx
 interrupts

Right now the mvneta driver doesn't handle Tx IRQ, and solely relies on a
timer to flush Tx descriptors. This causes jerky output traffic with bursts
and pauses, making it difficult to reach line rate with very few streams.
This patch tries to improve the situation which is complicated by the lack
of public datasheet from Marvell. The workaround consists in trying to flush
pending buffers during the Rx polling. The idea is that for symmetric TCP
traffic, ACKs received in response to the packets sent will trigger the Rx
interrupt and will anticipate the flushing of the descriptors.

The results are quite good, a single TCP stream is now capable of saturating
a gigabit.

This is only a workaround, it doesn't address asymmetric traffic nor datagram
based traffic.

Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 5aed8ed..59e1c86 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2013,6 +2013,26 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 	}
 
 	pp->cause_rx_tx = cause_rx_tx;
+
+	/* Try to flush pending Tx buffers if any */
+	if (test_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags)) {
+		int tx_todo = 0;
+
+		mvneta_tx_done_gbe(pp,
+	                           (((1 << txq_number) - 1) &
+	                           MVNETA_CAUSE_TXQ_SENT_DESC_ALL_MASK),
+	                           &tx_todo);
+
+		if (tx_todo > 0) {
+			mod_timer(&pp->tx_done_timer,
+			          jiffies + msecs_to_jiffies(MVNETA_TX_DONE_TIMER_PERIOD));
+		}
+		else {
+			clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
+			del_timer(&pp->tx_done_timer);
+		}
+	}
+
 	return rx_done;
 }
 
-- 
1.7.12.1

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-20  8:50                                 ` Thomas Petazzoni
  (?)
@ 2013-11-20 19:21                                 ` Arnaud Ebalard
  -1 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-20 19:21 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Thomas,

I removed netdev from that reply

arno at natisbad.org (Arnaud Ebalard) writes:

> <#secure method=pgpmime mode=sign>
> Thomas Petazzoni <thomas.petazzoni@free-electrons.com> writes:
>
>> Arnaud,
>>
>> On Wed, 20 Nov 2013 00:53:43 +0100, Arnaud Ebalard wrote:
>>
>>>  - It seems I will I have to spend some time on the SATA issues I
>>>    previously thought were an artefact of not cleaning my tree during a
>>>    debug session [1], i.e. there is IMHO an issue.
>>
>> I don't remember in detail what your SATA problem was, but just to let
>> you know that we are currently debugging a problem that occurs on
>> Armada XP (more than one core is needed for the problem to occur), with
>> SATA (the symptom is that after some time of SATA usage, SATA traffic is
>> stalled, and no SATA interrupts are generated anymore). We're still
>> working on this one, and trying to figure out where the problem is.

The problem I had is in the first email of:

  http://thread.gmane.org/gmane.linux.ports.arm.kernel/271508

Then, yesterday, when testing with the USB 3.0 to ethernet dongle
connected to my RN102 (Armada 370) as primary interface, I got the
following. It is easily reproducible:

[  317.412873] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen
[  317.419947] ata1.00: failed command: READ FPDMA QUEUED
[  317.425118] ata1.00: cmd 60/00:00:00:07:2a/01:00:00:00:00/40 tag 0 ncq 131072 in
[  317.425118]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  317.439926] ata1.00: status: { DRDY }
[  317.443600] ata1.00: failed command: READ FPDMA QUEUED
[  317.448756] ata1.00: cmd 60/00:08:00:08:2a/01:00:00:00:00/40 tag 1 ncq 131072 in
[  317.448756]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  317.463565] ata1.00: status: { DRDY }
[  317.467244] ata1: hard resetting link
[  318.012913] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  318.020220] ata1.00: configured for UDMA/133
[  318.024559] ata1.00: device reported invalid CHS sector 0
[  318.030001] ata1.00: device reported invalid CHS sector 0
[  318.035425] ata1: EH complete

And then again:

[  381.342873] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen
[  381.349947] ata1.00: failed command: READ FPDMA QUEUED
[  381.355119] ata1.00: cmd 60/00:00:00:03:30/01:00:00:00:00/40 tag 0 ncq 131072 in
[  381.355119]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  381.369927] ata1.00: status: { DRDY }
[  381.373599] ata1.00: failed command: READ FPDMA QUEUED
[  381.378756] ata1.00: cmd 60/00:08:00:04:30/01:00:00:00:00/40 tag 1 ncq 131072 in
[  381.378756]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  381.393563] ata1.00: status: { DRDY }
[  381.397242] ata1: hard resetting link
[  381.942848] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  381.950167] ata1.00: configured for UDMA/133
[  381.954496] ata1.00: device reported invalid CHS sector 0
[  381.959958] ata1.00: device reported invalid CHS sector 0
[  381.965383] ata1: EH complete

But, as the problem seems to happen when the dongle is connected and in
use (and considering current threads on the topic on USB and netdev ML),
I think I will wait for things to calm down and test again with a 3.12
and then a 3.13-rcX.

Anyway, if you find something on your bug, I can give patches a try on
my RN2120.

Cheers,

a+

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-20 18:15                           ` Willy Tarreau
@ 2013-11-20 19:22                             ` Arnaud Ebalard
  -1 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-20 19:22 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Eric Dumazet, Thomas Petazzoni, netdev, edumazet, Cong Wang,
	linux-arm-kernel

Hi,

Willy Tarreau <w@1wt.eu> writes:

> On Wed, Nov 20, 2013 at 09:40:22AM -0800, Eric Dumazet wrote:
>> On Wed, 2013-11-20 at 18:34 +0100, Willy Tarreau wrote:
>> 
>> > One important point, I was looking for the other patch you pointed
>> > in this long thread and finally found it :
>> > 
>> > > So
>> > > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee
>> > > 
>> > > restored this minimal amount of buffering, and let the bigger amount for
>> > > 40Gb NICs ;)
>> > 
>> > This one definitely restores original performance, so it's a much better
>> > bet in my opinion :-)
>> 
>> I don't understand. I thought you were using this patch.
>
> No, I was on latest stable (3.10.19) which exhibits the regression but does
> not yet have your fix above. Then I tested the patch your proposed in this
> thread, then this latest one. Since the patch is not yet even in Linus'
> tree, I'm not sure Arnaud has tried it yet.

This is the one I have used for a week now, since its publication by
Eric a week ago:

 http://www.spinics.net/lists/netdev/msg257396.html

Cheers,

a+

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20 19:22                             ` Arnaud Ebalard
  0 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-20 19:22 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

Willy Tarreau <w@1wt.eu> writes:

> On Wed, Nov 20, 2013 at 09:40:22AM -0800, Eric Dumazet wrote:
>> On Wed, 2013-11-20 at 18:34 +0100, Willy Tarreau wrote:
>> 
>> > One important point, I was looking for the other patch you pointed
>> > in this long thread and finally found it :
>> > 
>> > > So
>> > > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee
>> > > 
>> > > restored this minimal amount of buffering, and let the bigger amount for
>> > > 40Gb NICs ;)
>> > 
>> > This one definitely restores original performance, so it's a much better
>> > bet in my opinion :-)
>> 
>> I don't understand. I thought you were using this patch.
>
> No, I was on latest stable (3.10.19) which exhibits the regression but does
> not yet have your fix above. Then I tested the patch your proposed in this
> thread, then this latest one. Since the patch is not yet even in Linus'
> tree, I'm not sure Arnaud has tried it yet.

This is the one I have used for a week now, since its publication by
Eric a week ago:

 http://www.spinics.net/lists/netdev/msg257396.html

Cheers,

a+

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-20 19:11                                 ` Willy Tarreau
@ 2013-11-20 19:26                                   ` Arnaud Ebalard
  -1 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-20 19:26 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Eric Dumazet, Thomas Petazzoni, Florian Fainelli, simon.guinot,
	netdev, edumazet, Cong Wang, linux-arm-kernel

Hi,

Willy Tarreau <w@1wt.eu> writes:

> first, thanks for all these tests.
>
> On Wed, Nov 20, 2013 at 12:53:43AM +0100, Arnaud Ebalard wrote:
> (...)
>> In the end, here are the conclusions *I* draw from this test session,
>> do not hesitate to correct me:
>> 
>>  - Eric, it seems something changed in linus tree betwen the beginning
>>    of the thread and now, which somehow reduces the effect of the
>>    regression we were seen: I never got back the 256KB/s.
>>  - You revert patch still improves the perf a lot
>>  - It seems reducing MVNETA_TX_DONE_TIMER_PERIOD does not help
>>  - w/ your revert patch, I can confirm that mvneta driver is capable of
>>    doing line rate w/ proper tweak of TCP send window (256KB instead of
>>    4M)
>>  - It seems I will I have to spend some time on the SATA issues I
>>    previously thought were an artefact of not cleaning my tree during a
>>    debug session [1], i.e. there is IMHO an issue.
>
> Could you please try Eric's patch that was just merged into Linus' tree
> if it was not yet in the kernel you tried :
>
>   98e09386c0e  tcp: tsq: restore minimal amount of queueing

I have it in my quilt set.


> For me it restored the original performance (I saturate the Gbps with
> about 7 concurrent streams).
>
> Further, I wrote the small patch below for mvneta. I'm not sure it's
> smp-safe but it's a PoC. In mvneta_poll() which currently is only called
> upon Rx interrupt, it tries to flush all possible remaining Tx descriptors
> if any. That significantly improved my transfer rate, now I easily achieve
> 1 Gbps using a single TCP stream on the mirabox. Not tried on the AX3 yet.
>
> It also increased the overall connection rate by 10% on empty HTTP responses
> (small packets), very likely by reducing the dead time between some segments!
>
> You'll probably want to give it a try, so here it comes.

hehe, I was falling short of patches to test tonight ;-) I will give it
a try now.

Cheers,

a+

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20 19:26                                   ` Arnaud Ebalard
  0 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-20 19:26 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

Willy Tarreau <w@1wt.eu> writes:

> first, thanks for all these tests.
>
> On Wed, Nov 20, 2013 at 12:53:43AM +0100, Arnaud Ebalard wrote:
> (...)
>> In the end, here are the conclusions *I* draw from this test session,
>> do not hesitate to correct me:
>> 
>>  - Eric, it seems something changed in linus tree betwen the beginning
>>    of the thread and now, which somehow reduces the effect of the
>>    regression we were seen: I never got back the 256KB/s.
>>  - You revert patch still improves the perf a lot
>>  - It seems reducing MVNETA_TX_DONE_TIMER_PERIOD does not help
>>  - w/ your revert patch, I can confirm that mvneta driver is capable of
>>    doing line rate w/ proper tweak of TCP send window (256KB instead of
>>    4M)
>>  - It seems I will I have to spend some time on the SATA issues I
>>    previously thought were an artefact of not cleaning my tree during a
>>    debug session [1], i.e. there is IMHO an issue.
>
> Could you please try Eric's patch that was just merged into Linus' tree
> if it was not yet in the kernel you tried :
>
>   98e09386c0e  tcp: tsq: restore minimal amount of queueing

I have it in my quilt set.


> For me it restored the original performance (I saturate the Gbps with
> about 7 concurrent streams).
>
> Further, I wrote the small patch below for mvneta. I'm not sure it's
> smp-safe but it's a PoC. In mvneta_poll() which currently is only called
> upon Rx interrupt, it tries to flush all possible remaining Tx descriptors
> if any. That significantly improved my transfer rate, now I easily achieve
> 1 Gbps using a single TCP stream on the mirabox. Not tried on the AX3 yet.
>
> It also increased the overall connection rate by 10% on empty HTTP responses
> (small packets), very likely by reducing the dead time between some segments!
>
> You'll probably want to give it a try, so here it comes.

hehe, I was falling short of patches to test tonight ;-) I will give it
a try now.

Cheers,

a+

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-20 19:11                                 ` Willy Tarreau
@ 2013-11-20 21:28                                   ` Arnaud Ebalard
  -1 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-20 21:28 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Eric Dumazet, Thomas Petazzoni, Florian Fainelli, simon.guinot,
	netdev, edumazet, Cong Wang, linux-arm-kernel

Hi,

Willy Tarreau <w@1wt.eu> writes:

> From d1a00e593841223c7d871007b1e1fc528afe8e4d Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Wed, 20 Nov 2013 19:47:11 +0100
> Subject: EXP: net: mvneta: try to flush Tx descriptor queue upon Rx
>  interrupts
>
> Right now the mvneta driver doesn't handle Tx IRQ, and solely relies on a
> timer to flush Tx descriptors. This causes jerky output traffic with bursts
> and pauses, making it difficult to reach line rate with very few streams.
> This patch tries to improve the situation which is complicated by the lack
> of public datasheet from Marvell. The workaround consists in trying to flush
> pending buffers during the Rx polling. The idea is that for symmetric TCP
> traffic, ACKs received in response to the packets sent will trigger the Rx
> interrupt and will anticipate the flushing of the descriptors.
>
> The results are quite good, a single TCP stream is now capable of saturating
> a gigabit.
>
> This is only a workaround, it doesn't address asymmetric traffic nor datagram
> based traffic.
>
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> index 5aed8ed..59e1c86 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -2013,6 +2013,26 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
>  	}
>  
>  	pp->cause_rx_tx = cause_rx_tx;
> +
> +	/* Try to flush pending Tx buffers if any */
> +	if (test_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags)) {
> +		int tx_todo = 0;
> +
> +		mvneta_tx_done_gbe(pp,
> +	                           (((1 << txq_number) - 1) &
> +	                           MVNETA_CAUSE_TXQ_SENT_DESC_ALL_MASK),
> +	                           &tx_todo);
> +
> +		if (tx_todo > 0) {
> +			mod_timer(&pp->tx_done_timer,
> +			          jiffies + msecs_to_jiffies(MVNETA_TX_DONE_TIMER_PERIOD));
> +		}
> +		else {
> +			clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
> +			del_timer(&pp->tx_done_timer);
> +		}
> +	}
> +
>  	return rx_done;
>  }

With current Linus tree (head being b4789b8e: aacraid: prevent invalid
pointer dereference), as a baseline here is what I get:

 w/ tcp_wmem left w/ default values (4096 16384 4071360)

  via netperf (TCP_MAERTS/TCP_STREAM): 151.13 / 935.50 Mbits/s
  via wget against apache: 15.4 MB/s
  via wget against nginx: 104 MB/s
 
 w/ tcp_wmem set to 4096 16384 262144:

  via netperf (TCP_MAERTS/TCP_STREAM): 919.89 / 935.50 Mbits/s
  via wget against apache: 63.3 MB/s
  via wget against nginx: 104 MB/s
 
With your patch on top of it (and tcp_wmem kept at its default value):

 via netperf: 939.16 / 935.44 Mbits/s
 via wget against apache: 65.9 MB/s (top reports 69.5 sy, 30.1 si
                                     and 72% CPU for apache2)
 via wget against nginx: 106 MB/s


With your patch and MVNETA_TX_DONE_TIMER_PERIOD set to 1 instead of 10
(still w/ and tcp_wmem kept at its default value):

 via netperf: 939.12 / 935.84 Mbits/s
 via wget against apache: 63.7 MB/s
 via wget against nginx: 108 MB/s

So:

 - First, Eric's patch sitting in Linus tree does fix the regression
   I had on 3.11.7 and early 3.12 (15.4 MB/s vs 256KB/s).

 - As can be seen in the results of first test, Eric's patch still
   requires some additional tweaking of tcp_wmem to get netperf and
   apache somewhat happy w/ perfectible drivers (63.3 MB/s instead of
   15.4MB/s by setting max tcp send buffer space to 256KB for apache).

 - For unknown reasons, nginx manages to provide a 104MB/s download rate
   even with a tcp_wmem set to default and no specific patch of mvneta.

 - Now, Willy's patch seems to makes netperf happy (link saturated from
   server to client), w/o tweaking tcp_wmem.

 - Again with Willy's patch I guess the "limitations" of the platform
   (1.2GHz CPU w/ 512MB of RAM) somehow prevent Apache to saturate the
   link. All I can say is that the same test some months ago on a 1.6GHz
   ARMv5TE (kirkwood 88f6282) w/ 256MB of RAM gave me 108MB/s. I do not
   know if it is some apache regression, some mvneta vs mv63xx_eth
   difference or some CPU frequency issue but having netperf and  nginx
   happy make me wonder about Apache.

 - Willy, setting MVNETA_TX_DONE_TIMER_PERIOD to 1 instead of 10 w/ your
   patch does not improve the already good value I get w/ your patch.


In the end if you iterate on your work to push a version of your patch
upstream, I'll be happy to test it. And thanks for the time you already
spent!

Cheers,

a+

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20 21:28                                   ` Arnaud Ebalard
  0 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-20 21:28 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

Willy Tarreau <w@1wt.eu> writes:

> From d1a00e593841223c7d871007b1e1fc528afe8e4d Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Wed, 20 Nov 2013 19:47:11 +0100
> Subject: EXP: net: mvneta: try to flush Tx descriptor queue upon Rx
>  interrupts
>
> Right now the mvneta driver doesn't handle Tx IRQ, and solely relies on a
> timer to flush Tx descriptors. This causes jerky output traffic with bursts
> and pauses, making it difficult to reach line rate with very few streams.
> This patch tries to improve the situation which is complicated by the lack
> of public datasheet from Marvell. The workaround consists in trying to flush
> pending buffers during the Rx polling. The idea is that for symmetric TCP
> traffic, ACKs received in response to the packets sent will trigger the Rx
> interrupt and will anticipate the flushing of the descriptors.
>
> The results are quite good, a single TCP stream is now capable of saturating
> a gigabit.
>
> This is only a workaround, it doesn't address asymmetric traffic nor datagram
> based traffic.
>
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> index 5aed8ed..59e1c86 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -2013,6 +2013,26 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
>  	}
>  
>  	pp->cause_rx_tx = cause_rx_tx;
> +
> +	/* Try to flush pending Tx buffers if any */
> +	if (test_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags)) {
> +		int tx_todo = 0;
> +
> +		mvneta_tx_done_gbe(pp,
> +	                           (((1 << txq_number) - 1) &
> +	                           MVNETA_CAUSE_TXQ_SENT_DESC_ALL_MASK),
> +	                           &tx_todo);
> +
> +		if (tx_todo > 0) {
> +			mod_timer(&pp->tx_done_timer,
> +			          jiffies + msecs_to_jiffies(MVNETA_TX_DONE_TIMER_PERIOD));
> +		}
> +		else {
> +			clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
> +			del_timer(&pp->tx_done_timer);
> +		}
> +	}
> +
>  	return rx_done;
>  }

With current Linus tree (head being b4789b8e: aacraid: prevent invalid
pointer dereference), as a baseline here is what I get:

 w/ tcp_wmem left w/ default values (4096 16384 4071360)

  via netperf (TCP_MAERTS/TCP_STREAM): 151.13 / 935.50 Mbits/s
  via wget against apache: 15.4 MB/s
  via wget against nginx: 104 MB/s
 
 w/ tcp_wmem set to 4096 16384 262144:

  via netperf (TCP_MAERTS/TCP_STREAM): 919.89 / 935.50 Mbits/s
  via wget against apache: 63.3 MB/s
  via wget against nginx: 104 MB/s
 
With your patch on top of it (and tcp_wmem kept at its default value):

 via netperf: 939.16 / 935.44 Mbits/s
 via wget against apache: 65.9 MB/s (top reports 69.5 sy, 30.1 si
                                     and 72% CPU for apache2)
 via wget against nginx: 106 MB/s


With your patch and MVNETA_TX_DONE_TIMER_PERIOD set to 1 instead of 10
(still w/ and tcp_wmem kept at its default value):

 via netperf: 939.12 / 935.84 Mbits/s
 via wget against apache: 63.7 MB/s
 via wget against nginx: 108 MB/s

So:

 - First, Eric's patch sitting in Linus tree does fix the regression
   I had on 3.11.7 and early 3.12 (15.4 MB/s vs 256KB/s).

 - As can be seen in the results of first test, Eric's patch still
   requires some additional tweaking of tcp_wmem to get netperf and
   apache somewhat happy w/ perfectible drivers (63.3 MB/s instead of
   15.4MB/s by setting max tcp send buffer space to 256KB for apache).

 - For unknown reasons, nginx manages to provide a 104MB/s download rate
   even with a tcp_wmem set to default and no specific patch of mvneta.

 - Now, Willy's patch seems to makes netperf happy (link saturated from
   server to client), w/o tweaking tcp_wmem.

 - Again with Willy's patch I guess the "limitations" of the platform
   (1.2GHz CPU w/ 512MB of RAM) somehow prevent Apache to saturate the
   link. All I can say is that the same test some months ago on a 1.6GHz
   ARMv5TE (kirkwood 88f6282) w/ 256MB of RAM gave me 108MB/s. I do not
   know if it is some apache regression, some mvneta vs mv63xx_eth
   difference or some CPU frequency issue but having netperf and  nginx
   happy make me wonder about Apache.

 - Willy, setting MVNETA_TX_DONE_TIMER_PERIOD to 1 instead of 10 w/ your
   patch does not improve the already good value I get w/ your patch.


In the end if you iterate on your work to push a version of your patch
upstream, I'll be happy to test it. And thanks for the time you already
spent!

Cheers,

a+

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-20 21:28                                   ` Arnaud Ebalard
@ 2013-11-20 21:54                                     ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-20 21:54 UTC (permalink / raw)
  To: Arnaud Ebalard
  Cc: Eric Dumazet, Thomas Petazzoni, Florian Fainelli, simon.guinot,
	netdev, edumazet, Cong Wang, linux-arm-kernel

Hi Arnaud,

On Wed, Nov 20, 2013 at 10:28:50PM +0100, Arnaud Ebalard wrote:
> With current Linus tree (head being b4789b8e: aacraid: prevent invalid
> pointer dereference), as a baseline here is what I get:
> 
>  w/ tcp_wmem left w/ default values (4096 16384 4071360)
> 
>   via netperf (TCP_MAERTS/TCP_STREAM): 151.13 / 935.50 Mbits/s
>   via wget against apache: 15.4 MB/s
>   via wget against nginx: 104 MB/s
>  
>  w/ tcp_wmem set to 4096 16384 262144:
> 
>   via netperf (TCP_MAERTS/TCP_STREAM): 919.89 / 935.50 Mbits/s
>   via wget against apache: 63.3 MB/s
>   via wget against nginx: 104 MB/s
>  
> With your patch on top of it (and tcp_wmem kept at its default value):
> 
>  via netperf: 939.16 / 935.44 Mbits/s
>  via wget against apache: 65.9 MB/s (top reports 69.5 sy, 30.1 si
>                                      and 72% CPU for apache2)
>  via wget against nginx: 106 MB/s
> 
> 
> With your patch and MVNETA_TX_DONE_TIMER_PERIOD set to 1 instead of 10
> (still w/ and tcp_wmem kept at its default value):
> 
>  via netperf: 939.12 / 935.84 Mbits/s
>  via wget against apache: 63.7 MB/s
>  via wget against nginx: 108 MB/s
> 
> So:
> 
>  - First, Eric's patch sitting in Linus tree does fix the regression
>    I had on 3.11.7 and early 3.12 (15.4 MB/s vs 256KB/s).
> 
>  - As can be seen in the results of first test, Eric's patch still
>    requires some additional tweaking of tcp_wmem to get netperf and
>    apache somewhat happy w/ perfectible drivers (63.3 MB/s instead of
>    15.4MB/s by setting max tcp send buffer space to 256KB for apache).
> 
>  - For unknown reasons, nginx manages to provide a 104MB/s download rate
>    even with a tcp_wmem set to default and no specific patch of mvneta.
> 
>  - Now, Willy's patch seems to makes netperf happy (link saturated from
>    server to client), w/o tweaking tcp_wmem.
> 
>  - Again with Willy's patch I guess the "limitations" of the platform
>    (1.2GHz CPU w/ 512MB of RAM) somehow prevent Apache to saturate the
>    link. All I can say is that the same test some months ago on a 1.6GHz
>    ARMv5TE (kirkwood 88f6282) w/ 256MB of RAM gave me 108MB/s. I do not
>    know if it is some apache regression, some mvneta vs mv63xx_eth
>    difference or some CPU frequency issue but having netperf and  nginx
>    happy make me wonder about Apache.
> 
>  - Willy, setting MVNETA_TX_DONE_TIMER_PERIOD to 1 instead of 10 w/ your
>    patch does not improve the already good value I get w/ your patch.

Great, thanks for your detailed tests! Concerning Apache, it's common to
see it consume more CPU than others, which makes it more sensible to small
devices like these ones (which BTW have a very small cache and only a 16bit
RAM bus). Please still note that there could be a number of other differences
such as Apache always doing TCP_NODELAY resulting in sending incomplete
segments at the end of each buffer, which consume slightly more descriptors.

> In the end if you iterate on your work to push a version of your patch
> upstream, I'll be happy to test it. And thanks for the time you already
> spent!

I'm currently trying to implement TX IRQ handling. I found the registers
description in the neta driver that is provided in Marvell's LSP kernel
that is shipped with some devices using their CPUs. This code is utterly
broken (eg: splice fails with -EBADF) but I think the register descriptions
could be trusted.

I'd rather have real IRQ handling than just relying on mvneta_poll(), so
that we can use it for asymmetric traffic/routing/whatever.

Regards,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-20 21:54                                     ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-20 21:54 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Arnaud,

On Wed, Nov 20, 2013 at 10:28:50PM +0100, Arnaud Ebalard wrote:
> With current Linus tree (head being b4789b8e: aacraid: prevent invalid
> pointer dereference), as a baseline here is what I get:
> 
>  w/ tcp_wmem left w/ default values (4096 16384 4071360)
> 
>   via netperf (TCP_MAERTS/TCP_STREAM): 151.13 / 935.50 Mbits/s
>   via wget against apache: 15.4 MB/s
>   via wget against nginx: 104 MB/s
>  
>  w/ tcp_wmem set to 4096 16384 262144:
> 
>   via netperf (TCP_MAERTS/TCP_STREAM): 919.89 / 935.50 Mbits/s
>   via wget against apache: 63.3 MB/s
>   via wget against nginx: 104 MB/s
>  
> With your patch on top of it (and tcp_wmem kept at its default value):
> 
>  via netperf: 939.16 / 935.44 Mbits/s
>  via wget against apache: 65.9 MB/s (top reports 69.5 sy, 30.1 si
>                                      and 72% CPU for apache2)
>  via wget against nginx: 106 MB/s
> 
> 
> With your patch and MVNETA_TX_DONE_TIMER_PERIOD set to 1 instead of 10
> (still w/ and tcp_wmem kept at its default value):
> 
>  via netperf: 939.12 / 935.84 Mbits/s
>  via wget against apache: 63.7 MB/s
>  via wget against nginx: 108 MB/s
> 
> So:
> 
>  - First, Eric's patch sitting in Linus tree does fix the regression
>    I had on 3.11.7 and early 3.12 (15.4 MB/s vs 256KB/s).
> 
>  - As can be seen in the results of first test, Eric's patch still
>    requires some additional tweaking of tcp_wmem to get netperf and
>    apache somewhat happy w/ perfectible drivers (63.3 MB/s instead of
>    15.4MB/s by setting max tcp send buffer space to 256KB for apache).
> 
>  - For unknown reasons, nginx manages to provide a 104MB/s download rate
>    even with a tcp_wmem set to default and no specific patch of mvneta.
> 
>  - Now, Willy's patch seems to makes netperf happy (link saturated from
>    server to client), w/o tweaking tcp_wmem.
> 
>  - Again with Willy's patch I guess the "limitations" of the platform
>    (1.2GHz CPU w/ 512MB of RAM) somehow prevent Apache to saturate the
>    link. All I can say is that the same test some months ago on a 1.6GHz
>    ARMv5TE (kirkwood 88f6282) w/ 256MB of RAM gave me 108MB/s. I do not
>    know if it is some apache regression, some mvneta vs mv63xx_eth
>    difference or some CPU frequency issue but having netperf and  nginx
>    happy make me wonder about Apache.
> 
>  - Willy, setting MVNETA_TX_DONE_TIMER_PERIOD to 1 instead of 10 w/ your
>    patch does not improve the already good value I get w/ your patch.

Great, thanks for your detailed tests! Concerning Apache, it's common to
see it consume more CPU than others, which makes it more sensible to small
devices like these ones (which BTW have a very small cache and only a 16bit
RAM bus). Please still note that there could be a number of other differences
such as Apache always doing TCP_NODELAY resulting in sending incomplete
segments at the end of each buffer, which consume slightly more descriptors.

> In the end if you iterate on your work to push a version of your patch
> upstream, I'll be happy to test it. And thanks for the time you already
> spent!

I'm currently trying to implement TX IRQ handling. I found the registers
description in the neta driver that is provided in Marvell's LSP kernel
that is shipped with some devices using their CPUs. This code is utterly
broken (eg: splice fails with -EBADF) but I think the register descriptions
could be trusted.

I'd rather have real IRQ handling than just relying on mvneta_poll(), so
that we can use it for asymmetric traffic/routing/whatever.

Regards,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-20 21:54                                     ` Willy Tarreau
@ 2013-11-21  0:44                                       ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-21  0:44 UTC (permalink / raw)
  To: Arnaud Ebalard
  Cc: Thomas Petazzoni, Florian Fainelli, simon.guinot, Eric Dumazet,
	netdev, edumazet, Cong Wang, linux-arm-kernel

[-- Attachment #1: Type: text/plain, Size: 2106 bytes --]

Hi Arnaud,

On Wed, Nov 20, 2013 at 10:54:35PM +0100, Willy Tarreau wrote:
> I'm currently trying to implement TX IRQ handling. I found the registers
> description in the neta driver that is provided in Marvell's LSP kernel
> that is shipped with some devices using their CPUs. This code is utterly
> broken (eg: splice fails with -EBADF) but I think the register descriptions
> could be trusted.
> 
> I'd rather have real IRQ handling than just relying on mvneta_poll(), so
> that we can use it for asymmetric traffic/routing/whatever.

OK it paid off. And very well :-)

I did it at once and it worked immediately. I generally don't like this
because I always fear that some bug was left there hidden in the code. I have
only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and
on the XP-GP board for some SMP stress tests.

I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared
with and without the patch.

  without :
      - need at least 12 streams to reach gigabit.
      - 60% of idle CPU remains at 1 Gbps
      - HTTP connection rate on empty objects is 9950 connections/s
      - cumulated outgoing traffic on two ports reaches 1.3 Gbps

  with the patch :
      - a single stream easily saturates the gigabit
      - 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream)
      - HTTP connection rate on empty objects is 10250 connections/s
      - I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output.

BTW I must say I was impressed to see that big an improvement in CPU
usage between 3.10 and 3.13, I suspect some of the Tx queue improvements
that Eric has done in between account for this.

I cut the patch in 3 parts :
   - one which reintroduces the hidden bits of the driver
   - one which replaces the timer with the IRQ
   - one which changes the default Tx coalesce from 16 to 4 packets
     (larger was preferred with the timer, but less is better now).

I'm attaching them, please test them on your device.

Note that this is *not* for inclusion at the moment as it has not been
tested on the SMP CPUs.

Cheers,
Willy


[-- Attachment #2: 0001-net-mvneta-add-missing-bit-descriptions-for-interrup.patch --]
[-- Type: text/plain, Size: 3902 bytes --]

>From b77b32dbffdfffc6aa21fa230054e09e2a4cd227 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Wed, 20 Nov 2013 23:58:30 +0100
Subject: net: mvneta: add missing bit descriptions for interrupt masks and
 causes

Marvell has not published the chip's datasheet yet, so it's very hard
to find the relevant bits to manipulate to change the IRQ behaviour.
Fortunately, these bits are described in the proprietary LSP patch set
which is publicly available here :

    http://www.plugcomputer.org/downloads/mirabox/

So let's put them back in the driver in order to reduce the burden of
current and future maintenance.

Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 44 +++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index b8e232b..6630690 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -101,16 +101,56 @@
 #define      MVNETA_CPU_RXQ_ACCESS_ALL_MASK      0x000000ff
 #define      MVNETA_CPU_TXQ_ACCESS_ALL_MASK      0x0000ff00
 #define MVNETA_RXQ_TIME_COAL_REG(q)              (0x2580 + ((q) << 2))
+
+/* Exception Interrupt Port/Queue Cause register */
+
 #define MVNETA_INTR_NEW_CAUSE                    0x25a0
-#define      MVNETA_RX_INTR_MASK(nr_rxqs)        (((1 << nr_rxqs) - 1) << 8)
 #define MVNETA_INTR_NEW_MASK                     0x25a4
+
+/* bits  0..7  = TXQ SENT, one bit per queue.
+ * bits  8..15 = RXQ OCCUP, one bit per queue.
+ * bits 16..23 = RXQ FREE, one bit per queue.
+ * bit  29 = OLD_REG_SUM, see old reg ?
+ * bit  30 = TX_ERR_SUM, one bit for 4 ports
+ * bit  31 = MISC_SUM,   one bit for 4 ports
+ */
+#define      MVNETA_TX_INTR_MASK(nr_txqs)        (((1 << nr_txqs) - 1) << 0)
+#define      MVNETA_TX_INTR_MASK_ALL             (0xff << 0)
+#define      MVNETA_RX_INTR_MASK(nr_rxqs)        (((1 << nr_rxqs) - 1) << 8)
+#define      MVNETA_RX_INTR_MASK_ALL             (0xff << 8)
+
 #define MVNETA_INTR_OLD_CAUSE                    0x25a8
 #define MVNETA_INTR_OLD_MASK                     0x25ac
+
+/* Data Path Port/Queue Cause Register */
 #define MVNETA_INTR_MISC_CAUSE                   0x25b0
 #define MVNETA_INTR_MISC_MASK                    0x25b4
+
+#define      MVNETA_CAUSE_PHY_STATUS_CHANGE      BIT(0)
+#define      MVNETA_CAUSE_LINK_CHANGE            BIT(1)
+#define      MVNETA_CAUSE_PTP                    BIT(4)
+
+#define      MVNETA_CAUSE_INTERNAL_ADDR_ERR      BIT(7)
+#define      MVNETA_CAUSE_RX_OVERRUN             BIT(8)
+#define      MVNETA_CAUSE_RX_CRC_ERROR           BIT(9)
+#define      MVNETA_CAUSE_RX_LARGE_PKT           BIT(10)
+#define      MVNETA_CAUSE_TX_UNDERUN             BIT(11)
+#define      MVNETA_CAUSE_PRBS_ERR               BIT(12)
+#define      MVNETA_CAUSE_PSC_SYNC_CHANGE        BIT(13)
+#define      MVNETA_CAUSE_SERDES_SYNC_ERR        BIT(14)
+
+#define      MVNETA_CAUSE_BMU_ALLOC_ERR_SHIFT    16
+#define      MVNETA_CAUSE_BMU_ALLOC_ERR_ALL_MASK   (0xF << MVNETA_CAUSE_BMU_ALLOC_ERR_SHIFT)
+#define      MVNETA_CAUSE_BMU_ALLOC_ERR_MASK(pool) (1 << (MVNETA_CAUSE_BMU_ALLOC_ERR_SHIFT + (pool)))
+
+#define      MVNETA_CAUSE_TXQ_ERROR_SHIFT        24
+#define      MVNETA_CAUSE_TXQ_ERROR_ALL_MASK     (0xFF << MVNETA_CAUSE_TXQ_ERROR_SHIFT)
+#define      MVNETA_CAUSE_TXQ_ERROR_MASK(q)      (1 << (MVNETA_CAUSE_TXQ_ERROR_SHIFT + (q)))
+
 #define MVNETA_INTR_ENABLE                       0x25b8
 #define      MVNETA_TXQ_INTR_ENABLE_ALL_MASK     0x0000ff00
-#define      MVNETA_RXQ_INTR_ENABLE_ALL_MASK     0xff000000
+#define      MVNETA_RXQ_INTR_ENABLE_ALL_MASK     0xff000000  // note: neta says it's 0x000000FF
+
 #define MVNETA_RXQ_CMD                           0x2680
 #define      MVNETA_RXQ_DISABLE_SHIFT            8
 #define      MVNETA_RXQ_ENABLE_MASK              0x000000ff
-- 
1.7.12.1


[-- Attachment #3: 0002-net-mvneta-replace-Tx-timer-with-a-real-interrupt.patch --]
[-- Type: text/plain, Size: 6447 bytes --]

>From 741bc1bfccbfe33cebaceb2e854539946e8ec9fa Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Wed, 20 Nov 2013 19:47:11 +0100
Subject: net: mvneta: replace Tx timer with a real interrupt

Right now the mvneta driver doesn't handle Tx IRQ, and solely relies on a
timer to flush Tx descriptors. This causes jerky output traffic with bursts
and pauses, making it difficult to reach line rate with very few streams.

It seems that this feature was inherited from the original driver but
nothing there mentions any reason for not using the interrupt instead,
which the chip supports.

Thus, this patch enables Tx interrupts and removes the timer. It does the
two at once because it's not really possible to make the two mechanisms
coexist, so a split patch doesn't make sense.

First tests performed on a Mirabox (Armada 370) show that much less CPU
is used when sending traffic. One reason might be that we now call the
mvneta_tx_done_gbe() with a mask indicating which queues have been done
instead of looping over all of them.

The results are quite good, a single TCP stream is now capable of saturating
a gigabit link, while a minimum of 12 concurrent streams are needed without
the patch. At 1 Gbps with 12 concurrent streams, 60% of the CPU remains idle
without the patch, and it grows to 87% with the patch. The connection rate
has also increased from 9950 to 10150 connections per second (HTTP requests
of empty objects).

Signed-off-by: Willy Tarreau <w@1wt.eu>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Arnaud Ebalard <arno@natisbad.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 71 ++++++-----------------------------
 1 file changed, 12 insertions(+), 59 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 6630690..def32a8 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -216,9 +216,6 @@
 #define MVNETA_RX_COAL_PKTS		32
 #define MVNETA_RX_COAL_USEC		100
 
-/* Timer */
-#define MVNETA_TX_DONE_TIMER_PERIOD	10
-
 /* Napi polling weight */
 #define MVNETA_RX_POLL_WEIGHT		64
 
@@ -272,16 +269,11 @@ struct mvneta_port {
 	void __iomem *base;
 	struct mvneta_rx_queue *rxqs;
 	struct mvneta_tx_queue *txqs;
-	struct timer_list tx_done_timer;
 	struct net_device *dev;
 
 	u32 cause_rx_tx;
 	struct napi_struct napi;
 
-	/* Flags */
-	unsigned long flags;
-#define MVNETA_F_TX_DONE_TIMER_BIT  0
-
 	/* Napi weight */
 	int weight;
 
@@ -1140,17 +1132,6 @@ static void mvneta_tx_done_pkts_coal_set(struct mvneta_port *pp,
 	txq->done_pkts_coal = value;
 }
 
-/* Trigger tx done timer in MVNETA_TX_DONE_TIMER_PERIOD msecs */
-static void mvneta_add_tx_done_timer(struct mvneta_port *pp)
-{
-	if (test_and_set_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags) == 0) {
-		pp->tx_done_timer.expires = jiffies +
-			msecs_to_jiffies(MVNETA_TX_DONE_TIMER_PERIOD);
-		add_timer(&pp->tx_done_timer);
-	}
-}
-
-
 /* Handle rx descriptor fill by setting buf_cookie and buf_phys_addr */
 static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
 				u32 phys_addr, u32 cookie)
@@ -1632,15 +1613,6 @@ out:
 		dev_kfree_skb_any(skb);
 	}
 
-	if (txq->count >= MVNETA_TXDONE_COAL_PKTS)
-		mvneta_txq_done(pp, txq);
-
-	/* If after calling mvneta_txq_done, count equals
-	 * frags, we need to set the timer
-	 */
-	if (txq->count == frags && frags > 0)
-		mvneta_add_tx_done_timer(pp);
-
 	return NETDEV_TX_OK;
 }
 
@@ -1916,14 +1888,22 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 
 	/* Read cause register */
 	cause_rx_tx = mvreg_read(pp, MVNETA_INTR_NEW_CAUSE) &
-		MVNETA_RX_INTR_MASK(rxq_number);
+		(MVNETA_RX_INTR_MASK(rxq_number) | MVNETA_TX_INTR_MASK(txq_number));
+
+	/* Release Tx descriptors */
+	if (cause_rx_tx & MVNETA_TX_INTR_MASK_ALL) {
+		int tx_todo = 0;
+
+		mvneta_tx_done_gbe(pp, (cause_rx_tx & MVNETA_TX_INTR_MASK_ALL), &tx_todo);
+		cause_rx_tx &= ~MVNETA_TX_INTR_MASK_ALL;
+	}
 
 	/* For the case where the last mvneta_poll did not process all
 	 * RX packets
 	 */
 	cause_rx_tx |= pp->cause_rx_tx;
 	if (rxq_number > 1) {
-		while ((cause_rx_tx != 0) && (budget > 0)) {
+		while ((cause_rx_tx & MVNETA_RX_INTR_MASK_ALL) && (budget > 0)) {
 			int count;
 			struct mvneta_rx_queue *rxq;
 			/* get rx queue number from cause_rx_tx */
@@ -1955,7 +1935,7 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 		napi_complete(napi);
 		local_irq_save(flags);
 		mvreg_write(pp, MVNETA_INTR_NEW_MASK,
-			    MVNETA_RX_INTR_MASK(rxq_number));
+			    MVNETA_RX_INTR_MASK(rxq_number) | MVNETA_TX_INTR_MASK(txq_number));
 		local_irq_restore(flags);
 	}
 
@@ -1963,26 +1943,6 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 	return rx_done;
 }
 
-/* tx done timer callback */
-static void mvneta_tx_done_timer_callback(unsigned long data)
-{
-	struct net_device *dev = (struct net_device *)data;
-	struct mvneta_port *pp = netdev_priv(dev);
-	int tx_done = 0, tx_todo = 0;
-
-	if (!netif_running(dev))
-		return ;
-
-	clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
-
-	tx_done = mvneta_tx_done_gbe(pp,
-				     (((1 << txq_number) - 1) &
-				      MVNETA_CAUSE_TXQ_SENT_DESC_ALL_MASK),
-				     &tx_todo);
-	if (tx_todo > 0)
-		mvneta_add_tx_done_timer(pp);
-}
-
 /* Handle rxq fill: allocates rxq skbs; called when initializing a port */
 static int mvneta_rxq_fill(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 			   int num)
@@ -2232,7 +2192,7 @@ static void mvneta_start_dev(struct mvneta_port *pp)
 
 	/* Unmask interrupts */
 	mvreg_write(pp, MVNETA_INTR_NEW_MASK,
-		    MVNETA_RX_INTR_MASK(rxq_number));
+		    MVNETA_RX_INTR_MASK(rxq_number) | MVNETA_TX_INTR_MASK(txq_number));
 
 	phy_start(pp->phy_dev);
 	netif_tx_start_all_queues(pp->dev);
@@ -2518,8 +2478,6 @@ static int mvneta_stop(struct net_device *dev)
 	free_irq(dev->irq, pp);
 	mvneta_cleanup_rxqs(pp);
 	mvneta_cleanup_txqs(pp);
-	del_timer(&pp->tx_done_timer);
-	clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
 
 	return 0;
 }
@@ -2868,11 +2826,6 @@ static int mvneta_probe(struct platform_device *pdev)
 		}
 	}
 
-	pp->tx_done_timer.data = (unsigned long)dev;
-	pp->tx_done_timer.function = mvneta_tx_done_timer_callback;
-	init_timer(&pp->tx_done_timer);
-	clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
-
 	pp->tx_ring_size = MVNETA_MAX_TXD;
 	pp->rx_ring_size = MVNETA_MAX_RXD;
 
-- 
1.7.12.1


[-- Attachment #4: 0003-net-mvneta-reduce-Tx-coalesce-from-16-to-4-packets.patch --]
[-- Type: text/plain, Size: 1012 bytes --]

>From 04a4891c4f9a77052e5aea7d2ade25a3f8da5436 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Thu, 21 Nov 2013 00:13:06 +0100
Subject: net: mvneta: reduce Tx coalesce from 16 to 4 packets

I'm getting slightly better performance with a smaller Tx coalesce setting,
both with large and short packets. Since it was used differently with the
timer, it is possible that the previous value was more suited for use with
a slow timer.

Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index def32a8..d188828 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -212,7 +212,7 @@
 /* Various constants */
 
 /* Coalescing */
-#define MVNETA_TXDONE_COAL_PKTS		16
+#define MVNETA_TXDONE_COAL_PKTS		4
 #define MVNETA_RX_COAL_PKTS		32
 #define MVNETA_RX_COAL_USEC		100
 
-- 
1.7.12.1


[-- Attachment #5: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-21  0:44                                       ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-21  0:44 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Arnaud,

On Wed, Nov 20, 2013 at 10:54:35PM +0100, Willy Tarreau wrote:
> I'm currently trying to implement TX IRQ handling. I found the registers
> description in the neta driver that is provided in Marvell's LSP kernel
> that is shipped with some devices using their CPUs. This code is utterly
> broken (eg: splice fails with -EBADF) but I think the register descriptions
> could be trusted.
> 
> I'd rather have real IRQ handling than just relying on mvneta_poll(), so
> that we can use it for asymmetric traffic/routing/whatever.

OK it paid off. And very well :-)

I did it at once and it worked immediately. I generally don't like this
because I always fear that some bug was left there hidden in the code. I have
only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and
on the XP-GP board for some SMP stress tests.

I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared
with and without the patch.

  without :
      - need at least 12 streams to reach gigabit.
      - 60% of idle CPU remains at 1 Gbps
      - HTTP connection rate on empty objects is 9950 connections/s
      - cumulated outgoing traffic on two ports reaches 1.3 Gbps

  with the patch :
      - a single stream easily saturates the gigabit
      - 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream)
      - HTTP connection rate on empty objects is 10250 connections/s
      - I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output.

BTW I must say I was impressed to see that big an improvement in CPU
usage between 3.10 and 3.13, I suspect some of the Tx queue improvements
that Eric has done in between account for this.

I cut the patch in 3 parts :
   - one which reintroduces the hidden bits of the driver
   - one which replaces the timer with the IRQ
   - one which changes the default Tx coalesce from 16 to 4 packets
     (larger was preferred with the timer, but less is better now).

I'm attaching them, please test them on your device.

Note that this is *not* for inclusion at the moment as it has not been
tested on the SMP CPUs.

Cheers,
Willy

-------------- next part --------------
>From b77b32dbffdfffc6aa21fa230054e09e2a4cd227 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Wed, 20 Nov 2013 23:58:30 +0100
Subject: net: mvneta: add missing bit descriptions for interrupt masks and
 causes

Marvell has not published the chip's datasheet yet, so it's very hard
to find the relevant bits to manipulate to change the IRQ behaviour.
Fortunately, these bits are described in the proprietary LSP patch set
which is publicly available here :

    http://www.plugcomputer.org/downloads/mirabox/

So let's put them back in the driver in order to reduce the burden of
current and future maintenance.

Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 44 +++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index b8e232b..6630690 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -101,16 +101,56 @@
 #define      MVNETA_CPU_RXQ_ACCESS_ALL_MASK      0x000000ff
 #define      MVNETA_CPU_TXQ_ACCESS_ALL_MASK      0x0000ff00
 #define MVNETA_RXQ_TIME_COAL_REG(q)              (0x2580 + ((q) << 2))
+
+/* Exception Interrupt Port/Queue Cause register */
+
 #define MVNETA_INTR_NEW_CAUSE                    0x25a0
-#define      MVNETA_RX_INTR_MASK(nr_rxqs)        (((1 << nr_rxqs) - 1) << 8)
 #define MVNETA_INTR_NEW_MASK                     0x25a4
+
+/* bits  0..7  = TXQ SENT, one bit per queue.
+ * bits  8..15 = RXQ OCCUP, one bit per queue.
+ * bits 16..23 = RXQ FREE, one bit per queue.
+ * bit  29 = OLD_REG_SUM, see old reg ?
+ * bit  30 = TX_ERR_SUM, one bit for 4 ports
+ * bit  31 = MISC_SUM,   one bit for 4 ports
+ */
+#define      MVNETA_TX_INTR_MASK(nr_txqs)        (((1 << nr_txqs) - 1) << 0)
+#define      MVNETA_TX_INTR_MASK_ALL             (0xff << 0)
+#define      MVNETA_RX_INTR_MASK(nr_rxqs)        (((1 << nr_rxqs) - 1) << 8)
+#define      MVNETA_RX_INTR_MASK_ALL             (0xff << 8)
+
 #define MVNETA_INTR_OLD_CAUSE                    0x25a8
 #define MVNETA_INTR_OLD_MASK                     0x25ac
+
+/* Data Path Port/Queue Cause Register */
 #define MVNETA_INTR_MISC_CAUSE                   0x25b0
 #define MVNETA_INTR_MISC_MASK                    0x25b4
+
+#define      MVNETA_CAUSE_PHY_STATUS_CHANGE      BIT(0)
+#define      MVNETA_CAUSE_LINK_CHANGE            BIT(1)
+#define      MVNETA_CAUSE_PTP                    BIT(4)
+
+#define      MVNETA_CAUSE_INTERNAL_ADDR_ERR      BIT(7)
+#define      MVNETA_CAUSE_RX_OVERRUN             BIT(8)
+#define      MVNETA_CAUSE_RX_CRC_ERROR           BIT(9)
+#define      MVNETA_CAUSE_RX_LARGE_PKT           BIT(10)
+#define      MVNETA_CAUSE_TX_UNDERUN             BIT(11)
+#define      MVNETA_CAUSE_PRBS_ERR               BIT(12)
+#define      MVNETA_CAUSE_PSC_SYNC_CHANGE        BIT(13)
+#define      MVNETA_CAUSE_SERDES_SYNC_ERR        BIT(14)
+
+#define      MVNETA_CAUSE_BMU_ALLOC_ERR_SHIFT    16
+#define      MVNETA_CAUSE_BMU_ALLOC_ERR_ALL_MASK   (0xF << MVNETA_CAUSE_BMU_ALLOC_ERR_SHIFT)
+#define      MVNETA_CAUSE_BMU_ALLOC_ERR_MASK(pool) (1 << (MVNETA_CAUSE_BMU_ALLOC_ERR_SHIFT + (pool)))
+
+#define      MVNETA_CAUSE_TXQ_ERROR_SHIFT        24
+#define      MVNETA_CAUSE_TXQ_ERROR_ALL_MASK     (0xFF << MVNETA_CAUSE_TXQ_ERROR_SHIFT)
+#define      MVNETA_CAUSE_TXQ_ERROR_MASK(q)      (1 << (MVNETA_CAUSE_TXQ_ERROR_SHIFT + (q)))
+
 #define MVNETA_INTR_ENABLE                       0x25b8
 #define      MVNETA_TXQ_INTR_ENABLE_ALL_MASK     0x0000ff00
-#define      MVNETA_RXQ_INTR_ENABLE_ALL_MASK     0xff000000
+#define      MVNETA_RXQ_INTR_ENABLE_ALL_MASK     0xff000000  // note: neta says it's 0x000000FF
+
 #define MVNETA_RXQ_CMD                           0x2680
 #define      MVNETA_RXQ_DISABLE_SHIFT            8
 #define      MVNETA_RXQ_ENABLE_MASK              0x000000ff
-- 
1.7.12.1

-------------- next part --------------
>From 741bc1bfccbfe33cebaceb2e854539946e8ec9fa Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Wed, 20 Nov 2013 19:47:11 +0100
Subject: net: mvneta: replace Tx timer with a real interrupt

Right now the mvneta driver doesn't handle Tx IRQ, and solely relies on a
timer to flush Tx descriptors. This causes jerky output traffic with bursts
and pauses, making it difficult to reach line rate with very few streams.

It seems that this feature was inherited from the original driver but
nothing there mentions any reason for not using the interrupt instead,
which the chip supports.

Thus, this patch enables Tx interrupts and removes the timer. It does the
two at once because it's not really possible to make the two mechanisms
coexist, so a split patch doesn't make sense.

First tests performed on a Mirabox (Armada 370) show that much less CPU
is used when sending traffic. One reason might be that we now call the
mvneta_tx_done_gbe() with a mask indicating which queues have been done
instead of looping over all of them.

The results are quite good, a single TCP stream is now capable of saturating
a gigabit link, while a minimum of 12 concurrent streams are needed without
the patch. At 1 Gbps with 12 concurrent streams, 60% of the CPU remains idle
without the patch, and it grows to 87% with the patch. The connection rate
has also increased from 9950 to 10150 connections per second (HTTP requests
of empty objects).

Signed-off-by: Willy Tarreau <w@1wt.eu>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Arnaud Ebalard <arno@natisbad.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 71 ++++++-----------------------------
 1 file changed, 12 insertions(+), 59 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 6630690..def32a8 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -216,9 +216,6 @@
 #define MVNETA_RX_COAL_PKTS		32
 #define MVNETA_RX_COAL_USEC		100
 
-/* Timer */
-#define MVNETA_TX_DONE_TIMER_PERIOD	10
-
 /* Napi polling weight */
 #define MVNETA_RX_POLL_WEIGHT		64
 
@@ -272,16 +269,11 @@ struct mvneta_port {
 	void __iomem *base;
 	struct mvneta_rx_queue *rxqs;
 	struct mvneta_tx_queue *txqs;
-	struct timer_list tx_done_timer;
 	struct net_device *dev;
 
 	u32 cause_rx_tx;
 	struct napi_struct napi;
 
-	/* Flags */
-	unsigned long flags;
-#define MVNETA_F_TX_DONE_TIMER_BIT  0
-
 	/* Napi weight */
 	int weight;
 
@@ -1140,17 +1132,6 @@ static void mvneta_tx_done_pkts_coal_set(struct mvneta_port *pp,
 	txq->done_pkts_coal = value;
 }
 
-/* Trigger tx done timer in MVNETA_TX_DONE_TIMER_PERIOD msecs */
-static void mvneta_add_tx_done_timer(struct mvneta_port *pp)
-{
-	if (test_and_set_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags) == 0) {
-		pp->tx_done_timer.expires = jiffies +
-			msecs_to_jiffies(MVNETA_TX_DONE_TIMER_PERIOD);
-		add_timer(&pp->tx_done_timer);
-	}
-}
-
-
 /* Handle rx descriptor fill by setting buf_cookie and buf_phys_addr */
 static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
 				u32 phys_addr, u32 cookie)
@@ -1632,15 +1613,6 @@ out:
 		dev_kfree_skb_any(skb);
 	}
 
-	if (txq->count >= MVNETA_TXDONE_COAL_PKTS)
-		mvneta_txq_done(pp, txq);
-
-	/* If after calling mvneta_txq_done, count equals
-	 * frags, we need to set the timer
-	 */
-	if (txq->count == frags && frags > 0)
-		mvneta_add_tx_done_timer(pp);
-
 	return NETDEV_TX_OK;
 }
 
@@ -1916,14 +1888,22 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 
 	/* Read cause register */
 	cause_rx_tx = mvreg_read(pp, MVNETA_INTR_NEW_CAUSE) &
-		MVNETA_RX_INTR_MASK(rxq_number);
+		(MVNETA_RX_INTR_MASK(rxq_number) | MVNETA_TX_INTR_MASK(txq_number));
+
+	/* Release Tx descriptors */
+	if (cause_rx_tx & MVNETA_TX_INTR_MASK_ALL) {
+		int tx_todo = 0;
+
+		mvneta_tx_done_gbe(pp, (cause_rx_tx & MVNETA_TX_INTR_MASK_ALL), &tx_todo);
+		cause_rx_tx &= ~MVNETA_TX_INTR_MASK_ALL;
+	}
 
 	/* For the case where the last mvneta_poll did not process all
 	 * RX packets
 	 */
 	cause_rx_tx |= pp->cause_rx_tx;
 	if (rxq_number > 1) {
-		while ((cause_rx_tx != 0) && (budget > 0)) {
+		while ((cause_rx_tx & MVNETA_RX_INTR_MASK_ALL) && (budget > 0)) {
 			int count;
 			struct mvneta_rx_queue *rxq;
 			/* get rx queue number from cause_rx_tx */
@@ -1955,7 +1935,7 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 		napi_complete(napi);
 		local_irq_save(flags);
 		mvreg_write(pp, MVNETA_INTR_NEW_MASK,
-			    MVNETA_RX_INTR_MASK(rxq_number));
+			    MVNETA_RX_INTR_MASK(rxq_number) | MVNETA_TX_INTR_MASK(txq_number));
 		local_irq_restore(flags);
 	}
 
@@ -1963,26 +1943,6 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 	return rx_done;
 }
 
-/* tx done timer callback */
-static void mvneta_tx_done_timer_callback(unsigned long data)
-{
-	struct net_device *dev = (struct net_device *)data;
-	struct mvneta_port *pp = netdev_priv(dev);
-	int tx_done = 0, tx_todo = 0;
-
-	if (!netif_running(dev))
-		return ;
-
-	clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
-
-	tx_done = mvneta_tx_done_gbe(pp,
-				     (((1 << txq_number) - 1) &
-				      MVNETA_CAUSE_TXQ_SENT_DESC_ALL_MASK),
-				     &tx_todo);
-	if (tx_todo > 0)
-		mvneta_add_tx_done_timer(pp);
-}
-
 /* Handle rxq fill: allocates rxq skbs; called when initializing a port */
 static int mvneta_rxq_fill(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 			   int num)
@@ -2232,7 +2192,7 @@ static void mvneta_start_dev(struct mvneta_port *pp)
 
 	/* Unmask interrupts */
 	mvreg_write(pp, MVNETA_INTR_NEW_MASK,
-		    MVNETA_RX_INTR_MASK(rxq_number));
+		    MVNETA_RX_INTR_MASK(rxq_number) | MVNETA_TX_INTR_MASK(txq_number));
 
 	phy_start(pp->phy_dev);
 	netif_tx_start_all_queues(pp->dev);
@@ -2518,8 +2478,6 @@ static int mvneta_stop(struct net_device *dev)
 	free_irq(dev->irq, pp);
 	mvneta_cleanup_rxqs(pp);
 	mvneta_cleanup_txqs(pp);
-	del_timer(&pp->tx_done_timer);
-	clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
 
 	return 0;
 }
@@ -2868,11 +2826,6 @@ static int mvneta_probe(struct platform_device *pdev)
 		}
 	}
 
-	pp->tx_done_timer.data = (unsigned long)dev;
-	pp->tx_done_timer.function = mvneta_tx_done_timer_callback;
-	init_timer(&pp->tx_done_timer);
-	clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
-
 	pp->tx_ring_size = MVNETA_MAX_TXD;
 	pp->rx_ring_size = MVNETA_MAX_RXD;
 
-- 
1.7.12.1

-------------- next part --------------
>From 04a4891c4f9a77052e5aea7d2ade25a3f8da5436 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Thu, 21 Nov 2013 00:13:06 +0100
Subject: net: mvneta: reduce Tx coalesce from 16 to 4 packets

I'm getting slightly better performance with a smaller Tx coalesce setting,
both with large and short packets. Since it was used differently with the
timer, it is possible that the previous value was more suited for use with
a slow timer.

Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index def32a8..d188828 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -212,7 +212,7 @@
 /* Various constants */
 
 /* Coalescing */
-#define MVNETA_TXDONE_COAL_PKTS		16
+#define MVNETA_TXDONE_COAL_PKTS		4
 #define MVNETA_RX_COAL_PKTS		32
 #define MVNETA_RX_COAL_USEC		100
 
-- 
1.7.12.1

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* ARM network performance and dma_mask (was: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s)
  2013-11-21  0:44                                       ` Willy Tarreau
  (?)
@ 2013-11-21 18:38                                       ` Willy Tarreau
  2013-11-21 19:04                                           ` Thomas Petazzoni
  2013-11-21 22:01                                           ` Rob Herring
  -1 siblings, 2 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-21 18:38 UTC (permalink / raw)
  To: Rob Herring
  Cc: Arnaud Ebalard, Thomas Petazzoni, Florian Fainelli, simon.guinot,
	Eric Dumazet, netdev, edumazet, Cong Wang, linux-arm-kernel

Hi Rob,

While we were diagnosing a network performance regression that we finally
found and fixed, it appeared during a test that Linus' tree shows a much
higher performance on Armada 370 (armv7) than its predecessors. I can
saturate the two Gig links of my Mirabox each with a single TCP flow and
keep up to 25% of idle CPU in the optimal case. In 3.12.1 or 3.10.20, I
can achieve around 1.3 Gbps when the two ports are used in parallel.

Today I bisected these kernels to find what was causing this difference.
I found it was your patch below which I can copy entirely here :

  commit 0589342c27944e50ebd7a54f5215002b6598b748
  Author: Rob Herring <rob.herring@calxeda.com>
  Date:   Tue Oct 29 23:36:46 2013 -0500

      of: set dma_mask to point to coherent_dma_mask
    
      Platform devices created by DT code don't initialize dma_mask pointer to
      anything. Set it to coherent_dma_mask by default if the architecture
      code has not set it.
    
      Signed-off-by: Rob Herring <rob.herring@calxeda.com>

  diff --git a/drivers/of/platform.c b/drivers/of/platform.c
  index 9b439ac..c005495 100644
  --- a/drivers/of/platform.c
  +++ b/drivers/of/platform.c
  @@ -216,6 +216,8 @@ static struct platform_device *of_platform_device_create_pdata(
          dev->archdata.dma_mask = 0xffffffffUL;
   #endif
          dev->dev.coherent_dma_mask = DMA_BIT_MASK(32);
  +       if (!dev->dev.dma_mask)
  +               dev->dev.dma_mask = &dev->dev.coherent_dma_mask;
          dev->dev.bus = &platform_bus_type;
          dev->dev.platform_data = platform_data;

And I can confirm that applying this patch on 3.10.20 + the fixes we found
yesterday substantially boosted my network performance (and reduced the CPU
usage when running on a single link).

I'm not at ease with these things so I'd like to ask your opinion here, is
this supposed to be an improvement or a fix ? Is this something we should
backport into stable versions, or is there something to fix in the armada
platform so that it works just as if the patch was applied ?

Thanks,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: ARM network performance and dma_mask (was: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s)
  2013-11-21 18:38                                       ` ARM network performance and dma_mask (was: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s) Willy Tarreau
@ 2013-11-21 19:04                                           ` Thomas Petazzoni
  2013-11-21 22:01                                           ` Rob Herring
  1 sibling, 0 replies; 121+ messages in thread
From: Thomas Petazzoni @ 2013-11-21 19:04 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Rob Herring, Arnaud Ebalard, Florian Fainelli, simon.guinot,
	Eric Dumazet, netdev, edumazet, Cong Wang, linux-arm-kernel

Dear Willy Tarreau,

On Thu, 21 Nov 2013 19:38:34 +0100, Willy Tarreau wrote:

> While we were diagnosing a network performance regression that we finally
> found and fixed, it appeared during a test that Linus' tree shows a much
> higher performance on Armada 370 (armv7) than its predecessors. I can
> saturate the two Gig links of my Mirabox each with a single TCP flow and
> keep up to 25% of idle CPU in the optimal case. In 3.12.1 or 3.10.20, I
> can achieve around 1.3 Gbps when the two ports are used in parallel.

Interesting finding and analysis, once again!

> I'm not at ease with these things so I'd like to ask your opinion here, is
> this supposed to be an improvement or a fix ? Is this something we should
> backport into stable versions, or is there something to fix in the armada
> platform so that it works just as if the patch was applied ?

I guess the driver should have been setting its dma_mask to 0xffffffff,
since the platform is capable of doing DMA on the first 32 bits of the
physical address space, probably something like calling
pci_set_dma_mask(pdev, DMA_BIT_MASK(32)) or something like that. I know
Russell has recently added some helpers to prevent stupid people (like
me) from doing mistakes when setting the DMA masks. Certainly worth
having a look.

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com

^ permalink raw reply	[flat|nested] 121+ messages in thread

* ARM network performance and dma_mask (was: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s)
@ 2013-11-21 19:04                                           ` Thomas Petazzoni
  0 siblings, 0 replies; 121+ messages in thread
From: Thomas Petazzoni @ 2013-11-21 19:04 UTC (permalink / raw)
  To: linux-arm-kernel

Dear Willy Tarreau,

On Thu, 21 Nov 2013 19:38:34 +0100, Willy Tarreau wrote:

> While we were diagnosing a network performance regression that we finally
> found and fixed, it appeared during a test that Linus' tree shows a much
> higher performance on Armada 370 (armv7) than its predecessors. I can
> saturate the two Gig links of my Mirabox each with a single TCP flow and
> keep up to 25% of idle CPU in the optimal case. In 3.12.1 or 3.10.20, I
> can achieve around 1.3 Gbps when the two ports are used in parallel.

Interesting finding and analysis, once again!

> I'm not at ease with these things so I'd like to ask your opinion here, is
> this supposed to be an improvement or a fix ? Is this something we should
> backport into stable versions, or is there something to fix in the armada
> platform so that it works just as if the patch was applied ?

I guess the driver should have been setting its dma_mask to 0xffffffff,
since the platform is capable of doing DMA on the first 32 bits of the
physical address space, probably something like calling
pci_set_dma_mask(pdev, DMA_BIT_MASK(32)) or something like that. I know
Russell has recently added some helpers to prevent stupid people (like
me) from doing mistakes when setting the DMA masks. Certainly worth
having a look.

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-21  0:44                                       ` Willy Tarreau
@ 2013-11-21 21:51                                         ` Arnaud Ebalard
  -1 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-21 21:51 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Thomas Petazzoni, Florian Fainelli, simon.guinot, Eric Dumazet,
	netdev, edumazet, Cong Wang, linux-arm-kernel

Hi,

Willy Tarreau <w@1wt.eu> writes:

> OK it paid off. And very well :-)
>
> I did it at once and it worked immediately. I generally don't like this
> because I always fear that some bug was left there hidden in the code. I have
> only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and
> on the XP-GP board for some SMP stress tests.
>
> I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared
> with and without the patch.
>
>   without :
>       - need at least 12 streams to reach gigabit.
>       - 60% of idle CPU remains at 1 Gbps
>       - HTTP connection rate on empty objects is 9950 connections/s
>       - cumulated outgoing traffic on two ports reaches 1.3 Gbps
>
>   with the patch :
>       - a single stream easily saturates the gigabit
>       - 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream)
>       - HTTP connection rate on empty objects is 10250 connections/s
>       - I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output.
>
> BTW I must say I was impressed to see that big an improvement in CPU
> usage between 3.10 and 3.13, I suspect some of the Tx queue improvements
> that Eric has done in between account for this.
>
> I cut the patch in 3 parts :
>    - one which reintroduces the hidden bits of the driver
>    - one which replaces the timer with the IRQ
>    - one which changes the default Tx coalesce from 16 to 4 packets
>      (larger was preferred with the timer, but less is better now).
>
> I'm attaching them, please test them on your device.

Well, on the RN102 (Armada 370), I get the same results as with your
previous patch, i.e. netperf and nginx saturate the link. Apache still
lagging behind though.

> Note that this is *not* for inclusion at the moment as it has not been
> tested on the SMP CPUs.

I tested it on my RN2120 (2-core armada XP): I got no problem and the
link saturated w/ apache, nginx and netperf. Good work!

Cheers,

a+

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-21 21:51                                         ` Arnaud Ebalard
  0 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-21 21:51 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

Willy Tarreau <w@1wt.eu> writes:

> OK it paid off. And very well :-)
>
> I did it at once and it worked immediately. I generally don't like this
> because I always fear that some bug was left there hidden in the code. I have
> only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and
> on the XP-GP board for some SMP stress tests.
>
> I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared
> with and without the patch.
>
>   without :
>       - need at least 12 streams to reach gigabit.
>       - 60% of idle CPU remains at 1 Gbps
>       - HTTP connection rate on empty objects is 9950 connections/s
>       - cumulated outgoing traffic on two ports reaches 1.3 Gbps
>
>   with the patch :
>       - a single stream easily saturates the gigabit
>       - 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream)
>       - HTTP connection rate on empty objects is 10250 connections/s
>       - I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output.
>
> BTW I must say I was impressed to see that big an improvement in CPU
> usage between 3.10 and 3.13, I suspect some of the Tx queue improvements
> that Eric has done in between account for this.
>
> I cut the patch in 3 parts :
>    - one which reintroduces the hidden bits of the driver
>    - one which replaces the timer with the IRQ
>    - one which changes the default Tx coalesce from 16 to 4 packets
>      (larger was preferred with the timer, but less is better now).
>
> I'm attaching them, please test them on your device.

Well, on the RN102 (Armada 370), I get the same results as with your
previous patch, i.e. netperf and nginx saturate the link. Apache still
lagging behind though.

> Note that this is *not* for inclusion at the moment as it has not been
> tested on the SMP CPUs.

I tested it on my RN2120 (2-core armada XP): I got no problem and the
link saturated w/ apache, nginx and netperf. Good work!

Cheers,

a+

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: ARM network performance and dma_mask (was: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s)
  2013-11-21 19:04                                           ` Thomas Petazzoni
@ 2013-11-21 21:51                                             ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-21 21:51 UTC (permalink / raw)
  To: Thomas Petazzoni
  Cc: Florian Fainelli, simon.guinot, Eric Dumazet, netdev,
	Arnaud Ebalard, Rob Herring, edumazet, Cong Wang,
	linux-arm-kernel

Hi Thomas,

On Thu, Nov 21, 2013 at 08:04:33PM +0100, Thomas Petazzoni wrote:
> > I'm not at ease with these things so I'd like to ask your opinion here, is
> > this supposed to be an improvement or a fix ? Is this something we should
> > backport into stable versions, or is there something to fix in the armada
> > platform so that it works just as if the patch was applied ?
> 
> I guess the driver should have been setting its dma_mask to 0xffffffff,
> since the platform is capable of doing DMA on the first 32 bits of the
> physical address space, probably something like calling
> pci_set_dma_mask(pdev, DMA_BIT_MASK(32)) or something like that.

Almost, yes. Thanks for the tip! There are so few drivers which do this
that I was convinced something was missing (nobody initializes dma_mask
on this platform), so calls to dma_set_mask() from drivers return -EIO
and are ignored.

I ended up with that in mvneta_init() at the end :

       /* setting DMA mask significantly improves transfer rates */
       pp->dev->dev.parent->coherent_dma_mask = DMA_BIT_MASK(32);
       pp->dev->dev.parent->dma_mask = &pp->dev->dev.parent->coherent_dma_mask;

This method changed in 3.12 with Russell's commit fa6a8d6 (DMA-API: provide
a helper to setup DMA masks) doing it a cleaner and safer way, using
dma_coerce_mask_and_coherent().

Then Rob's commit 0589342 (of: set dma_mask to point to coherent_dma_mask) also
merged in 3.12 pre-initialized the dma_mask to point to &coherent_dma_mask for
all devices by default.

> I know
> Russell has recently added some helpers to prevent stupid people (like
> me) from doing mistakes when setting the DMA masks. Certainly worth
> having a look.

My change now allows me to proxy HTTP traffic at 1 Gbps between the two
ports of the mirabox, while it limits to 650 Mbps without the change. But
it's not needed in mainline anymore. However it might be worth having in
older kernels (I don't know if it's suitable for stable since I don't
know if that's a bug), or at least in your own kernels if you have to
maintain and older branch for some customers.

That said, I tend to believe that applying Rob's patch will be better than
just the change above since it will cover all drivers, not only mvneta.
I'll have to test on the AX3 and the XP-GP to see the performance gain in
SMP and using the PCIe.

Best regards,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* ARM network performance and dma_mask (was: [BUG, REGRESSION?] 3.11.6+, 3.12: GbE iface rate drops to few KB/s)
@ 2013-11-21 21:51                                             ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-21 21:51 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Thomas,

On Thu, Nov 21, 2013 at 08:04:33PM +0100, Thomas Petazzoni wrote:
> > I'm not at ease with these things so I'd like to ask your opinion here, is
> > this supposed to be an improvement or a fix ? Is this something we should
> > backport into stable versions, or is there something to fix in the armada
> > platform so that it works just as if the patch was applied ?
> 
> I guess the driver should have been setting its dma_mask to 0xffffffff,
> since the platform is capable of doing DMA on the first 32 bits of the
> physical address space, probably something like calling
> pci_set_dma_mask(pdev, DMA_BIT_MASK(32)) or something like that.

Almost, yes. Thanks for the tip! There are so few drivers which do this
that I was convinced something was missing (nobody initializes dma_mask
on this platform), so calls to dma_set_mask() from drivers return -EIO
and are ignored.

I ended up with that in mvneta_init() at the end :

       /* setting DMA mask significantly improves transfer rates */
       pp->dev->dev.parent->coherent_dma_mask = DMA_BIT_MASK(32);
       pp->dev->dev.parent->dma_mask = &pp->dev->dev.parent->coherent_dma_mask;

This method changed in 3.12 with Russell's commit fa6a8d6 (DMA-API: provide
a helper to setup DMA masks) doing it a cleaner and safer way, using
dma_coerce_mask_and_coherent().

Then Rob's commit 0589342 (of: set dma_mask to point to coherent_dma_mask) also
merged in 3.12 pre-initialized the dma_mask to point to &coherent_dma_mask for
all devices by default.

> I know
> Russell has recently added some helpers to prevent stupid people (like
> me) from doing mistakes when setting the DMA masks. Certainly worth
> having a look.

My change now allows me to proxy HTTP traffic at 1 Gbps between the two
ports of the mirabox, while it limits to 650 Mbps without the change. But
it's not needed in mainline anymore. However it might be worth having in
older kernels (I don't know if it's suitable for stable since I don't
know if that's a bug), or at least in your own kernels if you have to
maintain and older branch for some customers.

That said, I tend to believe that applying Rob's patch will be better than
just the change above since it will cover all drivers, not only mvneta.
I'll have to test on the AX3 and the XP-GP to see the performance gain in
SMP and using the PCIe.

Best regards,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-21 21:51                                         ` Arnaud Ebalard
@ 2013-11-21 21:52                                           ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-21 21:52 UTC (permalink / raw)
  To: Arnaud Ebalard
  Cc: Thomas Petazzoni, Florian Fainelli, simon.guinot, Eric Dumazet,
	netdev, edumazet, Cong Wang, linux-arm-kernel

On Thu, Nov 21, 2013 at 10:51:09PM +0100, Arnaud Ebalard wrote:
> Hi,
> 
> Willy Tarreau <w@1wt.eu> writes:
> 
> > OK it paid off. And very well :-)
> >
> > I did it at once and it worked immediately. I generally don't like this
> > because I always fear that some bug was left there hidden in the code. I have
> > only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and
> > on the XP-GP board for some SMP stress tests.
> >
> > I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared
> > with and without the patch.
> >
> >   without :
> >       - need at least 12 streams to reach gigabit.
> >       - 60% of idle CPU remains at 1 Gbps
> >       - HTTP connection rate on empty objects is 9950 connections/s
> >       - cumulated outgoing traffic on two ports reaches 1.3 Gbps
> >
> >   with the patch :
> >       - a single stream easily saturates the gigabit
> >       - 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream)
> >       - HTTP connection rate on empty objects is 10250 connections/s
> >       - I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output.
> >
> > BTW I must say I was impressed to see that big an improvement in CPU
> > usage between 3.10 and 3.13, I suspect some of the Tx queue improvements
> > that Eric has done in between account for this.
> >
> > I cut the patch in 3 parts :
> >    - one which reintroduces the hidden bits of the driver
> >    - one which replaces the timer with the IRQ
> >    - one which changes the default Tx coalesce from 16 to 4 packets
> >      (larger was preferred with the timer, but less is better now).
> >
> > I'm attaching them, please test them on your device.
> 
> Well, on the RN102 (Armada 370), I get the same results as with your
> previous patch, i.e. netperf and nginx saturate the link. Apache still
> lagging behind though.
> 
> > Note that this is *not* for inclusion at the moment as it has not been
> > tested on the SMP CPUs.
> 
> I tested it on my RN2120 (2-core armada XP): I got no problem and the
> link saturated w/ apache, nginx and netperf. Good work!

Great, thanks for your tests Arnaud. I forgot to mention that all my
tests this evening involved this patch as well.

Cheers,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-21 21:52                                           ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-21 21:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Nov 21, 2013 at 10:51:09PM +0100, Arnaud Ebalard wrote:
> Hi,
> 
> Willy Tarreau <w@1wt.eu> writes:
> 
> > OK it paid off. And very well :-)
> >
> > I did it at once and it worked immediately. I generally don't like this
> > because I always fear that some bug was left there hidden in the code. I have
> > only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and
> > on the XP-GP board for some SMP stress tests.
> >
> > I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared
> > with and without the patch.
> >
> >   without :
> >       - need at least 12 streams to reach gigabit.
> >       - 60% of idle CPU remains at 1 Gbps
> >       - HTTP connection rate on empty objects is 9950 connections/s
> >       - cumulated outgoing traffic on two ports reaches 1.3 Gbps
> >
> >   with the patch :
> >       - a single stream easily saturates the gigabit
> >       - 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream)
> >       - HTTP connection rate on empty objects is 10250 connections/s
> >       - I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output.
> >
> > BTW I must say I was impressed to see that big an improvement in CPU
> > usage between 3.10 and 3.13, I suspect some of the Tx queue improvements
> > that Eric has done in between account for this.
> >
> > I cut the patch in 3 parts :
> >    - one which reintroduces the hidden bits of the driver
> >    - one which replaces the timer with the IRQ
> >    - one which changes the default Tx coalesce from 16 to 4 packets
> >      (larger was preferred with the timer, but less is better now).
> >
> > I'm attaching them, please test them on your device.
> 
> Well, on the RN102 (Armada 370), I get the same results as with your
> previous patch, i.e. netperf and nginx saturate the link. Apache still
> lagging behind though.
> 
> > Note that this is *not* for inclusion at the moment as it has not been
> > tested on the SMP CPUs.
> 
> I tested it on my RN2120 (2-core armada XP): I got no problem and the
> link saturated w/ apache, nginx and netperf. Good work!

Great, thanks for your tests Arnaud. I forgot to mention that all my
tests this evening involved this patch as well.

Cheers,
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-21 21:52                                           ` Willy Tarreau
@ 2013-11-21 22:00                                             ` Eric Dumazet
  -1 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-21 22:00 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Arnaud Ebalard, Thomas Petazzoni, Florian Fainelli, simon.guinot,
	netdev, edumazet, Cong Wang, linux-arm-kernel

On Thu, 2013-11-21 at 22:52 +0100, Willy Tarreau wrote:
> On Thu, Nov 21, 2013 at 10:51:09PM +0100, Arnaud Ebalard wrote:
> > Hi,
> > 
> > Willy Tarreau <w@1wt.eu> writes:
> > 
> > > OK it paid off. And very well :-)
> > >
> > > I did it at once and it worked immediately. I generally don't like this
> > > because I always fear that some bug was left there hidden in the code. I have
> > > only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and
> > > on the XP-GP board for some SMP stress tests.
> > >
> > > I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared
> > > with and without the patch.
> > >
> > >   without :
> > >       - need at least 12 streams to reach gigabit.
> > >       - 60% of idle CPU remains at 1 Gbps
> > >       - HTTP connection rate on empty objects is 9950 connections/s
> > >       - cumulated outgoing traffic on two ports reaches 1.3 Gbps
> > >
> > >   with the patch :
> > >       - a single stream easily saturates the gigabit
> > >       - 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream)
> > >       - HTTP connection rate on empty objects is 10250 connections/s
> > >       - I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output.
> > >
> > > BTW I must say I was impressed to see that big an improvement in CPU
> > > usage between 3.10 and 3.13, I suspect some of the Tx queue improvements
> > > that Eric has done in between account for this.
> > >
> > > I cut the patch in 3 parts :
> > >    - one which reintroduces the hidden bits of the driver
> > >    - one which replaces the timer with the IRQ
> > >    - one which changes the default Tx coalesce from 16 to 4 packets
> > >      (larger was preferred with the timer, but less is better now).
> > >
> > > I'm attaching them, please test them on your device.
> > 
> > Well, on the RN102 (Armada 370), I get the same results as with your
> > previous patch, i.e. netperf and nginx saturate the link. Apache still
> > lagging behind though.
> > 
> > > Note that this is *not* for inclusion at the moment as it has not been
> > > tested on the SMP CPUs.
> > 
> > I tested it on my RN2120 (2-core armada XP): I got no problem and the
> > link saturated w/ apache, nginx and netperf. Good work!
> 
> Great, thanks for your tests Arnaud. I forgot to mention that all my
> tests this evening involved this patch as well.

Now you might try to set a lower value
for /proc/sys/net/ipv4/tcp_limit_output_bytes

Ideally, a value of 8192 (instead of 131072) allows
to queue less data per tcp flow, and react faster to losses,
as retransmits don't have to wait that previous packets in Qdisc left
the host.

131072 bytes for a 80 Mbit flow means more than 11 ms of queueing :(

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-21 22:00                                             ` Eric Dumazet
  0 siblings, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2013-11-21 22:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2013-11-21 at 22:52 +0100, Willy Tarreau wrote:
> On Thu, Nov 21, 2013 at 10:51:09PM +0100, Arnaud Ebalard wrote:
> > Hi,
> > 
> > Willy Tarreau <w@1wt.eu> writes:
> > 
> > > OK it paid off. And very well :-)
> > >
> > > I did it at once and it worked immediately. I generally don't like this
> > > because I always fear that some bug was left there hidden in the code. I have
> > > only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and
> > > on the XP-GP board for some SMP stress tests.
> > >
> > > I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared
> > > with and without the patch.
> > >
> > >   without :
> > >       - need at least 12 streams to reach gigabit.
> > >       - 60% of idle CPU remains at 1 Gbps
> > >       - HTTP connection rate on empty objects is 9950 connections/s
> > >       - cumulated outgoing traffic on two ports reaches 1.3 Gbps
> > >
> > >   with the patch :
> > >       - a single stream easily saturates the gigabit
> > >       - 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream)
> > >       - HTTP connection rate on empty objects is 10250 connections/s
> > >       - I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output.
> > >
> > > BTW I must say I was impressed to see that big an improvement in CPU
> > > usage between 3.10 and 3.13, I suspect some of the Tx queue improvements
> > > that Eric has done in between account for this.
> > >
> > > I cut the patch in 3 parts :
> > >    - one which reintroduces the hidden bits of the driver
> > >    - one which replaces the timer with the IRQ
> > >    - one which changes the default Tx coalesce from 16 to 4 packets
> > >      (larger was preferred with the timer, but less is better now).
> > >
> > > I'm attaching them, please test them on your device.
> > 
> > Well, on the RN102 (Armada 370), I get the same results as with your
> > previous patch, i.e. netperf and nginx saturate the link. Apache still
> > lagging behind though.
> > 
> > > Note that this is *not* for inclusion at the moment as it has not been
> > > tested on the SMP CPUs.
> > 
> > I tested it on my RN2120 (2-core armada XP): I got no problem and the
> > link saturated w/ apache, nginx and netperf. Good work!
> 
> Great, thanks for your tests Arnaud. I forgot to mention that all my
> tests this evening involved this patch as well.

Now you might try to set a lower value
for /proc/sys/net/ipv4/tcp_limit_output_bytes

Ideally, a value of 8192 (instead of 131072) allows
to queue less data per tcp flow, and react faster to losses,
as retransmits don't have to wait that previous packets in Qdisc left
the host.

131072 bytes for a 80 Mbit flow means more than 11 ms of queueing :(

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: ARM network performance and dma_mask
  2013-11-21 18:38                                       ` ARM network performance and dma_mask (was: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s) Willy Tarreau
@ 2013-11-21 22:01                                           ` Rob Herring
  2013-11-21 22:01                                           ` Rob Herring
  1 sibling, 0 replies; 121+ messages in thread
From: Rob Herring @ 2013-11-21 22:01 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Arnaud Ebalard, Thomas Petazzoni, Florian Fainelli, simon.guinot,
	Eric Dumazet, netdev, edumazet, Cong Wang, linux-arm-kernel

On 11/21/2013 12:38 PM, Willy Tarreau wrote:
> Hi Rob,
> 
> While we were diagnosing a network performance regression that we finally
> found and fixed, it appeared during a test that Linus' tree shows a much
> higher performance on Armada 370 (armv7) than its predecessors. I can
> saturate the two Gig links of my Mirabox each with a single TCP flow and
> keep up to 25% of idle CPU in the optimal case. In 3.12.1 or 3.10.20, I
> can achieve around 1.3 Gbps when the two ports are used in parallel.
> 
> Today I bisected these kernels to find what was causing this difference.
> I found it was your patch below which I can copy entirely here :
> 
>   commit 0589342c27944e50ebd7a54f5215002b6598b748
>   Author: Rob Herring <rob.herring@calxeda.com>
>   Date:   Tue Oct 29 23:36:46 2013 -0500
> 
>       of: set dma_mask to point to coherent_dma_mask
>     
>       Platform devices created by DT code don't initialize dma_mask pointer to
>       anything. Set it to coherent_dma_mask by default if the architecture
>       code has not set it.
>     
>       Signed-off-by: Rob Herring <rob.herring@calxeda.com>
> 
>   diff --git a/drivers/of/platform.c b/drivers/of/platform.c
>   index 9b439ac..c005495 100644
>   --- a/drivers/of/platform.c
>   +++ b/drivers/of/platform.c
>   @@ -216,6 +216,8 @@ static struct platform_device *of_platform_device_create_pdata(
>           dev->archdata.dma_mask = 0xffffffffUL;
>    #endif
>           dev->dev.coherent_dma_mask = DMA_BIT_MASK(32);
>   +       if (!dev->dev.dma_mask)
>   +               dev->dev.dma_mask = &dev->dev.coherent_dma_mask;
>           dev->dev.bus = &platform_bus_type;
>           dev->dev.platform_data = platform_data;
> 
> And I can confirm that applying this patch on 3.10.20 + the fixes we found
> yesterday substantially boosted my network performance (and reduced the CPU
> usage when running on a single link).
> 
> I'm not at ease with these things so I'd like to ask your opinion here, is
> this supposed to be an improvement or a fix ? Is this something we should
> backport into stable versions, or is there something to fix in the armada
> platform so that it works just as if the patch was applied ?
> 

The patch was to fix this issue[1]. It is fixed in the core code because
dma_mask not being set has been a known issue with DT probing for some
time. Since most drivers don't seem to care, we've gotten away with it.
I thought the normal failure mode was drivers failing to probe.

As to why it helps performance, I'm not really sure. Perhaps it is
causing some bounce buffers to be used.

Rob

[1] http://lists.xen.org/archives/html/xen-devel/2013-10/msg00092.html

^ permalink raw reply	[flat|nested] 121+ messages in thread

* ARM network performance and dma_mask
@ 2013-11-21 22:01                                           ` Rob Herring
  0 siblings, 0 replies; 121+ messages in thread
From: Rob Herring @ 2013-11-21 22:01 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/21/2013 12:38 PM, Willy Tarreau wrote:
> Hi Rob,
> 
> While we were diagnosing a network performance regression that we finally
> found and fixed, it appeared during a test that Linus' tree shows a much
> higher performance on Armada 370 (armv7) than its predecessors. I can
> saturate the two Gig links of my Mirabox each with a single TCP flow and
> keep up to 25% of idle CPU in the optimal case. In 3.12.1 or 3.10.20, I
> can achieve around 1.3 Gbps when the two ports are used in parallel.
> 
> Today I bisected these kernels to find what was causing this difference.
> I found it was your patch below which I can copy entirely here :
> 
>   commit 0589342c27944e50ebd7a54f5215002b6598b748
>   Author: Rob Herring <rob.herring@calxeda.com>
>   Date:   Tue Oct 29 23:36:46 2013 -0500
> 
>       of: set dma_mask to point to coherent_dma_mask
>     
>       Platform devices created by DT code don't initialize dma_mask pointer to
>       anything. Set it to coherent_dma_mask by default if the architecture
>       code has not set it.
>     
>       Signed-off-by: Rob Herring <rob.herring@calxeda.com>
> 
>   diff --git a/drivers/of/platform.c b/drivers/of/platform.c
>   index 9b439ac..c005495 100644
>   --- a/drivers/of/platform.c
>   +++ b/drivers/of/platform.c
>   @@ -216,6 +216,8 @@ static struct platform_device *of_platform_device_create_pdata(
>           dev->archdata.dma_mask = 0xffffffffUL;
>    #endif
>           dev->dev.coherent_dma_mask = DMA_BIT_MASK(32);
>   +       if (!dev->dev.dma_mask)
>   +               dev->dev.dma_mask = &dev->dev.coherent_dma_mask;
>           dev->dev.bus = &platform_bus_type;
>           dev->dev.platform_data = platform_data;
> 
> And I can confirm that applying this patch on 3.10.20 + the fixes we found
> yesterday substantially boosted my network performance (and reduced the CPU
> usage when running on a single link).
> 
> I'm not at ease with these things so I'd like to ask your opinion here, is
> this supposed to be an improvement or a fix ? Is this something we should
> backport into stable versions, or is there something to fix in the armada
> platform so that it works just as if the patch was applied ?
> 

The patch was to fix this issue[1]. It is fixed in the core code because
dma_mask not being set has been a known issue with DT probing for some
time. Since most drivers don't seem to care, we've gotten away with it.
I thought the normal failure mode was drivers failing to probe.

As to why it helps performance, I'm not really sure. Perhaps it is
causing some bounce buffers to be used.

Rob

[1] http://lists.xen.org/archives/html/xen-devel/2013-10/msg00092.html

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: ARM network performance and dma_mask
  2013-11-21 22:01                                           ` Rob Herring
@ 2013-11-21 22:13                                             ` Willy Tarreau
  -1 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-21 22:13 UTC (permalink / raw)
  To: Rob Herring
  Cc: Arnaud Ebalard, Thomas Petazzoni, Florian Fainelli, simon.guinot,
	Eric Dumazet, netdev, edumazet, Cong Wang, linux-arm-kernel

On Thu, Nov 21, 2013 at 04:01:42PM -0600, Rob Herring wrote:
> The patch was to fix this issue[1]. It is fixed in the core code because
> dma_mask not being set has been a known issue with DT probing for some
> time. Since most drivers don't seem to care, we've gotten away with it.
> I thought the normal failure mode was drivers failing to probe.

It seems that very few drivers try to set their mask, so probably that
the default value was already OK even though less performant.

> As to why it helps performance, I'm not really sure. Perhaps it is
> causing some bounce buffers to be used.

That's also the thing I have been thinking about, and given this device
only has a 16-bit DDR bus, bounce buffers can make a difference.

> Rob
> 
> [1] http://lists.xen.org/archives/html/xen-devel/2013-10/msg00092.html

Thanks for your quick explanation Rob!
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* ARM network performance and dma_mask
@ 2013-11-21 22:13                                             ` Willy Tarreau
  0 siblings, 0 replies; 121+ messages in thread
From: Willy Tarreau @ 2013-11-21 22:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Nov 21, 2013 at 04:01:42PM -0600, Rob Herring wrote:
> The patch was to fix this issue[1]. It is fixed in the core code because
> dma_mask not being set has been a known issue with DT probing for some
> time. Since most drivers don't seem to care, we've gotten away with it.
> I thought the normal failure mode was drivers failing to probe.

It seems that very few drivers try to set their mask, so probably that
the default value was already OK even though less performant.

> As to why it helps performance, I'm not really sure. Perhaps it is
> causing some bounce buffers to be used.

That's also the thing I have been thinking about, and given this device
only has a 16-bit DDR bus, bounce buffers can make a difference.

> Rob
> 
> [1] http://lists.xen.org/archives/html/xen-devel/2013-10/msg00092.html

Thanks for your quick explanation Rob!
Willy

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-21 22:00                                             ` Eric Dumazet
@ 2013-11-21 22:55                                               ` Arnaud Ebalard
  -1 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-21 22:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willy Tarreau, Thomas Petazzoni, Florian Fainelli, simon.guinot,
	netdev, edumazet, Cong Wang, linux-arm-kernel

Hi eric,

Eric Dumazet <eric.dumazet@gmail.com> writes:

>> > I tested it on my RN2120 (2-core armada XP): I got no problem and the
>> > link saturated w/ apache, nginx and netperf. Good work!
>> 
>> Great, thanks for your tests Arnaud. I forgot to mention that all my
>> tests this evening involved this patch as well.
>
> Now you might try to set a lower value
> for /proc/sys/net/ipv4/tcp_limit_output_bytes

On the RN2120, for a file served from /run/shm (for apache and nginx):

          Apache     nginx       netperf
131072:  102 MB/s   112 MB/s   941.11 Mb/s
 65536:  102 MB/s   112 MB/s   935.97 Mb/s
 32768:  101 MB/s   105 MB/s   940.49 Mb/s
 16384:   94 MB/s    90 MB/s   770.07 Mb/s
  8192:   83 MB/s    66 MB/s   556.79 Mb/s

On the RN102, this time for apache and nginx, the file is served from
disks (ext4/lvm/raid1):

          Apache     nginx       netperf
131072:  66 MB/s   105 MB/s   925.63 Mb/s
 65536:  59 MB/s   105 MB/s   862.55 Mb/s
 32768:  62 MB/s   105 MB/s   918.99 Mb/s
 16384:  65 MB/s   105 MB/s   927.71 Mb/s
  8192:  60 MB/s   104 MB/s   915.63 Mb/s

Values above are for a single flow though.

Cheers,

a+

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-21 22:55                                               ` Arnaud Ebalard
  0 siblings, 0 replies; 121+ messages in thread
From: Arnaud Ebalard @ 2013-11-21 22:55 UTC (permalink / raw)
  To: linux-arm-kernel

Hi eric,

Eric Dumazet <eric.dumazet@gmail.com> writes:

>> > I tested it on my RN2120 (2-core armada XP): I got no problem and the
>> > link saturated w/ apache, nginx and netperf. Good work!
>> 
>> Great, thanks for your tests Arnaud. I forgot to mention that all my
>> tests this evening involved this patch as well.
>
> Now you might try to set a lower value
> for /proc/sys/net/ipv4/tcp_limit_output_bytes

On the RN2120, for a file served from /run/shm (for apache and nginx):

          Apache     nginx       netperf
131072:  102 MB/s   112 MB/s   941.11 Mb/s
 65536:  102 MB/s   112 MB/s   935.97 Mb/s
 32768:  101 MB/s   105 MB/s   940.49 Mb/s
 16384:   94 MB/s    90 MB/s   770.07 Mb/s
  8192:   83 MB/s    66 MB/s   556.79 Mb/s

On the RN102, this time for apache and nginx, the file is served from
disks (ext4/lvm/raid1):

          Apache     nginx       netperf
131072:  66 MB/s   105 MB/s   925.63 Mb/s
 65536:  59 MB/s   105 MB/s   862.55 Mb/s
 32768:  62 MB/s   105 MB/s   918.99 Mb/s
 16384:  65 MB/s   105 MB/s   927.71 Mb/s
  8192:  60 MB/s   104 MB/s   915.63 Mb/s

Values above are for a single flow though.

Cheers,

a+

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
  2013-11-21 22:55                                               ` Arnaud Ebalard
@ 2013-11-21 23:23                                                 ` Rick Jones
  -1 siblings, 0 replies; 121+ messages in thread
From: Rick Jones @ 2013-11-21 23:23 UTC (permalink / raw)
  To: Arnaud Ebalard, Eric Dumazet
  Cc: Willy Tarreau, Thomas Petazzoni, Florian Fainelli, simon.guinot,
	netdev, edumazet, Cong Wang, linux-arm-kernel

On 11/21/2013 02:55 PM, Arnaud Ebalard wrote:
> On the RN2120, for a file served from /run/shm (for apache and nginx):
>
>            Apache     nginx       netperf
> 131072:  102 MB/s   112 MB/s   941.11 Mb/s
>   65536:  102 MB/s   112 MB/s   935.97 Mb/s
>   32768:  101 MB/s   105 MB/s   940.49 Mb/s
>   16384:   94 MB/s    90 MB/s   770.07 Mb/s
>    8192:   83 MB/s    66 MB/s   556.79 Mb/s

If you want to make the units common across all three tests, netperf 
accepts a global -f option to alter the output units.  If you add -f M 
netperf will then emit results in MB/s (M == 1048576).  I'm assuming of 
course that the MB/s of Apache and nginx are also M == 1048576.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
@ 2013-11-21 23:23                                                 ` Rick Jones
  0 siblings, 0 replies; 121+ messages in thread
From: Rick Jones @ 2013-11-21 23:23 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/21/2013 02:55 PM, Arnaud Ebalard wrote:
> On the RN2120, for a file served from /run/shm (for apache and nginx):
>
>            Apache     nginx       netperf
> 131072:  102 MB/s   112 MB/s   941.11 Mb/s
>   65536:  102 MB/s   112 MB/s   935.97 Mb/s
>   32768:  101 MB/s   105 MB/s   940.49 Mb/s
>   16384:   94 MB/s    90 MB/s   770.07 Mb/s
>    8192:   83 MB/s    66 MB/s   556.79 Mb/s

If you want to make the units common across all three tests, netperf 
accepts a global -f option to alter the output units.  If you add -f M 
netperf will then emit results in MB/s (M == 1048576).  I'm assuming of 
course that the MB/s of Apache and nginx are also M == 1048576.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 121+ messages in thread

end of thread, other threads:[~2013-11-21 23:23 UTC | newest]

Thread overview: 121+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-10 13:53 [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s Arnaud Ebalard
2013-11-10 13:53 ` Arnaud Ebalard
2013-11-12  6:48 ` Cong Wang
2013-11-12  6:48   ` Cong Wang
2013-11-12  7:56   ` Arnaud Ebalard
2013-11-12  7:56     ` Arnaud Ebalard
2013-11-12  8:36     ` Willy Tarreau
2013-11-12  8:36       ` Willy Tarreau
2013-11-12  9:14       ` Arnaud Ebalard
2013-11-12  9:14         ` Arnaud Ebalard
2013-11-12 10:01         ` Willy Tarreau
2013-11-12 10:01           ` Willy Tarreau
2013-11-12 15:34           ` Arnaud Ebalard
2013-11-12 15:34             ` Arnaud Ebalard
2013-11-13  7:22             ` Willy Tarreau
2013-11-13  7:22               ` Willy Tarreau
2013-11-17 14:19               ` Willy Tarreau
2013-11-17 14:19                 ` Willy Tarreau
2013-11-17 17:41                 ` Eric Dumazet
2013-11-17 17:41                   ` Eric Dumazet
2013-11-19  6:44                   ` Arnaud Ebalard
2013-11-19  6:44                     ` Arnaud Ebalard
2013-11-19 13:53                     ` Eric Dumazet
2013-11-19 13:53                       ` Eric Dumazet
2013-11-19 17:43                       ` Willy Tarreau
2013-11-19 17:43                         ` Willy Tarreau
2013-11-19 18:31                         ` Eric Dumazet
2013-11-19 18:31                           ` Eric Dumazet
2013-11-19 18:41                           ` Willy Tarreau
2013-11-19 18:41                             ` Willy Tarreau
2013-11-19 23:53                             ` Arnaud Ebalard
2013-11-19 23:53                               ` Arnaud Ebalard
2013-11-20  0:08                               ` Eric Dumazet
2013-11-20  0:08                                 ` Eric Dumazet
2013-11-20  0:35                                 ` Willy Tarreau
2013-11-20  0:35                                   ` Willy Tarreau
2013-11-20  0:43                                   ` Eric Dumazet
2013-11-20  0:43                                     ` Eric Dumazet
2013-11-20  0:52                                     ` Willy Tarreau
2013-11-20  0:52                                       ` Willy Tarreau
2013-11-20  8:50                               ` Thomas Petazzoni
2013-11-20  8:50                                 ` Thomas Petazzoni
2013-11-20 19:21                                 ` Arnaud Ebalard
2013-11-20 19:11                               ` Willy Tarreau
2013-11-20 19:11                                 ` Willy Tarreau
2013-11-20 19:26                                 ` Arnaud Ebalard
2013-11-20 19:26                                   ` Arnaud Ebalard
2013-11-20 21:28                                 ` Arnaud Ebalard
2013-11-20 21:28                                   ` Arnaud Ebalard
2013-11-20 21:54                                   ` Willy Tarreau
2013-11-20 21:54                                     ` Willy Tarreau
2013-11-21  0:44                                     ` Willy Tarreau
2013-11-21  0:44                                       ` Willy Tarreau
2013-11-21 18:38                                       ` ARM network performance and dma_mask (was: [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s) Willy Tarreau
2013-11-21 19:04                                         ` Thomas Petazzoni
2013-11-21 19:04                                           ` Thomas Petazzoni
2013-11-21 21:51                                           ` Willy Tarreau
2013-11-21 21:51                                             ` ARM network performance and dma_mask (was: [BUG, REGRESSION?] 3.11.6+, 3.12: " Willy Tarreau
2013-11-21 22:01                                         ` ARM network performance and dma_mask Rob Herring
2013-11-21 22:01                                           ` Rob Herring
2013-11-21 22:13                                           ` Willy Tarreau
2013-11-21 22:13                                             ` Willy Tarreau
2013-11-21 21:51                                       ` [BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s Arnaud Ebalard
2013-11-21 21:51                                         ` Arnaud Ebalard
2013-11-21 21:52                                         ` Willy Tarreau
2013-11-21 21:52                                           ` Willy Tarreau
2013-11-21 22:00                                           ` Eric Dumazet
2013-11-21 22:00                                             ` Eric Dumazet
2013-11-21 22:55                                             ` Arnaud Ebalard
2013-11-21 22:55                                               ` Arnaud Ebalard
2013-11-21 23:23                                               ` Rick Jones
2013-11-21 23:23                                                 ` Rick Jones
2013-11-20 17:12                   ` Willy Tarreau
2013-11-20 17:12                     ` Willy Tarreau
2013-11-20 17:30                     ` Eric Dumazet
2013-11-20 17:30                       ` Eric Dumazet
2013-11-20 17:38                       ` Willy Tarreau
2013-11-20 17:38                         ` Willy Tarreau
2013-11-20 18:52                       ` David Miller
2013-11-20 18:52                         ` David Miller
2013-11-20 17:34                     ` Willy Tarreau
2013-11-20 17:34                       ` Willy Tarreau
2013-11-20 17:40                       ` Eric Dumazet
2013-11-20 17:40                         ` Eric Dumazet
2013-11-20 18:15                         ` Willy Tarreau
2013-11-20 18:15                           ` Willy Tarreau
2013-11-20 18:21                           ` Eric Dumazet
2013-11-20 18:21                             ` Eric Dumazet
2013-11-20 18:29                             ` Willy Tarreau
2013-11-20 18:29                               ` Willy Tarreau
2013-11-20 19:22                           ` Arnaud Ebalard
2013-11-20 19:22                             ` Arnaud Ebalard
2013-11-18 10:09                 ` David Laight
2013-11-18 10:09                   ` David Laight
2013-11-18 10:52                   ` Willy Tarreau
2013-11-18 10:52                     ` Willy Tarreau
2013-11-18 10:26                 ` Thomas Petazzoni
2013-11-18 10:26                   ` Thomas Petazzoni
2013-11-18 10:44                   ` Simon Guinot
2013-11-18 10:44                     ` Simon Guinot
2013-11-18 16:54                     ` Stephen Hemminger
2013-11-18 16:54                       ` Stephen Hemminger
2013-11-18 17:13                       ` Eric Dumazet
2013-11-18 17:13                         ` Eric Dumazet
2013-11-18 10:51                   ` Willy Tarreau
2013-11-18 10:51                     ` Willy Tarreau
2013-11-18 17:58                     ` Florian Fainelli
2013-11-18 17:58                       ` Florian Fainelli
2013-11-12 14:39     ` [PATCH] tcp: tsq: restore minimal amount of queueing Eric Dumazet
2013-11-12 15:24       ` Sujith Manoharan
2013-11-13 14:06       ` Eric Dumazet
2013-11-13 14:32       ` [PATCH v2] " Eric Dumazet
2013-11-13 21:18         ` Arnaud Ebalard
2013-11-13 21:59           ` Holger Hoffstaette
2013-11-13 23:40             ` Eric Dumazet
2013-11-13 23:52               ` Holger Hoffstaette
2013-11-17 23:15                 ` Francois Romieu
2013-11-18 16:26                   ` Holger Hoffstätte
2013-11-18 16:47                     ` Eric Dumazet
2013-11-13 22:41           ` Eric Dumazet
2013-11-14 21:26         ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.