All of lore.kernel.org
 help / color / mirror / Atom feed
* BW regression after "tcp: refine TSO autosizing"
@ 2015-01-13 16:48 Eyal Perry
  2015-01-13 18:57 ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Eyal Perry @ 2015-01-13 16:48 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, Amir Vadai, yevgenyp, saeedm, idos, amira, eyalpe

Hello Eric,
Lately we've observed performance degradation in BW of about 30-40% (depends on
the setup we use).
I've bisected the issue down to the this commit: 605ad7f1 ("tcp: refine TSO
autosizing")

For instance, I was running the following test:
1. Bounding net device' irqs to core 0 for both client and server side
2. Running netperf with 64K massage size (used the following command)
$ netperf -H remote -T 1,1 -l 100 -t TCP_STREAM -- -k THROUGHPUT -M 65536 -m 65536

I ran the test on upstream net-next including your patch and than reverted it
and these are the results I got was improvement from 14.6Gbps to 22.1Gbps.

an additional difference I've noticed when inspecting the ethtool statics,
number of xmit_more packets increased from 4 to 160 with the reverted kernel.

We are investigating this issue, do you have a hint?

Best regards,
Eyal.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: BW regression after "tcp: refine TSO autosizing"
  2015-01-13 16:48 BW regression after "tcp: refine TSO autosizing" Eyal Perry
@ 2015-01-13 18:57 ` Eric Dumazet
  2015-01-13 20:21   ` Or Gerlitz
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2015-01-13 18:57 UTC (permalink / raw)
  To: Eyal Perry; +Cc: netdev, Amir Vadai, yevgenyp, saeedm, idos, amira, eyalpe

On Tue, 2015-01-13 at 18:48 +0200, Eyal Perry wrote:
> Hello Eric,
> Lately we've observed performance degradation in BW of about 30-40% (depends on
> the setup we use).
> I've bisected the issue down to the this commit: 605ad7f1 ("tcp: refine TSO
> autosizing")
> 
> For instance, I was running the following test:
> 1. Bounding net device' irqs to core 0 for both client and server side
> 2. Running netperf with 64K massage size (used the following command)
> $ netperf -H remote -T 1,1 -l 100 -t TCP_STREAM -- -k THROUGHPUT -M 65536 -m 65536
> 
> I ran the test on upstream net-next including your patch and than reverted it
> and these are the results I got was improvement from 14.6Gbps to 22.1Gbps.
> 
> an additional difference I've noticed when inspecting the ethtool statics,
> number of xmit_more packets increased from 4 to 160 with the reverted kernel.
> 
> We are investigating this issue, do you have a hint?

Which driver are you using for this test ?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: BW regression after "tcp: refine TSO autosizing"
  2015-01-13 18:57 ` Eric Dumazet
@ 2015-01-13 20:21   ` Or Gerlitz
  2015-01-13 21:41     ` Eyal Perry
  0 siblings, 1 reply; 16+ messages in thread
From: Or Gerlitz @ 2015-01-13 20:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eyal Perry, Linux Netdev List, Amir Vadai, Yevgeny Petrilin,
	Saeed Mahameed, Ido Shamay, Amir Ancel, Eyal Perry

On Tue, Jan 13, 2015 at 8:57 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2015-01-13 at 18:48 +0200, Eyal Perry wrote:
>> Hello Eric,
>> Lately we've observed performance degradation in BW of about 30-40% (depends on
>> the setup we use).
>> I've bisected the issue down to the this commit: 605ad7f1 ("tcp: refine TSO
>> autosizing")
>>
>> For instance, I was running the following test:
>> 1. Bounding net device' irqs to core 0 for both client and server side
>> 2. Running netperf with 64K massage size (used the following command)
>> $ netperf -H remote -T 1,1 -l 100 -t TCP_STREAM -- -k THROUGHPUT -M 65536 -m 65536
>>
>> I ran the test on upstream net-next including your patch and than reverted it
>> and these are the results I got was improvement from 14.6Gbps to 22.1Gbps.
>>
>> an additional difference I've noticed when inspecting the ethtool statics,
>> number of xmit_more packets increased from 4 to 160 with the reverted kernel.
>>
>> We are investigating this issue, do you have a hint?
>
> Which driver are you using for this test ?

AFAIK, mlx4

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: BW regression after "tcp: refine TSO autosizing"
  2015-01-13 20:21   ` Or Gerlitz
@ 2015-01-13 21:41     ` Eyal Perry
  2015-01-13 22:00       ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Eyal Perry @ 2015-01-13 21:41 UTC (permalink / raw)
  To: Or Gerlitz, Eric Dumazet
  Cc: Linux Netdev List, Amir Vadai, Yevgeny Petrilin, Saeed Mahameed,
	Ido Shamay, Amir Ancel, Eyal Perry

On 1/13/2015 22:21 PM, Or Gerlitz wrote:
> On Tue, Jan 13, 2015 at 8:57 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Tue, 2015-01-13 at 18:48 +0200, Eyal Perry wrote:
>>> Hello Eric,
>>> Lately we've observed performance degradation in BW of about 30-40% (depends on
>>> the setup we use).
>>> I've bisected the issue down to the this commit: 605ad7f1 ("tcp: refine TSO
>>> autosizing")
>>>
>>> For instance, I was running the following test:
>>> 1. Bounding net device' irqs to core 0 for both client and server side
>>> 2. Running netperf with 64K massage size (used the following command)
>>> $ netperf -H remote -T 1,1 -l 100 -t TCP_STREAM -- -k THROUGHPUT -M 65536 -m 65536
>>>
>>> I ran the test on upstream net-next including your patch and than reverted it
>>> and these are the results I got was improvement from 14.6Gbps to 22.1Gbps.
>>>
>>> an additional difference I've noticed when inspecting the ethtool statics,
>>> number of xmit_more packets increased from 4 to 160 with the reverted kernel.
>>>
>>> We are investigating this issue, do you have a hint?
>> Which driver are you using for this test ?
> AFAIK, mlx4
Oops, forgot to mention.
mlx4 indeed.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: BW regression after "tcp: refine TSO autosizing"
  2015-01-13 21:41     ` Eyal Perry
@ 2015-01-13 22:00       ` Eric Dumazet
  2015-01-18 16:22         ` Eyal Perry
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2015-01-13 22:00 UTC (permalink / raw)
  To: Eyal Perry
  Cc: Or Gerlitz, Linux Netdev List, Amir Vadai, Yevgeny Petrilin,
	Saeed Mahameed, Ido Shamay, Amir Ancel, Eyal Perry

On Tue, 2015-01-13 at 23:41 +0200, Eyal Perry wrote:
> On 1/13/2015 22:21 PM, Or Gerlitz wrote:
> > On Tue, Jan 13, 2015 at 8:57 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> On Tue, 2015-01-13 at 18:48 +0200, Eyal Perry wrote:
> >>> Hello Eric,
> >>> Lately we've observed performance degradation in BW of about 30-40% (depends on
> >>> the setup we use).
> >>> I've bisected the issue down to the this commit: 605ad7f1 ("tcp: refine TSO
> >>> autosizing")
> >>>
> >>> For instance, I was running the following test:
> >>> 1. Bounding net device' irqs to core 0 for both client and server side
> >>> 2. Running netperf with 64K massage size (used the following command)
> >>> $ netperf -H remote -T 1,1 -l 100 -t TCP_STREAM -- -k THROUGHPUT -M 65536 -m 65536
> >>>
> >>> I ran the test on upstream net-next including your patch and than reverted it
> >>> and these are the results I got was improvement from 14.6Gbps to 22.1Gbps.
> >>>
> >>> an additional difference I've noticed when inspecting the ethtool statics,
> >>> number of xmit_more packets increased from 4 to 160 with the reverted kernel.
> >>>
> >>> We are investigating this issue, do you have a hint?
> >> Which driver are you using for this test ?
> > AFAIK, mlx4
> Oops, forgot to mention.
> mlx4 indeed.

Make sure you do not drop packets at receiver.

(Patch might have increased raw speed, and receiver starts dropping
packets because it is not able to sustain line rate on a single flow)

If cwnd is too small, then yes, sending slightly smaller TSO packets can
impact performance, but this is desirable as well.

This is a congestion control problem.


lpaa23:~# nstat >/dev/null; DUMP_TCP_INFO=1 ./netperf -H lpaa24;nstat
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpaa24.prod.google.com () port 0 AF_INET
rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=52 rttvar=2 snd_ssthresh=66 cwnd=102 reordering=3 total_retrans=439 ca_state=0
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    10.00    17366.51   
#kernel
IpInReceives                    379010             0.0
IpInDelivers                    379010             0.0
IpOutRequests                   494794             0.0
IcmpInErrors                    1                  0.0
IcmpInTimeExcds                 1                  0.0
IcmpOutErrors                   1                  0.0
IcmpOutTimeExcds                1                  0.0
IcmpMsgInType3                  1                  0.0
IcmpMsgOutType3                 1                  0.0
TcpActiveOpens                  18                 0.0
TcpPassiveOpens                 4                  0.0
TcpAttemptFails                 8                  0.0
TcpEstabResets                  7                  0.0
TcpInSegs                       378992             0.0
TcpOutSegs                      14993053           0.0
TcpRetransSegs                  439                0.0
TcpOutRsts                      28                 0.0
UdpInDatagrams                  16                 0.0
UdpNoPorts                      1                  0.0
UdpOutDatagrams                 17                 0.0
TcpExtTW                        3                  0.0
TcpExtDelayedACKs               1                  0.0
TcpExtTCPPrequeued              1                  0.0
TcpExtTCPHPHits                 14                 0.0
TcpExtTCPPureAcks               301046             0.0
TcpExtTCPHPAcks                 77858              0.0
TcpExtTCPSackRecovery           75                 0.0
TcpExtTCPFastRetrans            439                0.0
TcpExtTCPAbortOnData            7                  0.0
TcpExtTCPSackShifted            17                 0.0
TcpExtTCPSackMerged             57                 0.0
TcpExtTCPSackShiftFallback      234                0.0
TcpExtTCPRcvCoalesce            6                  0.0
TcpExtTCPFastOpenActive         7                  0.0
TcpExtTCPSpuriousRtxHostQueues  2                  0.0
TcpExtTCPAutoCorking            68423              0.0
TcpExtTCPOrigDataSent           14992970           0.0
TcpExtTCPHystartTrainDetect     1                  0.0
TcpExtTCPHystartTrainCwnd       70                 0.0
IpExtInOctets                   19731445           0.0
IpExtOutOctets                  21736126719        0.0
IpExtInNoECTPkts                379010             0.0


You also can see in this sample Hystart ended slow start 
with a very small cwnd of 70

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: BW regression after "tcp: refine TSO autosizing"
  2015-01-13 22:00       ` Eric Dumazet
@ 2015-01-18 16:22         ` Eyal Perry
  2015-01-18 17:48           ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Eyal Perry @ 2015-01-18 16:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Or Gerlitz, Linux Netdev List, Amir Vadai, Yevgeny Petrilin,
	Saeed Mahameed, Ido Shamay, Amir Ancel, Eyal Perry

On Wed, Jan 14, 2015 at 12:00 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Make sure you do not drop packets at receiver. (Patch might have increased raw speed, and receiver starts dropping packets because it is not able to sustain line rate on a single flow)

Hi Eric,

I double checked that there are no drops on the receiver (ifconfig,
ethtool, and net_dropmonitor).

> If cwnd is too small, then yes, sending slightly smaller TSO packets can impact performance, but this is desirable as well. This is a congestion control problem.

How can we reliably measure the cwnd? tpci_snd_cwnd is not consistent
across runs.
Anyway, we don't see difference in the TSO packets size (see results below).

> lpaa23:~# nstat >/dev/null; DUMP_TCP_INFO=1 ./netperf -H lpaa24;nstat
[...]
>
> You also can see in this sample Hystart ended slow start with a very small cwnd of 70

We see the issue also on very long runs so I don't understand how is
it related to the slow start mechanism.


Below, are two measurements with all the statistics.
* with your patch:
$ nstat >/dev/null; netperf -H remote;nstat
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
11.11.11.36 () port 0 AF_INET
tcpi_rto 204000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 29200
tcpi_rtt 79 tcpi_rttvar 14 tcpi_snd_ssthresh 214 tpci_snd_cwnd 634
tcpi_reordering 3 tcpi_total_retrans 0
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.00    18537.46
#kernel
IpInReceives                    260728             0.0
IpInDelivers                    260728             0.0
IpOutRequests                   355684             0.0
TcpActiveOpens                  2                  0.0
TcpInSegs                       260729             0.0
TcpOutSegs                      16004157           0.0
UdpInDatagrams                  1                  0.0
UdpOutDatagrams                 7                  0.0
UdpIgnoredMulti                 1                  0.0
Ip6InReceives                   4                  0.0
Ip6InDelivers                   4                  0.0
Ip6OutRequests                  3                  0.0
Ip6InMcastPkts                  1                  0.0
Ip6InOctets                     288                0.0
Ip6OutOctets                    856                0.0
Ip6InMcastOctets                72                 0.0
Ip6InNoECTPkts                  4                  0.0
Icmp6InMsgs                     1                  0.0
Icmp6InNeighborAdvertisements   1                  0.0
Icmp6InType136                  1                  0.0
TcpExtDelayedACKs               2                  0.0
TcpExtTCPHPHits                 3                  0.0
TcpExtTCPPureAcks               208131             0.0
TcpExtTCPHPAcks                 52591              0.0
TcpExtTCPAutoCorking            49920              0.0
TcpExtTCPOrigDataSent           16004146           0.0
TcpExtTCPHystartTrainDetect     1                  0.0
TcpExtTCPHystartTrainCwnd       214                0.0
IpExtInBcastPkts                1                  0.0
IpExtInOctets                   13560399           0.0
IpExtOutOctets                  23192484354        0.0
IpExtInBcastOctets              367                0.0
IpExtInNoECTPkts                260728             0.0

* And without it:
$ nstat >/dev/null; netperf -H remote;nstat
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
11.11.11.37 () port 0 AF_INET
tcpi_rto 204000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 29200
tcpi_rtt 108 tcpi_rttvar 13 tcpi_snd_ssthresh 801 tpci_snd_cwnd 809
tcpi_reordering 64 tcpi_total_retrans 6
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.00    26826.32
#kernel
IpInReceives                    431716             0.0
IpInDelivers                    431716             0.0
IpOutRequests                   521537             0.0
TcpActiveOpens                  2                  0.0
TcpInSegs                       431676             0.0
TcpOutSegs                      23164903           0.0
TcpRetransSegs                  6                  0.0
UdpInDatagrams                  3                  0.0
UdpOutDatagrams                 1                  0.0
UdpIgnoredMulti                 40                 0.0
Ip6InReceives                   4                  0.0
Ip6InDelivers                   4                  0.0
Ip6OutRequests                  4                  0.0
Ip6InOctets                     288                0.0
Ip6OutOctets                    928                0.0
Ip6InNoECTPkts                  4                  0.0
Icmp6InMsgs                     1                  0.0
Icmp6OutMsgs                    1                  0.0
Icmp6InNeighborAdvertisements   1                  0.0
Icmp6OutNeighborSolicits        1                  0.0
Icmp6InType136                  1                  0.0
Icmp6OutType135                 1                  0.0
TcpExtDelayedACKs               1                  0.0
TcpExtTCPHPHits                 2                  0.0
TcpExtTCPPureAcks               291894             0.0
TcpExtTCPHPAcks                 139775             0.0
TcpExtTCPSackRecovery           5                  0.0
TcpExtTCPTSReorder              1                  0.0
TcpExtTCPPartialUndo            1                  0.0
TcpExtTCPDSACKUndo              4                  0.0
TcpExtTCPFastRetrans            6                  0.0
TcpExtTCPDSACKRecv              6                  0.0
TcpExtTCPDSACKIgnoredNoUndo     1                  0.0
TcpExtTCPSackShifted            46436              0.0
TcpExtTCPSackMerged             1306               0.0
TcpExtTCPSackShiftFallback      4414               0.0
TcpExtTCPAutoCorking            274120             0.0
TcpExtTCPOrigDataSent           23164893           0.0
TcpExtTCPHystartTrainDetect     1                  0.0
TcpExtTCPHystartTrainCwnd       71                 0.0
IpExtInBcastPkts                42                 0.0
IpExtInOctets                   23066435           0.0
IpExtOutOctets                  33567160556        0.0
IpExtInBcastOctets              4275               0.0
IpExtInNoECTPkts                431716             0.0


Please let me know if you see something in the results.

Regards,
Eyal.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: BW regression after "tcp: refine TSO autosizing"
  2015-01-18 16:22         ` Eyal Perry
@ 2015-01-18 17:48           ` Eric Dumazet
  2015-01-18 21:40             ` Eyal Perry
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2015-01-18 17:48 UTC (permalink / raw)
  To: Eyal Perry
  Cc: Or Gerlitz, Linux Netdev List, Amir Vadai, Yevgeny Petrilin,
	Saeed Mahameed, Ido Shamay, Amir Ancel, Eyal Perry

On Sun, 2015-01-18 at 18:22 +0200, Eyal Perry wrote:

> 
> Please let me know if you see something in the results.

Getting high throughput on a single flow means lot of tweaking.

For a start, mlx4 is known to have interrupt mitigation that can hurt,
as the TX interrupt timer is restarted for every packet that is
delivered to the NIC.

ethtool -c ethX
..
tx-usecs: 16
tx-frames: 16
tx-usecs-irq: 0
tx-frames-irq: 256
...

-> TX IRQ can be delayed by 16*16 = 256 usec.

Can you try :

ethtool -C ethX tx-usecs 2 tx-frames 2

Or even

ethtool -C ethX tx-usecs 1 tx-frames 1

Interrupt mitigation is a trade-off.

If one customer wants high throughput on a single flow, then you might
remove interrupt mitigation.

If another customer wants cpu efficiency with thousand of flows, I guess
current mlx4 defaults are pretty good.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: BW regression after "tcp: refine TSO autosizing"
  2015-01-18 17:48           ` Eric Dumazet
@ 2015-01-18 21:40             ` Eyal Perry
  2015-01-20  2:16               ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Eyal Perry @ 2015-01-18 21:40 UTC (permalink / raw)
  To: Eric Dumazet, Eyal Perry
  Cc: Or Gerlitz, Linux Netdev List, Amir Vadai, Yevgeny Petrilin,
	Saeed Mahameed, Ido Shamay, Amir Ancel


On 1/18/2015 19:48 PM, Eric Dumazet wrote:
> On Sun, 2015-01-18 at 18:22 +0200, Eyal Perry wrote:
>
>> Please let me know if you see something in the results.
> Getting high throughput on a single flow means lot of tweaking.
>
> For a start, mlx4 is known to have interrupt mitigation that can hurt,
> as the TX interrupt timer is restarted for every packet that is
> delivered to the NIC.
>
> ethtool -c ethX
> ..
> tx-usecs: 16
> tx-frames: 16
> tx-usecs-irq: 0
> tx-frames-irq: 256
> ...
>
> -> TX IRQ can be delayed by 16*16 = 256 usec.
>
> Can you try :
>
> ethtool -C ethX tx-usecs 2 tx-frames 2
>
> Or even
>
> ethtool -C ethX tx-usecs 1 tx-frames 1
So indeed, interrupt mitigation (tx-usecs 1 tx-frames 1) improves things up
for the "refined TSO autosizing" kernel (from 18.4Gbps to 19.7Gbps). but
in the
other kernel, the BW is remains the same with and without the coalescing.
> Interrupt mitigation is a trade-off.
>
> If one customer wants high throughput on a single flow, then you might
> remove interrupt mitigation.
>
> If another customer wants cpu efficiency with thousand of flows, I guess
> current mlx4 defaults are pretty good.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: BW regression after "tcp: refine TSO autosizing"
  2015-01-18 21:40             ` Eyal Perry
@ 2015-01-20  2:16               ` Eric Dumazet
  2015-01-20  2:37                 ` Dave Taht
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2015-01-20  2:16 UTC (permalink / raw)
  To: Eyal Perry, Yuchung Cheng, Neal Cardwell
  Cc: Eyal Perry, Or Gerlitz, Linux Netdev List, Amir Vadai,
	Yevgeny Petrilin, Saeed Mahameed, Ido Shamay, Amir Ancel

On Sun, 2015-01-18 at 23:40 +0200, Eyal Perry wrote:

> So indeed, interrupt mitigation (tx-usecs 1 tx-frames 1) improves things up
> for the "refined TSO autosizing" kernel (from 18.4Gbps to 19.7Gbps). but
> in the
> other kernel, the BW is remains the same with and without the coalescing.

OK thanks for testing.

I believe the regression comes from inability for cc to cope with
stretch acks.

Nowadays on fast networks, each ACK packet acknowledges ~45 MSS, but
CUBIC (and others cc) got support for this only during slow start, with
commit 9f9843a751d0a2057f9f3d313886e7e5e6ebaac9
("tcp: properly handle stretch acks in slow start")

I guess it is time to also handle congestion avoidance phase.

With following patch (very close to what we use here at Google) I
reached 37Gbps instead of 20Gbps :

ethtool -C eth1 tx-usecs 4 tx-frames 4

DUMP_TCP_INFO=1 ./netperf -H remote -T2,2 -t TCP_STREAM -l 20
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to remote () port 0 AF_INET : cpu bind
rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=67 rttvar=6 snd_ssthresh=263 cwnd=265 reordering=3 total_retrans=4569 ca_state=0
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    20.00    37213.05   

I guess this is a world record, my previous one was 34Gbps.


 include/net/tcp.h    |    2 
 net/ipv4/tcp_cong.c  |    4 +
 net/ipv4/tcp_cubic.c |   91 +++++++++++++++++++----------------------
 3 files changed, 47 insertions(+), 50 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b8fdc6bab3f3..05815fbb490f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -843,7 +843,7 @@ void tcp_get_available_congestion_control(char *buf, size_t len);
 void tcp_get_allowed_congestion_control(char *buf, size_t len);
 int tcp_set_allowed_congestion_control(char *allowed);
 int tcp_set_congestion_control(struct sock *sk, const char *name);
-void tcp_slow_start(struct tcp_sock *tp, u32 acked);
+int tcp_slow_start(struct tcp_sock *tp, u32 acked);
 void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w);
 
 u32 tcp_reno_ssthresh(struct sock *sk);
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 63c29dba68a8..f0fc696b9333 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -360,13 +360,15 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
  * ABC caps N to 2. Slow start exits when cwnd grows over ssthresh and
  * returns the leftover acks to adjust cwnd in congestion avoidance mode.
  */
-void tcp_slow_start(struct tcp_sock *tp, u32 acked)
+int tcp_slow_start(struct tcp_sock *tp, u32 acked)
 {
 	u32 cwnd = tp->snd_cwnd + acked;
 
 	if (cwnd > tp->snd_ssthresh)
 		cwnd = tp->snd_ssthresh + 1;
+	acked -= cwnd - tp->snd_cwnd;
 	tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp);
+	return acked;
 }
 EXPORT_SYMBOL_GPL(tcp_slow_start);
 
diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 6b6002416a73..c0e048929b74 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -81,7 +81,6 @@ MODULE_PARM_DESC(hystart_ack_delta, "spacing between ack's indicating train (mse
 
 /* BIC TCP Parameters */
 struct bictcp {
-	u32	cnt;		/* increase cwnd by 1 after ACKs */
 	u32	last_max_cwnd;	/* last maximum snd_cwnd */
 	u32	loss_cwnd;	/* congestion window at last loss */
 	u32	last_cwnd;	/* the last snd_cwnd */
@@ -93,20 +92,18 @@ struct bictcp {
 	u32	epoch_start;	/* beginning of an epoch */
 	u32	ack_cnt;	/* number of acks */
 	u32	tcp_cwnd;	/* estimated tcp cwnd */
-#define ACK_RATIO_SHIFT	4
-#define ACK_RATIO_LIMIT (32u << ACK_RATIO_SHIFT)
-	u16	delayed_ack;	/* estimate the ratio of Packets/ACKs << 4 */
 	u8	sample_cnt;	/* number of samples to decide curr_rtt */
 	u8	found;		/* the exit point is found? */
 	u32	round_start;	/* beginning of each round */
 	u32	end_seq;	/* end_seq of the round */
 	u32	last_ack;	/* last time when the ACK spacing is close */
 	u32	curr_rtt;	/* the minimum rtt of current round */
+	u32	last_bic_target;/* last target cwnd computed by cubic
+				 * (not tcp_friendliness mode) */
 };
 
 static inline void bictcp_reset(struct bictcp *ca)
 {
-	ca->cnt = 0;
 	ca->last_max_cwnd = 0;
 	ca->last_cwnd = 0;
 	ca->last_time = 0;
@@ -114,7 +111,6 @@ static inline void bictcp_reset(struct bictcp *ca)
 	ca->bic_K = 0;
 	ca->delay_min = 0;
 	ca->epoch_start = 0;
-	ca->delayed_ack = 2 << ACK_RATIO_SHIFT;
 	ca->ack_cnt = 0;
 	ca->tcp_cwnd = 0;
 	ca->found = 0;
@@ -205,12 +201,14 @@ static u32 cubic_root(u64 a)
 /*
  * Compute congestion window to use.
  */
-static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
+static inline void bictcp_update(struct bictcp *ca, u32 pkts_acked, u32 cwnd)
 {
-	u32 delta, bic_target, max_cnt;
+	u32 delta, bic_target;
 	u64 offs, t;
 
-	ca->ack_cnt++;	/* count the number of ACKs */
+	ca->ack_cnt += pkts_acked;	/* count the number of packets that
+					 * have been ACKed
+					 */
 
 	if (ca->last_cwnd == cwnd &&
 	    (s32)(tcp_time_stamp - ca->last_time) <= HZ / 32)
@@ -221,7 +219,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
 
 	if (ca->epoch_start == 0) {
 		ca->epoch_start = tcp_time_stamp;	/* record beginning */
-		ca->ack_cnt = 1;			/* start counting */
+		ca->ack_cnt = pkts_acked;		/* start counting */
 		ca->tcp_cwnd = cwnd;			/* syn with cubic */
 
 		if (ca->last_max_cwnd <= cwnd) {
@@ -269,19 +267,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
 	else                                          /* above origin*/
 		bic_target = ca->bic_origin_point + delta;
 
-	/* cubic function - calc bictcp_cnt*/
-	if (bic_target > cwnd) {
-		ca->cnt = cwnd / (bic_target - cwnd);
-	} else {
-		ca->cnt = 100 * cwnd;              /* very small increment*/
-	}
-
-	/*
-	 * The initial growth of cubic function may be too conservative
-	 * when the available bandwidth is still unknown.
-	 */
-	if (ca->last_max_cwnd == 0 && ca->cnt > 20)
-		ca->cnt = 20;	/* increase cwnd 5% per RTT */
+	ca->last_bic_target = bic_target;
 
 	/* TCP Friendly */
 	if (tcp_friendliness) {
@@ -292,18 +278,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
 			ca->ack_cnt -= delta;
 			ca->tcp_cwnd++;
 		}
-
-		if (ca->tcp_cwnd > cwnd) {	/* if bic is slower than tcp */
-			delta = ca->tcp_cwnd - cwnd;
-			max_cnt = cwnd / delta;
-			if (ca->cnt > max_cnt)
-				ca->cnt = max_cnt;
-		}
 	}
-
-	ca->cnt = (ca->cnt << ACK_RATIO_SHIFT) / ca->delayed_ack;
-	if (ca->cnt == 0)			/* cannot be zero */
-		ca->cnt = 1;
 }
 
 static void bictcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
@@ -314,13 +289,43 @@ static void bictcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
 	if (!tcp_is_cwnd_limited(sk))
 		return;
 
+	/* cwnd may first advance in slow start then move on to congestion
+	 * control mode on a stretch ACK.
+	 */
 	if (tp->snd_cwnd <= tp->snd_ssthresh) {
 		if (hystart && after(ack, ca->end_seq))
 			bictcp_hystart_reset(sk);
-		tcp_slow_start(tp, acked);
-	} else {
-		bictcp_update(ca, tp->snd_cwnd);
-		tcp_cong_avoid_ai(tp, ca->cnt);
+		acked = tcp_slow_start(tp, acked);
+	}
+
+	if (acked && tp->snd_cwnd > tp->snd_ssthresh) {
+		u32 target, cnt;
+
+		bictcp_update(ca, acked, tp->snd_cwnd);
+		/* Compute target cwnd based on bic_target and tcp_cwnd
+		 * (whichever is faster)
+		 */
+		target = (ca->last_bic_target >= ca->tcp_cwnd) ?
+				ca->last_bic_target : ca->tcp_cwnd;
+		while (acked > 0) {
+			if (target > tp->snd_cwnd)
+				cnt = tp->snd_cwnd / (target - tp->snd_cwnd);
+			else
+				cnt = 100 * tp->snd_cwnd;
+
+			/* The initial growth of cubic function may be
+			 * too conservative when the available
+			 * bandwidth is still unknown.
+			 */
+			if (ca->last_max_cwnd == 0 && cnt > 20)
+				cnt = 20;   /* increase cwnd 5% per RTT */
+
+			if (cnt == 0)		/* cannot be zero */
+				cnt = 1;
+
+			tcp_cong_avoid_ai(tp, cnt);
+			acked--;
+		}
 	}
 }
 
@@ -411,20 +416,10 @@ static void hystart_update(struct sock *sk, u32 delay)
  */
 static void bictcp_acked(struct sock *sk, u32 cnt, s32 rtt_us)
 {
-	const struct inet_connection_sock *icsk = inet_csk(sk);
 	const struct tcp_sock *tp = tcp_sk(sk);
 	struct bictcp *ca = inet_csk_ca(sk);
 	u32 delay;
 
-	if (icsk->icsk_ca_state == TCP_CA_Open) {
-		u32 ratio = ca->delayed_ack;
-
-		ratio -= ca->delayed_ack >> ACK_RATIO_SHIFT;
-		ratio += cnt;
-
-		ca->delayed_ack = clamp(ratio, 1U, ACK_RATIO_LIMIT);
-	}
-
 	/* Some calls are for duplicates without timetamps */
 	if (rtt_us < 0)
 		return;

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: BW regression after "tcp: refine TSO autosizing"
  2015-01-20  2:16               ` Eric Dumazet
@ 2015-01-20  2:37                 ` Dave Taht
  2015-01-20  3:14                   ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Taht @ 2015-01-20  2:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eyal Perry, Yuchung Cheng, Neal Cardwell, Eyal Perry, Or Gerlitz,
	Linux Netdev List, Amir Vadai, Yevgeny Petrilin, Saeed Mahameed,
	Ido Shamay, Amir Ancel

On Mon, Jan 19, 2015 at 6:16 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sun, 2015-01-18 at 23:40 +0200, Eyal Perry wrote:
>
>> So indeed, interrupt mitigation (tx-usecs 1 tx-frames 1) improves things up
>> for the "refined TSO autosizing" kernel (from 18.4Gbps to 19.7Gbps). but
>> in the
>> other kernel, the BW is remains the same with and without the coalescing.
>
> OK thanks for testing.
>
> I believe the regression comes from inability for cc to cope with
> stretch acks.
>
> Nowadays on fast networks, each ACK packet acknowledges ~45 MSS, but
> CUBIC (and others cc) got support for this only during slow start, with
> commit 9f9843a751d0a2057f9f3d313886e7e5e6ebaac9
> ("tcp: properly handle stretch acks in slow start")
>
> I guess it is time to also handle congestion avoidance phase.

Are you saying that at long last, delayed acks as we knew them are
dead, dead, dead?

> With following patch (very close to what we use here at Google) I
> reached 37Gbps instead of 20Gbps :
>
> ethtool -C eth1 tx-usecs 4 tx-frames 4

What is the default here?

What happens with the default here?

>
> DUMP_TCP_INFO=1 ./netperf -H remote -T2,2 -t TCP_STREAM -l 20
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to remote () port 0 AF_INET : cpu bind
> rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=67 rttvar=6 snd_ssthresh=263 cwnd=265 reordering=3 total_retrans=4569 ca_state=0

The above statistics are not dumped by my netperf, and look extremely
desirable to capture in netperf-wrapper. This is a script parsing some
other kernel data at the conclusion of the run? or a better netperf?

If ECN was on the bottleneck link, I imagine total_retrans would be 0,
or are packets getting dropped in the kernel?

> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
>
>  87380  16384  16384    20.00    37213.05
>
> I guess this is a world record, my previous one was 34Gbps.
>
>
>  include/net/tcp.h    |    2
>  net/ipv4/tcp_cong.c  |    4 +
>  net/ipv4/tcp_cubic.c |   91 +++++++++++++++++++----------------------
>  3 files changed, 47 insertions(+), 50 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index b8fdc6bab3f3..05815fbb490f 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -843,7 +843,7 @@ void tcp_get_available_congestion_control(char *buf, size_t len);
>  void tcp_get_allowed_congestion_control(char *buf, size_t len);
>  int tcp_set_allowed_congestion_control(char *allowed);
>  int tcp_set_congestion_control(struct sock *sk, const char *name);
> -void tcp_slow_start(struct tcp_sock *tp, u32 acked);
> +int tcp_slow_start(struct tcp_sock *tp, u32 acked);
>  void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w);
>
>  u32 tcp_reno_ssthresh(struct sock *sk);
> diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
> index 63c29dba68a8..f0fc696b9333 100644
> --- a/net/ipv4/tcp_cong.c
> +++ b/net/ipv4/tcp_cong.c
> @@ -360,13 +360,15 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
>   * ABC caps N to 2. Slow start exits when cwnd grows over ssthresh and
>   * returns the leftover acks to adjust cwnd in congestion avoidance mode.
>   */
> -void tcp_slow_start(struct tcp_sock *tp, u32 acked)
> +int tcp_slow_start(struct tcp_sock *tp, u32 acked)
>  {
>         u32 cwnd = tp->snd_cwnd + acked;
>
>         if (cwnd > tp->snd_ssthresh)
>                 cwnd = tp->snd_ssthresh + 1;
> +       acked -= cwnd - tp->snd_cwnd;
>         tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp);
> +       return acked;
>  }
>  EXPORT_SYMBOL_GPL(tcp_slow_start);
>
> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
> index 6b6002416a73..c0e048929b74 100644
> --- a/net/ipv4/tcp_cubic.c
> +++ b/net/ipv4/tcp_cubic.c
> @@ -81,7 +81,6 @@ MODULE_PARM_DESC(hystart_ack_delta, "spacing between ack's indicating train (mse
>
>  /* BIC TCP Parameters */
>  struct bictcp {
> -       u32     cnt;            /* increase cwnd by 1 after ACKs */
>         u32     last_max_cwnd;  /* last maximum snd_cwnd */
>         u32     loss_cwnd;      /* congestion window at last loss */
>         u32     last_cwnd;      /* the last snd_cwnd */
> @@ -93,20 +92,18 @@ struct bictcp {
>         u32     epoch_start;    /* beginning of an epoch */
>         u32     ack_cnt;        /* number of acks */
>         u32     tcp_cwnd;       /* estimated tcp cwnd */
> -#define ACK_RATIO_SHIFT        4
> -#define ACK_RATIO_LIMIT (32u << ACK_RATIO_SHIFT)
> -       u16     delayed_ack;    /* estimate the ratio of Packets/ACKs << 4 */
>         u8      sample_cnt;     /* number of samples to decide curr_rtt */
>         u8      found;          /* the exit point is found? */
>         u32     round_start;    /* beginning of each round */
>         u32     end_seq;        /* end_seq of the round */
>         u32     last_ack;       /* last time when the ACK spacing is close */
>         u32     curr_rtt;       /* the minimum rtt of current round */
> +       u32     last_bic_target;/* last target cwnd computed by cubic
> +                                * (not tcp_friendliness mode) */
>  };
>
>  static inline void bictcp_reset(struct bictcp *ca)
>  {
> -       ca->cnt = 0;
>         ca->last_max_cwnd = 0;
>         ca->last_cwnd = 0;
>         ca->last_time = 0;
> @@ -114,7 +111,6 @@ static inline void bictcp_reset(struct bictcp *ca)
>         ca->bic_K = 0;
>         ca->delay_min = 0;
>         ca->epoch_start = 0;
> -       ca->delayed_ack = 2 << ACK_RATIO_SHIFT;
>         ca->ack_cnt = 0;
>         ca->tcp_cwnd = 0;
>         ca->found = 0;
> @@ -205,12 +201,14 @@ static u32 cubic_root(u64 a)
>  /*
>   * Compute congestion window to use.
>   */
> -static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
> +static inline void bictcp_update(struct bictcp *ca, u32 pkts_acked, u32 cwnd)
>  {
> -       u32 delta, bic_target, max_cnt;
> +       u32 delta, bic_target;
>         u64 offs, t;
>
> -       ca->ack_cnt++;  /* count the number of ACKs */
> +       ca->ack_cnt += pkts_acked;      /* count the number of packets that
> +                                        * have been ACKed
> +                                        */
>
>         if (ca->last_cwnd == cwnd &&
>             (s32)(tcp_time_stamp - ca->last_time) <= HZ / 32)
> @@ -221,7 +219,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
>
>         if (ca->epoch_start == 0) {
>                 ca->epoch_start = tcp_time_stamp;       /* record beginning */
> -               ca->ack_cnt = 1;                        /* start counting */
> +               ca->ack_cnt = pkts_acked;               /* start counting */
>                 ca->tcp_cwnd = cwnd;                    /* syn with cubic */
>
>                 if (ca->last_max_cwnd <= cwnd) {
> @@ -269,19 +267,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
>         else                                          /* above origin*/
>                 bic_target = ca->bic_origin_point + delta;
>
> -       /* cubic function - calc bictcp_cnt*/
> -       if (bic_target > cwnd) {
> -               ca->cnt = cwnd / (bic_target - cwnd);
> -       } else {
> -               ca->cnt = 100 * cwnd;              /* very small increment*/
> -       }
> -
> -       /*
> -        * The initial growth of cubic function may be too conservative
> -        * when the available bandwidth is still unknown.
> -        */
> -       if (ca->last_max_cwnd == 0 && ca->cnt > 20)
> -               ca->cnt = 20;   /* increase cwnd 5% per RTT */
> +       ca->last_bic_target = bic_target;
>
>         /* TCP Friendly */
>         if (tcp_friendliness) {
> @@ -292,18 +278,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
>                         ca->ack_cnt -= delta;
>                         ca->tcp_cwnd++;
>                 }
> -
> -               if (ca->tcp_cwnd > cwnd) {      /* if bic is slower than tcp */
> -                       delta = ca->tcp_cwnd - cwnd;
> -                       max_cnt = cwnd / delta;
> -                       if (ca->cnt > max_cnt)
> -                               ca->cnt = max_cnt;
> -               }
>         }
> -
> -       ca->cnt = (ca->cnt << ACK_RATIO_SHIFT) / ca->delayed_ack;
> -       if (ca->cnt == 0)                       /* cannot be zero */
> -               ca->cnt = 1;
>  }
>
>  static void bictcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
> @@ -314,13 +289,43 @@ static void bictcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
>         if (!tcp_is_cwnd_limited(sk))
>                 return;
>
> +       /* cwnd may first advance in slow start then move on to congestion
> +        * control mode on a stretch ACK.
> +        */
>         if (tp->snd_cwnd <= tp->snd_ssthresh) {
>                 if (hystart && after(ack, ca->end_seq))
>                         bictcp_hystart_reset(sk);
> -               tcp_slow_start(tp, acked);
> -       } else {
> -               bictcp_update(ca, tp->snd_cwnd);
> -               tcp_cong_avoid_ai(tp, ca->cnt);
> +               acked = tcp_slow_start(tp, acked);
> +       }
> +
> +       if (acked && tp->snd_cwnd > tp->snd_ssthresh) {
> +               u32 target, cnt;
> +
> +               bictcp_update(ca, acked, tp->snd_cwnd);
> +               /* Compute target cwnd based on bic_target and tcp_cwnd
> +                * (whichever is faster)
> +                */
> +               target = (ca->last_bic_target >= ca->tcp_cwnd) ?
> +                               ca->last_bic_target : ca->tcp_cwnd;
> +               while (acked > 0) {
> +                       if (target > tp->snd_cwnd)
> +                               cnt = tp->snd_cwnd / (target - tp->snd_cwnd);
> +                       else
> +                               cnt = 100 * tp->snd_cwnd;
> +
> +                       /* The initial growth of cubic function may be
> +                        * too conservative when the available
> +                        * bandwidth is still unknown.
> +                        */
> +                       if (ca->last_max_cwnd == 0 && cnt > 20)
> +                               cnt = 20;   /* increase cwnd 5% per RTT */
> +
> +                       if (cnt == 0)           /* cannot be zero */
> +                               cnt = 1;
> +
> +                       tcp_cong_avoid_ai(tp, cnt);
> +                       acked--;
> +               }
>         }
>  }
>
> @@ -411,20 +416,10 @@ static void hystart_update(struct sock *sk, u32 delay)
>   */
>  static void bictcp_acked(struct sock *sk, u32 cnt, s32 rtt_us)
>  {
> -       const struct inet_connection_sock *icsk = inet_csk(sk);
>         const struct tcp_sock *tp = tcp_sk(sk);
>         struct bictcp *ca = inet_csk_ca(sk);
>         u32 delay;
>
> -       if (icsk->icsk_ca_state == TCP_CA_Open) {
> -               u32 ratio = ca->delayed_ack;
> -
> -               ratio -= ca->delayed_ack >> ACK_RATIO_SHIFT;
> -               ratio += cnt;
> -
> -               ca->delayed_ack = clamp(ratio, 1U, ACK_RATIO_LIMIT);
> -       }
> -
>         /* Some calls are for duplicates without timetamps */
>         if (rtt_us < 0)
>                 return;
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Dave Täht

http://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: BW regression after "tcp: refine TSO autosizing"
  2015-01-20  2:37                 ` Dave Taht
@ 2015-01-20  3:14                   ` Eric Dumazet
  2015-01-20 19:14                     ` Rick Jones
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2015-01-20  3:14 UTC (permalink / raw)
  To: Dave Taht
  Cc: Eyal Perry, Yuchung Cheng, Neal Cardwell, Eyal Perry, Or Gerlitz,
	Linux Netdev List, Amir Vadai, Yevgeny Petrilin, Saeed Mahameed,
	Ido Shamay, Amir Ancel

On Mon, 2015-01-19 at 18:37 -0800, Dave Taht wrote:
> On Mon, Jan 19, 2015 at 6:16 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Sun, 2015-01-18 at 23:40 +0200, Eyal Perry wrote:
> >
> >> So indeed, interrupt mitigation (tx-usecs 1 tx-frames 1) improves things up
> >> for the "refined TSO autosizing" kernel (from 18.4Gbps to 19.7Gbps). but
> >> in the
> >> other kernel, the BW is remains the same with and without the coalescing.
> >
> > OK thanks for testing.
> >
> > I believe the regression comes from inability for cc to cope with
> > stretch acks.
> >
> > Nowadays on fast networks, each ACK packet acknowledges ~45 MSS, but
> > CUBIC (and others cc) got support for this only during slow start, with
> > commit 9f9843a751d0a2057f9f3d313886e7e5e6ebaac9
> > ("tcp: properly handle stretch acks in slow start")
> >
> > I guess it is time to also handle congestion avoidance phase.
> 
> Are you saying that at long last, delayed acks as we knew them are
> dead, dead, dead?

Sorry, I can not parse what you are saying.

In case you missed it, it has nothing to do with delayed ACK but GRO on
receiver.


> 
> > With following patch (very close to what we use here at Google) I
> > reached 37Gbps instead of 20Gbps :
> >
> > ethtool -C eth1 tx-usecs 4 tx-frames 4
> 
> What is the default here?

16 & 16, see my prior answer in this thread.

> 
> What happens with the default here?

ethtool -C eth1 tx-usecs 16 tx-frames 16
DUMP_TCP_INFO=1 ./netperf -H remote -T2,2 -t TCP_STREAM -l 20
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to remote
() port 0 AF_INET : cpu bind
rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=60 rttvar=2
snd_ssthresh=179 cwnd=243 reordering=3 total_retrans=23 ca_state=0
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    20.00    22923.74   




> 
> >
> > DUMP_TCP_INFO=1 ./netperf -H remote -T2,2 -t TCP_STREAM -l 20
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to remote () port 0 AF_INET : cpu bind
> > rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=67 rttvar=6 snd_ssthresh=263 cwnd=265 reordering=3 total_retrans=4569 ca_state=0
> 
> The above statistics are not dumped by my netperf, and look extremely
> desirable to capture in netperf-wrapper. This is a script parsing some
> other kernel data at the conclusion of the run? or a better netperf?

Thats a 3 lines patch in netperf actually.

> 
> If ECN was on the bottleneck link, I imagine total_retrans would be 0,
> or are packets getting dropped in the kernel?

The receiver drops frames, because we are at the limit of what the NIC
can do on a single RX queue.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: BW regression after "tcp: refine TSO autosizing"
  2015-01-20  3:14                   ` Eric Dumazet
@ 2015-01-20 19:14                     ` Rick Jones
  2015-01-20 19:26                       ` Eric Dumazet
  2015-01-21 12:26                       ` David Laight
  0 siblings, 2 replies; 16+ messages in thread
From: Rick Jones @ 2015-01-20 19:14 UTC (permalink / raw)
  To: Eric Dumazet, Dave Taht
  Cc: Eyal Perry, Yuchung Cheng, Neal Cardwell, Eyal Perry, Or Gerlitz,
	Linux Netdev List, Amir Vadai, Yevgeny Petrilin, Saeed Mahameed,
	Ido Shamay, Amir Ancel


>> Are you saying that at long last, delayed acks as we knew them are
>> dead, dead, dead?
>
> Sorry, I can not parse what you are saying.
>
> In case you missed it, it has nothing to do with delayed ACK but GRO on
> receiver.

Dave - assuming I've interpreted Eric's comments correctly, I believe 
the answer to your question is No.  Your desire for a world brimming 
with ack-every-other purity has not been fulfilled :)

However, the engineers formerly at Mentat are probably pleased that a 
functional near-equivalent to their ACK avoidance heuristic has ended-up 
being implemented and tacitly accepted, albeit by the back door :)


>>> DUMP_TCP_INFO=1 ./netperf -H remote -T2,2 -t TCP_STREAM -l 20
>>> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to remote () port 0 AF_INET : cpu bind
>>> rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=67 rttvar=6 snd_ssthresh=263 cwnd=265 reordering=3 total_retrans=4569 ca_state=0
>>
>> The above statistics are not dumped by my netperf, and look extremely
>> desirable to capture in netperf-wrapper. This is a script parsing some
>> other kernel data at the conclusion of the run? or a better netperf?
>
> Thats a 3 lines patch in netperf actually.

More stuff to pull from a TCP_INFO call I presume?  Feel free to drop me 
a patch, though I'd probably want it to be in the guise of the omni 
output selectors.

happy benchmarking,

rick

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: BW regression after "tcp: refine TSO autosizing"
  2015-01-20 19:14                     ` Rick Jones
@ 2015-01-20 19:26                       ` Eric Dumazet
  2015-01-20 19:44                         ` Rick Jones
  2015-01-21 12:26                       ` David Laight
  1 sibling, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2015-01-20 19:26 UTC (permalink / raw)
  To: Rick Jones
  Cc: Dave Taht, Eyal Perry, Yuchung Cheng, Neal Cardwell, Eyal Perry,
	Or Gerlitz, Linux Netdev List, Amir Vadai, Yevgeny Petrilin,
	Saeed Mahameed, Ido Shamay, Amir Ancel

On Tue, 2015-01-20 at 11:14 -0800, Rick Jones wrote:

> > Thats a 3 lines patch in netperf actually.
> 
> More stuff to pull from a TCP_INFO call I presume?  Feel free to drop me 
> a patch, though I'd probably want it to be in the guise of the omni 
> output selectors.
> 

It was something like :

diff --git a/src/nettest_omni.c b/src/nettest_omni.c
index fb2d5f4..80e43ca 100644
--- a/src/nettest_omni.c
+++ b/src/nettest_omni.c
@@ -3465,7 +3465,7 @@ static void
 dump_tcp_info(struct tcp_info *tcp_info)
 {
 
-  printf("tcpi_rto %d tcpi_ato %d tcpi_pmtu %d tcpi_rcv_ssthresh %d\n"
+  fprintf(stderr, "tcpi_rto %d tcpi_ato %d tcpi_pmtu %d tcpi_rcv_ssthresh %d\n"
         "tcpi_rtt %d tcpi_rttvar %d tcpi_snd_ssthresh %d tpci_snd_cwnd %d\n"
         "tcpi_reordering %d tcpi_total_retrans %d\n",
         tcp_info->tcpi_rto,
@@ -3539,7 +3539,7 @@ get_transport_retrans(SOCKET socket, int protocol) {
   }
   else {
 
-    if (debug > 1) {
+    if (debug > 1 || getenv("DUMP_TCP_INFO")) {
       dump_tcp_info(&tcp_info);
     }
     return tcp_info.tcpi_total_retrans;

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: BW regression after "tcp: refine TSO autosizing"
  2015-01-20 19:26                       ` Eric Dumazet
@ 2015-01-20 19:44                         ` Rick Jones
  0 siblings, 0 replies; 16+ messages in thread
From: Rick Jones @ 2015-01-20 19:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Dave Taht, Eyal Perry, Yuchung Cheng, Neal Cardwell, Eyal Perry,
	Or Gerlitz, Linux Netdev List, Amir Vadai, Yevgeny Petrilin,
	Saeed Mahameed, Ido Shamay, Amir Ancel

>> More stuff to pull from a TCP_INFO call I presume?  Feel free to drop me
>> a patch, though I'd probably want it to be in the guise of the omni
>> output selectors.
>>
>
> It was something like :

I'd forgotten about dump_tcp_info() :)

Committed revision 673.

happy benchmarking,

rick

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: BW regression after "tcp: refine TSO autosizing"
  2015-01-20 19:14                     ` Rick Jones
  2015-01-20 19:26                       ` Eric Dumazet
@ 2015-01-21 12:26                       ` David Laight
  2015-01-21 17:01                         ` Eric Dumazet
  1 sibling, 1 reply; 16+ messages in thread
From: David Laight @ 2015-01-21 12:26 UTC (permalink / raw)
  To: 'Rick Jones', Eric Dumazet, Dave Taht
  Cc: Eyal Perry, Yuchung Cheng, Neal Cardwell, Eyal Perry, Or Gerlitz,
	Linux Netdev List, Amir Vadai, Yevgeny Petrilin, Saeed Mahameed,
	Ido Shamay, Amir Ancel

From: Of Rick Jones
> >> Are you saying that at long last, delayed acks as we knew them are
> >> dead, dead, dead?
> >
> > Sorry, I can not parse what you are saying.
> >
> > In case you missed it, it has nothing to do with delayed ACK but GRO on
> > receiver.
> 
> Dave - assuming I've interpreted Eric's comments correctly, I believe
> the answer to your question is No.  Your desire for a world brimming
> with ack-every-other purity has not been fulfilled :)
> 
> However, the engineers formerly at Mentat are probably pleased that a
> functional near-equivalent to their ACK avoidance heuristic has ended-up
> being implemented and tacitly accepted, albeit by the back door :)

I must recheck something I discovered a while back with more recent kernels.
There has been a bad interaction between 'slow start' and 'delayed acks'
when nagle is disabled on 0 RTT local links with uni-directional traffic.

'Slow start' would refuse to send more than 4 messages until it received
an ack (rather than 4 mss of data).
The receiving system wouldn't send an ack until the timer expired
(or several mss of data were received) by which time the sender could have
a lot of data queued.

Due to the 0 RTT and bursty nature of the data 'slow start' happened
all the time.

	David


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: BW regression after "tcp: refine TSO autosizing"
  2015-01-21 12:26                       ` David Laight
@ 2015-01-21 17:01                         ` Eric Dumazet
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2015-01-21 17:01 UTC (permalink / raw)
  To: David Laight
  Cc: 'Rick Jones',
	Dave Taht, Eyal Perry, Yuchung Cheng, Neal Cardwell, Eyal Perry,
	Or Gerlitz, Linux Netdev List, Amir Vadai, Yevgeny Petrilin,
	Saeed Mahameed, Ido Shamay, Amir Ancel

On Wed, 2015-01-21 at 12:26 +0000, David Laight wrote:
> From: Of Rick Jones
> > >> Are you saying that at long last, delayed acks as we knew them are
> > >> dead, dead, dead?
> > >
> > > Sorry, I can not parse what you are saying.
> > >
> > > In case you missed it, it has nothing to do with delayed ACK but GRO on
> > > receiver.
> > 
> > Dave - assuming I've interpreted Eric's comments correctly, I believe
> > the answer to your question is No.  Your desire for a world brimming
> > with ack-every-other purity has not been fulfilled :)
> > 
> > However, the engineers formerly at Mentat are probably pleased that a
> > functional near-equivalent to their ACK avoidance heuristic has ended-up
> > being implemented and tacitly accepted, albeit by the back door :)
> 
> I must recheck something I discovered a while back with more recent kernels.
> There has been a bad interaction between 'slow start' and 'delayed acks'
> when nagle is disabled on 0 RTT local links with uni-directional traffic.
> 
> 'Slow start' would refuse to send more than 4 messages until it received
> an ack (rather than 4 mss of data).
> The receiving system wouldn't send an ack until the timer expired
> (or several mss of data were received) by which time the sender could have
> a lot of data queued.
> 
> Due to the 0 RTT and bursty nature of the data 'slow start' happened
> all the time.

Following packetdrill test suggests that current kernel send up to 10
messages without having to wait for any ACK
(IW10)

// Set up production and experiment configs
`../common/defaults.sh`

// Establish a connection.
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 6>
0.110 < . 1:1(0) ack 1 win 257
0.110 accept(3, ..., ...) = 4

0.200 %{ assert tcpi_snd_cwnd == 10 }%
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

+0.01 write(4, ..., 100) = 100
+0  > P. 1:101(100) ack 1

+0.01 write(4, ..., 100) = 100
+0  > P. 101:201(100) ack 1

+0.01 write(4, ..., 100) = 100
+0  > P. 201:301(100) ack 1

+0.01 write(4, ..., 100) = 100
+0  > P. 301:401(100) ack 1

+0.01 write(4, ..., 100) = 100
+0  > P. 401:501(100) ack 1

+0.01 write(4, ..., 100) = 100
+0  > P. 501:601(100) ack 1

+0.01 write(4, ..., 100) = 100
+0  > P. 601:701(100) ack 1

+0.01 write(4, ..., 100) = 100
+0  > P. 701:801(100) ack 1

+0.01 write(4, ..., 100) = 100
+0  > P. 801:901(100) ack 1

+0.01 write(4, ..., 100) = 100
+0  > P. 901:1001(100) ack 1

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-01-21 17:02 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-13 16:48 BW regression after "tcp: refine TSO autosizing" Eyal Perry
2015-01-13 18:57 ` Eric Dumazet
2015-01-13 20:21   ` Or Gerlitz
2015-01-13 21:41     ` Eyal Perry
2015-01-13 22:00       ` Eric Dumazet
2015-01-18 16:22         ` Eyal Perry
2015-01-18 17:48           ` Eric Dumazet
2015-01-18 21:40             ` Eyal Perry
2015-01-20  2:16               ` Eric Dumazet
2015-01-20  2:37                 ` Dave Taht
2015-01-20  3:14                   ` Eric Dumazet
2015-01-20 19:14                     ` Rick Jones
2015-01-20 19:26                       ` Eric Dumazet
2015-01-20 19:44                         ` Rick Jones
2015-01-21 12:26                       ` David Laight
2015-01-21 17:01                         ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.