* BW regression after "tcp: refine TSO autosizing"
@ 2015-01-13 16:48 Eyal Perry
2015-01-13 18:57 ` Eric Dumazet
0 siblings, 1 reply; 16+ messages in thread
From: Eyal Perry @ 2015-01-13 16:48 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, Amir Vadai, yevgenyp, saeedm, idos, amira, eyalpe
Hello Eric,
Lately we've observed performance degradation in BW of about 30-40% (depends on
the setup we use).
I've bisected the issue down to the this commit: 605ad7f1 ("tcp: refine TSO
autosizing")
For instance, I was running the following test:
1. Bounding net device' irqs to core 0 for both client and server side
2. Running netperf with 64K massage size (used the following command)
$ netperf -H remote -T 1,1 -l 100 -t TCP_STREAM -- -k THROUGHPUT -M 65536 -m 65536
I ran the test on upstream net-next including your patch and than reverted it
and these are the results I got was improvement from 14.6Gbps to 22.1Gbps.
an additional difference I've noticed when inspecting the ethtool statics,
number of xmit_more packets increased from 4 to 160 with the reverted kernel.
We are investigating this issue, do you have a hint?
Best regards,
Eyal.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: BW regression after "tcp: refine TSO autosizing"
2015-01-13 16:48 BW regression after "tcp: refine TSO autosizing" Eyal Perry
@ 2015-01-13 18:57 ` Eric Dumazet
2015-01-13 20:21 ` Or Gerlitz
0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2015-01-13 18:57 UTC (permalink / raw)
To: Eyal Perry; +Cc: netdev, Amir Vadai, yevgenyp, saeedm, idos, amira, eyalpe
On Tue, 2015-01-13 at 18:48 +0200, Eyal Perry wrote:
> Hello Eric,
> Lately we've observed performance degradation in BW of about 30-40% (depends on
> the setup we use).
> I've bisected the issue down to the this commit: 605ad7f1 ("tcp: refine TSO
> autosizing")
>
> For instance, I was running the following test:
> 1. Bounding net device' irqs to core 0 for both client and server side
> 2. Running netperf with 64K massage size (used the following command)
> $ netperf -H remote -T 1,1 -l 100 -t TCP_STREAM -- -k THROUGHPUT -M 65536 -m 65536
>
> I ran the test on upstream net-next including your patch and than reverted it
> and these are the results I got was improvement from 14.6Gbps to 22.1Gbps.
>
> an additional difference I've noticed when inspecting the ethtool statics,
> number of xmit_more packets increased from 4 to 160 with the reverted kernel.
>
> We are investigating this issue, do you have a hint?
Which driver are you using for this test ?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: BW regression after "tcp: refine TSO autosizing"
2015-01-13 18:57 ` Eric Dumazet
@ 2015-01-13 20:21 ` Or Gerlitz
2015-01-13 21:41 ` Eyal Perry
0 siblings, 1 reply; 16+ messages in thread
From: Or Gerlitz @ 2015-01-13 20:21 UTC (permalink / raw)
To: Eric Dumazet
Cc: Eyal Perry, Linux Netdev List, Amir Vadai, Yevgeny Petrilin,
Saeed Mahameed, Ido Shamay, Amir Ancel, Eyal Perry
On Tue, Jan 13, 2015 at 8:57 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2015-01-13 at 18:48 +0200, Eyal Perry wrote:
>> Hello Eric,
>> Lately we've observed performance degradation in BW of about 30-40% (depends on
>> the setup we use).
>> I've bisected the issue down to the this commit: 605ad7f1 ("tcp: refine TSO
>> autosizing")
>>
>> For instance, I was running the following test:
>> 1. Bounding net device' irqs to core 0 for both client and server side
>> 2. Running netperf with 64K massage size (used the following command)
>> $ netperf -H remote -T 1,1 -l 100 -t TCP_STREAM -- -k THROUGHPUT -M 65536 -m 65536
>>
>> I ran the test on upstream net-next including your patch and than reverted it
>> and these are the results I got was improvement from 14.6Gbps to 22.1Gbps.
>>
>> an additional difference I've noticed when inspecting the ethtool statics,
>> number of xmit_more packets increased from 4 to 160 with the reverted kernel.
>>
>> We are investigating this issue, do you have a hint?
>
> Which driver are you using for this test ?
AFAIK, mlx4
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: BW regression after "tcp: refine TSO autosizing"
2015-01-13 20:21 ` Or Gerlitz
@ 2015-01-13 21:41 ` Eyal Perry
2015-01-13 22:00 ` Eric Dumazet
0 siblings, 1 reply; 16+ messages in thread
From: Eyal Perry @ 2015-01-13 21:41 UTC (permalink / raw)
To: Or Gerlitz, Eric Dumazet
Cc: Linux Netdev List, Amir Vadai, Yevgeny Petrilin, Saeed Mahameed,
Ido Shamay, Amir Ancel, Eyal Perry
On 1/13/2015 22:21 PM, Or Gerlitz wrote:
> On Tue, Jan 13, 2015 at 8:57 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Tue, 2015-01-13 at 18:48 +0200, Eyal Perry wrote:
>>> Hello Eric,
>>> Lately we've observed performance degradation in BW of about 30-40% (depends on
>>> the setup we use).
>>> I've bisected the issue down to the this commit: 605ad7f1 ("tcp: refine TSO
>>> autosizing")
>>>
>>> For instance, I was running the following test:
>>> 1. Bounding net device' irqs to core 0 for both client and server side
>>> 2. Running netperf with 64K massage size (used the following command)
>>> $ netperf -H remote -T 1,1 -l 100 -t TCP_STREAM -- -k THROUGHPUT -M 65536 -m 65536
>>>
>>> I ran the test on upstream net-next including your patch and than reverted it
>>> and these are the results I got was improvement from 14.6Gbps to 22.1Gbps.
>>>
>>> an additional difference I've noticed when inspecting the ethtool statics,
>>> number of xmit_more packets increased from 4 to 160 with the reverted kernel.
>>>
>>> We are investigating this issue, do you have a hint?
>> Which driver are you using for this test ?
> AFAIK, mlx4
Oops, forgot to mention.
mlx4 indeed.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: BW regression after "tcp: refine TSO autosizing"
2015-01-13 21:41 ` Eyal Perry
@ 2015-01-13 22:00 ` Eric Dumazet
2015-01-18 16:22 ` Eyal Perry
0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2015-01-13 22:00 UTC (permalink / raw)
To: Eyal Perry
Cc: Or Gerlitz, Linux Netdev List, Amir Vadai, Yevgeny Petrilin,
Saeed Mahameed, Ido Shamay, Amir Ancel, Eyal Perry
On Tue, 2015-01-13 at 23:41 +0200, Eyal Perry wrote:
> On 1/13/2015 22:21 PM, Or Gerlitz wrote:
> > On Tue, Jan 13, 2015 at 8:57 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> On Tue, 2015-01-13 at 18:48 +0200, Eyal Perry wrote:
> >>> Hello Eric,
> >>> Lately we've observed performance degradation in BW of about 30-40% (depends on
> >>> the setup we use).
> >>> I've bisected the issue down to the this commit: 605ad7f1 ("tcp: refine TSO
> >>> autosizing")
> >>>
> >>> For instance, I was running the following test:
> >>> 1. Bounding net device' irqs to core 0 for both client and server side
> >>> 2. Running netperf with 64K massage size (used the following command)
> >>> $ netperf -H remote -T 1,1 -l 100 -t TCP_STREAM -- -k THROUGHPUT -M 65536 -m 65536
> >>>
> >>> I ran the test on upstream net-next including your patch and than reverted it
> >>> and these are the results I got was improvement from 14.6Gbps to 22.1Gbps.
> >>>
> >>> an additional difference I've noticed when inspecting the ethtool statics,
> >>> number of xmit_more packets increased from 4 to 160 with the reverted kernel.
> >>>
> >>> We are investigating this issue, do you have a hint?
> >> Which driver are you using for this test ?
> > AFAIK, mlx4
> Oops, forgot to mention.
> mlx4 indeed.
Make sure you do not drop packets at receiver.
(Patch might have increased raw speed, and receiver starts dropping
packets because it is not able to sustain line rate on a single flow)
If cwnd is too small, then yes, sending slightly smaller TSO packets can
impact performance, but this is desirable as well.
This is a congestion control problem.
lpaa23:~# nstat >/dev/null; DUMP_TCP_INFO=1 ./netperf -H lpaa24;nstat
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpaa24.prod.google.com () port 0 AF_INET
rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=52 rttvar=2 snd_ssthresh=66 cwnd=102 reordering=3 total_retrans=439 ca_state=0
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 17366.51
#kernel
IpInReceives 379010 0.0
IpInDelivers 379010 0.0
IpOutRequests 494794 0.0
IcmpInErrors 1 0.0
IcmpInTimeExcds 1 0.0
IcmpOutErrors 1 0.0
IcmpOutTimeExcds 1 0.0
IcmpMsgInType3 1 0.0
IcmpMsgOutType3 1 0.0
TcpActiveOpens 18 0.0
TcpPassiveOpens 4 0.0
TcpAttemptFails 8 0.0
TcpEstabResets 7 0.0
TcpInSegs 378992 0.0
TcpOutSegs 14993053 0.0
TcpRetransSegs 439 0.0
TcpOutRsts 28 0.0
UdpInDatagrams 16 0.0
UdpNoPorts 1 0.0
UdpOutDatagrams 17 0.0
TcpExtTW 3 0.0
TcpExtDelayedACKs 1 0.0
TcpExtTCPPrequeued 1 0.0
TcpExtTCPHPHits 14 0.0
TcpExtTCPPureAcks 301046 0.0
TcpExtTCPHPAcks 77858 0.0
TcpExtTCPSackRecovery 75 0.0
TcpExtTCPFastRetrans 439 0.0
TcpExtTCPAbortOnData 7 0.0
TcpExtTCPSackShifted 17 0.0
TcpExtTCPSackMerged 57 0.0
TcpExtTCPSackShiftFallback 234 0.0
TcpExtTCPRcvCoalesce 6 0.0
TcpExtTCPFastOpenActive 7 0.0
TcpExtTCPSpuriousRtxHostQueues 2 0.0
TcpExtTCPAutoCorking 68423 0.0
TcpExtTCPOrigDataSent 14992970 0.0
TcpExtTCPHystartTrainDetect 1 0.0
TcpExtTCPHystartTrainCwnd 70 0.0
IpExtInOctets 19731445 0.0
IpExtOutOctets 21736126719 0.0
IpExtInNoECTPkts 379010 0.0
You also can see in this sample Hystart ended slow start
with a very small cwnd of 70
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: BW regression after "tcp: refine TSO autosizing"
2015-01-13 22:00 ` Eric Dumazet
@ 2015-01-18 16:22 ` Eyal Perry
2015-01-18 17:48 ` Eric Dumazet
0 siblings, 1 reply; 16+ messages in thread
From: Eyal Perry @ 2015-01-18 16:22 UTC (permalink / raw)
To: Eric Dumazet
Cc: Or Gerlitz, Linux Netdev List, Amir Vadai, Yevgeny Petrilin,
Saeed Mahameed, Ido Shamay, Amir Ancel, Eyal Perry
On Wed, Jan 14, 2015 at 12:00 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Make sure you do not drop packets at receiver. (Patch might have increased raw speed, and receiver starts dropping packets because it is not able to sustain line rate on a single flow)
Hi Eric,
I double checked that there are no drops on the receiver (ifconfig,
ethtool, and net_dropmonitor).
> If cwnd is too small, then yes, sending slightly smaller TSO packets can impact performance, but this is desirable as well. This is a congestion control problem.
How can we reliably measure the cwnd? tpci_snd_cwnd is not consistent
across runs.
Anyway, we don't see difference in the TSO packets size (see results below).
> lpaa23:~# nstat >/dev/null; DUMP_TCP_INFO=1 ./netperf -H lpaa24;nstat
[...]
>
> You also can see in this sample Hystart ended slow start with a very small cwnd of 70
We see the issue also on very long runs so I don't understand how is
it related to the slow start mechanism.
Below, are two measurements with all the statistics.
* with your patch:
$ nstat >/dev/null; netperf -H remote;nstat
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
11.11.11.36 () port 0 AF_INET
tcpi_rto 204000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 29200
tcpi_rtt 79 tcpi_rttvar 14 tcpi_snd_ssthresh 214 tpci_snd_cwnd 634
tcpi_reordering 3 tcpi_total_retrans 0
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 18537.46
#kernel
IpInReceives 260728 0.0
IpInDelivers 260728 0.0
IpOutRequests 355684 0.0
TcpActiveOpens 2 0.0
TcpInSegs 260729 0.0
TcpOutSegs 16004157 0.0
UdpInDatagrams 1 0.0
UdpOutDatagrams 7 0.0
UdpIgnoredMulti 1 0.0
Ip6InReceives 4 0.0
Ip6InDelivers 4 0.0
Ip6OutRequests 3 0.0
Ip6InMcastPkts 1 0.0
Ip6InOctets 288 0.0
Ip6OutOctets 856 0.0
Ip6InMcastOctets 72 0.0
Ip6InNoECTPkts 4 0.0
Icmp6InMsgs 1 0.0
Icmp6InNeighborAdvertisements 1 0.0
Icmp6InType136 1 0.0
TcpExtDelayedACKs 2 0.0
TcpExtTCPHPHits 3 0.0
TcpExtTCPPureAcks 208131 0.0
TcpExtTCPHPAcks 52591 0.0
TcpExtTCPAutoCorking 49920 0.0
TcpExtTCPOrigDataSent 16004146 0.0
TcpExtTCPHystartTrainDetect 1 0.0
TcpExtTCPHystartTrainCwnd 214 0.0
IpExtInBcastPkts 1 0.0
IpExtInOctets 13560399 0.0
IpExtOutOctets 23192484354 0.0
IpExtInBcastOctets 367 0.0
IpExtInNoECTPkts 260728 0.0
* And without it:
$ nstat >/dev/null; netperf -H remote;nstat
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
11.11.11.37 () port 0 AF_INET
tcpi_rto 204000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 29200
tcpi_rtt 108 tcpi_rttvar 13 tcpi_snd_ssthresh 801 tpci_snd_cwnd 809
tcpi_reordering 64 tcpi_total_retrans 6
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 26826.32
#kernel
IpInReceives 431716 0.0
IpInDelivers 431716 0.0
IpOutRequests 521537 0.0
TcpActiveOpens 2 0.0
TcpInSegs 431676 0.0
TcpOutSegs 23164903 0.0
TcpRetransSegs 6 0.0
UdpInDatagrams 3 0.0
UdpOutDatagrams 1 0.0
UdpIgnoredMulti 40 0.0
Ip6InReceives 4 0.0
Ip6InDelivers 4 0.0
Ip6OutRequests 4 0.0
Ip6InOctets 288 0.0
Ip6OutOctets 928 0.0
Ip6InNoECTPkts 4 0.0
Icmp6InMsgs 1 0.0
Icmp6OutMsgs 1 0.0
Icmp6InNeighborAdvertisements 1 0.0
Icmp6OutNeighborSolicits 1 0.0
Icmp6InType136 1 0.0
Icmp6OutType135 1 0.0
TcpExtDelayedACKs 1 0.0
TcpExtTCPHPHits 2 0.0
TcpExtTCPPureAcks 291894 0.0
TcpExtTCPHPAcks 139775 0.0
TcpExtTCPSackRecovery 5 0.0
TcpExtTCPTSReorder 1 0.0
TcpExtTCPPartialUndo 1 0.0
TcpExtTCPDSACKUndo 4 0.0
TcpExtTCPFastRetrans 6 0.0
TcpExtTCPDSACKRecv 6 0.0
TcpExtTCPDSACKIgnoredNoUndo 1 0.0
TcpExtTCPSackShifted 46436 0.0
TcpExtTCPSackMerged 1306 0.0
TcpExtTCPSackShiftFallback 4414 0.0
TcpExtTCPAutoCorking 274120 0.0
TcpExtTCPOrigDataSent 23164893 0.0
TcpExtTCPHystartTrainDetect 1 0.0
TcpExtTCPHystartTrainCwnd 71 0.0
IpExtInBcastPkts 42 0.0
IpExtInOctets 23066435 0.0
IpExtOutOctets 33567160556 0.0
IpExtInBcastOctets 4275 0.0
IpExtInNoECTPkts 431716 0.0
Please let me know if you see something in the results.
Regards,
Eyal.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: BW regression after "tcp: refine TSO autosizing"
2015-01-18 16:22 ` Eyal Perry
@ 2015-01-18 17:48 ` Eric Dumazet
2015-01-18 21:40 ` Eyal Perry
0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2015-01-18 17:48 UTC (permalink / raw)
To: Eyal Perry
Cc: Or Gerlitz, Linux Netdev List, Amir Vadai, Yevgeny Petrilin,
Saeed Mahameed, Ido Shamay, Amir Ancel, Eyal Perry
On Sun, 2015-01-18 at 18:22 +0200, Eyal Perry wrote:
>
> Please let me know if you see something in the results.
Getting high throughput on a single flow means lot of tweaking.
For a start, mlx4 is known to have interrupt mitigation that can hurt,
as the TX interrupt timer is restarted for every packet that is
delivered to the NIC.
ethtool -c ethX
..
tx-usecs: 16
tx-frames: 16
tx-usecs-irq: 0
tx-frames-irq: 256
...
-> TX IRQ can be delayed by 16*16 = 256 usec.
Can you try :
ethtool -C ethX tx-usecs 2 tx-frames 2
Or even
ethtool -C ethX tx-usecs 1 tx-frames 1
Interrupt mitigation is a trade-off.
If one customer wants high throughput on a single flow, then you might
remove interrupt mitigation.
If another customer wants cpu efficiency with thousand of flows, I guess
current mlx4 defaults are pretty good.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: BW regression after "tcp: refine TSO autosizing"
2015-01-18 17:48 ` Eric Dumazet
@ 2015-01-18 21:40 ` Eyal Perry
2015-01-20 2:16 ` Eric Dumazet
0 siblings, 1 reply; 16+ messages in thread
From: Eyal Perry @ 2015-01-18 21:40 UTC (permalink / raw)
To: Eric Dumazet, Eyal Perry
Cc: Or Gerlitz, Linux Netdev List, Amir Vadai, Yevgeny Petrilin,
Saeed Mahameed, Ido Shamay, Amir Ancel
On 1/18/2015 19:48 PM, Eric Dumazet wrote:
> On Sun, 2015-01-18 at 18:22 +0200, Eyal Perry wrote:
>
>> Please let me know if you see something in the results.
> Getting high throughput on a single flow means lot of tweaking.
>
> For a start, mlx4 is known to have interrupt mitigation that can hurt,
> as the TX interrupt timer is restarted for every packet that is
> delivered to the NIC.
>
> ethtool -c ethX
> ..
> tx-usecs: 16
> tx-frames: 16
> tx-usecs-irq: 0
> tx-frames-irq: 256
> ...
>
> -> TX IRQ can be delayed by 16*16 = 256 usec.
>
> Can you try :
>
> ethtool -C ethX tx-usecs 2 tx-frames 2
>
> Or even
>
> ethtool -C ethX tx-usecs 1 tx-frames 1
So indeed, interrupt mitigation (tx-usecs 1 tx-frames 1) improves things up
for the "refined TSO autosizing" kernel (from 18.4Gbps to 19.7Gbps). but
in the
other kernel, the BW is remains the same with and without the coalescing.
> Interrupt mitigation is a trade-off.
>
> If one customer wants high throughput on a single flow, then you might
> remove interrupt mitigation.
>
> If another customer wants cpu efficiency with thousand of flows, I guess
> current mlx4 defaults are pretty good.
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: BW regression after "tcp: refine TSO autosizing"
2015-01-18 21:40 ` Eyal Perry
@ 2015-01-20 2:16 ` Eric Dumazet
2015-01-20 2:37 ` Dave Taht
0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2015-01-20 2:16 UTC (permalink / raw)
To: Eyal Perry, Yuchung Cheng, Neal Cardwell
Cc: Eyal Perry, Or Gerlitz, Linux Netdev List, Amir Vadai,
Yevgeny Petrilin, Saeed Mahameed, Ido Shamay, Amir Ancel
On Sun, 2015-01-18 at 23:40 +0200, Eyal Perry wrote:
> So indeed, interrupt mitigation (tx-usecs 1 tx-frames 1) improves things up
> for the "refined TSO autosizing" kernel (from 18.4Gbps to 19.7Gbps). but
> in the
> other kernel, the BW is remains the same with and without the coalescing.
OK thanks for testing.
I believe the regression comes from inability for cc to cope with
stretch acks.
Nowadays on fast networks, each ACK packet acknowledges ~45 MSS, but
CUBIC (and others cc) got support for this only during slow start, with
commit 9f9843a751d0a2057f9f3d313886e7e5e6ebaac9
("tcp: properly handle stretch acks in slow start")
I guess it is time to also handle congestion avoidance phase.
With following patch (very close to what we use here at Google) I
reached 37Gbps instead of 20Gbps :
ethtool -C eth1 tx-usecs 4 tx-frames 4
DUMP_TCP_INFO=1 ./netperf -H remote -T2,2 -t TCP_STREAM -l 20
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to remote () port 0 AF_INET : cpu bind
rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=67 rttvar=6 snd_ssthresh=263 cwnd=265 reordering=3 total_retrans=4569 ca_state=0
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 20.00 37213.05
I guess this is a world record, my previous one was 34Gbps.
include/net/tcp.h | 2
net/ipv4/tcp_cong.c | 4 +
net/ipv4/tcp_cubic.c | 91 +++++++++++++++++++----------------------
3 files changed, 47 insertions(+), 50 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index b8fdc6bab3f3..05815fbb490f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -843,7 +843,7 @@ void tcp_get_available_congestion_control(char *buf, size_t len);
void tcp_get_allowed_congestion_control(char *buf, size_t len);
int tcp_set_allowed_congestion_control(char *allowed);
int tcp_set_congestion_control(struct sock *sk, const char *name);
-void tcp_slow_start(struct tcp_sock *tp, u32 acked);
+int tcp_slow_start(struct tcp_sock *tp, u32 acked);
void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w);
u32 tcp_reno_ssthresh(struct sock *sk);
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 63c29dba68a8..f0fc696b9333 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -360,13 +360,15 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
* ABC caps N to 2. Slow start exits when cwnd grows over ssthresh and
* returns the leftover acks to adjust cwnd in congestion avoidance mode.
*/
-void tcp_slow_start(struct tcp_sock *tp, u32 acked)
+int tcp_slow_start(struct tcp_sock *tp, u32 acked)
{
u32 cwnd = tp->snd_cwnd + acked;
if (cwnd > tp->snd_ssthresh)
cwnd = tp->snd_ssthresh + 1;
+ acked -= cwnd - tp->snd_cwnd;
tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp);
+ return acked;
}
EXPORT_SYMBOL_GPL(tcp_slow_start);
diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 6b6002416a73..c0e048929b74 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -81,7 +81,6 @@ MODULE_PARM_DESC(hystart_ack_delta, "spacing between ack's indicating train (mse
/* BIC TCP Parameters */
struct bictcp {
- u32 cnt; /* increase cwnd by 1 after ACKs */
u32 last_max_cwnd; /* last maximum snd_cwnd */
u32 loss_cwnd; /* congestion window at last loss */
u32 last_cwnd; /* the last snd_cwnd */
@@ -93,20 +92,18 @@ struct bictcp {
u32 epoch_start; /* beginning of an epoch */
u32 ack_cnt; /* number of acks */
u32 tcp_cwnd; /* estimated tcp cwnd */
-#define ACK_RATIO_SHIFT 4
-#define ACK_RATIO_LIMIT (32u << ACK_RATIO_SHIFT)
- u16 delayed_ack; /* estimate the ratio of Packets/ACKs << 4 */
u8 sample_cnt; /* number of samples to decide curr_rtt */
u8 found; /* the exit point is found? */
u32 round_start; /* beginning of each round */
u32 end_seq; /* end_seq of the round */
u32 last_ack; /* last time when the ACK spacing is close */
u32 curr_rtt; /* the minimum rtt of current round */
+ u32 last_bic_target;/* last target cwnd computed by cubic
+ * (not tcp_friendliness mode) */
};
static inline void bictcp_reset(struct bictcp *ca)
{
- ca->cnt = 0;
ca->last_max_cwnd = 0;
ca->last_cwnd = 0;
ca->last_time = 0;
@@ -114,7 +111,6 @@ static inline void bictcp_reset(struct bictcp *ca)
ca->bic_K = 0;
ca->delay_min = 0;
ca->epoch_start = 0;
- ca->delayed_ack = 2 << ACK_RATIO_SHIFT;
ca->ack_cnt = 0;
ca->tcp_cwnd = 0;
ca->found = 0;
@@ -205,12 +201,14 @@ static u32 cubic_root(u64 a)
/*
* Compute congestion window to use.
*/
-static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
+static inline void bictcp_update(struct bictcp *ca, u32 pkts_acked, u32 cwnd)
{
- u32 delta, bic_target, max_cnt;
+ u32 delta, bic_target;
u64 offs, t;
- ca->ack_cnt++; /* count the number of ACKs */
+ ca->ack_cnt += pkts_acked; /* count the number of packets that
+ * have been ACKed
+ */
if (ca->last_cwnd == cwnd &&
(s32)(tcp_time_stamp - ca->last_time) <= HZ / 32)
@@ -221,7 +219,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
if (ca->epoch_start == 0) {
ca->epoch_start = tcp_time_stamp; /* record beginning */
- ca->ack_cnt = 1; /* start counting */
+ ca->ack_cnt = pkts_acked; /* start counting */
ca->tcp_cwnd = cwnd; /* syn with cubic */
if (ca->last_max_cwnd <= cwnd) {
@@ -269,19 +267,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
else /* above origin*/
bic_target = ca->bic_origin_point + delta;
- /* cubic function - calc bictcp_cnt*/
- if (bic_target > cwnd) {
- ca->cnt = cwnd / (bic_target - cwnd);
- } else {
- ca->cnt = 100 * cwnd; /* very small increment*/
- }
-
- /*
- * The initial growth of cubic function may be too conservative
- * when the available bandwidth is still unknown.
- */
- if (ca->last_max_cwnd == 0 && ca->cnt > 20)
- ca->cnt = 20; /* increase cwnd 5% per RTT */
+ ca->last_bic_target = bic_target;
/* TCP Friendly */
if (tcp_friendliness) {
@@ -292,18 +278,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
ca->ack_cnt -= delta;
ca->tcp_cwnd++;
}
-
- if (ca->tcp_cwnd > cwnd) { /* if bic is slower than tcp */
- delta = ca->tcp_cwnd - cwnd;
- max_cnt = cwnd / delta;
- if (ca->cnt > max_cnt)
- ca->cnt = max_cnt;
- }
}
-
- ca->cnt = (ca->cnt << ACK_RATIO_SHIFT) / ca->delayed_ack;
- if (ca->cnt == 0) /* cannot be zero */
- ca->cnt = 1;
}
static void bictcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
@@ -314,13 +289,43 @@ static void bictcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
if (!tcp_is_cwnd_limited(sk))
return;
+ /* cwnd may first advance in slow start then move on to congestion
+ * control mode on a stretch ACK.
+ */
if (tp->snd_cwnd <= tp->snd_ssthresh) {
if (hystart && after(ack, ca->end_seq))
bictcp_hystart_reset(sk);
- tcp_slow_start(tp, acked);
- } else {
- bictcp_update(ca, tp->snd_cwnd);
- tcp_cong_avoid_ai(tp, ca->cnt);
+ acked = tcp_slow_start(tp, acked);
+ }
+
+ if (acked && tp->snd_cwnd > tp->snd_ssthresh) {
+ u32 target, cnt;
+
+ bictcp_update(ca, acked, tp->snd_cwnd);
+ /* Compute target cwnd based on bic_target and tcp_cwnd
+ * (whichever is faster)
+ */
+ target = (ca->last_bic_target >= ca->tcp_cwnd) ?
+ ca->last_bic_target : ca->tcp_cwnd;
+ while (acked > 0) {
+ if (target > tp->snd_cwnd)
+ cnt = tp->snd_cwnd / (target - tp->snd_cwnd);
+ else
+ cnt = 100 * tp->snd_cwnd;
+
+ /* The initial growth of cubic function may be
+ * too conservative when the available
+ * bandwidth is still unknown.
+ */
+ if (ca->last_max_cwnd == 0 && cnt > 20)
+ cnt = 20; /* increase cwnd 5% per RTT */
+
+ if (cnt == 0) /* cannot be zero */
+ cnt = 1;
+
+ tcp_cong_avoid_ai(tp, cnt);
+ acked--;
+ }
}
}
@@ -411,20 +416,10 @@ static void hystart_update(struct sock *sk, u32 delay)
*/
static void bictcp_acked(struct sock *sk, u32 cnt, s32 rtt_us)
{
- const struct inet_connection_sock *icsk = inet_csk(sk);
const struct tcp_sock *tp = tcp_sk(sk);
struct bictcp *ca = inet_csk_ca(sk);
u32 delay;
- if (icsk->icsk_ca_state == TCP_CA_Open) {
- u32 ratio = ca->delayed_ack;
-
- ratio -= ca->delayed_ack >> ACK_RATIO_SHIFT;
- ratio += cnt;
-
- ca->delayed_ack = clamp(ratio, 1U, ACK_RATIO_LIMIT);
- }
-
/* Some calls are for duplicates without timetamps */
if (rtt_us < 0)
return;
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: BW regression after "tcp: refine TSO autosizing"
2015-01-20 2:16 ` Eric Dumazet
@ 2015-01-20 2:37 ` Dave Taht
2015-01-20 3:14 ` Eric Dumazet
0 siblings, 1 reply; 16+ messages in thread
From: Dave Taht @ 2015-01-20 2:37 UTC (permalink / raw)
To: Eric Dumazet
Cc: Eyal Perry, Yuchung Cheng, Neal Cardwell, Eyal Perry, Or Gerlitz,
Linux Netdev List, Amir Vadai, Yevgeny Petrilin, Saeed Mahameed,
Ido Shamay, Amir Ancel
On Mon, Jan 19, 2015 at 6:16 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sun, 2015-01-18 at 23:40 +0200, Eyal Perry wrote:
>
>> So indeed, interrupt mitigation (tx-usecs 1 tx-frames 1) improves things up
>> for the "refined TSO autosizing" kernel (from 18.4Gbps to 19.7Gbps). but
>> in the
>> other kernel, the BW is remains the same with and without the coalescing.
>
> OK thanks for testing.
>
> I believe the regression comes from inability for cc to cope with
> stretch acks.
>
> Nowadays on fast networks, each ACK packet acknowledges ~45 MSS, but
> CUBIC (and others cc) got support for this only during slow start, with
> commit 9f9843a751d0a2057f9f3d313886e7e5e6ebaac9
> ("tcp: properly handle stretch acks in slow start")
>
> I guess it is time to also handle congestion avoidance phase.
Are you saying that at long last, delayed acks as we knew them are
dead, dead, dead?
> With following patch (very close to what we use here at Google) I
> reached 37Gbps instead of 20Gbps :
>
> ethtool -C eth1 tx-usecs 4 tx-frames 4
What is the default here?
What happens with the default here?
>
> DUMP_TCP_INFO=1 ./netperf -H remote -T2,2 -t TCP_STREAM -l 20
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to remote () port 0 AF_INET : cpu bind
> rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=67 rttvar=6 snd_ssthresh=263 cwnd=265 reordering=3 total_retrans=4569 ca_state=0
The above statistics are not dumped by my netperf, and look extremely
desirable to capture in netperf-wrapper. This is a script parsing some
other kernel data at the conclusion of the run? or a better netperf?
If ECN was on the bottleneck link, I imagine total_retrans would be 0,
or are packets getting dropped in the kernel?
> Recv Send Send
> Socket Socket Message Elapsed
> Size Size Size Time Throughput
> bytes bytes bytes secs. 10^6bits/sec
>
> 87380 16384 16384 20.00 37213.05
>
> I guess this is a world record, my previous one was 34Gbps.
>
>
> include/net/tcp.h | 2
> net/ipv4/tcp_cong.c | 4 +
> net/ipv4/tcp_cubic.c | 91 +++++++++++++++++++----------------------
> 3 files changed, 47 insertions(+), 50 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index b8fdc6bab3f3..05815fbb490f 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -843,7 +843,7 @@ void tcp_get_available_congestion_control(char *buf, size_t len);
> void tcp_get_allowed_congestion_control(char *buf, size_t len);
> int tcp_set_allowed_congestion_control(char *allowed);
> int tcp_set_congestion_control(struct sock *sk, const char *name);
> -void tcp_slow_start(struct tcp_sock *tp, u32 acked);
> +int tcp_slow_start(struct tcp_sock *tp, u32 acked);
> void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w);
>
> u32 tcp_reno_ssthresh(struct sock *sk);
> diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
> index 63c29dba68a8..f0fc696b9333 100644
> --- a/net/ipv4/tcp_cong.c
> +++ b/net/ipv4/tcp_cong.c
> @@ -360,13 +360,15 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
> * ABC caps N to 2. Slow start exits when cwnd grows over ssthresh and
> * returns the leftover acks to adjust cwnd in congestion avoidance mode.
> */
> -void tcp_slow_start(struct tcp_sock *tp, u32 acked)
> +int tcp_slow_start(struct tcp_sock *tp, u32 acked)
> {
> u32 cwnd = tp->snd_cwnd + acked;
>
> if (cwnd > tp->snd_ssthresh)
> cwnd = tp->snd_ssthresh + 1;
> + acked -= cwnd - tp->snd_cwnd;
> tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp);
> + return acked;
> }
> EXPORT_SYMBOL_GPL(tcp_slow_start);
>
> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
> index 6b6002416a73..c0e048929b74 100644
> --- a/net/ipv4/tcp_cubic.c
> +++ b/net/ipv4/tcp_cubic.c
> @@ -81,7 +81,6 @@ MODULE_PARM_DESC(hystart_ack_delta, "spacing between ack's indicating train (mse
>
> /* BIC TCP Parameters */
> struct bictcp {
> - u32 cnt; /* increase cwnd by 1 after ACKs */
> u32 last_max_cwnd; /* last maximum snd_cwnd */
> u32 loss_cwnd; /* congestion window at last loss */
> u32 last_cwnd; /* the last snd_cwnd */
> @@ -93,20 +92,18 @@ struct bictcp {
> u32 epoch_start; /* beginning of an epoch */
> u32 ack_cnt; /* number of acks */
> u32 tcp_cwnd; /* estimated tcp cwnd */
> -#define ACK_RATIO_SHIFT 4
> -#define ACK_RATIO_LIMIT (32u << ACK_RATIO_SHIFT)
> - u16 delayed_ack; /* estimate the ratio of Packets/ACKs << 4 */
> u8 sample_cnt; /* number of samples to decide curr_rtt */
> u8 found; /* the exit point is found? */
> u32 round_start; /* beginning of each round */
> u32 end_seq; /* end_seq of the round */
> u32 last_ack; /* last time when the ACK spacing is close */
> u32 curr_rtt; /* the minimum rtt of current round */
> + u32 last_bic_target;/* last target cwnd computed by cubic
> + * (not tcp_friendliness mode) */
> };
>
> static inline void bictcp_reset(struct bictcp *ca)
> {
> - ca->cnt = 0;
> ca->last_max_cwnd = 0;
> ca->last_cwnd = 0;
> ca->last_time = 0;
> @@ -114,7 +111,6 @@ static inline void bictcp_reset(struct bictcp *ca)
> ca->bic_K = 0;
> ca->delay_min = 0;
> ca->epoch_start = 0;
> - ca->delayed_ack = 2 << ACK_RATIO_SHIFT;
> ca->ack_cnt = 0;
> ca->tcp_cwnd = 0;
> ca->found = 0;
> @@ -205,12 +201,14 @@ static u32 cubic_root(u64 a)
> /*
> * Compute congestion window to use.
> */
> -static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
> +static inline void bictcp_update(struct bictcp *ca, u32 pkts_acked, u32 cwnd)
> {
> - u32 delta, bic_target, max_cnt;
> + u32 delta, bic_target;
> u64 offs, t;
>
> - ca->ack_cnt++; /* count the number of ACKs */
> + ca->ack_cnt += pkts_acked; /* count the number of packets that
> + * have been ACKed
> + */
>
> if (ca->last_cwnd == cwnd &&
> (s32)(tcp_time_stamp - ca->last_time) <= HZ / 32)
> @@ -221,7 +219,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
>
> if (ca->epoch_start == 0) {
> ca->epoch_start = tcp_time_stamp; /* record beginning */
> - ca->ack_cnt = 1; /* start counting */
> + ca->ack_cnt = pkts_acked; /* start counting */
> ca->tcp_cwnd = cwnd; /* syn with cubic */
>
> if (ca->last_max_cwnd <= cwnd) {
> @@ -269,19 +267,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
> else /* above origin*/
> bic_target = ca->bic_origin_point + delta;
>
> - /* cubic function - calc bictcp_cnt*/
> - if (bic_target > cwnd) {
> - ca->cnt = cwnd / (bic_target - cwnd);
> - } else {
> - ca->cnt = 100 * cwnd; /* very small increment*/
> - }
> -
> - /*
> - * The initial growth of cubic function may be too conservative
> - * when the available bandwidth is still unknown.
> - */
> - if (ca->last_max_cwnd == 0 && ca->cnt > 20)
> - ca->cnt = 20; /* increase cwnd 5% per RTT */
> + ca->last_bic_target = bic_target;
>
> /* TCP Friendly */
> if (tcp_friendliness) {
> @@ -292,18 +278,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
> ca->ack_cnt -= delta;
> ca->tcp_cwnd++;
> }
> -
> - if (ca->tcp_cwnd > cwnd) { /* if bic is slower than tcp */
> - delta = ca->tcp_cwnd - cwnd;
> - max_cnt = cwnd / delta;
> - if (ca->cnt > max_cnt)
> - ca->cnt = max_cnt;
> - }
> }
> -
> - ca->cnt = (ca->cnt << ACK_RATIO_SHIFT) / ca->delayed_ack;
> - if (ca->cnt == 0) /* cannot be zero */
> - ca->cnt = 1;
> }
>
> static void bictcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
> @@ -314,13 +289,43 @@ static void bictcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
> if (!tcp_is_cwnd_limited(sk))
> return;
>
> + /* cwnd may first advance in slow start then move on to congestion
> + * control mode on a stretch ACK.
> + */
> if (tp->snd_cwnd <= tp->snd_ssthresh) {
> if (hystart && after(ack, ca->end_seq))
> bictcp_hystart_reset(sk);
> - tcp_slow_start(tp, acked);
> - } else {
> - bictcp_update(ca, tp->snd_cwnd);
> - tcp_cong_avoid_ai(tp, ca->cnt);
> + acked = tcp_slow_start(tp, acked);
> + }
> +
> + if (acked && tp->snd_cwnd > tp->snd_ssthresh) {
> + u32 target, cnt;
> +
> + bictcp_update(ca, acked, tp->snd_cwnd);
> + /* Compute target cwnd based on bic_target and tcp_cwnd
> + * (whichever is faster)
> + */
> + target = (ca->last_bic_target >= ca->tcp_cwnd) ?
> + ca->last_bic_target : ca->tcp_cwnd;
> + while (acked > 0) {
> + if (target > tp->snd_cwnd)
> + cnt = tp->snd_cwnd / (target - tp->snd_cwnd);
> + else
> + cnt = 100 * tp->snd_cwnd;
> +
> + /* The initial growth of cubic function may be
> + * too conservative when the available
> + * bandwidth is still unknown.
> + */
> + if (ca->last_max_cwnd == 0 && cnt > 20)
> + cnt = 20; /* increase cwnd 5% per RTT */
> +
> + if (cnt == 0) /* cannot be zero */
> + cnt = 1;
> +
> + tcp_cong_avoid_ai(tp, cnt);
> + acked--;
> + }
> }
> }
>
> @@ -411,20 +416,10 @@ static void hystart_update(struct sock *sk, u32 delay)
> */
> static void bictcp_acked(struct sock *sk, u32 cnt, s32 rtt_us)
> {
> - const struct inet_connection_sock *icsk = inet_csk(sk);
> const struct tcp_sock *tp = tcp_sk(sk);
> struct bictcp *ca = inet_csk_ca(sk);
> u32 delay;
>
> - if (icsk->icsk_ca_state == TCP_CA_Open) {
> - u32 ratio = ca->delayed_ack;
> -
> - ratio -= ca->delayed_ack >> ACK_RATIO_SHIFT;
> - ratio += cnt;
> -
> - ca->delayed_ack = clamp(ratio, 1U, ACK_RATIO_LIMIT);
> - }
> -
> /* Some calls are for duplicates without timetamps */
> if (rtt_us < 0)
> return;
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Dave Täht
http://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: BW regression after "tcp: refine TSO autosizing"
2015-01-20 2:37 ` Dave Taht
@ 2015-01-20 3:14 ` Eric Dumazet
2015-01-20 19:14 ` Rick Jones
0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2015-01-20 3:14 UTC (permalink / raw)
To: Dave Taht
Cc: Eyal Perry, Yuchung Cheng, Neal Cardwell, Eyal Perry, Or Gerlitz,
Linux Netdev List, Amir Vadai, Yevgeny Petrilin, Saeed Mahameed,
Ido Shamay, Amir Ancel
On Mon, 2015-01-19 at 18:37 -0800, Dave Taht wrote:
> On Mon, Jan 19, 2015 at 6:16 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Sun, 2015-01-18 at 23:40 +0200, Eyal Perry wrote:
> >
> >> So indeed, interrupt mitigation (tx-usecs 1 tx-frames 1) improves things up
> >> for the "refined TSO autosizing" kernel (from 18.4Gbps to 19.7Gbps). but
> >> in the
> >> other kernel, the BW is remains the same with and without the coalescing.
> >
> > OK thanks for testing.
> >
> > I believe the regression comes from inability for cc to cope with
> > stretch acks.
> >
> > Nowadays on fast networks, each ACK packet acknowledges ~45 MSS, but
> > CUBIC (and others cc) got support for this only during slow start, with
> > commit 9f9843a751d0a2057f9f3d313886e7e5e6ebaac9
> > ("tcp: properly handle stretch acks in slow start")
> >
> > I guess it is time to also handle congestion avoidance phase.
>
> Are you saying that at long last, delayed acks as we knew them are
> dead, dead, dead?
Sorry, I can not parse what you are saying.
In case you missed it, it has nothing to do with delayed ACK but GRO on
receiver.
>
> > With following patch (very close to what we use here at Google) I
> > reached 37Gbps instead of 20Gbps :
> >
> > ethtool -C eth1 tx-usecs 4 tx-frames 4
>
> What is the default here?
16 & 16, see my prior answer in this thread.
>
> What happens with the default here?
ethtool -C eth1 tx-usecs 16 tx-frames 16
DUMP_TCP_INFO=1 ./netperf -H remote -T2,2 -t TCP_STREAM -l 20
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to remote
() port 0 AF_INET : cpu bind
rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=60 rttvar=2
snd_ssthresh=179 cwnd=243 reordering=3 total_retrans=23 ca_state=0
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 20.00 22923.74
>
> >
> > DUMP_TCP_INFO=1 ./netperf -H remote -T2,2 -t TCP_STREAM -l 20
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to remote () port 0 AF_INET : cpu bind
> > rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=67 rttvar=6 snd_ssthresh=263 cwnd=265 reordering=3 total_retrans=4569 ca_state=0
>
> The above statistics are not dumped by my netperf, and look extremely
> desirable to capture in netperf-wrapper. This is a script parsing some
> other kernel data at the conclusion of the run? or a better netperf?
Thats a 3 lines patch in netperf actually.
>
> If ECN was on the bottleneck link, I imagine total_retrans would be 0,
> or are packets getting dropped in the kernel?
The receiver drops frames, because we are at the limit of what the NIC
can do on a single RX queue.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: BW regression after "tcp: refine TSO autosizing"
2015-01-20 3:14 ` Eric Dumazet
@ 2015-01-20 19:14 ` Rick Jones
2015-01-20 19:26 ` Eric Dumazet
2015-01-21 12:26 ` David Laight
0 siblings, 2 replies; 16+ messages in thread
From: Rick Jones @ 2015-01-20 19:14 UTC (permalink / raw)
To: Eric Dumazet, Dave Taht
Cc: Eyal Perry, Yuchung Cheng, Neal Cardwell, Eyal Perry, Or Gerlitz,
Linux Netdev List, Amir Vadai, Yevgeny Petrilin, Saeed Mahameed,
Ido Shamay, Amir Ancel
>> Are you saying that at long last, delayed acks as we knew them are
>> dead, dead, dead?
>
> Sorry, I can not parse what you are saying.
>
> In case you missed it, it has nothing to do with delayed ACK but GRO on
> receiver.
Dave - assuming I've interpreted Eric's comments correctly, I believe
the answer to your question is No. Your desire for a world brimming
with ack-every-other purity has not been fulfilled :)
However, the engineers formerly at Mentat are probably pleased that a
functional near-equivalent to their ACK avoidance heuristic has ended-up
being implemented and tacitly accepted, albeit by the back door :)
>>> DUMP_TCP_INFO=1 ./netperf -H remote -T2,2 -t TCP_STREAM -l 20
>>> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to remote () port 0 AF_INET : cpu bind
>>> rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=67 rttvar=6 snd_ssthresh=263 cwnd=265 reordering=3 total_retrans=4569 ca_state=0
>>
>> The above statistics are not dumped by my netperf, and look extremely
>> desirable to capture in netperf-wrapper. This is a script parsing some
>> other kernel data at the conclusion of the run? or a better netperf?
>
> Thats a 3 lines patch in netperf actually.
More stuff to pull from a TCP_INFO call I presume? Feel free to drop me
a patch, though I'd probably want it to be in the guise of the omni
output selectors.
happy benchmarking,
rick
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: BW regression after "tcp: refine TSO autosizing"
2015-01-20 19:14 ` Rick Jones
@ 2015-01-20 19:26 ` Eric Dumazet
2015-01-20 19:44 ` Rick Jones
2015-01-21 12:26 ` David Laight
1 sibling, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2015-01-20 19:26 UTC (permalink / raw)
To: Rick Jones
Cc: Dave Taht, Eyal Perry, Yuchung Cheng, Neal Cardwell, Eyal Perry,
Or Gerlitz, Linux Netdev List, Amir Vadai, Yevgeny Petrilin,
Saeed Mahameed, Ido Shamay, Amir Ancel
On Tue, 2015-01-20 at 11:14 -0800, Rick Jones wrote:
> > Thats a 3 lines patch in netperf actually.
>
> More stuff to pull from a TCP_INFO call I presume? Feel free to drop me
> a patch, though I'd probably want it to be in the guise of the omni
> output selectors.
>
It was something like :
diff --git a/src/nettest_omni.c b/src/nettest_omni.c
index fb2d5f4..80e43ca 100644
--- a/src/nettest_omni.c
+++ b/src/nettest_omni.c
@@ -3465,7 +3465,7 @@ static void
dump_tcp_info(struct tcp_info *tcp_info)
{
- printf("tcpi_rto %d tcpi_ato %d tcpi_pmtu %d tcpi_rcv_ssthresh %d\n"
+ fprintf(stderr, "tcpi_rto %d tcpi_ato %d tcpi_pmtu %d tcpi_rcv_ssthresh %d\n"
"tcpi_rtt %d tcpi_rttvar %d tcpi_snd_ssthresh %d tpci_snd_cwnd %d\n"
"tcpi_reordering %d tcpi_total_retrans %d\n",
tcp_info->tcpi_rto,
@@ -3539,7 +3539,7 @@ get_transport_retrans(SOCKET socket, int protocol) {
}
else {
- if (debug > 1) {
+ if (debug > 1 || getenv("DUMP_TCP_INFO")) {
dump_tcp_info(&tcp_info);
}
return tcp_info.tcpi_total_retrans;
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: BW regression after "tcp: refine TSO autosizing"
2015-01-20 19:26 ` Eric Dumazet
@ 2015-01-20 19:44 ` Rick Jones
0 siblings, 0 replies; 16+ messages in thread
From: Rick Jones @ 2015-01-20 19:44 UTC (permalink / raw)
To: Eric Dumazet
Cc: Dave Taht, Eyal Perry, Yuchung Cheng, Neal Cardwell, Eyal Perry,
Or Gerlitz, Linux Netdev List, Amir Vadai, Yevgeny Petrilin,
Saeed Mahameed, Ido Shamay, Amir Ancel
>> More stuff to pull from a TCP_INFO call I presume? Feel free to drop me
>> a patch, though I'd probably want it to be in the guise of the omni
>> output selectors.
>>
>
> It was something like :
I'd forgotten about dump_tcp_info() :)
Committed revision 673.
happy benchmarking,
rick
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: BW regression after "tcp: refine TSO autosizing"
2015-01-20 19:14 ` Rick Jones
2015-01-20 19:26 ` Eric Dumazet
@ 2015-01-21 12:26 ` David Laight
2015-01-21 17:01 ` Eric Dumazet
1 sibling, 1 reply; 16+ messages in thread
From: David Laight @ 2015-01-21 12:26 UTC (permalink / raw)
To: 'Rick Jones', Eric Dumazet, Dave Taht
Cc: Eyal Perry, Yuchung Cheng, Neal Cardwell, Eyal Perry, Or Gerlitz,
Linux Netdev List, Amir Vadai, Yevgeny Petrilin, Saeed Mahameed,
Ido Shamay, Amir Ancel
From: Of Rick Jones
> >> Are you saying that at long last, delayed acks as we knew them are
> >> dead, dead, dead?
> >
> > Sorry, I can not parse what you are saying.
> >
> > In case you missed it, it has nothing to do with delayed ACK but GRO on
> > receiver.
>
> Dave - assuming I've interpreted Eric's comments correctly, I believe
> the answer to your question is No. Your desire for a world brimming
> with ack-every-other purity has not been fulfilled :)
>
> However, the engineers formerly at Mentat are probably pleased that a
> functional near-equivalent to their ACK avoidance heuristic has ended-up
> being implemented and tacitly accepted, albeit by the back door :)
I must recheck something I discovered a while back with more recent kernels.
There has been a bad interaction between 'slow start' and 'delayed acks'
when nagle is disabled on 0 RTT local links with uni-directional traffic.
'Slow start' would refuse to send more than 4 messages until it received
an ack (rather than 4 mss of data).
The receiving system wouldn't send an ack until the timer expired
(or several mss of data were received) by which time the sender could have
a lot of data queued.
Due to the 0 RTT and bursty nature of the data 'slow start' happened
all the time.
David
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: BW regression after "tcp: refine TSO autosizing"
2015-01-21 12:26 ` David Laight
@ 2015-01-21 17:01 ` Eric Dumazet
0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2015-01-21 17:01 UTC (permalink / raw)
To: David Laight
Cc: 'Rick Jones',
Dave Taht, Eyal Perry, Yuchung Cheng, Neal Cardwell, Eyal Perry,
Or Gerlitz, Linux Netdev List, Amir Vadai, Yevgeny Petrilin,
Saeed Mahameed, Ido Shamay, Amir Ancel
On Wed, 2015-01-21 at 12:26 +0000, David Laight wrote:
> From: Of Rick Jones
> > >> Are you saying that at long last, delayed acks as we knew them are
> > >> dead, dead, dead?
> > >
> > > Sorry, I can not parse what you are saying.
> > >
> > > In case you missed it, it has nothing to do with delayed ACK but GRO on
> > > receiver.
> >
> > Dave - assuming I've interpreted Eric's comments correctly, I believe
> > the answer to your question is No. Your desire for a world brimming
> > with ack-every-other purity has not been fulfilled :)
> >
> > However, the engineers formerly at Mentat are probably pleased that a
> > functional near-equivalent to their ACK avoidance heuristic has ended-up
> > being implemented and tacitly accepted, albeit by the back door :)
>
> I must recheck something I discovered a while back with more recent kernels.
> There has been a bad interaction between 'slow start' and 'delayed acks'
> when nagle is disabled on 0 RTT local links with uni-directional traffic.
>
> 'Slow start' would refuse to send more than 4 messages until it received
> an ack (rather than 4 mss of data).
> The receiving system wouldn't send an ack until the timer expired
> (or several mss of data were received) by which time the sender could have
> a lot of data queued.
>
> Due to the 0 RTT and bursty nature of the data 'slow start' happened
> all the time.
Following packetdrill test suggests that current kernel send up to 10
messages without having to wait for any ACK
(IW10)
// Set up production and experiment configs
`../common/defaults.sh`
// Establish a connection.
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0
0.100 < S 0:0(0) win 32792 <mss 1460,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 6>
0.110 < . 1:1(0) ack 1 win 257
0.110 accept(3, ..., ...) = 4
0.200 %{ assert tcpi_snd_cwnd == 10 }%
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
+0.01 write(4, ..., 100) = 100
+0 > P. 1:101(100) ack 1
+0.01 write(4, ..., 100) = 100
+0 > P. 101:201(100) ack 1
+0.01 write(4, ..., 100) = 100
+0 > P. 201:301(100) ack 1
+0.01 write(4, ..., 100) = 100
+0 > P. 301:401(100) ack 1
+0.01 write(4, ..., 100) = 100
+0 > P. 401:501(100) ack 1
+0.01 write(4, ..., 100) = 100
+0 > P. 501:601(100) ack 1
+0.01 write(4, ..., 100) = 100
+0 > P. 601:701(100) ack 1
+0.01 write(4, ..., 100) = 100
+0 > P. 701:801(100) ack 1
+0.01 write(4, ..., 100) = 100
+0 > P. 801:901(100) ack 1
+0.01 write(4, ..., 100) = 100
+0 > P. 901:1001(100) ack 1
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2015-01-21 17:02 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-13 16:48 BW regression after "tcp: refine TSO autosizing" Eyal Perry
2015-01-13 18:57 ` Eric Dumazet
2015-01-13 20:21 ` Or Gerlitz
2015-01-13 21:41 ` Eyal Perry
2015-01-13 22:00 ` Eric Dumazet
2015-01-18 16:22 ` Eyal Perry
2015-01-18 17:48 ` Eric Dumazet
2015-01-18 21:40 ` Eyal Perry
2015-01-20 2:16 ` Eric Dumazet
2015-01-20 2:37 ` Dave Taht
2015-01-20 3:14 ` Eric Dumazet
2015-01-20 19:14 ` Rick Jones
2015-01-20 19:26 ` Eric Dumazet
2015-01-20 19:44 ` Rick Jones
2015-01-21 12:26 ` David Laight
2015-01-21 17:01 ` Eric Dumazet
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.