All of lore.kernel.org
 help / color / mirror / Atom feed
* TCP fast retransmit
@ 2011-11-25 13:33 Esztermann, Ansgar
  2011-11-25 16:36 ` Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Esztermann, Ansgar @ 2011-11-25 13:33 UTC (permalink / raw)
  To: netdev

[originally posted to lkml]
Hello list,

is there some documentation available on TCP fast retransmit? There seem to be quite a lot of descriptions -- from informal to scholarly papers -- on the various algorithms available to calculate the proper size of the congestion window, but I have been unable so far to find out *when* a fast retransmit is triggered. RFC 2581 states the third dupACK "should" do it, and this seems to be quoted fairly often. However, I can easily produce connections that fail to perform fast retransmit even after 5 dupACKs. Some people mention Linux uses a different (presumable more sophisticated) algorithm to trigger fast retransmits, but no-one seems to elaborate.


Thanks,

A.
-- 
Ansgar Esztermann
DV-Systemadministration
Max-Planck-Institut für biophysikalische Chemie, Abteilung 105

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-11-25 13:33 TCP fast retransmit Esztermann, Ansgar
@ 2011-11-25 16:36 ` Eric Dumazet
  2011-11-25 16:39   ` Eric Dumazet
                     ` (2 more replies)
  2011-11-25 16:57 ` Ilpo Järvinen
  2011-11-28 21:17 ` Yuchung Cheng
  2 siblings, 3 replies; 23+ messages in thread
From: Eric Dumazet @ 2011-11-25 16:36 UTC (permalink / raw)
  To: Esztermann, Ansgar; +Cc: netdev

Le vendredi 25 novembre 2011 à 14:33 +0100, Esztermann, Ansgar a écrit :
> [originally posted to lkml]
> Hello list,
> 
> is there some documentation available on TCP fast retransmit? There
> seem to be quite a lot of descriptions -- from informal to scholarly
> papers -- on the various algorithms available to calculate the proper
> size of the congestion window, but I have been unable so far to find
> out *when* a fast retransmit is triggered. RFC 2581 states the third
> dupACK "should" do it, and this seems to be quoted fairly often.
> However, I can easily produce connections that fail to perform fast
> retransmit even after 5 dupACKs. Some people mention Linux uses a
> different (presumable more sophisticated) algorithm to trigger fast
> retransmits, but no-one seems to elaborate.

Could you send a sample pcap of such problem, but please include full
tcp sesssion, from the first SYN packet, up to packets following
restransmits.

A diff of "netstat -s" taken before your session and after your session
on receiver would help too, if receiver is not a loaded machine of
course.

Also, what version of linux kernel are you using in receiver ?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-11-25 16:36 ` Eric Dumazet
@ 2011-11-25 16:39   ` Eric Dumazet
  2011-11-29  9:00   ` Esztermann, Ansgar
  2011-12-09 13:34   ` Esztermann, Ansgar
  2 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2011-11-25 16:39 UTC (permalink / raw)
  To: Esztermann, Ansgar; +Cc: netdev

Le vendredi 25 novembre 2011 à 17:36 +0100, Eric Dumazet a écrit :

> Could you send a sample pcap of such problem, but please include full
> tcp sesssion, from the first SYN packet, up to packets following
> restransmits.
> 
> A diff of "netstat -s" taken before your session and after your session
> on receiver would help too, if receiver is not a loaded machine of
> course.
> 
> Also, what version of linux kernel are you using in receiver ?
> 
> 

Oh well, I meant sender , not receiver !

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-11-25 13:33 TCP fast retransmit Esztermann, Ansgar
  2011-11-25 16:36 ` Eric Dumazet
@ 2011-11-25 16:57 ` Ilpo Järvinen
  2011-11-28 21:17 ` Yuchung Cheng
  2 siblings, 0 replies; 23+ messages in thread
From: Ilpo Järvinen @ 2011-11-25 16:57 UTC (permalink / raw)
  To: Esztermann, Ansgar; +Cc: netdev

On Fri, 25 Nov 2011, Esztermann, Ansgar wrote:

> [originally posted to lkml]
> Hello list,
> 
> is there some documentation available on TCP fast retransmit? There seem 
> to be quite a lot of descriptions -- from informal to scholarly papers 
> -- on the various algorithms available to calculate the proper size of 
> the congestion window, but I have been unable so far to find out *when* 
> a fast retransmit is triggered. RFC 2581 states the third dupACK 
> "should" do it, and this seems to be quoted fairly often. However, I can 
> easily produce connections that fail to perform fast retransmit even 
> after 5 dupACKs. Some people mention Linux uses a different (presumable 
> more sophisticated) algorithm to trigger fast retransmits, but no-one 
> seems to elaborate.

With SACKs dupacks are meaningless (just in case you craft them). Instead 
SACK blocks matter... but how exactly depends of if FACK is in use or 
not... with FACK also holes (segments not reported by sack) below highest 
SACK count.

-- 
 i.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-11-25 13:33 TCP fast retransmit Esztermann, Ansgar
  2011-11-25 16:36 ` Eric Dumazet
  2011-11-25 16:57 ` Ilpo Järvinen
@ 2011-11-28 21:17 ` Yuchung Cheng
  2011-11-29  9:00   ` Esztermann, Ansgar
  2 siblings, 1 reply; 23+ messages in thread
From: Yuchung Cheng @ 2011-11-28 21:17 UTC (permalink / raw)
  To: Esztermann, Ansgar; +Cc: netdev

Hi Ansgar,

On Fri, Nov 25, 2011 at 5:33 AM, Esztermann, Ansgar
<Ansgar.Esztermann@mpi-bpc.mpg.de> wrote:
> [originally posted to lkml]
> Hello list,
>
> is there some documentation available on TCP fast retransmit? There seem to be quite a lot of descriptions -- from informal to scholarly papers -- on the various algorithms available to calculate the proper size of the congestion window, but I have been unable so far to find out *when* a fast retransmit is triggered. RFC 2581 states the third dupACK "should" do it, and this seems to be quoted fairly often. However, I can easily produce connections that fail to perform fast retransmit even after 5 dupACKs. Some people mention Linux uses a different (presumable more sophisticated) algorithm to trigger fast retransmits, but no-one seems to elaborate.
>

This 2003 paper has the core algorithms for Linux loss recovery.
http://www.cs.helsinki.fi/research/iwtcp/papers/linuxtcp.pdf

We've also detailed latest Linux fast retransmit algorithms in the PRR
paper (section 3.2).
http://research.google.com/pubs/pub37486.html

HTH

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-11-25 16:36 ` Eric Dumazet
  2011-11-25 16:39   ` Eric Dumazet
@ 2011-11-29  9:00   ` Esztermann, Ansgar
  2011-12-09 13:34   ` Esztermann, Ansgar
  2 siblings, 0 replies; 23+ messages in thread
From: Esztermann, Ansgar @ 2011-11-29  9:00 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 933 bytes --]


On Nov 25, 2011, at 17:36 , Eric Dumazet wrote:

> Could you send a sample pcap of such problem, but please include full
> tcp sesssion, from the first SYN packet, up to packets following
> restransmits.

I will prepare one (these session often run for hours or days, as they're our backup runs). It may take a few days, as I'm on sick leave because of an accident.

> A diff of "netstat -s" taken before your session and after your session
> on receiver would help too, if receiver is not a loaded machine of
> course.
> 
> Also, what version of linux kernel are you using in receiver ?

2.6.37.6 with openSUSE patches in the sender, some version of AIX in the receiver. The latter seems to be critical: we've never encountered this problem with any other combination of OSs but AIX & Linux.


A.
-- 
Ansgar Esztermann
DV-Systemadministration
Max-Planck-Institut für biophysikalische Chemie, Abteilung 105


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 4492 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-11-28 21:17 ` Yuchung Cheng
@ 2011-11-29  9:00   ` Esztermann, Ansgar
  0 siblings, 0 replies; 23+ messages in thread
From: Esztermann, Ansgar @ 2011-11-29  9:00 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 513 bytes --]


On Nov 28, 2011, at 22:17 , Yuchung Cheng wrote:

> This 2003 paper has the core algorithms for Linux loss recovery.
> http://www.cs.helsinki.fi/research/iwtcp/papers/linuxtcp.pdf
> 
> We've also detailed latest Linux fast retransmit algorithms in the PRR
> paper (section 3.2).
> http://research.google.com/pubs/pub37486.html

Thank you very much, I will be sure to read them!


A.
-- 
Ansgar Esztermann
DV-Systemadministration
Max-Planck-Institut für biophysikalische Chemie, Abteilung 105


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 4492 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-11-25 16:36 ` Eric Dumazet
  2011-11-25 16:39   ` Eric Dumazet
  2011-11-29  9:00   ` Esztermann, Ansgar
@ 2011-12-09 13:34   ` Esztermann, Ansgar
  2011-12-09 14:43     ` Eric Dumazet
  2 siblings, 1 reply; 23+ messages in thread
From: Esztermann, Ansgar @ 2011-12-09 13:34 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1120 bytes --]


On Nov 25, 2011, at 17:36 , Eric Dumazet wrote:

> Could you send a sample pcap of such problem, but please include full
> tcp sesssion, from the first SYN packet, up to packets following
> restransmits.

OK, I've got a dump now. It is rather large (>300MB), so it's probably not a good idea to send it to the list. Instead, you can find it here:
http://wwwuser.gwdg.de/~aeszter/tcpstream.pcap

The capture has been taken in the sender, 10.208.9.87, with a capture filter on the receiver's IP address. OS is:
% uname -a
Linux mwolf 2.6.37.6-0.9-default #1 SMP 2011-10-19 22:33:27 +0200 x86_64 x86_64 x86_64 GNU/Linux

The first "strange" retransmission is in frame 166859, following ACKs in frames 166849 .. 166858. 

If I can do anything to reduce the amount of data, I will of course do so.

> A diff of "netstat -s" taken before your session and after your session
> on receiver would help too, if receiver is not a loaded machine of
> course.

Attached.

Thanks a lot,

A.

-- 
Ansgar Esztermann
DV-Systemadministration
Max-Planck-Institut für biophysikalische Chemie, Abteilung 105

[-- Attachment #2: netstat.diff --]
[-- Type: application/octet-stream, Size: 4185 bytes --]

--- /tmp/before	2011-12-09 12:16:46.000000000 +0100
+++ /tmp/after	2011-12-09 12:24:33.000000000 +0100
@@ -1,10 +1,10 @@
 Ip:
-    1072587745 total packets received
+    1072859058 total packets received
     151 with invalid addresses
     0 forwarded
     0 incoming packets discarded
-    1072579855 incoming packets delivered
-    638884639 requests sent out
+    1072850362 incoming packets delivered
+    639060780 requests sent out
 Icmp:
     78 ICMP messages received
     1 input ICMP message failed.
@@ -24,21 +24,21 @@
         OutType3: 117
         OutType8: 287
 Tcp:
-    48789 active connections openings
-    26811 passive connection openings
+    48801 active connections openings
+    26814 passive connection openings
     132 failed connection attempts
     1153 connection resets received
-    20 connections established
-    1061726671 segments received
-    3164994399 segments send out
-    40400659 segments retransmited
+    23 connections established
+    1061994682 segments received
+    3165601316 segments send out
+    40407999 segments retransmited
     0 bad segments received.
     2272 resets sent
 Udp:
-    9868225 packets received
+    9870499 packets received
     44 packets to unknown port received.
     0 packet receive errors
-    444603 packets sent
+    444689 packets sent
     RcvbufErrors: 0
     SndbufErrors: 0
 UdpLite:
@@ -53,18 +53,18 @@
     13 packets pruned from receive queue because of socket buffer overrun
     33 ICMP packets dropped because they were out-of-window
     ArpFilter: 0
-    29164 TCP sockets finished time wait in fast timer
+    29177 TCP sockets finished time wait in fast timer
     43 packets rejects in established connections because of timestamp
-    287177 delayed acks sent
-    527 delayed acks further delayed because of locked socket
-    Quick ack mode was activated 15340 times
+    287884 delayed acks sent
+    528 delayed acks further delayed because of locked socket
+    Quick ack mode was activated 15590 times
     49278 packets directly queued to recvmsg prequeue.
-    1803084 packets directly received from backlog
+    1804188 packets directly received from backlog
     176281729 packets directly received from prequeue
-    121263452 packets header predicted
-    50350 packets header predicted and directly queued to user
-    TCPPureAcks: 287487005
-    TCPHPAcks: 627694332
+    121375107 packets header predicted
+    50351 packets header predicted and directly queued to user
+    TCPPureAcks: 287534908
+    TCPHPAcks: 627801067
     TCPRenoRecovery: 0
     TCPSackRecovery: 286
     TCPSACKReneging: 0
@@ -79,17 +79,17 @@
     TCPLoss: 104
     TCPLostRetransmit: 0
     TCPRenoFailures: 0
-    TCPSackFailures: 112431
-    TCPLossFailures: 77046
+    TCPSackFailures: 112458
+    TCPLossFailures: 77061
     TCPFastRetrans: 3654
     TCPForwardRetrans: 1
-    TCPSlowStartRetrans: 34571085
-    TCPTimeouts: 5586734
+    TCPSlowStartRetrans: 34577497
+    TCPTimeouts: 5587603
     TCPRenoRecoveryFail: 0
     TCPSackRecoveryFail: 64
     TCPSchedulerFailed: 0
     TCPRcvCollapsed: 1875
-    TCPDSACKOldSent: 12229
+    TCPDSACKOldSent: 12479
     TCPDSACKOfoSent: 428
     TCPDSACKRecv: 92820540
     TCPDSACKOfoRecv: 0
@@ -101,7 +101,7 @@
     TCPAbortOnLinger: 0
     TCPAbortFailed: 0
     TCPMemoryPressures: 0
-    TCPSACKDiscard: 20059268
+    TCPSACKDiscard: 20074900
     TCPDSACKIgnoredOld: 24637182
     TCPDSACKIgnoredNoUndo: 68178807
     TCPSpuriousRTOs: 25
@@ -118,13 +118,13 @@
 IpExt:
     InNoRoutes: 0
     InTruncatedPkts: 0
-    InMcastPkts: 9564962
-    OutMcastPkts: 130464
-    InBcastPkts: 983842
+    InMcastPkts: 9567193
+    OutMcastPkts: 130499
+    InBcastPkts: 984064
     OutBcastPkts: 0
-    InOctets: 718897124903
-    OutOctets: 4260817429624
-    InMcastOctets: 1697591676
-    OutMcastOctets: 19340504
-    InBcastOctets: 138610857
+    InOctets: 719247299280
+    OutOctets: 4261557593433
+    InMcastOctets: 1697945778
+    OutMcastOctets: 19344841
+    InBcastOctets: 138637486
     OutBcastOctets: 0

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-12-09 13:34   ` Esztermann, Ansgar
@ 2011-12-09 14:43     ` Eric Dumazet
  2011-12-09 16:17       ` Esztermann, Ansgar
  2011-12-14 19:00       ` Yuchung Cheng
  0 siblings, 2 replies; 23+ messages in thread
From: Eric Dumazet @ 2011-12-09 14:43 UTC (permalink / raw)
  To: Esztermann, Ansgar; +Cc: netdev

Le vendredi 09 décembre 2011 à 14:34 +0100, Esztermann, Ansgar a écrit :
> On Nov 25, 2011, at 17:36 , Eric Dumazet wrote:
> 
> > Could you send a sample pcap of such problem, but please include full
> > tcp sesssion, from the first SYN packet, up to packets following
> > restransmits.
> 
> OK, I've got a dump now. It is rather large (>300MB), so it's probably not a good idea to send it to the list. Instead, you can find it here:
> http://wwwuser.gwdg.de/~aeszter/tcpstream.pcap
> 
> The capture has been taken in the sender, 10.208.9.87, with a capture filter on the receiver's IP address. OS is:
> % uname -a
> Linux mwolf 2.6.37.6-0.9-default #1 SMP 2011-10-19 22:33:27 +0200 x86_64 x86_64 x86_64 GNU/Linux
> 
> The first "strange" retransmission is in frame 166859, following ACKs in frames 166849 .. 166858. 
> 
> If I can do anything to reduce the amount of data, I will of course do so.
> 
> > A diff of "netstat -s" taken before your session and after your session
> > on receiver would help too, if receiver is not a loaded machine of
> > course.
> 
> Attached.
> 
> Thanks a lot,
> 
> A.
> 

It seems you have a lot of packet reorders.

Are you using multipath or some channel bonding ?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-12-09 14:43     ` Eric Dumazet
@ 2011-12-09 16:17       ` Esztermann, Ansgar
  2011-12-09 16:31         ` Eric Dumazet
  2011-12-14 19:00       ` Yuchung Cheng
  1 sibling, 1 reply; 23+ messages in thread
From: Esztermann, Ansgar @ 2011-12-09 16:17 UTC (permalink / raw)
  To: netdev


On Dec 9, 2011, at 15:43 , Eric Dumazet wrote:

> It seems you have a lot of packet reorders.
> 
> Are you using multipath or some channel bonding ?

Not that I am aware of (i.e. not on our end). However, the connection will probably routed through a firewall. I will have to check if it is configured to avoid reordering.


Regards,

A.

-- 
Ansgar Esztermann
DV-Systemadministration
Max-Planck-Institut für biophysikalische Chemie, Abteilung 105

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-12-09 16:17       ` Esztermann, Ansgar
@ 2011-12-09 16:31         ` Eric Dumazet
  2011-12-13 14:05           ` Esztermann, Ansgar
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2011-12-09 16:31 UTC (permalink / raw)
  To: Esztermann, Ansgar; +Cc: netdev

Le vendredi 09 décembre 2011 à 17:17 +0100, Esztermann, Ansgar a écrit :
> On Dec 9, 2011, at 15:43 , Eric Dumazet wrote:
> 
> > It seems you have a lot of packet reorders.
> > 
> > Are you using multipath or some channel bonding ?
> 
> Not that I am aware of (i.e. not on our end). However, the connection
> will probably routed through a firewall. I will have to check if it is
> configured to avoid reordering.
> 

Is it a linux based firewall ?

I suspect this firewall terminates a tunnel ?

If so, make sure network interrupts are handled by the same cpu.

(because tunneling means calling netif_rx() : This can be the reason for
Out Of Order packets, if next hardware interrupt is delivered to another
cpu)

We really should call netif_receive_skb() for the first tunnel level to
avoid this...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-12-09 16:31         ` Eric Dumazet
@ 2011-12-13 14:05           ` Esztermann, Ansgar
  2011-12-13 14:31             ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Esztermann, Ansgar @ 2011-12-13 14:05 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 539 bytes --]


On Dec 9, 2011, at 17:31 , Eric Dumazet wrote:

>>> It seems you have a lot of packet reorders.
>>> 
>>> Are you using multipath or some channel bonding ?
>> 
>> Not that I am aware of (i.e. not on our end). However, the connection
>> will probably routed through a firewall. I will have to check if it is
>> configured to avoid reordering.
>> 
> 
> Is it a linux based firewall ?

No, it's Cisco.


A.

-- 
Ansgar Esztermann
DV-Systemadministration
Max-Planck-Institut für biophysikalische Chemie, Abteilung 105


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 4492 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-12-13 14:05           ` Esztermann, Ansgar
@ 2011-12-13 14:31             ` Eric Dumazet
  2011-12-13 14:59               ` Carsten Wolff
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2011-12-13 14:31 UTC (permalink / raw)
  To: Esztermann, Ansgar; +Cc: netdev

Le mardi 13 décembre 2011 à 15:05 +0100, Esztermann, Ansgar a écrit :
> On Dec 9, 2011, at 17:31 , Eric Dumazet wrote:
> 
> >>> It seems you have a lot of packet reorders.
> >>> 
> >>> Are you using multipath or some channel bonding ?
> >> 
> >> Not that I am aware of (i.e. not on our end). However, the connection
> >> will probably routed through a firewall. I will have to check if it is
> >> configured to avoid reordering.
> >> 
> > 
> > Is it a linux based firewall ?
> 
> No, it's Cisco.
> 

OK

You could increase /proc/sys/net/ipv4/tcp_reordering from 3 to 16

(Not sure it will change anything, since its kinda dynamic for a tcp
session)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-12-13 14:31             ` Eric Dumazet
@ 2011-12-13 14:59               ` Carsten Wolff
  0 siblings, 0 replies; 23+ messages in thread
From: Carsten Wolff @ 2011-12-13 14:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Esztermann, Ansgar, netdev

On Tuesday 13 December 2011, you wrote:
> Le mardi 13 décembre 2011 à 15:05 +0100, Esztermann, Ansgar a écrit :
> > On Dec 9, 2011, at 17:31 , Eric Dumazet wrote:
> > >>> It seems you have a lot of packet reorders.
> > >>> 
> > >>> Are you using multipath or some channel bonding ?
> > >> 
> > >> Not that I am aware of (i.e. not on our end). However, the connection
> > >> will probably routed through a firewall. I will have to check if it is
> > >> configured to avoid reordering.
> > > 
> > > Is it a linux based firewall ?
> > 
> > No, it's Cisco.
> 
> OK
> 
> You could increase /proc/sys/net/ipv4/tcp_reordering from 3 to 16
> 
> (Not sure it will change anything, since its kinda dynamic for a tcp
> session)

It's dynamic, but the sysctl changes the dynamic a lot, because its the 
minimum value of the dupack threshold. Be sure not to set this too high, or 
you will risk a lot of RTOs, which should kill performance even more 
thoroughly.

Carsten

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-12-09 14:43     ` Eric Dumazet
  2011-12-09 16:17       ` Esztermann, Ansgar
@ 2011-12-14 19:00       ` Yuchung Cheng
  2011-12-14 22:31         ` Eric Dumazet
  1 sibling, 1 reply; 23+ messages in thread
From: Yuchung Cheng @ 2011-12-14 19:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Esztermann, Ansgar, netdev

On Fri, Dec 9, 2011 at 6:43 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le vendredi 09 décembre 2011 à 14:34 +0100, Esztermann, Ansgar a écrit :
>> On Nov 25, 2011, at 17:36 , Eric Dumazet wrote:
>>
>> > Could you send a sample pcap of such problem, but please include full
>> > tcp sesssion, from the first SYN packet, up to packets following
>> > restransmits.
>>
>> OK, I've got a dump now. It is rather large (>300MB), so it's probably not a good idea to send it to the list. Instead, you can find it here:
>> http://wwwuser.gwdg.de/~aeszter/tcpstream.pcap
>>
>> The capture has been taken in the sender, 10.208.9.87, with a capture filter on the receiver's IP address. OS is:
>> % uname -a
>> Linux mwolf 2.6.37.6-0.9-default #1 SMP 2011-10-19 22:33:27 +0200 x86_64 x86_64 x86_64 GNU/Linux
>>
>> The first "strange" retransmission is in frame 166859, following ACKs in frames 166849 .. 166858.
>>
>> If I can do anything to reduce the amount of data, I will of course do so.
>>
>> > A diff of "netstat -s" taken before your session and after your session
>> > on receiver would help too, if receiver is not a loaded machine of
>> > course.
>>
>> Attached.
>>
>> Thanks a lot,
>>
>> A.
>>
>
> It seems you have a lot of packet reorders.
>
I use tcptrace to check the time sequence and I am puzzled:
I see a lot of OOO packets too but how can this happen at a sender-side trace?
unless the trace is taken close to but not exactly at the sender.
I expect on seeing in-sequence packets but a lots of SACKs plus some
spurious retransmists.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-12-14 19:00       ` Yuchung Cheng
@ 2011-12-14 22:31         ` Eric Dumazet
  2011-12-15  7:41           ` Carsten Wolff
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2011-12-14 22:31 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: Esztermann, Ansgar, netdev

Le mercredi 14 décembre 2011 à 11:00 -0800, Yuchung Cheng a écrit :
> >
> I use tcptrace to check the time sequence and I am puzzled:
> I see a lot of OOO packets too but how can this happen at a sender-side trace?
> unless the trace is taken close to but not exactly at the sender.
> I expect on seeing in-sequence packets but a lots of SACKs plus some
> spurious retransmists.

I understood the trace was a receiver-side one (a linux machine if I am
not mistaken, while the sender is AIX powered)

(Looking at timings of ACKS, coming a few us after corresponding data
packet arrival)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-12-14 22:31         ` Eric Dumazet
@ 2011-12-15  7:41           ` Carsten Wolff
  2011-12-15  8:24             ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Carsten Wolff @ 2011-12-15  7:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Yuchung Cheng, Esztermann, Ansgar, netdev

On Wednesday 14 December 2011, Eric Dumazet wrote:
> Le mercredi 14 décembre 2011 à 11:00 -0800, Yuchung Cheng a écrit :
> > I use tcptrace to check the time sequence and I am puzzled:
> > I see a lot of OOO packets too but how can this happen at a sender-side
> > trace? unless the trace is taken close to but not exactly at the sender.
> > I expect on seeing in-sequence packets but a lots of SACKs plus some
> > spurious retransmists.
> 
> I understood the trace was a receiver-side one (a linux machine if I am
> not mistaken, while the sender is AIX powered)
> 
> (Looking at timings of ACKS, coming a few us after corresponding data
> packet arrival)

Oh. Right. This also means, that net.ipv4.tcp_reordering is only available at 
the receiver (Linux), which doesn't help, because the reordering robustness 
stuff happens on sender-side. So don't even bother changing that sysctl.

Carsten

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-12-15  7:41           ` Carsten Wolff
@ 2011-12-15  8:24             ` Eric Dumazet
  2011-12-16 15:53               ` Esztermann, Ansgar
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2011-12-15  8:24 UTC (permalink / raw)
  To: Carsten Wolff; +Cc: Yuchung Cheng, Esztermann, Ansgar, netdev

Le jeudi 15 décembre 2011 à 08:41 +0100, Carsten Wolff a écrit :
> On Wednesday 14 December 2011, Eric Dumazet wrote:
> > Le mercredi 14 décembre 2011 à 11:00 -0800, Yuchung Cheng a écrit :
> > > I use tcptrace to check the time sequence and I am puzzled:
> > > I see a lot of OOO packets too but how can this happen at a sender-side
> > > trace? unless the trace is taken close to but not exactly at the sender.
> > > I expect on seeing in-sequence packets but a lots of SACKs plus some
> > > spurious retransmists.
> > 
> > I understood the trace was a receiver-side one (a linux machine if I am
> > not mistaken, while the sender is AIX powered)
> > 
> > (Looking at timings of ACKS, coming a few us after corresponding data
> > packet arrival)
> 
> Oh. Right. This also means, that net.ipv4.tcp_reordering is only available at 
> the receiver (Linux), which doesn't help, because the reordering robustness 
> stuff happens on sender-side. So don't even bother changing that sysctl.
> 

Oh well, reading Ansgar mail, it seems this is the other way :

quote :
2.6.37.6 with openSUSE patches in the sender, some version of AIX in the
receiver. The latter seems to be critical: we've never encountered this
problem with any other combination of OSs but AIX & Linux.


I only dont understand how we can receive an ACK so fast (6 us after the
data packet ACKed, even 3us a bit later). This seems not possible, even
with 10Gb infra. (A CISCO firewall was mentioned)

12:18:20.732998 IP 134.76.98.13.1500 > 10.208.9.87.35337: Flags [.], seq 284400:287136, ack 555, win 65280, options [nop,nop,TS val 1327509818 ecr 627192022], length 2736
12:18:20.733004 IP 10.208.9.87.35337 > 134.76.98.13.1500: Flags [.], ack 287136, win 591, options [nop,nop,TS val 627192022 ecr 1327509818], length 0
12:18:20.733048 IP 134.76.98.13.1500 > 10.208.9.87.35337: Flags [.], seq 287136:293976, ack 555, win 65280, options [nop,nop,TS val 1327509818 ecr 627192022], length 6840
12:18:20.733073 IP 10.208.9.87.35337 > 134.76.98.13.1500: Flags [.], ack 293976, win 549, options [nop,nop,TS val 627192022 ecr 1327509818], length 0
12:18:20.733104 IP 134.76.98.13.1500 > 10.208.9.87.35337: Flags [.], seq 293976:298080, ack 555, win 65280, options [nop,nop,TS val 1327509818 ecr 627192022], length 4104
12:18:20.733120 IP 10.208.9.87.35337 > 134.76.98.13.1500: Flags [.], ack 298080, win 522, options [nop,nop,TS val 627192022 ecr 1327509818], length 0

Here next two packets we send are out of order.

12:18:20.733161 IP 134.76.98.13.1500 > 10.208.9.87.35337: Flags [.], seq 299448:300816, ack 555, win 65280, options [nop,nop,TS val 1327509818 ecr 627192022], length 1368
12:18:20.733164 IP 10.208.9.87.35337 > 134.76.98.13.1500: Flags [.], ack 298080, win 522, options [nop,nop,TS val 627192022 ecr 1327509818,nop,nop,sack 1 {299448:300816}], length 0
12:18:20.733166 IP 134.76.98.13.1500 > 10.208.9.87.35337: Flags [.], seq 298080:299448, ack 555, win 65280, options [nop,nop,TS val 1327509818 ecr 627192022], length 1368
12:18:20.733169 IP 134.76.98.13.1500 > 10.208.9.87.35337: Flags [.], seq 300816:302184, ack 555, win 65280, options [nop,nop,TS val 1327509818 ecr 627192022], length 1368
12:18:20.733171 IP 134.76.98.13.1500 > 10.208.9.87.35337: Flags [.], seq 303552:304920, ack 555, win 65280, options [nop,nop,TS val 1327509818 ecr 627192022], length 1368
12:18:20.733173 IP 10.208.9.87.35337 > 134.76.98.13.1500: Flags [.], ack 302184, win 490, options [nop,nop,TS val 627192022 ecr 1327509818,nop,nop,sack 1 {303552:304920}], length 0
12:18:20.733174 IP 134.76.98.13.1500 > 10.208.9.87.35337: Flags [.], seq 302184:303552, ack 555, win 65280, options [nop,nop,TS val 1327509818 ecr 627192022], length 1368
12:18:20.733177 IP 10.208.9.87.35337 > 134.76.98.13.1500: Flags [.], ack 304920, win 469, options [nop,nop,TS val 627192022 ecr 1327509818], length 0
12:18:20.733224 IP 134.76.98.13.1500 > 10.208.9.87.35337: Flags [.], seq 304920:310392, ack 555, win 65280, options [nop,nop,TS val 1327509818 ecr 627192022], length 5472
12:18:20.733228 IP 134.76.98.13.1500 > 10.208.9.87.35337: Flags [.], seq 311760:313128, ack 555, win 65280, options [nop,nop,TS val 1327509818 ecr 627192022], length 1368
12:18:20.733230 IP 10.208.9.87.35337 > 134.76.98.13.1500: Flags [.], ack 310392, win 427, options [nop,nop,TS val 627192022 ecr 1327509818,nop,nop,sack 1 {311760:313128}], length 0
12:18:20.733272 IP 134.76.98.13.1500 > 10.208.9.87.35337: Flags [.], seq 313128:315864, ack 555, win 65280, options [nop,nop,TS val 1327509818 ecr 627192022], length 2736
12:18:20.733276 IP 10.208.9.87.35337 > 134.76.98.13.1500: Flags [.], ack 310392, win 427, options [nop,nop,TS val 627192022 ecr 1327509818,nop,nop,sack 1 {311760:315864}], length 0
12:18:20.733326 IP 134.76.98.13.1500 > 10.208.9.87.35337: Flags [.], seq 315864:319968, ack 555, win 65280, options [nop,nop,TS val 1327509818 ecr 627192022], length 4104
12:18:20.733330 IP 10.208.9.87.35337 > 134.76.98.13.1500: Flags [.], ack 310392, win 427, options [nop,nop,TS val 627192022 ecr 1327509818,nop,nop,sack 1 {311760:319968}], length 0
12:18:20.733332 IP 134.76.98.13.1500 > 10.208.9.87.35337: Flags [.], seq 310392:311760, ack 555, win 65280, options [nop,nop,TS val 1327509818 ecr 627192022], length 1368
12:18:20.733333 IP 134.76.98.13.1500 > 10.208.9.87.35337: Flags [.], seq 321336:322704, ack 555, win 65280, options [nop,nop,TS val 1327509818 ecr 627192022], length 1368
12:18:20.733335 IP 10.208.9.87.35337 > 134.76.98.13.1500: Flags [.], ack 319968, win 353, options [nop,nop,TS val 627192022 ecr 1327509818,nop,nop,sack 1 {321336:322704}], length 0
12:18:20.733372 IP 134.76.98.13.1500 > 10.208.9.87.35337: Flags [.], seq 322704:324072, ack 555, win 65280, options [nop,nop,TS val 1327509818 ecr 627192022], length 1368
12:18:20.733375 IP 10.208.9.87.35337 > 134.76.98.13.1500: Flags [.], ack 319968, win 353, options [nop,nop,TS val 627192022 ecr 1327509818,nop,nop,sack 1 {321336:324072}], length 0
12:18:20.733377 IP 134.76.98.13.1500 > 10.208.9.87.35337: Flags [.], seq 319968:321336, ack 555, win 65280, options [nop,nop,TS val 1327509818 ecr 627192022], length 1368
12:18:20.733381 IP 10.208.9.87.35337 > 134.76.98.13.1500: Flags [.], ack 324072, win 327, options [nop,nop,TS val 627192022 ecr 1327509818], length 0


Really, my feeling is this trace is taken on receiver, and maybe LRO/GRO
is buggy ?

Ansgar, please provide more details, like the NIC you use (hardware,
driver versions...)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-12-15  8:24             ` Eric Dumazet
@ 2011-12-16 15:53               ` Esztermann, Ansgar
  0 siblings, 0 replies; 23+ messages in thread
From: Esztermann, Ansgar @ 2011-12-16 15:53 UTC (permalink / raw)
  To: netdev


On Dec 15, 2011, at 9:24 , Eric Dumazet wrote:

> Really, my feeling is this trace is taken on receiver, and maybe LRO/GRO
> is buggy ?

Oh dear. I'm sorry, my mistake: in the beginning of a backup session, the server transmits to the client, so this is indeed taken on the receiver. I should have waited until the client begins to push data to the server, as that is where we've noticed the original problem of frequent RTOs.

I'll get a new trace extending all the way to the interesting part of the session...


Sorry for the confusion,

A.

-- 
Ansgar Esztermann
DV-Systemadministration
Max-Planck-Institut für biophysikalische Chemie, Abteilung 105

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-11-25 12:55   ` Esztermann, Ansgar
@ 2011-11-25 13:09     ` Eric Dumazet
  0 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2011-11-25 13:09 UTC (permalink / raw)
  To: Esztermann, Ansgar; +Cc: linux-kernel

Le vendredi 25 novembre 2011 à 13:55 +0100, Esztermann, Ansgar a écrit :
> On Nov 25, 2011, at 11:42 , Eric Dumazet wrote:
> 
> > 
> > Could you provide a trace showing what you believe is a violation of the
> > standards ?
> 
> "Violation" is probably a bit on the harsh side (after all, it says "should"), but here goes (wireshark/libpcap format):
> The trace has been collected on 10.208.9.87. After one ACK plus five duplicates, a retransmission is triggered -- but it takes more than 200 ms, so that would be an ordinary retransmission. The original trace is (much) longer, but I've cut it down to keep the mail small. If required, I can provide more.

Ah, I missed the fact you sent your messages to
linux-kernel@vger.kernel.org.

Please start a new thread on netdev@vger.kernel.org to reach network
guys.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-11-25 10:42 ` Eric Dumazet
@ 2011-11-25 12:55   ` Esztermann, Ansgar
  2011-11-25 13:09     ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Esztermann, Ansgar @ 2011-11-25 12:55 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 712 bytes --]


On Nov 25, 2011, at 11:42 , Eric Dumazet wrote:

> 
> Could you provide a trace showing what you believe is a violation of the
> standards ?

"Violation" is probably a bit on the harsh side (after all, it says "should"), but here goes (wireshark/libpcap format):
The trace has been collected on 10.208.9.87. After one ACK plus five duplicates, a retransmission is triggered -- but it takes more than 200 ms, so that would be an ordinary retransmission. The original trace is (much) longer, but I've cut it down to keep the mail small. If required, I can provide more.


Thanks,

A.

-- 
Ansgar Esztermann
DV-Systemadministration
Max-Planck-Institut für biophysikalische Chemie, Abteilung 105

[-- Attachment #2: retransmit --]
[-- Type: application/octet-stream, Size: 18524 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP fast retransmit
  2011-11-25  9:42 Esztermann, Ansgar
@ 2011-11-25 10:42 ` Eric Dumazet
  2011-11-25 12:55   ` Esztermann, Ansgar
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2011-11-25 10:42 UTC (permalink / raw)
  To: Esztermann, Ansgar; +Cc: linux-kernel

Le vendredi 25 novembre 2011 à 10:42 +0100, Esztermann, Ansgar a écrit :
> Hello list,
> 
> is there some documentation available on TCP fast retransmit? There
> seem to be quite a lot of descriptions -- from informal to scholarly
> papers -- on the various algorithms available to calculate the proper
> size of the congestion window, but I have been unable so far to find
> out *when* a fast retransmit is triggered. RFC 2581 states the third
> dupACK "should" do it, and this seems to be quoted fairly often.
> However, I can easily produce connections that fail to perform fast
> retransmit even after 5 dupACKs. Some people mention Linux uses a
> different (presumable more sophisticated) algorithm to trigger fast
> retransmits, but no-one seems to elaborate.
> 
I believe the RFC you gave should be the ground to your question.

Could you provide a trace showing what you believe is a violation of the
standards ?




^ permalink raw reply	[flat|nested] 23+ messages in thread

* TCP fast retransmit
@ 2011-11-25  9:42 Esztermann, Ansgar
  2011-11-25 10:42 ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Esztermann, Ansgar @ 2011-11-25  9:42 UTC (permalink / raw)
  To: linux-kernel

Hello list,

is there some documentation available on TCP fast retransmit? There seem to be quite a lot of descriptions -- from informal to scholarly papers -- on the various algorithms available to calculate the proper size of the congestion window, but I have been unable so far to find out *when* a fast retransmit is triggered. RFC 2581 states the third dupACK "should" do it, and this seems to be quoted fairly often. However, I can easily produce connections that fail to perform fast retransmit even after 5 dupACKs. Some people mention Linux uses a different (presumable more sophisticated) algorithm to trigger fast retransmits, but no-one seems to elaborate.


Thanks,

A.
-- 
Ansgar Esztermann
DV-Systemadministration
Max-Planck-Institut für biophysikalische Chemie, Abteilung 105


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2011-12-16 15:53 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-25 13:33 TCP fast retransmit Esztermann, Ansgar
2011-11-25 16:36 ` Eric Dumazet
2011-11-25 16:39   ` Eric Dumazet
2011-11-29  9:00   ` Esztermann, Ansgar
2011-12-09 13:34   ` Esztermann, Ansgar
2011-12-09 14:43     ` Eric Dumazet
2011-12-09 16:17       ` Esztermann, Ansgar
2011-12-09 16:31         ` Eric Dumazet
2011-12-13 14:05           ` Esztermann, Ansgar
2011-12-13 14:31             ` Eric Dumazet
2011-12-13 14:59               ` Carsten Wolff
2011-12-14 19:00       ` Yuchung Cheng
2011-12-14 22:31         ` Eric Dumazet
2011-12-15  7:41           ` Carsten Wolff
2011-12-15  8:24             ` Eric Dumazet
2011-12-16 15:53               ` Esztermann, Ansgar
2011-11-25 16:57 ` Ilpo Järvinen
2011-11-28 21:17 ` Yuchung Cheng
2011-11-29  9:00   ` Esztermann, Ansgar
  -- strict thread matches above, loose matches on Subject: below --
2011-11-25  9:42 Esztermann, Ansgar
2011-11-25 10:42 ` Eric Dumazet
2011-11-25 12:55   ` Esztermann, Ansgar
2011-11-25 13:09     ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.