* TCP fast retransmit issues @ 2017-07-26 11:07 Klavs Klavsen 2017-07-26 11:49 ` Eric Dumazet 0 siblings, 1 reply; 21+ messages in thread From: Klavs Klavsen @ 2017-07-26 11:07 UTC (permalink / raw) To: netdev [-- Attachment #1: Type: text/plain, Size: 1591 bytes --] Hi guys, Me and my colleagues have an annoying issue with our Linux desktops and the company's Junos VPN. We connect with openconnect (some use the official Pulse client) - which then opens up a tun0 device - and traffic runs through that. If we try to scp a file of ~100MB (f.ex. linux-4.12.3.tar.xz :) - it stalls after sending 20-30% typicly.. then starts again after some time, and typicly dies before finishing. I've captured it with tcpdump (its a large 77Mb file - thats how far it got before it died :) - http://blog.klavsen.info/fast-retransmit-problem-junos-linux I've attached an image of wireshark - where the (AFAIK) interesting part starts.. Where my client starts getting DUP ACK's.. but my Linux client does nothing :( I've tried to upgrade to latest Ubuntu-mainline kernel build (4.12.3) and it changed nothing. The problem goes away, if I do: sysctl -w net.ipv4.tcp_sack=0 I've tried specificly enabling net.ipv4.tcp_fack=1 - but that did not help. This is not an issue on Mac OSX or Windows clients. None of the Linux users here figured, that could be a Linux kernel issue - but the evidence seems to suggest it - and all my googleing and reading does not lead me to any other conclusion. It may ofcourse be that Junos has implemented the standard badly/wrongly - and Windows/Mac has done a workaround for that? I hope you can help me figure out whats going wrong. -- Regards, Klavs Klavsen, GSEC - kl@vsen.dk - http://blog.klavsen.info - Tlf. 61281200 "Those who do not understand Unix are condemned to reinvent it, poorly." --Henry Spencer [-- Attachment #2: fast-retransmit-not-happening.png --] [-- Type: image/png, Size: 323591 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-26 11:07 TCP fast retransmit issues Klavs Klavsen @ 2017-07-26 11:49 ` Eric Dumazet 2017-07-26 12:18 ` Klavs Klavsen 0 siblings, 1 reply; 21+ messages in thread From: Eric Dumazet @ 2017-07-26 11:49 UTC (permalink / raw) To: Klavs Klavsen; +Cc: netdev On Wed, 2017-07-26 at 13:07 +0200, Klavs Klavsen wrote: > Hi guys, > > Me and my colleagues have an annoying issue with our Linux desktops and > the company's Junos VPN. > > We connect with openconnect (some use the official Pulse client) - which > then opens up a tun0 device - and traffic runs through that. > > If we try to scp a file of ~100MB (f.ex. linux-4.12.3.tar.xz :) - it > stalls after sending 20-30% typicly.. then starts again after some time, > and typicly dies before finishing. I've captured it with tcpdump (its a > large 77Mb file - thats how far it got before it died :) - > http://blog.klavsen.info/fast-retransmit-problem-junos-linux > > I've attached an image of wireshark - where the (AFAIK) interesting part > starts.. Where my client starts getting DUP ACK's.. but my Linux client > does nothing :( > I've tried to upgrade to latest Ubuntu-mainline kernel build (4.12.3) > and it changed nothing. > > The problem goes away, if I do: > sysctl -w net.ipv4.tcp_sack=0 > > I've tried specificly enabling net.ipv4.tcp_fack=1 - but that did not > help. > > This is not an issue on Mac OSX or Windows clients. > > None of the Linux users here figured, that could be a Linux kernel issue > - but the evidence seems to suggest it - and all my googleing and > reading does not lead me to any other conclusion. > > It may ofcourse be that Junos has implemented the standard badly/wrongly > - and Windows/Mac has done a workaround for that? > > I hope you can help me figure out whats going wrong. > sack blocks returned by the remote peer are completely bogus. Maybe a firewall is messing with them ? I suspect ACK packets might be simply dropped because of invalid SACK information. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-26 11:49 ` Eric Dumazet @ 2017-07-26 12:18 ` Klavs Klavsen 2017-07-26 13:31 ` Eric Dumazet 0 siblings, 1 reply; 21+ messages in thread From: Klavs Klavsen @ 2017-07-26 12:18 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev the 192.168.32.44 is a Centos 7 box. Could you help me by elaborating on how to see why the "dup ack" (sack blocks) are bogus? Thank you very much. I'll try to capture the same scp done on mac - and see if it also gets DUP ACK's - and how they look in comparison (since it works on Mac clients). Eric Dumazet skrev den 2017-07-26 13:49: > On Wed, 2017-07-26 at 13:07 +0200, Klavs Klavsen wrote: >> Hi guys, >> >> Me and my colleagues have an annoying issue with our Linux desktops >> and >> the company's Junos VPN. >> >> We connect with openconnect (some use the official Pulse client) - >> which >> then opens up a tun0 device - and traffic runs through that. >> >> If we try to scp a file of ~100MB (f.ex. linux-4.12.3.tar.xz :) - it >> stalls after sending 20-30% typicly.. then starts again after some >> time, >> and typicly dies before finishing. I've captured it with tcpdump (its >> a >> large 77Mb file - thats how far it got before it died :) - >> http://blog.klavsen.info/fast-retransmit-problem-junos-linux >> >> I've attached an image of wireshark - where the (AFAIK) interesting >> part >> starts.. Where my client starts getting DUP ACK's.. but my Linux >> client >> does nothing :( >> I've tried to upgrade to latest Ubuntu-mainline kernel build (4.12.3) >> and it changed nothing. >> >> The problem goes away, if I do: >> sysctl -w net.ipv4.tcp_sack=0 >> >> I've tried specificly enabling net.ipv4.tcp_fack=1 - but that did not >> help. >> >> This is not an issue on Mac OSX or Windows clients. >> >> None of the Linux users here figured, that could be a Linux kernel >> issue >> - but the evidence seems to suggest it - and all my googleing and >> reading does not lead me to any other conclusion. >> >> It may ofcourse be that Junos has implemented the standard >> badly/wrongly >> - and Windows/Mac has done a workaround for that? >> >> I hope you can help me figure out whats going wrong. >> > > sack blocks returned by the remote peer are completely bogus. > > Maybe a firewall is messing with them ? > > I suspect ACK packets might be simply dropped because of invalid SACK > information. -- Regards, Klavs Klavsen, GSEC - kl@vsen.dk - http://blog.klavsen.info - Tlf. 61281200 "Those who do not understand Unix are condemned to reinvent it, poorly." --Henry Spencer ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-26 12:18 ` Klavs Klavsen @ 2017-07-26 13:31 ` Eric Dumazet 2017-07-26 13:42 ` Willy Tarreau 2017-07-26 14:08 ` Klavs Klavsen 0 siblings, 2 replies; 21+ messages in thread From: Eric Dumazet @ 2017-07-26 13:31 UTC (permalink / raw) To: Klavs Klavsen; +Cc: netdev On Wed, 2017-07-26 at 14:18 +0200, Klavs Klavsen wrote: > the 192.168.32.44 is a Centos 7 box. Could you grab a capture on this box, to see if the bogus packets are sent by it, or later mangled by a middle box ? > > Could you help me by elaborating on how to see why the "dup ack" (sack > blocks) are bogus? tcpdump -S ... 01:37:44.455158 IP 62.242.222.50.54004 > 192.168.32.44.22: Flags [.], seq 370328274:370329622, ack 4062374366, win 1382, options [nop,nop,TS val 769439585 ecr 510569974], length 1348 01:37:44.455159 IP 62.242.222.50.54004 > 192.168.32.44.22: Flags [.], seq 370329622:370330970, ack 4062374366, win 1382, options [nop,nop,TS val 769439585 ecr 510569974], length 1348 01:37:44.455160 IP 62.242.222.50.54004 > 192.168.32.44.22: Flags [.], seq 370330970:370332318, ack 4062374366, win 1382, options [nop,nop,TS val 769439585 ecr 510569974], length 1348 01:37:44.455160 IP 62.242.222.50.54004 > 192.168.32.44.22: Flags [.], seq 370332318:370333666, ack 4062374366, win 1382, options [nop,nop,TS val 769439585 ecr 510569974], length 1348 01:37:44.455163 IP 192.168.32.44.22 > 62.242.222.50.54004: Flags [.], ack 370166514, win 2730, options [nop,nop,TS val 510569975 ecr 769439578,nop,nop,sack 1 {3012336703:3012338051}], length 0 3012336703:3012338051 is clearly outside of the window. Receiver claims to have received bytes that were never sent. 01:37:44.455169 IP 192.168.32.44.22 > 62.242.222.50.54004: Flags [.], ack 370166514, win 2730, options [nop,nop,TS val 510569975 ecr 769439578,nop,nop,sack 1 {3012336703:3012339399}], length 0 01:37:44.455172 IP 192.168.32.44.22 > 62.242.222.50.54004: Flags [.], ack 370166514, win 2730, options [nop,nop,TS val 510569975 ecr 769439578,nop,nop,sack 1 {3012336703:3012340747}], length 0 01:37:44.455175 IP 192.168.32.44.22 > 62.242.222.50.54004: Flags [.], ack 370166514, win 2730, options [nop,nop,TS val 510569975 ecr 769439578,nop,nop,sack 1 {3012336703:3012342095}], length 0 01:37:44.455178 IP 192.168.32.44.22 > 62.242.222.50.54004: Flags [.], ack 370166514, win 2730, options [nop,nop,TS val 510569975 ecr 769439578,nop,nop,sack 1 {3012336703:3012343443}], length 0 01:37:44.455181 IP 192.168.32.44.22 > 62.242.222.50.54004: Flags [.], ack 370166514, win 2730, options [nop,nop,TS val 510569976 ecr 769439578,nop,nop,sack 1 {3012336703:3012344791}], length 0 01:37:44.455183 IP 192.168.32.44.22 > 62.242.222.50.54004: Flags [.], ack 370166514, win 2730, options [nop,nop,TS val 510569976 ecr 769439578,nop,nop,sack 1 {3012336703:3012346139}], length 0 01:37:44.455186 IP 192.168.32.44.22 > 62.242.222.50.54004: Flags [.], ack 370166514, win 2730, options [nop,nop,TS val 510569976 ecr 769439578,nop,nop,sack 1 {3012336703:3012347487}], length 0 01:37:44.455189 IP 192.168.32.44.22 > 62.242.222.50.54004: Flags [.], ack 370166514, win 2730, options [nop,nop,TS val 510569977 ecr 769439578,nop,nop,sack 1 {3012336703:3012348835}], length 0 > > Thank you very much. I'll try to capture the same scp done on mac - and > see if it also gets DUP ACK's - and how they look in comparison (since > it works on Mac clients). > > Eric Dumazet skrev den 2017-07-26 13:49: > > On Wed, 2017-07-26 at 13:07 +0200, Klavs Klavsen wrote: > >> Hi guys, > >> > >> Me and my colleagues have an annoying issue with our Linux desktops > >> and > >> the company's Junos VPN. > >> > >> We connect with openconnect (some use the official Pulse client) - > >> which > >> then opens up a tun0 device - and traffic runs through that. > >> > >> If we try to scp a file of ~100MB (f.ex. linux-4.12.3.tar.xz :) - it > >> stalls after sending 20-30% typicly.. then starts again after some > >> time, > >> and typicly dies before finishing. I've captured it with tcpdump (its > >> a > >> large 77Mb file - thats how far it got before it died :) - > >> http://blog.klavsen.info/fast-retransmit-problem-junos-linux > >> > >> I've attached an image of wireshark - where the (AFAIK) interesting > >> part > >> starts.. Where my client starts getting DUP ACK's.. but my Linux > >> client > >> does nothing :( > >> I've tried to upgrade to latest Ubuntu-mainline kernel build (4.12.3) > >> and it changed nothing. > >> > >> The problem goes away, if I do: > >> sysctl -w net.ipv4.tcp_sack=0 > >> > >> I've tried specificly enabling net.ipv4.tcp_fack=1 - but that did not > >> help. > >> > >> This is not an issue on Mac OSX or Windows clients. > >> > >> None of the Linux users here figured, that could be a Linux kernel > >> issue > >> - but the evidence seems to suggest it - and all my googleing and > >> reading does not lead me to any other conclusion. > >> > >> It may ofcourse be that Junos has implemented the standard > >> badly/wrongly > >> - and Windows/Mac has done a workaround for that? > >> > >> I hope you can help me figure out whats going wrong. > >> > > > > sack blocks returned by the remote peer are completely bogus. > > > > Maybe a firewall is messing with them ? > > > > I suspect ACK packets might be simply dropped because of invalid SACK > > information. > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-26 13:31 ` Eric Dumazet @ 2017-07-26 13:42 ` Willy Tarreau 2017-07-26 14:32 ` Eric Dumazet 2017-07-26 14:08 ` Klavs Klavsen 1 sibling, 1 reply; 21+ messages in thread From: Willy Tarreau @ 2017-07-26 13:42 UTC (permalink / raw) To: Eric Dumazet; +Cc: Klavs Klavsen, netdev On Wed, Jul 26, 2017 at 06:31:21AM -0700, Eric Dumazet wrote: > On Wed, 2017-07-26 at 14:18 +0200, Klavs Klavsen wrote: > > the 192.168.32.44 is a Centos 7 box. > > Could you grab a capture on this box, to see if the bogus packets are > sent by it, or later mangled by a middle box ? Given the huge difference between the window and the ranges of the values in the SACK field, I'm pretty sure there's a firewall doing some sequence numbers randomization in the middle, not aware of SACK and not converting these ones. I've had to disable such broken features more than once in field after similar observations! Probably that the Mac doesn't advertise SACK support and doesn't experience the problem. Willy ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-26 13:42 ` Willy Tarreau @ 2017-07-26 14:32 ` Eric Dumazet 2017-07-26 14:50 ` Willy Tarreau 2017-07-28 6:53 ` Christoph Paasch 0 siblings, 2 replies; 21+ messages in thread From: Eric Dumazet @ 2017-07-26 14:32 UTC (permalink / raw) To: Willy Tarreau; +Cc: Klavs Klavsen, netdev On Wed, 2017-07-26 at 15:42 +0200, Willy Tarreau wrote: > On Wed, Jul 26, 2017 at 06:31:21AM -0700, Eric Dumazet wrote: > > On Wed, 2017-07-26 at 14:18 +0200, Klavs Klavsen wrote: > > > the 192.168.32.44 is a Centos 7 box. > > > > Could you grab a capture on this box, to see if the bogus packets are > > sent by it, or later mangled by a middle box ? > > Given the huge difference between the window and the ranges of the > values in the SACK field, I'm pretty sure there's a firewall doing > some sequence numbers randomization in the middle, not aware of SACK > and not converting these ones. I've had to disable such broken > features more than once in field after similar observations! Probably > that the Mac doesn't advertise SACK support and doesn't experience the > problem. We need to check RFC if such invalid SACK blocks should be ignored (DUP ACK would be processed and trigger fast retransmit anyway), or strongly validated (as I suspect we currently do), leading to a total freeze. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-26 14:32 ` Eric Dumazet @ 2017-07-26 14:50 ` Willy Tarreau 2017-07-26 16:43 ` Neal Cardwell 2017-07-28 6:53 ` Christoph Paasch 1 sibling, 1 reply; 21+ messages in thread From: Willy Tarreau @ 2017-07-26 14:50 UTC (permalink / raw) To: Eric Dumazet; +Cc: Klavs Klavsen, netdev On Wed, Jul 26, 2017 at 07:32:12AM -0700, Eric Dumazet wrote: > On Wed, 2017-07-26 at 15:42 +0200, Willy Tarreau wrote: > > On Wed, Jul 26, 2017 at 06:31:21AM -0700, Eric Dumazet wrote: > > > On Wed, 2017-07-26 at 14:18 +0200, Klavs Klavsen wrote: > > > > the 192.168.32.44 is a Centos 7 box. > > > > > > Could you grab a capture on this box, to see if the bogus packets are > > > sent by it, or later mangled by a middle box ? > > > > Given the huge difference between the window and the ranges of the > > values in the SACK field, I'm pretty sure there's a firewall doing > > some sequence numbers randomization in the middle, not aware of SACK > > and not converting these ones. I've had to disable such broken > > features more than once in field after similar observations! Probably > > that the Mac doesn't advertise SACK support and doesn't experience the > > problem. > > We need to check RFC if such invalid SACK blocks should be ignored (DUP > ACK would be processed and trigger fast retransmit anyway), or strongly > validated (as I suspect we currently do), leading to a total freeze. RFC2883 #4.3 talks about interaction with PAWS and only suggests that since the sequence numbers can wrap the sender should be aware that a reported segment can in fact relate to a value within the prior seq number space before cycling, but that they don't expect any side effect. So that more or less means to me "you should consider that some of these segments might be old, meaningless and should be ignored". But as you can see the recommendation lacks a bit of strength given that no issue was expected in such a situation. Willy ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-26 14:50 ` Willy Tarreau @ 2017-07-26 16:43 ` Neal Cardwell 2017-07-26 17:06 ` Neal Cardwell 0 siblings, 1 reply; 21+ messages in thread From: Neal Cardwell @ 2017-07-26 16:43 UTC (permalink / raw) To: Willy Tarreau Cc: Eric Dumazet, Klavs Klavsen, Netdev, Yuchung Cheng, Nandita Dukkipati [-- Attachment #1: Type: text/plain, Size: 1412 bytes --] On Wed, Jul 26, 2017 at 04:08:19PM +0200, Klavs Klavsen wrote: > Grabbed on both ends. > > http://blog.klavsen.info/fast-retransmit-problem-junos-linux (updated to new > dump - from client scp'ing) > http://blog.klavsen.info/fast-retransmit-problem-junos-linux-receiving-side > (receiving host) Looking at some time-sequence plots of the sender trace (attached), and thinking about the Linux TCP sender code, it like there are at least two interesting things going on: (1) Because the connection negotiated SACK, the Linux TCP sender does not get to its tcp_add_reno_sack() code to count dupacks and enter fast recovery on the 3rd dupack. The sender keeps waiting for specific packets to be SACKed that would signal that something has probably been lost. We could probably mitigate this by having the sender turn off SACK once it sees SACKed ranges that are completely invalid (way out of window). Then it should use the old non-SACK "Recovery on 3rd dupack" path. (2) It looks like there is a bug in the sender code where it seems to be repeatedly using a TLP timer firing 211ms after every ACK is received to transmit another TLP probe (a new packet in this case). Somehow these weird invalid SACKs seem to be triggering a code path that makes us think we can send another TLP, when we probably should be firing an RTO. That's my interpretation, anyway. I will try to reproduce this with packetdrill. neal [-- Attachment #2: linux-tcp-fr-issues-2017-07-26-zoomed-out.png --] [-- Type: image/png, Size: 37196 bytes --] [-- Attachment #3: linux-tcp-fr-issues-2017-07-26-zoomed-in-1.png --] [-- Type: image/png, Size: 39867 bytes --] [-- Attachment #4: linux-tcp-fr-issues-2017-07-26-zoomed-in-2.png --] [-- Type: image/png, Size: 37358 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-26 16:43 ` Neal Cardwell @ 2017-07-26 17:06 ` Neal Cardwell 2017-07-26 18:38 ` Neal Cardwell 0 siblings, 1 reply; 21+ messages in thread From: Neal Cardwell @ 2017-07-26 17:06 UTC (permalink / raw) To: Willy Tarreau Cc: Eric Dumazet, Klavs Klavsen, Netdev, Yuchung Cheng, Nandita Dukkipati On Wed, Jul 26, 2017 at 12:43 PM, Neal Cardwell <ncardwell@google.com> wrote: > (1) Because the connection negotiated SACK, the Linux TCP sender does > not get to its tcp_add_reno_sack() code to count dupacks and enter > fast recovery on the 3rd dupack. The sender keeps waiting for specific > packets to be SACKed that would signal that something has probably > been lost. We could probably mitigate this by having the sender turn > off SACK once it sees SACKed ranges that are completely invalid (way > out of window). Then it should use the old non-SACK "Recovery on 3rd > dupack" path. > > (2) It looks like there is a bug in the sender code where it seems to > be repeatedly using a TLP timer firing 211ms after every ACK is > received to transmit another TLP probe (a new packet in this case). > Somehow these weird invalid SACKs seem to be triggering a code path > that makes us think we can send another TLP, when we probably should > be firing an RTO. That's my interpretation, anyway. I will try to > reproduce this with packetdrill. Hmm. It looks like this might be a general issue, where any time we get an ACK that doesn't ACK/SACK anything new (whether because it's incoming data in a bi-directional flow, or a middlebox breaking the SACKs), then we schedule a TLP timer further out in time. Probably we should only push the TLP timer out if something is cumulatively ACKed. But that's not a trivial thing to do, because by the time we are deciding whether to schedule another TLP, we have already canceled the previous TLP and reinstalled an RTO. Hmm. neal ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-26 17:06 ` Neal Cardwell @ 2017-07-26 18:38 ` Neal Cardwell 2017-07-26 19:02 ` Neal Cardwell 0 siblings, 1 reply; 21+ messages in thread From: Neal Cardwell @ 2017-07-26 18:38 UTC (permalink / raw) To: Willy Tarreau Cc: Eric Dumazet, Klavs Klavsen, Netdev, Yuchung Cheng, Nandita Dukkipati [-- Attachment #1: Type: text/plain, Size: 1351 bytes --] On Wed, Jul 26, 2017 at 1:06 PM, Neal Cardwell <ncardwell@google.com> wrote: > On Wed, Jul 26, 2017 at 12:43 PM, Neal Cardwell <ncardwell@google.com> wrote: >> (2) It looks like there is a bug in the sender code where it seems to >> be repeatedly using a TLP timer firing 211ms after every ACK is >> received to transmit another TLP probe (a new packet in this case). >> Somehow these weird invalid SACKs seem to be triggering a code path >> that makes us think we can send another TLP, when we probably should >> be firing an RTO. That's my interpretation, anyway. I will try to >> reproduce this with packetdrill. > > Hmm. It looks like this might be a general issue, where any time we > get an ACK that doesn't ACK/SACK anything new (whether because it's > incoming data in a bi-directional flow, or a middlebox breaking the > SACKs), then we schedule a TLP timer further out in time. Probably we > should only push the TLP timer out if something is cumulatively ACKed. > > But that's not a trivial thing to do, because by the time we are > deciding whether to schedule another TLP, we have already canceled the > previous TLP and reinstalled an RTO. Hmm. Yeah, it looks like I can reproduce this issue with (1) bad sacks causing repeated TLPs, and (2) TLPs timers being pushed out to later times due to incoming data. Scripts are attached. neal [-- Attachment #2: tlp-bad-sacks.pkt --] [-- Type: application/octet-stream, Size: 1665 bytes --] // Test for TLP behavior when all SACKs that come back are invalid // (e.g. because of a middlebox). // (Oops... it seems invalid SACKs can cause us to send TLPs forever.) // Set up production config. `../common/defaults.sh` // Establish a connection. 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8> +.020 < . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 // Send 10 MSS. +0 write(4, ..., 20000) = 20000 +0 > . 1:10001(10000) ack 1 // First round: // An ACK arrives with a bogus SACK. +.020 < . 1:1(0) ack 1 win 257 <sack 1000000001:1000001001,nop,nop> // At 2*RTT, send a TLP loss probe that is a new packet. +.040 > . 10001:11001(1000) ack 1 +0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd assert tcpi_unacked == 11, tcpi_unacked }% // Second round (same as the first): // An ACK arrives with another bogus SACK. +.020 < . 1:1(0) ack 1 win 257 <sack 1000000001:1000002001,nop,nop> // At 2*RTT, send a TLP loss probe that is a new packet. +.040 > . 11001:12001(1000) ack 1 +0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd assert tcpi_unacked == 12, tcpi_unacked }% // Third round (same as the first): // An ACK arrives with another bogus SACK. +.020 < . 1:1(0) ack 1 win 257 <sack 1000000001:1000003001,nop,nop> // At 2*RTT, send a TLP loss probe that is a new packet. +.040 > . 12001:13001(1000) ack 1 +0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd assert tcpi_unacked == 13, tcpi_unacked }% [-- Attachment #3: tlp-bidirectional.pkt --] [-- Type: application/octet-stream, Size: 1272 bytes --] // Test for TLP behavior with bi-directional traffic (incoming data). // Make sure that incoming data does not push back the TLP timer. // (Oops... currently it does seem that incoming data pushes back // the TLP timer.) // Set up production config. `../common/defaults.sh` // Establish a connection. 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8> +.020 < . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 // Send 10 MSS. +0 write(4, ..., 20000) = 20000 +0 > . 1:10001(10000) ack 1 // Incoming data arrives +.010 < . 1:1001(1000) ack 1 win 257 +0 > . 10001:10001(0) ack 1001 +.010 < . 1001:2001(1000) ack 1 win 257 +0 > . 10001:10001(0) ack 2001 +.010 < . 2001:3001(1000) ack 1 win 257 +0 > . 10001:10001(0) ack 3001 +.010 < . 3001:4001(1000) ack 1 win 257 +0 > . 10001:10001(0) ack 4001 // At 2*RTT after the last transmit, send a TLP loss probe // that is a new packet. +.032 > . 10001:11001(1000) ack 4001 +0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd assert tcpi_unacked == 11, tcpi_unacked }% ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-26 18:38 ` Neal Cardwell @ 2017-07-26 19:02 ` Neal Cardwell 2017-07-28 22:54 ` Neal Cardwell 0 siblings, 1 reply; 21+ messages in thread From: Neal Cardwell @ 2017-07-26 19:02 UTC (permalink / raw) To: Willy Tarreau Cc: Eric Dumazet, Klavs Klavsen, Netdev, Yuchung Cheng, Nandita Dukkipati On Wed, Jul 26, 2017 at 2:38 PM, Neal Cardwell <ncardwell@google.com> wrote: > Yeah, it looks like I can reproduce this issue with (1) bad sacks > causing repeated TLPs, and (2) TLPs timers being pushed out to later > times due to incoming data. Scripts are attached. I'm testing a fix of only scheduling a TLP if (flag & FLAG_DATA_ACKED) is true... neal ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-26 19:02 ` Neal Cardwell @ 2017-07-28 22:54 ` Neal Cardwell 2017-08-01 3:17 ` Neal Cardwell 0 siblings, 1 reply; 21+ messages in thread From: Neal Cardwell @ 2017-07-28 22:54 UTC (permalink / raw) To: Willy Tarreau Cc: Eric Dumazet, Klavs Klavsen, Netdev, Yuchung Cheng, Nandita Dukkipati On Wed, Jul 26, 2017 at 3:02 PM, Neal Cardwell <ncardwell@google.com> wrote: > On Wed, Jul 26, 2017 at 2:38 PM, Neal Cardwell <ncardwell@google.com> wrote: >> Yeah, it looks like I can reproduce this issue with (1) bad sacks >> causing repeated TLPs, and (2) TLPs timers being pushed out to later >> times due to incoming data. Scripts are attached. > > I'm testing a fix of only scheduling a TLP if (flag & FLAG_DATA_ACKED) > is true... An update for the TLP aspect of this thread: our team has a proposed fix for this RTO/TLP reschedule issue that we have reviewed internally and tested with our packetdrill test suite, including some new tests. The basic approach in the fix is as follows: a) only reschedule the xmit timer once per ACK b) only reschedule the xmit timer if tcp_clean_rtx_queue() deems this is safe (a packet was cumulatively ACKed, or we got a SACK for a packet that was sent before the most recent retransmit of the write queue head). After further review and testing we will post it. Hopefully next week. thanks, neal ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-28 22:54 ` Neal Cardwell @ 2017-08-01 3:17 ` Neal Cardwell 0 siblings, 0 replies; 21+ messages in thread From: Neal Cardwell @ 2017-08-01 3:17 UTC (permalink / raw) To: Willy Tarreau Cc: Eric Dumazet, Klavs Klavsen, Netdev, Yuchung Cheng, Nandita Dukkipati On Fri, Jul 28, 2017 at 6:54 PM, Neal Cardwell <ncardwell@google.com> wrote: > On Wed, Jul 26, 2017 at 3:02 PM, Neal Cardwell <ncardwell@google.com> wrote: >> On Wed, Jul 26, 2017 at 2:38 PM, Neal Cardwell <ncardwell@google.com> wrote: >>> Yeah, it looks like I can reproduce this issue with (1) bad sacks >>> causing repeated TLPs, and (2) TLPs timers being pushed out to later >>> times due to incoming data. Scripts are attached. >> >> I'm testing a fix of only scheduling a TLP if (flag & FLAG_DATA_ACKED) >> is true... > > An update for the TLP aspect of this thread: our team has a proposed > fix for this RTO/TLP reschedule issue that we have reviewed internally > and tested with our packetdrill test suite, including some new tests. > The basic approach in the fix is as follows: > > a) only reschedule the xmit timer once per ACK > > b) only reschedule the xmit timer if tcp_clean_rtx_queue() deems this > is safe (a packet was cumulatively ACKed, or we got a SACK for a > packet that was sent before the most recent retransmit of the write > queue head). > > After further review and testing we will post it. Hopefully next week. The timer patches are upstream for review for the "net" branch: https://patchwork.ozlabs.org/patch/796057/ https://patchwork.ozlabs.org/patch/796058/ https://patchwork.ozlabs.org/patch/796059/ Again, thank you for reporting this, and thanks for the packet trace! neal ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-26 14:32 ` Eric Dumazet 2017-07-26 14:50 ` Willy Tarreau @ 2017-07-28 6:53 ` Christoph Paasch 1 sibling, 0 replies; 21+ messages in thread From: Christoph Paasch @ 2017-07-28 6:53 UTC (permalink / raw) To: Eric Dumazet; +Cc: Willy Tarreau, Klavs Klavsen, netdev Hello, On Wed, Jul 26, 2017 at 7:32 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Wed, 2017-07-26 at 15:42 +0200, Willy Tarreau wrote: >> On Wed, Jul 26, 2017 at 06:31:21AM -0700, Eric Dumazet wrote: >> > On Wed, 2017-07-26 at 14:18 +0200, Klavs Klavsen wrote: >> > > the 192.168.32.44 is a Centos 7 box. >> > >> > Could you grab a capture on this box, to see if the bogus packets are >> > sent by it, or later mangled by a middle box ? >> >> Given the huge difference between the window and the ranges of the >> values in the SACK field, I'm pretty sure there's a firewall doing >> some sequence numbers randomization in the middle, not aware of SACK >> and not converting these ones. I've had to disable such broken >> features more than once in field after similar observations! Probably >> that the Mac doesn't advertise SACK support and doesn't experience the >> problem. > > We need to check RFC if such invalid SACK blocks should be ignored (DUP > ACK would be processed and trigger fast retransmit anyway), or strongly > validated (as I suspect we currently do), leading to a total freeze. quite some time ago this issue with sequence number randomizing middleboxes came already up (http://marc.info/?l=netfilter-devel&m=137691623129822&w=2). From what I remember, the RFC does not say that invalid SACK blocks should be strongly validated. So, trigger dup-ack retransmission seems fine. I had some patches at the time that ignored invalid sack-blocks and allowed fast-retransmit to happen thanks to the duplicate acks: https://patchwork.ozlabs.org/patch/268297/ https://patchwork.ozlabs.org/patch/268298/ Cheers, Christoph ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-26 13:31 ` Eric Dumazet 2017-07-26 13:42 ` Willy Tarreau @ 2017-07-26 14:08 ` Klavs Klavsen 2017-07-26 14:18 ` Willy Tarreau 1 sibling, 1 reply; 21+ messages in thread From: Klavs Klavsen @ 2017-07-26 14:08 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev Grabbed on both ends. http://blog.klavsen.info/fast-retransmit-problem-junos-linux (updated to new dump - from client scp'ing) http://blog.klavsen.info/fast-retransmit-problem-junos-linux-receiving-side (receiving host) Eric Dumazet skrev den 2017-07-26 15:31: > On Wed, 2017-07-26 at 14:18 +0200, Klavs Klavsen wrote: >> the 192.168.32.44 is a Centos 7 box. > > Could you grab a capture on this box, to see if the bogus packets are > sent by it, or later mangled by a middle box ? > [CUT] > 3012336703:3012338051 is clearly outside of the window. > Receiver claims to have received bytes that were never sent. > Thank you very much.. [CUT] -- Regards, Klavs Klavsen, GSEC - kl@vsen.dk - http://blog.klavsen.info - Tlf. 61281200 "Those who do not understand Unix are condemned to reinvent it, poorly." --Henry Spencer ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-26 14:08 ` Klavs Klavsen @ 2017-07-26 14:18 ` Willy Tarreau 2017-07-26 14:25 ` Klavs Klavsen 0 siblings, 1 reply; 21+ messages in thread From: Willy Tarreau @ 2017-07-26 14:18 UTC (permalink / raw) To: Klavs Klavsen; +Cc: Eric Dumazet, netdev On Wed, Jul 26, 2017 at 04:08:19PM +0200, Klavs Klavsen wrote: > Grabbed on both ends. > > http://blog.klavsen.info/fast-retransmit-problem-junos-linux (updated to new > dump - from client scp'ing) > http://blog.klavsen.info/fast-retransmit-problem-junos-linux-receiving-side > (receiving host) So bingo, Eric guessed right, the client's sequence numbers are translated on their way to/from the server, but the SACK fields are not : Server : 15:59:54.292867 IP (tos 0x8, ttl 64, id 15878, offset 0, flags [DF], proto TCP (6), length 64) 192.168.32.44.22 > 62.242.222.50.35002: Flags [.], cksum 0xfe2b (incorrect -> 0xce0e), seq 1568063538, ack 3903858556, win 10965, options [nop,nop,TS val 529899820 ecr 774272020,nop,nop,sack 1 {3903859904:3903861252}], length 0 Client : 15:59:54.297388 IP (tos 0x8, ttl 56, id 15878, offset 0, flags [DF], proto TCP (6), length 64) 192.168.32.44.22 > 62.242.222.50.35002: Flags [.], cksum 0xbb2c (correct), seq 1568063538, ack 2684453645, win 10965, options [nop,nop,TS val 529899820 ecr 774272020,nop,nop,sack 1 {3903859904:3903861252}], length 0 To there's very likely a broken firewall in the middle that is waiting for a bug fix, or to have its feature disabled. Sometimes it can also happen on firewalls performing some SYN proxying except that it would mangle the server's sequence numbers instead of the client ones. Willy ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-26 14:18 ` Willy Tarreau @ 2017-07-26 14:25 ` Klavs Klavsen 2017-07-26 14:38 ` Willy Tarreau 0 siblings, 1 reply; 21+ messages in thread From: Klavs Klavsen @ 2017-07-26 14:25 UTC (permalink / raw) To: Willy Tarreau; +Cc: Eric Dumazet, netdev Thank you very much guys for your insight.. its highly appreciated. Next up for me, is waiting till the network guys come back from summer vacation, and convince them to sniff on the devices in between to pinpoint the culprit :) Willy Tarreau skrev den 2017-07-26 16:18: > On Wed, Jul 26, 2017 at 04:08:19PM +0200, Klavs Klavsen wrote: >> Grabbed on both ends. >> >> http://blog.klavsen.info/fast-retransmit-problem-junos-linux (updated >> to new >> dump - from client scp'ing) >> http://blog.klavsen.info/fast-retransmit-problem-junos-linux-receiving-side >> (receiving host) > > So bingo, Eric guessed right, the client's sequence numbers are > translated > on their way to/from the server, but the SACK fields are not : > > Server : > 15:59:54.292867 IP (tos 0x8, ttl 64, id 15878, offset 0, flags [DF], > proto TCP (6), length 64) > 192.168.32.44.22 > 62.242.222.50.35002: Flags [.], cksum 0xfe2b > (incorrect -> 0xce0e), seq 1568063538, ack 3903858556, > win 10965, options [nop,nop,TS val 529899820 ecr > 774272020,nop,nop,sack 1 {3903859904:3903861252}], length 0 > > Client : > 15:59:54.297388 IP (tos 0x8, ttl 56, id 15878, offset 0, flags [DF], > proto TCP (6), length 64) > 192.168.32.44.22 > 62.242.222.50.35002: Flags [.], cksum 0xbb2c > (correct), seq 1568063538, ack 2684453645, > win 10965, options [nop,nop,TS val 529899820 ecr > 774272020,nop,nop,sack 1 {3903859904:3903861252}], length 0 > > To there's very likely a broken firewall in the middle that is waiting > for > a bug fix, or to have its feature disabled. Sometimes it can also > happen > on firewalls performing some SYN proxying except that it would mangle > the > server's sequence numbers instead of the client ones. > > Willy -- Regards, Klavs Klavsen, GSEC - kl@vsen.dk - http://blog.klavsen.info - Tlf. 61281200 "Those who do not understand Unix are condemned to reinvent it, poorly." --Henry Spencer ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-26 14:25 ` Klavs Klavsen @ 2017-07-26 14:38 ` Willy Tarreau 2017-07-28 6:36 ` Klavs Klavsen 0 siblings, 1 reply; 21+ messages in thread From: Willy Tarreau @ 2017-07-26 14:38 UTC (permalink / raw) To: Klavs Klavsen; +Cc: Eric Dumazet, netdev On Wed, Jul 26, 2017 at 04:25:29PM +0200, Klavs Klavsen wrote: > Thank you very much guys for your insight.. its highly appreciated. > > Next up for me, is waiting till the network guys come back from summer > vacation, and convince them to sniff on the devices in between to pinpoint > the culprit :) That said, Eric, I'm a bit surprized that it completely stalls. Shouldn't the sender end up retransmitting unacked segments after seeing a certain number of ACKs not making progress ? Or maybe this is disabled when SACKs are in use but it seems to me that once invalid SACKs are ignored we should ideally fall back to the normal way to deal with losses. Here the server ACKed 3903858556 for the first time at 15:59:54.292743 and repeated this one 850 times till 16:01:17.296407 but the client kept sending past this point probably due to a huge window, so this looks suboptimal to me. Willy ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-26 14:38 ` Willy Tarreau @ 2017-07-28 6:36 ` Klavs Klavsen 2017-07-28 7:27 ` Willy Tarreau 0 siblings, 1 reply; 21+ messages in thread From: Klavs Klavsen @ 2017-07-28 6:36 UTC (permalink / raw) To: Willy Tarreau; +Cc: Eric Dumazet, netdev The network guys know what caused it. Appearently on (atleast some) Cisco equipment the feature: TCP Sequence Number Randomization is enabled by default. It would most definetely be beneficial if Linux handled SACK "not working" better than it does - but then I might never have found the culprit who destroyed SACK :) Willy Tarreau wrote on 26-07-2017 16:38: > On Wed, Jul 26, 2017 at 04:25:29PM +0200, Klavs Klavsen wrote: >> Thank you very much guys for your insight.. its highly appreciated. >> >> Next up for me, is waiting till the network guys come back from summer >> vacation, and convince them to sniff on the devices in between to pinpoint >> the culprit :) > That said, Eric, I'm a bit surprized that it completely stalls. Shouldn't > the sender end up retransmitting unacked segments after seeing a certain > number of ACKs not making progress ? Or maybe this is disabled when SACKs > are in use but it seems to me that once invalid SACKs are ignored we should > ideally fall back to the normal way to deal with losses. Here the server > ACKed 3903858556 for the first time at 15:59:54.292743 and repeated this > one 850 times till 16:01:17.296407 but the client kept sending past this > point probably due to a huge window, so this looks suboptimal to me. > > Willy -- Regards, Klavs Klavsen, GSEC - kl@vsen.dk - http://www.vsen.dk - Tlf. 61281200 "Those who do not understand Unix are condemned to reinvent it, poorly." --Henry Spencer ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-28 6:36 ` Klavs Klavsen @ 2017-07-28 7:27 ` Willy Tarreau 2017-08-17 13:20 ` Jeremy Harris 0 siblings, 1 reply; 21+ messages in thread From: Willy Tarreau @ 2017-07-28 7:27 UTC (permalink / raw) To: Klavs Klavsen; +Cc: Eric Dumazet, netdev On Fri, Jul 28, 2017 at 08:36:49AM +0200, Klavs Klavsen wrote: > The network guys know what caused it. > > Appearently on (atleast some) Cisco equipment the feature: > > TCP Sequence Number Randomization > > is enabled by default. I didn't want to suggest names but since you did it first ;-) Indeed it's mostly on the same device that I've been bothered a lot by their annoying randomization. I used to know by memory the exact command to type to disable it, but I don't anymore (something along "no randomization"). The other trouble it causes is retransmits of the first SYN when your source ports wrap too fast (ie when installed after a proxy). The SYNs reaching the other end find a session in TIME_WAIT, but the SYN sometimes lands in the previous window and leads to an ACK instead of a SYN-ACK, which the firewall blocks. This was easily worked around using timestamps on both sides thanks to PAWS. But disabling the broken feature is better. And no, "more secure" is not an excuse for "broken". > It would most definetely be beneficial if Linux handled SACK "not working" > better than it does - but then I might never have found the culprit who > destroyed SACK :) Yep ;-) Willy ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: TCP fast retransmit issues 2017-07-28 7:27 ` Willy Tarreau @ 2017-08-17 13:20 ` Jeremy Harris 0 siblings, 0 replies; 21+ messages in thread From: Jeremy Harris @ 2017-08-17 13:20 UTC (permalink / raw) Cc: netdev On 28/07/17 08:27, Willy Tarreau wrote: > I didn't want to suggest names but since you did it first ;-) Indeed it's > mostly on the same device that I've been bothered a lot by their annoying > randomization. I used to know by memory the exact command to type to disable > it, but I don't anymore (something along "no randomization"). https://supportforums.cisco.com/document/48551/single-tcp-flow-performance-firewall-services-module-fwsm#TCP_Sequence_Number_Randomization_and_SACK https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solutionid=sk74640 -- Cheers, Jeremy ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2017-08-17 13:21 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-07-26 11:07 TCP fast retransmit issues Klavs Klavsen 2017-07-26 11:49 ` Eric Dumazet 2017-07-26 12:18 ` Klavs Klavsen 2017-07-26 13:31 ` Eric Dumazet 2017-07-26 13:42 ` Willy Tarreau 2017-07-26 14:32 ` Eric Dumazet 2017-07-26 14:50 ` Willy Tarreau 2017-07-26 16:43 ` Neal Cardwell 2017-07-26 17:06 ` Neal Cardwell 2017-07-26 18:38 ` Neal Cardwell 2017-07-26 19:02 ` Neal Cardwell 2017-07-28 22:54 ` Neal Cardwell 2017-08-01 3:17 ` Neal Cardwell 2017-07-28 6:53 ` Christoph Paasch 2017-07-26 14:08 ` Klavs Klavsen 2017-07-26 14:18 ` Willy Tarreau 2017-07-26 14:25 ` Klavs Klavsen 2017-07-26 14:38 ` Willy Tarreau 2017-07-28 6:36 ` Klavs Klavsen 2017-07-28 7:27 ` Willy Tarreau 2017-08-17 13:20 ` Jeremy Harris
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.