All of lore.kernel.org
 help / color / mirror / Atom feed
* TCP one-by-one acking - RFC interpretation question
@ 2018-04-06 10:05 Michal Kubecek
  2018-04-06 12:01 ` Eric Dumazet
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Kubecek @ 2018-04-06 10:05 UTC (permalink / raw)
  To: netdev

Hello,

I encountered a strange behaviour of some (non-linux) TCP stack which
I believe is incorrect but support engineers from the company producing
it claim is OK.

Assume a client (sender, Linux 4.4 kernel) sends a stream of MSS sized
segments but segments 2, 4 and 6 do not reach the server (receiver):

         ACK             SAK             SAK             SAK
      +-------+-------+-------+-------+-------+-------+-------+
      |   1   |   2   |   3   |   4   |   5   |   6   |   7   |
      +-------+-------+-------+-------+-------+-------+-------+
    34273   35701   37129   38557   39985   41413   42841   44269

When segment 2 is retransmitted after RTO timeout, normal response would
be ACK-ing segment 3 (38557) with SACK for 5 and 7 (39985-41413 and
42841-44269).

However, this server stack responds with two separate ACKs:

  - ACK 37129, SACK 37129-38557 39985-41413 42841-44269
  - ACK 38557, SACK 39985-41413 42841-44269

There is no payload from server, no window update and it happens even if
there is no other packet received by server between those two. The
result is that as segment 3 was never retransmitted, second ACK is
interpreted as acking a newly arrived segment by 4.4 kernel so that the
whole interval between first transmission of segment 3 and this second
ACK is used for RTT estimator; even worse, when the same happens again
for segment 5, both timeouts (for 2 and 4) are counted into its RTT.
The result is RTO growing exponentially until it reaches the maximum
(120 seconds) and the connection is effectively stalled.

In my opinion, server behaviour violates the last paragraph of RFC 5681,
section 4.2:

  A TCP receiver MUST NOT generate more than one ACK for every incoming
  segment, other than to update the offered window as the receiving
  application consumes new data (see [RFC813] and page 42 of [RFC793]).

Server vendor claims that their behaviour is correct as first ACK is
sent in response to segment 2 and second ACK in response to segment 3
(which has just been delayed in the out of order queue).

Note that SACK doesn't really help here. First SACK block in first ACK
(37129-38557) is actually invalid as it violates the "the bytes just
below the block ... have not been received" condition from RFC 2018
section 3. Therefore Linux 4.4 stack ignores this SACK block, detects
(spurious) SACK reneging and unmarks the "previously sacked" flag of
segment 3 so that when second ACK arrives, there is no trace of it
having been sacked before. They already admitted this SACK block is
incorrect but there is still disagreement about the "one-by-one acking"
behaviour in general.

My question is: is my interpretation correct? If so, is there an even
less ambiguous statement somewhere that receiver is supposed to send one
ACK for "everything they got so far" rather than acking the segments one
by one? While reading the RFCs, I always considered this obvious but
apparently some people may think otherwise.

Thanks in advance,
Michal Kubecek

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TCP one-by-one acking - RFC interpretation question
  2018-04-06 10:05 TCP one-by-one acking - RFC interpretation question Michal Kubecek
@ 2018-04-06 12:01 ` Eric Dumazet
  2018-04-06 15:03   ` Michal Kubecek
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2018-04-06 12:01 UTC (permalink / raw)
  To: Michal Kubecek, netdev



On 04/06/2018 03:05 AM, Michal Kubecek wrote:
> Hello,
> 
> I encountered a strange behaviour of some (non-linux) TCP stack which
> I believe is incorrect but support engineers from the company producing
> it claim is OK.
> 
> Assume a client (sender, Linux 4.4 kernel) sends a stream of MSS sized
> segments but segments 2, 4 and 6 do not reach the server (receiver):
> 
>          ACK             SAK             SAK             SAK
>       +-------+-------+-------+-------+-------+-------+-------+
>       |   1   |   2   |   3   |   4   |   5   |   6   |   7   |
>       +-------+-------+-------+-------+-------+-------+-------+
>     34273   35701   37129   38557   39985   41413   42841   44269
> 
> When segment 2 is retransmitted after RTO timeout, normal response would
> be ACK-ing segment 3 (38557) with SACK for 5 and 7 (39985-41413 and
> 42841-44269).
> 
> However, this server stack responds with two separate ACKs:
> 
>   - ACK 37129, SACK 37129-38557 39985-41413 42841-44269
>   - ACK 38557, SACK 39985-41413 42841-44269

Hmmm... Yes this seems very very wrong and lazy.

Have you verified behavior of more recent linux kernel to such threats ?

packetdrill test would be relatively easy to write.

Regardless of this broken alien stack, we might be able to work around this faster
than the vendor is able to fix and deploy a new stack.

( https://en.wikipedia.org/wiki/Robustness_principle )
Be conservative in what you do, be liberal in what you accept from others...



> 
> There is no payload from server, no window update and it happens even if
> there is no other packet received by server between those two. The
> result is that as segment 3 was never retransmitted, second ACK is
> interpreted as acking a newly arrived segment by 4.4 kernel so that the
> whole interval between first transmission of segment 3 and this second
> ACK is used for RTT estimator; even worse, when the same happens again
> for segment 5, both timeouts (for 2 and 4) are counted into its RTT.
> The result is RTO growing exponentially until it reaches the maximum
> (120 seconds) and the connection is effectively stalled.
> 
> In my opinion, server behaviour violates the last paragraph of RFC 5681,
> section 4.2:
> 
>   A TCP receiver MUST NOT generate more than one ACK for every incoming
>   segment, other than to update the offered window as the receiving
>   application consumes new data (see [RFC813] and page 42 of [RFC793]).
> 
> Server vendor claims that their behaviour is correct as first ACK is
> sent in response to segment 2 and second ACK in response to segment 3
> (which has just been delayed in the out of order queue).
> 
> Note that SACK doesn't really help here. First SACK block in first ACK
> (37129-38557) is actually invalid as it violates the "the bytes just
> below the block ... have not been received" condition from RFC 2018
> section 3. Therefore Linux 4.4 stack ignores this SACK block, detects
> (spurious) SACK reneging and unmarks the "previously sacked" flag of
> segment 3 so that when second ACK arrives, there is no trace of it
> having been sacked before. They already admitted this SACK block is
> incorrect but there is still disagreement about the "one-by-one acking"
> behaviour in general.
> 
> My question is: is my interpretation correct? If so, is there an even
> less ambiguous statement somewhere that receiver is supposed to send one
> ACK for "everything they got so far" rather than acking the segments one
> by one? While reading the RFCs, I always considered this obvious but
> apparently some people may think otherwise.
> 
> Thanks in advance,
> Michal Kubecek
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TCP one-by-one acking - RFC interpretation question
  2018-04-06 12:01 ` Eric Dumazet
@ 2018-04-06 15:03   ` Michal Kubecek
  2018-04-06 16:49     ` Eric Dumazet
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Kubecek @ 2018-04-06 15:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Fri, Apr 06, 2018 at 05:01:29AM -0700, Eric Dumazet wrote:
> 
> 
> On 04/06/2018 03:05 AM, Michal Kubecek wrote:
> > Hello,
> > 
> > I encountered a strange behaviour of some (non-linux) TCP stack which
> > I believe is incorrect but support engineers from the company producing
> > it claim is OK.
> > 
> > Assume a client (sender, Linux 4.4 kernel) sends a stream of MSS sized
> > segments but segments 2, 4 and 6 do not reach the server (receiver):
> > 
> >          ACK             SAK             SAK             SAK
> >       +-------+-------+-------+-------+-------+-------+-------+
> >       |   1   |   2   |   3   |   4   |   5   |   6   |   7   |
> >       +-------+-------+-------+-------+-------+-------+-------+
> >     34273   35701   37129   38557   39985   41413   42841   44269
> > 
> > When segment 2 is retransmitted after RTO timeout, normal response would
> > be ACK-ing segment 3 (38557) with SACK for 5 and 7 (39985-41413 and
> > 42841-44269).
> > 
> > However, this server stack responds with two separate ACKs:
> > 
> >   - ACK 37129, SACK 37129-38557 39985-41413 42841-44269
> >   - ACK 38557, SACK 39985-41413 42841-44269
> 
> Hmmm... Yes this seems very very wrong and lazy.
> 
> Have you verified behavior of more recent linux kernel to such threats ?

No, unfortunately the problem was only encountered by our customer in
production environment (they tried to reproduce in a test lab but no
luck). They are running backups to NFS server and it happens from time
to time (in the order of hours, IIUC). So it would be probably hard to
let them try with more recent kernel.

On the other hand, they reported that SLE11 clients (kernel 3.0) do not
run into this kind of problem. It was originally reported as a
a regression on migration from SLE11-SP4 (3.0 kernel) to SLE12-SP2 (4.4
kernel) and the problem was reported as "SLE12-SP2 is ignoring dupacks"
(which seems to be mostly caused by the switch to RACK).

It also seems that part of the problem is specific packet loss pattern
where at some point, many packets are lost in "every second" pattern.
The customer finally started to investigate this problem and it seems it
has something to do with their bonding setup (they provided no details,
my guess is packets are divided over two paths and one of them fails).

> packetdrill test would be relatively easy to write.

I'll try but I have very little experience with writing packetdrill
scripts so it will probably take some time.

> Regardless of this broken alien stack, we might be able to work around
> this faster than the vendor is able to fix and deploy a new stack.
> 
> ( https://en.wikipedia.org/wiki/Robustness_principle )
> Be conservative in what you do, be liberal in what you accept from
> others...

I was thinking about this a bit. "Fixing" the acknowledgment number
could do the trick but it doesn't feel correct. We might use the fact
that TSecr of both ACKs above matches TSval of the retransmission which
triggered them so that RTT calculated from timestamp would be the right
one. So perhaps something like "prefer timestamp RTT if measured RTT
seems way too off". But I'm not sure if it couldn't break other use
cases where (high) measured RTT is actually correct, rather than (low)
timestamp RTT.

Michal Kubecek

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TCP one-by-one acking - RFC interpretation question
  2018-04-06 15:03   ` Michal Kubecek
@ 2018-04-06 16:49     ` Eric Dumazet
  2018-04-11 10:58       ` Michal Kubecek
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2018-04-06 16:49 UTC (permalink / raw)
  To: Michal Kubecek, Eric Dumazet; +Cc: netdev, Yuchung Cheng, Neal Cardwell

Cc Neal and Yuchung if they missed this thread.

On 04/06/2018 08:03 AM, Michal Kubecek wrote:
> On Fri, Apr 06, 2018 at 05:01:29AM -0700, Eric Dumazet wrote:
>>
>>
>> On 04/06/2018 03:05 AM, Michal Kubecek wrote:
>>> Hello
>>>
>>> I encountered a strange behaviour of some (non-linux) TCP stack which
>>> I believe is incorrect but support engineers from the company producing
>>> it claim is OK.
>>>
>>> Assume a client (sender, Linux 4.4 kernel) sends a stream of MSS sized
>>> segments but segments 2, 4 and 6 do not reach the server (receiver):
>>>
>>>          ACK             SAK             SAK             SAK
>>>       +-------+-------+-------+-------+-------+-------+-------+
>>>       |   1   |   2   |   3   |   4   |   5   |   6   |   7   |
>>>       +-------+-------+-------+-------+-------+-------+-------+
>>>     34273   35701   37129   38557   39985   41413   42841   44269
>>>
>>> When segment 2 is retransmitted after RTO timeout, normal response would
>>> be ACK-ing segment 3 (38557) with SACK for 5 and 7 (39985-41413 and
>>> 42841-44269).
>>>
>>> However, this server stack responds with two separate ACKs:
>>>
>>>   - ACK 37129, SACK 37129-38557 39985-41413 42841-44269
>>>   - ACK 38557, SACK 39985-41413 42841-44269
>>
>> Hmmm... Yes this seems very very wrong and lazy.
>>
>> Have you verified behavior of more recent linux kernel to such threats ?
> 
> No, unfortunately the problem was only encountered by our customer in
> production environment (they tried to reproduce in a test lab but no
> luck). They are running backups to NFS server and it happens from time
> to time (in the order of hours, IIUC). So it would be probably hard to
> let them try with more recent kernel.
> 
> On the other hand, they reported that SLE11 clients (kernel 3.0) do not
> run into this kind of problem. It was originally reported as a
> a regression on migration from SLE11-SP4 (3.0 kernel) to SLE12-SP2 (4.4
> kernel) and the problem was reported as "SLE12-SP2 is ignoring dupacks"
> (which seems to be mostly caused by the switch to RACK).
> 
> It also seems that part of the problem is specific packet loss pattern
> where at some point, many packets are lost in "every second" pattern.
> The customer finally started to investigate this problem and it seems it
> has something to do with their bonding setup (they provided no details,
> my guess is packets are divided over two paths and one of them fails).
> 
>> packetdrill test would be relatively easy to write.
> 
> I'll try but I have very little experience with writing packetdrill
> scripts so it will probably take some time.
> 
>> Regardless of this broken alien stack, we might be able to work around
>> this faster than the vendor is able to fix and deploy a new stack.
>>
>> ( https://en.wikipedia.org/wiki/Robustness_principle )
>> Be conservative in what you do, be liberal in what you accept from
>> others...
> 
> I was thinking about this a bit. "Fixing" the acknowledgment number
> could do the trick but it doesn't feel correct. We might use the fact
> that TSecr of both ACKs above matches TSval of the retransmission which
> triggered them so that RTT calculated from timestamp would be the right
> one. So perhaps something like "prefer timestamp RTT if measured RTT
> seems way too off". But I'm not sure if it couldn't break other use
> cases where (high) measured RTT is actually correct, rather than (low)
> timestamp RTT.
> 
> Michal Kubecek
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TCP one-by-one acking - RFC interpretation question
  2018-04-06 16:49     ` Eric Dumazet
@ 2018-04-11 10:58       ` Michal Kubecek
  2018-04-11 12:06         ` Michal Kubecek
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Kubecek @ 2018-04-11 10:58 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Yuchung Cheng, Neal Cardwell, Kenneth Klette Jonassen

On Fri, Apr 06, 2018 at 09:49:58AM -0700, Eric Dumazet wrote:
> Cc Neal and Yuchung if they missed this thread.
> 
> On 04/06/2018 08:03 AM, Michal Kubecek wrote:
> > On Fri, Apr 06, 2018 at 05:01:29AM -0700, Eric Dumazet wrote:
> >>
> >>
> >> On 04/06/2018 03:05 AM, Michal Kubecek wrote:
> >>> Hello
> >>>
> >>> I encountered a strange behaviour of some (non-linux) TCP stack which
> >>> I believe is incorrect but support engineers from the company producing
> >>> it claim is OK.
> >>>
> >>> Assume a client (sender, Linux 4.4 kernel) sends a stream of MSS sized
> >>> segments but segments 2, 4 and 6 do not reach the server (receiver):
> >>>
> >>>          ACK             SAK             SAK             SAK
> >>>       +-------+-------+-------+-------+-------+-------+-------+
> >>>       |   1   |   2   |   3   |   4   |   5   |   6   |   7   |
> >>>       +-------+-------+-------+-------+-------+-------+-------+
> >>>     34273   35701   37129   38557   39985   41413   42841   44269
> >>>
> >>> When segment 2 is retransmitted after RTO timeout, normal response would
> >>> be ACK-ing segment 3 (38557) with SACK for 5 and 7 (39985-41413 and
> >>> 42841-44269).
> >>>
> >>> However, this server stack responds with two separate ACKs:
> >>>
> >>>   - ACK 37129, SACK 37129-38557 39985-41413 42841-44269
> >>>   - ACK 38557, SACK 39985-41413 42841-44269
> >>
> >> Hmmm... Yes this seems very very wrong and lazy.
> >>
> >> Have you verified behavior of more recent linux kernel to such threats ?
> > 
> > No, unfortunately the problem was only encountered by our customer in
> > production environment (they tried to reproduce in a test lab but no
> > luck). They are running backups to NFS server and it happens from time
> > to time (in the order of hours, IIUC). So it would be probably hard to
> > let them try with more recent kernel.
> > 
> > On the other hand, they reported that SLE11 clients (kernel 3.0) do not
> > run into this kind of problem. It was originally reported as a
> > a regression on migration from SLE11-SP4 (3.0 kernel) to SLE12-SP2 (4.4
> > kernel) and the problem was reported as "SLE12-SP2 is ignoring dupacks"
> > (which seems to be mostly caused by the switch to RACK).
> > 
> > It also seems that part of the problem is specific packet loss pattern
> > where at some point, many packets are lost in "every second" pattern.
> > The customer finally started to investigate this problem and it seems it
> > has something to do with their bonding setup (they provided no details,
> > my guess is packets are divided over two paths and one of them fails).
> > 
> >> packetdrill test would be relatively easy to write.
> > 
> > I'll try but I have very little experience with writing packetdrill
> > scripts so it will probably take some time.
> > 
> >> Regardless of this broken alien stack, we might be able to work around
> >> this faster than the vendor is able to fix and deploy a new stack.
> >>
> >> ( https://en.wikipedia.org/wiki/Robustness_principle )
> >> Be conservative in what you do, be liberal in what you accept from
> >> others...
> > 
> > I was thinking about this a bit. "Fixing" the acknowledgment number
> > could do the trick but it doesn't feel correct. We might use the fact
> > that TSecr of both ACKs above matches TSval of the retransmission which
> > triggered them so that RTT calculated from timestamp would be the right
> > one. So perhaps something like "prefer timestamp RTT if measured RTT
> > seems way too off". But I'm not sure if it couldn't break other use
> > cases where (high) measured RTT is actually correct, rather than (low)
> > timestamp RTT.

I stared at the code some more and apparently I was wrong. I put my
tracing right after the check

        if (flag & FLAG_SACK_RENEGING) {

in tcp_check_sack_reneging() and completely missed Neal's commit
5ae344c949e7 ("tcp: reduce spurious retransmits due to transient SACK
reneging") which deals with exactly this kind of broken TCP stacks
sending a series of acks for each segment from out of order queue.
Therefore it seems the events in my log weren't actual SACK reneging
events and the SACK scoreboard wasn't cleared.

There is something else I don't understand, though. In the case of
acking previously sacked and never retransmitted segment,
tcp_clean_rtx_queue() calculates the parameters for tcp_ack_update_rtt()
using

        if (sack->first_sackt.v64) {
                sack_rtt_us = skb_mstamp_us_delta(&now,
&sack->first_sackt);
                ca_rtt_us = skb_mstamp_us_delta(&now,
&sack->last_sackt);
        }

(in 4.4; mainline code replaces &now with tp->tcp_mstamp). If I read the
code correctly, both sack->first_sackt and sack->last_sackt contain
timestamps of initial segment transmission. This would mean we use the
time difference between the initial transmission and now, i.e. including
the RTO of the lost packet).

IMHO we should take the actual round trip time instead, i.e. the
difference between the original transmission and the time the packet
sacked (first time). It seems we have been doing this before commit
31231a8a8730 ("tcp: improve RTT from SACK for CC").

Michal Kubecek

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TCP one-by-one acking - RFC interpretation question
  2018-04-11 10:58       ` Michal Kubecek
@ 2018-04-11 12:06         ` Michal Kubecek
  2018-04-12 16:20           ` Yuchung Cheng
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Kubecek @ 2018-04-11 12:06 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Yuchung Cheng, Neal Cardwell, Kenneth Klette Jonassen

On Wed, Apr 11, 2018 at 12:58:37PM +0200, Michal Kubecek wrote:
> There is something else I don't understand, though. In the case of
> acking previously sacked and never retransmitted segment,
> tcp_clean_rtx_queue() calculates the parameters for tcp_ack_update_rtt()
> using
> 
>         if (sack->first_sackt.v64) {
>                 sack_rtt_us = skb_mstamp_us_delta(&now,
> &sack->first_sackt);
>                 ca_rtt_us = skb_mstamp_us_delta(&now,
> &sack->last_sackt);
>         }
> 
> (in 4.4; mainline code replaces &now with tp->tcp_mstamp). If I read the
> code correctly, both sack->first_sackt and sack->last_sackt contain
> timestamps of initial segment transmission. This would mean we use the
> time difference between the initial transmission and now, i.e. including
> the RTO of the lost packet).
> 
> IMHO we should take the actual round trip time instead, i.e. the
> difference between the original transmission and the time the packet
> sacked (first time). It seems we have been doing this before commit
> 31231a8a8730 ("tcp: improve RTT from SACK for CC").

Sorry for the noise, this was my misunderstanding, the first_sackt and
last_sackt values are only taken from segments newly sacked by ack
received right now, not those which were already sacked before.

The actual problem and unrealistic RTT measurements come from another
RFC violation I didn't mention before: the NAS doesn't follow RFC 2018
section 4 rule for ordering of SACK blocks. Rather than sending SACK
blocks three most recently received out-of-order blocks, it simply sends
first three ordered by sequence numbers. In the earlier example (odd
packets were received, even lost)

       ACK             SAK             SAK             SAK
    +-------+-------+-------+-------+-------+-------+-------+-------+-------+
    |   1   |   2   |   3   |   4   |   5   |   6   |   7   |   8   |   9   |
    +-------+-------+-------+-------+-------+-------+-------+-------+-------+
  34273   35701   37129   38557   39985   41413   42841   44269   45697   47125

it responds to retransmitted segment 2 by

  1. ACK 37129, SACK 37129-38557 39985-41413 42841-44269
  2. ACK 38557, SACK 39985-41413 42841-44269 45697-47125

This new SACK block 45697-47125 has not been retransmitted and as it
wasn't sacked before, it is considered newly sacked. Therefore it gets
processed and its deemed RTT (time since its original transmit time)
"poisons" the RTT calculation, leading to RTO spiraling up.

Thus if we want to work around the NAS behaviour, we would need to
recognize such new SACK block as "not really new" and ignore it for
first_sackt/last_sackt. I'm not sure if it's possible without
misinterpreting actually delayed out of order packets. Of course, it is
not clear if it's worth the effort to work around so severely broken TCP
implementations (two obvious RFC violations, even if we don't count the
one-by-one acking).

Michal Kubecek

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TCP one-by-one acking - RFC interpretation question
  2018-04-11 12:06         ` Michal Kubecek
@ 2018-04-12 16:20           ` Yuchung Cheng
  0 siblings, 0 replies; 7+ messages in thread
From: Yuchung Cheng @ 2018-04-12 16:20 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: netdev, Eric Dumazet, Neal Cardwell, Kenneth Klette Jonassen

On Wed, Apr 11, 2018 at 5:06 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
> On Wed, Apr 11, 2018 at 12:58:37PM +0200, Michal Kubecek wrote:
>> There is something else I don't understand, though. In the case of
>> acking previously sacked and never retransmitted segment,
>> tcp_clean_rtx_queue() calculates the parameters for tcp_ack_update_rtt()
>> using
>>
>>         if (sack->first_sackt.v64) {
>>                 sack_rtt_us = skb_mstamp_us_delta(&now,
>> &sack->first_sackt);
>>                 ca_rtt_us = skb_mstamp_us_delta(&now,
>> &sack->last_sackt);
>>         }
>>
>> (in 4.4; mainline code replaces &now with tp->tcp_mstamp). If I read the
>> code correctly, both sack->first_sackt and sack->last_sackt contain
>> timestamps of initial segment transmission. This would mean we use the
>> time difference between the initial transmission and now, i.e. including
>> the RTO of the lost packet).
>>
>> IMHO we should take the actual round trip time instead, i.e. the
>> difference between the original transmission and the time the packet
>> sacked (first time). It seems we have been doing this before commit
>> 31231a8a8730 ("tcp: improve RTT from SACK for CC").
>
> Sorry for the noise, this was my misunderstanding, the first_sackt and
> last_sackt values are only taken from segments newly sacked by ack
> received right now, not those which were already sacked before.
>
> The actual problem and unrealistic RTT measurements come from another
> RFC violation I didn't mention before: the NAS doesn't follow RFC 2018
> section 4 rule for ordering of SACK blocks. Rather than sending SACK
> blocks three most recently received out-of-order blocks, it simply sends
> first three ordered by sequence numbers. In the earlier example (odd
> packets were received, even lost)
>
>        ACK             SAK             SAK             SAK
>     +-------+-------+-------+-------+-------+-------+-------+-------+-------+
>     |   1   |   2   |   3   |   4   |   5   |   6   |   7   |   8   |   9   |
>     +-------+-------+-------+-------+-------+-------+-------+-------+-------+
>   34273   35701   37129   38557   39985   41413   42841   44269   45697   47125
>
> it responds to retransmitted segment 2 by
>
>   1. ACK 37129, SACK 37129-38557 39985-41413 42841-44269
>   2. ACK 38557, SACK 39985-41413 42841-44269 45697-47125
>
> This new SACK block 45697-47125 has not been retransmitted and as it
> wasn't sacked before, it is considered newly sacked. Therefore it gets
> processed and its deemed RTT (time since its original transmit time)
> "poisons" the RTT calculation, leading to RTO spiraling up.
>
> Thus if we want to work around the NAS behaviour, we would need to
> recognize such new SACK block as "not really new" and ignore it for
> first_sackt/last_sackt. I'm not sure if it's possible without
> misinterpreting actually delayed out of order packets. Of course, it is
> not clear if it's worth the effort to work around so severely broken TCP
> implementations (two obvious RFC violations, even if we don't count the
> one-by-one acking).
Right. Not much we (sender) can do if the receiver is not reporting
the delivery status correctly. This also negatively impacts TCP
congestion control (Cubic, Reno, BBR, CDG etc) because we've changed
it to increase/decrease cwnd based on both inorder and out-of-order
delivery.

We're close to publish our internal packetdrill tests. Hopefully they
can be used to test these poor implementations.

>
> Michal Kubecek

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-04-12 16:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-06 10:05 TCP one-by-one acking - RFC interpretation question Michal Kubecek
2018-04-06 12:01 ` Eric Dumazet
2018-04-06 15:03   ` Michal Kubecek
2018-04-06 16:49     ` Eric Dumazet
2018-04-11 10:58       ` Michal Kubecek
2018-04-11 12:06         ` Michal Kubecek
2018-04-12 16:20           ` Yuchung Cheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.