Re: TCP one-by-one acking - RFC interpretation question

From: Yuchung Cheng <ycheng@google.com>
To: Michal Kubecek <mkubecek@suse.cz>
Cc: netdev <netdev@vger.kernel.org>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	Neal Cardwell <ncardwell@google.com>,
	Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Subject: Re: TCP one-by-one acking - RFC interpretation question
Date: Thu, 12 Apr 2018 09:20:31 -0700	[thread overview]
Message-ID: <CAK6E8=cWfjLh_sUSi2tVm8v-iWd8KtPASkC6kgPnKTN+4Mdywg@mail.gmail.com> (raw)
In-Reply-To: <20180411120655.aib34ye5xnch4zw3@unicorn.suse.cz>

On Wed, Apr 11, 2018 at 5:06 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
> On Wed, Apr 11, 2018 at 12:58:37PM +0200, Michal Kubecek wrote:
>> There is something else I don't understand, though. In the case of
>> acking previously sacked and never retransmitted segment,
>> tcp_clean_rtx_queue() calculates the parameters for tcp_ack_update_rtt()
>> using
>>
>>         if (sack->first_sackt.v64) {
>>                 sack_rtt_us = skb_mstamp_us_delta(&now,
>> &sack->first_sackt);
>>                 ca_rtt_us = skb_mstamp_us_delta(&now,
>> &sack->last_sackt);
>>         }
>>
>> (in 4.4; mainline code replaces &now with tp->tcp_mstamp). If I read the
>> code correctly, both sack->first_sackt and sack->last_sackt contain
>> timestamps of initial segment transmission. This would mean we use the
>> time difference between the initial transmission and now, i.e. including
>> the RTO of the lost packet).
>>
>> IMHO we should take the actual round trip time instead, i.e. the
>> difference between the original transmission and the time the packet
>> sacked (first time). It seems we have been doing this before commit
>> 31231a8a8730 ("tcp: improve RTT from SACK for CC").
>
> Sorry for the noise, this was my misunderstanding, the first_sackt and
> last_sackt values are only taken from segments newly sacked by ack
> received right now, not those which were already sacked before.
>
> The actual problem and unrealistic RTT measurements come from another
> RFC violation I didn't mention before: the NAS doesn't follow RFC 2018
> section 4 rule for ordering of SACK blocks. Rather than sending SACK
> blocks three most recently received out-of-order blocks, it simply sends
> first three ordered by sequence numbers. In the earlier example (odd
> packets were received, even lost)
>
>        ACK             SAK             SAK             SAK
>     +-------+-------+-------+-------+-------+-------+-------+-------+-------+
>     |   1   |   2   |   3   |   4   |   5   |   6   |   7   |   8   |   9   |
>     +-------+-------+-------+-------+-------+-------+-------+-------+-------+
>   34273   35701   37129   38557   39985   41413   42841   44269   45697   47125
>
> it responds to retransmitted segment 2 by
>
>   1. ACK 37129, SACK 37129-38557 39985-41413 42841-44269
>   2. ACK 38557, SACK 39985-41413 42841-44269 45697-47125
>
> This new SACK block 45697-47125 has not been retransmitted and as it
> wasn't sacked before, it is considered newly sacked. Therefore it gets
> processed and its deemed RTT (time since its original transmit time)
> "poisons" the RTT calculation, leading to RTO spiraling up.
>
> Thus if we want to work around the NAS behaviour, we would need to
> recognize such new SACK block as "not really new" and ignore it for
> first_sackt/last_sackt. I'm not sure if it's possible without
> misinterpreting actually delayed out of order packets. Of course, it is
> not clear if it's worth the effort to work around so severely broken TCP
> implementations (two obvious RFC violations, even if we don't count the
> one-by-one acking).
Right. Not much we (sender) can do if the receiver is not reporting
the delivery status correctly. This also negatively impacts TCP
congestion control (Cubic, Reno, BBR, CDG etc) because we've changed
it to increase/decrease cwnd based on both inorder and out-of-order
delivery.

We're close to publish our internal packetdrill tests. Hopefully they
can be used to test these poor implementations.

>
> Michal Kubecek