From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Kubecek Subject: Re: TCP one-by-one acking - RFC interpretation question Date: Wed, 11 Apr 2018 14:06:55 +0200 Message-ID: <20180411120655.aib34ye5xnch4zw3@unicorn.suse.cz> References: <20180406100511.oify5ej2zvxr3isg@unicorn.suse.cz> <65f7ac65-e6d1-11f2-408e-39eb542ec817@gmail.com> <20180406150345.i3t7cu5g6dmoc3io@unicorn.suse.cz> <20180411105837.bwnpoqvbra43kjub@unicorn.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Eric Dumazet , Yuchung Cheng , Neal Cardwell , Kenneth Klette Jonassen To: netdev@vger.kernel.org Return-path: Received: from mx2.suse.de ([195.135.220.15]:47644 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752351AbeDKMG5 (ORCPT ); Wed, 11 Apr 2018 08:06:57 -0400 Content-Disposition: inline In-Reply-To: <20180411105837.bwnpoqvbra43kjub@unicorn.suse.cz> Sender: netdev-owner@vger.kernel.org List-ID: On Wed, Apr 11, 2018 at 12:58:37PM +0200, Michal Kubecek wrote: > There is something else I don't understand, though. In the case of > acking previously sacked and never retransmitted segment, > tcp_clean_rtx_queue() calculates the parameters for tcp_ack_update_rtt() > using > > if (sack->first_sackt.v64) { > sack_rtt_us = skb_mstamp_us_delta(&now, > &sack->first_sackt); > ca_rtt_us = skb_mstamp_us_delta(&now, > &sack->last_sackt); > } > > (in 4.4; mainline code replaces &now with tp->tcp_mstamp). If I read the > code correctly, both sack->first_sackt and sack->last_sackt contain > timestamps of initial segment transmission. This would mean we use the > time difference between the initial transmission and now, i.e. including > the RTO of the lost packet). > > IMHO we should take the actual round trip time instead, i.e. the > difference between the original transmission and the time the packet > sacked (first time). It seems we have been doing this before commit > 31231a8a8730 ("tcp: improve RTT from SACK for CC"). Sorry for the noise, this was my misunderstanding, the first_sackt and last_sackt values are only taken from segments newly sacked by ack received right now, not those which were already sacked before. The actual problem and unrealistic RTT measurements come from another RFC violation I didn't mention before: the NAS doesn't follow RFC 2018 section 4 rule for ordering of SACK blocks. Rather than sending SACK blocks three most recently received out-of-order blocks, it simply sends first three ordered by sequence numbers. In the earlier example (odd packets were received, even lost) ACK SAK SAK SAK +-------+-------+-------+-------+-------+-------+-------+-------+-------+ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | +-------+-------+-------+-------+-------+-------+-------+-------+-------+ 34273 35701 37129 38557 39985 41413 42841 44269 45697 47125 it responds to retransmitted segment 2 by 1. ACK 37129, SACK 37129-38557 39985-41413 42841-44269 2. ACK 38557, SACK 39985-41413 42841-44269 45697-47125 This new SACK block 45697-47125 has not been retransmitted and as it wasn't sacked before, it is considered newly sacked. Therefore it gets processed and its deemed RTT (time since its original transmit time) "poisons" the RTT calculation, leading to RTO spiraling up. Thus if we want to work around the NAS behaviour, we would need to recognize such new SACK block as "not really new" and ignore it for first_sackt/last_sackt. I'm not sure if it's possible without misinterpreting actually delayed out of order packets. Of course, it is not clear if it's worth the effort to work around so severely broken TCP implementations (two obvious RFC violations, even if we don't count the one-by-one acking). Michal Kubecek