netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/4 net-next] tcp: consolidate SYNACK RTT sampling
@ 2013-07-21  5:39 Yuchung Cheng
  2013-07-21  5:39 ` [PATCH 2/4 net-next] tcp: prefer packet timing to TS-ECR for RTT Yuchung Cheng
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Yuchung Cheng @ 2013-07-21  5:39 UTC (permalink / raw)
  To: davem, ncardwell, edumazet; +Cc: nanditad, netdev, Yuchung Cheng

This patch series improve RTT sampling in three ways:
1. Sample RTT during fast recovery and reordering events.
2. Favor ack-based RTT to timestamps because of broken TS ECR fields
3. Consolidate the RTT measurement logic.

The first patch consolidates SYNACK and other RTT measurement to use a
central function tcp_ack_update_rtt(). A (small) bonus is now SYNACK
RTT measurement happens after PAWS check, potentially reducing the
impact of RTO seeding on bad TCP timestamps values.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
 include/net/tcp.h        |  9 ---------
 net/ipv4/tcp_input.c     | 11 +++++++++++
 net/ipv4/tcp_ipv4.c      |  2 --
 net/ipv4/tcp_minisocks.c |  8 ++------
 net/ipv6/tcp_ipv6.c      |  2 --
 5 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d198005..f9777db 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1094,15 +1094,6 @@ static inline void tcp_openreq_init(struct request_sock *req,
 	ireq->loc_port = tcp_hdr(skb)->dest;
 }
 
-/* Compute time elapsed between SYNACK and the ACK completing 3WHS */
-static inline void tcp_synack_rtt_meas(struct sock *sk,
-				       struct request_sock *req)
-{
-	if (tcp_rsk(req)->snt_synack)
-		tcp_valid_rtt_meas(sk,
-		    tcp_time_stamp - tcp_rsk(req)->snt_synack);
-}
-
 extern void tcp_enter_memory_pressure(struct sock *sk);
 
 static inline int keepalive_intvl_when(const struct tcp_sock *tp)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 28af45a..988e816 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2853,6 +2853,17 @@ static inline void tcp_ack_update_rtt(struct sock *sk, const int flag,
 		tcp_ack_no_tstamp(sk, seq_rtt, flag);
 }
 
+/* Compute time elapsed between (last) SYNACK and the ACK completing 3WHS. */
+static void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	s32 seq_rtt = -1;
+
+	if (tp->lsndtime && !tp->total_retrans)
+		seq_rtt = tcp_time_stamp - tp->lsndtime;
+	tcp_ack_update_rtt(sk, FLAG_SYN_ACKED, seq_rtt);
+}
+
 static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index b299da5..2e3f129 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1671,8 +1671,6 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 		newtp->advmss = tcp_sk(sk)->rx_opt.user_mss;
 
 	tcp_initialize_rcv_mss(newsk);
-	tcp_synack_rtt_meas(newsk, req);
-	newtp->total_retrans = req->num_retrans;
 
 #ifdef CONFIG_TCP_MD5SIG
 	/* Copy over the MD5 key from the original socket */
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index ab1c086..58a3e69 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -411,6 +411,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		newtp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
 		tcp_enable_early_retrans(newtp);
 		newtp->tlp_high_seq = 0;
+		newtp->lsndtime = treq->snt_synack;
+		newtp->total_retrans = req->num_retrans;
 
 		/* So many TCP implementations out there (incorrectly) count the
 		 * initial SYN frame in their delayed-ACK and congestion control
@@ -666,12 +668,6 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	if (!(flg & TCP_FLAG_ACK))
 		return NULL;
 
-	/* Got ACK for our SYNACK, so update baseline for SYNACK RTT sample. */
-	if (tmp_opt.saw_tstamp && tmp_opt.rcv_tsecr)
-		tcp_rsk(req)->snt_synack = tmp_opt.rcv_tsecr;
-	else if (req->num_retrans) /* don't take RTT sample if retrans && ~TS */
-		tcp_rsk(req)->snt_synack = 0;
-
 	/* For Fast Open no more processing is needed (sk is the
 	 * child socket).
 	 */
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 6e1649d..80fe69e 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1237,8 +1237,6 @@ static struct sock * tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 		newtp->advmss = tcp_sk(sk)->rx_opt.user_mss;
 
 	tcp_initialize_rcv_mss(newsk);
-	tcp_synack_rtt_meas(newsk, req);
-	newtp->total_retrans = req->num_retrans;
 
 	newinet->inet_daddr = newinet->inet_saddr = LOOPBACK4_IPV6;
 	newinet->inet_rcv_saddr = LOOPBACK4_IPV6;
-- 
1.8.3

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/4 net-next] tcp: prefer packet timing to TS-ECR for RTT
  2013-07-21  5:39 [PATCH 1/4 net-next] tcp: consolidate SYNACK RTT sampling Yuchung Cheng
@ 2013-07-21  5:39 ` Yuchung Cheng
  2013-07-22  3:17   ` Neal Cardwell
  2013-07-21  5:39 ` [PATCH 3/4 net-next] tcp: measure RTT from new SACK Yuchung Cheng
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Yuchung Cheng @ 2013-07-21  5:39 UTC (permalink / raw)
  To: davem, ncardwell, edumazet; +Cc: nanditad, netdev, Yuchung Cheng

Prefer packet timings to TS-ecr for RTT measurements when both
sources are available. That's because broken middle-boxes and remote
peer can return packets with corrupted TS ECR fields. Similarly most
congestion controls that require RTT signals favor timing-based
sources as well. Also check for bad TS ECR values to avoid RTT
blow-ups. It has happened on production Web servers.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
 include/net/tcp.h    |  1 -
 net/ipv4/tcp_input.c | 67 ++++++++++++++--------------------------------------
 2 files changed, 18 insertions(+), 50 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index f9777db..c586847 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -591,7 +591,6 @@ extern void tcp_initialize_rcv_mss(struct sock *sk);
 extern int tcp_mtu_to_mss(struct sock *sk, int pmtu);
 extern int tcp_mss_to_mtu(struct sock *sk, int mss);
 extern void tcp_mtup_init(struct sock *sk);
-extern void tcp_valid_rtt_meas(struct sock *sk, u32 seq_rtt);
 extern void tcp_init_buffer_space(struct sock *sk);
 
 static inline void tcp_bound_rto(const struct sock *sk)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 988e816..b6357ec 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2792,65 +2792,36 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
 	tcp_xmit_retransmit_queue(sk);
 }
 
-void tcp_valid_rtt_meas(struct sock *sk, u32 seq_rtt)
+static inline void tcp_ack_update_rtt(struct sock *sk, const int flag,
+				      s32 seq_rtt)
 {
-	tcp_rtt_estimator(sk, seq_rtt);
-	tcp_set_rto(sk);
-	inet_csk(sk)->icsk_backoff = 0;
-}
-EXPORT_SYMBOL(tcp_valid_rtt_meas);
+	const struct tcp_sock *tp = tcp_sk(sk);
+
+	/* Prefer RTT measured from ACK's timing to TS-ECR. This is because
+	 * broken middle-boxes or peers may corrupt TS-ECR fields. But
+	 * Karn's algorithm forbids taking RTT if some retransmitted data
+	 * is acked (RFC6298).
+	 */
+	if (flag & FLAG_RETRANS_DATA_ACKED)
+		seq_rtt = -1;
 
-/* Read draft-ietf-tcplw-high-performance before mucking
- * with this code. (Supersedes RFC1323)
- */
-static void tcp_ack_saw_tstamp(struct sock *sk, int flag)
-{
 	/* RTTM Rule: A TSecr value received in a segment is used to
 	 * update the averaged RTT measurement only if the segment
 	 * acknowledges some new data, i.e., only if it advances the
 	 * left edge of the send window.
-	 *
 	 * See draft-ietf-tcplw-high-performance-00, section 3.3.
-	 * 1998/04/10 Andrey V. Savochkin <saw@msu.ru>
-	 *
-	 * Changed: reset backoff as soon as we see the first valid sample.
-	 * If we do not, we get strongly overestimated rto. With timestamps
-	 * samples are accepted even from very old segments: f.e., when rtt=1
-	 * increases to 8, we retransmit 5 times and after 8 seconds delayed
-	 * answer arrives rto becomes 120 seconds! If at least one of segments
-	 * in window is lost... Voila.	 			--ANK (010210)
 	 */
-	struct tcp_sock *tp = tcp_sk(sk);
+	if (seq_rtt < 0 && tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
+		seq_rtt = tcp_time_stamp - tp->rx_opt.rcv_tsecr;
 
-	tcp_valid_rtt_meas(sk, tcp_time_stamp - tp->rx_opt.rcv_tsecr);
-}
-
-static void tcp_ack_no_tstamp(struct sock *sk, u32 seq_rtt, int flag)
-{
-	/* We don't have a timestamp. Can only use
-	 * packets that are not retransmitted to determine
-	 * rtt estimates. Also, we must not reset the
-	 * backoff for rto until we get a non-retransmitted
-	 * packet. This allows us to deal with a situation
-	 * where the network delay has increased suddenly.
-	 * I.e. Karn's algorithm. (SIGCOMM '87, p5.)
-	 */
-
-	if (flag & FLAG_RETRANS_DATA_ACKED)
+	if (seq_rtt < 0)
 		return;
 
-	tcp_valid_rtt_meas(sk, seq_rtt);
-}
+	tcp_rtt_estimator(sk, seq_rtt);
+	tcp_set_rto(sk);
 
-static inline void tcp_ack_update_rtt(struct sock *sk, const int flag,
-				      const s32 seq_rtt)
-{
-	const struct tcp_sock *tp = tcp_sk(sk);
-	/* Note that peer MAY send zero echo. In this case it is ignored. (rfc1323) */
-	if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
-		tcp_ack_saw_tstamp(sk, flag);
-	else if (seq_rtt >= 0)
-		tcp_ack_no_tstamp(sk, seq_rtt, flag);
+	/* RFC6298: only reset backoff on valid RTT measurement. */
+	inet_csk(sk)->icsk_backoff = 0;
 }
 
 /* Compute time elapsed between (last) SYNACK and the ACK completing 3WHS. */
@@ -2989,8 +2960,6 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 			if (sacked & TCPCB_SACKED_RETRANS)
 				tp->retrans_out -= acked_pcount;
 			flag |= FLAG_RETRANS_DATA_ACKED;
-			ca_seq_rtt = -1;
-			seq_rtt = -1;
 		} else {
 			ca_seq_rtt = now - scb->when;
 			last_ackt = skb->tstamp;
-- 
1.8.3

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/4 net-next] tcp: measure RTT from new SACK
  2013-07-21  5:39 [PATCH 1/4 net-next] tcp: consolidate SYNACK RTT sampling Yuchung Cheng
  2013-07-21  5:39 ` [PATCH 2/4 net-next] tcp: prefer packet timing to TS-ECR for RTT Yuchung Cheng
@ 2013-07-21  5:39 ` Yuchung Cheng
  2013-07-22  3:32   ` Neal Cardwell
  2013-07-21  5:39 ` [PATCH 4/4 net-next] tcp: use RTT from SACK for RTO Yuchung Cheng
  2013-07-22  2:45 ` [PATCH 1/4 net-next] tcp: consolidate SYNACK RTT sampling Neal Cardwell
  3 siblings, 1 reply; 8+ messages in thread
From: Yuchung Cheng @ 2013-07-21  5:39 UTC (permalink / raw)
  To: davem, ncardwell, edumazet; +Cc: nanditad, netdev, Yuchung Cheng

Take RTT sample if an ACK selectively acks some sequences that
have never been retransmitted. The Karn's algorithm does not apply
even if that ACK (s)acks other retransmitted sequences, because it
must been generated by an original but perhaps out-of-order packet.
There is no ambiguity.

Such RTT samples allow the sender to estimate the RTO during loss
recovery and packet reordering events. It is still useful even with
TCP timestamps. That's because during these events the SND.UNA may
not advance preventing RTT samples from TS ECR (thus the FLAG_ACKED
check before calling tcp_ack_update_rtt()).  Therefore this new
RTT source is complementary to existing ACK and TS RTT mechanisms.

This patch does not update the RTO. It is done in the next patch.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
 net/ipv4/tcp_input.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b6357ec..46aeef1 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1048,6 +1048,7 @@ struct tcp_sacktag_state {
 	int reord;
 	int fack_count;
 	int flag;
+	s32 rtt; /* RTT measured by SACKing never-retransmitted data */
 };
 
 /* Check if skb is fully within the SACK block. In presence of GSO skbs,
@@ -1108,7 +1109,7 @@ static int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb,
 static u8 tcp_sacktag_one(struct sock *sk,
 			  struct tcp_sacktag_state *state, u8 sacked,
 			  u32 start_seq, u32 end_seq,
-			  bool dup_sack, int pcount)
+			  int dup_sack, int pcount, u32 lsndtime)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int fack_count = state->fack_count;
@@ -1148,6 +1149,7 @@ static u8 tcp_sacktag_one(struct sock *sk,
 							   state->reord);
 				if (!after(end_seq, tp->high_seq))
 					state->flag |= FLAG_ORIG_SACK_ACKED;
+				state->rtt = tcp_time_stamp - lsndtime;
 			}
 
 			if (sacked & TCPCB_LOST) {
@@ -1205,7 +1207,8 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *skb,
 	 * tcp_highest_sack_seq() when skb is highest_sack.
 	 */
 	tcp_sacktag_one(sk, state, TCP_SKB_CB(skb)->sacked,
-			start_seq, end_seq, dup_sack, pcount);
+			start_seq, end_seq, dup_sack, pcount,
+			TCP_SKB_CB(skb)->when);
 
 	if (skb == tp->lost_skb_hint)
 		tp->lost_cnt_hint += pcount;
@@ -1479,7 +1482,8 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
 						TCP_SKB_CB(skb)->seq,
 						TCP_SKB_CB(skb)->end_seq,
 						dup_sack,
-						tcp_skb_pcount(skb));
+						tcp_skb_pcount(skb),
+						TCP_SKB_CB(skb)->when);
 
 			if (!before(TCP_SKB_CB(skb)->seq,
 				    tcp_highest_sack_seq(tp)))
@@ -1536,7 +1540,7 @@ static int tcp_sack_cache_ok(const struct tcp_sock *tp, const struct tcp_sack_bl
 
 static int
 tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
-			u32 prior_snd_una)
+			u32 prior_snd_una, s32 *sack_rtt)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	const unsigned char *ptr = (skb_transport_header(ack_skb) +
@@ -1554,6 +1558,7 @@ tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
 
 	state.flag = 0;
 	state.reord = tp->packets_out;
+	state.rtt = -1;
 
 	if (!tp->sacked_out) {
 		if (WARN_ON(tp->fackets_out))
@@ -1737,6 +1742,7 @@ out:
 	WARN_ON((int)tp->retrans_out < 0);
 	WARN_ON((int)tcp_packets_in_flight(tp) < 0);
 #endif
+	*sack_rtt = state.rtt;
 	return state.flag;
 }
 
@@ -3254,6 +3260,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	int prior_packets = tp->packets_out;
 	const int prior_unsacked = tp->packets_out - tp->sacked_out;
 	int acked = 0; /* Number of packets newly acked */
+	s32 sack_rtt = -1;
 
 	/* If the ack is older than previous acks
 	 * then we can probably ignore it.
@@ -3310,7 +3317,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 		flag |= tcp_ack_update_window(sk, skb, ack, ack_seq);
 
 		if (TCP_SKB_CB(skb)->sacked)
-			flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una);
+			flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una,
+							&sack_rtt);
 
 		if (TCP_ECN_rcv_ecn_echo(tp, tcp_hdr(skb)))
 			flag |= FLAG_ECE;
@@ -3382,7 +3390,8 @@ old_ack:
 	 * If data was DSACKed, see if we can undo a cwnd reduction.
 	 */
 	if (TCP_SKB_CB(skb)->sacked) {
-		flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una);
+		flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una,
+						&sack_rtt);
 		tcp_fastretrans_alert(sk, acked, prior_unsacked,
 				      is_dupack, flag);
 	}
-- 
1.8.3

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/4 net-next] tcp: use RTT from SACK for RTO
  2013-07-21  5:39 [PATCH 1/4 net-next] tcp: consolidate SYNACK RTT sampling Yuchung Cheng
  2013-07-21  5:39 ` [PATCH 2/4 net-next] tcp: prefer packet timing to TS-ECR for RTT Yuchung Cheng
  2013-07-21  5:39 ` [PATCH 3/4 net-next] tcp: measure RTT from new SACK Yuchung Cheng
@ 2013-07-21  5:39 ` Yuchung Cheng
  2013-07-22  3:40   ` Neal Cardwell
  2013-07-22  2:45 ` [PATCH 1/4 net-next] tcp: consolidate SYNACK RTT sampling Neal Cardwell
  3 siblings, 1 reply; 8+ messages in thread
From: Yuchung Cheng @ 2013-07-21  5:39 UTC (permalink / raw)
  To: davem, ncardwell, edumazet; +Cc: nanditad, netdev, Yuchung Cheng

If RTT is not available because Karn's check has failed or no
new packet is acked, use the RTT measured from SACK to estimate
the RTO. The sender can continue to estimate the RTO during loss
recovery or reordering event upon receiving non-partial ACKs.

This also changes when the RTO is re-armed. Previously it is
only re-armed when some data is cummulatively acknowledged (i.e.,
SND.UNA advances), but now it is re-armed whenever RTT estimator
is updated. This feature is particularly useful to reduce spurious
timeout for buffer bloat including cellular carriers [1], and
RTT estimation on reordering events.

[1] "An In-depth Study of LTE: Effect of Network Protocol and
 Application Behavior on Performance", In Proc. of SIGCOMM 2013

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
 net/ipv4/tcp_input.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 46aeef1..8d2f712 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2798,8 +2798,8 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
 	tcp_xmit_retransmit_queue(sk);
 }
 
-static inline void tcp_ack_update_rtt(struct sock *sk, const int flag,
-				      s32 seq_rtt)
+static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
+				      s32 seq_rtt, s32 sack_rtt)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 
@@ -2811,6 +2811,9 @@ static inline void tcp_ack_update_rtt(struct sock *sk, const int flag,
 	if (flag & FLAG_RETRANS_DATA_ACKED)
 		seq_rtt = -1;
 
+	if (seq_rtt < 0)
+		seq_rtt = sack_rtt;
+
 	/* RTTM Rule: A TSecr value received in a segment is used to
 	 * update the averaged RTT measurement only if the segment
 	 * acknowledges some new data, i.e., only if it advances the
@@ -2821,13 +2824,14 @@ static inline void tcp_ack_update_rtt(struct sock *sk, const int flag,
 		seq_rtt = tcp_time_stamp - tp->rx_opt.rcv_tsecr;
 
 	if (seq_rtt < 0)
-		return;
+		return false;
 
 	tcp_rtt_estimator(sk, seq_rtt);
 	tcp_set_rto(sk);
 
 	/* RFC6298: only reset backoff on valid RTT measurement. */
 	inet_csk(sk)->icsk_backoff = 0;
+	return true;
 }
 
 /* Compute time elapsed between (last) SYNACK and the ACK completing 3WHS. */
@@ -2838,7 +2842,7 @@ static void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req)
 
 	if (tp->lsndtime && !tp->total_retrans)
 		seq_rtt = tcp_time_stamp - tp->lsndtime;
-	tcp_ack_update_rtt(sk, FLAG_SYN_ACKED, seq_rtt);
+	tcp_ack_update_rtt(sk, FLAG_SYN_ACKED, seq_rtt, -1);
 }
 
 static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
@@ -2927,7 +2931,7 @@ static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
  * arrived at the other end.
  */
 static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
-			       u32 prior_snd_una)
+			       u32 prior_snd_una, s32 sack_rtt)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	const struct inet_connection_sock *icsk = inet_csk(sk);
@@ -3017,6 +3021,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 	if (skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
 		flag |= FLAG_SACK_RENEGING;
 
+	if (tcp_ack_update_rtt(sk, flag, seq_rtt, sack_rtt) ||
+	    (flag & FLAG_ACKED))
+		tcp_rearm_rto(sk);
+
 	if (flag & FLAG_ACKED) {
 		const struct tcp_congestion_ops *ca_ops
 			= inet_csk(sk)->icsk_ca_ops;
@@ -3026,9 +3034,6 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 			tcp_mtup_probe_success(sk);
 		}
 
-		tcp_ack_update_rtt(sk, flag, seq_rtt);
-		tcp_rearm_rto(sk);
-
 		if (tcp_is_reno(tp)) {
 			tcp_remove_reno_sacks(sk, pkts_acked);
 		} else {
@@ -3337,7 +3342,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 
 	/* See if we can take anything off of the retransmit queue. */
 	acked = tp->packets_out;
-	flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una);
+	flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una, sack_rtt);
 	acked -= tp->packets_out;
 
 	if (tcp_ack_is_dubious(sk, flag)) {
-- 
1.8.3

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/4 net-next] tcp: consolidate SYNACK RTT sampling
  2013-07-21  5:39 [PATCH 1/4 net-next] tcp: consolidate SYNACK RTT sampling Yuchung Cheng
                   ` (2 preceding siblings ...)
  2013-07-21  5:39 ` [PATCH 4/4 net-next] tcp: use RTT from SACK for RTO Yuchung Cheng
@ 2013-07-22  2:45 ` Neal Cardwell
  3 siblings, 0 replies; 8+ messages in thread
From: Neal Cardwell @ 2013-07-22  2:45 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: David Miller, Eric Dumazet, Nandita Dukkipati, Netdev

On Sun, Jul 21, 2013 at 1:39 AM, Yuchung Cheng <ycheng@google.com> wrote:
> This patch series improve RTT sampling in three ways:
> 1. Sample RTT during fast recovery and reordering events.
> 2. Favor ack-based RTT to timestamps because of broken TS ECR fields
> 3. Consolidate the RTT measurement logic.
>
> The first patch consolidates SYNACK and other RTT measurement to use a
> central function tcp_ack_update_rtt(). A (small) bonus is now SYNACK
> RTT measurement happens after PAWS check, potentially reducing the
> impact of RTO seeding on bad TCP timestamps values.
>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>

The idea seems very nice to me, but I think a line or two got dropped
in a rebase. :-)

In the previous version of this patch you moved the TFO-specific call
to tcp_synack_rtt_meas() out to a spot in tcp_rcv_state_process()
where both TFO and non-TFO passive connections would hit it. In this
version of the patch it removes the existing calls to
tcp_synack_rtt_meas() but does not add this new call to replace them.
(And it does not seem to be added back in a later patch in the
series.)

Other than that, looks great.

neal

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/4 net-next] tcp: prefer packet timing to TS-ECR for RTT
  2013-07-21  5:39 ` [PATCH 2/4 net-next] tcp: prefer packet timing to TS-ECR for RTT Yuchung Cheng
@ 2013-07-22  3:17   ` Neal Cardwell
  0 siblings, 0 replies; 8+ messages in thread
From: Neal Cardwell @ 2013-07-22  3:17 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: David Miller, Eric Dumazet, Nandita Dukkipati, Netdev

On Sun, Jul 21, 2013 at 1:39 AM, Yuchung Cheng <ycheng@google.com> wrote:
> Prefer packet timings to TS-ecr for RTT measurements when both
> sources are available. That's because broken middle-boxes and remote
> peer can return packets with corrupted TS ECR fields. Similarly most
> congestion controls that require RTT signals favor timing-based
> sources as well. Also check for bad TS ECR values to avoid RTT
> blow-ups. It has happened on production Web servers.
>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/4 net-next] tcp: measure RTT from new SACK
  2013-07-21  5:39 ` [PATCH 3/4 net-next] tcp: measure RTT from new SACK Yuchung Cheng
@ 2013-07-22  3:32   ` Neal Cardwell
  0 siblings, 0 replies; 8+ messages in thread
From: Neal Cardwell @ 2013-07-22  3:32 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: David Miller, Eric Dumazet, Nandita Dukkipati, Netdev

On Sun, Jul 21, 2013 at 1:39 AM, Yuchung Cheng <ycheng@google.com> wrote:
> @@ -1148,6 +1149,7 @@ static u8 tcp_sacktag_one(struct sock *sk,
>                                                            state->reord);
>                                 if (!after(end_seq, tp->high_seq))
>                                         state->flag |= FLAG_ORIG_SACK_ACKED;
> +                               state->rtt = tcp_time_stamp - lsndtime;

To be safer we should probably take the RTT sample for the left-most
SACKed segment, just like tcp_clean_rtx_queue() takes a seq_rtt RTT
sample for the left-most cumulatively ACKed segment. Since
tcp_sacktag_write_queue() already sorts the SACK blocks, I think that
means tcp_sacktag_one() just needs to only set state->rtt if it is
still less than 0 (analagous to the update logic in
tcp_clean_rtx_queue()).

Also, FWIW I think the "lsndtime" variable name here is a little confusing,
since it is being used for the send time of the particular skb, which is
quite different from the similarly-named existing variable, tp->lsndttime.
Maybe "xmit_stamp"?

Otherwise, looks good. This is going to be nice to have.

neal

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 4/4 net-next] tcp: use RTT from SACK for RTO
  2013-07-21  5:39 ` [PATCH 4/4 net-next] tcp: use RTT from SACK for RTO Yuchung Cheng
@ 2013-07-22  3:40   ` Neal Cardwell
  0 siblings, 0 replies; 8+ messages in thread
From: Neal Cardwell @ 2013-07-22  3:40 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: David Miller, Eric Dumazet, Nandita Dukkipati, Netdev

On Sun, Jul 21, 2013 at 1:39 AM, Yuchung Cheng <ycheng@google.com> wrote:
> If RTT is not available because Karn's check has failed or no
> new packet is acked, use the RTT measured from SACK to estimate
> the RTO. The sender can continue to estimate the RTO during loss
> recovery or reordering event upon receiving non-partial ACKs.
>
> This also changes when the RTO is re-armed. Previously it is
> only re-armed when some data is cummulatively acknowledged (i.e.,
> SND.UNA advances), but now it is re-armed whenever RTT estimator
> is updated. This feature is particularly useful to reduce spurious
> timeout for buffer bloat including cellular carriers [1], and
> RTT estimation on reordering events.
>
> [1] "An In-depth Study of LTE: Effect of Network Protocol and
>  Application Behavior on Performance", In Proc. of SIGCOMM 2013
>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-07-22  3:40 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-21  5:39 [PATCH 1/4 net-next] tcp: consolidate SYNACK RTT sampling Yuchung Cheng
2013-07-21  5:39 ` [PATCH 2/4 net-next] tcp: prefer packet timing to TS-ECR for RTT Yuchung Cheng
2013-07-22  3:17   ` Neal Cardwell
2013-07-21  5:39 ` [PATCH 3/4 net-next] tcp: measure RTT from new SACK Yuchung Cheng
2013-07-22  3:32   ` Neal Cardwell
2013-07-21  5:39 ` [PATCH 4/4 net-next] tcp: use RTT from SACK for RTO Yuchung Cheng
2013-07-22  3:40   ` Neal Cardwell
2013-07-22  2:45 ` [PATCH 1/4 net-next] tcp: consolidate SYNACK RTT sampling Neal Cardwell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).