Re: [PATCH net-next 2/7] tcp: track min RTT using windowed min-filter

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH net-next 2/7] tcp: track min RTT using windowed min-filter
  2015-10-17  4:57 ` [PATCH net-next 2/7] tcp: track min RTT using windowed min-filter Yuchung Cheng
@ 2015-10-14  9:28   ` Andrew Shewmaker
  2015-10-18 14:33     ` Neal Cardwell
  2015-10-19  4:39     ` Andrew Shewmaker
  0 siblings, 2 replies; 15+ messages in thread
From: Andrew Shewmaker @ 2015-10-14  9:28 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: davem, netdev, Neal Cardwell, Eric Dumazet

On Fri, Oct 16, 2015 at 09:57:42PM -0700, Yuchung Cheng wrote:
...
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 86a7eda..90edef5 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -217,6 +217,9 @@ struct tcp_sock {
>  	u32	mdev_max_us;	/* maximal mdev for the last rtt period	*/
>  	u32	rttvar_us;	/* smoothed mdev_max			*/
>  	u32	rtt_seq;	/* sequence number to update rttvar	*/
> +	struct rtt_meas {
> +		u32 rtt, ts;	/* RTT in usec and sampling time in jiffies. */
> +	} rtt_min[3];

First, thanks for all the work in this patch series. In particular,
applying Kern's check to ca_seq_rtt_us should fix some bad behavior
I've observed recently.

I only have a couple comments, so I abbreviated most of your patch.
Should rtt_meas.rtt be rtt_meas.rtt_us in order to be more consistent
with the naming of related variables?

...
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 38743e5..e177386 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
...
> @@ -2961,7 +3028,7 @@ void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req)
>  		rtt_us = skb_mstamp_us_delta(&now, &tcp_rsk(req)->snt_synack);
>  	}
>  
> -	tcp_ack_update_rtt(sk, FLAG_SYN_ACKED, rtt_us, -1L);
> +	tcp_ack_update_rtt(sk, FLAG_SYN_ACKED, rtt_us, -1L, rtt_us);
>  }

This didn't apply to net-next for me. I see seq_rtt_us instead of
rtt_us and a check on the existence of tp->srtt_us. Maybe I've
misapplied the patch? I'll try again and test the patch series
against the bad behavior I mentioned above as soon as I can.
Hopefully today.

-Andrew Shewmaker

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 2/7] tcp: track min RTT using windowed min-filter
  2015-10-18 14:33     ` Neal Cardwell
@ 2015-10-14 13:25       ` Andrew Shewmaker
  0 siblings, 0 replies; 15+ messages in thread
From: Andrew Shewmaker @ 2015-10-14 13:25 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: Yuchung Cheng, David Miller, Netdev, Eric Dumazet

On Sun, Oct 18, 2015 at 10:33:41AM -0400, Neal Cardwell wrote:
> On Wed, Oct 14, 2015 at 5:28 AM, Andrew Shewmaker <agshew@gmail.com> wrote:
> >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >> index 38743e5..e177386 100644
> >> --- a/net/ipv4/tcp_input.c
> >> +++ b/net/ipv4/tcp_input.c
> > ...
> >> @@ -2961,7 +3028,7 @@ void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req)
> >>               rtt_us = skb_mstamp_us_delta(&now, &tcp_rsk(req)->snt_synack);
> >>       }
> >>
> >> -     tcp_ack_update_rtt(sk, FLAG_SYN_ACKED, rtt_us, -1L);
> >> +     tcp_ack_update_rtt(sk, FLAG_SYN_ACKED, rtt_us, -1L, rtt_us);
> >>  }
> >
> > This didn't apply to net-next for me. I see seq_rtt_us instead of
> > rtt_us and a check on the existence of tp->srtt_us. Maybe I've
> > misapplied the patch?
> 
> This patch series applies cleanly for me against David Miller's
> net-next at SHA1 4be3158 (from Friday Oct 16). (Using "git am" on the
> mbox patches from http://patchwork.ozlabs.org/project/netdev/list/
> ...)
> 
> On top of what SHA1 are you applying the series?

Doesn't matter which one ... it was the wrong one :)

I'd executed a git pull before applying the patch set, but I ended up having to
force it. Once I did, it applied cleanly.

Apologies for the false alarm,

Andrew

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH net-next 0/7] RACK loss detection
@ 2015-10-17  4:57 Yuchung Cheng
  2015-10-17  4:57 ` [PATCH net-next 1/7] tcp: apply Kern's check on RTTs used for congestion control Yuchung Cheng
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Yuchung Cheng @ 2015-10-17  4:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, Yuchung Cheng

RACK (Recent ACK) loss recovery uses the notion of time instead of
packet sequence (FACK) or counts (dupthresh).

It's inspired by the FACK heuristic in tcp_mark_lost_retrans(): when a
limited transmit (new data packet) is sacked in recovery, then any
retransmission sent before that newly sacked packet was sent must have
been lost, since at least one round trip time has elapsed.

But that existing heuristic from tcp_mark_lost_retrans()
has several limitations:
  1) it can't detect tail drops since it depends on limited transmit
  2) it's disabled upon reordering (assumes no reordering)
  3) it's only enabled in fast recovery but not timeout recovery

RACK addresses these limitations with a core idea: an unacknowledged
packet P1 is deemed lost if a packet P2 that was sent later is is
s/acked, since at least one round trip has passed.

Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when a later retransmission is
s/acked, while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering, similar to the delayed
Early Retransmit.

In the current patch set RACK is only a supplemental loss detection
and does not trigger fast recovery. However we are developing RACK
to replace or consolidate FACK/dupthresh, early retransmit, and
thin-dupack. These heuristics all implicitly bear the time notion.
For example, the delayed Early Retransmit is simply applying RACK
to trigger the fast recovery with small inflight.

RACK requires measuring the minimum RTT. Tracking a global min is less
robust due to traffic engineering pathing changes. Therefore it uses a
windowed filter by Kathleen Nichols. The min RTT can also be useful
for various other purposes like congestion control or stat monitoring.

This patch has been used on Google servers for well over 1 year. RACK
has also been implemented in the QUIC protocol. We are submitting an
IETF draft as well.

Yuchung Cheng (7):
  tcp: apply Kern's check on RTTs used for congestion control
  tcp: track min RTT using windowed min-filter
  tcp: remove tcp_mark_lost_retrans()
  tcp: add tcp_tsopt_ecr_before helper
  tcp: skb_mstamp_after helper
  tcp: track the packet timings in RACK
  tcp: use RACK to detect losses

 Documentation/networking/ip-sysctl.txt |  17 ++++
 include/linux/skbuff.h                 |   9 ++
 include/linux/tcp.h                    |  11 +-
 include/net/tcp.h                      |  21 ++++
 net/ipv4/Makefile                      |   1 +
 net/ipv4/sysctl_net_ipv4.c             |  14 +++
 net/ipv4/tcp.c                         |   1 +
 net/ipv4/tcp_input.c                   | 180 +++++++++++++++++++--------------
 net/ipv4/tcp_minisocks.c               |   3 +
 net/ipv4/tcp_output.c                  |   6 --
 net/ipv4/tcp_recovery.c                | 109 ++++++++++++++++++++
 11 files changed, 286 insertions(+), 86 deletions(-)
 create mode 100644 net/ipv4/tcp_recovery.c

-- 
2.6.0.rc2.230.g3dd15c0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH net-next 1/7] tcp: apply Kern's check on RTTs used for congestion control
  2015-10-17  4:57 [PATCH net-next 0/7] RACK loss detection Yuchung Cheng
@ 2015-10-17  4:57 ` Yuchung Cheng
  2016-02-02 19:30   ` Kenneth Klette Jonassen
  2015-10-17  4:57 ` [PATCH net-next 2/7] tcp: track min RTT using windowed min-filter Yuchung Cheng
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Yuchung Cheng @ 2015-10-17  4:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, Yuchung Cheng, Neal Cardwell, Eric Dumazet

Currently ca_seq_rtt_us does not use Kern's check. Fix that by
checking if any packet acked is a retransmit, for both RTT used
for RTT estimation and congestion control.

Fixes: 5b08e47ca ("tcp: prefer packet timing to TS-ECR for RTT")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3b35c3f..38743e5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2925,9 +2925,6 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
 	 * Karn's algorithm forbids taking RTT if some retransmitted data
 	 * is acked (RFC6298).
 	 */
-	if (flag & FLAG_RETRANS_DATA_ACKED)
-		seq_rtt_us = -1L;
-
 	if (seq_rtt_us < 0)
 		seq_rtt_us = sack_rtt_us;
 
@@ -3169,7 +3166,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 		flag |= FLAG_SACK_RENEGING;
 
 	skb_mstamp_get(&now);
-	if (likely(first_ackt.v64)) {
+	if (likely(first_ackt.v64) && !(flag & FLAG_RETRANS_DATA_ACKED)) {
 		seq_rtt_us = skb_mstamp_us_delta(&now, &first_ackt);
 		ca_rtt_us = skb_mstamp_us_delta(&now, &last_ackt);
 	}
-- 
2.6.0.rc2.230.g3dd15c0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next 2/7] tcp: track min RTT using windowed min-filter
  2015-10-17  4:57 [PATCH net-next 0/7] RACK loss detection Yuchung Cheng
  2015-10-17  4:57 ` [PATCH net-next 1/7] tcp: apply Kern's check on RTTs used for congestion control Yuchung Cheng
@ 2015-10-17  4:57 ` Yuchung Cheng
  2015-10-14  9:28   ` Andrew Shewmaker
  2015-10-17  4:57 ` [PATCH net-next 3/7] tcp: remove tcp_mark_lost_retrans() Yuchung Cheng
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Yuchung Cheng @ 2015-10-17  4:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, Yuchung Cheng, Neal Cardwell, Eric Dumazet

Kathleen Nichols' algorithm for tracking the minimum RTT of a
data stream over some measurement window. It uses constant space
and constant time per update. Yet it almost always delivers
the same minimum as an implementation that has to keep all
the data in the window. The measurement window is tunable via
sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.

The algorithm keeps track of the best, 2nd best & 3rd best min
values, maintaining an invariant that the measurement time of
the n'th best >= n-1'th best. It also makes sure that the three
values are widely separated in the time window since that bounds
the worse case error when that data is monotonically increasing
over the window.

Upon getting a new min, we can forget everything earlier because
it has no value - the new min is less than everything else in the
window by definition and it's the most recent. So we restart fresh
on every new min and overwrites the 2nd & 3rd choices. The same
property holds for the 2nd & 3rd best.

Therefore we have to maintain two invariants to maximize the
information in the samples, one on values (1st.v <= 2nd.v <=
3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
now). These invariants determine the structure of the code

The RTT input to the windowed filter is the minimum RTT measured
from ACK or SACK, or as the last resort from TCP timestamps.

The accessor tcp_min_rtt() returns the minimum RTT seen in the
window. ~0U indicates it is not available. The minimum is 1usec
even if the true RTT is below that.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 Documentation/networking/ip-sysctl.txt |  8 ++++
 include/linux/tcp.h                    |  3 ++
 include/net/tcp.h                      |  7 +++
 net/ipv4/sysctl_net_ipv4.c             |  7 +++
 net/ipv4/tcp.c                         |  1 +
 net/ipv4/tcp_input.c                   | 78 +++++++++++++++++++++++++++++++---
 net/ipv4/tcp_minisocks.c               |  1 +
 7 files changed, 100 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index ebe94f2..502d6a5 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -384,6 +384,14 @@ tcp_mem - vector of 3 INTEGERs: min, pressure, max
 	Defaults are calculated at boot time from amount of available
 	memory.
 
+tcp_min_rtt_wlen - INTEGER
+	The window length of the windowed min filter to track the minimum RTT.
+	A shorter window lets a flow more quickly pick up new (higher)
+	minimum RTT when it is moved to a longer path (e.g., due to traffic
+	engineering). A longer window makes the filter more resistant to RTT
+	inflations such as transient congestion. The unit is seconds.
+	Default: 300
+
 tcp_moderate_rcvbuf - BOOLEAN
 	If set, TCP performs receive buffer auto-tuning, attempting to
 	automatically size the buffer (no greater than tcp_rmem[2]) to
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 86a7eda..90edef5 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -217,6 +217,9 @@ struct tcp_sock {
 	u32	mdev_max_us;	/* maximal mdev for the last rtt period	*/
 	u32	rttvar_us;	/* smoothed mdev_max			*/
 	u32	rtt_seq;	/* sequence number to update rttvar	*/
+	struct rtt_meas {
+		u32 rtt, ts;	/* RTT in usec and sampling time in jiffies. */
+	} rtt_min[3];
 
 	u32	packets_out;	/* Packets which are "in flight"	*/
 	u32	retrans_out;	/* Retransmitted packets out		*/
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a6be56d..4575f0e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -279,6 +279,7 @@ extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
 extern unsigned int sysctl_tcp_notsent_lowat;
 extern int sysctl_tcp_min_tso_segs;
+extern int sysctl_tcp_min_rtt_wlen;
 extern int sysctl_tcp_autocorking;
 extern int sysctl_tcp_invalid_ratelimit;
 extern int sysctl_tcp_pacing_ss_ratio;
@@ -671,6 +672,12 @@ static inline bool tcp_ca_dst_locked(const struct dst_entry *dst)
 	return dst_metric_locked(dst, RTAX_CC_ALGO);
 }
 
+/* Minimum RTT in usec. ~0 means not available. */
+static inline u32 tcp_min_rtt(const struct tcp_sock *tp)
+{
+	return tp->rtt_min[0].rtt;
+}
+
 /* Compute the actual receive window we are currently advertising.
  * Rcv_nxt can be after the window if our peer push more data
  * than the offered window.
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 894da3a..13ab434 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -577,6 +577,13 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec
 	},
 	{
+		.procname	= "tcp_min_rtt_wlen",
+		.data		= &sysctl_tcp_min_rtt_wlen,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+	{
 		.procname	= "tcp_low_latency",
 		.data		= &sysctl_tcp_low_latency,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ac1bdbb..0cfa7c0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -388,6 +388,7 @@ void tcp_init_sock(struct sock *sk)
 
 	icsk->icsk_rto = TCP_TIMEOUT_INIT;
 	tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
+	tp->rtt_min[0].rtt = ~0U;
 
 	/* So many TCP implementations out there (incorrectly) count the
 	 * initial SYN frame in their delayed-ACK and congestion control
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 38743e5..e177386 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -95,6 +95,7 @@ int sysctl_tcp_stdurg __read_mostly;
 int sysctl_tcp_rfc1337 __read_mostly;
 int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 int sysctl_tcp_frto __read_mostly = 2;
+int sysctl_tcp_min_rtt_wlen __read_mostly = 300;
 
 int sysctl_tcp_thin_dupack __read_mostly;
 
@@ -2915,8 +2916,69 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
 	tcp_xmit_retransmit_queue(sk);
 }
 
+/* Kathleen Nichols' algorithm for tracking the minimum value of
+ * a data stream over some fixed time interval. (E.g., the minimum
+ * RTT over the past five minutes.) It uses constant space and constant
+ * time per update yet almost always delivers the same minimum as an
+ * implementation that has to keep all the data in the window.
+ *
+ * The algorithm keeps track of the best, 2nd best & 3rd best min
+ * values, maintaining an invariant that the measurement time of the
+ * n'th best >= n-1'th best. It also makes sure that the three values
+ * are widely separated in the time window since that bounds the worse
+ * case error when that data is monotonically increasing over the window.
+ *
+ * Upon getting a new min, we can forget everything earlier because it
+ * has no value - the new min is <= everything else in the window by
+ * definition and it's the most recent. So we restart fresh on every new min
+ * and overwrites 2nd & 3rd choices. The same property holds for 2nd & 3rd
+ * best.
+ */
+static void tcp_update_rtt_min(struct sock *sk, u32 rtt_us)
+{
+	const u32 now = tcp_time_stamp, wlen = sysctl_tcp_min_rtt_wlen * HZ;
+	struct rtt_meas *m = tcp_sk(sk)->rtt_min;
+	struct rtt_meas rttm = { .rtt = (rtt_us ? : 1), .ts = now };
+	u32 elapsed;
+
+	/* Check if the new measurement updates the 1st, 2nd, or 3rd choices */
+	if (unlikely(rttm.rtt <= m[0].rtt))
+		m[0] = m[1] = m[2] = rttm;
+	else if (rttm.rtt <= m[1].rtt)
+		m[1] = m[2] = rttm;
+	else if (rttm.rtt <= m[2].rtt)
+		m[2] = rttm;
+
+	elapsed = now - m[0].ts;
+	if (unlikely(elapsed > wlen)) {
+		/* Passed entire window without a new min so make 2nd choice
+		 * the new min & 3rd choice the new 2nd. So forth and so on.
+		 */
+		m[0] = m[1];
+		m[1] = m[2];
+		m[2] = rttm;
+		if (now - m[0].ts > wlen) {
+			m[0] = m[1];
+			m[1] = rttm;
+			if (now - m[0].ts > wlen)
+				m[0] = rttm;
+		}
+	} else if (m[1].ts == m[0].ts && elapsed > wlen / 4) {
+		/* Passed a quarter of the window without a new min so
+		 * take 2nd choice from the 2nd quarter of the window.
+		 */
+		m[2] = m[1] = rttm;
+	} else if (m[2].ts == m[1].ts && elapsed > wlen / 2) {
+		/* Passed half the window without a new min so take the 3rd
+		 * choice from the last half of the window.
+		 */
+		m[2] = rttm;
+	}
+}
+
 static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
-				      long seq_rtt_us, long sack_rtt_us)
+				      long seq_rtt_us, long sack_rtt_us,
+				      long ca_rtt_us)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 
@@ -2936,11 +2998,16 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
 	 */
 	if (seq_rtt_us < 0 && tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
 	    flag & FLAG_ACKED)
-		seq_rtt_us = jiffies_to_usecs(tcp_time_stamp - tp->rx_opt.rcv_tsecr);
-
+		seq_rtt_us = ca_rtt_us = jiffies_to_usecs(tcp_time_stamp -
+							  tp->rx_opt.rcv_tsecr);
 	if (seq_rtt_us < 0)
 		return false;
 
+	/* ca_rtt_us >= 0 is counting on the invariant that ca_rtt_us is
+	 * always taken together with ACK, SACK, or TS-opts. Any negative
+	 * values will be skipped with the seq_rtt_us < 0 check above.
+	 */
+	tcp_update_rtt_min(sk, ca_rtt_us);
 	tcp_rtt_estimator(sk, seq_rtt_us);
 	tcp_set_rto(sk);
 
@@ -2961,7 +3028,7 @@ void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req)
 		rtt_us = skb_mstamp_us_delta(&now, &tcp_rsk(req)->snt_synack);
 	}
 
-	tcp_ack_update_rtt(sk, FLAG_SYN_ACKED, rtt_us, -1L);
+	tcp_ack_update_rtt(sk, FLAG_SYN_ACKED, rtt_us, -1L, rtt_us);
 }
 
 
@@ -3175,7 +3242,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 		ca_rtt_us = skb_mstamp_us_delta(&now, &sack->last_sackt);
 	}
 
-	rtt_update = tcp_ack_update_rtt(sk, flag, seq_rtt_us, sack_rtt_us);
+	rtt_update = tcp_ack_update_rtt(sk, flag, seq_rtt_us, sack_rtt_us,
+					ca_rtt_us);
 
 	if (flag & FLAG_ACKED) {
 		tcp_rearm_rto(sk);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 41828bd..b875c28 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -470,6 +470,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 
 		newtp->srtt_us = 0;
 		newtp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
+		newtp->rtt_min[0].rtt = ~0U;
 		newicsk->icsk_rto = TCP_TIMEOUT_INIT;
 
 		newtp->packets_out = 0;
-- 
2.6.0.rc2.230.g3dd15c0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next 3/7] tcp: remove tcp_mark_lost_retrans()
  2015-10-17  4:57 [PATCH net-next 0/7] RACK loss detection Yuchung Cheng
  2015-10-17  4:57 ` [PATCH net-next 1/7] tcp: apply Kern's check on RTTs used for congestion control Yuchung Cheng
  2015-10-17  4:57 ` [PATCH net-next 2/7] tcp: track min RTT using windowed min-filter Yuchung Cheng
@ 2015-10-17  4:57 ` Yuchung Cheng
  2015-10-17  4:57 ` [PATCH net-next 4/7] tcp: add tcp_tsopt_ecr_before helper Yuchung Cheng
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Yuchung Cheng @ 2015-10-17  4:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, Yuchung Cheng, Neal Cardwell, Eric Dumazet

Remove the existing lost retransmit detection because RACK subsumes
it completely. This also stops the overloading the ack_seq field of
the skb control block.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/tcp.h   |  2 --
 net/ipv4/tcp_input.c  | 65 ---------------------------------------------------
 net/ipv4/tcp_output.c |  6 -----
 3 files changed, 73 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 90edef5..8c54863 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -283,8 +283,6 @@ struct tcp_sock {
 	int     lost_cnt_hint;
 	u32     retransmit_high;	/* L-bits may be on up to this seqno */
 
-	u32	lost_retrans_low;	/* Sent seq after any rxmit (lowest) */
-
 	u32	prior_ssthresh; /* ssthresh saved at recovery start	*/
 	u32	high_seq;	/* snd_nxt at onset of congestion	*/
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e177386..5ad08d8 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1048,70 +1048,6 @@ static bool tcp_is_sackblock_valid(struct tcp_sock *tp, bool is_dsack,
 	return !before(start_seq, end_seq - tp->max_window);
 }
 
-/* Check for lost retransmit. This superb idea is borrowed from "ratehalving".
- * Event "B". Later note: FACK people cheated me again 8), we have to account
- * for reordering! Ugly, but should help.
- *
- * Search retransmitted skbs from write_queue that were sent when snd_nxt was
- * less than what is now known to be received by the other end (derived from
- * highest SACK block). Also calculate the lowest snd_nxt among the remaining
- * retransmitted skbs to avoid some costly processing per ACKs.
- */
-static void tcp_mark_lost_retrans(struct sock *sk, int *flag)
-{
-	const struct inet_connection_sock *icsk = inet_csk(sk);
-	struct tcp_sock *tp = tcp_sk(sk);
-	struct sk_buff *skb;
-	int cnt = 0;
-	u32 new_low_seq = tp->snd_nxt;
-	u32 received_upto = tcp_highest_sack_seq(tp);
-
-	if (!tcp_is_fack(tp) || !tp->retrans_out ||
-	    !after(received_upto, tp->lost_retrans_low) ||
-	    icsk->icsk_ca_state != TCP_CA_Recovery)
-		return;
-
-	tcp_for_write_queue(skb, sk) {
-		u32 ack_seq = TCP_SKB_CB(skb)->ack_seq;
-
-		if (skb == tcp_send_head(sk))
-			break;
-		if (cnt == tp->retrans_out)
-			break;
-		if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
-			continue;
-
-		if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS))
-			continue;
-
-		/* TODO: We would like to get rid of tcp_is_fack(tp) only
-		 * constraint here (see above) but figuring out that at
-		 * least tp->reordering SACK blocks reside between ack_seq
-		 * and received_upto is not easy task to do cheaply with
-		 * the available datastructures.
-		 *
-		 * Whether FACK should check here for tp->reordering segs
-		 * in-between one could argue for either way (it would be
-		 * rather simple to implement as we could count fack_count
-		 * during the walk and do tp->fackets_out - fack_count).
-		 */
-		if (after(received_upto, ack_seq)) {
-			TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
-			tp->retrans_out -= tcp_skb_pcount(skb);
-			*flag |= FLAG_LOST_RETRANS;
-			tcp_skb_mark_lost_uncond_verify(tp, skb);
-			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSTRETRANSMIT);
-		} else {
-			if (before(ack_seq, new_low_seq))
-				new_low_seq = ack_seq;
-			cnt += tcp_skb_pcount(skb);
-		}
-	}
-
-	if (tp->retrans_out)
-		tp->lost_retrans_low = new_low_seq;
-}
-
 static bool tcp_check_dsack(struct sock *sk, const struct sk_buff *ack_skb,
 			    struct tcp_sack_block_wire *sp, int num_sacks,
 			    u32 prior_snd_una)
@@ -1838,7 +1774,6 @@ advance_sp:
 	    ((inet_csk(sk)->icsk_ca_state != TCP_CA_Loss) || tp->undo_marker))
 		tcp_update_reordering(sk, tp->fackets_out - state->reord, 0);
 
-	tcp_mark_lost_retrans(sk, &state->flag);
 	tcp_verify_left_out(tp);
 out:
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6e79fcb..49730e5 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2655,8 +2655,6 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 			net_dbg_ratelimited("retrans_out leaked\n");
 		}
 #endif
-		if (!tp->retrans_out)
-			tp->lost_retrans_low = tp->snd_nxt;
 		TCP_SKB_CB(skb)->sacked |= TCPCB_RETRANS;
 		tp->retrans_out += tcp_skb_pcount(skb);
 
@@ -2664,10 +2662,6 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 		if (!tp->retrans_stamp)
 			tp->retrans_stamp = tcp_skb_timestamp(skb);
 
-		/* snd_nxt is stored to detect loss of retransmitted segment,
-		 * see tcp_input.c tcp_sacktag_write_queue().
-		 */
-		TCP_SKB_CB(skb)->ack_seq = tp->snd_nxt;
 	} else if (err != -EBUSY) {
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
 	}
-- 
2.6.0.rc2.230.g3dd15c0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next 4/7] tcp: add tcp_tsopt_ecr_before helper
  2015-10-17  4:57 [PATCH net-next 0/7] RACK loss detection Yuchung Cheng
                   ` (2 preceding siblings ...)
  2015-10-17  4:57 ` [PATCH net-next 3/7] tcp: remove tcp_mark_lost_retrans() Yuchung Cheng
@ 2015-10-17  4:57 ` Yuchung Cheng
  2015-10-17  4:57 ` [PATCH net-next 5/7] tcp: skb_mstamp_after helper Yuchung Cheng
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Yuchung Cheng @ 2015-10-17  4:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, Yuchung Cheng, Neal Cardwell, Eric Dumazet

a helper to prepare the main RACK patch

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5ad08d8..c304b5f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2250,14 +2250,19 @@ static inline void tcp_moderate_cwnd(struct tcp_sock *tp)
 	tp->snd_cwnd_stamp = tcp_time_stamp;
 }
 
+static bool tcp_tsopt_ecr_before(const struct tcp_sock *tp, u32 when)
+{
+	return tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
+	       before(tp->rx_opt.rcv_tsecr, when);
+}
+
 /* Nothing was retransmitted or returned timestamp is less
  * than timestamp of the first retransmission.
  */
 static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
 {
 	return !tp->retrans_stamp ||
-		(tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
-		 before(tp->rx_opt.rcv_tsecr, tp->retrans_stamp));
+	       tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
 }
 
 /* Undo procedures. */
-- 
2.6.0.rc2.230.g3dd15c0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next 5/7] tcp: skb_mstamp_after helper
  2015-10-17  4:57 [PATCH net-next 0/7] RACK loss detection Yuchung Cheng
                   ` (3 preceding siblings ...)
  2015-10-17  4:57 ` [PATCH net-next 4/7] tcp: add tcp_tsopt_ecr_before helper Yuchung Cheng
@ 2015-10-17  4:57 ` Yuchung Cheng
  2015-10-17  4:57 ` [PATCH net-next 6/7] tcp: track the packet timings in RACK Yuchung Cheng
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Yuchung Cheng @ 2015-10-17  4:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, Yuchung Cheng, Neal Cardwell, Eric Dumazet

a helper to prepare the first main RACK patch.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/skbuff.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4398411..24f4dfd 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -463,6 +463,15 @@ static inline u32 skb_mstamp_us_delta(const struct skb_mstamp *t1,
 	return delta_us;
 }
 
+static inline bool skb_mstamp_after(const struct skb_mstamp *t1,
+				    const struct skb_mstamp *t0)
+{
+	s32 diff = t1->stamp_jiffies - t0->stamp_jiffies;
+
+	if (!diff)
+		diff = t1->stamp_us - t0->stamp_us;
+	return diff > 0;
+}
 
 /** 
  *	struct sk_buff - socket buffer
-- 
2.6.0.rc2.230.g3dd15c0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next 6/7] tcp: track the packet timings in RACK
  2015-10-17  4:57 [PATCH net-next 0/7] RACK loss detection Yuchung Cheng
                   ` (4 preceding siblings ...)
  2015-10-17  4:57 ` [PATCH net-next 5/7] tcp: skb_mstamp_after helper Yuchung Cheng
@ 2015-10-17  4:57 ` Yuchung Cheng
  2015-10-17  4:57 ` [PATCH net-next 7/7] tcp: use RACK to detect losses Yuchung Cheng
  2015-10-21 14:01 ` [PATCH net-next 0/7] RACK loss detection David Miller
  7 siblings, 0 replies; 15+ messages in thread
From: Yuchung Cheng @ 2015-10-17  4:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, Yuchung Cheng, Neal Cardwell, Eric Dumazet

This patch is the first half of the RACK loss recovery.

RACK loss recovery uses the notion of time instead
of packet sequence (FACK) or counts (dupthresh). It's inspired by the
previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
transmit (new data packet) is sacked, then current retransmitted
sequence below the newly sacked sequence must been lost,
since at least one round trip time has elapsed.

But it has several limitations:
1) can't detect tail drops since it depends on limited transmit
2) is disabled upon reordering (assumes no reordering)
3) only enabled in fast recovery ut not timeout recovery

RACK (Recently ACK) addresses these limitations with the notion
of time instead: a packet P1 is lost if a later packet P2 is s/acked,
as at least one round trip has passed.

Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when later retransmission is
s/acked while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering.

This patch implements tcp_advanced_rack() which tracks the
most recent transmission time among the packets that have been
delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
is the key to determine which packet has been lost.

Consider an example that the sender sends six packets:
T1: P1 (lost)
T2: P2
T3: P3
T4: P4
T100: sack of P2. rack.mstamp = T2
T101: retransmit P1
T102: sack of P2,P3,P4. rack.mstamp = T4
T205: ACK of P4 since the hole is repaired. rack.mstamp = T101

We need to be careful about spurious retransmission because it may
falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
to falsely mark all packets lost, just like a spurious timeout.

We identify spurious retransmission by the ACK's TS echo value.
If TS option is not applicable but the retransmission is acknowledged
less than min-RTT ago, it is likely to be spurious. We refrain from
using the transmission time of these spurious retransmissions.

The second half is implemented in the next patch that marks packet
lost using RACK timestamp.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/tcp.h      |  6 ++++++
 include/net/tcp.h        |  5 +++++
 net/ipv4/Makefile        |  1 +
 net/ipv4/tcp_input.c     | 14 ++++++++++++++
 net/ipv4/tcp_minisocks.c |  2 ++
 net/ipv4/tcp_recovery.c  | 32 ++++++++++++++++++++++++++++++++
 6 files changed, 60 insertions(+)
 create mode 100644 net/ipv4/tcp_recovery.c

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 8c54863..5dce970 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -194,6 +194,12 @@ struct tcp_sock {
 	u32	window_clamp;	/* Maximal window to advertise		*/
 	u32	rcv_ssthresh;	/* Current window clamp			*/
 
+	/* Information of the most recently (s)acked skb */
+	struct tcp_rack {
+		struct skb_mstamp mstamp; /* (Re)sent time of the skb */
+		u8 advanced; /* mstamp advanced since last lost marking */
+		u8 reord;    /* reordering detected */
+	} rack;
 	u16	advmss;		/* Advertised MSS			*/
 	u8	unused;
 	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4575f0e..aee5f23 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1750,6 +1750,11 @@ int tcpv4_offload_init(void);
 void tcp_v4_init(void);
 void tcp_init(void);
 
+/* tcp_recovery.c */
+
+extern void tcp_rack_advance(struct tcp_sock *tp,
+			     const struct skb_mstamp *xmit_time, u8 sacked);
+
 /*
  * Save and compile IPv4 options, return a pointer to it
  */
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 89aacb6..c29809f 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -8,6 +8,7 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     inet_timewait_sock.o inet_connection_sock.o \
 	     tcp.o tcp_input.o tcp_output.o tcp_timer.o tcp_ipv4.o \
 	     tcp_minisocks.o tcp_cong.o tcp_metrics.o tcp_fastopen.o \
+	     tcp_recovery.o \
 	     tcp_offload.o datagram.o raw.o udp.o udplite.o \
 	     udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
 	     fib_frontend.o fib_semantics.o fib_trie.o \
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c304b5f..21a9ea4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1173,6 +1173,8 @@ static u8 tcp_sacktag_one(struct sock *sk,
 		return sacked;
 
 	if (!(sacked & TCPCB_SACKED_ACKED)) {
+		tcp_rack_advance(tp, xmit_time, sacked);
+
 		if (sacked & TCPCB_SACKED_RETRANS) {
 			/* If the segment is not tagged as lost,
 			 * we do not clear RETRANS, believing
@@ -2256,6 +2258,16 @@ static bool tcp_tsopt_ecr_before(const struct tcp_sock *tp, u32 when)
 	       before(tp->rx_opt.rcv_tsecr, when);
 }
 
+/* skb is spurious retransmitted if the returned timestamp echo
+ * reply is prior to the skb transmission time
+ */
+static bool tcp_skb_spurious_retrans(const struct tcp_sock *tp,
+				     const struct sk_buff *skb)
+{
+	return (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS) &&
+	       tcp_tsopt_ecr_before(tp, tcp_skb_timestamp(skb));
+}
+
 /* Nothing was retransmitted or returned timestamp is less
  * than timestamp of the first retransmission.
  */
@@ -3135,6 +3147,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 
 		if (sacked & TCPCB_SACKED_ACKED)
 			tp->sacked_out -= acked_pcount;
+		else if (tcp_is_sack(tp) && !tcp_skb_spurious_retrans(tp, skb))
+			tcp_rack_advance(tp, &skb->skb_mstamp, sacked);
 		if (sacked & TCPCB_LOST)
 			tp->lost_out -= acked_pcount;
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index b875c28..1fd5d41 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -548,6 +548,8 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 		tcp_ecn_openreq_child(newtp, req);
 		newtp->fastopen_rsk = NULL;
 		newtp->syn_data_acked = 0;
+		newtp->rack.mstamp.v64 = 0;
+		newtp->rack.advanced = 0;
 
 		newtp->saved_syn = req->saved_syn;
 		req->saved_syn = NULL;
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
new file mode 100644
index 0000000..8f66a65
--- /dev/null
+++ b/net/ipv4/tcp_recovery.c
@@ -0,0 +1,32 @@
+#include <linux/tcp.h>
+#include <net/tcp.h>
+
+/* Record the most recently (re)sent time among the (s)acked packets */
+void tcp_rack_advance(struct tcp_sock *tp,
+		      const struct skb_mstamp *xmit_time, u8 sacked)
+{
+	if (tp->rack.mstamp.v64 &&
+	    !skb_mstamp_after(xmit_time, &tp->rack.mstamp))
+		return;
+
+	if (sacked & TCPCB_RETRANS) {
+		struct skb_mstamp now;
+
+		/* If the sacked packet was retransmitted, it's ambiguous
+		 * whether the retransmission or the original (or the prior
+		 * retransmission) was sacked.
+		 *
+		 * If the original is lost, there is no ambiguity. Otherwise
+		 * we assume the original can be delayed up to aRTT + min_rtt.
+		 * the aRTT term is bounded by the fast recovery or timeout,
+		 * so it's at least one RTT (i.e., retransmission is at least
+		 * an RTT later).
+		 */
+		skb_mstamp_get(&now);
+		if (skb_mstamp_us_delta(&now, xmit_time) < tcp_min_rtt(tp))
+			return;
+	}
+
+	tp->rack.mstamp = *xmit_time;
+	tp->rack.advanced = 1;
+}
-- 
2.6.0.rc2.230.g3dd15c0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next 7/7] tcp: use RACK to detect losses
  2015-10-17  4:57 [PATCH net-next 0/7] RACK loss detection Yuchung Cheng
                   ` (5 preceding siblings ...)
  2015-10-17  4:57 ` [PATCH net-next 6/7] tcp: track the packet timings in RACK Yuchung Cheng
@ 2015-10-17  4:57 ` Yuchung Cheng
  2015-10-21 14:01 ` [PATCH net-next 0/7] RACK loss detection David Miller
  7 siblings, 0 replies; 15+ messages in thread
From: Yuchung Cheng @ 2015-10-17  4:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, Yuchung Cheng, Neal Cardwell, Eric Dumazet

This patch implements the second half of RACK that uses the the most
recent transmit time among all delivered packets to detect losses.

tcp_rack_mark_lost() is called upon receiving a dubious ACK.
It then checks if an not-yet-sacked packet was sent at least
"reo_wnd" prior to the sent time of the most recently delivered.
If so the packet is deemed lost.

The "reo_wnd" reordering window starts with 1msec for fast loss
detection and changes to min-RTT/4 when reordering is observed.
We found 1msec accommodates well on tiny degree of reordering
(<3 pkts) on faster links. We use min-RTT instead of SRTT because
reordering is more of a path property but SRTT can be inflated by
self-inflicated congestion. The factor of 4 is borrowed from the
delayed early retransmit and seems to work reasonably well.

Since RACK is still experimental, it is now used as a supplemental
loss detection on top of existing algorithms. It is only effective
after the fast recovery starts or after the timeout occurs. The
fast recovery is still triggered by FACK and/or dupack threshold
instead of RACK.

We introduce a new sysctl net.ipv4.tcp_recovery for future
experiments of loss recoveries. For now RACK can be disabled by
setting it to 0.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 Documentation/networking/ip-sysctl.txt |  9 ++++
 include/net/tcp.h                      |  9 ++++
 net/ipv4/sysctl_net_ipv4.c             |  7 ++++
 net/ipv4/tcp_input.c                   |  9 +++-
 net/ipv4/tcp_recovery.c                | 77 ++++++++++++++++++++++++++++++++++
 5 files changed, 109 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 502d6a5..85752c8 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -433,6 +433,15 @@ tcp_orphan_retries - INTEGER
 	you should think about lowering this value, such sockets
 	may consume significant resources. Cf. tcp_max_orphans.
 
+tcp_recovery - INTEGER
+	This value is a bitmap to enable various experimental loss recovery
+	features.
+
+	RACK: 0x1 enables the RACK loss detection for fast detection of lost
+	      retransmissions and tail drops.
+
+	Default: 0x1
+
 tcp_reordering - INTEGER
 	Initial reordering level of packets in a TCP stream.
 	TCP stack can then dynamically adjust flow reordering level
diff --git a/include/net/tcp.h b/include/net/tcp.h
index aee5f23..f615789 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -567,6 +567,7 @@ void tcp_resume_early_retransmit(struct sock *sk);
 void tcp_rearm_rto(struct sock *sk);
 void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req);
 void tcp_reset(struct sock *sk);
+void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);
 
 /* tcp_timer.c */
 void tcp_init_xmit_timers(struct sock *);
@@ -1752,6 +1753,14 @@ void tcp_init(void);
 
 /* tcp_recovery.c */
 
+/* Flags to enable various loss recovery features. See below */
+extern int sysctl_tcp_recovery;
+
+/* Use TCP RACK to detect (some) tail and retransmit losses */
+#define TCP_RACK_LOST_RETRANS  0x1
+
+extern int tcp_rack_mark_lost(struct sock *sk);
+
 extern void tcp_rack_advance(struct tcp_sock *tp,
 			     const struct skb_mstamp *xmit_time, u8 sacked);
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 13ab434..25300c5 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -496,6 +496,13 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec
 	},
 	{
+		.procname	= "tcp_recovery",
+		.data		= &sysctl_tcp_recovery,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "tcp_reordering",
 		.data		= &sysctl_tcp_reordering,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 21a9ea4..cf0f954 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -881,6 +881,7 @@ static void tcp_update_reordering(struct sock *sk, const int metric,
 
 	if (metric > 0)
 		tcp_disable_early_retrans(tp);
+	tp->rack.reord = 1;
 }
 
 /* This must be called before lost_out is incremented */
@@ -906,8 +907,7 @@ static void tcp_skb_mark_lost(struct tcp_sock *tp, struct sk_buff *skb)
 	}
 }
 
-static void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp,
-					    struct sk_buff *skb)
+void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb)
 {
 	tcp_verify_retransmit_hint(tp, skb);
 
@@ -2806,6 +2806,11 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
 		}
 	}
 
+	/* Use RACK to detect loss */
+	if (sysctl_tcp_recovery & TCP_RACK_LOST_RETRANS &&
+	    tcp_rack_mark_lost(sk))
+		flag |= FLAG_LOST_RETRANS;
+
 	/* E. Process state. */
 	switch (icsk->icsk_ca_state) {
 	case TCP_CA_Recovery:
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index 8f66a65..5353085 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -1,6 +1,83 @@
 #include <linux/tcp.h>
 #include <net/tcp.h>
 
+int sysctl_tcp_recovery __read_mostly = TCP_RACK_LOST_RETRANS;
+
+/* Marks a packet lost, if some packet sent later has been (s)acked.
+ * The underlying idea is similar to the traditional dupthresh and FACK
+ * but they look at different metrics:
+ *
+ * dupthresh: 3 OOO packets delivered (packet count)
+ * FACK: sequence delta to highest sacked sequence (sequence space)
+ * RACK: sent time delta to the latest delivered packet (time domain)
+ *
+ * The advantage of RACK is it applies to both original and retransmitted
+ * packet and therefore is robust against tail losses. Another advantage
+ * is being more resilient to reordering by simply allowing some
+ * "settling delay", instead of tweaking the dupthresh.
+ *
+ * The current version is only used after recovery starts but can be
+ * easily extended to detect the first loss.
+ */
+int tcp_rack_mark_lost(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb;
+	u32 reo_wnd, prior_retrans = tp->retrans_out;
+
+	if (inet_csk(sk)->icsk_ca_state < TCP_CA_Recovery || !tp->rack.advanced)
+		return 0;
+
+	/* Reset the advanced flag to avoid unnecessary queue scanning */
+	tp->rack.advanced = 0;
+
+	/* To be more reordering resilient, allow min_rtt/4 settling delay
+	 * (lower-bounded to 1000uS). We use min_rtt instead of the smoothed
+	 * RTT because reordering is often a path property and less related
+	 * to queuing or delayed ACKs.
+	 *
+	 * TODO: measure and adapt to the observed reordering delay, and
+	 * use a timer to retransmit like the delayed early retransmit.
+	 */
+	reo_wnd = 1000;
+	if (tp->rack.reord && tcp_min_rtt(tp) != ~0U)
+		reo_wnd = max(tcp_min_rtt(tp) >> 2, reo_wnd);
+
+	tcp_for_write_queue(skb, sk) {
+		struct tcp_skb_cb *scb = TCP_SKB_CB(skb);
+
+		if (skb == tcp_send_head(sk))
+			break;
+
+		/* Skip ones already (s)acked */
+		if (!after(scb->end_seq, tp->snd_una) ||
+		    scb->sacked & TCPCB_SACKED_ACKED)
+			continue;
+
+		if (skb_mstamp_after(&tp->rack.mstamp, &skb->skb_mstamp)) {
+
+			if (skb_mstamp_us_delta(&tp->rack.mstamp,
+						&skb->skb_mstamp) <= reo_wnd)
+				continue;
+
+			/* skb is lost if packet sent later is sacked */
+			tcp_skb_mark_lost_uncond_verify(tp, skb);
+			if (scb->sacked & TCPCB_SACKED_RETRANS) {
+				scb->sacked &= ~TCPCB_SACKED_RETRANS;
+				tp->retrans_out -= tcp_skb_pcount(skb);
+				NET_INC_STATS_BH(sock_net(sk),
+						 LINUX_MIB_TCPLOSTRETRANSMIT);
+			}
+		} else if (!(scb->sacked & TCPCB_RETRANS)) {
+			/* Original data are sent sequentially so stop early
+			 * b/c the rest are all sent after rack_sent
+			 */
+			break;
+		}
+	}
+	return prior_retrans - tp->retrans_out;
+}
+
 /* Record the most recently (re)sent time among the (s)acked packets */
 void tcp_rack_advance(struct tcp_sock *tp,
 		      const struct skb_mstamp *xmit_time, u8 sacked)
-- 
2.6.0.rc2.230.g3dd15c0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 2/7] tcp: track min RTT using windowed min-filter
  2015-10-14  9:28   ` Andrew Shewmaker
@ 2015-10-18 14:33     ` Neal Cardwell
  2015-10-14 13:25       ` Andrew Shewmaker
  2015-10-19  4:39     ` Andrew Shewmaker
  1 sibling, 1 reply; 15+ messages in thread
From: Neal Cardwell @ 2015-10-18 14:33 UTC (permalink / raw)
  To: Andrew Shewmaker; +Cc: Yuchung Cheng, David Miller, Netdev, Eric Dumazet

On Wed, Oct 14, 2015 at 5:28 AM, Andrew Shewmaker <agshew@gmail.com> wrote:
>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index 38743e5..e177386 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
> ...
>> @@ -2961,7 +3028,7 @@ void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req)
>>               rtt_us = skb_mstamp_us_delta(&now, &tcp_rsk(req)->snt_synack);
>>       }
>>
>> -     tcp_ack_update_rtt(sk, FLAG_SYN_ACKED, rtt_us, -1L);
>> +     tcp_ack_update_rtt(sk, FLAG_SYN_ACKED, rtt_us, -1L, rtt_us);
>>  }
>
> This didn't apply to net-next for me. I see seq_rtt_us instead of
> rtt_us and a check on the existence of tp->srtt_us. Maybe I've
> misapplied the patch?

This patch series applies cleanly for me against David Miller's
net-next at SHA1 4be3158 (from Friday Oct 16). (Using "git am" on the
mbox patches from http://patchwork.ozlabs.org/project/netdev/list/
...)

On top of what SHA1 are you applying the series?

neal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 2/7] tcp: track min RTT using windowed min-filter
  2015-10-14  9:28   ` Andrew Shewmaker
  2015-10-18 14:33     ` Neal Cardwell
@ 2015-10-19  4:39     ` Andrew Shewmaker
  1 sibling, 0 replies; 15+ messages in thread
From: Andrew Shewmaker @ 2015-10-19  4:39 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: davem, netdev, Neal Cardwell, Eric Dumazet

On Wed, Oct 14, 2015 at 02:28:00AM -0700, Andrew Shewmaker wrote:
> On Fri, Oct 16, 2015 at 09:57:42PM -0700, Yuchung Cheng wrote:
> ...
> > diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> > index 86a7eda..90edef5 100644
> > --- a/include/linux/tcp.h
> > +++ b/include/linux/tcp.h
> > @@ -217,6 +217,9 @@ struct tcp_sock {
> >  	u32	mdev_max_us;	/* maximal mdev for the last rtt period	*/
> >  	u32	rttvar_us;	/* smoothed mdev_max			*/
> >  	u32	rtt_seq;	/* sequence number to update rttvar	*/
> > +	struct rtt_meas {
> > +		u32 rtt, ts;	/* RTT in usec and sampling time in jiffies. */
> > +	} rtt_min[3];
> 
> First, thanks for all the work in this patch series. In particular,
> applying Kern's check to ca_seq_rtt_us should fix some bad behavior
> I've observed recently.

I'd have to run more tests to be sure, but net-next with this patch series
significantly improves upon the poor behavior I was seeing where randomly
dropped packets greatly limited the usefulness of RTTs as a signal of
congestion. I still see some measurements that exceed the amount of delay
possible from queue buildup, but fewer than before.

Thanks!

-Andrew

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 0/7] RACK loss detection
  2015-10-17  4:57 [PATCH net-next 0/7] RACK loss detection Yuchung Cheng
                   ` (6 preceding siblings ...)
  2015-10-17  4:57 ` [PATCH net-next 7/7] tcp: use RACK to detect losses Yuchung Cheng
@ 2015-10-21 14:01 ` David Miller
  7 siblings, 0 replies; 15+ messages in thread
From: David Miller @ 2015-10-21 14:01 UTC (permalink / raw)
  To: ycheng; +Cc: netdev

From: Yuchung Cheng <ycheng@google.com>
Date: Fri, 16 Oct 2015 21:57:40 -0700

> RACK (Recent ACK) loss recovery uses the notion of time instead of
> packet sequence (FACK) or counts (dupthresh).
> 
> It's inspired by the FACK heuristic in tcp_mark_lost_retrans(): when a
> limited transmit (new data packet) is sacked in recovery, then any
> retransmission sent before that newly sacked packet was sent must have
> been lost, since at least one round trip time has elapsed.
> 
> But that existing heuristic from tcp_mark_lost_retrans()
> has several limitations:
>   1) it can't detect tail drops since it depends on limited transmit
>   2) it's disabled upon reordering (assumes no reordering)
>   3) it's only enabled in fast recovery but not timeout recovery
> 
> RACK addresses these limitations with a core idea: an unacknowledged
> packet P1 is deemed lost if a packet P2 that was sent later is is
> s/acked, since at least one round trip has passed.
> 
> Since RACK cares about the time sequence instead of the data sequence
> of packets, it can detect tail drops when a later retransmission is
> s/acked, while FACK or dupthresh can't. For reordering RACK uses a
> dynamically adjusted reordering window ("reo_wnd") to reduce false
> positives on ever (small) degree of reordering, similar to the delayed
> Early Retransmit.
> 
> In the current patch set RACK is only a supplemental loss detection
> and does not trigger fast recovery. However we are developing RACK
> to replace or consolidate FACK/dupthresh, early retransmit, and
> thin-dupack. These heuristics all implicitly bear the time notion.
> For example, the delayed Early Retransmit is simply applying RACK
> to trigger the fast recovery with small inflight.
> 
> RACK requires measuring the minimum RTT. Tracking a global min is less
> robust due to traffic engineering pathing changes. Therefore it uses a
> windowed filter by Kathleen Nichols. The min RTT can also be useful
> for various other purposes like congestion control or stat monitoring.
> 
> This patch has been used on Google servers for well over 1 year. RACK
> has also been implemented in the QUIC protocol. We are submitting an
> IETF draft as well.

This looks really great, in fact in my eyes the entire series is
justified merely by patch #3 :-)

Series applied, thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 1/7] tcp: apply Kern's check on RTTs used for congestion control
  2015-10-17  4:57 ` [PATCH net-next 1/7] tcp: apply Kern's check on RTTs used for congestion control Yuchung Cheng
@ 2016-02-02 19:30   ` Kenneth Klette Jonassen
  2016-02-02 23:28     ` Yuchung Cheng
  0 siblings, 1 reply; 15+ messages in thread
From: Kenneth Klette Jonassen @ 2016-02-02 19:30 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: David Miller, netdev, Neal Cardwell, Eric Dumazet

On Sat, Oct 17, 2015 at 6:57 AM, Yuchung Cheng <ycheng@google.com> wrote:
> Currently ca_seq_rtt_us does not use Kern's check. Fix that by
> checking if any packet acked is a retransmit, for both RTT used
> for RTT estimation and congestion control.
>
*snip*

This patch (commit 9e45a3e) puzzles me, because Karn's check was
already in effect:
http://lxr.free-electrons.com/source/net/ipv4/tcp_input.c?v=4.3#L3117

Since first_ackt/last_ackt is only set when non-retransmitted packets
are ACKed (sequentially), we know them to be unambiguous samples for
RTTM. Even if a (sequential) ACK covers retransmitted packets, we can
still make a valid RTTM if that ACK also covers non-retransmitted
packets. But this patch seems to prevent that?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 1/7] tcp: apply Kern's check on RTTs used for congestion control
  2016-02-02 19:30   ` Kenneth Klette Jonassen
@ 2016-02-02 23:28     ` Yuchung Cheng
  0 siblings, 0 replies; 15+ messages in thread
From: Yuchung Cheng @ 2016-02-02 23:28 UTC (permalink / raw)
  To: Kenneth Klette Jonassen; +Cc: David Miller, netdev, Neal Cardwell, Eric Dumazet

On Tue, Feb 2, 2016 at 11:30 AM, Kenneth Klette Jonassen
<kennetkl@ifi.uio.no> wrote:
>
> On Sat, Oct 17, 2015 at 6:57 AM, Yuchung Cheng <ycheng@google.com> wrote:
> > Currently ca_seq_rtt_us does not use Kern's check. Fix that by
> > checking if any packet acked is a retransmit, for both RTT used
> > for RTT estimation and congestion control.
> >
> *snip*
>
> This patch (commit 9e45a3e) puzzles me, because Karn's check was
> already in effect:
> http://lxr.free-electrons.com/source/net/ipv4/tcp_input.c?v=4.3#L3117
>
> Since first_ackt/last_ackt is only set when non-retransmitted packets
> are ACKed (sequentially), we know them to be unambiguous samples for
> RTTM. Even if a (sequential) ACK covers retransmitted packets, we can
> still make a valid RTTM if that ACK also covers non-retransmitted
> packets. But this patch seems to prevent that?
Perhaps the commit message is not clear. Here is an example: an ACK
acks 2 packets where the 1st packet was retransmitted but the 2nd
packet is not.

Since we don't know if the ACk was caused by the retransmission (plugs
a hole) or by the original, we should not take an RTT sample. In other
words, we should refrain from taking RTT sample as long as the ACK
covers any retransmitted sequence.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-02-02 23:29 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-17  4:57 [PATCH net-next 0/7] RACK loss detection Yuchung Cheng
2015-10-17  4:57 ` [PATCH net-next 1/7] tcp: apply Kern's check on RTTs used for congestion control Yuchung Cheng
2016-02-02 19:30   ` Kenneth Klette Jonassen
2016-02-02 23:28     ` Yuchung Cheng
2015-10-17  4:57 ` [PATCH net-next 2/7] tcp: track min RTT using windowed min-filter Yuchung Cheng
2015-10-14  9:28   ` Andrew Shewmaker
2015-10-18 14:33     ` Neal Cardwell
2015-10-14 13:25       ` Andrew Shewmaker
2015-10-19  4:39     ` Andrew Shewmaker
2015-10-17  4:57 ` [PATCH net-next 3/7] tcp: remove tcp_mark_lost_retrans() Yuchung Cheng
2015-10-17  4:57 ` [PATCH net-next 4/7] tcp: add tcp_tsopt_ecr_before helper Yuchung Cheng
2015-10-17  4:57 ` [PATCH net-next 5/7] tcp: skb_mstamp_after helper Yuchung Cheng
2015-10-17  4:57 ` [PATCH net-next 6/7] tcp: track the packet timings in RACK Yuchung Cheng
2015-10-17  4:57 ` [PATCH net-next 7/7] tcp: use RACK to detect losses Yuchung Cheng
2015-10-21 14:01 ` [PATCH net-next 0/7] RACK loss detection David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.