All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] tcp: Tail loss probe (TLP)
@ 2013-03-11 20:00 Nandita Dukkipati
  2013-03-11 20:00 ` [PATCH 2/2] tcp: TLP loss detection Nandita Dukkipati
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Nandita Dukkipati @ 2013-03-11 20:00 UTC (permalink / raw)
  To: David S. Miller, Neal Cardwell, Yuchung Cheng, Eric Dumazet
  Cc: netdev, Ilpo Jarvinen, Tom Herbert, Nandita Dukkipati

This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.

TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.

PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.

TLP Algorithm

On transmission of new data in Open state:
  -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
  -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
  -> PTO = min(PTO, RTO)

Conditions for scheduling PTO:
  -> Connection is in Open state.
  -> Connection is either cwnd limited or no new data to send.
  -> Number of probes per tail loss episode is limited to one.
  -> Connection is SACK enabled.

When PTO fires:
  new_segment_exists:
    -> transmit new segment.
    -> packets_out++. cwnd remains same.

  no_new_packet:
    -> retransmit the last segment.
       Its ACK triggers FACK or early retransmit based recovery.

ACK path:
  -> rearm RTO at start of ACK processing.
  -> reschedule PTO if need be.

In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
		 ==1; enables RFC5827 ER.
		 ==2; delayed ER.
		 ==3; TLP and delayed ER. [DEFAULT]
		 ==4; TLP only.

The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.

Signed-off-by: Nandita Dukkipati <nanditad@google.com>
---
 Documentation/networking/ip-sysctl.txt |   8 ++-
 include/linux/tcp.h                    |   1 -
 include/net/inet_connection_sock.h     |   5 +-
 include/net/tcp.h                      |   6 +-
 include/uapi/linux/snmp.h              |   1 +
 net/ipv4/inet_diag.c                   |   4 +-
 net/ipv4/proc.c                        |   1 +
 net/ipv4/sysctl_net_ipv4.c             |   4 +-
 net/ipv4/tcp_input.c                   |  24 ++++---
 net/ipv4/tcp_ipv4.c                    |   4 +-
 net/ipv4/tcp_output.c                  | 128 +++++++++++++++++++++++++++++++--
 net/ipv4/tcp_timer.c                   |  13 ++--
 12 files changed, 171 insertions(+), 28 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index dc2dc87..1cae6c3 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -190,7 +190,9 @@ tcp_early_retrans - INTEGER
 	Enable Early Retransmit (ER), per RFC 5827. ER lowers the threshold
 	for triggering fast retransmit when the amount of outstanding data is
 	small and when no previously unsent data can be transmitted (such
-	that limited transmit could be used).
+	that limited transmit could be used). Also controls the use of
+	Tail loss probe (TLP) that converts RTOs occuring due to tail
+	losses into fast recovery (draft-dukkipati-tcpm-tcp-loss-probe-01).
 	Possible values:
 		0 disables ER
 		1 enables ER
@@ -198,7 +200,9 @@ tcp_early_retrans - INTEGER
 		  by a fourth of RTT. This mitigates connection falsely
 		  recovers when network has a small degree of reordering
 		  (less than 3 packets).
-	Default: 2
+		3 enables delayed ER and TLP.
+		4 enables TLP only.
+	Default: 3
 
 tcp_ecn - INTEGER
 	Control use of Explicit Congestion Notification (ECN) by TCP.
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 515c374..01860d7 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -201,7 +201,6 @@ struct tcp_sock {
 		unused      : 1;
 	u8	repair_queue;
 	u8	do_early_retrans:1,/* Enable RFC5827 early-retransmit  */
-		early_retrans_delayed:1, /* Delayed ER timer installed */
 		syn_data:1,	/* SYN includes data */
 		syn_fastopen:1,	/* SYN includes Fast Open option */
 		syn_data_acked:1;/* data in SYN is acked by SYN-ACK */
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 1832927..de2c785 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -133,6 +133,8 @@ struct inet_connection_sock {
 #define ICSK_TIME_RETRANS	1	/* Retransmit timer */
 #define ICSK_TIME_DACK		2	/* Delayed ack timer */
 #define ICSK_TIME_PROBE0	3	/* Zero window probe timer */
+#define ICSK_TIME_EARLY_RETRANS 4	/* Early retransmit timer */
+#define ICSK_TIME_LOSS_PROBE	5	/* Tail loss probe timer */
 
 static inline struct inet_connection_sock *inet_csk(const struct sock *sk)
 {
@@ -222,7 +224,8 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
 		when = max_when;
 	}
 
-	if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0) {
+	if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0 ||
+	    what == ICSK_TIME_EARLY_RETRANS || what ==  ICSK_TIME_LOSS_PROBE) {
 		icsk->icsk_pending = what;
 		icsk->icsk_timeout = jiffies + when;
 		sk_reset_timer(sk, &icsk->icsk_retransmit_timer, icsk->icsk_timeout);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a2baa5e..ab9f947 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -543,6 +543,8 @@ extern bool tcp_syn_flood_action(struct sock *sk,
 extern void tcp_push_one(struct sock *, unsigned int mss_now);
 extern void tcp_send_ack(struct sock *sk);
 extern void tcp_send_delayed_ack(struct sock *sk);
+extern void tcp_send_loss_probe(struct sock *sk);
+extern bool tcp_schedule_loss_probe(struct sock *sk);
 
 /* tcp_input.c */
 extern void tcp_cwnd_application_limited(struct sock *sk);
@@ -873,8 +875,8 @@ static inline void tcp_enable_fack(struct tcp_sock *tp)
 static inline void tcp_enable_early_retrans(struct tcp_sock *tp)
 {
 	tp->do_early_retrans = sysctl_tcp_early_retrans &&
-		!sysctl_tcp_thin_dupack && sysctl_tcp_reordering == 3;
-	tp->early_retrans_delayed = 0;
+		sysctl_tcp_early_retrans < 4 && !sysctl_tcp_thin_dupack &&
+		sysctl_tcp_reordering == 3;
 }
 
 static inline void tcp_disable_early_retrans(struct tcp_sock *tp)
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index b49eab8..290bed6 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -202,6 +202,7 @@ enum
 	LINUX_MIB_TCPFORWARDRETRANS,		/* TCPForwardRetrans */
 	LINUX_MIB_TCPSLOWSTARTRETRANS,		/* TCPSlowStartRetrans */
 	LINUX_MIB_TCPTIMEOUTS,			/* TCPTimeouts */
+	LINUX_MIB_TCPLOSSPROBES,		/* TCPLossProbes */
 	LINUX_MIB_TCPRENORECOVERYFAIL,		/* TCPRenoRecoveryFail */
 	LINUX_MIB_TCPSACKRECOVERYFAIL,		/* TCPSackRecoveryFail */
 	LINUX_MIB_TCPSCHEDULERFAILED,		/* TCPSchedulerFailed */
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 7afa2c3..8620408 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -158,7 +158,9 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk,
 
 #define EXPIRES_IN_MS(tmo)  DIV_ROUND_UP((tmo - jiffies) * 1000, HZ)
 
-	if (icsk->icsk_pending == ICSK_TIME_RETRANS) {
+	if (icsk->icsk_pending == ICSK_TIME_RETRANS ||
+	    icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
+	    icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
 		r->idiag_timer = 1;
 		r->idiag_retrans = icsk->icsk_retransmits;
 		r->idiag_expires = EXPIRES_IN_MS(icsk->icsk_timeout);
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 32030a2..4c35911 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -224,6 +224,7 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPForwardRetrans", LINUX_MIB_TCPFORWARDRETRANS),
 	SNMP_MIB_ITEM("TCPSlowStartRetrans", LINUX_MIB_TCPSLOWSTARTRETRANS),
 	SNMP_MIB_ITEM("TCPTimeouts", LINUX_MIB_TCPTIMEOUTS),
+	SNMP_MIB_ITEM("TCPLossProbes", LINUX_MIB_TCPLOSSPROBES),
 	SNMP_MIB_ITEM("TCPRenoRecoveryFail", LINUX_MIB_TCPRENORECOVERYFAIL),
 	SNMP_MIB_ITEM("TCPSackRecoveryFail", LINUX_MIB_TCPSACKRECOVERYFAIL),
 	SNMP_MIB_ITEM("TCPSchedulerFailed", LINUX_MIB_TCPSCHEDULERFAILED),
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 960fd29..cca4550 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -28,7 +28,7 @@
 
 static int zero;
 static int one = 1;
-static int two = 2;
+static int four = 4;
 static int tcp_retr1_max = 255;
 static int ip_local_port_range_min[] = { 1, 1 };
 static int ip_local_port_range_max[] = { 65535, 65535 };
@@ -760,7 +760,7 @@ static struct ctl_table ipv4_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero,
-		.extra2		= &two,
+		.extra2		= &four,
 	},
 	{
 		.procname	= "udp_mem",
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0d9bdac..b794f89 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -98,7 +98,7 @@ int sysctl_tcp_frto_response __read_mostly;
 int sysctl_tcp_thin_dupack __read_mostly;
 
 int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
-int sysctl_tcp_early_retrans __read_mostly = 2;
+int sysctl_tcp_early_retrans __read_mostly = 3;
 
 #define FLAG_DATA		0x01 /* Incoming frame contained data.		*/
 #define FLAG_WIN_UPDATE		0x02 /* Incoming ACK was a window update.	*/
@@ -2150,15 +2150,16 @@ static bool tcp_pause_early_retransmit(struct sock *sk, int flag)
 	 * max(RTT/4, 2msec) unless ack has ECE mark, no RTT samples
 	 * available, or RTO is scheduled to fire first.
 	 */
-	if (sysctl_tcp_early_retrans < 2 || (flag & FLAG_ECE) || !tp->srtt)
+	if (sysctl_tcp_early_retrans < 2 || sysctl_tcp_early_retrans > 3 ||
+	    (flag & FLAG_ECE) || !tp->srtt)
 		return false;
 
 	delay = max_t(unsigned long, (tp->srtt >> 5), msecs_to_jiffies(2));
 	if (!time_after(inet_csk(sk)->icsk_timeout, (jiffies + delay)))
 		return false;
 
-	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, delay, TCP_RTO_MAX);
-	tp->early_retrans_delayed = 1;
+	inet_csk_reset_xmit_timer(sk, ICSK_TIME_EARLY_RETRANS, delay,
+				  TCP_RTO_MAX);
 	return true;
 }
 
@@ -2321,7 +2322,7 @@ static bool tcp_time_to_recover(struct sock *sk, int flag)
 	 * interval if appropriate.
 	 */
 	if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
-	    (tp->packets_out == (tp->sacked_out + 1) && tp->packets_out < 4) &&
+	    (tp->packets_out >= (tp->sacked_out + 1) && tp->packets_out < 4) &&
 	    !tcp_may_send_now(sk))
 		return !tcp_pause_early_retransmit(sk, flag);
 
@@ -3081,6 +3082,7 @@ static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
  */
 void tcp_rearm_rto(struct sock *sk)
 {
+	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 
 	/* If the retrans timer is currently being used by Fast Open
@@ -3094,12 +3096,13 @@ void tcp_rearm_rto(struct sock *sk)
 	} else {
 		u32 rto = inet_csk(sk)->icsk_rto;
 		/* Offset the time elapsed after installing regular RTO */
-		if (tp->early_retrans_delayed) {
+		if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
+		    icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
 			struct sk_buff *skb = tcp_write_queue_head(sk);
 			const u32 rto_time_stamp = TCP_SKB_CB(skb)->when + rto;
 			s32 delta = (s32)(rto_time_stamp - tcp_time_stamp);
 			/* delta may not be positive if the socket is locked
-			 * when the delayed ER timer fires and is rescheduled.
+			 * when the retrans timer fires and is rescheduled.
 			 */
 			if (delta > 0)
 				rto = delta;
@@ -3107,7 +3110,6 @@ void tcp_rearm_rto(struct sock *sk)
 		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
 					  TCP_RTO_MAX);
 	}
-	tp->early_retrans_delayed = 0;
 }
 
 /* This function is called when the delayed ER timer fires. TCP enters
@@ -3601,7 +3603,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	if (after(ack, tp->snd_nxt))
 		goto invalid_ack;
 
-	if (tp->early_retrans_delayed)
+	if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
+	    icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
 		tcp_rearm_rto(sk);
 
 	if (after(ack, prior_snd_una))
@@ -3678,6 +3681,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 		if (dst)
 			dst_confirm(dst);
 	}
+
+	if (icsk->icsk_pending == ICSK_TIME_RETRANS)
+		tcp_schedule_loss_probe(sk);
 	return 1;
 
 no_queue:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 8cdee12..b7ab868 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2703,7 +2703,9 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i, int *len)
 	__u16 srcp = ntohs(inet->inet_sport);
 	int rx_queue;
 
-	if (icsk->icsk_pending == ICSK_TIME_RETRANS) {
+	if (icsk->icsk_pending == ICSK_TIME_RETRANS ||
+	    icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
+	    icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
 		timer_active	= 1;
 		timer_expires	= icsk->icsk_timeout;
 	} else if (icsk->icsk_pending == ICSK_TIME_PROBE0) {
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e2b4461..beb63db 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -74,6 +74,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 /* Account for new data that has been sent to the network. */
 static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
 {
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	unsigned int prior_packets = tp->packets_out;
 
@@ -85,7 +86,8 @@ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
 		tp->frto_counter = 3;
 
 	tp->packets_out += tcp_skb_pcount(skb);
-	if (!prior_packets || tp->early_retrans_delayed)
+	if (!prior_packets || icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
+	    icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
 		tcp_rearm_rto(sk);
 }
 
@@ -1959,6 +1961,9 @@ static int tcp_mtu_probe(struct sock *sk)
  * snd_up-64k-mss .. snd_up cannot be large. However, taking into
  * account rare use of URG, this is not a big flaw.
  *
+ * Send at most one packet when push_one > 0. Temporarily ignore
+ * cwnd limit to force at most one packet out when push_one == 2.
+
  * Returns true, if no segments are in flight and we have queued segments,
  * but cannot send anything now because of SWS or another problem.
  */
@@ -1994,8 +1999,13 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 			goto repair; /* Skip network transmission */
 
 		cwnd_quota = tcp_cwnd_test(tp, skb);
-		if (!cwnd_quota)
-			break;
+		if (!cwnd_quota) {
+			if (push_one == 2)
+				/* Force out a loss probe pkt. */
+				cwnd_quota = 1;
+			else
+				break;
+		}
 
 		if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
 			break;
@@ -2049,10 +2059,120 @@ repair:
 	if (likely(sent_pkts)) {
 		if (tcp_in_cwnd_reduction(sk))
 			tp->prr_out += sent_pkts;
+
+		/* Send one loss probe per tail loss episode. */
+		if (push_one != 2)
+			tcp_schedule_loss_probe(sk);
 		tcp_cwnd_validate(sk);
 		return false;
 	}
-	return !tp->packets_out && tcp_send_head(sk);
+	return (push_one == 2) || (!tp->packets_out && tcp_send_head(sk));
+}
+
+bool tcp_schedule_loss_probe(struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	u32 timeout, tlp_time_stamp, rto_time_stamp;
+	u32 rtt = tp->srtt >> 3;
+
+	if (WARN_ON(icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS))
+		return false;
+	/* No consecutive loss probes. */
+	if (WARN_ON(icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)) {
+		tcp_rearm_rto(sk);
+		return false;
+	}
+	/* Don't do any loss probe on a Fast Open connection before 3WHS
+	 * finishes.
+	 */
+	if (sk->sk_state == TCP_SYN_RECV)
+		return false;
+
+	/* TLP is only scheduled when next timer event is RTO. */
+	if (icsk->icsk_pending != ICSK_TIME_RETRANS)
+		return false;
+
+	/* Schedule a loss probe in 2*RTT for SACK capable connections
+	 * in Open state, that are either limited by cwnd or application.
+	 */
+	if (sysctl_tcp_early_retrans < 3 || !rtt || !tp->packets_out ||
+	    !tcp_is_sack(tp) || inet_csk(sk)->icsk_ca_state != TCP_CA_Open)
+		return false;
+
+	if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&
+	     tcp_send_head(sk))
+		return false;
+
+	/* Probe timeout is at least 1.5*rtt + TCP_DELACK_MAX to account
+	 * for delayed ack when there's one outstanding packet.
+	 */
+	timeout = rtt << 1;
+	if (tp->packets_out == 1)
+		timeout = max_t(u32, timeout,
+				(rtt + (rtt >> 1) + TCP_DELACK_MAX));
+	timeout = max_t(u32, timeout, msecs_to_jiffies(10));
+
+	/* If RTO is shorter, just schedule TLP in its place. */
+	tlp_time_stamp = tcp_time_stamp + timeout;
+	rto_time_stamp = (u32)inet_csk(sk)->icsk_timeout;
+	if ((s32)(tlp_time_stamp - rto_time_stamp) > 0) {
+		s32 delta = rto_time_stamp - tcp_time_stamp;
+		if (delta > 0)
+			timeout = delta;
+	}
+
+	inet_csk_reset_xmit_timer(sk, ICSK_TIME_LOSS_PROBE, timeout,
+				  TCP_RTO_MAX);
+	return true;
+}
+
+/* When probe timeout (PTO) fires, send a new segment if one exists, else
+ * retransmit the last segment.
+ */
+void tcp_send_loss_probe(struct sock *sk)
+{
+	struct sk_buff *skb;
+	int pcount;
+	int mss = tcp_current_mss(sk);
+	int err = -1;
+
+	if (tcp_send_head(sk) != NULL) {
+		err = tcp_write_xmit(sk, mss, TCP_NAGLE_OFF, 2, GFP_ATOMIC);
+		goto rearm_timer;
+	}
+
+	/* Retransmit last segment. */
+	skb = tcp_write_queue_tail(sk);
+	if (WARN_ON(!skb))
+		goto rearm_timer;
+
+	pcount = tcp_skb_pcount(skb);
+	if (WARN_ON(!pcount))
+		goto rearm_timer;
+
+	if ((pcount > 1) && (skb->len > (pcount - 1) * mss)) {
+		if (unlikely(tcp_fragment(sk, skb, (pcount - 1) * mss, mss)))
+			goto rearm_timer;
+		skb = tcp_write_queue_tail(sk);
+	}
+
+	if (WARN_ON(!skb || !tcp_skb_pcount(skb)))
+		goto rearm_timer;
+
+	/* Probe with zero data doesn't trigger fast recovery. */
+	if (skb->len > 0)
+		err = __tcp_retransmit_skb(sk, skb);
+
+rearm_timer:
+	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
+				  inet_csk(sk)->icsk_rto,
+				  TCP_RTO_MAX);
+
+	if (likely(!err))
+		NET_INC_STATS_BH(sock_net(sk),
+				 LINUX_MIB_TCPLOSSPROBES);
+	return;
 }
 
 /* Push out any pending frames which were held back due to
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index b78aac3..ecd61d5 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -342,10 +342,6 @@ void tcp_retransmit_timer(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
-	if (tp->early_retrans_delayed) {
-		tcp_resume_early_retransmit(sk);
-		return;
-	}
 	if (tp->fastopen_rsk) {
 		WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
 			     sk->sk_state != TCP_FIN_WAIT1);
@@ -495,13 +491,20 @@ void tcp_write_timer_handler(struct sock *sk)
 	}
 
 	event = icsk->icsk_pending;
-	icsk->icsk_pending = 0;
 
 	switch (event) {
+	case ICSK_TIME_EARLY_RETRANS:
+		tcp_resume_early_retransmit(sk);
+		break;
+	case ICSK_TIME_LOSS_PROBE:
+		tcp_send_loss_probe(sk);
+		break;
 	case ICSK_TIME_RETRANS:
+		icsk->icsk_pending = 0;
 		tcp_retransmit_timer(sk);
 		break;
 	case ICSK_TIME_PROBE0:
+		icsk->icsk_pending = 0;
 		tcp_probe_timer(sk);
 		break;
 	}
-- 
1.8.1.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] tcp: TLP loss detection.
  2013-03-11 20:00 [PATCH 1/2] tcp: Tail loss probe (TLP) Nandita Dukkipati
@ 2013-03-11 20:00 ` Nandita Dukkipati
  2013-03-11 21:38   ` Neal Cardwell
  2013-03-11 21:37 ` [PATCH 1/2] tcp: Tail loss probe (TLP) Neal Cardwell
  2013-03-11 22:47 ` Yuchung Cheng
  2 siblings, 1 reply; 7+ messages in thread
From: Nandita Dukkipati @ 2013-03-11 20:00 UTC (permalink / raw)
  To: David S. Miller, Neal Cardwell, Yuchung Cheng, Eric Dumazet
  Cc: netdev, Ilpo Jarvinen, Tom Herbert, Nandita Dukkipati

This is the second of the TLP patch series; it augments the basic TLP
algorithm with a loss detection scheme.

This patch implements a mechanism for loss detection when a Tail
loss probe retransmission plugs a hole thereby masking packet loss
from the sender. The loss detection algorithm relies on counting
TLP dupacks as outlined in Sec. 3 of:
http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01

The basic idea is: Sender keeps track of TLP "episode" upon
retransmission of a TLP packet. An episode ends when the sender receives
an ACK above the SND.NXT (tracked by tlp_high_seq) at the time of the
episode. We want to make sure that before the episode ends the sender
receives a "TLP dupack", indicating that the TLP retransmission was
unnecessary, so there was no loss/hole that needed plugging. If the
sender gets no TLP dupack before the end of the episode, then it reduces
ssthresh and the congestion window, because the TLP packet arriving at
the receiver probably plugged a hole.

Signed-off-by: Nandita Dukkipati <nanditad@google.com>
---
 include/linux/tcp.h       |  1 +
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/proc.c           |  1 +
 net/ipv4/tcp_input.c      | 39 +++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_minisocks.c  |  1 +
 net/ipv4/tcp_output.c     |  9 +++++++++
 net/ipv4/tcp_timer.c      |  2 ++
 7 files changed, 54 insertions(+)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 01860d7..763c108 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -204,6 +204,7 @@ struct tcp_sock {
 		syn_data:1,	/* SYN includes data */
 		syn_fastopen:1,	/* SYN includes Fast Open option */
 		syn_data_acked:1;/* data in SYN is acked by SYN-ACK */
+	u32	tlp_high_seq;	/* snd_nxt at the time of TLP retransmit. */
 
 /* RTT measurement */
 	u32	srtt;		/* smoothed round trip time << 3	*/
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 290bed6..e00013a 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -203,6 +203,7 @@ enum
 	LINUX_MIB_TCPSLOWSTARTRETRANS,		/* TCPSlowStartRetrans */
 	LINUX_MIB_TCPTIMEOUTS,			/* TCPTimeouts */
 	LINUX_MIB_TCPLOSSPROBES,		/* TCPLossProbes */
+	LINUX_MIB_TCPLOSSPROBERECOVERY,		/* TCPLossProbeRecovery */
 	LINUX_MIB_TCPRENORECOVERYFAIL,		/* TCPRenoRecoveryFail */
 	LINUX_MIB_TCPSACKRECOVERYFAIL,		/* TCPSackRecoveryFail */
 	LINUX_MIB_TCPSCHEDULERFAILED,		/* TCPSchedulerFailed */
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 4c35911..b6f2ea1 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -225,6 +225,7 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPSlowStartRetrans", LINUX_MIB_TCPSLOWSTARTRETRANS),
 	SNMP_MIB_ITEM("TCPTimeouts", LINUX_MIB_TCPTIMEOUTS),
 	SNMP_MIB_ITEM("TCPLossProbes", LINUX_MIB_TCPLOSSPROBES),
+	SNMP_MIB_ITEM("TCPLossProbeRecovery", LINUX_MIB_TCPLOSSPROBERECOVERY),
 	SNMP_MIB_ITEM("TCPRenoRecoveryFail", LINUX_MIB_TCPRENORECOVERYFAIL),
 	SNMP_MIB_ITEM("TCPSackRecoveryFail", LINUX_MIB_TCPSACKRECOVERYFAIL),
 	SNMP_MIB_ITEM("TCPSchedulerFailed", LINUX_MIB_TCPSCHEDULERFAILED),
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b794f89..836d74d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2682,6 +2682,7 @@ static void tcp_init_cwnd_reduction(struct sock *sk, const bool set_ssthresh)
 	struct tcp_sock *tp = tcp_sk(sk);
 
 	tp->high_seq = tp->snd_nxt;
+	tp->tlp_high_seq = 0;
 	tp->snd_cwnd_cnt = 0;
 	tp->prior_cwnd = tp->snd_cwnd;
 	tp->prr_delivered = 0;
@@ -3569,6 +3570,38 @@ static void tcp_send_challenge_ack(struct sock *sk)
 	}
 }
 
+/* This routine deals with acks during a TLP episode.
+ * Ref: loss detection algorithm in draft-dukkipati-tcpm-tcp-loss-probe.
+ */
+static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	bool is_tlp_dupack = (ack == tp->tlp_high_seq) &&
+			     !(flag & (FLAG_SND_UNA_ADVANCED |
+				       FLAG_NOT_DUP | FLAG_DATA_SACKED));
+
+	/* Mark the end of TLP episode on receiving TLP dupack or when
+	 * ack is after tlp_high_seq.
+	 */
+	if (is_tlp_dupack) {
+		tp->tlp_high_seq = 0;
+		return;
+	}
+
+	if (after(ack, tp->tlp_high_seq)) {
+		tp->tlp_high_seq = 0;
+		/* Don't reduce cwnd if DSACK arrives for TLP retrans. */
+		if (!(flag & FLAG_DSACKING_ACK)) {
+			tcp_init_cwnd_reduction(sk, true);
+			tcp_set_ca_state(sk, TCP_CA_CWR);
+			tcp_end_cwnd_reduction(sk);
+			tcp_set_ca_state(sk, TCP_CA_Open);
+			NET_INC_STATS_BH(sock_net(sk),
+					 LINUX_MIB_TCPLOSSPROBERECOVERY);
+		}
+	}
+}
+
 /* This routine deals with incoming acks, but not outgoing ones. */
 static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 {
@@ -3676,6 +3709,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 			tcp_cong_avoid(sk, ack, prior_in_flight);
 	}
 
+	if (tp->tlp_high_seq)
+		tcp_process_tlp_ack(sk, ack, flag);
+
 	if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP)) {
 		struct dst_entry *dst = __sk_dst_get(sk);
 		if (dst)
@@ -3697,6 +3733,9 @@ no_queue:
 	 */
 	if (tcp_send_head(sk))
 		tcp_ack_probe(sk);
+
+	if (tp->tlp_high_seq)
+		tcp_process_tlp_ack(sk, ack, flag);
 	return 1;
 
 invalid_ack:
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index b83a49c..4bdb09f 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -440,6 +440,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		newtp->fackets_out = 0;
 		newtp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
 		tcp_enable_early_retrans(newtp);
+		newtp->tlp_high_seq = 0;
 
 		/* So many TCP implementations out there (incorrectly) count the
 		 * initial SYN frame in their delayed-ACK and congestion control
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index beb63db..8e7742f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2132,6 +2132,7 @@ bool tcp_schedule_loss_probe(struct sock *sk)
  */
 void tcp_send_loss_probe(struct sock *sk)
 {
+	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
 	int pcount;
 	int mss = tcp_current_mss(sk);
@@ -2142,6 +2143,10 @@ void tcp_send_loss_probe(struct sock *sk)
 		goto rearm_timer;
 	}
 
+	/* At most one outstanding TLP retransmission. */
+	if (tp->tlp_high_seq)
+		goto rearm_timer;
+
 	/* Retransmit last segment. */
 	skb = tcp_write_queue_tail(sk);
 	if (WARN_ON(!skb))
@@ -2164,6 +2169,10 @@ void tcp_send_loss_probe(struct sock *sk)
 	if (skb->len > 0)
 		err = __tcp_retransmit_skb(sk, skb);
 
+	/* Record snd_nxt for loss detection. */
+	if (likely(!err))
+		tp->tlp_high_seq = tp->snd_nxt;
+
 rearm_timer:
 	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
 				  inet_csk(sk)->icsk_rto,
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index ecd61d5..eeccf79 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -356,6 +356,8 @@ void tcp_retransmit_timer(struct sock *sk)
 
 	WARN_ON(tcp_write_queue_empty(sk));
 
+	tp->tlp_high_seq = 0;
+
 	if (!tp->snd_wnd && !sock_flag(sk, SOCK_DEAD) &&
 	    !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))) {
 		/* Receiver dastardly shrinks window. Our retransmits
-- 
1.8.1.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] tcp: Tail loss probe (TLP)
  2013-03-11 20:00 [PATCH 1/2] tcp: Tail loss probe (TLP) Nandita Dukkipati
  2013-03-11 20:00 ` [PATCH 2/2] tcp: TLP loss detection Nandita Dukkipati
@ 2013-03-11 21:37 ` Neal Cardwell
  2013-03-12 12:45   ` David Miller
  2013-03-11 22:47 ` Yuchung Cheng
  2 siblings, 1 reply; 7+ messages in thread
From: Neal Cardwell @ 2013-03-11 21:37 UTC (permalink / raw)
  To: Nandita Dukkipati
  Cc: David S. Miller, Yuchung Cheng, Eric Dumazet, Netdev,
	Ilpo Jarvinen, Tom Herbert

On Mon, Mar 11, 2013 at 4:00 PM, Nandita Dukkipati <nanditad@google.com> wrote:
> This patch series implement the Tail loss probe (TLP) algorithm described
> in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
> first patch implements the basic algorithm.
>
> TLP's goal is to reduce tail latency of short transactions. It achieves
> this by converting retransmission timeouts (RTOs) occuring due
> to tail losses (losses at end of transactions) into fast recovery.
> TLP transmits one packet in two round-trips when a connection is in
> Open state and isn't receiving any ACKs. The transmitted packet, aka
> loss probe, can be either new or a retransmission. When there is tail
> loss, the ACK from a loss probe triggers FACK/early-retransmit based
> fast recovery, thus avoiding a costly RTO. In the absence of loss,
> there is no change in the connection state.
>
> PTO stands for probe timeout. It is a timer event indicating
> that an ACK is overdue and triggers a loss probe packet. The PTO value
> is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
> ACK timer when there is only one oustanding packet.
>
> TLP Algorithm
>
> On transmission of new data in Open state:
>   -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
>   -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
>   -> PTO = min(PTO, RTO)
>
> Conditions for scheduling PTO:
>   -> Connection is in Open state.
>   -> Connection is either cwnd limited or no new data to send.
>   -> Number of probes per tail loss episode is limited to one.
>   -> Connection is SACK enabled.
>
> When PTO fires:
>   new_segment_exists:
>     -> transmit new segment.
>     -> packets_out++. cwnd remains same.
>
>   no_new_packet:
>     -> retransmit the last segment.
>        Its ACK triggers FACK or early retransmit based recovery.
>
> ACK path:
>   -> rearm RTO at start of ACK processing.
>   -> reschedule PTO if need be.
>
> In addition, the patch includes a small variation to the Early Retransmit
> (ER) algorithm, such that ER and TLP together can in principle recover any
> N-degree of tail loss through fast recovery. TLP is controlled by the same
> sysctl as ER, tcp_early_retrans sysctl.
> tcp_early_retrans==0; disables TLP and ER.
>                  ==1; enables RFC5827 ER.
>                  ==2; delayed ER.
>                  ==3; TLP and delayed ER. [DEFAULT]
>                  ==4; TLP only.
>
> The TLP patch series have been extensively tested on Google Web servers.
> It is most effective for short Web trasactions, where it reduced RTOs by 15%
> and improved HTTP response time (average by 6%, 99th percentile by 10%).
> The transmitted probes account for <0.5% of the overall transmissions.
>
> Signed-off-by: Nandita Dukkipati <nanditad@google.com>
> ---

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] tcp: TLP loss detection.
  2013-03-11 20:00 ` [PATCH 2/2] tcp: TLP loss detection Nandita Dukkipati
@ 2013-03-11 21:38   ` Neal Cardwell
  2013-03-12 12:45     ` David Miller
  0 siblings, 1 reply; 7+ messages in thread
From: Neal Cardwell @ 2013-03-11 21:38 UTC (permalink / raw)
  To: Nandita Dukkipati
  Cc: David S. Miller, Yuchung Cheng, Eric Dumazet, Netdev,
	Ilpo Jarvinen, Tom Herbert

On Mon, Mar 11, 2013 at 4:00 PM, Nandita Dukkipati <nanditad@google.com> wrote:
> This is the second of the TLP patch series; it augments the basic TLP
> algorithm with a loss detection scheme.
>
> This patch implements a mechanism for loss detection when a Tail
> loss probe retransmission plugs a hole thereby masking packet loss
> from the sender. The loss detection algorithm relies on counting
> TLP dupacks as outlined in Sec. 3 of:
> http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01
>
> The basic idea is: Sender keeps track of TLP "episode" upon
> retransmission of a TLP packet. An episode ends when the sender receives
> an ACK above the SND.NXT (tracked by tlp_high_seq) at the time of the
> episode. We want to make sure that before the episode ends the sender
> receives a "TLP dupack", indicating that the TLP retransmission was
> unnecessary, so there was no loss/hole that needed plugging. If the
> sender gets no TLP dupack before the end of the episode, then it reduces
> ssthresh and the congestion window, because the TLP packet arriving at
> the receiver probably plugged a hole.
>
> Signed-off-by: Nandita Dukkipati <nanditad@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] tcp: Tail loss probe (TLP)
  2013-03-11 20:00 [PATCH 1/2] tcp: Tail loss probe (TLP) Nandita Dukkipati
  2013-03-11 20:00 ` [PATCH 2/2] tcp: TLP loss detection Nandita Dukkipati
  2013-03-11 21:37 ` [PATCH 1/2] tcp: Tail loss probe (TLP) Neal Cardwell
@ 2013-03-11 22:47 ` Yuchung Cheng
  2 siblings, 0 replies; 7+ messages in thread
From: Yuchung Cheng @ 2013-03-11 22:47 UTC (permalink / raw)
  To: Nandita Dukkipati
  Cc: David S. Miller, Neal Cardwell, Eric Dumazet, netdev,
	Ilpo Jarvinen, Tom Herbert

On Mon, Mar 11, 2013 at 4:00 PM, Nandita Dukkipati <nanditad@google.com> wrote:
> This patch series implement the Tail loss probe (TLP) algorithm described
> in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
> first patch implements the basic algorithm.
>
> TLP's goal is to reduce tail latency of short transactions. It achieves
> this by converting retransmission timeouts (RTOs) occuring due
> to tail losses (losses at end of transactions) into fast recovery.
> TLP transmits one packet in two round-trips when a connection is in
> Open state and isn't receiving any ACKs. The transmitted packet, aka
> loss probe, can be either new or a retransmission. When there is tail
> loss, the ACK from a loss probe triggers FACK/early-retransmit based
> fast recovery, thus avoiding a costly RTO. In the absence of loss,
> there is no change in the connection state.
>
> PTO stands for probe timeout. It is a timer event indicating
> that an ACK is overdue and triggers a loss probe packet. The PTO value
> is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
> ACK timer when there is only one oustanding packet.
>
> TLP Algorithm
>
> On transmission of new data in Open state:
>   -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
>   -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
>   -> PTO = min(PTO, RTO)
>
> Conditions for scheduling PTO:
>   -> Connection is in Open state.
>   -> Connection is either cwnd limited or no new data to send.
>   -> Number of probes per tail loss episode is limited to one.
>   -> Connection is SACK enabled.
>
> When PTO fires:
>   new_segment_exists:
>     -> transmit new segment.
>     -> packets_out++. cwnd remains same.
>
>   no_new_packet:
>     -> retransmit the last segment.
>        Its ACK triggers FACK or early retransmit based recovery.
>
> ACK path:
>   -> rearm RTO at start of ACK processing.
>   -> reschedule PTO if need be.
>
> In addition, the patch includes a small variation to the Early Retransmit
> (ER) algorithm, such that ER and TLP together can in principle recover any
> N-degree of tail loss through fast recovery. TLP is controlled by the same
> sysctl as ER, tcp_early_retrans sysctl.
> tcp_early_retrans==0; disables TLP and ER.
>                  ==1; enables RFC5827 ER.
>                  ==2; delayed ER.
>                  ==3; TLP and delayed ER. [DEFAULT]
>                  ==4; TLP only.
>
> The TLP patch series have been extensively tested on Google Web servers.
> It is most effective for short Web trasactions, where it reduced RTOs by 15%
> and improved HTTP response time (average by 6%, 99th percentile by 10%).
> The transmitted probes account for <0.5% of the overall transmissions.
>
> Signed-off-by: Nandita Dukkipati <nanditad@google.com>
> ---
Acked-by: Yuchung Cheng <ycheng@google.com>

>  Documentation/networking/ip-sysctl.txt |   8 ++-
>  include/linux/tcp.h                    |   1 -
>  include/net/inet_connection_sock.h     |   5 +-
>  include/net/tcp.h                      |   6 +-
>  include/uapi/linux/snmp.h              |   1 +
>  net/ipv4/inet_diag.c                   |   4 +-
>  net/ipv4/proc.c                        |   1 +
>  net/ipv4/sysctl_net_ipv4.c             |   4 +-
>  net/ipv4/tcp_input.c                   |  24 ++++---
>  net/ipv4/tcp_ipv4.c                    |   4 +-
>  net/ipv4/tcp_output.c                  | 128 +++++++++++++++++++++++++++++++--
>  net/ipv4/tcp_timer.c                   |  13 ++--
>  12 files changed, 171 insertions(+), 28 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index dc2dc87..1cae6c3 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -190,7 +190,9 @@ tcp_early_retrans - INTEGER
>         Enable Early Retransmit (ER), per RFC 5827. ER lowers the threshold
>         for triggering fast retransmit when the amount of outstanding data is
>         small and when no previously unsent data can be transmitted (such
> -       that limited transmit could be used).
> +       that limited transmit could be used). Also controls the use of
> +       Tail loss probe (TLP) that converts RTOs occuring due to tail
> +       losses into fast recovery (draft-dukkipati-tcpm-tcp-loss-probe-01).
>         Possible values:
>                 0 disables ER
>                 1 enables ER
> @@ -198,7 +200,9 @@ tcp_early_retrans - INTEGER
>                   by a fourth of RTT. This mitigates connection falsely
>                   recovers when network has a small degree of reordering
>                   (less than 3 packets).
> -       Default: 2
> +               3 enables delayed ER and TLP.
> +               4 enables TLP only.
> +       Default: 3
>
>  tcp_ecn - INTEGER
>         Control use of Explicit Congestion Notification (ECN) by TCP.
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 515c374..01860d7 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -201,7 +201,6 @@ struct tcp_sock {
>                 unused      : 1;
>         u8      repair_queue;
>         u8      do_early_retrans:1,/* Enable RFC5827 early-retransmit  */
> -               early_retrans_delayed:1, /* Delayed ER timer installed */
>                 syn_data:1,     /* SYN includes data */
>                 syn_fastopen:1, /* SYN includes Fast Open option */
>                 syn_data_acked:1;/* data in SYN is acked by SYN-ACK */
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index 1832927..de2c785 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -133,6 +133,8 @@ struct inet_connection_sock {
>  #define ICSK_TIME_RETRANS      1       /* Retransmit timer */
>  #define ICSK_TIME_DACK         2       /* Delayed ack timer */
>  #define ICSK_TIME_PROBE0       3       /* Zero window probe timer */
> +#define ICSK_TIME_EARLY_RETRANS 4      /* Early retransmit timer */
> +#define ICSK_TIME_LOSS_PROBE   5       /* Tail loss probe timer */
>
>  static inline struct inet_connection_sock *inet_csk(const struct sock *sk)
>  {
> @@ -222,7 +224,8 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
>                 when = max_when;
>         }
>
> -       if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0) {
> +       if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0 ||
> +           what == ICSK_TIME_EARLY_RETRANS || what ==  ICSK_TIME_LOSS_PROBE) {
>                 icsk->icsk_pending = what;
>                 icsk->icsk_timeout = jiffies + when;
>                 sk_reset_timer(sk, &icsk->icsk_retransmit_timer, icsk->icsk_timeout);
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index a2baa5e..ab9f947 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -543,6 +543,8 @@ extern bool tcp_syn_flood_action(struct sock *sk,
>  extern void tcp_push_one(struct sock *, unsigned int mss_now);
>  extern void tcp_send_ack(struct sock *sk);
>  extern void tcp_send_delayed_ack(struct sock *sk);
> +extern void tcp_send_loss_probe(struct sock *sk);
> +extern bool tcp_schedule_loss_probe(struct sock *sk);
>
>  /* tcp_input.c */
>  extern void tcp_cwnd_application_limited(struct sock *sk);
> @@ -873,8 +875,8 @@ static inline void tcp_enable_fack(struct tcp_sock *tp)
>  static inline void tcp_enable_early_retrans(struct tcp_sock *tp)
>  {
>         tp->do_early_retrans = sysctl_tcp_early_retrans &&
> -               !sysctl_tcp_thin_dupack && sysctl_tcp_reordering == 3;
> -       tp->early_retrans_delayed = 0;
> +               sysctl_tcp_early_retrans < 4 && !sysctl_tcp_thin_dupack &&
> +               sysctl_tcp_reordering == 3;
>  }
>
>  static inline void tcp_disable_early_retrans(struct tcp_sock *tp)
> diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
> index b49eab8..290bed6 100644
> --- a/include/uapi/linux/snmp.h
> +++ b/include/uapi/linux/snmp.h
> @@ -202,6 +202,7 @@ enum
>         LINUX_MIB_TCPFORWARDRETRANS,            /* TCPForwardRetrans */
>         LINUX_MIB_TCPSLOWSTARTRETRANS,          /* TCPSlowStartRetrans */
>         LINUX_MIB_TCPTIMEOUTS,                  /* TCPTimeouts */
> +       LINUX_MIB_TCPLOSSPROBES,                /* TCPLossProbes */
>         LINUX_MIB_TCPRENORECOVERYFAIL,          /* TCPRenoRecoveryFail */
>         LINUX_MIB_TCPSACKRECOVERYFAIL,          /* TCPSackRecoveryFail */
>         LINUX_MIB_TCPSCHEDULERFAILED,           /* TCPSchedulerFailed */
> diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
> index 7afa2c3..8620408 100644
> --- a/net/ipv4/inet_diag.c
> +++ b/net/ipv4/inet_diag.c
> @@ -158,7 +158,9 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk,
>
>  #define EXPIRES_IN_MS(tmo)  DIV_ROUND_UP((tmo - jiffies) * 1000, HZ)
>
> -       if (icsk->icsk_pending == ICSK_TIME_RETRANS) {
> +       if (icsk->icsk_pending == ICSK_TIME_RETRANS ||
> +           icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
> +           icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
>                 r->idiag_timer = 1;
>                 r->idiag_retrans = icsk->icsk_retransmits;
>                 r->idiag_expires = EXPIRES_IN_MS(icsk->icsk_timeout);
> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
> index 32030a2..4c35911 100644
> --- a/net/ipv4/proc.c
> +++ b/net/ipv4/proc.c
> @@ -224,6 +224,7 @@ static const struct snmp_mib snmp4_net_list[] = {
>         SNMP_MIB_ITEM("TCPForwardRetrans", LINUX_MIB_TCPFORWARDRETRANS),
>         SNMP_MIB_ITEM("TCPSlowStartRetrans", LINUX_MIB_TCPSLOWSTARTRETRANS),
>         SNMP_MIB_ITEM("TCPTimeouts", LINUX_MIB_TCPTIMEOUTS),
> +       SNMP_MIB_ITEM("TCPLossProbes", LINUX_MIB_TCPLOSSPROBES),
>         SNMP_MIB_ITEM("TCPRenoRecoveryFail", LINUX_MIB_TCPRENORECOVERYFAIL),
>         SNMP_MIB_ITEM("TCPSackRecoveryFail", LINUX_MIB_TCPSACKRECOVERYFAIL),
>         SNMP_MIB_ITEM("TCPSchedulerFailed", LINUX_MIB_TCPSCHEDULERFAILED),
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 960fd29..cca4550 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -28,7 +28,7 @@
>
>  static int zero;
>  static int one = 1;
> -static int two = 2;
> +static int four = 4;
>  static int tcp_retr1_max = 255;
>  static int ip_local_port_range_min[] = { 1, 1 };
>  static int ip_local_port_range_max[] = { 65535, 65535 };
> @@ -760,7 +760,7 @@ static struct ctl_table ipv4_table[] = {
>                 .mode           = 0644,
>                 .proc_handler   = proc_dointvec_minmax,
>                 .extra1         = &zero,
> -               .extra2         = &two,
> +               .extra2         = &four,
>         },
>         {
>                 .procname       = "udp_mem",
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 0d9bdac..b794f89 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -98,7 +98,7 @@ int sysctl_tcp_frto_response __read_mostly;
>  int sysctl_tcp_thin_dupack __read_mostly;
>
>  int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
> -int sysctl_tcp_early_retrans __read_mostly = 2;
> +int sysctl_tcp_early_retrans __read_mostly = 3;
>
>  #define FLAG_DATA              0x01 /* Incoming frame contained data.          */
>  #define FLAG_WIN_UPDATE                0x02 /* Incoming ACK was a window update.       */
> @@ -2150,15 +2150,16 @@ static bool tcp_pause_early_retransmit(struct sock *sk, int flag)
>          * max(RTT/4, 2msec) unless ack has ECE mark, no RTT samples
>          * available, or RTO is scheduled to fire first.
>          */
> -       if (sysctl_tcp_early_retrans < 2 || (flag & FLAG_ECE) || !tp->srtt)
> +       if (sysctl_tcp_early_retrans < 2 || sysctl_tcp_early_retrans > 3 ||
> +           (flag & FLAG_ECE) || !tp->srtt)
>                 return false;
>
>         delay = max_t(unsigned long, (tp->srtt >> 5), msecs_to_jiffies(2));
>         if (!time_after(inet_csk(sk)->icsk_timeout, (jiffies + delay)))
>                 return false;
>
> -       inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, delay, TCP_RTO_MAX);
> -       tp->early_retrans_delayed = 1;
> +       inet_csk_reset_xmit_timer(sk, ICSK_TIME_EARLY_RETRANS, delay,
> +                                 TCP_RTO_MAX);
>         return true;
>  }
>
> @@ -2321,7 +2322,7 @@ static bool tcp_time_to_recover(struct sock *sk, int flag)
>          * interval if appropriate.
>          */
>         if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
> -           (tp->packets_out == (tp->sacked_out + 1) && tp->packets_out < 4) &&
> +           (tp->packets_out >= (tp->sacked_out + 1) && tp->packets_out < 4) &&
>             !tcp_may_send_now(sk))
>                 return !tcp_pause_early_retransmit(sk, flag);
>
> @@ -3081,6 +3082,7 @@ static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
>   */
>  void tcp_rearm_rto(struct sock *sk)
>  {
> +       const struct inet_connection_sock *icsk = inet_csk(sk);
>         struct tcp_sock *tp = tcp_sk(sk);
>
>         /* If the retrans timer is currently being used by Fast Open
> @@ -3094,12 +3096,13 @@ void tcp_rearm_rto(struct sock *sk)
>         } else {
>                 u32 rto = inet_csk(sk)->icsk_rto;
>                 /* Offset the time elapsed after installing regular RTO */
> -               if (tp->early_retrans_delayed) {
> +               if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
> +                   icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
>                         struct sk_buff *skb = tcp_write_queue_head(sk);
>                         const u32 rto_time_stamp = TCP_SKB_CB(skb)->when + rto;
>                         s32 delta = (s32)(rto_time_stamp - tcp_time_stamp);
>                         /* delta may not be positive if the socket is locked
> -                        * when the delayed ER timer fires and is rescheduled.
> +                        * when the retrans timer fires and is rescheduled.
>                          */
>                         if (delta > 0)
>                                 rto = delta;
> @@ -3107,7 +3110,6 @@ void tcp_rearm_rto(struct sock *sk)
>                 inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
>                                           TCP_RTO_MAX);
>         }
> -       tp->early_retrans_delayed = 0;
>  }
>
>  /* This function is called when the delayed ER timer fires. TCP enters
> @@ -3601,7 +3603,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>         if (after(ack, tp->snd_nxt))
>                 goto invalid_ack;
>
> -       if (tp->early_retrans_delayed)
> +       if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
> +           icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
>                 tcp_rearm_rto(sk);
>
>         if (after(ack, prior_snd_una))
> @@ -3678,6 +3681,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>                 if (dst)
>                         dst_confirm(dst);
>         }
> +
> +       if (icsk->icsk_pending == ICSK_TIME_RETRANS)
> +               tcp_schedule_loss_probe(sk);
>         return 1;
>
>  no_queue:
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 8cdee12..b7ab868 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2703,7 +2703,9 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i, int *len)
>         __u16 srcp = ntohs(inet->inet_sport);
>         int rx_queue;
>
> -       if (icsk->icsk_pending == ICSK_TIME_RETRANS) {
> +       if (icsk->icsk_pending == ICSK_TIME_RETRANS ||
> +           icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
> +           icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
>                 timer_active    = 1;
>                 timer_expires   = icsk->icsk_timeout;
>         } else if (icsk->icsk_pending == ICSK_TIME_PROBE0) {
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index e2b4461..beb63db 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -74,6 +74,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>  /* Account for new data that has been sent to the network. */
>  static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
>  {
> +       struct inet_connection_sock *icsk = inet_csk(sk);
>         struct tcp_sock *tp = tcp_sk(sk);
>         unsigned int prior_packets = tp->packets_out;
>
> @@ -85,7 +86,8 @@ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
>                 tp->frto_counter = 3;
>
>         tp->packets_out += tcp_skb_pcount(skb);
> -       if (!prior_packets || tp->early_retrans_delayed)
> +       if (!prior_packets || icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
> +           icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
>                 tcp_rearm_rto(sk);
>  }
>
> @@ -1959,6 +1961,9 @@ static int tcp_mtu_probe(struct sock *sk)
>   * snd_up-64k-mss .. snd_up cannot be large. However, taking into
>   * account rare use of URG, this is not a big flaw.
>   *
> + * Send at most one packet when push_one > 0. Temporarily ignore
> + * cwnd limit to force at most one packet out when push_one == 2.
> +
>   * Returns true, if no segments are in flight and we have queued segments,
>   * but cannot send anything now because of SWS or another problem.
>   */
> @@ -1994,8 +1999,13 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                         goto repair; /* Skip network transmission */
>
>                 cwnd_quota = tcp_cwnd_test(tp, skb);
> -               if (!cwnd_quota)
> -                       break;
> +               if (!cwnd_quota) {
> +                       if (push_one == 2)
> +                               /* Force out a loss probe pkt. */
> +                               cwnd_quota = 1;
> +                       else
> +                               break;
> +               }
>
>                 if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
>                         break;
> @@ -2049,10 +2059,120 @@ repair:
>         if (likely(sent_pkts)) {
>                 if (tcp_in_cwnd_reduction(sk))
>                         tp->prr_out += sent_pkts;
> +
> +               /* Send one loss probe per tail loss episode. */
> +               if (push_one != 2)
> +                       tcp_schedule_loss_probe(sk);
>                 tcp_cwnd_validate(sk);
>                 return false;
>         }
> -       return !tp->packets_out && tcp_send_head(sk);
> +       return (push_one == 2) || (!tp->packets_out && tcp_send_head(sk));
> +}
> +
> +bool tcp_schedule_loss_probe(struct sock *sk)
> +{
> +       struct inet_connection_sock *icsk = inet_csk(sk);
> +       struct tcp_sock *tp = tcp_sk(sk);
> +       u32 timeout, tlp_time_stamp, rto_time_stamp;
> +       u32 rtt = tp->srtt >> 3;
> +
> +       if (WARN_ON(icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS))
> +               return false;
> +       /* No consecutive loss probes. */
> +       if (WARN_ON(icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)) {
> +               tcp_rearm_rto(sk);
> +               return false;
> +       }
> +       /* Don't do any loss probe on a Fast Open connection before 3WHS
> +        * finishes.
> +        */
> +       if (sk->sk_state == TCP_SYN_RECV)
> +               return false;
> +
> +       /* TLP is only scheduled when next timer event is RTO. */
> +       if (icsk->icsk_pending != ICSK_TIME_RETRANS)
> +               return false;
> +
> +       /* Schedule a loss probe in 2*RTT for SACK capable connections
> +        * in Open state, that are either limited by cwnd or application.
> +        */
> +       if (sysctl_tcp_early_retrans < 3 || !rtt || !tp->packets_out ||
> +           !tcp_is_sack(tp) || inet_csk(sk)->icsk_ca_state != TCP_CA_Open)
> +               return false;
> +
> +       if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&
> +            tcp_send_head(sk))
> +               return false;
> +
> +       /* Probe timeout is at least 1.5*rtt + TCP_DELACK_MAX to account
> +        * for delayed ack when there's one outstanding packet.
> +        */
> +       timeout = rtt << 1;
> +       if (tp->packets_out == 1)
> +               timeout = max_t(u32, timeout,
> +                               (rtt + (rtt >> 1) + TCP_DELACK_MAX));
> +       timeout = max_t(u32, timeout, msecs_to_jiffies(10));
> +
> +       /* If RTO is shorter, just schedule TLP in its place. */
> +       tlp_time_stamp = tcp_time_stamp + timeout;
> +       rto_time_stamp = (u32)inet_csk(sk)->icsk_timeout;
> +       if ((s32)(tlp_time_stamp - rto_time_stamp) > 0) {
> +               s32 delta = rto_time_stamp - tcp_time_stamp;
> +               if (delta > 0)
> +                       timeout = delta;
> +       }
> +
> +       inet_csk_reset_xmit_timer(sk, ICSK_TIME_LOSS_PROBE, timeout,
> +                                 TCP_RTO_MAX);
> +       return true;
> +}
> +
> +/* When probe timeout (PTO) fires, send a new segment if one exists, else
> + * retransmit the last segment.
> + */
> +void tcp_send_loss_probe(struct sock *sk)
> +{
> +       struct sk_buff *skb;
> +       int pcount;
> +       int mss = tcp_current_mss(sk);
> +       int err = -1;
> +
> +       if (tcp_send_head(sk) != NULL) {
> +               err = tcp_write_xmit(sk, mss, TCP_NAGLE_OFF, 2, GFP_ATOMIC);
> +               goto rearm_timer;
> +       }
> +
> +       /* Retransmit last segment. */
> +       skb = tcp_write_queue_tail(sk);
> +       if (WARN_ON(!skb))
> +               goto rearm_timer;
> +
> +       pcount = tcp_skb_pcount(skb);
> +       if (WARN_ON(!pcount))
> +               goto rearm_timer;
> +
> +       if ((pcount > 1) && (skb->len > (pcount - 1) * mss)) {
> +               if (unlikely(tcp_fragment(sk, skb, (pcount - 1) * mss, mss)))
> +                       goto rearm_timer;
> +               skb = tcp_write_queue_tail(sk);
> +       }
> +
> +       if (WARN_ON(!skb || !tcp_skb_pcount(skb)))
> +               goto rearm_timer;
> +
> +       /* Probe with zero data doesn't trigger fast recovery. */
> +       if (skb->len > 0)
> +               err = __tcp_retransmit_skb(sk, skb);
> +
> +rearm_timer:
> +       inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> +                                 inet_csk(sk)->icsk_rto,
> +                                 TCP_RTO_MAX);
> +
> +       if (likely(!err))
> +               NET_INC_STATS_BH(sock_net(sk),
> +                                LINUX_MIB_TCPLOSSPROBES);
> +       return;
>  }
>
>  /* Push out any pending frames which were held back due to
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index b78aac3..ecd61d5 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -342,10 +342,6 @@ void tcp_retransmit_timer(struct sock *sk)
>         struct tcp_sock *tp = tcp_sk(sk);
>         struct inet_connection_sock *icsk = inet_csk(sk);
>
> -       if (tp->early_retrans_delayed) {
> -               tcp_resume_early_retransmit(sk);
> -               return;
> -       }
>         if (tp->fastopen_rsk) {
>                 WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
>                              sk->sk_state != TCP_FIN_WAIT1);
> @@ -495,13 +491,20 @@ void tcp_write_timer_handler(struct sock *sk)
>         }
>
>         event = icsk->icsk_pending;
> -       icsk->icsk_pending = 0;
>
>         switch (event) {
> +       case ICSK_TIME_EARLY_RETRANS:
> +               tcp_resume_early_retransmit(sk);
> +               break;
> +       case ICSK_TIME_LOSS_PROBE:
> +               tcp_send_loss_probe(sk);
> +               break;
>         case ICSK_TIME_RETRANS:
> +               icsk->icsk_pending = 0;
>                 tcp_retransmit_timer(sk);
>                 break;
>         case ICSK_TIME_PROBE0:
> +               icsk->icsk_pending = 0;
>                 tcp_probe_timer(sk);
>                 break;
>         }
> --
> 1.8.1.3
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] tcp: Tail loss probe (TLP)
  2013-03-11 21:37 ` [PATCH 1/2] tcp: Tail loss probe (TLP) Neal Cardwell
@ 2013-03-12 12:45   ` David Miller
  0 siblings, 0 replies; 7+ messages in thread
From: David Miller @ 2013-03-12 12:45 UTC (permalink / raw)
  To: ncardwell; +Cc: nanditad, ycheng, edumazet, netdev, ilpo.jarvinen, therbert

From: Neal Cardwell <ncardwell@google.com>
Date: Mon, 11 Mar 2013 17:37:28 -0400

> On Mon, Mar 11, 2013 at 4:00 PM, Nandita Dukkipati <nanditad@google.com> wrote:
>> This patch series implement the Tail loss probe (TLP) algorithm described
>> in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
>> first patch implements the basic algorithm.
 ...
>> Signed-off-by: Nandita Dukkipati <nanditad@google.com>
>> ---
> 
> Acked-by: Neal Cardwell <ncardwell@google.com>

Looks good, applied, thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] tcp: TLP loss detection.
  2013-03-11 21:38   ` Neal Cardwell
@ 2013-03-12 12:45     ` David Miller
  0 siblings, 0 replies; 7+ messages in thread
From: David Miller @ 2013-03-12 12:45 UTC (permalink / raw)
  To: ncardwell; +Cc: nanditad, ycheng, edumazet, netdev, ilpo.jarvinen, therbert

From: Neal Cardwell <ncardwell@google.com>
Date: Mon, 11 Mar 2013 17:38:12 -0400

> On Mon, Mar 11, 2013 at 4:00 PM, Nandita Dukkipati <nanditad@google.com> wrote:
>> This is the second of the TLP patch series; it augments the basic TLP
>> algorithm with a loss detection scheme.
>>
>> This patch implements a mechanism for loss detection when a Tail
>> loss probe retransmission plugs a hole thereby masking packet loss
>> from the sender. The loss detection algorithm relies on counting
>> TLP dupacks as outlined in Sec. 3 of:
>> http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01
>>
>> The basic idea is: Sender keeps track of TLP "episode" upon
>> retransmission of a TLP packet. An episode ends when the sender receives
>> an ACK above the SND.NXT (tracked by tlp_high_seq) at the time of the
>> episode. We want to make sure that before the episode ends the sender
>> receives a "TLP dupack", indicating that the TLP retransmission was
>> unnecessary, so there was no loss/hole that needed plugging. If the
>> sender gets no TLP dupack before the end of the episode, then it reduces
>> ssthresh and the congestion window, because the TLP packet arriving at
>> the receiver probably plugged a hole.
>>
>> Signed-off-by: Nandita Dukkipati <nanditad@google.com>
> 
> Acked-by: Neal Cardwell <ncardwell@google.com>

Also applied, thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-03-12 12:45 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-11 20:00 [PATCH 1/2] tcp: Tail loss probe (TLP) Nandita Dukkipati
2013-03-11 20:00 ` [PATCH 2/2] tcp: TLP loss detection Nandita Dukkipati
2013-03-11 21:38   ` Neal Cardwell
2013-03-12 12:45     ` David Miller
2013-03-11 21:37 ` [PATCH 1/2] tcp: Tail loss probe (TLP) Neal Cardwell
2013-03-12 12:45   ` David Miller
2013-03-11 22:47 ` Yuchung Cheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.