All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net-next 0/2] tcp: timer restart for tail loss
@ 2015-12-07  9:00 Per Hurtig
  2015-12-07  9:00 ` [RFC PATCH net-next 1/2] tcp: RTO Restart (RTOR) Per Hurtig
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Per Hurtig @ 2015-12-07  9:00 UTC (permalink / raw)
  To: davem, edumazet, ncardwell, nanditad, tom, ycheng, viro, fw,
	mleitner, daniel, willemb, ilpo.jarvinen, pasi.sarolahti,
	stephen, netdev
  Cc: anna.brunstrom, apetlund, michawe, mohammad.rajiullah, Per Hurtig

This is a request for comments.

RTO and TLP restart is a modification to the restart process of the RTO and
TLP timers in TCP. Currently, both timers are restarted with its
corresponding timeout value when an acknowledgment (ACK) for correctly
received data is received. In many situations, resetting the timers on
incoming ACKs will add an implicit offset of at least RTT seconds to the
loss recovery process.

The goal of the modified restart is to provide quicker loss recovery for
segments lost in the end of a burst/connection, where the limited feedback
from a receiver inhibits the use of fast/early retransmit. To accomplish
this the algorithm adjusts the RTO and PTO (TLP's timer) values on each
rearm of the timers to allow them to expire after exactly RTO and PTO ms,
respectively.

The restart behavior is controlled by the new tcp_timer_restart sysctl.
tcp_timer_restart==0; disables RTOR and TLPR.
		 ==1; enables RTOR.
		 ==2; enables TLPR.
		 ==3; enables both RTOR and TLPR. [DEFAULT]

The new restart behavior has been approved by the IETF for publication as
an experimental RFC (for the RTO part of the mechanism) [1], and
experiments have been conducted to show both the benefit of this strategy
(in terms of reduced loss-recovery delays) and the limited negative impact
(in terms of spurious retransmissions) [2]. More information regarding the
mechanism can be found at [3].

Basic functionality tests, using packetdrill, are available at:
https://github.com/perhurt/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/timer_restart


[1] https://datatracker.ietf.org/doc/draft-ietf-tcpm-rtorestart/
[2] http://www.sigcomm.org/sites/default/files/ccr/papers/2015/January/0000000-0000000.pdf
[3] http://riteproject.eu/resources/rto-restart/

Per Hurtig (2):
  tcp: RTO Restart (RTOR)
  tcp: TLP restart (TLPR)

 Documentation/networking/ip-sysctl.txt | 12 ++++++++++++
 include/net/tcp.h                      |  6 +++++-
 net/ipv4/sysctl_net_ipv4.c             | 10 ++++++++++
 net/ipv4/tcp_input.c                   | 26 +++++++++++++++++++++++++-
 net/ipv4/tcp_output.c                  | 12 ++++++++++--
 5 files changed, 62 insertions(+), 4 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH net-next 1/2] tcp: RTO Restart (RTOR)
  2015-12-07  9:00 [RFC PATCH net-next 0/2] tcp: timer restart for tail loss Per Hurtig
@ 2015-12-07  9:00 ` Per Hurtig
  2015-12-07 10:22   ` Ilpo Järvinen
                     ` (3 more replies)
  2015-12-07  9:00 ` [RFC PATCH net-next 2/2] tcp: TLP restart (TLPR) Per Hurtig
  2015-12-08  9:19 ` [RFC PATCHv2 net-next 0/2] tcp: timer restart for tail loss Per Hurtig
  2 siblings, 4 replies; 17+ messages in thread
From: Per Hurtig @ 2015-12-07  9:00 UTC (permalink / raw)
  To: davem, edumazet, ncardwell, nanditad, tom, ycheng, viro, fw,
	mleitner, daniel, willemb, ilpo.jarvinen, pasi.sarolahti,
	stephen, netdev
  Cc: anna.brunstrom, apetlund, michawe, mohammad.rajiullah, Per Hurtig

This patch implements the RTO restart modification (RTOR). When data is
ACKed, and the RTO timer is restarted, the time elapsed since the last
outstanding segment was transmitted is subtracted from the calculated RTO
value. This way, the RTO timer will expire after exactly RTO seconds, and
not RTO + RTT [+ delACK] seconds.

This patch also implements a new sysctl (tcp_timer_restart) that is used
to control the timer restart behavior.

Signed-off-by: Per Hurtig <per.hurtig@kau.se>
---
 Documentation/networking/ip-sysctl.txt | 12 ++++++++++++
 include/net/tcp.h                      |  4 ++++
 net/ipv4/sysctl_net_ipv4.c             | 10 ++++++++++
 net/ipv4/tcp_input.c                   | 24 ++++++++++++++++++++++++
 4 files changed, 50 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 2ea4c45..4094128 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -591,6 +591,18 @@ tcp_syn_retries - INTEGER
 	with the current initial RTO of 1second. With this the final timeout
 	for an active TCP connection attempt will happen after 127seconds.
 
+tcp_timer_restart - INTEGER
+	Controls how the RTO and PTO timers are restarted (RTOR and TLPR).
+	If set (per timer or combined) the timers are restarted with
+	respect to the earliest outstanding segment, to not extend tail loss
+	latency unnecessarily.
+	Possible values:
+		0 disables RTOR and TLPR.
+		1 enables RTOR.
+		2 enables TLPR.
+		3 enables RTOR and TLPR.
+	Default: 3
+
 tcp_timestamps - BOOLEAN
 	Enable timestamps as defined in RFC1323.
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f80e74c..bf98768 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -76,6 +76,9 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 /* After receiving this amount of duplicate ACKs fast retransmit starts. */
 #define TCP_FASTRETRANS_THRESH 3
 
+/* Disable RTO Restart if the number of outstanding segments is at least. */
+#define TCP_RTORESTART_THRESH	4
+
 /* Maximal number of ACKs sent quickly to accelerate slow-start. */
 #define TCP_MAX_QUICKACKS	16U
 
@@ -284,6 +287,7 @@ extern int sysctl_tcp_autocorking;
 extern int sysctl_tcp_invalid_ratelimit;
 extern int sysctl_tcp_pacing_ss_ratio;
 extern int sysctl_tcp_pacing_ca_ratio;
+extern int sysctl_tcp_timer_restart;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index a0bd7a5..dfb6968 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -28,6 +28,7 @@
 
 static int zero;
 static int one = 1;
+static int three = 3;
 static int four = 4;
 static int thousand = 1000;
 static int gso_max_segs = GSO_MAX_SEGS;
@@ -745,6 +746,15 @@ static struct ctl_table ipv4_table[] = {
 		.extra2		= &thousand,
 	},
 	{
+		.procname	= "tcp_timer_restart",
+		.data		= &sysctl_tcp_timer_restart,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &three,
+	},
+	{
 		.procname	= "tcp_autocorking",
 		.data		= &sysctl_tcp_autocorking,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fdd88c3..66e0425 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -101,6 +101,7 @@ int sysctl_tcp_thin_dupack __read_mostly;
 
 int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
 int sysctl_tcp_early_retrans __read_mostly = 3;
+int sysctl_tcp_timer_restart __read_mostly = 3;
 int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
 
 #define FLAG_DATA		0x01 /* Incoming frame contained data.		*/
@@ -2997,6 +2998,18 @@ static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
 	tcp_sk(sk)->snd_cwnd_stamp = tcp_time_stamp;
 }
 
+static u32 tcp_unsent_pkts(const struct sock *sk)
+{
+	struct sk_buff *skb = tcp_send_head(sk);
+	u32 pkts = 0;
+
+	if (skb)
+		tcp_for_write_queue_from(skb, sk)
+			pkts += tcp_skb_pcount(skb);
+
+	return pkts;
+}
+
 /* Restart timer after forward progress on connection.
  * RFC2988 recommends to restart timer to now+rto.
  */
@@ -3027,6 +3040,17 @@ void tcp_rearm_rto(struct sock *sk)
 			 */
 			if (delta > 0)
 				rto = delta;
+		} else if (icsk->icsk_pending == ICSK_TIME_RETRANS &&
+			   (sysctl_tcp_timer_restart == 1 ||
+			    sysctl_tcp_timer_restart == 3) &&
+			   (tp->packets_out + tcp_unsent_pkts(sk) <
+			    TCP_RTORESTART_THRESH)) {
+			struct sk_buff *skb = tcp_write_queue_head(sk);
+			const u32 rto_time_stamp = tcp_skb_timestamp(skb);
+			s32 delta = (s32)(tcp_time_stamp - rto_time_stamp);
+
+			if (delta > 0 && rto > delta)
+				rto -= delta;
 		}
 		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
 					  TCP_RTO_MAX);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH net-next 2/2] tcp: TLP restart (TLPR)
  2015-12-07  9:00 [RFC PATCH net-next 0/2] tcp: timer restart for tail loss Per Hurtig
  2015-12-07  9:00 ` [RFC PATCH net-next 1/2] tcp: RTO Restart (RTOR) Per Hurtig
@ 2015-12-07  9:00 ` Per Hurtig
  2015-12-08  9:19 ` [RFC PATCHv2 net-next 0/2] tcp: timer restart for tail loss Per Hurtig
  2 siblings, 0 replies; 17+ messages in thread
From: Per Hurtig @ 2015-12-07  9:00 UTC (permalink / raw)
  To: davem, edumazet, ncardwell, nanditad, tom, ycheng, viro, fw,
	mleitner, daniel, willemb, ilpo.jarvinen, pasi.sarolahti,
	stephen, netdev
  Cc: anna.brunstrom, apetlund, michawe, mohammad.rajiullah, Per Hurtig

This patch implements the TLP restart modification (TLPR). When data is
ACKed, and TLP's PTO timer is restarted, the time elapsed since the last
outstanding segment was transmitted is subtracted from the calculated RTO
value to not unnecessarily delay loss probes.

Signed-off-by: Per Hurtig <per.hurtig@kau.se>
---
 include/net/tcp.h     |  2 +-
 net/ipv4/tcp_input.c  |  2 +-
 net/ipv4/tcp_output.c | 12 ++++++++++--
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index bf98768..8ac4118 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -566,7 +566,7 @@ void tcp_push_one(struct sock *, unsigned int mss_now);
 void tcp_send_ack(struct sock *sk);
 void tcp_send_delayed_ack(struct sock *sk);
 void tcp_send_loss_probe(struct sock *sk);
-bool tcp_schedule_loss_probe(struct sock *sk);
+bool tcp_schedule_loss_probe(struct sock *sk, bool restart);
 
 /* tcp_input.c */
 void tcp_resume_early_retransmit(struct sock *sk);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 66e0425..28d3b21 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3655,7 +3655,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	}
 
 	if (icsk->icsk_pending == ICSK_TIME_RETRANS)
-		tcp_schedule_loss_probe(sk);
+		tcp_schedule_loss_probe(sk, true);
 	tcp_update_pacing_rate(sk);
 	return 1;
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index cb7ca56..752db3d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2135,7 +2135,7 @@ repair:
 
 		/* Send one loss probe per tail loss episode. */
 		if (push_one != 2)
-			tcp_schedule_loss_probe(sk);
+			tcp_schedule_loss_probe(sk, false);
 		is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd);
 		tcp_cwnd_validate(sk, is_cwnd_limited);
 		return false;
@@ -2143,10 +2143,11 @@ repair:
 	return !tp->packets_out && tcp_send_head(sk);
 }
 
-bool tcp_schedule_loss_probe(struct sock *sk)
+bool tcp_schedule_loss_probe(struct sock *sk, bool restart)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb = tcp_write_queue_head(sk);
 	u32 timeout, tlp_time_stamp, rto_time_stamp;
 	u32 rtt = usecs_to_jiffies(tp->srtt_us >> 3);
 
@@ -2186,6 +2187,13 @@ bool tcp_schedule_loss_probe(struct sock *sk)
 	if (tp->packets_out == 1)
 		timeout = max_t(u32, timeout,
 				(rtt + (rtt >> 1) + TCP_DELACK_MAX));
+	if (sysctl_tcp_timer_restart > 1 && restart && skb) {
+		const u32 rto_time_stamp = tcp_skb_timestamp(skb);
+		s32 delta = (s32)(tcp_time_stamp - rto_time_stamp);
+
+		if (delta > 0 && timeout > delta)
+			timeout -= delta;
+	}
 	timeout = max_t(u32, timeout, msecs_to_jiffies(10));
 
 	/* If RTO is shorter, just schedule TLP in its place. */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH net-next 1/2] tcp: RTO Restart (RTOR)
  2015-12-07  9:00 ` [RFC PATCH net-next 1/2] tcp: RTO Restart (RTOR) Per Hurtig
@ 2015-12-07 10:22   ` Ilpo Järvinen
  2015-12-07 16:46   ` Marcelo Ricardo Leitner
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Ilpo Järvinen @ 2015-12-07 10:22 UTC (permalink / raw)
  To: Per Hurtig
  Cc: David Miller, edumazet, ncardwell, nanditad, tom, ycheng, viro,
	fw, mleitner, daniel, willemb, pasi.sarolahti, stephen, Netdev,
	anna.brunstrom, apetlund, michawe, mohammad.rajiullah

On Mon, 7 Dec 2015, Per Hurtig wrote:

> This patch implements the RTO restart modification (RTOR). When data is
> ACKed, and the RTO timer is restarted, the time elapsed since the last
> outstanding segment was transmitted is subtracted from the calculated RTO
> value. This way, the RTO timer will expire after exactly RTO seconds, and
> not RTO + RTT [+ delACK] seconds.
> 
> This patch also implements a new sysctl (tcp_timer_restart) that is used
> to control the timer restart behavior.
> 
> Signed-off-by: Per Hurtig <per.hurtig@kau.se>
> ---
>  Documentation/networking/ip-sysctl.txt | 12 ++++++++++++
>  include/net/tcp.h                      |  4 ++++
>  net/ipv4/sysctl_net_ipv4.c             | 10 ++++++++++
>  net/ipv4/tcp_input.c                   | 24 ++++++++++++++++++++++++
>  4 files changed, 50 insertions(+)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 2ea4c45..4094128 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -591,6 +591,18 @@ tcp_syn_retries - INTEGER
>  	with the current initial RTO of 1second. With this the final timeout
>  	for an active TCP connection attempt will happen after 127seconds.
>  
> +tcp_timer_restart - INTEGER
> +	Controls how the RTO and PTO timers are restarted (RTOR and TLPR).
> +	If set (per timer or combined) the timers are restarted with
> +	respect to the earliest outstanding segment, to not extend tail loss
> +	latency unnecessarily.
> +	Possible values:
> +		0 disables RTOR and TLPR.
> +		1 enables RTOR.
> +		2 enables TLPR.
> +		3 enables RTOR and TLPR.
> +	Default: 3
> +
>  tcp_timestamps - BOOLEAN
>  	Enable timestamps as defined in RFC1323.
>  

[...snip...]

> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index fdd88c3..66e0425 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c

[...snip...]

>  /* Restart timer after forward progress on connection.
>   * RFC2988 recommends to restart timer to now+rto.
>   */
> @@ -3027,6 +3040,17 @@ void tcp_rearm_rto(struct sock *sk)
>  			 */
>  			if (delta > 0)
>  				rto = delta;
> +		} else if (icsk->icsk_pending == ICSK_TIME_RETRANS &&
> +			   (sysctl_tcp_timer_restart == 1 ||
> +			    sysctl_tcp_timer_restart == 3) &&

Use a bit operation here instead? Also I think that this sysctl would 
benefit from named constants rather than use of literals (similar 
comment applies to the other patch too).

> +			   (tp->packets_out + tcp_unsent_pkts(sk) <
> +			    TCP_RTORESTART_THRESH)) {
> +			struct sk_buff *skb = tcp_write_queue_head(sk);
> +			const u32 rto_time_stamp = tcp_skb_timestamp(skb);
> +			s32 delta = (s32)(tcp_time_stamp - rto_time_stamp);
> +
> +			if (delta > 0 && rto > delta)
> +				rto -= delta;
>  		}
>  		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
>  					  TCP_RTO_MAX);


-- 
 i.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH net-next 1/2] tcp: RTO Restart (RTOR)
  2015-12-07  9:00 ` [RFC PATCH net-next 1/2] tcp: RTO Restart (RTOR) Per Hurtig
  2015-12-07 10:22   ` Ilpo Järvinen
@ 2015-12-07 16:46   ` Marcelo Ricardo Leitner
  2015-12-07 17:03   ` Eric Dumazet
  2015-12-08  2:05   ` Yuchung Cheng
  3 siblings, 0 replies; 17+ messages in thread
From: Marcelo Ricardo Leitner @ 2015-12-07 16:46 UTC (permalink / raw)
  To: Per Hurtig
  Cc: davem, edumazet, ncardwell, nanditad, tom, ycheng, viro, fw,
	daniel, willemb, ilpo.jarvinen, pasi.sarolahti, stephen, netdev,
	anna.brunstrom, apetlund, michawe, mohammad.rajiullah

On Mon, Dec 07, 2015 at 10:00:11AM +0100, Per Hurtig wrote:
> This patch implements the RTO restart modification (RTOR). When data is
> ACKed, and the RTO timer is restarted, the time elapsed since the last
> outstanding segment was transmitted is subtracted from the calculated RTO
> value. This way, the RTO timer will expire after exactly RTO seconds, and
> not RTO + RTT [+ delACK] seconds.
> 
> This patch also implements a new sysctl (tcp_timer_restart) that is used
> to control the timer restart behavior.
> 
> Signed-off-by: Per Hurtig <per.hurtig@kau.se>
> ---
>  Documentation/networking/ip-sysctl.txt | 12 ++++++++++++
>  include/net/tcp.h                      |  4 ++++
>  net/ipv4/sysctl_net_ipv4.c             | 10 ++++++++++
>  net/ipv4/tcp_input.c                   | 24 ++++++++++++++++++++++++
>  4 files changed, 50 insertions(+)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 2ea4c45..4094128 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt

(snip)

> @@ -2997,6 +2998,18 @@ static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
>  	tcp_sk(sk)->snd_cwnd_stamp = tcp_time_stamp;
>  }
>  
> +static u32 tcp_unsent_pkts(const struct sock *sk)
> +{
> +	struct sk_buff *skb = tcp_send_head(sk);
> +	u32 pkts = 0;
> +
> +	if (skb)
> +		tcp_for_write_queue_from(skb, sk)
> +			pkts += tcp_skb_pcount(skb);
> +
> +	return pkts;
> +}
> +
>  /* Restart timer after forward progress on connection.
>   * RFC2988 recommends to restart timer to now+rto.
>   */
> @@ -3027,6 +3040,17 @@ void tcp_rearm_rto(struct sock *sk)
>  			 */
>  			if (delta > 0)
>  				rto = delta;
> +		} else if (icsk->icsk_pending == ICSK_TIME_RETRANS &&
> +			   (sysctl_tcp_timer_restart == 1 ||
> +			    sysctl_tcp_timer_restart == 3) &&
> +			   (tp->packets_out + tcp_unsent_pkts(sk) <
> +			    TCP_RTORESTART_THRESH)) {

(snip)

By when this gets hit, you could have a big write queue.
What about wrapping at least this this condition 
tp->packets_out + tcp_unsent_pkts(sk) < TCP_RTORESTART_THRESH
in its own check function? Like:

+static bool tcp_can_rtor(const struct sock *sk)
+{
+	struct sk_buff *skb = tcp_send_head(sk);
+	s32 target = TCP_RTORESTART_THRESH - tp->packets_out;
+
+	if (target <= 0)
+		return false;
+
+	if (skb) {
+		tcp_for_write_queue_from(skb, sk) {
+			target -= tcp_skb_pcount(skb);
+			if (target <= 0)
+				return false;
+		}
+	}
+
+	return true;
+}

This way it will only traverse what is needed for the check itself.

  Marcelo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH net-next 1/2] tcp: RTO Restart (RTOR)
  2015-12-07  9:00 ` [RFC PATCH net-next 1/2] tcp: RTO Restart (RTOR) Per Hurtig
  2015-12-07 10:22   ` Ilpo Järvinen
  2015-12-07 16:46   ` Marcelo Ricardo Leitner
@ 2015-12-07 17:03   ` Eric Dumazet
  2015-12-08  2:05   ` Yuchung Cheng
  3 siblings, 0 replies; 17+ messages in thread
From: Eric Dumazet @ 2015-12-07 17:03 UTC (permalink / raw)
  To: Per Hurtig
  Cc: davem, edumazet, ncardwell, nanditad, tom, ycheng, viro, fw,
	mleitner, daniel, willemb, ilpo.jarvinen, pasi.sarolahti,
	stephen, netdev, anna.brunstrom, apetlund, michawe,
	mohammad.rajiullah

On Mon, 2015-12-07 at 10:00 +0100, Per Hurtig wrote:
\
>  
> +static u32 tcp_unsent_pkts(const struct sock *sk)
> +{
> +	struct sk_buff *skb = tcp_send_head(sk);
> +	u32 pkts = 0;
> +
> +	if (skb)
> +		tcp_for_write_queue_from(skb, sk)
> +			pkts += tcp_skb_pcount(skb);
> +
> +	return pkts;
> +}

write queue can be very big (consider GSO/TSO being off for example)

Parsing it just to implement later :

(tp->packets_out + tcp_unsent_pkts(sk) <
> +			    TCP_RTORESTART_THRESH)

is probably not very efficient.

I would rather implement a different helper, aborting the loop as soon
as the condition can not be met anymore.

> +
>  /* Restart timer after forward progress on connection.
>   * RFC2988 recommends to restart timer to now+rto.
>   */
> @@ -3027,6 +3040,17 @@ void tcp_rearm_rto(struct sock *sk)
>  			 */
>  			if (delta > 0)
>  				rto = delta;
> +		} else if (icsk->icsk_pending == ICSK_TIME_RETRANS &&
> +			   (sysctl_tcp_timer_restart == 1 ||
> +			    sysctl_tcp_timer_restart == 3) &&
> +			   (tp->packets_out + tcp_unsent_pkts(sk) <
> +			    TCP_RTORESTART_THRESH)) {
> +			struct sk_buff *skb = tcp_write_queue_head(sk);
> +			const u32 rto_time_stamp = tcp_skb_timestamp(skb);
> +			s32 delta = (s32)(tcp_time_stamp - rto_time_stamp);
> +
> +			if (delta > 0 && rto > delta)
> +				rto -= delta;
>  		}
>  		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
>  					  TCP_RTO_MAX);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH net-next 1/2] tcp: RTO Restart (RTOR)
  2015-12-07  9:00 ` [RFC PATCH net-next 1/2] tcp: RTO Restart (RTOR) Per Hurtig
                     ` (2 preceding siblings ...)
  2015-12-07 17:03   ` Eric Dumazet
@ 2015-12-08  2:05   ` Yuchung Cheng
  2015-12-08  9:25     ` Per Hurtig
  3 siblings, 1 reply; 17+ messages in thread
From: Yuchung Cheng @ 2015-12-08  2:05 UTC (permalink / raw)
  To: Per Hurtig
  Cc: David Miller, Eric Dumazet, Neal Cardwell, Nandita Dukkipati,
	Tom Herbert, viro, Florian Westphal, Marcelo Ricardo Leitner,
	Daniel Borkmann, Willem de Bruijn, Ilpo Järvinen,
	Pasi Sarolahti, Stephen Hemminger, netdev, Anna Brunstrom,
	Andreas Petlund, Michael Welzl, mohammad.rajiullah

On Mon, Dec 7, 2015 at 1:00 AM, Per Hurtig <per.hurtig@kau.se> wrote:
> This patch implements the RTO restart modification (RTOR). When data is
> ACKed, and the RTO timer is restarted, the time elapsed since the last
> outstanding segment was transmitted is subtracted from the calculated RTO
> value. This way, the RTO timer will expire after exactly RTO seconds, and
> not RTO + RTT [+ delACK] seconds.
>
> This patch also implements a new sysctl (tcp_timer_restart) that is used
> to control the timer restart behavior.
>
> Signed-off-by: Per Hurtig <per.hurtig@kau.se>
> ---
>  Documentation/networking/ip-sysctl.txt | 12 ++++++++++++
>  include/net/tcp.h                      |  4 ++++
>  net/ipv4/sysctl_net_ipv4.c             | 10 ++++++++++
>  net/ipv4/tcp_input.c                   | 24 ++++++++++++++++++++++++
>  4 files changed, 50 insertions(+)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 2ea4c45..4094128 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -591,6 +591,18 @@ tcp_syn_retries - INTEGER
>         with the current initial RTO of 1second. With this the final timeout
>         for an active TCP connection attempt will happen after 127seconds.
>
> +tcp_timer_restart - INTEGER
> +       Controls how the RTO and PTO timers are restarted (RTOR and TLPR).
> +       If set (per timer or combined) the timers are restarted with
> +       respect to the earliest outstanding segment, to not extend tail loss
> +       latency unnecessarily.
> +       Possible values:
> +               0 disables RTOR and TLPR.
> +               1 enables RTOR.
> +               2 enables TLPR.
> +               3 enables RTOR and TLPR.
> +       Default: 3
> +
>  tcp_timestamps - BOOLEAN
>         Enable timestamps as defined in RFC1323.
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index f80e74c..bf98768 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -76,6 +76,9 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
>  /* After receiving this amount of duplicate ACKs fast retransmit starts. */
>  #define TCP_FASTRETRANS_THRESH 3
>
> +/* Disable RTO Restart if the number of outstanding segments is at least. */
> +#define TCP_RTORESTART_THRESH  4
> +
>  /* Maximal number of ACKs sent quickly to accelerate slow-start. */
>  #define TCP_MAX_QUICKACKS      16U
>
> @@ -284,6 +287,7 @@ extern int sysctl_tcp_autocorking;
>  extern int sysctl_tcp_invalid_ratelimit;
>  extern int sysctl_tcp_pacing_ss_ratio;
>  extern int sysctl_tcp_pacing_ca_ratio;
> +extern int sysctl_tcp_timer_restart;
>
>  extern atomic_long_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index a0bd7a5..dfb6968 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -28,6 +28,7 @@
>
>  static int zero;
>  static int one = 1;
> +static int three = 3;
>  static int four = 4;
>  static int thousand = 1000;
>  static int gso_max_segs = GSO_MAX_SEGS;
> @@ -745,6 +746,15 @@ static struct ctl_table ipv4_table[] = {
>                 .extra2         = &thousand,
>         },
>         {
> +               .procname       = "tcp_timer_restart",
> +               .data           = &sysctl_tcp_timer_restart,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,
> +               .extra1         = &zero,
> +               .extra2         = &three,
> +       },
> +       {
>                 .procname       = "tcp_autocorking",
>                 .data           = &sysctl_tcp_autocorking,
>                 .maxlen         = sizeof(int),
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index fdd88c3..66e0425 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -101,6 +101,7 @@ int sysctl_tcp_thin_dupack __read_mostly;
>
>  int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
>  int sysctl_tcp_early_retrans __read_mostly = 3;
> +int sysctl_tcp_timer_restart __read_mostly = 3;
>  int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
>
>  #define FLAG_DATA              0x01 /* Incoming frame contained data.          */
> @@ -2997,6 +2998,18 @@ static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
>         tcp_sk(sk)->snd_cwnd_stamp = tcp_time_stamp;
>  }
>
> +static u32 tcp_unsent_pkts(const struct sock *sk)
> +{
> +       struct sk_buff *skb = tcp_send_head(sk);
> +       u32 pkts = 0;
> +
> +       if (skb)
> +               tcp_for_write_queue_from(skb, sk)
> +                       pkts += tcp_skb_pcount(skb);
> +
> +       return pkts;
> +}
> +
>  /* Restart timer after forward progress on connection.
>   * RFC2988 recommends to restart timer to now+rto.
>   */
> @@ -3027,6 +3040,17 @@ void tcp_rearm_rto(struct sock *sk)
>                          */
>                         if (delta > 0)
>                                 rto = delta;
> +               } else if (icsk->icsk_pending == ICSK_TIME_RETRANS &&
> +                          (sysctl_tcp_timer_restart == 1 ||
> +                           sysctl_tcp_timer_restart == 3) &&
> +                          (tp->packets_out + tcp_unsent_pkts(sk) <
We re-arm RTO every ACK. Do we really need a loop here to improve
short-transfer RTO performance?

Going one level up: what's the rationale to arm RTO differently
depending on the #packets in the write queue? if the idea is to offset
the elapsed of the first unacked packet in RTO, it should generally
apply to any packet w/ any inflight, and we can forgo another packet
counting heuristic.

While larger inflight can trigger more fast recoveries, tail drops can
happen at any time. This is why TLP is always enabled of any inflight.
Sounds like we can benefit if RTO-restart is always used too? I am
planning to experiment your patch w and w/o the packet count checks.


> +                           TCP_RTORESTART_THRESH)) {
> +                       struct sk_buff *skb = tcp_write_queue_head(sk);
> +                       const u32 rto_time_stamp = tcp_skb_timestamp(skb);
> +                       s32 delta = (s32)(tcp_time_stamp - rto_time_stamp);
> +
> +                       if (delta > 0 && rto > delta)
> +                               rto -= delta;
>                 }
>                 inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
>                                           TCP_RTO_MAX);
> --
> 1.9.1
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCHv2 net-next 0/2] tcp: timer restart for tail loss
  2015-12-07  9:00 [RFC PATCH net-next 0/2] tcp: timer restart for tail loss Per Hurtig
  2015-12-07  9:00 ` [RFC PATCH net-next 1/2] tcp: RTO Restart (RTOR) Per Hurtig
  2015-12-07  9:00 ` [RFC PATCH net-next 2/2] tcp: TLP restart (TLPR) Per Hurtig
@ 2015-12-08  9:19 ` Per Hurtig
  2015-12-08  9:19   ` [RFC PATCHv2 net-next 1/2] tcp: RTO Restart (RTOR) Per Hurtig
  2015-12-08  9:19   ` [RFC PATCHv2 net-next 2/2] tcp: TLP restart (TLPR) Per Hurtig
  2 siblings, 2 replies; 17+ messages in thread
From: Per Hurtig @ 2015-12-08  9:19 UTC (permalink / raw)
  To: davem, edumazet, ncardwell, nanditad, tom, ycheng, viro, fw,
	mleitner, daniel, willemb, ilpo.jarvinen, pasi.sarolahti,
	stephen, netdev
  Cc: anna.brunstrom, apetlund, michawe, mohammad.rajiullah, Per Hurtig

This is a request for comments.

RTO and TLP restart is a modification to the restart process of the RTO and
TLP timers in TCP. Currently, both timers are restarted with its
corresponding timeout value when an acknowledgment (ACK) for correctly
received data is received. In many situations, resetting the timers on
incoming ACKs will add an implicit offset of at least RTT seconds to the
loss recovery process.

The goal of the modified restart is to provide quicker loss recovery for
segments lost in the end of a burst/connection, where the limited feedback
from a receiver inhibits the use of fast/early retransmit. To accomplish
this the algorithm adjusts the RTO and PTO (TLP's timer) values on each
rearm of the timers to allow them to expire after exactly RTO and PTO ms,
respectively.

The restart behavior is controlled by the new tcp_timer_restart sysctl.
tcp_timer_restart==0; disables RTOR and TLPR.
		 ==1; enables RTOR.
		 ==2; enables TLPR.
		 ==3; enables both RTOR and TLPR. [DEFAULT]

The new restart behavior has been approved by the IETF for publication as
an experimental RFC (for the RTO part of the mechanism) [1], and
experiments have been conducted to show both the benefit of this strategy
(in terms of reduced loss-recovery delays) and the limited negative impact
(in terms of spurious retransmissions) [2]. More information regarding the
mechanism can be found at [3].

Basic functionality tests, using packetdrill, are available at:
https://github.com/perhurt/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/timer_restart


[1] https://datatracker.ietf.org/doc/draft-ietf-tcpm-rtorestart/
[2] http://www.sigcomm.org/sites/default/files/ccr/papers/2015/January/0000000-0000000.pdf
[3] http://riteproject.eu/resources/rto-restart/

Per Hurtig (2):
  tcp: RTO Restart (RTOR)
  tcp: TLP restart (TLPR)

 Documentation/networking/ip-sysctl.txt | 12 ++++++++++++
 include/net/tcp.h                      |  8 +++++++-
 net/ipv4/sysctl_net_ipv4.c             | 10 ++++++++++
 net/ipv4/tcp_input.c                   | 31 ++++++++++++++++++++++++++++++-
 net/ipv4/tcp_output.c                  | 12 ++++++++++--
 5 files changed, 69 insertions(+), 4 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCHv2 net-next 1/2] tcp: RTO Restart (RTOR)
  2015-12-08  9:19 ` [RFC PATCHv2 net-next 0/2] tcp: timer restart for tail loss Per Hurtig
@ 2015-12-08  9:19   ` Per Hurtig
  2015-12-08 10:50     ` Ilpo Järvinen
  2015-12-08 13:47     ` Eric Dumazet
  2015-12-08  9:19   ` [RFC PATCHv2 net-next 2/2] tcp: TLP restart (TLPR) Per Hurtig
  1 sibling, 2 replies; 17+ messages in thread
From: Per Hurtig @ 2015-12-08  9:19 UTC (permalink / raw)
  To: davem, edumazet, ncardwell, nanditad, tom, ycheng, viro, fw,
	mleitner, daniel, willemb, ilpo.jarvinen, pasi.sarolahti,
	stephen, netdev
  Cc: anna.brunstrom, apetlund, michawe, mohammad.rajiullah, Per Hurtig

This patch implements the RTO restart modification (RTOR). When data is
ACKed, and the RTO timer is restarted, the time elapsed since the last
outstanding segment was transmitted is subtracted from the calculated RTO
value. This way, the RTO timer will expire after exactly RTO seconds, and
not RTO + RTT [+ delACK] seconds.

This patch also implements a new sysctl (tcp_timer_restart) that is used
to control the timer restart behavior.

Signed-off-by: Per Hurtig <per.hurtig@kau.se>
---
 Documentation/networking/ip-sysctl.txt | 12 ++++++++++++
 include/net/tcp.h                      |  6 ++++++
 net/ipv4/sysctl_net_ipv4.c             | 10 ++++++++++
 net/ipv4/tcp_input.c                   | 29 +++++++++++++++++++++++++++++
 4 files changed, 57 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 2ea4c45..4094128 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -591,6 +591,18 @@ tcp_syn_retries - INTEGER
 	with the current initial RTO of 1second. With this the final timeout
 	for an active TCP connection attempt will happen after 127seconds.
 
+tcp_timer_restart - INTEGER
+	Controls how the RTO and PTO timers are restarted (RTOR and TLPR).
+	If set (per timer or combined) the timers are restarted with
+	respect to the earliest outstanding segment, to not extend tail loss
+	latency unnecessarily.
+	Possible values:
+		0 disables RTOR and TLPR.
+		1 enables RTOR.
+		2 enables TLPR.
+		3 enables RTOR and TLPR.
+	Default: 3
+
 tcp_timestamps - BOOLEAN
 	Enable timestamps as defined in RFC1323.
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f80e74c..833efb7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -76,6 +76,11 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 /* After receiving this amount of duplicate ACKs fast retransmit starts. */
 #define TCP_FASTRETRANS_THRESH 3
 
+/* Disable RTO Restart if the number of outstanding segments is at least. */
+#define TCP_TIMER_RTORESTART	1
+#define TCP_TIMER_TLPRESTART	2
+#define TCP_RTORESTART_THRESH	4
+
 /* Maximal number of ACKs sent quickly to accelerate slow-start. */
 #define TCP_MAX_QUICKACKS	16U
 
@@ -284,6 +289,7 @@ extern int sysctl_tcp_autocorking;
 extern int sysctl_tcp_invalid_ratelimit;
 extern int sysctl_tcp_pacing_ss_ratio;
 extern int sysctl_tcp_pacing_ca_ratio;
+extern int sysctl_tcp_timer_restart;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index a0bd7a5..dfb6968 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -28,6 +28,7 @@
 
 static int zero;
 static int one = 1;
+static int three = 3;
 static int four = 4;
 static int thousand = 1000;
 static int gso_max_segs = GSO_MAX_SEGS;
@@ -745,6 +746,15 @@ static struct ctl_table ipv4_table[] = {
 		.extra2		= &thousand,
 	},
 	{
+		.procname	= "tcp_timer_restart",
+		.data		= &sysctl_tcp_timer_restart,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &three,
+	},
+	{
 		.procname	= "tcp_autocorking",
 		.data		= &sysctl_tcp_autocorking,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2d656ee..2870af3 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -101,6 +101,8 @@ int sysctl_tcp_thin_dupack __read_mostly;
 
 int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
 int sysctl_tcp_early_retrans __read_mostly = 3;
+int sysctl_tcp_timer_restart __read_mostly = TCP_TIMER_RTORESTART |
+					     TCP_TIMER_TLPRESTART;
 int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
 
 #define FLAG_DATA		0x01 /* Incoming frame contained data.		*/
@@ -2997,6 +2999,22 @@ static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
 	tcp_sk(sk)->snd_cwnd_stamp = tcp_time_stamp;
 }
 
+static u32 tcp_unsent_pkts(const struct sock *sk, u32 ulimit)
+{
+	struct sk_buff *skb = tcp_send_head(sk);
+	u32 pkts = 0;
+
+	if (skb)
+		tcp_for_write_queue_from(skb, sk) {
+			pkts += tcp_skb_pcount(skb);
+
+			if (ulimit && pkts >= ulimit)
+				return ulimit;
+		}
+
+	return pkts;
+}
+
 /* Restart timer after forward progress on connection.
  * RFC2988 recommends to restart timer to now+rto.
  */
@@ -3027,6 +3045,17 @@ void tcp_rearm_rto(struct sock *sk)
 			 */
 			if (delta > 0)
 				rto = delta;
+		} else if (icsk->icsk_pending == ICSK_TIME_RETRANS &&
+			   (sysctl_tcp_timer_restart & TCP_TIMER_RTORESTART) &&
+			   (tp->packets_out +
+			    tcp_unsent_pkts(sk, TCP_RTORESTART_THRESH) <
+			    TCP_RTORESTART_THRESH)) {
+			struct sk_buff *skb = tcp_write_queue_head(sk);
+			const u32 rto_time_stamp = tcp_skb_timestamp(skb);
+			s32 delta = (s32)(tcp_time_stamp - rto_time_stamp);
+
+			if (delta > 0 && rto > delta)
+				rto -= delta;
 		}
 		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
 					  TCP_RTO_MAX);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCHv2 net-next 2/2] tcp: TLP restart (TLPR)
  2015-12-08  9:19 ` [RFC PATCHv2 net-next 0/2] tcp: timer restart for tail loss Per Hurtig
  2015-12-08  9:19   ` [RFC PATCHv2 net-next 1/2] tcp: RTO Restart (RTOR) Per Hurtig
@ 2015-12-08  9:19   ` Per Hurtig
  1 sibling, 0 replies; 17+ messages in thread
From: Per Hurtig @ 2015-12-08  9:19 UTC (permalink / raw)
  To: davem, edumazet, ncardwell, nanditad, tom, ycheng, viro, fw,
	mleitner, daniel, willemb, ilpo.jarvinen, pasi.sarolahti,
	stephen, netdev
  Cc: anna.brunstrom, apetlund, michawe, mohammad.rajiullah, Per Hurtig

This patch implements the TLP restart modification (TLPR). When data is
ACKed, and TLP's PTO timer is restarted, the time elapsed since the last
outstanding segment was transmitted is subtracted from the calculated RTO
value to not unnecessarily delay loss probes.

Signed-off-by: Per Hurtig <per.hurtig@kau.se>
---
 include/net/tcp.h     |  2 +-
 net/ipv4/tcp_input.c  |  2 +-
 net/ipv4/tcp_output.c | 12 ++++++++++--
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 833efb7..14c0888 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -568,7 +568,7 @@ void tcp_push_one(struct sock *, unsigned int mss_now);
 void tcp_send_ack(struct sock *sk);
 void tcp_send_delayed_ack(struct sock *sk);
 void tcp_send_loss_probe(struct sock *sk);
-bool tcp_schedule_loss_probe(struct sock *sk);
+bool tcp_schedule_loss_probe(struct sock *sk, bool restart);
 
 /* tcp_input.c */
 void tcp_resume_early_retransmit(struct sock *sk);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2870af3..952c4cd 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3660,7 +3660,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	}
 
 	if (icsk->icsk_pending == ICSK_TIME_RETRANS)
-		tcp_schedule_loss_probe(sk);
+		tcp_schedule_loss_probe(sk, true);
 	tcp_update_pacing_rate(sk);
 	return 1;
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a800cee..b89ba50 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2135,7 +2135,7 @@ repair:
 
 		/* Send one loss probe per tail loss episode. */
 		if (push_one != 2)
-			tcp_schedule_loss_probe(sk);
+			tcp_schedule_loss_probe(sk, false);
 		is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd);
 		tcp_cwnd_validate(sk, is_cwnd_limited);
 		return false;
@@ -2143,10 +2143,11 @@ repair:
 	return !tp->packets_out && tcp_send_head(sk);
 }
 
-bool tcp_schedule_loss_probe(struct sock *sk)
+bool tcp_schedule_loss_probe(struct sock *sk, bool restart)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb = tcp_write_queue_head(sk);
 	u32 timeout, tlp_time_stamp, rto_time_stamp;
 	u32 rtt = usecs_to_jiffies(tp->srtt_us >> 3);
 
@@ -2186,6 +2187,13 @@ bool tcp_schedule_loss_probe(struct sock *sk)
 	if (tp->packets_out == 1)
 		timeout = max_t(u32, timeout,
 				(rtt + (rtt >> 1) + TCP_DELACK_MAX));
+	if (sysctl_tcp_timer_restart & TCP_TIMER_TLPRESTART && restart && skb) {
+		const u32 rto_time_stamp = tcp_skb_timestamp(skb);
+		s32 delta = (s32)(tcp_time_stamp - rto_time_stamp);
+
+		if (delta > 0 && timeout > delta)
+			timeout -= delta;
+	}
 	timeout = max_t(u32, timeout, msecs_to_jiffies(10));
 
 	/* If RTO is shorter, just schedule TLP in its place. */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH net-next 1/2] tcp: RTO Restart (RTOR)
  2015-12-08  2:05   ` Yuchung Cheng
@ 2015-12-08  9:25     ` Per Hurtig
  0 siblings, 0 replies; 17+ messages in thread
From: Per Hurtig @ 2015-12-08  9:25 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: David Miller, Eric Dumazet, Neal Cardwell, Nandita Dukkipati,
	Tom Herbert, viro, Florian Westphal, Marcelo Ricardo Leitner,
	Daniel Borkmann, Willem de Bruijn, Ilpo Järvinen,
	Pasi Sarolahti, Stephen Hemminger, netdev, Anna Brunström,
	Andreas Petlund, Michael Welzl, Mohammad Rajiullah

[-- Attachment #1: Type: text/plain, Size: 7635 bytes --]


> On 08 Dec 2015, at 03:05, Yuchung Cheng <ycheng@google.com> wrote:
> 
> On Mon, Dec 7, 2015 at 1:00 AM, Per Hurtig <per.hurtig@kau.se> wrote:
>> This patch implements the RTO restart modification (RTOR). When data is
>> ACKed, and the RTO timer is restarted, the time elapsed since the last
>> outstanding segment was transmitted is subtracted from the calculated RTO
>> value. This way, the RTO timer will expire after exactly RTO seconds, and
>> not RTO + RTT [+ delACK] seconds.
>> 
>> This patch also implements a new sysctl (tcp_timer_restart) that is used
>> to control the timer restart behavior.
>> 
>> Signed-off-by: Per Hurtig <per.hurtig@kau.se>
>> ---
>> Documentation/networking/ip-sysctl.txt | 12 ++++++++++++
>> include/net/tcp.h                      |  4 ++++
>> net/ipv4/sysctl_net_ipv4.c             | 10 ++++++++++
>> net/ipv4/tcp_input.c                   | 24 ++++++++++++++++++++++++
>> 4 files changed, 50 insertions(+)
>> 
>> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
>> index 2ea4c45..4094128 100644
>> --- a/Documentation/networking/ip-sysctl.txt
>> +++ b/Documentation/networking/ip-sysctl.txt
>> @@ -591,6 +591,18 @@ tcp_syn_retries - INTEGER
>>        with the current initial RTO of 1second. With this the final timeout
>>        for an active TCP connection attempt will happen after 127seconds.
>> 
>> +tcp_timer_restart - INTEGER
>> +       Controls how the RTO and PTO timers are restarted (RTOR and TLPR).
>> +       If set (per timer or combined) the timers are restarted with
>> +       respect to the earliest outstanding segment, to not extend tail loss
>> +       latency unnecessarily.
>> +       Possible values:
>> +               0 disables RTOR and TLPR.
>> +               1 enables RTOR.
>> +               2 enables TLPR.
>> +               3 enables RTOR and TLPR.
>> +       Default: 3
>> +
>> tcp_timestamps - BOOLEAN
>>        Enable timestamps as defined in RFC1323.
>> 
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index f80e74c..bf98768 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -76,6 +76,9 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
>> /* After receiving this amount of duplicate ACKs fast retransmit starts. */
>> #define TCP_FASTRETRANS_THRESH 3
>> 
>> +/* Disable RTO Restart if the number of outstanding segments is at least. */
>> +#define TCP_RTORESTART_THRESH  4
>> +
>> /* Maximal number of ACKs sent quickly to accelerate slow-start. */
>> #define TCP_MAX_QUICKACKS      16U
>> 
>> @@ -284,6 +287,7 @@ extern int sysctl_tcp_autocorking;
>> extern int sysctl_tcp_invalid_ratelimit;
>> extern int sysctl_tcp_pacing_ss_ratio;
>> extern int sysctl_tcp_pacing_ca_ratio;
>> +extern int sysctl_tcp_timer_restart;
>> 
>> extern atomic_long_t tcp_memory_allocated;
>> extern struct percpu_counter tcp_sockets_allocated;
>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>> index a0bd7a5..dfb6968 100644
>> --- a/net/ipv4/sysctl_net_ipv4.c
>> +++ b/net/ipv4/sysctl_net_ipv4.c
>> @@ -28,6 +28,7 @@
>> 
>> static int zero;
>> static int one = 1;
>> +static int three = 3;
>> static int four = 4;
>> static int thousand = 1000;
>> static int gso_max_segs = GSO_MAX_SEGS;
>> @@ -745,6 +746,15 @@ static struct ctl_table ipv4_table[] = {
>>                .extra2         = &thousand,
>>        },
>>        {
>> +               .procname       = "tcp_timer_restart",
>> +               .data           = &sysctl_tcp_timer_restart,
>> +               .maxlen         = sizeof(int),
>> +               .mode           = 0644,
>> +               .proc_handler   = proc_dointvec_minmax,
>> +               .extra1         = &zero,
>> +               .extra2         = &three,
>> +       },
>> +       {
>>                .procname       = "tcp_autocorking",
>>                .data           = &sysctl_tcp_autocorking,
>>                .maxlen         = sizeof(int),
>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index fdd88c3..66e0425 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -101,6 +101,7 @@ int sysctl_tcp_thin_dupack __read_mostly;
>> 
>> int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
>> int sysctl_tcp_early_retrans __read_mostly = 3;
>> +int sysctl_tcp_timer_restart __read_mostly = 3;
>> int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
>> 
>> #define FLAG_DATA              0x01 /* Incoming frame contained data.          */
>> @@ -2997,6 +2998,18 @@ static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
>>        tcp_sk(sk)->snd_cwnd_stamp = tcp_time_stamp;
>> }
>> 
>> +static u32 tcp_unsent_pkts(const struct sock *sk)
>> +{
>> +       struct sk_buff *skb = tcp_send_head(sk);
>> +       u32 pkts = 0;
>> +
>> +       if (skb)
>> +               tcp_for_write_queue_from(skb, sk)
>> +                       pkts += tcp_skb_pcount(skb);
>> +
>> +       return pkts;
>> +}
>> +
>> /* Restart timer after forward progress on connection.
>>  * RFC2988 recommends to restart timer to now+rto.
>>  */
>> @@ -3027,6 +3040,17 @@ void tcp_rearm_rto(struct sock *sk)
>>                         */
>>                        if (delta > 0)
>>                                rto = delta;
>> +               } else if (icsk->icsk_pending == ICSK_TIME_RETRANS &&
>> +                          (sysctl_tcp_timer_restart == 1 ||
>> +                           sysctl_tcp_timer_restart == 3) &&
>> +                          (tp->packets_out + tcp_unsent_pkts(sk) <
> We re-arm RTO every ACK. Do we really need a loop here to improve
> short-transfer RTO performance?
> 
> Going one level up: what's the rationale to arm RTO differently
> depending on the #packets in the write queue? if the idea is to offset
> the elapsed of the first unacked packet in RTO, it should generally
> apply to any packet w/ any inflight, and we can forgo another packet
> counting heuristic.
> 
The rationale is to not risk being overly aggressive when there is enough
packets in the write queue to allow other recovery strategies than RTO
to kick in. However, this version of the patch tightly follows the IETF
internet-draft of the mechanism. If people like a more aggressive
approach I’m all for that, it’ll at least be a cleaner solution in terms of
code :)

> While larger inflight can trigger more fast recoveries, tail drops can
> happen at any time. This is why TLP is always enabled of any inflight.
> Sounds like we can benefit if RTO-restart is always used too? I am
> planning to experiment your patch w and w/o the packet count checks.
> 
This is of course possible. We welcome any experimentation with this
piece of code to make it better!

> 
>> +                           TCP_RTORESTART_THRESH)) {
>> +                       struct sk_buff *skb = tcp_write_queue_head(sk);
>> +                       const u32 rto_time_stamp = tcp_skb_timestamp(skb);
>> +                       s32 delta = (s32)(tcp_time_stamp - rto_time_stamp);
>> +
>> +                       if (delta > 0 && rto > delta)
>> +                               rto -= delta;
>>                }
>>                inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
>>                                          TCP_RTO_MAX);
>> --
>> 1.9.1
>> 
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCHv2 net-next 1/2] tcp: RTO Restart (RTOR)
  2015-12-08  9:19   ` [RFC PATCHv2 net-next 1/2] tcp: RTO Restart (RTOR) Per Hurtig
@ 2015-12-08 10:50     ` Ilpo Järvinen
  2015-12-08 11:03       ` Per Hurtig
  2015-12-08 13:47     ` Eric Dumazet
  1 sibling, 1 reply; 17+ messages in thread
From: Ilpo Järvinen @ 2015-12-08 10:50 UTC (permalink / raw)
  To: Per Hurtig
  Cc: David Miller, edumazet, ncardwell, nanditad, tom, ycheng, viro,
	fw, mleitner, daniel, willemb, pasi.sarolahti, stephen, Netdev,
	anna.brunstrom, apetlund, michawe, mohammad.rajiullah

On Tue, 8 Dec 2015, Per Hurtig wrote:

> This patch implements the RTO restart modification (RTOR). When data is
> ACKed, and the RTO timer is restarted, the time elapsed since the last
> outstanding segment was transmitted is subtracted from the calculated RTO
> value. This way, the RTO timer will expire after exactly RTO seconds, and
> not RTO + RTT [+ delACK] seconds.
> 
> This patch also implements a new sysctl (tcp_timer_restart) that is used
> to control the timer restart behavior.
> 
> Signed-off-by: Per Hurtig <per.hurtig@kau.se>
> ---
>  Documentation/networking/ip-sysctl.txt | 12 ++++++++++++
>  include/net/tcp.h                      |  6 ++++++
>  net/ipv4/sysctl_net_ipv4.c             | 10 ++++++++++
>  net/ipv4/tcp_input.c                   | 29 +++++++++++++++++++++++++++++
>  4 files changed, 57 insertions(+)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 2ea4c45..4094128 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -591,6 +591,18 @@ tcp_syn_retries - INTEGER
>  	with the current initial RTO of 1second. With this the final timeout
>  	for an active TCP connection attempt will happen after 127seconds.
>  
> +tcp_timer_restart - INTEGER
> +	Controls how the RTO and PTO timers are restarted (RTOR and TLPR).
> +	If set (per timer or combined) the timers are restarted with
> +	respect to the earliest outstanding segment, to not extend tail loss
> +	latency unnecessarily.
> +	Possible values:
> +		0 disables RTOR and TLPR.
> +		1 enables RTOR.
> +		2 enables TLPR.
> +		3 enables RTOR and TLPR.
> +	Default: 3
> +
>  tcp_timestamps - BOOLEAN
>  	Enable timestamps as defined in RFC1323.
>  
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index f80e74c..833efb7 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -76,6 +76,11 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
>  /* After receiving this amount of duplicate ACKs fast retransmit starts. */
>  #define TCP_FASTRETRANS_THRESH 3
>  
> +/* Disable RTO Restart if the number of outstanding segments is at least. */
> +#define TCP_TIMER_RTORESTART	1
> +#define TCP_TIMER_TLPRESTART	2
> +#define TCP_RTORESTART_THRESH	4

Unfortunately the comment got now separated from the actual define.


-- 
 i.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCHv2 net-next 1/2] tcp: RTO Restart (RTOR)
  2015-12-08 10:50     ` Ilpo Järvinen
@ 2015-12-08 11:03       ` Per Hurtig
  0 siblings, 0 replies; 17+ messages in thread
From: Per Hurtig @ 2015-12-08 11:03 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: David Miller, edumazet, ncardwell, nanditad, tom, Yuchung Cheng,
	viro, fw, mleitner, daniel, willemb, pasi.sarolahti, stephen,
	Netdev, Anna Brunström, apetlund, Michael Welzl,
	Mohammad Rajiullah

[-- Attachment #1: Type: text/plain, Size: 2677 bytes --]


> On 08 Dec 2015, at 11:50, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
> 
> On Tue, 8 Dec 2015, Per Hurtig wrote:
> 
>> This patch implements the RTO restart modification (RTOR). When data is
>> ACKed, and the RTO timer is restarted, the time elapsed since the last
>> outstanding segment was transmitted is subtracted from the calculated RTO
>> value. This way, the RTO timer will expire after exactly RTO seconds, and
>> not RTO + RTT [+ delACK] seconds.
>> 
>> This patch also implements a new sysctl (tcp_timer_restart) that is used
>> to control the timer restart behavior.
>> 
>> Signed-off-by: Per Hurtig <per.hurtig@kau.se>
>> ---
>> Documentation/networking/ip-sysctl.txt | 12 ++++++++++++
>> include/net/tcp.h                      |  6 ++++++
>> net/ipv4/sysctl_net_ipv4.c             | 10 ++++++++++
>> net/ipv4/tcp_input.c                   | 29 +++++++++++++++++++++++++++++
>> 4 files changed, 57 insertions(+)
>> 
>> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
>> index 2ea4c45..4094128 100644
>> --- a/Documentation/networking/ip-sysctl.txt
>> +++ b/Documentation/networking/ip-sysctl.txt
>> @@ -591,6 +591,18 @@ tcp_syn_retries - INTEGER
>> 	with the current initial RTO of 1second. With this the final timeout
>> 	for an active TCP connection attempt will happen after 127seconds.
>> 
>> +tcp_timer_restart - INTEGER
>> +	Controls how the RTO and PTO timers are restarted (RTOR and TLPR).
>> +	If set (per timer or combined) the timers are restarted with
>> +	respect to the earliest outstanding segment, to not extend tail loss
>> +	latency unnecessarily.
>> +	Possible values:
>> +		0 disables RTOR and TLPR.
>> +		1 enables RTOR.
>> +		2 enables TLPR.
>> +		3 enables RTOR and TLPR.
>> +	Default: 3
>> +
>> tcp_timestamps - BOOLEAN
>> 	Enable timestamps as defined in RFC1323.
>> 
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index f80e74c..833efb7 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -76,6 +76,11 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
>> /* After receiving this amount of duplicate ACKs fast retransmit starts. */
>> #define TCP_FASTRETRANS_THRESH 3
>> 
>> +/* Disable RTO Restart if the number of outstanding segments is at least. */
>> +#define TCP_TIMER_RTORESTART	1
>> +#define TCP_TIMER_TLPRESTART	2
>> +#define TCP_RTORESTART_THRESH	4
> 
> Unfortunately the comment got now separated from the actual define.
> 
> 
> --
> i.
Darn, I knew I missed something. Well, I’ll fix that in the next round. Suppose there are more things that could be improved.

Per


[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCHv2 net-next 1/2] tcp: RTO Restart (RTOR)
  2015-12-08  9:19   ` [RFC PATCHv2 net-next 1/2] tcp: RTO Restart (RTOR) Per Hurtig
  2015-12-08 10:50     ` Ilpo Järvinen
@ 2015-12-08 13:47     ` Eric Dumazet
  2015-12-10  6:51       ` Per Hurtig
  1 sibling, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2015-12-08 13:47 UTC (permalink / raw)
  To: Per Hurtig
  Cc: davem, edumazet, ncardwell, nanditad, tom, ycheng, viro, fw,
	mleitner, daniel, willemb, ilpo.jarvinen, pasi.sarolahti,
	stephen, netdev, anna.brunstrom, apetlund, michawe,
	mohammad.rajiullah

On Tue, 2015-12-08 at 10:19 +0100, Per Hurtig wrote:

> +static u32 tcp_unsent_pkts(const struct sock *sk, u32 ulimit)
> +{
> +	struct sk_buff *skb = tcp_send_head(sk);
> +	u32 pkts = 0;
> +
> +	if (skb)
> +		tcp_for_write_queue_from(skb, sk) {
> +			pkts += tcp_skb_pcount(skb);
> +
> +			if (ulimit && pkts >= ulimit)
> +				return ulimit;
> +		}
> +
> +	return pkts;
> +}


Considering Yuchung feedback, have you looked at using an approximation
instead ? 

(ie using tp->write_seq - tp->snd_nxt)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCHv2 net-next 1/2] tcp: RTO Restart (RTOR)
  2015-12-08 13:47     ` Eric Dumazet
@ 2015-12-10  6:51       ` Per Hurtig
  2015-12-10 15:37         ` Neal Cardwell
  0 siblings, 1 reply; 17+ messages in thread
From: Per Hurtig @ 2015-12-10  6:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, edumazet, ncardwell, nanditad, tom, Yuchung Cheng,
	viro, fw, mleitner, daniel, willemb, ilpo.jarvinen,
	pasi.sarolahti, stephen, netdev, Anna Brunström, apetlund,
	Michael Welzl, Mohammad Rajiullah

[-- Attachment #1: Type: text/plain, Size: 1580 bytes --]


> On 08 Dec 2015, at 14:47, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> On Tue, 2015-12-08 at 10:19 +0100, Per Hurtig wrote:
> 
>> +static u32 tcp_unsent_pkts(const struct sock *sk, u32 ulimit)
>> +{
>> +	struct sk_buff *skb = tcp_send_head(sk);
>> +	u32 pkts = 0;
>> +
>> +	if (skb)
>> +		tcp_for_write_queue_from(skb, sk) {
>> +			pkts += tcp_skb_pcount(skb);
>> +
>> +			if (ulimit && pkts >= ulimit)
>> +				return ulimit;
>> +		}
>> +
>> +	return pkts;
>> +}
> 
> 
> Considering Yuchung feedback, have you looked at using an approximation
> instead ?
> 
> (ie using tp->write_seq - tp->snd_nxt)
> 
> 

Well, an approximation is rather “dangerous” as missing a single packet
could inhibit the desired behaviour. If looping is undesired, I think a
better solution is to actually *not* do this check at all and instead rely
solely on the

tp->packets_out < TCP_RTORESTART_THRESH

check instead. The reason why the number of unsent packets was
included was only to fix a corner case where it should be possible
to use the modified restart, but impossible due to the conditioning.

However, this corner case is likely to not occur very often and we
may be better off with the simpler check.

The corner case (if I remember this correctly) is that the restart is
not triggered when you have 2 segments in flight and (i) have a
congestion window of exactly 3; or (ii) get a packet written to the socket
just between previous data transmission and the arrival of the
acknowledgment that triggers the restart.


— Per


[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCHv2 net-next 1/2] tcp: RTO Restart (RTOR)
  2015-12-10  6:51       ` Per Hurtig
@ 2015-12-10 15:37         ` Neal Cardwell
  2015-12-10 21:11           ` Per Hurtig
  0 siblings, 1 reply; 17+ messages in thread
From: Neal Cardwell @ 2015-12-10 15:37 UTC (permalink / raw)
  To: Per Hurtig
  Cc: Eric Dumazet, David Miller, Eric Dumazet, Nandita Dukkipati,
	Tom Herbert, Yuchung Cheng, Al Viro, Florian Westphal,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Ilpo Järvinen, Pasi Sarolahti, Stephen Hemminger, Netdev,
	Anna Brunström, Andreas Petlund, Michael Welzl,
	Mohammad Rajiullah

On Thu, Dec 10, 2015 at 1:51 AM, Per Hurtig <per.hurtig@kau.se> wrote:
>
>> On 08 Dec 2015, at 14:47, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>> On Tue, 2015-12-08 at 10:19 +0100, Per Hurtig wrote:
>>
>>> +static u32 tcp_unsent_pkts(const struct sock *sk, u32 ulimit)
>>> +{
>>> +    struct sk_buff *skb = tcp_send_head(sk);
>>> +    u32 pkts = 0;
>>> +
>>> +    if (skb)
>>> +            tcp_for_write_queue_from(skb, sk) {
>>> +                    pkts += tcp_skb_pcount(skb);
>>> +
>>> +                    if (ulimit && pkts >= ulimit)
>>> +                            return ulimit;
>>> +            }
>>> +
>>> +    return pkts;
>>> +}
>>
>>
>> Considering Yuchung feedback, have you looked at using an approximation
>> instead ?
>>
>> (ie using tp->write_seq - tp->snd_nxt)
>>
>>
>
> Well, an approximation is rather “dangerous” as missing a single packet
> could inhibit the desired behaviour. If looping is undesired, I think a
> better solution is to actually *not* do this check at all and instead rely
> solely on the
>
> tp->packets_out < TCP_RTORESTART_THRESH

Yes, this simpler version seems very much preferable, IMHO. I agree
that it does not seem worth the complexity to try to cover the kind of
corner cases you outline.

I would also suggest a TCP_RTORESTART_THRESH value higher than 4.

In the ID at https://tools.ietf.org/html/draft-ietf-tcpm-rtorestart-10 it says:

   The RECOMMENDED value of rrthresh is four, as this value will ensure
   that RTOR is only used when fast retransmit cannot be triggered.

But my sense is that fast retransmit is often not triggered at
in-flight counts of much higher than 4, due to drop-tail queues, TSO
bursts, the initial IW10 being unpaced, etc. It would be interesting
to see A/B experiments for a few TCP_RTORESTART_THRESH values, say, 4
vs 10.

neal

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCHv2 net-next 1/2] tcp: RTO Restart (RTOR)
  2015-12-10 15:37         ` Neal Cardwell
@ 2015-12-10 21:11           ` Per Hurtig
  0 siblings, 0 replies; 17+ messages in thread
From: Per Hurtig @ 2015-12-10 21:11 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, David Miller, Eric Dumazet, Nandita Dukkipati,
	Tom Herbert, Yuchung Cheng, Al Viro, Florian Westphal,
	Marcelo Ricardo Leitner, Daniel Borkmann, Willem de Bruijn,
	Ilpo Järvinen, Pasi Sarolahti, Stephen Hemminger, Netdev,
	Anna Brunström, Andreas Petlund, Michael Welzl,
	Mohammad Rajiullah



> 10 dec. 2015 kl. 16:37 skrev Neal Cardwell <ncardwell@google.com>:
> 
>> On Thu, Dec 10, 2015 at 1:51 AM, Per Hurtig <per.hurtig@kau.se> wrote:
>> 
>>> On 08 Dec 2015, at 14:47, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> 
>>> On Tue, 2015-12-08 at 10:19 +0100, Per Hurtig wrote:
>>> 
>>>> +static u32 tcp_unsent_pkts(const struct sock *sk, u32 ulimit)
>>>> +{
>>>> +    struct sk_buff *skb = tcp_send_head(sk);
>>>> +    u32 pkts = 0;
>>>> +
>>>> +    if (skb)
>>>> +            tcp_for_write_queue_from(skb, sk) {
>>>> +                    pkts += tcp_skb_pcount(skb);
>>>> +
>>>> +                    if (ulimit && pkts >= ulimit)
>>>> +                            return ulimit;
>>>> +            }
>>>> +
>>>> +    return pkts;
>>>> +}
>>> 
>>> 
>>> Considering Yuchung feedback, have you looked at using an approximation
>>> instead ?
>>> 
>>> (ie using tp->write_seq - tp->snd_nxt)
>> 
>> Well, an approximation is rather “dangerous” as missing a single packet
>> could inhibit the desired behaviour. If looping is undesired, I think a
>> better solution is to actually *not* do this check at all and instead rely
>> solely on the
>> 
>> tp->packets_out < TCP_RTORESTART_THRESH
> 
> Yes, this simpler version seems very much preferable, IMHO. I agree
> that it does not seem worth the complexity to try to cover the kind of
> corner cases you outline.
> 
> I would also suggest a TCP_RTORESTART_THRESH value higher than 4.
> 
> In the ID at https://tools.ietf.org/html/draft-ietf-tcpm-rtorestart-10 it says:
> 
>   The RECOMMENDED value of rrthresh is four, as this value will ensure
>   that RTOR is only used when fast retransmit cannot be triggered.
> 
> But my sense is that fast retransmit is often not triggered at
> in-flight counts of much higher than 4, due to drop-tail queues, TSO
> bursts, the initial IW10 being unpaced, etc. It would be interesting
> to see A/B experiments for a few TCP_RTORESTART_THRESH values, say, 4
> vs 10.
> 
> neal
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sure. One idea could also be to use the "reordering" value as a dynamic threshhold?

-- Per

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-12-10 21:11 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-07  9:00 [RFC PATCH net-next 0/2] tcp: timer restart for tail loss Per Hurtig
2015-12-07  9:00 ` [RFC PATCH net-next 1/2] tcp: RTO Restart (RTOR) Per Hurtig
2015-12-07 10:22   ` Ilpo Järvinen
2015-12-07 16:46   ` Marcelo Ricardo Leitner
2015-12-07 17:03   ` Eric Dumazet
2015-12-08  2:05   ` Yuchung Cheng
2015-12-08  9:25     ` Per Hurtig
2015-12-07  9:00 ` [RFC PATCH net-next 2/2] tcp: TLP restart (TLPR) Per Hurtig
2015-12-08  9:19 ` [RFC PATCHv2 net-next 0/2] tcp: timer restart for tail loss Per Hurtig
2015-12-08  9:19   ` [RFC PATCHv2 net-next 1/2] tcp: RTO Restart (RTOR) Per Hurtig
2015-12-08 10:50     ` Ilpo Järvinen
2015-12-08 11:03       ` Per Hurtig
2015-12-08 13:47     ` Eric Dumazet
2015-12-10  6:51       ` Per Hurtig
2015-12-10 15:37         ` Neal Cardwell
2015-12-10 21:11           ` Per Hurtig
2015-12-08  9:19   ` [RFC PATCHv2 net-next 2/2] tcp: TLP restart (TLPR) Per Hurtig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.