All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
@ 2009-08-26 10:16 Damian Lukowski
  2009-08-31 13:26 ` Ilpo Järvinen
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Damian Lukowski @ 2009-08-26 10:16 UTC (permalink / raw)
  To: Netdev

RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
which may represent a number of allowed retransmissions or a timeout value.
Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
in number of allowed retransmissions.

For any desired threshold R2 (by means of time) one can specify tcp_retries2
(by means of number of retransmissions) such that TCP will not time out
earlier than R2. This is the case, because the RTO schedule follows a fixed
pattern, namely exponential backoff.

However, the RTO behaviour is not predictable any more if RTO backoffs can be
reverted, as it is the case in the draft
"Make TCP more Robust to Long Connectivity Disruptions"
(http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).

In the worst case TCP would time out a connection after 3.2 seconds, if the
initial RTO equaled MIN_RTO and each backoff has been reverted.

This patch introduces a function retransmits_timed_out(N),
which calculates the timeout of a TCP connection, assuming an initial
RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.

Whenever timeout decisions are made by comparing the retransmission counter
to some value N, this function can be used, instead.

The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
can occur than the value indicates. However, it yields a timeout which is
similar to the one of an unpatched, exponentially backing off TCP in the same
scenario. As no application could rely on an RTO greater than MIN_RTO, there
should be no risk of a regression.

Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
---
 include/net/tcp.h    |   18 ++++++++++++++++++
 net/ipv4/tcp_timer.c |   11 +++++++----
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index c35b329..17d1a88 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1247,6 +1247,24 @@ static inline struct sk_buff *tcp_write_queue_prev(struct sock *sk, struct sk_bu
 #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
 	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
 
+static inline bool retransmits_timed_out(const struct sock *sk,
+					 unsigned int boundary)
+{
+	int limit, K;
+	if (!inet_csk(sk)->icsk_retransmits)
+		return false;
+
+	K = ilog2(TCP_RTO_MAX/TCP_RTO_MIN);
+
+	if (boundary <= K)
+		limit = ((2 << boundary) - 1) * TCP_RTO_MIN;
+	else
+		limit = ((2 << K) - 1) * TCP_RTO_MIN +
+			(boundary - K) * TCP_RTO_MAX;
+
+	return (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= limit;
+}
+
 static inline struct sk_buff *tcp_send_head(struct sock *sk)
 {
 	return sk->sk_send_head;
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index a3ba494..2972d7b 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -137,13 +137,14 @@ static int tcp_write_timeout(struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	int retry_until;
+	bool do_reset;
 
 	if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
 		if (icsk->icsk_retransmits)
 			dst_negative_advice(&sk->sk_dst_cache);
 		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
 	} else {
-		if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
+		if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
 			/* Black hole detection */
 			tcp_mtu_probing(icsk, sk);
 
@@ -155,13 +156,15 @@ static int tcp_write_timeout(struct sock *sk)
 			const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
 
 			retry_until = tcp_orphan_retries(sk, alive);
+			do_reset = alive ||
+				   !retransmits_timed_out(sk, retry_until);
 
-			if (tcp_out_of_resources(sk, alive || icsk->icsk_retransmits < retry_until))
+			if (tcp_out_of_resources(sk, do_reset))
 				return 1;
 		}
 	}
 
-	if (icsk->icsk_retransmits >= retry_until) {
+	if (retransmits_timed_out(sk, retry_until)) {
 		/* Has it gone just too far? */
 		tcp_write_err(sk);
 		return 1;
@@ -385,7 +388,7 @@ void tcp_retransmit_timer(struct sock *sk)
 out_reset_timer:
 	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
 	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
-	if (icsk->icsk_retransmits > sysctl_tcp_retries1)
+	if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
 		__sk_dst_reset(sk);
 
 out:;
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
  2009-08-26 10:16 [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value Damian Lukowski
@ 2009-08-31 13:26 ` Ilpo Järvinen
  2009-09-01 11:28   ` Damian Lukowski
  2009-08-31 14:01 ` Eric Dumazet
       [not found] ` <CAFbMe2Mu46N8kRrYkGzYRLuntEd2J9aO_d0Jw_y0gsQA4q-qHw@mail.gmail.com>
  2 siblings, 1 reply; 19+ messages in thread
From: Ilpo Järvinen @ 2009-08-31 13:26 UTC (permalink / raw)
  To: Damian Lukowski; +Cc: Netdev, David Miller

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5801 bytes --]

On Wed, 26 Aug 2009, Damian Lukowski wrote:

> RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
> which may represent a number of allowed retransmissions or a timeout value.
> Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
> in number of allowed retransmissions.
>
> For any desired threshold R2 (by means of time) one can specify tcp_retries2
> (by means of number of retransmissions) such that TCP will not time out
> earlier than R2. This is the case, because the RTO schedule follows a fixed
> pattern, namely exponential backoff.
>
> However, the RTO behaviour is not predictable any more if RTO backoffs can be
> reverted, as it is the case in the draft
> "Make TCP more Robust to Long Connectivity Disruptions"
> (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
>
> In the worst case TCP would time out a connection after 3.2 seconds, if the
> initial RTO equaled MIN_RTO and each backoff has been reverted.
>
> This patch introduces a function retransmits_timed_out(N),
> which calculates the timeout of a TCP connection, assuming an initial
> RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
>
> Whenever timeout decisions are made by comparing the retransmission counter
> to some value N, this function can be used, instead.
>
> The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
> can occur than the value indicates. However, it yields a timeout which is
> similar to the one of an unpatched, exponentially backing off TCP in the same
> scenario. As no application could rely on an RTO greater than MIN_RTO, there
> should be no risk of a regression.
>
> Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
> ---
> include/net/tcp.h    |   18 ++++++++++++++++++
> net/ipv4/tcp_timer.c |   11 +++++++----
> 2 files changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index c35b329..17d1a88 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1247,6 +1247,24 @@ static inline struct sk_buff *tcp_write_queue_prev(struct sock *sk, struct sk_bu
> #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
> 	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
>

IMHO, having an introductionary comment here wouldn't hurt as this is a 
bit tricky thing we end up doing here :-).

> +static inline bool retransmits_timed_out(const struct sock *sk,
> +					 unsigned int boundary)
> +{
> +	int limit, K;

An empty line after local variables. Just rename the K to max_backoff or 
something like that (more meaningful).

> +	if (!inet_csk(sk)->icsk_retransmits)
> +		return false;
> +
> +	K = ilog2(TCP_RTO_MAX/TCP_RTO_MIN);
> +
> +	if (boundary <= K)
> +		limit = ((2 << boundary) - 1) * TCP_RTO_MIN;
> +	else
> +		limit = ((2 << K) - 1) * TCP_RTO_MIN +
> +			(boundary - K) * TCP_RTO_MAX;
> +
> +	return (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= limit;
> +}
> +
> static inline struct sk_buff *tcp_send_head(struct sock *sk)
> {
> 	return sk->sk_send_head;
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index a3ba494..2972d7b 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -137,13 +137,14 @@ static int tcp_write_timeout(struct sock *sk)
> {
> 	struct inet_connection_sock *icsk = inet_csk(sk);
> 	int retry_until;
> +	bool do_reset;
>
> 	if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
> 		if (icsk->icsk_retransmits)
> 			dst_negative_advice(&sk->sk_dst_cache);
> 		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
> 	} else {
> -		if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
> +		if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
> 			/* Black hole detection */
> 			tcp_mtu_probing(icsk, sk);
>
> @@ -155,13 +156,15 @@ static int tcp_write_timeout(struct sock *sk)
> 			const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
>
> 			retry_until = tcp_orphan_retries(sk, alive);
> +			do_reset = alive ||
> +				   !retransmits_timed_out(sk, retry_until);
>
> -			if (tcp_out_of_resources(sk, alive || icsk->icsk_retransmits < retry_until))
> +			if (tcp_out_of_resources(sk, do_reset))
> 				return 1;
> 		}
> 	}
>
> -	if (icsk->icsk_retransmits >= retry_until) {
> +	if (retransmits_timed_out(sk, retry_until)) {
> 		/* Has it gone just too far? */
> 		tcp_write_err(sk);
> 		return 1;
> @@ -385,7 +388,7 @@ void tcp_retransmit_timer(struct sock *sk)
> out_reset_timer:
> 	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
> 	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
> -	if (icsk->icsk_retransmits > sysctl_tcp_retries1)
> +	if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
> 		__sk_dst_reset(sk);
>
> out:;

The implementation itself seems ok. I was a bit concerned that the use of 
retrans_stamp would result in considerably different behavior than the use 
of icsk_retransmits but it seems I was wrong.

Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

I'm fine with this approach as no matter what we would do the previous 
approach just isn't fitting the new model that breaks the assumptions 
behind the previous model. I don't expect the change in the timing to be 
significant for anybody unless one did copy our code exactly somewhere 
(including rtt calculation). ...but if somebody has some objections in 
this, please speak up! After all, it will change how the sysctl behaves.

Perhaps we should mention this artificial timing change in 
ip-sysctl.txt too...?

It would be worth to do a trivial pull-the-plug testing comparing this and 
the previous approach in the rto < TCP_RTO_MIN region without any icmps to 
verify that this didn't change the timing a bit, afaict, it shouldn't (If 
you didn't do that already). Just to be on very sure grounds.


-- 
  i.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
  2009-08-26 10:16 [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value Damian Lukowski
  2009-08-31 13:26 ` Ilpo Järvinen
@ 2009-08-31 14:01 ` Eric Dumazet
  2009-08-31 14:04   ` Eric Dumazet
  2009-09-01 11:49   ` Damian Lukowski
       [not found] ` <CAFbMe2Mu46N8kRrYkGzYRLuntEd2J9aO_d0Jw_y0gsQA4q-qHw@mail.gmail.com>
  2 siblings, 2 replies; 19+ messages in thread
From: Eric Dumazet @ 2009-08-31 14:01 UTC (permalink / raw)
  To: Damian Lukowski; +Cc: Netdev

Damian Lukowski a écrit :
> RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
> which may represent a number of allowed retransmissions or a timeout value.
> Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
> in number of allowed retransmissions.
> 
> For any desired threshold R2 (by means of time) one can specify tcp_retries2
> (by means of number of retransmissions) such that TCP will not time out
> earlier than R2. This is the case, because the RTO schedule follows a fixed
> pattern, namely exponential backoff.
> 
> However, the RTO behaviour is not predictable any more if RTO backoffs can be
> reverted, as it is the case in the draft
> "Make TCP more Robust to Long Connectivity Disruptions"
> (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
> 
> In the worst case TCP would time out a connection after 3.2 seconds, if the
> initial RTO equaled MIN_RTO and each backoff has been reverted.
> 
> This patch introduces a function retransmits_timed_out(N),
> which calculates the timeout of a TCP connection, assuming an initial
> RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
> 
> Whenever timeout decisions are made by comparing the retransmission counter
> to some value N, this function can be used, instead.
> 
> The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
> can occur than the value indicates. However, it yields a timeout which is
> similar to the one of an unpatched, exponentially backing off TCP in the same
> scenario. As no application could rely on an RTO greater than MIN_RTO, there
> should be no risk of a regression.
> 
> Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
> ---
>  include/net/tcp.h    |   18 ++++++++++++++++++
>  net/ipv4/tcp_timer.c |   11 +++++++----
>  2 files changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index c35b329..17d1a88 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1247,6 +1247,24 @@ static inline struct sk_buff *tcp_write_queue_prev(struct sock *sk, struct sk_bu
>  #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
>  	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
>  
> +static inline bool retransmits_timed_out(const struct sock *sk,
> +					 unsigned int boundary)
> +{
> +	int limit, K;
> +	if (!inet_csk(sk)->icsk_retransmits)
> +		return false;
> +
> +	K = ilog2(TCP_RTO_MAX/TCP_RTO_MIN);
> +
> +	if (boundary <= K)
> +		limit = ((2 << boundary) - 1) * TCP_RTO_MIN;
> +	else
> +		limit = ((2 << K) - 1) * TCP_RTO_MIN +
> +			(boundary - K) * TCP_RTO_MAX;

Doing this computation might allow us to respect RFC 1122 here :
 
"The value of R2 SHOULD correspond to at least 100 seconds."

adding a third parameter to retransmits_timed_out(), min_limit,
being 100*HZ if sysctl_tcp_retries2 was used...

limit = min(min_limit, limit);


> +
> +	return (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= limit;
> +}
> +
>  static inline struct sk_buff *tcp_send_head(struct sock *sk)
>  {
>  	return sk->sk_send_head;
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index a3ba494..2972d7b 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -137,13 +137,14 @@ static int tcp_write_timeout(struct sock *sk)
>  {
>  	struct inet_connection_sock *icsk = inet_csk(sk);
>  	int retry_until;
> +	bool do_reset;
>  
>  	if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>  		if (icsk->icsk_retransmits)
>  			dst_negative_advice(&sk->sk_dst_cache);
>  		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
>  	} else {
> -		if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
> +		if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
>  			/* Black hole detection */
>  			tcp_mtu_probing(icsk, sk);
>  
> @@ -155,13 +156,15 @@ static int tcp_write_timeout(struct sock *sk)
>  			const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
>  
>  			retry_until = tcp_orphan_retries(sk, alive);
> +			do_reset = alive ||
> +				   !retransmits_timed_out(sk, retry_until);
>  
> -			if (tcp_out_of_resources(sk, alive || icsk->icsk_retransmits < retry_until))
> +			if (tcp_out_of_resources(sk, do_reset))
>  				return 1;
>  		}
>  	}
>  
> -	if (icsk->icsk_retransmits >= retry_until) {
> +	if (retransmits_timed_out(sk, retry_until)) {
>  		/* Has it gone just too far? */
>  		tcp_write_err(sk);
>  		return 1;
> @@ -385,7 +388,7 @@ void tcp_retransmit_timer(struct sock *sk)
>  out_reset_timer:
>  	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>  	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
> -	if (icsk->icsk_retransmits > sysctl_tcp_retries1)
> +	if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
>  		__sk_dst_reset(sk);
>  
>  out:;


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
  2009-08-31 14:01 ` Eric Dumazet
@ 2009-08-31 14:04   ` Eric Dumazet
  2009-09-01 11:49   ` Damian Lukowski
  1 sibling, 0 replies; 19+ messages in thread
From: Eric Dumazet @ 2009-08-31 14:04 UTC (permalink / raw)
  Cc: Damian Lukowski, Netdev

Eric Dumazet a écrit :
> Damian Lukowski a écrit :
>> RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
>> which may represent a number of allowed retransmissions or a timeout value.
>> Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
>> in number of allowed retransmissions.
>>
>> For any desired threshold R2 (by means of time) one can specify tcp_retries2
>> (by means of number of retransmissions) such that TCP will not time out
>> earlier than R2. This is the case, because the RTO schedule follows a fixed
>> pattern, namely exponential backoff.
>>
>> However, the RTO behaviour is not predictable any more if RTO backoffs can be
>> reverted, as it is the case in the draft
>> "Make TCP more Robust to Long Connectivity Disruptions"
>> (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
>>
>> In the worst case TCP would time out a connection after 3.2 seconds, if the
>> initial RTO equaled MIN_RTO and each backoff has been reverted.
>>
>> This patch introduces a function retransmits_timed_out(N),
>> which calculates the timeout of a TCP connection, assuming an initial
>> RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
>>
>> Whenever timeout decisions are made by comparing the retransmission counter
>> to some value N, this function can be used, instead.
>>
>> The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
>> can occur than the value indicates. However, it yields a timeout which is
>> similar to the one of an unpatched, exponentially backing off TCP in the same
>> scenario. As no application could rely on an RTO greater than MIN_RTO, there
>> should be no risk of a regression.
>>
>> Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
>> ---
>>  include/net/tcp.h    |   18 ++++++++++++++++++
>>  net/ipv4/tcp_timer.c |   11 +++++++----
>>  2 files changed, 25 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index c35b329..17d1a88 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -1247,6 +1247,24 @@ static inline struct sk_buff *tcp_write_queue_prev(struct sock *sk, struct sk_bu
>>  #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
>>  	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
>>  
>> +static inline bool retransmits_timed_out(const struct sock *sk,
>> +					 unsigned int boundary)
>> +{
>> +	int limit, K;
>> +	if (!inet_csk(sk)->icsk_retransmits)
>> +		return false;
>> +
>> +	K = ilog2(TCP_RTO_MAX/TCP_RTO_MIN);
>> +
>> +	if (boundary <= K)
>> +		limit = ((2 << boundary) - 1) * TCP_RTO_MIN;
>> +	else
>> +		limit = ((2 << K) - 1) * TCP_RTO_MIN +
>> +			(boundary - K) * TCP_RTO_MAX;
> 
> Doing this computation might allow us to respect RFC 1122 here :
>  
> "The value of R2 SHOULD correspond to at least 100 seconds."
> 
> adding a third parameter to retransmits_timed_out(), min_limit,
> being 100*HZ if sysctl_tcp_retries2 was used...
> 
> limit = min(min_limit, limit);
> 

Oh well, typical min/max error, sorry :)

> 
>> +
>> +	return (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= limit;
>> +}
>> +
>>  static inline struct sk_buff *tcp_send_head(struct sock *sk)
>>  {
>>  	return sk->sk_send_head;
>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> index a3ba494..2972d7b 100644
>> --- a/net/ipv4/tcp_timer.c
>> +++ b/net/ipv4/tcp_timer.c
>> @@ -137,13 +137,14 @@ static int tcp_write_timeout(struct sock *sk)
>>  {
>>  	struct inet_connection_sock *icsk = inet_csk(sk);
>>  	int retry_until;
>> +	bool do_reset;
>>  
>>  	if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>>  		if (icsk->icsk_retransmits)
>>  			dst_negative_advice(&sk->sk_dst_cache);
>>  		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
>>  	} else {
>> -		if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
>> +		if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
>>  			/* Black hole detection */
>>  			tcp_mtu_probing(icsk, sk);
>>  
>> @@ -155,13 +156,15 @@ static int tcp_write_timeout(struct sock *sk)
>>  			const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
>>  
>>  			retry_until = tcp_orphan_retries(sk, alive);
>> +			do_reset = alive ||
>> +				   !retransmits_timed_out(sk, retry_until);
>>  
>> -			if (tcp_out_of_resources(sk, alive || icsk->icsk_retransmits < retry_until))
>> +			if (tcp_out_of_resources(sk, do_reset))
>>  				return 1;
>>  		}
>>  	}
>>  
>> -	if (icsk->icsk_retransmits >= retry_until) {
>> +	if (retransmits_timed_out(sk, retry_until)) {
>>  		/* Has it gone just too far? */
>>  		tcp_write_err(sk);
>>  		return 1;
>> @@ -385,7 +388,7 @@ void tcp_retransmit_timer(struct sock *sk)
>>  out_reset_timer:
>>  	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>>  	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
>> -	if (icsk->icsk_retransmits > sysctl_tcp_retries1)
>> +	if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
>>  		__sk_dst_reset(sk);
>>  
>>  out:;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
  2009-08-31 13:26 ` Ilpo Järvinen
@ 2009-09-01 11:28   ` Damian Lukowski
  2009-09-01 13:41     ` Ilpo Järvinen
  0 siblings, 1 reply; 19+ messages in thread
From: Damian Lukowski @ 2009-09-01 11:28 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Netdev, David Miller

Ilpo Järvinen schrieb:
> On Wed, 26 Aug 2009, Damian Lukowski wrote:
> 
>> RFC 1122 specifies two threshold values R1 and R2 for connection
>> timeouts,
>> which may represent a number of allowed retransmissions or a timeout
>> value.
>> Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
>> in number of allowed retransmissions.
>>
>> For any desired threshold R2 (by means of time) one can specify
>> tcp_retries2
>> (by means of number of retransmissions) such that TCP will not time out
>> earlier than R2. This is the case, because the RTO schedule follows a
>> fixed
>> pattern, namely exponential backoff.
>>
>> However, the RTO behaviour is not predictable any more if RTO backoffs
>> can be
>> reverted, as it is the case in the draft
>> "Make TCP more Robust to Long Connectivity Disruptions"
>> (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
>>
>> In the worst case TCP would time out a connection after 3.2 seconds,
>> if the
>> initial RTO equaled MIN_RTO and each backoff has been reverted.
>>
>> This patch introduces a function retransmits_timed_out(N),
>> which calculates the timeout of a TCP connection, assuming an initial
>> RTO of MIN_RTO and N unsuccessful, exponentially backed-off
>> retransmissions.
>>
>> Whenever timeout decisions are made by comparing the retransmission
>> counter
>> to some value N, this function can be used, instead.
>>
>> The meaning of tcp_retries2 will be changed, as many more RTO
>> retransmissions
>> can occur than the value indicates. However, it yields a timeout which is
>> similar to the one of an unpatched, exponentially backing off TCP in
>> the same
>> scenario. As no application could rely on an RTO greater than MIN_RTO,
>> there
>> should be no risk of a regression.
>>
>> Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
>> ---
>> include/net/tcp.h    |   18 ++++++++++++++++++
>> net/ipv4/tcp_timer.c |   11 +++++++----
>> 2 files changed, 25 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index c35b329..17d1a88 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -1247,6 +1247,24 @@ static inline struct sk_buff
>> *tcp_write_queue_prev(struct sock *sk, struct sk_bu
>> #define tcp_for_write_queue_from_safe(skb, tmp, sk)            \
>>     skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
>>
> 
> IMHO, having an introductionary comment here wouldn't hurt as this is a
> bit tricky thing we end up doing here :-).
> 
>> +static inline bool retransmits_timed_out(const struct sock *sk,
>> +                     unsigned int boundary)
>> +{
>> +    int limit, K;
> 
> An empty line after local variables. Just rename the K to max_backoff or
> something like that (more meaningful).
> 
>> +    if (!inet_csk(sk)->icsk_retransmits)
>> +        return false;
>> +
>> +    K = ilog2(TCP_RTO_MAX/TCP_RTO_MIN);
>> +
>> +    if (boundary <= K)
>> +        limit = ((2 << boundary) - 1) * TCP_RTO_MIN;
>> +    else
>> +        limit = ((2 << K) - 1) * TCP_RTO_MIN +
>> +            (boundary - K) * TCP_RTO_MAX;
>> +
>> +    return (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= limit;
>> +}
>> +
>> static inline struct sk_buff *tcp_send_head(struct sock *sk)
>> {
>>     return sk->sk_send_head;
>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> index a3ba494..2972d7b 100644
>> --- a/net/ipv4/tcp_timer.c
>> +++ b/net/ipv4/tcp_timer.c
>> @@ -137,13 +137,14 @@ static int tcp_write_timeout(struct sock *sk)
>> {
>>     struct inet_connection_sock *icsk = inet_csk(sk);
>>     int retry_until;
>> +    bool do_reset;
>>
>>     if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>>         if (icsk->icsk_retransmits)
>>             dst_negative_advice(&sk->sk_dst_cache);
>>         retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
>>     } else {
>> -        if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
>> +        if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
>>             /* Black hole detection */
>>             tcp_mtu_probing(icsk, sk);
>>
>> @@ -155,13 +156,15 @@ static int tcp_write_timeout(struct sock *sk)
>>             const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
>>
>>             retry_until = tcp_orphan_retries(sk, alive);
>> +            do_reset = alive ||
>> +                   !retransmits_timed_out(sk, retry_until);
>>
>> -            if (tcp_out_of_resources(sk, alive ||
>> icsk->icsk_retransmits < retry_until))
>> +            if (tcp_out_of_resources(sk, do_reset))
>>                 return 1;
>>         }
>>     }
>>
>> -    if (icsk->icsk_retransmits >= retry_until) {
>> +    if (retransmits_timed_out(sk, retry_until)) {
>>         /* Has it gone just too far? */
>>         tcp_write_err(sk);
>>         return 1;
>> @@ -385,7 +388,7 @@ void tcp_retransmit_timer(struct sock *sk)
>> out_reset_timer:
>>     icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>>     inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto,
>> TCP_RTO_MAX);
>> -    if (icsk->icsk_retransmits > sysctl_tcp_retries1)
>> +    if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
>>         __sk_dst_reset(sk);
>>
>> out:;
> 
> The implementation itself seems ok. I was a bit concerned that the use
> of retrans_stamp would result in considerably different behavior than
> the use of icsk_retransmits but it seems I was wrong.
> 
> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> 
> I'm fine with this approach as no matter what we would do the previous
> approach just isn't fitting the new model that breaks the assumptions
> behind the previous model. I don't expect the change in the timing to be
> significant for anybody unless one did copy our code exactly somewhere
> (including rtt calculation). ...but if somebody has some objections in
> this, please speak up! After all, it will change how the sysctl behaves.
> 
> Perhaps we should mention this artificial timing change in ip-sysctl.txt
> too...?
> 
> It would be worth to do a trivial pull-the-plug testing comparing this
> and the previous approach in the rto < TCP_RTO_MIN region without any
> icmps to verify that this didn't change the timing a bit, afaict, it
> shouldn't (If you didn't do that already). Just to be on very sure grounds.

Hi,
thanks for reviewing all the patches.
Which region do you mean exactly? I mean, rto < TCP_RTO_MIN should be
impossible by definition, shouldn't it?
However, there are differences between new and old timing, which maybe
should be discussed. The new timeout values can deviate from the old 
approach by up to the value of the last actual RTO. Consider following:

retries2 is set to 15 and MIN_RTO/MAX_RTO are 0.2 and 120 seconds,
respectively. retrans_timed_out() calculates 924.6 seconds as timeout,
regardless of the actual RTO. Assume, that rtt measurement calculates
0.4 seconds as RTO before connectivity breaks.
An unpatched TCP - as well as patched TCP without ICMPs - will do the
standard backoff procedure and fire it's 15th retransmission after 924.4
seconds, i.e. 0.2 seconds before the calculated timeout of 924.6 seconds.

Then, patched TCP will wait for the next RTO, before checking again if the
timeout expired - and will finally time out 119.8 seconds later
than calculated. However, unpatched TCP will also wait until the 16th RTO
(924.4 + 120 seconds) and then time out. So in this case, the effective
timeouts for unpatched and patched TCP without ICMPs are the same, though
not intended by retransmits_timed_out().

On the other hand, when ICMPs will revert backoffs, patched TCP will
time out somewhere nearby the calculated value.

So one can say the following:
Since retransmits_timed_out() does always underestimate the effective timeout
of an unpatched TCP (it assumes initial RTO == MIN_RTO), the higher effective
timeout of the patched TCP (compared to the calculated) is not too bad, if
compared to the effective unpatched timeout, instead.
When consecutive RTOs remain small due to reverts, the effective timeout
will be closer to the calculated one.

I have made a simple test with a netem delay of 30ms at the sender,
and retries2 set to 6:
Unpatched TCP times out after ~29.7 seconds,
patched TCP without ICMPs also after ~29.7 seconds,
and patched TCP with ICMPs after ~25.6 seconds.

The same scenario but with netem delay of 150ms:
Unpatched TCP times out after ~46 seconds,
patched TCP without ICMPs after ~45 seconds,
and patched TCP with ICMPs after ~25.9 seconds.

Anyway, what is the procedure for further updates? As David has applied
the current patches, I assume that I should post a new series of two
subpatches; one concerning your suggestions and the second concerning
the ip-sysctl.txt update?

Best regards
 Damian


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
  2009-08-31 14:01 ` Eric Dumazet
  2009-08-31 14:04   ` Eric Dumazet
@ 2009-09-01 11:49   ` Damian Lukowski
  2009-09-01 12:25     ` Eric Dumazet
  1 sibling, 1 reply; 19+ messages in thread
From: Damian Lukowski @ 2009-09-01 11:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Netdev

Eric Dumazet schrieb:
> Damian Lukowski a écrit :
>> RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
>> which may represent a number of allowed retransmissions or a timeout value.
>> Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
>> in number of allowed retransmissions.
>>
>> For any desired threshold R2 (by means of time) one can specify tcp_retries2
>> (by means of number of retransmissions) such that TCP will not time out
>> earlier than R2. This is the case, because the RTO schedule follows a fixed
>> pattern, namely exponential backoff.
>>
>> However, the RTO behaviour is not predictable any more if RTO backoffs can be
>> reverted, as it is the case in the draft
>> "Make TCP more Robust to Long Connectivity Disruptions"
>> (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
>>
>> In the worst case TCP would time out a connection after 3.2 seconds, if the
>> initial RTO equaled MIN_RTO and each backoff has been reverted.
>>
>> This patch introduces a function retransmits_timed_out(N),
>> which calculates the timeout of a TCP connection, assuming an initial
>> RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
>>
>> Whenever timeout decisions are made by comparing the retransmission counter
>> to some value N, this function can be used, instead.
>>
>> The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
>> can occur than the value indicates. However, it yields a timeout which is
>> similar to the one of an unpatched, exponentially backing off TCP in the same
>> scenario. As no application could rely on an RTO greater than MIN_RTO, there
>> should be no risk of a regression.
>>
>> Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
>> ---
>>  include/net/tcp.h    |   18 ++++++++++++++++++
>>  net/ipv4/tcp_timer.c |   11 +++++++----
>>  2 files changed, 25 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index c35b329..17d1a88 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -1247,6 +1247,24 @@ static inline struct sk_buff *tcp_write_queue_prev(struct sock *sk, struct sk_bu
>>  #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
>>  	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
>>  
>> +static inline bool retransmits_timed_out(const struct sock *sk,
>> +					 unsigned int boundary)
>> +{
>> +	int limit, K;
>> +	if (!inet_csk(sk)->icsk_retransmits)
>> +		return false;
>> +
>> +	K = ilog2(TCP_RTO_MAX/TCP_RTO_MIN);
>> +
>> +	if (boundary <= K)
>> +		limit = ((2 << boundary) - 1) * TCP_RTO_MIN;
>> +	else
>> +		limit = ((2 << K) - 1) * TCP_RTO_MIN +
>> +			(boundary - K) * TCP_RTO_MAX;
> 
> Doing this computation might allow us to respect RFC 1122 here :
>  
> "The value of R2 SHOULD correspond to at least 100 seconds."
> 
> adding a third parameter to retransmits_timed_out(), min_limit,
> being 100*HZ if sysctl_tcp_retries2 was used...
> 
> limit = min(min_limit, limit);

Hi.
Hm, with this restriction, we would make it a MUST instead of a SHOULD.
The current approach does also allow retries2 values, which can yield
lower timeouts than 100 seconds.
I could implement the min_timeout, but in my opinion, the 100 seconds
shouldn't be enforced. We could make a patch later, which introduces a
lower limit to the sysctl, so the user gets feedback, if he tries to adjust
the limit below the recommended 100 seconds, or something like that.

Maybe the others would like to comment on this?

Regards
 Damian

>> +
>> +	return (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= limit;
>> +}
>> +
>>  static inline struct sk_buff *tcp_send_head(struct sock *sk)
>>  {
>>  	return sk->sk_send_head;
>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> index a3ba494..2972d7b 100644
>> --- a/net/ipv4/tcp_timer.c
>> +++ b/net/ipv4/tcp_timer.c
>> @@ -137,13 +137,14 @@ static int tcp_write_timeout(struct sock *sk)
>>  {
>>  	struct inet_connection_sock *icsk = inet_csk(sk);
>>  	int retry_until;
>> +	bool do_reset;
>>  
>>  	if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>>  		if (icsk->icsk_retransmits)
>>  			dst_negative_advice(&sk->sk_dst_cache);
>>  		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
>>  	} else {
>> -		if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
>> +		if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
>>  			/* Black hole detection */
>>  			tcp_mtu_probing(icsk, sk);
>>  
>> @@ -155,13 +156,15 @@ static int tcp_write_timeout(struct sock *sk)
>>  			const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
>>  
>>  			retry_until = tcp_orphan_retries(sk, alive);
>> +			do_reset = alive ||
>> +				   !retransmits_timed_out(sk, retry_until);
>>  
>> -			if (tcp_out_of_resources(sk, alive || icsk->icsk_retransmits < retry_until))
>> +			if (tcp_out_of_resources(sk, do_reset))
>>  				return 1;
>>  		}
>>  	}
>>  
>> -	if (icsk->icsk_retransmits >= retry_until) {
>> +	if (retransmits_timed_out(sk, retry_until)) {
>>  		/* Has it gone just too far? */
>>  		tcp_write_err(sk);
>>  		return 1;
>> @@ -385,7 +388,7 @@ void tcp_retransmit_timer(struct sock *sk)
>>  out_reset_timer:
>>  	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>>  	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
>> -	if (icsk->icsk_retransmits > sysctl_tcp_retries1)
>> +	if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
>>  		__sk_dst_reset(sk);
>>  
>>  out:;


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
  2009-09-01 11:49   ` Damian Lukowski
@ 2009-09-01 12:25     ` Eric Dumazet
  2009-09-01 13:23       ` Ilpo Järvinen
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2009-09-01 12:25 UTC (permalink / raw)
  To: Damian Lukowski; +Cc: Netdev

Damian Lukowski a écrit :
> Eric Dumazet schrieb:
>> Damian Lukowski a écrit :
>>> RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
>>> which may represent a number of allowed retransmissions or a timeout value.
>>> Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
>>> in number of allowed retransmissions.
>>>
>>> For any desired threshold R2 (by means of time) one can specify tcp_retries2
>>> (by means of number of retransmissions) such that TCP will not time out
>>> earlier than R2. This is the case, because the RTO schedule follows a fixed
>>> pattern, namely exponential backoff.
>>>
>>> However, the RTO behaviour is not predictable any more if RTO backoffs can be
>>> reverted, as it is the case in the draft
>>> "Make TCP more Robust to Long Connectivity Disruptions"
>>> (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
>>>
>>> In the worst case TCP would time out a connection after 3.2 seconds, if the
>>> initial RTO equaled MIN_RTO and each backoff has been reverted.
>>>
>>> This patch introduces a function retransmits_timed_out(N),
>>> which calculates the timeout of a TCP connection, assuming an initial
>>> RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
>>>
>>> Whenever timeout decisions are made by comparing the retransmission counter
>>> to some value N, this function can be used, instead.
>>>
>>> The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
>>> can occur than the value indicates. However, it yields a timeout which is
>>> similar to the one of an unpatched, exponentially backing off TCP in the same
>>> scenario. As no application could rely on an RTO greater than MIN_RTO, there
>>> should be no risk of a regression.
>>>
>>> Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
>>> ---
>>>  include/net/tcp.h    |   18 ++++++++++++++++++
>>>  net/ipv4/tcp_timer.c |   11 +++++++----
>>>  2 files changed, 25 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>>> index c35b329..17d1a88 100644
>>> --- a/include/net/tcp.h
>>> +++ b/include/net/tcp.h
>>> @@ -1247,6 +1247,24 @@ static inline struct sk_buff *tcp_write_queue_prev(struct sock *sk, struct sk_bu
>>>  #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
>>>  	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
>>>  
>>> +static inline bool retransmits_timed_out(const struct sock *sk,
>>> +					 unsigned int boundary)
>>> +{
>>> +	int limit, K;
>>> +	if (!inet_csk(sk)->icsk_retransmits)
>>> +		return false;
>>> +
>>> +	K = ilog2(TCP_RTO_MAX/TCP_RTO_MIN);
>>> +
>>> +	if (boundary <= K)
>>> +		limit = ((2 << boundary) - 1) * TCP_RTO_MIN;
>>> +	else
>>> +		limit = ((2 << K) - 1) * TCP_RTO_MIN +
>>> +			(boundary - K) * TCP_RTO_MAX;
>> Doing this computation might allow us to respect RFC 1122 here :
>>  
>> "The value of R2 SHOULD correspond to at least 100 seconds."
>>
>> adding a third parameter to retransmits_timed_out(), min_limit,
>> being 100*HZ if sysctl_tcp_retries2 was used...
>>
>> limit = min(min_limit, limit);
> 
> Hi.
> Hm, with this restriction, we would make it a MUST instead of a SHOULD.
> The current approach does also allow retries2 values, which can yield
> lower timeouts than 100 seconds.
> I could implement the min_timeout, but in my opinion, the 100 seconds
> shouldn't be enforced. We could make a patch later, which introduces a
> lower limit to the sysctl, so the user gets feedback, if he tries to adjust
> the limit below the recommended 100 seconds, or something like that.
> 

Fair enough, this 100 seconds limit is only a hint, not an enforcement.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
  2009-09-01 12:25     ` Eric Dumazet
@ 2009-09-01 13:23       ` Ilpo Järvinen
  0 siblings, 0 replies; 19+ messages in thread
From: Ilpo Järvinen @ 2009-09-01 13:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Damian Lukowski, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4264 bytes --]

On Tue, 1 Sep 2009, Eric Dumazet wrote:

> Damian Lukowski a écrit :
>> Eric Dumazet schrieb:
>>> Damian Lukowski a écrit :
>>>> RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
>>>> which may represent a number of allowed retransmissions or a timeout value.
>>>> Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
>>>> in number of allowed retransmissions.
>>>>
>>>> For any desired threshold R2 (by means of time) one can specify tcp_retries2
>>>> (by means of number of retransmissions) such that TCP will not time out
>>>> earlier than R2. This is the case, because the RTO schedule follows a fixed
>>>> pattern, namely exponential backoff.
>>>>
>>>> However, the RTO behaviour is not predictable any more if RTO backoffs can be
>>>> reverted, as it is the case in the draft
>>>> "Make TCP more Robust to Long Connectivity Disruptions"
>>>> (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
>>>>
>>>> In the worst case TCP would time out a connection after 3.2 seconds, if the
>>>> initial RTO equaled MIN_RTO and each backoff has been reverted.
>>>>
>>>> This patch introduces a function retransmits_timed_out(N),
>>>> which calculates the timeout of a TCP connection, assuming an initial
>>>> RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
>>>>
>>>> Whenever timeout decisions are made by comparing the retransmission counter
>>>> to some value N, this function can be used, instead.
>>>>
>>>> The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
>>>> can occur than the value indicates. However, it yields a timeout which is
>>>> similar to the one of an unpatched, exponentially backing off TCP in the same
>>>> scenario. As no application could rely on an RTO greater than MIN_RTO, there
>>>> should be no risk of a regression.
>>>>
>>>> Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
>>>> ---
>>>>  include/net/tcp.h    |   18 ++++++++++++++++++
>>>>  net/ipv4/tcp_timer.c |   11 +++++++----
>>>>  2 files changed, 25 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>>>> index c35b329..17d1a88 100644
>>>> --- a/include/net/tcp.h
>>>> +++ b/include/net/tcp.h
>>>> @@ -1247,6 +1247,24 @@ static inline struct sk_buff *tcp_write_queue_prev(struct sock *sk, struct sk_bu
>>>>  #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
>>>>  	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
>>>>
>>>> +static inline bool retransmits_timed_out(const struct sock *sk,
>>>> +					 unsigned int boundary)
>>>> +{
>>>> +	int limit, K;
>>>> +	if (!inet_csk(sk)->icsk_retransmits)
>>>> +		return false;
>>>> +
>>>> +	K = ilog2(TCP_RTO_MAX/TCP_RTO_MIN);
>>>> +
>>>> +	if (boundary <= K)
>>>> +		limit = ((2 << boundary) - 1) * TCP_RTO_MIN;
>>>> +	else
>>>> +		limit = ((2 << K) - 1) * TCP_RTO_MIN +
>>>> +			(boundary - K) * TCP_RTO_MAX;
>>> Doing this computation might allow us to respect RFC 1122 here :
>>>
>>> "The value of R2 SHOULD correspond to at least 100 seconds."
>>>
>>> adding a third parameter to retransmits_timed_out(), min_limit,
>>> being 100*HZ if sysctl_tcp_retries2 was used...
>>>
>>> limit = min(min_limit, limit);
>>
>> Hi.
>> Hm, with this restriction, we would make it a MUST instead of a SHOULD.
>> The current approach does also allow retries2 values, which can yield
>> lower timeouts than 100 seconds.
>> I could implement the min_timeout, but in my opinion, the 100 seconds
>> shouldn't be enforced. We could make a patch later, which introduces a
>> lower limit to the sysctl, so the user gets feedback, if he tries to adjust
>> the limit below the recommended 100 seconds, or something like that.
>>
>
> Fair enough, this 100 seconds limit is only a hint, not an enforcement.

Please excuse my curiousness. Were you aiming for some particular case 
with this thought of a lower bound? ...To me it seems quite a small 
difference. I can understand that one can construct a scenario with small 
retries and heterogenous rtts but how significant that is in practice 
whether we timeout 100s or slightly earlier the small rtt'ed ones (as the 
enforced limits are anyway quite strict I doubt we really would care that 
much on those hanging around for long).

-- 
  i.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
  2009-09-01 11:28   ` Damian Lukowski
@ 2009-09-01 13:41     ` Ilpo Järvinen
  0 siblings, 0 replies; 19+ messages in thread
From: Ilpo Järvinen @ 2009-09-01 13:41 UTC (permalink / raw)
  To: Damian Lukowski; +Cc: Netdev, David Miller

On Tue, 1 Sep 2009, Damian Lukowski wrote:

>> I'm fine with this approach as no matter what we would do the previous
>> approach just isn't fitting the new model that breaks the assumptions
>> behind the previous model. I don't expect the change in the timing to be
>> significant for anybody unless one did copy our code exactly somewhere
>> (including rtt calculation). ...but if somebody has some objections in
>> this, please speak up! After all, it will change how the sysctl behaves.
>>
>> Perhaps we should mention this artificial timing change in ip-sysctl.txt
>> too...?
>>
>> It would be worth to do a trivial pull-the-plug testing comparing this
>> and the previous approach in the rto < TCP_RTO_MIN region without any
>> icmps to verify that this didn't change the timing a bit, afaict, it
>> shouldn't (If you didn't do that already). Just to be on very sure grounds.
>
> thanks for reviewing all the patches.
> Which region do you mean exactly? I mean, rto < TCP_RTO_MIN should be
> impossible by definition, shouldn't it?

Right, I was careless in the wording. I meant with a low rtt'ed flow which 
lower bounds rto to TCP_RTO_MIN.

> However, there are differences between new and old timing, which maybe
> should be discussed. The new timeout values can deviate from the old
> approach by up to the value of the last actual RTO. Consider following:
>
> retries2 is set to 15 and MIN_RTO/MAX_RTO are 0.2 and 120 seconds,
> respectively. retrans_timed_out() calculates 924.6 seconds as timeout,
> regardless of the actual RTO.

My point was to take a connection which has rtt below the lower bound and 
this the actual RTO should be at TCP_RTO_MIN.

> Assume, that rtt measurement calculates
> 0.4 seconds as RTO before connectivity breaks.
> An unpatched TCP - as well as patched TCP without ICMPs - will do the
> standard backoff procedure and fire it's 15th retransmission after 924.4
> seconds, i.e. 0.2 seconds before the calculated timeout of 924.6 seconds.
>
> Then, patched TCP will wait for the next RTO, before checking again if the
> timeout expired - and will finally time out 119.8 seconds later
> than calculated. However, unpatched TCP will also wait until the 16th RTO
> (924.4 + 120 seconds) and then time out. So in this case, the effective
> timeouts for unpatched and patched TCP without ICMPs are the same, though
> not intended by retransmits_timed_out().

If RTO value was > MIN_RTO to begin with there will be the obvious 
difference.

> I have made a simple test with a netem delay of 30ms at the sender,
> and retries2 set to 6:
> Unpatched TCP times out after ~29.7 seconds,
> patched TCP without ICMPs also after ~29.7 seconds,

Thanks for your efforts in this, these times I was mostly interested in. 
Seems very good.

> Anyway, what is the procedure for further updates?
>
> As David has applied the current patches, I assume that I should post a 
> new series of two subpatches; one concerning your suggestions and the 
> second concerning the ip-sysctl.txt update?

Your guessed right, once applied no updates to the previous version but 
you take what Dave currently has as base (in net-next) and build on top of 
that.


-- 
  i.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
       [not found] ` <CAFbMe2Mu46N8kRrYkGzYRLuntEd2J9aO_d0Jw_y0gsQA4q-qHw@mail.gmail.com>
@ 2012-06-01 22:58   ` Jerry Chu
  2012-06-04 17:50     ` Damian Lukowski
  0 siblings, 1 reply; 19+ messages in thread
From: Jerry Chu @ 2012-06-01 22:58 UTC (permalink / raw)
  To: Damian Lukowski; +Cc: Netdev, David Miller, Ilpo Järvinen

> From: Damian Lukowski <damian@tvk.rwth-aachen.de>
> Date: Wed, Aug 26, 2009 at 3:16 AM
> Subject: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close
> threshold as a time value.
> To: Netdev <netdev@vger.kernel.org>
>
>
> RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
> which may represent a number of allowed retransmissions or a timeout value.
> Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
> in number of allowed retransmissions.
>
> For any desired threshold R2 (by means of time) one can specify tcp_retries2
> (by means of number of retransmissions) such that TCP will not time out
> earlier than R2. This is the case, because the RTO schedule follows a fixed
> pattern, namely exponential backoff.
>
> However, the RTO behaviour is not predictable any more if RTO backoffs can
> be
> reverted, as it is the case in the draft
> "Make TCP more Robust to Long Connectivity Disruptions"
> (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
>
> In the worst case TCP would time out a connection after 3.2 seconds, if the
> initial RTO equaled MIN_RTO and each backoff has been reverted.
>
> This patch introduces a function retransmits_timed_out(N),
> which calculates the timeout of a TCP connection, assuming an initial
> RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
>
> Whenever timeout decisions are made by comparing the retransmission counter
> to some value N, this function can be used, instead.
>
> The meaning of tcp_retries2 will be changed, as many more RTO
> retransmissions
> can occur than the value indicates. However, it yields a timeout which is
> similar to the one of an unpatched, exponentially backing off TCP in the
> same
> scenario. As no application could rely on an RTO greater than MIN_RTO, there
> should be no risk of a regression.

This looks like a typical "fix one problem, introducing a few more" patch :(.
What do you mean by "no application could rely on an RTO greater than
MIN_RTO..."
above? How can you make the assumption that RTO is not too far off
from TCP_RTO_MIN?

While you tried to address a problem where the retransmission count
was high but the actual
timeout duration was too short, have you considered the other case
around, i.e., the timeout
duration is long but the retransmission count is too short? This is
exactly what's happening
to us with your patch. We've much reduced TCP_RTO_MIN for our internal
traffic, but not
noticing your change has severely shortened the R1 & R2 recommended by
RFC1122 for our
long haul traffic until now. In many cases R1 threshold was met upon
the first retrans timeout.

I think retransmits_timed_out() should check against both time
duration and retrans count
(icsk_retransmits).

Thought?

Jerry

>
> Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
> ---
>  include/net/tcp.h    |   18 ++++++++++++++++++
>  net/ipv4/tcp_timer.c |   11 +++++++----
>  2 files changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index c35b329..17d1a88 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1247,6 +1247,24 @@ static inline struct sk_buff
> *tcp_write_queue_prev(struct sock *sk, struct sk_bu
>  #define tcp_for_write_queue_from_safe(skb, tmp, sk)                    \
>        skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
>
> +static inline bool retransmits_timed_out(const struct sock *sk,
> +                                        unsigned int boundary)
> +{
> +       int limit, K;
> +       if (!inet_csk(sk)->icsk_retransmits)
> +               return false;
> +
> +       K = ilog2(TCP_RTO_MAX/TCP_RTO_MIN);
> +
> +       if (boundary <= K)
> +               limit = ((2 << boundary) - 1) * TCP_RTO_MIN;
> +       else
> +               limit = ((2 << K) - 1) * TCP_RTO_MIN +
> +                       (boundary - K) * TCP_RTO_MAX;
> +
> +       return (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= limit;
> +}
> +
>  static inline struct sk_buff *tcp_send_head(struct sock *sk)
>  {
>        return sk->sk_send_head;
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index a3ba494..2972d7b 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -137,13 +137,14 @@ static int tcp_write_timeout(struct sock *sk)
>  {
>        struct inet_connection_sock *icsk = inet_csk(sk);
>        int retry_until;
> +       bool do_reset;
>
>        if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>                if (icsk->icsk_retransmits)
>                        dst_negative_advice(&sk->sk_dst_cache);
>                retry_until = icsk->icsk_syn_retries ? :
> sysctl_tcp_syn_retries;
>        } else {
> -               if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
> +               if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
>                        /* Black hole detection */
>                        tcp_mtu_probing(icsk, sk);
>
> @@ -155,13 +156,15 @@ static int tcp_write_timeout(struct sock *sk)
>                        const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
>
>                        retry_until = tcp_orphan_retries(sk, alive);
> +                       do_reset = alive ||
> +                                  !retransmits_timed_out(sk, retry_until);
>
> -                       if (tcp_out_of_resources(sk, alive ||
> icsk->icsk_retransmits < retry_until))
> +                       if (tcp_out_of_resources(sk, do_reset))
>                                return 1;
>                }
>        }
>
> -       if (icsk->icsk_retransmits >= retry_until) {
> +       if (retransmits_timed_out(sk, retry_until)) {
>                /* Has it gone just too far? */
>                tcp_write_err(sk);
>                return 1;
> @@ -385,7 +388,7 @@ void tcp_retransmit_timer(struct sock *sk)
>  out_reset_timer:
>        icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto,
> TCP_RTO_MAX);
> -       if (icsk->icsk_retransmits > sysctl_tcp_retries1)
> +       if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
>                __sk_dst_reset(sk);
>
>  out:;
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
  2012-06-01 22:58   ` Jerry Chu
@ 2012-06-04 17:50     ` Damian Lukowski
  2012-06-04 19:18       ` Ilpo Järvinen
  2012-06-04 23:50       ` Jerry Chu
  0 siblings, 2 replies; 19+ messages in thread
From: Damian Lukowski @ 2012-06-04 17:50 UTC (permalink / raw)
  To: Jerry Chu; +Cc: Netdev, David Miller, Ilpo Järvinen

Hi Jerry,

please verify, I understood you correctly.

You have set TCP_RTO_MIN to a lower value, e.g. 0.002 seconds to improve
your internal low-latency traffic. Because of the improvement, R1
timeouts are triggered too fast for external high-RTT traffic. Is that
correct?
If so, may I suggest to set tcp_retries1 to a higher value? For
TCP_RTO_MIN == 0.002 and tcp_retries1 ==  10, R1 will be calculated to
approximately 4 seconds.

Is that ok?

Best regards
 Damian

Am Freitag, den 01.06.2012, 15:58 -0700 schrieb Jerry Chu:
> > From: Damian Lukowski <damian@tvk.rwth-aachen.de>
> > Date: Wed, Aug 26, 2009 at 3:16 AM
> > Subject: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close
> > threshold as a time value.
> > To: Netdev <netdev@vger.kernel.org>
> >
> >
> > RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
> > which may represent a number of allowed retransmissions or a timeout value.
> > Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
> > in number of allowed retransmissions.
> >
> > For any desired threshold R2 (by means of time) one can specify tcp_retries2
> > (by means of number of retransmissions) such that TCP will not time out
> > earlier than R2. This is the case, because the RTO schedule follows a fixed
> > pattern, namely exponential backoff.
> >
> > However, the RTO behaviour is not predictable any more if RTO backoffs can
> > be
> > reverted, as it is the case in the draft
> > "Make TCP more Robust to Long Connectivity Disruptions"
> > (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
> >
> > In the worst case TCP would time out a connection after 3.2 seconds, if the
> > initial RTO equaled MIN_RTO and each backoff has been reverted.
> >
> > This patch introduces a function retransmits_timed_out(N),
> > which calculates the timeout of a TCP connection, assuming an initial
> > RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
> >
> > Whenever timeout decisions are made by comparing the retransmission counter
> > to some value N, this function can be used, instead.
> >
> > The meaning of tcp_retries2 will be changed, as many more RTO
> > retransmissions
> > can occur than the value indicates. However, it yields a timeout which is
> > similar to the one of an unpatched, exponentially backing off TCP in the
> > same
> > scenario. As no application could rely on an RTO greater than MIN_RTO, there
> > should be no risk of a regression.
> 
> This looks like a typical "fix one problem, introducing a few more" patch :(.
> What do you mean by "no application could rely on an RTO greater than
> MIN_RTO..."
> above? How can you make the assumption that RTO is not too far off
> from TCP_RTO_MIN?
> 
> While you tried to address a problem where the retransmission count
> was high but the actual
> timeout duration was too short, have you considered the other case
> around, i.e., the timeout
> duration is long but the retransmission count is too short? This is
> exactly what's happening
> to us with your patch. We've much reduced TCP_RTO_MIN for our internal
> traffic, but not
> noticing your change has severely shortened the R1 & R2 recommended by
> RFC1122 for our
> long haul traffic until now. In many cases R1 threshold was met upon
> the first retrans timeout.
> 
> I think retransmits_timed_out() should check against both time
> duration and retrans count
> (icsk_retransmits).
> 
> Thought?
> 
> Jerry
> 
> >
> > Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
> > ---
> >  include/net/tcp.h    |   18 ++++++++++++++++++
> >  net/ipv4/tcp_timer.c |   11 +++++++----
> >  2 files changed, 25 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index c35b329..17d1a88 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -1247,6 +1247,24 @@ static inline struct sk_buff
> > *tcp_write_queue_prev(struct sock *sk, struct sk_bu
> >  #define tcp_for_write_queue_from_safe(skb, tmp, sk)                    \
> >        skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
> >
> > +static inline bool retransmits_timed_out(const struct sock *sk,
> > +                                        unsigned int boundary)
> > +{
> > +       int limit, K;
> > +       if (!inet_csk(sk)->icsk_retransmits)
> > +               return false;
> > +
> > +       K = ilog2(TCP_RTO_MAX/TCP_RTO_MIN);
> > +
> > +       if (boundary <= K)
> > +               limit = ((2 << boundary) - 1) * TCP_RTO_MIN;
> > +       else
> > +               limit = ((2 << K) - 1) * TCP_RTO_MIN +
> > +                       (boundary - K) * TCP_RTO_MAX;
> > +
> > +       return (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= limit;
> > +}
> > +
> >  static inline struct sk_buff *tcp_send_head(struct sock *sk)
> >  {
> >        return sk->sk_send_head;
> > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > index a3ba494..2972d7b 100644
> > --- a/net/ipv4/tcp_timer.c
> > +++ b/net/ipv4/tcp_timer.c
> > @@ -137,13 +137,14 @@ static int tcp_write_timeout(struct sock *sk)
> >  {
> >        struct inet_connection_sock *icsk = inet_csk(sk);
> >        int retry_until;
> > +       bool do_reset;
> >
> >        if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
> >                if (icsk->icsk_retransmits)
> >                        dst_negative_advice(&sk->sk_dst_cache);
> >                retry_until = icsk->icsk_syn_retries ? :
> > sysctl_tcp_syn_retries;
> >        } else {
> > -               if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
> > +               if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
> >                        /* Black hole detection */
> >                        tcp_mtu_probing(icsk, sk);
> >
> > @@ -155,13 +156,15 @@ static int tcp_write_timeout(struct sock *sk)
> >                        const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
> >
> >                        retry_until = tcp_orphan_retries(sk, alive);
> > +                       do_reset = alive ||
> > +                                  !retransmits_timed_out(sk, retry_until);
> >
> > -                       if (tcp_out_of_resources(sk, alive ||
> > icsk->icsk_retransmits < retry_until))
> > +                       if (tcp_out_of_resources(sk, do_reset))
> >                                return 1;
> >                }
> >        }
> >
> > -       if (icsk->icsk_retransmits >= retry_until) {
> > +       if (retransmits_timed_out(sk, retry_until)) {
> >                /* Has it gone just too far? */
> >                tcp_write_err(sk);
> >                return 1;
> > @@ -385,7 +388,7 @@ void tcp_retransmit_timer(struct sock *sk)
> >  out_reset_timer:
> >        icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
> >        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto,
> > TCP_RTO_MAX);
> > -       if (icsk->icsk_retransmits > sysctl_tcp_retries1)
> > +       if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
> >                __sk_dst_reset(sk);
> >
> >  out:;
> > --
> > 1.6.3.3
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
  2012-06-04 17:50     ` Damian Lukowski
@ 2012-06-04 19:18       ` Ilpo Järvinen
  2012-06-04 23:57         ` Jerry Chu
  2012-06-04 23:50       ` Jerry Chu
  1 sibling, 1 reply; 19+ messages in thread
From: Ilpo Järvinen @ 2012-06-04 19:18 UTC (permalink / raw)
  To: Damian Lukowski; +Cc: Jerry Chu, Netdev, David Miller

On Mon, 4 Jun 2012, Damian Lukowski wrote:

> please verify, I understood you correctly.
> 
> You have set TCP_RTO_MIN to a lower value, e.g. 0.002 seconds to improve
> your internal low-latency traffic. Because of the improvement, R1
> timeouts are triggered too fast for external high-RTT traffic. Is that
> correct?
> If so, may I suggest to set tcp_retries1 to a higher value? For
> TCP_RTO_MIN == 0.002 and tcp_retries1 ==  10, R1 will be calculated to
> approximately 4 seconds.
> 
> Is that ok?

I suppose what he meant is that you could have e.g., 60sec RTT and with 
small enough retries the timeout calculation yields to some timeout 
smaller than 60 secs, and therefore no retransmissions are made which is 
certainly not a desirable property? ...This is valid issue even if no min 
rto tweaking was done but can of course get much worse if min rto is 
shorter.

I agree with his proposed solution:

> > I think retransmits_timed_out() should check against both time
> > duration and retrans count (icsk_retransmits).

...that is, use both pseudo timeout check of the current code and the 
previously used icsk_retransmits compare at the same time.

-- 
 i.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
  2012-06-04 17:50     ` Damian Lukowski
  2012-06-04 19:18       ` Ilpo Järvinen
@ 2012-06-04 23:50       ` Jerry Chu
  2012-06-05 17:42         ` Jerry Chu
  1 sibling, 1 reply; 19+ messages in thread
From: Jerry Chu @ 2012-06-04 23:50 UTC (permalink / raw)
  To: Damian Lukowski; +Cc: Netdev, David Miller, Ilpo Järvinen

Hi Damian,

On Mon, Jun 4, 2012 at 10:50 AM, Damian Lukowski
<damian@tvk.rwth-aachen.de> wrote:
> Hi Jerry,
>
> please verify, I understood you correctly.
>
> You have set TCP_RTO_MIN to a lower value, e.g. 0.002 seconds to improve
> your internal low-latency traffic. Because of the improvement, R1
> timeouts are triggered too fast for external high-RTT traffic. Is that
> correct?

Correct.

> If so, may I suggest to set tcp_retries1 to a higher value? For
> TCP_RTO_MIN == 0.002 and tcp_retries1 ==  10, R1 will be calculated to
> approximately 4 seconds.

I think hacking tcp_retries1 is the wrong solution. E.g., 10 retries may be too
generous for those short RTT flows.

I think the fundamental problem is - the ideal fix for your original RTO revert
problem should've used the per-flow RTO to compute R1 & R2. But that
computation may be too expensive so you used TCP_RTO_MIN as an
approximation - not a good idea IMHO!

The easiest solution I can see so far is to replace the check

if (!inet_csk(sk)->icsk_retransmits)
                return false;

at the beginning of retransmits_timed_out() with

if (inet_csk(sk)->icsk_retransmits < boundary)
                return false;

Best,

Jerry

>
> Is that ok?
>
> Best regards
>  Damian
>
> Am Freitag, den 01.06.2012, 15:58 -0700 schrieb Jerry Chu:
>> > From: Damian Lukowski <damian@tvk.rwth-aachen.de>
>> > Date: Wed, Aug 26, 2009 at 3:16 AM
>> > Subject: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close
>> > threshold as a time value.
>> > To: Netdev <netdev@vger.kernel.org>
>> >
>> >
>> > RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
>> > which may represent a number of allowed retransmissions or a timeout value.
>> > Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
>> > in number of allowed retransmissions.
>> >
>> > For any desired threshold R2 (by means of time) one can specify tcp_retries2
>> > (by means of number of retransmissions) such that TCP will not time out
>> > earlier than R2. This is the case, because the RTO schedule follows a fixed
>> > pattern, namely exponential backoff.
>> >
>> > However, the RTO behaviour is not predictable any more if RTO backoffs can
>> > be
>> > reverted, as it is the case in the draft
>> > "Make TCP more Robust to Long Connectivity Disruptions"
>> > (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
>> >
>> > In the worst case TCP would time out a connection after 3.2 seconds, if the
>> > initial RTO equaled MIN_RTO and each backoff has been reverted.
>> >
>> > This patch introduces a function retransmits_timed_out(N),
>> > which calculates the timeout of a TCP connection, assuming an initial
>> > RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
>> >
>> > Whenever timeout decisions are made by comparing the retransmission counter
>> > to some value N, this function can be used, instead.
>> >
>> > The meaning of tcp_retries2 will be changed, as many more RTO
>> > retransmissions
>> > can occur than the value indicates. However, it yields a timeout which is
>> > similar to the one of an unpatched, exponentially backing off TCP in the
>> > same
>> > scenario. As no application could rely on an RTO greater than MIN_RTO, there
>> > should be no risk of a regression.
>>
>> This looks like a typical "fix one problem, introducing a few more" patch :(.
>> What do you mean by "no application could rely on an RTO greater than
>> MIN_RTO..."
>> above? How can you make the assumption that RTO is not too far off
>> from TCP_RTO_MIN?
>>
>> While you tried to address a problem where the retransmission count
>> was high but the actual
>> timeout duration was too short, have you considered the other case
>> around, i.e., the timeout
>> duration is long but the retransmission count is too short? This is
>> exactly what's happening
>> to us with your patch. We've much reduced TCP_RTO_MIN for our internal
>> traffic, but not
>> noticing your change has severely shortened the R1 & R2 recommended by
>> RFC1122 for our
>> long haul traffic until now. In many cases R1 threshold was met upon
>> the first retrans timeout.
>>
>> I think retransmits_timed_out() should check against both time
>> duration and retrans count
>> (icsk_retransmits).
>>
>> Thought?
>>
>> Jerry
>>
>> >
>> > Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
>> > ---
>> >  include/net/tcp.h    |   18 ++++++++++++++++++
>> >  net/ipv4/tcp_timer.c |   11 +++++++----
>> >  2 files changed, 25 insertions(+), 4 deletions(-)
>> >
>> > diff --git a/include/net/tcp.h b/include/net/tcp.h
>> > index c35b329..17d1a88 100644
>> > --- a/include/net/tcp.h
>> > +++ b/include/net/tcp.h
>> > @@ -1247,6 +1247,24 @@ static inline struct sk_buff
>> > *tcp_write_queue_prev(struct sock *sk, struct sk_bu
>> >  #define tcp_for_write_queue_from_safe(skb, tmp, sk)                    \
>> >        skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
>> >
>> > +static inline bool retransmits_timed_out(const struct sock *sk,
>> > +                                        unsigned int boundary)
>> > +{
>> > +       int limit, K;
>> > +       if (!inet_csk(sk)->icsk_retransmits)
>> > +               return false;
>> > +
>> > +       K = ilog2(TCP_RTO_MAX/TCP_RTO_MIN);
>> > +
>> > +       if (boundary <= K)
>> > +               limit = ((2 << boundary) - 1) * TCP_RTO_MIN;
>> > +       else
>> > +               limit = ((2 << K) - 1) * TCP_RTO_MIN +
>> > +                       (boundary - K) * TCP_RTO_MAX;
>> > +
>> > +       return (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= limit;
>> > +}
>> > +
>> >  static inline struct sk_buff *tcp_send_head(struct sock *sk)
>> >  {
>> >        return sk->sk_send_head;
>> > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> > index a3ba494..2972d7b 100644
>> > --- a/net/ipv4/tcp_timer.c
>> > +++ b/net/ipv4/tcp_timer.c
>> > @@ -137,13 +137,14 @@ static int tcp_write_timeout(struct sock *sk)
>> >  {
>> >        struct inet_connection_sock *icsk = inet_csk(sk);
>> >        int retry_until;
>> > +       bool do_reset;
>> >
>> >        if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>> >                if (icsk->icsk_retransmits)
>> >                        dst_negative_advice(&sk->sk_dst_cache);
>> >                retry_until = icsk->icsk_syn_retries ? :
>> > sysctl_tcp_syn_retries;
>> >        } else {
>> > -               if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
>> > +               if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
>> >                        /* Black hole detection */
>> >                        tcp_mtu_probing(icsk, sk);
>> >
>> > @@ -155,13 +156,15 @@ static int tcp_write_timeout(struct sock *sk)
>> >                        const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
>> >
>> >                        retry_until = tcp_orphan_retries(sk, alive);
>> > +                       do_reset = alive ||
>> > +                                  !retransmits_timed_out(sk, retry_until);
>> >
>> > -                       if (tcp_out_of_resources(sk, alive ||
>> > icsk->icsk_retransmits < retry_until))
>> > +                       if (tcp_out_of_resources(sk, do_reset))
>> >                                return 1;
>> >                }
>> >        }
>> >
>> > -       if (icsk->icsk_retransmits >= retry_until) {
>> > +       if (retransmits_timed_out(sk, retry_until)) {
>> >                /* Has it gone just too far? */
>> >                tcp_write_err(sk);
>> >                return 1;
>> > @@ -385,7 +388,7 @@ void tcp_retransmit_timer(struct sock *sk)
>> >  out_reset_timer:
>> >        icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>> >        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto,
>> > TCP_RTO_MAX);
>> > -       if (icsk->icsk_retransmits > sysctl_tcp_retries1)
>> > +       if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
>> >                __sk_dst_reset(sk);
>> >
>> >  out:;
>> > --
>> > 1.6.3.3
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe netdev" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
  2012-06-04 19:18       ` Ilpo Järvinen
@ 2012-06-04 23:57         ` Jerry Chu
  0 siblings, 0 replies; 19+ messages in thread
From: Jerry Chu @ 2012-06-04 23:57 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Damian Lukowski, Netdev, David Miller

On Mon, Jun 4, 2012 at 12:18 PM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> On Mon, 4 Jun 2012, Damian Lukowski wrote:
>
>> please verify, I understood you correctly.
>>
>> You have set TCP_RTO_MIN to a lower value, e.g. 0.002 seconds to improve
>> your internal low-latency traffic. Because of the improvement, R1
>> timeouts are triggered too fast for external high-RTT traffic. Is that
>> correct?
>> If so, may I suggest to set tcp_retries1 to a higher value? For
>> TCP_RTO_MIN == 0.002 and tcp_retries1 ==  10, R1 will be calculated to
>> approximately 4 seconds.
>>
>> Is that ok?
>
> I suppose what he meant is that you could have e.g., 60sec RTT and with
> small enough retries the timeout calculation yields to some timeout
> smaller than 60 secs, and therefore no retransmissions are made which is
> certainly not a desirable property? ...This is valid issue even if no min
> rto tweaking was done but can of course get much worse if min rto is
> shorter.

The extreme case you described above won't happen because there is a
point check at the beginning to return false if inet_csk(sk)->icsk_retransmits
is zero. But that seems to be a hack because why is 0 so special, not 1, 2,...?

>
> I agree with his proposed solution:
>
>> > I think retransmits_timed_out() should check against both time
>> > duration and retrans count (icsk_retransmits).
>
> ...that is, use both pseudo timeout check of the current code and the
> previously used icsk_retransmits compare at the same time.

Yep, it's not an ideal solution, i.e., the problem the original patch
tried to address
may continue to exist if the gap between TCP_RTO_MIN used as an estimator
and the real RTO is large, but it at least addresses the problem we
have locally.

I will submit a patch for this asap.

Thanks,

Jerry

>
> --
>  i.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
  2012-06-04 23:50       ` Jerry Chu
@ 2012-06-05 17:42         ` Jerry Chu
  2012-06-05 18:39           ` Damian Lukowski
  0 siblings, 1 reply; 19+ messages in thread
From: Jerry Chu @ 2012-06-05 17:42 UTC (permalink / raw)
  To: Damian Lukowski; +Cc: Netdev, David Miller, Ilpo Järvinen

On Mon, Jun 4, 2012 at 4:50 PM, Jerry Chu <hkchu@google.com> wrote:
> Hi Damian,
>
> On Mon, Jun 4, 2012 at 10:50 AM, Damian Lukowski
> <damian@tvk.rwth-aachen.de> wrote:
>> Hi Jerry,
>>
>> please verify, I understood you correctly.
>>
>> You have set TCP_RTO_MIN to a lower value, e.g. 0.002 seconds to improve
>> your internal low-latency traffic. Because of the improvement, R1
>> timeouts are triggered too fast for external high-RTT traffic. Is that
>> correct?
>
> Correct.
>
>> If so, may I suggest to set tcp_retries1 to a higher value? For
>> TCP_RTO_MIN == 0.002 and tcp_retries1 ==  10, R1 will be calculated to
>> approximately 4 seconds.
>
> I think hacking tcp_retries1 is the wrong solution. E.g., 10 retries may be too
> generous for those short RTT flows.
>
> I think the fundamental problem is - the ideal fix for your original RTO revert
> problem should've used the per-flow RTO to compute R1 & R2. But that
> computation may be too expensive so you used TCP_RTO_MIN as an
> approximation - not a good idea IMHO!

Just realized the correct fix of using the original, non-backoff per flow RTO is
not any more expensive than the current code through ilog2(). What's needed
is a new field "base_rto" to record the original RTO before backoff. I'm leaning
toward this more accurate fix now without any fudge because fudging almost
always causes bugs.

Any comment is welcome. I'm not sure in the existing code if it makes sense
to apply the exponential backoff based computation to thin stream but it's a
separate question so I won't touch it.

Jerry

>
> The easiest solution I can see so far is to replace the check
>
> if (!inet_csk(sk)->icsk_retransmits)
>                return false;
>
> at the beginning of retransmits_timed_out() with
>
> if (inet_csk(sk)->icsk_retransmits < boundary)
>                return false;
>
> Best,
>
> Jerry
>
>>
>> Is that ok?
>>
>> Best regards
>>  Damian
>>
>> Am Freitag, den 01.06.2012, 15:58 -0700 schrieb Jerry Chu:
>>> > From: Damian Lukowski <damian@tvk.rwth-aachen.de>
>>> > Date: Wed, Aug 26, 2009 at 3:16 AM
>>> > Subject: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close
>>> > threshold as a time value.
>>> > To: Netdev <netdev@vger.kernel.org>
>>> >
>>> >
>>> > RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
>>> > which may represent a number of allowed retransmissions or a timeout value.
>>> > Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
>>> > in number of allowed retransmissions.
>>> >
>>> > For any desired threshold R2 (by means of time) one can specify tcp_retries2
>>> > (by means of number of retransmissions) such that TCP will not time out
>>> > earlier than R2. This is the case, because the RTO schedule follows a fixed
>>> > pattern, namely exponential backoff.
>>> >
>>> > However, the RTO behaviour is not predictable any more if RTO backoffs can
>>> > be
>>> > reverted, as it is the case in the draft
>>> > "Make TCP more Robust to Long Connectivity Disruptions"
>>> > (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
>>> >
>>> > In the worst case TCP would time out a connection after 3.2 seconds, if the
>>> > initial RTO equaled MIN_RTO and each backoff has been reverted.
>>> >
>>> > This patch introduces a function retransmits_timed_out(N),
>>> > which calculates the timeout of a TCP connection, assuming an initial
>>> > RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
>>> >
>>> > Whenever timeout decisions are made by comparing the retransmission counter
>>> > to some value N, this function can be used, instead.
>>> >
>>> > The meaning of tcp_retries2 will be changed, as many more RTO
>>> > retransmissions
>>> > can occur than the value indicates. However, it yields a timeout which is
>>> > similar to the one of an unpatched, exponentially backing off TCP in the
>>> > same
>>> > scenario. As no application could rely on an RTO greater than MIN_RTO, there
>>> > should be no risk of a regression.
>>>
>>> This looks like a typical "fix one problem, introducing a few more" patch :(.
>>> What do you mean by "no application could rely on an RTO greater than
>>> MIN_RTO..."
>>> above? How can you make the assumption that RTO is not too far off
>>> from TCP_RTO_MIN?
>>>
>>> While you tried to address a problem where the retransmission count
>>> was high but the actual
>>> timeout duration was too short, have you considered the other case
>>> around, i.e., the timeout
>>> duration is long but the retransmission count is too short? This is
>>> exactly what's happening
>>> to us with your patch. We've much reduced TCP_RTO_MIN for our internal
>>> traffic, but not
>>> noticing your change has severely shortened the R1 & R2 recommended by
>>> RFC1122 for our
>>> long haul traffic until now. In many cases R1 threshold was met upon
>>> the first retrans timeout.
>>>
>>> I think retransmits_timed_out() should check against both time
>>> duration and retrans count
>>> (icsk_retransmits).
>>>
>>> Thought?
>>>
>>> Jerry
>>>
>>> >
>>> > Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
>>> > ---
>>> >  include/net/tcp.h    |   18 ++++++++++++++++++
>>> >  net/ipv4/tcp_timer.c |   11 +++++++----
>>> >  2 files changed, 25 insertions(+), 4 deletions(-)
>>> >
>>> > diff --git a/include/net/tcp.h b/include/net/tcp.h
>>> > index c35b329..17d1a88 100644
>>> > --- a/include/net/tcp.h
>>> > +++ b/include/net/tcp.h
>>> > @@ -1247,6 +1247,24 @@ static inline struct sk_buff
>>> > *tcp_write_queue_prev(struct sock *sk, struct sk_bu
>>> >  #define tcp_for_write_queue_from_safe(skb, tmp, sk)                    \
>>> >        skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
>>> >
>>> > +static inline bool retransmits_timed_out(const struct sock *sk,
>>> > +                                        unsigned int boundary)
>>> > +{
>>> > +       int limit, K;
>>> > +       if (!inet_csk(sk)->icsk_retransmits)
>>> > +               return false;
>>> > +
>>> > +       K = ilog2(TCP_RTO_MAX/TCP_RTO_MIN);
>>> > +
>>> > +       if (boundary <= K)
>>> > +               limit = ((2 << boundary) - 1) * TCP_RTO_MIN;
>>> > +       else
>>> > +               limit = ((2 << K) - 1) * TCP_RTO_MIN +
>>> > +                       (boundary - K) * TCP_RTO_MAX;
>>> > +
>>> > +       return (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= limit;
>>> > +}
>>> > +
>>> >  static inline struct sk_buff *tcp_send_head(struct sock *sk)
>>> >  {
>>> >        return sk->sk_send_head;
>>> > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>>> > index a3ba494..2972d7b 100644
>>> > --- a/net/ipv4/tcp_timer.c
>>> > +++ b/net/ipv4/tcp_timer.c
>>> > @@ -137,13 +137,14 @@ static int tcp_write_timeout(struct sock *sk)
>>> >  {
>>> >        struct inet_connection_sock *icsk = inet_csk(sk);
>>> >        int retry_until;
>>> > +       bool do_reset;
>>> >
>>> >        if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>>> >                if (icsk->icsk_retransmits)
>>> >                        dst_negative_advice(&sk->sk_dst_cache);
>>> >                retry_until = icsk->icsk_syn_retries ? :
>>> > sysctl_tcp_syn_retries;
>>> >        } else {
>>> > -               if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
>>> > +               if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
>>> >                        /* Black hole detection */
>>> >                        tcp_mtu_probing(icsk, sk);
>>> >
>>> > @@ -155,13 +156,15 @@ static int tcp_write_timeout(struct sock *sk)
>>> >                        const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
>>> >
>>> >                        retry_until = tcp_orphan_retries(sk, alive);
>>> > +                       do_reset = alive ||
>>> > +                                  !retransmits_timed_out(sk, retry_until);
>>> >
>>> > -                       if (tcp_out_of_resources(sk, alive ||
>>> > icsk->icsk_retransmits < retry_until))
>>> > +                       if (tcp_out_of_resources(sk, do_reset))
>>> >                                return 1;
>>> >                }
>>> >        }
>>> >
>>> > -       if (icsk->icsk_retransmits >= retry_until) {
>>> > +       if (retransmits_timed_out(sk, retry_until)) {
>>> >                /* Has it gone just too far? */
>>> >                tcp_write_err(sk);
>>> >                return 1;
>>> > @@ -385,7 +388,7 @@ void tcp_retransmit_timer(struct sock *sk)
>>> >  out_reset_timer:
>>> >        icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>>> >        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto,
>>> > TCP_RTO_MAX);
>>> > -       if (icsk->icsk_retransmits > sysctl_tcp_retries1)
>>> > +       if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
>>> >                __sk_dst_reset(sk);
>>> >
>>> >  out:;
>>> > --
>>> > 1.6.3.3
>>> >
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> > the body of a message to majordomo@vger.kernel.org
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >
>>
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
  2012-06-05 17:42         ` Jerry Chu
@ 2012-06-05 18:39           ` Damian Lukowski
  2012-06-05 21:22             ` Jerry Chu
  0 siblings, 1 reply; 19+ messages in thread
From: Damian Lukowski @ 2012-06-05 18:39 UTC (permalink / raw)
  To: Jerry Chu; +Cc: Netdev, David Miller, Ilpo Järvinen

Am Dienstag, den 05.06.2012, 10:42 -0700 schrieb Jerry Chu:
> On Mon, Jun 4, 2012 at 4:50 PM, Jerry Chu <hkchu@google.com> wrote:
> > Hi Damian,
> >
> > On Mon, Jun 4, 2012 at 10:50 AM, Damian Lukowski
> > <damian@tvk.rwth-aachen.de> wrote:
> >> Hi Jerry,
> >>
> >> please verify, I understood you correctly.
> >>
> >> You have set TCP_RTO_MIN to a lower value, e.g. 0.002 seconds to improve
> >> your internal low-latency traffic. Because of the improvement, R1
> >> timeouts are triggered too fast for external high-RTT traffic. Is that
> >> correct?
> >
> > Correct.
> >
> >> If so, may I suggest to set tcp_retries1 to a higher value? For
> >> TCP_RTO_MIN == 0.002 and tcp_retries1 ==  10, R1 will be calculated to
> >> approximately 4 seconds.
> >
> > I think hacking tcp_retries1 is the wrong solution. E.g., 10 retries may be too
> > generous for those short RTT flows.
> >
> > I think the fundamental problem is - the ideal fix for your original RTO revert
> > problem should've used the per-flow RTO to compute R1 & R2. But that
> > computation may be too expensive so you used TCP_RTO_MIN as an
> > approximation - not a good idea IMHO!
> 
> Just realized the correct fix of using the original, non-backoff per flow RTO is
> not any more expensive than the current code through ilog2(). What's needed
> is a new field "base_rto" to record the original RTO before backoff. I'm leaning
> toward this more accurate fix now without any fudge because fudging almost
> always causes bugs.


The current version of retransmits_timed_out() uses such a field
already. I suppose, we can do a combination like the following?

-       unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
+       unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : __tcp_set_rto(tcp_sk(sk));
+       rto_base = rto_base ? : TCP_RTO_MIN;
 
-       if (!inet_csk(sk)->icsk_retransmits)
+       if (inet_csk(sk)->icsk_retransmits < boundary)
 

Regards
 Damian

> 
> Any comment is welcome. I'm not sure in the existing code if it makes sense
> to apply the exponential backoff based computation to thin stream but it's a
> separate question so I won't touch it.
> 
> Jerry
> 
> >
> > The easiest solution I can see so far is to replace the check
> >
> > if (!inet_csk(sk)->icsk_retransmits)
> >                return false;
> >
> > at the beginning of retransmits_timed_out() with
> >
> > if (inet_csk(sk)->icsk_retransmits < boundary)
> >                return false;
> >
> > Best,
> >
> > Jerry
> >
> >>
> >> Is that ok?
> >>
> >> Best regards
> >>  Damian
> >>
> >> Am Freitag, den 01.06.2012, 15:58 -0700 schrieb Jerry Chu:
> >>> > From: Damian Lukowski <damian@tvk.rwth-aachen.de>
> >>> > Date: Wed, Aug 26, 2009 at 3:16 AM
> >>> > Subject: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close
> >>> > threshold as a time value.
> >>> > To: Netdev <netdev@vger.kernel.org>
> >>> >
> >>> >
> >>> > RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
> >>> > which may represent a number of allowed retransmissions or a timeout value.
> >>> > Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
> >>> > in number of allowed retransmissions.
> >>> >
> >>> > For any desired threshold R2 (by means of time) one can specify tcp_retries2
> >>> > (by means of number of retransmissions) such that TCP will not time out
> >>> > earlier than R2. This is the case, because the RTO schedule follows a fixed
> >>> > pattern, namely exponential backoff.
> >>> >
> >>> > However, the RTO behaviour is not predictable any more if RTO backoffs can
> >>> > be
> >>> > reverted, as it is the case in the draft
> >>> > "Make TCP more Robust to Long Connectivity Disruptions"
> >>> > (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
> >>> >
> >>> > In the worst case TCP would time out a connection after 3.2 seconds, if the
> >>> > initial RTO equaled MIN_RTO and each backoff has been reverted.
> >>> >
> >>> > This patch introduces a function retransmits_timed_out(N),
> >>> > which calculates the timeout of a TCP connection, assuming an initial
> >>> > RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
> >>> >
> >>> > Whenever timeout decisions are made by comparing the retransmission counter
> >>> > to some value N, this function can be used, instead.
> >>> >
> >>> > The meaning of tcp_retries2 will be changed, as many more RTO
> >>> > retransmissions
> >>> > can occur than the value indicates. However, it yields a timeout which is
> >>> > similar to the one of an unpatched, exponentially backing off TCP in the
> >>> > same
> >>> > scenario. As no application could rely on an RTO greater than MIN_RTO, there
> >>> > should be no risk of a regression.
> >>>
> >>> This looks like a typical "fix one problem, introducing a few more" patch :(.
> >>> What do you mean by "no application could rely on an RTO greater than
> >>> MIN_RTO..."
> >>> above? How can you make the assumption that RTO is not too far off
> >>> from TCP_RTO_MIN?
> >>>
> >>> While you tried to address a problem where the retransmission count
> >>> was high but the actual
> >>> timeout duration was too short, have you considered the other case
> >>> around, i.e., the timeout
> >>> duration is long but the retransmission count is too short? This is
> >>> exactly what's happening
> >>> to us with your patch. We've much reduced TCP_RTO_MIN for our internal
> >>> traffic, but not
> >>> noticing your change has severely shortened the R1 & R2 recommended by
> >>> RFC1122 for our
> >>> long haul traffic until now. In many cases R1 threshold was met upon
> >>> the first retrans timeout.
> >>>
> >>> I think retransmits_timed_out() should check against both time
> >>> duration and retrans count
> >>> (icsk_retransmits).
> >>>
> >>> Thought?
> >>>
> >>> Jerry
> >>>
> >>> >
> >>> > Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
> >>> > ---
> >>> >  include/net/tcp.h    |   18 ++++++++++++++++++
> >>> >  net/ipv4/tcp_timer.c |   11 +++++++----
> >>> >  2 files changed, 25 insertions(+), 4 deletions(-)
> >>> >
> >>> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> >>> > index c35b329..17d1a88 100644
> >>> > --- a/include/net/tcp.h
> >>> > +++ b/include/net/tcp.h
> >>> > @@ -1247,6 +1247,24 @@ static inline struct sk_buff
> >>> > *tcp_write_queue_prev(struct sock *sk, struct sk_bu
> >>> >  #define tcp_for_write_queue_from_safe(skb, tmp, sk)                    \
> >>> >        skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
> >>> >
> >>> > +static inline bool retransmits_timed_out(const struct sock *sk,
> >>> > +                                        unsigned int boundary)
> >>> > +{
> >>> > +       int limit, K;
> >>> > +       if (!inet_csk(sk)->icsk_retransmits)
> >>> > +               return false;
> >>> > +
> >>> > +       K = ilog2(TCP_RTO_MAX/TCP_RTO_MIN);
> >>> > +
> >>> > +       if (boundary <= K)
> >>> > +               limit = ((2 << boundary) - 1) * TCP_RTO_MIN;
> >>> > +       else
> >>> > +               limit = ((2 << K) - 1) * TCP_RTO_MIN +
> >>> > +                       (boundary - K) * TCP_RTO_MAX;
> >>> > +
> >>> > +       return (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= limit;
> >>> > +}
> >>> > +
> >>> >  static inline struct sk_buff *tcp_send_head(struct sock *sk)
> >>> >  {
> >>> >        return sk->sk_send_head;
> >>> > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> >>> > index a3ba494..2972d7b 100644
> >>> > --- a/net/ipv4/tcp_timer.c
> >>> > +++ b/net/ipv4/tcp_timer.c
> >>> > @@ -137,13 +137,14 @@ static int tcp_write_timeout(struct sock *sk)
> >>> >  {
> >>> >        struct inet_connection_sock *icsk = inet_csk(sk);
> >>> >        int retry_until;
> >>> > +       bool do_reset;
> >>> >
> >>> >        if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
> >>> >                if (icsk->icsk_retransmits)
> >>> >                        dst_negative_advice(&sk->sk_dst_cache);
> >>> >                retry_until = icsk->icsk_syn_retries ? :
> >>> > sysctl_tcp_syn_retries;
> >>> >        } else {
> >>> > -               if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
> >>> > +               if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
> >>> >                        /* Black hole detection */
> >>> >                        tcp_mtu_probing(icsk, sk);
> >>> >
> >>> > @@ -155,13 +156,15 @@ static int tcp_write_timeout(struct sock *sk)
> >>> >                        const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
> >>> >
> >>> >                        retry_until = tcp_orphan_retries(sk, alive);
> >>> > +                       do_reset = alive ||
> >>> > +                                  !retransmits_timed_out(sk, retry_until);
> >>> >
> >>> > -                       if (tcp_out_of_resources(sk, alive ||
> >>> > icsk->icsk_retransmits < retry_until))
> >>> > +                       if (tcp_out_of_resources(sk, do_reset))
> >>> >                                return 1;
> >>> >                }
> >>> >        }
> >>> >
> >>> > -       if (icsk->icsk_retransmits >= retry_until) {
> >>> > +       if (retransmits_timed_out(sk, retry_until)) {
> >>> >                /* Has it gone just too far? */
> >>> >                tcp_write_err(sk);
> >>> >                return 1;
> >>> > @@ -385,7 +388,7 @@ void tcp_retransmit_timer(struct sock *sk)
> >>> >  out_reset_timer:
> >>> >        icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
> >>> >        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto,
> >>> > TCP_RTO_MAX);
> >>> > -       if (icsk->icsk_retransmits > sysctl_tcp_retries1)
> >>> > +       if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
> >>> >                __sk_dst_reset(sk);
> >>> >
> >>> >  out:;
> >>> > --
> >>> > 1.6.3.3
> >>> >
> >>> > --
> >>> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> >>> > the body of a message to majordomo@vger.kernel.org
> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> >
> >>
> >>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
  2012-06-05 18:39           ` Damian Lukowski
@ 2012-06-05 21:22             ` Jerry Chu
  2012-06-06 20:35               ` Damian Lukowski
  0 siblings, 1 reply; 19+ messages in thread
From: Jerry Chu @ 2012-06-05 21:22 UTC (permalink / raw)
  To: Damian Lukowski; +Cc: Netdev, David Miller, Ilpo Järvinen

On Tue, Jun 5, 2012 at 11:39 AM, Damian Lukowski
<damian@tvk.rwth-aachen.de> wrote:
> Am Dienstag, den 05.06.2012, 10:42 -0700 schrieb Jerry Chu:
>> On Mon, Jun 4, 2012 at 4:50 PM, Jerry Chu <hkchu@google.com> wrote:
>> > Hi Damian,
>> >
>> > On Mon, Jun 4, 2012 at 10:50 AM, Damian Lukowski
>> > <damian@tvk.rwth-aachen.de> wrote:
>> >> Hi Jerry,
>> >>
>> >> please verify, I understood you correctly.
>> >>
>> >> You have set TCP_RTO_MIN to a lower value, e.g. 0.002 seconds to improve
>> >> your internal low-latency traffic. Because of the improvement, R1
>> >> timeouts are triggered too fast for external high-RTT traffic. Is that
>> >> correct?
>> >
>> > Correct.
>> >
>> >> If so, may I suggest to set tcp_retries1 to a higher value? For
>> >> TCP_RTO_MIN == 0.002 and tcp_retries1 ==  10, R1 will be calculated to
>> >> approximately 4 seconds.
>> >
>> > I think hacking tcp_retries1 is the wrong solution. E.g., 10 retries may be too
>> > generous for those short RTT flows.
>> >
>> > I think the fundamental problem is - the ideal fix for your original RTO revert
>> > problem should've used the per-flow RTO to compute R1 & R2. But that
>> > computation may be too expensive so you used TCP_RTO_MIN as an
>> > approximation - not a good idea IMHO!
>>
>> Just realized the correct fix of using the original, non-backoff per flow RTO is
>> not any more expensive than the current code through ilog2(). What's needed
>> is a new field "base_rto" to record the original RTO before backoff. I'm leaning
>> toward this more accurate fix now without any fudge because fudging almost
>> always causes bugs.
>
>
> The current version of retransmits_timed_out() uses such a field
> already. I suppose, we can do a combination like the following?
>
> -       unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
> +       unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : __tcp_set_rto(tcp_sk(sk));

Yes that could work and we probably don't need a new field for the original RTO.

But I started wondering what the problem you tried to solve initially. The old
counter (icsk_retransmits) based code was really easy to understand, debug, and
matched well with the API (sysctl_tcp_retries1, sysctl_tcp_retries2,
TCP_SYNCNT,...), which are all counter based. Moreover, my simple brain has
a strong prejudice against complex code unless the complexity is justified.

Could you point out where backoff revert might happen? (tcp_v4_err() when
handing ICMP errors?) And for those cases is it possible to either not increment
icsk_retransmits (as long as it won't get us into infinite
retransmissions), or invent
a separate field for the sole purpose of timeout check? Won't that be
much simpler
than your current fix?

Best,

Jerry

> +       rto_base = rto_base ? : TCP_RTO_MIN;
>
> -       if (!inet_csk(sk)->icsk_retransmits)
> +       if (inet_csk(sk)->icsk_retransmits < boundary)
>
>
> Regards
>  Damian
>
>>
>> Any comment is welcome. I'm not sure in the existing code if it makes sense
>> to apply the exponential backoff based computation to thin stream but it's a
>> separate question so I won't touch it.
>>
>> Jerry
>>
>> >
>> > The easiest solution I can see so far is to replace the check
>> >
>> > if (!inet_csk(sk)->icsk_retransmits)
>> >                return false;
>> >
>> > at the beginning of retransmits_timed_out() with
>> >
>> > if (inet_csk(sk)->icsk_retransmits < boundary)
>> >                return false;
>> >
>> > Best,
>> >
>> > Jerry
>> >
>> >>
>> >> Is that ok?
>> >>
>> >> Best regards
>> >>  Damian
>> >>
>> >> Am Freitag, den 01.06.2012, 15:58 -0700 schrieb Jerry Chu:
>> >>> > From: Damian Lukowski <damian@tvk.rwth-aachen.de>
>> >>> > Date: Wed, Aug 26, 2009 at 3:16 AM
>> >>> > Subject: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close
>> >>> > threshold as a time value.
>> >>> > To: Netdev <netdev@vger.kernel.org>
>> >>> >
>> >>> >
>> >>> > RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
>> >>> > which may represent a number of allowed retransmissions or a timeout value.
>> >>> > Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
>> >>> > in number of allowed retransmissions.
>> >>> >
>> >>> > For any desired threshold R2 (by means of time) one can specify tcp_retries2
>> >>> > (by means of number of retransmissions) such that TCP will not time out
>> >>> > earlier than R2. This is the case, because the RTO schedule follows a fixed
>> >>> > pattern, namely exponential backoff.
>> >>> >
>> >>> > However, the RTO behaviour is not predictable any more if RTO backoffs can
>> >>> > be
>> >>> > reverted, as it is the case in the draft
>> >>> > "Make TCP more Robust to Long Connectivity Disruptions"
>> >>> > (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
>> >>> >
>> >>> > In the worst case TCP would time out a connection after 3.2 seconds, if the
>> >>> > initial RTO equaled MIN_RTO and each backoff has been reverted.
>> >>> >
>> >>> > This patch introduces a function retransmits_timed_out(N),
>> >>> > which calculates the timeout of a TCP connection, assuming an initial
>> >>> > RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
>> >>> >
>> >>> > Whenever timeout decisions are made by comparing the retransmission counter
>> >>> > to some value N, this function can be used, instead.
>> >>> >
>> >>> > The meaning of tcp_retries2 will be changed, as many more RTO
>> >>> > retransmissions
>> >>> > can occur than the value indicates. However, it yields a timeout which is
>> >>> > similar to the one of an unpatched, exponentially backing off TCP in the
>> >>> > same
>> >>> > scenario. As no application could rely on an RTO greater than MIN_RTO, there
>> >>> > should be no risk of a regression.
>> >>>
>> >>> This looks like a typical "fix one problem, introducing a few more" patch :(.
>> >>> What do you mean by "no application could rely on an RTO greater than
>> >>> MIN_RTO..."
>> >>> above? How can you make the assumption that RTO is not too far off
>> >>> from TCP_RTO_MIN?
>> >>>
>> >>> While you tried to address a problem where the retransmission count
>> >>> was high but the actual
>> >>> timeout duration was too short, have you considered the other case
>> >>> around, i.e., the timeout
>> >>> duration is long but the retransmission count is too short? This is
>> >>> exactly what's happening
>> >>> to us with your patch. We've much reduced TCP_RTO_MIN for our internal
>> >>> traffic, but not
>> >>> noticing your change has severely shortened the R1 & R2 recommended by
>> >>> RFC1122 for our
>> >>> long haul traffic until now. In many cases R1 threshold was met upon
>> >>> the first retrans timeout.
>> >>>
>> >>> I think retransmits_timed_out() should check against both time
>> >>> duration and retrans count
>> >>> (icsk_retransmits).
>> >>>
>> >>> Thought?
>> >>>
>> >>> Jerry
>> >>>
>> >>> >
>> >>> > Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
>> >>> > ---
>> >>> >  include/net/tcp.h    |   18 ++++++++++++++++++
>> >>> >  net/ipv4/tcp_timer.c |   11 +++++++----
>> >>> >  2 files changed, 25 insertions(+), 4 deletions(-)
>> >>> >
>> >>> > diff --git a/include/net/tcp.h b/include/net/tcp.h
>> >>> > index c35b329..17d1a88 100644
>> >>> > --- a/include/net/tcp.h
>> >>> > +++ b/include/net/tcp.h
>> >>> > @@ -1247,6 +1247,24 @@ static inline struct sk_buff
>> >>> > *tcp_write_queue_prev(struct sock *sk, struct sk_bu
>> >>> >  #define tcp_for_write_queue_from_safe(skb, tmp, sk)                    \
>> >>> >        skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
>> >>> >
>> >>> > +static inline bool retransmits_timed_out(const struct sock *sk,
>> >>> > +                                        unsigned int boundary)
>> >>> > +{
>> >>> > +       int limit, K;
>> >>> > +       if (!inet_csk(sk)->icsk_retransmits)
>> >>> > +               return false;
>> >>> > +
>> >>> > +       K = ilog2(TCP_RTO_MAX/TCP_RTO_MIN);
>> >>> > +
>> >>> > +       if (boundary <= K)
>> >>> > +               limit = ((2 << boundary) - 1) * TCP_RTO_MIN;
>> >>> > +       else
>> >>> > +               limit = ((2 << K) - 1) * TCP_RTO_MIN +
>> >>> > +                       (boundary - K) * TCP_RTO_MAX;
>> >>> > +
>> >>> > +       return (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= limit;
>> >>> > +}
>> >>> > +
>> >>> >  static inline struct sk_buff *tcp_send_head(struct sock *sk)
>> >>> >  {
>> >>> >        return sk->sk_send_head;
>> >>> > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> >>> > index a3ba494..2972d7b 100644
>> >>> > --- a/net/ipv4/tcp_timer.c
>> >>> > +++ b/net/ipv4/tcp_timer.c
>> >>> > @@ -137,13 +137,14 @@ static int tcp_write_timeout(struct sock *sk)
>> >>> >  {
>> >>> >        struct inet_connection_sock *icsk = inet_csk(sk);
>> >>> >        int retry_until;
>> >>> > +       bool do_reset;
>> >>> >
>> >>> >        if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>> >>> >                if (icsk->icsk_retransmits)
>> >>> >                        dst_negative_advice(&sk->sk_dst_cache);
>> >>> >                retry_until = icsk->icsk_syn_retries ? :
>> >>> > sysctl_tcp_syn_retries;
>> >>> >        } else {
>> >>> > -               if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
>> >>> > +               if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
>> >>> >                        /* Black hole detection */
>> >>> >                        tcp_mtu_probing(icsk, sk);
>> >>> >
>> >>> > @@ -155,13 +156,15 @@ static int tcp_write_timeout(struct sock *sk)
>> >>> >                        const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
>> >>> >
>> >>> >                        retry_until = tcp_orphan_retries(sk, alive);
>> >>> > +                       do_reset = alive ||
>> >>> > +                                  !retransmits_timed_out(sk, retry_until);
>> >>> >
>> >>> > -                       if (tcp_out_of_resources(sk, alive ||
>> >>> > icsk->icsk_retransmits < retry_until))
>> >>> > +                       if (tcp_out_of_resources(sk, do_reset))
>> >>> >                                return 1;
>> >>> >                }
>> >>> >        }
>> >>> >
>> >>> > -       if (icsk->icsk_retransmits >= retry_until) {
>> >>> > +       if (retransmits_timed_out(sk, retry_until)) {
>> >>> >                /* Has it gone just too far? */
>> >>> >                tcp_write_err(sk);
>> >>> >                return 1;
>> >>> > @@ -385,7 +388,7 @@ void tcp_retransmit_timer(struct sock *sk)
>> >>> >  out_reset_timer:
>> >>> >        icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>> >>> >        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto,
>> >>> > TCP_RTO_MAX);
>> >>> > -       if (icsk->icsk_retransmits > sysctl_tcp_retries1)
>> >>> > +       if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
>> >>> >                __sk_dst_reset(sk);
>> >>> >
>> >>> >  out:;
>> >>> > --
>> >>> > 1.6.3.3
>> >>> >
>> >>> > --
>> >>> > To unsubscribe from this list: send the line "unsubscribe netdev" in
>> >>> > the body of a message to majordomo@vger.kernel.org
>> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>> >
>> >>
>> >>
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
  2012-06-05 21:22             ` Jerry Chu
@ 2012-06-06 20:35               ` Damian Lukowski
  2012-06-06 22:41                 ` Yuchung Cheng
  0 siblings, 1 reply; 19+ messages in thread
From: Damian Lukowski @ 2012-06-06 20:35 UTC (permalink / raw)
  To: Jerry Chu; +Cc: Netdev, David Miller, Ilpo Järvinen

Am Dienstag, den 05.06.2012, 14:22 -0700 schrieb Jerry Chu:
> On Tue, Jun 5, 2012 at 11:39 AM, Damian Lukowski
> <damian@tvk.rwth-aachen.de> wrote:
> > Am Dienstag, den 05.06.2012, 10:42 -0700 schrieb Jerry Chu:
> >> On Mon, Jun 4, 2012 at 4:50 PM, Jerry Chu <hkchu@google.com> wrote:
> >> > Hi Damian,
> >> >
> >> > On Mon, Jun 4, 2012 at 10:50 AM, Damian Lukowski
> >> > <damian@tvk.rwth-aachen.de> wrote:
> >> >> Hi Jerry,
> >> >>
> >> >> please verify, I understood you correctly.
> >> >>
> >> >> You have set TCP_RTO_MIN to a lower value, e.g. 0.002 seconds to improve
> >> >> your internal low-latency traffic. Because of the improvement, R1
> >> >> timeouts are triggered too fast for external high-RTT traffic. Is that
> >> >> correct?
> >> >
> >> > Correct.
> >> >
> >> >> If so, may I suggest to set tcp_retries1 to a higher value? For
> >> >> TCP_RTO_MIN == 0.002 and tcp_retries1 ==  10, R1 will be calculated to
> >> >> approximately 4 seconds.
> >> >
> >> > I think hacking tcp_retries1 is the wrong solution. E.g., 10 retries may be too
> >> > generous for those short RTT flows.
> >> >
> >> > I think the fundamental problem is - the ideal fix for your original RTO revert
> >> > problem should've used the per-flow RTO to compute R1 & R2. But that
> >> > computation may be too expensive so you used TCP_RTO_MIN as an
> >> > approximation - not a good idea IMHO!
> >>
> >> Just realized the correct fix of using the original, non-backoff per flow RTO is
> >> not any more expensive than the current code through ilog2(). What's needed
> >> is a new field "base_rto" to record the original RTO before backoff. I'm leaning
> >> toward this more accurate fix now without any fudge because fudging almost
> >> always causes bugs.
> >
> >
> > The current version of retransmits_timed_out() uses such a field
> > already. I suppose, we can do a combination like the following?
> >
> > -       unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
> > +       unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : __tcp_set_rto(tcp_sk(sk));
> 
> Yes that could work and we probably don't need a new field for the original RTO.
> 
> But I started wondering what the problem you tried to solve initially. The old
> counter (icsk_retransmits) based code was really easy to understand, debug, and
> matched well with the API (sysctl_tcp_retries1, sysctl_tcp_retries2,
> TCP_SYNCNT,...), which are all counter based. Moreover, my simple brain has
> a strong prejudice against complex code unless the complexity is justified.
> 
> Could you point out where backoff revert might happen? (tcp_v4_err() when
> handing ICMP errors?) And for those cases is it possible to either not increment
> icsk_retransmits (as long as it won't get us into infinite
> retransmissions), or invent
> a separate field for the sole purpose of timeout check? Won't that be
> much simpler than your current fix?

Hi,

backoffs are reverted when an RTO retransmission triggers an ICMP
destination unreachable along the path towards the target.
Consider A --- R --- B, where A and B are TCP endpoints, and R is some
router in between. When the link between R and B breaks for a longer
time, A will perform RTO retransmissions if there are outstanding ACKs.
Those packets which arrive at R will hopefully trigger an ICMP response
back towards A, as R has no more route towards B. The ICMP packet is an
indication for A, that the retransmission has not been lost because of
congestion but because of a link outage, and the backoff will be
reverted for the corresponding TCP session. In the best case, every RTO
retransmission triggers an ICMP response, so every backoff is reverted,
and the time between retransmissions remains at the original value.
If icsk_retransmits is decremented at this point within the original
code's logic, the connection might never time out. And we cannot take
tcp_retriesX literally here, as the above scenario would time out after
tcp_retries2 x base_rto, where the base_rto might be as small as 0.2
seconds.

I am not sure, how an additional counter variable should help. You still
cannot take tcp_retriesX literally. Besides, I think that changing the
socket structure is too heavy machinery, isn't it?

Regards
 Damian

> 
> Best,
> 
> Jerry
> 
> > +       rto_base = rto_base ? : TCP_RTO_MIN;
> >
> > -       if (!inet_csk(sk)->icsk_retransmits)
> > +       if (inet_csk(sk)->icsk_retransmits < boundary)
> >
> >
> > Regards
> >  Damian
> >
> >>
> >> Any comment is welcome. I'm not sure in the existing code if it makes sense
> >> to apply the exponential backoff based computation to thin stream but it's a
> >> separate question so I won't touch it.
> >>
> >> Jerry
> >>
> >> >
> >> > The easiest solution I can see so far is to replace the check
> >> >
> >> > if (!inet_csk(sk)->icsk_retransmits)
> >> >                return false;
> >> >
> >> > at the beginning of retransmits_timed_out() with
> >> >
> >> > if (inet_csk(sk)->icsk_retransmits < boundary)
> >> >                return false;
> >> >
> >> > Best,
> >> >
> >> > Jerry
> >> >
> >> >>
> >> >> Is that ok?
> >> >>
> >> >> Best regards
> >> >>  Damian
> >> >>
> >> >> Am Freitag, den 01.06.2012, 15:58 -0700 schrieb Jerry Chu:
> >> >>> > From: Damian Lukowski <damian@tvk.rwth-aachen.de>
> >> >>> > Date: Wed, Aug 26, 2009 at 3:16 AM
> >> >>> > Subject: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close
> >> >>> > threshold as a time value.
> >> >>> > To: Netdev <netdev@vger.kernel.org>
> >> >>> >
> >> >>> >
> >> >>> > RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
> >> >>> > which may represent a number of allowed retransmissions or a timeout value.
> >> >>> > Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
> >> >>> > in number of allowed retransmissions.
> >> >>> >
> >> >>> > For any desired threshold R2 (by means of time) one can specify tcp_retries2
> >> >>> > (by means of number of retransmissions) such that TCP will not time out
> >> >>> > earlier than R2. This is the case, because the RTO schedule follows a fixed
> >> >>> > pattern, namely exponential backoff.
> >> >>> >
> >> >>> > However, the RTO behaviour is not predictable any more if RTO backoffs can
> >> >>> > be
> >> >>> > reverted, as it is the case in the draft
> >> >>> > "Make TCP more Robust to Long Connectivity Disruptions"
> >> >>> > (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
> >> >>> >
> >> >>> > In the worst case TCP would time out a connection after 3.2 seconds, if the
> >> >>> > initial RTO equaled MIN_RTO and each backoff has been reverted.
> >> >>> >
> >> >>> > This patch introduces a function retransmits_timed_out(N),
> >> >>> > which calculates the timeout of a TCP connection, assuming an initial
> >> >>> > RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
> >> >>> >
> >> >>> > Whenever timeout decisions are made by comparing the retransmission counter
> >> >>> > to some value N, this function can be used, instead.
> >> >>> >
> >> >>> > The meaning of tcp_retries2 will be changed, as many more RTO
> >> >>> > retransmissions
> >> >>> > can occur than the value indicates. However, it yields a timeout which is
> >> >>> > similar to the one of an unpatched, exponentially backing off TCP in the
> >> >>> > same
> >> >>> > scenario. As no application could rely on an RTO greater than MIN_RTO, there
> >> >>> > should be no risk of a regression.
> >> >>>
> >> >>> This looks like a typical "fix one problem, introducing a few more" patch :(.
> >> >>> What do you mean by "no application could rely on an RTO greater than
> >> >>> MIN_RTO..."
> >> >>> above? How can you make the assumption that RTO is not too far off
> >> >>> from TCP_RTO_MIN?
> >> >>>
> >> >>> While you tried to address a problem where the retransmission count
> >> >>> was high but the actual
> >> >>> timeout duration was too short, have you considered the other case
> >> >>> around, i.e., the timeout
> >> >>> duration is long but the retransmission count is too short? This is
> >> >>> exactly what's happening
> >> >>> to us with your patch. We've much reduced TCP_RTO_MIN for our internal
> >> >>> traffic, but not
> >> >>> noticing your change has severely shortened the R1 & R2 recommended by
> >> >>> RFC1122 for our
> >> >>> long haul traffic until now. In many cases R1 threshold was met upon
> >> >>> the first retrans timeout.
> >> >>>
> >> >>> I think retransmits_timed_out() should check against both time
> >> >>> duration and retrans count
> >> >>> (icsk_retransmits).
> >> >>>
> >> >>> Thought?
> >> >>>
> >> >>> Jerry
> >> >>>
> >> >>> >
> >> >>> > Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
> >> >>> > ---
> >> >>> >  include/net/tcp.h    |   18 ++++++++++++++++++
> >> >>> >  net/ipv4/tcp_timer.c |   11 +++++++----
> >> >>> >  2 files changed, 25 insertions(+), 4 deletions(-)
> >> >>> >
> >> >>> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> >> >>> > index c35b329..17d1a88 100644
> >> >>> > --- a/include/net/tcp.h
> >> >>> > +++ b/include/net/tcp.h
> >> >>> > @@ -1247,6 +1247,24 @@ static inline struct sk_buff
> >> >>> > *tcp_write_queue_prev(struct sock *sk, struct sk_bu
> >> >>> >  #define tcp_for_write_queue_from_safe(skb, tmp, sk)                    \
> >> >>> >        skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
> >> >>> >
> >> >>> > +static inline bool retransmits_timed_out(const struct sock *sk,
> >> >>> > +                                        unsigned int boundary)
> >> >>> > +{
> >> >>> > +       int limit, K;
> >> >>> > +       if (!inet_csk(sk)->icsk_retransmits)
> >> >>> > +               return false;
> >> >>> > +
> >> >>> > +       K = ilog2(TCP_RTO_MAX/TCP_RTO_MIN);
> >> >>> > +
> >> >>> > +       if (boundary <= K)
> >> >>> > +               limit = ((2 << boundary) - 1) * TCP_RTO_MIN;
> >> >>> > +       else
> >> >>> > +               limit = ((2 << K) - 1) * TCP_RTO_MIN +
> >> >>> > +                       (boundary - K) * TCP_RTO_MAX;
> >> >>> > +
> >> >>> > +       return (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= limit;
> >> >>> > +}
> >> >>> > +
> >> >>> >  static inline struct sk_buff *tcp_send_head(struct sock *sk)
> >> >>> >  {
> >> >>> >        return sk->sk_send_head;
> >> >>> > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> >> >>> > index a3ba494..2972d7b 100644
> >> >>> > --- a/net/ipv4/tcp_timer.c
> >> >>> > +++ b/net/ipv4/tcp_timer.c
> >> >>> > @@ -137,13 +137,14 @@ static int tcp_write_timeout(struct sock *sk)
> >> >>> >  {
> >> >>> >        struct inet_connection_sock *icsk = inet_csk(sk);
> >> >>> >        int retry_until;
> >> >>> > +       bool do_reset;
> >> >>> >
> >> >>> >        if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
> >> >>> >                if (icsk->icsk_retransmits)
> >> >>> >                        dst_negative_advice(&sk->sk_dst_cache);
> >> >>> >                retry_until = icsk->icsk_syn_retries ? :
> >> >>> > sysctl_tcp_syn_retries;
> >> >>> >        } else {
> >> >>> > -               if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
> >> >>> > +               if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
> >> >>> >                        /* Black hole detection */
> >> >>> >                        tcp_mtu_probing(icsk, sk);
> >> >>> >
> >> >>> > @@ -155,13 +156,15 @@ static int tcp_write_timeout(struct sock *sk)
> >> >>> >                        const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
> >> >>> >
> >> >>> >                        retry_until = tcp_orphan_retries(sk, alive);
> >> >>> > +                       do_reset = alive ||
> >> >>> > +                                  !retransmits_timed_out(sk, retry_until);
> >> >>> >
> >> >>> > -                       if (tcp_out_of_resources(sk, alive ||
> >> >>> > icsk->icsk_retransmits < retry_until))
> >> >>> > +                       if (tcp_out_of_resources(sk, do_reset))
> >> >>> >                                return 1;
> >> >>> >                }
> >> >>> >        }
> >> >>> >
> >> >>> > -       if (icsk->icsk_retransmits >= retry_until) {
> >> >>> > +       if (retransmits_timed_out(sk, retry_until)) {
> >> >>> >                /* Has it gone just too far? */
> >> >>> >                tcp_write_err(sk);
> >> >>> >                return 1;
> >> >>> > @@ -385,7 +388,7 @@ void tcp_retransmit_timer(struct sock *sk)
> >> >>> >  out_reset_timer:
> >> >>> >        icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
> >> >>> >        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto,
> >> >>> > TCP_RTO_MAX);
> >> >>> > -       if (icsk->icsk_retransmits > sysctl_tcp_retries1)
> >> >>> > +       if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
> >> >>> >                __sk_dst_reset(sk);
> >> >>> >
> >> >>> >  out:;
> >> >>> > --
> >> >>> > 1.6.3.3
> >> >>> >
> >> >>> > --
> >> >>> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> >> >>> > the body of a message to majordomo@vger.kernel.org
> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>> >
> >> >>
> >> >>
> >
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
  2012-06-06 20:35               ` Damian Lukowski
@ 2012-06-06 22:41                 ` Yuchung Cheng
  0 siblings, 0 replies; 19+ messages in thread
From: Yuchung Cheng @ 2012-06-06 22:41 UTC (permalink / raw)
  To: Damian Lukowski; +Cc: Jerry Chu, Netdev, David Miller, Ilpo Järvinen

On Wed, Jun 6, 2012 at 1:35 PM, Damian Lukowski
<damian@tvk.rwth-aachen.de> wrote:
> Hi,
>
> backoffs are reverted when an RTO retransmission triggers an ICMP
> destination unreachable along the path towards the target.
> Consider A --- R --- B, where A and B are TCP endpoints, and R is some
> router in between. When the link between R and B breaks for a longer
> time, A will perform RTO retransmissions if there are outstanding ACKs.
> Those packets which arrive at R will hopefully trigger an ICMP response
> back towards A, as R has no more route towards B. The ICMP packet is an
> indication for A, that the retransmission has not been lost because of
> congestion but because of a link outage, and the backoff will be
> reverted for the corresponding TCP session. In the best case, every RTO
> retransmission triggers an ICMP response, so every backoff is reverted,
> and the time between retransmissions remains at the original value.
> If icsk_retransmits is decremented at this point within the original
> code's logic, the connection might never time out. And we cannot take
> tcp_retriesX literally here, as the above scenario would time out after
> tcp_retries2 x base_rto, where the base_rto might be as small as 0.2
> seconds.
so wouldn't upping the threshold counter a (much) simpler solution?

IMO, this is a valid concern and listed in the TODOs of the expired LCD draft.
But it creates a new rule that's not compatible with others in the stack: an RTO
backoff can be reverted but TCP should still timeout based on exp.
backoff rule. But
we can also do linear RTOs and use counter-based checks. End result:
unpredictable
and unclear timeout behavior.

I second Jerry's idea to revert back to simple counter checks.

>
> I am not sure, how an additional counter variable should help. You still
> cannot take tcp_retriesX literally. Besides, I think that changing the
> socket structure is too heavy machinery, isn't it?

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2012-06-06 22:41 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-26 10:16 [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value Damian Lukowski
2009-08-31 13:26 ` Ilpo Järvinen
2009-09-01 11:28   ` Damian Lukowski
2009-09-01 13:41     ` Ilpo Järvinen
2009-08-31 14:01 ` Eric Dumazet
2009-08-31 14:04   ` Eric Dumazet
2009-09-01 11:49   ` Damian Lukowski
2009-09-01 12:25     ` Eric Dumazet
2009-09-01 13:23       ` Ilpo Järvinen
     [not found] ` <CAFbMe2Mu46N8kRrYkGzYRLuntEd2J9aO_d0Jw_y0gsQA4q-qHw@mail.gmail.com>
2012-06-01 22:58   ` Jerry Chu
2012-06-04 17:50     ` Damian Lukowski
2012-06-04 19:18       ` Ilpo Järvinen
2012-06-04 23:57         ` Jerry Chu
2012-06-04 23:50       ` Jerry Chu
2012-06-05 17:42         ` Jerry Chu
2012-06-05 18:39           ` Damian Lukowski
2012-06-05 21:22             ` Jerry Chu
2012-06-06 20:35               ` Damian Lukowski
2012-06-06 22:41                 ` Yuchung Cheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.