All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] tcp: use RFC6298 compliant TCP RTO calculation
@ 2016-06-13 20:45 Daniel Metz
  2016-06-13 21:19 ` Eric Dumazet
  2016-06-13 22:38 ` Yuchung Cheng
  0 siblings, 2 replies; 22+ messages in thread
From: Daniel Metz @ 2016-06-13 20:45 UTC (permalink / raw)
  To: netdev
  Cc: Daniel Metz, Hagen Paul Pfeifer, Eric Dumazet, Yuchung Cheng,
	Neal Cardwell

This patch adjusts Linux RTO calculation to be RFC6298 Standard
compliant. MinRTO is no longer added to the computed RTO, RTO damping
and overestimation are decreased.

In RFC 6298 Standard TCP Retransmission Timeout (RTO) calculation the
calculated RTO is rounded up to the Minimum RTO (MinRTO), if it is
less. The Linux implementation as a discrepancy to the Standard
basically adds the defined MinRTO to the calculated RTO. When
comparing both approaches, the Linux calculation seems to perform
worse for sender limited TCP flows like Telnet, SSH or constant bit
rate encoded transmissions, especially for Round Trip Times (RTT) of
50ms to 800ms.

Compared to the Linux implementation the RFC 6298 proposed RTO
calculation performs better and more precise in adapting to current
network characteristics. Extensive measurements for bulk data did not
show a negative impact of the adjusted calculation.

Exemplary performance comparison for sender-limited-flows:

- Rate: 10Mbit/s
- Delay: 200ms, Delay Variation: 10ms
- Time between each scheduled segment: 1s
- Amount Data Segments: 300
- Mean of 11 runs

         Mean Response Waiting Time [milliseconds]

PER [%] |   0.5      1    1.5      2      3      5      7     10
--------+-------------------------------------------------------
old     | 206.4  208.6  218.0  218.6  227.2  249.3  274.7  308.2
new     | 203.9  206.0  207.0  209.9  217.3  225.6  238.7  259.1


Detailed Analysis:

https://docs.google.com/document/d/1pKmPfnQb6fDK4qpiNVkN8cQyGE4wYDZukcuZfR-BnnM/


Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Daniel Metz <dmetz@mytum.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
---

We removed outdated comments in the code, though more cleanup may required.


 net/ipv4/tcp_input.c | 74 ++++++++++++----------------------------------------
 1 file changed, 17 insertions(+), 57 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d6c8f4cd0..a0f66f8 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -680,8 +680,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
 /* Called to compute a smoothed rtt estimate. The data fed to this
  * routine either comes from timestamps, or from segments that were
  * known _not_ to have been retransmitted [see Karn/Partridge
- * Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88
- * piece by Van Jacobson.
+ * Proceedings SIGCOMM 87].
  * NOTE: the next three routines used to be one big routine.
  * To save cycles in the RFC 1323 implementation it was better to break
  * it up into three procedures. -- erics
@@ -692,59 +691,21 @@ static void tcp_rtt_estimator(struct sock *sk, long mrtt_us)
 	long m = mrtt_us; /* RTT */
 	u32 srtt = tp->srtt_us;
 
-	/*	The following amusing code comes from Jacobson's
-	 *	article in SIGCOMM '88.  Note that rtt and mdev
-	 *	are scaled versions of rtt and mean deviation.
-	 *	This is designed to be as fast as possible
-	 *	m stands for "measurement".
-	 *
-	 *	On a 1990 paper the rto value is changed to:
-	 *	RTO = rtt + 4 * mdev
-	 *
-	 * Funny. This algorithm seems to be very broken.
-	 * These formulae increase RTO, when it should be decreased, increase
-	 * too slowly, when it should be increased quickly, decrease too quickly
-	 * etc. I guess in BSD RTO takes ONE value, so that it is absolutely
-	 * does not matter how to _calculate_ it. Seems, it was trap
-	 * that VJ failed to avoid. 8)
-	 */
 	if (srtt != 0) {
-		m -= (srtt >> 3);	/* m is now error in rtt est */
-		srtt += m;		/* rtt = 7/8 rtt + 1/8 new */
-		if (m < 0) {
-			m = -m;		/* m is now abs(error) */
-			m -= (tp->mdev_us >> 2);   /* similar update on mdev */
-			/* This is similar to one of Eifel findings.
-			 * Eifel blocks mdev updates when rtt decreases.
-			 * This solution is a bit different: we use finer gain
-			 * for mdev in this case (alpha*beta).
-			 * Like Eifel it also prevents growth of rto,
-			 * but also it limits too fast rto decreases,
-			 * happening in pure Eifel.
-			 */
-			if (m > 0)
-				m >>= 3;
-		} else {
-			m -= (tp->mdev_us >> 2);   /* similar update on mdev */
-		}
-		tp->mdev_us += m;		/* mdev = 3/4 mdev + 1/4 new */
-		if (tp->mdev_us > tp->mdev_max_us) {
-			tp->mdev_max_us = tp->mdev_us;
-			if (tp->mdev_max_us > tp->rttvar_us)
-				tp->rttvar_us = tp->mdev_max_us;
-		}
-		if (after(tp->snd_una, tp->rtt_seq)) {
-			if (tp->mdev_max_us < tp->rttvar_us)
-				tp->rttvar_us -= (tp->rttvar_us - tp->mdev_max_us) >> 2;
+		m -= (srtt >> 3);	 /* m' = m - srtt / 8 = (R' - SRTT) */
+		srtt += m;		 /* srtt = srtt + m’ = srtt + m - srtt / 8 */
+		if (m < 0)
+			m = -m;
+		m -= (tp->mdev_us >> 2); /* m'' = |m'| - mdev / 4 */
+		tp->mdev_us += m;
+		tp->rttvar_us = tp->mdev_us;
+		if (after(tp->snd_una, tp->rtt_seq))
 			tp->rtt_seq = tp->snd_nxt;
-			tp->mdev_max_us = tcp_rto_min_us(sk);
-		}
 	} else {
 		/* no previous measure. */
-		srtt = m << 3;		/* take the measured time to be rtt */
-		tp->mdev_us = m << 1;	/* make sure rto = 3*rtt */
-		tp->rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk));
-		tp->mdev_max_us = tp->rttvar_us;
+		srtt = m << 3;
+		tp->mdev_us = m << 1;
+		tp->rttvar_us = tp->mdev_us;
 		tp->rtt_seq = tp->snd_nxt;
 	}
 	tp->srtt_us = max(1U, srtt);
@@ -798,6 +759,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
  */
 static void tcp_set_rto(struct sock *sk)
 {
+	const u32 min_rto = tcp_rto_min_us(sk);
 	const struct tcp_sock *tp = tcp_sk(sk);
 	/* Old crap is replaced with new one. 8)
 	 *
@@ -809,17 +771,15 @@ static void tcp_set_rto(struct sock *sk)
 	 *    is invisible. Actually, Linux-2.4 also generates erratic
 	 *    ACKs in some circumstances.
 	 */
-	inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
-
+	if (((tp->srtt_us >> 3) + tp->rttvar_us) < min_rto)
+		inet_csk(sk)->icsk_rto = usecs_to_jiffies(min_rto);
+	else
+		inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
 	/* 2. Fixups made earlier cannot be right.
 	 *    If we do not estimate RTO correctly without them,
 	 *    all the algo is pure shit and should be replaced
 	 *    with correct one. It is exactly, which we pretend to do.
 	 */
-
-	/* NOTE: clamping at TCP_RTO_MIN is not required, current algo
-	 * guarantees that rto is higher.
-	 */
 	tcp_bound_rto(sk);
 }
 
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-13 20:45 [PATCH net-next] tcp: use RFC6298 compliant TCP RTO calculation Daniel Metz
@ 2016-06-13 21:19 ` Eric Dumazet
  2016-06-13 22:38 ` Yuchung Cheng
  1 sibling, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2016-06-13 21:19 UTC (permalink / raw)
  To: Daniel Metz
  Cc: netdev, Hagen Paul Pfeifer, Eric Dumazet, Yuchung Cheng, Neal Cardwell

On Mon, 2016-06-13 at 22:45 +0200, Daniel Metz wrote:
> This patch adjusts Linux RTO calculation to be RFC6298 Standard
> compliant. MinRTO is no longer added to the computed RTO, RTO damping
> and overestimation are decreased.
> 
> In RFC 6298 Standard TCP Retransmission Timeout (RTO) calculation the
> calculated RTO is rounded up to the Minimum RTO (MinRTO), if it is
> less. The Linux implementation as a discrepancy to the Standard
> basically adds the defined MinRTO to the calculated RTO. When
> comparing both approaches, the Linux calculation seems to perform
> worse for sender limited TCP flows like Telnet, SSH or constant bit
> rate encoded transmissions, especially for Round Trip Times (RTT) of
> 50ms to 800ms.
> 
> Compared to the Linux implementation the RFC 6298 proposed RTO
> calculation performs better and more precise in adapting to current
> network characteristics. Extensive measurements for bulk data did not
> show a negative impact of the adjusted calculation.
> 
> Exemplary performance comparison for sender-limited-flows:
> 
> - Rate: 10Mbit/s
> - Delay: 200ms, Delay Variation: 10ms
> - Time between each scheduled segment: 1s
> - Amount Data Segments: 300
> - Mean of 11 runs
> 
>          Mean Response Waiting Time [milliseconds]
> 
> PER [%] |   0.5      1    1.5      2      3      5      7     10
> --------+-------------------------------------------------------
> old     | 206.4  208.6  218.0  218.6  227.2  249.3  274.7  308.2
> new     | 203.9  206.0  207.0  209.9  217.3  225.6  238.7  259.1
> 
> 
> Detailed Analysis:
> 
> https://docs.google.com/document/d/1pKmPfnQb6fDK4qpiNVkN8cQyGE4wYDZukcuZfR-BnnM/
> 
> 
> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
> Signed-off-by: Daniel Metz <dmetz@mytum.de>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> ---
> 
> We removed outdated comments in the code, though more cleanup may required.
> 

It seems you left mdev_max_us ?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-13 20:45 [PATCH net-next] tcp: use RFC6298 compliant TCP RTO calculation Daniel Metz
  2016-06-13 21:19 ` Eric Dumazet
@ 2016-06-13 22:38 ` Yuchung Cheng
  2016-06-14  6:17   ` Hagen Paul Pfeifer
  1 sibling, 1 reply; 22+ messages in thread
From: Yuchung Cheng @ 2016-06-13 22:38 UTC (permalink / raw)
  To: Daniel Metz
  Cc: netdev, Hagen Paul Pfeifer, Eric Dumazet, Neal Cardwell,
	Pasi Sarolahti, Van Jacobson

On Mon, Jun 13, 2016 at 1:45 PM, Daniel Metz <dmetz@mytum.de> wrote:
> This patch adjusts Linux RTO calculation to be RFC6298 Standard
> compliant. MinRTO is no longer added to the computed RTO, RTO damping
> and overestimation are decreased.
>
> In RFC 6298 Standard TCP Retransmission Timeout (RTO) calculation the
> calculated RTO is rounded up to the Minimum RTO (MinRTO), if it is
> less. The Linux implementation as a discrepancy to the Standard
> basically adds the defined MinRTO to the calculated RTO. When
> comparing both approaches, the Linux calculation seems to perform
> worse for sender limited TCP flows like Telnet, SSH or constant bit
> rate encoded transmissions, especially for Round Trip Times (RTT) of
> 50ms to 800ms.
>
> Compared to the Linux implementation the RFC 6298 proposed RTO
> calculation performs better and more precise in adapting to current
> network characteristics. Extensive measurements for bulk data did not
> show a negative impact of the adjusted calculation.
>
> Exemplary performance comparison for sender-limited-flows:
>
> - Rate: 10Mbit/s
> - Delay: 200ms, Delay Variation: 10ms
> - Time between each scheduled segment: 1s
> - Amount Data Segments: 300
> - Mean of 11 runs
>
>          Mean Response Waiting Time [milliseconds]
>
> PER [%] |   0.5      1    1.5      2      3      5      7     10
> --------+-------------------------------------------------------
> old     | 206.4  208.6  218.0  218.6  227.2  249.3  274.7  308.2
> new     | 203.9  206.0  207.0  209.9  217.3  225.6  238.7  259.1
>
>
> Detailed Analysis:
>
> https://docs.google.com/document/d/1pKmPfnQb6fDK4qpiNVkN8cQyGE4wYDZukcuZfR-BnnM/
Thanks for the patch. I also have long wanted to evaluate Linux's RTO vs RFC's.

Since this is not a small change, and your patch is only tested on
emulation-based testbed AFAICT, I'd like to try your patch on Google
servers to get more data. But this would take a few days to setup &
collect.

Note that this paper
https://www.cs.helsinki.fi/research/iwtcp/papers/linuxtcp.pdf has
detailed rationale of current design (section 4). IMO having a "tight"
RTO is less necessary now after TLP. I am also testing a new set of
patches to install a quick reordering timer. But it's worth mentioning
the paper in the commit message.


>
>
> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
> Signed-off-by: Daniel Metz <dmetz@mytum.de>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> ---
>
> We removed outdated comments in the code, though more cleanup may required.
>
>
>  net/ipv4/tcp_input.c | 74 ++++++++++++----------------------------------------
>  1 file changed, 17 insertions(+), 57 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index d6c8f4cd0..a0f66f8 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -680,8 +680,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
>  /* Called to compute a smoothed rtt estimate. The data fed to this
>   * routine either comes from timestamps, or from segments that were
>   * known _not_ to have been retransmitted [see Karn/Partridge
> - * Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88
> - * piece by Van Jacobson.
> + * Proceedings SIGCOMM 87].
>   * NOTE: the next three routines used to be one big routine.
>   * To save cycles in the RFC 1323 implementation it was better to break
>   * it up into three procedures. -- erics
> @@ -692,59 +691,21 @@ static void tcp_rtt_estimator(struct sock *sk, long mrtt_us)
>         long m = mrtt_us; /* RTT */
>         u32 srtt = tp->srtt_us;
>
> -       /*      The following amusing code comes from Jacobson's
> -        *      article in SIGCOMM '88.  Note that rtt and mdev
> -        *      are scaled versions of rtt and mean deviation.
> -        *      This is designed to be as fast as possible
> -        *      m stands for "measurement".
> -        *
> -        *      On a 1990 paper the rto value is changed to:
> -        *      RTO = rtt + 4 * mdev
> -        *
> -        * Funny. This algorithm seems to be very broken.
> -        * These formulae increase RTO, when it should be decreased, increase
> -        * too slowly, when it should be increased quickly, decrease too quickly
> -        * etc. I guess in BSD RTO takes ONE value, so that it is absolutely
> -        * does not matter how to _calculate_ it. Seems, it was trap
> -        * that VJ failed to avoid. 8)
> -        */
>         if (srtt != 0) {
> -               m -= (srtt >> 3);       /* m is now error in rtt est */
> -               srtt += m;              /* rtt = 7/8 rtt + 1/8 new */
> -               if (m < 0) {
> -                       m = -m;         /* m is now abs(error) */
> -                       m -= (tp->mdev_us >> 2);   /* similar update on mdev */
> -                       /* This is similar to one of Eifel findings.
> -                        * Eifel blocks mdev updates when rtt decreases.
> -                        * This solution is a bit different: we use finer gain
> -                        * for mdev in this case (alpha*beta).
> -                        * Like Eifel it also prevents growth of rto,
> -                        * but also it limits too fast rto decreases,
> -                        * happening in pure Eifel.
> -                        */
> -                       if (m > 0)
> -                               m >>= 3;
> -               } else {
> -                       m -= (tp->mdev_us >> 2);   /* similar update on mdev */
> -               }
> -               tp->mdev_us += m;               /* mdev = 3/4 mdev + 1/4 new */
> -               if (tp->mdev_us > tp->mdev_max_us) {
> -                       tp->mdev_max_us = tp->mdev_us;
> -                       if (tp->mdev_max_us > tp->rttvar_us)
> -                               tp->rttvar_us = tp->mdev_max_us;
> -               }
> -               if (after(tp->snd_una, tp->rtt_seq)) {
> -                       if (tp->mdev_max_us < tp->rttvar_us)
> -                               tp->rttvar_us -= (tp->rttvar_us - tp->mdev_max_us) >> 2;
> +               m -= (srtt >> 3);        /* m' = m - srtt / 8 = (R' - SRTT) */
> +               srtt += m;               /* srtt = srtt + m’ = srtt + m - srtt / 8 */
> +               if (m < 0)
> +                       m = -m;
> +               m -= (tp->mdev_us >> 2); /* m'' = |m'| - mdev / 4 */
> +               tp->mdev_us += m;
> +               tp->rttvar_us = tp->mdev_us;
> +               if (after(tp->snd_una, tp->rtt_seq))
>                         tp->rtt_seq = tp->snd_nxt;
> -                       tp->mdev_max_us = tcp_rto_min_us(sk);
> -               }
>         } else {
>                 /* no previous measure. */
> -               srtt = m << 3;          /* take the measured time to be rtt */
> -               tp->mdev_us = m << 1;   /* make sure rto = 3*rtt */
> -               tp->rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk));
> -               tp->mdev_max_us = tp->rttvar_us;
> +               srtt = m << 3;
> +               tp->mdev_us = m << 1;
> +               tp->rttvar_us = tp->mdev_us;
>                 tp->rtt_seq = tp->snd_nxt;
>         }
>         tp->srtt_us = max(1U, srtt);
> @@ -798,6 +759,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
>   */
>  static void tcp_set_rto(struct sock *sk)
>  {
> +       const u32 min_rto = tcp_rto_min_us(sk);
>         const struct tcp_sock *tp = tcp_sk(sk);
>         /* Old crap is replaced with new one. 8)
>          *
> @@ -809,17 +771,15 @@ static void tcp_set_rto(struct sock *sk)
>          *    is invisible. Actually, Linux-2.4 also generates erratic
>          *    ACKs in some circumstances.
>          */
> -       inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
> -
> +       if (((tp->srtt_us >> 3) + tp->rttvar_us) < min_rto)
> +               inet_csk(sk)->icsk_rto = usecs_to_jiffies(min_rto);
> +       else
> +               inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
>         /* 2. Fixups made earlier cannot be right.
>          *    If we do not estimate RTO correctly without them,
>          *    all the algo is pure shit and should be replaced
>          *    with correct one. It is exactly, which we pretend to do.
>          */
> -
> -       /* NOTE: clamping at TCP_RTO_MIN is not required, current algo
> -        * guarantees that rto is higher.
> -        */
>         tcp_bound_rto(sk);
>  }
>
> --
> 2.8.3
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-13 22:38 ` Yuchung Cheng
@ 2016-06-14  6:17   ` Hagen Paul Pfeifer
  2016-06-14 17:58     ` Yuchung Cheng
  0 siblings, 1 reply; 22+ messages in thread
From: Hagen Paul Pfeifer @ 2016-06-14  6:17 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Daniel Metz, netdev, Eric Dumazet, Neal Cardwell, Pasi Sarolahti,
	Van Jacobson

* Yuchung Cheng | 2016-06-13 15:38:24 [-0700]:

Hey Eric, Yuchung,

regarding the missed mdev_max_us: internal communication problem. Daniel well
respin a v2 removing the no longer required mdev_max_us.

>Thanks for the patch. I also have long wanted to evaluate Linux's RTO vs RFC's.
>
>Since this is not a small change, and your patch is only tested on
>emulation-based testbed AFAICT, I'd like to try your patch on Google
>servers to get more data. But this would take a few days to setup &
>collect.

Great - no hurry! We tried hard to find any downsides of RFC 6298 so far
without any result. If you have any special & concrete tests in mind: Daniel
will test it!

>Note that this paper
>https://www.cs.helsinki.fi/research/iwtcp/papers/linuxtcp.pdf has
>detailed rationale of current design (section 4). IMO having a "tight"
>RTO is less necessary now after TLP. I am also testing a new set of
>patches to install a quick reordering timer. But it's worth mentioning
>the paper in the commit message.

We had "difficulties" to find scenarios where the RTO kicks-in. For the
majority of use cases duplicate ACKs triggers TCP retransmission. For bulk
data transmissions almost 100% of retransmissions are triggered by duplicate
ACKs (except connection teardown). TLP will reduce the requirement for RTO
even further, also window probes helps sometimes. The use case we realized was
sender limited, non-continuous flows where a RFC 6298 compliant implementation
is better.

Thank you Yuchung, we will add an reference in v2.

Hagen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-14  6:17   ` Hagen Paul Pfeifer
@ 2016-06-14 17:58     ` Yuchung Cheng
  2016-06-14 19:18       ` [PATCH net-next v2] " Daniel Metz
  0 siblings, 1 reply; 22+ messages in thread
From: Yuchung Cheng @ 2016-06-14 17:58 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: Daniel Metz, netdev, Eric Dumazet, Neal Cardwell, Pasi Sarolahti,
	Van Jacobson

On Mon, Jun 13, 2016 at 11:17 PM, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
>
> * Yuchung Cheng | 2016-06-13 15:38:24 [-0700]:
>
> Hey Eric, Yuchung,
>
> regarding the missed mdev_max_us: internal communication problem. Daniel well
> respin a v2 removing the no longer required mdev_max_us.
>
> >Thanks for the patch. I also have long wanted to evaluate Linux's RTO vs RFC's.
> >
> >Since this is not a small change, and your patch is only tested on
> >emulation-based testbed AFAICT, I'd like to try your patch on Google
> >servers to get more data. But this would take a few days to setup &
> >collect.
>
> Great - no hurry! We tried hard to find any downsides of RFC 6298 so far
> without any result. If you have any special & concrete tests in mind: Daniel
> will test it!
>
> >Note that this paper
> >https://www.cs.helsinki.fi/research/iwtcp/papers/linuxtcp.pdf has
> >detailed rationale of current design (section 4). IMO having a "tight"
> >RTO is less necessary now after TLP. I am also testing a new set of
> >patches to install a quick reordering timer. But it's worth mentioning
> >the paper in the commit message.
>
> We had "difficulties" to find scenarios where the RTO kicks-in. For the
> majority of use cases duplicate ACKs triggers TCP retransmission. For bulk
> data transmissions almost 100% of retransmissions are triggered by duplicate
> ACKs (except connection teardown). TLP will reduce the requirement for RTO
> even further, also window probes helps sometimes. The use case we realized was
> sender limited, non-continuous flows where a RFC 6298 compliant implementation
> is better.
>
> Thank you Yuchung, we will add an reference in v2.
Sounds good. I will wait for v2 to do the experiment.
>
> Hagen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v2] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-14 17:58     ` Yuchung Cheng
@ 2016-06-14 19:18       ` Daniel Metz
  2016-06-14 21:33         ` Yuchung Cheng
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Metz @ 2016-06-14 19:18 UTC (permalink / raw)
  To: netdev
  Cc: Daniel Metz, Hagen Paul Pfeifer, Daniel Metz, Eric Dumazet,
	Yuchung Cheng

From: Daniel Metz <daniel.metz@rohde-schwarz.com>

This patch adjusts Linux RTO calculation to be RFC6298 Standard
compliant. MinRTO is no longer added to the computed RTO, RTO damping
and overestimation are decreased.

In RFC 6298 Standard TCP Retransmission Timeout (RTO) calculation the
calculated RTO is rounded up to the Minimum RTO (MinRTO), if it is
less.  The Linux implementation as a discrepancy to the Standard
basically adds the defined MinRTO to the calculated RTO. When
comparing both approaches, the Linux calculation seems to perform
worse for sender limited TCP flows like Telnet, SSH or constant bit
rate encoded transmissions, especially for Round Trip Times (RTT) of
50ms to 800ms.

Compared to the Linux implementation the RFC 6298 proposed RTO
calculation performs better and more precise in adapting to current
network characteristics. Extensive measurements for bulk data did not
show a negative impact of the adjusted calculation.

Exemplary performance comparison for sender-limited-flows:

- Rate: 10Mbit/s
- Delay: 200ms, Delay Variation: 10ms
- Time between each scheduled segment: 1s
- Amount Data Segments: 300
- Mean of 11 runs

         Mean Response Waiting Time [milliseconds]

PER [%] |   0.5      1    1.5      2      3      5      7     10
--------+-------------------------------------------------------
old     | 206.4  208.6  218.0  218.6  227.2  249.3  274.7  308.2
new     | 203.9  206.0  207.0  209.9  217.3  225.6  238.7  259.1


Detailed analysis:
https://docs.google.com/document/d/1pKmPfnQb6fDK4qpiNVkN8cQyGE4wYDZukcuZfR-BnnM/

Reasoning for historic design:
Sarolahti, P.; Kuznetsov, A. (2002). Congestion Control in Linux TCP.
Conference Paper. Proceedings of the FREENIX Track. 2002 USENIX Annual
https://www.cs.helsinki.fi/research/iwtcp/papers/linuxtcp.pdf


Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Daniel Metz <dmetz@mytum.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---

v2:
 - Using the RFC 6298 compliant implementation, the tcp_sock struct variable
	 u32 mdev_max_us becomes obsolete and consequently is being removed.
 - Add reference to Kuznetsov paper


 include/linux/tcp.h    |  1 -
 net/ipv4/tcp_input.c   | 74 ++++++++++++--------------------------------------
 net/ipv4/tcp_metrics.c |  2 +-
 3 files changed, 18 insertions(+), 59 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 7be9b12..d1790c5 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -231,7 +231,6 @@ struct tcp_sock {
 /* RTT measurement */
 	u32	srtt_us;	/* smoothed round trip time << 3 in usecs */
 	u32	mdev_us;	/* medium deviation			*/
-	u32	mdev_max_us;	/* maximal mdev for the last rtt period	*/
 	u32	rttvar_us;	/* smoothed mdev_max			*/
 	u32	rtt_seq;	/* sequence number to update rttvar	*/
 	struct rtt_meas {
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 94d4aff..0d53537 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -680,8 +680,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
 /* Called to compute a smoothed rtt estimate. The data fed to this
  * routine either comes from timestamps, or from segments that were
  * known _not_ to have been retransmitted [see Karn/Partridge
- * Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88
- * piece by Van Jacobson.
+ * Proceedings SIGCOMM 87].
  * NOTE: the next three routines used to be one big routine.
  * To save cycles in the RFC 1323 implementation it was better to break
  * it up into three procedures. -- erics
@@ -692,59 +691,21 @@ static void tcp_rtt_estimator(struct sock *sk, long mrtt_us)
 	long m = mrtt_us; /* RTT */
 	u32 srtt = tp->srtt_us;
 
-	/*	The following amusing code comes from Jacobson's
-	 *	article in SIGCOMM '88.  Note that rtt and mdev
-	 *	are scaled versions of rtt and mean deviation.
-	 *	This is designed to be as fast as possible
-	 *	m stands for "measurement".
-	 *
-	 *	On a 1990 paper the rto value is changed to:
-	 *	RTO = rtt + 4 * mdev
-	 *
-	 * Funny. This algorithm seems to be very broken.
-	 * These formulae increase RTO, when it should be decreased, increase
-	 * too slowly, when it should be increased quickly, decrease too quickly
-	 * etc. I guess in BSD RTO takes ONE value, so that it is absolutely
-	 * does not matter how to _calculate_ it. Seems, it was trap
-	 * that VJ failed to avoid. 8)
-	 */
 	if (srtt != 0) {
-		m -= (srtt >> 3);	/* m is now error in rtt est */
-		srtt += m;		/* rtt = 7/8 rtt + 1/8 new */
-		if (m < 0) {
-			m = -m;		/* m is now abs(error) */
-			m -= (tp->mdev_us >> 2);   /* similar update on mdev */
-			/* This is similar to one of Eifel findings.
-			 * Eifel blocks mdev updates when rtt decreases.
-			 * This solution is a bit different: we use finer gain
-			 * for mdev in this case (alpha*beta).
-			 * Like Eifel it also prevents growth of rto,
-			 * but also it limits too fast rto decreases,
-			 * happening in pure Eifel.
-			 */
-			if (m > 0)
-				m >>= 3;
-		} else {
-			m -= (tp->mdev_us >> 2);   /* similar update on mdev */
-		}
-		tp->mdev_us += m;		/* mdev = 3/4 mdev + 1/4 new */
-		if (tp->mdev_us > tp->mdev_max_us) {
-			tp->mdev_max_us = tp->mdev_us;
-			if (tp->mdev_max_us > tp->rttvar_us)
-				tp->rttvar_us = tp->mdev_max_us;
-		}
-		if (after(tp->snd_una, tp->rtt_seq)) {
-			if (tp->mdev_max_us < tp->rttvar_us)
-				tp->rttvar_us -= (tp->rttvar_us - tp->mdev_max_us) >> 2;
+		m -= (srtt >> 3);	 /* m' = m - srtt / 8 = (R' - SRTT) */
+		srtt += m;	/* srtt = srtt + m’ = srtt + m - srtt / 8 */
+		if (m < 0)
+			m = -m;
+		m -= (tp->mdev_us >> 2); /* m'' = |m'| - mdev / 4 */
+		tp->mdev_us += m;
+		tp->rttvar_us = tp->mdev_us;
+		if (after(tp->snd_una, tp->rtt_seq))
 			tp->rtt_seq = tp->snd_nxt;
-			tp->mdev_max_us = tcp_rto_min_us(sk);
-		}
 	} else {
 		/* no previous measure. */
-		srtt = m << 3;		/* take the measured time to be rtt */
-		tp->mdev_us = m << 1;	/* make sure rto = 3*rtt */
-		tp->rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk));
-		tp->mdev_max_us = tp->rttvar_us;
+		srtt = m << 3;
+		tp->mdev_us = m << 1;
+		tp->rttvar_us = tp->mdev_us;
 		tp->rtt_seq = tp->snd_nxt;
 	}
 	tp->srtt_us = max(1U, srtt);
@@ -798,6 +759,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
  */
 static void tcp_set_rto(struct sock *sk)
 {
+	const u32 min_rto = tcp_rto_min_us(sk);
 	const struct tcp_sock *tp = tcp_sk(sk);
 	/* Old crap is replaced with new one. 8)
 	 *
@@ -809,17 +771,15 @@ static void tcp_set_rto(struct sock *sk)
 	 *    is invisible. Actually, Linux-2.4 also generates erratic
 	 *    ACKs in some circumstances.
 	 */
-	inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
-
+	if (((tp->srtt_us >> 3) + tp->rttvar_us) < min_rto)
+		inet_csk(sk)->icsk_rto = usecs_to_jiffies(min_rto);
+	else
+		inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
 	/* 2. Fixups made earlier cannot be right.
 	 *    If we do not estimate RTO correctly without them,
 	 *    all the algo is pure shit and should be replaced
 	 *    with correct one. It is exactly, which we pretend to do.
 	 */
-
-	/* NOTE: clamping at TCP_RTO_MIN is not required, current algo
-	 * guarantees that rto is higher.
-	 */
 	tcp_bound_rto(sk);
 }
 
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index b617826..7f59f5b 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -561,7 +561,7 @@ reset:
 		 * retransmission.
 		 */
 		tp->rttvar_us = jiffies_to_usecs(TCP_TIMEOUT_FALLBACK);
-		tp->mdev_us = tp->mdev_max_us = tp->rttvar_us;
+		tp->mdev_us = tp->rttvar_us;
 
 		inet_csk(sk)->icsk_rto = TCP_TIMEOUT_FALLBACK;
 	}
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v2] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-14 19:18       ` [PATCH net-next v2] " Daniel Metz
@ 2016-06-14 21:33         ` Yuchung Cheng
  2016-06-15 17:41           ` Hagen Paul Pfeifer
  2016-06-15 18:00           ` [PATCH net-next v3] " Daniel Metz
  0 siblings, 2 replies; 22+ messages in thread
From: Yuchung Cheng @ 2016-06-14 21:33 UTC (permalink / raw)
  To: Daniel Metz; +Cc: netdev, Daniel Metz, Hagen Paul Pfeifer, Eric Dumazet

On Tue, Jun 14, 2016 at 12:18 PM, Daniel Metz <dmetz@mytum.de> wrote:
> From: Daniel Metz <daniel.metz@rohde-schwarz.com>
>
> This patch adjusts Linux RTO calculation to be RFC6298 Standard
> compliant. MinRTO is no longer added to the computed RTO, RTO damping
> and overestimation are decreased.
>
> In RFC 6298 Standard TCP Retransmission Timeout (RTO) calculation the
> calculated RTO is rounded up to the Minimum RTO (MinRTO), if it is
> less.  The Linux implementation as a discrepancy to the Standard
> basically adds the defined MinRTO to the calculated RTO. When
> comparing both approaches, the Linux calculation seems to perform
> worse for sender limited TCP flows like Telnet, SSH or constant bit
> rate encoded transmissions, especially for Round Trip Times (RTT) of
> 50ms to 800ms.
>
> Compared to the Linux implementation the RFC 6298 proposed RTO
> calculation performs better and more precise in adapting to current
> network characteristics. Extensive measurements for bulk data did not
> show a negative impact of the adjusted calculation.
>
> Exemplary performance comparison for sender-limited-flows:
>
> - Rate: 10Mbit/s
> - Delay: 200ms, Delay Variation: 10ms
> - Time between each scheduled segment: 1s
> - Amount Data Segments: 300
> - Mean of 11 runs
>
>          Mean Response Waiting Time [milliseconds]
>
> PER [%] |   0.5      1    1.5      2      3      5      7     10
> --------+-------------------------------------------------------
> old     | 206.4  208.6  218.0  218.6  227.2  249.3  274.7  308.2
> new     | 203.9  206.0  207.0  209.9  217.3  225.6  238.7  259.1
>
>
> Detailed analysis:
> https://docs.google.com/document/d/1pKmPfnQb6fDK4qpiNVkN8cQyGE4wYDZukcuZfR-BnnM/
>
> Reasoning for historic design:
> Sarolahti, P.; Kuznetsov, A. (2002). Congestion Control in Linux TCP.
> Conference Paper. Proceedings of the FREENIX Track. 2002 USENIX Annual
> https://www.cs.helsinki.fi/research/iwtcp/papers/linuxtcp.pdf
>
>
> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
> Signed-off-by: Daniel Metz <dmetz@mytum.de>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> ---
>
> v2:
>  - Using the RFC 6298 compliant implementation, the tcp_sock struct variable
>          u32 mdev_max_us becomes obsolete and consequently is being removed.
>  - Add reference to Kuznetsov paper
>
>
>  include/linux/tcp.h    |  1 -
>  net/ipv4/tcp_input.c   | 74 ++++++++++++--------------------------------------
>  net/ipv4/tcp_metrics.c |  2 +-
>  3 files changed, 18 insertions(+), 59 deletions(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 7be9b12..d1790c5 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -231,7 +231,6 @@ struct tcp_sock {
>  /* RTT measurement */
>         u32     srtt_us;        /* smoothed round trip time << 3 in usecs */
>         u32     mdev_us;        /* medium deviation                     */
> -       u32     mdev_max_us;    /* maximal mdev for the last rtt period */
>         u32     rttvar_us;      /* smoothed mdev_max                    */
>         u32     rtt_seq;        /* sequence number to update rttvar     */
>         struct rtt_meas {
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 94d4aff..0d53537 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -680,8 +680,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
>  /* Called to compute a smoothed rtt estimate. The data fed to this
>   * routine either comes from timestamps, or from segments that were
>   * known _not_ to have been retransmitted [see Karn/Partridge
> - * Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88
> - * piece by Van Jacobson.
> + * Proceedings SIGCOMM 87].
>   * NOTE: the next three routines used to be one big routine.
>   * To save cycles in the RFC 1323 implementation it was better to break
>   * it up into three procedures. -- erics
> @@ -692,59 +691,21 @@ static void tcp_rtt_estimator(struct sock *sk, long mrtt_us)
>         long m = mrtt_us; /* RTT */
>         u32 srtt = tp->srtt_us;
>
> -       /*      The following amusing code comes from Jacobson's
> -        *      article in SIGCOMM '88.  Note that rtt and mdev
> -        *      are scaled versions of rtt and mean deviation.
> -        *      This is designed to be as fast as possible
> -        *      m stands for "measurement".
> -        *
> -        *      On a 1990 paper the rto value is changed to:
> -        *      RTO = rtt + 4 * mdev
> -        *
> -        * Funny. This algorithm seems to be very broken.
> -        * These formulae increase RTO, when it should be decreased, increase
> -        * too slowly, when it should be increased quickly, decrease too quickly
> -        * etc. I guess in BSD RTO takes ONE value, so that it is absolutely
> -        * does not matter how to _calculate_ it. Seems, it was trap
> -        * that VJ failed to avoid. 8)
> -        */
>         if (srtt != 0) {
> -               m -= (srtt >> 3);       /* m is now error in rtt est */
> -               srtt += m;              /* rtt = 7/8 rtt + 1/8 new */
> -               if (m < 0) {
> -                       m = -m;         /* m is now abs(error) */
> -                       m -= (tp->mdev_us >> 2);   /* similar update on mdev */
> -                       /* This is similar to one of Eifel findings.
> -                        * Eifel blocks mdev updates when rtt decreases.
> -                        * This solution is a bit different: we use finer gain
> -                        * for mdev in this case (alpha*beta).
> -                        * Like Eifel it also prevents growth of rto,
> -                        * but also it limits too fast rto decreases,
> -                        * happening in pure Eifel.
> -                        */
> -                       if (m > 0)
> -                               m >>= 3;
> -               } else {
> -                       m -= (tp->mdev_us >> 2);   /* similar update on mdev */
> -               }
> -               tp->mdev_us += m;               /* mdev = 3/4 mdev + 1/4 new */
> -               if (tp->mdev_us > tp->mdev_max_us) {
> -                       tp->mdev_max_us = tp->mdev_us;
> -                       if (tp->mdev_max_us > tp->rttvar_us)
> -                               tp->rttvar_us = tp->mdev_max_us;
> -               }
> -               if (after(tp->snd_una, tp->rtt_seq)) {
> -                       if (tp->mdev_max_us < tp->rttvar_us)
> -                               tp->rttvar_us -= (tp->rttvar_us - tp->mdev_max_us) >> 2;
> +               m -= (srtt >> 3);        /* m' = m - srtt / 8 = (R' - SRTT) */
> +               srtt += m;      /* srtt = srtt + m’ = srtt + m - srtt / 8 */
> +               if (m < 0)
> +                       m = -m;
> +               m -= (tp->mdev_us >> 2); /* m'' = |m'| - mdev / 4 */
> +               tp->mdev_us += m;
> +               tp->rttvar_us = tp->mdev_us;
> +               if (after(tp->snd_una, tp->rtt_seq))
>                         tp->rtt_seq = tp->snd_nxt;
> -                       tp->mdev_max_us = tcp_rto_min_us(sk);
> -               }
>         } else {
>                 /* no previous measure. */
> -               srtt = m << 3;          /* take the measured time to be rtt */
> -               tp->mdev_us = m << 1;   /* make sure rto = 3*rtt */
> -               tp->rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk));
> -               tp->mdev_max_us = tp->rttvar_us;
> +               srtt = m << 3;
> +               tp->mdev_us = m << 1;
> +               tp->rttvar_us = tp->mdev_us;
AFAICT we can update rttvar_us directly and don't need mdev_us anymore?

>                 tp->rtt_seq = tp->snd_nxt;
>         }
>         tp->srtt_us = max(1U, srtt);
> @@ -798,6 +759,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
>   */
>  static void tcp_set_rto(struct sock *sk)
>  {
> +       const u32 min_rto = tcp_rto_min_us(sk);
>         const struct tcp_sock *tp = tcp_sk(sk);
>         /* Old crap is replaced with new one. 8)
>          *
> @@ -809,17 +771,15 @@ static void tcp_set_rto(struct sock *sk)
>          *    is invisible. Actually, Linux-2.4 also generates erratic
>          *    ACKs in some circumstances.
>          */
> -       inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
> -
> +       if (((tp->srtt_us >> 3) + tp->rttvar_us) < min_rto)
> +               inet_csk(sk)->icsk_rto = usecs_to_jiffies(min_rto);
> +       else
> +               inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);

This is more aggressive than  RFC6298 that RTO <- SRTT + max (G,
K*RTTVAR) where G = MIN_RTO = 200ms

based on our discussion, in the spirit of keeping RTO more
conservative, I recommend we implement RFC formula. Acks being delayed
over 200ms is not uncommon (unfortunately due to bloat or other
issues).

Also I think we should change __tcp_set_rto so that the formula
applies to backoffs or ICMP timeouts calculations too.

>         /* 2. Fixups made earlier cannot be right.
>          *    If we do not estimate RTO correctly without them,
>          *    all the algo is pure shit and should be replaced
>          *    with correct one. It is exactly, which we pretend to do.
>          */
> -
> -       /* NOTE: clamping at TCP_RTO_MIN is not required, current algo
> -        * guarantees that rto is higher.
> -        */
>         tcp_bound_rto(sk);
>  }
>
> diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
> index b617826..7f59f5b 100644
> --- a/net/ipv4/tcp_metrics.c
> +++ b/net/ipv4/tcp_metrics.c
> @@ -561,7 +561,7 @@ reset:
>                  * retransmission.
>                  */
>                 tp->rttvar_us = jiffies_to_usecs(TCP_TIMEOUT_FALLBACK);
> -               tp->mdev_us = tp->mdev_max_us = tp->rttvar_us;
> +               tp->mdev_us = tp->rttvar_us;
>
>                 inet_csk(sk)->icsk_rto = TCP_TIMEOUT_FALLBACK;
>         }
> --
> 2.8.3
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v2] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-14 21:33         ` Yuchung Cheng
@ 2016-06-15 17:41           ` Hagen Paul Pfeifer
  2016-06-15 18:02             ` Yuchung Cheng
  2016-06-15 18:00           ` [PATCH net-next v3] " Daniel Metz
  1 sibling, 1 reply; 22+ messages in thread
From: Hagen Paul Pfeifer @ 2016-06-15 17:41 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: Daniel Metz, netdev, Daniel Metz, Eric Dumazet

* Yuchung Cheng | 2016-06-14 14:33:18 [-0700]:

>> +               tp->rttvar_us = tp->mdev_us;
>AFAICT we can update rttvar_us directly and don't need mdev_us anymore?

Yes, v3 will remove mdev_us.

>This is more aggressive than  RFC6298 that RTO <- SRTT + max (G,
>K*RTTVAR) where G = MIN_RTO = 200ms
>
>based on our discussion, in the spirit of keeping RTO more
>conservative, I recommend we implement RFC formula. Acks being delayed
>over 200ms is not uncommon (unfortunately due to bloat or other
>issues).
>
>Also I think we should change __tcp_set_rto so that the formula
>applies to backoffs or ICMP timeouts calculations too.

We are a unsure what you mean Yuchung. We believe this patch not to be more
aggressive than RFC 6298. In fact, we believe it to be RFC 6298 compliant, as
in RFC 6298, G is the clock granularity and we don’t see where it deviates
from the RFC. However, it is more aggressive than “RTO <- SRTT + max (G,
K*RTTVAR) where G = MIN_RTO = 200ms”. Which formula do you want to implement?

Hagen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-14 21:33         ` Yuchung Cheng
  2016-06-15 17:41           ` Hagen Paul Pfeifer
@ 2016-06-15 18:00           ` Daniel Metz
  2016-06-17 18:32             ` David Miller
  1 sibling, 1 reply; 22+ messages in thread
From: Daniel Metz @ 2016-06-15 18:00 UTC (permalink / raw)
  To: netdev
  Cc: Daniel Metz, Hagen Paul Pfeifer, Daniel Metz, Eric Dumazet,
	Yuchung Cheng

From: Daniel Metz <Daniel.Metz@rohde-schwarz.com>

This patch adjusts Linux RTO calculation to be RFC6298 Standard
compliant. MinRTO is no longer added to the computed RTO, RTO damping
and overestimation are decreased.

In RFC 6298 Standard TCP Retransmission Timeout (RTO) calculation the
calculated RTO is rounded up to the Minimum RTO (MinRTO), if it is
less.  The Linux implementation as a discrepancy to the Standard
basically adds the defined MinRTO to the calculated RTO. When
comparing both approaches, the Linux calculation seems to perform
worse for sender limited TCP flows like Telnet, SSH or constant bit
rate encoded transmissions, especially for Round Trip Times (RTT) of
50ms to 800ms.

Compared to the Linux implementation the RFC 6298 proposed RTO
calculation performs better and more precise in adapting to current
network characteristics. Extensive measurements for bulk data did not
show a negative impact of the adjusted calculation.

Exemplary performance comparison for sender-limited-flows:

- Rate: 10Mbit/s
- Delay: 200ms, Delay Variation: 10ms
- Time between each scheduled segment: 1s
- Amount Data Segments: 300
- Mean of 11 runs

         Mean Response Waiting Time [milliseconds]

PER [%] |   0.5      1    1.5      2      3      5      7     10
--------+-------------------------------------------------------
old     | 206.4  208.6  218.0  218.6  227.2  249.3  274.7  308.2
new     | 203.9  206.0  207.0  209.9  217.3  225.6  238.7  259.1

Detailed analysis:
https://docs.google.com/document/d/1pKmPfnQb6fDK4qpiNVkN8cQyGE4wYDZukcuZfR-BnnM/

Reasoning for historic design:
Sarolahti, P.; Kuznetsov, A. (2002). Congestion Control in Linux TCP.
Conference Paper. Proceedings of the FREENIX Track. 2002 USENIX Annual
https://www.cs.helsinki.fi/research/iwtcp/papers/linuxtcp.pdf

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Daniel Metz <dmetz@mytum.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---

 v3:
  - remove mdev_us

 v2:
  - Using the RFC 6298 compliant implementation, the tcp_sock struct variable
    u32 mdev_max_us becomes obsolete and consequently is being removed.
  - Add reference to Kuznetsov paper


 include/linux/tcp.h      |  2 --
 net/ipv4/tcp.c           |  3 +-
 net/ipv4/tcp_input.c     | 72 ++++++++++--------------------------------------
 net/ipv4/tcp_metrics.c   |  5 ++--
 net/ipv4/tcp_minisocks.c |  1 -
 5 files changed, 18 insertions(+), 65 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 7be9b12..3128eb1 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -230,8 +230,6 @@ struct tcp_sock {
 
 /* RTT measurement */
 	u32	srtt_us;	/* smoothed round trip time << 3 in usecs */
-	u32	mdev_us;	/* medium deviation			*/
-	u32	mdev_max_us;	/* maximal mdev for the last rtt period	*/
 	u32	rttvar_us;	/* smoothed mdev_max			*/
 	u32	rtt_seq;	/* sequence number to update rttvar	*/
 	struct rtt_meas {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5c7ed14..4a7597c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -386,7 +386,6 @@ void tcp_init_sock(struct sock *sk)
 	INIT_LIST_HEAD(&tp->tsq_node);
 
 	icsk->icsk_rto = TCP_TIMEOUT_INIT;
-	tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
 	tp->rtt_min[0].rtt = ~0U;
 
 	/* So many TCP implementations out there (incorrectly) count the
@@ -2703,7 +2702,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
 	info->tcpi_pmtu = icsk->icsk_pmtu_cookie;
 	info->tcpi_rcv_ssthresh = tp->rcv_ssthresh;
 	info->tcpi_rtt = tp->srtt_us >> 3;
-	info->tcpi_rttvar = tp->mdev_us >> 2;
+	info->tcpi_rttvar = tp->rttvar_us >> 2;
 	info->tcpi_snd_ssthresh = tp->snd_ssthresh;
 	info->tcpi_snd_cwnd = tp->snd_cwnd;
 	info->tcpi_advmss = tp->advmss;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 94d4aff..279f5f7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -680,8 +680,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
 /* Called to compute a smoothed rtt estimate. The data fed to this
  * routine either comes from timestamps, or from segments that were
  * known _not_ to have been retransmitted [see Karn/Partridge
- * Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88
- * piece by Van Jacobson.
+ * Proceedings SIGCOMM 87].
  * NOTE: the next three routines used to be one big routine.
  * To save cycles in the RFC 1323 implementation it was better to break
  * it up into three procedures. -- erics
@@ -692,59 +691,19 @@ static void tcp_rtt_estimator(struct sock *sk, long mrtt_us)
 	long m = mrtt_us; /* RTT */
 	u32 srtt = tp->srtt_us;
 
-	/*	The following amusing code comes from Jacobson's
-	 *	article in SIGCOMM '88.  Note that rtt and mdev
-	 *	are scaled versions of rtt and mean deviation.
-	 *	This is designed to be as fast as possible
-	 *	m stands for "measurement".
-	 *
-	 *	On a 1990 paper the rto value is changed to:
-	 *	RTO = rtt + 4 * mdev
-	 *
-	 * Funny. This algorithm seems to be very broken.
-	 * These formulae increase RTO, when it should be decreased, increase
-	 * too slowly, when it should be increased quickly, decrease too quickly
-	 * etc. I guess in BSD RTO takes ONE value, so that it is absolutely
-	 * does not matter how to _calculate_ it. Seems, it was trap
-	 * that VJ failed to avoid. 8)
-	 */
 	if (srtt != 0) {
-		m -= (srtt >> 3);	/* m is now error in rtt est */
-		srtt += m;		/* rtt = 7/8 rtt + 1/8 new */
-		if (m < 0) {
-			m = -m;		/* m is now abs(error) */
-			m -= (tp->mdev_us >> 2);   /* similar update on mdev */
-			/* This is similar to one of Eifel findings.
-			 * Eifel blocks mdev updates when rtt decreases.
-			 * This solution is a bit different: we use finer gain
-			 * for mdev in this case (alpha*beta).
-			 * Like Eifel it also prevents growth of rto,
-			 * but also it limits too fast rto decreases,
-			 * happening in pure Eifel.
-			 */
-			if (m > 0)
-				m >>= 3;
-		} else {
-			m -= (tp->mdev_us >> 2);   /* similar update on mdev */
-		}
-		tp->mdev_us += m;		/* mdev = 3/4 mdev + 1/4 new */
-		if (tp->mdev_us > tp->mdev_max_us) {
-			tp->mdev_max_us = tp->mdev_us;
-			if (tp->mdev_max_us > tp->rttvar_us)
-				tp->rttvar_us = tp->mdev_max_us;
-		}
-		if (after(tp->snd_una, tp->rtt_seq)) {
-			if (tp->mdev_max_us < tp->rttvar_us)
-				tp->rttvar_us -= (tp->rttvar_us - tp->mdev_max_us) >> 2;
+		m -= (srtt >> 3);	 /* m' = m - srtt / 8 = (R' - SRTT) */
+		srtt += m;	/* srtt = srtt + m’ = srtt + m - srtt / 8 */
+		if (m < 0)
+			m = -m;
+		m -= (tp->rttvar_us >> 2); /* m'' = |m'| - rttvar / 4 */
+		tp->rttvar_us += m;
+		if (after(tp->snd_una, tp->rtt_seq))
 			tp->rtt_seq = tp->snd_nxt;
-			tp->mdev_max_us = tcp_rto_min_us(sk);
-		}
 	} else {
 		/* no previous measure. */
-		srtt = m << 3;		/* take the measured time to be rtt */
-		tp->mdev_us = m << 1;	/* make sure rto = 3*rtt */
-		tp->rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk));
-		tp->mdev_max_us = tp->rttvar_us;
+		srtt = m << 3;
+		tp->rttvar_us = m << 1;
 		tp->rtt_seq = tp->snd_nxt;
 	}
 	tp->srtt_us = max(1U, srtt);
@@ -798,6 +757,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
  */
 static void tcp_set_rto(struct sock *sk)
 {
+	const u32 min_rto = tcp_rto_min_us(sk);
 	const struct tcp_sock *tp = tcp_sk(sk);
 	/* Old crap is replaced with new one. 8)
 	 *
@@ -809,17 +769,15 @@ static void tcp_set_rto(struct sock *sk)
 	 *    is invisible. Actually, Linux-2.4 also generates erratic
 	 *    ACKs in some circumstances.
 	 */
-	inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
-
+	if (((tp->srtt_us >> 3) + tp->rttvar_us) < min_rto)
+		inet_csk(sk)->icsk_rto = usecs_to_jiffies(min_rto);
+	else
+		inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
 	/* 2. Fixups made earlier cannot be right.
 	 *    If we do not estimate RTO correctly without them,
 	 *    all the algo is pure shit and should be replaced
 	 *    with correct one. It is exactly, which we pretend to do.
 	 */
-
-	/* NOTE: clamping at TCP_RTO_MIN is not required, current algo
-	 * guarantees that rto is higher.
-	 */
 	tcp_bound_rto(sk);
 }
 
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index b617826..9ec9d99 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -420,8 +420,8 @@ void tcp_update_metrics(struct sock *sk)
 
 		/* Scale deviation to rttvar fixed point */
 		m >>= 1;
-		if (m < tp->mdev_us)
-			m = tp->mdev_us;
+		if (m < tp->rttvar_us)
+			m = tp->rttvar_us;
 
 		var = tcp_metric_get(tm, TCP_METRIC_RTTVAR);
 		if (m >= var)
@@ -561,7 +561,6 @@ reset:
 		 * retransmission.
 		 */
 		tp->rttvar_us = jiffies_to_usecs(TCP_TIMEOUT_FALLBACK);
-		tp->mdev_us = tp->mdev_max_us = tp->rttvar_us;
 
 		inet_csk(sk)->icsk_rto = TCP_TIMEOUT_FALLBACK;
 	}
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 4b95ec4..e72c2f5 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -463,7 +463,6 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 		tcp_init_wl(newtp, treq->rcv_isn);
 
 		newtp->srtt_us = 0;
-		newtp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
 		newtp->rtt_min[0].rtt = ~0U;
 		newicsk->icsk_rto = TCP_TIMEOUT_INIT;
 
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v2] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-15 17:41           ` Hagen Paul Pfeifer
@ 2016-06-15 18:02             ` Yuchung Cheng
  2016-06-15 20:34               ` Daniel Metz
  2016-06-16  9:03               ` Hagen Paul Pfeifer
  0 siblings, 2 replies; 22+ messages in thread
From: Yuchung Cheng @ 2016-06-15 18:02 UTC (permalink / raw)
  To: Hagen Paul Pfeifer; +Cc: Daniel Metz, netdev, Daniel Metz, Eric Dumazet

On Wed, Jun 15, 2016 at 10:41 AM, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
>
> * Yuchung Cheng | 2016-06-14 14:33:18 [-0700]:
>
> >> +               tp->rttvar_us = tp->mdev_us;
> >AFAICT we can update rttvar_us directly and don't need mdev_us anymore?
>
> Yes, v3 will remove mdev_us.
>
> >This is more aggressive than  RFC6298 that RTO <- SRTT + max (G,
> >K*RTTVAR) where G = MIN_RTO = 200ms
> >
> >based on our discussion, in the spirit of keeping RTO more
> >conservative, I recommend we implement RFC formula. Acks being delayed
> >over 200ms is not uncommon (unfortunately due to bloat or other
> >issues).
> >
> >Also I think we should change __tcp_set_rto so that the formula
> >applies to backoffs or ICMP timeouts calculations too.
>
> We are a unsure what you mean Yuchung. We believe this patch not to be more
> aggressive than RFC 6298. In fact, we believe it to be RFC 6298 compliant, as
> in RFC 6298, G is the clock granularity and we don’t see where it deviates
> from the RFC. However, it is more aggressive than “RTO <- SRTT + max (G,
> K*RTTVAR) where G = MIN_RTO = 200ms”. Which formula do you want to implement?
Let me explain in a different way:

* RFC6298 applies a lower bound of 1 second to RTO (section 2.4)

* Linux currently applies a lower bound of 200ms (min_rto) to
K*RTTVAR, but /not/ RTO itself.

* This patch applies the lower bound of 200ms to RTO, similar to RFC6298


Let's say the SRTT is 100ms and RTT variations is 10ms. The variation
is low because we've been sending large chunks, and RTT is fairly
stable, and we sample on every ACK. The RTOs produced are

RFC6298: RTO=1s
Linux: RTO=300ms
This patch: RTO=200ms

Then we send 1 packet out. The receiver delays the ACK up to 200ms.
The actual RTT can be longer because other network components further
delay the data or the ACK. This patch would surely fire the RTO
spuriously.

so we can either implement RFC6298 faithfully, or apply the
lower-bound as-is, or something in between. But the current patch
as-is is more aggressive. Did I miss something?


>
>
> Hagen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v2] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-15 18:02             ` Yuchung Cheng
@ 2016-06-15 20:34               ` Daniel Metz
  2016-06-15 20:38                 ` Eric Dumazet
  2016-06-16  9:03               ` Hagen Paul Pfeifer
  1 sibling, 1 reply; 22+ messages in thread
From: Daniel Metz @ 2016-06-15 20:34 UTC (permalink / raw)
  To: Yuchung Cheng, Hagen Paul Pfeifer; +Cc: netdev, Daniel Metz, Eric Dumazet

Yuchung Cheng | 2016-06-15 20:02:
 > Let me explain in a different way:
 >
 > * RFC6298 applies a lower bound of 1 second to RTO (section 2.4)
 >
 > * Linux currently applies a lower bound of 200ms (min_rto) to
 > K*RTTVAR, but /not/ RTO itself.
 >
 > * This patch applies the lower bound of 200ms to RTO, similar to RFC6298
 >
 >
 > Let's say the SRTT is 100ms and RTT variations is 10ms. The variation
 > is low because we've been sending large chunks, and RTT is fairly
 > stable, and we sample on every ACK. The RTOs produced are
 >
 > RFC6298: RTO=1s
 > Linux: RTO=300ms
 > This patch: RTO=200ms
 >
 > Then we send 1 packet out. The receiver delays the ACK up to 200ms.
 > The actual RTT can be longer because other network components further
 > delay the data or the ACK. This patch would surely fire the RTO
 > spuriously.
 >
 > so we can either implement RFC6298 faithfully, or apply the
 > lower-bound as-is, or something in between. But the current patch
 > as-is is more aggressive. Did I miss something?

Thank you for the clarification. The fundamental thought of this patch 
was to decrease Linux RTO overestimation. This also involved not 
clinging to the RFC 6298 MinRTO of 1 second ((2.4) "[...] at the same 
time acknowledging that at some future point, research may show that a 
smaller minimum RTO is acceptable or superior."). A more aggressive RTO 
will of course increase the amount of Spurious Retransmission. The 
question is, if the benefit is higher than the sacrifice. The tests we 
conducted did not show significant negative impact so far. However, for 
sender-limited TCP flows the results were promising.

Daniel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v2] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-15 20:34               ` Daniel Metz
@ 2016-06-15 20:38                 ` Eric Dumazet
  2016-06-15 21:07                   ` Yuchung Cheng
  2016-06-16  9:07                   ` Hagen Paul Pfeifer
  0 siblings, 2 replies; 22+ messages in thread
From: Eric Dumazet @ 2016-06-15 20:38 UTC (permalink / raw)
  To: Daniel Metz; +Cc: Yuchung Cheng, Hagen Paul Pfeifer, netdev, Daniel Metz

On Wed, Jun 15, 2016 at 1:34 PM, Daniel Metz <dmetz@mytum.de> wrote:
> Yuchung Cheng | 2016-06-15 20:02:
>> Let me explain in a different way:
>>
>> * RFC6298 applies a lower bound of 1 second to RTO (section 2.4)
>>
>> * Linux currently applies a lower bound of 200ms (min_rto) to
>> K*RTTVAR, but /not/ RTO itself.
>>
>> * This patch applies the lower bound of 200ms to RTO, similar to RFC6298
>>
>>
>> Let's say the SRTT is 100ms and RTT variations is 10ms. The variation
>> is low because we've been sending large chunks, and RTT is fairly
>> stable, and we sample on every ACK. The RTOs produced are
>>
>> RFC6298: RTO=1s
>> Linux: RTO=300ms
>> This patch: RTO=200ms
>>
>> Then we send 1 packet out. The receiver delays the ACK up to 200ms.
>> The actual RTT can be longer because other network components further
>> delay the data or the ACK. This patch would surely fire the RTO
>> spuriously.
>>
>> so we can either implement RFC6298 faithfully, or apply the
>> lower-bound as-is, or something in between. But the current patch
>> as-is is more aggressive. Did I miss something?
>
> Thank you for the clarification. The fundamental thought of this patch was
> to decrease Linux RTO overestimation. This also involved not clinging to the
> RFC 6298 MinRTO of 1 second ((2.4) "[...] at the same time acknowledging
> that at some future point, research may show that a smaller minimum RTO is
> acceptable or superior."). A more aggressive RTO will of course increase the
> amount of Spurious Retransmission. The question is, if the benefit is higher
> than the sacrifice. The tests we conducted did not show significant negative
> impact so far. However, for sender-limited TCP flows the results were
> promising.
>

I guess the problem is that some folks use smaller rto than
RTAX_RTO_MIN , look at tcp_rto_min()

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v2] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-15 20:38                 ` Eric Dumazet
@ 2016-06-15 21:07                   ` Yuchung Cheng
  2016-06-16  9:07                   ` Hagen Paul Pfeifer
  1 sibling, 0 replies; 22+ messages in thread
From: Yuchung Cheng @ 2016-06-15 21:07 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Daniel Metz, Hagen Paul Pfeifer, netdev, Daniel Metz

On Wed, Jun 15, 2016 at 1:38 PM, Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Jun 15, 2016 at 1:34 PM, Daniel Metz <dmetz@mytum.de> wrote:
> > Yuchung Cheng | 2016-06-15 20:02:
> >> Let me explain in a different way:
> >>
> >> * RFC6298 applies a lower bound of 1 second to RTO (section 2.4)
> >>
> >> * Linux currently applies a lower bound of 200ms (min_rto) to
> >> K*RTTVAR, but /not/ RTO itself.
> >>
> >> * This patch applies the lower bound of 200ms to RTO, similar to RFC6298
> >>
> >>
> >> Let's say the SRTT is 100ms and RTT variations is 10ms. The variation
> >> is low because we've been sending large chunks, and RTT is fairly
> >> stable, and we sample on every ACK. The RTOs produced are
> >>
> >> RFC6298: RTO=1s
> >> Linux: RTO=300ms
> >> This patch: RTO=200ms
> >>
> >> Then we send 1 packet out. The receiver delays the ACK up to 200ms.
> >> The actual RTT can be longer because other network components further
> >> delay the data or the ACK. This patch would surely fire the RTO
> >> spuriously.
> >>
> >> so we can either implement RFC6298 faithfully, or apply the
> >> lower-bound as-is, or something in between. But the current patch
> >> as-is is more aggressive. Did I miss something?
> >
> > Thank you for the clarification. The fundamental thought of this patch was
> > to decrease Linux RTO overestimation. This also involved not clinging to the
> > RFC 6298 MinRTO of 1 second ((2.4) "[...] at the same time acknowledging
> > that at some future point, research may show that a smaller minimum RTO is
> > acceptable or superior."). A more aggressive RTO will of course increase the
> > amount of Spurious Retransmission. The question is, if the benefit is higher
> > than the sacrifice. The tests we conducted did not show significant negative
> > impact so far. However, for sender-limited TCP flows the results were
> > promising.
> >
>
> I guess the problem is that some folks use smaller rto than
> RTAX_RTO_MIN , look at tcp_rto_min()

Also many other stacks (e.g., Windows until very recently) do not have
40ms delayed ACKs like Linux. One thing we at least know is that the
current 200ms lower-bound on RTTVAR works for a long time. That's why
I propose to do so. In other words, change the RTT variation
averaging, but not the lower-bound.

Will try to get the experiment going to test different min_rto values
so we have more data.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v2] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-15 18:02             ` Yuchung Cheng
  2016-06-15 20:34               ` Daniel Metz
@ 2016-06-16  9:03               ` Hagen Paul Pfeifer
  1 sibling, 0 replies; 22+ messages in thread
From: Hagen Paul Pfeifer @ 2016-06-16  9:03 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: Daniel Metz, netdev, Daniel Metz, Eric Dumazet

> On June 15, 2016 at 8:02 PM Yuchung Cheng <ycheng@google.com> wrote:
> 
> Let's say the SRTT is 100ms and RTT variations is 10ms. The variation
> is low because we've been sending large chunks, and RTT is fairly
> stable, and we sample on every ACK. The RTOs produced are
> 
> RFC6298: RTO=1s
> Linux: RTO=300ms
> This patch: RTO=200ms
> 
> Then we send 1 packet out. The receiver delays the ACK up to 200ms.
> The actual RTT can be longer because other network components further
> delay the data or the ACK. This patch would surely fire the RTO
> spuriously.
> 
> so we can either implement RFC6298 faithfully, or apply the
> lower-bound as-is, or something in between. But the current patch
> as-is is more aggressive. Did I miss something?

We analyzed the impact for a wide variety of network characteristics. Starting from bulk data till chatty, sender-limited transmissions from low RTTs to high RTTs, small and large variances as well as different queue characteristics. For a group of tests we measured advantages of a RFC 6298 compliant implementation: sender-limited flows. For bulk data we did not measured any difference compared to standard Linux. As a result we concluded that the RFC conform implementation - mapped to real world protocols - if beneficial.

For the mentioned use case, yes the new implementation is a little bit more aggressive: when delayed ack kicks in, a spurious retransmission can be triggerd, yes. We asked ourself if this is a real world scenario or more an theoretical issue. Furthermore, if a real world problem, if the retransmission is negligible compared to the advantages?

Yuchung, can you test the patch and see if the patch have any downsides? And thank you for the comments!

Hagen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v2] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-15 20:38                 ` Eric Dumazet
  2016-06-15 21:07                   ` Yuchung Cheng
@ 2016-06-16  9:07                   ` Hagen Paul Pfeifer
  1 sibling, 0 replies; 22+ messages in thread
From: Hagen Paul Pfeifer @ 2016-06-16  9:07 UTC (permalink / raw)
  To: Daniel Metz, Eric Dumazet; +Cc: Yuchung Cheng, netdev, Daniel Metz

> On June 15, 2016 at 10:38 PM Eric Dumazet <edumazet@google.com> wrote:
>
> I guess the problem is that some folks use smaller rto than
> RTAX_RTO_MIN , look at tcp_rto_min()

Due to the nature of the Linux calculation, this is probably more of a reason to use the RFC 6298 calculation. When a smaller MinRTO as 200ms is used, the Linux “advantage” to account for Delayed ACKs up to 200ms is decreased. Assuming a MinRTO of 0ms, the Linux ability and the RFC ability to account for sudden Delayed ACKs is pretty equal: zero. 

To illustrate this: RTT: 50ms, RTTVAR: 0ms, MinRTO: 50ms, Delayed ACKs: 200ms. 

Before any ACK is delayed:

Linux RTO ~ 100+ms (tested) 
RFC 6298 RTO ~ 50+ms (tested) 

RTT of first delayed ACK if it is not shortened due to another data packet: ~250ms 

This is not tied to the RTT:
RTT 1000ms, RTTVAR: 0ms, MinRTO: 50ms, Delayed ACKs: 200ms 

Before any ACK is delayed:

Linux RTO ~ 1050+ms (tested) 
RFC 6298 RTO ~ 1000+ms (tested)

RTT of first delayed ACK if it is not shortened due to another data packet: ~1200ms 

A RFC 6298 problem we run in so far was with extremely steady RTTs and sender limited data. A Spurious Retransmission occurred from time to time in this case.

Hagen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-15 18:00           ` [PATCH net-next v3] " Daniel Metz
@ 2016-06-17 18:32             ` David Miller
  2016-06-17 18:56               ` Yuchung Cheng
  0 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2016-06-17 18:32 UTC (permalink / raw)
  To: dmetz; +Cc: netdev, Daniel.Metz, hagen, edumazet, ycheng

From: Daniel Metz <dmetz@mytum.de>
Date: Wed, 15 Jun 2016 20:00:03 +0200

> This patch adjusts Linux RTO calculation to be RFC6298 Standard
> compliant. MinRTO is no longer added to the computed RTO, RTO damping
> and overestimation are decreased.
 ...

Yuchung, I assume I am waiting for you to do the testing you said
you would do for this patch, right?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-17 18:32             ` David Miller
@ 2016-06-17 18:56               ` Yuchung Cheng
  2016-06-22  5:53                 ` Yuchung Cheng
  0 siblings, 1 reply; 22+ messages in thread
From: Yuchung Cheng @ 2016-06-17 18:56 UTC (permalink / raw)
  To: David Miller
  Cc: Daniel Metz, netdev, Daniel Metz, Hagen Paul Pfeifer, Eric Dumazet

On Fri, Jun 17, 2016 at 11:32 AM, David Miller <davem@davemloft.net> wrote:
>
> From: Daniel Metz <dmetz@mytum.de>
> Date: Wed, 15 Jun 2016 20:00:03 +0200
>
> > This patch adjusts Linux RTO calculation to be RFC6298 Standard
> > compliant. MinRTO is no longer added to the computed RTO, RTO damping
> > and overestimation are decreased.
>  ...
>
> Yuchung, I assume I am waiting for you to do the testing you said
> you would do for this patch, right?
Yes I spent the last two days resolving some unrelated glitches to
start my testing on Web servers. I should be able to get some results
over the weekend.

I will test
0) current Linux
1) this patch
2) RFC6298 with min_RTO=1sec
3) RFC6298 with minimum RTTVAR of 200ms (so it is more like current
Linux style of min RTO which only applies to RTTVAR)

and collect the TCP latency (how long to send an HTTP response) and
(spurious) timeout & retransmission stats.

I didn't respond to Hagen's email yet b/c I thought data would help
the discussion better :-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-17 18:56               ` Yuchung Cheng
@ 2016-06-22  5:53                 ` Yuchung Cheng
  2016-06-22 11:21                   ` Hagen Paul Pfeifer
  2016-06-29  7:06                   ` Yuchung Cheng
  0 siblings, 2 replies; 22+ messages in thread
From: Yuchung Cheng @ 2016-06-22  5:53 UTC (permalink / raw)
  To: David Miller
  Cc: Daniel Metz, netdev, Daniel Metz, Hagen Paul Pfeifer, Eric Dumazet

On Fri, Jun 17, 2016 at 11:56 AM, Yuchung Cheng <ycheng@google.com> wrote:
>
> On Fri, Jun 17, 2016 at 11:32 AM, David Miller <davem@davemloft.net> wrote:
> >
> > From: Daniel Metz <dmetz@mytum.de>
> > Date: Wed, 15 Jun 2016 20:00:03 +0200
> >
> > > This patch adjusts Linux RTO calculation to be RFC6298 Standard
> > > compliant. MinRTO is no longer added to the computed RTO, RTO damping
> > > and overestimation are decreased.
> >  ...
> >
> > Yuchung, I assume I am waiting for you to do the testing you said
> > you would do for this patch, right?
> Yes I spent the last two days resolving some unrelated glitches to
> start my testing on Web servers. I should be able to get some results
> over the weekend.
>
> I will test
> 0) current Linux
> 1) this patch
> 2) RFC6298 with min_RTO=1sec
> 3) RFC6298 with minimum RTTVAR of 200ms (so it is more like current
> Linux style of min RTO which only applies to RTTVAR)
>
> and collect the TCP latency (how long to send an HTTP response) and
> (spurious) timeout & retransmission stats.
>
Thanks for the patience. I've collected data from some Google Web
servers. They serve both a mix of US and SouthAm users using
HTTP1 and HTTP2. The traffic is Web browsing (e.g., search, maps,
gmails, etc but not Youtube videos). The mean RTT is about 100ms.

The user connections were split into 4 groups of different TCP RTO
configs. Each group has many millions of connections but the
size variation among groups is well under 1%.

B: baseline Linux
D: this patch
R: change RTTYAR averaging as in D, but bound RTO to 1sec per RFC6298
Y: change RTTVAR averaging as in D, but bound RTTVAR to 200ms instead (like B)

For mean TCP latency of HTTP responses (first byte sent to last byte
acked), B < R < Y < D. But the differences are so insignificant (<1%).
The median, 95pctl, and 99pctl has similar indifference. In summary
there's hardly visible impact on latency. I also look at only response
less than 4KB but do not see a different picture.

The main difference is the retransmission rate where R =~ Y < B =~D.
R and Y are ~20% lower than B and D. Parsing the SNMP stats reveal
more interesting details. The table shows the deltas in percentage to
the baseline B.

                D      R     Y
------------------------------
Timeout      +12%   -16%  -16%
TailLossProb +28%    -7%   -7%
DSACK_rcvd   +37%    -7%   -7%
Cwnd-undo    +16%   -29%  -29%

RTO change affects TLP because TLP will use the min of RTO and TLP
timer value to arm the probe timer.

The stats indicate that the main culprit of spurious timeouts / rtx is
the RTO lower-bound. But they also show the RFC RTTVAR averaging is as
good as current Linux approach.

Given that I would recommend we revise this patch to use the RFC
averaging but keep existing lower-bound (of RTTVAR to 200ms). We can
further experiment the lower-bound and change that in a separate
patch.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-22  5:53                 ` Yuchung Cheng
@ 2016-06-22 11:21                   ` Hagen Paul Pfeifer
  2016-06-22 20:50                     ` Yuchung Cheng
  2016-06-29  7:06                   ` Yuchung Cheng
  1 sibling, 1 reply; 22+ messages in thread
From: Hagen Paul Pfeifer @ 2016-06-22 11:21 UTC (permalink / raw)
  To: Yuchung Cheng, David Miller
  Cc: Daniel Metz, netdev, Daniel Metz, Eric Dumazet

> On June 22, 2016 at 7:53 AM Yuchung Cheng <ycheng@google.com> wrote:
> 
> Thanks for the patience. I've collected data from some Google Web
> servers. They serve both a mix of US and SouthAm users using
> HTTP1 and HTTP2. The traffic is Web browsing (e.g., search, maps,
> gmails, etc but not Youtube videos). The mean RTT is about 100ms.
> 
> The user connections were split into 4 groups of different TCP RTO
> configs. Each group has many millions of connections but the
> size variation among groups is well under 1%.
> 
> B: baseline Linux
> D: this patch
> R: change RTTYAR averaging as in D, but bound RTO to 1sec per RFC6298
> Y: change RTTVAR averaging as in D, but bound RTTVAR to 200ms instead (like B)
> 
> For mean TCP latency of HTTP responses (first byte sent to last byte
> acked), B < R < Y < D. But the differences are so insignificant (<1%).
> The median, 95pctl, and 99pctl has similar indifference. In summary
> there's hardly visible impact on latency. I also look at only response
> less than 4KB but do not see a different picture.
> 
> The main difference is the retransmission rate where R =~ Y < B =~D.
> R and Y are ~20% lower than B and D. Parsing the SNMP stats reveal
> more interesting details. The table shows the deltas in percentage to
> the baseline B.
> 
>                 D      R     Y
> ------------------------------
> Timeout      +12%   -16%  -16%
> TailLossProb +28%    -7%   -7%
> DSACK_rcvd   +37%    -7%   -7%
> Cwnd-undo    +16%   -29%  -29%
> 
> RTO change affects TLP because TLP will use the min of RTO and TLP
> timer value to arm the probe timer.
> 
> The stats indicate that the main culprit of spurious timeouts / rtx is
> the RTO lower-bound. But they also show the RFC RTTVAR averaging is as
> good as current Linux approach.
> 
> Given that I would recommend we revise this patch to use the RFC
> averaging but keep existing lower-bound (of RTTVAR to 200ms). We can
> further experiment the lower-bound and change that in a separate
> patch.

Great news Yuchung!

Then Daniel will prepare v4 with a min-rto lower bound:

    max(RTTVAR, tcp_rto_min_us(struct sock))

Any further suggestions Yuchung, Eric? We will also feed this v4 in our test environment to check the behavior for sender limited, non-continuous flows.

Hagen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-22 11:21                   ` Hagen Paul Pfeifer
@ 2016-06-22 20:50                     ` Yuchung Cheng
  0 siblings, 0 replies; 22+ messages in thread
From: Yuchung Cheng @ 2016-06-22 20:50 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: David Miller, Daniel Metz, netdev, Daniel Metz, Eric Dumazet

On Wed, Jun 22, 2016 at 4:21 AM, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
>
> > On June 22, 2016 at 7:53 AM Yuchung Cheng <ycheng@google.com> wrote:
> >
> > Thanks for the patience. I've collected data from some Google Web
> > servers. They serve both a mix of US and SouthAm users using
> > HTTP1 and HTTP2. The traffic is Web browsing (e.g., search, maps,
> > gmails, etc but not Youtube videos). The mean RTT is about 100ms.
> >
> > The user connections were split into 4 groups of different TCP RTO
> > configs. Each group has many millions of connections but the
> > size variation among groups is well under 1%.
> >
> > B: baseline Linux
> > D: this patch
> > R: change RTTYAR averaging as in D, but bound RTO to 1sec per RFC6298
> > Y: change RTTVAR averaging as in D, but bound RTTVAR to 200ms instead (like B)
> >
> > For mean TCP latency of HTTP responses (first byte sent to last byte
> > acked), B < R < Y < D. But the differences are so insignificant (<1%).
> > The median, 95pctl, and 99pctl has similar indifference. In summary
> > there's hardly visible impact on latency. I also look at only response
> > less than 4KB but do not see a different picture.
> >
> > The main difference is the retransmission rate where R =~ Y < B =~D.
> > R and Y are ~20% lower than B and D. Parsing the SNMP stats reveal
> > more interesting details. The table shows the deltas in percentage to
> > the baseline B.
> >
> >                 D      R     Y
> > ------------------------------
> > Timeout      +12%   -16%  -16%
> > TailLossProb +28%    -7%   -7%
> > DSACK_rcvd   +37%    -7%   -7%
> > Cwnd-undo    +16%   -29%  -29%
> >
> > RTO change affects TLP because TLP will use the min of RTO and TLP
> > timer value to arm the probe timer.
> >
> > The stats indicate that the main culprit of spurious timeouts / rtx is
> > the RTO lower-bound. But they also show the RFC RTTVAR averaging is as
> > good as current Linux approach.
> >
> > Given that I would recommend we revise this patch to use the RFC
> > averaging but keep existing lower-bound (of RTTVAR to 200ms). We can
> > further experiment the lower-bound and change that in a separate
> > patch.
>
> Great news Yuchung!
>
> Then Daniel will prepare v4 with a min-rto lower bound:
>
>     max(RTTVAR, tcp_rto_min_us(struct sock))
>
> Any further suggestions Yuchung, Eric? We will also feed this v4 in our test environment to check the behavior for sender limited, non-continuous flows.
yes a small one: I think the patch should change __tcp_set_rto()
instead of tcp_set_rto() so it applies to recurring timeouts as well.


>
> Hagen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-22  5:53                 ` Yuchung Cheng
  2016-06-22 11:21                   ` Hagen Paul Pfeifer
@ 2016-06-29  7:06                   ` Yuchung Cheng
  2016-06-29 20:18                     ` Daniel Metz
  1 sibling, 1 reply; 22+ messages in thread
From: Yuchung Cheng @ 2016-06-29  7:06 UTC (permalink / raw)
  To: Daniel Metz, Hagen Paul Pfeifer, Daniel Metz
  Cc: netdev, Eric Dumazet, Neal Cardwell, David Miller

[-- Attachment #1: Type: text/plain, Size: 5255 bytes --]

On Tue, Jun 21, 2016 at 10:53 PM, Yuchung Cheng <ycheng@google.com> wrote:
>
> On Fri, Jun 17, 2016 at 11:56 AM, Yuchung Cheng <ycheng@google.com> wrote:
> >
> > On Fri, Jun 17, 2016 at 11:32 AM, David Miller <davem@davemloft.net> wrote:
> > >
> > > From: Daniel Metz <dmetz@mytum.de>
> > > Date: Wed, 15 Jun 2016 20:00:03 +0200
> > >
> > > > This patch adjusts Linux RTO calculation to be RFC6298 Standard
> > > > compliant. MinRTO is no longer added to the computed RTO, RTO damping
> > > > and overestimation are decreased.
> > >  ...
> > >
> > > Yuchung, I assume I am waiting for you to do the testing you said
> > > you would do for this patch, right?
> > Yes I spent the last two days resolving some unrelated glitches to
> > start my testing on Web servers. I should be able to get some results
> > over the weekend.
> >
> > I will test
> > 0) current Linux
> > 1) this patch
> > 2) RFC6298 with min_RTO=1sec
> > 3) RFC6298 with minimum RTTVAR of 200ms (so it is more like current
> > Linux style of min RTO which only applies to RTTVAR)
> >
> > and collect the TCP latency (how long to send an HTTP response) and
> > (spurious) timeout & retransmission stats.
> >
> Thanks for the patience. I've collected data from some Google Web
> servers. They serve both a mix of US and SouthAm users using
> HTTP1 and HTTP2. The traffic is Web browsing (e.g., search, maps,
> gmails, etc but not Youtube videos). The mean RTT is about 100ms.
>
> The user connections were split into 4 groups of different TCP RTO
> configs. Each group has many millions of connections but the
> size variation among groups is well under 1%.
>
> B: baseline Linux
> D: this patch
> R: change RTTYAR averaging as in D, but bound RTO to 1sec per RFC6298
> Y: change RTTVAR averaging as in D, but bound RTTVAR to 200ms instead (like B)
>
> For mean TCP latency of HTTP responses (first byte sent to last byte
> acked), B < R < Y < D. But the differences are so insignificant (<1%).
> The median, 95pctl, and 99pctl has similar indifference. In summary
> there's hardly visible impact on latency. I also look at only response
> less than 4KB but do not see a different picture.
>
> The main difference is the retransmission rate where R =~ Y < B =~D.
> R and Y are ~20% lower than B and D. Parsing the SNMP stats reveal
> more interesting details. The table shows the deltas in percentage to
> the baseline B.
>
>                 D      R     Y
> ------------------------------
> Timeout      +12%   -16%  -16%
> TailLossProb +28%    -7%   -7%
> DSACK_rcvd   +37%    -7%   -7%
> Cwnd-undo    +16%   -29%  -29%
>
> RTO change affects TLP because TLP will use the min of RTO and TLP
> timer value to arm the probe timer.
>
> The stats indicate that the main culprit of spurious timeouts / rtx is
> the RTO lower-bound. But they also show the RFC RTTVAR averaging is as
> good as current Linux approach.
>
> Given that I would recommend we revise this patch to use the RFC
> averaging but keep existing lower-bound (of RTTVAR to 200ms). We can
> further experiment the lower-bound and change that in a separate
> patch.
Hi I have some update.

I instrumented the kernel to capture the time spent in recovery
(attached). The latency measurement starts when TCP goes into
recovery, triggered by either ACKs or RTOs.  The start time is the
(original) sent time of the first unacked packet. The end time is when
the ACK covers the highest sent sequence when recovery started. The
total latency in usec and count are recorded in MIB_TCPRECOVLAT and
MIB_TCPRECOVCNT. If the connection times out or closes while the
sender was still in recovery, the total latency and count are stored
in MIB_TCPRECOVLAT2 and MIB_TCPRECOVCNT2. This second bucket is to
capture long recovery that led to eventual connection aborts.

Since network stat is usually power distribution, the mean of such
distribution is gonna be dominated by the tail. but the new metrics
still shows very interesting impact of different RTOs. Using the same
table format like my previous email. This table shows the difference
in percentage to the baseline.

                                      B        D      R     Y
------------------------------------------------------------------------------
mean TCPRecovLat      3s      -7%   +39% +38%
mean TCPRecovLat2    52s     +1%   -11%  -11%

The new metrics show that lower-bounding RTO to 200ms (D) indeed
lowers the latency. But by my previous analysis, D has a lot more
spurious rtx and TLPs (which the collateral damage on latency is not
captured by these metrics). And note that TLP timer uses the min of
RTO and TLP timeout, so TLP fires 28% more often in (D). Therefore the
latency may be mainly benefited from a faster TLP timer. Nevertheless
the significant impacts on recovery latency do not show up on the
response latency we measured earlier. My conjecture is that only a
small fraction of flows experience losses so even a 40% increase on
average on loss recovery does not move the needle, or the latency
reduction was cancelled by the latency increase of spurious timeouts
and CC reactions.

hmm it almost seen that the current code is the right balance. We may
need more investigation.

Daniel and Hagen, could you try my instrumentation patch on your testbed?

[-- Attachment #2: 0001-tcp-instrument-recovery-latency.patch --]
[-- Type: application/octet-stream, Size: 6650 bytes --]

From 7692d87af171514e4e4b5fb31631faf6d4f604d6 Mon Sep 17 00:00:00 2001
From: Yuchung Cheng <ycheng@google.com>
Date: Mon, 27 Jun 2016 18:16:11 -0700
Subject: [PATCH] tcp: instrument recovery latency

Adding four SNMP stats for recovery latency. The latency measurement
starts when TCP goes into recovery, triggered by either ACKs
or RTOs.  The start time is the (original) sent time of the first
unacked packet. The end time is when the ACK covers the highest
sent sequence when recovery started. The total latency in usec and
count are recorded in MIB_TCPRECOVLAT and MIB_TCPRECOVCNT.

If the connection times out or closes while we were in recovery,
the total latency and count are stored in MIB_TCPRECOVLAT2 and
MIB_TCPRECOVCNT2.
---
 include/linux/tcp.h       |  3 +++
 include/net/tcp.h         |  7 +++++++
 include/uapi/linux/snmp.h |  4 ++++
 net/ipv4/proc.c           |  4 ++++
 net/ipv4/tcp.c            |  1 +
 net/ipv4/tcp_input.c      |  7 +++++++
 net/ipv4/tcp_recovery.c   | 31 +++++++++++++++++++++++++++++++
 net/ipv4/tcp_timer.c      |  1 +
 8 files changed, 58 insertions(+)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 7be9b12..c6d6189 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -355,6 +355,7 @@ struct tcp_sock {
 	 */
 	struct request_sock *fastopen_rsk;
 	u32	*saved_syn;
+	struct skb_mstamp recov_start;
 };
 
 enum tsq_flags {
@@ -388,6 +389,8 @@ struct tcp_timewait_sock {
 #ifdef CONFIG_TCP_MD5SIG
 	struct tcp_md5sig_key	  *tw_md5_key;
 #endif
+
+	struct skb_mstamp recov_start;
 };
 
 static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a79894b..85c38fa 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1862,6 +1862,13 @@ static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff *skb)
 		tp->data_segs_in += segs_in;
 }
 
+static inline void tcp_reset_recovery_latency(struct tcp_sock *tp)
+{
+	tp->recov_start.v64 = 0;
+}
+void tcp_start_recovery_latency(struct sock *sk);
+void tcp_end_recovery_latency(struct sock *sk, bool acked);
+
 /*
  * TCP listen path runs lockless.
  * We forced "struct sock" to be const qualified to make sure
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 25a9ad8..6228bf9 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -280,6 +280,10 @@ enum
 	LINUX_MIB_TCPKEEPALIVE,			/* TCPKeepAlive */
 	LINUX_MIB_TCPMTUPFAIL,			/* TCPMTUPFail */
 	LINUX_MIB_TCPMTUPSUCCESS,		/* TCPMTUPSuccess */
+	LINUX_MIB_TCPRECOVLAT,
+	LINUX_MIB_TCPRECOVCNT,
+	LINUX_MIB_TCPRECOVLAT2,
+	LINUX_MIB_TCPRECOVCNT2,
 	__LINUX_MIB_MAX
 };
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 9f665b6..0029429 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -302,6 +302,10 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPKeepAlive", LINUX_MIB_TCPKEEPALIVE),
 	SNMP_MIB_ITEM("TCPMTUPFail", LINUX_MIB_TCPMTUPFAIL),
 	SNMP_MIB_ITEM("TCPMTUPSuccess", LINUX_MIB_TCPMTUPSUCCESS),
+	SNMP_MIB_ITEM("TCPRecovLat", LINUX_MIB_TCPRECOVLAT),
+	SNMP_MIB_ITEM("TCPRecovCnt", LINUX_MIB_TCPRECOVCNT),
+	SNMP_MIB_ITEM("TCPRecovLat2", LINUX_MIB_TCPRECOVLAT2),
+	SNMP_MIB_ITEM("TCPRecovCnt2", LINUX_MIB_TCPRECOVCNT2),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5c7ed14..3de4649 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2111,6 +2111,7 @@ void tcp_close(struct sock *sk, long timeout)
 		 */
 		tcp_send_fin(sk);
 	}
+	tcp_end_recovery_latency(sk, false);
 
 	sk_stream_wait_close(sk, timeout);
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 94d4aff..1169183 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1951,6 +1951,8 @@ void tcp_enter_loss(struct sock *sk)
 	tp->frto = sysctl_tcp_frto &&
 		   (new_recovery || icsk->icsk_retransmits) &&
 		   !inet_csk(sk)->icsk_mtup.probe_size;
+
+	tcp_start_recovery_latency(sk);
 }
 
 /* If ACK arrived pointing to a remembered SACK, it means that our
@@ -2374,6 +2376,7 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, bool unmark_loss)
 	}
 	tp->snd_cwnd_stamp = tcp_time_stamp;
 	tp->undo_marker = 0;
+	tcp_reset_recovery_latency(tp); /* Spurious recovery */
 }
 
 static inline bool tcp_may_undo(const struct tcp_sock *tp)
@@ -2658,6 +2661,8 @@ static void tcp_enter_recovery(struct sock *sk, bool ece_ack)
 		tcp_init_cwnd_reduction(sk);
 	}
 	tcp_set_ca_state(sk, TCP_CA_Recovery);
+
+	tcp_start_recovery_latency(sk);
 }
 
 /* Process an ACK in CA_Loss state. Move to CA_Open if lost data are
@@ -2790,6 +2795,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
 	if (icsk->icsk_ca_state == TCP_CA_Open) {
 		WARN_ON(tp->retrans_out != 0);
 		tp->retrans_stamp = 0;
+		tcp_reset_recovery_latency(tp); /* Paranoid reset */
 	} else if (!before(tp->snd_una, tp->high_seq)) {
 		switch (icsk->icsk_ca_state) {
 		case TCP_CA_CWR:
@@ -2809,6 +2815,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
 			tcp_end_cwnd_reduction(sk);
 			break;
 		}
+		tcp_end_recovery_latency(sk, true);
 	}
 
 	/* Use RACK to detect loss */
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index e36df4f..46093a5 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -107,3 +107,34 @@ void tcp_rack_advance(struct tcp_sock *tp,
 	tp->rack.mstamp = *xmit_time;
 	tp->rack.advanced = 1;
 }
+
+/* Caller must check write queue is not empty */
+void tcp_start_recovery_latency(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (!tp->recov_start.v64)
+		tp->recov_start = tcp_write_queue_head(sk)->skb_mstamp;
+}
+
+void tcp_end_recovery_latency(struct sock *sk, bool acked)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct net *net = sock_net(sk);
+	struct skb_mstamp now;
+
+	if (!tp->recov_start.v64)
+		return;
+
+	skb_mstamp_get(&now);
+	if (acked) {
+		NET_ADD_STATS(net, LINUX_MIB_TCPRECOVLAT,
+			      skb_mstamp_us_delta(&now, &tp->recov_start));
+		NET_INC_STATS(net, LINUX_MIB_TCPRECOVCNT);
+	} else {
+		NET_ADD_STATS(net, LINUX_MIB_TCPRECOVLAT2,
+			      skb_mstamp_us_delta(&now, &tp->recov_start));
+		NET_INC_STATS(net, LINUX_MIB_TCPRECOVCNT2);
+	}
+	tp->recov_start.v64 = 0;
+}
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index debdd8b..7edc059 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -29,6 +29,7 @@ static void tcp_write_err(struct sock *sk)
 	sk->sk_err = sk->sk_err_soft ? : ETIMEDOUT;
 	sk->sk_error_report(sk);
 
+	tcp_end_recovery_latency(sk, false);
 	tcp_done(sk);
 	__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONTIMEOUT);
 }
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation
  2016-06-29  7:06                   ` Yuchung Cheng
@ 2016-06-29 20:18                     ` Daniel Metz
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Metz @ 2016-06-29 20:18 UTC (permalink / raw)
  To: Yuchung Cheng, Hagen Paul Pfeifer, Daniel Metz
  Cc: netdev, Eric Dumazet, Neal Cardwell, David Miller

>                                       B        D      R     Y
> ------------------------------------------------------------------------------
> mean TCPRecovLat      3s      -7%   +39% +38%
> mean TCPRecovLat2    52s     +1%   -11%  -11%

This is indeed very interesting and somewhat unexpected. Do you have any 
clue why Y is as bad as R and so much worse than B? By my understanding 
I would have expected Y to be similar to B. At least tests on the Mean 
Response Waiting Time of sender limited flows show hardly any difference 
to B (as expected).

Also, is a potential longer time in TCPRecovLat such a bad thing 
considering your information on HTTP response performance?

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-06-29 20:26 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-13 20:45 [PATCH net-next] tcp: use RFC6298 compliant TCP RTO calculation Daniel Metz
2016-06-13 21:19 ` Eric Dumazet
2016-06-13 22:38 ` Yuchung Cheng
2016-06-14  6:17   ` Hagen Paul Pfeifer
2016-06-14 17:58     ` Yuchung Cheng
2016-06-14 19:18       ` [PATCH net-next v2] " Daniel Metz
2016-06-14 21:33         ` Yuchung Cheng
2016-06-15 17:41           ` Hagen Paul Pfeifer
2016-06-15 18:02             ` Yuchung Cheng
2016-06-15 20:34               ` Daniel Metz
2016-06-15 20:38                 ` Eric Dumazet
2016-06-15 21:07                   ` Yuchung Cheng
2016-06-16  9:07                   ` Hagen Paul Pfeifer
2016-06-16  9:03               ` Hagen Paul Pfeifer
2016-06-15 18:00           ` [PATCH net-next v3] " Daniel Metz
2016-06-17 18:32             ` David Miller
2016-06-17 18:56               ` Yuchung Cheng
2016-06-22  5:53                 ` Yuchung Cheng
2016-06-22 11:21                   ` Hagen Paul Pfeifer
2016-06-22 20:50                     ` Yuchung Cheng
2016-06-29  7:06                   ` Yuchung Cheng
2016-06-29 20:18                     ` Daniel Metz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.