tcp: multiple ssthresh reductions before all packets are retransmitted

All of lore.kernel.org
 help / color / mirror / Atom feed

* tcp: multiple ssthresh reductions before all packets are retransmitted
@ 2014-06-16 15:35 Michal Kubecek
  2014-06-16 17:02 ` Yuchung Cheng
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Kubecek @ 2014-06-16 15:35 UTC (permalink / raw)
  To: netdev

Hello,

while investigating a problem with TCP loss recovery, I noticed that if
large window is lost and another loss happens before the recovery is
finished, ssthresh can drop to very low values. This is especially
harmful with Reno congestion control where cwnd growth in congestion
avoidance phase is linear and rather slow. This is 100% reproducible
on older (3.0) kernels but I was able to reproduce it on 3.15 as well.

RFC 5681 says that ssthresh reduction in response to RTO should be done
only once and should not be repeated until all packets from the first
loss are retransmitted. RFC 6582 (as well as its predecessor RFC 3782)
is even more specific and says that when loss is detected, one should
mark current SND.NXT and ssthresh shouldn't be reduced again due to
a loss until SND.UNA reaches this remembered value.

Linux implementation does exactly that but in TCP_CA_Loss state,
tcp_enter_loss() also takes icsk_retransmits into account:

	/* Reduce ssthresh if it has not yet been made inside this window. */
	if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
	    !after(tp->high_seq, tp->snd_una) ||
	    (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
		new_recovery = true;
		tp->prior_ssthresh = tcp_current_ssthresh(sk);
		tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
		tcp_ca_event(sk, CA_EVENT_LOSS);
	}

This seems correct as icsk_retransmits is supposed to mean "we still
have packets to retransmit". But icsk_retransmits is reset in
tcp_process_loss() if one of two conditions is satisfied:

	bool recovered = !before(tp->snd_una, tp->high_seq);
...
	if (recovered) {
		/* F-RTO RFC5682 sec 3.1 step 2.a and 1st part of step 3.a */
		icsk->icsk_retransmits = 0;
		tcp_try_undo_recovery(sk);
		return;
	}
	if (flag & FLAG_DATA_ACKED)
		icsk->icsk_retransmits = 0;

First part implements the condition from RFC. But the second part can
reset icsk_retransmits much earlier than all previously lost data are
retransmitted so that ssthresh can be reduced more than once.

The first is recent and comes from commits ab42d9ee3 ("tcp: refactor
CA_Loss state processing") and e33099f96 ("tcp: implement RFC5682
F-RTO"). Second test is much older and predates the start of git tree.

I believe this code is actually wrong and we should remove those two
lines:

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 40661fc..f534723 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2710,8 +2710,6 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack)
 		tcp_try_undo_recovery(sk);
 		return;
 	}
-	if (flag & FLAG_DATA_ACKED)
-		icsk->icsk_retransmits = 0;
 	if (tcp_is_reno(tp)) {
 		/* A Reno DUPACK means new data in F-RTO step 2.b above are
 		 * delivered. Lower inflight to clock out (re)tranmissions.

But as I don't know where do they come from, I can't be sure they don't
serve some purpose so I wanted to ask first if someone knows the logic
behind them.

Thanks in advance,
Michal Kubecek

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: tcp: multiple ssthresh reductions before all packets are retransmitted
  2014-06-16 15:35 tcp: multiple ssthresh reductions before all packets are retransmitted Michal Kubecek
@ 2014-06-16 17:02 ` Yuchung Cheng
  2014-06-16 18:48   ` Michal Kubecek
       [not found]   ` <20140616174721.GA15406@lion>
  0 siblings, 2 replies; 21+ messages in thread
From: Yuchung Cheng @ 2014-06-16 17:02 UTC (permalink / raw)
  To: Michal Kubecek; +Cc: netdev

On Mon, Jun 16, 2014 at 8:35 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
> Hello,
>
> while investigating a problem with TCP loss recovery, I noticed that if
> large window is lost and another loss happens before the recovery is
> finished, ssthresh can drop to very low values. This is especially
> harmful with Reno congestion control where cwnd growth in congestion
> avoidance phase is linear and rather slow. This is 100% reproducible
> on older (3.0) kernels but I was able to reproduce it on 3.15 as well.
>
> RFC 5681 says that ssthresh reduction in response to RTO should be done
> only once and should not be repeated until all packets from the first
> loss are retransmitted. RFC 6582 (as well as its predecessor RFC 3782)
> is even more specific and says that when loss is detected, one should
> mark current SND.NXT and ssthresh shouldn't be reduced again due to
> a loss until SND.UNA reaches this remembered value.
>
> Linux implementation does exactly that but in TCP_CA_Loss state,
> tcp_enter_loss() also takes icsk_retransmits into account:
>
>         /* Reduce ssthresh if it has not yet been made inside this window. */
>         if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
>             !after(tp->high_seq, tp->snd_una) ||
>             (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
>                 new_recovery = true;
>                 tp->prior_ssthresh = tcp_current_ssthresh(sk);
>                 tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
>                 tcp_ca_event(sk, CA_EVENT_LOSS);
>         }
>
> This seems correct as icsk_retransmits is supposed to mean "we still
> have packets to retransmit". But icsk_retransmits is reset in
> tcp_process_loss() if one of two conditions is satisfied:

icsk_retransmits indicates the number of recurring timeouts (of the
same sequence). so it is reset when the recovery is done or SND_UNA is
advanced. the variable name however is confusing.

does your suggested change fix the problem you are observing?

>
>         bool recovered = !before(tp->snd_una, tp->high_seq);
> ...
>         if (recovered) {
>                 /* F-RTO RFC5682 sec 3.1 step 2.a and 1st part of step 3.a */
>                 icsk->icsk_retransmits = 0;
>                 tcp_try_undo_recovery(sk);
>                 return;
>         }
>         if (flag & FLAG_DATA_ACKED)
>                 icsk->icsk_retransmits = 0;
>
> First part implements the condition from RFC. But the second part can
> reset icsk_retransmits much earlier than all previously lost data are
> retransmitted so that ssthresh can be reduced more than once.
>
> The first is recent and comes from commits ab42d9ee3 ("tcp: refactor
> CA_Loss state processing") and e33099f96 ("tcp: implement RFC5682
> F-RTO"). Second test is much older and predates the start of git tree.
>
> I believe this code is actually wrong and we should remove those two
> lines:
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 40661fc..f534723 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2710,8 +2710,6 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack)
>                 tcp_try_undo_recovery(sk);
>                 return;
>         }
> -       if (flag & FLAG_DATA_ACKED)
> -               icsk->icsk_retransmits = 0;
>         if (tcp_is_reno(tp)) {
>                 /* A Reno DUPACK means new data in F-RTO step 2.b above are
>                  * delivered. Lower inflight to clock out (re)tranmissions.
>
> But as I don't know where do they come from, I can't be sure they don't
> serve some purpose so I wanted to ask first if someone knows the logic
> behind them.
>
> Thanks in advance,
> Michal Kubecek
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tcp: multiple ssthresh reductions before all packets are retransmitted
  2014-06-16 17:02 ` Yuchung Cheng
@ 2014-06-16 18:48   ` Michal Kubecek
       [not found]   ` <20140616174721.GA15406@lion>
  1 sibling, 0 replies; 21+ messages in thread
From: Michal Kubecek @ 2014-06-16 18:48 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: netdev

On Mon, Jun 16, 2014 at 10:02:08AM -0700, Yuchung Cheng wrote:
> On Mon, Jun 16, 2014 at 8:35 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
> >
> > RFC 5681 says that ssthresh reduction in response to RTO should be done
> > only once and should not be repeated until all packets from the first
> > loss are retransmitted. RFC 6582 (as well as its predecessor RFC 3782)
> > is even more specific and says that when loss is detected, one should
> > mark current SND.NXT and ssthresh shouldn't be reduced again due to
> > a loss until SND.UNA reaches this remembered value.
> >
> > Linux implementation does exactly that but in TCP_CA_Loss state,
> > tcp_enter_loss() also takes icsk_retransmits into account:
> >
> >         /* Reduce ssthresh if it has not yet been made inside this window. */
> >         if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
> >             !after(tp->high_seq, tp->snd_una) ||
> >             (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
> >                 new_recovery = true;
> >                 tp->prior_ssthresh = tcp_current_ssthresh(sk);
> >                 tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
> >                 tcp_ca_event(sk, CA_EVENT_LOSS);
> >         }
> >
> > This seems correct as icsk_retransmits is supposed to mean "we still
> > have packets to retransmit". But icsk_retransmits is reset in
> > tcp_process_loss() if one of two conditions is satisfied:
> 
> icsk_retransmits indicates the number of recurring timeouts (of the
> same sequence). so it is reset when the recovery is done or SND_UNA is
> advanced. the variable name however is confusing.

In that case, I suppose the problem would be this part

    (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)

of the condition above (in tcp_enter_loss()). As that would mean we
would allow further ssthresh reduction as soon as SND.UNA moves forward
(not when it reaches high_seq).

> does your suggested change fix the problem you are observing?

It does, with it ssthresh isn't lowered again until all lost packets are
retransmitted (at which time, cwnd is already reasonably high). But
I wasn't sure it wouldn't break something else.

                                                          Michal Kubecek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tcp: multiple ssthresh reductions before all packets are retransmitted
       [not found]   ` <20140616174721.GA15406@lion>
@ 2014-06-16 19:04     ` Yuchung Cheng
  2014-06-16 20:06       ` Michal Kubecek
  2014-06-16 21:19       ` [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window Michal Kubecek
  0 siblings, 2 replies; 21+ messages in thread
From: Yuchung Cheng @ 2014-06-16 19:04 UTC (permalink / raw)
  To: Michal Kubecek; +Cc: netdev

On Mon, Jun 16, 2014 at 10:47 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
> On Mon, Jun 16, 2014 at 10:02:08AM -0700, Yuchung Cheng wrote:
>> On Mon, Jun 16, 2014 at 8:35 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
>> >
>> > RFC 5681 says that ssthresh reduction in response to RTO should be done
>> > only once and should not be repeated until all packets from the first
>> > loss are retransmitted. RFC 6582 (as well as its predecessor RFC 3782)
>> > is even more specific and says that when loss is detected, one should
>> > mark current SND.NXT and ssthresh shouldn't be reduced again due to
>> > a loss until SND.UNA reaches this remembered value.
>> >
>> > Linux implementation does exactly that but in TCP_CA_Loss state,
>> > tcp_enter_loss() also takes icsk_retransmits into account:
>> >
>> >         /* Reduce ssthresh if it has not yet been made inside this window. */
>> >         if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
>> >             !after(tp->high_seq, tp->snd_una) ||
>> >             (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
>> >                 new_recovery = true;
>> >                 tp->prior_ssthresh = tcp_current_ssthresh(sk);
>> >                 tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
>> >                 tcp_ca_event(sk, CA_EVENT_LOSS);
>> >         }
>> >
>> > This seems correct as icsk_retransmits is supposed to mean "we still
>> > have packets to retransmit". But icsk_retransmits is reset in
>> > tcp_process_loss() if one of two conditions is satisfied:
>>
>> icsk_retransmits indicates the number of recurring timeouts (of the
>> same sequence). so it is reset when the recovery is done or SND_UNA is
>> advanced. the variable name however is confusing.
>
> In that case, I suppose the problem would be this part
>
>     (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)
>
> of the condition above (in tcp_enter_loss()). As that would mean we
> would allow further ssthresh reduction as soon as SND.UNA moves forward
> (not when it reaches high_seq).

Nice catch. That check is incorrect and should be removed. I have used
packetdrill to verify it fixes the ssthresh problem.

`../common/defaults.sh`

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>

+.020 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4
   +0 write(4, ..., 11000) = 11000
   +0 > . 1:10001(10000) ack 1

// TLP (is dropped too)
+.040 > P. 10001:11001(1000) ack 1

// RTO and retransmit head, SST is set
+.221 > . 1:1001(1000) ack 1
   +0 %{ sst = tcpi_snd_ssthresh }%

+.020 < . 1:1(0) ack 1001 win 257
   +0 > . 1001:2001(1000) ack 1
   +0 > . 2001:3001(1000) ack 1

+.020 < . 1:1(0) ack 2001 win 257
   +0 > . 3001:4001(1000) ack 1
   +0 > . 4001:5001(1000) ack 1

// Another timeout during recovery. Check SST remains the same
+.020 < . 1:1(0) ack 2001 win 257
+.421 > . 2001:3001(1000) ack 1
   +0 %{ assert tcpi_snd_ssthresh == sst }%

+.020 < . 1:1(0) ack 3001 win 257
   +0 > . 3001:4001(1000) ack 1
   +0 > . 4001:5001(1000) ack 1

+.020 < . 1:1(0) ack 5001 win 257
   +0 > . 5001:6001(1000) ack 1
   +0 > . 6001:7001(1000) ack 1
   +0 > . 7001:8001(1000) ack 1
   +0 > . 8001:9001(1000) ack 1

+.020 < . 1:1(0) ack 9001 win 257
   +0 > . 9001:10001(1000) ack 1
   +0 > P. 10001:11001(1000) ack 1

// Recovery is done. Check SST remains the same
+0.20 < . 1:1(0) ack 11001 win 257
   +0 %{ assert tcpi_snd_ssthresh == sst }%

Do you want to submit a one-liner patch? or I am happy to do that.


>
>> does your suggested change fix the problem you are observing?
>
> It does, with it ssthresh isn't lowered again until all lost packets are
> retransmitted (at which time, cwnd is already reasonably high).
>
>                                                           Michal Kubecek
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tcp: multiple ssthresh reductions before all packets are retransmitted
  2014-06-16 19:04     ` Yuchung Cheng
@ 2014-06-16 20:06       ` Michal Kubecek
  2014-06-16 21:19       ` [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window Michal Kubecek
  1 sibling, 0 replies; 21+ messages in thread
From: Michal Kubecek @ 2014-06-16 20:06 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: netdev

On Mon, Jun 16, 2014 at 12:04:37PM -0700, Yuchung Cheng wrote:
> On Mon, Jun 16, 2014 at 10:47 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
> > On Mon, Jun 16, 2014 at 10:02:08AM -0700, Yuchung Cheng wrote:
> >> On Mon, Jun 16, 2014 at 8:35 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
> >> >
> >> > RFC 5681 says that ssthresh reduction in response to RTO should be done
> >> > only once and should not be repeated until all packets from the first
> >> > loss are retransmitted. RFC 6582 (as well as its predecessor RFC 3782)
> >> > is even more specific and says that when loss is detected, one should
> >> > mark current SND.NXT and ssthresh shouldn't be reduced again due to
> >> > a loss until SND.UNA reaches this remembered value.
> >> >
> >> > Linux implementation does exactly that but in TCP_CA_Loss state,
> >> > tcp_enter_loss() also takes icsk_retransmits into account:
> >> >
> >> >         /* Reduce ssthresh if it has not yet been made inside this window. */
> >> >         if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
> >> >             !after(tp->high_seq, tp->snd_una) ||
> >> >             (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
> >> >                 new_recovery = true;
> >> >                 tp->prior_ssthresh = tcp_current_ssthresh(sk);
> >> >                 tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
> >> >                 tcp_ca_event(sk, CA_EVENT_LOSS);
> >> >         }
> >> >
> >> > This seems correct as icsk_retransmits is supposed to mean "we still
> >> > have packets to retransmit". But icsk_retransmits is reset in
> >> > tcp_process_loss() if one of two conditions is satisfied:
> >>
> >> icsk_retransmits indicates the number of recurring timeouts (of the
> >> same sequence). so it is reset when the recovery is done or SND_UNA is
> >> advanced. the variable name however is confusing.
> >
> > In that case, I suppose the problem would be this part
> >
> >     (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)
> >
> > of the condition above (in tcp_enter_loss()). As that would mean we
> > would allow further ssthresh reduction as soon as SND.UNA moves forward
> > (not when it reaches high_seq).
> 
> Nice catch. That check is incorrect and should be removed. I have used
> packetdrill to verify it fixes the ssthresh problem.
> 
...
> 
> Do you want to submit a one-liner patch? or I am happy to do that.

I'll submit it. Thank you for your help.

                                                          Michal Kubecek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window
  2014-06-16 19:04     ` Yuchung Cheng
  2014-06-16 20:06       ` Michal Kubecek
@ 2014-06-16 21:19       ` Michal Kubecek
  2014-06-16 22:39         ` Yuchung Cheng
  1 sibling, 1 reply; 21+ messages in thread
From: Michal Kubecek @ 2014-06-16 21:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Yuchung Cheng, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy

RFC 5681 says that ssthresh reduction in response to RTO should
be done only once and should not be repeated until all packets
from the first loss are retransmitted. RFC 6582 (as well as its
predecessor RFC 3782) is even more specific and says that when
loss is detected, one should mark current SND.NXT and ssthresh
shouldn't be reduced again due to a loss until SND.UNA reaches
this remembered value.

In Linux implementation, this is done in tcp_enter_loss() but an
additional condition

  (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)

allows to further reduce ssthresh before snd_una reaches the
high_seq (the snd_nxt value at the previous loss) as
icsk_retransmits is reset as soon as snd_una moves forward. As a
result, if a retransmit timeout ouccurs early in the retransmit
phase, we can adjust snd_ssthresh based on very low value of
cwnd. This can be especially harmful for reno congestion control
with slow linear cwnd growth in congestion avoidance phase.

The patch removes the condition above so that snd_ssthresh is
not reduced again until snd_una reaches high_seq as described in
RFC 5681 and 6582.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
---
 net/ipv4/tcp_input.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 40661fc..768ba88 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1917,8 +1917,7 @@ void tcp_enter_loss(struct sock *sk, int how)

 	/* Reduce ssthresh if it has not yet been made inside this window. */
 	if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
-	    !after(tp->high_seq, tp->snd_una) ||
-	    (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
+	    !after(tp->high_seq, tp->snd_una)) {
 		new_recovery = true;
 		tp->prior_ssthresh = tcp_current_ssthresh(sk);
 		tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
-- 
1.8.4.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window
  2014-06-16 21:19       ` [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window Michal Kubecek
@ 2014-06-16 22:39         ` Yuchung Cheng
  2014-06-16 23:42           ` Neal Cardwell
  0 siblings, 1 reply; 21+ messages in thread
From: Yuchung Cheng @ 2014-06-16 22:39 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: David S. Miller, netdev, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Neal Cardwell

On Mon, Jun 16, 2014 at 2:19 PM, Michal Kubecek <mkubecek@suse.cz> wrote:
> RFC 5681 says that ssthresh reduction in response to RTO should
> be done only once and should not be repeated until all packets
> from the first loss are retransmitted. RFC 6582 (as well as its
> predecessor RFC 3782) is even more specific and says that when
> loss is detected, one should mark current SND.NXT and ssthresh
> shouldn't be reduced again due to a loss until SND.UNA reaches
> this remembered value.
>
> In Linux implementation, this is done in tcp_enter_loss() but an
> additional condition
>
>   (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)
>
> allows to further reduce ssthresh before snd_una reaches the
> high_seq (the snd_nxt value at the previous loss) as
> icsk_retransmits is reset as soon as snd_una moves forward. As a
> result, if a retransmit timeout ouccurs early in the retransmit
> phase, we can adjust snd_ssthresh based on very low value of
> cwnd. This can be especially harmful for reno congestion control
> with slow linear cwnd growth in congestion avoidance phase.
>
> The patch removes the condition above so that snd_ssthresh is
> not reduced again until snd_una reaches high_seq as described in
> RFC 5681 and 6582.
>
> Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Yuchung Cheng <ycheng@google.com>

> ---
>  net/ipv4/tcp_input.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 40661fc..768ba88 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -1917,8 +1917,7 @@ void tcp_enter_loss(struct sock *sk, int how)
>
>         /* Reduce ssthresh if it has not yet been made inside this window. */
>         if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
> -           !after(tp->high_seq, tp->snd_una) ||
> -           (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
> +           !after(tp->high_seq, tp->snd_una)) {
>                 new_recovery = true;
>                 tp->prior_ssthresh = tcp_current_ssthresh(sk);
>                 tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
> --
> 1.8.4.5
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window
  2014-06-16 22:39         ` Yuchung Cheng
@ 2014-06-16 23:42           ` Neal Cardwell
  2014-06-17  0:25             ` Yuchung Cheng
  0 siblings, 1 reply; 21+ messages in thread
From: Neal Cardwell @ 2014-06-16 23:42 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Michal Kubecek, David S. Miller, netdev, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy

On Mon, Jun 16, 2014 at 6:39 PM, Yuchung Cheng <ycheng@google.com> wrote:
> On Mon, Jun 16, 2014 at 2:19 PM, Michal Kubecek <mkubecek@suse.cz> wrote:
>> RFC 5681 says that ssthresh reduction in response to RTO should
>> be done only once and should not be repeated until all packets
>> from the first loss are retransmitted. RFC 6582 (as well as its
>> predecessor RFC 3782) is even more specific and says that when
>> loss is detected, one should mark current SND.NXT and ssthresh
>> shouldn't be reduced again due to a loss until SND.UNA reaches
>> this remembered value.
>>
>> In Linux implementation, this is done in tcp_enter_loss() but an
>> additional condition
>>
>>   (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)
>>
>> allows to further reduce ssthresh before snd_una reaches the
>> high_seq (the snd_nxt value at the previous loss) as
>> icsk_retransmits is reset as soon as snd_una moves forward. As a
>> result, if a retransmit timeout ouccurs early in the retransmit
>> phase, we can adjust snd_ssthresh based on very low value of
>> cwnd. This can be especially harmful for reno congestion control
>> with slow linear cwnd growth in congestion avoidance phase.
>>
>> The patch removes the condition above so that snd_ssthresh is
>> not reduced again until snd_una reaches high_seq as described in
>> RFC 5681 and 6582.
>>
>> Signed-off-by: Michal Kubecek <mkubecek@suse.cz>

AFAICT this commit description is arguing from a mis-reading of the
RFCs.

RFC 6582 and RFC 3782 are only about Fast Recovery, and not relevant
to the timeout recovery we're dealing with in tcp_enter_loss().

RFC 5681, Section 4.3 says:

   Loss in two successive windows of data, or the loss of a
   retransmission, should be taken as two indications of congestion and,
   therefore, cwnd (and ssthresh) MUST be lowered twice in this case.

So if we're in TCP_CA_Loss snd_una advances (FLAG_DATA_ACKED is set
and icsk_retransmits is zero), but snd_una does not advance above
high_seq, then if we subsequently suffer an RTO (and call
tcp_enter_loss()) then that indicates a retransmission is lost, which
this passage from sec 4.3 indicates should be taken as a second
indication of congestion.

> - (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {

AFAICT this existing code is a faithful implementation of, RFC 5681,
Section 7:

   The treatment of ssthresh on retransmission timeout was clarified.
   In particular, ssthresh must be set to half the FlightSize on the
   first retransmission of a given segment and then is held constant on
   subsequent retransmissions of the same segment.

That is, if snd_una advances (FLAG_DATA_ACKED is set and
icsk_retransmits is zero), if we subsequently suffer an RTO and call
tcp_enter_loss() then we will be sending a "first retransmission" at
the segment pointed to by the new/higher snd_una. So this is the first
retransmission of that new segment, so we should reduce ssthresh.

And from first principles, the current Linux code and RFCs seem
sensible on this matter, AFAICT. Suppose we suffer an RTO, and then
over the following RTTs in TCP_CA_Loss we grow cwnd exponentially
again. If we suffer another RTO in this cwnd growth process, then it
seems like a good idea to remember the reduced ssthresh inferred from
this smaller cwnd at which we suffered a loss.

So AFAICT the existing code is sensible and complies with the RFC.

Now, I agree the linear growth of Reno in such situations is
problematic, but I think it's a somewhat separate issue. Or at least
if we're going to change the behavior here then we should justify it
by using data, and not by reference to RFCs. :-)

neal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window
  2014-06-16 23:42           ` Neal Cardwell
@ 2014-06-17  0:25             ` Yuchung Cheng
  2014-06-17  0:44               ` Neal Cardwell
  0 siblings, 1 reply; 21+ messages in thread
From: Yuchung Cheng @ 2014-06-17  0:25 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Michal Kubecek, David S. Miller, netdev, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy

On Mon, Jun 16, 2014 at 4:42 PM, Neal Cardwell <ncardwell@google.com> wrote:
> On Mon, Jun 16, 2014 at 6:39 PM, Yuchung Cheng <ycheng@google.com> wrote:
>> On Mon, Jun 16, 2014 at 2:19 PM, Michal Kubecek <mkubecek@suse.cz> wrote:
>>> RFC 5681 says that ssthresh reduction in response to RTO should
>>> be done only once and should not be repeated until all packets
>>> from the first loss are retransmitted. RFC 6582 (as well as its
>>> predecessor RFC 3782) is even more specific and says that when
>>> loss is detected, one should mark current SND.NXT and ssthresh
>>> shouldn't be reduced again due to a loss until SND.UNA reaches
>>> this remembered value.
>>>
>>> In Linux implementation, this is done in tcp_enter_loss() but an
>>> additional condition
>>>
>>>   (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)
>>>
>>> allows to further reduce ssthresh before snd_una reaches the
>>> high_seq (the snd_nxt value at the previous loss) as
>>> icsk_retransmits is reset as soon as snd_una moves forward. As a
>>> result, if a retransmit timeout ouccurs early in the retransmit
>>> phase, we can adjust snd_ssthresh based on very low value of
>>> cwnd. This can be especially harmful for reno congestion control
>>> with slow linear cwnd growth in congestion avoidance phase.
>>>
>>> The patch removes the condition above so that snd_ssthresh is
>>> not reduced again until snd_una reaches high_seq as described in
>>> RFC 5681 and 6582.
>>>
>>> Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
>
> AFAICT this commit description is arguing from a mis-reading of the
> RFCs.
>
> RFC 6582 and RFC 3782 are only about Fast Recovery, and not relevant
> to the timeout recovery we're dealing with in tcp_enter_loss().
>
> RFC 5681, Section 4.3 says:
>
>    Loss in two successive windows of data, or the loss of a
>    retransmission, should be taken as two indications of congestion and,
>    therefore, cwnd (and ssthresh) MUST be lowered twice in this case.
>
> So if we're in TCP_CA_Loss snd_una advances (FLAG_DATA_ACKED is set
> and icsk_retransmits is zero), but snd_una does not advance above
> high_seq, then if we subsequently suffer an RTO (and call
> tcp_enter_loss()) then that indicates a retransmission is lost, which
> this passage from sec 4.3 indicates should be taken as a second
> indication of congestion.
That's right. I should have checked the RFC more thoroughly. Sorry
please ignore my Acked-by.

However Linux is inconsistent on the loss of a retransmission. It
reduces ssthresh (and cwnd) if this happens on a timeout, but not in
fast recovery (tcp_mark_lost_retrans). We should fix that and that
should help dealing with traffic policers.


>
>> - (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
>
> AFAICT this existing code is a faithful implementation of, RFC 5681,
> Section 7:
>
>    The treatment of ssthresh on retransmission timeout was clarified.
>    In particular, ssthresh must be set to half the FlightSize on the
>    first retransmission of a given segment and then is held constant on
>    subsequent retransmissions of the same segment.
>
> That is, if snd_una advances (FLAG_DATA_ACKED is set and
> icsk_retransmits is zero), if we subsequently suffer an RTO and call
> tcp_enter_loss() then we will be sending a "first retransmission" at
> the segment pointed to by the new/higher snd_una. So this is the first
> retransmission of that new segment, so we should reduce ssthresh.
>
> And from first principles, the current Linux code and RFCs seem
> sensible on this matter, AFAICT. Suppose we suffer an RTO, and then
> over the following RTTs in TCP_CA_Loss we grow cwnd exponentially
> again. If we suffer another RTO in this cwnd growth process, then it
> seems like a good idea to remember the reduced ssthresh inferred from
> this smaller cwnd at which we suffered a loss.
>
> So AFAICT the existing code is sensible and complies with the RFC.
>
> Now, I agree the linear growth of Reno in such situations is
> problematic, but I think it's a somewhat separate issue. Or at least
> if we're going to change the behavior here then we should justify it
> by using data, and not by reference to RFCs. :-)
>
> neal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window
  2014-06-17  0:25             ` Yuchung Cheng
@ 2014-06-17  0:44               ` Neal Cardwell
  2014-06-17 12:20                 ` Michal Kubecek
  0 siblings, 1 reply; 21+ messages in thread
From: Neal Cardwell @ 2014-06-17  0:44 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Michal Kubecek, David S. Miller, netdev, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy

On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote:
> However Linux is inconsistent on the loss of a retransmission. It
> reduces ssthresh (and cwnd) if this happens on a timeout, but not in
> fast recovery (tcp_mark_lost_retrans). We should fix that and that
> should help dealing with traffic policers.

Yes, great point!

neal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window
  2014-06-17  0:44               ` Neal Cardwell
@ 2014-06-17 12:20                 ` Michal Kubecek
  2014-06-17 21:35                   ` Yuchung Cheng
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Kubecek @ 2014-06-17 12:20 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Yuchung Cheng, David S. Miller, netdev, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy

On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote:
> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote:
> > However Linux is inconsistent on the loss of a retransmission. It
> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in
> > fast recovery (tcp_mark_lost_retrans). We should fix that and that
> > should help dealing with traffic policers.
> 
> Yes, great point!

Does it mean the patch itself would be acceptable if the reasoning in
its commit message was changed? Or would you prefer a different way to
unify the two situations?

                                                        Michal Kubecek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window
  2014-06-17 12:20                 ` Michal Kubecek
@ 2014-06-17 21:35                   ` Yuchung Cheng
  2014-06-17 22:42                     ` Michal Kubecek
  0 siblings, 1 reply; 21+ messages in thread
From: Yuchung Cheng @ 2014-06-17 21:35 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: Neal Cardwell, David S. Miller, netdev, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy

On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
> On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote:
>> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote:
>> > However Linux is inconsistent on the loss of a retransmission. It
>> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in
>> > fast recovery (tcp_mark_lost_retrans). We should fix that and that
>> > should help dealing with traffic policers.
>>
>> Yes, great point!
>
> Does it mean the patch itself would be acceptable if the reasoning in
> its commit message was changed? Or would you prefer a different way to
> unify the two situations?

It's the latter but it seems to belong to a different patch (and it'll
not solve the problem you are seeing).

The idea behind the RFC is that TCP should reduce cwnd and ssthresh
across round trips of send, but not within an RTT. Suppose cwnd was
10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3
round trips, we time out again. By the design of Reno this should
reset cwnd from 8 to 1, and ssthresh from 5 to 2.5.

Of course this may not make sense in various cases. But it will be a
design bug in the congestion control rather than an implementation bug
in the loss recovery. We are seeing many similar issues where
non-queue-overflow drops mess up CCs relying on ssthresh :(

>
>                                                         Michal Kubecek
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window
  2014-06-17 21:35                   ` Yuchung Cheng
@ 2014-06-17 22:42                     ` Michal Kubecek
  2014-06-18  0:38                       ` Jay Vosburgh
  2014-06-18  7:17                       ` Eric Dumazet
  0 siblings, 2 replies; 21+ messages in thread
From: Michal Kubecek @ 2014-06-17 22:42 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Neal Cardwell, David S. Miller, netdev, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy

On Tue, Jun 17, 2014 at 02:35:23PM -0700, Yuchung Cheng wrote:
> On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
> > On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote:
> >> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote:
> >> > However Linux is inconsistent on the loss of a retransmission. It
> >> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in
> >> > fast recovery (tcp_mark_lost_retrans). We should fix that and that
> >> > should help dealing with traffic policers.
> >>
> >> Yes, great point!
> >
> > Does it mean the patch itself would be acceptable if the reasoning in
> > its commit message was changed? Or would you prefer a different way to
> > unify the two situations?
> 
> It's the latter but it seems to belong to a different patch (and it'll
> not solve the problem you are seeing).

OK, thank you. I guess we will have to persuade them to move to cubic
which handles their problems much better.

> The idea behind the RFC is that TCP should reduce cwnd and ssthresh
> across round trips of send, but not within an RTT. Suppose cwnd was
> 10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3
> round trips, we time out again. By the design of Reno this should
> reset cwnd from 8 to 1, and ssthresh from 5 to 2.5.

Shouldn't that be from 5 to 4? We reduce ssthresh to half of current
cwnd, not current ssthresh.

BtW, this is exactly the problem our customer is facing: they have
relatively fast line (15 Mb/s) but with big buffers so that the
roundtrip times can rise from unloaded 35 ms up to something like 1.5 s
under full load.

What happens is this: cwnd initally rises to ~2100 then first drops
are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start
lets cwnd reach ssthresh but after that, a slow linear growth follows.
In this state, all in-flight packets are dropped (simulation of what
happens on router switchover) so that cwnd is reset to 1 again and
ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh).
If a packet loss comes shortly after that, cwnd is still very low and
ssthresh is reduced to half of that cwnd (i.e. much lower than to half
of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2
which takes really long to recover from.

                                                        Michal Kubecek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window
  2014-06-17 22:42                     ` Michal Kubecek
@ 2014-06-18  0:38                       ` Jay Vosburgh
  2014-06-18  0:56                         ` Neal Cardwell
  2014-06-18 16:56                         ` Yuchung Cheng
  2014-06-18  7:17                       ` Eric Dumazet
  1 sibling, 2 replies; 21+ messages in thread
From: Jay Vosburgh @ 2014-06-18  0:38 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: Yuchung Cheng, Neal Cardwell, David S. Miller, netdev,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy

Michal Kubecek <mkubecek@suse.cz> wrote:

>On Tue, Jun 17, 2014 at 02:35:23PM -0700, Yuchung Cheng wrote:
>> On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
>> > On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote:
>> >> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote:
>> >> > However Linux is inconsistent on the loss of a retransmission. It
>> >> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in
>> >> > fast recovery (tcp_mark_lost_retrans). We should fix that and that
>> >> > should help dealing with traffic policers.
>> >>
>> >> Yes, great point!
>> >
>> > Does it mean the patch itself would be acceptable if the reasoning in
>> > its commit message was changed? Or would you prefer a different way to
>> > unify the two situations?
>> 
>> It's the latter but it seems to belong to a different patch (and it'll
>> not solve the problem you are seeing).
>
>OK, thank you. I guess we will have to persuade them to move to cubic
>which handles their problems much better.
>
>> The idea behind the RFC is that TCP should reduce cwnd and ssthresh
>> across round trips of send, but not within an RTT. Suppose cwnd was
>> 10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3
>> round trips, we time out again. By the design of Reno this should
>> reset cwnd from 8 to 1, and ssthresh from 5 to 2.5.
>
>Shouldn't that be from 5 to 4? We reduce ssthresh to half of current
>cwnd, not current ssthresh.
>
>BtW, this is exactly the problem our customer is facing: they have
>relatively fast line (15 Mb/s) but with big buffers so that the
>roundtrip times can rise from unloaded 35 ms up to something like 1.5 s
>under full load.
>
>What happens is this: cwnd initally rises to ~2100 then first drops
>are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start
>lets cwnd reach ssthresh but after that, a slow linear growth follows.
>In this state, all in-flight packets are dropped (simulation of what
>happens on router switchover) so that cwnd is reset to 1 again and
>ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh).
>If a packet loss comes shortly after that, cwnd is still very low and
>ssthresh is reduced to half of that cwnd (i.e. much lower than to half
>of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2
>which takes really long to recover from.

	I'm also looking into a problem that exhibits very similar TCP
characteristics, even down to cwnd and ssthresh values similar to what
you cite.  In this case, the situation has to do with high RTT (around
80 ms) connections competing with low RTT (1 ms) connections.  This case
is already using cubic.

	Essentially, a high RTT connection to the server transfers data
in at a reasonable and steady rate until something causes some packets
to be lost (in this case, another transfer from a low RTT host to the
same server).  Some packets are lost, and cwnd drops from ~2200 to ~300
(in stages, first to ~1500, then ~600, then to ~300, ).  The ssthresh
starts at around 1100, then drops to ~260, which is the lowest cwnd
value.

	The recovery from the low cwnd situation is very slow; cwnd
climbs a bit and then remains essentially flat for around 5 seconds.  It
then begins to climb until a few packets are lost again, and the cycle
repeats.  If no futher losses occur (if the competing traffic has
ceased, for example), recovery from a low cwnd (300 - 750 ish) to the
full value (~2200) requires on the order of 20 seconds.  The connection
exits recovery state fairly quickly, and most of the 20 seconds is spent
in open state.

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window
  2014-06-18  0:38                       ` Jay Vosburgh
@ 2014-06-18  0:56                         ` Neal Cardwell
  2014-06-18  2:00                           ` Jay Vosburgh
  2014-06-19  1:52                           ` Jay Vosburgh
  2014-06-18 16:56                         ` Yuchung Cheng
  1 sibling, 2 replies; 21+ messages in thread
From: Neal Cardwell @ 2014-06-18  0:56 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Michal Kubecek, Yuchung Cheng, David S. Miller, netdev,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy

On Tue, Jun 17, 2014 at 8:38 PM, Jay Vosburgh
<jay.vosburgh@canonical.com> wrote:
> Michal Kubecek <mkubecek@suse.cz> wrote:
>
>>On Tue, Jun 17, 2014 at 02:35:23PM -0700, Yuchung Cheng wrote:
>>> On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
>>> > On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote:
>>> >> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote:
>>> >> > However Linux is inconsistent on the loss of a retransmission. It
>>> >> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in
>>> >> > fast recovery (tcp_mark_lost_retrans). We should fix that and that
>>> >> > should help dealing with traffic policers.
>>> >>
>>> >> Yes, great point!
>>> >
>>> > Does it mean the patch itself would be acceptable if the reasoning in
>>> > its commit message was changed? Or would you prefer a different way to
>>> > unify the two situations?
>>>
>>> It's the latter but it seems to belong to a different patch (and it'll
>>> not solve the problem you are seeing).
>>
>>OK, thank you. I guess we will have to persuade them to move to cubic
>>which handles their problems much better.
>>
>>> The idea behind the RFC is that TCP should reduce cwnd and ssthresh
>>> across round trips of send, but not within an RTT. Suppose cwnd was
>>> 10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3
>>> round trips, we time out again. By the design of Reno this should
>>> reset cwnd from 8 to 1, and ssthresh from 5 to 2.5.
>>
>>Shouldn't that be from 5 to 4? We reduce ssthresh to half of current
>>cwnd, not current ssthresh.
>>
>>BtW, this is exactly the problem our customer is facing: they have
>>relatively fast line (15 Mb/s) but with big buffers so that the
>>roundtrip times can rise from unloaded 35 ms up to something like 1.5 s
>>under full load.
>>
>>What happens is this: cwnd initally rises to ~2100 then first drops
>>are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start
>>lets cwnd reach ssthresh but after that, a slow linear growth follows.
>>In this state, all in-flight packets are dropped (simulation of what
>>happens on router switchover) so that cwnd is reset to 1 again and
>>ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh).
>>If a packet loss comes shortly after that, cwnd is still very low and
>>ssthresh is reduced to half of that cwnd (i.e. much lower than to half
>>of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2
>>which takes really long to recover from.
>
>         I'm also looking into a problem that exhibits very similar TCP
> characteristics, even down to cwnd and ssthresh values similar to what
> you cite.  In this case, the situation has to do with high RTT (around
> 80 ms) connections competing with low RTT (1 ms) connections.  This case
> is already using cubic.
>
>         Essentially, a high RTT connection to the server transfers data
> in at a reasonable and steady rate until something causes some packets
> to be lost (in this case, another transfer from a low RTT host to the
> same server).  Some packets are lost, and cwnd drops from ~2200 to ~300
> (in stages, first to ~1500, then ~600, then to ~300, ).  The ssthresh
> starts at around 1100, then drops to ~260, which is the lowest cwnd
> value.
>
>         The recovery from the low cwnd situation is very slow; cwnd
> climbs a bit and then remains essentially flat for around 5 seconds.  It
> then begins to climb until a few packets are lost again, and the cycle
> repeats.  If no futher losses occur (if the competing traffic has
> ceased, for example), recovery from a low cwnd (300 - 750 ish) to the
> full value (~2200) requires on the order of 20 seconds.  The connection
> exits recovery state fairly quickly, and most of the 20 seconds is spent
> in open state.

Interesting. I'm a little surprised it takes CUBIC so long to re-grow
cwnd to the full value. Would you be able to provide your kernel
version number and post a tcpdump binary packet trace somewhere
public?

One thing you could try would be to disable CUBIC's "fast convergence" feature:

  echo 0 > /sys/module/tcp_cubic/parameters/fast_convergence

We have noticed that this feature can hurt performance when there is a
high rate of random packet drops (packet drops that are not correlated
with the sending rate of the flow in question).

neal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window
  2014-06-18  0:56                         ` Neal Cardwell
@ 2014-06-18  2:00                           ` Jay Vosburgh
  2014-06-19  1:52                           ` Jay Vosburgh
  1 sibling, 0 replies; 21+ messages in thread
From: Jay Vosburgh @ 2014-06-18  2:00 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Michal Kubecek, Yuchung Cheng, David S. Miller, netdev,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy

Neal Cardwell <ncardwell@google.com> wrote:

>On Tue, Jun 17, 2014 at 8:38 PM, Jay Vosburgh
><jay.vosburgh@canonical.com> wrote:
>> Michal Kubecek <mkubecek@suse.cz> wrote:
>>
>>>On Tue, Jun 17, 2014 at 02:35:23PM -0700, Yuchung Cheng wrote:
>>>> On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
>>>> > On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote:
>>>> >> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote:
>>>> >> > However Linux is inconsistent on the loss of a retransmission. It
>>>> >> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in
>>>> >> > fast recovery (tcp_mark_lost_retrans). We should fix that and that
>>>> >> > should help dealing with traffic policers.
>>>> >>
>>>> >> Yes, great point!
>>>> >
>>>> > Does it mean the patch itself would be acceptable if the reasoning in
>>>> > its commit message was changed? Or would you prefer a different way to
>>>> > unify the two situations?
>>>>
>>>> It's the latter but it seems to belong to a different patch (and it'll
>>>> not solve the problem you are seeing).
>>>
>>>OK, thank you. I guess we will have to persuade them to move to cubic
>>>which handles their problems much better.
>>>
>>>> The idea behind the RFC is that TCP should reduce cwnd and ssthresh
>>>> across round trips of send, but not within an RTT. Suppose cwnd was
>>>> 10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3
>>>> round trips, we time out again. By the design of Reno this should
>>>> reset cwnd from 8 to 1, and ssthresh from 5 to 2.5.
>>>
>>>Shouldn't that be from 5 to 4? We reduce ssthresh to half of current
>>>cwnd, not current ssthresh.
>>>
>>>BtW, this is exactly the problem our customer is facing: they have
>>>relatively fast line (15 Mb/s) but with big buffers so that the
>>>roundtrip times can rise from unloaded 35 ms up to something like 1.5 s
>>>under full load.
>>>
>>>What happens is this: cwnd initally rises to ~2100 then first drops
>>>are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start
>>>lets cwnd reach ssthresh but after that, a slow linear growth follows.
>>>In this state, all in-flight packets are dropped (simulation of what
>>>happens on router switchover) so that cwnd is reset to 1 again and
>>>ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh).
>>>If a packet loss comes shortly after that, cwnd is still very low and
>>>ssthresh is reduced to half of that cwnd (i.e. much lower than to half
>>>of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2
>>>which takes really long to recover from.
>>
>>         I'm also looking into a problem that exhibits very similar TCP
>> characteristics, even down to cwnd and ssthresh values similar to what
>> you cite.  In this case, the situation has to do with high RTT (around
>> 80 ms) connections competing with low RTT (1 ms) connections.  This case
>> is already using cubic.
>>
>>         Essentially, a high RTT connection to the server transfers data
>> in at a reasonable and steady rate until something causes some packets
>> to be lost (in this case, another transfer from a low RTT host to the
>> same server).  Some packets are lost, and cwnd drops from ~2200 to ~300
>> (in stages, first to ~1500, then ~600, then to ~300, ).  The ssthresh
>> starts at around 1100, then drops to ~260, which is the lowest cwnd
>> value.
>>
>>         The recovery from the low cwnd situation is very slow; cwnd
>> climbs a bit and then remains essentially flat for around 5 seconds.  It
>> then begins to climb until a few packets are lost again, and the cycle
>> repeats.  If no futher losses occur (if the competing traffic has
>> ceased, for example), recovery from a low cwnd (300 - 750 ish) to the
>> full value (~2200) requires on the order of 20 seconds.  The connection
>> exits recovery state fairly quickly, and most of the 20 seconds is spent
>> in open state.
>
>Interesting. I'm a little surprised it takes CUBIC so long to re-grow
>cwnd to the full value. Would you be able to provide your kernel
>version number and post a tcpdump binary packet trace somewhere
>public?

	The kernel I'm using at the moment is an Ubuntu 3.2.0-54 distro
kernel, but I've reproduced the problem on Ubuntu distro 3.13 and a
mainline 3.15-rc (although in the 3.13/3.15 cases using netem to inject
delay).  I've been gathering data mostly with systemtap, but I should be
able to get some packet captures as well, although not until tomorrow.

	The test I'm using right now is pretty simple. I have three
machines: two, A and B, are separated by about 80 ms RTT; the third
machine, C, is about 1 ms from B, so:

	A --- 80ms --- B --- 1ms ---- C

	On A, I run an "iperf -i 1" to B, and let it max its cwnd, and
then on C, run an "iperf -t 1" to B ("-t 1" means only run for one
second then exit).  The iperf results on A look like this:

[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   896 KBytes  7.34 Mbits/sec
[  3]  1.0- 2.0 sec  1.50 MBytes  12.6 Mbits/sec
[  3]  2.0- 3.0 sec  4.62 MBytes  38.8 Mbits/sec
[  3]  3.0- 4.0 sec  13.5 MBytes   113 Mbits/sec
[  3]  4.0- 5.0 sec  27.8 MBytes   233 Mbits/sec
[  3]  5.0- 6.0 sec  39.0 MBytes   327 Mbits/sec
[  3]  6.0- 7.0 sec  36.9 MBytes   309 Mbits/sec
[  3]  7.0- 8.0 sec  34.8 MBytes   292 Mbits/sec
[  3]  8.0- 9.0 sec  39.0 MBytes   327 Mbits/sec
[  3]  9.0-10.0 sec  36.9 MBytes   309 Mbits/sec
[  3] 10.0-11.0 sec  36.9 MBytes   309 Mbits/sec
[  3] 11.0-12.0 sec  11.1 MBytes  93.3 Mbits/sec
[  3] 12.0-13.0 sec  4.50 MBytes  37.7 Mbits/sec
[  3] 13.0-14.0 sec  2.88 MBytes  24.1 Mbits/sec
[  3] 14.0-15.0 sec  5.50 MBytes  46.1 Mbits/sec
[  3] 15.0-16.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 16.0-17.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 17.0-18.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 18.0-19.0 sec  4.25 MBytes  35.7 Mbits/sec
[  3] 19.0-20.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 20.0-21.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 21.0-22.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 22.0-23.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 23.0-24.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 24.0-25.0 sec  8.62 MBytes  72.4 Mbits/sec
[  3] 25.0-26.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 26.0-27.0 sec  8.50 MBytes  71.3 Mbits/sec
[  3] 27.0-28.0 sec  8.62 MBytes  72.4 Mbits/sec
[  3] 28.0-29.0 sec  10.6 MBytes  89.1 Mbits/sec
[  3] 29.0-30.0 sec  12.9 MBytes   108 Mbits/sec
[  3] 30.0-31.0 sec  15.0 MBytes   126 Mbits/sec
[  3] 31.0-32.0 sec  15.0 MBytes   126 Mbits/sec
[  3] 32.0-33.0 sec  21.8 MBytes   182 Mbits/sec
[  3] 33.0-34.0 sec  21.4 MBytes   179 Mbits/sec
[  3] 34.0-35.0 sec  27.8 MBytes   233 Mbits/sec
[  3] 35.0-36.0 sec  32.6 MBytes   274 Mbits/sec
[  3] 36.0-37.0 sec  36.6 MBytes   307 Mbits/sec
[  3] 37.0-38.0 sec  36.6 MBytes   307 Mbits/sec

	The second iperf starts at about time 10.  The middle value is 1
second's throughput, so the flat throughput between roughly time 13 and
time 23 is the cwnd slow recovery.

	I've got one graph prepared already that I can post:

http://people.canonical.com/~jvosburgh/t-vs-cwnd-ssthresh.jpg

	This shows cwnd (green) and ssthresh (red) vs. time.  In this
case, the second (low RTT) iperf started at the first big drop at around
time 22 and ran for 30 seconds (its data is not on the graph).  The big
cwnd drop is actually a series of drops, but that's hard to see at this
scale.  This graph shows two of the slow recoveries, and was done on a
3.13 kernel using netem to add delay.  The cwnd and ssthresh data was
captured by systemtap when exiting tcp_ack.

>One thing you could try would be to disable CUBIC's "fast convergence" feature:
>
>  echo 0 > /sys/module/tcp_cubic/parameters/fast_convergence
>
>We have noticed that this feature can hurt performance when there is a
>high rate of random packet drops (packet drops that are not correlated
>with the sending rate of the flow in question).

	I ran the above iperf results with this disabled; it does not
appear to have any effect.

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window
  2014-06-17 22:42                     ` Michal Kubecek
  2014-06-18  0:38                       ` Jay Vosburgh
@ 2014-06-18  7:17                       ` Eric Dumazet
  1 sibling, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2014-06-18  7:17 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: Yuchung Cheng, Neal Cardwell, David S. Miller, netdev,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy

On Wed, 2014-06-18 at 00:42 +0200, Michal Kubecek wrote:

> BtW, this is exactly the problem our customer is facing: they have
> relatively fast line (15 Mb/s) but with big buffers so that the
> roundtrip times can rise from unloaded 35 ms up to something like 1.5 s
> under full load.

It looks like some cwnd limiting would be nice ;)

BDP is about 45 packets for a 35ms rtt and 15 Mb/s link.

> 
> What happens is this: cwnd initally rises to ~2100 then first drops
> are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start
> lets cwnd reach ssthresh but after that, a slow linear growth follows.
> In this state, all in-flight packets are dropped (simulation of what
> happens on router switchover) so that cwnd is reset to 1 again and
> ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh).
> If a packet loss comes shortly after that, cwnd is still very low and
> ssthresh is reduced to half of that cwnd (i.e. much lower than to half
> of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2
> which takes really long to recover from.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window
  2014-06-18  0:38                       ` Jay Vosburgh
  2014-06-18  0:56                         ` Neal Cardwell
@ 2014-06-18 16:56                         ` Yuchung Cheng
  1 sibling, 0 replies; 21+ messages in thread
From: Yuchung Cheng @ 2014-06-18 16:56 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Michal Kubecek, Neal Cardwell, David S. Miller, netdev,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy

On Tue, Jun 17, 2014 at 5:38 PM, Jay Vosburgh
<jay.vosburgh@canonical.com> wrote:
> Michal Kubecek <mkubecek@suse.cz> wrote:
>
>>On Tue, Jun 17, 2014 at 02:35:23PM -0700, Yuchung Cheng wrote:
>>> On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
>>> > On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote:
>>> >> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote:
>>> >> > However Linux is inconsistent on the loss of a retransmission. It
>>> >> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in
>>> >> > fast recovery (tcp_mark_lost_retrans). We should fix that and that
>>> >> > should help dealing with traffic policers.
>>> >>
>>> >> Yes, great point!
>>> >
>>> > Does it mean the patch itself would be acceptable if the reasoning in
>>> > its commit message was changed? Or would you prefer a different way to
>>> > unify the two situations?
>>>
>>> It's the latter but it seems to belong to a different patch (and it'll
>>> not solve the problem you are seeing).
>>
>>OK, thank you. I guess we will have to persuade them to move to cubic
>>which handles their problems much better.
>>
>>> The idea behind the RFC is that TCP should reduce cwnd and ssthresh
>>> across round trips of send, but not within an RTT. Suppose cwnd was
>>> 10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3
>>> round trips, we time out again. By the design of Reno this should
>>> reset cwnd from 8 to 1, and ssthresh from 5 to 2.5.
>>
>>Shouldn't that be from 5 to 4? We reduce ssthresh to half of current
>>cwnd, not current ssthresh.
Oops yes it should be 8 to 4.
>>
>>BtW, this is exactly the problem our customer is facing: they have
>>relatively fast line (15 Mb/s) but with big buffers so that the
>>roundtrip times can rise from unloaded 35 ms up to something like 1.5 s
>>under full load.
>>
>>What happens is this: cwnd initally rises to ~2100 then first drops
>>are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start
>>lets cwnd reach ssthresh but after that, a slow linear growth follows.
>>In this state, all in-flight packets are dropped (simulation of what
>>happens on router switchover) so that cwnd is reset to 1 again and
>>ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh).
>>If a packet loss comes shortly after that, cwnd is still very low and
>>ssthresh is reduced to half of that cwnd (i.e. much lower than to half
>>of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2
>>which takes really long to recover from.
>
>         I'm also looking into a problem that exhibits very similar TCP
> characteristics, even down to cwnd and ssthresh values similar to what
> you cite.  In this case, the situation has to do with high RTT (around
> 80 ms) connections competing with low RTT (1 ms) connections.  This case
> is already using cubic.
>
>         Essentially, a high RTT connection to the server transfers data
> in at a reasonable and steady rate until something causes some packets
> to be lost (in this case, another transfer from a low RTT host to the
> same server).  Some packets are lost, and cwnd drops from ~2200 to ~300
> (in stages, first to ~1500, then ~600, then to ~300, ).  The ssthresh
> starts at around 1100, then drops to ~260, which is the lowest cwnd
> value.
>
>         The recovery from the low cwnd situation is very slow; cwnd
> climbs a bit and then remains essentially flat for around 5 seconds.  It
> then begins to climb until a few packets are lost again, and the cycle
> repeats.  If no futher losses occur (if the competing traffic has
> ceased, for example), recovery from a low cwnd (300 - 750 ish) to the
> full value (~2200) requires on the order of 20 seconds.  The connection
> exits recovery state fairly quickly, and most of the 20 seconds is spent
> in open state.

ssthresh is problematic. Both cases show the same shortcoming of
Reno/Cubic using losses and ssthresh.
If losses are not caused by queue overflows but by link flaps, bursts,
etc, the ssthresh is not indicative of BDP. It's kinda a random
value (>> BDP on BB, <<BDP in these cases). TCP throughput goes
south if we hit two losses within a few RTT and it's a point of no
return :( Hopefully someone can come up a more intelligent control.

Several posts in tcpm also discuss the low ssthresh issues.

http://www.ietf.org/mail-archive/web/tcpm/current/msg08145.html
http://www.ietf.org/mail-archive/web/tcpm/current/msg08778.html

>
>         -J
>
> ---
>         -Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window
  2014-06-18  0:56                         ` Neal Cardwell
  2014-06-18  2:00                           ` Jay Vosburgh
@ 2014-06-19  1:52                           ` Jay Vosburgh
  2014-06-19  2:28                             ` Eric Dumazet
  1 sibling, 1 reply; 21+ messages in thread
From: Jay Vosburgh @ 2014-06-19  1:52 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Michal Kubecek, Yuchung Cheng, David S. Miller, netdev,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy

Neal Cardwell <ncardwell@google.com> wrote:

>On Tue, Jun 17, 2014 at 8:38 PM, Jay Vosburgh
><jay.vosburgh@canonical.com> wrote:
[...]
>>         The recovery from the low cwnd situation is very slow; cwnd
>> climbs a bit and then remains essentially flat for around 5 seconds.  It
>> then begins to climb until a few packets are lost again, and the cycle
>> repeats.  If no futher losses occur (if the competing traffic has
>> ceased, for example), recovery from a low cwnd (300 - 750 ish) to the
>> full value (~2200) requires on the order of 20 seconds.  The connection
>> exits recovery state fairly quickly, and most of the 20 seconds is spent
>> in open state.
>
>Interesting. I'm a little surprised it takes CUBIC so long to re-grow
>cwnd to the full value. Would you be able to provide your kernel
>version number and post a tcpdump binary packet trace somewhere
>public?

	Ok, I ran a test today that demonstrates the slow cwnd growth.
The sending machine is 3.15-rc8 (net-next as of about two weeks ago),
the receiver is Ubuntu 3.13.0-24.

	The test involves adding 40 ms of delay in and out from machine
A with netem, then running iperf from A to B.  Once the iperf reaches a
steady cwnd, on B, I add an iptables rule to drop 1 packet out of every
1000 coming from A, then remove the rule after 10 seconds.  The behavior
resulting from this closely matches what I see on the real systems.

	I captured packets from both ends, running it twice, the second
time with GSO, GRO and TSO disabled.

	The iperf output is as follows:

[  3]  5.0- 6.0 sec  33.6 MBytes   282 Mbits/sec
[  3]  6.0- 7.0 sec  33.8 MBytes   283 Mbits/sec
[  3]  7.0- 8.0 sec  27.0 MBytes   226 Mbits/sec
[  3]  8.0- 9.0 sec  23.2 MBytes   195 Mbits/sec
[  3]  9.0-10.0 sec  17.4 MBytes   146 Mbits/sec
[  3] 10.0-11.0 sec  13.9 MBytes   116 Mbits/sec
[  3] 11.0-12.0 sec  10.4 MBytes  87.0 Mbits/sec
[  3] 12.0-13.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 13.0-14.0 sec  5.75 MBytes  48.2 Mbits/sec
[  3] 14.0-15.0 sec  4.75 MBytes  39.8 Mbits/sec
[  3] 15.0-16.0 sec  3.12 MBytes  26.2 Mbits/sec
[  3] 16.0-17.0 sec  4.38 MBytes  36.7 Mbits/sec
[  3] 17.0-18.0 sec  3.12 MBytes  26.2 Mbits/sec
[  3] 18.0-19.0 sec  3.12 MBytes  26.2 Mbits/sec
[  3] 19.0-20.0 sec  4.25 MBytes  35.7 Mbits/sec
[  3] 20.0-21.0 sec  3.12 MBytes  26.2 Mbits/sec
[  3] 21.0-22.0 sec  3.25 MBytes  27.3 Mbits/sec
[  3] 22.0-23.0 sec  4.25 MBytes  35.7 Mbits/sec
[  3] 23.0-24.0 sec  3.12 MBytes  26.2 Mbits/sec
[  3] 24.0-25.0 sec  4.12 MBytes  34.6 Mbits/sec
[  3] 25.0-26.0 sec  4.50 MBytes  37.7 Mbits/sec
[  3] 26.0-27.0 sec  4.50 MBytes  37.7 Mbits/sec
[  3] 27.0-28.0 sec  5.88 MBytes  49.3 Mbits/sec
[  3] 28.0-29.0 sec  7.12 MBytes  59.8 Mbits/sec
[  3] 29.0-30.0 sec  7.38 MBytes  61.9 Mbits/sec
[  3] 30.0-31.0 sec  10.0 MBytes  83.9 Mbits/sec
[  3] 31.0-32.0 sec  11.6 MBytes  97.5 Mbits/sec
[  3] 32.0-33.0 sec  15.5 MBytes   130 Mbits/sec
[  3] 33.0-34.0 sec  17.2 MBytes   145 Mbits/sec
[  3] 34.0-35.0 sec  20.0 MBytes   168 Mbits/sec
[  3] 35.0-36.0 sec  25.5 MBytes   214 Mbits/sec
[  3] 36.0-37.0 sec  29.8 MBytes   250 Mbits/sec
[  3] 37.0-38.0 sec  32.2 MBytes   271 Mbits/sec
[  3] 38.0-39.0 sec  32.4 MBytes   272 Mbits/sec

	For the above run, the iptables drop rule went in at about time
7, and was removed 10 seconds later, so recovery began at about time 17.
The second run is similar, although the exact start times differ.

	The full data (two runs, each with packet capture from both ends
and the iperf output) can be found at:

http://people.canonical.com/~jvosburgh/tcp-slow-recovery.tar.bz2

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window
  2014-06-19  1:52                           ` Jay Vosburgh
@ 2014-06-19  2:28                             ` Eric Dumazet
  2014-06-19  6:05                               ` Jay Vosburgh
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2014-06-19  2:28 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Neal Cardwell, Michal Kubecek, Yuchung Cheng, David S. Miller,
	netdev, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy

On Wed, 2014-06-18 at 18:52 -0700, Jay Vosburgh wrote:
> 	The test involves adding 40 ms of delay in and out from machine
> A with netem, then running iperf from A to B.  Once the iperf reaches a
> steady cwnd, on B, I add an iptables rule to drop 1 packet out of every
> 1000 coming from A, then remove the rule after 10 seconds.  The behavior
> resulting from this closely matches what I see on the real systems.

Please share the netem setup. Are you sure you do not drop frames on
netem ? (considering you disable GSO/TSO netem has to be able to store a
lot of packets)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window
  2014-06-19  2:28                             ` Eric Dumazet
@ 2014-06-19  6:05                               ` Jay Vosburgh
  0 siblings, 0 replies; 21+ messages in thread
From: Jay Vosburgh @ 2014-06-19  6:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, Michal Kubecek, Yuchung Cheng, David S. Miller,
	netdev, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy

Eric Dumazet <eric.dumazet@gmail.com> wrote:

>On Wed, 2014-06-18 at 18:52 -0700, Jay Vosburgh wrote:
>> 	The test involves adding 40 ms of delay in and out from machine
>> A with netem, then running iperf from A to B.  Once the iperf reaches a
>> steady cwnd, on B, I add an iptables rule to drop 1 packet out of every
>> 1000 coming from A, then remove the rule after 10 seconds.  The behavior
>> resulting from this closely matches what I see on the real systems.
>
>Please share the netem setup. Are you sure you do not drop frames on
>netem ? (considering you disable GSO/TSO netem has to be able to store a
>lot of packets)

	Reasonably sure; the tc -s qdisc doesn't show any drops by netem
for these test runs.  The data I linked to earlier is one run with
TSO/GSO/GRO enabled, and one with TSO/GSO/GRO disabled, and the results
are similar in terms of cwnd recovery time.  Looking at the packet
capture for the TSO/GSO/GRO disabled case, the time span from the first
duplicate ACK to the last is about 9 seconds, which is close to the 10
seconds the iptables drop rule is in effect; the same time analysis
applies to retransmissions from the sender.

	I've also tested with using netem to induce drops, but in this
particular case I used iptables.

	The script I use to set up netem is:

#!/bin/bash

IF=eth1
TC=/usr/local/bin/tc
DELAY=40ms

rmmod ifb
modprobe ifb
ip link set dev ifb0 up

if ${TC} qdisc show dev ${IF} | grep -q ingress; then
	${TC} qdisc del dev ${IF} ingress
fi
${TC} qdisc add dev ${IF} ingress

${TC} qdisc del dev ${IF} root

${TC} filter add dev ${IF} parent ffff: protocol ip \
	u32 match u32 0 0 flowid 1:1 action mirred egress redirect dev ifb0
${TC} qdisc add dev ifb0 root netem delay ${DELAY} limit 5000
${TC} qdisc add dev ${IF} root netem delay ${DELAY} limit 5000

	In the past I've watched the tc backlog, and the highest I've
seen is about 900 packets, so the limit 5000 is probably overkill.

	I'm also not absolutely sure the delay 40ms each direction is
materially different from 80ms in one direction, but the real
configuration I'm recreating is 40ms each way.

	The tc qdisc stats after the two runs I did earlier to capture
data look like this:

qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 1905005 bytes 22277 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
qdisc netem 8002: dev eth1 root refcnt 2 limit 5000 delay 40.0ms
 Sent 773383636 bytes 510901 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
qdisc ingress ffff: dev eth1 parent ffff:fff1 ---------------- 
 Sent 14852588 bytes 281846 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
qdisc netem 8001: dev ifb0 root refcnt 2 limit 5000 delay 40.0ms
 Sent 18763686 bytes 281291 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 

	Lastly, I ran the same test on the actual systems, and the iperf
results are similar to my test lab:

[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   896 KBytes  7.34 Mbits/sec
[  3]  1.0- 2.0 sec  1.50 MBytes  12.6 Mbits/sec
[  3]  2.0- 3.0 sec  5.12 MBytes  43.0 Mbits/sec
[  3]  3.0- 4.0 sec  13.9 MBytes   116 Mbits/sec
[  3]  4.0- 5.0 sec  27.8 MBytes   233 Mbits/sec
[  3]  5.0- 6.0 sec  39.0 MBytes   327 Mbits/sec
[  3]  6.0- 7.0 sec  36.8 MBytes   308 Mbits/sec
[  3]  7.0- 8.0 sec  36.8 MBytes   308 Mbits/sec
[  3]  8.0- 9.0 sec  37.0 MBytes   310 Mbits/sec
[  3]  9.0-10.0 sec  36.6 MBytes   307 Mbits/sec
[  3] 10.0-11.0 sec  33.9 MBytes   284 Mbits/sec
[  3] 11.0-12.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 12.0-13.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 13.0-14.0 sec  4.38 MBytes  36.7 Mbits/sec
[  3] 14.0-15.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 15.0-16.0 sec  7.00 MBytes  58.7 Mbits/sec
[  3] 16.0-17.0 sec  8.62 MBytes  72.4 Mbits/sec
[  3] 17.0-18.0 sec  4.25 MBytes  35.7 Mbits/sec
[  3] 18.0-19.0 sec  8.50 MBytes  71.3 Mbits/sec
[  3] 19.0-20.0 sec  4.25 MBytes  35.7 Mbits/sec
[  3] 20.0-21.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 21.0-22.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 22.0-23.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 23.0-24.0 sec  8.50 MBytes  71.3 Mbits/sec
[  3] 24.0-25.0 sec  8.50 MBytes  71.3 Mbits/sec
[  3] 25.0-26.0 sec  8.38 MBytes  70.3 Mbits/sec
[  3] 26.0-27.0 sec  8.62 MBytes  72.4 Mbits/sec
[  3] 27.0-28.0 sec  8.50 MBytes  71.3 Mbits/sec
[  3] 28.0-29.0 sec  8.50 MBytes  71.3 Mbits/sec
[  3] 29.0-30.0 sec  8.38 MBytes  70.3 Mbits/sec
[  3] 30.0-31.0 sec  8.50 MBytes  71.3 Mbits/sec
[  3] 31.0-32.0 sec  8.62 MBytes  72.4 Mbits/sec
[  3] 32.0-33.0 sec  8.38 MBytes  70.3 Mbits/sec
[  3] 33.0-34.0 sec  10.6 MBytes  89.1 Mbits/sec
[  3] 34.0-35.0 sec  10.6 MBytes  89.1 Mbits/sec
[  3] 35.0-36.0 sec  10.6 MBytes  89.1 Mbits/sec
[  3] 36.0-37.0 sec  12.8 MBytes   107 Mbits/sec
[  3] 37.0-38.0 sec  15.0 MBytes   126 Mbits/sec
[  3] 38.0-39.0 sec  17.0 MBytes   143 Mbits/sec
[  3] 39.0-40.0 sec  19.4 MBytes   163 Mbits/sec
[  3] 40.0-41.0 sec  23.5 MBytes   197 Mbits/sec
[  3] 41.0-42.0 sec  25.6 MBytes   215 Mbits/sec
[  3] 42.0-43.0 sec  30.2 MBytes   254 Mbits/sec
[  3] 43.0-44.0 sec  34.2 MBytes   287 Mbits/sec
[  3] 44.0-45.0 sec  36.6 MBytes   307 Mbits/sec
[  3] 45.0-46.0 sec  38.8 MBytes   325 Mbits/sec
[  3] 46.0-47.0 sec  36.5 MBytes   306 Mbits/sec

	This result is consistently repeatable.  These systems have more
hops between them than my lab systems, but the ping RTT is 80ms.

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-06-19  6:05 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-16 15:35 tcp: multiple ssthresh reductions before all packets are retransmitted Michal Kubecek
2014-06-16 17:02 ` Yuchung Cheng
2014-06-16 18:48   ` Michal Kubecek
     [not found]   ` <20140616174721.GA15406@lion>
2014-06-16 19:04     ` Yuchung Cheng
2014-06-16 20:06       ` Michal Kubecek
2014-06-16 21:19       ` [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window Michal Kubecek
2014-06-16 22:39         ` Yuchung Cheng
2014-06-16 23:42           ` Neal Cardwell
2014-06-17  0:25             ` Yuchung Cheng
2014-06-17  0:44               ` Neal Cardwell
2014-06-17 12:20                 ` Michal Kubecek
2014-06-17 21:35                   ` Yuchung Cheng
2014-06-17 22:42                     ` Michal Kubecek
2014-06-18  0:38                       ` Jay Vosburgh
2014-06-18  0:56                         ` Neal Cardwell
2014-06-18  2:00                           ` Jay Vosburgh
2014-06-19  1:52                           ` Jay Vosburgh
2014-06-19  2:28                             ` Eric Dumazet
2014-06-19  6:05                               ` Jay Vosburgh
2014-06-18 16:56                         ` Yuchung Cheng
2014-06-18  7:17                       ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.