[PATCH net-next] tcp: Fix CWV being too strict on thin streams

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next] tcp: Fix CWV being too strict on thin streams
@ 2015-09-18 23:38 Bendik Rønning Opstad
  2015-09-19 12:46 ` Neal Cardwell
  0 siblings, 1 reply; 10+ messages in thread
From: Bendik Rønning Opstad @ 2015-09-18 23:38 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy
  Cc: netdev, Eric Dumazet, Neal Cardwell, Bendik Rønning Opstad,
	Andreas Petlund, Carsten Griwodz, Jonas Markussen,
	Kenneth Klette Jonassen

Application limited streams such as thin streams, that transmit small
amounts of payload in relatively few packets per RTT, are prevented from
growing the CWND after experiencing loss. This leads to increased
sojourn times for data segments in streams that often transmit
time-dependent data.

After the CWND is reduced due to loss, and an ACK has made room in the
send window for more packets to be transmitted, the CWND will not grow
unless there is more unsent data buffered in the output queue than the
CWND permits to be sent. That is because tcp_cwnd_validate(), which
updates tp->is_cwnd_limited, is only called in tcp_write_xmit() when at
least one packet with new data has been sent. However, if all the
buffered data in the output queue was sent within the current CWND,
is_cwnd_limited will remain false even when there is no more room in the
CWND. While the CWND is fully utilized, any new data put on the output
queue will be held back (i.e. limited by the CWND), but
tp->is_cwnd_limited will not be updated as no packets were transmitted.

Fix by updating tp->is_cwnd_limited if no packets are sent due to the
CWND being fully utilized.

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
---
 net/ipv4/tcp_output.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d0ad355..7096b8b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2145,6 +2145,11 @@ repair:
 		tcp_cwnd_validate(sk, is_cwnd_limited);
 		return false;
 	}
+	/* We are prevented from transmitting because we are limited
+	 * by the CWND, so update tcp sock with correct value.
+	 */
+	if (is_cwnd_limited)
+		tp->is_cwnd_limited = is_cwnd_limited;
 	return !tp->packets_out && tcp_send_head(sk);
 }

-- 
1.9.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] tcp: Fix CWV being too strict on thin streams
  2015-09-18 23:38 [PATCH net-next] tcp: Fix CWV being too strict on thin streams Bendik Rønning Opstad
@ 2015-09-19 12:46 ` Neal Cardwell
  2015-09-22 14:29   ` Bendik Rønning Opstad
  0 siblings, 1 reply; 10+ messages in thread
From: Neal Cardwell @ 2015-09-19 12:46 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Netdev, Eric Dumazet,
	Bendik Rønning Opstad, Andreas Petlund, Carsten Griwodz,
	Jonas Markussen, Kenneth Klette Jonassen, Yuchung Cheng

On Fri, Sep 18, 2015 at 7:38 PM, Bendik Rønning Opstad
<bro.devel@gmail.com> wrote:
>
> Application limited streams such as thin streams, that transmit small
> amounts of payload in relatively few packets per RTT, are prevented from
> growing the CWND after experiencing loss. This leads to increased
> sojourn times for data segments in streams that often transmit
> time-dependent data.
>
> After the CWND is reduced due to loss, and an ACK has made room in the
> send window for more packets to be transmitted, the CWND will not grow
> unless there is more unsent data buffered in the output queue than the
> CWND permits to be sent. That is because tcp_cwnd_validate(), which
> updates tp->is_cwnd_limited, is only called in tcp_write_xmit() when at
> least one packet with new data has been sent. However, if all the
> buffered data in the output queue was sent within the current CWND,
> is_cwnd_limited will remain false even when there is no more room in the
> CWND. While the CWND is fully utilized, any new data put on the output
> queue will be held back (i.e. limited by the CWND), but
> tp->is_cwnd_limited will not be updated as no packets were transmitted.
>
> Fix by updating tp->is_cwnd_limited if no packets are sent due to the
> CWND being fully utilized.

Thanks for this report!

When you say "CWND is reduced due to loss", are you talking about RTO
or Fast Recovery? Do you have any traces you can share that illustrate
this issue?

Have you verified that this patch fixes the issue you identified? I
think the fix may be in the wrong place for that scenario, or at least
incomplete.

In the scenario you describe, you say "all the buffered data in the
output queue was sent within the current CWND", which means that there
is no unsent data, so in tcp_write_xmit() the call to tcp_send_head()
would return NULL, so we would not enter the while loop, and could not
set is_cwnd_limited to true.

In the scenario you describe, I think we'd need to check for being
cwnd-limited in tcp_xmit_retransmit_queue().

How about something like the following patch:

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d0ad355..8e6a772 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2145,6 +2145,7 @@ repair:
                tcp_cwnd_validate(sk, is_cwnd_limited);
                return false;
        }
+       tp->is_cwnd_limited |= is_cwnd_limited;
        return !tp->packets_out && tcp_send_head(sk);
 }

@@ -2762,8 +2763,10 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
                 * packet to be MSS sized and all the
                 * packet counting works out.
                 */
-               if (tcp_packets_in_flight(tp) >= tp->snd_cwnd)
+               if (tcp_packets_in_flight(tp) >= tp->snd_cwnd) {
+                       tp->is_cwnd_limited = true;
                        return;
+               }

                if (fwd_rexmitting) {
 begin_fwd:

neal

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] tcp: Fix CWV being too strict on thin streams
  2015-09-19 12:46 ` Neal Cardwell
@ 2015-09-22 14:29   ` Bendik Rønning Opstad
  2015-09-22 14:46     ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: Bendik Rønning Opstad @ 2015-09-22 14:29 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Bendik Rønning Opstad, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Netdev,
	Eric Dumazet, Andreas Petlund, Carsten Griwodz, Jonas Markussen,
	Kenneth Klette Jonassen, Yuchung Cheng

> Thanks for this report!

Thank you for answering!

> When you say "CWND is reduced due to loss", are you talking about RTO
> or Fast Recovery? Do you have any traces you can share that illustrate
> this issue?

The problem is not related to loss recovery, but only occurs in congestion
avoidance state. This is because tcp_is_cwnd_limited(), called from the
congestion controls ".cong_avoid" hook, will only use tp->is_cwnd_limited when
in congestion avoidance.

I've made two traces to compare the effect with and without the patch:
http://heim.ifi.uio.no/bendiko/traces/CWV_PATCH/

To ensure that the results are comparable, the traces are produced with an extra
patch applied (shown below) such that both connections are in congestion
avoidance from the beginning.

> Have you verified that this patch fixes the issue you identified? I
> think the fix may be in the wrong place for that scenario, or at least
> incomplete.

Yes, we have tested this and verified that the patch solves the issue in the
test scenarios we've run.

As always (in networking) it can be difficult to reproduce the issue in a
consistent manner, so I will describe the setup we have used.

In our testbed we have a sender host, a bridge and a receiver. The bridge is
configured with 75 ms delay in each direction, giving an RTT of 150 ms.

On bridge:
tc qdisc add dev eth0 handle 1:0 root netem delay 75ms
tc qdisc add dev eth1 handle 1:0 root netem delay 75ms

We have used the tool streamzero (https://bitbucket.org/bendikro/streamzero) to
produce the thin stream traffic. An example where the sender opens a TCP
connection and performs write calls to the socket every 10 ms with 100 bytes per
call for 60 seconds:

On receiver host (10.0.0.22):
./streamzero_srv -p 5010 -A

On sender host:
./streamzero_client -s 10.0.0.22 -p 5010 -v3 -A4 -d 60 -I i:10,S:100

streamzero_client will by default disable Nagle (TCP_NODELAY=1).

Adding loss for a few seconds in netem causes the CWND and ssthresh to
eventually drop to a low value, e.g. 2.

On bridge:
tc qdisc del dev eth1 root && tc qdisc add dev eth1 handle 1:0 root netem delay
75ms loss 4%

Removing loss after a few seconds:
tc qdisc del dev eth1 root && tc qdisc add dev eth1 handle 1:0 root netem delay
75ms

>From this point and on the CWND will not grow any further, as shown by the print
statement when applying this patch:

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d0ad355..947eb57 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2135,6 +2135,8 @@ repair:
 			break;
 	}

+	printk("CWND: %d, SSTHRESH: %d, PKTS_OUT: %d\n", tp->snd_cwnd, tp->snd_ssthresh, tp->packets_out);
+
 	if (likely(sent_pkts)) {
 		if (tcp_in_cwnd_reduction(sk))
 			tp->prr_out += sent_pkts;

The following patch avoids the need to induce loss by setting the connection in
congestion avoidance from the beginning:

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a62e9c7..206b32f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5394,6 +5394,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)

 	tcp_init_metrics(sk);
 +
	tp->snd_cwnd = tp->snd_ssthresh = 2;
 	tcp_init_congestion_control(sk);

 	/* Prevent spurious tcp_cwnd_restart() on first data

The traces linked to above are run with this patch applied.

> In the scenario you describe, you say "all the buffered data in the
> output queue was sent within the current CWND", which means that there
> is no unsent data, so in tcp_write_xmit() the call to tcp_send_head()
> would return NULL, so we would not enter the while loop, and could not
> set is_cwnd_limited to true.

I'll describe two example scenarios in detail. In both scenarios we are in
congestion avoidance after experiencing loss. Nagle is disabled.

Scenario 1:
Current CWND is 2 and there is currently one packet in flight. The output queue
contains one SKB with unsent data (payload <= 1 * MSS).

In tcp_write_xmit(), tcp_send_head() will return the SKB with unsent data, and
tcp_cwnd_test() will set cwnd_quota to 1, which means that is_cwnd_limited will
remain false. After sending the packet, sent_pkts will be set to 1, and the
while will break because tcp_send_head() returns NULL. tcp_cwnd_validate() will
be called with is_cwnd_limited=false.

When a new data chunk is put on the output queue, and tcp_write_xmit is called,
tcp_send_head() will return the SKB with unsent data. tcp_cwnd_test() will now
set cwnd_quota to 0 because packets in flight == CWND. is_cwnd_limited will
correctly be set to true, however, since no packets were sent,
tcp_cwnd_validate() will never be called.
(Calling tcp_cwnd_validate() if sent_pkts == 0 will not solve the problem in
this scenario as it will not enter the first if test in tcp_cwnd_validate()
where tp->is_cwnd_limited is updated.)

Scenario 2:
CWND is 2 and there is currently one packet in flight. The output queue contains
two SKBs with unsent data.

In tcp_write_xmit(), tcp_send_head() will return the first SKB with unsent data,
and tcp_cwnd_test() will set cwnd_quota to 1, which means that is_cwnd_limited
will remain false. After sending the packet, sent_pkts will be set to 1, and the
while loop continues on to the next (and last) SKB with unsent data returned by
tcp_send_head(). tcp_cwnd_test() will set cwnd_quota to 0 and is_cwnd_limited
will be set to true (and the while breaks). As one packet was sent, and one
packet was held back, tcp_cwnd_validate() will be called with
is_cwnd_limited=true, and tp->is_cwnd_limited will be set to true (as it
should).

In scenario 1, "all the buffered data in the output queue was sent within the
current CWND", and the result is that the CWND does not grow even when the CWND
is fully utilized after the packet is sent. In scenario 2, some of the unsent
data buffered in the output queue was not sent because there was no room in the
CWND, and this results in the CWND growing because tp->is_cwnd_limited was set
to true.

Analysing the traces with the analyseTCP
(https://bitbucket.org/mpg_code/tstools_analysetcp) gives the following results:

# ./analyseTCP -s 10.0.0.15 -f test_net-next.pcap
Processing sent packets...
Using filter: 'tcp && src host 10.0.0.15'
Finished processing sent packets...
Processing acknowledgements...
Finished processing acknowledgements...
STATS FOR CONN: 10.0.0.15:15000 -> 10.0.0.22:5010
  Duration: 61 seconds (0.016944 hours)
  Total packets sent                            :        782
  Total data packets sent                       :        779
  Total pure acks (no payload)                  :          2
  SYN/FIN/RST packets sent                      :      1/1/0
  Number of retransmissions                     :          0
  Number of packets with bundled segments       :          0
  Number of packets with redundant data         :          0
  Number of received acks                       :        779
  Total bytes sent (payload)                    :     555600
  Number of unique bytes                        :     555600
  Number of retransmitted bytes                 :          0
  Redundant bytes (bytes already sent)          :          0 (0.00 %)
  Estimated loss rate based on retransmissions  :       0.00 %
---------------------------------------------------------------
Payload size stats:
  Minimum    payload                            :     100 bytes
  Average    payload                            :     713 bytes
  Maximum    payload                            :    1448 bytes
---------------------------------------------------------------
Latency stats:
  Minimum    latency                            :  150244 usec
  Average    latency                            :  150675 usec
  Maximum    latency                            :  302493 usec
---------------------------------------------------------------
ITT stats:
  Minimum        itt                            :      12 usec
  Average        itt                            :   78383 usec
  Maximum        itt                            :  302704 usec
===============================================================

General info for entire dump:
  10.0.0.15: -> :
  Filename: test_net-next.pcap
  Sent Packet Count     : 782
  Received Packet Count : 0
  ACK Count             : 779
  Sent Bytes Count      : 555600
  Max payload size      : 1448
#

# ./analyseTCP -s 10.0.0.15 -f test_cwv_fix.pcap
Processing sent packets...
Using filter: 'tcp && src host 10.0.0.15'
Finished processing sent packets...
Processing acknowledgements...
Finished processing acknowledgements...
STATS FOR CONN: 10.0.0.15:15000 -> 10.0.0.22:5010
  Duration: 60 seconds (0.016667 hours)
  Total packets sent                            :       4760
  Total data packets sent                       :       4756
  Total pure acks (no payload)                  :          2
  SYN/FIN/RST packets sent                      :      1/1/0
  Number of retransmissions                     :          0
  Number of packets with bundled segments       :          0
  Number of packets with redundant data         :          0
  Number of received acks                       :       4758
  Total bytes sent (payload)                    :     483300
  Number of unique bytes                        :     483300
  Number of retransmitted bytes                 :          0
  Redundant bytes (bytes already sent)          :          0 (0.00 %)
  Estimated loss rate based on retransmissions  :       0.00 %
---------------------------------------------------------------
Payload size stats:
  Minimum    payload                            :     100 bytes
  Average    payload                            :     102 bytes
  Maximum    payload                            :    1400 bytes
---------------------------------------------------------------
Latency stats:
  Minimum    latency                            :  150221 usec
  Average    latency                            :  150340 usec
  Maximum    latency                            :  302022 usec
---------------------------------------------------------------
ITT stats:
  Minimum        itt                            :    1469 usec
  Average        itt                            :   12761 usec
  Maximum        itt                            :  302287 usec
===============================================================

General info for entire dump:
  10.0.0.15: -> :
  Filename: test_cwv_fix.pcap
  Sent Packet Count     : 4760
  Received Packet Count : 0
  ACK Count             : 4758
  Sent Bytes Count      : 483300
  Max payload size      : 1400
#

The key results revealing the effect of the proposed patch:

test_net-next.pcap:
Packets with payload sent: 779
Average payload per packet: 713 bytes
Average time between each packet transmission: 78.3 ms

test_cwv_fix.pcap:
Packets with payload sent: 4756
Average payload per packet: 102 bytes
Average time between each packet transmission: 12.7 ms

> In the scenario you describe, I think we'd need to check for being
> cwnd-limited in tcp_xmit_retransmit_queue().
> 
> How about something like the following patch:
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index d0ad355..8e6a772 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2145,6 +2145,7 @@ repair:
>                 tcp_cwnd_validate(sk, is_cwnd_limited);
>                 return false;
>         }
> +       tp->is_cwnd_limited |= is_cwnd_limited;
>         return !tp->packets_out && tcp_send_head(sk);
>  }
> 
> @@ -2762,8 +2763,10 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
>                  * packet to be MSS sized and all the
>                  * packet counting works out.
>                  */
> -               if (tcp_packets_in_flight(tp) >= tp->snd_cwnd)
> +               if (tcp_packets_in_flight(tp) >= tp->snd_cwnd) {
> +                       tp->is_cwnd_limited = true;
>                         return;
> +               }
> 
>                 if (fwd_rexmitting) {
>  begin_fwd:
> 
> neal

This will not fix the problem, as that would only set tp->is_cwnd_limited=true
once after retransmitting. The main issue is that the CWND does not grow for
these types of streams while no loss occurs, even when the CWND is fully
utilized (but only after a loss _has_ occurred to put the connection in
congestion avoidance).

When setting the initial CWND and ssthresh to 10 instead of 2 (with the patch
above to start in congestion avoidance), and running the same test as in the
traces (where the application performs ~15 write calls per RTT), the CWND never
grows past 10. When applying the proposed patch to update tp->is_cwnd_limited
when no packets are transmitted, the CWND grows to 13-14, which produces minimal
sojourn time for the segments written by the application.

Let me know if you need more information or additional test traces.

Bendik

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] tcp: Fix CWV being too strict on thin streams
  2015-09-22 14:29   ` Bendik Rønning Opstad
@ 2015-09-22 14:46     ` Eric Dumazet
  2015-09-22 16:09       ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2015-09-22 14:46 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: Neal Cardwell, David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Netdev, Eric Dumazet,
	Andreas Petlund, Carsten Griwodz, Jonas Markussen,
	Kenneth Klette Jonassen, Yuchung Cheng

On Tue, 2015-09-22 at 16:29 +0200, Bendik Rønning Opstad wrote:
> > Thanks for this report!
> 
> Thank you for answering!
> 
> > When you say "CWND is reduced due to loss", are you talking about RTO
> > or Fast Recovery? Do you have any traces you can share that illustrate
> > this issue?
> 
> The problem is not related to loss recovery, but only occurs in congestion
> avoidance state. This is because tcp_is_cwnd_limited(), called from the
> congestion controls ".cong_avoid" hook, will only use tp->is_cwnd_limited when
> in congestion avoidance.
> 
> I've made two traces to compare the effect with and without the patch:
> http://heim.ifi.uio.no/bendiko/traces/CWV_PATCH/
> 
> To ensure that the results are comparable, the traces are produced with an extra
> patch applied (shown below) such that both connections are in congestion
> avoidance from the beginning.
> 
> > Have you verified that this patch fixes the issue you identified? I
> > think the fix may be in the wrong place for that scenario, or at least
> > incomplete.
> 
> Yes, we have tested this and verified that the patch solves the issue in the
> test scenarios we've run.
> 
> As always (in networking) it can be difficult to reproduce the issue in a
> consistent manner, so I will describe the setup we have used.
> 


Ahem.

packetdrill can make this in one script, as you can exactly control the
packets that the 'remote' peer would answer.

No need for complex setup. You should try it, and as a bonus we could
easily reproduce the problem and check the fix.

Let see if we can cook a packetdrill scenario.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] tcp: Fix CWV being too strict on thin streams
  2015-09-22 14:46     ` Eric Dumazet
@ 2015-09-22 16:09       ` Eric Dumazet
  2015-09-22 17:11         ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2015-09-22 16:09 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: Neal Cardwell, David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Netdev, Eric Dumazet,
	Andreas Petlund, Carsten Griwodz, Jonas Markussen,
	Kenneth Klette Jonassen, Yuchung Cheng

On Tue, 2015-09-22 at 07:46 -0700, Eric Dumazet wrote:

> 
> Ahem.
> 
> packetdrill can make this in one script, as you can exactly control the
> packets that the 'remote' peer would answer.
> 
> No need for complex setup. You should try it, and as a bonus we could
> easily reproduce the problem and check the fix.
> 
> Let see if we can cook a packetdrill scenario.


Following packetdrill skeleton will provide you a connexion with cwnd=1
sshtresh=2  (without patching kernel)

// Establish a connection and send 1 MSS.
0     socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.200 < . 1:1(0) ack 1 win 65535
   +0 accept(3, ..., ...) = 4

   +0 write(4, ..., 100) = 100
   +0 > P. 1:101(100) ack 1
+.000 %{ print  tcpi_rto }%

// TLP
+.500~+.505 > P. 1:101(100) ack 1
// RTO
+.600~+.605 > P. 1:101(100) ack 1

+.200 < . 1:1(0) ack 101 win 65535
// cwnd should be 2, ssthresh should be 7
2.000 write(4, ..., 100) = 100
   +0  > P. 101:201(100) ack 1

// TLP
+.500~+.505 > P. 101:201(100) ack 1
// RTO
+1.200~+1.210 > P. 101:201(100) ack 1

   +0 %{ print "tcpi_snd_cwnd=%d" % tcpi_snd_cwnd }%
   +0 %{ print "tcpi_snd_ssthresh=%d" % tcpi_snd_ssthresh }%


Then we can add a bunch of  +.01 write(4, ..., 100) = 100

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] tcp: Fix CWV being too strict on thin streams
  2015-09-22 16:09       ` Eric Dumazet
@ 2015-09-22 17:11         ` Eric Dumazet
  2015-09-22 18:02           ` Neal Cardwell
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2015-09-22 17:11 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: Neal Cardwell, David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Netdev, Eric Dumazet,
	Andreas Petlund, Carsten Griwodz, Jonas Markussen,
	Kenneth Klette Jonassen, Yuchung Cheng

On Tue, 2015-09-22 at 09:09 -0700, Eric Dumazet wrote:
> On Tue, 2015-09-22 at 07:46 -0700, Eric Dumazet wrote:
> 
> > 
> > Ahem.
> > 
> > packetdrill can make this in one script, as you can exactly control the
> > packets that the 'remote' peer would answer.
> > 
> > No need for complex setup. You should try it, and as a bonus we could
> > easily reproduce the problem and check the fix.
> > 
> > Let see if we can cook a packetdrill scenario.
> 
> 
> Following packetdrill skeleton will provide you a connexion with cwnd=1
> sshtresh=2  (without patching kernel)
> 

A more complete packetdrill scenario would be

# cat cwv.pkt
// Establish a connection and send 1 MSS.
0     socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.200 < . 1:1(0) ack 1 win 65535
   +0 accept(3, ..., ...) = 4

   +0 write(4, ..., 100) = 100
   +0 > P. 1:101(100) ack 1
+.000 %{ print  tcpi_rto }%

// TLP
+.500~+.505 > P. 1:101(100) ack 1
// RTO
+.600~+.605 > P. 1:101(100) ack 1

+.200 < . 1:1(0) ack 101 win 65535
// cwnd should be 2, ssthresh should be 7
+0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }%
2.000 write(4, ..., 100) = 100
   +0  > P. 101:201(100) ack 1

// TLP
+.500~+.505 > P. 101:201(100) ack 1
+0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }%
// RTO
+1.200~+1.210 > P. 101:201(100) ack 1

+0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }%
+.200 < . 1:1(0) ack 201 win 65535

4.00 write(4, ..., 100) = 100
    +0  > P. 201:301(100) ack 1
4.01 write(4, ..., 100) = 100
	+0 > P. 301:401(100) ack 1
4.02 write(4, ..., 100) = 100
4.03 write(4, ..., 100) = 100
4.04 write(4, ..., 100) = 100
4.05 write(4, ..., 100) = 100
4.06 write(4, ..., 100) = 100
4.07 write(4, ..., 100) = 100
4.08 write(4, ..., 100) = 100
4.09 write(4, ..., 100) = 100
4.10 write(4, ..., 100) = 100
4.11 write(4, ..., 100) = 100
4.12 write(4, ..., 100) = 100
4.13 write(4, ..., 100) = 100
4.14 write(4, ..., 100) = 100
4.15 write(4, ..., 100) = 100
4.16 write(4, ..., 100) = 100
4.17 write(4, ..., 100) = 100
4.18 write(4, ..., 100) = 100
4.19 write(4, ..., 100) = 100

4.20 write(4, ..., 100) = 100
4.20 < . 1:1(0) ack 301 win 65535
4.20 > . 401:1401(1000) ack 1

4.21 write(4, ..., 100) = 100
4.21 < . 1:1(0) ack 401 win 65535
4.21 > P. 1401:2401(1000) ack 1
   +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }%
4.22 write(4, ..., 100) = 100
4.23 write(4, ..., 100) = 100
4.24 write(4, ..., 100) = 100
4.25 write(4, ..., 100) = 100
4.26 write(4, ..., 100) = 100
4.27 write(4, ..., 100) = 100
4.28 write(4, ..., 100) = 100
4.29 write(4, ..., 100) = 100
4.31 write(4, ..., 100) = 100
4.32 write(4, ..., 100) = 100
4.33 write(4, ..., 100) = 100
4.34 write(4, ..., 100) = 100
4.35 write(4, ..., 100) = 100
4.36 write(4, ..., 100) = 100
4.37 write(4, ..., 100) = 100
4.38 write(4, ..., 100) = 100
4.39 write(4, ..., 100) = 100

4.40 write(4, ..., 100) = 100
4.40 < . 1:1(0) ack 1401 win 65535
4.40 > . 2401:3401(1000) ack 1
4.41 write(4, ..., 100) = 100
4.41 < . 1:1(0) ack 2401 win 65535
4.41 > P. 3401:4301(900) ack 1
   +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }%
4.42 write(4, ..., 100) = 100
4.43 write(4, ..., 100) = 100
4.44 write(4, ..., 100) = 100
4.45 write(4, ..., 100) = 100
4.46 write(4, ..., 100) = 100
4.47 write(4, ..., 100) = 100
4.48 write(4, ..., 100) = 100
4.49 write(4, ..., 100) = 100
4.50 write(4, ..., 100) = 100
4.51 write(4, ..., 100) = 100
4.52 write(4, ..., 100) = 100
4.53 write(4, ..., 100) = 100
4.54 write(4, ..., 100) = 100
4.55 write(4, ..., 100) = 100
4.56 write(4, ..., 100) = 100
4.57 write(4, ..., 100) = 100
4.58 write(4, ..., 100) = 100
4.59 write(4, ..., 100) = 100
4.60 write(4, ..., 100) = 100
4.60 < . 1:1(0) ack 3401 win 65535
4.60 > . 4301:5301(1000) ack 1
4.61 write(4, ..., 100) = 100
4.61 < . 1:1(0) ack 4301 win 65535
4.61 > P. 5301:6301(1000) ack 1
   +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }%
4.62 write(4, ..., 100) = 100
4.63 write(4, ..., 100) = 100
4.64 write(4, ..., 100) = 100
4.65 write(4, ..., 100) = 100
4.66 write(4, ..., 100) = 100
4.67 write(4, ..., 100) = 100
4.68 write(4, ..., 100) = 100
4.69 write(4, ..., 100) = 100
4.70 write(4, ..., 100) = 100
4.71 write(4, ..., 100) = 100
4.72 write(4, ..., 100) = 100
4.73 write(4, ..., 100) = 100
4.74 write(4, ..., 100) = 100
4.75 write(4, ..., 100) = 100
4.76 write(4, ..., 100) = 100
4.77 write(4, ..., 100) = 100
4.78 write(4, ..., 100) = 100
4.79 write(4, ..., 100) = 100
4.80 write(4, ..., 100) = 100
4.80 < . 1:1(0) ack 5301 win 65535
4.80 > . 6301:7301(1000) ack 1
4.81 write(4, ..., 100) = 100
4.81 < . 1:1(0) ack 6301 win 65535
   +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }%
4.81 > P. 7301:8301(1000) ack 1
4.82 write(4, ..., 100) = 100
4.83 write(4, ..., 100) = 100
4.84 write(4, ..., 100) = 100
4.85 write(4, ..., 100) = 100
4.86 write(4, ..., 100) = 100
4.87 write(4, ..., 100) = 100
4.88 write(4, ..., 100) = 100
4.89 write(4, ..., 100) = 100
4.90 write(4, ..., 100) = 100
4.91 write(4, ..., 100) = 100
4.92 write(4, ..., 100) = 100
4.93 write(4, ..., 100) = 100
4.94 write(4, ..., 100) = 100
4.95 write(4, ..., 100) = 100
4.96 write(4, ..., 100) = 100
4.97 write(4, ..., 100) = 100
4.98 write(4, ..., 100) = 100
4.99 write(4, ..., 100) = 100
5.00 write(4, ..., 100) = 100
5.00 < . 1:1(0) ack 7301 win 65535
5.00 > . 8301:9301(1000) ack 1
5.01 write(4, ..., 100) = 100
5.01 < . 1:1(0) ack 8301 win 65535
5.01 > P. 9301:10301(1000) ack 1
  +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }%
5.02 write(4, ..., 100) = 100
5.03 write(4, ..., 100) = 100
5.04 write(4, ..., 100) = 100
5.05 write(4, ..., 100) = 100
5.06 write(4, ..., 100) = 100
5.07 write(4, ..., 100) = 100
5.08 write(4, ..., 100) = 100
5.09 write(4, ..., 100) = 100
5.10 write(4, ..., 100) = 100
5.11 write(4, ..., 100) = 100
5.12 write(4, ..., 100) = 100
5.13 write(4, ..., 100) = 100
5.14 write(4, ..., 100) = 100
5.15 write(4, ..., 100) = 100
5.16 write(4, ..., 100) = 100
5.17 write(4, ..., 100) = 100
5.18 write(4, ..., 100) = 100
5.19 write(4, ..., 100) = 100
5.20 write(4, ..., 100) = 100
5.20 < . 1:1(0) ack 9301 win 65535
5.20 > . 10301:11301(1000) ack 1
5.21 write(4, ..., 100) = 100
5.21 < . 1:1(0) ack 10301 win 65535
5.21 > P. 11301:12301(1000) ack 1
+0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }%
5.22 write(4, ..., 100) = 100
5.23 write(4, ..., 100) = 100
5.24 write(4, ..., 100) = 100
5.25 write(4, ..., 100) = 100
5.26 write(4, ..., 100) = 100
5.27 write(4, ..., 100) = 100
5.28 write(4, ..., 100) = 100
5.29 write(4, ..., 100) = 100
5.30 write(4, ..., 100) = 100
5.31 write(4, ..., 100) = 100
5.32 write(4, ..., 100) = 100
5.33 write(4, ..., 100) = 100
5.34 write(4, ..., 100) = 100
5.35 write(4, ..., 100) = 100
5.36 write(4, ..., 100) = 100
5.37 write(4, ..., 100) = 100
5.38 write(4, ..., 100) = 100
5.39 write(4, ..., 100) = 100
5.40 write(4, ..., 100) = 100
5.40 < . 1:1(0) ack 11301 win 65535
5.40 > . 12301:13301(1000) ack 1
5.41 write(4, ..., 100) = 100
5.41 < . 1:1(0) ack 12301 win 65535
5.41 > P. 13301:14301(1000) ack 1
+0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }% 


Output is :
tcpi_snd_cwnd=2 tcpi_snd_ssthresh=7
tcpi_snd_cwnd=2 tcpi_snd_ssthresh=7
tcpi_snd_cwnd=1 tcpi_snd_ssthresh=2
tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2
tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2
tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2
tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2
tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2
tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2
tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] tcp: Fix CWV being too strict on thin streams
  2015-09-22 17:11         ` Eric Dumazet
@ 2015-09-22 18:02           ` Neal Cardwell
  2015-09-22 20:04             ` Neal Cardwell
  0 siblings, 1 reply; 10+ messages in thread
From: Neal Cardwell @ 2015-09-22 18:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Bendik Rønning Opstad, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Netdev,
	Eric Dumazet, Andreas Petlund, Carsten Griwodz, Jonas Markussen,
	Kenneth Klette Jonassen, Yuchung Cheng

> I'll describe two example scenarios in detail. In both scenarios we are in
> congestion avoidance after experiencing loss. Nagle is disabled.

Thanks for the detailed follow-up! And thanks, Eric, for the packetdrill
script!

This looks like an issue of how to deal with the case when we run out of
packets to send and cwnd at exactly the same moment, and whether we consider
that case cwnd-limited.

Previously we did not consider ourselves cwnd-limited in that case, but I think
your "scenario 1", and Eric's packetdrill case, show convincingly that we ought
to consider ourselves cwnd-limited in that case.

I would suggest we try fixing this by making sure that in scenario 1, at the
moment when we fill the cwnd (packets_in_flight == cwnd == 2) we mark ourselves
as cwnd-limited.

More generally, my sense is that we should tweak the is_cwnd_limited code to
shift from saying "set is_cwnd_limited to true iff the cwnd is known to be
limiting transmits" to saying "set is_cwnd_limited to true iff the packets in
flight fill the cwnd".

This is partly justified by the kind of case presented here. Plus, in RFC 2861,
section 3.1, which this code is helping to implement, it says:

   After TCP sends a packet, it also checks to see if that packet filled
   the congestion window.  If so, the sender is network-limited, and
   sets the variable T_prev to the current TCP clock time,

So at the point in Scenario 1 when we send the second packet, and
  packets_in_flight == cwnd == 2
then IMHO we should set
  tp->is_cwnd_limited = true
and thus
  tp->snd_cwnd_stamp = tcp_time_stamp;

So maybe we want to slightly tweak the way tcp_write_xmit() and
tcp_tso_should_defer() treat this boundary condition? Gotta run to a
meeting....

neal

ps: If the tp->is_cwnd_limited and tcp_cwnd_validate() code were only used to
decide whether to increase cwnd, then I think your fix would probably be
sufficient. But it is also used for the code to implement RFC 2861 ("TCP
Congestion Window Validation"). So I don't think just changing
tp->is_cwnd_limited is sufficient.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] tcp: Fix CWV being too strict on thin streams
  2015-09-22 18:02           ` Neal Cardwell
@ 2015-09-22 20:04             ` Neal Cardwell
  2015-09-23 14:46               ` Bendik Rønning Opstad
  0 siblings, 1 reply; 10+ messages in thread
From: Neal Cardwell @ 2015-09-22 20:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Bendik Rønning Opstad, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Netdev,
	Eric Dumazet, Andreas Petlund, Carsten Griwodz, Jonas Markussen,
	Kenneth Klette Jonassen, Yuchung Cheng

On Tue, Sep 22, 2015 at 2:02 PM, Neal Cardwell <ncardwell@google.com> wrote:
> More generally, my sense is that we should tweak the is_cwnd_limited code to
> shift from saying "set is_cwnd_limited to true iff the cwnd is known to be
> limiting transmits" to saying "set is_cwnd_limited to true iff the packets in
> flight fill the cwnd".

Here is a proposed patch, and a packetdrill test case based on Eric's,
trying to capture the flavor of your Scenario 1. The test shows how
this proposed variant allows the sender (Reno in this case, for
simplicity) to increase cwnd each RTT in congestion avoidance, since
the cwnd is fully utilized, even though we run out of packets to send
and available cwnd at exactly the same moment.

Bendik, does this fix your scenario? How does this strike you as a
potential fix?

-------------------------------------

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4cd0b50..57a586f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1827,7 +1827,7 @@ static bool tcp_tso_should_defer(struct sock
*sk, struct sk_buff *skb,

        /* Ok, it looks like it is advisable to defer. */

-       if (cong_win < send_win && cong_win < skb->len)
+       if (cong_win < send_win && cong_win <= skb->len)
                *is_cwnd_limited = true;

        return true;
@@ -2060,7 +2060,6 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,

                cwnd_quota = tcp_cwnd_test(tp, skb);
                if (!cwnd_quota) {
-                       is_cwnd_limited = true;
                        if (push_one == 2)
                                /* Force out a loss probe pkt. */
                                cwnd_quota = 1;
@@ -2142,6 +2141,7 @@ repair:
                /* Send one loss probe per tail loss episode. */
                if (push_one != 2)
                        tcp_schedule_loss_probe(sk);
+               is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd);
                tcp_cwnd_validate(sk, is_cwnd_limited);
                return false;
        }

-------------------------------------
# cat cwv-ndc-upstream-1.pkt
--->

// Keep it simple.
`sysctl net.ipv4.tcp_congestion_control=reno`

// Establish a connection and send 1 MSS.
0     socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.200 < . 1:1(0) ack 1 win 65535
   +0 accept(3, ..., ...) = 4

   +0 write(4, ..., 100) = 100
   +0 > P. 1:101(100) ack 1
+.000 %{ print  tcpi_rto }%

// TLP
+.500~+.505 > P. 1:101(100) ack 1
// RTO
+.600~+.605 > P. 1:101(100) ack 1

+.200 < . 1:1(0) ack 101 win 65535
// cwnd should be 2, ssthresh should be 7
+0 %{ print "1: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%
2.000 write(4, ..., 100) = 100
   +0  > P. 101:201(100) ack 1

// TLP
+.500~+.505 > P. 101:201(100) ack 1
+0 %{ print "2: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%
// RTO
+1.200~+1.210 > P. 101:201(100) ack 1

+0 %{ print "3: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%
+.200 < . 1:1(0) ack 201 win 65535

// cwnd should be 2; since we fill it we should raise it to 3:
+0 %{ assert tcpi_snd_cwnd == 2 }%

+.010 write(4, ..., 1000) = 1000
   +0 > P. 201:1201(1000) ack 1
+0 %{ print "4: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

+.010 write(4, ..., 1000) = 1000
   +0 > P. 1201:2201(1000) ack 1
+0 %{ print "5: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

+.010 < . 1:1(0) ack 1201 win 65535
+0 %{ print "6: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

+.010 < . 1:1(0) ack 2201 win 65535
+0 %{ print "7: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

// cwnd should be 3; since we fill it we should raise it to 4:
+0 %{ assert tcpi_snd_cwnd == 3 }%

+.010 write(4, ..., 1000) = 1000
   +0 > P. 2201:3201(1000) ack 1
+0 %{ print "8: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

+.010 write(4, ..., 1000) = 1000
   +0 > P. 3201:4201(1000) ack 1
+0 %{ print "9: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

+.010 write(4, ..., 1000) = 1000
   +0 > P. 4201:5201(1000) ack 1
+0 %{ print "9: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

+.010 < . 1:1(0) ack 3201 win 65535
+0 %{ print "10: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

+.010 < . 1:1(0) ack 4201 win 65535
+0 %{ print "11: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

+.010 < . 1:1(0) ack 5201 win 65535
+0 %{ print "12: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

// cwnd should be 4:
+0 %{ assert tcpi_snd_cwnd == 4 }%

-------------------------------------


Output is:

net.ipv4.tcp_congestion_control = reno
602000
1: cwnd=2 ssthresh=5 out=0
2: cwnd=2 ssthresh=5 out=1
3: cwnd=1 ssthresh=2 out=1
4: cwnd=2 ssthresh=2 out=1
5: cwnd=2 ssthresh=2 out=2
6: cwnd=2 ssthresh=2 out=1
7: cwnd=3 ssthresh=2 out=0
8: cwnd=3 ssthresh=2 out=1
9: cwnd=3 ssthresh=2 out=2
9: cwnd=3 ssthresh=2 out=3
10: cwnd=3 ssthresh=2 out=2
11: cwnd=3 ssthresh=2 out=1
12: cwnd=4 ssthresh=2 out=0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] tcp: Fix CWV being too strict on thin streams
  2015-09-22 20:04             ` Neal Cardwell
@ 2015-09-23 14:46               ` Bendik Rønning Opstad
  2015-09-23 15:34                 ` Neal Cardwell
  0 siblings, 1 reply; 10+ messages in thread
From: Bendik Rønning Opstad @ 2015-09-23 14:46 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Netdev, Eric Dumazet,
	Andreas Petlund, Carsten Griwodz, Jonas Markussen,
	Kenneth Klette Jonassen, Yuchung Cheng


Neal, Eric, sorry for the late replies, but keeping up with your speedy replies
is a full time job :-)

The packetdrill scripts are certainly useful to test this, so thanks for
supplying those!

On Tuesday, September 22, 2015 04:04:37 PM you wrote:
> On Tue, Sep 22, 2015 at 2:02 PM, Neal Cardwell <ncardwell@google.com> wrote:
> > More generally, my sense is that we should tweak the is_cwnd_limited code to
> > shift from saying "set is_cwnd_limited to true iff the cwnd is known to be
> > limiting transmits" to saying "set is_cwnd_limited to true iff the packets in
> > flight fill the cwnd".

A fully agree. This ensures that a failed attempt at transmitting a packet
is not required for the connection to be considered CWND limited.

> Here is a proposed patch, and a packetdrill test case based on Eric's,
> trying to capture the flavor of your Scenario 1. The test shows how
> this proposed variant allows the sender (Reno in this case, for
> simplicity) to increase cwnd each RTT in congestion avoidance, since
> the cwnd is fully utilized, even though we run out of packets to send
> and available cwnd at exactly the same moment.

This is actually very close to a variation of the patch we experimented with,
where we added a test before the call to tcp_cwnd_validate() to set
is_cwnd_limited. However, to avoid the extra procedure for every time one or
more packets are transmitted (which we presumed would be preferable for
performance) we ended up with updating the value only when no packets were sent.
If this is not a problem, then updating the value before the call to
tcp_cwnd_validate() will be better and also more correct according to RFC2861 as
you mentioned.

> Bendik, does this fix your scenario? How does this strike you as a
> potential fix?

This fixes the identified problem and is a good fix as far as our tests show.

Should I resend this as PATCH v2?

> -------------------------------------
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 4cd0b50..57a586f 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1827,7 +1827,7 @@ static bool tcp_tso_should_defer(struct sock
> *sk, struct sk_buff *skb,
> 
>         /* Ok, it looks like it is advisable to defer. */
> 
> -       if (cong_win < send_win && cong_win < skb->len)
> +       if (cong_win < send_win && cong_win <= skb->len)
>                 *is_cwnd_limited = true;
> 
>         return true;
> @@ -2060,7 +2060,6 @@ static bool tcp_write_xmit(struct sock *sk,
> unsigned int mss_now, int nonagle,
> 
>                 cwnd_quota = tcp_cwnd_test(tp, skb);
>                 if (!cwnd_quota) {
> -                       is_cwnd_limited = true;
>                         if (push_one == 2)
>                                 /* Force out a loss probe pkt. */
>                                 cwnd_quota = 1;
> @@ -2142,6 +2141,7 @@ repair:
>                 /* Send one loss probe per tail loss episode. */
>                 if (push_one != 2)
>                         tcp_schedule_loss_probe(sk);
> +               is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd);
>                 tcp_cwnd_validate(sk, is_cwnd_limited);
>                 return false;
>         }
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] tcp: Fix CWV being too strict on thin streams
  2015-09-23 14:46               ` Bendik Rønning Opstad
@ 2015-09-23 15:34                 ` Neal Cardwell
  0 siblings, 0 replies; 10+ messages in thread
From: Neal Cardwell @ 2015-09-23 15:34 UTC (permalink / raw)
  To: Bendik Rønning Opstad
  Cc: Eric Dumazet, David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Netdev, Eric Dumazet,
	Andreas Petlund, Carsten Griwodz, Jonas Markussen,
	Kenneth Klette Jonassen, Yuchung Cheng

On Wed, Sep 23, 2015 at 10:46 AM, Bendik Rønning Opstad
<bro.devel@gmail.com> wrote:
>> On Tue, Sep 22, 2015 at 2:02 PM, Neal Cardwell <ncardwell@google.com> wrote:
>> > More generally, my sense is that we should tweak the is_cwnd_limited code to
>> > shift from saying "set is_cwnd_limited to true iff the cwnd is known to be
>> > limiting transmits" to saying "set is_cwnd_limited to true iff the packets in
>> > flight fill the cwnd".
...
> This fixes the identified problem and is a good fix as far as our tests show.
>
> Should I resend this as PATCH v2?

Yes, please send this latest patch as a v2, with an updated commit
message, to reflect the latest diagnosis/rationale.

Thanks!
neal

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-09-23 15:34 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-18 23:38 [PATCH net-next] tcp: Fix CWV being too strict on thin streams Bendik Rønning Opstad
2015-09-19 12:46 ` Neal Cardwell
2015-09-22 14:29   ` Bendik Rønning Opstad
2015-09-22 14:46     ` Eric Dumazet
2015-09-22 16:09       ` Eric Dumazet
2015-09-22 17:11         ` Eric Dumazet
2015-09-22 18:02           ` Neal Cardwell
2015-09-22 20:04             ` Neal Cardwell
2015-09-23 14:46               ` Bendik Rønning Opstad
2015-09-23 15:34                 ` Neal Cardwell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.