All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP
@ 2018-07-02 21:39 Lawrence Brakmo
  2018-07-02 21:39 ` [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent Lawrence Brakmo
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Lawrence Brakmo @ 2018-07-02 21:39 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Neal Cardwell,
	Yuchung Cheng, Steve Ibanez, Eric Dumazet

When have observed high tail latencies when using DCTCP for RPCs as
compared to using Cubic. For example, in one setup there are 2 hosts
sending to a 3rd one, with each sender having 3 flows (1 stream,
1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
table shows the 99% and 99.9% latencies for both Cubic and dctcp:

           Cubic 99%  Cubic 99.9%   dctcp 99%    dctcp 99.9%
 1MB RPCs    2.6ms       5.5ms         43ms          208ms
10KB RPCs    1.1ms       1.3ms         53ms          212ms

Looking at tcpdump traces showed that there are two causes for the
latency.  

  1) RTOs caused by the receiver sending a dup ACK and not ACKing
     the last (and only) packet sent.
  2) Delaying ACKs when the sender has a cwnd of 1, so everything
     pauses for the duration of the delayed ACK.

The first patch fixes the cause of the dup ACKs, not updating DCTCP
state when an ACK that was initially delayed has been sent with a
data packet.

The second patch insures that an ACK is sent immediately when a
CWR marked packet arrives.

With the patches the latencies for DCTCP now look like:

           dctcp 99%  dctcp 99.9%
 1MB RPCs    5.8ms       6.9ms
10KB RPCs    146us       203us

Note that while the 1MB RPCs tail latencies are higher than Cubic's,
the 10KB latencies are much smaller than Cubic's. These patches fix
issues on the receiver, but tcpdump traces indicate there is an
opportunity to also fix an issue at the sender that adds about 3ms
to the tail latencies.

The following trace shows the issue that tiggers an RTO (fixed by these patches):

   Host A sends the last packets of the request
   Host B receives them, and the last packet is marked with congestion (CE)
   Host B sends ACKs for packets not marked with congestion
   Host B sends data packet with reply and ACK for packet marked with
          congestion (TCP flag ECE)
   Host A receives ACKs with no ECE flag
   Host A receives data packet with ACK for the last packet of request
          and which has TCP ECE bit set
   Host A sends 1st data packet of the next request with TCP flag CWR
   Host B receives the packet (as seen in tcpdump at B), no CE flag
   Host B sends a dup ACK that also has the TCP ECE flag
   Host A RTO timer fires!
   Host A to send the next packet
   Host A receives an ACK for everything it has sent (i.e. Host B
          did receive 1st packet of request)
   Host A send more packets…

[PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent
[PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet

 net/ipv4/tcp_input.c  | 16 +++++++++++-----
 net/ipv4/tcp_output.c |  4 ++--
 2 files changed, 13 insertions(+), 7 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent
  2018-07-02 21:39 [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP Lawrence Brakmo
@ 2018-07-02 21:39 ` Lawrence Brakmo
  2018-07-02 23:48   ` Yuchung Cheng
  2018-07-02 21:39 ` [PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet arrives Lawrence Brakmo
  2018-07-03  0:51 ` [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP Neal Cardwell
  2 siblings, 1 reply; 10+ messages in thread
From: Lawrence Brakmo @ 2018-07-02 21:39 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Neal Cardwell,
	Yuchung Cheng, Steve Ibanez, Eric Dumazet

DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
notifications to keep track if it needs to send an ACK for packets that
were received with a particular ECN state but whose ACK was delayed.

Under some circumstances, for example when a delayed ACK is sent with a
data packet, DCTCP state was not being updated due to a lack of
notification that the previously delayed ACK was sent. As a result, it
would sometimes send a duplicate ACK when a new data packet arrived.

This patch insures that DCTCP's state is correctly updated so it will
not send the duplicate ACK.

Improved based on comments from Neal Cardwell <ncardwell@google.com>.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 net/ipv4/tcp_output.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f8f6129160dd..acefb64e8280 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -172,6 +172,8 @@ static inline void tcp_event_ack_sent(struct sock *sk, unsigned int pkts)
 			__sock_put(sk);
 	}
 	tcp_dec_quickack_mode(sk, pkts);
+	if (inet_csk_ack_scheduled(sk))
+		tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
 	inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
 }
 
@@ -3567,8 +3569,6 @@ void tcp_send_ack(struct sock *sk)
 	if (sk->sk_state == TCP_CLOSE)
 		return;
 
-	tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
-
 	/* We are not putting this on the write queue, so
 	 * tcp_transmit_skb() will set the ownership to this
 	 * sock.
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet arrives
  2018-07-02 21:39 [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP Lawrence Brakmo
  2018-07-02 21:39 ` [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent Lawrence Brakmo
@ 2018-07-02 21:39 ` Lawrence Brakmo
  2018-07-03  0:51 ` [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP Neal Cardwell
  2 siblings, 0 replies; 10+ messages in thread
From: Lawrence Brakmo @ 2018-07-02 21:39 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Neal Cardwell,
	Yuchung Cheng, Steve Ibanez, Eric Dumazet

We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
problem is triggered when the last packet of a request arrives CE
marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
to 1 (because there are no packets in flight). When the 1st packet of
the next request arrives, the ACK was sometimes delayed even though it
is CWR marked, adding up to 40ms to the RPC latency.

This patch insures that CWR makred data packets arriving will be acked
immediately.

Modified based on comments by Neal Cardwell <ncardwell@google.com>

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 net/ipv4/tcp_input.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 76ca88f63b70..6fd1f2378f6c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -254,10 +254,16 @@ static void tcp_ecn_withdraw_cwr(struct tcp_sock *tp)
 	tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
 }
 
-static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
+static void __tcp_ecn_check(struct sock *sk, const struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	/* If the sender is telling us it has entered CWR, then its cwnd may be
+	 * very low (even just 1 packet), so we should ACK immediately.
+	 */
+	if (tcp_hdr(skb)->cwr)
+		tcp_enter_quickack_mode(sk, 2);
+
 	switch (TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK) {
 	case INET_ECN_NOT_ECT:
 		/* Funny extension: if ECT is not set on a segment,
@@ -286,10 +292,10 @@ static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
 	}
 }
 
-static void tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
+static void tcp_ecn_check(struct sock *sk, const struct sk_buff *skb)
 {
 	if (tcp_sk(sk)->ecn_flags & TCP_ECN_OK)
-		__tcp_ecn_check_ce(sk, skb);
+		__tcp_ecn_check(sk, skb);
 }
 
 static void tcp_ecn_rcv_synack(struct tcp_sock *tp, const struct tcphdr *th)
@@ -715,7 +721,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
 	}
 	icsk->icsk_ack.lrcvtime = now;
 
-	tcp_ecn_check_ce(sk, skb);
+	tcp_ecn_check(sk, skb);
 
 	if (skb->len >= 128)
 		tcp_grow_window(sk, skb);
@@ -4439,7 +4445,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	u32 seq, end_seq;
 	bool fragstolen;
 
-	tcp_ecn_check_ce(sk, skb);
+	tcp_ecn_check(sk, skb);
 
 	if (unlikely(tcp_try_rmem_schedule(sk, skb, skb->truesize))) {
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPOFODROP);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent
  2018-07-02 21:39 ` [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent Lawrence Brakmo
@ 2018-07-02 23:48   ` Yuchung Cheng
  2018-07-03 13:14     ` Neal Cardwell
  2018-07-03 15:15     ` Lawrence Brakmo
  0 siblings, 2 replies; 10+ messages in thread
From: Yuchung Cheng @ 2018-07-02 23:48 UTC (permalink / raw)
  To: Lawrence Brakmo
  Cc: netdev, Kernel Team, Blake Matheny, Alexei Starovoitov,
	Neal Cardwell, Steve Ibanez, Eric Dumazet

On Mon, Jul 2, 2018 at 2:39 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
>
> DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
> notifications to keep track if it needs to send an ACK for packets that
> were received with a particular ECN state but whose ACK was delayed.
>
> Under some circumstances, for example when a delayed ACK is sent with a
> data packet, DCTCP state was not being updated due to a lack of
> notification that the previously delayed ACK was sent. As a result, it
> would sometimes send a duplicate ACK when a new data packet arrived.
>
> This patch insures that DCTCP's state is correctly updated so it will
> not send the duplicate ACK.
Sorry to chime-in late here (lame excuse: IETF deadline)

IIRC this issue would exist prior to 4.11 kernel. While it'd be good
to fix that, it's not clear which patch introduced the regression
between 4.11 and 4.16? I assume you tested Eric's most recent quickack
fix.

In terms of the fix itself, it seems odd the tcp_send_ack() call in
DCTCP generates NON_DELAYED_ACK event to toggle DCTCP's
delayed_ack_reserved bit. Shouldn't the fix to have DCTCP send the
"prior" ACK w/o cancelling delayed-ACK and mis-recording that it's
cancelled, because that prior-ACK really is a supplementary old ACK.

But it's still unclear how this bug introduces the regression 4.11 - 4.16


>
> Improved based on comments from Neal Cardwell <ncardwell@google.com>.
>
> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
> ---
>  net/ipv4/tcp_output.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index f8f6129160dd..acefb64e8280 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -172,6 +172,8 @@ static inline void tcp_event_ack_sent(struct sock *sk, unsigned int pkts)
>                         __sock_put(sk);
>         }
>         tcp_dec_quickack_mode(sk, pkts);
> +       if (inet_csk_ack_scheduled(sk))
> +               tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
>         inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
>  }
>
> @@ -3567,8 +3569,6 @@ void tcp_send_ack(struct sock *sk)
>         if (sk->sk_state == TCP_CLOSE)
>                 return;
>
> -       tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
> -
>         /* We are not putting this on the write queue, so
>          * tcp_transmit_skb() will set the ownership to this
>          * sock.
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP
  2018-07-02 21:39 [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP Lawrence Brakmo
  2018-07-02 21:39 ` [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent Lawrence Brakmo
  2018-07-02 21:39 ` [PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet arrives Lawrence Brakmo
@ 2018-07-03  0:51 ` Neal Cardwell
  2018-07-03 15:10   ` Lawrence Brakmo
  2 siblings, 1 reply; 10+ messages in thread
From: Neal Cardwell @ 2018-07-03  0:51 UTC (permalink / raw)
  To: Lawrence Brakmo
  Cc: Netdev, Kernel Team, bmatheny, ast, Yuchung Cheng, Steve Ibanez,
	Eric Dumazet

On Mon, Jul 2, 2018 at 5:39 PM Lawrence Brakmo <brakmo@fb.com> wrote:
>
> When have observed high tail latencies when using DCTCP for RPCs as
> compared to using Cubic. For example, in one setup there are 2 hosts
> sending to a 3rd one, with each sender having 3 flows (1 stream,
> 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
> table shows the 99% and 99.9% latencies for both Cubic and dctcp:
>
>            Cubic 99%  Cubic 99.9%   dctcp 99%    dctcp 99.9%
>  1MB RPCs    2.6ms       5.5ms         43ms          208ms
> 10KB RPCs    1.1ms       1.3ms         53ms          212ms
>
> Looking at tcpdump traces showed that there are two causes for the
> latency.
>
>   1) RTOs caused by the receiver sending a dup ACK and not ACKing
>      the last (and only) packet sent.
>   2) Delaying ACKs when the sender has a cwnd of 1, so everything
>      pauses for the duration of the delayed ACK.
>
> The first patch fixes the cause of the dup ACKs, not updating DCTCP
> state when an ACK that was initially delayed has been sent with a
> data packet.
>
> The second patch insures that an ACK is sent immediately when a
> CWR marked packet arrives.
>
> With the patches the latencies for DCTCP now look like:
>
>            dctcp 99%  dctcp 99.9%
>  1MB RPCs    5.8ms       6.9ms
> 10KB RPCs    146us       203us
>
> Note that while the 1MB RPCs tail latencies are higher than Cubic's,
> the 10KB latencies are much smaller than Cubic's. These patches fix
> issues on the receiver, but tcpdump traces indicate there is an
> opportunity to also fix an issue at the sender that adds about 3ms
> to the tail latencies.
>
> The following trace shows the issue that tiggers an RTO (fixed by these patches):
>
>    Host A sends the last packets of the request
>    Host B receives them, and the last packet is marked with congestion (CE)
>    Host B sends ACKs for packets not marked with congestion
>    Host B sends data packet with reply and ACK for packet marked with
>           congestion (TCP flag ECE)
>    Host A receives ACKs with no ECE flag
>    Host A receives data packet with ACK for the last packet of request
>           and which has TCP ECE bit set
>    Host A sends 1st data packet of the next request with TCP flag CWR
>    Host B receives the packet (as seen in tcpdump at B), no CE flag
>    Host B sends a dup ACK that also has the TCP ECE flag
>    Host A RTO timer fires!
>    Host A to send the next packet
>    Host A receives an ACK for everything it has sent (i.e. Host B
>           did receive 1st packet of request)
>    Host A send more packets…
>
> [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent
> [PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet
>
>  net/ipv4/tcp_input.c  | 16 +++++++++++-----
>  net/ipv4/tcp_output.c |  4 ++--
>  2 files changed, 13 insertions(+), 7 deletions(-)

Thanks, Larry. Just for context, can you please let us know whether
your tests included zero, one, or both of Eric's recent commits
(listed below) that tuned the number of ACKs after ECN events? (Or
maybe the tests were literally using a net-next kernel?) Just wanted
to get a better handle on any possible interactions there.

Thanks!

neal

---
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=522040ea5fdd1c33bbf75e1d7c7c0422b96a94ef
commit 522040ea5fdd1c33bbf75e1d7c7c0422b96a94ef
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon May 21 15:08:57 2018 -0700

    tcp: do not aggressively quick ack after ECN events

    ECN signals currently forces TCP to enter quickack mode for
    up to 16 (TCP_MAX_QUICKACKS) following incoming packets.

    We believe this is not needed, and only sending one immediate ack
    for the current packet should be enough.

    This should reduce the extra load noticed in DCTCP environments,
    after congestion events.

    This is part 2 of our effort to reduce pure ACK packets.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Acked-by: Yuchung Cheng <ycheng@google.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=15ecbe94a45ef88491ca459b26efdd02f91edb6d
commit 15ecbe94a45ef88491ca459b26efdd02f91edb6d
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jun 27 08:47:21 2018 -0700

    tcp: add one more quick ack after after ECN events

    Larry Brakmo proposal ( https://patchwork.ozlabs.org/patch/935233/
    tcp: force cwnd at least 2 in tcp_cwnd_reduction) made us rethink
    about our recent patch removing ~16 quick acks after ECN events.

    tcp_enter_quickack_mode(sk, 1) makes sure one immediate ack is sent,
    but in the case the sender cwnd was lowered to 1, we do not want
    to have a delayed ack for the next packet we will receive.

    Fixes: 522040ea5fdd ("tcp: do not aggressively quick ack after ECN events")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: Neal Cardwell <ncardwell@google.com>
    Cc: Lawrence Brakmo <brakmo@fb.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent
  2018-07-02 23:48   ` Yuchung Cheng
@ 2018-07-03 13:14     ` Neal Cardwell
  2018-07-03 18:38       ` Lawrence Brakmo
  2018-07-03 15:15     ` Lawrence Brakmo
  1 sibling, 1 reply; 10+ messages in thread
From: Neal Cardwell @ 2018-07-03 13:14 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Lawrence Brakmo, Netdev, Kernel Team, bmatheny, ast,
	Steve Ibanez, Eric Dumazet

On Mon, Jul 2, 2018 at 7:49 PM Yuchung Cheng <ycheng@google.com> wrote:
>
> On Mon, Jul 2, 2018 at 2:39 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
> >
> > DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
> > notifications to keep track if it needs to send an ACK for packets that
> > were received with a particular ECN state but whose ACK was delayed.
> >
> > Under some circumstances, for example when a delayed ACK is sent with a
> > data packet, DCTCP state was not being updated due to a lack of
> > notification that the previously delayed ACK was sent. As a result, it
> > would sometimes send a duplicate ACK when a new data packet arrived.
> >
> > This patch insures that DCTCP's state is correctly updated so it will
> > not send the duplicate ACK.
> Sorry to chime-in late here (lame excuse: IETF deadline)
>
> IIRC this issue would exist prior to 4.11 kernel. While it'd be good
> to fix that, it's not clear which patch introduced the regression
> between 4.11 and 4.16? I assume you tested Eric's most recent quickack
> fix.

I don't think Larry is saying there's a regression between 4.11 and
4.16. His recent "tcp: force cwnd at least 2 in tcp_cwnd_reduction"
patch here:

  https://patchwork.ozlabs.org/patch/935233/

said that 4.11 was bad (high tail latency and lots of RTOs) and 4.16
was still bad but with different netstat counters (no RTOs but still
high tail latency):

"""
On 4.11, pcap traces indicate that in some instances the 1st packet of
the RPC is received but no ACK is sent before the packet is
retransmitted. On 4.11 netstat shows TCP timeouts, with some of them
spurious.

On 4.16, we don't see retransmits in netstat but the high tail latencies
are still there.
"""

I suspect the RTOs disappeared but latencies remained too high because
between 4.11 and 4.16 we introduced:
  tcp: allow TLP in ECN CWR
  https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=b4f70c3d4ec32a2ff4c62e1e2da0da5f55fe12bd

So the RTOs probably disappeared because this commit turned them into
TLPs. But the latencies remained high because the fundamental bug
remained throughout 4.11 and 4.16 and today: the DCTCP use of
tcp_send_ack() with an old rcv_nxt caused delayed ACKs to be cancelled
when they should not have been.

> In terms of the fix itself, it seems odd the tcp_send_ack() call in
> DCTCP generates NON_DELAYED_ACK event to toggle DCTCP's
> delayed_ack_reserved bit. Shouldn't the fix to have DCTCP send the
> "prior" ACK w/o cancelling delayed-ACK and mis-recording that it's
> cancelled, because that prior-ACK really is a supplementary old ACK.

This patch is fixing an issue that's orthogonal to the one you are
talking about. Using the taxonomy from our team's internal discussion
yesterday, the issue you mention where the DCTCP "prior" ACK is
cancelling delayed ACKs is issue (4); the issue that this particular
"tcp: notify when a delayed ack is sent" patch from Larry fixes is
issue (3). It's a bit tricky because both issues appeared in Larry's
trace summary and packetdrill script to reproduce the issue.

neal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP
  2018-07-03  0:51 ` [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP Neal Cardwell
@ 2018-07-03 15:10   ` Lawrence Brakmo
  2018-07-03 15:25     ` Neal Cardwell
  0 siblings, 1 reply; 10+ messages in thread
From: Lawrence Brakmo @ 2018-07-03 15:10 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Netdev, Kernel Team, Blake Matheny, Alexei Starovoitov,
	Yuchung Cheng, Steve Ibanez, Eric Dumazet

On 7/2/18, 5:52 PM, "netdev-owner@vger.kernel.org on behalf of Neal Cardwell" <netdev-owner@vger.kernel.org on behalf of ncardwell@google.com> wrote:

    On Mon, Jul 2, 2018 at 5:39 PM Lawrence Brakmo <brakmo@fb.com> wrote:
    >
    > When have observed high tail latencies when using DCTCP for RPCs as
    > compared to using Cubic. For example, in one setup there are 2 hosts
    > sending to a 3rd one, with each sender having 3 flows (1 stream,
    > 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
    > table shows the 99% and 99.9% latencies for both Cubic and dctcp:
    >
    >            Cubic 99%  Cubic 99.9%   dctcp 99%    dctcp 99.9%
    >  1MB RPCs    2.6ms       5.5ms         43ms          208ms
    > 10KB RPCs    1.1ms       1.3ms         53ms          212ms
    >
    > Looking at tcpdump traces showed that there are two causes for the
    > latency.
    >
    >   1) RTOs caused by the receiver sending a dup ACK and not ACKing
    >      the last (and only) packet sent.
    >   2) Delaying ACKs when the sender has a cwnd of 1, so everything
    >      pauses for the duration of the delayed ACK.
    >
    > The first patch fixes the cause of the dup ACKs, not updating DCTCP
    > state when an ACK that was initially delayed has been sent with a
    > data packet.
    >
    > The second patch insures that an ACK is sent immediately when a
    > CWR marked packet arrives.
    >
    > With the patches the latencies for DCTCP now look like:
    >
    >            dctcp 99%  dctcp 99.9%
    >  1MB RPCs    5.8ms       6.9ms
    > 10KB RPCs    146us       203us
    >
    > Note that while the 1MB RPCs tail latencies are higher than Cubic's,
    > the 10KB latencies are much smaller than Cubic's. These patches fix
    > issues on the receiver, but tcpdump traces indicate there is an
    > opportunity to also fix an issue at the sender that adds about 3ms
    > to the tail latencies.
    >
    > The following trace shows the issue that tiggers an RTO (fixed by these patches):
    >
    >    Host A sends the last packets of the request
    >    Host B receives them, and the last packet is marked with congestion (CE)
    >    Host B sends ACKs for packets not marked with congestion
    >    Host B sends data packet with reply and ACK for packet marked with
    >           congestion (TCP flag ECE)
    >    Host A receives ACKs with no ECE flag
    >    Host A receives data packet with ACK for the last packet of request
    >           and which has TCP ECE bit set
    >    Host A sends 1st data packet of the next request with TCP flag CWR
    >    Host B receives the packet (as seen in tcpdump at B), no CE flag
    >    Host B sends a dup ACK that also has the TCP ECE flag
    >    Host A RTO timer fires!
    >    Host A to send the next packet
    >    Host A receives an ACK for everything it has sent (i.e. Host B
    >           did receive 1st packet of request)
    >    Host A send more packets…
    >
    > [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent
    > [PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet
    >
    >  net/ipv4/tcp_input.c  | 16 +++++++++++-----
    >  net/ipv4/tcp_output.c |  4 ++--
    >  2 files changed, 13 insertions(+), 7 deletions(-)
    
    Thanks, Larry. Just for context, can you please let us know whether
    your tests included zero, one, or both of Eric's recent commits
    (listed below) that tuned the number of ACKs after ECN events? (Or
    maybe the tests were literally using a net-next kernel?) Just wanted
    to get a better handle on any possible interactions there.
    
    Thanks!
    
    neal

Yes, my test kernel includes both patches listed below. BTW, I will send a new patch where I move the call to tcp_incr_quickack from tcp_ecn_check_ce to tcp_ecn_accept_cwr.
    
    ---
    https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=522040ea5fdd1c33bbf75e1d7c7c0422b96a94ef
    commit 522040ea5fdd1c33bbf75e1d7c7c0422b96a94ef
    Author: Eric Dumazet <edumazet@google.com>
    Date:   Mon May 21 15:08:57 2018 -0700
    
        tcp: do not aggressively quick ack after ECN events
    
        ECN signals currently forces TCP to enter quickack mode for
        up to 16 (TCP_MAX_QUICKACKS) following incoming packets.
    
        We believe this is not needed, and only sending one immediate ack
        for the current packet should be enough.
    
        This should reduce the extra load noticed in DCTCP environments,
        after congestion events.
    
        This is part 2 of our effort to reduce pure ACK packets.
    
        Signed-off-by: Eric Dumazet <edumazet@google.com>
        Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
        Acked-by: Yuchung Cheng <ycheng@google.com>
        Acked-by: Neal Cardwell <ncardwell@google.com>
        Signed-off-by: David S. Miller <davem@davemloft.net>
    
    https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=15ecbe94a45ef88491ca459b26efdd02f91edb6d
    commit 15ecbe94a45ef88491ca459b26efdd02f91edb6d
    Author: Eric Dumazet <edumazet@google.com>
    Date:   Wed Jun 27 08:47:21 2018 -0700
    
        tcp: add one more quick ack after after ECN events
    
        Larry Brakmo proposal ( https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_935233_&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx1u_g&m=-aaZOqM6jDpeeyDiKlVoiaQRFc55aJgk4jHQILSj3D4&s=tFz97077b4KhTvYd69-n-YhnWhX6PWZQceWJiylvL5Q&e=
        tcp: force cwnd at least 2 in tcp_cwnd_reduction) made us rethink
        about our recent patch removing ~16 quick acks after ECN events.
    
        tcp_enter_quickack_mode(sk, 1) makes sure one immediate ack is sent,
        but in the case the sender cwnd was lowered to 1, we do not want
        to have a delayed ack for the next packet we will receive.
    
        Fixes: 522040ea5fdd ("tcp: do not aggressively quick ack after ECN events")
        Signed-off-by: Eric Dumazet <edumazet@google.com>
        Reported-by: Neal Cardwell <ncardwell@google.com>
        Cc: Lawrence Brakmo <brakmo@fb.com>
        Acked-by: Neal Cardwell <ncardwell@google.com>
        Signed-off-by: David S. Miller <davem@davemloft.net>
    

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent
  2018-07-02 23:48   ` Yuchung Cheng
  2018-07-03 13:14     ` Neal Cardwell
@ 2018-07-03 15:15     ` Lawrence Brakmo
  1 sibling, 0 replies; 10+ messages in thread
From: Lawrence Brakmo @ 2018-07-03 15:15 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: netdev, Kernel Team, Blake Matheny, Alexei Starovoitov,
	Neal Cardwell, Steve Ibanez, Eric Dumazet

On 7/2/18, 4:50 PM, "Yuchung Cheng" <ycheng@google.com> wrote:

    On Mon, Jul 2, 2018 at 2:39 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
    >
    > DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
    > notifications to keep track if it needs to send an ACK for packets that
    > were received with a particular ECN state but whose ACK was delayed.
    >
    > Under some circumstances, for example when a delayed ACK is sent with a
    > data packet, DCTCP state was not being updated due to a lack of
    > notification that the previously delayed ACK was sent. As a result, it
    > would sometimes send a duplicate ACK when a new data packet arrived.
    >
    > This patch insures that DCTCP's state is correctly updated so it will
    > not send the duplicate ACK.
    Sorry to chime-in late here (lame excuse: IETF deadline)
    
    IIRC this issue would exist prior to 4.11 kernel. While it'd be good
    to fix that, it's not clear which patch introduced the regression
    between 4.11 and 4.16? I assume you tested Eric's most recent quickack
    fix.
    
    In terms of the fix itself, it seems odd the tcp_send_ack() call in
    DCTCP generates NON_DELAYED_ACK event to toggle DCTCP's
    delayed_ack_reserved bit. Shouldn't the fix to have DCTCP send the
    "prior" ACK w/o cancelling delayed-ACK and mis-recording that it's
    cancelled, because that prior-ACK really is a supplementary old ACK.
    
    But it's still unclear how this bug introduces the regression 4.11 - 4.16
   
Feedback is always appreciated! This issue is also present in 4.11 (that is where I discovered). I think the bug was introduces much earlier.

Yes, I tested with Eric's quickack fix, it did not fix either of the two issues that are fixed with this patch set.

As I mentioned earlier, the bug was introduced before 4.11. I am not sure I understand your comments. Yes, at some level it would make sense to change the delayed_ack_reserved bit directly, but we would still need to do it whenever we send the ACK, so I do not think it can be helped. Please clarify if I misunderstood your comment.
    
    >
    > Improved based on comments from Neal Cardwell <ncardwell@google.com>.
    >
    > Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
    > ---
    >  net/ipv4/tcp_output.c | 4 ++--
    >  1 file changed, 2 insertions(+), 2 deletions(-)
    >
    > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
    > index f8f6129160dd..acefb64e8280 100644
    > --- a/net/ipv4/tcp_output.c
    > +++ b/net/ipv4/tcp_output.c
    > @@ -172,6 +172,8 @@ static inline void tcp_event_ack_sent(struct sock *sk, unsigned int pkts)
    >                         __sock_put(sk);
    >         }
    >         tcp_dec_quickack_mode(sk, pkts);
    > +       if (inet_csk_ack_scheduled(sk))
    > +               tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
    >         inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
    >  }
    >
    > @@ -3567,8 +3569,6 @@ void tcp_send_ack(struct sock *sk)
    >         if (sk->sk_state == TCP_CLOSE)
    >                 return;
    >
    > -       tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
    > -
    >         /* We are not putting this on the write queue, so
    >          * tcp_transmit_skb() will set the ownership to this
    >          * sock.
    > --
    > 2.17.1
    >
    

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP
  2018-07-03 15:10   ` Lawrence Brakmo
@ 2018-07-03 15:25     ` Neal Cardwell
  0 siblings, 0 replies; 10+ messages in thread
From: Neal Cardwell @ 2018-07-03 15:25 UTC (permalink / raw)
  To: Lawrence Brakmo
  Cc: Netdev, Kernel Team, bmatheny, ast, Yuchung Cheng, Steve Ibanez,
	Eric Dumazet

On Tue, Jul 3, 2018 at 11:10 AM Lawrence Brakmo <brakmo@fb.com> wrote:
>
> On 7/2/18, 5:52 PM, "netdev-owner@vger.kernel.org on behalf of Neal Cardwell" <netdev-owner@vger.kernel.org on behalf of ncardwell@google.com> wrote:
>
>     On Mon, Jul 2, 2018 at 5:39 PM Lawrence Brakmo <brakmo@fb.com> wrote:
>     >
>     > When have observed high tail latencies when using DCTCP for RPCs as
>     > compared to using Cubic. For example, in one setup there are 2 hosts
>     > sending to a 3rd one, with each sender having 3 flows (1 stream,
>     > 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
>     > table shows the 99% and 99.9% latencies for both Cubic and dctcp:
>     >
>     >            Cubic 99%  Cubic 99.9%   dctcp 99%    dctcp 99.9%
>     >  1MB RPCs    2.6ms       5.5ms         43ms          208ms
>     > 10KB RPCs    1.1ms       1.3ms         53ms          212ms
>     >
>     > Looking at tcpdump traces showed that there are two causes for the
>     > latency.
>     >
>     >   1) RTOs caused by the receiver sending a dup ACK and not ACKing
>     >      the last (and only) packet sent.
>     >   2) Delaying ACKs when the sender has a cwnd of 1, so everything
>     >      pauses for the duration of the delayed ACK.
>     >
>     > The first patch fixes the cause of the dup ACKs, not updating DCTCP
>     > state when an ACK that was initially delayed has been sent with a
>     > data packet.
>     >
>     > The second patch insures that an ACK is sent immediately when a
>     > CWR marked packet arrives.
>     >
>     > With the patches the latencies for DCTCP now look like:
>     >
>     >            dctcp 99%  dctcp 99.9%
>     >  1MB RPCs    5.8ms       6.9ms
>     > 10KB RPCs    146us       203us
>     >
>     > Note that while the 1MB RPCs tail latencies are higher than Cubic's,
>     > the 10KB latencies are much smaller than Cubic's. These patches fix
>     > issues on the receiver, but tcpdump traces indicate there is an
>     > opportunity to also fix an issue at the sender that adds about 3ms
>     > to the tail latencies.
>     >
>     > The following trace shows the issue that tiggers an RTO (fixed by these patches):
>     >
>     >    Host A sends the last packets of the request
>     >    Host B receives them, and the last packet is marked with congestion (CE)
>     >    Host B sends ACKs for packets not marked with congestion
>     >    Host B sends data packet with reply and ACK for packet marked with
>     >           congestion (TCP flag ECE)
>     >    Host A receives ACKs with no ECE flag
>     >    Host A receives data packet with ACK for the last packet of request
>     >           and which has TCP ECE bit set
>     >    Host A sends 1st data packet of the next request with TCP flag CWR
>     >    Host B receives the packet (as seen in tcpdump at B), no CE flag
>     >    Host B sends a dup ACK that also has the TCP ECE flag
>     >    Host A RTO timer fires!
>     >    Host A to send the next packet
>     >    Host A receives an ACK for everything it has sent (i.e. Host B
>     >           did receive 1st packet of request)
>     >    Host A send more packets…
>     >
>     > [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent
>     > [PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet
>     >
>     >  net/ipv4/tcp_input.c  | 16 +++++++++++-----
>     >  net/ipv4/tcp_output.c |  4 ++--
>     >  2 files changed, 13 insertions(+), 7 deletions(-)
>
>     Thanks, Larry. Just for context, can you please let us know whether
>     your tests included zero, one, or both of Eric's recent commits
>     (listed below) that tuned the number of ACKs after ECN events? (Or
>     maybe the tests were literally using a net-next kernel?) Just wanted
>     to get a better handle on any possible interactions there.
>
>     Thanks!
>
>     neal
>
> Yes, my test kernel includes both patches listed below.

OK, great.

> BTW, I will send a new
> patch where I move the call to tcp_incr_quickack from tcp_ecn_check_ce to tcp_ecn_accept_cwr.

OK, sounds good to me.

thanks,
neal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent
  2018-07-03 13:14     ` Neal Cardwell
@ 2018-07-03 18:38       ` Lawrence Brakmo
  0 siblings, 0 replies; 10+ messages in thread
From: Lawrence Brakmo @ 2018-07-03 18:38 UTC (permalink / raw)
  To: Neal Cardwell, Yuchung Cheng
  Cc: Netdev, Kernel Team, Blake Matheny, Alexei Starovoitov,
	Steve Ibanez, Eric Dumazet, Daniel Borkmann

On 7/3/18, 6:15 AM, "Neal Cardwell" <ncardwell@google.com> wrote:

    On Mon, Jul 2, 2018 at 7:49 PM Yuchung Cheng <ycheng@google.com> wrote:
    >
    > On Mon, Jul 2, 2018 at 2:39 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
    > >
    > > DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
    > > notifications to keep track if it needs to send an ACK for packets that
    > > were received with a particular ECN state but whose ACK was delayed.
    > >
    > > Under some circumstances, for example when a delayed ACK is sent with a
    > > data packet, DCTCP state was not being updated due to a lack of
    > > notification that the previously delayed ACK was sent. As a result, it
    > > would sometimes send a duplicate ACK when a new data packet arrived.
    > >
    > > This patch insures that DCTCP's state is correctly updated so it will
    > > not send the duplicate ACK.
    > Sorry to chime-in late here (lame excuse: IETF deadline)
    >
    > IIRC this issue would exist prior to 4.11 kernel. While it'd be good
    > to fix that, it's not clear which patch introduced the regression
    > between 4.11 and 4.16? I assume you tested Eric's most recent quickack
    > fix.
    
    I don't think Larry is saying there's a regression between 4.11 and
    4.16. His recent "tcp: force cwnd at least 2 in tcp_cwnd_reduction"
    patch here:
    
      https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_935233_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx1u_g&m=7ko5C_ln2b7gZvz2A_UrZzz0AlcnhrW7-9KRahj_PEA&s=HcpZvo1TulN4-Y7Jhba5KM1MIaPwnBC95T8pLZfJESI&e=
    
    said that 4.11 was bad (high tail latency and lots of RTOs) and 4.16
    was still bad but with different netstat counters (no RTOs but still
    high tail latency):
    
    """
    On 4.11, pcap traces indicate that in some instances the 1st packet of
    the RPC is received but no ACK is sent before the packet is
    retransmitted. On 4.11 netstat shows TCP timeouts, with some of them
    spurious.
    
    On 4.16, we don't see retransmits in netstat but the high tail latencies
    are still there.
    """
    
    I suspect the RTOs disappeared but latencies remained too high because
    between 4.11 and 4.16 we introduced:
      tcp: allow TLP in ECN CWR
      https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=b4f70c3d4ec32a2ff4c62e1e2da0da5f55fe12bd
    
    So the RTOs probably disappeared because this commit turned them into
    TLPs. But the latencies remained high because the fundamental bug
    remained throughout 4.11 and 4.16 and today: the DCTCP use of
    tcp_send_ack() with an old rcv_nxt caused delayed ACKs to be cancelled
    when they should not have been.
    
    > In terms of the fix itself, it seems odd the tcp_send_ack() call in
    > DCTCP generates NON_DELAYED_ACK event to toggle DCTCP's
    > delayed_ack_reserved bit. Shouldn't the fix to have DCTCP send the
    > "prior" ACK w/o cancelling delayed-ACK and mis-recording that it's
    > cancelled, because that prior-ACK really is a supplementary old ACK.
    
    This patch is fixing an issue that's orthogonal to the one you are
    talking about. Using the taxonomy from our team's internal discussion
    yesterday, the issue you mention where the DCTCP "prior" ACK is
    cancelling delayed ACKs is issue (4); the issue that this particular
    "tcp: notify when a delayed ack is sent" patch from Larry fixes is
    issue (3). It's a bit tricky because both issues appeared in Larry's
    trace summary and packetdrill script to reproduce the issue.
    
    neal

I was able to track the patch that introduced the problem:

    commit 3759824da87b30ce7a35b4873b62b0ba38905ef5
    Author: Yuchung Cheng <ycheng@google.com>
    Date:   Wed Jul 1 14:11:15 2015 -0700

        tcp: PRR uses CRB mode by default and SS mode conditionally

I tested a kernel which reverted the relevant change (see diff below) and the high tail latencies of more than 40ms disappeared. However, the 10KB high percentile latencies are 4ms vs. less than 200us with my patches. It looks like the patch above ended up reducing the cwnd to 1 in the scenarios that were triggering the high tail latencies. That is, it increased the likelihood of triggering actual bugs in the network stack code that my patch set fixes.


diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2c5d70bc294e..50fabb07d739 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2468,13 +2468,14 @@ void tcp_cwnd_reduction(struct sock *sk, int newly_acked_sacked, int flag)
                u64 dividend = (u64)tp->snd_ssthresh * tp->prr_delivered +
                               tp->prior_cwnd - 1;
                sndcnt = div_u64(dividend, tp->prior_cwnd) - tp->prr_out;
-       } else if ((flag & FLAG_RETRANS_DATA_ACKED) &&
-                  !(flag & FLAG_LOST_RETRANS)) {
+       } else {
+//     } else if ((flag & FLAG_RETRANS_DATA_ACKED) &&
+//                !(flag & FLAG_LOST_RETRANS)) {
                sndcnt = min_t(int, delta,
                               max_t(int, tp->prr_delivered - tp->prr_out,
                                     newly_acked_sacked) + 1);
-       } else {
-               sndcnt = min(delta, newly_acked_sacked);
+//     } else {
+//             sndcnt = min(delta, newly_acked_sacked);
        }
        /* Force a fast retransmit upon entering fast recovery */
        sndcnt = max(sndcnt, (tp->prr_out ? 0 : 1));

    

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-07-03 18:39 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-02 21:39 [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP Lawrence Brakmo
2018-07-02 21:39 ` [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent Lawrence Brakmo
2018-07-02 23:48   ` Yuchung Cheng
2018-07-03 13:14     ` Neal Cardwell
2018-07-03 18:38       ` Lawrence Brakmo
2018-07-03 15:15     ` Lawrence Brakmo
2018-07-02 21:39 ` [PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet arrives Lawrence Brakmo
2018-07-03  0:51 ` [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP Neal Cardwell
2018-07-03 15:10   ` Lawrence Brakmo
2018-07-03 15:25     ` Neal Cardwell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.