All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/3] tcp: take a bit more care of backlog stress
@ 2018-11-21 17:52 Eric Dumazet
  2018-11-21 17:52 ` [PATCH net-next 1/3] tcp: remove hdrlen argument from tcp_queue_rcv() Eric Dumazet
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Eric Dumazet @ 2018-11-21 17:52 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Jean-Louis Dupond, Neal Cardwell, Yuchung Cheng,
	Eric Dumazet, Eric Dumazet

While working on the SACK compression issue Jean-Louis Dupond
reported, we found that his linux box was suffering very hard
from tail drops on the socket backlog queue, because the opposite
TCP stack was ont implementing latest RFC recommendations.

First patch is a cleanup

Second patch is attempting coalescing when a new packet must
be added to the backlog queue. Cooking bigger skbs helps
to keep backlog list smaller and speeds its handling when
user thread finally releases the socket lock.

Third patch is implementing head drop as a last resort.
Head drops are generally better for optimal TCP behavior.

Eric Dumazet (3):
  tcp: remove hdrlen argument from tcp_queue_rcv()
  tcp: implement coalescing on backlog queue
  tcp: implement head drops in backlog queue

 include/uapi/linux/snmp.h |  1 +
 net/ipv4/proc.c           |  1 +
 net/ipv4/tcp_input.c      | 13 +++---
 net/ipv4/tcp_ipv4.c       | 89 ++++++++++++++++++++++++++++++++++++---
 4 files changed, 91 insertions(+), 13 deletions(-)

-- 
2.19.1.1215.g8438c0b245-goog

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH net-next 1/3] tcp: remove hdrlen argument from tcp_queue_rcv()
  2018-11-21 17:52 [PATCH net-next 0/3] tcp: take a bit more care of backlog stress Eric Dumazet
@ 2018-11-21 17:52 ` Eric Dumazet
  2018-11-21 22:41   ` Yuchung Cheng
  2018-11-21 17:52 ` [PATCH net-next 2/3] tcp: implement coalescing on backlog queue Eric Dumazet
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2018-11-21 17:52 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Jean-Louis Dupond, Neal Cardwell, Yuchung Cheng,
	Eric Dumazet, Eric Dumazet

Only one caller needs to pull TCP headers, so lets
move __skb_pull() to the caller side.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index edaaebfbcd4693ef6689f3f4c71733b3888c7c2c..e0ad7d3825b5945049c099171ce36e5c8bb9ba99 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4602,13 +4602,12 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	}
 }
 
-static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
-		  bool *fragstolen)
+static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb,
+				      bool *fragstolen)
 {
 	int eaten;
 	struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
 
-	__skb_pull(skb, hdrlen);
 	eaten = (tail &&
 		 tcp_try_coalesce(sk, tail,
 				  skb, fragstolen)) ? 1 : 0;
@@ -4659,7 +4658,7 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
 	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + size;
 	TCP_SKB_CB(skb)->ack_seq = tcp_sk(sk)->snd_una - 1;
 
-	if (tcp_queue_rcv(sk, skb, 0, &fragstolen)) {
+	if (tcp_queue_rcv(sk, skb, &fragstolen)) {
 		WARN_ON_ONCE(fragstolen); /* should not happen */
 		__kfree_skb(skb);
 	}
@@ -4719,7 +4718,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 			goto drop;
 		}
 
-		eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
+		eaten = tcp_queue_rcv(sk, skb, &fragstolen);
 		if (skb->len)
 			tcp_event_data_recv(sk, skb);
 		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
@@ -5585,8 +5584,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 			NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPHITS);
 
 			/* Bulk data transfer: receiver */
-			eaten = tcp_queue_rcv(sk, skb, tcp_header_len,
-					      &fragstolen);
+			__skb_pull(skb, tcp_header_len);
+			eaten = tcp_queue_rcv(sk, skb, &fragstolen);
 
 			tcp_event_data_recv(sk, skb);
 
-- 
2.19.1.1215.g8438c0b245-goog

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-21 17:52 [PATCH net-next 0/3] tcp: take a bit more care of backlog stress Eric Dumazet
  2018-11-21 17:52 ` [PATCH net-next 1/3] tcp: remove hdrlen argument from tcp_queue_rcv() Eric Dumazet
@ 2018-11-21 17:52 ` Eric Dumazet
  2018-11-21 22:31   ` Yuchung Cheng
  2018-11-22 18:01   ` Neal Cardwell
  2018-11-21 17:52 ` [PATCH net-next 3/3] tcp: implement head drops in " Eric Dumazet
  2018-11-23 19:25 ` [PATCH net-next 0/3] tcp: take a bit more care of backlog stress David Miller
  3 siblings, 2 replies; 27+ messages in thread
From: Eric Dumazet @ 2018-11-21 17:52 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Jean-Louis Dupond, Neal Cardwell, Yuchung Cheng,
	Eric Dumazet, Eric Dumazet

In case GRO is not as efficient as it should be or disabled,
we might have a user thread trapped in __release_sock() while
softirq handler flood packets up to the point we have to drop.

This patch balances work done from user thread and softirq,
to give more chances to __release_sock() to complete its work.

This also helps if we receive many ACK packets, since GRO
does not aggregate them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Jean-Louis Dupond <jean-louis@dupond.be>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/proc.c           |  1 +
 net/ipv4/tcp_ipv4.c       | 75 +++++++++++++++++++++++++++++++++++----
 3 files changed, 71 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index f80135e5feaa886000009db6dff75b2bc2d637b2..86dc24a96c90ab047d5173d625450facd6c6dd79 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -243,6 +243,7 @@ enum
 	LINUX_MIB_TCPREQQFULLDROP,		/* TCPReqQFullDrop */
 	LINUX_MIB_TCPRETRANSFAIL,		/* TCPRetransFail */
 	LINUX_MIB_TCPRCVCOALESCE,		/* TCPRcvCoalesce */
+	LINUX_MIB_TCPBACKLOGCOALESCE,		/* TCPBacklogCoalesce */
 	LINUX_MIB_TCPOFOQUEUE,			/* TCPOFOQueue */
 	LINUX_MIB_TCPOFODROP,			/* TCPOFODrop */
 	LINUX_MIB_TCPOFOMERGE,			/* TCPOFOMerge */
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 70289682a6701438aed99a00a9705c39fa4394d3..c3610b37bb4ce665b1976d8cc907b6dd0de42ab9 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -219,6 +219,7 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPRenoRecoveryFail", LINUX_MIB_TCPRENORECOVERYFAIL),
 	SNMP_MIB_ITEM("TCPSackRecoveryFail", LINUX_MIB_TCPSACKRECOVERYFAIL),
 	SNMP_MIB_ITEM("TCPRcvCollapsed", LINUX_MIB_TCPRCVCOLLAPSED),
+	SNMP_MIB_ITEM("TCPBacklogCoalesce", LINUX_MIB_TCPBACKLOGCOALESCE),
 	SNMP_MIB_ITEM("TCPDSACKOldSent", LINUX_MIB_TCPDSACKOLDSENT),
 	SNMP_MIB_ITEM("TCPDSACKOfoSent", LINUX_MIB_TCPDSACKOFOSENT),
 	SNMP_MIB_ITEM("TCPDSACKRecv", LINUX_MIB_TCPDSACKRECV),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 795605a2327504b8a025405826e7e0ca8dc8501d..401e1d1cb904a4c7963d8baa419cfbf178593344 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1619,12 +1619,10 @@ int tcp_v4_early_demux(struct sk_buff *skb)
 bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
 {
 	u32 limit = sk->sk_rcvbuf + sk->sk_sndbuf;
-
-	/* Only socket owner can try to collapse/prune rx queues
-	 * to reduce memory overhead, so add a little headroom here.
-	 * Few sockets backlog are possibly concurrently non empty.
-	 */
-	limit += 64*1024;
+	struct skb_shared_info *shinfo;
+	const struct tcphdr *th;
+	struct sk_buff *tail;
+	unsigned int hdrlen;
 
 	/* In case all data was pulled from skb frags (in __pskb_pull_tail()),
 	 * we can fix skb->truesize to its real value to avoid future drops.
@@ -1636,6 +1634,71 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
 
 	skb_dst_drop(skb);
 
+	if (unlikely(tcp_checksum_complete(skb))) {
+		bh_unlock_sock(sk);
+		__TCP_INC_STATS(sock_net(sk), TCP_MIB_CSUMERRORS);
+		__TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
+		return true;
+	}
+
+	/* Attempt coalescing to last skb in backlog, even if we are
+	 * above the limits.
+	 * This is okay because skb capacity is limited to MAX_SKB_FRAGS.
+	 */
+	th = (const struct tcphdr *)skb->data;
+	hdrlen = th->doff * 4;
+	shinfo = skb_shinfo(skb);
+
+	if (!shinfo->gso_size)
+		shinfo->gso_size = skb->len - hdrlen;
+
+	if (!shinfo->gso_segs)
+		shinfo->gso_segs = 1;
+
+	tail = sk->sk_backlog.tail;
+	if (tail &&
+	    TCP_SKB_CB(tail)->end_seq == TCP_SKB_CB(skb)->seq &&
+#ifdef CONFIG_TLS_DEVICE
+	    tail->decrypted == skb->decrypted &&
+#endif
+	    !memcmp(tail->data + sizeof(*th), skb->data + sizeof(*th),
+		    hdrlen - sizeof(*th))) {
+		bool fragstolen;
+		int delta;
+
+		__skb_pull(skb, hdrlen);
+		if (skb_try_coalesce(tail, skb, &fragstolen, &delta)) {
+			TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq;
+			TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
+			TCP_SKB_CB(tail)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags;
+
+			if (TCP_SKB_CB(skb)->has_rxtstamp) {
+				TCP_SKB_CB(tail)->has_rxtstamp = true;
+				tail->tstamp = skb->tstamp;
+				skb_hwtstamps(tail)->hwtstamp = skb_hwtstamps(skb)->hwtstamp;
+			}
+
+			/* Not as strict as GRO. We only need to carry mss max value */
+			skb_shinfo(tail)->gso_size = max(shinfo->gso_size,
+							 skb_shinfo(tail)->gso_size);
+
+			skb_shinfo(tail)->gso_segs += shinfo->gso_segs;
+
+			sk->sk_backlog.len += delta;
+			__NET_INC_STATS(sock_net(sk),
+					LINUX_MIB_TCPBACKLOGCOALESCE);
+			kfree_skb_partial(skb, fragstolen);
+			return false;
+		}
+		__skb_push(skb, hdrlen);
+	}
+
+	/* Only socket owner can try to collapse/prune rx queues
+	 * to reduce memory overhead, so add a little headroom here.
+	 * Few sockets backlog are possibly concurrently non empty.
+	 */
+	limit += 64*1024;
+
 	if (unlikely(sk_add_backlog(sk, skb, limit))) {
 		bh_unlock_sock(sk);
 		__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP);
-- 
2.19.1.1215.g8438c0b245-goog

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 3/3] tcp: implement head drops in backlog queue
  2018-11-21 17:52 [PATCH net-next 0/3] tcp: take a bit more care of backlog stress Eric Dumazet
  2018-11-21 17:52 ` [PATCH net-next 1/3] tcp: remove hdrlen argument from tcp_queue_rcv() Eric Dumazet
  2018-11-21 17:52 ` [PATCH net-next 2/3] tcp: implement coalescing on backlog queue Eric Dumazet
@ 2018-11-21 17:52 ` Eric Dumazet
  2018-11-21 22:40   ` Yuchung Cheng
  2018-11-23 19:25 ` [PATCH net-next 0/3] tcp: take a bit more care of backlog stress David Miller
  3 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2018-11-21 17:52 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Jean-Louis Dupond, Neal Cardwell, Yuchung Cheng,
	Eric Dumazet, Eric Dumazet

Under high stress, and if GRO or coalescing does not help,
we better make room in backlog queue to be able to keep latest
packet coming.

This generally helps fast recovery, given that we often receive
packets in order.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Jean-Louis Dupond <jean-louis@dupond.be>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 net/ipv4/tcp_ipv4.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 401e1d1cb904a4c7963d8baa419cfbf178593344..36c9d715bf2aa7eb7bf58b045bfeb85a2ec1a696 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1693,6 +1693,20 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
 		__skb_push(skb, hdrlen);
 	}
 
+	while (sk_rcvqueues_full(sk, limit)) {
+		struct sk_buff *head;
+
+		head = sk->sk_backlog.head;
+		if (!head)
+			break;
+		sk->sk_backlog.head = head->next;
+		if (!head->next)
+			sk->sk_backlog.tail = NULL;
+		skb_mark_not_on_list(head);
+		sk->sk_backlog.len -= head->truesize;
+		kfree_skb(head);
+		__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP);
+	}
 	/* Only socket owner can try to collapse/prune rx queues
 	 * to reduce memory overhead, so add a little headroom here.
 	 * Few sockets backlog are possibly concurrently non empty.
-- 
2.19.1.1215.g8438c0b245-goog

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-21 17:52 ` [PATCH net-next 2/3] tcp: implement coalescing on backlog queue Eric Dumazet
@ 2018-11-21 22:31   ` Yuchung Cheng
  2018-11-21 22:40     ` Eric Dumazet
  2018-11-22 18:01   ` Neal Cardwell
  1 sibling, 1 reply; 27+ messages in thread
From: Yuchung Cheng @ 2018-11-21 22:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Jean-Louis Dupond, Neal Cardwell, Eric Dumazet

On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:
>
> In case GRO is not as efficient as it should be or disabled,
> we might have a user thread trapped in __release_sock() while
> softirq handler flood packets up to the point we have to drop.
>
> This patch balances work done from user thread and softirq,
> to give more chances to __release_sock() to complete its work.
>
> This also helps if we receive many ACK packets, since GRO
> does not aggregate them.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Tested-by: Jean-Louis Dupond <jean-louis@dupond.be>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> ---
>  include/uapi/linux/snmp.h |  1 +
>  net/ipv4/proc.c           |  1 +
>  net/ipv4/tcp_ipv4.c       | 75 +++++++++++++++++++++++++++++++++++----
>  3 files changed, 71 insertions(+), 6 deletions(-)
>
> diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
> index f80135e5feaa886000009db6dff75b2bc2d637b2..86dc24a96c90ab047d5173d625450facd6c6dd79 100644
> --- a/include/uapi/linux/snmp.h
> +++ b/include/uapi/linux/snmp.h
> @@ -243,6 +243,7 @@ enum
>         LINUX_MIB_TCPREQQFULLDROP,              /* TCPReqQFullDrop */
>         LINUX_MIB_TCPRETRANSFAIL,               /* TCPRetransFail */
>         LINUX_MIB_TCPRCVCOALESCE,               /* TCPRcvCoalesce */
> +       LINUX_MIB_TCPBACKLOGCOALESCE,           /* TCPBacklogCoalesce */
>         LINUX_MIB_TCPOFOQUEUE,                  /* TCPOFOQueue */
>         LINUX_MIB_TCPOFODROP,                   /* TCPOFODrop */
>         LINUX_MIB_TCPOFOMERGE,                  /* TCPOFOMerge */
> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
> index 70289682a6701438aed99a00a9705c39fa4394d3..c3610b37bb4ce665b1976d8cc907b6dd0de42ab9 100644
> --- a/net/ipv4/proc.c
> +++ b/net/ipv4/proc.c
> @@ -219,6 +219,7 @@ static const struct snmp_mib snmp4_net_list[] = {
>         SNMP_MIB_ITEM("TCPRenoRecoveryFail", LINUX_MIB_TCPRENORECOVERYFAIL),
>         SNMP_MIB_ITEM("TCPSackRecoveryFail", LINUX_MIB_TCPSACKRECOVERYFAIL),
>         SNMP_MIB_ITEM("TCPRcvCollapsed", LINUX_MIB_TCPRCVCOLLAPSED),
> +       SNMP_MIB_ITEM("TCPBacklogCoalesce", LINUX_MIB_TCPBACKLOGCOALESCE),
>         SNMP_MIB_ITEM("TCPDSACKOldSent", LINUX_MIB_TCPDSACKOLDSENT),
>         SNMP_MIB_ITEM("TCPDSACKOfoSent", LINUX_MIB_TCPDSACKOFOSENT),
>         SNMP_MIB_ITEM("TCPDSACKRecv", LINUX_MIB_TCPDSACKRECV),
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 795605a2327504b8a025405826e7e0ca8dc8501d..401e1d1cb904a4c7963d8baa419cfbf178593344 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1619,12 +1619,10 @@ int tcp_v4_early_demux(struct sk_buff *skb)
>  bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
>  {
>         u32 limit = sk->sk_rcvbuf + sk->sk_sndbuf;
> -
> -       /* Only socket owner can try to collapse/prune rx queues
> -        * to reduce memory overhead, so add a little headroom here.
> -        * Few sockets backlog are possibly concurrently non empty.
> -        */
> -       limit += 64*1024;
> +       struct skb_shared_info *shinfo;
> +       const struct tcphdr *th;
> +       struct sk_buff *tail;
> +       unsigned int hdrlen;
>
>         /* In case all data was pulled from skb frags (in __pskb_pull_tail()),
>          * we can fix skb->truesize to its real value to avoid future drops.
> @@ -1636,6 +1634,71 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
>
>         skb_dst_drop(skb);
>
> +       if (unlikely(tcp_checksum_complete(skb))) {
> +               bh_unlock_sock(sk);
> +               __TCP_INC_STATS(sock_net(sk), TCP_MIB_CSUMERRORS);
> +               __TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
> +               return true;
> +       }
> +
> +       /* Attempt coalescing to last skb in backlog, even if we are
> +        * above the limits.
> +        * This is okay because skb capacity is limited to MAX_SKB_FRAGS.
> +        */
> +       th = (const struct tcphdr *)skb->data;
> +       hdrlen = th->doff * 4;
> +       shinfo = skb_shinfo(skb);
> +
> +       if (!shinfo->gso_size)
> +               shinfo->gso_size = skb->len - hdrlen;
> +
> +       if (!shinfo->gso_segs)
> +               shinfo->gso_segs = 1;
> +
> +       tail = sk->sk_backlog.tail;
> +       if (tail &&
> +           TCP_SKB_CB(tail)->end_seq == TCP_SKB_CB(skb)->seq &&
> +#ifdef CONFIG_TLS_DEVICE
> +           tail->decrypted == skb->decrypted &&
> +#endif
> +           !memcmp(tail->data + sizeof(*th), skb->data + sizeof(*th),
> +                   hdrlen - sizeof(*th))) {
> +               bool fragstolen;
> +               int delta;
> +
> +               __skb_pull(skb, hdrlen);
> +               if (skb_try_coalesce(tail, skb, &fragstolen, &delta)) {
> +                       TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq;
> +                       TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
> +                       TCP_SKB_CB(tail)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags;
> +
> +                       if (TCP_SKB_CB(skb)->has_rxtstamp) {
> +                               TCP_SKB_CB(tail)->has_rxtstamp = true;
> +                               tail->tstamp = skb->tstamp;
> +                               skb_hwtstamps(tail)->hwtstamp = skb_hwtstamps(skb)->hwtstamp;
> +                       }
> +
Really nice! would it make sense to re-use (some of) the similar
tcp_try_coalesce()?

> +                       /* Not as strict as GRO. We only need to carry mss max value */
> +                       skb_shinfo(tail)->gso_size = max(shinfo->gso_size,
> +                                                        skb_shinfo(tail)->gso_size);
> +
> +                       skb_shinfo(tail)->gso_segs += shinfo->gso_segs;
> +
> +                       sk->sk_backlog.len += delta;
> +                       __NET_INC_STATS(sock_net(sk),
> +                                       LINUX_MIB_TCPBACKLOGCOALESCE);
> +                       kfree_skb_partial(skb, fragstolen);
> +                       return false;
> +               }
> +               __skb_push(skb, hdrlen);
> +       }
> +
> +       /* Only socket owner can try to collapse/prune rx queues
> +        * to reduce memory overhead, so add a little headroom here.
> +        * Few sockets backlog are possibly concurrently non empty.
> +        */
> +       limit += 64*1024;
> +
>         if (unlikely(sk_add_backlog(sk, skb, limit))) {
>                 bh_unlock_sock(sk);
>                 __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP);
> --
> 2.19.1.1215.g8438c0b245-goog
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-21 22:31   ` Yuchung Cheng
@ 2018-11-21 22:40     ` Eric Dumazet
  2018-11-22 16:34       ` Yuchung Cheng
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2018-11-21 22:40 UTC (permalink / raw)
  To: Yuchung Cheng, Eric Dumazet
  Cc: David S . Miller, netdev, Jean-Louis Dupond, Neal Cardwell



On 11/21/2018 02:31 PM, Yuchung Cheng wrote:
> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:

>> +
> Really nice! would it make sense to re-use (some of) the similar
> tcp_try_coalesce()?
>

Maybe, but it is a bit complex, since skbs in receive queues (regular or out of order)
are accounted differently (they have skb->destructor set)

Also they had the TCP header pulled already, while the backlog coalescing also has
to make sure TCP options match.

Not sure if we want to add extra parameters and conditional checks...

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue
  2018-11-21 17:52 ` [PATCH net-next 3/3] tcp: implement head drops in " Eric Dumazet
@ 2018-11-21 22:40   ` Yuchung Cheng
  2018-11-21 22:47     ` Eric Dumazet
  0 siblings, 1 reply; 27+ messages in thread
From: Yuchung Cheng @ 2018-11-21 22:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Jean-Louis Dupond, Neal Cardwell, Eric Dumazet

On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:
> Under high stress, and if GRO or coalescing does not help,
> we better make room in backlog queue to be able to keep latest
> packet coming.
>
> This generally helps fast recovery, given that we often receive
> packets in order.

I like the benefit of fast recovery but I am a bit leery about head
drop causing HoLB on large read, while tail drops can be repaired by
RACK and TLP already. Hmm -

>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Tested-by: Jean-Louis Dupond <jean-louis@dupond.be>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> ---
>  net/ipv4/tcp_ipv4.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 401e1d1cb904a4c7963d8baa419cfbf178593344..36c9d715bf2aa7eb7bf58b045bfeb85a2ec1a696 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1693,6 +1693,20 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
>                 __skb_push(skb, hdrlen);
>         }
>
> +       while (sk_rcvqueues_full(sk, limit)) {
> +               struct sk_buff *head;
> +
> +               head = sk->sk_backlog.head;
> +               if (!head)
> +                       break;
> +               sk->sk_backlog.head = head->next;
> +               if (!head->next)
> +                       sk->sk_backlog.tail = NULL;
> +               skb_mark_not_on_list(head);
> +               sk->sk_backlog.len -= head->truesize;
> +               kfree_skb(head);
> +               __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP);
> +       }
>         /* Only socket owner can try to collapse/prune rx queues
>          * to reduce memory overhead, so add a little headroom here.
>          * Few sockets backlog are possibly concurrently non empty.
> --
> 2.19.1.1215.g8438c0b245-goog
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 1/3] tcp: remove hdrlen argument from tcp_queue_rcv()
  2018-11-21 17:52 ` [PATCH net-next 1/3] tcp: remove hdrlen argument from tcp_queue_rcv() Eric Dumazet
@ 2018-11-21 22:41   ` Yuchung Cheng
  0 siblings, 0 replies; 27+ messages in thread
From: Yuchung Cheng @ 2018-11-21 22:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Jean-Louis Dupond, Neal Cardwell, Eric Dumazet

On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:
> Only one caller needs to pull TCP headers, so lets
> move __skb_pull() to the caller side.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
Acked-by: Yuchung Cheng <ycheng@google.com>

>  net/ipv4/tcp_input.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index edaaebfbcd4693ef6689f3f4c71733b3888c7c2c..e0ad7d3825b5945049c099171ce36e5c8bb9ba99 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4602,13 +4602,12 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
>         }
>  }
>
> -static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
> -                 bool *fragstolen)
> +static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb,
> +                                     bool *fragstolen)
>  {
>         int eaten;
>         struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
>
> -       __skb_pull(skb, hdrlen);
>         eaten = (tail &&
>                  tcp_try_coalesce(sk, tail,
>                                   skb, fragstolen)) ? 1 : 0;
> @@ -4659,7 +4658,7 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
>         TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + size;
>         TCP_SKB_CB(skb)->ack_seq = tcp_sk(sk)->snd_una - 1;
>
> -       if (tcp_queue_rcv(sk, skb, 0, &fragstolen)) {
> +       if (tcp_queue_rcv(sk, skb, &fragstolen)) {
>                 WARN_ON_ONCE(fragstolen); /* should not happen */
>                 __kfree_skb(skb);
>         }
> @@ -4719,7 +4718,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
>                         goto drop;
>                 }
>
> -               eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
> +               eaten = tcp_queue_rcv(sk, skb, &fragstolen);
>                 if (skb->len)
>                         tcp_event_data_recv(sk, skb);
>                 if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
> @@ -5585,8 +5584,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
>                         NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPHITS);
>
>                         /* Bulk data transfer: receiver */
> -                       eaten = tcp_queue_rcv(sk, skb, tcp_header_len,
> -                                             &fragstolen);
> +                       __skb_pull(skb, tcp_header_len);
> +                       eaten = tcp_queue_rcv(sk, skb, &fragstolen);
>
>                         tcp_event_data_recv(sk, skb);
>
> --
> 2.19.1.1215.g8438c0b245-goog
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue
  2018-11-21 22:40   ` Yuchung Cheng
@ 2018-11-21 22:47     ` Eric Dumazet
  2018-11-21 23:46       ` Yuchung Cheng
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2018-11-21 22:47 UTC (permalink / raw)
  To: Yuchung Cheng, Eric Dumazet
  Cc: David S . Miller, netdev, Jean-Louis Dupond, Neal Cardwell



On 11/21/2018 02:40 PM, Yuchung Cheng wrote:
> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:
>> Under high stress, and if GRO or coalescing does not help,
>> we better make room in backlog queue to be able to keep latest
>> packet coming.
>>
>> This generally helps fast recovery, given that we often receive
>> packets in order.
> 
> I like the benefit of fast recovery but I am a bit leery about head
> drop causing HoLB on large read, while tail drops can be repaired by
> RACK and TLP already. Hmm -

This is very different pattern here.

We have a train of packets coming, the last packet is not a TLP probe...

Consider this train coming from an old stack without burst control nor pacing.

This patch guarantees last packet will be processed, and either :

1) We are a receiver, we will send a SACK. Sender will typically start recovery

2) We are a sender, we will process the most recent ACK sent by the receiver.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue
  2018-11-21 22:47     ` Eric Dumazet
@ 2018-11-21 23:46       ` Yuchung Cheng
  2018-11-21 23:52         ` Eric Dumazet
  0 siblings, 1 reply; 27+ messages in thread
From: Yuchung Cheng @ 2018-11-21 23:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, Jean-Louis Dupond, Neal Cardwell

On Wed, Nov 21, 2018 at 2:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 11/21/2018 02:40 PM, Yuchung Cheng wrote:
>> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:
>>> Under high stress, and if GRO or coalescing does not help,
>>> we better make room in backlog queue to be able to keep latest
>>> packet coming.
>>>
>>> This generally helps fast recovery, given that we often receive
>>> packets in order.
>>
>> I like the benefit of fast recovery but I am a bit leery about head
>> drop causing HoLB on large read, while tail drops can be repaired by
>> RACK and TLP already. Hmm -
>
> This is very different pattern here.
>
> We have a train of packets coming, the last packet is not a TLP probe...
>
> Consider this train coming from an old stack without burst control nor pacing.
>
> This patch guarantees last packet will be processed, and either :
>
> 1) We are a receiver, we will send a SACK. Sender will typically start recovery
>
> 2) We are a sender, we will process the most recent ACK sent by the receiver.
>

Sure on the sender it's universally good.

On the receiver side my scenario was not the last packet being TLP.
AFAIU the new design will first try coalesce the incoming skb to the
tail one then exit. Otherwise it's queued to the back with an
additional 64KB space credit. This patch checks the space w/o the
extra credit and drops the head skb. If the head skb has been
coalesced, we might end dropping the first big chunk that may need a
few rounds of fast recovery to repair. But I am likely to
misunderstand the patch :-)

Would it make sense to check the space first before the coalesce
operation, and drop just enough bytes of the head to make room?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue
  2018-11-21 23:46       ` Yuchung Cheng
@ 2018-11-21 23:52         ` Eric Dumazet
  2018-11-22  0:18           ` Eric Dumazet
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2018-11-21 23:52 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Eric Dumazet, David Miller, netdev, Jean-Louis Dupond, Neal Cardwell

On Wed, Nov 21, 2018 at 3:47 PM Yuchung Cheng <ycheng@google.com> wrote:
>
> On Wed, Nov 21, 2018 at 2:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> >
> > On 11/21/2018 02:40 PM, Yuchung Cheng wrote:
> >> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:
> >>> Under high stress, and if GRO or coalescing does not help,
> >>> we better make room in backlog queue to be able to keep latest
> >>> packet coming.
> >>>
> >>> This generally helps fast recovery, given that we often receive
> >>> packets in order.
> >>
> >> I like the benefit of fast recovery but I am a bit leery about head
> >> drop causing HoLB on large read, while tail drops can be repaired by
> >> RACK and TLP already. Hmm -
> >
> > This is very different pattern here.
> >
> > We have a train of packets coming, the last packet is not a TLP probe...
> >
> > Consider this train coming from an old stack without burst control nor pacing.
> >
> > This patch guarantees last packet will be processed, and either :
> >
> > 1) We are a receiver, we will send a SACK. Sender will typically start recovery
> >
> > 2) We are a sender, we will process the most recent ACK sent by the receiver.
> >
>
> Sure on the sender it's universally good.
>
> On the receiver side my scenario was not the last packet being TLP.
> AFAIU the new design will first try coalesce the incoming skb to the
> tail one then exit. Otherwise it's queued to the back with an
> additional 64KB space credit. This patch checks the space w/o the
> extra credit and drops the head skb. If the head skb has been
> coalesced, we might end dropping the first big chunk that may need a
> few rounds of fast recovery to repair. But I am likely to
> misunderstand the patch :-)

This situation happens only when we are largely over committing
the memory, due to bad skb->len/skb->truesize ratio.

Think about MTU=9000 and some drivers allocating 16 KB of memory per frame,
even if the packet has 1500 bytes in it.

>
> Would it make sense to check the space first before the coalesce
> operation, and drop just enough bytes of the head to make room?

This is basically what the patch does, the while loop breaks when we have freed
just enough skbs.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue
  2018-11-21 23:52         ` Eric Dumazet
@ 2018-11-22  0:18           ` Eric Dumazet
  2018-11-22  0:54             ` Yuchung Cheng
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2018-11-22  0:18 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Eric Dumazet, David Miller, netdev, Jean-Louis Dupond, Neal Cardwell

On Wed, Nov 21, 2018 at 3:52 PM Eric Dumazet <edumazet@google.com> wrote:
> This is basically what the patch does, the while loop breaks when we have freed
> just enough skbs.

Also this is the patch we tested with Jean-Louis on his host, bring
very nice results,
even from an old stack sender (the one that had problems with the SACK
compression we just fixed)

Keep in mind we are dealing here with the exception, I would not spend
too much time
testing another variant if this one simply works.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue
  2018-11-22  0:18           ` Eric Dumazet
@ 2018-11-22  0:54             ` Yuchung Cheng
  2018-11-22  1:01               ` Eric Dumazet
  0 siblings, 1 reply; 27+ messages in thread
From: Yuchung Cheng @ 2018-11-22  0:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David Miller, netdev, Jean-Louis Dupond, Neal Cardwell

On Wed, Nov 21, 2018 at 4:18 PM, Eric Dumazet <edumazet@google.com> wrote:
> On Wed, Nov 21, 2018 at 3:52 PM Eric Dumazet <edumazet@google.com> wrote:
>> This is basically what the patch does, the while loop breaks when we have freed
>> just enough skbs.
>
> Also this is the patch we tested with Jean-Louis on his host, bring
> very nice results,
> even from an old stack sender (the one that had problems with the SACK
> compression we just fixed)
>
> Keep in mind we are dealing here with the exception, I would not spend
> too much time
> testing another variant if this one simply works.
To clarify I do think this patch set is overall useful so I only
wanted to discuss the specifics of the head drop.

It occurs to me we check the limit differently (one w/ 64KB more), so
we may overcommit and trim more often than necessary?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue
  2018-11-22  0:54             ` Yuchung Cheng
@ 2018-11-22  1:01               ` Eric Dumazet
  0 siblings, 0 replies; 27+ messages in thread
From: Eric Dumazet @ 2018-11-22  1:01 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Eric Dumazet, David Miller, netdev, Jean-Louis Dupond, Neal Cardwell

On Wed, Nov 21, 2018 at 4:55 PM Yuchung Cheng <ycheng@google.com> wrote:
>
> To clarify I do think this patch set is overall useful so I only
> wanted to discuss the specifics of the head drop.
>
> It occurs to me we check the limit differently (one w/ 64KB more), so
> we may overcommit and trim more often than necessary?

Keep in mind that the normal limit might be hit while backlog is empty.

So the loop might have nothing to remove.

We only have an extra 64KB credit for this very particular case, so
that we still can queue this packet
_if_ the additional credit was not already consumed.

We do not want to add back the terrible bug fixed by :

commit 8eae939f1400326b06d0c9afe53d2a484a326871
Author: Zhu Yi <yi.zhu@intel.com>
Date:   Thu Mar 4 18:01:40 2010 +0000

    net: add limit for socket backlog

    We got system OOM while running some UDP netperf testing on the loopback
    device. The case is multiple senders sent stream UDP packets to a single
    receiver via loopback on local host. Of course, the receiver is not able
    to handle all the packets in time. But we surprisingly found that these
    packets were not discarded due to the receiver's sk->sk_rcvbuf limit.
    Instead, they are kept queuing to sk->sk_backlog and finally ate up all
    the memory. We believe this is a secure hole that a none privileged user
    can crash the system.

    The root cause for this problem is, when the receiver is doing
    __release_sock() (i.e. after userspace recv, kernel udp_recvmsg ->
    skb_free_datagram_locked -> release_sock), it moves skbs from backlog to
    sk_receive_queue with the softirq enabled. In the above case, multiple
    busy senders will almost make it an endless loop. The skbs in the
    backlog end up eat all the system memory.

    The issue is not only for UDP. Any protocols using socket backlog is
    potentially affected. The patch adds limit for socket backlog so that
    the backlog size cannot be expanded endlessly.

    Reported-by: Alex Shi <alex.shi@intel.com>
    Cc: David Miller <davem@davemloft.net>
    Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
    Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru
    Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
    Cc: Patrick McHardy <kaber@trash.net>
    Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
    Cc: Sridhar Samudrala <sri@us.ibm.com>
    Cc: Jon Maloy <jon.maloy@ericsson.com>
    Cc: Allan Stephens <allan.stephens@windriver.com>
    Cc: Andrew Hendry <andrew.hendry@gmail.com>
    Signed-off-by: Zhu Yi <yi.zhu@intel.com>
    Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
    Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-21 22:40     ` Eric Dumazet
@ 2018-11-22 16:34       ` Yuchung Cheng
  0 siblings, 0 replies; 27+ messages in thread
From: Yuchung Cheng @ 2018-11-22 16:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, Jean-Louis Dupond, Neal Cardwell

On Wed, Nov 21, 2018 at 2:40 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 11/21/2018 02:31 PM, Yuchung Cheng wrote:
>> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:
>
>>> +
>> Really nice! would it make sense to re-use (some of) the similar
>> tcp_try_coalesce()?
>>
>
> Maybe, but it is a bit complex, since skbs in receive queues (regular or out of order)
> are accounted differently (they have skb->destructor set)
>
> Also they had the TCP header pulled already, while the backlog coalescing also has
> to make sure TCP options match.
>
> Not sure if we want to add extra parameters and conditional checks...
Makes sense.

Acked-by: Yuchung Cheng <ycheng@google.com>

>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-21 17:52 ` [PATCH net-next 2/3] tcp: implement coalescing on backlog queue Eric Dumazet
  2018-11-21 22:31   ` Yuchung Cheng
@ 2018-11-22 18:01   ` Neal Cardwell
  2018-11-22 18:16     ` Eric Dumazet
  1 sibling, 1 reply; 27+ messages in thread
From: Neal Cardwell @ 2018-11-22 18:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Netdev, jean-louis, Yuchung Cheng, Eric Dumazet

On Wed, Nov 21, 2018 at 12:52 PM Eric Dumazet <edumazet@google.com> wrote:
>
> In case GRO is not as efficient as it should be or disabled,
> we might have a user thread trapped in __release_sock() while
> softirq handler flood packets up to the point we have to drop.
>
> This patch balances work done from user thread and softirq,
> to give more chances to __release_sock() to complete its work.
>
> This also helps if we receive many ACK packets, since GRO
> does not aggregate them.

Would this coalesce duplicate incoming ACK packets? Is there a risk
that this would eliminate incoming dupacks needed for fast recovery in
non-SACK connections? Perhaps pure ACKs should only be coalesced if
the ACK field is different?

neal

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-22 18:01   ` Neal Cardwell
@ 2018-11-22 18:16     ` Eric Dumazet
  2018-11-22 18:21       ` Eric Dumazet
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2018-11-22 18:16 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, netdev, Jean-Louis Dupond, Yuchung Cheng, Eric Dumazet

On Thu, Nov 22, 2018 at 10:01 AM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Wed, Nov 21, 2018 at 12:52 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > In case GRO is not as efficient as it should be or disabled,
> > we might have a user thread trapped in __release_sock() while
> > softirq handler flood packets up to the point we have to drop.
> >
> > This patch balances work done from user thread and softirq,
> > to give more chances to __release_sock() to complete its work.
> >
> > This also helps if we receive many ACK packets, since GRO
> > does not aggregate them.
>
> Would this coalesce duplicate incoming ACK packets? Is there a risk
> that this would eliminate incoming dupacks needed for fast recovery in
> non-SACK connections? Perhaps pure ACKs should only be coalesced if
> the ACK field is different?

Yes, I was considering properly filtering SACK as a refinement later [1]
but you raise a valid point for alien stacks that are not yet using SACK :/

[1] This version of the patch will not aggregate sacks since the
memcmp() on tcp options would fail.

Neal can you double check if cake_ack_filter() does not have the issue
you just mentioned ?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-22 18:16     ` Eric Dumazet
@ 2018-11-22 18:21       ` Eric Dumazet
  0 siblings, 0 replies; 27+ messages in thread
From: Eric Dumazet @ 2018-11-22 18:21 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, netdev, Jean-Louis Dupond, Yuchung Cheng, Eric Dumazet

On Thu, Nov 22, 2018 at 10:16 AM Eric Dumazet <edumazet@google.com> wrote:

> Yes, I was considering properly filtering SACK as a refinement later [1]
> but you raise a valid point for alien stacks that are not yet using SACK :/
>
> [1] This version of the patch will not aggregate sacks since the
> memcmp() on tcp options would fail.
>
> Neal can you double check if cake_ack_filter() does not have the issue
> you just mentioned ?

Note that aggregated pure acks will have a gso_segs set to the number
of aggregated acks,
we might simply use this value later in the stack, instead of forcing
having X pure acks in the backlog
and increase memory needs and cpu costs.

Then I guess I need this fix :

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 36c9d715bf2aa7eb7bf58b045bfeb85a2ec1a696..736f7f24cdb4fe61769faaa1644c8bff01c746c4
100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1669,7 +1669,8 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
                __skb_pull(skb, hdrlen);
                if (skb_try_coalesce(tail, skb, &fragstolen, &delta)) {
                        TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq;
-                       TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
+                       if (after(TCP_SKB_CB(skb)->ack_seq,
TCP_SKB_CB(tail)->ack_seq))
+                               TCP_SKB_CB(tail)->ack_seq =
TCP_SKB_CB(skb)->ack_seq;
                        TCP_SKB_CB(tail)->tcp_flags |=
TCP_SKB_CB(skb)->tcp_flags;

                        if (TCP_SKB_CB(skb)->has_rxtstamp) {

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 0/3] tcp: take a bit more care of backlog stress
  2018-11-21 17:52 [PATCH net-next 0/3] tcp: take a bit more care of backlog stress Eric Dumazet
                   ` (2 preceding siblings ...)
  2018-11-21 17:52 ` [PATCH net-next 3/3] tcp: implement head drops in " Eric Dumazet
@ 2018-11-23 19:25 ` David Miller
  2018-11-23 19:27   ` Eric Dumazet
  3 siblings, 1 reply; 27+ messages in thread
From: David Miller @ 2018-11-23 19:25 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, jean-louis, ncardwell, ycheng, eric.dumazet

From: Eric Dumazet <edumazet@google.com>
Date: Wed, 21 Nov 2018 09:52:37 -0800

> While working on the SACK compression issue Jean-Louis Dupond
> reported, we found that his linux box was suffering very hard
> from tail drops on the socket backlog queue, because the opposite
> TCP stack was ont implementing latest RFC recommendations.
> 
> First patch is a cleanup
> 
> Second patch is attempting coalescing when a new packet must
> be added to the backlog queue. Cooking bigger skbs helps
> to keep backlog list smaller and speeds its handling when
> user thread finally releases the socket lock.
> 
> Third patch is implementing head drop as a last resort.
> Head drops are generally better for optimal TCP behavior.

My impression is that patch #2 needs some fixes in order to not
lose dupacks.  So there will be a respin of this.

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 0/3] tcp: take a bit more care of backlog stress
  2018-11-23 19:25 ` [PATCH net-next 0/3] tcp: take a bit more care of backlog stress David Miller
@ 2018-11-23 19:27   ` Eric Dumazet
  0 siblings, 0 replies; 27+ messages in thread
From: Eric Dumazet @ 2018-11-23 19:27 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Jean-Louis Dupond, Neal Cardwell, Yuchung Cheng, Eric Dumazet

On Fri, Nov 23, 2018 at 11:25 AM David Miller <davem@davemloft.net> wrote:
> My impression is that patch #2 needs some fixes in order to not
> lose dupacks.  So there will be a respin of this.
>
> Thanks.

You are absolutely right, we will submit a v2 next week after TG holidays.

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2019-04-25 13:30         ` Eric Dumazet
@ 2019-04-25 14:16           ` Bruno Prémont
  0 siblings, 0 replies; 27+ messages in thread
From: Bruno Prémont @ 2019-04-25 14:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Richard Purdie, Neal Cardwell, Yuchung Cheng, David S. Miller,
	netdev, Alexander Kanavin, Bruce Ashfield

On Thu, 25 Apr 2019 06:30:24 -0700 Eric Dumazet wrote:
> On Thu, Apr 25, 2019 at 6:13 AM Bruno Prémont wrote:
> > On Thu, 25 Apr 2019 09:55:31 +0200 Bruno Prémont wrote:  
> > > On Wed, 24 Apr 2019 08:47:27 -0700 Eric Dumazet wrote:  
> > > > On Wed, Apr 24, 2019 at 7:51 AM Bruno Prémont wrote:  
> > > > > I'm seeing issues with this patch as well, not as regular as for
> > > > > Richard but still (about up to one in 30-50 TCP sessions).
> > > > >
> > > > > In my case I have a virtual machine (on VMWare) with this patch where
> > > > > NGINX as reverse proxy misses part (end) of payload from its upstream
> > > > > and times out on the upstream connection (while according to tcpdump all
> > > > > packets including upstream's FIN were sent and the upstream did get
> > > > > ACKs from the VM).
> > > > >
> > > > > From when browsers get from NGINX it feels as if at some point reading
> > > > > from the socket or waiting for data using select() never returned data
> > > > > that arrived as more than just EOF is missing.
> > > > >
> > > > > The upstream is a hardware machine in the same subnet.
> > > > >
> > > > > My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
> > > > > as network adapter which lists the following features:
> > > > >  
> > > >
> > > > Hi Bruno.
> > > >
> > > > I suspect a EPOLLIN notification being lost by the application.
> > > >
> > > > Fact that TCP backlog contains 1 instead of 2+ packets should not
> > > > change stack behavior,
> > > > this packet should land into socket receive queue eventually.
> > > >
> > > > Are you using epoll() in Edge Trigger mode. You mention select() but
> > > > select() is a rather old and inefficient API.  
> > >
> > > nginx is using epoll (c.f. http://nginx.org/en/docs/events.html)
> > >
> > > For source, see here
> > > https://trac.nginx.org/nginx/browser/nginx/src/event/modules/ngx_epoll_module.c?rev=ebf8c9686b8ce7428f975d8a567935ea3722da70  
> >
> > I've been trying to reproduce while running nginx under strace (with verbose logging), but
> > have not been successful. Looks like tracing modifies the timing in a
> > way to prevent the issue.
> > Without verbose logging I could hit it (though needed way more perseverence).
> >
> > What I do have for a working session under strace is:
> >   socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 12
> >   ioctl(12, FIONBIO, [1]) = 0
> >   epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1221158320, u64=139682147558832}}) = 0
> >   connect(12, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "2001:a18:1:6::217", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
> >   epoll_wait(9, [{EPOLLOUT, {u32=1221158320, u64=139682147558832}}], 512, 57230) = 1
> >   getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
> >   writev(12, [{""..., 830}, {""..., 115}], 2) = 945
> >   epoll_wait(9, [{EPOLLIN|EPOLLOUT, {u32=1221158320, u64=139682147558832}}], 512, 57220) = 1
> >   recvfrom(12, ""..., 4096, 0, NULL, NULL) = 4096
> >   readv(12, [{""..., 4096}], 1) = 4096
> >   readv(12, [{""..., 4096}], 1) = 4096
> >   readv(12, [{""..., 4096}], 1) = 4096
> >   readv(12, [{""..., 4096}], 1) = 4096
> >   readv(12, [{""..., 4096}], 1) = 4096
> >   readv(12, [{""..., 4096}], 1) = 3855
> >   epoll_wait(9, [{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=1221158320, u64=139682147558832}}], 512, 57190) = 1
> >   readv(12, [{"", 241}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}], 7) = 0
> >   close(12)               = 0
> >
> > Under failing session I got:
> >   socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 6
> >   ioctl(6, FIONBIO, [1])  = 0
> >   epoll_ctl(86, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1221159521, u64=139682147560033}}) = 0
> >   connect(6, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("158.64.2.217")}, 16) = -1 EINPROGRESS (Operation now in progress)
> >   epoll_wait(86, [{EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 60000) = 1
> >   getsockopt(6, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
> >   writev(6, [{""..., 830}, {""..., 120}], 2) = 950
> >   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 60000) = 1
> >   recvfrom(6, ""..., 4096, 0, NULL, NULL) = 1740
> >   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
> >   readv(6, [{""..., 2356}], 1) = 2356
> >   readv(6, [{""..., 4096}], 1) = 3362
> >   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
> >   readv(6, [{""..., 734}, {""..., 4096}], 2) = 4830
> >   readv(6, [{""..., 4096}], 1) = 4096
> >   readv(6, [{""..., 4096}], 1) = 4096
> >   readv(6, [{""..., 4096}], 1) = 4096
> >   readv(6, [{""..., 4096}], 1) = 3619
> >   epoll_wait(86, [{EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
> >   ... some other socket with activity and epoll_wait calls for 50 seconds
> >   epoll_wait(86, [], 512, 3331) = 0
> >   close(6)                = 0
> >
> >  So the EPOLLRDHUP seems to never happen here.
> >
> >
> > As a side-note, in strace output (over lots of sessions) I see almost no readv() returning -1 EAGAIN
> > except for the first readv() following revcfrom().
> >
> >
> > I hope this helps at looking for the cause, as well as that not collapsing FIN packets
> > also prevents the issue.  
> 
> Very interesting Bruno, thanks a lot for all these details.
> 
> I wonder if you have some existing epoll notification bug, that would
> trigger if the last packet received from the peer
> contains both data and FIN.

I just did take a tcpdump on the nginx host and have a TCP session
for which the issue occurred in the dump, and a few working ones too (I
can't share it in public though unless I could annonymise the payload).
For almost all connections I have in the dump, last packet from
upstream has payload as well as FIN flag, same for the failing one.
So data with payload is not the only trigger (amount of data probably
neither as one with 6 bytes passed, the failing one had 7), if at all.

The dump doesn't reveal anything that would catch my attention though.
At best a difference in timing as to when nginx host sends ACKs.

> Were the "ss" outputs taken from nginx side ?

Yes, they were.

Cheers,
Bruno

> > > > Could you watch/report the output of " ss -temoi "  for the frozen TCP flow ?  
> > >
> > > Here it is, three distinct reproducing attempts:
> > > State      Recv-Q  Send-Q  Local Address:Port  Peer Address:Port
> > > ESTAB      0       0       158.64.2.228:44248  158.64.2.217:webcache              uid:83 ino:13245 sk:87 <->
> > >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.24/0.118 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28381 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 482.7Mbps lastsnd:46810 lastrcv:46790 lastack:46790 pacing_rate 965.3Mbps delivery_rate 74.3Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.156
> > >
> > >
> > > ESTAB      0      0                                                                   2001:a18:1:6::228:33572                                                                             2001:a18:1:6::217:webcache              uid:83 ino:16699 sk:e1 <->
> > >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.231/0.11 ato:40 mss:1428 rcvmss:1428 advmss:1428 cwnd:10 bytes_acked:948 bytes_received:28474 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 494.5Mbps lastsnd:8380 lastrcv:8360 lastack:8360 pacing_rate 989.1Mbps delivery_rate 71.0Mbps app_limited rcv_rtt:1.109 rcv_space:14280 minrtt:0.161
> > >
> > >
> > > ESTAB      0      0                                                                        158.64.2.228:44578                                                                                  158.64.2.217:webcache              uid:83 ino:17628 sk:12c <->
> > >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.279/0.136 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28481 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 415.2Mbps lastsnd:11360 lastrcv:11330 lastack:11340 pacing_rate 828.2Mbps delivery_rate 61.9Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.187
> > >
> > >
> > > From nginx debug logging I don't get a real clue though it seems for working connections
> > > the last event obtained is 2005 (EPOLLMSG | EPOLLWRBAND | EPOLLWRNORM |
> > > EPOLLRDBAND | EPOLLRDNORM | EPOLLHUP | EPOLLIN | EPOLLOUT) - previous ones are 5
> > > while for failing connections it looks like last event seen is 5 (EPOLLIN | EPOLLOUT).
> > >  
> > > > This migtht give us a clue about packets being dropped, say the the
> > > > accumulated packet became too big.  
> > >
> > >
> > > The following minor patch (might be white-space mangled) does prevent the issue
> > > happening for me:
> > >
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index 4904250a9aac..c102cd367c79 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -1667,7 +1667,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
> > >         if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq ||
> > >             TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield ||
> > >             ((TCP_SKB_CB(tail)->tcp_flags |
> > > -             TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_URG) ||
> > > +             TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_URG | TCPHDR_FIN)) ||
> > >             ((TCP_SKB_CB(tail)->tcp_flags ^
> > >               TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
> > >  #ifdef CONFIG_TLS_DEVICE
> > >
> > > Cheers,
> > > Bruno

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2019-04-25 13:13       ` Bruno Prémont
@ 2019-04-25 13:30         ` Eric Dumazet
  2019-04-25 14:16           ` Bruno Prémont
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2019-04-25 13:30 UTC (permalink / raw)
  To: Bruno Prémont
  Cc: Richard Purdie, Neal Cardwell, Yuchung Cheng, David S. Miller,
	netdev, Alexander Kanavin, Bruce Ashfield

On Thu, Apr 25, 2019 at 6:13 AM Bruno Prémont <bonbons@sysophe.eu> wrote:
>
> Hi Eric,
>
> On Thu, 25 Apr 2019 09:55:31 +0200 Bruno Prémont wrote:
> > Hi Eric,
> >
> > On Wed, 24 Apr 2019 08:47:27 -0700 Eric Dumazet wrote:
> > > On Wed, Apr 24, 2019 at 7:51 AM Bruno Prémont wrote:
> > > >
> > > > Hi Eric,
> > > >
> > > > I'm seeing issues with this patch as well, not as regular as for
> > > > Richard but still (about up to one in 30-50 TCP sessions).
> > > >
> > > > In my case I have a virtual machine (on VMWare) with this patch where
> > > > NGINX as reverse proxy misses part (end) of payload from its upstream
> > > > and times out on the upstream connection (while according to tcpdump all
> > > > packets including upstream's FIN were sent and the upstream did get
> > > > ACKs from the VM).
> > > >
> > > > From when browsers get from NGINX it feels as if at some point reading
> > > > from the socket or waiting for data using select() never returned data
> > > > that arrived as more than just EOF is missing.
> > > >
> > > > The upstream is a hardware machine in the same subnet.
> > > >
> > > > My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
> > > > as network adapter which lists the following features:
> > > >
> > >
> > > Hi Bruno.
> > >
> > > I suspect a EPOLLIN notification being lost by the application.
> > >
> > > Fact that TCP backlog contains 1 instead of 2+ packets should not
> > > change stack behavior,
> > > this packet should land into socket receive queue eventually.
> > >
> > > Are you using epoll() in Edge Trigger mode. You mention select() but
> > > select() is a rather old and inefficient API.
> >
> > nginx is using epoll (c.f. http://nginx.org/en/docs/events.html)
> >
> > For source, see here
> > https://trac.nginx.org/nginx/browser/nginx/src/event/modules/ngx_epoll_module.c?rev=ebf8c9686b8ce7428f975d8a567935ea3722da70
>
> I've been trying to reproduce while running nginx under strace (with verbose logging), but
> have not been successful. Looks like tracing modifies the timing in a
> way to prevent the issue.
> Without verbose logging I could hit it (though needed way more perseverence).
>
> What I do have for a working session under strace is:
>   socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 12
>   ioctl(12, FIONBIO, [1]) = 0
>   epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1221158320, u64=139682147558832}}) = 0
>   connect(12, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "2001:a18:1:6::217", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
>   epoll_wait(9, [{EPOLLOUT, {u32=1221158320, u64=139682147558832}}], 512, 57230) = 1
>   getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
>   writev(12, [{""..., 830}, {""..., 115}], 2) = 945
>   epoll_wait(9, [{EPOLLIN|EPOLLOUT, {u32=1221158320, u64=139682147558832}}], 512, 57220) = 1
>   recvfrom(12, ""..., 4096, 0, NULL, NULL) = 4096
>   readv(12, [{""..., 4096}], 1) = 4096
>   readv(12, [{""..., 4096}], 1) = 4096
>   readv(12, [{""..., 4096}], 1) = 4096
>   readv(12, [{""..., 4096}], 1) = 4096
>   readv(12, [{""..., 4096}], 1) = 4096
>   readv(12, [{""..., 4096}], 1) = 3855
>   epoll_wait(9, [{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=1221158320, u64=139682147558832}}], 512, 57190) = 1
>   readv(12, [{"", 241}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}], 7) = 0
>   close(12)               = 0
>
> Under failing session I got:
>   socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 6
>   ioctl(6, FIONBIO, [1])  = 0
>   epoll_ctl(86, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1221159521, u64=139682147560033}}) = 0
>   connect(6, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("158.64.2.217")}, 16) = -1 EINPROGRESS (Operation now in progress)
>   epoll_wait(86, [{EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 60000) = 1
>   getsockopt(6, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
>   writev(6, [{""..., 830}, {""..., 120}], 2) = 950
>   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 60000) = 1
>   recvfrom(6, ""..., 4096, 0, NULL, NULL) = 1740
>   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
>   readv(6, [{""..., 2356}], 1) = 2356
>   readv(6, [{""..., 4096}], 1) = 3362
>   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
>   readv(6, [{""..., 734}, {""..., 4096}], 2) = 4830
>   readv(6, [{""..., 4096}], 1) = 4096
>   readv(6, [{""..., 4096}], 1) = 4096
>   readv(6, [{""..., 4096}], 1) = 4096
>   readv(6, [{""..., 4096}], 1) = 3619
>   epoll_wait(86, [{EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
>   ... some other socket with activity and epoll_wait calls for 50 seconds
>   epoll_wait(86, [], 512, 3331) = 0
>   close(6)                = 0
>
>  So the EPOLLRDHUP seems to never happen here.
>
>
> As a side-note, in strace output (over lots of sessions) I see almost no readv() returning -1 EAGAIN
> except for the first readv() following revcfrom().
>
>
> I hope this helps at looking for the cause, as well as that not collapsing FIN packets
> also prevents the issue.

Very interesting Bruno, thanks a lot for all these details.

I wonder if you have some existing epoll notification bug, that would
trigger if the last packet received from the peer
contains both data and FIN.

Were the "ss" outputs taken from nginx side ?

>
> Cheers,
> Bruno
>
>
> > > Could you watch/report the output of " ss -temoi "  for the frozen TCP flow ?
> >
> > Here it is, three distinct reproducing attempts:
> > State      Recv-Q  Send-Q  Local Address:Port  Peer Address:Port
> > ESTAB      0       0       158.64.2.228:44248  158.64.2.217:webcache              uid:83 ino:13245 sk:87 <->
> >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.24/0.118 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28381 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 482.7Mbps lastsnd:46810 lastrcv:46790 lastack:46790 pacing_rate 965.3Mbps delivery_rate 74.3Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.156
> >
> >
> > ESTAB      0      0                                                                   2001:a18:1:6::228:33572                                                                             2001:a18:1:6::217:webcache              uid:83 ino:16699 sk:e1 <->
> >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.231/0.11 ato:40 mss:1428 rcvmss:1428 advmss:1428 cwnd:10 bytes_acked:948 bytes_received:28474 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 494.5Mbps lastsnd:8380 lastrcv:8360 lastack:8360 pacing_rate 989.1Mbps delivery_rate 71.0Mbps app_limited rcv_rtt:1.109 rcv_space:14280 minrtt:0.161
> >
> >
> > ESTAB      0      0                                                                        158.64.2.228:44578                                                                                  158.64.2.217:webcache              uid:83 ino:17628 sk:12c <->
> >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.279/0.136 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28481 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 415.2Mbps lastsnd:11360 lastrcv:11330 lastack:11340 pacing_rate 828.2Mbps delivery_rate 61.9Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.187
> >
> >
> > From nginx debug logging I don't get a real clue though it seems for working connections
> > the last event obtained is 2005 (EPOLLMSG | EPOLLWRBAND | EPOLLWRNORM |
> > EPOLLRDBAND | EPOLLRDNORM | EPOLLHUP | EPOLLIN | EPOLLOUT) - previous ones are 5
> > while for failing connections it looks like last event seen is 5 (EPOLLIN | EPOLLOUT).
> >
> > > This migtht give us a clue about packets being dropped, say the the
> > > accumulated packet became too big.
> >
> >
> > The following minor patch (might be white-space mangled) does prevent the issue
> > happening for me:
> >
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index 4904250a9aac..c102cd367c79 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -1667,7 +1667,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
> >         if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq ||
> >             TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield ||
> >             ((TCP_SKB_CB(tail)->tcp_flags |
> > -             TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_URG) ||
> > +             TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_URG | TCPHDR_FIN)) ||
> >             ((TCP_SKB_CB(tail)->tcp_flags ^
> >               TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
> >  #ifdef CONFIG_TLS_DEVICE
> >
> > Cheers,
> > Bruno
> >
> >
> > > > rx-checksumming: on
> > > > tx-checksumming: on
> > > >         tx-checksum-ipv4: off [fixed]
> > > >         tx-checksum-ip-generic: on
> > > >         tx-checksum-ipv6: off [fixed]
> > > >         tx-checksum-fcoe-crc: off [fixed]
> > > >         tx-checksum-sctp: off [fixed]
> > > > scatter-gather: on
> > > >         tx-scatter-gather: on
> > > >         tx-scatter-gather-fraglist: off [fixed]
> > > > tcp-segmentation-offload: on
> > > >         tx-tcp-segmentation: on
> > > >         tx-tcp-ecn-segmentation: off [fixed]
> > > >         tx-tcp-mangleid-segmentation: off
> > > >         tx-tcp6-segmentation: on
> > > > udp-fragmentation-offload: off
> > > > generic-segmentation-offload: on
> > > > generic-receive-offload: on
> > > > large-receive-offload: on
> > > > rx-vlan-offload: on
> > > > tx-vlan-offload: on
> > > > ntuple-filters: off [fixed]
> > > > receive-hashing: off [fixed]
> > > > highdma: on
> > > > rx-vlan-filter: on [fixed]
> > > > vlan-challenged: off [fixed]
> > > > tx-lockless: off [fixed]
> > > > netns-local: off [fixed]
> > > > tx-gso-robust: off [fixed]
> > > > tx-fcoe-segmentation: off [fixed]
> > > > tx-gre-segmentation: off [fixed]
> > > > tx-gre-csum-segmentation: off [fixed]
> > > > tx-ipxip4-segmentation: off [fixed]
> > > > tx-ipxip6-segmentation: off [fixed]
> > > > tx-udp_tnl-segmentation: off [fixed]
> > > > tx-udp_tnl-csum-segmentation: off [fixed]
> > > > tx-gso-partial: off [fixed]
> > > > tx-sctp-segmentation: off [fixed]
> > > > tx-esp-segmentation: off [fixed]
> > > > tx-udp-segmentation: off [fixed]
> > > > fcoe-mtu: off [fixed]
> > > > tx-nocache-copy: off
> > > > loopback: off [fixed]
> > > > rx-fcs: off [fixed]
> > > > rx-all: off [fixed]
> > > > tx-vlan-stag-hw-insert: off [fixed]
> > > > rx-vlan-stag-hw-parse: off [fixed]
> > > > rx-vlan-stag-filter: off [fixed]
> > > > l2-fwd-offload: off [fixed]
> > > > hw-tc-offload: off [fixed]
> > > > esp-hw-offload: off [fixed]
> > > > esp-tx-csum-hw-offload: off [fixed]
> > > > rx-udp_tunnel-port-offload: off [fixed]
> > > > tls-hw-tx-offload: off [fixed]
> > > > tls-hw-rx-offload: off [fixed]
> > > > rx-gro-hw: off [fixed]
> > > > tls-hw-record: off [fixed]
> > > >
> > > >
> > > > I can reproduce the issue with kernels 5.0.x and as recent as 5.1-rc6.
> > > >
> > > > Cheers,
> > > > Bruno
> > > >
> > > > On Sunday, April 7, 2019 11:28:30 PM CEST, richard.purdie@linuxfoundation.org wrote:
> > > > > Hi,
> > > > >
> > > > > I've been chasing down why a python test from the python3 testsuite
> > > > > started failing and it seems to point to this kernel change in the
> > > > > networking stack.
> > > > >
> > > > > In kernels beyond commit 4f693b55c3d2d2239b8a0094b518a1e533cf75d5 the
> > > > > test hangs about 90% of the time (I've reproduced with 5.1-rc3, 5.0.7,
> > > > > 5.0-rc1 but not 4.18, 4.19 or 4.20). The reproducer is:
> > > > >
> > > > > $ python3 -m test test_httplib -v
> > > > > == CPython 3.7.2 (default, Apr 5 2019, 15:17:15) [GCC 8.3.0]
> > > > > == Linux-5.0.0-yocto-standard-x86_64-with-glibc2.2.5 little-endian
> > > > > == cwd: /var/volatile/tmp/test_python_288
> > > > > == CPU count: 1
> > > > > == encodings: locale=UTF-8, FS=utf-8
> > > > > [...]
> > > > > test_response_fileno (test.test_httplib.BasicTest) ...
> > > > >
> > > > > and it hangs in test_response_fileno.
> > > > >
> > > > > The test in question comes from Lib/test/test_httplib.py in the python
> > > > > source tree and the code is:
> > > > >
> > > > >     def test_response_fileno(self):
> > > > >         # Make sure fd returned by fileno is valid.
> > > > >         serv = socket.socket(
> > > > >             socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP)
> > > > >         self.addCleanup(serv.close)
> > > > >         serv.bind((HOST, 0))
> > > > >         serv.listen()
> > > > >
> > > > >         result = None
> > > > >         def run_server():
> > > > >             [conn, address] = serv.accept()
> > > > >             with conn, conn.makefile("rb") as reader:
> > > > >                 # Read the request header until a blank line
> > > > >                 while True:
> > > > >                     line = reader.readline()
> > > > >                     if not line.rstrip(b"\r\n"):
> > > > >                         break
> > > > >                 conn.sendall(b"HTTP/1.1 200 Connection established\r\n\r\n")
> > > > >                 nonlocal result
> > > > >                 result = reader.read()
> > > > >
> > > > >         thread = threading.Thread(target=run_server)
> > > > >         thread.start()
> > > > >         self.addCleanup(thread.join, float(1))
> > > > >         conn = client.HTTPConnection(*serv.getsockname())
> > > > >         conn.request("CONNECT", "dummy:1234")
> > > > >         response = conn.getresponse()
> > > > >         try:
> > > > >             self.assertEqual(response.status, client.OK)
> > > > >             s = socket.socket(fileno=response.fileno())
> > > > >             try:
> > > > >                 s.sendall(b"proxied data\n")
> > > > >             finally:
> > > > >                 s.detach()
> > > > >         finally:
> > > > >             response.close()
> > > > >             conn.close()
> > > > >         thread.join()
> > > > >         self.assertEqual(result, b"proxied data\n")
> > > > >
> > > > > I was hoping someone with more understanding of the networking stack
> > > > > could look at this and tell whether its a bug in the python test, the
> > > > > kernel change or otherwise give a pointer to where the problem might
> > > > > be? I'll freely admit this is not an area I know much about.
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Richard
> > > > >
> > > > >
> > > > >

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2019-04-25  7:55     ` Bruno Prémont
@ 2019-04-25 13:13       ` Bruno Prémont
  2019-04-25 13:30         ` Eric Dumazet
  0 siblings, 1 reply; 27+ messages in thread
From: Bruno Prémont @ 2019-04-25 13:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: richard.purdie, Neal Cardwell, Yuchung Cheng, David S. Miller,
	netdev, Alexander Kanavin, Bruce Ashfield

Hi Eric,

On Thu, 25 Apr 2019 09:55:31 +0200 Bruno Prémont wrote:
> Hi Eric,
> 
> On Wed, 24 Apr 2019 08:47:27 -0700 Eric Dumazet wrote:
> > On Wed, Apr 24, 2019 at 7:51 AM Bruno Prémont wrote:  
> > >
> > > Hi Eric,
> > >
> > > I'm seeing issues with this patch as well, not as regular as for
> > > Richard but still (about up to one in 30-50 TCP sessions).
> > >
> > > In my case I have a virtual machine (on VMWare) with this patch where
> > > NGINX as reverse proxy misses part (end) of payload from its upstream
> > > and times out on the upstream connection (while according to tcpdump all
> > > packets including upstream's FIN were sent and the upstream did get
> > > ACKs from the VM).
> > >
> > > From when browsers get from NGINX it feels as if at some point reading
> > > from the socket or waiting for data using select() never returned data
> > > that arrived as more than just EOF is missing.
> > >
> > > The upstream is a hardware machine in the same subnet.
> > >
> > > My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
> > > as network adapter which lists the following features:
> > >    
> > 
> > Hi Bruno.
> > 
> > I suspect a EPOLLIN notification being lost by the application.
> > 
> > Fact that TCP backlog contains 1 instead of 2+ packets should not
> > change stack behavior,
> > this packet should land into socket receive queue eventually.
> > 
> > Are you using epoll() in Edge Trigger mode. You mention select() but
> > select() is a rather old and inefficient API.  
> 
> nginx is using epoll (c.f. http://nginx.org/en/docs/events.html)
> 
> For source, see here
> https://trac.nginx.org/nginx/browser/nginx/src/event/modules/ngx_epoll_module.c?rev=ebf8c9686b8ce7428f975d8a567935ea3722da70

I've been trying to reproduce while running nginx under strace (with verbose logging), but
have not been successful. Looks like tracing modifies the timing in a
way to prevent the issue.
Without verbose logging I could hit it (though needed way more perseverence).

What I do have for a working session under strace is:
  socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 12
  ioctl(12, FIONBIO, [1]) = 0
  epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1221158320, u64=139682147558832}}) = 0
  connect(12, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "2001:a18:1:6::217", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
  epoll_wait(9, [{EPOLLOUT, {u32=1221158320, u64=139682147558832}}], 512, 57230) = 1
  getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
  writev(12, [{""..., 830}, {""..., 115}], 2) = 945
  epoll_wait(9, [{EPOLLIN|EPOLLOUT, {u32=1221158320, u64=139682147558832}}], 512, 57220) = 1
  recvfrom(12, ""..., 4096, 0, NULL, NULL) = 4096
  readv(12, [{""..., 4096}], 1) = 4096
  readv(12, [{""..., 4096}], 1) = 4096
  readv(12, [{""..., 4096}], 1) = 4096
  readv(12, [{""..., 4096}], 1) = 4096
  readv(12, [{""..., 4096}], 1) = 4096
  readv(12, [{""..., 4096}], 1) = 3855
  epoll_wait(9, [{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=1221158320, u64=139682147558832}}], 512, 57190) = 1
  readv(12, [{"", 241}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}], 7) = 0
  close(12)               = 0

Under failing session I got:
  socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 6
  ioctl(6, FIONBIO, [1])  = 0
  epoll_ctl(86, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1221159521, u64=139682147560033}}) = 0
  connect(6, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("158.64.2.217")}, 16) = -1 EINPROGRESS (Operation now in progress)
  epoll_wait(86, [{EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 60000) = 1
  getsockopt(6, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
  writev(6, [{""..., 830}, {""..., 120}], 2) = 950
  epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 60000) = 1
  recvfrom(6, ""..., 4096, 0, NULL, NULL) = 1740
  epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
  readv(6, [{""..., 2356}], 1) = 2356
  readv(6, [{""..., 4096}], 1) = 3362
  epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
  readv(6, [{""..., 734}, {""..., 4096}], 2) = 4830
  readv(6, [{""..., 4096}], 1) = 4096
  readv(6, [{""..., 4096}], 1) = 4096
  readv(6, [{""..., 4096}], 1) = 4096
  readv(6, [{""..., 4096}], 1) = 3619
  epoll_wait(86, [{EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
  ... some other socket with activity and epoll_wait calls for 50 seconds
  epoll_wait(86, [], 512, 3331) = 0
  close(6)                = 0

 So the EPOLLRDHUP seems to never happen here.


As a side-note, in strace output (over lots of sessions) I see almost no readv() returning -1 EAGAIN
except for the first readv() following revcfrom().


I hope this helps at looking for the cause, as well as that not collapsing FIN packets
also prevents the issue.

Cheers,
Bruno


> > Could you watch/report the output of " ss -temoi "  for the frozen TCP flow ?  
> 
> Here it is, three distinct reproducing attempts:
> State      Recv-Q  Send-Q  Local Address:Port  Peer Address:Port
> ESTAB      0       0       158.64.2.228:44248  158.64.2.217:webcache              uid:83 ino:13245 sk:87 <->
>          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.24/0.118 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28381 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 482.7Mbps lastsnd:46810 lastrcv:46790 lastack:46790 pacing_rate 965.3Mbps delivery_rate 74.3Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.156
> 
> 
> ESTAB      0      0                                                                   2001:a18:1:6::228:33572                                                                             2001:a18:1:6::217:webcache              uid:83 ino:16699 sk:e1 <->
>          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.231/0.11 ato:40 mss:1428 rcvmss:1428 advmss:1428 cwnd:10 bytes_acked:948 bytes_received:28474 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 494.5Mbps lastsnd:8380 lastrcv:8360 lastack:8360 pacing_rate 989.1Mbps delivery_rate 71.0Mbps app_limited rcv_rtt:1.109 rcv_space:14280 minrtt:0.161
> 
> 
> ESTAB      0      0                                                                        158.64.2.228:44578                                                                                  158.64.2.217:webcache              uid:83 ino:17628 sk:12c <->
>          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.279/0.136 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28481 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 415.2Mbps lastsnd:11360 lastrcv:11330 lastack:11340 pacing_rate 828.2Mbps delivery_rate 61.9Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.187
> 
> 
> From nginx debug logging I don't get a real clue though it seems for working connections
> the last event obtained is 2005 (EPOLLMSG | EPOLLWRBAND | EPOLLWRNORM |
> EPOLLRDBAND | EPOLLRDNORM | EPOLLHUP | EPOLLIN | EPOLLOUT) - previous ones are 5
> while for failing connections it looks like last event seen is 5 (EPOLLIN | EPOLLOUT).
> 
> > This migtht give us a clue about packets being dropped, say the the
> > accumulated packet became too big.  
> 
> 
> The following minor patch (might be white-space mangled) does prevent the issue
> happening for me:
> 
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 4904250a9aac..c102cd367c79 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1667,7 +1667,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
>         if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq ||
>             TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield ||
>             ((TCP_SKB_CB(tail)->tcp_flags |
> -             TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_URG) ||
> +             TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_URG | TCPHDR_FIN)) ||
>             ((TCP_SKB_CB(tail)->tcp_flags ^
>               TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
>  #ifdef CONFIG_TLS_DEVICE
> 
> Cheers,
> Bruno
> 
> 
> > > rx-checksumming: on
> > > tx-checksumming: on
> > >         tx-checksum-ipv4: off [fixed]
> > >         tx-checksum-ip-generic: on
> > >         tx-checksum-ipv6: off [fixed]
> > >         tx-checksum-fcoe-crc: off [fixed]
> > >         tx-checksum-sctp: off [fixed]
> > > scatter-gather: on
> > >         tx-scatter-gather: on
> > >         tx-scatter-gather-fraglist: off [fixed]
> > > tcp-segmentation-offload: on
> > >         tx-tcp-segmentation: on
> > >         tx-tcp-ecn-segmentation: off [fixed]
> > >         tx-tcp-mangleid-segmentation: off
> > >         tx-tcp6-segmentation: on
> > > udp-fragmentation-offload: off
> > > generic-segmentation-offload: on
> > > generic-receive-offload: on
> > > large-receive-offload: on
> > > rx-vlan-offload: on
> > > tx-vlan-offload: on
> > > ntuple-filters: off [fixed]
> > > receive-hashing: off [fixed]
> > > highdma: on
> > > rx-vlan-filter: on [fixed]
> > > vlan-challenged: off [fixed]
> > > tx-lockless: off [fixed]
> > > netns-local: off [fixed]
> > > tx-gso-robust: off [fixed]
> > > tx-fcoe-segmentation: off [fixed]
> > > tx-gre-segmentation: off [fixed]
> > > tx-gre-csum-segmentation: off [fixed]
> > > tx-ipxip4-segmentation: off [fixed]
> > > tx-ipxip6-segmentation: off [fixed]
> > > tx-udp_tnl-segmentation: off [fixed]
> > > tx-udp_tnl-csum-segmentation: off [fixed]
> > > tx-gso-partial: off [fixed]
> > > tx-sctp-segmentation: off [fixed]
> > > tx-esp-segmentation: off [fixed]
> > > tx-udp-segmentation: off [fixed]
> > > fcoe-mtu: off [fixed]
> > > tx-nocache-copy: off
> > > loopback: off [fixed]
> > > rx-fcs: off [fixed]
> > > rx-all: off [fixed]
> > > tx-vlan-stag-hw-insert: off [fixed]
> > > rx-vlan-stag-hw-parse: off [fixed]
> > > rx-vlan-stag-filter: off [fixed]
> > > l2-fwd-offload: off [fixed]
> > > hw-tc-offload: off [fixed]
> > > esp-hw-offload: off [fixed]
> > > esp-tx-csum-hw-offload: off [fixed]
> > > rx-udp_tunnel-port-offload: off [fixed]
> > > tls-hw-tx-offload: off [fixed]
> > > tls-hw-rx-offload: off [fixed]
> > > rx-gro-hw: off [fixed]
> > > tls-hw-record: off [fixed]
> > >
> > >
> > > I can reproduce the issue with kernels 5.0.x and as recent as 5.1-rc6.
> > >
> > > Cheers,
> > > Bruno
> > >
> > > On Sunday, April 7, 2019 11:28:30 PM CEST, richard.purdie@linuxfoundation.org wrote:    
> > > > Hi,
> > > >
> > > > I've been chasing down why a python test from the python3 testsuite
> > > > started failing and it seems to point to this kernel change in the
> > > > networking stack.
> > > >
> > > > In kernels beyond commit 4f693b55c3d2d2239b8a0094b518a1e533cf75d5 the
> > > > test hangs about 90% of the time (I've reproduced with 5.1-rc3, 5.0.7,
> > > > 5.0-rc1 but not 4.18, 4.19 or 4.20). The reproducer is:
> > > >
> > > > $ python3 -m test test_httplib -v
> > > > == CPython 3.7.2 (default, Apr 5 2019, 15:17:15) [GCC 8.3.0]
> > > > == Linux-5.0.0-yocto-standard-x86_64-with-glibc2.2.5 little-endian
> > > > == cwd: /var/volatile/tmp/test_python_288
> > > > == CPU count: 1
> > > > == encodings: locale=UTF-8, FS=utf-8
> > > > [...]
> > > > test_response_fileno (test.test_httplib.BasicTest) ...
> > > >
> > > > and it hangs in test_response_fileno.
> > > >
> > > > The test in question comes from Lib/test/test_httplib.py in the python
> > > > source tree and the code is:
> > > >
> > > >     def test_response_fileno(self):
> > > >         # Make sure fd returned by fileno is valid.
> > > >         serv = socket.socket(
> > > >             socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP)
> > > >         self.addCleanup(serv.close)
> > > >         serv.bind((HOST, 0))
> > > >         serv.listen()
> > > >
> > > >         result = None
> > > >         def run_server():
> > > >             [conn, address] = serv.accept()
> > > >             with conn, conn.makefile("rb") as reader:
> > > >                 # Read the request header until a blank line
> > > >                 while True:
> > > >                     line = reader.readline()
> > > >                     if not line.rstrip(b"\r\n"):
> > > >                         break
> > > >                 conn.sendall(b"HTTP/1.1 200 Connection established\r\n\r\n")
> > > >                 nonlocal result
> > > >                 result = reader.read()
> > > >
> > > >         thread = threading.Thread(target=run_server)
> > > >         thread.start()
> > > >         self.addCleanup(thread.join, float(1))
> > > >         conn = client.HTTPConnection(*serv.getsockname())
> > > >         conn.request("CONNECT", "dummy:1234")
> > > >         response = conn.getresponse()
> > > >         try:
> > > >             self.assertEqual(response.status, client.OK)
> > > >             s = socket.socket(fileno=response.fileno())
> > > >             try:
> > > >                 s.sendall(b"proxied data\n")
> > > >             finally:
> > > >                 s.detach()
> > > >         finally:
> > > >             response.close()
> > > >             conn.close()
> > > >         thread.join()
> > > >         self.assertEqual(result, b"proxied data\n")
> > > >
> > > > I was hoping someone with more understanding of the networking stack
> > > > could look at this and tell whether its a bug in the python test, the
> > > > kernel change or otherwise give a pointer to where the problem might
> > > > be? I'll freely admit this is not an area I know much about.
> > > >
> > > > Cheers,
> > > >
> > > > Richard
> > > >
> > > >
> > > >    

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2019-04-24 15:47   ` Eric Dumazet
@ 2019-04-25  7:55     ` Bruno Prémont
  2019-04-25 13:13       ` Bruno Prémont
  0 siblings, 1 reply; 27+ messages in thread
From: Bruno Prémont @ 2019-04-25  7:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: richard.purdie, Neal Cardwell, Yuchung Cheng, David S. Miller,
	netdev, Alexander Kanavin, Bruce Ashfield

Hi Eric,

On Wed, 24 Apr 2019 08:47:27 -0700 Eric Dumazet wrote:
> On Wed, Apr 24, 2019 at 7:51 AM Bruno Prémont wrote:
> >
> > Hi Eric,
> >
> > I'm seeing issues with this patch as well, not as regular as for
> > Richard but still (about up to one in 30-50 TCP sessions).
> >
> > In my case I have a virtual machine (on VMWare) with this patch where
> > NGINX as reverse proxy misses part (end) of payload from its upstream
> > and times out on the upstream connection (while according to tcpdump all
> > packets including upstream's FIN were sent and the upstream did get
> > ACKs from the VM).
> >
> > From when browsers get from NGINX it feels as if at some point reading
> > from the socket or waiting for data using select() never returned data
> > that arrived as more than just EOF is missing.
> >
> > The upstream is a hardware machine in the same subnet.
> >
> > My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
> > as network adapter which lists the following features:
> >  
> 
> Hi Bruno.
> 
> I suspect a EPOLLIN notification being lost by the application.
> 
> Fact that TCP backlog contains 1 instead of 2+ packets should not
> change stack behavior,
> this packet should land into socket receive queue eventually.
> 
> Are you using epoll() in Edge Trigger mode. You mention select() but
> select() is a rather old and inefficient API.

nginx is using epoll (c.f. http://nginx.org/en/docs/events.html)

For source, see here
https://trac.nginx.org/nginx/browser/nginx/src/event/modules/ngx_epoll_module.c?rev=ebf8c9686b8ce7428f975d8a567935ea3722da70

> Could you watch/report the output of " ss -temoi "  for the frozen TCP flow ?

Here it is, three distinct reproducing attempts:
State      Recv-Q  Send-Q  Local Address:Port  Peer Address:Port
ESTAB      0       0       158.64.2.228:44248  158.64.2.217:webcache              uid:83 ino:13245 sk:87 <->
         skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.24/0.118 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28381 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 482.7Mbps lastsnd:46810 lastrcv:46790 lastack:46790 pacing_rate 965.3Mbps delivery_rate 74.3Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.156


ESTAB      0      0                                                                   2001:a18:1:6::228:33572                                                                             2001:a18:1:6::217:webcache              uid:83 ino:16699 sk:e1 <->
         skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.231/0.11 ato:40 mss:1428 rcvmss:1428 advmss:1428 cwnd:10 bytes_acked:948 bytes_received:28474 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 494.5Mbps lastsnd:8380 lastrcv:8360 lastack:8360 pacing_rate 989.1Mbps delivery_rate 71.0Mbps app_limited rcv_rtt:1.109 rcv_space:14280 minrtt:0.161


ESTAB      0      0                                                                        158.64.2.228:44578                                                                                  158.64.2.217:webcache              uid:83 ino:17628 sk:12c <->
         skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.279/0.136 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28481 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 415.2Mbps lastsnd:11360 lastrcv:11330 lastack:11340 pacing_rate 828.2Mbps delivery_rate 61.9Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.187


From nginx debug logging I don't get a real clue though it seems for working connections
the last event obtained is 2005 (EPOLLMSG | EPOLLWRBAND | EPOLLWRNORM |
EPOLLRDBAND | EPOLLRDNORM | EPOLLHUP | EPOLLIN | EPOLLOUT) - previous ones are 5
while for failing connections it looks like last event seen is 5 (EPOLLIN | EPOLLOUT).

> This migtht give us a clue about packets being dropped, say the the
> accumulated packet became too big.


The following minor patch (might be white-space mangled) does prevent the issue
happening for me:

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 4904250a9aac..c102cd367c79 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1667,7 +1667,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
        if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq ||
            TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield ||
            ((TCP_SKB_CB(tail)->tcp_flags |
-             TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_URG) ||
+             TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_URG | TCPHDR_FIN)) ||
            ((TCP_SKB_CB(tail)->tcp_flags ^
              TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
 #ifdef CONFIG_TLS_DEVICE

Cheers,
Bruno


> > rx-checksumming: on
> > tx-checksumming: on
> >         tx-checksum-ipv4: off [fixed]
> >         tx-checksum-ip-generic: on
> >         tx-checksum-ipv6: off [fixed]
> >         tx-checksum-fcoe-crc: off [fixed]
> >         tx-checksum-sctp: off [fixed]
> > scatter-gather: on
> >         tx-scatter-gather: on
> >         tx-scatter-gather-fraglist: off [fixed]
> > tcp-segmentation-offload: on
> >         tx-tcp-segmentation: on
> >         tx-tcp-ecn-segmentation: off [fixed]
> >         tx-tcp-mangleid-segmentation: off
> >         tx-tcp6-segmentation: on
> > udp-fragmentation-offload: off
> > generic-segmentation-offload: on
> > generic-receive-offload: on
> > large-receive-offload: on
> > rx-vlan-offload: on
> > tx-vlan-offload: on
> > ntuple-filters: off [fixed]
> > receive-hashing: off [fixed]
> > highdma: on
> > rx-vlan-filter: on [fixed]
> > vlan-challenged: off [fixed]
> > tx-lockless: off [fixed]
> > netns-local: off [fixed]
> > tx-gso-robust: off [fixed]
> > tx-fcoe-segmentation: off [fixed]
> > tx-gre-segmentation: off [fixed]
> > tx-gre-csum-segmentation: off [fixed]
> > tx-ipxip4-segmentation: off [fixed]
> > tx-ipxip6-segmentation: off [fixed]
> > tx-udp_tnl-segmentation: off [fixed]
> > tx-udp_tnl-csum-segmentation: off [fixed]
> > tx-gso-partial: off [fixed]
> > tx-sctp-segmentation: off [fixed]
> > tx-esp-segmentation: off [fixed]
> > tx-udp-segmentation: off [fixed]
> > fcoe-mtu: off [fixed]
> > tx-nocache-copy: off
> > loopback: off [fixed]
> > rx-fcs: off [fixed]
> > rx-all: off [fixed]
> > tx-vlan-stag-hw-insert: off [fixed]
> > rx-vlan-stag-hw-parse: off [fixed]
> > rx-vlan-stag-filter: off [fixed]
> > l2-fwd-offload: off [fixed]
> > hw-tc-offload: off [fixed]
> > esp-hw-offload: off [fixed]
> > esp-tx-csum-hw-offload: off [fixed]
> > rx-udp_tunnel-port-offload: off [fixed]
> > tls-hw-tx-offload: off [fixed]
> > tls-hw-rx-offload: off [fixed]
> > rx-gro-hw: off [fixed]
> > tls-hw-record: off [fixed]
> >
> >
> > I can reproduce the issue with kernels 5.0.x and as recent as 5.1-rc6.
> >
> > Cheers,
> > Bruno
> >
> > On Sunday, April 7, 2019 11:28:30 PM CEST, richard.purdie@linuxfoundation.org wrote:  
> > > Hi,
> > >
> > > I've been chasing down why a python test from the python3 testsuite
> > > started failing and it seems to point to this kernel change in the
> > > networking stack.
> > >
> > > In kernels beyond commit 4f693b55c3d2d2239b8a0094b518a1e533cf75d5 the
> > > test hangs about 90% of the time (I've reproduced with 5.1-rc3, 5.0.7,
> > > 5.0-rc1 but not 4.18, 4.19 or 4.20). The reproducer is:
> > >
> > > $ python3 -m test test_httplib -v
> > > == CPython 3.7.2 (default, Apr 5 2019, 15:17:15) [GCC 8.3.0]
> > > == Linux-5.0.0-yocto-standard-x86_64-with-glibc2.2.5 little-endian
> > > == cwd: /var/volatile/tmp/test_python_288
> > > == CPU count: 1
> > > == encodings: locale=UTF-8, FS=utf-8
> > > [...]
> > > test_response_fileno (test.test_httplib.BasicTest) ...
> > >
> > > and it hangs in test_response_fileno.
> > >
> > > The test in question comes from Lib/test/test_httplib.py in the python
> > > source tree and the code is:
> > >
> > >     def test_response_fileno(self):
> > >         # Make sure fd returned by fileno is valid.
> > >         serv = socket.socket(
> > >             socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP)
> > >         self.addCleanup(serv.close)
> > >         serv.bind((HOST, 0))
> > >         serv.listen()
> > >
> > >         result = None
> > >         def run_server():
> > >             [conn, address] = serv.accept()
> > >             with conn, conn.makefile("rb") as reader:
> > >                 # Read the request header until a blank line
> > >                 while True:
> > >                     line = reader.readline()
> > >                     if not line.rstrip(b"\r\n"):
> > >                         break
> > >                 conn.sendall(b"HTTP/1.1 200 Connection established\r\n\r\n")
> > >                 nonlocal result
> > >                 result = reader.read()
> > >
> > >         thread = threading.Thread(target=run_server)
> > >         thread.start()
> > >         self.addCleanup(thread.join, float(1))
> > >         conn = client.HTTPConnection(*serv.getsockname())
> > >         conn.request("CONNECT", "dummy:1234")
> > >         response = conn.getresponse()
> > >         try:
> > >             self.assertEqual(response.status, client.OK)
> > >             s = socket.socket(fileno=response.fileno())
> > >             try:
> > >                 s.sendall(b"proxied data\n")
> > >             finally:
> > >                 s.detach()
> > >         finally:
> > >             response.close()
> > >             conn.close()
> > >         thread.join()
> > >         self.assertEqual(result, b"proxied data\n")
> > >
> > > I was hoping someone with more understanding of the networking stack
> > > could look at this and tell whether its a bug in the python test, the
> > > kernel change or otherwise give a pointer to where the problem might
> > > be? I'll freely admit this is not an area I know much about.
> > >
> > > Cheers,
> > >
> > > Richard
> > >
> > >
> > >  

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2019-04-24 14:51 ` Bruno Prémont
@ 2019-04-24 15:47   ` Eric Dumazet
  2019-04-25  7:55     ` Bruno Prémont
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2019-04-24 15:47 UTC (permalink / raw)
  To: Bruno Prémont
  Cc: richard.purdie, Neal Cardwell, Yuchung Cheng, David S. Miller,
	netdev, Alexander Kanavin, Bruce Ashfield

On Wed, Apr 24, 2019 at 7:51 AM Bruno Prémont <bonbons@sysophe.eu> wrote:
>
> Hi Eric,
>
> I'm seeing issues with this patch as well, not as regular as for
> Richard but still (about up to one in 30-50 TCP sessions).
>
> In my case I have a virtual machine (on VMWare) with this patch where
> NGINX as reverse proxy misses part (end) of payload from its upstream
> and times out on the upstream connection (while according to tcpdump all
> packets including upstream's FIN were sent and the upstream did get
> ACKs from the VM).
>
> From when browsers get from NGINX it feels as if at some point reading
> from the socket or waiting for data using select() never returned data
> that arrived as more than just EOF is missing.
>
> The upstream is a hardware machine in the same subnet.
>
> My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
> as network adapter which lists the following features:
>

Hi Bruno.

I suspect a EPOLLIN notification being lost by the application.

Fact that TCP backlog contains 1 instead of 2+ packets should not
change stack behavior,
this packet should land into socket receive queue eventually.

Are you using epoll() in Edge Trigger mode. You mention select() but
select() is a rather old and inefficient API.

Could you watch/report the output of " ss -temoi "  for the frozen TCP flow ?

This migtht give us a clue about packets being dropped, say the the
accumulated packet became too big.



> rx-checksumming: on
> tx-checksumming: on
>         tx-checksum-ipv4: off [fixed]
>         tx-checksum-ip-generic: on
>         tx-checksum-ipv6: off [fixed]
>         tx-checksum-fcoe-crc: off [fixed]
>         tx-checksum-sctp: off [fixed]
> scatter-gather: on
>         tx-scatter-gather: on
>         tx-scatter-gather-fraglist: off [fixed]
> tcp-segmentation-offload: on
>         tx-tcp-segmentation: on
>         tx-tcp-ecn-segmentation: off [fixed]
>         tx-tcp-mangleid-segmentation: off
>         tx-tcp6-segmentation: on
> udp-fragmentation-offload: off
> generic-segmentation-offload: on
> generic-receive-offload: on
> large-receive-offload: on
> rx-vlan-offload: on
> tx-vlan-offload: on
> ntuple-filters: off [fixed]
> receive-hashing: off [fixed]
> highdma: on
> rx-vlan-filter: on [fixed]
> vlan-challenged: off [fixed]
> tx-lockless: off [fixed]
> netns-local: off [fixed]
> tx-gso-robust: off [fixed]
> tx-fcoe-segmentation: off [fixed]
> tx-gre-segmentation: off [fixed]
> tx-gre-csum-segmentation: off [fixed]
> tx-ipxip4-segmentation: off [fixed]
> tx-ipxip6-segmentation: off [fixed]
> tx-udp_tnl-segmentation: off [fixed]
> tx-udp_tnl-csum-segmentation: off [fixed]
> tx-gso-partial: off [fixed]
> tx-sctp-segmentation: off [fixed]
> tx-esp-segmentation: off [fixed]
> tx-udp-segmentation: off [fixed]
> fcoe-mtu: off [fixed]
> tx-nocache-copy: off
> loopback: off [fixed]
> rx-fcs: off [fixed]
> rx-all: off [fixed]
> tx-vlan-stag-hw-insert: off [fixed]
> rx-vlan-stag-hw-parse: off [fixed]
> rx-vlan-stag-filter: off [fixed]
> l2-fwd-offload: off [fixed]
> hw-tc-offload: off [fixed]
> esp-hw-offload: off [fixed]
> esp-tx-csum-hw-offload: off [fixed]
> rx-udp_tunnel-port-offload: off [fixed]
> tls-hw-tx-offload: off [fixed]
> tls-hw-rx-offload: off [fixed]
> rx-gro-hw: off [fixed]
> tls-hw-record: off [fixed]
>
>
> I can reproduce the issue with kernels 5.0.x and as recent as 5.1-rc6.
>
> Cheers,
> Bruno
>
> On Sunday, April 7, 2019 11:28:30 PM CEST, richard.purdie@linuxfoundation.org wrote:
> > Hi,
> >
> > I've been chasing down why a python test from the python3 testsuite
> > started failing and it seems to point to this kernel change in the
> > networking stack.
> >
> > In kernels beyond commit 4f693b55c3d2d2239b8a0094b518a1e533cf75d5 the
> > test hangs about 90% of the time (I've reproduced with 5.1-rc3, 5.0.7,
> > 5.0-rc1 but not 4.18, 4.19 or 4.20). The reproducer is:
> >
> > $ python3 -m test test_httplib -v
> > == CPython 3.7.2 (default, Apr 5 2019, 15:17:15) [GCC 8.3.0]
> > == Linux-5.0.0-yocto-standard-x86_64-with-glibc2.2.5 little-endian
> > == cwd: /var/volatile/tmp/test_python_288
> > == CPU count: 1
> > == encodings: locale=UTF-8, FS=utf-8
> > [...]
> > test_response_fileno (test.test_httplib.BasicTest) ...
> >
> > and it hangs in test_response_fileno.
> >
> > The test in question comes from Lib/test/test_httplib.py in the python
> > source tree and the code is:
> >
> >     def test_response_fileno(self):
> >         # Make sure fd returned by fileno is valid.
> >         serv = socket.socket(
> >             socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP)
> >         self.addCleanup(serv.close)
> >         serv.bind((HOST, 0))
> >         serv.listen()
> >
> >         result = None
> >         def run_server():
> >             [conn, address] = serv.accept()
> >             with conn, conn.makefile("rb") as reader:
> >                 # Read the request header until a blank line
> >                 while True:
> >                     line = reader.readline()
> >                     if not line.rstrip(b"\r\n"):
> >                         break
> >                 conn.sendall(b"HTTP/1.1 200 Connection established\r\n\r\n")
> >                 nonlocal result
> >                 result = reader.read()
> >
> >         thread = threading.Thread(target=run_server)
> >         thread.start()
> >         self.addCleanup(thread.join, float(1))
> >         conn = client.HTTPConnection(*serv.getsockname())
> >         conn.request("CONNECT", "dummy:1234")
> >         response = conn.getresponse()
> >         try:
> >             self.assertEqual(response.status, client.OK)
> >             s = socket.socket(fileno=response.fileno())
> >             try:
> >                 s.sendall(b"proxied data\n")
> >             finally:
> >                 s.detach()
> >         finally:
> >             response.close()
> >             conn.close()
> >         thread.join()
> >         self.assertEqual(result, b"proxied data\n")
> >
> > I was hoping someone with more understanding of the networking stack
> > could look at this and tell whether its a bug in the python test, the
> > kernel change or otherwise give a pointer to where the problem might
> > be? I'll freely admit this is not an area I know much about.
> >
> > Cheers,
> >
> > Richard
> >
> >
> >

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2019-04-07 21:28 [PATCH net-next 2/3] tcp: implement coalescing on backlog queue richard.purdie
@ 2019-04-24 14:51 ` Bruno Prémont
  2019-04-24 15:47   ` Eric Dumazet
  0 siblings, 1 reply; 27+ messages in thread
From: Bruno Prémont @ 2019-04-24 14:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: richard.purdie, Neal Cardwell, Yuchung Cheng, David S. Miller,
	netdev, Alexander Kanavin, Bruce Ashfield

Hi Eric,

I'm seeing issues with this patch as well, not as regular as for
Richard but still (about up to one in 30-50 TCP sessions).

In my case I have a virtual machine (on VMWare) with this patch where
NGINX as reverse proxy misses part (end) of payload from its upstream
and times out on the upstream connection (while according to tcpdump all
packets including upstream's FIN were sent and the upstream did get
ACKs from the VM).

From when browsers get from NGINX it feels as if at some point reading
from the socket or waiting for data using select() never returned data
that arrived as more than just EOF is missing.

The upstream is a hardware machine in the same subnet.

My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
as network adapter which lists the following features:

rx-checksumming: on
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]


I can reproduce the issue with kernels 5.0.x and as recent as 5.1-rc6.

Cheers,
Bruno

On Sunday, April 7, 2019 11:28:30 PM CEST, richard.purdie@linuxfoundation.org wrote:
> Hi,
>
> I've been chasing down why a python test from the python3 testsuite
> started failing and it seems to point to this kernel change in the
> networking stack.
>
> In kernels beyond commit 4f693b55c3d2d2239b8a0094b518a1e533cf75d5 the
> test hangs about 90% of the time (I've reproduced with 5.1-rc3, 5.0.7,
> 5.0-rc1 but not 4.18, 4.19 or 4.20). The reproducer is:
>
> $ python3 -m test test_httplib -v
> == CPython 3.7.2 (default, Apr 5 2019, 15:17:15) [GCC 8.3.0]
> == Linux-5.0.0-yocto-standard-x86_64-with-glibc2.2.5 little-endian
> == cwd: /var/volatile/tmp/test_python_288
> == CPU count: 1
> == encodings: locale=UTF-8, FS=utf-8
> [...]
> test_response_fileno (test.test_httplib.BasicTest) ... 
>
> and it hangs in test_response_fileno.
>
> The test in question comes from Lib/test/test_httplib.py in the python
> source tree and the code is:
>
>     def test_response_fileno(self):
>         # Make sure fd returned by fileno is valid.
>         serv = socket.socket(
>             socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP)
>         self.addCleanup(serv.close)
>         serv.bind((HOST, 0))
>         serv.listen()
>
>         result = None
>         def run_server():
>             [conn, address] = serv.accept()
>             with conn, conn.makefile("rb") as reader:
>                 # Read the request header until a blank line
>                 while True:
>                     line = reader.readline()
>                     if not line.rstrip(b"\r\n"):
>                         break
>                 conn.sendall(b"HTTP/1.1 200 Connection established\r\n\r\n")
>                 nonlocal result
>                 result = reader.read()
>
>         thread = threading.Thread(target=run_server)
>         thread.start()
>         self.addCleanup(thread.join, float(1))
>         conn = client.HTTPConnection(*serv.getsockname())
>         conn.request("CONNECT", "dummy:1234")
>         response = conn.getresponse()
>         try:
>             self.assertEqual(response.status, client.OK)
>             s = socket.socket(fileno=response.fileno())
>             try:
>                 s.sendall(b"proxied data\n")
>             finally:
>                 s.detach()
>         finally:
>             response.close()
>             conn.close()
>         thread.join()
>         self.assertEqual(result, b"proxied data\n")
>
> I was hoping someone with more understanding of the networking stack
> could look at this and tell whether its a bug in the python test, the
> kernel change or otherwise give a pointer to where the problem might
> be? I'll freely admit this is not an area I know much about.
>
> Cheers,
>
> Richard
>
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
@ 2019-04-07 21:28 richard.purdie
  2019-04-24 14:51 ` Bruno Prémont
  0 siblings, 1 reply; 27+ messages in thread
From: richard.purdie @ 2019-04-07 21:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, Yuchung Cheng, David S. Miller, netdev,
	Alexander Kanavin, Bruce Ashfield

Hi,

I've been chasing down why a python test from the python3 testsuite
started failing and it seems to point to this kernel change in the
networking stack.

In kernels beyond commit 4f693b55c3d2d2239b8a0094b518a1e533cf75d5 the
test hangs about 90% of the time (I've reproduced with 5.1-rc3, 5.0.7,
5.0-rc1 but not 4.18, 4.19 or 4.20). The reproducer is:

$ python3 -m test test_httplib -v
== CPython 3.7.2 (default, Apr 5 2019, 15:17:15) [GCC 8.3.0]
== Linux-5.0.0-yocto-standard-x86_64-with-glibc2.2.5 little-endian
== cwd: /var/volatile/tmp/test_python_288
== CPU count: 1
== encodings: locale=UTF-8, FS=utf-8
[...]
test_response_fileno (test.test_httplib.BasicTest) ... 

and it hangs in test_response_fileno.

The test in question comes from Lib/test/test_httplib.py in the python
source tree and the code is:

    def test_response_fileno(self):
        # Make sure fd returned by fileno is valid.
        serv = socket.socket(
            socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP)
        self.addCleanup(serv.close)
        serv.bind((HOST, 0))
        serv.listen()

        result = None
        def run_server():
            [conn, address] = serv.accept()
            with conn, conn.makefile("rb") as reader:
                # Read the request header until a blank line
                while True:
                    line = reader.readline()
                    if not line.rstrip(b"\r\n"):
                        break
                conn.sendall(b"HTTP/1.1 200 Connection established\r\n\r\n")
                nonlocal result
                result = reader.read()

        thread = threading.Thread(target=run_server)
        thread.start()
        self.addCleanup(thread.join, float(1))
        conn = client.HTTPConnection(*serv.getsockname())
        conn.request("CONNECT", "dummy:1234")
        response = conn.getresponse()
        try:
            self.assertEqual(response.status, client.OK)
            s = socket.socket(fileno=response.fileno())
            try:
                s.sendall(b"proxied data\n")
            finally:
                s.detach()
        finally:
            response.close()
            conn.close()
        thread.join()
        self.assertEqual(result, b"proxied data\n")

I was hoping someone with more understanding of the networking stack
could look at this and tell whether its a bug in the python test, the
kernel change or otherwise give a pointer to where the problem might
be? I'll freely admit this is not an area I know much about.

Cheers,

Richard


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2019-04-25 14:16 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-21 17:52 [PATCH net-next 0/3] tcp: take a bit more care of backlog stress Eric Dumazet
2018-11-21 17:52 ` [PATCH net-next 1/3] tcp: remove hdrlen argument from tcp_queue_rcv() Eric Dumazet
2018-11-21 22:41   ` Yuchung Cheng
2018-11-21 17:52 ` [PATCH net-next 2/3] tcp: implement coalescing on backlog queue Eric Dumazet
2018-11-21 22:31   ` Yuchung Cheng
2018-11-21 22:40     ` Eric Dumazet
2018-11-22 16:34       ` Yuchung Cheng
2018-11-22 18:01   ` Neal Cardwell
2018-11-22 18:16     ` Eric Dumazet
2018-11-22 18:21       ` Eric Dumazet
2018-11-21 17:52 ` [PATCH net-next 3/3] tcp: implement head drops in " Eric Dumazet
2018-11-21 22:40   ` Yuchung Cheng
2018-11-21 22:47     ` Eric Dumazet
2018-11-21 23:46       ` Yuchung Cheng
2018-11-21 23:52         ` Eric Dumazet
2018-11-22  0:18           ` Eric Dumazet
2018-11-22  0:54             ` Yuchung Cheng
2018-11-22  1:01               ` Eric Dumazet
2018-11-23 19:25 ` [PATCH net-next 0/3] tcp: take a bit more care of backlog stress David Miller
2018-11-23 19:27   ` Eric Dumazet
2019-04-07 21:28 [PATCH net-next 2/3] tcp: implement coalescing on backlog queue richard.purdie
2019-04-24 14:51 ` Bruno Prémont
2019-04-24 15:47   ` Eric Dumazet
2019-04-25  7:55     ` Bruno Prémont
2019-04-25 13:13       ` Bruno Prémont
2019-04-25 13:30         ` Eric Dumazet
2019-04-25 14:16           ` Bruno Prémont

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.