All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/3] tcp: take a bit more care of backlog stress
@ 2018-11-21 17:52 Eric Dumazet
  2018-11-21 17:52 ` [PATCH net-next 1/3] tcp: remove hdrlen argument from tcp_queue_rcv() Eric Dumazet
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Eric Dumazet @ 2018-11-21 17:52 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Jean-Louis Dupond, Neal Cardwell, Yuchung Cheng,
	Eric Dumazet, Eric Dumazet

While working on the SACK compression issue Jean-Louis Dupond
reported, we found that his linux box was suffering very hard
from tail drops on the socket backlog queue, because the opposite
TCP stack was ont implementing latest RFC recommendations.

First patch is a cleanup

Second patch is attempting coalescing when a new packet must
be added to the backlog queue. Cooking bigger skbs helps
to keep backlog list smaller and speeds its handling when
user thread finally releases the socket lock.

Third patch is implementing head drop as a last resort.
Head drops are generally better for optimal TCP behavior.

Eric Dumazet (3):
  tcp: remove hdrlen argument from tcp_queue_rcv()
  tcp: implement coalescing on backlog queue
  tcp: implement head drops in backlog queue

 include/uapi/linux/snmp.h |  1 +
 net/ipv4/proc.c           |  1 +
 net/ipv4/tcp_input.c      | 13 +++---
 net/ipv4/tcp_ipv4.c       | 89 ++++++++++++++++++++++++++++++++++++---
 4 files changed, 91 insertions(+), 13 deletions(-)

-- 
2.19.1.1215.g8438c0b245-goog

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH net-next 1/3] tcp: remove hdrlen argument from tcp_queue_rcv()
  2018-11-21 17:52 [PATCH net-next 0/3] tcp: take a bit more care of backlog stress Eric Dumazet
@ 2018-11-21 17:52 ` Eric Dumazet
  2018-11-21 22:41   ` Yuchung Cheng
  2018-11-21 17:52 ` [PATCH net-next 2/3] tcp: implement coalescing on backlog queue Eric Dumazet
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2018-11-21 17:52 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Jean-Louis Dupond, Neal Cardwell, Yuchung Cheng,
	Eric Dumazet, Eric Dumazet

Only one caller needs to pull TCP headers, so lets
move __skb_pull() to the caller side.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index edaaebfbcd4693ef6689f3f4c71733b3888c7c2c..e0ad7d3825b5945049c099171ce36e5c8bb9ba99 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4602,13 +4602,12 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	}
 }
 
-static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
-		  bool *fragstolen)
+static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb,
+				      bool *fragstolen)
 {
 	int eaten;
 	struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
 
-	__skb_pull(skb, hdrlen);
 	eaten = (tail &&
 		 tcp_try_coalesce(sk, tail,
 				  skb, fragstolen)) ? 1 : 0;
@@ -4659,7 +4658,7 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
 	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + size;
 	TCP_SKB_CB(skb)->ack_seq = tcp_sk(sk)->snd_una - 1;
 
-	if (tcp_queue_rcv(sk, skb, 0, &fragstolen)) {
+	if (tcp_queue_rcv(sk, skb, &fragstolen)) {
 		WARN_ON_ONCE(fragstolen); /* should not happen */
 		__kfree_skb(skb);
 	}
@@ -4719,7 +4718,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 			goto drop;
 		}
 
-		eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
+		eaten = tcp_queue_rcv(sk, skb, &fragstolen);
 		if (skb->len)
 			tcp_event_data_recv(sk, skb);
 		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
@@ -5585,8 +5584,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 			NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPHITS);
 
 			/* Bulk data transfer: receiver */
-			eaten = tcp_queue_rcv(sk, skb, tcp_header_len,
-					      &fragstolen);
+			__skb_pull(skb, tcp_header_len);
+			eaten = tcp_queue_rcv(sk, skb, &fragstolen);
 
 			tcp_event_data_recv(sk, skb);
 
-- 
2.19.1.1215.g8438c0b245-goog

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-21 17:52 [PATCH net-next 0/3] tcp: take a bit more care of backlog stress Eric Dumazet
  2018-11-21 17:52 ` [PATCH net-next 1/3] tcp: remove hdrlen argument from tcp_queue_rcv() Eric Dumazet
@ 2018-11-21 17:52 ` Eric Dumazet
  2018-11-21 22:31   ` Yuchung Cheng
  2018-11-22 18:01   ` Neal Cardwell
  2018-11-21 17:52 ` [PATCH net-next 3/3] tcp: implement head drops in " Eric Dumazet
  2018-11-23 19:25 ` [PATCH net-next 0/3] tcp: take a bit more care of backlog stress David Miller
  3 siblings, 2 replies; 20+ messages in thread
From: Eric Dumazet @ 2018-11-21 17:52 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Jean-Louis Dupond, Neal Cardwell, Yuchung Cheng,
	Eric Dumazet, Eric Dumazet

In case GRO is not as efficient as it should be or disabled,
we might have a user thread trapped in __release_sock() while
softirq handler flood packets up to the point we have to drop.

This patch balances work done from user thread and softirq,
to give more chances to __release_sock() to complete its work.

This also helps if we receive many ACK packets, since GRO
does not aggregate them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Jean-Louis Dupond <jean-louis@dupond.be>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/proc.c           |  1 +
 net/ipv4/tcp_ipv4.c       | 75 +++++++++++++++++++++++++++++++++++----
 3 files changed, 71 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index f80135e5feaa886000009db6dff75b2bc2d637b2..86dc24a96c90ab047d5173d625450facd6c6dd79 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -243,6 +243,7 @@ enum
 	LINUX_MIB_TCPREQQFULLDROP,		/* TCPReqQFullDrop */
 	LINUX_MIB_TCPRETRANSFAIL,		/* TCPRetransFail */
 	LINUX_MIB_TCPRCVCOALESCE,		/* TCPRcvCoalesce */
+	LINUX_MIB_TCPBACKLOGCOALESCE,		/* TCPBacklogCoalesce */
 	LINUX_MIB_TCPOFOQUEUE,			/* TCPOFOQueue */
 	LINUX_MIB_TCPOFODROP,			/* TCPOFODrop */
 	LINUX_MIB_TCPOFOMERGE,			/* TCPOFOMerge */
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 70289682a6701438aed99a00a9705c39fa4394d3..c3610b37bb4ce665b1976d8cc907b6dd0de42ab9 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -219,6 +219,7 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPRenoRecoveryFail", LINUX_MIB_TCPRENORECOVERYFAIL),
 	SNMP_MIB_ITEM("TCPSackRecoveryFail", LINUX_MIB_TCPSACKRECOVERYFAIL),
 	SNMP_MIB_ITEM("TCPRcvCollapsed", LINUX_MIB_TCPRCVCOLLAPSED),
+	SNMP_MIB_ITEM("TCPBacklogCoalesce", LINUX_MIB_TCPBACKLOGCOALESCE),
 	SNMP_MIB_ITEM("TCPDSACKOldSent", LINUX_MIB_TCPDSACKOLDSENT),
 	SNMP_MIB_ITEM("TCPDSACKOfoSent", LINUX_MIB_TCPDSACKOFOSENT),
 	SNMP_MIB_ITEM("TCPDSACKRecv", LINUX_MIB_TCPDSACKRECV),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 795605a2327504b8a025405826e7e0ca8dc8501d..401e1d1cb904a4c7963d8baa419cfbf178593344 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1619,12 +1619,10 @@ int tcp_v4_early_demux(struct sk_buff *skb)
 bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
 {
 	u32 limit = sk->sk_rcvbuf + sk->sk_sndbuf;
-
-	/* Only socket owner can try to collapse/prune rx queues
-	 * to reduce memory overhead, so add a little headroom here.
-	 * Few sockets backlog are possibly concurrently non empty.
-	 */
-	limit += 64*1024;
+	struct skb_shared_info *shinfo;
+	const struct tcphdr *th;
+	struct sk_buff *tail;
+	unsigned int hdrlen;
 
 	/* In case all data was pulled from skb frags (in __pskb_pull_tail()),
 	 * we can fix skb->truesize to its real value to avoid future drops.
@@ -1636,6 +1634,71 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
 
 	skb_dst_drop(skb);
 
+	if (unlikely(tcp_checksum_complete(skb))) {
+		bh_unlock_sock(sk);
+		__TCP_INC_STATS(sock_net(sk), TCP_MIB_CSUMERRORS);
+		__TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
+		return true;
+	}
+
+	/* Attempt coalescing to last skb in backlog, even if we are
+	 * above the limits.
+	 * This is okay because skb capacity is limited to MAX_SKB_FRAGS.
+	 */
+	th = (const struct tcphdr *)skb->data;
+	hdrlen = th->doff * 4;
+	shinfo = skb_shinfo(skb);
+
+	if (!shinfo->gso_size)
+		shinfo->gso_size = skb->len - hdrlen;
+
+	if (!shinfo->gso_segs)
+		shinfo->gso_segs = 1;
+
+	tail = sk->sk_backlog.tail;
+	if (tail &&
+	    TCP_SKB_CB(tail)->end_seq == TCP_SKB_CB(skb)->seq &&
+#ifdef CONFIG_TLS_DEVICE
+	    tail->decrypted == skb->decrypted &&
+#endif
+	    !memcmp(tail->data + sizeof(*th), skb->data + sizeof(*th),
+		    hdrlen - sizeof(*th))) {
+		bool fragstolen;
+		int delta;
+
+		__skb_pull(skb, hdrlen);
+		if (skb_try_coalesce(tail, skb, &fragstolen, &delta)) {
+			TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq;
+			TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
+			TCP_SKB_CB(tail)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags;
+
+			if (TCP_SKB_CB(skb)->has_rxtstamp) {
+				TCP_SKB_CB(tail)->has_rxtstamp = true;
+				tail->tstamp = skb->tstamp;
+				skb_hwtstamps(tail)->hwtstamp = skb_hwtstamps(skb)->hwtstamp;
+			}
+
+			/* Not as strict as GRO. We only need to carry mss max value */
+			skb_shinfo(tail)->gso_size = max(shinfo->gso_size,
+							 skb_shinfo(tail)->gso_size);
+
+			skb_shinfo(tail)->gso_segs += shinfo->gso_segs;
+
+			sk->sk_backlog.len += delta;
+			__NET_INC_STATS(sock_net(sk),
+					LINUX_MIB_TCPBACKLOGCOALESCE);
+			kfree_skb_partial(skb, fragstolen);
+			return false;
+		}
+		__skb_push(skb, hdrlen);
+	}
+
+	/* Only socket owner can try to collapse/prune rx queues
+	 * to reduce memory overhead, so add a little headroom here.
+	 * Few sockets backlog are possibly concurrently non empty.
+	 */
+	limit += 64*1024;
+
 	if (unlikely(sk_add_backlog(sk, skb, limit))) {
 		bh_unlock_sock(sk);
 		__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP);
-- 
2.19.1.1215.g8438c0b245-goog

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next 3/3] tcp: implement head drops in backlog queue
  2018-11-21 17:52 [PATCH net-next 0/3] tcp: take a bit more care of backlog stress Eric Dumazet
  2018-11-21 17:52 ` [PATCH net-next 1/3] tcp: remove hdrlen argument from tcp_queue_rcv() Eric Dumazet
  2018-11-21 17:52 ` [PATCH net-next 2/3] tcp: implement coalescing on backlog queue Eric Dumazet
@ 2018-11-21 17:52 ` Eric Dumazet
  2018-11-21 22:40   ` Yuchung Cheng
  2018-11-23 19:25 ` [PATCH net-next 0/3] tcp: take a bit more care of backlog stress David Miller
  3 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2018-11-21 17:52 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Jean-Louis Dupond, Neal Cardwell, Yuchung Cheng,
	Eric Dumazet, Eric Dumazet

Under high stress, and if GRO or coalescing does not help,
we better make room in backlog queue to be able to keep latest
packet coming.

This generally helps fast recovery, given that we often receive
packets in order.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Jean-Louis Dupond <jean-louis@dupond.be>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 net/ipv4/tcp_ipv4.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 401e1d1cb904a4c7963d8baa419cfbf178593344..36c9d715bf2aa7eb7bf58b045bfeb85a2ec1a696 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1693,6 +1693,20 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
 		__skb_push(skb, hdrlen);
 	}
 
+	while (sk_rcvqueues_full(sk, limit)) {
+		struct sk_buff *head;
+
+		head = sk->sk_backlog.head;
+		if (!head)
+			break;
+		sk->sk_backlog.head = head->next;
+		if (!head->next)
+			sk->sk_backlog.tail = NULL;
+		skb_mark_not_on_list(head);
+		sk->sk_backlog.len -= head->truesize;
+		kfree_skb(head);
+		__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP);
+	}
 	/* Only socket owner can try to collapse/prune rx queues
 	 * to reduce memory overhead, so add a little headroom here.
 	 * Few sockets backlog are possibly concurrently non empty.
-- 
2.19.1.1215.g8438c0b245-goog

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-21 17:52 ` [PATCH net-next 2/3] tcp: implement coalescing on backlog queue Eric Dumazet
@ 2018-11-21 22:31   ` Yuchung Cheng
  2018-11-21 22:40     ` Eric Dumazet
  2018-11-22 18:01   ` Neal Cardwell
  1 sibling, 1 reply; 20+ messages in thread
From: Yuchung Cheng @ 2018-11-21 22:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Jean-Louis Dupond, Neal Cardwell, Eric Dumazet

On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:
>
> In case GRO is not as efficient as it should be or disabled,
> we might have a user thread trapped in __release_sock() while
> softirq handler flood packets up to the point we have to drop.
>
> This patch balances work done from user thread and softirq,
> to give more chances to __release_sock() to complete its work.
>
> This also helps if we receive many ACK packets, since GRO
> does not aggregate them.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Tested-by: Jean-Louis Dupond <jean-louis@dupond.be>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> ---
>  include/uapi/linux/snmp.h |  1 +
>  net/ipv4/proc.c           |  1 +
>  net/ipv4/tcp_ipv4.c       | 75 +++++++++++++++++++++++++++++++++++----
>  3 files changed, 71 insertions(+), 6 deletions(-)
>
> diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
> index f80135e5feaa886000009db6dff75b2bc2d637b2..86dc24a96c90ab047d5173d625450facd6c6dd79 100644
> --- a/include/uapi/linux/snmp.h
> +++ b/include/uapi/linux/snmp.h
> @@ -243,6 +243,7 @@ enum
>         LINUX_MIB_TCPREQQFULLDROP,              /* TCPReqQFullDrop */
>         LINUX_MIB_TCPRETRANSFAIL,               /* TCPRetransFail */
>         LINUX_MIB_TCPRCVCOALESCE,               /* TCPRcvCoalesce */
> +       LINUX_MIB_TCPBACKLOGCOALESCE,           /* TCPBacklogCoalesce */
>         LINUX_MIB_TCPOFOQUEUE,                  /* TCPOFOQueue */
>         LINUX_MIB_TCPOFODROP,                   /* TCPOFODrop */
>         LINUX_MIB_TCPOFOMERGE,                  /* TCPOFOMerge */
> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
> index 70289682a6701438aed99a00a9705c39fa4394d3..c3610b37bb4ce665b1976d8cc907b6dd0de42ab9 100644
> --- a/net/ipv4/proc.c
> +++ b/net/ipv4/proc.c
> @@ -219,6 +219,7 @@ static const struct snmp_mib snmp4_net_list[] = {
>         SNMP_MIB_ITEM("TCPRenoRecoveryFail", LINUX_MIB_TCPRENORECOVERYFAIL),
>         SNMP_MIB_ITEM("TCPSackRecoveryFail", LINUX_MIB_TCPSACKRECOVERYFAIL),
>         SNMP_MIB_ITEM("TCPRcvCollapsed", LINUX_MIB_TCPRCVCOLLAPSED),
> +       SNMP_MIB_ITEM("TCPBacklogCoalesce", LINUX_MIB_TCPBACKLOGCOALESCE),
>         SNMP_MIB_ITEM("TCPDSACKOldSent", LINUX_MIB_TCPDSACKOLDSENT),
>         SNMP_MIB_ITEM("TCPDSACKOfoSent", LINUX_MIB_TCPDSACKOFOSENT),
>         SNMP_MIB_ITEM("TCPDSACKRecv", LINUX_MIB_TCPDSACKRECV),
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 795605a2327504b8a025405826e7e0ca8dc8501d..401e1d1cb904a4c7963d8baa419cfbf178593344 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1619,12 +1619,10 @@ int tcp_v4_early_demux(struct sk_buff *skb)
>  bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
>  {
>         u32 limit = sk->sk_rcvbuf + sk->sk_sndbuf;
> -
> -       /* Only socket owner can try to collapse/prune rx queues
> -        * to reduce memory overhead, so add a little headroom here.
> -        * Few sockets backlog are possibly concurrently non empty.
> -        */
> -       limit += 64*1024;
> +       struct skb_shared_info *shinfo;
> +       const struct tcphdr *th;
> +       struct sk_buff *tail;
> +       unsigned int hdrlen;
>
>         /* In case all data was pulled from skb frags (in __pskb_pull_tail()),
>          * we can fix skb->truesize to its real value to avoid future drops.
> @@ -1636,6 +1634,71 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
>
>         skb_dst_drop(skb);
>
> +       if (unlikely(tcp_checksum_complete(skb))) {
> +               bh_unlock_sock(sk);
> +               __TCP_INC_STATS(sock_net(sk), TCP_MIB_CSUMERRORS);
> +               __TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
> +               return true;
> +       }
> +
> +       /* Attempt coalescing to last skb in backlog, even if we are
> +        * above the limits.
> +        * This is okay because skb capacity is limited to MAX_SKB_FRAGS.
> +        */
> +       th = (const struct tcphdr *)skb->data;
> +       hdrlen = th->doff * 4;
> +       shinfo = skb_shinfo(skb);
> +
> +       if (!shinfo->gso_size)
> +               shinfo->gso_size = skb->len - hdrlen;
> +
> +       if (!shinfo->gso_segs)
> +               shinfo->gso_segs = 1;
> +
> +       tail = sk->sk_backlog.tail;
> +       if (tail &&
> +           TCP_SKB_CB(tail)->end_seq == TCP_SKB_CB(skb)->seq &&
> +#ifdef CONFIG_TLS_DEVICE
> +           tail->decrypted == skb->decrypted &&
> +#endif
> +           !memcmp(tail->data + sizeof(*th), skb->data + sizeof(*th),
> +                   hdrlen - sizeof(*th))) {
> +               bool fragstolen;
> +               int delta;
> +
> +               __skb_pull(skb, hdrlen);
> +               if (skb_try_coalesce(tail, skb, &fragstolen, &delta)) {
> +                       TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq;
> +                       TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
> +                       TCP_SKB_CB(tail)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags;
> +
> +                       if (TCP_SKB_CB(skb)->has_rxtstamp) {
> +                               TCP_SKB_CB(tail)->has_rxtstamp = true;
> +                               tail->tstamp = skb->tstamp;
> +                               skb_hwtstamps(tail)->hwtstamp = skb_hwtstamps(skb)->hwtstamp;
> +                       }
> +
Really nice! would it make sense to re-use (some of) the similar
tcp_try_coalesce()?

> +                       /* Not as strict as GRO. We only need to carry mss max value */
> +                       skb_shinfo(tail)->gso_size = max(shinfo->gso_size,
> +                                                        skb_shinfo(tail)->gso_size);
> +
> +                       skb_shinfo(tail)->gso_segs += shinfo->gso_segs;
> +
> +                       sk->sk_backlog.len += delta;
> +                       __NET_INC_STATS(sock_net(sk),
> +                                       LINUX_MIB_TCPBACKLOGCOALESCE);
> +                       kfree_skb_partial(skb, fragstolen);
> +                       return false;
> +               }
> +               __skb_push(skb, hdrlen);
> +       }
> +
> +       /* Only socket owner can try to collapse/prune rx queues
> +        * to reduce memory overhead, so add a little headroom here.
> +        * Few sockets backlog are possibly concurrently non empty.
> +        */
> +       limit += 64*1024;
> +
>         if (unlikely(sk_add_backlog(sk, skb, limit))) {
>                 bh_unlock_sock(sk);
>                 __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP);
> --
> 2.19.1.1215.g8438c0b245-goog
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-21 22:31   ` Yuchung Cheng
@ 2018-11-21 22:40     ` Eric Dumazet
  2018-11-22 16:34       ` Yuchung Cheng
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2018-11-21 22:40 UTC (permalink / raw)
  To: Yuchung Cheng, Eric Dumazet
  Cc: David S . Miller, netdev, Jean-Louis Dupond, Neal Cardwell



On 11/21/2018 02:31 PM, Yuchung Cheng wrote:
> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:

>> +
> Really nice! would it make sense to re-use (some of) the similar
> tcp_try_coalesce()?
>

Maybe, but it is a bit complex, since skbs in receive queues (regular or out of order)
are accounted differently (they have skb->destructor set)

Also they had the TCP header pulled already, while the backlog coalescing also has
to make sure TCP options match.

Not sure if we want to add extra parameters and conditional checks...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue
  2018-11-21 17:52 ` [PATCH net-next 3/3] tcp: implement head drops in " Eric Dumazet
@ 2018-11-21 22:40   ` Yuchung Cheng
  2018-11-21 22:47     ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Yuchung Cheng @ 2018-11-21 22:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Jean-Louis Dupond, Neal Cardwell, Eric Dumazet

On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:
> Under high stress, and if GRO or coalescing does not help,
> we better make room in backlog queue to be able to keep latest
> packet coming.
>
> This generally helps fast recovery, given that we often receive
> packets in order.

I like the benefit of fast recovery but I am a bit leery about head
drop causing HoLB on large read, while tail drops can be repaired by
RACK and TLP already. Hmm -

>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Tested-by: Jean-Louis Dupond <jean-louis@dupond.be>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> ---
>  net/ipv4/tcp_ipv4.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 401e1d1cb904a4c7963d8baa419cfbf178593344..36c9d715bf2aa7eb7bf58b045bfeb85a2ec1a696 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1693,6 +1693,20 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
>                 __skb_push(skb, hdrlen);
>         }
>
> +       while (sk_rcvqueues_full(sk, limit)) {
> +               struct sk_buff *head;
> +
> +               head = sk->sk_backlog.head;
> +               if (!head)
> +                       break;
> +               sk->sk_backlog.head = head->next;
> +               if (!head->next)
> +                       sk->sk_backlog.tail = NULL;
> +               skb_mark_not_on_list(head);
> +               sk->sk_backlog.len -= head->truesize;
> +               kfree_skb(head);
> +               __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP);
> +       }
>         /* Only socket owner can try to collapse/prune rx queues
>          * to reduce memory overhead, so add a little headroom here.
>          * Few sockets backlog are possibly concurrently non empty.
> --
> 2.19.1.1215.g8438c0b245-goog
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 1/3] tcp: remove hdrlen argument from tcp_queue_rcv()
  2018-11-21 17:52 ` [PATCH net-next 1/3] tcp: remove hdrlen argument from tcp_queue_rcv() Eric Dumazet
@ 2018-11-21 22:41   ` Yuchung Cheng
  0 siblings, 0 replies; 20+ messages in thread
From: Yuchung Cheng @ 2018-11-21 22:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Jean-Louis Dupond, Neal Cardwell, Eric Dumazet

On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:
> Only one caller needs to pull TCP headers, so lets
> move __skb_pull() to the caller side.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
Acked-by: Yuchung Cheng <ycheng@google.com>

>  net/ipv4/tcp_input.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index edaaebfbcd4693ef6689f3f4c71733b3888c7c2c..e0ad7d3825b5945049c099171ce36e5c8bb9ba99 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4602,13 +4602,12 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
>         }
>  }
>
> -static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
> -                 bool *fragstolen)
> +static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb,
> +                                     bool *fragstolen)
>  {
>         int eaten;
>         struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
>
> -       __skb_pull(skb, hdrlen);
>         eaten = (tail &&
>                  tcp_try_coalesce(sk, tail,
>                                   skb, fragstolen)) ? 1 : 0;
> @@ -4659,7 +4658,7 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
>         TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + size;
>         TCP_SKB_CB(skb)->ack_seq = tcp_sk(sk)->snd_una - 1;
>
> -       if (tcp_queue_rcv(sk, skb, 0, &fragstolen)) {
> +       if (tcp_queue_rcv(sk, skb, &fragstolen)) {
>                 WARN_ON_ONCE(fragstolen); /* should not happen */
>                 __kfree_skb(skb);
>         }
> @@ -4719,7 +4718,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
>                         goto drop;
>                 }
>
> -               eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
> +               eaten = tcp_queue_rcv(sk, skb, &fragstolen);
>                 if (skb->len)
>                         tcp_event_data_recv(sk, skb);
>                 if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
> @@ -5585,8 +5584,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
>                         NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPHITS);
>
>                         /* Bulk data transfer: receiver */
> -                       eaten = tcp_queue_rcv(sk, skb, tcp_header_len,
> -                                             &fragstolen);
> +                       __skb_pull(skb, tcp_header_len);
> +                       eaten = tcp_queue_rcv(sk, skb, &fragstolen);
>
>                         tcp_event_data_recv(sk, skb);
>
> --
> 2.19.1.1215.g8438c0b245-goog
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue
  2018-11-21 22:40   ` Yuchung Cheng
@ 2018-11-21 22:47     ` Eric Dumazet
  2018-11-21 23:46       ` Yuchung Cheng
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2018-11-21 22:47 UTC (permalink / raw)
  To: Yuchung Cheng, Eric Dumazet
  Cc: David S . Miller, netdev, Jean-Louis Dupond, Neal Cardwell



On 11/21/2018 02:40 PM, Yuchung Cheng wrote:
> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:
>> Under high stress, and if GRO or coalescing does not help,
>> we better make room in backlog queue to be able to keep latest
>> packet coming.
>>
>> This generally helps fast recovery, given that we often receive
>> packets in order.
> 
> I like the benefit of fast recovery but I am a bit leery about head
> drop causing HoLB on large read, while tail drops can be repaired by
> RACK and TLP already. Hmm -

This is very different pattern here.

We have a train of packets coming, the last packet is not a TLP probe...

Consider this train coming from an old stack without burst control nor pacing.

This patch guarantees last packet will be processed, and either :

1) We are a receiver, we will send a SACK. Sender will typically start recovery

2) We are a sender, we will process the most recent ACK sent by the receiver.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue
  2018-11-21 22:47     ` Eric Dumazet
@ 2018-11-21 23:46       ` Yuchung Cheng
  2018-11-21 23:52         ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Yuchung Cheng @ 2018-11-21 23:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, Jean-Louis Dupond, Neal Cardwell

On Wed, Nov 21, 2018 at 2:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 11/21/2018 02:40 PM, Yuchung Cheng wrote:
>> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:
>>> Under high stress, and if GRO or coalescing does not help,
>>> we better make room in backlog queue to be able to keep latest
>>> packet coming.
>>>
>>> This generally helps fast recovery, given that we often receive
>>> packets in order.
>>
>> I like the benefit of fast recovery but I am a bit leery about head
>> drop causing HoLB on large read, while tail drops can be repaired by
>> RACK and TLP already. Hmm -
>
> This is very different pattern here.
>
> We have a train of packets coming, the last packet is not a TLP probe...
>
> Consider this train coming from an old stack without burst control nor pacing.
>
> This patch guarantees last packet will be processed, and either :
>
> 1) We are a receiver, we will send a SACK. Sender will typically start recovery
>
> 2) We are a sender, we will process the most recent ACK sent by the receiver.
>

Sure on the sender it's universally good.

On the receiver side my scenario was not the last packet being TLP.
AFAIU the new design will first try coalesce the incoming skb to the
tail one then exit. Otherwise it's queued to the back with an
additional 64KB space credit. This patch checks the space w/o the
extra credit and drops the head skb. If the head skb has been
coalesced, we might end dropping the first big chunk that may need a
few rounds of fast recovery to repair. But I am likely to
misunderstand the patch :-)

Would it make sense to check the space first before the coalesce
operation, and drop just enough bytes of the head to make room?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue
  2018-11-21 23:46       ` Yuchung Cheng
@ 2018-11-21 23:52         ` Eric Dumazet
  2018-11-22  0:18           ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2018-11-21 23:52 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Eric Dumazet, David Miller, netdev, Jean-Louis Dupond, Neal Cardwell

On Wed, Nov 21, 2018 at 3:47 PM Yuchung Cheng <ycheng@google.com> wrote:
>
> On Wed, Nov 21, 2018 at 2:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> >
> > On 11/21/2018 02:40 PM, Yuchung Cheng wrote:
> >> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:
> >>> Under high stress, and if GRO or coalescing does not help,
> >>> we better make room in backlog queue to be able to keep latest
> >>> packet coming.
> >>>
> >>> This generally helps fast recovery, given that we often receive
> >>> packets in order.
> >>
> >> I like the benefit of fast recovery but I am a bit leery about head
> >> drop causing HoLB on large read, while tail drops can be repaired by
> >> RACK and TLP already. Hmm -
> >
> > This is very different pattern here.
> >
> > We have a train of packets coming, the last packet is not a TLP probe...
> >
> > Consider this train coming from an old stack without burst control nor pacing.
> >
> > This patch guarantees last packet will be processed, and either :
> >
> > 1) We are a receiver, we will send a SACK. Sender will typically start recovery
> >
> > 2) We are a sender, we will process the most recent ACK sent by the receiver.
> >
>
> Sure on the sender it's universally good.
>
> On the receiver side my scenario was not the last packet being TLP.
> AFAIU the new design will first try coalesce the incoming skb to the
> tail one then exit. Otherwise it's queued to the back with an
> additional 64KB space credit. This patch checks the space w/o the
> extra credit and drops the head skb. If the head skb has been
> coalesced, we might end dropping the first big chunk that may need a
> few rounds of fast recovery to repair. But I am likely to
> misunderstand the patch :-)

This situation happens only when we are largely over committing
the memory, due to bad skb->len/skb->truesize ratio.

Think about MTU=9000 and some drivers allocating 16 KB of memory per frame,
even if the packet has 1500 bytes in it.

>
> Would it make sense to check the space first before the coalesce
> operation, and drop just enough bytes of the head to make room?

This is basically what the patch does, the while loop breaks when we have freed
just enough skbs.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue
  2018-11-21 23:52         ` Eric Dumazet
@ 2018-11-22  0:18           ` Eric Dumazet
  2018-11-22  0:54             ` Yuchung Cheng
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2018-11-22  0:18 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Eric Dumazet, David Miller, netdev, Jean-Louis Dupond, Neal Cardwell

On Wed, Nov 21, 2018 at 3:52 PM Eric Dumazet <edumazet@google.com> wrote:
> This is basically what the patch does, the while loop breaks when we have freed
> just enough skbs.

Also this is the patch we tested with Jean-Louis on his host, bring
very nice results,
even from an old stack sender (the one that had problems with the SACK
compression we just fixed)

Keep in mind we are dealing here with the exception, I would not spend
too much time
testing another variant if this one simply works.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue
  2018-11-22  0:18           ` Eric Dumazet
@ 2018-11-22  0:54             ` Yuchung Cheng
  2018-11-22  1:01               ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Yuchung Cheng @ 2018-11-22  0:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David Miller, netdev, Jean-Louis Dupond, Neal Cardwell

On Wed, Nov 21, 2018 at 4:18 PM, Eric Dumazet <edumazet@google.com> wrote:
> On Wed, Nov 21, 2018 at 3:52 PM Eric Dumazet <edumazet@google.com> wrote:
>> This is basically what the patch does, the while loop breaks when we have freed
>> just enough skbs.
>
> Also this is the patch we tested with Jean-Louis on his host, bring
> very nice results,
> even from an old stack sender (the one that had problems with the SACK
> compression we just fixed)
>
> Keep in mind we are dealing here with the exception, I would not spend
> too much time
> testing another variant if this one simply works.
To clarify I do think this patch set is overall useful so I only
wanted to discuss the specifics of the head drop.

It occurs to me we check the limit differently (one w/ 64KB more), so
we may overcommit and trim more often than necessary?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue
  2018-11-22  0:54             ` Yuchung Cheng
@ 2018-11-22  1:01               ` Eric Dumazet
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Dumazet @ 2018-11-22  1:01 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Eric Dumazet, David Miller, netdev, Jean-Louis Dupond, Neal Cardwell

On Wed, Nov 21, 2018 at 4:55 PM Yuchung Cheng <ycheng@google.com> wrote:
>
> To clarify I do think this patch set is overall useful so I only
> wanted to discuss the specifics of the head drop.
>
> It occurs to me we check the limit differently (one w/ 64KB more), so
> we may overcommit and trim more often than necessary?

Keep in mind that the normal limit might be hit while backlog is empty.

So the loop might have nothing to remove.

We only have an extra 64KB credit for this very particular case, so
that we still can queue this packet
_if_ the additional credit was not already consumed.

We do not want to add back the terrible bug fixed by :

commit 8eae939f1400326b06d0c9afe53d2a484a326871
Author: Zhu Yi <yi.zhu@intel.com>
Date:   Thu Mar 4 18:01:40 2010 +0000

    net: add limit for socket backlog

    We got system OOM while running some UDP netperf testing on the loopback
    device. The case is multiple senders sent stream UDP packets to a single
    receiver via loopback on local host. Of course, the receiver is not able
    to handle all the packets in time. But we surprisingly found that these
    packets were not discarded due to the receiver's sk->sk_rcvbuf limit.
    Instead, they are kept queuing to sk->sk_backlog and finally ate up all
    the memory. We believe this is a secure hole that a none privileged user
    can crash the system.

    The root cause for this problem is, when the receiver is doing
    __release_sock() (i.e. after userspace recv, kernel udp_recvmsg ->
    skb_free_datagram_locked -> release_sock), it moves skbs from backlog to
    sk_receive_queue with the softirq enabled. In the above case, multiple
    busy senders will almost make it an endless loop. The skbs in the
    backlog end up eat all the system memory.

    The issue is not only for UDP. Any protocols using socket backlog is
    potentially affected. The patch adds limit for socket backlog so that
    the backlog size cannot be expanded endlessly.

    Reported-by: Alex Shi <alex.shi@intel.com>
    Cc: David Miller <davem@davemloft.net>
    Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
    Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru
    Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
    Cc: Patrick McHardy <kaber@trash.net>
    Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
    Cc: Sridhar Samudrala <sri@us.ibm.com>
    Cc: Jon Maloy <jon.maloy@ericsson.com>
    Cc: Allan Stephens <allan.stephens@windriver.com>
    Cc: Andrew Hendry <andrew.hendry@gmail.com>
    Signed-off-by: Zhu Yi <yi.zhu@intel.com>
    Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
    Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-21 22:40     ` Eric Dumazet
@ 2018-11-22 16:34       ` Yuchung Cheng
  0 siblings, 0 replies; 20+ messages in thread
From: Yuchung Cheng @ 2018-11-22 16:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, Jean-Louis Dupond, Neal Cardwell

On Wed, Nov 21, 2018 at 2:40 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 11/21/2018 02:31 PM, Yuchung Cheng wrote:
>> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:
>
>>> +
>> Really nice! would it make sense to re-use (some of) the similar
>> tcp_try_coalesce()?
>>
>
> Maybe, but it is a bit complex, since skbs in receive queues (regular or out of order)
> are accounted differently (they have skb->destructor set)
>
> Also they had the TCP header pulled already, while the backlog coalescing also has
> to make sure TCP options match.
>
> Not sure if we want to add extra parameters and conditional checks...
Makes sense.

Acked-by: Yuchung Cheng <ycheng@google.com>

>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-21 17:52 ` [PATCH net-next 2/3] tcp: implement coalescing on backlog queue Eric Dumazet
  2018-11-21 22:31   ` Yuchung Cheng
@ 2018-11-22 18:01   ` Neal Cardwell
  2018-11-22 18:16     ` Eric Dumazet
  1 sibling, 1 reply; 20+ messages in thread
From: Neal Cardwell @ 2018-11-22 18:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Netdev, jean-louis, Yuchung Cheng, Eric Dumazet

On Wed, Nov 21, 2018 at 12:52 PM Eric Dumazet <edumazet@google.com> wrote:
>
> In case GRO is not as efficient as it should be or disabled,
> we might have a user thread trapped in __release_sock() while
> softirq handler flood packets up to the point we have to drop.
>
> This patch balances work done from user thread and softirq,
> to give more chances to __release_sock() to complete its work.
>
> This also helps if we receive many ACK packets, since GRO
> does not aggregate them.

Would this coalesce duplicate incoming ACK packets? Is there a risk
that this would eliminate incoming dupacks needed for fast recovery in
non-SACK connections? Perhaps pure ACKs should only be coalesced if
the ACK field is different?

neal

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-22 18:01   ` Neal Cardwell
@ 2018-11-22 18:16     ` Eric Dumazet
  2018-11-22 18:21       ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2018-11-22 18:16 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, netdev, Jean-Louis Dupond, Yuchung Cheng, Eric Dumazet

On Thu, Nov 22, 2018 at 10:01 AM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Wed, Nov 21, 2018 at 12:52 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > In case GRO is not as efficient as it should be or disabled,
> > we might have a user thread trapped in __release_sock() while
> > softirq handler flood packets up to the point we have to drop.
> >
> > This patch balances work done from user thread and softirq,
> > to give more chances to __release_sock() to complete its work.
> >
> > This also helps if we receive many ACK packets, since GRO
> > does not aggregate them.
>
> Would this coalesce duplicate incoming ACK packets? Is there a risk
> that this would eliminate incoming dupacks needed for fast recovery in
> non-SACK connections? Perhaps pure ACKs should only be coalesced if
> the ACK field is different?

Yes, I was considering properly filtering SACK as a refinement later [1]
but you raise a valid point for alien stacks that are not yet using SACK :/

[1] This version of the patch will not aggregate sacks since the
memcmp() on tcp options would fail.

Neal can you double check if cake_ack_filter() does not have the issue
you just mentioned ?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-22 18:16     ` Eric Dumazet
@ 2018-11-22 18:21       ` Eric Dumazet
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Dumazet @ 2018-11-22 18:21 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, netdev, Jean-Louis Dupond, Yuchung Cheng, Eric Dumazet

On Thu, Nov 22, 2018 at 10:16 AM Eric Dumazet <edumazet@google.com> wrote:

> Yes, I was considering properly filtering SACK as a refinement later [1]
> but you raise a valid point for alien stacks that are not yet using SACK :/
>
> [1] This version of the patch will not aggregate sacks since the
> memcmp() on tcp options would fail.
>
> Neal can you double check if cake_ack_filter() does not have the issue
> you just mentioned ?

Note that aggregated pure acks will have a gso_segs set to the number
of aggregated acks,
we might simply use this value later in the stack, instead of forcing
having X pure acks in the backlog
and increase memory needs and cpu costs.

Then I guess I need this fix :

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 36c9d715bf2aa7eb7bf58b045bfeb85a2ec1a696..736f7f24cdb4fe61769faaa1644c8bff01c746c4
100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1669,7 +1669,8 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
                __skb_pull(skb, hdrlen);
                if (skb_try_coalesce(tail, skb, &fragstolen, &delta)) {
                        TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq;
-                       TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
+                       if (after(TCP_SKB_CB(skb)->ack_seq,
TCP_SKB_CB(tail)->ack_seq))
+                               TCP_SKB_CB(tail)->ack_seq =
TCP_SKB_CB(skb)->ack_seq;
                        TCP_SKB_CB(tail)->tcp_flags |=
TCP_SKB_CB(skb)->tcp_flags;

                        if (TCP_SKB_CB(skb)->has_rxtstamp) {

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 0/3] tcp: take a bit more care of backlog stress
  2018-11-21 17:52 [PATCH net-next 0/3] tcp: take a bit more care of backlog stress Eric Dumazet
                   ` (2 preceding siblings ...)
  2018-11-21 17:52 ` [PATCH net-next 3/3] tcp: implement head drops in " Eric Dumazet
@ 2018-11-23 19:25 ` David Miller
  2018-11-23 19:27   ` Eric Dumazet
  3 siblings, 1 reply; 20+ messages in thread
From: David Miller @ 2018-11-23 19:25 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, jean-louis, ncardwell, ycheng, eric.dumazet

From: Eric Dumazet <edumazet@google.com>
Date: Wed, 21 Nov 2018 09:52:37 -0800

> While working on the SACK compression issue Jean-Louis Dupond
> reported, we found that his linux box was suffering very hard
> from tail drops on the socket backlog queue, because the opposite
> TCP stack was ont implementing latest RFC recommendations.
> 
> First patch is a cleanup
> 
> Second patch is attempting coalescing when a new packet must
> be added to the backlog queue. Cooking bigger skbs helps
> to keep backlog list smaller and speeds its handling when
> user thread finally releases the socket lock.
> 
> Third patch is implementing head drop as a last resort.
> Head drops are generally better for optimal TCP behavior.

My impression is that patch #2 needs some fixes in order to not
lose dupacks.  So there will be a respin of this.

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 0/3] tcp: take a bit more care of backlog stress
  2018-11-23 19:25 ` [PATCH net-next 0/3] tcp: take a bit more care of backlog stress David Miller
@ 2018-11-23 19:27   ` Eric Dumazet
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Dumazet @ 2018-11-23 19:27 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Jean-Louis Dupond, Neal Cardwell, Yuchung Cheng, Eric Dumazet

On Fri, Nov 23, 2018 at 11:25 AM David Miller <davem@davemloft.net> wrote:
> My impression is that patch #2 needs some fixes in order to not
> lose dupacks.  So there will be a respin of this.
>
> Thanks.

You are absolutely right, we will submit a v2 next week after TG holidays.

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-11-24  6:13 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-21 17:52 [PATCH net-next 0/3] tcp: take a bit more care of backlog stress Eric Dumazet
2018-11-21 17:52 ` [PATCH net-next 1/3] tcp: remove hdrlen argument from tcp_queue_rcv() Eric Dumazet
2018-11-21 22:41   ` Yuchung Cheng
2018-11-21 17:52 ` [PATCH net-next 2/3] tcp: implement coalescing on backlog queue Eric Dumazet
2018-11-21 22:31   ` Yuchung Cheng
2018-11-21 22:40     ` Eric Dumazet
2018-11-22 16:34       ` Yuchung Cheng
2018-11-22 18:01   ` Neal Cardwell
2018-11-22 18:16     ` Eric Dumazet
2018-11-22 18:21       ` Eric Dumazet
2018-11-21 17:52 ` [PATCH net-next 3/3] tcp: implement head drops in " Eric Dumazet
2018-11-21 22:40   ` Yuchung Cheng
2018-11-21 22:47     ` Eric Dumazet
2018-11-21 23:46       ` Yuchung Cheng
2018-11-21 23:52         ` Eric Dumazet
2018-11-22  0:18           ` Eric Dumazet
2018-11-22  0:54             ` Yuchung Cheng
2018-11-22  1:01               ` Eric Dumazet
2018-11-23 19:25 ` [PATCH net-next 0/3] tcp: take a bit more care of backlog stress David Miller
2018-11-23 19:27   ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.