[PATCH v3 net-next 0/4] tcp: take a bit more care of backlog stress

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 net-next 0/4] tcp: take a bit more care of backlog stress
@ 2018-11-27 22:41 Eric Dumazet
  2018-11-27 22:42 ` [PATCH v3 net-next 1/4] tcp: hint compiler about sack flows Eric Dumazet
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Eric Dumazet @ 2018-11-27 22:41 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Jean-Louis Dupond, Neal Cardwell, Yuchung Cheng,
	Eric Dumazet, Eric Dumazet

While working on the SACK compression issue Jean-Louis Dupond
reported, we found that his linux box was suffering very hard
from tail drops on the socket backlog queue.

First patch hints the compiler about sack flows being the norm.

Second patch changes non-sack code in preparation of the ack
compression.

Third patch fixes tcp_space() to take backlog into account.

Fourth patch is attempting coalescing when a new packet must
be added to the backlog queue. Cooking bigger skbs helps
to keep backlog list smaller and speeds its handling when
user thread finally releases the socket lock.

v3: Neal/Yuchung feedback addressed :
     Do not aggregate if any skb has URG bit set.
     Do not aggregate if the skbs have different ECE/CWR bits

v2: added feedback from Neal : tcp: take care of compressed acks in tcp_add_reno_sack() 
    added : tcp: hint compiler about sack flows
	added : tcp: make tcp_space() aware of socket backlog

Eric Dumazet (4):
  tcp: hint compiler about sack flows
  tcp: take care of compressed acks in tcp_add_reno_sack()
  tcp: make tcp_space() aware of socket backlog
  tcp: implement coalescing on backlog queue

 include/net/tcp.h         |  4 +-
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/proc.c           |  1 +
 net/ipv4/tcp_input.c      | 58 +++++++++++++-----------
 net/ipv4/tcp_ipv4.c       | 92 ++++++++++++++++++++++++++++++++++++---
 5 files changed, 123 insertions(+), 33 deletions(-)

-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 net-next 1/4] tcp: hint compiler about sack flows
  2018-11-27 22:41 [PATCH v3 net-next 0/4] tcp: take a bit more care of backlog stress Eric Dumazet
@ 2018-11-27 22:42 ` Eric Dumazet
  2018-11-27 22:42 ` [PATCH v3 net-next 2/4] tcp: take care of compressed acks in tcp_add_reno_sack() Eric Dumazet
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2018-11-27 22:42 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Jean-Louis Dupond, Neal Cardwell, Yuchung Cheng,
	Eric Dumazet, Eric Dumazet

Tell the compiler that most TCP flows are using SACK these days.

There is no need to add the unlikely() clause in tcp_is_reno(),
the compiler is able to infer it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
---
 include/net/tcp.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 63e37dd1c274cc396e41ea9612cf67a5b7c89776..0c61bf0a06dac95268c26b6302a2afbaef4c88b3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1124,7 +1124,7 @@ void tcp_rate_check_app_limited(struct sock *sk);
  */
 static inline int tcp_is_sack(const struct tcp_sock *tp)
 {
-	return tp->rx_opt.sack_ok;
+	return likely(tp->rx_opt.sack_ok);
 }
 
 static inline bool tcp_is_reno(const struct tcp_sock *tp)
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 net-next 2/4] tcp: take care of compressed acks in tcp_add_reno_sack()
  2018-11-27 22:41 [PATCH v3 net-next 0/4] tcp: take a bit more care of backlog stress Eric Dumazet
  2018-11-27 22:42 ` [PATCH v3 net-next 1/4] tcp: hint compiler about sack flows Eric Dumazet
@ 2018-11-27 22:42 ` Eric Dumazet
  2018-11-28 16:41   ` Neal Cardwell
  2018-11-27 22:42 ` [PATCH v3 net-next 3/4] tcp: make tcp_space() aware of socket backlog Eric Dumazet
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2018-11-27 22:42 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Jean-Louis Dupond, Neal Cardwell, Yuchung Cheng,
	Eric Dumazet, Eric Dumazet

Neal pointed out that non sack flows might suffer from ACK compression
added in the following patch ("tcp: implement coalescing on backlog queue")

Instead of tweaking tcp_add_backlog() we can take into
account how many ACK were coalesced, this information
will be available in skb_shinfo(skb)->gso_segs

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 58 +++++++++++++++++++++++++-------------------
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f32397890b6dcbc34976954c4be142108efa04d8..e5f3819ad859f6e6ca28e09c0f2dbdb7052708ee 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1863,16 +1863,20 @@ static void tcp_check_reno_reordering(struct sock *sk, const int addend)
 
 /* Emulate SACKs for SACKless connection: account for a new dupack. */
 
-static void tcp_add_reno_sack(struct sock *sk)
+static void tcp_add_reno_sack(struct sock *sk, int num_dupack)
 {
-	struct tcp_sock *tp = tcp_sk(sk);
-	u32 prior_sacked = tp->sacked_out;
+	if (num_dupack) {
+		struct tcp_sock *tp = tcp_sk(sk);
+		u32 prior_sacked = tp->sacked_out;
+		s32 delivered;
 
-	tp->sacked_out++;
-	tcp_check_reno_reordering(sk, 0);
-	if (tp->sacked_out > prior_sacked)
-		tp->delivered++; /* Some out-of-order packet is delivered */
-	tcp_verify_left_out(tp);
+		tp->sacked_out += num_dupack;
+		tcp_check_reno_reordering(sk, 0);
+		delivered = tp->sacked_out - prior_sacked;
+		if (delivered > 0)
+			tp->delivered += delivered;
+		tcp_verify_left_out(tp);
+	}
 }
 
 /* Account for ACK, ACKing some data in Reno Recovery phase. */
@@ -2634,7 +2638,7 @@ void tcp_enter_recovery(struct sock *sk, bool ece_ack)
 /* Process an ACK in CA_Loss state. Move to CA_Open if lost data are
  * recovered or spurious. Otherwise retransmits more on partial ACKs.
  */
-static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
+static void tcp_process_loss(struct sock *sk, int flag, int num_dupack,
 			     int *rexmit)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -2653,7 +2657,7 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
 			return;
 
 		if (after(tp->snd_nxt, tp->high_seq)) {
-			if (flag & FLAG_DATA_SACKED || is_dupack)
+			if (flag & FLAG_DATA_SACKED || num_dupack)
 				tp->frto = 0; /* Step 3.a. loss was real */
 		} else if (flag & FLAG_SND_UNA_ADVANCED && !recovered) {
 			tp->high_seq = tp->snd_nxt;
@@ -2679,8 +2683,8 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
 		/* A Reno DUPACK means new data in F-RTO step 2.b above are
 		 * delivered. Lower inflight to clock out (re)tranmissions.
 		 */
-		if (after(tp->snd_nxt, tp->high_seq) && is_dupack)
-			tcp_add_reno_sack(sk);
+		if (after(tp->snd_nxt, tp->high_seq) && num_dupack)
+			tcp_add_reno_sack(sk, num_dupack);
 		else if (flag & FLAG_SND_UNA_ADVANCED)
 			tcp_reset_reno_sack(tp);
 	}
@@ -2757,13 +2761,13 @@ static bool tcp_force_fast_retransmit(struct sock *sk)
  * tcp_xmit_retransmit_queue().
  */
 static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
-				  bool is_dupack, int *ack_flag, int *rexmit)
+				  int num_dupack, int *ack_flag, int *rexmit)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	int fast_rexmit = 0, flag = *ack_flag;
-	bool do_lost = is_dupack || ((flag & FLAG_DATA_SACKED) &&
-				     tcp_force_fast_retransmit(sk));
+	bool do_lost = num_dupack || ((flag & FLAG_DATA_SACKED) &&
+				      tcp_force_fast_retransmit(sk));
 
 	if (!tp->packets_out && tp->sacked_out)
 		tp->sacked_out = 0;
@@ -2810,8 +2814,8 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
 	switch (icsk->icsk_ca_state) {
 	case TCP_CA_Recovery:
 		if (!(flag & FLAG_SND_UNA_ADVANCED)) {
-			if (tcp_is_reno(tp) && is_dupack)
-				tcp_add_reno_sack(sk);
+			if (tcp_is_reno(tp))
+				tcp_add_reno_sack(sk, num_dupack);
 		} else {
 			if (tcp_try_undo_partial(sk, prior_snd_una))
 				return;
@@ -2826,7 +2830,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
 		tcp_identify_packet_loss(sk, ack_flag);
 		break;
 	case TCP_CA_Loss:
-		tcp_process_loss(sk, flag, is_dupack, rexmit);
+		tcp_process_loss(sk, flag, num_dupack, rexmit);
 		tcp_identify_packet_loss(sk, ack_flag);
 		if (!(icsk->icsk_ca_state == TCP_CA_Open ||
 		      (*ack_flag & FLAG_LOST_RETRANS)))
@@ -2837,8 +2841,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
 		if (tcp_is_reno(tp)) {
 			if (flag & FLAG_SND_UNA_ADVANCED)
 				tcp_reset_reno_sack(tp);
-			if (is_dupack)
-				tcp_add_reno_sack(sk);
+			tcp_add_reno_sack(sk, num_dupack);
 		}
 
 		if (icsk->icsk_ca_state <= TCP_CA_Disorder)
@@ -3558,7 +3561,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	bool is_sack_reneg = tp->is_sack_reneg;
 	u32 ack_seq = TCP_SKB_CB(skb)->seq;
 	u32 ack = TCP_SKB_CB(skb)->ack_seq;
-	bool is_dupack = false;
+	int num_dupack = 0;
 	int prior_packets = tp->packets_out;
 	u32 delivered = tp->delivered;
 	u32 lost = tp->lost;
@@ -3669,8 +3672,13 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 		tcp_set_xmit_timer(sk);
 
 	if (tcp_ack_is_dubious(sk, flag)) {
-		is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
-		tcp_fastretrans_alert(sk, prior_snd_una, is_dupack, &flag,
+		if (!(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP))) {
+			num_dupack = 1;
+			/* Consider if pure acks were aggregated in tcp_add_backlog() */
+			if (!(flag & FLAG_DATA))
+				num_dupack = max_t(u16, 1, skb_shinfo(skb)->gso_segs);
+		}
+		tcp_fastretrans_alert(sk, prior_snd_una, num_dupack, &flag,
 				      &rexmit);
 	}
 
@@ -3688,7 +3696,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 no_queue:
 	/* If data was DSACKed, see if we can undo a cwnd reduction. */
 	if (flag & FLAG_DSACKING_ACK) {
-		tcp_fastretrans_alert(sk, prior_snd_una, is_dupack, &flag,
+		tcp_fastretrans_alert(sk, prior_snd_una, num_dupack, &flag,
 				      &rexmit);
 		tcp_newly_delivered(sk, delivered, flag);
 	}
@@ -3713,7 +3721,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	if (TCP_SKB_CB(skb)->sacked) {
 		flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una,
 						&sack_state);
-		tcp_fastretrans_alert(sk, prior_snd_una, is_dupack, &flag,
+		tcp_fastretrans_alert(sk, prior_snd_una, num_dupack, &flag,
 				      &rexmit);
 		tcp_newly_delivered(sk, delivered, flag);
 		tcp_xmit_recovery(sk, rexmit);
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 net-next 2/4] tcp: take care of compressed acks in tcp_add_reno_sack()
  2018-11-27 22:42 ` [PATCH v3 net-next 2/4] tcp: take care of compressed acks in tcp_add_reno_sack() Eric Dumazet
@ 2018-11-28 16:41   ` Neal Cardwell
  0 siblings, 0 replies; 9+ messages in thread
From: Neal Cardwell @ 2018-11-28 16:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Netdev, jean-louis, Yuchung Cheng, Eric Dumazet

On Tue, Nov 27, 2018 at 5:42 PM Eric Dumazet <edumazet@google.com> wrote:
>
> Neal pointed out that non sack flows might suffer from ACK compression
> added in the following patch ("tcp: implement coalescing on backlog queue")
>
> Instead of tweaking tcp_add_backlog() we can take into
> account how many ACK were coalesced, this information
> will be available in skb_shinfo(skb)->gso_segs
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---

Acked-by: Neal Cardwell <ncardwell@google.com>

Thanks!

neal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 net-next 3/4] tcp: make tcp_space() aware of socket backlog
  2018-11-27 22:41 [PATCH v3 net-next 0/4] tcp: take a bit more care of backlog stress Eric Dumazet
  2018-11-27 22:42 ` [PATCH v3 net-next 1/4] tcp: hint compiler about sack flows Eric Dumazet
  2018-11-27 22:42 ` [PATCH v3 net-next 2/4] tcp: take care of compressed acks in tcp_add_reno_sack() Eric Dumazet
@ 2018-11-27 22:42 ` Eric Dumazet
  2018-11-28 14:54   ` Jean-Louis Dupond
  2018-11-27 22:42 ` [PATCH v3 net-next 4/4] tcp: implement coalescing on backlog queue Eric Dumazet
  2018-11-30 21:27 ` [PATCH v3 net-next 0/4] tcp: take a bit more care of backlog stress David Miller
  4 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2018-11-27 22:42 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Jean-Louis Dupond, Neal Cardwell, Yuchung Cheng,
	Eric Dumazet, Eric Dumazet

Jean-Louis Dupond reported poor iscsi TCP receive performance
that we tracked to backlog drops.

Apparently we fail to send window updates reflecting the
fact that we are under stress.

Note that we might lack a proper window increase when
backlog is fully processed, since __release_sock() clears
sk->sk_backlog.len _after_ all skbs have been processed.

This should not matter in practice. If we had a significant
load through socket backlog, we are in a dangerous
situation.

Reported-by: Jean-Louis Dupond <jean-louis@dupond.be>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
---
 include/net/tcp.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0c61bf0a06dac95268c26b6302a2afbaef4c88b3..3b522259da7d5a54d7d3730ddd8d8c9ef24313e1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1368,7 +1368,7 @@ static inline int tcp_win_from_space(const struct sock *sk, int space)
 /* Note: caller must be prepared to deal with negative returns */
 static inline int tcp_space(const struct sock *sk)
 {
-	return tcp_win_from_space(sk, sk->sk_rcvbuf -
+	return tcp_win_from_space(sk, sk->sk_rcvbuf - sk->sk_backlog.len -
 				  atomic_read(&sk->sk_rmem_alloc));
 }
 
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 net-next 3/4] tcp: make tcp_space() aware of socket backlog
  2018-11-27 22:42 ` [PATCH v3 net-next 3/4] tcp: make tcp_space() aware of socket backlog Eric Dumazet
@ 2018-11-28 14:54   ` Jean-Louis Dupond
  0 siblings, 0 replies; 9+ messages in thread
From: Jean-Louis Dupond @ 2018-11-28 14:54 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller
  Cc: netdev, Neal Cardwell, Yuchung Cheng, Eric Dumazet

On 27/11/18 23:42, Eric Dumazet wrote:
> Jean-Louis Dupond reported poor iscsi TCP receive performance
> that we tracked to backlog drops.
>
> Apparently we fail to send window updates reflecting the
> fact that we are under stress.
>
> Note that we might lack a proper window increase when
> backlog is fully processed, since __release_sock() clears
> sk->sk_backlog.len _after_ all skbs have been processed.
>
> This should not matter in practice. If we had a significant
> load through socket backlog, we are in a dangerous
> situation.
>
> Reported-by: Jean-Louis Dupond <jean-louis@dupond.be>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Neal Cardwell <ncardwell@google.com>
> Acked-by: Yuchung Cheng <ycheng@google.com>
> ---
>   include/net/tcp.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 0c61bf0a06dac95268c26b6302a2afbaef4c88b3..3b522259da7d5a54d7d3730ddd8d8c9ef24313e1 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1368,7 +1368,7 @@ static inline int tcp_win_from_space(const struct sock *sk, int space)
>   /* Note: caller must be prepared to deal with negative returns */
>   static inline int tcp_space(const struct sock *sk)
>   {
> -	return tcp_win_from_space(sk, sk->sk_rcvbuf -
> +	return tcp_win_from_space(sk, sk->sk_rcvbuf - sk->sk_backlog.len -
>   				  atomic_read(&sk->sk_rmem_alloc));
>   }
>   

Tested-by: Jean-Louis Dupond<jean-louis@dupond.be>

Big difference in performance :)
Thanks a lot!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 net-next 4/4] tcp: implement coalescing on backlog queue
  2018-11-27 22:41 [PATCH v3 net-next 0/4] tcp: take a bit more care of backlog stress Eric Dumazet
                   ` (2 preceding siblings ...)
  2018-11-27 22:42 ` [PATCH v3 net-next 3/4] tcp: make tcp_space() aware of socket backlog Eric Dumazet
@ 2018-11-27 22:42 ` Eric Dumazet
  2018-11-28 16:46   ` Neal Cardwell
  2018-11-30 21:27 ` [PATCH v3 net-next 0/4] tcp: take a bit more care of backlog stress David Miller
  4 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2018-11-27 22:42 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Jean-Louis Dupond, Neal Cardwell, Yuchung Cheng,
	Eric Dumazet, Eric Dumazet

In case GRO is not as efficient as it should be or disabled,
we might have a user thread trapped in __release_sock() while
softirq handler flood packets up to the point we have to drop.

This patch balances work done from user thread and softirq,
to give more chances to __release_sock() to complete its work
before new packets are added the the backlog.

This also helps if we receive many ACK packets, since GRO
does not aggregate them.

This patch brings ~60% throughput increase on a receiver
without GRO, but the spectacular gain is really on
1000x release_sock() latency reduction I have measured.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/proc.c           |  1 +
 net/ipv4/tcp_ipv4.c       | 92 ++++++++++++++++++++++++++++++++++++---
 3 files changed, 88 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index f80135e5feaa886000009db6dff75b2bc2d637b2..86dc24a96c90ab047d5173d625450facd6c6dd79 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -243,6 +243,7 @@ enum
 	LINUX_MIB_TCPREQQFULLDROP,		/* TCPReqQFullDrop */
 	LINUX_MIB_TCPRETRANSFAIL,		/* TCPRetransFail */
 	LINUX_MIB_TCPRCVCOALESCE,		/* TCPRcvCoalesce */
+	LINUX_MIB_TCPBACKLOGCOALESCE,		/* TCPBacklogCoalesce */
 	LINUX_MIB_TCPOFOQUEUE,			/* TCPOFOQueue */
 	LINUX_MIB_TCPOFODROP,			/* TCPOFODrop */
 	LINUX_MIB_TCPOFOMERGE,			/* TCPOFOMerge */
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 70289682a6701438aed99a00a9705c39fa4394d3..c3610b37bb4ce665b1976d8cc907b6dd0de42ab9 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -219,6 +219,7 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPRenoRecoveryFail", LINUX_MIB_TCPRENORECOVERYFAIL),
 	SNMP_MIB_ITEM("TCPSackRecoveryFail", LINUX_MIB_TCPSACKRECOVERYFAIL),
 	SNMP_MIB_ITEM("TCPRcvCollapsed", LINUX_MIB_TCPRCVCOLLAPSED),
+	SNMP_MIB_ITEM("TCPBacklogCoalesce", LINUX_MIB_TCPBACKLOGCOALESCE),
 	SNMP_MIB_ITEM("TCPDSACKOldSent", LINUX_MIB_TCPDSACKOLDSENT),
 	SNMP_MIB_ITEM("TCPDSACKOfoSent", LINUX_MIB_TCPDSACKOFOSENT),
 	SNMP_MIB_ITEM("TCPDSACKRecv", LINUX_MIB_TCPDSACKRECV),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 795605a2327504b8a025405826e7e0ca8dc8501d..4904250a9aac5001410f9454258cbb8978bb8202 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1619,12 +1619,14 @@ int tcp_v4_early_demux(struct sk_buff *skb)
 bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
 {
 	u32 limit = sk->sk_rcvbuf + sk->sk_sndbuf;
-
-	/* Only socket owner can try to collapse/prune rx queues
-	 * to reduce memory overhead, so add a little headroom here.
-	 * Few sockets backlog are possibly concurrently non empty.
-	 */
-	limit += 64*1024;
+	struct skb_shared_info *shinfo;
+	const struct tcphdr *th;
+	struct tcphdr *thtail;
+	struct sk_buff *tail;
+	unsigned int hdrlen;
+	bool fragstolen;
+	u32 gso_segs;
+	int delta;
 
 	/* In case all data was pulled from skb frags (in __pskb_pull_tail()),
 	 * we can fix skb->truesize to its real value to avoid future drops.
@@ -1636,6 +1638,84 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
 
 	skb_dst_drop(skb);
 
+	if (unlikely(tcp_checksum_complete(skb))) {
+		bh_unlock_sock(sk);
+		__TCP_INC_STATS(sock_net(sk), TCP_MIB_CSUMERRORS);
+		__TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
+		return true;
+	}
+
+	/* Attempt coalescing to last skb in backlog, even if we are
+	 * above the limits.
+	 * This is okay because skb capacity is limited to MAX_SKB_FRAGS.
+	 */
+	th = (const struct tcphdr *)skb->data;
+	hdrlen = th->doff * 4;
+	shinfo = skb_shinfo(skb);
+
+	if (!shinfo->gso_size)
+		shinfo->gso_size = skb->len - hdrlen;
+
+	if (!shinfo->gso_segs)
+		shinfo->gso_segs = 1;
+
+	tail = sk->sk_backlog.tail;
+	if (!tail)
+		goto no_coalesce;
+	thtail = (struct tcphdr *)tail->data;
+
+	if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq ||
+	    TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield ||
+	    ((TCP_SKB_CB(tail)->tcp_flags |
+	      TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_URG) ||
+	    ((TCP_SKB_CB(tail)->tcp_flags ^
+	      TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
+#ifdef CONFIG_TLS_DEVICE
+	    tail->decrypted != skb->decrypted ||
+#endif
+	    thtail->doff != th->doff ||
+	    memcmp(thtail + 1, th + 1, hdrlen - sizeof(*th)))
+		goto no_coalesce;
+
+	__skb_pull(skb, hdrlen);
+	if (skb_try_coalesce(tail, skb, &fragstolen, &delta)) {
+		thtail->window = th->window;
+
+		TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq;
+
+		if (after(TCP_SKB_CB(skb)->ack_seq, TCP_SKB_CB(tail)->ack_seq))
+			TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
+
+		TCP_SKB_CB(tail)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags;
+
+		if (TCP_SKB_CB(skb)->has_rxtstamp) {
+			TCP_SKB_CB(tail)->has_rxtstamp = true;
+			tail->tstamp = skb->tstamp;
+			skb_hwtstamps(tail)->hwtstamp = skb_hwtstamps(skb)->hwtstamp;
+		}
+
+		/* Not as strict as GRO. We only need to carry mss max value */
+		skb_shinfo(tail)->gso_size = max(shinfo->gso_size,
+						 skb_shinfo(tail)->gso_size);
+
+		gso_segs = skb_shinfo(tail)->gso_segs + shinfo->gso_segs;
+		skb_shinfo(tail)->gso_segs = min_t(u32, gso_segs, 0xFFFF);
+
+		sk->sk_backlog.len += delta;
+		__NET_INC_STATS(sock_net(sk),
+				LINUX_MIB_TCPBACKLOGCOALESCE);
+		kfree_skb_partial(skb, fragstolen);
+		return false;
+	}
+	__skb_push(skb, hdrlen);
+
+no_coalesce:
+	/* Only socket owner can try to collapse/prune rx queues
+	 * to reduce memory overhead, so add a little headroom here.
+	 * Few sockets backlog are possibly concurrently non empty.
+	 */
+	limit += 64*1024;
+
 	if (unlikely(sk_add_backlog(sk, skb, limit))) {
 		bh_unlock_sock(sk);
 		__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP);
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 net-next 4/4] tcp: implement coalescing on backlog queue
  2018-11-27 22:42 ` [PATCH v3 net-next 4/4] tcp: implement coalescing on backlog queue Eric Dumazet
@ 2018-11-28 16:46   ` Neal Cardwell
  0 siblings, 0 replies; 9+ messages in thread
From: Neal Cardwell @ 2018-11-28 16:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Netdev, jean-louis, Yuchung Cheng, Eric Dumazet

On Tue, Nov 27, 2018 at 5:42 PM Eric Dumazet <edumazet@google.com> wrote:
>
> In case GRO is not as efficient as it should be or disabled,
> we might have a user thread trapped in __release_sock() while
> softirq handler flood packets up to the point we have to drop.
>
> This patch balances work done from user thread and softirq,
> to give more chances to __release_sock() to complete its work
> before new packets are added the the backlog.
>
> This also helps if we receive many ACK packets, since GRO
> does not aggregate them.
>
> This patch brings ~60% throughput increase on a receiver
> without GRO, but the spectacular gain is really on
> 1000x release_sock() latency reduction I have measured.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> ---

Acked-by: Neal Cardwell <ncardwell@google.com>

Thanks!

neal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 net-next 0/4] tcp: take a bit more care of backlog stress
  2018-11-27 22:41 [PATCH v3 net-next 0/4] tcp: take a bit more care of backlog stress Eric Dumazet
                   ` (3 preceding siblings ...)
  2018-11-27 22:42 ` [PATCH v3 net-next 4/4] tcp: implement coalescing on backlog queue Eric Dumazet
@ 2018-11-30 21:27 ` David Miller
  4 siblings, 0 replies; 9+ messages in thread
From: David Miller @ 2018-11-30 21:27 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, jean-louis, ncardwell, ycheng, eric.dumazet

From: Eric Dumazet <edumazet@google.com>
Date: Tue, 27 Nov 2018 14:41:59 -0800

> While working on the SACK compression issue Jean-Louis Dupond
> reported, we found that his linux box was suffering very hard
> from tail drops on the socket backlog queue.
> 
> First patch hints the compiler about sack flows being the norm.
> 
> Second patch changes non-sack code in preparation of the ack
> compression.
> 
> Third patch fixes tcp_space() to take backlog into account.
> 
> Fourth patch is attempting coalescing when a new packet must
> be added to the backlog queue. Cooking bigger skbs helps
> to keep backlog list smaller and speeds its handling when
> user thread finally releases the socket lock.
> 
> v3: Neal/Yuchung feedback addressed :
>      Do not aggregate if any skb has URG bit set.
>      Do not aggregate if the skbs have different ECE/CWR bits
> 
> v2: added feedback from Neal : tcp: take care of compressed acks in tcp_add_reno_sack() 
>     added : tcp: hint compiler about sack flows
> 	added : tcp: make tcp_space() aware of socket backlog

Series applied, thanks Eric.

I'll push this out after the build check finishes.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-12-01  8:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-27 22:41 [PATCH v3 net-next 0/4] tcp: take a bit more care of backlog stress Eric Dumazet
2018-11-27 22:42 ` [PATCH v3 net-next 1/4] tcp: hint compiler about sack flows Eric Dumazet
2018-11-27 22:42 ` [PATCH v3 net-next 2/4] tcp: take care of compressed acks in tcp_add_reno_sack() Eric Dumazet
2018-11-28 16:41   ` Neal Cardwell
2018-11-27 22:42 ` [PATCH v3 net-next 3/4] tcp: make tcp_space() aware of socket backlog Eric Dumazet
2018-11-28 14:54   ` Jean-Louis Dupond
2018-11-27 22:42 ` [PATCH v3 net-next 4/4] tcp: implement coalescing on backlog queue Eric Dumazet
2018-11-28 16:46   ` Neal Cardwell
2018-11-30 21:27 ` [PATCH v3 net-next 0/4] tcp: take a bit more care of backlog stress David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.