All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/3] tcp: avoid sending too small packets
@ 2024-04-18 21:45 Eric Dumazet
  2024-04-18 21:45 ` [PATCH net-next 1/3] tcp: remove dubious FIN exception from tcp_cwnd_test() Eric Dumazet
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Eric Dumazet @ 2024-04-18 21:45 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Neal Cardwell, Kevin Yang, eric.dumazet, Eric Dumazet

tcp_sendmsg() cooks 'large' skbs, that are later split
if needed from tcp_write_xmit().

After a split, the leftover skb size is smaller than the optimal
size, and this causes a performance drop.

In this series, tcp_grow_skb() helper is added to shift
payload from the second skb in the write queue to the first
skb to always send optimal sized skbs.

This increases TSO efficiency, and decreases number of ACK
packets.

Eric Dumazet (3):
  tcp: remove dubious FIN exception from tcp_cwnd_test()
  tcp: call tcp_set_skb_tso_segs() from tcp_write_xmit()
  tcp: try to send bigger TSO packets

 net/ipv4/tcp_output.c | 78 +++++++++++++++++++++++++++++--------------
 1 file changed, 53 insertions(+), 25 deletions(-)

-- 
2.44.0.769.g3c40516874-goog


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH net-next 1/3] tcp: remove dubious FIN exception from tcp_cwnd_test()
  2024-04-18 21:45 [PATCH net-next 0/3] tcp: avoid sending too small packets Eric Dumazet
@ 2024-04-18 21:45 ` Eric Dumazet
  2024-04-18 21:45 ` [PATCH net-next 2/3] tcp: call tcp_set_skb_tso_segs() from tcp_write_xmit() Eric Dumazet
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Eric Dumazet @ 2024-04-18 21:45 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Neal Cardwell, Kevin Yang, eric.dumazet, Eric Dumazet

tcp_cwnd_test() has a special handing for the last packet in
the write queue if it is smaller than one MSS and has the FIN flag.

This is in violation of TCP RFC, and seems quite dubious.

This packet can be sent only if the current CWND is bigger
than the number of packets in flight.

Making tcp_cwnd_test() result independent of the first skb
in the write queue is needed for the last patch of the series.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_output.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 61119d42b0fd27a3736e136b1e81f6fc2d4cb44b..acbc76ca3e640354880c62c2423cfe4ba99f0be3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2073,16 +2073,10 @@ static unsigned int tcp_mss_split_point(const struct sock *sk,
 /* Can at least one segment of SKB be sent right now, according to the
  * congestion window rules?  If so, return how many segments are allowed.
  */
-static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
-					 const struct sk_buff *skb)
+static u32 tcp_cwnd_test(const struct tcp_sock *tp)
 {
 	u32 in_flight, cwnd, halfcwnd;
 
-	/* Don't be strict about the congestion window for the final FIN.  */
-	if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
-	    tcp_skb_pcount(skb) == 1)
-		return 1;
-
 	in_flight = tcp_packets_in_flight(tp);
 	cwnd = tcp_snd_cwnd(tp);
 	if (in_flight >= cwnd)
@@ -2706,10 +2700,9 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
 	unsigned int tso_segs, sent_pkts;
-	int cwnd_quota;
+	u32 cwnd_quota, max_segs;
 	int result;
 	bool is_cwnd_limited = false, is_rwnd_limited = false;
-	u32 max_segs;
 
 	sent_pkts = 0;
 
@@ -2743,7 +2736,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		tso_segs = tcp_init_tso_segs(skb, mss_now);
 		BUG_ON(!tso_segs);
 
-		cwnd_quota = tcp_cwnd_test(tp, skb);
+		cwnd_quota = tcp_cwnd_test(tp);
 		if (!cwnd_quota) {
 			if (push_one == 2)
 				/* Force out a loss probe pkt. */
@@ -2772,9 +2765,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		limit = mss_now;
 		if (tso_segs > 1 && !tcp_urg_mode(tp))
 			limit = tcp_mss_split_point(sk, skb, mss_now,
-						    min_t(unsigned int,
-							  cwnd_quota,
-							  max_segs),
+						    min(cwnd_quota,
+							max_segs),
 						    nonagle);
 
 		if (skb->len > limit &&
-- 
2.44.0.769.g3c40516874-goog


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH net-next 2/3] tcp: call tcp_set_skb_tso_segs() from tcp_write_xmit()
  2024-04-18 21:45 [PATCH net-next 0/3] tcp: avoid sending too small packets Eric Dumazet
  2024-04-18 21:45 ` [PATCH net-next 1/3] tcp: remove dubious FIN exception from tcp_cwnd_test() Eric Dumazet
@ 2024-04-18 21:45 ` Eric Dumazet
  2024-04-18 21:46 ` [PATCH net-next 3/3] tcp: try to send bigger TSO packets Eric Dumazet
  2024-04-22 21:30 ` [PATCH net-next 0/3] tcp: avoid sending too small packets patchwork-bot+netdevbpf
  3 siblings, 0 replies; 5+ messages in thread
From: Eric Dumazet @ 2024-04-18 21:45 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Neal Cardwell, Kevin Yang, eric.dumazet, Eric Dumazet

tcp_write_xmit() calls tcp_init_tso_segs()
to set gso_size and gso_segs on the packet.

tcp_init_tso_segs() requires the stack to maintain
an up to date tcp_skb_pcount(), and this makes sense
for packets in rtx queue. Not so much for packets
still in the write queue.

In the following patch, we don't want to deal with
tcp_skb_pcount() when moving payload from 2nd
skb to 1st skb in the write queue.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_output.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index acbc76ca3e640354880c62c2423cfe4ba99f0be3..5e8665241f9345f38ce56afffe473948aef66786 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1502,18 +1502,22 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
 }
 
 /* Initialize TSO segments for a packet. */
-static void tcp_set_skb_tso_segs(struct sk_buff *skb, unsigned int mss_now)
+static int tcp_set_skb_tso_segs(struct sk_buff *skb, unsigned int mss_now)
 {
+	int tso_segs;
+
 	if (skb->len <= mss_now) {
 		/* Avoid the costly divide in the normal
 		 * non-TSO case.
 		 */
-		tcp_skb_pcount_set(skb, 1);
 		TCP_SKB_CB(skb)->tcp_gso_size = 0;
-	} else {
-		tcp_skb_pcount_set(skb, DIV_ROUND_UP(skb->len, mss_now));
-		TCP_SKB_CB(skb)->tcp_gso_size = mss_now;
+		tcp_skb_pcount_set(skb, 1);
+		return 1;
 	}
+	TCP_SKB_CB(skb)->tcp_gso_size = mss_now;
+	tso_segs = DIV_ROUND_UP(skb->len, mss_now);
+	tcp_skb_pcount_set(skb, tso_segs);
+	return tso_segs;
 }
 
 /* Pcount in the middle of the write queue got changed, we need to do various
@@ -2097,10 +2101,9 @@ static int tcp_init_tso_segs(struct sk_buff *skb, unsigned int mss_now)
 {
 	int tso_segs = tcp_skb_pcount(skb);
 
-	if (!tso_segs || (tso_segs > 1 && tcp_skb_mss(skb) != mss_now)) {
-		tcp_set_skb_tso_segs(skb, mss_now);
-		tso_segs = tcp_skb_pcount(skb);
-	}
+	if (!tso_segs || (tso_segs > 1 && tcp_skb_mss(skb) != mss_now))
+		return tcp_set_skb_tso_segs(skb, mss_now);
+
 	return tso_segs;
 }
 
@@ -2733,9 +2736,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		if (tcp_pacing_check(sk))
 			break;
 
-		tso_segs = tcp_init_tso_segs(skb, mss_now);
-		BUG_ON(!tso_segs);
-
 		cwnd_quota = tcp_cwnd_test(tp);
 		if (!cwnd_quota) {
 			if (push_one == 2)
@@ -2745,6 +2745,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}
 
+		tso_segs = tcp_set_skb_tso_segs(skb, mss_now);
+
 		if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now))) {
 			is_rwnd_limited = true;
 			break;
-- 
2.44.0.769.g3c40516874-goog


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH net-next 3/3] tcp: try to send bigger TSO packets
  2024-04-18 21:45 [PATCH net-next 0/3] tcp: avoid sending too small packets Eric Dumazet
  2024-04-18 21:45 ` [PATCH net-next 1/3] tcp: remove dubious FIN exception from tcp_cwnd_test() Eric Dumazet
  2024-04-18 21:45 ` [PATCH net-next 2/3] tcp: call tcp_set_skb_tso_segs() from tcp_write_xmit() Eric Dumazet
@ 2024-04-18 21:46 ` Eric Dumazet
  2024-04-22 21:30 ` [PATCH net-next 0/3] tcp: avoid sending too small packets patchwork-bot+netdevbpf
  3 siblings, 0 replies; 5+ messages in thread
From: Eric Dumazet @ 2024-04-18 21:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Neal Cardwell, Kevin Yang, eric.dumazet, Eric Dumazet

While investigating TCP performance, I found that TCP would
sometimes send big skbs followed by a single MSS skb,
in a 'locked' pattern.

For instance, BIG TCP is enabled, MSS is set to have 4096 bytes
of payload per segment. gso_max_size is set to 181000.

This means that an optimal TCP packet size should contain
44 * 4096 = 180224 bytes of payload,

However, I was seeing packets sizes interleaved in this pattern:

172032, 8192, 172032, 8192, 172032, 8192, <repeat>

tcp_tso_should_defer() heuristic is defeated, because after a split of
a packet in write queue for whatever reason (this might be a too small
CWND or a small enough pacing_rate),
the leftover packet in the queue is smaller than the optimal size.

It is time to try to make 'leftover packets' bigger so that
tcp_tso_should_defer() can give its full potential.

After this patch, we can see the following output:

14:13:34.009273 IP6 sender > receiver: Flags [P.], seq 4048380:4098360, ack 1, win 256, options [nop,nop,TS val 3425678144 ecr 1561784500], length 49980
14:13:34.010272 IP6 sender > receiver: Flags [P.], seq 4098360:4148340, ack 1, win 256, options [nop,nop,TS val 3425678145 ecr 1561784501], length 49980
14:13:34.011271 IP6 sender > receiver: Flags [P.], seq 4148340:4198320, ack 1, win 256, options [nop,nop,TS val 3425678146 ecr 1561784502], length 49980
14:13:34.012271 IP6 sender > receiver: Flags [P.], seq 4198320:4248300, ack 1, win 256, options [nop,nop,TS val 3425678147 ecr 1561784503], length 49980
14:13:34.013272 IP6 sender > receiver: Flags [P.], seq 4248300:4298280, ack 1, win 256, options [nop,nop,TS val 3425678148 ecr 1561784504], length 49980
14:13:34.014271 IP6 sender > receiver: Flags [P.], seq 4298280:4348260, ack 1, win 256, options [nop,nop,TS val 3425678149 ecr 1561784505], length 49980
14:13:34.015272 IP6 sender > receiver: Flags [P.], seq 4348260:4398240, ack 1, win 256, options [nop,nop,TS val 3425678150 ecr 1561784506], length 49980
14:13:34.016270 IP6 sender > receiver: Flags [P.], seq 4398240:4448220, ack 1, win 256, options [nop,nop,TS val 3425678151 ecr 1561784507], length 49980
14:13:34.017269 IP6 sender > receiver: Flags [P.], seq 4448220:4498200, ack 1, win 256, options [nop,nop,TS val 3425678152 ecr 1561784508], length 49980
14:13:34.018276 IP6 sender > receiver: Flags [P.], seq 4498200:4548180, ack 1, win 256, options [nop,nop,TS val 3425678153 ecr 1561784509], length 49980
14:13:34.019259 IP6 sender > receiver: Flags [P.], seq 4548180:4598160, ack 1, win 256, options [nop,nop,TS val 3425678154 ecr 1561784510], length 49980

With 200 concurrent flows on a 100Gbit NIC, we can see a reduction
of TSO packets (and ACK packets) of about 30 %.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_output.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5e8665241f9345f38ce56afffe473948aef66786..99a1d88f7f47b9ef0334efe62f8fd34c0d693ced 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2683,6 +2683,36 @@ void tcp_chrono_stop(struct sock *sk, const enum tcp_chrono type)
 		tcp_chrono_set(tp, TCP_CHRONO_BUSY);
 }
 
+/* First skb in the write queue is smaller than ideal packet size.
+ * Check if we can move payload from the second skb in the queue.
+ */
+static void tcp_grow_skb(struct sock *sk, struct sk_buff *skb, int amount)
+{
+	struct sk_buff *next_skb = skb->next;
+	unsigned int nlen;
+
+	if (tcp_skb_is_last(sk, skb))
+		return;
+
+	if (!tcp_skb_can_collapse(skb, next_skb))
+		return;
+
+	nlen = min_t(u32, amount, next_skb->len);
+	if (!nlen || !skb_shift(skb, next_skb, nlen))
+		return;
+
+	TCP_SKB_CB(skb)->end_seq += nlen;
+	TCP_SKB_CB(next_skb)->seq += nlen;
+
+	if (!next_skb->len) {
+		TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq;
+		TCP_SKB_CB(skb)->eor = TCP_SKB_CB(next_skb)->eor;
+		TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(next_skb)->tcp_flags;
+		tcp_unlink_write_queue(next_skb, sk);
+		tcp_wmem_free_skb(sk, next_skb);
+	}
+}
+
 /* This routine writes packets to the network.  It advances the
  * send_head.  This happens as incoming acks open up the remote
  * window for us.
@@ -2723,6 +2753,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 	max_segs = tcp_tso_segs(sk, mss_now);
 	while ((skb = tcp_send_head(sk))) {
 		unsigned int limit;
+		int missing_bytes;
 
 		if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) {
 			/* "skb_mstamp_ns" is used as a start point for the retransmit timer */
@@ -2744,6 +2775,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 			else
 				break;
 		}
+		cwnd_quota = min(cwnd_quota, max_segs);
+		missing_bytes = cwnd_quota * mss_now - skb->len;
+		if (missing_bytes > 0)
+			tcp_grow_skb(sk, skb, missing_bytes);
 
 		tso_segs = tcp_set_skb_tso_segs(skb, mss_now);
 
@@ -2767,8 +2802,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		limit = mss_now;
 		if (tso_segs > 1 && !tcp_urg_mode(tp))
 			limit = tcp_mss_split_point(sk, skb, mss_now,
-						    min(cwnd_quota,
-							max_segs),
+						    cwnd_quota,
 						    nonagle);
 
 		if (skb->len > limit &&
-- 
2.44.0.769.g3c40516874-goog


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next 0/3] tcp: avoid sending too small packets
  2024-04-18 21:45 [PATCH net-next 0/3] tcp: avoid sending too small packets Eric Dumazet
                   ` (2 preceding siblings ...)
  2024-04-18 21:46 ` [PATCH net-next 3/3] tcp: try to send bigger TSO packets Eric Dumazet
@ 2024-04-22 21:30 ` patchwork-bot+netdevbpf
  3 siblings, 0 replies; 5+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-04-22 21:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, kuba, pabeni, netdev, ncardwell, yyd, eric.dumazet

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu, 18 Apr 2024 21:45:57 +0000 you wrote:
> tcp_sendmsg() cooks 'large' skbs, that are later split
> if needed from tcp_write_xmit().
> 
> After a split, the leftover skb size is smaller than the optimal
> size, and this causes a performance drop.
> 
> In this series, tcp_grow_skb() helper is added to shift
> payload from the second skb in the write queue to the first
> skb to always send optimal sized skbs.
> 
> [...]

Here is the summary with links:
  - [net-next,1/3] tcp: remove dubious FIN exception from tcp_cwnd_test()
    https://git.kernel.org/netdev/net-next/c/22555032c513
  - [net-next,2/3] tcp: call tcp_set_skb_tso_segs() from tcp_write_xmit()
    https://git.kernel.org/netdev/net-next/c/d5b38a71d333
  - [net-next,3/3] tcp: try to send bigger TSO packets
    https://git.kernel.org/netdev/net-next/c/8ee602c63520

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-04-22 21:30 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-18 21:45 [PATCH net-next 0/3] tcp: avoid sending too small packets Eric Dumazet
2024-04-18 21:45 ` [PATCH net-next 1/3] tcp: remove dubious FIN exception from tcp_cwnd_test() Eric Dumazet
2024-04-18 21:45 ` [PATCH net-next 2/3] tcp: call tcp_set_skb_tso_segs() from tcp_write_xmit() Eric Dumazet
2024-04-18 21:46 ` [PATCH net-next 3/3] tcp: try to send bigger TSO packets Eric Dumazet
2024-04-22 21:30 ` [PATCH net-next 0/3] tcp: avoid sending too small packets patchwork-bot+netdevbpf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.