All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] tcp: refine TSO autosizing
@ 2014-12-05 14:15 Eric Dumazet
  2014-12-05 15:32 ` Neal Cardwell
  2014-12-06  2:22 ` [PATCH v2 " Eric Dumazet
  0 siblings, 2 replies; 9+ messages in thread
From: Eric Dumazet @ 2014-12-05 14:15 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Neal Cardwell, Yuchung Cheng, Nandita Dukkipati

From: Eric Dumazet <edumazet@google.com>

Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to
control TSO size, but did this at the wrong place (sendmsg() time)

At sendmsg() time, we might have a pessimistic view of flow rate,
and we end up building very small skbs (with 2 MSS per skb).

This is bad because :

 - It sends small TSO packets even in Slow Start where rate quickly
   increases.
 - It tends to make socket write queue very big, increasing tcp_ack()
   processing time, but also increasing memory needs, not necessarily
   accounted for, as fast clones overhead is currently ignored.
 - Lower GRO efficiency and more ACK packets.

Servers with a lot of small lived connections suffer from this.

Lets instead fill skbs as much as possible (64KB of payload), but split
them at xmit time, when we have a precise idea of the flow rate.
skb split is actually quite efficient.

Patch looks bigger than necessary, because TCP Small Queue decision now
has to take place after the eventual split.

Tested:

40 ms rtt link

nstat >/dev/null
netperf -H remote -Cc -l -2000000 -- -s 1000000
nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets"

Before patch :

Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380 2000000 2000000    0.36         44.22   0.00     0.06     0.000   5.007  
IpInReceives                    600                0.0
IpOutRequests                   599                0.0
TcpOutSegs                      1397               0.0
IpExtOutOctets                  2033249            0.0


After patch :

Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380 2000000 2000000    0.36         44.09   0.00     0.00     0.000   0.000  
IpInReceives                    257                0.0
IpOutRequests                   226                0.0
TcpOutSegs                      1399               0.0
IpExtOutOctets                  2013777            0.0


Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp.c        |   14 ++------------
 net/ipv4/tcp_output.c |   38 ++++++++++++++++++++++++--------------
 2 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index dc13a3657e8e..91e6c1406313 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -840,24 +840,14 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
 	xmit_size_goal = mss_now;
 
 	if (large_allowed && sk_can_gso(sk)) {
-		u32 gso_size, hlen;
+		u32 hlen;
 
 		/* Maybe we should/could use sk->sk_prot->max_header here ? */
 		hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
 		       inet_csk(sk)->icsk_ext_hdr_len +
 		       tp->tcp_header_len;
 
-		/* Goal is to send at least one packet per ms,
-		 * not one big TSO packet every 100 ms.
-		 * This preserves ACK clocking and is consistent
-		 * with tcp_tso_should_defer() heuristic.
-		 */
-		gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC);
-		gso_size = max_t(u32, gso_size,
-				 sysctl_tcp_min_tso_segs * mss_now);
-
-		xmit_size_goal = min_t(u32, gso_size,
-				       sk->sk_gso_max_size - 1 - hlen);
+		xmit_size_goal = sk->sk_gso_max_size - 1 - hlen;
 
 		xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f5bd4bd3f7e6..dac9991bccbf 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1533,6 +1533,16 @@ static unsigned int tcp_mss_split_point(const struct sock *sk,
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 	u32 partial, needed, window, max_len;
+	u32 bytes = sk->sk_pacing_rate >> 10;
+	u32 segs;
+
+	/* Goal is to send at least one packet per ms,
+	 * not one big TSO packet every 100 ms.
+	 * This preserves ACK clocking and is consistent
+	 * with tcp_tso_should_defer() heuristic.
+	 */
+	segs = max_t(u32, bytes / mss_now, sysctl_tcp_min_tso_segs);
+	max_segs = min_t(u32, max_segs, segs);
 
 	window = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq;
 	max_len = mss_now * max_segs;
@@ -2008,6 +2018,18 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}
 
+		limit = mss_now;
+		if (tso_segs > 1 && !tcp_urg_mode(tp))
+			limit = tcp_mss_split_point(sk, skb, mss_now,
+						    min_t(unsigned int,
+							  cwnd_quota,
+							  sk->sk_gso_max_segs),
+						    nonagle);
+
+		if (skb->len > limit &&
+		    unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
+			break;
+
 		/* TCP Small Queues :
 		 * Control number of packets in qdisc/devices to two packets / or ~1 ms.
 		 * This allows for :
@@ -2018,8 +2040,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		 * of queued bytes to ensure line rate.
 		 * One example is wifi aggregation (802.11 AMPDU)
 		 */
-		limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes,
-			      sk->sk_pacing_rate >> 10);
+		limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+		limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
 
 		if (atomic_read(&sk->sk_wmem_alloc) > limit) {
 			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
@@ -2032,18 +2054,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}
 
-		limit = mss_now;
-		if (tso_segs > 1 && !tcp_urg_mode(tp))
-			limit = tcp_mss_split_point(sk, skb, mss_now,
-						    min_t(unsigned int,
-							  cwnd_quota,
-							  sk->sk_gso_max_segs),
-						    nonagle);
-
-		if (skb->len > limit &&
-		    unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
-			break;
-
 		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
 			break;
 

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next] tcp: refine TSO autosizing
  2014-12-05 14:15 [PATCH net-next] tcp: refine TSO autosizing Eric Dumazet
@ 2014-12-05 15:32 ` Neal Cardwell
  2014-12-05 17:06   ` Eric Dumazet
  2014-12-06  2:22 ` [PATCH v2 " Eric Dumazet
  1 sibling, 1 reply; 9+ messages in thread
From: Neal Cardwell @ 2014-12-05 15:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Yuchung Cheng, Nandita Dukkipati

On Fri, Dec 5, 2014 at 9:15 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to
> control TSO size, but did this at the wrong place (sendmsg() time)
>
> At sendmsg() time, we might have a pessimistic view of flow rate,
> and we end up building very small skbs (with 2 MSS per skb).
>
> This is bad because :
>
>  - It sends small TSO packets even in Slow Start where rate quickly
>    increases.
>  - It tends to make socket write queue very big, increasing tcp_ack()
>    processing time, but also increasing memory needs, not necessarily
>    accounted for, as fast clones overhead is currently ignored.
>  - Lower GRO efficiency and more ACK packets.
>
> Servers with a lot of small lived connections suffer from this.
>
> Lets instead fill skbs as much as possible (64KB of payload), but split
> them at xmit time, when we have a precise idea of the flow rate.
> skb split is actually quite efficient.

Nice. I definitely agree this is the right direction.

However, from my experience testing a variant of this approach, this
kind of late decision about packet size was sometimes causing
performance shortfalls on long-RTT, medium-bandwidth paths unless
tcp_tso_should_defer() was also modified to use the new/smaller packet
size goal.

The issue is that tcp_tso_should_defer() uses tp->xmit_size_goal_segs
as a yardstick, and says, "hey, if cwnd and rwin allow us to send
tp->xmit_size_goal_segs * tp->mss_cache then let's go ahead and send
it now."

But if we remove the sendmsg-time autosizing logic that was tuning
tp->xmit_size_goal_segs, then tcp_tso_should_defer() is now going to
be waiting to try to accumulate permission to send a big skb with
tp->xmit_size_goal_segs (e.g. ~40) MSS in it.

In my tests I was able to fix this issue by making
tcp_tso_should_defer() use the latest size goal instead of
tp->xmit_size_goal_segs.

So, how about making the rate-based TSO autosizing goal (stored in
"segs" in this patch) at the top of tcp_write_xmit()? Then we could
pass in that segment goal to tcp_tso_should_defer() for use instead of
tp->xmit_size_goal_segs in deciding whether we have a big enough chunk
to send now. Similarly, that segment goal could be passed in to
tcp_mss_split_point instead of sk->sk_gso_max_segs.

(The autosizing calculation could be in a helper function to keep
tcp_write_xmit() manageable.)

neal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next] tcp: refine TSO autosizing
  2014-12-05 15:32 ` Neal Cardwell
@ 2014-12-05 17:06   ` Eric Dumazet
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2014-12-05 17:06 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: David Miller, netdev, Yuchung Cheng, Nandita Dukkipati

On Fri, 2014-12-05 at 10:32 -0500, Neal Cardwell wrote:
> On Fri, Dec 5, 2014 at 9:15 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to
> > control TSO size, but did this at the wrong place (sendmsg() time)
> >
> > At sendmsg() time, we might have a pessimistic view of flow rate,
> > and we end up building very small skbs (with 2 MSS per skb).
> >
> > This is bad because :
> >
> >  - It sends small TSO packets even in Slow Start where rate quickly
> >    increases.
> >  - It tends to make socket write queue very big, increasing tcp_ack()
> >    processing time, but also increasing memory needs, not necessarily
> >    accounted for, as fast clones overhead is currently ignored.
> >  - Lower GRO efficiency and more ACK packets.
> >
> > Servers with a lot of small lived connections suffer from this.
> >
> > Lets instead fill skbs as much as possible (64KB of payload), but split
> > them at xmit time, when we have a precise idea of the flow rate.
> > skb split is actually quite efficient.
> 
> Nice. I definitely agree this is the right direction.
> 
> However, from my experience testing a variant of this approach, this
> kind of late decision about packet size was sometimes causing
> performance shortfalls on long-RTT, medium-bandwidth paths unless
> tcp_tso_should_defer() was also modified to use the new/smaller packet
> size goal.
> 
> The issue is that tcp_tso_should_defer() uses tp->xmit_size_goal_segs
> as a yardstick, and says, "hey, if cwnd and rwin allow us to send
> tp->xmit_size_goal_segs * tp->mss_cache then let's go ahead and send
> it now."
> 
> But if we remove the sendmsg-time autosizing logic that was tuning
> tp->xmit_size_goal_segs, then tcp_tso_should_defer() is now going to
> be waiting to try to accumulate permission to send a big skb with
> tp->xmit_size_goal_segs (e.g. ~40) MSS in it.
> 
> In my tests I was able to fix this issue by making
> tcp_tso_should_defer() use the latest size goal instead of
> tp->xmit_size_goal_segs.
> 
> So, how about making the rate-based TSO autosizing goal (stored in
> "segs" in this patch) at the top of tcp_write_xmit()? Then we could
> pass in that segment goal to tcp_tso_should_defer() for use instead of
> tp->xmit_size_goal_segs in deciding whether we have a big enough chunk
> to send now. Similarly, that segment goal could be passed in to
> tcp_mss_split_point instead of sk->sk_gso_max_segs.
> 
> (The autosizing calculation could be in a helper function to keep
> tcp_write_xmit() manageable.)
> 

Sounds an awesome suggestion indeed, I am working on it.

Thanks Neal !

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 net-next] tcp: refine TSO autosizing
  2014-12-05 14:15 [PATCH net-next] tcp: refine TSO autosizing Eric Dumazet
  2014-12-05 15:32 ` Neal Cardwell
@ 2014-12-06  2:22 ` Eric Dumazet
  2014-12-06 16:58   ` Eric Dumazet
  2014-12-07 20:22   ` [PATCH v3 " Eric Dumazet
  1 sibling, 2 replies; 9+ messages in thread
From: Eric Dumazet @ 2014-12-06  2:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Neal Cardwell, Yuchung Cheng, Nandita Dukkipati

From: Eric Dumazet <edumazet@google.com>

Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to
control TSO size, but did this at the wrong place (sendmsg() time)

At sendmsg() time, we might have a pessimistic view of flow rate,
and we end up building very small skbs (with 2 MSS per skb).

This is bad because :

 - It sends small TSO packets even in Slow Start where rate quickly
   increases.
 - It tends to make socket write queue very big, increasing tcp_ack()
   processing time, but also increasing memory needs, not necessarily
   accounted for, as fast clones overhead is currently ignored.
 - Lower GRO efficiency and more ACK packets.

Servers with a lot of small lived connections suffer from this.

Lets instead fill skbs as much as possible (64KB of payload), but split
them at xmit time, when we have a precise idea of the flow rate.
skb split is actually quite efficient.

Patch looks bigger than necessary, because TCP Small Queue decision now
has to take place after the eventual split.

As Neal suggested, get rid of tp->xmit_size_goal_segs, and
introduce a new tcp_tso_autosize() helper, so that
tcp_tso_should_defer() can be synchronized on same goal.

Tested:

40 ms rtt link

nstat >/dev/null
netperf -H remote -Cc -l -2000000 -- -s 1000000
nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets"

Before patch :

Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380 2000000 2000000    0.36         44.22   0.00     0.06     0.000   5.007  
IpInReceives                    600                0.0
IpOutRequests                   599                0.0
TcpOutSegs                      1397               0.0
IpExtOutOctets                  2033249            0.0


After patch :

Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380 2000000 2000000    0.36         44.09   0.00     0.00     0.000   0.000  
IpInReceives                    257                0.0
IpOutRequests                   226                0.0
TcpOutSegs                      1399               0.0
IpExtOutOctets                  2013777            0.0


Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
v2: added tcp_tso_autosize() helper and removed tp->xmit_size_goal_segs

 include/linux/tcp.h   |    2 -
 net/ipv4/tcp.c        |   44 ++++++-----------------------
 net/ipv4/tcp_output.c |   59 +++++++++++++++++++++++++++-------------
 3 files changed, 51 insertions(+), 54 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index f566b8567892ef0bb213de0540b37cfc6ac03ca0..a2944717f1b7d1d12b4e2397c20457931ad71bac 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -130,7 +130,7 @@ struct tcp_sock {
 	/* inet_connection_sock has to be the first member of tcp_sock */
 	struct inet_connection_sock	inet_conn;
 	u16	tcp_header_len;	/* Bytes of tcp header to send		*/
-	u16	xmit_size_goal_segs; /* Goal for segmenting output packets */
+	/* 2 bytes hole */
 
 /*
  *	Header prediction flags
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index dc13a3657e8e1b81ba0cb1fcd5386a9d0b106168..6e692f7ac62880415093ca7a43c782c7fd6a049b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -835,45 +835,19 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
 				       int large_allowed)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	u32 xmit_size_goal, old_size_goal;
+	u32 hlen, xmit_size_goal;
 
-	xmit_size_goal = mss_now;
+	if (!large_allowed || !sk_can_gso(sk))
+		return mss_now;
 
-	if (large_allowed && sk_can_gso(sk)) {
-		u32 gso_size, hlen;
+	/* Maybe we should/could use sk->sk_prot->max_header here ? */
+	hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
+	       inet_csk(sk)->icsk_ext_hdr_len +
+	       tp->tcp_header_len;
 
-		/* Maybe we should/could use sk->sk_prot->max_header here ? */
-		hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
-		       inet_csk(sk)->icsk_ext_hdr_len +
-		       tp->tcp_header_len;
+	xmit_size_goal = sk->sk_gso_max_size - 1 - hlen;
 
-		/* Goal is to send at least one packet per ms,
-		 * not one big TSO packet every 100 ms.
-		 * This preserves ACK clocking and is consistent
-		 * with tcp_tso_should_defer() heuristic.
-		 */
-		gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC);
-		gso_size = max_t(u32, gso_size,
-				 sysctl_tcp_min_tso_segs * mss_now);
-
-		xmit_size_goal = min_t(u32, gso_size,
-				       sk->sk_gso_max_size - 1 - hlen);
-
-		xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
-
-		/* We try hard to avoid divides here */
-		old_size_goal = tp->xmit_size_goal_segs * mss_now;
-
-		if (likely(old_size_goal <= xmit_size_goal &&
-			   old_size_goal + mss_now > xmit_size_goal)) {
-			xmit_size_goal = old_size_goal;
-		} else {
-			tp->xmit_size_goal_segs =
-				min_t(u16, xmit_size_goal / mss_now,
-				      sk->sk_gso_max_segs);
-			xmit_size_goal = tp->xmit_size_goal_segs * mss_now;
-		}
-	}
+	xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
 
 	return max(xmit_size_goal, mss_now);
 }
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f5bd4bd3f7e669b3fd48a843d55e7313a30a3409..f37ecf53ee8a96827fc08bd203b0ca8857f8fc34 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1524,6 +1524,27 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
 		((nonagle & TCP_NAGLE_CORK) ||
 		 (!nonagle && tp->packets_out && tcp_minshall_check(tp)));
 }
+
+/* Return how many segs we'd like on a TSO packet,
+ * to send one TSO packet per ms.
+ */
+static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now)
+{
+	u32 bytes, segs;
+
+	bytes = min(sk->sk_pacing_rate >> 10,
+		    sk->sk_gso_max_size - 1 - MAX_TCP_HEADER);
+
+	/* Goal is to send at least one packet per ms,
+	 * not one big TSO packet every 100 ms.
+	 * This preserves ACK clocking and is consistent
+	 * with tcp_tso_should_defer() heuristic.
+	 */
+	segs = max_t(u32, bytes / mss_now, sysctl_tcp_min_tso_segs);
+
+	return min_t(u32, segs, sk->sk_gso_max_segs);
+}
+
 /* Returns the portion of skb which can be sent right away */
 static unsigned int tcp_mss_split_point(const struct sock *sk,
 					const struct sk_buff *skb,
@@ -1731,7 +1752,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
  * This algorithm is from John Heffner.
  */
 static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
-				 bool *is_cwnd_limited)
+				 bool *is_cwnd_limited, u32 max_segs)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	const struct inet_connection_sock *icsk = inet_csk(sk);
@@ -1761,8 +1782,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
 	limit = min(send_win, cong_win);
 
 	/* If a full-sized TSO skb can be sent, do it. */
-	if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
-			   tp->xmit_size_goal_segs * tp->mss_cache))
+	if (limit >= max_segs * tp->mss_cache)
 		goto send_now;
 
 	/* Middle in queue won't get any more data, full sendable already? */
@@ -1959,6 +1979,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 	int cwnd_quota;
 	int result;
 	bool is_cwnd_limited = false;
+	u32 max_segs;
 
 	sent_pkts = 0;
 
@@ -1972,6 +1993,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		}
 	}
 
+	max_segs = tcp_tso_autosize(sk, mss_now);
 	while ((skb = tcp_send_head(sk))) {
 		unsigned int limit;
 
@@ -2004,10 +2026,23 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		} else {
 			if (!push_one &&
-			    tcp_tso_should_defer(sk, skb, &is_cwnd_limited))
+			    tcp_tso_should_defer(sk, skb, &is_cwnd_limited,
+						 max_segs))
 				break;
 		}
 
+		limit = mss_now;
+		if (tso_segs > 1 && !tcp_urg_mode(tp))
+			limit = tcp_mss_split_point(sk, skb, mss_now,
+						    min_t(unsigned int,
+							  cwnd_quota,
+							  max_segs),
+						    nonagle);
+
+		if (skb->len > limit &&
+		    unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
+			break;
+
 		/* TCP Small Queues :
 		 * Control number of packets in qdisc/devices to two packets / or ~1 ms.
 		 * This allows for :
@@ -2018,8 +2053,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		 * of queued bytes to ensure line rate.
 		 * One example is wifi aggregation (802.11 AMPDU)
 		 */
-		limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes,
-			      sk->sk_pacing_rate >> 10);
+		limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+		limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
 
 		if (atomic_read(&sk->sk_wmem_alloc) > limit) {
 			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
@@ -2032,18 +2067,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}
 
-		limit = mss_now;
-		if (tso_segs > 1 && !tcp_urg_mode(tp))
-			limit = tcp_mss_split_point(sk, skb, mss_now,
-						    min_t(unsigned int,
-							  cwnd_quota,
-							  sk->sk_gso_max_segs),
-						    nonagle);
-
-		if (skb->len > limit &&
-		    unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
-			break;
-
 		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
 			break;
 

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 net-next] tcp: refine TSO autosizing
  2014-12-06  2:22 ` [PATCH v2 " Eric Dumazet
@ 2014-12-06 16:58   ` Eric Dumazet
  2014-12-07 20:22   ` [PATCH v3 " Eric Dumazet
  1 sibling, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2014-12-06 16:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Neal Cardwell, Yuchung Cheng, Nandita Dukkipati

On Fri, 2014-12-05 at 18:22 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 

...

> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> ---
> v2: added tcp_tso_autosize() helper and removed tp->xmit_size_goal_segs

Please do not apply this patch. I'll send a v3.

We still need to cook skb with a multiple of mss bytes.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 net-next] tcp: refine TSO autosizing
  2014-12-06  2:22 ` [PATCH v2 " Eric Dumazet
  2014-12-06 16:58   ` Eric Dumazet
@ 2014-12-07 20:22   ` Eric Dumazet
  2014-12-07 21:24     ` Yuchung Cheng
  2014-12-09 21:39     ` David Miller
  1 sibling, 2 replies; 9+ messages in thread
From: Eric Dumazet @ 2014-12-07 20:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Neal Cardwell, Yuchung Cheng, Nandita Dukkipati

From: Eric Dumazet <edumazet@google.com>

Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to
control TSO size, but did this at the wrong place (sendmsg() time)

At sendmsg() time, we might have a pessimistic view of flow rate,
and we end up building very small skbs (with 2 MSS per skb).

This is bad because :

 - It sends small TSO packets even in Slow Start where rate quickly
   increases.
 - It tends to make socket write queue very big, increasing tcp_ack()
   processing time, but also increasing memory needs, not necessarily
   accounted for, as fast clones overhead is currently ignored.
 - Lower GRO efficiency and more ACK packets.

Servers with a lot of small lived connections suffer from this.

Lets instead fill skbs as much as possible (64KB of payload), but split
them at xmit time, when we have a precise idea of the flow rate.
skb split is actually quite efficient.

Patch looks bigger than necessary, because TCP Small Queue decision now
has to take place after the eventual split.

As Neal suggested, introduce a new tcp_tso_autosize() helper, so that
tcp_tso_should_defer() can be synchronized on same goal.

Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable
contains number of mss that we can put in GSO packet, and is not
related to the autosizing goal anymore.


Tested:

40 ms rtt link

nstat >/dev/null
netperf -H remote -l -2000000 -- -s 1000000
nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets"

Before patch :

Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/s

 87380 2000000 2000000    0.36         44.22  
IpInReceives                    600                0.0
IpOutRequests                   599                0.0
TcpOutSegs                      1397               0.0
IpExtOutOctets                  2033249            0.0


After patch :

Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380 2000000 2000000    0.36       44.27   
IpInReceives                    221                0.0
IpOutRequests                   232                0.0
TcpOutSegs                      1397               0.0
IpExtOutOctets                  2013953            0.0


Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
v3: tcp_xmit_size_goal() still needs to return a multiple of mss.
v2: added tcp_tso_autosize() helper and removed tp->xmit_size_goal_segs

 include/linux/tcp.h   |    2 -
 net/ipv4/tcp.c        |   60 ++++++++++++++--------------------------
 net/ipv4/tcp_output.c |   59 +++++++++++++++++++++++++++------------
 3 files changed, 63 insertions(+), 58 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index f566b8567892ef0bb213de0540b37cfc6ac03ca0..3fa0a9669a3a662be81d4b04f7d117b11012257c 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -130,7 +130,7 @@ struct tcp_sock {
 	/* inet_connection_sock has to be the first member of tcp_sock */
 	struct inet_connection_sock	inet_conn;
 	u16	tcp_header_len;	/* Bytes of tcp header to send		*/
-	u16	xmit_size_goal_segs; /* Goal for segmenting output packets */
+	u16	gso_segs;	/* Max number of segs per GSO packet	*/
 
 /*
  *	Header prediction flags
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index dc13a3657e8e1b81ba0cb1fcd5386a9d0b106168..427aee33ffc04ad189d9d0ec24ab8004c25961ec 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -835,47 +835,29 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
 				       int large_allowed)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	u32 xmit_size_goal, old_size_goal;
-
-	xmit_size_goal = mss_now;
-
-	if (large_allowed && sk_can_gso(sk)) {
-		u32 gso_size, hlen;
-
-		/* Maybe we should/could use sk->sk_prot->max_header here ? */
-		hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
-		       inet_csk(sk)->icsk_ext_hdr_len +
-		       tp->tcp_header_len;
-
-		/* Goal is to send at least one packet per ms,
-		 * not one big TSO packet every 100 ms.
-		 * This preserves ACK clocking and is consistent
-		 * with tcp_tso_should_defer() heuristic.
-		 */
-		gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC);
-		gso_size = max_t(u32, gso_size,
-				 sysctl_tcp_min_tso_segs * mss_now);
-
-		xmit_size_goal = min_t(u32, gso_size,
-				       sk->sk_gso_max_size - 1 - hlen);
-
-		xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
-
-		/* We try hard to avoid divides here */
-		old_size_goal = tp->xmit_size_goal_segs * mss_now;
-
-		if (likely(old_size_goal <= xmit_size_goal &&
-			   old_size_goal + mss_now > xmit_size_goal)) {
-			xmit_size_goal = old_size_goal;
-		} else {
-			tp->xmit_size_goal_segs =
-				min_t(u16, xmit_size_goal / mss_now,
-				      sk->sk_gso_max_segs);
-			xmit_size_goal = tp->xmit_size_goal_segs * mss_now;
-		}
+	u32 new_size_goal, size_goal, hlen;
+
+	if (!large_allowed || !sk_can_gso(sk))
+		return mss_now;
+
+	/* Maybe we should/could use sk->sk_prot->max_header here ? */
+	hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
+	       inet_csk(sk)->icsk_ext_hdr_len +
+	       tp->tcp_header_len;
+
+	new_size_goal = sk->sk_gso_max_size - 1 - hlen;
+	new_size_goal = tcp_bound_to_half_wnd(tp, new_size_goal);
+
+	/* We try hard to avoid divides here */
+	size_goal = tp->gso_segs * mss_now;
+	if (unlikely(new_size_goal < size_goal ||
+		     new_size_goal >= size_goal + mss_now)) {
+		tp->gso_segs = min_t(u16, new_size_goal / mss_now,
+				     sk->sk_gso_max_segs);
+		size_goal = tp->gso_segs * mss_now;
 	}
 
-	return max(xmit_size_goal, mss_now);
+	return max(size_goal, mss_now);
 }
 
 static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f5bd4bd3f7e669b3fd48a843d55e7313a30a3409..f37ecf53ee8a96827fc08bd203b0ca8857f8fc34 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1524,6 +1524,27 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
 		((nonagle & TCP_NAGLE_CORK) ||
 		 (!nonagle && tp->packets_out && tcp_minshall_check(tp)));
 }
+
+/* Return how many segs we'd like on a TSO packet,
+ * to send one TSO packet per ms
+ */
+static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now)
+{
+	u32 bytes, segs;
+
+	bytes = min(sk->sk_pacing_rate >> 10,
+		    sk->sk_gso_max_size - 1 - MAX_TCP_HEADER);
+
+	/* Goal is to send at least one packet per ms,
+	 * not one big TSO packet every 100 ms.
+	 * This preserves ACK clocking and is consistent
+	 * with tcp_tso_should_defer() heuristic.
+	 */
+	segs = max_t(u32, bytes / mss_now, sysctl_tcp_min_tso_segs);
+
+	return min_t(u32, segs, sk->sk_gso_max_segs);
+}
+
 /* Returns the portion of skb which can be sent right away */
 static unsigned int tcp_mss_split_point(const struct sock *sk,
 					const struct sk_buff *skb,
@@ -1731,7 +1752,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
  * This algorithm is from John Heffner.
  */
 static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
-				 bool *is_cwnd_limited)
+				 bool *is_cwnd_limited, u32 max_segs)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	const struct inet_connection_sock *icsk = inet_csk(sk);
@@ -1761,8 +1782,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
 	limit = min(send_win, cong_win);
 
 	/* If a full-sized TSO skb can be sent, do it. */
-	if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
-			   tp->xmit_size_goal_segs * tp->mss_cache))
+	if (limit >= max_segs * tp->mss_cache)
 		goto send_now;
 
 	/* Middle in queue won't get any more data, full sendable already? */
@@ -1959,6 +1979,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 	int cwnd_quota;
 	int result;
 	bool is_cwnd_limited = false;
+	u32 max_segs;
 
 	sent_pkts = 0;
 
@@ -1972,6 +1993,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		}
 	}
 
+	max_segs = tcp_tso_autosize(sk, mss_now);
 	while ((skb = tcp_send_head(sk))) {
 		unsigned int limit;
 
@@ -2004,10 +2026,23 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		} else {
 			if (!push_one &&
-			    tcp_tso_should_defer(sk, skb, &is_cwnd_limited))
+			    tcp_tso_should_defer(sk, skb, &is_cwnd_limited,
+						 max_segs))
 				break;
 		}
 
+		limit = mss_now;
+		if (tso_segs > 1 && !tcp_urg_mode(tp))
+			limit = tcp_mss_split_point(sk, skb, mss_now,
+						    min_t(unsigned int,
+							  cwnd_quota,
+							  max_segs),
+						    nonagle);
+
+		if (skb->len > limit &&
+		    unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
+			break;
+
 		/* TCP Small Queues :
 		 * Control number of packets in qdisc/devices to two packets / or ~1 ms.
 		 * This allows for :
@@ -2018,8 +2053,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		 * of queued bytes to ensure line rate.
 		 * One example is wifi aggregation (802.11 AMPDU)
 		 */
-		limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes,
-			      sk->sk_pacing_rate >> 10);
+		limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+		limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
 
 		if (atomic_read(&sk->sk_wmem_alloc) > limit) {
 			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
@@ -2032,18 +2067,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}
 
-		limit = mss_now;
-		if (tso_segs > 1 && !tcp_urg_mode(tp))
-			limit = tcp_mss_split_point(sk, skb, mss_now,
-						    min_t(unsigned int,
-							  cwnd_quota,
-							  sk->sk_gso_max_segs),
-						    nonagle);
-
-		if (skb->len > limit &&
-		    unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
-			break;
-
 		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
 			break;
 

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 net-next] tcp: refine TSO autosizing
  2014-12-07 20:22   ` [PATCH v3 " Eric Dumazet
@ 2014-12-07 21:24     ` Yuchung Cheng
  2014-12-08  6:32       ` Eric Dumazet
  2014-12-09 21:39     ` David Miller
  1 sibling, 1 reply; 9+ messages in thread
From: Yuchung Cheng @ 2014-12-07 21:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Neal Cardwell, Nandita Dukkipati

On Sun, Dec 7, 2014 at 12:22 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to
> control TSO size, but did this at the wrong place (sendmsg() time)
>
> At sendmsg() time, we might have a pessimistic view of flow rate,
> and we end up building very small skbs (with 2 MSS per skb).
>
> This is bad because :
>
>  - It sends small TSO packets even in Slow Start where rate quickly
>    increases.
>  - It tends to make socket write queue very big, increasing tcp_ack()
>    processing time, but also increasing memory needs, not necessarily
>    accounted for, as fast clones overhead is currently ignored.
>  - Lower GRO efficiency and more ACK packets.
>
> Servers with a lot of small lived connections suffer from this.
>
> Lets instead fill skbs as much as possible (64KB of payload), but split
> them at xmit time, when we have a precise idea of the flow rate.
> skb split is actually quite efficient.
>
> Patch looks bigger than necessary, because TCP Small Queue decision now
> has to take place after the eventual split.
>
> As Neal suggested, introduce a new tcp_tso_autosize() helper, so that
> tcp_tso_should_defer() can be synchronized on same goal.
>
> Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable
> contains number of mss that we can put in GSO packet, and is not
> related to the autosizing goal anymore.
>
>
> Tested:
>
> 40 ms rtt link
>
> nstat >/dev/null
> netperf -H remote -l -2000000 -- -s 1000000
> nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets"
>
> Before patch :
>
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/s
>
>  87380 2000000 2000000    0.36         44.22
> IpInReceives                    600                0.0
> IpOutRequests                   599                0.0
> TcpOutSegs                      1397               0.0
> IpExtOutOctets                  2033249            0.0
>
>
> After patch :
>
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
>
>  87380 2000000 2000000    0.36       44.27
> IpInReceives                    221                0.0
> IpOutRequests                   232                0.0
> TcpOutSegs                      1397               0.0
> IpExtOutOctets                  2013953            0.0
>
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>

Very nice Eric. This would improve both performance and simplify the
complex TSO logic. I have a small question below.


> ---
> v3: tcp_xmit_size_goal() still needs to return a multiple of mss.
> v2: added tcp_tso_autosize() helper and removed tp->xmit_size_goal_segs
>
>  include/linux/tcp.h   |    2 -
>  net/ipv4/tcp.c        |   60 ++++++++++++++--------------------------
>  net/ipv4/tcp_output.c |   59 +++++++++++++++++++++++++++------------
>  3 files changed, 63 insertions(+), 58 deletions(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index f566b8567892ef0bb213de0540b37cfc6ac03ca0..3fa0a9669a3a662be81d4b04f7d117b11012257c 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -130,7 +130,7 @@ struct tcp_sock {
>         /* inet_connection_sock has to be the first member of tcp_sock */
>         struct inet_connection_sock     inet_conn;
>         u16     tcp_header_len; /* Bytes of tcp header to send          */
> -       u16     xmit_size_goal_segs; /* Goal for segmenting output packets */
> +       u16     gso_segs;       /* Max number of segs per GSO packet    */
>
>  /*
>   *     Header prediction flags
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index dc13a3657e8e1b81ba0cb1fcd5386a9d0b106168..427aee33ffc04ad189d9d0ec24ab8004c25961ec 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -835,47 +835,29 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
>                                        int large_allowed)
>  {
>         struct tcp_sock *tp = tcp_sk(sk);
> -       u32 xmit_size_goal, old_size_goal;
> -
> -       xmit_size_goal = mss_now;
> -
> -       if (large_allowed && sk_can_gso(sk)) {
> -               u32 gso_size, hlen;
> -
> -               /* Maybe we should/could use sk->sk_prot->max_header here ? */
> -               hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
> -                      inet_csk(sk)->icsk_ext_hdr_len +
> -                      tp->tcp_header_len;
> -
> -               /* Goal is to send at least one packet per ms,
> -                * not one big TSO packet every 100 ms.
> -                * This preserves ACK clocking and is consistent
> -                * with tcp_tso_should_defer() heuristic.
> -                */
> -               gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC);
> -               gso_size = max_t(u32, gso_size,
> -                                sysctl_tcp_min_tso_segs * mss_now);
> -
> -               xmit_size_goal = min_t(u32, gso_size,
> -                                      sk->sk_gso_max_size - 1 - hlen);
> -
> -               xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
> -
> -               /* We try hard to avoid divides here */
> -               old_size_goal = tp->xmit_size_goal_segs * mss_now;
> -
> -               if (likely(old_size_goal <= xmit_size_goal &&
> -                          old_size_goal + mss_now > xmit_size_goal)) {
> -                       xmit_size_goal = old_size_goal;
> -               } else {
> -                       tp->xmit_size_goal_segs =
> -                               min_t(u16, xmit_size_goal / mss_now,
> -                                     sk->sk_gso_max_segs);
> -                       xmit_size_goal = tp->xmit_size_goal_segs * mss_now;
> -               }
> +       u32 new_size_goal, size_goal, hlen;
> +
> +       if (!large_allowed || !sk_can_gso(sk))
> +               return mss_now;
> +
> +       /* Maybe we should/could use sk->sk_prot->max_header here ? */
> +       hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
> +              inet_csk(sk)->icsk_ext_hdr_len +
> +              tp->tcp_header_len;
> +
> +       new_size_goal = sk->sk_gso_max_size - 1 - hlen;
> +       new_size_goal = tcp_bound_to_half_wnd(tp, new_size_goal);
> +
> +       /* We try hard to avoid divides here */
> +       size_goal = tp->gso_segs * mss_now;
> +       if (unlikely(new_size_goal < size_goal ||
> +                    new_size_goal >= size_goal + mss_now)) {
> +               tp->gso_segs = min_t(u16, new_size_goal / mss_now,
> +                                    sk->sk_gso_max_segs);
> +               size_goal = tp->gso_segs * mss_now;
>         }
>
> -       return max(xmit_size_goal, mss_now);
> +       return max(size_goal, mss_now);
>  }
>
>  static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index f5bd4bd3f7e669b3fd48a843d55e7313a30a3409..f37ecf53ee8a96827fc08bd203b0ca8857f8fc34 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1524,6 +1524,27 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
>                 ((nonagle & TCP_NAGLE_CORK) ||
>                  (!nonagle && tp->packets_out && tcp_minshall_check(tp)));
>  }
> +
> +/* Return how many segs we'd like on a TSO packet,
> + * to send one TSO packet per ms
> + */
> +static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now)
> +{
> +       u32 bytes, segs;
> +
> +       bytes = min(sk->sk_pacing_rate >> 10,
> +                   sk->sk_gso_max_size - 1 - MAX_TCP_HEADER);
> +
> +       /* Goal is to send at least one packet per ms,
> +        * not one big TSO packet every 100 ms.
> +        * This preserves ACK clocking and is consistent
> +        * with tcp_tso_should_defer() heuristic.
> +        */
> +       segs = max_t(u32, bytes / mss_now, sysctl_tcp_min_tso_segs);
> +
> +       return min_t(u32, segs, sk->sk_gso_max_segs);
> +}
> +
>  /* Returns the portion of skb which can be sent right away */
>  static unsigned int tcp_mss_split_point(const struct sock *sk,
>                                         const struct sk_buff *skb,
> @@ -1731,7 +1752,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
>   * This algorithm is from John Heffner.
>   */
>  static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
> -                                bool *is_cwnd_limited)
> +                                bool *is_cwnd_limited, u32 max_segs)
>  {
>         struct tcp_sock *tp = tcp_sk(sk);
>         const struct inet_connection_sock *icsk = inet_csk(sk);
> @@ -1761,8 +1782,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
>         limit = min(send_win, cong_win);
>
>         /* If a full-sized TSO skb can be sent, do it. */
> -       if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
> -                          tp->xmit_size_goal_segs * tp->mss_cache))
> +       if (limit >= max_segs * tp->mss_cache)
>                 goto send_now;
>
>         /* Middle in queue won't get any more data, full sendable already? */
> @@ -1959,6 +1979,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>         int cwnd_quota;
>         int result;
>         bool is_cwnd_limited = false;
> +       u32 max_segs;
>
>         sent_pkts = 0;
>
> @@ -1972,6 +1993,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                 }
>         }
>
> +       max_segs = tcp_tso_autosize(sk, mss_now);
>         while ((skb = tcp_send_head(sk))) {
>                 unsigned int limit;
>
> @@ -2004,10 +2026,23 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                                 break;
>                 } else {
>                         if (!push_one &&
> -                           tcp_tso_should_defer(sk, skb, &is_cwnd_limited))
> +                           tcp_tso_should_defer(sk, skb, &is_cwnd_limited,
> +                                                max_segs))
>                                 break;
>                 }
>
> +               limit = mss_now;
> +               if (tso_segs > 1 && !tcp_urg_mode(tp))
> +                       limit = tcp_mss_split_point(sk, skb, mss_now,
> +                                                   min_t(unsigned int,
> +                                                         cwnd_quota,
> +                                                         max_segs),
> +                                                   nonagle);
> +
> +               if (skb->len > limit &&
> +                   unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
> +                       break;
> +
>                 /* TCP Small Queues :
>                  * Control number of packets in qdisc/devices to two packets / or ~1 ms.
>                  * This allows for :
> @@ -2018,8 +2053,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                  * of queued bytes to ensure line rate.
>                  * One example is wifi aggregation (802.11 AMPDU)
>                  */
> -               limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes,
> -                             sk->sk_pacing_rate >> 10);
> +               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
is this capping necessary if skb->truesize already takes the pacing
rate into account from the new logic above?

> +               limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>
>                 if (atomic_read(&sk->sk_wmem_alloc) > limit) {
>                         set_bit(TSQ_THROTTLED, &tp->tsq_flags);
> @@ -2032,18 +2067,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                                 break;
>                 }
>
> -               limit = mss_now;
> -               if (tso_segs > 1 && !tcp_urg_mode(tp))
> -                       limit = tcp_mss_split_point(sk, skb, mss_now,
> -                                                   min_t(unsigned int,
> -                                                         cwnd_quota,
> -                                                         sk->sk_gso_max_segs),
> -                                                   nonagle);
> -
> -               if (skb->len > limit &&
> -                   unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
> -                       break;
> -
>                 if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
>                         break;
>
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 net-next] tcp: refine TSO autosizing
  2014-12-07 21:24     ` Yuchung Cheng
@ 2014-12-08  6:32       ` Eric Dumazet
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2014-12-08  6:32 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: David Miller, netdev, Neal Cardwell, Nandita Dukkipati

On Sun, 2014-12-07 at 13:24 -0800, Yuchung Cheng wrote:
> >                  */
> > -               limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes,
> > -                             sk->sk_pacing_rate >> 10);
> > +               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
> is this capping necessary if skb->truesize already takes the pacing
> rate into account from the new logic above?


This is needed for flows not using GSO/TSO : For such flows,
skb->truesize is a constant (2048 + sizeof(struct sk_buff))

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 net-next] tcp: refine TSO autosizing
  2014-12-07 20:22   ` [PATCH v3 " Eric Dumazet
  2014-12-07 21:24     ` Yuchung Cheng
@ 2014-12-09 21:39     ` David Miller
  1 sibling, 0 replies; 9+ messages in thread
From: David Miller @ 2014-12-09 21:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, ncardwell, ycheng, nanditad

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 07 Dec 2014 12:22:18 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to
> control TSO size, but did this at the wrong place (sendmsg() time)
> 
> At sendmsg() time, we might have a pessimistic view of flow rate,
> and we end up building very small skbs (with 2 MSS per skb).
> 
> This is bad because :
> 
>  - It sends small TSO packets even in Slow Start where rate quickly
>    increases.
>  - It tends to make socket write queue very big, increasing tcp_ack()
>    processing time, but also increasing memory needs, not necessarily
>    accounted for, as fast clones overhead is currently ignored.
>  - Lower GRO efficiency and more ACK packets.
> 
> Servers with a lot of small lived connections suffer from this.
> 
> Lets instead fill skbs as much as possible (64KB of payload), but split
> them at xmit time, when we have a precise idea of the flow rate.
> skb split is actually quite efficient.
> 
> Patch looks bigger than necessary, because TCP Small Queue decision now
> has to take place after the eventual split.
> 
> As Neal suggested, introduce a new tcp_tso_autosize() helper, so that
> tcp_tso_should_defer() can be synchronized on same goal.
> 
> Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable
> contains number of mss that we can put in GSO packet, and is not
> related to the autosizing goal anymore.
 ...
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> ---
> v3: tcp_xmit_size_goal() still needs to return a multiple of mss.
> v2: added tcp_tso_autosize() helper and removed tp->xmit_size_goal_segs

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-12-09 21:39 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-05 14:15 [PATCH net-next] tcp: refine TSO autosizing Eric Dumazet
2014-12-05 15:32 ` Neal Cardwell
2014-12-05 17:06   ` Eric Dumazet
2014-12-06  2:22 ` [PATCH v2 " Eric Dumazet
2014-12-06 16:58   ` Eric Dumazet
2014-12-07 20:22   ` [PATCH v3 " Eric Dumazet
2014-12-07 21:24     ` Yuchung Cheng
2014-12-08  6:32       ` Eric Dumazet
2014-12-09 21:39     ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.