* [PATCH net-next] tcp: refine TSO autosizing @ 2014-12-05 14:15 Eric Dumazet 2014-12-05 15:32 ` Neal Cardwell 2014-12-06 2:22 ` [PATCH v2 " Eric Dumazet 0 siblings, 2 replies; 9+ messages in thread From: Eric Dumazet @ 2014-12-05 14:15 UTC (permalink / raw) To: David Miller; +Cc: netdev, Neal Cardwell, Yuchung Cheng, Nandita Dukkipati From: Eric Dumazet <edumazet@google.com> Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to control TSO size, but did this at the wrong place (sendmsg() time) At sendmsg() time, we might have a pessimistic view of flow rate, and we end up building very small skbs (with 2 MSS per skb). This is bad because : - It sends small TSO packets even in Slow Start where rate quickly increases. - It tends to make socket write queue very big, increasing tcp_ack() processing time, but also increasing memory needs, not necessarily accounted for, as fast clones overhead is currently ignored. - Lower GRO efficiency and more ACK packets. Servers with a lot of small lived connections suffer from this. Lets instead fill skbs as much as possible (64KB of payload), but split them at xmit time, when we have a precise idea of the flow rate. skb split is actually quite efficient. Patch looks bigger than necessary, because TCP Small Queue decision now has to take place after the eventual split. Tested: 40 ms rtt link nstat >/dev/null netperf -H remote -Cc -l -2000000 -- -s 1000000 nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets" Before patch : Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 2000000 2000000 0.36 44.22 0.00 0.06 0.000 5.007 IpInReceives 600 0.0 IpOutRequests 599 0.0 TcpOutSegs 1397 0.0 IpExtOutOctets 2033249 0.0 After patch : Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 2000000 2000000 0.36 44.09 0.00 0.00 0.000 0.000 IpInReceives 257 0.0 IpOutRequests 226 0.0 TcpOutSegs 1399 0.0 IpExtOutOctets 2013777 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> --- net/ipv4/tcp.c | 14 ++------------ net/ipv4/tcp_output.c | 38 ++++++++++++++++++++++++-------------- 2 files changed, 26 insertions(+), 26 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index dc13a3657e8e..91e6c1406313 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -840,24 +840,14 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, xmit_size_goal = mss_now; if (large_allowed && sk_can_gso(sk)) { - u32 gso_size, hlen; + u32 hlen; /* Maybe we should/could use sk->sk_prot->max_header here ? */ hlen = inet_csk(sk)->icsk_af_ops->net_header_len + inet_csk(sk)->icsk_ext_hdr_len + tp->tcp_header_len; - /* Goal is to send at least one packet per ms, - * not one big TSO packet every 100 ms. - * This preserves ACK clocking and is consistent - * with tcp_tso_should_defer() heuristic. - */ - gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC); - gso_size = max_t(u32, gso_size, - sysctl_tcp_min_tso_segs * mss_now); - - xmit_size_goal = min_t(u32, gso_size, - sk->sk_gso_max_size - 1 - hlen); + xmit_size_goal = sk->sk_gso_max_size - 1 - hlen; xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index f5bd4bd3f7e6..dac9991bccbf 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1533,6 +1533,16 @@ static unsigned int tcp_mss_split_point(const struct sock *sk, { const struct tcp_sock *tp = tcp_sk(sk); u32 partial, needed, window, max_len; + u32 bytes = sk->sk_pacing_rate >> 10; + u32 segs; + + /* Goal is to send at least one packet per ms, + * not one big TSO packet every 100 ms. + * This preserves ACK clocking and is consistent + * with tcp_tso_should_defer() heuristic. + */ + segs = max_t(u32, bytes / mss_now, sysctl_tcp_min_tso_segs); + max_segs = min_t(u32, max_segs, segs); window = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq; max_len = mss_now * max_segs; @@ -2008,6 +2018,18 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, break; } + limit = mss_now; + if (tso_segs > 1 && !tcp_urg_mode(tp)) + limit = tcp_mss_split_point(sk, skb, mss_now, + min_t(unsigned int, + cwnd_quota, + sk->sk_gso_max_segs), + nonagle); + + if (skb->len > limit && + unlikely(tso_fragment(sk, skb, limit, mss_now, gfp))) + break; + /* TCP Small Queues : * Control number of packets in qdisc/devices to two packets / or ~1 ms. * This allows for : @@ -2018,8 +2040,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, * of queued bytes to ensure line rate. * One example is wifi aggregation (802.11 AMPDU) */ - limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes, - sk->sk_pacing_rate >> 10); + limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10); + limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); if (atomic_read(&sk->sk_wmem_alloc) > limit) { set_bit(TSQ_THROTTLED, &tp->tsq_flags); @@ -2032,18 +2054,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, break; } - limit = mss_now; - if (tso_segs > 1 && !tcp_urg_mode(tp)) - limit = tcp_mss_split_point(sk, skb, mss_now, - min_t(unsigned int, - cwnd_quota, - sk->sk_gso_max_segs), - nonagle); - - if (skb->len > limit && - unlikely(tso_fragment(sk, skb, limit, mss_now, gfp))) - break; - if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) break; ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH net-next] tcp: refine TSO autosizing 2014-12-05 14:15 [PATCH net-next] tcp: refine TSO autosizing Eric Dumazet @ 2014-12-05 15:32 ` Neal Cardwell 2014-12-05 17:06 ` Eric Dumazet 2014-12-06 2:22 ` [PATCH v2 " Eric Dumazet 1 sibling, 1 reply; 9+ messages in thread From: Neal Cardwell @ 2014-12-05 15:32 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev, Yuchung Cheng, Nandita Dukkipati On Fri, Dec 5, 2014 at 9:15 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > From: Eric Dumazet <edumazet@google.com> > > Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to > control TSO size, but did this at the wrong place (sendmsg() time) > > At sendmsg() time, we might have a pessimistic view of flow rate, > and we end up building very small skbs (with 2 MSS per skb). > > This is bad because : > > - It sends small TSO packets even in Slow Start where rate quickly > increases. > - It tends to make socket write queue very big, increasing tcp_ack() > processing time, but also increasing memory needs, not necessarily > accounted for, as fast clones overhead is currently ignored. > - Lower GRO efficiency and more ACK packets. > > Servers with a lot of small lived connections suffer from this. > > Lets instead fill skbs as much as possible (64KB of payload), but split > them at xmit time, when we have a precise idea of the flow rate. > skb split is actually quite efficient. Nice. I definitely agree this is the right direction. However, from my experience testing a variant of this approach, this kind of late decision about packet size was sometimes causing performance shortfalls on long-RTT, medium-bandwidth paths unless tcp_tso_should_defer() was also modified to use the new/smaller packet size goal. The issue is that tcp_tso_should_defer() uses tp->xmit_size_goal_segs as a yardstick, and says, "hey, if cwnd and rwin allow us to send tp->xmit_size_goal_segs * tp->mss_cache then let's go ahead and send it now." But if we remove the sendmsg-time autosizing logic that was tuning tp->xmit_size_goal_segs, then tcp_tso_should_defer() is now going to be waiting to try to accumulate permission to send a big skb with tp->xmit_size_goal_segs (e.g. ~40) MSS in it. In my tests I was able to fix this issue by making tcp_tso_should_defer() use the latest size goal instead of tp->xmit_size_goal_segs. So, how about making the rate-based TSO autosizing goal (stored in "segs" in this patch) at the top of tcp_write_xmit()? Then we could pass in that segment goal to tcp_tso_should_defer() for use instead of tp->xmit_size_goal_segs in deciding whether we have a big enough chunk to send now. Similarly, that segment goal could be passed in to tcp_mss_split_point instead of sk->sk_gso_max_segs. (The autosizing calculation could be in a helper function to keep tcp_write_xmit() manageable.) neal ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH net-next] tcp: refine TSO autosizing 2014-12-05 15:32 ` Neal Cardwell @ 2014-12-05 17:06 ` Eric Dumazet 0 siblings, 0 replies; 9+ messages in thread From: Eric Dumazet @ 2014-12-05 17:06 UTC (permalink / raw) To: Neal Cardwell; +Cc: David Miller, netdev, Yuchung Cheng, Nandita Dukkipati On Fri, 2014-12-05 at 10:32 -0500, Neal Cardwell wrote: > On Fri, Dec 5, 2014 at 9:15 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > From: Eric Dumazet <edumazet@google.com> > > > > Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to > > control TSO size, but did this at the wrong place (sendmsg() time) > > > > At sendmsg() time, we might have a pessimistic view of flow rate, > > and we end up building very small skbs (with 2 MSS per skb). > > > > This is bad because : > > > > - It sends small TSO packets even in Slow Start where rate quickly > > increases. > > - It tends to make socket write queue very big, increasing tcp_ack() > > processing time, but also increasing memory needs, not necessarily > > accounted for, as fast clones overhead is currently ignored. > > - Lower GRO efficiency and more ACK packets. > > > > Servers with a lot of small lived connections suffer from this. > > > > Lets instead fill skbs as much as possible (64KB of payload), but split > > them at xmit time, when we have a precise idea of the flow rate. > > skb split is actually quite efficient. > > Nice. I definitely agree this is the right direction. > > However, from my experience testing a variant of this approach, this > kind of late decision about packet size was sometimes causing > performance shortfalls on long-RTT, medium-bandwidth paths unless > tcp_tso_should_defer() was also modified to use the new/smaller packet > size goal. > > The issue is that tcp_tso_should_defer() uses tp->xmit_size_goal_segs > as a yardstick, and says, "hey, if cwnd and rwin allow us to send > tp->xmit_size_goal_segs * tp->mss_cache then let's go ahead and send > it now." > > But if we remove the sendmsg-time autosizing logic that was tuning > tp->xmit_size_goal_segs, then tcp_tso_should_defer() is now going to > be waiting to try to accumulate permission to send a big skb with > tp->xmit_size_goal_segs (e.g. ~40) MSS in it. > > In my tests I was able to fix this issue by making > tcp_tso_should_defer() use the latest size goal instead of > tp->xmit_size_goal_segs. > > So, how about making the rate-based TSO autosizing goal (stored in > "segs" in this patch) at the top of tcp_write_xmit()? Then we could > pass in that segment goal to tcp_tso_should_defer() for use instead of > tp->xmit_size_goal_segs in deciding whether we have a big enough chunk > to send now. Similarly, that segment goal could be passed in to > tcp_mss_split_point instead of sk->sk_gso_max_segs. > > (The autosizing calculation could be in a helper function to keep > tcp_write_xmit() manageable.) > Sounds an awesome suggestion indeed, I am working on it. Thanks Neal ! ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v2 net-next] tcp: refine TSO autosizing 2014-12-05 14:15 [PATCH net-next] tcp: refine TSO autosizing Eric Dumazet 2014-12-05 15:32 ` Neal Cardwell @ 2014-12-06 2:22 ` Eric Dumazet 2014-12-06 16:58 ` Eric Dumazet 2014-12-07 20:22 ` [PATCH v3 " Eric Dumazet 1 sibling, 2 replies; 9+ messages in thread From: Eric Dumazet @ 2014-12-06 2:22 UTC (permalink / raw) To: David Miller; +Cc: netdev, Neal Cardwell, Yuchung Cheng, Nandita Dukkipati From: Eric Dumazet <edumazet@google.com> Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to control TSO size, but did this at the wrong place (sendmsg() time) At sendmsg() time, we might have a pessimistic view of flow rate, and we end up building very small skbs (with 2 MSS per skb). This is bad because : - It sends small TSO packets even in Slow Start where rate quickly increases. - It tends to make socket write queue very big, increasing tcp_ack() processing time, but also increasing memory needs, not necessarily accounted for, as fast clones overhead is currently ignored. - Lower GRO efficiency and more ACK packets. Servers with a lot of small lived connections suffer from this. Lets instead fill skbs as much as possible (64KB of payload), but split them at xmit time, when we have a precise idea of the flow rate. skb split is actually quite efficient. Patch looks bigger than necessary, because TCP Small Queue decision now has to take place after the eventual split. As Neal suggested, get rid of tp->xmit_size_goal_segs, and introduce a new tcp_tso_autosize() helper, so that tcp_tso_should_defer() can be synchronized on same goal. Tested: 40 ms rtt link nstat >/dev/null netperf -H remote -Cc -l -2000000 -- -s 1000000 nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets" Before patch : Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 2000000 2000000 0.36 44.22 0.00 0.06 0.000 5.007 IpInReceives 600 0.0 IpOutRequests 599 0.0 TcpOutSegs 1397 0.0 IpExtOutOctets 2033249 0.0 After patch : Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 2000000 2000000 0.36 44.09 0.00 0.00 0.000 0.000 IpInReceives 257 0.0 IpOutRequests 226 0.0 TcpOutSegs 1399 0.0 IpExtOutOctets 2013777 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> --- v2: added tcp_tso_autosize() helper and removed tp->xmit_size_goal_segs include/linux/tcp.h | 2 - net/ipv4/tcp.c | 44 ++++++----------------------- net/ipv4/tcp_output.c | 59 +++++++++++++++++++++++++++------------- 3 files changed, 51 insertions(+), 54 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index f566b8567892ef0bb213de0540b37cfc6ac03ca0..a2944717f1b7d1d12b4e2397c20457931ad71bac 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -130,7 +130,7 @@ struct tcp_sock { /* inet_connection_sock has to be the first member of tcp_sock */ struct inet_connection_sock inet_conn; u16 tcp_header_len; /* Bytes of tcp header to send */ - u16 xmit_size_goal_segs; /* Goal for segmenting output packets */ + /* 2 bytes hole */ /* * Header prediction flags diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index dc13a3657e8e1b81ba0cb1fcd5386a9d0b106168..6e692f7ac62880415093ca7a43c782c7fd6a049b 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -835,45 +835,19 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, int large_allowed) { struct tcp_sock *tp = tcp_sk(sk); - u32 xmit_size_goal, old_size_goal; + u32 hlen, xmit_size_goal; - xmit_size_goal = mss_now; + if (!large_allowed || !sk_can_gso(sk)) + return mss_now; - if (large_allowed && sk_can_gso(sk)) { - u32 gso_size, hlen; + /* Maybe we should/could use sk->sk_prot->max_header here ? */ + hlen = inet_csk(sk)->icsk_af_ops->net_header_len + + inet_csk(sk)->icsk_ext_hdr_len + + tp->tcp_header_len; - /* Maybe we should/could use sk->sk_prot->max_header here ? */ - hlen = inet_csk(sk)->icsk_af_ops->net_header_len + - inet_csk(sk)->icsk_ext_hdr_len + - tp->tcp_header_len; + xmit_size_goal = sk->sk_gso_max_size - 1 - hlen; - /* Goal is to send at least one packet per ms, - * not one big TSO packet every 100 ms. - * This preserves ACK clocking and is consistent - * with tcp_tso_should_defer() heuristic. - */ - gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC); - gso_size = max_t(u32, gso_size, - sysctl_tcp_min_tso_segs * mss_now); - - xmit_size_goal = min_t(u32, gso_size, - sk->sk_gso_max_size - 1 - hlen); - - xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal); - - /* We try hard to avoid divides here */ - old_size_goal = tp->xmit_size_goal_segs * mss_now; - - if (likely(old_size_goal <= xmit_size_goal && - old_size_goal + mss_now > xmit_size_goal)) { - xmit_size_goal = old_size_goal; - } else { - tp->xmit_size_goal_segs = - min_t(u16, xmit_size_goal / mss_now, - sk->sk_gso_max_segs); - xmit_size_goal = tp->xmit_size_goal_segs * mss_now; - } - } + xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal); return max(xmit_size_goal, mss_now); } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index f5bd4bd3f7e669b3fd48a843d55e7313a30a3409..f37ecf53ee8a96827fc08bd203b0ca8857f8fc34 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1524,6 +1524,27 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp, ((nonagle & TCP_NAGLE_CORK) || (!nonagle && tp->packets_out && tcp_minshall_check(tp))); } + +/* Return how many segs we'd like on a TSO packet, + * to send one TSO packet per ms. + */ +static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now) +{ + u32 bytes, segs; + + bytes = min(sk->sk_pacing_rate >> 10, + sk->sk_gso_max_size - 1 - MAX_TCP_HEADER); + + /* Goal is to send at least one packet per ms, + * not one big TSO packet every 100 ms. + * This preserves ACK clocking and is consistent + * with tcp_tso_should_defer() heuristic. + */ + segs = max_t(u32, bytes / mss_now, sysctl_tcp_min_tso_segs); + + return min_t(u32, segs, sk->sk_gso_max_segs); +} + /* Returns the portion of skb which can be sent right away */ static unsigned int tcp_mss_split_point(const struct sock *sk, const struct sk_buff *skb, @@ -1731,7 +1752,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len, * This algorithm is from John Heffner. */ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb, - bool *is_cwnd_limited) + bool *is_cwnd_limited, u32 max_segs) { struct tcp_sock *tp = tcp_sk(sk); const struct inet_connection_sock *icsk = inet_csk(sk); @@ -1761,8 +1782,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb, limit = min(send_win, cong_win); /* If a full-sized TSO skb can be sent, do it. */ - if (limit >= min_t(unsigned int, sk->sk_gso_max_size, - tp->xmit_size_goal_segs * tp->mss_cache)) + if (limit >= max_segs * tp->mss_cache) goto send_now; /* Middle in queue won't get any more data, full sendable already? */ @@ -1959,6 +1979,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, int cwnd_quota; int result; bool is_cwnd_limited = false; + u32 max_segs; sent_pkts = 0; @@ -1972,6 +1993,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, } } + max_segs = tcp_tso_autosize(sk, mss_now); while ((skb = tcp_send_head(sk))) { unsigned int limit; @@ -2004,10 +2026,23 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, break; } else { if (!push_one && - tcp_tso_should_defer(sk, skb, &is_cwnd_limited)) + tcp_tso_should_defer(sk, skb, &is_cwnd_limited, + max_segs)) break; } + limit = mss_now; + if (tso_segs > 1 && !tcp_urg_mode(tp)) + limit = tcp_mss_split_point(sk, skb, mss_now, + min_t(unsigned int, + cwnd_quota, + max_segs), + nonagle); + + if (skb->len > limit && + unlikely(tso_fragment(sk, skb, limit, mss_now, gfp))) + break; + /* TCP Small Queues : * Control number of packets in qdisc/devices to two packets / or ~1 ms. * This allows for : @@ -2018,8 +2053,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, * of queued bytes to ensure line rate. * One example is wifi aggregation (802.11 AMPDU) */ - limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes, - sk->sk_pacing_rate >> 10); + limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10); + limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); if (atomic_read(&sk->sk_wmem_alloc) > limit) { set_bit(TSQ_THROTTLED, &tp->tsq_flags); @@ -2032,18 +2067,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, break; } - limit = mss_now; - if (tso_segs > 1 && !tcp_urg_mode(tp)) - limit = tcp_mss_split_point(sk, skb, mss_now, - min_t(unsigned int, - cwnd_quota, - sk->sk_gso_max_segs), - nonagle); - - if (skb->len > limit && - unlikely(tso_fragment(sk, skb, limit, mss_now, gfp))) - break; - if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) break; ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v2 net-next] tcp: refine TSO autosizing 2014-12-06 2:22 ` [PATCH v2 " Eric Dumazet @ 2014-12-06 16:58 ` Eric Dumazet 2014-12-07 20:22 ` [PATCH v3 " Eric Dumazet 1 sibling, 0 replies; 9+ messages in thread From: Eric Dumazet @ 2014-12-06 16:58 UTC (permalink / raw) To: David Miller; +Cc: netdev, Neal Cardwell, Yuchung Cheng, Nandita Dukkipati On Fri, 2014-12-05 at 18:22 -0800, Eric Dumazet wrote: > From: Eric Dumazet <edumazet@google.com> > ... > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Neal Cardwell <ncardwell@google.com> > --- > v2: added tcp_tso_autosize() helper and removed tp->xmit_size_goal_segs Please do not apply this patch. I'll send a v3. We still need to cook skb with a multiple of mss bytes. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v3 net-next] tcp: refine TSO autosizing 2014-12-06 2:22 ` [PATCH v2 " Eric Dumazet 2014-12-06 16:58 ` Eric Dumazet @ 2014-12-07 20:22 ` Eric Dumazet 2014-12-07 21:24 ` Yuchung Cheng 2014-12-09 21:39 ` David Miller 1 sibling, 2 replies; 9+ messages in thread From: Eric Dumazet @ 2014-12-07 20:22 UTC (permalink / raw) To: David Miller; +Cc: netdev, Neal Cardwell, Yuchung Cheng, Nandita Dukkipati From: Eric Dumazet <edumazet@google.com> Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to control TSO size, but did this at the wrong place (sendmsg() time) At sendmsg() time, we might have a pessimistic view of flow rate, and we end up building very small skbs (with 2 MSS per skb). This is bad because : - It sends small TSO packets even in Slow Start where rate quickly increases. - It tends to make socket write queue very big, increasing tcp_ack() processing time, but also increasing memory needs, not necessarily accounted for, as fast clones overhead is currently ignored. - Lower GRO efficiency and more ACK packets. Servers with a lot of small lived connections suffer from this. Lets instead fill skbs as much as possible (64KB of payload), but split them at xmit time, when we have a precise idea of the flow rate. skb split is actually quite efficient. Patch looks bigger than necessary, because TCP Small Queue decision now has to take place after the eventual split. As Neal suggested, introduce a new tcp_tso_autosize() helper, so that tcp_tso_should_defer() can be synchronized on same goal. Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable contains number of mss that we can put in GSO packet, and is not related to the autosizing goal anymore. Tested: 40 ms rtt link nstat >/dev/null netperf -H remote -l -2000000 -- -s 1000000 nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets" Before patch : Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/s 87380 2000000 2000000 0.36 44.22 IpInReceives 600 0.0 IpOutRequests 599 0.0 TcpOutSegs 1397 0.0 IpExtOutOctets 2033249 0.0 After patch : Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 2000000 2000000 0.36 44.27 IpInReceives 221 0.0 IpOutRequests 232 0.0 TcpOutSegs 1397 0.0 IpExtOutOctets 2013953 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> --- v3: tcp_xmit_size_goal() still needs to return a multiple of mss. v2: added tcp_tso_autosize() helper and removed tp->xmit_size_goal_segs include/linux/tcp.h | 2 - net/ipv4/tcp.c | 60 ++++++++++++++-------------------------- net/ipv4/tcp_output.c | 59 +++++++++++++++++++++++++++------------ 3 files changed, 63 insertions(+), 58 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index f566b8567892ef0bb213de0540b37cfc6ac03ca0..3fa0a9669a3a662be81d4b04f7d117b11012257c 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -130,7 +130,7 @@ struct tcp_sock { /* inet_connection_sock has to be the first member of tcp_sock */ struct inet_connection_sock inet_conn; u16 tcp_header_len; /* Bytes of tcp header to send */ - u16 xmit_size_goal_segs; /* Goal for segmenting output packets */ + u16 gso_segs; /* Max number of segs per GSO packet */ /* * Header prediction flags diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index dc13a3657e8e1b81ba0cb1fcd5386a9d0b106168..427aee33ffc04ad189d9d0ec24ab8004c25961ec 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -835,47 +835,29 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, int large_allowed) { struct tcp_sock *tp = tcp_sk(sk); - u32 xmit_size_goal, old_size_goal; - - xmit_size_goal = mss_now; - - if (large_allowed && sk_can_gso(sk)) { - u32 gso_size, hlen; - - /* Maybe we should/could use sk->sk_prot->max_header here ? */ - hlen = inet_csk(sk)->icsk_af_ops->net_header_len + - inet_csk(sk)->icsk_ext_hdr_len + - tp->tcp_header_len; - - /* Goal is to send at least one packet per ms, - * not one big TSO packet every 100 ms. - * This preserves ACK clocking and is consistent - * with tcp_tso_should_defer() heuristic. - */ - gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC); - gso_size = max_t(u32, gso_size, - sysctl_tcp_min_tso_segs * mss_now); - - xmit_size_goal = min_t(u32, gso_size, - sk->sk_gso_max_size - 1 - hlen); - - xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal); - - /* We try hard to avoid divides here */ - old_size_goal = tp->xmit_size_goal_segs * mss_now; - - if (likely(old_size_goal <= xmit_size_goal && - old_size_goal + mss_now > xmit_size_goal)) { - xmit_size_goal = old_size_goal; - } else { - tp->xmit_size_goal_segs = - min_t(u16, xmit_size_goal / mss_now, - sk->sk_gso_max_segs); - xmit_size_goal = tp->xmit_size_goal_segs * mss_now; - } + u32 new_size_goal, size_goal, hlen; + + if (!large_allowed || !sk_can_gso(sk)) + return mss_now; + + /* Maybe we should/could use sk->sk_prot->max_header here ? */ + hlen = inet_csk(sk)->icsk_af_ops->net_header_len + + inet_csk(sk)->icsk_ext_hdr_len + + tp->tcp_header_len; + + new_size_goal = sk->sk_gso_max_size - 1 - hlen; + new_size_goal = tcp_bound_to_half_wnd(tp, new_size_goal); + + /* We try hard to avoid divides here */ + size_goal = tp->gso_segs * mss_now; + if (unlikely(new_size_goal < size_goal || + new_size_goal >= size_goal + mss_now)) { + tp->gso_segs = min_t(u16, new_size_goal / mss_now, + sk->sk_gso_max_segs); + size_goal = tp->gso_segs * mss_now; } - return max(xmit_size_goal, mss_now); + return max(size_goal, mss_now); } static int tcp_send_mss(struct sock *sk, int *size_goal, int flags) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index f5bd4bd3f7e669b3fd48a843d55e7313a30a3409..f37ecf53ee8a96827fc08bd203b0ca8857f8fc34 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1524,6 +1524,27 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp, ((nonagle & TCP_NAGLE_CORK) || (!nonagle && tp->packets_out && tcp_minshall_check(tp))); } + +/* Return how many segs we'd like on a TSO packet, + * to send one TSO packet per ms + */ +static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now) +{ + u32 bytes, segs; + + bytes = min(sk->sk_pacing_rate >> 10, + sk->sk_gso_max_size - 1 - MAX_TCP_HEADER); + + /* Goal is to send at least one packet per ms, + * not one big TSO packet every 100 ms. + * This preserves ACK clocking and is consistent + * with tcp_tso_should_defer() heuristic. + */ + segs = max_t(u32, bytes / mss_now, sysctl_tcp_min_tso_segs); + + return min_t(u32, segs, sk->sk_gso_max_segs); +} + /* Returns the portion of skb which can be sent right away */ static unsigned int tcp_mss_split_point(const struct sock *sk, const struct sk_buff *skb, @@ -1731,7 +1752,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len, * This algorithm is from John Heffner. */ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb, - bool *is_cwnd_limited) + bool *is_cwnd_limited, u32 max_segs) { struct tcp_sock *tp = tcp_sk(sk); const struct inet_connection_sock *icsk = inet_csk(sk); @@ -1761,8 +1782,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb, limit = min(send_win, cong_win); /* If a full-sized TSO skb can be sent, do it. */ - if (limit >= min_t(unsigned int, sk->sk_gso_max_size, - tp->xmit_size_goal_segs * tp->mss_cache)) + if (limit >= max_segs * tp->mss_cache) goto send_now; /* Middle in queue won't get any more data, full sendable already? */ @@ -1959,6 +1979,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, int cwnd_quota; int result; bool is_cwnd_limited = false; + u32 max_segs; sent_pkts = 0; @@ -1972,6 +1993,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, } } + max_segs = tcp_tso_autosize(sk, mss_now); while ((skb = tcp_send_head(sk))) { unsigned int limit; @@ -2004,10 +2026,23 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, break; } else { if (!push_one && - tcp_tso_should_defer(sk, skb, &is_cwnd_limited)) + tcp_tso_should_defer(sk, skb, &is_cwnd_limited, + max_segs)) break; } + limit = mss_now; + if (tso_segs > 1 && !tcp_urg_mode(tp)) + limit = tcp_mss_split_point(sk, skb, mss_now, + min_t(unsigned int, + cwnd_quota, + max_segs), + nonagle); + + if (skb->len > limit && + unlikely(tso_fragment(sk, skb, limit, mss_now, gfp))) + break; + /* TCP Small Queues : * Control number of packets in qdisc/devices to two packets / or ~1 ms. * This allows for : @@ -2018,8 +2053,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, * of queued bytes to ensure line rate. * One example is wifi aggregation (802.11 AMPDU) */ - limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes, - sk->sk_pacing_rate >> 10); + limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10); + limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); if (atomic_read(&sk->sk_wmem_alloc) > limit) { set_bit(TSQ_THROTTLED, &tp->tsq_flags); @@ -2032,18 +2067,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, break; } - limit = mss_now; - if (tso_segs > 1 && !tcp_urg_mode(tp)) - limit = tcp_mss_split_point(sk, skb, mss_now, - min_t(unsigned int, - cwnd_quota, - sk->sk_gso_max_segs), - nonagle); - - if (skb->len > limit && - unlikely(tso_fragment(sk, skb, limit, mss_now, gfp))) - break; - if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) break; ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v3 net-next] tcp: refine TSO autosizing 2014-12-07 20:22 ` [PATCH v3 " Eric Dumazet @ 2014-12-07 21:24 ` Yuchung Cheng 2014-12-08 6:32 ` Eric Dumazet 2014-12-09 21:39 ` David Miller 1 sibling, 1 reply; 9+ messages in thread From: Yuchung Cheng @ 2014-12-07 21:24 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev, Neal Cardwell, Nandita Dukkipati On Sun, Dec 7, 2014 at 12:22 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > From: Eric Dumazet <edumazet@google.com> > > Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to > control TSO size, but did this at the wrong place (sendmsg() time) > > At sendmsg() time, we might have a pessimistic view of flow rate, > and we end up building very small skbs (with 2 MSS per skb). > > This is bad because : > > - It sends small TSO packets even in Slow Start where rate quickly > increases. > - It tends to make socket write queue very big, increasing tcp_ack() > processing time, but also increasing memory needs, not necessarily > accounted for, as fast clones overhead is currently ignored. > - Lower GRO efficiency and more ACK packets. > > Servers with a lot of small lived connections suffer from this. > > Lets instead fill skbs as much as possible (64KB of payload), but split > them at xmit time, when we have a precise idea of the flow rate. > skb split is actually quite efficient. > > Patch looks bigger than necessary, because TCP Small Queue decision now > has to take place after the eventual split. > > As Neal suggested, introduce a new tcp_tso_autosize() helper, so that > tcp_tso_should_defer() can be synchronized on same goal. > > Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable > contains number of mss that we can put in GSO packet, and is not > related to the autosizing goal anymore. > > > Tested: > > 40 ms rtt link > > nstat >/dev/null > netperf -H remote -l -2000000 -- -s 1000000 > nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets" > > Before patch : > > Recv Send Send > Socket Socket Message Elapsed > Size Size Size Time Throughput > bytes bytes bytes secs. 10^6bits/s > > 87380 2000000 2000000 0.36 44.22 > IpInReceives 600 0.0 > IpOutRequests 599 0.0 > TcpOutSegs 1397 0.0 > IpExtOutOctets 2033249 0.0 > > > After patch : > > Recv Send Send > Socket Socket Message Elapsed > Size Size Size Time Throughput > bytes bytes bytes secs. 10^6bits/sec > > 87380 2000000 2000000 0.36 44.27 > IpInReceives 221 0.0 > IpOutRequests 232 0.0 > TcpOutSegs 1397 0.0 > IpExtOutOctets 2013953 0.0 > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Very nice Eric. This would improve both performance and simplify the complex TSO logic. I have a small question below. > --- > v3: tcp_xmit_size_goal() still needs to return a multiple of mss. > v2: added tcp_tso_autosize() helper and removed tp->xmit_size_goal_segs > > include/linux/tcp.h | 2 - > net/ipv4/tcp.c | 60 ++++++++++++++-------------------------- > net/ipv4/tcp_output.c | 59 +++++++++++++++++++++++++++------------ > 3 files changed, 63 insertions(+), 58 deletions(-) > > diff --git a/include/linux/tcp.h b/include/linux/tcp.h > index f566b8567892ef0bb213de0540b37cfc6ac03ca0..3fa0a9669a3a662be81d4b04f7d117b11012257c 100644 > --- a/include/linux/tcp.h > +++ b/include/linux/tcp.h > @@ -130,7 +130,7 @@ struct tcp_sock { > /* inet_connection_sock has to be the first member of tcp_sock */ > struct inet_connection_sock inet_conn; > u16 tcp_header_len; /* Bytes of tcp header to send */ > - u16 xmit_size_goal_segs; /* Goal for segmenting output packets */ > + u16 gso_segs; /* Max number of segs per GSO packet */ > > /* > * Header prediction flags > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index dc13a3657e8e1b81ba0cb1fcd5386a9d0b106168..427aee33ffc04ad189d9d0ec24ab8004c25961ec 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -835,47 +835,29 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, > int large_allowed) > { > struct tcp_sock *tp = tcp_sk(sk); > - u32 xmit_size_goal, old_size_goal; > - > - xmit_size_goal = mss_now; > - > - if (large_allowed && sk_can_gso(sk)) { > - u32 gso_size, hlen; > - > - /* Maybe we should/could use sk->sk_prot->max_header here ? */ > - hlen = inet_csk(sk)->icsk_af_ops->net_header_len + > - inet_csk(sk)->icsk_ext_hdr_len + > - tp->tcp_header_len; > - > - /* Goal is to send at least one packet per ms, > - * not one big TSO packet every 100 ms. > - * This preserves ACK clocking and is consistent > - * with tcp_tso_should_defer() heuristic. > - */ > - gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC); > - gso_size = max_t(u32, gso_size, > - sysctl_tcp_min_tso_segs * mss_now); > - > - xmit_size_goal = min_t(u32, gso_size, > - sk->sk_gso_max_size - 1 - hlen); > - > - xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal); > - > - /* We try hard to avoid divides here */ > - old_size_goal = tp->xmit_size_goal_segs * mss_now; > - > - if (likely(old_size_goal <= xmit_size_goal && > - old_size_goal + mss_now > xmit_size_goal)) { > - xmit_size_goal = old_size_goal; > - } else { > - tp->xmit_size_goal_segs = > - min_t(u16, xmit_size_goal / mss_now, > - sk->sk_gso_max_segs); > - xmit_size_goal = tp->xmit_size_goal_segs * mss_now; > - } > + u32 new_size_goal, size_goal, hlen; > + > + if (!large_allowed || !sk_can_gso(sk)) > + return mss_now; > + > + /* Maybe we should/could use sk->sk_prot->max_header here ? */ > + hlen = inet_csk(sk)->icsk_af_ops->net_header_len + > + inet_csk(sk)->icsk_ext_hdr_len + > + tp->tcp_header_len; > + > + new_size_goal = sk->sk_gso_max_size - 1 - hlen; > + new_size_goal = tcp_bound_to_half_wnd(tp, new_size_goal); > + > + /* We try hard to avoid divides here */ > + size_goal = tp->gso_segs * mss_now; > + if (unlikely(new_size_goal < size_goal || > + new_size_goal >= size_goal + mss_now)) { > + tp->gso_segs = min_t(u16, new_size_goal / mss_now, > + sk->sk_gso_max_segs); > + size_goal = tp->gso_segs * mss_now; > } > > - return max(xmit_size_goal, mss_now); > + return max(size_goal, mss_now); > } > > static int tcp_send_mss(struct sock *sk, int *size_goal, int flags) > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index f5bd4bd3f7e669b3fd48a843d55e7313a30a3409..f37ecf53ee8a96827fc08bd203b0ca8857f8fc34 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -1524,6 +1524,27 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp, > ((nonagle & TCP_NAGLE_CORK) || > (!nonagle && tp->packets_out && tcp_minshall_check(tp))); > } > + > +/* Return how many segs we'd like on a TSO packet, > + * to send one TSO packet per ms > + */ > +static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now) > +{ > + u32 bytes, segs; > + > + bytes = min(sk->sk_pacing_rate >> 10, > + sk->sk_gso_max_size - 1 - MAX_TCP_HEADER); > + > + /* Goal is to send at least one packet per ms, > + * not one big TSO packet every 100 ms. > + * This preserves ACK clocking and is consistent > + * with tcp_tso_should_defer() heuristic. > + */ > + segs = max_t(u32, bytes / mss_now, sysctl_tcp_min_tso_segs); > + > + return min_t(u32, segs, sk->sk_gso_max_segs); > +} > + > /* Returns the portion of skb which can be sent right away */ > static unsigned int tcp_mss_split_point(const struct sock *sk, > const struct sk_buff *skb, > @@ -1731,7 +1752,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len, > * This algorithm is from John Heffner. > */ > static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb, > - bool *is_cwnd_limited) > + bool *is_cwnd_limited, u32 max_segs) > { > struct tcp_sock *tp = tcp_sk(sk); > const struct inet_connection_sock *icsk = inet_csk(sk); > @@ -1761,8 +1782,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb, > limit = min(send_win, cong_win); > > /* If a full-sized TSO skb can be sent, do it. */ > - if (limit >= min_t(unsigned int, sk->sk_gso_max_size, > - tp->xmit_size_goal_segs * tp->mss_cache)) > + if (limit >= max_segs * tp->mss_cache) > goto send_now; > > /* Middle in queue won't get any more data, full sendable already? */ > @@ -1959,6 +1979,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, > int cwnd_quota; > int result; > bool is_cwnd_limited = false; > + u32 max_segs; > > sent_pkts = 0; > > @@ -1972,6 +1993,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, > } > } > > + max_segs = tcp_tso_autosize(sk, mss_now); > while ((skb = tcp_send_head(sk))) { > unsigned int limit; > > @@ -2004,10 +2026,23 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, > break; > } else { > if (!push_one && > - tcp_tso_should_defer(sk, skb, &is_cwnd_limited)) > + tcp_tso_should_defer(sk, skb, &is_cwnd_limited, > + max_segs)) > break; > } > > + limit = mss_now; > + if (tso_segs > 1 && !tcp_urg_mode(tp)) > + limit = tcp_mss_split_point(sk, skb, mss_now, > + min_t(unsigned int, > + cwnd_quota, > + max_segs), > + nonagle); > + > + if (skb->len > limit && > + unlikely(tso_fragment(sk, skb, limit, mss_now, gfp))) > + break; > + > /* TCP Small Queues : > * Control number of packets in qdisc/devices to two packets / or ~1 ms. > * This allows for : > @@ -2018,8 +2053,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, > * of queued bytes to ensure line rate. > * One example is wifi aggregation (802.11 AMPDU) > */ > - limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes, > - sk->sk_pacing_rate >> 10); > + limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10); is this capping necessary if skb->truesize already takes the pacing rate into account from the new logic above? > + limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); > > if (atomic_read(&sk->sk_wmem_alloc) > limit) { > set_bit(TSQ_THROTTLED, &tp->tsq_flags); > @@ -2032,18 +2067,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, > break; > } > > - limit = mss_now; > - if (tso_segs > 1 && !tcp_urg_mode(tp)) > - limit = tcp_mss_split_point(sk, skb, mss_now, > - min_t(unsigned int, > - cwnd_quota, > - sk->sk_gso_max_segs), > - nonagle); > - > - if (skb->len > limit && > - unlikely(tso_fragment(sk, skb, limit, mss_now, gfp))) > - break; > - > if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) > break; > > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 net-next] tcp: refine TSO autosizing 2014-12-07 21:24 ` Yuchung Cheng @ 2014-12-08 6:32 ` Eric Dumazet 0 siblings, 0 replies; 9+ messages in thread From: Eric Dumazet @ 2014-12-08 6:32 UTC (permalink / raw) To: Yuchung Cheng; +Cc: David Miller, netdev, Neal Cardwell, Nandita Dukkipati On Sun, 2014-12-07 at 13:24 -0800, Yuchung Cheng wrote: > > */ > > - limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes, > > - sk->sk_pacing_rate >> 10); > > + limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10); > is this capping necessary if skb->truesize already takes the pacing > rate into account from the new logic above? This is needed for flows not using GSO/TSO : For such flows, skb->truesize is a constant (2048 + sizeof(struct sk_buff)) ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 net-next] tcp: refine TSO autosizing 2014-12-07 20:22 ` [PATCH v3 " Eric Dumazet 2014-12-07 21:24 ` Yuchung Cheng @ 2014-12-09 21:39 ` David Miller 1 sibling, 0 replies; 9+ messages in thread From: David Miller @ 2014-12-09 21:39 UTC (permalink / raw) To: eric.dumazet; +Cc: netdev, ncardwell, ycheng, nanditad From: Eric Dumazet <eric.dumazet@gmail.com> Date: Sun, 07 Dec 2014 12:22:18 -0800 > From: Eric Dumazet <edumazet@google.com> > > Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to > control TSO size, but did this at the wrong place (sendmsg() time) > > At sendmsg() time, we might have a pessimistic view of flow rate, > and we end up building very small skbs (with 2 MSS per skb). > > This is bad because : > > - It sends small TSO packets even in Slow Start where rate quickly > increases. > - It tends to make socket write queue very big, increasing tcp_ack() > processing time, but also increasing memory needs, not necessarily > accounted for, as fast clones overhead is currently ignored. > - Lower GRO efficiency and more ACK packets. > > Servers with a lot of small lived connections suffer from this. > > Lets instead fill skbs as much as possible (64KB of payload), but split > them at xmit time, when we have a precise idea of the flow rate. > skb split is actually quite efficient. > > Patch looks bigger than necessary, because TCP Small Queue decision now > has to take place after the eventual split. > > As Neal suggested, introduce a new tcp_tso_autosize() helper, so that > tcp_tso_should_defer() can be synchronized on same goal. > > Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable > contains number of mss that we can put in GSO packet, and is not > related to the autosizing goal anymore. ... > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Neal Cardwell <ncardwell@google.com> > --- > v3: tcp_xmit_size_goal() still needs to return a multiple of mss. > v2: added tcp_tso_autosize() helper and removed tp->xmit_size_goal_segs Applied, thanks Eric. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2014-12-09 21:39 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-12-05 14:15 [PATCH net-next] tcp: refine TSO autosizing Eric Dumazet 2014-12-05 15:32 ` Neal Cardwell 2014-12-05 17:06 ` Eric Dumazet 2014-12-06 2:22 ` [PATCH v2 " Eric Dumazet 2014-12-06 16:58 ` Eric Dumazet 2014-12-07 20:22 ` [PATCH v3 " Eric Dumazet 2014-12-07 21:24 ` Yuchung Cheng 2014-12-08 6:32 ` Eric Dumazet 2014-12-09 21:39 ` David Miller
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.