All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
@ 2015-03-26 16:46 Jonathan Davies
  2015-03-26 17:14 ` Eric Dumazet
  2015-03-26 17:14 ` Eric Dumazet
  0 siblings, 2 replies; 17+ messages in thread
From: Jonathan Davies @ 2015-03-26 16:46 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev
  Cc: Jonathan Davies, xen-devel, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, David Vrabel, Eric Dumazet

Network drivers with slow TX completion can experience poor network transmit
throughput, limited by hitting the sk_wmem_alloc limit check in tcp_write_xmit.
The limit is 128 KB (by default), which means we are limited to two 64 KB skbs
in-flight. This has been observed to limit transmit throughput with xen-netfront
because its TX completion can be relatively slow compared to physical NIC
drivers.

There have been several modifications to the calculation of the sk_wmem_alloc
limit in the past. Here is a brief history:

 * Since TSQ was introduced, the queue size limit was
       sysctl_tcp_limit_output_bytes.

 * Commit c9eeec26 ("tcp: TSQ can use a dynamic limit") made the limit
       max(skb->truesize, sk->sk_pacing_rate >> 10).
   This allows more packets in-flight according to the estimated rate.

 * Commit 98e09386 ("tcp: tsq: restore minimal amount of queueing") made the
   limit
       max_t(unsigned int, sysctl_tcp_limit_output_bytes,
                           sk->sk_pacing_rate >> 10).
   This ensures at least sysctl_tcp_limit_output_bytes in flight but allowed
   more if rate estimation shows this to be worthwhile.

 * Commit 605ad7f1 ("tcp: refine TSO autosizing") made the limit
       min_t(u32, max(2 * skb->truesize, sk->sk_pacing_rate >> 10),
                  sysctl_tcp_limit_output_bytes).
   This meant that the limit can never exceed sysctl_tcp_limit_output_bytes,
   regardless of what rate estimation suggests. It's not clear from the commit
   message why this significant change was justified, changing
   sysctl_tcp_limit_output_bytes from being a lower bound to an upper bound.

This patch restores the behaviour that allows the limit to grow above
sysctl_tcp_limit_output_bytes according to the rate estimation.

This has been measured to improve xen-netfront throughput from a domU to dom0
from 5.5 Gb/s to 8.0 Gb/s. Or, in the case of transmitting from one domU to
another (on the same host), throughput rose from 2.8 Gb/s to 8.0 Gb/s. In the
latter case, TX completion is especially slow, explaining the large improvement.
These values were measured against 4.0-rc5 using "iperf -c <ip> -i 1" using
CentOS 7.0 VM(s) on Citrix XenServer 6.5 on a Dell R730 host with a pair of Xeon
E5-2650 v3 CPUs.

Fixes: 605ad7f184b6 ("tcp: refine TSO autosizing")
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
---
 net/ipv4/tcp_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1db253e..3a49af8 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2052,7 +2052,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		 * One example is wifi aggregation (802.11 AMPDU)
 		 */
 		limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
-		limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
+		limit = max_t(u32, limit, sysctl_tcp_limit_output_bytes);
 
 		if (atomic_read(&sk->sk_wmem_alloc) > limit) {
 			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread
* [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
@ 2015-03-26 16:46 Jonathan Davies
  0 siblings, 0 replies; 17+ messages in thread
From: Jonathan Davies @ 2015-03-26 16:46 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev
  Cc: Jonathan Davies, Eric Dumazet, David Vrabel, xen-devel, Boris Ostrovsky

Network drivers with slow TX completion can experience poor network transmit
throughput, limited by hitting the sk_wmem_alloc limit check in tcp_write_xmit.
The limit is 128 KB (by default), which means we are limited to two 64 KB skbs
in-flight. This has been observed to limit transmit throughput with xen-netfront
because its TX completion can be relatively slow compared to physical NIC
drivers.

There have been several modifications to the calculation of the sk_wmem_alloc
limit in the past. Here is a brief history:

 * Since TSQ was introduced, the queue size limit was
       sysctl_tcp_limit_output_bytes.

 * Commit c9eeec26 ("tcp: TSQ can use a dynamic limit") made the limit
       max(skb->truesize, sk->sk_pacing_rate >> 10).
   This allows more packets in-flight according to the estimated rate.

 * Commit 98e09386 ("tcp: tsq: restore minimal amount of queueing") made the
   limit
       max_t(unsigned int, sysctl_tcp_limit_output_bytes,
                           sk->sk_pacing_rate >> 10).
   This ensures at least sysctl_tcp_limit_output_bytes in flight but allowed
   more if rate estimation shows this to be worthwhile.

 * Commit 605ad7f1 ("tcp: refine TSO autosizing") made the limit
       min_t(u32, max(2 * skb->truesize, sk->sk_pacing_rate >> 10),
                  sysctl_tcp_limit_output_bytes).
   This meant that the limit can never exceed sysctl_tcp_limit_output_bytes,
   regardless of what rate estimation suggests. It's not clear from the commit
   message why this significant change was justified, changing
   sysctl_tcp_limit_output_bytes from being a lower bound to an upper bound.

This patch restores the behaviour that allows the limit to grow above
sysctl_tcp_limit_output_bytes according to the rate estimation.

This has been measured to improve xen-netfront throughput from a domU to dom0
from 5.5 Gb/s to 8.0 Gb/s. Or, in the case of transmitting from one domU to
another (on the same host), throughput rose from 2.8 Gb/s to 8.0 Gb/s. In the
latter case, TX completion is especially slow, explaining the large improvement.
These values were measured against 4.0-rc5 using "iperf -c <ip> -i 1" using
CentOS 7.0 VM(s) on Citrix XenServer 6.5 on a Dell R730 host with a pair of Xeon
E5-2650 v3 CPUs.

Fixes: 605ad7f184b6 ("tcp: refine TSO autosizing")
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
---
 net/ipv4/tcp_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1db253e..3a49af8 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2052,7 +2052,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		 * One example is wifi aggregation (802.11 AMPDU)
 		 */
 		limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
-		limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
+		limit = max_t(u32, limit, sysctl_tcp_limit_output_bytes);
 
 		if (atomic_read(&sk->sk_wmem_alloc) > limit) {
 			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-04-15 16:42 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-26 16:46 [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes Jonathan Davies
2015-03-26 17:14 ` Eric Dumazet
2015-03-26 17:14 ` Eric Dumazet
2015-03-27 13:06   ` Jonathan Davies
2015-04-13 13:46     ` David Vrabel
2015-04-13 13:46     ` [Xen-devel] " David Vrabel
2015-04-13 14:05       ` Eric Dumazet
2015-04-13 15:03         ` Malcolm Crossley
2015-04-13 15:03         ` [Xen-devel] " Malcolm Crossley
2015-04-15 14:19           ` George Dunlap
2015-04-15 14:19           ` [Xen-devel] " George Dunlap
2015-04-15 14:36             ` Ian Campbell
2015-04-15 14:36             ` [Xen-devel] " Ian Campbell
2015-04-15 16:42               ` Eric Dumazet
2015-04-13 14:05       ` Eric Dumazet
2015-03-27 13:06   ` Jonathan Davies
  -- strict thread matches above, loose matches on Subject: below --
2015-03-26 16:46 Jonathan Davies

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.