All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jonathan Davies <jonathan.davies@citrix.com>
To: "David S. Miller" <davem@davemloft.net>,
	Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>,
	James Morris <jmorris@namei.org>,
	Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>,
	Patrick McHardy <kaber@trash.net>, <netdev@vger.kernel.org>
Cc: Jonathan Davies <jonathan.davies@citrix.com>,
	<xen-devel@lists.xenproject.org>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	"David Vrabel" <david.vrabel@citrix.com>,
	Eric Dumazet <edumazet@google.com>
Subject: [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
Date: Thu, 26 Mar 2015 16:46:53 +0000	[thread overview]
Message-ID: <1427388414-31077-1-git-send-email-jonathan.davies@citrix.com> (raw)

Network drivers with slow TX completion can experience poor network transmit
throughput, limited by hitting the sk_wmem_alloc limit check in tcp_write_xmit.
The limit is 128 KB (by default), which means we are limited to two 64 KB skbs
in-flight. This has been observed to limit transmit throughput with xen-netfront
because its TX completion can be relatively slow compared to physical NIC
drivers.

There have been several modifications to the calculation of the sk_wmem_alloc
limit in the past. Here is a brief history:

 * Since TSQ was introduced, the queue size limit was
       sysctl_tcp_limit_output_bytes.

 * Commit c9eeec26 ("tcp: TSQ can use a dynamic limit") made the limit
       max(skb->truesize, sk->sk_pacing_rate >> 10).
   This allows more packets in-flight according to the estimated rate.

 * Commit 98e09386 ("tcp: tsq: restore minimal amount of queueing") made the
   limit
       max_t(unsigned int, sysctl_tcp_limit_output_bytes,
                           sk->sk_pacing_rate >> 10).
   This ensures at least sysctl_tcp_limit_output_bytes in flight but allowed
   more if rate estimation shows this to be worthwhile.

 * Commit 605ad7f1 ("tcp: refine TSO autosizing") made the limit
       min_t(u32, max(2 * skb->truesize, sk->sk_pacing_rate >> 10),
                  sysctl_tcp_limit_output_bytes).
   This meant that the limit can never exceed sysctl_tcp_limit_output_bytes,
   regardless of what rate estimation suggests. It's not clear from the commit
   message why this significant change was justified, changing
   sysctl_tcp_limit_output_bytes from being a lower bound to an upper bound.

This patch restores the behaviour that allows the limit to grow above
sysctl_tcp_limit_output_bytes according to the rate estimation.

This has been measured to improve xen-netfront throughput from a domU to dom0
from 5.5 Gb/s to 8.0 Gb/s. Or, in the case of transmitting from one domU to
another (on the same host), throughput rose from 2.8 Gb/s to 8.0 Gb/s. In the
latter case, TX completion is especially slow, explaining the large improvement.
These values were measured against 4.0-rc5 using "iperf -c <ip> -i 1" using
CentOS 7.0 VM(s) on Citrix XenServer 6.5 on a Dell R730 host with a pair of Xeon
E5-2650 v3 CPUs.

Fixes: 605ad7f184b6 ("tcp: refine TSO autosizing")
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
---
 net/ipv4/tcp_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1db253e..3a49af8 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2052,7 +2052,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		 * One example is wifi aggregation (802.11 AMPDU)
 		 */
 		limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
-		limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
+		limit = max_t(u32, limit, sysctl_tcp_limit_output_bytes);
 
 		if (atomic_read(&sk->sk_wmem_alloc) > limit) {
 			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
-- 
1.9.1

             reply	other threads:[~2015-03-26 16:48 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-26 16:46 Jonathan Davies [this message]
2015-03-26 17:14 ` [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes Eric Dumazet
2015-03-26 17:14 ` Eric Dumazet
2015-03-27 13:06   ` Jonathan Davies
2015-04-13 13:46     ` David Vrabel
2015-04-13 13:46     ` [Xen-devel] " David Vrabel
2015-04-13 14:05       ` Eric Dumazet
2015-04-13 15:03         ` Malcolm Crossley
2015-04-13 15:03         ` [Xen-devel] " Malcolm Crossley
2015-04-15 14:19           ` George Dunlap
2015-04-15 14:19           ` [Xen-devel] " George Dunlap
2015-04-15 14:36             ` Ian Campbell
2015-04-15 14:36             ` [Xen-devel] " Ian Campbell
2015-04-15 16:42               ` Eric Dumazet
2015-04-13 14:05       ` Eric Dumazet
2015-03-27 13:06   ` Jonathan Davies
  -- strict thread matches above, loose matches on Subject: below --
2015-03-26 16:46 Jonathan Davies

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1427388414-31077-1-git-send-email-jonathan.davies@citrix.com \
    --to=jonathan.davies@citrix.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=davem@davemloft.net \
    --cc=david.vrabel@citrix.com \
    --cc=edumazet@google.com \
    --cc=jmorris@namei.org \
    --cc=kaber@trash.net \
    --cc=konrad.wilk@oracle.com \
    --cc=kuznet@ms2.inr.ac.ru \
    --cc=netdev@vger.kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    --cc=yoshfuji@linux-ipv6.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.