All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
@ 2015-03-26 16:46 Jonathan Davies
  2015-03-26 17:14 ` Eric Dumazet
  2015-03-26 17:14 ` Eric Dumazet
  0 siblings, 2 replies; 17+ messages in thread
From: Jonathan Davies @ 2015-03-26 16:46 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev
  Cc: Jonathan Davies, xen-devel, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, David Vrabel, Eric Dumazet

Network drivers with slow TX completion can experience poor network transmit
throughput, limited by hitting the sk_wmem_alloc limit check in tcp_write_xmit.
The limit is 128 KB (by default), which means we are limited to two 64 KB skbs
in-flight. This has been observed to limit transmit throughput with xen-netfront
because its TX completion can be relatively slow compared to physical NIC
drivers.

There have been several modifications to the calculation of the sk_wmem_alloc
limit in the past. Here is a brief history:

 * Since TSQ was introduced, the queue size limit was
       sysctl_tcp_limit_output_bytes.

 * Commit c9eeec26 ("tcp: TSQ can use a dynamic limit") made the limit
       max(skb->truesize, sk->sk_pacing_rate >> 10).
   This allows more packets in-flight according to the estimated rate.

 * Commit 98e09386 ("tcp: tsq: restore minimal amount of queueing") made the
   limit
       max_t(unsigned int, sysctl_tcp_limit_output_bytes,
                           sk->sk_pacing_rate >> 10).
   This ensures at least sysctl_tcp_limit_output_bytes in flight but allowed
   more if rate estimation shows this to be worthwhile.

 * Commit 605ad7f1 ("tcp: refine TSO autosizing") made the limit
       min_t(u32, max(2 * skb->truesize, sk->sk_pacing_rate >> 10),
                  sysctl_tcp_limit_output_bytes).
   This meant that the limit can never exceed sysctl_tcp_limit_output_bytes,
   regardless of what rate estimation suggests. It's not clear from the commit
   message why this significant change was justified, changing
   sysctl_tcp_limit_output_bytes from being a lower bound to an upper bound.

This patch restores the behaviour that allows the limit to grow above
sysctl_tcp_limit_output_bytes according to the rate estimation.

This has been measured to improve xen-netfront throughput from a domU to dom0
from 5.5 Gb/s to 8.0 Gb/s. Or, in the case of transmitting from one domU to
another (on the same host), throughput rose from 2.8 Gb/s to 8.0 Gb/s. In the
latter case, TX completion is especially slow, explaining the large improvement.
These values were measured against 4.0-rc5 using "iperf -c <ip> -i 1" using
CentOS 7.0 VM(s) on Citrix XenServer 6.5 on a Dell R730 host with a pair of Xeon
E5-2650 v3 CPUs.

Fixes: 605ad7f184b6 ("tcp: refine TSO autosizing")
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
---
 net/ipv4/tcp_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1db253e..3a49af8 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2052,7 +2052,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		 * One example is wifi aggregation (802.11 AMPDU)
 		 */
 		limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
-		limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
+		limit = max_t(u32, limit, sysctl_tcp_limit_output_bytes);
 
 		if (atomic_read(&sk->sk_wmem_alloc) > limit) {
 			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
  2015-03-26 16:46 [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes Jonathan Davies
  2015-03-26 17:14 ` Eric Dumazet
@ 2015-03-26 17:14 ` Eric Dumazet
  2015-03-27 13:06   ` Jonathan Davies
  2015-03-27 13:06   ` Jonathan Davies
  1 sibling, 2 replies; 17+ messages in thread
From: Eric Dumazet @ 2015-03-26 17:14 UTC (permalink / raw)
  To: Jonathan Davies
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev, xen-devel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, David Vrabel,
	Eric Dumazet

On Thu, 2015-03-26 at 16:46 +0000, Jonathan Davies wrote:
> Network drivers with slow TX completion can experience poor network transmit
> throughput, limited by hitting the sk_wmem_alloc limit check in tcp_write_xmit.
> The limit is 128 KB (by default), which means we are limited to two 64 KB skbs
> in-flight. This has been observed to limit transmit throughput with xen-netfront
> because its TX completion can be relatively slow compared to physical NIC
> drivers.
> 
> There have been several modifications to the calculation of the sk_wmem_alloc
> limit in the past. Here is a brief history:
> 
>  * Since TSQ was introduced, the queue size limit was
>        sysctl_tcp_limit_output_bytes.
> 
>  * Commit c9eeec26 ("tcp: TSQ can use a dynamic limit") made the limit
>        max(skb->truesize, sk->sk_pacing_rate >> 10).
>    This allows more packets in-flight according to the estimated rate.
> 
>  * Commit 98e09386 ("tcp: tsq: restore minimal amount of queueing") made the
>    limit
>        max_t(unsigned int, sysctl_tcp_limit_output_bytes,
>                            sk->sk_pacing_rate >> 10).
>    This ensures at least sysctl_tcp_limit_output_bytes in flight but allowed
>    more if rate estimation shows this to be worthwhile.
> 
>  * Commit 605ad7f1 ("tcp: refine TSO autosizing") made the limit
>        min_t(u32, max(2 * skb->truesize, sk->sk_pacing_rate >> 10),
>                   sysctl_tcp_limit_output_bytes).
>    This meant that the limit can never exceed sysctl_tcp_limit_output_bytes,
>    regardless of what rate estimation suggests. It's not clear from the commit
>    message why this significant change was justified, changing
>    sysctl_tcp_limit_output_bytes from being a lower bound to an upper bound.
> 
> This patch restores the behaviour that allows the limit to grow above
> sysctl_tcp_limit_output_bytes according to the rate estimation.
> 
> This has been measured to improve xen-netfront throughput from a domU to dom0
> from 5.5 Gb/s to 8.0 Gb/s. Or, in the case of transmitting from one domU to
> another (on the same host), throughput rose from 2.8 Gb/s to 8.0 Gb/s. In the
> latter case, TX completion is especially slow, explaining the large improvement.
> These values were measured against 4.0-rc5 using "iperf -c <ip> -i 1" using
> CentOS 7.0 VM(s) on Citrix XenServer 6.5 on a Dell R730 host with a pair of Xeon
> E5-2650 v3 CPUs.
> 
> Fixes: 605ad7f184b6 ("tcp: refine TSO autosizing")
> Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
> ---
>  net/ipv4/tcp_output.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 1db253e..3a49af8 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2052,7 +2052,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>  		 * One example is wifi aggregation (802.11 AMPDU)
>  		 */
>  		limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
> -		limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
> +		limit = max_t(u32, limit, sysctl_tcp_limit_output_bytes);
>  
>  		if (atomic_read(&sk->sk_wmem_alloc) > limit) {
>  			set_bit(TSQ_THROTTLED, &tp->tsq_flags);


That will get a NACK from me and Google TCP team of course.

Try your patch with low throughput flows, like 64kbps, GSO off...

And observe TCP timestamps and rtt estimations, critical for vegas or
any CC based on delay.

I get line rate on 40Gbps NIC using this (low) limit of 2 TSO packets.

This topic was discussed for wifi recently, and I suggested multiple
ways to solve the problem on problematic drivers.

There is a reason sysctl_tcp_limit_output_bytes exists : You can change
its value to whatever you want.

vi +652 Documentation/networking/ip-sysctl.txt

tcp_limit_output_bytes - INTEGER
        Controls TCP Small Queue limit per tcp socket.
        TCP bulk sender tends to increase packets in flight until it
        gets losses notifications. With SNDBUF autotuning, this can
        result in a large amount of packets queued in qdisc/device
        on the local machine, hurting latency of other flows, for
        typical pfifo_fast qdiscs.
        tcp_limit_output_bytes limits the number of bytes on qdisc
        or device to reduce artificial RTT/cwnd and reduce bufferbloat.
        Default: 131072

Documentation is pretty clear : This is the max value, not a min one.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
  2015-03-26 16:46 [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes Jonathan Davies
@ 2015-03-26 17:14 ` Eric Dumazet
  2015-03-26 17:14 ` Eric Dumazet
  1 sibling, 0 replies; 17+ messages in thread
From: Eric Dumazet @ 2015-03-26 17:14 UTC (permalink / raw)
  To: Jonathan Davies
  Cc: Hideaki YOSHIFUJI, netdev, James Morris, David S. Miller,
	Eric Dumazet, David Vrabel, xen-devel, Alexey Kuznetsov,
	Boris Ostrovsky, Patrick McHardy

On Thu, 2015-03-26 at 16:46 +0000, Jonathan Davies wrote:
> Network drivers with slow TX completion can experience poor network transmit
> throughput, limited by hitting the sk_wmem_alloc limit check in tcp_write_xmit.
> The limit is 128 KB (by default), which means we are limited to two 64 KB skbs
> in-flight. This has been observed to limit transmit throughput with xen-netfront
> because its TX completion can be relatively slow compared to physical NIC
> drivers.
> 
> There have been several modifications to the calculation of the sk_wmem_alloc
> limit in the past. Here is a brief history:
> 
>  * Since TSQ was introduced, the queue size limit was
>        sysctl_tcp_limit_output_bytes.
> 
>  * Commit c9eeec26 ("tcp: TSQ can use a dynamic limit") made the limit
>        max(skb->truesize, sk->sk_pacing_rate >> 10).
>    This allows more packets in-flight according to the estimated rate.
> 
>  * Commit 98e09386 ("tcp: tsq: restore minimal amount of queueing") made the
>    limit
>        max_t(unsigned int, sysctl_tcp_limit_output_bytes,
>                            sk->sk_pacing_rate >> 10).
>    This ensures at least sysctl_tcp_limit_output_bytes in flight but allowed
>    more if rate estimation shows this to be worthwhile.
> 
>  * Commit 605ad7f1 ("tcp: refine TSO autosizing") made the limit
>        min_t(u32, max(2 * skb->truesize, sk->sk_pacing_rate >> 10),
>                   sysctl_tcp_limit_output_bytes).
>    This meant that the limit can never exceed sysctl_tcp_limit_output_bytes,
>    regardless of what rate estimation suggests. It's not clear from the commit
>    message why this significant change was justified, changing
>    sysctl_tcp_limit_output_bytes from being a lower bound to an upper bound.
> 
> This patch restores the behaviour that allows the limit to grow above
> sysctl_tcp_limit_output_bytes according to the rate estimation.
> 
> This has been measured to improve xen-netfront throughput from a domU to dom0
> from 5.5 Gb/s to 8.0 Gb/s. Or, in the case of transmitting from one domU to
> another (on the same host), throughput rose from 2.8 Gb/s to 8.0 Gb/s. In the
> latter case, TX completion is especially slow, explaining the large improvement.
> These values were measured against 4.0-rc5 using "iperf -c <ip> -i 1" using
> CentOS 7.0 VM(s) on Citrix XenServer 6.5 on a Dell R730 host with a pair of Xeon
> E5-2650 v3 CPUs.
> 
> Fixes: 605ad7f184b6 ("tcp: refine TSO autosizing")
> Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
> ---
>  net/ipv4/tcp_output.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 1db253e..3a49af8 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2052,7 +2052,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>  		 * One example is wifi aggregation (802.11 AMPDU)
>  		 */
>  		limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
> -		limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
> +		limit = max_t(u32, limit, sysctl_tcp_limit_output_bytes);
>  
>  		if (atomic_read(&sk->sk_wmem_alloc) > limit) {
>  			set_bit(TSQ_THROTTLED, &tp->tsq_flags);


That will get a NACK from me and Google TCP team of course.

Try your patch with low throughput flows, like 64kbps, GSO off...

And observe TCP timestamps and rtt estimations, critical for vegas or
any CC based on delay.

I get line rate on 40Gbps NIC using this (low) limit of 2 TSO packets.

This topic was discussed for wifi recently, and I suggested multiple
ways to solve the problem on problematic drivers.

There is a reason sysctl_tcp_limit_output_bytes exists : You can change
its value to whatever you want.

vi +652 Documentation/networking/ip-sysctl.txt

tcp_limit_output_bytes - INTEGER
        Controls TCP Small Queue limit per tcp socket.
        TCP bulk sender tends to increase packets in flight until it
        gets losses notifications. With SNDBUF autotuning, this can
        result in a large amount of packets queued in qdisc/device
        on the local machine, hurting latency of other flows, for
        typical pfifo_fast qdiscs.
        tcp_limit_output_bytes limits the number of bytes on qdisc
        or device to reduce artificial RTT/cwnd and reduce bufferbloat.
        Default: 131072

Documentation is pretty clear : This is the max value, not a min one.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
  2015-03-26 17:14 ` Eric Dumazet
@ 2015-03-27 13:06   ` Jonathan Davies
  2015-04-13 13:46     ` David Vrabel
  2015-04-13 13:46     ` [Xen-devel] " David Vrabel
  2015-03-27 13:06   ` Jonathan Davies
  1 sibling, 2 replies; 17+ messages in thread
From: Jonathan Davies @ 2015-03-27 13:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev, xen-devel,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, David Vrabel

On 26/03/15 17:14, Eric Dumazet wrote:
> On Thu, 2015-03-26 at 16:46 +0000, Jonathan Davies wrote:
>> Network drivers with slow TX completion can experience poor network transmit
>> throughput, limited by hitting the sk_wmem_alloc limit check in tcp_write_xmit.
>> The limit is 128 KB (by default), which means we are limited to two 64 KB skbs
>> in-flight. This has been observed to limit transmit throughput with xen-netfront
>> because its TX completion can be relatively slow compared to physical NIC
>> drivers.
>>
>> There have been several modifications to the calculation of the sk_wmem_alloc
>> limit in the past. Here is a brief history:
>>
>>   * Since TSQ was introduced, the queue size limit was
>>         sysctl_tcp_limit_output_bytes.
>>
>>   * Commit c9eeec26 ("tcp: TSQ can use a dynamic limit") made the limit
>>         max(skb->truesize, sk->sk_pacing_rate >> 10).
>>     This allows more packets in-flight according to the estimated rate.
>>
>>   * Commit 98e09386 ("tcp: tsq: restore minimal amount of queueing") made the
>>     limit
>>         max_t(unsigned int, sysctl_tcp_limit_output_bytes,
>>                             sk->sk_pacing_rate >> 10).
>>     This ensures at least sysctl_tcp_limit_output_bytes in flight but allowed
>>     more if rate estimation shows this to be worthwhile.
>>
>>   * Commit 605ad7f1 ("tcp: refine TSO autosizing") made the limit
>>         min_t(u32, max(2 * skb->truesize, sk->sk_pacing_rate >> 10),
>>                    sysctl_tcp_limit_output_bytes).
>>     This meant that the limit can never exceed sysctl_tcp_limit_output_bytes,
>>     regardless of what rate estimation suggests. It's not clear from the commit
>>     message why this significant change was justified, changing
>>     sysctl_tcp_limit_output_bytes from being a lower bound to an upper bound.
>>
>> This patch restores the behaviour that allows the limit to grow above
>> sysctl_tcp_limit_output_bytes according to the rate estimation.
>>
>> This has been measured to improve xen-netfront throughput from a domU to dom0
>> from 5.5 Gb/s to 8.0 Gb/s. Or, in the case of transmitting from one domU to
>> another (on the same host), throughput rose from 2.8 Gb/s to 8.0 Gb/s. In the
>> latter case, TX completion is especially slow, explaining the large improvement.
>> These values were measured against 4.0-rc5 using "iperf -c <ip> -i 1" using
>> CentOS 7.0 VM(s) on Citrix XenServer 6.5 on a Dell R730 host with a pair of Xeon
>> E5-2650 v3 CPUs.
>>
>> Fixes: 605ad7f184b6 ("tcp: refine TSO autosizing")
>> Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
>> ---
>>   net/ipv4/tcp_output.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index 1db253e..3a49af8 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -2052,7 +2052,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>>   		 * One example is wifi aggregation (802.11 AMPDU)
>>   		 */
>>   		limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
>> -		limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>> +		limit = max_t(u32, limit, sysctl_tcp_limit_output_bytes);
>>
>>   		if (atomic_read(&sk->sk_wmem_alloc) > limit) {
>>   			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
>
>
> That will get a NACK from me and Google TCP team of course.
>
> Try your patch with low throughput flows, like 64kbps, GSO off...
>
> And observe TCP timestamps and rtt estimations, critical for vegas or
> any CC based on delay.
>
> I get line rate on 40Gbps NIC using this (low) limit of 2 TSO packets.
>
> This topic was discussed for wifi recently, and I suggested multiple
> ways to solve the problem on problematic drivers.
>
> There is a reason sysctl_tcp_limit_output_bytes exists : You can change
> its value to whatever you want.
>
> vi +652 Documentation/networking/ip-sysctl.txt
>
> tcp_limit_output_bytes - INTEGER
>          Controls TCP Small Queue limit per tcp socket.
>          TCP bulk sender tends to increase packets in flight until it
>          gets losses notifications. With SNDBUF autotuning, this can
>          result in a large amount of packets queued in qdisc/device
>          on the local machine, hurting latency of other flows, for
>          typical pfifo_fast qdiscs.
>          tcp_limit_output_bytes limits the number of bytes on qdisc
>          or device to reduce artificial RTT/cwnd and reduce bufferbloat.
>          Default: 131072
>
> Documentation is pretty clear : This is the max value, not a min one.

Okay, thanks for your feedback. It wasn't clear to me from the commit 
message in 605ad7f1 ("tcp: refine TSO autosizing") why 
tcp_limit_output_bytes changed from being a lower bound to an upper 
bound in that patch. But it's now clear from your reply that that's the 
behaviour you intend as correct.

Thanks for drawing my attention to the wifi thread, which is helpful. As 
I understand it, the only available options for fixing the performance 
regression experienced by xen-netfront are:
   1. Reduce TX completion latency, or
   2. Bypass TSQ to allow buffering of more than tcp_limit_output_bytes, 
finding a way to work around that limit in the driver, e.g.
        (a) pretending the sent packets are smaller than in reality, so 
sk_wmem_alloc doesn't grow as fast;
        (b) using skb_orphan in xennet_start_xmit to allow the skb to be 
freed without waiting for TX completion.

Do you agree, or have I missed something?

Option 1 seems to me like the right long-term goal, but this would 
require substantial changes to xen-netback and probably can't be 
addressed in the near-term to fix this performance regression.

Option 2 risks the problems associated with bufferbloat but may be the 
only way to sustain high throughput. I have briefly tried out ideas (a) 
and (b); both look like they can give almost the same throughput as the 
patch above (around 7 Gb/s domU-to-domU) without the need to modify 
tcp_write_xmit. (I don't know much about the implications of using 
skb_orphan in ndo_start_xmit, so am not sure whether that may cause 
other problems.)

I would like to seek the opinion of the xen-netfront maintainers (cc'd). 
You have a significant performance regression since commit 605ad7f1 
("tcp: refine TSO autosizing"). Are either of options (a) or (b) viable 
to fix it? Or is there a better solution?

For your reference, the proof-of-concept patch for idea (a) I used was:

@@ -630,6 +630,16 @@ static int xennet_start_xmit(struct sk_buff *skb, 
struct net_device *dev)

  	spin_unlock_irqrestore(&queue->tx_lock, flags);

+	/* Pretend we sent less than we did, so we can squeeze more out
+         * of the available tcp_limit_output_bytes.
+         */
+	if (skb->sk) {
+		u32 fraction = (7 * skb->truesize) / 8;
+
+		skb->truesize -= fraction;
+		atomic_sub(fraction, &skb->sk->sk_wmem_alloc);
+	}
+
  	return NETDEV_TX_OK;

   drop:

And the proof-of-concept patch for idea (b) I used was:

@@ -552,6 +552,8 @@ static int xennet_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
  			goto drop;
  	}

+	skb_orphan(skb);
+
  	page = virt_to_page(skb->data);
  	offset = offset_in_page(skb->data);
  	len = skb_headlen(skb);

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
  2015-03-26 17:14 ` Eric Dumazet
  2015-03-27 13:06   ` Jonathan Davies
@ 2015-03-27 13:06   ` Jonathan Davies
  1 sibling, 0 replies; 17+ messages in thread
From: Jonathan Davies @ 2015-03-27 13:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Hideaki YOSHIFUJI, netdev, James Morris, David S. Miller,
	David Vrabel, xen-devel, Alexey Kuznetsov, Boris Ostrovsky,
	Patrick McHardy

On 26/03/15 17:14, Eric Dumazet wrote:
> On Thu, 2015-03-26 at 16:46 +0000, Jonathan Davies wrote:
>> Network drivers with slow TX completion can experience poor network transmit
>> throughput, limited by hitting the sk_wmem_alloc limit check in tcp_write_xmit.
>> The limit is 128 KB (by default), which means we are limited to two 64 KB skbs
>> in-flight. This has been observed to limit transmit throughput with xen-netfront
>> because its TX completion can be relatively slow compared to physical NIC
>> drivers.
>>
>> There have been several modifications to the calculation of the sk_wmem_alloc
>> limit in the past. Here is a brief history:
>>
>>   * Since TSQ was introduced, the queue size limit was
>>         sysctl_tcp_limit_output_bytes.
>>
>>   * Commit c9eeec26 ("tcp: TSQ can use a dynamic limit") made the limit
>>         max(skb->truesize, sk->sk_pacing_rate >> 10).
>>     This allows more packets in-flight according to the estimated rate.
>>
>>   * Commit 98e09386 ("tcp: tsq: restore minimal amount of queueing") made the
>>     limit
>>         max_t(unsigned int, sysctl_tcp_limit_output_bytes,
>>                             sk->sk_pacing_rate >> 10).
>>     This ensures at least sysctl_tcp_limit_output_bytes in flight but allowed
>>     more if rate estimation shows this to be worthwhile.
>>
>>   * Commit 605ad7f1 ("tcp: refine TSO autosizing") made the limit
>>         min_t(u32, max(2 * skb->truesize, sk->sk_pacing_rate >> 10),
>>                    sysctl_tcp_limit_output_bytes).
>>     This meant that the limit can never exceed sysctl_tcp_limit_output_bytes,
>>     regardless of what rate estimation suggests. It's not clear from the commit
>>     message why this significant change was justified, changing
>>     sysctl_tcp_limit_output_bytes from being a lower bound to an upper bound.
>>
>> This patch restores the behaviour that allows the limit to grow above
>> sysctl_tcp_limit_output_bytes according to the rate estimation.
>>
>> This has been measured to improve xen-netfront throughput from a domU to dom0
>> from 5.5 Gb/s to 8.0 Gb/s. Or, in the case of transmitting from one domU to
>> another (on the same host), throughput rose from 2.8 Gb/s to 8.0 Gb/s. In the
>> latter case, TX completion is especially slow, explaining the large improvement.
>> These values were measured against 4.0-rc5 using "iperf -c <ip> -i 1" using
>> CentOS 7.0 VM(s) on Citrix XenServer 6.5 on a Dell R730 host with a pair of Xeon
>> E5-2650 v3 CPUs.
>>
>> Fixes: 605ad7f184b6 ("tcp: refine TSO autosizing")
>> Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
>> ---
>>   net/ipv4/tcp_output.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index 1db253e..3a49af8 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -2052,7 +2052,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>>   		 * One example is wifi aggregation (802.11 AMPDU)
>>   		 */
>>   		limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
>> -		limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>> +		limit = max_t(u32, limit, sysctl_tcp_limit_output_bytes);
>>
>>   		if (atomic_read(&sk->sk_wmem_alloc) > limit) {
>>   			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
>
>
> That will get a NACK from me and Google TCP team of course.
>
> Try your patch with low throughput flows, like 64kbps, GSO off...
>
> And observe TCP timestamps and rtt estimations, critical for vegas or
> any CC based on delay.
>
> I get line rate on 40Gbps NIC using this (low) limit of 2 TSO packets.
>
> This topic was discussed for wifi recently, and I suggested multiple
> ways to solve the problem on problematic drivers.
>
> There is a reason sysctl_tcp_limit_output_bytes exists : You can change
> its value to whatever you want.
>
> vi +652 Documentation/networking/ip-sysctl.txt
>
> tcp_limit_output_bytes - INTEGER
>          Controls TCP Small Queue limit per tcp socket.
>          TCP bulk sender tends to increase packets in flight until it
>          gets losses notifications. With SNDBUF autotuning, this can
>          result in a large amount of packets queued in qdisc/device
>          on the local machine, hurting latency of other flows, for
>          typical pfifo_fast qdiscs.
>          tcp_limit_output_bytes limits the number of bytes on qdisc
>          or device to reduce artificial RTT/cwnd and reduce bufferbloat.
>          Default: 131072
>
> Documentation is pretty clear : This is the max value, not a min one.

Okay, thanks for your feedback. It wasn't clear to me from the commit 
message in 605ad7f1 ("tcp: refine TSO autosizing") why 
tcp_limit_output_bytes changed from being a lower bound to an upper 
bound in that patch. But it's now clear from your reply that that's the 
behaviour you intend as correct.

Thanks for drawing my attention to the wifi thread, which is helpful. As 
I understand it, the only available options for fixing the performance 
regression experienced by xen-netfront are:
   1. Reduce TX completion latency, or
   2. Bypass TSQ to allow buffering of more than tcp_limit_output_bytes, 
finding a way to work around that limit in the driver, e.g.
        (a) pretending the sent packets are smaller than in reality, so 
sk_wmem_alloc doesn't grow as fast;
        (b) using skb_orphan in xennet_start_xmit to allow the skb to be 
freed without waiting for TX completion.

Do you agree, or have I missed something?

Option 1 seems to me like the right long-term goal, but this would 
require substantial changes to xen-netback and probably can't be 
addressed in the near-term to fix this performance regression.

Option 2 risks the problems associated with bufferbloat but may be the 
only way to sustain high throughput. I have briefly tried out ideas (a) 
and (b); both look like they can give almost the same throughput as the 
patch above (around 7 Gb/s domU-to-domU) without the need to modify 
tcp_write_xmit. (I don't know much about the implications of using 
skb_orphan in ndo_start_xmit, so am not sure whether that may cause 
other problems.)

I would like to seek the opinion of the xen-netfront maintainers (cc'd). 
You have a significant performance regression since commit 605ad7f1 
("tcp: refine TSO autosizing"). Are either of options (a) or (b) viable 
to fix it? Or is there a better solution?

For your reference, the proof-of-concept patch for idea (a) I used was:

@@ -630,6 +630,16 @@ static int xennet_start_xmit(struct sk_buff *skb, 
struct net_device *dev)

  	spin_unlock_irqrestore(&queue->tx_lock, flags);

+	/* Pretend we sent less than we did, so we can squeeze more out
+         * of the available tcp_limit_output_bytes.
+         */
+	if (skb->sk) {
+		u32 fraction = (7 * skb->truesize) / 8;
+
+		skb->truesize -= fraction;
+		atomic_sub(fraction, &skb->sk->sk_wmem_alloc);
+	}
+
  	return NETDEV_TX_OK;

   drop:

And the proof-of-concept patch for idea (b) I used was:

@@ -552,6 +552,8 @@ static int xennet_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
  			goto drop;
  	}

+	skb_orphan(skb);
+
  	page = virt_to_page(skb->data);
  	offset = offset_in_page(skb->data);
  	len = skb_headlen(skb);

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
  2015-03-27 13:06   ` Jonathan Davies
  2015-04-13 13:46     ` David Vrabel
@ 2015-04-13 13:46     ` David Vrabel
  2015-04-13 14:05       ` Eric Dumazet
  2015-04-13 14:05       ` Eric Dumazet
  1 sibling, 2 replies; 17+ messages in thread
From: David Vrabel @ 2015-04-13 13:46 UTC (permalink / raw)
  To: Jonathan Davies, Eric Dumazet
  Cc: Hideaki YOSHIFUJI, netdev, James Morris, David S. Miller,
	David Vrabel, xen-devel, Alexey Kuznetsov, Boris Ostrovsky,
	Patrick McHardy

On 27/03/15 13:06, Jonathan Davies wrote:
> On 26/03/15 17:14, Eric Dumazet wrote:
>> On Thu, 2015-03-26 at 16:46 +0000, Jonathan Davies wrote:
>>> Network drivers with slow TX completion can experience poor network
>>> transmit
>>> throughput, limited by hitting the sk_wmem_alloc limit check in
>>> tcp_write_xmit.
>>> The limit is 128 KB (by default), which means we are limited to two
>>> 64 KB skbs
>>> in-flight. This has been observed to limit transmit throughput with
>>> xen-netfront
>>> because its TX completion can be relatively slow compared to physical
>>> NIC
>>> drivers.
>>>
>>> There have been several modifications to the calculation of the
>>> sk_wmem_alloc
>>> limit in the past. Here is a brief history:
>>>
>>>   * Since TSQ was introduced, the queue size limit was
>>>         sysctl_tcp_limit_output_bytes.
>>>
>>>   * Commit c9eeec26 ("tcp: TSQ can use a dynamic limit") made the limit
>>>         max(skb->truesize, sk->sk_pacing_rate >> 10).
>>>     This allows more packets in-flight according to the estimated rate.
>>>
>>>   * Commit 98e09386 ("tcp: tsq: restore minimal amount of queueing")
>>> made the
>>>     limit
>>>         max_t(unsigned int, sysctl_tcp_limit_output_bytes,
>>>                             sk->sk_pacing_rate >> 10).
>>>     This ensures at least sysctl_tcp_limit_output_bytes in flight but
>>> allowed
>>>     more if rate estimation shows this to be worthwhile.
>>>
>>>   * Commit 605ad7f1 ("tcp: refine TSO autosizing") made the limit
>>>         min_t(u32, max(2 * skb->truesize, sk->sk_pacing_rate >> 10),
>>>                    sysctl_tcp_limit_output_bytes).
>>>     This meant that the limit can never exceed
>>> sysctl_tcp_limit_output_bytes,
>>>     regardless of what rate estimation suggests. It's not clear from
>>> the commit
>>>     message why this significant change was justified, changing
>>>     sysctl_tcp_limit_output_bytes from being a lower bound to an
>>> upper bound.
>>>
>>> This patch restores the behaviour that allows the limit to grow above
>>> sysctl_tcp_limit_output_bytes according to the rate estimation.
>>>
>>> This has been measured to improve xen-netfront throughput from a domU
>>> to dom0
>>> from 5.5 Gb/s to 8.0 Gb/s. Or, in the case of transmitting from one
>>> domU to
>>> another (on the same host), throughput rose from 2.8 Gb/s to 8.0
>>> Gb/s. In the
>>> latter case, TX completion is especially slow, explaining the large
>>> improvement.
>>> These values were measured against 4.0-rc5 using "iperf -c <ip> -i 1"
>>> using
>>> CentOS 7.0 VM(s) on Citrix XenServer 6.5 on a Dell R730 host with a
>>> pair of Xeon
>>> E5-2650 v3 CPUs.
>>>
>>> Fixes: 605ad7f184b6 ("tcp: refine TSO autosizing")
>>> Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
>>> ---
>>>   net/ipv4/tcp_output.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>>> index 1db253e..3a49af8 100644
>>> --- a/net/ipv4/tcp_output.c
>>> +++ b/net/ipv4/tcp_output.c
>>> @@ -2052,7 +2052,7 @@ static bool tcp_write_xmit(struct sock *sk,
>>> unsigned int mss_now, int nonagle,
>>>            * One example is wifi aggregation (802.11 AMPDU)
>>>            */
>>>           limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
>>> -        limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>>> +        limit = max_t(u32, limit, sysctl_tcp_limit_output_bytes);
>>>
>>>           if (atomic_read(&sk->sk_wmem_alloc) > limit) {
>>>               set_bit(TSQ_THROTTLED, &tp->tsq_flags);
>>
>>
>> That will get a NACK from me and Google TCP team of course.
>>
>> Try your patch with low throughput flows, like 64kbps, GSO off...
>>
>> And observe TCP timestamps and rtt estimations, critical for vegas or
>> any CC based on delay.
>>
>> I get line rate on 40Gbps NIC using this (low) limit of 2 TSO packets.
>>
>> This topic was discussed for wifi recently, and I suggested multiple
>> ways to solve the problem on problematic drivers.
>>
>> There is a reason sysctl_tcp_limit_output_bytes exists : You can change
>> its value to whatever you want.
>>
>> vi +652 Documentation/networking/ip-sysctl.txt
>>
>> tcp_limit_output_bytes - INTEGER
>>          Controls TCP Small Queue limit per tcp socket.
>>          TCP bulk sender tends to increase packets in flight until it
>>          gets losses notifications. With SNDBUF autotuning, this can
>>          result in a large amount of packets queued in qdisc/device
>>          on the local machine, hurting latency of other flows, for
>>          typical pfifo_fast qdiscs.
>>          tcp_limit_output_bytes limits the number of bytes on qdisc
>>          or device to reduce artificial RTT/cwnd and reduce bufferbloat.
>>          Default: 131072
>>
>> Documentation is pretty clear : This is the max value, not a min one.
> 
> Okay, thanks for your feedback. It wasn't clear to me from the commit
> message in 605ad7f1 ("tcp: refine TSO autosizing") why
> tcp_limit_output_bytes changed from being a lower bound to an upper
> bound in that patch. But it's now clear from your reply that that's the
> behaviour you intend as correct.
> 
> Thanks for drawing my attention to the wifi thread, which is helpful. As
> I understand it, the only available options for fixing the performance
> regression experienced by xen-netfront are:
>   1. Reduce TX completion latency, or
>   2. Bypass TSQ to allow buffering of more than tcp_limit_output_bytes,
> finding a way to work around that limit in the driver, e.g.
>        (a) pretending the sent packets are smaller than in reality, so
> sk_wmem_alloc doesn't grow as fast;
>        (b) using skb_orphan in xennet_start_xmit to allow the skb to be
> freed without waiting for TX completion.
> 
> Do you agree, or have I missed something?
> 
> Option 1 seems to me like the right long-term goal, but this would
> require substantial changes to xen-netback and probably can't be
> addressed in the near-term to fix this performance regression.
> 
> Option 2 risks the problems associated with bufferbloat but may be the
> only way to sustain high throughput. I have briefly tried out ideas (a)
> and (b); both look like they can give almost the same throughput as the
> patch above (around 7 Gb/s domU-to-domU) without the need to modify
> tcp_write_xmit. (I don't know much about the implications of using
> skb_orphan in ndo_start_xmit, so am not sure whether that may cause
> other problems.)
> 
> I would like to seek the opinion of the xen-netfront maintainers (cc'd).
> You have a significant performance regression since commit 605ad7f1
> ("tcp: refine TSO autosizing"). Are either of options (a) or (b) viable
> to fix it? Or is there a better solution?
> 
> For your reference, the proof-of-concept patch for idea (a) I used was:
> 
> @@ -630,6 +630,16 @@ static int xennet_start_xmit(struct sk_buff *skb,
> struct net_device *dev)
> 
>      spin_unlock_irqrestore(&queue->tx_lock, flags);
> 
> +    /* Pretend we sent less than we did, so we can squeeze more out
> +         * of the available tcp_limit_output_bytes.
> +         */
> +    if (skb->sk) {
> +        u32 fraction = (7 * skb->truesize) / 8;
> +
> +        skb->truesize -= fraction;
> +        atomic_sub(fraction, &skb->sk->sk_wmem_alloc);
> +    }

Playing games like this with skb->truesize looks dodgy to me.

> And the proof-of-concept patch for idea (b) I used was:
> 
> @@ -552,6 +552,8 @@ static int xennet_start_xmit(struct sk_buff *skb,
> struct net_device *dev)
>              goto drop;
>      }
> 
> +    skb_orphan(skb);
> +
>      page = virt_to_page(skb->data);
>      offset = offset_in_page(skb->data);
>      len = skb_headlen(skb);

No. This a bunch of allocations and  a full memcpy of all the frags.

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
  2015-03-27 13:06   ` Jonathan Davies
@ 2015-04-13 13:46     ` David Vrabel
  2015-04-13 13:46     ` [Xen-devel] " David Vrabel
  1 sibling, 0 replies; 17+ messages in thread
From: David Vrabel @ 2015-04-13 13:46 UTC (permalink / raw)
  To: Jonathan Davies, Eric Dumazet
  Cc: Hideaki YOSHIFUJI, netdev, James Morris, Patrick McHardy,
	David Vrabel, Alexey Kuznetsov, xen-devel, Boris Ostrovsky,
	David S. Miller

On 27/03/15 13:06, Jonathan Davies wrote:
> On 26/03/15 17:14, Eric Dumazet wrote:
>> On Thu, 2015-03-26 at 16:46 +0000, Jonathan Davies wrote:
>>> Network drivers with slow TX completion can experience poor network
>>> transmit
>>> throughput, limited by hitting the sk_wmem_alloc limit check in
>>> tcp_write_xmit.
>>> The limit is 128 KB (by default), which means we are limited to two
>>> 64 KB skbs
>>> in-flight. This has been observed to limit transmit throughput with
>>> xen-netfront
>>> because its TX completion can be relatively slow compared to physical
>>> NIC
>>> drivers.
>>>
>>> There have been several modifications to the calculation of the
>>> sk_wmem_alloc
>>> limit in the past. Here is a brief history:
>>>
>>>   * Since TSQ was introduced, the queue size limit was
>>>         sysctl_tcp_limit_output_bytes.
>>>
>>>   * Commit c9eeec26 ("tcp: TSQ can use a dynamic limit") made the limit
>>>         max(skb->truesize, sk->sk_pacing_rate >> 10).
>>>     This allows more packets in-flight according to the estimated rate.
>>>
>>>   * Commit 98e09386 ("tcp: tsq: restore minimal amount of queueing")
>>> made the
>>>     limit
>>>         max_t(unsigned int, sysctl_tcp_limit_output_bytes,
>>>                             sk->sk_pacing_rate >> 10).
>>>     This ensures at least sysctl_tcp_limit_output_bytes in flight but
>>> allowed
>>>     more if rate estimation shows this to be worthwhile.
>>>
>>>   * Commit 605ad7f1 ("tcp: refine TSO autosizing") made the limit
>>>         min_t(u32, max(2 * skb->truesize, sk->sk_pacing_rate >> 10),
>>>                    sysctl_tcp_limit_output_bytes).
>>>     This meant that the limit can never exceed
>>> sysctl_tcp_limit_output_bytes,
>>>     regardless of what rate estimation suggests. It's not clear from
>>> the commit
>>>     message why this significant change was justified, changing
>>>     sysctl_tcp_limit_output_bytes from being a lower bound to an
>>> upper bound.
>>>
>>> This patch restores the behaviour that allows the limit to grow above
>>> sysctl_tcp_limit_output_bytes according to the rate estimation.
>>>
>>> This has been measured to improve xen-netfront throughput from a domU
>>> to dom0
>>> from 5.5 Gb/s to 8.0 Gb/s. Or, in the case of transmitting from one
>>> domU to
>>> another (on the same host), throughput rose from 2.8 Gb/s to 8.0
>>> Gb/s. In the
>>> latter case, TX completion is especially slow, explaining the large
>>> improvement.
>>> These values were measured against 4.0-rc5 using "iperf -c <ip> -i 1"
>>> using
>>> CentOS 7.0 VM(s) on Citrix XenServer 6.5 on a Dell R730 host with a
>>> pair of Xeon
>>> E5-2650 v3 CPUs.
>>>
>>> Fixes: 605ad7f184b6 ("tcp: refine TSO autosizing")
>>> Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
>>> ---
>>>   net/ipv4/tcp_output.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>>> index 1db253e..3a49af8 100644
>>> --- a/net/ipv4/tcp_output.c
>>> +++ b/net/ipv4/tcp_output.c
>>> @@ -2052,7 +2052,7 @@ static bool tcp_write_xmit(struct sock *sk,
>>> unsigned int mss_now, int nonagle,
>>>            * One example is wifi aggregation (802.11 AMPDU)
>>>            */
>>>           limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
>>> -        limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>>> +        limit = max_t(u32, limit, sysctl_tcp_limit_output_bytes);
>>>
>>>           if (atomic_read(&sk->sk_wmem_alloc) > limit) {
>>>               set_bit(TSQ_THROTTLED, &tp->tsq_flags);
>>
>>
>> That will get a NACK from me and Google TCP team of course.
>>
>> Try your patch with low throughput flows, like 64kbps, GSO off...
>>
>> And observe TCP timestamps and rtt estimations, critical for vegas or
>> any CC based on delay.
>>
>> I get line rate on 40Gbps NIC using this (low) limit of 2 TSO packets.
>>
>> This topic was discussed for wifi recently, and I suggested multiple
>> ways to solve the problem on problematic drivers.
>>
>> There is a reason sysctl_tcp_limit_output_bytes exists : You can change
>> its value to whatever you want.
>>
>> vi +652 Documentation/networking/ip-sysctl.txt
>>
>> tcp_limit_output_bytes - INTEGER
>>          Controls TCP Small Queue limit per tcp socket.
>>          TCP bulk sender tends to increase packets in flight until it
>>          gets losses notifications. With SNDBUF autotuning, this can
>>          result in a large amount of packets queued in qdisc/device
>>          on the local machine, hurting latency of other flows, for
>>          typical pfifo_fast qdiscs.
>>          tcp_limit_output_bytes limits the number of bytes on qdisc
>>          or device to reduce artificial RTT/cwnd and reduce bufferbloat.
>>          Default: 131072
>>
>> Documentation is pretty clear : This is the max value, not a min one.
> 
> Okay, thanks for your feedback. It wasn't clear to me from the commit
> message in 605ad7f1 ("tcp: refine TSO autosizing") why
> tcp_limit_output_bytes changed from being a lower bound to an upper
> bound in that patch. But it's now clear from your reply that that's the
> behaviour you intend as correct.
> 
> Thanks for drawing my attention to the wifi thread, which is helpful. As
> I understand it, the only available options for fixing the performance
> regression experienced by xen-netfront are:
>   1. Reduce TX completion latency, or
>   2. Bypass TSQ to allow buffering of more than tcp_limit_output_bytes,
> finding a way to work around that limit in the driver, e.g.
>        (a) pretending the sent packets are smaller than in reality, so
> sk_wmem_alloc doesn't grow as fast;
>        (b) using skb_orphan in xennet_start_xmit to allow the skb to be
> freed without waiting for TX completion.
> 
> Do you agree, or have I missed something?
> 
> Option 1 seems to me like the right long-term goal, but this would
> require substantial changes to xen-netback and probably can't be
> addressed in the near-term to fix this performance regression.
> 
> Option 2 risks the problems associated with bufferbloat but may be the
> only way to sustain high throughput. I have briefly tried out ideas (a)
> and (b); both look like they can give almost the same throughput as the
> patch above (around 7 Gb/s domU-to-domU) without the need to modify
> tcp_write_xmit. (I don't know much about the implications of using
> skb_orphan in ndo_start_xmit, so am not sure whether that may cause
> other problems.)
> 
> I would like to seek the opinion of the xen-netfront maintainers (cc'd).
> You have a significant performance regression since commit 605ad7f1
> ("tcp: refine TSO autosizing"). Are either of options (a) or (b) viable
> to fix it? Or is there a better solution?
> 
> For your reference, the proof-of-concept patch for idea (a) I used was:
> 
> @@ -630,6 +630,16 @@ static int xennet_start_xmit(struct sk_buff *skb,
> struct net_device *dev)
> 
>      spin_unlock_irqrestore(&queue->tx_lock, flags);
> 
> +    /* Pretend we sent less than we did, so we can squeeze more out
> +         * of the available tcp_limit_output_bytes.
> +         */
> +    if (skb->sk) {
> +        u32 fraction = (7 * skb->truesize) / 8;
> +
> +        skb->truesize -= fraction;
> +        atomic_sub(fraction, &skb->sk->sk_wmem_alloc);
> +    }

Playing games like this with skb->truesize looks dodgy to me.

> And the proof-of-concept patch for idea (b) I used was:
> 
> @@ -552,6 +552,8 @@ static int xennet_start_xmit(struct sk_buff *skb,
> struct net_device *dev)
>              goto drop;
>      }
> 
> +    skb_orphan(skb);
> +
>      page = virt_to_page(skb->data);
>      offset = offset_in_page(skb->data);
>      len = skb_headlen(skb);

No. This a bunch of allocations and  a full memcpy of all the frags.

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
  2015-04-13 13:46     ` [Xen-devel] " David Vrabel
@ 2015-04-13 14:05       ` Eric Dumazet
  2015-04-13 15:03         ` Malcolm Crossley
  2015-04-13 15:03         ` [Xen-devel] " Malcolm Crossley
  2015-04-13 14:05       ` Eric Dumazet
  1 sibling, 2 replies; 17+ messages in thread
From: Eric Dumazet @ 2015-04-13 14:05 UTC (permalink / raw)
  To: David Vrabel
  Cc: Jonathan Davies, Hideaki YOSHIFUJI, netdev, James Morris,
	David S. Miller, xen-devel, Alexey Kuznetsov, Boris Ostrovsky,
	Patrick McHardy

On Mon, 2015-04-13 at 14:46 +0100, David Vrabel wrote:

> > And the proof-of-concept patch for idea (b) I used was:
> > 
> > @@ -552,6 +552,8 @@ static int xennet_start_xmit(struct sk_buff *skb,
> > struct net_device *dev)
> >              goto drop;
> >      }
> > 
> > +    skb_orphan(skb);
> > +
> >      page = virt_to_page(skb->data);
> >      offset = offset_in_page(skb->data);
> >      len = skb_headlen(skb);
> 
> No. This a bunch of allocations and  a full memcpy of all the frags.

skb_orphan() does nothing like that.

But the main concern here is it basically breaks back pressure.

And you do not want this, unless there is no other choice.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
  2015-04-13 13:46     ` [Xen-devel] " David Vrabel
  2015-04-13 14:05       ` Eric Dumazet
@ 2015-04-13 14:05       ` Eric Dumazet
  1 sibling, 0 replies; 17+ messages in thread
From: Eric Dumazet @ 2015-04-13 14:05 UTC (permalink / raw)
  To: David Vrabel
  Cc: Jonathan Davies, Hideaki YOSHIFUJI, netdev, James Morris,
	Patrick McHardy, Alexey Kuznetsov, xen-devel, Boris Ostrovsky,
	David S. Miller

On Mon, 2015-04-13 at 14:46 +0100, David Vrabel wrote:

> > And the proof-of-concept patch for idea (b) I used was:
> > 
> > @@ -552,6 +552,8 @@ static int xennet_start_xmit(struct sk_buff *skb,
> > struct net_device *dev)
> >              goto drop;
> >      }
> > 
> > +    skb_orphan(skb);
> > +
> >      page = virt_to_page(skb->data);
> >      offset = offset_in_page(skb->data);
> >      len = skb_headlen(skb);
> 
> No. This a bunch of allocations and  a full memcpy of all the frags.

skb_orphan() does nothing like that.

But the main concern here is it basically breaks back pressure.

And you do not want this, unless there is no other choice.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
  2015-04-13 14:05       ` Eric Dumazet
  2015-04-13 15:03         ` Malcolm Crossley
@ 2015-04-13 15:03         ` Malcolm Crossley
  2015-04-15 14:19           ` George Dunlap
  2015-04-15 14:19           ` [Xen-devel] " George Dunlap
  1 sibling, 2 replies; 17+ messages in thread
From: Malcolm Crossley @ 2015-04-13 15:03 UTC (permalink / raw)
  To: Eric Dumazet, David Vrabel
  Cc: Jonathan Davies, Hideaki YOSHIFUJI, netdev, James Morris,
	Patrick McHardy, Alexey Kuznetsov, xen-devel, Boris Ostrovsky,
	David S. Miller

On 13/04/15 15:05, Eric Dumazet wrote:
> On Mon, 2015-04-13 at 14:46 +0100, David Vrabel wrote:
> 
>>> And the proof-of-concept patch for idea (b) I used was:
>>>
>>> @@ -552,6 +552,8 @@ static int xennet_start_xmit(struct sk_buff *skb,
>>> struct net_device *dev)
>>>              goto drop;
>>>      }
>>>
>>> +    skb_orphan(skb);
>>> +
>>>      page = virt_to_page(skb->data);
>>>      offset = offset_in_page(skb->data);
>>>      len = skb_headlen(skb);
>>
>> No. This a bunch of allocations and  a full memcpy of all the frags.
> 
> skb_orphan() does nothing like that.

I think David accidently read the skb_orphan_frags() code.

> 
> But the main concern here is it basically breaks back pressure.
> 
> And you do not want this, unless there is no other choice.
> 

virtio_net already use's skb_orphan() in it's transmit path. It seems
only fair that other virtual network drivers behave in the same way.

There are no easy solutions to decrease the transmit latency for
netback/netfront. We map the guest memory through to the backend to
avoid memory copies. The frontend memory can only be freed once the
network driver has completed transmitting the packet in the backend.

Modern network drivers can be quite slow at freeing the skb's once
transmitted (the packet is already on the wire as far as they are
concerned) and this delay is compounded by needing the signal the
completion of the transmit back to the frontend (by IPI in worst case).

>From a networking point of view, the backend is a switch. Is it OK to
consider the packet to have been transmitted from the guest point of
view once the backend is aware of the packet?

This would help justify the skb_orphan() in the frontend.

> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
  2015-04-13 14:05       ` Eric Dumazet
@ 2015-04-13 15:03         ` Malcolm Crossley
  2015-04-13 15:03         ` [Xen-devel] " Malcolm Crossley
  1 sibling, 0 replies; 17+ messages in thread
From: Malcolm Crossley @ 2015-04-13 15:03 UTC (permalink / raw)
  To: Eric Dumazet, David Vrabel
  Cc: Jonathan Davies, Hideaki YOSHIFUJI, netdev, James Morris,
	David S. Miller, xen-devel, Alexey Kuznetsov, Boris Ostrovsky,
	Patrick McHardy

On 13/04/15 15:05, Eric Dumazet wrote:
> On Mon, 2015-04-13 at 14:46 +0100, David Vrabel wrote:
> 
>>> And the proof-of-concept patch for idea (b) I used was:
>>>
>>> @@ -552,6 +552,8 @@ static int xennet_start_xmit(struct sk_buff *skb,
>>> struct net_device *dev)
>>>              goto drop;
>>>      }
>>>
>>> +    skb_orphan(skb);
>>> +
>>>      page = virt_to_page(skb->data);
>>>      offset = offset_in_page(skb->data);
>>>      len = skb_headlen(skb);
>>
>> No. This a bunch of allocations and  a full memcpy of all the frags.
> 
> skb_orphan() does nothing like that.

I think David accidently read the skb_orphan_frags() code.

> 
> But the main concern here is it basically breaks back pressure.
> 
> And you do not want this, unless there is no other choice.
> 

virtio_net already use's skb_orphan() in it's transmit path. It seems
only fair that other virtual network drivers behave in the same way.

There are no easy solutions to decrease the transmit latency for
netback/netfront. We map the guest memory through to the backend to
avoid memory copies. The frontend memory can only be freed once the
network driver has completed transmitting the packet in the backend.

Modern network drivers can be quite slow at freeing the skb's once
transmitted (the packet is already on the wire as far as they are
concerned) and this delay is compounded by needing the signal the
completion of the transmit back to the frontend (by IPI in worst case).

>From a networking point of view, the backend is a switch. Is it OK to
consider the packet to have been transmitted from the guest point of
view once the backend is aware of the packet?

This would help justify the skb_orphan() in the frontend.

> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
  2015-04-13 15:03         ` [Xen-devel] " Malcolm Crossley
  2015-04-15 14:19           ` George Dunlap
@ 2015-04-15 14:19           ` George Dunlap
  2015-04-15 14:36             ` Ian Campbell
  2015-04-15 14:36             ` [Xen-devel] " Ian Campbell
  1 sibling, 2 replies; 17+ messages in thread
From: George Dunlap @ 2015-04-15 14:19 UTC (permalink / raw)
  To: Malcolm Crossley
  Cc: Eric Dumazet, David Vrabel, Jonathan Davies, Hideaki YOSHIFUJI,
	netdev, James Morris, David S. Miller, xen-devel,
	Alexey Kuznetsov, Boris Ostrovsky, Patrick McHardy, Wei Liu

On Mon, Apr 13, 2015 at 4:03 PM, Malcolm Crossley
<malcolm.crossley@citrix.com> wrote:
>>
>> But the main concern here is it basically breaks back pressure.
>>
>> And you do not want this, unless there is no other choice.
>>
>
> virtio_net already use's skb_orphan() in it's transmit path. It seems
> only fair that other virtual network drivers behave in the same way.
>
> There are no easy solutions to decrease the transmit latency for
> netback/netfront. We map the guest memory through to the backend to
> avoid memory copies. The frontend memory can only be freed once the
> network driver has completed transmitting the packet in the backend.
>
> Modern network drivers can be quite slow at freeing the skb's once
> transmitted (the packet is already on the wire as far as they are
> concerned) and this delay is compounded by needing the signal the
> completion of the transmit back to the frontend (by IPI in worst case).
>
> From a networking point of view, the backend is a switch. Is it OK to
> consider the packet to have been transmitted from the guest point of
> view once the backend is aware of the packet?
>
> This would help justify the skb_orphan() in the frontend.

This sounds sensible to me, particularly if virtio_net is already doing it.

 -George

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
  2015-04-13 15:03         ` [Xen-devel] " Malcolm Crossley
@ 2015-04-15 14:19           ` George Dunlap
  2015-04-15 14:19           ` [Xen-devel] " George Dunlap
  1 sibling, 0 replies; 17+ messages in thread
From: George Dunlap @ 2015-04-15 14:19 UTC (permalink / raw)
  To: Malcolm Crossley
  Cc: Jonathan Davies, Wei Liu, Eric Dumazet, Hideaki YOSHIFUJI,
	netdev, James Morris, Patrick McHardy, David Vrabel,
	Alexey Kuznetsov, xen-devel, Boris Ostrovsky, David S. Miller

On Mon, Apr 13, 2015 at 4:03 PM, Malcolm Crossley
<malcolm.crossley@citrix.com> wrote:
>>
>> But the main concern here is it basically breaks back pressure.
>>
>> And you do not want this, unless there is no other choice.
>>
>
> virtio_net already use's skb_orphan() in it's transmit path. It seems
> only fair that other virtual network drivers behave in the same way.
>
> There are no easy solutions to decrease the transmit latency for
> netback/netfront. We map the guest memory through to the backend to
> avoid memory copies. The frontend memory can only be freed once the
> network driver has completed transmitting the packet in the backend.
>
> Modern network drivers can be quite slow at freeing the skb's once
> transmitted (the packet is already on the wire as far as they are
> concerned) and this delay is compounded by needing the signal the
> completion of the transmit back to the frontend (by IPI in worst case).
>
> From a networking point of view, the backend is a switch. Is it OK to
> consider the packet to have been transmitted from the guest point of
> view once the backend is aware of the packet?
>
> This would help justify the skb_orphan() in the frontend.

This sounds sensible to me, particularly if virtio_net is already doing it.

 -George

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
  2015-04-15 14:19           ` [Xen-devel] " George Dunlap
  2015-04-15 14:36             ` Ian Campbell
@ 2015-04-15 14:36             ` Ian Campbell
  2015-04-15 16:42               ` Eric Dumazet
  1 sibling, 1 reply; 17+ messages in thread
From: Ian Campbell @ 2015-04-15 14:36 UTC (permalink / raw)
  To: George Dunlap
  Cc: Malcolm Crossley, Jonathan Davies, Wei Liu, Eric Dumazet,
	Hideaki YOSHIFUJI, netdev, James Morris, Patrick McHardy,
	David Vrabel, Alexey Kuznetsov, xen-devel, Boris Ostrovsky,
	David S. Miller

On Wed, 2015-04-15 at 15:19 +0100, George Dunlap wrote:
> On Mon, Apr 13, 2015 at 4:03 PM, Malcolm Crossley
[...]
> > From a networking point of view, the backend is a switch. Is it OK to
> > consider the packet to have been transmitted from the guest point of
> > view once the backend is aware of the packet?
> >
> > This would help justify the skb_orphan() in the frontend.
> 
> This sounds sensible to me, particularly if virtio_net is already doing it.

I also find Malcolm's argument above pretty compelling.

Ian.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
  2015-04-15 14:19           ` [Xen-devel] " George Dunlap
@ 2015-04-15 14:36             ` Ian Campbell
  2015-04-15 14:36             ` [Xen-devel] " Ian Campbell
  1 sibling, 0 replies; 17+ messages in thread
From: Ian Campbell @ 2015-04-15 14:36 UTC (permalink / raw)
  To: George Dunlap
  Cc: Boris Ostrovsky, Jonathan Davies, Wei Liu, Eric Dumazet,
	Hideaki YOSHIFUJI, netdev, James Morris, David S. Miller,
	David Vrabel, xen-devel, Alexey Kuznetsov, Malcolm Crossley,
	Patrick McHardy

On Wed, 2015-04-15 at 15:19 +0100, George Dunlap wrote:
> On Mon, Apr 13, 2015 at 4:03 PM, Malcolm Crossley
[...]
> > From a networking point of view, the backend is a switch. Is it OK to
> > consider the packet to have been transmitted from the guest point of
> > view once the backend is aware of the packet?
> >
> > This would help justify the skb_orphan() in the frontend.
> 
> This sounds sensible to me, particularly if virtio_net is already doing it.

I also find Malcolm's argument above pretty compelling.

Ian.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
  2015-04-15 14:36             ` [Xen-devel] " Ian Campbell
@ 2015-04-15 16:42               ` Eric Dumazet
  0 siblings, 0 replies; 17+ messages in thread
From: Eric Dumazet @ 2015-04-15 16:42 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Boris Ostrovsky, Jonathan Davies, Wei Liu, George Dunlap, netdev,
	James Morris, David S. Miller, Hideaki YOSHIFUJI, David Vrabel,
	xen-devel, Alexey Kuznetsov, Malcolm Crossley, Patrick McHardy

On Wed, 2015-04-15 at 15:36 +0100, Ian Campbell wrote:
> On Wed, 2015-04-15 at 15:19 +0100, George Dunlap wrote:
> > On Mon, Apr 13, 2015 at 4:03 PM, Malcolm Crossley
> [...]
> > > From a networking point of view, the backend is a switch. Is it OK to
> > > consider the packet to have been transmitted from the guest point of
> > > view once the backend is aware of the packet?
> > >
> > > This would help justify the skb_orphan() in the frontend.
> > 
> > This sounds sensible to me, particularly if virtio_net is already doing it.
> 
> I also find Malcolm's argument above pretty compelling.

Yes, and then you'll have to help the virtio ongoing effort trying to
get rid of this skb_orphan()

Basically you're adding head of line blocking, as a single flow is able
to fill your queue.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
@ 2015-03-26 16:46 Jonathan Davies
  0 siblings, 0 replies; 17+ messages in thread
From: Jonathan Davies @ 2015-03-26 16:46 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev
  Cc: Jonathan Davies, Eric Dumazet, David Vrabel, xen-devel, Boris Ostrovsky

Network drivers with slow TX completion can experience poor network transmit
throughput, limited by hitting the sk_wmem_alloc limit check in tcp_write_xmit.
The limit is 128 KB (by default), which means we are limited to two 64 KB skbs
in-flight. This has been observed to limit transmit throughput with xen-netfront
because its TX completion can be relatively slow compared to physical NIC
drivers.

There have been several modifications to the calculation of the sk_wmem_alloc
limit in the past. Here is a brief history:

 * Since TSQ was introduced, the queue size limit was
       sysctl_tcp_limit_output_bytes.

 * Commit c9eeec26 ("tcp: TSQ can use a dynamic limit") made the limit
       max(skb->truesize, sk->sk_pacing_rate >> 10).
   This allows more packets in-flight according to the estimated rate.

 * Commit 98e09386 ("tcp: tsq: restore minimal amount of queueing") made the
   limit
       max_t(unsigned int, sysctl_tcp_limit_output_bytes,
                           sk->sk_pacing_rate >> 10).
   This ensures at least sysctl_tcp_limit_output_bytes in flight but allowed
   more if rate estimation shows this to be worthwhile.

 * Commit 605ad7f1 ("tcp: refine TSO autosizing") made the limit
       min_t(u32, max(2 * skb->truesize, sk->sk_pacing_rate >> 10),
                  sysctl_tcp_limit_output_bytes).
   This meant that the limit can never exceed sysctl_tcp_limit_output_bytes,
   regardless of what rate estimation suggests. It's not clear from the commit
   message why this significant change was justified, changing
   sysctl_tcp_limit_output_bytes from being a lower bound to an upper bound.

This patch restores the behaviour that allows the limit to grow above
sysctl_tcp_limit_output_bytes according to the rate estimation.

This has been measured to improve xen-netfront throughput from a domU to dom0
from 5.5 Gb/s to 8.0 Gb/s. Or, in the case of transmitting from one domU to
another (on the same host), throughput rose from 2.8 Gb/s to 8.0 Gb/s. In the
latter case, TX completion is especially slow, explaining the large improvement.
These values were measured against 4.0-rc5 using "iperf -c <ip> -i 1" using
CentOS 7.0 VM(s) on Citrix XenServer 6.5 on a Dell R730 host with a pair of Xeon
E5-2650 v3 CPUs.

Fixes: 605ad7f184b6 ("tcp: refine TSO autosizing")
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
---
 net/ipv4/tcp_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1db253e..3a49af8 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2052,7 +2052,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		 * One example is wifi aggregation (802.11 AMPDU)
 		 */
 		limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
-		limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
+		limit = max_t(u32, limit, sysctl_tcp_limit_output_bytes);
 
 		if (atomic_read(&sk->sk_wmem_alloc) > limit) {
 			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-04-15 16:42 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-26 16:46 [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes Jonathan Davies
2015-03-26 17:14 ` Eric Dumazet
2015-03-26 17:14 ` Eric Dumazet
2015-03-27 13:06   ` Jonathan Davies
2015-04-13 13:46     ` David Vrabel
2015-04-13 13:46     ` [Xen-devel] " David Vrabel
2015-04-13 14:05       ` Eric Dumazet
2015-04-13 15:03         ` Malcolm Crossley
2015-04-13 15:03         ` [Xen-devel] " Malcolm Crossley
2015-04-15 14:19           ` George Dunlap
2015-04-15 14:19           ` [Xen-devel] " George Dunlap
2015-04-15 14:36             ` Ian Campbell
2015-04-15 14:36             ` [Xen-devel] " Ian Campbell
2015-04-15 16:42               ` Eric Dumazet
2015-04-13 14:05       ` Eric Dumazet
2015-03-27 13:06   ` Jonathan Davies
  -- strict thread matches above, loose matches on Subject: below --
2015-03-26 16:46 Jonathan Davies

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.