All of lore.kernel.org
 help / color / mirror / Atom feed
* "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-09 15:46 ` Stefano Stabellini
  0 siblings, 0 replies; 92+ messages in thread
From: Stefano Stabellini @ 2015-04-09 15:46 UTC (permalink / raw)
  To: xen-devel
  Cc: linux-arm-kernel, linux-kernel, Stefano Stabellini,
	christoffer.dall, Ian Campbell, Wei Liu, David Vrabel, edumazet,
	Konrad Rzeszutek Wilk

Hi all,

I found a performance regression when running netperf -t TCP_MAERTS from
an external host to a Xen VM on ARM64: v3.19 and v4.0-rc4 running in the
virtual machine are 30% slower than v3.18.

Through bisection I found that the perf regression is caused by the
prensence of the following commit in the guest kernel:


commit 605ad7f184b60cfaacbc038aa6c55ee68dee3c89
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Dec 7 12:22:18 2014 -0800

    tcp: refine TSO autosizing


A simple revert would fix the issue.

Does anybody have any ideas on what could be the cause of the problem?
Suggestions on what to do to fix it?

Cheers,

Stefano

^ permalink raw reply	[flat|nested] 92+ messages in thread

* "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-09 15:46 ` Stefano Stabellini
  0 siblings, 0 replies; 92+ messages in thread
From: Stefano Stabellini @ 2015-04-09 15:46 UTC (permalink / raw)
  To: linux-arm-kernel

Hi all,

I found a performance regression when running netperf -t TCP_MAERTS from
an external host to a Xen VM on ARM64: v3.19 and v4.0-rc4 running in the
virtual machine are 30% slower than v3.18.

Through bisection I found that the perf regression is caused by the
prensence of the following commit in the guest kernel:


commit 605ad7f184b60cfaacbc038aa6c55ee68dee3c89
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Dec 7 12:22:18 2014 -0800

    tcp: refine TSO autosizing


A simple revert would fix the issue.

Does anybody have any ideas on what could be the cause of the problem?
Suggestions on what to do to fix it?

Cheers,

Stefano

^ permalink raw reply	[flat|nested] 92+ messages in thread

* "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-09 15:46 ` Stefano Stabellini
  0 siblings, 0 replies; 92+ messages in thread
From: Stefano Stabellini @ 2015-04-09 15:46 UTC (permalink / raw)
  To: xen-devel
  Cc: linux-arm-kernel, linux-kernel, Stefano Stabellini,
	christoffer.dall, Ian Campbell, Wei Liu, David Vrabel, edumazet,
	Konrad Rzeszutek Wilk

Hi all,

I found a performance regression when running netperf -t TCP_MAERTS from
an external host to a Xen VM on ARM64: v3.19 and v4.0-rc4 running in the
virtual machine are 30% slower than v3.18.

Through bisection I found that the perf regression is caused by the
prensence of the following commit in the guest kernel:


commit 605ad7f184b60cfaacbc038aa6c55ee68dee3c89
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Dec 7 12:22:18 2014 -0800

    tcp: refine TSO autosizing


A simple revert would fix the issue.

Does anybody have any ideas on what could be the cause of the problem?
Suggestions on what to do to fix it?

Cheers,

Stefano

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-09 15:46 ` Stefano Stabellini
@ 2015-04-09 16:16   ` Eric Dumazet
  -1 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-09 16:16 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, linux-arm-kernel, linux-kernel, christoffer.dall,
	Ian Campbell, Wei Liu, David Vrabel, edumazet,
	Konrad Rzeszutek Wilk, netdev

On Thu, 2015-04-09 at 16:46 +0100, Stefano Stabellini wrote:
> Hi all,
> 
> I found a performance regression when running netperf -t TCP_MAERTS from
> an external host to a Xen VM on ARM64: v3.19 and v4.0-rc4 running in the
> virtual machine are 30% slower than v3.18.
> 
> Through bisection I found that the perf regression is caused by the
> prensence of the following commit in the guest kernel:
> 
> 
> commit 605ad7f184b60cfaacbc038aa6c55ee68dee3c89
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Sun Dec 7 12:22:18 2014 -0800
> 
>     tcp: refine TSO autosizing
> 
> 
> A simple revert would fix the issue.
> 
> Does anybody have any ideas on what could be the cause of the problem?
> Suggestions on what to do to fix it?

You sent this to lkml while networking discussions are on netdev.

This topic had been discussed on netdev multiple times.

This commit restored original TCP Small Queue behavior, which is the
first step to fight bufferbloat.

Some network drivers are known to be problematic because of a delayed TX
completion.

So far this commit did not impact max single flow throughput on 40Gb
mlx4 NIC. (ie : line rate is possible)

Try to tweak /proc/sys/net/ipv4/tcp_limit_output_bytes to see if it
makes a difference ?




^ permalink raw reply	[flat|nested] 92+ messages in thread

* "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-09 16:16   ` Eric Dumazet
  0 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-09 16:16 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2015-04-09 at 16:46 +0100, Stefano Stabellini wrote:
> Hi all,
> 
> I found a performance regression when running netperf -t TCP_MAERTS from
> an external host to a Xen VM on ARM64: v3.19 and v4.0-rc4 running in the
> virtual machine are 30% slower than v3.18.
> 
> Through bisection I found that the perf regression is caused by the
> prensence of the following commit in the guest kernel:
> 
> 
> commit 605ad7f184b60cfaacbc038aa6c55ee68dee3c89
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Sun Dec 7 12:22:18 2014 -0800
> 
>     tcp: refine TSO autosizing
> 
> 
> A simple revert would fix the issue.
> 
> Does anybody have any ideas on what could be the cause of the problem?
> Suggestions on what to do to fix it?

You sent this to lkml while networking discussions are on netdev.

This topic had been discussed on netdev multiple times.

This commit restored original TCP Small Queue behavior, which is the
first step to fight bufferbloat.

Some network drivers are known to be problematic because of a delayed TX
completion.

So far this commit did not impact max single flow throughput on 40Gb
mlx4 NIC. (ie : line rate is possible)

Try to tweak /proc/sys/net/ipv4/tcp_limit_output_bytes to see if it
makes a difference ?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-09 16:16   ` Eric Dumazet
  (?)
@ 2015-04-09 16:36     ` Stefano Stabellini
  -1 siblings, 0 replies; 92+ messages in thread
From: Stefano Stabellini @ 2015-04-09 16:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stefano Stabellini, xen-devel, linux-arm-kernel, linux-kernel,
	christoffer.dall, Ian Campbell, Wei Liu, David Vrabel, edumazet,
	Konrad Rzeszutek Wilk, netdev

On Thu, 9 Apr 2015, Eric Dumazet wrote:
> On Thu, 2015-04-09 at 16:46 +0100, Stefano Stabellini wrote:
> > Hi all,
> > 
> > I found a performance regression when running netperf -t TCP_MAERTS from
> > an external host to a Xen VM on ARM64: v3.19 and v4.0-rc4 running in the
> > virtual machine are 30% slower than v3.18.
> > 
> > Through bisection I found that the perf regression is caused by the
> > prensence of the following commit in the guest kernel:
> > 
> > 
> > commit 605ad7f184b60cfaacbc038aa6c55ee68dee3c89
> > Author: Eric Dumazet <edumazet@google.com>
> > Date:   Sun Dec 7 12:22:18 2014 -0800
> > 
> >     tcp: refine TSO autosizing
> > 
> > 
> > A simple revert would fix the issue.
> > 
> > Does anybody have any ideas on what could be the cause of the problem?
> > Suggestions on what to do to fix it?
> 
> You sent this to lkml while networking discussions are on netdev.
> 
> This topic had been discussed on netdev multiple times.

Sorry, and many thanks for the quick reply!


> This commit restored original TCP Small Queue behavior, which is the
> first step to fight bufferbloat.
> 
> Some network drivers are known to be problematic because of a delayed TX
> completion.
> 
> So far this commit did not impact max single flow throughput on 40Gb
> mlx4 NIC. (ie : line rate is possible)
> 
> Try to tweak /proc/sys/net/ipv4/tcp_limit_output_bytes to see if it
> makes a difference ?

A very big difference:

echo 262144 > /proc/sys/net/ipv4/tcp_limit_output_bytes
brings us much closer to the original performance, the slowdown is just
8%

echo 1048576 > /proc/sys/net/ipv4/tcp_limit_output_bytes
fills the gap entirely, same performance as before "refine TSO
autosizing"


What would be the next step for here?  Should I just document this as an
important performance tweaking step for Xen, or is there something else
we can do?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-09 16:36     ` Stefano Stabellini
  0 siblings, 0 replies; 92+ messages in thread
From: Stefano Stabellini @ 2015-04-09 16:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 9 Apr 2015, Eric Dumazet wrote:
> On Thu, 2015-04-09 at 16:46 +0100, Stefano Stabellini wrote:
> > Hi all,
> > 
> > I found a performance regression when running netperf -t TCP_MAERTS from
> > an external host to a Xen VM on ARM64: v3.19 and v4.0-rc4 running in the
> > virtual machine are 30% slower than v3.18.
> > 
> > Through bisection I found that the perf regression is caused by the
> > prensence of the following commit in the guest kernel:
> > 
> > 
> > commit 605ad7f184b60cfaacbc038aa6c55ee68dee3c89
> > Author: Eric Dumazet <edumazet@google.com>
> > Date:   Sun Dec 7 12:22:18 2014 -0800
> > 
> >     tcp: refine TSO autosizing
> > 
> > 
> > A simple revert would fix the issue.
> > 
> > Does anybody have any ideas on what could be the cause of the problem?
> > Suggestions on what to do to fix it?
> 
> You sent this to lkml while networking discussions are on netdev.
> 
> This topic had been discussed on netdev multiple times.

Sorry, and many thanks for the quick reply!


> This commit restored original TCP Small Queue behavior, which is the
> first step to fight bufferbloat.
> 
> Some network drivers are known to be problematic because of a delayed TX
> completion.
> 
> So far this commit did not impact max single flow throughput on 40Gb
> mlx4 NIC. (ie : line rate is possible)
> 
> Try to tweak /proc/sys/net/ipv4/tcp_limit_output_bytes to see if it
> makes a difference ?

A very big difference:

echo 262144 > /proc/sys/net/ipv4/tcp_limit_output_bytes
brings us much closer to the original performance, the slowdown is just
8%

echo 1048576 > /proc/sys/net/ipv4/tcp_limit_output_bytes
fills the gap entirely, same performance as before "refine TSO
autosizing"


What would be the next step for here?  Should I just document this as an
important performance tweaking step for Xen, or is there something else
we can do?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-09 16:36     ` Stefano Stabellini
  0 siblings, 0 replies; 92+ messages in thread
From: Stefano Stabellini @ 2015-04-09 16:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stefano Stabellini, xen-devel, linux-arm-kernel, linux-kernel,
	christoffer.dall, Ian Campbell, Wei Liu, David Vrabel, edumazet,
	Konrad Rzeszutek Wilk, netdev

On Thu, 9 Apr 2015, Eric Dumazet wrote:
> On Thu, 2015-04-09 at 16:46 +0100, Stefano Stabellini wrote:
> > Hi all,
> > 
> > I found a performance regression when running netperf -t TCP_MAERTS from
> > an external host to a Xen VM on ARM64: v3.19 and v4.0-rc4 running in the
> > virtual machine are 30% slower than v3.18.
> > 
> > Through bisection I found that the perf regression is caused by the
> > prensence of the following commit in the guest kernel:
> > 
> > 
> > commit 605ad7f184b60cfaacbc038aa6c55ee68dee3c89
> > Author: Eric Dumazet <edumazet@google.com>
> > Date:   Sun Dec 7 12:22:18 2014 -0800
> > 
> >     tcp: refine TSO autosizing
> > 
> > 
> > A simple revert would fix the issue.
> > 
> > Does anybody have any ideas on what could be the cause of the problem?
> > Suggestions on what to do to fix it?
> 
> You sent this to lkml while networking discussions are on netdev.
> 
> This topic had been discussed on netdev multiple times.

Sorry, and many thanks for the quick reply!


> This commit restored original TCP Small Queue behavior, which is the
> first step to fight bufferbloat.
> 
> Some network drivers are known to be problematic because of a delayed TX
> completion.
> 
> So far this commit did not impact max single flow throughput on 40Gb
> mlx4 NIC. (ie : line rate is possible)
> 
> Try to tweak /proc/sys/net/ipv4/tcp_limit_output_bytes to see if it
> makes a difference ?

A very big difference:

echo 262144 > /proc/sys/net/ipv4/tcp_limit_output_bytes
brings us much closer to the original performance, the slowdown is just
8%

echo 1048576 > /proc/sys/net/ipv4/tcp_limit_output_bytes
fills the gap entirely, same performance as before "refine TSO
autosizing"


What would be the next step for here?  Should I just document this as an
important performance tweaking step for Xen, or is there something else
we can do?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-09 16:36     ` Stefano Stabellini
@ 2015-04-09 17:07       ` Eric Dumazet
  -1 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-09 17:07 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, linux-arm-kernel, linux-kernel, christoffer.dall,
	Ian Campbell, Wei Liu, David Vrabel, edumazet,
	Konrad Rzeszutek Wilk, netdev

On Thu, 2015-04-09 at 17:36 +0100, Stefano Stabellini wrote:

> A very big difference:
> 
> echo 262144 > /proc/sys/net/ipv4/tcp_limit_output_bytes
> brings us much closer to the original performance, the slowdown is just
> 8%

Cool.

> 
> echo 1048576 > /proc/sys/net/ipv4/tcp_limit_output_bytes
> fills the gap entirely, same performance as before "refine TSO
> autosizing"


Sure, this basically disables TCP Small Queue and select the opposite :

Favor single flow throughput and huge latencies (bufferbloat)


> 
> 
> What would be the next step for here?  Should I just document this as an
> important performance tweaking step for Xen, or is there something else
> we can do?

I guess this is a reasonable choice.

Note that /proc/sys/net/ipv4/tcp_limit_output_bytes is already
documented.




^ permalink raw reply	[flat|nested] 92+ messages in thread

* "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-09 17:07       ` Eric Dumazet
  0 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-09 17:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2015-04-09 at 17:36 +0100, Stefano Stabellini wrote:

> A very big difference:
> 
> echo 262144 > /proc/sys/net/ipv4/tcp_limit_output_bytes
> brings us much closer to the original performance, the slowdown is just
> 8%

Cool.

> 
> echo 1048576 > /proc/sys/net/ipv4/tcp_limit_output_bytes
> fills the gap entirely, same performance as before "refine TSO
> autosizing"


Sure, this basically disables TCP Small Queue and select the opposite :

Favor single flow throughput and huge latencies (bufferbloat)


> 
> 
> What would be the next step for here?  Should I just document this as an
> important performance tweaking step for Xen, or is there something else
> we can do?

I guess this is a reasonable choice.

Note that /proc/sys/net/ipv4/tcp_limit_output_bytes is already
documented.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-09 16:36     ` Stefano Stabellini
@ 2015-04-13 10:56       ` George Dunlap
  -1 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-13 10:56 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Eric Dumazet, xen-devel, Wei Liu, Ian Campbell, netdev,
	Linux Kernel Mailing List, edumazet, linux-arm-kernel,
	Christoffer Dall, David Vrabel, Jonathan Davies,
	Felipe Franciosi, Paul Durrant

On Thu, Apr 9, 2015 at 5:36 PM, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
> On Thu, 9 Apr 2015, Eric Dumazet wrote:
>> On Thu, 2015-04-09 at 16:46 +0100, Stefano Stabellini wrote:
>> > Hi all,
>> >
>> > I found a performance regression when running netperf -t TCP_MAERTS from
>> > an external host to a Xen VM on ARM64: v3.19 and v4.0-rc4 running in the
>> > virtual machine are 30% slower than v3.18.
>> >
>> > Through bisection I found that the perf regression is caused by the
>> > prensence of the following commit in the guest kernel:
>> >
>> >
>> > commit 605ad7f184b60cfaacbc038aa6c55ee68dee3c89
>> > Author: Eric Dumazet <edumazet@google.com>
>> > Date:   Sun Dec 7 12:22:18 2014 -0800
>> >
>> >     tcp: refine TSO autosizing

[snip]

>> This commit restored original TCP Small Queue behavior, which is the
>> first step to fight bufferbloat.
>>
>> Some network drivers are known to be problematic because of a delayed TX
>> completion.

[snip]

>> Try to tweak /proc/sys/net/ipv4/tcp_limit_output_bytes to see if it
>> makes a difference ?
>
> A very big difference:
>
> echo 262144 > /proc/sys/net/ipv4/tcp_limit_output_bytes
> brings us much closer to the original performance, the slowdown is just
> 8%
>
> echo 1048576 > /proc/sys/net/ipv4/tcp_limit_output_bytes
> fills the gap entirely, same performance as before "refine TSO
> autosizing"
>
>
> What would be the next step for here?  Should I just document this as an
> important performance tweaking step for Xen, or is there something else
> we can do?

Is the problem perhaps that netback/netfront delays TX completion?
Would it be better to see if that can be addressed properly, so that
the original purpose of the patch (fighting bufferbloat) can be
achieved while not degrading performance for Xen?  Or at least, so
that people get decent perfomance out of the box without having to
tweak TCP parameters?

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-13 10:56       ` George Dunlap
  0 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-13 10:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Apr 9, 2015 at 5:36 PM, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
> On Thu, 9 Apr 2015, Eric Dumazet wrote:
>> On Thu, 2015-04-09 at 16:46 +0100, Stefano Stabellini wrote:
>> > Hi all,
>> >
>> > I found a performance regression when running netperf -t TCP_MAERTS from
>> > an external host to a Xen VM on ARM64: v3.19 and v4.0-rc4 running in the
>> > virtual machine are 30% slower than v3.18.
>> >
>> > Through bisection I found that the perf regression is caused by the
>> > prensence of the following commit in the guest kernel:
>> >
>> >
>> > commit 605ad7f184b60cfaacbc038aa6c55ee68dee3c89
>> > Author: Eric Dumazet <edumazet@google.com>
>> > Date:   Sun Dec 7 12:22:18 2014 -0800
>> >
>> >     tcp: refine TSO autosizing

[snip]

>> This commit restored original TCP Small Queue behavior, which is the
>> first step to fight bufferbloat.
>>
>> Some network drivers are known to be problematic because of a delayed TX
>> completion.

[snip]

>> Try to tweak /proc/sys/net/ipv4/tcp_limit_output_bytes to see if it
>> makes a difference ?
>
> A very big difference:
>
> echo 262144 > /proc/sys/net/ipv4/tcp_limit_output_bytes
> brings us much closer to the original performance, the slowdown is just
> 8%
>
> echo 1048576 > /proc/sys/net/ipv4/tcp_limit_output_bytes
> fills the gap entirely, same performance as before "refine TSO
> autosizing"
>
>
> What would be the next step for here?  Should I just document this as an
> important performance tweaking step for Xen, or is there something else
> we can do?

Is the problem perhaps that netback/netfront delays TX completion?
Would it be better to see if that can be addressed properly, so that
the original purpose of the patch (fighting bufferbloat) can be
achieved while not degrading performance for Xen?  Or at least, so
that people get decent perfomance out of the box without having to
tweak TCP parameters?

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-13 10:56       ` George Dunlap
  (?)
@ 2015-04-13 13:38         ` Jonathan Davies
  -1 siblings, 0 replies; 92+ messages in thread
From: Jonathan Davies @ 2015-04-13 13:38 UTC (permalink / raw)
  To: George Dunlap, Stefano Stabellini
  Cc: Eric Dumazet, xen-devel, Wei Liu, Ian Campbell, netdev,
	Linux Kernel Mailing List, edumazet, linux-arm-kernel,
	Christoffer Dall, David Vrabel, Felipe Franciosi, Paul Durrant

On 13/04/15 11:56, George Dunlap wrote:
> On Thu, Apr 9, 2015 at 5:36 PM, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
>> On Thu, 9 Apr 2015, Eric Dumazet wrote:
>>> On Thu, 2015-04-09 at 16:46 +0100, Stefano Stabellini wrote:
>>>> Hi all,
>>>>
>>>> I found a performance regression when running netperf -t TCP_MAERTS from
>>>> an external host to a Xen VM on ARM64: v3.19 and v4.0-rc4 running in the
>>>> virtual machine are 30% slower than v3.18.
>>>>
>>>> Through bisection I found that the perf regression is caused by the
>>>> prensence of the following commit in the guest kernel:
>>>>
>>>>
>>>> commit 605ad7f184b60cfaacbc038aa6c55ee68dee3c89
>>>> Author: Eric Dumazet <edumazet@google.com>
>>>> Date:   Sun Dec 7 12:22:18 2014 -0800
>>>>
>>>>      tcp: refine TSO autosizing
>
> [snip]

I recently discussed this issue on netdev in the following thread:

https://www.marc.info/?l=linux-netdev&m=142738853820517

>>> This commit restored original TCP Small Queue behavior, which is the
>>> first step to fight bufferbloat.
>>>
>>> Some network drivers are known to be problematic because of a delayed TX
>>> completion.
>
> [snip]
>
>>> Try to tweak /proc/sys/net/ipv4/tcp_limit_output_bytes to see if it
>>> makes a difference ?
>>
>> A very big difference:
>>
>> echo 262144 > /proc/sys/net/ipv4/tcp_limit_output_bytes
>> brings us much closer to the original performance, the slowdown is just
>> 8%
>>
>> echo 1048576 > /proc/sys/net/ipv4/tcp_limit_output_bytes
>> fills the gap entirely, same performance as before "refine TSO
>> autosizing"
>>
>>
>> What would be the next step for here?  Should I just document this as an
>> important performance tweaking step for Xen, or is there something else
>> we can do?
>
> Is the problem perhaps that netback/netfront delays TX completion?
> Would it be better to see if that can be addressed properly, so that
> the original purpose of the patch (fighting bufferbloat) can be
> achieved while not degrading performance for Xen?  Or at least, so
> that people get decent perfomance out of the box without having to
> tweak TCP parameters?

I agree; reducing the completion latency should be the ultimate goal. 
However, that won't be easy, so we need a work-around in the short term. 
I don't like the idea of relying on documenting the recommendation to 
change tcp_limit_output_bytes; too many people won't read this advice 
and will expect the out-of-the-box defaults to be reasonable.

Following Eric's pointers to where a similar problem had been 
experienced in wifi drivers, I came up with two proof-of-concept patches 
that gave a similar performance gain without any changes to sysctl 
parameters or core tcp/ip code. See 
https://www.marc.info/?l=linux-netdev&m=142746161307283.

I haven't yet received any feedback from the xen-netfront maintainers 
about whether those ideas could be reasonably adopted.

Jonathan

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-13 13:38         ` Jonathan Davies
  0 siblings, 0 replies; 92+ messages in thread
From: Jonathan Davies @ 2015-04-13 13:38 UTC (permalink / raw)
  To: linux-arm-kernel

On 13/04/15 11:56, George Dunlap wrote:
> On Thu, Apr 9, 2015 at 5:36 PM, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
>> On Thu, 9 Apr 2015, Eric Dumazet wrote:
>>> On Thu, 2015-04-09 at 16:46 +0100, Stefano Stabellini wrote:
>>>> Hi all,
>>>>
>>>> I found a performance regression when running netperf -t TCP_MAERTS from
>>>> an external host to a Xen VM on ARM64: v3.19 and v4.0-rc4 running in the
>>>> virtual machine are 30% slower than v3.18.
>>>>
>>>> Through bisection I found that the perf regression is caused by the
>>>> prensence of the following commit in the guest kernel:
>>>>
>>>>
>>>> commit 605ad7f184b60cfaacbc038aa6c55ee68dee3c89
>>>> Author: Eric Dumazet <edumazet@google.com>
>>>> Date:   Sun Dec 7 12:22:18 2014 -0800
>>>>
>>>>      tcp: refine TSO autosizing
>
> [snip]

I recently discussed this issue on netdev in the following thread:

https://www.marc.info/?l=linux-netdev&m=142738853820517

>>> This commit restored original TCP Small Queue behavior, which is the
>>> first step to fight bufferbloat.
>>>
>>> Some network drivers are known to be problematic because of a delayed TX
>>> completion.
>
> [snip]
>
>>> Try to tweak /proc/sys/net/ipv4/tcp_limit_output_bytes to see if it
>>> makes a difference ?
>>
>> A very big difference:
>>
>> echo 262144 > /proc/sys/net/ipv4/tcp_limit_output_bytes
>> brings us much closer to the original performance, the slowdown is just
>> 8%
>>
>> echo 1048576 > /proc/sys/net/ipv4/tcp_limit_output_bytes
>> fills the gap entirely, same performance as before "refine TSO
>> autosizing"
>>
>>
>> What would be the next step for here?  Should I just document this as an
>> important performance tweaking step for Xen, or is there something else
>> we can do?
>
> Is the problem perhaps that netback/netfront delays TX completion?
> Would it be better to see if that can be addressed properly, so that
> the original purpose of the patch (fighting bufferbloat) can be
> achieved while not degrading performance for Xen?  Or at least, so
> that people get decent perfomance out of the box without having to
> tweak TCP parameters?

I agree; reducing the completion latency should be the ultimate goal. 
However, that won't be easy, so we need a work-around in the short term. 
I don't like the idea of relying on documenting the recommendation to 
change tcp_limit_output_bytes; too many people won't read this advice 
and will expect the out-of-the-box defaults to be reasonable.

Following Eric's pointers to where a similar problem had been 
experienced in wifi drivers, I came up with two proof-of-concept patches 
that gave a similar performance gain without any changes to sysctl 
parameters or core tcp/ip code. See 
https://www.marc.info/?l=linux-netdev&m=142746161307283.

I haven't yet received any feedback from the xen-netfront maintainers 
about whether those ideas could be reasonably adopted.

Jonathan

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-13 13:38         ` Jonathan Davies
  0 siblings, 0 replies; 92+ messages in thread
From: Jonathan Davies @ 2015-04-13 13:38 UTC (permalink / raw)
  To: George Dunlap, Stefano Stabellini
  Cc: Eric Dumazet, xen-devel, Wei Liu, Ian Campbell, netdev,
	Linux Kernel Mailing List, edumazet, linux-arm-kernel,
	Christoffer Dall, David Vrabel, Felipe Franciosi, Paul Durrant

On 13/04/15 11:56, George Dunlap wrote:
> On Thu, Apr 9, 2015 at 5:36 PM, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
>> On Thu, 9 Apr 2015, Eric Dumazet wrote:
>>> On Thu, 2015-04-09 at 16:46 +0100, Stefano Stabellini wrote:
>>>> Hi all,
>>>>
>>>> I found a performance regression when running netperf -t TCP_MAERTS from
>>>> an external host to a Xen VM on ARM64: v3.19 and v4.0-rc4 running in the
>>>> virtual machine are 30% slower than v3.18.
>>>>
>>>> Through bisection I found that the perf regression is caused by the
>>>> prensence of the following commit in the guest kernel:
>>>>
>>>>
>>>> commit 605ad7f184b60cfaacbc038aa6c55ee68dee3c89
>>>> Author: Eric Dumazet <edumazet@google.com>
>>>> Date:   Sun Dec 7 12:22:18 2014 -0800
>>>>
>>>>      tcp: refine TSO autosizing
>
> [snip]

I recently discussed this issue on netdev in the following thread:

https://www.marc.info/?l=linux-netdev&m=142738853820517

>>> This commit restored original TCP Small Queue behavior, which is the
>>> first step to fight bufferbloat.
>>>
>>> Some network drivers are known to be problematic because of a delayed TX
>>> completion.
>
> [snip]
>
>>> Try to tweak /proc/sys/net/ipv4/tcp_limit_output_bytes to see if it
>>> makes a difference ?
>>
>> A very big difference:
>>
>> echo 262144 > /proc/sys/net/ipv4/tcp_limit_output_bytes
>> brings us much closer to the original performance, the slowdown is just
>> 8%
>>
>> echo 1048576 > /proc/sys/net/ipv4/tcp_limit_output_bytes
>> fills the gap entirely, same performance as before "refine TSO
>> autosizing"
>>
>>
>> What would be the next step for here?  Should I just document this as an
>> important performance tweaking step for Xen, or is there something else
>> we can do?
>
> Is the problem perhaps that netback/netfront delays TX completion?
> Would it be better to see if that can be addressed properly, so that
> the original purpose of the patch (fighting bufferbloat) can be
> achieved while not degrading performance for Xen?  Or at least, so
> that people get decent perfomance out of the box without having to
> tweak TCP parameters?

I agree; reducing the completion latency should be the ultimate goal. 
However, that won't be easy, so we need a work-around in the short term. 
I don't like the idea of relying on documenting the recommendation to 
change tcp_limit_output_bytes; too many people won't read this advice 
and will expect the out-of-the-box defaults to be reasonable.

Following Eric's pointers to where a similar problem had been 
experienced in wifi drivers, I came up with two proof-of-concept patches 
that gave a similar performance gain without any changes to sysctl 
parameters or core tcp/ip code. See 
https://www.marc.info/?l=linux-netdev&m=142746161307283.

I haven't yet received any feedback from the xen-netfront maintainers 
about whether those ideas could be reasonably adopted.

Jonathan

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-13 10:56       ` George Dunlap
@ 2015-04-13 13:49         ` Eric Dumazet
  -1 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-13 13:49 UTC (permalink / raw)
  To: George Dunlap
  Cc: Stefano Stabellini, xen-devel, Wei Liu, Ian Campbell, netdev,
	Linux Kernel Mailing List, edumazet, linux-arm-kernel,
	Christoffer Dall, David Vrabel, Jonathan Davies,
	Felipe Franciosi, Paul Durrant

On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote:

> Is the problem perhaps that netback/netfront delays TX completion?
> Would it be better to see if that can be addressed properly, so that
> the original purpose of the patch (fighting bufferbloat) can be
> achieved while not degrading performance for Xen?  Or at least, so
> that people get decent perfomance out of the box without having to
> tweak TCP parameters?

Sure, please provide a patch, that does not break back pressure.

But just in case, if Xen performance relied on bufferbloat, it might be
very difficult to reach a stable equilibrium : Any small change in stack
or scheduling might introduce a significant difference in 'raw
performance'.




^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-13 13:49         ` Eric Dumazet
  0 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-13 13:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote:

> Is the problem perhaps that netback/netfront delays TX completion?
> Would it be better to see if that can be addressed properly, so that
> the original purpose of the patch (fighting bufferbloat) can be
> achieved while not degrading performance for Xen?  Or at least, so
> that people get decent perfomance out of the box without having to
> tweak TCP parameters?

Sure, please provide a patch, that does not break back pressure.

But just in case, if Xen performance relied on bufferbloat, it might be
very difficult to reach a stable equilibrium : Any small change in stack
or scheduling might introduce a significant difference in 'raw
performance'.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-13 13:49         ` Eric Dumazet
@ 2015-04-15 13:43           ` George Dunlap
  -1 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-15 13:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On Mon, Apr 13, 2015 at 2:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote:
>
>> Is the problem perhaps that netback/netfront delays TX completion?
>> Would it be better to see if that can be addressed properly, so that
>> the original purpose of the patch (fighting bufferbloat) can be
>> achieved while not degrading performance for Xen?  Or at least, so
>> that people get decent perfomance out of the box without having to
>> tweak TCP parameters?
>
> Sure, please provide a patch, that does not break back pressure.
>
> But just in case, if Xen performance relied on bufferbloat, it might be
> very difficult to reach a stable equilibrium : Any small change in stack
> or scheduling might introduce a significant difference in 'raw
> performance'.

So help me understand this a little bit here.  tcp_limit_output_bytes
limits the amount of data allowed to be "in-transit" between a send()
and the wire, is that right?

And so the "bufferbloat" problem you're talking about here are TCP
buffers inside the kernel, and/or buffers in the NIC, is that right?

So ideally, you want this to be large enough to fill the "pipeline"
all the way from send() down to actually getting out on the wire;
otherwise, you'll have gaps in the pipeline, and the machinery won't
be working at full throttle.

And the reason it's a problem is that many NICs now come with large
send buffers; and effectively what happens then is that this makes the
"pipeline" longer -- as the buffer fills up, the time between send()
and the wire is increased.  This increased latency causes delays in
round-trip-times and interferes with the mechanisms TCP uses to try to
determine what the actual sustainable rate of data trasmission is.

By limiting the number of "in-transit" bytes, you make sure that
neither the kernel nor the NIC are going to have packets queues up for
long lengths of time in buffers, and you keep this "pipeline" as close
to the actual minimal length of the pipeline as possible.  And it
sounds like for your 40G NIC, 128k is big enough to fill the pipeline
without unduly making it longer by introducing buffering.

Is that an accurate picture of what you're trying to achieve?

But the problem for xennet (and a number of other drivers), as I
understand it, is that at the moment the "pipeline" itself is just
longer -- it just takes a longer time from the time you send a packet
to the time it actually gets out on the wire.

So it's not actually accurate to say that "Xen performance relies on
bufferbloat".  There's no buffering involved -- the pipeline is just
longer, and so to fill up the pipeline you need more data.

Basically, to maximize throughput while minimizing buffering, for
*any* connection, tcp_limit_output_bytes should ideally be around
(min_tx_latency * max_bandwidth).  For physical NICs, the minimum
latency is really small, but for xennet -- and I'm guessing for a lot
of virtualized cards -- the min_tx_latency will be a lot higher,
requiring a much higher ideal tcp_limit_output value.

Rather than trying to pick a single value which will be good for all
NICs, it seems like it would make more sense to have this vary
depending on the parameters of the NIC.  After all, for NICs that have
low throughput -- say, old 100MiB NICs -- even 128k may still
introduce a significant amount of buffering.

Obviously one solution would be to allow the drivers themselves to set
the tcp_limit_output_bytes, but that seems like a maintenance
nightmare.

Another simple solution would be to allow drivers to indicate whether
they have a high transmit latency, and have the kernel use a higher
value by default when that's the case.

Probably the most sustainable solution would be to have the networking
layer keep track of the average and minimum transmit latencies, and
automatically adjust tcp_limit_output_bytes based on that.  (Keeping
the minimum as well as the average because the whole problem with
bufferbloat is that the more data you give it, the longer the apparent
"pipeline" becomes.)

Thoughts?

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 13:43           ` George Dunlap
  0 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-15 13:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Apr 13, 2015 at 2:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote:
>
>> Is the problem perhaps that netback/netfront delays TX completion?
>> Would it be better to see if that can be addressed properly, so that
>> the original purpose of the patch (fighting bufferbloat) can be
>> achieved while not degrading performance for Xen?  Or at least, so
>> that people get decent perfomance out of the box without having to
>> tweak TCP parameters?
>
> Sure, please provide a patch, that does not break back pressure.
>
> But just in case, if Xen performance relied on bufferbloat, it might be
> very difficult to reach a stable equilibrium : Any small change in stack
> or scheduling might introduce a significant difference in 'raw
> performance'.

So help me understand this a little bit here.  tcp_limit_output_bytes
limits the amount of data allowed to be "in-transit" between a send()
and the wire, is that right?

And so the "bufferbloat" problem you're talking about here are TCP
buffers inside the kernel, and/or buffers in the NIC, is that right?

So ideally, you want this to be large enough to fill the "pipeline"
all the way from send() down to actually getting out on the wire;
otherwise, you'll have gaps in the pipeline, and the machinery won't
be working at full throttle.

And the reason it's a problem is that many NICs now come with large
send buffers; and effectively what happens then is that this makes the
"pipeline" longer -- as the buffer fills up, the time between send()
and the wire is increased.  This increased latency causes delays in
round-trip-times and interferes with the mechanisms TCP uses to try to
determine what the actual sustainable rate of data trasmission is.

By limiting the number of "in-transit" bytes, you make sure that
neither the kernel nor the NIC are going to have packets queues up for
long lengths of time in buffers, and you keep this "pipeline" as close
to the actual minimal length of the pipeline as possible.  And it
sounds like for your 40G NIC, 128k is big enough to fill the pipeline
without unduly making it longer by introducing buffering.

Is that an accurate picture of what you're trying to achieve?

But the problem for xennet (and a number of other drivers), as I
understand it, is that at the moment the "pipeline" itself is just
longer -- it just takes a longer time from the time you send a packet
to the time it actually gets out on the wire.

So it's not actually accurate to say that "Xen performance relies on
bufferbloat".  There's no buffering involved -- the pipeline is just
longer, and so to fill up the pipeline you need more data.

Basically, to maximize throughput while minimizing buffering, for
*any* connection, tcp_limit_output_bytes should ideally be around
(min_tx_latency * max_bandwidth).  For physical NICs, the minimum
latency is really small, but for xennet -- and I'm guessing for a lot
of virtualized cards -- the min_tx_latency will be a lot higher,
requiring a much higher ideal tcp_limit_output value.

Rather than trying to pick a single value which will be good for all
NICs, it seems like it would make more sense to have this vary
depending on the parameters of the NIC.  After all, for NICs that have
low throughput -- say, old 100MiB NICs -- even 128k may still
introduce a significant amount of buffering.

Obviously one solution would be to allow the drivers themselves to set
the tcp_limit_output_bytes, but that seems like a maintenance
nightmare.

Another simple solution would be to allow drivers to indicate whether
they have a high transmit latency, and have the kernel use a higher
value by default when that's the case.

Probably the most sustainable solution would be to have the networking
layer keep track of the average and minimum transmit latencies, and
automatically adjust tcp_limit_output_bytes based on that.  (Keeping
the minimum as well as the average because the whole problem with
bufferbloat is that the more data you give it, the longer the apparent
"pipeline" becomes.)

Thoughts?

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-15 13:43           ` George Dunlap
  (?)
@ 2015-04-15 16:38             ` Eric Dumazet
  -1 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-15 16:38 UTC (permalink / raw)
  To: George Dunlap
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On Wed, 2015-04-15 at 14:43 +0100, George Dunlap wrote:
> On Mon, Apr 13, 2015 at 2:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote:
> >
> >> Is the problem perhaps that netback/netfront delays TX completion?
> >> Would it be better to see if that can be addressed properly, so that
> >> the original purpose of the patch (fighting bufferbloat) can be
> >> achieved while not degrading performance for Xen?  Or at least, so
> >> that people get decent perfomance out of the box without having to
> >> tweak TCP parameters?
> >
> > Sure, please provide a patch, that does not break back pressure.
> >
> > But just in case, if Xen performance relied on bufferbloat, it might be
> > very difficult to reach a stable equilibrium : Any small change in stack
> > or scheduling might introduce a significant difference in 'raw
> > performance'.
> 
> So help me understand this a little bit here.  tcp_limit_output_bytes
> limits the amount of data allowed to be "in-transit" between a send()
> and the wire, is that right?
> 
> And so the "bufferbloat" problem you're talking about here are TCP
> buffers inside the kernel, and/or buffers in the NIC, is that right?
> 
> So ideally, you want this to be large enough to fill the "pipeline"
> all the way from send() down to actually getting out on the wire;
> otherwise, you'll have gaps in the pipeline, and the machinery won't
> be working at full throttle.
> 
> And the reason it's a problem is that many NICs now come with large
> send buffers; and effectively what happens then is that this makes the
> "pipeline" longer -- as the buffer fills up, the time between send()
> and the wire is increased.  This increased latency causes delays in
> round-trip-times and interferes with the mechanisms TCP uses to try to
> determine what the actual sustainable rate of data trasmission is.
> 
> By limiting the number of "in-transit" bytes, you make sure that
> neither the kernel nor the NIC are going to have packets queues up for
> long lengths of time in buffers, and you keep this "pipeline" as close
> to the actual minimal length of the pipeline as possible.  And it
> sounds like for your 40G NIC, 128k is big enough to fill the pipeline
> without unduly making it longer by introducing buffering.
> 
> Is that an accurate picture of what you're trying to achieve?
> 
> But the problem for xennet (and a number of other drivers), as I
> understand it, is that at the moment the "pipeline" itself is just
> longer -- it just takes a longer time from the time you send a packet
> to the time it actually gets out on the wire.
> 
> So it's not actually accurate to say that "Xen performance relies on
> bufferbloat".  There's no buffering involved -- the pipeline is just
> longer, and so to fill up the pipeline you need more data.
> 
> Basically, to maximize throughput while minimizing buffering, for
> *any* connection, tcp_limit_output_bytes should ideally be around
> (min_tx_latency * max_bandwidth).  For physical NICs, the minimum
> latency is really small, but for xennet -- and I'm guessing for a lot
> of virtualized cards -- the min_tx_latency will be a lot higher,
> requiring a much higher ideal tcp_limit_output value.
> 
> Rather than trying to pick a single value which will be good for all
> NICs, it seems like it would make more sense to have this vary
> depending on the parameters of the NIC.  After all, for NICs that have
> low throughput -- say, old 100MiB NICs -- even 128k may still
> introduce a significant amount of buffering.
> 
> Obviously one solution would be to allow the drivers themselves to set
> the tcp_limit_output_bytes, but that seems like a maintenance
> nightmare.
> 
> Another simple solution would be to allow drivers to indicate whether
> they have a high transmit latency, and have the kernel use a higher
> value by default when that's the case.
> 
> Probably the most sustainable solution would be to have the networking
> layer keep track of the average and minimum transmit latencies, and
> automatically adjust tcp_limit_output_bytes based on that.  (Keeping
> the minimum as well as the average because the whole problem with
> bufferbloat is that the more data you give it, the longer the apparent
> "pipeline" becomes.)
> 
> Thoughts?

My thoughts that instead of these long talks you should guys read the
code :

                /* TCP Small Queues :
                 * Control number of packets in qdisc/devices to two packets / or ~1 ms.
                 * This allows for :
                 *  - better RTT estimation and ACK scheduling
                 *  - faster recovery
                 *  - high rates
                 * Alas, some drivers / subsystems require a fair amount
                 * of queued bytes to ensure line rate.
                 * One example is wifi aggregation (802.11 AMPDU)
                 */
                limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
                limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);


Then you'll see that most of your questions are already answered.

Feel free to try to improve the behavior, if it does not hurt critical workloads
like TCP_RR, where we we send very small messages, millions times per second.

Thanks




^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 16:38             ` Eric Dumazet
  0 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-15 16:38 UTC (permalink / raw)
  To: George Dunlap
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, linux-arm-kernel, Felipe Franciosi,
	Christoffer Dall, David Vrabel

On Wed, 2015-04-15 at 14:43 +0100, George Dunlap wrote:
> On Mon, Apr 13, 2015 at 2:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote:
> >
> >> Is the problem perhaps that netback/netfront delays TX completion?
> >> Would it be better to see if that can be addressed properly, so that
> >> the original purpose of the patch (fighting bufferbloat) can be
> >> achieved while not degrading performance for Xen?  Or at least, so
> >> that people get decent perfomance out of the box without having to
> >> tweak TCP parameters?
> >
> > Sure, please provide a patch, that does not break back pressure.
> >
> > But just in case, if Xen performance relied on bufferbloat, it might be
> > very difficult to reach a stable equilibrium : Any small change in stack
> > or scheduling might introduce a significant difference in 'raw
> > performance'.
> 
> So help me understand this a little bit here.  tcp_limit_output_bytes
> limits the amount of data allowed to be "in-transit" between a send()
> and the wire, is that right?
> 
> And so the "bufferbloat" problem you're talking about here are TCP
> buffers inside the kernel, and/or buffers in the NIC, is that right?
> 
> So ideally, you want this to be large enough to fill the "pipeline"
> all the way from send() down to actually getting out on the wire;
> otherwise, you'll have gaps in the pipeline, and the machinery won't
> be working at full throttle.
> 
> And the reason it's a problem is that many NICs now come with large
> send buffers; and effectively what happens then is that this makes the
> "pipeline" longer -- as the buffer fills up, the time between send()
> and the wire is increased.  This increased latency causes delays in
> round-trip-times and interferes with the mechanisms TCP uses to try to
> determine what the actual sustainable rate of data trasmission is.
> 
> By limiting the number of "in-transit" bytes, you make sure that
> neither the kernel nor the NIC are going to have packets queues up for
> long lengths of time in buffers, and you keep this "pipeline" as close
> to the actual minimal length of the pipeline as possible.  And it
> sounds like for your 40G NIC, 128k is big enough to fill the pipeline
> without unduly making it longer by introducing buffering.
> 
> Is that an accurate picture of what you're trying to achieve?
> 
> But the problem for xennet (and a number of other drivers), as I
> understand it, is that at the moment the "pipeline" itself is just
> longer -- it just takes a longer time from the time you send a packet
> to the time it actually gets out on the wire.
> 
> So it's not actually accurate to say that "Xen performance relies on
> bufferbloat".  There's no buffering involved -- the pipeline is just
> longer, and so to fill up the pipeline you need more data.
> 
> Basically, to maximize throughput while minimizing buffering, for
> *any* connection, tcp_limit_output_bytes should ideally be around
> (min_tx_latency * max_bandwidth).  For physical NICs, the minimum
> latency is really small, but for xennet -- and I'm guessing for a lot
> of virtualized cards -- the min_tx_latency will be a lot higher,
> requiring a much higher ideal tcp_limit_output value.
> 
> Rather than trying to pick a single value which will be good for all
> NICs, it seems like it would make more sense to have this vary
> depending on the parameters of the NIC.  After all, for NICs that have
> low throughput -- say, old 100MiB NICs -- even 128k may still
> introduce a significant amount of buffering.
> 
> Obviously one solution would be to allow the drivers themselves to set
> the tcp_limit_output_bytes, but that seems like a maintenance
> nightmare.
> 
> Another simple solution would be to allow drivers to indicate whether
> they have a high transmit latency, and have the kernel use a higher
> value by default when that's the case.
> 
> Probably the most sustainable solution would be to have the networking
> layer keep track of the average and minimum transmit latencies, and
> automatically adjust tcp_limit_output_bytes based on that.  (Keeping
> the minimum as well as the average because the whole problem with
> bufferbloat is that the more data you give it, the longer the apparent
> "pipeline" becomes.)
> 
> Thoughts?

My thoughts that instead of these long talks you should guys read the
code :

                /* TCP Small Queues :
                 * Control number of packets in qdisc/devices to two packets / or ~1 ms.
                 * This allows for :
                 *  - better RTT estimation and ACK scheduling
                 *  - faster recovery
                 *  - high rates
                 * Alas, some drivers / subsystems require a fair amount
                 * of queued bytes to ensure line rate.
                 * One example is wifi aggregation (802.11 AMPDU)
                 */
                limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
                limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);


Then you'll see that most of your questions are already answered.

Feel free to try to improve the behavior, if it does not hurt critical workloads
like TCP_RR, where we we send very small messages, millions times per second.

Thanks

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 16:38             ` Eric Dumazet
  0 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-15 16:38 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2015-04-15 at 14:43 +0100, George Dunlap wrote:
> On Mon, Apr 13, 2015 at 2:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote:
> >
> >> Is the problem perhaps that netback/netfront delays TX completion?
> >> Would it be better to see if that can be addressed properly, so that
> >> the original purpose of the patch (fighting bufferbloat) can be
> >> achieved while not degrading performance for Xen?  Or at least, so
> >> that people get decent perfomance out of the box without having to
> >> tweak TCP parameters?
> >
> > Sure, please provide a patch, that does not break back pressure.
> >
> > But just in case, if Xen performance relied on bufferbloat, it might be
> > very difficult to reach a stable equilibrium : Any small change in stack
> > or scheduling might introduce a significant difference in 'raw
> > performance'.
> 
> So help me understand this a little bit here.  tcp_limit_output_bytes
> limits the amount of data allowed to be "in-transit" between a send()
> and the wire, is that right?
> 
> And so the "bufferbloat" problem you're talking about here are TCP
> buffers inside the kernel, and/or buffers in the NIC, is that right?
> 
> So ideally, you want this to be large enough to fill the "pipeline"
> all the way from send() down to actually getting out on the wire;
> otherwise, you'll have gaps in the pipeline, and the machinery won't
> be working at full throttle.
> 
> And the reason it's a problem is that many NICs now come with large
> send buffers; and effectively what happens then is that this makes the
> "pipeline" longer -- as the buffer fills up, the time between send()
> and the wire is increased.  This increased latency causes delays in
> round-trip-times and interferes with the mechanisms TCP uses to try to
> determine what the actual sustainable rate of data trasmission is.
> 
> By limiting the number of "in-transit" bytes, you make sure that
> neither the kernel nor the NIC are going to have packets queues up for
> long lengths of time in buffers, and you keep this "pipeline" as close
> to the actual minimal length of the pipeline as possible.  And it
> sounds like for your 40G NIC, 128k is big enough to fill the pipeline
> without unduly making it longer by introducing buffering.
> 
> Is that an accurate picture of what you're trying to achieve?
> 
> But the problem for xennet (and a number of other drivers), as I
> understand it, is that at the moment the "pipeline" itself is just
> longer -- it just takes a longer time from the time you send a packet
> to the time it actually gets out on the wire.
> 
> So it's not actually accurate to say that "Xen performance relies on
> bufferbloat".  There's no buffering involved -- the pipeline is just
> longer, and so to fill up the pipeline you need more data.
> 
> Basically, to maximize throughput while minimizing buffering, for
> *any* connection, tcp_limit_output_bytes should ideally be around
> (min_tx_latency * max_bandwidth).  For physical NICs, the minimum
> latency is really small, but for xennet -- and I'm guessing for a lot
> of virtualized cards -- the min_tx_latency will be a lot higher,
> requiring a much higher ideal tcp_limit_output value.
> 
> Rather than trying to pick a single value which will be good for all
> NICs, it seems like it would make more sense to have this vary
> depending on the parameters of the NIC.  After all, for NICs that have
> low throughput -- say, old 100MiB NICs -- even 128k may still
> introduce a significant amount of buffering.
> 
> Obviously one solution would be to allow the drivers themselves to set
> the tcp_limit_output_bytes, but that seems like a maintenance
> nightmare.
> 
> Another simple solution would be to allow drivers to indicate whether
> they have a high transmit latency, and have the kernel use a higher
> value by default when that's the case.
> 
> Probably the most sustainable solution would be to have the networking
> layer keep track of the average and minimum transmit latencies, and
> automatically adjust tcp_limit_output_bytes based on that.  (Keeping
> the minimum as well as the average because the whole problem with
> bufferbloat is that the more data you give it, the longer the apparent
> "pipeline" becomes.)
> 
> Thoughts?

My thoughts that instead of these long talks you should guys read the
code :

                /* TCP Small Queues :
                 * Control number of packets in qdisc/devices to two packets / or ~1 ms.
                 * This allows for :
                 *  - better RTT estimation and ACK scheduling
                 *  - faster recovery
                 *  - high rates
                 * Alas, some drivers / subsystems require a fair amount
                 * of queued bytes to ensure line rate.
                 * One example is wifi aggregation (802.11 AMPDU)
                 */
                limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
                limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);


Then you'll see that most of your questions are already answered.

Feel free to try to improve the behavior, if it does not hurt critical workloads
like TCP_RR, where we we send very small messages, millions times per second.

Thanks

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-15 16:38             ` Eric Dumazet
  (?)
@ 2015-04-15 17:23               ` George Dunlap
  -1 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-15 17:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On 04/15/2015 05:38 PM, Eric Dumazet wrote:
> My thoughts that instead of these long talks you should guys read the
> code :
> 
>                 /* TCP Small Queues :
>                  * Control number of packets in qdisc/devices to two packets / or ~1 ms.
>                  * This allows for :
>                  *  - better RTT estimation and ACK scheduling
>                  *  - faster recovery
>                  *  - high rates
>                  * Alas, some drivers / subsystems require a fair amount
>                  * of queued bytes to ensure line rate.
>                  * One example is wifi aggregation (802.11 AMPDU)
>                  */
>                 limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
>                 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
> 
> 
> Then you'll see that most of your questions are already answered.
> 
> Feel free to try to improve the behavior, if it does not hurt critical workloads
> like TCP_RR, where we we send very small messages, millions times per second.

First of all, with regard to critical workloads, once this patch gets
into distros, *normal TCP streams* on every VM running on Amazon,
Rackspace, Linode, &c will get a 30% hit in performance *by default*.
Normal TCP streams on xennet *are* a critical workload, and deserve the
same kind of accommodation as TCP_RR (if not more).  The same goes for
virtio_net.

Secondly, according to Stefano's and Jonathan's tests,
tcp_limit_output_bytes completely fixes the problem for Xen.

Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
*already* larger for Xen; that calculation mentioned in the comment is
*already* doing the right thing.

As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an
automatic TSQ calculation which is actually choosing an effective value
for xennet.

It certainly makes sense for sysctl_tcp_limit_output_bytes to be an
actual maximum limit.  I went back and looked at the original patch
which introduced it (46d3ceabd), and it looks to me like it was designed
to be a rough, quick estimate of "two packets outstanding" (by choosing
the maximum size of the packet, 64k, and multiplying it by two).

Now that you have a better algorithm -- the size of 2 actual packets or
the amount transmitted in 1ms -- it seems like the default
sysctl_tcp_limit_output_bytes should be higher, and let the automatic
TSQ you have on the first line throttle things down when necessary.

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 17:23               ` George Dunlap
  0 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-15 17:23 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/15/2015 05:38 PM, Eric Dumazet wrote:
> My thoughts that instead of these long talks you should guys read the
> code :
> 
>                 /* TCP Small Queues :
>                  * Control number of packets in qdisc/devices to two packets / or ~1 ms.
>                  * This allows for :
>                  *  - better RTT estimation and ACK scheduling
>                  *  - faster recovery
>                  *  - high rates
>                  * Alas, some drivers / subsystems require a fair amount
>                  * of queued bytes to ensure line rate.
>                  * One example is wifi aggregation (802.11 AMPDU)
>                  */
>                 limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
>                 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
> 
> 
> Then you'll see that most of your questions are already answered.
> 
> Feel free to try to improve the behavior, if it does not hurt critical workloads
> like TCP_RR, where we we send very small messages, millions times per second.

First of all, with regard to critical workloads, once this patch gets
into distros, *normal TCP streams* on every VM running on Amazon,
Rackspace, Linode, &c will get a 30% hit in performance *by default*.
Normal TCP streams on xennet *are* a critical workload, and deserve the
same kind of accommodation as TCP_RR (if not more).  The same goes for
virtio_net.

Secondly, according to Stefano's and Jonathan's tests,
tcp_limit_output_bytes completely fixes the problem for Xen.

Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
*already* larger for Xen; that calculation mentioned in the comment is
*already* doing the right thing.

As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an
automatic TSQ calculation which is actually choosing an effective value
for xennet.

It certainly makes sense for sysctl_tcp_limit_output_bytes to be an
actual maximum limit.  I went back and looked at the original patch
which introduced it (46d3ceabd), and it looks to me like it was designed
to be a rough, quick estimate of "two packets outstanding" (by choosing
the maximum size of the packet, 64k, and multiplying it by two).

Now that you have a better algorithm -- the size of 2 actual packets or
the amount transmitted in 1ms -- it seems like the default
sysctl_tcp_limit_output_bytes should be higher, and let the automatic
TSQ you have on the first line throttle things down when necessary.

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 17:23               ` George Dunlap
  0 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-15 17:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On 04/15/2015 05:38 PM, Eric Dumazet wrote:
> My thoughts that instead of these long talks you should guys read the
> code :
> 
>                 /* TCP Small Queues :
>                  * Control number of packets in qdisc/devices to two packets / or ~1 ms.
>                  * This allows for :
>                  *  - better RTT estimation and ACK scheduling
>                  *  - faster recovery
>                  *  - high rates
>                  * Alas, some drivers / subsystems require a fair amount
>                  * of queued bytes to ensure line rate.
>                  * One example is wifi aggregation (802.11 AMPDU)
>                  */
>                 limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
>                 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
> 
> 
> Then you'll see that most of your questions are already answered.
> 
> Feel free to try to improve the behavior, if it does not hurt critical workloads
> like TCP_RR, where we we send very small messages, millions times per second.

First of all, with regard to critical workloads, once this patch gets
into distros, *normal TCP streams* on every VM running on Amazon,
Rackspace, Linode, &c will get a 30% hit in performance *by default*.
Normal TCP streams on xennet *are* a critical workload, and deserve the
same kind of accommodation as TCP_RR (if not more).  The same goes for
virtio_net.

Secondly, according to Stefano's and Jonathan's tests,
tcp_limit_output_bytes completely fixes the problem for Xen.

Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
*already* larger for Xen; that calculation mentioned in the comment is
*already* doing the right thing.

As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an
automatic TSQ calculation which is actually choosing an effective value
for xennet.

It certainly makes sense for sysctl_tcp_limit_output_bytes to be an
actual maximum limit.  I went back and looked at the original patch
which introduced it (46d3ceabd), and it looks to me like it was designed
to be a rough, quick estimate of "two packets outstanding" (by choosing
the maximum size of the packet, 64k, and multiplying it by two).

Now that you have a better algorithm -- the size of 2 actual packets or
the amount transmitted in 1ms -- it seems like the default
sysctl_tcp_limit_output_bytes should be higher, and let the automatic
TSQ you have on the first line throttle things down when necessary.

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-15 17:23               ` George Dunlap
@ 2015-04-15 17:29                 ` Eric Dumazet
  -1 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-15 17:29 UTC (permalink / raw)
  To: George Dunlap
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:
> On 04/15/2015 05:38 PM, Eric Dumazet wrote:
> > My thoughts that instead of these long talks you should guys read the
> > code :
> > 
> >                 /* TCP Small Queues :
> >                  * Control number of packets in qdisc/devices to two packets / or ~1 ms.
> >                  * This allows for :
> >                  *  - better RTT estimation and ACK scheduling
> >                  *  - faster recovery
> >                  *  - high rates
> >                  * Alas, some drivers / subsystems require a fair amount
> >                  * of queued bytes to ensure line rate.
> >                  * One example is wifi aggregation (802.11 AMPDU)
> >                  */
> >                 limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
> >                 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
> > 
> > 
> > Then you'll see that most of your questions are already answered.
> > 
> > Feel free to try to improve the behavior, if it does not hurt critical workloads
> > like TCP_RR, where we we send very small messages, millions times per second.
> 
> First of all, with regard to critical workloads, once this patch gets
> into distros, *normal TCP streams* on every VM running on Amazon,
> Rackspace, Linode, &c will get a 30% hit in performance *by default*.
> Normal TCP streams on xennet *are* a critical workload, and deserve the
> same kind of accommodation as TCP_RR (if not more).  The same goes for
> virtio_net.
> 
> Secondly, according to Stefano's and Jonathan's tests,
> tcp_limit_output_bytes completely fixes the problem for Xen.
> 
> Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
> *already* larger for Xen; that calculation mentioned in the comment is
> *already* doing the right thing.
> 
> As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an
> automatic TSQ calculation which is actually choosing an effective value
> for xennet.
> 
> It certainly makes sense for sysctl_tcp_limit_output_bytes to be an
> actual maximum limit.  I went back and looked at the original patch
> which introduced it (46d3ceabd), and it looks to me like it was designed
> to be a rough, quick estimate of "two packets outstanding" (by choosing
> the maximum size of the packet, 64k, and multiplying it by two).
> 
> Now that you have a better algorithm -- the size of 2 actual packets or
> the amount transmitted in 1ms -- it seems like the default
> sysctl_tcp_limit_output_bytes should be higher, and let the automatic
> TSQ you have on the first line throttle things down when necessary.


I asked you guys to make a test by increasing
sysctl_tcp_limit_output_bytes

You have no need to explain me the code I wrote, thank you.



^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 17:29                 ` Eric Dumazet
  0 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-15 17:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:
> On 04/15/2015 05:38 PM, Eric Dumazet wrote:
> > My thoughts that instead of these long talks you should guys read the
> > code :
> > 
> >                 /* TCP Small Queues :
> >                  * Control number of packets in qdisc/devices to two packets / or ~1 ms.
> >                  * This allows for :
> >                  *  - better RTT estimation and ACK scheduling
> >                  *  - faster recovery
> >                  *  - high rates
> >                  * Alas, some drivers / subsystems require a fair amount
> >                  * of queued bytes to ensure line rate.
> >                  * One example is wifi aggregation (802.11 AMPDU)
> >                  */
> >                 limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
> >                 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
> > 
> > 
> > Then you'll see that most of your questions are already answered.
> > 
> > Feel free to try to improve the behavior, if it does not hurt critical workloads
> > like TCP_RR, where we we send very small messages, millions times per second.
> 
> First of all, with regard to critical workloads, once this patch gets
> into distros, *normal TCP streams* on every VM running on Amazon,
> Rackspace, Linode, &c will get a 30% hit in performance *by default*.
> Normal TCP streams on xennet *are* a critical workload, and deserve the
> same kind of accommodation as TCP_RR (if not more).  The same goes for
> virtio_net.
> 
> Secondly, according to Stefano's and Jonathan's tests,
> tcp_limit_output_bytes completely fixes the problem for Xen.
> 
> Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
> *already* larger for Xen; that calculation mentioned in the comment is
> *already* doing the right thing.
> 
> As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an
> automatic TSQ calculation which is actually choosing an effective value
> for xennet.
> 
> It certainly makes sense for sysctl_tcp_limit_output_bytes to be an
> actual maximum limit.  I went back and looked at the original patch
> which introduced it (46d3ceabd), and it looks to me like it was designed
> to be a rough, quick estimate of "two packets outstanding" (by choosing
> the maximum size of the packet, 64k, and multiplying it by two).
> 
> Now that you have a better algorithm -- the size of 2 actual packets or
> the amount transmitted in 1ms -- it seems like the default
> sysctl_tcp_limit_output_bytes should be higher, and let the automatic
> TSQ you have on the first line throttle things down when necessary.


I asked you guys to make a test by increasing
sysctl_tcp_limit_output_bytes

You have no need to explain me the code I wrote, thank you.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-15 17:29                 ` Eric Dumazet
  (?)
@ 2015-04-15 17:41                   ` George Dunlap
  -1 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-15 17:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

[-- Attachment #1: Type: text/plain, Size: 4011 bytes --]

On 04/15/2015 06:29 PM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:
>> On 04/15/2015 05:38 PM, Eric Dumazet wrote:
>>> My thoughts that instead of these long talks you should guys read the
>>> code :
>>>
>>>                 /* TCP Small Queues :
>>>                  * Control number of packets in qdisc/devices to two packets / or ~1 ms.
>>>                  * This allows for :
>>>                  *  - better RTT estimation and ACK scheduling
>>>                  *  - faster recovery
>>>                  *  - high rates
>>>                  * Alas, some drivers / subsystems require a fair amount
>>>                  * of queued bytes to ensure line rate.
>>>                  * One example is wifi aggregation (802.11 AMPDU)
>>>                  */
>>>                 limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
>>>                 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>>>
>>>
>>> Then you'll see that most of your questions are already answered.
>>>
>>> Feel free to try to improve the behavior, if it does not hurt critical workloads
>>> like TCP_RR, where we we send very small messages, millions times per second.
>>
>> First of all, with regard to critical workloads, once this patch gets
>> into distros, *normal TCP streams* on every VM running on Amazon,
>> Rackspace, Linode, &c will get a 30% hit in performance *by default*.
>> Normal TCP streams on xennet *are* a critical workload, and deserve the
>> same kind of accommodation as TCP_RR (if not more).  The same goes for
>> virtio_net.
>>
>> Secondly, according to Stefano's and Jonathan's tests,
>> tcp_limit_output_bytes completely fixes the problem for Xen.
>>
>> Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
>> *already* larger for Xen; that calculation mentioned in the comment is
>> *already* doing the right thing.
>>
>> As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an
>> automatic TSQ calculation which is actually choosing an effective value
>> for xennet.
>>
>> It certainly makes sense for sysctl_tcp_limit_output_bytes to be an
>> actual maximum limit.  I went back and looked at the original patch
>> which introduced it (46d3ceabd), and it looks to me like it was designed
>> to be a rough, quick estimate of "two packets outstanding" (by choosing
>> the maximum size of the packet, 64k, and multiplying it by two).
>>
>> Now that you have a better algorithm -- the size of 2 actual packets or
>> the amount transmitted in 1ms -- it seems like the default
>> sysctl_tcp_limit_output_bytes should be higher, and let the automatic
>> TSQ you have on the first line throttle things down when necessary.
> 
> 
> I asked you guys to make a test by increasing
> sysctl_tcp_limit_output_bytes

So you'd be OK with a patch like this?  (With perhaps a better changelog?)

 -George

---
TSQ: Raise default static TSQ limit

A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
size of actual packets and the amount of data being transmitted.
Raise the default static limit to allow that new limit to actually
come into effect.

This fixes a regression where NICs with large transmit completion
times (such as xennet) had a 30% hit unless the user manually tweaked
the value in /proc.

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1db253e..8ad7cdf 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
  */
 int sysctl_tcp_workaround_signed_windows __read_mostly = 0;

-/* Default TSQ limit of two TSO segments */
-int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
+/* Static TSQ limit.  A more dynamic limit is calculated in
tcp_write_xmit. */
+int sysctl_tcp_limit_output_bytes __read_mostly = 1048576;

 /* This limits the percentage of the congestion window which we
  * will allow a single TSO frame to consume.  Building TSO frames


[-- Attachment #2: tsq-raise-default-static.diff --]
[-- Type: text/x-patch, Size: 1143 bytes --]

TSQ: Raise default static TSQ limit
    
A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
size of actual packets and the amount of data being transmitted.
Raise the default static limit to allow that new limit to actually
come into effect.
 
This fixes a regression where NICs with large transmit completion
times (such as xennet) had a 30% hit unless the user manually tweaked
the value in /proc.
    
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1db253e..8ad7cdf 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
  */
 int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
 
-/* Default TSQ limit of two TSO segments */
-int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
+/* Static TSQ limit.  A more dynamic limit is calculated in tcp_write_xmit. */
+int sysctl_tcp_limit_output_bytes __read_mostly = 1048576;
 
 /* This limits the percentage of the congestion window which we
  * will allow a single TSO frame to consume.  Building TSO frames

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 17:41                   ` George Dunlap
  0 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-15 17:41 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/15/2015 06:29 PM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:
>> On 04/15/2015 05:38 PM, Eric Dumazet wrote:
>>> My thoughts that instead of these long talks you should guys read the
>>> code :
>>>
>>>                 /* TCP Small Queues :
>>>                  * Control number of packets in qdisc/devices to two packets / or ~1 ms.
>>>                  * This allows for :
>>>                  *  - better RTT estimation and ACK scheduling
>>>                  *  - faster recovery
>>>                  *  - high rates
>>>                  * Alas, some drivers / subsystems require a fair amount
>>>                  * of queued bytes to ensure line rate.
>>>                  * One example is wifi aggregation (802.11 AMPDU)
>>>                  */
>>>                 limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
>>>                 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>>>
>>>
>>> Then you'll see that most of your questions are already answered.
>>>
>>> Feel free to try to improve the behavior, if it does not hurt critical workloads
>>> like TCP_RR, where we we send very small messages, millions times per second.
>>
>> First of all, with regard to critical workloads, once this patch gets
>> into distros, *normal TCP streams* on every VM running on Amazon,
>> Rackspace, Linode, &c will get a 30% hit in performance *by default*.
>> Normal TCP streams on xennet *are* a critical workload, and deserve the
>> same kind of accommodation as TCP_RR (if not more).  The same goes for
>> virtio_net.
>>
>> Secondly, according to Stefano's and Jonathan's tests,
>> tcp_limit_output_bytes completely fixes the problem for Xen.
>>
>> Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
>> *already* larger for Xen; that calculation mentioned in the comment is
>> *already* doing the right thing.
>>
>> As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an
>> automatic TSQ calculation which is actually choosing an effective value
>> for xennet.
>>
>> It certainly makes sense for sysctl_tcp_limit_output_bytes to be an
>> actual maximum limit.  I went back and looked at the original patch
>> which introduced it (46d3ceabd), and it looks to me like it was designed
>> to be a rough, quick estimate of "two packets outstanding" (by choosing
>> the maximum size of the packet, 64k, and multiplying it by two).
>>
>> Now that you have a better algorithm -- the size of 2 actual packets or
>> the amount transmitted in 1ms -- it seems like the default
>> sysctl_tcp_limit_output_bytes should be higher, and let the automatic
>> TSQ you have on the first line throttle things down when necessary.
> 
> 
> I asked you guys to make a test by increasing
> sysctl_tcp_limit_output_bytes

So you'd be OK with a patch like this?  (With perhaps a better changelog?)

 -George

---
TSQ: Raise default static TSQ limit

A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
size of actual packets and the amount of data being transmitted.
Raise the default static limit to allow that new limit to actually
come into effect.

This fixes a regression where NICs with large transmit completion
times (such as xennet) had a 30% hit unless the user manually tweaked
the value in /proc.

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1db253e..8ad7cdf 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
  */
 int sysctl_tcp_workaround_signed_windows __read_mostly = 0;

-/* Default TSQ limit of two TSO segments */
-int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
+/* Static TSQ limit.  A more dynamic limit is calculated in
tcp_write_xmit. */
+int sysctl_tcp_limit_output_bytes __read_mostly = 1048576;

 /* This limits the percentage of the congestion window which we
  * will allow a single TSO frame to consume.  Building TSO frames

-------------- next part --------------
A non-text attachment was scrubbed...
Name: tsq-raise-default-static.diff
Type: text/x-patch
Size: 1143 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20150415/06525925/attachment.bin>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 17:41                   ` George Dunlap
  0 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-15 17:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

[-- Attachment #1: Type: text/plain, Size: 4011 bytes --]

On 04/15/2015 06:29 PM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:
>> On 04/15/2015 05:38 PM, Eric Dumazet wrote:
>>> My thoughts that instead of these long talks you should guys read the
>>> code :
>>>
>>>                 /* TCP Small Queues :
>>>                  * Control number of packets in qdisc/devices to two packets / or ~1 ms.
>>>                  * This allows for :
>>>                  *  - better RTT estimation and ACK scheduling
>>>                  *  - faster recovery
>>>                  *  - high rates
>>>                  * Alas, some drivers / subsystems require a fair amount
>>>                  * of queued bytes to ensure line rate.
>>>                  * One example is wifi aggregation (802.11 AMPDU)
>>>                  */
>>>                 limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
>>>                 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>>>
>>>
>>> Then you'll see that most of your questions are already answered.
>>>
>>> Feel free to try to improve the behavior, if it does not hurt critical workloads
>>> like TCP_RR, where we we send very small messages, millions times per second.
>>
>> First of all, with regard to critical workloads, once this patch gets
>> into distros, *normal TCP streams* on every VM running on Amazon,
>> Rackspace, Linode, &c will get a 30% hit in performance *by default*.
>> Normal TCP streams on xennet *are* a critical workload, and deserve the
>> same kind of accommodation as TCP_RR (if not more).  The same goes for
>> virtio_net.
>>
>> Secondly, according to Stefano's and Jonathan's tests,
>> tcp_limit_output_bytes completely fixes the problem for Xen.
>>
>> Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
>> *already* larger for Xen; that calculation mentioned in the comment is
>> *already* doing the right thing.
>>
>> As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an
>> automatic TSQ calculation which is actually choosing an effective value
>> for xennet.
>>
>> It certainly makes sense for sysctl_tcp_limit_output_bytes to be an
>> actual maximum limit.  I went back and looked at the original patch
>> which introduced it (46d3ceabd), and it looks to me like it was designed
>> to be a rough, quick estimate of "two packets outstanding" (by choosing
>> the maximum size of the packet, 64k, and multiplying it by two).
>>
>> Now that you have a better algorithm -- the size of 2 actual packets or
>> the amount transmitted in 1ms -- it seems like the default
>> sysctl_tcp_limit_output_bytes should be higher, and let the automatic
>> TSQ you have on the first line throttle things down when necessary.
> 
> 
> I asked you guys to make a test by increasing
> sysctl_tcp_limit_output_bytes

So you'd be OK with a patch like this?  (With perhaps a better changelog?)

 -George

---
TSQ: Raise default static TSQ limit

A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
size of actual packets and the amount of data being transmitted.
Raise the default static limit to allow that new limit to actually
come into effect.

This fixes a regression where NICs with large transmit completion
times (such as xennet) had a 30% hit unless the user manually tweaked
the value in /proc.

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1db253e..8ad7cdf 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
  */
 int sysctl_tcp_workaround_signed_windows __read_mostly = 0;

-/* Default TSQ limit of two TSO segments */
-int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
+/* Static TSQ limit.  A more dynamic limit is calculated in
tcp_write_xmit. */
+int sysctl_tcp_limit_output_bytes __read_mostly = 1048576;

 /* This limits the percentage of the congestion window which we
  * will allow a single TSO frame to consume.  Building TSO frames


[-- Attachment #2: tsq-raise-default-static.diff --]
[-- Type: text/x-patch, Size: 1143 bytes --]

TSQ: Raise default static TSQ limit
    
A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
size of actual packets and the amount of data being transmitted.
Raise the default static limit to allow that new limit to actually
come into effect.
 
This fixes a regression where NICs with large transmit completion
times (such as xennet) had a 30% hit unless the user manually tweaked
the value in /proc.
    
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1db253e..8ad7cdf 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
  */
 int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
 
-/* Default TSQ limit of two TSO segments */
-int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
+/* Static TSQ limit.  A more dynamic limit is calculated in tcp_write_xmit. */
+int sysctl_tcp_limit_output_bytes __read_mostly = 1048576;
 
 /* This limits the percentage of the congestion window which we
  * will allow a single TSO frame to consume.  Building TSO frames

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-15 17:23               ` George Dunlap
@ 2015-04-15 17:41                 ` Eric Dumazet
  -1 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-15 17:41 UTC (permalink / raw)
  To: George Dunlap
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:

> Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
> *already* larger for Xen; that calculation mentioned in the comment is
> *already* doing the right thing.

Sigh.

1ms of traffic at 40Gbit is 5 MBytes

The reason for the cap to /proc/sys/net/ipv4/tcp_limit_output_bytes is
to provide the limitation of ~2 TSO packets, which _also_ is documented.

Without this limitation, 5 MBytes could translate to : Fill the queue,
do not limit.

If a particular driver needs to extend the limit, fine, document it and
take actions.




^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 17:41                 ` Eric Dumazet
  0 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-15 17:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:

> Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
> *already* larger for Xen; that calculation mentioned in the comment is
> *already* doing the right thing.

Sigh.

1ms of traffic at 40Gbit is 5 MBytes

The reason for the cap to /proc/sys/net/ipv4/tcp_limit_output_bytes is
to provide the limitation of ~2 TSO packets, which _also_ is documented.

Without this limitation, 5 MBytes could translate to : Fill the queue,
do not limit.

If a particular driver needs to extend the limit, fine, document it and
take actions.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-15 17:41                   ` George Dunlap
@ 2015-04-15 17:52                     ` Eric Dumazet
  -1 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-15 17:52 UTC (permalink / raw)
  To: George Dunlap
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On Wed, 2015-04-15 at 18:41 +0100, George Dunlap wrote:

> So you'd be OK with a patch like this?  (With perhaps a better changelog?)
> 
>  -George
> 
> ---
> TSQ: Raise default static TSQ limit
> 
> A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
> size of actual packets and the amount of data being transmitted.
> Raise the default static limit to allow that new limit to actually
> come into effect.
> 
> This fixes a regression where NICs with large transmit completion
> times (such as xennet) had a 30% hit unless the user manually tweaked
> the value in /proc.
> 
> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 1db253e..8ad7cdf 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
>   */
>  int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
> 
> -/* Default TSQ limit of two TSO segments */
> -int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
> +/* Static TSQ limit.  A more dynamic limit is calculated in
> tcp_write_xmit. */
> +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576;
> 
>  /* This limits the percentage of the congestion window which we
>   * will allow a single TSO frame to consume.  Building TSO frames
> 

Have you tested this patch on a NIC without GSO/TSO ?

This would allow more than 500 packets for a single flow.

Hello bufferbloat.

So my answer to this patch is a no.




^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 17:52                     ` Eric Dumazet
  0 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-15 17:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2015-04-15 at 18:41 +0100, George Dunlap wrote:

> So you'd be OK with a patch like this?  (With perhaps a better changelog?)
> 
>  -George
> 
> ---
> TSQ: Raise default static TSQ limit
> 
> A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
> size of actual packets and the amount of data being transmitted.
> Raise the default static limit to allow that new limit to actually
> come into effect.
> 
> This fixes a regression where NICs with large transmit completion
> times (such as xennet) had a 30% hit unless the user manually tweaked
> the value in /proc.
> 
> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 1db253e..8ad7cdf 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
>   */
>  int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
> 
> -/* Default TSQ limit of two TSO segments */
> -int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
> +/* Static TSQ limit.  A more dynamic limit is calculated in
> tcp_write_xmit. */
> +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576;
> 
>  /* This limits the percentage of the congestion window which we
>   * will allow a single TSO frame to consume.  Building TSO frames
> 

Have you tested this patch on a NIC without GSO/TSO ?

This would allow more than 500 packets for a single flow.

Hello bufferbloat.

So my answer to this patch is a no.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-15 17:52                     ` Eric Dumazet
@ 2015-04-15 17:55                       ` Rick Jones
  -1 siblings, 0 replies; 92+ messages in thread
From: Rick Jones @ 2015-04-15 17:55 UTC (permalink / raw)
  To: Eric Dumazet, George Dunlap
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

>
> Have you tested this patch on a NIC without GSO/TSO ?
>
> This would allow more than 500 packets for a single flow.
>
> Hello bufferbloat.

Woudln't the fq_codel qdisc on that interface address that problem?

rick


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 17:55                       ` Rick Jones
  0 siblings, 0 replies; 92+ messages in thread
From: Rick Jones @ 2015-04-15 17:55 UTC (permalink / raw)
  To: linux-arm-kernel

>
> Have you tested this patch on a NIC without GSO/TSO ?
>
> This would allow more than 500 packets for a single flow.
>
> Hello bufferbloat.

Woudln't the fq_codel qdisc on that interface address that problem?

rick

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-15 17:41                 ` Eric Dumazet
  (?)
@ 2015-04-15 17:58                   ` Stefano Stabellini
  -1 siblings, 0 replies; 92+ messages in thread
From: Stefano Stabellini @ 2015-04-15 17:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: George Dunlap, Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On Wed, 15 Apr 2015, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:
> 
> > Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
> > *already* larger for Xen; that calculation mentioned in the comment is
> > *already* doing the right thing.
> 
> Sigh.
> 
> 1ms of traffic at 40Gbit is 5 MBytes
> 
> The reason for the cap to /proc/sys/net/ipv4/tcp_limit_output_bytes is
> to provide the limitation of ~2 TSO packets, which _also_ is documented.
> 
> Without this limitation, 5 MBytes could translate to : Fill the queue,
> do not limit.
> 
> If a particular driver needs to extend the limit, fine, document it and
> take actions.

What actions do you have in mind exactly?  It would be great if you
could suggest how to move forward from here, beside documentation.

I don't think we can really expect every user that spawns a new VM in
the cloud to manually echo blah >
/proc/sys/net/ipv4/tcp_limit_output_bytes to an init script.  I cannot
imagine that would work well.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 17:58                   ` Stefano Stabellini
  0 siblings, 0 replies; 92+ messages in thread
From: Stefano Stabellini @ 2015-04-15 17:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 15 Apr 2015, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:
> 
> > Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
> > *already* larger for Xen; that calculation mentioned in the comment is
> > *already* doing the right thing.
> 
> Sigh.
> 
> 1ms of traffic at 40Gbit is 5 MBytes
> 
> The reason for the cap to /proc/sys/net/ipv4/tcp_limit_output_bytes is
> to provide the limitation of ~2 TSO packets, which _also_ is documented.
> 
> Without this limitation, 5 MBytes could translate to : Fill the queue,
> do not limit.
> 
> If a particular driver needs to extend the limit, fine, document it and
> take actions.

What actions do you have in mind exactly?  It would be great if you
could suggest how to move forward from here, beside documentation.

I don't think we can really expect every user that spawns a new VM in
the cloud to manually echo blah >
/proc/sys/net/ipv4/tcp_limit_output_bytes to an init script.  I cannot
imagine that would work well.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 17:58                   ` Stefano Stabellini
  0 siblings, 0 replies; 92+ messages in thread
From: Stefano Stabellini @ 2015-04-15 17:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: George Dunlap, Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On Wed, 15 Apr 2015, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:
> 
> > Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
> > *already* larger for Xen; that calculation mentioned in the comment is
> > *already* doing the right thing.
> 
> Sigh.
> 
> 1ms of traffic at 40Gbit is 5 MBytes
> 
> The reason for the cap to /proc/sys/net/ipv4/tcp_limit_output_bytes is
> to provide the limitation of ~2 TSO packets, which _also_ is documented.
> 
> Without this limitation, 5 MBytes could translate to : Fill the queue,
> do not limit.
> 
> If a particular driver needs to extend the limit, fine, document it and
> take actions.

What actions do you have in mind exactly?  It would be great if you
could suggest how to move forward from here, beside documentation.

I don't think we can really expect every user that spawns a new VM in
the cloud to manually echo blah >
/proc/sys/net/ipv4/tcp_limit_output_bytes to an init script.  I cannot
imagine that would work well.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-15 17:52                     ` Eric Dumazet
  (?)
@ 2015-04-15 18:04                       ` George Dunlap
  -1 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-15 18:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On 04/15/2015 06:52 PM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 18:41 +0100, George Dunlap wrote:
> 
>> So you'd be OK with a patch like this?  (With perhaps a better changelog?)
>>
>>  -George
>>
>> ---
>> TSQ: Raise default static TSQ limit
>>
>> A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
>> size of actual packets and the amount of data being transmitted.
>> Raise the default static limit to allow that new limit to actually
>> come into effect.
>>
>> This fixes a regression where NICs with large transmit completion
>> times (such as xennet) had a 30% hit unless the user manually tweaked
>> the value in /proc.
>>
>> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
>>
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index 1db253e..8ad7cdf 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
>>   */
>>  int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
>>
>> -/* Default TSQ limit of two TSO segments */
>> -int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
>> +/* Static TSQ limit.  A more dynamic limit is calculated in
>> tcp_write_xmit. */
>> +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576;
>>
>>  /* This limits the percentage of the congestion window which we
>>   * will allow a single TSO frame to consume.  Building TSO frames
>>
> 
> Have you tested this patch on a NIC without GSO/TSO ?
> 
> This would allow more than 500 packets for a single flow.
> 
> Hello bufferbloat.
> 
> So my answer to this patch is a no.

You said:

"I asked you guys to make a test by increasing
sysctl_tcp_limit_output_bytes  You have no need to explain me the code I
wrote, thank you."

Which implies to me that you think you've already pointed us to the
answer you want and we're just not getting it.

Maybe you should stop wasting all of our time and just tell us what
you're thinking.

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 18:04                       ` George Dunlap
  0 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-15 18:04 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/15/2015 06:52 PM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 18:41 +0100, George Dunlap wrote:
> 
>> So you'd be OK with a patch like this?  (With perhaps a better changelog?)
>>
>>  -George
>>
>> ---
>> TSQ: Raise default static TSQ limit
>>
>> A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
>> size of actual packets and the amount of data being transmitted.
>> Raise the default static limit to allow that new limit to actually
>> come into effect.
>>
>> This fixes a regression where NICs with large transmit completion
>> times (such as xennet) had a 30% hit unless the user manually tweaked
>> the value in /proc.
>>
>> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
>>
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index 1db253e..8ad7cdf 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
>>   */
>>  int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
>>
>> -/* Default TSQ limit of two TSO segments */
>> -int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
>> +/* Static TSQ limit.  A more dynamic limit is calculated in
>> tcp_write_xmit. */
>> +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576;
>>
>>  /* This limits the percentage of the congestion window which we
>>   * will allow a single TSO frame to consume.  Building TSO frames
>>
> 
> Have you tested this patch on a NIC without GSO/TSO ?
> 
> This would allow more than 500 packets for a single flow.
> 
> Hello bufferbloat.
> 
> So my answer to this patch is a no.

You said:

"I asked you guys to make a test by increasing
sysctl_tcp_limit_output_bytes  You have no need to explain me the code I
wrote, thank you."

Which implies to me that you think you've already pointed us to the
answer you want and we're just not getting it.

Maybe you should stop wasting all of our time and just tell us what
you're thinking.

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 18:04                       ` George Dunlap
  0 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-15 18:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On 04/15/2015 06:52 PM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 18:41 +0100, George Dunlap wrote:
> 
>> So you'd be OK with a patch like this?  (With perhaps a better changelog?)
>>
>>  -George
>>
>> ---
>> TSQ: Raise default static TSQ limit
>>
>> A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
>> size of actual packets and the amount of data being transmitted.
>> Raise the default static limit to allow that new limit to actually
>> come into effect.
>>
>> This fixes a regression where NICs with large transmit completion
>> times (such as xennet) had a 30% hit unless the user manually tweaked
>> the value in /proc.
>>
>> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
>>
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index 1db253e..8ad7cdf 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
>>   */
>>  int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
>>
>> -/* Default TSQ limit of two TSO segments */
>> -int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
>> +/* Static TSQ limit.  A more dynamic limit is calculated in
>> tcp_write_xmit. */
>> +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576;
>>
>>  /* This limits the percentage of the congestion window which we
>>   * will allow a single TSO frame to consume.  Building TSO frames
>>
> 
> Have you tested this patch on a NIC without GSO/TSO ?
> 
> This would allow more than 500 packets for a single flow.
> 
> Hello bufferbloat.
> 
> So my answer to this patch is a no.

You said:

"I asked you guys to make a test by increasing
sysctl_tcp_limit_output_bytes  You have no need to explain me the code I
wrote, thank you."

Which implies to me that you think you've already pointed us to the
answer you want and we're just not getting it.

Maybe you should stop wasting all of our time and just tell us what
you're thinking.

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-15 17:55                       ` Rick Jones
@ 2015-04-15 18:08                         ` Eric Dumazet
  -1 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-15 18:08 UTC (permalink / raw)
  To: Rick Jones
  Cc: George Dunlap, Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On Wed, 2015-04-15 at 10:55 -0700, Rick Jones wrote:
> >
> > Have you tested this patch on a NIC without GSO/TSO ?
> >
> > This would allow more than 500 packets for a single flow.
> >
> > Hello bufferbloat.
> 
> Woudln't the fq_codel qdisc on that interface address that problem?

Last time I checked, default qdisc was pfifo_fast.

These guys do not want to change a sysctl, how pfifo_fast will magically
becomes fq_codel ?




^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 18:08                         ` Eric Dumazet
  0 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-15 18:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2015-04-15 at 10:55 -0700, Rick Jones wrote:
> >
> > Have you tested this patch on a NIC without GSO/TSO ?
> >
> > This would allow more than 500 packets for a single flow.
> >
> > Hello bufferbloat.
> 
> Woudln't the fq_codel qdisc on that interface address that problem?

Last time I checked, default qdisc was pfifo_fast.

These guys do not want to change a sysctl, how pfifo_fast will magically
becomes fq_codel ?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-15 17:58                   ` Stefano Stabellini
@ 2015-04-15 18:17                     ` Eric Dumazet
  -1 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-15 18:17 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: George Dunlap, Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	netdev, Linux Kernel Mailing List, Eric Dumazet, Paul Durrant,
	Christoffer Dall, Felipe Franciosi, linux-arm-kernel,
	David Vrabel

On Wed, 2015-04-15 at 18:58 +0100, Stefano Stabellini wrote:
> On Wed, 15 Apr 2015, Eric Dumazet wrote:
> > On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:
> > 
> > > Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
> > > *already* larger for Xen; that calculation mentioned in the comment is
> > > *already* doing the right thing.
> > 
> > Sigh.
> > 
> > 1ms of traffic at 40Gbit is 5 MBytes
> > 
> > The reason for the cap to /proc/sys/net/ipv4/tcp_limit_output_bytes is
> > to provide the limitation of ~2 TSO packets, which _also_ is documented.
> > 
> > Without this limitation, 5 MBytes could translate to : Fill the queue,
> > do not limit.
> > 
> > If a particular driver needs to extend the limit, fine, document it and
> > take actions.
> 
> What actions do you have in mind exactly?  It would be great if you
> could suggest how to move forward from here, beside documentation.
> 
> I don't think we can really expect every user that spawns a new VM in
> the cloud to manually echo blah >
> /proc/sys/net/ipv4/tcp_limit_output_bytes to an init script.  I cannot
> imagine that would work well.

I already pointed a discussion on the same topic for wireless adapters.

Some adapters have a ~3 ms TX completion delay, so the 1ms assumption in
TCP stack is limiting the max throughput.

All I hear here are unreasonable requests, marketing driven.

If a global sysctl is not good enough, make it a per device value.

We already have netdev->gso_max_size and netdev->gso_max_segs
which are cached into sk->sk_gso_max_size & sk->sk_gso_max_segs

What about you guys provide a new 
netdev->I_need_to_have_big_buffers_to_cope_with_my_latencies.

Do not expect me to fight bufferbloat alone. Be part of the challenge,
instead of trying to get back to proven bad solutions.



^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 18:17                     ` Eric Dumazet
  0 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-15 18:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2015-04-15 at 18:58 +0100, Stefano Stabellini wrote:
> On Wed, 15 Apr 2015, Eric Dumazet wrote:
> > On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:
> > 
> > > Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
> > > *already* larger for Xen; that calculation mentioned in the comment is
> > > *already* doing the right thing.
> > 
> > Sigh.
> > 
> > 1ms of traffic at 40Gbit is 5 MBytes
> > 
> > The reason for the cap to /proc/sys/net/ipv4/tcp_limit_output_bytes is
> > to provide the limitation of ~2 TSO packets, which _also_ is documented.
> > 
> > Without this limitation, 5 MBytes could translate to : Fill the queue,
> > do not limit.
> > 
> > If a particular driver needs to extend the limit, fine, document it and
> > take actions.
> 
> What actions do you have in mind exactly?  It would be great if you
> could suggest how to move forward from here, beside documentation.
> 
> I don't think we can really expect every user that spawns a new VM in
> the cloud to manually echo blah >
> /proc/sys/net/ipv4/tcp_limit_output_bytes to an init script.  I cannot
> imagine that would work well.

I already pointed a discussion on the same topic for wireless adapters.

Some adapters have a ~3 ms TX completion delay, so the 1ms assumption in
TCP stack is limiting the max throughput.

All I hear here are unreasonable requests, marketing driven.

If a global sysctl is not good enough, make it a per device value.

We already have netdev->gso_max_size and netdev->gso_max_segs
which are cached into sk->sk_gso_max_size & sk->sk_gso_max_segs

What about you guys provide a new 
netdev->I_need_to_have_big_buffers_to_cope_with_my_latencies.

Do not expect me to fight bufferbloat alone. Be part of the challenge,
instead of trying to get back to proven bad solutions.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-15 18:08                         ` Eric Dumazet
@ 2015-04-15 18:19                           ` Rick Jones
  -1 siblings, 0 replies; 92+ messages in thread
From: Rick Jones @ 2015-04-15 18:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: George Dunlap, Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On 04/15/2015 11:08 AM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 10:55 -0700, Rick Jones wrote:
>>>
>>> Have you tested this patch on a NIC without GSO/TSO ?
>>>
>>> This would allow more than 500 packets for a single flow.
>>>
>>> Hello bufferbloat.
>>
>> Woudln't the fq_codel qdisc on that interface address that problem?
>
> Last time I checked, default qdisc was pfifo_fast.

Bummer.

> These guys do not want to change a sysctl, how pfifo_fast will magically
> becomes fq_codel ?

Well, I'm not sure that it is George and Jonathan themselves who don't 
want to change a sysctl, but the customers who would have to tweak that 
in their VMs?

rick

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 18:19                           ` Rick Jones
  0 siblings, 0 replies; 92+ messages in thread
From: Rick Jones @ 2015-04-15 18:19 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/15/2015 11:08 AM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 10:55 -0700, Rick Jones wrote:
>>>
>>> Have you tested this patch on a NIC without GSO/TSO ?
>>>
>>> This would allow more than 500 packets for a single flow.
>>>
>>> Hello bufferbloat.
>>
>> Woudln't the fq_codel qdisc on that interface address that problem?
>
> Last time I checked, default qdisc was pfifo_fast.

Bummer.

> These guys do not want to change a sysctl, how pfifo_fast will magically
> becomes fq_codel ?

Well, I'm not sure that it is George and Jonathan themselves who don't 
want to change a sysctl, but the customers who would have to tweak that 
in their VMs?

rick

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-15 18:04                       ` George Dunlap
@ 2015-04-15 18:19                         ` Eric Dumazet
  -1 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-15 18:19 UTC (permalink / raw)
  To: George Dunlap
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:

> Maybe you should stop wasting all of our time and just tell us what
> you're thinking.

I think you make me wasting my time.

I already gave all the hints in prior discussions.

Rome was not built in one day.



^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 18:19                         ` Eric Dumazet
  0 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-15 18:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:

> Maybe you should stop wasting all of our time and just tell us what
> you're thinking.

I think you make me wasting my time.

I already gave all the hints in prior discussions.

Rome was not built in one day.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-15 18:19                           ` Rick Jones
  (?)
@ 2015-04-15 18:32                             ` Eric Dumazet
  -1 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-15 18:32 UTC (permalink / raw)
  To: Rick Jones
  Cc: George Dunlap, Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On Wed, 2015-04-15 at 11:19 -0700, Rick Jones wrote:

> Well, I'm not sure that it is George and Jonathan themselves who don't 
> want to change a sysctl, but the customers who would have to tweak that 
> in their VMs?

Keep in mind some VM users install custom qdisc, or even custom TCP
sysctls.



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 18:32                             ` Eric Dumazet
  0 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-15 18:32 UTC (permalink / raw)
  To: Rick Jones
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, netdev,
	Linux Kernel Mailing List, Eric Dumazet, Paul Durrant,
	linux-arm-kernel, Felipe Franciosi, Christoffer Dall,
	David Vrabel

On Wed, 2015-04-15 at 11:19 -0700, Rick Jones wrote:

> Well, I'm not sure that it is George and Jonathan themselves who don't 
> want to change a sysctl, but the customers who would have to tweak that 
> in their VMs?

Keep in mind some VM users install custom qdisc, or even custom TCP
sysctls.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 18:32                             ` Eric Dumazet
  0 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-15 18:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2015-04-15 at 11:19 -0700, Rick Jones wrote:

> Well, I'm not sure that it is George and Jonathan themselves who don't 
> want to change a sysctl, but the customers who would have to tweak that 
> in their VMs?

Keep in mind some VM users install custom qdisc, or even custom TCP
sysctls.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-15 18:32                             ` Eric Dumazet
  (?)
@ 2015-04-15 20:08                               ` Rick Jones
  -1 siblings, 0 replies; 92+ messages in thread
From: Rick Jones @ 2015-04-15 20:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: George Dunlap, Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On 04/15/2015 11:32 AM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 11:19 -0700, Rick Jones wrote:
>
>> Well, I'm not sure that it is George and Jonathan themselves who don't
>> want to change a sysctl, but the customers who would have to tweak that
>> in their VMs?
>
> Keep in mind some VM users install custom qdisc, or even custom TCP
> sysctls.

That could very well be, though I confess I've not seen that happening 
in my little corner of the cloud.  They tend to want to launch the VM 
and go.  Some of the more advanced/sophisticated ones might tweak a few 
things but my (admittedly limited) experience has been they are few in 
number.  They just expect it to work "out of the box" (to the extent one 
can use that phrase still).

It's kind of ironic - go back to the (early) 1990s when NICs generated a 
completion interrupt for every individual tx completion (and incoming 
packet) and all everyone wanted to do was coalesce/avoid interrupts.  I 
guess that has gone rather far.  And today to fight bufferbloat TCP gets 
tweaked to favor quick tx completions.  Call it cycles, or pendulums or 
whatever I guess.

I wonder just how consistent tx completion timings are for a VM so a 
virtio_net or whatnot in the VM can pick a per-device setting to 
advertise to TCP?  Hopefully, full NIC emulation is no longer a thing 
and VMs "universally" use a virtual NIC interface. At least in my little 
corner of the cloud, emulated NICs are gone, and good riddance.

rick

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 20:08                               ` Rick Jones
  0 siblings, 0 replies; 92+ messages in thread
From: Rick Jones @ 2015-04-15 20:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: George Dunlap, Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On 04/15/2015 11:32 AM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 11:19 -0700, Rick Jones wrote:
>
>> Well, I'm not sure that it is George and Jonathan themselves who don't
>> want to change a sysctl, but the customers who would have to tweak that
>> in their VMs?
>
> Keep in mind some VM users install custom qdisc, or even custom TCP
> sysctls.

That could very well be, though I confess I've not seen that happening 
in my little corner of the cloud.  They tend to want to launch the VM 
and go.  Some of the more advanced/sophisticated ones might tweak a few 
things but my (admittedly limited) experience has been they are few in 
number.  They just expect it to work "out of the box" (to the extent one 
can use that phrase still).

It's kind of ironic - go back to the (early) 1990s when NICs generated a 
completion interrupt for every individual tx completion (and incoming 
packet) and all everyone wanted to do was coalesce/avoid interrupts.  I 
guess that has gone rather far.  And today to fight bufferbloat TCP gets 
tweaked to favor quick tx completions.  Call it cycles, or pendulums or 
whatever I guess.

I wonder just how consistent tx completion timings are for a VM so a 
virtio_net or whatnot in the VM can pick a per-device setting to 
advertise to TCP?  Hopefully, full NIC emulation is no longer a thing 
and VMs "universally" use a virtual NIC interface. At least in my little 
corner of the cloud, emulated NICs are gone, and good riddance.

rick

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-15 20:08                               ` Rick Jones
  0 siblings, 0 replies; 92+ messages in thread
From: Rick Jones @ 2015-04-15 20:08 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/15/2015 11:32 AM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 11:19 -0700, Rick Jones wrote:
>
>> Well, I'm not sure that it is George and Jonathan themselves who don't
>> want to change a sysctl, but the customers who would have to tweak that
>> in their VMs?
>
> Keep in mind some VM users install custom qdisc, or even custom TCP
> sysctls.

That could very well be, though I confess I've not seen that happening 
in my little corner of the cloud.  They tend to want to launch the VM 
and go.  Some of the more advanced/sophisticated ones might tweak a few 
things but my (admittedly limited) experience has been they are few in 
number.  They just expect it to work "out of the box" (to the extent one 
can use that phrase still).

It's kind of ironic - go back to the (early) 1990s when NICs generated a 
completion interrupt for every individual tx completion (and incoming 
packet) and all everyone wanted to do was coalesce/avoid interrupts.  I 
guess that has gone rather far.  And today to fight bufferbloat TCP gets 
tweaked to favor quick tx completions.  Call it cycles, or pendulums or 
whatever I guess.

I wonder just how consistent tx completion timings are for a VM so a 
virtio_net or whatnot in the VM can pick a per-device setting to 
advertise to TCP?  Hopefully, full NIC emulation is no longer a thing 
and VMs "universally" use a virtual NIC interface. At least in my little 
corner of the cloud, emulated NICs are gone, and good riddance.

rick

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-15 18:17                     ` Eric Dumazet
@ 2015-04-16  4:20                       ` Herbert Xu
  -1 siblings, 0 replies; 92+ messages in thread
From: Herbert Xu @ 2015-04-16  4:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: stefano.stabellini, george.dunlap, Jonathan.Davies, xen-devel,
	wei.liu2, Ian.Campbell, netdev, linux-kernel, edumazet,
	paul.durrant, christoffer.dall, felipe.franciosi,
	linux-arm-kernel, david.vrabel

Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> We already have netdev->gso_max_size and netdev->gso_max_segs
> which are cached into sk->sk_gso_max_size & sk->sk_gso_max_segs

It is quite dangerous to attempt tricks like this because a
tc redirection or netfilter nat could change the destination
device rendering such hints incorrect.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-16  4:20                       ` Herbert Xu
  0 siblings, 0 replies; 92+ messages in thread
From: Herbert Xu @ 2015-04-16  4:20 UTC (permalink / raw)
  To: linux-arm-kernel

Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> We already have netdev->gso_max_size and netdev->gso_max_segs
> which are cached into sk->sk_gso_max_size & sk->sk_gso_max_segs

It is quite dangerous to attempt tricks like this because a
tc redirection or netfilter nat could change the destination
device rendering such hints incorrect.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-16  4:20                       ` Herbert Xu
@ 2015-04-16  4:30                         ` Eric Dumazet
  -1 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-16  4:30 UTC (permalink / raw)
  To: Herbert Xu
  Cc: stefano.stabellini, george.dunlap, Jonathan.Davies, xen-devel,
	wei.liu2, Ian.Campbell, netdev, linux-kernel, edumazet,
	paul.durrant, christoffer.dall, felipe.franciosi,
	linux-arm-kernel, david.vrabel

On Thu, 2015-04-16 at 12:20 +0800, Herbert Xu wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > We already have netdev->gso_max_size and netdev->gso_max_segs
> > which are cached into sk->sk_gso_max_size & sk->sk_gso_max_segs
> 
> It is quite dangerous to attempt tricks like this because a
> tc redirection or netfilter nat could change the destination
> device rendering such hints incorrect.

Right but we are talking of performance hints, on quite basic VM setup.

Here the guest would use xen and this hint would apply.





^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-16  4:30                         ` Eric Dumazet
  0 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-16  4:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2015-04-16 at 12:20 +0800, Herbert Xu wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > We already have netdev->gso_max_size and netdev->gso_max_segs
> > which are cached into sk->sk_gso_max_size & sk->sk_gso_max_segs
> 
> It is quite dangerous to attempt tricks like this because a
> tc redirection or netfilter nat could change the destination
> device rendering such hints incorrect.

Right but we are talking of performance hints, on quite basic VM setup.

Here the guest would use xen and this hint would apply.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-15 18:19                         ` Eric Dumazet
  (?)
@ 2015-04-16  8:56                           ` George Dunlap
  -1 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-16  8:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On 04/15/2015 07:19 PM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:
> 
>> Maybe you should stop wasting all of our time and just tell us what
>> you're thinking.
> 
> I think you make me wasting my time.
> 
> I already gave all the hints in prior discussions.

Right, and I suggested these two options:

"Obviously one solution would be to allow the drivers themselves to set
the tcp_limit_output_bytes, but that seems like a maintenance
nightmare.

"Another simple solution would be to allow drivers to indicate whether
they have a high transmit latency, and have the kernel use a higher
value by default when that's the case." [1]

Neither of which you commented on.  Instead you pointed me to a comment
that only partially described what the limitations were. (I.e., it
described the "two packets or 1ms", but not how they related, nor how
they related to the "max of 2 64k packets outstanding" of the default
tcp_limit_output_bytes setting.)

 -George


[1]
http://marc.info/?i=<CAFLBxZYt7-v29ysm=f+5QMOw64_QhESjzj98udba+1cS-PfObA@mail.gmail.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-16  8:56                           ` George Dunlap
  0 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-16  8:56 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/15/2015 07:19 PM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:
> 
>> Maybe you should stop wasting all of our time and just tell us what
>> you're thinking.
> 
> I think you make me wasting my time.
> 
> I already gave all the hints in prior discussions.

Right, and I suggested these two options:

"Obviously one solution would be to allow the drivers themselves to set
the tcp_limit_output_bytes, but that seems like a maintenance
nightmare.

"Another simple solution would be to allow drivers to indicate whether
they have a high transmit latency, and have the kernel use a higher
value by default when that's the case." [1]

Neither of which you commented on.  Instead you pointed me to a comment
that only partially described what the limitations were. (I.e., it
described the "two packets or 1ms", but not how they related, nor how
they related to the "max of 2 64k packets outstanding" of the default
tcp_limit_output_bytes setting.)

 -George


[1]
http://marc.info/?i=<CAFLBxZYt7-v29ysm=f+5QMOw64_QhESjzj98udba+1cS-PfObA@mail.gmail.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-16  8:56                           ` George Dunlap
  0 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-16  8:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On 04/15/2015 07:19 PM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:
> 
>> Maybe you should stop wasting all of our time and just tell us what
>> you're thinking.
> 
> I think you make me wasting my time.
> 
> I already gave all the hints in prior discussions.

Right, and I suggested these two options:

"Obviously one solution would be to allow the drivers themselves to set
the tcp_limit_output_bytes, but that seems like a maintenance
nightmare.

"Another simple solution would be to allow drivers to indicate whether
they have a high transmit latency, and have the kernel use a higher
value by default when that's the case." [1]

Neither of which you commented on.  Instead you pointed me to a comment
that only partially described what the limitations were. (I.e., it
described the "two packets or 1ms", but not how they related, nor how
they related to the "max of 2 64k packets outstanding" of the default
tcp_limit_output_bytes setting.)

 -George


[1]
http://marc.info/?i=<CAFLBxZYt7-v29ysm=f+5QMOw64_QhESjzj98udba+1cS-PfObA@mail.gmail.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-16  8:56                           ` George Dunlap
  (?)
@ 2015-04-16  9:20                             ` Daniel Borkmann
  -1 siblings, 0 replies; 92+ messages in thread
From: Daniel Borkmann @ 2015-04-16  9:20 UTC (permalink / raw)
  To: George Dunlap, Eric Dumazet
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On 04/16/2015 10:56 AM, George Dunlap wrote:
> On 04/15/2015 07:19 PM, Eric Dumazet wrote:
>> On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:
>>
>>> Maybe you should stop wasting all of our time and just tell us what
>>> you're thinking.
>>
>> I think you make me wasting my time.
>>
>> I already gave all the hints in prior discussions.
>
> Right, and I suggested these two options:

So mid term, it would be much more beneficial if you attempt fix the
underlying driver issues that actually cause high tx completion delays,
instead of reintroducing bufferbloat. So that we all can move forward
and not backwards in time.

> "Obviously one solution would be to allow the drivers themselves to set
> the tcp_limit_output_bytes, but that seems like a maintenance
> nightmare.

Possible, but very hacky, as you penalize globally.

> "Another simple solution would be to allow drivers to indicate whether
> they have a high transmit latency, and have the kernel use a higher
> value by default when that's the case." [1]
>
> Neither of which you commented on.  Instead you pointed me to a comment

What Eric described to you was that you introduce a new netdev member
like netdev->needs_bufferbloat, set that indication from driver site,
and cache that in the socket that binds to it, so you can adjust the
test in tcp_xmit_size_goal(). It should merely be seen as a hint/indication
for such devices. Hmm?

> that only partially described what the limitations were. (I.e., it
> described the "two packets or 1ms", but not how they related, nor how
> they related to the "max of 2 64k packets outstanding" of the default
> tcp_limit_output_bytes setting.)
>
>   -George
>
> [1]
> http://marc.info/?i=<CAFLBxZYt7-v29ysm=f+5QMOw64_QhESjzj98udba+1cS-PfObA@mail.gmail.com>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-16  9:20                             ` Daniel Borkmann
  0 siblings, 0 replies; 92+ messages in thread
From: Daniel Borkmann @ 2015-04-16  9:20 UTC (permalink / raw)
  To: George Dunlap, Eric Dumazet
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On 04/16/2015 10:56 AM, George Dunlap wrote:
> On 04/15/2015 07:19 PM, Eric Dumazet wrote:
>> On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:
>>
>>> Maybe you should stop wasting all of our time and just tell us what
>>> you're thinking.
>>
>> I think you make me wasting my time.
>>
>> I already gave all the hints in prior discussions.
>
> Right, and I suggested these two options:

So mid term, it would be much more beneficial if you attempt fix the
underlying driver issues that actually cause high tx completion delays,
instead of reintroducing bufferbloat. So that we all can move forward
and not backwards in time.

> "Obviously one solution would be to allow the drivers themselves to set
> the tcp_limit_output_bytes, but that seems like a maintenance
> nightmare.

Possible, but very hacky, as you penalize globally.

> "Another simple solution would be to allow drivers to indicate whether
> they have a high transmit latency, and have the kernel use a higher
> value by default when that's the case." [1]
>
> Neither of which you commented on.  Instead you pointed me to a comment

What Eric described to you was that you introduce a new netdev member
like netdev->needs_bufferbloat, set that indication from driver site,
and cache that in the socket that binds to it, so you can adjust the
test in tcp_xmit_size_goal(). It should merely be seen as a hint/indication
for such devices. Hmm?

> that only partially described what the limitations were. (I.e., it
> described the "two packets or 1ms", but not how they related, nor how
> they related to the "max of 2 64k packets outstanding" of the default
> tcp_limit_output_bytes setting.)
>
>   -George
>
> [1]
> http://marc.info/?i=<CAFLBxZYt7-v29ysm=f+5QMOw64_QhESjzj98udba+1cS-PfObA@mail.gmail.com>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-16  9:20                             ` Daniel Borkmann
  0 siblings, 0 replies; 92+ messages in thread
From: Daniel Borkmann @ 2015-04-16  9:20 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/16/2015 10:56 AM, George Dunlap wrote:
> On 04/15/2015 07:19 PM, Eric Dumazet wrote:
>> On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:
>>
>>> Maybe you should stop wasting all of our time and just tell us what
>>> you're thinking.
>>
>> I think you make me wasting my time.
>>
>> I already gave all the hints in prior discussions.
>
> Right, and I suggested these two options:

So mid term, it would be much more beneficial if you attempt fix the
underlying driver issues that actually cause high tx completion delays,
instead of reintroducing bufferbloat. So that we all can move forward
and not backwards in time.

> "Obviously one solution would be to allow the drivers themselves to set
> the tcp_limit_output_bytes, but that seems like a maintenance
> nightmare.

Possible, but very hacky, as you penalize globally.

> "Another simple solution would be to allow drivers to indicate whether
> they have a high transmit latency, and have the kernel use a higher
> value by default when that's the case." [1]
>
> Neither of which you commented on.  Instead you pointed me to a comment

What Eric described to you was that you introduce a new netdev member
like netdev->needs_bufferbloat, set that indication from driver site,
and cache that in the socket that binds to it, so you can adjust the
test in tcp_xmit_size_goal(). It should merely be seen as a hint/indication
for such devices. Hmm?

> that only partially described what the limitations were. (I.e., it
> described the "two packets or 1ms", but not how they related, nor how
> they related to the "max of 2 64k packets outstanding" of the default
> tcp_limit_output_bytes setting.)
>
>   -George
>
> [1]
> http://marc.info/?i=<CAFLBxZYt7-v29ysm=f+5QMOw64_QhESjzj98udba+1cS-PfObA@mail.gmail.com>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-16  8:56                           ` George Dunlap
  (?)
@ 2015-04-16  9:22                             ` David Laight
  -1 siblings, 0 replies; 92+ messages in thread
From: David Laight @ 2015-04-16  9:22 UTC (permalink / raw)
  To: 'George Dunlap', Eric Dumazet
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2125 bytes --]

From: George Dunlap
> Sent: 16 April 2015 09:56
> On 04/15/2015 07:19 PM, Eric Dumazet wrote:
> > On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:
> >
> >> Maybe you should stop wasting all of our time and just tell us what
> >> you're thinking.
> >
> > I think you make me wasting my time.
> >
> > I already gave all the hints in prior discussions.
> 
> Right, and I suggested these two options:
> 
> "Obviously one solution would be to allow the drivers themselves to set
> the tcp_limit_output_bytes, but that seems like a maintenance
> nightmare.
> 
> "Another simple solution would be to allow drivers to indicate whether
> they have a high transmit latency, and have the kernel use a higher
> value by default when that's the case." [1]
> 
> Neither of which you commented on.  Instead you pointed me to a comment
> that only partially described what the limitations were. (I.e., it
> described the "two packets or 1ms", but not how they related, nor how
> they related to the "max of 2 64k packets outstanding" of the default
> tcp_limit_output_bytes setting.)

ISTM that you are changing the wrong knob.
You need to change something that affects the global amount of pending tx data,
not the amount that can be buffered by a single connection.

If you change tcp_limit_output_bytes and then have 1000 connections trying
to send data you'll suffer 'bufferbloat'.

If you call skb_orphan() in the tx setup path then the total number of
buffers is limited, but a single connection can (and will) will the tx
ring leading to incorrect RTT calculations and additional latency for
other connections.
This will give high single connection throughput but isn't ideal.

One possibility might be to call skb_orphan() when enough time has
elapsed since the packet was queued for transmit that it is very likely
to have actually been transmitted - even though 'transmit done' has
not yet been signalled.
Not at all sure how this would fit in though...

	David


ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-16  9:22                             ` David Laight
  0 siblings, 0 replies; 92+ messages in thread
From: David Laight @ 2015-04-16  9:22 UTC (permalink / raw)
  To: 'George Dunlap', Eric Dumazet
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

From: George Dunlap
> Sent: 16 April 2015 09:56
> On 04/15/2015 07:19 PM, Eric Dumazet wrote:
> > On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:
> >
> >> Maybe you should stop wasting all of our time and just tell us what
> >> you're thinking.
> >
> > I think you make me wasting my time.
> >
> > I already gave all the hints in prior discussions.
> 
> Right, and I suggested these two options:
> 
> "Obviously one solution would be to allow the drivers themselves to set
> the tcp_limit_output_bytes, but that seems like a maintenance
> nightmare.
> 
> "Another simple solution would be to allow drivers to indicate whether
> they have a high transmit latency, and have the kernel use a higher
> value by default when that's the case." [1]
> 
> Neither of which you commented on.  Instead you pointed me to a comment
> that only partially described what the limitations were. (I.e., it
> described the "two packets or 1ms", but not how they related, nor how
> they related to the "max of 2 64k packets outstanding" of the default
> tcp_limit_output_bytes setting.)

ISTM that you are changing the wrong knob.
You need to change something that affects the global amount of pending tx data,
not the amount that can be buffered by a single connection.

If you change tcp_limit_output_bytes and then have 1000 connections trying
to send data you'll suffer 'bufferbloat'.

If you call skb_orphan() in the tx setup path then the total number of
buffers is limited, but a single connection can (and will) will the tx
ring leading to incorrect RTT calculations and additional latency for
other connections.
This will give high single connection throughput but isn't ideal.

One possibility might be to call skb_orphan() when enough time has
elapsed since the packet was queued for transmit that it is very likely
to have actually been transmitted - even though 'transmit done' has
not yet been signalled.
Not at all sure how this would fit in though...

	David



^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-16  9:22                             ` David Laight
  0 siblings, 0 replies; 92+ messages in thread
From: David Laight @ 2015-04-16  9:22 UTC (permalink / raw)
  To: linux-arm-kernel

From: George Dunlap
> Sent: 16 April 2015 09:56
> On 04/15/2015 07:19 PM, Eric Dumazet wrote:
> > On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:
> >
> >> Maybe you should stop wasting all of our time and just tell us what
> >> you're thinking.
> >
> > I think you make me wasting my time.
> >
> > I already gave all the hints in prior discussions.
> 
> Right, and I suggested these two options:
> 
> "Obviously one solution would be to allow the drivers themselves to set
> the tcp_limit_output_bytes, but that seems like a maintenance
> nightmare.
> 
> "Another simple solution would be to allow drivers to indicate whether
> they have a high transmit latency, and have the kernel use a higher
> value by default when that's the case." [1]
> 
> Neither of which you commented on.  Instead you pointed me to a comment
> that only partially described what the limitations were. (I.e., it
> described the "two packets or 1ms", but not how they related, nor how
> they related to the "max of 2 64k packets outstanding" of the default
> tcp_limit_output_bytes setting.)

ISTM that you are changing the wrong knob.
You need to change something that affects the global amount of pending tx data,
not the amount that can be buffered by a single connection.

If you change tcp_limit_output_bytes and then have 1000 connections trying
to send data you'll suffer 'bufferbloat'.

If you call skb_orphan() in the tx setup path then the total number of
buffers is limited, but a single connection can (and will) will the tx
ring leading to incorrect RTT calculations and additional latency for
other connections.
This will give high single connection throughput but isn't ideal.

One possibility might be to call skb_orphan() when enough time has
elapsed since the packet was queued for transmit that it is very likely
to have actually been transmitted - even though 'transmit done' has
not yet been signalled.
Not at all sure how this would fit in though...

	David

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-16  9:20                             ` Daniel Borkmann
  (?)
@ 2015-04-16 10:01                               ` George Dunlap
  -1 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-16 10:01 UTC (permalink / raw)
  To: Daniel Borkmann, Eric Dumazet
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On 04/16/2015 10:20 AM, Daniel Borkmann wrote:
> So mid term, it would be much more beneficial if you attempt fix the
> underlying driver issues that actually cause high tx completion delays,
> instead of reintroducing bufferbloat. So that we all can move forward
> and not backwards in time.

Yes, I think we definitely see the need for this.  I think we certainly
agree that bufferbloat needs to be reduced, and minimizing the data we
need "in the pipe" for full performance on xennet is an important part
of that.

It should be said, however, that any virtual device is always going to
have higher latency than a physical device.  Hopefully we'll be able to
get the latency of xennet down to something that's more "reasonable",
but it may just not be possible.  And in any case, if we're going to be
cranking down these limits to just barely within the tolerance of
physical NICs, virtual devices (either xennet or virtio_net) are never
going to be able to catch up.  (Without cheating that is.)

> What Eric described to you was that you introduce a new netdev member
> like netdev->needs_bufferbloat, set that indication from driver site,
> and cache that in the socket that binds to it, so you can adjust the
> test in tcp_xmit_size_goal(). It should merely be seen as a hint/indication
> for such devices. Hmm?

He suggested that after he'd been prodded by 4 more e-mails in which two
of us guessed what he was trying to get at.  That's what I was
complaining about.

Having a per-device "long transmit latency" hint sounds like a sensible
short-term solution to me.

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-16 10:01                               ` George Dunlap
  0 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-16 10:01 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/16/2015 10:20 AM, Daniel Borkmann wrote:
> So mid term, it would be much more beneficial if you attempt fix the
> underlying driver issues that actually cause high tx completion delays,
> instead of reintroducing bufferbloat. So that we all can move forward
> and not backwards in time.

Yes, I think we definitely see the need for this.  I think we certainly
agree that bufferbloat needs to be reduced, and minimizing the data we
need "in the pipe" for full performance on xennet is an important part
of that.

It should be said, however, that any virtual device is always going to
have higher latency than a physical device.  Hopefully we'll be able to
get the latency of xennet down to something that's more "reasonable",
but it may just not be possible.  And in any case, if we're going to be
cranking down these limits to just barely within the tolerance of
physical NICs, virtual devices (either xennet or virtio_net) are never
going to be able to catch up.  (Without cheating that is.)

> What Eric described to you was that you introduce a new netdev member
> like netdev->needs_bufferbloat, set that indication from driver site,
> and cache that in the socket that binds to it, so you can adjust the
> test in tcp_xmit_size_goal(). It should merely be seen as a hint/indication
> for such devices. Hmm?

He suggested that after he'd been prodded by 4 more e-mails in which two
of us guessed what he was trying to get at.  That's what I was
complaining about.

Having a per-device "long transmit latency" hint sounds like a sensible
short-term solution to me.

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-16 10:01                               ` George Dunlap
  0 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-16 10:01 UTC (permalink / raw)
  To: Daniel Borkmann, Eric Dumazet
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On 04/16/2015 10:20 AM, Daniel Borkmann wrote:
> So mid term, it would be much more beneficial if you attempt fix the
> underlying driver issues that actually cause high tx completion delays,
> instead of reintroducing bufferbloat. So that we all can move forward
> and not backwards in time.

Yes, I think we definitely see the need for this.  I think we certainly
agree that bufferbloat needs to be reduced, and minimizing the data we
need "in the pipe" for full performance on xennet is an important part
of that.

It should be said, however, that any virtual device is always going to
have higher latency than a physical device.  Hopefully we'll be able to
get the latency of xennet down to something that's more "reasonable",
but it may just not be possible.  And in any case, if we're going to be
cranking down these limits to just barely within the tolerance of
physical NICs, virtual devices (either xennet or virtio_net) are never
going to be able to catch up.  (Without cheating that is.)

> What Eric described to you was that you introduce a new netdev member
> like netdev->needs_bufferbloat, set that indication from driver site,
> and cache that in the socket that binds to it, so you can adjust the
> test in tcp_xmit_size_goal(). It should merely be seen as a hint/indication
> for such devices. Hmm?

He suggested that after he'd been prodded by 4 more e-mails in which two
of us guessed what he was trying to get at.  That's what I was
complaining about.

Having a per-device "long transmit latency" hint sounds like a sensible
short-term solution to me.

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-16  9:22                             ` David Laight
  (?)
@ 2015-04-16 10:57                               ` George Dunlap
  -1 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-16 10:57 UTC (permalink / raw)
  To: David Laight
  Cc: Eric Dumazet, Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, linux-arm-kernel, Felipe Franciosi,
	Christoffer Dall, David Vrabel

On Thu, Apr 16, 2015 at 10:22 AM, David Laight <David.Laight@aculab.com> wrote:
> ISTM that you are changing the wrong knob.
> You need to change something that affects the global amount of pending tx data,
> not the amount that can be buffered by a single connection.

Well it seems like the problem is that the global amount of pending tx
data is high enough, but that the per-stream amount is too low for
only a single stream.

> If you change tcp_limit_output_bytes and then have 1000 connections trying
> to send data you'll suffer 'bufferbloat'.

Right -- so are you worried about the buffers in the local device
here, or are you worried about buffers elsewhere in the network?

If you're worried about buffers on the local device, don't you have a
similar problem for physical NICs?  i.e., if a NIC has a big buffer
that you're trying to keep mostly empty, limiting a single TCP stream
may keep that buffer empty, but if you have 1000 connections,
1000*limit will still fill up the buffer.

Or am I missing something?

> If you call skb_orphan() in the tx setup path then the total number of
> buffers is limited, but a single connection can (and will) will the tx
> ring leading to incorrect RTT calculations and additional latency for
> other connections.
> This will give high single connection throughput but isn't ideal.
>
> One possibility might be to call skb_orphan() when enough time has
> elapsed since the packet was queued for transmit that it is very likely
> to have actually been transmitted - even though 'transmit done' has
> not yet been signalled.
> Not at all sure how this would fit in though...

Right -- so it sounds like the problem with skb_orphan() is making
sure that the tx ring is shared properly between different streams.
That would mean that ideally we wouldn't call it until the tx ring
actually had space to add more packets onto it.

The Xen project is having a sort of developer meeting in a few weeks;
if we can get a good picture of all the constraints, maybe we can hash
out a solution that works for everyone.

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-16 10:57                               ` George Dunlap
  0 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-16 10:57 UTC (permalink / raw)
  To: David Laight
  Cc: Eric Dumazet, Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, linux-arm-kernel, Felipe Franciosi,
	Christoffer Dall, David Vrabel

On Thu, Apr 16, 2015 at 10:22 AM, David Laight <David.Laight@aculab.com> wrote:
> ISTM that you are changing the wrong knob.
> You need to change something that affects the global amount of pending tx data,
> not the amount that can be buffered by a single connection.

Well it seems like the problem is that the global amount of pending tx
data is high enough, but that the per-stream amount is too low for
only a single stream.

> If you change tcp_limit_output_bytes and then have 1000 connections trying
> to send data you'll suffer 'bufferbloat'.

Right -- so are you worried about the buffers in the local device
here, or are you worried about buffers elsewhere in the network?

If you're worried about buffers on the local device, don't you have a
similar problem for physical NICs?  i.e., if a NIC has a big buffer
that you're trying to keep mostly empty, limiting a single TCP stream
may keep that buffer empty, but if you have 1000 connections,
1000*limit will still fill up the buffer.

Or am I missing something?

> If you call skb_orphan() in the tx setup path then the total number of
> buffers is limited, but a single connection can (and will) will the tx
> ring leading to incorrect RTT calculations and additional latency for
> other connections.
> This will give high single connection throughput but isn't ideal.
>
> One possibility might be to call skb_orphan() when enough time has
> elapsed since the packet was queued for transmit that it is very likely
> to have actually been transmitted - even though 'transmit done' has
> not yet been signalled.
> Not at all sure how this would fit in though...

Right -- so it sounds like the problem with skb_orphan() is making
sure that the tx ring is shared properly between different streams.
That would mean that ideally we wouldn't call it until the tx ring
actually had space to add more packets onto it.

The Xen project is having a sort of developer meeting in a few weeks;
if we can get a good picture of all the constraints, maybe we can hash
out a solution that works for everyone.

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-16 10:57                               ` George Dunlap
  0 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-16 10:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Apr 16, 2015 at 10:22 AM, David Laight <David.Laight@aculab.com> wrote:
> ISTM that you are changing the wrong knob.
> You need to change something that affects the global amount of pending tx data,
> not the amount that can be buffered by a single connection.

Well it seems like the problem is that the global amount of pending tx
data is high enough, but that the per-stream amount is too low for
only a single stream.

> If you change tcp_limit_output_bytes and then have 1000 connections trying
> to send data you'll suffer 'bufferbloat'.

Right -- so are you worried about the buffers in the local device
here, or are you worried about buffers elsewhere in the network?

If you're worried about buffers on the local device, don't you have a
similar problem for physical NICs?  i.e., if a NIC has a big buffer
that you're trying to keep mostly empty, limiting a single TCP stream
may keep that buffer empty, but if you have 1000 connections,
1000*limit will still fill up the buffer.

Or am I missing something?

> If you call skb_orphan() in the tx setup path then the total number of
> buffers is limited, but a single connection can (and will) will the tx
> ring leading to incorrect RTT calculations and additional latency for
> other connections.
> This will give high single connection throughput but isn't ideal.
>
> One possibility might be to call skb_orphan() when enough time has
> elapsed since the packet was queued for transmit that it is very likely
> to have actually been transmitted - even though 'transmit done' has
> not yet been signalled.
> Not at all sure how this would fit in though...

Right -- so it sounds like the problem with skb_orphan() is making
sure that the tx ring is shared properly between different streams.
That would mean that ideally we wouldn't call it until the tx ring
actually had space to add more packets onto it.

The Xen project is having a sort of developer meeting in a few weeks;
if we can get a good picture of all the constraints, maybe we can hash
out a solution that works for everyone.

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-15 18:17                     ` Eric Dumazet
  (?)
@ 2015-04-16 11:39                       ` George Dunlap
  -1 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-16 11:39 UTC (permalink / raw)
  To: Eric Dumazet, Stefano Stabellini
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell, netdev,
	Linux Kernel Mailing List, Eric Dumazet, Paul Durrant,
	Christoffer Dall, Felipe Franciosi, linux-arm-kernel,
	David Vrabel

On 04/15/2015 07:17 PM, Eric Dumazet wrote:
> Do not expect me to fight bufferbloat alone. Be part of the challenge,
> instead of trying to get back to proven bad solutions.

I tried that.  I wrote a description of what I thought the situation
was, so that you could correct me if my understanding was wrong, and
then what I thought we could do about it.  You apparently didn't even
read it, but just pointed me to a single cryptic comment that doesn't
give me enough information to actually figure out what the situation is.

We all agree that bufferbloat is a problem for everybody, and I can
definitely understand the desire to actually make the situation better
rather than dying the death of a thousand exceptions.

If you want help fighting bufferbloat, you have to educate people to
help you; or alternately, if you don't want to bother educating people,
you have to fight it alone -- or lose the battle due to having a
thousand exceptions.

So, back to TSQ limits.  What's so magical about 2 packets being *in the
device itself*?  And what does 1ms, or 2*64k packets (the default for
tcp_limit_output_bytes), have anything to do with it?

Your comment lists three benefits:
1. better RTT estimation
2. faster recovery
3. high rates

#3 is just marketing fluff; it's also contradicted by the statement that
immediately follows it -- i.e., there are drivers for which the
limitation does *not* give high rates.

#1, as far as I can tell, has to do with measuring the *actual* minimal
round trip time of an empty pipe, rather than the round trip time you
get when there's 512MB of packets in the device buffer.  If a device has
a large internal buffer, then having a large number of packets
outstanding means that the measured RTT is skewed.

The goal here, I take it, is to have this "pipe" *exactly* full; having
it significantly more than "full" is what leads to bufferbloat.

#2 sounds like you're saying that if there are too many packets
outstanding when you discover that you need to adjust things, that it
takes a long time for your changes to have an effect; i.e., if you have
5ms of data in the pipe, it will take at least 5ms for your reduced
transmmission rate to actually have an effect.

Is that accurate, or have I misunderstood something?

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-16 11:39                       ` George Dunlap
  0 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-16 11:39 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/15/2015 07:17 PM, Eric Dumazet wrote:
> Do not expect me to fight bufferbloat alone. Be part of the challenge,
> instead of trying to get back to proven bad solutions.

I tried that.  I wrote a description of what I thought the situation
was, so that you could correct me if my understanding was wrong, and
then what I thought we could do about it.  You apparently didn't even
read it, but just pointed me to a single cryptic comment that doesn't
give me enough information to actually figure out what the situation is.

We all agree that bufferbloat is a problem for everybody, and I can
definitely understand the desire to actually make the situation better
rather than dying the death of a thousand exceptions.

If you want help fighting bufferbloat, you have to educate people to
help you; or alternately, if you don't want to bother educating people,
you have to fight it alone -- or lose the battle due to having a
thousand exceptions.

So, back to TSQ limits.  What's so magical about 2 packets being *in the
device itself*?  And what does 1ms, or 2*64k packets (the default for
tcp_limit_output_bytes), have anything to do with it?

Your comment lists three benefits:
1. better RTT estimation
2. faster recovery
3. high rates

#3 is just marketing fluff; it's also contradicted by the statement that
immediately follows it -- i.e., there are drivers for which the
limitation does *not* give high rates.

#1, as far as I can tell, has to do with measuring the *actual* minimal
round trip time of an empty pipe, rather than the round trip time you
get when there's 512MB of packets in the device buffer.  If a device has
a large internal buffer, then having a large number of packets
outstanding means that the measured RTT is skewed.

The goal here, I take it, is to have this "pipe" *exactly* full; having
it significantly more than "full" is what leads to bufferbloat.

#2 sounds like you're saying that if there are too many packets
outstanding when you discover that you need to adjust things, that it
takes a long time for your changes to have an effect; i.e., if you have
5ms of data in the pipe, it will take at least 5ms for your reduced
transmmission rate to actually have an effect.

Is that accurate, or have I misunderstood something?

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-16 11:39                       ` George Dunlap
  0 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-16 11:39 UTC (permalink / raw)
  To: Eric Dumazet, Stefano Stabellini
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell, netdev,
	Linux Kernel Mailing List, Eric Dumazet, Paul Durrant,
	Christoffer Dall, Felipe Franciosi, linux-arm-kernel,
	David Vrabel

On 04/15/2015 07:17 PM, Eric Dumazet wrote:
> Do not expect me to fight bufferbloat alone. Be part of the challenge,
> instead of trying to get back to proven bad solutions.

I tried that.  I wrote a description of what I thought the situation
was, so that you could correct me if my understanding was wrong, and
then what I thought we could do about it.  You apparently didn't even
read it, but just pointed me to a single cryptic comment that doesn't
give me enough information to actually figure out what the situation is.

We all agree that bufferbloat is a problem for everybody, and I can
definitely understand the desire to actually make the situation better
rather than dying the death of a thousand exceptions.

If you want help fighting bufferbloat, you have to educate people to
help you; or alternately, if you don't want to bother educating people,
you have to fight it alone -- or lose the battle due to having a
thousand exceptions.

So, back to TSQ limits.  What's so magical about 2 packets being *in the
device itself*?  And what does 1ms, or 2*64k packets (the default for
tcp_limit_output_bytes), have anything to do with it?

Your comment lists three benefits:
1. better RTT estimation
2. faster recovery
3. high rates

#3 is just marketing fluff; it's also contradicted by the statement that
immediately follows it -- i.e., there are drivers for which the
limitation does *not* give high rates.

#1, as far as I can tell, has to do with measuring the *actual* minimal
round trip time of an empty pipe, rather than the round trip time you
get when there's 512MB of packets in the device buffer.  If a device has
a large internal buffer, then having a large number of packets
outstanding means that the measured RTT is skewed.

The goal here, I take it, is to have this "pipe" *exactly* full; having
it significantly more than "full" is what leads to bufferbloat.

#2 sounds like you're saying that if there are too many packets
outstanding when you discover that you need to adjust things, that it
takes a long time for your changes to have an effect; i.e., if you have
5ms of data in the pipe, it will take at least 5ms for your reduced
transmmission rate to actually have an effect.

Is that accurate, or have I misunderstood something?

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-16 11:39                       ` George Dunlap
@ 2015-04-16 12:16                         ` Eric Dumazet
  -1 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-16 12:16 UTC (permalink / raw)
  To: George Dunlap
  Cc: Stefano Stabellini, Jonathan Davies, xen-devel, Wei Liu,
	Ian Campbell, netdev, Linux Kernel Mailing List, Eric Dumazet,
	Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

On Thu, 2015-04-16 at 12:39 +0100, George Dunlap wrote:
> On 04/15/2015 07:17 PM, Eric Dumazet wrote:
> > Do not expect me to fight bufferbloat alone. Be part of the challenge,
> > instead of trying to get back to proven bad solutions.
> 
> I tried that.  I wrote a description of what I thought the situation
> was, so that you could correct me if my understanding was wrong, and
> then what I thought we could do about it.  You apparently didn't even
> read it, but just pointed me to a single cryptic comment that doesn't
> give me enough information to actually figure out what the situation is.
> 
> We all agree that bufferbloat is a problem for everybody, and I can
> definitely understand the desire to actually make the situation better
> rather than dying the death of a thousand exceptions.
> 
> If you want help fighting bufferbloat, you have to educate people to
> help you; or alternately, if you don't want to bother educating people,
> you have to fight it alone -- or lose the battle due to having a
> thousand exceptions.
> 
> So, back to TSQ limits.  What's so magical about 2 packets being *in the
> device itself*?  And what does 1ms, or 2*64k packets (the default for
> tcp_limit_output_bytes), have anything to do with it?
> 
> Your comment lists three benefits:
> 1. better RTT estimation
> 2. faster recovery
> 3. high rates
> 
> #3 is just marketing fluff; it's also contradicted by the statement that
> immediately follows it -- i.e., there are drivers for which the
> limitation does *not* give high rates.
> 
> #1, as far as I can tell, has to do with measuring the *actual* minimal
> round trip time of an empty pipe, rather than the round trip time you
> get when there's 512MB of packets in the device buffer.  If a device has
> a large internal buffer, then having a large number of packets
> outstanding means that the measured RTT is skewed.
> 
> The goal here, I take it, is to have this "pipe" *exactly* full; having
> it significantly more than "full" is what leads to bufferbloat.
> 
> #2 sounds like you're saying that if there are too many packets
> outstanding when you discover that you need to adjust things, that it
> takes a long time for your changes to have an effect; i.e., if you have
> 5ms of data in the pipe, it will take at least 5ms for your reduced
> transmmission rate to actually have an effect.
> 
> Is that accurate, or have I misunderstood something?

#2 means that :

If you have an outstanding queue of 500 packets for a flow in qdisc.

A rtx has to be done, because we receive a SACK.

The rtx is queued _after_ the previous 500 packets.

500 packets have to be drained before rtx can be sent and eventually
reach destination.


These 500 packets will likely be dropped because the destination cannot
process them before the rtx.

2 TSO packets are already 90 packets (MSS=1448). It is not small, but a
good compromise allowing line rate even on 40Gbit NIC.


#1 is not marketing. It is hugely relevant.

You might use cubic as the default congestion control, you have to
understand we work hard on delay based cc, as losses are no longer a way
to measure congestion in modern networks.

Vegas and delay gradient congestion depends on precise RTT measures.

I added usec RTT estimations (instead of jiffies based rtt samples) to
increase resolution by 3 order of magnitude, not for marketing, but
because it had to be done when DC communications have typical rtt of 25
usec these days.

And jitter in host queues is not nice and must be kept at the minimum.

You do not have the whole picture, but this tight bufferbloat control is
one step before we can replace cubic by new upcoming cc, that many
companies are actively developing and testing.

The steps are the following :

1) TCP Small queues
2) FQ/pacing
3) TSO auto sizing
3) usec rtt estimations
4) New revolutionary cc module currently under test at Google,
   but others have alternatives.


The fact that few drivers have bugs should not stop this effort.

If you guys are in the Bay area, we would be happy to host a meeting
where we can present you how our work reduced packet drops in our
networks by 2 order of magnitude, and increased capacity by 40 or 50%.




^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-16 12:16                         ` Eric Dumazet
  0 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-16 12:16 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2015-04-16 at 12:39 +0100, George Dunlap wrote:
> On 04/15/2015 07:17 PM, Eric Dumazet wrote:
> > Do not expect me to fight bufferbloat alone. Be part of the challenge,
> > instead of trying to get back to proven bad solutions.
> 
> I tried that.  I wrote a description of what I thought the situation
> was, so that you could correct me if my understanding was wrong, and
> then what I thought we could do about it.  You apparently didn't even
> read it, but just pointed me to a single cryptic comment that doesn't
> give me enough information to actually figure out what the situation is.
> 
> We all agree that bufferbloat is a problem for everybody, and I can
> definitely understand the desire to actually make the situation better
> rather than dying the death of a thousand exceptions.
> 
> If you want help fighting bufferbloat, you have to educate people to
> help you; or alternately, if you don't want to bother educating people,
> you have to fight it alone -- or lose the battle due to having a
> thousand exceptions.
> 
> So, back to TSQ limits.  What's so magical about 2 packets being *in the
> device itself*?  And what does 1ms, or 2*64k packets (the default for
> tcp_limit_output_bytes), have anything to do with it?
> 
> Your comment lists three benefits:
> 1. better RTT estimation
> 2. faster recovery
> 3. high rates
> 
> #3 is just marketing fluff; it's also contradicted by the statement that
> immediately follows it -- i.e., there are drivers for which the
> limitation does *not* give high rates.
> 
> #1, as far as I can tell, has to do with measuring the *actual* minimal
> round trip time of an empty pipe, rather than the round trip time you
> get when there's 512MB of packets in the device buffer.  If a device has
> a large internal buffer, then having a large number of packets
> outstanding means that the measured RTT is skewed.
> 
> The goal here, I take it, is to have this "pipe" *exactly* full; having
> it significantly more than "full" is what leads to bufferbloat.
> 
> #2 sounds like you're saying that if there are too many packets
> outstanding when you discover that you need to adjust things, that it
> takes a long time for your changes to have an effect; i.e., if you have
> 5ms of data in the pipe, it will take at least 5ms for your reduced
> transmmission rate to actually have an effect.
> 
> Is that accurate, or have I misunderstood something?

#2 means that :

If you have an outstanding queue of 500 packets for a flow in qdisc.

A rtx has to be done, because we receive a SACK.

The rtx is queued _after_ the previous 500 packets.

500 packets have to be drained before rtx can be sent and eventually
reach destination.


These 500 packets will likely be dropped because the destination cannot
process them before the rtx.

2 TSO packets are already 90 packets (MSS=1448). It is not small, but a
good compromise allowing line rate even on 40Gbit NIC.


#1 is not marketing. It is hugely relevant.

You might use cubic as the default congestion control, you have to
understand we work hard on delay based cc, as losses are no longer a way
to measure congestion in modern networks.

Vegas and delay gradient congestion depends on precise RTT measures.

I added usec RTT estimations (instead of jiffies based rtt samples) to
increase resolution by 3 order of magnitude, not for marketing, but
because it had to be done when DC communications have typical rtt of 25
usec these days.

And jitter in host queues is not nice and must be kept at the minimum.

You do not have the whole picture, but this tight bufferbloat control is
one step before we can replace cubic by new upcoming cc, that many
companies are actively developing and testing.

The steps are the following :

1) TCP Small queues
2) FQ/pacing
3) TSO auto sizing
3) usec rtt estimations
4) New revolutionary cc module currently under test at Google,
   but others have alternatives.


The fact that few drivers have bugs should not stop this effort.

If you guys are in the Bay area, we would be happy to host a meeting
where we can present you how our work reduced packet drops in our
networks by 2 order of magnitude, and increased capacity by 40 or 50%.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-16 10:01                               ` George Dunlap
@ 2015-04-16 12:42                                 ` Eric Dumazet
  -1 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-16 12:42 UTC (permalink / raw)
  To: George Dunlap
  Cc: Daniel Borkmann, Jonathan Davies, xen-devel, Wei Liu,
	Ian Campbell, Stefano Stabellini, netdev,
	Linux Kernel Mailing List, Eric Dumazet, Paul Durrant,
	Christoffer Dall, Felipe Franciosi, linux-arm-kernel,
	David Vrabel

On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote:

> He suggested that after he'd been prodded by 4 more e-mails in which two
> of us guessed what he was trying to get at.  That's what I was
> complaining about.

My big complain is that I suggested to test to double the sysctl, which
gave good results.

Then you provided a patch using a 8x factor. How does that sound ?

Next time I ask a raise, I should try a 8x factor as well, who knows,
it might be accepted.




^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-16 12:42                                 ` Eric Dumazet
  0 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-04-16 12:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote:

> He suggested that after he'd been prodded by 4 more e-mails in which two
> of us guessed what he was trying to get at.  That's what I was
> complaining about.

My big complain is that I suggested to test to double the sysctl, which
gave good results.

Then you provided a patch using a 8x factor. How does that sound ?

Next time I ask a raise, I should try a 8x factor as well, who knows,
it might be accepted.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-16 11:39                       ` George Dunlap
  (?)
@ 2015-04-16 13:00                         ` Tim Deegan
  -1 siblings, 0 replies; 92+ messages in thread
From: Tim Deegan @ 2015-04-16 13:00 UTC (permalink / raw)
  To: George Dunlap
  Cc: Eric Dumazet, Stefano Stabellini, Jonathan Davies, xen-devel,
	Wei Liu, Ian Campbell, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, linux-arm-kernel, Felipe Franciosi,
	Christoffer Dall, David Vrabel

At 12:39 +0100 on 16 Apr (1429187952), George Dunlap wrote:
> Your comment lists three benefits:
> 1. better RTT estimation
> 2. faster recovery
> 3. high rates
> 
> #3 is just marketing fluff; it's also contradicted by the statement that
> immediately follows it -- i.e., there are drivers for which the
> limitation does *not* give high rates.

AFAICT #3 is talking about throughput _under TCP_, where inflating the
RTT will absolutely cause problems.

Tim.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-16 13:00                         ` Tim Deegan
  0 siblings, 0 replies; 92+ messages in thread
From: Tim Deegan @ 2015-04-16 13:00 UTC (permalink / raw)
  To: George Dunlap
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell, Eric Dumazet,
	Stefano Stabellini, netdev, Linux Kernel Mailing List,
	Eric Dumazet, Paul Durrant, Christoffer Dall, Felipe Franciosi,
	linux-arm-kernel, David Vrabel

At 12:39 +0100 on 16 Apr (1429187952), George Dunlap wrote:
> Your comment lists three benefits:
> 1. better RTT estimation
> 2. faster recovery
> 3. high rates
> 
> #3 is just marketing fluff; it's also contradicted by the statement that
> immediately follows it -- i.e., there are drivers for which the
> limitation does *not* give high rates.

AFAICT #3 is talking about throughput _under TCP_, where inflating the
RTT will absolutely cause problems.

Tim.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-16 13:00                         ` Tim Deegan
  0 siblings, 0 replies; 92+ messages in thread
From: Tim Deegan @ 2015-04-16 13:00 UTC (permalink / raw)
  To: linux-arm-kernel

At 12:39 +0100 on 16 Apr (1429187952), George Dunlap wrote:
> Your comment lists three benefits:
> 1. better RTT estimation
> 2. faster recovery
> 3. high rates
> 
> #3 is just marketing fluff; it's also contradicted by the statement that
> immediately follows it -- i.e., there are drivers for which the
> limitation does *not* give high rates.

AFAICT #3 is talking about throughput _under TCP_, where inflating the
RTT will absolutely cause problems.

Tim.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-16 12:42                                 ` Eric Dumazet
@ 2015-04-20 11:03                                   ` George Dunlap
  -1 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-20 11:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jonathan Davies, xen-devel, Wei Liu, Ian Campbell,
	Daniel Borkmann, Stefano Stabellini, netdev,
	Linux Kernel Mailing List, Eric Dumazet, Paul Durrant,
	linux-arm-kernel, Felipe Franciosi, Christoffer Dall,
	David Vrabel

On Thu, Apr 16, 2015 at 1:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote:
>
>> He suggested that after he'd been prodded by 4 more e-mails in which two
>> of us guessed what he was trying to get at.  That's what I was
>> complaining about.
>
> My big complain is that I suggested to test to double the sysctl, which
> gave good results.
>
> Then you provided a patch using a 8x factor. How does that sound ?
>
> Next time I ask a raise, I should try a 8x factor as well, who knows,
> it might be accepted.

I see.  I chose the value that Stefano had determined had completely
eliminated the overhead.  Doubling the value reduces the overhead to
8%, which should be fine for a short-term fix while we git a proper
mid/long-term fix.

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-04-20 11:03                                   ` George Dunlap
  0 siblings, 0 replies; 92+ messages in thread
From: George Dunlap @ 2015-04-20 11:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Apr 16, 2015 at 1:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote:
>
>> He suggested that after he'd been prodded by 4 more e-mails in which two
>> of us guessed what he was trying to get at.  That's what I was
>> complaining about.
>
> My big complain is that I suggested to test to double the sysctl, which
> gave good results.
>
> Then you provided a patch using a 8x factor. How does that sound ?
>
> Next time I ask a raise, I should try a 8x factor as well, who knows,
> it might be accepted.

I see.  I chose the value that Stefano had determined had completely
eliminated the overhead.  Doubling the value reduces the overhead to
8%, which should be fine for a short-term fix while we git a proper
mid/long-term fix.

 -George

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-04-16 12:42                                 ` Eric Dumazet
  (?)
@ 2015-06-02  9:52                                   ` Wei Liu
  -1 siblings, 0 replies; 92+ messages in thread
From: Wei Liu @ 2015-06-02  9:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: George Dunlap, Daniel Borkmann, Jonathan Davies, xen-devel,
	Wei Liu, Ian Campbell, Stefano Stabellini, netdev,
	Linux Kernel Mailing List, Eric Dumazet, Paul Durrant,
	Christoffer Dall, Felipe Franciosi, linux-arm-kernel,
	David Vrabel

Hi Eric

Sorry for coming late to the discussion.

On Thu, Apr 16, 2015 at 05:42:16AM -0700, Eric Dumazet wrote:
> On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote:
> 
> > He suggested that after he'd been prodded by 4 more e-mails in which two
> > of us guessed what he was trying to get at.  That's what I was
> > complaining about.
> 
> My big complain is that I suggested to test to double the sysctl, which
> gave good results.
> 

Do I understand correctly that it's acceptable to you to double the size
of the buffer? If so I will send a patch to do that.

Wei.

> Then you provided a patch using a 8x factor. How does that sound ?
> 
> Next time I ask a raise, I should try a 8x factor as well, who knows,
> it might be accepted.
> 
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-06-02  9:52                                   ` Wei Liu
  0 siblings, 0 replies; 92+ messages in thread
From: Wei Liu @ 2015-06-02  9:52 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Eric

Sorry for coming late to the discussion.

On Thu, Apr 16, 2015 at 05:42:16AM -0700, Eric Dumazet wrote:
> On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote:
> 
> > He suggested that after he'd been prodded by 4 more e-mails in which two
> > of us guessed what he was trying to get at.  That's what I was
> > complaining about.
> 
> My big complain is that I suggested to test to double the sysctl, which
> gave good results.
> 

Do I understand correctly that it's acceptable to you to double the size
of the buffer? If so I will send a patch to do that.

Wei.

> Then you provided a patch using a 8x factor. How does that sound ?
> 
> Next time I ask a raise, I should try a 8x factor as well, who knows,
> it might be accepted.
> 
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-06-02  9:52                                   ` Wei Liu
  0 siblings, 0 replies; 92+ messages in thread
From: Wei Liu @ 2015-06-02  9:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: George Dunlap, Daniel Borkmann, Jonathan Davies, xen-devel,
	Wei Liu, Ian Campbell, Stefano Stabellini, netdev,
	Linux Kernel Mailing List, Eric Dumazet, Paul Durrant,
	Christoffer Dall, Felipe Franciosi, linux-arm-kernel,
	David Vrabel

Hi Eric

Sorry for coming late to the discussion.

On Thu, Apr 16, 2015 at 05:42:16AM -0700, Eric Dumazet wrote:
> On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote:
> 
> > He suggested that after he'd been prodded by 4 more e-mails in which two
> > of us guessed what he was trying to get at.  That's what I was
> > complaining about.
> 
> My big complain is that I suggested to test to double the sysctl, which
> gave good results.
> 

Do I understand correctly that it's acceptable to you to double the size
of the buffer? If so I will send a patch to do that.

Wei.

> Then you provided a patch using a 8x factor. How does that sound ?
> 
> Next time I ask a raise, I should try a 8x factor as well, who knows,
> it might be accepted.
> 
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
  2015-06-02  9:52                                   ` Wei Liu
@ 2015-06-02 16:16                                     ` Eric Dumazet
  -1 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-06-02 16:16 UTC (permalink / raw)
  To: Wei Liu
  Cc: George Dunlap, Daniel Borkmann, Jonathan Davies, xen-devel,
	Ian Campbell, Stefano Stabellini, netdev,
	Linux Kernel Mailing List, Eric Dumazet, Paul Durrant,
	Christoffer Dall, Felipe Franciosi, linux-arm-kernel,
	David Vrabel

On Tue, 2015-06-02 at 10:52 +0100, Wei Liu wrote:
> Hi Eric
> 
> Sorry for coming late to the discussion.
> 
> On Thu, Apr 16, 2015 at 05:42:16AM -0700, Eric Dumazet wrote:
> > On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote:
> > 
> > > He suggested that after he'd been prodded by 4 more e-mails in which two
> > > of us guessed what he was trying to get at.  That's what I was
> > > complaining about.
> > 
> > My big complain is that I suggested to test to double the sysctl, which
> > gave good results.
> > 
> 
> Do I understand correctly that it's acceptable to you to double the size
> of the buffer? If so I will send a patch to do that.

Absolutely.



^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
@ 2015-06-02 16:16                                     ` Eric Dumazet
  0 siblings, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2015-06-02 16:16 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2015-06-02 at 10:52 +0100, Wei Liu wrote:
> Hi Eric
> 
> Sorry for coming late to the discussion.
> 
> On Thu, Apr 16, 2015 at 05:42:16AM -0700, Eric Dumazet wrote:
> > On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote:
> > 
> > > He suggested that after he'd been prodded by 4 more e-mails in which two
> > > of us guessed what he was trying to get at.  That's what I was
> > > complaining about.
> > 
> > My big complain is that I suggested to test to double the sysctl, which
> > gave good results.
> > 
> 
> Do I understand correctly that it's acceptable to you to double the size
> of the buffer? If so I will send a patch to do that.

Absolutely.

^ permalink raw reply	[flat|nested] 92+ messages in thread

end of thread, other threads:[~2015-06-02 16:16 UTC | newest]

Thread overview: 92+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-09 15:46 "tcp: refine TSO autosizing" causes performance regression on Xen Stefano Stabellini
2015-04-09 15:46 ` Stefano Stabellini
2015-04-09 15:46 ` Stefano Stabellini
2015-04-09 16:16 ` Eric Dumazet
2015-04-09 16:16   ` Eric Dumazet
2015-04-09 16:36   ` Stefano Stabellini
2015-04-09 16:36     ` Stefano Stabellini
2015-04-09 16:36     ` Stefano Stabellini
2015-04-09 17:07     ` Eric Dumazet
2015-04-09 17:07       ` Eric Dumazet
2015-04-13 10:56     ` [Xen-devel] " George Dunlap
2015-04-13 10:56       ` George Dunlap
2015-04-13 13:38       ` Jonathan Davies
2015-04-13 13:38         ` Jonathan Davies
2015-04-13 13:38         ` Jonathan Davies
2015-04-13 13:49       ` Eric Dumazet
2015-04-13 13:49         ` Eric Dumazet
2015-04-15 13:43         ` George Dunlap
2015-04-15 13:43           ` George Dunlap
2015-04-15 16:38           ` Eric Dumazet
2015-04-15 16:38             ` Eric Dumazet
2015-04-15 16:38             ` Eric Dumazet
2015-04-15 17:23             ` George Dunlap
2015-04-15 17:23               ` George Dunlap
2015-04-15 17:23               ` George Dunlap
2015-04-15 17:29               ` Eric Dumazet
2015-04-15 17:29                 ` Eric Dumazet
2015-04-15 17:41                 ` George Dunlap
2015-04-15 17:41                   ` George Dunlap
2015-04-15 17:41                   ` George Dunlap
2015-04-15 17:52                   ` Eric Dumazet
2015-04-15 17:52                     ` Eric Dumazet
2015-04-15 17:55                     ` Rick Jones
2015-04-15 17:55                       ` Rick Jones
2015-04-15 18:08                       ` Eric Dumazet
2015-04-15 18:08                         ` Eric Dumazet
2015-04-15 18:19                         ` Rick Jones
2015-04-15 18:19                           ` Rick Jones
2015-04-15 18:32                           ` Eric Dumazet
2015-04-15 18:32                             ` Eric Dumazet
2015-04-15 18:32                             ` Eric Dumazet
2015-04-15 20:08                             ` [Xen-devel] " Rick Jones
2015-04-15 20:08                               ` Rick Jones
2015-04-15 20:08                               ` Rick Jones
2015-04-15 18:04                     ` George Dunlap
2015-04-15 18:04                       ` George Dunlap
2015-04-15 18:04                       ` George Dunlap
2015-04-15 18:19                       ` Eric Dumazet
2015-04-15 18:19                         ` Eric Dumazet
2015-04-16  8:56                         ` George Dunlap
2015-04-16  8:56                           ` George Dunlap
2015-04-16  8:56                           ` George Dunlap
2015-04-16  9:20                           ` Daniel Borkmann
2015-04-16  9:20                             ` Daniel Borkmann
2015-04-16  9:20                             ` Daniel Borkmann
2015-04-16 10:01                             ` George Dunlap
2015-04-16 10:01                               ` George Dunlap
2015-04-16 10:01                               ` George Dunlap
2015-04-16 12:42                               ` Eric Dumazet
2015-04-16 12:42                                 ` Eric Dumazet
2015-04-20 11:03                                 ` George Dunlap
2015-04-20 11:03                                   ` George Dunlap
2015-06-02  9:52                                 ` Wei Liu
2015-06-02  9:52                                   ` Wei Liu
2015-06-02  9:52                                   ` Wei Liu
2015-06-02 16:16                                   ` Eric Dumazet
2015-06-02 16:16                                     ` Eric Dumazet
2015-04-16  9:22                           ` David Laight
2015-04-16  9:22                             ` David Laight
2015-04-16  9:22                             ` David Laight
2015-04-16 10:57                             ` George Dunlap
2015-04-16 10:57                               ` George Dunlap
2015-04-16 10:57                               ` George Dunlap
2015-04-15 17:41               ` Eric Dumazet
2015-04-15 17:41                 ` Eric Dumazet
2015-04-15 17:58                 ` Stefano Stabellini
2015-04-15 17:58                   ` Stefano Stabellini
2015-04-15 17:58                   ` Stefano Stabellini
2015-04-15 18:17                   ` Eric Dumazet
2015-04-15 18:17                     ` Eric Dumazet
2015-04-16  4:20                     ` Herbert Xu
2015-04-16  4:20                       ` Herbert Xu
2015-04-16  4:30                       ` Eric Dumazet
2015-04-16  4:30                         ` Eric Dumazet
2015-04-16 11:39                     ` George Dunlap
2015-04-16 11:39                       ` George Dunlap
2015-04-16 11:39                       ` George Dunlap
2015-04-16 12:16                       ` Eric Dumazet
2015-04-16 12:16                         ` Eric Dumazet
2015-04-16 13:00                       ` Tim Deegan
2015-04-16 13:00                         ` Tim Deegan
2015-04-16 13:00                         ` Tim Deegan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.