All of lore.kernel.org
 help / color / mirror / Atom feed
* TCP and BBR: reproducibly low cwnd and bandwidth
@ 2018-02-15 20:42 Oleksandr Natalenko
  2018-02-16 15:15 ` Oleksandr Natalenko
  2018-02-16 16:21 ` Eric Dumazet
  0 siblings, 2 replies; 26+ messages in thread
From: Oleksandr Natalenko @ 2018-02-15 20:42 UTC (permalink / raw)
  To: David S. Miller; +Cc: Alexey Kuznetsov, Hideaki YOSHIFUJI, netdev, linux-kernel

Hello.

I've faced an issue with a limited TCP bandwidth between my laptop and a 
server in my 1 Gbps LAN while using BBR as a congestion control mechanism. To 
verify my observations, I've set up 2 KVM VMs with the following parameters:

1) Linux v4.15.3
2) virtio NICs
3) 128 MiB of RAM
4) 2 vCPUs
5) tested on both non-PREEMPT/100 Hz and PREEMPT/1000 Hz

The VMs are interconnected via host bridge (-netdev bridge). I was running 
iperf3 in the default and reverse mode. Here are the results:

1) BBR on both VMs

upload: 3.42 Gbits/sec, cwnd ~ 320 KBytes
download: 3.39 Gbits/sec, cwnd ~ 320 KBytes

2) Reno on both VMs

upload: 5.50 Gbits/sec, cwnd = 976 KBytes (constant)
download: 5.22 Gbits/sec, cwnd = 1.20 MBytes (constant)

3) Reno on client, BBR on server

upload: 5.29 Gbits/sec, cwnd = 952 KBytes (constant)
download: 3.45 Gbits/sec, cwnd ~ 320 KBytes

4) BBR on client, Reno on server

upload: 3.36 Gbits/sec, cwnd ~ 370 KBytes
download: 5.21 Gbits/sec, cwnd = 887 KBytes (constant)

So, as you may see, when BBR is in use, upload rate is bad and cwnd is low. If 
using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput to 
~100 Mbps (verifiable not only by iperf3, but also by scp while transferring 
some files between hosts).

Also, I've tried to use YeAH instead of Reno, and it gives me the same results 
as Reno (IOW, YeAH works fine too).

Questions:

1) is this expected?
2) or am I missing some extra BBR tuneable?
3) if it is not a regression (I don't have any previous data to compare with), 
how can I fix this?
4) if it is a bug in BBR, what else should I provide or check for a proper 
investigation?

Thanks.

Regards,
  Oleksandr

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-15 20:42 TCP and BBR: reproducibly low cwnd and bandwidth Oleksandr Natalenko
@ 2018-02-16 15:15 ` Oleksandr Natalenko
  2018-02-16 16:25   ` Eric Dumazet
  2018-02-16 16:26   ` Holger Hoffstätte
  2018-02-16 16:21 ` Eric Dumazet
  1 sibling, 2 replies; 26+ messages in thread
From: Oleksandr Natalenko @ 2018-02-16 15:15 UTC (permalink / raw)
  To: David S. Miller
  Cc: Alexey Kuznetsov, Hideaki YOSHIFUJI, netdev, linux-kernel,
	Eric Dumazet, Soheil Hassas Yeganeh, Neal Cardwell,
	Yuchung Cheng, Van Jacobson, Jerry Chu

Hi, David, Eric, Neal et al.

On čtvrtek 15. února 2018 21:42:26 CET Oleksandr Natalenko wrote:
> I've faced an issue with a limited TCP bandwidth between my laptop and a
> server in my 1 Gbps LAN while using BBR as a congestion control mechanism.
> To verify my observations, I've set up 2 KVM VMs with the following
> parameters:
> 
> 1) Linux v4.15.3
> 2) virtio NICs
> 3) 128 MiB of RAM
> 4) 2 vCPUs
> 5) tested on both non-PREEMPT/100 Hz and PREEMPT/1000 Hz
> 
> The VMs are interconnected via host bridge (-netdev bridge). I was running
> iperf3 in the default and reverse mode. Here are the results:
> 
> 1) BBR on both VMs
> 
> upload: 3.42 Gbits/sec, cwnd ~ 320 KBytes
> download: 3.39 Gbits/sec, cwnd ~ 320 KBytes
> 
> 2) Reno on both VMs
> 
> upload: 5.50 Gbits/sec, cwnd = 976 KBytes (constant)
> download: 5.22 Gbits/sec, cwnd = 1.20 MBytes (constant)
> 
> 3) Reno on client, BBR on server
> 
> upload: 5.29 Gbits/sec, cwnd = 952 KBytes (constant)
> download: 3.45 Gbits/sec, cwnd ~ 320 KBytes
> 
> 4) BBR on client, Reno on server
> 
> upload: 3.36 Gbits/sec, cwnd ~ 370 KBytes
> download: 5.21 Gbits/sec, cwnd = 887 KBytes (constant)
> 
> So, as you may see, when BBR is in use, upload rate is bad and cwnd is low.
> If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput
> to ~100 Mbps (verifiable not only by iperf3, but also by scp while
> transferring some files between hosts).
> 
> Also, I've tried to use YeAH instead of Reno, and it gives me the same
> results as Reno (IOW, YeAH works fine too).
> 
> Questions:
> 
> 1) is this expected?
> 2) or am I missing some extra BBR tuneable?
> 3) if it is not a regression (I don't have any previous data to compare
> with), how can I fix this?
> 4) if it is a bug in BBR, what else should I provide or check for a proper
> investigation?

I've played with BBR a little bit more and managed to narrow the issue down to 
the changes between v4.12 and v4.13. Here are my observations:

v4.12 + BBR + fq_codel == OK
v4.12 + BBR + fq       == OK
v4.13 + BBR + fq_codel == Not OK
v4.13 + BBR + fq       == OK

I think this has something to do with an internal TCP implementation for 
pacing, that was introduced in v4.13 (commit 218af599fa63) specifically to 
allow using BBR together with non-fq qdiscs. Once BBR relies on fq, the 
throughput is high and saturates the link, but if another qdisc is in use, for 
instance, fq_codel, the throughput drops. Just to be sure, I've also tried 
pfifo_fast instead of fq_codel with the same outcome resulting in the low 
throughput.

Unfortunately, I do not know if this is something expected or should be 
considered as a regression. Thus, asking for an advice.

Ideas?

Thanks.

Regards,
  Oleksandr

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-15 20:42 TCP and BBR: reproducibly low cwnd and bandwidth Oleksandr Natalenko
  2018-02-16 15:15 ` Oleksandr Natalenko
@ 2018-02-16 16:21 ` Eric Dumazet
       [not found]   ` <CADVnQymiswHBp32dcMvWd1WfYLpFqY4QTas8yABFQE7KKKc5ag@mail.gmail.com>
  1 sibling, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2018-02-16 16:21 UTC (permalink / raw)
  To: Oleksandr Natalenko, David S. Miller
  Cc: netdev, Yuchung Cheng, Soheil Hassas Yeganeh, Neal Cardwell,
	Jerry Chu, Eric Dumazet

Lets CC BBR folks at Google, and remove the ones that probably have no
idea.



On Thu, 2018-02-15 at 21:42 +0100, Oleksandr Natalenko wrote:
> Hello.
> 
> I've faced an issue with a limited TCP bandwidth between my laptop and a 
> server in my 1 Gbps LAN while using BBR as a congestion control mechanism. To 
> verify my observations, I've set up 2 KVM VMs with the following parameters:
> 
> 1) Linux v4.15.3
> 2) virtio NICs
> 3) 128 MiB of RAM
> 4) 2 vCPUs
> 5) tested on both non-PREEMPT/100 Hz and PREEMPT/1000 Hz
> 
> The VMs are interconnected via host bridge (-netdev bridge). I was running 
> iperf3 in the default and reverse mode. Here are the results:
> 
> 1) BBR on both VMs
> 
> upload: 3.42 Gbits/sec, cwnd ~ 320 KBytes
> download: 3.39 Gbits/sec, cwnd ~ 320 KBytes
> 
> 2) Reno on both VMs
> 
> upload: 5.50 Gbits/sec, cwnd = 976 KBytes (constant)
> download: 5.22 Gbits/sec, cwnd = 1.20 MBytes (constant)
> 
> 3) Reno on client, BBR on server
> 
> upload: 5.29 Gbits/sec, cwnd = 952 KBytes (constant)
> download: 3.45 Gbits/sec, cwnd ~ 320 KBytes
> 
> 4) BBR on client, Reno on server
> 
> upload: 3.36 Gbits/sec, cwnd ~ 370 KBytes
> download: 5.21 Gbits/sec, cwnd = 887 KBytes (constant)
> 
> So, as you may see, when BBR is in use, upload rate is bad and cwnd is low. If 
> using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput to 
> ~100 Mbps (verifiable not only by iperf3, but also by scp while transferring 
> some files between hosts).
> 
> Also, I've tried to use YeAH instead of Reno, and it gives me the same results 
> as Reno (IOW, YeAH works fine too).
> 
> Questions:
> 
> 1) is this expected?
> 2) or am I missing some extra BBR tuneable?
> 3) if it is not a regression (I don't have any previous data to compare with), 
> how can I fix this?
> 4) if it is a bug in BBR, what else should I provide or check for a proper 
> investigation?
> 
> Thanks.
> 
> Regards,
>   Oleksandr
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-16 15:15 ` Oleksandr Natalenko
@ 2018-02-16 16:25   ` Eric Dumazet
  2018-02-16 17:37     ` Oleksandr Natalenko
  2018-02-16 16:26   ` Holger Hoffstätte
  1 sibling, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2018-02-16 16:25 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, netdev,
	LKML, Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng,
	Van Jacobson, Jerry Chu

On Fri, Feb 16, 2018 at 7:15 AM, Oleksandr Natalenko
<oleksandr@natalenko.name> wrote:
> Hi, David, Eric, Neal et al.
>
> On čtvrtek 15. února 2018 21:42:26 CET Oleksandr Natalenko wrote:
>> I've faced an issue with a limited TCP bandwidth between my laptop and a
>> server in my 1 Gbps LAN while using BBR as a congestion control mechanism.
>> To verify my observations, I've set up 2 KVM VMs with the following
>> parameters:
>>
>> 1) Linux v4.15.3
>> 2) virtio NICs
>> 3) 128 MiB of RAM
>> 4) 2 vCPUs
>> 5) tested on both non-PREEMPT/100 Hz and PREEMPT/1000 Hz
>>
>> The VMs are interconnected via host bridge (-netdev bridge). I was running
>> iperf3 in the default and reverse mode. Here are the results:
>>
>> 1) BBR on both VMs
>>
>> upload: 3.42 Gbits/sec, cwnd ~ 320 KBytes
>> download: 3.39 Gbits/sec, cwnd ~ 320 KBytes
>>
>> 2) Reno on both VMs
>>
>> upload: 5.50 Gbits/sec, cwnd = 976 KBytes (constant)
>> download: 5.22 Gbits/sec, cwnd = 1.20 MBytes (constant)
>>
>> 3) Reno on client, BBR on server
>>
>> upload: 5.29 Gbits/sec, cwnd = 952 KBytes (constant)
>> download: 3.45 Gbits/sec, cwnd ~ 320 KBytes
>>
>> 4) BBR on client, Reno on server
>>
>> upload: 3.36 Gbits/sec, cwnd ~ 370 KBytes
>> download: 5.21 Gbits/sec, cwnd = 887 KBytes (constant)
>>
>> So, as you may see, when BBR is in use, upload rate is bad and cwnd is low.
>> If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput
>> to ~100 Mbps (verifiable not only by iperf3, but also by scp while
>> transferring some files between hosts).
>>
>> Also, I've tried to use YeAH instead of Reno, and it gives me the same
>> results as Reno (IOW, YeAH works fine too).
>>
>> Questions:
>>
>> 1) is this expected?
>> 2) or am I missing some extra BBR tuneable?
>> 3) if it is not a regression (I don't have any previous data to compare
>> with), how can I fix this?
>> 4) if it is a bug in BBR, what else should I provide or check for a proper
>> investigation?
>
> I've played with BBR a little bit more and managed to narrow the issue down to
> the changes between v4.12 and v4.13. Here are my observations:
>
> v4.12 + BBR + fq_codel == OK
> v4.12 + BBR + fq       == OK
> v4.13 + BBR + fq_codel == Not OK
> v4.13 + BBR + fq       == OK
>
> I think this has something to do with an internal TCP implementation for
> pacing, that was introduced in v4.13 (commit 218af599fa63) specifically to
> allow using BBR together with non-fq qdiscs. Once BBR relies on fq, the
> throughput is high and saturates the link, but if another qdisc is in use, for
> instance, fq_codel, the throughput drops. Just to be sure, I've also tried
> pfifo_fast instead of fq_codel with the same outcome resulting in the low
> throughput.
>
> Unfortunately, I do not know if this is something expected or should be
> considered as a regression. Thus, asking for an advice.
>
> Ideas?

The way TCP pacing works, it defaults to internal pacing using a hint
stored in the socket.

If you change the qdisc while flow is alive, result could be unexpected.

(TCP socket remembers that one FQ was supposed to handle the pacing)

What results do you have if you use standard pfifo_fast ?

I am asking because TCP pacing relies on High resolution timers, and
that might be weak on your VM.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-16 15:15 ` Oleksandr Natalenko
  2018-02-16 16:25   ` Eric Dumazet
@ 2018-02-16 16:26   ` Holger Hoffstätte
  2018-02-16 16:56     ` Neal Cardwell
  2018-02-16 17:35     ` Oleksandr Natalenko
  1 sibling, 2 replies; 26+ messages in thread
From: Holger Hoffstätte @ 2018-02-16 16:26 UTC (permalink / raw)
  To: Oleksandr Natalenko, David S. Miller
  Cc: Alexey Kuznetsov, Hideaki YOSHIFUJI, netdev, linux-kernel,
	Eric Dumazet, Soheil Hassas Yeganeh, Neal Cardwell,
	Yuchung Cheng, Van Jacobson, Jerry Chu

On 02/16/18 16:15, Oleksandr Natalenko wrote:
> Hi, David, Eric, Neal et al.
> 
> On čtvrtek 15. února 2018 21:42:26 CET Oleksandr Natalenko wrote:
>> I've faced an issue with a limited TCP bandwidth between my laptop and a
>> server in my 1 Gbps LAN while using BBR as a congestion control mechanism.
>> To verify my observations, I've set up 2 KVM VMs with the following
>> parameters:
>>
>> 1) Linux v4.15.3
>> 2) virtio NICs
>> 3) 128 MiB of RAM
>> 4) 2 vCPUs
>> 5) tested on both non-PREEMPT/100 Hz and PREEMPT/1000 Hz

These are very odd configurations. :)
Non-preempt/100 might well be too slow, whereas PREEMPT/1000 might simply
have too much overhead.

>> The VMs are interconnected via host bridge (-netdev bridge). I was running
>> iperf3 in the default and reverse mode. Here are the results:
>>
>> 1) BBR on both VMs
>>
>> upload: 3.42 Gbits/sec, cwnd ~ 320 KBytes
>> download: 3.39 Gbits/sec, cwnd ~ 320 KBytes
>>
>> 2) Reno on both VMs
>>
>> upload: 5.50 Gbits/sec, cwnd = 976 KBytes (constant)
>> download: 5.22 Gbits/sec, cwnd = 1.20 MBytes (constant)
>>
>> 3) Reno on client, BBR on server
>>
>> upload: 5.29 Gbits/sec, cwnd = 952 KBytes (constant)
>> download: 3.45 Gbits/sec, cwnd ~ 320 KBytes
>>
>> 4) BBR on client, Reno on server
>>
>> upload: 3.36 Gbits/sec, cwnd ~ 370 KBytes
>> download: 5.21 Gbits/sec, cwnd = 887 KBytes (constant)
>>
>> So, as you may see, when BBR is in use, upload rate is bad and cwnd is low.

BBR in general will run with lower cwnd than e.g. Cubic or others.
That's a feature and necessary for WAN transfers.

>> If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput
>> to ~100 Mbps (verifiable not only by iperf3, but also by scp while
>> transferring some files between hosts).

Something seems really wrong with your setup. I get completely
expected throughput on wired 1Gb between two hosts:

Connecting to host tux, port 5201
[  5] local 192.168.100.223 port 48718 connected to 192.168.100.222 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   113 MBytes   948 Mbits/sec    0    204 KBytes       
[  5]   1.00-2.00   sec   112 MBytes   941 Mbits/sec    0    204 KBytes       
[  5]   2.00-3.00   sec   112 MBytes   941 Mbits/sec    0    204 KBytes       
[...]

Running it locally gives the more or less expected results as well:

Connecting to host ragnarok, port 5201
[  5] local 192.168.100.223 port 54090 connected to 192.168.100.223 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  8.09 GBytes  69.5 Gbits/sec    0    512 KBytes       
[  5]   1.00-2.00   sec  8.14 GBytes  69.9 Gbits/sec    0    512 KBytes       
[  5]   2.00-3.00   sec  8.43 GBytes  72.4 Gbits/sec    0    512 KBytes       
[...]

Both hosts running 4.14.x with bbr and fq_codel (default qdisc everywhere).
In the past I only used BBR briefly for testing since at 1Gb speeds on my
LAN it was actually slightly slower than Cubic (some of those bugs were
recently addressed) and made no difference otherwise, even for uploads -
which are capped by my 50/10 DSL anyway.

Please note that BBR was developed to address the case of WAN transfers
(or more precisely high BDP paths) which often suffer from TCP throughput
collapse due to single packet loss events. While it might "work" in other
scenarios as well, strictly speaking delay-based anything is increasingly
less likely to work when there is no meaningful notion of delay - such
as on a LAN. (yes, this is very simplified..)

The BBR mailing list has several nice reports why the current BBR
implementation (dubbed v1) has a few - sometimes severe - problems.
These are being addressed as we speak.

(let me know if you want some of those tech reports by email. :)

>> Also, I've tried to use YeAH instead of Reno, and it gives me the same
>> results as Reno (IOW, YeAH works fine too).
>>
>> Questions:
>>
>> 1) is this expected?
>> 2) or am I missing some extra BBR tuneable?

No, it should work out of the box.

>> 3) if it is not a regression (I don't have any previous data to compare
>> with), how can I fix this?
>> 4) if it is a bug in BBR, what else should I provide or check for a proper
>> investigation?
> 
> I've played with BBR a little bit more and managed to narrow the issue down to 
> the changes between v4.12 and v4.13. Here are my observations:
> 
> v4.12 + BBR + fq_codel == OK
> v4.12 + BBR + fq       == OK
> v4.13 + BBR + fq_codel == Not OK
> v4.13 + BBR + fq       == OK
> 
> I think this has something to do with an internal TCP implementation for 
> pacing, that was introduced in v4.13 (commit 218af599fa63) specifically to 
> allow using BBR together with non-fq qdiscs. Once BBR relies on fq, the 
> throughput is high and saturates the link, but if another qdisc is in use, for 
> instance, fq_codel, the throughput drops. Just to be sure, I've also tried 
> pfifo_fast instead of fq_codel with the same outcome resulting in the low 
> throughput.

I'm not sure testing the old version without builtin pacing is going to help
matters in finding the actual problem. :)
Several people have reported severe performance regressions with 4.15.x,
maybe that's related. Can you test latest 4.14.x?

Out of curiosity, what is the expected use case for BBR here?

cheers
Holger

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
       [not found]   ` <CADVnQymiswHBp32dcMvWd1WfYLpFqY4QTas8yABFQE7KKKc5ag@mail.gmail.com>
@ 2018-02-16 16:43     ` Eric Dumazet
  2018-02-16 16:45       ` Neal Cardwell
  2018-02-16 17:25     ` Oleksandr Natalenko
  1 sibling, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2018-02-16 16:43 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, Oleksandr Natalenko, David S. Miller, Netdev,
	Yuchung Cheng, Soheil Hassas Yeganeh, Jerry Chu

On Fri, Feb 16, 2018 at 8:33 AM, Neal Cardwell <ncardwell@google.com> wrote:
> Oleksandr,
>
> Thanks for the detailed report! Yes, this sounds like an issue in BBR. We
> have not run into this one in our team, but we will try to work with you to
> fix this.
>
> Would you be able to take a sender-side tcpdump trace of the slow BBR
> transfer ("v4.13 + BBR + fq_codel == Not OK")? Packet headers only would be
> fine. Maybe something like:
>
>   tcpdump -w /tmp/test.pcap -c1000000 -s 100 -i eth0 port $PORT
>
> Thanks!
> neal

On baremetal and using latest net tree, I get pretty normal results at
least, on 40Gbit NIC,
with pfifo_fast, fq and fq_codel.

# tc qd replace dev eth0 root fq
# ./super_netperf 1 -H lpaa24 -- -K cubic
  25627
# ./super_netperf 1 -H lpaa24 -- -K bbr
  25897
# tc qd replace dev eth0 root fq_codel
# ./super_netperf 1 -H lpaa24 -- -K cubic
  22246
# ./super_netperf 1 -H lpaa24 -- -K bbr
  25228
# tc qd replace dev eth0 root pfifo_fast
# ./super_netperf 1 -H lpaa24 -- -K cubic
  25454
# ./super_netperf 1 -H lpaa24 -- -K bbr
  25508

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-16 16:43     ` Eric Dumazet
@ 2018-02-16 16:45       ` Neal Cardwell
  2018-02-16 17:00         ` Oleksandr Natalenko
  0 siblings, 1 reply; 26+ messages in thread
From: Neal Cardwell @ 2018-02-16 16:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, Oleksandr Natalenko, David S. Miller, Netdev,
	Yuchung Cheng, Soheil Hassas Yeganeh, Jerry Chu

On Fri, Feb 16, 2018 at 11:43 AM, Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Feb 16, 2018 at 8:33 AM, Neal Cardwell <ncardwell@google.com> wrote:
> > Oleksandr,
> >
> > Thanks for the detailed report! Yes, this sounds like an issue in BBR. We
> > have not run into this one in our team, but we will try to work with you to
> > fix this.
> >
> > Would you be able to take a sender-side tcpdump trace of the slow BBR
> > transfer ("v4.13 + BBR + fq_codel == Not OK")? Packet headers only would be
> > fine. Maybe something like:
> >
> >   tcpdump -w /tmp/test.pcap -c1000000 -s 100 -i eth0 port $PORT
> >
> > Thanks!
> > neal
>
> On baremetal and using latest net tree, I get pretty normal results at
> least, on 40Gbit NIC,

Eric raises a good question: bare metal vs VMs.

Oleksandr, your first email mentioned KVM VMs and virtio NICs. Your
second e-mail did not seem to mention if those results were for bare
metal or a VM scenario: can you please clarify the details on your
second set of tests?

Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-16 16:26   ` Holger Hoffstätte
@ 2018-02-16 16:56     ` Neal Cardwell
  2018-02-16 17:13       ` Holger Hoffstätte
  2018-02-16 17:35     ` Oleksandr Natalenko
  1 sibling, 1 reply; 26+ messages in thread
From: Neal Cardwell @ 2018-02-16 16:56 UTC (permalink / raw)
  To: Holger Hoffstätte
  Cc: Oleksandr Natalenko, David S. Miller, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Netdev, LKML, Eric Dumazet,
	Soheil Hassas Yeganeh, Yuchung Cheng, Van Jacobson, Jerry Chu

On Fri, Feb 16, 2018 at 11:26 AM, Holger Hoffstätte
<holger@applied-asynchrony.com> wrote:
>
> BBR in general will run with lower cwnd than e.g. Cubic or others.
> That's a feature and necessary for WAN transfers.

Please note that there's no general rule about whether BBR will run
with a lower or higher cwnd than CUBIC, Reno, or other loss-based
congestion control algorithms. Whether BBR's cwnd will be lower or
higher depends on the BDP of the path, the amount of buffering in the
bottleneck, and the number of flows. BBR tries to match the amount of
in-flight data to the BDP based on the available bandwidth and the
two-way propagation delay. This will usually produce an amount of data
in flight that is smaller than CUBIC/Reno (yielding lower latency) if
the path has deep buffers (bufferbloat), but can be larger than
CUBIC/Reno (yielding higher throughput) if the buffers are shallow and
the traffic is suffering burst losses.

>
>>> If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput
>>> to ~100 Mbps (verifiable not only by iperf3, but also by scp while
>>> transferring some files between hosts).
>
> Something seems really wrong with your setup. I get completely
> expected throughput on wired 1Gb between two hosts:
>
> Connecting to host tux, port 5201
> [  5] local 192.168.100.223 port 48718 connected to 192.168.100.222 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   113 MBytes   948 Mbits/sec    0    204 KBytes
> [  5]   1.00-2.00   sec   112 MBytes   941 Mbits/sec    0    204 KBytes
> [  5]   2.00-3.00   sec   112 MBytes   941 Mbits/sec    0    204 KBytes
> [...]
>
> Running it locally gives the more or less expected results as well:
>
> Connecting to host ragnarok, port 5201
> [  5] local 192.168.100.223 port 54090 connected to 192.168.100.223 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  8.09 GBytes  69.5 Gbits/sec    0    512 KBytes
> [  5]   1.00-2.00   sec  8.14 GBytes  69.9 Gbits/sec    0    512 KBytes
> [  5]   2.00-3.00   sec  8.43 GBytes  72.4 Gbits/sec    0    512 KBytes
> [...]
>
> Both hosts running 4.14.x with bbr and fq_codel (default qdisc everywhere).

Can you please clarify if this is over bare metal or between VM
guests? It sounds like Oleksandr's initial report was between KVM VMs,
so the virtualization may be an ingredient here.

thanks,
neal

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-16 16:45       ` Neal Cardwell
@ 2018-02-16 17:00         ` Oleksandr Natalenko
  0 siblings, 0 replies; 26+ messages in thread
From: Oleksandr Natalenko @ 2018-02-16 17:00 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, Eric Dumazet, David S. Miller, Netdev,
	Yuchung Cheng, Soheil Hassas Yeganeh, Jerry Chu

Hi!

On pátek 16. února 2018 17:45:56 CET Neal Cardwell wrote:
> Eric raises a good question: bare metal vs VMs.
> 
> Oleksandr, your first email mentioned KVM VMs and virtio NICs. Your
> second e-mail did not seem to mention if those results were for bare
> metal or a VM scenario: can you please clarify the details on your
> second set of tests?

Ugh, so many letters simultaneously… I'll answer them one by one if you don't 
mind :).

Both the first and the second set of tests were performed on 2 KVM VMs, but 
from now I'll test everything using real HW only to exclude potential 
influence of virtualisation. Also, as I've already pointed out, on the real HW 
the difference is even bigger (~10 times).

Now, I'm going to answer other emails of yours, including the actual results 
from the real HW and tcpdump output as requested.

Thanks!

Regards,
  Oleksandr

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-16 16:56     ` Neal Cardwell
@ 2018-02-16 17:13       ` Holger Hoffstätte
  0 siblings, 0 replies; 26+ messages in thread
From: Holger Hoffstätte @ 2018-02-16 17:13 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Oleksandr Natalenko, David S. Miller, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Netdev, LKML, Eric Dumazet,
	Soheil Hassas Yeganeh, Yuchung Cheng, Van Jacobson, Jerry Chu

On 02/16/18 17:56, Neal Cardwell wrote:
> On Fri, Feb 16, 2018 at 11:26 AM, Holger Hoffstätte
> <holger@applied-asynchrony.com> wrote:
>>
>> BBR in general will run with lower cwnd than e.g. Cubic or others.
>> That's a feature and necessary for WAN transfers.
> 
> Please note that there's no general rule about whether BBR will run
> with a lower or higher cwnd than CUBIC, Reno, or other loss-based
> congestion control algorithms. Whether BBR's cwnd will be lower or
> higher depends on the BDP of the path, the amount of buffering in the
> bottleneck, and the number of flows. BBR tries to match the amount of
> in-flight data to the BDP based on the available bandwidth and the
> two-way propagation delay. This will usually produce an amount of data
> in flight that is smaller than CUBIC/Reno (yielding lower latency) if
> the path has deep buffers (bufferbloat), but can be larger than
> CUBIC/Reno (yielding higher throughput) if the buffers are shallow and
> the traffic is suffering burst losses.

In all my tests I've never seen it larger, but OK. Thanks for the
explanation. :)
On second reading the "necessary for WAN transfers" was phrased a bit
unfortunately, but it likely doesn't matter for Oleksandr's case
anyway..

(snip)

>> Something seems really wrong with your setup. I get completely
>> expected throughput on wired 1Gb between two hosts:
>>
>> Connecting to host tux, port 5201
>> [  5] local 192.168.100.223 port 48718 connected to 192.168.100.222 port 5201
>> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
>> [  5]   0.00-1.00   sec   113 MBytes   948 Mbits/sec    0    204 KBytes
>> [  5]   1.00-2.00   sec   112 MBytes   941 Mbits/sec    0    204 KBytes
>> [  5]   2.00-3.00   sec   112 MBytes   941 Mbits/sec    0    204 KBytes
>> [...]
>>
>> Running it locally gives the more or less expected results as well:
>>
>> Connecting to host ragnarok, port 5201
>> [  5] local 192.168.100.223 port 54090 connected to 192.168.100.223 port 5201
>> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
>> [  5]   0.00-1.00   sec  8.09 GBytes  69.5 Gbits/sec    0    512 KBytes
>> [  5]   1.00-2.00   sec  8.14 GBytes  69.9 Gbits/sec    0    512 KBytes
>> [  5]   2.00-3.00   sec  8.43 GBytes  72.4 Gbits/sec    0    512 KBytes
>> [...]
>>
>> Both hosts running 4.14.x with bbr and fq_codel (default qdisc everywhere).
> 
> Can you please clarify if this is over bare metal or between VM
> guests? It sounds like Oleksandr's initial report was between KVM VMs,
> so the virtualization may be an ingredient here.

These are real hosts, not VMs, wired by 1Gbit Ethernet (home office).
Like Eric said it's probably weird HZ, slow host, iffy high-res timer
(bad for both fq and fq_codel), overhead of retpoline in a VM or whatnot.

cheers
Holger

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
       [not found]   ` <CADVnQymiswHBp32dcMvWd1WfYLpFqY4QTas8yABFQE7KKKc5ag@mail.gmail.com>
  2018-02-16 16:43     ` Eric Dumazet
@ 2018-02-16 17:25     ` Oleksandr Natalenko
  2018-02-16 17:56       ` Holger Hoffstätte
  2018-02-16 20:54       ` Eric Dumazet
  1 sibling, 2 replies; 26+ messages in thread
From: Oleksandr Natalenko @ 2018-02-16 17:25 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, David S. Miller, Netdev, Yuchung Cheng,
	Soheil Hassas Yeganeh, Jerry Chu, Eric Dumazet, Dave Taht

Hi.

On pátek 16. února 2018 17:33:48 CET Neal Cardwell wrote:
> Thanks for the detailed report! Yes, this sounds like an issue in BBR. We
> have not run into this one in our team, but we will try to work with you to
> fix this.
> 
> Would you be able to take a sender-side tcpdump trace of the slow BBR
> transfer ("v4.13 + BBR + fq_codel == Not OK")? Packet headers only would be
> fine. Maybe something like:
> 
>   tcpdump -w /tmp/test.pcap -c1000000 -s 100 -i eth0 port $PORT

So, going on with two real HW hosts. They are both running latest stock Arch 
Linux kernel (4.15.3-1-ARCH, CONFIG_PREEMPT=y, CONFIG_HZ=1000) and are 
interconnected with 1 Gbps link (via switch if that matters). Using iperf3, 
running each test for 20 seconds.

Having BBR+fq_codel (or pfifo_fast, same result) on both hosts:

Client to server: 112 Mbits/sec
Server to client: 96.1 Mbits/sec

Having BBR+fq on both hosts:

Client to server: 347 Mbits/sec
Server to client: 397 Mbits/sec

Having YeAH+fq on both hosts:

Client to server: 928 Mbits/sec
Server to client: 711 Mbits/sec

(when the server generates traffic, the throughput is a little bit lower, as 
you can see, but I assume that's because I have there low-power Silvermont 
CPU, when the client has Ivy Bridge beast)

Now, to tcpdump. I've captured it 2 times, for client-to-server flow (c2s) and 
for server-to-client flow (s2c) while using BBR + pfifo_fast:

# tcpdump -w test_XXX.pcap -c1000000 -s 100 -i enp2s0 port 5201

I've uploaded both files here [1].

Thanks.

Oleksandr

[1] https://natalenko.name/myfiles/bbr/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-16 16:26   ` Holger Hoffstätte
  2018-02-16 16:56     ` Neal Cardwell
@ 2018-02-16 17:35     ` Oleksandr Natalenko
  1 sibling, 0 replies; 26+ messages in thread
From: Oleksandr Natalenko @ 2018-02-16 17:35 UTC (permalink / raw)
  To: Holger Hoffstätte
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, netdev,
	linux-kernel, Eric Dumazet, Soheil Hassas Yeganeh, Neal Cardwell,
	Yuchung Cheng, Van Jacobson, Jerry Chu

Hi.

On pátek 16. února 2018 17:26:11 CET Holger Hoffstätte wrote:
> These are very odd configurations. :)
> Non-preempt/100 might well be too slow, whereas PREEMPT/1000 might simply
> have too much overhead.

Since the pacing is based on hrtimers, should HZ matter at all? Even if so, 
poor 1 Gbps link shouldn't drop to below 100 Mbps, for sure.

> BBR in general will run with lower cwnd than e.g. Cubic or others.
> That's a feature and necessary for WAN transfers.

Okay, got it.

> Something seems really wrong with your setup. I get completely
> expected throughput on wired 1Gb between two hosts:
> /* snip */

Yes, and that's strange :/. And that's why I'm wondering what I am missing 
since things cannot be *that* bad.

> /* snip */
> Please note that BBR was developed to address the case of WAN transfers
> (or more precisely high BDP paths) which often suffer from TCP throughput
> collapse due to single packet loss events. While it might "work" in other
> scenarios as well, strictly speaking delay-based anything is increasingly
> less likely to work when there is no meaningful notion of delay - such
> as on a LAN. (yes, this is very simplified..)
> 
> The BBR mailing list has several nice reports why the current BBR
> implementation (dubbed v1) has a few - sometimes severe - problems.
> These are being addressed as we speak.
> 
> (let me know if you want some of those tech reports by email. :)

Well, yes, please, why not :).

> /* snip */
> I'm not sure testing the old version without builtin pacing is going to help
> matters in finding the actual problem. :)
> Several people have reported severe performance regressions with 4.15.x,
> maybe that's related. Can you test latest 4.14.x?

Observed this on v4.14 too but didn't pay much attention until realised that 
things look definitely wrong.

> Out of curiosity, what is the expected use case for BBR here?

Nothing special, just assumed it could be set as a default for both WAN and 
LAN usage.

Regards,
  Oleksandr

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-16 16:25   ` Eric Dumazet
@ 2018-02-16 17:37     ` Oleksandr Natalenko
  0 siblings, 0 replies; 26+ messages in thread
From: Oleksandr Natalenko @ 2018-02-16 17:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, netdev,
	LKML, Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng,
	Van Jacobson, Jerry Chu

Hi.

On pátek 16. února 2018 17:25:58 CET Eric Dumazet wrote:
> The way TCP pacing works, it defaults to internal pacing using a hint
> stored in the socket.
> 
> If you change the qdisc while flow is alive, result could be unexpected.

I don't change a qdisc while flow is alive. Either the VM is completely 
restarted, or iperf3 is restarted on both sides.

> (TCP socket remembers that one FQ was supposed to handle the pacing)
> 
> What results do you have if you use standard pfifo_fast ?

Almost the same as with fq_codel (see my previous email with numbers).

> I am asking because TCP pacing relies on High resolution timers, and
> that might be weak on your VM.

Also, I've switched to measuring things on a real HW only (also see previous 
email with numbers).

Thanks.

Regards,
  Oleksandr

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-16 17:25     ` Oleksandr Natalenko
@ 2018-02-16 17:56       ` Holger Hoffstätte
  2018-02-16 19:54         ` Oleksandr Natalenko
  2018-02-16 20:54       ` Eric Dumazet
  1 sibling, 1 reply; 26+ messages in thread
From: Holger Hoffstätte @ 2018-02-16 17:56 UTC (permalink / raw)
  To: Oleksandr Natalenko, Neal Cardwell
  Cc: Eric Dumazet, David S. Miller, Netdev, Yuchung Cheng,
	Soheil Hassas Yeganeh, Jerry Chu, Eric Dumazet, Dave Taht

On 02/16/18 18:25, Oleksandr Natalenko wrote:
> So, going on with two real HW hosts. They are both running latest stock Arch 
> Linux kernel (4.15.3-1-ARCH, CONFIG_PREEMPT=y, CONFIG_HZ=1000) and are 
> interconnected with 1 Gbps link (via switch if that matters). Using iperf3, 
> running each test for 20 seconds.
> 
> Having BBR+fq_codel (or pfifo_fast, same result) on both hosts:
> 
> Client to server: 112 Mbits/sec
> Server to client: 96.1 Mbits/sec
> 
> Having BBR+fq on both hosts:
> 
> Client to server: 347 Mbits/sec
> Server to client: 397 Mbits/sec
> 
> Having YeAH+fq on both hosts:
> 
> Client to server: 928 Mbits/sec
> Server to client: 711 Mbits/sec
> 
> (when the server generates traffic, the throughput is a little bit lower, as 
> you can see, but I assume that's because I have there low-power Silvermont 
> CPU, when the client has Ivy Bridge beast)

There is simply no reason why you shouldn't get approx. line rate (~920+-ish)
Mbit over wired 1GBit Ethernet; even my broken 10-year old Core2Duo laptop can
do that. Can you boot with spectre_v2=off and try "the simplest case" with the
defaults cubic/pfifo_fast? spectre_v2 has terrible performance impact esp.
on small/older processors.

When I last benchmarked full PREEMPT with 4.9.x it was similarly bad and also
had a noticeable network throughput impact even on my i7.

Also congratulations for being the only other person I know who ever tried
YeAH. :-)

cheers
Holger

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-16 17:56       ` Holger Hoffstätte
@ 2018-02-16 19:54         ` Oleksandr Natalenko
  0 siblings, 0 replies; 26+ messages in thread
From: Oleksandr Natalenko @ 2018-02-16 19:54 UTC (permalink / raw)
  To: Holger Hoffstätte
  Cc: Neal Cardwell, Eric Dumazet, David S. Miller, Netdev,
	Yuchung Cheng, Soheil Hassas Yeganeh, Jerry Chu, Eric Dumazet,
	Dave Taht

Hi.

On pátek 16. února 2018 18:56:12 CET Holger Hoffstätte wrote:
> There is simply no reason why you shouldn't get approx. line rate
> (~920+-ish) Mbit over wired 1GBit Ethernet; even my broken 10-year old
> Core2Duo laptop can do that. Can you boot with spectre_v2=off and try "the
> simplest case" with the defaults cubic/pfifo_fast? spectre_v2 has terrible
> performance impact esp. on small/older processors.

Just have tried. No visible difference.

> When I last benchmarked full PREEMPT with 4.9.x it was similarly bad and
> also had a noticeable network throughput impact even on my i7.
> 
> Also congratulations for being the only other person I know who ever tried
> YeAH. :-)

Well, according to the git log on tcp_yeah.c and Reported-by tag, I was not 
the only one there ;).

Regards,
  Oleksandr

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-16 17:25     ` Oleksandr Natalenko
  2018-02-16 17:56       ` Holger Hoffstätte
@ 2018-02-16 20:54       ` Eric Dumazet
  2018-02-16 22:50         ` Eric Dumazet
  2018-02-16 22:50         ` Oleksandr Natalenko
  1 sibling, 2 replies; 26+ messages in thread
From: Eric Dumazet @ 2018-02-16 20:54 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: Neal Cardwell, Eric Dumazet, David S. Miller, Netdev,
	Yuchung Cheng, Soheil Hassas Yeganeh, Jerry Chu, Dave Taht

On Fri, Feb 16, 2018 at 9:25 AM, Oleksandr Natalenko
<oleksandr@natalenko.name> wrote:
> Hi.
>
> On pátek 16. února 2018 17:33:48 CET Neal Cardwell wrote:
>> Thanks for the detailed report! Yes, this sounds like an issue in BBR. We
>> have not run into this one in our team, but we will try to work with you to
>> fix this.
>>
>> Would you be able to take a sender-side tcpdump trace of the slow BBR
>> transfer ("v4.13 + BBR + fq_codel == Not OK")? Packet headers only would be
>> fine. Maybe something like:
>>
>>   tcpdump -w /tmp/test.pcap -c1000000 -s 100 -i eth0 port $PORT
>
> So, going on with two real HW hosts. They are both running latest stock Arch
> Linux kernel (4.15.3-1-ARCH, CONFIG_PREEMPT=y, CONFIG_HZ=1000) and are
> interconnected with 1 Gbps link (via switch if that matters). Using iperf3,
> running each test for 20 seconds.
>
> Having BBR+fq_codel (or pfifo_fast, same result) on both hosts:
>
> Client to server: 112 Mbits/sec
> Server to client: 96.1 Mbits/sec
>
> Having BBR+fq on both hosts:
>
> Client to server: 347 Mbits/sec
> Server to client: 397 Mbits/sec
>
> Having YeAH+fq on both hosts:

> [1] https://natalenko.name/myfiles/bbr/
>

Something fishy really :

09:18:31.449903 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [P.],
seq 76745:79641, ack 38, win 227, options [nop,nop,TS val 2327043753
ecr 3190508870], length 2896
09:18:31.449916 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.],
ack 79641, win 1011, options [nop,nop,TS val 3190508870 ecr
2327043753], length 0
09:18:31.449925 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
seq 79641:83985, ack 38, win 227, options [nop,nop,TS val 2327043753
ecr 3190508870], length 4344
09:18:31.449936 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.],
ack 83985, win 987, options [nop,nop,TS val 3190508870 ecr
2327043753], length 0
09:18:31.450112 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
seq 83985:86881, ack 38, win 227, options [nop,nop,TS val 2327043753
ecr 3190508870], length 2896
09:18:31.450124 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.],
ack 86881, win 971, options [nop,nop,TS val 3190508871 ecr
2327043753], length 0
09:18:31.450299 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
seq 86881:91225, ack 38, win 227, options [nop,nop,TS val 2327043753
ecr 3190508870], length 4344
09:18:31.450313 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.],
ack 91225, win 947, options [nop,nop,TS val 3190508871 ecr
2327043753], length 0
09:18:31.450491 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [P.],
seq 91225:92673, ack 38, win 227, options [nop,nop,TS val 2327043753
ecr 3190508870], length 1448
09:18:31.450505 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
seq 92673:94121, ack 38, win 227, options [nop,nop,TS val 2327043753
ecr 3190508871], length 1448
09:18:31.450511 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [P.],
seq 94121:95569, ack 38, win 227, options [nop,nop,TS val 2327043754
ecr 3190508871], length 1448
09:18:31.450720 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
seq 95569:101361, ack 38, win 227, options [nop,nop,TS val 2327043754
ecr 3190508871], length 5792
09:18:31.450932 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
seq 101361:105705, ack 38, win 227, options [nop,nop,TS val 2327043754
ecr 3190508871], length 4344
09:18:31.451132 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
seq 105705:110049, ack 38, win 227, options [nop,nop,TS val 2327043754
ecr 3190508871], length 4344
09:18:31.451342 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
seq 110049:111497, ack 38, win 227, options [nop,nop,TS val 2327043754
ecr 3190508871], length 1448
09:18:31.455841 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
seq 111497:112945, ack 38, win 227, options [nop,nop,TS val 2327043759
ecr 3190508871], length 1448

Not only the receiver suddenly adds a 25 ms delay, but also note that
it acknowledges all prior segments (ack 112949), but with a wrong ecr
value ( 2327043753 )
instead of 2327043759

09:18:31.482238 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.],
ack 112945, win 1111, options [nop,nop,TS val 3190508903 ecr
2327043753], length 0
09:18:31.482704 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
seq 112945:114393, ack 38, win 227, options [nop,nop,TS val 2327043786
ecr 3190508903], length 1448
09:18:31.482734 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.],
ack 114393, win 1134, options [nop,nop,TS val 3190508903 ecr
2327043786], length 0
09:18:31.482802 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
seq 114393:117289, ack 38, win 227, options [nop,nop,TS val 2327043786
ecr 3190508903], length 2896
09:18:31.482813 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.],
ack 117289, win 1179, options [nop,nop,TS val 3190508903 ecr
2327043786], length 0
09:18:31.483138 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
seq 117289:120185, ack 38, win 227, options [nop,nop,TS val 2327043786
ecr 3190508903], length 2896
09:18:31.483158 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.],
ack 120185, win 1224, options [nop,nop,TS val 3190508904 ecr
2327043786], length 0

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-16 20:54       ` Eric Dumazet
@ 2018-02-16 22:50         ` Eric Dumazet
  2018-02-16 23:06           ` Oleksandr Natalenko
  2018-02-16 22:50         ` Oleksandr Natalenko
  1 sibling, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2018-02-16 22:50 UTC (permalink / raw)
  To: Eric Dumazet, Oleksandr Natalenko
  Cc: Neal Cardwell, David S. Miller, Netdev, Yuchung Cheng,
	Soheil Hassas Yeganeh, Jerry Chu, Dave Taht

On Fri, 2018-02-16 at 12:54 -0800, Eric Dumazet wrote:
> On Fri, Feb 16, 2018 at 9:25 AM, Oleksandr Natalenko
> <oleksandr@natalenko.name> wrote:
> > Hi.
> > 
> > On pátek 16. února 2018 17:33:48 CET Neal Cardwell wrote:
> > > Thanks for the detailed report! Yes, this sounds like an issue in BBR. We
> > > have not run into this one in our team, but we will try to work with you to
> > > fix this.
> > > 
> > > Would you be able to take a sender-side tcpdump trace of the slow BBR
> > > transfer ("v4.13 + BBR + fq_codel == Not OK")? Packet headers only would be
> > > fine. Maybe something like:
> > > 
> > >   tcpdump -w /tmp/test.pcap -c1000000 -s 100 -i eth0 port $PORT
> > 
> > So, going on with two real HW hosts. They are both running latest stock Arch
> > Linux kernel (4.15.3-1-ARCH, CONFIG_PREEMPT=y, CONFIG_HZ=1000) and are
> > interconnected with 1 Gbps link (via switch if that matters). Using iperf3,
> > running each test for 20 seconds.
> > 
> > Having BBR+fq_codel (or pfifo_fast, same result) on both hosts:
> > 
> > Client to server: 112 Mbits/sec
> > Server to client: 96.1 Mbits/sec
> > 
> > Having BBR+fq on both hosts:
> > 
> > Client to server: 347 Mbits/sec
> > Server to client: 397 Mbits/sec
> > 
> > Having YeAH+fq on both hosts:
> > [1] https://natalenko.name/myfiles/bbr/
> > 
> 
> Something fishy really :
> 
> 09:18:31.449903 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [P.],
> seq 76745:79641, ack 38, win 227, options [nop,nop,TS val 2327043753
> ecr 3190508870], length 2896
> 09:18:31.449916 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.],
> ack 79641, win 1011, options [nop,nop,TS val 3190508870 ecr
> 2327043753], length 0
> 09:18:31.449925 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
> seq 79641:83985, ack 38, win 227, options [nop,nop,TS val 2327043753
> ecr 3190508870], length 4344
> 09:18:31.449936 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.],
> ack 83985, win 987, options [nop,nop,TS val 3190508870 ecr
> 2327043753], length 0
> 09:18:31.450112 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
> seq 83985:86881, ack 38, win 227, options [nop,nop,TS val 2327043753
> ecr 3190508870], length 2896
> 09:18:31.450124 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.],
> ack 86881, win 971, options [nop,nop,TS val 3190508871 ecr
> 2327043753], length 0
> 09:18:31.450299 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
> seq 86881:91225, ack 38, win 227, options [nop,nop,TS val 2327043753
> ecr 3190508870], length 4344
> 09:18:31.450313 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.],
> ack 91225, win 947, options [nop,nop,TS val 3190508871 ecr
> 2327043753], length 0
> 09:18:31.450491 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [P.],
> seq 91225:92673, ack 38, win 227, options [nop,nop,TS val 2327043753
> ecr 3190508870], length 1448
> 09:18:31.450505 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
> seq 92673:94121, ack 38, win 227, options [nop,nop,TS val 2327043753
> ecr 3190508871], length 1448
> 09:18:31.450511 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [P.],
> seq 94121:95569, ack 38, win 227, options [nop,nop,TS val 2327043754
> ecr 3190508871], length 1448
> 09:18:31.450720 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
> seq 95569:101361, ack 38, win 227, options [nop,nop,TS val 2327043754
> ecr 3190508871], length 5792
> 09:18:31.450932 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
> seq 101361:105705, ack 38, win 227, options [nop,nop,TS val 2327043754
> ecr 3190508871], length 4344
> 09:18:31.451132 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
> seq 105705:110049, ack 38, win 227, options [nop,nop,TS val 2327043754
> ecr 3190508871], length 4344
> 09:18:31.451342 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
> seq 110049:111497, ack 38, win 227, options [nop,nop,TS val 2327043754
> ecr 3190508871], length 1448
> 09:18:31.455841 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.],
> seq 111497:112945, ack 38, win 227, options [nop,nop,TS val 2327043759
> ecr 3190508871], length 1448
> 
> Not only the receiver suddenly adds a 25 ms delay, but also note that
> it acknowledges all prior segments (ack 112949), but with a wrong ecr
> value ( 2327043753 )
> instead of 2327043759

If you use 

tcptrace -R test_s2c.pcap
xplot.org d2c_rtt.xpl

Then you'll see plenty of suspect 40ms rtt samples.

It looks like receiver misses wakeups for some reason,
and only the TCP delayed ACK timer is helping.

So it does not look like a sender side issue to me.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-16 20:54       ` Eric Dumazet
  2018-02-16 22:50         ` Eric Dumazet
@ 2018-02-16 22:50         ` Oleksandr Natalenko
  2018-02-16 22:59           ` Eric Dumazet
  1 sibling, 1 reply; 26+ messages in thread
From: Oleksandr Natalenko @ 2018-02-16 22:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, Eric Dumazet, David S. Miller, Netdev,
	Yuchung Cheng, Soheil Hassas Yeganeh, Jerry Chu, Dave Taht

Hi.

On pátek 16. února 2018 21:54:05 CET Eric Dumazet wrote:
> /* snip */
> Something fishy really :
> /* snip */
> Not only the receiver suddenly adds a 25 ms delay, but also note that
> it acknowledges all prior segments (ack 112949), but with a wrong ecr
> value ( 2327043753 )
> instead of 2327043759
> /* snip */

Eric has encouraged me to look closer at what's there in the ethtool, and I've 
just had a free time to play with it. I've found out that enabling scatter-
gather (ethtool -K enp3s0 sg on, it is disabled by default on both hosts) 
brings the throughput back to normal even with BBR+fq_codel.

Whyyyy? What's the deal BBR has with sg?

Oleksandr

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-16 22:50         ` Oleksandr Natalenko
@ 2018-02-16 22:59           ` Eric Dumazet
  2018-02-17 10:01             ` Oleksandr Natalenko
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2018-02-16 22:59 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: Neal Cardwell, Eric Dumazet, David S. Miller, Netdev,
	Yuchung Cheng, Soheil Hassas Yeganeh, Jerry Chu, Dave Taht

On Fri, Feb 16, 2018 at 2:50 PM, Oleksandr Natalenko
<oleksandr@natalenko.name> wrote:
> Hi.
>
> On pátek 16. února 2018 21:54:05 CET Eric Dumazet wrote:
>> /* snip */
>> Something fishy really :
>> /* snip */
>> Not only the receiver suddenly adds a 25 ms delay, but also note that
>> it acknowledges all prior segments (ack 112949), but with a wrong ecr
>> value ( 2327043753 )
>> instead of 2327043759
>> /* snip */
>
> Eric has encouraged me to look closer at what's there in the ethtool, and I've
> just had a free time to play with it. I've found out that enabling scatter-
> gather (ethtool -K enp3s0 sg on, it is disabled by default on both hosts)
> brings the throughput back to normal even with BBR+fq_codel.
>
> Whyyyy? What's the deal BBR has with sg?

Well, no effect  here on e1000e (1 Gbit) at least

# ethtool -K eth3 sg off
Actual changes:
scatter-gather: off
tx-scatter-gather: off
tcp-segmentation-offload: off
tx-tcp-segmentation: off [requested on]
tx-tcp6-segmentation: off [requested on]
generic-segmentation-offload: off [requested on]

# tc qd replace dev eth3 root pfifo_fast
# ./super_netperf 1 -H 7.7.7.84 -- -K cubic
    941
# ./super_netperf 1 -H 7.7.7.84 -- -K bbr
    941
# tc qd replace dev eth3 root fq
# ./super_netperf 1 -H 7.7.7.84 -- -K cubic
    941
# ./super_netperf 1 -H 7.7.7.84 -- -K bbr
    941
# tc qd replace dev eth3 root fq_codel
# ./super_netperf 1 -H 7.7.7.84 -- -K cubic
    941
# ./super_netperf 1 -H 7.7.7.84 -- -K bbr
    941
#

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-16 22:50         ` Eric Dumazet
@ 2018-02-16 23:06           ` Oleksandr Natalenko
  0 siblings, 0 replies; 26+ messages in thread
From: Oleksandr Natalenko @ 2018-02-16 23:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, Neal Cardwell, David S. Miller, Netdev,
	Yuchung Cheng, Soheil Hassas Yeganeh, Jerry Chu, Dave Taht

On pátek 16. února 2018 23:50:35 CET Eric Dumazet wrote:
> /* snip */
> If you use
> 
> tcptrace -R test_s2c.pcap
> xplot.org d2c_rtt.xpl
> 
> Then you'll see plenty of suspect 40ms rtt samples.

That's odd. Even the way how they look uniformly.

> It looks like receiver misses wakeups for some reason,
> and only the TCP delayed ACK timer is helping.
> 
> So it does not look like a sender side issue to me.

To make things even more complicated, I've disabled sg on the server, leaving 
it enabled on the client:

client to server flow: 935 Mbits/sec
server to client flow: 72.5 Mbits/sec

So still, to me it looks like a sender issue. No?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-16 22:59           ` Eric Dumazet
@ 2018-02-17 10:01             ` Oleksandr Natalenko
  2018-02-17 18:52               ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Oleksandr Natalenko @ 2018-02-17 10:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, Eric Dumazet, David S. Miller, Netdev,
	Yuchung Cheng, Soheil Hassas Yeganeh, Jerry Chu, Dave Taht

Hi.

On pátek 16. února 2018 23:59:52 CET Eric Dumazet wrote:
> Well, no effect  here on e1000e (1 Gbit) at least
> 
> # ethtool -K eth3 sg off
> Actual changes:
> scatter-gather: off
> tx-scatter-gather: off
> tcp-segmentation-offload: off
> tx-tcp-segmentation: off [requested on]
> tx-tcp6-segmentation: off [requested on]
> generic-segmentation-offload: off [requested on]
> 
> # tc qd replace dev eth3 root pfifo_fast
> # ./super_netperf 1 -H 7.7.7.84 -- -K cubic
>     941
> # ./super_netperf 1 -H 7.7.7.84 -- -K bbr
>     941
> # tc qd replace dev eth3 root fq
> # ./super_netperf 1 -H 7.7.7.84 -- -K cubic
>     941
> # ./super_netperf 1 -H 7.7.7.84 -- -K bbr
>     941
> # tc qd replace dev eth3 root fq_codel
> # ./super_netperf 1 -H 7.7.7.84 -- -K cubic
>     941
> # ./super_netperf 1 -H 7.7.7.84 -- -K bbr
>     941
> #

That really looks strange to me. I'm able to reproduce the effect caused by 
disabling scatter-gather even on the VM (using iperf3, as usual):

BBR+fq_codel:
sg on:  4.23 Gbits/sec
sg off: 121 Mbits/sec

BBR+fq:
sg on:  6.38 Gbits/sec
sg off: 437 Mbits/sec

Reno+fq_codel:
sg on:  6.74 Gbits/sec
sg off: 1.37 Gbits/sec

Reno+fq:
sg on:  6.53 Gbits/sec
sg off: 1.19 Gbits/sec

Regardless of which congestion algorithm and qdisc is in use, the throughput 
drops, but when BBR is in use, especially with something non-fq, it drops the 
most.

Oleksandr

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-17 10:01             ` Oleksandr Natalenko
@ 2018-02-17 18:52               ` Eric Dumazet
  2018-02-18 21:04                 ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2018-02-17 18:52 UTC (permalink / raw)
  To: Oleksandr Natalenko, Eric Dumazet
  Cc: Neal Cardwell, David S. Miller, Netdev, Yuchung Cheng,
	Soheil Hassas Yeganeh, Jerry Chu, Dave Taht

On Sat, 2018-02-17 at 11:01 +0100, Oleksandr Natalenko wrote:
> Hi.
> 
> On pátek 16. února 2018 23:59:52 CET Eric Dumazet wrote:
> > Well, no effect  here on e1000e (1 Gbit) at least
> > 
> > # ethtool -K eth3 sg off
> > Actual changes:
> > scatter-gather: off
> > tx-scatter-gather: off
> > tcp-segmentation-offload: off
> > tx-tcp-segmentation: off [requested on]
> > tx-tcp6-segmentation: off [requested on]
> > generic-segmentation-offload: off [requested on]
> > 
> > # tc qd replace dev eth3 root pfifo_fast
> > # ./super_netperf 1 -H 7.7.7.84 -- -K cubic
> >     941
> > # ./super_netperf 1 -H 7.7.7.84 -- -K bbr
> >     941
> > # tc qd replace dev eth3 root fq
> > # ./super_netperf 1 -H 7.7.7.84 -- -K cubic
> >     941
> > # ./super_netperf 1 -H 7.7.7.84 -- -K bbr
> >     941
> > # tc qd replace dev eth3 root fq_codel
> > # ./super_netperf 1 -H 7.7.7.84 -- -K cubic
> >     941
> > # ./super_netperf 1 -H 7.7.7.84 -- -K bbr
> >     941
> > #
> 
> That really looks strange to me. I'm able to reproduce the effect caused by 
> disabling scatter-gather even on the VM (using iperf3, as usual):

This must be some race condition in the code I added in TCP for self-
pacing, when a sort timeout is programmed.

Disabling SG means TCP cooks 1-MSS packets.

I will take a look, probably after the (long) week-end : Tuesday.

Thanks !

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-17 18:52               ` Eric Dumazet
@ 2018-02-18 21:04                 ` Eric Dumazet
  2018-02-18 21:06                   ` Eric Dumazet
  2018-02-18 21:49                   ` Oleksandr Natalenko
  0 siblings, 2 replies; 26+ messages in thread
From: Eric Dumazet @ 2018-02-18 21:04 UTC (permalink / raw)
  To: Oleksandr Natalenko, Eric Dumazet
  Cc: Neal Cardwell, David S. Miller, Netdev, Yuchung Cheng,
	Soheil Hassas Yeganeh, Jerry Chu, Dave Taht

On Sat, 2018-02-17 at 10:52 -0800, Eric Dumazet wrote:
> 
> This must be some race condition in the code I added in TCP for self-
> pacing, when a sort timeout is programmed.
> 
> Disabling SG means TCP cooks 1-MSS packets.
> 
> I will take a look, probably after the (long) week-end : Tuesday.

I was able to take a look today, and I believe this is the time to
switch TCP to GSO being always on.

As a bonus, we get speed boost for cubic as well.

Todays high BDP and recent TCP improvements (rtx queue as rb-tree, sack
coalescing, TCP pacing...) all were developed/tested/maintained with
GSO/TSO being the norm.

Can you please test the following patch ?

Note that some cleanups can be done later in TCP stack, removing lots
of legacy stuff.

Also TCP internal-pacing could benefit from something similar to this
fq patch eventually, although there is no hurry.
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=fefa569a9d4bc4b7758c0fddd75bb0382c95da77  

Of course, you have to consider why SG was disabled on your device,
this looks very pessimistic.

Thanks !

 include/net/sock.h  |    1 +
 net/core/sock.c     |    2 +-
 net/ipv4/tcp_ipv4.c |    1 +
 net/ipv6/tcp_ipv6.c |    1 +
 4 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 169c92afcafa3d548f8238e91606b87c187559f4..df4ac691870ff9f779f1782ded58140eb4d3a961 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -417,6 +417,7 @@ struct sock {
 	struct page_frag	sk_frag;
 	netdev_features_t	sk_route_caps;
 	netdev_features_t	sk_route_nocaps;
+	netdev_features_t	sk_route_forced_caps;
 	int			sk_gso_type;
 	unsigned int		sk_gso_max_size;
 	gfp_t			sk_allocation;
diff --git a/net/core/sock.c b/net/core/sock.c
index c501499a04fe973e80e18655b306d762d348ff44..b084acb3b3b96791663b731788a392041148416c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1773,7 +1773,7 @@ void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
 	u32 max_segs = 1;
 
 	sk_dst_set(sk, dst);
-	sk->sk_route_caps = dst->dev->features;
+	sk->sk_route_caps = dst->dev->features | sk->sk_route_forced_caps;
 	if (sk->sk_route_caps & NETIF_F_GSO)
 		sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE;
 	sk->sk_route_caps &= ~sk->sk_route_nocaps;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f8ad397e285e9b8db0b04f8abc30a42f22294ef9..eaf1e30fc5af879442f5f33ed4bd69f89dff8cfb 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -233,6 +233,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	}
 	/* OK, now commit destination to socket.  */
 	sk->sk_gso_type = SKB_GSO_TCPV4;
+	sk->sk_route_forced_caps = NETIF_F_GSO;
 	sk_setup_caps(sk, &rt->dst);
 	rt = NULL;
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 412139f4eccd96923daaea064cd9fb8be13f5916..4a461e8e2d654aa341d525a0df609a294c2040df 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -269,6 +269,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	inet->inet_rcv_saddr = LOOPBACK4_IPV6;
 
 	sk->sk_gso_type = SKB_GSO_TCPV6;
+	sk->sk_route_forced_caps = NETIF_F_GSO;
 	ip6_dst_store(sk, dst, NULL, NULL);
 
 	icsk->icsk_ext_hdr_len = 0;

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-18 21:04                 ` Eric Dumazet
@ 2018-02-18 21:06                   ` Eric Dumazet
  2018-02-18 21:49                   ` Oleksandr Natalenko
  1 sibling, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2018-02-18 21:06 UTC (permalink / raw)
  To: Oleksandr Natalenko, Eric Dumazet
  Cc: Neal Cardwell, David S. Miller, Netdev, Yuchung Cheng,
	Soheil Hassas Yeganeh, Jerry Chu, Dave Taht

On Sun, 2018-02-18 at 13:04 -0800, Eric Dumazet wrote:
> 
> Can you please test the following patch ?
> 
> Note that some cleanups can be done later in TCP stack, removing lots
> of legacy stuff.
> 
> Also TCP internal-pacing could benefit from something similar to this
> fq patch eventually, although there is no hurry.
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=fefa569a9d4bc4b7758c0fddd75bb0382c95da77  
> 
> Of course, you have to consider why SG was disabled on your device,
> this looks very pessimistic.
> 
> Thanks !
> 
>  include/net/sock.h  |    1 +
>  net/core/sock.c     |    2 +-
>  net/ipv4/tcp_ipv4.c |    1 +
>  net/ipv6/tcp_ipv6.c |    1 +
>  4 files changed, 4 insertions(+), 1 deletion(-)

Also note that the patch only deals with active connections.

My official patch will also take care of passive ones of course.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-18 21:04                 ` Eric Dumazet
  2018-02-18 21:06                   ` Eric Dumazet
@ 2018-02-18 21:49                   ` Oleksandr Natalenko
  2018-02-18 22:24                     ` Eric Dumazet
  1 sibling, 1 reply; 26+ messages in thread
From: Oleksandr Natalenko @ 2018-02-18 21:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, Neal Cardwell, David S. Miller, Netdev,
	Yuchung Cheng, Soheil Hassas Yeganeh, Jerry Chu, Dave Taht

Hi.

On neděle 18. února 2018 22:04:27 CET Eric Dumazet wrote:
> I was able to take a look today, and I believe this is the time to
> switch TCP to GSO being always on.
> 
> As a bonus, we get speed boost for cubic as well.
> 
> Todays high BDP and recent TCP improvements (rtx queue as rb-tree, sack
> coalescing, TCP pacing...) all were developed/tested/maintained with
> GSO/TSO being the norm.
> 
> Can you please test the following patch ?

Yes, results below:

BBR+fq:
sg on:  6.02 Gbits/sec
sg off: 1.33 Gbits/sec

BBR+pfifo_fast:
sg on:  4.13 Gbits/sec
sg off: 1.34 Gbits/sec

BBR+fq_codel:
sg on:  4.16 Gbits/sec
sg off: 1.35 Gbits/sec

Reno+fq:
sg on:  6.44 Gbits/sec
sg off: 1.39 Gbits/sec

Reno+pfifo_fast:
sg on:  6.36 Gbits/sec
sg off: 1.39 Gbits/sec

Reno+fq_codel:
sg on:  6.41 Gbits/sec
sg off: 1.38 Gbits/sec

While BBR still suffers when fq is not used, disabling sg doesn't bring 
drastic throughput drop anymore. So, looks good to me, eh?

> Note that some cleanups can be done later in TCP stack, removing lots
> of legacy stuff.
> 
> Also TCP internal-pacing could benefit from something similar to this
> fq patch eventually, although there is no hurry.
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?i
> d=fefa569a9d4bc4b7758c0fddd75bb0382c95da77  

Feel free to ping me if you have something else to test then ;).

> Of course, you have to consider why SG was disabled on your device,
> this looks very pessimistic.

Dunno why that happens, but I've managed to just enable it automatically on 
interface up.

Thanks.

Oleksandr

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TCP and BBR: reproducibly low cwnd and bandwidth
  2018-02-18 21:49                   ` Oleksandr Natalenko
@ 2018-02-18 22:24                     ` Eric Dumazet
  0 siblings, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2018-02-18 22:24 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: Eric Dumazet, Neal Cardwell, David S. Miller, Netdev,
	Yuchung Cheng, Soheil Hassas Yeganeh, Jerry Chu, Dave Taht

On Sun, 2018-02-18 at 22:49 +0100, Oleksandr Natalenko wrote:
> Hi.
> 
> On neděle 18. února 2018 22:04:27 CET Eric Dumazet wrote:
> > I was able to take a look today, and I believe this is the time to
> > switch TCP to GSO being always on.
> > 
> > As a bonus, we get speed boost for cubic as well.
> > 
> > Todays high BDP and recent TCP improvements (rtx queue as rb-tree, sack
> > coalescing, TCP pacing...) all were developed/tested/maintained with
> > GSO/TSO being the norm.
> > 
> > Can you please test the following patch ?
> 
> Yes, results below:
> 
> BBR+fq:
> sg on:  6.02 Gbits/sec
> sg off: 1.33 Gbits/sec
> 
> BBR+pfifo_fast:
> sg on:  4.13 Gbits/sec
> sg off: 1.34 Gbits/sec
> 
> BBR+fq_codel:
> sg on:  4.16 Gbits/sec
> sg off: 1.35 Gbits/sec
> 
> Reno+fq:
> sg on:  6.44 Gbits/sec
> sg off: 1.39 Gbits/sec
> 
> Reno+pfifo_fast:
> sg on:  6.36 Gbits/sec
> sg off: 1.39 Gbits/sec
> 
> Reno+fq_codel:
> sg on:  6.41 Gbits/sec
> sg off: 1.38 Gbits/sec
> 
> While BBR still suffers when fq is not used, disabling sg doesn't bring 
> drastic throughput drop anymore. So, looks good to me, eh?
> 

Indeed :)

Here are my results on 40Gbit link (mlx4) :

BBR+fq:
sg on:  26 Gbits/sec
sg off: 15.7 Gbits/sec   (was 2.3 Gbit before patch)

BBR+pfifo_fast:
sg on:  24.2 Gbits/sec
sg off: 14.9 Gbits/sec  (was 0.66 Gbit before patch !!! )

BBR+fq_codel:
sg on:  24.4 Gbits/sec
sg off: 15 Gbits/sec  (was 0.66 Gbit before patch !!! )

Reno+fq:
sg on:  20 Gbits/sec
sg off: 15.7 Gbits/sec  (was 6 Gbit)

Reno+pfifo_fast:
sg on:  25.7 Gbits/sec
sg off: 15.5 Gbits/sec  (was 7 Gbit)

Reno+fq_codel:
sg on:  25.8 Gbits/sec
sg off: 16 Gbits/sec    (was 7 Gbit)

Definitely worth it ;)

Thanks !

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2018-02-18 22:24 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-15 20:42 TCP and BBR: reproducibly low cwnd and bandwidth Oleksandr Natalenko
2018-02-16 15:15 ` Oleksandr Natalenko
2018-02-16 16:25   ` Eric Dumazet
2018-02-16 17:37     ` Oleksandr Natalenko
2018-02-16 16:26   ` Holger Hoffstätte
2018-02-16 16:56     ` Neal Cardwell
2018-02-16 17:13       ` Holger Hoffstätte
2018-02-16 17:35     ` Oleksandr Natalenko
2018-02-16 16:21 ` Eric Dumazet
     [not found]   ` <CADVnQymiswHBp32dcMvWd1WfYLpFqY4QTas8yABFQE7KKKc5ag@mail.gmail.com>
2018-02-16 16:43     ` Eric Dumazet
2018-02-16 16:45       ` Neal Cardwell
2018-02-16 17:00         ` Oleksandr Natalenko
2018-02-16 17:25     ` Oleksandr Natalenko
2018-02-16 17:56       ` Holger Hoffstätte
2018-02-16 19:54         ` Oleksandr Natalenko
2018-02-16 20:54       ` Eric Dumazet
2018-02-16 22:50         ` Eric Dumazet
2018-02-16 23:06           ` Oleksandr Natalenko
2018-02-16 22:50         ` Oleksandr Natalenko
2018-02-16 22:59           ` Eric Dumazet
2018-02-17 10:01             ` Oleksandr Natalenko
2018-02-17 18:52               ` Eric Dumazet
2018-02-18 21:04                 ` Eric Dumazet
2018-02-18 21:06                   ` Eric Dumazet
2018-02-18 21:49                   ` Oleksandr Natalenko
2018-02-18 22:24                     ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.