All of lore.kernel.org
 help / color / mirror / Atom feed
* NAT performance regression caused by vlan GRO support
@ 2019-04-04 12:57 Rafał Miłecki
  2019-04-04 15:17 ` Toshiaki Makita
  2019-04-07 11:53 ` Rafał Miłecki
  0 siblings, 2 replies; 16+ messages in thread
From: Rafał Miłecki @ 2019-04-04 12:57 UTC (permalink / raw)
  To: netdev, David S. Miller, Toshiaki Makita
  Cc: Stefano Brivio, Sabrina Dubroca, David Ahern, Felix Fietkau,
	Jo-Philipp Wich, Koen Vandeputte

[-- Attachment #1: Type: text/plain, Size: 1951 bytes --]

Hello,

I'd like to report a regression that goes back to the 2015. I know it's damn
late, but the good thing is, the regression is still easy to reproduce, verify &
revert.

Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO support
for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT
performance of my router dropped by 30% - 40%.

My hardware is BCM47094 SoC (dual core ARM) with integrated network controller
and external BCM53012 switch.

Relevant setup:
* SoC network controller is wired to the hardware switch
* Switch passes 802.1q frames with VID 1 to four LAN ports
* Switch passes 802.1q frames with VID 2 to WAN port
* Linux does NAT for LAN (eth0.1) to WAN (eth0.2)
* Linux uses pfifo and "echo 2 > rps_cpus"
* Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port
* Intel i7-2670QM laptop connected to a WAN port
* Speed of LAN to WAN measured using iperf & TCP over 10 minutes

1) 5.1.0-rc3
[  6]  0.0-600.0 sec  39.9 GBytes   572 Mbits/sec

2) 5.1.0-rc3 + rtcache patch
[  6]  0.0-600.0 sec  40.0 GBytes   572 Mbits/sec

3) 5.1.0-rc3 + disable GRO support
[  6]  0.0-300.4 sec  27.5 GBytes   786 Mbits/sec

4) 5.1.0-rc3 + rtcache patch + disable GRO support
[  6]  0.0-600.0 sec  65.6 GBytes   939 Mbits/sec

5) 4.1.15 + rtcache patch
934 Mb/s

6) 4.3.4 + rtcache patch
565 Mb/s

As you can see I can achieve a big performance gain by disabling/reverting a
GRO support. Getting up to 65% faster NAT makes a huge difference and ideally
I'd like to get that with upstream Linux code.

Could someone help me and check the reported commit/code, please? Is there
any other info I can provide or anything I can test for you?


--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -545,6 +545,8 @@ static int __init vlan_offload_init(void)
  {
  	unsigned int i;

+	return -ENOTSUPP;
+
  	for (i = 0; i < ARRAY_SIZE(vlan_packet_offloads); i++)
  		dev_add_offload(&vlan_packet_offloads[i]);


[-- Attachment #2: .config --]
[-- Type: application/x-config, Size: 77368 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NAT performance regression caused by vlan GRO support
  2019-04-04 12:57 NAT performance regression caused by vlan GRO support Rafał Miłecki
@ 2019-04-04 15:17 ` Toshiaki Makita
  2019-04-04 20:22   ` Rafał Miłecki
  2019-04-07 11:53 ` Rafał Miłecki
  1 sibling, 1 reply; 16+ messages in thread
From: Toshiaki Makita @ 2019-04-04 15:17 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: netdev, David S. Miller, Toshiaki Makita, Stefano Brivio,
	Sabrina Dubroca, David Ahern, Felix Fietkau, Jo-Philipp Wich,
	Koen Vandeputte

Hi Rafał,

On 19/04/04 (木) 21:57:15, Rafał Miłecki wrote:
> Hello,
> 
> I'd like to report a regression that goes back to the 2015. I know it's 
> damn
> late, but the good thing is, the regression is still easy to reproduce, 
> verify &
> revert.
> 
> Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO 
> support
> for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT
> performance of my router dropped by 30% - 40%.
> 
> My hardware is BCM47094 SoC (dual core ARM) with integrated network 
> controller
> and external BCM53012 switch.
> 
> Relevant setup:
> * SoC network controller is wired to the hardware switch
> * Switch passes 802.1q frames with VID 1 to four LAN ports
> * Switch passes 802.1q frames with VID 2 to WAN port
> * Linux does NAT for LAN (eth0.1) to WAN (eth0.2)
> * Linux uses pfifo and "echo 2 > rps_cpus"
> * Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port
> * Intel i7-2670QM laptop connected to a WAN port
> * Speed of LAN to WAN measured using iperf & TCP over 10 minutes
> 
> 1) 5.1.0-rc3
> [  6]  0.0-600.0 sec  39.9 GBytes   572 Mbits/sec
> 
> 2) 5.1.0-rc3 + rtcache patch
> [  6]  0.0-600.0 sec  40.0 GBytes   572 Mbits/sec
> 
> 3) 5.1.0-rc3 + disable GRO support
> [  6]  0.0-300.4 sec  27.5 GBytes   786 Mbits/sec
> 
> 4) 5.1.0-rc3 + rtcache patch + disable GRO support
> [  6]  0.0-600.0 sec  65.6 GBytes   939 Mbits/sec

Did you test it with disabling GRO by ethtool -K?
Is this the result with your reverting patch?

It's late night in Japan so I think I will try to reproduce it tomorrow.

Thanks.

> 
> 5) 4.1.15 + rtcache patch
> 934 Mb/s
> 
> 6) 4.3.4 + rtcache patch
> 565 Mb/s
> 
> As you can see I can achieve a big performance gain by 
> disabling/reverting a
> GRO support. Getting up to 65% faster NAT makes a huge difference and 
> ideally
> I'd like to get that with upstream Linux code.
> 
> Could someone help me and check the reported commit/code, please? Is there
> any other info I can provide or anything I can test for you?
> 
> 
> --- a/net/8021q/vlan_core.c
> +++ b/net/8021q/vlan_core.c
> @@ -545,6 +545,8 @@ static int __init vlan_offload_init(void)
>   {
>       unsigned int i;
> 
> +    return -ENOTSUPP;
> +
>       for (i = 0; i < ARRAY_SIZE(vlan_packet_offloads); i++)
>           dev_add_offload(&vlan_packet_offloads[i]);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NAT performance regression caused by vlan GRO support
  2019-04-04 15:17 ` Toshiaki Makita
@ 2019-04-04 20:22   ` Rafał Miłecki
  2019-04-05  4:26     ` Toshiaki Makita
  0 siblings, 1 reply; 16+ messages in thread
From: Rafał Miłecki @ 2019-04-04 20:22 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: netdev, David S. Miller, Toshiaki Makita, Stefano Brivio,
	Sabrina Dubroca, David Ahern, Felix Fietkau, Jo-Philipp Wich,
	Koen Vandeputte

On 04.04.2019 17:17, Toshiaki Makita wrote:
> On 19/04/04 (木) 21:57:15, Rafał Miłecki wrote:
>> I'd like to report a regression that goes back to the 2015. I know it's damn
>> late, but the good thing is, the regression is still easy to reproduce, verify &
>> revert.
>>
>> Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO support
>> for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT
>> performance of my router dropped by 30% - 40%.
>>
>> My hardware is BCM47094 SoC (dual core ARM) with integrated network controller
>> and external BCM53012 switch.
>>
>> Relevant setup:
>> * SoC network controller is wired to the hardware switch
>> * Switch passes 802.1q frames with VID 1 to four LAN ports
>> * Switch passes 802.1q frames with VID 2 to WAN port
>> * Linux does NAT for LAN (eth0.1) to WAN (eth0.2)
>> * Linux uses pfifo and "echo 2 > rps_cpus"
>> * Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port
>> * Intel i7-2670QM laptop connected to a WAN port
>> * Speed of LAN to WAN measured using iperf & TCP over 10 minutes
>>
>> 1) 5.1.0-rc3
>> [  6]  0.0-600.0 sec  39.9 GBytes   572 Mbits/sec
>>
>> 2) 5.1.0-rc3 + rtcache patch
>> [  6]  0.0-600.0 sec  40.0 GBytes   572 Mbits/sec
>>
>> 3) 5.1.0-rc3 + disable GRO support
>> [  6]  0.0-300.4 sec  27.5 GBytes   786 Mbits/sec
>>
>> 4) 5.1.0-rc3 + rtcache patch + disable GRO support
>> [  6]  0.0-600.0 sec  65.6 GBytes   939 Mbits/sec
> 
> Did you test it with disabling GRO by ethtool -K?

Oh, I didn't know about such possibility! I just tested:
1) Kernel with GRO support left in place (no local patch disabling it)
2) ethtool -K eth0 gro off
and it bumped my NAT performance from 576 Mb/s to 939 Mb/s. I can reliably
break/fix NAT performance by just calling ethtool -K eth0 gro on/off.


> Is this the result with your reverting patch?

Previous results were coming from kernel with patched vlan_offload_init() - see
diff at the end of my first e-mail.


> It's late night in Japan so I think I will try to reproduce it tomorrow.

Thank you!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NAT performance regression caused by vlan GRO support
  2019-04-04 20:22   ` Rafał Miłecki
@ 2019-04-05  4:26     ` Toshiaki Makita
  2019-04-05  5:48       ` Rafał Miłecki
  0 siblings, 1 reply; 16+ messages in thread
From: Toshiaki Makita @ 2019-04-05  4:26 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Toshiaki Makita, netdev, David S. Miller, Stefano Brivio,
	Sabrina Dubroca, David Ahern, Felix Fietkau, Jo-Philipp Wich,
	Koen Vandeputte

On 2019/04/05 5:22, Rafał Miłecki wrote:
> On 04.04.2019 17:17, Toshiaki Makita wrote:
>> On 19/04/04 (木) 21:57:15, Rafał Miłecki wrote:
>>> I'd like to report a regression that goes back to the 2015. I know
>>> it's damn
>>> late, but the good thing is, the regression is still easy to
>>> reproduce, verify &
>>> revert.
>>>
>>> Long story short, starting with the commit 66e5133f19e9 ("vlan: Add
>>> GRO support
>>> for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT
>>> performance of my router dropped by 30% - 40%.
>>>
>>> My hardware is BCM47094 SoC (dual core ARM) with integrated network
>>> controller
>>> and external BCM53012 switch.
>>>
>>> Relevant setup:
>>> * SoC network controller is wired to the hardware switch
>>> * Switch passes 802.1q frames with VID 1 to four LAN ports
>>> * Switch passes 802.1q frames with VID 2 to WAN port
>>> * Linux does NAT for LAN (eth0.1) to WAN (eth0.2)
>>> * Linux uses pfifo and "echo 2 > rps_cpus"
>>> * Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port
>>> * Intel i7-2670QM laptop connected to a WAN port
>>> * Speed of LAN to WAN measured using iperf & TCP over 10 minutes
>>>
>>> 1) 5.1.0-rc3
>>> [  6]  0.0-600.0 sec  39.9 GBytes   572 Mbits/sec
>>>
>>> 2) 5.1.0-rc3 + rtcache patch
>>> [  6]  0.0-600.0 sec  40.0 GBytes   572 Mbits/sec
>>>
>>> 3) 5.1.0-rc3 + disable GRO support
>>> [  6]  0.0-300.4 sec  27.5 GBytes   786 Mbits/sec
>>>
>>> 4) 5.1.0-rc3 + rtcache patch + disable GRO support
>>> [  6]  0.0-600.0 sec  65.6 GBytes   939 Mbits/sec
>>
>> Did you test it with disabling GRO by ethtool -K?
> 
> Oh, I didn't know about such possibility! I just tested:
> 1) Kernel with GRO support left in place (no local patch disabling it)
> 2) ethtool -K eth0 gro off
> and it bumped my NAT performance from 576 Mb/s to 939 Mb/s. I can reliably
> break/fix NAT performance by just calling ethtool -K eth0 gro on/off.
> 
> 
>> Is this the result with your reverting patch?
> 
> Previous results were coming from kernel with patched
> vlan_offload_init() - see
> diff at the end of my first e-mail.
> 
> 
>> It's late night in Japan so I think I will try to reproduce it tomorrow.

My test results:

Receiving packets from eth0.10, forwarding them to eth0.20 and applying
MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13.
Disabled rxvlan by ethtool -K to exercise vlan_gro_receive().
Measured TCP throughput by netperf.

GRO on : 17 Gbps
GRO off:  5 Gbps

So I failed to reproduce your problem.

Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar
-u ALL -P ALL") to check if the traffic is able to consume 100% CPU on
your machine?

If CPU is 100%, perf may help us analyze your problem. If it's
available, try running below while testing:
# perf record -a -g -- sleep 5

And then run this after testing:
# perf report --no-child

-- 
Toshiaki Makita


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NAT performance regression caused by vlan GRO support
  2019-04-05  4:26     ` Toshiaki Makita
@ 2019-04-05  5:48       ` Rafał Miłecki
  2019-04-05  7:11         ` Rafał Miłecki
  0 siblings, 1 reply; 16+ messages in thread
From: Rafał Miłecki @ 2019-04-05  5:48 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Toshiaki Makita, netdev, David S. Miller, Stefano Brivio,
	Sabrina Dubroca, David Ahern, Felix Fietkau, Jo-Philipp Wich,
	Koen Vandeputte

On 05.04.2019 06:26, Toshiaki Makita wrote:
> On 2019/04/05 5:22, Rafał Miłecki wrote:
>> On 04.04.2019 17:17, Toshiaki Makita wrote:
>>> On 19/04/04 (木) 21:57:15, Rafał Miłecki wrote:
>>>> I'd like to report a regression that goes back to the 2015. I know
>>>> it's damn
>>>> late, but the good thing is, the regression is still easy to
>>>> reproduce, verify &
>>>> revert.
>>>>
>>>> Long story short, starting with the commit 66e5133f19e9 ("vlan: Add
>>>> GRO support
>>>> for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT
>>>> performance of my router dropped by 30% - 40%.
>>>>
>>>> My hardware is BCM47094 SoC (dual core ARM) with integrated network
>>>> controller
>>>> and external BCM53012 switch.
>>>>
>>>> Relevant setup:
>>>> * SoC network controller is wired to the hardware switch
>>>> * Switch passes 802.1q frames with VID 1 to four LAN ports
>>>> * Switch passes 802.1q frames with VID 2 to WAN port
>>>> * Linux does NAT for LAN (eth0.1) to WAN (eth0.2)
>>>> * Linux uses pfifo and "echo 2 > rps_cpus"
>>>> * Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port
>>>> * Intel i7-2670QM laptop connected to a WAN port
>>>> * Speed of LAN to WAN measured using iperf & TCP over 10 minutes
>>>>
>>>> 1) 5.1.0-rc3
>>>> [  6]  0.0-600.0 sec  39.9 GBytes   572 Mbits/sec
>>>>
>>>> 2) 5.1.0-rc3 + rtcache patch
>>>> [  6]  0.0-600.0 sec  40.0 GBytes   572 Mbits/sec
>>>>
>>>> 3) 5.1.0-rc3 + disable GRO support
>>>> [  6]  0.0-300.4 sec  27.5 GBytes   786 Mbits/sec
>>>>
>>>> 4) 5.1.0-rc3 + rtcache patch + disable GRO support
>>>> [  6]  0.0-600.0 sec  65.6 GBytes   939 Mbits/sec
>>>
>>> Did you test it with disabling GRO by ethtool -K?
>>
>> Oh, I didn't know about such possibility! I just tested:
>> 1) Kernel with GRO support left in place (no local patch disabling it)
>> 2) ethtool -K eth0 gro off
>> and it bumped my NAT performance from 576 Mb/s to 939 Mb/s. I can reliably
>> break/fix NAT performance by just calling ethtool -K eth0 gro on/off.
>>
>>
>>> Is this the result with your reverting patch?
>>
>> Previous results were coming from kernel with patched
>> vlan_offload_init() - see
>> diff at the end of my first e-mail.
>>
>>
>>> It's late night in Japan so I think I will try to reproduce it tomorrow.
> 
> My test results:
> 
> Receiving packets from eth0.10, forwarding them to eth0.20 and applying
> MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13.
> Disabled rxvlan by ethtool -K to exercise vlan_gro_receive().
> Measured TCP throughput by netperf.
> 
> GRO on : 17 Gbps
> GRO off:  5 Gbps
> 
> So I failed to reproduce your problem.

:( Thanks for trying & checking that!


> Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar
> -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on
> your machine?

1) ethtool -K eth0 gro on + iperf running (577 Mb/s)
root@OpenWrt:/# mpstat -P ALL 10 3
Linux 5.1.0-rc3+ (OpenWrt)      03/27/19        _armv7l_        (2 CPU)

16:33:40     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
16:33:50     all    0.00    0.00    0.00    0.00    0.00   58.79    0.00    0.00   41.21
16:33:50       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00
16:33:50       1    0.00    0.00    0.00    0.00    0.00   17.58    0.00    0.00   82.42

16:33:50     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
16:34:00     all    0.00    0.00    0.05    0.00    0.00   59.44    0.00    0.00   40.51
16:34:00       0    0.00    0.00    0.10    0.00    0.00   99.90    0.00    0.00    0.00
16:34:00       1    0.00    0.00    0.00    0.00    0.00   18.98    0.00    0.00   81.02

16:34:00     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
16:34:10     all    0.00    0.00    0.00    0.00    0.00   59.59    0.00    0.00   40.41
16:34:10       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00
16:34:10       1    0.00    0.00    0.00    0.00    0.00   19.18    0.00    0.00   80.82

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
Average:     all    0.00    0.00    0.02    0.00    0.00   59.27    0.00    0.00   40.71
Average:       0    0.00    0.00    0.03    0.00    0.00   99.97    0.00    0.00    0.00
Average:       1    0.00    0.00    0.00    0.00    0.00   18.58    0.00    0.00   81.42


2) ethtool -K eth0 gro off + iperf running (941 Mb/s)
root@OpenWrt:/# mpstat -P ALL 10 3
Linux 5.1.0-rc3+ (OpenWrt)      03/27/19        _armv7l_        (2 CPU)

16:34:39     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
16:34:49     all    0.00    0.00    0.05    0.00    0.00   86.91    0.00    0.00   13.04
16:34:49       0    0.00    0.00    0.10    0.00    0.00   78.22    0.00    0.00   21.68
16:34:49       1    0.00    0.00    0.00    0.00    0.00   95.60    0.00    0.00    4.40

16:34:49     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
16:34:59     all    0.00    0.00    0.10    0.00    0.00   87.06    0.00    0.00   12.84
16:34:59       0    0.00    0.00    0.20    0.00    0.00   79.72    0.00    0.00   20.08
16:34:59       1    0.00    0.00    0.00    0.00    0.00   94.41    0.00    0.00    5.59

16:34:59     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
16:35:09     all    0.00    0.00    0.05    0.00    0.00   85.71    0.00    0.00   14.24
16:35:09       0    0.00    0.00    0.10    0.00    0.00   79.42    0.00    0.00   20.48
16:35:09       1    0.00    0.00    0.00    0.00    0.00   92.01    0.00    0.00    7.99

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
Average:     all    0.00    0.00    0.07    0.00    0.00   86.56    0.00    0.00   13.37
Average:       0    0.00    0.00    0.13    0.00    0.00   79.12    0.00    0.00   20.75
Average:       1    0.00    0.00    0.00    0.00    0.00   94.01    0.00    0.00    5.99


3) System idle (no iperf)
root@OpenWrt:/# mpstat -P ALL 10 1
Linux 5.1.0-rc3+ (OpenWrt)      03/27/19        _armv7l_        (2 CPU)

16:35:31     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
16:35:41     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
16:35:41       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
16:35:41       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
Average:     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00


> If CPU is 100%, perf may help us analyze your problem. If it's
> available, try running below while testing:
> # perf record -a -g -- sleep 5
> 
> And then run this after testing:
> # perf report --no-child

I can see my CPU 0 is fully loaded when using "gro on". I'll try perf now.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NAT performance regression caused by vlan GRO support
  2019-04-05  5:48       ` Rafał Miłecki
@ 2019-04-05  7:11         ` Rafał Miłecki
  2019-04-05  7:14           ` Felix Fietkau
  0 siblings, 1 reply; 16+ messages in thread
From: Rafał Miłecki @ 2019-04-05  7:11 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Toshiaki Makita, netdev, David S. Miller, Stefano Brivio,
	Sabrina Dubroca, David Ahern, Felix Fietkau, Jo-Philipp Wich,
	Koen Vandeputte

On 05.04.2019 07:48, Rafał Miłecki wrote:
> On 05.04.2019 06:26, Toshiaki Makita wrote:
>> My test results:
>>
>> Receiving packets from eth0.10, forwarding them to eth0.20 and applying
>> MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13.
>> Disabled rxvlan by ethtool -K to exercise vlan_gro_receive().
>> Measured TCP throughput by netperf.
>>
>> GRO on : 17 Gbps
>> GRO off:  5 Gbps
>>
>> So I failed to reproduce your problem.
> 
> :( Thanks for trying & checking that!
> 
> 
>> Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar
>> -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on
>> your machine?
> 
> 1) ethtool -K eth0 gro on + iperf running (577 Mb/s)
> root@OpenWrt:/# mpstat -P ALL 10 3
> Linux 5.1.0-rc3+ (OpenWrt)      03/27/19        _armv7l_        (2 CPU)
> 
> 16:33:40     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
> 16:33:50     all    0.00    0.00    0.00    0.00    0.00   58.79    0.00    0.00   41.21
> 16:33:50       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00
> 16:33:50       1    0.00    0.00    0.00    0.00    0.00   17.58    0.00    0.00   82.42
> 
> 16:33:50     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
> 16:34:00     all    0.00    0.00    0.05    0.00    0.00   59.44    0.00    0.00   40.51
> 16:34:00       0    0.00    0.00    0.10    0.00    0.00   99.90    0.00    0.00    0.00
> 16:34:00       1    0.00    0.00    0.00    0.00    0.00   18.98    0.00    0.00   81.02
> 
> 16:34:00     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
> 16:34:10     all    0.00    0.00    0.00    0.00    0.00   59.59    0.00    0.00   40.41
> 16:34:10       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00
> 16:34:10       1    0.00    0.00    0.00    0.00    0.00   19.18    0.00    0.00   80.82
> 
> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
> Average:     all    0.00    0.00    0.02    0.00    0.00   59.27    0.00    0.00   40.71
> Average:       0    0.00    0.00    0.03    0.00    0.00   99.97    0.00    0.00    0.00
> Average:       1    0.00    0.00    0.00    0.00    0.00   18.58    0.00    0.00   81.42
> 
> 
> 2) ethtool -K eth0 gro off + iperf running (941 Mb/s)
> root@OpenWrt:/# mpstat -P ALL 10 3
> Linux 5.1.0-rc3+ (OpenWrt)      03/27/19        _armv7l_        (2 CPU)
> 
> 16:34:39     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
> 16:34:49     all    0.00    0.00    0.05    0.00    0.00   86.91    0.00    0.00   13.04
> 16:34:49       0    0.00    0.00    0.10    0.00    0.00   78.22    0.00    0.00   21.68
> 16:34:49       1    0.00    0.00    0.00    0.00    0.00   95.60    0.00    0.00    4.40
> 
> 16:34:49     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
> 16:34:59     all    0.00    0.00    0.10    0.00    0.00   87.06    0.00    0.00   12.84
> 16:34:59       0    0.00    0.00    0.20    0.00    0.00   79.72    0.00    0.00   20.08
> 16:34:59       1    0.00    0.00    0.00    0.00    0.00   94.41    0.00    0.00    5.59
> 
> 16:34:59     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
> 16:35:09     all    0.00    0.00    0.05    0.00    0.00   85.71    0.00    0.00   14.24
> 16:35:09       0    0.00    0.00    0.10    0.00    0.00   79.42    0.00    0.00   20.48
> 16:35:09       1    0.00    0.00    0.00    0.00    0.00   92.01    0.00    0.00    7.99
> 
> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
> Average:     all    0.00    0.00    0.07    0.00    0.00   86.56    0.00    0.00   13.37
> Average:       0    0.00    0.00    0.13    0.00    0.00   79.12    0.00    0.00   20.75
> Average:       1    0.00    0.00    0.00    0.00    0.00   94.01    0.00    0.00    5.99
> 
> 
> 3) System idle (no iperf)
> root@OpenWrt:/# mpstat -P ALL 10 1
> Linux 5.1.0-rc3+ (OpenWrt)      03/27/19        _armv7l_        (2 CPU)
> 
> 16:35:31     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
> 16:35:41     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
> 16:35:41       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
> 16:35:41       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
> 
> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
> Average:     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
> Average:       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
> Average:       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
> 
> 
>> If CPU is 100%, perf may help us analyze your problem. If it's
>> available, try running below while testing:
>> # perf record -a -g -- sleep 5
>>
>> And then run this after testing:
>> # perf report --no-child
> 
> I can see my CPU 0 is fully loaded when using "gro on". I'll try perf now.

I guess its GRO + csum_partial() to be blamed for this performance drop.

Maybe csum_partial() is very fast on your powerful machine and few extra calls
don't make a difference? I can imagine it affecting much slower home router with
ARM cores.


1) ethtool -K eth0 gro on

Samples: 34K of event 'cycles', Event count (approx.): 10041345370
   Overhead  Command          Shared Object           Symbol
+   25,46%  ksoftirqd/0      [kernel.kallsyms]       [k] csum_partial
+    8,82%  ksoftirqd/0      [kernel.kallsyms]       [k] v7_dma_inv_range
+    6,03%  swapper          [kernel.kallsyms]       [k] arch_cpu_idle
+    4,08%  ksoftirqd/0      [kernel.kallsyms]       [k] v7_dma_clean_range
+    3,82%  ksoftirqd/0      [kernel.kallsyms]       [k] l2c210_inv_range
+    3,14%  swapper          [kernel.kallsyms]       [k] rcu_idle_exit
+    3,00%  ksoftirqd/0      [kernel.kallsyms]       [k] l2c210_clean_range
+    2,43%  ksoftirqd/0      [kernel.kallsyms]       [k] bgmac_start_xmit
+    1,24%  swapper          [kernel.kallsyms]       [k] csum_partial
+    1,20%  swapper          [kernel.kallsyms]       [k] do_idle
+    1,19%  swapper          [kernel.kallsyms]       [k] skb_segment
+    1,19%  ksoftirqd/0      [kernel.kallsyms]       [k] arm_dma_unmap_page
+    1,00%  ksoftirqd/0      [kernel.kallsyms]       [k] bgmac_poll
+    0,95%  ksoftirqd/0      [kernel.kallsyms]       [k] __slab_free.constprop.3
+    0,80%  ksoftirqd/0      [kernel.kallsyms]       [k] skb_release_data
+    0,77%  swapper          [kernel.kallsyms]       [k] __dev_queue_xmit
+    0,73%  ksoftirqd/0      [kernel.kallsyms]       [k] build_skb
+    0,68%  ksoftirqd/0      [kernel.kallsyms]       [k] skb_segment
+    0,66%  ksoftirqd/0      [kernel.kallsyms]       [k] mmiocpy
+    0,66%  ksoftirqd/0      [kernel.kallsyms]       [k] skb_checksum_help
+    0,65%  ksoftirqd/0      [kernel.kallsyms]       [k] dev_gro_receive
+    0,64%  ksoftirqd/0      [kernel.kallsyms]       [k] page_address
+    0,62%  ksoftirqd/0      [kernel.kallsyms]       [k] __qdisc_run
+    0,62%  ksoftirqd/0      [kernel.kallsyms]       [k] dma_cache_maint_page
+    0,59%  swapper          [kernel.kallsyms]       [k] __kmalloc_track_caller
+    0,59%  swapper          [kernel.kallsyms]       [k] mmiocpy
+    0,58%  ksoftirqd/0      [kernel.kallsyms]       [k] sch_direct_xmit
+    0,55%  ksoftirqd/0      [kernel.kallsyms]       [k] mmioset
+    0,52%  ksoftirqd/0      [kernel.kallsyms]       [k] inet_gro_receive
      0,49%  ksoftirqd/0      [kernel.kallsyms]       [k] netdev_alloc_frag
      0,47%  swapper          [kernel.kallsyms]       [k] __netif_receive_skb_core
      0,45%  swapper          [kernel.kallsyms]       [k] kmem_cache_alloc
      0,45%  ksoftirqd/0      [kernel.kallsyms]       [k] __skb_checksum
      0,43%  swapper          [kernel.kallsyms]       [k] v7_dma_clean_range
      0,39%  ksoftirqd/0      [kernel.kallsyms]       [k] kmem_cache_alloc
      0,36%  ksoftirqd/0      [kernel.kallsyms]       [k] qdisc_dequeue_head
      0,36%  ksoftirqd/0      [kernel.kallsyms]       [k] arm_dma_map_page
      0,35%  swapper          [kernel.kallsyms]       [k] mmioset
      0,34%  ksoftirqd/0      [kernel.kallsyms]       [k] tcp_gro_receive
      0,33%  swapper          [kernel.kallsyms]       [k] __copy_skb_header
      0,33%  ksoftirqd/0      [kernel.kallsyms]       [k] kmem_cache_free
      0,32%  ksoftirqd/0      [kernel.kallsyms]       [k] netif_skb_features
      0,30%  swapper          [kernel.kallsyms]       [k] netif_skb_features
      0,30%  ksoftirqd/0      [kernel.kallsyms]       [k] __skb_flow_dissect

2) ethtool -K eth0 gro off

Samples: 39K of event 'cycles', Event count (approx.): 13065826851
   Overhead  Command          Shared Object           Symbol
+   11,09%  swapper          [kernel.kallsyms]       [k] v7_dma_inv_range
+    5,86%  ksoftirqd/1      [kernel.kallsyms]       [k] v7_dma_clean_range
+    5,77%  swapper          [kernel.kallsyms]       [k] l2c210_inv_range
+    5,38%  swapper          [kernel.kallsyms]       [k] __irqentry_text_end
+    4,44%  swapper          [kernel.kallsyms]       [k] bcma_host_soc_read32
+    3,28%  ksoftirqd/1      [kernel.kallsyms]       [k] __netif_receive_skb_core
+    3,25%  ksoftirqd/1      [kernel.kallsyms]       [k] l2c210_clean_range
+    2,70%  swapper          [kernel.kallsyms]       [k] arch_cpu_idle
+    2,25%  swapper          [kernel.kallsyms]       [k] bgmac_poll
+    2,14%  ksoftirqd/1      [kernel.kallsyms]       [k] bgmac_start_xmit
+    1,79%  ksoftirqd/1      [kernel.kallsyms]       [k] __dev_queue_xmit
+    1,36%  ksoftirqd/1      [kernel.kallsyms]       [k] skb_vlan_untag
+    1,11%  swapper          [kernel.kallsyms]       [k] __skb_flow_dissect
+    1,07%  ksoftirqd/1      [kernel.kallsyms]       [k] netif_skb_features
+    0,98%  ksoftirqd/1      [kernel.kallsyms]       [k] ip_rcv_core.constprop.3
+    0,92%  ksoftirqd/1      [kernel.kallsyms]       [k] sch_direct_xmit
+    0,90%  ksoftirqd/1      [kernel.kallsyms]       [k] __local_bh_enable_ip
+    0,86%  ksoftirqd/1      [kernel.kallsyms]       [k] nf_hook_slow
+    0,82%  swapper          [kernel.kallsyms]       [k] net_rx_action
+    0,80%  ksoftirqd/1      [kernel.kallsyms]       [k] validate_xmit_skb.constprop.30
+    0,75%  swapper          [kernel.kallsyms]       [k] build_skb
+    0,72%  ksoftirqd/1      [kernel.kallsyms]       [k] ip_forward
+    0,71%  ksoftirqd/1      [kernel.kallsyms]       [k] br_handle_frame_finish
+    0,71%  ksoftirqd/1      [kernel.kallsyms]       [k] skb_pull_rcsum
+    0,65%  swapper          [kernel.kallsyms]       [k] arm_dma_unmap_page
+    0,59%  ksoftirqd/1      [kernel.kallsyms]       [k] ip_finish_output2
+    0,59%  swapper          [kernel.kallsyms]       [k] __skb_get_hash
+    0,58%  swapper          [kernel.kallsyms]       [k] dma_cache_maint_page
+    0,55%  ksoftirqd/1      [kernel.kallsyms]       [k] fdb_find_rcu
+    0,54%  swapper          [kernel.kallsyms]       [k] bcma_host_soc_write32
+    0,53%  ksoftirqd/1      [kernel.kallsyms]       [k] vlan_do_receive
+    0,52%  ksoftirqd/1      [kernel.kallsyms]       [k] memmove
+    0,52%  swapper          [kernel.kallsyms]       [k] rcu_idle_exit
+    0,51%  ksoftirqd/1      [kernel.kallsyms]       [k] ip_rcv
+    0,51%  ksoftirqd/1      [kernel.kallsyms]       [k] dev_hard_start_xmit
      0,49%  ksoftirqd/1      [kernel.kallsyms]       [k] ip_output
      0,46%  ksoftirqd/1      [kernel.kallsyms]       [k] vlan_dev_hard_start_xmit
      0,45%  swapper          [kernel.kallsyms]       [k] enqueue_to_backlog
      0,42%  swapper          [kernel.kallsyms]       [k] netdev_alloc_frag
      0,42%  swapper          [kernel.kallsyms]       [k] skb_release_data
      0,41%  ksoftirqd/1      [kernel.kallsyms]       [k] ip_forward_finish
      0,40%  ksoftirqd/1      [kernel.kallsyms]       [k] br_handle_frame
      0,37%  ksoftirqd/1      [kernel.kallsyms]       [k] mmiocpy
      0,37%  ksoftirqd/1      [kernel.kallsyms]       [k] page_address
      0,36%  ksoftirqd/0      [kernel.kallsyms]       [k] v7_dma_inv_range
      0,36%  ksoftirqd/1      [kernel.kallsyms]       [k] memcmp
      0,36%  ksoftirqd/1      [kernel.kallsyms]       [k] netif_receive_skb_internal
      0,34%  swapper          [kernel.kallsyms]       [k] page_address
      0,34%  swapper          [kernel.kallsyms]       [k] mmioset
      0,33%  ksoftirqd/1      [kernel.kallsyms]       [k] br_pass_frame_up
      0,33%  ksoftirqd/1      [kernel.kallsyms]       [k] neigh_connected_output
      0,33%  swapper          [kernel.kallsyms]       [k] kmem_cache_alloc
      0,31%  ksoftirqd/1      [kernel.kallsyms]       [k] mmioset
      0,30%  ksoftirqd/1      [kernel.kallsyms]       [k] ip_finish_output
      0,30%  ksoftirqd/1      [kernel.kallsyms]       [k] bcma_bgmac_write

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NAT performance regression caused by vlan GRO support
  2019-04-05  7:11         ` Rafał Miłecki
@ 2019-04-05  7:14           ` Felix Fietkau
  2019-04-05  7:58             ` Toshiaki Makita
  0 siblings, 1 reply; 16+ messages in thread
From: Felix Fietkau @ 2019-04-05  7:14 UTC (permalink / raw)
  To: Rafał Miłecki, Toshiaki Makita
  Cc: Toshiaki Makita, netdev, David S. Miller, Stefano Brivio,
	Sabrina Dubroca, David Ahern, Jo-Philipp Wich, Koen Vandeputte

On 2019-04-05 09:11, Rafał Miłecki wrote:
> On 05.04.2019 07:48, Rafał Miłecki wrote:
>> On 05.04.2019 06:26, Toshiaki Makita wrote:
>>> My test results:
>>>
>>> Receiving packets from eth0.10, forwarding them to eth0.20 and applying
>>> MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13.
>>> Disabled rxvlan by ethtool -K to exercise vlan_gro_receive().
>>> Measured TCP throughput by netperf.
>>>
>>> GRO on : 17 Gbps
>>> GRO off:  5 Gbps
>>>
>>> So I failed to reproduce your problem.
>> 
>> :( Thanks for trying & checking that!
>> 
>> 
>>> Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar
>>> -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on
>>> your machine?
>> 
>> 1) ethtool -K eth0 gro on + iperf running (577 Mb/s)
>> root@OpenWrt:/# mpstat -P ALL 10 3
>> Linux 5.1.0-rc3+ (OpenWrt)      03/27/19        _armv7l_        (2 CPU)
>> 
>> 16:33:40     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>> 16:33:50     all    0.00    0.00    0.00    0.00    0.00   58.79    0.00    0.00   41.21
>> 16:33:50       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00
>> 16:33:50       1    0.00    0.00    0.00    0.00    0.00   17.58    0.00    0.00   82.42
>> 
>> 16:33:50     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>> 16:34:00     all    0.00    0.00    0.05    0.00    0.00   59.44    0.00    0.00   40.51
>> 16:34:00       0    0.00    0.00    0.10    0.00    0.00   99.90    0.00    0.00    0.00
>> 16:34:00       1    0.00    0.00    0.00    0.00    0.00   18.98    0.00    0.00   81.02
>> 
>> 16:34:00     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>> 16:34:10     all    0.00    0.00    0.00    0.00    0.00   59.59    0.00    0.00   40.41
>> 16:34:10       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00
>> 16:34:10       1    0.00    0.00    0.00    0.00    0.00   19.18    0.00    0.00   80.82
>> 
>> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>> Average:     all    0.00    0.00    0.02    0.00    0.00   59.27    0.00    0.00   40.71
>> Average:       0    0.00    0.00    0.03    0.00    0.00   99.97    0.00    0.00    0.00
>> Average:       1    0.00    0.00    0.00    0.00    0.00   18.58    0.00    0.00   81.42
>> 
>> 
>> 2) ethtool -K eth0 gro off + iperf running (941 Mb/s)
>> root@OpenWrt:/# mpstat -P ALL 10 3
>> Linux 5.1.0-rc3+ (OpenWrt)      03/27/19        _armv7l_        (2 CPU)
>> 
>> 16:34:39     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>> 16:34:49     all    0.00    0.00    0.05    0.00    0.00   86.91    0.00    0.00   13.04
>> 16:34:49       0    0.00    0.00    0.10    0.00    0.00   78.22    0.00    0.00   21.68
>> 16:34:49       1    0.00    0.00    0.00    0.00    0.00   95.60    0.00    0.00    4.40
>> 
>> 16:34:49     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>> 16:34:59     all    0.00    0.00    0.10    0.00    0.00   87.06    0.00    0.00   12.84
>> 16:34:59       0    0.00    0.00    0.20    0.00    0.00   79.72    0.00    0.00   20.08
>> 16:34:59       1    0.00    0.00    0.00    0.00    0.00   94.41    0.00    0.00    5.59
>> 
>> 16:34:59     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>> 16:35:09     all    0.00    0.00    0.05    0.00    0.00   85.71    0.00    0.00   14.24
>> 16:35:09       0    0.00    0.00    0.10    0.00    0.00   79.42    0.00    0.00   20.48
>> 16:35:09       1    0.00    0.00    0.00    0.00    0.00   92.01    0.00    0.00    7.99
>> 
>> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>> Average:     all    0.00    0.00    0.07    0.00    0.00   86.56    0.00    0.00   13.37
>> Average:       0    0.00    0.00    0.13    0.00    0.00   79.12    0.00    0.00   20.75
>> Average:       1    0.00    0.00    0.00    0.00    0.00   94.01    0.00    0.00    5.99
>> 
>> 
>> 3) System idle (no iperf)
>> root@OpenWrt:/# mpstat -P ALL 10 1
>> Linux 5.1.0-rc3+ (OpenWrt)      03/27/19        _armv7l_        (2 CPU)
>> 
>> 16:35:31     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>> 16:35:41     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>> 16:35:41       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>> 16:35:41       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>> 
>> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>> Average:     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>> Average:       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>> Average:       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>> 
>> 
>>> If CPU is 100%, perf may help us analyze your problem. If it's
>>> available, try running below while testing:
>>> # perf record -a -g -- sleep 5
>>>
>>> And then run this after testing:
>>> # perf report --no-child
>> 
>> I can see my CPU 0 is fully loaded when using "gro on". I'll try perf now.
> 
> I guess its GRO + csum_partial() to be blamed for this performance drop.
> 
> Maybe csum_partial() is very fast on your powerful machine and few extra calls
> don't make a difference? I can imagine it affecting much slower home router with
> ARM cores.
Most high performance Ethernet devices implement hardware checksum
offload, which completely gets rid of this overhead.
Unfortunately, the BCM53xx/47xx Ethernet MAC doesn't have this, which is
why you're getting such crappy performance.

- Felix

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NAT performance regression caused by vlan GRO support
  2019-04-05  7:14           ` Felix Fietkau
@ 2019-04-05  7:58             ` Toshiaki Makita
  2019-04-05  8:12               ` Rafał Miłecki
  2019-04-05 10:18               ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 16+ messages in thread
From: Toshiaki Makita @ 2019-04-05  7:58 UTC (permalink / raw)
  To: Felix Fietkau, Rafał Miłecki
  Cc: Toshiaki Makita, netdev, David S. Miller, Stefano Brivio,
	Sabrina Dubroca, David Ahern, Jo-Philipp Wich, Koen Vandeputte

On 2019/04/05 16:14, Felix Fietkau wrote:
> On 2019-04-05 09:11, Rafał Miłecki wrote:
>> On 05.04.2019 07:48, Rafał Miłecki wrote:
>>> On 05.04.2019 06:26, Toshiaki Makita wrote:
>>>> My test results:
>>>>
>>>> Receiving packets from eth0.10, forwarding them to eth0.20 and applying
>>>> MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13.
>>>> Disabled rxvlan by ethtool -K to exercise vlan_gro_receive().
>>>> Measured TCP throughput by netperf.
>>>>
>>>> GRO on : 17 Gbps
>>>> GRO off:  5 Gbps
>>>>
>>>> So I failed to reproduce your problem.
>>>
>>> :( Thanks for trying & checking that!
>>>
>>>
>>>> Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar
>>>> -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on
>>>> your machine?
>>>
>>> 1) ethtool -K eth0 gro on + iperf running (577 Mb/s)
>>> root@OpenWrt:/# mpstat -P ALL 10 3
>>> Linux 5.1.0-rc3+ (OpenWrt)      03/27/19        _armv7l_        (2 CPU)
>>>
>>> 16:33:40     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>> 16:33:50     all    0.00    0.00    0.00    0.00    0.00   58.79    0.00    0.00   41.21
>>> 16:33:50       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00
>>> 16:33:50       1    0.00    0.00    0.00    0.00    0.00   17.58    0.00    0.00   82.42
>>>
>>> 16:33:50     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>> 16:34:00     all    0.00    0.00    0.05    0.00    0.00   59.44    0.00    0.00   40.51
>>> 16:34:00       0    0.00    0.00    0.10    0.00    0.00   99.90    0.00    0.00    0.00
>>> 16:34:00       1    0.00    0.00    0.00    0.00    0.00   18.98    0.00    0.00   81.02
>>>
>>> 16:34:00     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>> 16:34:10     all    0.00    0.00    0.00    0.00    0.00   59.59    0.00    0.00   40.41
>>> 16:34:10       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00
>>> 16:34:10       1    0.00    0.00    0.00    0.00    0.00   19.18    0.00    0.00   80.82
>>>
>>> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>> Average:     all    0.00    0.00    0.02    0.00    0.00   59.27    0.00    0.00   40.71
>>> Average:       0    0.00    0.00    0.03    0.00    0.00   99.97    0.00    0.00    0.00
>>> Average:       1    0.00    0.00    0.00    0.00    0.00   18.58    0.00    0.00   81.42
>>>
>>>
>>> 2) ethtool -K eth0 gro off + iperf running (941 Mb/s)
>>> root@OpenWrt:/# mpstat -P ALL 10 3
>>> Linux 5.1.0-rc3+ (OpenWrt)      03/27/19        _armv7l_        (2 CPU)
>>>
>>> 16:34:39     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>> 16:34:49     all    0.00    0.00    0.05    0.00    0.00   86.91    0.00    0.00   13.04
>>> 16:34:49       0    0.00    0.00    0.10    0.00    0.00   78.22    0.00    0.00   21.68
>>> 16:34:49       1    0.00    0.00    0.00    0.00    0.00   95.60    0.00    0.00    4.40
>>>
>>> 16:34:49     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>> 16:34:59     all    0.00    0.00    0.10    0.00    0.00   87.06    0.00    0.00   12.84
>>> 16:34:59       0    0.00    0.00    0.20    0.00    0.00   79.72    0.00    0.00   20.08
>>> 16:34:59       1    0.00    0.00    0.00    0.00    0.00   94.41    0.00    0.00    5.59
>>>
>>> 16:34:59     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>> 16:35:09     all    0.00    0.00    0.05    0.00    0.00   85.71    0.00    0.00   14.24
>>> 16:35:09       0    0.00    0.00    0.10    0.00    0.00   79.42    0.00    0.00   20.48
>>> 16:35:09       1    0.00    0.00    0.00    0.00    0.00   92.01    0.00    0.00    7.99
>>>
>>> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>> Average:     all    0.00    0.00    0.07    0.00    0.00   86.56    0.00    0.00   13.37
>>> Average:       0    0.00    0.00    0.13    0.00    0.00   79.12    0.00    0.00   20.75
>>> Average:       1    0.00    0.00    0.00    0.00    0.00   94.01    0.00    0.00    5.99
>>>
>>>
>>> 3) System idle (no iperf)
>>> root@OpenWrt:/# mpstat -P ALL 10 1
>>> Linux 5.1.0-rc3+ (OpenWrt)      03/27/19        _armv7l_        (2 CPU)
>>>
>>> 16:35:31     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>> 16:35:41     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>> 16:35:41       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>> 16:35:41       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>>
>>> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>> Average:     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>> Average:       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>> Average:       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>>
>>>
>>>> If CPU is 100%, perf may help us analyze your problem. If it's
>>>> available, try running below while testing:
>>>> # perf record -a -g -- sleep 5
>>>>
>>>> And then run this after testing:
>>>> # perf report --no-child
>>>
>>> I can see my CPU 0 is fully loaded when using "gro on". I'll try perf now.
>>
>> I guess its GRO + csum_partial() to be blamed for this performance drop.
>>
>> Maybe csum_partial() is very fast on your powerful machine and few extra calls
>> don't make a difference? I can imagine it affecting much slower home router with
>> ARM cores.
> Most high performance Ethernet devices implement hardware checksum
> offload, which completely gets rid of this overhead.
> Unfortunately, the BCM53xx/47xx Ethernet MAC doesn't have this, which is
> why you're getting such crappy performance.

Hmm... now I disabled rx checksum and tried the test again, and indeed I
see csum_partial from GRO path. But I also see csum_partial even without
GRO from nf_conntrack_in -> tcp_packet -> __skb_checksum_complete.
Probably Rafał disabled nf_conntrack_checksum sysctl knob?

But anyway even with disabling rx csum offload my machine has better
performance with GRO. I'm sure in some cases GRO should be disabled, but
I guess it's difficult to determine whether we should disable GRO or not
automatically when csum offload is not available.

-- 
Toshiaki Makita


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NAT performance regression caused by vlan GRO support
  2019-04-05  7:58             ` Toshiaki Makita
@ 2019-04-05  8:12               ` Rafał Miłecki
  2019-04-05  8:24                 ` Rafał Miłecki
  2019-04-05 10:18               ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 16+ messages in thread
From: Rafał Miłecki @ 2019-04-05  8:12 UTC (permalink / raw)
  To: Toshiaki Makita, Felix Fietkau
  Cc: Toshiaki Makita, netdev, David S. Miller, Stefano Brivio,
	Sabrina Dubroca, David Ahern, Jo-Philipp Wich, Koen Vandeputte

On 05.04.2019 09:58, Toshiaki Makita wrote:
> On 2019/04/05 16:14, Felix Fietkau wrote:
>> On 2019-04-05 09:11, Rafał Miłecki wrote:
>>> I guess its GRO + csum_partial() to be blamed for this performance drop.
>>>
>>> Maybe csum_partial() is very fast on your powerful machine and few extra calls
>>> don't make a difference? I can imagine it affecting much slower home router with
>>> ARM cores.
>> Most high performance Ethernet devices implement hardware checksum
>> offload, which completely gets rid of this overhead.
>> Unfortunately, the BCM53xx/47xx Ethernet MAC doesn't have this, which is
>> why you're getting such crappy performance.
> 
> Hmm... now I disabled rx checksum and tried the test again, and indeed I
> see csum_partial from GRO path. But I also see csum_partial even without
> GRO from nf_conntrack_in -> tcp_packet -> __skb_checksum_complete.
> Probably Rafał disabled nf_conntrack_checksum sysctl knob?
> 
> But anyway even with disabling rx csum offload my machine has better
> performance with GRO. I'm sure in some cases GRO should be disabled, but
> I guess it's difficult to determine whether we should disable GRO or not
> automatically when csum offload is not available.

Few testing results:

1) ethtool -K eth0 gro off; echo 0 > /proc/sys/net/netfilter/nf_conntrack_checksum
[  6]  0.0-60.0 sec  6.57 GBytes   940 Mbits/sec

2) ethtool -K eth0 gro off; echo 1 > /proc/sys/net/netfilter/nf_conntrack_checksum
[  6]  0.0-60.0 sec  4.65 GBytes   666 Mbits/sec

3) ethtool -K eth0 gro on; echo 0 > /proc/sys/net/netfilter/nf_conntrack_checksum
[  6]  0.0-60.0 sec  4.02 GBytes   575 Mbits/sec

4) ethtool -K eth0 gro on; echo 1 > /proc/sys/net/netfilter/nf_conntrack_checksum
[  6]  0.0-60.0 sec  4.04 GBytes   579 Mbits/sec

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NAT performance regression caused by vlan GRO support
  2019-04-05  8:12               ` Rafał Miłecki
@ 2019-04-05  8:24                 ` Rafał Miłecki
  0 siblings, 0 replies; 16+ messages in thread
From: Rafał Miłecki @ 2019-04-05  8:24 UTC (permalink / raw)
  To: Toshiaki Makita, Felix Fietkau
  Cc: Toshiaki Makita, netdev, David S. Miller, Stefano Brivio,
	Sabrina Dubroca, David Ahern, Jo-Philipp Wich, Koen Vandeputte

On 05.04.2019 10:12, Rafał Miłecki wrote:
> On 05.04.2019 09:58, Toshiaki Makita wrote:
>> On 2019/04/05 16:14, Felix Fietkau wrote:
>>> On 2019-04-05 09:11, Rafał Miłecki wrote:
>>>> I guess its GRO + csum_partial() to be blamed for this performance drop.
>>>>
>>>> Maybe csum_partial() is very fast on your powerful machine and few extra calls
>>>> don't make a difference? I can imagine it affecting much slower home router with
>>>> ARM cores.
>>> Most high performance Ethernet devices implement hardware checksum
>>> offload, which completely gets rid of this overhead.
>>> Unfortunately, the BCM53xx/47xx Ethernet MAC doesn't have this, which is
>>> why you're getting such crappy performance.
>>
>> Hmm... now I disabled rx checksum and tried the test again, and indeed I
>> see csum_partial from GRO path. But I also see csum_partial even without
>> GRO from nf_conntrack_in -> tcp_packet -> __skb_checksum_complete.
>> Probably Rafał disabled nf_conntrack_checksum sysctl knob?
>>
>> But anyway even with disabling rx csum offload my machine has better
>> performance with GRO. I'm sure in some cases GRO should be disabled, but
>> I guess it's difficult to determine whether we should disable GRO or not
>> automatically when csum offload is not available.
> 
> Few testing results:
> 
> 1) ethtool -K eth0 gro off; echo 0 > /proc/sys/net/netfilter/nf_conntrack_checksum
> [  6]  0.0-60.0 sec  6.57 GBytes   940 Mbits/sec
> 
> 2) ethtool -K eth0 gro off; echo 1 > /proc/sys/net/netfilter/nf_conntrack_checksum
> [  6]  0.0-60.0 sec  4.65 GBytes   666 Mbits/sec

For this case (GRO off and nf_conntrack_checksum enabled) I can confirm I see
csum_partial() in the perf output. It's taking 13,14% instead of 25,46% (as when
using GRO) though.

Samples: 38K of event 'cycles', Event count (approx.): 12209908413
   Overhead  Command          Shared Object           Symbol
+   13,14%  ksoftirqd/1      [kernel.kallsyms]       [k] csum_partial
+   10,16%  swapper          [kernel.kallsyms]       [k] v7_dma_inv_range
+    6,36%  swapper          [kernel.kallsyms]       [k] l2c210_inv_range
+    4,89%  swapper          [kernel.kallsyms]       [k] __irqentry_text_end
+    4,12%  ksoftirqd/1      [kernel.kallsyms]       [k] v7_dma_clean_range
+    3,78%  swapper          [kernel.kallsyms]       [k] bcma_host_soc_read32
+    2,76%  swapper          [kernel.kallsyms]       [k] arch_cpu_idle
+    2,45%  ksoftirqd/1      [kernel.kallsyms]       [k] __netif_receive_skb_core
+    2,37%  ksoftirqd/1      [kernel.kallsyms]       [k] l2c210_clean_range
+    1,76%  ksoftirqd/1      [kernel.kallsyms]       [k] bgmac_start_xmit
+    1,66%  swapper          [kernel.kallsyms]       [k] bgmac_poll
+    1,55%  ksoftirqd/1      [kernel.kallsyms]       [k] __dev_queue_xmit
+    1,11%  ksoftirqd/1      [kernel.kallsyms]       [k] skb_vlan_untag


> 3) ethtool -K eth0 gro on; echo 0 > /proc/sys/net/netfilter/nf_conntrack_checksum
> [  6]  0.0-60.0 sec  4.02 GBytes   575 Mbits/sec
> 
> 4) ethtool -K eth0 gro on; echo 1 > /proc/sys/net/netfilter/nf_conntrack_checksum
> [  6]  0.0-60.0 sec  4.04 GBytes   579 Mbits/sec

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NAT performance regression caused by vlan GRO support
  2019-04-05  7:58             ` Toshiaki Makita
  2019-04-05  8:12               ` Rafał Miłecki
@ 2019-04-05 10:18               ` Toke Høiland-Jørgensen
  2019-04-05 10:51                 ` Florian Westphal
  1 sibling, 1 reply; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-04-05 10:18 UTC (permalink / raw)
  To: Toshiaki Makita, Felix Fietkau, Rafał Miłecki
  Cc: Toshiaki Makita, netdev, David S. Miller, Stefano Brivio,
	Sabrina Dubroca, David Ahern, Jo-Philipp Wich, Koen Vandeputte

Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:

> On 2019/04/05 16:14, Felix Fietkau wrote:
>> On 2019-04-05 09:11, Rafał Miłecki wrote:
>>> On 05.04.2019 07:48, Rafał Miłecki wrote:
>>>> On 05.04.2019 06:26, Toshiaki Makita wrote:
>>>>> My test results:
>>>>>
>>>>> Receiving packets from eth0.10, forwarding them to eth0.20 and applying
>>>>> MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13.
>>>>> Disabled rxvlan by ethtool -K to exercise vlan_gro_receive().
>>>>> Measured TCP throughput by netperf.
>>>>>
>>>>> GRO on : 17 Gbps
>>>>> GRO off:  5 Gbps
>>>>>
>>>>> So I failed to reproduce your problem.
>>>>
>>>> :( Thanks for trying & checking that!
>>>>
>>>>
>>>>> Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar
>>>>> -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on
>>>>> your machine?
>>>>
>>>> 1) ethtool -K eth0 gro on + iperf running (577 Mb/s)
>>>> root@OpenWrt:/# mpstat -P ALL 10 3
>>>> Linux 5.1.0-rc3+ (OpenWrt)      03/27/19        _armv7l_        (2 CPU)
>>>>
>>>> 16:33:40     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>>> 16:33:50     all    0.00    0.00    0.00    0.00    0.00   58.79    0.00    0.00   41.21
>>>> 16:33:50       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00
>>>> 16:33:50       1    0.00    0.00    0.00    0.00    0.00   17.58    0.00    0.00   82.42
>>>>
>>>> 16:33:50     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>>> 16:34:00     all    0.00    0.00    0.05    0.00    0.00   59.44    0.00    0.00   40.51
>>>> 16:34:00       0    0.00    0.00    0.10    0.00    0.00   99.90    0.00    0.00    0.00
>>>> 16:34:00       1    0.00    0.00    0.00    0.00    0.00   18.98    0.00    0.00   81.02
>>>>
>>>> 16:34:00     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>>> 16:34:10     all    0.00    0.00    0.00    0.00    0.00   59.59    0.00    0.00   40.41
>>>> 16:34:10       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00
>>>> 16:34:10       1    0.00    0.00    0.00    0.00    0.00   19.18    0.00    0.00   80.82
>>>>
>>>> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>>> Average:     all    0.00    0.00    0.02    0.00    0.00   59.27    0.00    0.00   40.71
>>>> Average:       0    0.00    0.00    0.03    0.00    0.00   99.97    0.00    0.00    0.00
>>>> Average:       1    0.00    0.00    0.00    0.00    0.00   18.58    0.00    0.00   81.42
>>>>
>>>>
>>>> 2) ethtool -K eth0 gro off + iperf running (941 Mb/s)
>>>> root@OpenWrt:/# mpstat -P ALL 10 3
>>>> Linux 5.1.0-rc3+ (OpenWrt)      03/27/19        _armv7l_        (2 CPU)
>>>>
>>>> 16:34:39     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>>> 16:34:49     all    0.00    0.00    0.05    0.00    0.00   86.91    0.00    0.00   13.04
>>>> 16:34:49       0    0.00    0.00    0.10    0.00    0.00   78.22    0.00    0.00   21.68
>>>> 16:34:49       1    0.00    0.00    0.00    0.00    0.00   95.60    0.00    0.00    4.40
>>>>
>>>> 16:34:49     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>>> 16:34:59     all    0.00    0.00    0.10    0.00    0.00   87.06    0.00    0.00   12.84
>>>> 16:34:59       0    0.00    0.00    0.20    0.00    0.00   79.72    0.00    0.00   20.08
>>>> 16:34:59       1    0.00    0.00    0.00    0.00    0.00   94.41    0.00    0.00    5.59
>>>>
>>>> 16:34:59     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>>> 16:35:09     all    0.00    0.00    0.05    0.00    0.00   85.71    0.00    0.00   14.24
>>>> 16:35:09       0    0.00    0.00    0.10    0.00    0.00   79.42    0.00    0.00   20.48
>>>> 16:35:09       1    0.00    0.00    0.00    0.00    0.00   92.01    0.00    0.00    7.99
>>>>
>>>> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>>> Average:     all    0.00    0.00    0.07    0.00    0.00   86.56    0.00    0.00   13.37
>>>> Average:       0    0.00    0.00    0.13    0.00    0.00   79.12    0.00    0.00   20.75
>>>> Average:       1    0.00    0.00    0.00    0.00    0.00   94.01    0.00    0.00    5.99
>>>>
>>>>
>>>> 3) System idle (no iperf)
>>>> root@OpenWrt:/# mpstat -P ALL 10 1
>>>> Linux 5.1.0-rc3+ (OpenWrt)      03/27/19        _armv7l_        (2 CPU)
>>>>
>>>> 16:35:31     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>>> 16:35:41     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>>> 16:35:41       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>>> 16:35:41       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>>>
>>>> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
>>>> Average:     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>>> Average:       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>>> Average:       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>>>
>>>>
>>>>> If CPU is 100%, perf may help us analyze your problem. If it's
>>>>> available, try running below while testing:
>>>>> # perf record -a -g -- sleep 5
>>>>>
>>>>> And then run this after testing:
>>>>> # perf report --no-child
>>>>
>>>> I can see my CPU 0 is fully loaded when using "gro on". I'll try perf now.
>>>
>>> I guess its GRO + csum_partial() to be blamed for this performance drop.
>>>
>>> Maybe csum_partial() is very fast on your powerful machine and few extra calls
>>> don't make a difference? I can imagine it affecting much slower home router with
>>> ARM cores.
>> Most high performance Ethernet devices implement hardware checksum
>> offload, which completely gets rid of this overhead.
>> Unfortunately, the BCM53xx/47xx Ethernet MAC doesn't have this, which is
>> why you're getting such crappy performance.
>
> Hmm... now I disabled rx checksum and tried the test again, and indeed I
> see csum_partial from GRO path. But I also see csum_partial even without
> GRO from nf_conntrack_in -> tcp_packet -> __skb_checksum_complete.
> Probably Rafał disabled nf_conntrack_checksum sysctl knob?
>
> But anyway even with disabling rx csum offload my machine has better
> performance with GRO.

But you're also running at way higher speeds, where the benefit of GRO
is higher.

> I'm sure in some cases GRO should be disabled, but I guess it's
> difficult to determine whether we should disable GRO or not
> automatically when csum offload is not available.

As a first approximation, maybe just:

if (!has_hardware_cksum_offload(netdev) && link_rate(netdev) <= 1Gbps)
  disable_gro();

We used 1Gbps as the threshold for when to split GRO packets by default
in sck_cake as well...

-Toke

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NAT performance regression caused by vlan GRO support
  2019-04-05 10:18               ` Toke Høiland-Jørgensen
@ 2019-04-05 10:51                 ` Florian Westphal
  2019-04-05 11:00                   ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2019-04-05 10:51 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Toshiaki Makita, Felix Fietkau, Rafał Miłecki,
	Toshiaki Makita, netdev, David S. Miller, Stefano Brivio,
	Sabrina Dubroca, David Ahern, Jo-Philipp Wich, Koen Vandeputte

Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> As a first approximation, maybe just:
> 
> if (!has_hardware_cksum_offload(netdev) && link_rate(netdev) <= 1Gbps)
>   disable_gro();

I don't think its a good idea.  For local delivery case, there is no
way to avoid the checksum cost, so might as well have GRO enabled.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NAT performance regression caused by vlan GRO support
  2019-04-05 10:51                 ` Florian Westphal
@ 2019-04-05 11:00                   ` Eric Dumazet
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2019-04-05 11:00 UTC (permalink / raw)
  To: Florian Westphal, Toke Høiland-Jørgensen
  Cc: Toshiaki Makita, Felix Fietkau, Rafał Miłecki,
	Toshiaki Makita, netdev, David S. Miller, Stefano Brivio,
	Sabrina Dubroca, David Ahern, Jo-Philipp Wich, Koen Vandeputte



On 04/05/2019 03:51 AM, Florian Westphal wrote:
> Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> As a first approximation, maybe just:
>>
>> if (!has_hardware_cksum_offload(netdev) && link_rate(netdev) <= 1Gbps)
>>   disable_gro();
> 
> I don't think its a good idea.  For local delivery case, there is no
> way to avoid the checksum cost, so might as well have GRO enabled.
> 

We might add a sysctl or a way to tell GRO layer :

Do not attempt checksumming if forwarding is enabled on the host.

Basically GRO if NIC has provided checksum offload.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NAT performance regression caused by vlan GRO support
  2019-04-04 12:57 NAT performance regression caused by vlan GRO support Rafał Miłecki
  2019-04-04 15:17 ` Toshiaki Makita
@ 2019-04-07 11:53 ` Rafał Miłecki
  2019-04-07 11:54   ` Rafał Miłecki
  1 sibling, 1 reply; 16+ messages in thread
From: Rafał Miłecki @ 2019-04-07 11:53 UTC (permalink / raw)
  To: netdev, David S. Miller, Toshiaki Makita,
	Toke Høiland-Jørgensen, Florian Westphal, Eric Dumazet
  Cc: Stefano Brivio, Sabrina Dubroca, David Ahern, Felix Fietkau,
	Jo-Philipp Wich, Koen Vandeputte

On 04.04.2019 14:57, Rafał Miłecki wrote:
> Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO support
> for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT
> performance of my router dropped by 30% - 40%.

I'll try to provide some summary for this issue. I'll focus on TCP traffic as
that's what I happened to test.

Basically all slowdowns are related to the csum_partial(). Calculating checksum
has a significant impact on NAT performance on less CPU powerful devices.

**********

GRO disabled

Without GRO a csum_partial() is used only when validating TCP packets in the
nf_conntrack_tcp_packet() (known as tcp_packet() in kernels older than 5.1).

Simplified forward trace for that case:
nf_conntrack_in
	nf_conntrack_tcp_packet
		tcp_error
			if (state->net->ct.sysctl_checksum)
				nf_checksum
					nf_ip_checksum
						__skb_checksum_complete

That validation can be disabled using nf_conntrack_checksum sysfs and it bumps
NAT speed for me from 666 Mb/s to 940 Mb/s (+41%).

**********

GRO enabled

First of all GRO also includes TCP validation that requires calculating a
checksum.

Simplified forward trace for that case:
vlan_gro_receive
	call_gro_receive
		inet_gro_receive
			indirect_call_gro_receive
				tcp4_gro_receive
					skb_gro_checksum_validate
					tcp_gro_receive

*If* we had a way to disable that validation it *would* result in bumping NAT
speed for me from 577 Mb/s to 825 Mb/s (+43%).


Secondly using GRO means we need to calculate a checksum before transmitting
packets (applies to devices without HW checksum offloading). I think it's
related to packets merging in the skb_gro_receive() and then setting
CHECKSUM_PARTIAL:

vlan_gro_complete
	inet_gro_complete
		tcp4_gro_complete
			tcp_gro_complete
				skb->ip_summed = CHECKSUM_PARTIAL;

That results in bgmac calculating a checksum from the scratch, take a look at
the bgmac_dma_tx_add() which does:

if (skb->ip_summed == CHECKSUM_PARTIAL)
	skb_checksum_help(skb);

Performing that whole checksum calculation will always result in GRO slowing
down NAT for me when using BCM47094 SoC with that not-so-powerful ARM CPUs.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NAT performance regression caused by vlan GRO support
  2019-04-07 11:53 ` Rafał Miłecki
@ 2019-04-07 11:54   ` Rafał Miłecki
  2019-04-08 13:31     ` David Laight
  0 siblings, 1 reply; 16+ messages in thread
From: Rafał Miłecki @ 2019-04-07 11:54 UTC (permalink / raw)
  To: netdev, David S. Miller, Toshiaki Makita,
	Toke Høiland-Jørgensen, Florian Westphal, Eric Dumazet
  Cc: Stefano Brivio, Sabrina Dubroca, David Ahern, Felix Fietkau,
	Jo-Philipp Wich, Koen Vandeputte

Now I have some questions regarding possible optimizations. Note I'm too
familiar with the net subsystem so maybe I got wrong ideas.

On 07.04.2019 13:53, Rafał Miłecki wrote:
> On 04.04.2019 14:57, Rafał Miłecki wrote:
>> Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO support
>> for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT
>> performance of my router dropped by 30% - 40%.
> 
> I'll try to provide some summary for this issue. I'll focus on TCP traffic as
> that's what I happened to test.
> 
> Basically all slowdowns are related to the csum_partial(). Calculating checksum
> has a significant impact on NAT performance on less CPU powerful devices.
> 
> **********
> 
> GRO disabled
> 
> Without GRO a csum_partial() is used only when validating TCP packets in the
> nf_conntrack_tcp_packet() (known as tcp_packet() in kernels older than 5.1).
> 
> Simplified forward trace for that case:
> nf_conntrack_in
>      nf_conntrack_tcp_packet
>          tcp_error
>              if (state->net->ct.sysctl_checksum)
>                  nf_checksum
>                      nf_ip_checksum
>                          __skb_checksum_complete
> 
> That validation can be disabled using nf_conntrack_checksum sysfs and it bumps
> NAT speed for me from 666 Mb/s to 940 Mb/s (+41%).
> 
> **********
> 
> GRO enabled
> 
> First of all GRO also includes TCP validation that requires calculating a
> checksum.
> 
> Simplified forward trace for that case:
> vlan_gro_receive
>      call_gro_receive
>          inet_gro_receive
>              indirect_call_gro_receive
>                  tcp4_gro_receive
>                      skb_gro_checksum_validate
>                      tcp_gro_receive
> 
> *If* we had a way to disable that validation it *would* result in bumping NAT
> speed for me from 577 Mb/s to 825 Mb/s (+43%).

Could we have tcp4_gro_receive() behave similarly to the tcp_error() and make it
respect the nf_conntrack_checksum sysfs value?

Could we simply add something like:
if (dev_net(skb->dev)->ct.sysctl_checksum)
to it (to additionally protect a skb_gro_checksum_validate() call)?


> Secondly using GRO means we need to calculate a checksum before transmitting
> packets (applies to devices without HW checksum offloading). I think it's
> related to packets merging in the skb_gro_receive() and then setting
> CHECKSUM_PARTIAL:
> 
> vlan_gro_complete
>      inet_gro_complete
>          tcp4_gro_complete
>              tcp_gro_complete
>                  skb->ip_summed = CHECKSUM_PARTIAL;
> 
> That results in bgmac calculating a checksum from the scratch, take a look at
> the bgmac_dma_tx_add() which does:
> 
> if (skb->ip_summed == CHECKSUM_PARTIAL)
>      skb_checksum_help(skb);
> 
> Performing that whole checksum calculation will always result in GRO slowing
> down NAT for me when using BCM47094 SoC with that not-so-powerful ARM CPUs.

Is this possible to avoid CHECKSUM_PARTIAL & skb_checksum_help() which has to
calculate a whole checksum? It's definitely possible to *update* checksum after
simple packet changes (e.g. amending an IP or port). Would that be possible to
use similar method when dealing with packets with GRO enabled?

If not, maybe w really need to think about some good & clever condition for
disabling GRO by default on hw without checksum offloading.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: NAT performance regression caused by vlan GRO support
  2019-04-07 11:54   ` Rafał Miłecki
@ 2019-04-08 13:31     ` David Laight
  0 siblings, 0 replies; 16+ messages in thread
From: David Laight @ 2019-04-08 13:31 UTC (permalink / raw)
  To: 'Rafał Miłecki',
	netdev, David S. Miller, Toshiaki Makita,
	Toke Høiland-Jørgensen, Florian Westphal, Eric Dumazet
  Cc: Stefano Brivio, Sabrina Dubroca, David Ahern, Felix Fietkau,
	Jo-Philipp Wich, Koen Vandeputte

From: Rafal Milecki
> Sent: 07 April 2019 12:55
...
> If not, maybe w really need to think about some good & clever condition for
> disabling GRO by default on hw without checksum offloading.

Maybe GRO could assume the checksums are valid so the checksum
would only be verified when the packet is delivered locally.

If the packet is forwarded then, provided the same packet
boundaries are used, the original checksums (maybe modified
by NAT) can be used.

No idea how easy this might be :-)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-04-08 13:30 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-04 12:57 NAT performance regression caused by vlan GRO support Rafał Miłecki
2019-04-04 15:17 ` Toshiaki Makita
2019-04-04 20:22   ` Rafał Miłecki
2019-04-05  4:26     ` Toshiaki Makita
2019-04-05  5:48       ` Rafał Miłecki
2019-04-05  7:11         ` Rafał Miłecki
2019-04-05  7:14           ` Felix Fietkau
2019-04-05  7:58             ` Toshiaki Makita
2019-04-05  8:12               ` Rafał Miłecki
2019-04-05  8:24                 ` Rafał Miłecki
2019-04-05 10:18               ` Toke Høiland-Jørgensen
2019-04-05 10:51                 ` Florian Westphal
2019-04-05 11:00                   ` Eric Dumazet
2019-04-07 11:53 ` Rafał Miłecki
2019-04-07 11:54   ` Rafał Miłecki
2019-04-08 13:31     ` David Laight

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.