* NAT performance regression caused by vlan GRO support @ 2019-04-04 12:57 Rafał Miłecki 2019-04-04 15:17 ` Toshiaki Makita 2019-04-07 11:53 ` Rafał Miłecki 0 siblings, 2 replies; 16+ messages in thread From: Rafał Miłecki @ 2019-04-04 12:57 UTC (permalink / raw) To: netdev, David S. Miller, Toshiaki Makita Cc: Stefano Brivio, Sabrina Dubroca, David Ahern, Felix Fietkau, Jo-Philipp Wich, Koen Vandeputte [-- Attachment #1: Type: text/plain, Size: 1951 bytes --] Hello, I'd like to report a regression that goes back to the 2015. I know it's damn late, but the good thing is, the regression is still easy to reproduce, verify & revert. Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO support for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT performance of my router dropped by 30% - 40%. My hardware is BCM47094 SoC (dual core ARM) with integrated network controller and external BCM53012 switch. Relevant setup: * SoC network controller is wired to the hardware switch * Switch passes 802.1q frames with VID 1 to four LAN ports * Switch passes 802.1q frames with VID 2 to WAN port * Linux does NAT for LAN (eth0.1) to WAN (eth0.2) * Linux uses pfifo and "echo 2 > rps_cpus" * Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port * Intel i7-2670QM laptop connected to a WAN port * Speed of LAN to WAN measured using iperf & TCP over 10 minutes 1) 5.1.0-rc3 [ 6] 0.0-600.0 sec 39.9 GBytes 572 Mbits/sec 2) 5.1.0-rc3 + rtcache patch [ 6] 0.0-600.0 sec 40.0 GBytes 572 Mbits/sec 3) 5.1.0-rc3 + disable GRO support [ 6] 0.0-300.4 sec 27.5 GBytes 786 Mbits/sec 4) 5.1.0-rc3 + rtcache patch + disable GRO support [ 6] 0.0-600.0 sec 65.6 GBytes 939 Mbits/sec 5) 4.1.15 + rtcache patch 934 Mb/s 6) 4.3.4 + rtcache patch 565 Mb/s As you can see I can achieve a big performance gain by disabling/reverting a GRO support. Getting up to 65% faster NAT makes a huge difference and ideally I'd like to get that with upstream Linux code. Could someone help me and check the reported commit/code, please? Is there any other info I can provide or anything I can test for you? --- a/net/8021q/vlan_core.c +++ b/net/8021q/vlan_core.c @@ -545,6 +545,8 @@ static int __init vlan_offload_init(void) { unsigned int i; + return -ENOTSUPP; + for (i = 0; i < ARRAY_SIZE(vlan_packet_offloads); i++) dev_add_offload(&vlan_packet_offloads[i]); [-- Attachment #2: .config --] [-- Type: application/x-config, Size: 77368 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: NAT performance regression caused by vlan GRO support 2019-04-04 12:57 NAT performance regression caused by vlan GRO support Rafał Miłecki @ 2019-04-04 15:17 ` Toshiaki Makita 2019-04-04 20:22 ` Rafał Miłecki 2019-04-07 11:53 ` Rafał Miłecki 1 sibling, 1 reply; 16+ messages in thread From: Toshiaki Makita @ 2019-04-04 15:17 UTC (permalink / raw) To: Rafał Miłecki Cc: netdev, David S. Miller, Toshiaki Makita, Stefano Brivio, Sabrina Dubroca, David Ahern, Felix Fietkau, Jo-Philipp Wich, Koen Vandeputte Hi Rafał, On 19/04/04 (木) 21:57:15, Rafał Miłecki wrote: > Hello, > > I'd like to report a regression that goes back to the 2015. I know it's > damn > late, but the good thing is, the regression is still easy to reproduce, > verify & > revert. > > Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO > support > for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT > performance of my router dropped by 30% - 40%. > > My hardware is BCM47094 SoC (dual core ARM) with integrated network > controller > and external BCM53012 switch. > > Relevant setup: > * SoC network controller is wired to the hardware switch > * Switch passes 802.1q frames with VID 1 to four LAN ports > * Switch passes 802.1q frames with VID 2 to WAN port > * Linux does NAT for LAN (eth0.1) to WAN (eth0.2) > * Linux uses pfifo and "echo 2 > rps_cpus" > * Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port > * Intel i7-2670QM laptop connected to a WAN port > * Speed of LAN to WAN measured using iperf & TCP over 10 minutes > > 1) 5.1.0-rc3 > [ 6] 0.0-600.0 sec 39.9 GBytes 572 Mbits/sec > > 2) 5.1.0-rc3 + rtcache patch > [ 6] 0.0-600.0 sec 40.0 GBytes 572 Mbits/sec > > 3) 5.1.0-rc3 + disable GRO support > [ 6] 0.0-300.4 sec 27.5 GBytes 786 Mbits/sec > > 4) 5.1.0-rc3 + rtcache patch + disable GRO support > [ 6] 0.0-600.0 sec 65.6 GBytes 939 Mbits/sec Did you test it with disabling GRO by ethtool -K? Is this the result with your reverting patch? It's late night in Japan so I think I will try to reproduce it tomorrow. Thanks. > > 5) 4.1.15 + rtcache patch > 934 Mb/s > > 6) 4.3.4 + rtcache patch > 565 Mb/s > > As you can see I can achieve a big performance gain by > disabling/reverting a > GRO support. Getting up to 65% faster NAT makes a huge difference and > ideally > I'd like to get that with upstream Linux code. > > Could someone help me and check the reported commit/code, please? Is there > any other info I can provide or anything I can test for you? > > > --- a/net/8021q/vlan_core.c > +++ b/net/8021q/vlan_core.c > @@ -545,6 +545,8 @@ static int __init vlan_offload_init(void) > { > unsigned int i; > > + return -ENOTSUPP; > + > for (i = 0; i < ARRAY_SIZE(vlan_packet_offloads); i++) > dev_add_offload(&vlan_packet_offloads[i]); ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: NAT performance regression caused by vlan GRO support 2019-04-04 15:17 ` Toshiaki Makita @ 2019-04-04 20:22 ` Rafał Miłecki 2019-04-05 4:26 ` Toshiaki Makita 0 siblings, 1 reply; 16+ messages in thread From: Rafał Miłecki @ 2019-04-04 20:22 UTC (permalink / raw) To: Toshiaki Makita Cc: netdev, David S. Miller, Toshiaki Makita, Stefano Brivio, Sabrina Dubroca, David Ahern, Felix Fietkau, Jo-Philipp Wich, Koen Vandeputte On 04.04.2019 17:17, Toshiaki Makita wrote: > On 19/04/04 (木) 21:57:15, Rafał Miłecki wrote: >> I'd like to report a regression that goes back to the 2015. I know it's damn >> late, but the good thing is, the regression is still easy to reproduce, verify & >> revert. >> >> Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO support >> for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT >> performance of my router dropped by 30% - 40%. >> >> My hardware is BCM47094 SoC (dual core ARM) with integrated network controller >> and external BCM53012 switch. >> >> Relevant setup: >> * SoC network controller is wired to the hardware switch >> * Switch passes 802.1q frames with VID 1 to four LAN ports >> * Switch passes 802.1q frames with VID 2 to WAN port >> * Linux does NAT for LAN (eth0.1) to WAN (eth0.2) >> * Linux uses pfifo and "echo 2 > rps_cpus" >> * Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port >> * Intel i7-2670QM laptop connected to a WAN port >> * Speed of LAN to WAN measured using iperf & TCP over 10 minutes >> >> 1) 5.1.0-rc3 >> [ 6] 0.0-600.0 sec 39.9 GBytes 572 Mbits/sec >> >> 2) 5.1.0-rc3 + rtcache patch >> [ 6] 0.0-600.0 sec 40.0 GBytes 572 Mbits/sec >> >> 3) 5.1.0-rc3 + disable GRO support >> [ 6] 0.0-300.4 sec 27.5 GBytes 786 Mbits/sec >> >> 4) 5.1.0-rc3 + rtcache patch + disable GRO support >> [ 6] 0.0-600.0 sec 65.6 GBytes 939 Mbits/sec > > Did you test it with disabling GRO by ethtool -K? Oh, I didn't know about such possibility! I just tested: 1) Kernel with GRO support left in place (no local patch disabling it) 2) ethtool -K eth0 gro off and it bumped my NAT performance from 576 Mb/s to 939 Mb/s. I can reliably break/fix NAT performance by just calling ethtool -K eth0 gro on/off. > Is this the result with your reverting patch? Previous results were coming from kernel with patched vlan_offload_init() - see diff at the end of my first e-mail. > It's late night in Japan so I think I will try to reproduce it tomorrow. Thank you! ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: NAT performance regression caused by vlan GRO support 2019-04-04 20:22 ` Rafał Miłecki @ 2019-04-05 4:26 ` Toshiaki Makita 2019-04-05 5:48 ` Rafał Miłecki 0 siblings, 1 reply; 16+ messages in thread From: Toshiaki Makita @ 2019-04-05 4:26 UTC (permalink / raw) To: Rafał Miłecki Cc: Toshiaki Makita, netdev, David S. Miller, Stefano Brivio, Sabrina Dubroca, David Ahern, Felix Fietkau, Jo-Philipp Wich, Koen Vandeputte On 2019/04/05 5:22, Rafał Miłecki wrote: > On 04.04.2019 17:17, Toshiaki Makita wrote: >> On 19/04/04 (木) 21:57:15, Rafał Miłecki wrote: >>> I'd like to report a regression that goes back to the 2015. I know >>> it's damn >>> late, but the good thing is, the regression is still easy to >>> reproduce, verify & >>> revert. >>> >>> Long story short, starting with the commit 66e5133f19e9 ("vlan: Add >>> GRO support >>> for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT >>> performance of my router dropped by 30% - 40%. >>> >>> My hardware is BCM47094 SoC (dual core ARM) with integrated network >>> controller >>> and external BCM53012 switch. >>> >>> Relevant setup: >>> * SoC network controller is wired to the hardware switch >>> * Switch passes 802.1q frames with VID 1 to four LAN ports >>> * Switch passes 802.1q frames with VID 2 to WAN port >>> * Linux does NAT for LAN (eth0.1) to WAN (eth0.2) >>> * Linux uses pfifo and "echo 2 > rps_cpus" >>> * Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port >>> * Intel i7-2670QM laptop connected to a WAN port >>> * Speed of LAN to WAN measured using iperf & TCP over 10 minutes >>> >>> 1) 5.1.0-rc3 >>> [ 6] 0.0-600.0 sec 39.9 GBytes 572 Mbits/sec >>> >>> 2) 5.1.0-rc3 + rtcache patch >>> [ 6] 0.0-600.0 sec 40.0 GBytes 572 Mbits/sec >>> >>> 3) 5.1.0-rc3 + disable GRO support >>> [ 6] 0.0-300.4 sec 27.5 GBytes 786 Mbits/sec >>> >>> 4) 5.1.0-rc3 + rtcache patch + disable GRO support >>> [ 6] 0.0-600.0 sec 65.6 GBytes 939 Mbits/sec >> >> Did you test it with disabling GRO by ethtool -K? > > Oh, I didn't know about such possibility! I just tested: > 1) Kernel with GRO support left in place (no local patch disabling it) > 2) ethtool -K eth0 gro off > and it bumped my NAT performance from 576 Mb/s to 939 Mb/s. I can reliably > break/fix NAT performance by just calling ethtool -K eth0 gro on/off. > > >> Is this the result with your reverting patch? > > Previous results were coming from kernel with patched > vlan_offload_init() - see > diff at the end of my first e-mail. > > >> It's late night in Japan so I think I will try to reproduce it tomorrow. My test results: Receiving packets from eth0.10, forwarding them to eth0.20 and applying MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13. Disabled rxvlan by ethtool -K to exercise vlan_gro_receive(). Measured TCP throughput by netperf. GRO on : 17 Gbps GRO off: 5 Gbps So I failed to reproduce your problem. Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on your machine? If CPU is 100%, perf may help us analyze your problem. If it's available, try running below while testing: # perf record -a -g -- sleep 5 And then run this after testing: # perf report --no-child -- Toshiaki Makita ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: NAT performance regression caused by vlan GRO support 2019-04-05 4:26 ` Toshiaki Makita @ 2019-04-05 5:48 ` Rafał Miłecki 2019-04-05 7:11 ` Rafał Miłecki 0 siblings, 1 reply; 16+ messages in thread From: Rafał Miłecki @ 2019-04-05 5:48 UTC (permalink / raw) To: Toshiaki Makita Cc: Toshiaki Makita, netdev, David S. Miller, Stefano Brivio, Sabrina Dubroca, David Ahern, Felix Fietkau, Jo-Philipp Wich, Koen Vandeputte On 05.04.2019 06:26, Toshiaki Makita wrote: > On 2019/04/05 5:22, Rafał Miłecki wrote: >> On 04.04.2019 17:17, Toshiaki Makita wrote: >>> On 19/04/04 (木) 21:57:15, Rafał Miłecki wrote: >>>> I'd like to report a regression that goes back to the 2015. I know >>>> it's damn >>>> late, but the good thing is, the regression is still easy to >>>> reproduce, verify & >>>> revert. >>>> >>>> Long story short, starting with the commit 66e5133f19e9 ("vlan: Add >>>> GRO support >>>> for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT >>>> performance of my router dropped by 30% - 40%. >>>> >>>> My hardware is BCM47094 SoC (dual core ARM) with integrated network >>>> controller >>>> and external BCM53012 switch. >>>> >>>> Relevant setup: >>>> * SoC network controller is wired to the hardware switch >>>> * Switch passes 802.1q frames with VID 1 to four LAN ports >>>> * Switch passes 802.1q frames with VID 2 to WAN port >>>> * Linux does NAT for LAN (eth0.1) to WAN (eth0.2) >>>> * Linux uses pfifo and "echo 2 > rps_cpus" >>>> * Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port >>>> * Intel i7-2670QM laptop connected to a WAN port >>>> * Speed of LAN to WAN measured using iperf & TCP over 10 minutes >>>> >>>> 1) 5.1.0-rc3 >>>> [ 6] 0.0-600.0 sec 39.9 GBytes 572 Mbits/sec >>>> >>>> 2) 5.1.0-rc3 + rtcache patch >>>> [ 6] 0.0-600.0 sec 40.0 GBytes 572 Mbits/sec >>>> >>>> 3) 5.1.0-rc3 + disable GRO support >>>> [ 6] 0.0-300.4 sec 27.5 GBytes 786 Mbits/sec >>>> >>>> 4) 5.1.0-rc3 + rtcache patch + disable GRO support >>>> [ 6] 0.0-600.0 sec 65.6 GBytes 939 Mbits/sec >>> >>> Did you test it with disabling GRO by ethtool -K? >> >> Oh, I didn't know about such possibility! I just tested: >> 1) Kernel with GRO support left in place (no local patch disabling it) >> 2) ethtool -K eth0 gro off >> and it bumped my NAT performance from 576 Mb/s to 939 Mb/s. I can reliably >> break/fix NAT performance by just calling ethtool -K eth0 gro on/off. >> >> >>> Is this the result with your reverting patch? >> >> Previous results were coming from kernel with patched >> vlan_offload_init() - see >> diff at the end of my first e-mail. >> >> >>> It's late night in Japan so I think I will try to reproduce it tomorrow. > > My test results: > > Receiving packets from eth0.10, forwarding them to eth0.20 and applying > MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13. > Disabled rxvlan by ethtool -K to exercise vlan_gro_receive(). > Measured TCP throughput by netperf. > > GRO on : 17 Gbps > GRO off: 5 Gbps > > So I failed to reproduce your problem. :( Thanks for trying & checking that! > Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar > -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on > your machine? 1) ethtool -K eth0 gro on + iperf running (577 Mb/s) root@OpenWrt:/# mpstat -P ALL 10 3 Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) 16:33:40 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:33:50 all 0.00 0.00 0.00 0.00 0.00 58.79 0.00 0.00 41.21 16:33:50 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 16:33:50 1 0.00 0.00 0.00 0.00 0.00 17.58 0.00 0.00 82.42 16:33:50 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:34:00 all 0.00 0.00 0.05 0.00 0.00 59.44 0.00 0.00 40.51 16:34:00 0 0.00 0.00 0.10 0.00 0.00 99.90 0.00 0.00 0.00 16:34:00 1 0.00 0.00 0.00 0.00 0.00 18.98 0.00 0.00 81.02 16:34:00 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:34:10 all 0.00 0.00 0.00 0.00 0.00 59.59 0.00 0.00 40.41 16:34:10 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 16:34:10 1 0.00 0.00 0.00 0.00 0.00 19.18 0.00 0.00 80.82 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.00 0.00 0.02 0.00 0.00 59.27 0.00 0.00 40.71 Average: 0 0.00 0.00 0.03 0.00 0.00 99.97 0.00 0.00 0.00 Average: 1 0.00 0.00 0.00 0.00 0.00 18.58 0.00 0.00 81.42 2) ethtool -K eth0 gro off + iperf running (941 Mb/s) root@OpenWrt:/# mpstat -P ALL 10 3 Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) 16:34:39 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:34:49 all 0.00 0.00 0.05 0.00 0.00 86.91 0.00 0.00 13.04 16:34:49 0 0.00 0.00 0.10 0.00 0.00 78.22 0.00 0.00 21.68 16:34:49 1 0.00 0.00 0.00 0.00 0.00 95.60 0.00 0.00 4.40 16:34:49 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:34:59 all 0.00 0.00 0.10 0.00 0.00 87.06 0.00 0.00 12.84 16:34:59 0 0.00 0.00 0.20 0.00 0.00 79.72 0.00 0.00 20.08 16:34:59 1 0.00 0.00 0.00 0.00 0.00 94.41 0.00 0.00 5.59 16:34:59 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:35:09 all 0.00 0.00 0.05 0.00 0.00 85.71 0.00 0.00 14.24 16:35:09 0 0.00 0.00 0.10 0.00 0.00 79.42 0.00 0.00 20.48 16:35:09 1 0.00 0.00 0.00 0.00 0.00 92.01 0.00 0.00 7.99 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.00 0.00 0.07 0.00 0.00 86.56 0.00 0.00 13.37 Average: 0 0.00 0.00 0.13 0.00 0.00 79.12 0.00 0.00 20.75 Average: 1 0.00 0.00 0.00 0.00 0.00 94.01 0.00 0.00 5.99 3) System idle (no iperf) root@OpenWrt:/# mpstat -P ALL 10 1 Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) 16:35:31 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:35:41 all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 16:35:41 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 16:35:41 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 Average: 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 Average: 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 > If CPU is 100%, perf may help us analyze your problem. If it's > available, try running below while testing: > # perf record -a -g -- sleep 5 > > And then run this after testing: > # perf report --no-child I can see my CPU 0 is fully loaded when using "gro on". I'll try perf now. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: NAT performance regression caused by vlan GRO support 2019-04-05 5:48 ` Rafał Miłecki @ 2019-04-05 7:11 ` Rafał Miłecki 2019-04-05 7:14 ` Felix Fietkau 0 siblings, 1 reply; 16+ messages in thread From: Rafał Miłecki @ 2019-04-05 7:11 UTC (permalink / raw) To: Toshiaki Makita Cc: Toshiaki Makita, netdev, David S. Miller, Stefano Brivio, Sabrina Dubroca, David Ahern, Felix Fietkau, Jo-Philipp Wich, Koen Vandeputte On 05.04.2019 07:48, Rafał Miłecki wrote: > On 05.04.2019 06:26, Toshiaki Makita wrote: >> My test results: >> >> Receiving packets from eth0.10, forwarding them to eth0.20 and applying >> MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13. >> Disabled rxvlan by ethtool -K to exercise vlan_gro_receive(). >> Measured TCP throughput by netperf. >> >> GRO on : 17 Gbps >> GRO off: 5 Gbps >> >> So I failed to reproduce your problem. > > :( Thanks for trying & checking that! > > >> Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar >> -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on >> your machine? > > 1) ethtool -K eth0 gro on + iperf running (577 Mb/s) > root@OpenWrt:/# mpstat -P ALL 10 3 > Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) > > 16:33:40 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle > 16:33:50 all 0.00 0.00 0.00 0.00 0.00 58.79 0.00 0.00 41.21 > 16:33:50 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 > 16:33:50 1 0.00 0.00 0.00 0.00 0.00 17.58 0.00 0.00 82.42 > > 16:33:50 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle > 16:34:00 all 0.00 0.00 0.05 0.00 0.00 59.44 0.00 0.00 40.51 > 16:34:00 0 0.00 0.00 0.10 0.00 0.00 99.90 0.00 0.00 0.00 > 16:34:00 1 0.00 0.00 0.00 0.00 0.00 18.98 0.00 0.00 81.02 > > 16:34:00 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle > 16:34:10 all 0.00 0.00 0.00 0.00 0.00 59.59 0.00 0.00 40.41 > 16:34:10 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 > 16:34:10 1 0.00 0.00 0.00 0.00 0.00 19.18 0.00 0.00 80.82 > > Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle > Average: all 0.00 0.00 0.02 0.00 0.00 59.27 0.00 0.00 40.71 > Average: 0 0.00 0.00 0.03 0.00 0.00 99.97 0.00 0.00 0.00 > Average: 1 0.00 0.00 0.00 0.00 0.00 18.58 0.00 0.00 81.42 > > > 2) ethtool -K eth0 gro off + iperf running (941 Mb/s) > root@OpenWrt:/# mpstat -P ALL 10 3 > Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) > > 16:34:39 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle > 16:34:49 all 0.00 0.00 0.05 0.00 0.00 86.91 0.00 0.00 13.04 > 16:34:49 0 0.00 0.00 0.10 0.00 0.00 78.22 0.00 0.00 21.68 > 16:34:49 1 0.00 0.00 0.00 0.00 0.00 95.60 0.00 0.00 4.40 > > 16:34:49 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle > 16:34:59 all 0.00 0.00 0.10 0.00 0.00 87.06 0.00 0.00 12.84 > 16:34:59 0 0.00 0.00 0.20 0.00 0.00 79.72 0.00 0.00 20.08 > 16:34:59 1 0.00 0.00 0.00 0.00 0.00 94.41 0.00 0.00 5.59 > > 16:34:59 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle > 16:35:09 all 0.00 0.00 0.05 0.00 0.00 85.71 0.00 0.00 14.24 > 16:35:09 0 0.00 0.00 0.10 0.00 0.00 79.42 0.00 0.00 20.48 > 16:35:09 1 0.00 0.00 0.00 0.00 0.00 92.01 0.00 0.00 7.99 > > Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle > Average: all 0.00 0.00 0.07 0.00 0.00 86.56 0.00 0.00 13.37 > Average: 0 0.00 0.00 0.13 0.00 0.00 79.12 0.00 0.00 20.75 > Average: 1 0.00 0.00 0.00 0.00 0.00 94.01 0.00 0.00 5.99 > > > 3) System idle (no iperf) > root@OpenWrt:/# mpstat -P ALL 10 1 > Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) > > 16:35:31 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle > 16:35:41 all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 > 16:35:41 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 > 16:35:41 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 > > Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle > Average: all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 > Average: 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 > Average: 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 > > >> If CPU is 100%, perf may help us analyze your problem. If it's >> available, try running below while testing: >> # perf record -a -g -- sleep 5 >> >> And then run this after testing: >> # perf report --no-child > > I can see my CPU 0 is fully loaded when using "gro on". I'll try perf now. I guess its GRO + csum_partial() to be blamed for this performance drop. Maybe csum_partial() is very fast on your powerful machine and few extra calls don't make a difference? I can imagine it affecting much slower home router with ARM cores. 1) ethtool -K eth0 gro on Samples: 34K of event 'cycles', Event count (approx.): 10041345370 Overhead Command Shared Object Symbol + 25,46% ksoftirqd/0 [kernel.kallsyms] [k] csum_partial + 8,82% ksoftirqd/0 [kernel.kallsyms] [k] v7_dma_inv_range + 6,03% swapper [kernel.kallsyms] [k] arch_cpu_idle + 4,08% ksoftirqd/0 [kernel.kallsyms] [k] v7_dma_clean_range + 3,82% ksoftirqd/0 [kernel.kallsyms] [k] l2c210_inv_range + 3,14% swapper [kernel.kallsyms] [k] rcu_idle_exit + 3,00% ksoftirqd/0 [kernel.kallsyms] [k] l2c210_clean_range + 2,43% ksoftirqd/0 [kernel.kallsyms] [k] bgmac_start_xmit + 1,24% swapper [kernel.kallsyms] [k] csum_partial + 1,20% swapper [kernel.kallsyms] [k] do_idle + 1,19% swapper [kernel.kallsyms] [k] skb_segment + 1,19% ksoftirqd/0 [kernel.kallsyms] [k] arm_dma_unmap_page + 1,00% ksoftirqd/0 [kernel.kallsyms] [k] bgmac_poll + 0,95% ksoftirqd/0 [kernel.kallsyms] [k] __slab_free.constprop.3 + 0,80% ksoftirqd/0 [kernel.kallsyms] [k] skb_release_data + 0,77% swapper [kernel.kallsyms] [k] __dev_queue_xmit + 0,73% ksoftirqd/0 [kernel.kallsyms] [k] build_skb + 0,68% ksoftirqd/0 [kernel.kallsyms] [k] skb_segment + 0,66% ksoftirqd/0 [kernel.kallsyms] [k] mmiocpy + 0,66% ksoftirqd/0 [kernel.kallsyms] [k] skb_checksum_help + 0,65% ksoftirqd/0 [kernel.kallsyms] [k] dev_gro_receive + 0,64% ksoftirqd/0 [kernel.kallsyms] [k] page_address + 0,62% ksoftirqd/0 [kernel.kallsyms] [k] __qdisc_run + 0,62% ksoftirqd/0 [kernel.kallsyms] [k] dma_cache_maint_page + 0,59% swapper [kernel.kallsyms] [k] __kmalloc_track_caller + 0,59% swapper [kernel.kallsyms] [k] mmiocpy + 0,58% ksoftirqd/0 [kernel.kallsyms] [k] sch_direct_xmit + 0,55% ksoftirqd/0 [kernel.kallsyms] [k] mmioset + 0,52% ksoftirqd/0 [kernel.kallsyms] [k] inet_gro_receive 0,49% ksoftirqd/0 [kernel.kallsyms] [k] netdev_alloc_frag 0,47% swapper [kernel.kallsyms] [k] __netif_receive_skb_core 0,45% swapper [kernel.kallsyms] [k] kmem_cache_alloc 0,45% ksoftirqd/0 [kernel.kallsyms] [k] __skb_checksum 0,43% swapper [kernel.kallsyms] [k] v7_dma_clean_range 0,39% ksoftirqd/0 [kernel.kallsyms] [k] kmem_cache_alloc 0,36% ksoftirqd/0 [kernel.kallsyms] [k] qdisc_dequeue_head 0,36% ksoftirqd/0 [kernel.kallsyms] [k] arm_dma_map_page 0,35% swapper [kernel.kallsyms] [k] mmioset 0,34% ksoftirqd/0 [kernel.kallsyms] [k] tcp_gro_receive 0,33% swapper [kernel.kallsyms] [k] __copy_skb_header 0,33% ksoftirqd/0 [kernel.kallsyms] [k] kmem_cache_free 0,32% ksoftirqd/0 [kernel.kallsyms] [k] netif_skb_features 0,30% swapper [kernel.kallsyms] [k] netif_skb_features 0,30% ksoftirqd/0 [kernel.kallsyms] [k] __skb_flow_dissect 2) ethtool -K eth0 gro off Samples: 39K of event 'cycles', Event count (approx.): 13065826851 Overhead Command Shared Object Symbol + 11,09% swapper [kernel.kallsyms] [k] v7_dma_inv_range + 5,86% ksoftirqd/1 [kernel.kallsyms] [k] v7_dma_clean_range + 5,77% swapper [kernel.kallsyms] [k] l2c210_inv_range + 5,38% swapper [kernel.kallsyms] [k] __irqentry_text_end + 4,44% swapper [kernel.kallsyms] [k] bcma_host_soc_read32 + 3,28% ksoftirqd/1 [kernel.kallsyms] [k] __netif_receive_skb_core + 3,25% ksoftirqd/1 [kernel.kallsyms] [k] l2c210_clean_range + 2,70% swapper [kernel.kallsyms] [k] arch_cpu_idle + 2,25% swapper [kernel.kallsyms] [k] bgmac_poll + 2,14% ksoftirqd/1 [kernel.kallsyms] [k] bgmac_start_xmit + 1,79% ksoftirqd/1 [kernel.kallsyms] [k] __dev_queue_xmit + 1,36% ksoftirqd/1 [kernel.kallsyms] [k] skb_vlan_untag + 1,11% swapper [kernel.kallsyms] [k] __skb_flow_dissect + 1,07% ksoftirqd/1 [kernel.kallsyms] [k] netif_skb_features + 0,98% ksoftirqd/1 [kernel.kallsyms] [k] ip_rcv_core.constprop.3 + 0,92% ksoftirqd/1 [kernel.kallsyms] [k] sch_direct_xmit + 0,90% ksoftirqd/1 [kernel.kallsyms] [k] __local_bh_enable_ip + 0,86% ksoftirqd/1 [kernel.kallsyms] [k] nf_hook_slow + 0,82% swapper [kernel.kallsyms] [k] net_rx_action + 0,80% ksoftirqd/1 [kernel.kallsyms] [k] validate_xmit_skb.constprop.30 + 0,75% swapper [kernel.kallsyms] [k] build_skb + 0,72% ksoftirqd/1 [kernel.kallsyms] [k] ip_forward + 0,71% ksoftirqd/1 [kernel.kallsyms] [k] br_handle_frame_finish + 0,71% ksoftirqd/1 [kernel.kallsyms] [k] skb_pull_rcsum + 0,65% swapper [kernel.kallsyms] [k] arm_dma_unmap_page + 0,59% ksoftirqd/1 [kernel.kallsyms] [k] ip_finish_output2 + 0,59% swapper [kernel.kallsyms] [k] __skb_get_hash + 0,58% swapper [kernel.kallsyms] [k] dma_cache_maint_page + 0,55% ksoftirqd/1 [kernel.kallsyms] [k] fdb_find_rcu + 0,54% swapper [kernel.kallsyms] [k] bcma_host_soc_write32 + 0,53% ksoftirqd/1 [kernel.kallsyms] [k] vlan_do_receive + 0,52% ksoftirqd/1 [kernel.kallsyms] [k] memmove + 0,52% swapper [kernel.kallsyms] [k] rcu_idle_exit + 0,51% ksoftirqd/1 [kernel.kallsyms] [k] ip_rcv + 0,51% ksoftirqd/1 [kernel.kallsyms] [k] dev_hard_start_xmit 0,49% ksoftirqd/1 [kernel.kallsyms] [k] ip_output 0,46% ksoftirqd/1 [kernel.kallsyms] [k] vlan_dev_hard_start_xmit 0,45% swapper [kernel.kallsyms] [k] enqueue_to_backlog 0,42% swapper [kernel.kallsyms] [k] netdev_alloc_frag 0,42% swapper [kernel.kallsyms] [k] skb_release_data 0,41% ksoftirqd/1 [kernel.kallsyms] [k] ip_forward_finish 0,40% ksoftirqd/1 [kernel.kallsyms] [k] br_handle_frame 0,37% ksoftirqd/1 [kernel.kallsyms] [k] mmiocpy 0,37% ksoftirqd/1 [kernel.kallsyms] [k] page_address 0,36% ksoftirqd/0 [kernel.kallsyms] [k] v7_dma_inv_range 0,36% ksoftirqd/1 [kernel.kallsyms] [k] memcmp 0,36% ksoftirqd/1 [kernel.kallsyms] [k] netif_receive_skb_internal 0,34% swapper [kernel.kallsyms] [k] page_address 0,34% swapper [kernel.kallsyms] [k] mmioset 0,33% ksoftirqd/1 [kernel.kallsyms] [k] br_pass_frame_up 0,33% ksoftirqd/1 [kernel.kallsyms] [k] neigh_connected_output 0,33% swapper [kernel.kallsyms] [k] kmem_cache_alloc 0,31% ksoftirqd/1 [kernel.kallsyms] [k] mmioset 0,30% ksoftirqd/1 [kernel.kallsyms] [k] ip_finish_output 0,30% ksoftirqd/1 [kernel.kallsyms] [k] bcma_bgmac_write ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: NAT performance regression caused by vlan GRO support 2019-04-05 7:11 ` Rafał Miłecki @ 2019-04-05 7:14 ` Felix Fietkau 2019-04-05 7:58 ` Toshiaki Makita 0 siblings, 1 reply; 16+ messages in thread From: Felix Fietkau @ 2019-04-05 7:14 UTC (permalink / raw) To: Rafał Miłecki, Toshiaki Makita Cc: Toshiaki Makita, netdev, David S. Miller, Stefano Brivio, Sabrina Dubroca, David Ahern, Jo-Philipp Wich, Koen Vandeputte On 2019-04-05 09:11, Rafał Miłecki wrote: > On 05.04.2019 07:48, Rafał Miłecki wrote: >> On 05.04.2019 06:26, Toshiaki Makita wrote: >>> My test results: >>> >>> Receiving packets from eth0.10, forwarding them to eth0.20 and applying >>> MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13. >>> Disabled rxvlan by ethtool -K to exercise vlan_gro_receive(). >>> Measured TCP throughput by netperf. >>> >>> GRO on : 17 Gbps >>> GRO off: 5 Gbps >>> >>> So I failed to reproduce your problem. >> >> :( Thanks for trying & checking that! >> >> >>> Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar >>> -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on >>> your machine? >> >> 1) ethtool -K eth0 gro on + iperf running (577 Mb/s) >> root@OpenWrt:/# mpstat -P ALL 10 3 >> Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) >> >> 16:33:40 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >> 16:33:50 all 0.00 0.00 0.00 0.00 0.00 58.79 0.00 0.00 41.21 >> 16:33:50 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 >> 16:33:50 1 0.00 0.00 0.00 0.00 0.00 17.58 0.00 0.00 82.42 >> >> 16:33:50 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >> 16:34:00 all 0.00 0.00 0.05 0.00 0.00 59.44 0.00 0.00 40.51 >> 16:34:00 0 0.00 0.00 0.10 0.00 0.00 99.90 0.00 0.00 0.00 >> 16:34:00 1 0.00 0.00 0.00 0.00 0.00 18.98 0.00 0.00 81.02 >> >> 16:34:00 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >> 16:34:10 all 0.00 0.00 0.00 0.00 0.00 59.59 0.00 0.00 40.41 >> 16:34:10 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 >> 16:34:10 1 0.00 0.00 0.00 0.00 0.00 19.18 0.00 0.00 80.82 >> >> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >> Average: all 0.00 0.00 0.02 0.00 0.00 59.27 0.00 0.00 40.71 >> Average: 0 0.00 0.00 0.03 0.00 0.00 99.97 0.00 0.00 0.00 >> Average: 1 0.00 0.00 0.00 0.00 0.00 18.58 0.00 0.00 81.42 >> >> >> 2) ethtool -K eth0 gro off + iperf running (941 Mb/s) >> root@OpenWrt:/# mpstat -P ALL 10 3 >> Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) >> >> 16:34:39 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >> 16:34:49 all 0.00 0.00 0.05 0.00 0.00 86.91 0.00 0.00 13.04 >> 16:34:49 0 0.00 0.00 0.10 0.00 0.00 78.22 0.00 0.00 21.68 >> 16:34:49 1 0.00 0.00 0.00 0.00 0.00 95.60 0.00 0.00 4.40 >> >> 16:34:49 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >> 16:34:59 all 0.00 0.00 0.10 0.00 0.00 87.06 0.00 0.00 12.84 >> 16:34:59 0 0.00 0.00 0.20 0.00 0.00 79.72 0.00 0.00 20.08 >> 16:34:59 1 0.00 0.00 0.00 0.00 0.00 94.41 0.00 0.00 5.59 >> >> 16:34:59 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >> 16:35:09 all 0.00 0.00 0.05 0.00 0.00 85.71 0.00 0.00 14.24 >> 16:35:09 0 0.00 0.00 0.10 0.00 0.00 79.42 0.00 0.00 20.48 >> 16:35:09 1 0.00 0.00 0.00 0.00 0.00 92.01 0.00 0.00 7.99 >> >> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >> Average: all 0.00 0.00 0.07 0.00 0.00 86.56 0.00 0.00 13.37 >> Average: 0 0.00 0.00 0.13 0.00 0.00 79.12 0.00 0.00 20.75 >> Average: 1 0.00 0.00 0.00 0.00 0.00 94.01 0.00 0.00 5.99 >> >> >> 3) System idle (no iperf) >> root@OpenWrt:/# mpstat -P ALL 10 1 >> Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) >> >> 16:35:31 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >> 16:35:41 all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >> 16:35:41 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >> 16:35:41 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >> >> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >> Average: all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >> Average: 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >> Average: 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >> >> >>> If CPU is 100%, perf may help us analyze your problem. If it's >>> available, try running below while testing: >>> # perf record -a -g -- sleep 5 >>> >>> And then run this after testing: >>> # perf report --no-child >> >> I can see my CPU 0 is fully loaded when using "gro on". I'll try perf now. > > I guess its GRO + csum_partial() to be blamed for this performance drop. > > Maybe csum_partial() is very fast on your powerful machine and few extra calls > don't make a difference? I can imagine it affecting much slower home router with > ARM cores. Most high performance Ethernet devices implement hardware checksum offload, which completely gets rid of this overhead. Unfortunately, the BCM53xx/47xx Ethernet MAC doesn't have this, which is why you're getting such crappy performance. - Felix ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: NAT performance regression caused by vlan GRO support 2019-04-05 7:14 ` Felix Fietkau @ 2019-04-05 7:58 ` Toshiaki Makita 2019-04-05 8:12 ` Rafał Miłecki 2019-04-05 10:18 ` Toke Høiland-Jørgensen 0 siblings, 2 replies; 16+ messages in thread From: Toshiaki Makita @ 2019-04-05 7:58 UTC (permalink / raw) To: Felix Fietkau, Rafał Miłecki Cc: Toshiaki Makita, netdev, David S. Miller, Stefano Brivio, Sabrina Dubroca, David Ahern, Jo-Philipp Wich, Koen Vandeputte On 2019/04/05 16:14, Felix Fietkau wrote: > On 2019-04-05 09:11, Rafał Miłecki wrote: >> On 05.04.2019 07:48, Rafał Miłecki wrote: >>> On 05.04.2019 06:26, Toshiaki Makita wrote: >>>> My test results: >>>> >>>> Receiving packets from eth0.10, forwarding them to eth0.20 and applying >>>> MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13. >>>> Disabled rxvlan by ethtool -K to exercise vlan_gro_receive(). >>>> Measured TCP throughput by netperf. >>>> >>>> GRO on : 17 Gbps >>>> GRO off: 5 Gbps >>>> >>>> So I failed to reproduce your problem. >>> >>> :( Thanks for trying & checking that! >>> >>> >>>> Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar >>>> -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on >>>> your machine? >>> >>> 1) ethtool -K eth0 gro on + iperf running (577 Mb/s) >>> root@OpenWrt:/# mpstat -P ALL 10 3 >>> Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) >>> >>> 16:33:40 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>> 16:33:50 all 0.00 0.00 0.00 0.00 0.00 58.79 0.00 0.00 41.21 >>> 16:33:50 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 >>> 16:33:50 1 0.00 0.00 0.00 0.00 0.00 17.58 0.00 0.00 82.42 >>> >>> 16:33:50 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>> 16:34:00 all 0.00 0.00 0.05 0.00 0.00 59.44 0.00 0.00 40.51 >>> 16:34:00 0 0.00 0.00 0.10 0.00 0.00 99.90 0.00 0.00 0.00 >>> 16:34:00 1 0.00 0.00 0.00 0.00 0.00 18.98 0.00 0.00 81.02 >>> >>> 16:34:00 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>> 16:34:10 all 0.00 0.00 0.00 0.00 0.00 59.59 0.00 0.00 40.41 >>> 16:34:10 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 >>> 16:34:10 1 0.00 0.00 0.00 0.00 0.00 19.18 0.00 0.00 80.82 >>> >>> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>> Average: all 0.00 0.00 0.02 0.00 0.00 59.27 0.00 0.00 40.71 >>> Average: 0 0.00 0.00 0.03 0.00 0.00 99.97 0.00 0.00 0.00 >>> Average: 1 0.00 0.00 0.00 0.00 0.00 18.58 0.00 0.00 81.42 >>> >>> >>> 2) ethtool -K eth0 gro off + iperf running (941 Mb/s) >>> root@OpenWrt:/# mpstat -P ALL 10 3 >>> Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) >>> >>> 16:34:39 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>> 16:34:49 all 0.00 0.00 0.05 0.00 0.00 86.91 0.00 0.00 13.04 >>> 16:34:49 0 0.00 0.00 0.10 0.00 0.00 78.22 0.00 0.00 21.68 >>> 16:34:49 1 0.00 0.00 0.00 0.00 0.00 95.60 0.00 0.00 4.40 >>> >>> 16:34:49 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>> 16:34:59 all 0.00 0.00 0.10 0.00 0.00 87.06 0.00 0.00 12.84 >>> 16:34:59 0 0.00 0.00 0.20 0.00 0.00 79.72 0.00 0.00 20.08 >>> 16:34:59 1 0.00 0.00 0.00 0.00 0.00 94.41 0.00 0.00 5.59 >>> >>> 16:34:59 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>> 16:35:09 all 0.00 0.00 0.05 0.00 0.00 85.71 0.00 0.00 14.24 >>> 16:35:09 0 0.00 0.00 0.10 0.00 0.00 79.42 0.00 0.00 20.48 >>> 16:35:09 1 0.00 0.00 0.00 0.00 0.00 92.01 0.00 0.00 7.99 >>> >>> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>> Average: all 0.00 0.00 0.07 0.00 0.00 86.56 0.00 0.00 13.37 >>> Average: 0 0.00 0.00 0.13 0.00 0.00 79.12 0.00 0.00 20.75 >>> Average: 1 0.00 0.00 0.00 0.00 0.00 94.01 0.00 0.00 5.99 >>> >>> >>> 3) System idle (no iperf) >>> root@OpenWrt:/# mpstat -P ALL 10 1 >>> Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) >>> >>> 16:35:31 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>> 16:35:41 all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >>> 16:35:41 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >>> 16:35:41 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >>> >>> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>> Average: all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >>> Average: 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >>> Average: 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >>> >>> >>>> If CPU is 100%, perf may help us analyze your problem. If it's >>>> available, try running below while testing: >>>> # perf record -a -g -- sleep 5 >>>> >>>> And then run this after testing: >>>> # perf report --no-child >>> >>> I can see my CPU 0 is fully loaded when using "gro on". I'll try perf now. >> >> I guess its GRO + csum_partial() to be blamed for this performance drop. >> >> Maybe csum_partial() is very fast on your powerful machine and few extra calls >> don't make a difference? I can imagine it affecting much slower home router with >> ARM cores. > Most high performance Ethernet devices implement hardware checksum > offload, which completely gets rid of this overhead. > Unfortunately, the BCM53xx/47xx Ethernet MAC doesn't have this, which is > why you're getting such crappy performance. Hmm... now I disabled rx checksum and tried the test again, and indeed I see csum_partial from GRO path. But I also see csum_partial even without GRO from nf_conntrack_in -> tcp_packet -> __skb_checksum_complete. Probably Rafał disabled nf_conntrack_checksum sysctl knob? But anyway even with disabling rx csum offload my machine has better performance with GRO. I'm sure in some cases GRO should be disabled, but I guess it's difficult to determine whether we should disable GRO or not automatically when csum offload is not available. -- Toshiaki Makita ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: NAT performance regression caused by vlan GRO support 2019-04-05 7:58 ` Toshiaki Makita @ 2019-04-05 8:12 ` Rafał Miłecki 2019-04-05 8:24 ` Rafał Miłecki 2019-04-05 10:18 ` Toke Høiland-Jørgensen 1 sibling, 1 reply; 16+ messages in thread From: Rafał Miłecki @ 2019-04-05 8:12 UTC (permalink / raw) To: Toshiaki Makita, Felix Fietkau Cc: Toshiaki Makita, netdev, David S. Miller, Stefano Brivio, Sabrina Dubroca, David Ahern, Jo-Philipp Wich, Koen Vandeputte On 05.04.2019 09:58, Toshiaki Makita wrote: > On 2019/04/05 16:14, Felix Fietkau wrote: >> On 2019-04-05 09:11, Rafał Miłecki wrote: >>> I guess its GRO + csum_partial() to be blamed for this performance drop. >>> >>> Maybe csum_partial() is very fast on your powerful machine and few extra calls >>> don't make a difference? I can imagine it affecting much slower home router with >>> ARM cores. >> Most high performance Ethernet devices implement hardware checksum >> offload, which completely gets rid of this overhead. >> Unfortunately, the BCM53xx/47xx Ethernet MAC doesn't have this, which is >> why you're getting such crappy performance. > > Hmm... now I disabled rx checksum and tried the test again, and indeed I > see csum_partial from GRO path. But I also see csum_partial even without > GRO from nf_conntrack_in -> tcp_packet -> __skb_checksum_complete. > Probably Rafał disabled nf_conntrack_checksum sysctl knob? > > But anyway even with disabling rx csum offload my machine has better > performance with GRO. I'm sure in some cases GRO should be disabled, but > I guess it's difficult to determine whether we should disable GRO or not > automatically when csum offload is not available. Few testing results: 1) ethtool -K eth0 gro off; echo 0 > /proc/sys/net/netfilter/nf_conntrack_checksum [ 6] 0.0-60.0 sec 6.57 GBytes 940 Mbits/sec 2) ethtool -K eth0 gro off; echo 1 > /proc/sys/net/netfilter/nf_conntrack_checksum [ 6] 0.0-60.0 sec 4.65 GBytes 666 Mbits/sec 3) ethtool -K eth0 gro on; echo 0 > /proc/sys/net/netfilter/nf_conntrack_checksum [ 6] 0.0-60.0 sec 4.02 GBytes 575 Mbits/sec 4) ethtool -K eth0 gro on; echo 1 > /proc/sys/net/netfilter/nf_conntrack_checksum [ 6] 0.0-60.0 sec 4.04 GBytes 579 Mbits/sec ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: NAT performance regression caused by vlan GRO support 2019-04-05 8:12 ` Rafał Miłecki @ 2019-04-05 8:24 ` Rafał Miłecki 0 siblings, 0 replies; 16+ messages in thread From: Rafał Miłecki @ 2019-04-05 8:24 UTC (permalink / raw) To: Toshiaki Makita, Felix Fietkau Cc: Toshiaki Makita, netdev, David S. Miller, Stefano Brivio, Sabrina Dubroca, David Ahern, Jo-Philipp Wich, Koen Vandeputte On 05.04.2019 10:12, Rafał Miłecki wrote: > On 05.04.2019 09:58, Toshiaki Makita wrote: >> On 2019/04/05 16:14, Felix Fietkau wrote: >>> On 2019-04-05 09:11, Rafał Miłecki wrote: >>>> I guess its GRO + csum_partial() to be blamed for this performance drop. >>>> >>>> Maybe csum_partial() is very fast on your powerful machine and few extra calls >>>> don't make a difference? I can imagine it affecting much slower home router with >>>> ARM cores. >>> Most high performance Ethernet devices implement hardware checksum >>> offload, which completely gets rid of this overhead. >>> Unfortunately, the BCM53xx/47xx Ethernet MAC doesn't have this, which is >>> why you're getting such crappy performance. >> >> Hmm... now I disabled rx checksum and tried the test again, and indeed I >> see csum_partial from GRO path. But I also see csum_partial even without >> GRO from nf_conntrack_in -> tcp_packet -> __skb_checksum_complete. >> Probably Rafał disabled nf_conntrack_checksum sysctl knob? >> >> But anyway even with disabling rx csum offload my machine has better >> performance with GRO. I'm sure in some cases GRO should be disabled, but >> I guess it's difficult to determine whether we should disable GRO or not >> automatically when csum offload is not available. > > Few testing results: > > 1) ethtool -K eth0 gro off; echo 0 > /proc/sys/net/netfilter/nf_conntrack_checksum > [ 6] 0.0-60.0 sec 6.57 GBytes 940 Mbits/sec > > 2) ethtool -K eth0 gro off; echo 1 > /proc/sys/net/netfilter/nf_conntrack_checksum > [ 6] 0.0-60.0 sec 4.65 GBytes 666 Mbits/sec For this case (GRO off and nf_conntrack_checksum enabled) I can confirm I see csum_partial() in the perf output. It's taking 13,14% instead of 25,46% (as when using GRO) though. Samples: 38K of event 'cycles', Event count (approx.): 12209908413 Overhead Command Shared Object Symbol + 13,14% ksoftirqd/1 [kernel.kallsyms] [k] csum_partial + 10,16% swapper [kernel.kallsyms] [k] v7_dma_inv_range + 6,36% swapper [kernel.kallsyms] [k] l2c210_inv_range + 4,89% swapper [kernel.kallsyms] [k] __irqentry_text_end + 4,12% ksoftirqd/1 [kernel.kallsyms] [k] v7_dma_clean_range + 3,78% swapper [kernel.kallsyms] [k] bcma_host_soc_read32 + 2,76% swapper [kernel.kallsyms] [k] arch_cpu_idle + 2,45% ksoftirqd/1 [kernel.kallsyms] [k] __netif_receive_skb_core + 2,37% ksoftirqd/1 [kernel.kallsyms] [k] l2c210_clean_range + 1,76% ksoftirqd/1 [kernel.kallsyms] [k] bgmac_start_xmit + 1,66% swapper [kernel.kallsyms] [k] bgmac_poll + 1,55% ksoftirqd/1 [kernel.kallsyms] [k] __dev_queue_xmit + 1,11% ksoftirqd/1 [kernel.kallsyms] [k] skb_vlan_untag > 3) ethtool -K eth0 gro on; echo 0 > /proc/sys/net/netfilter/nf_conntrack_checksum > [ 6] 0.0-60.0 sec 4.02 GBytes 575 Mbits/sec > > 4) ethtool -K eth0 gro on; echo 1 > /proc/sys/net/netfilter/nf_conntrack_checksum > [ 6] 0.0-60.0 sec 4.04 GBytes 579 Mbits/sec ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: NAT performance regression caused by vlan GRO support 2019-04-05 7:58 ` Toshiaki Makita 2019-04-05 8:12 ` Rafał Miłecki @ 2019-04-05 10:18 ` Toke Høiland-Jørgensen 2019-04-05 10:51 ` Florian Westphal 1 sibling, 1 reply; 16+ messages in thread From: Toke Høiland-Jørgensen @ 2019-04-05 10:18 UTC (permalink / raw) To: Toshiaki Makita, Felix Fietkau, Rafał Miłecki Cc: Toshiaki Makita, netdev, David S. Miller, Stefano Brivio, Sabrina Dubroca, David Ahern, Jo-Philipp Wich, Koen Vandeputte Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes: > On 2019/04/05 16:14, Felix Fietkau wrote: >> On 2019-04-05 09:11, Rafał Miłecki wrote: >>> On 05.04.2019 07:48, Rafał Miłecki wrote: >>>> On 05.04.2019 06:26, Toshiaki Makita wrote: >>>>> My test results: >>>>> >>>>> Receiving packets from eth0.10, forwarding them to eth0.20 and applying >>>>> MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13. >>>>> Disabled rxvlan by ethtool -K to exercise vlan_gro_receive(). >>>>> Measured TCP throughput by netperf. >>>>> >>>>> GRO on : 17 Gbps >>>>> GRO off: 5 Gbps >>>>> >>>>> So I failed to reproduce your problem. >>>> >>>> :( Thanks for trying & checking that! >>>> >>>> >>>>> Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar >>>>> -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on >>>>> your machine? >>>> >>>> 1) ethtool -K eth0 gro on + iperf running (577 Mb/s) >>>> root@OpenWrt:/# mpstat -P ALL 10 3 >>>> Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) >>>> >>>> 16:33:40 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>>> 16:33:50 all 0.00 0.00 0.00 0.00 0.00 58.79 0.00 0.00 41.21 >>>> 16:33:50 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 >>>> 16:33:50 1 0.00 0.00 0.00 0.00 0.00 17.58 0.00 0.00 82.42 >>>> >>>> 16:33:50 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>>> 16:34:00 all 0.00 0.00 0.05 0.00 0.00 59.44 0.00 0.00 40.51 >>>> 16:34:00 0 0.00 0.00 0.10 0.00 0.00 99.90 0.00 0.00 0.00 >>>> 16:34:00 1 0.00 0.00 0.00 0.00 0.00 18.98 0.00 0.00 81.02 >>>> >>>> 16:34:00 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>>> 16:34:10 all 0.00 0.00 0.00 0.00 0.00 59.59 0.00 0.00 40.41 >>>> 16:34:10 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 >>>> 16:34:10 1 0.00 0.00 0.00 0.00 0.00 19.18 0.00 0.00 80.82 >>>> >>>> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>>> Average: all 0.00 0.00 0.02 0.00 0.00 59.27 0.00 0.00 40.71 >>>> Average: 0 0.00 0.00 0.03 0.00 0.00 99.97 0.00 0.00 0.00 >>>> Average: 1 0.00 0.00 0.00 0.00 0.00 18.58 0.00 0.00 81.42 >>>> >>>> >>>> 2) ethtool -K eth0 gro off + iperf running (941 Mb/s) >>>> root@OpenWrt:/# mpstat -P ALL 10 3 >>>> Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) >>>> >>>> 16:34:39 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>>> 16:34:49 all 0.00 0.00 0.05 0.00 0.00 86.91 0.00 0.00 13.04 >>>> 16:34:49 0 0.00 0.00 0.10 0.00 0.00 78.22 0.00 0.00 21.68 >>>> 16:34:49 1 0.00 0.00 0.00 0.00 0.00 95.60 0.00 0.00 4.40 >>>> >>>> 16:34:49 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>>> 16:34:59 all 0.00 0.00 0.10 0.00 0.00 87.06 0.00 0.00 12.84 >>>> 16:34:59 0 0.00 0.00 0.20 0.00 0.00 79.72 0.00 0.00 20.08 >>>> 16:34:59 1 0.00 0.00 0.00 0.00 0.00 94.41 0.00 0.00 5.59 >>>> >>>> 16:34:59 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>>> 16:35:09 all 0.00 0.00 0.05 0.00 0.00 85.71 0.00 0.00 14.24 >>>> 16:35:09 0 0.00 0.00 0.10 0.00 0.00 79.42 0.00 0.00 20.48 >>>> 16:35:09 1 0.00 0.00 0.00 0.00 0.00 92.01 0.00 0.00 7.99 >>>> >>>> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>>> Average: all 0.00 0.00 0.07 0.00 0.00 86.56 0.00 0.00 13.37 >>>> Average: 0 0.00 0.00 0.13 0.00 0.00 79.12 0.00 0.00 20.75 >>>> Average: 1 0.00 0.00 0.00 0.00 0.00 94.01 0.00 0.00 5.99 >>>> >>>> >>>> 3) System idle (no iperf) >>>> root@OpenWrt:/# mpstat -P ALL 10 1 >>>> Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) >>>> >>>> 16:35:31 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>>> 16:35:41 all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >>>> 16:35:41 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >>>> 16:35:41 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >>>> >>>> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle >>>> Average: all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >>>> Average: 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >>>> Average: 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >>>> >>>> >>>>> If CPU is 100%, perf may help us analyze your problem. If it's >>>>> available, try running below while testing: >>>>> # perf record -a -g -- sleep 5 >>>>> >>>>> And then run this after testing: >>>>> # perf report --no-child >>>> >>>> I can see my CPU 0 is fully loaded when using "gro on". I'll try perf now. >>> >>> I guess its GRO + csum_partial() to be blamed for this performance drop. >>> >>> Maybe csum_partial() is very fast on your powerful machine and few extra calls >>> don't make a difference? I can imagine it affecting much slower home router with >>> ARM cores. >> Most high performance Ethernet devices implement hardware checksum >> offload, which completely gets rid of this overhead. >> Unfortunately, the BCM53xx/47xx Ethernet MAC doesn't have this, which is >> why you're getting such crappy performance. > > Hmm... now I disabled rx checksum and tried the test again, and indeed I > see csum_partial from GRO path. But I also see csum_partial even without > GRO from nf_conntrack_in -> tcp_packet -> __skb_checksum_complete. > Probably Rafał disabled nf_conntrack_checksum sysctl knob? > > But anyway even with disabling rx csum offload my machine has better > performance with GRO. But you're also running at way higher speeds, where the benefit of GRO is higher. > I'm sure in some cases GRO should be disabled, but I guess it's > difficult to determine whether we should disable GRO or not > automatically when csum offload is not available. As a first approximation, maybe just: if (!has_hardware_cksum_offload(netdev) && link_rate(netdev) <= 1Gbps) disable_gro(); We used 1Gbps as the threshold for when to split GRO packets by default in sck_cake as well... -Toke ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: NAT performance regression caused by vlan GRO support 2019-04-05 10:18 ` Toke Høiland-Jørgensen @ 2019-04-05 10:51 ` Florian Westphal 2019-04-05 11:00 ` Eric Dumazet 0 siblings, 1 reply; 16+ messages in thread From: Florian Westphal @ 2019-04-05 10:51 UTC (permalink / raw) To: Toke Høiland-Jørgensen Cc: Toshiaki Makita, Felix Fietkau, Rafał Miłecki, Toshiaki Makita, netdev, David S. Miller, Stefano Brivio, Sabrina Dubroca, David Ahern, Jo-Philipp Wich, Koen Vandeputte Toke Høiland-Jørgensen <toke@redhat.com> wrote: > As a first approximation, maybe just: > > if (!has_hardware_cksum_offload(netdev) && link_rate(netdev) <= 1Gbps) > disable_gro(); I don't think its a good idea. For local delivery case, there is no way to avoid the checksum cost, so might as well have GRO enabled. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: NAT performance regression caused by vlan GRO support 2019-04-05 10:51 ` Florian Westphal @ 2019-04-05 11:00 ` Eric Dumazet 0 siblings, 0 replies; 16+ messages in thread From: Eric Dumazet @ 2019-04-05 11:00 UTC (permalink / raw) To: Florian Westphal, Toke Høiland-Jørgensen Cc: Toshiaki Makita, Felix Fietkau, Rafał Miłecki, Toshiaki Makita, netdev, David S. Miller, Stefano Brivio, Sabrina Dubroca, David Ahern, Jo-Philipp Wich, Koen Vandeputte On 04/05/2019 03:51 AM, Florian Westphal wrote: > Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> As a first approximation, maybe just: >> >> if (!has_hardware_cksum_offload(netdev) && link_rate(netdev) <= 1Gbps) >> disable_gro(); > > I don't think its a good idea. For local delivery case, there is no > way to avoid the checksum cost, so might as well have GRO enabled. > We might add a sysctl or a way to tell GRO layer : Do not attempt checksumming if forwarding is enabled on the host. Basically GRO if NIC has provided checksum offload. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: NAT performance regression caused by vlan GRO support 2019-04-04 12:57 NAT performance regression caused by vlan GRO support Rafał Miłecki 2019-04-04 15:17 ` Toshiaki Makita @ 2019-04-07 11:53 ` Rafał Miłecki 2019-04-07 11:54 ` Rafał Miłecki 1 sibling, 1 reply; 16+ messages in thread From: Rafał Miłecki @ 2019-04-07 11:53 UTC (permalink / raw) To: netdev, David S. Miller, Toshiaki Makita, Toke Høiland-Jørgensen, Florian Westphal, Eric Dumazet Cc: Stefano Brivio, Sabrina Dubroca, David Ahern, Felix Fietkau, Jo-Philipp Wich, Koen Vandeputte On 04.04.2019 14:57, Rafał Miłecki wrote: > Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO support > for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT > performance of my router dropped by 30% - 40%. I'll try to provide some summary for this issue. I'll focus on TCP traffic as that's what I happened to test. Basically all slowdowns are related to the csum_partial(). Calculating checksum has a significant impact on NAT performance on less CPU powerful devices. ********** GRO disabled Without GRO a csum_partial() is used only when validating TCP packets in the nf_conntrack_tcp_packet() (known as tcp_packet() in kernels older than 5.1). Simplified forward trace for that case: nf_conntrack_in nf_conntrack_tcp_packet tcp_error if (state->net->ct.sysctl_checksum) nf_checksum nf_ip_checksum __skb_checksum_complete That validation can be disabled using nf_conntrack_checksum sysfs and it bumps NAT speed for me from 666 Mb/s to 940 Mb/s (+41%). ********** GRO enabled First of all GRO also includes TCP validation that requires calculating a checksum. Simplified forward trace for that case: vlan_gro_receive call_gro_receive inet_gro_receive indirect_call_gro_receive tcp4_gro_receive skb_gro_checksum_validate tcp_gro_receive *If* we had a way to disable that validation it *would* result in bumping NAT speed for me from 577 Mb/s to 825 Mb/s (+43%). Secondly using GRO means we need to calculate a checksum before transmitting packets (applies to devices without HW checksum offloading). I think it's related to packets merging in the skb_gro_receive() and then setting CHECKSUM_PARTIAL: vlan_gro_complete inet_gro_complete tcp4_gro_complete tcp_gro_complete skb->ip_summed = CHECKSUM_PARTIAL; That results in bgmac calculating a checksum from the scratch, take a look at the bgmac_dma_tx_add() which does: if (skb->ip_summed == CHECKSUM_PARTIAL) skb_checksum_help(skb); Performing that whole checksum calculation will always result in GRO slowing down NAT for me when using BCM47094 SoC with that not-so-powerful ARM CPUs. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: NAT performance regression caused by vlan GRO support 2019-04-07 11:53 ` Rafał Miłecki @ 2019-04-07 11:54 ` Rafał Miłecki 2019-04-08 13:31 ` David Laight 0 siblings, 1 reply; 16+ messages in thread From: Rafał Miłecki @ 2019-04-07 11:54 UTC (permalink / raw) To: netdev, David S. Miller, Toshiaki Makita, Toke Høiland-Jørgensen, Florian Westphal, Eric Dumazet Cc: Stefano Brivio, Sabrina Dubroca, David Ahern, Felix Fietkau, Jo-Philipp Wich, Koen Vandeputte Now I have some questions regarding possible optimizations. Note I'm too familiar with the net subsystem so maybe I got wrong ideas. On 07.04.2019 13:53, Rafał Miłecki wrote: > On 04.04.2019 14:57, Rafał Miłecki wrote: >> Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO support >> for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT >> performance of my router dropped by 30% - 40%. > > I'll try to provide some summary for this issue. I'll focus on TCP traffic as > that's what I happened to test. > > Basically all slowdowns are related to the csum_partial(). Calculating checksum > has a significant impact on NAT performance on less CPU powerful devices. > > ********** > > GRO disabled > > Without GRO a csum_partial() is used only when validating TCP packets in the > nf_conntrack_tcp_packet() (known as tcp_packet() in kernels older than 5.1). > > Simplified forward trace for that case: > nf_conntrack_in > nf_conntrack_tcp_packet > tcp_error > if (state->net->ct.sysctl_checksum) > nf_checksum > nf_ip_checksum > __skb_checksum_complete > > That validation can be disabled using nf_conntrack_checksum sysfs and it bumps > NAT speed for me from 666 Mb/s to 940 Mb/s (+41%). > > ********** > > GRO enabled > > First of all GRO also includes TCP validation that requires calculating a > checksum. > > Simplified forward trace for that case: > vlan_gro_receive > call_gro_receive > inet_gro_receive > indirect_call_gro_receive > tcp4_gro_receive > skb_gro_checksum_validate > tcp_gro_receive > > *If* we had a way to disable that validation it *would* result in bumping NAT > speed for me from 577 Mb/s to 825 Mb/s (+43%). Could we have tcp4_gro_receive() behave similarly to the tcp_error() and make it respect the nf_conntrack_checksum sysfs value? Could we simply add something like: if (dev_net(skb->dev)->ct.sysctl_checksum) to it (to additionally protect a skb_gro_checksum_validate() call)? > Secondly using GRO means we need to calculate a checksum before transmitting > packets (applies to devices without HW checksum offloading). I think it's > related to packets merging in the skb_gro_receive() and then setting > CHECKSUM_PARTIAL: > > vlan_gro_complete > inet_gro_complete > tcp4_gro_complete > tcp_gro_complete > skb->ip_summed = CHECKSUM_PARTIAL; > > That results in bgmac calculating a checksum from the scratch, take a look at > the bgmac_dma_tx_add() which does: > > if (skb->ip_summed == CHECKSUM_PARTIAL) > skb_checksum_help(skb); > > Performing that whole checksum calculation will always result in GRO slowing > down NAT for me when using BCM47094 SoC with that not-so-powerful ARM CPUs. Is this possible to avoid CHECKSUM_PARTIAL & skb_checksum_help() which has to calculate a whole checksum? It's definitely possible to *update* checksum after simple packet changes (e.g. amending an IP or port). Would that be possible to use similar method when dealing with packets with GRO enabled? If not, maybe w really need to think about some good & clever condition for disabling GRO by default on hw without checksum offloading. ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: NAT performance regression caused by vlan GRO support 2019-04-07 11:54 ` Rafał Miłecki @ 2019-04-08 13:31 ` David Laight 0 siblings, 0 replies; 16+ messages in thread From: David Laight @ 2019-04-08 13:31 UTC (permalink / raw) To: 'Rafał Miłecki', netdev, David S. Miller, Toshiaki Makita, Toke Høiland-Jørgensen, Florian Westphal, Eric Dumazet Cc: Stefano Brivio, Sabrina Dubroca, David Ahern, Felix Fietkau, Jo-Philipp Wich, Koen Vandeputte From: Rafal Milecki > Sent: 07 April 2019 12:55 ... > If not, maybe w really need to think about some good & clever condition for > disabling GRO by default on hw without checksum offloading. Maybe GRO could assume the checksums are valid so the checksum would only be verified when the packet is delivered locally. If the packet is forwarded then, provided the same packet boundaries are used, the original checksums (maybe modified by NAT) can be used. No idea how easy this might be :-) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2019-04-08 13:30 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-04-04 12:57 NAT performance regression caused by vlan GRO support Rafał Miłecki 2019-04-04 15:17 ` Toshiaki Makita 2019-04-04 20:22 ` Rafał Miłecki 2019-04-05 4:26 ` Toshiaki Makita 2019-04-05 5:48 ` Rafał Miłecki 2019-04-05 7:11 ` Rafał Miłecki 2019-04-05 7:14 ` Felix Fietkau 2019-04-05 7:58 ` Toshiaki Makita 2019-04-05 8:12 ` Rafał Miłecki 2019-04-05 8:24 ` Rafał Miłecki 2019-04-05 10:18 ` Toke Høiland-Jørgensen 2019-04-05 10:51 ` Florian Westphal 2019-04-05 11:00 ` Eric Dumazet 2019-04-07 11:53 ` Rafał Miłecki 2019-04-07 11:54 ` Rafał Miłecki 2019-04-08 13:31 ` David Laight
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.