Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / Atom feed
* ARM router NAT performance affected by random/unrelated commits
@ 2019-05-21 10:28 Rafał Miłecki
  2019-05-21 10:45 ` Russell King - ARM Linux admin
  2019-05-21 13:01 ` Andrew Lunn
  0 siblings, 2 replies; 8+ messages in thread
From: Rafał Miłecki @ 2019-05-21 10:28 UTC (permalink / raw)
  To: Network Development, linux-arm-kernel, Linux Kernel Mailing List
  Cc: linux-block, John Crispin, Jonas Gorski, Jo-Philipp Wich, Felix Fietkau

[-- Attachment #1: Type: text/plain, Size: 5928 bytes --]

Hi,

I work on home routers based on Broadcom's Northstar SoCs. Those devices
have ARM Cortex-A9 and most of them are dual-core.

As for home routers, my main concern is network performance. That CPU
isn't powerful enough to handle gigabit traffic so all kind of
optimizations do matter. I noticed some unexpected changes in NAT
performance when switching between kernels.

My hardware is BCM47094 SoC (dual core ARM) with integrated network
controller and external BCM53012 switch.

Relevant setup:
* SoC network controller is wired to the hardware switch
* Switch passes 802.1q frames with VID 1 to four LAN ports
* Switch passes 802.1q frames with VID 2 to WAN port
* Linux does NAT for LAN (eth0.1) to WAN (eth0.2)
* Linux uses pfifo and "echo 2 > rps_cpus"
* Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port
* Intel i7-2670QM laptop connected to a WAN port

*****

I found a very nice example of commit that does /nothing/ yet it affects
NAT performance: 9316a9ed6895 ("blk-mq: provide helper for setting up an
SQ queue and tag set")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283
All it does is exporting an unused symbol (function).

Let me share some numbers (I use iperf for testing):

git reset --hard v4.19
git am OpenWrt-mtd-chages.patch
[  3]  0.0-30.0 sec  2.60 GBytes   745 Mbits/sec
[  3]  0.0-30.0 sec  2.60 GBytes   745 Mbits/sec
[  3]  0.0-30.0 sec  2.60 GBytes   744 Mbits/sec
[  3]  0.0-30.0 sec  2.59 GBytes   742 Mbits/sec
[  3]  0.0-30.0 sec  2.59 GBytes   740 Mbits/sec
[  3]  0.0-30.0 sec  2.59 GBytes   740 Mbits/sec
[  3]  0.0-30.0 sec  2.58 GBytes   738 Mbits/sec
[  3]  0.0-30.0 sec  2.58 GBytes   738 Mbits/sec
[  3]  0.0-30.0 sec  2.58 GBytes   738 Mbits/sec
[  3]  0.0-30.0 sec  2.57 GBytes   735 Mbits/sec
Average: 741 Mb/s

git reset --hard v4.19
git am OpenWrt-mtd-chages.patch
git cherry-pick -x 9316a9ed6895
[  3]  0.0-30.0 sec  2.73 GBytes   780 Mbits/sec
[  3]  0.0-30.0 sec  2.72 GBytes   777 Mbits/sec
[  3]  0.0-30.0 sec  2.71 GBytes   775 Mbits/sec
[  3]  0.0-30.0 sec  2.70 GBytes   773 Mbits/sec
[  3]  0.0-30.0 sec  2.70 GBytes   771 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   771 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   771 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   770 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   769 Mbits/sec
[  3]  0.0-30.0 sec  2.68 GBytes   768 Mbits/sec
Average: 773 Mb/s

As you can see cherry-picking (on top of Linux 4.19) a single commit
that does /nothing/ can improve NAT performance by 4,5%.

*****

I was hoping to learn something from profiling kernel with the "perf"
tool. Eanbling CONFIG_PERF_EVENTS resulted in smaller NAT performance
gain: 741 Mb/s → 750 Mb/s. I tried it anyway.

Without cherry-picking I got:
+    9,04%  swapper          [kernel.kallsyms]  [k] v7_dma_inv_range
+    5,54%  swapper          [kernel.kallsyms]  [k] __irqentry_text_end
+    5,12%  swapper          [kernel.kallsyms]  [k] l2c210_inv_range
+    4,30%  ksoftirqd/1      [kernel.kallsyms]  [k] v7_dma_clean_range
+    4,02%  swapper          [kernel.kallsyms]  [k] bcma_host_soc_read32
+    3,13%  swapper          [kernel.kallsyms]  [k] arch_cpu_idle
+    2,88%  ksoftirqd/1      [kernel.kallsyms]  [k] __netif_receive_skb_core
+    2,51%  ksoftirqd/1      [kernel.kallsyms]  [k] l2c210_clean_range
+    1,88%  ksoftirqd/1      [kernel.kallsyms]  [k] fib_table_lookup
(741 Mb/s while *not* running perf)

With cherry-picked 9316a9ed6895 I got:
+    9,16%  swapper          [kernel.kallsyms]  [k] v7_dma_inv_range
+    5,64%  swapper          [kernel.kallsyms]  [k] __irqentry_text_end
+    5,05%  swapper          [kernel.kallsyms]  [k] l2c210_inv_range
+    4,25%  ksoftirqd/1      [kernel.kallsyms]  [k] v7_dma_clean_range
+    4,10%  swapper          [kernel.kallsyms]  [k] bcma_host_soc_read32
+    3,35%  ksoftirqd/1      [kernel.kallsyms]  [k] __netif_receive_skb_core
+    3,17%  swapper          [kernel.kallsyms]  [k] arch_cpu_idle
+    2,49%  ksoftirqd/1      [kernel.kallsyms]  [k] l2c210_clean_range
+    2,03%  ksoftirqd/1      [kernel.kallsyms]  [k] fib_table_lookup
(750 Mb/s while *not* running perf)

Changes seem quite minimal and I'm not sure if they tell what is causing
that NAT performance change at all.

*****

I also tried running cachestat but didn't get anything interesting:
Counting cache functions... Output every 1 seconds.
TIME         HITS   MISSES  DIRTIES    RATIO   BUFFERS_MB   CACHE_MB
10:06:59     1020        5        0    99.5%            0          2
10:07:00     1029        0        0   100.0%            0          2
10:07:01     1013        0        0   100.0%            0          2
10:07:02     1029        0        0   100.0%            0          2
10:07:03     1029        0        0   100.0%            0          2
10:07:04      997        0        0   100.0%            0          2
10:07:05     1013        0        0   100.0%            0          2
(I started iperf at 10:07:00).

*****

There were more situations with such unexpected performance changes.
Another example: cherry-picking 5b0890a97204 ("flow_dissector: Parse
batman-adv unicast headers")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5b0890a97204627d75a333fc30f29f737e2bfad6
to some Linux 4.14.x release was lowering NAT performance by 55 Mb/s.

The tricky part is there aren't any ETH_P_BATMAN packets in my traffic.
Extra tests revealed that any __skb_flow_dissect() modification was
lowering my NAT performance (e.g. commenting out ETH_P_TIPC or
ETH_P_FCOE switch cases).

*****

I would like every kernel to provide a maximum NAT performance, no
matter what random commits it contains.

Suffering from such a random changes makes it also really hard to notice
a real performance regression.

Do you have any idea what is causing those performance changes? Can I
provide any extra info to help debugging this?

[-- Attachment #2: openwrt-mtd-patches.txt --]
[-- Type: text/plain, Size: 1435 bytes --]

047-v4.21-mtd-keep-original-flags-for-every-struct-mtd_info.patch
048-v4.21-mtd-improve-calculating-partition-boundaries-when-ch.patch
080-v5.1-0001-bcma-keep-a-direct-pointer-to-the-struct-device.patch
080-v5.1-0002-bcma-use-dev_-printing-functions.patch
095-Allow-class-e-address-assignment-via-ifconfig-ioctl.patch

140-jffs2-use-.rename2-and-add-RENAME_WHITEOUT-support.patch
141-jffs2-add-RENAME_EXCHANGE-support.patch
400-mtd-add-rootfs-split-support.patch
401-mtd-add-support-for-different-partition-parser-types.patch
402-mtd-use-typed-mtd-parsers-for-rootfs-and-firmware-split.patch
403-mtd-hook-mtdsplit-to-Kbuild.patch
404-mtd-add-more-helper-functions.patch
431-mtd-bcm47xxpart-check-for-bad-blocks-when-calculatin.patch
432-mtd-bcm47xxpart-detect-T_Meter-partition.patch
480-mtd-set-rootfs-to-be-root-dev.patch
490-ubi-auto-attach-mtd-device-named-ubi-or-data-on-boot.patch
491-ubi-auto-create-ubiblock-device-for-rootfs.patch
492-try-auto-mounting-ubi0-rootfs-in-init-do_mounts.c.patch
493-ubi-set-ROOT_DEV-to-ubiblock-rootfs-if-unset.patch
530-jffs2_make_lzma_available.patch
532-jffs2_eofdetect.patch
500-v4.20-ubifs-Fix-default-compression-selection-in-ubifs.patch
553-ubifs-Add-option-to-create-UBI-FS-version-4-on-empty.patch

700-swconfig_switch_drivers.patch
702-phy_add_aneg_done_function.patch
721-phy_packets.patch
773-bgmac-add-srab-switch.patch
910-kobject_uevent.patch
911-kobject_add_broadcast_uevent.patch

[-- Attachment #3: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ARM router NAT performance affected by random/unrelated commits
  2019-05-21 10:28 ARM router NAT performance affected by random/unrelated commits Rafał Miłecki
@ 2019-05-21 10:45 ` Russell King - ARM Linux admin
  2019-05-21 11:16   ` Rafał Miłecki
  2019-05-22 11:51   ` Rafał Miłecki
  2019-05-21 13:01 ` Andrew Lunn
  1 sibling, 2 replies; 8+ messages in thread
From: Russell King - ARM Linux admin @ 2019-05-21 10:45 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Jo-Philipp Wich, Network Development, John Crispin,
	Linux Kernel Mailing List, linux-block, Jonas Gorski,
	linux-arm-kernel, Felix Fietkau

On Tue, May 21, 2019 at 12:28:48PM +0200, Rafał Miłecki wrote:
> Hi,
> 
> I work on home routers based on Broadcom's Northstar SoCs. Those devices
> have ARM Cortex-A9 and most of them are dual-core.
> 
> As for home routers, my main concern is network performance. That CPU
> isn't powerful enough to handle gigabit traffic so all kind of
> optimizations do matter. I noticed some unexpected changes in NAT
> performance when switching between kernels.
> 
> My hardware is BCM47094 SoC (dual core ARM) with integrated network
> controller and external BCM53012 switch.

Guessing, I'd say it's to do with the placement of code wrt cachelines.
You could try aligning some of the cache flushing code to a cache line
and see what effect that has.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ARM router NAT performance affected by random/unrelated commits
  2019-05-21 10:45 ` Russell King - ARM Linux admin
@ 2019-05-21 11:16   ` Rafał Miłecki
  2019-05-21 11:19     ` Russell King - ARM Linux admin
  2019-05-22 11:51   ` Rafał Miłecki
  1 sibling, 1 reply; 8+ messages in thread
From: Rafał Miłecki @ 2019-05-21 11:16 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Jo-Philipp Wich, Network Development, John Crispin,
	Linux Kernel Mailing List, linux-block, Jonas Gorski,
	linux-arm-kernel, Felix Fietkau

On Tue, 21 May 2019 at 12:45, Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
> On Tue, May 21, 2019 at 12:28:48PM +0200, Rafał Miłecki wrote:
> > I work on home routers based on Broadcom's Northstar SoCs. Those devices
> > have ARM Cortex-A9 and most of them are dual-core.
> >
> > As for home routers, my main concern is network performance. That CPU
> > isn't powerful enough to handle gigabit traffic so all kind of
> > optimizations do matter. I noticed some unexpected changes in NAT
> > performance when switching between kernels.
> >
> > My hardware is BCM47094 SoC (dual core ARM) with integrated network
> > controller and external BCM53012 switch.
>
> Guessing, I'd say it's to do with the placement of code wrt cachelines.

That was my guess as well, that's why I tried "cachestat" tool.


> You could try aligning some of the cache flushing code to a cache line
> and see what effect that has.

Can you give me some extra hint on how to do that, please? I tried
searching for it a bit but I didn't find any clear article on that
matter.

-- 
Rafał

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ARM router NAT performance affected by random/unrelated commits
  2019-05-21 11:16   ` Rafał Miłecki
@ 2019-05-21 11:19     ` Russell King - ARM Linux admin
  0 siblings, 0 replies; 8+ messages in thread
From: Russell King - ARM Linux admin @ 2019-05-21 11:19 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Jo-Philipp Wich, Network Development, John Crispin,
	Linux Kernel Mailing List, linux-block, Jonas Gorski,
	linux-arm-kernel, Felix Fietkau

On Tue, May 21, 2019 at 01:16:12PM +0200, Rafał Miłecki wrote:
> On Tue, 21 May 2019 at 12:45, Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> > On Tue, May 21, 2019 at 12:28:48PM +0200, Rafał Miłecki wrote:
> > > I work on home routers based on Broadcom's Northstar SoCs. Those devices
> > > have ARM Cortex-A9 and most of them are dual-core.
> > >
> > > As for home routers, my main concern is network performance. That CPU
> > > isn't powerful enough to handle gigabit traffic so all kind of
> > > optimizations do matter. I noticed some unexpected changes in NAT
> > > performance when switching between kernels.
> > >
> > > My hardware is BCM47094 SoC (dual core ARM) with integrated network
> > > controller and external BCM53012 switch.
> >
> > Guessing, I'd say it's to do with the placement of code wrt cachelines.
> 
> That was my guess as well, that's why I tried "cachestat" tool.
> 
> 
> > You could try aligning some of the cache flushing code to a cache line
> > and see what effect that has.
> 
> Can you give me some extra hint on how to do that, please? I tried
> searching for it a bit but I didn't find any clear article on that
> matter.

IIRC, the cache line size on Cortex A9 is 32 bytes, so the assembler
directive would be ".align 5".  Place that in arch/arm/mm/cache-v7.S
before v7_dma_clean_range and v7_dma_inv_range.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ARM router NAT performance affected by random/unrelated commits
  2019-05-21 10:28 ARM router NAT performance affected by random/unrelated commits Rafał Miłecki
  2019-05-21 10:45 ` Russell King - ARM Linux admin
@ 2019-05-21 13:01 ` Andrew Lunn
  1 sibling, 0 replies; 8+ messages in thread
From: Andrew Lunn @ 2019-05-21 13:01 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Jo-Philipp Wich, Network Development, John Crispin,
	Linux Kernel Mailing List, linux-block, Jonas Gorski,
	linux-arm-kernel, Felix Fietkau

> I also tried running cachestat but didn't get anything interesting:
> Counting cache functions... Output every 1 seconds.
> TIME         HITS   MISSES  DIRTIES    RATIO   BUFFERS_MB   CACHE_MB
> 10:06:59     1020        5        0    99.5%            0          2
> 10:07:00     1029        0        0   100.0%            0          2
> 10:07:01     1013        0        0   100.0%            0          2
> 10:07:02     1029        0        0   100.0%            0          2
> 10:07:03     1029        0        0   100.0%            0          2
> 10:07:04      997        0        0   100.0%            0          2
> 10:07:05     1013        0        0   100.0%            0          2
> (I started iperf at 10:07:00).

Try looking at the L1 cache performance. For this class of device, the
L1 code cache is probably too small to contain the active parts of the
network stack. The less cache thrashing you have, the faster the stack
will go.

Maybe try compiling with -Os so it optimises for size.

Build a custom kernel with everything you don't need turned off.

Look at the work being done to batch process packets. Rather than
passing one packet at a time through the network stack, it passes a
linked list of packets to each stage in the stack. That should result
in less cache misses per packet. But not all layers in the stack
support this batching. See if you can find out where it is being
unbatched, and why. Can you influence this, disable build options, or
work on the code to pass batches further along the stack.

     Andrew

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ARM router NAT performance affected by random/unrelated commits
  2019-05-21 10:45 ` Russell King - ARM Linux admin
  2019-05-21 11:16   ` Rafał Miłecki
@ 2019-05-22 11:51   ` Rafał Miłecki
  2019-05-22 12:17     ` Russell King - ARM Linux admin
  1 sibling, 1 reply; 8+ messages in thread
From: Rafał Miłecki @ 2019-05-22 11:51 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Jo-Philipp Wich, Network Development, John Crispin,
	Linux Kernel Mailing List, linux-block, Jonas Gorski,
	linux-arm-kernel, Felix Fietkau

[-- Attachment #1: Type: text/plain, Size: 5777 bytes --]

On 21.05.2019 12:45, Russell King - ARM Linux admin wrote:> On Tue, May 21, 2019 at 12:28:48PM +0200, Rafał Miłecki wrote:
 >> I work on home routers based on Broadcom's Northstar SoCs. Those devices
 >> have ARM Cortex-A9 and most of them are dual-core.
 >>
 >> As for home routers, my main concern is network performance. That CPU
 >> isn't powerful enough to handle gigabit traffic so all kind of
 >> optimizations do matter. I noticed some unexpected changes in NAT
 >> performance when switching between kernels.
 >>
 >> My hardware is BCM47094 SoC (dual core ARM) with integrated network
 >> controller and external BCM53012 switch.
 >
 > Guessing, I'd say it's to do with the placement of code wrt cachelines.
 > You could try aligning some of the cache flushing code to a cache line
 > and see what effect that has.

Is System.map a good place to check for functions code alignment?

With Linux 4.19 + OpenWrt mtd patches I have:
(...)
c010ea94 t v7_dma_inv_range
c010eae0 t v7_dma_clean_range
(...)
c02ca3d0 T blk_mq_update_nr_hw_queues
c02ca69c T blk_mq_alloc_tag_set
c02ca94c T blk_mq_release
c02ca9b4 T blk_mq_free_queue
c02caa88 T blk_mq_update_nr_requests
c02cab50 T blk_mq_unique_tag
(...)

After cherry-picking 9316a9ed6895 ("blk-mq: provide helper for setting
up an SQ queue and tag set"):
(...)
c010ea94 t v7_dma_inv_range
c010eae0 t v7_dma_clean_range
(...)
c02ca3d0 T blk_mq_update_nr_hw_queues
c02ca69c T blk_mq_alloc_tag_set
c02ca94c T blk_mq_init_sq_queue <-- NEW
c02ca9c0 T blk_mq_release <-- Different address of this & all below
c02caa28 T blk_mq_free_queue
c02caafc T blk_mq_update_nr_requests
c02cabc4 T blk_mq_unique_tag
(...)

As you can see blk_mq_init_sq_queue has appeared in the System.map and
it affected addresses of ~30000 symbols. I can believe some frequently
used symbols got luckily aligned and that improved overall performance.

Interestingly v7_dma_inv_range() and v7_dma_clean_range() were not
relocated.

*****

I followed Russell's suggestion and added .align 5 to cache-v7.S (see
two attached diffs).

1) v4.19 + OpenWrt mtd patches
 > egrep -B 1 -A 1 "v7_dma_(inv|clean)_range" System.map
c010ea58 T v7_flush_kern_dcache_area
c010ea94 t v7_dma_inv_range
c010eae0 t v7_dma_clean_range
c010eb18 T b15_dma_flush_range

2) v4.19 + OpenWrt mtd patches + two .align 5 in cache-v7.S
c010ea6c T v7_flush_kern_dcache_area
c010eac0 t v7_dma_inv_range
c010eb20 t v7_dma_clean_range
c010eb58 T b15_dma_flush_range
(actually 15 symbols above v7_dma_inv_range were replaced)

This method seems to be somehow working (at least affects addresses in
System.map).

*****

I run 2 tests for each combination of changes. Each test consisted of
10 sequences of: 30 seconds iperf session + reboot.


 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
Test #1: 738 Mb/s
Test #2: 737 Mb/s

 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
patch -p1 < v7_dma_clean_range-align.diff
Test #1: 746 Mb/s
Test #2: 747 Mb/s

 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
 > patch -p1 < v7_dma_inv_range-align.diff
Test #1: 745 Mb/s
Test #2: 746 Mb/s

 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
 > patch -p1 < v7_dma_clean_range-align.diff
 > patch -p1 < v7_dma_inv_range-align.diff
Test #1: 762 Mb/s
Test #2: 761 Mb/s

As you can see I got a quite nice performance improvement after aligning
both: v7_dma_clean_range() and v7_dma_inv_range().

It still wasn't as good as with 9316a9ed6895 cherry-picked but pretty
close.


 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
 > git cherry-pick -x 9316a9ed6895
Test #1: 770 Mb/s
Test #2: 766 Mb/s

 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
 > git cherry-pick -x 9316a9ed6895
 > patch -p1 < v7_dma_clean_range-align.diff
Test #1: 756 Mb/s
Test #2: 759 Mb/s

 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
 > git cherry-pick -x 9316a9ed6895
 > patch -p1 < v7_dma_inv_range-align.diff
Test #1: 758 Mb/s
Test #2: 759 Mb/s

 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
 > git cherry-pick -x 9316a9ed6895
 > patch -p1 < v7_dma_clean_range-align.diff
 > patch -p1 < v7_dma_inv_range-align.diff
Test #1: 767 Mb/s
Test #2: 763 Mb/s

Now you can see how unpredictable it is. If I cherry-pick 9316a9ed6895
and do an extra alignment of v7_dma_clean_range() and v7_dma_inv_range()
that extra alignment can actually *hurt* NAT performance.

My guess is that:
1) 9316a9ed6895 provides alignment of some very important function(s)
2) DMA alignments on top ^^ provide some gain but also break some align

*****

SUMMARY

It seems that for Linux 4.19 + my .config I can get a very lucky &
optimal alignment of functions by cherry-picking 9316a9ed6895.

I thought of checking functions reported by the "perf" tool with CPU
usage of 2%+.

All following functions keep their original address with 9316a9ed6895:
__irqentry_text_end
arch_cpu_idle
l2c210_clean_range
l2c210_inv_range
v7_dma_clean_range
v7_dma_inv_range

Remaining 3 functions got reallocated:
-c03e5038 t __netif_receive_skb_core
+c03e50b0 t __netif_receive_skb_core
-c03c8b1c t bcma_host_soc_read32
+c03c8b94 t bcma_host_soc_read32
-c0475620 T fib_table_lookup
+c0475698 T fib_table_lookup

I tried aligning all 3 above functions using:
__attribute__((aligned(32)))
and got 756 Mb/s. It's better but still not ~770 Mb/s.

Is there any easy way of identifying which of function alignments got
such a big impact on NAT performance? I'd like to get those functions
explicitly aligned using assembler/__attribute__/something.

What I'm also afraid are false positives. I may end up aligning some
unrelated function that just happens to align other ones. Just like
cherry-picking 9316a9ed6895 having side-effects and not really fixing
anything explicitly.

[-- Attachment #2: v7_dma_clean_range-align.diff --]
[-- Type: text/x-patch, Size: 334 bytes --]

[-- Attachment #3: v7_dma_inv_range-align.diff --]
[-- Type: text/x-patch, Size: 312 bytes --]

[-- Attachment #4: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ARM router NAT performance affected by random/unrelated commits
  2019-05-22 11:51   ` Rafał Miłecki
@ 2019-05-22 12:17     ` Russell King - ARM Linux admin
  2019-05-22 21:12       ` Rafał Miłecki
  0 siblings, 1 reply; 8+ messages in thread
From: Russell King - ARM Linux admin @ 2019-05-22 12:17 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Jo-Philipp Wich, Network Development, John Crispin,
	Linux Kernel Mailing List, linux-block, Jonas Gorski,
	linux-arm-kernel, Felix Fietkau

On Wed, May 22, 2019 at 01:51:01PM +0200, Rafał Miłecki wrote:
> On 21.05.2019 12:45, Russell King - ARM Linux admin wrote:> On Tue, May 21, 2019 at 12:28:48PM +0200, Rafał Miłecki wrote:
> >> I work on home routers based on Broadcom's Northstar SoCs. Those devices
> >> have ARM Cortex-A9 and most of them are dual-core.
> >>
> >> As for home routers, my main concern is network performance. That CPU
> >> isn't powerful enough to handle gigabit traffic so all kind of
> >> optimizations do matter. I noticed some unexpected changes in NAT
> >> performance when switching between kernels.
> >>
> >> My hardware is BCM47094 SoC (dual core ARM) with integrated network
> >> controller and external BCM53012 switch.
> >
> > Guessing, I'd say it's to do with the placement of code wrt cachelines.
> > You could try aligning some of the cache flushing code to a cache line
> > and see what effect that has.
> 
> Is System.map a good place to check for functions code alignment?
> 
> With Linux 4.19 + OpenWrt mtd patches I have:
> (...)
> c010ea94 t v7_dma_inv_range
> c010eae0 t v7_dma_clean_range
> (...)
> c02ca3d0 T blk_mq_update_nr_hw_queues
> c02ca69c T blk_mq_alloc_tag_set
> c02ca94c T blk_mq_release
> c02ca9b4 T blk_mq_free_queue
> c02caa88 T blk_mq_update_nr_requests
> c02cab50 T blk_mq_unique_tag
> (...)
> 
> After cherry-picking 9316a9ed6895 ("blk-mq: provide helper for setting
> up an SQ queue and tag set"):
> (...)
> c010ea94 t v7_dma_inv_range
> c010eae0 t v7_dma_clean_range
> (...)
> c02ca3d0 T blk_mq_update_nr_hw_queues
> c02ca69c T blk_mq_alloc_tag_set
> c02ca94c T blk_mq_init_sq_queue <-- NEW
> c02ca9c0 T blk_mq_release <-- Different address of this & all below
> c02caa28 T blk_mq_free_queue
> c02caafc T blk_mq_update_nr_requests
> c02cabc4 T blk_mq_unique_tag
> (...)
> 
> As you can see blk_mq_init_sq_queue has appeared in the System.map and
> it affected addresses of ~30000 symbols. I can believe some frequently
> used symbols got luckily aligned and that improved overall performance.
> 
> Interestingly v7_dma_inv_range() and v7_dma_clean_range() were not
> relocated.
> 
> *****
> 
> I followed Russell's suggestion and added .align 5 to cache-v7.S (see
> two attached diffs).
> 
> 1) v4.19 + OpenWrt mtd patches
> > egrep -B 1 -A 1 "v7_dma_(inv|clean)_range" System.map
> c010ea58 T v7_flush_kern_dcache_area
> c010ea94 t v7_dma_inv_range
> c010eae0 t v7_dma_clean_range
> c010eb18 T b15_dma_flush_range
> 
> 2) v4.19 + OpenWrt mtd patches + two .align 5 in cache-v7.S
> c010ea6c T v7_flush_kern_dcache_area
> c010eac0 t v7_dma_inv_range
> c010eb20 t v7_dma_clean_range
> c010eb58 T b15_dma_flush_range
> (actually 15 symbols above v7_dma_inv_range were replaced)
> 
> This method seems to be somehow working (at least affects addresses in
> System.map).
> 
> *****
> 
> I run 2 tests for each combination of changes. Each test consisted of
> 10 sequences of: 30 seconds iperf session + reboot.
> 
> 
> > git reset --hard v4.19
> > git am OpenWrt-mtd-chages.patch
> Test #1: 738 Mb/s
> Test #2: 737 Mb/s
> 
> > git reset --hard v4.19
> > git am OpenWrt-mtd-chages.patch
> patch -p1 < v7_dma_clean_range-align.diff
> Test #1: 746 Mb/s
> Test #2: 747 Mb/s
> 
> > git reset --hard v4.19
> > git am OpenWrt-mtd-chages.patch
> > patch -p1 < v7_dma_inv_range-align.diff
> Test #1: 745 Mb/s
> Test #2: 746 Mb/s
> 
> > git reset --hard v4.19
> > git am OpenWrt-mtd-chages.patch
> > patch -p1 < v7_dma_clean_range-align.diff
> > patch -p1 < v7_dma_inv_range-align.diff
> Test #1: 762 Mb/s
> Test #2: 761 Mb/s
> 
> As you can see I got a quite nice performance improvement after aligning
> both: v7_dma_clean_range() and v7_dma_inv_range().

This is an improvement of about 3.3%.

> It still wasn't as good as with 9316a9ed6895 cherry-picked but pretty
> close.
> 
> 
> > git reset --hard v4.19
> > git am OpenWrt-mtd-chages.patch
> > git cherry-pick -x 9316a9ed6895
> Test #1: 770 Mb/s
> Test #2: 766 Mb/s
> 
> > git reset --hard v4.19
> > git am OpenWrt-mtd-chages.patch
> > git cherry-pick -x 9316a9ed6895
> > patch -p1 < v7_dma_clean_range-align.diff
> Test #1: 756 Mb/s
> Test #2: 759 Mb/s
> 
> > git reset --hard v4.19
> > git am OpenWrt-mtd-chages.patch
> > git cherry-pick -x 9316a9ed6895
> > patch -p1 < v7_dma_inv_range-align.diff
> Test #1: 758 Mb/s
> Test #2: 759 Mb/s
> 
> > git reset --hard v4.19
> > git am OpenWrt-mtd-chages.patch
> > git cherry-pick -x 9316a9ed6895
> > patch -p1 < v7_dma_clean_range-align.diff
> > patch -p1 < v7_dma_inv_range-align.diff
> Test #1: 767 Mb/s
> Test #2: 763 Mb/s
> 
> Now you can see how unpredictable it is. If I cherry-pick 9316a9ed6895
> and do an extra alignment of v7_dma_clean_range() and v7_dma_inv_range()
> that extra alignment can actually *hurt* NAT performance.

You have a maximum variance of 4Mb/s in your tests which is around
0.5%, and this shows a reduction of 3Mb/s, or 0.4%.

If we look at it a different way:
- Without the alignment patches, there is a difference of 4% in
  performance depending on whether 9316a9ed6895 is applied.
- With the alignment patches, there is a difference of 0.4% in
  performance depending on whether 9316a9ed6895 is applied.

How can this not be beneficial?

> 
> My guess is that:
> 1) 9316a9ed6895 provides alignment of some very important function(s)
> 2) DMA alignments on top ^^ provide some gain but also break some align
> 
> *****
> 
> SUMMARY
> 
> It seems that for Linux 4.19 + my .config I can get a very lucky &
> optimal alignment of functions by cherry-picking 9316a9ed6895.
> 
> I thought of checking functions reported by the "perf" tool with CPU
> usage of 2%+.
> 
> All following functions keep their original address with 9316a9ed6895:
> __irqentry_text_end
> arch_cpu_idle
> l2c210_clean_range
> l2c210_inv_range
> v7_dma_clean_range
> v7_dma_inv_range
> 
> Remaining 3 functions got reallocated:
> -c03e5038 t __netif_receive_skb_core
> +c03e50b0 t __netif_receive_skb_core
> -c03c8b1c t bcma_host_soc_read32
> +c03c8b94 t bcma_host_soc_read32
> -c0475620 T fib_table_lookup
> +c0475698 T fib_table_lookup
> 
> I tried aligning all 3 above functions using:
> __attribute__((aligned(32)))
> and got 756 Mb/s. It's better but still not ~770 Mb/s.
> 
> Is there any easy way of identifying which of function alignments got
> such a big impact on NAT performance? I'd like to get those functions
> explicitly aligned using assembler/__attribute__/something.
> 
> What I'm also afraid are false positives. I may end up aligning some
> unrelated function that just happens to align other ones. Just like
> cherry-picking 9316a9ed6895 having side-effects and not really fixing
> anything explicitly.

> diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
> index 215df435bfb9..c60046cd34aa 100644
> --- a/arch/arm/mm/cache-v7.S
> +++ b/arch/arm/mm/cache-v7.S
> @@ -373,6 +373,8 @@ v7_dma_inv_range:
>  	ret	lr
>  ENDPROC(v7_dma_inv_range)
>  
> +	.align	5
> +
>  /*
>   *	v7_dma_clean_range(start,end)
>   *	- start   - virtual start address of region

> diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
> index 215df435bfb9..0c3999f219ab 100644
> --- a/arch/arm/mm/cache-v7.S
> +++ b/arch/arm/mm/cache-v7.S
> @@ -340,6 +340,8 @@ ENTRY(v7_flush_kern_dcache_area)
>  	ret	lr
>  ENDPROC(v7_flush_kern_dcache_area)
>  
> +	.align	5
> +
>  /*
>   *	v7_dma_inv_range(start,end)
>   *


-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ARM router NAT performance affected by random/unrelated commits
  2019-05-22 12:17     ` Russell King - ARM Linux admin
@ 2019-05-22 21:12       ` Rafał Miłecki
  0 siblings, 0 replies; 8+ messages in thread
From: Rafał Miłecki @ 2019-05-22 21:12 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Jo-Philipp Wich, Network Development, John Crispin,
	Linux Kernel Mailing List, linux-block, Jonas Gorski,
	linux-arm-kernel

On 22.05.2019 14:17, Russell King - ARM Linux admin wrote:
> On Wed, May 22, 2019 at 01:51:01PM +0200, Rafał Miłecki wrote:
>> On 21.05.2019 12:45, Russell King - ARM Linux admin wrote:> On Tue, May 21, 2019 at 12:28:48PM +0200, Rafał Miłecki wrote:
>>>> I work on home routers based on Broadcom's Northstar SoCs. Those devices
>>>> have ARM Cortex-A9 and most of them are dual-core.
>>>>
>>>> As for home routers, my main concern is network performance. That CPU
>>>> isn't powerful enough to handle gigabit traffic so all kind of
>>>> optimizations do matter. I noticed some unexpected changes in NAT
>>>> performance when switching between kernels.
>>>>
>>>> My hardware is BCM47094 SoC (dual core ARM) with integrated network
>>>> controller and external BCM53012 switch.
>>>
>>> Guessing, I'd say it's to do with the placement of code wrt cachelines.
>>> You could try aligning some of the cache flushing code to a cache line
>>> and see what effect that has.
>>
>> Is System.map a good place to check for functions code alignment?
>>
>> With Linux 4.19 + OpenWrt mtd patches I have:
>> (...)
>> c010ea94 t v7_dma_inv_range
>> c010eae0 t v7_dma_clean_range
>> (...)
>> c02ca3d0 T blk_mq_update_nr_hw_queues
>> c02ca69c T blk_mq_alloc_tag_set
>> c02ca94c T blk_mq_release
>> c02ca9b4 T blk_mq_free_queue
>> c02caa88 T blk_mq_update_nr_requests
>> c02cab50 T blk_mq_unique_tag
>> (...)
>>
>> After cherry-picking 9316a9ed6895 ("blk-mq: provide helper for setting
>> up an SQ queue and tag set"):
>> (...)
>> c010ea94 t v7_dma_inv_range
>> c010eae0 t v7_dma_clean_range
>> (...)
>> c02ca3d0 T blk_mq_update_nr_hw_queues
>> c02ca69c T blk_mq_alloc_tag_set
>> c02ca94c T blk_mq_init_sq_queue <-- NEW
>> c02ca9c0 T blk_mq_release <-- Different address of this & all below
>> c02caa28 T blk_mq_free_queue
>> c02caafc T blk_mq_update_nr_requests
>> c02cabc4 T blk_mq_unique_tag
>> (...)
>>
>> As you can see blk_mq_init_sq_queue has appeared in the System.map and
>> it affected addresses of ~30000 symbols. I can believe some frequently
>> used symbols got luckily aligned and that improved overall performance.
>>
>> Interestingly v7_dma_inv_range() and v7_dma_clean_range() were not
>> relocated.
>>
>> *****
>>
>> I followed Russell's suggestion and added .align 5 to cache-v7.S (see
>> two attached diffs).
>>
>> 1) v4.19 + OpenWrt mtd patches
>>> egrep -B 1 -A 1 "v7_dma_(inv|clean)_range" System.map
>> c010ea58 T v7_flush_kern_dcache_area
>> c010ea94 t v7_dma_inv_range
>> c010eae0 t v7_dma_clean_range
>> c010eb18 T b15_dma_flush_range
>>
>> 2) v4.19 + OpenWrt mtd patches + two .align 5 in cache-v7.S
>> c010ea6c T v7_flush_kern_dcache_area
>> c010eac0 t v7_dma_inv_range
>> c010eb20 t v7_dma_clean_range
>> c010eb58 T b15_dma_flush_range
>> (actually 15 symbols above v7_dma_inv_range were replaced)
>>
>> This method seems to be somehow working (at least affects addresses in
>> System.map).
>>
>> *****
>>
>> I run 2 tests for each combination of changes. Each test consisted of
>> 10 sequences of: 30 seconds iperf session + reboot.
>>
>>
>>> git reset --hard v4.19
>>> git am OpenWrt-mtd-chages.patch
>> Test #1: 738 Mb/s
>> Test #2: 737 Mb/s
>>
>>> git reset --hard v4.19
>>> git am OpenWrt-mtd-chages.patch
>> patch -p1 < v7_dma_clean_range-align.diff
>> Test #1: 746 Mb/s
>> Test #2: 747 Mb/s
>>
>>> git reset --hard v4.19
>>> git am OpenWrt-mtd-chages.patch
>>> patch -p1 < v7_dma_inv_range-align.diff
>> Test #1: 745 Mb/s
>> Test #2: 746 Mb/s
>>
>>> git reset --hard v4.19
>>> git am OpenWrt-mtd-chages.patch
>>> patch -p1 < v7_dma_clean_range-align.diff
>>> patch -p1 < v7_dma_inv_range-align.diff
>> Test #1: 762 Mb/s
>> Test #2: 761 Mb/s
>>
>> As you can see I got a quite nice performance improvement after aligning
>> both: v7_dma_clean_range() and v7_dma_inv_range().
> 
> This is an improvement of about 3.3%.
> 
>> It still wasn't as good as with 9316a9ed6895 cherry-picked but pretty
>> close.
>>
>>
>>> git reset --hard v4.19
>>> git am OpenWrt-mtd-chages.patch
>>> git cherry-pick -x 9316a9ed6895
>> Test #1: 770 Mb/s
>> Test #2: 766 Mb/s
>>
>>> git reset --hard v4.19
>>> git am OpenWrt-mtd-chages.patch
>>> git cherry-pick -x 9316a9ed6895
>>> patch -p1 < v7_dma_clean_range-align.diff
>> Test #1: 756 Mb/s
>> Test #2: 759 Mb/s
>>
>>> git reset --hard v4.19
>>> git am OpenWrt-mtd-chages.patch
>>> git cherry-pick -x 9316a9ed6895
>>> patch -p1 < v7_dma_inv_range-align.diff
>> Test #1: 758 Mb/s
>> Test #2: 759 Mb/s
>>
>>> git reset --hard v4.19
>>> git am OpenWrt-mtd-chages.patch
>>> git cherry-pick -x 9316a9ed6895
>>> patch -p1 < v7_dma_clean_range-align.diff
>>> patch -p1 < v7_dma_inv_range-align.diff
>> Test #1: 767 Mb/s
>> Test #2: 763 Mb/s
>>
>> Now you can see how unpredictable it is. If I cherry-pick 9316a9ed6895
>> and do an extra alignment of v7_dma_clean_range() and v7_dma_inv_range()
>> that extra alignment can actually *hurt* NAT performance.
> 
> You have a maximum variance of 4Mb/s in your tests which is around
> 0.5%, and this shows a reduction of 3Mb/s, or 0.4%.
> 
> If we look at it a different way:
> - Without the alignment patches, there is a difference of 4% in
>    performance depending on whether 9316a9ed6895 is applied.
> - With the alignment patches, there is a difference of 0.4% in
>    performance depending on whether 9316a9ed6895 is applied.
> 
> How can this not be beneficial?

Aligning v7_dma_clean_range() and v7_dma_inv_range() is definitely
beneficial! I'm sorry I wasn't clear enough.

I redid testing of 2 most important setups with few more iterations.

 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
 > git cherry-pick -x 9316a9ed6895
[  3]  0.0-30.0 sec  2.71 GBytes   776 Mbits/sec
[  3]  0.0-30.0 sec  2.71 GBytes   775 Mbits/sec
[  3]  0.0-30.0 sec  2.70 GBytes   774 Mbits/sec
[  3]  0.0-30.0 sec  2.70 GBytes   774 Mbits/sec
[  3]  0.0-30.0 sec  2.70 GBytes   773 Mbits/sec
[  3]  0.0-30.0 sec  2.70 GBytes   773 Mbits/sec
[  3]  0.0-30.0 sec  2.70 GBytes   773 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   771 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   771 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   771 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   771 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   770 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   770 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   770 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   770 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   769 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   769 Mbits/sec
[  3]  0.0-30.0 sec  2.68 GBytes   768 Mbits/sec
[  3]  0.0-30.0 sec  2.68 GBytes   768 Mbits/sec
[  3]  0.0-30.0 sec  2.68 GBytes   767 Mbits/sec
[  3]  0.0-30.0 sec  2.68 GBytes   767 Mbits/sec
[  3]  0.0-30.0 sec  2.68 GBytes   767 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   765 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   765 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   764 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   763 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   763 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   762 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   760 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   760 Mbits/sec
Average: 769 Mb/s (+4,10%)
Previous results: 773 Mb/s, 770 Mb/s, 766 Mb/s

 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
 > patch -p1 < v7_dma_clean_range-align.diff
 > patch -p1 < v7_dma_inv_range-align.diff
[  3]  0.0-30.0 sec  2.69 GBytes   769 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   769 Mbits/sec
[  3]  0.0-30.0 sec  2.68 GBytes   767 Mbits/sec
[  3]  0.0-30.0 sec  2.68 GBytes   766 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   766 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   765 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   765 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   765 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   764 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   763 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   763 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   762 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   762 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   762 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   762 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   761 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   761 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   760 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   760 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   760 Mbits/sec
[  3]  0.0-30.0 sec  2.65 GBytes   760 Mbits/sec
[  3]  0.0-30.0 sec  2.65 GBytes   759 Mbits/sec
[  3]  0.0-30.0 sec  2.65 GBytes   759 Mbits/sec
[  3]  0.0-30.0 sec  2.65 GBytes   758 Mbits/sec
[  3]  0.0-30.0 sec  2.65 GBytes   758 Mbits/sec
[  3]  0.0-30.0 sec  2.65 GBytes   757 Mbits/sec
[  3]  0.0-30.0 sec  2.65 GBytes   757 Mbits/sec
[  3]  0.0-30.0 sec  2.65 GBytes   757 Mbits/sec
[  3]  0.0-30.0 sec  2.64 GBytes   757 Mbits/sec
[  3]  0.0-30.0 sec  2.64 GBytes   756 Mbits/sec
Average: 762 Mb/s (+3,16%)
Previous results: 767 Mb/s, 763 Mb/s

So let me explain why I keep researching on this. There are two reasons:

1) Realignment done by cherry-picking 9316a9ed6895 was providing a
*marginally* better performance than aligning v7_dma_clean_range() and
v7_dma_inv_range(). It's a *very* minimal difference but I can't stop
thinking I can still do better.

2) Cherry-picking 9316a9ed6895 doesn't change v7_dma_clean_range or
v7_dma_inv_range addresses at all. Yet it still improves NAT
performance. That makes me believe there are more functions that (if
properly aligned) can bump NAT performance.
I hope that aligning all:
* v7_dma_clean_range
* v7_dma_inv_range
* [some unrevealed functions]
could result in even better NAT performance.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, back to index

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-21 10:28 ARM router NAT performance affected by random/unrelated commits Rafał Miłecki
2019-05-21 10:45 ` Russell King - ARM Linux admin
2019-05-21 11:16   ` Rafał Miłecki
2019-05-21 11:19     ` Russell King - ARM Linux admin
2019-05-22 11:51   ` Rafał Miłecki
2019-05-22 12:17     ` Russell King - ARM Linux admin
2019-05-22 21:12       ` Rafał Miłecki
2019-05-21 13:01 ` Andrew Lunn

Linux-ARM-Kernel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-arm-kernel/0 linux-arm-kernel/git/0.git
	git clone --mirror https://lore.kernel.org/linux-arm-kernel/1 linux-arm-kernel/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-arm-kernel linux-arm-kernel/ https://lore.kernel.org/linux-arm-kernel \
		linux-arm-kernel@lists.infradead.org infradead-linux-arm-kernel@archiver.kernel.org
	public-inbox-index linux-arm-kernel


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-arm-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox