Re: ARM router NAT performance affected by random/unrelated commits

From: "Rafał Miłecki" <zajec5@gmail.com>
To: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Jo-Philipp Wich <jo@mein.io>,
	Network Development <netdev@vger.kernel.org>,
	John Crispin <john@phrozen.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-block@vger.kernel.org,
	Jonas Gorski <jonas.gorski@gmail.com>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	Felix Fietkau <nbd@nbd.name>
Subject: Re: ARM router NAT performance affected by random/unrelated commits
Date: Wed, 22 May 2019 13:51:01 +0200	[thread overview]
Message-ID: <de262f71-748f-d242-f1d4-ea10188a0438@gmail.com> (raw)
In-Reply-To: <20190521104512.2r67fydrgniwqaja@shell.armlinux.org.uk>

[-- Attachment #1: Type: text/plain, Size: 5777 bytes --]

On 21.05.2019 12:45, Russell King - ARM Linux admin wrote:> On Tue, May 21, 2019 at 12:28:48PM +0200, Rafał Miłecki wrote:
 >> I work on home routers based on Broadcom's Northstar SoCs. Those devices
 >> have ARM Cortex-A9 and most of them are dual-core.
 >>
 >> As for home routers, my main concern is network performance. That CPU
 >> isn't powerful enough to handle gigabit traffic so all kind of
 >> optimizations do matter. I noticed some unexpected changes in NAT
 >> performance when switching between kernels.
 >>
 >> My hardware is BCM47094 SoC (dual core ARM) with integrated network
 >> controller and external BCM53012 switch.
 >
 > Guessing, I'd say it's to do with the placement of code wrt cachelines.
 > You could try aligning some of the cache flushing code to a cache line
 > and see what effect that has.

Is System.map a good place to check for functions code alignment?

With Linux 4.19 + OpenWrt mtd patches I have:
(...)
c010ea94 t v7_dma_inv_range
c010eae0 t v7_dma_clean_range
(...)
c02ca3d0 T blk_mq_update_nr_hw_queues
c02ca69c T blk_mq_alloc_tag_set
c02ca94c T blk_mq_release
c02ca9b4 T blk_mq_free_queue
c02caa88 T blk_mq_update_nr_requests
c02cab50 T blk_mq_unique_tag
(...)

After cherry-picking 9316a9ed6895 ("blk-mq: provide helper for setting
up an SQ queue and tag set"):
(...)
c010ea94 t v7_dma_inv_range
c010eae0 t v7_dma_clean_range
(...)
c02ca3d0 T blk_mq_update_nr_hw_queues
c02ca69c T blk_mq_alloc_tag_set
c02ca94c T blk_mq_init_sq_queue <-- NEW
c02ca9c0 T blk_mq_release <-- Different address of this & all below
c02caa28 T blk_mq_free_queue
c02caafc T blk_mq_update_nr_requests
c02cabc4 T blk_mq_unique_tag
(...)

As you can see blk_mq_init_sq_queue has appeared in the System.map and
it affected addresses of ~30000 symbols. I can believe some frequently
used symbols got luckily aligned and that improved overall performance.

Interestingly v7_dma_inv_range() and v7_dma_clean_range() were not
relocated.

*****

I followed Russell's suggestion and added .align 5 to cache-v7.S (see
two attached diffs).

1) v4.19 + OpenWrt mtd patches
 > egrep -B 1 -A 1 "v7_dma_(inv|clean)_range" System.map
c010ea58 T v7_flush_kern_dcache_area
c010ea94 t v7_dma_inv_range
c010eae0 t v7_dma_clean_range
c010eb18 T b15_dma_flush_range

2) v4.19 + OpenWrt mtd patches + two .align 5 in cache-v7.S
c010ea6c T v7_flush_kern_dcache_area
c010eac0 t v7_dma_inv_range
c010eb20 t v7_dma_clean_range
c010eb58 T b15_dma_flush_range
(actually 15 symbols above v7_dma_inv_range were replaced)

This method seems to be somehow working (at least affects addresses in
System.map).

*****

I run 2 tests for each combination of changes. Each test consisted of
10 sequences of: 30 seconds iperf session + reboot.


 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
Test #1: 738 Mb/s
Test #2: 737 Mb/s

 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
patch -p1 < v7_dma_clean_range-align.diff
Test #1: 746 Mb/s
Test #2: 747 Mb/s

 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
 > patch -p1 < v7_dma_inv_range-align.diff
Test #1: 745 Mb/s
Test #2: 746 Mb/s

 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
 > patch -p1 < v7_dma_clean_range-align.diff
 > patch -p1 < v7_dma_inv_range-align.diff
Test #1: 762 Mb/s
Test #2: 761 Mb/s

As you can see I got a quite nice performance improvement after aligning
both: v7_dma_clean_range() and v7_dma_inv_range().

It still wasn't as good as with 9316a9ed6895 cherry-picked but pretty
close.


 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
 > git cherry-pick -x 9316a9ed6895
Test #1: 770 Mb/s
Test #2: 766 Mb/s

 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
 > git cherry-pick -x 9316a9ed6895
 > patch -p1 < v7_dma_clean_range-align.diff
Test #1: 756 Mb/s
Test #2: 759 Mb/s

 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
 > git cherry-pick -x 9316a9ed6895
 > patch -p1 < v7_dma_inv_range-align.diff
Test #1: 758 Mb/s
Test #2: 759 Mb/s

 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
 > git cherry-pick -x 9316a9ed6895
 > patch -p1 < v7_dma_clean_range-align.diff
 > patch -p1 < v7_dma_inv_range-align.diff
Test #1: 767 Mb/s
Test #2: 763 Mb/s

Now you can see how unpredictable it is. If I cherry-pick 9316a9ed6895
and do an extra alignment of v7_dma_clean_range() and v7_dma_inv_range()
that extra alignment can actually *hurt* NAT performance.

My guess is that:
1) 9316a9ed6895 provides alignment of some very important function(s)
2) DMA alignments on top ^^ provide some gain but also break some align

*****

SUMMARY

It seems that for Linux 4.19 + my .config I can get a very lucky &
optimal alignment of functions by cherry-picking 9316a9ed6895.

I thought of checking functions reported by the "perf" tool with CPU
usage of 2%+.

All following functions keep their original address with 9316a9ed6895:
__irqentry_text_end
arch_cpu_idle
l2c210_clean_range
l2c210_inv_range
v7_dma_clean_range
v7_dma_inv_range

Remaining 3 functions got reallocated:
-c03e5038 t __netif_receive_skb_core
+c03e50b0 t __netif_receive_skb_core
-c03c8b1c t bcma_host_soc_read32
+c03c8b94 t bcma_host_soc_read32
-c0475620 T fib_table_lookup
+c0475698 T fib_table_lookup

I tried aligning all 3 above functions using:
__attribute__((aligned(32)))
and got 756 Mb/s. It's better but still not ~770 Mb/s.

Is there any easy way of identifying which of function alignments got
such a big impact on NAT performance? I'd like to get those functions
explicitly aligned using assembler/__attribute__/something.

What I'm also afraid are false positives. I may end up aligning some
unrelated function that just happens to align other ones. Just like
cherry-picking 9316a9ed6895 having side-effects and not really fixing
anything explicitly.

[-- Attachment #2: v7_dma_clean_range-align.diff --]
[-- Type: text/x-patch, Size: 334 bytes --]

diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
index 215df435bfb9..c60046cd34aa 100644
--- a/arch/arm/mm/cache-v7.S
+++ b/arch/arm/mm/cache-v7.S
@@ -373,6 +373,8 @@ v7_dma_inv_range:
 	ret	lr
 ENDPROC(v7_dma_inv_range)
 
+	.align	5
+
 /*
  *	v7_dma_clean_range(start,end)
  *	- start   - virtual start address of region

[-- Attachment #3: v7_dma_inv_range-align.diff --]
[-- Type: text/x-patch, Size: 312 bytes --]

diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
index 215df435bfb9..0c3999f219ab 100644
--- a/arch/arm/mm/cache-v7.S
+++ b/arch/arm/mm/cache-v7.S
@@ -340,6 +340,8 @@ ENTRY(v7_flush_kern_dcache_area)
 	ret	lr
 ENDPROC(v7_flush_kern_dcache_area)
 
+	.align	5
+
 /*
  *	v7_dma_inv_range(start,end)
  *

[-- Attachment #4: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel