On 21.05.2019 12:45, Russell King - ARM Linux admin wrote:> On Tue, May 21, 2019 at 12:28:48PM +0200, Rafał Miłecki wrote: >> I work on home routers based on Broadcom's Northstar SoCs. Those devices >> have ARM Cortex-A9 and most of them are dual-core. >> >> As for home routers, my main concern is network performance. That CPU >> isn't powerful enough to handle gigabit traffic so all kind of >> optimizations do matter. I noticed some unexpected changes in NAT >> performance when switching between kernels. >> >> My hardware is BCM47094 SoC (dual core ARM) with integrated network >> controller and external BCM53012 switch. > > Guessing, I'd say it's to do with the placement of code wrt cachelines. > You could try aligning some of the cache flushing code to a cache line > and see what effect that has. Is System.map a good place to check for functions code alignment? With Linux 4.19 + OpenWrt mtd patches I have: (...) c010ea94 t v7_dma_inv_range c010eae0 t v7_dma_clean_range (...) c02ca3d0 T blk_mq_update_nr_hw_queues c02ca69c T blk_mq_alloc_tag_set c02ca94c T blk_mq_release c02ca9b4 T blk_mq_free_queue c02caa88 T blk_mq_update_nr_requests c02cab50 T blk_mq_unique_tag (...) After cherry-picking 9316a9ed6895 ("blk-mq: provide helper for setting up an SQ queue and tag set"): (...) c010ea94 t v7_dma_inv_range c010eae0 t v7_dma_clean_range (...) c02ca3d0 T blk_mq_update_nr_hw_queues c02ca69c T blk_mq_alloc_tag_set c02ca94c T blk_mq_init_sq_queue <-- NEW c02ca9c0 T blk_mq_release <-- Different address of this & all below c02caa28 T blk_mq_free_queue c02caafc T blk_mq_update_nr_requests c02cabc4 T blk_mq_unique_tag (...) As you can see blk_mq_init_sq_queue has appeared in the System.map and it affected addresses of ~30000 symbols. I can believe some frequently used symbols got luckily aligned and that improved overall performance. Interestingly v7_dma_inv_range() and v7_dma_clean_range() were not relocated. ***** I followed Russell's suggestion and added .align 5 to cache-v7.S (see two attached diffs). 1) v4.19 + OpenWrt mtd patches > egrep -B 1 -A 1 "v7_dma_(inv|clean)_range" System.map c010ea58 T v7_flush_kern_dcache_area c010ea94 t v7_dma_inv_range c010eae0 t v7_dma_clean_range c010eb18 T b15_dma_flush_range 2) v4.19 + OpenWrt mtd patches + two .align 5 in cache-v7.S c010ea6c T v7_flush_kern_dcache_area c010eac0 t v7_dma_inv_range c010eb20 t v7_dma_clean_range c010eb58 T b15_dma_flush_range (actually 15 symbols above v7_dma_inv_range were replaced) This method seems to be somehow working (at least affects addresses in System.map). ***** I run 2 tests for each combination of changes. Each test consisted of 10 sequences of: 30 seconds iperf session + reboot. > git reset --hard v4.19 > git am OpenWrt-mtd-chages.patch Test #1: 738 Mb/s Test #2: 737 Mb/s > git reset --hard v4.19 > git am OpenWrt-mtd-chages.patch patch -p1 < v7_dma_clean_range-align.diff Test #1: 746 Mb/s Test #2: 747 Mb/s > git reset --hard v4.19 > git am OpenWrt-mtd-chages.patch > patch -p1 < v7_dma_inv_range-align.diff Test #1: 745 Mb/s Test #2: 746 Mb/s > git reset --hard v4.19 > git am OpenWrt-mtd-chages.patch > patch -p1 < v7_dma_clean_range-align.diff > patch -p1 < v7_dma_inv_range-align.diff Test #1: 762 Mb/s Test #2: 761 Mb/s As you can see I got a quite nice performance improvement after aligning both: v7_dma_clean_range() and v7_dma_inv_range(). It still wasn't as good as with 9316a9ed6895 cherry-picked but pretty close. > git reset --hard v4.19 > git am OpenWrt-mtd-chages.patch > git cherry-pick -x 9316a9ed6895 Test #1: 770 Mb/s Test #2: 766 Mb/s > git reset --hard v4.19 > git am OpenWrt-mtd-chages.patch > git cherry-pick -x 9316a9ed6895 > patch -p1 < v7_dma_clean_range-align.diff Test #1: 756 Mb/s Test #2: 759 Mb/s > git reset --hard v4.19 > git am OpenWrt-mtd-chages.patch > git cherry-pick -x 9316a9ed6895 > patch -p1 < v7_dma_inv_range-align.diff Test #1: 758 Mb/s Test #2: 759 Mb/s > git reset --hard v4.19 > git am OpenWrt-mtd-chages.patch > git cherry-pick -x 9316a9ed6895 > patch -p1 < v7_dma_clean_range-align.diff > patch -p1 < v7_dma_inv_range-align.diff Test #1: 767 Mb/s Test #2: 763 Mb/s Now you can see how unpredictable it is. If I cherry-pick 9316a9ed6895 and do an extra alignment of v7_dma_clean_range() and v7_dma_inv_range() that extra alignment can actually *hurt* NAT performance. My guess is that: 1) 9316a9ed6895 provides alignment of some very important function(s) 2) DMA alignments on top ^^ provide some gain but also break some align ***** SUMMARY It seems that for Linux 4.19 + my .config I can get a very lucky & optimal alignment of functions by cherry-picking 9316a9ed6895. I thought of checking functions reported by the "perf" tool with CPU usage of 2%+. All following functions keep their original address with 9316a9ed6895: __irqentry_text_end arch_cpu_idle l2c210_clean_range l2c210_inv_range v7_dma_clean_range v7_dma_inv_range Remaining 3 functions got reallocated: -c03e5038 t __netif_receive_skb_core +c03e50b0 t __netif_receive_skb_core -c03c8b1c t bcma_host_soc_read32 +c03c8b94 t bcma_host_soc_read32 -c0475620 T fib_table_lookup +c0475698 T fib_table_lookup I tried aligning all 3 above functions using: __attribute__((aligned(32))) and got 756 Mb/s. It's better but still not ~770 Mb/s. Is there any easy way of identifying which of function alignments got such a big impact on NAT performance? I'd like to get those functions explicitly aligned using assembler/__attribute__/something. What I'm also afraid are false positives. I may end up aligning some unrelated function that just happens to align other ones. Just like cherry-picking 9316a9ed6895 having side-effects and not really fixing anything explicitly.