* Optimizing kernel compilation / alignments for network performance @ 2022-04-27 12:04 Rafał Miłecki 2022-04-27 12:56 ` Alexander Lobakin 0 siblings, 1 reply; 22+ messages in thread From: Rafał Miłecki @ 2022-04-27 12:04 UTC (permalink / raw) To: Network Development, linux-arm-kernel, Russell King, Andrew Lunn, Felix Fietkau Cc: openwrt-devel, Florian Fainelli Hi, I noticed years ago that kernel changes touching code - that I don't use at all - can affect network performance for me. I work with home routers based on Broadcom Northstar platform. Those are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of those devices is NAT masquerade and that is what I test with iperf running on two x86 machines. *** Example of such unused code change: ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A"). https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%). I first reported that issue it in the e-mail thread: ARM router NAT performance affected by random/unrelated commits https://lkml.org/lkml/2019/5/21/349 https://www.spinics.net/lists/linux-block/msg40624.html Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv unicast headers") https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283 that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%). *** It appears Northstar CPUs have little cache size and so any change in location of kernel symbols can affect NAT performance. That explains why changing unrelated code affects anything & it has been partially proven aligning some of cache-v7.S code. My question is: is there a way to find out & force an optimal symbols locations? Adding .align 5 to the cache-v7.S is a partial success. I'd like to find out what other functions are worth optimizing (aligning) and force that (I guess __attribute__((aligned(32))) could be used). I can't really draw any conclusions from comparing System.map before and after above commits as they relocate thousands of symbols in one go. Optimizing is pretty important for me for two reasons: 1. I want to reach maximum possible NAT masquerade performance 2. I need stable performance across random commits to detect regressions ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-04-27 12:04 Optimizing kernel compilation / alignments for network performance Rafał Miłecki @ 2022-04-27 12:56 ` Alexander Lobakin 2022-04-27 17:31 ` Rafał Miłecki 0 siblings, 1 reply; 22+ messages in thread From: Alexander Lobakin @ 2022-04-27 12:56 UTC (permalink / raw) To: Rafał Miłecki Cc: Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, Andrew Lunn, Felix Fietkau, openwrt-devel, Florian Fainelli From: Rafał Miłecki <zajec5@gmail.com> Date: Wed, 27 Apr 2022 14:04:54 +0200 > Hi, Hej, > > I noticed years ago that kernel changes touching code - that I don't use > at all - can affect network performance for me. > > I work with home routers based on Broadcom Northstar platform. Those > are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of > those devices is NAT masquerade and that is what I test with iperf > running on two x86 machines. > > *** > > Example of such unused code change: > ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A"). > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b > It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%). > > I first reported that issue it in the e-mail thread: > ARM router NAT performance affected by random/unrelated commits > https://lkml.org/lkml/2019/5/21/349 > https://www.spinics.net/lists/linux-block/msg40624.html > > Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv > unicast headers") > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283 > that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%). > > *** > > It appears Northstar CPUs have little cache size and so any change in > location of kernel symbols can affect NAT performance. That explains why > changing unrelated code affects anything & it has been partially proven > aligning some of cache-v7.S code. > > My question is: is there a way to find out & force an optimal symbols > locations? Take a look at CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B[0]. I've been fighting with the same issue on some Realtek MIPS boards: random code changes in random kernel core parts were affecting NAT / network performance. This option resolved this I'd say, for the cost of slightly increased vmlinux size (almost no change in vmlinuz size). The only thing is that it was recently restricted to a set of architectures and MIPS and ARM32 are not included now lol. So it's either a matter of expanding the list (since it was restricted only because `-falign-functions=` is not supported on some architectures) or you can just do: make KCFLAGS=-falign-functions=64 # replace 64 with your I-cache size The actual alignment is something to play with, I stopped on the cacheline size, 32 in my case. Also, this does not provide any guarantees that you won't suffer from random data cacheline changes. There were some initiatives to introduce debug alignment of data as well, but since function are often bigger than 32, while variables are usually much smaller, it was increasing the vmlinux size by a ton (imagine each u32 variable occupying 32-64 bytes instead of 4). But the chance of catching this is much lower than to suffer from I-cache function misplacement. > > Adding .align 5 to the cache-v7.S is a partial success. I'd like to find > out what other functions are worth optimizing (aligning) and force that > (I guess __attribute__((aligned(32))) could be used). > > I can't really draw any conclusions from comparing System.map before and > after above commits as they relocate thousands of symbols in one go. > > Optimizing is pretty important for me for two reasons: > 1. I want to reach maximum possible NAT masquerade performance > 2. I need stable performance across random commits to detect regressions [0] https://elixir.bootlin.com/linux/v5.18-rc4/K/ident/CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B Thanks, Al ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-04-27 12:56 ` Alexander Lobakin @ 2022-04-27 17:31 ` Rafał Miłecki 2022-04-29 14:18 ` Rafał Miłecki 2022-04-29 14:49 ` Arnd Bergmann 0 siblings, 2 replies; 22+ messages in thread From: Rafał Miłecki @ 2022-04-27 17:31 UTC (permalink / raw) To: Alexander Lobakin Cc: Network Development, linux-arm-kernel, Russell King, Andrew Lunn, Felix Fietkau, openwrt-devel, Florian Fainelli On 27.04.2022 14:56, Alexander Lobakin wrote: > From: Rafał Miłecki <zajec5@gmail.com> > Date: Wed, 27 Apr 2022 14:04:54 +0200 > >> I noticed years ago that kernel changes touching code - that I don't use >> at all - can affect network performance for me. >> >> I work with home routers based on Broadcom Northstar platform. Those >> are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of >> those devices is NAT masquerade and that is what I test with iperf >> running on two x86 machines. >> >> *** >> >> Example of such unused code change: >> ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A"). >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b >> It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%). >> >> I first reported that issue it in the e-mail thread: >> ARM router NAT performance affected by random/unrelated commits >> https://lkml.org/lkml/2019/5/21/349 >> https://www.spinics.net/lists/linux-block/msg40624.html >> >> Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv >> unicast headers") >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283 >> that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%). >> >> *** >> >> It appears Northstar CPUs have little cache size and so any change in >> location of kernel symbols can affect NAT performance. That explains why >> changing unrelated code affects anything & it has been partially proven >> aligning some of cache-v7.S code. >> >> My question is: is there a way to find out & force an optimal symbols >> locations? > > Take a look at CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B[0]. I've been > fighting with the same issue on some Realtek MIPS boards: random > code changes in random kernel core parts were affecting NAT / > network performance. This option resolved this I'd say, for the cost > of slightly increased vmlinux size (almost no change in vmlinuz > size). > The only thing is that it was recently restricted to a set of > architectures and MIPS and ARM32 are not included now lol. So it's > either a matter of expanding the list (since it was restricted only > because `-falign-functions=` is not supported on some architectures) > or you can just do: > > make KCFLAGS=-falign-functions=64 # replace 64 with your I-cache size > > The actual alignment is something to play with, I stopped on the > cacheline size, 32 in my case. > Also, this does not provide any guarantees that you won't suffer > from random data cacheline changes. There were some initiatives to > introduce debug alignment of data as well, but since function are > often bigger than 32, while variables are usually much smaller, it > was increasing the vmlinux size by a ton (imagine each u32 variable > occupying 32-64 bytes instead of 4). But the chance of catching this > is much lower than to suffer from I-cache function misplacement. Thank you Alexander, this appears to be helpful! I decided to ignore CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS manually. 1. Without ce5013ff3bec and with -falign-functions=32 387 Mb/s 2. Without ce5013ff3bec and with -falign-functions=64 377 Mb/s 3. With ce5013ff3bec and with -falign-functions=32 384 Mb/s 4. With ce5013ff3bec and with -falign-functions=64 377 Mb/s So it seems that: 1. -falign-functions=32 = pretty stable high speed 2. -falign-functions=64 = very stable slightly lower speed I'm going to perform tests on more commits but if it stays so reliable as above that will be a huge success for me. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-04-27 17:31 ` Rafał Miłecki @ 2022-04-29 14:18 ` Rafał Miłecki 2022-04-29 14:49 ` Arnd Bergmann 1 sibling, 0 replies; 22+ messages in thread From: Rafał Miłecki @ 2022-04-29 14:18 UTC (permalink / raw) To: Alexander Lobakin Cc: Network Development, linux-arm-kernel, Russell King, Andrew Lunn, Felix Fietkau, openwrt-devel, Florian Fainelli On 27.04.2022 19:31, Rafał Miłecki wrote: > On 27.04.2022 14:56, Alexander Lobakin wrote: >> From: Rafał Miłecki <zajec5@gmail.com> >> Date: Wed, 27 Apr 2022 14:04:54 +0200 >> >>> I noticed years ago that kernel changes touching code - that I don't use >>> at all - can affect network performance for me. >>> >>> I work with home routers based on Broadcom Northstar platform. Those >>> are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of >>> those devices is NAT masquerade and that is what I test with iperf >>> running on two x86 machines. >>> >>> *** >>> >>> Example of such unused code change: >>> ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A"). >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b >>> It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%). >>> >>> I first reported that issue it in the e-mail thread: >>> ARM router NAT performance affected by random/unrelated commits >>> https://lkml.org/lkml/2019/5/21/349 >>> https://www.spinics.net/lists/linux-block/msg40624.html >>> >>> Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv >>> unicast headers") >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283 >>> that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%). >>> >>> *** >>> >>> It appears Northstar CPUs have little cache size and so any change in >>> location of kernel symbols can affect NAT performance. That explains why >>> changing unrelated code affects anything & it has been partially proven >>> aligning some of cache-v7.S code. >>> >>> My question is: is there a way to find out & force an optimal symbols >>> locations? >> >> Take a look at CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B[0]. I've been >> fighting with the same issue on some Realtek MIPS boards: random >> code changes in random kernel core parts were affecting NAT / >> network performance. This option resolved this I'd say, for the cost >> of slightly increased vmlinux size (almost no change in vmlinuz >> size). >> The only thing is that it was recently restricted to a set of >> architectures and MIPS and ARM32 are not included now lol. So it's >> either a matter of expanding the list (since it was restricted only >> because `-falign-functions=` is not supported on some architectures) >> or you can just do: >> >> make KCFLAGS=-falign-functions=64 # replace 64 with your I-cache size >> >> The actual alignment is something to play with, I stopped on the >> cacheline size, 32 in my case. >> Also, this does not provide any guarantees that you won't suffer >> from random data cacheline changes. There were some initiatives to >> introduce debug alignment of data as well, but since function are >> often bigger than 32, while variables are usually much smaller, it >> was increasing the vmlinux size by a ton (imagine each u32 variable >> occupying 32-64 bytes instead of 4). But the chance of catching this >> is much lower than to suffer from I-cache function misplacement. > > Thank you Alexander, this appears to be helpful! I decided to ignore > CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS > manually. > > > 1. Without ce5013ff3bec and with -falign-functions=32 > 387 Mb/s > > 2. Without ce5013ff3bec and with -falign-functions=64 > 377 Mb/s > > 3. With ce5013ff3bec and with -falign-functions=32 > 384 Mb/s > > 4. With ce5013ff3bec and with -falign-functions=64 > 377 Mb/s > > > So it seems that: > 1. -falign-functions=32 = pretty stable high speed > 2. -falign-functions=64 = very stable slightly lower speed > > > I'm going to perform tests on more commits but if it stays so reliable > as above that will be a huge success for me. So sadly that doesn't work all the time. Or maybe just works randomly. I tried multiple commits with both: -falign-functions=32 and -falign-functions=64 . I still get speed variations. About 30 Mb/s in total. From commit to commit it's usually about 3% but skipping few can result in up to 30 Mb/s (almost 10%). Similarly to code changes performance also gets affected by enabling / disabling kernel config options. I noticed that enabling CONFIG_CRYPTO_PCRYPT may decrease *or* increase speed depending on -falign-functions (and depending on kernel commit surely too). ┌──────────────────────┬───────────┬──────────┬───────┐ │ │ no PCRYPT │ PCRYPT=y │ diff │ ├──────────────────────┼───────────┼──────────┼───────┤ │ No -falign-functions │ 363 Mb/s │ 370 Mb/s │ +2% │ │ -falign-functions=32 │ 364 Mb/s │ 370 Mb/s │ +1,7% │ │ -falign-functions=64 │ 372 Mb/s │ 365 Mb/s │ -2% │ └──────────────────────┴───────────┴──────────┴───────┘ So I still don't have a reliable way of testing kernel changes for speed regressions :( ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-04-27 17:31 ` Rafał Miłecki 2022-04-29 14:18 ` Rafał Miłecki @ 2022-04-29 14:49 ` Arnd Bergmann 2022-05-05 15:42 ` Rafał Miłecki 1 sibling, 1 reply; 22+ messages in thread From: Arnd Bergmann @ 2022-04-29 14:49 UTC (permalink / raw) To: Rafał Miłecki Cc: Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, Andrew Lunn, Felix Fietkau, openwrt-devel, Florian Fainelli On Wed, Apr 27, 2022 at 7:31 PM Rafał Miłecki <zajec5@gmail.com> wrote: > On 27.04.2022 14:56, Alexander Lobakin wrote: > Thank you Alexander, this appears to be helpful! I decided to ignore > CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS > manually. > > > 1. Without ce5013ff3bec and with -falign-functions=32 > 387 Mb/s > > 2. Without ce5013ff3bec and with -falign-functions=64 > 377 Mb/s > > 3. With ce5013ff3bec and with -falign-functions=32 > 384 Mb/s > > 4. With ce5013ff3bec and with -falign-functions=64 > 377 Mb/s > > > So it seems that: > 1. -falign-functions=32 = pretty stable high speed > 2. -falign-functions=64 = very stable slightly lower speed > > > I'm going to perform tests on more commits but if it stays so reliable > as above that will be a huge success for me. Note that the problem may not just be the alignment of a particular function, but also how different function map into your cache. The Cortex-A9 has a 4-way set-associative L1 cache of 16KB, 32KB or 64KB, with a line size of 32 bytes. If you are unlucky and you get five different functions that are frequently called and are a multiple functions are exactly the wrong spacing that they need more than four ways, calling them in sequence would always evict the other ones. The same could of course happen if the problem is the D-cache or the L2. Can you try to get a profile using 'perf record' to see where most time is spent, in both the slowest and the fastest versions? If the instruction cache is the issue, you should see how the hottest addresses line up. Arnd ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-04-29 14:49 ` Arnd Bergmann @ 2022-05-05 15:42 ` Rafał Miłecki 2022-05-05 16:04 ` Andrew Lunn 0 siblings, 1 reply; 22+ messages in thread From: Rafał Miłecki @ 2022-05-05 15:42 UTC (permalink / raw) To: Arnd Bergmann Cc: Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, Andrew Lunn, Felix Fietkau, openwrt-devel, Florian Fainelli On 29.04.2022 16:49, Arnd Bergmann wrote: > On Wed, Apr 27, 2022 at 7:31 PM Rafał Miłecki <zajec5@gmail.com> wrote: >> On 27.04.2022 14:56, Alexander Lobakin wrote: > >> Thank you Alexander, this appears to be helpful! I decided to ignore >> CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS >> manually. >> >> >> 1. Without ce5013ff3bec and with -falign-functions=32 >> 387 Mb/s >> >> 2. Without ce5013ff3bec and with -falign-functions=64 >> 377 Mb/s >> >> 3. With ce5013ff3bec and with -falign-functions=32 >> 384 Mb/s >> >> 4. With ce5013ff3bec and with -falign-functions=64 >> 377 Mb/s >> >> >> So it seems that: >> 1. -falign-functions=32 = pretty stable high speed >> 2. -falign-functions=64 = very stable slightly lower speed >> >> >> I'm going to perform tests on more commits but if it stays so reliable >> as above that will be a huge success for me. > > Note that the problem may not just be the alignment of a particular > function, but also how different function map into your cache. > The Cortex-A9 has a 4-way set-associative L1 cache of 16KB, 32KB or > 64KB, with a line size of 32 bytes. If you are unlucky and you get > five different functions that are frequently called and are a multiple > functions are exactly the wrong spacing that they need more than > four ways, calling them in sequence would always evict the other > ones. The same could of course happen if the problem is the D-cache > or the L2. > > Can you try to get a profile using 'perf record' to see where most > time is spent, in both the slowest and the fastest versions? > If the instruction cache is the issue, you should see how the hottest > addresses line up. Your explanation sounds sane of course. If you take a look at my old e-mail ARM router NAT performance affected by random/unrelated commits https://lkml.org/lkml/2019/5/21/349 https://www.spinics.net/lists/linux-block/msg40624.html you'll see that most used functions are: v7_dma_inv_range __irqentry_text_end l2c210_inv_range v7_dma_clean_range bcma_host_soc_read32 __netif_receive_skb_core arch_cpu_idle l2c210_clean_range fib_table_lookup Is there a way to optimize kernel for optimal cache usage of selected (above) functions? Meanwhile I was testing -fno-reorder-blocks which some OpenWrt folks reported as worth trying. It's another randomness. It stabilizes NAT performance across some commits and breaks stability across others. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-05-05 15:42 ` Rafał Miłecki @ 2022-05-05 16:04 ` Andrew Lunn 2022-05-05 16:46 ` Felix Fietkau 2022-05-06 7:44 ` Rafał Miłecki 0 siblings, 2 replies; 22+ messages in thread From: Andrew Lunn @ 2022-05-05 16:04 UTC (permalink / raw) To: Rafał Miłecki Cc: Arnd Bergmann, Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel, Florian Fainelli > you'll see that most used functions are: > v7_dma_inv_range > __irqentry_text_end > l2c210_inv_range > v7_dma_clean_range > bcma_host_soc_read32 > __netif_receive_skb_core > arch_cpu_idle > l2c210_clean_range > fib_table_lookup There is a lot of cache management functions here. Might sound odd, but have you tried disabling SMP? These cache functions need to operate across all CPUs, and the communication between CPUs can slow them down. If there is only one CPU, these cache functions get simpler and faster. It just depends on your workload. If you have 1 CPU loaded to 100% and the other 3 idle, you might see an improvement. If you actually need more than one CPU, it will probably be worse. I've also found that some Ethernet drivers invalidate or flush too much. If you are sending a 64 byte TCP ACK, all you need to flush is 64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then recycle the buffer, all you need to invalidate is the size of the ACK, so long as you can guarantee nothing has touched the memory above it. But you need to be careful when implementing tricks like this, or you can get subtle corruption bugs when you get it wrong. Andrew ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-05-05 16:04 ` Andrew Lunn @ 2022-05-05 16:46 ` Felix Fietkau 2022-05-06 7:47 ` Rafał Miłecki 2022-05-06 7:44 ` Rafał Miłecki 1 sibling, 1 reply; 22+ messages in thread From: Felix Fietkau @ 2022-05-05 16:46 UTC (permalink / raw) To: Andrew Lunn, Rafał Miłecki Cc: Arnd Bergmann, Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, openwrt-devel, Florian Fainelli On 05.05.22 18:04, Andrew Lunn wrote: >> you'll see that most used functions are: >> v7_dma_inv_range >> __irqentry_text_end >> l2c210_inv_range >> v7_dma_clean_range >> bcma_host_soc_read32 >> __netif_receive_skb_core >> arch_cpu_idle >> l2c210_clean_range >> fib_table_lookup > > There is a lot of cache management functions here. Might sound odd, > but have you tried disabling SMP? These cache functions need to > operate across all CPUs, and the communication between CPUs can slow > them down. If there is only one CPU, these cache functions get simpler > and faster. > > It just depends on your workload. If you have 1 CPU loaded to 100% and > the other 3 idle, you might see an improvement. If you actually need > more than one CPU, it will probably be worse. > > I've also found that some Ethernet drivers invalidate or flush too > much. If you are sending a 64 byte TCP ACK, all you need to flush is > 64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then > recycle the buffer, all you need to invalidate is the size of the ACK, > so long as you can guarantee nothing has touched the memory above it. > But you need to be careful when implementing tricks like this, or you > can get subtle corruption bugs when you get it wrong. I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724. This seems rather excessive, especially since most people are going to use a MTU of 1500. My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes. This should significantly reduce the time spent on flushing caches. - Felix ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-05-05 16:46 ` Felix Fietkau @ 2022-05-06 7:47 ` Rafał Miłecki 2022-05-06 12:42 ` Andrew Lunn 0 siblings, 1 reply; 22+ messages in thread From: Rafał Miłecki @ 2022-05-06 7:47 UTC (permalink / raw) To: Felix Fietkau, Andrew Lunn Cc: Arnd Bergmann, Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, openwrt-devel, Florian Fainelli On 5.05.2022 18:46, Felix Fietkau wrote: > > On 05.05.22 18:04, Andrew Lunn wrote: >>> you'll see that most used functions are: >>> v7_dma_inv_range >>> __irqentry_text_end >>> l2c210_inv_range >>> v7_dma_clean_range >>> bcma_host_soc_read32 >>> __netif_receive_skb_core >>> arch_cpu_idle >>> l2c210_clean_range >>> fib_table_lookup >> >> There is a lot of cache management functions here. Might sound odd, >> but have you tried disabling SMP? These cache functions need to >> operate across all CPUs, and the communication between CPUs can slow >> them down. If there is only one CPU, these cache functions get simpler >> and faster. >> >> It just depends on your workload. If you have 1 CPU loaded to 100% and >> the other 3 idle, you might see an improvement. If you actually need >> more than one CPU, it will probably be worse. >> >> I've also found that some Ethernet drivers invalidate or flush too >> much. If you are sending a 64 byte TCP ACK, all you need to flush is >> 64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then >> recycle the buffer, all you need to invalidate is the size of the ACK, >> so long as you can guarantee nothing has touched the memory above it. >> But you need to be careful when implementing tricks like this, or you >> can get subtle corruption bugs when you get it wrong. > I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724. > This seems rather excessive, especially since most people are going to use a MTU of 1500. > My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes. > This should significantly reduce the time spent on flushing caches. Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac: configure MTU and add support for frames beyond 8192 byte size"): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03 It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps). I do all my testing with #define BGMAC_RX_MAX_FRAME_SIZE 1536 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-05-06 7:47 ` Rafał Miłecki @ 2022-05-06 12:42 ` Andrew Lunn 2022-05-10 10:29 ` Rafał Miłecki 0 siblings, 1 reply; 22+ messages in thread From: Andrew Lunn @ 2022-05-06 12:42 UTC (permalink / raw) To: Rafał Miłecki Cc: Felix Fietkau, Arnd Bergmann, Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, openwrt-devel, Florian Fainelli > > I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724. > > This seems rather excessive, especially since most people are going to use a MTU of 1500. > > My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes. > > This should significantly reduce the time spent on flushing caches. > > Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac: > configure MTU and add support for frames beyond 8192 byte size"): > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03 > > It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps). > > I do all my testing with > #define BGMAC_RX_MAX_FRAME_SIZE 1536 That helps show that cache operations are part of your bottleneck. Taking a quick look at the driver. On the receive side: /* Unmap buffer to make it accessible to the CPU */ dma_unmap_single(dma_dev, dma_addr, BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE); Here is data is mapped read for the CPU to use it. /* Get info from the header */ len = le16_to_cpu(rx->len); flags = le16_to_cpu(rx->flags); /* Check for poison and drop or pass the packet */ if (len == 0xdead && flags == 0xbeef) { netdev_err(bgmac->net_dev, "Found poisoned packet at slot %d, DMA issue!\n", ring->start); put_page(virt_to_head_page(buf)); bgmac->net_dev->stats.rx_errors++; break; } if (len > BGMAC_RX_ALLOC_SIZE) { netdev_err(bgmac->net_dev, "Found oversized packet at slot %d, DMA issue!\n", ring->start); put_page(virt_to_head_page(buf)); bgmac->net_dev->stats.rx_length_errors++; bgmac->net_dev->stats.rx_errors++; break; } /* Omit CRC. */ len -= ETH_FCS_LEN; skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE); if (unlikely(!skb)) { netdev_err(bgmac->net_dev, "build_skb failed\n"); put_page(virt_to_head_page(buf)); bgmac->net_dev->stats.rx_errors++; break; } skb_put(skb, BGMAC_RX_FRAME_OFFSET + BGMAC_RX_BUF_OFFSET + len); skb_pull(skb, BGMAC_RX_FRAME_OFFSET + BGMAC_RX_BUF_OFFSET); skb_checksum_none_assert(skb); skb->protocol = eth_type_trans(skb, bgmac->net_dev); and this is the first access of the actual data. You can make the cache actually work for you, rather than against you, to adding a call to prefetch(buf); just after the dma_unmap_single(). That will start getting the frame header from DRAM into cache, so hopefully it is available by the time eth_type_trans() is called and you don't have a cache miss. Andrew ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-05-06 12:42 ` Andrew Lunn @ 2022-05-10 10:29 ` Rafał Miłecki 2022-05-10 14:09 ` Dave Taht 0 siblings, 1 reply; 22+ messages in thread From: Rafał Miłecki @ 2022-05-10 10:29 UTC (permalink / raw) To: Andrew Lunn Cc: Felix Fietkau, Arnd Bergmann, Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, openwrt-devel, Florian Fainelli On 6.05.2022 14:42, Andrew Lunn wrote: >>> I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724. >>> This seems rather excessive, especially since most people are going to use a MTU of 1500. >>> My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes. >>> This should significantly reduce the time spent on flushing caches. >> >> Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac: >> configure MTU and add support for frames beyond 8192 byte size"): >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03 >> >> It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps). >> >> I do all my testing with >> #define BGMAC_RX_MAX_FRAME_SIZE 1536 > > That helps show that cache operations are part of your bottleneck. > > Taking a quick look at the driver. On the receive side: > > /* Unmap buffer to make it accessible to the CPU */ > dma_unmap_single(dma_dev, dma_addr, > BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE); > > Here is data is mapped read for the CPU to use it. > > /* Get info from the header */ > len = le16_to_cpu(rx->len); > flags = le16_to_cpu(rx->flags); > > /* Check for poison and drop or pass the packet */ > if (len == 0xdead && flags == 0xbeef) { > netdev_err(bgmac->net_dev, "Found poisoned packet at slot %d, DMA issue!\n", > ring->start); > put_page(virt_to_head_page(buf)); > bgmac->net_dev->stats.rx_errors++; > break; > } > > if (len > BGMAC_RX_ALLOC_SIZE) { > netdev_err(bgmac->net_dev, "Found oversized packet at slot %d, DMA issue!\n", > ring->start); > put_page(virt_to_head_page(buf)); > bgmac->net_dev->stats.rx_length_errors++; > bgmac->net_dev->stats.rx_errors++; > break; > } > > /* Omit CRC. */ > len -= ETH_FCS_LEN; > > skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE); > if (unlikely(!skb)) { > netdev_err(bgmac->net_dev, "build_skb failed\n"); > put_page(virt_to_head_page(buf)); > bgmac->net_dev->stats.rx_errors++; > break; > } > skb_put(skb, BGMAC_RX_FRAME_OFFSET + > BGMAC_RX_BUF_OFFSET + len); > skb_pull(skb, BGMAC_RX_FRAME_OFFSET + > BGMAC_RX_BUF_OFFSET); > > skb_checksum_none_assert(skb); > skb->protocol = eth_type_trans(skb, bgmac->net_dev); > > and this is the first access of the actual data. You can make the > cache actually work for you, rather than against you, to adding a call to > > prefetch(buf); > > just after the dma_unmap_single(). That will start getting the frame > header from DRAM into cache, so hopefully it is available by the time > eth_type_trans() is called and you don't have a cache miss. I don't think that analysis is correct. Please take a look at following lines: struct bgmac_rx_header *rx = slot->buf + BGMAC_RX_BUF_OFFSET; void *buf = slot->buf; The first we do after dma_unmap_single() call is rx->len read. That actually points to DMA data. There is nothing we could keep CPU busy with while preteching data. FWIW I tried adding prefetch(buf); anyway. I didn't change NAT speed by a single 1 Mb/s. Speed was exactly the same as without prefetch() call. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-05-10 10:29 ` Rafał Miłecki @ 2022-05-10 14:09 ` Dave Taht 2022-05-10 19:15 ` Dave Taht 0 siblings, 1 reply; 22+ messages in thread From: Dave Taht @ 2022-05-10 14:09 UTC (permalink / raw) To: Rafał Miłecki Cc: Andrew Lunn, Felix Fietkau, Arnd Bergmann, Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, openwrt-devel, Florian Fainelli I might have mentioned this before. but I'm really big on using the flent tool to drive test runs. The comparison plots are to die for, and it can also sample cpu and other statistics over time. Also I'm big on testing bidirectional functionality. client$ flent -H server -t what_test_conditions_you_have --step-size=.05 --te=upload_streams=4 -x --socket-stats tcp_nup Gathers a lot of data about everything. The rrul test is one of my favorites for creating a bittorrent like load. flent is usually available in apt/rpm/etc. there are scripts that can run on routers, openwrt has opkg install flent-tools, you use ssh to fire these off. there are a few python dependencies for the flent-gui, that aren't needed for the flent server or client sometimes you have to install and compile netperf on your own with ./configure --enable-demo Please see flent.org for more details, and/or hit the flent-users list for questions. On Tue, May 10, 2022 at 5:03 AM Rafał Miłecki <zajec5@gmail.com> wrote: > > On 6.05.2022 14:42, Andrew Lunn wrote: > >>> I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724. > >>> This seems rather excessive, especially since most people are going to use a MTU of 1500. > >>> My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes. > >>> This should significantly reduce the time spent on flushing caches. > >> > >> Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac: > >> configure MTU and add support for frames beyond 8192 byte size"): > >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03 > >> > >> It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps). > >> > >> I do all my testing with > >> #define BGMAC_RX_MAX_FRAME_SIZE 1536 > > > > That helps show that cache operations are part of your bottleneck. > > > > Taking a quick look at the driver. On the receive side: > > > > /* Unmap buffer to make it accessible to the CPU */ > > dma_unmap_single(dma_dev, dma_addr, > > BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE); > > > > Here is data is mapped read for the CPU to use it. > > > > /* Get info from the header */ > > len = le16_to_cpu(rx->len); > > flags = le16_to_cpu(rx->flags); > > > > /* Check for poison and drop or pass the packet */ > > if (len == 0xdead && flags == 0xbeef) { > > netdev_err(bgmac->net_dev, "Found poisoned packet at slot %d, DMA issue!\n", > > ring->start); > > put_page(virt_to_head_page(buf)); > > bgmac->net_dev->stats.rx_errors++; > > break; > > } > > > > if (len > BGMAC_RX_ALLOC_SIZE) { > > netdev_err(bgmac->net_dev, "Found oversized packet at slot %d, DMA issue!\n", > > ring->start); > > put_page(virt_to_head_page(buf)); > > bgmac->net_dev->stats.rx_length_errors++; > > bgmac->net_dev->stats.rx_errors++; > > break; > > } > > > > /* Omit CRC. */ > > len -= ETH_FCS_LEN; > > > > skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE); > > if (unlikely(!skb)) { > > netdev_err(bgmac->net_dev, "build_skb failed\n"); > > put_page(virt_to_head_page(buf)); > > bgmac->net_dev->stats.rx_errors++; > > break; > > } > > skb_put(skb, BGMAC_RX_FRAME_OFFSET + > > BGMAC_RX_BUF_OFFSET + len); > > skb_pull(skb, BGMAC_RX_FRAME_OFFSET + > > BGMAC_RX_BUF_OFFSET); > > > > skb_checksum_none_assert(skb); > > skb->protocol = eth_type_trans(skb, bgmac->net_dev); > > > > and this is the first access of the actual data. You can make the > > cache actually work for you, rather than against you, to adding a call to > > > > prefetch(buf); > > > > just after the dma_unmap_single(). That will start getting the frame > > header from DRAM into cache, so hopefully it is available by the time > > eth_type_trans() is called and you don't have a cache miss. > > > I don't think that analysis is correct. > > Please take a look at following lines: > struct bgmac_rx_header *rx = slot->buf + BGMAC_RX_BUF_OFFSET; > void *buf = slot->buf; > > The first we do after dma_unmap_single() call is rx->len read. That > actually points to DMA data. There is nothing we could keep CPU busy > with while preteching data. > > FWIW I tried adding prefetch(buf); anyway. I didn't change NAT speed by > a single 1 Mb/s. Speed was exactly the same as without prefetch() call. -- FQ World Domination pending: https://blog.cerowrt.org/post/state_of_fq_codel/ Dave Täht CEO, TekLibre, LLC ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-05-10 14:09 ` Dave Taht @ 2022-05-10 19:15 ` Dave Taht 0 siblings, 0 replies; 22+ messages in thread From: Dave Taht @ 2022-05-10 19:15 UTC (permalink / raw) To: Rafał Miłecki Cc: Andrew Lunn, Felix Fietkau, Arnd Bergmann, Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, openwrt-devel, Florian Fainelli while I'm kibitzing kind of randomly on this thread... Richard Site's just published book, "Understanding software dynamics", is the first book I've been compelled to buy on paper in many years, due to the extensive use of useful color graphs and analogies, as well as explaining the KUtrace tool, and so many other wonderful modern things I'd missed. https://www.amazon.com/Understanding-Software-Addison-Wesley-Professional-Computing/dp/0137589735 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-05-05 16:04 ` Andrew Lunn 2022-05-05 16:46 ` Felix Fietkau @ 2022-05-06 7:44 ` Rafał Miłecki 2022-05-06 8:45 ` Arnd Bergmann 2022-05-08 9:53 ` Rafał Miłecki 1 sibling, 2 replies; 22+ messages in thread From: Rafał Miłecki @ 2022-05-06 7:44 UTC (permalink / raw) To: Andrew Lunn Cc: Arnd Bergmann, Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel, Florian Fainelli On 5.05.2022 18:04, Andrew Lunn wrote: >> you'll see that most used functions are: >> v7_dma_inv_range >> __irqentry_text_end >> l2c210_inv_range >> v7_dma_clean_range >> bcma_host_soc_read32 >> __netif_receive_skb_core >> arch_cpu_idle >> l2c210_clean_range >> fib_table_lookup > > There is a lot of cache management functions here. Might sound odd, > but have you tried disabling SMP? These cache functions need to > operate across all CPUs, and the communication between CPUs can slow > them down. If there is only one CPU, these cache functions get simpler > and faster. > > It just depends on your workload. If you have 1 CPU loaded to 100% and > the other 3 idle, you might see an improvement. If you actually need > more than one CPU, it will probably be worse. It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels more stable now (lower variations). Let me spend some time on more testing. FWIW during all my tests I was using: echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus that is what I need to get similar speeds across iperf sessions With echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus my NAT speeds were jumping between 4 speeds: 273 Mbps / 315 Mbps / 353 Mbps / 425 Mbps (every time I started iperf kernel jumped into one state and kept the same iperf speed until stopping it and starting another session) With echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus my NAT speeds were jumping between 2 speeds: 284 Mbps / 408 Mbps > I've also found that some Ethernet drivers invalidate or flush too > much. If you are sending a 64 byte TCP ACK, all you need to flush is > 64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then > recycle the buffer, all you need to invalidate is the size of the ACK, > so long as you can guarantee nothing has touched the memory above it. > But you need to be careful when implementing tricks like this, or you > can get subtle corruption bugs when you get it wrong. That was actually bgmac's initial behaviour, see commit 92b9ccd34a90 ("bgmac: pass received packet to the netif instead of copying it"): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=92b9ccd34a9053c628d230fe27a7e0c10179910f I think it was Felix who suggested me to avoid skb_copy*() and it seems it improved performance indeed. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-05-06 7:44 ` Rafał Miłecki @ 2022-05-06 8:45 ` Arnd Bergmann 2022-05-06 8:55 ` Rafał Miłecki 2022-05-10 11:23 ` Rafał Miłecki 2022-05-08 9:53 ` Rafał Miłecki 1 sibling, 2 replies; 22+ messages in thread From: Arnd Bergmann @ 2022-05-06 8:45 UTC (permalink / raw) To: Rafał Miłecki Cc: Andrew Lunn, Arnd Bergmann, Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel, Florian Fainelli On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki <zajec5@gmail.com> wrote: > > On 5.05.2022 18:04, Andrew Lunn wrote: > >> you'll see that most used functions are: > >> v7_dma_inv_range > >> __irqentry_text_end > >> l2c210_inv_range > >> v7_dma_clean_range > >> bcma_host_soc_read32 > >> __netif_receive_skb_core > >> arch_cpu_idle > >> l2c210_clean_range > >> fib_table_lookup > > > > There is a lot of cache management functions here. Indeed, so optimizing the coherency management (see Felix' reply) is likely to help most in making the driver faster, but that does not explain why the alignment of the object code has such a big impact on performance. To investigate the alignment further, what I was actually looking for is a comparison of the profile of the slow and fast case. Here I would expect that the slow case spends more time in one of the functions that don't deal with cache management (maybe fib_table_lookup or __netif_receive_skb_core). A few other thoughts: - bcma_host_soc_read32() is a fundamentally slow operation, maybe some of the calls can turned into a relaxed read, like the readback in bgmac_chip_intrs_off() or the 'poll again' at the end bgmac_poll(), though obviously not the one in bgmac_dma_rx_read(). It may be possible to even avoid some of the reads entirely, checking for more data in bgmac_poll() may actually be counterproductive depending on the workload. - The higher-end networking SoCs are usually cache-coherent and can avoid the cache management entirely. There is a slim chance that this chip is designed that way and it just needs to be enabled properly. Most low-end chips don't implement the coherent interconnect though, and I suppose you have checked this already. - bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear to have an extraneous dma_wmb(), which should be implied by the non-relaxed writel() in bgmac_write(). - accesses to the DMA descriptor don't show up in the profile here, but look like they can get misoptimized by the compiler. I would generally use READ_ONCE() and WRITE_ONCE() for these to ensure that you don't end up with extra or out-of-order accesses. This also makes it clearer to the reader that something special happens here. > > Might sound odd, > > but have you tried disabling SMP? These cache functions need to > > operate across all CPUs, and the communication between CPUs can slow > > them down. If there is only one CPU, these cache functions get simpler > > and faster. > > > > It just depends on your workload. If you have 1 CPU loaded to 100% and > > the other 3 idle, you might see an improvement. If you actually need > > more than one CPU, it will probably be worse. > > It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels > more stable now (lower variations). Let me spend some time on more > testing. > > > FWIW during all my tests I was using: > echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus > that is what I need to get similar speeds across iperf sessions > > With > echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus > my NAT speeds were jumping between 4 speeds: > 273 Mbps / 315 Mbps / 353 Mbps / 425 Mbps > (every time I started iperf kernel jumped into one state and kept the > same iperf speed until stopping it and starting another session) > > With > echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus > my NAT speeds were jumping between 2 speeds: > 284 Mbps / 408 Mbps Can you try using 'numactl -C' to pin the iperf processes to a particular CPU core? This may be related to the locality of the user process relative to where the interrupts end up. Arnd ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-05-06 8:45 ` Arnd Bergmann @ 2022-05-06 8:55 ` Rafał Miłecki 2022-05-06 9:44 ` Arnd Bergmann 2022-05-10 11:23 ` Rafał Miłecki 1 sibling, 1 reply; 22+ messages in thread From: Rafał Miłecki @ 2022-05-06 8:55 UTC (permalink / raw) To: Arnd Bergmann Cc: Andrew Lunn, Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel, Florian Fainelli On 6.05.2022 10:45, Arnd Bergmann wrote: > On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki <zajec5@gmail.com> wrote: >> >> On 5.05.2022 18:04, Andrew Lunn wrote: >>>> you'll see that most used functions are: >>>> v7_dma_inv_range >>>> __irqentry_text_end >>>> l2c210_inv_range >>>> v7_dma_clean_range >>>> bcma_host_soc_read32 >>>> __netif_receive_skb_core >>>> arch_cpu_idle >>>> l2c210_clean_range >>>> fib_table_lookup >>> >>> There is a lot of cache management functions here. > > Indeed, so optimizing the coherency management (see Felix' reply) > is likely to help most in making the driver faster, but that does not > explain why the alignment of the object code has such a big impact > on performance. > > To investigate the alignment further, what I was actually looking for > is a comparison of the profile of the slow and fast case. Here I would > expect that the slow case spends more time in one of the functions > that don't deal with cache management (maybe fib_table_lookup or > __netif_receive_skb_core). > > A few other thoughts: > > - bcma_host_soc_read32() is a fundamentally slow operation, maybe > some of the calls can turned into a relaxed read, like the readback > in bgmac_chip_intrs_off() or the 'poll again' at the end bgmac_poll(), > though obviously not the one in bgmac_dma_rx_read(). > It may be possible to even avoid some of the reads entirely, checking > for more data in bgmac_poll() may actually be counterproductive > depending on the workload. > > - The higher-end networking SoCs are usually cache-coherent and > can avoid the cache management entirely. There is a slim chance > that this chip is designed that way and it just needs to be enabled > properly. Most low-end chips don't implement the coherent > interconnect though, and I suppose you have checked this already. > > - bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear > to have an extraneous dma_wmb(), which should be implied by the > non-relaxed writel() in bgmac_write(). > > - accesses to the DMA descriptor don't show up in the profile here, > but look like they can get misoptimized by the compiler. I would > generally use READ_ONCE() and WRITE_ONCE() for these to > ensure that you don't end up with extra or out-of-order accesses. > This also makes it clearer to the reader that something special > happens here. > >>> Might sound odd, >>> but have you tried disabling SMP? These cache functions need to >>> operate across all CPUs, and the communication between CPUs can slow >>> them down. If there is only one CPU, these cache functions get simpler >>> and faster. >>> >>> It just depends on your workload. If you have 1 CPU loaded to 100% and >>> the other 3 idle, you might see an improvement. If you actually need >>> more than one CPU, it will probably be worse. >> >> It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels >> more stable now (lower variations). Let me spend some time on more >> testing. >> >> >> FWIW during all my tests I was using: >> echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus >> that is what I need to get similar speeds across iperf sessions >> >> With >> echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus >> my NAT speeds were jumping between 4 speeds: >> 273 Mbps / 315 Mbps / 353 Mbps / 425 Mbps >> (every time I started iperf kernel jumped into one state and kept the >> same iperf speed until stopping it and starting another session) >> >> With >> echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus >> my NAT speeds were jumping between 2 speeds: >> 284 Mbps / 408 Mbps > > Can you try using 'numactl -C' to pin the iperf processes to > a particular CPU core? This may be related to the locality of > the user process relative to where the interrupts end up. I run iperf on x86 machines connected to router's WAN and LAN ports. It's meant to emulate end user just downloading from / uploading to Internet some data. Router's only task is doing masquarade NAT here. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-05-06 8:55 ` Rafał Miłecki @ 2022-05-06 9:44 ` Arnd Bergmann 2022-05-10 12:51 ` Rafał Miłecki 0 siblings, 1 reply; 22+ messages in thread From: Arnd Bergmann @ 2022-05-06 9:44 UTC (permalink / raw) To: Rafał Miłecki Cc: Arnd Bergmann, Andrew Lunn, Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel, Florian Fainelli On Fri, May 6, 2022 at 10:55 AM Rafał Miłecki <zajec5@gmail.com> wrote: > On 6.05.2022 10:45, Arnd Bergmann wrote: > > On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki <zajec5@gmail.com> wrote: > >> With > >> echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus > >> my NAT speeds were jumping between 2 speeds: > >> 284 Mbps / 408 Mbps > > > > Can you try using 'numactl -C' to pin the iperf processes to > > a particular CPU core? This may be related to the locality of > > the user process relative to where the interrupts end up. > > I run iperf on x86 machines connected to router's WAN and LAN ports. > It's meant to emulate end user just downloading from / uploading to > Internet some data. > > Router's only task is doing masquarade NAT here. Ah, makes sense. Can you observe the CPU usage to be on a particular core in the slow vs fast case then? Arnd ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-05-06 9:44 ` Arnd Bergmann @ 2022-05-10 12:51 ` Rafał Miłecki 2022-05-10 13:19 ` Arnd Bergmann 0 siblings, 1 reply; 22+ messages in thread From: Rafał Miłecki @ 2022-05-10 12:51 UTC (permalink / raw) To: Arnd Bergmann Cc: Andrew Lunn, Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel, Florian Fainelli On 6.05.2022 11:44, Arnd Bergmann wrote: > On Fri, May 6, 2022 at 10:55 AM Rafał Miłecki <zajec5@gmail.com> wrote: >> On 6.05.2022 10:45, Arnd Bergmann wrote: >>> On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki <zajec5@gmail.com> wrote: >>>> With >>>> echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus >>>> my NAT speeds were jumping between 2 speeds: >>>> 284 Mbps / 408 Mbps >>> >>> Can you try using 'numactl -C' to pin the iperf processes to >>> a particular CPU core? This may be related to the locality of >>> the user process relative to where the interrupts end up. >> >> I run iperf on x86 machines connected to router's WAN and LAN ports. >> It's meant to emulate end user just downloading from / uploading to >> Internet some data. >> >> Router's only task is doing masquarade NAT here. > > Ah, makes sense. Can you observe the CPU usage to be on > a particular core in the slow vs fast case then? With echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus NAT speed was verying between: a) 311 Mb/s (CPUs load: 100% + 0%) b) 408 Mb/s (CPUs load: 100% + 62%) With echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus NAT speed was verying between: a) 290 Mb/s (CPUs load: 100% + 0%) b) 410 Mb/s (CPUs load: 100% + 63%) With echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus NAT speed was stable: a) 372 Mb/s (CPUs load: 100% + 26%) b) 375 Mb/s (CPUs load: 82% + 100%) With echo 3 > /sys/class/net/eth0/queues/rx-0/rps_cpus NAT speed was verying between: a) 293 Mb/s (CPUs load: 100% + 0%) b) 332 Mb/s (CPUs load: 100% + 17%) c) 374 Mb/s (CPUs load: 81% + 100%) d) 442 Mb/s (CPUs load: 100% + 75%) After some extra debugging I found a reason for varying CPU usage & varying NAT speeds. My router has a single swtich so I use two VLANs: eth0.1 - LAN eth0.2 - WAN (VLAN traffic is routed to correct ports by switch). On top of that I have "br-lan" bridge interface briding eth0.1 and wireless interfaces. For all that time I had /sys/class/net/br-lan/queues/rx-0/rps_cpus set to 3. So bridge traffic was randomly handled by CPU 0 or CPU 1. So if I assign specific CPU core to each of two interfaces, e.g.: echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus echo 2 > /sys/class/net/br-lan/queues/rx-0/rps_cpus things get stable. With above I get stable 419 Mb/s (CPUs load: 100% + 64%) on every iperf session. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-05-10 12:51 ` Rafał Miłecki @ 2022-05-10 13:19 ` Arnd Bergmann 0 siblings, 0 replies; 22+ messages in thread From: Arnd Bergmann @ 2022-05-10 13:19 UTC (permalink / raw) To: Rafał Miłecki Cc: Arnd Bergmann, Andrew Lunn, Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel, Florian Fainelli On Tue, May 10, 2022 at 2:51 PM Rafał Miłecki <zajec5@gmail.com> wrote: > On 6.05.2022 11:44, Arnd Bergmann wrote: > > My router has a single swtich so I use two VLANs: > eth0.1 - LAN > eth0.2 - WAN > (VLAN traffic is routed to correct ports by switch). On top of that I > have "br-lan" bridge interface briding eth0.1 and wireless interfaces. > > For all that time I had /sys/class/net/br-lan/queues/rx-0/rps_cpus set > to 3. So bridge traffic was randomly handled by CPU 0 or CPU 1. > > So if I assign specific CPU core to each of two interfaces, e.g.: > echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus > echo 2 > /sys/class/net/br-lan/queues/rx-0/rps_cpus > things get stable. > > With above I get stable 419 Mb/s (CPUs load: 100% + 64%) on every iperf > session. Ah, very nice! One part of the mystery is solved then I guess. Arnd ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-05-06 8:45 ` Arnd Bergmann 2022-05-06 8:55 ` Rafał Miłecki @ 2022-05-10 11:23 ` Rafał Miłecki 2022-05-10 13:18 ` Arnd Bergmann 1 sibling, 1 reply; 22+ messages in thread From: Rafał Miłecki @ 2022-05-10 11:23 UTC (permalink / raw) To: Arnd Bergmann Cc: Andrew Lunn, Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel, Florian Fainelli On 6.05.2022 10:45, Arnd Bergmann wrote: > On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki <zajec5@gmail.com> wrote: >> >> On 5.05.2022 18:04, Andrew Lunn wrote: >>>> you'll see that most used functions are: >>>> v7_dma_inv_range >>>> __irqentry_text_end >>>> l2c210_inv_range >>>> v7_dma_clean_range >>>> bcma_host_soc_read32 >>>> __netif_receive_skb_core >>>> arch_cpu_idle >>>> l2c210_clean_range >>>> fib_table_lookup >>> >>> There is a lot of cache management functions here. > > Indeed, so optimizing the coherency management (see Felix' reply) > is likely to help most in making the driver faster, but that does not > explain why the alignment of the object code has such a big impact > on performance. > > To investigate the alignment further, what I was actually looking for > is a comparison of the profile of the slow and fast case. Here I would > expect that the slow case spends more time in one of the functions > that don't deal with cache management (maybe fib_table_lookup or > __netif_receive_skb_core). > > A few other thoughts: > > - bcma_host_soc_read32() is a fundamentally slow operation, maybe > some of the calls can turned into a relaxed read, like the readback > in bgmac_chip_intrs_off() or the 'poll again' at the end bgmac_poll(), > though obviously not the one in bgmac_dma_rx_read(). > It may be possible to even avoid some of the reads entirely, checking > for more data in bgmac_poll() may actually be counterproductive > depending on the workload. I'll experiment with that, hopefully I can optimize it a bit. > - The higher-end networking SoCs are usually cache-coherent and > can avoid the cache management entirely. There is a slim chance > that this chip is designed that way and it just needs to be enabled > properly. Most low-end chips don't implement the coherent > interconnect though, and I suppose you have checked this already. To my best knowledge Northstar platform doesn't support hw coherency. I just took an extra look at Broadcom's SDK and them seem to have some driver for selected chipsets but BCM708 isn't there. config BCM_GLB_COHERENCY bool "Global Hardware Cache Coherency" default n depends on BCM963158 || BCM96846 || BCM96858 || BCM96856 || BCM963178 || BCM947622 || BCM963146 || BCM94912 || BCM96813 || BCM96756 || BCM96855 > - bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear > to have an extraneous dma_wmb(), which should be implied by the > non-relaxed writel() in bgmac_write(). I tried dropping wmb() calls. With wmb(): 421 Mb/s Without: 418 Mb/s I also tried dropping bgmac_read() from bgmac_chip_intrs_off() which seems to be a flushing readback. With bgmac_read(): 421 Mb/s Without: 413 Mb/s > - accesses to the DMA descriptor don't show up in the profile here, > but look like they can get misoptimized by the compiler. I would > generally use READ_ONCE() and WRITE_ONCE() for these to > ensure that you don't end up with extra or out-of-order accesses. > This also makes it clearer to the reader that something special > happens here. Should I use something as below? FWIW it doesn't seem to change NAT performance. Without WRITE_ONCE: 421 Mb/s With: 419 Mb/s diff --git a/drivers/net/ethernet/broadcom/bgmac.c b/drivers/net/ethernet/broadcom/bgmac.c index 87700072..ce98f2a9 100644 --- a/drivers/net/ethernet/broadcom/bgmac.c +++ b/drivers/net/ethernet/broadcom/bgmac.c @@ -119,10 +119,10 @@ bgmac_dma_tx_add_buf(struct bgmac *bgmac, struct bgmac_dma_ring *ring, slot = &ring->slots[i]; dma_desc = &ring->cpu_base[i]; - dma_desc->addr_low = cpu_to_le32(lower_32_bits(slot->dma_addr)); - dma_desc->addr_high = cpu_to_le32(upper_32_bits(slot->dma_addr)); - dma_desc->ctl0 = cpu_to_le32(ctl0); - dma_desc->ctl1 = cpu_to_le32(ctl1); + WRITE_ONCE(dma_desc->addr_low, cpu_to_le32(lower_32_bits(slot->dma_addr))); + WRITE_ONCE(dma_desc->addr_high, cpu_to_le32(upper_32_bits(slot->dma_addr))); + WRITE_ONCE(dma_desc->ctl0, cpu_to_le32(ctl0)); + WRITE_ONCE(dma_desc->ctl1, cpu_to_le32(ctl1)); } static netdev_tx_t bgmac_dma_tx_add(struct bgmac *bgmac, @@ -387,10 +387,10 @@ static void bgmac_dma_rx_setup_desc(struct bgmac *bgmac, * B43_DMA64_DCTL1_ADDREXT_MASK; */ - dma_desc->addr_low = cpu_to_le32(lower_32_bits(ring->slots[desc_idx].dma_addr)); - dma_desc->addr_high = cpu_to_le32(upper_32_bits(ring->slots[desc_idx].dma_addr)); - dma_desc->ctl0 = cpu_to_le32(ctl0); - dma_desc->ctl1 = cpu_to_le32(ctl1); + WRITE_ONCE(dma_desc->addr_low, cpu_to_le32(lower_32_bits(ring->slots[desc_idx].dma_addr))); + WRITE_ONCE(dma_desc->addr_high, cpu_to_le32(upper_32_bits(ring->slots[desc_idx].dma_addr))); + WRITE_ONCE(dma_desc->ctl0, cpu_to_le32(ctl0)); + WRITE_ONCE(dma_desc->ctl1, cpu_to_le32(ctl1)); ring->end = desc_idx; } ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-05-10 11:23 ` Rafał Miłecki @ 2022-05-10 13:18 ` Arnd Bergmann 0 siblings, 0 replies; 22+ messages in thread From: Arnd Bergmann @ 2022-05-10 13:18 UTC (permalink / raw) To: Rafał Miłecki Cc: Arnd Bergmann, Andrew Lunn, Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel, Florian Fainelli On Tue, May 10, 2022 at 1:23 PM Rafał Miłecki <zajec5@gmail.com> wrote: > On 6.05.2022 10:45, Arnd Bergmann wrote: > > - The higher-end networking SoCs are usually cache-coherent and > > can avoid the cache management entirely. There is a slim chance > > that this chip is designed that way and it just needs to be enabled > > properly. Most low-end chips don't implement the coherent > > interconnect though, and I suppose you have checked this already. > > To my best knowledge Northstar platform doesn't support hw coherency. > > I just took an extra look at Broadcom's SDK and them seem to have some > driver for selected chipsets but BCM708 isn't there. > > config BCM_GLB_COHERENCY > bool "Global Hardware Cache Coherency" > default n > depends on BCM963158 || BCM96846 || BCM96858 || BCM96856 || BCM963178 || BCM947622 || BCM963146 || BCM94912 || BCM96813 || BCM96756 || BCM96855 Ok > > - bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear > > to have an extraneous dma_wmb(), which should be implied by the > > non-relaxed writel() in bgmac_write(). > > I tried dropping wmb() calls. > With wmb(): 421 Mb/s > Without: 418 Mb/s That's probably within the noise here. I suppose doing two wmb() calls in a row is not that expensive because there is nothing left to wait for. If the extra wmb() is measurably faster than no wmb(), there is something else going wrong ;-) > I also tried dropping bgmac_read() from bgmac_chip_intrs_off() which > seems to be a flushing readback. > > With bgmac_read(): 421 Mb/s > Without: 413 Mb/s Interesting, so this is statistically significant, right? It could be that this changing the interrupt timing just enough that it ends up doing more work at once some of the time. > > - accesses to the DMA descriptor don't show up in the profile here, > > but look like they can get misoptimized by the compiler. I would > > generally use READ_ONCE() and WRITE_ONCE() for these to > > ensure that you don't end up with extra or out-of-order accesses. > > This also makes it clearer to the reader that something special > > happens here. > > Should I use something as below? > > FWIW it doesn't seem to change NAT performance. > Without WRITE_ONCE: 421 Mb/s > With: 419 Mb/s This one depends on the compiler. What I would expect here is that it often makes no difference, but if the compiler does something odd, then the WRITE_ONCE() would prevent this and make it behave as before. I would suggest adding this part regardless. The other suggestion I had was this, I think you did not test this: --- a/drivers/net/ethernet/broadcom/bgmac.c +++ b/drivers/net/ethernet/broadcom/bgmac.c @@ -1156,11 +1156,12 @@ static int bgmac_poll(struct napi_struct *napi, int weight) bgmac_dma_tx_free(bgmac, &bgmac->tx_ring[0]); handled += bgmac_dma_rx_read(bgmac, &bgmac->rx_ring[0], weight); - /* Poll again if more events arrived in the meantime */ - if (bgmac_read(bgmac, BGMAC_INT_STATUS) & (BGMAC_IS_TX0 | BGMAC_IS_RX)) - return weight; - if (handled < weight) { + /* Poll again if more events arrived in the meantime */ + if (bgmac_read(bgmac, BGMAC_INT_STATUS) & + (BGMAC_IS_TX0 | BGMAC_IS_RX)) + return weight; + napi_complete_done(napi, handled); bgmac_chip_intrs_on(bgmac); } Or possibly, remove that extra check entirely and just rely on the irq to do this after it gets turned on again. Arnd ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Optimizing kernel compilation / alignments for network performance 2022-05-06 7:44 ` Rafał Miłecki 2022-05-06 8:45 ` Arnd Bergmann @ 2022-05-08 9:53 ` Rafał Miłecki 1 sibling, 0 replies; 22+ messages in thread From: Rafał Miłecki @ 2022-05-08 9:53 UTC (permalink / raw) To: Andrew Lunn Cc: Arnd Bergmann, Alexander Lobakin, Network Development, linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel, Florian Fainelli On 6.05.2022 09:44, Rafał Miłecki wrote: > On 5.05.2022 18:04, Andrew Lunn wrote: >>> you'll see that most used functions are: >>> v7_dma_inv_range >>> __irqentry_text_end >>> l2c210_inv_range >>> v7_dma_clean_range >>> bcma_host_soc_read32 >>> __netif_receive_skb_core >>> arch_cpu_idle >>> l2c210_clean_range >>> fib_table_lookup >> >> There is a lot of cache management functions here. Might sound odd, >> but have you tried disabling SMP? These cache functions need to >> operate across all CPUs, and the communication between CPUs can slow >> them down. If there is only one CPU, these cache functions get simpler >> and faster. >> >> It just depends on your workload. If you have 1 CPU loaded to 100% and >> the other 3 idle, you might see an improvement. If you actually need >> more than one CPU, it will probably be worse. > > It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels > more stable now (lower variations). Let me spend some time on more > testing. For a context I test various kernel commits / configs using: iperf -t 120 -i 10 -c 192.168.13.1 I did more testing with # CONFIG_SMP is not set Good thing: During a single iperf session I get noticably more stable speed. With SMP: x ± 2,86% Without SMP: x ± 0,96% Bad thing: Across kernel commits / config changes speed still varies. So disabling CONFIG_SMP won't help me looking for kernel regressions. ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2022-05-10 19:15 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-04-27 12:04 Optimizing kernel compilation / alignments for network performance Rafał Miłecki 2022-04-27 12:56 ` Alexander Lobakin 2022-04-27 17:31 ` Rafał Miłecki 2022-04-29 14:18 ` Rafał Miłecki 2022-04-29 14:49 ` Arnd Bergmann 2022-05-05 15:42 ` Rafał Miłecki 2022-05-05 16:04 ` Andrew Lunn 2022-05-05 16:46 ` Felix Fietkau 2022-05-06 7:47 ` Rafał Miłecki 2022-05-06 12:42 ` Andrew Lunn 2022-05-10 10:29 ` Rafał Miłecki 2022-05-10 14:09 ` Dave Taht 2022-05-10 19:15 ` Dave Taht 2022-05-06 7:44 ` Rafał Miłecki 2022-05-06 8:45 ` Arnd Bergmann 2022-05-06 8:55 ` Rafał Miłecki 2022-05-06 9:44 ` Arnd Bergmann 2022-05-10 12:51 ` Rafał Miłecki 2022-05-10 13:19 ` Arnd Bergmann 2022-05-10 11:23 ` Rafał Miłecki 2022-05-10 13:18 ` Arnd Bergmann 2022-05-08 9:53 ` Rafał Miłecki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).