netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Rafał Miłecki" <zajec5@gmail.com>
To: Arnd Bergmann <arnd@arndb.de>
Cc: Andrew Lunn <andrew@lunn.ch>,
	Alexander Lobakin <alexandr.lobakin@intel.com>,
	Network Development <netdev@vger.kernel.org>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	Russell King <linux@armlinux.org.uk>,
	Felix Fietkau <nbd@nbd.name>,
	"openwrt-devel@lists.openwrt.org"
	<openwrt-devel@lists.openwrt.org>,
	Florian Fainelli <f.fainelli@gmail.com>
Subject: Re: Optimizing kernel compilation / alignments for network performance
Date: Fri, 6 May 2022 10:55:52 +0200	[thread overview]
Message-ID: <306e9713-5c37-8c6a-488b-bc07f8b8b274@gmail.com> (raw)
In-Reply-To: <CAK8P3a0Rouw8jHHqGhKtMu-ks--bqpVYj_+u4-Pt9VoFOK7nMw@mail.gmail.com>

On 6.05.2022 10:45, Arnd Bergmann wrote:
> On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki <zajec5@gmail.com> wrote:
>>
>> On 5.05.2022 18:04, Andrew Lunn wrote:
>>>> you'll see that most used functions are:
>>>> v7_dma_inv_range
>>>> __irqentry_text_end
>>>> l2c210_inv_range
>>>> v7_dma_clean_range
>>>> bcma_host_soc_read32
>>>> __netif_receive_skb_core
>>>> arch_cpu_idle
>>>> l2c210_clean_range
>>>> fib_table_lookup
>>>
>>> There is a lot of cache management functions here.
> 
> Indeed, so optimizing the coherency management (see Felix' reply)
> is likely to help most in making the driver faster, but that does not
> explain why the alignment of the object code has such a big impact
> on performance.
> 
> To investigate the alignment further, what I was actually looking for
> is a comparison of the profile of the slow and fast case. Here I would
> expect that the slow case spends more time in one of the functions
> that don't deal with cache management (maybe fib_table_lookup or
> __netif_receive_skb_core).
> 
> A few other thoughts:
> 
> - bcma_host_soc_read32() is a fundamentally slow operation, maybe
>    some of the calls can turned into a relaxed read, like the readback
>    in bgmac_chip_intrs_off() or the 'poll again' at the end bgmac_poll(),
>    though obviously not the one in bgmac_dma_rx_read().
>    It may be possible to even avoid some of the reads entirely, checking
>    for more data in bgmac_poll() may actually be counterproductive
>    depending on the workload.
> 
> - The higher-end networking SoCs are usually cache-coherent and
>    can avoid the cache management entirely. There is a slim chance
>    that this chip is designed that way and it just needs to be enabled
>    properly. Most low-end chips don't implement the coherent
>    interconnect though, and I suppose you have checked this already.
> 
> - bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear
>    to have an extraneous dma_wmb(), which should be implied by the
>    non-relaxed writel() in bgmac_write().
> 
> - accesses to the DMA descriptor don't show up in the profile here,
>    but look like they can get misoptimized by the compiler. I would
>    generally use READ_ONCE() and WRITE_ONCE() for these to
>    ensure that you don't end up with extra or out-of-order accesses.
>    This also makes it clearer to the reader that something special
>    happens here.
> 
>>> Might sound odd,
>>> but have you tried disabling SMP? These cache functions need to
>>> operate across all CPUs, and the communication between CPUs can slow
>>> them down. If there is only one CPU, these cache functions get simpler
>>> and faster.
>>>
>>> It just depends on your workload. If you have 1 CPU loaded to 100% and
>>> the other 3 idle, you might see an improvement. If you actually need
>>> more than one CPU, it will probably be worse.
>>
>> It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels
>> more stable now (lower variations). Let me spend some time on more
>> testing.
>>
>>
>> FWIW during all my tests I was using:
>> echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus
>> that is what I need to get similar speeds across iperf sessions
>>
>> With
>> echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus
>> my NAT speeds were jumping between 4 speeds:
>> 273 Mbps / 315 Mbps / 353 Mbps / 425 Mbps
>> (every time I started iperf kernel jumped into one state and kept the
>>    same iperf speed until stopping it and starting another session)
>>
>> With
>> echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
>> my NAT speeds were jumping between 2 speeds:
>> 284 Mbps / 408 Mbps
> 
> Can you try using 'numactl -C' to pin the iperf processes to
> a particular CPU core? This may be related to the locality of
> the user process relative to where the interrupts end up.

I run iperf on x86 machines connected to router's WAN and LAN ports.
It's meant to emulate end user just downloading from / uploading to
Internet some data.

Router's only task is doing masquarade NAT here.

  reply	other threads:[~2022-05-06  8:55 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-27 12:04 Optimizing kernel compilation / alignments for network performance Rafał Miłecki
2022-04-27 12:56 ` Alexander Lobakin
2022-04-27 17:31   ` Rafał Miłecki
2022-04-29 14:18     ` Rafał Miłecki
2022-04-29 14:49     ` Arnd Bergmann
2022-05-05 15:42       ` Rafał Miłecki
2022-05-05 16:04         ` Andrew Lunn
2022-05-05 16:46           ` Felix Fietkau
2022-05-06  7:47             ` Rafał Miłecki
2022-05-06 12:42               ` Andrew Lunn
2022-05-10 10:29                 ` Rafał Miłecki
2022-05-10 14:09                   ` Dave Taht
2022-05-10 19:15                     ` Dave Taht
2022-05-06  7:44           ` Rafał Miłecki
2022-05-06  8:45             ` Arnd Bergmann
2022-05-06  8:55               ` Rafał Miłecki [this message]
2022-05-06  9:44                 ` Arnd Bergmann
2022-05-10 12:51                   ` Rafał Miłecki
2022-05-10 13:19                     ` Arnd Bergmann
2022-05-10 11:23               ` Rafał Miłecki
2022-05-10 13:18                 ` Arnd Bergmann
2022-05-08  9:53             ` Rafał Miłecki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=306e9713-5c37-8c6a-488b-bc07f8b8b274@gmail.com \
    --to=zajec5@gmail.com \
    --cc=alexandr.lobakin@intel.com \
    --cc=andrew@lunn.ch \
    --cc=arnd@arndb.de \
    --cc=f.fainelli@gmail.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux@armlinux.org.uk \
    --cc=nbd@nbd.name \
    --cc=netdev@vger.kernel.org \
    --cc=openwrt-devel@lists.openwrt.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).