Optimizing kernel compilation / alignments for network performance

All of lore.kernel.org
 help / color / mirror / Atom feed

* Optimizing kernel compilation / alignments for network performance
@ 2022-04-27 12:04 ` Rafał Miłecki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-04-27 12:04 UTC (permalink / raw)
  To: Network Development, linux-arm-kernel, Russell King, Andrew Lunn,
	Felix Fietkau
  Cc: openwrt-devel, Florian Fainelli

Hi,

I noticed years ago that kernel changes touching code - that I don't use
at all - can affect network performance for me.

I work with home routers based on Broadcom Northstar platform. Those
are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of
those devices is NAT masquerade and that is what I test with iperf
running on two x86 machines.

***

Example of such unused code change:
ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A").
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b
It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%).

I first reported that issue it in the e-mail thread:
ARM router NAT performance affected by random/unrelated commits
https://lkml.org/lkml/2019/5/21/349
https://www.spinics.net/lists/linux-block/msg40624.html

Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv
unicast headers")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283
that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%).

***

It appears Northstar CPUs have little cache size and so any change in
location of kernel symbols can affect NAT performance. That explains why
changing unrelated code affects anything & it has been partially proven
aligning some of cache-v7.S code.

My question is: is there a way to find out & force an optimal symbols
locations?

Adding .align 5 to the cache-v7.S is a partial success. I'd like to find
out what other functions are worth optimizing (aligning) and force that
(I guess  __attribute__((aligned(32))) could be used).

I can't really draw any conclusions from comparing System.map before and
after above commits as they relocate thousands of symbols in one go.

Optimizing is pretty important for me for two reasons:
1. I want to reach maximum possible NAT masquerade performance
2. I need stable performance across random commits to detect regressions

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Optimizing kernel compilation / alignments for network performance
@ 2022-04-27 12:04 ` Rafał Miłecki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-04-27 12:04 UTC (permalink / raw)
  To: Network Development, linux-arm-kernel, Russell King, Andrew Lunn,
	Felix Fietkau
  Cc: openwrt-devel, Florian Fainelli

Hi,

I noticed years ago that kernel changes touching code - that I don't use
at all - can affect network performance for me.

I work with home routers based on Broadcom Northstar platform. Those
are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of
those devices is NAT masquerade and that is what I test with iperf
running on two x86 machines.

***

Example of such unused code change:
ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A").
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b
It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%).

I first reported that issue it in the e-mail thread:
ARM router NAT performance affected by random/unrelated commits
https://lkml.org/lkml/2019/5/21/349
https://www.spinics.net/lists/linux-block/msg40624.html

Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv
unicast headers")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283
that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%).

***

It appears Northstar CPUs have little cache size and so any change in
location of kernel symbols can affect NAT performance. That explains why
changing unrelated code affects anything & it has been partially proven
aligning some of cache-v7.S code.

My question is: is there a way to find out & force an optimal symbols
locations?

Adding .align 5 to the cache-v7.S is a partial success. I'd like to find
out what other functions are worth optimizing (aligning) and force that
(I guess  __attribute__((aligned(32))) could be used).

I can't really draw any conclusions from comparing System.map before and
after above commits as they relocate thousands of symbols in one go.

Optimizing is pretty important for me for two reasons:
1. I want to reach maximum possible NAT masquerade performance
2. I need stable performance across random commits to detect regressions

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-04-27 12:04 ` Rafał Miłecki
@ 2022-04-27 12:56   ` Alexander Lobakin
  -1 siblings, 0 replies; 44+ messages in thread
From: Alexander Lobakin @ 2022-04-27 12:56 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Alexander Lobakin, Network Development, linux-arm-kernel,
	Russell King, Andrew Lunn, Felix Fietkau, openwrt-devel,
	Florian Fainelli

From: Rafał Miłecki <zajec5@gmail.com>
Date: Wed, 27 Apr 2022 14:04:54 +0200

> Hi,

Hej,

> 
> I noticed years ago that kernel changes touching code - that I don't use
> at all - can affect network performance for me.
> 
> I work with home routers based on Broadcom Northstar platform. Those
> are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of
> those devices is NAT masquerade and that is what I test with iperf
> running on two x86 machines.
> 
> ***
> 
> Example of such unused code change:
> ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A").
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b
> It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%).
> 
> I first reported that issue it in the e-mail thread:
> ARM router NAT performance affected by random/unrelated commits
> https://lkml.org/lkml/2019/5/21/349
> https://www.spinics.net/lists/linux-block/msg40624.html
> 
> Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv
> unicast headers")
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283
> that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%).
> 
> ***
> 
> It appears Northstar CPUs have little cache size and so any change in
> location of kernel symbols can affect NAT performance. That explains why
> changing unrelated code affects anything & it has been partially proven
> aligning some of cache-v7.S code.
> 
> My question is: is there a way to find out & force an optimal symbols
> locations?

Take a look at CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B[0]. I've been
fighting with the same issue on some Realtek MIPS boards: random
code changes in random kernel core parts were affecting NAT /
network performance. This option resolved this I'd say, for the cost
of slightly increased vmlinux size (almost no change in vmlinuz
size).
The only thing is that it was recently restricted to a set of
architectures and MIPS and ARM32 are not included now lol. So it's
either a matter of expanding the list (since it was restricted only
because `-falign-functions=` is not supported on some architectures)
or you can just do:

make KCFLAGS=-falign-functions=64 # replace 64 with your I-cache size

The actual alignment is something to play with, I stopped on the
cacheline size, 32 in my case.
Also, this does not provide any guarantees that you won't suffer
from random data cacheline changes. There were some initiatives to
introduce debug alignment of data as well, but since function are
often bigger than 32, while variables are usually much smaller, it
was increasing the vmlinux size by a ton (imagine each u32 variable
occupying 32-64 bytes instead of 4). But the chance of catching this
is much lower than to suffer from I-cache function misplacement.

> 
> Adding .align 5 to the cache-v7.S is a partial success. I'd like to find
> out what other functions are worth optimizing (aligning) and force that
> (I guess  __attribute__((aligned(32))) could be used).
> 
> I can't really draw any conclusions from comparing System.map before and
> after above commits as they relocate thousands of symbols in one go.
> 
> Optimizing is pretty important for me for two reasons:
> 1. I want to reach maximum possible NAT masquerade performance
> 2. I need stable performance across random commits to detect regressions

[0] https://elixir.bootlin.com/linux/v5.18-rc4/K/ident/CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B

Thanks,
Al

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-04-27 12:56   ` Alexander Lobakin
  0 siblings, 0 replies; 44+ messages in thread
From: Alexander Lobakin @ 2022-04-27 12:56 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Alexander Lobakin, Network Development, linux-arm-kernel,
	Russell King, Andrew Lunn, Felix Fietkau, openwrt-devel,
	Florian Fainelli

From: Rafał Miłecki <zajec5@gmail.com>
Date: Wed, 27 Apr 2022 14:04:54 +0200

> Hi,

Hej,

> 
> I noticed years ago that kernel changes touching code - that I don't use
> at all - can affect network performance for me.
> 
> I work with home routers based on Broadcom Northstar platform. Those
> are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of
> those devices is NAT masquerade and that is what I test with iperf
> running on two x86 machines.
> 
> ***
> 
> Example of such unused code change:
> ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A").
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b
> It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%).
> 
> I first reported that issue it in the e-mail thread:
> ARM router NAT performance affected by random/unrelated commits
> https://lkml.org/lkml/2019/5/21/349
> https://www.spinics.net/lists/linux-block/msg40624.html
> 
> Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv
> unicast headers")
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283
> that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%).
> 
> ***
> 
> It appears Northstar CPUs have little cache size and so any change in
> location of kernel symbols can affect NAT performance. That explains why
> changing unrelated code affects anything & it has been partially proven
> aligning some of cache-v7.S code.
> 
> My question is: is there a way to find out & force an optimal symbols
> locations?

Take a look at CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B[0]. I've been
fighting with the same issue on some Realtek MIPS boards: random
code changes in random kernel core parts were affecting NAT /
network performance. This option resolved this I'd say, for the cost
of slightly increased vmlinux size (almost no change in vmlinuz
size).
The only thing is that it was recently restricted to a set of
architectures and MIPS and ARM32 are not included now lol. So it's
either a matter of expanding the list (since it was restricted only
because `-falign-functions=` is not supported on some architectures)
or you can just do:

make KCFLAGS=-falign-functions=64 # replace 64 with your I-cache size

The actual alignment is something to play with, I stopped on the
cacheline size, 32 in my case.
Also, this does not provide any guarantees that you won't suffer
from random data cacheline changes. There were some initiatives to
introduce debug alignment of data as well, but since function are
often bigger than 32, while variables are usually much smaller, it
was increasing the vmlinux size by a ton (imagine each u32 variable
occupying 32-64 bytes instead of 4). But the chance of catching this
is much lower than to suffer from I-cache function misplacement.

> 
> Adding .align 5 to the cache-v7.S is a partial success. I'd like to find
> out what other functions are worth optimizing (aligning) and force that
> (I guess  __attribute__((aligned(32))) could be used).
> 
> I can't really draw any conclusions from comparing System.map before and
> after above commits as they relocate thousands of symbols in one go.
> 
> Optimizing is pretty important for me for two reasons:
> 1. I want to reach maximum possible NAT masquerade performance
> 2. I need stable performance across random commits to detect regressions

[0] https://elixir.bootlin.com/linux/v5.18-rc4/K/ident/CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B

Thanks,
Al

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-04-27 12:56   ` Alexander Lobakin
@ 2022-04-27 17:31     ` Rafał Miłecki
  -1 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-04-27 17:31 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Network Development, linux-arm-kernel, Russell King, Andrew Lunn,
	Felix Fietkau, openwrt-devel, Florian Fainelli

On 27.04.2022 14:56, Alexander Lobakin wrote:
> From: Rafał Miłecki <zajec5@gmail.com>
> Date: Wed, 27 Apr 2022 14:04:54 +0200
> 
>> I noticed years ago that kernel changes touching code - that I don't use
>> at all - can affect network performance for me.
>>
>> I work with home routers based on Broadcom Northstar platform. Those
>> are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of
>> those devices is NAT masquerade and that is what I test with iperf
>> running on two x86 machines.
>>
>> ***
>>
>> Example of such unused code change:
>> ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A").
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b
>> It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%).
>>
>> I first reported that issue it in the e-mail thread:
>> ARM router NAT performance affected by random/unrelated commits
>> https://lkml.org/lkml/2019/5/21/349
>> https://www.spinics.net/lists/linux-block/msg40624.html
>>
>> Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv
>> unicast headers")
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283
>> that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%).
>>
>> ***
>>
>> It appears Northstar CPUs have little cache size and so any change in
>> location of kernel symbols can affect NAT performance. That explains why
>> changing unrelated code affects anything & it has been partially proven
>> aligning some of cache-v7.S code.
>>
>> My question is: is there a way to find out & force an optimal symbols
>> locations?
> 
> Take a look at CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B[0]. I've been
> fighting with the same issue on some Realtek MIPS boards: random
> code changes in random kernel core parts were affecting NAT /
> network performance. This option resolved this I'd say, for the cost
> of slightly increased vmlinux size (almost no change in vmlinuz
> size).
> The only thing is that it was recently restricted to a set of
> architectures and MIPS and ARM32 are not included now lol. So it's
> either a matter of expanding the list (since it was restricted only
> because `-falign-functions=` is not supported on some architectures)
> or you can just do:
> 
> make KCFLAGS=-falign-functions=64 # replace 64 with your I-cache size
> 
> The actual alignment is something to play with, I stopped on the
> cacheline size, 32 in my case.
> Also, this does not provide any guarantees that you won't suffer
> from random data cacheline changes. There were some initiatives to
> introduce debug alignment of data as well, but since function are
> often bigger than 32, while variables are usually much smaller, it
> was increasing the vmlinux size by a ton (imagine each u32 variable
> occupying 32-64 bytes instead of 4). But the chance of catching this
> is much lower than to suffer from I-cache function misplacement.

Thank you Alexander, this appears to be helpful! I decided to ignore
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS
manually.


1. Without ce5013ff3bec and with -falign-functions=32
387 Mb/s

2. Without ce5013ff3bec and with -falign-functions=64
377 Mb/s

3. With ce5013ff3bec and with -falign-functions=32
384 Mb/s

4. With ce5013ff3bec and with -falign-functions=64
377 Mb/s


So it seems that:
1. -falign-functions=32 = pretty stable high speed
2. -falign-functions=64 = very stable slightly lower speed


I'm going to perform tests on more commits but if it stays so reliable
as above that will be a huge success for me.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-04-27 17:31     ` Rafał Miłecki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-04-27 17:31 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Network Development, linux-arm-kernel, Russell King, Andrew Lunn,
	Felix Fietkau, openwrt-devel, Florian Fainelli

On 27.04.2022 14:56, Alexander Lobakin wrote:
> From: Rafał Miłecki <zajec5@gmail.com>
> Date: Wed, 27 Apr 2022 14:04:54 +0200
> 
>> I noticed years ago that kernel changes touching code - that I don't use
>> at all - can affect network performance for me.
>>
>> I work with home routers based on Broadcom Northstar platform. Those
>> are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of
>> those devices is NAT masquerade and that is what I test with iperf
>> running on two x86 machines.
>>
>> ***
>>
>> Example of such unused code change:
>> ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A").
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b
>> It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%).
>>
>> I first reported that issue it in the e-mail thread:
>> ARM router NAT performance affected by random/unrelated commits
>> https://lkml.org/lkml/2019/5/21/349
>> https://www.spinics.net/lists/linux-block/msg40624.html
>>
>> Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv
>> unicast headers")
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283
>> that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%).
>>
>> ***
>>
>> It appears Northstar CPUs have little cache size and so any change in
>> location of kernel symbols can affect NAT performance. That explains why
>> changing unrelated code affects anything & it has been partially proven
>> aligning some of cache-v7.S code.
>>
>> My question is: is there a way to find out & force an optimal symbols
>> locations?
> 
> Take a look at CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B[0]. I've been
> fighting with the same issue on some Realtek MIPS boards: random
> code changes in random kernel core parts were affecting NAT /
> network performance. This option resolved this I'd say, for the cost
> of slightly increased vmlinux size (almost no change in vmlinuz
> size).
> The only thing is that it was recently restricted to a set of
> architectures and MIPS and ARM32 are not included now lol. So it's
> either a matter of expanding the list (since it was restricted only
> because `-falign-functions=` is not supported on some architectures)
> or you can just do:
> 
> make KCFLAGS=-falign-functions=64 # replace 64 with your I-cache size
> 
> The actual alignment is something to play with, I stopped on the
> cacheline size, 32 in my case.
> Also, this does not provide any guarantees that you won't suffer
> from random data cacheline changes. There were some initiatives to
> introduce debug alignment of data as well, but since function are
> often bigger than 32, while variables are usually much smaller, it
> was increasing the vmlinux size by a ton (imagine each u32 variable
> occupying 32-64 bytes instead of 4). But the chance of catching this
> is much lower than to suffer from I-cache function misplacement.

Thank you Alexander, this appears to be helpful! I decided to ignore
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS
manually.


1. Without ce5013ff3bec and with -falign-functions=32
387 Mb/s

2. Without ce5013ff3bec and with -falign-functions=64
377 Mb/s

3. With ce5013ff3bec and with -falign-functions=32
384 Mb/s

4. With ce5013ff3bec and with -falign-functions=64
377 Mb/s


So it seems that:
1. -falign-functions=32 = pretty stable high speed
2. -falign-functions=64 = very stable slightly lower speed


I'm going to perform tests on more commits but if it stays so reliable
as above that will be a huge success for me.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-04-27 17:31     ` Rafał Miłecki
@ 2022-04-29 14:18       ` Rafał Miłecki
  -1 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-04-29 14:18 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Network Development, linux-arm-kernel, Russell King, Andrew Lunn,
	Felix Fietkau, openwrt-devel, Florian Fainelli

On 27.04.2022 19:31, Rafał Miłecki wrote:
> On 27.04.2022 14:56, Alexander Lobakin wrote:
>> From: Rafał Miłecki <zajec5@gmail.com>
>> Date: Wed, 27 Apr 2022 14:04:54 +0200
>>
>>> I noticed years ago that kernel changes touching code - that I don't use
>>> at all - can affect network performance for me.
>>>
>>> I work with home routers based on Broadcom Northstar platform. Those
>>> are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of
>>> those devices is NAT masquerade and that is what I test with iperf
>>> running on two x86 machines.
>>>
>>> ***
>>>
>>> Example of such unused code change:
>>> ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A").
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b
>>> It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%).
>>>
>>> I first reported that issue it in the e-mail thread:
>>> ARM router NAT performance affected by random/unrelated commits
>>> https://lkml.org/lkml/2019/5/21/349
>>> https://www.spinics.net/lists/linux-block/msg40624.html
>>>
>>> Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv
>>> unicast headers")
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283
>>> that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%).
>>>
>>> ***
>>>
>>> It appears Northstar CPUs have little cache size and so any change in
>>> location of kernel symbols can affect NAT performance. That explains why
>>> changing unrelated code affects anything & it has been partially proven
>>> aligning some of cache-v7.S code.
>>>
>>> My question is: is there a way to find out & force an optimal symbols
>>> locations?
>>
>> Take a look at CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B[0]. I've been
>> fighting with the same issue on some Realtek MIPS boards: random
>> code changes in random kernel core parts were affecting NAT /
>> network performance. This option resolved this I'd say, for the cost
>> of slightly increased vmlinux size (almost no change in vmlinuz
>> size).
>> The only thing is that it was recently restricted to a set of
>> architectures and MIPS and ARM32 are not included now lol. So it's
>> either a matter of expanding the list (since it was restricted only
>> because `-falign-functions=` is not supported on some architectures)
>> or you can just do:
>>
>> make KCFLAGS=-falign-functions=64 # replace 64 with your I-cache size
>>
>> The actual alignment is something to play with, I stopped on the
>> cacheline size, 32 in my case.
>> Also, this does not provide any guarantees that you won't suffer
>> from random data cacheline changes. There were some initiatives to
>> introduce debug alignment of data as well, but since function are
>> often bigger than 32, while variables are usually much smaller, it
>> was increasing the vmlinux size by a ton (imagine each u32 variable
>> occupying 32-64 bytes instead of 4). But the chance of catching this
>> is much lower than to suffer from I-cache function misplacement.
> 
> Thank you Alexander, this appears to be helpful! I decided to ignore
> CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS
> manually.
> 
> 
> 1. Without ce5013ff3bec and with -falign-functions=32
> 387 Mb/s
> 
> 2. Without ce5013ff3bec and with -falign-functions=64
> 377 Mb/s
> 
> 3. With ce5013ff3bec and with -falign-functions=32
> 384 Mb/s
> 
> 4. With ce5013ff3bec and with -falign-functions=64
> 377 Mb/s
> 
> 
> So it seems that:
> 1. -falign-functions=32 = pretty stable high speed
> 2. -falign-functions=64 = very stable slightly lower speed
> 
> 
> I'm going to perform tests on more commits but if it stays so reliable
> as above that will be a huge success for me.

So sadly that doesn't work all the time. Or maybe just works randomly.

I tried multiple commits with both: -falign-functions=32 and
-falign-functions=64 . I still get speed variations. About 30 Mb/s in
total. From commit to commit it's usually about 3% but skipping few can
result in up to 30 Mb/s (almost 10%).

Similarly to code changes performance also gets affected by enabling /
disabling kernel config options. I noticed that enabling
CONFIG_CRYPTO_PCRYPT may decrease *or* increase speed depending on
-falign-functions (and depending on kernel commit surely too).

┌──────────────────────┬───────────┬──────────┬───────┐
│                      │ no PCRYPT │ PCRYPT=y │ diff  │
├──────────────────────┼───────────┼──────────┼───────┤
│ No -falign-functions │ 363 Mb/s  │ 370 Mb/s │ +2%   │
│ -falign-functions=32 │ 364 Mb/s  │ 370 Mb/s │ +1,7% │
│ -falign-functions=64 │ 372 Mb/s  │ 365 Mb/s │ -2%   │
└──────────────────────┴───────────┴──────────┴───────┘

So I still don't have a reliable way of testing kernel changes for speed
regressions :(

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-04-29 14:18       ` Rafał Miłecki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-04-29 14:18 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Network Development, linux-arm-kernel, Russell King, Andrew Lunn,
	Felix Fietkau, openwrt-devel, Florian Fainelli

On 27.04.2022 19:31, Rafał Miłecki wrote:
> On 27.04.2022 14:56, Alexander Lobakin wrote:
>> From: Rafał Miłecki <zajec5@gmail.com>
>> Date: Wed, 27 Apr 2022 14:04:54 +0200
>>
>>> I noticed years ago that kernel changes touching code - that I don't use
>>> at all - can affect network performance for me.
>>>
>>> I work with home routers based on Broadcom Northstar platform. Those
>>> are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of
>>> those devices is NAT masquerade and that is what I test with iperf
>>> running on two x86 machines.
>>>
>>> ***
>>>
>>> Example of such unused code change:
>>> ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A").
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b
>>> It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%).
>>>
>>> I first reported that issue it in the e-mail thread:
>>> ARM router NAT performance affected by random/unrelated commits
>>> https://lkml.org/lkml/2019/5/21/349
>>> https://www.spinics.net/lists/linux-block/msg40624.html
>>>
>>> Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv
>>> unicast headers")
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283
>>> that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%).
>>>
>>> ***
>>>
>>> It appears Northstar CPUs have little cache size and so any change in
>>> location of kernel symbols can affect NAT performance. That explains why
>>> changing unrelated code affects anything & it has been partially proven
>>> aligning some of cache-v7.S code.
>>>
>>> My question is: is there a way to find out & force an optimal symbols
>>> locations?
>>
>> Take a look at CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B[0]. I've been
>> fighting with the same issue on some Realtek MIPS boards: random
>> code changes in random kernel core parts were affecting NAT /
>> network performance. This option resolved this I'd say, for the cost
>> of slightly increased vmlinux size (almost no change in vmlinuz
>> size).
>> The only thing is that it was recently restricted to a set of
>> architectures and MIPS and ARM32 are not included now lol. So it's
>> either a matter of expanding the list (since it was restricted only
>> because `-falign-functions=` is not supported on some architectures)
>> or you can just do:
>>
>> make KCFLAGS=-falign-functions=64 # replace 64 with your I-cache size
>>
>> The actual alignment is something to play with, I stopped on the
>> cacheline size, 32 in my case.
>> Also, this does not provide any guarantees that you won't suffer
>> from random data cacheline changes. There were some initiatives to
>> introduce debug alignment of data as well, but since function are
>> often bigger than 32, while variables are usually much smaller, it
>> was increasing the vmlinux size by a ton (imagine each u32 variable
>> occupying 32-64 bytes instead of 4). But the chance of catching this
>> is much lower than to suffer from I-cache function misplacement.
> 
> Thank you Alexander, this appears to be helpful! I decided to ignore
> CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS
> manually.
> 
> 
> 1. Without ce5013ff3bec and with -falign-functions=32
> 387 Mb/s
> 
> 2. Without ce5013ff3bec and with -falign-functions=64
> 377 Mb/s
> 
> 3. With ce5013ff3bec and with -falign-functions=32
> 384 Mb/s
> 
> 4. With ce5013ff3bec and with -falign-functions=64
> 377 Mb/s
> 
> 
> So it seems that:
> 1. -falign-functions=32 = pretty stable high speed
> 2. -falign-functions=64 = very stable slightly lower speed
> 
> 
> I'm going to perform tests on more commits but if it stays so reliable
> as above that will be a huge success for me.

So sadly that doesn't work all the time. Or maybe just works randomly.

I tried multiple commits with both: -falign-functions=32 and
-falign-functions=64 . I still get speed variations. About 30 Mb/s in
total. From commit to commit it's usually about 3% but skipping few can
result in up to 30 Mb/s (almost 10%).

Similarly to code changes performance also gets affected by enabling /
disabling kernel config options. I noticed that enabling
CONFIG_CRYPTO_PCRYPT may decrease *or* increase speed depending on
-falign-functions (and depending on kernel commit surely too).

┌──────────────────────┬───────────┬──────────┬───────┐
│                      │ no PCRYPT │ PCRYPT=y │ diff  │
├──────────────────────┼───────────┼──────────┼───────┤
│ No -falign-functions │ 363 Mb/s  │ 370 Mb/s │ +2%   │
│ -falign-functions=32 │ 364 Mb/s  │ 370 Mb/s │ +1,7% │
│ -falign-functions=64 │ 372 Mb/s  │ 365 Mb/s │ -2%   │
└──────────────────────┴───────────┴──────────┴───────┘

So I still don't have a reliable way of testing kernel changes for speed
regressions :(

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-04-27 17:31     ` Rafał Miłecki
@ 2022-04-29 14:49       ` Arnd Bergmann
  -1 siblings, 0 replies; 44+ messages in thread
From: Arnd Bergmann @ 2022-04-29 14:49 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Alexander Lobakin, Network Development, linux-arm-kernel,
	Russell King, Andrew Lunn, Felix Fietkau, openwrt-devel,
	Florian Fainelli

On Wed, Apr 27, 2022 at 7:31 PM Rafał Miłecki <zajec5@gmail.com> wrote:
> On 27.04.2022 14:56, Alexander Lobakin wrote:

> Thank you Alexander, this appears to be helpful! I decided to ignore
> CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS
> manually.
>
>
> 1. Without ce5013ff3bec and with -falign-functions=32
> 387 Mb/s
>
> 2. Without ce5013ff3bec and with -falign-functions=64
> 377 Mb/s
>
> 3. With ce5013ff3bec and with -falign-functions=32
> 384 Mb/s
>
> 4. With ce5013ff3bec and with -falign-functions=64
> 377 Mb/s
>
>
> So it seems that:
> 1. -falign-functions=32 = pretty stable high speed
> 2. -falign-functions=64 = very stable slightly lower speed
>
>
> I'm going to perform tests on more commits but if it stays so reliable
> as above that will be a huge success for me.

Note that the problem may not just be the alignment of a particular
function, but also how different function map into your cache.
The Cortex-A9 has a 4-way set-associative L1 cache of 16KB, 32KB or
64KB, with a line size of 32 bytes. If you are unlucky and you get
five different functions that are frequently called and are a multiple
functions are exactly the wrong spacing that they need more than
four ways, calling them in sequence would always evict the other
ones. The same could of course happen if the problem is the D-cache
or the L2.

Can you try to get a profile using 'perf record' to see where most
time is spent, in both the slowest and the fastest versions?
If the instruction cache is the issue, you should see how the hottest
addresses line up.

        Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-04-29 14:49       ` Arnd Bergmann
  0 siblings, 0 replies; 44+ messages in thread
From: Arnd Bergmann @ 2022-04-29 14:49 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Alexander Lobakin, Network Development, linux-arm-kernel,
	Russell King, Andrew Lunn, Felix Fietkau, openwrt-devel,
	Florian Fainelli

On Wed, Apr 27, 2022 at 7:31 PM Rafał Miłecki <zajec5@gmail.com> wrote:
> On 27.04.2022 14:56, Alexander Lobakin wrote:

> Thank you Alexander, this appears to be helpful! I decided to ignore
> CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS
> manually.
>
>
> 1. Without ce5013ff3bec and with -falign-functions=32
> 387 Mb/s
>
> 2. Without ce5013ff3bec and with -falign-functions=64
> 377 Mb/s
>
> 3. With ce5013ff3bec and with -falign-functions=32
> 384 Mb/s
>
> 4. With ce5013ff3bec and with -falign-functions=64
> 377 Mb/s
>
>
> So it seems that:
> 1. -falign-functions=32 = pretty stable high speed
> 2. -falign-functions=64 = very stable slightly lower speed
>
>
> I'm going to perform tests on more commits but if it stays so reliable
> as above that will be a huge success for me.

Note that the problem may not just be the alignment of a particular
function, but also how different function map into your cache.
The Cortex-A9 has a 4-way set-associative L1 cache of 16KB, 32KB or
64KB, with a line size of 32 bytes. If you are unlucky and you get
five different functions that are frequently called and are a multiple
functions are exactly the wrong spacing that they need more than
four ways, calling them in sequence would always evict the other
ones. The same could of course happen if the problem is the D-cache
or the L2.

Can you try to get a profile using 'perf record' to see where most
time is spent, in both the slowest and the fastest versions?
If the instruction cache is the issue, you should see how the hottest
addresses line up.

        Arnd

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-04-29 14:49       ` Arnd Bergmann
@ 2022-05-05 15:42         ` Rafał Miłecki
  -1 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-05-05 15:42 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Alexander Lobakin, Network Development, linux-arm-kernel,
	Russell King, Andrew Lunn, Felix Fietkau, openwrt-devel,
	Florian Fainelli

On 29.04.2022 16:49, Arnd Bergmann wrote:
> On Wed, Apr 27, 2022 at 7:31 PM Rafał Miłecki <zajec5@gmail.com> wrote:
>> On 27.04.2022 14:56, Alexander Lobakin wrote:
> 
>> Thank you Alexander, this appears to be helpful! I decided to ignore
>> CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS
>> manually.
>>
>>
>> 1. Without ce5013ff3bec and with -falign-functions=32
>> 387 Mb/s
>>
>> 2. Without ce5013ff3bec and with -falign-functions=64
>> 377 Mb/s
>>
>> 3. With ce5013ff3bec and with -falign-functions=32
>> 384 Mb/s
>>
>> 4. With ce5013ff3bec and with -falign-functions=64
>> 377 Mb/s
>>
>>
>> So it seems that:
>> 1. -falign-functions=32 = pretty stable high speed
>> 2. -falign-functions=64 = very stable slightly lower speed
>>
>>
>> I'm going to perform tests on more commits but if it stays so reliable
>> as above that will be a huge success for me.
> 
> Note that the problem may not just be the alignment of a particular
> function, but also how different function map into your cache.
> The Cortex-A9 has a 4-way set-associative L1 cache of 16KB, 32KB or
> 64KB, with a line size of 32 bytes. If you are unlucky and you get
> five different functions that are frequently called and are a multiple
> functions are exactly the wrong spacing that they need more than
> four ways, calling them in sequence would always evict the other
> ones. The same could of course happen if the problem is the D-cache
> or the L2.
> 
> Can you try to get a profile using 'perf record' to see where most
> time is spent, in both the slowest and the fastest versions?
> If the instruction cache is the issue, you should see how the hottest
> addresses line up.

Your explanation sounds sane of course.

If you take a look at my old e-mail
ARM router NAT performance affected by random/unrelated commits
https://lkml.org/lkml/2019/5/21/349
https://www.spinics.net/lists/linux-block/msg40624.html

you'll see that most used functions are:
v7_dma_inv_range
__irqentry_text_end
l2c210_inv_range
v7_dma_clean_range
bcma_host_soc_read32
__netif_receive_skb_core
arch_cpu_idle
l2c210_clean_range
fib_table_lookup

Is there a way to optimize kernel for optimal cache usage of selected
(above) functions?


Meanwhile I was testing -fno-reorder-blocks which some OpenWrt folks
reported as worth trying. It's another randomness. It stabilizes NAT
performance across some commits and breaks stability across others.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-05-05 15:42         ` Rafał Miłecki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-05-05 15:42 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Alexander Lobakin, Network Development, linux-arm-kernel,
	Russell King, Andrew Lunn, Felix Fietkau, openwrt-devel,
	Florian Fainelli

On 29.04.2022 16:49, Arnd Bergmann wrote:
> On Wed, Apr 27, 2022 at 7:31 PM Rafał Miłecki <zajec5@gmail.com> wrote:
>> On 27.04.2022 14:56, Alexander Lobakin wrote:
> 
>> Thank you Alexander, this appears to be helpful! I decided to ignore
>> CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS
>> manually.
>>
>>
>> 1. Without ce5013ff3bec and with -falign-functions=32
>> 387 Mb/s
>>
>> 2. Without ce5013ff3bec and with -falign-functions=64
>> 377 Mb/s
>>
>> 3. With ce5013ff3bec and with -falign-functions=32
>> 384 Mb/s
>>
>> 4. With ce5013ff3bec and with -falign-functions=64
>> 377 Mb/s
>>
>>
>> So it seems that:
>> 1. -falign-functions=32 = pretty stable high speed
>> 2. -falign-functions=64 = very stable slightly lower speed
>>
>>
>> I'm going to perform tests on more commits but if it stays so reliable
>> as above that will be a huge success for me.
> 
> Note that the problem may not just be the alignment of a particular
> function, but also how different function map into your cache.
> The Cortex-A9 has a 4-way set-associative L1 cache of 16KB, 32KB or
> 64KB, with a line size of 32 bytes. If you are unlucky and you get
> five different functions that are frequently called and are a multiple
> functions are exactly the wrong spacing that they need more than
> four ways, calling them in sequence would always evict the other
> ones. The same could of course happen if the problem is the D-cache
> or the L2.
> 
> Can you try to get a profile using 'perf record' to see where most
> time is spent, in both the slowest and the fastest versions?
> If the instruction cache is the issue, you should see how the hottest
> addresses line up.

Your explanation sounds sane of course.

If you take a look at my old e-mail
ARM router NAT performance affected by random/unrelated commits
https://lkml.org/lkml/2019/5/21/349
https://www.spinics.net/lists/linux-block/msg40624.html

you'll see that most used functions are:
v7_dma_inv_range
__irqentry_text_end
l2c210_inv_range
v7_dma_clean_range
bcma_host_soc_read32
__netif_receive_skb_core
arch_cpu_idle
l2c210_clean_range
fib_table_lookup

Is there a way to optimize kernel for optimal cache usage of selected
(above) functions?


Meanwhile I was testing -fno-reorder-blocks which some OpenWrt folks
reported as worth trying. It's another randomness. It stabilizes NAT
performance across some commits and breaks stability across others.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-05-05 15:42         ` Rafał Miłecki
@ 2022-05-05 16:04           ` Andrew Lunn
  -1 siblings, 0 replies; 44+ messages in thread
From: Andrew Lunn @ 2022-05-05 16:04 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Arnd Bergmann, Alexander Lobakin, Network Development,
	linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel,
	Florian Fainelli

> you'll see that most used functions are:
> v7_dma_inv_range
> __irqentry_text_end
> l2c210_inv_range
> v7_dma_clean_range
> bcma_host_soc_read32
> __netif_receive_skb_core
> arch_cpu_idle
> l2c210_clean_range
> fib_table_lookup

There is a lot of cache management functions here. Might sound odd,
but have you tried disabling SMP? These cache functions need to
operate across all CPUs, and the communication between CPUs can slow
them down. If there is only one CPU, these cache functions get simpler
and faster.

It just depends on your workload. If you have 1 CPU loaded to 100% and
the other 3 idle, you might see an improvement. If you actually need
more than one CPU, it will probably be worse.

I've also found that some Ethernet drivers invalidate or flush too
much. If you are sending a 64 byte TCP ACK, all you need to flush is
64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then
recycle the buffer, all you need to invalidate is the size of the ACK,
so long as you can guarantee nothing has touched the memory above it.
But you need to be careful when implementing tricks like this, or you
can get subtle corruption bugs when you get it wrong.

    Andrew

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-05-05 16:04           ` Andrew Lunn
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Lunn @ 2022-05-05 16:04 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Arnd Bergmann, Alexander Lobakin, Network Development,
	linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel,
	Florian Fainelli

> you'll see that most used functions are:
> v7_dma_inv_range
> __irqentry_text_end
> l2c210_inv_range
> v7_dma_clean_range
> bcma_host_soc_read32
> __netif_receive_skb_core
> arch_cpu_idle
> l2c210_clean_range
> fib_table_lookup

There is a lot of cache management functions here. Might sound odd,
but have you tried disabling SMP? These cache functions need to
operate across all CPUs, and the communication between CPUs can slow
them down. If there is only one CPU, these cache functions get simpler
and faster.

It just depends on your workload. If you have 1 CPU loaded to 100% and
the other 3 idle, you might see an improvement. If you actually need
more than one CPU, it will probably be worse.

I've also found that some Ethernet drivers invalidate or flush too
much. If you are sending a 64 byte TCP ACK, all you need to flush is
64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then
recycle the buffer, all you need to invalidate is the size of the ACK,
so long as you can guarantee nothing has touched the memory above it.
But you need to be careful when implementing tricks like this, or you
can get subtle corruption bugs when you get it wrong.

    Andrew

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-05-05 16:04           ` Andrew Lunn
@ 2022-05-05 16:46             ` Felix Fietkau
  -1 siblings, 0 replies; 44+ messages in thread
From: Felix Fietkau @ 2022-05-05 16:46 UTC (permalink / raw)
  To: Andrew Lunn, Rafał Miłecki
  Cc: Arnd Bergmann, Alexander Lobakin, Network Development,
	linux-arm-kernel, Russell King, openwrt-devel, Florian Fainelli


On 05.05.22 18:04, Andrew Lunn wrote:
>> you'll see that most used functions are:
>> v7_dma_inv_range
>> __irqentry_text_end
>> l2c210_inv_range
>> v7_dma_clean_range
>> bcma_host_soc_read32
>> __netif_receive_skb_core
>> arch_cpu_idle
>> l2c210_clean_range
>> fib_table_lookup
> 
> There is a lot of cache management functions here. Might sound odd,
> but have you tried disabling SMP? These cache functions need to
> operate across all CPUs, and the communication between CPUs can slow
> them down. If there is only one CPU, these cache functions get simpler
> and faster.
> 
> It just depends on your workload. If you have 1 CPU loaded to 100% and
> the other 3 idle, you might see an improvement. If you actually need
> more than one CPU, it will probably be worse.
> 
> I've also found that some Ethernet drivers invalidate or flush too
> much. If you are sending a 64 byte TCP ACK, all you need to flush is
> 64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then
> recycle the buffer, all you need to invalidate is the size of the ACK,
> so long as you can guarantee nothing has touched the memory above it.
> But you need to be careful when implementing tricks like this, or you
> can get subtle corruption bugs when you get it wrong.
I just took a quick look at the driver. It allocates and maps rx buffers 
that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
This seems rather excessive, especially since most people are going to 
use a MTU of 1500.
My proposal would be to add support for making rx buffer size dependent 
on MTU, reallocating the ring on MTU changes.
This should significantly reduce the time spent on flushing caches.

- Felix

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-05-05 16:46             ` Felix Fietkau
  0 siblings, 0 replies; 44+ messages in thread
From: Felix Fietkau @ 2022-05-05 16:46 UTC (permalink / raw)
  To: Andrew Lunn, Rafał Miłecki
  Cc: Arnd Bergmann, Alexander Lobakin, Network Development,
	linux-arm-kernel, Russell King, openwrt-devel, Florian Fainelli


On 05.05.22 18:04, Andrew Lunn wrote:
>> you'll see that most used functions are:
>> v7_dma_inv_range
>> __irqentry_text_end
>> l2c210_inv_range
>> v7_dma_clean_range
>> bcma_host_soc_read32
>> __netif_receive_skb_core
>> arch_cpu_idle
>> l2c210_clean_range
>> fib_table_lookup
> 
> There is a lot of cache management functions here. Might sound odd,
> but have you tried disabling SMP? These cache functions need to
> operate across all CPUs, and the communication between CPUs can slow
> them down. If there is only one CPU, these cache functions get simpler
> and faster.
> 
> It just depends on your workload. If you have 1 CPU loaded to 100% and
> the other 3 idle, you might see an improvement. If you actually need
> more than one CPU, it will probably be worse.
> 
> I've also found that some Ethernet drivers invalidate or flush too
> much. If you are sending a 64 byte TCP ACK, all you need to flush is
> 64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then
> recycle the buffer, all you need to invalidate is the size of the ACK,
> so long as you can guarantee nothing has touched the memory above it.
> But you need to be careful when implementing tricks like this, or you
> can get subtle corruption bugs when you get it wrong.
I just took a quick look at the driver. It allocates and maps rx buffers 
that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
This seems rather excessive, especially since most people are going to 
use a MTU of 1500.
My proposal would be to add support for making rx buffer size dependent 
on MTU, reallocating the ring on MTU changes.
This should significantly reduce the time spent on flushing caches.

- Felix

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-05-05 16:04           ` Andrew Lunn
@ 2022-05-06  7:44             ` Rafał Miłecki
  -1 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-05-06  7:44 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Arnd Bergmann, Alexander Lobakin, Network Development,
	linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel,
	Florian Fainelli

On 5.05.2022 18:04, Andrew Lunn wrote:
>> you'll see that most used functions are:
>> v7_dma_inv_range
>> __irqentry_text_end
>> l2c210_inv_range
>> v7_dma_clean_range
>> bcma_host_soc_read32
>> __netif_receive_skb_core
>> arch_cpu_idle
>> l2c210_clean_range
>> fib_table_lookup
> 
> There is a lot of cache management functions here. Might sound odd,
> but have you tried disabling SMP? These cache functions need to
> operate across all CPUs, and the communication between CPUs can slow
> them down. If there is only one CPU, these cache functions get simpler
> and faster.
> 
> It just depends on your workload. If you have 1 CPU loaded to 100% and
> the other 3 idle, you might see an improvement. If you actually need
> more than one CPU, it will probably be worse.

It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels
more stable now (lower variations). Let me spend some time on more
testing.


FWIW during all my tests I was using:
echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus
that is what I need to get similar speeds across iperf sessions

With
echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus
my NAT speeds were jumping between 4 speeds:
273 Mbps / 315 Mbps / 353 Mbps / 425 Mbps
(every time I started iperf kernel jumped into one state and kept the
  same iperf speed until stopping it and starting another session)

With
echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
my NAT speeds were jumping between 2 speeds:
284 Mbps / 408 Mbps


> I've also found that some Ethernet drivers invalidate or flush too
> much. If you are sending a 64 byte TCP ACK, all you need to flush is
> 64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then
> recycle the buffer, all you need to invalidate is the size of the ACK,
> so long as you can guarantee nothing has touched the memory above it.
> But you need to be careful when implementing tricks like this, or you
> can get subtle corruption bugs when you get it wrong.

That was actually bgmac's initial behaviour, see commit 92b9ccd34a90
("bgmac: pass received packet to the netif instead of copying it"):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=92b9ccd34a9053c628d230fe27a7e0c10179910f

I think it was Felix who suggested me to avoid skb_copy*() and it seems
it improved performance indeed.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-05-06  7:44             ` Rafał Miłecki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-05-06  7:44 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Arnd Bergmann, Alexander Lobakin, Network Development,
	linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel,
	Florian Fainelli

On 5.05.2022 18:04, Andrew Lunn wrote:
>> you'll see that most used functions are:
>> v7_dma_inv_range
>> __irqentry_text_end
>> l2c210_inv_range
>> v7_dma_clean_range
>> bcma_host_soc_read32
>> __netif_receive_skb_core
>> arch_cpu_idle
>> l2c210_clean_range
>> fib_table_lookup
> 
> There is a lot of cache management functions here. Might sound odd,
> but have you tried disabling SMP? These cache functions need to
> operate across all CPUs, and the communication between CPUs can slow
> them down. If there is only one CPU, these cache functions get simpler
> and faster.
> 
> It just depends on your workload. If you have 1 CPU loaded to 100% and
> the other 3 idle, you might see an improvement. If you actually need
> more than one CPU, it will probably be worse.

It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels
more stable now (lower variations). Let me spend some time on more
testing.


FWIW during all my tests I was using:
echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus
that is what I need to get similar speeds across iperf sessions

With
echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus
my NAT speeds were jumping between 4 speeds:
273 Mbps / 315 Mbps / 353 Mbps / 425 Mbps
(every time I started iperf kernel jumped into one state and kept the
  same iperf speed until stopping it and starting another session)

With
echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
my NAT speeds were jumping between 2 speeds:
284 Mbps / 408 Mbps


> I've also found that some Ethernet drivers invalidate or flush too
> much. If you are sending a 64 byte TCP ACK, all you need to flush is
> 64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then
> recycle the buffer, all you need to invalidate is the size of the ACK,
> so long as you can guarantee nothing has touched the memory above it.
> But you need to be careful when implementing tricks like this, or you
> can get subtle corruption bugs when you get it wrong.

That was actually bgmac's initial behaviour, see commit 92b9ccd34a90
("bgmac: pass received packet to the netif instead of copying it"):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=92b9ccd34a9053c628d230fe27a7e0c10179910f

I think it was Felix who suggested me to avoid skb_copy*() and it seems
it improved performance indeed.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-05-05 16:46             ` Felix Fietkau
@ 2022-05-06  7:47               ` Rafał Miłecki
  -1 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-05-06  7:47 UTC (permalink / raw)
  To: Felix Fietkau, Andrew Lunn
  Cc: Arnd Bergmann, Alexander Lobakin, Network Development,
	linux-arm-kernel, Russell King, openwrt-devel, Florian Fainelli

On 5.05.2022 18:46, Felix Fietkau wrote:
> 
> On 05.05.22 18:04, Andrew Lunn wrote:
>>> you'll see that most used functions are:
>>> v7_dma_inv_range
>>> __irqentry_text_end
>>> l2c210_inv_range
>>> v7_dma_clean_range
>>> bcma_host_soc_read32
>>> __netif_receive_skb_core
>>> arch_cpu_idle
>>> l2c210_clean_range
>>> fib_table_lookup
>>
>> There is a lot of cache management functions here. Might sound odd,
>> but have you tried disabling SMP? These cache functions need to
>> operate across all CPUs, and the communication between CPUs can slow
>> them down. If there is only one CPU, these cache functions get simpler
>> and faster.
>>
>> It just depends on your workload. If you have 1 CPU loaded to 100% and
>> the other 3 idle, you might see an improvement. If you actually need
>> more than one CPU, it will probably be worse.
>>
>> I've also found that some Ethernet drivers invalidate or flush too
>> much. If you are sending a 64 byte TCP ACK, all you need to flush is
>> 64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then
>> recycle the buffer, all you need to invalidate is the size of the ACK,
>> so long as you can guarantee nothing has touched the memory above it.
>> But you need to be careful when implementing tricks like this, or you
>> can get subtle corruption bugs when you get it wrong.
> I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
> This seems rather excessive, especially since most people are going to use a MTU of 1500.
> My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes.
> This should significantly reduce the time spent on flushing caches.

Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac:
configure MTU and add support for frames beyond 8192 byte size"):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03

It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps).

I do all my testing with
#define BGMAC_RX_MAX_FRAME_SIZE			1536

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-05-06  7:47               ` Rafał Miłecki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-05-06  7:47 UTC (permalink / raw)
  To: Felix Fietkau, Andrew Lunn
  Cc: Arnd Bergmann, Alexander Lobakin, Network Development,
	linux-arm-kernel, Russell King, openwrt-devel, Florian Fainelli

On 5.05.2022 18:46, Felix Fietkau wrote:
> 
> On 05.05.22 18:04, Andrew Lunn wrote:
>>> you'll see that most used functions are:
>>> v7_dma_inv_range
>>> __irqentry_text_end
>>> l2c210_inv_range
>>> v7_dma_clean_range
>>> bcma_host_soc_read32
>>> __netif_receive_skb_core
>>> arch_cpu_idle
>>> l2c210_clean_range
>>> fib_table_lookup
>>
>> There is a lot of cache management functions here. Might sound odd,
>> but have you tried disabling SMP? These cache functions need to
>> operate across all CPUs, and the communication between CPUs can slow
>> them down. If there is only one CPU, these cache functions get simpler
>> and faster.
>>
>> It just depends on your workload. If you have 1 CPU loaded to 100% and
>> the other 3 idle, you might see an improvement. If you actually need
>> more than one CPU, it will probably be worse.
>>
>> I've also found that some Ethernet drivers invalidate or flush too
>> much. If you are sending a 64 byte TCP ACK, all you need to flush is
>> 64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then
>> recycle the buffer, all you need to invalidate is the size of the ACK,
>> so long as you can guarantee nothing has touched the memory above it.
>> But you need to be careful when implementing tricks like this, or you
>> can get subtle corruption bugs when you get it wrong.
> I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
> This seems rather excessive, especially since most people are going to use a MTU of 1500.
> My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes.
> This should significantly reduce the time spent on flushing caches.

Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac:
configure MTU and add support for frames beyond 8192 byte size"):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03

It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps).

I do all my testing with
#define BGMAC_RX_MAX_FRAME_SIZE			1536

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-05-06  7:44             ` Rafał Miłecki
@ 2022-05-06  8:45               ` Arnd Bergmann
  -1 siblings, 0 replies; 44+ messages in thread
From: Arnd Bergmann @ 2022-05-06  8:45 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Andrew Lunn, Arnd Bergmann, Alexander Lobakin,
	Network Development, linux-arm-kernel, Russell King,
	Felix Fietkau, openwrt-devel, Florian Fainelli

On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki <zajec5@gmail.com> wrote:
>
> On 5.05.2022 18:04, Andrew Lunn wrote:
> >> you'll see that most used functions are:
> >> v7_dma_inv_range
> >> __irqentry_text_end
> >> l2c210_inv_range
> >> v7_dma_clean_range
> >> bcma_host_soc_read32
> >> __netif_receive_skb_core
> >> arch_cpu_idle
> >> l2c210_clean_range
> >> fib_table_lookup
> >
> > There is a lot of cache management functions here.

Indeed, so optimizing the coherency management (see Felix' reply)
is likely to help most in making the driver faster, but that does not
explain why the alignment of the object code has such a big impact
on performance.

To investigate the alignment further, what I was actually looking for
is a comparison of the profile of the slow and fast case. Here I would
expect that the slow case spends more time in one of the functions
that don't deal with cache management (maybe fib_table_lookup or
__netif_receive_skb_core).

A few other thoughts:

- bcma_host_soc_read32() is a fundamentally slow operation, maybe
  some of the calls can turned into a relaxed read, like the readback
  in bgmac_chip_intrs_off() or the 'poll again' at the end bgmac_poll(),
  though obviously not the one in bgmac_dma_rx_read().
  It may be possible to even avoid some of the reads entirely, checking
  for more data in bgmac_poll() may actually be counterproductive
  depending on the workload.

- The higher-end networking SoCs are usually cache-coherent and
  can avoid the cache management entirely. There is a slim chance
  that this chip is designed that way and it just needs to be enabled
  properly. Most low-end chips don't implement the coherent
  interconnect though, and I suppose you have checked this already.

- bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear
  to have an extraneous dma_wmb(), which should be implied by the
  non-relaxed writel() in bgmac_write().

- accesses to the DMA descriptor don't show up in the profile here,
  but look like they can get misoptimized by the compiler. I would
  generally use READ_ONCE() and WRITE_ONCE() for these to
  ensure that you don't end up with extra or out-of-order accesses.
  This also makes it clearer to the reader that something special
  happens here.

> > Might sound odd,
> > but have you tried disabling SMP? These cache functions need to
> > operate across all CPUs, and the communication between CPUs can slow
> > them down. If there is only one CPU, these cache functions get simpler
> > and faster.
> >
> > It just depends on your workload. If you have 1 CPU loaded to 100% and
> > the other 3 idle, you might see an improvement. If you actually need
> > more than one CPU, it will probably be worse.
>
> It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels
> more stable now (lower variations). Let me spend some time on more
> testing.
>
>
> FWIW during all my tests I was using:
> echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus
> that is what I need to get similar speeds across iperf sessions
>
> With
> echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus
> my NAT speeds were jumping between 4 speeds:
> 273 Mbps / 315 Mbps / 353 Mbps / 425 Mbps
> (every time I started iperf kernel jumped into one state and kept the
>   same iperf speed until stopping it and starting another session)
>
> With
> echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
> my NAT speeds were jumping between 2 speeds:
> 284 Mbps / 408 Mbps

Can you try using 'numactl -C' to pin the iperf processes to
a particular CPU core? This may be related to the locality of
the user process relative to where the interrupts end up.

        Arnd

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-05-06  8:45               ` Arnd Bergmann
  0 siblings, 0 replies; 44+ messages in thread
From: Arnd Bergmann @ 2022-05-06  8:45 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Andrew Lunn, Arnd Bergmann, Alexander Lobakin,
	Network Development, linux-arm-kernel, Russell King,
	Felix Fietkau, openwrt-devel, Florian Fainelli

On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki <zajec5@gmail.com> wrote:
>
> On 5.05.2022 18:04, Andrew Lunn wrote:
> >> you'll see that most used functions are:
> >> v7_dma_inv_range
> >> __irqentry_text_end
> >> l2c210_inv_range
> >> v7_dma_clean_range
> >> bcma_host_soc_read32
> >> __netif_receive_skb_core
> >> arch_cpu_idle
> >> l2c210_clean_range
> >> fib_table_lookup
> >
> > There is a lot of cache management functions here.

Indeed, so optimizing the coherency management (see Felix' reply)
is likely to help most in making the driver faster, but that does not
explain why the alignment of the object code has such a big impact
on performance.

To investigate the alignment further, what I was actually looking for
is a comparison of the profile of the slow and fast case. Here I would
expect that the slow case spends more time in one of the functions
that don't deal with cache management (maybe fib_table_lookup or
__netif_receive_skb_core).

A few other thoughts:

- bcma_host_soc_read32() is a fundamentally slow operation, maybe
  some of the calls can turned into a relaxed read, like the readback
  in bgmac_chip_intrs_off() or the 'poll again' at the end bgmac_poll(),
  though obviously not the one in bgmac_dma_rx_read().
  It may be possible to even avoid some of the reads entirely, checking
  for more data in bgmac_poll() may actually be counterproductive
  depending on the workload.

- The higher-end networking SoCs are usually cache-coherent and
  can avoid the cache management entirely. There is a slim chance
  that this chip is designed that way and it just needs to be enabled
  properly. Most low-end chips don't implement the coherent
  interconnect though, and I suppose you have checked this already.

- bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear
  to have an extraneous dma_wmb(), which should be implied by the
  non-relaxed writel() in bgmac_write().

- accesses to the DMA descriptor don't show up in the profile here,
  but look like they can get misoptimized by the compiler. I would
  generally use READ_ONCE() and WRITE_ONCE() for these to
  ensure that you don't end up with extra or out-of-order accesses.
  This also makes it clearer to the reader that something special
  happens here.

> > Might sound odd,
> > but have you tried disabling SMP? These cache functions need to
> > operate across all CPUs, and the communication between CPUs can slow
> > them down. If there is only one CPU, these cache functions get simpler
> > and faster.
> >
> > It just depends on your workload. If you have 1 CPU loaded to 100% and
> > the other 3 idle, you might see an improvement. If you actually need
> > more than one CPU, it will probably be worse.
>
> It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels
> more stable now (lower variations). Let me spend some time on more
> testing.
>
>
> FWIW during all my tests I was using:
> echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus
> that is what I need to get similar speeds across iperf sessions
>
> With
> echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus
> my NAT speeds were jumping between 4 speeds:
> 273 Mbps / 315 Mbps / 353 Mbps / 425 Mbps
> (every time I started iperf kernel jumped into one state and kept the
>   same iperf speed until stopping it and starting another session)
>
> With
> echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
> my NAT speeds were jumping between 2 speeds:
> 284 Mbps / 408 Mbps

Can you try using 'numactl -C' to pin the iperf processes to
a particular CPU core? This may be related to the locality of
the user process relative to where the interrupts end up.

        Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-05-06  8:45               ` Arnd Bergmann
@ 2022-05-06  8:55                 ` Rafał Miłecki
  -1 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-05-06  8:55 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrew Lunn, Alexander Lobakin, Network Development,
	linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel,
	Florian Fainelli

On 6.05.2022 10:45, Arnd Bergmann wrote:
> On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki <zajec5@gmail.com> wrote:
>>
>> On 5.05.2022 18:04, Andrew Lunn wrote:
>>>> you'll see that most used functions are:
>>>> v7_dma_inv_range
>>>> __irqentry_text_end
>>>> l2c210_inv_range
>>>> v7_dma_clean_range
>>>> bcma_host_soc_read32
>>>> __netif_receive_skb_core
>>>> arch_cpu_idle
>>>> l2c210_clean_range
>>>> fib_table_lookup
>>>
>>> There is a lot of cache management functions here.
> 
> Indeed, so optimizing the coherency management (see Felix' reply)
> is likely to help most in making the driver faster, but that does not
> explain why the alignment of the object code has such a big impact
> on performance.
> 
> To investigate the alignment further, what I was actually looking for
> is a comparison of the profile of the slow and fast case. Here I would
> expect that the slow case spends more time in one of the functions
> that don't deal with cache management (maybe fib_table_lookup or
> __netif_receive_skb_core).
> 
> A few other thoughts:
> 
> - bcma_host_soc_read32() is a fundamentally slow operation, maybe
>    some of the calls can turned into a relaxed read, like the readback
>    in bgmac_chip_intrs_off() or the 'poll again' at the end bgmac_poll(),
>    though obviously not the one in bgmac_dma_rx_read().
>    It may be possible to even avoid some of the reads entirely, checking
>    for more data in bgmac_poll() may actually be counterproductive
>    depending on the workload.
> 
> - The higher-end networking SoCs are usually cache-coherent and
>    can avoid the cache management entirely. There is a slim chance
>    that this chip is designed that way and it just needs to be enabled
>    properly. Most low-end chips don't implement the coherent
>    interconnect though, and I suppose you have checked this already.
> 
> - bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear
>    to have an extraneous dma_wmb(), which should be implied by the
>    non-relaxed writel() in bgmac_write().
> 
> - accesses to the DMA descriptor don't show up in the profile here,
>    but look like they can get misoptimized by the compiler. I would
>    generally use READ_ONCE() and WRITE_ONCE() for these to
>    ensure that you don't end up with extra or out-of-order accesses.
>    This also makes it clearer to the reader that something special
>    happens here.
> 
>>> Might sound odd,
>>> but have you tried disabling SMP? These cache functions need to
>>> operate across all CPUs, and the communication between CPUs can slow
>>> them down. If there is only one CPU, these cache functions get simpler
>>> and faster.
>>>
>>> It just depends on your workload. If you have 1 CPU loaded to 100% and
>>> the other 3 idle, you might see an improvement. If you actually need
>>> more than one CPU, it will probably be worse.
>>
>> It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels
>> more stable now (lower variations). Let me spend some time on more
>> testing.
>>
>>
>> FWIW during all my tests I was using:
>> echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus
>> that is what I need to get similar speeds across iperf sessions
>>
>> With
>> echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus
>> my NAT speeds were jumping between 4 speeds:
>> 273 Mbps / 315 Mbps / 353 Mbps / 425 Mbps
>> (every time I started iperf kernel jumped into one state and kept the
>>    same iperf speed until stopping it and starting another session)
>>
>> With
>> echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
>> my NAT speeds were jumping between 2 speeds:
>> 284 Mbps / 408 Mbps
> 
> Can you try using 'numactl -C' to pin the iperf processes to
> a particular CPU core? This may be related to the locality of
> the user process relative to where the interrupts end up.

I run iperf on x86 machines connected to router's WAN and LAN ports.
It's meant to emulate end user just downloading from / uploading to
Internet some data.

Router's only task is doing masquarade NAT here.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-05-06  8:55                 ` Rafał Miłecki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-05-06  8:55 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrew Lunn, Alexander Lobakin, Network Development,
	linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel,
	Florian Fainelli

On 6.05.2022 10:45, Arnd Bergmann wrote:
> On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki <zajec5@gmail.com> wrote:
>>
>> On 5.05.2022 18:04, Andrew Lunn wrote:
>>>> you'll see that most used functions are:
>>>> v7_dma_inv_range
>>>> __irqentry_text_end
>>>> l2c210_inv_range
>>>> v7_dma_clean_range
>>>> bcma_host_soc_read32
>>>> __netif_receive_skb_core
>>>> arch_cpu_idle
>>>> l2c210_clean_range
>>>> fib_table_lookup
>>>
>>> There is a lot of cache management functions here.
> 
> Indeed, so optimizing the coherency management (see Felix' reply)
> is likely to help most in making the driver faster, but that does not
> explain why the alignment of the object code has such a big impact
> on performance.
> 
> To investigate the alignment further, what I was actually looking for
> is a comparison of the profile of the slow and fast case. Here I would
> expect that the slow case spends more time in one of the functions
> that don't deal with cache management (maybe fib_table_lookup or
> __netif_receive_skb_core).
> 
> A few other thoughts:
> 
> - bcma_host_soc_read32() is a fundamentally slow operation, maybe
>    some of the calls can turned into a relaxed read, like the readback
>    in bgmac_chip_intrs_off() or the 'poll again' at the end bgmac_poll(),
>    though obviously not the one in bgmac_dma_rx_read().
>    It may be possible to even avoid some of the reads entirely, checking
>    for more data in bgmac_poll() may actually be counterproductive
>    depending on the workload.
> 
> - The higher-end networking SoCs are usually cache-coherent and
>    can avoid the cache management entirely. There is a slim chance
>    that this chip is designed that way and it just needs to be enabled
>    properly. Most low-end chips don't implement the coherent
>    interconnect though, and I suppose you have checked this already.
> 
> - bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear
>    to have an extraneous dma_wmb(), which should be implied by the
>    non-relaxed writel() in bgmac_write().
> 
> - accesses to the DMA descriptor don't show up in the profile here,
>    but look like they can get misoptimized by the compiler. I would
>    generally use READ_ONCE() and WRITE_ONCE() for these to
>    ensure that you don't end up with extra or out-of-order accesses.
>    This also makes it clearer to the reader that something special
>    happens here.
> 
>>> Might sound odd,
>>> but have you tried disabling SMP? These cache functions need to
>>> operate across all CPUs, and the communication between CPUs can slow
>>> them down. If there is only one CPU, these cache functions get simpler
>>> and faster.
>>>
>>> It just depends on your workload. If you have 1 CPU loaded to 100% and
>>> the other 3 idle, you might see an improvement. If you actually need
>>> more than one CPU, it will probably be worse.
>>
>> It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels
>> more stable now (lower variations). Let me spend some time on more
>> testing.
>>
>>
>> FWIW during all my tests I was using:
>> echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus
>> that is what I need to get similar speeds across iperf sessions
>>
>> With
>> echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus
>> my NAT speeds were jumping between 4 speeds:
>> 273 Mbps / 315 Mbps / 353 Mbps / 425 Mbps
>> (every time I started iperf kernel jumped into one state and kept the
>>    same iperf speed until stopping it and starting another session)
>>
>> With
>> echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
>> my NAT speeds were jumping between 2 speeds:
>> 284 Mbps / 408 Mbps
> 
> Can you try using 'numactl -C' to pin the iperf processes to
> a particular CPU core? This may be related to the locality of
> the user process relative to where the interrupts end up.

I run iperf on x86 machines connected to router's WAN and LAN ports.
It's meant to emulate end user just downloading from / uploading to
Internet some data.

Router's only task is doing masquarade NAT here.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-05-06  8:55                 ` Rafał Miłecki
@ 2022-05-06  9:44                   ` Arnd Bergmann
  -1 siblings, 0 replies; 44+ messages in thread
From: Arnd Bergmann @ 2022-05-06  9:44 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Arnd Bergmann, Andrew Lunn, Alexander Lobakin,
	Network Development, linux-arm-kernel, Russell King,
	Felix Fietkau, openwrt-devel, Florian Fainelli

On Fri, May 6, 2022 at 10:55 AM Rafał Miłecki <zajec5@gmail.com> wrote:
> On 6.05.2022 10:45, Arnd Bergmann wrote:
> > On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki <zajec5@gmail.com> wrote:
> >> With
> >> echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
> >> my NAT speeds were jumping between 2 speeds:
> >> 284 Mbps / 408 Mbps
> >
> > Can you try using 'numactl -C' to pin the iperf processes to
> > a particular CPU core? This may be related to the locality of
> > the user process relative to where the interrupts end up.
>
> I run iperf on x86 machines connected to router's WAN and LAN ports.
> It's meant to emulate end user just downloading from / uploading to
> Internet some data.
>
> Router's only task is doing masquarade NAT here.

Ah, makes sense. Can you observe the CPU usage to be on
a particular core in the slow vs fast case then?

        Arnd

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-05-06  9:44                   ` Arnd Bergmann
  0 siblings, 0 replies; 44+ messages in thread
From: Arnd Bergmann @ 2022-05-06  9:44 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Arnd Bergmann, Andrew Lunn, Alexander Lobakin,
	Network Development, linux-arm-kernel, Russell King,
	Felix Fietkau, openwrt-devel, Florian Fainelli

On Fri, May 6, 2022 at 10:55 AM Rafał Miłecki <zajec5@gmail.com> wrote:
> On 6.05.2022 10:45, Arnd Bergmann wrote:
> > On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki <zajec5@gmail.com> wrote:
> >> With
> >> echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
> >> my NAT speeds were jumping between 2 speeds:
> >> 284 Mbps / 408 Mbps
> >
> > Can you try using 'numactl -C' to pin the iperf processes to
> > a particular CPU core? This may be related to the locality of
> > the user process relative to where the interrupts end up.
>
> I run iperf on x86 machines connected to router's WAN and LAN ports.
> It's meant to emulate end user just downloading from / uploading to
> Internet some data.
>
> Router's only task is doing masquarade NAT here.

Ah, makes sense. Can you observe the CPU usage to be on
a particular core in the slow vs fast case then?

        Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-05-06  7:47               ` Rafał Miłecki
@ 2022-05-06 12:42                 ` Andrew Lunn
  -1 siblings, 0 replies; 44+ messages in thread
From: Andrew Lunn @ 2022-05-06 12:42 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Felix Fietkau, Arnd Bergmann, Alexander Lobakin,
	Network Development, linux-arm-kernel, Russell King,
	openwrt-devel, Florian Fainelli

> > I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
> > This seems rather excessive, especially since most people are going to use a MTU of 1500.
> > My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes.
> > This should significantly reduce the time spent on flushing caches.
> 
> Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac:
> configure MTU and add support for frames beyond 8192 byte size"):
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03
> 
> It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps).
> 
> I do all my testing with
> #define BGMAC_RX_MAX_FRAME_SIZE			1536

That helps show that cache operations are part of your bottleneck.

Taking a quick look at the driver. On the receive side:

                       /* Unmap buffer to make it accessible to the CPU */
                        dma_unmap_single(dma_dev, dma_addr,
                                         BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE);

Here is data is mapped read for the CPU to use it.

			/* Get info from the header */
                        len = le16_to_cpu(rx->len);
                        flags = le16_to_cpu(rx->flags);

                        /* Check for poison and drop or pass the packet */
                        if (len == 0xdead && flags == 0xbeef) {
                                netdev_err(bgmac->net_dev, "Found poisoned packet at slot %d, DMA issue!\n",
                                           ring->start);
                                put_page(virt_to_head_page(buf));
                                bgmac->net_dev->stats.rx_errors++;
                                break;
                        }

                        if (len > BGMAC_RX_ALLOC_SIZE) {
                                netdev_err(bgmac->net_dev, "Found oversized packet at slot %d, DMA issue!\n",
                                           ring->start);
                                put_page(virt_to_head_page(buf));
                                bgmac->net_dev->stats.rx_length_errors++;
                                bgmac->net_dev->stats.rx_errors++;
                                break;
                        }

                        /* Omit CRC. */
                        len -= ETH_FCS_LEN;

                        skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE);
                        if (unlikely(!skb)) {
                                netdev_err(bgmac->net_dev, "build_skb failed\n");
                                put_page(virt_to_head_page(buf));
                                bgmac->net_dev->stats.rx_errors++;
                                break;
                        }
                        skb_put(skb, BGMAC_RX_FRAME_OFFSET +
                                BGMAC_RX_BUF_OFFSET + len);
                        skb_pull(skb, BGMAC_RX_FRAME_OFFSET +
                                 BGMAC_RX_BUF_OFFSET);

                        skb_checksum_none_assert(skb);
                        skb->protocol = eth_type_trans(skb, bgmac->net_dev);

and this is the first access of the actual data. You can make the
cache actually work for you, rather than against you, to adding a call to

	prefetch(buf);

just after the dma_unmap_single(). That will start getting the frame
header from DRAM into cache, so hopefully it is available by the time
eth_type_trans() is called and you don't have a cache miss.

	Andrew

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-05-06 12:42                 ` Andrew Lunn
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Lunn @ 2022-05-06 12:42 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Felix Fietkau, Arnd Bergmann, Alexander Lobakin,
	Network Development, linux-arm-kernel, Russell King,
	openwrt-devel, Florian Fainelli

> > I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
> > This seems rather excessive, especially since most people are going to use a MTU of 1500.
> > My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes.
> > This should significantly reduce the time spent on flushing caches.
> 
> Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac:
> configure MTU and add support for frames beyond 8192 byte size"):
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03
> 
> It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps).
> 
> I do all my testing with
> #define BGMAC_RX_MAX_FRAME_SIZE			1536

That helps show that cache operations are part of your bottleneck.

Taking a quick look at the driver. On the receive side:

                       /* Unmap buffer to make it accessible to the CPU */
                        dma_unmap_single(dma_dev, dma_addr,
                                         BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE);

Here is data is mapped read for the CPU to use it.

			/* Get info from the header */
                        len = le16_to_cpu(rx->len);
                        flags = le16_to_cpu(rx->flags);

                        /* Check for poison and drop or pass the packet */
                        if (len == 0xdead && flags == 0xbeef) {
                                netdev_err(bgmac->net_dev, "Found poisoned packet at slot %d, DMA issue!\n",
                                           ring->start);
                                put_page(virt_to_head_page(buf));
                                bgmac->net_dev->stats.rx_errors++;
                                break;
                        }

                        if (len > BGMAC_RX_ALLOC_SIZE) {
                                netdev_err(bgmac->net_dev, "Found oversized packet at slot %d, DMA issue!\n",
                                           ring->start);
                                put_page(virt_to_head_page(buf));
                                bgmac->net_dev->stats.rx_length_errors++;
                                bgmac->net_dev->stats.rx_errors++;
                                break;
                        }

                        /* Omit CRC. */
                        len -= ETH_FCS_LEN;

                        skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE);
                        if (unlikely(!skb)) {
                                netdev_err(bgmac->net_dev, "build_skb failed\n");
                                put_page(virt_to_head_page(buf));
                                bgmac->net_dev->stats.rx_errors++;
                                break;
                        }
                        skb_put(skb, BGMAC_RX_FRAME_OFFSET +
                                BGMAC_RX_BUF_OFFSET + len);
                        skb_pull(skb, BGMAC_RX_FRAME_OFFSET +
                                 BGMAC_RX_BUF_OFFSET);

                        skb_checksum_none_assert(skb);
                        skb->protocol = eth_type_trans(skb, bgmac->net_dev);

and this is the first access of the actual data. You can make the
cache actually work for you, rather than against you, to adding a call to

	prefetch(buf);

just after the dma_unmap_single(). That will start getting the frame
header from DRAM into cache, so hopefully it is available by the time
eth_type_trans() is called and you don't have a cache miss.

	Andrew

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-05-06  7:44             ` Rafał Miłecki
@ 2022-05-08  9:53               ` Rafał Miłecki
  -1 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-05-08  9:53 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Arnd Bergmann, Alexander Lobakin, Network Development,
	linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel,
	Florian Fainelli

On 6.05.2022 09:44, Rafał Miłecki wrote:
> On 5.05.2022 18:04, Andrew Lunn wrote:
>>> you'll see that most used functions are:
>>> v7_dma_inv_range
>>> __irqentry_text_end
>>> l2c210_inv_range
>>> v7_dma_clean_range
>>> bcma_host_soc_read32
>>> __netif_receive_skb_core
>>> arch_cpu_idle
>>> l2c210_clean_range
>>> fib_table_lookup
>>
>> There is a lot of cache management functions here. Might sound odd,
>> but have you tried disabling SMP? These cache functions need to
>> operate across all CPUs, and the communication between CPUs can slow
>> them down. If there is only one CPU, these cache functions get simpler
>> and faster.
>>
>> It just depends on your workload. If you have 1 CPU loaded to 100% and
>> the other 3 idle, you might see an improvement. If you actually need
>> more than one CPU, it will probably be worse.
> 
> It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels
> more stable now (lower variations). Let me spend some time on more
> testing.

For a context I test various kernel commits / configs using:
iperf -t 120 -i 10 -c 192.168.13.1


I did more testing with # CONFIG_SMP is not set

Good thing:
During a single iperf session I get noticably more stable speed.
With SMP: x ± 2,86%
Without SMP: x ± 0,96%

Bad thing:
Across kernel commits / config changes speed still varies.


So disabling CONFIG_SMP won't help me looking for kernel regressions.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-05-08  9:53               ` Rafał Miłecki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-05-08  9:53 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Arnd Bergmann, Alexander Lobakin, Network Development,
	linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel,
	Florian Fainelli

On 6.05.2022 09:44, Rafał Miłecki wrote:
> On 5.05.2022 18:04, Andrew Lunn wrote:
>>> you'll see that most used functions are:
>>> v7_dma_inv_range
>>> __irqentry_text_end
>>> l2c210_inv_range
>>> v7_dma_clean_range
>>> bcma_host_soc_read32
>>> __netif_receive_skb_core
>>> arch_cpu_idle
>>> l2c210_clean_range
>>> fib_table_lookup
>>
>> There is a lot of cache management functions here. Might sound odd,
>> but have you tried disabling SMP? These cache functions need to
>> operate across all CPUs, and the communication between CPUs can slow
>> them down. If there is only one CPU, these cache functions get simpler
>> and faster.
>>
>> It just depends on your workload. If you have 1 CPU loaded to 100% and
>> the other 3 idle, you might see an improvement. If you actually need
>> more than one CPU, it will probably be worse.
> 
> It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels
> more stable now (lower variations). Let me spend some time on more
> testing.

For a context I test various kernel commits / configs using:
iperf -t 120 -i 10 -c 192.168.13.1


I did more testing with # CONFIG_SMP is not set

Good thing:
During a single iperf session I get noticably more stable speed.
With SMP: x ± 2,86%
Without SMP: x ± 0,96%

Bad thing:
Across kernel commits / config changes speed still varies.


So disabling CONFIG_SMP won't help me looking for kernel regressions.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-05-06 12:42                 ` Andrew Lunn
@ 2022-05-10 10:29                   ` Rafał Miłecki
  -1 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-05-10 10:29 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Felix Fietkau, Arnd Bergmann, Alexander Lobakin,
	Network Development, linux-arm-kernel, Russell King,
	openwrt-devel, Florian Fainelli

On 6.05.2022 14:42, Andrew Lunn wrote:
>>> I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
>>> This seems rather excessive, especially since most people are going to use a MTU of 1500.
>>> My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes.
>>> This should significantly reduce the time spent on flushing caches.
>>
>> Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac:
>> configure MTU and add support for frames beyond 8192 byte size"):
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03
>>
>> It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps).
>>
>> I do all my testing with
>> #define BGMAC_RX_MAX_FRAME_SIZE			1536
> 
> That helps show that cache operations are part of your bottleneck.
> 
> Taking a quick look at the driver. On the receive side:
> 
>                         /* Unmap buffer to make it accessible to the CPU */
>                          dma_unmap_single(dma_dev, dma_addr,
>                                           BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE);
> 
> Here is data is mapped read for the CPU to use it.
> 
> 			/* Get info from the header */
>                          len = le16_to_cpu(rx->len);
>                          flags = le16_to_cpu(rx->flags);
> 
>                          /* Check for poison and drop or pass the packet */
>                          if (len == 0xdead && flags == 0xbeef) {
>                                  netdev_err(bgmac->net_dev, "Found poisoned packet at slot %d, DMA issue!\n",
>                                             ring->start);
>                                  put_page(virt_to_head_page(buf));
>                                  bgmac->net_dev->stats.rx_errors++;
>                                  break;
>                          }
> 
>                          if (len > BGMAC_RX_ALLOC_SIZE) {
>                                  netdev_err(bgmac->net_dev, "Found oversized packet at slot %d, DMA issue!\n",
>                                             ring->start);
>                                  put_page(virt_to_head_page(buf));
>                                  bgmac->net_dev->stats.rx_length_errors++;
>                                  bgmac->net_dev->stats.rx_errors++;
>                                  break;
>                          }
> 
>                          /* Omit CRC. */
>                          len -= ETH_FCS_LEN;
> 
>                          skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE);
>                          if (unlikely(!skb)) {
>                                  netdev_err(bgmac->net_dev, "build_skb failed\n");
>                                  put_page(virt_to_head_page(buf));
>                                  bgmac->net_dev->stats.rx_errors++;
>                                  break;
>                          }
>                          skb_put(skb, BGMAC_RX_FRAME_OFFSET +
>                                  BGMAC_RX_BUF_OFFSET + len);
>                          skb_pull(skb, BGMAC_RX_FRAME_OFFSET +
>                                   BGMAC_RX_BUF_OFFSET);
> 
>                          skb_checksum_none_assert(skb);
>                          skb->protocol = eth_type_trans(skb, bgmac->net_dev);
> 
> and this is the first access of the actual data. You can make the
> cache actually work for you, rather than against you, to adding a call to
> 
> 	prefetch(buf);
> 
> just after the dma_unmap_single(). That will start getting the frame
> header from DRAM into cache, so hopefully it is available by the time
> eth_type_trans() is called and you don't have a cache miss.


I don't think that analysis is correct.

Please take a look at following lines:
struct bgmac_rx_header *rx = slot->buf + BGMAC_RX_BUF_OFFSET;
void *buf = slot->buf;

The first we do after dma_unmap_single() call is rx->len read. That
actually points to DMA data. There is nothing we could keep CPU busy
with while preteching data.

FWIW I tried adding prefetch(buf); anyway. I didn't change NAT speed by
a single 1 Mb/s. Speed was exactly the same as without prefetch() call.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-05-10 10:29                   ` Rafał Miłecki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-05-10 10:29 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Felix Fietkau, Arnd Bergmann, Alexander Lobakin,
	Network Development, linux-arm-kernel, Russell King,
	openwrt-devel, Florian Fainelli

On 6.05.2022 14:42, Andrew Lunn wrote:
>>> I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
>>> This seems rather excessive, especially since most people are going to use a MTU of 1500.
>>> My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes.
>>> This should significantly reduce the time spent on flushing caches.
>>
>> Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac:
>> configure MTU and add support for frames beyond 8192 byte size"):
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03
>>
>> It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps).
>>
>> I do all my testing with
>> #define BGMAC_RX_MAX_FRAME_SIZE			1536
> 
> That helps show that cache operations are part of your bottleneck.
> 
> Taking a quick look at the driver. On the receive side:
> 
>                         /* Unmap buffer to make it accessible to the CPU */
>                          dma_unmap_single(dma_dev, dma_addr,
>                                           BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE);
> 
> Here is data is mapped read for the CPU to use it.
> 
> 			/* Get info from the header */
>                          len = le16_to_cpu(rx->len);
>                          flags = le16_to_cpu(rx->flags);
> 
>                          /* Check for poison and drop or pass the packet */
>                          if (len == 0xdead && flags == 0xbeef) {
>                                  netdev_err(bgmac->net_dev, "Found poisoned packet at slot %d, DMA issue!\n",
>                                             ring->start);
>                                  put_page(virt_to_head_page(buf));
>                                  bgmac->net_dev->stats.rx_errors++;
>                                  break;
>                          }
> 
>                          if (len > BGMAC_RX_ALLOC_SIZE) {
>                                  netdev_err(bgmac->net_dev, "Found oversized packet at slot %d, DMA issue!\n",
>                                             ring->start);
>                                  put_page(virt_to_head_page(buf));
>                                  bgmac->net_dev->stats.rx_length_errors++;
>                                  bgmac->net_dev->stats.rx_errors++;
>                                  break;
>                          }
> 
>                          /* Omit CRC. */
>                          len -= ETH_FCS_LEN;
> 
>                          skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE);
>                          if (unlikely(!skb)) {
>                                  netdev_err(bgmac->net_dev, "build_skb failed\n");
>                                  put_page(virt_to_head_page(buf));
>                                  bgmac->net_dev->stats.rx_errors++;
>                                  break;
>                          }
>                          skb_put(skb, BGMAC_RX_FRAME_OFFSET +
>                                  BGMAC_RX_BUF_OFFSET + len);
>                          skb_pull(skb, BGMAC_RX_FRAME_OFFSET +
>                                   BGMAC_RX_BUF_OFFSET);
> 
>                          skb_checksum_none_assert(skb);
>                          skb->protocol = eth_type_trans(skb, bgmac->net_dev);
> 
> and this is the first access of the actual data. You can make the
> cache actually work for you, rather than against you, to adding a call to
> 
> 	prefetch(buf);
> 
> just after the dma_unmap_single(). That will start getting the frame
> header from DRAM into cache, so hopefully it is available by the time
> eth_type_trans() is called and you don't have a cache miss.


I don't think that analysis is correct.

Please take a look at following lines:
struct bgmac_rx_header *rx = slot->buf + BGMAC_RX_BUF_OFFSET;
void *buf = slot->buf;

The first we do after dma_unmap_single() call is rx->len read. That
actually points to DMA data. There is nothing we could keep CPU busy
with while preteching data.

FWIW I tried adding prefetch(buf); anyway. I didn't change NAT speed by
a single 1 Mb/s. Speed was exactly the same as without prefetch() call.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-05-06  8:45               ` Arnd Bergmann
@ 2022-05-10 11:23                 ` Rafał Miłecki
  -1 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-05-10 11:23 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrew Lunn, Alexander Lobakin, Network Development,
	linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel,
	Florian Fainelli

On 6.05.2022 10:45, Arnd Bergmann wrote:
> On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki <zajec5@gmail.com> wrote:
>>
>> On 5.05.2022 18:04, Andrew Lunn wrote:
>>>> you'll see that most used functions are:
>>>> v7_dma_inv_range
>>>> __irqentry_text_end
>>>> l2c210_inv_range
>>>> v7_dma_clean_range
>>>> bcma_host_soc_read32
>>>> __netif_receive_skb_core
>>>> arch_cpu_idle
>>>> l2c210_clean_range
>>>> fib_table_lookup
>>>
>>> There is a lot of cache management functions here.
> 
> Indeed, so optimizing the coherency management (see Felix' reply)
> is likely to help most in making the driver faster, but that does not
> explain why the alignment of the object code has such a big impact
> on performance.
> 
> To investigate the alignment further, what I was actually looking for
> is a comparison of the profile of the slow and fast case. Here I would
> expect that the slow case spends more time in one of the functions
> that don't deal with cache management (maybe fib_table_lookup or
> __netif_receive_skb_core).
> 
> A few other thoughts:
> 
> - bcma_host_soc_read32() is a fundamentally slow operation, maybe
>    some of the calls can turned into a relaxed read, like the readback
>    in bgmac_chip_intrs_off() or the 'poll again' at the end bgmac_poll(),
>    though obviously not the one in bgmac_dma_rx_read().
>    It may be possible to even avoid some of the reads entirely, checking
>    for more data in bgmac_poll() may actually be counterproductive
>    depending on the workload.

I'll experiment with that, hopefully I can optimize it a bit.


> - The higher-end networking SoCs are usually cache-coherent and
>    can avoid the cache management entirely. There is a slim chance
>    that this chip is designed that way and it just needs to be enabled
>    properly. Most low-end chips don't implement the coherent
>    interconnect though, and I suppose you have checked this already.

To my best knowledge Northstar platform doesn't support hw coherency.

I just took an extra look at Broadcom's SDK and them seem to have some
driver for selected chipsets but BCM708 isn't there.

config BCM_GLB_COHERENCY
	bool "Global Hardware Cache Coherency"
	default n
	depends on BCM963158 || BCM96846 || BCM96858 || BCM96856 || BCM963178 || BCM947622 || BCM963146  || BCM94912 || BCM96813 || BCM96756 || BCM96855


> - bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear
>    to have an extraneous dma_wmb(), which should be implied by the
>    non-relaxed writel() in bgmac_write().

I tried dropping wmb() calls.
With wmb(): 421 Mb/s
Without: 418 Mb/s


I also tried dropping bgmac_read() from bgmac_chip_intrs_off() which
seems to be a flushing readback.

With bgmac_read(): 421 Mb/s
Without: 413 Mb/s


> - accesses to the DMA descriptor don't show up in the profile here,
>    but look like they can get misoptimized by the compiler. I would
>    generally use READ_ONCE() and WRITE_ONCE() for these to
>    ensure that you don't end up with extra or out-of-order accesses.
>    This also makes it clearer to the reader that something special
>    happens here.

Should I use something as below?

FWIW it doesn't seem to change NAT performance.
Without WRITE_ONCE: 421 Mb/s
With: 419 Mb/s


diff --git a/drivers/net/ethernet/broadcom/bgmac.c b/drivers/net/ethernet/broadcom/bgmac.c
index 87700072..ce98f2a9 100644
--- a/drivers/net/ethernet/broadcom/bgmac.c
+++ b/drivers/net/ethernet/broadcom/bgmac.c
@@ -119,10 +119,10 @@ bgmac_dma_tx_add_buf(struct bgmac *bgmac, struct bgmac_dma_ring *ring,

  	slot = &ring->slots[i];
  	dma_desc = &ring->cpu_base[i];
-	dma_desc->addr_low = cpu_to_le32(lower_32_bits(slot->dma_addr));
-	dma_desc->addr_high = cpu_to_le32(upper_32_bits(slot->dma_addr));
-	dma_desc->ctl0 = cpu_to_le32(ctl0);
-	dma_desc->ctl1 = cpu_to_le32(ctl1);
+	WRITE_ONCE(dma_desc->addr_low, cpu_to_le32(lower_32_bits(slot->dma_addr)));
+	WRITE_ONCE(dma_desc->addr_high, cpu_to_le32(upper_32_bits(slot->dma_addr)));
+	WRITE_ONCE(dma_desc->ctl0, cpu_to_le32(ctl0));
+	WRITE_ONCE(dma_desc->ctl1, cpu_to_le32(ctl1));
  }

  static netdev_tx_t bgmac_dma_tx_add(struct bgmac *bgmac,
@@ -387,10 +387,10 @@ static void bgmac_dma_rx_setup_desc(struct bgmac *bgmac,
  	 * B43_DMA64_DCTL1_ADDREXT_MASK;
  	 */

-	dma_desc->addr_low = cpu_to_le32(lower_32_bits(ring->slots[desc_idx].dma_addr));
-	dma_desc->addr_high = cpu_to_le32(upper_32_bits(ring->slots[desc_idx].dma_addr));
-	dma_desc->ctl0 = cpu_to_le32(ctl0);
-	dma_desc->ctl1 = cpu_to_le32(ctl1);
+	WRITE_ONCE(dma_desc->addr_low, cpu_to_le32(lower_32_bits(ring->slots[desc_idx].dma_addr)));
+	WRITE_ONCE(dma_desc->addr_high, cpu_to_le32(upper_32_bits(ring->slots[desc_idx].dma_addr)));
+	WRITE_ONCE(dma_desc->ctl0, cpu_to_le32(ctl0));
+	WRITE_ONCE(dma_desc->ctl1, cpu_to_le32(ctl1));

  	ring->end = desc_idx;
  }

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-05-10 11:23                 ` Rafał Miłecki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-05-10 11:23 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrew Lunn, Alexander Lobakin, Network Development,
	linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel,
	Florian Fainelli

On 6.05.2022 10:45, Arnd Bergmann wrote:
> On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki <zajec5@gmail.com> wrote:
>>
>> On 5.05.2022 18:04, Andrew Lunn wrote:
>>>> you'll see that most used functions are:
>>>> v7_dma_inv_range
>>>> __irqentry_text_end
>>>> l2c210_inv_range
>>>> v7_dma_clean_range
>>>> bcma_host_soc_read32
>>>> __netif_receive_skb_core
>>>> arch_cpu_idle
>>>> l2c210_clean_range
>>>> fib_table_lookup
>>>
>>> There is a lot of cache management functions here.
> 
> Indeed, so optimizing the coherency management (see Felix' reply)
> is likely to help most in making the driver faster, but that does not
> explain why the alignment of the object code has such a big impact
> on performance.
> 
> To investigate the alignment further, what I was actually looking for
> is a comparison of the profile of the slow and fast case. Here I would
> expect that the slow case spends more time in one of the functions
> that don't deal with cache management (maybe fib_table_lookup or
> __netif_receive_skb_core).
> 
> A few other thoughts:
> 
> - bcma_host_soc_read32() is a fundamentally slow operation, maybe
>    some of the calls can turned into a relaxed read, like the readback
>    in bgmac_chip_intrs_off() or the 'poll again' at the end bgmac_poll(),
>    though obviously not the one in bgmac_dma_rx_read().
>    It may be possible to even avoid some of the reads entirely, checking
>    for more data in bgmac_poll() may actually be counterproductive
>    depending on the workload.

I'll experiment with that, hopefully I can optimize it a bit.


> - The higher-end networking SoCs are usually cache-coherent and
>    can avoid the cache management entirely. There is a slim chance
>    that this chip is designed that way and it just needs to be enabled
>    properly. Most low-end chips don't implement the coherent
>    interconnect though, and I suppose you have checked this already.

To my best knowledge Northstar platform doesn't support hw coherency.

I just took an extra look at Broadcom's SDK and them seem to have some
driver for selected chipsets but BCM708 isn't there.

config BCM_GLB_COHERENCY
	bool "Global Hardware Cache Coherency"
	default n
	depends on BCM963158 || BCM96846 || BCM96858 || BCM96856 || BCM963178 || BCM947622 || BCM963146  || BCM94912 || BCM96813 || BCM96756 || BCM96855


> - bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear
>    to have an extraneous dma_wmb(), which should be implied by the
>    non-relaxed writel() in bgmac_write().

I tried dropping wmb() calls.
With wmb(): 421 Mb/s
Without: 418 Mb/s


I also tried dropping bgmac_read() from bgmac_chip_intrs_off() which
seems to be a flushing readback.

With bgmac_read(): 421 Mb/s
Without: 413 Mb/s


> - accesses to the DMA descriptor don't show up in the profile here,
>    but look like they can get misoptimized by the compiler. I would
>    generally use READ_ONCE() and WRITE_ONCE() for these to
>    ensure that you don't end up with extra or out-of-order accesses.
>    This also makes it clearer to the reader that something special
>    happens here.

Should I use something as below?

FWIW it doesn't seem to change NAT performance.
Without WRITE_ONCE: 421 Mb/s
With: 419 Mb/s


diff --git a/drivers/net/ethernet/broadcom/bgmac.c b/drivers/net/ethernet/broadcom/bgmac.c
index 87700072..ce98f2a9 100644
--- a/drivers/net/ethernet/broadcom/bgmac.c
+++ b/drivers/net/ethernet/broadcom/bgmac.c
@@ -119,10 +119,10 @@ bgmac_dma_tx_add_buf(struct bgmac *bgmac, struct bgmac_dma_ring *ring,

  	slot = &ring->slots[i];
  	dma_desc = &ring->cpu_base[i];
-	dma_desc->addr_low = cpu_to_le32(lower_32_bits(slot->dma_addr));
-	dma_desc->addr_high = cpu_to_le32(upper_32_bits(slot->dma_addr));
-	dma_desc->ctl0 = cpu_to_le32(ctl0);
-	dma_desc->ctl1 = cpu_to_le32(ctl1);
+	WRITE_ONCE(dma_desc->addr_low, cpu_to_le32(lower_32_bits(slot->dma_addr)));
+	WRITE_ONCE(dma_desc->addr_high, cpu_to_le32(upper_32_bits(slot->dma_addr)));
+	WRITE_ONCE(dma_desc->ctl0, cpu_to_le32(ctl0));
+	WRITE_ONCE(dma_desc->ctl1, cpu_to_le32(ctl1));
  }

  static netdev_tx_t bgmac_dma_tx_add(struct bgmac *bgmac,
@@ -387,10 +387,10 @@ static void bgmac_dma_rx_setup_desc(struct bgmac *bgmac,
  	 * B43_DMA64_DCTL1_ADDREXT_MASK;
  	 */

-	dma_desc->addr_low = cpu_to_le32(lower_32_bits(ring->slots[desc_idx].dma_addr));
-	dma_desc->addr_high = cpu_to_le32(upper_32_bits(ring->slots[desc_idx].dma_addr));
-	dma_desc->ctl0 = cpu_to_le32(ctl0);
-	dma_desc->ctl1 = cpu_to_le32(ctl1);
+	WRITE_ONCE(dma_desc->addr_low, cpu_to_le32(lower_32_bits(ring->slots[desc_idx].dma_addr)));
+	WRITE_ONCE(dma_desc->addr_high, cpu_to_le32(upper_32_bits(ring->slots[desc_idx].dma_addr)));
+	WRITE_ONCE(dma_desc->ctl0, cpu_to_le32(ctl0));
+	WRITE_ONCE(dma_desc->ctl1, cpu_to_le32(ctl1));

  	ring->end = desc_idx;
  }

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-05-06  9:44                   ` Arnd Bergmann
@ 2022-05-10 12:51                     ` Rafał Miłecki
  -1 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-05-10 12:51 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrew Lunn, Alexander Lobakin, Network Development,
	linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel,
	Florian Fainelli

On 6.05.2022 11:44, Arnd Bergmann wrote:
> On Fri, May 6, 2022 at 10:55 AM Rafał Miłecki <zajec5@gmail.com> wrote:
>> On 6.05.2022 10:45, Arnd Bergmann wrote:
>>> On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki <zajec5@gmail.com> wrote:
>>>> With
>>>> echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
>>>> my NAT speeds were jumping between 2 speeds:
>>>> 284 Mbps / 408 Mbps
>>>
>>> Can you try using 'numactl -C' to pin the iperf processes to
>>> a particular CPU core? This may be related to the locality of
>>> the user process relative to where the interrupts end up.
>>
>> I run iperf on x86 machines connected to router's WAN and LAN ports.
>> It's meant to emulate end user just downloading from / uploading to
>> Internet some data.
>>
>> Router's only task is doing masquarade NAT here.
> 
> Ah, makes sense. Can you observe the CPU usage to be on
> a particular core in the slow vs fast case then?

With echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus
NAT speed was verying between:
a) 311 Mb/s (CPUs load: 100% + 0%)
b) 408 Mb/s (CPUs load: 100% + 62%)

With echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
NAT speed was verying between:
a) 290 Mb/s (CPUs load: 100% + 0%)
b) 410 Mb/s (CPUs load: 100% + 63%)

With echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus
NAT speed was stable:
a) 372 Mb/s (CPUs load: 100% + 26%)
b) 375 Mb/s (CPUs load: 82% + 100%)

With echo 3 > /sys/class/net/eth0/queues/rx-0/rps_cpus
NAT speed was verying between:
a) 293 Mb/s (CPUs load: 100% + 0%)
b) 332 Mb/s (CPUs load: 100% + 17%)
c) 374 Mb/s (CPUs load: 81% + 100%)
d) 442 Mb/s (CPUs load: 100% + 75%)



After some extra debugging I found a reason for varying CPU usage &
varying NAT speeds.

My router has a single swtich so I use two VLANs:
eth0.1 - LAN
eth0.2 - WAN
(VLAN traffic is routed to correct ports by switch). On top of that I
have "br-lan" bridge interface briding eth0.1 and wireless interfaces.

For all that time I had /sys/class/net/br-lan/queues/rx-0/rps_cpus set
to 3. So bridge traffic was randomly handled by CPU 0 or CPU 1.

So if I assign specific CPU core to each of two interfaces, e.g.:
echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
echo 2 > /sys/class/net/br-lan/queues/rx-0/rps_cpus
things get stable.

With above I get stable 419 Mb/s (CPUs load: 100% + 64%) on every iperf
session.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-05-10 12:51                     ` Rafał Miłecki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafał Miłecki @ 2022-05-10 12:51 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrew Lunn, Alexander Lobakin, Network Development,
	linux-arm-kernel, Russell King, Felix Fietkau, openwrt-devel,
	Florian Fainelli

On 6.05.2022 11:44, Arnd Bergmann wrote:
> On Fri, May 6, 2022 at 10:55 AM Rafał Miłecki <zajec5@gmail.com> wrote:
>> On 6.05.2022 10:45, Arnd Bergmann wrote:
>>> On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki <zajec5@gmail.com> wrote:
>>>> With
>>>> echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
>>>> my NAT speeds were jumping between 2 speeds:
>>>> 284 Mbps / 408 Mbps
>>>
>>> Can you try using 'numactl -C' to pin the iperf processes to
>>> a particular CPU core? This may be related to the locality of
>>> the user process relative to where the interrupts end up.
>>
>> I run iperf on x86 machines connected to router's WAN and LAN ports.
>> It's meant to emulate end user just downloading from / uploading to
>> Internet some data.
>>
>> Router's only task is doing masquarade NAT here.
> 
> Ah, makes sense. Can you observe the CPU usage to be on
> a particular core in the slow vs fast case then?

With echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus
NAT speed was verying between:
a) 311 Mb/s (CPUs load: 100% + 0%)
b) 408 Mb/s (CPUs load: 100% + 62%)

With echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
NAT speed was verying between:
a) 290 Mb/s (CPUs load: 100% + 0%)
b) 410 Mb/s (CPUs load: 100% + 63%)

With echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus
NAT speed was stable:
a) 372 Mb/s (CPUs load: 100% + 26%)
b) 375 Mb/s (CPUs load: 82% + 100%)

With echo 3 > /sys/class/net/eth0/queues/rx-0/rps_cpus
NAT speed was verying between:
a) 293 Mb/s (CPUs load: 100% + 0%)
b) 332 Mb/s (CPUs load: 100% + 17%)
c) 374 Mb/s (CPUs load: 81% + 100%)
d) 442 Mb/s (CPUs load: 100% + 75%)



After some extra debugging I found a reason for varying CPU usage &
varying NAT speeds.

My router has a single swtich so I use two VLANs:
eth0.1 - LAN
eth0.2 - WAN
(VLAN traffic is routed to correct ports by switch). On top of that I
have "br-lan" bridge interface briding eth0.1 and wireless interfaces.

For all that time I had /sys/class/net/br-lan/queues/rx-0/rps_cpus set
to 3. So bridge traffic was randomly handled by CPU 0 or CPU 1.

So if I assign specific CPU core to each of two interfaces, e.g.:
echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
echo 2 > /sys/class/net/br-lan/queues/rx-0/rps_cpus
things get stable.

With above I get stable 419 Mb/s (CPUs load: 100% + 64%) on every iperf
session.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-05-10 11:23                 ` Rafał Miłecki
@ 2022-05-10 13:18                   ` Arnd Bergmann
  -1 siblings, 0 replies; 44+ messages in thread
From: Arnd Bergmann @ 2022-05-10 13:18 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Arnd Bergmann, Andrew Lunn, Alexander Lobakin,
	Network Development, linux-arm-kernel, Russell King,
	Felix Fietkau, openwrt-devel, Florian Fainelli

On Tue, May 10, 2022 at 1:23 PM Rafał Miłecki <zajec5@gmail.com> wrote:
> On 6.05.2022 10:45, Arnd Bergmann wrote:
> > - The higher-end networking SoCs are usually cache-coherent and
> >    can avoid the cache management entirely. There is a slim chance
> >    that this chip is designed that way and it just needs to be enabled
> >    properly. Most low-end chips don't implement the coherent
> >    interconnect though, and I suppose you have checked this already.
>
> To my best knowledge Northstar platform doesn't support hw coherency.
>
> I just took an extra look at Broadcom's SDK and them seem to have some
> driver for selected chipsets but BCM708 isn't there.
>
> config BCM_GLB_COHERENCY
>         bool "Global Hardware Cache Coherency"
>         default n
>         depends on BCM963158 || BCM96846 || BCM96858 || BCM96856 || BCM963178 || BCM947622 || BCM963146  || BCM94912 || BCM96813 || BCM96756 || BCM96855

Ok

> > - bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear
> >    to have an extraneous dma_wmb(), which should be implied by the
> >    non-relaxed writel() in bgmac_write().
>
> I tried dropping wmb() calls.
> With wmb(): 421 Mb/s
> Without: 418 Mb/s

That's probably within the noise here. I suppose doing two wmb()
calls in a row is not that expensive because there is nothing left to
wait for. If the extra wmb() is measurably faster than no wmb(), there
is something else going wrong ;-)

> I also tried dropping bgmac_read() from bgmac_chip_intrs_off() which
> seems to be a flushing readback.
>
> With bgmac_read(): 421 Mb/s
> Without: 413 Mb/s

Interesting, so this is statistically significant, right? It could be that
this changing the interrupt timing just enough that it ends up doing
more work at once some of the time.

> > - accesses to the DMA descriptor don't show up in the profile here,
> >    but look like they can get misoptimized by the compiler. I would
> >    generally use READ_ONCE() and WRITE_ONCE() for these to
> >    ensure that you don't end up with extra or out-of-order accesses.
> >    This also makes it clearer to the reader that something special
> >    happens here.
>
> Should I use something as below?
>
> FWIW it doesn't seem to change NAT performance.
> Without WRITE_ONCE: 421 Mb/s
> With: 419 Mb/s

This one depends on the compiler. What I would expect here is that
it often makes no difference, but if the compiler does something
odd, then the WRITE_ONCE() would prevent this and make it behave
as before. I would suggest adding this part regardless.

The other suggestion I had was this, I think you did not test this:

--- a/drivers/net/ethernet/broadcom/bgmac.c
+++ b/drivers/net/ethernet/broadcom/bgmac.c
@@ -1156,11 +1156,12 @@ static int bgmac_poll(struct napi_struct
*napi, int weight)
        bgmac_dma_tx_free(bgmac, &bgmac->tx_ring[0]);
        handled += bgmac_dma_rx_read(bgmac, &bgmac->rx_ring[0], weight);

-       /* Poll again if more events arrived in the meantime */
-       if (bgmac_read(bgmac, BGMAC_INT_STATUS) & (BGMAC_IS_TX0 | BGMAC_IS_RX))
-               return weight;
-
        if (handled < weight) {
+               /* Poll again if more events arrived in the meantime */
+               if (bgmac_read(bgmac, BGMAC_INT_STATUS) &
+                               (BGMAC_IS_TX0 | BGMAC_IS_RX))
+                       return weight;
+
                napi_complete_done(napi, handled);
                bgmac_chip_intrs_on(bgmac);
        }

Or possibly, remove that extra check entirely and just rely on the irq to do
this after it gets turned on again.

         Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-05-10 13:18                   ` Arnd Bergmann
  0 siblings, 0 replies; 44+ messages in thread
From: Arnd Bergmann @ 2022-05-10 13:18 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Arnd Bergmann, Andrew Lunn, Alexander Lobakin,
	Network Development, linux-arm-kernel, Russell King,
	Felix Fietkau, openwrt-devel, Florian Fainelli

On Tue, May 10, 2022 at 1:23 PM Rafał Miłecki <zajec5@gmail.com> wrote:
> On 6.05.2022 10:45, Arnd Bergmann wrote:
> > - The higher-end networking SoCs are usually cache-coherent and
> >    can avoid the cache management entirely. There is a slim chance
> >    that this chip is designed that way and it just needs to be enabled
> >    properly. Most low-end chips don't implement the coherent
> >    interconnect though, and I suppose you have checked this already.
>
> To my best knowledge Northstar platform doesn't support hw coherency.
>
> I just took an extra look at Broadcom's SDK and them seem to have some
> driver for selected chipsets but BCM708 isn't there.
>
> config BCM_GLB_COHERENCY
>         bool "Global Hardware Cache Coherency"
>         default n
>         depends on BCM963158 || BCM96846 || BCM96858 || BCM96856 || BCM963178 || BCM947622 || BCM963146  || BCM94912 || BCM96813 || BCM96756 || BCM96855

Ok

> > - bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear
> >    to have an extraneous dma_wmb(), which should be implied by the
> >    non-relaxed writel() in bgmac_write().
>
> I tried dropping wmb() calls.
> With wmb(): 421 Mb/s
> Without: 418 Mb/s

That's probably within the noise here. I suppose doing two wmb()
calls in a row is not that expensive because there is nothing left to
wait for. If the extra wmb() is measurably faster than no wmb(), there
is something else going wrong ;-)

> I also tried dropping bgmac_read() from bgmac_chip_intrs_off() which
> seems to be a flushing readback.
>
> With bgmac_read(): 421 Mb/s
> Without: 413 Mb/s

Interesting, so this is statistically significant, right? It could be that
this changing the interrupt timing just enough that it ends up doing
more work at once some of the time.

> > - accesses to the DMA descriptor don't show up in the profile here,
> >    but look like they can get misoptimized by the compiler. I would
> >    generally use READ_ONCE() and WRITE_ONCE() for these to
> >    ensure that you don't end up with extra or out-of-order accesses.
> >    This also makes it clearer to the reader that something special
> >    happens here.
>
> Should I use something as below?
>
> FWIW it doesn't seem to change NAT performance.
> Without WRITE_ONCE: 421 Mb/s
> With: 419 Mb/s

This one depends on the compiler. What I would expect here is that
it often makes no difference, but if the compiler does something
odd, then the WRITE_ONCE() would prevent this and make it behave
as before. I would suggest adding this part regardless.

The other suggestion I had was this, I think you did not test this:

--- a/drivers/net/ethernet/broadcom/bgmac.c
+++ b/drivers/net/ethernet/broadcom/bgmac.c
@@ -1156,11 +1156,12 @@ static int bgmac_poll(struct napi_struct
*napi, int weight)
        bgmac_dma_tx_free(bgmac, &bgmac->tx_ring[0]);
        handled += bgmac_dma_rx_read(bgmac, &bgmac->rx_ring[0], weight);

-       /* Poll again if more events arrived in the meantime */
-       if (bgmac_read(bgmac, BGMAC_INT_STATUS) & (BGMAC_IS_TX0 | BGMAC_IS_RX))
-               return weight;
-
        if (handled < weight) {
+               /* Poll again if more events arrived in the meantime */
+               if (bgmac_read(bgmac, BGMAC_INT_STATUS) &
+                               (BGMAC_IS_TX0 | BGMAC_IS_RX))
+                       return weight;
+
                napi_complete_done(napi, handled);
                bgmac_chip_intrs_on(bgmac);
        }

Or possibly, remove that extra check entirely and just rely on the irq to do
this after it gets turned on again.

         Arnd

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-05-10 12:51                     ` Rafał Miłecki
@ 2022-05-10 13:19                       ` Arnd Bergmann
  -1 siblings, 0 replies; 44+ messages in thread
From: Arnd Bergmann @ 2022-05-10 13:19 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Arnd Bergmann, Andrew Lunn, Alexander Lobakin,
	Network Development, linux-arm-kernel, Russell King,
	Felix Fietkau, openwrt-devel, Florian Fainelli

On Tue, May 10, 2022 at 2:51 PM Rafał Miłecki <zajec5@gmail.com> wrote:
> On 6.05.2022 11:44, Arnd Bergmann wrote:
>
> My router has a single swtich so I use two VLANs:
> eth0.1 - LAN
> eth0.2 - WAN
> (VLAN traffic is routed to correct ports by switch). On top of that I
> have "br-lan" bridge interface briding eth0.1 and wireless interfaces.
>
> For all that time I had /sys/class/net/br-lan/queues/rx-0/rps_cpus set
> to 3. So bridge traffic was randomly handled by CPU 0 or CPU 1.
>
> So if I assign specific CPU core to each of two interfaces, e.g.:
> echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
> echo 2 > /sys/class/net/br-lan/queues/rx-0/rps_cpus
> things get stable.
>
> With above I get stable 419 Mb/s (CPUs load: 100% + 64%) on every iperf
> session.

Ah, very nice! One part of the mystery is solved then I guess.

       Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-05-10 13:19                       ` Arnd Bergmann
  0 siblings, 0 replies; 44+ messages in thread
From: Arnd Bergmann @ 2022-05-10 13:19 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Arnd Bergmann, Andrew Lunn, Alexander Lobakin,
	Network Development, linux-arm-kernel, Russell King,
	Felix Fietkau, openwrt-devel, Florian Fainelli

On Tue, May 10, 2022 at 2:51 PM Rafał Miłecki <zajec5@gmail.com> wrote:
> On 6.05.2022 11:44, Arnd Bergmann wrote:
>
> My router has a single swtich so I use two VLANs:
> eth0.1 - LAN
> eth0.2 - WAN
> (VLAN traffic is routed to correct ports by switch). On top of that I
> have "br-lan" bridge interface briding eth0.1 and wireless interfaces.
>
> For all that time I had /sys/class/net/br-lan/queues/rx-0/rps_cpus set
> to 3. So bridge traffic was randomly handled by CPU 0 or CPU 1.
>
> So if I assign specific CPU core to each of two interfaces, e.g.:
> echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
> echo 2 > /sys/class/net/br-lan/queues/rx-0/rps_cpus
> things get stable.
>
> With above I get stable 419 Mb/s (CPUs load: 100% + 64%) on every iperf
> session.

Ah, very nice! One part of the mystery is solved then I guess.

       Arnd

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-05-10 10:29                   ` Rafał Miłecki
@ 2022-05-10 14:09                     ` Dave Taht
  -1 siblings, 0 replies; 44+ messages in thread
From: Dave Taht @ 2022-05-10 14:09 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Andrew Lunn, Felix Fietkau, Arnd Bergmann, Alexander Lobakin,
	Network Development, linux-arm-kernel, Russell King,
	openwrt-devel, Florian Fainelli

I might have mentioned this before. but I'm really big on using the
flent tool to drive test runs. The comparison
plots are to die for, and it can also sample cpu and other statistics
over time. Also I'm big on testing bidirectional functionality.

client$ flent -H server -t what_test_conditions_you_have
--step-size=.05 --te=upload_streams=4 -x --socket-stats tcp_nup

Gathers a lot of data about everything. The rrul test is one of my
favorites for creating a bittorrent like load.

flent is usually available in apt/rpm/etc. there are scripts that can
run on routers, openwrt has opkg install flent-tools, you use ssh to
fire these off.

there are a few python dependencies for the flent-gui, that aren't
needed for the flent server or client
sometimes you have to install and compile netperf on your own with
./configure --enable-demo

Please see flent.org for more details, and/or hit the flent-users list
for questions.

On Tue, May 10, 2022 at 5:03 AM Rafał Miłecki <zajec5@gmail.com> wrote:
>
> On 6.05.2022 14:42, Andrew Lunn wrote:
> >>> I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
> >>> This seems rather excessive, especially since most people are going to use a MTU of 1500.
> >>> My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes.
> >>> This should significantly reduce the time spent on flushing caches.
> >>
> >> Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac:
> >> configure MTU and add support for frames beyond 8192 byte size"):
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03
> >>
> >> It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps).
> >>
> >> I do all my testing with
> >> #define BGMAC_RX_MAX_FRAME_SIZE                      1536
> >
> > That helps show that cache operations are part of your bottleneck.
> >
> > Taking a quick look at the driver. On the receive side:
> >
> >                         /* Unmap buffer to make it accessible to the CPU */
> >                          dma_unmap_single(dma_dev, dma_addr,
> >                                           BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE);
> >
> > Here is data is mapped read for the CPU to use it.
> >
> >                       /* Get info from the header */
> >                          len = le16_to_cpu(rx->len);
> >                          flags = le16_to_cpu(rx->flags);
> >
> >                          /* Check for poison and drop or pass the packet */
> >                          if (len == 0xdead && flags == 0xbeef) {
> >                                  netdev_err(bgmac->net_dev, "Found poisoned packet at slot %d, DMA issue!\n",
> >                                             ring->start);
> >                                  put_page(virt_to_head_page(buf));
> >                                  bgmac->net_dev->stats.rx_errors++;
> >                                  break;
> >                          }
> >
> >                          if (len > BGMAC_RX_ALLOC_SIZE) {
> >                                  netdev_err(bgmac->net_dev, "Found oversized packet at slot %d, DMA issue!\n",
> >                                             ring->start);
> >                                  put_page(virt_to_head_page(buf));
> >                                  bgmac->net_dev->stats.rx_length_errors++;
> >                                  bgmac->net_dev->stats.rx_errors++;
> >                                  break;
> >                          }
> >
> >                          /* Omit CRC. */
> >                          len -= ETH_FCS_LEN;
> >
> >                          skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE);
> >                          if (unlikely(!skb)) {
> >                                  netdev_err(bgmac->net_dev, "build_skb failed\n");
> >                                  put_page(virt_to_head_page(buf));
> >                                  bgmac->net_dev->stats.rx_errors++;
> >                                  break;
> >                          }
> >                          skb_put(skb, BGMAC_RX_FRAME_OFFSET +
> >                                  BGMAC_RX_BUF_OFFSET + len);
> >                          skb_pull(skb, BGMAC_RX_FRAME_OFFSET +
> >                                   BGMAC_RX_BUF_OFFSET);
> >
> >                          skb_checksum_none_assert(skb);
> >                          skb->protocol = eth_type_trans(skb, bgmac->net_dev);
> >
> > and this is the first access of the actual data. You can make the
> > cache actually work for you, rather than against you, to adding a call to
> >
> >       prefetch(buf);
> >
> > just after the dma_unmap_single(). That will start getting the frame
> > header from DRAM into cache, so hopefully it is available by the time
> > eth_type_trans() is called and you don't have a cache miss.
>
>
> I don't think that analysis is correct.
>
> Please take a look at following lines:
> struct bgmac_rx_header *rx = slot->buf + BGMAC_RX_BUF_OFFSET;
> void *buf = slot->buf;
>
> The first we do after dma_unmap_single() call is rx->len read. That
> actually points to DMA data. There is nothing we could keep CPU busy
> with while preteching data.
>
> FWIW I tried adding prefetch(buf); anyway. I didn't change NAT speed by
> a single 1 Mb/s. Speed was exactly the same as without prefetch() call.



-- 
FQ World Domination pending: https://blog.cerowrt.org/post/state_of_fq_codel/
Dave Täht CEO, TekLibre, LLC

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-05-10 14:09                     ` Dave Taht
  0 siblings, 0 replies; 44+ messages in thread
From: Dave Taht @ 2022-05-10 14:09 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Andrew Lunn, Felix Fietkau, Arnd Bergmann, Alexander Lobakin,
	Network Development, linux-arm-kernel, Russell King,
	openwrt-devel, Florian Fainelli

I might have mentioned this before. but I'm really big on using the
flent tool to drive test runs. The comparison
plots are to die for, and it can also sample cpu and other statistics
over time. Also I'm big on testing bidirectional functionality.

client$ flent -H server -t what_test_conditions_you_have
--step-size=.05 --te=upload_streams=4 -x --socket-stats tcp_nup

Gathers a lot of data about everything. The rrul test is one of my
favorites for creating a bittorrent like load.

flent is usually available in apt/rpm/etc. there are scripts that can
run on routers, openwrt has opkg install flent-tools, you use ssh to
fire these off.

there are a few python dependencies for the flent-gui, that aren't
needed for the flent server or client
sometimes you have to install and compile netperf on your own with
./configure --enable-demo

Please see flent.org for more details, and/or hit the flent-users list
for questions.

On Tue, May 10, 2022 at 5:03 AM Rafał Miłecki <zajec5@gmail.com> wrote:
>
> On 6.05.2022 14:42, Andrew Lunn wrote:
> >>> I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
> >>> This seems rather excessive, especially since most people are going to use a MTU of 1500.
> >>> My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes.
> >>> This should significantly reduce the time spent on flushing caches.
> >>
> >> Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac:
> >> configure MTU and add support for frames beyond 8192 byte size"):
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03
> >>
> >> It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps).
> >>
> >> I do all my testing with
> >> #define BGMAC_RX_MAX_FRAME_SIZE                      1536
> >
> > That helps show that cache operations are part of your bottleneck.
> >
> > Taking a quick look at the driver. On the receive side:
> >
> >                         /* Unmap buffer to make it accessible to the CPU */
> >                          dma_unmap_single(dma_dev, dma_addr,
> >                                           BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE);
> >
> > Here is data is mapped read for the CPU to use it.
> >
> >                       /* Get info from the header */
> >                          len = le16_to_cpu(rx->len);
> >                          flags = le16_to_cpu(rx->flags);
> >
> >                          /* Check for poison and drop or pass the packet */
> >                          if (len == 0xdead && flags == 0xbeef) {
> >                                  netdev_err(bgmac->net_dev, "Found poisoned packet at slot %d, DMA issue!\n",
> >                                             ring->start);
> >                                  put_page(virt_to_head_page(buf));
> >                                  bgmac->net_dev->stats.rx_errors++;
> >                                  break;
> >                          }
> >
> >                          if (len > BGMAC_RX_ALLOC_SIZE) {
> >                                  netdev_err(bgmac->net_dev, "Found oversized packet at slot %d, DMA issue!\n",
> >                                             ring->start);
> >                                  put_page(virt_to_head_page(buf));
> >                                  bgmac->net_dev->stats.rx_length_errors++;
> >                                  bgmac->net_dev->stats.rx_errors++;
> >                                  break;
> >                          }
> >
> >                          /* Omit CRC. */
> >                          len -= ETH_FCS_LEN;
> >
> >                          skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE);
> >                          if (unlikely(!skb)) {
> >                                  netdev_err(bgmac->net_dev, "build_skb failed\n");
> >                                  put_page(virt_to_head_page(buf));
> >                                  bgmac->net_dev->stats.rx_errors++;
> >                                  break;
> >                          }
> >                          skb_put(skb, BGMAC_RX_FRAME_OFFSET +
> >                                  BGMAC_RX_BUF_OFFSET + len);
> >                          skb_pull(skb, BGMAC_RX_FRAME_OFFSET +
> >                                   BGMAC_RX_BUF_OFFSET);
> >
> >                          skb_checksum_none_assert(skb);
> >                          skb->protocol = eth_type_trans(skb, bgmac->net_dev);
> >
> > and this is the first access of the actual data. You can make the
> > cache actually work for you, rather than against you, to adding a call to
> >
> >       prefetch(buf);
> >
> > just after the dma_unmap_single(). That will start getting the frame
> > header from DRAM into cache, so hopefully it is available by the time
> > eth_type_trans() is called and you don't have a cache miss.
>
>
> I don't think that analysis is correct.
>
> Please take a look at following lines:
> struct bgmac_rx_header *rx = slot->buf + BGMAC_RX_BUF_OFFSET;
> void *buf = slot->buf;
>
> The first we do after dma_unmap_single() call is rx->len read. That
> actually points to DMA data. There is nothing we could keep CPU busy
> with while preteching data.
>
> FWIW I tried adding prefetch(buf); anyway. I didn't change NAT speed by
> a single 1 Mb/s. Speed was exactly the same as without prefetch() call.



-- 
FQ World Domination pending: https://blog.cerowrt.org/post/state_of_fq_codel/
Dave Täht CEO, TekLibre, LLC

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
  2022-05-10 14:09                     ` Dave Taht
@ 2022-05-10 19:15                       ` Dave Taht
  -1 siblings, 0 replies; 44+ messages in thread
From: Dave Taht @ 2022-05-10 19:15 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Andrew Lunn, Felix Fietkau, Arnd Bergmann, Alexander Lobakin,
	Network Development, linux-arm-kernel, Russell King,
	openwrt-devel, Florian Fainelli

while I'm kibitzing kind of randomly on this thread... Richard Site's
just published book, "Understanding software dynamics", is the first
book I've been compelled to buy on paper in many years, due to the
extensive use of useful color graphs and analogies, as well as
explaining the KUtrace tool, and so many other wonderful modern things
I'd missed.

https://www.amazon.com/Understanding-Software-Addison-Wesley-Professional-Computing/dp/0137589735

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Optimizing kernel compilation / alignments for network performance
@ 2022-05-10 19:15                       ` Dave Taht
  0 siblings, 0 replies; 44+ messages in thread
From: Dave Taht @ 2022-05-10 19:15 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Andrew Lunn, Felix Fietkau, Arnd Bergmann, Alexander Lobakin,
	Network Development, linux-arm-kernel, Russell King,
	openwrt-devel, Florian Fainelli

while I'm kibitzing kind of randomly on this thread... Richard Site's
just published book, "Understanding software dynamics", is the first
book I've been compelled to buy on paper in many years, due to the
extensive use of useful color graphs and analogies, as well as
explaining the KUtrace tool, and so many other wonderful modern things
I'd missed.

https://www.amazon.com/Understanding-Software-Addison-Wesley-Professional-Computing/dp/0137589735

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2022-05-10 19:16 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-27 12:04 Optimizing kernel compilation / alignments for network performance Rafał Miłecki
2022-04-27 12:04 ` Rafał Miłecki
2022-04-27 12:56 ` Alexander Lobakin
2022-04-27 12:56   ` Alexander Lobakin
2022-04-27 17:31   ` Rafał Miłecki
2022-04-27 17:31     ` Rafał Miłecki
2022-04-29 14:18     ` Rafał Miłecki
2022-04-29 14:18       ` Rafał Miłecki
2022-04-29 14:49     ` Arnd Bergmann
2022-04-29 14:49       ` Arnd Bergmann
2022-05-05 15:42       ` Rafał Miłecki
2022-05-05 15:42         ` Rafał Miłecki
2022-05-05 16:04         ` Andrew Lunn
2022-05-05 16:04           ` Andrew Lunn
2022-05-05 16:46           ` Felix Fietkau
2022-05-05 16:46             ` Felix Fietkau
2022-05-06  7:47             ` Rafał Miłecki
2022-05-06  7:47               ` Rafał Miłecki
2022-05-06 12:42               ` Andrew Lunn
2022-05-06 12:42                 ` Andrew Lunn
2022-05-10 10:29                 ` Rafał Miłecki
2022-05-10 10:29                   ` Rafał Miłecki
2022-05-10 14:09                   ` Dave Taht
2022-05-10 14:09                     ` Dave Taht
2022-05-10 19:15                     ` Dave Taht
2022-05-10 19:15                       ` Dave Taht
2022-05-06  7:44           ` Rafał Miłecki
2022-05-06  7:44             ` Rafał Miłecki
2022-05-06  8:45             ` Arnd Bergmann
2022-05-06  8:45               ` Arnd Bergmann
2022-05-06  8:55               ` Rafał Miłecki
2022-05-06  8:55                 ` Rafał Miłecki
2022-05-06  9:44                 ` Arnd Bergmann
2022-05-06  9:44                   ` Arnd Bergmann
2022-05-10 12:51                   ` Rafał Miłecki
2022-05-10 12:51                     ` Rafał Miłecki
2022-05-10 13:19                     ` Arnd Bergmann
2022-05-10 13:19                       ` Arnd Bergmann
2022-05-10 11:23               ` Rafał Miłecki
2022-05-10 11:23                 ` Rafał Miłecki
2022-05-10 13:18                 ` Arnd Bergmann
2022-05-10 13:18                   ` Arnd Bergmann
2022-05-08  9:53             ` Rafał Miłecki
2022-05-08  9:53               ` Rafał Miłecki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.