linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Revert "arm64: Increase the max granular size"
@ 2016-03-16  9:32 Ganesh Mahendran
  2016-03-16 10:07 ` Will Deacon
  2016-03-18 21:05 ` Chalamarla, Tirumalesh
  0 siblings, 2 replies; 36+ messages in thread
From: Ganesh Mahendran @ 2016-03-16  9:32 UTC (permalink / raw)
  To: catalin.marinas, will.deacon
  Cc: linux-kernel, linux-arm-kernel, Ganesh Mahendran, stable

Reverts commit 97303480753e ("arm64: Increase the max granular size").

The commit 97303480753e ("arm64: Increase the max granular size") will
degrade system performente in some cpus.

We test wifi network throughput with iperf on Qualcomm msm8996 CPU:
----------------
run on host:
  # iperf -s
run on device:
  # iperf -c <device-ip-addr> -t 100 -i 1
----------------

Test result:
----------------
with commit 97303480753e ("arm64: Increase the max granular size"):
    172MBits/sec

without commit 97303480753e ("arm64: Increase the max granular size"):
    230MBits/sec
----------------

Some module like slab/net will use the L1_CACHE_SHIFT, so if we do not
set the parameter correctly, it may affect the system performance.

So revert the commit.

Cc: stable@vger.kernel.org
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
---
 arch/arm64/include/asm/cache.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
index 5082b30..bde4499 100644
--- a/arch/arm64/include/asm/cache.h
+++ b/arch/arm64/include/asm/cache.h
@@ -18,7 +18,7 @@
 
 #include <asm/cachetype.h>
 
-#define L1_CACHE_SHIFT		7
+#define L1_CACHE_SHIFT		6
 #define L1_CACHE_BYTES		(1 << L1_CACHE_SHIFT)
 
 /*
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-16  9:32 [PATCH] Revert "arm64: Increase the max granular size" Ganesh Mahendran
@ 2016-03-16 10:07 ` Will Deacon
  2016-03-16 13:06   ` Timur Tabi
  2016-03-18 21:05 ` Chalamarla, Tirumalesh
  1 sibling, 1 reply; 36+ messages in thread
From: Will Deacon @ 2016-03-16 10:07 UTC (permalink / raw)
  To: Ganesh Mahendran
  Cc: catalin.marinas, stable, linux-kernel, linux-arm-kernel,
	tchalamarla, rrichter, apinski, timur

[adding Cavium folk and Timur]

On Wed, Mar 16, 2016 at 05:32:23PM +0800, Ganesh Mahendran wrote:
> Reverts commit 97303480753e ("arm64: Increase the max granular size").
> 
> The commit 97303480753e ("arm64: Increase the max granular size") will
> degrade system performente in some cpus.
> 
> We test wifi network throughput with iperf on Qualcomm msm8996 CPU:
> ----------------
> run on host:
>   # iperf -s
> run on device:
>   # iperf -c <device-ip-addr> -t 100 -i 1
> ----------------
> 
> Test result:
> ----------------
> with commit 97303480753e ("arm64: Increase the max granular size"):
>     172MBits/sec
> 
> without commit 97303480753e ("arm64: Increase the max granular size"):
>     230MBits/sec
> ----------------
> 
> Some module like slab/net will use the L1_CACHE_SHIFT, so if we do not
> set the parameter correctly, it may affect the system performance.
> 
> So revert the commit.

Unfortunately, the original patch is required to support the 128-byte L1
cache lines of Cavium ThunderX, so we can't simply revert it like this.
Similarly, the desire for a single, multiplatform kernel image prevents
us from reasonably fixing this at compile time to anything other than
the expected maximum value.

Furthermore, Timur previously said that the change is also required
"on our [Qualcomm] silicon", but I'm not sure if this is msm9886 or not:

http://lkml.kernel.org/r/CAOZdJXUiRMAguDV+HEJqPg57MyBNqEcTyaH+ya=U93NHb-pdJA@mail.gmail.com

You could look into making ARCH_DMA_MINALIGN a runtime value, but that
looks like an uphill struggle to me. Alternatively, we could only warn
if the CWG is bigger than L1_CACHE_BYTES *and* we have a non-coherent
DMA master, but that doesn't solve any performance issues from having
things like locks sharing cachelines, not that I think we ever got any
data on that (afaik, we don't pad locks to cacheline boundaries anyway).
I'm also not sure what it would mean for PCI NoSnoop transactions.

Will

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-16 10:07 ` Will Deacon
@ 2016-03-16 13:06   ` Timur Tabi
  2016-03-16 14:03     ` Mark Rutland
  2016-03-16 14:18     ` Catalin Marinas
  0 siblings, 2 replies; 36+ messages in thread
From: Timur Tabi @ 2016-03-16 13:06 UTC (permalink / raw)
  To: Will Deacon, Ganesh Mahendran
  Cc: catalin.marinas, stable, linux-kernel, linux-arm-kernel,
	tchalamarla, rrichter, apinski, Shanker Donthineni

Will Deacon wrote:
> [adding Cavium folk and Timur]
>
> On Wed, Mar 16, 2016 at 05:32:23PM +0800, Ganesh Mahendran wrote:
>> Reverts commit 97303480753e ("arm64: Increase the max granular size").
>>
>> The commit 97303480753e ("arm64: Increase the max granular size") will
>> degrade system performente in some cpus.
>>
>> We test wifi network throughput with iperf on Qualcomm msm8996 CPU:
>> ----------------
>> run on host:
>>    # iperf -s
>> run on device:
>>    # iperf -c <device-ip-addr> -t 100 -i 1
>> ----------------
>>
>> Test result:
>> ----------------
>> with commit 97303480753e ("arm64: Increase the max granular size"):
>>      172MBits/sec
>>
>> without commit 97303480753e ("arm64: Increase the max granular size"):
>>      230MBits/sec
>> ----------------
>>
>> Some module like slab/net will use the L1_CACHE_SHIFT, so if we do not
>> set the parameter correctly, it may affect the system performance.
>>
>> So revert the commit.
>
> Unfortunately, the original patch is required to support the 128-byte L1
> cache lines of Cavium ThunderX, so we can't simply revert it like this.
> Similarly, the desire for a single, multiplatform kernel image prevents
> us from reasonably fixing this at compile time to anything other than
> the expected maximum value.
>
> Furthermore, Timur previously said that the change is also required
> "on our [Qualcomm] silicon", but I'm not sure if this is msm9886 or not:
>
> http://lkml.kernel.org/r/CAOZdJXUiRMAguDV+HEJqPg57MyBNqEcTyaH+ya=U93NHb-pdJA@mail.gmail.com

I was talking about our server part, the QDF2432.  At the time, I wasn't 
allowed to mention it by name.

> You could look into making ARCH_DMA_MINALIGN a runtime value, but that
> looks like an uphill struggle to me. Alternatively, we could only warn
> if the CWG is bigger than L1_CACHE_BYTES *and* we have a non-coherent
> DMA master, but that doesn't solve any performance issues from having
> things like locks sharing cachelines, not that I think we ever got any
> data on that (afaik, we don't pad locks to cacheline boundaries anyway).
> I'm also not sure what it would mean for PCI NoSnoop transactions.

Our internal version of this patch made it a Kconfig option.  Perhaps 
that would at least be an improvement over just reverting it?  We 
already have to have our own defconfig for the QDF2432.


-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the
Code Aurora Forum, hosted by The Linux Foundation.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-16 13:06   ` Timur Tabi
@ 2016-03-16 14:03     ` Mark Rutland
  2016-03-16 14:35       ` Will Deacon
  2016-03-16 14:18     ` Catalin Marinas
  1 sibling, 1 reply; 36+ messages in thread
From: Mark Rutland @ 2016-03-16 14:03 UTC (permalink / raw)
  To: Timur Tabi
  Cc: Will Deacon, Ganesh Mahendran, catalin.marinas, linux-kernel,
	stable, rrichter, tchalamarla, Shanker Donthineni, apinski,
	linux-arm-kernel

On Wed, Mar 16, 2016 at 08:06:22AM -0500, Timur Tabi wrote:
> Will Deacon wrote:
> >Unfortunately, the original patch is required to support the 128-byte L1
> >cache lines of Cavium ThunderX, so we can't simply revert it like this.
> >Similarly, the desire for a single, multiplatform kernel image prevents
> >us from reasonably fixing this at compile time to anything other than
> >the expected maximum value.
> >
> >Furthermore, Timur previously said that the change is also required
> >"on our [Qualcomm] silicon", but I'm not sure if this is msm9886 or not:
> >
> >http://lkml.kernel.org/r/CAOZdJXUiRMAguDV+HEJqPg57MyBNqEcTyaH+ya=U93NHb-pdJA@mail.gmail.com
> 
> I was talking about our server part, the QDF2432.  At the time, I
> wasn't allowed to mention it by name.
> 
> >You could look into making ARCH_DMA_MINALIGN a runtime value, but that
> >looks like an uphill struggle to me. Alternatively, we could only warn
> >if the CWG is bigger than L1_CACHE_BYTES *and* we have a non-coherent
> >DMA master, but that doesn't solve any performance issues from having
> >things like locks sharing cachelines, not that I think we ever got any
> >data on that (afaik, we don't pad locks to cacheline boundaries anyway).
> >I'm also not sure what it would mean for PCI NoSnoop transactions.
> 
> Our internal version of this patch made it a Kconfig option.
> Perhaps that would at least be an improvement over just reverting
> it?  We already have to have our own defconfig for the QDF2432.

While having an option for producing a less-portable, performance tuned kernel
might not be the end of the world, the defconfig is intended to function
correctly on all platforms (assuming LE and 4K page support).

Even if we were to add the option, the default would have to be the maximum
size known to be implemented.

If I understand correctly, the main reason that we need this for correctness is
non-coherent DMA to/from SLAB caches.

A more general approach (and more invasive, but perhaps less so than making
ARCH_DMA_MINALIGN usage completely dynamic) would be to determine at runtime
whether the CWG is larger than the configured ARCH_DMA_MINALIGN, and if so,
force the use of bounce buffers (which could be padded to the architectural
maximum of 2K) for non-coherent DMA. That nicely degrades to not mattering for
the case of coherent DMA.

I would consider NoSnoop a separate case. It's closer to "negatively coherent",
and always required page-aligned buffer anyway due to MMU behaviour.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-16 13:06   ` Timur Tabi
  2016-03-16 14:03     ` Mark Rutland
@ 2016-03-16 14:18     ` Catalin Marinas
  2016-03-16 15:26       ` Timur Tabi
  1 sibling, 1 reply; 36+ messages in thread
From: Catalin Marinas @ 2016-03-16 14:18 UTC (permalink / raw)
  To: Timur Tabi
  Cc: Will Deacon, Ganesh Mahendran, linux-kernel, stable, rrichter,
	tchalamarla, Shanker Donthineni, apinski, linux-arm-kernel

On Wed, Mar 16, 2016 at 08:06:22AM -0500, Timur Tabi wrote:
> Will Deacon wrote:
> >You could look into making ARCH_DMA_MINALIGN a runtime value, but that
> >looks like an uphill struggle to me. Alternatively, we could only warn
> >if the CWG is bigger than L1_CACHE_BYTES *and* we have a non-coherent
> >DMA master, but that doesn't solve any performance issues from having
> >things like locks sharing cachelines, not that I think we ever got any
> >data on that (afaik, we don't pad locks to cacheline boundaries anyway).
> >I'm also not sure what it would mean for PCI NoSnoop transactions.
> 
> Our internal version of this patch made it a Kconfig option.  Perhaps that
> would at least be an improvement over just reverting it?  We already have to
> have our own defconfig for the QDF2432.

Why do you need your own defconfig? If it's just on the short term until
all your code is upstream, that's fine, but this goes against the single
Image aim. I would like defconfig to cover all supported SoCs (and yes,
ACPI on by default once we deem it !EXPERT anymore), though at some
point we may need a server/mobile split (if the generated image is too
large, maybe more stuff being built as modules).

-- 
Catalin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-16 14:03     ` Mark Rutland
@ 2016-03-16 14:35       ` Will Deacon
  2016-03-16 14:54         ` Mark Rutland
  0 siblings, 1 reply; 36+ messages in thread
From: Will Deacon @ 2016-03-16 14:35 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Timur Tabi, Ganesh Mahendran, catalin.marinas, linux-kernel,
	stable, rrichter, tchalamarla, Shanker Donthineni, apinski,
	linux-arm-kernel

On Wed, Mar 16, 2016 at 02:03:35PM +0000, Mark Rutland wrote:
> If I understand correctly, the main reason that we need this for correctness is
> non-coherent DMA to/from SLAB caches.
> 
> A more general approach (and more invasive, but perhaps less so than making
> ARCH_DMA_MINALIGN usage completely dynamic) would be to determine at runtime
> whether the CWG is larger than the configured ARCH_DMA_MINALIGN, and if so,
> force the use of bounce buffers (which could be padded to the architectural
> maximum of 2K) for non-coherent DMA. That nicely degrades to not mattering for
> the case of coherent DMA.
>
> I would consider NoSnoop a separate case. It's closer to "negatively coherent",
> and always required page-aligned buffer anyway due to MMU behaviour.

What makes you say that? There are no such alignment requirements for
buffers that may be accessed with a NoSnoop transaction. On ARM, we'll
have a mismatched alias, but we'd need to solve that with explicit
cache maintenance (and my understanding is that's what things like GPU
drivers already do on x86).

Will

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-16 14:35       ` Will Deacon
@ 2016-03-16 14:54         ` Mark Rutland
  0 siblings, 0 replies; 36+ messages in thread
From: Mark Rutland @ 2016-03-16 14:54 UTC (permalink / raw)
  To: Will Deacon
  Cc: Timur Tabi, Ganesh Mahendran, catalin.marinas, linux-kernel,
	stable, rrichter, tchalamarla, Shanker Donthineni, apinski,
	linux-arm-kernel

On Wed, Mar 16, 2016 at 02:35:35PM +0000, Will Deacon wrote:
> On Wed, Mar 16, 2016 at 02:03:35PM +0000, Mark Rutland wrote:
> > If I understand correctly, the main reason that we need this for correctness is
> > non-coherent DMA to/from SLAB caches.
> > 
> > A more general approach (and more invasive, but perhaps less so than making
> > ARCH_DMA_MINALIGN usage completely dynamic) would be to determine at runtime
> > whether the CWG is larger than the configured ARCH_DMA_MINALIGN, and if so,
> > force the use of bounce buffers (which could be padded to the architectural
> > maximum of 2K) for non-coherent DMA. That nicely degrades to not mattering for
> > the case of coherent DMA.
> >
> > I would consider NoSnoop a separate case. It's closer to "negatively coherent",
> > and always required page-aligned buffer anyway due to MMU behaviour.
> 
> What makes you say that? There are no such alignment requirements for
> buffers that may be accessed with a NoSnoop transaction. On ARM, we'll
> have a mismatched alias, but we'd need to solve that with explicit
> cache maintenance (and my understanding is that's what things like GPU
> drivers already do on x86).

I was under the impression that NoSnoop transactions were permitted to be
Cacheable, even if non-snooping (e.g. allowing them to allocate and hit in a
system cache).

If that is permitted, then data corruption could potentially occur in the
presence of another cacheable alias due to things like line migration (e.g. a
CPU making a speculative fetch and taking ownership of a line that was in the
system cache). To avoid that, you'd have to remove any cachable alias, for
which we only have page-granular control.

If that is not permitted, then no-snoop is effectively non-cacheable and
non-coherent, and my comment doesn't hold.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-16 14:18     ` Catalin Marinas
@ 2016-03-16 15:26       ` Timur Tabi
  2016-03-17 14:27         ` Catalin Marinas
  0 siblings, 1 reply; 36+ messages in thread
From: Timur Tabi @ 2016-03-16 15:26 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Ganesh Mahendran, linux-kernel, stable, rrichter,
	tchalamarla, Shanker Donthineni, apinski, linux-arm-kernel

Catalin Marinas wrote:
> Why do you need your own defconfig? If it's just on the short term until
> all your code is upstream, that's fine, but this goes against the single
> Image aim. I would like defconfig to cover all supported SoCs (and yes,
> ACPI on by default once we deem it !EXPERT anymore), though at some
> point we may need a server/mobile split (if the generated image is too
> large, maybe more stuff being built as modules).

Yes, that's exactly it.  Ours is an ACPI system, and so we have to have 
our own defconfig for now.  We're holding off on pushing our own 
defconfig changes (enabling drivers, etc) until ACPI is enabled in 
arch/arm64/configs/defconfig.

My understanding is that ACPI won't be enabled by default until at least 
after the GIC driver is fully ACPI-enabled.

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation collaborative project.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-16 15:26       ` Timur Tabi
@ 2016-03-17 14:27         ` Catalin Marinas
  2016-03-17 14:49           ` Timur Tabi
  2016-03-17 18:07           ` Andrew Pinski
  0 siblings, 2 replies; 36+ messages in thread
From: Catalin Marinas @ 2016-03-17 14:27 UTC (permalink / raw)
  To: Timur Tabi
  Cc: linux-arm-kernel, Ganesh Mahendran, Will Deacon, linux-kernel,
	stable, rrichter, tchalamarla, apinski, Shanker Donthineni

On Wed, Mar 16, 2016 at 10:26:08AM -0500, Timur Tabi wrote:
> Catalin Marinas wrote:
> >Why do you need your own defconfig? If it's just on the short term until
> >all your code is upstream, that's fine, but this goes against the single
> >Image aim. I would like defconfig to cover all supported SoCs (and yes,
> >ACPI on by default once we deem it !EXPERT anymore), though at some
> >point we may need a server/mobile split (if the generated image is too
> >large, maybe more stuff being built as modules).
> 
> Yes, that's exactly it.  Ours is an ACPI system, and so we have to have our
> own defconfig for now.  We're holding off on pushing our own defconfig
> changes (enabling drivers, etc) until ACPI is enabled in
> arch/arm64/configs/defconfig.

Is there anything that prevents you from providing a dtb/dts for this
SoC?

-- 
Catalin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-17 14:27         ` Catalin Marinas
@ 2016-03-17 14:49           ` Timur Tabi
  2016-03-17 15:37             ` Catalin Marinas
  2016-03-17 18:07           ` Andrew Pinski
  1 sibling, 1 reply; 36+ messages in thread
From: Timur Tabi @ 2016-03-17 14:49 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: linux-arm-kernel, Ganesh Mahendran, Will Deacon, linux-kernel,
	stable, rrichter, tchalamarla, apinski, Shanker Donthineni

Catalin Marinas wrote:
>> >Yes, that's exactly it.  Ours is an ACPI system, and so we have to have our
>> >own defconfig for now.  We're holding off on pushing our own defconfig
>> >changes (enabling drivers, etc) until ACPI is enabled in
>> >arch/arm64/configs/defconfig.

> Is there anything that prevents you from providing a dtb/dts for this
> SoC?

We don't have a boot loader capable of passing a device tree to the 
kernel.  It's an ARM Server chip.  It doesn't do device tree.  It's 100% 
ACPI.  We boot with UEFI that configures the system and generates ACPI 
tables.

I just want to make this crystal clear, because it comes up every now 
and then.  The QDF2432 is an ACPI-only SOC with no device tree support 
whatsoever.  Zero.  Zip.  Nada.  It's not an option.

Keep in mind that on an ACPI system like ours, the boot loader (UEFI in 
our case) configures the system extensively.  It does a lot of things 
that the kernel would normally do on a device tree system.  For example, 
pin control is handled completely by UEFI.  The kernel does not set the 
pin muxes or GPIO directions.  That means we don't support dynamic pin 
muxing.  Before the kernel is booted, the GPIO pins are fixed.

We're not going to create an entire device tree from scratch for this 
chip, and then make the extensive modifications necessary to our boot 
loader for parsing and modifying that device tree.  That would take 
months of work, and it would be all throw-away code.

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the
Code Aurora Forum, hosted by The Linux Foundation.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-17 14:49           ` Timur Tabi
@ 2016-03-17 15:37             ` Catalin Marinas
  2016-03-17 16:03               ` Marc Zyngier
  0 siblings, 1 reply; 36+ messages in thread
From: Catalin Marinas @ 2016-03-17 15:37 UTC (permalink / raw)
  To: Timur Tabi
  Cc: Ganesh Mahendran, Will Deacon, linux-kernel, stable, rrichter,
	tchalamarla, Shanker Donthineni, apinski, linux-arm-kernel

On Thu, Mar 17, 2016 at 09:49:51AM -0500, Timur Tabi wrote:
> Catalin Marinas wrote:
> >>>Yes, that's exactly it.  Ours is an ACPI system, and so we have to have our
> >>>own defconfig for now.  We're holding off on pushing our own defconfig
> >>>changes (enabling drivers, etc) until ACPI is enabled in
> >>>arch/arm64/configs/defconfig.
> 
> >Is there anything that prevents you from providing a dtb/dts for this
> >SoC?
> 
> We don't have a boot loader capable of passing a device tree to the kernel.
> It's an ARM Server chip.  It doesn't do device tree.  It's 100% ACPI.  We
> boot with UEFI that configures the system and generates ACPI tables.

Well, I disagree with the idea that server == ACPI. But I guess you knew
this already.

> I just want to make this crystal clear, because it comes up every now and
> then.  The QDF2432 is an ACPI-only SOC with no device tree support
> whatsoever.  Zero.  Zip.  Nada.  It's not an option.

That's your choice really, I don't care much (as long as current ACPI
support has all the features you need; if it doesn't, there is a good
chance that your SoC won't be supported in mainline until it's sorted).

> Keep in mind that on an ACPI system like ours, the boot loader (UEFI in our
> case) configures the system extensively.  It does a lot of things that the
> kernel would normally do on a device tree system.  For example, pin control
> is handled completely by UEFI.  The kernel does not set the pin muxes or
> GPIO directions.  That means we don't support dynamic pin muxing.  Before
> the kernel is booted, the GPIO pins are fixed.

And that's great. But you are mistaken in thinking that DT requires lots
of drivers in the kernel and prevents the firmware from doing sane
stuff. DT rather gained additional features out of necessity since the
firmware was not always doing a proper job at hardware initialisation.
A DT-enabled kernel does not impose restrictions on such firmware
features. With ACPI, the choice is not as wide and forces vendors to
look into their firmware story from a different angle (until they figure
the _DSD+PRP0001 out and we end up with DT emulated in ACPI).

If the GPIO pins are fixed at boot-time, they don't even need to be
described in the DT, just let the firmware configure them. However, if
they need to be changed at run-time (which does not seem to be your
case), that's where you have a choice of either kernel driver (DT) or
AML code (ACPI) (or both). Otherwise, without this run-time aspect, both
DT and ACPI are just slightly different ways to provide a static
platform description.

> We're not going to create an entire device tree from scratch for this chip,
> and then make the extensive modifications necessary to our boot loader for
> parsing and modifying that device tree.  That would take months of work, and
> it would be all throw-away code.

As I said above, that's your choice and it depends on your timeline and
mainline support requirements. I however disagree with this being
"throw-away code" (or even taking "months of work").

-- 
Catalin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-17 15:37             ` Catalin Marinas
@ 2016-03-17 16:03               ` Marc Zyngier
  0 siblings, 0 replies; 36+ messages in thread
From: Marc Zyngier @ 2016-03-17 16:03 UTC (permalink / raw)
  To: Catalin Marinas, Timur Tabi
  Cc: Ganesh Mahendran, Will Deacon, linux-kernel, stable, rrichter,
	tchalamarla, Shanker Donthineni, apinski, linux-arm-kernel

On 17/03/16 15:37, Catalin Marinas wrote:
> On Thu, Mar 17, 2016 at 09:49:51AM -0500, Timur Tabi wrote:

[...]

>> Keep in mind that on an ACPI system like ours, the boot loader (UEFI in our
>> case) configures the system extensively.  It does a lot of things that the
>> kernel would normally do on a device tree system.  For example, pin control
>> is handled completely by UEFI.  The kernel does not set the pin muxes or
>> GPIO directions.  That means we don't support dynamic pin muxing.  Before
>> the kernel is booted, the GPIO pins are fixed.
> 
> And that's great. But you are mistaken in thinking that DT requires lots
> of drivers in the kernel and prevents the firmware from doing sane
> stuff. DT rather gained additional features out of necessity since the
> firmware was not always doing a proper job at hardware initialisation.
> A DT-enabled kernel does not impose restrictions on such firmware
> features. With ACPI, the choice is not as wide and forces vendors to
> look into their firmware story from a different angle (until they figure
> the _DSD+PRP0001 out and we end up with DT emulated in ACPI).

Or even worse, perverting the couple of things that were actually OK in
ACPI by inventing a new layer of broken stuff:

http://thread.gmane.org/gmane.linux.ports.arm.msm/17828

Once we've reached that level, _DSD fells all nice and cuddly.

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-17 14:27         ` Catalin Marinas
  2016-03-17 14:49           ` Timur Tabi
@ 2016-03-17 18:07           ` Andrew Pinski
  2016-03-17 18:34             ` Timur Tabi
  2016-03-17 18:37             ` Catalin Marinas
  1 sibling, 2 replies; 36+ messages in thread
From: Andrew Pinski @ 2016-03-17 18:07 UTC (permalink / raw)
  To: Catalin Marinas, Timur Tabi
  Cc: linux-arm-kernel, Ganesh Mahendran, Will Deacon, linux-kernel,
	stable, rrichter, tchalamarla, apinski, Shanker Donthineni

On 3/17/2016 7:27 AM, Catalin Marinas wrote:
> On Wed, Mar 16, 2016 at 10:26:08AM -0500, Timur Tabi wrote:
>> Catalin Marinas wrote:
>>> Why do you need your own defconfig? If it's just on the short term until
>>> all your code is upstream, that's fine, but this goes against the single
>>> Image aim. I would like defconfig to cover all supported SoCs (and yes,
>>> ACPI on by default once we deem it !EXPERT anymore), though at some
>>> point we may need a server/mobile split (if the generated image is too
>>> large, maybe more stuff being built as modules).
>> Yes, that's exactly it.  Ours is an ACPI system, and so we have to have our
>> own defconfig for now.  We're holding off on pushing our own defconfig
>> changes (enabling drivers, etc) until ACPI is enabled in
>> arch/arm64/configs/defconfig.
> Is there anything that prevents you from providing a dtb/dts for this
> SoC?

Note ThunderX's SOC have customers  where some are embedded users 
(uboot) and server users (UEFI).  The cores always have 128 byte 
cacheline size.  So please don't make this dependent on ACPI.  Note the 
defconfig works correctly on T88.

Thanks,
Andrew


>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-17 18:07           ` Andrew Pinski
@ 2016-03-17 18:34             ` Timur Tabi
  2016-03-17 18:37             ` Catalin Marinas
  1 sibling, 0 replies; 36+ messages in thread
From: Timur Tabi @ 2016-03-17 18:34 UTC (permalink / raw)
  To: Andrew Pinski, Catalin Marinas
  Cc: linux-arm-kernel, Ganesh Mahendran, Will Deacon, linux-kernel,
	stable, rrichter, tchalamarla, apinski, Shanker Donthineni

Andrew Pinski wrote:
>>
>
> Note ThunderX's SOC have customers  where some are embedded users
> (uboot) and server users (UEFI).  The cores always have 128 byte
> cacheline size.  So please don't make this dependent on ACPI.  Note the
> defconfig works correctly on T88.

This thread is getting off-topic.  There's nothing about the cacheline 
size that is dependent on ACPI or DT.  Catalin was wondering why we have 
our own defconfig for our ARM64 SOC, and I replied that it's because 
ACPI is not enabled yet in the upstream defconfig.

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation collaborative project.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-17 18:07           ` Andrew Pinski
  2016-03-17 18:34             ` Timur Tabi
@ 2016-03-17 18:37             ` Catalin Marinas
  1 sibling, 0 replies; 36+ messages in thread
From: Catalin Marinas @ 2016-03-17 18:37 UTC (permalink / raw)
  To: Andrew Pinski
  Cc: Timur Tabi, Ganesh Mahendran, Will Deacon, linux-kernel, stable,
	rrichter, tchalamarla, Shanker Donthineni, apinski,
	linux-arm-kernel

On Thu, Mar 17, 2016 at 11:07:00AM -0700, Andrew Pinski wrote:
> On 3/17/2016 7:27 AM, Catalin Marinas wrote:
> >On Wed, Mar 16, 2016 at 10:26:08AM -0500, Timur Tabi wrote:
> >>Catalin Marinas wrote:
> >>>Why do you need your own defconfig? If it's just on the short term until
> >>>all your code is upstream, that's fine, but this goes against the single
> >>>Image aim. I would like defconfig to cover all supported SoCs (and yes,
> >>>ACPI on by default once we deem it !EXPERT anymore), though at some
> >>>point we may need a server/mobile split (if the generated image is too
> >>>large, maybe more stuff being built as modules).
> >>Yes, that's exactly it.  Ours is an ACPI system, and so we have to have our
> >>own defconfig for now.  We're holding off on pushing our own defconfig
> >>changes (enabling drivers, etc) until ACPI is enabled in
> >>arch/arm64/configs/defconfig.
> >Is there anything that prevents you from providing a dtb/dts for this
> >SoC?
> 
> Note ThunderX's SOC have customers  where some are embedded users (uboot)
> and server users (UEFI).  The cores always have 128 byte cacheline size.  So
> please don't make this dependent on ACPI. 

Definitely not, this has nothing to do with ACPI or servers. My comment
on different defconfig was more about things like 64K pages vs 4K, if
the former ever prove useful in practice. Who knows, we may even see
ACPI for IoT ;) (with MS involvement in Raspberry Pi)

> Note the defconfig works correctly on T88.

We have two aspects to address: one is correctness and the other is
performance. But we bundle everything under L1_CACHE_BYTES which affects
platforms that don't have such large cache lines (actually, it may even
affect those that do; has anyone done actual benchmarks?)

As Will suggested, we could try to revert L1_CACHE_BYTES back to 64 and
make ARCH_DMA_MINALIGN run-time based on CWG for correctness. Would this
work on the Cavium hardware?

-- 
Catalin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-16  9:32 [PATCH] Revert "arm64: Increase the max granular size" Ganesh Mahendran
  2016-03-16 10:07 ` Will Deacon
@ 2016-03-18 21:05 ` Chalamarla, Tirumalesh
  2016-03-21  1:56   ` Ganesh Mahendran
  2016-03-21 17:14   ` Catalin Marinas
  1 sibling, 2 replies; 36+ messages in thread
From: Chalamarla, Tirumalesh @ 2016-03-18 21:05 UTC (permalink / raw)
  To: Ganesh Mahendran, catalin.marinas, will.deacon
  Cc: stable, linux-kernel, linux-arm-kernel






On 3/16/16, 2:32 AM, "linux-arm-kernel on behalf of Ganesh Mahendran" <linux-arm-kernel-bounces@lists.infradead.org on behalf of opensource.ganesh@gmail.com> wrote:

>Reverts commit 97303480753e ("arm64: Increase the max granular size").
>
>The commit 97303480753e ("arm64: Increase the max granular size") will
>degrade system performente in some cpus.
>
>We test wifi network throughput with iperf on Qualcomm msm8996 CPU:
>----------------
>run on host:
>  # iperf -s
>run on device:
>  # iperf -c <device-ip-addr> -t 100 -i 1
>----------------
>
>Test result:
>----------------
>with commit 97303480753e ("arm64: Increase the max granular size"):
>    172MBits/sec
>
>without commit 97303480753e ("arm64: Increase the max granular size"):
>    230MBits/sec
>----------------
>
>Some module like slab/net will use the L1_CACHE_SHIFT, so if we do not
>set the parameter correctly, it may affect the system performance.
>
>So revert the commit.

Is there any explanation why is this so? May be there is an alternative to this, apart from reverting the commit.

Until now it seems L1_CACHE_SHIFT is the max of supported chips. But now we are making it 64byte, is there any reason why not 32. 

Thanks,
Tirumalesh. 
>
>Cc: stable@vger.kernel.org
>Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
>---
> arch/arm64/include/asm/cache.h |    2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
>index 5082b30..bde4499 100644
>--- a/arch/arm64/include/asm/cache.h
>+++ b/arch/arm64/include/asm/cache.h
>@@ -18,7 +18,7 @@
> 
> #include <asm/cachetype.h>
> 
>-#define L1_CACHE_SHIFT		7
>+#define L1_CACHE_SHIFT		6
> #define L1_CACHE_BYTES		(1 << L1_CACHE_SHIFT)
> 
> /*
>-- 
>1.7.9.5
>
>
>_______________________________________________
>linux-arm-kernel mailing list
>linux-arm-kernel@lists.infradead.org
>http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-18 21:05 ` Chalamarla, Tirumalesh
@ 2016-03-21  1:56   ` Ganesh Mahendran
  2016-03-21 17:14   ` Catalin Marinas
  1 sibling, 0 replies; 36+ messages in thread
From: Ganesh Mahendran @ 2016-03-21  1:56 UTC (permalink / raw)
  To: Chalamarla, Tirumalesh
  Cc: catalin.marinas, will.deacon, stable, linux-kernel, linux-arm-kernel

Hello, Tirumalesh:

2016-03-19 5:05 GMT+08:00 Chalamarla, Tirumalesh
<Tirumalesh.Chalamarla@caviumnetworks.com>:
>
>
>
>
>
> On 3/16/16, 2:32 AM, "linux-arm-kernel on behalf of Ganesh Mahendran" <linux-arm-kernel-bounces@lists.infradead.org on behalf of opensource.ganesh@gmail.com> wrote:
>
>>Reverts commit 97303480753e ("arm64: Increase the max granular size").
>>
>>The commit 97303480753e ("arm64: Increase the max granular size") will
>>degrade system performente in some cpus.
>>
>>We test wifi network throughput with iperf on Qualcomm msm8996 CPU:
>>----------------
>>run on host:
>>  # iperf -s
>>run on device:
>>  # iperf -c <device-ip-addr> -t 100 -i 1
>>----------------
>>
>>Test result:
>>----------------
>>with commit 97303480753e ("arm64: Increase the max granular size"):
>>    172MBits/sec
>>
>>without commit 97303480753e ("arm64: Increase the max granular size"):
>>    230MBits/sec
>>----------------
>>
>>Some module like slab/net will use the L1_CACHE_SHIFT, so if we do not
>>set the parameter correctly, it may affect the system performance.
>>
>>So revert the commit.
>
> Is there any explanation why is this so? May be there is an alternative to this, apart from reverting the commit.
>

I just think the commit 97303480753e ("arm64: Increase the max
granular size") introduced new problem for other Socs which
the L1 cache line size is not 128 Bytes. So I wanted to revert this commit.

> Until now it seems L1_CACHE_SHIFT is the max of supported chips. But now we are making it 64byte, is there any reason why not 32.
>

We could not simply set the L1_CACHE_SHIFT to max. There are other
places which use L1 cache line size.
If we just set the L1 cache line size to the max, the memory footprint
and the system performance will be affected.
For example:
------
#define SMP_CACHE_BYTES L1_CACHE_BYTES
#define SKB_DATA_ALIGN(X) ALIGN(X, SMP_CACHE_BYTES)
------

Thanks.

> Thanks,
> Tirumalesh.
>>
>>Cc: stable@vger.kernel.org
>>Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
>>---
>> arch/arm64/include/asm/cache.h |    2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>>diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
>>index 5082b30..bde4499 100644
>>--- a/arch/arm64/include/asm/cache.h
>>+++ b/arch/arm64/include/asm/cache.h
>>@@ -18,7 +18,7 @@
>>
>> #include <asm/cachetype.h>
>>
>>-#define L1_CACHE_SHIFT                7
>>+#define L1_CACHE_SHIFT                6
>> #define L1_CACHE_BYTES                (1 << L1_CACHE_SHIFT)
>>
>> /*
>>--
>>1.7.9.5
>>
>>
>>_______________________________________________
>>linux-arm-kernel mailing list
>>linux-arm-kernel@lists.infradead.org
>>http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-18 21:05 ` Chalamarla, Tirumalesh
  2016-03-21  1:56   ` Ganesh Mahendran
@ 2016-03-21 17:14   ` Catalin Marinas
  2016-03-21 17:23     ` Will Deacon
       [not found]     ` <CAPub14-sFgx=oCHzJPb9h9b_V0rbn5UAMDNJ-yTkjhz38JPqMQ@mail.gmail.com>
  1 sibling, 2 replies; 36+ messages in thread
From: Catalin Marinas @ 2016-03-21 17:14 UTC (permalink / raw)
  To: Chalamarla, Tirumalesh
  Cc: Ganesh Mahendran, will.deacon, linux-arm-kernel, linux-kernel, stable

On Fri, Mar 18, 2016 at 09:05:37PM +0000, Chalamarla, Tirumalesh wrote:
> On 3/16/16, 2:32 AM, "linux-arm-kernel on behalf of Ganesh Mahendran" <linux-arm-kernel-bounces@lists.infradead.org on behalf of opensource.ganesh@gmail.com> wrote:
> >Reverts commit 97303480753e ("arm64: Increase the max granular size").
> >
> >The commit 97303480753e ("arm64: Increase the max granular size") will
> >degrade system performente in some cpus.
> >
> >We test wifi network throughput with iperf on Qualcomm msm8996 CPU:
> >----------------
> >run on host:
> >  # iperf -s
> >run on device:
> >  # iperf -c <device-ip-addr> -t 100 -i 1
> >----------------
> >
> >Test result:
> >----------------
> >with commit 97303480753e ("arm64: Increase the max granular size"):
> >    172MBits/sec
> >
> >without commit 97303480753e ("arm64: Increase the max granular size"):
> >    230MBits/sec
> >----------------
> >
> >Some module like slab/net will use the L1_CACHE_SHIFT, so if we do not
> >set the parameter correctly, it may affect the system performance.
> >
> >So revert the commit.
> 
> Is there any explanation why is this so? May be there is an
> alternative to this, apart from reverting the commit.

I agree we need an explanation but in the meantime, this patch has
caused a regression on certain systems.

> Until now it seems L1_CACHE_SHIFT is the max of supported chips. But
> now we are making it 64byte, is there any reason why not 32. 

We may have to revisit this logic and consider L1_CACHE_BYTES the
_minimum_ of cache line sizes in arm64 systems supported by the kernel.
Do you have any benchmarks on Cavium boards that would show significant
degradation with 64-byte L1_CACHE_BYTES vs 128?

For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the
_maximum_ of the supported systems:

diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
index 5082b30bc2c0..4b5d7b27edaf 100644
--- a/arch/arm64/include/asm/cache.h
+++ b/arch/arm64/include/asm/cache.h
@@ -18,17 +18,17 @@
 
 #include <asm/cachetype.h>
 
-#define L1_CACHE_SHIFT		7
+#define L1_CACHE_SHIFT		6
 #define L1_CACHE_BYTES		(1 << L1_CACHE_SHIFT)
 
 /*
  * Memory returned by kmalloc() may be used for DMA, so we must make
- * sure that all such allocations are cache aligned. Otherwise,
- * unrelated code may cause parts of the buffer to be read into the
- * cache before the transfer is done, causing old data to be seen by
- * the CPU.
+ * sure that all such allocations are aligned to the maximum *known*
+ * cache line size on ARMv8 systems. Otherwise, unrelated code may cause
+ * parts of the buffer to be read into the cache before the transfer is
+ * done, causing old data to be seen by the CPU.
  */
-#define ARCH_DMA_MINALIGN	L1_CACHE_BYTES
+#define ARCH_DMA_MINALIGN	(128)
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 392c67eb9fa6..30bafca1aebf 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -976,9 +976,9 @@ void __init setup_cpu_features(void)
 	if (!cwg)
 		pr_warn("No Cache Writeback Granule information, assuming cache line size %d\n",
 			cls);
-	if (L1_CACHE_BYTES < cls)
-		pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n",
-			L1_CACHE_BYTES, cls);
+	if (ARCH_DMA_MINALIGN < cls)
+		pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)\n",
+			ARCH_DMA_MINALIGN, cls);
 }
 
 static bool __maybe_unused

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-21 17:14   ` Catalin Marinas
@ 2016-03-21 17:23     ` Will Deacon
  2016-03-21 17:33       ` Catalin Marinas
       [not found]     ` <CAPub14-sFgx=oCHzJPb9h9b_V0rbn5UAMDNJ-yTkjhz38JPqMQ@mail.gmail.com>
  1 sibling, 1 reply; 36+ messages in thread
From: Will Deacon @ 2016-03-21 17:23 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Chalamarla, Tirumalesh, Ganesh Mahendran, linux-arm-kernel,
	linux-kernel, stable

On Mon, Mar 21, 2016 at 05:14:03PM +0000, Catalin Marinas wrote:
> On Fri, Mar 18, 2016 at 09:05:37PM +0000, Chalamarla, Tirumalesh wrote:
> > On 3/16/16, 2:32 AM, "linux-arm-kernel on behalf of Ganesh Mahendran" <linux-arm-kernel-bounces@lists.infradead.org on behalf of opensource.ganesh@gmail.com> wrote:
> > >Reverts commit 97303480753e ("arm64: Increase the max granular size").
> > >
> > >The commit 97303480753e ("arm64: Increase the max granular size") will
> > >degrade system performente in some cpus.
> > >
> > >We test wifi network throughput with iperf on Qualcomm msm8996 CPU:
> > >----------------
> > >run on host:
> > >  # iperf -s
> > >run on device:
> > >  # iperf -c <device-ip-addr> -t 100 -i 1
> > >----------------
> > >
> > >Test result:
> > >----------------
> > >with commit 97303480753e ("arm64: Increase the max granular size"):
> > >    172MBits/sec
> > >
> > >without commit 97303480753e ("arm64: Increase the max granular size"):
> > >    230MBits/sec
> > >----------------
> > >
> > >Some module like slab/net will use the L1_CACHE_SHIFT, so if we do not
> > >set the parameter correctly, it may affect the system performance.
> > >
> > >So revert the commit.
> > 
> > Is there any explanation why is this so? May be there is an
> > alternative to this, apart from reverting the commit.
> 
> I agree we need an explanation but in the meantime, this patch has
> caused a regression on certain systems.
> 
> > Until now it seems L1_CACHE_SHIFT is the max of supported chips. But
> > now we are making it 64byte, is there any reason why not 32. 
> 
> We may have to revisit this logic and consider L1_CACHE_BYTES the
> _minimum_ of cache line sizes in arm64 systems supported by the kernel.
> Do you have any benchmarks on Cavium boards that would show significant
> degradation with 64-byte L1_CACHE_BYTES vs 128?
> 
> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the
> _maximum_ of the supported systems:
> 
> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> index 5082b30bc2c0..4b5d7b27edaf 100644
> --- a/arch/arm64/include/asm/cache.h
> +++ b/arch/arm64/include/asm/cache.h
> @@ -18,17 +18,17 @@
>  
>  #include <asm/cachetype.h>
>  
> -#define L1_CACHE_SHIFT		7
> +#define L1_CACHE_SHIFT		6
>  #define L1_CACHE_BYTES		(1 << L1_CACHE_SHIFT)
>  
>  /*
>   * Memory returned by kmalloc() may be used for DMA, so we must make
> - * sure that all such allocations are cache aligned. Otherwise,
> - * unrelated code may cause parts of the buffer to be read into the
> - * cache before the transfer is done, causing old data to be seen by
> - * the CPU.
> + * sure that all such allocations are aligned to the maximum *known*
> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause
> + * parts of the buffer to be read into the cache before the transfer is
> + * done, causing old data to be seen by the CPU.
>   */
> -#define ARCH_DMA_MINALIGN	L1_CACHE_BYTES
> +#define ARCH_DMA_MINALIGN	(128)

Does this actually fix the reported iperf regression? My assumption was
that ARCH_DMA_MINALIGN is the problem, but I could be wrong.

Will

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-21 17:23     ` Will Deacon
@ 2016-03-21 17:33       ` Catalin Marinas
  2016-03-21 17:39         ` Chalamarla, Tirumalesh
  0 siblings, 1 reply; 36+ messages in thread
From: Catalin Marinas @ 2016-03-21 17:33 UTC (permalink / raw)
  To: Will Deacon
  Cc: Ganesh Mahendran, stable, Chalamarla, Tirumalesh, linux-kernel,
	linux-arm-kernel

On Mon, Mar 21, 2016 at 05:23:01PM +0000, Will Deacon wrote:
> On Mon, Mar 21, 2016 at 05:14:03PM +0000, Catalin Marinas wrote:
> > diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> > index 5082b30bc2c0..4b5d7b27edaf 100644
> > --- a/arch/arm64/include/asm/cache.h
> > +++ b/arch/arm64/include/asm/cache.h
> > @@ -18,17 +18,17 @@
> >  
> >  #include <asm/cachetype.h>
> >  
> > -#define L1_CACHE_SHIFT		7
> > +#define L1_CACHE_SHIFT		6
> >  #define L1_CACHE_BYTES		(1 << L1_CACHE_SHIFT)
> >  
> >  /*
> >   * Memory returned by kmalloc() may be used for DMA, so we must make
> > - * sure that all such allocations are cache aligned. Otherwise,
> > - * unrelated code may cause parts of the buffer to be read into the
> > - * cache before the transfer is done, causing old data to be seen by
> > - * the CPU.
> > + * sure that all such allocations are aligned to the maximum *known*
> > + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause
> > + * parts of the buffer to be read into the cache before the transfer is
> > + * done, causing old data to be seen by the CPU.
> >   */
> > -#define ARCH_DMA_MINALIGN	L1_CACHE_BYTES
> > +#define ARCH_DMA_MINALIGN	(128)
> 
> Does this actually fix the reported iperf regression? My assumption was
> that ARCH_DMA_MINALIGN is the problem, but I could be wrong.

I can't tell. But since I haven't seen any better explanation in this
thread yet, I hope that at least someone would try this patch and come
back with numbers.

For networking, SKB_DATA_ALIGN() uses SMP_CACHE_BYTES (== L1_CACHE_BYTES).
I think (hope) this alignment is not meant for non-coherent DMA,
otherwise using SMP_CACHE_BYTES wouldn't make sense.

-- 
Catalin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2016-03-21 17:33       ` Catalin Marinas
@ 2016-03-21 17:39         ` Chalamarla, Tirumalesh
  0 siblings, 0 replies; 36+ messages in thread
From: Chalamarla, Tirumalesh @ 2016-03-21 17:39 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon
  Cc: Ganesh Mahendran, stable, linux-kernel, linux-arm-kernel






On 3/21/16, 10:33 AM, "Catalin Marinas" <catalin.marinas@arm.com> wrote:

>On Mon, Mar 21, 2016 at 05:23:01PM +0000, Will Deacon wrote:
>> On Mon, Mar 21, 2016 at 05:14:03PM +0000, Catalin Marinas wrote:
>> > diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
>> > index 5082b30bc2c0..4b5d7b27edaf 100644
>> > --- a/arch/arm64/include/asm/cache.h
>> > +++ b/arch/arm64/include/asm/cache.h
>> > @@ -18,17 +18,17 @@
>> >  
>> >  #include <asm/cachetype.h>
>> >  
>> > -#define L1_CACHE_SHIFT		7
>> > +#define L1_CACHE_SHIFT		6
>> >  #define L1_CACHE_BYTES		(1 << L1_CACHE_SHIFT)
>> >  
>> >  /*
>> >   * Memory returned by kmalloc() may be used for DMA, so we must make
>> > - * sure that all such allocations are cache aligned. Otherwise,
>> > - * unrelated code may cause parts of the buffer to be read into the
>> > - * cache before the transfer is done, causing old data to be seen by
>> > - * the CPU.
>> > + * sure that all such allocations are aligned to the maximum *known*
>> > + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause
>> > + * parts of the buffer to be read into the cache before the transfer is
>> > + * done, causing old data to be seen by the CPU.
>> >   */
>> > -#define ARCH_DMA_MINALIGN	L1_CACHE_BYTES
>> > +#define ARCH_DMA_MINALIGN	(128)
>> 
>> Does this actually fix the reported iperf regression? My assumption was
>> that ARCH_DMA_MINALIGN is the problem, but I could be wrong.
>
>I can't tell. But since I haven't seen any better explanation in this
>thread yet, I hope that at least someone would try this patch and come
>back with numbers.
>
>For networking, SKB_DATA_ALIGN() uses SMP_CACHE_BYTES (== L1_CACHE_BYTES).
>I think (hope) this alignment is not meant for non-coherent DMA,
>otherwise using SMP_CACHE_BYTES wouldn't make sense.

As I see the problem, may be it is because of fewer number of buffers available because of this alignment requirement.
But that should be fixed by making slab alignment a run time thing. I may be totally wrong. 

We are still coming up with a decent benchmark that shows degradation.  
>
>-- 
>Catalin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
       [not found]       ` <10fef112-37f1-0a1b-b5af-435acd032f01@codeaurora.org>
@ 2017-04-06  7:22         ` Imran Khan
  2017-04-06 15:58           ` Catalin Marinas
  0 siblings, 1 reply; 36+ messages in thread
From: Imran Khan @ 2017-04-06  7:22 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: "Chalamarla,
	Tirumalesh"
	<Tirumalesh.Chalamarla@caviumnetworks.com>;Ganesh
	Mahendran, open list:ARM/QUALCOMM SUPPORT, open list,
	linux-arm-kernel

On 4/5/2017 10:13 AM, Imran Khan wrote:
Hi Catalin,

> Hi Catalin,
> 
>> From: Catalin Marinas <catalin.marinas@arm.com>
>> Date: Mon, Mar 21, 2016 at 10:44 PM
>> Subject: Re: [PATCH] Revert "arm64: Increase the max granular size"
>> To: "Chalamarla, Tirumalesh" <Tirumalesh.Chalamarla@caviumnetworks.com>
>> Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>, "will.deacon@arm.com" <
>> will.deacon@arm.com>, "linux-arm-kernel@lists.infradead.org" <
>> linux-arm-kernel@lists.infradead.org>, "linux-kernel@vger.kernel.org" <
>>> , "stable@vger.kernel.org" <
>> stable@vger.kernel.org>
>>
>>
>> On Fri, Mar 18, 2016 at 09:05:37PM +0000, Chalamarla, Tirumalesh wrote:
>>> On 3/16/16, 2:32 AM, "linux-arm-kernel on behalf of Ganesh Mahendran" <
>> linux-arm-kernel-bounces@lists.infradead.org on behalf of
>> opensource.ganesh@gmail.com> wrote:
>>>> Reverts commit 97303480753e ("arm64: Increase the max granular size").
>>>>
>>>> The commit 97303480753e ("arm64: Increase the max granular size") will
>>>> degrade system performente in some cpus.
>>>>
>>>> We test wifi network throughput with iperf on Qualcomm msm8996 CPU:
>>>> ----------------
>>>> run on host:
>>>>  # iperf -s
>>>> run on device:
>>>>  # iperf -c <device-ip-addr> -t 100 -i 1
>>>> ----------------
>>>>
>>>> Test result:
>>>> ----------------
>>>> with commit 97303480753e ("arm64: Increase the max granular size"):
>>>>    172MBits/sec
>>>>
>>>> without commit 97303480753e ("arm64: Increase the max granular size"):
>>>>    230MBits/sec
>>>> ----------------
>>>>
>>>> Some module like slab/net will use the L1_CACHE_SHIFT, so if we do not
>>>> set the parameter correctly, it may affect the system performance.
>>>>
>>>> So revert the commit.
>>>
>>> Is there any explanation why is this so? May be there is an
>>> alternative to this, apart from reverting the commit.
>>
>> I agree we need an explanation but in the meantime, this patch has
>> caused a regression on certain systems.
>>
>>> Until now it seems L1_CACHE_SHIFT is the max of supported chips. But
>>> now we are making it 64byte, is there any reason why not 32.
>>
>> We may have to revisit this logic and consider L1_CACHE_BYTES the
>> _minimum_ of cache line sizes in arm64 systems supported by the kernel.
>> Do you have any benchmarks on Cavium boards that would show significant
>> degradation with 64-byte L1_CACHE_BYTES vs 128?
>>
>> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the
>> _maximum_ of the supported systems:
>>
>> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
>> index 5082b30bc2c0..4b5d7b27edaf 100644
>> --- a/arch/arm64/include/asm/cache.h
>> +++ b/arch/arm64/include/asm/cache.h
>> @@ -18,17 +18,17 @@
>>
>>  #include <asm/cachetype.h>
>>
>> -#define L1_CACHE_SHIFT         7
>> +#define L1_CACHE_SHIFT         6
>>  #define L1_CACHE_BYTES         (1 << L1_CACHE_SHIFT)
>>
>>  /*
>>   * Memory returned by kmalloc() may be used for DMA, so we must make
>> - * sure that all such allocations are cache aligned. Otherwise,
>> - * unrelated code may cause parts of the buffer to be read into the
>> - * cache before the transfer is done, causing old data to be seen by
>> - * the CPU.
>> + * sure that all such allocations are aligned to the maximum *known*
>> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause
>> + * parts of the buffer to be read into the cache before the transfer is
>> + * done, causing old data to be seen by the CPU.
>>   */
>> -#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
>> +#define ARCH_DMA_MINALIGN      (128)
>>
>>  #ifndef __ASSEMBLY__
>>
>> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
>> index 392c67eb9fa6..30bafca1aebf 100644
>> --- a/arch/arm64/kernel/cpufeature.c
>> +++ b/arch/arm64/kernel/cpufeature.c
>> @@ -976,9 +976,9 @@ void __init setup_cpu_features(void)
>>         if (!cwg)
>>                 pr_warn("No Cache Writeback Granule information, assuming
>> cache line size %d\n",
>>                         cls);
>> -       if (L1_CACHE_BYTES < cls)
>> -               pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback
>> Granule (%d < %d)\n",
>> -                       L1_CACHE_BYTES, cls);
>> +       if (ARCH_DMA_MINALIGN < cls)
>> +               pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback
>> Granule (%d < %d)\n",
>> +                       ARCH_DMA_MINALIGN, cls);
>>  }
>>
>>  static bool __maybe_unused
>>
>>
>>
> 
> This change was discussed at: [1] but was not concluded as apparently no one
> came back with test report and numbers. After including this change in our 
> local kernel we are seeing significant throughput improvement. For example with:
> 
> iperf -c 192.168.1.181 -i 1 -w 128K -t 60
> 
> The average throughput is improving by about 30% (230Mbps from 180Mbps).
> Could you please let us know if this change can be included in upstream kernel.
> 
> [1]: https://groups.google.com/forum/#!topic/linux.kernel/P40yDB90ePs
> 

Could you please provide some feedback about the above mentioned query ?


> Thanks and Regards,
> Imran
> 
> 
> 
> 


-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2017-04-06  7:22         ` Imran Khan
@ 2017-04-06 15:58           ` Catalin Marinas
  2017-04-07  2:06             ` Ganesh Mahendran
  2017-04-11  4:40             ` Jon Masters
  0 siblings, 2 replies; 36+ messages in thread
From: Catalin Marinas @ 2017-04-06 15:58 UTC (permalink / raw)
  To: Imran Khan
  Cc: "Chalamarla,
	Tirumalesh"
	<Tirumalesh.Chalamarla@caviumnetworks.com>; Ganesh
	Mahendran, open list, linux-arm-kernel,
	open list:ARM/QUALCOMM SUPPORT

On Thu, Apr 06, 2017 at 12:52:13PM +0530, Imran Khan wrote:
> On 4/5/2017 10:13 AM, Imran Khan wrote:
> >> We may have to revisit this logic and consider L1_CACHE_BYTES the
> >> _minimum_ of cache line sizes in arm64 systems supported by the kernel.
> >> Do you have any benchmarks on Cavium boards that would show significant
> >> degradation with 64-byte L1_CACHE_BYTES vs 128?
> >>
> >> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the
> >> _maximum_ of the supported systems:
> >>
> >> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> >> index 5082b30bc2c0..4b5d7b27edaf 100644
> >> --- a/arch/arm64/include/asm/cache.h
> >> +++ b/arch/arm64/include/asm/cache.h
> >> @@ -18,17 +18,17 @@
> >>
> >>  #include <asm/cachetype.h>
> >>
> >> -#define L1_CACHE_SHIFT         7
> >> +#define L1_CACHE_SHIFT         6
> >>  #define L1_CACHE_BYTES         (1 << L1_CACHE_SHIFT)
> >>
> >>  /*
> >>   * Memory returned by kmalloc() may be used for DMA, so we must make
> >> - * sure that all such allocations are cache aligned. Otherwise,
> >> - * unrelated code may cause parts of the buffer to be read into the
> >> - * cache before the transfer is done, causing old data to be seen by
> >> - * the CPU.
> >> + * sure that all such allocations are aligned to the maximum *known*
> >> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause
> >> + * parts of the buffer to be read into the cache before the transfer is
> >> + * done, causing old data to be seen by the CPU.
> >>   */
> >> -#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
> >> +#define ARCH_DMA_MINALIGN      (128)
> >>
> >>  #ifndef __ASSEMBLY__
> >>
> >> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> >> index 392c67eb9fa6..30bafca1aebf 100644
> >> --- a/arch/arm64/kernel/cpufeature.c
> >> +++ b/arch/arm64/kernel/cpufeature.c
> >> @@ -976,9 +976,9 @@ void __init setup_cpu_features(void)
> >>         if (!cwg)
> >>                 pr_warn("No Cache Writeback Granule information, assuming
> >> cache line size %d\n",
> >>                         cls);
> >> -       if (L1_CACHE_BYTES < cls)
> >> -               pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n",
> >> -                       L1_CACHE_BYTES, cls);
> >> +       if (ARCH_DMA_MINALIGN < cls)
> >> +               pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)\n",
> >> +                       ARCH_DMA_MINALIGN, cls);
> >>  }
> >>
> >>  static bool __maybe_unused
> > 
> > This change was discussed at: [1] but was not concluded as apparently no one
> > came back with test report and numbers. After including this change in our 
> > local kernel we are seeing significant throughput improvement. For example with:
> > 
> > iperf -c 192.168.1.181 -i 1 -w 128K -t 60
> > 
> > The average throughput is improving by about 30% (230Mbps from 180Mbps).
> > Could you please let us know if this change can be included in upstream kernel.
> > 
> > [1]: https://groups.google.com/forum/#!topic/linux.kernel/P40yDB90ePs
> 
> Could you please provide some feedback about the above mentioned query ?

Do you have an explanation on the performance variation when
L1_CACHE_BYTES is changed? We'd need to understand how the network stack
is affected by L1_CACHE_BYTES, in which context it uses it (is it for
non-coherent DMA?).

The Cavium guys haven't shown any numbers (IIUC) to back the
L1_CACHE_BYTES performance improvement but I would not revert the
original commit since ARCH_DMA_MINALIGN definitely needs to cover the
maximum available cache line size, which is 128 for them.

-- 
Catalin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2017-04-06 15:58           ` Catalin Marinas
@ 2017-04-07  2:06             ` Ganesh Mahendran
  2017-04-07  8:59               ` Catalin Marinas
  2017-04-12  5:13               ` Imran Khan
  2017-04-11  4:40             ` Jon Masters
  1 sibling, 2 replies; 36+ messages in thread
From: Ganesh Mahendran @ 2017-04-07  2:06 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Imran Khan, open list, linux-arm-kernel, open list:ARM/QUALCOMM SUPPORT

2017-04-06 23:58 GMT+08:00 Catalin Marinas <catalin.marinas@arm.com>:
> On Thu, Apr 06, 2017 at 12:52:13PM +0530, Imran Khan wrote:
>> On 4/5/2017 10:13 AM, Imran Khan wrote:
>> >> We may have to revisit this logic and consider L1_CACHE_BYTES the
>> >> _minimum_ of cache line sizes in arm64 systems supported by the kernel.
>> >> Do you have any benchmarks on Cavium boards that would show significant
>> >> degradation with 64-byte L1_CACHE_BYTES vs 128?
>> >>
>> >> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the
>> >> _maximum_ of the supported systems:
>> >>
>> >> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
>> >> index 5082b30bc2c0..4b5d7b27edaf 100644
>> >> --- a/arch/arm64/include/asm/cache.h
>> >> +++ b/arch/arm64/include/asm/cache.h
>> >> @@ -18,17 +18,17 @@
>> >>
>> >>  #include <asm/cachetype.h>
>> >>
>> >> -#define L1_CACHE_SHIFT         7
>> >> +#define L1_CACHE_SHIFT         6
>> >>  #define L1_CACHE_BYTES         (1 << L1_CACHE_SHIFT)
>> >>
>> >>  /*
>> >>   * Memory returned by kmalloc() may be used for DMA, so we must make
>> >> - * sure that all such allocations are cache aligned. Otherwise,
>> >> - * unrelated code may cause parts of the buffer to be read into the
>> >> - * cache before the transfer is done, causing old data to be seen by
>> >> - * the CPU.
>> >> + * sure that all such allocations are aligned to the maximum *known*
>> >> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause
>> >> + * parts of the buffer to be read into the cache before the transfer is
>> >> + * done, causing old data to be seen by the CPU.
>> >>   */
>> >> -#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
>> >> +#define ARCH_DMA_MINALIGN      (128)
>> >>
>> >>  #ifndef __ASSEMBLY__
>> >>
>> >> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
>> >> index 392c67eb9fa6..30bafca1aebf 100644
>> >> --- a/arch/arm64/kernel/cpufeature.c
>> >> +++ b/arch/arm64/kernel/cpufeature.c
>> >> @@ -976,9 +976,9 @@ void __init setup_cpu_features(void)
>> >>         if (!cwg)
>> >>                 pr_warn("No Cache Writeback Granule information, assuming
>> >> cache line size %d\n",
>> >>                         cls);
>> >> -       if (L1_CACHE_BYTES < cls)
>> >> -               pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n",
>> >> -                       L1_CACHE_BYTES, cls);
>> >> +       if (ARCH_DMA_MINALIGN < cls)
>> >> +               pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)\n",
>> >> +                       ARCH_DMA_MINALIGN, cls);
>> >>  }
>> >>
>> >>  static bool __maybe_unused
>> >
>> > This change was discussed at: [1] but was not concluded as apparently no one
>> > came back with test report and numbers. After including this change in our
>> > local kernel we are seeing significant throughput improvement. For example with:
>> >
>> > iperf -c 192.168.1.181 -i 1 -w 128K -t 60
>> >
>> > The average throughput is improving by about 30% (230Mbps from 180Mbps).
>> > Could you please let us know if this change can be included in upstream kernel.
>> >
>> > [1]: https://groups.google.com/forum/#!topic/linux.kernel/P40yDB90ePs
>>
>> Could you please provide some feedback about the above mentioned query ?
>
> Do you have an explanation on the performance variation when
> L1_CACHE_BYTES is changed? We'd need to understand how the network stack
> is affected by L1_CACHE_BYTES, in which context it uses it (is it for
> non-coherent DMA?).

network stack use SKB_DATA_ALIGN to align.
---
#define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \
~(SMP_CACHE_BYTES - 1))

#define SMP_CACHE_BYTES L1_CACHE_BYTES
---
I think this is the reason of performance regression.

>
> The Cavium guys haven't shown any numbers (IIUC) to back the
> L1_CACHE_BYTES performance improvement but I would not revert the
> original commit since ARCH_DMA_MINALIGN definitely needs to cover the
> maximum available cache line size, which is 128 for them.

how about define L1_CACHE_SHIFT like below:
---
#ifdef CONFIG_ARM64_L1_CACHE_SHIFT
#define L1_CACHE_SHIFT CONFIG_ARM64_L1_CACHE_SHIFT
#else
#define L1_CACHE_SHIFT 7
endif
---

Thanks

>
> --
> Catalin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2017-04-07  2:06             ` Ganesh Mahendran
@ 2017-04-07  8:59               ` Catalin Marinas
  2017-04-12  5:13               ` Imran Khan
  1 sibling, 0 replies; 36+ messages in thread
From: Catalin Marinas @ 2017-04-07  8:59 UTC (permalink / raw)
  To: Ganesh Mahendran
  Cc: open list:ARM/QUALCOMM SUPPORT, Imran Khan, open list, linux-arm-kernel

On Fri, Apr 07, 2017 at 10:06:31AM +0800, Ganesh Mahendran wrote:
> 2017-04-06 23:58 GMT+08:00 Catalin Marinas <catalin.marinas@arm.com>:
> > On Thu, Apr 06, 2017 at 12:52:13PM +0530, Imran Khan wrote:
> >> On 4/5/2017 10:13 AM, Imran Khan wrote:
> >> >> We may have to revisit this logic and consider L1_CACHE_BYTES the
> >> >> _minimum_ of cache line sizes in arm64 systems supported by the kernel.
> >> >> Do you have any benchmarks on Cavium boards that would show significant
> >> >> degradation with 64-byte L1_CACHE_BYTES vs 128?
> >> >>
> >> >> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the
> >> >> _maximum_ of the supported systems:
> >> >>
> >> >> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> >> >> index 5082b30bc2c0..4b5d7b27edaf 100644
> >> >> --- a/arch/arm64/include/asm/cache.h
> >> >> +++ b/arch/arm64/include/asm/cache.h
> >> >> @@ -18,17 +18,17 @@
> >> >>
> >> >>  #include <asm/cachetype.h>
> >> >>
> >> >> -#define L1_CACHE_SHIFT         7
> >> >> +#define L1_CACHE_SHIFT         6
> >> >>  #define L1_CACHE_BYTES         (1 << L1_CACHE_SHIFT)
> >> >>
> >> >>  /*
> >> >>   * Memory returned by kmalloc() may be used for DMA, so we must make
> >> >> - * sure that all such allocations are cache aligned. Otherwise,
> >> >> - * unrelated code may cause parts of the buffer to be read into the
> >> >> - * cache before the transfer is done, causing old data to be seen by
> >> >> - * the CPU.
> >> >> + * sure that all such allocations are aligned to the maximum *known*
> >> >> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause
> >> >> + * parts of the buffer to be read into the cache before the transfer is
> >> >> + * done, causing old data to be seen by the CPU.
> >> >>   */
> >> >> -#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
> >> >> +#define ARCH_DMA_MINALIGN      (128)
> >> >>
> >> >>  #ifndef __ASSEMBLY__
> >> >>
> >> >> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> >> >> index 392c67eb9fa6..30bafca1aebf 100644
> >> >> --- a/arch/arm64/kernel/cpufeature.c
> >> >> +++ b/arch/arm64/kernel/cpufeature.c
> >> >> @@ -976,9 +976,9 @@ void __init setup_cpu_features(void)
> >> >>         if (!cwg)
> >> >>                 pr_warn("No Cache Writeback Granule information, assuming
> >> >> cache line size %d\n",
> >> >>                         cls);
> >> >> -       if (L1_CACHE_BYTES < cls)
> >> >> -               pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n",
> >> >> -                       L1_CACHE_BYTES, cls);
> >> >> +       if (ARCH_DMA_MINALIGN < cls)
> >> >> +               pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)\n",
> >> >> +                       ARCH_DMA_MINALIGN, cls);
> >> >>  }
> >> >>
> >> >>  static bool __maybe_unused
> >> >
> >> > This change was discussed at: [1] but was not concluded as apparently no one
> >> > came back with test report and numbers. After including this change in our
> >> > local kernel we are seeing significant throughput improvement. For example with:
> >> >
> >> > iperf -c 192.168.1.181 -i 1 -w 128K -t 60
> >> >
> >> > The average throughput is improving by about 30% (230Mbps from 180Mbps).
> >> > Could you please let us know if this change can be included in upstream kernel.
> >> >
> >> > [1]: https://groups.google.com/forum/#!topic/linux.kernel/P40yDB90ePs
> >>
> >> Could you please provide some feedback about the above mentioned query ?
> >
> > Do you have an explanation on the performance variation when
> > L1_CACHE_BYTES is changed? We'd need to understand how the network stack
> > is affected by L1_CACHE_BYTES, in which context it uses it (is it for
> > non-coherent DMA?).
> 
> network stack use SKB_DATA_ALIGN to align.
> ---
> #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \
> ~(SMP_CACHE_BYTES - 1))
> 
> #define SMP_CACHE_BYTES L1_CACHE_BYTES
> ---
> I think this is the reason of performance regression.

And is this alignment required for DMA coherency? (I hope not since
SMP_CACHE_BYTES doesn't give this).

Anyway, to be sure, it's worth changing SKB_DATA_ALIGN to use 64 bytes
(hard-coded) and check the results again.

> > The Cavium guys haven't shown any numbers (IIUC) to back the
> > L1_CACHE_BYTES performance improvement but I would not revert the
> > original commit since ARCH_DMA_MINALIGN definitely needs to cover the
> > maximum available cache line size, which is 128 for them.
> 
> how about define L1_CACHE_SHIFT like below:
> ---
> #ifdef CONFIG_ARM64_L1_CACHE_SHIFT
> #define L1_CACHE_SHIFT CONFIG_ARM64_L1_CACHE_SHIFT
> #else
> #define L1_CACHE_SHIFT 7
> endif
> ---

I'd very much like to avoid this, still aiming for a single kernel
image.

My suggestion is to check whether L1_CACHE_BYTES is wrongly used for DMA
buffer alignment in the network code and, if not, move it back to 64
bytes while keeping ARCH_DMA_MINALIGN to 128 (as per my patch above). If
the performance on the Cavium system is affected by the L1_CACHE_BYTES
change, further patches would need to be backed by some numbers.

-- 
Catalin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2017-04-06 15:58           ` Catalin Marinas
  2017-04-07  2:06             ` Ganesh Mahendran
@ 2017-04-11  4:40             ` Jon Masters
  1 sibling, 0 replies; 36+ messages in thread
From: Jon Masters @ 2017-04-11  4:40 UTC (permalink / raw)
  To: Catalin Marinas, Imran Khan
  Cc: "Chalamarla,
	Tirumalesh"
	<Tirumalesh.Chalamarla@caviumnetworks.com>; Ganesh
	Mahendran, open list, linux-arm-kernel,
	open list:ARM/QUALCOMM SUPPORT

On 04/06/2017 11:58 AM, Catalin Marinas wrote:

> The Cavium guys haven't shown any numbers (IIUC) to back the
> L1_CACHE_BYTES performance improvement but I would not revert the
> original commit since ARCH_DMA_MINALIGN definitely needs to cover the
> maximum available cache line size, which is 128 for them.

I am pinging them via other channels to ensure they've seen this.

Jon.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2017-04-07  2:06             ` Ganesh Mahendran
  2017-04-07  8:59               ` Catalin Marinas
@ 2017-04-12  5:13               ` Imran Khan
  2017-04-12 14:00                 ` Chalamarla, Tirumalesh
  1 sibling, 1 reply; 36+ messages in thread
From: Imran Khan @ 2017-04-12  5:13 UTC (permalink / raw)
  To: Ganesh Mahendran, Catalin Marinas
  Cc: open list, linux-arm-kernel, open list:ARM/QUALCOMM SUPPORT

On 4/7/2017 7:36 AM, Ganesh Mahendran wrote:
> 2017-04-06 23:58 GMT+08:00 Catalin Marinas <catalin.marinas@arm.com>:
>> On Thu, Apr 06, 2017 at 12:52:13PM +0530, Imran Khan wrote:
>>> On 4/5/2017 10:13 AM, Imran Khan wrote:
>>>>> We may have to revisit this logic and consider L1_CACHE_BYTES the
>>>>> _minimum_ of cache line sizes in arm64 systems supported by the kernel.
>>>>> Do you have any benchmarks on Cavium boards that would show significant
>>>>> degradation with 64-byte L1_CACHE_BYTES vs 128?
>>>>>
>>>>> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the
>>>>> _maximum_ of the supported systems:
>>>>>
>>>>> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
>>>>> index 5082b30bc2c0..4b5d7b27edaf 100644
>>>>> --- a/arch/arm64/include/asm/cache.h
>>>>> +++ b/arch/arm64/include/asm/cache.h
>>>>> @@ -18,17 +18,17 @@
>>>>>
>>>>>  #include <asm/cachetype.h>
>>>>>
>>>>> -#define L1_CACHE_SHIFT         7
>>>>> +#define L1_CACHE_SHIFT         6
>>>>>  #define L1_CACHE_BYTES         (1 << L1_CACHE_SHIFT)
>>>>>
>>>>>  /*
>>>>>   * Memory returned by kmalloc() may be used for DMA, so we must make
>>>>> - * sure that all such allocations are cache aligned. Otherwise,
>>>>> - * unrelated code may cause parts of the buffer to be read into the
>>>>> - * cache before the transfer is done, causing old data to be seen by
>>>>> - * the CPU.
>>>>> + * sure that all such allocations are aligned to the maximum *known*
>>>>> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause
>>>>> + * parts of the buffer to be read into the cache before the transfer is
>>>>> + * done, causing old data to be seen by the CPU.
>>>>>   */
>>>>> -#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
>>>>> +#define ARCH_DMA_MINALIGN      (128)
>>>>>
>>>>>  #ifndef __ASSEMBLY__
>>>>>
>>>>> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
>>>>> index 392c67eb9fa6..30bafca1aebf 100644
>>>>> --- a/arch/arm64/kernel/cpufeature.c
>>>>> +++ b/arch/arm64/kernel/cpufeature.c
>>>>> @@ -976,9 +976,9 @@ void __init setup_cpu_features(void)
>>>>>         if (!cwg)
>>>>>                 pr_warn("No Cache Writeback Granule information, assuming
>>>>> cache line size %d\n",
>>>>>                         cls);
>>>>> -       if (L1_CACHE_BYTES < cls)
>>>>> -               pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n",
>>>>> -                       L1_CACHE_BYTES, cls);
>>>>> +       if (ARCH_DMA_MINALIGN < cls)
>>>>> +               pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)\n",
>>>>> +                       ARCH_DMA_MINALIGN, cls);
>>>>>  }
>>>>>
>>>>>  static bool __maybe_unused
>>>>
>>>> This change was discussed at: [1] but was not concluded as apparently no one
>>>> came back with test report and numbers. After including this change in our
>>>> local kernel we are seeing significant throughput improvement. For example with:
>>>>
>>>> iperf -c 192.168.1.181 -i 1 -w 128K -t 60
>>>>
>>>> The average throughput is improving by about 30% (230Mbps from 180Mbps).
>>>> Could you please let us know if this change can be included in upstream kernel.
>>>>
>>>> [1]: https://groups.google.com/forum/#!topic/linux.kernel/P40yDB90ePs
>>>
>>> Could you please provide some feedback about the above mentioned query ?
>>
>> Do you have an explanation on the performance variation when
>> L1_CACHE_BYTES is changed? We'd need to understand how the network stack
>> is affected by L1_CACHE_BYTES, in which context it uses it (is it for
>> non-coherent DMA?).
> 
> network stack use SKB_DATA_ALIGN to align.
> ---
> #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \
> ~(SMP_CACHE_BYTES - 1))
> 
> #define SMP_CACHE_BYTES L1_CACHE_BYTES
> ---
> I think this is the reason of performance regression.
> 

Yes this is the reason for performance regression. Due to increases L1 cache alignment the 
object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to 
4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers.

>>
>> The Cavium guys haven't shown any numbers (IIUC) to back the
>> L1_CACHE_BYTES performance improvement but I would not revert the
>> original commit since ARCH_DMA_MINALIGN definitely needs to cover the
>> maximum available cache line size, which is 128 for them.
> 
> how about define L1_CACHE_SHIFT like below:
> ---
> #ifdef CONFIG_ARM64_L1_CACHE_SHIFT
> #define L1_CACHE_SHIFT CONFIG_ARM64_L1_CACHE_SHIFT
> #else
> #define L1_CACHE_SHIFT 7
> endif
> ---
> 
> Thanks
> 
>>
>> --
>> Catalin


-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2017-04-12  5:13               ` Imran Khan
@ 2017-04-12 14:00                 ` Chalamarla, Tirumalesh
  2017-04-17  7:35                   ` Imran Khan
  0 siblings, 1 reply; 36+ messages in thread
From: Chalamarla, Tirumalesh @ 2017-04-12 14:00 UTC (permalink / raw)
  To: Imran Khan, Ganesh Mahendran, Catalin Marinas
  Cc: open list:ARM/QUALCOMM SUPPORT, open list, linux-arm-kernel



On 4/11/17, 10:13 PM, "linux-arm-kernel on behalf of Imran Khan" <linux-arm-kernel-bounces@lists.infradead.org on behalf of kimran@codeaurora.org> wrote:

    On 4/7/2017 7:36 AM, Ganesh Mahendran wrote:
    > 2017-04-06 23:58 GMT+08:00 Catalin Marinas <catalin.marinas@arm.com>:
    >> On Thu, Apr 06, 2017 at 12:52:13PM +0530, Imran Khan wrote:
    >>> On 4/5/2017 10:13 AM, Imran Khan wrote:
    >>>>> We may have to revisit this logic and consider L1_CACHE_BYTES the
    >>>>> _minimum_ of cache line sizes in arm64 systems supported by the kernel.
    >>>>> Do you have any benchmarks on Cavium boards that would show significant
    >>>>> degradation with 64-byte L1_CACHE_BYTES vs 128?
    >>>>>
    >>>>> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the
    >>>>> _maximum_ of the supported systems:
    >>>>>
    >>>>> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
    >>>>> index 5082b30bc2c0..4b5d7b27edaf 100644
    >>>>> --- a/arch/arm64/include/asm/cache.h
    >>>>> +++ b/arch/arm64/include/asm/cache.h
    >>>>> @@ -18,17 +18,17 @@
    >>>>>
    >>>>>  #include <asm/cachetype.h>
    >>>>>
    >>>>> -#define L1_CACHE_SHIFT         7
    >>>>> +#define L1_CACHE_SHIFT         6
    >>>>>  #define L1_CACHE_BYTES         (1 << L1_CACHE_SHIFT)
    >>>>>
    >>>>>  /*
    >>>>>   * Memory returned by kmalloc() may be used for DMA, so we must make
    >>>>> - * sure that all such allocations are cache aligned. Otherwise,
    >>>>> - * unrelated code may cause parts of the buffer to be read into the
    >>>>> - * cache before the transfer is done, causing old data to be seen by
    >>>>> - * the CPU.
    >>>>> + * sure that all such allocations are aligned to the maximum *known*
    >>>>> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause
    >>>>> + * parts of the buffer to be read into the cache before the transfer is
    >>>>> + * done, causing old data to be seen by the CPU.
    >>>>>   */
    >>>>> -#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
    >>>>> +#define ARCH_DMA_MINALIGN      (128)
    >>>>>
    >>>>>  #ifndef __ASSEMBLY__
    >>>>>
    >>>>> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
    >>>>> index 392c67eb9fa6..30bafca1aebf 100644
    >>>>> --- a/arch/arm64/kernel/cpufeature.c
    >>>>> +++ b/arch/arm64/kernel/cpufeature.c
    >>>>> @@ -976,9 +976,9 @@ void __init setup_cpu_features(void)
    >>>>>         if (!cwg)
    >>>>>                 pr_warn("No Cache Writeback Granule information, assuming
    >>>>> cache line size %d\n",
    >>>>>                         cls);
    >>>>> -       if (L1_CACHE_BYTES < cls)
    >>>>> -               pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n",
    >>>>> -                       L1_CACHE_BYTES, cls);
    >>>>> +       if (ARCH_DMA_MINALIGN < cls)
    >>>>> +               pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)\n",
    >>>>> +                       ARCH_DMA_MINALIGN, cls);
    >>>>>  }
    >>>>>
    >>>>>  static bool __maybe_unused
    >>>>
    >>>> This change was discussed at: [1] but was not concluded as apparently no one
    >>>> came back with test report and numbers. After including this change in our
    >>>> local kernel we are seeing significant throughput improvement. For example with:
    >>>>
    >>>> iperf -c 192.168.1.181 -i 1 -w 128K -t 60
    >>>>
    >>>> The average throughput is improving by about 30% (230Mbps from 180Mbps).
    >>>> Could you please let us know if this change can be included in upstream kernel.
    >>>>
    >>>> [1]: https://groups.google.com/forum/#!topic/linux.kernel/P40yDB90ePs
    >>>
    >>> Could you please provide some feedback about the above mentioned query ?
    >>
    >> Do you have an explanation on the performance variation when
    >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack
    >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for
    >> non-coherent DMA?).
    > 
    > network stack use SKB_DATA_ALIGN to align.
    > ---
    > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \
    > ~(SMP_CACHE_BYTES - 1))
    > 
    > #define SMP_CACHE_BYTES L1_CACHE_BYTES
    > ---
    > I think this is the reason of performance regression.
    > 
    
    Yes this is the reason for performance regression. Due to increases L1 cache alignment the 
    object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to 
    4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers.
    
We tried different benchmarks and found none which really affects with Cache line change. If there is no correctness issue,
I think we are fine with reverting the patch.

Though I still think it is beneficiary to do some more investigation for the perf loss, who knows 32 bit align or no align might 
Give even more perf benefit. 


Thanks,
Tirumalesh.  
    >>
    >> The Cavium guys haven't shown any numbers (IIUC) to back the
    >> L1_CACHE_BYTES performance improvement but I would not revert the
    >> original commit since ARCH_DMA_MINALIGN definitely needs to cover the
    >> maximum available cache line size, which is 128 for them.
    > 
    > how about define L1_CACHE_SHIFT like below:
    > ---
    > #ifdef CONFIG_ARM64_L1_CACHE_SHIFT
    > #define L1_CACHE_SHIFT CONFIG_ARM64_L1_CACHE_SHIFT
    > #else
    > #define L1_CACHE_SHIFT 7
    > endif
    > ---
    > 
    > Thanks
    > 
    >>
    >> --
    >> Catalin
    
    
    -- 
    QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation
    
    _______________________________________________
    linux-arm-kernel mailing list
    linux-arm-kernel@lists.infradead.org
    http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
    

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2017-04-12 14:00                 ` Chalamarla, Tirumalesh
@ 2017-04-17  7:35                   ` Imran Khan
  2017-04-17 10:38                     ` Sunil Kovvuri
  2017-04-18 18:21                     ` Chalamarla, Tirumalesh
  0 siblings, 2 replies; 36+ messages in thread
From: Imran Khan @ 2017-04-17  7:35 UTC (permalink / raw)
  To: Chalamarla, Tirumalesh, Ganesh Mahendran, Catalin Marinas
  Cc: open list:ARM/QUALCOMM SUPPORT, open list, linux-arm-kernel

On 4/12/2017 7:30 PM, Chalamarla, Tirumalesh wrote:
> 
> 
> On 4/11/17, 10:13 PM, "linux-arm-kernel on behalf of Imran Khan" <linux-arm-kernel-bounces@lists.infradead.org on behalf of kimran@codeaurora.org> wrote:
> 
>     On 4/7/2017 7:36 AM, Ganesh Mahendran wrote:
>     > 2017-04-06 23:58 GMT+08:00 Catalin Marinas <catalin.marinas@arm.com>:
>     >> On Thu, Apr 06, 2017 at 12:52:13PM +0530, Imran Khan wrote:
>     >>> On 4/5/2017 10:13 AM, Imran Khan wrote:
>     >>>>> We may have to revisit this logic and consider L1_CACHE_BYTES the
>     >>>>> _minimum_ of cache line sizes in arm64 systems supported by the kernel.
>     >>>>> Do you have any benchmarks on Cavium boards that would show significant
>     >>>>> degradation with 64-byte L1_CACHE_BYTES vs 128?
>     >>>>>
>     >>>>> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the
>     >>>>> _maximum_ of the supported systems:
>     >>>>>
>     >>>>> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
>     >>>>> index 5082b30bc2c0..4b5d7b27edaf 100644
>     >>>>> --- a/arch/arm64/include/asm/cache.h
>     >>>>> +++ b/arch/arm64/include/asm/cache.h
>     >>>>> @@ -18,17 +18,17 @@
>     >>>>>
>     >>>>>  #include <asm/cachetype.h>
>     >>>>>
>     >>>>> -#define L1_CACHE_SHIFT         7
>     >>>>> +#define L1_CACHE_SHIFT         6
>     >>>>>  #define L1_CACHE_BYTES         (1 << L1_CACHE_SHIFT)
>     >>>>>
>     >>>>>  /*
>     >>>>>   * Memory returned by kmalloc() may be used for DMA, so we must make
>     >>>>> - * sure that all such allocations are cache aligned. Otherwise,
>     >>>>> - * unrelated code may cause parts of the buffer to be read into the
>     >>>>> - * cache before the transfer is done, causing old data to be seen by
>     >>>>> - * the CPU.
>     >>>>> + * sure that all such allocations are aligned to the maximum *known*
>     >>>>> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause
>     >>>>> + * parts of the buffer to be read into the cache before the transfer is
>     >>>>> + * done, causing old data to be seen by the CPU.
>     >>>>>   */
>     >>>>> -#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
>     >>>>> +#define ARCH_DMA_MINALIGN      (128)
>     >>>>>
>     >>>>>  #ifndef __ASSEMBLY__
>     >>>>>
>     >>>>> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
>     >>>>> index 392c67eb9fa6..30bafca1aebf 100644
>     >>>>> --- a/arch/arm64/kernel/cpufeature.c
>     >>>>> +++ b/arch/arm64/kernel/cpufeature.c
>     >>>>> @@ -976,9 +976,9 @@ void __init setup_cpu_features(void)
>     >>>>>         if (!cwg)
>     >>>>>                 pr_warn("No Cache Writeback Granule information, assuming
>     >>>>> cache line size %d\n",
>     >>>>>                         cls);
>     >>>>> -       if (L1_CACHE_BYTES < cls)
>     >>>>> -               pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n",
>     >>>>> -                       L1_CACHE_BYTES, cls);
>     >>>>> +       if (ARCH_DMA_MINALIGN < cls)
>     >>>>> +               pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)\n",
>     >>>>> +                       ARCH_DMA_MINALIGN, cls);
>     >>>>>  }
>     >>>>>
>     >>>>>  static bool __maybe_unused
>     >>>>
>     >>>> This change was discussed at: [1] but was not concluded as apparently no one
>     >>>> came back with test report and numbers. After including this change in our
>     >>>> local kernel we are seeing significant throughput improvement. For example with:
>     >>>>
>     >>>> iperf -c 192.168.1.181 -i 1 -w 128K -t 60
>     >>>>
>     >>>> The average throughput is improving by about 30% (230Mbps from 180Mbps).
>     >>>> Could you please let us know if this change can be included in upstream kernel.
>     >>>>
>     >>>> [1]: https://groups.google.com/forum/#!topic/linux.kernel/P40yDB90ePs
>     >>>
>     >>> Could you please provide some feedback about the above mentioned query ?
>     >>
>     >> Do you have an explanation on the performance variation when
>     >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack
>     >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for
>     >> non-coherent DMA?).
>     > 
>     > network stack use SKB_DATA_ALIGN to align.
>     > ---
>     > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \
>     > ~(SMP_CACHE_BYTES - 1))
>     > 
>     > #define SMP_CACHE_BYTES L1_CACHE_BYTES
>     > ---
>     > I think this is the reason of performance regression.
>     > 
>     
>     Yes this is the reason for performance regression. Due to increases L1 cache alignment the 
>     object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to 
>     4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers.
>     
> We tried different benchmarks and found none which really affects with Cache line change. If there is no correctness issue,
> I think we are fine with reverting the patch.
> 
So, can we revert the patch that makes L1_CACHE_SHIFT 7 or should the patch suggested by Catalin should be mainlined.
We have verified the throughput degradation on 3.18 and 4.4 but I am afraid that this issue will be seen on other
kernels too.
> Though I still think it is beneficiary to do some more investigation for the perf loss, who knows 32 bit align or no align might 
> Give even more perf benefit. 
> 
Which perf loss you are referring to here. Did you mean throughput loss here or some other perf benchmarking ?

Thanks,
Imran

> 
> Thanks,
> Tirumalesh.  
>     >>
>     >> The Cavium guys haven't shown any numbers (IIUC) to back the
>     >> L1_CACHE_BYTES performance improvement but I would not revert the
>     >> original commit since ARCH_DMA_MINALIGN definitely needs to cover the
>     >> maximum available cache line size, which is 128 for them.
>     > 
>     > how about define L1_CACHE_SHIFT like below:
>     > ---
>     > #ifdef CONFIG_ARM64_L1_CACHE_SHIFT
>     > #define L1_CACHE_SHIFT CONFIG_ARM64_L1_CACHE_SHIFT
>     > #else
>     > #define L1_CACHE_SHIFT 7
>     > endif
>     > ---
>     > 
>     > Thanks
>     > 
>     >>
>     >> --
>     >> Catalin
>     
>     
>     -- 
>     QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation
>     
>     _______________________________________________
>     linux-arm-kernel mailing list
>     linux-arm-kernel@lists.infradead.org
>     http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>     
> 


-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2017-04-17  7:35                   ` Imran Khan
@ 2017-04-17 10:38                     ` Sunil Kovvuri
  2017-04-18 14:48                       ` Catalin Marinas
  2017-04-18 18:21                     ` Chalamarla, Tirumalesh
  1 sibling, 1 reply; 36+ messages in thread
From: Sunil Kovvuri @ 2017-04-17 10:38 UTC (permalink / raw)
  To: Imran Khan
  Cc: Chalamarla, Tirumalesh, Ganesh Mahendran, Catalin Marinas,
	open list:ARM/QUALCOMM SUPPORT, open list, linux-arm-kernel

>>     >> Do you have an explanation on the performance variation when
>>     >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack
>>     >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for
>>     >> non-coherent DMA?).
>>     >
>>     > network stack use SKB_DATA_ALIGN to align.
>>     > ---
>>     > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \
>>     > ~(SMP_CACHE_BYTES - 1))
>>     >
>>     > #define SMP_CACHE_BYTES L1_CACHE_BYTES
>>     > ---
>>     > I think this is the reason of performance regression.
>>     >
>>
>>     Yes this is the reason for performance regression. Due to increases L1 cache alignment the
>>     object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to
>>     4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers.

With what traffic did you check 'skb->truesize' ?
Increase from 2304 to 4352 bytes doesn't seem to be real. I checked
with ICMP pkts with maximum
size possible with 1500byte MTU and I don't see such a bump. If the
bump is observed with Iperf
sending TCP packets then I suggest to check if TSO is playing a part over here.

And for 'sk_wmem_alloc', I have done Iperf benchmarking on a 40G
interface and I hit linerate irrespective
of cache line size being 64 or 128 bytes. I guess transmit completion
latency on your HW or driver is very
high and that seems to be the real issue for low performance and not
due to cache line size, basically you are
not able to freeup skbs/buffers fast enough so that new ones get queued up.

Doesn't skb_orphan() solve your issue ?
FYI,
https://patchwork.ozlabs.org/patch/455134/
http://lxr.free-electrons.com/source/drivers/net/ethernet/chelsio/cxgb3/sge.c#L1288


>>
>> We tried different benchmarks and found none which really affects with Cache line change. If there is no correctness issue,
>> I think we are fine with reverting the patch.
>>
> So, can we revert the patch that makes L1_CACHE_SHIFT 7 or should the patch suggested by Catalin should be mainlined.

This doesn't seem right, as someone said earlier what if there is
another arm64 platform with 32bytes
cacheline size and wants to reduce this further. Either this should be
made platform dependent or left as is
i.e that is maximum of all.

Thanks,
Sunil.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2017-04-17 10:38                     ` Sunil Kovvuri
@ 2017-04-18 14:48                       ` Catalin Marinas
  2017-04-18 17:05                         ` Sunil Kovvuri
  0 siblings, 1 reply; 36+ messages in thread
From: Catalin Marinas @ 2017-04-18 14:48 UTC (permalink / raw)
  To: Sunil Kovvuri
  Cc: Imran Khan, Ganesh Mahendran, open list, Chalamarla, Tirumalesh,
	open list:ARM/QUALCOMM SUPPORT, linux-arm-kernel

On Mon, Apr 17, 2017 at 04:08:52PM +0530, Sunil Kovvuri wrote:
> >>     >> Do you have an explanation on the performance variation when
> >>     >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack
> >>     >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for
> >>     >> non-coherent DMA?).
> >>     >
> >>     > network stack use SKB_DATA_ALIGN to align.
> >>     > ---
> >>     > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \
> >>     > ~(SMP_CACHE_BYTES - 1))
> >>     >
> >>     > #define SMP_CACHE_BYTES L1_CACHE_BYTES
> >>     > ---
> >>     > I think this is the reason of performance regression.
> >>     >
> >>
> >>     Yes this is the reason for performance regression. Due to increases L1 cache alignment the
> >>     object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to
> >>     4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers.
> 
> With what traffic did you check 'skb->truesize' ?
> Increase from 2304 to 4352 bytes doesn't seem to be real. I checked
> with ICMP pkts with maximum
> size possible with 1500byte MTU and I don't see such a bump. If the
> bump is observed with Iperf
> sending TCP packets then I suggest to check if TSO is playing a part over here.

I haven't checked truesize but I added some printks to __alloc_skb() (on
a Juno platform) and the size argument to this function is 1720 on many
occasions. With sizeof(struct skb_shared_info) of 320, the actual data
allocation is exactly 2048 when using 64 byte L1_CACHE_SIZE. With a
128 byte cache size, it goes slightly over 2K, hence the 4K slab
allocation. The 1720 figure surprised me a bit as well since I was
expecting something close to 1500.

The thing that worries me is that skb->data may be used as a buffer to
DMA into. If that's the case, skb_shared_info is wrongly aligned based
on SMP_CACHE_BYTES only and can lead to corruption on a non-DMA-coherent
platform. It should really be ARCH_DMA_MINALIGN.

IIUC, the Cavium platform has coherent DMA, so it shouldn't be an issue
if we go back to 64 byte cache lines. However, we don't really have an
easy way to check (maybe taint the kernel if CWG is different from
ARCH_DMA_MINALIGN *and* the non-coherent DMA API is called).

-- 
Catalin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2017-04-18 14:48                       ` Catalin Marinas
@ 2017-04-18 17:05                         ` Sunil Kovvuri
  2017-04-19 12:01                           ` Catalin Marinas
  0 siblings, 1 reply; 36+ messages in thread
From: Sunil Kovvuri @ 2017-04-18 17:05 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Imran Khan, Ganesh Mahendran, open list, Chalamarla, Tirumalesh,
	open list:ARM/QUALCOMM SUPPORT, linux-arm-kernel

On Tue, Apr 18, 2017 at 8:18 PM, Catalin Marinas
<catalin.marinas@arm.com> wrote:
> On Mon, Apr 17, 2017 at 04:08:52PM +0530, Sunil Kovvuri wrote:
>> >>     >> Do you have an explanation on the performance variation when
>> >>     >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack
>> >>     >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for
>> >>     >> non-coherent DMA?).
>> >>     >
>> >>     > network stack use SKB_DATA_ALIGN to align.
>> >>     > ---
>> >>     > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \
>> >>     > ~(SMP_CACHE_BYTES - 1))
>> >>     >
>> >>     > #define SMP_CACHE_BYTES L1_CACHE_BYTES
>> >>     > ---
>> >>     > I think this is the reason of performance regression.
>> >>     >
>> >>
>> >>     Yes this is the reason for performance regression. Due to increases L1 cache alignment the
>> >>     object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to
>> >>     4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers.
>>
>> With what traffic did you check 'skb->truesize' ?
>> Increase from 2304 to 4352 bytes doesn't seem to be real. I checked
>> with ICMP pkts with maximum
>> size possible with 1500byte MTU and I don't see such a bump. If the
>> bump is observed with Iperf
>> sending TCP packets then I suggest to check if TSO is playing a part over here.
>
> I haven't checked truesize but I added some printks to __alloc_skb() (on
> a Juno platform) and the size argument to this function is 1720 on many
> occasions. With sizeof(struct skb_shared_info) of 320, the actual data
> allocation is exactly 2048 when using 64 byte L1_CACHE_SIZE. With a
> 128 byte cache size, it goes slightly over 2K, hence the 4K slab
> allocation.

Understood but still in my opinion this '4K slab allocation' cannot be
considered as
an issue with cache line size, there are many network drivers out
there which do
receive buffer or page recycling to minimize (sometimes almost to
zero) the cost
of buffer allocation.

>The 1720 figure surprised me a bit as well since I was
> expecting something close to 1500.
>
> The thing that worries me is that skb->data may be used as a buffer to
> DMA into. If that's the case, skb_shared_info is wrongly aligned based
> on SMP_CACHE_BYTES only and can lead to corruption on a non-DMA-coherent
> platform. It should really be ARCH_DMA_MINALIGN.

I didn't get this, if you see __alloc_skb()

229         size = SKB_DATA_ALIGN(size);
230         size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));

both DMA buffer and skb_shared_info are aligned to a cacheline separately,
considering 128byte alignment guarantees 64byte alignment as well, how
will this
lead to corruption ?

And if platform is non-DMA-coherent then again it's the driver which
should take
of coherency by using appropriate map/unmap APIs and should avoid any cacheline
sharing btw DMA buffer and skb_shared_info.

>
> IIUC, the Cavium platform has coherent DMA, so it shouldn't be an issue
> if we go back to 64 byte cache lines.

Yes, Cavium platform is DMA coherent and there is no issue with reverting back
to 64byte cachelines. But do we want to do this because some platform has a
performance issue and this is an easy way to solve it. IMHO there seems
to be many ways to solve performance degradation within the driver itself, and
if those doesn't work then probably it makes sense to revert this.

>However, we don't really have an
> easy way to check (maybe taint the kernel if CWG is different from
> ARCH_DMA_MINALIGN *and* the non-coherent DMA API is called).
>
> --
> Catalin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2017-04-17  7:35                   ` Imran Khan
  2017-04-17 10:38                     ` Sunil Kovvuri
@ 2017-04-18 18:21                     ` Chalamarla, Tirumalesh
  1 sibling, 0 replies; 36+ messages in thread
From: Chalamarla, Tirumalesh @ 2017-04-18 18:21 UTC (permalink / raw)
  To: Imran Khan, Ganesh Mahendran, Catalin Marinas
  Cc: open list:ARM/QUALCOMM SUPPORT, open list, linux-arm-kernel



On 4/17/17, 12:35 AM, "Imran Khan" <kimran@codeaurora.org> wrote:

    On 4/12/2017 7:30 PM, Chalamarla, Tirumalesh wrote:
    > 
    > 
    > On 4/11/17, 10:13 PM, "linux-arm-kernel on behalf of Imran Khan" <linux-arm-kernel-bounces@lists.infradead.org on behalf of kimran@codeaurora.org> wrote:
    > 
    >     On 4/7/2017 7:36 AM, Ganesh Mahendran wrote:
    >     > 2017-04-06 23:58 GMT+08:00 Catalin Marinas <catalin.marinas@arm.com>:
    >     >> On Thu, Apr 06, 2017 at 12:52:13PM +0530, Imran Khan wrote:
    >     >>> On 4/5/2017 10:13 AM, Imran Khan wrote:
    >     >>>>> We may have to revisit this logic and consider L1_CACHE_BYTES the
    >     >>>>> _minimum_ of cache line sizes in arm64 systems supported by the kernel.
    >     >>>>> Do you have any benchmarks on Cavium boards that would show significant
    >     >>>>> degradation with 64-byte L1_CACHE_BYTES vs 128?
    >     >>>>>
    >     >>>>> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the
    >     >>>>> _maximum_ of the supported systems:
    >     >>>>>
    >     >>>>> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
    >     >>>>> index 5082b30bc2c0..4b5d7b27edaf 100644
    >     >>>>> --- a/arch/arm64/include/asm/cache.h
    >     >>>>> +++ b/arch/arm64/include/asm/cache.h
    >     >>>>> @@ -18,17 +18,17 @@
    >     >>>>>
    >     >>>>>  #include <asm/cachetype.h>
    >     >>>>>
    >     >>>>> -#define L1_CACHE_SHIFT         7
    >     >>>>> +#define L1_CACHE_SHIFT         6
    >     >>>>>  #define L1_CACHE_BYTES         (1 << L1_CACHE_SHIFT)
    >     >>>>>
    >     >>>>>  /*
    >     >>>>>   * Memory returned by kmalloc() may be used for DMA, so we must make
    >     >>>>> - * sure that all such allocations are cache aligned. Otherwise,
    >     >>>>> - * unrelated code may cause parts of the buffer to be read into the
    >     >>>>> - * cache before the transfer is done, causing old data to be seen by
    >     >>>>> - * the CPU.
    >     >>>>> + * sure that all such allocations are aligned to the maximum *known*
    >     >>>>> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause
    >     >>>>> + * parts of the buffer to be read into the cache before the transfer is
    >     >>>>> + * done, causing old data to be seen by the CPU.
    >     >>>>>   */
    >     >>>>> -#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
    >     >>>>> +#define ARCH_DMA_MINALIGN      (128)
    >     >>>>>
    >     >>>>>  #ifndef __ASSEMBLY__
    >     >>>>>
    >     >>>>> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
    >     >>>>> index 392c67eb9fa6..30bafca1aebf 100644
    >     >>>>> --- a/arch/arm64/kernel/cpufeature.c
    >     >>>>> +++ b/arch/arm64/kernel/cpufeature.c
    >     >>>>> @@ -976,9 +976,9 @@ void __init setup_cpu_features(void)
    >     >>>>>         if (!cwg)
    >     >>>>>                 pr_warn("No Cache Writeback Granule information, assuming
    >     >>>>> cache line size %d\n",
    >     >>>>>                         cls);
    >     >>>>> -       if (L1_CACHE_BYTES < cls)
    >     >>>>> -               pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n",
    >     >>>>> -                       L1_CACHE_BYTES, cls);
    >     >>>>> +       if (ARCH_DMA_MINALIGN < cls)
    >     >>>>> +               pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)\n",
    >     >>>>> +                       ARCH_DMA_MINALIGN, cls);
    >     >>>>>  }
    >     >>>>>
    >     >>>>>  static bool __maybe_unused
    >     >>>>
    >     >>>> This change was discussed at: [1] but was not concluded as apparently no one
    >     >>>> came back with test report and numbers. After including this change in our
    >     >>>> local kernel we are seeing significant throughput improvement. For example with:
    >     >>>>
    >     >>>> iperf -c 192.168.1.181 -i 1 -w 128K -t 60
    >     >>>>
    >     >>>> The average throughput is improving by about 30% (230Mbps from 180Mbps).
    >     >>>> Could you please let us know if this change can be included in upstream kernel.
    >     >>>>
    >     >>>> [1]: https://groups.google.com/forum/#!topic/linux.kernel/P40yDB90ePs
    >     >>>
    >     >>> Could you please provide some feedback about the above mentioned query ?
    >     >>
    >     >> Do you have an explanation on the performance variation when
    >     >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack
    >     >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for
    >     >> non-coherent DMA?).
    >     > 
    >     > network stack use SKB_DATA_ALIGN to align.
    >     > ---
    >     > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \
    >     > ~(SMP_CACHE_BYTES - 1))
    >     > 
    >     > #define SMP_CACHE_BYTES L1_CACHE_BYTES
    >     > ---
    >     > I think this is the reason of performance regression.
    >     > 
    >     
    >     Yes this is the reason for performance regression. Due to increases L1 cache alignment the 
    >     object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to 
    >     4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers.
    >     
    > We tried different benchmarks and found none which really affects with Cache line change. If there is no correctness issue,
    > I think we are fine with reverting the patch.
    > 
    So, can we revert the patch that makes L1_CACHE_SHIFT 7 or should the patch suggested by Catalin should be mainlined.
    We have verified the throughput degradation on 3.18 and 4.4 but I am afraid that this issue will be seen on other
    kernels too.
    > Though I still think it is beneficiary to do some more investigation for the perf loss, who knows 32 bit align or no align might 
    > Give even more perf benefit. 
    > 
    Which perf loss you are referring to here. Did you mean throughput loss here or some other perf benchmarking ?
    
The iperf issue mentioning here, looks to me as incomplete. 

    Thanks,
    Imran
    
    > 
    > Thanks,
    > Tirumalesh.  
    >     >>
    >     >> The Cavium guys haven't shown any numbers (IIUC) to back the
    >     >> L1_CACHE_BYTES performance improvement but I would not revert the
    >     >> original commit since ARCH_DMA_MINALIGN definitely needs to cover the
    >     >> maximum available cache line size, which is 128 for them.
    >     > 
    >     > how about define L1_CACHE_SHIFT like below:
    >     > ---
    >     > #ifdef CONFIG_ARM64_L1_CACHE_SHIFT
    >     > #define L1_CACHE_SHIFT CONFIG_ARM64_L1_CACHE_SHIFT
    >     > #else
    >     > #define L1_CACHE_SHIFT 7
    >     > endif
    >     > ---
    >     > 
    >     > Thanks
    >     > 
    >     >>
    >     >> --
    >     >> Catalin
    >     
    >     
    >     -- 
    >     QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation
    >     
    >     _______________________________________________
    >     linux-arm-kernel mailing list
    >     linux-arm-kernel@lists.infradead.org
    >     http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
    >     
    > 
    
    
    -- 
    QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation
    

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2017-04-18 17:05                         ` Sunil Kovvuri
@ 2017-04-19 12:01                           ` Catalin Marinas
  2017-04-19 13:11                             ` Sunil Kovvuri
  0 siblings, 1 reply; 36+ messages in thread
From: Catalin Marinas @ 2017-04-19 12:01 UTC (permalink / raw)
  To: Sunil Kovvuri
  Cc: Ganesh Mahendran, open list, Chalamarla, Tirumalesh,
	open list:ARM/QUALCOMM SUPPORT, Imran Khan, linux-arm-kernel

On Tue, Apr 18, 2017 at 10:35:02PM +0530, Sunil Kovvuri wrote:
> On Tue, Apr 18, 2017 at 8:18 PM, Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > On Mon, Apr 17, 2017 at 04:08:52PM +0530, Sunil Kovvuri wrote:
> >> >>     >> Do you have an explanation on the performance variation when
> >> >>     >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack
> >> >>     >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for
> >> >>     >> non-coherent DMA?).
> >> >>     >
> >> >>     > network stack use SKB_DATA_ALIGN to align.
> >> >>     > ---
> >> >>     > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \
> >> >>     > ~(SMP_CACHE_BYTES - 1))
> >> >>     >
> >> >>     > #define SMP_CACHE_BYTES L1_CACHE_BYTES
> >> >>     > ---
> >> >>     > I think this is the reason of performance regression.
> >> >>     >
> >> >>
> >> >>     Yes this is the reason for performance regression. Due to increases L1 cache alignment the
> >> >>     object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to
> >> >>     4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers.
> >>
> >> With what traffic did you check 'skb->truesize' ? Increase from
> >> 2304 to 4352 bytes doesn't seem to be real. I checked with ICMP
> >> pkts with maximum size possible with 1500byte MTU and I don't see
> >> such a bump. If the bump is observed with Iperf sending TCP packets
> >> then I suggest to check if TSO is playing a part over here.
> >
> > I haven't checked truesize but I added some printks to __alloc_skb() (on
> > a Juno platform) and the size argument to this function is 1720 on many
> > occasions. With sizeof(struct skb_shared_info) of 320, the actual data
> > allocation is exactly 2048 when using 64 byte L1_CACHE_SIZE. With a
> > 128 byte cache size, it goes slightly over 2K, hence the 4K slab
> > allocation.
> 
> Understood but still in my opinion this '4K slab allocation' cannot be
> considered as an issue with cache line size, there are many network
> drivers out there which do receive buffer or page recycling to
> minimize (sometimes almost to zero) the cost of buffer allocation.

The slab allocation shouldn't make much difference (unless you are
running on a memory constrained system) but I don't understand how
skb->truesize (which is almost half unused) affects the sk_wmem_alloc
and its interaction with other bits in the network stack (e.g.
tcp_limit_output_bytes).

However, I do think it's worth investigating further to fully understand
the issue.

> >The 1720 figure surprised me a bit as well since I was
> > expecting something close to 1500.
> >
> > The thing that worries me is that skb->data may be used as a buffer to
> > DMA into. If that's the case, skb_shared_info is wrongly aligned based
> > on SMP_CACHE_BYTES only and can lead to corruption on a non-DMA-coherent
> > platform. It should really be ARCH_DMA_MINALIGN.
> 
> I didn't get this, if you see __alloc_skb()
> 
> 229         size = SKB_DATA_ALIGN(size);
> 230         size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> 
> both DMA buffer and skb_shared_info are aligned to a cacheline separately,
> considering 128byte alignment guarantees 64byte alignment as well, how
> will this lead to corruption ?

It's the other way around: you align only to 64 byte while running on a
platform with 128 byte cache lines and non-coherent DMA.

> And if platform is non-DMA-coherent then again it's the driver which
> should take of coherency by using appropriate map/unmap APIs and
> should avoid any cacheline sharing btw DMA buffer and skb_shared_info.

The problem is that the streaming DMA API can only work correctly on
cacheline-aligned buffers (because of the cache invalidation it performs
for DMA ops; even with clean&invalidate, the operation isn't always safe
if a cacheline is shared between DMA and CPU buffers). In the skb case,
we could have the data potentially sharing the last addresses of a DMA
buffer with struct skb_shared_info.

We may be able to get away with SKB_DATA_ALIGN not using
ARCH_DMA_MINALIGN *if* skb_shared_info is *not* written before or during
an inbound DMA transfer (though such tricks are arch specific).

> > IIUC, the Cavium platform has coherent DMA, so it shouldn't be an issue
> > if we go back to 64 byte cache lines.
> 
> Yes, Cavium platform is DMA coherent and there is no issue with reverting back
> to 64byte cachelines. But do we want to do this because some platform has a
> performance issue and this is an easy way to solve it. IMHO there seems
> to be many ways to solve performance degradation within the driver itself, and
> if those doesn't work then probably it makes sense to revert this.

My initial thought was to revert the change because it was causing a
significant performance regression on certain SoC. But given that it
took over a year for people to follow up, it doesn't seem too urgent, so
we should rather try to understand the issue and potential side effects
of moving back to a 64 byte cache line.

-- 
Catalin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2017-04-19 12:01                           ` Catalin Marinas
@ 2017-04-19 13:11                             ` Sunil Kovvuri
  2017-04-25  6:42                               ` Ding Tianhong
  0 siblings, 1 reply; 36+ messages in thread
From: Sunil Kovvuri @ 2017-04-19 13:11 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ganesh Mahendran, open list, Chalamarla, Tirumalesh,
	open list:ARM/QUALCOMM SUPPORT, Imran Khan, linux-arm-kernel,
	sgoutham

On Wed, Apr 19, 2017 at 5:31 PM, Catalin Marinas
<catalin.marinas@arm.com> wrote:
> On Tue, Apr 18, 2017 at 10:35:02PM +0530, Sunil Kovvuri wrote:
>> On Tue, Apr 18, 2017 at 8:18 PM, Catalin Marinas
>> <catalin.marinas@arm.com> wrote:
>> > On Mon, Apr 17, 2017 at 04:08:52PM +0530, Sunil Kovvuri wrote:
>> >> >>     >> Do you have an explanation on the performance variation when
>> >> >>     >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack
>> >> >>     >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for
>> >> >>     >> non-coherent DMA?).
>> >> >>     >
>> >> >>     > network stack use SKB_DATA_ALIGN to align.
>> >> >>     > ---
>> >> >>     > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \
>> >> >>     > ~(SMP_CACHE_BYTES - 1))
>> >> >>     >
>> >> >>     > #define SMP_CACHE_BYTES L1_CACHE_BYTES
>> >> >>     > ---
>> >> >>     > I think this is the reason of performance regression.
>> >> >>     >
>> >> >>
>> >> >>     Yes this is the reason for performance regression. Due to increases L1 cache alignment the
>> >> >>     object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to
>> >> >>     4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers.
>> >>
>> >> With what traffic did you check 'skb->truesize' ? Increase from
>> >> 2304 to 4352 bytes doesn't seem to be real. I checked with ICMP
>> >> pkts with maximum size possible with 1500byte MTU and I don't see
>> >> such a bump. If the bump is observed with Iperf sending TCP packets
>> >> then I suggest to check if TSO is playing a part over here.
>> >
>> > I haven't checked truesize but I added some printks to __alloc_skb() (on
>> > a Juno platform) and the size argument to this function is 1720 on many
>> > occasions. With sizeof(struct skb_shared_info) of 320, the actual data
>> > allocation is exactly 2048 when using 64 byte L1_CACHE_SIZE. With a
>> > 128 byte cache size, it goes slightly over 2K, hence the 4K slab
>> > allocation.
>>
>> Understood but still in my opinion this '4K slab allocation' cannot be
>> considered as an issue with cache line size, there are many network
>> drivers out there which do receive buffer or page recycling to
>> minimize (sometimes almost to zero) the cost of buffer allocation.
>
> The slab allocation shouldn't make much difference (unless you are
> running on a memory constrained system) but I don't understand how
> skb->truesize (which is almost half unused) affects the sk_wmem_alloc
> and its interaction with other bits in the network stack (e.g.
> tcp_limit_output_bytes).
>
> However, I do think it's worth investigating further to fully understand
> the issue.

Absolutely.

>
>> >The 1720 figure surprised me a bit as well since I was
>> > expecting something close to 1500.
>> >
>> > The thing that worries me is that skb->data may be used as a buffer to
>> > DMA into. If that's the case, skb_shared_info is wrongly aligned based
>> > on SMP_CACHE_BYTES only and can lead to corruption on a non-DMA-coherent
>> > platform. It should really be ARCH_DMA_MINALIGN.
>>
>> I didn't get this, if you see __alloc_skb()
>>
>> 229         size = SKB_DATA_ALIGN(size);
>> 230         size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>>
>> both DMA buffer and skb_shared_info are aligned to a cacheline separately,
>> considering 128byte alignment guarantees 64byte alignment as well, how
>> will this lead to corruption ?
>
> It's the other way around: you align only to 64 byte while running on a
> platform with 128 byte cache lines and non-coherent DMA.

Okay, I mistook your statement. This is indeed a valid statement.

>> And if platform is non-DMA-coherent then again it's the driver which
>> should take of coherency by using appropriate map/unmap APIs and
>> should avoid any cacheline sharing btw DMA buffer and skb_shared_info.
>
> The problem is that the streaming DMA API can only work correctly on
> cacheline-aligned buffers (because of the cache invalidation it performs
> for DMA ops; even with clean&invalidate, the operation isn't always safe
> if a cacheline is shared between DMA and CPU buffers). In the skb case,
> we could have the data potentially sharing the last addresses of a DMA
> buffer with struct skb_shared_info.
>
> We may be able to get away with SKB_DATA_ALIGN not using
> ARCH_DMA_MINALIGN *if* skb_shared_info is *not* written before or during
> an inbound DMA transfer (though such tricks are arch specific).
>
>> > IIUC, the Cavium platform has coherent DMA, so it shouldn't be an issue
>> > if we go back to 64 byte cache lines.
>>
>> Yes, Cavium platform is DMA coherent and there is no issue with reverting back
>> to 64byte cachelines. But do we want to do this because some platform has a
>> performance issue and this is an easy way to solve it. IMHO there seems
>> to be many ways to solve performance degradation within the driver itself, and
>> if those doesn't work then probably it makes sense to revert this.
>
> My initial thought was to revert the change because it was causing a
> significant performance regression on certain SoC. But given that it
> took over a year for people to follow up, it doesn't seem too urgent, so
> we should rather try to understand the issue and potential side effects
> of moving back to a 64 byte cache line.

Yes.

Thanks,
Sunil.

>
> --
> Catalin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] Revert "arm64: Increase the max granular size"
  2017-04-19 13:11                             ` Sunil Kovvuri
@ 2017-04-25  6:42                               ` Ding Tianhong
  0 siblings, 0 replies; 36+ messages in thread
From: Ding Tianhong @ 2017-04-25  6:42 UTC (permalink / raw)
  To: Sunil Kovvuri, Catalin Marinas
  Cc: Ganesh Mahendran, open list, Chalamarla, Tirumalesh, sgoutham,
	open list:ARM/QUALCOMM SUPPORT, Imran Khan, linux-arm-kernel,
	LinuxArm, Gabriele Paoloni, John Garry



On 2017/4/19 21:11, Sunil Kovvuri wrote:
> On Wed, Apr 19, 2017 at 5:31 PM, Catalin Marinas
> <catalin.marinas@arm.com> wrote:
>> On Tue, Apr 18, 2017 at 10:35:02PM +0530, Sunil Kovvuri wrote:
>>> On Tue, Apr 18, 2017 at 8:18 PM, Catalin Marinas
>>> <catalin.marinas@arm.com> wrote:
>>>> On Mon, Apr 17, 2017 at 04:08:52PM +0530, Sunil Kovvuri wrote:
>>>>>>>     >> Do you have an explanation on the performance variation when
>>>>>>>     >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack
>>>>>>>     >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for
>>>>>>>     >> non-coherent DMA?).
>>>>>>>     >
>>>>>>>     > network stack use SKB_DATA_ALIGN to align.
>>>>>>>     > ---
>>>>>>>     > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \
>>>>>>>     > ~(SMP_CACHE_BYTES - 1))
>>>>>>>     >
>>>>>>>     > #define SMP_CACHE_BYTES L1_CACHE_BYTES
>>>>>>>     > ---
>>>>>>>     > I think this is the reason of performance regression.
>>>>>>>     >
>>>>>>>
>>>>>>>     Yes this is the reason for performance regression. Due to increases L1 cache alignment the
>>>>>>>     object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to
>>>>>>>     4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers.
>>>>>
>>>>> With what traffic did you check 'skb->truesize' ? Increase from
>>>>> 2304 to 4352 bytes doesn't seem to be real. I checked with ICMP
>>>>> pkts with maximum size possible with 1500byte MTU and I don't see
>>>>> such a bump. If the bump is observed with Iperf sending TCP packets
>>>>> then I suggest to check if TSO is playing a part over here.
>>>>
>>>> I haven't checked truesize but I added some printks to __alloc_skb() (on
>>>> a Juno platform) and the size argument to this function is 1720 on many
>>>> occasions. With sizeof(struct skb_shared_info) of 320, the actual data
>>>> allocation is exactly 2048 when using 64 byte L1_CACHE_SIZE. With a
>>>> 128 byte cache size, it goes slightly over 2K, hence the 4K slab
>>>> allocation.
>>>
>>> Understood but still in my opinion this '4K slab allocation' cannot be
>>> considered as an issue with cache line size, there are many network
>>> drivers out there which do receive buffer or page recycling to
>>> minimize (sometimes almost to zero) the cost of buffer allocation.
>>
>> The slab allocation shouldn't make much difference (unless you are
>> running on a memory constrained system) but I don't understand how
>> skb->truesize (which is almost half unused) affects the sk_wmem_alloc
>> and its interaction with other bits in the network stack (e.g.
>> tcp_limit_output_bytes).
>>
>> However, I do think it's worth investigating further to fully understand
>> the issue.
> 
> Absolutely.
> 
>>
>>>> The 1720 figure surprised me a bit as well since I was
>>>> expecting something close to 1500.
>>>>
>>>> The thing that worries me is that skb->data may be used as a buffer to
>>>> DMA into. If that's the case, skb_shared_info is wrongly aligned based
>>>> on SMP_CACHE_BYTES only and can lead to corruption on a non-DMA-coherent
>>>> platform. It should really be ARCH_DMA_MINALIGN.
>>>
>>> I didn't get this, if you see __alloc_skb()
>>>
>>> 229         size = SKB_DATA_ALIGN(size);
>>> 230         size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>>>
>>> both DMA buffer and skb_shared_info are aligned to a cacheline separately,
>>> considering 128byte alignment guarantees 64byte alignment as well, how
>>> will this lead to corruption ?
>>
>> It's the other way around: you align only to 64 byte while running on a
>> platform with 128 byte cache lines and non-coherent DMA.
> 
> Okay, I mistook your statement. This is indeed a valid statement.
> 
>>> And if platform is non-DMA-coherent then again it's the driver which
>>> should take of coherency by using appropriate map/unmap APIs and
>>> should avoid any cacheline sharing btw DMA buffer and skb_shared_info.
>>
>> The problem is that the streaming DMA API can only work correctly on
>> cacheline-aligned buffers (because of the cache invalidation it performs
>> for DMA ops; even with clean&invalidate, the operation isn't always safe
>> if a cacheline is shared between DMA and CPU buffers). In the skb case,
>> we could have the data potentially sharing the last addresses of a DMA
>> buffer with struct skb_shared_info.
>>
>> We may be able to get away with SKB_DATA_ALIGN not using
>> ARCH_DMA_MINALIGN *if* skb_shared_info is *not* written before or during
>> an inbound DMA transfer (though such tricks are arch specific).
>>
>>>> IIUC, the Cavium platform has coherent DMA, so it shouldn't be an issue
>>>> if we go back to 64 byte cache lines.
>>>
>>> Yes, Cavium platform is DMA coherent and there is no issue with reverting back
>>> to 64byte cachelines. But do we want to do this because some platform has a
>>> performance issue and this is an easy way to solve it. IMHO there seems
>>> to be many ways to solve performance degradation within the driver itself, and
>>> if those doesn't work then probably it makes sense to revert this.
>>
>> My initial thought was to revert the change because it was causing a
>> significant performance regression on certain SoC. But given that it
>> took over a year for people to follow up, it doesn't seem too urgent, so
>> we should rather try to understand the issue and potential side effects
>> of moving back to a 64 byte cache line.
> 
> Yes.
> 
> Thanks,
> Sunil.
> 

Hi, continue this discussion, I have test this patch on hisilicon D05 board,
the lmbench looks much better after applying this patch, my kernel version is 4.1:

Lmbench:	128B/1core	128B/4core	128B/16core	128B/32core	64B/1core	64B/4core	64B/16core	64B/32core
rd		9393.81		9118.1		23693.96	42026.32	8433.26		8635.96		11563.52	20142.34
frd		7748.18		7003.02		23336.95	42271.16	7269.42		7091.55		11211.67	19188.03
wr		7880.8		7403.61		12471.28	22545.83	7453.13		7066.01		11140.08	19625.04
fwr		8783.97		8911.85		25318.05	46540		8953.55		8894.28		19094.92	30646.38
bzero		8848.33		8873.43		25416.46	46620.68	8998.88		8877.64		19092.39	32537.9
rdwr		5854.37		5759.17		12346.27	22928.25	5601.26		5517.21		6608.84		11717.91
cp		5251.07		5118.19		8324.99		15616.85	4997.66		4964.38		5833.04		9600.68
fcp		6883.94		5446.19		11588.7		21534.58	6766.76		6625.98		8757.97		15844.67
bcopy		6561.41		6402.99		9070.05		17076.24	6839.07		6959.03		8900.73		15868.25

you could see that when use the 32 cores, the 128B L1 cache line is much better than the 64B, up 200%.
I think the performance would be shown difference on different chip, so I suggest not to revert this patch and test more about it.

Some details about the hisilicon D05 board chip:
L1/L2: 	64B
L3:	128B

Thanks
Ding

>>
>> --
>> Catalin
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 
> .
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2017-04-25  6:43 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-16  9:32 [PATCH] Revert "arm64: Increase the max granular size" Ganesh Mahendran
2016-03-16 10:07 ` Will Deacon
2016-03-16 13:06   ` Timur Tabi
2016-03-16 14:03     ` Mark Rutland
2016-03-16 14:35       ` Will Deacon
2016-03-16 14:54         ` Mark Rutland
2016-03-16 14:18     ` Catalin Marinas
2016-03-16 15:26       ` Timur Tabi
2016-03-17 14:27         ` Catalin Marinas
2016-03-17 14:49           ` Timur Tabi
2016-03-17 15:37             ` Catalin Marinas
2016-03-17 16:03               ` Marc Zyngier
2016-03-17 18:07           ` Andrew Pinski
2016-03-17 18:34             ` Timur Tabi
2016-03-17 18:37             ` Catalin Marinas
2016-03-18 21:05 ` Chalamarla, Tirumalesh
2016-03-21  1:56   ` Ganesh Mahendran
2016-03-21 17:14   ` Catalin Marinas
2016-03-21 17:23     ` Will Deacon
2016-03-21 17:33       ` Catalin Marinas
2016-03-21 17:39         ` Chalamarla, Tirumalesh
     [not found]     ` <CAPub14-sFgx=oCHzJPb9h9b_V0rbn5UAMDNJ-yTkjhz38JPqMQ@mail.gmail.com>
     [not found]       ` <10fef112-37f1-0a1b-b5af-435acd032f01@codeaurora.org>
2017-04-06  7:22         ` Imran Khan
2017-04-06 15:58           ` Catalin Marinas
2017-04-07  2:06             ` Ganesh Mahendran
2017-04-07  8:59               ` Catalin Marinas
2017-04-12  5:13               ` Imran Khan
2017-04-12 14:00                 ` Chalamarla, Tirumalesh
2017-04-17  7:35                   ` Imran Khan
2017-04-17 10:38                     ` Sunil Kovvuri
2017-04-18 14:48                       ` Catalin Marinas
2017-04-18 17:05                         ` Sunil Kovvuri
2017-04-19 12:01                           ` Catalin Marinas
2017-04-19 13:11                             ` Sunil Kovvuri
2017-04-25  6:42                               ` Ding Tianhong
2017-04-18 18:21                     ` Chalamarla, Tirumalesh
2017-04-11  4:40             ` Jon Masters

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).