linux-arm-msm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
@ 2021-07-06  9:26 ` Yassine Oudjana
  2021-07-06 10:26   ` Catalin Marinas
  0 siblings, 1 reply; 20+ messages in thread
From: Yassine Oudjana @ 2021-07-06  9:26 UTC (permalink / raw)
  To: will
  Cc: ardb, arnd, catalin.marinas, kernel-team, linux-arm-kernel,
	mark.rutland, vincent.whitchurch, linux-arm-msm

In-Reply-To: <20210527124356.22367-1-will@kernel.org>

> Reduce ARCH_DMA_MINALIGN to 64 bytes and allow the warning/taint to
> indicate if there are machines that unknowingly rely on this.

The warning is being triggered on Qualcomm MSM8996, as well as the out-of-spec taint:

------------[ cut here ]------------
rtc-pm8xxx 400f000.qcom,spmi:pmic@0:rtc@6000: ARCH_DMA_MINALIGN smaller than CTR_EL0.CWG (64 < 128)
WARNING: CPU: 0 PID: 1 at arch/arm64/mm/dma-mapping.c:45 arch_setup_dma_ops+0xf8/0x10c
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Tainted: G S                5.13.0+ #43
Hardware name: Xiaomi Mi Note 2 (DT)
pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
pc : arch_setup_dma_ops+0xf8/0x10c
lr : arch_setup_dma_ops+0xf8/0x10c
sp : ffff80001003bb50
x29: ffff80001003bb50 x28: 0000000000000000 x27: ffff800010ee04c0
x26: ffff800010f41060 x25: 00000000ffffffff x24: 0000000000000000
x23: 0000000000000080 x22: 0000000000000000 x21: 0000000000000000
x20: 0000000100000000 x19: ffff00008143e010 x18: ffffffffffffffff
x17: 696c206f74205d72 x16: 656874655f675b20 x15: ffff800011212d34
x14: 0000000000000000 x13: ffff8000110d4620 x12: 0000000000001047
x11: 000000000000056d x10: ffff8000110d4620 x9 : ffff8000110d4620
x8 : 00000000ffffefff x7 : ffff80001112c620 x6 : ffff80001112c620
x5 : 0000000000000000 x4 : 0000000000000000 x3 : 00000000ffffffff
x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000080098000
Call trace:
arch_setup_dma_ops+0xf8/0x10c
of_dma_configure_id+0x120/0x2a4
platform_dma_configure+0x24/0x40
really_probe.part.0+0x50/0x2fc
__driver_probe_device+0x98/0x144
driver_probe_device+0xc8/0x15c
__driver_attach+0xf8/0x190
bus_for_each_dev+0x70/0xd0
driver_attach+0x24/0x30
bus_add_driver+0x104/0x1ec
driver_register+0x78/0x130
__platform_driver_register+0x28/0x34
pm8xxx_rtc_driver_init+0x1c/0x28
do_one_initcall+0x50/0x1b0
kernel_init_freeable+0x214/0x298
kernel_init+0x28/0x12c
ret_from_fork+0x10/0x18
---[ end trace 09b7616a3af9b190 ]---

This warning is triggered with nearly every driver probe, not only rtc-pm8xxx.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-06  9:26 ` [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES) Yassine Oudjana
@ 2021-07-06 10:26   ` Catalin Marinas
  2021-07-06 13:29     ` Robin Murphy
  0 siblings, 1 reply; 20+ messages in thread
From: Catalin Marinas @ 2021-07-06 10:26 UTC (permalink / raw)
  To: Yassine Oudjana
  Cc: will, ardb, arnd, kernel-team, linux-arm-kernel, mark.rutland,
	vincent.whitchurch, linux-arm-msm

On Tue, Jul 06, 2021 at 09:26:59AM +0000, Yassine Oudjana wrote:
> In-Reply-To: <20210527124356.22367-1-will@kernel.org>
> > Reduce ARCH_DMA_MINALIGN to 64 bytes and allow the warning/taint to
> > indicate if there are machines that unknowingly rely on this.
> 
> The warning is being triggered on Qualcomm MSM8996, as well as the out-of-spec taint:

Is this booting with ACPI or DT?

> ------------[ cut here ]------------
> rtc-pm8xxx 400f000.qcom,spmi:pmic@0:rtc@6000: ARCH_DMA_MINALIGN smaller than CTR_EL0.CWG (64 < 128)
> WARNING: CPU: 0 PID: 1 at arch/arm64/mm/dma-mapping.c:45 arch_setup_dma_ops+0xf8/0x10c
[...]
> This warning is triggered with nearly every driver probe, not only rtc-pm8xxx.

I have a suspicion none of the reported devices actually do any DMA, so
in practice it should be safe but we need to figure out why
arch_setup_dma_ops() gets called.

-- 
Catalin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-06 10:26   ` Catalin Marinas
@ 2021-07-06 13:29     ` Robin Murphy
  2021-07-06 13:33       ` Will Deacon
  0 siblings, 1 reply; 20+ messages in thread
From: Robin Murphy @ 2021-07-06 13:29 UTC (permalink / raw)
  To: Catalin Marinas, Yassine Oudjana
  Cc: will, ardb, arnd, kernel-team, linux-arm-kernel, mark.rutland,
	vincent.whitchurch, linux-arm-msm

On 2021-07-06 11:26, Catalin Marinas wrote:
> On Tue, Jul 06, 2021 at 09:26:59AM +0000, Yassine Oudjana wrote:
>> In-Reply-To: <20210527124356.22367-1-will@kernel.org>
>>> Reduce ARCH_DMA_MINALIGN to 64 bytes and allow the warning/taint to
>>> indicate if there are machines that unknowingly rely on this.
>>
>> The warning is being triggered on Qualcomm MSM8996, as well as the out-of-spec taint:
> 
> Is this booting with ACPI or DT?
> 
>> ------------[ cut here ]------------
>> rtc-pm8xxx 400f000.qcom,spmi:pmic@0:rtc@6000: ARCH_DMA_MINALIGN smaller than CTR_EL0.CWG (64 < 128)
>> WARNING: CPU: 0 PID: 1 at arch/arm64/mm/dma-mapping.c:45 arch_setup_dma_ops+0xf8/0x10c
> [...]
>> This warning is triggered with nearly every driver probe, not only rtc-pm8xxx.
> 
> I have a suspicion none of the reported devices actually do any DMA, so
> in practice it should be safe but we need to figure out why
> arch_setup_dma_ops() gets called.

It gets called because there's no straightforward way to know that a 
platform device *isn't* DMA-capable, so we have to assume they are.

I would also assume that in a Qcom SoC there really are at least some 
things doing non-coherent DMA :(

Robin.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-06 13:29     ` Robin Murphy
@ 2021-07-06 13:33       ` Will Deacon
  2021-07-06 13:44         ` Marc Zyngier
  0 siblings, 1 reply; 20+ messages in thread
From: Will Deacon @ 2021-07-06 13:33 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Catalin Marinas, Yassine Oudjana, ardb, arnd, kernel-team,
	linux-arm-kernel, mark.rutland, vincent.whitchurch,
	linux-arm-msm

On Tue, Jul 06, 2021 at 02:29:07PM +0100, Robin Murphy wrote:
> On 2021-07-06 11:26, Catalin Marinas wrote:
> > On Tue, Jul 06, 2021 at 09:26:59AM +0000, Yassine Oudjana wrote:
> > > In-Reply-To: <20210527124356.22367-1-will@kernel.org>
> > > > Reduce ARCH_DMA_MINALIGN to 64 bytes and allow the warning/taint to
> > > > indicate if there are machines that unknowingly rely on this.
> > > 
> > > The warning is being triggered on Qualcomm MSM8996, as well as the out-of-spec taint:
> > 
> > Is this booting with ACPI or DT?
> > 
> > > ------------[ cut here ]------------
> > > rtc-pm8xxx 400f000.qcom,spmi:pmic@0:rtc@6000: ARCH_DMA_MINALIGN smaller than CTR_EL0.CWG (64 < 128)
> > > WARNING: CPU: 0 PID: 1 at arch/arm64/mm/dma-mapping.c:45 arch_setup_dma_ops+0xf8/0x10c
> > [...]
> > > This warning is triggered with nearly every driver probe, not only rtc-pm8xxx.
> > 
> > I have a suspicion none of the reported devices actually do any DMA, so
> > in practice it should be safe but we need to figure out why
> > arch_setup_dma_ops() gets called.
> 
> It gets called because there's no straightforward way to know that a
> platform device *isn't* DMA-capable, so we have to assume they are.
> 
> I would also assume that in a Qcom SoC there really are at least some things
> doing non-coherent DMA :(

Agreed, unless this is a CPU erratum and the line size is being reported for
a cache beyond the PoC, then I think we're going to have to revert the patch
reducing ARCH_DMA_MINALIGN after all.

I can't find much information about the original Kryo core at all...

Will

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-06 13:33       ` Will Deacon
@ 2021-07-06 13:44         ` Marc Zyngier
  2021-07-06 14:21           ` Robin Murphy
  2021-07-06 14:30           ` Arnd Bergmann
  0 siblings, 2 replies; 20+ messages in thread
From: Marc Zyngier @ 2021-07-06 13:44 UTC (permalink / raw)
  To: Will Deacon
  Cc: Robin Murphy, Catalin Marinas, Yassine Oudjana, ardb, arnd,
	kernel-team, linux-arm-kernel, mark.rutland, vincent.whitchurch,
	linux-arm-msm

On 2021-07-06 14:33, Will Deacon wrote:
> On Tue, Jul 06, 2021 at 02:29:07PM +0100, Robin Murphy wrote:
>> On 2021-07-06 11:26, Catalin Marinas wrote:
>> > On Tue, Jul 06, 2021 at 09:26:59AM +0000, Yassine Oudjana wrote:
>> > > In-Reply-To: <20210527124356.22367-1-will@kernel.org>
>> > > > Reduce ARCH_DMA_MINALIGN to 64 bytes and allow the warning/taint to
>> > > > indicate if there are machines that unknowingly rely on this.
>> > >
>> > > The warning is being triggered on Qualcomm MSM8996, as well as the out-of-spec taint:
>> >
>> > Is this booting with ACPI or DT?
>> >
>> > > ------------[ cut here ]------------
>> > > rtc-pm8xxx 400f000.qcom,spmi:pmic@0:rtc@6000: ARCH_DMA_MINALIGN smaller than CTR_EL0.CWG (64 < 128)
>> > > WARNING: CPU: 0 PID: 1 at arch/arm64/mm/dma-mapping.c:45 arch_setup_dma_ops+0xf8/0x10c
>> > [...]
>> > > This warning is triggered with nearly every driver probe, not only rtc-pm8xxx.
>> >
>> > I have a suspicion none of the reported devices actually do any DMA, so
>> > in practice it should be safe but we need to figure out why
>> > arch_setup_dma_ops() gets called.
>> 
>> It gets called because there's no straightforward way to know that a
>> platform device *isn't* DMA-capable, so we have to assume they are.
>> 
>> I would also assume that in a Qcom SoC there really are at least some 
>> things
>> doing non-coherent DMA :(
> 
> Agreed, unless this is a CPU erratum and the line size is being 
> reported for
> a cache beyond the PoC, then I think we're going to have to revert the 
> patch
> reducing ARCH_DMA_MINALIGN after all.
> 
> I can't find much information about the original Kryo core at all...

I have similar issues with my QDF2400. The UART, RTC and DMA controllers
are all screaming at me. I'm confident that the UART doesn't do any
DMA (it is handled by the SBSA driver), but the DMA controllers are
probably doing what it says on the tin.

Do we know whether Falkor and Kryo share any part of their design?

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-06 13:44         ` Marc Zyngier
@ 2021-07-06 14:21           ` Robin Murphy
  2021-07-06 14:30           ` Arnd Bergmann
  1 sibling, 0 replies; 20+ messages in thread
From: Robin Murphy @ 2021-07-06 14:21 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon
  Cc: Catalin Marinas, Yassine Oudjana, ardb, arnd, kernel-team,
	linux-arm-kernel, mark.rutland, vincent.whitchurch,
	linux-arm-msm

On 2021-07-06 14:44, Marc Zyngier wrote:
> On 2021-07-06 14:33, Will Deacon wrote:
>> On Tue, Jul 06, 2021 at 02:29:07PM +0100, Robin Murphy wrote:
>>> On 2021-07-06 11:26, Catalin Marinas wrote:
>>> > On Tue, Jul 06, 2021 at 09:26:59AM +0000, Yassine Oudjana wrote:
>>> > > In-Reply-To: <20210527124356.22367-1-will@kernel.org>
>>> > > > Reduce ARCH_DMA_MINALIGN to 64 bytes and allow the 
>>> warning/taint to
>>> > > > indicate if there are machines that unknowingly rely on this.
>>> > >
>>> > > The warning is being triggered on Qualcomm MSM8996, as well as 
>>> the out-of-spec taint:
>>> >
>>> > Is this booting with ACPI or DT?
>>> >
>>> > > ------------[ cut here ]------------
>>> > > rtc-pm8xxx 400f000.qcom,spmi:pmic@0:rtc@6000: ARCH_DMA_MINALIGN 
>>> smaller than CTR_EL0.CWG (64 < 128)
>>> > > WARNING: CPU: 0 PID: 1 at arch/arm64/mm/dma-mapping.c:45 
>>> arch_setup_dma_ops+0xf8/0x10c
>>> > [...]
>>> > > This warning is triggered with nearly every driver probe, not 
>>> only rtc-pm8xxx.
>>> >
>>> > I have a suspicion none of the reported devices actually do any 
>>> DMA, so
>>> > in practice it should be safe but we need to figure out why
>>> > arch_setup_dma_ops() gets called.
>>>
>>> It gets called because there's no straightforward way to know that a
>>> platform device *isn't* DMA-capable, so we have to assume they are.
>>>
>>> I would also assume that in a Qcom SoC there really are at least some 
>>> things
>>> doing non-coherent DMA :(
>>
>> Agreed, unless this is a CPU erratum and the line size is being 
>> reported for
>> a cache beyond the PoC, then I think we're going to have to revert the 
>> patch
>> reducing ARCH_DMA_MINALIGN after all.
>>
>> I can't find much information about the original Kryo core at all...
> 
> I have similar issues with my QDF2400. The UART, RTC and DMA controllers
> are all screaming at me. I'm confident that the UART doesn't do any
> DMA (it is handled by the SBSA driver), but the DMA controllers are
> probably doing what it says on the tin.
> 
> Do we know whether Falkor and Kryo share any part of their design?

According to [1], not literally, but some of the same people being 
involved in both could plausibly imply that some basic design decisions 
might have carried over.

Robin.

[1] 
https://www.anandtech.com/show/11737/analyzing-falkors-microarchitecture-a-deep-dive-into-qualcomms-centriq-2400-for-windows-server-and-linux

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-06 13:44         ` Marc Zyngier
  2021-07-06 14:21           ` Robin Murphy
@ 2021-07-06 14:30           ` Arnd Bergmann
  2021-07-06 14:46             ` Marc Zyngier
  2021-07-06 16:20             ` Will Deacon
  1 sibling, 2 replies; 20+ messages in thread
From: Arnd Bergmann @ 2021-07-06 14:30 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Will Deacon, Robin Murphy, Catalin Marinas, Yassine Oudjana,
	Ard Biesheuvel, Android Kernel Team, Linux ARM, Mark Rutland,
	Vincent Whitchurch, linux-arm-msm

On Tue, Jul 6, 2021 at 3:44 PM Marc Zyngier <maz@kernel.org> wrote:
>
> On 2021-07-06 14:33, Will Deacon wrote:
> > On Tue, Jul 06, 2021 at 02:29:07PM +0100, Robin Murphy wrote:
> >
> > I can't find much information about the original Kryo core at all...
>
> I have similar issues with my QDF2400. The UART, RTC and DMA controllers
> are all screaming at me. I'm confident that the UART doesn't do any
> DMA (it is handled by the SBSA driver), but the DMA controllers are
> probably doing what it says on the tin.

But that's a server chip, surely the DMA controller is fully cache coherent
as required by SBSA? (please?)

Maybe just a misannotation on the device node?

> Do we know whether Falkor and Kryo share any part of their design?

I'm fairly sure the Snapdragon 821 / msm8996 is not cache coherent.

I can only speculate on how much got reused between the two, but
as Falkor was released only after they had already given up on
the full-custom Kryo core, it's plausible that it incorporates bits from
that one. In particular the cache controller is probably easy to reuse
even if the rest of it was a new design.

      Arnd

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-06 14:30           ` Arnd Bergmann
@ 2021-07-06 14:46             ` Marc Zyngier
  2021-07-06 15:43               ` Arnd Bergmann
  2021-07-06 16:20             ` Will Deacon
  1 sibling, 1 reply; 20+ messages in thread
From: Marc Zyngier @ 2021-07-06 14:46 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Will Deacon, Robin Murphy, Catalin Marinas, Yassine Oudjana,
	Ard Biesheuvel, Android Kernel Team, Linux ARM, Mark Rutland,
	Vincent Whitchurch, linux-arm-msm

On Tue, 06 Jul 2021 15:30:34 +0100,
Arnd Bergmann <arnd@arndb.de> wrote:
> 
> On Tue, Jul 6, 2021 at 3:44 PM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On 2021-07-06 14:33, Will Deacon wrote:
> > > On Tue, Jul 06, 2021 at 02:29:07PM +0100, Robin Murphy wrote:
> > >
> > > I can't find much information about the original Kryo core at all...
> >
> > I have similar issues with my QDF2400. The UART, RTC and DMA controllers
> > are all screaming at me. I'm confident that the UART doesn't do any
> > DMA (it is handled by the SBSA driver), but the DMA controllers are
> > probably doing what it says on the tin.
> 
> But that's a server chip, surely the DMA controller is fully cache
> coherent as required by SBSA? (please?)

Remember that there is a SBSA level for each broken implementation
(even my XGene is SBSA compliant!), so that doesn't mean much.

> Maybe just a misannotation on the device node?

Maybe. But given that whoever wrote the ACPI tables made sure that
everything else was annotated as coherent, I doubt the DMA controllers
being advertised as non-coherent is an accident.

> > Do we know whether Falkor and Kryo share any part of their design?
> 
> I'm fairly sure the Snapdragon 821 / msm8996 is not cache coherent.
> 
> I can only speculate on how much got reused between the two, but
> as Falkor was released only after they had already given up on
> the full-custom Kryo core, it's plausible that it incorporates bits from
> that one. In particular the cache controller is probably easy to reuse
> even if the rest of it was a new design.

I guess we'll never find out, and I'm probably one of the few still
having some access to this HW (not even sure for how long anyway).

I won't cry if we decide to pull the plug on it.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-06 14:46             ` Marc Zyngier
@ 2021-07-06 15:43               ` Arnd Bergmann
  2021-07-06 17:15                 ` Yassine Oudjana
  0 siblings, 1 reply; 20+ messages in thread
From: Arnd Bergmann @ 2021-07-06 15:43 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Will Deacon, Robin Murphy, Catalin Marinas, Yassine Oudjana,
	Ard Biesheuvel, Android Kernel Team, Linux ARM, Mark Rutland,
	Vincent Whitchurch, linux-arm-msm

On Tue, Jul 6, 2021 at 4:46 PM Marc Zyngier <maz@kernel.org> wrote:
> On Tue, 06 Jul 2021 15:30:34 +0100, Arnd Bergmann <arnd@arndb.de> wrote:
> > I can only speculate on how much got reused between the two, but
> > as Falkor was released only after they had already given up on
> > the full-custom Kryo core, it's plausible that it incorporates bits from
> > that one. In particular the cache controller is probably easy to reuse
> > even if the rest of it was a new design.
>
> I guess we'll never find out, and I'm probably one of the few still
> having some access to this HW (not even sure for how long anyway).
>
> I won't cry if we decide to pull the plug on it.

Sure, but the Snapdragon 820E is one we do need to worry about.

While the internet pretty much agrees on Falkor having 128 bytes
L1 cache line, it might be good to rule out that Kryo just misreports
it before we revert the patch.

Yassine, could you run the 'line' and 'cache' helper from lmbench
to determine what the cache topology appears to be and if that
matches the CTR_EL0 contents?

Something like

numactl -C 0 line -M 1M
numactl -C 3 line -M 1M
numactl -C 0 cache
numactl -C 3 cache

(the numactl command helps run this both on the 'big' and 'little'
cores without running into migration)

      Arnd

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-06 14:30           ` Arnd Bergmann
  2021-07-06 14:46             ` Marc Zyngier
@ 2021-07-06 16:20             ` Will Deacon
  1 sibling, 0 replies; 20+ messages in thread
From: Will Deacon @ 2021-07-06 16:20 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Marc Zyngier, Robin Murphy, Catalin Marinas, Yassine Oudjana,
	Ard Biesheuvel, Android Kernel Team, Linux ARM, Mark Rutland,
	Vincent Whitchurch, linux-arm-msm

On Tue, Jul 06, 2021 at 04:30:34PM +0200, Arnd Bergmann wrote:
> On Tue, Jul 6, 2021 at 3:44 PM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On 2021-07-06 14:33, Will Deacon wrote:
> > > On Tue, Jul 06, 2021 at 02:29:07PM +0100, Robin Murphy wrote:
> > >
> > > I can't find much information about the original Kryo core at all...
> >
> > I have similar issues with my QDF2400. The UART, RTC and DMA controllers
> > are all screaming at me. I'm confident that the UART doesn't do any
> > DMA (it is handled by the SBSA driver), but the DMA controllers are
> > probably doing what it says on the tin.
> 
> But that's a server chip, surely the DMA controller is fully cache coherent
> as required by SBSA? (please?)
> 
> Maybe just a misannotation on the device node?
> 
> > Do we know whether Falkor and Kryo share any part of their design?
> 
> I'm fairly sure the Snapdragon 821 / msm8996 is not cache coherent.
> 
> I can only speculate on how much got reused between the two, but
> as Falkor was released only after they had already given up on
> the full-custom Kryo core, it's plausible that it incorporates bits from
> that one. In particular the cache controller is probably easy to reuse
> even if the rest of it was a new design.

I think the million dollar question is whether the 128-byte cache-lines
live in a cache above the PoC or not. If it's just a system level cache
through which all DMA is "coherent", then it doesn't matter.

Digging around on the web, it's unclear whether msm8996 has an L3 or not.

Will

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-06 15:43               ` Arnd Bergmann
@ 2021-07-06 17:15                 ` Yassine Oudjana
  2021-07-06 20:33                   ` Arnd Bergmann
  0 siblings, 1 reply; 20+ messages in thread
From: Yassine Oudjana @ 2021-07-06 17:15 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Marc Zyngier, Will Deacon, Robin Murphy, Catalin Marinas,
	Ard Biesheuvel, Android Kernel Team, Linux ARM, Mark Rutland,
	Vincent Whitchurch, linux-arm-msm

On Tuesday, July 6th, 2021 at 7:43 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tue, Jul 6, 2021 at 4:46 PM Marc Zyngier maz@kernel.org wrote:
> > On Tue, 06 Jul 2021 15:30:34 +0100, Arnd Bergmann arnd@arndb.de wrote:
> > > I can only speculate on how much got reused between the two, but
> > > as Falkor was released only after they had already given up on
> > > the full-custom Kryo core, it's plausible that it incorporates bits from
> > > that one. In particular the cache controller is probably easy to reuse
> > > even if the rest of it was a new design.
> >
> > I guess we'll never find out, and I'm probably one of the few still
> > having some access to this HW (not even sure for how long anyway).
> >
> > I won't cry if we decide to pull the plug on it.
>
> Sure, but the Snapdragon 820E is one we do need to worry about.
> While the internet pretty much agrees on Falkor having 128 bytes
> L1 cache line, it might be good to rule out that Kryo just misreports
> it before we revert the patch.
>
> Yassine, could you run the 'line' and 'cache' helper from lmbench
> to determine what the cache topology appears to be and if that
> matches the CTR_EL0 contents?
>
> Something like
>
> numactl -C 0 line -M 1M
> numactl -C 3 line -M 1M
> numactl -C 0 cache
> numactl -C 3 cache
>
> (the numactl command helps run this both on the 'big' and 'little'
> cores without running into migration)
>
> Arnd

Here are the results:

$ numactl -C 0 line -M 1M
128
$ numactl -C 3 line -M 1M
128
$ numactl -C 0 cache
L1 cache: 512 bytes 1.37 nanoseconds 64 linesize -1.00 parallelism
L2 cache: 24576 bytes 2.75 nanoseconds 64 linesize 5.06 parallelism
L3 cache: 131072 bytes 7.89 nanoseconds 64 linesize 3.85 parallelism
L4 cache: 524288 bytes 15.86 nanoseconds 128 linesize 3.48 parallelism
Memory latency: 145.93 nanoseconds 4.88 parallelism
$ numactl -C 3 cache
L1 cache: 24576 bytes 1.29 nanoseconds 64 linesize 5.00 parallelism
L2 cache: 1048576 bytes 8.60 nanoseconds 128 linesize 3.07 parallelism
Memory latency: 143.29 nanoseconds 5.37 parallelism


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-06 17:15                 ` Yassine Oudjana
@ 2021-07-06 20:33                   ` Arnd Bergmann
  2021-07-06 22:27                     ` Bjorn Andersson
  2021-07-07  8:24                     ` Yassine Oudjana
  0 siblings, 2 replies; 20+ messages in thread
From: Arnd Bergmann @ 2021-07-06 20:33 UTC (permalink / raw)
  To: Yassine Oudjana
  Cc: Marc Zyngier, Will Deacon, Robin Murphy, Catalin Marinas,
	Ard Biesheuvel, Android Kernel Team, Linux ARM, Mark Rutland,
	Vincent Whitchurch, linux-arm-msm

On Tue, Jul 6, 2021 at 7:15 PM Yassine Oudjana <y.oudjana@protonmail.com> wrote:
> > (the numactl command helps run this both on the 'big' and 'little'
> > cores without running into migration)
> >
> > Arnd
>
> Here are the results:

Thanks, that was quick

> $ numactl -C 0 line -M 1M
> 128
> $ numactl -C 3 line -M 1M
> 128
> $ numactl -C 0 cache
> L1 cache: 512 bytes 1.37 nanoseconds 64 linesize -1.00 parallelism
> L2 cache: 24576 bytes 2.75 nanoseconds 64 linesize 5.06 parallelism
> L3 cache: 131072 bytes 7.89 nanoseconds 64 linesize 3.85 parallelism
> L4 cache: 524288 bytes 15.86 nanoseconds 128 linesize 3.48 parallelism
> Memory latency: 145.93 nanoseconds 4.88 parallelism
> $ numactl -C 3 cache
> L1 cache: 24576 bytes 1.29 nanoseconds 64 linesize 5.00 parallelism
> L2 cache: 1048576 bytes 8.60 nanoseconds 128 linesize 3.07 parallelism
> Memory latency: 143.29 nanoseconds 5.37 parallelism

This is still somewhat inconclusive, but it does give some hope. The data that
I found on random web sites was

- 32KB L1, 2MB/1MB L2 [1][2]
- 16KB L1, 1.5MB L2 [3]
- 32KB L1, 1MB/512KB L2 [4]

so none of the sizes really line up. My best guess is that the actual hierarchy

1MB per-core L2 cache on the two big CPU, 512KB per-core L2 cache on
the two little ones, but no shared L2 or L3. The older Krait had a 4KB L0
cache, which could explain the 512-byte L1 output.

Can you rerun the the 'line' test with '-M 128K' to see if that confirms the 64
byte L1 line size that the 'cache' test reported?
      Arnd

[1] https://en.wikipedia.org/wiki/List_of_Qualcomm_Snapdragon_processors#Snapdragon_820_and_821_(2016)
[2] https://en.wikipedia.org/wiki/Kryo
[3] https://www.geektopia.es/es/product/qualcomm/snapdragon-820/
[4] https://www.anandtech.com/show/9837/snapdragon-820-preview/2

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-06 20:33                   ` Arnd Bergmann
@ 2021-07-06 22:27                     ` Bjorn Andersson
  2021-07-07  9:27                       ` Will Deacon
  2021-07-07  8:24                     ` Yassine Oudjana
  1 sibling, 1 reply; 20+ messages in thread
From: Bjorn Andersson @ 2021-07-06 22:27 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Yassine Oudjana, Marc Zyngier, Will Deacon, Robin Murphy,
	Catalin Marinas, Ard Biesheuvel, Android Kernel Team, Linux ARM,
	Mark Rutland, Vincent Whitchurch, linux-arm-msm

On Tue 06 Jul 15:33 CDT 2021, Arnd Bergmann wrote:

> On Tue, Jul 6, 2021 at 7:15 PM Yassine Oudjana <y.oudjana@protonmail.com> wrote:
> > > (the numactl command helps run this both on the 'big' and 'little'
> > > cores without running into migration)
> > >
> > > Arnd
> >
> > Here are the results:
> 
> Thanks, that was quick
> 
> > $ numactl -C 0 line -M 1M
> > 128
> > $ numactl -C 3 line -M 1M
> > 128
> > $ numactl -C 0 cache
> > L1 cache: 512 bytes 1.37 nanoseconds 64 linesize -1.00 parallelism
> > L2 cache: 24576 bytes 2.75 nanoseconds 64 linesize 5.06 parallelism
> > L3 cache: 131072 bytes 7.89 nanoseconds 64 linesize 3.85 parallelism
> > L4 cache: 524288 bytes 15.86 nanoseconds 128 linesize 3.48 parallelism
> > Memory latency: 145.93 nanoseconds 4.88 parallelism
> > $ numactl -C 3 cache
> > L1 cache: 24576 bytes 1.29 nanoseconds 64 linesize 5.00 parallelism
> > L2 cache: 1048576 bytes 8.60 nanoseconds 128 linesize 3.07 parallelism
> > Memory latency: 143.29 nanoseconds 5.37 parallelism
> 
> This is still somewhat inconclusive, but it does give some hope. The data that
> I found on random web sites was
> 
> - 32KB L1, 2MB/1MB L2 [1][2]
> - 16KB L1, 1.5MB L2 [3]
> - 32KB L1, 1MB/512KB L2 [4]
> 
> so none of the sizes really line up. My best guess is that the actual hierarchy
> 
> 1MB per-core L2 cache on the two big CPU, 512KB per-core L2 cache on
> the two little ones, but no shared L2 or L3. The older Krait had a 4KB L0
> cache, which could explain the 512-byte L1 output.
> 
> Can you rerun the the 'line' test with '-M 128K' to see if that confirms the 64
> byte L1 line size that the 'cache' test reported?

I can confirm that MSM8996, and a few derivatives, has 128 byte cache
lines.

Regards,
Bjorn

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-06 20:33                   ` Arnd Bergmann
  2021-07-06 22:27                     ` Bjorn Andersson
@ 2021-07-07  8:24                     ` Yassine Oudjana
  2021-07-07  9:29                       ` Arnd Bergmann
  1 sibling, 1 reply; 20+ messages in thread
From: Yassine Oudjana @ 2021-07-07  8:24 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Marc Zyngier, Will Deacon, Robin Murphy, Catalin Marinas,
	Ard Biesheuvel, Android Kernel Team, Linux ARM, Mark Rutland,
	Vincent Whitchurch, linux-arm-msm, Bjorn Andersson

On Wednesday, July 7th, 2021 at 12:33 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tue, Jul 6, 2021 at 7:15 PM Yassine Oudjana y.oudjana@protonmail.com wrote:
> > ...
>
> This is still somewhat inconclusive, but it does give some hope. The data that
>
> I found on random web sites was
>
> -   32KB L1, 2MB/1MB L2 [1][2]
> -   16KB L1, 1.5MB L2 [3]
> -   32KB L1, 1MB/512KB L2 [4]
>
>     so none of the sizes really line up. My best guess is that the actual hierarchy
>     1MB per-core L2 cache on the two big CPU, 512KB per-core L2 cache on
>     the two little ones, but no shared L2 or L3. The older Krait had a 4KB L0
>     cache, which could explain the 512-byte L1 output.
>
>     Can you rerun the the 'line' test with '-M 128K' to see if that confirms the 64
>     byte L1 line size that the 'cache' test reported?
>
>     Arnd
>
>     [1] https://en.wikipedia.org/wiki/List_of_Qualcomm_Snapdragon_processors#Snapdragon_820_and_821_(2016)
>     [2] https://en.wikipedia.org/wiki/Kryo
>     [3] https://www.geektopia.es/es/product/qualcomm/snapdragon-820/
>     [4] https://www.anandtech.com/show/9837/snapdragon-820-preview/2

$ numactl -C 0 line -M 128K
64
$ numactl -C 3 line -M 128K
64


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-06 22:27                     ` Bjorn Andersson
@ 2021-07-07  9:27                       ` Will Deacon
  0 siblings, 0 replies; 20+ messages in thread
From: Will Deacon @ 2021-07-07  9:27 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: Arnd Bergmann, Yassine Oudjana, Marc Zyngier, Robin Murphy,
	Catalin Marinas, Ard Biesheuvel, Android Kernel Team, Linux ARM,
	Mark Rutland, Vincent Whitchurch, linux-arm-msm

On Tue, Jul 06, 2021 at 05:27:53PM -0500, Bjorn Andersson wrote:
> On Tue 06 Jul 15:33 CDT 2021, Arnd Bergmann wrote:
> 
> > On Tue, Jul 6, 2021 at 7:15 PM Yassine Oudjana <y.oudjana@protonmail.com> wrote:
> > > > (the numactl command helps run this both on the 'big' and 'little'
> > > > cores without running into migration)
> > > >
> > > > Arnd
> > >
> > > Here are the results:
> > 
> > Thanks, that was quick
> > 
> > > $ numactl -C 0 line -M 1M
> > > 128
> > > $ numactl -C 3 line -M 1M
> > > 128
> > > $ numactl -C 0 cache
> > > L1 cache: 512 bytes 1.37 nanoseconds 64 linesize -1.00 parallelism
> > > L2 cache: 24576 bytes 2.75 nanoseconds 64 linesize 5.06 parallelism
> > > L3 cache: 131072 bytes 7.89 nanoseconds 64 linesize 3.85 parallelism
> > > L4 cache: 524288 bytes 15.86 nanoseconds 128 linesize 3.48 parallelism
> > > Memory latency: 145.93 nanoseconds 4.88 parallelism
> > > $ numactl -C 3 cache
> > > L1 cache: 24576 bytes 1.29 nanoseconds 64 linesize 5.00 parallelism
> > > L2 cache: 1048576 bytes 8.60 nanoseconds 128 linesize 3.07 parallelism
> > > Memory latency: 143.29 nanoseconds 5.37 parallelism
> > 
> > This is still somewhat inconclusive, but it does give some hope. The data that
> > I found on random web sites was
> > 
> > - 32KB L1, 2MB/1MB L2 [1][2]
> > - 16KB L1, 1.5MB L2 [3]
> > - 32KB L1, 1MB/512KB L2 [4]
> > 
> > so none of the sizes really line up. My best guess is that the actual hierarchy
> > 
> > 1MB per-core L2 cache on the two big CPU, 512KB per-core L2 cache on
> > the two little ones, but no shared L2 or L3. The older Krait had a 4KB L0
> > cache, which could explain the 512-byte L1 output.
> > 
> > Can you rerun the the 'line' test with '-M 128K' to see if that confirms the 64
> > byte L1 line size that the 'cache' test reported?
> 
> I can confirm that MSM8996, and a few derivatives, has 128 byte cache
> lines.

Do you know if the caches with 128-byte lines sit above the PoC? i.e. is
non-coherent DMA coherent with this cache or not?

Will

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-07  8:24                     ` Yassine Oudjana
@ 2021-07-07  9:29                       ` Arnd Bergmann
  2021-07-07 14:41                         ` Jeffrey Hugo
  0 siblings, 1 reply; 20+ messages in thread
From: Arnd Bergmann @ 2021-07-07  9:29 UTC (permalink / raw)
  To: Yassine Oudjana
  Cc: Marc Zyngier, Will Deacon, Robin Murphy, Catalin Marinas,
	Ard Biesheuvel, Android Kernel Team, Linux ARM, Mark Rutland,
	Vincent Whitchurch, linux-arm-msm, Bjorn Andersson

On Tue, Jul 6, 2021 at 6:20 PM Will Deacon <will@kernel.org> wrote:
>
> I think the million dollar question is whether the 128-byte cache-lines
> live in a cache above the PoC or not. If it's just a system level cache
> through which all DMA is "coherent", then it doesn't matter.

On Wed, Jul 7, 2021 at 10:24 AM Yassine Oudjana
<y.oudjana@protonmail.com> wrote:
>
> On Wednesday, July 7th, 2021 at 12:33 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Tue, Jul 6, 2021 at 7:15 PM Yassine Oudjana y.oudjana@protonmail.com wrote:
> > >
> > > $ numactl -C 0 line -M 1M
> > > 128
> > > $ numactl -C 3 line -M 1M
> > > 128
> >
> >     Can you rerun the the 'line' test with '-M 128K' to see if that confirms the 64
> >     byte L1 line size that the 'cache' test reported?
>
> $ numactl -C 0 line -M 128K
> 64
> $ numactl -C 3 line -M 128K
> 64

Ok, so this seems to confirm that the L1 uses 64 byte lines, while the L2 (or
possibly L3) uses 128 byte lines.

On Wed, Jul 7, 2021 at 12:27 AM Bjorn Andersson
<bjorn.andersson@linaro.org> wrote:
>
> I can confirm that MSM8996, and a few derivatives, has 128 byte cache lines.

Ok, thanks. Assuming this is an outer cache and the L1 indeed has a smaller line
size, can you also confirm that this 128 byte lines are north of the point of
coherency?

In other words, does the CTR_EL0.DminLine field also show 128 bytes
(in which case
it seems we already lost)? And if not, does a CPU store to the second half of a
128 byte L2 line cause DMA data in the first half to be clobbered?

       Arnd

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-07  9:29                       ` Arnd Bergmann
@ 2021-07-07 14:41                         ` Jeffrey Hugo
  2021-07-08 20:59                           ` Jeffrey Hugo
  0 siblings, 1 reply; 20+ messages in thread
From: Jeffrey Hugo @ 2021-07-07 14:41 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Yassine Oudjana, Marc Zyngier, Will Deacon, Robin Murphy,
	Catalin Marinas, Ard Biesheuvel, Android Kernel Team, Linux ARM,
	Mark Rutland, Vincent Whitchurch, linux-arm-msm, Bjorn Andersson

On Wed, Jul 7, 2021 at 3:30 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Tue, Jul 6, 2021 at 6:20 PM Will Deacon <will@kernel.org> wrote:
> >
> > I think the million dollar question is whether the 128-byte cache-lines
> > live in a cache above the PoC or not. If it's just a system level cache
> > through which all DMA is "coherent", then it doesn't matter.
>
> On Wed, Jul 7, 2021 at 10:24 AM Yassine Oudjana
> <y.oudjana@protonmail.com> wrote:
> >
> > On Wednesday, July 7th, 2021 at 12:33 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > > On Tue, Jul 6, 2021 at 7:15 PM Yassine Oudjana y.oudjana@protonmail.com wrote:
> > > >
> > > > $ numactl -C 0 line -M 1M
> > > > 128
> > > > $ numactl -C 3 line -M 1M
> > > > 128
> > >
> > >     Can you rerun the the 'line' test with '-M 128K' to see if that confirms the 64
> > >     byte L1 line size that the 'cache' test reported?
> >
> > $ numactl -C 0 line -M 128K
> > 64
> > $ numactl -C 3 line -M 128K
> > 64
>
> Ok, so this seems to confirm that the L1 uses 64 byte lines, while the L2 (or
> possibly L3) uses 128 byte lines.
>
> On Wed, Jul 7, 2021 at 12:27 AM Bjorn Andersson
> <bjorn.andersson@linaro.org> wrote:
> >
> > I can confirm that MSM8996, and a few derivatives, has 128 byte cache lines.
>
> Ok, thanks. Assuming this is an outer cache and the L1 indeed has a smaller line
> size, can you also confirm that this 128 byte lines are north of the point of
> coherency?

Finding this old documentation has been painful  :)

L0 I 64 byte cacheline
L1 I 64
L1 D 64
L2 unified 128 (shared between the CPUs of a duplex)

I believe L2 is within the POC, but I'm trying to dig up the old
documentation to confirm.

>
> In other words, does the CTR_EL0.DminLine field also show 128 bytes
> (in which case
> it seems we already lost)? And if not, does a CPU store to the second half of a
> 128 byte L2 line cause DMA data in the first half to be clobbered?

Per the documentation I'm seeing, CTR_EL0.DminLine should show 128
bytes.  I don't have hardware handy to confirm.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-07 14:41                         ` Jeffrey Hugo
@ 2021-07-08 20:59                           ` Jeffrey Hugo
  2021-07-09  8:48                             ` Will Deacon
  0 siblings, 1 reply; 20+ messages in thread
From: Jeffrey Hugo @ 2021-07-08 20:59 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Yassine Oudjana, Marc Zyngier, Will Deacon, Robin Murphy,
	Catalin Marinas, Ard Biesheuvel, Android Kernel Team, Linux ARM,
	Mark Rutland, Vincent Whitchurch, linux-arm-msm, Bjorn Andersson

On Wed, Jul 7, 2021 at 8:41 AM Jeffrey Hugo <jeffrey.l.hugo@gmail.com> wrote:
>
> On Wed, Jul 7, 2021 at 3:30 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Tue, Jul 6, 2021 at 6:20 PM Will Deacon <will@kernel.org> wrote:
> > >
> > > I think the million dollar question is whether the 128-byte cache-lines
> > > live in a cache above the PoC or not. If it's just a system level cache
> > > through which all DMA is "coherent", then it doesn't matter.
> >
> > On Wed, Jul 7, 2021 at 10:24 AM Yassine Oudjana
> > <y.oudjana@protonmail.com> wrote:
> > >
> > > On Wednesday, July 7th, 2021 at 12:33 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > On Tue, Jul 6, 2021 at 7:15 PM Yassine Oudjana y.oudjana@protonmail.com wrote:
> > > > >
> > > > > $ numactl -C 0 line -M 1M
> > > > > 128
> > > > > $ numactl -C 3 line -M 1M
> > > > > 128
> > > >
> > > >     Can you rerun the the 'line' test with '-M 128K' to see if that confirms the 64
> > > >     byte L1 line size that the 'cache' test reported?
> > >
> > > $ numactl -C 0 line -M 128K
> > > 64
> > > $ numactl -C 3 line -M 128K
> > > 64
> >
> > Ok, so this seems to confirm that the L1 uses 64 byte lines, while the L2 (or
> > possibly L3) uses 128 byte lines.
> >
> > On Wed, Jul 7, 2021 at 12:27 AM Bjorn Andersson
> > <bjorn.andersson@linaro.org> wrote:
> > >
> > > I can confirm that MSM8996, and a few derivatives, has 128 byte cache lines.
> >
> > Ok, thanks. Assuming this is an outer cache and the L1 indeed has a smaller line
> > size, can you also confirm that this 128 byte lines are north of the point of
> > coherency?
>
> Finding this old documentation has been painful  :)
>
> L0 I 64 byte cacheline
> L1 I 64
> L1 D 64
> L2 unified 128 (shared between the CPUs of a duplex)
>
> I believe L2 is within the POC, but I'm trying to dig up the old
> documentation to confirm.

Was able to track down a friendly hardware designer.  The POC lies
between L2 and L3.  Hope this helps.

> > In other words, does the CTR_EL0.DminLine field also show 128 bytes
> > (in which case
> > it seems we already lost)? And if not, does a CPU store to the second half of a
> > 128 byte L2 line cause DMA data in the first half to be clobbered?
>
> Per the documentation I'm seeing, CTR_EL0.DminLine should show 128
> bytes.  I don't have hardware handy to confirm.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-08 20:59                           ` Jeffrey Hugo
@ 2021-07-09  8:48                             ` Will Deacon
  2021-07-09 17:10                               ` Catalin Marinas
  0 siblings, 1 reply; 20+ messages in thread
From: Will Deacon @ 2021-07-09  8:48 UTC (permalink / raw)
  To: Jeffrey Hugo
  Cc: Arnd Bergmann, Yassine Oudjana, Marc Zyngier, Robin Murphy,
	Catalin Marinas, Ard Biesheuvel, Android Kernel Team, Linux ARM,
	Mark Rutland, Vincent Whitchurch, linux-arm-msm, Bjorn Andersson

On Thu, Jul 08, 2021 at 02:59:28PM -0600, Jeffrey Hugo wrote:
> On Wed, Jul 7, 2021 at 8:41 AM Jeffrey Hugo <jeffrey.l.hugo@gmail.com> wrote:
> >
> > On Wed, Jul 7, 2021 at 3:30 AM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Tue, Jul 6, 2021 at 6:20 PM Will Deacon <will@kernel.org> wrote:
> > > >
> > > > I think the million dollar question is whether the 128-byte cache-lines
> > > > live in a cache above the PoC or not. If it's just a system level cache
> > > > through which all DMA is "coherent", then it doesn't matter.
> > >
> > > On Wed, Jul 7, 2021 at 10:24 AM Yassine Oudjana
> > > <y.oudjana@protonmail.com> wrote:
> > > >
> > > > On Wednesday, July 7th, 2021 at 12:33 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > On Tue, Jul 6, 2021 at 7:15 PM Yassine Oudjana y.oudjana@protonmail.com wrote:
> > > > > >
> > > > > > $ numactl -C 0 line -M 1M
> > > > > > 128
> > > > > > $ numactl -C 3 line -M 1M
> > > > > > 128
> > > > >
> > > > >     Can you rerun the the 'line' test with '-M 128K' to see if that confirms the 64
> > > > >     byte L1 line size that the 'cache' test reported?
> > > >
> > > > $ numactl -C 0 line -M 128K
> > > > 64
> > > > $ numactl -C 3 line -M 128K
> > > > 64
> > >
> > > Ok, so this seems to confirm that the L1 uses 64 byte lines, while the L2 (or
> > > possibly L3) uses 128 byte lines.
> > >
> > > On Wed, Jul 7, 2021 at 12:27 AM Bjorn Andersson
> > > <bjorn.andersson@linaro.org> wrote:
> > > >
> > > > I can confirm that MSM8996, and a few derivatives, has 128 byte cache lines.
> > >
> > > Ok, thanks. Assuming this is an outer cache and the L1 indeed has a smaller line
> > > size, can you also confirm that this 128 byte lines are north of the point of
> > > coherency?
> >
> > Finding this old documentation has been painful  :)
> >
> > L0 I 64 byte cacheline
> > L1 I 64
> > L1 D 64
> > L2 unified 128 (shared between the CPUs of a duplex)
> >
> > I believe L2 is within the POC, but I'm trying to dig up the old
> > documentation to confirm.
> 
> Was able to track down a friendly hardware designer.  The POC lies
> between L2 and L3.  Hope this helps.

Damn, yes, it's bad news but thanks for chasing it up. I'll revert the patch
at -rc1 and add a comment about MSM8996.

Will

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
  2021-07-09  8:48                             ` Will Deacon
@ 2021-07-09 17:10                               ` Catalin Marinas
  0 siblings, 0 replies; 20+ messages in thread
From: Catalin Marinas @ 2021-07-09 17:10 UTC (permalink / raw)
  To: Will Deacon
  Cc: Jeffrey Hugo, Arnd Bergmann, Yassine Oudjana, Marc Zyngier,
	Robin Murphy, Ard Biesheuvel, Android Kernel Team, Linux ARM,
	Mark Rutland, Vincent Whitchurch, linux-arm-msm, Bjorn Andersson

On Fri, Jul 09, 2021 at 09:48:42AM +0100, Will Deacon wrote:
> On Thu, Jul 08, 2021 at 02:59:28PM -0600, Jeffrey Hugo wrote:
> > On Wed, Jul 7, 2021 at 8:41 AM Jeffrey Hugo <jeffrey.l.hugo@gmail.com> wrote:
> > > L0 I 64 byte cacheline
> > > L1 I 64
> > > L1 D 64
> > > L2 unified 128 (shared between the CPUs of a duplex)
> > >
> > > I believe L2 is within the POC, but I'm trying to dig up the old
> > > documentation to confirm.
> > 
> > Was able to track down a friendly hardware designer.  The POC lies
> > between L2 and L3.  Hope this helps.
> 
> Damn, yes, it's bad news but thanks for chasing it up. I'll revert the patch
> at -rc1 and add a comment about MSM8996.

It's a shame but we can't do much for this platform.

Longer term, we should look at making kmalloc() cache selection more
dynamic. Probably still starting with a 128 byte minimum size but, after
initialising all the devices during boot, if we can't find any
non-coherent one just relax the kmalloc() allocations. We still have the
issue with platform devices with DT assumed to be non-coherent and any
late call (after boot) to arch_setup_dma_ops().

Some bodge below to get an idea, not a final patch (not even the
beginning of one). It initialises the kmalloc caches to size 8 but
limits the allocation size to a kmalloc_dyn_min_size, initially set to
128 on arm64. In a device_initcall_sync(), if we didn't find any
non-coherent device, we lower this to KMALLOC_MIN_SIZE (8 with slub).

----------------8<----------------------------
diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
index a074459f8f2f..bed65db3c42e 100644
--- a/arch/arm64/include/asm/cache.h
+++ b/arch/arm64/include/asm/cache.h
@@ -40,15 +40,6 @@
 #define CLIDR_LOC(clidr)	(((clidr) >> CLIDR_LOC_SHIFT) & 0x7)
 #define CLIDR_LOUIS(clidr)	(((clidr) >> CLIDR_LOUIS_SHIFT) & 0x7)
 
-/*
- * Memory returned by kmalloc() may be used for DMA, so we must make
- * sure that all such allocations are cache aligned. Otherwise,
- * unrelated code may cause parts of the buffer to be read into the
- * cache before the transfer is done, causing old data to be seen by
- * the CPU.
- */
-#define ARCH_DMA_MINALIGN	(128)
-
 #ifdef CONFIG_KASAN_SW_TAGS
 #define ARCH_SLAB_MINALIGN	(1ULL << KASAN_SHADOW_SCALE_SHIFT)
 #elif defined(CONFIG_KASAN_HW_TAGS)
@@ -59,6 +50,9 @@
 
 #include <linux/bitops.h>
 
+extern int kmalloc_dyn_min_size;
+#define __HAVE_ARCH_KMALLOC_DYN_MIN_SIZE
+
 #define ICACHEF_ALIASING	0
 #define ICACHEF_VPIPT		1
 extern unsigned long __icache_flags;
@@ -88,7 +82,7 @@ static inline int cache_line_size_of_cpu(void)
 {
 	u32 cwg = cache_type_cwg();
 
-	return cwg ? 4 << cwg : ARCH_DMA_MINALIGN;
+	return cwg ? 4 << cwg : __alignof__(unsigned long long);
 }
 
 int cache_line_size(void);
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index efed2830d141..a25813377187 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2808,8 +2808,8 @@ void __init setup_cpu_features(void)
 	 */
 	cwg = cache_type_cwg();
 	if (!cwg)
-		pr_warn("No Cache Writeback Granule information, assuming %d\n",
-			ARCH_DMA_MINALIGN);
+		pr_warn("No Cache Writeback Granule information, assuming %ld\n",
+			__alignof__(unsigned long long));
 }
 
 static void __maybe_unused cpu_enable_cnp(struct arm64_cpu_capabilities const *cap)
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 4bf1dd3eb041..9a30d1beb3ea 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -13,6 +13,18 @@
 
 #include <asm/cacheflush.h>
 
+/*
+ * Memory returned by kmalloc() may be used for DMA, so we must make
+ * sure that all such allocations are cache aligned. Otherwise,
+ * unrelated code may cause parts of the buffer to be read into the
+ * cache before the transfer is done, causing old data to be seen by
+ * the CPU.
+ */
+int kmalloc_dyn_min_size = 128;
+EXPORT_SYMBOL(kmalloc_dyn_min_size);
+
+static bool non_coherent_devices;
+
 void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
 		enum dma_data_direction dir)
 {
@@ -42,11 +54,14 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
 {
 	int cls = cache_line_size_of_cpu();
 
-	WARN_TAINT(!coherent && cls > ARCH_DMA_MINALIGN,
+	WARN_TAINT(!coherent && cls > kmalloc_dyn_min_size,
 		   TAINT_CPU_OUT_OF_SPEC,
-		   "%s %s: ARCH_DMA_MINALIGN smaller than CTR_EL0.CWG (%d < %d)",
+		   "%s %s: kmalloc() minimum size smaller than CTR_EL0.CWG (%d < %d)",
 		   dev_driver_string(dev), dev_name(dev),
-		   ARCH_DMA_MINALIGN, cls);
+		   kmalloc_dyn_min_size, cls);
+
+	if (!coherent)
+		non_coherent_devices = true;
 
 	dev->dma_coherent = coherent;
 	if (iommu)
@@ -57,3 +72,12 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
 		dev->dma_ops = &xen_swiotlb_dma_ops;
 #endif
 }
+
+static int __init adjust_kmalloc_dyn_min_size(void)
+{
+	if (!non_coherent_devices)
+		kmalloc_dyn_min_size = KMALLOC_MIN_SIZE;
+
+	return 0;
+}
+device_initcall_sync(adjust_kmalloc_dyn_min_size);
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 0c97d788762c..e40c7899cb07 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -349,15 +349,21 @@ static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags)
  */
 static __always_inline unsigned int kmalloc_index(size_t size)
 {
+	int min_size = KMALLOC_MIN_SIZE;
+
 	if (!size)
 		return 0;
 
-	if (size <= KMALLOC_MIN_SIZE)
-		return KMALLOC_SHIFT_LOW;
+#ifdef __HAVE_ARCH_KMALLOC_DYN_MIN_SIZE
+	min_size = kmalloc_dyn_min_size;
+#endif
+
+	if (size <= min_size)
+		return ilog2(min_size);
 
-	if (KMALLOC_MIN_SIZE <= 32 && size > 64 && size <= 96)
+	if (min_size <= 32 && size > 64 && size <= 96)
 		return 1;
-	if (KMALLOC_MIN_SIZE <= 64 && size > 128 && size <= 192)
+	if (min_size <= 64 && size > 128 && size <= 192)
 		return 2;
 	if (size <=          8) return 3;
 	if (size <=         16) return 4;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 7cab77655f11..2666237c84c4 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -725,6 +725,10 @@ struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
 		if (!size)
 			return ZERO_SIZE_PTR;
 
+#ifdef __HAVE_ARCH_KMALLOC_DYN_MIN_SIZE
+		if (size < kmalloc_dyn_min_size)
+			size = kmalloc_dyn_min_size;
+#endif
 		index = size_index[size_index_elem(size)];
 	} else {
 		if (WARN_ON_ONCE(size > KMALLOC_MAX_CACHE_SIZE))

-- 
Catalin

^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-07-09 17:11 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20210527124356.22367-1-will@kernel.org>
2021-07-06  9:26 ` [PATCH] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES) Yassine Oudjana
2021-07-06 10:26   ` Catalin Marinas
2021-07-06 13:29     ` Robin Murphy
2021-07-06 13:33       ` Will Deacon
2021-07-06 13:44         ` Marc Zyngier
2021-07-06 14:21           ` Robin Murphy
2021-07-06 14:30           ` Arnd Bergmann
2021-07-06 14:46             ` Marc Zyngier
2021-07-06 15:43               ` Arnd Bergmann
2021-07-06 17:15                 ` Yassine Oudjana
2021-07-06 20:33                   ` Arnd Bergmann
2021-07-06 22:27                     ` Bjorn Andersson
2021-07-07  9:27                       ` Will Deacon
2021-07-07  8:24                     ` Yassine Oudjana
2021-07-07  9:29                       ` Arnd Bergmann
2021-07-07 14:41                         ` Jeffrey Hugo
2021-07-08 20:59                           ` Jeffrey Hugo
2021-07-09  8:48                             ` Will Deacon
2021-07-09 17:10                               ` Catalin Marinas
2021-07-06 16:20             ` Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).