All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] arm64: mm: take CWG into account in __inval_cache_range()
@ 2016-04-19 10:29 ` Ard Biesheuvel
  0 siblings, 0 replies; 16+ messages in thread
From: Ard Biesheuvel @ 2016-04-19 10:29 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, catalin.marinas, will.deacon,
	mark.rutland, james.morse
  Cc: Ard Biesheuvel

Currently, the arm64 implementation of __inval_cache_range() [aka
__dma_inv_range()] takes CTR_EL0.Dminline into account for two purposes:
- the stride to use for doing by-VA cache maintenance,
- to check whether the start and end arguments are unaligned with respect
  to the cache line size, in which case the unaligned extremes need to be
  cleaned before being invalidated, to avoid corrupting adjacent unrelated
  memory contents.

In the second case, the use of Dminline is incorrect, and should use the
CWG field instead, since an invalidate operation could result in cache
lines that are larger than Dminline to be evicted at any level of the
cache hierarchy.

So introduce a macro cache_cwg_size to retrieve the CWG value, and use it
to clean as many cachelines as required on either end of the [start, end)
interval.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/mm/cache.S       | 34 ++++++++++++++++++++++------------
 arch/arm64/mm/proc-macros.S | 13 +++++++++++++
 2 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
index 6df07069a025..e5067e87e1b5 100644
--- a/arch/arm64/mm/cache.S
+++ b/arch/arm64/mm/cache.S
@@ -120,19 +120,29 @@ ENTRY(__inval_cache_range)
  *	- end     - virtual end address of region
  */
 __dma_inv_range:
-	dcache_line_size x2, x3
-	sub	x3, x2, #1
-	tst	x1, x3				// end cache line aligned?
-	bic	x1, x1, x3
-	b.eq	1f
-	dc	civac, x1			// clean & invalidate D / U line
-1:	tst	x0, x3				// start cache line aligned?
-	bic	x0, x0, x3
+	dcache_line_size x2, x3			// get Dminline in x2
+	sub	x3, x2, #1			// Dminline mask in x3
+	bic	x0, x0, x3			// align start down to line size
+
+	cache_cwg_size x4, x3			// get CWG
+	sub	x3, x4, #1			// CWG mask in x3
+
+	tst	x1, x3				// end CWG aligned?
 	b.eq	2f
-	dc	civac, x0			// clean & invalidate D / U line
-	b	3f
-2:	dc	ivac, x0			// invalidate D / U line
-3:	add	x0, x0, x2
+	bic	x5, x1, x3
+0:	dc	civac, x5			// clean & invalidate D / U line
+	add	x5, x5, x2
+	tst	x5, x3
+	b.ne	0b
+	b	2f
+
+1:	dc	civac, x0			// clean & invalidate D / U line
+	add	x0, x0, x2
+2:	tst	x0, x3				// start CWG aligned?
+	b.ne	1b
+
+	dc	ivac, x0			// invalidate D / U line
+	add	x0, x0, x2
 	cmp	x0, x1
 	b.lo	2b
 	dsb	sy
diff --git a/arch/arm64/mm/proc-macros.S b/arch/arm64/mm/proc-macros.S
index e6a30e1268a8..872299ce3081 100644
--- a/arch/arm64/mm/proc-macros.S
+++ b/arch/arm64/mm/proc-macros.S
@@ -54,6 +54,19 @@
 	.endm
 
 /*
+ * cache_cwg_size - get the maximum cache line size from the CTR register
+ */
+	.macro	cache_cwg_size, reg, tmp
+	mrs	\tmp, ctr_el0			// read CTR
+	ubfm	\tmp, \tmp, #24, #27		// CTR_EL0.CWG [27:24]
+	mov	\reg, #9			// use architectural default of
+	cmp	\tmp, xzr			// 2 KB (2^9 words) if CWG is
+	csel	\tmp, \tmp, \reg, ne		// not provided
+	mov	\reg, #4			// bytes per word
+	lsl	\reg, \reg, \tmp		// actual cache line size
+	.endm
+
+/*
  * tcr_set_idmap_t0sz - update TCR.T0SZ so that we can load the ID map
  */
 	.macro	tcr_set_idmap_t0sz, valreg, tmpreg
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH] arm64: mm: take CWG into account in __inval_cache_range()
@ 2016-04-19 10:29 ` Ard Biesheuvel
  0 siblings, 0 replies; 16+ messages in thread
From: Ard Biesheuvel @ 2016-04-19 10:29 UTC (permalink / raw)
  To: linux-arm-kernel

Currently, the arm64 implementation of __inval_cache_range() [aka
__dma_inv_range()] takes CTR_EL0.Dminline into account for two purposes:
- the stride to use for doing by-VA cache maintenance,
- to check whether the start and end arguments are unaligned with respect
  to the cache line size, in which case the unaligned extremes need to be
  cleaned before being invalidated, to avoid corrupting adjacent unrelated
  memory contents.

In the second case, the use of Dminline is incorrect, and should use the
CWG field instead, since an invalidate operation could result in cache
lines that are larger than Dminline to be evicted at any level of the
cache hierarchy.

So introduce a macro cache_cwg_size to retrieve the CWG value, and use it
to clean as many cachelines as required on either end of the [start, end)
interval.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/mm/cache.S       | 34 ++++++++++++++++++++++------------
 arch/arm64/mm/proc-macros.S | 13 +++++++++++++
 2 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
index 6df07069a025..e5067e87e1b5 100644
--- a/arch/arm64/mm/cache.S
+++ b/arch/arm64/mm/cache.S
@@ -120,19 +120,29 @@ ENTRY(__inval_cache_range)
  *	- end     - virtual end address of region
  */
 __dma_inv_range:
-	dcache_line_size x2, x3
-	sub	x3, x2, #1
-	tst	x1, x3				// end cache line aligned?
-	bic	x1, x1, x3
-	b.eq	1f
-	dc	civac, x1			// clean & invalidate D / U line
-1:	tst	x0, x3				// start cache line aligned?
-	bic	x0, x0, x3
+	dcache_line_size x2, x3			// get Dminline in x2
+	sub	x3, x2, #1			// Dminline mask in x3
+	bic	x0, x0, x3			// align start down to line size
+
+	cache_cwg_size x4, x3			// get CWG
+	sub	x3, x4, #1			// CWG mask in x3
+
+	tst	x1, x3				// end CWG aligned?
 	b.eq	2f
-	dc	civac, x0			// clean & invalidate D / U line
-	b	3f
-2:	dc	ivac, x0			// invalidate D / U line
-3:	add	x0, x0, x2
+	bic	x5, x1, x3
+0:	dc	civac, x5			// clean & invalidate D / U line
+	add	x5, x5, x2
+	tst	x5, x3
+	b.ne	0b
+	b	2f
+
+1:	dc	civac, x0			// clean & invalidate D / U line
+	add	x0, x0, x2
+2:	tst	x0, x3				// start CWG aligned?
+	b.ne	1b
+
+	dc	ivac, x0			// invalidate D / U line
+	add	x0, x0, x2
 	cmp	x0, x1
 	b.lo	2b
 	dsb	sy
diff --git a/arch/arm64/mm/proc-macros.S b/arch/arm64/mm/proc-macros.S
index e6a30e1268a8..872299ce3081 100644
--- a/arch/arm64/mm/proc-macros.S
+++ b/arch/arm64/mm/proc-macros.S
@@ -54,6 +54,19 @@
 	.endm
 
 /*
+ * cache_cwg_size - get the maximum cache line size from the CTR register
+ */
+	.macro	cache_cwg_size, reg, tmp
+	mrs	\tmp, ctr_el0			// read CTR
+	ubfm	\tmp, \tmp, #24, #27		// CTR_EL0.CWG [27:24]
+	mov	\reg, #9			// use architectural default of
+	cmp	\tmp, xzr			// 2 KB (2^9 words) if CWG is
+	csel	\tmp, \tmp, \reg, ne		// not provided
+	mov	\reg, #4			// bytes per word
+	lsl	\reg, \reg, \tmp		// actual cache line size
+	.endm
+
+/*
  * tcr_set_idmap_t0sz - update TCR.T0SZ so that we can load the ID map
  */
 	.macro	tcr_set_idmap_t0sz, valreg, tmpreg
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: mm: take CWG into account in __inval_cache_range()
  2016-04-19 10:29 ` Ard Biesheuvel
@ 2016-04-19 12:56   ` Mark Rutland
  -1 siblings, 0 replies; 16+ messages in thread
From: Mark Rutland @ 2016-04-19 12:56 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-arm-kernel, linux-kernel, catalin.marinas, will.deacon,
	james.morse, robin.murphy

Hi,

On Tue, Apr 19, 2016 at 12:29:33PM +0200, Ard Biesheuvel wrote:
> Currently, the arm64 implementation of __inval_cache_range() [aka
> __dma_inv_range()] takes CTR_EL0.Dminline into account for two purposes:
> - the stride to use for doing by-VA cache maintenance,
> - to check whether the start and end arguments are unaligned with respect
>   to the cache line size, in which case the unaligned extremes need to be
>   cleaned before being invalidated, to avoid corrupting adjacent unrelated
>   memory contents.
>
> In the second case, the use of Dminline is incorrect, and should use the
> CWG field instead, since an invalidate operation could result in cache
> lines that are larger than Dminline to be evicted at any level of the
> cache hierarchy.

Have you seen this in practice, or was this found by inspection?

I agree that we need to round addresses to CWG boundaries when
performing maintenance to the PoC to prevent subsequent asynchronous
writebacks of data falling in the same CWG, which could clobber data at
the PoC.

However, if we have unrelated data in the same CWG, surely we have no
guarantee that said data will not be dirtied in caches by other kernel
code, and thus we may still have issues with asynchronous writebacks?

Is sharing a CWG broken by design, or is there some caveat I'm missing
that prevents/prohibits unrelated data from being dirtied?

Thanks,
Mark.

> So introduce a macro cache_cwg_size to retrieve the CWG value, and use it
> to clean as many cachelines as required on either end of the [start, end)
> interval.
> 
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
>  arch/arm64/mm/cache.S       | 34 ++++++++++++++++++++++------------
>  arch/arm64/mm/proc-macros.S | 13 +++++++++++++
>  2 files changed, 35 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
> index 6df07069a025..e5067e87e1b5 100644
> --- a/arch/arm64/mm/cache.S
> +++ b/arch/arm64/mm/cache.S
> @@ -120,19 +120,29 @@ ENTRY(__inval_cache_range)
>   *	- end     - virtual end address of region
>   */
>  __dma_inv_range:
> -	dcache_line_size x2, x3
> -	sub	x3, x2, #1
> -	tst	x1, x3				// end cache line aligned?
> -	bic	x1, x1, x3
> -	b.eq	1f
> -	dc	civac, x1			// clean & invalidate D / U line
> -1:	tst	x0, x3				// start cache line aligned?
> -	bic	x0, x0, x3
> +	dcache_line_size x2, x3			// get Dminline in x2
> +	sub	x3, x2, #1			// Dminline mask in x3
> +	bic	x0, x0, x3			// align start down to line size
> +
> +	cache_cwg_size x4, x3			// get CWG
> +	sub	x3, x4, #1			// CWG mask in x3
> +
> +	tst	x1, x3				// end CWG aligned?
>  	b.eq	2f
> -	dc	civac, x0			// clean & invalidate D / U line
> -	b	3f
> -2:	dc	ivac, x0			// invalidate D / U line
> -3:	add	x0, x0, x2
> +	bic	x5, x1, x3
> +0:	dc	civac, x5			// clean & invalidate D / U line
> +	add	x5, x5, x2
> +	tst	x5, x3
> +	b.ne	0b
> +	b	2f
> +
> +1:	dc	civac, x0			// clean & invalidate D / U line
> +	add	x0, x0, x2
> +2:	tst	x0, x3				// start CWG aligned?
> +	b.ne	1b
> +
> +	dc	ivac, x0			// invalidate D / U line
> +	add	x0, x0, x2
>  	cmp	x0, x1
>  	b.lo	2b
>  	dsb	sy
> diff --git a/arch/arm64/mm/proc-macros.S b/arch/arm64/mm/proc-macros.S
> index e6a30e1268a8..872299ce3081 100644
> --- a/arch/arm64/mm/proc-macros.S
> +++ b/arch/arm64/mm/proc-macros.S
> @@ -54,6 +54,19 @@
>  	.endm
>  
>  /*
> + * cache_cwg_size - get the maximum cache line size from the CTR register
> + */
> +	.macro	cache_cwg_size, reg, tmp
> +	mrs	\tmp, ctr_el0			// read CTR
> +	ubfm	\tmp, \tmp, #24, #27		// CTR_EL0.CWG [27:24]
> +	mov	\reg, #9			// use architectural default of
> +	cmp	\tmp, xzr			// 2 KB (2^9 words) if CWG is
> +	csel	\tmp, \tmp, \reg, ne		// not provided
> +	mov	\reg, #4			// bytes per word
> +	lsl	\reg, \reg, \tmp		// actual cache line size
> +	.endm
> +
> +/*
>   * tcr_set_idmap_t0sz - update TCR.T0SZ so that we can load the ID map
>   */
>  	.macro	tcr_set_idmap_t0sz, valreg, tmpreg
> -- 
> 2.5.0
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] arm64: mm: take CWG into account in __inval_cache_range()
@ 2016-04-19 12:56   ` Mark Rutland
  0 siblings, 0 replies; 16+ messages in thread
From: Mark Rutland @ 2016-04-19 12:56 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Tue, Apr 19, 2016 at 12:29:33PM +0200, Ard Biesheuvel wrote:
> Currently, the arm64 implementation of __inval_cache_range() [aka
> __dma_inv_range()] takes CTR_EL0.Dminline into account for two purposes:
> - the stride to use for doing by-VA cache maintenance,
> - to check whether the start and end arguments are unaligned with respect
>   to the cache line size, in which case the unaligned extremes need to be
>   cleaned before being invalidated, to avoid corrupting adjacent unrelated
>   memory contents.
>
> In the second case, the use of Dminline is incorrect, and should use the
> CWG field instead, since an invalidate operation could result in cache
> lines that are larger than Dminline to be evicted at any level of the
> cache hierarchy.

Have you seen this in practice, or was this found by inspection?

I agree that we need to round addresses to CWG boundaries when
performing maintenance to the PoC to prevent subsequent asynchronous
writebacks of data falling in the same CWG, which could clobber data at
the PoC.

However, if we have unrelated data in the same CWG, surely we have no
guarantee that said data will not be dirtied in caches by other kernel
code, and thus we may still have issues with asynchronous writebacks?

Is sharing a CWG broken by design, or is there some caveat I'm missing
that prevents/prohibits unrelated data from being dirtied?

Thanks,
Mark.

> So introduce a macro cache_cwg_size to retrieve the CWG value, and use it
> to clean as many cachelines as required on either end of the [start, end)
> interval.
> 
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
>  arch/arm64/mm/cache.S       | 34 ++++++++++++++++++++++------------
>  arch/arm64/mm/proc-macros.S | 13 +++++++++++++
>  2 files changed, 35 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
> index 6df07069a025..e5067e87e1b5 100644
> --- a/arch/arm64/mm/cache.S
> +++ b/arch/arm64/mm/cache.S
> @@ -120,19 +120,29 @@ ENTRY(__inval_cache_range)
>   *	- end     - virtual end address of region
>   */
>  __dma_inv_range:
> -	dcache_line_size x2, x3
> -	sub	x3, x2, #1
> -	tst	x1, x3				// end cache line aligned?
> -	bic	x1, x1, x3
> -	b.eq	1f
> -	dc	civac, x1			// clean & invalidate D / U line
> -1:	tst	x0, x3				// start cache line aligned?
> -	bic	x0, x0, x3
> +	dcache_line_size x2, x3			// get Dminline in x2
> +	sub	x3, x2, #1			// Dminline mask in x3
> +	bic	x0, x0, x3			// align start down to line size
> +
> +	cache_cwg_size x4, x3			// get CWG
> +	sub	x3, x4, #1			// CWG mask in x3
> +
> +	tst	x1, x3				// end CWG aligned?
>  	b.eq	2f
> -	dc	civac, x0			// clean & invalidate D / U line
> -	b	3f
> -2:	dc	ivac, x0			// invalidate D / U line
> -3:	add	x0, x0, x2
> +	bic	x5, x1, x3
> +0:	dc	civac, x5			// clean & invalidate D / U line
> +	add	x5, x5, x2
> +	tst	x5, x3
> +	b.ne	0b
> +	b	2f
> +
> +1:	dc	civac, x0			// clean & invalidate D / U line
> +	add	x0, x0, x2
> +2:	tst	x0, x3				// start CWG aligned?
> +	b.ne	1b
> +
> +	dc	ivac, x0			// invalidate D / U line
> +	add	x0, x0, x2
>  	cmp	x0, x1
>  	b.lo	2b
>  	dsb	sy
> diff --git a/arch/arm64/mm/proc-macros.S b/arch/arm64/mm/proc-macros.S
> index e6a30e1268a8..872299ce3081 100644
> --- a/arch/arm64/mm/proc-macros.S
> +++ b/arch/arm64/mm/proc-macros.S
> @@ -54,6 +54,19 @@
>  	.endm
>  
>  /*
> + * cache_cwg_size - get the maximum cache line size from the CTR register
> + */
> +	.macro	cache_cwg_size, reg, tmp
> +	mrs	\tmp, ctr_el0			// read CTR
> +	ubfm	\tmp, \tmp, #24, #27		// CTR_EL0.CWG [27:24]
> +	mov	\reg, #9			// use architectural default of
> +	cmp	\tmp, xzr			// 2 KB (2^9 words) if CWG is
> +	csel	\tmp, \tmp, \reg, ne		// not provided
> +	mov	\reg, #4			// bytes per word
> +	lsl	\reg, \reg, \tmp		// actual cache line size
> +	.endm
> +
> +/*
>   * tcr_set_idmap_t0sz - update TCR.T0SZ so that we can load the ID map
>   */
>  	.macro	tcr_set_idmap_t0sz, valreg, tmpreg
> -- 
> 2.5.0
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: mm: take CWG into account in __inval_cache_range()
  2016-04-19 12:56   ` Mark Rutland
@ 2016-04-19 13:08     ` Ard Biesheuvel
  -1 siblings, 0 replies; 16+ messages in thread
From: Ard Biesheuvel @ 2016-04-19 13:08 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, Catalin Marinas, Will Deacon,
	James Morse, robin.murphy

On 19 April 2016 at 14:56, Mark Rutland <mark.rutland@arm.com> wrote:
> Hi,
>
> On Tue, Apr 19, 2016 at 12:29:33PM +0200, Ard Biesheuvel wrote:
>> Currently, the arm64 implementation of __inval_cache_range() [aka
>> __dma_inv_range()] takes CTR_EL0.Dminline into account for two purposes:
>> - the stride to use for doing by-VA cache maintenance,
>> - to check whether the start and end arguments are unaligned with respect
>>   to the cache line size, in which case the unaligned extremes need to be
>>   cleaned before being invalidated, to avoid corrupting adjacent unrelated
>>   memory contents.
>>
>> In the second case, the use of Dminline is incorrect, and should use the
>> CWG field instead, since an invalidate operation could result in cache
>> lines that are larger than Dminline to be evicted at any level of the
>> cache hierarchy.
>
> Have you seen this in practice, or was this found by inspection?
>

Amusingly, I spotted it when a Huawei engineer proposing driver code
for Tianocore questioned my demand to take CWG into account when doing
cache invalidation, because the kernel does not do it either.

> I agree that we need to round addresses to CWG boundaries when
> performing maintenance to the PoC to prevent subsequent asynchronous
> writebacks of data falling in the same CWG, which could clobber data at
> the PoC.
>
> However, if we have unrelated data in the same CWG, surely we have no
> guarantee that said data will not be dirtied in caches by other kernel
> code, and thus we may still have issues with asynchronous writebacks?
>

Indeed.

> Is sharing a CWG broken by design, or is there some caveat I'm missing
> that prevents/prohibits unrelated data from being dirtied?
>

I think sharing a CWG window is broken by design, now that I think of
it. The invalidate is part of the dma_unmap() code path, which means
the cleaning we do on the edges of the buffer may clobber data in
memory written by the device, and not cleaning isn't an option either.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] arm64: mm: take CWG into account in __inval_cache_range()
@ 2016-04-19 13:08     ` Ard Biesheuvel
  0 siblings, 0 replies; 16+ messages in thread
From: Ard Biesheuvel @ 2016-04-19 13:08 UTC (permalink / raw)
  To: linux-arm-kernel

On 19 April 2016 at 14:56, Mark Rutland <mark.rutland@arm.com> wrote:
> Hi,
>
> On Tue, Apr 19, 2016 at 12:29:33PM +0200, Ard Biesheuvel wrote:
>> Currently, the arm64 implementation of __inval_cache_range() [aka
>> __dma_inv_range()] takes CTR_EL0.Dminline into account for two purposes:
>> - the stride to use for doing by-VA cache maintenance,
>> - to check whether the start and end arguments are unaligned with respect
>>   to the cache line size, in which case the unaligned extremes need to be
>>   cleaned before being invalidated, to avoid corrupting adjacent unrelated
>>   memory contents.
>>
>> In the second case, the use of Dminline is incorrect, and should use the
>> CWG field instead, since an invalidate operation could result in cache
>> lines that are larger than Dminline to be evicted at any level of the
>> cache hierarchy.
>
> Have you seen this in practice, or was this found by inspection?
>

Amusingly, I spotted it when a Huawei engineer proposing driver code
for Tianocore questioned my demand to take CWG into account when doing
cache invalidation, because the kernel does not do it either.

> I agree that we need to round addresses to CWG boundaries when
> performing maintenance to the PoC to prevent subsequent asynchronous
> writebacks of data falling in the same CWG, which could clobber data at
> the PoC.
>
> However, if we have unrelated data in the same CWG, surely we have no
> guarantee that said data will not be dirtied in caches by other kernel
> code, and thus we may still have issues with asynchronous writebacks?
>

Indeed.

> Is sharing a CWG broken by design, or is there some caveat I'm missing
> that prevents/prohibits unrelated data from being dirtied?
>

I think sharing a CWG window is broken by design, now that I think of
it. The invalidate is part of the dma_unmap() code path, which means
the cleaning we do on the edges of the buffer may clobber data in
memory written by the device, and not cleaning isn't an option either.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: mm: take CWG into account in __inval_cache_range()
  2016-04-19 13:08     ` Ard Biesheuvel
@ 2016-04-19 14:13       ` Catalin Marinas
  -1 siblings, 0 replies; 16+ messages in thread
From: Catalin Marinas @ 2016-04-19 14:13 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Mark Rutland, Will Deacon, linux-kernel, James Morse,
	robin.murphy, linux-arm-kernel

On Tue, Apr 19, 2016 at 03:08:27PM +0200, Ard Biesheuvel wrote:
> On 19 April 2016 at 14:56, Mark Rutland <mark.rutland@arm.com> wrote:
> > Is sharing a CWG broken by design, or is there some caveat I'm missing
> > that prevents/prohibits unrelated data from being dirtied?
> 
> I think sharing a CWG window is broken by design, now that I think of
> it. The invalidate is part of the dma_unmap() code path, which means
> the cleaning we do on the edges of the buffer may clobber data in
> memory written by the device, and not cleaning isn't an option either.

Indeed. It only works if the cache line is always clean (e.g. read or
speculatively loaded) but you can't guarantee what's accessing it. I
think we shouldn't even bother with the edges at all in __dma_inv_range.

The best we could do is to warn if ARCH_DMA_MINALIGN is smaller than CWG
(as Robin suggested, we could do this only if we have non-coherent DMA
masters via arch_setup_dma_ops()). Quick hack below:

-------------------------------8<-----------------------------------
diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
index 5082b30bc2c0..5967fcbb617a 100644
--- a/arch/arm64/include/asm/cache.h
+++ b/arch/arm64/include/asm/cache.h
@@ -28,7 +28,7 @@
  * cache before the transfer is done, causing old data to be seen by
  * the CPU.
  */
-#define ARCH_DMA_MINALIGN	L1_CACHE_BYTES
+#define ARCH_DMA_MINALIGN	128
 
 #ifndef __ASSEMBLY__
 
@@ -37,7 +37,7 @@
 static inline int cache_line_size(void)
 {
 	u32 cwg = cache_type_cwg();
-	return cwg ? 4 << cwg : L1_CACHE_BYTES;
+	return cwg ? 4 << cwg : ARCH_DMA_MINALIGN;
 }
 
 #endif	/* __ASSEMBLY__ */
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 943f5140e0f3..b1d4d1f5b0dd 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -970,7 +970,6 @@ static void __init setup_feature_capabilities(void)
 void __init setup_cpu_features(void)
 {
 	u32 cwg;
-	int cls;
 
 	/* Set the CPU feature capabilies */
 	setup_feature_capabilities();
@@ -983,13 +982,9 @@ void __init setup_cpu_features(void)
 	 * Check for sane CTR_EL0.CWG value.
 	 */
 	cwg = cache_type_cwg();
-	cls = cache_line_size();
 	if (!cwg)
-		pr_warn("No Cache Writeback Granule information, assuming cache line size %d\n",
-			cls);
-	if (L1_CACHE_BYTES < cls)
-		pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n",
-			L1_CACHE_BYTES, cls);
+		pr_warn("No Cache Writeback Granule information, assuming %d\n",
+			ARCH_DMA_MINALIGN);
 }
 
 static bool __maybe_unused
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index a6e757cbab77..08232faf38ad 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -987,6 +987,11 @@ static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
 void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
 			struct iommu_ops *iommu, bool coherent)
 {
+	WARN_TAINT_ONCE(!coherent && ARCH_DMA_MINALIGN < cache_line_size(),
+			TAINT_CPU_OUT_OF_SPEC,
+			"ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)",
+			ARCH_DMA_MINALIGN, cache_line_size());
+
 	if (!dev->archdata.dma_ops)
 		dev->archdata.dma_ops = &swiotlb_dma_ops;
 
-------------------------------8<-----------------------------------

This way we could also reduce L1_CACHE_SHIFT back to 6 as it shouldn't
cause any correctness issues (potentially performance but so far it
seems that increasing it made things worse on some SoCs).

The reason for cache_line_size() using ARCH_DMA_MINALIGN instead of
L1_CACHE_BYTES is that I can see it used in the sl*b code when
SLAB_HWCACHE_ALIGN is passed and on few occasions for DMA buffers.

-- 
Catalin

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH] arm64: mm: take CWG into account in __inval_cache_range()
@ 2016-04-19 14:13       ` Catalin Marinas
  0 siblings, 0 replies; 16+ messages in thread
From: Catalin Marinas @ 2016-04-19 14:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Apr 19, 2016 at 03:08:27PM +0200, Ard Biesheuvel wrote:
> On 19 April 2016 at 14:56, Mark Rutland <mark.rutland@arm.com> wrote:
> > Is sharing a CWG broken by design, or is there some caveat I'm missing
> > that prevents/prohibits unrelated data from being dirtied?
> 
> I think sharing a CWG window is broken by design, now that I think of
> it. The invalidate is part of the dma_unmap() code path, which means
> the cleaning we do on the edges of the buffer may clobber data in
> memory written by the device, and not cleaning isn't an option either.

Indeed. It only works if the cache line is always clean (e.g. read or
speculatively loaded) but you can't guarantee what's accessing it. I
think we shouldn't even bother with the edges at all in __dma_inv_range.

The best we could do is to warn if ARCH_DMA_MINALIGN is smaller than CWG
(as Robin suggested, we could do this only if we have non-coherent DMA
masters via arch_setup_dma_ops()). Quick hack below:

-------------------------------8<-----------------------------------
diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
index 5082b30bc2c0..5967fcbb617a 100644
--- a/arch/arm64/include/asm/cache.h
+++ b/arch/arm64/include/asm/cache.h
@@ -28,7 +28,7 @@
  * cache before the transfer is done, causing old data to be seen by
  * the CPU.
  */
-#define ARCH_DMA_MINALIGN	L1_CACHE_BYTES
+#define ARCH_DMA_MINALIGN	128
 
 #ifndef __ASSEMBLY__
 
@@ -37,7 +37,7 @@
 static inline int cache_line_size(void)
 {
 	u32 cwg = cache_type_cwg();
-	return cwg ? 4 << cwg : L1_CACHE_BYTES;
+	return cwg ? 4 << cwg : ARCH_DMA_MINALIGN;
 }
 
 #endif	/* __ASSEMBLY__ */
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 943f5140e0f3..b1d4d1f5b0dd 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -970,7 +970,6 @@ static void __init setup_feature_capabilities(void)
 void __init setup_cpu_features(void)
 {
 	u32 cwg;
-	int cls;
 
 	/* Set the CPU feature capabilies */
 	setup_feature_capabilities();
@@ -983,13 +982,9 @@ void __init setup_cpu_features(void)
 	 * Check for sane CTR_EL0.CWG value.
 	 */
 	cwg = cache_type_cwg();
-	cls = cache_line_size();
 	if (!cwg)
-		pr_warn("No Cache Writeback Granule information, assuming cache line size %d\n",
-			cls);
-	if (L1_CACHE_BYTES < cls)
-		pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n",
-			L1_CACHE_BYTES, cls);
+		pr_warn("No Cache Writeback Granule information, assuming %d\n",
+			ARCH_DMA_MINALIGN);
 }
 
 static bool __maybe_unused
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index a6e757cbab77..08232faf38ad 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -987,6 +987,11 @@ static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
 void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
 			struct iommu_ops *iommu, bool coherent)
 {
+	WARN_TAINT_ONCE(!coherent && ARCH_DMA_MINALIGN < cache_line_size(),
+			TAINT_CPU_OUT_OF_SPEC,
+			"ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)",
+			ARCH_DMA_MINALIGN, cache_line_size());
+
 	if (!dev->archdata.dma_ops)
 		dev->archdata.dma_ops = &swiotlb_dma_ops;
 
-------------------------------8<-----------------------------------

This way we could also reduce L1_CACHE_SHIFT back to 6 as it shouldn't
cause any correctness issues (potentially performance but so far it
seems that increasing it made things worse on some SoCs).

The reason for cache_line_size() using ARCH_DMA_MINALIGN instead of
L1_CACHE_BYTES is that I can see it used in the sl*b code when
SLAB_HWCACHE_ALIGN is passed and on few occasions for DMA buffers.

-- 
Catalin

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: mm: take CWG into account in __inval_cache_range()
  2016-04-19 14:13       ` Catalin Marinas
@ 2016-04-19 14:48         ` Ard Biesheuvel
  -1 siblings, 0 replies; 16+ messages in thread
From: Ard Biesheuvel @ 2016-04-19 14:48 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Mark Rutland, Will Deacon, linux-kernel, James Morse,
	robin.murphy, linux-arm-kernel

On 19 April 2016 at 16:13, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Tue, Apr 19, 2016 at 03:08:27PM +0200, Ard Biesheuvel wrote:
>> On 19 April 2016 at 14:56, Mark Rutland <mark.rutland@arm.com> wrote:
>> > Is sharing a CWG broken by design, or is there some caveat I'm missing
>> > that prevents/prohibits unrelated data from being dirtied?
>>
>> I think sharing a CWG window is broken by design, now that I think of
>> it. The invalidate is part of the dma_unmap() code path, which means
>> the cleaning we do on the edges of the buffer may clobber data in
>> memory written by the device, and not cleaning isn't an option either.
>
> Indeed. It only works if the cache line is always clean (e.g. read or
> speculatively loaded) but you can't guarantee what's accessing it. I
> think we shouldn't even bother with the edges at all in __dma_inv_range.
>

Agreed.

> The best we could do is to warn if ARCH_DMA_MINALIGN is smaller than CWG
> (as Robin suggested, we could do this only if we have non-coherent DMA
> masters via arch_setup_dma_ops()). Quick hack below:
>
> -------------------------------8<-----------------------------------
> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> index 5082b30bc2c0..5967fcbb617a 100644
> --- a/arch/arm64/include/asm/cache.h
> +++ b/arch/arm64/include/asm/cache.h
> @@ -28,7 +28,7 @@
>   * cache before the transfer is done, causing old data to be seen by
>   * the CPU.
>   */
> -#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
> +#define ARCH_DMA_MINALIGN      128
>
>  #ifndef __ASSEMBLY__
>
> @@ -37,7 +37,7 @@
>  static inline int cache_line_size(void)
>  {
>         u32 cwg = cache_type_cwg();
> -       return cwg ? 4 << cwg : L1_CACHE_BYTES;
> +       return cwg ? 4 << cwg : ARCH_DMA_MINALIGN;

Unrelated, but this does not look right: if the CWG field is zero, we
should either assume 2 KB, or iterate over all the CCSIDR values and
take the maximum linesize.

>  }
>
>  #endif /* __ASSEMBLY__ */
> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> index 943f5140e0f3..b1d4d1f5b0dd 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c
> @@ -970,7 +970,6 @@ static void __init setup_feature_capabilities(void)
>  void __init setup_cpu_features(void)
>  {
>         u32 cwg;
> -       int cls;
>
>         /* Set the CPU feature capabilies */
>         setup_feature_capabilities();
> @@ -983,13 +982,9 @@ void __init setup_cpu_features(void)
>          * Check for sane CTR_EL0.CWG value.
>          */
>         cwg = cache_type_cwg();
> -       cls = cache_line_size();
>         if (!cwg)
> -               pr_warn("No Cache Writeback Granule information, assuming cache line size %d\n",
> -                       cls);
> -       if (L1_CACHE_BYTES < cls)
> -               pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n",
> -                       L1_CACHE_BYTES, cls);
> +               pr_warn("No Cache Writeback Granule information, assuming %d\n",
> +                       ARCH_DMA_MINALIGN);
>  }
>
>  static bool __maybe_unused
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index a6e757cbab77..08232faf38ad 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -987,6 +987,11 @@ static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
>  void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
>                         struct iommu_ops *iommu, bool coherent)
>  {
> +       WARN_TAINT_ONCE(!coherent && ARCH_DMA_MINALIGN < cache_line_size(),
> +                       TAINT_CPU_OUT_OF_SPEC,
> +                       "ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)",
> +                       ARCH_DMA_MINALIGN, cache_line_size());
> +
>         if (!dev->archdata.dma_ops)
>                 dev->archdata.dma_ops = &swiotlb_dma_ops;
>
> -------------------------------8<-----------------------------------
>
> This way we could also reduce L1_CACHE_SHIFT back to 6 as it shouldn't
> cause any correctness issues (potentially performance but so far it
> seems that increasing it made things worse on some SoCs).
>
> The reason for cache_line_size() using ARCH_DMA_MINALIGN instead of
> L1_CACHE_BYTES is that I can see it used in the sl*b code when
> SLAB_HWCACHE_ALIGN is passed and on few occasions for DMA buffers.
>
> --
> Catalin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] arm64: mm: take CWG into account in __inval_cache_range()
@ 2016-04-19 14:48         ` Ard Biesheuvel
  0 siblings, 0 replies; 16+ messages in thread
From: Ard Biesheuvel @ 2016-04-19 14:48 UTC (permalink / raw)
  To: linux-arm-kernel

On 19 April 2016 at 16:13, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Tue, Apr 19, 2016 at 03:08:27PM +0200, Ard Biesheuvel wrote:
>> On 19 April 2016 at 14:56, Mark Rutland <mark.rutland@arm.com> wrote:
>> > Is sharing a CWG broken by design, or is there some caveat I'm missing
>> > that prevents/prohibits unrelated data from being dirtied?
>>
>> I think sharing a CWG window is broken by design, now that I think of
>> it. The invalidate is part of the dma_unmap() code path, which means
>> the cleaning we do on the edges of the buffer may clobber data in
>> memory written by the device, and not cleaning isn't an option either.
>
> Indeed. It only works if the cache line is always clean (e.g. read or
> speculatively loaded) but you can't guarantee what's accessing it. I
> think we shouldn't even bother with the edges at all in __dma_inv_range.
>

Agreed.

> The best we could do is to warn if ARCH_DMA_MINALIGN is smaller than CWG
> (as Robin suggested, we could do this only if we have non-coherent DMA
> masters via arch_setup_dma_ops()). Quick hack below:
>
> -------------------------------8<-----------------------------------
> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> index 5082b30bc2c0..5967fcbb617a 100644
> --- a/arch/arm64/include/asm/cache.h
> +++ b/arch/arm64/include/asm/cache.h
> @@ -28,7 +28,7 @@
>   * cache before the transfer is done, causing old data to be seen by
>   * the CPU.
>   */
> -#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
> +#define ARCH_DMA_MINALIGN      128
>
>  #ifndef __ASSEMBLY__
>
> @@ -37,7 +37,7 @@
>  static inline int cache_line_size(void)
>  {
>         u32 cwg = cache_type_cwg();
> -       return cwg ? 4 << cwg : L1_CACHE_BYTES;
> +       return cwg ? 4 << cwg : ARCH_DMA_MINALIGN;

Unrelated, but this does not look right: if the CWG field is zero, we
should either assume 2 KB, or iterate over all the CCSIDR values and
take the maximum linesize.

>  }
>
>  #endif /* __ASSEMBLY__ */
> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> index 943f5140e0f3..b1d4d1f5b0dd 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c
> @@ -970,7 +970,6 @@ static void __init setup_feature_capabilities(void)
>  void __init setup_cpu_features(void)
>  {
>         u32 cwg;
> -       int cls;
>
>         /* Set the CPU feature capabilies */
>         setup_feature_capabilities();
> @@ -983,13 +982,9 @@ void __init setup_cpu_features(void)
>          * Check for sane CTR_EL0.CWG value.
>          */
>         cwg = cache_type_cwg();
> -       cls = cache_line_size();
>         if (!cwg)
> -               pr_warn("No Cache Writeback Granule information, assuming cache line size %d\n",
> -                       cls);
> -       if (L1_CACHE_BYTES < cls)
> -               pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n",
> -                       L1_CACHE_BYTES, cls);
> +               pr_warn("No Cache Writeback Granule information, assuming %d\n",
> +                       ARCH_DMA_MINALIGN);
>  }
>
>  static bool __maybe_unused
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index a6e757cbab77..08232faf38ad 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -987,6 +987,11 @@ static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
>  void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
>                         struct iommu_ops *iommu, bool coherent)
>  {
> +       WARN_TAINT_ONCE(!coherent && ARCH_DMA_MINALIGN < cache_line_size(),
> +                       TAINT_CPU_OUT_OF_SPEC,
> +                       "ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)",
> +                       ARCH_DMA_MINALIGN, cache_line_size());
> +
>         if (!dev->archdata.dma_ops)
>                 dev->archdata.dma_ops = &swiotlb_dma_ops;
>
> -------------------------------8<-----------------------------------
>
> This way we could also reduce L1_CACHE_SHIFT back to 6 as it shouldn't
> cause any correctness issues (potentially performance but so far it
> seems that increasing it made things worse on some SoCs).
>
> The reason for cache_line_size() using ARCH_DMA_MINALIGN instead of
> L1_CACHE_BYTES is that I can see it used in the sl*b code when
> SLAB_HWCACHE_ALIGN is passed and on few occasions for DMA buffers.
>
> --
> Catalin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: mm: take CWG into account in __inval_cache_range()
  2016-04-19 14:48         ` Ard Biesheuvel
@ 2016-04-19 15:32           ` Catalin Marinas
  -1 siblings, 0 replies; 16+ messages in thread
From: Catalin Marinas @ 2016-04-19 15:32 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Mark Rutland, Will Deacon, linux-kernel, James Morse,
	robin.murphy, linux-arm-kernel

On Tue, Apr 19, 2016 at 04:48:32PM +0200, Ard Biesheuvel wrote:
> On 19 April 2016 at 16:13, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > The best we could do is to warn if ARCH_DMA_MINALIGN is smaller than CWG
> > (as Robin suggested, we could do this only if we have non-coherent DMA
> > masters via arch_setup_dma_ops()). Quick hack below:
> >
> > -------------------------------8<-----------------------------------
> > diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> > index 5082b30bc2c0..5967fcbb617a 100644
> > --- a/arch/arm64/include/asm/cache.h
> > +++ b/arch/arm64/include/asm/cache.h
> > @@ -28,7 +28,7 @@
> >   * cache before the transfer is done, causing old data to be seen by
> >   * the CPU.
> >   */
> > -#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
> > +#define ARCH_DMA_MINALIGN      128
> >
> >  #ifndef __ASSEMBLY__
> >
> > @@ -37,7 +37,7 @@
> >  static inline int cache_line_size(void)
> >  {
> >         u32 cwg = cache_type_cwg();
> > -       return cwg ? 4 << cwg : L1_CACHE_BYTES;
> > +       return cwg ? 4 << cwg : ARCH_DMA_MINALIGN;
> 
> Unrelated, but this does not look right: if the CWG field is zero, we
> should either assume 2 KB, or iterate over all the CCSIDR values and
> take the maximum linesize.

It may be a better guess but even that is not always relevant since
CCSIDR may not present the real hardware information. It's only meant to
give enough information to be able to do cache maintenance by set/way
and we've seen CPU implementations where this has nothing to do with the
actual cache geometry.

So I don't think we can do anything more than just hard-coding and hope
that implementations where CWG is 0 (or higher than 128) are only
deployed in a fully coherent configuration.

-- 
Catalin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] arm64: mm: take CWG into account in __inval_cache_range()
@ 2016-04-19 15:32           ` Catalin Marinas
  0 siblings, 0 replies; 16+ messages in thread
From: Catalin Marinas @ 2016-04-19 15:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Apr 19, 2016 at 04:48:32PM +0200, Ard Biesheuvel wrote:
> On 19 April 2016 at 16:13, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > The best we could do is to warn if ARCH_DMA_MINALIGN is smaller than CWG
> > (as Robin suggested, we could do this only if we have non-coherent DMA
> > masters via arch_setup_dma_ops()). Quick hack below:
> >
> > -------------------------------8<-----------------------------------
> > diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> > index 5082b30bc2c0..5967fcbb617a 100644
> > --- a/arch/arm64/include/asm/cache.h
> > +++ b/arch/arm64/include/asm/cache.h
> > @@ -28,7 +28,7 @@
> >   * cache before the transfer is done, causing old data to be seen by
> >   * the CPU.
> >   */
> > -#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
> > +#define ARCH_DMA_MINALIGN      128
> >
> >  #ifndef __ASSEMBLY__
> >
> > @@ -37,7 +37,7 @@
> >  static inline int cache_line_size(void)
> >  {
> >         u32 cwg = cache_type_cwg();
> > -       return cwg ? 4 << cwg : L1_CACHE_BYTES;
> > +       return cwg ? 4 << cwg : ARCH_DMA_MINALIGN;
> 
> Unrelated, but this does not look right: if the CWG field is zero, we
> should either assume 2 KB, or iterate over all the CCSIDR values and
> take the maximum linesize.

It may be a better guess but even that is not always relevant since
CCSIDR may not present the real hardware information. It's only meant to
give enough information to be able to do cache maintenance by set/way
and we've seen CPU implementations where this has nothing to do with the
actual cache geometry.

So I don't think we can do anything more than just hard-coding and hope
that implementations where CWG is 0 (or higher than 128) are only
deployed in a fully coherent configuration.

-- 
Catalin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: mm: take CWG into account in __inval_cache_range()
  2016-04-19 15:32           ` Catalin Marinas
@ 2016-04-19 15:38             ` Ard Biesheuvel
  -1 siblings, 0 replies; 16+ messages in thread
From: Ard Biesheuvel @ 2016-04-19 15:38 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Mark Rutland, Will Deacon, linux-kernel, James Morse,
	robin.murphy, linux-arm-kernel

On 19 April 2016 at 17:32, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Tue, Apr 19, 2016 at 04:48:32PM +0200, Ard Biesheuvel wrote:
>> On 19 April 2016 at 16:13, Catalin Marinas <catalin.marinas@arm.com> wrote:
>> > The best we could do is to warn if ARCH_DMA_MINALIGN is smaller than CWG
>> > (as Robin suggested, we could do this only if we have non-coherent DMA
>> > masters via arch_setup_dma_ops()). Quick hack below:
>> >
>> > -------------------------------8<-----------------------------------
>> > diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
>> > index 5082b30bc2c0..5967fcbb617a 100644
>> > --- a/arch/arm64/include/asm/cache.h
>> > +++ b/arch/arm64/include/asm/cache.h
>> > @@ -28,7 +28,7 @@
>> >   * cache before the transfer is done, causing old data to be seen by
>> >   * the CPU.
>> >   */
>> > -#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
>> > +#define ARCH_DMA_MINALIGN      128
>> >
>> >  #ifndef __ASSEMBLY__
>> >
>> > @@ -37,7 +37,7 @@
>> >  static inline int cache_line_size(void)
>> >  {
>> >         u32 cwg = cache_type_cwg();
>> > -       return cwg ? 4 << cwg : L1_CACHE_BYTES;
>> > +       return cwg ? 4 << cwg : ARCH_DMA_MINALIGN;
>>
>> Unrelated, but this does not look right: if the CWG field is zero, we
>> should either assume 2 KB, or iterate over all the CCSIDR values and
>> take the maximum linesize.
>
> It may be a better guess but even that is not always relevant since
> CCSIDR may not present the real hardware information. It's only meant to
> give enough information to be able to do cache maintenance by set/way
> and we've seen CPU implementations where this has nothing to do with the
> actual cache geometry.
>

I am aware of that discussion, but that was about inferring aliasing
properties from the way size, which combines the linesize and the
number of sets/ways, and the latter are apparently set to 1/1 in some
cases so that any set/way operation simply affects the entire cache.

However, the CCSIDR linesize field itself is mentioned in the
description of CWG in the ARM ARM, as a suitable source of obtaining
the maximum linesize in the system.

> So I don't think we can do anything more than just hard-coding and hope
> that implementations where CWG is 0 (or higher than 128) are only
> deployed in a fully coherent configuration.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] arm64: mm: take CWG into account in __inval_cache_range()
@ 2016-04-19 15:38             ` Ard Biesheuvel
  0 siblings, 0 replies; 16+ messages in thread
From: Ard Biesheuvel @ 2016-04-19 15:38 UTC (permalink / raw)
  To: linux-arm-kernel

On 19 April 2016 at 17:32, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Tue, Apr 19, 2016 at 04:48:32PM +0200, Ard Biesheuvel wrote:
>> On 19 April 2016 at 16:13, Catalin Marinas <catalin.marinas@arm.com> wrote:
>> > The best we could do is to warn if ARCH_DMA_MINALIGN is smaller than CWG
>> > (as Robin suggested, we could do this only if we have non-coherent DMA
>> > masters via arch_setup_dma_ops()). Quick hack below:
>> >
>> > -------------------------------8<-----------------------------------
>> > diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
>> > index 5082b30bc2c0..5967fcbb617a 100644
>> > --- a/arch/arm64/include/asm/cache.h
>> > +++ b/arch/arm64/include/asm/cache.h
>> > @@ -28,7 +28,7 @@
>> >   * cache before the transfer is done, causing old data to be seen by
>> >   * the CPU.
>> >   */
>> > -#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
>> > +#define ARCH_DMA_MINALIGN      128
>> >
>> >  #ifndef __ASSEMBLY__
>> >
>> > @@ -37,7 +37,7 @@
>> >  static inline int cache_line_size(void)
>> >  {
>> >         u32 cwg = cache_type_cwg();
>> > -       return cwg ? 4 << cwg : L1_CACHE_BYTES;
>> > +       return cwg ? 4 << cwg : ARCH_DMA_MINALIGN;
>>
>> Unrelated, but this does not look right: if the CWG field is zero, we
>> should either assume 2 KB, or iterate over all the CCSIDR values and
>> take the maximum linesize.
>
> It may be a better guess but even that is not always relevant since
> CCSIDR may not present the real hardware information. It's only meant to
> give enough information to be able to do cache maintenance by set/way
> and we've seen CPU implementations where this has nothing to do with the
> actual cache geometry.
>

I am aware of that discussion, but that was about inferring aliasing
properties from the way size, which combines the linesize and the
number of sets/ways, and the latter are apparently set to 1/1 in some
cases so that any set/way operation simply affects the entire cache.

However, the CCSIDR linesize field itself is mentioned in the
description of CWG in the ARM ARM, as a suitable source of obtaining
the maximum linesize in the system.

> So I don't think we can do anything more than just hard-coding and hope
> that implementations where CWG is 0 (or higher than 128) are only
> deployed in a fully coherent configuration.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: mm: take CWG into account in __inval_cache_range()
  2016-04-19 15:38             ` Ard Biesheuvel
@ 2016-04-19 16:09               ` Catalin Marinas
  -1 siblings, 0 replies; 16+ messages in thread
From: Catalin Marinas @ 2016-04-19 16:09 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Mark Rutland, Will Deacon, linux-kernel, James Morse,
	robin.murphy, linux-arm-kernel

On Tue, Apr 19, 2016 at 05:38:56PM +0200, Ard Biesheuvel wrote:
> On 19 April 2016 at 17:32, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Tue, Apr 19, 2016 at 04:48:32PM +0200, Ard Biesheuvel wrote:
> >> On 19 April 2016 at 16:13, Catalin Marinas <catalin.marinas@arm.com> wrote:
> >> > The best we could do is to warn if ARCH_DMA_MINALIGN is smaller than CWG
> >> > (as Robin suggested, we could do this only if we have non-coherent DMA
> >> > masters via arch_setup_dma_ops()). Quick hack below:
> >> >
> >> > -------------------------------8<-----------------------------------
> >> > diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> >> > index 5082b30bc2c0..5967fcbb617a 100644
> >> > --- a/arch/arm64/include/asm/cache.h
> >> > +++ b/arch/arm64/include/asm/cache.h
> >> > @@ -28,7 +28,7 @@
> >> >   * cache before the transfer is done, causing old data to be seen by
> >> >   * the CPU.
> >> >   */
> >> > -#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
> >> > +#define ARCH_DMA_MINALIGN      128
> >> >
> >> >  #ifndef __ASSEMBLY__
> >> >
> >> > @@ -37,7 +37,7 @@
> >> >  static inline int cache_line_size(void)
> >> >  {
> >> >         u32 cwg = cache_type_cwg();
> >> > -       return cwg ? 4 << cwg : L1_CACHE_BYTES;
> >> > +       return cwg ? 4 << cwg : ARCH_DMA_MINALIGN;
> >>
> >> Unrelated, but this does not look right: if the CWG field is zero, we
> >> should either assume 2 KB, or iterate over all the CCSIDR values and
> >> take the maximum linesize.
> >
> > It may be a better guess but even that is not always relevant since
> > CCSIDR may not present the real hardware information. It's only meant to
> > give enough information to be able to do cache maintenance by set/way
> > and we've seen CPU implementations where this has nothing to do with the
> > actual cache geometry.
> 
> I am aware of that discussion, but that was about inferring aliasing
> properties from the way size, which combines the linesize and the
> number of sets/ways, and the latter are apparently set to 1/1 in some
> cases so that any set/way operation simply affects the entire cache.
> 
> However, the CCSIDR linesize field itself is mentioned in the
> description of CWG in the ARM ARM, as a suitable source of obtaining
> the maximum linesize in the system.

The ARM ARM confuses me. While CTR_EL0.CWG indeed talks about CCSIDR as
a fall back, the latter has a note:

  The parameters NumSets, Associativity, and LineSize in these registers
  define the architecturally visible parameters that are required for
  the cache maintenance by Set/Way instructions. They are not guaranteed
  to represent the actual microarchitectural features of a design. You
  cannot make any inference about the actual sizes of caches based on
  these parameters.

So it's not clear to me whether LineSize would give us the information
we need for cache maintenance by VA. But I'm not opposed to using this
on a CPU that doesn't have CWG (once we see one).

-- 
Catalin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] arm64: mm: take CWG into account in __inval_cache_range()
@ 2016-04-19 16:09               ` Catalin Marinas
  0 siblings, 0 replies; 16+ messages in thread
From: Catalin Marinas @ 2016-04-19 16:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Apr 19, 2016 at 05:38:56PM +0200, Ard Biesheuvel wrote:
> On 19 April 2016 at 17:32, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Tue, Apr 19, 2016 at 04:48:32PM +0200, Ard Biesheuvel wrote:
> >> On 19 April 2016 at 16:13, Catalin Marinas <catalin.marinas@arm.com> wrote:
> >> > The best we could do is to warn if ARCH_DMA_MINALIGN is smaller than CWG
> >> > (as Robin suggested, we could do this only if we have non-coherent DMA
> >> > masters via arch_setup_dma_ops()). Quick hack below:
> >> >
> >> > -------------------------------8<-----------------------------------
> >> > diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> >> > index 5082b30bc2c0..5967fcbb617a 100644
> >> > --- a/arch/arm64/include/asm/cache.h
> >> > +++ b/arch/arm64/include/asm/cache.h
> >> > @@ -28,7 +28,7 @@
> >> >   * cache before the transfer is done, causing old data to be seen by
> >> >   * the CPU.
> >> >   */
> >> > -#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
> >> > +#define ARCH_DMA_MINALIGN      128
> >> >
> >> >  #ifndef __ASSEMBLY__
> >> >
> >> > @@ -37,7 +37,7 @@
> >> >  static inline int cache_line_size(void)
> >> >  {
> >> >         u32 cwg = cache_type_cwg();
> >> > -       return cwg ? 4 << cwg : L1_CACHE_BYTES;
> >> > +       return cwg ? 4 << cwg : ARCH_DMA_MINALIGN;
> >>
> >> Unrelated, but this does not look right: if the CWG field is zero, we
> >> should either assume 2 KB, or iterate over all the CCSIDR values and
> >> take the maximum linesize.
> >
> > It may be a better guess but even that is not always relevant since
> > CCSIDR may not present the real hardware information. It's only meant to
> > give enough information to be able to do cache maintenance by set/way
> > and we've seen CPU implementations where this has nothing to do with the
> > actual cache geometry.
> 
> I am aware of that discussion, but that was about inferring aliasing
> properties from the way size, which combines the linesize and the
> number of sets/ways, and the latter are apparently set to 1/1 in some
> cases so that any set/way operation simply affects the entire cache.
> 
> However, the CCSIDR linesize field itself is mentioned in the
> description of CWG in the ARM ARM, as a suitable source of obtaining
> the maximum linesize in the system.

The ARM ARM confuses me. While CTR_EL0.CWG indeed talks about CCSIDR as
a fall back, the latter has a note:

  The parameters NumSets, Associativity, and LineSize in these registers
  define the architecturally visible parameters that are required for
  the cache maintenance by Set/Way instructions. They are not guaranteed
  to represent the actual microarchitectural features of a design. You
  cannot make any inference about the actual sizes of caches based on
  these parameters.

So it's not clear to me whether LineSize would give us the information
we need for cache maintenance by VA. But I'm not opposed to using this
on a CPU that doesn't have CWG (once we see one).

-- 
Catalin

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-04-19 16:09 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-19 10:29 [PATCH] arm64: mm: take CWG into account in __inval_cache_range() Ard Biesheuvel
2016-04-19 10:29 ` Ard Biesheuvel
2016-04-19 12:56 ` Mark Rutland
2016-04-19 12:56   ` Mark Rutland
2016-04-19 13:08   ` Ard Biesheuvel
2016-04-19 13:08     ` Ard Biesheuvel
2016-04-19 14:13     ` Catalin Marinas
2016-04-19 14:13       ` Catalin Marinas
2016-04-19 14:48       ` Ard Biesheuvel
2016-04-19 14:48         ` Ard Biesheuvel
2016-04-19 15:32         ` Catalin Marinas
2016-04-19 15:32           ` Catalin Marinas
2016-04-19 15:38           ` Ard Biesheuvel
2016-04-19 15:38             ` Ard Biesheuvel
2016-04-19 16:09             ` Catalin Marinas
2016-04-19 16:09               ` Catalin Marinas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.