All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] arm64: enable EDAC on arm64
@ 2014-04-21 16:09 Rob Herring
  2014-04-22 10:24 ` Will Deacon
  0 siblings, 1 reply; 10+ messages in thread
From: Rob Herring @ 2014-04-21 16:09 UTC (permalink / raw)
  To: linux-arm-kernel

From: Rob Herring <robh@kernel.org>

Implement atomic_scrub and enable EDAC for arm64.

Signed-off-by: Rob Herring <robh@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
---
v3:
- Drop "cc" clobber annotation.
v2:
- Add loop for exclusive store success
- Fix size to be 32-bits at a time. The framework gives no alignment
  guarantees.

 arch/arm64/Kconfig            |  1 +
 arch/arm64/include/asm/edac.h | 38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)
 create mode 100644 arch/arm64/include/asm/edac.h

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e6e4d37..4c1f857 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -14,6 +14,7 @@ config ARM64
 	select COMMON_CLK
 	select CPU_PM if (SUSPEND || CPU_IDLE)
 	select DCACHE_WORD_ACCESS
+	select EDAC_SUPPORT
 	select GENERIC_CLOCKEVENTS
 	select GENERIC_CLOCKEVENTS_BROADCAST if SMP
 	select GENERIC_CPU_AUTOPROBE
diff --git a/arch/arm64/include/asm/edac.h b/arch/arm64/include/asm/edac.h
new file mode 100644
index 0000000..8a3d176
--- /dev/null
+++ b/arch/arm64/include/asm/edac.h
@@ -0,0 +1,38 @@
+/*
+ * Copyright 2013 Calxeda, Inc.
+ * Based on PPC version Copyright 2007 MontaVista Software, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#ifndef ASM_EDAC_H
+#define ASM_EDAC_H
+/*
+ * ECC atomic, DMA, SMP and interrupt safe scrub function.
+ * Implements the per arch atomic_scrub() that EDAC use for software
+ * ECC scrubbing.  It reads memory and then writes back the original
+ * value, allowing the hardware to detect and correct memory errors.
+ */
+static inline void atomic_scrub(void *va, u32 size)
+{
+	unsigned int *virt_addr = va;
+	unsigned int i;
+
+	for (i = 0; i < size / sizeof(*virt_addr); i++, virt_addr++) {
+		long result;
+		unsigned long tmp;
+
+		asm volatile("/* atomic_scrub */\n"
+		"1:     ldxr    %w0, %2\n"
+		"       stxr    %w1, %w0, %2\n"
+		"       cbnz    %w1, 1b"
+			: "=&r" (result), "=&r" (tmp), "+Q" (*virt_addr) : : );
+	}
+}
+#endif
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v3] arm64: enable EDAC on arm64
  2014-04-21 16:09 [PATCH v3] arm64: enable EDAC on arm64 Rob Herring
@ 2014-04-22 10:24 ` Will Deacon
  2014-04-22 12:54   ` Rob Herring
  0 siblings, 1 reply; 10+ messages in thread
From: Will Deacon @ 2014-04-22 10:24 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Rob,

On Mon, Apr 21, 2014 at 05:09:16PM +0100, Rob Herring wrote:
> From: Rob Herring <robh@kernel.org>
> 
> Implement atomic_scrub and enable EDAC for arm64.
> 
> Signed-off-by: Rob Herring <robh@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>

[...]

> diff --git a/arch/arm64/include/asm/edac.h b/arch/arm64/include/asm/edac.h
> new file mode 100644
> index 0000000..8a3d176
> --- /dev/null
> +++ b/arch/arm64/include/asm/edac.h
> @@ -0,0 +1,38 @@
> +/*
> + * Copyright 2013 Calxeda, Inc.
> + * Based on PPC version Copyright 2007 MontaVista Software, Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +#ifndef ASM_EDAC_H
> +#define ASM_EDAC_H
> +/*
> + * ECC atomic, DMA, SMP and interrupt safe scrub function.

What do you mean by `DMA safe'? For coherent (cacheable) DMA buffers, this
should work fine, but for non-coherent (and potentially non-cacheable)
buffers, I think we'll have problems both due to the lack of guaranteed
exclusive monitor support and also eviction of dirty lines.

Will

> + * Implements the per arch atomic_scrub() that EDAC use for software
> + * ECC scrubbing.  It reads memory and then writes back the original
> + * value, allowing the hardware to detect and correct memory errors.
> + */
> +static inline void atomic_scrub(void *va, u32 size)
> +{
> +	unsigned int *virt_addr = va;
> +	unsigned int i;
> +
> +	for (i = 0; i < size / sizeof(*virt_addr); i++, virt_addr++) {
> +		long result;
> +		unsigned long tmp;
> +
> +		asm volatile("/* atomic_scrub */\n"
> +		"1:     ldxr    %w0, %2\n"
> +		"       stxr    %w1, %w0, %2\n"
> +		"       cbnz    %w1, 1b"
> +			: "=&r" (result), "=&r" (tmp), "+Q" (*virt_addr) : : );
> +	}
> +}
> +#endif
> -- 
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3] arm64: enable EDAC on arm64
  2014-04-22 10:24 ` Will Deacon
@ 2014-04-22 12:54   ` Rob Herring
  2014-04-22 13:26     ` Will Deacon
  0 siblings, 1 reply; 10+ messages in thread
From: Rob Herring @ 2014-04-22 12:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Apr 22, 2014 at 5:24 AM, Will Deacon <will.deacon@arm.com> wrote:
> Hi Rob,
>
> On Mon, Apr 21, 2014 at 05:09:16PM +0100, Rob Herring wrote:
>> From: Rob Herring <robh@kernel.org>
>>
>> Implement atomic_scrub and enable EDAC for arm64.
>>
>> Signed-off-by: Rob Herring <robh@kernel.org>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will.deacon@arm.com>
>
> [...]
>
>> diff --git a/arch/arm64/include/asm/edac.h b/arch/arm64/include/asm/edac.h
>> new file mode 100644
>> index 0000000..8a3d176
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/edac.h
>> @@ -0,0 +1,38 @@
>> +/*
>> + * Copyright 2013 Calxeda, Inc.
>> + * Based on PPC version Copyright 2007 MontaVista Software, Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms and conditions of the GNU General Public License,
>> + * version 2, as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope it will be useful, but WITHOUT
>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
>> + * more details.
>> + */
>> +#ifndef ASM_EDAC_H
>> +#define ASM_EDAC_H
>> +/*
>> + * ECC atomic, DMA, SMP and interrupt safe scrub function.
>
> What do you mean by `DMA safe'? For coherent (cacheable) DMA buffers, this
> should work fine, but for non-coherent (and potentially non-cacheable)
> buffers, I think we'll have problems both due to the lack of guaranteed
> exclusive monitor support and also eviction of dirty lines.

That's just copied from other implementations. I agree you could have
a problem here although I don't see why dirty line eviction would be.
There's not really a solution other than not doing s/w scrubbing or
doing it in h/w. So it is up to individual drivers to decide what to
do, but we have to provide this function just to enable EDAC.

Rob

>
> Will
>
>> + * Implements the per arch atomic_scrub() that EDAC use for software
>> + * ECC scrubbing.  It reads memory and then writes back the original
>> + * value, allowing the hardware to detect and correct memory errors.
>> + */
>> +static inline void atomic_scrub(void *va, u32 size)
>> +{
>> +     unsigned int *virt_addr = va;
>> +     unsigned int i;
>> +
>> +     for (i = 0; i < size / sizeof(*virt_addr); i++, virt_addr++) {
>> +             long result;
>> +             unsigned long tmp;
>> +
>> +             asm volatile("/* atomic_scrub */\n"
>> +             "1:     ldxr    %w0, %2\n"
>> +             "       stxr    %w1, %w0, %2\n"
>> +             "       cbnz    %w1, 1b"
>> +                     : "=&r" (result), "=&r" (tmp), "+Q" (*virt_addr) : : );
>> +     }
>> +}
>> +#endif
>> --
>> 1.9.1
>>
>>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3] arm64: enable EDAC on arm64
  2014-04-22 12:54   ` Rob Herring
@ 2014-04-22 13:26     ` Will Deacon
  2014-04-22 15:23       ` Rob Herring
  0 siblings, 1 reply; 10+ messages in thread
From: Will Deacon @ 2014-04-22 13:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Apr 22, 2014 at 01:54:12PM +0100, Rob Herring wrote:
> On Tue, Apr 22, 2014 at 5:24 AM, Will Deacon <will.deacon@arm.com> wrote:
> > On Mon, Apr 21, 2014 at 05:09:16PM +0100, Rob Herring wrote:
> >> +#ifndef ASM_EDAC_H
> >> +#define ASM_EDAC_H
> >> +/*
> >> + * ECC atomic, DMA, SMP and interrupt safe scrub function.
> >
> > What do you mean by `DMA safe'? For coherent (cacheable) DMA buffers, this
> > should work fine, but for non-coherent (and potentially non-cacheable)
> > buffers, I think we'll have problems both due to the lack of guaranteed
> > exclusive monitor support and also eviction of dirty lines.
> 
> That's just copied from other implementations. I agree you could have
> a problem here although I don't see why dirty line eviction would be.

I was thinking of the case where you have an ongoing, non-coherent DMA
transfer from a device and then the atomic_scrub routine runs in parallel
on the CPU, targetting the same buffer. In this case, the stxr could store
stale data back to the buffer, leading to corruption (since the monitor
won't help). This differs from the case where the monitor could always
report failure for non-cacheable regions, causing atomic_scrub to livelock.

> There's not really a solution other than not doing s/w scrubbing or
> doing it in h/w. So it is up to individual drivers to decide what to
> do, but we have to provide this function just to enable EDAC.

I think we need to avoid s/w scrubbing of non-cacheable memory altogether.

Will

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3] arm64: enable EDAC on arm64
  2014-04-22 13:26     ` Will Deacon
@ 2014-04-22 15:23       ` Rob Herring
  2014-04-22 16:01         ` Will Deacon
  0 siblings, 1 reply; 10+ messages in thread
From: Rob Herring @ 2014-04-22 15:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Apr 22, 2014 at 8:26 AM, Will Deacon <will.deacon@arm.com> wrote:
> On Tue, Apr 22, 2014 at 01:54:12PM +0100, Rob Herring wrote:
>> On Tue, Apr 22, 2014 at 5:24 AM, Will Deacon <will.deacon@arm.com> wrote:
>> > On Mon, Apr 21, 2014 at 05:09:16PM +0100, Rob Herring wrote:
>> >> +#ifndef ASM_EDAC_H
>> >> +#define ASM_EDAC_H
>> >> +/*
>> >> + * ECC atomic, DMA, SMP and interrupt safe scrub function.
>> >
>> > What do you mean by `DMA safe'? For coherent (cacheable) DMA buffers, this
>> > should work fine, but for non-coherent (and potentially non-cacheable)
>> > buffers, I think we'll have problems both due to the lack of guaranteed
>> > exclusive monitor support and also eviction of dirty lines.
>>
>> That's just copied from other implementations. I agree you could have
>> a problem here although I don't see why dirty line eviction would be.
>
> I was thinking of the case where you have an ongoing, non-coherent DMA
> transfer from a device and then the atomic_scrub routine runs in parallel
> on the CPU, targetting the same buffer. In this case, the stxr could store
> stale data back to the buffer, leading to corruption (since the monitor
> won't help). This differs from the case where the monitor could always
> report failure for non-cacheable regions, causing atomic_scrub to livelock.

It is only reads that will trigger an error and scrubbing. If the DMA
is continuously reading (such as a framebuffer), then there would not
be an issue. What would be the usecase where a DMA continously writes
to the same location without any synchronization with the cpu? I
suppose one core could re-trigger a DMA while another core is doing
the scrubbing. You would have to read the DMA data and be finished
with it quicker than the scrubbing could get handled. I just wonder
whether this is really only a theoretical problem, but not one in
practice.

>> There's not really a solution other than not doing s/w scrubbing or
>> doing it in h/w. So it is up to individual drivers to decide what to
>> do, but we have to provide this function just to enable EDAC.
>
> I think we need to avoid s/w scrubbing of non-cacheable memory altogether.

There's not really a way to determine the memory attributes easily
though. Whether it works depends on the h/w. Calxeda's memory
controller did have an exclusive monitor so I think this would have
worked even in the non-coherent case.

What exactly is your proposal to do here? I think we should assume the
h/w is designed correctly until we have a case that it is not.

Rob

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3] arm64: enable EDAC on arm64
  2014-04-22 15:23       ` Rob Herring
@ 2014-04-22 16:01         ` Will Deacon
  2014-04-22 16:29           ` Rob Herring
  0 siblings, 1 reply; 10+ messages in thread
From: Will Deacon @ 2014-04-22 16:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Apr 22, 2014 at 04:23:20PM +0100, Rob Herring wrote:
> On Tue, Apr 22, 2014 at 8:26 AM, Will Deacon <will.deacon@arm.com> wrote:
> > On Tue, Apr 22, 2014 at 01:54:12PM +0100, Rob Herring wrote:
> >> On Tue, Apr 22, 2014 at 5:24 AM, Will Deacon <will.deacon@arm.com> wrote:
> >> > On Mon, Apr 21, 2014 at 05:09:16PM +0100, Rob Herring wrote:
> >> >> +#ifndef ASM_EDAC_H
> >> >> +#define ASM_EDAC_H
> >> >> +/*
> >> >> + * ECC atomic, DMA, SMP and interrupt safe scrub function.
> >> >
> >> > What do you mean by `DMA safe'? For coherent (cacheable) DMA buffers, this
> >> > should work fine, but for non-coherent (and potentially non-cacheable)
> >> > buffers, I think we'll have problems both due to the lack of guaranteed
> >> > exclusive monitor support and also eviction of dirty lines.
> >>
> >> That's just copied from other implementations. I agree you could have
> >> a problem here although I don't see why dirty line eviction would be.
> >
> > I was thinking of the case where you have an ongoing, non-coherent DMA
> > transfer from a device and then the atomic_scrub routine runs in parallel
> > on the CPU, targetting the same buffer. In this case, the stxr could store
> > stale data back to the buffer, leading to corruption (since the monitor
> > won't help). This differs from the case where the monitor could always
> > report failure for non-cacheable regions, causing atomic_scrub to livelock.
> 
> It is only reads that will trigger an error and scrubbing. If the DMA
> is continuously reading (such as a framebuffer), then there would not
> be an issue. What would be the usecase where a DMA continously writes
> to the same location without any synchronization with the cpu? I
> suppose one core could re-trigger a DMA while another core is doing
> the scrubbing. You would have to read the DMA data and be finished
> with it quicker than the scrubbing could get handled. I just wonder
> whether this is really only a theoretical problem, but not one in
> practice.

I don't think it's all that complicated if you consider speculative reads
from the CPU triggering the error. However, discussion with Catalin raised
another question (see below).

> >> There's not really a solution other than not doing s/w scrubbing or
> >> doing it in h/w. So it is up to individual drivers to decide what to
> >> do, but we have to provide this function just to enable EDAC.
> >
> > I think we need to avoid s/w scrubbing of non-cacheable memory altogether.
> 
> There's not really a way to determine the memory attributes easily
> though. Whether it works depends on the h/w. Calxeda's memory
> controller did have an exclusive monitor so I think this would have
> worked even in the non-coherent case.
> 
> What exactly is your proposal to do here? I think we should assume the
> h/w is designed correctly until we have a case that it is not.

Looking at the edac_mc_scrub_block code, atomic_scrub is always called with
a normal, cacheable mapping (kmap_atomic) so that doesn't help us (although
it means the exclusives will at least succeed).

The problem of speculative reads by the CPU could be solved by unmapped the
DMA buffer when we transfer the ownership over to the device (instead of
invalidating it after the transfer). However, I'm now slightly confused as
to how atomic_scrub fixes errors reported at any cache level higher than
L1. Do we need cache-flushing to ensure that the exclusive-store propagates
to the point of failure?

Will

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3] arm64: enable EDAC on arm64
  2014-04-22 16:01         ` Will Deacon
@ 2014-04-22 16:29           ` Rob Herring
  2014-04-23 17:04             ` Will Deacon
  0 siblings, 1 reply; 10+ messages in thread
From: Rob Herring @ 2014-04-22 16:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Apr 22, 2014 at 11:01 AM, Will Deacon <will.deacon@arm.com> wrote:
> On Tue, Apr 22, 2014 at 04:23:20PM +0100, Rob Herring wrote:
>> On Tue, Apr 22, 2014 at 8:26 AM, Will Deacon <will.deacon@arm.com> wrote:
>> > On Tue, Apr 22, 2014 at 01:54:12PM +0100, Rob Herring wrote:
>> >> On Tue, Apr 22, 2014 at 5:24 AM, Will Deacon <will.deacon@arm.com> wrote:
>> >> > On Mon, Apr 21, 2014 at 05:09:16PM +0100, Rob Herring wrote:
>> >> >> +#ifndef ASM_EDAC_H
>> >> >> +#define ASM_EDAC_H
>> >> >> +/*
>> >> >> + * ECC atomic, DMA, SMP and interrupt safe scrub function.
>> >> >
>> >> > What do you mean by `DMA safe'? For coherent (cacheable) DMA buffers, this
>> >> > should work fine, but for non-coherent (and potentially non-cacheable)
>> >> > buffers, I think we'll have problems both due to the lack of guaranteed
>> >> > exclusive monitor support and also eviction of dirty lines.
>> >>
>> >> That's just copied from other implementations. I agree you could have
>> >> a problem here although I don't see why dirty line eviction would be.
>> >
>> > I was thinking of the case where you have an ongoing, non-coherent DMA
>> > transfer from a device and then the atomic_scrub routine runs in parallel
>> > on the CPU, targetting the same buffer. In this case, the stxr could store
>> > stale data back to the buffer, leading to corruption (since the monitor
>> > won't help). This differs from the case where the monitor could always
>> > report failure for non-cacheable regions, causing atomic_scrub to livelock.
>>
>> It is only reads that will trigger an error and scrubbing. If the DMA
>> is continuously reading (such as a framebuffer), then there would not
>> be an issue. What would be the usecase where a DMA continously writes
>> to the same location without any synchronization with the cpu? I
>> suppose one core could re-trigger a DMA while another core is doing
>> the scrubbing. You would have to read the DMA data and be finished
>> with it quicker than the scrubbing could get handled. I just wonder
>> whether this is really only a theoretical problem, but not one in
>> practice.
>
> I don't think it's all that complicated if you consider speculative reads
> from the CPU triggering the error. However, discussion with Catalin raised
> another question (see below).
>
>> >> There's not really a solution other than not doing s/w scrubbing or
>> >> doing it in h/w. So it is up to individual drivers to decide what to
>> >> do, but we have to provide this function just to enable EDAC.
>> >
>> > I think we need to avoid s/w scrubbing of non-cacheable memory altogether.
>>
>> There's not really a way to determine the memory attributes easily
>> though. Whether it works depends on the h/w. Calxeda's memory
>> controller did have an exclusive monitor so I think this would have
>> worked even in the non-coherent case.
>>
>> What exactly is your proposal to do here? I think we should assume the
>> h/w is designed correctly until we have a case that it is not.
>
> Looking at the edac_mc_scrub_block code, atomic_scrub is always called with
> a normal, cacheable mapping (kmap_atomic) so that doesn't help us (although
> it means the exclusives will at least succeed).
>
> The problem of speculative reads by the CPU could be solved by unmapped the
> DMA buffer when we transfer the ownership over to the device (instead of
> invalidating it after the transfer). However, I'm now slightly confused as
> to how atomic_scrub fixes errors reported at any cache level higher than
> L1. Do we need cache-flushing to ensure that the exclusive-store propagates
> to the point of failure?

The whole point of scrubbing is to stop repeated error reporting of
correctable errors. For example, you do a write to memory and the ECC
code is added to it. Suppose the data stored in the memory gets
corrupted either on the write or some time later you get a bit flip in
the memory cell. Then when the data is read from memory, the memory
controller will detect the error, correct it, and trigger and ECC
correctable error interrupt. It will do this every time you read that
memory location because the error occurred on the write. The only way
to clear the error is re-writing memory. As long as that cache line is
dirty, no reads from that memory location will occur as other readers
will get the line from other cores, the L2, or the line will get
pushed out to memory first. I guess you could see an invalidate on DMA
memory causing the scrub to get lost, but that doesn't really matter.
It would be harmless to get the error again other than making your
error rate seem higher (which is something OEMs are very sensitive
to). You are doing the invalidate so that DMA can write new data
anyway.

Rob

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3] arm64: enable EDAC on arm64
  2014-04-22 16:29           ` Rob Herring
@ 2014-04-23 17:04             ` Will Deacon
  2014-05-09 17:33               ` Catalin Marinas
  0 siblings, 1 reply; 10+ messages in thread
From: Will Deacon @ 2014-04-23 17:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Apr 22, 2014 at 05:29:52PM +0100, Rob Herring wrote:
> On Tue, Apr 22, 2014 at 11:01 AM, Will Deacon <will.deacon@arm.com> wrote:
> > Looking at the edac_mc_scrub_block code, atomic_scrub is always called with
> > a normal, cacheable mapping (kmap_atomic) so that doesn't help us (although
> > it means the exclusives will at least succeed).
> >
> > The problem of speculative reads by the CPU could be solved by unmapped the
> > DMA buffer when we transfer the ownership over to the device (instead of
> > invalidating it after the transfer). However, I'm now slightly confused as
> > to how atomic_scrub fixes errors reported at any cache level higher than
> > L1. Do we need cache-flushing to ensure that the exclusive-store propagates
> > to the point of failure?
> 
> The whole point of scrubbing is to stop repeated error reporting of
> correctable errors. For example, you do a write to memory and the ECC
> code is added to it. Suppose the data stored in the memory gets
> corrupted either on the write or some time later you get a bit flip in
> the memory cell. Then when the data is read from memory, the memory
> controller will detect the error, correct it, and trigger and ECC
> correctable error interrupt. It will do this every time you read that
> memory location because the error occurred on the write. The only way
> to clear the error is re-writing memory.

Thanks for the explanation.

> As long as that cache line is dirty, no reads from that memory location
> will occur as other readers will get the line from other cores, the L2, or
> the line will get pushed out to memory first.

Agreed, if all of the readers are coherent.

> I guess you could see an invalidate on DMA memory causing the scrub to get
> lost, but that doesn't really matter.  It would be harmless to get the
> error again other than making your error rate seem higher (which is
> something OEMs are very sensitive to). You are doing the invalidate so
> that DMA can write new data anyway.

Also agreed that the error-rate could be higher, but I still think there's
a corruption issue here as well.

To be clear:

 (1) The CPU maps a non-coherent, streaming DMA buffer for a device to
     populate (i.e. cache cleaning).

 (2) The device starts writing to the buffer

 (3) Whilst the device is writing, the CPU performs a speculative read
     from the buffer and an ECC error occurs. The error is corrected and
     the CPU gets a clean line, whilst an interrupt is pended at the GIC
     to inform the CPU about the error.

 (4) The CPU takes the interrupt and starts scrubbing the line. It issues
     the exclusive load but then...

 (5) The device writes the location in question. The error is cleared (not
     that we really care) and the memory location now contains new data

 (6) The CPU continues with its scrub, executing a successful
     exclusive-store of *stale* data back to the memory location, but
     allocating into L1.

 (7) Before the DMA completes, the line gets evicted from L1 and back to
     main memory, corrupting the DMA transfer.

So that's more serious that inflated reports -- we're turning a corrected
error into a data corruption.

Will

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3] arm64: enable EDAC on arm64
  2014-04-23 17:04             ` Will Deacon
@ 2014-05-09 17:33               ` Catalin Marinas
  2014-05-09 17:55                 ` Will Deacon
  0 siblings, 1 reply; 10+ messages in thread
From: Catalin Marinas @ 2014-05-09 17:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Apr 23, 2014 at 06:04:45PM +0100, Will Deacon wrote:
> On Tue, Apr 22, 2014 at 05:29:52PM +0100, Rob Herring wrote:
> > On Tue, Apr 22, 2014 at 11:01 AM, Will Deacon <will.deacon@arm.com> wrote:
> > > Looking at the edac_mc_scrub_block code, atomic_scrub is always called with
> > > a normal, cacheable mapping (kmap_atomic) so that doesn't help us (although
> > > it means the exclusives will at least succeed).
> > >
> > > The problem of speculative reads by the CPU could be solved by unmapped the
> > > DMA buffer when we transfer the ownership over to the device (instead of
> > > invalidating it after the transfer). However, I'm now slightly confused as
> > > to how atomic_scrub fixes errors reported at any cache level higher than
> > > L1. Do we need cache-flushing to ensure that the exclusive-store propagates
> > > to the point of failure?
> > 
> > The whole point of scrubbing is to stop repeated error reporting of
> > correctable errors. For example, you do a write to memory and the ECC
> > code is added to it. Suppose the data stored in the memory gets
> > corrupted either on the write or some time later you get a bit flip in
> > the memory cell. Then when the data is read from memory, the memory
> > controller will detect the error, correct it, and trigger and ECC
> > correctable error interrupt. It will do this every time you read that
> > memory location because the error occurred on the write. The only way
> > to clear the error is re-writing memory.
> 
> Thanks for the explanation.
> 
> > As long as that cache line is dirty, no reads from that memory location
> > will occur as other readers will get the line from other cores, the L2, or
> > the line will get pushed out to memory first.
> 
> Agreed, if all of the readers are coherent.

Just to get things moving on this patch, do we agree that it is only safe
for coherent DMA? If so, do we merge it on the grounds that people
needing EDAC only use it with DMA-coherent memory? The comment for
atomic_scrub should be updated to state coherent DMA only.

We could check the dma_ops in atomic_scrub but I don't think it's worth
it.

-- 
Catalin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3] arm64: enable EDAC on arm64
  2014-05-09 17:33               ` Catalin Marinas
@ 2014-05-09 17:55                 ` Will Deacon
  0 siblings, 0 replies; 10+ messages in thread
From: Will Deacon @ 2014-05-09 17:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 09, 2014 at 06:33:40PM +0100, Catalin Marinas wrote:
> On Wed, Apr 23, 2014 at 06:04:45PM +0100, Will Deacon wrote:
> > On Tue, Apr 22, 2014 at 05:29:52PM +0100, Rob Herring wrote:
> > > On Tue, Apr 22, 2014 at 11:01 AM, Will Deacon <will.deacon@arm.com> wrote:
> > > > Looking at the edac_mc_scrub_block code, atomic_scrub is always called with
> > > > a normal, cacheable mapping (kmap_atomic) so that doesn't help us (although
> > > > it means the exclusives will at least succeed).
> > > >
> > > > The problem of speculative reads by the CPU could be solved by unmapped the
> > > > DMA buffer when we transfer the ownership over to the device (instead of
> > > > invalidating it after the transfer). However, I'm now slightly confused as
> > > > to how atomic_scrub fixes errors reported at any cache level higher than
> > > > L1. Do we need cache-flushing to ensure that the exclusive-store propagates
> > > > to the point of failure?
> > > 
> > > The whole point of scrubbing is to stop repeated error reporting of
> > > correctable errors. For example, you do a write to memory and the ECC
> > > code is added to it. Suppose the data stored in the memory gets
> > > corrupted either on the write or some time later you get a bit flip in
> > > the memory cell. Then when the data is read from memory, the memory
> > > controller will detect the error, correct it, and trigger and ECC
> > > correctable error interrupt. It will do this every time you read that
> > > memory location because the error occurred on the write. The only way
> > > to clear the error is re-writing memory.
> > 
> > Thanks for the explanation.
> > 
> > > As long as that cache line is dirty, no reads from that memory location
> > > will occur as other readers will get the line from other cores, the L2, or
> > > the line will get pushed out to memory first.
> > 
> > Agreed, if all of the readers are coherent.
> 
> Just to get things moving on this patch, do we agree that it is only safe
> for coherent DMA? If so, do we merge it on the grounds that people
> needing EDAC only use it with DMA-coherent memory? The comment for
> atomic_scrub should be updated to state coherent DMA only.
> 
> We could check the dma_ops in atomic_scrub but I don't think it's worth
> it.

Well, I think we should do *something* for the non-coherent case, otherwise
we're going to have fun debugging random buffer corruption. Could we disable
scrubbing from the dma_bus_notifier the moment we find a non-coherent
device?

Will

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-05-09 17:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-21 16:09 [PATCH v3] arm64: enable EDAC on arm64 Rob Herring
2014-04-22 10:24 ` Will Deacon
2014-04-22 12:54   ` Rob Herring
2014-04-22 13:26     ` Will Deacon
2014-04-22 15:23       ` Rob Herring
2014-04-22 16:01         ` Will Deacon
2014-04-22 16:29           ` Rob Herring
2014-04-23 17:04             ` Will Deacon
2014-05-09 17:33               ` Catalin Marinas
2014-05-09 17:55                 ` Will Deacon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.