linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] arm: mm: fault: check ADFSR in case of abort
@ 2018-10-29 14:20 Wiebe, Wladislav (Nokia - DE/Ulm)
  2018-10-29 14:52 ` Robin Murphy
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Wiebe, Wladislav (Nokia - DE/Ulm) @ 2018-10-29 14:20 UTC (permalink / raw)
  To: linux, tony, akpm, ebiederm, jrdr.linux, linux-arm-kernel; +Cc: linux-kernel

When running into situations like:
"Unhandled fault: synchronous external abort (0x210) at 0xXXX"
or
"Unhandled prefetch abort: synchronous external abort (0x210) at 0xXXX"
it is useful to know the content of ADFSR (Auxiliary Data Fault Status
Register) to indicate an ECC double-bit error in L1 or L2 cache.

Refer to:
Cortex-A15 Technical Reference Manual, Revision: r2p1
[6.4.8. Error Correction Code]

Signed-off-by: Wladislav Wiebe <wladislav.wiebe@nokia.com>
---
 arch/arm/mm/fault.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 3232afb6fdc0..5e240deb6ed6 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -547,6 +547,22 @@ hook_fault_code(int nr, int (*fn)(unsigned long, unsigned int, struct pt_regs *)
 	fsr_info[nr].name = name;
 }
 
+/*
+ * Check for ECC double-bit errors in Auxiliary Data Fault Status Register
+ */
+static void check_adfsr_for_ecc(void)
+{
+	u32 adfsr = 0;
+
+	asm("mrc p15, 0, %0, c5, c1, 0" : "=r" (adfsr));
+
+	if (adfsr & (BIT(31) | BIT(23))) {
+		pr_alert("ADFSR status 0x%x indicates that an L1 or L2 cache\n"
+			 "ECC double-bit error occurred at some time.\n",
+			  adfsr);
+	}
+}
+
 /*
  * Dispatch a data abort to the relevant handler.
  */
@@ -559,6 +575,7 @@ do_DataAbort(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	if (!inf->fn(addr, fsr & ~FSR_LNX_PF, regs))
 		return;
 
+	check_adfsr_for_ecc();
 	pr_alert("Unhandled fault: %s (0x%03x) at 0x%08lx\n",
 		inf->name, fsr, addr);
 	show_pte(current->mm, addr);
@@ -593,6 +610,7 @@ do_PrefetchAbort(unsigned long addr, unsigned int ifsr, struct pt_regs *regs)
 	if (!inf->fn(addr, ifsr | FSR_LNX_PF, regs))
 		return;
 
+	check_adfsr_for_ecc();
 	pr_alert("Unhandled prefetch abort: %s (0x%03x) at 0x%08lx\n",
 		inf->name, ifsr, addr);
 
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] arm: mm: fault: check ADFSR in case of abort
  2018-10-29 14:20 [PATCH] arm: mm: fault: check ADFSR in case of abort Wiebe, Wladislav (Nokia - DE/Ulm)
@ 2018-10-29 14:52 ` Robin Murphy
  2018-10-29 15:30   ` Wiebe, Wladislav (Nokia - DE/Ulm)
  2018-10-29 15:12 ` Russell King - ARM Linux
  2018-10-29 15:54 ` Mark Rutland
  2 siblings, 1 reply; 6+ messages in thread
From: Robin Murphy @ 2018-10-29 14:52 UTC (permalink / raw)
  To: Wiebe, Wladislav (Nokia - DE/Ulm),
	linux, tony, akpm, ebiederm, jrdr.linux, linux-arm-kernel
  Cc: linux-kernel

On 29/10/2018 14:20, Wiebe, Wladislav (Nokia - DE/Ulm) wrote:
> When running into situations like:
> "Unhandled fault: synchronous external abort (0x210) at 0xXXX"
> or
> "Unhandled prefetch abort: synchronous external abort (0x210) at 0xXXX"
> it is useful to know the content of ADFSR (Auxiliary Data Fault Status
> Register) to indicate an ECC double-bit error in L1 or L2 cache.
> 
> Refer to:
> Cortex-A15 Technical Reference Manual, Revision: r2p1
> [6.4.8. Error Correction Code]

The contents of ADFSR are implementation-defined, though, so this 
interpretation is *only* valid on Cortex-A15. Other processors may use 
those bit positions to report something else, at which point printing a 
message about ECC errors would be totally misleading.

Robin.

> Signed-off-by: Wladislav Wiebe <wladislav.wiebe@nokia.com>
> ---
>   arch/arm/mm/fault.c | 18 ++++++++++++++++++
>   1 file changed, 18 insertions(+)
> 
> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> index 3232afb6fdc0..5e240deb6ed6 100644
> --- a/arch/arm/mm/fault.c
> +++ b/arch/arm/mm/fault.c
> @@ -547,6 +547,22 @@ hook_fault_code(int nr, int (*fn)(unsigned long, unsigned int, struct pt_regs *)
>   	fsr_info[nr].name = name;
>   }
>   
> +/*
> + * Check for ECC double-bit errors in Auxiliary Data Fault Status Register
> + */
> +static void check_adfsr_for_ecc(void)
> +{
> +	u32 adfsr = 0;
> +
> +	asm("mrc p15, 0, %0, c5, c1, 0" : "=r" (adfsr));
> +
> +	if (adfsr & (BIT(31) | BIT(23))) {
> +		pr_alert("ADFSR status 0x%x indicates that an L1 or L2 cache\n"
> +			 "ECC double-bit error occurred at some time.\n",
> +			  adfsr);
> +	}
> +}
> +
>   /*
>    * Dispatch a data abort to the relevant handler.
>    */
> @@ -559,6 +575,7 @@ do_DataAbort(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
>   	if (!inf->fn(addr, fsr & ~FSR_LNX_PF, regs))
>   		return;
>   
> +	check_adfsr_for_ecc();
>   	pr_alert("Unhandled fault: %s (0x%03x) at 0x%08lx\n",
>   		inf->name, fsr, addr);
>   	show_pte(current->mm, addr);
> @@ -593,6 +610,7 @@ do_PrefetchAbort(unsigned long addr, unsigned int ifsr, struct pt_regs *regs)
>   	if (!inf->fn(addr, ifsr | FSR_LNX_PF, regs))
>   		return;
>   
> +	check_adfsr_for_ecc();
>   	pr_alert("Unhandled prefetch abort: %s (0x%03x) at 0x%08lx\n",
>   		inf->name, ifsr, addr);
>   
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] arm: mm: fault: check ADFSR in case of abort
  2018-10-29 14:20 [PATCH] arm: mm: fault: check ADFSR in case of abort Wiebe, Wladislav (Nokia - DE/Ulm)
  2018-10-29 14:52 ` Robin Murphy
@ 2018-10-29 15:12 ` Russell King - ARM Linux
  2018-10-29 15:54 ` Mark Rutland
  2 siblings, 0 replies; 6+ messages in thread
From: Russell King - ARM Linux @ 2018-10-29 15:12 UTC (permalink / raw)
  To: Wiebe, Wladislav (Nokia - DE/Ulm)
  Cc: tony, akpm, ebiederm, jrdr.linux, linux-arm-kernel, linux-kernel

On Mon, Oct 29, 2018 at 02:20:51PM +0000, Wiebe, Wladislav (Nokia - DE/Ulm) wrote:
> When running into situations like:
> "Unhandled fault: synchronous external abort (0x210) at 0xXXX"
> or
> "Unhandled prefetch abort: synchronous external abort (0x210) at 0xXXX"
> it is useful to know the content of ADFSR (Auxiliary Data Fault Status
> Register) to indicate an ECC double-bit error in L1 or L2 cache.
> 
> Refer to:
> Cortex-A15 Technical Reference Manual, Revision: r2p1
> [6.4.8. Error Correction Code]

This is CPU independent code, and so must only access registers that are
present on all CPUs which may run that code.

Here's the extract from the ARM ARM for the ADFSR and AIFSR:

  The position of these registers is architecturally-defined, but the
  content and use of the registers is IMPLEMENTATION DEFINED. An
  implementation can use these registers to return additional fault
  status information. An example use of these registers is to return
  more information for diagnosing parity errors.

So by testing bits in this register, you are making use of
implementation defined values.

It also goes on to say:

  These registers are not implemented in architecture versions before
  ARMv7.

So before ARMv7, we have to take note of the unimplemented CP15 rules:

2. In an allocated CP15 primary register, accesses to all unallocated
   encodings are UNPREDICTABLE for accesses at PL1 or higher.  This
   means that any MCR or MRC access from PL1 or higher with a
   combination of <CRn>, <opc1>, <CRm> and <opc2> values not shown in,
   or referenced from, Full list of VMSA CP15 registers, by coprocessor
   register number on page B3-1481, that would access an allocated
   CP15 primary register, is UNPREDICTABLE. As indicated by rule 1, for
   the ARMv7-Aarchitecture, the allocated CP15 primary registers are:
   • in any VMSA implementation, c0-c3, c5-c11, c13, and c15
   ...

So I'd prefer if we didn't attempt to read this register on CPUs where
this isn't explicitly implemented.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH] arm: mm: fault: check ADFSR in case of abort
  2018-10-29 14:52 ` Robin Murphy
@ 2018-10-29 15:30   ` Wiebe, Wladislav (Nokia - DE/Ulm)
  0 siblings, 0 replies; 6+ messages in thread
From: Wiebe, Wladislav (Nokia - DE/Ulm) @ 2018-10-29 15:30 UTC (permalink / raw)
  To: Robin Murphy, linux, tony, akpm, ebiederm, jrdr.linux, linux-arm-kernel
  Cc: linux-kernel

Hi Robin, Russel,

> -----Original Message-----
> From: Robin Murphy <robin.murphy@arm.com>
> Sent: Monday, October 29, 2018 3:52 PM
[..]
> On 29/10/2018 14:20, Wiebe, Wladislav (Nokia - DE/Ulm) wrote:
> > When running into situations like:
> > "Unhandled fault: synchronous external abort (0x210) at 0xXXX"
> > or
> > "Unhandled prefetch abort: synchronous external abort (0x210) at 0xXXX"
> > it is useful to know the content of ADFSR (Auxiliary Data Fault Status
> > Register) to indicate an ECC double-bit error in L1 or L2 cache.
> >
> > Refer to:
> > Cortex-A15 Technical Reference Manual, Revision: r2p1 [6.4.8. Error
> > Correction Code]
> 
> The contents of ADFSR are implementation-defined, though, so this
> interpretation is *only* valid on Cortex-A15. Other processors may use those
> bit positions to report something else, at which point printing a message
> about ECC errors would be totally misleading.

Good point, I thought initially it is valid for others as well.

Do you think we can go with this approach:
	if (read_cpuid_part() == ARM_CPU_PART_CORTEX_A15) {
		asm("mrc p15, 0, %0, c5, c1, 0" : "=r" (adfsr));
		xxxx
	}

?
Thanks a lot for the fast feedback!

- Wladislav

> 
> Robin.
> 
> > Signed-off-by: Wladislav Wiebe <wladislav.wiebe@nokia.com>
> > ---
> >   arch/arm/mm/fault.c | 18 ++++++++++++++++++
> >   1 file changed, 18 insertions(+)
> >
> > diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index
> > 3232afb6fdc0..5e240deb6ed6 100644
> > --- a/arch/arm/mm/fault.c
> > +++ b/arch/arm/mm/fault.c
> > @@ -547,6 +547,22 @@ hook_fault_code(int nr, int (*fn)(unsigned long,
> unsigned int, struct pt_regs *)
> >   	fsr_info[nr].name = name;
> >   }
> >
> > +/*
> > + * Check for ECC double-bit errors in Auxiliary Data Fault Status
> > +Register  */ static void check_adfsr_for_ecc(void) {
> > +	u32 adfsr = 0;
> > +
> > +	asm("mrc p15, 0, %0, c5, c1, 0" : "=r" (adfsr));
> > +
> > +	if (adfsr & (BIT(31) | BIT(23))) {
> > +		pr_alert("ADFSR status 0x%x indicates that an L1 or L2
> cache\n"
> > +			 "ECC double-bit error occurred at some time.\n",
> > +			  adfsr);
> > +	}
> > +}
> > +
> >   /*
> >    * Dispatch a data abort to the relevant handler.
> >    */
> > @@ -559,6 +575,7 @@ do_DataAbort(unsigned long addr, unsigned int fsr,
> struct pt_regs *regs)
> >   	if (!inf->fn(addr, fsr & ~FSR_LNX_PF, regs))
> >   		return;
> >
> > +	check_adfsr_for_ecc();
> >   	pr_alert("Unhandled fault: %s (0x%03x) at 0x%08lx\n",
> >   		inf->name, fsr, addr);
> >   	show_pte(current->mm, addr);
> > @@ -593,6 +610,7 @@ do_PrefetchAbort(unsigned long addr, unsigned int
> ifsr, struct pt_regs *regs)
> >   	if (!inf->fn(addr, ifsr | FSR_LNX_PF, regs))
> >   		return;
> >
> > +	check_adfsr_for_ecc();
> >   	pr_alert("Unhandled prefetch abort: %s (0x%03x) at 0x%08lx\n",
> >   		inf->name, ifsr, addr);
> >
> >

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] arm: mm: fault: check ADFSR in case of abort
  2018-10-29 14:20 [PATCH] arm: mm: fault: check ADFSR in case of abort Wiebe, Wladislav (Nokia - DE/Ulm)
  2018-10-29 14:52 ` Robin Murphy
  2018-10-29 15:12 ` Russell King - ARM Linux
@ 2018-10-29 15:54 ` Mark Rutland
  2018-10-29 16:43   ` Russell King - ARM Linux
  2 siblings, 1 reply; 6+ messages in thread
From: Mark Rutland @ 2018-10-29 15:54 UTC (permalink / raw)
  To: Wiebe, Wladislav (Nokia - DE/Ulm)
  Cc: linux, tony, akpm, ebiederm, jrdr.linux, linux-arm-kernel, linux-kernel

On Mon, Oct 29, 2018 at 02:20:51PM +0000, Wiebe, Wladislav (Nokia - DE/Ulm) wrote:
> When running into situations like:
> "Unhandled fault: synchronous external abort (0x210) at 0xXXX"
> or
> "Unhandled prefetch abort: synchronous external abort (0x210) at 0xXXX"
> it is useful to know the content of ADFSR (Auxiliary Data Fault Status
> Register) to indicate an ECC double-bit error in L1 or L2 cache.
> 
> Refer to:
> Cortex-A15 Technical Reference Manual, Revision: r2p1
> [6.4.8. Error Correction Code]
> 
> Signed-off-by: Wladislav Wiebe <wladislav.wiebe@nokia.com>
> ---
>  arch/arm/mm/fault.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> index 3232afb6fdc0..5e240deb6ed6 100644
> --- a/arch/arm/mm/fault.c
> +++ b/arch/arm/mm/fault.c
> @@ -547,6 +547,22 @@ hook_fault_code(int nr, int (*fn)(unsigned long, unsigned int, struct pt_regs *)
>  	fsr_info[nr].name = name;
>  }
>  
> +/*
> + * Check for ECC double-bit errors in Auxiliary Data Fault Status Register
> + */
> +static void check_adfsr_for_ecc(void)
> +{
> +	u32 adfsr = 0;
> +
> +	asm("mrc p15, 0, %0, c5, c1, 0" : "=r" (adfsr));
> +
> +	if (adfsr & (BIT(31) | BIT(23))) {
> +		pr_alert("ADFSR status 0x%x indicates that an L1 or L2 cache\n"
> +			 "ECC double-bit error occurred at some time.\n",
> +			  adfsr);
> +	}
> +}
> +
>  /*
>   * Dispatch a data abort to the relevant handler.
>   */
> @@ -559,6 +575,7 @@ do_DataAbort(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
>  	if (!inf->fn(addr, fsr & ~FSR_LNX_PF, regs))
>  		return;
>  
> +	check_adfsr_for_ecc();
>  	pr_alert("Unhandled fault: %s (0x%03x) at 0x%08lx\n",
>  		inf->name, fsr, addr);
>  	show_pte(current->mm, addr);
> @@ -593,6 +610,7 @@ do_PrefetchAbort(unsigned long addr, unsigned int ifsr, struct pt_regs *regs)
>  	if (!inf->fn(addr, ifsr | FSR_LNX_PF, regs))
>  		return;
>  
> +	check_adfsr_for_ecc();
>  	pr_alert("Unhandled prefetch abort: %s (0x%03x) at 0x%08lx\n",
>  		inf->name, ifsr, addr);

IIUC at this point the task is preemptible (and interruptible), so I
believe this is too late to snapshot the ADFSR. The task could have been
migrated to a different core, with an irrelavant ADFSR, or a fault could
have occured within an interrupt handler, etc.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] arm: mm: fault: check ADFSR in case of abort
  2018-10-29 15:54 ` Mark Rutland
@ 2018-10-29 16:43   ` Russell King - ARM Linux
  0 siblings, 0 replies; 6+ messages in thread
From: Russell King - ARM Linux @ 2018-10-29 16:43 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Wiebe, Wladislav (Nokia - DE/Ulm),
	tony, akpm, ebiederm, jrdr.linux, linux-arm-kernel, linux-kernel

On Mon, Oct 29, 2018 at 03:54:36PM +0000, Mark Rutland wrote:
> On Mon, Oct 29, 2018 at 02:20:51PM +0000, Wiebe, Wladislav (Nokia - DE/Ulm) wrote:
> > When running into situations like:
> > "Unhandled fault: synchronous external abort (0x210) at 0xXXX"
> > or
> > "Unhandled prefetch abort: synchronous external abort (0x210) at 0xXXX"
> > it is useful to know the content of ADFSR (Auxiliary Data Fault Status
> > Register) to indicate an ECC double-bit error in L1 or L2 cache.
> > 
> > Refer to:
> > Cortex-A15 Technical Reference Manual, Revision: r2p1
> > [6.4.8. Error Correction Code]
> > 
> > Signed-off-by: Wladislav Wiebe <wladislav.wiebe@nokia.com>
> > ---
> >  arch/arm/mm/fault.c | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> > 
> > diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> > index 3232afb6fdc0..5e240deb6ed6 100644
> > --- a/arch/arm/mm/fault.c
> > +++ b/arch/arm/mm/fault.c
> > @@ -547,6 +547,22 @@ hook_fault_code(int nr, int (*fn)(unsigned long, unsigned int, struct pt_regs *)
> >  	fsr_info[nr].name = name;
> >  }
> >  
> > +/*
> > + * Check for ECC double-bit errors in Auxiliary Data Fault Status Register
> > + */
> > +static void check_adfsr_for_ecc(void)
> > +{
> > +	u32 adfsr = 0;
> > +
> > +	asm("mrc p15, 0, %0, c5, c1, 0" : "=r" (adfsr));
> > +
> > +	if (adfsr & (BIT(31) | BIT(23))) {
> > +		pr_alert("ADFSR status 0x%x indicates that an L1 or L2 cache\n"
> > +			 "ECC double-bit error occurred at some time.\n",
> > +			  adfsr);
> > +	}
> > +}
> > +
> >  /*
> >   * Dispatch a data abort to the relevant handler.
> >   */
> > @@ -559,6 +575,7 @@ do_DataAbort(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
> >  	if (!inf->fn(addr, fsr & ~FSR_LNX_PF, regs))
> >  		return;
> >  
> > +	check_adfsr_for_ecc();
> >  	pr_alert("Unhandled fault: %s (0x%03x) at 0x%08lx\n",
> >  		inf->name, fsr, addr);
> >  	show_pte(current->mm, addr);
> > @@ -593,6 +610,7 @@ do_PrefetchAbort(unsigned long addr, unsigned int ifsr, struct pt_regs *regs)
> >  	if (!inf->fn(addr, ifsr | FSR_LNX_PF, regs))
> >  		return;
> >  
> > +	check_adfsr_for_ecc();
> >  	pr_alert("Unhandled prefetch abort: %s (0x%03x) at 0x%08lx\n",
> >  		inf->name, ifsr, addr);
> 
> IIUC at this point the task is preemptible (and interruptible),

It may be preemptable, but isn't necessarily so.  It depends whether the
called FSR specific function enabled interrupts or not.

So, it would be better to read the ADFSR before calling the FSR specific
function to guarantee that we read the values that correspond with _this_
fault.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-10-29 16:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-29 14:20 [PATCH] arm: mm: fault: check ADFSR in case of abort Wiebe, Wladislav (Nokia - DE/Ulm)
2018-10-29 14:52 ` Robin Murphy
2018-10-29 15:30   ` Wiebe, Wladislav (Nokia - DE/Ulm)
2018-10-29 15:12 ` Russell King - ARM Linux
2018-10-29 15:54 ` Mark Rutland
2018-10-29 16:43   ` Russell King - ARM Linux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).