All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Bug in EDAC error handling of Marvell Armada XP
       [not found] <AM8PR07MB8172B5AB7AD5DAE78CA4624CFEAA9@AM8PR07MB8172.eurprd07.prod.outlook.com>
@ 2021-09-30 14:25 ` Jan Lübbe
  2021-10-06  9:40   ` Potsch, Hans (Nokia - DE/Stuttgart)
  0 siblings, 1 reply; 4+ messages in thread
From: Jan Lübbe @ 2021-09-30 14:25 UTC (permalink / raw)
  To: Potsch, Hans (Nokia - DE/Stuttgart), linux-edac
  Cc: Glock, Harald (Nokia - DE/Stuttgart)

Hi Hans,

On Thu, 2021-09-30 at 12:55 +0000, Potsch, Hans (Nokia - DE/Stuttgart) wrote:
> Hi Jan, all,
>  
> we recently discovered strange traces when our system is experiencing ECC
> errors.
>  
> When system is detecting ECC errors it sometimes reports the same amount of CE
> and UE (correctable and uncorrectable errors) or sometimes even 0 UE.
> Please see these examples:
>  
> EDAC MC0: 3 CE marvell,armada-xp-sdram-controller on any memory ( page:0x0
> offset:0x0 grain:8 syndrome:0x0 - details unavailable (multiple errors))
> EDAC MC0: 3 UE marvell,armada-xp-sdram-controller on any memory ( page:0x0
> offset:0x0 grain:8 - details unavailable (multiple errors))
>  
> EDAC MC0: 0 UE marvell,armada-xp-sdram-controller on any memory ( page:0x0
> offset:0x0 grain:8 - details unavailable (multiple errors))
>  
> While looking at the code we noticed that in the drivers/edac/armada_xp_edac.c
> file the probably incorrect parameter is passed to the edac_mc_handle_error()
> function.
>
> Please see the following git diff which should pass the correct parameter
> diff --git a/drivers/edac/armada_xp_edac.c b/drivers/edac/armada_xp_edac.c
> index e3e757513d1b..b1f46a974b9e 100644
> --- a/drivers/edac/armada_xp_edac.c
> +++ b/drivers/edac/armada_xp_edac.c
> @@ -178,7 +178,7 @@ static void axp_mc_check(struct mem_ctl_info *mci)
>                                      "details unavailable (multiple errors)");
>         if (cnt_dbe)
>                 edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci,
> -                                    cnt_sbe, /* error count */
> +                                    cnt_dbe, /* error count */
>                                      0, 0, 0, /* pfn, offset, syndrome */
>                                      -1, -1, -1, /* top, mid, low layer */
>                                      mci->ctl_name,

Yes, this is a bug and your fix looks correct.

> Currently our system is running kernel version 5.10.49.
> Investigation showed, that also latest RC still has this bug 
> armada_xp_edac.c - drivers/edac/armada_xp_edac.c - Linux source code (v5.15-
> rc3) - Bootlin

Were you able to verify that your change fixes the issue? If so, I'd ack a
properly formatted patch submission.

I don't have access to the hardware at the moment, so I can't easily test it
myself.

Thanks,
Jan
-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Bug in EDAC error handling of Marvell Armada XP
  2021-09-30 14:25 ` Bug in EDAC error handling of Marvell Armada XP Jan Lübbe
@ 2021-10-06  9:40   ` Potsch, Hans (Nokia - DE/Stuttgart)
  2021-10-06 12:13     ` [PATCH] EDAC/armada-xp: Fix output of uncorrectable error counter Hans Potsch
  0 siblings, 1 reply; 4+ messages in thread
From: Potsch, Hans (Nokia - DE/Stuttgart) @ 2021-10-06  9:40 UTC (permalink / raw)
  To: Jan Lübbe, linux-edac; +Cc: Glock, Harald (Nokia - DE/Stuttgart)

[-- Attachment #1: Type: text/plain, Size: 3380 bytes --]

Hey Jan,

we tested the patch and are able to confirm that it does not have any side effects.
Unfortunately we were not able to reproduce the error trace as we do not have a specific testbench to reproduce these errors.

I will send patch in separate mail.

Is this ok with you?

Best Regards,
Hans

-----Original Message-----
From: Jan Lübbe <jlu@pengutronix.de> 
Sent: Thursday, September 30, 2021 4:25 PM
To: Potsch, Hans (Nokia - DE/Stuttgart) <hans.potsch@nokia.com>; linux-edac@vger.kernel.org
Cc: Glock, Harald (Nokia - DE/Stuttgart) <harald.glock@nokia.com>
Subject: Re: Bug in EDAC error handling of Marvell Armada XP

Hi Hans,

On Thu, 2021-09-30 at 12:55 +0000, Potsch, Hans (Nokia - DE/Stuttgart) wrote:
> Hi Jan, all,
>  
> we recently discovered strange traces when our system is experiencing 
> ECC errors.
>  
> When system is detecting ECC errors it sometimes reports the same 
> amount of CE and UE (correctable and uncorrectable errors) or sometimes even 0 UE.
> Please see these examples:
>  
> EDAC MC0: 3 CE marvell,armada-xp-sdram-controller on any memory ( 
> page:0x0
> offset:0x0 grain:8 syndrome:0x0 - details unavailable (multiple 
> errors)) EDAC MC0: 3 UE marvell,armada-xp-sdram-controller on any 
> memory ( page:0x0
> offset:0x0 grain:8 - details unavailable (multiple errors))
>  
> EDAC MC0: 0 UE marvell,armada-xp-sdram-controller on any memory ( 
> page:0x0
> offset:0x0 grain:8 - details unavailable (multiple errors))
>  
> While looking at the code we noticed that in the 
> drivers/edac/armada_xp_edac.c file the probably incorrect parameter is 
> passed to the edac_mc_handle_error() function.
>
> Please see the following git diff which should pass the correct 
> parameter diff --git a/drivers/edac/armada_xp_edac.c 
> b/drivers/edac/armada_xp_edac.c index e3e757513d1b..b1f46a974b9e 
> 100644
> --- a/drivers/edac/armada_xp_edac.c
> +++ b/drivers/edac/armada_xp_edac.c
> @@ -178,7 +178,7 @@ static void axp_mc_check(struct mem_ctl_info *mci)
>                                      "details unavailable (multiple 
> errors)");
>         if (cnt_dbe)
>                 edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci,
> -                                    cnt_sbe, /* error count */
> +                                    cnt_dbe, /* error count */
>                                      0, 0, 0, /* pfn, offset, syndrome 
> */
>                                      -1, -1, -1, /* top, mid, low 
> layer */
>                                      mci->ctl_name,

Yes, this is a bug and your fix looks correct.

> Currently our system is running kernel version 5.10.49.
> Investigation showed, that also latest RC still has this bug 
> armada_xp_edac.c - drivers/edac/armada_xp_edac.c - Linux source code 
> (v5.15-
> rc3) - Bootlin

Were you able to verify that your change fixes the issue? If so, I'd ack a properly formatted patch submission.

I don't have access to the hardware at the moment, so I can't easily test it myself.

Thanks,
Jan
-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5692 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] EDAC/armada-xp: Fix output of uncorrectable error counter
  2021-10-06  9:40   ` Potsch, Hans (Nokia - DE/Stuttgart)
@ 2021-10-06 12:13     ` Hans Potsch
  2021-10-14  9:56       ` Borislav Petkov
  0 siblings, 1 reply; 4+ messages in thread
From: Hans Potsch @ 2021-10-06 12:13 UTC (permalink / raw)
  To: jlu, linux-edac; +Cc: harald.glock, Hans Potsch

Incorrect parameter is passed to the edac_mc_handle_error() function.
Therefore number of correctable errors is displayed as uncorrectable
errors. Changed to correct parameter.

Signed-off-by: Hans Potsch <hans.potsch@nokia.com>
---
 drivers/edac/armada_xp_edac.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/edac/armada_xp_edac.c b/drivers/edac/armada_xp_edac.c
index e3e757513d1b..b1f46a974b9e 100644
--- a/drivers/edac/armada_xp_edac.c
+++ b/drivers/edac/armada_xp_edac.c
@@ -178,7 +178,7 @@ static void axp_mc_check(struct mem_ctl_info *mci)
 				     "details unavailable (multiple errors)");
 	if (cnt_dbe)
 		edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci,
-				     cnt_sbe, /* error count */
+				     cnt_dbe, /* error count */
 				     0, 0, 0, /* pfn, offset, syndrome */
 				     -1, -1, -1, /* top, mid, low layer */
 				     mci->ctl_name,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] EDAC/armada-xp: Fix output of uncorrectable error counter
  2021-10-06 12:13     ` [PATCH] EDAC/armada-xp: Fix output of uncorrectable error counter Hans Potsch
@ 2021-10-14  9:56       ` Borislav Petkov
  0 siblings, 0 replies; 4+ messages in thread
From: Borislav Petkov @ 2021-10-14  9:56 UTC (permalink / raw)
  To: Hans Potsch; +Cc: jlu, linux-edac, harald.glock

On Wed, Oct 06, 2021 at 02:13:32PM +0200, Hans Potsch wrote:
> Incorrect parameter is passed to the edac_mc_handle_error() function.
> Therefore number of correctable errors is displayed as uncorrectable
> errors. Changed to correct parameter.
> 
> Signed-off-by: Hans Potsch <hans.potsch@nokia.com>
> ---
>  drivers/edac/armada_xp_edac.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/edac/armada_xp_edac.c b/drivers/edac/armada_xp_edac.c
> index e3e757513d1b..b1f46a974b9e 100644
> --- a/drivers/edac/armada_xp_edac.c
> +++ b/drivers/edac/armada_xp_edac.c
> @@ -178,7 +178,7 @@ static void axp_mc_check(struct mem_ctl_info *mci)
>  				     "details unavailable (multiple errors)");
>  	if (cnt_dbe)
>  		edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci,
> -				     cnt_sbe, /* error count */
> +				     cnt_dbe, /* error count */
>  				     0, 0, 0, /* pfn, offset, syndrome */
>  				     -1, -1, -1, /* top, mid, low layer */
>  				     mci->ctl_name,
> -- 

Applied, thanks.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-10-14  9:56 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <AM8PR07MB8172B5AB7AD5DAE78CA4624CFEAA9@AM8PR07MB8172.eurprd07.prod.outlook.com>
2021-09-30 14:25 ` Bug in EDAC error handling of Marvell Armada XP Jan Lübbe
2021-10-06  9:40   ` Potsch, Hans (Nokia - DE/Stuttgart)
2021-10-06 12:13     ` [PATCH] EDAC/armada-xp: Fix output of uncorrectable error counter Hans Potsch
2021-10-14  9:56       ` Borislav Petkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.