All of lore.kernel.org
 help / color / mirror / Atom feed
* EDAC igen6 error messages at boot
@ 2022-10-21 16:08 Orion Poplawski
  2022-10-25  2:46 ` Zhuo, Qiuxu
  0 siblings, 1 reply; 8+ messages in thread
From: Orion Poplawski @ 2022-10-21 16:08 UTC (permalink / raw)
  To: linux-edac

[-- Attachment #1: Type: text/plain, Size: 1314 bytes --]

We have a Dell XPS 15 9520 running 5.15.0-52-generic (Ubuntu 20.04) and get
the following at boot:

[    0.981641] EDAC MC: Ver: 3.0.0
[   31.801126] caller igen6_probe+0x176/0x7b0 [igen6_edac] mapping multiple BARs
[   31.805272] EDAC MC0: Giving out device to module igen6_edac controller
Intel_client_SoC MC#0: DEV 0000:00:00.0 (INTERRUPT)
[   31.810599] EDAC MC1: Giving out device to module igen6_edac controller
Intel_client_SoC MC#1: DEV 0000:00:00.0 (INTERRUPT)
[   31.810616] EDAC igen6 MC1: HANDLING IBECC MEMORY ERROR
[   31.810617] EDAC igen6 MC1: ADDR 0x7fffffffe0
[   31.810619] EDAC igen6 MC0: HANDLING IBECC MEMORY ERROR
[   31.810620] EDAC igen6 MC0: ADDR 0x7fffffffe0
[   31.811957] EDAC igen6: v2.5

logwatch triggers on the ERROR and reports them.

However, from some searching around this seems to be fairly common, so I'm
guessing they are somewhat spurious.  Unfortunately the messages seem to be
similar to what you would see with an actual memory error so don't want to
completely ignore them.

Could anyone shed some more light here?

Thanks!

-- 
Orion Poplawski
IT Systems Manager                         720-772-5637
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       orion@nwra.com
Boulder, CO 80301                 https://www.nwra.com/

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3847 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: EDAC igen6 error messages at boot
  2022-10-21 16:08 EDAC igen6 error messages at boot Orion Poplawski
@ 2022-10-25  2:46 ` Zhuo, Qiuxu
  2022-10-25  2:58   ` Orion Poplawski
  0 siblings, 1 reply; 8+ messages in thread
From: Zhuo, Qiuxu @ 2022-10-25  2:46 UTC (permalink / raw)
  To: Orion Poplawski, linux-edac



> From: Orion Poplawski <orion@nwra.com>
> Sent: Saturday, October 22, 2022 12:08 AM
> To: linux-edac@vger.kernel.org
> Subject: EDAC igen6 error messages at boot
> 
> We have a Dell XPS 15 9520 running 5.15.0-52-generic (Ubuntu 20.04) and get
> the following at boot:
> 
> [    0.981641] EDAC MC: Ver: 3.0.0
> [   31.801126] caller igen6_probe+0x176/0x7b0 [igen6_edac] mapping multiple
> BARs
> [   31.805272] EDAC MC0: Giving out device to module igen6_edac controller
> Intel_client_SoC MC#0: DEV 0000:00:00.0 (INTERRUPT)
> [   31.810599] EDAC MC1: Giving out device to module igen6_edac controller
> Intel_client_SoC MC#1: DEV 0000:00:00.0 (INTERRUPT)
> [   31.810616] EDAC igen6 MC1: HANDLING IBECC MEMORY ERROR
> [   31.810617] EDAC igen6 MC1: ADDR 0x7fffffffe0
> [   31.810619] EDAC igen6 MC0: HANDLING IBECC MEMORY ERROR
> [   31.810620] EDAC igen6 MC0: ADDR 0x7fffffffe0

Did you still see the error log after you re-boot the machine?

> [   31.811957] EDAC igen6: v2.5
> 
> logwatch triggers on the ERROR and reports them.
> 
> However, from some searching around this seems to be fairly common, so I'm
> guessing they are somewhat spurious.  Unfortunately the messages seem to be
> similar to what you would see with an actual memory error so don't want to
> completely ignore them.
> 
> Could anyone shed some more light here?
> 
> Thanks!
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: EDAC igen6 error messages at boot
  2022-10-25  2:46 ` Zhuo, Qiuxu
@ 2022-10-25  2:58   ` Orion Poplawski
  2022-10-25  3:13     ` Zhuo, Qiuxu
  0 siblings, 1 reply; 8+ messages in thread
From: Orion Poplawski @ 2022-10-25  2:58 UTC (permalink / raw)
  To: Zhuo, Qiuxu, linux-edac

[-- Attachment #1: Type: text/plain, Size: 1874 bytes --]

On 10/24/22 20:46, Zhuo, Qiuxu wrote:
> 
> 
>> From: Orion Poplawski <orion@nwra.com>
>> Sent: Saturday, October 22, 2022 12:08 AM
>> To: linux-edac@vger.kernel.org
>> Subject: EDAC igen6 error messages at boot
>>
>> We have a Dell XPS 15 9520 running 5.15.0-52-generic (Ubuntu 20.04) and get
>> the following at boot:
>>
>> [    0.981641] EDAC MC: Ver: 3.0.0
>> [   31.801126] caller igen6_probe+0x176/0x7b0 [igen6_edac] mapping multiple
>> BARs
>> [   31.805272] EDAC MC0: Giving out device to module igen6_edac controller
>> Intel_client_SoC MC#0: DEV 0000:00:00.0 (INTERRUPT)
>> [   31.810599] EDAC MC1: Giving out device to module igen6_edac controller
>> Intel_client_SoC MC#1: DEV 0000:00:00.0 (INTERRUPT)
>> [   31.810616] EDAC igen6 MC1: HANDLING IBECC MEMORY ERROR
>> [   31.810617] EDAC igen6 MC1: ADDR 0x7fffffffe0
>> [   31.810619] EDAC igen6 MC0: HANDLING IBECC MEMORY ERROR
>> [   31.810620] EDAC igen6 MC0: ADDR 0x7fffffffe0
> 
> Did you still see the error log after you re-boot the machine?

Not quite sure what you mean.  I see it every boot it the logs.  Are you 
interested in the difference between a reboot and a power cycle?


>> [   31.811957] EDAC igen6: v2.5
>>
>> logwatch triggers on the ERROR and reports them.
>>
>> However, from some searching around this seems to be fairly common, so I'm
>> guessing they are somewhat spurious.  Unfortunately the messages seem to be
>> similar to what you would see with an actual memory error so don't want to
>> completely ignore them.
>>
>> Could anyone shed some more light here?
>>
>> Thanks!
>>
> 


-- 
Orion Poplawski
he/him/his  - surely the least important thing about me
IT Systems Manager                         720-772-5637
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       orion@nwra.com
Boulder, CO 80301                 https://www.nwra.com/

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3847 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: EDAC igen6 error messages at boot
  2022-10-25  2:58   ` Orion Poplawski
@ 2022-10-25  3:13     ` Zhuo, Qiuxu
  2022-10-25 15:28       ` Orion Poplawski
  0 siblings, 1 reply; 8+ messages in thread
From: Zhuo, Qiuxu @ 2022-10-25  3:13 UTC (permalink / raw)
  To: Orion Poplawski, linux-edac

> From: Orion Poplawski <orion@nwra.com>
> Sent: Tuesday, October 25, 2022 10:59 AM
> ...
> >> [    0.981641] EDAC MC: Ver: 3.0.0
> >> [   31.801126] caller igen6_probe+0x176/0x7b0 [igen6_edac] mapping
> multiple
> >> BARs
> >> [   31.805272] EDAC MC0: Giving out device to module igen6_edac controller
> >> Intel_client_SoC MC#0: DEV 0000:00:00.0 (INTERRUPT)
> >> [   31.810599] EDAC MC1: Giving out device to module igen6_edac controller
> >> Intel_client_SoC MC#1: DEV 0000:00:00.0 (INTERRUPT)
> >> [   31.810616] EDAC igen6 MC1: HANDLING IBECC MEMORY ERROR
> >> [   31.810617] EDAC igen6 MC1: ADDR 0x7fffffffe0
> >> [   31.810619] EDAC igen6 MC0: HANDLING IBECC MEMORY ERROR
> >> [   31.810620] EDAC igen6 MC0: ADDR 0x7fffffffe0
> >
> > Did you still see the error log after you re-boot the machine?
> 
> Not quite sure what you mean.  I see it every boot it the logs.  Are you
> interested in the difference between a reboot and a power cycle?

Yes, can you try a power cycle on the machine and check whether the error log still occur?
Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: EDAC igen6 error messages at boot
  2022-10-25  3:13     ` Zhuo, Qiuxu
@ 2022-10-25 15:28       ` Orion Poplawski
  2022-10-26  5:58         ` Zhuo, Qiuxu
  0 siblings, 1 reply; 8+ messages in thread
From: Orion Poplawski @ 2022-10-25 15:28 UTC (permalink / raw)
  To: Zhuo, Qiuxu, linux-edac

[-- Attachment #1: Type: text/plain, Size: 1930 bytes --]

On 10/24/22 21:13, Zhuo, Qiuxu wrote:
>> From: Orion Poplawski <orion@nwra.com>
>> Sent: Tuesday, October 25, 2022 10:59 AM
>> ...
>>>> [    0.981641] EDAC MC: Ver: 3.0.0
>>>> [   31.801126] caller igen6_probe+0x176/0x7b0 [igen6_edac] mapping
>> multiple
>>>> BARs
>>>> [   31.805272] EDAC MC0: Giving out device to module igen6_edac controller
>>>> Intel_client_SoC MC#0: DEV 0000:00:00.0 (INTERRUPT)
>>>> [   31.810599] EDAC MC1: Giving out device to module igen6_edac controller
>>>> Intel_client_SoC MC#1: DEV 0000:00:00.0 (INTERRUPT)
>>>> [   31.810616] EDAC igen6 MC1: HANDLING IBECC MEMORY ERROR
>>>> [   31.810617] EDAC igen6 MC1: ADDR 0x7fffffffe0
>>>> [   31.810619] EDAC igen6 MC0: HANDLING IBECC MEMORY ERROR
>>>> [   31.810620] EDAC igen6 MC0: ADDR 0x7fffffffe0
>>>
>>> Did you still see the error log after you re-boot the machine?
>>
>> Not quite sure what you mean.  I see it every boot it the logs.  Are you
>> interested in the difference between a reboot and a power cycle?
> 
> Yes, can you try a power cycle on the machine and check whether the error log still occur?
> Thanks!

Still happens.  Seems to happen all the time.

[    0.975306] EDAC MC: Ver: 3.0.0
[   17.052613] EDAC MC0: Giving out device to module igen6_edac controller
Intel_client_SoC MC#0: DEV 0000:00:00.0 (INTERRUPT)
[   17.054293] EDAC MC1: Giving out device to module igen6_edac controller
Intel_client_SoC MC#1: DEV 0000:00:00.0 (INTERRUPT)
[   17.054311] EDAC igen6 MC1: HANDLING IBECC MEMORY ERROR
[   17.054312] EDAC igen6 MC1: ADDR 0x7fffffffe0
[   17.054313] EDAC igen6 MC0: HANDLING IBECC MEMORY ERROR
[   17.054314] EDAC igen6 MC0: ADDR 0x7fffffffe0
[   17.056192] EDAC igen6: v2.5

-- 
Orion Poplawski
IT Systems Manager                         720-772-5637
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       orion@nwra.com
Boulder, CO 80301                 https://www.nwra.com/


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3847 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: EDAC igen6 error messages at boot
  2022-10-25 15:28       ` Orion Poplawski
@ 2022-10-26  5:58         ` Zhuo, Qiuxu
  2022-11-02 14:36           ` Orion Poplawski
  0 siblings, 1 reply; 8+ messages in thread
From: Zhuo, Qiuxu @ 2022-10-26  5:58 UTC (permalink / raw)
  To: Orion Poplawski, linux-edac

[-- Attachment #1: Type: text/plain, Size: 633 bytes --]

> From: Orion Poplawski <orion@nwra.com>
>> ...
> >
> > Yes, can you try a power cycle on the machine and check whether the
> error log still occur?
> > Thanks!
> 
> Still happens.  Seems to happen all the time.

Thanks for the feedback.

Looked like the In-band ECC error log was initialized with the value ~0ULL that
resulted in the spurious errors.

I don't have such a machine for debug. Could you please try the attached patch to see 
whether it fixes the spurious errors on driver load? If possible please also open the 
"CONFIG_EDAC_DEBUG=y" kernel configuration for more EDAC debug logs.

Thanks!
-Qiuxu 

[-- Attachment #2: 0001-EDAC-igen6-Fix-spurious-errors-on-driver-load.patch --]
[-- Type: application/octet-stream, Size: 1228 bytes --]

From 6abdd39b626299ac59d89ee35b4d17686d81f62d Mon Sep 17 00:00:00 2001
From: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Date: Wed, 26 Oct 2022 13:34:26 +0800
Subject: [PATCH 1/1] EDAC/igen6: Fix spurious errors on driver load

The In-band ECC error log register may be initialized with the value
~0ULL that results in reporting spurious errors on driver load as below.

  EDAC igen6 MC1: HANDLING IBECC MEMORY ERROR
  EDAC igen6 MC1: ADDR 0x7fffffffe0
  EDAC igen6 MC0: HANDLING IBECC MEMORY ERROR
  EDAC igen6 MC0: ADDR 0x7fffffffe0
  EDAC igen6: v2.5

Fix these spurious errors by filtering them out.

Reported-by: Orion Poplawski <orion@nwra.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
 drivers/edac/igen6_edac.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/edac/igen6_edac.c b/drivers/edac/igen6_edac.c
index a07bbfd075d0..d700144957da 100644
--- a/drivers/edac/igen6_edac.c
+++ b/drivers/edac/igen6_edac.c
@@ -653,7 +653,7 @@ static int ecclog_handler(void)
 		/* errsts_clear() isn't NMI-safe. Delay it in the IRQ context */
 
 		ecclog = ecclog_read_and_clear(imc);
-		if (!ecclog)
+		if (!ecclog || ecclog == ~0ULL)
 			continue;
 
 		if (!ecclog_gen_pool_add(i, ecclog))
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: EDAC igen6 error messages at boot
  2022-10-26  5:58         ` Zhuo, Qiuxu
@ 2022-11-02 14:36           ` Orion Poplawski
  2022-11-04  5:07             ` Zhuo, Qiuxu
  0 siblings, 1 reply; 8+ messages in thread
From: Orion Poplawski @ 2022-11-02 14:36 UTC (permalink / raw)
  To: Zhuo, Qiuxu, linux-edac

[-- Attachment #1: Type: text/plain, Size: 1562 bytes --]

On 10/25/22 23:58, Zhuo, Qiuxu wrote:
>> From: Orion Poplawski <orion@nwra.com>
>>> ...
>>>
>>> Yes, can you try a power cycle on the machine and check whether the
>> error log still occur?
>>> Thanks!
>>
>> Still happens.  Seems to happen all the time.
> 
> Thanks for the feedback.
> 
> Looked like the In-band ECC error log was initialized with the value ~0ULL that
> resulted in the spurious errors.
> 
> I don't have such a machine for debug. Could you please try the attached patch to see
> whether it fixes the spurious errors on driver load? If possible please also open the
> "CONFIG_EDAC_DEBUG=y" kernel configuration for more EDAC debug logs.
> 
> Thanks!
> -Qiuxu

I can confirm that the patch removes the errors from the log:

# dmesg | grep -Fi edac
[    0.979497] EDAC MC: Ver: 3.0.0
[  417.538823] caller igen6_probe+0x176/0x7b0 [igen6_edac] mapping 
multiple BARs
[  417.538876] EDAC MC0: Giving out device to module igen6_edac 
controller Intel_client_SoC MC#0: DEV 0000:00:00.0 (INTERRUPT)
[  417.541597] EDAC MC1: Giving out device to module igen6_edac 
controller Intel_client_SoC MC#1: DEV 0000:00:00.0 (INTERRUPT)
[  417.542186] EDAC igen6: v2.5

However, I forgot to enable the CONFIG_EDAC_DEBUG option - do you still 
need that output?

-- 
Orion Poplawski
he/him/his  - surely the least important thing about me
IT Systems Manager                         720-772-5637
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       orion@nwra.com
Boulder, CO 80301                 https://www.nwra.com/

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3847 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: EDAC igen6 error messages at boot
  2022-11-02 14:36           ` Orion Poplawski
@ 2022-11-04  5:07             ` Zhuo, Qiuxu
  0 siblings, 0 replies; 8+ messages in thread
From: Zhuo, Qiuxu @ 2022-11-04  5:07 UTC (permalink / raw)
  To: Orion Poplawski, linux-edac

> From: Orion Poplawski <orion@nwra.com>
> ...
> 
> I can confirm that the patch removes the errors from the log:

Thanks for the confirmation.

> # dmesg | grep -Fi edac
> [    0.979497] EDAC MC: Ver: 3.0.0
> [  417.538823] caller igen6_probe+0x176/0x7b0 [igen6_edac] mapping
> multiple BARs
> [  417.538876] EDAC MC0: Giving out device to module igen6_edac
> controller Intel_client_SoC MC#0: DEV 0000:00:00.0 (INTERRUPT)
> [  417.541597] EDAC MC1: Giving out device to module igen6_edac
> controller Intel_client_SoC MC#1: DEV 0000:00:00.0 (INTERRUPT)
> [  417.542186] EDAC igen6: v2.5
> 
> However, I forgot to enable the CONFIG_EDAC_DEBUG option - do you still
> need that output?

No. 
Your confirmation above was enough. 😊

-Qiuxu

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-11-04  5:07 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-21 16:08 EDAC igen6 error messages at boot Orion Poplawski
2022-10-25  2:46 ` Zhuo, Qiuxu
2022-10-25  2:58   ` Orion Poplawski
2022-10-25  3:13     ` Zhuo, Qiuxu
2022-10-25 15:28       ` Orion Poplawski
2022-10-26  5:58         ` Zhuo, Qiuxu
2022-11-02 14:36           ` Orion Poplawski
2022-11-04  5:07             ` Zhuo, Qiuxu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.