linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Qumulo: a question about UECC detection from the ie31200_edac ko
@ 2020-02-04  1:25 Dan Pehush
  2020-02-05 18:25 ` Tony Luck
  0 siblings, 1 reply; 2+ messages in thread
From: Dan Pehush @ 2020-02-04  1:25 UTC (permalink / raw)
  To: linux-edac

Hi All,

   My name is Daniel Pehush, I work on the hardware team at an
enterprise data storage company called Qumulo Inc. We want to be able
to have our server systems kernel PANIC on the occurrence of a UECC
error. A UECC should be treated as an interrupt. We were working with
Intel to get resolution for this desired behavior, and they have
directed us ask for guidance from the developers of this kernel
module. Our current configuration is the following ...

OS: Ubuntu 18.04, Linux du108-r2145-3 4.4.0-142-generic #168 SMP Wed
Jul 24 18:19:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Motherboard: Intel S1200SPL
CPUs: Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz or Intel(R) Xeon(R)
CPU E3-1270 v5 @ 3.60GHz

We see the following kernel modules are loaded. From our understanding
though, there is no method to get this ko to operate in interrupt mode
instead of polling. We desire interrupt mode. we are open to kernel
patches or moving the kernel to a later version to get this critical
EDAC feature to function on systems that utilize the following

root@du108-r2145-3:~# modinfo ie31200_edac
filename:
/lib/modules/4.4.0-142-generic/kernel/drivers/edac/ie31200_edac.ko
description:    MC support for Intel Processor E31200 memory hub controllers
author:         Jason Baron <jbaron@akamai.com>
license:        GPL
srcversion:     340329DA0015F03633253D0
alias:          pci:v00008086d00005918sv*sd*bc*sc*i*
alias:          pci:v00008086d00001918sv*sd*bc*sc*i*
alias:          pci:v00008086d00000C08sv*sd*bc*sc*i*
alias:          pci:v00008086d00000C04sv*sd*bc*sc*i*
alias:          pci:v00008086d0000015Csv*sd*bc*sc*i*
alias:          pci:v00008086d00000158sv*sd*bc*sc*i*
alias:          pci:v00008086d00000150sv*sd*bc*sc*i*
alias:          pci:v00008086d0000010Csv*sd*bc*sc*i*
alias:          pci:v00008086d00000108sv*sd*bc*sc*i*
depends:        edac_core
retpoline:      Y
intree:         Y
vermagic:       4.4.0-142-generic SMP mod_unload modversions retpoline
signat:         PKCS#7
signer:
sig_key:
sig_hashalgo:   md4
root@du108-r2145-3:~# modinfo edac_core
filename:       /lib/modules/4.4.0-142-generic/kernel/drivers/edac/edac_core.ko
description:    Core library routines for EDAC reporting
author:         Doug Thompson www.softwarebitmaker.com, et al
license:        GPL
srcversion:     60FF3CE149817D76BF414C7
depends:
retpoline:      Y
intree:         Y
vermagic:       4.4.0-142-generic SMP mod_unload modversions retpoline
signat:         PKCS#7
signer:
sig_key:
sig_hashalgo:   md4
parm:           check_pci_errors:Check for PCI bus parity errors:
0=off 1=on (int)
parm:           edac_pci_panic_on_pe:Panic on PCI Bus Parity error:
0=off 1=on (int)
parm:           edac_mc_panic_on_ue:Panic on uncorrected error: 0=off 1=on (int)
parm:           edac_mc_log_ue:Log uncorrectable error to console:
0=off 1=on (int)
parm:           edac_mc_log_ce:Log correctable error to console: 0=off
1=on (int)
parm:           edac_mc_poll_msec:Polling period in milliseconds
root@du108-r2145-3:~# uname -ra
Linux du108-r2145-3 4.4.0-142-generic #168 SMP Wed Jul 24 18:19:09 UTC
2019 x86_64 x86_64 x86_64 GNU/Linux

For example, I can boot on kernel 4.15, and see that the kernel module
is loaded as such. But, am unsure if the driver is in interrupt mode
and able to react to a UECC error occuring.
root@qkiosk:~# modinfo ie31200_edac
filename:
/lib/modules/4.15.0-46-generic/kernel/drivers/edac/ie31200_edac.ko
description:    MC support for Intel Processor E31200 memory hub controllers
author:         Jason Baron <jbaron@akamai.com>
license:        GPL
srcversion:     39D6D5F1A63B6CF65CF5F51
alias:          pci:v00008086d00005918sv*sd*bc*sc*i*
alias:          pci:v00008086d00001918sv*sd*bc*sc*i*
alias:          pci:v00008086d00000C08sv*sd*bc*sc*i*
alias:          pci:v00008086d00000C04sv*sd*bc*sc*i*
alias:          pci:v00008086d0000015Csv*sd*bc*sc*i*
alias:          pci:v00008086d00000158sv*sd*bc*sc*i*
alias:          pci:v00008086d00000150sv*sd*bc*sc*i*
alias:          pci:v00008086d0000010Csv*sd*bc*sc*i*
alias:          pci:v00008086d00000108sv*sd*bc*sc*i*
depends:
retpoline:      Y
intree:         Y
name:           ie31200_edac
vermagic:       4.15.0-46-generic SMP mod_unload
signat:         PKCS#7
signer:
sig_key:
sig_hashalgo:   md4

Respectfully,
   Dan P.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Qumulo: a question about UECC detection from the ie31200_edac ko
  2020-02-04  1:25 Qumulo: a question about UECC detection from the ie31200_edac ko Dan Pehush
@ 2020-02-05 18:25 ` Tony Luck
  0 siblings, 0 replies; 2+ messages in thread
From: Tony Luck @ 2020-02-05 18:25 UTC (permalink / raw)
  To: Dan Pehush; +Cc: Linux Edac Mailing List

On Mon, Feb 3, 2020 at 5:27 PM Dan Pehush <dpehush@qumulo.com> wrote:
>
> Hi All,
>
>    My name is Daniel Pehush, I work on the hardware team at an
> enterprise data storage company called Qumulo Inc. We want to be able
> to have our server systems kernel PANIC on the occurrence of a UECC
> error. A UECC should be treated as an interrupt. We were working with
> Intel to get resolution for this desired behavior, and they have
> directed us ask for guidance from the developers of this kernel
> module. Our current configuration is the following ...

I haven't done much with the E3 systems.  Do you know if you
get CMCI interrupts for corrected errors?  If you do, then it is
likely that you'd also get a CMCI for an uncorrected error too.
[Worst acronym ever ... Corrected Machine Check Interrupt, can
happen for uncorrected errors. Totally separate from the "Machine
Check"  INT#18].

Clues to check:
1) Is MCG_CAP bit 10 (MCG_CMCI) set?
2) If so, use rdmsr(8) to look at each MCi_CTL2 (0x280, 0x281, ... 0x280+nbanks)
to see if bit 30 (CMCI_EN) is set.

If that's the case, then you may just need to modify your EDAC driver
to panic if is sees MCi_STATUS.UC == 1

Note that doesn't give you complete containment of the error. Whatever
read the uncorrected data is going to use it until the CMCI is delivered
and your driver calls panic.  If this is an application, or kernel code with
interrupts enabled, then the window is tiny. If the kernel accessed with
interrupts off, then a lot may happen to that bad data before the plug is
pulled.

-Tony

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-02-05 18:25 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-04  1:25 Qumulo: a question about UECC detection from the ie31200_edac ko Dan Pehush
2020-02-05 18:25 ` Tony Luck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).