From: Tony Luck <tony.luck@gmail.com>
To: Dan Pehush <dpehush@qumulo.com>
Cc: Linux Edac Mailing List <linux-edac@vger.kernel.org>
Subject: Re: Qumulo: a question about UECC detection from the ie31200_edac ko
Date: Wed, 5 Feb 2020 10:25:07 -0800 [thread overview]
Message-ID: <CA+8MBb+R4V-uesUbsy=5y2FOxHV11k6e=G2uFQe0yV13wCQ3RQ@mail.gmail.com> (raw)
In-Reply-To: <CACNqQuQNsVyqxW2yq_W=EN2f0q7oP-Fkfe9vXWV4wMznZ093jA@mail.gmail.com>
On Mon, Feb 3, 2020 at 5:27 PM Dan Pehush <dpehush@qumulo.com> wrote:
>
> Hi All,
>
> My name is Daniel Pehush, I work on the hardware team at an
> enterprise data storage company called Qumulo Inc. We want to be able
> to have our server systems kernel PANIC on the occurrence of a UECC
> error. A UECC should be treated as an interrupt. We were working with
> Intel to get resolution for this desired behavior, and they have
> directed us ask for guidance from the developers of this kernel
> module. Our current configuration is the following ...
I haven't done much with the E3 systems. Do you know if you
get CMCI interrupts for corrected errors? If you do, then it is
likely that you'd also get a CMCI for an uncorrected error too.
[Worst acronym ever ... Corrected Machine Check Interrupt, can
happen for uncorrected errors. Totally separate from the "Machine
Check" INT#18].
Clues to check:
1) Is MCG_CAP bit 10 (MCG_CMCI) set?
2) If so, use rdmsr(8) to look at each MCi_CTL2 (0x280, 0x281, ... 0x280+nbanks)
to see if bit 30 (CMCI_EN) is set.
If that's the case, then you may just need to modify your EDAC driver
to panic if is sees MCi_STATUS.UC == 1
Note that doesn't give you complete containment of the error. Whatever
read the uncorrected data is going to use it until the CMCI is delivered
and your driver calls panic. If this is an application, or kernel code with
interrupts enabled, then the window is tiny. If the kernel accessed with
interrupts off, then a lot may happen to that bad data before the plug is
pulled.
-Tony
prev parent reply other threads:[~2020-02-05 18:25 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-02-04 1:25 Qumulo: a question about UECC detection from the ie31200_edac ko Dan Pehush
2020-02-05 18:25 ` Tony Luck [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CA+8MBb+R4V-uesUbsy=5y2FOxHV11k6e=G2uFQe0yV13wCQ3RQ@mail.gmail.com' \
--to=tony.luck@gmail.com \
--cc=dpehush@qumulo.com \
--cc=linux-edac@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).