linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tony Luck <tony.luck@gmail.com>
To: Dan Pehush <dpehush@qumulo.com>
Cc: Linux Edac Mailing List <linux-edac@vger.kernel.org>
Subject: Re: Qumulo: a question about UECC detection from the ie31200_edac ko
Date: Wed, 5 Feb 2020 10:25:07 -0800	[thread overview]
Message-ID: <CA+8MBb+R4V-uesUbsy=5y2FOxHV11k6e=G2uFQe0yV13wCQ3RQ@mail.gmail.com> (raw)
In-Reply-To: <CACNqQuQNsVyqxW2yq_W=EN2f0q7oP-Fkfe9vXWV4wMznZ093jA@mail.gmail.com>

On Mon, Feb 3, 2020 at 5:27 PM Dan Pehush <dpehush@qumulo.com> wrote:
>
> Hi All,
>
>    My name is Daniel Pehush, I work on the hardware team at an
> enterprise data storage company called Qumulo Inc. We want to be able
> to have our server systems kernel PANIC on the occurrence of a UECC
> error. A UECC should be treated as an interrupt. We were working with
> Intel to get resolution for this desired behavior, and they have
> directed us ask for guidance from the developers of this kernel
> module. Our current configuration is the following ...

I haven't done much with the E3 systems.  Do you know if you
get CMCI interrupts for corrected errors?  If you do, then it is
likely that you'd also get a CMCI for an uncorrected error too.
[Worst acronym ever ... Corrected Machine Check Interrupt, can
happen for uncorrected errors. Totally separate from the "Machine
Check"  INT#18].

Clues to check:
1) Is MCG_CAP bit 10 (MCG_CMCI) set?
2) If so, use rdmsr(8) to look at each MCi_CTL2 (0x280, 0x281, ... 0x280+nbanks)
to see if bit 30 (CMCI_EN) is set.

If that's the case, then you may just need to modify your EDAC driver
to panic if is sees MCi_STATUS.UC == 1

Note that doesn't give you complete containment of the error. Whatever
read the uncorrected data is going to use it until the CMCI is delivered
and your driver calls panic.  If this is an application, or kernel code with
interrupts enabled, then the window is tiny. If the kernel accessed with
interrupts off, then a lot may happen to that bad data before the plug is
pulled.

-Tony

      reply	other threads:[~2020-02-05 18:25 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-04  1:25 Qumulo: a question about UECC detection from the ie31200_edac ko Dan Pehush
2020-02-05 18:25 ` Tony Luck [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+8MBb+R4V-uesUbsy=5y2FOxHV11k6e=G2uFQe0yV13wCQ3RQ@mail.gmail.com' \
    --to=tony.luck@gmail.com \
    --cc=dpehush@qumulo.com \
    --cc=linux-edac@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).