linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Borislav Petkov <bp@alien8.de>
To: William Roche <william.roche@oracle.com>
Cc: linux-kernel@vger.kernel.org, Tony Luck <tony.luck@intel.com>,
	linux-edac@vger.kernel.org
Subject: Re: [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering
Date: Thu, 1 Apr 2021 18:12:37 +0200	[thread overview]
Message-ID: <20210401161237.GC28954@zn.tnic> (raw)
In-Reply-To: <3ee5551c-d311-1939-315f-a4712e3821ff@oracle.com>

On Mon, Mar 29, 2021 at 11:44:05AM +0200, William Roche wrote:
> I totally agree with you, and in order to schedule a replacement, MCEs
> information (enriched by the notifiers chain) are more meaningful than
> only PFN values.

Well, if you want to collect errors and analyze patterns in order to
detect hw going bad, you're probably better off disabling the CEC
altogether - either disable it in Kconfig or boot with ras=cec_disable.

> 1/ Giving back ras_cec a consistent behavior where the first occurrence
> of a CE doesn't generate an MCE message from the MCE_HANDLED_CEC
> notifiers, and a consistent behavior between the slot 0 and the other
> pfn slots.

If by this you mean the issue with the return value, then sure.

If you mean something else, you'd have to be more specific.

> 2/ Give the CE MCE information when the action threshold is reached to
> help the administrator identify what generated the PFN "Soft-offlining"
> or "Invalid pfn" message.
> 
> When ras_cec is enabled it hides most of the CE errors, but when the
> action threshold is reached all notifiers can generate their indication
> about the error that appeared too often.
> 
> An administrator getting too many action threshold CE errors can
> schedule a replacement based on the indications provided by his EDAC
> module etc...

Well, this works probably only in theory.

First of all, the CEC sees the error first, before the EDAC drivers.

But, in order to map from the virtual address to the actual DIMM, you
need the EDAC drivers to have a go at the error. In many cases not even
the EDAC drivers can give you that mapping because, well, hw/fw does its
own stuff underneath, predictive fault bla, added value crap, whatever,
so that we can't even get a "DIMM X on processor Y caused the error."

I know, your assumption is that if a page gets offlined by the CEC, then
all the errors' addresses are coming from the same physical DIMM. And
that is probably correct in most cases but I'm not convinced for all.

In any case, what we could do - which is pretty easy and cheap - is to
fix the retval of cec_add_elem() to communicate to the caller that it
offlined a page and this way tell the notifier chain that the error
needs to be printed into dmesg with a statement sayin that DIMM Y got
just one more page offlined.

Over time, if a DIMM is going bad, one should be able to grep dmesg and
correlate all those offlined pages to DIMMs and then maybe see a pattern
and eventually schedule a downtime.

A lot of ifs, I know. :-\

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

  reply	other threads:[~2021-04-01 17:58 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-26 18:30 [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering “William Roche
2021-03-26 19:02 ` Borislav Petkov
2021-03-26 22:24   ` William Roche
2021-03-26 22:43     ` Borislav Petkov
2021-03-29  9:44       ` William Roche
2021-04-01 16:12         ` Borislav Petkov [this message]
2021-04-02 16:00           ` William Roche
2021-04-02 17:07             ` Borislav Petkov
2021-04-06 15:28               ` [PATCH v2] " “William Roche

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210401161237.GC28954@zn.tnic \
    --to=bp@alien8.de \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tony.luck@intel.com \
    --cc=william.roche@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).