linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: William Roche <william.roche@oracle.com>
To: Borislav Petkov <bp@alien8.de>
Cc: linux-kernel@vger.kernel.org, Tony Luck <tony.luck@intel.com>,
	linux-edac@vger.kernel.org
Subject: Re: [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering
Date: Fri, 2 Apr 2021 18:00:42 +0200	[thread overview]
Message-ID: <5ba128f6-62f3-beb2-9f04-fdebaf411414@oracle.com> (raw)
In-Reply-To: <20210401161237.GC28954@zn.tnic>

On 01/04/2021 18:12, Borislav Petkov wrote:
> On Mon, Mar 29, 2021 at 11:44:05AM +0200, William Roche wrote:
>> I totally agree with you, and in order to schedule a replacement, MCEs
>> information (enriched by the notifiers chain) are more meaningful than
>> only PFN values.
> 
> Well, if you want to collect errors and analyze patterns in order to
> detect hw going bad, you're probably better off disabling the CEC
> altogether - either disable it in Kconfig or boot with ras=cec_disable.

Corrected Errors are not the best indicators for a failing DIMM, and I
agree that enabling ras_cec is a good thing to have in production.


> 
>> 1/ Giving back ras_cec a consistent behavior where the first occurrence
>> of a CE doesn't generate an MCE message from the MCE_HANDLED_CEC
>> notifiers, and a consistent behavior between the slot 0 and the other
>> pfn slots.
> 
> If by this you mean the issue with the return value, then sure.
> 
> If you mean something else, you'd have to be more specific.

No I just want to fix the return value.
And I expressed the consequences of this fix.


> 
>> 2/ Give the CE MCE information when the action threshold is reached to
>> help the administrator identify what generated the PFN "Soft-offlining"
>> or "Invalid pfn" message.
>>
>> When ras_cec is enabled it hides most of the CE errors, but when the
>> action threshold is reached all notifiers can generate their indication
>> about the error that appeared too often.
>>
>> An administrator getting too many action threshold CE errors can
>> schedule a replacement based on the indications provided by his EDAC
>> module etc...
> 
> Well, this works probably only in theory.
> 
> First of all, the CEC sees the error first, before the EDAC drivers.
> 
> But, in order to map from the virtual address to the actual DIMM, you
> need the EDAC drivers to have a go at the error. In many cases not even
> the EDAC drivers can give you that mapping because, well, hw/fw does its
> own stuff underneath, predictive fault bla, added value crap, whatever,
> so that we can't even get a "DIMM X on processor Y caused the error."
> 
> I know, your assumption is that if a page gets offlined by the CEC, then
> all the errors' addresses are coming from the same physical DIMM. And
> that is probably correct in most cases but I'm not convinced for all.
> 
> In any case, what we could do - which is pretty easy and cheap - is to
> fix the retval of cec_add_elem() to communicate to the caller that it
> offlined a page and this way tell the notifier chain that the error
> needs to be printed into dmesg with a statement sayin that DIMM Y got
> just one more page offlined.

Let's just fix cec_add_elem() with this patch first.

I would love to have a mechanism indicating what physical DIMM had the
page off-lined but I know, as you said earlier, that "hw/fw does its
own stuff underneath" and that we may not have the right information in
the kernel.


> 
> Over time, if a DIMM is going bad, one should be able to grep dmesg and
> correlate all those offlined pages to DIMMs and then maybe see a pattern
> and eventually schedule a downtime.
> 
> A lot of ifs, I know. :-\
> 

For the moment we will have the CE MCE handled my the MCE_HANDLED_CEC
aware notifiers only when a page is off-lined, like it used to be.

Can we start with that small fix ?

  reply	other threads:[~2021-04-02 16:01 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-26 18:30 [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering “William Roche
2021-03-26 19:02 ` Borislav Petkov
2021-03-26 22:24   ` William Roche
2021-03-26 22:43     ` Borislav Petkov
2021-03-29  9:44       ` William Roche
2021-04-01 16:12         ` Borislav Petkov
2021-04-02 16:00           ` William Roche [this message]
2021-04-02 17:07             ` Borislav Petkov
2021-04-06 15:28               ` [PATCH v2] " “William Roche

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5ba128f6-62f3-beb2-9f04-fdebaf411414@oracle.com \
    --to=william.roche@oracle.com \
    --cc=bp@alien8.de \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).