From: William Roche <email@example.com>
To: Borislav Petkov <firstname.lastname@example.org>
Cc: email@example.com, Tony Luck <firstname.lastname@example.org>,
Subject: Re: [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering
Date: Fri, 2 Apr 2021 18:00:42 +0200 [thread overview]
Message-ID: <email@example.com> (raw)
On 01/04/2021 18:12, Borislav Petkov wrote:
> On Mon, Mar 29, 2021 at 11:44:05AM +0200, William Roche wrote:
>> I totally agree with you, and in order to schedule a replacement, MCEs
>> information (enriched by the notifiers chain) are more meaningful than
>> only PFN values.
> Well, if you want to collect errors and analyze patterns in order to
> detect hw going bad, you're probably better off disabling the CEC
> altogether - either disable it in Kconfig or boot with ras=cec_disable.
Corrected Errors are not the best indicators for a failing DIMM, and I
agree that enabling ras_cec is a good thing to have in production.
>> 1/ Giving back ras_cec a consistent behavior where the first occurrence
>> of a CE doesn't generate an MCE message from the MCE_HANDLED_CEC
>> notifiers, and a consistent behavior between the slot 0 and the other
>> pfn slots.
> If by this you mean the issue with the return value, then sure.
> If you mean something else, you'd have to be more specific.
No I just want to fix the return value.
And I expressed the consequences of this fix.
>> 2/ Give the CE MCE information when the action threshold is reached to
>> help the administrator identify what generated the PFN "Soft-offlining"
>> or "Invalid pfn" message.
>> When ras_cec is enabled it hides most of the CE errors, but when the
>> action threshold is reached all notifiers can generate their indication
>> about the error that appeared too often.
>> An administrator getting too many action threshold CE errors can
>> schedule a replacement based on the indications provided by his EDAC
>> module etc...
> Well, this works probably only in theory.
> First of all, the CEC sees the error first, before the EDAC drivers.
> But, in order to map from the virtual address to the actual DIMM, you
> need the EDAC drivers to have a go at the error. In many cases not even
> the EDAC drivers can give you that mapping because, well, hw/fw does its
> own stuff underneath, predictive fault bla, added value crap, whatever,
> so that we can't even get a "DIMM X on processor Y caused the error."
> I know, your assumption is that if a page gets offlined by the CEC, then
> all the errors' addresses are coming from the same physical DIMM. And
> that is probably correct in most cases but I'm not convinced for all.
> In any case, what we could do - which is pretty easy and cheap - is to
> fix the retval of cec_add_elem() to communicate to the caller that it
> offlined a page and this way tell the notifier chain that the error
> needs to be printed into dmesg with a statement sayin that DIMM Y got
> just one more page offlined.
Let's just fix cec_add_elem() with this patch first.
I would love to have a mechanism indicating what physical DIMM had the
page off-lined but I know, as you said earlier, that "hw/fw does its
own stuff underneath" and that we may not have the right information in
> Over time, if a DIMM is going bad, one should be able to grep dmesg and
> correlate all those offlined pages to DIMMs and then maybe see a pattern
> and eventually schedule a downtime.
> A lot of ifs, I know. :-\
For the moment we will have the CE MCE handled my the MCE_HANDLED_CEC
aware notifiers only when a page is off-lined, like it used to be.
Can we start with that small fix ?
next prev parent reply other threads:[~2021-04-02 16:01 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-26 18:30 [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering “William Roche
2021-03-26 19:02 ` Borislav Petkov
2021-03-26 22:24 ` William Roche
2021-03-26 22:43 ` Borislav Petkov
2021-03-29 9:44 ` William Roche
2021-04-01 16:12 ` Borislav Petkov
2021-04-02 16:00 ` William Roche [this message]
2021-04-02 17:07 ` Borislav Petkov
2021-04-06 15:28 ` [PATCH v2] " “William Roche
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).