From: William Roche <firstname.lastname@example.org>
To: Borislav Petkov <email@example.com>
Cc: firstname.lastname@example.org, Tony Luck <email@example.com>,
Subject: Re: [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering
Date: Mon, 29 Mar 2021 11:44:05 +0200 [thread overview]
Message-ID: <firstname.lastname@example.org> (raw)
On 26/03/2021 23:43, Borislav Petkov wrote:
> On Fri, Mar 26, 2021 at 11:24:43PM +0100, William Roche wrote:
>> What we want is to make cec_add_elem() to return !0 value only
>> when the given pfn triggered an action, so that its callers should
>> log the error.
> No, this is not what the CEC does - it collects those errors and when it
> reaches the threshold for any pfn, it offlines the corresponding page. I
> know, the comment above talks about:
> * That error event entry causes cec_add_elem() to return !0 value and thus
> * signal to its callers to log the error.
> but it doesn't do that. Frankly, I don't see the point of logging the
> error - it already says
> pr_err("Soft-offlining pfn: 0x%llx\n", pfn);
> which pfn it has offlined. And that is probably only mildly interesting
> to people - so what, 4K got offlined, servers have so much memory
> The only moment one should start worrying is if one gets those pretty
> often but then you're probably better off simply scheduling maintenance
> and replacing the faulty DIMM - problem solved.
I totally agree with you, and in order to schedule a replacement, MCEs
information (enriched by the notifiers chain) are more meaningful than
only PFN values.
>> What I'm expecting from ras_cec is to "hide" CEs until they reach the
>> action threshold where an action is tried against the impacted PFN,
> That it does.
>> and it's now the time to log the error with the entire notifiers
> And I'm not sure why we'd want to do that. It simply offlines the page.
> But maybe you could explain what you're trying to achieve...
My little fix proposition as 2 goals:
1/ Giving back ras_cec a consistent behavior where the first occurrence
of a CE doesn't generate an MCE message from the MCE_HANDLED_CEC
notifiers, and a consistent behavior between the slot 0 and the other
2/ Give the CE MCE information when the action threshold is reached to
help the administrator identify what generated the PFN "Soft-offlining"
or "Invalid pfn" message.
When ras_cec is enabled it hides most of the CE errors, but when the
action threshold is reached all notifiers can generate their indication
about the error that appeared too often.
An administrator getting too many action threshold CE errors can
schedule a replacement based on the indications provided by his EDAC
I think it could be more useful this way than systematically hiding all CEs.
But this decision is yours :)
next prev parent reply other threads:[~2021-03-29 9:45 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-26 18:30 [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering “William Roche
2021-03-26 19:02 ` Borislav Petkov
2021-03-26 22:24 ` William Roche
2021-03-26 22:43 ` Borislav Petkov
2021-03-29 9:44 ` William Roche [this message]
2021-04-01 16:12 ` Borislav Petkov
2021-04-02 16:00 ` William Roche
2021-04-02 17:07 ` Borislav Petkov
2021-04-06 15:28 ` [PATCH v2] " “William Roche
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).