Re: [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering

From: William Roche <william.roche@oracle.com>
To: Borislav Petkov <bp@alien8.de>
Cc: linux-kernel@vger.kernel.org, Tony Luck <tony.luck@intel.com>,
	linux-edac@vger.kernel.org
Subject: Re: [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering
Date: Fri, 2 Apr 2021 18:00:42 +0200	[thread overview]
Message-ID: <5ba128f6-62f3-beb2-9f04-fdebaf411414@oracle.com> (raw)
In-Reply-To: <20210401161237.GC28954@zn.tnic>

On 01/04/2021 18:12, Borislav Petkov wrote:
> On Mon, Mar 29, 2021 at 11:44:05AM +0200, William Roche wrote:
>> I totally agree with you, and in order to schedule a replacement, MCEs
>> information (enriched by the notifiers chain) are more meaningful than
>> only PFN values.
> 
> Well, if you want to collect errors and analyze patterns in order to
> detect hw going bad, you're probably better off disabling the CEC
> altogether - either disable it in Kconfig or boot with ras=cec_disable.

Corrected Errors are not the best indicators for a failing DIMM, and I
agree that enabling ras_cec is a good thing to have in production.

> 
>> 1/ Giving back ras_cec a consistent behavior where the first occurrence
>> of a CE doesn't generate an MCE message from the MCE_HANDLED_CEC
>> notifiers, and a consistent behavior between the slot 0 and the other
>> pfn slots.
> 
> If by this you mean the issue with the return value, then sure.
> 
> If you mean something else, you'd have to be more specific.

No I just want to fix the return value.
And I expressed the consequences of this fix.

> 
>> 2/ Give the CE MCE information when the action threshold is reached to
>> help the administrator identify what generated the PFN "Soft-offlining"
>> or "Invalid pfn" message.
>>
>> When ras_cec is enabled it hides most of the CE errors, but when the
>> action threshold is reached all notifiers can generate their indication
>> about the error that appeared too often.
>>
>> An administrator getting too many action threshold CE errors can
>> schedule a replacement based on the indications provided by his EDAC
>> module etc...
> 
> Well, this works probably only in theory.
> 
> First of all, the CEC sees the error first, before the EDAC drivers.
> 
> But, in order to map from the virtual address to the actual DIMM, you
> need the EDAC drivers to have a go at the error. In many cases not even
> the EDAC drivers can give you that mapping because, well, hw/fw does its
> own stuff underneath, predictive fault bla, added value crap, whatever,
> so that we can't even get a "DIMM X on processor Y caused the error."
> 
> I know, your assumption is that if a page gets offlined by the CEC, then
> all the errors' addresses are coming from the same physical DIMM. And
> that is probably correct in most cases but I'm not convinced for all.
> 
> In any case, what we could do - which is pretty easy and cheap - is to
> fix the retval of cec_add_elem() to communicate to the caller that it
> offlined a page and this way tell the notifier chain that the error
> needs to be printed into dmesg with a statement sayin that DIMM Y got
> just one more page offlined.

Let's just fix cec_add_elem() with this patch first.

I would love to have a mechanism indicating what physical DIMM had the
page off-lined but I know, as you said earlier, that "hw/fw does its
own stuff underneath" and that we may not have the right information in
the kernel.

> 
> Over time, if a DIMM is going bad, one should be able to grep dmesg and
> correlate all those offlined pages to DIMMs and then maybe see a pattern
> and eventually schedule a downtime.
> 
> A lot of ifs, I know. :-\
> 

For the moment we will have the CE MCE handled my the MCE_HANDLED_CEC
aware notifiers only when a page is off-lined, like it used to be.

Can we start with that small fix ?