linux-acpi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Shiju Jose <shiju.jose@huawei.com>
To: Borislav Petkov <bp@alien8.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"tony.luck@intel.com" <tony.luck@intel.com>,
	"rjw@rjwysocki.net" <rjw@rjwysocki.net>,
	"james.morse@arm.com" <james.morse@arm.com>,
	"lenb@kernel.org" <lenb@kernel.org>,
	Linuxarm <linuxarm@huawei.com>
Subject: RE: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core
Date: Thu, 10 Sep 2020 15:29:56 +0000	[thread overview]
Message-ID: <50714e083d55491a8ccf5ad847682d1e@huawei.com> (raw)
In-Reply-To: <20200909120203.GB12237@zn.tnic>

Hello Boris,

>-----Original Message-----
>From: Borislav Petkov [mailto:bp@alien8.de]
>Sent: 09 September 2020 13:02
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-edac@vger.kernel.org; linux-acpi@vger.kernel.org; linux-
>kernel@vger.kernel.org; tony.luck@intel.com; rjw@rjwysocki.net;
>james.morse@arm.com; lenb@kernel.org; Linuxarm
><linuxarm@huawei.com>
>Subject: Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate
>an erroneous CPU core
>
>On Tue, Sep 01, 2020 at 04:20:54PM +0000, Shiju Jose wrote:
>> CPU CEC derived the infrastructure of the CEC only and the logic used
>> in the CEC for CE count storage, CE count calculation and page
>> isolation is very unique for the memory pages, which seems cannot be
>> reusable for the CPU CEs.
>
>Oh, because it saves the reported error's PFN and you want to save
>
>[CPU num | error count]
>
>?
Yes. 

>
>Well, you can easily change that by extending the existing CEC to have a
>different storage format for CPU errors, i.e., use a different ce_array which
>gets passed to the functions anyway.
Ok. However the functions such as __find_elem() use
memory specific PFN() and PAGE_SHIFT.

>
>> Also the values set for the parameters such as threshold, time period
>> for the memory errors and CPU errors would be different.
>
>And your implementation with sliding windows is so totally different that it
>warrants the duplication of the code? I don't think so.
>
>You can use the current CEC to do exactly what you wanna do, with the
>decaying and so on.
I will check this.
For CPU, the corrected errors count for a short time period to be checked.
Thus old errors outside this period would not be considered and would be cleared. 
It is not clear to me whether in the current CEC, the count for the old errors outside
a time period would be excluded for the threshold check or removed?

>
>Because all you wanna do is count the errors a CPU triggered.
>
>However, a CPU can trigger a *lot* of different types of errors.
>You're putting them all in the same basket by doing:
>
>                else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM))
>			/* add to CEC */
>
>and only for correctable.
>
>What type of errors get reported in CPER_SEC_PROC_ARM?
According to the ARM Processor CPER definition the error types reported are
Cache Error, TLB Error, Bus Error and micro-architectural Error.

>
>If they're all lumped together and if some functional unit generates a lot of
>errors, instead of disabling that unit only, you'll go and remove the whole
>CPU?
>
Few thoughts on this,
1. Not sure will a CPU core would work/perform as normal after disabling
a functional unit?
2. Support in the HW to disable a function unit alone may not available.
3. If it is require to store and retrieve the error count based on functional unit,
    then CEC will become more complex?

>Doesn't make a whole lot of sense to me.
>
>How about you define what exactly you're trying to solve, maybe give an
>example of a real issue someone is encountering and you're trying to
>address? Because there was never a necessity so far to disable CPUs on
>x86 due to correctable errors. Why is that needed on ARM?
>
This requirement is the part of the early fault prediction by taking action
when large number of corrected errors reported on a CPU core
before it causing serious faults. 
We are mainly looking for disable CPU core on large number of L1/L2 cache
corrected errors  reported on a CPU core. Can we add atleast removing CPU core
for the CPU cache corrected errors filtering out other error types?
 
[...]

Thanks,
Shiju


  reply	other threads:[~2020-09-10 15:31 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-01 14:01 [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core Shiju Jose
2020-09-01 14:35 ` Borislav Petkov
2020-09-01 16:20   ` Shiju Jose
2020-09-09 12:02     ` Borislav Petkov
2020-09-10 15:29       ` Shiju Jose [this message]
2020-09-17  8:40         ` Borislav Petkov
2020-10-01 17:16           ` James Morse
2020-10-01 17:30             ` Borislav Petkov
2020-10-02 12:23               ` Shiju Jose
2020-09-01 18:51 ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50714e083d55491a8ccf5ad847682d1e@huawei.com \
    --to=shiju.jose@huawei.com \
    --cc=bp@alien8.de \
    --cc=james.morse@arm.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxarm@huawei.com \
    --cc=rjw@rjwysocki.net \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).