From: James Morse <james.morse@arm.com>
To: Borislav Petkov <bp@alien8.de>, Shiju Jose <shiju.jose@huawei.com>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"tony.luck@intel.com" <tony.luck@intel.com>,
"rjw@rjwysocki.net" <rjw@rjwysocki.net>,
"lenb@kernel.org" <lenb@kernel.org>,
Linuxarm <linuxarm@huawei.com>
Subject: Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core
Date: Thu, 1 Oct 2020 18:16:03 +0100 [thread overview]
Message-ID: <91e71fe9-b002-0f1f-3237-62cea49e083a@arm.com> (raw)
In-Reply-To: <20200917084038.GE31960@zn.tnic>
Hi guys,
On 17/09/2020 09:40, Borislav Petkov wrote:
> On Thu, Sep 10, 2020 at 03:29:56PM +0000, Shiju Jose wrote:
> You can't know what exactly you wanna do if you don't have a use case
> you're trying to address.
>
>> According to the ARM Processor CPER definition the error types
>> reported are Cache Error, TLB Error, Bus Error and micro-architectural
>> Error.
>
> Bus error sounds like not even originating in the CPU but the CPU only
> reporting it. Imagine if that really were the case, and you go disable
> the CPU but the error source is still there. You've just disabled the
> reporting of the error only and now you don't even know anymore that
> you're getting errors.
>
>> Few thoughts on this,
>> 1. Not sure will a CPU core would work/perform as normal after disabling
>> a functional unit?
>
> You can disable parts of caches, etc, so that you can have a somewhat
> functioning CPU until the replacement maintenance can take place.
This is implementation-specific stuff that only firmware can do...
>> 2. Support in the HW to disable a function unit alone may not available.
>
> Yes.
>
>> 3. If it is require to store and retrieve the error count based on
>> functional unit, then CEC will become more complex?
>
> Depends on how it is designed. That's why we're first talking about what
> needs to be done exactly before going off and doing something.
>
>> This requirement is the part of the early fault prediction by taking
>> action when large number of corrected errors reported on a CPU core
>> before it causing serious faults.
>
> And do you know of actual real-life examples where this is really the
> case? Do you have any users who report a large error count on ARM CPUs,
> originating from the caches and that something like that would really
> help?
>
> Because from my x86 CPUs limited experience, the cache arrays are mostly
> fine and errors reported there are not something that happens very
> frequently so we don't even need to collect and count those.
>
> So is this something which you need to have in order to check a box
> somewhere that there is some functionality or is there an actual
> real-life use case behind it which a customer has requested?
If the corrected-count is available somewhere, can't this policy be made in user-space?
Thanks,
James
next prev parent reply other threads:[~2020-10-01 17:16 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-01 14:01 [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core Shiju Jose
2020-09-01 14:35 ` Borislav Petkov
2020-09-01 16:20 ` Shiju Jose
2020-09-09 12:02 ` Borislav Petkov
2020-09-10 15:29 ` Shiju Jose
2020-09-17 8:40 ` Borislav Petkov
2020-10-01 17:16 ` James Morse [this message]
2020-10-01 17:30 ` Borislav Petkov
2020-10-02 12:23 ` Shiju Jose
2020-09-01 18:51 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=91e71fe9-b002-0f1f-3237-62cea49e083a@arm.com \
--to=james.morse@arm.com \
--cc=bp@alien8.de \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linuxarm@huawei.com \
--cc=rjw@rjwysocki.net \
--cc=shiju.jose@huawei.com \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).