linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Shiju Jose <shiju.jose@huawei.com>
To: Borislav Petkov <bp@alien8.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"james.morse@arm.com" <james.morse@arm.com>,
	"mchehab+huawei@kernel.org" <mchehab+huawei@kernel.org>,
	"tony.luck@intel.com" <tony.luck@intel.com>,
	"rjw@rjwysocki.net" <rjw@rjwysocki.net>,
	"lenb@kernel.org" <lenb@kernel.org>,
	"rrichter@marvell.com" <rrichter@marvell.com>,
	Jonathan Cameron <jonathan.cameron@huawei.com>,
	tanxiaofei <tanxiaofei@huawei.com>,
	"linuxarm@openeuler.org" <linuxarm@openeuler.org>
Subject: RE: [RFC PATCH 1/2] EDAC/ghes: Add EDAC device for the CPU caches
Date: Fri, 15 Jan 2021 11:06:30 +0000	[thread overview]
Message-ID: <a5745b56831c461bbb2cde4afc7ee295@huawei.com> (raw)
In-Reply-To: <20201231164409.GC4504@zn.tnic>

Hi Boris,

Thanks for the feedback.
Apologies for the delay.

>-----Original Message-----
>From: Borislav Petkov [mailto:bp@alien8.de]
>Sent: 31 December 2020 16:44
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-edac@vger.kernel.org; linux-acpi@vger.kernel.org; linux-
>kernel@vger.kernel.org; james.morse@arm.com;
>mchehab+huawei@kernel.org; tony.luck@intel.com; rjw@rjwysocki.net;
>lenb@kernel.org; rrichter@marvell.com; Linuxarm <linuxarm@huawei.com>;
>xuwei (O) <xuwei5@huawei.com>; Jonathan Cameron
><jonathan.cameron@huawei.com>; John Garry <john.garry@huawei.com>;
>tanxiaofei <tanxiaofei@huawei.com>; Shameerali Kolothum Thodi
><shameerali.kolothum.thodi@huawei.com>; Salil Mehta
><salil.mehta@huawei.com>
>Subject: Re: [RFC PATCH 1/2] EDAC/ghes: Add EDAC device for the CPU
>caches
>
>On Tue, Dec 08, 2020 at 05:29:58PM +0000, Shiju Jose wrote:
>> The corrected error count on the CPU caches required reporting to the
>> user-space for the predictive failure analysis. For this purpose, add
>> an EDAC device and device blocks for the CPU caches found.
>> The cache's corrected error count would be stored in the
>> /sys/devices/system/edac/cpu/cpu*/cache*/ce_count.
>
>This still doesn't begin to explain why the kernel needs this. I had already
>asked whether errors in CPU caches are something which happen often
>enough so that software should count them but nothing came. So pls justify
>first why this wants to be added to the kernel.

L2 cache corrected errors are detected occasionally on few of
our ARM64 hardware boards. Though it is rare, the probability of
the CPU cache errors frequently occurring can't be avoided.
The earlier failure detection by monitoring the cache corrected
errors for the frequent occurrences and taking preventive
action could prevent more serious hardware faults.

On Intel architectures, cache corrected errors are reported and
the affected cores are offline in the architecture specific method.
http://www.mcelog.org/cache.html

However for the firmware-first error reporting, specifically on
ARM64 architectures, there is no provision present for reporting
the cache corrected error count to the user-space and taking
preventive action such as offline the affected cores.
>
>> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig index
>> 7a47680d6f07..c73eeab27ac9 100644
>> --- a/drivers/edac/Kconfig
>> +++ b/drivers/edac/Kconfig
>> @@ -74,6 +74,16 @@ config EDAC_GHES
>>
>>  	  In doubt, say 'Y'.
>>
>> +config EDAC_GHES_CPU_ERROR
>> +	bool "EDAC device for reporting firmware-first BIOS detected CPU
>error count"
>
>Why a separate Kconfig item?
CONFIG_EDAC_GHES_CPU_CACHE_ERROR is added to make this
feature optional only for the platforms which need this and supported.

>
>> +	depends on EDAC_GHES
>> +	help
>> +	  EDAC device for the firmware-first BIOS detected CPU error count
>> +reported
>
>Well this is not what it is doing - you're talking about cache errors.
>"CPU errors" can be a lot more than just cache errors.
Sure. I will change.

>
>> +static void ghes_edac_create_cpu_device(struct device *dev) {
>> +	int cpu;
>> +	struct cpu_cacheinfo *this_cpu_ci;
>> +
>> +	/*
>> +	 * Find the maximum number of caches present in the CPU heirarchy
>> +	 * among the online CPUs.
>> +	 */
>> +	for_each_online_cpu(cpu) {
>> +		this_cpu_ci = get_cpu_cacheinfo(cpu);
>> +		if (!this_cpu_ci)
>> +			continue;
>> +		if (max_number_of_caches < this_cpu_ci->num_leaves)
>> +			max_number_of_caches = this_cpu_ci->num_leaves;
>
>So this is counting the number of cache levels on the system? So you want to
>count the errors per cache levels?
Yes. This was the suggestion from James and to offline the affected cores for the shared cache.

>
>> +	}
>> +	if (!max_number_of_caches)
>> +		return;
>> +
>> +	/*
>> +	 * EDAC device interface only supports creating the CPU cache
>hierarchy for alls
>> +	 * the CPUs together. Thus need to allocate cpu_edac_block_list for
>the
>> +	 * max_number_of_caches among all the CPUs irrespective of the
>number of caches
>> +	 * per CPU might vary.
>> +	 */
>
>So this is lumping all the caches together into a single list? What for?
>To untangle to the proper ones when the error gets reported?
>
>Have you heard of percpu variables?
Yes. Changed the list to the percpu variable.

>
>> @@ -624,6 +787,10 @@ int ghes_edac_register(struct ghes *ghes, struct
>device *dev)
>>  	ghes_pvt = pvt;
>>  	spin_unlock_irqrestore(&ghes_lock, flags);
>>
>> +#if defined(CONFIG_EDAC_GHES_CPU_ERROR)
>> +	ghes_edac_create_cpu_device(dev);
>> +#endif
>> +
>
>Init stuff belongs into ghes_scan_system().
>
Did you mean calling  ghes_edac_create_cpu_device() in the ghes_scan_system()?

>...
>
>Ok, I've seen enough. "required reporting to the user-space for the predictive
>failure analysis." is not even trying to explain *why* you're doing this, what
>*actual* problem it is addressing and why should the kernel get it.
>
>And without a proper problem definition of what you're trying to solve, this
>is not going anywhere.
>
>--
>Regards/Gruss,
>    Boris.
>

Thanks,
Shiju

  reply	other threads:[~2021-01-15 11:07 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-08 17:29 [RFC PATCH 0/2] EDAC/ghes: Add EDAC device for recording the CPU error count Shiju Jose
2020-12-08 17:29 ` [RFC PATCH 1/2] EDAC/ghes: Add EDAC device for the CPU caches Shiju Jose
2020-12-31 16:44   ` Borislav Petkov
2021-01-15 11:06     ` Shiju Jose [this message]
2021-01-18 18:36       ` Borislav Petkov
2021-01-19 10:04         ` Shiju Jose
2021-01-19 10:16           ` Borislav Petkov
2020-12-08 17:29 ` [RFC PATCH 2/2] ACPI / APEI: Add reporting ARM64 CPU cache corrected error count Shiju Jose

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a5745b56831c461bbb2cde4afc7ee295@huawei.com \
    --to=shiju.jose@huawei.com \
    --cc=bp@alien8.de \
    --cc=james.morse@arm.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxarm@openeuler.org \
    --cc=mchehab+huawei@kernel.org \
    --cc=rjw@rjwysocki.net \
    --cc=rrichter@marvell.com \
    --cc=tanxiaofei@huawei.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).