RE: [RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on short time period

From: "Luck, Tony" <tony.luck@intel.com>
To: Borislav Petkov <bp@alien8.de>, Shiju Jose <shiju.jose@huawei.com>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"rjw@rjwysocki.net" <rjw@rjwysocki.net>,
	"james.morse@arm.com" <james.morse@arm.com>,
	"lenb@kernel.org" <lenb@kernel.org>,
	"linuxarm@huawei.com" <linuxarm@huawei.com>
Subject: RE: [RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on short time period
Date: Fri, 2 Oct 2020 16:04:11 +0000	[thread overview]
Message-ID: <fd12bc3222784e06bb5b0ca1d1f1e1ae@intel.com> (raw)
In-Reply-To: <20201002124352.GC17436@zn.tnic>

> Because from my x86 CPUs limited experience, the cache arrays are mostly
> fine and errors reported there are not something that happens very
> frequently so we don't even need to collect and count those.

On Intel X86 we leave the counting and threshold decisions about cache
health to the hardware. When a cache reaches the limit, it logs a "yellow"
status instead of "green" in the machine check bank (error is still marked
as "corrected"). The mcelog(8) daemon may attempt to take CPUs that share
that cache offline.

See Intel SDM volume 3B "15.4 Enhanced Cache Error Reporting"

-Tony