linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on short time period
@ 2020-10-02 12:22 Shiju Jose
  2020-10-02 12:22 ` [RFC PATCH 1/7] RAS/CEC: Replace the macro PFN with ELEM_NO Shiju Jose
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Shiju Jose @ 2020-10-02 12:22 UTC (permalink / raw)
  To: linux-edac, linux-acpi, linux-kernel, bp, tony.luck, rjw,
	james.morse, lenb
  Cc: linuxarm, shiju.jose

In ARM64 hardware platforms, for example our Kunpeng platforms, CPU L1/L2
cache corrected errors are reported in the ARM processor error section.
The situations the CPU CE errors are reported too often is not unlikely
and may need to isolate that CPU core to prevent leading to more 
serious faults. This is important for the early fault prediction.

Extend the existing RAS CEC to support the errors count check on short
time period with the threshold. The decay interval is divided into a
number of time slots for these elements. The CEC calculates the average
error count for each element at the end of each decay interval. Then the
average count would be subtracted from the total count in each of the
following time slots within the decay interval. The work function for
the decay interval would be set for a reduced time period = decay
interval/ number of time slots. When the new CE count for a CPU is added,
the element would try to offline when the sum of the most recent CEs
counts exceeded the CEs threshold value. More implementation details is
added in the file.

Add collection of CPU correctable errors, for example ARM64 L1/L2 cache
errors and isolation of the CPU core when the errors count in the short
time interval exceed the threshold value.

Open Questions based on the feedback from Boris,
1. ARM processor error types are cache/TLB/bus errors.
   [Reference N2.4.4.1 ARM Processor Error Information UEFI Spec v2.8]
Any of the above error types should not be consider for the
error collection and CPU core isolation?

2.If disabling entire CPU core is not acceptable,
please suggest method to disable L1 and L2 cache on ARM64 core?

Shiju Jose (7):
  RAS/CEC: Replace the macro PFN with ELEM_NO
  RAS/CEC: Replace pfns_poisoned with elems_poisoned
  RAS/CEC: Move X86 MCE specific code under CONFIG_X86_MCE
  RAS/CEC: Modify cec_mod_work() for common use
  RAS/CEC: Add support for errors count check on short time period
  RAS/CEC: Add CPU Correctable Error Collector to isolate an erroneous
    CPU core
  ACPI / APEI: Add reporting ARM64 CPU correctable errors to the CEC

 arch/arm64/ras/Kconfig   |  17 ++
 drivers/acpi/apei/ghes.c |  36 +++-
 drivers/ras/Kconfig      |   1 +
 drivers/ras/cec.c        | 399 +++++++++++++++++++++++++++++++++++----
 include/linux/ras.h      |   9 +
 5 files changed, 418 insertions(+), 44 deletions(-)
 create mode 100644 arch/arm64/ras/Kconfig

-- 
2.17.1



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-10-07 16:45 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-02 12:22 [RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on short time period Shiju Jose
2020-10-02 12:22 ` [RFC PATCH 1/7] RAS/CEC: Replace the macro PFN with ELEM_NO Shiju Jose
2020-10-02 12:22 ` [RFC PATCH 2/7] RAS/CEC: Replace pfns_poisoned with elems_poisoned Shiju Jose
2020-10-02 12:22 ` [RFC PATCH 3/7] RAS/CEC: Move X86 MCE specific code under CONFIG_X86_MCE Shiju Jose
2020-10-02 12:22 ` [RFC PATCH 4/7] RAS/CEC: Modify cec_mod_work() for common use Shiju Jose
2020-10-02 12:22 ` [RFC PATCH 5/7] RAS/CEC: Add support for errors count check on short time period Shiju Jose
2020-10-02 12:22 ` [RFC PATCH 6/7] RAS/CEC: Add CPU Correctable Error Collector to isolate an erroneous CPU core Shiju Jose
2020-10-02 12:22 ` [RFC PATCH 7/7] ACPI / APEI: Add reporting ARM64 CPU correctable errors to the CEC Shiju Jose
2020-10-02 12:43 ` [RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on short time period Borislav Petkov
2020-10-02 15:38   ` Shiju Jose
2020-10-02 17:33     ` James Morse
2020-10-02 18:02       ` Borislav Petkov
2020-10-06 16:13       ` Shiju Jose
2020-10-07 16:45         ` James Morse
2020-10-02 16:04   ` Luck, Tony

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).