[PATCH v2 0/2] ARM Error Source Table V1 Support

* [PATCH v2 0/2] ARM Error Source Table V1 Support
@ 2024-03-21  2:53 Ruidong Tian
  2024-03-21  2:53 ` [PATCH v2 1/2] ACPI/AEST: Initial AEST driver Ruidong Tian
  2024-03-21  2:53 ` [PATCH v2 2/2] trace, ras: add ARM RAS extension trace event Ruidong Tian
  0 siblings, 2 replies; 7+ messages in thread
From: Ruidong Tian @ 2024-03-21  2:53 UTC (permalink / raw)
  To: catalin.marinas, will, lpieralisi, guohanjun, sudeep.holla,
	xueshuai, baolin.wang, linux-kernel, linux-acpi,
	linux-arm-kernel, rafael, lenb, tony.luck, bp, linux-edac
  Cc: tianruidond, Ruidong Tian

This series adds support for the ARM Error Source Table (AEST) based on
the 1.1 version of ACPI for the Armv8 RAS Extensions [0].

The Arm Error Source Table (AEST) enable kernel-first handling of errors
in a system that supports the Armv8 RAS extensions. In kernel-first mode,
kernel controls almost all RAS configuration, include CE threshold and
interrupt enable/disable. Hardware errors will trigger a RAS interrupt
to kernel, kernel scan all AEST node to find error node which occur
error in irq context and process the RAS error. Kernel will act as
follow for different types error:
  - CE, DE: use a workqueue to log this hardware errors.
  - UER, UEO: call memory_failure.
  - UC, UEU: panic.

I have tested this series on PTG Yitian710 SOC. Both corrected and
uncorrected errors were tested to verify the non-fatal vs fatal
scenarios.

Future work:
1. Add CE storm mitigation.
2. Support AEST V2.

This series is based on Tyler Baicar's patches [1], which do not have v2
sended to mail list yet. Change from origin patch:
1. Add a genpool to collect all AEST error, and log them in a workqueue
other than in irq context.
2. Just use the same one aest_proc function for system register interface
and MMIO interface.
3. Reconstruct some structures and functions to make it more clear.
4. Accept all comments in Tyler Baicar's mail list.

Change from V1:
https://lore.kernel.org/all/20240304111517.33001-1-tianruidong@linux.alibaba.com/
1. Marc Zyngier
  - Use readq/writeq_relaxed instead of readq/writeq for MMIO address.
  - Add sync for system register operation.
  - Use irq_is_percpu_devid() helper to identify a per-CPU interrupt.
  - Other fix.
2. Set RAS CE threshold in AEST driver.
3. Enable RAS interrupt explicitly in driver.
4. UER and UEO trigger memory_failure other than panic.

[0]: https://developer.arm.com/documentation/den0085/0101/
[1]: https://lore.kernel.org/all/20211124170708.3874-1-baicar@os.amperecomputing.com/

Tyler Baicar (2):
  ACPI/AEST: Initial AEST driver
  trace, ras: add ARM RAS extension trace event

 MAINTAINERS                  |  11 +
 arch/arm64/include/asm/ras.h |  71 +++
 drivers/acpi/arm64/Kconfig   |  10 +
 drivers/acpi/arm64/Makefile  |   1 +
 drivers/acpi/arm64/aest.c    | 839 +++++++++++++++++++++++++++++++++++
 include/linux/acpi_aest.h    |  92 ++++
 include/linux/cpuhotplug.h   |   1 +
 include/ras/ras_event.h      |  55 +++
 8 files changed, 1080 insertions(+)
 create mode 100644 arch/arm64/include/asm/ras.h
 create mode 100644 drivers/acpi/arm64/aest.c
 create mode 100644 include/linux/acpi_aest.h

-- 
2.33.1

^ permalink raw reply	[flat|nested] 7+ messages in thread