Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH RFC 0/4] ARM Error Source Table Support
@ 2019-07-02 16:51 Tyler Baicar OS
  2019-07-02 16:51 ` [PATCH RFC 1/4] ACPI/AEST: Initial AEST driver Tyler Baicar OS
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Tyler Baicar OS @ 2019-07-02 16:51 UTC (permalink / raw)
  To: Open Source Submission, linux-arm-kernel, linux-kernel,
	linux-acpi, linux-edac, james.morse, catalin.marinas, will,
	lorenzo.pieralisi, guohanjun, sudeep.holla, rjw, lenb,
	mark.rutland, tony.luck, bp, Matteo.Carlini, Andrew.Murray
  Cc: Tyler Baicar OS

This series adds support for the ARM Error Source Table (AEST) based on
the latest version of the AEST from ARM [0].

The AEST driver supports both memory mapped and system register interfaces.
This series assumes system register interfaces are only registered with
private peripheral interrupts (PPIs); otherwise there is no guarantee the
core handling the error is the core which took the error and has the
syndrome info in it's system registers.

SEAs and SEIs could also have syndrome information present in the RAS
extension system registers. That handling is tied into the system register
handling code.

This is meant to be initial support for AEST to address the current gaps
with systems that support ARMv8.2 RAS extensions but don't have
firmware-first support. This series simply logs all the errors it finds
and triggers a kernel panic if there is an UE present.

Future work:
- UER handling to avoid panic
- Looping through all external abort capable (ERR<n>FR.UE != 0) error
   nodes in SEA/SEI handling
- ARMv8.4 extension support

[0] https://static.docs.arm.com/den0085/a/DEN0085_RAS_ACPI_1.0_BETA_1.pdf

Tyler Baicar (4):
  ACPI/AEST: Initial AEST driver
  arm64: mm: Add RAS extension system register check to SEA handling
  arm64: traps: Add RAS extension system register check to serror
    handling
  trace, ras: add ARM RAS extension trace event

 arch/arm64/include/asm/ras.h |  41 +++++
 arch/arm64/kernel/Makefile   |   2 +-
 arch/arm64/kernel/ras.c      |  70 +++++++++
 arch/arm64/kernel/traps.c    |   3 +
 arch/arm64/mm/fault.c        |   3 +
 drivers/acpi/arm64/Kconfig   |   3 +
 drivers/acpi/arm64/Makefile  |   1 +
 drivers/acpi/arm64/aest.c    | 366 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/acpi_aest.h    |  94 +++++++++++
 include/ras/ras_event.h      |  46 ++++++
 10 files changed, 628 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/include/asm/ras.h
 create mode 100644 arch/arm64/kernel/ras.c
 create mode 100644 drivers/acpi/arm64/aest.c
 create mode 100644 include/linux/acpi_aest.h

-- 
1.8.3.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH RFC 1/4] ACPI/AEST: Initial AEST driver
  2019-07-02 16:51 [PATCH RFC 0/4] ARM Error Source Table Support Tyler Baicar OS
@ 2019-07-02 16:51 ` Tyler Baicar OS
  2019-07-03  9:25   ` Andrew Murray
  2019-07-04 16:02   ` Shiju Jose
  2019-07-02 16:51 ` [PATCH RFC 2/4] arm64: mm: Add RAS extension system register check to SEA handling Tyler Baicar OS
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 14+ messages in thread
From: Tyler Baicar OS @ 2019-07-02 16:51 UTC (permalink / raw)
  To: Open Source Submission, linux-arm-kernel, linux-kernel,
	linux-acpi, linux-edac, james.morse, catalin.marinas, will,
	lorenzo.pieralisi, guohanjun, sudeep.holla, rjw, lenb,
	mark.rutland, tony.luck, bp, Matteo.Carlini, Andrew.Murray
  Cc: Tyler Baicar OS

Add support for parsing the ARM Error Source Table and basic handling of
errors reported through both memory mapped and system register interfaces.

Signed-off-by: Tyler Baicar <baicar@os.amperecomputing.com>
---
 arch/arm64/include/asm/ras.h |  41 +++++
 arch/arm64/kernel/Makefile   |   2 +-
 arch/arm64/kernel/ras.c      |  67 ++++++++
 drivers/acpi/arm64/Kconfig   |   3 +
 drivers/acpi/arm64/Makefile  |   1 +
 drivers/acpi/arm64/aest.c    | 362 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/acpi_aest.h    |  94 +++++++++++
 7 files changed, 569 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/include/asm/ras.h
 create mode 100644 arch/arm64/kernel/ras.c
 create mode 100644 drivers/acpi/arm64/aest.c
 create mode 100644 include/linux/acpi_aest.h

diff --git a/arch/arm64/include/asm/ras.h b/arch/arm64/include/asm/ras.h
new file mode 100644
index 0000000..36bfff4
--- /dev/null
+++ b/arch/arm64/include/asm/ras.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_RAS_H
+#define __ASM_RAS_H
+
+#define ERR_STATUS_AV		BIT(31)
+#define ERR_STATUS_V		BIT(30)
+#define ERR_STATUS_UE		BIT(29)
+#define ERR_STATUS_ER		BIT(28)
+#define ERR_STATUS_OF		BIT(27)
+#define ERR_STATUS_MV		BIT(26)
+#define ERR_STATUS_CE_SHIFT	24
+#define ERR_STATUS_CE_MASK	0x3
+#define ERR_STATUS_DE		BIT(23)
+#define ERR_STATUS_PN		BIT(22)
+#define ERR_STATUS_UET_SHIFT	20
+#define ERR_STATUS_UET_MASK	0x3
+#define ERR_STATUS_IERR_SHIFT	8
+#define ERR_STATUS_IERR_MASK	0xff
+#define ERR_STATUS_SERR_SHIFT	0
+#define ERR_STATUS_SERR_MASK	0xff
+
+#define ERR_FR_CEC_SHIFT	12
+#define ERR_FR_CEC_MASK		0x7
+
+#define ERR_FR_8B_CEC		BIT(1)
+#define ERR_FR_16B_CEC		BIT(2)
+
+struct ras_ext_regs {
+	u64 err_fr;
+	u64 err_ctlr;
+	u64 err_status;
+	u64 err_addr;
+	u64 err_misc0;
+	u64 err_misc1;
+	u64 err_misc2;
+	u64 err_misc3;
+};
+
+void arch_arm_ras_report_error(void);
+
+#endif	/* __ASM_RAS_H */
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 9e7dcb2..294f602 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -19,7 +19,7 @@ obj-y			:= debug-monitors.o entry.o irq.o fpsimd.o		\
 			   return_address.o cpuinfo.o cpu_errata.o		\
 			   cpufeature.o alternative.o cacheinfo.o		\
 			   smp.o smp_spin_table.o topology.o smccc-call.o	\
-			   syscall.o
+			   syscall.o ras.o
 
 extra-$(CONFIG_EFI)			:= efi-entry.o
 
diff --git a/arch/arm64/kernel/ras.c b/arch/arm64/kernel/ras.c
new file mode 100644
index 0000000..ca47efa
--- /dev/null
+++ b/arch/arm64/kernel/ras.c
@@ -0,0 +1,67 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/kernel.h>
+#include <linux/cpu.h>
+#include <linux/smp.h>
+
+#include <asm/ras.h>
+
+void arch_arm_ras_report_error(void)
+{
+	u64 num_records;
+	unsigned int i, cpu_num;
+	bool fatal = false;
+	struct ras_ext_regs regs;
+
+	if (!this_cpu_has_cap(ARM64_HAS_RAS_EXTN))
+		return;
+
+	cpu_num = get_cpu();
+	num_records = read_sysreg_s(SYS_ERRIDR_EL1);
+
+	for (i = 0; i < num_records; i++) {
+		write_sysreg_s(i, SYS_ERRSELR_EL1);
+		regs.err_status = read_sysreg_s(SYS_ERXSTATUS_EL1);
+
+		if (!(regs.err_status & ERR_STATUS_V))
+			continue;
+
+		pr_err("CPU%u: ERR%uSTATUS: 0x%llx\n", cpu_num, i,
+		       regs.err_status);
+
+		if (regs.err_status & ERR_STATUS_AV) {
+			regs.err_addr = read_sysreg_s(SYS_ERXSTATUS_EL1);
+			pr_err("CPU%u: ERR%uADDR: 0x%llx\n", cpu_num, i,
+			       regs.err_addr);
+		} else
+			regs.err_addr = 0;
+
+		regs.err_fr = read_sysreg_s(SYS_ERXFR_EL1);
+		pr_err("CPU%u: ERR%uFR: 0x%llx\n", cpu_num, i, regs.err_fr);
+		regs.err_ctlr = read_sysreg_s(SYS_ERXCTLR_EL1);
+		pr_err("CPU%u: ERR%uCTLR: 0x%llx\n", cpu_num, i, regs.err_ctlr);
+
+		if (regs.err_status & ERR_STATUS_MV) {
+			regs.err_misc0 = read_sysreg_s(SYS_ERXMISC0_EL1);
+			pr_err("CPU%u: ERR%uMISC0: 0x%llx\n", cpu_num, i,
+			       regs.err_misc0);
+			regs.err_misc1 = read_sysreg_s(SYS_ERXMISC1_EL1);
+			pr_err("CPU%u: ERR%uMISC1: 0x%llx\n", cpu_num, i,
+			       regs.err_misc1);
+		}
+
+		/*
+		 * In the future, we will treat UER conditions as potentially
+		 * recoverable.
+		 */
+		if (regs.err_status & ERR_STATUS_UE)
+			fatal = true;
+
+		write_sysreg_s(regs.err_status, SYS_ERXSTATUS_EL1);
+	}
+
+	if (fatal)
+		panic("uncorrectable error encountered");
+
+	put_cpu();
+}
diff --git a/drivers/acpi/arm64/Kconfig b/drivers/acpi/arm64/Kconfig
index 6dba187..8d5cf99 100644
--- a/drivers/acpi/arm64/Kconfig
+++ b/drivers/acpi/arm64/Kconfig
@@ -8,3 +8,6 @@ config ACPI_IORT
 
 config ACPI_GTDT
 	bool
+
+config ACPI_AEST
+	bool "ARM Error Source Table Support"
diff --git a/drivers/acpi/arm64/Makefile b/drivers/acpi/arm64/Makefile
index 6ff50f4..ea1ba28 100644
--- a/drivers/acpi/arm64/Makefile
+++ b/drivers/acpi/arm64/Makefile
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-$(CONFIG_ACPI_IORT) 	+= iort.o
 obj-$(CONFIG_ACPI_GTDT) 	+= gtdt.o
+obj-$(CONFIG_ACPI_AEST) 	+= aest.o
diff --git a/drivers/acpi/arm64/aest.c b/drivers/acpi/arm64/aest.c
new file mode 100644
index 0000000..fd4f3b5
--- /dev/null
+++ b/drivers/acpi/arm64/aest.c
@@ -0,0 +1,362 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* ARM Error Source Table Support */
+
+#include <linux/acpi.h>
+#include <linux/acpi_aest.h>
+#include <linux/init.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/irq.h>
+#include <linux/kernel.h>
+#include <linux/percpu.h>
+#include <linux/ratelimit.h>
+
+#include <asm/ras.h>
+
+#undef pr_fmt
+#define pr_fmt(fmt) "ACPI AEST: " fmt
+
+static struct acpi_table_header *aest_table;
+
+static struct aest_node_data __percpu **ppi_data;
+static u8 num_ppi;
+static u8 ppi_idx;
+
+static void aest_print(struct aest_node_data *data, struct ras_ext_regs regs,
+		       int index)
+{
+	/* No more than 2 corrected messages every 5 seconds */
+	static DEFINE_RATELIMIT_STATE(ratelimit_corrected, 5*HZ, 2);
+
+	if (regs.err_status & ERR_STATUS_UE ||
+	    regs.err_status & ERR_STATUS_DE ||
+	    __ratelimit(&ratelimit_corrected)) {
+		switch (data->node_type) {
+		case AEST_NODE_TYPE_PROC:
+			pr_err("error from processor 0x%x\n",
+			       data->data.proc.id);
+			break;
+		case AEST_NODE_TYPE_MEM:
+			pr_err("error from memory domain 0x%x\n",
+			       data->data.mem.domain);
+			break;
+		case AEST_NODE_TYPE_VENDOR:
+			pr_err("error from vendor specific source 0x%x\n",
+			       data->data.vendor.id);
+		}
+
+		pr_err("ERR%dSTATUS = 0x%llx\n", index, regs.err_status);
+		if (regs.err_status & ERR_STATUS_AV)
+			pr_err("ERR%dADDR = 0x%llx\n", index, regs.err_addr);
+
+		pr_err("ERR%dFR = 0x%llx\n", index, regs.err_fr);
+		pr_err("ERR%dCTLR = 0x%llx\n", index, regs.err_ctlr);
+
+		if (regs.err_status & ERR_STATUS_MV) {
+			pr_err("ERR%dMISC0 = 0x%llx\n", index, regs.err_misc0);
+			pr_err("ERR%dMISC1 = 0x%llx\n", index, regs.err_misc1);
+		}
+	}
+}
+
+static void aest_proc(struct aest_node_data *data)
+{
+	struct ras_ext_regs *regs_p, regs;
+	int i;
+	bool fatal = false;
+
+	/*
+	 * Currently SR based handling is done through the architected
+	 * discovery exposed through SRs. That may change in the future
+	 * if there is supplemental information in the AEST that is
+	 * needed.
+	 */
+	if (data->interface.type == AEST_SYSTEM_REG_INTERFACE) {
+		arch_arm_ras_report_error();
+		return;
+	}
+
+	regs_p = data->interface.regs;
+
+	for (i = data->interface.start; i < data->interface.end; i++) {
+		regs.err_status = readq(&regs_p[i].err_status);
+		if (!(regs.err_status & ERR_STATUS_V))
+			continue;
+
+		if (regs.err_status & ERR_STATUS_AV)
+			regs.err_addr = readq(&regs_p[i].err_addr);
+		else
+			regs.err_addr = 0;
+
+		regs.err_fr = readq(&regs_p[i].err_fr);
+		regs.err_ctlr = readq(&regs_p[i].err_ctlr);
+
+		if (regs.err_status & ERR_STATUS_MV) {
+			regs.err_misc0 = readq(&regs_p[i].err_misc0);
+			regs.err_misc1 = readq(&regs_p[i].err_misc1);
+		} else {
+			regs.err_misc0 = 0;
+			regs.err_misc1 = 0;
+		}
+
+		aest_print(data, regs, i);
+
+		if (regs.err_status & ERR_STATUS_UE)
+			fatal = true;
+
+		writeq(regs.err_status, &regs_p[i].err_status);
+	}
+
+	if (fatal)
+		panic("AEST: uncorrectable error encountered");
+
+}
+
+static irqreturn_t aest_irq_func(int irq, void *input)
+{
+	struct aest_node_data *data = input;
+
+	aest_proc(data);
+
+	return IRQ_HANDLED;
+}
+
+static int __init aest_register_gsi(u32 gsi, int trigger, void *data)
+{
+	int cpu, irq;
+
+	irq = acpi_register_gsi(NULL, gsi, trigger, ACPI_ACTIVE_HIGH);
+
+	if (irq == -EINVAL) {
+		pr_err("failed to map AEST GSI %d\n", gsi);
+		return -EINVAL;
+	}
+
+	if (gsi < 16) {
+		pr_err("invalid GSI %d\n", gsi);
+		return -EINVAL;
+	} else if (gsi < 32) {
+		if (ppi_idx >= AEST_MAX_PPI) {
+			pr_err("Unable to register PPI %d\n", gsi);
+			return -EINVAL;
+		}
+		enable_percpu_irq(irq, IRQ_TYPE_NONE);
+		for_each_possible_cpu(cpu) {
+			memcpy(per_cpu_ptr(ppi_data[ppi_idx], cpu), data,
+			       sizeof(struct aest_node_data));
+		}
+		if (request_percpu_irq(irq, aest_irq_func, "AEST",
+				       ppi_data[ppi_idx++])) {
+			pr_err("failed to register AEST IRQ %d\n", irq);
+			return -EINVAL;
+		}
+	} else if (gsi < 1020) {
+		if (request_irq(irq, aest_irq_func, IRQF_SHARED, "AEST",
+				data)) {
+			pr_err("failed to register AEST IRQ %d\n", irq);
+			return -EINVAL;
+		}
+	} else {
+		pr_err("invalid GSI %d\n", gsi);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int __init aest_init_interrupts(struct aest_type_header *node,
+				       struct aest_node_data *data)
+{
+	struct aest_interrupt *interrupt;
+	int i, trigger, ret = 0;
+
+	interrupt = ACPI_ADD_PTR(struct aest_interrupt, node,
+				 node->interrupt_offset);
+
+	for (i = 0; i < node->interrupt_size; i++, interrupt++) {
+		trigger = (interrupt->flags & AEST_INTERRUPT_MODE) ?
+			  ACPI_LEVEL_SENSITIVE : ACPI_EDGE_SENSITIVE;
+		if (aest_register_gsi(interrupt->gsiv, trigger, data))
+			ret = -EINVAL;
+	}
+
+	return ret;
+}
+
+static int __init aest_init_interface(struct aest_type_header *node,
+				       struct aest_node_data *data)
+{
+	struct aest_interface *interface;
+	struct resource *res;
+	int size;
+
+	interface = ACPI_ADD_PTR(struct aest_interface, node,
+				 node->interface_offset);
+
+	if (interface->type > AEST_MEMORY_MAPPED_INTERFACE) {
+		pr_err("invalid interface type: %d\n", interface->type);
+		return -EINVAL;
+	}
+
+	data->interface.type = interface->type;
+
+	/*
+	 * Currently SR based handling is done through the architected
+	 * discovery exposed through SRs. That may change in the future
+	 * if there is supplemental information in the AEST that is
+	 * needed.
+	 */
+	if (interface->type == AEST_SYSTEM_REG_INTERFACE)
+		return 0;
+
+	res = kzalloc(sizeof(struct resource), GFP_KERNEL);
+	if (!res)
+		return -ENOMEM;
+
+	size = interface->num_records * sizeof(struct ras_ext_regs);
+	res->name = "AEST";
+	res->start = interface->address;
+	res->end = res->start + size;
+	res->flags = IORESOURCE_MEM;
+	if (request_resource_conflict(&iomem_resource, res)) {
+		pr_err("unable to request region starting at 0x%llx\n",
+			res->start);
+		kfree(res);
+		return -EEXIST;
+	}
+
+	data->interface.start = interface->start_index;
+	data->interface.end = interface->start_index + interface->num_records;
+
+	data->interface.regs = ioremap(interface->address, size);
+	if (data->interface.regs == NULL)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int __init aest_init_node(struct aest_type_header *node)
+{
+	struct aest_node_data *data;
+	union aest_node_spec *node_spec;
+	int ret;
+
+	data = kzalloc(sizeof(struct aest_node_data), GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	data->node_type = node->type;
+
+	node_spec = ACPI_ADD_PTR(union aest_node_spec, node, node->data_offset);
+
+	switch (node->type) {
+	case AEST_NODE_TYPE_PROC:
+		memcpy(&data->data, node_spec, sizeof(struct aest_proc_data));
+		break;
+	case AEST_NODE_TYPE_MEM:
+		memcpy(&data->data, node_spec, sizeof(struct aest_mem_data));
+		break;
+	case AEST_NODE_TYPE_VENDOR:
+		memcpy(&data->data, node_spec, sizeof(struct aest_vendor_data));
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	ret = aest_init_interface(node, data);
+	if (ret) {
+		kfree(data);
+		return ret;
+	}
+
+	return aest_init_interrupts(node, data);
+}
+
+static void aest_count_ppi(struct aest_type_header *node)
+{
+	struct aest_interrupt *interrupt;
+	int i;
+
+	interrupt = ACPI_ADD_PTR(struct aest_interrupt, node,
+				 node->interrupt_offset);
+
+	for (i = 0; i < node->interrupt_size; i++, interrupt++) {
+		if (interrupt->gsiv >= 16 && interrupt->gsiv < 32)
+			num_ppi++;
+	}
+
+}
+
+int __init acpi_aest_init(void)
+{
+	struct acpi_table_aest *aest;
+	struct aest_type_header *aest_node, *aest_end;
+	int i, ret = 0;
+
+	if (acpi_disabled)
+		return 0;
+
+	if (ACPI_FAILURE(acpi_get_table(ACPI_SIG_AEST, 0, &aest_table)))
+		return -EINVAL;
+
+	aest = (struct acpi_table_aest *)aest_table;
+
+	/* Get the first AEST node */
+	aest_node = ACPI_ADD_PTR(struct aest_type_header, aest,
+				 sizeof(struct acpi_table_aest));
+	/* Pointer to the end of the AEST table */
+	aest_end = ACPI_ADD_PTR(struct aest_type_header, aest,
+				aest_table->length);
+
+	while (aest_node < aest_end) {
+		if (((u64)aest_node + aest_node->length) > (u64)aest_end) {
+			pr_err("AEST node pointer overflow, bad table\n");
+			return -EINVAL;
+		}
+
+		aest_count_ppi(aest_node);
+
+		aest_node = ACPI_ADD_PTR(struct aest_type_header, aest_node,
+					 aest_node->length);
+	}
+
+	if (num_ppi > AEST_MAX_PPI) {
+		pr_err("Limiting PPI support to %d PPIs\n", AEST_MAX_PPI);
+		num_ppi = AEST_MAX_PPI;
+	}
+
+	ppi_data = kcalloc(num_ppi, sizeof(struct aest_node_data *),
+			   GFP_KERNEL);
+
+	for (i = 0; i < num_ppi; i++) {
+		ppi_data[i] = alloc_percpu(struct aest_node_data);
+		if (!ppi_data[i]) {
+			ret = -ENOMEM;
+			break;
+		}
+	}
+
+	if (ret) {
+		pr_err("Failed percpu allocation\n");
+		for (i = 0; i < num_ppi; i++)
+			free_percpu(ppi_data[i]);
+		return ret;
+	}
+
+	aest_node = ACPI_ADD_PTR(struct aest_type_header, aest,
+				 sizeof(struct acpi_table_aest));
+
+	while (aest_node < aest_end) {
+		ret = aest_init_node(aest_node);
+		if (ret)
+			pr_err("failed to init node: %d", ret);
+
+		aest_node = ACPI_ADD_PTR(struct aest_type_header, aest_node,
+					 aest_node->length);
+	}
+
+	return 0;
+}
+
+early_initcall(acpi_aest_init);
diff --git a/include/linux/acpi_aest.h b/include/linux/acpi_aest.h
new file mode 100644
index 0000000..376122b
--- /dev/null
+++ b/include/linux/acpi_aest.h
@@ -0,0 +1,94 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef AEST_H
+#define AEST_H
+
+#include <acpi/actbl.h>
+
+#define ACPI_SIG_AEST			"AEST"	/* ARM Error Source Table */
+
+#define AEST_NODE_TYPE_PROC		0
+#define AEST_NODE_TYPE_MEM		1
+#define AEST_NODE_TYPE_VENDOR		2
+
+#define AEST_SYSTEM_REG_INTERFACE	0x0
+#define AEST_MEMORY_MAPPED_INTERFACE	0x1
+
+#define AEST_INTERRUPT_MODE		BIT(0)
+
+#define AEST_MAX_PPI			4
+
+#pragma pack(1)
+
+struct acpi_table_aest {
+	struct acpi_table_header header;
+};
+
+struct aest_type_header {
+	u8 type;
+	u16 length;
+	u8 reserved;
+	u32 revision;
+	u32 data_offset;
+	u32 interface_offset;
+	u32 interface_size;
+	u32 interrupt_offset;
+	u32 interrupt_size;
+	u64 timestamp_rate;
+	u64 timestamp_start;
+	u64 countdown_rate;
+};
+
+struct aest_proc_data {
+	u32 id;
+	u32 level;
+	u32 cache_type;
+};
+
+struct aest_mem_data {
+	u32 domain;
+};
+
+struct aest_vendor_data {
+	u32 id;
+	u32 data;
+};
+
+struct aest_interface {
+	u8 type;
+	u8 reserved[3];
+	u32 flags;
+	u64 address;
+	u16 start_index;
+	u16 num_records;
+};
+
+struct aest_interrupt {
+	u8 type;
+	u16 reserved;
+	u8 flags;
+	u32 gsiv;
+	u8 iort_id[20];
+};
+
+#pragma pack()
+
+struct aest_interface_data {
+	u8 type;
+	u16 start;
+	u16 end;
+	struct ras_ext_regs *regs;
+};
+
+union aest_node_spec {
+	struct aest_proc_data proc;
+	struct aest_mem_data mem;
+	struct aest_vendor_data vendor;
+};
+
+struct aest_node_data {
+	u8 node_type;
+	struct aest_interface_data interface;
+	union aest_node_spec data;
+};
+
+#endif /* AEST_H */
-- 
1.8.3.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH RFC 2/4] arm64: mm: Add RAS extension system register check to SEA handling
  2019-07-02 16:51 [PATCH RFC 0/4] ARM Error Source Table Support Tyler Baicar OS
  2019-07-02 16:51 ` [PATCH RFC 1/4] ACPI/AEST: Initial AEST driver Tyler Baicar OS
@ 2019-07-02 16:51 ` Tyler Baicar OS
  2019-07-08 10:00   ` James Morse
  2019-07-02 16:51 ` [PATCH RFC 3/4] arm64: traps: Add RAS extension system register check to serror handling Tyler Baicar OS
  2019-07-02 16:52 ` [PATCH RFC 4/4] trace, ras: add ARM RAS extension trace event Tyler Baicar OS
  3 siblings, 1 reply; 14+ messages in thread
From: Tyler Baicar OS @ 2019-07-02 16:51 UTC (permalink / raw)
  To: Open Source Submission, linux-arm-kernel, linux-kernel,
	linux-acpi, linux-edac, james.morse, catalin.marinas, will,
	lorenzo.pieralisi, guohanjun, sudeep.holla, rjw, lenb,
	mark.rutland, tony.luck, bp, Matteo.Carlini, Andrew.Murray
  Cc: Tyler Baicar OS

On systems that support the ARM RAS extension, synchronous external
abort syndrome information could be captured in the core's RAS extension
system registers. So, when handling SEAs check the RAS system registers
for error syndrome information.

Signed-off-by: Tyler Baicar <baicar@os.amperecomputing.com>
---
 arch/arm64/mm/fault.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 2d11501..76b42ca 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -37,6 +37,7 @@
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 #include <asm/traps.h>
+#include <asm/ras.h>
 
 struct fault_info {
 	int	(*fn)(unsigned long addr, unsigned int esr,
@@ -632,6 +633,8 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
 
 	inf = esr_to_fault_info(esr);
 
+	arch_arm_ras_report_error();
+
 	/*
 	 * Return value ignored as we rely on signal merging.
 	 * Future patches will make this more robust.
-- 
1.8.3.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH RFC 3/4] arm64: traps: Add RAS extension system register check to serror handling
  2019-07-02 16:51 [PATCH RFC 0/4] ARM Error Source Table Support Tyler Baicar OS
  2019-07-02 16:51 ` [PATCH RFC 1/4] ACPI/AEST: Initial AEST driver Tyler Baicar OS
  2019-07-02 16:51 ` [PATCH RFC 2/4] arm64: mm: Add RAS extension system register check to SEA handling Tyler Baicar OS
@ 2019-07-02 16:51 ` Tyler Baicar OS
  2019-07-02 16:52 ` [PATCH RFC 4/4] trace, ras: add ARM RAS extension trace event Tyler Baicar OS
  3 siblings, 0 replies; 14+ messages in thread
From: Tyler Baicar OS @ 2019-07-02 16:51 UTC (permalink / raw)
  To: Open Source Submission, linux-arm-kernel, linux-kernel,
	linux-acpi, linux-edac, james.morse, catalin.marinas, will,
	lorenzo.pieralisi, guohanjun, sudeep.holla, rjw, lenb,
	mark.rutland, tony.luck, bp, Matteo.Carlini, Andrew.Murray
  Cc: Tyler Baicar OS

On systems that support the ARM RAS extension, serror interrupt syndrome
information could be captured in the core's RAS extension system
registers. When handling serrors, check the RAS system registers for
error syndrome information.

Signed-off-by: Tyler Baicar <baicar@os.amperecomputing.com>
---
 arch/arm64/kernel/traps.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index 985721a..aacfa51 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -40,6 +40,7 @@
 #include <asm/exception.h>
 #include <asm/system_misc.h>
 #include <asm/sysreg.h>
+#include <asm/ras.h>
 
 static const char *handler[]= {
 	"Synchronous Abort",
@@ -897,6 +898,8 @@ asmlinkage void do_serror(struct pt_regs *regs, unsigned int esr)
 	if (!was_in_nmi)
 		nmi_enter();
 
+	arch_arm_ras_report_error();
+
 	/* non-RAS errors are not containable */
 	if (!arm64_is_ras_serror(esr) || arm64_is_fatal_ras_serror(regs, esr))
 		arm64_serror_panic(regs, esr);
-- 
1.8.3.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH RFC 4/4] trace, ras: add ARM RAS extension trace event
  2019-07-02 16:51 [PATCH RFC 0/4] ARM Error Source Table Support Tyler Baicar OS
                   ` (2 preceding siblings ...)
  2019-07-02 16:51 ` [PATCH RFC 3/4] arm64: traps: Add RAS extension system register check to serror handling Tyler Baicar OS
@ 2019-07-02 16:52 ` Tyler Baicar OS
  3 siblings, 0 replies; 14+ messages in thread
From: Tyler Baicar OS @ 2019-07-02 16:52 UTC (permalink / raw)
  To: Open Source Submission, linux-arm-kernel, linux-kernel,
	linux-acpi, linux-edac, james.morse, catalin.marinas, will,
	lorenzo.pieralisi, guohanjun, sudeep.holla, rjw, lenb,
	mark.rutland, tony.luck, bp, Matteo.Carlini, Andrew.Murray
  Cc: Tyler Baicar OS

Add a trace event for hardware errors reported by the ARMv8.2
RAS extension registers.

Signed-off-by: Tyler Baicar <baicar@os.amperecomputing.com>
---
 arch/arm64/kernel/ras.c   |  3 +++
 drivers/acpi/arm64/aest.c |  4 ++++
 include/ras/ras_event.h   | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 53 insertions(+)

diff --git a/arch/arm64/kernel/ras.c b/arch/arm64/kernel/ras.c
index ca47efa..4e34d63 100644
--- a/arch/arm64/kernel/ras.c
+++ b/arch/arm64/kernel/ras.c
@@ -5,6 +5,7 @@
 #include <linux/smp.h>
 
 #include <asm/ras.h>
+#include <ras/ras_event.h>
 
 void arch_arm_ras_report_error(void)
 {
@@ -50,6 +51,8 @@ void arch_arm_ras_report_error(void)
 			       regs.err_misc1);
 		}
 
+		trace_arm_ras_ext_event(0, cpu_num, &regs);
+
 		/*
 		 * In the future, we will treat UER conditions as potentially
 		 * recoverable.
diff --git a/drivers/acpi/arm64/aest.c b/drivers/acpi/arm64/aest.c
index fd4f3b5..21ec583 100644
--- a/drivers/acpi/arm64/aest.c
+++ b/drivers/acpi/arm64/aest.c
@@ -13,6 +13,7 @@
 #include <linux/ratelimit.h>
 
 #include <asm/ras.h>
+#include <ras/ras_event.h>
 
 #undef pr_fmt
 #define pr_fmt(fmt) "ACPI AEST: " fmt
@@ -102,6 +103,9 @@ static void aest_proc(struct aest_node_data *data)
 
 		aest_print(data, regs, i);
 
+		trace_arm_ras_ext_event(data->node_type, data->data.proc.id,
+					&regs);
+
 		if (regs.err_status & ERR_STATUS_UE)
 			fatal = true;
 
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 36c5c5e..8b76cb1 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -339,6 +339,52 @@
 );
 
 /*
+ * ARM RAS Extension Events Report
+ *
+ * This event is generated when an error reported by the ARM RAS extension
+ * hardware is detected.
+ */
+
+#ifdef CONFIG_ARM64
+#include <asm/ras.h>
+TRACE_EVENT(arm_ras_ext_event,
+
+	TP_PROTO(u8 type, u32 id, struct ras_ext_regs *regs),
+
+	TP_ARGS(type, id, regs),
+
+	TP_STRUCT__entry(
+		__field(u8,  type)
+		__field(u32, id)
+		__field(u64, err_fr)
+		__field(u64, err_ctlr)
+		__field(u64, err_status)
+		__field(u64, err_addr)
+		__field(u64, err_misc0)
+		__field(u64, err_misc1)
+	),
+
+	TP_fast_assign(
+		__entry->type = type;
+		__entry->id = id;
+		__entry->err_fr = regs->err_fr;
+		__entry->err_ctlr = regs->err_ctlr;
+		__entry->err_status = regs->err_status;
+		__entry->err_addr = regs->err_addr;
+		__entry->err_misc0 = regs->err_misc0;
+		__entry->err_misc1 = regs->err_misc1;
+	),
+
+	TP_printk("type: %d; id: %d; ERR_FR: %llx; ERR_CTLR: %llx; "
+		  "ERR_STATUS: %llx; ERR_ADDR: %llx; ERR_MISC0: %llx; "
+		  "ERR_MISC1: %llx",
+		  __entry->type, __entry->id, __entry->err_fr,
+		  __entry->err_ctlr, __entry->err_status, __entry->err_addr,
+		  __entry->err_misc0, __entry->err_misc1)
+);
+#endif
+
+/*
  * memory-failure recovery action result event
  *
  * unsigned long pfn -	Page Frame Number of the corrupted page
-- 
1.8.3.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 1/4] ACPI/AEST: Initial AEST driver
  2019-07-02 16:51 ` [PATCH RFC 1/4] ACPI/AEST: Initial AEST driver Tyler Baicar OS
@ 2019-07-03  9:25   ` Andrew Murray
  2019-07-03 17:30     ` Tyler Baicar OS
  2019-07-04 16:02   ` Shiju Jose
  1 sibling, 1 reply; 14+ messages in thread
From: Andrew Murray @ 2019-07-03  9:25 UTC (permalink / raw)
  To: Tyler Baicar OS
  Cc: mark.rutland, lorenzo.pieralisi, catalin.marinas, sudeep.holla,
	rjw, linux-kernel, Matteo.Carlini, linux-acpi, tony.luck,
	james.morse, guohanjun, bp, Open Source Submission, lenb, will,
	linux-arm-kernel, linux-edac

On Tue, Jul 02, 2019 at 04:51:38PM +0000, Tyler Baicar OS wrote:
> Add support for parsing the ARM Error Source Table and basic handling of
> errors reported through both memory mapped and system register interfaces.
> 
> Signed-off-by: Tyler Baicar <baicar@os.amperecomputing.com>
> ---
>  arch/arm64/include/asm/ras.h |  41 +++++
>  arch/arm64/kernel/Makefile   |   2 +-
>  arch/arm64/kernel/ras.c      |  67 ++++++++
>  drivers/acpi/arm64/Kconfig   |   3 +
>  drivers/acpi/arm64/Makefile  |   1 +
>  drivers/acpi/arm64/aest.c    | 362 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/acpi_aest.h    |  94 +++++++++++
>  7 files changed, 569 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/include/asm/ras.h
>  create mode 100644 arch/arm64/kernel/ras.c
>  create mode 100644 drivers/acpi/arm64/aest.c
>  create mode 100644 include/linux/acpi_aest.h
> 
> diff --git a/arch/arm64/include/asm/ras.h b/arch/arm64/include/asm/ras.h
> new file mode 100644
> index 0000000..36bfff4
> --- /dev/null
> +++ b/arch/arm64/include/asm/ras.h
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __ASM_RAS_H
> +#define __ASM_RAS_H
> +
> +#define ERR_STATUS_AV		BIT(31)
> +#define ERR_STATUS_V		BIT(30)
> +#define ERR_STATUS_UE		BIT(29)
> +#define ERR_STATUS_ER		BIT(28)
> +#define ERR_STATUS_OF		BIT(27)
> +#define ERR_STATUS_MV		BIT(26)
> +#define ERR_STATUS_CE_SHIFT	24
> +#define ERR_STATUS_CE_MASK	0x3
> +#define ERR_STATUS_DE		BIT(23)
> +#define ERR_STATUS_PN		BIT(22)
> +#define ERR_STATUS_UET_SHIFT	20
> +#define ERR_STATUS_UET_MASK	0x3
> +#define ERR_STATUS_IERR_SHIFT	8
> +#define ERR_STATUS_IERR_MASK	0xff
> +#define ERR_STATUS_SERR_SHIFT	0
> +#define ERR_STATUS_SERR_MASK	0xff

Some of these (_ER, _OF, _CE*, _PN, _UET*) are not used anywhere in the series,
I'd suggest you drop the unused ones.

There may be some merit in renaming these to match the register names in the
spec, e.g. ERXSTATUS_EL1 instead of ERR_STATUS.

> +
> +#define ERR_FR_CEC_SHIFT	12
> +#define ERR_FR_CEC_MASK		0x7
> +
> +#define ERR_FR_8B_CEC		BIT(1)
> +#define ERR_FR_16B_CEC		BIT(2)

All of these ERR_FR_ defines aren't used anywhere either.

> +
> +struct ras_ext_regs {
> +	u64 err_fr;
> +	u64 err_ctlr;
> +	u64 err_status;
> +	u64 err_addr;
> +	u64 err_misc0;
> +	u64 err_misc1;
> +	u64 err_misc2;
> +	u64 err_misc3;
> +};
> +
> +void arch_arm_ras_report_error(void);
> +
> +#endif	/* __ASM_RAS_H */
> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> index 9e7dcb2..294f602 100644
> --- a/arch/arm64/kernel/Makefile
> +++ b/arch/arm64/kernel/Makefile
> @@ -19,7 +19,7 @@ obj-y			:= debug-monitors.o entry.o irq.o fpsimd.o		\
>  			   return_address.o cpuinfo.o cpu_errata.o		\
>  			   cpufeature.o alternative.o cacheinfo.o		\
>  			   smp.o smp_spin_table.o topology.o smccc-call.o	\
> -			   syscall.o
> +			   syscall.o ras.o

Given that arch_arm_ras_report_error depends on the ARM64_HAS_RAS_EXTN
capability, which in turn depends on CONFIG_ARM64_RAS_EXTN - you should
probably conditionally build ras.o only if CONFIG_ARM64_RAS_EXTN is defined
(and provide a stub in the header for when it isn't defined).

>  
>  extra-$(CONFIG_EFI)			:= efi-entry.o
>  
> diff --git a/arch/arm64/kernel/ras.c b/arch/arm64/kernel/ras.c
> new file mode 100644
> index 0000000..ca47efa
> --- /dev/null
> +++ b/arch/arm64/kernel/ras.c
> @@ -0,0 +1,67 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/kernel.h>
> +#include <linux/cpu.h>
> +#include <linux/smp.h>
> +
> +#include <asm/ras.h>
> +
> +void arch_arm_ras_report_error(void)
> +{
> +	u64 num_records;
> +	unsigned int i, cpu_num;
> +	bool fatal = false;
> +	struct ras_ext_regs regs;
> +
> +	if (!this_cpu_has_cap(ARM64_HAS_RAS_EXTN))
> +		return;
> +
> +	cpu_num = get_cpu();
> +	num_records = read_sysreg_s(SYS_ERRIDR_EL1);

This value should be masked to exclude the reserved bits. This will
also prevent you writing to reserved bits in ERRSELR.

> +
> +	for (i = 0; i < num_records; i++) {
> +		write_sysreg_s(i, SYS_ERRSELR_EL1);

There should be an isb here, this will ensure the record selection has
happened before reading the record.

> +		regs.err_status = read_sysreg_s(SYS_ERXSTATUS_EL1);
> +
> +		if (!(regs.err_status & ERR_STATUS_V))
> +			continue;
> +
> +		pr_err("CPU%u: ERR%uSTATUS: 0x%llx\n", cpu_num, i,
> +		       regs.err_status);
> +
> +		if (regs.err_status & ERR_STATUS_AV) {
> +			regs.err_addr = read_sysreg_s(SYS_ERXSTATUS_EL1);

This should be SYS_ERXADDR_EL1 not SYS_ERXSTATUS_EL1!

> +			pr_err("CPU%u: ERR%uADDR: 0x%llx\n", cpu_num, i,
> +			       regs.err_addr);
> +		} else
> +			regs.err_addr = 0;

Or perhaps set "regs = { }" at the start of the function instead?

> +
> +		regs.err_fr = read_sysreg_s(SYS_ERXFR_EL1);
> +		pr_err("CPU%u: ERR%uFR: 0x%llx\n", cpu_num, i, regs.err_fr);
> +		regs.err_ctlr = read_sysreg_s(SYS_ERXCTLR_EL1);
> +		pr_err("CPU%u: ERR%uCTLR: 0x%llx\n", cpu_num, i, regs.err_ctlr);
> +
> +		if (regs.err_status & ERR_STATUS_MV) {
> +			regs.err_misc0 = read_sysreg_s(SYS_ERXMISC0_EL1);
> +			pr_err("CPU%u: ERR%uMISC0: 0x%llx\n", cpu_num, i,
> +			       regs.err_misc0);
> +			regs.err_misc1 = read_sysreg_s(SYS_ERXMISC1_EL1);
> +			pr_err("CPU%u: ERR%uMISC1: 0x%llx\n", cpu_num, i,
> +			       regs.err_misc1);
> +		}
> +
> +		/*
> +		 * In the future, we will treat UER conditions as potentially
> +		 * recoverable.
> +		 */
> +		if (regs.err_status & ERR_STATUS_UE)
> +			fatal = true;
> +
> +		write_sysreg_s(regs.err_status, SYS_ERXSTATUS_EL1);
> +	}
> +
> +	if (fatal)
> +		panic("uncorrectable error encountered");

On the do_serror path, we will already panic if arm64_is_fatal_ras_serror
indicates uncorrectable errors. Is this here for the other paths?

> +
> +	put_cpu();
> +}

Finally, should we clear the errors when we see them?

> diff --git a/drivers/acpi/arm64/Kconfig b/drivers/acpi/arm64/Kconfig
> index 6dba187..8d5cf99 100644
> --- a/drivers/acpi/arm64/Kconfig
> +++ b/drivers/acpi/arm64/Kconfig
> @@ -8,3 +8,6 @@ config ACPI_IORT
>  
>  config ACPI_GTDT
>  	bool
> +
> +config ACPI_AEST
> +	bool "ARM Error Source Table Support"
> diff --git a/drivers/acpi/arm64/Makefile b/drivers/acpi/arm64/Makefile
> index 6ff50f4..ea1ba28 100644
> --- a/drivers/acpi/arm64/Makefile
> +++ b/drivers/acpi/arm64/Makefile
> @@ -1,3 +1,4 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  obj-$(CONFIG_ACPI_IORT) 	+= iort.o
>  obj-$(CONFIG_ACPI_GTDT) 	+= gtdt.o
> +obj-$(CONFIG_ACPI_AEST) 	+= aest.o
> diff --git a/drivers/acpi/arm64/aest.c b/drivers/acpi/arm64/aest.c
> new file mode 100644
> index 0000000..fd4f3b5
> --- /dev/null
> +++ b/drivers/acpi/arm64/aest.c
> @@ -0,0 +1,362 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/* ARM Error Source Table Support */
> +
> +#include <linux/acpi.h>
> +#include <linux/acpi_aest.h>
> +#include <linux/init.h>
> +#include <linux/interrupt.h>
> +#include <linux/io.h>
> +#include <linux/irq.h>
> +#include <linux/kernel.h>
> +#include <linux/percpu.h>
> +#include <linux/ratelimit.h>
> +
> +#include <asm/ras.h>
> +
> +#undef pr_fmt
> +#define pr_fmt(fmt) "ACPI AEST: " fmt
> +
> +static struct acpi_table_header *aest_table;
> +
> +static struct aest_node_data __percpu **ppi_data;
> +static u8 num_ppi;
> +static u8 ppi_idx;
> +
> +static void aest_print(struct aest_node_data *data, struct ras_ext_regs regs,
> +		       int index)
> +{
> +	/* No more than 2 corrected messages every 5 seconds */
> +	static DEFINE_RATELIMIT_STATE(ratelimit_corrected, 5*HZ, 2);
> +
> +	if (regs.err_status & ERR_STATUS_UE ||
> +	    regs.err_status & ERR_STATUS_DE ||
> +	    __ratelimit(&ratelimit_corrected)) {
> +		switch (data->node_type) {
> +		case AEST_NODE_TYPE_PROC:
> +			pr_err("error from processor 0x%x\n",
> +			       data->data.proc.id);
> +			break;
> +		case AEST_NODE_TYPE_MEM:
> +			pr_err("error from memory domain 0x%x\n",
> +			       data->data.mem.domain);
> +			break;
> +		case AEST_NODE_TYPE_VENDOR:
> +			pr_err("error from vendor specific source 0x%x\n",
> +			       data->data.vendor.id);
> +		}
> +
> +		pr_err("ERR%dSTATUS = 0x%llx\n", index, regs.err_status);
> +		if (regs.err_status & ERR_STATUS_AV)
> +			pr_err("ERR%dADDR = 0x%llx\n", index, regs.err_addr);
> +
> +		pr_err("ERR%dFR = 0x%llx\n", index, regs.err_fr);
> +		pr_err("ERR%dCTLR = 0x%llx\n", index, regs.err_ctlr);
> +
> +		if (regs.err_status & ERR_STATUS_MV) {
> +			pr_err("ERR%dMISC0 = 0x%llx\n", index, regs.err_misc0);
> +			pr_err("ERR%dMISC1 = 0x%llx\n", index, regs.err_misc1);
> +		}

Given that we have a ras_ext_regs struct, can't we use a single function to
print the error - rather than have duplicate pr_err's here and in
arch_arm_ras_report_error?

Thanks,

Andrew Murray

> +	}
> +}
> +
> +static void aest_proc(struct aest_node_data *data)
> +{
> +	struct ras_ext_regs *regs_p, regs;
> +	int i;
> +	bool fatal = false;
> +
> +	/*
> +	 * Currently SR based handling is done through the architected
> +	 * discovery exposed through SRs. That may change in the future
> +	 * if there is supplemental information in the AEST that is
> +	 * needed.
> +	 */
> +	if (data->interface.type == AEST_SYSTEM_REG_INTERFACE) {
> +		arch_arm_ras_report_error();
> +		return;
> +	}
> +
> +	regs_p = data->interface.regs;
> +
> +	for (i = data->interface.start; i < data->interface.end; i++) {
> +		regs.err_status = readq(&regs_p[i].err_status);
> +		if (!(regs.err_status & ERR_STATUS_V))
> +			continue;
> +
> +		if (regs.err_status & ERR_STATUS_AV)
> +			regs.err_addr = readq(&regs_p[i].err_addr);
> +		else
> +			regs.err_addr = 0;
> +
> +		regs.err_fr = readq(&regs_p[i].err_fr);
> +		regs.err_ctlr = readq(&regs_p[i].err_ctlr);
> +
> +		if (regs.err_status & ERR_STATUS_MV) {
> +			regs.err_misc0 = readq(&regs_p[i].err_misc0);
> +			regs.err_misc1 = readq(&regs_p[i].err_misc1);
> +		} else {
> +			regs.err_misc0 = 0;
> +			regs.err_misc1 = 0;
> +		}
> +
> +		aest_print(data, regs, i);
> +
> +		if (regs.err_status & ERR_STATUS_UE)
> +			fatal = true;
> +
> +		writeq(regs.err_status, &regs_p[i].err_status);
> +	}
> +
> +	if (fatal)
> +		panic("AEST: uncorrectable error encountered");
> +
> +}
> +
> +static irqreturn_t aest_irq_func(int irq, void *input)
> +{
> +	struct aest_node_data *data = input;
> +
> +	aest_proc(data);
> +
> +	return IRQ_HANDLED;
> +}
> +
> +static int __init aest_register_gsi(u32 gsi, int trigger, void *data)
> +{
> +	int cpu, irq;
> +
> +	irq = acpi_register_gsi(NULL, gsi, trigger, ACPI_ACTIVE_HIGH);
> +
> +	if (irq == -EINVAL) {
> +		pr_err("failed to map AEST GSI %d\n", gsi);
> +		return -EINVAL;
> +	}
> +
> +	if (gsi < 16) {
> +		pr_err("invalid GSI %d\n", gsi);
> +		return -EINVAL;
> +	} else if (gsi < 32) {
> +		if (ppi_idx >= AEST_MAX_PPI) {
> +			pr_err("Unable to register PPI %d\n", gsi);
> +			return -EINVAL;
> +		}
> +		enable_percpu_irq(irq, IRQ_TYPE_NONE);
> +		for_each_possible_cpu(cpu) {
> +			memcpy(per_cpu_ptr(ppi_data[ppi_idx], cpu), data,
> +			       sizeof(struct aest_node_data));
> +		}
> +		if (request_percpu_irq(irq, aest_irq_func, "AEST",
> +				       ppi_data[ppi_idx++])) {
> +			pr_err("failed to register AEST IRQ %d\n", irq);
> +			return -EINVAL;
> +		}
> +	} else if (gsi < 1020) {
> +		if (request_irq(irq, aest_irq_func, IRQF_SHARED, "AEST",
> +				data)) {
> +			pr_err("failed to register AEST IRQ %d\n", irq);
> +			return -EINVAL;
> +		}
> +	} else {
> +		pr_err("invalid GSI %d\n", gsi);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int __init aest_init_interrupts(struct aest_type_header *node,
> +				       struct aest_node_data *data)
> +{
> +	struct aest_interrupt *interrupt;
> +	int i, trigger, ret = 0;
> +
> +	interrupt = ACPI_ADD_PTR(struct aest_interrupt, node,
> +				 node->interrupt_offset);
> +
> +	for (i = 0; i < node->interrupt_size; i++, interrupt++) {
> +		trigger = (interrupt->flags & AEST_INTERRUPT_MODE) ?
> +			  ACPI_LEVEL_SENSITIVE : ACPI_EDGE_SENSITIVE;
> +		if (aest_register_gsi(interrupt->gsiv, trigger, data))
> +			ret = -EINVAL;
> +	}
> +
> +	return ret;
> +}
> +
> +static int __init aest_init_interface(struct aest_type_header *node,
> +				       struct aest_node_data *data)
> +{
> +	struct aest_interface *interface;
> +	struct resource *res;
> +	int size;
> +
> +	interface = ACPI_ADD_PTR(struct aest_interface, node,
> +				 node->interface_offset);
> +
> +	if (interface->type > AEST_MEMORY_MAPPED_INTERFACE) {
> +		pr_err("invalid interface type: %d\n", interface->type);
> +		return -EINVAL;
> +	}
> +
> +	data->interface.type = interface->type;
> +
> +	/*
> +	 * Currently SR based handling is done through the architected
> +	 * discovery exposed through SRs. That may change in the future
> +	 * if there is supplemental information in the AEST that is
> +	 * needed.
> +	 */
> +	if (interface->type == AEST_SYSTEM_REG_INTERFACE)
> +		return 0;
> +
> +	res = kzalloc(sizeof(struct resource), GFP_KERNEL);
> +	if (!res)
> +		return -ENOMEM;
> +
> +	size = interface->num_records * sizeof(struct ras_ext_regs);
> +	res->name = "AEST";
> +	res->start = interface->address;
> +	res->end = res->start + size;
> +	res->flags = IORESOURCE_MEM;
> +	if (request_resource_conflict(&iomem_resource, res)) {
> +		pr_err("unable to request region starting at 0x%llx\n",
> +			res->start);
> +		kfree(res);
> +		return -EEXIST;
> +	}
> +
> +	data->interface.start = interface->start_index;
> +	data->interface.end = interface->start_index + interface->num_records;
> +
> +	data->interface.regs = ioremap(interface->address, size);
> +	if (data->interface.regs == NULL)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int __init aest_init_node(struct aest_type_header *node)
> +{
> +	struct aest_node_data *data;
> +	union aest_node_spec *node_spec;
> +	int ret;
> +
> +	data = kzalloc(sizeof(struct aest_node_data), GFP_KERNEL);
> +	if (!data)
> +		return -ENOMEM;
> +
> +	data->node_type = node->type;
> +
> +	node_spec = ACPI_ADD_PTR(union aest_node_spec, node, node->data_offset);
> +
> +	switch (node->type) {
> +	case AEST_NODE_TYPE_PROC:
> +		memcpy(&data->data, node_spec, sizeof(struct aest_proc_data));
> +		break;
> +	case AEST_NODE_TYPE_MEM:
> +		memcpy(&data->data, node_spec, sizeof(struct aest_mem_data));
> +		break;
> +	case AEST_NODE_TYPE_VENDOR:
> +		memcpy(&data->data, node_spec, sizeof(struct aest_vendor_data));
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	ret = aest_init_interface(node, data);
> +	if (ret) {
> +		kfree(data);
> +		return ret;
> +	}
> +
> +	return aest_init_interrupts(node, data);
> +}
> +
> +static void aest_count_ppi(struct aest_type_header *node)
> +{
> +	struct aest_interrupt *interrupt;
> +	int i;
> +
> +	interrupt = ACPI_ADD_PTR(struct aest_interrupt, node,
> +				 node->interrupt_offset);
> +
> +	for (i = 0; i < node->interrupt_size; i++, interrupt++) {
> +		if (interrupt->gsiv >= 16 && interrupt->gsiv < 32)
> +			num_ppi++;
> +	}
> +
> +}
> +
> +int __init acpi_aest_init(void)
> +{
> +	struct acpi_table_aest *aest;
> +	struct aest_type_header *aest_node, *aest_end;
> +	int i, ret = 0;
> +
> +	if (acpi_disabled)
> +		return 0;
> +
> +	if (ACPI_FAILURE(acpi_get_table(ACPI_SIG_AEST, 0, &aest_table)))
> +		return -EINVAL;
> +
> +	aest = (struct acpi_table_aest *)aest_table;
> +
> +	/* Get the first AEST node */
> +	aest_node = ACPI_ADD_PTR(struct aest_type_header, aest,
> +				 sizeof(struct acpi_table_aest));
> +	/* Pointer to the end of the AEST table */
> +	aest_end = ACPI_ADD_PTR(struct aest_type_header, aest,
> +				aest_table->length);
> +
> +	while (aest_node < aest_end) {
> +		if (((u64)aest_node + aest_node->length) > (u64)aest_end) {
> +			pr_err("AEST node pointer overflow, bad table\n");
> +			return -EINVAL;
> +		}
> +
> +		aest_count_ppi(aest_node);
> +
> +		aest_node = ACPI_ADD_PTR(struct aest_type_header, aest_node,
> +					 aest_node->length);
> +	}
> +
> +	if (num_ppi > AEST_MAX_PPI) {
> +		pr_err("Limiting PPI support to %d PPIs\n", AEST_MAX_PPI);
> +		num_ppi = AEST_MAX_PPI;
> +	}
> +
> +	ppi_data = kcalloc(num_ppi, sizeof(struct aest_node_data *),
> +			   GFP_KERNEL);
> +
> +	for (i = 0; i < num_ppi; i++) {
> +		ppi_data[i] = alloc_percpu(struct aest_node_data);
> +		if (!ppi_data[i]) {
> +			ret = -ENOMEM;
> +			break;
> +		}
> +	}
> +
> +	if (ret) {
> +		pr_err("Failed percpu allocation\n");
> +		for (i = 0; i < num_ppi; i++)
> +			free_percpu(ppi_data[i]);
> +		return ret;
> +	}
> +
> +	aest_node = ACPI_ADD_PTR(struct aest_type_header, aest,
> +				 sizeof(struct acpi_table_aest));
> +
> +	while (aest_node < aest_end) {
> +		ret = aest_init_node(aest_node);
> +		if (ret)
> +			pr_err("failed to init node: %d", ret);
> +
> +		aest_node = ACPI_ADD_PTR(struct aest_type_header, aest_node,
> +					 aest_node->length);
> +	}
> +
> +	return 0;
> +}
> +
> +early_initcall(acpi_aest_init);
> diff --git a/include/linux/acpi_aest.h b/include/linux/acpi_aest.h
> new file mode 100644
> index 0000000..376122b
> --- /dev/null
> +++ b/include/linux/acpi_aest.h
> @@ -0,0 +1,94 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef AEST_H
> +#define AEST_H
> +
> +#include <acpi/actbl.h>
> +
> +#define ACPI_SIG_AEST			"AEST"	/* ARM Error Source Table */
> +
> +#define AEST_NODE_TYPE_PROC		0
> +#define AEST_NODE_TYPE_MEM		1
> +#define AEST_NODE_TYPE_VENDOR		2
> +
> +#define AEST_SYSTEM_REG_INTERFACE	0x0
> +#define AEST_MEMORY_MAPPED_INTERFACE	0x1
> +
> +#define AEST_INTERRUPT_MODE		BIT(0)
> +
> +#define AEST_MAX_PPI			4
> +
> +#pragma pack(1)
> +
> +struct acpi_table_aest {
> +	struct acpi_table_header header;
> +};
> +
> +struct aest_type_header {
> +	u8 type;
> +	u16 length;
> +	u8 reserved;
> +	u32 revision;
> +	u32 data_offset;
> +	u32 interface_offset;
> +	u32 interface_size;
> +	u32 interrupt_offset;
> +	u32 interrupt_size;
> +	u64 timestamp_rate;
> +	u64 timestamp_start;
> +	u64 countdown_rate;
> +};
> +
> +struct aest_proc_data {
> +	u32 id;
> +	u32 level;
> +	u32 cache_type;
> +};
> +
> +struct aest_mem_data {
> +	u32 domain;
> +};
> +
> +struct aest_vendor_data {
> +	u32 id;
> +	u32 data;
> +};
> +
> +struct aest_interface {
> +	u8 type;
> +	u8 reserved[3];
> +	u32 flags;
> +	u64 address;
> +	u16 start_index;
> +	u16 num_records;
> +};
> +
> +struct aest_interrupt {
> +	u8 type;
> +	u16 reserved;
> +	u8 flags;
> +	u32 gsiv;
> +	u8 iort_id[20];
> +};
> +
> +#pragma pack()
> +
> +struct aest_interface_data {
> +	u8 type;
> +	u16 start;
> +	u16 end;
> +	struct ras_ext_regs *regs;
> +};
> +
> +union aest_node_spec {
> +	struct aest_proc_data proc;
> +	struct aest_mem_data mem;
> +	struct aest_vendor_data vendor;
> +};
> +
> +struct aest_node_data {
> +	u8 node_type;
> +	struct aest_interface_data interface;
> +	union aest_node_spec data;
> +};
> +
> +#endif /* AEST_H */
> -- 
> 1.8.3.1
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 1/4] ACPI/AEST: Initial AEST driver
  2019-07-03  9:25   ` Andrew Murray
@ 2019-07-03 17:30     ` Tyler Baicar OS
  2019-07-04  9:05       ` Andrew Murray
  0 siblings, 1 reply; 14+ messages in thread
From: Tyler Baicar OS @ 2019-07-03 17:30 UTC (permalink / raw)
  To: Andrew Murray
  Cc: mark.rutland, lorenzo.pieralisi, catalin.marinas, sudeep.holla,
	rjw, linux-kernel, Matteo.Carlini, linux-acpi, tony.luck,
	james.morse, guohanjun, bp, Open Source Submission, lenb, will,
	linux-arm-kernel, linux-edac

Hello Andrew,

Thank you for the feedback!

On Wed, Jul 3, 2019 at 5:26 AM Andrew Murray <andrew.murray@arm.com> wrote:
>
> On Tue, Jul 02, 2019 at 04:51:38PM +0000, Tyler Baicar OS wrote:
> > Add support for parsing the ARM Error Source Table and basic handling of
> > errors reported through both memory mapped and system register interfaces.
> >
> > Signed-off-by: Tyler Baicar <baicar@os.amperecomputing.com>
> > ---
> >  arch/arm64/include/asm/ras.h |  41 +++++
> >  arch/arm64/kernel/Makefile   |   2 +-
> >  arch/arm64/kernel/ras.c      |  67 ++++++++
> >  drivers/acpi/arm64/Kconfig   |   3 +
> >  drivers/acpi/arm64/Makefile  |   1 +
> >  drivers/acpi/arm64/aest.c    | 362 +++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/acpi_aest.h    |  94 +++++++++++
> >  7 files changed, 569 insertions(+), 1 deletion(-)
> >  create mode 100644 arch/arm64/include/asm/ras.h
> >  create mode 100644 arch/arm64/kernel/ras.c
> >  create mode 100644 drivers/acpi/arm64/aest.c
> >  create mode 100644 include/linux/acpi_aest.h
> >
> > diff --git a/arch/arm64/include/asm/ras.h b/arch/arm64/include/asm/ras.h
> > new file mode 100644
> > index 0000000..36bfff4
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/ras.h
> > @@ -0,0 +1,41 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef __ASM_RAS_H
> > +#define __ASM_RAS_H
> > +
> > +#define ERR_STATUS_AV                BIT(31)
> > +#define ERR_STATUS_V         BIT(30)
> > +#define ERR_STATUS_UE                BIT(29)
> > +#define ERR_STATUS_ER                BIT(28)
> > +#define ERR_STATUS_OF                BIT(27)
> > +#define ERR_STATUS_MV                BIT(26)
> > +#define ERR_STATUS_CE_SHIFT  24
> > +#define ERR_STATUS_CE_MASK   0x3
> > +#define ERR_STATUS_DE                BIT(23)
> > +#define ERR_STATUS_PN                BIT(22)
> > +#define ERR_STATUS_UET_SHIFT 20
> > +#define ERR_STATUS_UET_MASK  0x3
> > +#define ERR_STATUS_IERR_SHIFT        8
> > +#define ERR_STATUS_IERR_MASK 0xff
> > +#define ERR_STATUS_SERR_SHIFT        0
> > +#define ERR_STATUS_SERR_MASK 0xff
>
> Some of these (_ER, _OF, _CE*, _PN, _UET*) are not used anywhere in the series,
> I'd suggest you drop the unused ones.

Yes, I'll remove them in the next version.

> There may be some merit in renaming these to match the register names in the
> spec, e.g. ERXSTATUS_EL1 instead of ERR_STATUS.

ERX* are the register names for the system registers, but these macros are used
for both system registers and memory mapped registers. The memory mapped
registers have prefix ERR<n>*. Also, the spec definition of the ERX* system
registers is "accesses ERR<n>* for the error record selected by
ERRSELR_EL1.SEL." So really, the registers being accessed in all cases are
ERR<n>*. Either way, if folks think these names should be changed I can change
them.

> > +
> > +#define ERR_FR_CEC_SHIFT     12
> > +#define ERR_FR_CEC_MASK              0x7
> > +
> > +#define ERR_FR_8B_CEC                BIT(1)
> > +#define ERR_FR_16B_CEC               BIT(2)
>
> All of these ERR_FR_ defines aren't used anywhere either.

Will remove.

> > +
> > +struct ras_ext_regs {
> > +     u64 err_fr;
> > +     u64 err_ctlr;
> > +     u64 err_status;
> > +     u64 err_addr;
> > +     u64 err_misc0;
> > +     u64 err_misc1;
> > +     u64 err_misc2;
> > +     u64 err_misc3;
> > +};
> > +
> > +void arch_arm_ras_report_error(void);
> > +
> > +#endif       /* __ASM_RAS_H */
> > diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> > index 9e7dcb2..294f602 100644
> > --- a/arch/arm64/kernel/Makefile
> > +++ b/arch/arm64/kernel/Makefile
> > @@ -19,7 +19,7 @@ obj-y                       := debug-monitors.o entry.o irq.o fpsimd.o              \
> >                          return_address.o cpuinfo.o cpu_errata.o              \
> >                          cpufeature.o alternative.o cacheinfo.o               \
> >                          smp.o smp_spin_table.o topology.o smccc-call.o       \
> > -                        syscall.o
> > +                        syscall.o ras.o
>
> Given that arch_arm_ras_report_error depends on the ARM64_HAS_RAS_EXTN
> capability, which in turn depends on CONFIG_ARM64_RAS_EXTN - you should
> probably conditionally build ras.o only if CONFIG_ARM64_RAS_EXTN is defined
> (and provide a stub in the header for when it isn't defined).

Yes, I can do that.

> > 
> >  extra-$(CONFIG_EFI)                  := efi-entry.o
> > 
> > diff --git a/arch/arm64/kernel/ras.c b/arch/arm64/kernel/ras.c
> > new file mode 100644
> > index 0000000..ca47efa
> > --- /dev/null
> > +++ b/arch/arm64/kernel/ras.c
> > @@ -0,0 +1,67 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#include <linux/kernel.h>
> > +#include <linux/cpu.h>
> > +#include <linux/smp.h>
> > +
> > +#include <asm/ras.h>
> > +
> > +void arch_arm_ras_report_error(void)
> > +{
> > +     u64 num_records;
> > +     unsigned int i, cpu_num;
> > +     bool fatal = false;
> > +     struct ras_ext_regs regs;
> > +
> > +     if (!this_cpu_has_cap(ARM64_HAS_RAS_EXTN))
> > +             return;
> > +
> > +     cpu_num = get_cpu();
> > +     num_records = read_sysreg_s(SYS_ERRIDR_EL1);
>
> This value should be masked to exclude the reserved bits. This will
> also prevent you writing to reserved bits in ERRSELR.

True, I'll add that in the next version.

> > +
> > +     for (i = 0; i < num_records; i++) {
> > +             write_sysreg_s(i, SYS_ERRSELR_EL1);
>
> There should be an isb here, this will ensure the record selection has
> happened before reading the record.

I'll add that in the next version.

> > +             regs.err_status = read_sysreg_s(SYS_ERXSTATUS_EL1);
> > +
> > +             if (!(regs.err_status & ERR_STATUS_V))
> > +                     continue;
> > +
> > +             pr_err("CPU%u: ERR%uSTATUS: 0x%llx\n", cpu_num, i,
> > +                    regs.err_status);
> > +
> > +             if (regs.err_status & ERR_STATUS_AV) {
> > +                     regs.err_addr = read_sysreg_s(SYS_ERXSTATUS_EL1);
>
> This should be SYS_ERXADDR_EL1 not SYS_ERXSTATUS_EL1!

Oops, good catch! I missed this in testing because none of the errors injected
resulted in valid address info in the system registers.

> > +                     pr_err("CPU%u: ERR%uADDR: 0x%llx\n", cpu_num, i,
> > +                            regs.err_addr);
> > +             } else
> > +                     regs.err_addr = 0;
>
> Or perhaps set "regs = { }" at the start of the function instead?

Yes, I could do that.

> > +
> > +             regs.err_fr = read_sysreg_s(SYS_ERXFR_EL1);
> > +             pr_err("CPU%u: ERR%uFR: 0x%llx\n", cpu_num, i, regs.err_fr);
> > +             regs.err_ctlr = read_sysreg_s(SYS_ERXCTLR_EL1);
> > +             pr_err("CPU%u: ERR%uCTLR: 0x%llx\n", cpu_num, i, regs.err_ctlr);
> > +
> > +             if (regs.err_status & ERR_STATUS_MV) {
> > +                     regs.err_misc0 = read_sysreg_s(SYS_ERXMISC0_EL1);
> > +                     pr_err("CPU%u: ERR%uMISC0: 0x%llx\n", cpu_num, i,
> > +                            regs.err_misc0);
> > +                     regs.err_misc1 = read_sysreg_s(SYS_ERXMISC1_EL1);
> > +                     pr_err("CPU%u: ERR%uMISC1: 0x%llx\n", cpu_num, i,
> > +                            regs.err_misc1);
> > +             }
> > +
> > +             /*
> > +              * In the future, we will treat UER conditions as potentially
> > +              * recoverable.
> > +              */
> > +             if (regs.err_status & ERR_STATUS_UE)
> > +                     fatal = true;
> > +
> > +             write_sysreg_s(regs.err_status, SYS_ERXSTATUS_EL1);
> > +     }
> > +
> > +     if (fatal)
> > +             panic("uncorrectable error encountered");
>
> On the do_serror path, we will already panic if arm64_is_fatal_ras_serror
> indicates uncorrectable errors. Is this here for the other paths?

This same function is called for the SEA path and also from AEST for errors
that are reported through the system register interface.

> > +
> > +     put_cpu();
> > +}
>
> Finally, should we clear the errors when we see them?

Each error is being cleared at the end of the loop above by writing the value
read from the status register back to the status register. The status register
bits are write 1 to clear and writing back the same value that was read
guarantees that a higher priority error that occurs between the read and write
isn't cleared.

> > diff --git a/drivers/acpi/arm64/Kconfig b/drivers/acpi/arm64/Kconfig
> > index 6dba187..8d5cf99 100644
> > --- a/drivers/acpi/arm64/Kconfig
> > +++ b/drivers/acpi/arm64/Kconfig
> > @@ -8,3 +8,6 @@ config ACPI_IORT
> > 
> >  config ACPI_GTDT
> >       bool
> > +
> > +config ACPI_AEST
> > +     bool "ARM Error Source Table Support"
> > diff --git a/drivers/acpi/arm64/Makefile b/drivers/acpi/arm64/Makefile
> > index 6ff50f4..ea1ba28 100644
> > --- a/drivers/acpi/arm64/Makefile
> > +++ b/drivers/acpi/arm64/Makefile
> > @@ -1,3 +1,4 @@
> >  # SPDX-License-Identifier: GPL-2.0-only
> >  obj-$(CONFIG_ACPI_IORT)      += iort.o
> >  obj-$(CONFIG_ACPI_GTDT)      += gtdt.o
> > +obj-$(CONFIG_ACPI_AEST)      += aest.o
> > diff --git a/drivers/acpi/arm64/aest.c b/drivers/acpi/arm64/aest.c
> > new file mode 100644
> > index 0000000..fd4f3b5
> > --- /dev/null
> > +++ b/drivers/acpi/arm64/aest.c
> > @@ -0,0 +1,362 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/* ARM Error Source Table Support */
> > +
> > +#include <linux/acpi.h>
> > +#include <linux/acpi_aest.h>
> > +#include <linux/init.h>
> > +#include <linux/interrupt.h>
> > +#include <linux/io.h>
> > +#include <linux/irq.h>
> > +#include <linux/kernel.h>
> > +#include <linux/percpu.h>
> > +#include <linux/ratelimit.h>
> > +
> > +#include <asm/ras.h>
> > +
> > +#undef pr_fmt
> > +#define pr_fmt(fmt) "ACPI AEST: " fmt
> > +
> > +static struct acpi_table_header *aest_table;
> > +
> > +static struct aest_node_data __percpu **ppi_data;
> > +static u8 num_ppi;
> > +static u8 ppi_idx;
> > +
> > +static void aest_print(struct aest_node_data *data, struct ras_ext_regs regs,
> > +                    int index)
> > +{
> > +     /* No more than 2 corrected messages every 5 seconds */
> > +     static DEFINE_RATELIMIT_STATE(ratelimit_corrected, 5*HZ, 2);
> > +
> > +     if (regs.err_status & ERR_STATUS_UE ||
> > +         regs.err_status & ERR_STATUS_DE ||
> > +         __ratelimit(&ratelimit_corrected)) {
> > +             switch (data->node_type) {
> > +             case AEST_NODE_TYPE_PROC:
> > +                     pr_err("error from processor 0x%x\n",
> > +                            data->data.proc.id);
> > +                     break;
> > +             case AEST_NODE_TYPE_MEM:
> > +                     pr_err("error from memory domain 0x%x\n",
> > +                            data->data.mem.domain);
> > +                     break;
> > +             case AEST_NODE_TYPE_VENDOR:
> > +                     pr_err("error from vendor specific source 0x%x\n",
> > +                            data->data.vendor.id);
> > +             }
> > +
> > +             pr_err("ERR%dSTATUS = 0x%llx\n", index, regs.err_status);
> > +             if (regs.err_status & ERR_STATUS_AV)
> > +                     pr_err("ERR%dADDR = 0x%llx\n", index, regs.err_addr);
> > +
> > +             pr_err("ERR%dFR = 0x%llx\n", index, regs.err_fr);
> > +             pr_err("ERR%dCTLR = 0x%llx\n", index, regs.err_ctlr);
> > +
> > +             if (regs.err_status & ERR_STATUS_MV) {
> > +                     pr_err("ERR%dMISC0 = 0x%llx\n", index, regs.err_misc0);
> > +                     pr_err("ERR%dMISC1 = 0x%llx\n", index, regs.err_misc1);
> > +             }
>
> Given that we have a ras_ext_regs struct, can't we use a single function to
> print the error - rather than have duplicate pr_err's here and in
> arch_arm_ras_report_error?

That was an option I had thought about, but I left it as is to get other
opinions. Right now the system register reporting prefixes everything with the
CPU number that the error occurred on...but now that I think about it more I
could just have the print function take a prefix as a parameter. I'll unify the
printing into a single function in the next version.

Thanks,
Tyler

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 1/4] ACPI/AEST: Initial AEST driver
  2019-07-03 17:30     ` Tyler Baicar OS
@ 2019-07-04  9:05       ` Andrew Murray
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Murray @ 2019-07-04  9:05 UTC (permalink / raw)
  To: Tyler Baicar OS
  Cc: mark.rutland, lorenzo.pieralisi, catalin.marinas, sudeep.holla,
	rjw, linux-kernel, Matteo.Carlini, linux-acpi, tony.luck,
	james.morse, guohanjun, bp, Open Source Submission, lenb, will,
	linux-arm-kernel, linux-edac

On Wed, Jul 03, 2019 at 05:30:53PM +0000, Tyler Baicar OS wrote:
> Hello Andrew,
> 
> Thank you for the feedback!
> 
> On Wed, Jul 3, 2019 at 5:26 AM Andrew Murray <andrew.murray@arm.com> wrote:
> >
> > On Tue, Jul 02, 2019 at 04:51:38PM +0000, Tyler Baicar OS wrote:
> > > Add support for parsing the ARM Error Source Table and basic handling of
> > > errors reported through both memory mapped and system register interfaces.
> > >
> > > Signed-off-by: Tyler Baicar <baicar@os.amperecomputing.com>
> > > ---
> > >  arch/arm64/include/asm/ras.h |  41 +++++
> > >  arch/arm64/kernel/Makefile   |   2 +-
> > >  arch/arm64/kernel/ras.c      |  67 ++++++++
> > >  drivers/acpi/arm64/Kconfig   |   3 +
> > >  drivers/acpi/arm64/Makefile  |   1 +
> > >  drivers/acpi/arm64/aest.c    | 362 +++++++++++++++++++++++++++++++++++++++++++
> > >  include/linux/acpi_aest.h    |  94 +++++++++++
> > >  7 files changed, 569 insertions(+), 1 deletion(-)
> > >  create mode 100644 arch/arm64/include/asm/ras.h
> > >  create mode 100644 arch/arm64/kernel/ras.c
> > >  create mode 100644 drivers/acpi/arm64/aest.c
> > >  create mode 100644 include/linux/acpi_aest.h
> > >
> > > diff --git a/arch/arm64/include/asm/ras.h b/arch/arm64/include/asm/ras.h
> > > new file mode 100644
> > > index 0000000..36bfff4
> > > --- /dev/null
> > > +++ b/arch/arm64/include/asm/ras.h
> > > @@ -0,0 +1,41 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef __ASM_RAS_H
> > > +#define __ASM_RAS_H
> > > +
> > > +#define ERR_STATUS_AV                BIT(31)
> > > +#define ERR_STATUS_V         BIT(30)
> > > +#define ERR_STATUS_UE                BIT(29)
> > > +#define ERR_STATUS_ER                BIT(28)
> > > +#define ERR_STATUS_OF                BIT(27)
> > > +#define ERR_STATUS_MV                BIT(26)
> > > +#define ERR_STATUS_CE_SHIFT  24
> > > +#define ERR_STATUS_CE_MASK   0x3
> > > +#define ERR_STATUS_DE                BIT(23)
> > > +#define ERR_STATUS_PN                BIT(22)
> > > +#define ERR_STATUS_UET_SHIFT 20
> > > +#define ERR_STATUS_UET_MASK  0x3
> > > +#define ERR_STATUS_IERR_SHIFT        8
> > > +#define ERR_STATUS_IERR_MASK 0xff
> > > +#define ERR_STATUS_SERR_SHIFT        0
> > > +#define ERR_STATUS_SERR_MASK 0xff
> >
> > Some of these (_ER, _OF, _CE*, _PN, _UET*) are not used anywhere in the series,
> > I'd suggest you drop the unused ones.
> 
> Yes, I'll remove them in the next version.
> 
> > There may be some merit in renaming these to match the register names in the
> > spec, e.g. ERXSTATUS_EL1 instead of ERR_STATUS.
> 
> ERX* are the register names for the system registers, but these macros are used
> for both system registers and memory mapped registers. The memory mapped
> registers have prefix ERR<n>*. Also, the spec definition of the ERX* system
> registers is "accesses ERR<n>* for the error record selected by
> ERRSELR_EL1.SEL." So really, the registers being accessed in all cases are
> ERR<n>*. Either way, if folks think these names should be changed I can change
> them.

Sorry I hadn't considered the memory mapped registers. Your rationale seems
sound so no need to change them.

> 
> > > +
> > > +#define ERR_FR_CEC_SHIFT     12
> > > +#define ERR_FR_CEC_MASK              0x7
> > > +
> > > +#define ERR_FR_8B_CEC                BIT(1)
> > > +#define ERR_FR_16B_CEC               BIT(2)
> >
> > All of these ERR_FR_ defines aren't used anywhere either.
> 
> Will remove.
> 
> > > +
> > > +struct ras_ext_regs {
> > > +     u64 err_fr;
> > > +     u64 err_ctlr;
> > > +     u64 err_status;
> > > +     u64 err_addr;
> > > +     u64 err_misc0;
> > > +     u64 err_misc1;
> > > +     u64 err_misc2;
> > > +     u64 err_misc3;
> > > +};
> > > +
> > > +void arch_arm_ras_report_error(void);
> > > +
> > > +#endif       /* __ASM_RAS_H */
> > > diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> > > index 9e7dcb2..294f602 100644
> > > --- a/arch/arm64/kernel/Makefile
> > > +++ b/arch/arm64/kernel/Makefile
> > > @@ -19,7 +19,7 @@ obj-y                       := debug-monitors.o entry.o irq.o fpsimd.o              \
> > >                          return_address.o cpuinfo.o cpu_errata.o              \
> > >                          cpufeature.o alternative.o cacheinfo.o               \
> > >                          smp.o smp_spin_table.o topology.o smccc-call.o       \
> > > -                        syscall.o
> > > +                        syscall.o ras.o
> >
> > Given that arch_arm_ras_report_error depends on the ARM64_HAS_RAS_EXTN
> > capability, which in turn depends on CONFIG_ARM64_RAS_EXTN - you should
> > probably conditionally build ras.o only if CONFIG_ARM64_RAS_EXTN is defined
> > (and provide a stub in the header for when it isn't defined).
> 
> Yes, I can do that.
> 
> > > 
> > >  extra-$(CONFIG_EFI)                  := efi-entry.o
> > > 
> > > diff --git a/arch/arm64/kernel/ras.c b/arch/arm64/kernel/ras.c
> > > new file mode 100644
> > > index 0000000..ca47efa
> > > --- /dev/null
> > > +++ b/arch/arm64/kernel/ras.c
> > > @@ -0,0 +1,67 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +
> > > +#include <linux/kernel.h>
> > > +#include <linux/cpu.h>
> > > +#include <linux/smp.h>
> > > +
> > > +#include <asm/ras.h>
> > > +
> > > +void arch_arm_ras_report_error(void)
> > > +{
> > > +     u64 num_records;
> > > +     unsigned int i, cpu_num;
> > > +     bool fatal = false;
> > > +     struct ras_ext_regs regs;
> > > +
> > > +     if (!this_cpu_has_cap(ARM64_HAS_RAS_EXTN))
> > > +             return;
> > > +
> > > +     cpu_num = get_cpu();
> > > +     num_records = read_sysreg_s(SYS_ERRIDR_EL1);
> >
> > This value should be masked to exclude the reserved bits. This will
> > also prevent you writing to reserved bits in ERRSELR.
> 
> True, I'll add that in the next version.
> 
> > > +
> > > +     for (i = 0; i < num_records; i++) {
> > > +             write_sysreg_s(i, SYS_ERRSELR_EL1);
> >
> > There should be an isb here, this will ensure the record selection has
> > happened before reading the record.
> 
> I'll add that in the next version.
> 
> > > +             regs.err_status = read_sysreg_s(SYS_ERXSTATUS_EL1);
> > > +
> > > +             if (!(regs.err_status & ERR_STATUS_V))
> > > +                     continue;
> > > +
> > > +             pr_err("CPU%u: ERR%uSTATUS: 0x%llx\n", cpu_num, i,
> > > +                    regs.err_status);
> > > +
> > > +             if (regs.err_status & ERR_STATUS_AV) {
> > > +                     regs.err_addr = read_sysreg_s(SYS_ERXSTATUS_EL1);
> >
> > This should be SYS_ERXADDR_EL1 not SYS_ERXSTATUS_EL1!
> 
> Oops, good catch! I missed this in testing because none of the errors injected
> resulted in valid address info in the system registers.
> 
> > > +                     pr_err("CPU%u: ERR%uADDR: 0x%llx\n", cpu_num, i,
> > > +                            regs.err_addr);
> > > +             } else
> > > +                     regs.err_addr = 0;
> >
> > Or perhaps set "regs = { }" at the start of the function instead?
> 
> Yes, I could do that.
> 
> > > +
> > > +             regs.err_fr = read_sysreg_s(SYS_ERXFR_EL1);
> > > +             pr_err("CPU%u: ERR%uFR: 0x%llx\n", cpu_num, i, regs.err_fr);
> > > +             regs.err_ctlr = read_sysreg_s(SYS_ERXCTLR_EL1);
> > > +             pr_err("CPU%u: ERR%uCTLR: 0x%llx\n", cpu_num, i, regs.err_ctlr);
> > > +
> > > +             if (regs.err_status & ERR_STATUS_MV) {
> > > +                     regs.err_misc0 = read_sysreg_s(SYS_ERXMISC0_EL1);
> > > +                     pr_err("CPU%u: ERR%uMISC0: 0x%llx\n", cpu_num, i,
> > > +                            regs.err_misc0);
> > > +                     regs.err_misc1 = read_sysreg_s(SYS_ERXMISC1_EL1);
> > > +                     pr_err("CPU%u: ERR%uMISC1: 0x%llx\n", cpu_num, i,
> > > +                            regs.err_misc1);
> > > +             }
> > > +
> > > +             /*
> > > +              * In the future, we will treat UER conditions as potentially
> > > +              * recoverable.
> > > +              */
> > > +             if (regs.err_status & ERR_STATUS_UE)
> > > +                     fatal = true;
> > > +
> > > +             write_sysreg_s(regs.err_status, SYS_ERXSTATUS_EL1);
> > > +     }
> > > +
> > > +     if (fatal)
> > > +             panic("uncorrectable error encountered");
> >
> > On the do_serror path, we will already panic if arm64_is_fatal_ras_serror
> > indicates uncorrectable errors. Is this here for the other paths?
> 
> This same function is called for the SEA path and also from AEST for errors
> that are reported through the system register interface.
> 
> > > +
> > > +     put_cpu();
> > > +}
> >
> > Finally, should we clear the errors when we see them?
> 
> Each error is being cleared at the end of the loop above by writing the value
> read from the status register back to the status register. The status register
> bits are write 1 to clear and writing back the same value that was read
> guarantees that a higher priority error that occurs between the read and write
> isn't cleared.

Ah, I missed that.

When you write back the status to clear the bits, I think you ought to mask
off the bottom 16 bits (SERR, IERR) and arguably the top 32 bits to prevent
writing to res0 or fields that are not error write-1-to-clear bits.

Also the manual states "Software must also write ones to the { ER, PN, UET,
CI } fields when clearing these [valid bits] fields" - we achieve this
implicitly by writing back the value of status... however UET and CE take up
2 bits each and include this comment "Writing a value other than all-zeros or
all-ones sets this field to an UNKNOWN value". Thus we probably should do
the following (or similar) to be inline with the spec:

/* Write-one-to-clear the bits we've seen */
regs.err_status &= ~ERR_STATUS_ERR_MASK;

/* Correctly clear double bit fields */
if (regs.err_status & ERR_STATUS_CE_MASK)
    regs.err_status |= ERR_STATUS_CE_MASK;
if (regs.err_status & ERR_STATUS_UET_MASK)
    regs.err_status |= ERR_STATUS_UET_MASK

write_sysreg_s(regs.err_status, SYS_ERXSTATUS_EL1);
isb();

The isb may not be necessary but ensures we've cleared the errors before
leaving the function.

> 
> > > diff --git a/drivers/acpi/arm64/Kconfig b/drivers/acpi/arm64/Kconfig
> > > index 6dba187..8d5cf99 100644
> > > --- a/drivers/acpi/arm64/Kconfig
> > > +++ b/drivers/acpi/arm64/Kconfig
> > > @@ -8,3 +8,6 @@ config ACPI_IORT
> > > 
> > >  config ACPI_GTDT
> > >       bool
> > > +
> > > +config ACPI_AEST
> > > +     bool "ARM Error Source Table Support"
> > > diff --git a/drivers/acpi/arm64/Makefile b/drivers/acpi/arm64/Makefile
> > > index 6ff50f4..ea1ba28 100644
> > > --- a/drivers/acpi/arm64/Makefile
> > > +++ b/drivers/acpi/arm64/Makefile
> > > @@ -1,3 +1,4 @@
> > >  # SPDX-License-Identifier: GPL-2.0-only
> > >  obj-$(CONFIG_ACPI_IORT)      += iort.o
> > >  obj-$(CONFIG_ACPI_GTDT)      += gtdt.o
> > > +obj-$(CONFIG_ACPI_AEST)      += aest.o
> > > diff --git a/drivers/acpi/arm64/aest.c b/drivers/acpi/arm64/aest.c
> > > new file mode 100644
> > > index 0000000..fd4f3b5
> > > --- /dev/null
> > > +++ b/drivers/acpi/arm64/aest.c
> > > @@ -0,0 +1,362 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +
> > > +/* ARM Error Source Table Support */
> > > +
> > > +#include <linux/acpi.h>
> > > +#include <linux/acpi_aest.h>
> > > +#include <linux/init.h>
> > > +#include <linux/interrupt.h>
> > > +#include <linux/io.h>
> > > +#include <linux/irq.h>
> > > +#include <linux/kernel.h>
> > > +#include <linux/percpu.h>
> > > +#include <linux/ratelimit.h>
> > > +
> > > +#include <asm/ras.h>
> > > +
> > > +#undef pr_fmt
> > > +#define pr_fmt(fmt) "ACPI AEST: " fmt
> > > +
> > > +static struct acpi_table_header *aest_table;
> > > +
> > > +static struct aest_node_data __percpu **ppi_data;
> > > +static u8 num_ppi;
> > > +static u8 ppi_idx;
> > > +
> > > +static void aest_print(struct aest_node_data *data, struct ras_ext_regs regs,
> > > +                    int index)
> > > +{
> > > +     /* No more than 2 corrected messages every 5 seconds */
> > > +     static DEFINE_RATELIMIT_STATE(ratelimit_corrected, 5*HZ, 2);
> > > +
> > > +     if (regs.err_status & ERR_STATUS_UE ||
> > > +         regs.err_status & ERR_STATUS_DE ||
> > > +         __ratelimit(&ratelimit_corrected)) {
> > > +             switch (data->node_type) {
> > > +             case AEST_NODE_TYPE_PROC:
> > > +                     pr_err("error from processor 0x%x\n",
> > > +                            data->data.proc.id);
> > > +                     break;
> > > +             case AEST_NODE_TYPE_MEM:
> > > +                     pr_err("error from memory domain 0x%x\n",
> > > +                            data->data.mem.domain);
> > > +                     break;
> > > +             case AEST_NODE_TYPE_VENDOR:
> > > +                     pr_err("error from vendor specific source 0x%x\n",
> > > +                            data->data.vendor.id);
> > > +             }
> > > +
> > > +             pr_err("ERR%dSTATUS = 0x%llx\n", index, regs.err_status);
> > > +             if (regs.err_status & ERR_STATUS_AV)
> > > +                     pr_err("ERR%dADDR = 0x%llx\n", index, regs.err_addr);
> > > +
> > > +             pr_err("ERR%dFR = 0x%llx\n", index, regs.err_fr);
> > > +             pr_err("ERR%dCTLR = 0x%llx\n", index, regs.err_ctlr);
> > > +
> > > +             if (regs.err_status & ERR_STATUS_MV) {
> > > +                     pr_err("ERR%dMISC0 = 0x%llx\n", index, regs.err_misc0);
> > > +                     pr_err("ERR%dMISC1 = 0x%llx\n", index, regs.err_misc1);
> > > +             }
> >
> > Given that we have a ras_ext_regs struct, can't we use a single function to
> > print the error - rather than have duplicate pr_err's here and in
> > arch_arm_ras_report_error?
> 
> That was an option I had thought about, but I left it as is to get other
> opinions. Right now the system register reporting prefixes everything with the
> CPU number that the error occurred on...but now that I think about it more I
> could just have the print function take a prefix as a parameter. I'll unify the
> printing into a single function in the next version.

This is also helpful for times when you want to clear the error without
printing it. (E.g. to catch guest non-fatal errors and prevent a bad guest from
spaming the log buffer).

Thanks,

Andrew Murray

> 
> Thanks,
> Tyler

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH RFC 1/4] ACPI/AEST: Initial AEST driver
  2019-07-02 16:51 ` [PATCH RFC 1/4] ACPI/AEST: Initial AEST driver Tyler Baicar OS
  2019-07-03  9:25   ` Andrew Murray
@ 2019-07-04 16:02   ` Shiju Jose
  2019-07-10  0:49     ` Tyler Baicar OS
  1 sibling, 1 reply; 14+ messages in thread
From: Shiju Jose @ 2019-07-04 16:02 UTC (permalink / raw)
  To: Tyler Baicar OS, Open Source Submission, linux-arm-kernel,
	linux-kernel, linux-acpi, linux-edac, james.morse,
	catalin.marinas, will, lorenzo.pieralisi, Guohanjun (Hanjun Guo),
	sudeep.holla, rjw, lenb, mark.rutland, tony.luck, bp,
	Matteo.Carlini, Andrew.Murray

Hi Tyler,

>-----Original Message-----
>From: linux-acpi-owner@vger.kernel.org [mailto:linux-acpi-
>owner@vger.kernel.org] On Behalf Of Tyler Baicar OS
>Sent: 02 July 2019 17:52
>To: Open Source Submission <patches@amperecomputing.com>; linux-arm-
>kernel@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
>acpi@vger.kernel.org; linux-edac@vger.kernel.org; james.morse@arm.com;
>catalin.marinas@arm.com; will@kernel.org; lorenzo.pieralisi@arm.com;
>Guohanjun (Hanjun Guo) <guohanjun@huawei.com>; sudeep.holla@arm.com;
>rjw@rjwysocki.net; lenb@kernel.org; mark.rutland@arm.com;
>tony.luck@intel.com; bp@alien8.de; Matteo.Carlini@arm.com;
>Andrew.Murray@arm.com
>Cc: Tyler Baicar OS <baicar@os.amperecomputing.com>
>Subject: [PATCH RFC 1/4] ACPI/AEST: Initial AEST driver
>
>Add support for parsing the ARM Error Source Table and basic handling of
>errors reported through both memory mapped and system register interfaces.
>
>Signed-off-by: Tyler Baicar <baicar@os.amperecomputing.com>
>---
> arch/arm64/include/asm/ras.h |  41 +++++
> arch/arm64/kernel/Makefile   |   2 +-
> arch/arm64/kernel/ras.c      |  67 ++++++++
> drivers/acpi/arm64/Kconfig   |   3 +
> drivers/acpi/arm64/Makefile  |   1 +
> drivers/acpi/arm64/aest.c    | 362
>+++++++++++++++++++++++++++++++++++++++++++
> include/linux/acpi_aest.h    |  94 +++++++++++
> 7 files changed, 569 insertions(+), 1 deletion(-)  create mode 100644
>arch/arm64/include/asm/ras.h  create mode 100644 arch/arm64/kernel/ras.c
>create mode 100644 drivers/acpi/arm64/aest.c  create mode 100644
>include/linux/acpi_aest.h
>
>diff --git a/arch/arm64/include/asm/ras.h b/arch/arm64/include/asm/ras.h
>new file mode 100644 index 0000000..36bfff4
>--- /dev/null
>+++ b/arch/arm64/include/asm/ras.h
>@@ -0,0 +1,41 @@
>+/* SPDX-License-Identifier: GPL-2.0 */
>+#ifndef __ASM_RAS_H
>+#define __ASM_RAS_H
>+
>+#define ERR_STATUS_AV		BIT(31)
>+#define ERR_STATUS_V		BIT(30)
>+#define ERR_STATUS_UE		BIT(29)
>+#define ERR_STATUS_ER		BIT(28)
>+#define ERR_STATUS_OF		BIT(27)
>+#define ERR_STATUS_MV		BIT(26)
>+#define ERR_STATUS_CE_SHIFT	24
>+#define ERR_STATUS_CE_MASK	0x3
>+#define ERR_STATUS_DE		BIT(23)
>+#define ERR_STATUS_PN		BIT(22)
>+#define ERR_STATUS_UET_SHIFT	20
>+#define ERR_STATUS_UET_MASK	0x3
>+#define ERR_STATUS_IERR_SHIFT	8
>+#define ERR_STATUS_IERR_MASK	0xff
>+#define ERR_STATUS_SERR_SHIFT	0
>+#define ERR_STATUS_SERR_MASK	0xff
>+
>+#define ERR_FR_CEC_SHIFT	12
>+#define ERR_FR_CEC_MASK		0x7
>+
>+#define ERR_FR_8B_CEC		BIT(1)
>+#define ERR_FR_16B_CEC		BIT(2)
>+
>+struct ras_ext_regs {
>+	u64 err_fr;
>+	u64 err_ctlr;
>+	u64 err_status;
>+	u64 err_addr;
>+	u64 err_misc0;
>+	u64 err_misc1;
>+	u64 err_misc2;
>+	u64 err_misc3;
err_misc2 and err_misc3 are not used. Are they for the future purpose?

>+};
>+
>+void arch_arm_ras_report_error(void);
>+
>+#endif	/* __ASM_RAS_H */
[...]
>+
>+int __init acpi_aest_init(void)
>+{
>+	struct acpi_table_aest *aest;
>+	struct aest_type_header *aest_node, *aest_end;
>+	int i, ret = 0;
>+
>+	if (acpi_disabled)
>+		return 0;
>+
>+	if (ACPI_FAILURE(acpi_get_table(ACPI_SIG_AEST, 0, &aest_table)))
>+		return -EINVAL;
>+
>+	aest = (struct acpi_table_aest *)aest_table;
>+
>+	/* Get the first AEST node */
>+	aest_node = ACPI_ADD_PTR(struct aest_type_header, aest,
>+				 sizeof(struct acpi_table_aest));
>+	/* Pointer to the end of the AEST table */
>+	aest_end = ACPI_ADD_PTR(struct aest_type_header, aest,
>+				aest_table->length);
>+
>+	while (aest_node < aest_end) {
>+		if (((u64)aest_node + aest_node->length) > (u64)aest_end) {
>+			pr_err("AEST node pointer overflow, bad table\n");
>+			return -EINVAL;
>+		}
>+
>+		aest_count_ppi(aest_node);
>+
>+		aest_node = ACPI_ADD_PTR(struct aest_type_header,
>aest_node,
>+					 aest_node->length);
>+	}
>+
>+	if (num_ppi > AEST_MAX_PPI) {
>+		pr_err("Limiting PPI support to %d PPIs\n", AEST_MAX_PPI);
>+		num_ppi = AEST_MAX_PPI;
>+	}
>+
>+	ppi_data = kcalloc(num_ppi, sizeof(struct aest_node_data *),
>+			   GFP_KERNEL);
>+
>+	for (i = 0; i < num_ppi; i++) {
>+		ppi_data[i] = alloc_percpu(struct aest_node_data);
>+		if (!ppi_data[i]) {
>+			ret = -ENOMEM;
>+			break;
>+		}
>+	}
>+
>+	if (ret) {
>+		pr_err("Failed percpu allocation\n");
>+		for (i = 0; i < num_ppi; i++)
>+			free_percpu(ppi_data[i]);
I think 'ppi_data' to be freed here?

>+		return ret;
[...]
>+
>+#endif /* AEST_H */
>--
>1.8.3.1

Thanks,
Shiju


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 2/4] arm64: mm: Add RAS extension system register check to SEA handling
  2019-07-02 16:51 ` [PATCH RFC 2/4] arm64: mm: Add RAS extension system register check to SEA handling Tyler Baicar OS
@ 2019-07-08 10:00   ` James Morse
  2019-07-10  0:51     ` Tyler Baicar OS
  0 siblings, 1 reply; 14+ messages in thread
From: James Morse @ 2019-07-08 10:00 UTC (permalink / raw)
  To: Tyler Baicar OS
  Cc: mark.rutland, lorenzo.pieralisi, catalin.marinas, sudeep.holla,
	rjw, linux-kernel, Matteo.Carlini, linux-acpi, tony.luck, bp,
	guohanjun, Andrew.Murray, Open Source Submission, lenb, will,
	linux-arm-kernel, linux-edac

Hey Tyler,

On 02/07/2019 17:51, Tyler Baicar OS wrote:
> On systems that support the ARM RAS extension, synchronous external
> abort syndrome information could be captured in the core's RAS extension
> system registers. So, when handling SEAs check the RAS system registers
> for error syndrome information.

> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 2d11501..76b42ca 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -37,6 +37,7 @@
>  #include <asm/pgtable.h>
>  #include <asm/tlbflush.h>
>  #include <asm/traps.h>
> +#include <asm/ras.h>
>  
>  struct fault_info {
>  	int	(*fn)(unsigned long addr, unsigned int esr,
> @@ -632,6 +633,8 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>  
>  	inf = esr_to_fault_info(esr);
>  
> +	arch_arm_ras_report_error();
> +
>  	/*
>  	 * Return value ignored as we rely on signal merging.
>  	 * Future patches will make this more robust.
> 

If we interrupted a preemptible context, do_sea() is preemptible too... This means we
can't know if we're still running on the same CPU as the one that took the external-abort.
(until this series, it hasn't mattered).

Fixing this means cramming something into entry.S's el1_da, as this may unmask interrupts
before calling do_mem_abort(). But its going to be ugly because some of do_mem_abort()s
ESR values need to be preemptible because they sleep, e.g. page-faults calling
handle_mm_fault().
For do_sea(), do_exit() will 'fix' the preempt count if we kill the thread, but if we
don't, it still needs to be balanced. Doing all this in assembly is going to be unreadable!

Mark Rutland has a series to move the entry assembly into C [0]. Based on that that it
should be possible for the new el1_abort() to spot a Synchronous-External-Abort ESR, and
wrap the do_mem_abort() with preempt enable/disable, before inheriting the flags. (which
for synchronous exceptions, I think we should always do)


Thanks,

James

[0] https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/entry-deasm

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 1/4] ACPI/AEST: Initial AEST driver
  2019-07-04 16:02   ` Shiju Jose
@ 2019-07-10  0:49     ` Tyler Baicar OS
  0 siblings, 0 replies; 14+ messages in thread
From: Tyler Baicar OS @ 2019-07-10  0:49 UTC (permalink / raw)
  To: Shiju Jose, Open Source Submission, linux-arm-kernel,
	linux-kernel, linux-acpi, linux-edac, james.morse,
	catalin.marinas, will, lorenzo.pieralisi, Guohanjun (Hanjun Guo),
	sudeep.holla, rjw, lenb, mark.rutland, tony.luck, bp,
	Matteo.Carlini, Andrew.Murray

Hello Shiju,

Thank you for the feedback!

On Thu, Jul 4, 2019 at 12:03 PM Shiju Jose <shiju.jose@huawei.com> wrote:
> >+struct ras_ext_regs {
> >+      u64 err_fr;
> >+      u64 err_ctlr;
> >+      u64 err_status;
> >+      u64 err_addr;
> >+      u64 err_misc0;
> >+      u64 err_misc1;
> >+      u64 err_misc2;
> >+      u64 err_misc3;
> err_misc2 and err_misc3 are not used. Are they for the future purpose?

Yes, these will be for future purpose once ARMv8.4 support is added. I'd like
to keep them in this structure define since that makes iterating through the
memory mapped error records easier. Regardless of ARMv8.2 or ARMv8.4 support,
each error record in memory mapped nodes are 64 bytes apart. Having these in
the structure make the structure 64 bytes long making it possible for me to
increment through error records with the increment ++ operation.

If folks don't agree with that, I can change this structure to just have a
reserved field at the end such as:

+ u64 res0[2];

or

+ u8 res0[8];


> >+      ppi_data = kcalloc(num_ppi, sizeof(struct aest_node_data *),
> >+                         GFP_KERNEL);
> >+
> >+      for (i = 0; i < num_ppi; i++) {
> >+              ppi_data[i] = alloc_percpu(struct aest_node_data);
> >+              if (!ppi_data[i]) {
> >+                      ret = -ENOMEM;
> >+                      break;
> >+              }
> >+      }
> >+
> >+      if (ret) {
> >+              pr_err("Failed percpu allocation\n");
> >+              for (i = 0; i < num_ppi; i++)
> >+                      free_percpu(ppi_data[i]);
> I think 'ppi_data' to be freed here?

Yes it should be! I'll add that in the next version.

Thanks,
Tyler
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 2/4] arm64: mm: Add RAS extension system register check to SEA handling
  2019-07-08 10:00   ` James Morse
@ 2019-07-10  0:51     ` Tyler Baicar OS
  2019-07-11  4:14       ` Tyler Baicar OS
  0 siblings, 1 reply; 14+ messages in thread
From: Tyler Baicar OS @ 2019-07-10  0:51 UTC (permalink / raw)
  To: James Morse
  Cc: mark.rutland, lorenzo.pieralisi, catalin.marinas, sudeep.holla,
	rjw, linux-kernel, Matteo.Carlini, linux-acpi, tony.luck, bp,
	guohanjun, Andrew.Murray, Open Source Submission, lenb, will,
	linux-arm-kernel, linux-edac

On Mon, Jul 8, 2019 at 10:10 AM James Morse <james.morse@arm.com> wrote:
> On 02/07/2019 17:51, Tyler Baicar OS wrote:
> > On systems that support the ARM RAS extension, synchronous external
> > abort syndrome information could be captured in the core's RAS extension
> > system registers. So, when handling SEAs check the RAS system registers
> > for error syndrome information.
>
> > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > index 2d11501..76b42ca 100644
> > --- a/arch/arm64/mm/fault.c
> > +++ b/arch/arm64/mm/fault.c
> > @@ -37,6 +37,7 @@
> >  #include <asm/pgtable.h>
> >  #include <asm/tlbflush.h>
> >  #include <asm/traps.h>
> > +#include <asm/ras.h>
> > 
> >  struct fault_info {
> >       int     (*fn)(unsigned long addr, unsigned int esr,
> > @@ -632,6 +633,8 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
> > 
> >       inf = esr_to_fault_info(esr);
> > 
> > +     arch_arm_ras_report_error();
> > +
> >       /*
> >        * Return value ignored as we rely on signal merging.
> >        * Future patches will make this more robust.
> >
>
> If we interrupted a preemptible context, do_sea() is preemptible too... This means we
> can't know if we're still running on the same CPU as the one that took the external-abort.
> (until this series, it hasn't mattered).
>
> Fixing this means cramming something into entry.S's el1_da, as this may unmask interrupts
> before calling do_mem_abort(). But its going to be ugly because some of do_mem_abort()s
> ESR values need to be preemptible because they sleep, e.g. page-faults calling
> handle_mm_fault().
> For do_sea(), do_exit() will 'fix' the preempt count if we kill the thread, but if we
> don't, it still needs to be balanced. Doing all this in assembly is going to be unreadable!
>
> Mark Rutland has a series to move the entry assembly into C [0]. Based on that that it
> should be possible for the new el1_abort() to spot a Synchronous-External-Abort ESR, and
> wrap the do_mem_abort() with preempt enable/disable, before inheriting the flags. (which
> for synchronous exceptions, I think we should always do)
>
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/entry-deasm

Hey James,

Good catch! I didn't think the synchronous route was preemptible.

I wasn't seeing this issue when testing this on emulation, but I was able to
test and prove the issue on a Neoverse N1 SDP:

root@genericarmv8:~# echo 0x100000000 > /proc/cached_read
[   42.985622] Reading from address 0x100000000
[   42.989893] WARNING: CPU: 0 PID: 2812 at /home/tyler/neoverse/arm-reference-
platforms/linux/arch/arm64/kernel/cpufeature.c:1940 this_cpu_has_cap+0x68/0x78
[..]
[   43.119083] Call trace:
[   43.121515]  this_cpu_has_cap+0x68/0x78
[   43.125338]  do_sea+0x34/0x70
[   43.128292]  do_mem_abort+0x3c/0x98
[   43.131765]  el1_da+0x20/0x94
[   43.134722]  cached_read+0x30/0x68
[   43.138112]  simple_attr_write+0xbc/0x128
[   43.142109]  proc_reg_write+0x60/0xa8
[   43.145757]  __vfs_write+0x18/0x40
[   43.149145]  vfs_write+0xa4/0x1b8
[   43.152445]  ksys_write+0x64/0xe0
[   43.155746]  __arm64_sys_write+0x14/0x20
[   43.159654]  el0_svc_common.constprop.0+0xa8/0x100
[   43.164430]  el0_svc_handler+0x28/0x78
[   43.168165]  el0_svc+0x8/0xc
[   43.171031] ---[ end trace 2c27619659261a1d ]---
[   43.175647] Internal error: synchronous external abort: 96000410 [#1]
PREEMPT SMP
[..]

That warning is because it's preemptible:

if (!WARN_ON(preemptible()) && n < ARM64_NCAPS) {

I'll pull Mark's series in and add the preempt enable/disable around the call
to do_mem_abort() in el1_abort() and test that out!

Thanks,
Tyler
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 2/4] arm64: mm: Add RAS extension system register check to SEA handling
  2019-07-10  0:51     ` Tyler Baicar OS
@ 2019-07-11  4:14       ` Tyler Baicar OS
  2019-07-17 17:41         ` James Morse
  0 siblings, 1 reply; 14+ messages in thread
From: Tyler Baicar OS @ 2019-07-11  4:14 UTC (permalink / raw)
  To: James Morse, mark.rutland
  Cc: lorenzo.pieralisi, catalin.marinas, guohanjun, rjw,
	Matteo.Carlini, linux-kernel, linux-acpi, tony.luck, bp,
	sudeep.holla, Andrew.Murray, Open Source Submission, linux-edac,
	will, linux-arm-kernel, lenb

Hi James, Mark,

On Tue, Jul 9, 2019 at 8:52 PM Tyler Baicar OS <baicar@os.amperecomputing.com> wrote:
> On Mon, Jul 8, 2019 at 10:10 AM James Morse <james.morse@arm.com> wrote:
> > On 02/07/2019 17:51, Tyler Baicar OS wrote:
> > > @@ -632,6 +633,8 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
> > >
> > >       inf = esr_to_fault_info(esr);
> > >
> > > +     arch_arm_ras_report_error();
> > > +
> > >       /*
> > >        * Return value ignored as we rely on signal merging.
> > >        * Future patches will make this more robust.
> > >
> >
> > If we interrupted a preemptible context, do_sea() is preemptible too... This means we
> > can't know if we're still running on the same CPU as the one that took the external-abort.
> > (until this series, it hasn't mattered).
> >
> > Fixing this means cramming something into entry.S's el1_da, as this may unmask interrupts
> > before calling do_mem_abort(). But its going to be ugly because some of do_mem_abort()s
> > ESR values need to be preemptible because they sleep, e.g. page-faults calling
> > handle_mm_fault().
> > For do_sea(), do_exit() will 'fix' the preempt count if we kill the thread, but if we
> > don't, it still needs to be balanced. Doing all this in assembly is going to be unreadable!
> >
> > Mark Rutland has a series to move the entry assembly into C [0]. Based on that that it
> > should be possible for the new el1_abort() to spot a Synchronous-External-Abort ESR, and
> > wrap the do_mem_abort() with preempt enable/disable, before inheriting the flags. (which
> > for synchronous exceptions, I think we should always do)
> >
> > [0] https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/entry-deasm
>
> Hey James,
>
> Good catch! I didn't think the synchronous route was preemptible.
>
> I wasn't seeing this issue when testing this on emulation, but I was able to
> test and prove the issue on a Neoverse N1 SDP:
>
> root@genericarmv8:~# echo 0x100000000 > /proc/cached_read
> [   42.985622] Reading from address 0x100000000
> [   42.989893] WARNING: CPU: 0 PID: 2812 at /home/tyler/neoverse/arm-reference-
> platforms/linux/arch/arm64/kernel/cpufeature.c:1940 this_cpu_has_cap+0x68/0x78
> [..]
> [   43.119083] Call trace:
> [   43.121515]  this_cpu_has_cap+0x68/0x78
> [   43.125338]  do_sea+0x34/0x70
> [   43.128292]  do_mem_abort+0x3c/0x98
> [   43.131765]  el1_da+0x20/0x94
> [   43.134722]  cached_read+0x30/0x68
> [   43.138112]  simple_attr_write+0xbc/0x128
> [   43.142109]  proc_reg_write+0x60/0xa8
> [   43.145757]  __vfs_write+0x18/0x40
> [   43.149145]  vfs_write+0xa4/0x1b8
> [   43.152445]  ksys_write+0x64/0xe0
> [   43.155746]  __arm64_sys_write+0x14/0x20
> [   43.159654]  el0_svc_common.constprop.0+0xa8/0x100
> [   43.164430]  el0_svc_handler+0x28/0x78
> [   43.168165]  el0_svc+0x8/0xc
> [   43.171031] ---[ end trace 2c27619659261a1d ]---
> [   43.175647] Internal error: synchronous external abort: 96000410 [#1]
> PREEMPT SMP
> [..]
>
> That warning is because it's preemptible:
>
> if (!WARN_ON(preemptible()) && n < ARM64_NCAPS) {
>
> I'll pull Mark's series in and add the preempt enable/disable around the call
> to do_mem_abort() in el1_abort() and test that out!

I was able to pull in the series mentioned [0] and add a patch to wrap
do_mem_abort with preempt disable/enable and the warning has gone away.

diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index 43aa78331e72..26cdf7db511a 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -118,7 +118,25 @@ static void el1_abort(struct pt_regs *regs, unsigned long esr)
 	unsigned long far = read_sysreg(far_el1);
 	local_daif_inherit(regs);
 	far = untagged_addr(far);
-	do_mem_abort(far, esr, regs);
+
+	switch (esr & ESR_ELx_FSC) {
+	case ESR_ELx_FSC_EXTABT:	// Synchronous External Abort
+	case 0x14:			// SEA level 0 translation table walk
+	case 0x15:			// SEA level 1 translation table walk
+	case 0x16:			// SEA level 2 translation table walk
+	case 0x17:			// SEA level 3 translation table walk
+	case 0x18:			// Synchronous ECC error
+	case 0x1c:			// SECC level 0 translation table walk
+	case 0x1d:			// SECC level 1 translation table walk
+	case 0x1e:			// SECC level 2 translation table walk
+	case 0x1f:			// SECC level 3 translation table walk
+		preempt_disable();
+		do_mem_abort(far, esr, regs);
+		preempt_enable();
+		break;
+	default:
+		do_mem_abort(far, esr, regs);
+	};
 }
 
 /* Stack or PC alignment exception handling */
-- 


Is that what you had in mind James?

Has this series [0] been accepted and is just waiting to be pulled now?
Do you want me to add tested-by?

Thanks,
Tyler

[0] https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/entry-deasm
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 2/4] arm64: mm: Add RAS extension system register check to SEA handling
  2019-07-11  4:14       ` Tyler Baicar OS
@ 2019-07-17 17:41         ` James Morse
  0 siblings, 0 replies; 14+ messages in thread
From: James Morse @ 2019-07-17 17:41 UTC (permalink / raw)
  To: Tyler Baicar OS, mark.rutland
  Cc: lorenzo.pieralisi, catalin.marinas, guohanjun, rjw,
	Matteo.Carlini, linux-kernel, linux-acpi, tony.luck, bp,
	sudeep.holla, Andrew.Murray, Open Source Submission, linux-edac,
	will, linux-arm-kernel, lenb

Hi Tyler,

On 11/07/2019 05:14, Tyler Baicar OS wrote:
> On Tue, Jul 9, 2019 at 8:52 PM Tyler Baicar OS <baicar@os.amperecomputing.com> wrote:
>> On Mon, Jul 8, 2019 at 10:10 AM James Morse <james.morse@arm.com> wrote:
>>> On 02/07/2019 17:51, Tyler Baicar OS wrote:
>>>> @@ -632,6 +633,8 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>>>>
>>>>       inf = esr_to_fault_info(esr);
>>>>
>>>> +     arch_arm_ras_report_error();
>>>> +
>>>>       /*
>>>>        * Return value ignored as we rely on signal merging.
>>>>        * Future patches will make this more robust.
>>>>
>>>
>>> If we interrupted a preemptible context, do_sea() is preemptible too... This means we
>>> can't know if we're still running on the same CPU as the one that took the external-abort.
>>> (until this series, it hasn't mattered).
>>>
>>> Fixing this means cramming something into entry.S's el1_da, as this may unmask interrupts
>>> before calling do_mem_abort(). But its going to be ugly because some of do_mem_abort()s
>>> ESR values need to be preemptible because they sleep, e.g. page-faults calling
>>> handle_mm_fault().
>>> For do_sea(), do_exit() will 'fix' the preempt count if we kill the thread, but if we
>>> don't, it still needs to be balanced. Doing all this in assembly is going to be unreadable!
>>>
>>> Mark Rutland has a series to move the entry assembly into C [0]. Based on that that it
>>> should be possible for the new el1_abort() to spot a Synchronous-External-Abort ESR, and
>>> wrap the do_mem_abort() with preempt enable/disable, before inheriting the flags. (which
>>> for synchronous exceptions, I think we should always do)
>>>
>>> [0] https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/entry-deasm

>> Good catch! I didn't think the synchronous route was preemptible.

>> I wasn't seeing this issue when testing this on emulation, but I was able to
>> test and prove the issue on a Neoverse N1 SDP:
>>
>> root@genericarmv8:~# echo 0x100000000 > /proc/cached_read
>> [   42.985622] Reading from address 0x100000000
>> [   42.989893] WARNING: CPU: 0 PID: 2812 at /home/tyler/neoverse/arm-reference-
>> platforms/linux/arch/arm64/kernel/cpufeature.c:1940 this_cpu_has_cap+0x68/0x78

[...]

>> [   43.175647] Internal error: synchronous external abort: 96000410 [#1]
>> PREEMPT SMP

[...]

>> I'll pull Mark's series in and add the preempt enable/disable around the call
>> to do_mem_abort() in el1_abort() and test that out!
> 
> I was able to pull in the series mentioned [0] and add a patch to wrap
> do_mem_abort with preempt disable/enable and the warning has gone away.

Great.

I spoke to Mark who commented he hadn't had the time to finish rebasing it onto
for-next/core. (which I guess is why it didn't get posted!).

I've taken a stab at picking out the 'synchronous' parts and rebasing it onto arm64's
for-next/core. That tree is here:
http://www.linux-arm.org/git?p=linux-jm.git;a=shortlog;h=refs/heads/deasm_sync_only/v0

(this should save you doing the rebase)

I'll aim to rebase/retest and post it next week.


> diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
> index 43aa78331e72..26cdf7db511a 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -118,7 +118,25 @@ static void el1_abort(struct pt_regs *regs, unsigned long esr)

el0_ia/da will have the same problem.


>  	unsigned long far = read_sysreg(far_el1);
>  	local_daif_inherit(regs);
>  	far = untagged_addr(far);
> -	do_mem_abort(far, esr, regs);
> +
> +	switch (esr & ESR_ELx_FSC) {
> +	case ESR_ELx_FSC_EXTABT:	// Synchronous External Abort
> +	case 0x14:			// SEA level 0 translation table walk
> +	case 0x15:			// SEA level 1 translation table walk
> +	case 0x16:			// SEA level 2 translation table walk
> +	case 0x17:			// SEA level 3 translation table walk
> +	case 0x18:			// Synchronous ECC error
> +	case 0x1c:			// SECC level 0 translation table walk
> +	case 0x1d:			// SECC level 1 translation table walk
> +	case 0x1e:			// SECC level 2 translation table walk
> +	case 0x1f:			// SECC level 3 translation table walk

Hex numbers, lovely. KVM has a helper for this, can we move/clean that so we can use it here?


> +		preempt_disable();

This is still too late. You can take an interrupt between local_daif_inherit() and be
migrated, before you call preempt_disable() here.

The local_daif_inherit() may need to move into the switch() too.

It may be simpler to fold the 'is_extabt(esr)' check into el1_sync, so that these bypass
el1_abort() and call do_sea() directly, which could then handle the far-read,
preempt-count and daif-inherit itself. I prefer ... whichever looks cleanest!


> +		do_mem_abort(far, esr, regs);
> +		preempt_enable();
> +		break;
> +	default:
> +		do_mem_abort(far, esr, regs);
> +	};
>  }


Thanks,

James

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, back to index

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-02 16:51 [PATCH RFC 0/4] ARM Error Source Table Support Tyler Baicar OS
2019-07-02 16:51 ` [PATCH RFC 1/4] ACPI/AEST: Initial AEST driver Tyler Baicar OS
2019-07-03  9:25   ` Andrew Murray
2019-07-03 17:30     ` Tyler Baicar OS
2019-07-04  9:05       ` Andrew Murray
2019-07-04 16:02   ` Shiju Jose
2019-07-10  0:49     ` Tyler Baicar OS
2019-07-02 16:51 ` [PATCH RFC 2/4] arm64: mm: Add RAS extension system register check to SEA handling Tyler Baicar OS
2019-07-08 10:00   ` James Morse
2019-07-10  0:51     ` Tyler Baicar OS
2019-07-11  4:14       ` Tyler Baicar OS
2019-07-17 17:41         ` James Morse
2019-07-02 16:51 ` [PATCH RFC 3/4] arm64: traps: Add RAS extension system register check to serror handling Tyler Baicar OS
2019-07-02 16:52 ` [PATCH RFC 4/4] trace, ras: add ARM RAS extension trace event Tyler Baicar OS

Linux-ARM-Kernel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-arm-kernel/0 linux-arm-kernel/git/0.git
	git clone --mirror https://lore.kernel.org/linux-arm-kernel/1 linux-arm-kernel/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-arm-kernel linux-arm-kernel/ https://lore.kernel.org/linux-arm-kernel \
		linux-arm-kernel@lists.infradead.org infradead-linux-arm-kernel@archiver.kernel.org
	public-inbox-index linux-arm-kernel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-arm-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox