linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] CCIX Protocol Error reporting
@ 2019-06-06 12:36 Jonathan Cameron
  2019-06-06 12:36 ` [RFC PATCH 1/6] efi / ras: CCIX Memory error reporting Jonathan Cameron
                   ` (7 more replies)
  0 siblings, 8 replies; 12+ messages in thread
From: Jonathan Cameron @ 2019-06-06 12:36 UTC (permalink / raw)
  To: linux-edac, linux-acpi, linux-efi
  Cc: linuxarm, rjw, tony.luck, bp, james.morse, ard.beisheuvel,
	nariman.poushin, Jonathan Cameron

UEFI 2.8 defines a new CPER record Appendix N for CCIX Protocol Error Records
(PER). www.uefi.org

These include Protocol Error Record logs which are defined in the
CCIX 1.0 Base Specification www.ccixconsortium.com.

Handling of coherency protocol errors is complex and how Linux does this
will take some time to evolve.  For now, fatal errors are handled via the
usual means and everything else is reported.

There are 6 types of error defined, covering:
* Memory errors
* Cache errors
* Address translation unit errors
* CCIX port errors 
* CCIX link errors
* Agent internal errors.

The set includes tracepoints to report the errors to RAS Daemon and a patch
set for RAS Daemon will follow shortly.

There are several open questions for this RFC.
1. Reporting of vendor data.  We have little choice but to do this via a
   dynamic array as these blocks can take arbitrary size. I had hoped
   no one would actually use these given the odd mismatch between a
   standard error structure and non standard element, but there are
   already designs out there that do use it.
2. The trade off between explicit tracepoint fields, on which we might
   want to filter, and the simplicity of a blob. I have gone for having
   the whole of the block specific to the PER error type in an opaque blob.
   Perhaps this is not the right balance?
3. Whether defining 6 new tracepoints is sensible. I think it is:
   * They are all defined by the CCIX specification as independant error
     classes.
   * Many of them can only be generated by particular types of agent.
   * The handling required will vary widely depending on types.
     In the kernel some map cleanly onto existing handling. Keeping the
     whole flow separate will aide this. They vary by a similar amount
     in scope to the RAS errors found on an existing system which have
     independent tracepoints.
   * Separating them out allows for filtering on the tracepoints by
     elements that are not shared between them.
   * Muxing the lot into one record type can lead to ugly code both in
     kernel and in userspace.

Rasdaemon patches will follow shortly.

This patch is being distributed by the CCIX Consortium, Inc. (CCIX) to
you and other parties that are paticipating (the "participants") in the
Linux kernel with the understanding that the participants will use CCIX's
name and trademark only when this patch is used in association with the
Linux kernel and associated user space.

CCIX is also distributing this patch to these participants with the
understanding that if any portion of the CCIX specification will be
used or referenced in the Linux kernel, the participants will not modify
the cited portion of the CCIX specification and will give CCIX propery
copyright attribution by including the following copyright notice with
the cited part of the CCIX specification:
"© 2019 CCIX CONSORTIUM, INC. ALL RIGHTS RESERVED."

Jonathan Cameron (6):
  efi / ras: CCIX Memory error reporting
  efi / ras: CCIX Cache error reporting
  efi / ras: CCIX Address Translation Cache error reporting
  efi / ras: CCIX Port error reporting
  efi / ras: CCIX Link error reporting
  efi / ras: CCIX Agent internal error reporting

 drivers/acpi/apei/Kconfig        |   8 +
 drivers/acpi/apei/ghes.c         |  59 ++
 drivers/firmware/efi/Kconfig     |   5 +
 drivers/firmware/efi/Makefile    |   1 +
 drivers/firmware/efi/cper-ccix.c | 916 +++++++++++++++++++++++++++++++
 drivers/firmware/efi/cper.c      |   6 +
 include/linux/cper.h             | 333 +++++++++++
 include/ras/ras_event.h          | 405 ++++++++++++++
 8 files changed, 1733 insertions(+)
 create mode 100644 drivers/firmware/efi/cper-ccix.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 1/6] efi / ras: CCIX Memory error reporting
  2019-06-06 12:36 [RFC PATCH 0/6] CCIX Protocol Error reporting Jonathan Cameron
@ 2019-06-06 12:36 ` Jonathan Cameron
  2019-06-21 17:40   ` Jonathan Cameron
  2019-06-06 12:36 ` [RFC PATCH 2/6] efi / ras: CCIX Cache " Jonathan Cameron
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 12+ messages in thread
From: Jonathan Cameron @ 2019-06-06 12:36 UTC (permalink / raw)
  To: linux-edac, linux-acpi, linux-efi
  Cc: linuxarm, rjw, tony.luck, bp, james.morse, ard.beisheuvel,
	nariman.poushin, Jonathan Cameron

CCIX defines a number of different error types
(See CCIX spec 1.0) and UEFI 2.8 defines a CPER record to allow
for them to be reported when firmware first handling is in use.
The last part of that record is a copy of the CCIX protocol
error record which can provide very detailed information.

This patch introduces infrastructure and support for one of those
error types, CCIX Memory Errors.  Later patches will supply
equivalent support for the other error types.

The variable length and content of the different messages makes
a single tracepoint impractical.  As such the current RAS
tracepoint only covers the memory error. Additional trace points
will be introduced for other error types along with their
cper handling in a follow up series.

RAS daemon support to follow shortly. qemu injection patches
also available but not currently planing to upstream those.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/acpi/apei/Kconfig        |   8 +
 drivers/acpi/apei/ghes.c         |  39 ++++
 drivers/firmware/efi/Kconfig     |   5 +
 drivers/firmware/efi/Makefile    |   1 +
 drivers/firmware/efi/cper-ccix.c | 356 +++++++++++++++++++++++++++++++
 drivers/firmware/efi/cper.c      |   6 +
 include/linux/cper.h             | 118 ++++++++++
 include/ras/ras_event.h          |  77 +++++++
 8 files changed, 610 insertions(+)

diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
index 6b18f8bc7be35..e687b18dee344 100644
--- a/drivers/acpi/apei/Kconfig
+++ b/drivers/acpi/apei/Kconfig
@@ -68,3 +68,11 @@ config ACPI_APEI_ERST_DEBUG
 	  error information to and from a persistent store. Enable this
 	  if you want to debugging and testing the ERST kernel support
 	  and firmware implementation.
+
+config ACPI_APEI_CCIX
+       bool "APEI CCIX error recovery support"
+       depends on ACPI_APEI && MEMORY_FAILURE
+       help
+	 CCIX has a number of defined error types. This option enables
+	 the handling of CPER records generated by a firmware performing
+	 firmware first error handling of these CCIX errors.
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 993940d582f50..cfc7dc31a9380 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -477,6 +477,42 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
 #endif
 }
 
+static void ghes_handle_ccix_per(struct acpi_hest_generic_data *gdata, int sev)
+{
+#ifdef CONFIG_ACPI_APEI_CCIX
+	struct cper_sec_ccix_header *header = acpi_hest_get_payload(gdata);
+	__u32 *dw;
+	enum ccix_per_type per_type;
+	static u32 err_seq;
+	void *payload;
+
+	/* Check if space for CCIX CPER header and 8 DW of a PER log header */
+	if (gdata->error_data_length <
+	    sizeof(*header) + CCIX_PER_LOG_HEADER_DWS * sizeof(__u32))
+		return;
+
+	if ((header->validation_bits & CPER_CCIX_VALID_PER_LOG) == 0)
+		return;
+
+	dw = (__u32 *)(header + 1);
+
+	per_type = FIELD_GET(CCIX_PER_LOG_DW1_PER_TYPE_M, dw[1]);
+	payload = acpi_hest_get_payload(gdata);
+
+	switch (per_type) {
+	case CCIX_MEMORY_ERROR:
+		trace_ccix_memory_error_event(payload, err_seq, sev,
+					      ccix_mem_err_ven_len_get(payload));
+		break;
+	default:
+		/* Unknown error type */
+		pr_info("CCIX error of unknown or vendor defined type\n");
+		break;
+	}
+	err_seq++;
+#endif
+}
+
 static void ghes_do_proc(struct ghes *ghes,
 			 const struct acpi_hest_generic_status *estatus)
 {
@@ -507,6 +543,9 @@ static void ghes_do_proc(struct ghes *ghes,
 		else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
 			ghes_handle_aer(gdata);
 		}
+		else if (guid_equal(sec_type, &CPER_SEC_CCIX)) {
+			ghes_handle_ccix_per(gdata, estatus->error_severity);
+		}
 		else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
 			struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
 
diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
index d4ea929e8b344..9ea161f68da8d 100644
--- a/drivers/firmware/efi/Kconfig
+++ b/drivers/firmware/efi/Kconfig
@@ -195,6 +195,11 @@ config UEFI_CPER_X86
 	depends on UEFI_CPER && X86
 	default y
 
+config UEFI_CPER_CCIX
+       bool
+       depends on UEFI_CPER
+       default y
+
 config EFI_DEV_PATH_PARSER
 	bool
 	depends on ACPI
diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile
index d2d0d20306200..69287da9664b6 100644
--- a/drivers/firmware/efi/Makefile
+++ b/drivers/firmware/efi/Makefile
@@ -33,3 +33,4 @@ obj-$(CONFIG_EFI_CAPSULE_LOADER)	+= capsule-loader.o
 obj-$(CONFIG_EFI_EARLYCON)		+= earlycon.o
 obj-$(CONFIG_UEFI_CPER_ARM)		+= cper-arm.o
 obj-$(CONFIG_UEFI_CPER_X86)		+= cper-x86.o
+obj-$(CONFIG_UEFI_CPER_CCIX)		+= cper-ccix.o
diff --git a/drivers/firmware/efi/cper-ccix.c b/drivers/firmware/efi/cper-ccix.c
new file mode 100644
index 0000000000000..9856804bdca81
--- /dev/null
+++ b/drivers/firmware/efi/cper-ccix.c
@@ -0,0 +1,356 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * UEFI Common Platform Error Record (CPER) support for CCIX
+ * protocol errors.
+ *
+ * Copyright (C) 2019, Huawei
+ *	Author: Jonathan Cameron <jonathan.cameron@huawei.com>
+ *
+ * CPER is the format used to describe platform hardware error by
+ * various tables, such as ERST, BERT and HEST etc.
+ *
+ * For more information about CPER, please refer to Appendix N of UEFI
+ * Specification version 2.9.
+ *
+ * CCIX defines a number of Protocol Error Messages which for the
+ * main body of the CCIX CPER records.  These are defined in the
+ * CCIX Specification 1.0.
+ */
+
+#include <acpi/ghes.h>
+#include <linux/acpi.h>
+#include <linux/bitfield.h>
+#include <linux/cper.h>
+#include <linux/kernel.h>
+#include <linux/printk.h>
+#include <ras/ras_event.h>
+
+static char rcd_decode_str[CPER_REC_LEN];
+
+static const char * const ccix_comp_type_strs[] = {
+	"Request Agent",
+	"Home Agent",
+	"Slave Agent",
+	"Port",
+	"Link",
+};
+
+const char *cper_ccix_comp_type_str(u8 comp_type)
+{
+	return comp_type < ARRAY_SIZE(ccix_comp_type_strs) ?
+		ccix_comp_type_strs[comp_type] : "Reserved";
+}
+
+static const char * const ccix_per_type_strs[] = {
+	"Memory Error",
+	"Cache Error",
+	"ATC Error",
+	"Port Error",
+	"Link Error",
+	"Agent Internal",
+};
+
+static const char * const ccix_mem_pool_gen_type_strs[] = {
+	"Other, Non-specified",
+	"ROM",
+	"Volatile",
+	"Non-volatile",
+	"Device",
+};
+
+static const char *cper_ccix_mem_err_generic_type_str(u16 type)
+{
+	const char *gen_type_str;
+
+	if (type < ARRAY_SIZE(ccix_mem_pool_gen_type_strs))
+		gen_type_str = ccix_mem_pool_gen_type_strs[type];
+	else if (type >= 0x80)
+		gen_type_str = "Vendor";
+	else
+		gen_type_str = "Reserved";
+
+	return gen_type_str;
+}
+
+static const char * const ccix_mem_op_type_strs[] = {
+	"Generic",
+	"Read",
+	"Write",
+	"Reserved",
+	"Scrub",
+};
+
+static const char *cper_ccix_mem_err_op_str(u8 op_type)
+{
+	return op_type < ARRAY_SIZE(ccix_mem_op_type_strs) ?
+		ccix_mem_op_type_strs[op_type] :
+		"Reserved";
+}
+
+/* Sightly different from the generic version */
+static const char * const ccix_mem_err_type_strs[] = {
+	"Unknown",
+	"No Error",
+	"Single-bit ECC",
+	"Multi-bit ECC",
+	"Single-symbol ChipKill ECC",
+	"Multi-symbol ChipKill ECC",
+	"Master Abort",
+	"Target Abort",
+	"Parity Error",
+	"Watchdog Timeout",
+	"Invalid Address",
+	"Mirror Broken",
+	"Memory Sparing",
+	"Scrub",
+	"Physical Memory Map-out Event",
+};
+
+const char *cper_ccix_mem_err_type_str(unsigned int error_type)
+{
+	return error_type < ARRAY_SIZE(ccix_mem_err_type_strs) ?
+		ccix_mem_err_type_strs[error_type] : "Reserved";
+}
+
+static const char * const ccix_mem_spec_type_strs[] = {
+	"Other, Not-specified",
+	"SRAM",
+	"DDR",
+	"NVDIMM-F",
+	"NVDIMM-N",
+	"HBM",
+	"Flash"
+};
+
+static const char *cper_ccix_mem_err_spec_type_str(u8 specific_type)
+{
+	if (specific_type < ARRAY_SIZE(ccix_mem_spec_type_strs))
+		return ccix_mem_spec_type_strs[specific_type];
+	else if (specific_type >= 0x80)
+		return "Vendor";
+	else
+		return "Reserved";
+}
+
+/*
+ * We pack up everything except those that are needed for software handling:
+ * - error_type, physical_addr
+ * and header values that would require additional validation bits:
+ * - source, component, severity,
+ * implicit: protocol error type (mem)
+ */
+void cper_ccix_mem_err_pack(const struct cper_sec_ccix_mem_error *mem_record,
+			    struct cper_ccix_mem_err_compact *cmem_err,
+			    const u16 vendor_data_len,
+			    u8 *vendor_data)
+{
+	cmem_err->validation_bits = mem_record->validation_bits;
+	cmem_err->mem_err_type = mem_record->memory_error_type;
+	cmem_err->pool_generic_type = mem_record->pool_generic_type;
+	cmem_err->op_type = mem_record->op_type;
+	cmem_err->card = mem_record->card;
+	cmem_err->module = mem_record->module;
+	cmem_err->bank = mem_record->bank;
+	cmem_err->device = mem_record->device;
+	cmem_err->row = mem_record->row;
+	cmem_err->column = mem_record->column;
+	cmem_err->rank = mem_record->rank;
+	cmem_err->bit_pos = mem_record->bit_pos;
+	cmem_err->chip_id = mem_record->chip_id;
+	cmem_err->pool_specific_type = mem_record->pool_specific_type;
+	cmem_err->fru = mem_record->fru;
+	memcpy(vendor_data, &mem_record->vendor_data[1], vendor_data_len);
+}
+
+static int cper_ccix_err_location(struct cper_ccix_mem_err_compact *cmem_err,
+				  char *msg)
+{
+	u32 len = CPER_REC_LEN - 1;
+	u32 n = 0;
+
+	if (!msg)
+		return 0;
+
+	if (cmem_err->validation_bits & CCIX_MEM_ERR_GENERIC_MEM_VALID)
+		n += snprintf(msg + n, len, "Pool Generic Type: %s ",
+			      cper_ccix_mem_err_generic_type_str(cmem_err->pool_generic_type));
+
+	if (cmem_err->validation_bits & CCIX_MEM_ERR_MEM_ERR_TYPE_VALID)
+		n += snprintf(msg + n, len, "Err Type: %s ",
+			      cper_ccix_mem_err_type_str(cmem_err->mem_err_type));
+	if (cmem_err->validation_bits & CCIX_MEM_ERR_OP_VALID)
+		n += snprintf(msg + n, len, "Operation: %s ",
+			     cper_ccix_mem_err_op_str(cmem_err->op_type));
+
+	if (cmem_err->validation_bits & CCIX_MEM_ERR_CARD_VALID)
+		n += snprintf(msg + n, len, "Card: %d ", cmem_err->card);
+	if (cmem_err->validation_bits & CCIX_MEM_ERR_MOD_VALID)
+		n += snprintf(msg + n, len, "Mod: %d ", cmem_err->module);
+	if (cmem_err->validation_bits & CCIX_MEM_ERR_BANK_VALID)
+		n += snprintf(msg + n, len, "Bank: %d ", cmem_err->bank);
+	if (cmem_err->validation_bits & CCIX_MEM_ERR_DEVICE_VALID)
+		n += snprintf(msg + n, len, "Device: %d ", cmem_err->device);
+	if (cmem_err->validation_bits & CCIX_MEM_ERR_ROW_VALID)
+		n += snprintf(msg + n, len, "Row: %d ", cmem_err->row);
+	if (cmem_err->validation_bits & CCIX_MEM_ERR_COL_VALID)
+		n += snprintf(msg + n, len, "Col: %d ", cmem_err->column);
+	if (cmem_err->validation_bits & CCIX_MEM_ERR_RANK_VALID)
+		n += snprintf(msg + n, len, "Rank: %d ", cmem_err->rank);
+	if (cmem_err->validation_bits & CCIX_MEM_ERR_BIT_POS_VALID)
+		n += snprintf(msg + n, len, "BitPos: %d ", cmem_err->bit_pos);
+	if (cmem_err->validation_bits & CCIX_MEM_ERR_CHIP_ID_VALID)
+		n += snprintf(msg + n, len, "ChipID: %d ", cmem_err->chip_id);
+	if (cmem_err->validation_bits & CCIX_MEM_ERR_SPEC_TYPE_VALID)
+		n += snprintf(msg + n, len, "Pool Specific Type: %s ",
+			      cper_ccix_mem_err_spec_type_str(cmem_err->pool_specific_type));
+	n += snprintf(msg + n, len, "FRU: %d ", cmem_err->fru);
+
+	return n;
+}
+
+const char *cper_ccix_mem_err_unpack(struct trace_seq *p,
+				     struct cper_ccix_mem_err_compact *cmem_err)
+{
+	const char *ret = trace_seq_buffer_ptr(p);
+
+	if (cper_ccix_err_location(cmem_err, rcd_decode_str))
+		trace_seq_printf(p, "%s", rcd_decode_str);
+	trace_seq_putc(p, '\0');
+
+	return ret;
+}
+
+static int cper_ccix_mem_err_details(const char *pfx,
+				     struct acpi_hest_generic_data *gdata)
+{
+	struct cper_ccix_mem_error *full_mem_err;
+	struct cper_sec_ccix_mem_error *mem_err;
+	u16 vendor_data_len;
+	int i;
+
+	if (gdata->error_data_length < sizeof(*full_mem_err))
+		return -ENOSPC;
+
+	full_mem_err = acpi_hest_get_payload(gdata);
+
+	mem_err = &full_mem_err->mem_record;
+	printk("%s""FRU ID: %u, Length: %u\n", pfx,
+	       mem_err->fru, mem_err->length);
+	if (mem_err->validation_bits & CCIX_MEM_ERR_GENERIC_MEM_VALID)
+		printk("%s""Pool Generic Type: %s\n", pfx,
+		       cper_ccix_mem_err_generic_type_str(mem_err->pool_generic_type));
+
+	if (mem_err->validation_bits & CCIX_MEM_ERR_OP_VALID)
+		printk("%s""Operation: %s\n", pfx,
+		       cper_ccix_mem_err_op_str(mem_err->op_type));
+
+	if (mem_err->validation_bits & CCIX_MEM_ERR_MEM_ERR_TYPE_VALID) {
+		printk("%s""Mem Error Type: %s\n", pfx,
+		       cper_ccix_mem_err_type_str(mem_err->memory_error_type));
+	}
+	if (mem_err->validation_bits & CCIX_MEM_ERR_CARD_VALID)
+		printk("%s""Card: %u\n", pfx, mem_err->card);
+	if (mem_err->validation_bits & CCIX_MEM_ERR_MOD_VALID)
+		printk("%s""Module: %u\n", pfx, mem_err->module);
+	if (mem_err->validation_bits & CCIX_MEM_ERR_BANK_VALID)
+		printk("%s""Bank: %u\n", pfx, mem_err->bank);
+	if (mem_err->validation_bits & CCIX_MEM_ERR_DEVICE_VALID)
+		printk("%s""Device: %u\n", pfx, mem_err->device);
+	if (mem_err->validation_bits & CCIX_MEM_ERR_ROW_VALID)
+		printk("%s""Row: %u\n", pfx, mem_err->row);
+	if (mem_err->validation_bits & CCIX_MEM_ERR_COL_VALID)
+		printk("%s""Column: %u\n", pfx, mem_err->column);
+	if (mem_err->validation_bits & CCIX_MEM_ERR_RANK_VALID)
+		printk("%s""Rank: %u\n", pfx, mem_err->rank);
+	if (mem_err->validation_bits & CCIX_MEM_ERR_BIT_POS_VALID)
+		printk("%s""Bit Pos: %u\n", pfx, mem_err->bit_pos);
+	if (mem_err->validation_bits & CCIX_MEM_ERR_CHIP_ID_VALID)
+		printk("%s""Chip ID: %u\n", pfx, mem_err->chip_id);
+	if (mem_err->validation_bits & CCIX_MEM_ERR_SPEC_TYPE_VALID)
+		printk("%s""Specific Type: %s\n", pfx,
+		       cper_ccix_mem_err_spec_type_str(mem_err->pool_specific_type));
+
+	if (mem_err->validation_bits & CCIX_MEM_ERR_VENDOR_DATA_VALID) {
+		if (gdata->error_data_length < sizeof(*full_mem_err) + 4)
+			return -ENOSPC;
+
+		vendor_data_len = mem_err->vendor_data[0] & GENMASK(15, 0);
+		if (gdata->error_data_length <
+		    sizeof(*full_mem_err) + vendor_data_len)
+			return -ENOSPC;
+
+		for (i = 0; i < vendor_data_len / 4 - 1; i++)
+			printk("%s""Vendor%d: 0x%08x\n", pfx, i,
+			       mem_err->vendor_data[i + 1]);
+	}
+
+	return 0;
+}
+
+int cper_print_ccix_per(const char *pfx, struct acpi_hest_generic_data *gdata)
+{
+	struct cper_sec_ccix_header *header = acpi_hest_get_payload(gdata);
+	__u32 *dw;
+	__u32 comp_type;
+	enum ccix_per_type per_type;
+	bool vendor_per;
+
+	if (gdata->error_data_length < sizeof(*header))
+		return -ENOSPC;
+
+	printk("%s""CPER Length: %u\n", pfx, header->length);
+	if (header->validation_bits & CPER_CCIX_VALID_SOURCE_ID)
+		printk("%s""Source: %u\n", pfx, header->source_id);
+	if (header->validation_bits & CPER_CCIX_VALID_PORT_ID)
+		printk("%s""Port: %u\n", pfx, header->port_id);
+	/* Not much use if we don't have the per log, in theory it's optional */
+	if ((header->validation_bits & CPER_CCIX_VALID_PER_LOG) == 0)
+		return 0;
+
+	/* The per log header is a packed structure so needs breaking up */
+	if (gdata->error_data_length < sizeof(*header) + 8 * 4)
+		return -ENOSPC;
+
+	dw = (__u32 *)(header + 1);
+
+	printk("%s""PER Rev: %lu, Log Length: %lu\n", pfx,
+	       FIELD_GET(CCIX_PER_LOG_DW0_REV_M, dw[0]),
+	       FIELD_GET(CCIX_PER_LOG_DW0_LEN_M, dw[0]));
+	comp_type = FIELD_GET(CCIX_PER_LOG_DW1_COMP_TYPE_M, dw[1]);
+	printk("%s""Component: %s\n", pfx, cper_ccix_comp_type_str(comp_type));
+	printk("%s""ME: %lu, SevUE: %lu, SevNoComm: %lu, SevDegraded: %lu, SevDeferred %lu\n",
+	       pfx,
+	       FIELD_GET(CCIX_PER_LOG_DW0_ME_M, dw[0]),
+	       FIELD_GET(CCIX_PER_LOG_DW1_SEV_UE_M, dw[1]),
+	       FIELD_GET(CCIX_PER_LOG_DW1_SEV_NO_COMM_M, dw[1]),
+	       FIELD_GET(CCIX_PER_LOG_DW1_SEV_DEGRADED_M, dw[1]),
+	       FIELD_GET(CCIX_PER_LOG_DW1_SEV_DEFFERABLE_M, dw[1]));
+
+	/* per_type is vendor defined if VEN is set */
+	vendor_per = FIELD_GET(CCIX_PER_LOG_DW1_VEN_VAL_M, dw[1]) ?
+		true : false;
+	per_type = FIELD_GET(CCIX_PER_LOG_DW1_PER_TYPE_M, dw[1]);
+	if (vendor_per)
+		printk("%s""Protocol Error Type: Vendor%u", pfx, per_type);
+	else
+		printk("%s""Protocol Error Type: %s\n", pfx,
+		       per_type < ARRAY_SIZE(ccix_per_type_strs) ?
+		       ccix_per_type_strs[per_type] : "Reserved");
+
+	if (FIELD_GET(CCIX_PER_LOG_DW1_ADDR_VAL_M, dw[1]))
+		printk("%s""Address: 0x%llx\n", pfx,
+		       (((__u64)dw[2]) << 32) | (dw[3] & 0xFFFFFFFC));
+
+	/* Vendor defined PER message, perhaps we could print it out */
+	if (vendor_per)
+		return 0;
+
+	switch (per_type) {
+	case CCIX_MEMORY_ERROR:
+		return cper_ccix_mem_err_details(pfx, gdata);
+	default:
+		/* Vendor defined so no formatting be done */
+		break;
+	}
+	return 0;
+}
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 8fa977c7861f9..52f0954a4577a 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -459,6 +459,12 @@ cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata
 			cper_print_pcie(newpfx, pcie, gdata);
 		else
 			goto err_section_too_small;
+	} else if (guid_equal(sec_type, &CPER_SEC_CCIX)) {
+		int ret;
+		/* CCIX CPER entries are variable length */
+		ret = cper_print_ccix_per(newpfx, gdata);
+		if (ret)
+			goto err_section_too_small;
 #if defined(CONFIG_ARM64) || defined(CONFIG_ARM)
 	} else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
 		struct cper_sec_proc_arm *arm_err = acpi_hest_get_payload(gdata);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index cc4980bb0f65a..a9a7ef56f4dc8 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -186,6 +186,9 @@ enum {
 #define CPER_SEC_PCIE							\
 	GUID_INIT(0xD995E954, 0xBBC1, 0x430F, 0xAD, 0x91, 0xB4, 0x4D,	\
 		  0xCB, 0x3C, 0x6F, 0x35)
+#define CPER_SEC_CCIX							\
+	GUID_INIT(0x91335EF6, 0xEBFB, 0x4478, 0xA6, 0xA6, 0x88, 0xB7,	\
+		  0x28, 0xCF, 0x75, 0xD7)
 /* Firmware Error Record Reference */
 #define CPER_SEC_FW_ERR_REC_REF						\
 	GUID_INIT(0x81212A96, 0x09ED, 0x4996, 0x94, 0x71, 0x8D, 0x72,	\
@@ -254,6 +257,10 @@ enum {
 
 #define CPER_PCIE_SLOT_SHIFT			3
 
+#define CPER_CCIX_VALID_SOURCE_ID		BIT(0)
+#define CPER_CCIX_VALID_PORT_ID			BIT(1)
+#define CPER_CCIX_VALID_PER_LOG			BIT(2)
+
 #define CPER_ARM_VALID_MPIDR			BIT(0)
 #define CPER_ARM_VALID_AFFINITY_LEVEL		BIT(1)
 #define CPER_ARM_VALID_RUNNING_STATE		BIT(2)
@@ -533,6 +540,105 @@ struct cper_sec_pcie {
 	u8	aer_info[96];
 };
 
+struct cper_sec_ccix_header {
+	__u32	length;
+	__u64	validation_bits;
+	__u8	source_id;
+	__u8	port_id;
+	__u8	reserved[2];
+};
+
+#define CCIX_PER_LOG_DW0_REV_M			GENMASK(7, 0)
+#define CCIX_PER_LOG_DW0_LEN_M			GENMASK(14, 8)
+#define CCIX_PER_LOG_DW0_ME_M			BIT(15)
+#define CCIX_PER_LOG_DW1_COMP_TYPE_M		GENMASK(15, 12)
+#define CCIX_PER_LOG_DW1_SEV_UE_M		BIT(16)
+#define CCIX_PER_LOG_DW1_SEV_NO_COMM_M		BIT(17)
+#define CCIX_PER_LOG_DW1_SEV_DEGRADED_M		BIT(18)
+#define CCIX_PER_LOG_DW1_SEV_DEFFERABLE_M	BIT(19)
+#define CCIX_PER_LOG_DW1_PER_TYPE_M		GENMASK(27, 24)
+#define CCIX_PER_LOG_DW1_ADDR_VAL_M		BIT(30)
+#define CCIX_PER_LOG_DW1_VEN_VAL_M		BIT(31)
+enum ccix_per_type {
+	CCIX_MEMORY_ERROR = 0,
+	CCIX_CACHE_ERROR = 1,
+	CCIX_ATC_ERROR = 2,
+	CCIX_PORT_ERROR = 3,
+	CCIX_LINK_ERROR = 4,
+	CCIX_AGENT_INTERNAL_ERROR = 5,
+};
+
+#define CCIX_PER_LOG_HEADER_DWS 8
+
+struct cper_sec_ccix_mem_error {
+	__u32	validation_bits;
+#define CCIX_MEM_ERR_GENERIC_MEM_VALID		BIT(0)
+#define CCIX_MEM_ERR_OP_VALID			BIT(1)
+#define CCIX_MEM_ERR_MEM_ERR_TYPE_VALID		BIT(2)
+#define CCIX_MEM_ERR_CARD_VALID			BIT(3)
+#define CCIX_MEM_ERR_BANK_VALID			BIT(4)
+#define CCIX_MEM_ERR_DEVICE_VALID		BIT(5)
+#define CCIX_MEM_ERR_ROW_VALID			BIT(6)
+#define CCIX_MEM_ERR_COL_VALID			BIT(7)
+#define CCIX_MEM_ERR_RANK_VALID			BIT(8)
+#define CCIX_MEM_ERR_BIT_POS_VALID		BIT(9)
+#define CCIX_MEM_ERR_CHIP_ID_VALID		BIT(10)
+#define CCIX_MEM_ERR_VENDOR_DATA_VALID		BIT(11)
+#define CCIX_MEM_ERR_MOD_VALID			BIT(12)
+#define CCIX_MEM_ERR_SPEC_TYPE_VALID		BIT(13)
+
+	__u8	fru;
+	__u8	reserved;
+	__u16	length; /* Includes vendor specific log info */
+	__u8	pool_generic_type;
+	__u8	op_type;
+	__u8	memory_error_type;
+	__u8	card;
+	__u16	module;
+	__u16	bank;
+	__u32	device;
+	__u32	row;
+	__u32	column;
+	__u32	rank;
+	__u8	bit_pos;
+	__u8	chip_id;
+	__u8	pool_specific_type;
+	__u32	vendor_data[];
+};
+
+struct cper_ccix_mem_error {
+	struct cper_sec_ccix_header header;
+	__u32 ccix_header[CCIX_PER_LOG_HEADER_DWS];
+	struct cper_sec_ccix_mem_error mem_record;
+};
+
+static inline u16 ccix_mem_err_ven_len_get(struct cper_ccix_mem_error *mem_err)
+{
+	if (mem_err->mem_record.validation_bits &
+	    CCIX_MEM_ERR_VENDOR_DATA_VALID)
+		return mem_err->mem_record.vendor_data[0] & 0xFFFF;
+	else
+		return 0;
+}
+
+struct cper_ccix_mem_err_compact {
+	__u32	validation_bits;
+	__u8	mem_err_type;
+	__u8	pool_generic_type;
+	__u8	pool_specific_type;
+	__u8	op_type;
+	__u8	card;
+	__u16	module;
+	__u16	bank;
+	__u32	device;
+	__u32	row;
+	__u32	column;
+	__u32	rank;
+	__u8	bit_pos;
+	__u8	chip_id;
+	__u8	fru;
+};
+
 /* Reset to default packing */
 #pragma pack()
 
@@ -547,6 +653,18 @@ void cper_mem_err_pack(const struct cper_sec_mem_err *,
 		       struct cper_mem_err_compact *);
 const char *cper_mem_err_unpack(struct trace_seq *,
 				struct cper_mem_err_compact *);
+void cper_ccix_mem_err_pack(const struct cper_sec_ccix_mem_error *mem_record,
+			    struct cper_ccix_mem_err_compact *cmem_err,
+			    const u16 vendor_data_len,
+			    u8 *vendor_data);
+const char *cper_ccix_mem_err_unpack(struct trace_seq *p,
+				     struct cper_ccix_mem_err_compact *cmem_err);
+const char *cper_ccix_mem_err_type_str(unsigned int error_type);
+const char *cper_ccix_comp_type_str(u8 comp_type);
+struct acpi_hest_generic_data;
+int cper_print_ccix_per(const char *pfx,
+			struct acpi_hest_generic_data *gdata);
+
 void cper_print_proc_arm(const char *pfx,
 			 const struct cper_sec_proc_arm *proc);
 void cper_print_proc_ia(const char *pfx,
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 36c5c5e38c1d8..128728eaeef41 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -14,6 +14,7 @@
 #include <linux/cper.h>
 #include <linux/mm.h>
 
+#include <linux/bitfield.h>
 /*
  * MCE Extended Error Log trace event
  *
@@ -338,6 +339,82 @@ TRACE_EVENT(aer_event,
 			"Not available")
 );
 
+/*
+ * CCIX PER log memory error trace event
+ *
+ * These events are generated when hardware detects a corrected or
+ * uncorrected event.
+ *
+ * Some elements of the record are not included
+ * - PER version (tracepoint should remain compatible across versions)
+ * - Multiple Error
+ */
+TRACE_EVENT(ccix_memory_error_event,
+	TP_PROTO(struct cper_ccix_mem_error *mem,
+		 u32 err_seq,
+		 u8 sev,
+		 u16 ven_len),
+
+	TP_ARGS(mem, err_seq, sev, ven_len),
+
+	TP_STRUCT__entry(
+		__field(u32, err_seq)
+		__field(u8, sev)
+		__field(u8, sevdetail)
+		__field(u8, source)
+		__field(u8, component)
+		__field(u64, pa)
+		__field(u8, pa_mask_lsb)
+		__field_struct(struct cper_ccix_mem_err_compact, data)
+		__field(u16, vendor_data_length)
+		__dynamic_array(u8, vendor_data, ven_len)
+	),
+
+	TP_fast_assign(
+		__entry->err_seq = err_seq;
+		__entry->sev = sev;
+		__entry->sevdetail =
+			FIELD_GET(CCIX_PER_LOG_DW1_SEV_UE_M |
+				  CCIX_PER_LOG_DW1_SEV_NO_COMM_M |
+				  CCIX_PER_LOG_DW1_SEV_DEGRADED_M |
+				  CCIX_PER_LOG_DW1_SEV_DEFFERABLE_M,
+						   mem->ccix_header[1]);
+		if (mem->header.validation_bits & 0x1)
+			__entry->source = mem->header.source_id;
+		else
+			__entry->source = ~0;
+		__entry->component = FIELD_GET(CCIX_PER_LOG_DW1_COMP_TYPE_M,
+					       mem->ccix_header[1]);
+		if (mem->ccix_header[1] & CCIX_PER_LOG_DW1_ADDR_VAL_M) {
+			__entry->pa = (u64)mem->ccix_header[2] << 32 |
+				(mem->ccix_header[3] & 0xfffffffc);
+			__entry->pa_mask_lsb = mem->ccix_header[4] & 0xff;
+		} else {
+			__entry->pa = ~0ull;
+			__entry->pa_mask_lsb = ~0;
+		}
+		__entry->vendor_data_length = ven_len ? ven_len - 4 : 0;
+		cper_ccix_mem_err_pack(&mem->mem_record, &__entry->data,
+				       __entry->vendor_data_length,
+				       __get_dynamic_array(vendor_data));
+	),
+
+	TP_printk("{%d} %s CCIX PER Memory Error in %s SevUE:%d SevNoComm:%d SevDegraded:%d SevDeferred:%d physical addr: %016llx (mask: %x) %s vendor:%s",
+		__entry->err_seq,
+		cper_severity_str(__entry->sev),
+		cper_ccix_comp_type_str(__entry->component),
+		  __entry->sevdetail & BIT(0) ? 1 : 0,
+		  __entry->sevdetail & BIT(1) ? 1 : 0,
+		  __entry->sevdetail & BIT(2) ? 1 : 0,
+		  __entry->sevdetail & BIT(3) ? 1 : 0,
+		__entry->pa,
+		__entry->pa_mask_lsb,
+		cper_ccix_mem_err_unpack(p, &__entry->data),
+		__print_hex(__get_dynamic_array(vendor_data),
+			    __entry->vendor_data_length)
+	)
+);
+
 /*
  * memory-failure recovery action result event
  *
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 2/6] efi / ras: CCIX Cache error reporting
  2019-06-06 12:36 [RFC PATCH 0/6] CCIX Protocol Error reporting Jonathan Cameron
  2019-06-06 12:36 ` [RFC PATCH 1/6] efi / ras: CCIX Memory error reporting Jonathan Cameron
@ 2019-06-06 12:36 ` Jonathan Cameron
  2019-06-06 12:36 ` [RFC PATCH 3/6] efi / ras: CCIX Address Translation " Jonathan Cameron
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Jonathan Cameron @ 2019-06-06 12:36 UTC (permalink / raw)
  To: linux-edac, linux-acpi, linux-efi
  Cc: linuxarm, rjw, tony.luck, bp, james.morse, ard.beisheuvel,
	nariman.poushin, Jonathan Cameron

As CCIX Request Agents have fully cache coherent caches,
the CCIX 1.0 Base Specification defines detailed error
reporting for these caches.

A CCIX cache error is reported via a CPER record as defined in the
UEFI 2.8 specification. The PER log section is defined in the
CCIX 1.0 Base Specification.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/acpi/apei/ghes.c         |   4 +
 drivers/firmware/efi/cper-ccix.c | 170 +++++++++++++++++++++++++++++++
 include/linux/cper.h             |  57 +++++++++++
 include/ras/ras_event.h          |  66 ++++++++++++
 4 files changed, 297 insertions(+)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index cfc7dc31a9380..1afe47f7bb5b5 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -504,6 +504,10 @@ static void ghes_handle_ccix_per(struct acpi_hest_generic_data *gdata, int sev)
 		trace_ccix_memory_error_event(payload, err_seq, sev,
 					      ccix_mem_err_ven_len_get(payload));
 		break;
+	case CCIX_CACHE_ERROR:
+		trace_ccix_cache_error_event(payload, err_seq, sev,
+					     ccix_cache_err_ven_len_get(payload));
+		break;
 	default:
 		/* Unknown error type */
 		pr_info("CCIX error of unknown or vendor defined type\n");
diff --git a/drivers/firmware/efi/cper-ccix.c b/drivers/firmware/efi/cper-ccix.c
index 9856804bdca81..53a466eb05b7d 100644
--- a/drivers/firmware/efi/cper-ccix.c
+++ b/drivers/firmware/efi/cper-ccix.c
@@ -287,6 +287,110 @@ static int cper_ccix_mem_err_details(const char *pfx,
 	return 0;
 }
 
+static const char * const ccix_cache_type_strs[] = {
+	"Instruction Cache",
+	"Data Cache",
+	"Generic / Unified Cache",
+	"Snoop Filter Directory",
+};
+
+static const char *cper_ccix_cache_type_str(__u8 type)
+{
+	return type < ARRAY_SIZE(ccix_cache_type_strs) ?
+		ccix_cache_type_strs[type] : "Reserved";
+}
+
+static const char * const ccix_cache_err_type_strs[] = {
+	"Data",
+	"Tag",
+	"Timeout",
+	"Hang",
+	"Data Lost",
+	"Invalid Address",
+};
+
+const char *cper_ccix_cache_err_type_str(__u8 error_type)
+{
+	return error_type < ARRAY_SIZE(ccix_cache_err_type_strs) ?
+		ccix_cache_err_type_strs[error_type] : "Reserved";
+}
+
+static const char * const ccix_cache_err_op_strs[] = {
+	"Generic",
+	"Generic Read",
+	"Generic Write",
+	"Data Read",
+	"Data Write",
+	"Instruction Fetch",
+	"Prefetch",
+	"Eviction",
+	"Snooping",
+	"Snooped",
+	"Management / Command Error",
+};
+
+static const char *cper_ccix_cache_err_op_str(__u8 op)
+{
+	return op < ARRAY_SIZE(ccix_cache_err_op_strs) ?
+		ccix_cache_err_op_strs[op] : "Reserved";
+}
+
+static int cper_ccix_cache_err_details(const char *pfx,
+				     struct acpi_hest_generic_data *gdata)
+{
+	struct cper_ccix_cache_error *full_cache_err;
+	struct cper_sec_ccix_cache_error *cache_err;
+	u16 vendor_data_len;
+	int i;
+
+	if (gdata->error_data_length < sizeof(*full_cache_err))
+		return -ENOSPC;
+
+	full_cache_err = acpi_hest_get_payload(gdata);
+
+	cache_err = &full_cache_err->cache_record;
+
+	if (cache_err->validation_bits & CCIX_CACHE_ERR_TYPE_VALID)
+		printk("%s""Cache Type: %s\n", pfx,
+		       cper_ccix_cache_type_str(cache_err->cache_type));
+
+	if (cache_err->validation_bits & CCIX_CACHE_ERR_OP_VALID)
+		printk("%s""Operation: %s\n", pfx,
+		       cper_ccix_cache_err_op_str(cache_err->op_type));
+
+	if (cache_err->validation_bits & CCIX_CACHE_ERR_CACHE_ERR_TYPE_VALID)
+		printk("%s""Cache Error Type: %s\n", pfx,
+		       cper_ccix_cache_err_type_str(cache_err->cache_error_type));
+
+	if (cache_err->validation_bits & CCIX_CACHE_ERR_LEVEL_VALID)
+		printk("%s""Level: %d\n", pfx, cache_err->cache_level);
+
+	if (cache_err->validation_bits & CCIX_CACHE_ERR_SET_VALID)
+		printk("%s""Set: %d\n", pfx, cache_err->set);
+
+	if (cache_err->validation_bits & CCIX_CACHE_ERR_WAY_VALID)
+		printk("%s""Way: %d\n", pfx, cache_err->way);
+
+	if (cache_err->validation_bits & CCIX_CACHE_ERR_INSTANCE_ID_VALID)
+		printk("%s""Instance ID: %d\n", pfx, cache_err->instance);
+
+	if (cache_err->validation_bits & CCIX_CACHE_ERR_VENDOR_DATA_VALID) {
+		if (gdata->error_data_length < sizeof(*full_cache_err) + 4)
+			return -ENOSPC;
+
+		vendor_data_len = cache_err->vendor_data[0] & GENMASK(15, 0);
+		if (gdata->error_data_length <
+		    sizeof(*full_cache_err) + vendor_data_len)
+			return -ENOSPC;
+
+		for (i = 0; i < vendor_data_len / 4 - 1; i++)
+			printk("%s""Vendor%d: 0x%08x\n", pfx, i,
+			       cache_err->vendor_data[i + 1]);
+	}
+
+	return 0;
+}
+
 int cper_print_ccix_per(const char *pfx, struct acpi_hest_generic_data *gdata)
 {
 	struct cper_sec_ccix_header *header = acpi_hest_get_payload(gdata);
@@ -348,9 +452,75 @@ int cper_print_ccix_per(const char *pfx, struct acpi_hest_generic_data *gdata)
 	switch (per_type) {
 	case CCIX_MEMORY_ERROR:
 		return cper_ccix_mem_err_details(pfx, gdata);
+	case CCIX_CACHE_ERROR:
+		return cper_ccix_cache_err_details(pfx, gdata);
 	default:
 		/* Vendor defined so no formatting be done */
 		break;
 	}
 	return 0;
 }
+
+void cper_ccix_cache_err_pack(const struct cper_sec_ccix_cache_error *cache_record,
+			      struct cper_ccix_cache_err_compact *ccache_err,
+			      const u16 vendor_data_len,
+			      u8 *vendor_data)
+{
+	ccache_err->validation_bits = cache_record->validation_bits;
+	ccache_err->set = cache_record->set;
+	ccache_err->way = cache_record->way;
+	ccache_err->cache_type = cache_record->cache_type;
+	ccache_err->op_type = cache_record->op_type;
+	ccache_err->cache_error_type = cache_record->cache_error_type;
+	ccache_err->cache_level = cache_record->cache_level;
+	ccache_err->instance = cache_record->instance;
+	memcpy(vendor_data, &cache_record->vendor_data[1], vendor_data_len);
+}
+
+static int cper_ccix_err_cache_location(struct cper_ccix_cache_err_compact *ccache_err,
+					char *msg)
+{
+	u32 len = CPER_REC_LEN - 1;
+	u32 n = 0;
+
+	if (!msg)
+		return 0;
+
+	if (ccache_err->validation_bits & CCIX_CACHE_ERR_CACHE_ERR_TYPE_VALID)
+		n += snprintf(msg + n, len, "Error: %s ",
+			     cper_ccix_cache_err_type_str(ccache_err->cache_error_type));
+
+	if (ccache_err->validation_bits & CCIX_CACHE_ERR_TYPE_VALID)
+		n += snprintf(msg + n, len, "Type: %s ",
+			     cper_ccix_cache_type_str(ccache_err->cache_type));
+
+	if (ccache_err->validation_bits & CCIX_CACHE_ERR_OP_VALID)
+		n += snprintf(msg + n, len, "Op: %s ",
+			      cper_ccix_cache_err_op_str(ccache_err->op_type));
+
+	if (ccache_err->validation_bits & CCIX_CACHE_ERR_LEVEL_VALID)
+		n += snprintf(msg + n, len, "Level: %d ",
+			      ccache_err->cache_level);
+	if (ccache_err->validation_bits & CCIX_CACHE_ERR_SET_VALID)
+		n += snprintf(msg + n, len, "Set: %d ", ccache_err->set);
+	if (ccache_err->validation_bits & CCIX_CACHE_ERR_WAY_VALID)
+		n += snprintf(msg + n, len, "Way: %d ", ccache_err->way);
+	if (ccache_err->validation_bits & CCIX_CACHE_ERR_INSTANCE_ID_VALID)
+		n += snprintf(msg + n, len, "Instance: %d ",
+			      ccache_err->instance);
+
+	return n;
+}
+
+const char *cper_ccix_cache_err_unpack(struct trace_seq *p,
+				       struct cper_ccix_cache_err_compact *ccache_err)
+{
+	const char *ret = trace_seq_buffer_ptr(p);
+
+	if (cper_ccix_err_cache_location(ccache_err, rcd_decode_str))
+		trace_seq_printf(p, "%s", rcd_decode_str);
+
+	trace_seq_putc(p, '\0');
+
+	return ret;
+}
diff --git a/include/linux/cper.h b/include/linux/cper.h
index a9a7ef56f4dc8..45af6818942f7 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -639,6 +639,54 @@ struct cper_ccix_mem_err_compact {
 	__u8	fru;
 };
 
+struct cper_sec_ccix_cache_error {
+	__u32	validation_bits;
+#define CCIX_CACHE_ERR_TYPE_VALID		BIT(0)
+#define CCIX_CACHE_ERR_OP_VALID			BIT(1)
+#define CCIX_CACHE_ERR_CACHE_ERR_TYPE_VALID	BIT(2)
+#define CCIX_CACHE_ERR_LEVEL_VALID		BIT(3)
+#define CCIX_CACHE_ERR_SET_VALID		BIT(4)
+#define CCIX_CACHE_ERR_WAY_VALID		BIT(5)
+#define CCIX_CACHE_ERR_INSTANCE_ID_VALID	BIT(6)
+#define CCIX_CACHE_ERR_VENDOR_DATA_VALID	BIT(7)
+	__u16	length; /* Includes vendor specific log info */
+	__u8	cache_type;
+	__u8	op_type;
+	__u8	cache_error_type;
+	__u8	cache_level;
+	__u32	set;
+	__u32	way;
+	__u8	instance;
+	__u8	reserved;
+	__u32	vendor_data[];
+};
+
+struct cper_ccix_cache_error {
+	struct cper_sec_ccix_header header;
+	__u32 ccix_header[CCIX_PER_LOG_HEADER_DWS];
+	struct cper_sec_ccix_cache_error cache_record;
+};
+
+static inline u16 ccix_cache_err_ven_len_get(struct cper_ccix_cache_error *cache_err)
+{
+	if (cache_err->cache_record.validation_bits &
+	    CCIX_CACHE_ERR_VENDOR_DATA_VALID)
+		return cache_err->cache_record.vendor_data[0] & 0xFFFF;
+	else
+		return 0;
+}
+
+struct cper_ccix_cache_err_compact {
+	__u32	validation_bits;
+	__u32	set;
+	__u32	way;
+	__u8	cache_type;
+	__u8	op_type;
+	__u8	cache_error_type;
+	__u8	cache_level;
+	__u8	instance;
+};
+
 /* Reset to default packing */
 #pragma pack()
 
@@ -661,6 +709,15 @@ const char *cper_ccix_mem_err_unpack(struct trace_seq *p,
 				     struct cper_ccix_mem_err_compact *cmem_err);
 const char *cper_ccix_mem_err_type_str(unsigned int error_type);
 const char *cper_ccix_comp_type_str(u8 comp_type);
+
+void cper_ccix_cache_err_pack(const struct cper_sec_ccix_cache_error *cache_record,
+			      struct cper_ccix_cache_err_compact *ccache_err,
+			      const u16 vendor_data_len,
+			      u8 *vendor_data);
+const char *cper_ccix_cache_err_unpack(struct trace_seq *p,
+				       struct cper_ccix_cache_err_compact *ccache_err);
+const char *cper_ccix_cache_err_type_str(__u8 error_type);
+
 struct acpi_hest_generic_data;
 int cper_print_ccix_per(const char *pfx,
 			struct acpi_hest_generic_data *gdata);
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 128728eaeef41..55f2c1900c540 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -415,6 +415,72 @@ TRACE_EVENT(ccix_memory_error_event,
 	)
 );
 
+TRACE_EVENT(ccix_cache_error_event,
+	TP_PROTO(struct cper_ccix_cache_error *err,
+		 u32 err_seq,
+		 u8 sev,
+		 u16 ven_len),
+
+	TP_ARGS(err, err_seq, sev, ven_len),
+
+	TP_STRUCT__entry(
+		__field(u32, err_seq)
+		__field(u8, sev)
+		__field(u8, sevdetail)
+		__field(u8, source)
+		__field(u8, component)
+		__field(u64, pa)
+		__field(u8, pa_mask_lsb)
+		__field_struct(struct cper_ccix_cache_err_compact, data)
+		__field(u16, vendor_data_length)
+		__dynamic_array(u8, vendor_data, ven_len)
+	),
+
+	TP_fast_assign(
+		__entry->err_seq = err_seq;
+
+		__entry->sev = sev;
+		__entry->sevdetail = FIELD_GET(CCIX_PER_LOG_DW1_SEV_UE_M |
+			CCIX_PER_LOG_DW1_SEV_NO_COMM_M |
+			CCIX_PER_LOG_DW1_SEV_DEGRADED_M |
+			CCIX_PER_LOG_DW1_SEV_DEFFERABLE_M,
+			err->ccix_header[1]);
+		if (err->header.validation_bits & 0x1)
+			__entry->source = err->header.source_id;
+		else
+			__entry->source = ~0;
+		__entry->component = FIELD_GET(CCIX_PER_LOG_DW1_COMP_TYPE_M,
+					       err->ccix_header[1]);
+		if (err->ccix_header[1] & CCIX_PER_LOG_DW1_ADDR_VAL_M) {
+			__entry->pa = (u64)err->ccix_header[2] << 32 |
+				(err->ccix_header[3] & 0xfffffffc);
+			__entry->pa_mask_lsb = err->ccix_header[4] & 0xff;
+		} else {
+			__entry->pa = ~0ull;
+			__entry->pa_mask_lsb = ~0;
+		}
+
+		__entry->vendor_data_length = ven_len ? ven_len - 4 : 0;
+		cper_ccix_cache_err_pack(&err->cache_record, &__entry->data,
+					 __entry->vendor_data_length,
+					 __get_dynamic_array(vendor_data));
+	),
+
+	TP_printk("{%d} %s CCIX PER Cache Error in %s SevUE:%d SevNoComm:%d SevDegraded:%d SevDeferred:%d physical addr: %016llx (mask: %x) %s vendor:%s",
+		__entry->err_seq,
+		cper_severity_str(__entry->sev),
+		cper_ccix_comp_type_str(__entry->component),
+		__entry->sevdetail & BIT(0) ? 1 : 0,
+		__entry->sevdetail & BIT(1) ? 1 : 0,
+		__entry->sevdetail & BIT(2) ? 1 : 0,
+		__entry->sevdetail & BIT(3) ? 1 : 0,
+		__entry->pa,
+		__entry->pa_mask_lsb,
+		cper_ccix_cache_err_unpack(p, &__entry->data),
+		__print_hex(__get_dynamic_array(vendor_data),
+			    __entry->vendor_data_length)
+	)
+);
 /*
  * memory-failure recovery action result event
  *
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 3/6] efi / ras: CCIX Address Translation Cache error reporting
  2019-06-06 12:36 [RFC PATCH 0/6] CCIX Protocol Error reporting Jonathan Cameron
  2019-06-06 12:36 ` [RFC PATCH 1/6] efi / ras: CCIX Memory error reporting Jonathan Cameron
  2019-06-06 12:36 ` [RFC PATCH 2/6] efi / ras: CCIX Cache " Jonathan Cameron
@ 2019-06-06 12:36 ` Jonathan Cameron
  2019-06-06 12:36 ` [RFC PATCH 4/6] efi / ras: CCIX Port " Jonathan Cameron
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Jonathan Cameron @ 2019-06-06 12:36 UTC (permalink / raw)
  To: linux-edac, linux-acpi, linux-efi
  Cc: linuxarm, rjw, tony.luck, bp, james.morse, ard.beisheuvel,
	nariman.poushin, Jonathan Cameron

CCIX devices tend to make heavy use of ATCs. The CCIX base
specification defines a protocol error message (PER) that
describes errors reported by such caches. The UEFI 2.8
specification includes a CCIX CPER record for firmware first
handling to report these errors to the operating system.

This patch is very similar to the support previously added
for CCIX Memory Errors and provides both logging and RAS
tracepoint for this error class.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/acpi/apei/ghes.c         |  4 ++
 drivers/firmware/efi/cper-ccix.c | 84 ++++++++++++++++++++++++++++++++
 include/linux/cper.h             | 39 +++++++++++++++
 include/ras/ras_event.h          | 67 +++++++++++++++++++++++++
 4 files changed, 194 insertions(+)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 1afe47f7bb5b5..a412cfb7c8657 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -508,6 +508,10 @@ static void ghes_handle_ccix_per(struct acpi_hest_generic_data *gdata, int sev)
 		trace_ccix_cache_error_event(payload, err_seq, sev,
 					     ccix_cache_err_ven_len_get(payload));
 		break;
+	case CCIX_ATC_ERROR:
+		trace_ccix_atc_error_event(payload, err_seq, sev,
+					   ccix_atc_err_ven_len_get(payload));
+		break;
 	default:
 		/* Unknown error type */
 		pr_info("CCIX error of unknown or vendor defined type\n");
diff --git a/drivers/firmware/efi/cper-ccix.c b/drivers/firmware/efi/cper-ccix.c
index 53a466eb05b7d..1b88458eae32f 100644
--- a/drivers/firmware/efi/cper-ccix.c
+++ b/drivers/firmware/efi/cper-ccix.c
@@ -391,6 +391,43 @@ static int cper_ccix_cache_err_details(const char *pfx,
 	return 0;
 }
 
+static int cper_ccix_atc_err_details(const char *pfx,
+				     struct acpi_hest_generic_data *gdata)
+{
+	struct cper_ccix_atc_error *full_atc_err;
+	struct cper_sec_ccix_atc_error *atc_err;
+	u16 vendor_data_len;
+	int i;
+
+	if (gdata->error_data_length < sizeof(*full_atc_err))
+		return -ENOSPC;
+
+	full_atc_err = acpi_hest_get_payload(gdata);
+
+	atc_err = &full_atc_err->atc_record;
+
+	if (atc_err->validation_bits & CCIX_ATC_ERR_OP_VALID)
+		printk("%s""Operation: %s\n", pfx,
+		       cper_ccix_cache_err_op_str(atc_err->op_type));
+
+	if (atc_err->validation_bits & CCIX_ATC_ERR_INSTANCE_ID_VALID)
+		printk("%s""Instance ID: %d\n", pfx, atc_err->instance);
+
+	if (atc_err->validation_bits & CCIX_ATC_ERR_VENDOR_DATA_VALID) {
+		if (gdata->error_data_length < sizeof(*full_atc_err) + 4)
+			return -ENOSPC;
+		vendor_data_len = atc_err->vendor_data[0] & GENMASK(15, 0);
+		if (gdata->error_data_length < sizeof(*full_atc_err) + vendor_data_len)
+			return -ENOSPC;
+
+		for (i = 0; i < vendor_data_len / 4 - 1; i++)
+			printk("%s""Vendor%d: 0x%08x\n", pfx, i,
+			       atc_err->vendor_data[i + 1]);
+	}
+
+	return 0;
+}
+
 int cper_print_ccix_per(const char *pfx, struct acpi_hest_generic_data *gdata)
 {
 	struct cper_sec_ccix_header *header = acpi_hest_get_payload(gdata);
@@ -454,6 +491,8 @@ int cper_print_ccix_per(const char *pfx, struct acpi_hest_generic_data *gdata)
 		return cper_ccix_mem_err_details(pfx, gdata);
 	case CCIX_CACHE_ERROR:
 		return cper_ccix_cache_err_details(pfx, gdata);
+	case CCIX_ATC_ERROR:
+		return cper_ccix_atc_err_details(pfx, gdata);
 	default:
 		/* Vendor defined so no formatting be done */
 		break;
@@ -524,3 +563,48 @@ const char *cper_ccix_cache_err_unpack(struct trace_seq *p,
 
 	return ret;
 }
+
+void cper_ccix_atc_err_pack(const struct cper_sec_ccix_atc_error *atc_record,
+			    struct cper_ccix_atc_err_compact *catc_err,
+			    const u16 vendor_data_len,
+			    u8 *vendor_data)
+{
+	catc_err->validation_bits = atc_record->validation_bits;
+	catc_err->op_type = atc_record->op_type;
+	catc_err->instance = atc_record->instance;
+	memcpy(vendor_data, &atc_record->vendor_data[1], vendor_data_len);
+}
+
+static int cper_ccix_err_atc_location(struct cper_ccix_atc_err_compact *catc_err,
+				      char *msg)
+{
+	u32 len = CPER_REC_LEN - 1;
+	u32 n = 0;
+
+	if (!msg)
+		return 0;
+
+	if (catc_err->validation_bits & CCIX_ATC_ERR_OP_VALID)
+		n += snprintf(msg + n, len, "Op: %s ",
+			     cper_ccix_cache_err_op_str(catc_err->op_type));
+
+	if (catc_err->validation_bits & CCIX_ATC_ERR_INSTANCE_ID_VALID)
+		n += snprintf(msg + n, len, "Instance: %d ",
+			      catc_err->instance);
+
+	return n;
+}
+
+const char *cper_ccix_atc_err_unpack(struct trace_seq *p,
+				     struct cper_ccix_atc_err_compact *catc_err)
+{
+	const char *ret = trace_seq_buffer_ptr(p);
+
+	if (cper_ccix_err_atc_location(catc_err, rcd_decode_str))
+		trace_seq_printf(p, "%s", rcd_decode_str);
+
+	trace_seq_putc(p, '\0');
+
+	return ret;
+}
+
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 45af6818942f7..77ce1c04dec25 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -687,6 +687,38 @@ struct cper_ccix_cache_err_compact {
 	__u8	instance;
 };
 
+struct cper_sec_ccix_atc_error {
+	__u32	validation_bits;
+#define CCIX_ATC_ERR_OP_VALID			BIT(0)
+#define CCIX_ATC_ERR_INSTANCE_ID_VALID		BIT(1)
+#define CCIX_ATC_ERR_VENDOR_DATA_VALID		BIT(2)
+	__u16	length; /* Includes vendor specific log info */
+	__u8	op_type;
+	__u8	instance;
+	__u32	reserved;
+	__u32	vendor_data[];
+};
+
+struct cper_ccix_atc_error {
+	struct cper_sec_ccix_header header;
+	__u32 ccix_header[CCIX_PER_LOG_HEADER_DWS];
+	struct cper_sec_ccix_atc_error atc_record;
+};
+
+static inline u16 ccix_atc_err_ven_len_get(struct cper_ccix_atc_error *atc_err)
+{
+	if (atc_err->atc_record.validation_bits & CCIX_ATC_ERR_VENDOR_DATA_VALID)
+		return atc_err->atc_record.vendor_data[0] & 0xFFFF;
+	else
+		return 0;
+}
+
+struct cper_ccix_atc_err_compact {
+	__u32	validation_bits;
+	__u8	op_type;
+	__u8	instance;
+};
+
 /* Reset to default packing */
 #pragma pack()
 
@@ -718,6 +750,13 @@ const char *cper_ccix_cache_err_unpack(struct trace_seq *p,
 				       struct cper_ccix_cache_err_compact *ccache_err);
 const char *cper_ccix_cache_err_type_str(__u8 error_type);
 
+void cper_ccix_atc_err_pack(const struct cper_sec_ccix_atc_error *atc_record,
+			    struct cper_ccix_atc_err_compact *catc_err,
+			    const u16 vendor_data_len,
+			    u8 *vendor_data);
+const char *cper_ccix_atc_err_unpack(struct trace_seq *p,
+				     struct cper_ccix_atc_err_compact *catc_err);
+
 struct acpi_hest_generic_data;
 int cper_print_ccix_per(const char *pfx,
 			struct acpi_hest_generic_data *gdata);
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 55f2c1900c540..bab49e297551e 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -481,6 +481,73 @@ TRACE_EVENT(ccix_cache_error_event,
 			    __entry->vendor_data_length)
 	)
 );
+
+TRACE_EVENT(ccix_atc_error_event,
+	TP_PROTO(struct cper_ccix_atc_error *err,
+		 u32 err_seq,
+		 u8 sev,
+		 u16 ven_len),
+
+	TP_ARGS(err, err_seq, sev, ven_len),
+
+	TP_STRUCT__entry(
+		__field(u32, err_seq)
+		__field(u8, sev)
+		__field(u8, sevdetail)
+		__field(u8, source)
+		__field(u8, component)
+		__field(u64, pa)
+		__field(u8, pa_mask_lsb)
+		__field_struct(struct cper_ccix_atc_err_compact, data)
+		__field(u16, vendor_data_length)
+		__dynamic_array(u8, vendor_data, ven_len)
+	),
+
+	TP_fast_assign(
+		__entry->err_seq = err_seq;
+
+		__entry->sev = sev;
+		__entry->sevdetail = FIELD_GET(CCIX_PER_LOG_DW1_SEV_UE_M |
+			CCIX_PER_LOG_DW1_SEV_NO_COMM_M |
+			CCIX_PER_LOG_DW1_SEV_DEGRADED_M |
+			CCIX_PER_LOG_DW1_SEV_DEFFERABLE_M,
+			err->ccix_header[1]);
+		if (err->header.validation_bits & 0x1)
+			__entry->source = err->header.source_id;
+		else
+			__entry->source = ~0;
+		__entry->component = FIELD_GET(CCIX_PER_LOG_DW1_COMP_TYPE_M,
+					       err->ccix_header[1]);
+		if (err->ccix_header[1] & CCIX_PER_LOG_DW1_ADDR_VAL_M) {
+			__entry->pa = (u64)err->ccix_header[2] << 32 |
+				(err->ccix_header[3] & 0xfffffffc);
+			__entry->pa_mask_lsb = err->ccix_header[4] & 0xff;
+		} else {
+			__entry->pa = ~0ull;
+			__entry->pa_mask_lsb = ~0;
+		}
+
+		__entry->vendor_data_length = ven_len ? ven_len - 4 : 0;
+		cper_ccix_atc_err_pack(&err->atc_record, &__entry->data,
+				       __entry->vendor_data_length,
+				       __get_dynamic_array(vendor_data));
+	),
+
+	TP_printk("{%d} %s CCIX PER ATC Error in %s SevUE:%d SevNoComm:%d SevDegraded:%d SevDeferred:%d physical addr: %016llx (mask: %x) %s vendor:%s",
+		__entry->err_seq,
+		cper_severity_str(__entry->sev),
+		cper_ccix_comp_type_str(__entry->component),
+		__entry->sevdetail & BIT(0) ? 1 : 0,
+		__entry->sevdetail & BIT(1) ? 1 : 0,
+		__entry->sevdetail & BIT(2) ? 1 : 0,
+		__entry->sevdetail & BIT(3) ? 1 : 0,
+		__entry->pa,
+		__entry->pa_mask_lsb,
+		cper_ccix_atc_err_unpack(p, &__entry->data),
+		__print_hex(__get_dynamic_array(vendor_data), __entry->vendor_data_length)
+	)
+);
+
 /*
  * memory-failure recovery action result event
  *
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 4/6] efi / ras: CCIX Port error reporting
  2019-06-06 12:36 [RFC PATCH 0/6] CCIX Protocol Error reporting Jonathan Cameron
                   ` (2 preceding siblings ...)
  2019-06-06 12:36 ` [RFC PATCH 3/6] efi / ras: CCIX Address Translation " Jonathan Cameron
@ 2019-06-06 12:36 ` Jonathan Cameron
  2019-06-06 12:36 ` [RFC PATCH 5/6] efi / ras: CCIX Link " Jonathan Cameron
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Jonathan Cameron @ 2019-06-06 12:36 UTC (permalink / raw)
  To: linux-edac, linux-acpi, linux-efi
  Cc: linuxarm, rjw, tony.luck, bp, james.morse, ard.beisheuvel,
	nariman.poushin, Jonathan Cameron

The CCIX 1.0 base specification defines a CCIX protocol layer port.
The specification provides a mechanism for detailed error logging
for these ports.  The UEFI 2.8 specification includes a CCIX CPER
record for firmware first handling to report these errors to the
operating system.

This patch is very similar to the support previously added for
for CCIX Memory Errors and provides both logging and RAS tracepoint
for this error class.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/acpi/apei/ghes.c         |   4 +
 drivers/firmware/efi/cper-ccix.c | 123 +++++++++++++++++++++++++++++++
 include/linux/cper.h             |  42 +++++++++++
 include/ras/ras_event.h          |  66 +++++++++++++++++
 4 files changed, 235 insertions(+)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a412cfb7c8657..2b2c0df204523 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -512,6 +512,10 @@ static void ghes_handle_ccix_per(struct acpi_hest_generic_data *gdata, int sev)
 		trace_ccix_atc_error_event(payload, err_seq, sev,
 					   ccix_atc_err_ven_len_get(payload));
 		break;
+	case CCIX_PORT_ERROR:
+		trace_ccix_port_error_event(payload, err_seq, sev,
+					    ccix_port_err_ven_len_get(payload));
+		break;
 	default:
 		/* Unknown error type */
 		pr_info("CCIX error of unknown or vendor defined type\n");
diff --git a/drivers/firmware/efi/cper-ccix.c b/drivers/firmware/efi/cper-ccix.c
index 1b88458eae32f..fcf77834b9894 100644
--- a/drivers/firmware/efi/cper-ccix.c
+++ b/drivers/firmware/efi/cper-ccix.c
@@ -428,6 +428,81 @@ static int cper_ccix_atc_err_details(const char *pfx,
 	return 0;
 }
 
+static const char * const ccix_port_err_type_strs[] = {
+	"Generic Bus / Slave Error",
+	"Bus Parity / ECC Error",
+	"BDF Not Present",
+	"Invalid Address",
+	"Invalid AgentID",
+	"Bus Timeout",
+	"Hang",
+	"Egress Blocked",
+};
+
+static const char *cper_ccix_port_err_type_str(__u8 op)
+{
+	return op < ARRAY_SIZE(ccix_port_err_type_strs) ?
+		ccix_port_err_type_strs[op] : "Reserved";
+}
+
+static const char * const ccix_port_err_op_strs[] = {
+	"Command",
+	"Read",
+	"Write",
+};
+
+static const char *cper_ccix_port_err_op_str(__u8 op)
+{
+	return op < ARRAY_SIZE(ccix_port_err_op_strs) ?
+		ccix_port_err_op_strs[op] : "Reserved";
+}
+
+static int cper_ccix_port_err_details(const char *pfx,
+				     struct acpi_hest_generic_data *gdata)
+{
+	struct cper_ccix_port_error *full_port_err;
+	struct cper_sec_ccix_port_error *port_err;
+	u16 vendor_data_len;
+	int i;
+
+	if (gdata->error_data_length < sizeof(*full_port_err))
+		return -ENOSPC;
+
+	full_port_err = acpi_hest_get_payload(gdata);
+
+	port_err = &full_port_err->port_record;
+
+	if (port_err->validation_bits & CCIX_PORT_ERR_TYPE_VALID)
+		printk("%s""Error Type: %s\n", pfx,
+		       cper_ccix_port_err_type_str(port_err->err_type));
+
+	if (port_err->validation_bits & CCIX_PORT_ERR_OP_VALID)
+		printk("%s""Operation: %s\n", pfx,
+		       cper_ccix_port_err_op_str(port_err->op_type));
+
+	/* CHECK THE AER EQUIVALENT */
+	if (port_err->validation_bits & CCIX_PORT_ERR_MESSAGE_VALID) {
+		for (i = 0; i < ARRAY_SIZE(port_err->message); i++)
+			printk("%s""Message%d: 0x%08x\n", pfx, i,
+			       port_err->message[i]);
+	}
+
+	if (port_err->validation_bits & CCIX_PORT_ERR_VENDOR_DATA_VALID) {
+		if (gdata->error_data_length < sizeof(*full_port_err) + 4)
+			return -ENOSPC;
+
+		vendor_data_len = port_err->vendor_data[0] & GENMASK(15, 0);
+		if (gdata->error_data_length < sizeof(*full_port_err) + vendor_data_len)
+			return -ENOSPC;
+
+		for (i = 0; i < vendor_data_len / 4 - 1; i++)
+			printk("%s""Vendor%d: 0x%08x\n", pfx, i,
+			       port_err->vendor_data[i + 1]);
+	}
+
+	return 0;
+}
+
 int cper_print_ccix_per(const char *pfx, struct acpi_hest_generic_data *gdata)
 {
 	struct cper_sec_ccix_header *header = acpi_hest_get_payload(gdata);
@@ -493,6 +568,8 @@ int cper_print_ccix_per(const char *pfx, struct acpi_hest_generic_data *gdata)
 		return cper_ccix_cache_err_details(pfx, gdata);
 	case CCIX_ATC_ERROR:
 		return cper_ccix_atc_err_details(pfx, gdata);
+	case CCIX_PORT_ERROR:
+		return cper_ccix_port_err_details(pfx, gdata);
 	default:
 		/* Vendor defined so no formatting be done */
 		break;
@@ -608,3 +685,49 @@ const char *cper_ccix_atc_err_unpack(struct trace_seq *p,
 	return ret;
 }
 
+void cper_ccix_port_err_pack(const struct cper_sec_ccix_port_error *port_record,
+			     struct cper_ccix_port_err_compact *cport_err,
+			     const u16 vendor_data_len,
+			     u8 *vendor_data)
+{
+	cport_err->validation_bits = port_record->validation_bits;
+	cport_err->err_type = port_record->err_type;
+	cport_err->op_type = port_record->op_type;
+	memcpy(cport_err->message, port_record->message,
+	       sizeof(cport_err->message));
+	memcpy(vendor_data, &port_record->vendor_data[1], vendor_data_len);
+}
+
+static int cper_ccix_err_port_location(struct cper_ccix_port_err_compact *cport_err,
+				       char *msg)
+{
+	u32 len = CPER_REC_LEN - 1;
+	u32 n = 0;
+
+	if (!msg)
+		return 0;
+
+	if (cport_err->validation_bits & CCIX_PORT_ERR_TYPE_VALID)
+		n += snprintf(msg + n, len, "Error Type: %s ",
+			      cper_ccix_port_err_type_str(cport_err->err_type));
+
+
+	if (cport_err->validation_bits & CCIX_PORT_ERR_OP_VALID)
+		n += snprintf(msg + n, len, "Op: %s ",
+			     cper_ccix_port_err_op_str(cport_err->op_type));
+
+	return n;
+}
+
+const char *cper_ccix_port_err_unpack(struct trace_seq *p,
+				      struct cper_ccix_port_err_compact *cport_err)
+{
+	const char *ret = trace_seq_buffer_ptr(p);
+
+	if (cper_ccix_err_port_location(cport_err, rcd_decode_str))
+		trace_seq_printf(p, "%s", rcd_decode_str);
+
+	trace_seq_putc(p, '\0');
+
+	return ret;
+}
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 77ce1c04dec25..cd023e0309bf4 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -719,6 +719,41 @@ struct cper_ccix_atc_err_compact {
 	__u8	instance;
 };
 
+
+struct cper_sec_ccix_port_error {
+	__u32	validation_bits;
+#define CCIX_PORT_ERR_OP_VALID			BIT(0)
+#define CCIX_PORT_ERR_TYPE_VALID		BIT(1)
+#define CCIX_PORT_ERR_MESSAGE_VALID		BIT(2)
+#define CCIX_PORT_ERR_VENDOR_DATA_VALID		BIT(3)
+	__u16	length; /* Includes vendor specific log info */
+	__u8	op_type;
+	__u8	err_type;
+	__u32	message[8];
+	__u32	vendor_data[];
+};
+
+struct cper_ccix_port_error {
+	struct cper_sec_ccix_header header;
+	__u32 ccix_header[CCIX_PER_LOG_HEADER_DWS];
+	struct cper_sec_ccix_port_error port_record;
+};
+
+static inline u16 ccix_port_err_ven_len_get(struct cper_ccix_port_error *port_err)
+{
+	if (port_err->port_record.validation_bits & CCIX_PORT_ERR_VENDOR_DATA_VALID)
+		return port_err->port_record.vendor_data[0] & 0xFFFF;
+	else
+		return 0;
+}
+
+struct cper_ccix_port_err_compact {
+	__u32	validation_bits;
+	__u32	message[8];
+	__u8	err_type;
+	__u8	op_type;
+};
+
 /* Reset to default packing */
 #pragma pack()
 
@@ -757,6 +792,13 @@ void cper_ccix_atc_err_pack(const struct cper_sec_ccix_atc_error *atc_record,
 const char *cper_ccix_atc_err_unpack(struct trace_seq *p,
 				     struct cper_ccix_atc_err_compact *catc_err);
 
+void cper_ccix_port_err_pack(const struct cper_sec_ccix_port_error *port_record,
+			     struct cper_ccix_port_err_compact *cport_err,
+			     const u16 vendor_data_len,
+			     u8 *vendor_data);
+const char *cper_ccix_port_err_unpack(struct trace_seq *p,
+				      struct cper_ccix_port_err_compact *cport_err);
+
 struct acpi_hest_generic_data;
 int cper_print_ccix_per(const char *pfx,
 			struct acpi_hest_generic_data *gdata);
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index bab49e297551e..52ac3f2d4c117 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -548,6 +548,72 @@ TRACE_EVENT(ccix_atc_error_event,
 	)
 );
 
+TRACE_EVENT(ccix_port_error_event,
+	TP_PROTO(struct cper_ccix_port_error *err,
+		 u32 err_seq,
+		 u8 sev,
+		 u16 ven_len),
+
+	TP_ARGS(err, err_seq, sev, ven_len),
+	TP_STRUCT__entry(
+		__field(u32, err_seq)
+		__field(u8, sev)
+		__field(u8, sevdetail)
+		__field(u8, source)
+		__field(u8, component)
+		__field(u64, pa)
+		__field(u8, pa_mask_lsb)
+		__field_struct(struct cper_ccix_port_err_compact, data)
+		__field(u16, vendor_data_length)
+		__dynamic_array(u8, vendor_data, ven_len)
+	),
+
+	TP_fast_assign(
+		__entry->err_seq = err_seq;
+
+		__entry->sev = sev;
+		__entry->sevdetail = FIELD_GET(CCIX_PER_LOG_DW1_SEV_UE_M |
+			CCIX_PER_LOG_DW1_SEV_NO_COMM_M |
+			CCIX_PER_LOG_DW1_SEV_DEGRADED_M |
+			CCIX_PER_LOG_DW1_SEV_DEFFERABLE_M,
+			err->ccix_header[1]);
+		if (err->header.validation_bits & 0x1)
+			__entry->source = err->header.source_id;
+		else
+			__entry->source = ~0;
+
+		__entry->component = FIELD_GET(CCIX_PER_LOG_DW1_COMP_TYPE_M,
+					       err->ccix_header[1]);
+		if (err->ccix_header[1] & CCIX_PER_LOG_DW1_ADDR_VAL_M) {
+			__entry->pa = (u64)err->ccix_header[2] << 32 |
+				(err->ccix_header[3] & 0xfffffffc);
+			__entry->pa_mask_lsb = err->ccix_header[4] & 0xff;
+		} else {
+			__entry->pa = ~0ull;
+			__entry->pa_mask_lsb = ~0;
+		}
+
+		__entry->vendor_data_length = ven_len ? ven_len - 4 : 0;
+		cper_ccix_port_err_pack(&err->port_record, &__entry->data,
+					__entry->vendor_data_length,
+					__get_dynamic_array(vendor_data));
+	),
+
+	TP_printk("{%d} %s CCIX PER Port Error in %s SevUE:%d SevNoComm:%d SevDegraded:%d SevDeferred:%d physical addr: %016llx (mask: %x) %s vendor:%s",
+		__entry->err_seq,
+		cper_severity_str(__entry->sev),
+		cper_ccix_comp_type_str(__entry->component),
+		__entry->sevdetail & BIT(0) ? 1 : 0,
+		__entry->sevdetail & BIT(1) ? 1 : 0,
+		__entry->sevdetail & BIT(2) ? 1 : 0,
+		__entry->sevdetail & BIT(3) ? 1 : 0,
+		__entry->pa,
+		__entry->pa_mask_lsb,
+		cper_ccix_port_err_unpack(p, &__entry->data),
+		__print_hex(__get_dynamic_array(vendor_data), __entry->vendor_data_length)
+	)
+);
+
 /*
  * memory-failure recovery action result event
  *
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 5/6] efi / ras: CCIX Link error reporting
  2019-06-06 12:36 [RFC PATCH 0/6] CCIX Protocol Error reporting Jonathan Cameron
                   ` (3 preceding siblings ...)
  2019-06-06 12:36 ` [RFC PATCH 4/6] efi / ras: CCIX Port " Jonathan Cameron
@ 2019-06-06 12:36 ` Jonathan Cameron
  2019-06-06 12:36 ` [RFC PATCH 6/6] efi / ras: CCIX Agent internal " Jonathan Cameron
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Jonathan Cameron @ 2019-06-06 12:36 UTC (permalink / raw)
  To: linux-edac, linux-acpi, linux-efi
  Cc: linuxarm, rjw, tony.luck, bp, james.morse, ard.beisheuvel,
	nariman.poushin, Jonathan Cameron

The CCIX 1.0 Base Specification defines a protocol layer
CCIX link and related error reporting mechanism.
The UEFI 2.8 specification includes a CCIX CPER record for
firmware first handling to report these errors to the
operating system.

This patch is very similar to the support previously added
for CCIX Memory Errors and provides both logging and RAS
tracepoint for this error class.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/acpi/apei/ghes.c         |   4 +
 drivers/firmware/efi/cper-ccix.c | 140 +++++++++++++++++++++++++++++++
 include/linux/cper.h             |  48 +++++++++++
 include/ras/ras_event.h          |  65 ++++++++++++++
 4 files changed, 257 insertions(+)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 2b2c0df204523..2cabac7a344c7 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -516,6 +516,10 @@ static void ghes_handle_ccix_per(struct acpi_hest_generic_data *gdata, int sev)
 		trace_ccix_port_error_event(payload, err_seq, sev,
 					    ccix_port_err_ven_len_get(payload));
 		break;
+	case CCIX_LINK_ERROR:
+		trace_ccix_link_error_event(payload, err_seq, sev,
+					    ccix_link_err_ven_len_get(payload));
+		break;
 	default:
 		/* Unknown error type */
 		pr_info("CCIX error of unknown or vendor defined type\n");
diff --git a/drivers/firmware/efi/cper-ccix.c b/drivers/firmware/efi/cper-ccix.c
index fcf77834b9894..1a9661a6c49f4 100644
--- a/drivers/firmware/efi/cper-ccix.c
+++ b/drivers/firmware/efi/cper-ccix.c
@@ -503,6 +503,86 @@ static int cper_ccix_port_err_details(const char *pfx,
 	return 0;
 }
 
+static const char * const ccix_link_err_type_strs[] = {
+	"Generic Error",
+	"Credit Underflow",
+	"Credit Overflow",
+	"Unusable Credit Received",
+	"Link Credit Timeout",
+};
+
+static const char *cper_ccix_link_err_type_str(__u8 op)
+{
+	return op < ARRAY_SIZE(ccix_link_err_type_strs) ?
+		ccix_link_err_type_strs[op] : "Reserved";
+}
+
+static const char * const ccix_link_credit_type_strs[] = {
+	"Memory",
+	"Snoop",
+	"Data",
+	"Misc",
+};
+
+static const char *cper_ccix_link_credit_type_str(__u8 op)
+{
+	return op < ARRAY_SIZE(ccix_link_credit_type_strs) ?
+		ccix_link_credit_type_strs[op] : "Reserved";
+}
+
+static int cper_ccix_link_err_details(const char *pfx,
+				     struct acpi_hest_generic_data *gdata)
+{
+	struct cper_ccix_link_error *full_link_err;
+	struct cper_sec_ccix_link_error *link_err;
+	u16 vendor_data_len;
+	int i;
+
+	if (gdata->error_data_length < sizeof(*full_link_err))
+		return -ENOSPC;
+
+	full_link_err = acpi_hest_get_payload(gdata);
+
+	link_err = &full_link_err->link_record;
+
+	if (link_err->validation_bits & CCIX_LINK_ERR_TYPE_VALID)
+		printk("%s""Error Type: %s\n", pfx,
+		       cper_ccix_link_err_type_str(link_err->err_type));
+
+	if (link_err->validation_bits & CCIX_LINK_ERR_OP_VALID)
+		printk("%s""Operation: %s\n", pfx,
+		       cper_ccix_port_err_op_str(link_err->op_type));
+
+	if (link_err->validation_bits & CCIX_LINK_ERR_LINK_ID_VALID)
+		printk("%s""Link ID: %d\n", pfx, link_err->link_id);
+
+	if (link_err->validation_bits & CCIX_LINK_ERR_CREDIT_TYPE_VALID)
+		printk("%s""Credit Type: %s\n", pfx,
+		       cper_ccix_link_credit_type_str(link_err->credit_type));
+
+	/* CHECK THE AER EQUIVALENT */
+	if (link_err->validation_bits & CCIX_LINK_ERR_MESSAGE_VALID) {
+		for (i = 0; i < ARRAY_SIZE(link_err->message); i++)
+			printk("%s""Message%d: 0x%08x\n", pfx, i, link_err->message[i]);
+	}
+
+	if (link_err->validation_bits & CCIX_LINK_ERR_VENDOR_DATA_VALID) {
+		if (gdata->error_data_length < sizeof(*full_link_err) + 4)
+			return -ENOSPC;
+
+		vendor_data_len = link_err->vendor_data[0] & GENMASK(15, 0);
+		if (gdata->error_data_length < sizeof(*full_link_err) + vendor_data_len)
+			return -ENOSPC;
+
+		for (i = 0; i < vendor_data_len/4 - 1; i++)
+			printk("%s""Vendor%d: 0x%08x\n", pfx, i,
+			       link_err->vendor_data[i + 1]);
+
+	}
+
+	return 0;
+}
+
 int cper_print_ccix_per(const char *pfx, struct acpi_hest_generic_data *gdata)
 {
 	struct cper_sec_ccix_header *header = acpi_hest_get_payload(gdata);
@@ -570,6 +650,8 @@ int cper_print_ccix_per(const char *pfx, struct acpi_hest_generic_data *gdata)
 		return cper_ccix_atc_err_details(pfx, gdata);
 	case CCIX_PORT_ERROR:
 		return cper_ccix_port_err_details(pfx, gdata);
+	case CCIX_LINK_ERROR:
+		return cper_ccix_link_err_details(pfx, gdata);
 	default:
 		/* Vendor defined so no formatting be done */
 		break;
@@ -731,3 +813,61 @@ const char *cper_ccix_port_err_unpack(struct trace_seq *p,
 
 	return ret;
 }
+
+void cper_ccix_link_err_pack(const struct cper_sec_ccix_link_error *link_record,
+			     struct cper_ccix_link_err_compact *clink_err,
+			     const u16 vendor_data_len,
+			     u8 *vendor_data)
+{
+	clink_err->validation_bits = link_record->validation_bits;
+	clink_err->err_type = link_record->err_type;
+	clink_err->op_type = link_record->op_type;
+	clink_err->link_id = link_record->link_id;
+	clink_err->credit_type = link_record->credit_type;
+	memcpy(clink_err->message, link_record->message,
+	       sizeof(clink_err->message));
+
+	memcpy(vendor_data, &link_record->vendor_data[1], vendor_data_len);
+}
+
+static int cper_ccix_err_link_location(struct cper_ccix_link_err_compact *clink_err,
+				       char *msg)
+{
+	u32 len = CPER_REC_LEN - 1;
+	u32 n = 0;
+
+	if (!msg)
+		return 0;
+
+	if (clink_err->validation_bits & CCIX_LINK_ERR_TYPE_VALID)
+		n += snprintf(msg + n, len, "Error Type: %s ",
+			      cper_ccix_link_err_type_str(clink_err->err_type));
+
+
+	if (clink_err->validation_bits & CCIX_LINK_ERR_OP_VALID)
+		n += snprintf(msg + n, len, "Op: %s ",
+			     cper_ccix_port_err_op_str(clink_err->op_type));
+
+	if (clink_err->validation_bits & CCIX_LINK_ERR_LINK_ID_VALID)
+		n += snprintf(msg + n, len, "Link ID: %d ", clink_err->link_id);
+
+	if (clink_err->validation_bits & CCIX_LINK_ERR_CREDIT_TYPE_VALID)
+		n += snprintf(msg + n, len, "Credit Type: %s ",
+			      cper_ccix_link_credit_type_str(clink_err->credit_type));
+
+	/* MESSAGE TODO */
+	return n;
+}
+
+const char *cper_ccix_link_err_unpack(struct trace_seq *p,
+				      struct cper_ccix_link_err_compact *clink_err)
+{
+	const char *ret = trace_seq_buffer_ptr(p);
+
+	if (cper_ccix_err_link_location(clink_err, rcd_decode_str))
+		trace_seq_printf(p, "%s", rcd_decode_str);
+
+	trace_seq_putc(p, '\0');
+
+	return ret;
+}
diff --git a/include/linux/cper.h b/include/linux/cper.h
index cd023e0309bf4..05d73604dfd2b 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -754,6 +754,47 @@ struct cper_ccix_port_err_compact {
 	__u8	op_type;
 };
 
+struct cper_sec_ccix_link_error {
+	__u32	validation_bits;
+#define CCIX_LINK_ERR_OP_VALID			BIT(0)
+#define CCIX_LINK_ERR_TYPE_VALID		BIT(1)
+#define CCIX_LINK_ERR_LINK_ID_VALID		BIT(2)
+#define CCIX_LINK_ERR_CREDIT_TYPE_VALID		BIT(3)
+#define CCIX_LINK_ERR_MESSAGE_VALID		BIT(4)
+#define CCIX_LINK_ERR_VENDOR_DATA_VALID		BIT(5)
+	__u16	length; /* Includes vendor specific log info */
+	__u8	op_type;
+	__u8	err_type;
+	__u8	link_id;
+	__u8	credit_type;
+	__u16	reserved;
+	__u32	message[8];
+	__u32	vendor_data[];
+};
+
+struct cper_ccix_link_error {
+	struct cper_sec_ccix_header header;
+	__u32 ccix_header[CCIX_PER_LOG_HEADER_DWS];
+	struct cper_sec_ccix_link_error link_record;
+};
+
+static inline u16 ccix_link_err_ven_len_get(struct cper_ccix_link_error *link_err)
+{
+	if (link_err->link_record.validation_bits & CCIX_LINK_ERR_VENDOR_DATA_VALID)
+		return link_err->link_record.vendor_data[0] & 0xFFFF;
+	else
+		return 0;
+}
+
+struct cper_ccix_link_err_compact {
+	__u32	validation_bits;
+	__u32	message[8];
+	__u8	err_type;
+	__u8	op_type;
+	__u8	link_id;
+	__u8	credit_type;
+};
+
 /* Reset to default packing */
 #pragma pack()
 
@@ -799,6 +840,13 @@ void cper_ccix_port_err_pack(const struct cper_sec_ccix_port_error *port_record,
 const char *cper_ccix_port_err_unpack(struct trace_seq *p,
 				      struct cper_ccix_port_err_compact *cport_err);
 
+void cper_ccix_link_err_pack(const struct cper_sec_ccix_link_error *link_record,
+			     struct cper_ccix_link_err_compact *clink_err,
+			     const u16 vendor_data_len,
+			     u8 *vendor_data);
+const char *cper_ccix_link_err_unpack(struct trace_seq *p,
+				      struct cper_ccix_link_err_compact *clink_err);
+
 struct acpi_hest_generic_data;
 int cper_print_ccix_per(const char *pfx,
 			struct acpi_hest_generic_data *gdata);
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 52ac3f2d4c117..bfe1c64b9db0b 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -614,6 +614,71 @@ TRACE_EVENT(ccix_port_error_event,
 	)
 );
 
+TRACE_EVENT(ccix_link_error_event,
+	TP_PROTO(struct cper_ccix_link_error *err,
+		 u32 err_seq,
+		 u8 sev, u16 ven_len),
+
+	TP_ARGS(err, err_seq, sev, ven_len),
+
+	TP_STRUCT__entry(
+		__field(u32, err_seq)
+		__field(u8, sev)
+		__field(u8, sevdetail)
+		__field(u8, source)
+		__field(u8, component)
+		__field(u64, pa)
+		__field(u8, pa_mask_lsb)
+		__field_struct(struct cper_ccix_link_err_compact, data)
+		__field(u16, vendor_data_length)
+		__dynamic_array(u8, vendor_data, ven_len)
+	),
+
+	TP_fast_assign(
+		__entry->err_seq = err_seq;
+
+		__entry->sev = sev;
+		__entry->sevdetail = FIELD_GET(CCIX_PER_LOG_DW1_SEV_UE_M |
+			CCIX_PER_LOG_DW1_SEV_NO_COMM_M |
+			CCIX_PER_LOG_DW1_SEV_DEGRADED_M |
+			CCIX_PER_LOG_DW1_SEV_DEFFERABLE_M,
+			err->ccix_header[1]);
+		if (err->header.validation_bits & 0x1)
+			__entry->source = err->header.source_id;
+		else
+			__entry->source = ~0;
+		__entry->component = FIELD_GET(CCIX_PER_LOG_DW1_COMP_TYPE_M,
+						   err->ccix_header[1]);
+		if (err->ccix_header[1] & CCIX_PER_LOG_DW1_ADDR_VAL_M) {
+			__entry->pa = (u64)err->ccix_header[2] << 32 |
+				(err->ccix_header[3] & 0xfffffffc);
+			__entry->pa_mask_lsb = err->ccix_header[4] & 0xff;
+		} else {
+			__entry->pa = ~0ull;
+			__entry->pa_mask_lsb = ~0;
+		}
+		/* Do not store the vendor data header length */
+		__entry->vendor_data_length = ven_len ? ven_len - 4 : 0;
+		cper_ccix_link_err_pack(&err->link_record, &__entry->data,
+					__entry->vendor_data_length,
+					__get_dynamic_array(vendor_data));
+	),
+
+	TP_printk("{%d} %s CCIX PER Link Error in %s SevUE:%d SevNoComm:%d SevDegraded:%d SevDeferred:%d physical addr: %016llx (mask: %x) %s vendor:%s",
+		__entry->err_seq,
+		cper_severity_str(__entry->sev),
+		cper_ccix_comp_type_str(__entry->component),
+		__entry->sevdetail & BIT(0) ? 1 : 0,
+		__entry->sevdetail & BIT(1) ? 1 : 0,
+		__entry->sevdetail & BIT(2) ? 1 : 0,
+		__entry->sevdetail & BIT(3) ? 1 : 0,
+		__entry->pa,
+		__entry->pa_mask_lsb,
+		cper_ccix_link_err_unpack(p, &__entry->data),
+		__print_hex(__get_dynamic_array(vendor_data), __entry->vendor_data_length)
+	)
+);
+
 /*
  * memory-failure recovery action result event
  *
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 6/6] efi / ras: CCIX Agent internal error reporting
  2019-06-06 12:36 [RFC PATCH 0/6] CCIX Protocol Error reporting Jonathan Cameron
                   ` (4 preceding siblings ...)
  2019-06-06 12:36 ` [RFC PATCH 5/6] efi / ras: CCIX Link " Jonathan Cameron
@ 2019-06-06 12:36 ` Jonathan Cameron
  2019-06-25 11:34 ` [RFC PATCH 0/6] CCIX Protocol Error reporting Jonathan Cameron
  2019-07-03  9:28 ` James Morse
  7 siblings, 0 replies; 12+ messages in thread
From: Jonathan Cameron @ 2019-06-06 12:36 UTC (permalink / raw)
  To: linux-edac, linux-acpi, linux-efi
  Cc: linuxarm, rjw, tony.luck, bp, james.morse, ard.beisheuvel,
	nariman.poushin, Jonathan Cameron

The CCIX 1.0 Base specification defines an internal agent error,
for which the specific data present afte the header is vendor
defined.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/acpi/apei/ghes.c         |  4 ++
 drivers/firmware/efi/cper-ccix.c | 43 +++++++++++++++++++++
 include/linux/cper.h             | 29 +++++++++++++++
 include/ras/ras_event.h          | 64 ++++++++++++++++++++++++++++++++
 4 files changed, 140 insertions(+)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 2cabac7a344c7..0e21b2a9c3c50 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -520,6 +520,10 @@ static void ghes_handle_ccix_per(struct acpi_hest_generic_data *gdata, int sev)
 		trace_ccix_link_error_event(payload, err_seq, sev,
 					    ccix_link_err_ven_len_get(payload));
 		break;
+	case CCIX_AGENT_INTERNAL_ERROR:
+		trace_ccix_agent_error_event(payload, err_seq, sev,
+					     ccix_agent_err_ven_len_get(payload));
+		break;
 	default:
 		/* Unknown error type */
 		pr_info("CCIX error of unknown or vendor defined type\n");
diff --git a/drivers/firmware/efi/cper-ccix.c b/drivers/firmware/efi/cper-ccix.c
index 1a9661a6c49f4..316a7ad3fec69 100644
--- a/drivers/firmware/efi/cper-ccix.c
+++ b/drivers/firmware/efi/cper-ccix.c
@@ -583,6 +583,38 @@ static int cper_ccix_link_err_details(const char *pfx,
 	return 0;
 }
 
+static int cper_ccix_agent_err_details(const char *pfx,
+				       struct acpi_hest_generic_data *gdata)
+{
+	struct cper_ccix_agent_err *full_agent_err;
+	struct cper_sec_ccix_agent_err *agent_err;
+	u16 vendor_data_len;
+	int i;
+
+	if (gdata->error_data_length < sizeof(*full_agent_err))
+		return -ENOSPC;
+
+	full_agent_err = acpi_hest_get_payload(gdata);
+
+	agent_err = &full_agent_err->agent_record;
+
+	if (agent_err->validation_bits & CCIX_AGENT_INTERNAL_ERR_VENDOR_DATA_VALID) {
+		if (gdata->error_data_length < sizeof(*full_agent_err) + 4)
+			return -ENOSPC;
+
+		vendor_data_len = agent_err->vendor_data[0] & GENMASK(15, 0);
+		if (gdata->error_data_length <
+		    sizeof(*full_agent_err) + vendor_data_len)
+			return -ENOSPC;
+
+		for (i = 0; i < vendor_data_len/4 - 1; i++)
+			printk("%s""Vendor%d: 0x%08x\n", pfx, i,
+			       agent_err->vendor_data[i + 1]);
+	}
+
+	return 0;
+}
+
 int cper_print_ccix_per(const char *pfx, struct acpi_hest_generic_data *gdata)
 {
 	struct cper_sec_ccix_header *header = acpi_hest_get_payload(gdata);
@@ -652,6 +684,8 @@ int cper_print_ccix_per(const char *pfx, struct acpi_hest_generic_data *gdata)
 		return cper_ccix_port_err_details(pfx, gdata);
 	case CCIX_LINK_ERROR:
 		return cper_ccix_link_err_details(pfx, gdata);
+	case CCIX_AGENT_INTERNAL_ERROR:
+		return cper_ccix_agent_err_details(pfx, gdata);
 	default:
 		/* Vendor defined so no formatting be done */
 		break;
@@ -871,3 +905,12 @@ const char *cper_ccix_link_err_unpack(struct trace_seq *p,
 
 	return ret;
 }
+
+void cper_ccix_agent_err_pack(const struct cper_sec_ccix_agent_err *agent_record,
+			      struct cper_ccix_agent_err_compact *cagent_err,
+			      const u16 vendor_data_len,
+			      u8 *vendor_data)
+{
+	cagent_err->validation_bits = agent_record->validation_bits;
+	memcpy(vendor_data, &agent_record->vendor_data[1], vendor_data_len);
+}
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 05d73604dfd2b..961c4ff7f0791 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -795,6 +795,30 @@ struct cper_ccix_link_err_compact {
 	__u8	credit_type;
 };
 
+struct cper_sec_ccix_agent_err {
+	__u32	validation_bits;
+#define CCIX_AGENT_INTERNAL_ERR_VENDOR_DATA_VALID	BIT(0)
+	__u32	vendor_data[];
+};
+
+struct cper_ccix_agent_err {
+	struct cper_sec_ccix_header header;
+	__u32 ccix_header[CCIX_PER_LOG_HEADER_DWS];
+	struct cper_sec_ccix_agent_err agent_record;
+};
+
+static inline u16 ccix_agent_err_ven_len_get(struct cper_ccix_agent_err *agent_err)
+{
+	if (agent_err->agent_record.validation_bits & CCIX_AGENT_INTERNAL_ERR_VENDOR_DATA_VALID)
+		return agent_err->agent_record.vendor_data[0] & 0xFFFF;
+	else
+		return 0;
+}
+
+struct cper_ccix_agent_err_compact {
+	__u32	validation_bits;
+};
+
 /* Reset to default packing */
 #pragma pack()
 
@@ -847,6 +871,11 @@ void cper_ccix_link_err_pack(const struct cper_sec_ccix_link_error *link_record,
 const char *cper_ccix_link_err_unpack(struct trace_seq *p,
 				      struct cper_ccix_link_err_compact *clink_err);
 
+void cper_ccix_agent_err_pack(const struct cper_sec_ccix_agent_err *agent_record,
+			      struct cper_ccix_agent_err_compact *cagent_err,
+			      const u16 vendor_data_len,
+			      u8 *vendor_data);
+
 struct acpi_hest_generic_data;
 int cper_print_ccix_per(const char *pfx,
 			struct acpi_hest_generic_data *gdata);
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index bfe1c64b9db0b..d09e5389a1e64 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -679,6 +679,70 @@ TRACE_EVENT(ccix_link_error_event,
 	)
 );
 
+TRACE_EVENT(ccix_agent_error_event,
+	TP_PROTO(struct cper_ccix_agent_err *err,
+		 u32 err_seq,
+		 u8 sev, u16 ven_len),
+
+	TP_ARGS(err, err_seq, sev, ven_len),
+
+	TP_STRUCT__entry(
+		__field(u32, err_seq)
+		__field(u8, sev)
+		__field(u8, sevdetail)
+		__field(u8, source)
+		__field(u8, component)
+		__field(u64, pa)
+		__field(u8, pa_mask_lsb)
+		__field(u16, vendor_data_length)
+		__field_struct(struct cper_ccix_agent_err_compact, data)
+		__dynamic_array(u8, vendor_data, ven_len)
+	),
+
+	TP_fast_assign(
+		__entry->err_seq = err_seq;
+
+		__entry->sev = sev;
+		__entry->sevdetail = FIELD_GET(CCIX_PER_LOG_DW1_SEV_UE_M |
+			CCIX_PER_LOG_DW1_SEV_NO_COMM_M |
+			CCIX_PER_LOG_DW1_SEV_DEGRADED_M |
+			CCIX_PER_LOG_DW1_SEV_DEFFERABLE_M,
+			err->ccix_header[1]);
+		if (err->header.validation_bits & 0x1)
+			__entry->source = err->header.source_id;
+		else
+			__entry->source = ~0;
+		__entry->component = FIELD_GET(CCIX_PER_LOG_DW1_COMP_TYPE_M,
+						   err->ccix_header[1]);
+		if (err->ccix_header[1] & CCIX_PER_LOG_DW1_ADDR_VAL_M) {
+			__entry->pa = (u64)err->ccix_header[2] << 32 |
+				(err->ccix_header[3] & 0xfffffffc);
+			__entry->pa_mask_lsb = err->ccix_header[4] & 0xff;
+		} else {
+			__entry->pa = ~0ull;
+			__entry->pa_mask_lsb = ~0;
+		}
+		/* Do not store the vendor data header length */
+		__entry->vendor_data_length = ven_len ? ven_len - 4 : 0;
+		cper_ccix_agent_err_pack(&err->agent_record, &__entry->data,
+					__entry->vendor_data_length,
+					__get_dynamic_array(vendor_data));
+	),
+
+	TP_printk("{%d} %s CCIX PER Agent Internal Error in %s SevUE:%d SevNoComm:%d SevDegraded:%d SevDeferred:%d physical addr: %016llx (mask: %x) vendor:%s",
+		__entry->err_seq,
+		cper_severity_str(__entry->sev),
+		cper_ccix_comp_type_str(__entry->component),
+		__entry->sevdetail & BIT(0) ? 1 : 0,
+		__entry->sevdetail & BIT(1) ? 1 : 0,
+		__entry->sevdetail & BIT(2) ? 1 : 0,
+		__entry->sevdetail & BIT(3) ? 1 : 0,
+		__entry->pa,
+		__entry->pa_mask_lsb,
+		__print_hex(__get_dynamic_array(vendor_data), __entry->vendor_data_length)
+	)
+);
+
 /*
  * memory-failure recovery action result event
  *
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/6] efi / ras: CCIX Memory error reporting
  2019-06-06 12:36 ` [RFC PATCH 1/6] efi / ras: CCIX Memory error reporting Jonathan Cameron
@ 2019-06-21 17:40   ` Jonathan Cameron
  0 siblings, 0 replies; 12+ messages in thread
From: Jonathan Cameron @ 2019-06-21 17:40 UTC (permalink / raw)
  To: linux-edac, linux-acpi, linux-efi
  Cc: linuxarm, rjw, tony.luck, bp, james.morse, ard.beisheuvel,
	nariman.poushin

On Thu, 6 Jun 2019 20:36:49 +0800
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> CCIX defines a number of different error types
> (See CCIX spec 1.0) and UEFI 2.8 defines a CPER record to allow
> for them to be reported when firmware first handling is in use.
> The last part of that record is a copy of the CCIX protocol
> error record which can provide very detailed information.
> 
> This patch introduces infrastructure and support for one of those
> error types, CCIX Memory Errors.  Later patches will supply
> equivalent support for the other error types.
> 
> The variable length and content of the different messages makes
> a single tracepoint impractical.  As such the current RAS
> tracepoint only covers the memory error. Additional trace points
> will be introduced for other error types along with their
> cper handling in a follow up series.
> 
> RAS daemon support to follow shortly. qemu injection patches
> also available but not currently planing to upstream those.
> 
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
As this is still and RFC I'm not going to spin a new version yet,
but we need some ifdef fun in the event header as it's calling
functions much like extlog_mem_event does.

I'll roll that fix into v2 once people have had time to look at
this version.

Thanks,

Jonathan

> ---
>  drivers/acpi/apei/Kconfig        |   8 +
>  drivers/acpi/apei/ghes.c         |  39 ++++
>  drivers/firmware/efi/Kconfig     |   5 +
>  drivers/firmware/efi/Makefile    |   1 +
>  drivers/firmware/efi/cper-ccix.c | 356 +++++++++++++++++++++++++++++++
>  drivers/firmware/efi/cper.c      |   6 +
>  include/linux/cper.h             | 118 ++++++++++
>  include/ras/ras_event.h          |  77 +++++++
>  8 files changed, 610 insertions(+)
> 
> diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
> index 6b18f8bc7be35..e687b18dee344 100644
> --- a/drivers/acpi/apei/Kconfig
> +++ b/drivers/acpi/apei/Kconfig
> @@ -68,3 +68,11 @@ config ACPI_APEI_ERST_DEBUG
>  	  error information to and from a persistent store. Enable this
>  	  if you want to debugging and testing the ERST kernel support
>  	  and firmware implementation.
> +
> +config ACPI_APEI_CCIX
> +       bool "APEI CCIX error recovery support"
> +       depends on ACPI_APEI && MEMORY_FAILURE
> +       help
> +	 CCIX has a number of defined error types. This option enables
> +	 the handling of CPER records generated by a firmware performing
> +	 firmware first error handling of these CCIX errors.
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 993940d582f50..cfc7dc31a9380 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -477,6 +477,42 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
>  #endif
>  }
>  
> +static void ghes_handle_ccix_per(struct acpi_hest_generic_data *gdata, int sev)
> +{
> +#ifdef CONFIG_ACPI_APEI_CCIX
> +	struct cper_sec_ccix_header *header = acpi_hest_get_payload(gdata);
> +	__u32 *dw;
> +	enum ccix_per_type per_type;
> +	static u32 err_seq;
> +	void *payload;
> +
> +	/* Check if space for CCIX CPER header and 8 DW of a PER log header */
> +	if (gdata->error_data_length <
> +	    sizeof(*header) + CCIX_PER_LOG_HEADER_DWS * sizeof(__u32))
> +		return;
> +
> +	if ((header->validation_bits & CPER_CCIX_VALID_PER_LOG) == 0)
> +		return;
> +
> +	dw = (__u32 *)(header + 1);
> +
> +	per_type = FIELD_GET(CCIX_PER_LOG_DW1_PER_TYPE_M, dw[1]);
> +	payload = acpi_hest_get_payload(gdata);
> +
> +	switch (per_type) {
> +	case CCIX_MEMORY_ERROR:
> +		trace_ccix_memory_error_event(payload, err_seq, sev,
> +					      ccix_mem_err_ven_len_get(payload));
> +		break;
> +	default:
> +		/* Unknown error type */
> +		pr_info("CCIX error of unknown or vendor defined type\n");
> +		break;
> +	}
> +	err_seq++;
> +#endif
> +}
> +
>  static void ghes_do_proc(struct ghes *ghes,
>  			 const struct acpi_hest_generic_status *estatus)
>  {
> @@ -507,6 +543,9 @@ static void ghes_do_proc(struct ghes *ghes,
>  		else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
>  			ghes_handle_aer(gdata);
>  		}
> +		else if (guid_equal(sec_type, &CPER_SEC_CCIX)) {
> +			ghes_handle_ccix_per(gdata, estatus->error_severity);
> +		}
>  		else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
>  			struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
>  
> diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
> index d4ea929e8b344..9ea161f68da8d 100644
> --- a/drivers/firmware/efi/Kconfig
> +++ b/drivers/firmware/efi/Kconfig
> @@ -195,6 +195,11 @@ config UEFI_CPER_X86
>  	depends on UEFI_CPER && X86
>  	default y
>  
> +config UEFI_CPER_CCIX
> +       bool
> +       depends on UEFI_CPER
> +       default y
> +
>  config EFI_DEV_PATH_PARSER
>  	bool
>  	depends on ACPI
> diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile
> index d2d0d20306200..69287da9664b6 100644
> --- a/drivers/firmware/efi/Makefile
> +++ b/drivers/firmware/efi/Makefile
> @@ -33,3 +33,4 @@ obj-$(CONFIG_EFI_CAPSULE_LOADER)	+= capsule-loader.o
>  obj-$(CONFIG_EFI_EARLYCON)		+= earlycon.o
>  obj-$(CONFIG_UEFI_CPER_ARM)		+= cper-arm.o
>  obj-$(CONFIG_UEFI_CPER_X86)		+= cper-x86.o
> +obj-$(CONFIG_UEFI_CPER_CCIX)		+= cper-ccix.o
> diff --git a/drivers/firmware/efi/cper-ccix.c b/drivers/firmware/efi/cper-ccix.c
> new file mode 100644
> index 0000000000000..9856804bdca81
> --- /dev/null
> +++ b/drivers/firmware/efi/cper-ccix.c
> @@ -0,0 +1,356 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * UEFI Common Platform Error Record (CPER) support for CCIX
> + * protocol errors.
> + *
> + * Copyright (C) 2019, Huawei
> + *	Author: Jonathan Cameron <jonathan.cameron@huawei.com>
> + *
> + * CPER is the format used to describe platform hardware error by
> + * various tables, such as ERST, BERT and HEST etc.
> + *
> + * For more information about CPER, please refer to Appendix N of UEFI
> + * Specification version 2.9.
> + *
> + * CCIX defines a number of Protocol Error Messages which for the
> + * main body of the CCIX CPER records.  These are defined in the
> + * CCIX Specification 1.0.
> + */
> +
> +#include <acpi/ghes.h>
> +#include <linux/acpi.h>
> +#include <linux/bitfield.h>
> +#include <linux/cper.h>
> +#include <linux/kernel.h>
> +#include <linux/printk.h>
> +#include <ras/ras_event.h>
> +
> +static char rcd_decode_str[CPER_REC_LEN];
> +
> +static const char * const ccix_comp_type_strs[] = {
> +	"Request Agent",
> +	"Home Agent",
> +	"Slave Agent",
> +	"Port",
> +	"Link",
> +};
> +
> +const char *cper_ccix_comp_type_str(u8 comp_type)
> +{
> +	return comp_type < ARRAY_SIZE(ccix_comp_type_strs) ?
> +		ccix_comp_type_strs[comp_type] : "Reserved";
> +}
> +
> +static const char * const ccix_per_type_strs[] = {
> +	"Memory Error",
> +	"Cache Error",
> +	"ATC Error",
> +	"Port Error",
> +	"Link Error",
> +	"Agent Internal",
> +};
> +
> +static const char * const ccix_mem_pool_gen_type_strs[] = {
> +	"Other, Non-specified",
> +	"ROM",
> +	"Volatile",
> +	"Non-volatile",
> +	"Device",
> +};
> +
> +static const char *cper_ccix_mem_err_generic_type_str(u16 type)
> +{
> +	const char *gen_type_str;
> +
> +	if (type < ARRAY_SIZE(ccix_mem_pool_gen_type_strs))
> +		gen_type_str = ccix_mem_pool_gen_type_strs[type];
> +	else if (type >= 0x80)
> +		gen_type_str = "Vendor";
> +	else
> +		gen_type_str = "Reserved";
> +
> +	return gen_type_str;
> +}
> +
> +static const char * const ccix_mem_op_type_strs[] = {
> +	"Generic",
> +	"Read",
> +	"Write",
> +	"Reserved",
> +	"Scrub",
> +};
> +
> +static const char *cper_ccix_mem_err_op_str(u8 op_type)
> +{
> +	return op_type < ARRAY_SIZE(ccix_mem_op_type_strs) ?
> +		ccix_mem_op_type_strs[op_type] :
> +		"Reserved";
> +}
> +
> +/* Sightly different from the generic version */
> +static const char * const ccix_mem_err_type_strs[] = {
> +	"Unknown",
> +	"No Error",
> +	"Single-bit ECC",
> +	"Multi-bit ECC",
> +	"Single-symbol ChipKill ECC",
> +	"Multi-symbol ChipKill ECC",
> +	"Master Abort",
> +	"Target Abort",
> +	"Parity Error",
> +	"Watchdog Timeout",
> +	"Invalid Address",
> +	"Mirror Broken",
> +	"Memory Sparing",
> +	"Scrub",
> +	"Physical Memory Map-out Event",
> +};
> +
> +const char *cper_ccix_mem_err_type_str(unsigned int error_type)
> +{
> +	return error_type < ARRAY_SIZE(ccix_mem_err_type_strs) ?
> +		ccix_mem_err_type_strs[error_type] : "Reserved";
> +}
> +
> +static const char * const ccix_mem_spec_type_strs[] = {
> +	"Other, Not-specified",
> +	"SRAM",
> +	"DDR",
> +	"NVDIMM-F",
> +	"NVDIMM-N",
> +	"HBM",
> +	"Flash"
> +};
> +
> +static const char *cper_ccix_mem_err_spec_type_str(u8 specific_type)
> +{
> +	if (specific_type < ARRAY_SIZE(ccix_mem_spec_type_strs))
> +		return ccix_mem_spec_type_strs[specific_type];
> +	else if (specific_type >= 0x80)
> +		return "Vendor";
> +	else
> +		return "Reserved";
> +}
> +
> +/*
> + * We pack up everything except those that are needed for software handling:
> + * - error_type, physical_addr
> + * and header values that would require additional validation bits:
> + * - source, component, severity,
> + * implicit: protocol error type (mem)
> + */
> +void cper_ccix_mem_err_pack(const struct cper_sec_ccix_mem_error *mem_record,
> +			    struct cper_ccix_mem_err_compact *cmem_err,
> +			    const u16 vendor_data_len,
> +			    u8 *vendor_data)
> +{
> +	cmem_err->validation_bits = mem_record->validation_bits;
> +	cmem_err->mem_err_type = mem_record->memory_error_type;
> +	cmem_err->pool_generic_type = mem_record->pool_generic_type;
> +	cmem_err->op_type = mem_record->op_type;
> +	cmem_err->card = mem_record->card;
> +	cmem_err->module = mem_record->module;
> +	cmem_err->bank = mem_record->bank;
> +	cmem_err->device = mem_record->device;
> +	cmem_err->row = mem_record->row;
> +	cmem_err->column = mem_record->column;
> +	cmem_err->rank = mem_record->rank;
> +	cmem_err->bit_pos = mem_record->bit_pos;
> +	cmem_err->chip_id = mem_record->chip_id;
> +	cmem_err->pool_specific_type = mem_record->pool_specific_type;
> +	cmem_err->fru = mem_record->fru;
> +	memcpy(vendor_data, &mem_record->vendor_data[1], vendor_data_len);
> +}
> +
> +static int cper_ccix_err_location(struct cper_ccix_mem_err_compact *cmem_err,
> +				  char *msg)
> +{
> +	u32 len = CPER_REC_LEN - 1;
> +	u32 n = 0;
> +
> +	if (!msg)
> +		return 0;
> +
> +	if (cmem_err->validation_bits & CCIX_MEM_ERR_GENERIC_MEM_VALID)
> +		n += snprintf(msg + n, len, "Pool Generic Type: %s ",
> +			      cper_ccix_mem_err_generic_type_str(cmem_err->pool_generic_type));
> +
> +	if (cmem_err->validation_bits & CCIX_MEM_ERR_MEM_ERR_TYPE_VALID)
> +		n += snprintf(msg + n, len, "Err Type: %s ",
> +			      cper_ccix_mem_err_type_str(cmem_err->mem_err_type));
> +	if (cmem_err->validation_bits & CCIX_MEM_ERR_OP_VALID)
> +		n += snprintf(msg + n, len, "Operation: %s ",
> +			     cper_ccix_mem_err_op_str(cmem_err->op_type));
> +
> +	if (cmem_err->validation_bits & CCIX_MEM_ERR_CARD_VALID)
> +		n += snprintf(msg + n, len, "Card: %d ", cmem_err->card);
> +	if (cmem_err->validation_bits & CCIX_MEM_ERR_MOD_VALID)
> +		n += snprintf(msg + n, len, "Mod: %d ", cmem_err->module);
> +	if (cmem_err->validation_bits & CCIX_MEM_ERR_BANK_VALID)
> +		n += snprintf(msg + n, len, "Bank: %d ", cmem_err->bank);
> +	if (cmem_err->validation_bits & CCIX_MEM_ERR_DEVICE_VALID)
> +		n += snprintf(msg + n, len, "Device: %d ", cmem_err->device);
> +	if (cmem_err->validation_bits & CCIX_MEM_ERR_ROW_VALID)
> +		n += snprintf(msg + n, len, "Row: %d ", cmem_err->row);
> +	if (cmem_err->validation_bits & CCIX_MEM_ERR_COL_VALID)
> +		n += snprintf(msg + n, len, "Col: %d ", cmem_err->column);
> +	if (cmem_err->validation_bits & CCIX_MEM_ERR_RANK_VALID)
> +		n += snprintf(msg + n, len, "Rank: %d ", cmem_err->rank);
> +	if (cmem_err->validation_bits & CCIX_MEM_ERR_BIT_POS_VALID)
> +		n += snprintf(msg + n, len, "BitPos: %d ", cmem_err->bit_pos);
> +	if (cmem_err->validation_bits & CCIX_MEM_ERR_CHIP_ID_VALID)
> +		n += snprintf(msg + n, len, "ChipID: %d ", cmem_err->chip_id);
> +	if (cmem_err->validation_bits & CCIX_MEM_ERR_SPEC_TYPE_VALID)
> +		n += snprintf(msg + n, len, "Pool Specific Type: %s ",
> +			      cper_ccix_mem_err_spec_type_str(cmem_err->pool_specific_type));
> +	n += snprintf(msg + n, len, "FRU: %d ", cmem_err->fru);
> +
> +	return n;
> +}
> +
> +const char *cper_ccix_mem_err_unpack(struct trace_seq *p,
> +				     struct cper_ccix_mem_err_compact *cmem_err)
> +{
> +	const char *ret = trace_seq_buffer_ptr(p);
> +
> +	if (cper_ccix_err_location(cmem_err, rcd_decode_str))
> +		trace_seq_printf(p, "%s", rcd_decode_str);
> +	trace_seq_putc(p, '\0');
> +
> +	return ret;
> +}
> +
> +static int cper_ccix_mem_err_details(const char *pfx,
> +				     struct acpi_hest_generic_data *gdata)
> +{
> +	struct cper_ccix_mem_error *full_mem_err;
> +	struct cper_sec_ccix_mem_error *mem_err;
> +	u16 vendor_data_len;
> +	int i;
> +
> +	if (gdata->error_data_length < sizeof(*full_mem_err))
> +		return -ENOSPC;
> +
> +	full_mem_err = acpi_hest_get_payload(gdata);
> +
> +	mem_err = &full_mem_err->mem_record;
> +	printk("%s""FRU ID: %u, Length: %u\n", pfx,
> +	       mem_err->fru, mem_err->length);
> +	if (mem_err->validation_bits & CCIX_MEM_ERR_GENERIC_MEM_VALID)
> +		printk("%s""Pool Generic Type: %s\n", pfx,
> +		       cper_ccix_mem_err_generic_type_str(mem_err->pool_generic_type));
> +
> +	if (mem_err->validation_bits & CCIX_MEM_ERR_OP_VALID)
> +		printk("%s""Operation: %s\n", pfx,
> +		       cper_ccix_mem_err_op_str(mem_err->op_type));
> +
> +	if (mem_err->validation_bits & CCIX_MEM_ERR_MEM_ERR_TYPE_VALID) {
> +		printk("%s""Mem Error Type: %s\n", pfx,
> +		       cper_ccix_mem_err_type_str(mem_err->memory_error_type));
> +	}
> +	if (mem_err->validation_bits & CCIX_MEM_ERR_CARD_VALID)
> +		printk("%s""Card: %u\n", pfx, mem_err->card);
> +	if (mem_err->validation_bits & CCIX_MEM_ERR_MOD_VALID)
> +		printk("%s""Module: %u\n", pfx, mem_err->module);
> +	if (mem_err->validation_bits & CCIX_MEM_ERR_BANK_VALID)
> +		printk("%s""Bank: %u\n", pfx, mem_err->bank);
> +	if (mem_err->validation_bits & CCIX_MEM_ERR_DEVICE_VALID)
> +		printk("%s""Device: %u\n", pfx, mem_err->device);
> +	if (mem_err->validation_bits & CCIX_MEM_ERR_ROW_VALID)
> +		printk("%s""Row: %u\n", pfx, mem_err->row);
> +	if (mem_err->validation_bits & CCIX_MEM_ERR_COL_VALID)
> +		printk("%s""Column: %u\n", pfx, mem_err->column);
> +	if (mem_err->validation_bits & CCIX_MEM_ERR_RANK_VALID)
> +		printk("%s""Rank: %u\n", pfx, mem_err->rank);
> +	if (mem_err->validation_bits & CCIX_MEM_ERR_BIT_POS_VALID)
> +		printk("%s""Bit Pos: %u\n", pfx, mem_err->bit_pos);
> +	if (mem_err->validation_bits & CCIX_MEM_ERR_CHIP_ID_VALID)
> +		printk("%s""Chip ID: %u\n", pfx, mem_err->chip_id);
> +	if (mem_err->validation_bits & CCIX_MEM_ERR_SPEC_TYPE_VALID)
> +		printk("%s""Specific Type: %s\n", pfx,
> +		       cper_ccix_mem_err_spec_type_str(mem_err->pool_specific_type));
> +
> +	if (mem_err->validation_bits & CCIX_MEM_ERR_VENDOR_DATA_VALID) {
> +		if (gdata->error_data_length < sizeof(*full_mem_err) + 4)
> +			return -ENOSPC;
> +
> +		vendor_data_len = mem_err->vendor_data[0] & GENMASK(15, 0);
> +		if (gdata->error_data_length <
> +		    sizeof(*full_mem_err) + vendor_data_len)
> +			return -ENOSPC;
> +
> +		for (i = 0; i < vendor_data_len / 4 - 1; i++)
> +			printk("%s""Vendor%d: 0x%08x\n", pfx, i,
> +			       mem_err->vendor_data[i + 1]);
> +	}
> +
> +	return 0;
> +}
> +
> +int cper_print_ccix_per(const char *pfx, struct acpi_hest_generic_data *gdata)
> +{
> +	struct cper_sec_ccix_header *header = acpi_hest_get_payload(gdata);
> +	__u32 *dw;
> +	__u32 comp_type;
> +	enum ccix_per_type per_type;
> +	bool vendor_per;
> +
> +	if (gdata->error_data_length < sizeof(*header))
> +		return -ENOSPC;
> +
> +	printk("%s""CPER Length: %u\n", pfx, header->length);
> +	if (header->validation_bits & CPER_CCIX_VALID_SOURCE_ID)
> +		printk("%s""Source: %u\n", pfx, header->source_id);
> +	if (header->validation_bits & CPER_CCIX_VALID_PORT_ID)
> +		printk("%s""Port: %u\n", pfx, header->port_id);
> +	/* Not much use if we don't have the per log, in theory it's optional */
> +	if ((header->validation_bits & CPER_CCIX_VALID_PER_LOG) == 0)
> +		return 0;
> +
> +	/* The per log header is a packed structure so needs breaking up */
> +	if (gdata->error_data_length < sizeof(*header) + 8 * 4)
> +		return -ENOSPC;
> +
> +	dw = (__u32 *)(header + 1);
> +
> +	printk("%s""PER Rev: %lu, Log Length: %lu\n", pfx,
> +	       FIELD_GET(CCIX_PER_LOG_DW0_REV_M, dw[0]),
> +	       FIELD_GET(CCIX_PER_LOG_DW0_LEN_M, dw[0]));
> +	comp_type = FIELD_GET(CCIX_PER_LOG_DW1_COMP_TYPE_M, dw[1]);
> +	printk("%s""Component: %s\n", pfx, cper_ccix_comp_type_str(comp_type));
> +	printk("%s""ME: %lu, SevUE: %lu, SevNoComm: %lu, SevDegraded: %lu, SevDeferred %lu\n",
> +	       pfx,
> +	       FIELD_GET(CCIX_PER_LOG_DW0_ME_M, dw[0]),
> +	       FIELD_GET(CCIX_PER_LOG_DW1_SEV_UE_M, dw[1]),
> +	       FIELD_GET(CCIX_PER_LOG_DW1_SEV_NO_COMM_M, dw[1]),
> +	       FIELD_GET(CCIX_PER_LOG_DW1_SEV_DEGRADED_M, dw[1]),
> +	       FIELD_GET(CCIX_PER_LOG_DW1_SEV_DEFFERABLE_M, dw[1]));
> +
> +	/* per_type is vendor defined if VEN is set */
> +	vendor_per = FIELD_GET(CCIX_PER_LOG_DW1_VEN_VAL_M, dw[1]) ?
> +		true : false;
> +	per_type = FIELD_GET(CCIX_PER_LOG_DW1_PER_TYPE_M, dw[1]);
> +	if (vendor_per)
> +		printk("%s""Protocol Error Type: Vendor%u", pfx, per_type);
> +	else
> +		printk("%s""Protocol Error Type: %s\n", pfx,
> +		       per_type < ARRAY_SIZE(ccix_per_type_strs) ?
> +		       ccix_per_type_strs[per_type] : "Reserved");
> +
> +	if (FIELD_GET(CCIX_PER_LOG_DW1_ADDR_VAL_M, dw[1]))
> +		printk("%s""Address: 0x%llx\n", pfx,
> +		       (((__u64)dw[2]) << 32) | (dw[3] & 0xFFFFFFFC));
> +
> +	/* Vendor defined PER message, perhaps we could print it out */
> +	if (vendor_per)
> +		return 0;
> +
> +	switch (per_type) {
> +	case CCIX_MEMORY_ERROR:
> +		return cper_ccix_mem_err_details(pfx, gdata);
> +	default:
> +		/* Vendor defined so no formatting be done */
> +		break;
> +	}
> +	return 0;
> +}
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index 8fa977c7861f9..52f0954a4577a 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -459,6 +459,12 @@ cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata
>  			cper_print_pcie(newpfx, pcie, gdata);
>  		else
>  			goto err_section_too_small;
> +	} else if (guid_equal(sec_type, &CPER_SEC_CCIX)) {
> +		int ret;
> +		/* CCIX CPER entries are variable length */
> +		ret = cper_print_ccix_per(newpfx, gdata);
> +		if (ret)
> +			goto err_section_too_small;
>  #if defined(CONFIG_ARM64) || defined(CONFIG_ARM)
>  	} else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
>  		struct cper_sec_proc_arm *arm_err = acpi_hest_get_payload(gdata);
> diff --git a/include/linux/cper.h b/include/linux/cper.h
> index cc4980bb0f65a..a9a7ef56f4dc8 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h
> @@ -186,6 +186,9 @@ enum {
>  #define CPER_SEC_PCIE							\
>  	GUID_INIT(0xD995E954, 0xBBC1, 0x430F, 0xAD, 0x91, 0xB4, 0x4D,	\
>  		  0xCB, 0x3C, 0x6F, 0x35)
> +#define CPER_SEC_CCIX							\
> +	GUID_INIT(0x91335EF6, 0xEBFB, 0x4478, 0xA6, 0xA6, 0x88, 0xB7,	\
> +		  0x28, 0xCF, 0x75, 0xD7)
>  /* Firmware Error Record Reference */
>  #define CPER_SEC_FW_ERR_REC_REF						\
>  	GUID_INIT(0x81212A96, 0x09ED, 0x4996, 0x94, 0x71, 0x8D, 0x72,	\
> @@ -254,6 +257,10 @@ enum {
>  
>  #define CPER_PCIE_SLOT_SHIFT			3
>  
> +#define CPER_CCIX_VALID_SOURCE_ID		BIT(0)
> +#define CPER_CCIX_VALID_PORT_ID			BIT(1)
> +#define CPER_CCIX_VALID_PER_LOG			BIT(2)
> +
>  #define CPER_ARM_VALID_MPIDR			BIT(0)
>  #define CPER_ARM_VALID_AFFINITY_LEVEL		BIT(1)
>  #define CPER_ARM_VALID_RUNNING_STATE		BIT(2)
> @@ -533,6 +540,105 @@ struct cper_sec_pcie {
>  	u8	aer_info[96];
>  };
>  
> +struct cper_sec_ccix_header {
> +	__u32	length;
> +	__u64	validation_bits;
> +	__u8	source_id;
> +	__u8	port_id;
> +	__u8	reserved[2];
> +};
> +
> +#define CCIX_PER_LOG_DW0_REV_M			GENMASK(7, 0)
> +#define CCIX_PER_LOG_DW0_LEN_M			GENMASK(14, 8)
> +#define CCIX_PER_LOG_DW0_ME_M			BIT(15)
> +#define CCIX_PER_LOG_DW1_COMP_TYPE_M		GENMASK(15, 12)
> +#define CCIX_PER_LOG_DW1_SEV_UE_M		BIT(16)
> +#define CCIX_PER_LOG_DW1_SEV_NO_COMM_M		BIT(17)
> +#define CCIX_PER_LOG_DW1_SEV_DEGRADED_M		BIT(18)
> +#define CCIX_PER_LOG_DW1_SEV_DEFFERABLE_M	BIT(19)
> +#define CCIX_PER_LOG_DW1_PER_TYPE_M		GENMASK(27, 24)
> +#define CCIX_PER_LOG_DW1_ADDR_VAL_M		BIT(30)
> +#define CCIX_PER_LOG_DW1_VEN_VAL_M		BIT(31)
> +enum ccix_per_type {
> +	CCIX_MEMORY_ERROR = 0,
> +	CCIX_CACHE_ERROR = 1,
> +	CCIX_ATC_ERROR = 2,
> +	CCIX_PORT_ERROR = 3,
> +	CCIX_LINK_ERROR = 4,
> +	CCIX_AGENT_INTERNAL_ERROR = 5,
> +};
> +
> +#define CCIX_PER_LOG_HEADER_DWS 8
> +
> +struct cper_sec_ccix_mem_error {
> +	__u32	validation_bits;
> +#define CCIX_MEM_ERR_GENERIC_MEM_VALID		BIT(0)
> +#define CCIX_MEM_ERR_OP_VALID			BIT(1)
> +#define CCIX_MEM_ERR_MEM_ERR_TYPE_VALID		BIT(2)
> +#define CCIX_MEM_ERR_CARD_VALID			BIT(3)
> +#define CCIX_MEM_ERR_BANK_VALID			BIT(4)
> +#define CCIX_MEM_ERR_DEVICE_VALID		BIT(5)
> +#define CCIX_MEM_ERR_ROW_VALID			BIT(6)
> +#define CCIX_MEM_ERR_COL_VALID			BIT(7)
> +#define CCIX_MEM_ERR_RANK_VALID			BIT(8)
> +#define CCIX_MEM_ERR_BIT_POS_VALID		BIT(9)
> +#define CCIX_MEM_ERR_CHIP_ID_VALID		BIT(10)
> +#define CCIX_MEM_ERR_VENDOR_DATA_VALID		BIT(11)
> +#define CCIX_MEM_ERR_MOD_VALID			BIT(12)
> +#define CCIX_MEM_ERR_SPEC_TYPE_VALID		BIT(13)
> +
> +	__u8	fru;
> +	__u8	reserved;
> +	__u16	length; /* Includes vendor specific log info */
> +	__u8	pool_generic_type;
> +	__u8	op_type;
> +	__u8	memory_error_type;
> +	__u8	card;
> +	__u16	module;
> +	__u16	bank;
> +	__u32	device;
> +	__u32	row;
> +	__u32	column;
> +	__u32	rank;
> +	__u8	bit_pos;
> +	__u8	chip_id;
> +	__u8	pool_specific_type;
> +	__u32	vendor_data[];
> +};
> +
> +struct cper_ccix_mem_error {
> +	struct cper_sec_ccix_header header;
> +	__u32 ccix_header[CCIX_PER_LOG_HEADER_DWS];
> +	struct cper_sec_ccix_mem_error mem_record;
> +};
> +
> +static inline u16 ccix_mem_err_ven_len_get(struct cper_ccix_mem_error *mem_err)
> +{
> +	if (mem_err->mem_record.validation_bits &
> +	    CCIX_MEM_ERR_VENDOR_DATA_VALID)
> +		return mem_err->mem_record.vendor_data[0] & 0xFFFF;
> +	else
> +		return 0;
> +}
> +
> +struct cper_ccix_mem_err_compact {
> +	__u32	validation_bits;
> +	__u8	mem_err_type;
> +	__u8	pool_generic_type;
> +	__u8	pool_specific_type;
> +	__u8	op_type;
> +	__u8	card;
> +	__u16	module;
> +	__u16	bank;
> +	__u32	device;
> +	__u32	row;
> +	__u32	column;
> +	__u32	rank;
> +	__u8	bit_pos;
> +	__u8	chip_id;
> +	__u8	fru;
> +};
> +
>  /* Reset to default packing */
>  #pragma pack()
>  
> @@ -547,6 +653,18 @@ void cper_mem_err_pack(const struct cper_sec_mem_err *,
>  		       struct cper_mem_err_compact *);
>  const char *cper_mem_err_unpack(struct trace_seq *,
>  				struct cper_mem_err_compact *);
> +void cper_ccix_mem_err_pack(const struct cper_sec_ccix_mem_error *mem_record,
> +			    struct cper_ccix_mem_err_compact *cmem_err,
> +			    const u16 vendor_data_len,
> +			    u8 *vendor_data);
> +const char *cper_ccix_mem_err_unpack(struct trace_seq *p,
> +				     struct cper_ccix_mem_err_compact *cmem_err);
> +const char *cper_ccix_mem_err_type_str(unsigned int error_type);
> +const char *cper_ccix_comp_type_str(u8 comp_type);
> +struct acpi_hest_generic_data;
> +int cper_print_ccix_per(const char *pfx,
> +			struct acpi_hest_generic_data *gdata);
> +
>  void cper_print_proc_arm(const char *pfx,
>  			 const struct cper_sec_proc_arm *proc);
>  void cper_print_proc_ia(const char *pfx,
> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index 36c5c5e38c1d8..128728eaeef41 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -14,6 +14,7 @@
>  #include <linux/cper.h>
>  #include <linux/mm.h>
>  
> +#include <linux/bitfield.h>
>  /*
>   * MCE Extended Error Log trace event
>   *
> @@ -338,6 +339,82 @@ TRACE_EVENT(aer_event,
>  			"Not available")
>  );
>  
> +/*
> + * CCIX PER log memory error trace event
> + *
> + * These events are generated when hardware detects a corrected or
> + * uncorrected event.
> + *
> + * Some elements of the record are not included
> + * - PER version (tracepoint should remain compatible across versions)
> + * - Multiple Error
> + */
> +TRACE_EVENT(ccix_memory_error_event,
> +	TP_PROTO(struct cper_ccix_mem_error *mem,
> +		 u32 err_seq,
> +		 u8 sev,
> +		 u16 ven_len),
> +
> +	TP_ARGS(mem, err_seq, sev, ven_len),
> +
> +	TP_STRUCT__entry(
> +		__field(u32, err_seq)
> +		__field(u8, sev)
> +		__field(u8, sevdetail)
> +		__field(u8, source)
> +		__field(u8, component)
> +		__field(u64, pa)
> +		__field(u8, pa_mask_lsb)
> +		__field_struct(struct cper_ccix_mem_err_compact, data)
> +		__field(u16, vendor_data_length)
> +		__dynamic_array(u8, vendor_data, ven_len)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->err_seq = err_seq;
> +		__entry->sev = sev;
> +		__entry->sevdetail =
> +			FIELD_GET(CCIX_PER_LOG_DW1_SEV_UE_M |
> +				  CCIX_PER_LOG_DW1_SEV_NO_COMM_M |
> +				  CCIX_PER_LOG_DW1_SEV_DEGRADED_M |
> +				  CCIX_PER_LOG_DW1_SEV_DEFFERABLE_M,
> +						   mem->ccix_header[1]);
> +		if (mem->header.validation_bits & 0x1)
> +			__entry->source = mem->header.source_id;
> +		else
> +			__entry->source = ~0;
> +		__entry->component = FIELD_GET(CCIX_PER_LOG_DW1_COMP_TYPE_M,
> +					       mem->ccix_header[1]);
> +		if (mem->ccix_header[1] & CCIX_PER_LOG_DW1_ADDR_VAL_M) {
> +			__entry->pa = (u64)mem->ccix_header[2] << 32 |
> +				(mem->ccix_header[3] & 0xfffffffc);
> +			__entry->pa_mask_lsb = mem->ccix_header[4] & 0xff;
> +		} else {
> +			__entry->pa = ~0ull;
> +			__entry->pa_mask_lsb = ~0;
> +		}
> +		__entry->vendor_data_length = ven_len ? ven_len - 4 : 0;

As the following may or may not be configured in a given kernel, we need
to add some ifdef protections.  As with the extlog_mem_event the sensible option
is probably to just protect the whole tracepoint definition.

> +		cper_ccix_mem_err_pack(&mem->mem_record, &__entry->data,
> +				       __entry->vendor_data_length,
> +				       __get_dynamic_array(vendor_data));
> +	),
> +
> +	TP_printk("{%d} %s CCIX PER Memory Error in %s SevUE:%d SevNoComm:%d SevDegraded:%d SevDeferred:%d physical addr: %016llx (mask: %x) %s vendor:%s",
> +		__entry->err_seq,
> +		cper_severity_str(__entry->sev),
> +		cper_ccix_comp_type_str(__entry->component),
> +		  __entry->sevdetail & BIT(0) ? 1 : 0,
> +		  __entry->sevdetail & BIT(1) ? 1 : 0,
> +		  __entry->sevdetail & BIT(2) ? 1 : 0,
> +		  __entry->sevdetail & BIT(3) ? 1 : 0,
> +		__entry->pa,
> +		__entry->pa_mask_lsb,
> +		cper_ccix_mem_err_unpack(p, &__entry->data),
> +		__print_hex(__get_dynamic_array(vendor_data),
> +			    __entry->vendor_data_length)
> +	)
> +);
> +
>  /*
>   * memory-failure recovery action result event
>   *



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/6] CCIX Protocol Error reporting
  2019-06-06 12:36 [RFC PATCH 0/6] CCIX Protocol Error reporting Jonathan Cameron
                   ` (5 preceding siblings ...)
  2019-06-06 12:36 ` [RFC PATCH 6/6] efi / ras: CCIX Agent internal " Jonathan Cameron
@ 2019-06-25 11:34 ` Jonathan Cameron
  2019-07-03  9:28 ` James Morse
  7 siblings, 0 replies; 12+ messages in thread
From: Jonathan Cameron @ 2019-06-25 11:34 UTC (permalink / raw)
  To: linux-edac, linux-acpi, linux-efi
  Cc: linuxarm, rjw, tony.luck, bp, james.morse, ard.beisheuvel,
	nariman.poushin

On Thu, 6 Jun 2019 20:36:48 +0800
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

Hi All,

I'm looking for some reviews on this series if anyone has time to take
a look.  Rasdaemon patches to match with this are on linux-edac but
are waiting on the tracepoints merging.

I'm not currently planning to upstream the qemu injection patches
used to test this but anyone would like those I can certainly put
a public branch up somewhere.

Thanks,

Jonathan

> UEFI 2.8 defines a new CPER record Appendix N for CCIX Protocol Error Records
> (PER). www.uefi.org
> 
> These include Protocol Error Record logs which are defined in the
> CCIX 1.0 Base Specification www.ccixconsortium.com.
> 
> Handling of coherency protocol errors is complex and how Linux does this
> will take some time to evolve.  For now, fatal errors are handled via the
> usual means and everything else is reported.
> 
> There are 6 types of error defined, covering:
> * Memory errors
> * Cache errors
> * Address translation unit errors
> * CCIX port errors 
> * CCIX link errors
> * Agent internal errors.
> 
> The set includes tracepoints to report the errors to RAS Daemon and a patch
> set for RAS Daemon will follow shortly.
> 
> There are several open questions for this RFC.
> 1. Reporting of vendor data.  We have little choice but to do this via a
>    dynamic array as these blocks can take arbitrary size. I had hoped
>    no one would actually use these given the odd mismatch between a
>    standard error structure and non standard element, but there are
>    already designs out there that do use it.
> 2. The trade off between explicit tracepoint fields, on which we might
>    want to filter, and the simplicity of a blob. I have gone for having
>    the whole of the block specific to the PER error type in an opaque blob.
>    Perhaps this is not the right balance?
> 3. Whether defining 6 new tracepoints is sensible. I think it is:
>    * They are all defined by the CCIX specification as independant error
>      classes.
>    * Many of them can only be generated by particular types of agent.
>    * The handling required will vary widely depending on types.
>      In the kernel some map cleanly onto existing handling. Keeping the
>      whole flow separate will aide this. They vary by a similar amount
>      in scope to the RAS errors found on an existing system which have
>      independent tracepoints.
>    * Separating them out allows for filtering on the tracepoints by
>      elements that are not shared between them.
>    * Muxing the lot into one record type can lead to ugly code both in
>      kernel and in userspace.
> 
> Rasdaemon patches will follow shortly.
> 
> This patch is being distributed by the CCIX Consortium, Inc. (CCIX) to
> you and other parties that are paticipating (the "participants") in the
> Linux kernel with the understanding that the participants will use CCIX's
> name and trademark only when this patch is used in association with the
> Linux kernel and associated user space.
> 
> CCIX is also distributing this patch to these participants with the
> understanding that if any portion of the CCIX specification will be
> used or referenced in the Linux kernel, the participants will not modify
> the cited portion of the CCIX specification and will give CCIX propery
> copyright attribution by including the following copyright notice with
> the cited part of the CCIX specification:
> "© 2019 CCIX CONSORTIUM, INC. ALL RIGHTS RESERVED."
> 
> Jonathan Cameron (6):
>   efi / ras: CCIX Memory error reporting
>   efi / ras: CCIX Cache error reporting
>   efi / ras: CCIX Address Translation Cache error reporting
>   efi / ras: CCIX Port error reporting
>   efi / ras: CCIX Link error reporting
>   efi / ras: CCIX Agent internal error reporting
> 
>  drivers/acpi/apei/Kconfig        |   8 +
>  drivers/acpi/apei/ghes.c         |  59 ++
>  drivers/firmware/efi/Kconfig     |   5 +
>  drivers/firmware/efi/Makefile    |   1 +
>  drivers/firmware/efi/cper-ccix.c | 916 +++++++++++++++++++++++++++++++
>  drivers/firmware/efi/cper.c      |   6 +
>  include/linux/cper.h             | 333 +++++++++++
>  include/ras/ras_event.h          | 405 ++++++++++++++
>  8 files changed, 1733 insertions(+)
>  create mode 100644 drivers/firmware/efi/cper-ccix.c
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/6] CCIX Protocol Error reporting
  2019-06-06 12:36 [RFC PATCH 0/6] CCIX Protocol Error reporting Jonathan Cameron
                   ` (6 preceding siblings ...)
  2019-06-25 11:34 ` [RFC PATCH 0/6] CCIX Protocol Error reporting Jonathan Cameron
@ 2019-07-03  9:28 ` James Morse
  2019-07-03 13:08   ` Jonathan Cameron
  7 siblings, 1 reply; 12+ messages in thread
From: James Morse @ 2019-07-03  9:28 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-edac, linux-acpi, linux-efi, linuxarm, rjw, tony.luck, bp,
	ard.beisheuvel, nariman.poushin

Hi Jonathan,

(Beware: this CCIX thing is new to me, some of this will be questions on the scope of CCIX)

On 06/06/2019 13:36, Jonathan Cameron wrote:
> UEFI 2.8 defines a new CPER record Appendix N for CCIX Protocol Error Records
> (PER). www.uefi.org

> These include Protocol Error Record logs which are defined in the
> CCIX 1.0 Base Specification www.ccixconsortium.com.

I can't find a public version of this spec, and I don't have an account to look at the
private one. (It looks like I could get one but I'm guessing there is an NDA between me
and the document text!)

Will it ever be public?


> Handling of coherency protocol errors is complex and how Linux does this
> will take some time to evolve.  For now, fatal errors are handled via the
> usual means and everything else is reported.

> There are 6 types of error defined, covering:
> * Memory errors

How does this interact with the vanilla Memory-Error CPER records? Do we always get them
as a pair? (that struct definition looks familiar...)

Is memory described like this never present in the UEFI memory map? (so we never
accidentally use it as host memory, for processes etc)

~

include/linux/memremap.h has:
|  * Device memory that is cache coherent from device and CPU point of view. This
|  * is use on platform that have an advance system bus (like CAPI or CCIX). A
|  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
|  * type. Any page of a process can be migrated to such memory. However no one
|  * should be allow to pin such memory so that it can always be evicted.

Is this the same CCIX?

If process memory can be migrated onto this stuff we need to kick off memory_failure() in
response to memory errors, otherwise we don't mark the pages as poison, signal user-space etc.


> * Cache errors
> * Address translation unit errors

> * CCIX port errors 
> * CCIX link errors

It looks like this stuff operates 'over' PCIe[0]. How does this interact with AER?
Will we always get a vanilla PCIE-AER CPER in addition to these two?

(I see 'CHECK THE AER EQUIVALENT' comments in your series, are these TODO:?)


> * Agent internal errors.
> 
> The set includes tracepoints to report the errors to RAS Daemon and a patch
> set for RAS Daemon will follow shortly.
> 
> There are several open questions for this RFC.
> 1. Reporting of vendor data.  We have little choice but to do this via a
>    dynamic array as these blocks can take arbitrary size. I had hoped
>    no one would actually use these given the odd mismatch between a
>    standard error structure and non standard element, but there are
>    already designs out there that do use it.

I think its okay to spit these blobs out of the trace points, but could we avoid printing
them in the kernel log?


> 2. The trade off between explicit tracepoint fields, on which we might
>    want to filter, and the simplicity of a blob. I have gone for having
>    the whole of the block specific to the PER error type in an opaque blob.
>    Perhaps this is not the right balance?

(I suspect I don't understand this).

The filtering can be done by user-space. Isn't that enough?

I see the memory-error event format file has 'pa', which is all anyone is likely to care
about.
Do 'source/component' indicate the device? (what do I pull out of the machine to stop
these happening?)


> 3. Whether defining 6 new tracepoints is sensible. I think it is:
>    * They are all defined by the CCIX specification as independant error
>      classes.
>    * Many of them can only be generated by particular types of agent.
>    * The handling required will vary widely depending on types.
>      In the kernel some map cleanly onto existing handling. Keeping the
>      whole flow separate will aide this. They vary by a similar amount
>      in scope to the RAS errors found on an existing system which have
>      independent tracepoints.
>    * Separating them out allows for filtering on the tracepoints by
>      elements that are not shared between them.
>    * Muxing the lot into one record type can lead to ugly code both in
>      kernel and in userspace.

Sold!


> This patch is being distributed by the CCIX Consortium, Inc. (CCIX) to
> you and other parties that are paticipating (the "participants") in the
> Linux kernel with the understanding that the participants will use CCIX's
> name and trademark only when this patch is used in association with the
> Linux kernel and associated user space.
> 
> CCIX is also distributing this patch to these participants with the
> understanding that if any portion of the CCIX specification will be
> used or referenced in the Linux kernel, the participants will not modify
> the cited portion of the CCIX specification and will give CCIX propery
> copyright attribution by including the following copyright notice with
> the cited part of the CCIX specification:
> "© 2019 CCIX CONSORTIUM, INC. ALL RIGHTS RESERVED."

Is this text permission from the CCIX-Consortium to reproduce bits of their spec in the
kernel?

I'm guessing any one who can hold of the spec is also prevented from publishing it.

(Nits: participating, proper)


Thanks,

James

[0] https://en.wikichip.org/wiki/ccix

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/6] CCIX Protocol Error reporting
  2019-07-03  9:28 ` James Morse
@ 2019-07-03 13:08   ` Jonathan Cameron
  2019-08-06 11:14     ` Jonathan Cameron
  0 siblings, 1 reply; 12+ messages in thread
From: Jonathan Cameron @ 2019-07-03 13:08 UTC (permalink / raw)
  To: James Morse
  Cc: linux-edac, linux-acpi, linux-efi, linuxarm, rjw, tony.luck, bp,
	ard.beisheuvel, nariman.poushin, jcm, Charles Garcia-Tobin

On Wed, 3 Jul 2019 10:28:08 +0100
James Morse <james.morse@arm.com> wrote:

> Hi Jonathan,
> 
> (Beware: this CCIX thing is new to me, some of this will be questions on the scope of CCIX)

Sure and welcome to our crazy world.

> 
> On 06/06/2019 13:36, Jonathan Cameron wrote:
> > UEFI 2.8 defines a new CPER record Appendix N for CCIX Protocol Error Records
> > (PER). www.uefi.org  
> 
> > These include Protocol Error Record logs which are defined in the
> > CCIX 1.0 Base Specification www.ccixconsortium.com.  
> 
> I can't find a public version of this spec, and I don't have an account to look at the
> private one. (It looks like I could get one but I'm guessing there is an NDA between me
> and the document text!)
> 
> Will it ever be public?

+cc A few people who might be able to help.

> 
> 
> > Handling of coherency protocol errors is complex and how Linux does this
> > will take some time to evolve.  For now, fatal errors are handled via the
> > usual means and everything else is reported.  
> 
> > There are 6 types of error defined, covering:
> > * Memory errors  
> 
> How does this interact with the vanilla Memory-Error CPER records? Do we always get them
> as a pair? (that struct definition looks familiar...)

The snag is that the standard memory error doesn't provide quite enough information to
actually identify a component once you have buses involved.
The actual memory specific bit is nearly identical, but the header section isn't
as it gives the CCIX agent ID (see below given you asked a more specific question
on that).

We did discuss this in a lot of detail and currently the answer is that the
UEFI spec + ccix allows you to report only the ccix error if you want to.

Note that I believe there is an OSC bit that I'm not currently setting so
right now no firmware compliant with the CCIX SW Guide (information available
from your local Charles) should actually be generating these anyway because
the OS hasn't said it can handle them.  I took the view I wanted to get
round 2 which actually hooks up handling rather than just reporting before
setting that bit.

> Is memory described like this never present in the UEFI memory map? (so we never
> accidentally use it as host memory, for processes etc)
Yes it definitely can.  If you want to you can have all your RAM attached by CCIX.
(probably not a good idea performance wise :)

More likely in the short term is SCM using something like the hmem patches
to hotplug it as if it were RAM.
https://patchwork.kernel.org/cover/10982677/

In order to support legacy OS, it wouldn't be unusual to have a bios switch
to make SCM on CCIX just appear as normal memory in the first place
(if that was true though it is highly unlikely you'd use the CCIX
PER messages.)

> 
> ~
> 
> include/linux/memremap.h has:
> |  * Device memory that is cache coherent from device and CPU point of view. This
> |  * is use on platform that have an advance system bus (like CAPI or CCIX). A
> |  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
> |  * type. Any page of a process can be migrated to such memory. However no one
> |  * should be allow to pin such memory so that it can always be evicted.
> 
> Is this the same CCIX?

Not always, but sometimes.  Typically if it's on an accelerator (GPU etc) then
one software model is to do that.  There are others though. Until we have
more devices out there, its not even clear that model with dominate.

It's a much argued about topic. My personal thoughts are you either have:
1. An accelerator that was designed from the ground up assuming that it had
   tight control of its own memory layout etc - someone bolted on coherency
   later.  This is the current GPU model and fits with ZONE_DEVICE etc. GPUs
   can and do require shuffling memory.

2. An accelerator designed to run without it's own memory.  We just decided
   to move some of the host memory to it's end of the bus for size or
   efficiency reasons.  As it's reasonably happy with fragmentation and
   the assumption that someone else is doing it's memory management for it
   (the OS) you can run in an SVM type mode, with a bit of hinting by
   a userspace app on where it would prefer the accelerator accessed memory
   comes from.  Those have no more constraints than any other memory.  If
   you want to pin for RDMA for example, its fine.

More importantly, it can just be normal DDR on a card with no extra functions.
In those cases it's just a memory only NUMA node.

> 
> If process memory can be migrated onto this stuff we need to kick off memory_failure() in
> response to memory errors, otherwise we don't mark the pages as poison, signal user-space etc.

Agreed.  That was in the follow up patch set (which I haven't written yet)
that actually does some error handling.  I need to do some work on the
emulation side to actually be able to generate sensible errors of that type.

I'll stick some mention of that in the v2 cover letter.
> 
> 
> > * Cache errors
> > * Address translation unit errors  
> 
> > * CCIX port errors 
> > * CCIX link errors  
> 
> It looks like this stuff operates 'over' PCIe[0]. How does this interact with AER?
> Will we always get a vanilla PCIE-AER CPER in addition to these two?

Nope. You'll get an AER for PCIe protocol, transport and data link layers and
PER error for CCIX protocol layer.  We'll ignore the horrible abuse of
Internal Error in AER that people do (pointing no fingers but the spec makes
it clear what this is for and it's not what most people use it for).

So depends on what went wrong.  Some types of error (ripping out a board perhaps)
might cause a cascade but in general they are independent systems.  For extra
fun they aren't even routed over the same physical links in some topologies.
CCIX per is CCIX ID routed (and we have meshes), AER has to follow the PCIe
tree.  In theory the CCIX Error Agent doesn't actually to be in the host,
and it certainly doesn't have to be via the same root port.

> 
> (I see 'CHECK THE AER EQUIVALENT' comments in your series, are these TODO:?)
Oops.  That was a formatting question that I'd forgotten about.

Both PCIe AER and CCIX PER have error cases in which part of the TLP
(at different layers) is captured in the error record.  I just wanted
to sanity check if I'd used similar formatting.

In some CCIX errors you get 32 bytes of the message that was in flight
when the problem occurred.

I'll actually check the formatting and scrub the comment or adjust it for v2.
Sorry about that - these were kicking around internally for far too long
so they have gone in out of my head several times.

>
> 
> > * Agent internal errors.
> > 
> > The set includes tracepoints to report the errors to RAS Daemon and a patch
> > set for RAS Daemon will follow shortly.
> > 
> > There are several open questions for this RFC.
> > 1. Reporting of vendor data.  We have little choice but to do this via a
> >    dynamic array as these blocks can take arbitrary size. I had hoped
> >    no one would actually use these given the odd mismatch between a
> >    standard error structure and non standard element, but there are
> >    already designs out there that do use it.  
> 
> I think its okay to spit these blobs out of the trace points, but could we avoid printing
> them in the kernel log?

Seems like a reasonable compromise.  Leave them in the tp_printk
and if anyone really wants to fill their kernel logs then they can.

> 
> 
> > 2. The trade off between explicit tracepoint fields, on which we might
> >    want to filter, and the simplicity of a blob. I have gone for having
> >    the whole of the block specific to the PER error type in an opaque blob.
> >    Perhaps this is not the right balance?  
> 
> (I suspect I don't understand this).
> 
> The filtering can be done by user-space. Isn't that enough?

It can, but then you have already logged them.  As I understand it you can
apply the filters at the logging stage. Will take another look at this.
Still I'm happy with the current balance, and there it'll be done in
userspace anyway.

> 
> I see the memory-error event format file has 'pa', which is all anyone is likely to care
> about.
> Do 'source/component' indicate the device? (what do I pull out of the machine to stop
> these happening?)

The answer to that is a qualified yes.  If you know the source ID and
the configuration then it gets you to the actual device.
For the various 'agents' Request Agent, Home Agent and Slave Agent,
the sourceID is the globally unique (per type of agent) ID assigned during
the topology configuration bit of firmware enumeration (it ends up as a field
you can read from PCI config space).  For other things sourceID is the device ID
which is assigned at an earlier phase of enumeration.  You may then need the
portID to figure out exactly where it's broken (those are fixed for a given
device).

Ultimately you use these to find out what the BDF is and hopefully figure out
which board to pull and throw out of the window.

> 
> 
> > 3. Whether defining 6 new tracepoints is sensible. I think it is:
> >    * They are all defined by the CCIX specification as independant error
> >      classes.
> >    * Many of them can only be generated by particular types of agent.
> >    * The handling required will vary widely depending on types.
> >      In the kernel some map cleanly onto existing handling. Keeping the
> >      whole flow separate will aide this. They vary by a similar amount
> >      in scope to the RAS errors found on an existing system which have
> >      independent tracepoints.
> >    * Separating them out allows for filtering on the tracepoints by
> >      elements that are not shared between them.
> >    * Muxing the lot into one record type can lead to ugly code both in
> >      kernel and in userspace.  
> 
> Sold!
> 
> 
> > This patch is being distributed by the CCIX Consortium, Inc. (CCIX) to
> > you and other parties that are paticipating (the "participants") in the
> > Linux kernel with the understanding that the participants will use CCIX's
> > name and trademark only when this patch is used in association with the
> > Linux kernel and associated user space.
> > 
> > CCIX is also distributing this patch to these participants with the
> > understanding that if any portion of the CCIX specification will be
> > used or referenced in the Linux kernel, the participants will not modify
> > the cited portion of the CCIX specification and will give CCIX propery
> > copyright attribution by including the following copyright notice with
> > the cited part of the CCIX specification:
> > "© 2019 CCIX CONSORTIUM, INC. ALL RIGHTS RESERVED."  
> 
> Is this text permission from the CCIX-Consortium to reproduce bits of their spec in the
> kernel?
<standard I'm not a lawyer disclaimer>

Hmm. The answer to that is 'sort of'.  It technically says anything in this
patch is fine if it used that statement.  I believe it is kind of a catch all
that just says quote under fair use terms, but as normal give clear copyright
acknowledgment.  Some of the answers that others are hopefully going to give
to earlier questions should clarify why this matters.

The main point of the whole statement being there at all is to give an explicit
trademark grant to the kernel so that we can use CCIX(tm) without worrying about it.
If you want to get scared, google the PCI-SIG(tm) trademark policy and think where
we might be pushing the limits.

We asked the CCIX legal advisors to come up with a clear statement
that meant open source projects could get in trouble for sensible trademark use.
This grant is apparently the best way to do it.
 
> 
> I'm guessing any one who can hold of the spec is also prevented from publishing it.
> 
> (Nits: participating, proper)
Yup. I copy typed this from my windows laptop to the linux machine because
I was too lazy to navigate Huawei security zones and did a really bad job.
Sorry Frank! (hope Frank never see this)

> 
> 
> Thanks,
> 
> James
> 
> [0] https://en.wikichip.org/wiki/ccix



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/6] CCIX Protocol Error reporting
  2019-07-03 13:08   ` Jonathan Cameron
@ 2019-08-06 11:14     ` Jonathan Cameron
  0 siblings, 0 replies; 12+ messages in thread
From: Jonathan Cameron @ 2019-08-06 11:14 UTC (permalink / raw)
  To: James Morse
  Cc: linux-edac, linux-acpi, linux-efi, linuxarm, rjw, tony.luck, bp,
	ard.beisheuvel, nariman.poushin, jcm, Charles Garcia-Tobin

On Wed, 3 Jul 2019 14:08:49 +0100
Jonathan Cameron <jonathan.cameron@huawei.com> wrote:

> On Wed, 3 Jul 2019 10:28:08 +0100
> James Morse <james.morse@arm.com> wrote:
> 
> > Hi Jonathan,
> > 
> > (Beware: this CCIX thing is new to me, some of this will be questions on the scope of CCIX)  
> 
> Sure and welcome to our crazy world.
> 
> > 
> > On 06/06/2019 13:36, Jonathan Cameron wrote:  
> > > UEFI 2.8 defines a new CPER record Appendix N for CCIX Protocol Error Records
> > > (PER). www.uefi.org    
> >   
> > > These include Protocol Error Record logs which are defined in the
> > > CCIX 1.0 Base Specification www.ccixconsortium.com.    
> > 
> > I can't find a public version of this spec, and I don't have an account to look at the
> > private one. (It looks like I could get one but I'm guessing there is an NDA between me
> > and the document text!)
> > 
> > Will it ever be public?  
> 
> +cc A few people who might be able to help.

Evaluation version of the CCIX 1.0a base specification now available,
(though there is a form to complete and license agreement)..

https://www.ccixconsortium.com/ccix-library/download-form/


> 
> > 
> >   
> > > Handling of coherency protocol errors is complex and how Linux does this
> > > will take some time to evolve.  For now, fatal errors are handled via the
> > > usual means and everything else is reported.    
> >   
> > > There are 6 types of error defined, covering:
> > > * Memory errors    
> > 
> > How does this interact with the vanilla Memory-Error CPER records? Do we always get them
> > as a pair? (that struct definition looks familiar...)  
> 
> The snag is that the standard memory error doesn't provide quite enough information to
> actually identify a component once you have buses involved.
> The actual memory specific bit is nearly identical, but the header section isn't
> as it gives the CCIX agent ID (see below given you asked a more specific question
> on that).
> 
> We did discuss this in a lot of detail and currently the answer is that the
> UEFI spec + ccix allows you to report only the ccix error if you want to.
> 
> Note that I believe there is an OSC bit that I'm not currently setting so
> right now no firmware compliant with the CCIX SW Guide (information available
> from your local Charles) should actually be generating these anyway because
> the OS hasn't said it can handle them.  I took the view I wanted to get
> round 2 which actually hooks up handling rather than just reporting before
> setting that bit.
> 
> > Is memory described like this never present in the UEFI memory map? (so we never
> > accidentally use it as host memory, for processes etc)  
> Yes it definitely can.  If you want to you can have all your RAM attached by CCIX.
> (probably not a good idea performance wise :)
> 
> More likely in the short term is SCM using something like the hmem patches
> to hotplug it as if it were RAM.
> https://patchwork.kernel.org/cover/10982677/
> 
> In order to support legacy OS, it wouldn't be unusual to have a bios switch
> to make SCM on CCIX just appear as normal memory in the first place
> (if that was true though it is highly unlikely you'd use the CCIX
> PER messages.)
> 
> > 
> > ~
> > 
> > include/linux/memremap.h has:
> > |  * Device memory that is cache coherent from device and CPU point of view. This
> > |  * is use on platform that have an advance system bus (like CAPI or CCIX). A
> > |  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
> > |  * type. Any page of a process can be migrated to such memory. However no one
> > |  * should be allow to pin such memory so that it can always be evicted.
> > 
> > Is this the same CCIX?  
> 
> Not always, but sometimes.  Typically if it's on an accelerator (GPU etc) then
> one software model is to do that.  There are others though. Until we have
> more devices out there, its not even clear that model with dominate.
> 
> It's a much argued about topic. My personal thoughts are you either have:
> 1. An accelerator that was designed from the ground up assuming that it had
>    tight control of its own memory layout etc - someone bolted on coherency
>    later.  This is the current GPU model and fits with ZONE_DEVICE etc. GPUs
>    can and do require shuffling memory.
> 
> 2. An accelerator designed to run without it's own memory.  We just decided
>    to move some of the host memory to it's end of the bus for size or
>    efficiency reasons.  As it's reasonably happy with fragmentation and
>    the assumption that someone else is doing it's memory management for it
>    (the OS) you can run in an SVM type mode, with a bit of hinting by
>    a userspace app on where it would prefer the accelerator accessed memory
>    comes from.  Those have no more constraints than any other memory.  If
>    you want to pin for RDMA for example, its fine.
> 
> More importantly, it can just be normal DDR on a card with no extra functions.
> In those cases it's just a memory only NUMA node.
> 
> > 
> > If process memory can be migrated onto this stuff we need to kick off memory_failure() in
> > response to memory errors, otherwise we don't mark the pages as poison, signal user-space etc.  
> 
> Agreed.  That was in the follow up patch set (which I haven't written yet)
> that actually does some error handling.  I need to do some work on the
> emulation side to actually be able to generate sensible errors of that type.
> 
> I'll stick some mention of that in the v2 cover letter.
> > 
> >   
> > > * Cache errors
> > > * Address translation unit errors    
> >   
> > > * CCIX port errors 
> > > * CCIX link errors    
> > 
> > It looks like this stuff operates 'over' PCIe[0]. How does this interact with AER?
> > Will we always get a vanilla PCIE-AER CPER in addition to these two?  
> 
> Nope. You'll get an AER for PCIe protocol, transport and data link layers and
> PER error for CCIX protocol layer.  We'll ignore the horrible abuse of
> Internal Error in AER that people do (pointing no fingers but the spec makes
> it clear what this is for and it's not what most people use it for).
> 
> So depends on what went wrong.  Some types of error (ripping out a board perhaps)
> might cause a cascade but in general they are independent systems.  For extra
> fun they aren't even routed over the same physical links in some topologies.
> CCIX per is CCIX ID routed (and we have meshes), AER has to follow the PCIe
> tree.  In theory the CCIX Error Agent doesn't actually to be in the host,
> and it certainly doesn't have to be via the same root port.
> 
> > 
> > (I see 'CHECK THE AER EQUIVALENT' comments in your series, are these TODO:?)  
> Oops.  That was a formatting question that I'd forgotten about.
> 
> Both PCIe AER and CCIX PER have error cases in which part of the TLP
> (at different layers) is captured in the error record.  I just wanted
> to sanity check if I'd used similar formatting.
> 
> In some CCIX errors you get 32 bytes of the message that was in flight
> when the problem occurred.
> 
> I'll actually check the formatting and scrub the comment or adjust it for v2.
> Sorry about that - these were kicking around internally for far too long
> so they have gone in out of my head several times.
> 
> >
> >   
> > > * Agent internal errors.
> > > 
> > > The set includes tracepoints to report the errors to RAS Daemon and a patch
> > > set for RAS Daemon will follow shortly.
> > > 
> > > There are several open questions for this RFC.
> > > 1. Reporting of vendor data.  We have little choice but to do this via a
> > >    dynamic array as these blocks can take arbitrary size. I had hoped
> > >    no one would actually use these given the odd mismatch between a
> > >    standard error structure and non standard element, but there are
> > >    already designs out there that do use it.    
> > 
> > I think its okay to spit these blobs out of the trace points, but could we avoid printing
> > them in the kernel log?  
> 
> Seems like a reasonable compromise.  Leave them in the tp_printk
> and if anyone really wants to fill their kernel logs then they can.
> 
> > 
> >   
> > > 2. The trade off between explicit tracepoint fields, on which we might
> > >    want to filter, and the simplicity of a blob. I have gone for having
> > >    the whole of the block specific to the PER error type in an opaque blob.
> > >    Perhaps this is not the right balance?    
> > 
> > (I suspect I don't understand this).
> > 
> > The filtering can be done by user-space. Isn't that enough?  
> 
> It can, but then you have already logged them.  As I understand it you can
> apply the filters at the logging stage. Will take another look at this.
> Still I'm happy with the current balance, and there it'll be done in
> userspace anyway.
> 
> > 
> > I see the memory-error event format file has 'pa', which is all anyone is likely to care
> > about.
> > Do 'source/component' indicate the device? (what do I pull out of the machine to stop
> > these happening?)  
> 
> The answer to that is a qualified yes.  If you know the source ID and
> the configuration then it gets you to the actual device.
> For the various 'agents' Request Agent, Home Agent and Slave Agent,
> the sourceID is the globally unique (per type of agent) ID assigned during
> the topology configuration bit of firmware enumeration (it ends up as a field
> you can read from PCI config space).  For other things sourceID is the device ID
> which is assigned at an earlier phase of enumeration.  You may then need the
> portID to figure out exactly where it's broken (those are fixed for a given
> device).
> 
> Ultimately you use these to find out what the BDF is and hopefully figure out
> which board to pull and throw out of the window.
> 
> > 
> >   
> > > 3. Whether defining 6 new tracepoints is sensible. I think it is:
> > >    * They are all defined by the CCIX specification as independant error
> > >      classes.
> > >    * Many of them can only be generated by particular types of agent.
> > >    * The handling required will vary widely depending on types.
> > >      In the kernel some map cleanly onto existing handling. Keeping the
> > >      whole flow separate will aide this. They vary by a similar amount
> > >      in scope to the RAS errors found on an existing system which have
> > >      independent tracepoints.
> > >    * Separating them out allows for filtering on the tracepoints by
> > >      elements that are not shared between them.
> > >    * Muxing the lot into one record type can lead to ugly code both in
> > >      kernel and in userspace.    
> > 
> > Sold!
> > 
> >   
> > > This patch is being distributed by the CCIX Consortium, Inc. (CCIX) to
> > > you and other parties that are paticipating (the "participants") in the
> > > Linux kernel with the understanding that the participants will use CCIX's
> > > name and trademark only when this patch is used in association with the
> > > Linux kernel and associated user space.
> > > 
> > > CCIX is also distributing this patch to these participants with the
> > > understanding that if any portion of the CCIX specification will be
> > > used or referenced in the Linux kernel, the participants will not modify
> > > the cited portion of the CCIX specification and will give CCIX propery
> > > copyright attribution by including the following copyright notice with
> > > the cited part of the CCIX specification:
> > > "© 2019 CCIX CONSORTIUM, INC. ALL RIGHTS RESERVED."    
> > 
> > Is this text permission from the CCIX-Consortium to reproduce bits of their spec in the
> > kernel?  
> <standard I'm not a lawyer disclaimer>
> 
> Hmm. The answer to that is 'sort of'.  It technically says anything in this
> patch is fine if it used that statement.  I believe it is kind of a catch all
> that just says quote under fair use terms, but as normal give clear copyright
> acknowledgment.  Some of the answers that others are hopefully going to give
> to earlier questions should clarify why this matters.
> 
> The main point of the whole statement being there at all is to give an explicit
> trademark grant to the kernel so that we can use CCIX(tm) without worrying about it.
> If you want to get scared, google the PCI-SIG(tm) trademark policy and think where
> we might be pushing the limits.
> 
> We asked the CCIX legal advisors to come up with a clear statement
> that meant open source projects could get in trouble for sensible trademark use.
> This grant is apparently the best way to do it.
>  
> > 
> > I'm guessing any one who can hold of the spec is also prevented from publishing it.
> > 
> > (Nits: participating, proper)  
> Yup. I copy typed this from my windows laptop to the linux machine because
> I was too lazy to navigate Huawei security zones and did a really bad job.
> Sorry Frank! (hope Frank never see this)
> 
> > 
> > 
> > Thanks,
> > 
> > James
> > 
> > [0] https://en.wikichip.org/wiki/ccix  
> 
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-08-06 11:15 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-06 12:36 [RFC PATCH 0/6] CCIX Protocol Error reporting Jonathan Cameron
2019-06-06 12:36 ` [RFC PATCH 1/6] efi / ras: CCIX Memory error reporting Jonathan Cameron
2019-06-21 17:40   ` Jonathan Cameron
2019-06-06 12:36 ` [RFC PATCH 2/6] efi / ras: CCIX Cache " Jonathan Cameron
2019-06-06 12:36 ` [RFC PATCH 3/6] efi / ras: CCIX Address Translation " Jonathan Cameron
2019-06-06 12:36 ` [RFC PATCH 4/6] efi / ras: CCIX Port " Jonathan Cameron
2019-06-06 12:36 ` [RFC PATCH 5/6] efi / ras: CCIX Link " Jonathan Cameron
2019-06-06 12:36 ` [RFC PATCH 6/6] efi / ras: CCIX Agent internal " Jonathan Cameron
2019-06-25 11:34 ` [RFC PATCH 0/6] CCIX Protocol Error reporting Jonathan Cameron
2019-07-03  9:28 ` James Morse
2019-07-03 13:08   ` Jonathan Cameron
2019-08-06 11:14     ` Jonathan Cameron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).