linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/8] Decode IA32/X64 CPER
@ 2018-02-26 19:38 Yazen Ghannam
  2018-02-26 19:38 ` [PATCH v2 1/8] efi: Fix IA32/X64 Processor Error Record definition Yazen Ghannam
                   ` (8 more replies)
  0 siblings, 9 replies; 41+ messages in thread
From: Yazen Ghannam @ 2018-02-26 19:38 UTC (permalink / raw)
  To: linux-efi; +Cc: Yazen Ghannam, linux-kernel, ard.biesheuvel, bp, x86

From: Yazen Ghannam <yazen.ghannam@amd.com>

This series adds decoding for the IA32/X64 Common Platform Error Record.

Patch 1 fixes the IA32/X64 Processor Error Section definition to match
the UEFI spec.

Patches 2-8 add the new decoding. The patches incrementally add the
decoding starting from the top-level "Error Section". Hopefully, this
will make reviewing a bit easier compared to one large patch.

The formatting of the field names and options is taken from the UEFI
spec. I tried to keep everything the same to make searching easier.

The patches were written to the UEFI 2.7 spec though the definition of
the IA32/X64 CPER seems to be the same as when it was introduced in
the UEFI 2.1 spec.

Link:
https://lkml.kernel.org/r/20180223200333.6410-1-Yazen.Ghannam@amd.com

Changes V1 to V2:
* Remove stable request for all patches.
* Address Ard's comments on formatting and other issues.
* In Patch 8, always print context info even if the type is not
  recognized.

Yazen Ghannam (8):
  efi: Fix IA32/X64 Processor Error Record definition
  efi: Decode IA32/X64 Processor Error Section
  efi: Decode IA32/X64 Processor Error Info Structure
  efi: Decode UEFI-defined IA32/X64 Error Structure GUIDs
  efi: Decode IA32/X64 Cache, TLB, and Bus Check structures
  efi: Decode additional IA32/X64 Bus Check fields
  efi: Decode IA32/X64 MS Check structure
  efi: Decode IA32/X64 Context Info structure

 drivers/firmware/efi/Kconfig    |   5 +
 drivers/firmware/efi/Makefile   |   1 +
 drivers/firmware/efi/cper-x86.c | 363 ++++++++++++++++++++++++++++++++++++++++
 drivers/firmware/efi/cper.c     |  10 ++
 include/linux/cper.h            |   4 +-
 5 files changed, 382 insertions(+), 1 deletion(-)
 create mode 100644 drivers/firmware/efi/cper-x86.c

-- 
2.14.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v2 1/8] efi: Fix IA32/X64 Processor Error Record definition
  2018-02-26 19:38 [PATCH v2 0/8] Decode IA32/X64 CPER Yazen Ghannam
@ 2018-02-26 19:38 ` Yazen Ghannam
  2018-02-27 10:47   ` Borislav Petkov
  2018-02-26 19:38 ` [PATCH v2 2/8] efi: Decode IA32/X64 Processor Error Section Yazen Ghannam
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 41+ messages in thread
From: Yazen Ghannam @ 2018-02-26 19:38 UTC (permalink / raw)
  To: linux-efi; +Cc: Yazen Ghannam, linux-kernel, ard.biesheuvel, bp, x86

From: Yazen Ghannam <yazen.ghannam@amd.com>

Based on UEFI 2.7 Table 255. Processor Error Record, the "Local APIC_ID"
field is 8 bytes but Linux defines this field as 1 byte.

Fix this in the struct cper_sec_proc_ia definition.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20180223200333.6410-2-Yazen.Ghannam@amd.com

v1->v2:
* No changes.

 include/linux/cper.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/cper.h b/include/linux/cper.h
index d14ef4e77c8a..4b5f8459b403 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -381,7 +381,7 @@ struct cper_sec_proc_generic {
 /* IA32/X64 Processor Error Section */
 struct cper_sec_proc_ia {
 	__u64	validation_bits;
-	__u8	lapic_id;
+	__u64	lapic_id;
 	__u8	cpuid[48];
 };
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 2/8] efi: Decode IA32/X64 Processor Error Section
  2018-02-26 19:38 [PATCH v2 0/8] Decode IA32/X64 CPER Yazen Ghannam
  2018-02-26 19:38 ` [PATCH v2 1/8] efi: Fix IA32/X64 Processor Error Record definition Yazen Ghannam
@ 2018-02-26 19:38 ` Yazen Ghannam
  2018-02-27 11:22   ` Borislav Petkov
  2018-02-26 19:38 ` [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info Structure Yazen Ghannam
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 41+ messages in thread
From: Yazen Ghannam @ 2018-02-26 19:38 UTC (permalink / raw)
  To: linux-efi; +Cc: Yazen Ghannam, linux-kernel, ard.biesheuvel, bp, x86

From: Yazen Ghannam <yazen.ghannam@amd.com>

Recognize the IA32/X64 Processor Error Section.

Do the section decoding in a new "cper-x86.c" file and add this to the
Makefile depending on a new "UEFI_CPER_X86" config option.

Print the Local APIC ID and CPUID info from the Processor Error Record.

The "Processor Error Info" and "Processor Context" fields will be
decoded in following patches.

Based on UEFI 2.7 Table 255. Processor Error Record.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20180223200333.6410-3-Yazen.Ghannam@amd.com

v1->v2:
* Change config option depends to "X86" instead of "X86_32 || X64_64".
* Remove extra newline in Makefile changes.
* Drop author copyright line.

 drivers/firmware/efi/Kconfig    |  5 +++++
 drivers/firmware/efi/Makefile   |  1 +
 drivers/firmware/efi/cper-x86.c | 25 +++++++++++++++++++++++++
 drivers/firmware/efi/cper.c     | 10 ++++++++++
 include/linux/cper.h            |  2 ++
 5 files changed, 43 insertions(+)
 create mode 100644 drivers/firmware/efi/cper-x86.c

diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
index 3098410abad8..781a4a337557 100644
--- a/drivers/firmware/efi/Kconfig
+++ b/drivers/firmware/efi/Kconfig
@@ -174,6 +174,11 @@ config UEFI_CPER_ARM
 	depends on UEFI_CPER && ( ARM || ARM64 )
 	default y
 
+config UEFI_CPER_X86
+	bool
+	depends on UEFI_CPER && X86
+	default y
+
 config EFI_DEV_PATH_PARSER
 	bool
 	depends on ACPI
diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile
index cb805374f4bc..5f9f5039de50 100644
--- a/drivers/firmware/efi/Makefile
+++ b/drivers/firmware/efi/Makefile
@@ -31,3 +31,4 @@ obj-$(CONFIG_ARM)			+= $(arm-obj-y)
 obj-$(CONFIG_ARM64)			+= $(arm-obj-y)
 obj-$(CONFIG_EFI_CAPSULE_LOADER)	+= capsule-loader.o
 obj-$(CONFIG_UEFI_CPER_ARM)		+= cper-arm.o
+obj-$(CONFIG_UEFI_CPER_X86)		+= cper-x86.o
diff --git a/drivers/firmware/efi/cper-x86.c b/drivers/firmware/efi/cper-x86.c
new file mode 100644
index 000000000000..9da0d981178f
--- /dev/null
+++ b/drivers/firmware/efi/cper-x86.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (C) 2018, Advanced Micro Devices, Inc.
+
+#include <linux/cper.h>
+
+/*
+ * We don't need a "CPER_IA" prefix since these are all locally defined.
+ * This will save us a lot of line space.
+ */
+#define VALID_LAPIC_ID			BIT_ULL(0)
+#define VALID_CPUID_INFO		BIT_ULL(1)
+
+void cper_print_proc_ia(const char *pfx, const struct cper_sec_proc_ia *proc)
+{
+	printk("%sValidation Bits: 0x%016llx\n", pfx, proc->validation_bits);
+
+	if (proc->validation_bits & VALID_LAPIC_ID)
+		printk("%sLocal APIC_ID: 0x%llx\n", pfx, proc->lapic_id);
+
+	if (proc->validation_bits & VALID_CPUID_INFO) {
+		printk("%sCPUID Info:\n", pfx);
+		print_hex_dump(pfx, "", DUMP_PREFIX_OFFSET, 16, 4, proc->cpuid,
+			       sizeof(proc->cpuid), 0);
+	}
+}
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index c165933ebf38..5a59b582c9aa 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -469,6 +469,16 @@ cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata
 			cper_print_proc_arm(newpfx, arm_err);
 		else
 			goto err_section_too_small;
+#endif
+#if defined(CONFIG_UEFI_CPER_X86)
+	} else if (guid_equal(sec_type, &CPER_SEC_PROC_IA)) {
+		struct cper_sec_proc_ia *ia_err = acpi_hest_get_payload(gdata);
+
+		printk("%ssection_type: IA32/X64 processor error\n", newpfx);
+		if (gdata->error_data_length >= sizeof(*ia_err))
+			cper_print_proc_ia(newpfx, ia_err);
+		else
+			goto err_section_too_small;
 #endif
 	} else {
 		const void *err = acpi_hest_get_payload(gdata);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 4b5f8459b403..9c703a0abe6e 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -551,5 +551,7 @@ const char *cper_mem_err_unpack(struct trace_seq *,
 				struct cper_mem_err_compact *);
 void cper_print_proc_arm(const char *pfx,
 			 const struct cper_sec_proc_arm *proc);
+void cper_print_proc_ia(const char *pfx,
+			const struct cper_sec_proc_ia *proc);
 
 #endif
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info Structure
  2018-02-26 19:38 [PATCH v2 0/8] Decode IA32/X64 CPER Yazen Ghannam
  2018-02-26 19:38 ` [PATCH v2 1/8] efi: Fix IA32/X64 Processor Error Record definition Yazen Ghannam
  2018-02-26 19:38 ` [PATCH v2 2/8] efi: Decode IA32/X64 Processor Error Section Yazen Ghannam
@ 2018-02-26 19:38 ` Yazen Ghannam
  2018-02-27 14:25   ` Borislav Petkov
  2018-02-26 19:39 ` [PATCH v2 4/8] efi: Decode UEFI-defined IA32/X64 Error Structure GUIDs Yazen Ghannam
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 41+ messages in thread
From: Yazen Ghannam @ 2018-02-26 19:38 UTC (permalink / raw)
  To: linux-efi; +Cc: Yazen Ghannam, linux-kernel, ard.biesheuvel, bp, x86

From: Yazen Ghannam <yazen.ghannam@amd.com>

Print the fields in the IA32/X64 Processor Error Info Structure.

Based on UEFI 2.7 Table 256. IA32/X64 Processor Error Information
Structure.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20180223200333.6410-4-Yazen.Ghannam@amd.com

v1->v2:
* Add parantheses around "bits" expression in macro.
* Fix indentation on multi-line statements.

 drivers/firmware/efi/cper-x86.c | 53 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/drivers/firmware/efi/cper-x86.c b/drivers/firmware/efi/cper-x86.c
index 9da0d981178f..417bd4e500a7 100644
--- a/drivers/firmware/efi/cper-x86.c
+++ b/drivers/firmware/efi/cper-x86.c
@@ -3,15 +3,28 @@
 
 #include <linux/cper.h>
 
+#define INDENT_SP	" "
+
 /*
  * We don't need a "CPER_IA" prefix since these are all locally defined.
  * This will save us a lot of line space.
  */
 #define VALID_LAPIC_ID			BIT_ULL(0)
 #define VALID_CPUID_INFO		BIT_ULL(1)
+#define VALID_PROC_ERR_INFO_NUM(bits)	(((bits) & GENMASK_ULL(7, 2)) >> 2)
+
+#define INFO_VALID_CHECK_INFO		BIT_ULL(0)
+#define INFO_VALID_TARGET_ID		BIT_ULL(1)
+#define INFO_VALID_REQUESTOR_ID		BIT_ULL(2)
+#define INFO_VALID_RESPONDER_ID		BIT_ULL(3)
+#define INFO_VALID_IP			BIT_ULL(4)
 
 void cper_print_proc_ia(const char *pfx, const struct cper_sec_proc_ia *proc)
 {
+	int i;
+	struct cper_ia_err_info *err_info;
+	char newpfx[64];
+
 	printk("%sValidation Bits: 0x%016llx\n", pfx, proc->validation_bits);
 
 	if (proc->validation_bits & VALID_LAPIC_ID)
@@ -22,4 +35,44 @@ void cper_print_proc_ia(const char *pfx, const struct cper_sec_proc_ia *proc)
 		print_hex_dump(pfx, "", DUMP_PREFIX_OFFSET, 16, 4, proc->cpuid,
 			       sizeof(proc->cpuid), 0);
 	}
+
+	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
+
+	err_info = (struct cper_ia_err_info *)(proc + 1);
+	for (i = 0; i < VALID_PROC_ERR_INFO_NUM(proc->validation_bits); i++) {
+		printk("%sError Information Structure %d:\n", pfx, i);
+
+		printk("%sError Structure Type: %pUl\n", newpfx,
+		       &err_info->err_type);
+
+		printk("%sValidation Bits: 0x%016llx\n",
+		       newpfx, err_info->validation_bits);
+
+		if (err_info->validation_bits & INFO_VALID_CHECK_INFO) {
+			printk("%sCheck Information: 0x%016llx\n", newpfx,
+			       err_info->check_info);
+		}
+
+		if (err_info->validation_bits & INFO_VALID_TARGET_ID) {
+			printk("%sTarget Identifier: 0x%016llx\n",
+			       newpfx, err_info->target_id);
+		}
+
+		if (err_info->validation_bits & INFO_VALID_REQUESTOR_ID) {
+			printk("%sRequestor Identifier: 0x%016llx\n",
+			       newpfx, err_info->requestor_id);
+		}
+
+		if (err_info->validation_bits & INFO_VALID_RESPONDER_ID) {
+			printk("%sResponder Identifier: 0x%016llx\n",
+			       newpfx, err_info->responder_id);
+		}
+
+		if (err_info->validation_bits & INFO_VALID_IP) {
+			printk("%sInstruction Pointer: 0x%016llx\n",
+			       newpfx, err_info->ip);
+		}
+
+		err_info++;
+	}
 }
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 4/8] efi: Decode UEFI-defined IA32/X64 Error Structure GUIDs
  2018-02-26 19:38 [PATCH v2 0/8] Decode IA32/X64 CPER Yazen Ghannam
                   ` (2 preceding siblings ...)
  2018-02-26 19:38 ` [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info Structure Yazen Ghannam
@ 2018-02-26 19:39 ` Yazen Ghannam
  2018-02-27 14:30   ` Borislav Petkov
  2018-02-26 19:39 ` [PATCH v2 5/8] efi: Decode IA32/X64 Cache, TLB, and Bus Check structures Yazen Ghannam
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 41+ messages in thread
From: Yazen Ghannam @ 2018-02-26 19:39 UTC (permalink / raw)
  To: linux-efi; +Cc: Yazen Ghannam, linux-kernel, ard.biesheuvel, bp, x86

From: Yazen Ghannam <yazen.ghannam@amd.com>

For easier handling, match the known IA32/X64 error structure GUIDs to
enums.

Also, print out the name of the matching Error Structure Type.

GUIDs taken from UEFI 2.7 section N.2.4.2.1 IA32/X64 Processor Error
Information Structure.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20180223200333.6410-5-Yazen.Ghannam@amd.com

v1->v2:
* Change use of enum type to u8.
* Fix indentation on multi-line statements.

 drivers/firmware/efi/cper-x86.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/drivers/firmware/efi/cper-x86.c b/drivers/firmware/efi/cper-x86.c
index 417bd4e500a7..037986ca82dd 100644
--- a/drivers/firmware/efi/cper-x86.c
+++ b/drivers/firmware/efi/cper-x86.c
@@ -13,17 +13,53 @@
 #define VALID_CPUID_INFO		BIT_ULL(1)
 #define VALID_PROC_ERR_INFO_NUM(bits)	(((bits) & GENMASK_ULL(7, 2)) >> 2)
 
+#define INFO_ERR_STRUCT_TYPE_CACHE					\
+	GUID_INIT(0xA55701F5, 0xE3EF, 0x43DE, 0xAC, 0x72, 0x24, 0x9B,	\
+		  0x57, 0x3F, 0xAD, 0x2C)
+#define INFO_ERR_STRUCT_TYPE_TLB					\
+	GUID_INIT(0xFC06B535, 0x5E1F, 0x4562, 0x9F, 0x25, 0x0A, 0x3B,	\
+		  0x9A, 0xDB, 0x63, 0xC3)
+#define INFO_ERR_STRUCT_TYPE_BUS					\
+	GUID_INIT(0x1CF3F8B3, 0xC5B1, 0x49a2, 0xAA, 0x59, 0x5E, 0xEF,	\
+		  0x92, 0xFF, 0xA6, 0x3C)
+#define INFO_ERR_STRUCT_TYPE_MS						\
+	GUID_INIT(0x48AB7F57, 0xDC34, 0x4f6c, 0xA7, 0xD3, 0xB0, 0xB5,	\
+		  0xB0, 0xA7, 0x43, 0x14)
+
 #define INFO_VALID_CHECK_INFO		BIT_ULL(0)
 #define INFO_VALID_TARGET_ID		BIT_ULL(1)
 #define INFO_VALID_REQUESTOR_ID		BIT_ULL(2)
 #define INFO_VALID_RESPONDER_ID		BIT_ULL(3)
 #define INFO_VALID_IP			BIT_ULL(4)
 
+enum err_types {
+	ERR_TYPE_CACHE = 0,
+	ERR_TYPE_TLB,
+	ERR_TYPE_BUS,
+	ERR_TYPE_MS,
+	N_ERR_TYPES
+};
+
+static enum err_types cper_get_err_type(const guid_t *err_type)
+{
+	if (guid_equal(err_type, &INFO_ERR_STRUCT_TYPE_CACHE))
+		return ERR_TYPE_CACHE;
+	else if (guid_equal(err_type, &INFO_ERR_STRUCT_TYPE_TLB))
+		return ERR_TYPE_TLB;
+	else if (guid_equal(err_type, &INFO_ERR_STRUCT_TYPE_BUS))
+		return ERR_TYPE_BUS;
+	else if (guid_equal(err_type, &INFO_ERR_STRUCT_TYPE_MS))
+		return ERR_TYPE_MS;
+	else
+		return N_ERR_TYPES;
+}
+
 void cper_print_proc_ia(const char *pfx, const struct cper_sec_proc_ia *proc)
 {
 	int i;
 	struct cper_ia_err_info *err_info;
 	char newpfx[64];
+	u8 err_type;
 
 	printk("%sValidation Bits: 0x%016llx\n", pfx, proc->validation_bits);
 
@@ -45,6 +81,11 @@ void cper_print_proc_ia(const char *pfx, const struct cper_sec_proc_ia *proc)
 		printk("%sError Structure Type: %pUl\n", newpfx,
 		       &err_info->err_type);
 
+		err_type = cper_get_err_type(&err_info->err_type);
+		printk("%sError Structure Type: %s\n", newpfx,
+		       err_type < ARRAY_SIZE(cper_proc_error_type_strs) ?
+		       cper_proc_error_type_strs[err_type] : "unknown");
+
 		printk("%sValidation Bits: 0x%016llx\n",
 		       newpfx, err_info->validation_bits);
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 5/8] efi: Decode IA32/X64 Cache, TLB, and Bus Check structures
  2018-02-26 19:38 [PATCH v2 0/8] Decode IA32/X64 CPER Yazen Ghannam
                   ` (3 preceding siblings ...)
  2018-02-26 19:39 ` [PATCH v2 4/8] efi: Decode UEFI-defined IA32/X64 Error Structure GUIDs Yazen Ghannam
@ 2018-02-26 19:39 ` Yazen Ghannam
  2018-02-27 15:03   ` Borislav Petkov
  2018-02-26 19:39 ` [PATCH v2 6/8] efi: Decode additional IA32/X64 Bus Check fields Yazen Ghannam
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 41+ messages in thread
From: Yazen Ghannam @ 2018-02-26 19:39 UTC (permalink / raw)
  To: linux-efi; +Cc: Yazen Ghannam, linux-kernel, ard.biesheuvel, bp, x86

From: Yazen Ghannam <yazen.ghannam@amd.com>

Print the common fields of the Cache, TLB, and Bus check structures.The
fields of these three check types are the same except for a few more
fields in the Bus check structure. The remaining Bus check structure
fields will be decoded in a following patch.

Based on UEFI 2.7,
Table 257. IA32/X64 Cache Check Structure
Table 258. IA32/X64 TLB Check Structure
Table 259. IA32/X64 Bus Check Structure

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20180223200333.6410-6-Yazen.Ghannam@amd.com

v1->v2:
* Add parantheses around "check" expression in macro.
* Change use of enum type to u8.
* Fix indentation on multi-line statements.

 drivers/firmware/efi/cper-x86.c | 101 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 100 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/cper-x86.c b/drivers/firmware/efi/cper-x86.c
index 037986ca82dd..24f62e78b6c7 100644
--- a/drivers/firmware/efi/cper-x86.c
+++ b/drivers/firmware/efi/cper-x86.c
@@ -32,6 +32,25 @@
 #define INFO_VALID_RESPONDER_ID		BIT_ULL(3)
 #define INFO_VALID_IP			BIT_ULL(4)
 
+#define CHECK_VALID_TRANS_TYPE		BIT_ULL(0)
+#define CHECK_VALID_OPERATION		BIT_ULL(1)
+#define CHECK_VALID_LEVEL		BIT_ULL(2)
+#define CHECK_VALID_PCC			BIT_ULL(3)
+#define CHECK_VALID_UNCORRECTED		BIT_ULL(4)
+#define CHECK_VALID_PRECISE_IP		BIT_ULL(5)
+#define CHECK_VALID_RESTARTABLE_IP	BIT_ULL(6)
+#define CHECK_VALID_OVERFLOW		BIT_ULL(7)
+
+#define CHECK_VALID_BITS(check)		(((check) & GENMASK_ULL(15, 0)))
+#define CHECK_TRANS_TYPE(check)		(((check) & GENMASK_ULL(17, 16)) >> 16)
+#define CHECK_OPERATION(check)		(((check) & GENMASK_ULL(21, 18)) >> 18)
+#define CHECK_LEVEL(check)		(((check) & GENMASK_ULL(24, 22)) >> 22)
+#define CHECK_PCC			BIT_ULL(25)
+#define CHECK_UNCORRECTED		BIT_ULL(26)
+#define CHECK_PRECISE_IP		BIT_ULL(27)
+#define CHECK_RESTARTABLE_IP		BIT_ULL(28)
+#define CHECK_OVERFLOW			BIT_ULL(29)
+
 enum err_types {
 	ERR_TYPE_CACHE = 0,
 	ERR_TYPE_TLB,
@@ -54,11 +73,83 @@ static enum err_types cper_get_err_type(const guid_t *err_type)
 		return N_ERR_TYPES;
 }
 
+static const char * const ia_check_trans_type_strs[] = {
+	"Instruction",
+	"Data Access",
+	"Generic",
+};
+
+static const char * const ia_check_op_strs[] = {
+	"generic error",
+	"generic read",
+	"generic write",
+	"data read",
+	"data write",
+	"instruction fetch",
+	"prefetch",
+	"eviction",
+	"snoop",
+};
+
+static inline void print_bool(char *str, const char *pfx, u64 check, u64 bit)
+{
+	printk("%s%s: %s\n", pfx, str, (check & bit) ? "true" : "false");
+}
+
+static void print_err_info(const char *pfx, u8 err_type, u64 check)
+{
+	u16 validation_bits = CHECK_VALID_BITS(check);
+
+	printk("%sValidation Bits: 0x%04x\n", pfx, validation_bits);
+
+	if (err_type == ERR_TYPE_MS)
+		return;
+
+	if (validation_bits & CHECK_VALID_TRANS_TYPE) {
+		u8 trans_type = CHECK_TRANS_TYPE(check);
+
+		printk("%sTransaction Type: %u, %s\n", pfx, trans_type,
+		       trans_type < ARRAY_SIZE(ia_check_trans_type_strs) ?
+		       ia_check_trans_type_strs[trans_type] : "unknown");
+	}
+
+	if (validation_bits & CHECK_VALID_OPERATION) {
+		u8 op = CHECK_OPERATION(check);
+
+		/*
+		 * CACHE has more operation types than TLB or BUS, though the
+		 * name and the order are the same.
+		 */
+		u8 max_ops = (err_type == ERR_TYPE_CACHE) ? 9 : 7;
+
+		printk("%sOperation: %u, %s\n", pfx, op,
+		       op < max_ops ? ia_check_op_strs[op] : "unknown");
+	}
+
+	if (validation_bits & CHECK_VALID_LEVEL)
+		printk("%sLevel: %llu\n", pfx, CHECK_LEVEL(check));
+
+	if (validation_bits & CHECK_VALID_PCC)
+		print_bool("Processor Context Corrupt", pfx, check, CHECK_PCC);
+
+	if (validation_bits & CHECK_VALID_UNCORRECTED)
+		print_bool("Uncorrected", pfx, check, CHECK_UNCORRECTED);
+
+	if (validation_bits & CHECK_VALID_PRECISE_IP)
+		print_bool("Precise IP", pfx, check, CHECK_PRECISE_IP);
+
+	if (validation_bits & CHECK_VALID_RESTARTABLE_IP)
+		print_bool("Restartable IP", pfx, check, CHECK_RESTARTABLE_IP);
+
+	if (validation_bits & CHECK_VALID_OVERFLOW)
+		print_bool("Overflow", pfx, check, CHECK_OVERFLOW);
+}
+
 void cper_print_proc_ia(const char *pfx, const struct cper_sec_proc_ia *proc)
 {
 	int i;
 	struct cper_ia_err_info *err_info;
-	char newpfx[64];
+	char newpfx[64], infopfx[64];
 	u8 err_type;
 
 	printk("%sValidation Bits: 0x%016llx\n", pfx, proc->validation_bits);
@@ -92,6 +183,14 @@ void cper_print_proc_ia(const char *pfx, const struct cper_sec_proc_ia *proc)
 		if (err_info->validation_bits & INFO_VALID_CHECK_INFO) {
 			printk("%sCheck Information: 0x%016llx\n", newpfx,
 			       err_info->check_info);
+
+			if (err_type < N_ERR_TYPES) {
+				snprintf(infopfx, sizeof(infopfx), "%s%s",
+					 newpfx, INDENT_SP);
+
+				print_err_info(infopfx, err_type,
+					       err_info->check_info);
+			}
 		}
 
 		if (err_info->validation_bits & INFO_VALID_TARGET_ID) {
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 6/8] efi: Decode additional IA32/X64 Bus Check fields
  2018-02-26 19:38 [PATCH v2 0/8] Decode IA32/X64 CPER Yazen Ghannam
                   ` (4 preceding siblings ...)
  2018-02-26 19:39 ` [PATCH v2 5/8] efi: Decode IA32/X64 Cache, TLB, and Bus Check structures Yazen Ghannam
@ 2018-02-26 19:39 ` Yazen Ghannam
  2018-02-26 19:39 ` [PATCH v2 7/8] efi: Decode IA32/X64 MS Check structure Yazen Ghannam
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 41+ messages in thread
From: Yazen Ghannam @ 2018-02-26 19:39 UTC (permalink / raw)
  To: linux-efi; +Cc: Yazen Ghannam, linux-kernel, ard.biesheuvel, bp, x86

From: Yazen Ghannam <yazen.ghannam@amd.com>

The "Participation Type", "Time Out", and "Address Space" fields are
unique to the IA32/X64 Bus Check structure. Print these fields.

Based on UEFI 2.7 Table 259. IA32/X64 Bus Check Structure

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20180223200333.6410-7-Yazen.Ghannam@amd.com

v1->v2:
* Add parantheses around "check" expression in macro.
* Fix indentation on multi-line statements.

 drivers/firmware/efi/cper-x86.c | 44 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/drivers/firmware/efi/cper-x86.c b/drivers/firmware/efi/cper-x86.c
index 24f62e78b6c7..c5209af3514b 100644
--- a/drivers/firmware/efi/cper-x86.c
+++ b/drivers/firmware/efi/cper-x86.c
@@ -41,6 +41,10 @@
 #define CHECK_VALID_RESTARTABLE_IP	BIT_ULL(6)
 #define CHECK_VALID_OVERFLOW		BIT_ULL(7)
 
+#define CHECK_VALID_BUS_PART_TYPE	BIT_ULL(8)
+#define CHECK_VALID_BUS_TIME_OUT	BIT_ULL(9)
+#define CHECK_VALID_BUS_ADDR_SPACE	BIT_ULL(10)
+
 #define CHECK_VALID_BITS(check)		(((check) & GENMASK_ULL(15, 0)))
 #define CHECK_TRANS_TYPE(check)		(((check) & GENMASK_ULL(17, 16)) >> 16)
 #define CHECK_OPERATION(check)		(((check) & GENMASK_ULL(21, 18)) >> 18)
@@ -51,6 +55,10 @@
 #define CHECK_RESTARTABLE_IP		BIT_ULL(28)
 #define CHECK_OVERFLOW			BIT_ULL(29)
 
+#define CHECK_BUS_PART_TYPE(check)	(((check) & GENMASK_ULL(31, 30)) >> 30)
+#define CHECK_BUS_TIME_OUT		BIT_ULL(32)
+#define CHECK_BUS_ADDR_SPACE(check)	(((check) & GENMASK_ULL(34, 33)) >> 33)
+
 enum err_types {
 	ERR_TYPE_CACHE = 0,
 	ERR_TYPE_TLB,
@@ -91,6 +99,20 @@ static const char * const ia_check_op_strs[] = {
 	"snoop",
 };
 
+static const char * const ia_check_bus_part_type_strs[] = {
+	"Local Processor originated request",
+	"Local Processor responded to request",
+	"Local Processor observed",
+	"Generic",
+};
+
+static const char * const ia_check_bus_addr_space_strs[] = {
+	"Memory Access",
+	"Reserved",
+	"I/O",
+	"Other Transaction",
+};
+
 static inline void print_bool(char *str, const char *pfx, u64 check, u64 bit)
 {
 	printk("%s%s: %s\n", pfx, str, (check & bit) ? "true" : "false");
@@ -143,6 +165,28 @@ static void print_err_info(const char *pfx, u8 err_type, u64 check)
 
 	if (validation_bits & CHECK_VALID_OVERFLOW)
 		print_bool("Overflow", pfx, check, CHECK_OVERFLOW);
+
+	if (err_type != ERR_TYPE_BUS)
+		return;
+
+	if (validation_bits & CHECK_VALID_BUS_PART_TYPE) {
+		u8 part_type = CHECK_BUS_PART_TYPE(check);
+
+		printk("%sParticipation Type: %u, %s\n", pfx, part_type,
+		       part_type < ARRAY_SIZE(ia_check_bus_part_type_strs) ?
+		       ia_check_bus_part_type_strs[part_type] : "unknown");
+	}
+
+	if (validation_bits & CHECK_VALID_BUS_TIME_OUT)
+		print_bool("Time Out", pfx, check, CHECK_BUS_TIME_OUT);
+
+	if (validation_bits & CHECK_VALID_BUS_ADDR_SPACE) {
+		u8 addr_space = CHECK_BUS_ADDR_SPACE(check);
+
+		printk("%sAddress Space: %u, %s\n", pfx, addr_space,
+		       addr_space < ARRAY_SIZE(ia_check_bus_addr_space_strs) ?
+		       ia_check_bus_addr_space_strs[addr_space] : "unknown");
+	}
 }
 
 void cper_print_proc_ia(const char *pfx, const struct cper_sec_proc_ia *proc)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 7/8] efi: Decode IA32/X64 MS Check structure
  2018-02-26 19:38 [PATCH v2 0/8] Decode IA32/X64 CPER Yazen Ghannam
                   ` (5 preceding siblings ...)
  2018-02-26 19:39 ` [PATCH v2 6/8] efi: Decode additional IA32/X64 Bus Check fields Yazen Ghannam
@ 2018-02-26 19:39 ` Yazen Ghannam
  2018-02-26 19:39 ` [PATCH v2 8/8] efi: Decode IA32/X64 Context Info structure Yazen Ghannam
  2018-02-28  8:43 ` [PATCH v2 0/8] Decode IA32/X64 CPER Borislav Petkov
  8 siblings, 0 replies; 41+ messages in thread
From: Yazen Ghannam @ 2018-02-26 19:39 UTC (permalink / raw)
  To: linux-efi; +Cc: Yazen Ghannam, linux-kernel, ard.biesheuvel, bp, x86

From: Yazen Ghannam <yazen.ghannam@amd.com>

The IA32/X64 MS Check structure varies from the other Check structures
in the the bit positions of its fields, and it includes an additional
"Error Type" field.

Decode the MS Check structure in a separate function.

Based on UEFI 2.7 Table 260. IA32/X64 MS Check Field Description.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20180223200333.6410-8-Yazen.Ghannam@amd.com

v1->v2:
* Add parantheses around "check" expression in macro.
* Fix indentation on multi-line statements.

 drivers/firmware/efi/cper-x86.c | 55 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/cper-x86.c b/drivers/firmware/efi/cper-x86.c
index c5209af3514b..4c79401aace0 100644
--- a/drivers/firmware/efi/cper-x86.c
+++ b/drivers/firmware/efi/cper-x86.c
@@ -59,6 +59,20 @@
 #define CHECK_BUS_TIME_OUT		BIT_ULL(32)
 #define CHECK_BUS_ADDR_SPACE(check)	(((check) & GENMASK_ULL(34, 33)) >> 33)
 
+#define CHECK_VALID_MS_ERR_TYPE		BIT_ULL(0)
+#define CHECK_VALID_MS_PCC		BIT_ULL(1)
+#define CHECK_VALID_MS_UNCORRECTED	BIT_ULL(2)
+#define CHECK_VALID_MS_PRECISE_IP	BIT_ULL(3)
+#define CHECK_VALID_MS_RESTARTABLE_IP	BIT_ULL(4)
+#define CHECK_VALID_MS_OVERFLOW		BIT_ULL(5)
+
+#define CHECK_MS_ERR_TYPE(check)	(((check) & GENMASK_ULL(18, 16)) >> 16)
+#define CHECK_MS_PCC			BIT_ULL(19)
+#define CHECK_MS_UNCORRECTED		BIT_ULL(20)
+#define CHECK_MS_PRECISE_IP		BIT_ULL(21)
+#define CHECK_MS_RESTARTABLE_IP		BIT_ULL(22)
+#define CHECK_MS_OVERFLOW		BIT_ULL(23)
+
 enum err_types {
 	ERR_TYPE_CACHE = 0,
 	ERR_TYPE_TLB,
@@ -113,19 +127,58 @@ static const char * const ia_check_bus_addr_space_strs[] = {
 	"Other Transaction",
 };
 
+static const char * const ia_check_ms_error_type_strs[] = {
+	"No Error",
+	"Unclassified",
+	"Microcode ROM Parity Error",
+	"External Error",
+	"FRC Error",
+	"Internal Unclassified",
+};
+
 static inline void print_bool(char *str, const char *pfx, u64 check, u64 bit)
 {
 	printk("%s%s: %s\n", pfx, str, (check & bit) ? "true" : "false");
 }
 
+static void print_err_info_ms(const char *pfx, u16 validation_bits, u64 check)
+{
+	if (validation_bits & CHECK_VALID_MS_ERR_TYPE) {
+		u8 err_type = CHECK_MS_ERR_TYPE(check);
+
+		printk("%sError Type: %u, %s\n", pfx, err_type,
+		       err_type < ARRAY_SIZE(ia_check_ms_error_type_strs) ?
+		       ia_check_ms_error_type_strs[err_type] : "unknown");
+	}
+
+	if (validation_bits & CHECK_VALID_MS_PCC)
+		print_bool("Processor Context Corrupt", pfx, check, CHECK_MS_PCC);
+
+	if (validation_bits & CHECK_VALID_MS_UNCORRECTED)
+		print_bool("Uncorrected", pfx, check, CHECK_MS_UNCORRECTED);
+
+	if (validation_bits & CHECK_VALID_MS_PRECISE_IP)
+		print_bool("Precise IP", pfx, check, CHECK_MS_PRECISE_IP);
+
+	if (validation_bits & CHECK_VALID_MS_RESTARTABLE_IP)
+		print_bool("Restartable IP", pfx, check, CHECK_MS_RESTARTABLE_IP);
+
+	if (validation_bits & CHECK_VALID_MS_OVERFLOW)
+		print_bool("Overflow", pfx, check, CHECK_MS_OVERFLOW);
+}
+
 static void print_err_info(const char *pfx, u8 err_type, u64 check)
 {
 	u16 validation_bits = CHECK_VALID_BITS(check);
 
 	printk("%sValidation Bits: 0x%04x\n", pfx, validation_bits);
 
+	/*
+	 * The MS Check structure varies a lot from the others, so use a
+	 * separate function for decoding.
+	 */
 	if (err_type == ERR_TYPE_MS)
-		return;
+		return print_err_info_ms(pfx, validation_bits, check);
 
 	if (validation_bits & CHECK_VALID_TRANS_TYPE) {
 		u8 trans_type = CHECK_TRANS_TYPE(check);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 8/8] efi: Decode IA32/X64 Context Info structure
  2018-02-26 19:38 [PATCH v2 0/8] Decode IA32/X64 CPER Yazen Ghannam
                   ` (6 preceding siblings ...)
  2018-02-26 19:39 ` [PATCH v2 7/8] efi: Decode IA32/X64 MS Check structure Yazen Ghannam
@ 2018-02-26 19:39 ` Yazen Ghannam
  2018-02-28  8:43 ` [PATCH v2 0/8] Decode IA32/X64 CPER Borislav Petkov
  8 siblings, 0 replies; 41+ messages in thread
From: Yazen Ghannam @ 2018-02-26 19:39 UTC (permalink / raw)
  To: linux-efi; +Cc: Yazen Ghannam, linux-kernel, ard.biesheuvel, bp, x86

From: Yazen Ghannam <yazen.ghannam@amd.com>

Print the fields of the IA32/X64 Context Information structure.

Print the "Register Array" as raw values. Some context types are defined
in the UEFI spec, so more detailed decoded may be added in the future.

Based on UEFI 2.7 section N.2.4.2.2 IA32/X64 Processor Context
Information Structure.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20180223200333.6410-9-Yazen.Ghannam@amd.com

v1->v2:
* Add parantheses around "bits" expression in macro.
* Change VALID_PROC_CNTXT_INFO_NUM to VALID_PROC_CTX_INFO_NUM.
* Fix indentation on multi-line statements.
* Remove conditional to skip unknown context types. The context info
  should be printed even if the type is unkown. This is just like what
  we do for the error information. 

 drivers/firmware/efi/cper-x86.c | 48 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/drivers/firmware/efi/cper-x86.c b/drivers/firmware/efi/cper-x86.c
index 4c79401aace0..0407671ad7e6 100644
--- a/drivers/firmware/efi/cper-x86.c
+++ b/drivers/firmware/efi/cper-x86.c
@@ -12,6 +12,7 @@
 #define VALID_LAPIC_ID			BIT_ULL(0)
 #define VALID_CPUID_INFO		BIT_ULL(1)
 #define VALID_PROC_ERR_INFO_NUM(bits)	(((bits) & GENMASK_ULL(7, 2)) >> 2)
+#define VALID_PROC_CXT_INFO_NUM(bits)	(((bits) & GENMASK_ULL(13, 8)) >> 8)
 
 #define INFO_ERR_STRUCT_TYPE_CACHE					\
 	GUID_INIT(0xA55701F5, 0xE3EF, 0x43DE, 0xAC, 0x72, 0x24, 0x9B,	\
@@ -73,6 +74,9 @@
 #define CHECK_MS_RESTARTABLE_IP		BIT_ULL(22)
 #define CHECK_MS_OVERFLOW		BIT_ULL(23)
 
+#define CTX_TYPE_MSR			1
+#define CTX_TYPE_MMREG			7
+
 enum err_types {
 	ERR_TYPE_CACHE = 0,
 	ERR_TYPE_TLB,
@@ -136,6 +140,17 @@ static const char * const ia_check_ms_error_type_strs[] = {
 	"Internal Unclassified",
 };
 
+static const char * const ia_reg_ctx_strs[] = {
+	"Unclassified Data",
+	"MSR Registers (Machine Check and other MSRs)",
+	"32-bit Mode Execution Context",
+	"64-bit Mode Execution Context",
+	"FXSAVE Context",
+	"32-bit Mode Debug Registers (DR0-DR7)",
+	"64-bit Mode Debug Registers (DR0-DR7)",
+	"Memory Mapped Registers",
+};
+
 static inline void print_bool(char *str, const char *pfx, u64 check, u64 bit)
 {
 	printk("%s%s: %s\n", pfx, str, (check & bit) ? "true" : "false");
@@ -246,6 +261,7 @@ void cper_print_proc_ia(const char *pfx, const struct cper_sec_proc_ia *proc)
 {
 	int i;
 	struct cper_ia_err_info *err_info;
+	struct cper_ia_proc_ctx *ctx_info;
 	char newpfx[64], infopfx[64];
 	u8 err_type;
 
@@ -312,4 +328,36 @@ void cper_print_proc_ia(const char *pfx, const struct cper_sec_proc_ia *proc)
 
 		err_info++;
 	}
+
+	ctx_info = (struct cper_ia_proc_ctx *)err_info;
+	for (i = 0; i < VALID_PROC_CXT_INFO_NUM(proc->validation_bits); i++) {
+		int size = sizeof(*ctx_info) + ctx_info->reg_arr_size;
+		int groupsize = 4;
+
+		printk("%sContext Information Structure %d:\n", pfx, i);
+
+		printk("%sRegister Context Type: %s\n", newpfx,
+		       ctx_info->reg_ctx_type < ARRAY_SIZE(ia_reg_ctx_strs) ?
+		       ia_reg_ctx_strs[ctx_info->reg_ctx_type] : "unknown");
+
+		printk("%sRegister Array Size: 0x%04x\n", newpfx,
+		       ctx_info->reg_arr_size);
+
+		if (ctx_info->reg_ctx_type == CTX_TYPE_MSR) {
+			groupsize = 8; /* MSRs are 8 bytes wide. */
+			printk("%sMSR Address: 0x%08x\n", newpfx,
+			       ctx_info->msr_addr);
+		}
+
+		if (ctx_info->reg_ctx_type == CTX_TYPE_MMREG) {
+			printk("%sMM Register Address: 0x%016llx\n", newpfx,
+			       ctx_info->mm_reg_addr);
+		}
+
+		printk("%sRegister Array:\n", newpfx);
+		print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, groupsize,
+			       (ctx_info + 1), ctx_info->reg_arr_size, 0);
+
+		ctx_info = (struct cper_ia_proc_ctx *)((long)ctx_info + size);
+	}
 }
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/8] efi: Fix IA32/X64 Processor Error Record definition
  2018-02-26 19:38 ` [PATCH v2 1/8] efi: Fix IA32/X64 Processor Error Record definition Yazen Ghannam
@ 2018-02-27 10:47   ` Borislav Petkov
  2018-02-27 15:05     ` Ghannam, Yazen
  0 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2018-02-27 10:47 UTC (permalink / raw)
  To: Yazen Ghannam; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

On Mon, Feb 26, 2018 at 01:38:57PM -0600, Yazen Ghannam wrote:
> From: Yazen Ghannam <yazen.ghannam@amd.com>
> 
> Based on UEFI 2.7 Table 255. Processor Error Record, the "Local APIC_ID"

My pdf says this is table 252.

> field is 8 bytes but Linux defines this field as 1 byte.
> 
> Fix this in the struct cper_sec_proc_ia definition.
> 
> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> ---
> Link:
> https://lkml.kernel.org/r/20180223200333.6410-2-Yazen.Ghannam@amd.com
> 
> v1->v2:
> * No changes.
> 
>  include/linux/cper.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/cper.h b/include/linux/cper.h
> index d14ef4e77c8a..4b5f8459b403 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h
> @@ -381,7 +381,7 @@ struct cper_sec_proc_generic {
>  /* IA32/X64 Processor Error Section */
>  struct cper_sec_proc_ia {
>  	__u64	validation_bits;
> -	__u8	lapic_id;
> +	__u64	lapic_id;
>  	__u8	cpuid[48];

Ok, that processor error record has a variable length structure at byte
offset 64 and we don't have it in this struct.

I guess I'll see it in the following patches but right now it looks
like that "Processor Error Info" thing is simply situated after that
Processor Error Record so we are supposed to simply find the info at
offset 64...

/me continues reading...

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/8] efi: Decode IA32/X64 Processor Error Section
  2018-02-26 19:38 ` [PATCH v2 2/8] efi: Decode IA32/X64 Processor Error Section Yazen Ghannam
@ 2018-02-27 11:22   ` Borislav Petkov
  2018-02-27 15:13     ` Ghannam, Yazen
  0 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2018-02-27 11:22 UTC (permalink / raw)
  To: Yazen Ghannam; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

On Mon, Feb 26, 2018 at 01:38:58PM -0600, Yazen Ghannam wrote:
> + * We don't need a "CPER_IA" prefix since these are all locally defined.
> + * This will save us a lot of line space.
> + */
> +#define VALID_LAPIC_ID			BIT_ULL(0)
> +#define VALID_CPUID_INFO		BIT_ULL(1)
> +
> +void cper_print_proc_ia(const char *pfx, const struct cper_sec_proc_ia *proc)
> +{
> +	printk("%sValidation Bits: 0x%016llx\n", pfx, proc->validation_bits);

Ok, so this...

> +
> +	if (proc->validation_bits & VALID_LAPIC_ID)
> +		printk("%sLocal APIC_ID: 0x%llx\n", pfx, proc->lapic_id);
> +
> +	if (proc->validation_bits & VALID_CPUID_INFO) {
> +		printk("%sCPUID Info:\n", pfx);
> +		print_hex_dump(pfx, "", DUMP_PREFIX_OFFSET, 16, 4, proc->cpuid,
> +			       sizeof(proc->cpuid), 0);

... and this are semi-decoded information bits which I'd have to go open
the spec and continue decoding.

Can we please change the whole approach of not simply dumping such
fields but decode them fully. We want that error information to be
helpful to the user and she should be able to immediately understand
what type of error it is.

Validation Bits and a CPUID hexdump simply makes you go look at the spec
again and such dumps are only useful as a debugging aid but nothing
more.

The APIC ID dump is how this should be done - properly and fully decoded
error info which can be used immediately for diagnosing the error.

Thx.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info Structure
  2018-02-26 19:38 ` [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info Structure Yazen Ghannam
@ 2018-02-27 14:25   ` Borislav Petkov
  2018-02-27 15:25     ` Ghannam, Yazen
  0 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2018-02-27 14:25 UTC (permalink / raw)
  To: Yazen Ghannam; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

On Mon, Feb 26, 2018 at 01:38:59PM -0600, Yazen Ghannam wrote:
> From: Yazen Ghannam <yazen.ghannam@amd.com>
> 
> Print the fields in the IA32/X64 Processor Error Info Structure.
> 
> Based on UEFI 2.7 Table 256. IA32/X64 Processor Error Information
> Structure.
> 
> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> ---
> Link:
> https://lkml.kernel.org/r/20180223200333.6410-4-Yazen.Ghannam@amd.com
> 
> v1->v2:
> * Add parantheses around "bits" expression in macro.
> * Fix indentation on multi-line statements.
> 
>  drivers/firmware/efi/cper-x86.c | 53 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 53 insertions(+)
> 
> diff --git a/drivers/firmware/efi/cper-x86.c b/drivers/firmware/efi/cper-x86.c
> index 9da0d981178f..417bd4e500a7 100644
> --- a/drivers/firmware/efi/cper-x86.c
> +++ b/drivers/firmware/efi/cper-x86.c
> @@ -3,15 +3,28 @@
>  
>  #include <linux/cper.h>
>  
> +#define INDENT_SP	" "

That's just silly.

> +
>  /*
>   * We don't need a "CPER_IA" prefix since these are all locally defined.
>   * This will save us a lot of line space.
>   */
>  #define VALID_LAPIC_ID			BIT_ULL(0)
>  #define VALID_CPUID_INFO		BIT_ULL(1)
> +#define VALID_PROC_ERR_INFO_NUM(bits)	(((bits) & GENMASK_ULL(7, 2)) >> 2)
> +
> +#define INFO_VALID_CHECK_INFO		BIT_ULL(0)
> +#define INFO_VALID_TARGET_ID		BIT_ULL(1)
> +#define INFO_VALID_REQUESTOR_ID		BIT_ULL(2)
> +#define INFO_VALID_RESPONDER_ID		BIT_ULL(3)
> +#define INFO_VALID_IP			BIT_ULL(4)
>  
>  void cper_print_proc_ia(const char *pfx, const struct cper_sec_proc_ia *proc)
>  {
> +	int i;
> +	struct cper_ia_err_info *err_info;
> +	char newpfx[64];
> +
>  	printk("%sValidation Bits: 0x%016llx\n", pfx, proc->validation_bits);
>  
>  	if (proc->validation_bits & VALID_LAPIC_ID)
> @@ -22,4 +35,44 @@ void cper_print_proc_ia(const char *pfx, const struct cper_sec_proc_ia *proc)
>  		print_hex_dump(pfx, "", DUMP_PREFIX_OFFSET, 16, 4, proc->cpuid,
>  			       sizeof(proc->cpuid), 0);
>  	}
> +
> +	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
> +
> +	err_info = (struct cper_ia_err_info *)(proc + 1);
> +	for (i = 0; i < VALID_PROC_ERR_INFO_NUM(proc->validation_bits); i++) {
> +		printk("%sError Information Structure %d:\n", pfx, i);
> +
> +		printk("%sError Structure Type: %pUl\n", newpfx,
> +		       &err_info->err_type);

That dumps a GUID - how do I find out what type that GUID refers to?

> +
> +		printk("%sValidation Bits: 0x%016llx\n",
> +		       newpfx, err_info->validation_bits);

As before, no need for those.

> +
> +		if (err_info->validation_bits & INFO_VALID_CHECK_INFO) {
> +			printk("%sCheck Information: 0x%016llx\n", newpfx,
> +			       err_info->check_info);
> +		}
> +
> +		if (err_info->validation_bits & INFO_VALID_TARGET_ID) {
> +			printk("%sTarget Identifier: 0x%016llx\n",
> +			       newpfx, err_info->target_id);
> +		}
> +
> +		if (err_info->validation_bits & INFO_VALID_REQUESTOR_ID) {
> +			printk("%sRequestor Identifier: 0x%016llx\n",
> +			       newpfx, err_info->requestor_id);
> +		}
> +
> +		if (err_info->validation_bits & INFO_VALID_RESPONDER_ID) {
> +			printk("%sResponder Identifier: 0x%016llx\n",
> +			       newpfx, err_info->responder_id);
> +		}

Those look like would need an additional decoding. Can we do that here
too?

> +
> +		if (err_info->validation_bits & INFO_VALID_IP) {
> +			printk("%sInstruction Pointer: 0x%016llx\n",
> +			       newpfx, err_info->ip);

The only thing that makes sense so far.

> +		}
> +
> +		err_info++;
> +	}

Also, it is worth checking how much duplication there is with
drivers/firmware/efi/cper-arm.c and unifying the common pieces into
functions in cper-common.c or so.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/8] efi: Decode UEFI-defined IA32/X64 Error Structure GUIDs
  2018-02-26 19:39 ` [PATCH v2 4/8] efi: Decode UEFI-defined IA32/X64 Error Structure GUIDs Yazen Ghannam
@ 2018-02-27 14:30   ` Borislav Petkov
  2018-02-27 15:28     ` Ghannam, Yazen
  0 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2018-02-27 14:30 UTC (permalink / raw)
  To: Yazen Ghannam; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

On Mon, Feb 26, 2018 at 01:39:00PM -0600, Yazen Ghannam wrote:
> @@ -45,6 +81,11 @@ void cper_print_proc_ia(const char *pfx, const struct cper_sec_proc_ia *proc)
>  		printk("%sError Structure Type: %pUl\n", newpfx,
>  		       &err_info->err_type);
>  
> +		err_type = cper_get_err_type(&err_info->err_type);
> +		printk("%sError Structure Type: %s\n", newpfx,
> +		       err_type < ARRAY_SIZE(cper_proc_error_type_strs) ?
> +		       cper_proc_error_type_strs[err_type] : "unknown");

Ah, there it is, much better. Now this tells us what component it is.

So cper-arm.c does the same thing and there's a similar thing in cper.c -
cper_print_proc_generic().

Let's not tri-plicate that code pls and have a generic function instead.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 5/8] efi: Decode IA32/X64 Cache, TLB, and Bus Check structures
  2018-02-26 19:39 ` [PATCH v2 5/8] efi: Decode IA32/X64 Cache, TLB, and Bus Check structures Yazen Ghannam
@ 2018-02-27 15:03   ` Borislav Petkov
  2018-02-27 15:33     ` Ghannam, Yazen
  0 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2018-02-27 15:03 UTC (permalink / raw)
  To: Yazen Ghannam; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

On Mon, Feb 26, 2018 at 01:39:01PM -0600, Yazen Ghannam wrote:
> +static void print_err_info(const char *pfx, u8 err_type, u64 check)
> +{
> +	u16 validation_bits = CHECK_VALID_BITS(check);
> +
> +	printk("%sValidation Bits: 0x%04x\n", pfx, validation_bits);
> +
> +	if (err_type == ERR_TYPE_MS)
> +		return;
> +
> +	if (validation_bits & CHECK_VALID_TRANS_TYPE) {
> +		u8 trans_type = CHECK_TRANS_TYPE(check);
> +
> +		printk("%sTransaction Type: %u, %s\n", pfx, trans_type,
> +		       trans_type < ARRAY_SIZE(ia_check_trans_type_strs) ?
> +		       ia_check_trans_type_strs[trans_type] : "unknown");
> +	}
> +
> +	if (validation_bits & CHECK_VALID_OPERATION) {
> +		u8 op = CHECK_OPERATION(check);
> +
> +		/*
> +		 * CACHE has more operation types than TLB or BUS, though the
> +		 * name and the order are the same.
> +		 */
> +		u8 max_ops = (err_type == ERR_TYPE_CACHE) ? 9 : 7;
> +
> +		printk("%sOperation: %u, %s\n", pfx, op,
> +		       op < max_ops ? ia_check_op_strs[op] : "unknown");
> +	}
> +
> +	if (validation_bits & CHECK_VALID_LEVEL)
> +		printk("%sLevel: %llu\n", pfx, CHECK_LEVEL(check));
> +
> +	if (validation_bits & CHECK_VALID_PCC)
> +		print_bool("Processor Context Corrupt", pfx, check, CHECK_PCC);

I think we want to print PCC here unconditionally and say:

	PCC: (yes|no|invalid)

I don't think the absence of PCC in the error record is a good enough
hint that the PCC field is invalid.

Ditto for the rest and transaction type above too. I think it would be
much easier if we have fixed fields error record.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 1/8] efi: Fix IA32/X64 Processor Error Record definition
  2018-02-27 10:47   ` Borislav Petkov
@ 2018-02-27 15:05     ` Ghannam, Yazen
  0 siblings, 0 replies; 41+ messages in thread
From: Ghannam, Yazen @ 2018-02-27 15:05 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

> -----Original Message-----
> From: Borislav Petkov [mailto:bp@suse.de]
> Sent: Tuesday, February 27, 2018 5:47 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-efi@vger.kernel.org; linux-kernel@vger.kernel.org;
> ard.biesheuvel@linaro.org; x86@kernel.org
> Subject: Re: [PATCH v2 1/8] efi: Fix IA32/X64 Processor Error Record
> definition
> 
> On Mon, Feb 26, 2018 at 01:38:57PM -0600, Yazen Ghannam wrote:
> > From: Yazen Ghannam <yazen.ghannam@amd.com>
> >
> > Based on UEFI 2.7 Table 255. Processor Error Record, the "Local APIC_ID"
> 
> My pdf says this is table 252.
> 

Right. I'm using the latest which is UEFI 2.7 Errata A. It turns out that the table
numbering is different between the base and the errata specs. I can change this.

> > field is 8 bytes but Linux defines this field as 1 byte.
> >
> > Fix this in the struct cper_sec_proc_ia definition.
> >
> > Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> > ---
> > Link:
> > https://lkml.kernel.org/r/20180223200333.6410-2-
> Yazen.Ghannam@amd.com
> >
> > v1->v2:
> > * No changes.
> >
> >  include/linux/cper.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/include/linux/cper.h b/include/linux/cper.h
> > index d14ef4e77c8a..4b5f8459b403 100644
> > --- a/include/linux/cper.h
> > +++ b/include/linux/cper.h
> > @@ -381,7 +381,7 @@ struct cper_sec_proc_generic {
> >  /* IA32/X64 Processor Error Section */
> >  struct cper_sec_proc_ia {
> >  	__u64	validation_bits;
> > -	__u8	lapic_id;
> > +	__u64	lapic_id;
> >  	__u8	cpuid[48];
> 
> Ok, that processor error record has a variable length structure at byte
> offset 64 and we don't have it in this struct.
> 
> I guess I'll see it in the following patches but right now it looks
> like that "Processor Error Info" thing is simply situated after that
> Processor Error Record so we are supposed to simply find the info at
> offset 64...
> 
> /me continues reading...

The error and context info tables are defined in other structs and we
access them by offsetting from the previous table.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 2/8] efi: Decode IA32/X64 Processor Error Section
  2018-02-27 11:22   ` Borislav Petkov
@ 2018-02-27 15:13     ` Ghannam, Yazen
  2018-02-27 17:00       ` Borislav Petkov
  0 siblings, 1 reply; 41+ messages in thread
From: Ghannam, Yazen @ 2018-02-27 15:13 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

> -----Original Message-----
> From: Borislav Petkov [mailto:bp@suse.de]
> Sent: Tuesday, February 27, 2018 6:23 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-efi@vger.kernel.org; linux-kernel@vger.kernel.org;
> ard.biesheuvel@linaro.org; x86@kernel.org
> Subject: Re: [PATCH v2 2/8] efi: Decode IA32/X64 Processor Error Section
> 
> On Mon, Feb 26, 2018 at 01:38:58PM -0600, Yazen Ghannam wrote:
> > + * We don't need a "CPER_IA" prefix since these are all locally defined.
> > + * This will save us a lot of line space.
> > + */
> > +#define VALID_LAPIC_ID			BIT_ULL(0)
> > +#define VALID_CPUID_INFO		BIT_ULL(1)
> > +
> > +void cper_print_proc_ia(const char *pfx, const struct cper_sec_proc_ia
> *proc)
> > +{
> > +	printk("%sValidation Bits: 0x%016llx\n", pfx, proc->validation_bits);
> 
> Ok, so this...
> 
> > +
> > +	if (proc->validation_bits & VALID_LAPIC_ID)
> > +		printk("%sLocal APIC_ID: 0x%llx\n", pfx, proc->lapic_id);
> > +
> > +	if (proc->validation_bits & VALID_CPUID_INFO) {
> > +		printk("%sCPUID Info:\n", pfx);
> > +		print_hex_dump(pfx, "", DUMP_PREFIX_OFFSET, 16, 4, proc-
> >cpuid,
> > +			       sizeof(proc->cpuid), 0);
> 
> ... and this are semi-decoded information bits which I'd have to go open
> the spec and continue decoding.
> 
> Can we please change the whole approach of not simply dumping such
> fields but decode them fully. We want that error information to be
> helpful to the user and she should be able to immediately understand
> what type of error it is.
> 
> Validation Bits and a CPUID hexdump simply makes you go look at the spec
> again and such dumps are only useful as a debugging aid but nothing
> more.
> 
> The APIC ID dump is how this should be done - properly and fully decoded
> error info which can be used immediately for diagnosing the error.
> 

I decided to print the Validation Bits as a sanity check for whomever is looking
at this. Since we only print fields with a valid bit, it may be confusing for users
who don't know why fields are missing. They can check the Validation Bits to see
that the fields printed have a valid bit set and those not printed don't have valid
bits set.

Also, I don't think we should be interpreting the spec for the user. We should
print the fields as they and users can refer back to the spec for more information.

The more detailed info about the error is printed later. Even just printing the fields
at that point should be enough for most users.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info Structure
  2018-02-27 14:25   ` Borislav Petkov
@ 2018-02-27 15:25     ` Ghannam, Yazen
  2018-02-27 17:04       ` Borislav Petkov
  0 siblings, 1 reply; 41+ messages in thread
From: Ghannam, Yazen @ 2018-02-27 15:25 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

> -----Original Message-----
> From: Borislav Petkov [mailto:bp@suse.de]
> Sent: Tuesday, February 27, 2018 9:26 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-efi@vger.kernel.org; linux-kernel@vger.kernel.org;
> ard.biesheuvel@linaro.org; x86@kernel.org
> Subject: Re: [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info
> Structure
> 
> On Mon, Feb 26, 2018 at 01:38:59PM -0600, Yazen Ghannam wrote:
> > From: Yazen Ghannam <yazen.ghannam@amd.com>
> >
> > Print the fields in the IA32/X64 Processor Error Info Structure.
> >
> > Based on UEFI 2.7 Table 256. IA32/X64 Processor Error Information
> > Structure.
> >
> > Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> > ---
> > Link:
> > https://lkml.kernel.org/r/20180223200333.6410-4-
> Yazen.Ghannam@amd.com
> >
> > v1->v2:
> > * Add parantheses around "bits" expression in macro.
> > * Fix indentation on multi-line statements.
> >
> >  drivers/firmware/efi/cper-x86.c | 53
> +++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 53 insertions(+)
> >
> > diff --git a/drivers/firmware/efi/cper-x86.c b/drivers/firmware/efi/cper-
> x86.c
> > index 9da0d981178f..417bd4e500a7 100644
> > --- a/drivers/firmware/efi/cper-x86.c
> > +++ b/drivers/firmware/efi/cper-x86.c
> > @@ -3,15 +3,28 @@
> >
> >  #include <linux/cper.h>
> >
> > +#define INDENT_SP	" "
> 
> That's just silly.
> 

This is the same as the other CPER code.

> > +
> >  /*
> >   * We don't need a "CPER_IA" prefix since these are all locally defined.
> >   * This will save us a lot of line space.
> >   */
> >  #define VALID_LAPIC_ID			BIT_ULL(0)
> >  #define VALID_CPUID_INFO		BIT_ULL(1)
> > +#define VALID_PROC_ERR_INFO_NUM(bits)	(((bits) & GENMASK_ULL(7,
> 2)) >> 2)
> > +
> > +#define INFO_VALID_CHECK_INFO		BIT_ULL(0)
> > +#define INFO_VALID_TARGET_ID		BIT_ULL(1)
> > +#define INFO_VALID_REQUESTOR_ID		BIT_ULL(2)
> > +#define INFO_VALID_RESPONDER_ID		BIT_ULL(3)
> > +#define INFO_VALID_IP			BIT_ULL(4)
> >
> >  void cper_print_proc_ia(const char *pfx, const struct cper_sec_proc_ia
> *proc)
> >  {
> > +	int i;
> > +	struct cper_ia_err_info *err_info;
> > +	char newpfx[64];
> > +
> >  	printk("%sValidation Bits: 0x%016llx\n", pfx, proc->validation_bits);
> >
> >  	if (proc->validation_bits & VALID_LAPIC_ID)
> > @@ -22,4 +35,44 @@ void cper_print_proc_ia(const char *pfx, const struct
> cper_sec_proc_ia *proc)
> >  		print_hex_dump(pfx, "", DUMP_PREFIX_OFFSET, 16, 4, proc-
> >cpuid,
> >  			       sizeof(proc->cpuid), 0);
> >  	}
> > +
> > +	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
> > +
> > +	err_info = (struct cper_ia_err_info *)(proc + 1);
> > +	for (i = 0; i < VALID_PROC_ERR_INFO_NUM(proc->validation_bits);
> i++) {
> > +		printk("%sError Information Structure %d:\n", pfx, i);
> > +
> > +		printk("%sError Structure Type: %pUl\n", newpfx,
> > +		       &err_info->err_type);
> 
> That dumps a GUID - how do I find out what type that GUID refers to?
> 

A later patch prints out the spec-defined type.

Also, the spec allows platform-defined GUIDs. So we should always print this
even if the GUID is not defined by the spec.

> > +
> > +		printk("%sValidation Bits: 0x%016llx\n",
> > +		       newpfx, err_info->validation_bits);
> 
> As before, no need for those.
> 
> > +
> > +		if (err_info->validation_bits & INFO_VALID_CHECK_INFO) {
> > +			printk("%sCheck Information: 0x%016llx\n", newpfx,
> > +			       err_info->check_info);
> > +		}
> > +
> > +		if (err_info->validation_bits & INFO_VALID_TARGET_ID) {
> > +			printk("%sTarget Identifier: 0x%016llx\n",
> > +			       newpfx, err_info->target_id);
> > +		}
> > +
> > +		if (err_info->validation_bits & INFO_VALID_REQUESTOR_ID) {
> > +			printk("%sRequestor Identifier: 0x%016llx\n",
> > +			       newpfx, err_info->requestor_id);
> > +		}
> > +
> > +		if (err_info->validation_bits & INFO_VALID_RESPONDER_ID) {
> > +			printk("%sResponder Identifier: 0x%016llx\n",
> > +			       newpfx, err_info->responder_id);
> > +		}
> 
> Those look like would need an additional decoding. Can we do that here
> too?
> 

The Check Information will be decoded further in another patch.

I don't think there's much else to decode for the ID fields.

> > +
> > +		if (err_info->validation_bits & INFO_VALID_IP) {
> > +			printk("%sInstruction Pointer: 0x%016llx\n",
> > +			       newpfx, err_info->ip);
> 
> The only thing that makes sense so far.
> 
> > +		}
> > +
> > +		err_info++;
> > +	}
> 
> Also, it is worth checking how much duplication there is with
> drivers/firmware/efi/cper-arm.c and unifying the common pieces into
> functions in cper-common.c or so.
> 

Other tables may have the same fields but the offsets are usually different.
So it may be more trouble than it's worth trying to unify the different tables.

Thanks,
Yazen 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 4/8] efi: Decode UEFI-defined IA32/X64 Error Structure GUIDs
  2018-02-27 14:30   ` Borislav Petkov
@ 2018-02-27 15:28     ` Ghannam, Yazen
  2018-02-27 15:31       ` Ard Biesheuvel
  0 siblings, 1 reply; 41+ messages in thread
From: Ghannam, Yazen @ 2018-02-27 15:28 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

> -----Original Message-----
> From: Borislav Petkov [mailto:bp@suse.de]
> Sent: Tuesday, February 27, 2018 9:30 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-efi@vger.kernel.org; linux-kernel@vger.kernel.org;
> ard.biesheuvel@linaro.org; x86@kernel.org
> Subject: Re: [PATCH v2 4/8] efi: Decode UEFI-defined IA32/X64 Error
> Structure GUIDs
> 
> On Mon, Feb 26, 2018 at 01:39:00PM -0600, Yazen Ghannam wrote:
> > @@ -45,6 +81,11 @@ void cper_print_proc_ia(const char *pfx, const struct
> cper_sec_proc_ia *proc)
> >  		printk("%sError Structure Type: %pUl\n", newpfx,
> >  		       &err_info->err_type);
> >
> > +		err_type = cper_get_err_type(&err_info->err_type);
> > +		printk("%sError Structure Type: %s\n", newpfx,
> > +		       err_type < ARRAY_SIZE(cper_proc_error_type_strs) ?
> > +		       cper_proc_error_type_strs[err_type] : "unknown");
> 
> Ah, there it is, much better. Now this tells us what component it is.
> 
> So cper-arm.c does the same thing and there's a similar thing in cper.c -
> cper_print_proc_generic().
> 
> Let's not tri-plicate that code pls and have a generic function instead.
> 

Sure, I can do that.

Ard, is this okay?

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/8] efi: Decode UEFI-defined IA32/X64 Error Structure GUIDs
  2018-02-27 15:28     ` Ghannam, Yazen
@ 2018-02-27 15:31       ` Ard Biesheuvel
  0 siblings, 0 replies; 41+ messages in thread
From: Ard Biesheuvel @ 2018-02-27 15:31 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: Borislav Petkov, linux-efi, linux-kernel, x86

On 27 February 2018 at 15:28, Ghannam, Yazen <Yazen.Ghannam@amd.com> wrote:
>> -----Original Message-----
>> From: Borislav Petkov [mailto:bp@suse.de]
>> Sent: Tuesday, February 27, 2018 9:30 AM
>> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
>> Cc: linux-efi@vger.kernel.org; linux-kernel@vger.kernel.org;
>> ard.biesheuvel@linaro.org; x86@kernel.org
>> Subject: Re: [PATCH v2 4/8] efi: Decode UEFI-defined IA32/X64 Error
>> Structure GUIDs
>>
>> On Mon, Feb 26, 2018 at 01:39:00PM -0600, Yazen Ghannam wrote:
>> > @@ -45,6 +81,11 @@ void cper_print_proc_ia(const char *pfx, const struct
>> cper_sec_proc_ia *proc)
>> >             printk("%sError Structure Type: %pUl\n", newpfx,
>> >                    &err_info->err_type);
>> >
>> > +           err_type = cper_get_err_type(&err_info->err_type);
>> > +           printk("%sError Structure Type: %s\n", newpfx,
>> > +                  err_type < ARRAY_SIZE(cper_proc_error_type_strs) ?
>> > +                  cper_proc_error_type_strs[err_type] : "unknown");
>>
>> Ah, there it is, much better. Now this tells us what component it is.
>>
>> So cper-arm.c does the same thing and there's a similar thing in cper.c -
>> cper_print_proc_generic().
>>
>> Let's not tri-plicate that code pls and have a generic function instead.
>>
>
> Sure, I can do that.
>
> Ard, is this okay?
>

Yes, please, but only if there's something to gain by it.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 5/8] efi: Decode IA32/X64 Cache, TLB, and Bus Check structures
  2018-02-27 15:03   ` Borislav Petkov
@ 2018-02-27 15:33     ` Ghannam, Yazen
  2018-02-27 17:06       ` Borislav Petkov
  0 siblings, 1 reply; 41+ messages in thread
From: Ghannam, Yazen @ 2018-02-27 15:33 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

> -----Original Message-----
> From: Borislav Petkov [mailto:bp@suse.de]
> Sent: Tuesday, February 27, 2018 10:04 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-efi@vger.kernel.org; linux-kernel@vger.kernel.org;
> ard.biesheuvel@linaro.org; x86@kernel.org
> Subject: Re: [PATCH v2 5/8] efi: Decode IA32/X64 Cache, TLB, and Bus Check
> structures
> 
> On Mon, Feb 26, 2018 at 01:39:01PM -0600, Yazen Ghannam wrote:
> > +static void print_err_info(const char *pfx, u8 err_type, u64 check)
> > +{
> > +	u16 validation_bits = CHECK_VALID_BITS(check);
> > +
> > +	printk("%sValidation Bits: 0x%04x\n", pfx, validation_bits);
> > +
> > +	if (err_type == ERR_TYPE_MS)
> > +		return;
> > +
> > +	if (validation_bits & CHECK_VALID_TRANS_TYPE) {
> > +		u8 trans_type = CHECK_TRANS_TYPE(check);
> > +
> > +		printk("%sTransaction Type: %u, %s\n", pfx, trans_type,
> > +		       trans_type < ARRAY_SIZE(ia_check_trans_type_strs) ?
> > +		       ia_check_trans_type_strs[trans_type] : "unknown");
> > +	}
> > +
> > +	if (validation_bits & CHECK_VALID_OPERATION) {
> > +		u8 op = CHECK_OPERATION(check);
> > +
> > +		/*
> > +		 * CACHE has more operation types than TLB or BUS, though
> the
> > +		 * name and the order are the same.
> > +		 */
> > +		u8 max_ops = (err_type == ERR_TYPE_CACHE) ? 9 : 7;
> > +
> > +		printk("%sOperation: %u, %s\n", pfx, op,
> > +		       op < max_ops ? ia_check_op_strs[op] : "unknown");
> > +	}
> > +
> > +	if (validation_bits & CHECK_VALID_LEVEL)
> > +		printk("%sLevel: %llu\n", pfx, CHECK_LEVEL(check));
> > +
> > +	if (validation_bits & CHECK_VALID_PCC)
> > +		print_bool("Processor Context Corrupt", pfx, check,
> CHECK_PCC);
> 
> I think we want to print PCC here unconditionally and say:
> 
> 	PCC: (yes|no|invalid)
> 
> I don't think the absence of PCC in the error record is a good enough
> hint that the PCC field is invalid.
> 
> Ditto for the rest and transaction type above too. I think it would be
> much easier if we have fixed fields error record.
> 

I agree which is why I've included the Validation Bits. A user can then
check the Validation Bits for any field that is of interest but missing.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/8] efi: Decode IA32/X64 Processor Error Section
  2018-02-27 15:13     ` Ghannam, Yazen
@ 2018-02-27 17:00       ` Borislav Petkov
  2018-02-27 17:27         ` Ghannam, Yazen
  0 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2018-02-27 17:00 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

On Tue, Feb 27, 2018 at 03:13:11PM +0000, Ghannam, Yazen wrote:
> I decided to print the Validation Bits as a sanity check for whomever is looking
> at this. Since we only print fields with a valid bit, it may be confusing for users
> who don't know why fields are missing.

I suggested what to do about that already: print the fields unconditionally.

> Also, I don't think we should be interpreting the spec for the user. We should
> print the fields as they and users can refer back to the spec for more information.

This is a very very bad idea.

You can just as well dump binary blobs and then add a user script which
deciphers that. Oh wait, we had that, it is called mcelog and it is a
huge pain trying to decode an error properly.

By that same logic, we could simply dump 64-bit MCA MSR values.

And everytime we get an error, we have to go dig out the spec. Hell no!
We had that, we know it is awful, we won't have it again.

You want to decode everything and fully in the kernel so that you can
have a ready error record which people can simply *read* from dmesg and
know what the error is.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info Structure
  2018-02-27 15:25     ` Ghannam, Yazen
@ 2018-02-27 17:04       ` Borislav Petkov
  2018-02-27 17:46         ` Ghannam, Yazen
  0 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2018-02-27 17:04 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

On Tue, Feb 27, 2018 at 03:25:06PM +0000, Ghannam, Yazen wrote:
> This is the same as the other CPER code.

Dude, turn on brain!

So if it is in the other CPER code, we should copy it, no matter how
dumb it is?!?

> Also, the spec allows platform-defined GUIDs. So we should always print this
> even if the GUID is not defined by the spec.

We need to have a way to map the GUID to a hw part. Dumb numbers mean shit
because the error record is worthless.

> The Check Information will be decoded further in another patch.
> 
> I don't think there's much else to decode for the ID fields.

Again, those numbers can't help decoding the error, no need to dump
them. Or we find a way to make sense out of that info.

> Other tables may have the same fields but the offsets are usually different.
> So it may be more trouble than it's worth trying to unify the different tables.

If the structs are the same, you can use generic functions for dumping -
the offsets are meaningless then.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 5/8] efi: Decode IA32/X64 Cache, TLB, and Bus Check structures
  2018-02-27 15:33     ` Ghannam, Yazen
@ 2018-02-27 17:06       ` Borislav Petkov
  0 siblings, 0 replies; 41+ messages in thread
From: Borislav Petkov @ 2018-02-27 17:06 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

On Tue, Feb 27, 2018 at 03:33:44PM +0000, Ghannam, Yazen wrote:
> I agree which is why I've included the Validation Bits. A user can then
> check the Validation Bits for any field that is of interest but missing.

No, no half-baked fields.

So you know drivers/edac/mce_amd.c, right? And you know the lengths we
go to make an error record as readable as possible?

So please apply the same logic and intent here too.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 2/8] efi: Decode IA32/X64 Processor Error Section
  2018-02-27 17:00       ` Borislav Petkov
@ 2018-02-27 17:27         ` Ghannam, Yazen
  2018-02-27 17:44           ` Borislav Petkov
  0 siblings, 1 reply; 41+ messages in thread
From: Ghannam, Yazen @ 2018-02-27 17:27 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

> -----Original Message-----
> From: Borislav Petkov [mailto:bp@suse.de]
> Sent: Tuesday, February 27, 2018 12:01 PM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-efi@vger.kernel.org; linux-kernel@vger.kernel.org;
> ard.biesheuvel@linaro.org; x86@kernel.org
> Subject: Re: [PATCH v2 2/8] efi: Decode IA32/X64 Processor Error Section
> 
> On Tue, Feb 27, 2018 at 03:13:11PM +0000, Ghannam, Yazen wrote:
> > I decided to print the Validation Bits as a sanity check for whomever is
> looking
> > at this. Since we only print fields with a valid bit, it may be confusing for
> users
> > who don't know why fields are missing.
> 
> I suggested what to do about that already: print the fields unconditionally.
> 

Sure, we can print the fields unconditionally if Ard is okay with that.

The issue is that the x86 behavior will be different from all the others, so that
might be confusing.

> > Also, I don't think we should be interpreting the spec for the user. We
> should
> > print the fields as they and users can refer back to the spec for more
> information.
> 
> This is a very very bad idea.
> 
> You can just as well dump binary blobs and then add a user script which
> deciphers that. Oh wait, we had that, it is called mcelog and it is a
> huge pain trying to decode an error properly.
> 
> By that same logic, we could simply dump 64-bit MCA MSR values.
> 
> And everytime we get an error, we have to go dig out the spec. Hell no!
> We had that, we know it is awful, we won't have it again.
> 
> You want to decode everything and fully in the kernel so that you can
> have a ready error record which people can simply *read* from dmesg and
> know what the error is.
> 

This set does decode everything fully so that people can read the error. The only
two places there are binary dumps are with the CPUID field and the Context info.

Maybe we can run the CPUID field through some decoder in arch/x86.

I already mentioned in the Context info patch that there could be further
decoding which we can do in the future. The two options that might be a pain
will be the MSR and MM register types, since these will be model and/or
platform specific.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/8] efi: Decode IA32/X64 Processor Error Section
  2018-02-27 17:27         ` Ghannam, Yazen
@ 2018-02-27 17:44           ` Borislav Petkov
  2018-02-27 18:06             ` Ghannam, Yazen
  0 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2018-02-27 17:44 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

On Tue, Feb 27, 2018 at 05:27:39PM +0000, Ghannam, Yazen wrote:
> Sure, we can print the fields unconditionally if Ard is okay with that.
> 
> The issue is that the x86 behavior will be different from all the others, so that
> might be confusing.

Confusing for whom?

Are we sharing tools with other arches or what am I missing?

> This set does decode everything fully so that people can read the error.

No, it doesn't. It dumps

        printk("%sValidation Bits: 0x%016llx\n", pfx, proc->validation_bits);

        printk("%sError Structure Type: %pUl\n", newpfx,
                &err_info->err_type);

        printk("%sCheck Information: 0x%016llx\n", newpfx,
                 err_info->check_info);

and so on which are half-baked.

Think of it this way: if you look at the error record and you still need
to look at the spec to decode it, then it is still not good enough.

Users don't care about validation bits, or unreadable GUIDs or WTF is
"Check Information" even?

"This section details the layout of the Processor Error Information
Structure and the detailed check information which is contained within."

And that "Check Information" thing is mentioned only twice in the whole
spec.

StructureErrorType only there in the table.

Very informative that.

So no, users don't care about any of that internal crap - they wanna
know what is wrong with their box and that should be written in plain
english.

> I already mentioned in the Context info patch that there could be
> further decoding which we can do in the future.

Then decode only those pieces fully now which you can do now and when
you add something in the future, add it then with the full decoding
functionality. No fields which need additional decoding with the spec
opened on one side.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info Structure
  2018-02-27 17:04       ` Borislav Petkov
@ 2018-02-27 17:46         ` Ghannam, Yazen
  2018-02-27 18:02           ` Borislav Petkov
  0 siblings, 1 reply; 41+ messages in thread
From: Ghannam, Yazen @ 2018-02-27 17:46 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

> -----Original Message-----
> From: Borislav Petkov [mailto:bp@suse.de]
> Sent: Tuesday, February 27, 2018 12:04 PM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-efi@vger.kernel.org; linux-kernel@vger.kernel.org;
> ard.biesheuvel@linaro.org; x86@kernel.org
> Subject: Re: [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info
> Structure
> 
> On Tue, Feb 27, 2018 at 03:25:06PM +0000, Ghannam, Yazen wrote:
> > This is the same as the other CPER code.
> 
> Dude, turn on brain!
> 
> So if it is in the other CPER code, we should copy it, no matter how
> dumb it is?!?
> 

I think there's value in following the conventions in a subsystem.

I can change this if you give a reason besides "it's dumb". This could apply
to the other CPER code also.

> > Also, the spec allows platform-defined GUIDs. So we should always print
> this
> > even if the GUID is not defined by the spec.
> 
> We need to have a way to map the GUID to a hw part. Dumb numbers mean
> shit
> because the error record is worthless.
> 

We do map the spec-defined GUIDs in patch 4 of this set. I don't know if there's
a central place where all vendor-defined GUIDs are listed. I can look into this.

> > The Check Information will be decoded further in another patch.
> >
> > I don't think there's much else to decode for the ID fields.
> 
> Again, those numbers can't help decoding the error, no need to dump
> them. Or we find a way to make sense out of that info.
> 

Which numbers? The Check Information? Like I said, that's decoded in another
patch, patches 5, 6, and 7.

And the raw value should still be printed because
1) It may represent a type that we can't decode. Maybe a type that's not part of
the spec.
2) It's good to have the raw value for reference. We do this with MCA_STATUS
where we print the raw value followed by the decoding.

> > Other tables may have the same fields but the offsets are usually different.
> > So it may be more trouble than it's worth trying to unify the different
> tables.
> 
> If the structs are the same, you can use generic functions for dumping -
> the offsets are meaningless then.
> 

The structs are all different even though some fields may be the same.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info Structure
  2018-02-27 17:46         ` Ghannam, Yazen
@ 2018-02-27 18:02           ` Borislav Petkov
  2018-02-27 18:40             ` Ghannam, Yazen
  0 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2018-02-27 18:02 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

On Tue, Feb 27, 2018 at 05:46:54PM +0000, Ghannam, Yazen wrote:
> I think there's value in following the conventions in a subsystem.

"conventions in a subsystem" my ass. That's brainless copy-pasting.

It was added by

f6edea77c8c8 ("ACPI, APEI, CPER: Cleanup CPER memory error output format")

and then replicated everywhere.

It is simply a dumb way of writing:

	snprintf(newpfx, sizeof(newpfx), "%s ", pfx);


> I can change this if you give a reason besides "it's dumb".

Two can play that game: you get to keep it if you give a good reason
why.

> We do map the spec-defined GUIDs in patch 4 of this set. I don't know if there's
> a central place where all vendor-defined GUIDs are listed. I can look into this.

Yes, at least for the most prominent ones.

> And the raw value should still be printed because
> 1) It may represent a type that we can't decode. Maybe a type that's not part of
> the spec.

If we can't decode it, *then* you dump it:

	"Unrecognized type: 0x%llx ..."

> 2) It's good to have the raw value for reference. We do this with MCA_STATUS
> where we print the raw value followed by the decoding.

1. No one stares at the raw value if the bits are decoded
2. MCA_STATUS is one register - this error record is huge.

> The structs are all different even though some fields may be the same.

Fair enough. Only if it makes sense.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 2/8] efi: Decode IA32/X64 Processor Error Section
  2018-02-27 17:44           ` Borislav Petkov
@ 2018-02-27 18:06             ` Ghannam, Yazen
  2018-02-27 18:34               ` Borislav Petkov
  0 siblings, 1 reply; 41+ messages in thread
From: Ghannam, Yazen @ 2018-02-27 18:06 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

> -----Original Message-----
> From: Borislav Petkov [mailto:bp@suse.de]
> Sent: Tuesday, February 27, 2018 12:45 PM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-efi@vger.kernel.org; linux-kernel@vger.kernel.org;
> ard.biesheuvel@linaro.org; x86@kernel.org
> Subject: Re: [PATCH v2 2/8] efi: Decode IA32/X64 Processor Error Section
> 
> On Tue, Feb 27, 2018 at 05:27:39PM +0000, Ghannam, Yazen wrote:
> > Sure, we can print the fields unconditionally if Ard is okay with that.
> >
> > The issue is that the x86 behavior will be different from all the others, so
> that
> > might be confusing.
> 
> Confusing for whom?
> 
> Are we sharing tools with other arches or what am I missing?
> 

It's not just about arches but record types. A single platform can report
using arch-specific records, memory records, PCIe records, etc.

So all the other record types only show fields with a valid bit set and this
has always been the case. There may be users or tools who expect that
same behavior.

> > This set does decode everything fully so that people can read the error.
> 
> No, it doesn't. It dumps
> 
>         printk("%sValidation Bits: 0x%016llx\n", pfx, proc->validation_bits);
> 
>         printk("%sError Structure Type: %pUl\n", newpfx,
>                 &err_info->err_type);
> 
>         printk("%sCheck Information: 0x%016llx\n", newpfx,
>                  err_info->check_info);
> 
> and so on which are half-baked.
> 
> Think of it this way: if you look at the error record and you still need
> to look at the spec to decode it, then it is still not good enough.
> 
> Users don't care about validation bits, or unreadable GUIDs or WTF is
> "Check Information" even?
> 
> "This section details the layout of the Processor Error Information
> Structure and the detailed check information which is contained within."
> 
> And that "Check Information" thing is mentioned only twice in the whole
> spec.
> 
> StructureErrorType only there in the table.
> 
> Very informative that.
> 
> So no, users don't care about any of that internal crap - they wanna
> know what is wrong with their box and that should be written in plain
> english.
> 

Please see the other patches where these are decoded further. As I
mentioned the cover letter, the decoding is applied incrementally rather
than having one large patch.

Also, like I said in another thread, we should always print the raw value
followed by the decoding. That way the raw info is there in case a user
wants to send the data to the vendor or other expert party. 

> > I already mentioned in the Context info patch that there could be
> > further decoding which we can do in the future.
> 
> Then decode only those pieces fully now which you can do now and when
> you add something in the future, add it then with the full decoding
> functionality. No fields which need additional decoding with the spec
> opened on one side.
> 

Okay.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/8] efi: Decode IA32/X64 Processor Error Section
  2018-02-27 18:06             ` Ghannam, Yazen
@ 2018-02-27 18:34               ` Borislav Petkov
  0 siblings, 0 replies; 41+ messages in thread
From: Borislav Petkov @ 2018-02-27 18:34 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

On Tue, Feb 27, 2018 at 06:06:21PM +0000, Ghannam, Yazen wrote:
> It's not just about arches but record types. A single platform can report
> using arch-specific records, memory records, PCIe records, etc.
> 
> So all the other record types only show fields with a valid bit set and this
> has always been the case. There may be users or tools who expect that
> same behavior.

You know we have this thing called tracepoints for that, don't you?

You define one tracepoint which regurgitates an error record and then
all arches which have the same table, share them.

> Please see the other patches where these are decoded further. As I
> mentioned the cover letter, the decoding is applied incrementally rather
> than having one large patch.

Here's what needs to go:

All the validation bits crap:

	printk("%sValidation Bits: 0x%04x\n", pfx, validation_bits);

Instead, you simply dump all entries unconditionally and say whether
they're valid or not. This way you have the same format for each error
record - much easier to work with.

	printk("%sCPUID Info:\n", pfx);
        print_hex_dump(pfx, "", DUMP_PREFIX_OFFSET, 16, 4, proc->cpuid,
                       sizeof(proc->cpuid), 0);

Frankly, I have no clue why we should even dump CPUID for *every* error.
It is completely sufficient to dump family/model/stepping like we do in
mce_amd.c. Besides, when debugging, you collect much more info from the
system - CPUID only is not enough.

	printk("%sError Structure Type: %pUl\n", newpfx,
               &err_info->err_type);

Unreadable GUIDs.

What are those even:

                if (err_info->validation_bits & INFO_VALID_TARGET_ID) {
                        printk("%sTarget Identifier: 0x%016llx\n",
                               newpfx, err_info->target_id);
                }

                if (err_info->validation_bits & INFO_VALID_REQUESTOR_ID) {
                        printk("%sRequestor Identifier: 0x%016llx\n",
                               newpfx, err_info->requestor_id);
                }

                if (err_info->validation_bits & INFO_VALID_RESPONDER_ID) {
                        printk("%sResponder Identifier: 0x%016llx\n",
                               newpfx, err_info->responder_id);
                }

some 8-byte ids? Who knows... Either remove them or make them
human-readable, *explaining* what the requestor is and the target is and
so on.

                printk("%sRegister Array Size: 0x%04x\n", newpfx,
                       ctx_info->reg_arr_size);

I have no clue what that is for.

                if (ctx_info->reg_ctx_type == CTX_TYPE_MSR) {
                        groupsize = 8; /* MSRs are 8 bytes wide. */
                        printk("%sMSR Address: 0x%08x\n", newpfx,
                               ctx_info->msr_addr);
                }

What do I need the MSR *addresses* for?

                if (ctx_info->reg_ctx_type == CTX_TYPE_MMREG) {
                        printk("%sMM Register Address: 0x%016llx\n", newpfx,
                               ctx_info->mm_reg_addr);
                }

memory mapped registers?!?!

> Also, like I said in another thread, we should always print the raw value
> followed by the decoding. That way the raw info is there in case a user
> wants to send the data to the vendor or other expert party.

By that logic we shouldn't be decoding at all - we should be dumping a
fat hex blob.

Again, there's this thing called tracepoints which have exactly that,
you know. You can dump human readable info to dmesg and dump a whole raw
record into the tracepoint.

When the error is critical and we're about to die, the tracepoint
contents goes to dmesg too.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info Structure
  2018-02-27 18:02           ` Borislav Petkov
@ 2018-02-27 18:40             ` Ghannam, Yazen
  2018-02-27 19:09               ` Borislav Petkov
  0 siblings, 1 reply; 41+ messages in thread
From: Ghannam, Yazen @ 2018-02-27 18:40 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

> -----Original Message-----
> From: Borislav Petkov [mailto:bp@suse.de]
> Sent: Tuesday, February 27, 2018 1:03 PM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-efi@vger.kernel.org; linux-kernel@vger.kernel.org;
> ard.biesheuvel@linaro.org; x86@kernel.org
> Subject: Re: [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info
> Structure
> 
> On Tue, Feb 27, 2018 at 05:46:54PM +0000, Ghannam, Yazen wrote:
> > I think there's value in following the conventions in a subsystem.
> 
> "conventions in a subsystem" my ass. That's brainless copy-pasting.
> 
> It was added by
> 
> f6edea77c8c8 ("ACPI, APEI, CPER: Cleanup CPER memory error output
> format")
> 
> and then replicated everywhere.
> 
> It is simply a dumb way of writing:
> 
> 	snprintf(newpfx, sizeof(newpfx), "%s ", pfx);
> 

There's a space after the %s, right? I missed it at first glance, though maybe
my LASIK didn't take very well.

> 
> > I can change this if you give a reason besides "it's dumb".
> 
> Two can play that game: you get to keep it if you give a good reason
> why.
> 

Code readability.

...
> > And the raw value should still be printed because
> > 1) It may represent a type that we can't decode. Maybe a type that's not
> part of
> > the spec.
> 
> If we can't decode it, *then* you dump it:
> 
> 	"Unrecognized type: 0x%llx ..."
> 
> > 2) It's good to have the raw value for reference. We do this with
> MCA_STATUS
> > where we print the raw value followed by the decoding.
> 
> 1. No one stares at the raw value if the bits are decoded
> 2. MCA_STATUS is one register - this error record is huge.
> 

1) No one except debug and HW design folks, who will eventually get a
report from a user.
2) You're right, the record is huge. Printing out the Validation Bits, GUID,
and raw Check Info will be an extra 5 lines.

I posted an example at the end.

We could get rid of the GUID for known types like you suggest. Also, we
can drop the Validation Bits for the Check Info since that's already part of
it.

Thanks,
Yazen

[    1.990948] [Hardware Error]:  Error 1, type: corrected 
[    1.995789] [Hardware Error]:  fru_text: ProcessorError 
[    2.000632] [Hardware Error]:   section_type: IA32/X64 processor error 
[    2.005796] [Hardware Error]:   Validation Bits: 0x0000000000000207 
[    2.010953] [Hardware Error]:   Local APIC_ID: 0x0 
[    2.015991] [Hardware Error]:   CPUID Info: 
[    2.020747] [Hardware Error]:   00000000: 00800f12 00000000 00400800 00000000 
[    2.025595] [Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000 
[    2.030423] [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000 
[    2.035198] [Hardware Error]:   Error Information Structure 0:
[    2.039961] [Hardware Error]:    Error Structure Type: a55701f5-e3ef-43de-ac72-249b573fad2c
[    2.049608] [Hardware Error]:    Error Structure Type: cache error
[    2.054344] [Hardware Error]:    Validation Bits: 0x0000000000000001
[    2.059046] [Hardware Error]:    Check Information: 0x0000000020540087
[    2.063625] [Hardware Error]:     Validation Bits: 0x0087
[    2.068032] [Hardware Error]:     Transaction Type: 0, Instruction
[    2.072423] [Hardware Error]:     Operation: 5, instruction fetch
[    2.076776] [Hardware Error]:     Level: 1
[    2.081073] [Hardware Error]:     Overflow: true
[    2.085360] [Hardware Error]:   Context Information Structure 0:
[    2.089661] [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[    2.098487] [Hardware Error]:    Register Array Size: 0x0050
[    2.103113] [Hardware Error]:    MSR Address: 0xc0002011
[    2.107742] [Hardware Error]:    Register Array:
[    2.112270] [Hardware Error]:    00000000: d8200000000a0151 0000000000000000
[    2.116845] [Hardware Error]:    00000010: d010000000000000 0000000300000031
[    2.121228] [Hardware Error]:    00000020: 000100b000000000 000000004a000000
[    2.125514] [Hardware Error]:    00000030: 0000000000000000 0000000000000000
[    2.129747] [Hardware Error]:    00000040: 0000000000000000 0000000000000000

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info Structure
  2018-02-27 18:40             ` Ghannam, Yazen
@ 2018-02-27 19:09               ` Borislav Petkov
  2018-02-27 21:32                 ` Ghannam, Yazen
  0 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2018-02-27 19:09 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86, Tony Luck

On Tue, Feb 27, 2018 at 06:40:13PM +0000, Ghannam, Yazen wrote:
> Code readability.

Bullshit!

You can't tell me this:

	snprintf(newpfx, sizeof(newpfx), "%s ", pfx);

is less readable than:

	 snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);

> 1) No one except debug and HW design folks, who will eventually get a
> report from a user.

Hahahha, yeah right.

The only people who get those reports are the maintainers of the code in
the kernel and the distro people who get all the bugs assigned to them.

And if they can't decode the error - it is Tony and me.

HW folks hear about it from us. And we go and decode the damn crap
*every* time. Do you catch my drift now?

> [    1.990948] [Hardware Error]:  Error 1, type: corrected 
> [    1.995789] [Hardware Error]:  fru_text: ProcessorError 
> [    2.000632] [Hardware Error]:   section_type: IA32/X64 processor error 
> [    2.005796] [Hardware Error]:   Validation Bits: 0x0000000000000207 
> [    2.010953] [Hardware Error]:   Local APIC_ID: 0x0 
> [    2.015991] [Hardware Error]:   CPUID Info: 
> [    2.020747] [Hardware Error]:   00000000: 00800f12 00000000 00400800 00000000 
> [    2.025595] [Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000 
> [    2.030423] [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000 
> [    2.035198] [Hardware Error]:   Error Information Structure 0:
> [    2.039961] [Hardware Error]:    Error Structure Type: a55701f5-e3ef-43de-ac72-249b573fad2c
> [    2.049608] [Hardware Error]:    Error Structure Type: cache error
> [    2.054344] [Hardware Error]:    Validation Bits: 0x0000000000000001
> [    2.059046] [Hardware Error]:    Check Information: 0x0000000020540087
> [    2.063625] [Hardware Error]:     Validation Bits: 0x0087
> [    2.068032] [Hardware Error]:     Transaction Type: 0, Instruction
> [    2.072423] [Hardware Error]:     Operation: 5, instruction fetch
> [    2.076776] [Hardware Error]:     Level: 1
> [    2.081073] [Hardware Error]:     Overflow: true
> [    2.085360] [Hardware Error]:   Context Information Structure 0:
> [    2.089661] [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
> [    2.098487] [Hardware Error]:    Register Array Size: 0x0050
> [    2.103113] [Hardware Error]:    MSR Address: 0xc0002011
> [    2.107742] [Hardware Error]:    Register Array:
> [    2.112270] [Hardware Error]:    00000000: d8200000000a0151 0000000000000000
> [    2.116845] [Hardware Error]:    00000010: d010000000000000 0000000300000031
> [    2.121228] [Hardware Error]:    00000020: 000100b000000000 000000004a000000
> [    2.125514] [Hardware Error]:    00000030: 0000000000000000 0000000000000000
> [    2.129747] [Hardware Error]:    00000040: 0000000000000000 0000000000000000

Lemme simplify that error record:

[Hardware Error]:  Corrected Processor Error
[Hardware Error]:   APIC_ID: 0x0 | CPUID: 0x17|0x1|0x2
[Hardware Error]:    Type: cache error during instruction fetch
[Hardware Error]:    cache level 1
[Hardware Error]:    Overflow: true

See how much more readable it got! And it is only 5 lines. I can make it
even smaller.

If it were a critical, uncorrectable error, every line counts: imagine
you do the above fat record and the machine freezes at line 5.

Now, I admit that my vesion of the record is not enough to debug it
but it needs to contain only information which is clear and humanly
readable to debug. You can always dump the raw data underneath from the
tracepoint but make the beginning human readable.

Do you know what users say about your error record?

"Err, it says hardware error, is my machine broken? I need to replace my
CPU."

I read that on a weekly basis.

Do you know how expensive support calls are about such errors which are
completely unreadable to people? 20 engineers need to get on a call to
realize it was a dumb correctable error? Btw, this is one of the reasons
why we did the error collector.

So put yourself in the users' shoes, look at the error record and think
hard whether the information displayed is readable to humans.

Btw, decode_error_status() in mce_amd.c is an attempt to explain the
error severity - note the "no action required." thing. It is still not
good enough - people still throw hands in the air and run in headless
chicken mode.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info Structure
  2018-02-27 19:09               ` Borislav Petkov
@ 2018-02-27 21:32                 ` Ghannam, Yazen
  2018-02-27 22:10                   ` Borislav Petkov
  0 siblings, 1 reply; 41+ messages in thread
From: Ghannam, Yazen @ 2018-02-27 21:32 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86, Tony Luck

> -----Original Message-----
> From: Borislav Petkov [mailto:bp@suse.de]
> Sent: Tuesday, February 27, 2018 2:10 PM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-efi@vger.kernel.org; linux-kernel@vger.kernel.org;
> ard.biesheuvel@linaro.org; x86@kernel.org; Tony Luck
> <tony.luck@intel.com>
> Subject: Re: [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info
> Structure
> 
...
> > 1) No one except debug and HW design folks, who will eventually get a
> > report from a user.
> 
> Hahahha, yeah right.
> 
> The only people who get those reports are the maintainers of the code in
> the kernel and the distro people who get all the bugs assigned to them.
> 
> And if they can't decode the error - it is Tony and me.
> 
> HW folks hear about it from us. And we go and decode the damn crap
> *every* time. Do you catch my drift now?
> 

That's not true. You may get reports but so do platform and silicon vendors
directly.

> > [    1.990948] [Hardware Error]:  Error 1, type: corrected
> > [    1.995789] [Hardware Error]:  fru_text: ProcessorError
> > [    2.000632] [Hardware Error]:   section_type: IA32/X64 processor error
> > [    2.005796] [Hardware Error]:   Validation Bits: 0x0000000000000207
> > [    2.010953] [Hardware Error]:   Local APIC_ID: 0x0
> > [    2.015991] [Hardware Error]:   CPUID Info:
> > [    2.020747] [Hardware Error]:   00000000: 00800f12 00000000 00400800
> 00000000
> > [    2.025595] [Hardware Error]:   00000010: 76d8320b 00000000 178bfbff
> 00000000
> > [    2.030423] [Hardware Error]:   00000020: 00000000 00000000 00000000
> 00000000
> > [    2.035198] [Hardware Error]:   Error Information Structure 0:
> > [    2.039961] [Hardware Error]:    Error Structure Type: a55701f5-e3ef-
> 43de-ac72-249b573fad2c
> > [    2.049608] [Hardware Error]:    Error Structure Type: cache error
> > [    2.054344] [Hardware Error]:    Validation Bits: 0x0000000000000001
> > [    2.059046] [Hardware Error]:    Check Information: 0x0000000020540087
> > [    2.063625] [Hardware Error]:     Validation Bits: 0x0087
> > [    2.068032] [Hardware Error]:     Transaction Type: 0, Instruction
> > [    2.072423] [Hardware Error]:     Operation: 5, instruction fetch
> > [    2.076776] [Hardware Error]:     Level: 1
> > [    2.081073] [Hardware Error]:     Overflow: true
> > [    2.085360] [Hardware Error]:   Context Information Structure 0:
> > [    2.089661] [Hardware Error]:    Register Context Type: MSR Registers
> (Machine Check and other MSRs)
> > [    2.098487] [Hardware Error]:    Register Array Size: 0x0050
> > [    2.103113] [Hardware Error]:    MSR Address: 0xc0002011
> > [    2.107742] [Hardware Error]:    Register Array:
> > [    2.112270] [Hardware Error]:    00000000: d8200000000a0151
> 0000000000000000
> > [    2.116845] [Hardware Error]:    00000010: d010000000000000
> 0000000300000031
> > [    2.121228] [Hardware Error]:    00000020: 000100b000000000
> 000000004a000000
> > [    2.125514] [Hardware Error]:    00000030: 0000000000000000
> 0000000000000000
> > [    2.129747] [Hardware Error]:    00000040: 0000000000000000
> 0000000000000000
> 
> Lemme simplify that error record:
> 
> [Hardware Error]:  Corrected Processor Error
> [Hardware Error]:   APIC_ID: 0x0 | CPUID: 0x17|0x1|0x2
> [Hardware Error]:    Type: cache error during instruction fetch
> [Hardware Error]:    cache level 1
> [Hardware Error]:    Overflow: true
> 
> See how much more readable it got! And it is only 5 lines. I can make it
> even smaller.
> 

Not much more readable. It's still vague and confusing to a user and devoid
of any real info so an expert can't help. And now the information is printed
arbitrarily, so someone needs to read the source to figure out what it really
means.

> If it were a critical, uncorrectable error, every line counts: imagine
> you do the above fat record and the machine freezes at line 5.
> 

Maybe. But these records are generated by Platform Firmware. Why would
FW report the error knowing the system is about to die? Most likely we'll
only see CPERs in HEST if the OS can do something, or we'll see them in
BERT after recovering from a hang/reset.

> Now, I admit that my vesion of the record is not enough to debug it
> but it needs to contain only information which is clear and humanly
> readable to debug. You can always dump the raw data underneath from the
> tracepoint but make the beginning human readable.
> 
> Do you know what users say about your error record?
> 
> "Err, it says hardware error, is my machine broken? I need to replace my
> CPU."
> 

Your example still says "Hardware Error" and odds are general users won't
understand what the error type means or what a corrected error is. So it's
not much better.

> I read that on a weekly basis.
> 
> Do you know how expensive support calls are about such errors which are
> completely unreadable to people? 20 engineers need to get on a call to
> realize it was a dumb correctable error? Btw, this is one of the reasons
> why we did the error collector.
> 

Exactly! The more info available (usually) the more quickly it can be
diagnosed.

Hardware errors are generally rare and hard to reproduce. So when one
does occur we should capture the data and provide it.

Here are a couple of scenarios based on similar experiences I've had:

Scenario 1 (with minimal info)
User: "I have a hardware error. What does it mean?"

Debugger: "Do you have any more info?"

User: "Corrected Processor Error; cache error during instruction fetch"

Debugger: "'Corrected' sounds okay, but not sure about 'cache error during
fetch'. Linux dev, what does that mean?"

Linux dev: "It means this field and that field equal this."

Debugger: "What about the rest of the data?"

Linux dev: "It's in a trace buffer. Please use this tool or set this sysfs file to
this and that to collect the data."

Debugger: "There aren't any errors in the trace buffer. User, when did
this occur?"

User: "This error occurred last week. We've reset the machine a few times
since then. The error seems to happen every few days under a very
specific workload."

Debugger: "Okay, let's try to reproduce this."
*wait a few days, run a lot of systems to up the odds*

Debugger: "We finally have data and can now help you"

Scenario 2 (with all platform provided data)
User: "I have a hardware error. What does it mean?"

Debugger: "Do you have any more info?"

User: *provides complete error record*

Debugger: "Thank you, we have everything we need to help you."


I'll send a V3 set with the following changes:
1) Fix table numbering in commit messages.
2) Remove "Validation Bits" lines.
3) Only print error type GUID for unknown types.

I think this set should focus on printing the x86 CPER based on the UEFI
spec and the convention of the other CPER code. CPERs are generated
by Platform Firmware. So errors are explicitly intended to be viewed by
the user and all info should be displayed. Other errors not intended to
be seen by the user may be thresholded or masked by the Platform.

I *have* been thinking that it would be nice to take the CPER and pipe it
through the MCA decoding in arch/x86 and EDAC. This would be really
nice for when the CPER includes the MCA registers in the Context info.
So we'd get our usual MCA decoding instead of a binary blob of registers.

I was thinking that the MCA decoding would be in addition to this. But
based on Boris's comments, maybe we can make it a default selection.
For example, if MCA/EDAC decoding is available, use it. Otherwise, print
the CPER fields in a generic way like we do for the other CPER types.

That would be a separate set though.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info Structure
  2018-02-27 21:32                 ` Ghannam, Yazen
@ 2018-02-27 22:10                   ` Borislav Petkov
  0 siblings, 0 replies; 41+ messages in thread
From: Borislav Petkov @ 2018-02-27 22:10 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86, Tony Luck

On Tue, Feb 27, 2018 at 09:32:18PM +0000, Ghannam, Yazen wrote:
> Not much more readable. It's still vague and confusing to a user and devoid
> of any real info so an expert can't help. And now the information is printed
> arbitrarily, so someone needs to read the source to figure out what it really
> means.

WTF? You need to read the source to figure out what the error is? So
"Corrected Processor Error" is confusing? I think you've been clearly
staring at the spec for waaay too long.

Also, read my mail again:

"Now, I admit that my vesion of the record is not enough to debug it
but it needs to contain only information which is clear and humanly
readable to debug. You can always dump the raw data underneath from the
tracepoint but make the beginning human readable."

> Maybe. But these records are generated by Platform Firmware. Why would
> FW report the error knowing the system is about to die?

WTF more! Dude, are you kidding me?

So the firmware should not report the error if it knows the system is
about to die?!?!

Now you're just making up insane counterarguments, just because.

> Your example still says "Hardware Error"

Oh my, that's the *prefix*.

> and odds are general users won't understand what the error type means
> or what a corrected error is. So it's not much better.

Yeah, and the next thing you'll say is that users won't understand what
"error" means, right?

Geez.

> Exactly! The more info available (usually) the more quickly it can be
> diagnosed.

You still don't understand what I'm trying to explain to you.

It is *not* about diagnosing it - in order to do that you need to
involve people to diagnose it. It is about making the error record as
human readable as possible so that you don't *have* to involve people to
diagnose it in the first place and the user can say, ah ok, corrected
error, no need to do anything.

Or "System Fatal error" - I better replace that part.

> Hardware errors are generally rare and hard to reproduce.

Except when they're not like DRAM ECC floods which are pretty easy to
reproduce.

> So when one does occur we should capture the data and provide it.

Did I say you should not do that?!

I said: make it as human readable as possible and dump the gory hex crap
after it.

> Here are a couple of scenarios based on similar experiences I've had:

Now play that same scenario with the following record format:

[ human readable error record ]
[ full raw error dump ]

> I'll send a V3 set with the following changes:
> 1) Fix table numbering in commit messages.
> 2) Remove "Validation Bits" lines.
> 3) Only print error type GUID for unknown types.
> 
> I think this set should focus on printing the x86 CPER based on the UEFI
> spec and the convention of the other CPER code. CPERs are generated
> by Platform Firmware. So errors are explicitly intended to be viewed by
> the user and all info should be displayed.

You should look up from the spec and realize that real life is much
different.

> I *have* been thinking that it would be nice to take the CPER and pipe it
> through the MCA decoding in arch/x86 and EDAC. This would be really
> nice for when the CPER includes the MCA registers in the Context info.
> So we'd get our usual MCA decoding instead of a binary blob of registers.

That would definitely be a step in the right direction.

> I was thinking that the MCA decoding would be in addition to this. But
> based on Boris's comments, maybe we can make it a default selection.
> For example, if MCA/EDAC decoding is available, use it. Otherwise, print
> the CPER fields in a generic way like we do for the other CPER types.

Yes.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 0/8] Decode IA32/X64 CPER
  2018-02-26 19:38 [PATCH v2 0/8] Decode IA32/X64 CPER Yazen Ghannam
                   ` (7 preceding siblings ...)
  2018-02-26 19:39 ` [PATCH v2 8/8] efi: Decode IA32/X64 Context Info structure Yazen Ghannam
@ 2018-02-28  8:43 ` Borislav Petkov
  2018-02-28 15:12   ` Ghannam, Yazen
  2018-03-01 16:38   ` Luck, Tony
  8 siblings, 2 replies; 41+ messages in thread
From: Borislav Petkov @ 2018-02-28  8:43 UTC (permalink / raw)
  To: Yazen Ghannam, Tony Luck; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

On Mon, Feb 26, 2018 at 01:38:56PM -0600, Yazen Ghannam wrote:
> From: Yazen Ghannam <yazen.ghannam@amd.com>
> 
> This series adds decoding for the IA32/X64 Common Platform Error Record.

One much more important thing I forgot about yesterday: how is
this thing playing into our RAS reporting, x86 decoding chain, etc
infrastructure?

Is CPER bypassing it completely and the firmware is doing everything
now? I sure hope not.

If not, it needs to tie into our infrastructure and the errors need
to go into the decoding chain where different things look at them and
filter them.

Tony, what are your plans here?

Perhaps we can finally get MCE decoding on Intel too :-)

Thx.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 0/8] Decode IA32/X64 CPER
  2018-02-28  8:43 ` [PATCH v2 0/8] Decode IA32/X64 CPER Borislav Petkov
@ 2018-02-28 15:12   ` Ghannam, Yazen
  2018-02-28 16:35     ` Borislav Petkov
  2018-03-01 16:38   ` Luck, Tony
  1 sibling, 1 reply; 41+ messages in thread
From: Ghannam, Yazen @ 2018-02-28 15:12 UTC (permalink / raw)
  To: Borislav Petkov, Tony Luck; +Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

> -----Original Message-----
> From: Borislav Petkov [mailto:bp@suse.de]
> Sent: Wednesday, February 28, 2018 3:43 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>; Tony Luck
> <tony.luck@intel.com>
> Cc: linux-efi@vger.kernel.org; linux-kernel@vger.kernel.org;
> ard.biesheuvel@linaro.org; x86@kernel.org
> Subject: Re: [PATCH v2 0/8] Decode IA32/X64 CPER
> 
> On Mon, Feb 26, 2018 at 01:38:56PM -0600, Yazen Ghannam wrote:
> > From: Yazen Ghannam <yazen.ghannam@amd.com>
> >
> > This series adds decoding for the IA32/X64 Common Platform Error Record.
> 
> One much more important thing I forgot about yesterday: how is
> this thing playing into our RAS reporting, x86 decoding chain, etc
> infrastructure?
> 

It doesn't right now.

> Is CPER bypassing it completely and the firmware is doing everything
> now? I sure hope not.
> 

CPER is the format used for BERT, etc. We'll only ever see a CPER if the
firmware creates it. And it's up to firmware policy what is shared with
the OS.

This set adds decoding for the x86 CPER format which will mostly map to
core MCA errors. There's a memory CPER format that DRAM ECC, etc. will
use, so that's not covered here. In other words, common errors like
corrected DRAM ECC won't be reported with this patch set.

Most likely, we'll only see CPERs in BERT because BERT only uses CPERs.
HEST has MCA structures that can be used. Meaning we won't see MCA
errors reported through CPERs during runtime.

So the most common scenario will probably be that the system resets,
firmware found an MCA error during boot, and firmware populates BERT
with a CPER. The error gets printed during OS boot. By the time userspace
is up the error has already be printed. The error is printed as informational
since there isn't any action for the OS or user to take.

My main reason for printing all the info is that it may be too difficult or too
late to gather that info after the fact. I think this is especially true for boot
errors, though maybe there's another way that I don't know about
(re-reading BERT later?).

> If not, it needs to tie into our infrastructure and the errors need
> to go into the decoding chain where different things look at them and
> filter them.
> 

Right. I want to work on getting this more integrated with our existing
x86 infrastructure. But I don't want to wait until we figure all that out
before we have some sort of CPER decoding.

> Tony, what are your plans here?
> 
> Perhaps we can finally get MCE decoding on Intel too :-)
> 

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 0/8] Decode IA32/X64 CPER
  2018-02-28 15:12   ` Ghannam, Yazen
@ 2018-02-28 16:35     ` Borislav Petkov
  2018-02-28 20:58       ` Ghannam, Yazen
  0 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2018-02-28 16:35 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: Tony Luck, linux-efi, linux-kernel, ard.biesheuvel, x86

On Wed, Feb 28, 2018 at 03:12:09PM +0000, Ghannam, Yazen wrote:
> CPER is the format used for BERT, etc. We'll only ever see a CPER if the
> firmware creates it. And it's up to firmware policy what is shared with
> the OS.

Yap, but we should still tie it into our infra.

> My main reason for printing all the info is that it may be too difficult or too
> late to gather that info after the fact. I think this is especially true for boot
> errors, though maybe there's another way that I don't know about
> (re-reading BERT later?).

So that BERT thing is a table, AFAIR. I don't see why it would be a
problem if we read it later, at our leisure and free the record only
when we're done.

Looking at bert.c is entered with a late_initcall() which is nicely late
and we have everything up and ready to process errors then.

> Right. I want to work on getting this more integrated with our existing
> x86 infrastructure. But I don't want to wait until we figure all that out
> before we have some sort of CPER decoding.

Just keep in mind that whenever you expose stuff and userspace starts
using it, it is much harder to change it. So let's do it right pls.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 0/8] Decode IA32/X64 CPER
  2018-02-28 16:35     ` Borislav Petkov
@ 2018-02-28 20:58       ` Ghannam, Yazen
  2018-03-01 11:59         ` Borislav Petkov
  0 siblings, 1 reply; 41+ messages in thread
From: Ghannam, Yazen @ 2018-02-28 20:58 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, linux-efi, linux-kernel, ard.biesheuvel, x86

> -----Original Message-----
> From: Borislav Petkov [mailto:bp@suse.de]
> Sent: Wednesday, February 28, 2018 11:36 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: Tony Luck <tony.luck@intel.com>; linux-efi@vger.kernel.org; linux-
> kernel@vger.kernel.org; ard.biesheuvel@linaro.org; x86@kernel.org
> Subject: Re: [PATCH v2 0/8] Decode IA32/X64 CPER
> 
> On Wed, Feb 28, 2018 at 03:12:09PM +0000, Ghannam, Yazen wrote:
> > CPER is the format used for BERT, etc. We'll only ever see a CPER if the
> > firmware creates it. And it's up to firmware policy what is shared with
> > the OS.
> 
> Yap, but we should still tie it into our infra.
> 

Okay, so how about this?

1) We keep this set mostly as-is. This would be our fallback if we don't have
anything better.

2) I add the MCA decoding to this set. I was thinking to do this in a separate
set but maybe it's better to do it all together.

Number 2 would mean we do a quick check on the CPER to see if it contains
MCA info. There's no spec-defined way to do this, but we can make a good
guess by seeing if we have an "MSR register" context and that context has
an "MSR address" that is an MCA register.

If we think we have MCA info, then we pull as much out of the CPER as we
can and put it in a struct mce which we then pass to the notifier chain.

If we don't think we have MCA info, then we fallback to number 1.

At the moment, it seems we'll be using x86 CPER to represent MCA errors
in BERT since there's no other option in BERT. So I think having number 2
would catch most, if not all, errors reported with x86 CPER.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 0/8] Decode IA32/X64 CPER
  2018-02-28 20:58       ` Ghannam, Yazen
@ 2018-03-01 11:59         ` Borislav Petkov
  2018-03-23  0:19           ` Ghannam, Yazen
  0 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2018-03-01 11:59 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: Tony Luck, linux-efi, linux-kernel, ard.biesheuvel, x86

On Wed, Feb 28, 2018 at 08:58:15PM +0000, Ghannam, Yazen wrote:
> 1) We keep this set mostly as-is. This would be our fallback if we don't have
> anything better.

Yes, sounds good. We try to decode it as MCE and if we cannot, we dump
the raw CPER record.

> 2) I add the MCA decoding to this set. I was thinking to do this in a separate
> set but maybe it's better to do it all together.

I'm fine if you do it separately, as long as you do it so that we have
user-friendly decoding in the end.

> Number 2 would mean we do a quick check on the CPER to see if it contains
> MCA info. There's no spec-defined way to do this, but we can make a good
> guess by seeing if we have an "MSR register" context and that context has
> an "MSR address" that is an MCA register.

Yap.

> If we think we have MCA info, then we pull as much out of the CPER as we
> can and put it in a struct mce which we then pass to the notifier chain.
> 
> If we don't think we have MCA info, then we fallback to number 1.

Ack.

> At the moment, it seems we'll be using x86 CPER to represent MCA errors
> in BERT since there's no other option in BERT. So I think having number 2
> would catch most, if not all, errors reported with x86 CPER.

Yeah, if you think about it, CPER is a clumsy and totally useless
indirection layer between MCA and the OS. And if the error is of
different type (AER, PCI, whatever), then it wraps around it too with
some dumb table. And that doesn't bring anything - just the need for
more support added to the OS and tools around it. Basically what you're
doing now.

I don't mind the aspect of firmware seeing the errors first and even
attempting to fix them as the firmware knows the platform intimately but
doing everything in firmware just because some misguided souls think
this gives added value but in reality ends up becoming a worse problem,
is simply the wrong wrong thing to do.

Thx.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 0/8] Decode IA32/X64 CPER
  2018-02-28  8:43 ` [PATCH v2 0/8] Decode IA32/X64 CPER Borislav Petkov
  2018-02-28 15:12   ` Ghannam, Yazen
@ 2018-03-01 16:38   ` Luck, Tony
  1 sibling, 0 replies; 41+ messages in thread
From: Luck, Tony @ 2018-03-01 16:38 UTC (permalink / raw)
  To: Borislav Petkov, Yazen Ghannam
  Cc: linux-efi, linux-kernel, ard.biesheuvel, x86

> One much more important thing I forgot about yesterday: how is
> this thing playing into our RAS reporting, x86 decoding chain, etc
> infrastructure?
>
> Is CPER bypassing it completely and the firmware is doing everything
> now? I sure hope not.

Intel gives OEMs lots of options to catch and tweak the error path in
BIOS SMM code.  So it is possible that some systems may go into
BIOS which would do platform level analysis and create a CPER to
feed to the OS. Hopefully anyone doing that would clear the machine
check bank and suppress CMCI to avoid double reporting the same
error.

> If not, it needs to tie into our infrastructure and the errors need
> to go into the decoding chain where different things look at them and
> filter them.

Good point.

> Tony, what are your plans here?

Nothing solid yet.

> Perhaps we can finally get MCE decoding on Intel too :-)

It's on a mental list ... I should probably add it to our internal
trackers so that we actually do it sooner rather than in some
distant future.

-Tony 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 0/8] Decode IA32/X64 CPER
  2018-03-01 11:59         ` Borislav Petkov
@ 2018-03-23  0:19           ` Ghannam, Yazen
  2018-03-23 15:29             ` Borislav Petkov
  0 siblings, 1 reply; 41+ messages in thread
From: Ghannam, Yazen @ 2018-03-23  0:19 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, linux-efi, linux-kernel, ard.biesheuvel, x86

> -----Original Message-----
> From: Borislav Petkov <bp@suse.de>
> Sent: Thursday, March 1, 2018 7:00 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: Tony Luck <tony.luck@intel.com>; linux-efi@vger.kernel.org; linux-
> kernel@vger.kernel.org; ard.biesheuvel@linaro.org; x86@kernel.org
> Subject: Re: [PATCH v2 0/8] Decode IA32/X64 CPER
> 
> On Wed, Feb 28, 2018 at 08:58:15PM +0000, Ghannam, Yazen wrote:
> > 1) We keep this set mostly as-is. This would be our fallback if we don't have
> > anything better.
> 
> Yes, sounds good. We try to decode it as MCE and if we cannot, we dump
> the raw CPER record.
> 
> > 2) I add the MCA decoding to this set. I was thinking to do this in a separate
> > set but maybe it's better to do it all together.
> 
> I'm fine if you do it separately, as long as you do it so that we have
> user-friendly decoding in the end.
> 

I wanted to add the MCA decoding to this set but I got caught up with something
else and haven't had time to do it.

Is it alright if I go ahead and just send the v3 of this set without the MCA decoding?
I was hoping this could be accepted in the next merge window. And if that's still
possible I'd like to have some CPER decoding sooner rather than nothing.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 0/8] Decode IA32/X64 CPER
  2018-03-23  0:19           ` Ghannam, Yazen
@ 2018-03-23 15:29             ` Borislav Petkov
  0 siblings, 0 replies; 41+ messages in thread
From: Borislav Petkov @ 2018-03-23 15:29 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: Tony Luck, linux-efi, linux-kernel, ard.biesheuvel, x86

On Fri, Mar 23, 2018 at 12:19:20AM +0000, Ghannam, Yazen wrote:
> Is it alright if I go ahead and just send the v3 of this set without the MCA decoding?

Sure, but let's not drop the MCA decoding from the todo list. :)

Thx.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2018-03-23 15:29 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-26 19:38 [PATCH v2 0/8] Decode IA32/X64 CPER Yazen Ghannam
2018-02-26 19:38 ` [PATCH v2 1/8] efi: Fix IA32/X64 Processor Error Record definition Yazen Ghannam
2018-02-27 10:47   ` Borislav Petkov
2018-02-27 15:05     ` Ghannam, Yazen
2018-02-26 19:38 ` [PATCH v2 2/8] efi: Decode IA32/X64 Processor Error Section Yazen Ghannam
2018-02-27 11:22   ` Borislav Petkov
2018-02-27 15:13     ` Ghannam, Yazen
2018-02-27 17:00       ` Borislav Petkov
2018-02-27 17:27         ` Ghannam, Yazen
2018-02-27 17:44           ` Borislav Petkov
2018-02-27 18:06             ` Ghannam, Yazen
2018-02-27 18:34               ` Borislav Petkov
2018-02-26 19:38 ` [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info Structure Yazen Ghannam
2018-02-27 14:25   ` Borislav Petkov
2018-02-27 15:25     ` Ghannam, Yazen
2018-02-27 17:04       ` Borislav Petkov
2018-02-27 17:46         ` Ghannam, Yazen
2018-02-27 18:02           ` Borislav Petkov
2018-02-27 18:40             ` Ghannam, Yazen
2018-02-27 19:09               ` Borislav Petkov
2018-02-27 21:32                 ` Ghannam, Yazen
2018-02-27 22:10                   ` Borislav Petkov
2018-02-26 19:39 ` [PATCH v2 4/8] efi: Decode UEFI-defined IA32/X64 Error Structure GUIDs Yazen Ghannam
2018-02-27 14:30   ` Borislav Petkov
2018-02-27 15:28     ` Ghannam, Yazen
2018-02-27 15:31       ` Ard Biesheuvel
2018-02-26 19:39 ` [PATCH v2 5/8] efi: Decode IA32/X64 Cache, TLB, and Bus Check structures Yazen Ghannam
2018-02-27 15:03   ` Borislav Petkov
2018-02-27 15:33     ` Ghannam, Yazen
2018-02-27 17:06       ` Borislav Petkov
2018-02-26 19:39 ` [PATCH v2 6/8] efi: Decode additional IA32/X64 Bus Check fields Yazen Ghannam
2018-02-26 19:39 ` [PATCH v2 7/8] efi: Decode IA32/X64 MS Check structure Yazen Ghannam
2018-02-26 19:39 ` [PATCH v2 8/8] efi: Decode IA32/X64 Context Info structure Yazen Ghannam
2018-02-28  8:43 ` [PATCH v2 0/8] Decode IA32/X64 CPER Borislav Petkov
2018-02-28 15:12   ` Ghannam, Yazen
2018-02-28 16:35     ` Borislav Petkov
2018-02-28 20:58       ` Ghannam, Yazen
2018-03-01 11:59         ` Borislav Petkov
2018-03-23  0:19           ` Ghannam, Yazen
2018-03-23 15:29             ` Borislav Petkov
2018-03-01 16:38   ` Luck, Tony

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).