Linux-EDAC Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v3 0/2] Decode raw MSR values of MCA registers in BERT
@ 2020-09-03 23:45 Smita Koralahalli
  2020-09-03 23:45 ` [PATCH v3 1/2] cper, apei, mce: Pass x86 CPER through the MCA handling chain Smita Koralahalli
  2020-09-03 23:45 ` [PATCH v3 2/2] x86/mce/dev-mcelog: Fix updating kflags in AMD systems Smita Koralahalli
  0 siblings, 2 replies; 8+ messages in thread
From: Smita Koralahalli @ 2020-09-03 23:45 UTC (permalink / raw)
  To: x86, linux-kernel, linux-pm, linux-edac, linux-efi, linux-acpi, devel
  Cc: Borislav Petkov, Tony Luck, Rafael J . Wysocki, Len Brown,
	Ard Biesheuvel, Yazen Ghannam, Smita Koralahalli

This series provides better decoding for SMCA specific raw MSR values in
BERT reported using x86 Processor Error Common Platform Error Record
(CPER) format.

Patch 1 extracts the raw MSR values of MCA registers in BERT and passes
it through the MCA handling chain.

Patch 2 provides a fix of missing error logs as observed in Patch 1.

Patch v2 provides a fix for the build error observed in Patch v1.
As reported by kernel test robot, Patch 1 failed stating undefined
reference to arch_apei_report_x86_error() in function cper_print_proc_ia.
The failure is noticed when CONFIG_ACPI_APEI is not compiled into the
kernel. Fix the error by returning error code if CONFIG_ACPI_APEI is not
installed into the kernel.

Patch v3 separates arch specific declarations and moves them into their
respective arch specific header files. Also, adds additional checks for
context type and register layout.

Link:
https://lkml.kernel.org/r/20200828203332.11129-1-Smita.KoralahalliChannabasappa@amd.com

Smita Koralahalli (2):
  cper, apei, mce: Pass x86 CPER through the MCA handling chain
  x86/mce/dev-mcelog: Fix updating kflags in AMD systems

 arch/x86/include/asm/acpi.h          | 11 ++++++++
 arch/x86/include/asm/mce.h           |  3 ++
 arch/x86/kernel/acpi/apei.c          |  9 ++++++
 arch/x86/kernel/cpu/mce/apei.c       | 42 ++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/mce/dev-mcelog.c |  4 ++-
 drivers/firmware/efi/cper-x86.c      | 10 +++++--
 6 files changed, 75 insertions(+), 4 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 1/2] cper, apei, mce: Pass x86 CPER through the MCA handling chain
  2020-09-03 23:45 [PATCH v3 0/2] Decode raw MSR values of MCA registers in BERT Smita Koralahalli
@ 2020-09-03 23:45 ` Smita Koralahalli
  2020-09-14 15:30   ` Borislav Petkov
  2020-09-03 23:45 ` [PATCH v3 2/2] x86/mce/dev-mcelog: Fix updating kflags in AMD systems Smita Koralahalli
  1 sibling, 1 reply; 8+ messages in thread
From: Smita Koralahalli @ 2020-09-03 23:45 UTC (permalink / raw)
  To: x86, linux-kernel, linux-pm, linux-edac, linux-efi, linux-acpi, devel
  Cc: Borislav Petkov, Tony Luck, Rafael J . Wysocki, Len Brown,
	Ard Biesheuvel, Yazen Ghannam, Smita Koralahalli

Linux Kernel uses ACPI Boot Error Record Table (BERT) to report fatal
errors that occurred in a previous boot. The MCA errors in the BERT are
reported using the x86 Processor Error Common Platform Error Record (CPER)
format. Currently, the record prints out the raw MSR values and AMD relies
on the raw record to provide MCA information.

Extract the raw MSR values of MCA registers from the BERT and feed it into
the standard mce_log() function through the existing x86/MCA RAS
infrastructure. This will result in better decoding from the EDAC MCE
decoder or the default notifier.

The implementation is SMCA specific as the raw MCA register values are
given in the register offset order of the MCAX address space.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
Link:
https://lkml.kernel.org/r/20200828203332.11129-2-Smita.KoralahalliChannabasappa@amd.com

v3:
	Moved arch specific declarations from generic header file to arch
	specific header file.
	Cleaned additional declarations which are unnecessary.
	Included the check for context type.
	Added a check to verify for the first MSR address in the register
	layout.
v2:
	Fixed build error reported by kernel test robot.
	Passed struct variable as function argument instead of entire struct
---
 arch/x86/include/asm/acpi.h     | 11 +++++++++
 arch/x86/include/asm/mce.h      |  3 +++
 arch/x86/kernel/acpi/apei.c     |  9 +++++++
 arch/x86/kernel/cpu/mce/apei.c  | 42 +++++++++++++++++++++++++++++++++
 drivers/firmware/efi/cper-x86.c | 10 +++++---
 5 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index ca0976456a6b..fb8404f65ab4 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -161,6 +161,8 @@ extern int x86_acpi_numa_init(void);
 
 #define acpi_unlazy_tlb(x)	leave_mm(x)
 
+struct cper_ia_proc_ctx;
+
 #ifdef CONFIG_ACPI_APEI
 static inline pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr)
 {
@@ -179,6 +181,15 @@ static inline pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr)
 	 */
 	return PAGE_KERNEL_NOENC;
 }
+
+int arch_apei_report_x86_error(struct cper_ia_proc_ctx *ctx_info,
+			       u64 lapic_id);
+#else
+static inline int arch_apei_report_x86_error(struct cper_ia_proc_ctx *ctx_info,
+					     u64 lapic_id)
+{
+	return -EINVAL;
+}
 #endif
 
 #define ACPI_TABLE_UPGRADE_MAX_PHYS (max_low_pfn_mapped << PAGE_SHIFT)
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 109af5c7f515..c4e055d45905 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -287,6 +287,9 @@ struct cper_sec_mem_err;
 extern void apei_mce_report_mem_error(int corrected,
 				      struct cper_sec_mem_err *mem_err);
 
+struct cper_ia_proc_ctx;
+int apei_mce_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id);
+
 /*
  * Enumerate new IP types and HWID values in AMD processors which support
  * Scalable MCA.
diff --git a/arch/x86/kernel/acpi/apei.c b/arch/x86/kernel/acpi/apei.c
index c22fb55abcfd..13d60a91eaa0 100644
--- a/arch/x86/kernel/acpi/apei.c
+++ b/arch/x86/kernel/acpi/apei.c
@@ -43,3 +43,12 @@ void arch_apei_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
 	apei_mce_report_mem_error(sev, mem_err);
 #endif
 }
+
+int arch_apei_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
+{
+	int err = -EINVAL;
+#ifdef CONFIG_X86_MCE
+	err = apei_mce_report_x86_error(ctx_info, lapic_id);
+#endif
+	return err;
+}
diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index af8d37962586..65001d342302 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -26,6 +26,8 @@
 
 #include "internal.h"
 
+#define MASK_MCA_STATUS 0xC0002001
+
 void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 {
 	struct mce m;
@@ -51,6 +53,46 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 }
 EXPORT_SYMBOL_GPL(apei_mce_report_mem_error);
 
+int apei_mce_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
+{
+	const u64 *i_mce = ((const void *) (ctx_info + 1));
+	unsigned int cpu;
+	struct mce m;
+
+	if (!boot_cpu_has(X86_FEATURE_SMCA))
+		return -EINVAL;
+
+	if ((ctx_info->msr_addr & MASK_MCA_STATUS) != MASK_MCA_STATUS)
+		return -EINVAL;
+
+	mce_setup(&m);
+
+	m.extcpu = -1;
+	m.socketid = -1;
+
+	for_each_possible_cpu(cpu) {
+		if (cpu_data(cpu).initial_apicid == lapic_id) {
+			m.extcpu = cpu;
+			m.socketid = cpu_data(m.extcpu).phys_proc_id;
+			break;
+		}
+	}
+
+	m.apicid = lapic_id;
+	m.bank = (ctx_info->msr_addr >> 4) & 0xFF;
+	m.status = *i_mce;
+	m.addr = *(i_mce + 1);
+	m.misc = *(i_mce + 2);
+	/* Skipping MCA_CONFIG */
+	m.ipid = *(i_mce + 4);
+	m.synd = *(i_mce + 5);
+
+	mce_log(&m);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(apei_mce_report_x86_error);
+
 #define CPER_CREATOR_MCE						\
 	GUID_INIT(0x75a574e3, 0x5052, 0x4b29, 0x8a, 0x8e, 0xbe, 0x2c,	\
 		  0x64, 0x90, 0xb8, 0x9d)
diff --git a/drivers/firmware/efi/cper-x86.c b/drivers/firmware/efi/cper-x86.c
index 2531de49f56c..2f2b0c431c18 100644
--- a/drivers/firmware/efi/cper-x86.c
+++ b/drivers/firmware/efi/cper-x86.c
@@ -2,6 +2,7 @@
 // Copyright (C) 2018, Advanced Micro Devices, Inc.
 
 #include <linux/cper.h>
+#include <linux/acpi.h>
 
 /*
  * We don't need a "CPER_IA" prefix since these are all locally defined.
@@ -347,9 +348,12 @@ void cper_print_proc_ia(const char *pfx, const struct cper_sec_proc_ia *proc)
 			       ctx_info->mm_reg_addr);
 		}
 
-		printk("%sRegister Array:\n", newpfx);
-		print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, groupsize,
-			       (ctx_info + 1), ctx_info->reg_arr_size, 0);
+		if (ctx_info->reg_ctx_type != CTX_TYPE_MSR ||
+		    arch_apei_report_x86_error(ctx_info, proc->lapic_id)) {
+			printk("%sRegister Array:\n", newpfx);
+			print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, groupsize,
+				       (ctx_info + 1), ctx_info->reg_arr_size, 0);
+		}
 
 		ctx_info = (struct cper_ia_proc_ctx *)((long)ctx_info + size);
 	}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 2/2] x86/mce/dev-mcelog: Fix updating kflags in AMD systems
  2020-09-03 23:45 [PATCH v3 0/2] Decode raw MSR values of MCA registers in BERT Smita Koralahalli
  2020-09-03 23:45 ` [PATCH v3 1/2] cper, apei, mce: Pass x86 CPER through the MCA handling chain Smita Koralahalli
@ 2020-09-03 23:45 ` Smita Koralahalli
  2020-09-14 15:34   ` Borislav Petkov
  1 sibling, 1 reply; 8+ messages in thread
From: Smita Koralahalli @ 2020-09-03 23:45 UTC (permalink / raw)
  To: x86, linux-kernel, linux-pm, linux-edac, linux-efi, linux-acpi, devel
  Cc: Borislav Petkov, Tony Luck, Rafael J . Wysocki, Len Brown,
	Ard Biesheuvel, Yazen Ghannam, Smita Koralahalli

The mcelog utility is not commonly used on AMD systems. Therefore, errors
logged only by the dev_mce_log() notifier will be missed. This may occur
if the EDAC modules are not loaded in which case it's preferable to print
the error record by the default notifier.

However, the mce->kflags set by dev_mce_log() notifier makes the default
notifier to skip over the errors assuming they are processed by
dev_mce_log().

Do not update kflags in the dev_mce_log() notifier on AMD systems.

Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
Link:
https://lkml.kernel.org/r/20200828203332.11129-3-Smita.KoralahalliChannabasappa@amd.com

v3:
	No change
v2:
	No change
---
 arch/x86/kernel/cpu/mce/dev-mcelog.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/dev-mcelog.c b/arch/x86/kernel/cpu/mce/dev-mcelog.c
index 03e51053592a..100fbeebdc72 100644
--- a/arch/x86/kernel/cpu/mce/dev-mcelog.c
+++ b/arch/x86/kernel/cpu/mce/dev-mcelog.c
@@ -67,7 +67,9 @@ static int dev_mce_log(struct notifier_block *nb, unsigned long val,
 unlock:
 	mutex_unlock(&mce_chrdev_read_mutex);
 
-	mce->kflags |= MCE_HANDLED_MCELOG;
+	if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD)
+		mce->kflags |= MCE_HANDLED_MCELOG;
+
 	return NOTIFY_OK;
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] cper, apei, mce: Pass x86 CPER through the MCA handling chain
  2020-09-03 23:45 ` [PATCH v3 1/2] cper, apei, mce: Pass x86 CPER through the MCA handling chain Smita Koralahalli
@ 2020-09-14 15:30   ` Borislav Petkov
  2020-09-14 21:16     ` Smita Koralahalli Channabasappa
  0 siblings, 1 reply; 8+ messages in thread
From: Borislav Petkov @ 2020-09-14 15:30 UTC (permalink / raw)
  To: Smita Koralahalli
  Cc: x86, linux-kernel, linux-pm, linux-edac, linux-efi, linux-acpi,
	devel, Tony Luck, Rafael J . Wysocki, Len Brown, Ard Biesheuvel,
	Yazen Ghannam

On Thu, Sep 03, 2020 at 06:45:30PM -0500, Smita Koralahalli wrote:
> Linux Kernel uses ACPI Boot Error Record Table (BERT) to report fatal
> errors that occurred in a previous boot. The MCA errors in the BERT are
> reported using the x86 Processor Error Common Platform Error Record (CPER)
> format. Currently, the record prints out the raw MSR values and AMD relies
> on the raw record to provide MCA information.
> 
> Extract the raw MSR values of MCA registers from the BERT and feed it into
> the standard mce_log() function through the existing x86/MCA RAS
> infrastructure. This will result in better decoding from the EDAC MCE
> decoder or the default notifier.
> 
> The implementation is SMCA specific as the raw MCA register values are
> given in the register offset order of the MCAX address space.
> 
> Reported-by: kernel test robot <lkp@intel.com>

What's that Reported-by for?

Pls put in [] brackets over it what the 0day robot has reported.

> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> ---
> Link:
> https://lkml.kernel.org/r/20200828203332.11129-2-Smita.KoralahalliChannabasappa@amd.com
> 
> v3:
> 	Moved arch specific declarations from generic header file to arch
> 	specific header file.
> 	Cleaned additional declarations which are unnecessary.
> 	Included the check for context type.
> 	Added a check to verify for the first MSR address in the register
> 	layout.
> v2:
> 	Fixed build error reported by kernel test robot.
> 	Passed struct variable as function argument instead of entire struct
> ---
>  arch/x86/include/asm/acpi.h     | 11 +++++++++
>  arch/x86/include/asm/mce.h      |  3 +++
>  arch/x86/kernel/acpi/apei.c     |  9 +++++++
>  arch/x86/kernel/cpu/mce/apei.c  | 42 +++++++++++++++++++++++++++++++++
>  drivers/firmware/efi/cper-x86.c | 10 +++++---
>  5 files changed, 72 insertions(+), 3 deletions(-)

...

> diff --git a/arch/x86/kernel/acpi/apei.c b/arch/x86/kernel/acpi/apei.c
> index c22fb55abcfd..13d60a91eaa0 100644
> --- a/arch/x86/kernel/acpi/apei.c
> +++ b/arch/x86/kernel/acpi/apei.c
> @@ -43,3 +43,12 @@ void arch_apei_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>  	apei_mce_report_mem_error(sev, mem_err);
>  #endif
>  }
> +
> +int arch_apei_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
> +{
> +	int err = -EINVAL;
> +#ifdef CONFIG_X86_MCE
> +	err = apei_mce_report_x86_error(ctx_info, lapic_id);
> +#endif
> +	return err;
> +}

Add a stub for apei_mce_report_x86_error() in
arch/x86/include/asm/mce.h, in one of the !CONFIG_X86_MCE ifdeffery
which returns -EINVAL and get rid of this ifdeffery and simply do:

	return apei_mce_report_x86_error(ctx_info, lapic_id);

here.

If you wanna fix the above apei_mce_report_mem_error() too, you can do
that in a separate patch.

> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
> index af8d37962586..65001d342302 100644
> --- a/arch/x86/kernel/cpu/mce/apei.c
> +++ b/arch/x86/kernel/cpu/mce/apei.c
> @@ -26,6 +26,8 @@
>  
>  #include "internal.h"
>  
> +#define MASK_MCA_STATUS 0xC0002001

What does that mask mean? Either here or where it is used needs a
comment.

>  void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
>  {
>  	struct mce m;
> @@ -51,6 +53,46 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
>  }
>  EXPORT_SYMBOL_GPL(apei_mce_report_mem_error);
>  
> +int apei_mce_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
> +{
> +	const u64 *i_mce = ((const void *) (ctx_info + 1));
> +	unsigned int cpu;
> +	struct mce m;
> +
> +	if (!boot_cpu_has(X86_FEATURE_SMCA))

If this function you're adding is SMCA-specific, then its name cannot be
as generic as it is now.

> +		return -EINVAL;
> +
> +	if ((ctx_info->msr_addr & MASK_MCA_STATUS) != MASK_MCA_STATUS)
> +		return -EINVAL;
> +
> +	mce_setup(&m);
> +
> +	m.extcpu = -1;
> +	m.socketid = -1;
> +
> +	for_each_possible_cpu(cpu) {
> +		if (cpu_data(cpu).initial_apicid == lapic_id) {

I don't like that but I don't think we have a reverse mapping from LAPIC
ID to logical CPU numbers in the kernel...

> +			m.extcpu = cpu;
> +			m.socketid = cpu_data(m.extcpu).phys_proc_id;

			m.socketid = cpu_data(cpu).phys_proc_id;

> +			break;
> +		}
> +	}
> +
> +	m.apicid = lapic_id;
> +	m.bank = (ctx_info->msr_addr >> 4) & 0xFF;
> +	m.status = *i_mce;
> +	m.addr = *(i_mce + 1);
> +	m.misc = *(i_mce + 2);
> +	/* Skipping MCA_CONFIG */
> +	m.ipid = *(i_mce + 4);
> +	m.synd = *(i_mce + 5);

Is that structure after cper_ia_proc_ctx defined somewhere in the UEFI
spec so that you can cast to it directly instead of doing this ugly
pointer arithmetic?

> +
> +	mce_log(&m);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(apei_mce_report_x86_error);

Why is this function exported?

If "no reason", you can fix the above one too, with a separate commit.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 2/2] x86/mce/dev-mcelog: Fix updating kflags in AMD systems
  2020-09-03 23:45 ` [PATCH v3 2/2] x86/mce/dev-mcelog: Fix updating kflags in AMD systems Smita Koralahalli
@ 2020-09-14 15:34   ` Borislav Petkov
  2020-09-14 21:20     ` Smita Koralahalli Channabasappa
  0 siblings, 1 reply; 8+ messages in thread
From: Borislav Petkov @ 2020-09-14 15:34 UTC (permalink / raw)
  To: Smita Koralahalli
  Cc: x86, linux-kernel, linux-pm, linux-edac, linux-efi, linux-acpi,
	devel, Tony Luck, Rafael J . Wysocki, Len Brown, Ard Biesheuvel,
	Yazen Ghannam

On Thu, Sep 03, 2020 at 06:45:31PM -0500, Smita Koralahalli wrote:
> The mcelog utility is not commonly used on AMD systems. Therefore, errors
> logged only by the dev_mce_log() notifier will be missed. This may occur
> if the EDAC modules are not loaded in which case it's preferable to print
> the error record by the default notifier.
> 
> However, the mce->kflags set by dev_mce_log() notifier makes the default
> notifier to skip over the errors assuming they are processed by
> dev_mce_log().
> 
> Do not update kflags in the dev_mce_log() notifier on AMD systems.
> 
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> ---
> Link:
> https://lkml.kernel.org/r/20200828203332.11129-3-Smita.KoralahalliChannabasappa@amd.com
> 
> v3:
> 	No change
> v2:
> 	No change
> ---
>  arch/x86/kernel/cpu/mce/dev-mcelog.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/dev-mcelog.c b/arch/x86/kernel/cpu/mce/dev-mcelog.c
> index 03e51053592a..100fbeebdc72 100644
> --- a/arch/x86/kernel/cpu/mce/dev-mcelog.c
> +++ b/arch/x86/kernel/cpu/mce/dev-mcelog.c
> @@ -67,7 +67,9 @@ static int dev_mce_log(struct notifier_block *nb, unsigned long val,
>  unlock:
>  	mutex_unlock(&mce_chrdev_read_mutex);
>  
> -	mce->kflags |= MCE_HANDLED_MCELOG;
> +	if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD)
> +		mce->kflags |= MCE_HANDLED_MCELOG;
> +
>  	return NOTIFY_OK;
>  }
>  
> -- 

This one is not related to your 1/2 so it sounds to me like I should
take this one now, independently?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] cper, apei, mce: Pass x86 CPER through the MCA handling chain
  2020-09-14 15:30   ` Borislav Petkov
@ 2020-09-14 21:16     ` Smita Koralahalli Channabasappa
  2020-09-15  8:15       ` Borislav Petkov
  0 siblings, 1 reply; 8+ messages in thread
From: Smita Koralahalli Channabasappa @ 2020-09-14 21:16 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-kernel, linux-pm, linux-edac, linux-efi, linux-acpi,
	devel, Tony Luck, Rafael J . Wysocki, Len Brown, Ard Biesheuvel,
	Yazen Ghannam

On 9/14/20 10:30 AM, Borislav Petkov wrote:

> On Thu, Sep 03, 2020 at 06:45:30PM -0500, Smita Koralahalli wrote:
>> Linux Kernel uses ACPI Boot Error Record Table (BERT) to report fatal
>> errors that occurred in a previous boot. The MCA errors in the BERT are
>> reported using the x86 Processor Error Common Platform Error Record (CPER)
>> format. Currently, the record prints out the raw MSR values and AMD relies
>> on the raw record to provide MCA information.
>>
>> Extract the raw MSR values of MCA registers from the BERT and feed it into
>> the standard mce_log() function through the existing x86/MCA RAS
>> infrastructure. This will result in better decoding from the EDAC MCE
>> decoder or the default notifier.
>>
>> The implementation is SMCA specific as the raw MCA register values are
>> given in the register offset order of the MCAX address space.
>>
>> Reported-by: kernel test robot <lkp@intel.com>
> What's that Reported-by for?
>
> Pls put in [] brackets over it what the 0day robot has reported.

Okay, I will add it here. May be I should have just put it over here
rather than in cover letter.

>> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
>> ---
>>
>> v3:
>> 	Moved arch specific declarations from generic header file to arch
>> 	specific header file.
>> 	Cleaned additional declarations which are unnecessary.
>> 	Included the check for context type.
>> 	Added a check to verify for the first MSR address in the register
>> 	layout.
>> v2:
>> 	Fixed build error reported by kernel test robot.
>> 	Passed struct variable as function argument instead of entire struct
>> ---
>>   arch/x86/include/asm/acpi.h     | 11 +++++++++
>>   arch/x86/include/asm/mce.h      |  3 +++
>>   arch/x86/kernel/acpi/apei.c     |  9 +++++++
>>   arch/x86/kernel/cpu/mce/apei.c  | 42 +++++++++++++++++++++++++++++++++
>>   drivers/firmware/efi/cper-x86.c | 10 +++++---
>>   5 files changed, 72 insertions(+), 3 deletions(-)
> ...
>
>> diff --git a/arch/x86/kernel/acpi/apei.c b/arch/x86/kernel/acpi/apei.c
>> index c22fb55abcfd..13d60a91eaa0 100644
>> --- a/arch/x86/kernel/acpi/apei.c
>> +++ b/arch/x86/kernel/acpi/apei.c
>> @@ -43,3 +43,12 @@ void arch_apei_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>>   	apei_mce_report_mem_error(sev, mem_err);
>>   #endif
>>   }
>> +
>> +int arch_apei_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
>> +{
>> +	int err = -EINVAL;
>> +#ifdef CONFIG_X86_MCE
>> +	err = apei_mce_report_x86_error(ctx_info, lapic_id);
>> +#endif
>> +	return err;
>> +}
> Add a stub for apei_mce_report_x86_error() in
> arch/x86/include/asm/mce.h, in one of the !CONFIG_X86_MCE ifdeffery
> which returns -EINVAL and get rid of this ifdeffery and simply do:
>
> 	return apei_mce_report_x86_error(ctx_info, lapic_id);
>
> here.
>
> If you wanna fix the above apei_mce_report_mem_error() too, you can do
> that in a separate patch.

I have addressed this and will be sending it out as a separate patch.

>> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
>> index af8d37962586..65001d342302 100644
>> --- a/arch/x86/kernel/cpu/mce/apei.c
>> +++ b/arch/x86/kernel/cpu/mce/apei.c
>> @@ -26,6 +26,8 @@
>>   
>>   #include "internal.h"
>>   
>> +#define MASK_MCA_STATUS 0xC0002001
> What does that mask mean? Either here or where it is used needs a
> comment.

The mask value checks for MSR address of the first expected register
in the register layout of MCAX register space. Since the register
layout is implementation specific I'm just checking for the first
register. I will be adding a comment for this.

>>   void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
>>   {
>>   	struct mce m;
>> @@ -51,6 +53,46 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
>>   }
>>   EXPORT_SYMBOL_GPL(apei_mce_report_mem_error);
>>   
>> +int apei_mce_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
>> +{
>> +	const u64 *i_mce = ((const void *) (ctx_info + 1));
>> +	unsigned int cpu;
>> +	struct mce m;
>> +
>> +	if (!boot_cpu_has(X86_FEATURE_SMCA))
> If this function you're adding is SMCA-specific, then its name cannot be
> as generic as it is now.

May be something like apei_smca_report_x86_error? Or probably an additional
SMCA-specific function call inside this generic APEI and MCA function call?

>> +		return -EINVAL;
>> +
>> +	if ((ctx_info->msr_addr & MASK_MCA_STATUS) != MASK_MCA_STATUS)
>> +		return -EINVAL;
>> +
>> +	mce_setup(&m);
>> +
>> +	m.extcpu = -1;
>> +	m.socketid = -1;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		if (cpu_data(cpu).initial_apicid == lapic_id) {
> I don't like that but I don't think we have a reverse mapping from LAPIC
> ID to logical CPU numbers in the kernel...

Yes, we do not have a reverse mapping.

>> +			m.extcpu = cpu;
>> +			m.socketid = cpu_data(m.extcpu).phys_proc_id;
> 			m.socketid = cpu_data(cpu).phys_proc_id;
>
>> +			break;
>> +		}
>> +	}
>> +
>> +	m.apicid = lapic_id;
>> +	m.bank = (ctx_info->msr_addr >> 4) & 0xFF;
>> +	m.status = *i_mce;
>> +	m.addr = *(i_mce + 1);
>> +	m.misc = *(i_mce + 2);
>> +	/* Skipping MCA_CONFIG */
>> +	m.ipid = *(i_mce + 4);
>> +	m.synd = *(i_mce + 5);
> Is that structure after cper_ia_proc_ctx defined somewhere in the UEFI
> spec so that you can cast to it directly instead of doing this ugly
> pointer arithmetic?

The registers here are implementation specific and applies only for
SMCA systems. So I have used pointer arithmetic as it is not defined
in the spec.

>> +
>> +	mce_log(&m);
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(apei_mce_report_x86_error);
> Why is this function exported?
>
> If "no reason", you can fix the above one too, with a separate commit.
>
> Thx.

I will fix this..

Thanks,
Smita

>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 2/2] x86/mce/dev-mcelog: Fix updating kflags in AMD systems
  2020-09-14 15:34   ` Borislav Petkov
@ 2020-09-14 21:20     ` Smita Koralahalli Channabasappa
  0 siblings, 0 replies; 8+ messages in thread
From: Smita Koralahalli Channabasappa @ 2020-09-14 21:20 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-kernel, linux-pm, linux-edac, linux-efi, linux-acpi,
	devel, Tony Luck, Rafael J . Wysocki, Len Brown, Ard Biesheuvel,
	Yazen Ghannam

On 9/14/20 10:34 AM, Borislav Petkov wrote:

> On Thu, Sep 03, 2020 at 06:45:31PM -0500, Smita Koralahalli wrote:
>> The mcelog utility is not commonly used on AMD systems. Therefore, errors
>> logged only by the dev_mce_log() notifier will be missed. This may occur
>> if the EDAC modules are not loaded in which case it's preferable to print
>> the error record by the default notifier.
>>
>> However, the mce->kflags set by dev_mce_log() notifier makes the default
>> notifier to skip over the errors assuming they are processed by
>> dev_mce_log().
>>
>> Do not update kflags in the dev_mce_log() notifier on AMD systems.
>>
>> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
>> ---
>> Link:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20200828203332.11129-3-Smita.KoralahalliChannabasappa%40amd.com&amp;data=02%7C01%7CSmita.KoralahalliChannabasappa%40amd.com%7Cc452e9f80fe9459839c708d858c3a763%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637356944652485754&amp;sdata=%2FhYQbBBNld1GtNX8%2FI6PERD0icYfy0e1k5zukQYI%2Fa4%3D&amp;reserved=0
>>
>> v3:
>> 	No change
>> v2:
>> 	No change
>> ---
>>   arch/x86/kernel/cpu/mce/dev-mcelog.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/dev-mcelog.c b/arch/x86/kernel/cpu/mce/dev-mcelog.c
>> index 03e51053592a..100fbeebdc72 100644
>> --- a/arch/x86/kernel/cpu/mce/dev-mcelog.c
>> +++ b/arch/x86/kernel/cpu/mce/dev-mcelog.c
>> @@ -67,7 +67,9 @@ static int dev_mce_log(struct notifier_block *nb, unsigned long val,
>>   unlock:
>>   	mutex_unlock(&mce_chrdev_read_mutex);
>>   
>> -	mce->kflags |= MCE_HANDLED_MCELOG;
>> +	if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD)
>> +		mce->kflags |= MCE_HANDLED_MCELOG;
>> +
>>   	return NOTIFY_OK;
>>   }
>>   
>> -- 
> This one is not related to your 1/2 so it sounds to me like I should
> take this one now, independently?

Yes, this can be taken independently. I just tagged it along as I came
across the issue of missing error logs while trying to print error
records in the previous patch.

Thanks,
Smita


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] cper, apei, mce: Pass x86 CPER through the MCA handling chain
  2020-09-14 21:16     ` Smita Koralahalli Channabasappa
@ 2020-09-15  8:15       ` Borislav Petkov
  0 siblings, 0 replies; 8+ messages in thread
From: Borislav Petkov @ 2020-09-15  8:15 UTC (permalink / raw)
  To: Smita Koralahalli Channabasappa
  Cc: x86, linux-kernel, linux-pm, linux-edac, linux-efi, linux-acpi,
	devel, Tony Luck, Rafael J . Wysocki, Len Brown, Ard Biesheuvel,
	Yazen Ghannam

On Mon, Sep 14, 2020 at 04:16:10PM -0500, Smita Koralahalli Channabasappa wrote:
> May be something like apei_smca_report_x86_error? Or probably an additional
> SMCA-specific function call inside this generic APEI and MCA function call?

apei_smca_report_x86_error() sounds ok. If there's need for any
additional MCA handling from BERT, then that can be extended/refactored
then.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, back to index

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-03 23:45 [PATCH v3 0/2] Decode raw MSR values of MCA registers in BERT Smita Koralahalli
2020-09-03 23:45 ` [PATCH v3 1/2] cper, apei, mce: Pass x86 CPER through the MCA handling chain Smita Koralahalli
2020-09-14 15:30   ` Borislav Petkov
2020-09-14 21:16     ` Smita Koralahalli Channabasappa
2020-09-15  8:15       ` Borislav Petkov
2020-09-03 23:45 ` [PATCH v3 2/2] x86/mce/dev-mcelog: Fix updating kflags in AMD systems Smita Koralahalli
2020-09-14 15:34   ` Borislav Petkov
2020-09-14 21:20     ` Smita Koralahalli Channabasappa

Linux-EDAC Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-edac/0 linux-edac/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-edac linux-edac/ https://lore.kernel.org/linux-edac \
		linux-edac@vger.kernel.org
	public-inbox-index linux-edac

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-edac


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git