linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86/MCE: Set the MCE's status and misc members
@ 2020-08-07 14:27 Kyle Meyer
  2020-08-07 14:41 ` Borislav Petkov
  2020-08-07 15:05 ` Steve Wahl
  0 siblings, 2 replies; 4+ messages in thread
From: Kyle Meyer @ 2020-08-07 14:27 UTC (permalink / raw)
  Cc: russ.anderson, steve.wahl, kyle.meyer, tony.luck, bp, linux-edac

The MCE's status and misc members are initialized to zero
within mce_setup. Set the MCE's status and misc members to the
corresponding values passed in by the mem_err argument. This provides
support for the RAS: Corrected Errors Collector (CEC) which uses the
MCE.

Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
---
 arch/x86/include/asm/mce.h     | 6 ++++--
 arch/x86/kernel/cpu/mce/apei.c | 8 ++++++++
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index cf503824529c..2346d900a232 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -79,8 +79,10 @@
 #define MCACOD_INSTR	0x0150	/* Instruction Fetch */
 
 /* MCi_MISC register defines */
-#define MCI_MISC_ADDR_LSB(m)	((m) & 0x3f)
-#define MCI_MISC_ADDR_MODE(m)	(((m) >> 6) & 7)
+#define MCI_MISC_ADDR_LSB(m)		((m) & 0x3f)
+#define MCI_MISC_ADDR_LSB_SET(x) 	((x) & 0x3f)
+#define MCI_MISC_ADDR_MODE(m)		(((m) >> 6) & 7)
+#define MCI_MISC_ADDR_MODE_SET(x) 	(((x) & 7) << 6)
 #define  MCI_MISC_ADDR_SEGOFF	0	/* segment offset */
 #define  MCI_MISC_ADDR_LINEAR	1	/* linear address */
 #define  MCI_MISC_ADDR_PHYS	2	/* physical address */
diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index af8d37962586..efdfb55b934a 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -38,6 +38,14 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 	/* Fake a memory read error with unknown channel */
 	m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | 0x9f;
 
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL ||
+	    boot_cpu_data.x86_vendor == X86_VENDOR_ZHAOXIN) {
+		m.status |= MCI_STATUS_MISCV;
+		m.misc |= MCI_MISC_ADDR_LSB_SET(ffs(
+					mem_err->physical_addr_mask) - 1) |
+			  MCI_MISC_ADDR_MODE_SET(MCI_MISC_ADDR_PHYS);
+	}
+
 	if (severity >= GHES_SEV_RECOVERABLE)
 		m.status |= MCI_STATUS_UC;
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] x86/MCE: Set the MCE's status and misc members
  2020-08-07 14:27 [PATCH] x86/MCE: Set the MCE's status and misc members Kyle Meyer
@ 2020-08-07 14:41 ` Borislav Petkov
  2020-08-07 15:05 ` Steve Wahl
  1 sibling, 0 replies; 4+ messages in thread
From: Borislav Petkov @ 2020-08-07 14:41 UTC (permalink / raw)
  To: Kyle Meyer; +Cc: russ.anderson, steve.wahl, tony.luck, linux-edac

On Fri, Aug 07, 2020 at 09:27:50AM -0500, Kyle Meyer wrote:
> The MCE's status and misc members are initialized to zero
> within mce_setup. Set the MCE's status and misc members to the
> corresponding values passed in by the mem_err argument. This provides
> support for the RAS: Corrected Errors Collector (CEC) which uses the
> MCE.

This commit message is talking about the "how" but that is visible from
the patch. It needs to talk more about the why, as to why are you doing
this and what you're trying to achieve. I believe you alluded to some of
that offlist.

> Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
> ---
>  arch/x86/include/asm/mce.h     | 6 ++++--
>  arch/x86/kernel/cpu/mce/apei.c | 8 ++++++++
>  2 files changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index cf503824529c..2346d900a232 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -79,8 +79,10 @@
>  #define MCACOD_INSTR	0x0150	/* Instruction Fetch */
>  
>  /* MCi_MISC register defines */
> -#define MCI_MISC_ADDR_LSB(m)	((m) & 0x3f)
> -#define MCI_MISC_ADDR_MODE(m)	(((m) >> 6) & 7)
> +#define MCI_MISC_ADDR_LSB(m)		((m) & 0x3f)
> +#define MCI_MISC_ADDR_LSB_SET(x) 	((x) & 0x3f)
> +#define MCI_MISC_ADDR_MODE(m)		(((m) >> 6) & 7)
> +#define MCI_MISC_ADDR_MODE_SET(x) 	(((x) & 7) << 6)
>  #define  MCI_MISC_ADDR_SEGOFF	0	/* segment offset */
>  #define  MCI_MISC_ADDR_LINEAR	1	/* linear address */
>  #define  MCI_MISC_ADDR_PHYS	2	/* physical address */
> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
> index af8d37962586..efdfb55b934a 100644
> --- a/arch/x86/kernel/cpu/mce/apei.c
> +++ b/arch/x86/kernel/cpu/mce/apei.c
> @@ -38,6 +38,14 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
>  	/* Fake a memory read error with unknown channel */
>  	m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | 0x9f;
>  
> +	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL ||
> +	    boot_cpu_data.x86_vendor == X86_VENDOR_ZHAOXIN) {
> +		m.status |= MCI_STATUS_MISCV;
> +		m.misc |= MCI_MISC_ADDR_LSB_SET(ffs(
> +					mem_err->physical_addr_mask) - 1) |
> +			  MCI_MISC_ADDR_MODE_SET(MCI_MISC_ADDR_PHYS);
> +	}

This is ugly, especially that trailing opening brace - please use a
helper variable.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] x86/MCE: Set the MCE's status and misc members
  2020-08-07 14:27 [PATCH] x86/MCE: Set the MCE's status and misc members Kyle Meyer
  2020-08-07 14:41 ` Borislav Petkov
@ 2020-08-07 15:05 ` Steve Wahl
  2020-08-07 16:33   ` Luck, Tony
  1 sibling, 1 reply; 4+ messages in thread
From: Steve Wahl @ 2020-08-07 15:05 UTC (permalink / raw)
  To: Kyle Meyer; +Cc: russ.anderson, steve.wahl, tony.luck, bp, linux-edac

There's one small code change I suggest below.

But now it's time to think about the description.  We need to zoom out
to the 10,000 ft. view.  IMHO, to understand what this patch does,
your current description requires the reader to be more familiar with
the involved functions than we should expect.

To introduce the reader to what's going on, it should read more like:

"When creating an MCE record from a CPER memory error, fill in the
m.misc field and mark it as valid in the m.status field on processors
where MCEs are expected to have this information.  This ensures that
mce_usable_address() will not interpret the address as invalid, which
(for example) would cause cec_notifier() to ignore the created MCE
records."

In addition to giving an improved background picture, that gives
direct "pointers" to the other functions you have to examine to
understand what's going on.

On Fri, Aug 07, 2020 at 09:27:50AM -0500, Kyle Meyer wrote:
> The MCE's status and misc members are initialized to zero
> within mce_setup. Set the MCE's status and misc members to the
> corresponding values passed in by the mem_err argument. This provides
> support for the RAS: Corrected Errors Collector (CEC) which uses the
> MCE.
> 
> Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
> ---
>  arch/x86/include/asm/mce.h     | 6 ++++--
>  arch/x86/kernel/cpu/mce/apei.c | 8 ++++++++
>  2 files changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index cf503824529c..2346d900a232 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -79,8 +79,10 @@
>  #define MCACOD_INSTR	0x0150	/* Instruction Fetch */
>  
>  /* MCi_MISC register defines */
> -#define MCI_MISC_ADDR_LSB(m)	((m) & 0x3f)
> -#define MCI_MISC_ADDR_MODE(m)	(((m) >> 6) & 7)
> +#define MCI_MISC_ADDR_LSB(m)		((m) & 0x3f)
> +#define MCI_MISC_ADDR_LSB_SET(x) 	((x) & 0x3f)
> +#define MCI_MISC_ADDR_MODE(m)		(((m) >> 6) & 7)
> +#define MCI_MISC_ADDR_MODE_SET(x) 	(((x) & 7) << 6)
>  #define  MCI_MISC_ADDR_SEGOFF	0	/* segment offset */
>  #define  MCI_MISC_ADDR_LINEAR	1	/* linear address */
>  #define  MCI_MISC_ADDR_PHYS	2	/* physical address */
> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
> index af8d37962586..efdfb55b934a 100644
> --- a/arch/x86/kernel/cpu/mce/apei.c
> +++ b/arch/x86/kernel/cpu/mce/apei.c
> @@ -38,6 +38,14 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
>  	/* Fake a memory read error with unknown channel */
>  	m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | 0x9f;
>  
> +	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL ||
> +	    boot_cpu_data.x86_vendor == X86_VENDOR_ZHAOXIN) {
> +		m.status |= MCI_STATUS_MISCV;

Concentrating on this statement:

> +		m.misc |= MCI_MISC_ADDR_LSB_SET(ffs(
> +					mem_err->physical_addr_mask) - 1) |
> +			  MCI_MISC_ADDR_MODE_SET(MCI_MISC_ADDR_PHYS);

I know you mentioned that the whole of structure m was cleared out by
mce_setup(), but that's in the change comment, not the code.  Since we
know m.misc is already 0, the code would be clearer if this line
started

		m.misc = MCI_MISC...

instead of

		m.misc |= MCI_MISC...

because the result is the same, and someone coming fresh into reading
this code has to go into the definition of mce_setup() to know that
m.misc started as zero for the |= version.

Another way of stating this is there's nothing already in m.misc
you're trying to preserve and add to, where seeing the |= operation
could mislead a new reader into thinking there was something there
you're trying to preserve.


> +	}
> +
>  	if (severity >= GHES_SEV_RECOVERABLE)
>  		m.status |= MCI_STATUS_UC;
>  
> -- 
> 2.26.2
> 
--> Steve

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] x86/MCE: Set the MCE's status and misc members
  2020-08-07 15:05 ` Steve Wahl
@ 2020-08-07 16:33   ` Luck, Tony
  0 siblings, 0 replies; 4+ messages in thread
From: Luck, Tony @ 2020-08-07 16:33 UTC (permalink / raw)
  To: Steve Wahl; +Cc: Kyle Meyer, russ.anderson, bp, linux-edac

On Fri, Aug 07, 2020 at 10:05:50AM -0500, Steve Wahl wrote:
> There's one small code change I suggest below.
> 
> But now it's time to think about the description.  We need to zoom out
> to the 10,000 ft. view.  IMHO, to understand what this patch does,
> your current description requires the reader to be more familiar with
> the involved functions than we should expect.

Maybe also travel back in time to why apei_mce_report_mem_error()
was created in the first place.

This was a workaround for a couple of Intel CPUs (Nehalem, Westmere)
that didn't provide a system address in MCi_ADDR. Since BIOS could figure
out the address, it was tasked with reporting to the OS suing APEI.

apei_mce_report_mem_error() did the minimum necessary to create
an error log that mcelog(8) daemon could process.

Jump forward in time to Sandybridge (and all successors) and the
reason for the workaround is gone. All of them provide a system
address in the machine check bank ... so this APEI path is a
poor substitute for getting the data from the machine check bank
(which will give you the channel, the corrected error count, and
perhaps other useful information).

So why are we applying band-aids to this decade old workaround?

-Tony

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-08-07 16:33 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-07 14:27 [PATCH] x86/MCE: Set the MCE's status and misc members Kyle Meyer
2020-08-07 14:41 ` Borislav Petkov
2020-08-07 15:05 ` Steve Wahl
2020-08-07 16:33   ` Luck, Tony

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).