linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86/mce: Add Skylake quirk for patrol scrub reported errors
@ 2021-03-22 22:37 Tony Luck
  2021-03-24 15:00 ` Borislav Petkov
  0 siblings, 1 reply; 8+ messages in thread
From: Tony Luck @ 2021-03-22 22:37 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: x86, linux-kernel, Youquan Song, Tony Luck

From: Youquan Song <youquan.song@intel.com>

Skylake has a mode where the system administrator can use a BIOS setup
option to request that the memory controller report uncorrected errors
found by the patrol scrubber as corrected.  This results in them being
signalled using CMCI, which is less disruptive than a machine check.

Add a quirk to detect that a "corrected" error is actually a downgraded
uncorrected error with model specific checks for the "MSCOD" signature in
MCi_STATUS and that the error was reported from a memory controller bank.

Adjust the severity to MCE_AO_SEVERITY so that Linux will try to take
the affected page offline.

[Tony: Wordsmith commit comment]

Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>

---
Repost ... looks like this got lost somewhere.

V2:
Boris:	Don't optimize with pointer to quirk function. Just do
 	the vendor/family/model check in the adjust_mce_log()
	function
Tony:	Add check for stepping >= 4
---
 arch/x86/kernel/cpu/mce/core.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index e9265e2f28c9..2d5fe23adf29 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -673,6 +673,35 @@ static void mce_read_aux(struct mce *m, int i)
 	}
 }
 
+/*
+ * Skylake family CPUs have a mode where the user can request that
+ * the memory controller report uncorrected errors found by the patrol
+ * scrubber as corrected (MCI_STATUS_UC == 0). This results in them being
+ * signalled using CMCI, which is less disruptive that a machine check.
+ * The following quirk detects such errors and adjusts the severity.
+ */
+
+#define MSCOD_UCE_SCRUB	(0x0010 << 16) /* UnCorrected Patrol Scrub Error */
+#define MSCOD_MASK	GENMASK_ULL(31, 16)
+
+static void adjust_mce_log(struct mce *m)
+{
+	struct cpuinfo_x86 *c = &boot_cpu_data;
+
+	if (c->x86_vendor == X86_VENDOR_INTEL && c->x86 == 6 &&
+	    c->x86_model == INTEL_FAM6_SKYLAKE_X && c->x86_stepping >= 4) {
+		/*
+		 * Check the error code to see if this is an uncorrected patrol
+		 * scrub error from one of the memory controller banks. If so,
+		 * then adjust the severity level to MCE_AO_SEVERITY
+		 */
+		if (((m->status & MCACOD_SCRUBMSK) == MCACOD_SCRUB) &&
+		    ((m->status & MSCOD_MASK) == MSCOD_UCE_SCRUB) &&
+		    m->bank >= 13 && m->bank <= 18)
+			m->severity = MCE_AO_SEVERITY;
+	}
+}
+
 DEFINE_PER_CPU(unsigned, mce_poll_count);
 
 /*
@@ -772,6 +801,7 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 		if (mca_cfg.dont_log_ce && !mce_usable_address(&m))
 			goto clear_it;
 
+		adjust_mce_log(&m);
 		mce_log(&m);
 
 clear_it:
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/mce: Add Skylake quirk for patrol scrub reported errors
  2021-03-22 22:37 [PATCH] x86/mce: Add Skylake quirk for patrol scrub reported errors Tony Luck
@ 2021-03-24 15:00 ` Borislav Petkov
  2021-03-24 15:35   ` Luck, Tony
  0 siblings, 1 reply; 8+ messages in thread
From: Borislav Petkov @ 2021-03-24 15:00 UTC (permalink / raw)
  To: Tony Luck; +Cc: x86, linux-kernel, Youquan Song

On Mon, Mar 22, 2021 at 03:37:10PM -0700, Tony Luck wrote:
> From: Youquan Song <youquan.song@intel.com>
> 
> Skylake has a mode where the system administrator can use a BIOS setup
> option to request that the memory controller report uncorrected errors
> found by the patrol scrubber as corrected.  This results in them being
> signalled using CMCI, which is less disruptive than a machine check.
> 
> Add a quirk to detect that a "corrected" error is actually a downgraded
> uncorrected error with model specific checks for the "MSCOD" signature in
> MCi_STATUS and that the error was reported from a memory controller bank.
> 
> Adjust the severity to MCE_AO_SEVERITY so that Linux will try to take
> the affected page offline.
> 
> [Tony: Wordsmith commit comment]
> 
> Signed-off-by: Youquan Song <youquan.song@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> 
> ---
> Repost ... looks like this got lost somewhere.

Yeah, into

fd258dc4442c ("x86/mce: Add Skylake quirk for patrol scrub reported errors")

:-)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH] x86/mce: Add Skylake quirk for patrol scrub reported errors
  2021-03-24 15:00 ` Borislav Petkov
@ 2021-03-24 15:35   ` Luck, Tony
  0 siblings, 0 replies; 8+ messages in thread
From: Luck, Tony @ 2021-03-24 15:35 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: x86, linux-kernel, Song, Youquan

> Yeah, into
> 
> fd258dc4442c ("x86/mce: Add Skylake quirk for patrol scrub reported errors")

Thanks ... my memory is failing, and I forgot that the patch had been improved and
moved from core.c (where I looked for it) into severity.c

-Tony


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/mce: Add Skylake quirk for patrol scrub reported errors
  2020-06-17  7:41     ` Borislav Petkov
@ 2020-06-17 18:49       ` Luck, Tony
  0 siblings, 0 replies; 8+ messages in thread
From: Luck, Tony @ 2020-06-17 18:49 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Song, Youquan, x86, linux-kernel

On Wed, Jun 17, 2020 at 09:41:58AM +0200, Borislav Petkov wrote:
> On Tue, Jun 16, 2020 at 10:33:08PM +0000, Luck, Tony wrote:
> > If the BIOS option is left in the default setting, uncorrectable errors found
> > by the patrol scrubber are reported with a machine check. Those MSCOD
> > and MCACOD signatures are the same ... but that's not important because
> > MCi_STATUS.UC==1. So Linux doesn't need to jump through hoops to
> > "upgrade" the severity.
> 
> No, this is not what I meant: I meant when you have the setting enabled
> to downgrade those errors, can they be detected as part of another MCE
> being raised...

Yes one of these downgraded error logs could be sitting in a bank
whe a different machine check fires. But the severity calculation will
ignore it as MCi_STATUS.EN==0

-Tony

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/mce: Add Skylake quirk for patrol scrub reported errors
  2020-06-16 22:33   ` Luck, Tony
@ 2020-06-17  7:41     ` Borislav Petkov
  2020-06-17 18:49       ` Luck, Tony
  0 siblings, 1 reply; 8+ messages in thread
From: Borislav Petkov @ 2020-06-17  7:41 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Song, Youquan, x86, linux-kernel

On Tue, Jun 16, 2020 at 10:33:08PM +0000, Luck, Tony wrote:
> If the BIOS option is left in the default setting, uncorrectable errors found
> by the patrol scrubber are reported with a machine check. Those MSCOD
> and MCACOD signatures are the same ... but that's not important because
> MCi_STATUS.UC==1. So Linux doesn't need to jump through hoops to
> "upgrade" the severity.

No, this is not what I meant: I meant when you have the setting enabled
to downgrade those errors, can they be detected as part of another MCE
being raised...

> > If so, then the adjusting needs to happen inside mce_log().
> So no, this adjust only needs to happen when polling the banks from
> CMCI or periodic timer.

... but since those downgraded errors raise CMCI then the answer to my
question is no.

> The point was to avoid the runtime test for CPU model on every error. But
> this isn't a performance critical path, so we can refactor if you think that
> looks cleaner.

Yes please.

> There is some new set of validation tests running now to check the effectiveness
> of this BIOS + OS change. So it may be a while before updated version is
> posted.

Ok, thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH] x86/mce: Add Skylake quirk for patrol scrub reported errors
  2020-06-16 19:29 ` Borislav Petkov
@ 2020-06-16 22:33   ` Luck, Tony
  2020-06-17  7:41     ` Borislav Petkov
  0 siblings, 1 reply; 8+ messages in thread
From: Luck, Tony @ 2020-06-16 22:33 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Song, Youquan, x86, linux-kernel

> Two things: can that error type be detected when #MC gets raised, i.e., in
> do_machine_check() as part of scanning all banks?

If the BIOS option is left in the default setting, uncorrectable errors found
by the patrol scrubber are reported with a machine check. Those MSCOD
and MCACOD signatures are the same ... but that's not important because
MCi_STATUS.UC==1. So Linux doesn't need to jump through hoops to
"upgrade" the severity.

> If so, then the adjusting needs to happen inside mce_log().
So no, this adjust only needs to happen when polling the banks from
CMCI or periodic timer.

> Also, that assignment to the function pointer doesn't make much sense to
> me and I think you should do the vendor/family/model checking straight
> in a function adjust_mce_log() which gets called by whoever...

The point was to avoid the runtime test for CPU model on every error. But
this isn't a performance critical path, so we can refactor if you think that
looks cleaner.

There is some new set of validation tests running now to check the effectiveness
of this BIOS + OS change. So it may be a while before updated version is
posted.

Thanks

-Tony

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/mce: Add Skylake quirk for patrol scrub reported errors
  2020-06-15 18:40 Tony Luck
@ 2020-06-16 19:29 ` Borislav Petkov
  2020-06-16 22:33   ` Luck, Tony
  0 siblings, 1 reply; 8+ messages in thread
From: Borislav Petkov @ 2020-06-16 19:29 UTC (permalink / raw)
  To: Tony Luck; +Cc: Youquan Song, x86, linux-kernel

On Mon, Jun 15, 2020 at 11:40:56AM -0700, Tony Luck wrote:
> From: Youquan Song <youquan.song@intel.com>
> 
> Skylake has a mode where the system administrator can use a BIOS setup
> option to request that the memory controller report uncorrected errors
> found by the patrol scrubber as corrected.  This results in them being
> signalled using CMCI, which is less disruptive than a machine check.
> 
> Add a quirk to detect that a "corrected" error is actually a downgraded
> uncorrected error with model specific checks for the "MSCOD" signature in
> MCi_STATUS and that the error was reported from a memory controller bank.
> 
> Adjust the severity to MCE_AO_SEVERITY so that Linux will try to take
> the affected page offline.
> 
> [Tony: Wordsmith commit comment]
> 
> Signed-off-by: Youquan Song <youquan.song@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/mce/core.c | 30 ++++++++++++++++++++++++++++++
>  1 file changed, 30 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index e9265e2f28c9..0dbd0a21a0bf 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -123,6 +123,8 @@ static struct irq_work mce_irq_work;
>  
>  static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
>  
> +static void no_adjust_mce_log(struct mce *m) {};
> +static void (*adjust_mce_log)(struct mce *m) = no_adjust_mce_log;
>  /*
>   * CPU/chipset specific EDAC code can register a notifier call here to print
>   * MCE errors in a human-readable form.
> @@ -772,6 +774,7 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
>  		if (mca_cfg.dont_log_ce && !mce_usable_address(&m))
>  			goto clear_it;
>  
> +		adjust_mce_log(&m);
>  		mce_log(&m);

Two things: can that error type be detected when #MC gets raised, i.e., in
do_machine_check() as part of scanning all banks?

If so, then the adjusting needs to happen inside mce_log().

Also, that assignment to the function pointer doesn't make much sense to
me and I think you should do the vendor/family/model checking straight
in a function adjust_mce_log() which gets called by whoever...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] x86/mce: Add Skylake quirk for patrol scrub reported errors
@ 2020-06-15 18:40 Tony Luck
  2020-06-16 19:29 ` Borislav Petkov
  0 siblings, 1 reply; 8+ messages in thread
From: Tony Luck @ 2020-06-15 18:40 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Youquan Song, Tony Luck, x86, linux-kernel

From: Youquan Song <youquan.song@intel.com>

Skylake has a mode where the system administrator can use a BIOS setup
option to request that the memory controller report uncorrected errors
found by the patrol scrubber as corrected.  This results in them being
signalled using CMCI, which is less disruptive than a machine check.

Add a quirk to detect that a "corrected" error is actually a downgraded
uncorrected error with model specific checks for the "MSCOD" signature in
MCi_STATUS and that the error was reported from a memory controller bank.

Adjust the severity to MCE_AO_SEVERITY so that Linux will try to take
the affected page offline.

[Tony: Wordsmith commit comment]

Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mce/core.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index e9265e2f28c9..0dbd0a21a0bf 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -123,6 +123,8 @@ static struct irq_work mce_irq_work;
 
 static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
 
+static void no_adjust_mce_log(struct mce *m) {};
+static void (*adjust_mce_log)(struct mce *m) = no_adjust_mce_log;
 /*
  * CPU/chipset specific EDAC code can register a notifier call here to print
  * MCE errors in a human-readable form.
@@ -772,6 +774,7 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 		if (mca_cfg.dont_log_ce && !mce_usable_address(&m))
 			goto clear_it;
 
+		adjust_mce_log(&m);
 		mce_log(&m);
 
 clear_it:
@@ -1640,6 +1643,30 @@ static void quirk_sandybridge_ifu(int bank, struct mce *m, struct pt_regs *regs)
 	m->cs = regs->cs;
 }
 
+/*
+ * Skylake family CPUs have a mode where the user can request that
+ * the memory controller report uncorrected errors found by the patrol
+ * scrubber as corrected (MCI_STATUS_UC == 0). This results in them being
+ * signalled using CMCI, which is less disruptive that a machine check.
+ * The following quirk detects such errors and adjusts the severity.
+ */
+
+#define MSCOD_UCE_SCRUB	(0x0010 << 16) /* UnCorrected Patrol Scrub Error */
+#define MSCOD_MASK	GENMASK_ULL(31, 16)
+
+/*
+ * Check the error code to see if this is an uncorrected patrol
+ * scrub error from one of the memory controller banks. If so,
+ * then adjust the severity level to MCE_AO_SEVERITY
+ */
+static void quirk_skx_adjust_mce_log(struct mce *m)
+{
+	if (((m->status & MCACOD_SCRUBMSK) == MCACOD_SCRUB) &&
+	    ((m->status & MSCOD_MASK) == MSCOD_UCE_SCRUB) &&
+	    m->bank >= 13 && m->bank <= 18)
+		m->severity = MCE_AO_SEVERITY;
+}
+
 /* Add per CPU specific workarounds here */
 static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)
 {
@@ -1714,6 +1741,9 @@ static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)
 
 		if (c->x86 == 6 && c->x86_model == 45)
 			quirk_no_way_out = quirk_sandybridge_ifu;
+
+		if (c->x86 == 6 && c->x86_model == INTEL_FAM6_SKYLAKE_X)
+			adjust_mce_log = quirk_skx_adjust_mce_log;
 	}
 
 	if (c->x86_vendor == X86_VENDOR_ZHAOXIN) {
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-03-24 15:35 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-22 22:37 [PATCH] x86/mce: Add Skylake quirk for patrol scrub reported errors Tony Luck
2021-03-24 15:00 ` Borislav Petkov
2021-03-24 15:35   ` Luck, Tony
  -- strict thread matches above, loose matches on Subject: below --
2020-06-15 18:40 Tony Luck
2020-06-16 19:29 ` Borislav Petkov
2020-06-16 22:33   ` Luck, Tony
2020-06-17  7:41     ` Borislav Petkov
2020-06-17 18:49       ` Luck, Tony

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).