linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] x86/MCE/AMD: Export smca_get_bank_type()
@ 2019-03-07 21:26 Ghannam, Yazen
  2019-03-07 21:26 ` [PATCH 2/2] x86/MCE/AMD, EDAC/mce_amd: Don't report L1 BTB MCA errors on some Family 17h models Ghannam, Yazen
  0 siblings, 1 reply; 5+ messages in thread
From: Ghannam, Yazen @ 2019-03-07 21:26 UTC (permalink / raw)
  To: linux-edac
  Cc: Ghannam, Yazen, Borislav Petkov, Tony Luck, x86, linux-kernel,
	rafal, clemej

From: Yazen Ghannam <yazen.ghannam@amd.com>

Export the smca_get_bank_type() function so it can be used in the AMD
MCE decoder module.

Cc: <stable@vger.kernel.org> # 4.14.x
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
 arch/x86/include/asm/mce.h    | 1 +
 arch/x86/kernel/cpu/mce/amd.c | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 22d05e3835f0..605b46fde1ee 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -340,6 +340,7 @@ struct smca_bank {
 extern struct smca_bank smca_banks[MAX_NR_BANKS];
 
 extern const char *smca_get_long_name(enum smca_bank_types t);
+extern enum smca_bank_types smca_get_bank_type(unsigned int bank);
 extern bool amd_mce_is_memory_error(struct mce *m);
 
 extern int mce_threshold_create_device(unsigned int cpu);
diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index e64de5149e50..041bb800cda8 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -123,7 +123,7 @@ const char *smca_get_long_name(enum smca_bank_types t)
 }
 EXPORT_SYMBOL_GPL(smca_get_long_name);
 
-static enum smca_bank_types smca_get_bank_type(unsigned int bank)
+enum smca_bank_types smca_get_bank_type(unsigned int bank)
 {
 	struct smca_bank *b;
 
@@ -136,6 +136,7 @@ static enum smca_bank_types smca_get_bank_type(unsigned int bank)
 
 	return b->hwid->bank_type;
 }
+EXPORT_SYMBOL_GPL(smca_get_bank_type);
 
 static struct smca_hwid smca_hwid_mcatypes[] = {
 	/* { bank_type, hwid_mcatype, xec_bitmap } */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] x86/MCE/AMD, EDAC/mce_amd: Don't report L1 BTB MCA errors on some Family 17h models
  2019-03-07 21:26 [PATCH 1/2] x86/MCE/AMD: Export smca_get_bank_type() Ghannam, Yazen
@ 2019-03-07 21:26 ` Ghannam, Yazen
  2019-03-11 18:21   ` Borislav Petkov
  0 siblings, 1 reply; 5+ messages in thread
From: Ghannam, Yazen @ 2019-03-07 21:26 UTC (permalink / raw)
  To: linux-edac
  Cc: Ghannam, Yazen, Borislav Petkov, Tony Luck, x86, linux-kernel,
	rafal, clemej

From: Yazen Ghannam <yazen.ghannam@amd.com>

AMD Family 17h Models 10h-2Fh may report a high number of L1 BTB MCA
errors under certain conditions. The errors are benign and can safely be
ignored. However, the high error rate may cause the MCA threshold
counter to overflow causing a high rate of thresholding interrupts. In
addition, users may see the errors reported through the AMD MCE decoder
module, even with the interrupt disabled, due to MCA polling.

This error is reported through the Instruction Fetch bank.

Clear the "Counter Present" bit in the Instruction Fetch bank's
MCA_MISC0 register. This will prevent enabling MCA thresholding on this
bank which will prevent the high interrupt rate due to this error.

Filter out this error signature in the AMD MCE decoder module.

Cc: <stable@vger.kernel.org> # 4.14.x: c95b323dcd35: x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models
Cc: <stable@vger.kernel.org> # 4.14.x: 30aa3d26edb0: x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk
Cc: <stable@vger.kernel.org> # 4.14.x
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
 arch/x86/kernel/cpu/mce/amd.c | 36 +++++++++++++++++++++++------------
 drivers/edac/mce_amd.c        | 21 ++++++++++++++++++++
 2 files changed, 45 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 041bb800cda8..810e37df5820 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -565,21 +565,33 @@ prepare_threshold_block(unsigned int bank, unsigned int block, u32 addr,
 }
 
 /*
- * Turn off MC4_MISC thresholding banks on all family 0x15 models since
- * they're not supported there.
+ * Turn off thresholding banks for the following conditions:
+ * - MC4_MISC thresholding is not support on Family 0x15.
+ * - Prevent possible spurious interrupts from the IF bank on Family 0x17
+ *   Models 0x10-0x2F due to Erratum #1114.
  */
-void disable_err_thresholding(struct cpuinfo_x86 *c)
+void disable_err_thresholding(struct cpuinfo_x86 *c, unsigned int bank)
 {
-	int i;
+	int i, num_msrs;
 	u64 hwcr;
 	bool need_toggle;
-	u32 msrs[] = {
-		0x00000413, /* MC4_MISC0 */
-		0xc0000408, /* MC4_MISC1 */
-	};
+	u32 msrs[NR_BLOCKS];
+
+	if (c->x86 == 0x15 && bank == 4) {
+		msrs[0] = 0x00000413; /* MC4_MISC0 */
+		msrs[1] = 0xc0000408; /* MC4_MISC1 */
+		num_msrs = 2;
+	} else if (c->x86 == 0x17 &&
+		   (c->x86_model >= 0x10 && c->x86_model <= 0x2F)) {
+
+		if (smca_get_bank_type(bank) != SMCA_IF)
+			return;
 
-	if (c->x86 != 0x15)
+		msrs[0] = MSR_AMD64_SMCA_MCx_MISC(bank);
+		num_msrs = 1;
+	} else {
 		return;
+	}
 
 	rdmsrl(MSR_K7_HWCR, hwcr);
 
@@ -590,7 +602,7 @@ void disable_err_thresholding(struct cpuinfo_x86 *c)
 		wrmsrl(MSR_K7_HWCR, hwcr | BIT(18));
 
 	/* Clear CntP bit safely */
-	for (i = 0; i < ARRAY_SIZE(msrs); i++)
+	for (i = 0; i < num_msrs; i++)
 		msr_clear_bit(msrs[i], 62);
 
 	/* restore old settings */
@@ -605,12 +617,12 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
 	unsigned int bank, block, cpu = smp_processor_id();
 	int offset = -1;
 
-	disable_err_thresholding(c);
-
 	for (bank = 0; bank < mca_cfg.banks; ++bank) {
 		if (mce_flags.smca)
 			smca_configure(bank, cpu);
 
+		disable_err_thresholding(c, bank);
+
 		for (block = 0; block < NR_BLOCKS; ++block) {
 			address = get_block_address(address, low, high, bank, block);
 			if (!address)
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 0a1814dad6cf..4f2bf8ecc513 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -1001,6 +1001,24 @@ static inline void amd_decode_err_code(u16 ec)
 	pr_cont("\n");
 }
 
+static bool smca_filter_mce(struct mce *m)
+{
+	enum smca_bank_types bank_type = smca_get_bank_type(m->bank);
+	struct cpuinfo_x86 *c = &boot_cpu_data;
+	u8 xec = XEC(m->status, xec_mask);
+
+	/*
+	 * Spurious errors of this type may be reported.
+	 * See Family 17h Models 10h-2Fh Erratum #1114.
+	 */
+	if (c->x86 == 0x17 &&
+	    (c->x86_model >= 0x10 && c->x86_model <= 0x2F) &&
+	    bank_type == SMCA_IF && xec == 10)
+		return true;
+
+	return false;
+}
+
 /*
  * Filter out unwanted MCE signatures here.
  */
@@ -1012,6 +1030,9 @@ static bool amd_filter_mce(struct mce *m)
 	if (m->bank == 4 && XEC(m->status, 0x1f) == 0x5 && !report_gart_errors)
 		return true;
 
+	if (boot_cpu_has(X86_FEATURE_SMCA))
+		return smca_filter_mce(m);
+
 	return false;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] x86/MCE/AMD, EDAC/mce_amd: Don't report L1 BTB MCA errors on some Family 17h models
  2019-03-07 21:26 ` [PATCH 2/2] x86/MCE/AMD, EDAC/mce_amd: Don't report L1 BTB MCA errors on some Family 17h models Ghannam, Yazen
@ 2019-03-11 18:21   ` Borislav Petkov
  2019-03-11 18:52     ` Ghannam, Yazen
  0 siblings, 1 reply; 5+ messages in thread
From: Borislav Petkov @ 2019-03-11 18:21 UTC (permalink / raw)
  To: Ghannam, Yazen
  Cc: linux-edac, Borislav Petkov, Tony Luck, x86, linux-kernel, rafal, clemej

On Thu, Mar 07, 2019 at 09:26:04PM +0000, Ghannam, Yazen wrote:
> +static bool smca_filter_mce(struct mce *m)
> +{
> +	enum smca_bank_types bank_type = smca_get_bank_type(m->bank);
> +	struct cpuinfo_x86 *c = &boot_cpu_data;
> +	u8 xec = XEC(m->status, xec_mask);
> +
> +	/*
> +	 * Spurious errors of this type may be reported.
> +	 * See Family 17h Models 10h-2Fh Erratum #1114.
> +	 */
> +	if (c->x86 == 0x17 &&
> +	    (c->x86_model >= 0x10 && c->x86_model <= 0x2F) &&
> +	    bank_type == SMCA_IF && xec == 10)
> +		return true;

This is happening too late and we need it much earlier, from Rafal's dmesg:

[    1.070855] mce: [Hardware Error]: Machine check events logged
[    1.070860] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 1: d8200000000a0151
[    1.070863] mce: [Hardware Error]: TSC 73fa0765c MISC d01b0fff00000000 SYND 4a000000 IPID 100b000000000
[    1.071065] mce: [Hardware Error]: PROCESSOR 2:810f10 TIME 1543481411 SOCKET 0 APIC 2 microcode 810100b

that's __print_mce() from the notifier.

So we'd need a filter function which is called in do_machine_check() and
machine_check_poll() right after we've collected enough info to be able
to filter out the MCE based on the signature. In this case the extended
error core and SMCA bank type suffices but we should put those functions
late enough so that they can be used for other filtering later.

Alternatively, if this error type has a special bit in the mask registers so
that you can disable it there ala

        if (c->x86_vendor == X86_VENDOR_AMD) {
                if (c->x86 == 15 && cfg->banks > 4) {
                        /*
                         * disable GART TBL walk error reporting, which
                         * trips off incorrectly with the IOMMU & 3ware
                         * & Cerberus:
                         */
                        clear_bit(10, (unsigned long *)&mce_banks[4].ctl);


that would be even better but I'd guess it doesn't have a special bit...

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH 2/2] x86/MCE/AMD, EDAC/mce_amd: Don't report L1 BTB MCA errors on some Family 17h models
  2019-03-11 18:21   ` Borislav Petkov
@ 2019-03-11 18:52     ` Ghannam, Yazen
  2019-03-11 19:01       ` Borislav Petkov
  0 siblings, 1 reply; 5+ messages in thread
From: Ghannam, Yazen @ 2019-03-11 18:52 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-edac, Borislav Petkov, Tony Luck, x86, linux-kernel, rafal, clemej

> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Borislav Petkov
> Sent: Monday, March 11, 2019 1:21 PM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-edac@vger.kernel.org; Borislav Petkov <bp@suse.de>; Tony Luck <tony.luck@intel.com>; x86@kernel.org; linux-
> kernel@vger.kernel.org; rafal@milecki.pl; clemej@gmail.com
> Subject: Re: [PATCH 2/2] x86/MCE/AMD, EDAC/mce_amd: Don't report L1 BTB MCA errors on some Family 17h models
> 
> On Thu, Mar 07, 2019 at 09:26:04PM +0000, Ghannam, Yazen wrote:
> > +static bool smca_filter_mce(struct mce *m)
> > +{
> > +	enum smca_bank_types bank_type = smca_get_bank_type(m->bank);
> > +	struct cpuinfo_x86 *c = &boot_cpu_data;
> > +	u8 xec = XEC(m->status, xec_mask);
> > +
> > +	/*
> > +	 * Spurious errors of this type may be reported.
> > +	 * See Family 17h Models 10h-2Fh Erratum #1114.
> > +	 */
> > +	if (c->x86 == 0x17 &&
> > +	    (c->x86_model >= 0x10 && c->x86_model <= 0x2F) &&
> > +	    bank_type == SMCA_IF && xec == 10)
> > +		return true;
> 
> This is happening too late and we need it much earlier, from Rafal's dmesg:
> 
> [    1.070855] mce: [Hardware Error]: Machine check events logged
> [    1.070860] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 1: d8200000000a0151
> [    1.070863] mce: [Hardware Error]: TSC 73fa0765c MISC d01b0fff00000000 SYND 4a000000 IPID 100b000000000
> [    1.071065] mce: [Hardware Error]: PROCESSOR 2:810f10 TIME 1543481411 SOCKET 0 APIC 2 microcode 810100b
> 
> that's __print_mce() from the notifier.
> 
> So we'd need a filter function which is called in do_machine_check() and
> machine_check_poll() right after we've collected enough info to be able
> to filter out the MCE based on the signature. In this case the extended
> error core and SMCA bank type suffices but we should put those functions
> late enough so that they can be used for other filtering later.
> 

Okay, understood.

Should I keep the filter in edac_mce_amd? I guess it's not necessary if filtered out earlier.

> Alternatively, if this error type has a special bit in the mask registers so
> that you can disable it there ala
> 
>         if (c->x86_vendor == X86_VENDOR_AMD) {
>                 if (c->x86 == 15 && cfg->banks > 4) {
>                         /*
>                          * disable GART TBL walk error reporting, which
>                          * trips off incorrectly with the IOMMU & 3ware
>                          * & Cerberus:
>                          */
>                         clear_bit(10, (unsigned long *)&mce_banks[4].ctl);
> 
> 
> that would be even better but I'd guess it doesn't have a special bit...
> 

Yes, that's right. Clearing a bit in MCA_CTL is not recommend in this case.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] x86/MCE/AMD, EDAC/mce_amd: Don't report L1 BTB MCA errors on some Family 17h models
  2019-03-11 18:52     ` Ghannam, Yazen
@ 2019-03-11 19:01       ` Borislav Petkov
  0 siblings, 0 replies; 5+ messages in thread
From: Borislav Petkov @ 2019-03-11 19:01 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-edac, Tony Luck, x86, linux-kernel, rafal, clemej

On Mon, Mar 11, 2019 at 06:52:05PM +0000, Ghannam, Yazen wrote:
> Should I keep the filter in edac_mce_amd? I guess it's not necessary
> if filtered out earlier.

Nah, you don't need to touch that if you filter earlier.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-03-11 19:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-07 21:26 [PATCH 1/2] x86/MCE/AMD: Export smca_get_bank_type() Ghannam, Yazen
2019-03-07 21:26 ` [PATCH 2/2] x86/MCE/AMD, EDAC/mce_amd: Don't report L1 BTB MCA errors on some Family 17h models Ghannam, Yazen
2019-03-11 18:21   ` Borislav Petkov
2019-03-11 18:52     ` Ghannam, Yazen
2019-03-11 19:01       ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).