linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86/mce: Initialize "bank" when we find a fatal error in mce_no_way_out()
@ 2019-02-01  0:33 Tony Luck
  2019-02-01  9:55 ` Borislav Petkov
  2019-02-03 12:36 ` [tip:x86/urgent] x86/MCE: Initialize mce.bank in the case of " tip-bot for Tony Luck
  0 siblings, 2 replies; 5+ messages in thread
From: Tony Luck @ 2019-02-01  0:33 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, x86, linux-kernel

Internal injection testing crashed with a console log that said:

mce: [Hardware Error]: CPU 7: Machine Check Exception: f Bank 0: bd80000000100134

This caused a lot of head scratching because the MCACOD (bits 15:0) of that
status is a signature from an L1 data cache error. But Linux says that it found
it in "Bank 0", which on this model CPU only reports L1 instruction cache errors.

The answer was that Linux doesn't initialize "m->bank" in the case that it finds
a fatal error in the mce_no_way_out() pre-scan of banks. If this was a local machine
check, then we pass this partially initialized "struct mce" to mce_panic().

Fix is simple. Just initialize m->bank in the case that we found a fatal error.

Fixes: 40c36e2741d7 ("x86/mce: Fix incorrect "Machine check from unknown source" message")
Cc: stable@vger.kernel.org # v4.18 Note pre-v5.0 arch/x86/kernel/cpu/mce/core.c was called arch/x86/kernel/cpu/mcheck/mce.c
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mce/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 672c7225cb1b..6ce290c506d9 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -784,6 +784,7 @@ static int mce_no_way_out(struct mce *m, char **msg, unsigned long *validp,
 			quirk_no_way_out(i, m, regs);
 
 		if (mce_severity(m, mca_cfg.tolerant, &tmp, true) >= MCE_PANIC_SEVERITY) {
+			m->bank = i;
 			mce_read_aux(m, i);
 			*msg = tmp;
 			return 1;
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] x86/mce: Initialize "bank" when we find a fatal error in mce_no_way_out()
  2019-02-01  0:33 [PATCH] x86/mce: Initialize "bank" when we find a fatal error in mce_no_way_out() Tony Luck
@ 2019-02-01  9:55 ` Borislav Petkov
  2019-02-01 18:36   ` Luck, Tony
  2019-02-03 12:36 ` [tip:x86/urgent] x86/MCE: Initialize mce.bank in the case of " tip-bot for Tony Luck
  1 sibling, 1 reply; 5+ messages in thread
From: Borislav Petkov @ 2019-02-01  9:55 UTC (permalink / raw)
  To: Tony Luck; +Cc: x86, linux-kernel

On Thu, Jan 31, 2019 at 04:33:41PM -0800, Tony Luck wrote:
> Internal injection testing crashed with a console log that said:
> 
> mce: [Hardware Error]: CPU 7: Machine Check Exception: f Bank 0: bd80000000100134
> 
> This caused a lot of head scratching because the MCACOD (bits 15:0) of that
> status is a signature from an L1 data cache error. But Linux says that it found
> it in "Bank 0", which on this model CPU only reports L1 instruction cache errors.
> 
> The answer was that Linux doesn't initialize "m->bank" in the case that it finds
> a fatal error in the mce_no_way_out() pre-scan of banks. If this was a local machine
> check, then we pass this partially initialized "struct mce" to mce_panic().
> 
> Fix is simple. Just initialize m->bank in the case that we found a fatal error.
> 
> Fixes: 40c36e2741d7 ("x86/mce: Fix incorrect "Machine check from unknown source" message")
> Cc: stable@vger.kernel.org # v4.18 Note pre-v5.0 arch/x86/kernel/cpu/mce/core.c was called arch/x86/kernel/cpu/mcheck/mce.c
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/mce/core.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 672c7225cb1b..6ce290c506d9 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -784,6 +784,7 @@ static int mce_no_way_out(struct mce *m, char **msg, unsigned long *validp,
>  			quirk_no_way_out(i, m, regs);
>  
>  		if (mce_severity(m, mca_cfg.tolerant, &tmp, true) >= MCE_PANIC_SEVERITY) {
> +			m->bank = i;

So conceptually this write belongs in...

>  			mce_read_aux(m, i);

... this function, i.e., in mce_read_aux() because it gets the bank
number passed in already. And our calling pattern when populating struct
mce is:

	mce_gather_info()
	mce_read_aux()

so it'll be more robust if we moved it there.

Also, that argument "i" of mce_read_aux() is not very telling and it
should be "bank" but that would complicate the stable backporting so if
you feel like it, you could do a second, cleanup patch ontop to fix that
too.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] x86/mce: Initialize "bank" when we find a fatal error in mce_no_way_out()
  2019-02-01  9:55 ` Borislav Petkov
@ 2019-02-01 18:36   ` Luck, Tony
  2019-02-02 15:33     ` Borislav Petkov
  0 siblings, 1 reply; 5+ messages in thread
From: Luck, Tony @ 2019-02-01 18:36 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: x86, linux-kernel

On Fri, Feb 01, 2019 at 10:55:53AM +0100, Borislav Petkov wrote:
> On Thu, Jan 31, 2019 at 04:33:41PM -0800, Tony Luck wrote:
> >  		if (mce_severity(m, mca_cfg.tolerant, &tmp, true) >= MCE_PANIC_SEVERITY) {
> > +			m->bank = i;
> 
> So conceptually this write belongs in...
> 
> >  			mce_read_aux(m, i);

Really? That function starts with the comment

	/*
	 * Read ADDR and MISC registers.
	 */

and the name is pretty descriptive that it is "reading auxiliary" information.

> 
> ... this function, i.e., in mce_read_aux() because it gets the bank
> number passed in already. And our calling pattern when populating struct
> mce is:
> 
> 	mce_gather_info()
> 	mce_read_aux()

Only two exising uses:

1) in machine_check_poll()

	mce_gather_info(&m, NULL);
	for (each bank) {
		m.bank = i;
		mce_read_aux(&m, i);
		...
	}

2) in __mc_scan_banks

	// mce_gather_info() already done in do_machine_check
	for (each bank) {
		m.bank = i;
		mce_read_aux(m, i);
		...
	}
> 
> so it'll be more robust if we moved it there.

It would be redundant to move it there for both
existing uses.

> Also, that argument "i" of mce_read_aux() is not very telling and it
> should be "bank" but that would complicate the stable backporting so if
> you feel like it, you could do a second, cleanup patch ontop to fix that
> too.

Agreed that "bank" is clearer than "i". But also agreed that should
be a separate patch for another day.

-Tony

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] x86/mce: Initialize "bank" when we find a fatal error in mce_no_way_out()
  2019-02-01 18:36   ` Luck, Tony
@ 2019-02-02 15:33     ` Borislav Petkov
  0 siblings, 0 replies; 5+ messages in thread
From: Borislav Petkov @ 2019-02-02 15:33 UTC (permalink / raw)
  To: Luck, Tony; +Cc: x86, linux-kernel

On Fri, Feb 01, 2019 at 10:36:17AM -0800, Luck, Tony wrote:
> > so it'll be more robust if we moved it there.
> 
> It would be redundant to move it there for both
> existing uses.

Maybe. But if the bank write happens there, it won't be "forgotten"
again.

I don't care what the functions are called. If they need to do something
more, like *fully* populating struct mce so that a proper record gets
logged further down the line, then we need to make them do so. That's
the point I'm trying to make.

Sure, the stable fix should be simple for easier backporting but in
order to avoid this thing happening again in the future, we should
probably look at unifying and making those paths easier to use.

Right now we have

1. mce_setup() - initial population of struct mce, called from
mce_gather_info(), a.o.
2. mce_gather_info() collects global regs
3. mce_read_aux() reading aux regs

and at the very end, we hand it into mce_log().

And in-between we poke (or don't poke) in some fields.

IOW, if the struct mce finishing population is concentrated in a single
function, it will be much harder to forget setting such fields in the
future.

In this case, the devil is in the detail so I'll have a look at this
when more time.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [tip:x86/urgent] x86/MCE: Initialize mce.bank in the case of a fatal error in mce_no_way_out()
  2019-02-01  0:33 [PATCH] x86/mce: Initialize "bank" when we find a fatal error in mce_no_way_out() Tony Luck
  2019-02-01  9:55 ` Borislav Petkov
@ 2019-02-03 12:36 ` tip-bot for Tony Luck
  1 sibling, 0 replies; 5+ messages in thread
From: tip-bot for Tony Luck @ 2019-02-03 12:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: x86, mingo, tony.luck, bp, linux-kernel, hpa, vishal.l.verma,
	tglx, mingo

Commit-ID:  d28af26faa0b1daf3c692603d46bc4687c16f19e
Gitweb:     https://git.kernel.org/tip/d28af26faa0b1daf3c692603d46bc4687c16f19e
Author:     Tony Luck <tony.luck@intel.com>
AuthorDate: Thu, 31 Jan 2019 16:33:41 -0800
Committer:  Borislav Petkov <bp@suse.de>
CommitDate: Sun, 3 Feb 2019 13:24:24 +0100

x86/MCE: Initialize mce.bank in the case of a fatal error in mce_no_way_out()

Internal injection testing crashed with a console log that said:

  mce: [Hardware Error]: CPU 7: Machine Check Exception: f Bank 0: bd80000000100134

This caused a lot of head scratching because the MCACOD (bits 15:0) of
that status is a signature from an L1 data cache error. But Linux says
that it found it in "Bank 0", which on this model CPU only reports L1
instruction cache errors.

The answer was that Linux doesn't initialize "m->bank" in the case that
it finds a fatal error in the mce_no_way_out() pre-scan of banks. If
this was a local machine check, then this partially initialized struct
mce is being passed to mce_panic().

Fix is simple: just initialize m->bank in the case of a fatal error.

Fixes: 40c36e2741d7 ("x86/mce: Fix incorrect "Machine check from unknown source" message")
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: x86-ml <x86@kernel.org>
Cc: stable@vger.kernel.org # v4.18 Note pre-v5.0 arch/x86/kernel/cpu/mce/core.c was called arch/x86/kernel/cpu/mcheck/mce.c
Link: https://lkml.kernel.org/r/20190201003341.10638-1-tony.luck@intel.com
---
 arch/x86/kernel/cpu/mce/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 672c7225cb1b..6ce290c506d9 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -784,6 +784,7 @@ static int mce_no_way_out(struct mce *m, char **msg, unsigned long *validp,
 			quirk_no_way_out(i, m, regs);
 
 		if (mce_severity(m, mca_cfg.tolerant, &tmp, true) >= MCE_PANIC_SEVERITY) {
+			m->bank = i;
 			mce_read_aux(m, i);
 			*msg = tmp;
 			return 1;

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-02-03 12:37 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-01  0:33 [PATCH] x86/mce: Initialize "bank" when we find a fatal error in mce_no_way_out() Tony Luck
2019-02-01  9:55 ` Borislav Petkov
2019-02-01 18:36   ` Luck, Tony
2019-02-02 15:33     ` Borislav Petkov
2019-02-03 12:36 ` [tip:x86/urgent] x86/MCE: Initialize mce.bank in the case of " tip-bot for Tony Luck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).