* Re: [PATCH 2/3] x86/mce: Fix incorrect "Machine check from unknown source" message
2018-05-25 21:41 ` [PATCH 2/3] x86/mce: Fix incorrect "Machine check from unknown source" message Tony Luck
@ 2018-05-28 20:49 ` Borislav Petkov
2018-05-29 16:15 ` [PATCH 2/3 V2] " Luck, Tony
2018-05-29 18:22 ` [PATCH 2/3] " Raj, Ashok
2018-05-29 10:42 ` Borislav Petkov
2018-06-22 12:40 ` tip-bot for Borislav Petkov
2 siblings, 2 replies; 29+ messages in thread
From: Borislav Petkov @ 2018-05-28 20:49 UTC (permalink / raw)
To: Tony Luck; +Cc: Dan Williams, Qiuxu Zhuo, Ashok Raj, x86, linux-kernel
On Fri, May 25, 2018 at 02:41:55PM -0700, Tony Luck wrote:
> @@ -1287,12 +1292,17 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> no_way_out = worst >= MCE_PANIC_SEVERITY;
> } else {
> /*
> - * Local MCE skipped calling mce_reign()
> - * If we found a fatal error, we need to panic here.
> + * If there was a fatal machine check we should have
> + * already called mce_panic earlier in this function.
> + * Since we re-read the banks, we might have found
> + * something new. Check again to see if we found a
> + * fatal error. We call "mce_severity()" again to
> + * make sure we have the right "msg".
> */
> - if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
> - mce_panic("Machine check from unknown source",
> - NULL, NULL);
> + if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
> + severity = mce_severity(&m, cfg->tolerant, &msg, true);
> + mce_panic("Local fatal machine check!", &m, msg);
Haha, this would still make you look at the code to remember was it
"fatal local" or "local fatal" the second one. Yeah, there's the "!" but
still.
How about:
"Fatal local machine check after banks scan"
or so.
Btw, the code in do_machine_check() has become one helluva spaghetti
mess. It could use some clean up a bit... :)
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 2/3 V2] x86/mce: Fix incorrect "Machine check from unknown source" message
2018-05-28 20:49 ` Borislav Petkov
@ 2018-05-29 16:15 ` Luck, Tony
2018-05-29 17:41 ` Borislav Petkov
2018-05-29 18:22 ` [PATCH 2/3] " Raj, Ashok
1 sibling, 1 reply; 29+ messages in thread
From: Luck, Tony @ 2018-05-29 16:15 UTC (permalink / raw)
To: Borislav Petkov; +Cc: Dan Williams, Qiuxu Zhuo, Ashok Raj, x86, linux-kernel
Some injection testing resulted in the following console log:
mce: [Hardware Error]: CPU 22: Machine Check Exception: f Bank 1: bd80000000100134
mce: [Hardware Error]: RIP 10:<ffffffffc05292dd> {pmem_do_bvec+0x11d/0x330 [nd_pmem]}
mce: [Hardware Error]: TSC c51a63035d52 ADDR 3234bc4000 MISC 88
mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1526502199 SOCKET 0 APIC 38 microcode 2000043
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
Kernel panic - not syncing: Machine check from unknown source
This confused everybody because the first line quite clearly shows
that we found a logged error in "Bank 1", while the last line says
"unknown source".
The problem is that the Linux code doesn't do the right thing
for a local machine check that results in a fatal error.
It turns out that we know very early in the handler whether the
machine check is fatal. The call to mce_no_way_out() has checked
all the banks for the CPU that took the local machine check. If
it says we must crash, we can do so right away with the right
messages.
We do scan all the banks again. This means that we might initially
not see a problem, but during the second scan find something fatal.
If this happens we print a different message (so I can see if it
actually every happens).
Cc: stable@vger.kernel.org # 4.2
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
V2: Boris:
1) make the "different message" more descriptive.
2) don't bother reassigning "severity" variable.
arch/x86/kernel/cpu/mcheck/mce.c | 26 ++++++++++++++++++--------
1 file changed, 18 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 42cf2880d0ed..f013d97abd5f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1205,13 +1205,18 @@ void do_machine_check(struct pt_regs *regs, long error_code)
lmce = m.mcgstatus & MCG_STATUS_LMCES;
/*
+ * Local machine check may already know that we have to panic.
+ * Broadcast machine check begins rendezvous in mce_start()
* Go through all banks in exclusion of the other CPUs. This way we
* don't report duplicated events on shared banks because the first one
- * to see it will clear it. If this is a Local MCE, then no need to
- * perform rendezvous.
+ * to see it will clear it.
*/
- if (!lmce)
+ if (lmce) {
+ if (no_way_out)
+ mce_panic("Fatal local machine check", &m, msg);
+ } else {
order = mce_start(&no_way_out);
+ }
for (i = 0; i < cfg->banks; i++) {
__clear_bit(i, toclear);
@@ -1287,12 +1292,17 @@ void do_machine_check(struct pt_regs *regs, long error_code)
no_way_out = worst >= MCE_PANIC_SEVERITY;
} else {
/*
- * Local MCE skipped calling mce_reign()
- * If we found a fatal error, we need to panic here.
+ * If there was a fatal machine check we should have
+ * already called mce_panic earlier in this function.
+ * Since we re-read the banks, we might have found
+ * something new. Check again to see if we found a
+ * fatal error. We call "mce_severity()" again to
+ * make sure we have the right "msg".
*/
- if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
- mce_panic("Machine check from unknown source",
- NULL, NULL);
+ if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
+ severity = mce_severity(&m, cfg->tolerant, &msg, true);
+ mce_panic("Local fatal machine check!", &m, msg);
+ }
}
/*
--
2.17.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH 2/3 V2] x86/mce: Fix incorrect "Machine check from unknown source" message
2018-05-29 16:15 ` [PATCH 2/3 V2] " Luck, Tony
@ 2018-05-29 17:41 ` Borislav Petkov
2018-05-29 17:50 ` Luck, Tony
0 siblings, 1 reply; 29+ messages in thread
From: Borislav Petkov @ 2018-05-29 17:41 UTC (permalink / raw)
To: Luck, Tony; +Cc: Dan Williams, Qiuxu Zhuo, Ashok Raj, x86, linux-kernel
On Tue, May 29, 2018 at 09:15:49AM -0700, Luck, Tony wrote:
> @@ -1287,12 +1292,17 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> no_way_out = worst >= MCE_PANIC_SEVERITY;
> } else {
> /*
> - * Local MCE skipped calling mce_reign()
> - * If we found a fatal error, we need to panic here.
> + * If there was a fatal machine check we should have
> + * already called mce_panic earlier in this function.
> + * Since we re-read the banks, we might have found
> + * something new. Check again to see if we found a
> + * fatal error. We call "mce_severity()" again to
> + * make sure we have the right "msg".
> */
> - if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
> - mce_panic("Machine check from unknown source",
> - NULL, NULL);
> + if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
> + severity = mce_severity(&m, cfg->tolerant, &msg, true);
> + mce_panic("Local fatal machine check!", &m, msg);
> + }
> }
It is still assigning. I'll simply do:
if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
mce_severity(&m, cfg->tolerant, &msg, true);
mce_panic("Local fatal machine check!", &m, msg);
}
when committing.
Btw, I did some cleaning up ontop. Please have a look and let me know if
it is too ugly to live... :)
https://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git/log/?h=rc7%2b0-ras
Thx.
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH 2/3 V2] x86/mce: Fix incorrect "Machine check from unknown source" message
2018-05-29 17:41 ` Borislav Petkov
@ 2018-05-29 17:50 ` Luck, Tony
2018-05-29 17:53 ` Borislav Petkov
0 siblings, 1 reply; 29+ messages in thread
From: Luck, Tony @ 2018-05-29 17:50 UTC (permalink / raw)
To: Borislav Petkov
Cc: Williams, Dan J, Zhuo, Qiuxu, Raj, Ashok, x86, linux-kernel
> It is still assigning.
Ah. That would be because I forgot to "git add" before "git commit --amend" :-(
> I'll simply do:
>
> if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
> mce_severity(&m, cfg->tolerant, &msg, true);
> mce_panic("Local fatal machine check!", &m, msg);
> }
>
> when committing.
I had put:
(void) mce_severity(&m, cfg->tolerant, &msg, true);
but either works.
> Btw, I did some cleaning up ontop. Please have a look and let me know if
> it is too ugly to live... :)
>
> https://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git/log/?h=rc7%2b0-ras
Will look. I hope you didn't do too much to make the stable backport more difficult.
-Tony
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/3 V2] x86/mce: Fix incorrect "Machine check from unknown source" message
2018-05-29 17:50 ` Luck, Tony
@ 2018-05-29 17:53 ` Borislav Petkov
2018-05-29 18:54 ` Luck, Tony
0 siblings, 1 reply; 29+ messages in thread
From: Borislav Petkov @ 2018-05-29 17:53 UTC (permalink / raw)
To: Luck, Tony; +Cc: Williams, Dan J, Zhuo, Qiuxu, Raj, Ashok, x86, linux-kernel
On Tue, May 29, 2018 at 05:50:48PM +0000, Luck, Tony wrote:
> Ah. That would be because I forgot to "git add" before "git commit --amend" :-(
Oh, I know the situation very well. :)
> I had put:
>
> (void) mce_severity(&m, cfg->tolerant, &msg, true);
>
> but either works.
Right.
> Will look. I hope you didn't do too much to make the stable backport more difficult.
Nah, the cleanups will all go ontop. This is just a dirty branch to show
my intention but yours go first and then the cleanup.
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/3 V2] x86/mce: Fix incorrect "Machine check from unknown source" message
2018-05-29 17:53 ` Borislav Petkov
@ 2018-05-29 18:54 ` Luck, Tony
2018-05-29 20:17 ` Dan Williams
2018-05-30 9:26 ` Borislav Petkov
0 siblings, 2 replies; 29+ messages in thread
From: Luck, Tony @ 2018-05-29 18:54 UTC (permalink / raw)
To: Borislav Petkov
Cc: Williams, Dan J, Zhuo, Qiuxu, Raj, Ashok, x86, linux-kernel
On Tue, May 29, 2018 at 07:53:14PM +0200, Borislav Petkov wrote:
> Nah, the cleanups will all go ontop. This is just a dirty branch to show
> my intention but yours go first and then the cleanup.
Couple of thoughts:
In "x86/mce: Carve out bank scanning code" you drop the extra
call to mce_severity() that I just added:
@@ -1310,10 +1318,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
* fatal error. We call "mce_severity()" again to
* make sure we have the right "msg".
*/
- if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
- severity = mce_severity(&m, cfg->tolerant, &msg, true);
+ if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
mce_panic("Local fatal machine check!", &m, msg);
- }
}
But "msg" won't have been filled in with the right message to match
the error in "m" (__mc_scan_banks() doesn't update "msg").
In "x86/mce: Exit properly when no banks to poll" you
leap right to the end. I'm wondering whether this can
ever happen? I mean, if there are no machine check banks,
then how did we get a machine check?
Both the original, and your new code, skip the:
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
which seems bad. That leaves MCG_STATUS.MCIP set ... so a second
machine check would just reset the machine.
-Tony
P.S. What happened to my "part 3/3" (updating the Skylake quirk)
... does that belong in somebody else's tree?
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/3 V2] x86/mce: Fix incorrect "Machine check from unknown source" message
2018-05-29 18:54 ` Luck, Tony
@ 2018-05-29 20:17 ` Dan Williams
2018-05-30 9:26 ` Borislav Petkov
1 sibling, 0 replies; 29+ messages in thread
From: Dan Williams @ 2018-05-29 20:17 UTC (permalink / raw)
To: Luck, Tony; +Cc: Borislav Petkov, Zhuo, Qiuxu, Raj, Ashok, x86, linux-kernel
On Tue, May 29, 2018 at 11:54 AM, Luck, Tony <tony.luck@intel.com> wrote:
> On Tue, May 29, 2018 at 07:53:14PM +0200, Borislav Petkov wrote:
[..]
> P.S. What happened to my "part 3/3" (updating the Skylake quirk)
> ... does that belong in somebody else's tree?
I have no qualms taking it through the nvdimm tree with the rest of
the pending memcpy_mcsafe() updates. Just need an x86 maintainer ack.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/3 V2] x86/mce: Fix incorrect "Machine check from unknown source" message
2018-05-29 18:54 ` Luck, Tony
2018-05-29 20:17 ` Dan Williams
@ 2018-05-30 9:26 ` Borislav Petkov
2018-06-19 10:30 ` Borislav Petkov
1 sibling, 1 reply; 29+ messages in thread
From: Borislav Petkov @ 2018-05-30 9:26 UTC (permalink / raw)
To: Luck, Tony; +Cc: Williams, Dan J, Zhuo, Qiuxu, Raj, Ashok, x86, linux-kernel
On Tue, May 29, 2018 at 11:54:25AM -0700, Luck, Tony wrote:
> Couple of thoughts:
Thanks for looking.
> In "x86/mce: Carve out bank scanning code" you drop the extra
> call to mce_severity() that I just added:
Yeah, did that before we talked about it.
> In "x86/mce: Exit properly when no banks to poll" you
> leap right to the end. I'm wondering whether this can
> ever happen? I mean, if there are no machine check banks,
> then how did we get a machine check?
Right, so this looks like some remnant from old times, lemme do some
archeology...
/me goes and dusts off the full history linux repo...
I found this:
commit 7dd1e1d805d15ca63d05badf40026629ba75cbc8
Author: Andi Kleen <ak@suse.de>
Date: Tue Feb 24 17:58:41 2004 -0800
[PATCH] New machine check handler for x86-64
and there's no mention why the !banks check is there.
I'm wondering if we should simply remove it. I mean, as you say, if
there are no MCA banks, we won't be running in here in the first
place...
> Both the original, and your new code, skip the:
>
> mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
>
> which seems bad. That leaves MCG_STATUS.MCIP set ... so a second
> machine check would just reset the machine.
That's a good point. It goes away as an issue if we simply drop the
check.
> P.S. What happened to my "part 3/3" (updating the Skylake quirk)
> ... does that belong in somebody else's tree?
Simply hadn't reached it yet. I will take it too, eventually.
Thx.
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/3 V2] x86/mce: Fix incorrect "Machine check from unknown source" message
2018-05-30 9:26 ` Borislav Petkov
@ 2018-06-19 10:30 ` Borislav Petkov
0 siblings, 0 replies; 29+ messages in thread
From: Borislav Petkov @ 2018-06-19 10:30 UTC (permalink / raw)
To: Luck, Tony; +Cc: Williams, Dan J, Zhuo, Qiuxu, Raj, Ashok, x86, linux-kernel
On Wed, May 30, 2018 at 11:26:32AM +0200, Borislav Petkov wrote:
> > In "x86/mce: Exit properly when no banks to poll" you
> > leap right to the end. I'm wondering whether this can
> > ever happen? I mean, if there are no machine check banks,
> > then how did we get a machine check?
>
> Right, so this looks like some remnant from old times, lemme do some
> archeology...
>
> /me goes and dusts off the full history linux repo...
>
> I found this:
>
> commit 7dd1e1d805d15ca63d05badf40026629ba75cbc8
> Author: Andi Kleen <ak@suse.de>
> Date: Tue Feb 24 17:58:41 2004 -0800
>
> [PATCH] New machine check handler for x86-64
>
> and there's no mention why the !banks check is there.
>
> I'm wondering if we should simply remove it. I mean, as you say, if
> there are no MCA banks, we won't be running in here in the first
> place...
>
> > Both the original, and your new code, skip the:
> >
> > mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> >
> > which seems bad. That leaves MCG_STATUS.MCIP set ... so a second
> > machine check would just reset the machine.
>
> That's a good point. It goes away as an issue if we simply drop the
> check.
...and gone it is.
---
From: Borislav Petkov <bp@suse.de>
Date: Tue, 19 Jun 2018 12:27:02 +0200
Subject: [PATCH] x86/mce: Remove !banks check
If we don't have MCA banks, we won't see machine checks anyway. Drop the
check.
Signed-off-by: Borislav Petkov <bp@suse.de>
---
arch/x86/kernel/cpu/mcheck/mce.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index ea1521ec7e5b..05d2d15a95bd 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1185,9 +1185,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
this_cpu_inc(mce_exception_count);
- if (!cfg->banks)
- goto out;
-
mce_gather_info(&m, regs);
m.tsc = rdtsc();
@@ -1327,7 +1324,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
if (worst > 0)
mce_report_event(regs);
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
-out:
+
sync_core();
if (worst != MCE_AR_SEVERITY && !kill_it)
--
2.17.0.582.gccdcbd54c
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH 2/3] x86/mce: Fix incorrect "Machine check from unknown source" message
2018-05-28 20:49 ` Borislav Petkov
2018-05-29 16:15 ` [PATCH 2/3 V2] " Luck, Tony
@ 2018-05-29 18:22 ` Raj, Ashok
1 sibling, 0 replies; 29+ messages in thread
From: Raj, Ashok @ 2018-05-29 18:22 UTC (permalink / raw)
To: Borislav Petkov
Cc: Tony Luck, Dan Williams, Qiuxu Zhuo, x86, linux-kernel, Ashok Raj
On Mon, May 28, 2018 at 10:49:23PM +0200, Borislav Petkov wrote:
> On Fri, May 25, 2018 at 02:41:55PM -0700, Tony Luck wrote:
> > @@ -1287,12 +1292,17 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> > no_way_out = worst >= MCE_PANIC_SEVERITY;
> > } else {
> > /*
> > - * Local MCE skipped calling mce_reign()
> > - * If we found a fatal error, we need to panic here.
> > + * If there was a fatal machine check we should have
> > + * already called mce_panic earlier in this function.
> > + * Since we re-read the banks, we might have found
> > + * something new. Check again to see if we found a
> > + * fatal error. We call "mce_severity()" again to
> > + * make sure we have the right "msg".
> > */
> > - if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
> > - mce_panic("Machine check from unknown source",
> > - NULL, NULL);
> > + if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
> > + severity = mce_severity(&m, cfg->tolerant, &msg, true);
> > + mce_panic("Local fatal machine check!", &m, msg);
If this doesn't affect mcelog parsing, would it make sense to change this from
"fatal" -> "Unrecoverable".. Fatal typically screams PCC=1 for x86, but
some of these cases are its Software recoverable, but just that Kernel
isn't able to perform recovery.
>
> Haha, this would still make you look at the code to remember was it
> "fatal local" or "local fatal" the second one. Yeah, there's the "!" but
> still.
>
> How about:
>
> "Fatal local machine check after banks scan"
>
> or so.
>
> Btw, the code in do_machine_check() has become one helluva spaghetti
> mess. It could use some clean up a bit... :)
>
> --
> Regards/Gruss,
> Boris.
>
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
> --
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/3] x86/mce: Fix incorrect "Machine check from unknown source" message
2018-05-25 21:41 ` [PATCH 2/3] x86/mce: Fix incorrect "Machine check from unknown source" message Tony Luck
2018-05-28 20:49 ` Borislav Petkov
@ 2018-05-29 10:42 ` Borislav Petkov
2018-05-29 16:13 ` Luck, Tony
2018-06-22 12:40 ` tip-bot for Borislav Petkov
2 siblings, 1 reply; 29+ messages in thread
From: Borislav Petkov @ 2018-05-29 10:42 UTC (permalink / raw)
To: Tony Luck; +Cc: Dan Williams, Qiuxu Zhuo, Ashok Raj, x86, linux-kernel
On Fri, May 25, 2018 at 02:41:55PM -0700, Tony Luck wrote:
> @@ -1287,12 +1292,17 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> no_way_out = worst >= MCE_PANIC_SEVERITY;
> } else {
> /*
> - * Local MCE skipped calling mce_reign()
> - * If we found a fatal error, we need to panic here.
> + * If there was a fatal machine check we should have
> + * already called mce_panic earlier in this function.
> + * Since we re-read the banks, we might have found
> + * something new. Check again to see if we found a
> + * fatal error. We call "mce_severity()" again to
> + * make sure we have the right "msg".
> */
> - if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
> - mce_panic("Machine check from unknown source",
> - NULL, NULL);
> + if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
> + severity = mce_severity(&m, cfg->tolerant, &msg, true);
Looking at this more while cleaning the whole thing up, that severity
doesn't get read anywhere past this line, AFAICT...
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/3] x86/mce: Fix incorrect "Machine check from unknown source" message
2018-05-29 10:42 ` Borislav Petkov
@ 2018-05-29 16:13 ` Luck, Tony
0 siblings, 0 replies; 29+ messages in thread
From: Luck, Tony @ 2018-05-29 16:13 UTC (permalink / raw)
To: Borislav Petkov; +Cc: Dan Williams, Qiuxu Zhuo, Ashok Raj, x86, linux-kernel
On Tue, May 29, 2018 at 12:42:22PM +0200, Borislav Petkov wrote:
> > + * fatal error. We call "mce_severity()" again to
> > + * make sure we have the right "msg".
> > */
> > - if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
> > - mce_panic("Machine check from unknown source",
> > - NULL, NULL);
> > + if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
> > + severity = mce_severity(&m, cfg->tolerant, &msg, true);
>
> Looking at this more while cleaning the whole thing up, that severity
> doesn't get read anywhere past this line, AFAICT...
Just making the call to update "msg" (see comment). But you are right
that we don't need to update the severity variable. I'll fix that in
the re-spin to make the messages more than slightly different.
-Tony
^ permalink raw reply [flat|nested] 29+ messages in thread
* [tip:ras/core] x86/mce: Fix incorrect "Machine check from unknown source" message
@ 2018-06-22 12:40 ` tip-bot for Borislav Petkov
0 siblings, 0 replies; 29+ messages in thread
From: tip-bot for Tony Luck @ 2018-06-22 12:40 UTC (permalink / raw)
To: linux-tip-commits
Cc: tglx, linux-edac, dan.j.williams, ashok.raj, tony.luck,
qiuxu.zhuo, bp, mingo, hpa, linux-kernel
Commit-ID: 40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2
Gitweb: https://git.kernel.org/tip/40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2
Author: Tony Luck <tony.luck@intel.com>
AuthorDate: Fri, 22 Jun 2018 11:54:23 +0200
Committer: Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 22 Jun 2018 14:35:50 +0200
x86/mce: Fix incorrect "Machine check from unknown source" message
Some injection testing resulted in the following console log:
mce: [Hardware Error]: CPU 22: Machine Check Exception: f Bank 1: bd80000000100134
mce: [Hardware Error]: RIP 10:<ffffffffc05292dd> {pmem_do_bvec+0x11d/0x330 [nd_pmem]}
mce: [Hardware Error]: TSC c51a63035d52 ADDR 3234bc4000 MISC 88
mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1526502199 SOCKET 0 APIC 38 microcode 2000043
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
Kernel panic - not syncing: Machine check from unknown source
This confused everybody because the first line quite clearly shows
that we found a logged error in "Bank 1", while the last line says
"unknown source".
The problem is that the Linux code doesn't do the right thing
for a local machine check that results in a fatal error.
It turns out that we know very early in the handler whether the
machine check is fatal. The call to mce_no_way_out() has checked
all the banks for the CPU that took the local machine check. If
it says we must crash, we can do so right away with the right
messages.
We do scan all the banks again. This means that we might initially
not see a problem, but during the second scan find something fatal.
If this happens we print a slightly different message (so I can
see if it actually every happens).
[ bp: Remove unneeded severity assignment. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: stable@vger.kernel.org # 4.2
Link: http://lkml.kernel.org/r/52e049a497e86fd0b71c529651def8871c804df0.1527283897.git.tony.luck@intel.com
---
arch/x86/kernel/cpu/mcheck/mce.c | 26 ++++++++++++++++++--------
1 file changed, 18 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 7e6f51a9d917..e93670d736a6 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1207,13 +1207,18 @@ void do_machine_check(struct pt_regs *regs, long error_code)
lmce = m.mcgstatus & MCG_STATUS_LMCES;
/*
+ * Local machine check may already know that we have to panic.
+ * Broadcast machine check begins rendezvous in mce_start()
* Go through all banks in exclusion of the other CPUs. This way we
* don't report duplicated events on shared banks because the first one
- * to see it will clear it. If this is a Local MCE, then no need to
- * perform rendezvous.
+ * to see it will clear it.
*/
- if (!lmce)
+ if (lmce) {
+ if (no_way_out)
+ mce_panic("Fatal local machine check", &m, msg);
+ } else {
order = mce_start(&no_way_out);
+ }
for (i = 0; i < cfg->banks; i++) {
__clear_bit(i, toclear);
@@ -1289,12 +1294,17 @@ void do_machine_check(struct pt_regs *regs, long error_code)
no_way_out = worst >= MCE_PANIC_SEVERITY;
} else {
/*
- * Local MCE skipped calling mce_reign()
- * If we found a fatal error, we need to panic here.
+ * If there was a fatal machine check we should have
+ * already called mce_panic earlier in this function.
+ * Since we re-read the banks, we might have found
+ * something new. Check again to see if we found a
+ * fatal error. We call "mce_severity()" again to
+ * make sure we have the right "msg".
*/
- if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
- mce_panic("Machine check from unknown source",
- NULL, NULL);
+ if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
+ mce_severity(&m, cfg->tolerant, &msg, true);
+ mce_panic("Local fatal machine check!", &m, msg);
+ }
}
/*
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [tip:ras/core] x86/mce: Fix incorrect "Machine check from unknown source" message
@ 2018-06-22 12:40 ` tip-bot for Borislav Petkov
0 siblings, 0 replies; 29+ messages in thread
From: tip-bot for Borislav Petkov @ 2018-06-22 12:40 UTC (permalink / raw)
To: linux-tip-commits
Cc: tglx, linux-edac, dan.j.williams, ashok.raj, tony.luck,
qiuxu.zhuo, bp, mingo, hpa, linux-kernel
Commit-ID: 40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2
Gitweb: https://git.kernel.org/tip/40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2
Author: Tony Luck <tony.luck@intel.com>
AuthorDate: Fri, 22 Jun 2018 11:54:23 +0200
Committer: Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 22 Jun 2018 14:35:50 +0200
x86/mce: Fix incorrect "Machine check from unknown source" message
Some injection testing resulted in the following console log:
mce: [Hardware Error]: CPU 22: Machine Check Exception: f Bank 1: bd80000000100134
mce: [Hardware Error]: RIP 10:<ffffffffc05292dd> {pmem_do_bvec+0x11d/0x330 [nd_pmem]}
mce: [Hardware Error]: TSC c51a63035d52 ADDR 3234bc4000 MISC 88
mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1526502199 SOCKET 0 APIC 38 microcode 2000043
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
Kernel panic - not syncing: Machine check from unknown source
This confused everybody because the first line quite clearly shows
that we found a logged error in "Bank 1", while the last line says
"unknown source".
The problem is that the Linux code doesn't do the right thing
for a local machine check that results in a fatal error.
It turns out that we know very early in the handler whether the
machine check is fatal. The call to mce_no_way_out() has checked
all the banks for the CPU that took the local machine check. If
it says we must crash, we can do so right away with the right
messages.
We do scan all the banks again. This means that we might initially
not see a problem, but during the second scan find something fatal.
If this happens we print a slightly different message (so I can
see if it actually every happens).
[ bp: Remove unneeded severity assignment. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: stable@vger.kernel.org # 4.2
Link: http://lkml.kernel.org/r/52e049a497e86fd0b71c529651def8871c804df0.1527283897.git.tony.luck@intel.com
---
arch/x86/kernel/cpu/mcheck/mce.c | 26 ++++++++++++++++++--------
1 file changed, 18 insertions(+), 8 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-edac" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 7e6f51a9d917..e93670d736a6 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1207,13 +1207,18 @@ void do_machine_check(struct pt_regs *regs, long error_code)
lmce = m.mcgstatus & MCG_STATUS_LMCES;
/*
+ * Local machine check may already know that we have to panic.
+ * Broadcast machine check begins rendezvous in mce_start()
* Go through all banks in exclusion of the other CPUs. This way we
* don't report duplicated events on shared banks because the first one
- * to see it will clear it. If this is a Local MCE, then no need to
- * perform rendezvous.
+ * to see it will clear it.
*/
- if (!lmce)
+ if (lmce) {
+ if (no_way_out)
+ mce_panic("Fatal local machine check", &m, msg);
+ } else {
order = mce_start(&no_way_out);
+ }
for (i = 0; i < cfg->banks; i++) {
__clear_bit(i, toclear);
@@ -1289,12 +1294,17 @@ void do_machine_check(struct pt_regs *regs, long error_code)
no_way_out = worst >= MCE_PANIC_SEVERITY;
} else {
/*
- * Local MCE skipped calling mce_reign()
- * If we found a fatal error, we need to panic here.
+ * If there was a fatal machine check we should have
+ * already called mce_panic earlier in this function.
+ * Since we re-read the banks, we might have found
+ * something new. Check again to see if we found a
+ * fatal error. We call "mce_severity()" again to
+ * make sure we have the right "msg".
*/
- if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
- mce_panic("Machine check from unknown source",
- NULL, NULL);
+ if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
+ mce_severity(&m, cfg->tolerant, &msg, true);
+ mce_panic("Local fatal machine check!", &m, msg);
+ }
}
/*
^ permalink raw reply related [flat|nested] 29+ messages in thread