* [PATCH 0/3] x86/mce fixes @ 2018-05-25 21:40 Tony Luck 2018-05-25 21:41 ` [PATCH 1/3] x86/mce: Improve error message when kernel cannot recover Tony Luck ` (2 more replies) 0 siblings, 3 replies; 28+ messages in thread From: Tony Luck @ 2018-05-25 21:40 UTC (permalink / raw) To: Borislav Petkov Cc: Tony Luck, Dan Williams, Qiuxu Zhuo, Ashok Raj, x86, linux-kernel The first two just get better error messages when we refuse to recover from an error that the h/w says is recoverable because the error occurred in a non-recoverable part of the kernel. The third improves identification of Skylake SKUs that support recovery. Tony Luck (3): x86/mce: Improve error message when kernel cannot recover. x86/mce: Fix incorrect "Machine check from unknown source" message x86/mce: Check for alternate indication of machine check recovery on Skylake arch/x86/kernel/cpu/mcheck/mce-severity.c | 5 +++++ arch/x86/kernel/cpu/mcheck/mce.c | 26 ++++++++++++++++------- arch/x86/kernel/quirks.c | 11 ++++++++-- 3 files changed, 32 insertions(+), 10 deletions(-) -- 2.17.0 ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 1/3] x86/mce: Improve error message when kernel cannot recover. 2018-05-25 21:40 [PATCH 0/3] x86/mce fixes Tony Luck @ 2018-05-25 21:41 ` Tony Luck 2018-06-07 20:24 ` [tip:ras/urgent] " tip-bot for Tony Luck 2018-05-25 21:41 ` [PATCH 2/3] x86/mce: Fix incorrect "Machine check from unknown source" message Tony Luck 2018-05-25 21:42 ` [PATCH 3/3] x86/mce: Check for alternate indication of machine check recovery on Skylake Tony Luck 2 siblings, 1 reply; 28+ messages in thread From: Tony Luck @ 2018-05-25 21:41 UTC (permalink / raw) To: Borislav Petkov Cc: Tony Luck, Dan Williams, Qiuxu Zhuo, Ashok Raj, x86, linux-kernel Since we added support to add recovery from some errors inside the kernel in: commit b2f9d678e28c ("x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries") we have done a less than stellar job at reporting the cause of recoverable machine checks that occur in other parts of the kernel. The user just gets the unhelpful message: mce: [Hardware Error]: Machine check: Action required: unknown MCACOD doubly unhelpful when they check the manual for the reported IA32_MSR_STATUS.MCACOD and see that it is listed as one of the standard recoverable values. Add an extra rule to the MCE severity table to catch this case and report it as: mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel Cc: stable@vger.kernel.org # 4.6+ Signed-off-by: Tony Luck <tony.luck@intel.com> --- arch/x86/kernel/cpu/mcheck/mce-severity.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c index 5bbd06f38ff6..f34d89c01edc 100644 --- a/arch/x86/kernel/cpu/mcheck/mce-severity.c +++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c @@ -160,6 +160,11 @@ static struct severity { SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_INSTR), USER ), + MCESEV( + PANIC, "Data load in unrecoverable area of kernel", + SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA), + KERNEL + ), #endif MCESEV( PANIC, "Action required: unknown MCACOD", -- 2.17.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [tip:ras/urgent] x86/mce: Improve error message when kernel cannot recover 2018-05-25 21:41 ` [PATCH 1/3] x86/mce: Improve error message when kernel cannot recover Tony Luck @ 2018-06-07 20:24 ` tip-bot for Tony Luck 0 siblings, 0 replies; 28+ messages in thread From: tip-bot for Tony Luck @ 2018-06-07 20:24 UTC (permalink / raw) To: linux-tip-commits Cc: tony.luck, hpa, mingo, linux-kernel, ashok.raj, dan.j.williams, bp, tglx, qiuxu.zhuo Commit-ID: c7d606f560e4c698884697fef503e4abacdd8c25 Gitweb: https://git.kernel.org/tip/c7d606f560e4c698884697fef503e4abacdd8c25 Author: Tony Luck <tony.luck@intel.com> AuthorDate: Fri, 25 May 2018 14:41:39 -0700 Committer: Thomas Gleixner <tglx@linutronix.de> CommitDate: Thu, 7 Jun 2018 22:22:12 +0200 x86/mce: Improve error message when kernel cannot recover Since we added support to add recovery from some errors inside the kernel in: commit b2f9d678e28c ("x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries") we have done a less than stellar job at reporting the cause of recoverable machine checks that occur in other parts of the kernel. The user just gets the unhelpful message: mce: [Hardware Error]: Machine check: Action required: unknown MCACOD doubly unhelpful when they check the manual for the reported IA32_MSR_STATUS.MCACOD and see that it is listed as one of the standard recoverable values. Add an extra rule to the MCE severity table to catch this case and report it as: mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel Fixes: b2f9d678e28c ("x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries") Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: Ashok Raj <ashok.raj@intel.com> Cc: stable@vger.kernel.org # 4.6+ Cc: Dan Williams <dan.j.williams@intel.com> Cc: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/4cc7c465150a9a48b8b9f45d0b840278e77eb9b5.1527283897.git.tony.luck@intel.com --- arch/x86/kernel/cpu/mcheck/mce-severity.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c index 5bbd06f38ff6..f34d89c01edc 100644 --- a/arch/x86/kernel/cpu/mcheck/mce-severity.c +++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c @@ -160,6 +160,11 @@ static struct severity { SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_INSTR), USER ), + MCESEV( + PANIC, "Data load in unrecoverable area of kernel", + SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA), + KERNEL + ), #endif MCESEV( PANIC, "Action required: unknown MCACOD", ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 2/3] x86/mce: Fix incorrect "Machine check from unknown source" message 2018-05-25 21:40 [PATCH 0/3] x86/mce fixes Tony Luck 2018-05-25 21:41 ` [PATCH 1/3] x86/mce: Improve error message when kernel cannot recover Tony Luck @ 2018-05-25 21:41 ` Tony Luck 2018-05-28 20:49 ` Borislav Petkov ` (2 more replies) 2018-05-25 21:42 ` [PATCH 3/3] x86/mce: Check for alternate indication of machine check recovery on Skylake Tony Luck 2 siblings, 3 replies; 28+ messages in thread From: Tony Luck @ 2018-05-25 21:41 UTC (permalink / raw) To: Borislav Petkov Cc: Tony Luck, Dan Williams, Qiuxu Zhuo, Ashok Raj, x86, linux-kernel Some injection testing resulted in the following console log: mce: [Hardware Error]: CPU 22: Machine Check Exception: f Bank 1: bd80000000100134 mce: [Hardware Error]: RIP 10:<ffffffffc05292dd> {pmem_do_bvec+0x11d/0x330 [nd_pmem]} mce: [Hardware Error]: TSC c51a63035d52 ADDR 3234bc4000 MISC 88 mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1526502199 SOCKET 0 APIC 38 microcode 2000043 mce: [Hardware Error]: Run the above through 'mcelog --ascii' Kernel panic - not syncing: Machine check from unknown source This confused everybody because the first line quite clearly shows that we found a logged error in "Bank 1", while the last line says "unknown source". The problem is that the Linux code doesn't do the right thing for a local machine check that results in a fatal error. It turns out that we know very early in the handler whether the machine check is fatal. The call to mce_no_way_out() has checked all the banks for the CPU that took the local machine check. If it says we must crash, we can do so right away with the right messages. We do scan all the banks again. This means that we might initially not see a problem, but during the second scan find something fatal. If this happens we print a slightly different message (so I can see if it actually every happens). Cc: stable@vger.kernel.org # 4.2 Signed-off-by: Tony Luck <tony.luck@intel.com> --- arch/x86/kernel/cpu/mcheck/mce.c | 26 ++++++++++++++++++-------- 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 42cf2880d0ed..f013d97abd5f 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -1205,13 +1205,18 @@ void do_machine_check(struct pt_regs *regs, long error_code) lmce = m.mcgstatus & MCG_STATUS_LMCES; /* + * Local machine check may already know that we have to panic. + * Broadcast machine check begins rendezvous in mce_start() * Go through all banks in exclusion of the other CPUs. This way we * don't report duplicated events on shared banks because the first one - * to see it will clear it. If this is a Local MCE, then no need to - * perform rendezvous. + * to see it will clear it. */ - if (!lmce) + if (lmce) { + if (no_way_out) + mce_panic("Fatal local machine check", &m, msg); + } else { order = mce_start(&no_way_out); + } for (i = 0; i < cfg->banks; i++) { __clear_bit(i, toclear); @@ -1287,12 +1292,17 @@ void do_machine_check(struct pt_regs *regs, long error_code) no_way_out = worst >= MCE_PANIC_SEVERITY; } else { /* - * Local MCE skipped calling mce_reign() - * If we found a fatal error, we need to panic here. + * If there was a fatal machine check we should have + * already called mce_panic earlier in this function. + * Since we re-read the banks, we might have found + * something new. Check again to see if we found a + * fatal error. We call "mce_severity()" again to + * make sure we have the right "msg". */ - if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) - mce_panic("Machine check from unknown source", - NULL, NULL); + if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) { + severity = mce_severity(&m, cfg->tolerant, &msg, true); + mce_panic("Local fatal machine check!", &m, msg); + } } /* -- 2.17.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 2/3] x86/mce: Fix incorrect "Machine check from unknown source" message 2018-05-25 21:41 ` [PATCH 2/3] x86/mce: Fix incorrect "Machine check from unknown source" message Tony Luck @ 2018-05-28 20:49 ` Borislav Petkov 2018-05-29 16:15 ` [PATCH 2/3 V2] " Luck, Tony 2018-05-29 18:22 ` [PATCH 2/3] " Raj, Ashok 2018-05-29 10:42 ` Borislav Petkov 2018-06-22 12:40 ` [tip:ras/core] " tip-bot for Tony Luck 2 siblings, 2 replies; 28+ messages in thread From: Borislav Petkov @ 2018-05-28 20:49 UTC (permalink / raw) To: Tony Luck; +Cc: Dan Williams, Qiuxu Zhuo, Ashok Raj, x86, linux-kernel On Fri, May 25, 2018 at 02:41:55PM -0700, Tony Luck wrote: > @@ -1287,12 +1292,17 @@ void do_machine_check(struct pt_regs *regs, long error_code) > no_way_out = worst >= MCE_PANIC_SEVERITY; > } else { > /* > - * Local MCE skipped calling mce_reign() > - * If we found a fatal error, we need to panic here. > + * If there was a fatal machine check we should have > + * already called mce_panic earlier in this function. > + * Since we re-read the banks, we might have found > + * something new. Check again to see if we found a > + * fatal error. We call "mce_severity()" again to > + * make sure we have the right "msg". > */ > - if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) > - mce_panic("Machine check from unknown source", > - NULL, NULL); > + if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) { > + severity = mce_severity(&m, cfg->tolerant, &msg, true); > + mce_panic("Local fatal machine check!", &m, msg); Haha, this would still make you look at the code to remember was it "fatal local" or "local fatal" the second one. Yeah, there's the "!" but still. How about: "Fatal local machine check after banks scan" or so. Btw, the code in do_machine_check() has become one helluva spaghetti mess. It could use some clean up a bit... :) -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 2/3 V2] x86/mce: Fix incorrect "Machine check from unknown source" message 2018-05-28 20:49 ` Borislav Petkov @ 2018-05-29 16:15 ` Luck, Tony 2018-05-29 17:41 ` Borislav Petkov 2018-05-29 18:22 ` [PATCH 2/3] " Raj, Ashok 1 sibling, 1 reply; 28+ messages in thread From: Luck, Tony @ 2018-05-29 16:15 UTC (permalink / raw) To: Borislav Petkov; +Cc: Dan Williams, Qiuxu Zhuo, Ashok Raj, x86, linux-kernel Some injection testing resulted in the following console log: mce: [Hardware Error]: CPU 22: Machine Check Exception: f Bank 1: bd80000000100134 mce: [Hardware Error]: RIP 10:<ffffffffc05292dd> {pmem_do_bvec+0x11d/0x330 [nd_pmem]} mce: [Hardware Error]: TSC c51a63035d52 ADDR 3234bc4000 MISC 88 mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1526502199 SOCKET 0 APIC 38 microcode 2000043 mce: [Hardware Error]: Run the above through 'mcelog --ascii' Kernel panic - not syncing: Machine check from unknown source This confused everybody because the first line quite clearly shows that we found a logged error in "Bank 1", while the last line says "unknown source". The problem is that the Linux code doesn't do the right thing for a local machine check that results in a fatal error. It turns out that we know very early in the handler whether the machine check is fatal. The call to mce_no_way_out() has checked all the banks for the CPU that took the local machine check. If it says we must crash, we can do so right away with the right messages. We do scan all the banks again. This means that we might initially not see a problem, but during the second scan find something fatal. If this happens we print a different message (so I can see if it actually every happens). Cc: stable@vger.kernel.org # 4.2 Signed-off-by: Tony Luck <tony.luck@intel.com> --- V2: Boris: 1) make the "different message" more descriptive. 2) don't bother reassigning "severity" variable. arch/x86/kernel/cpu/mcheck/mce.c | 26 ++++++++++++++++++-------- 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 42cf2880d0ed..f013d97abd5f 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -1205,13 +1205,18 @@ void do_machine_check(struct pt_regs *regs, long error_code) lmce = m.mcgstatus & MCG_STATUS_LMCES; /* + * Local machine check may already know that we have to panic. + * Broadcast machine check begins rendezvous in mce_start() * Go through all banks in exclusion of the other CPUs. This way we * don't report duplicated events on shared banks because the first one - * to see it will clear it. If this is a Local MCE, then no need to - * perform rendezvous. + * to see it will clear it. */ - if (!lmce) + if (lmce) { + if (no_way_out) + mce_panic("Fatal local machine check", &m, msg); + } else { order = mce_start(&no_way_out); + } for (i = 0; i < cfg->banks; i++) { __clear_bit(i, toclear); @@ -1287,12 +1292,17 @@ void do_machine_check(struct pt_regs *regs, long error_code) no_way_out = worst >= MCE_PANIC_SEVERITY; } else { /* - * Local MCE skipped calling mce_reign() - * If we found a fatal error, we need to panic here. + * If there was a fatal machine check we should have + * already called mce_panic earlier in this function. + * Since we re-read the banks, we might have found + * something new. Check again to see if we found a + * fatal error. We call "mce_severity()" again to + * make sure we have the right "msg". */ - if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) - mce_panic("Machine check from unknown source", - NULL, NULL); + if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) { + severity = mce_severity(&m, cfg->tolerant, &msg, true); + mce_panic("Local fatal machine check!", &m, msg); + } } /* -- 2.17.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 2/3 V2] x86/mce: Fix incorrect "Machine check from unknown source" message 2018-05-29 16:15 ` [PATCH 2/3 V2] " Luck, Tony @ 2018-05-29 17:41 ` Borislav Petkov 2018-05-29 17:50 ` Luck, Tony 0 siblings, 1 reply; 28+ messages in thread From: Borislav Petkov @ 2018-05-29 17:41 UTC (permalink / raw) To: Luck, Tony; +Cc: Dan Williams, Qiuxu Zhuo, Ashok Raj, x86, linux-kernel On Tue, May 29, 2018 at 09:15:49AM -0700, Luck, Tony wrote: > @@ -1287,12 +1292,17 @@ void do_machine_check(struct pt_regs *regs, long error_code) > no_way_out = worst >= MCE_PANIC_SEVERITY; > } else { > /* > - * Local MCE skipped calling mce_reign() > - * If we found a fatal error, we need to panic here. > + * If there was a fatal machine check we should have > + * already called mce_panic earlier in this function. > + * Since we re-read the banks, we might have found > + * something new. Check again to see if we found a > + * fatal error. We call "mce_severity()" again to > + * make sure we have the right "msg". > */ > - if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) > - mce_panic("Machine check from unknown source", > - NULL, NULL); > + if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) { > + severity = mce_severity(&m, cfg->tolerant, &msg, true); > + mce_panic("Local fatal machine check!", &m, msg); > + } > } It is still assigning. I'll simply do: if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) { mce_severity(&m, cfg->tolerant, &msg, true); mce_panic("Local fatal machine check!", &m, msg); } when committing. Btw, I did some cleaning up ontop. Please have a look and let me know if it is too ugly to live... :) https://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git/log/?h=rc7%2b0-ras Thx. -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [PATCH 2/3 V2] x86/mce: Fix incorrect "Machine check from unknown source" message 2018-05-29 17:41 ` Borislav Petkov @ 2018-05-29 17:50 ` Luck, Tony 2018-05-29 17:53 ` Borislav Petkov 0 siblings, 1 reply; 28+ messages in thread From: Luck, Tony @ 2018-05-29 17:50 UTC (permalink / raw) To: Borislav Petkov Cc: Williams, Dan J, Zhuo, Qiuxu, Raj, Ashok, x86, linux-kernel > It is still assigning. Ah. That would be because I forgot to "git add" before "git commit --amend" :-( > I'll simply do: > > if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) { > mce_severity(&m, cfg->tolerant, &msg, true); > mce_panic("Local fatal machine check!", &m, msg); > } > > when committing. I had put: (void) mce_severity(&m, cfg->tolerant, &msg, true); but either works. > Btw, I did some cleaning up ontop. Please have a look and let me know if > it is too ugly to live... :) > > https://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git/log/?h=rc7%2b0-ras Will look. I hope you didn't do too much to make the stable backport more difficult. -Tony ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/3 V2] x86/mce: Fix incorrect "Machine check from unknown source" message 2018-05-29 17:50 ` Luck, Tony @ 2018-05-29 17:53 ` Borislav Petkov 2018-05-29 18:54 ` Luck, Tony 0 siblings, 1 reply; 28+ messages in thread From: Borislav Petkov @ 2018-05-29 17:53 UTC (permalink / raw) To: Luck, Tony; +Cc: Williams, Dan J, Zhuo, Qiuxu, Raj, Ashok, x86, linux-kernel On Tue, May 29, 2018 at 05:50:48PM +0000, Luck, Tony wrote: > Ah. That would be because I forgot to "git add" before "git commit --amend" :-( Oh, I know the situation very well. :) > I had put: > > (void) mce_severity(&m, cfg->tolerant, &msg, true); > > but either works. Right. > Will look. I hope you didn't do too much to make the stable backport more difficult. Nah, the cleanups will all go ontop. This is just a dirty branch to show my intention but yours go first and then the cleanup. -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/3 V2] x86/mce: Fix incorrect "Machine check from unknown source" message 2018-05-29 17:53 ` Borislav Petkov @ 2018-05-29 18:54 ` Luck, Tony 2018-05-29 20:17 ` Dan Williams 2018-05-30 9:26 ` Borislav Petkov 0 siblings, 2 replies; 28+ messages in thread From: Luck, Tony @ 2018-05-29 18:54 UTC (permalink / raw) To: Borislav Petkov Cc: Williams, Dan J, Zhuo, Qiuxu, Raj, Ashok, x86, linux-kernel On Tue, May 29, 2018 at 07:53:14PM +0200, Borislav Petkov wrote: > Nah, the cleanups will all go ontop. This is just a dirty branch to show > my intention but yours go first and then the cleanup. Couple of thoughts: In "x86/mce: Carve out bank scanning code" you drop the extra call to mce_severity() that I just added: @@ -1310,10 +1318,8 @@ void do_machine_check(struct pt_regs *regs, long error_code) * fatal error. We call "mce_severity()" again to * make sure we have the right "msg". */ - if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) { - severity = mce_severity(&m, cfg->tolerant, &msg, true); + if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) mce_panic("Local fatal machine check!", &m, msg); - } } But "msg" won't have been filled in with the right message to match the error in "m" (__mc_scan_banks() doesn't update "msg"). In "x86/mce: Exit properly when no banks to poll" you leap right to the end. I'm wondering whether this can ever happen? I mean, if there are no machine check banks, then how did we get a machine check? Both the original, and your new code, skip the: mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); which seems bad. That leaves MCG_STATUS.MCIP set ... so a second machine check would just reset the machine. -Tony P.S. What happened to my "part 3/3" (updating the Skylake quirk) ... does that belong in somebody else's tree? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/3 V2] x86/mce: Fix incorrect "Machine check from unknown source" message 2018-05-29 18:54 ` Luck, Tony @ 2018-05-29 20:17 ` Dan Williams 2018-05-30 9:26 ` Borislav Petkov 1 sibling, 0 replies; 28+ messages in thread From: Dan Williams @ 2018-05-29 20:17 UTC (permalink / raw) To: Luck, Tony; +Cc: Borislav Petkov, Zhuo, Qiuxu, Raj, Ashok, x86, linux-kernel On Tue, May 29, 2018 at 11:54 AM, Luck, Tony <tony.luck@intel.com> wrote: > On Tue, May 29, 2018 at 07:53:14PM +0200, Borislav Petkov wrote: [..] > P.S. What happened to my "part 3/3" (updating the Skylake quirk) > ... does that belong in somebody else's tree? I have no qualms taking it through the nvdimm tree with the rest of the pending memcpy_mcsafe() updates. Just need an x86 maintainer ack. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/3 V2] x86/mce: Fix incorrect "Machine check from unknown source" message 2018-05-29 18:54 ` Luck, Tony 2018-05-29 20:17 ` Dan Williams @ 2018-05-30 9:26 ` Borislav Petkov 2018-06-19 10:30 ` Borislav Petkov 1 sibling, 1 reply; 28+ messages in thread From: Borislav Petkov @ 2018-05-30 9:26 UTC (permalink / raw) To: Luck, Tony; +Cc: Williams, Dan J, Zhuo, Qiuxu, Raj, Ashok, x86, linux-kernel On Tue, May 29, 2018 at 11:54:25AM -0700, Luck, Tony wrote: > Couple of thoughts: Thanks for looking. > In "x86/mce: Carve out bank scanning code" you drop the extra > call to mce_severity() that I just added: Yeah, did that before we talked about it. > In "x86/mce: Exit properly when no banks to poll" you > leap right to the end. I'm wondering whether this can > ever happen? I mean, if there are no machine check banks, > then how did we get a machine check? Right, so this looks like some remnant from old times, lemme do some archeology... /me goes and dusts off the full history linux repo... I found this: commit 7dd1e1d805d15ca63d05badf40026629ba75cbc8 Author: Andi Kleen <ak@suse.de> Date: Tue Feb 24 17:58:41 2004 -0800 [PATCH] New machine check handler for x86-64 and there's no mention why the !banks check is there. I'm wondering if we should simply remove it. I mean, as you say, if there are no MCA banks, we won't be running in here in the first place... > Both the original, and your new code, skip the: > > mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); > > which seems bad. That leaves MCG_STATUS.MCIP set ... so a second > machine check would just reset the machine. That's a good point. It goes away as an issue if we simply drop the check. > P.S. What happened to my "part 3/3" (updating the Skylake quirk) > ... does that belong in somebody else's tree? Simply hadn't reached it yet. I will take it too, eventually. Thx. -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/3 V2] x86/mce: Fix incorrect "Machine check from unknown source" message 2018-05-30 9:26 ` Borislav Petkov @ 2018-06-19 10:30 ` Borislav Petkov 0 siblings, 0 replies; 28+ messages in thread From: Borislav Petkov @ 2018-06-19 10:30 UTC (permalink / raw) To: Luck, Tony; +Cc: Williams, Dan J, Zhuo, Qiuxu, Raj, Ashok, x86, linux-kernel On Wed, May 30, 2018 at 11:26:32AM +0200, Borislav Petkov wrote: > > In "x86/mce: Exit properly when no banks to poll" you > > leap right to the end. I'm wondering whether this can > > ever happen? I mean, if there are no machine check banks, > > then how did we get a machine check? > > Right, so this looks like some remnant from old times, lemme do some > archeology... > > /me goes and dusts off the full history linux repo... > > I found this: > > commit 7dd1e1d805d15ca63d05badf40026629ba75cbc8 > Author: Andi Kleen <ak@suse.de> > Date: Tue Feb 24 17:58:41 2004 -0800 > > [PATCH] New machine check handler for x86-64 > > and there's no mention why the !banks check is there. > > I'm wondering if we should simply remove it. I mean, as you say, if > there are no MCA banks, we won't be running in here in the first > place... > > > Both the original, and your new code, skip the: > > > > mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); > > > > which seems bad. That leaves MCG_STATUS.MCIP set ... so a second > > machine check would just reset the machine. > > That's a good point. It goes away as an issue if we simply drop the > check. ...and gone it is. --- From: Borislav Petkov <bp@suse.de> Date: Tue, 19 Jun 2018 12:27:02 +0200 Subject: [PATCH] x86/mce: Remove !banks check If we don't have MCA banks, we won't see machine checks anyway. Drop the check. Signed-off-by: Borislav Petkov <bp@suse.de> --- arch/x86/kernel/cpu/mcheck/mce.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index ea1521ec7e5b..05d2d15a95bd 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -1185,9 +1185,6 @@ void do_machine_check(struct pt_regs *regs, long error_code) this_cpu_inc(mce_exception_count); - if (!cfg->banks) - goto out; - mce_gather_info(&m, regs); m.tsc = rdtsc(); @@ -1327,7 +1324,7 @@ void do_machine_check(struct pt_regs *regs, long error_code) if (worst > 0) mce_report_event(regs); mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); -out: + sync_core(); if (worst != MCE_AR_SEVERITY && !kill_it) -- 2.17.0.582.gccdcbd54c SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 2/3] x86/mce: Fix incorrect "Machine check from unknown source" message 2018-05-28 20:49 ` Borislav Petkov 2018-05-29 16:15 ` [PATCH 2/3 V2] " Luck, Tony @ 2018-05-29 18:22 ` Raj, Ashok 1 sibling, 0 replies; 28+ messages in thread From: Raj, Ashok @ 2018-05-29 18:22 UTC (permalink / raw) To: Borislav Petkov Cc: Tony Luck, Dan Williams, Qiuxu Zhuo, x86, linux-kernel, Ashok Raj On Mon, May 28, 2018 at 10:49:23PM +0200, Borislav Petkov wrote: > On Fri, May 25, 2018 at 02:41:55PM -0700, Tony Luck wrote: > > @@ -1287,12 +1292,17 @@ void do_machine_check(struct pt_regs *regs, long error_code) > > no_way_out = worst >= MCE_PANIC_SEVERITY; > > } else { > > /* > > - * Local MCE skipped calling mce_reign() > > - * If we found a fatal error, we need to panic here. > > + * If there was a fatal machine check we should have > > + * already called mce_panic earlier in this function. > > + * Since we re-read the banks, we might have found > > + * something new. Check again to see if we found a > > + * fatal error. We call "mce_severity()" again to > > + * make sure we have the right "msg". > > */ > > - if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) > > - mce_panic("Machine check from unknown source", > > - NULL, NULL); > > + if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) { > > + severity = mce_severity(&m, cfg->tolerant, &msg, true); > > + mce_panic("Local fatal machine check!", &m, msg); If this doesn't affect mcelog parsing, would it make sense to change this from "fatal" -> "Unrecoverable".. Fatal typically screams PCC=1 for x86, but some of these cases are its Software recoverable, but just that Kernel isn't able to perform recovery. > > Haha, this would still make you look at the code to remember was it > "fatal local" or "local fatal" the second one. Yeah, there's the "!" but > still. > > How about: > > "Fatal local machine check after banks scan" > > or so. > > Btw, the code in do_machine_check() has become one helluva spaghetti > mess. It could use some clean up a bit... :) > > -- > Regards/Gruss, > Boris. > > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) > -- ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/3] x86/mce: Fix incorrect "Machine check from unknown source" message 2018-05-25 21:41 ` [PATCH 2/3] x86/mce: Fix incorrect "Machine check from unknown source" message Tony Luck 2018-05-28 20:49 ` Borislav Petkov @ 2018-05-29 10:42 ` Borislav Petkov 2018-05-29 16:13 ` Luck, Tony 2018-06-22 12:40 ` [tip:ras/core] " tip-bot for Tony Luck 2 siblings, 1 reply; 28+ messages in thread From: Borislav Petkov @ 2018-05-29 10:42 UTC (permalink / raw) To: Tony Luck; +Cc: Dan Williams, Qiuxu Zhuo, Ashok Raj, x86, linux-kernel On Fri, May 25, 2018 at 02:41:55PM -0700, Tony Luck wrote: > @@ -1287,12 +1292,17 @@ void do_machine_check(struct pt_regs *regs, long error_code) > no_way_out = worst >= MCE_PANIC_SEVERITY; > } else { > /* > - * Local MCE skipped calling mce_reign() > - * If we found a fatal error, we need to panic here. > + * If there was a fatal machine check we should have > + * already called mce_panic earlier in this function. > + * Since we re-read the banks, we might have found > + * something new. Check again to see if we found a > + * fatal error. We call "mce_severity()" again to > + * make sure we have the right "msg". > */ > - if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) > - mce_panic("Machine check from unknown source", > - NULL, NULL); > + if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) { > + severity = mce_severity(&m, cfg->tolerant, &msg, true); Looking at this more while cleaning the whole thing up, that severity doesn't get read anywhere past this line, AFAICT... -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/3] x86/mce: Fix incorrect "Machine check from unknown source" message 2018-05-29 10:42 ` Borislav Petkov @ 2018-05-29 16:13 ` Luck, Tony 0 siblings, 0 replies; 28+ messages in thread From: Luck, Tony @ 2018-05-29 16:13 UTC (permalink / raw) To: Borislav Petkov; +Cc: Dan Williams, Qiuxu Zhuo, Ashok Raj, x86, linux-kernel On Tue, May 29, 2018 at 12:42:22PM +0200, Borislav Petkov wrote: > > + * fatal error. We call "mce_severity()" again to > > + * make sure we have the right "msg". > > */ > > - if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) > > - mce_panic("Machine check from unknown source", > > - NULL, NULL); > > + if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) { > > + severity = mce_severity(&m, cfg->tolerant, &msg, true); > > Looking at this more while cleaning the whole thing up, that severity > doesn't get read anywhere past this line, AFAICT... Just making the call to update "msg" (see comment). But you are right that we don't need to update the severity variable. I'll fix that in the re-spin to make the messages more than slightly different. -Tony ^ permalink raw reply [flat|nested] 28+ messages in thread
* [tip:ras/core] x86/mce: Fix incorrect "Machine check from unknown source" message 2018-05-25 21:41 ` [PATCH 2/3] x86/mce: Fix incorrect "Machine check from unknown source" message Tony Luck 2018-05-28 20:49 ` Borislav Petkov 2018-05-29 10:42 ` Borislav Petkov @ 2018-06-22 12:40 ` tip-bot for Tony Luck 2 siblings, 0 replies; 28+ messages in thread From: tip-bot for Tony Luck @ 2018-06-22 12:40 UTC (permalink / raw) To: linux-tip-commits Cc: tglx, linux-edac, dan.j.williams, ashok.raj, tony.luck, qiuxu.zhuo, bp, mingo, hpa, linux-kernel Commit-ID: 40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2 Gitweb: https://git.kernel.org/tip/40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2 Author: Tony Luck <tony.luck@intel.com> AuthorDate: Fri, 22 Jun 2018 11:54:23 +0200 Committer: Thomas Gleixner <tglx@linutronix.de> CommitDate: Fri, 22 Jun 2018 14:35:50 +0200 x86/mce: Fix incorrect "Machine check from unknown source" message Some injection testing resulted in the following console log: mce: [Hardware Error]: CPU 22: Machine Check Exception: f Bank 1: bd80000000100134 mce: [Hardware Error]: RIP 10:<ffffffffc05292dd> {pmem_do_bvec+0x11d/0x330 [nd_pmem]} mce: [Hardware Error]: TSC c51a63035d52 ADDR 3234bc4000 MISC 88 mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1526502199 SOCKET 0 APIC 38 microcode 2000043 mce: [Hardware Error]: Run the above through 'mcelog --ascii' Kernel panic - not syncing: Machine check from unknown source This confused everybody because the first line quite clearly shows that we found a logged error in "Bank 1", while the last line says "unknown source". The problem is that the Linux code doesn't do the right thing for a local machine check that results in a fatal error. It turns out that we know very early in the handler whether the machine check is fatal. The call to mce_no_way_out() has checked all the banks for the CPU that took the local machine check. If it says we must crash, we can do so right away with the right messages. We do scan all the banks again. This means that we might initially not see a problem, but during the second scan find something fatal. If this happens we print a slightly different message (so I can see if it actually every happens). [ bp: Remove unneeded severity assignment. ] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ashok Raj <ashok.raj@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: stable@vger.kernel.org # 4.2 Link: http://lkml.kernel.org/r/52e049a497e86fd0b71c529651def8871c804df0.1527283897.git.tony.luck@intel.com --- arch/x86/kernel/cpu/mcheck/mce.c | 26 ++++++++++++++++++-------- 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 7e6f51a9d917..e93670d736a6 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -1207,13 +1207,18 @@ void do_machine_check(struct pt_regs *regs, long error_code) lmce = m.mcgstatus & MCG_STATUS_LMCES; /* + * Local machine check may already know that we have to panic. + * Broadcast machine check begins rendezvous in mce_start() * Go through all banks in exclusion of the other CPUs. This way we * don't report duplicated events on shared banks because the first one - * to see it will clear it. If this is a Local MCE, then no need to - * perform rendezvous. + * to see it will clear it. */ - if (!lmce) + if (lmce) { + if (no_way_out) + mce_panic("Fatal local machine check", &m, msg); + } else { order = mce_start(&no_way_out); + } for (i = 0; i < cfg->banks; i++) { __clear_bit(i, toclear); @@ -1289,12 +1294,17 @@ void do_machine_check(struct pt_regs *regs, long error_code) no_way_out = worst >= MCE_PANIC_SEVERITY; } else { /* - * Local MCE skipped calling mce_reign() - * If we found a fatal error, we need to panic here. + * If there was a fatal machine check we should have + * already called mce_panic earlier in this function. + * Since we re-read the banks, we might have found + * something new. Check again to see if we found a + * fatal error. We call "mce_severity()" again to + * make sure we have the right "msg". */ - if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) - mce_panic("Machine check from unknown source", - NULL, NULL); + if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) { + mce_severity(&m, cfg->tolerant, &msg, true); + mce_panic("Local fatal machine check!", &m, msg); + } } /* ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 3/3] x86/mce: Check for alternate indication of machine check recovery on Skylake 2018-05-25 21:40 [PATCH 0/3] x86/mce fixes Tony Luck 2018-05-25 21:41 ` [PATCH 1/3] x86/mce: Improve error message when kernel cannot recover Tony Luck 2018-05-25 21:41 ` [PATCH 2/3] x86/mce: Fix incorrect "Machine check from unknown source" message Tony Luck @ 2018-05-25 21:42 ` Tony Luck 2018-06-07 17:43 ` Luck, Tony 2018-06-07 20:25 ` [tip:ras/urgent] " tip-bot for Tony Luck 2 siblings, 2 replies; 28+ messages in thread From: Tony Luck @ 2018-05-25 21:42 UTC (permalink / raw) To: Borislav Petkov Cc: Tony Luck, Dan Williams, Qiuxu Zhuo, Ashok Raj, x86, linux-kernel Currently we just check the "CAPID0" register to see whether the CPU can recover from machine checks. But there are also some special SKUs which do not have all advanced RAS features, but do enable machine check recovery for use with NVDIMMs. Add a check for any of bits {8:5} in the "CAPID5" register (each reports some NVDIMM mode available, if any of them are set, then the system supports memory machine check recovery). Cc: stable@vger.kernel.org # 4.9 Signed-off-by: Tony Luck <tony.luck@intel.com> --- arch/x86/kernel/quirks.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c index 697a4ce04308..736348ead421 100644 --- a/arch/x86/kernel/quirks.c +++ b/arch/x86/kernel/quirks.c @@ -645,12 +645,19 @@ static void quirk_intel_brickland_xeon_ras_cap(struct pci_dev *pdev) /* Skylake */ static void quirk_intel_purley_xeon_ras_cap(struct pci_dev *pdev) { - u32 capid0; + u32 capid0, capid5; pci_read_config_dword(pdev, 0x84, &capid0); + pci_read_config_dword(pdev, 0x98, &capid5); - if ((capid0 & 0xc0) == 0xc0) + /* + * CAPID0{7:6} indicate whether this is an advanced RAS SKU + * CAPID5{8:5} indicate that various NVDIMM usage modes are + * enabled, so memory machine check recovery is also enabled. + */ + if ((capid0 & 0xc0) == 0xc0 || (capid5 & 0x1e0)) static_branch_inc(&mcsafe_key); + } DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0ec3, quirk_intel_brickland_xeon_ras_cap); DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2fc0, quirk_intel_brickland_xeon_ras_cap); -- 2.17.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 3/3] x86/mce: Check for alternate indication of machine check recovery on Skylake 2018-05-25 21:42 ` [PATCH 3/3] x86/mce: Check for alternate indication of machine check recovery on Skylake Tony Luck @ 2018-06-07 17:43 ` Luck, Tony 2018-06-07 20:18 ` Dan Williams 2018-06-07 20:25 ` [tip:ras/urgent] " tip-bot for Tony Luck 1 sibling, 1 reply; 28+ messages in thread From: Luck, Tony @ 2018-06-07 17:43 UTC (permalink / raw) To: Borislav Petkov; +Cc: Dan Williams, Qiuxu Zhuo, Ashok Raj, x86, linux-kernel On Fri, May 25, 2018 at 02:42:09PM -0700, Tony Luck wrote: > Currently we just check the "CAPID0" register to see whether the CPU > can recover from machine checks. > > But there are also some special SKUs which do not have all advanced > RAS features, but do enable machine check recovery for use with NVDIMMs. > > Add a check for any of bits {8:5} in the "CAPID5" register (each > reports some NVDIMM mode available, if any of them are set, then > the system supports memory machine check recovery). > > Cc: stable@vger.kernel.org # 4.9 > Signed-off-by: Tony Luck <tony.luck@intel.com> > --- Has this stalled somewhere? I'd like to see this one go into the 4.18 merge because it unbreaks some real hardware. Parts 1 & 2 are nice-to-have, but they just make for better error messages so aren't as critical. > arch/x86/kernel/quirks.c | 11 +++++++++-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c > index 697a4ce04308..736348ead421 100644 > --- a/arch/x86/kernel/quirks.c > +++ b/arch/x86/kernel/quirks.c > @@ -645,12 +645,19 @@ static void quirk_intel_brickland_xeon_ras_cap(struct pci_dev *pdev) > /* Skylake */ > static void quirk_intel_purley_xeon_ras_cap(struct pci_dev *pdev) > { > - u32 capid0; > + u32 capid0, capid5; > > pci_read_config_dword(pdev, 0x84, &capid0); > + pci_read_config_dword(pdev, 0x98, &capid5); > > - if ((capid0 & 0xc0) == 0xc0) > + /* > + * CAPID0{7:6} indicate whether this is an advanced RAS SKU > + * CAPID5{8:5} indicate that various NVDIMM usage modes are > + * enabled, so memory machine check recovery is also enabled. > + */ > + if ((capid0 & 0xc0) == 0xc0 || (capid5 & 0x1e0)) > static_branch_inc(&mcsafe_key); > + > } > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0ec3, quirk_intel_brickland_xeon_ras_cap); > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2fc0, quirk_intel_brickland_xeon_ras_cap); > -- > 2.17.0 > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/3] x86/mce: Check for alternate indication of machine check recovery on Skylake 2018-06-07 17:43 ` Luck, Tony @ 2018-06-07 20:18 ` Dan Williams 2018-06-07 20:24 ` Borislav Petkov 2018-06-07 20:24 ` Thomas Gleixner 0 siblings, 2 replies; 28+ messages in thread From: Dan Williams @ 2018-06-07 20:18 UTC (permalink / raw) To: Luck, Tony Cc: Borislav Petkov, Qiuxu Zhuo, Ashok Raj, X86 ML, Linux Kernel Mailing List On Thu, Jun 7, 2018 at 10:43 AM, Luck, Tony <tony.luck@intel.com> wrote: > On Fri, May 25, 2018 at 02:42:09PM -0700, Tony Luck wrote: >> Currently we just check the "CAPID0" register to see whether the CPU >> can recover from machine checks. >> >> But there are also some special SKUs which do not have all advanced >> RAS features, but do enable machine check recovery for use with NVDIMMs. >> >> Add a check for any of bits {8:5} in the "CAPID5" register (each >> reports some NVDIMM mode available, if any of them are set, then >> the system supports memory machine check recovery). >> >> Cc: stable@vger.kernel.org # 4.9 >> Signed-off-by: Tony Luck <tony.luck@intel.com> >> --- > > Has this stalled somewhere? I'd like to see this one go into the > 4.18 merge because it unbreaks some real hardware. > > Parts 1 & 2 are nice-to-have, but they just make for better error > messages so aren't as critical. I'm making an effort to get all persistent memory error handling holes covered this cycle, so I think it makes sense for this to go through the nvdimm tree. This looks sufficiently non-controversial that I could justify sending it to Linus along with the other pmem updates. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/3] x86/mce: Check for alternate indication of machine check recovery on Skylake 2018-06-07 20:18 ` Dan Williams @ 2018-06-07 20:24 ` Borislav Petkov 2018-06-07 22:26 ` Luck, Tony 2018-06-14 21:57 ` Luck, Tony 2018-06-07 20:24 ` Thomas Gleixner 1 sibling, 2 replies; 28+ messages in thread From: Borislav Petkov @ 2018-06-07 20:24 UTC (permalink / raw) To: Dan Williams Cc: Luck, Tony, Qiuxu Zhuo, Ashok Raj, X86 ML, Linux Kernel Mailing List On Thu, Jun 07, 2018 at 01:18:31PM -0700, Dan Williams wrote: > I'm making an effort to get all persistent memory error handling holes > covered this cycle, so I think it makes sense for this to go through > the nvdimm tree. This looks sufficiently non-controversial that I > could justify sending it to Linus along with the other pmem updates. tglx just took 1 and 3, 2/3 had a minor issue but the merge window happened so I'll send it later. It is nice to have anyway. -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/3] x86/mce: Check for alternate indication of machine check recovery on Skylake 2018-06-07 20:24 ` Borislav Petkov @ 2018-06-07 22:26 ` Luck, Tony 2018-06-14 21:57 ` Luck, Tony 1 sibling, 0 replies; 28+ messages in thread From: Luck, Tony @ 2018-06-07 22:26 UTC (permalink / raw) To: Borislav Petkov Cc: Dan Williams, Qiuxu Zhuo, Ashok Raj, X86 ML, Linux Kernel Mailing List On Thu, Jun 07, 2018 at 10:24:46PM +0200, Borislav Petkov wrote: > On Thu, Jun 07, 2018 at 01:18:31PM -0700, Dan Williams wrote: > > I'm making an effort to get all persistent memory error handling holes > > covered this cycle, so I think it makes sense for this to go through > > the nvdimm tree. This looks sufficiently non-controversial that I > > could justify sending it to Linus along with the other pmem updates. > > tglx just took 1 and 3, 2/3 had a minor issue but the merge window > happened so I'll send it later. It is nice to have anyway. Thanks Boris. -Tony ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/3] x86/mce: Check for alternate indication of machine check recovery on Skylake 2018-06-07 20:24 ` Borislav Petkov 2018-06-07 22:26 ` Luck, Tony @ 2018-06-14 21:57 ` Luck, Tony 2018-06-15 11:45 ` Borislav Petkov 1 sibling, 1 reply; 28+ messages in thread From: Luck, Tony @ 2018-06-14 21:57 UTC (permalink / raw) To: Borislav Petkov Cc: Dan Williams, Qiuxu Zhuo, Ashok Raj, X86 ML, Linux Kernel Mailing List On Thu, Jun 07, 2018 at 10:24:46PM +0200, Borislav Petkov wrote: > tglx just took 1 and 3, 2/3 had a minor issue but the merge window > happened so I'll send it later. It is nice to have anyway. Did you fix up part 2/3? I see 1 & 3 were staged by Thomas in TIP ras/urgent and ras-urgent-for-linus but haven't gone into Linus' tree yet. -Tony ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/3] x86/mce: Check for alternate indication of machine check recovery on Skylake 2018-06-14 21:57 ` Luck, Tony @ 2018-06-15 11:45 ` Borislav Petkov 2018-06-15 16:34 ` Luck, Tony 0 siblings, 1 reply; 28+ messages in thread From: Borislav Petkov @ 2018-06-15 11:45 UTC (permalink / raw) To: Luck, Tony Cc: Dan Williams, Qiuxu Zhuo, Ashok Raj, X86 ML, Linux Kernel Mailing List On Thu, Jun 14, 2018 at 02:57:54PM -0700, Luck, Tony wrote: > On Thu, Jun 07, 2018 at 10:24:46PM +0200, Borislav Petkov wrote: > > tglx just took 1 and 3, 2/3 had a minor issue but the merge window > > happened so I'll send it later. It is nice to have anyway. > > Did you fix up part 2/3? You said "Parts 1 & 2 are nice-to-have" and we have merge window now. So what exactly do you mean? > I see 1 & 3 were staged by Thomas in TIP ras/urgent and > ras-urgent-for-linus but haven't gone into Linus' tree yet. Merge window ain't over yet :) -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/3] x86/mce: Check for alternate indication of machine check recovery on Skylake 2018-06-15 11:45 ` Borislav Petkov @ 2018-06-15 16:34 ` Luck, Tony 2018-06-15 17:16 ` Borislav Petkov 0 siblings, 1 reply; 28+ messages in thread From: Luck, Tony @ 2018-06-15 16:34 UTC (permalink / raw) To: Borislav Petkov Cc: Dan Williams, Qiuxu Zhuo, Ashok Raj, X86 ML, Linux Kernel Mailing List On Fri, Jun 15, 2018 at 01:45:18PM +0200, Borislav Petkov wrote: > On Thu, Jun 14, 2018 at 02:57:54PM -0700, Luck, Tony wrote: > > On Thu, Jun 07, 2018 at 10:24:46PM +0200, Borislav Petkov wrote: > > > tglx just took 1 and 3, 2/3 had a minor issue but the merge window > > > happened so I'll send it later. It is nice to have anyway. > > > > Did you fix up part 2/3? > > You said "Parts 1 & 2 are nice-to-have" and we have merge window now. So > what exactly do you mean? > > > I see 1 & 3 were staged by Thomas in TIP ras/urgent and > > ras-urgent-for-linus but haven't gone into Linus' tree yet. > > Merge window ain't over yet :) "send it later" could be read a couple of ways. I was just worried that Thomas is holding off asking Linus to pull because he's waiting for part 2/3, while you might be planning to hold onto part 2/3 for the next merge (and add the other cleanups you RFC'd). I'm OK either way on part 2/3 (it just makes for better error messages, it doesn't fix any bugs). -Tony ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/3] x86/mce: Check for alternate indication of machine check recovery on Skylake 2018-06-15 16:34 ` Luck, Tony @ 2018-06-15 17:16 ` Borislav Petkov 0 siblings, 0 replies; 28+ messages in thread From: Borislav Petkov @ 2018-06-15 17:16 UTC (permalink / raw) To: Luck, Tony Cc: Dan Williams, Qiuxu Zhuo, Ashok Raj, X86 ML, Linux Kernel Mailing List On Fri, Jun 15, 2018 at 09:34:06AM -0700, Luck, Tony wrote: > I was just worried that Thomas is holding off asking Linus to pull > because he's waiting for part 2/3, while you might be planning to > hold onto part 2/3 for the next merge (and add the other cleanups > you RFC'd). > > I'm OK either way on part 2/3 (it just makes for better error messages, > it doesn't fix any bugs). Yes, and this is my understanding too - 2/3 is not urgent material so no need to hurry that particular one. -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/3] x86/mce: Check for alternate indication of machine check recovery on Skylake 2018-06-07 20:18 ` Dan Williams 2018-06-07 20:24 ` Borislav Petkov @ 2018-06-07 20:24 ` Thomas Gleixner 1 sibling, 0 replies; 28+ messages in thread From: Thomas Gleixner @ 2018-06-07 20:24 UTC (permalink / raw) To: Dan Williams Cc: Luck, Tony, Borislav Petkov, Qiuxu Zhuo, Ashok Raj, X86 ML, Linux Kernel Mailing List On Thu, 7 Jun 2018, Dan Williams wrote: > On Thu, Jun 7, 2018 at 10:43 AM, Luck, Tony <tony.luck@intel.com> wrote: > > On Fri, May 25, 2018 at 02:42:09PM -0700, Tony Luck wrote: > >> Currently we just check the "CAPID0" register to see whether the CPU > >> can recover from machine checks. > >> > >> But there are also some special SKUs which do not have all advanced > >> RAS features, but do enable machine check recovery for use with NVDIMMs. > >> > >> Add a check for any of bits {8:5} in the "CAPID5" register (each > >> reports some NVDIMM mode available, if any of them are set, then > >> the system supports memory machine check recovery). > >> > >> Cc: stable@vger.kernel.org # 4.9 > >> Signed-off-by: Tony Luck <tony.luck@intel.com> > >> --- > > > > Has this stalled somewhere? I'd like to see this one go into the > > 4.18 merge because it unbreaks some real hardware. > > > > Parts 1 & 2 are nice-to-have, but they just make for better error > > messages so aren't as critical. > > I'm making an effort to get all persistent memory error handling holes > covered this cycle, so I think it makes sense for this to go through > the nvdimm tree. This looks sufficiently non-controversial that I > could justify sending it to Linus along with the other pmem updates. I've picked it up already and please can we let stuff go through the right trees? The worlds does not stop turning if a fix goes in 2 days later. Thanks, tglx ^ permalink raw reply [flat|nested] 28+ messages in thread
* [tip:ras/urgent] x86/mce: Check for alternate indication of machine check recovery on Skylake 2018-05-25 21:42 ` [PATCH 3/3] x86/mce: Check for alternate indication of machine check recovery on Skylake Tony Luck 2018-06-07 17:43 ` Luck, Tony @ 2018-06-07 20:25 ` tip-bot for Tony Luck 1 sibling, 0 replies; 28+ messages in thread From: tip-bot for Tony Luck @ 2018-06-07 20:25 UTC (permalink / raw) To: linux-tip-commits Cc: tglx, dan.j.williams, bp, mingo, ashok.raj, linux-kernel, qiuxu.zhuo, tony.luck, hpa Commit-ID: 4c5717da1d021cf368eabb3cb1adcaead56c0d1e Gitweb: https://git.kernel.org/tip/4c5717da1d021cf368eabb3cb1adcaead56c0d1e Author: Tony Luck <tony.luck@intel.com> AuthorDate: Fri, 25 May 2018 14:42:09 -0700 Committer: Thomas Gleixner <tglx@linutronix.de> CommitDate: Thu, 7 Jun 2018 22:22:12 +0200 x86/mce: Check for alternate indication of machine check recovery on Skylake Currently we just check the "CAPID0" register to see whether the CPU can recover from machine checks. But there are also some special SKUs which do not have all advanced RAS features, but do enable machine check recovery for use with NVDIMMs. Add a check for any of bits {8:5} in the "CAPID5" register (each reports some NVDIMM mode available, if any of them are set, then the system supports memory machine check recovery). Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: Ashok Raj <ashok.raj@intel.com> Cc: stable@vger.kernel.org # 4.9 Cc: Dan Williams <dan.j.williams@intel.com> Cc: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/03cbed6e99ddafb51c2eadf9a3b7c8d7a0cc204e.1527283897.git.tony.luck@intel.com --- arch/x86/kernel/quirks.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c index 697a4ce04308..736348ead421 100644 --- a/arch/x86/kernel/quirks.c +++ b/arch/x86/kernel/quirks.c @@ -645,12 +645,19 @@ static void quirk_intel_brickland_xeon_ras_cap(struct pci_dev *pdev) /* Skylake */ static void quirk_intel_purley_xeon_ras_cap(struct pci_dev *pdev) { - u32 capid0; + u32 capid0, capid5; pci_read_config_dword(pdev, 0x84, &capid0); + pci_read_config_dword(pdev, 0x98, &capid5); - if ((capid0 & 0xc0) == 0xc0) + /* + * CAPID0{7:6} indicate whether this is an advanced RAS SKU + * CAPID5{8:5} indicate that various NVDIMM usage modes are + * enabled, so memory machine check recovery is also enabled. + */ + if ((capid0 & 0xc0) == 0xc0 || (capid5 & 0x1e0)) static_branch_inc(&mcsafe_key); + } DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0ec3, quirk_intel_brickland_xeon_ras_cap); DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2fc0, quirk_intel_brickland_xeon_ras_cap); ^ permalink raw reply related [flat|nested] 28+ messages in thread
end of thread, other threads:[~2018-06-22 12:40 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-05-25 21:40 [PATCH 0/3] x86/mce fixes Tony Luck 2018-05-25 21:41 ` [PATCH 1/3] x86/mce: Improve error message when kernel cannot recover Tony Luck 2018-06-07 20:24 ` [tip:ras/urgent] " tip-bot for Tony Luck 2018-05-25 21:41 ` [PATCH 2/3] x86/mce: Fix incorrect "Machine check from unknown source" message Tony Luck 2018-05-28 20:49 ` Borislav Petkov 2018-05-29 16:15 ` [PATCH 2/3 V2] " Luck, Tony 2018-05-29 17:41 ` Borislav Petkov 2018-05-29 17:50 ` Luck, Tony 2018-05-29 17:53 ` Borislav Petkov 2018-05-29 18:54 ` Luck, Tony 2018-05-29 20:17 ` Dan Williams 2018-05-30 9:26 ` Borislav Petkov 2018-06-19 10:30 ` Borislav Petkov 2018-05-29 18:22 ` [PATCH 2/3] " Raj, Ashok 2018-05-29 10:42 ` Borislav Petkov 2018-05-29 16:13 ` Luck, Tony 2018-06-22 12:40 ` [tip:ras/core] " tip-bot for Tony Luck 2018-05-25 21:42 ` [PATCH 3/3] x86/mce: Check for alternate indication of machine check recovery on Skylake Tony Luck 2018-06-07 17:43 ` Luck, Tony 2018-06-07 20:18 ` Dan Williams 2018-06-07 20:24 ` Borislav Petkov 2018-06-07 22:26 ` Luck, Tony 2018-06-14 21:57 ` Luck, Tony 2018-06-15 11:45 ` Borislav Petkov 2018-06-15 16:34 ` Luck, Tony 2018-06-15 17:16 ` Borislav Petkov 2018-06-07 20:24 ` Thomas Gleixner 2018-06-07 20:25 ` [tip:ras/urgent] " tip-bot for Tony Luck
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).