* [Patch V2] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. @ 2015-12-05 0:29 Ashok Raj 2015-12-07 20:00 ` Borislav Petkov 0 siblings, 1 reply; 13+ messages in thread From: Ashok Raj @ 2015-12-05 0:29 UTC (permalink / raw) To: linux-kernel; +Cc: Boris Petkov, linux-edac, Tony Luck, Ashok Raj Linux has logical cpu offline capability. That can be triggered by: # echo 0 > /sys/devices/system/cpu/cpuX/online In Intel Architecture, MCE's are broadcasted to all CPUs in the system. This includes the CPUs marked offline by Linux. Unless the CPU's were removed via an ACPI notification, in which case the cpu's are removed from the cpu_present_map. This patch ensures offline CPU's don't participate in MCE rendezvous, but simply perform clearing some status bits to ensure a second MCE wont cause automatic shutdown. Without the patch, mce_start will increment mce_callin, but mce_start would only wait for all online_cpus. So offline cpu's should avoid participating in the rendezvous process. Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Ashok Raj <ashok.raj@intel.com> --- arch/x86/kernel/cpu/mcheck/mce.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index c5b0d56..23ecb1d 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -998,7 +998,20 @@ void do_machine_check(struct pt_regs *regs, long error_code) u64 recover_paddr = ~0ull; int flags = MF_ACTION_REQUIRED; int lmce = 0; + unsigned int cpu = smp_processor_id(); + + /* + * if this cpu is offline, just bail out. + */ + if (cpu_is_offline(cpu)) { + u64 mcgstatus; + mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS); + if (mcgstatus & MCG_STATUS_RIPV) { + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); + return; + } + } ist_enter(regs); this_cpu_inc(mce_exception_count); -- 2.4.3 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [Patch V2] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. 2015-12-05 0:29 [Patch V2] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process Ashok Raj @ 2015-12-07 20:00 ` Borislav Petkov 2015-12-07 20:04 ` Luck, Tony 0 siblings, 1 reply; 13+ messages in thread From: Borislav Petkov @ 2015-12-07 20:00 UTC (permalink / raw) To: Ashok Raj; +Cc: linux-kernel, linux-edac, Tony Luck On Fri, Dec 04, 2015 at 07:29:36PM -0500, Ashok Raj wrote: > Linux has logical cpu offline capability. That can be triggered by: > > # echo 0 > /sys/devices/system/cpu/cpuX/online > > In Intel Architecture, MCE's are broadcasted to all CPUs in the system. > > This includes the CPUs marked offline by Linux. Unless the CPU's were removed > via an ACPI notification, in which case the cpu's are removed from the > cpu_present_map. > > This patch ensures offline CPU's don't participate in MCE rendezvous, but > simply perform clearing some status bits to ensure a second MCE wont cause > automatic shutdown. > > Without the patch, mce_start will increment mce_callin, but mce_start would > only wait for all online_cpus. So offline cpu's should avoid participating > in the rendezvous process. > > Reviewed-by: Tony Luck <tony.luck@intel.com> > Cc: stable@vger.kernel.org > Signed-off-by: Ashok Raj <ashok.raj@intel.com> > --- > arch/x86/kernel/cpu/mcheck/mce.c | 13 +++++++++++++ > 1 file changed, 13 insertions(+) Tested on a box here, massaged commit message and queued for 4.4, thanks. --- From: Ashok Raj <ashok.raj@intel.com> Date: Thu, 3 Dec 2015 19:16:10 -0500 Subject: [PATCH] x86/mce: Ensure offline CPUs don't participate in rendezvous process Intel's MCA implementation broadcasts MCEs to all CPUs on the node. This poses a problem for offlined CPUs which cannot participate in the rendezvous process: Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler Kernel Offset: disabled Rebooting in 100 seconds.. More specifically, Linux does a soft offline of a CPU when writing a 0 to /sys/devices/system/cpu/cpuX/online, which doesn't prevent the #MC exception from being broadcasted to that CPU. Ensure that offline CPUs don't participate in the MCE rendezvous and clear the RIP valid status bit so that a second MCE won't cause a shutdown. Without the patch, mce_start() will increment mce_callin and wait for all CPUs. Offlined CPUs should avoid participating in the rendezvous process altogether. Reviewed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Ashok Raj <ashok.raj@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: stable@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/1449188170-3909-1-git-send-email-ashok.raj@intel.com [ Massage commit message. ] Signed-off-by: Borislav Petkov <bp@suse.de> --- arch/x86/kernel/cpu/mcheck/mce.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 3865e95cc5ec..a006f4cd792b 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -1002,6 +1002,17 @@ void do_machine_check(struct pt_regs *regs, long error_code) int flags = MF_ACTION_REQUIRED; int lmce = 0; + /* If this CPU is offline, just bail out. */ + if (cpu_is_offline(smp_processor_id())) { + u64 mcgstatus; + + mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS); + if (mcgstatus & MCG_STATUS_RIPV) { + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); + return; + } + } + ist_enter(regs); this_cpu_inc(mce_exception_count); -- 2.3.5 -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. ^ permalink raw reply related [flat|nested] 13+ messages in thread
* RE: [Patch V2] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. 2015-12-07 20:00 ` Borislav Petkov @ 2015-12-07 20:04 ` Luck, Tony 2015-12-07 20:19 ` Borislav Petkov 0 siblings, 1 reply; 13+ messages in thread From: Luck, Tony @ 2015-12-07 20:04 UTC (permalink / raw) To: Borislav Petkov, Raj, Ashok; +Cc: linux-kernel, linux-edac [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 419 bytes --] > Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler Is that what we printed in this case? ... boy is that a misleading message ... we got *extra* cpus (the offline ones), not "Not all". Good job we have a fix :-) -Tony ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Patch V2] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. 2015-12-07 20:04 ` Luck, Tony @ 2015-12-07 20:19 ` Borislav Petkov 2015-12-07 22:07 ` Luck, Tony 0 siblings, 1 reply; 13+ messages in thread From: Borislav Petkov @ 2015-12-07 20:19 UTC (permalink / raw) To: Luck, Tony; +Cc: Raj, Ashok, linux-kernel, linux-edac On Mon, Dec 07, 2015 at 08:04:30PM +0000, Luck, Tony wrote: > > Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler > > Is that what we printed in this case? ... boy is that a misleading message ... we got *extra* > cpus (the offline ones), not "Not all". > > Good job we have a fix :-) Well, we still have that printk string in there. And that is incorrect too, because the MCE (at least the one I'm injecting) gets broadcasted to the CPUs on the *node* and not to the whole system. If we had to be precise, text should say "Not all CPUs which the MCE was broadcasted to entered the exception handler..." -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [Patch V2] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. 2015-12-07 20:19 ` Borislav Petkov @ 2015-12-07 22:07 ` Luck, Tony 2015-12-07 22:34 ` Borislav Petkov 0 siblings, 1 reply; 13+ messages in thread From: Luck, Tony @ 2015-12-07 22:07 UTC (permalink / raw) To: Borislav Petkov; +Cc: Raj, Ashok, linux-kernel, linux-edac [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 727 bytes --] > And that is incorrect too, because the MCE (at least the one I'm > injecting) gets broadcasted to the CPUs on the *node* and not to the > whole system. Which system? What kind of machine check? On Intel we expect machine checks to be broadcast to all logical cpus on all nodes (unless local machine check is enabled, in which case SRAR style machine checks go only to the logical cpu that hit the error). The code is written to that expectation ... and we don't report things as well if something else happens (like too many or too few cpus showing up). -Tony ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Patch V2] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. 2015-12-07 22:07 ` Luck, Tony @ 2015-12-07 22:34 ` Borislav Petkov 2015-12-07 23:26 ` Luck, Tony 2015-12-07 23:46 ` Raj, Ashok 0 siblings, 2 replies; 13+ messages in thread From: Borislav Petkov @ 2015-12-07 22:34 UTC (permalink / raw) To: Luck, Tony; +Cc: Raj, Ashok, linux-kernel, linux-edac On Mon, Dec 07, 2015 at 10:07:59PM +0000, Luck, Tony wrote: > > And that is incorrect too, because the MCE (at least the one I'm > > injecting) gets broadcasted to the CPUs on the *node* and not to the > > whole system. > > Which system? What kind of machine check? On Intel we expect machine checks > to be broadcast to all logical cpus on all nodes (unless local machine check is enabled, > in which case SRAR style machine checks go only to the logical cpu that hit the error). > > The code is written to that expectation ... and we don't report things as well if > something else happens (like too many or too few cpus showing up). Box logs below. BIOS is doing funny cores enumeration: node #0, CPUs 0-7 node #1, CPUs 8-15 node #2, CPUs 16-23 node #3, CPUs 24-31 and then starts from node 0 again: .... node #0, CPUs: #32 #33 #34 #35 #36 #37 #38 #39 .... node #1, CPUs: #40 #41 #42 #43 #44 #45 #46 #47 .... node #2, CPUs: #48 #49 #50 #51 #52 #53 #54 #55 .... node #3, CPUs: #56 #57 #58 #59 #60 #61 #62 #63 So I went and offlined cores 5 and 34 which are on node 0. Why node 0? Well, when I inject error type 0x10 which is 0x00000010 Memory Uncorrectable non-fatal it generates an MCE only on the node 0 cores. For that log see the end of this mail. The gist of it is that the CPUs on which #MC gets raised are the cores on node 0, i.e., 0-7 and 32-39. Cores 5 and 34 are gone, of course. I mean, even if the #MC gets raised only on the node, the fix still works. $ grep -Ei "hardware.*CPU" /tmp/mce | sed 's/^.*CPU//' | sort -n 0: Machine Check Exception: 5 Bank 5: be00000000010090 1: Machine Check Exception: 5 Bank 5: be00000000010090 2: Machine Check Exception: 5 Bank 5: be00000000010090 3: Machine Check Exception: 5 Bank 5: be00000000010090 4: Machine Check Exception: 5 Bank 5: be00000000010090 6: Machine Check Exception: 5 Bank 5: be00000000010090 7: Machine Check Exception: 5 Bank 5: be00000000010090 32: Machine Check Exception: 5 Bank 5: be00000000010090 33: Machine Check Exception: 5 Bank 5: be00000000010090 35: Machine Check Exception: 5 Bank 5: be00000000010090 36: Machine Check Exception: 5 Bank 5: be00000000010090 37: Machine Check Exception: 5 Bank 5: be00000000010090 38: Machine Check Exception: 5 Bank 5: be00000000010090 39: Machine Check Exception: 5 Bank 5: be00000000010090 [ 0.859060] smpboot: CPU0: Intel(R) Xeon(R) CPU E5-4650 0 @ 2.70GHz (family: 0x6, model: 0x2d, stepping: 0x7 ... [ 0.981593] x86: Booting SMP configuration: [ 0.991092] .... node #0, CPUs: #1 [ 1.013485] microcode: CPU1 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.034219] #2 [ 1.049577] microcode: CPU2 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.070309] #3 [ 1.085865] microcode: CPU3 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.106618] #4 [ 1.121978] microcode: CPU4 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.142720] #5 [ 1.158079] microcode: CPU5 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.178833] #6 [ 1.194191] microcode: CPU6 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.214914] #7 [ 1.230471] microcode: CPU7 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.251309] [ 1.254854] .... node #1, CPUs: #8 [ 1.275173] microcode: CPU8 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.390509] #9 [ 1.406859] microcode: CPU9 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.427735] #10 [ 1.444303] microcode: CPU10 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.465343] #11 [ 1.481718] microcode: CPU11 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.502779] #12 [ 1.519156] microcode: CPU12 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.540171] #13 [ 1.556536] microcode: CPU13 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.577587] #14 [ 1.594127] microcode: CPU14 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.615131] #15 [ 1.631471] microcode: CPU15 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.652590] [ 1.656132] .... node #2, CPUs: #16 [ 1.676518] microcode: CPU16 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.791812] #17 [ 1.808189] microcode: CPU17 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.829292] #18 [ 1.845868] microcode: CPU18 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.866925] #19 [ 1.883311] microcode: CPU19 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.904386] #20 [ 1.920765] microcode: CPU20 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.941810] #21 [ 1.958169] microcode: CPU21 microcode updated early to revision 0x710, date = 2013-06-17 [ 1.979242] #22 [ 1.995787] microcode: CPU22 microcode updated early to revision 0x710, date = 2013-06-17 [ 2.016842] #23 [ 2.033182] microcode: CPU23 microcode updated early to revision 0x710, date = 2013-06-17 [ 2.054314] [ 2.057854] .... node #3, CPUs: #24 [ 2.078330] microcode: CPU24 microcode updated early to revision 0x710, date = 2013-06-17 [ 2.193513] #25 [ 2.209874] microcode: CPU25 microcode updated early to revision 0x710, date = 2013-06-17 [ 2.230996] #26 [ 2.247563] microcode: CPU26 microcode updated early to revision 0x710, date = 2013-06-17 [ 2.268627] #27 [ 2.284998] microcode: CPU27 microcode updated early to revision 0x710, date = 2013-06-17 [ 2.306061] #28 [ 2.322437] microcode: CPU28 microcode updated early to revision 0x710, date = 2013-06-17 [ 2.343433] #29 [ 2.359780] microcode: CPU29 microcode updated early to revision 0x710, date = 2013-06-17 [ 2.380855] #30 [ 2.397397] microcode: CPU30 microcode updated early to revision 0x710, date = 2013-06-17 [ 2.418432] #31 [ 2.434759] microcode: CPU31 microcode updated early to revision 0x710, date = 2013-06-17 [ 2.455792] [ 2.459336] .... node #0, CPUs: #32 #33 #34 #35 #36 #37 #38 #39 [ 2.583817] .... node #1, CPUs: #40 #41 #42 #43 #44 #45 #46 #47 [ 2.710873] .... node #2, CPUs: #48 #49 #50 #51 #52 #53 #54 #55 [ 2.838069] .... node #3, CPUs: #56 #57 #58 #59 #60 #61 #62 #63 [ 2.964288] x86: Booted up 4 nodes, 64 CPUs [ 2.974471] smpboot: Total of 64 processors activated (344907.86 BogoMIPS) [ 5290.635126] Broke affinity for irq 82 [ 5290.643222] Broke affinity for irq 111 [ 5290.651507] Broke affinity for irq 125 [ 5290.664107] smpboot: CPU 5 is now offline [ 5298.371336] Broke affinity for irq 31 [ 5298.379528] Broke affinity for irq 82 [ 5298.387627] Broke affinity for irq 103 [ 5298.395908] Broke affinity for irq 110 [ 5298.404187] Broke affinity for irq 111 [ 5298.412450] Broke affinity for irq 112 [ 5298.420733] Broke affinity for irq 118 [ 5298.429017] Broke affinity for irq 124 [ 5298.437295] Broke affinity for irq 125 [ 5298.445584] Broke affinity for irq 127 [ 5298.453880] Broke affinity for irq 137 [ 5298.466543] smpboot: CPU 34 is now offline [ 5302.187338] EINJ: Error INJection is initialized. [ 5318.897170] Disabling lock debugging due to kernel taint [ 5318.910775] mce: [Hardware Error]: CPU 37: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5318.931171] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130} [ 5318.951567] mce: [Hardware Error]: TSC bab9f2d8a4e00 ADDR bb68ec00 MISC 20403ebe86 [ 5318.969835] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC b microcode 710 [ 5318.990959] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [ 5319.003825] EDAC sbridge MC0: CPU 37: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5319.023215] EDAC sbridge MC0: TSC bab9f2d8a4e00 [ 5319.033036] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86 [ 5319.050338] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC b [ 5319.069542] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset :0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) [ 5319.122943] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5319.143355] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130} [ 5319.163846] mce: [Hardware Error]: TSC bab9f2d8a51c1 ADDR bb68ec00 MISC 20403ebe86 [ 5319.182249] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 6 microcode 710 [ 5319.203539] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [ 5319.216586] EDAC sbridge MC0: CPU 3: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5319.235994] EDAC sbridge MC0: TSC bab9f2d8a51c1 [ 5319.245814] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86 [ 5319.263348] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 6 [ 5319.283041] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset :0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) [ 5319.337311] mce: [Hardware Error]: CPU 2: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5319.357960] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8159a4d0> {mutex_lock+0x10/0x27} [ 5319.378519] mce: [Hardware Error]: TSC bab9f2d8a3feb ADDR bb68ec00 MISC 20403ebe86 [ 5319.397151] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 4 microcode 710 [ 5319.418650] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [ 5319.431902] EDAC sbridge MC0: CPU 2: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5319.451491] EDAC sbridge MC0: TSC bab9f2d8a3feb [ 5319.461311] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86 [ 5319.479022] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 4 [ 5319.499014] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset :0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) [ 5319.553209] mce: [Hardware Error]: CPU 6: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5319.574029] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130} [ 5319.594953] mce: [Hardware Error]: TSC bab9f2d8a87ea ADDR bb68ec00 MISC 20403ebe86 [ 5319.613756] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC c microcode 710 [ 5319.635431] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [ 5319.648873] EDAC sbridge MC0: CPU 6: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5319.668661] EDAC sbridge MC0: TSC bab9f2d8a87ea [ 5319.678483] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86 [ 5319.696422] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC c [ 5319.716789] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset :0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) [ 5319.771531] mce: [Hardware Error]: CPU 38: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5319.792743] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130} [ 5319.813836] mce: [Hardware Error]: TSC bab9f2d8a87ce ADDR bb68ec00 MISC 20403ebe86 [ 5319.832819] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC d microcode 710 [ 5319.854654] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [ 5319.868243] EDAC sbridge MC0: CPU 38: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5319.888366] EDAC sbridge MC0: TSC bab9f2d8a87ce [ 5319.898186] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86 [ 5319.916192] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC d [ 5319.936752] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) [ 5319.991752] mce: [Hardware Error]: CPU 35: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5320.013034] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130} [ 5320.034166] mce: [Hardware Error]: TSC bab9f2d8a59dd ADDR bb68ec00 MISC 20403ebe86 [ 5320.053149] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 7 microcode 710 [ 5320.074972] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [ 5320.088567] EDAC sbridge MC0: CPU 35: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5320.108688] EDAC sbridge MC0: TSC bab9f2d8a59dd [ 5320.118511] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86 [ 5320.136527] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 7 [ 5320.157079] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) [ 5320.212025] mce: [Hardware Error]: CPU 39: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5320.233316] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130} [ 5320.254462] mce: [Hardware Error]: TSC bab9f2d8a4f5c ADDR bb68ec00 MISC 20403ebe86 [ 5320.273455] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC f microcode 710 [ 5320.295303] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [ 5320.308905] EDAC sbridge MC0: CPU 39: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5320.329026] EDAC sbridge MC0: TSC bab9f2d8a4f5c [ 5320.338847] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86 [ 5320.356858] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC f [ 5320.377433] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) [ 5320.432474] mce: [Hardware Error]: CPU 7: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5320.453569] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130} [ 5320.474703] mce: [Hardware Error]: TSC bab9f2d8a4d60 ADDR bb68ec00 MISC 20403ebe86 [ 5320.493689] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC e microcode 710 [ 5320.515532] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [ 5320.529139] EDAC sbridge MC0: CPU 7: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5320.549050] EDAC sbridge MC0: TSC bab9f2d8a4d60 [ 5320.558870] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86 [ 5320.576890] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC e [ 5320.597478] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) [ 5320.652525] mce: [Hardware Error]: CPU 36: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5320.673804] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130} [ 5320.694918] mce: [Hardware Error]: TSC bab9f2d8a5823 ADDR bb68ec00 MISC 20403ebe86 [ 5320.713916] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 9 microcode 710 [ 5320.735759] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [ 5320.749347] EDAC sbridge MC0: CPU 36: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5320.769452] EDAC sbridge MC0: TSC bab9f2d8a5823 [ 5320.779273] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86 [ 5320.797296] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 9 [ 5320.817877] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) [ 5320.872972] mce: [Hardware Error]: CPU 33: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5320.894249] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130} [ 5320.915390] mce: [Hardware Error]: TSC bab9f2d8a5326 ADDR bb68ec00 MISC 20403ebe86 [ 5320.934374] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 3 microcode 710 [ 5320.956222] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [ 5320.969807] EDAC sbridge MC0: CPU 33: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5320.989913] EDAC sbridge MC0: TSC bab9f2d8a5326 [ 5320.999734] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86 [ 5321.017750] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 3 [ 5321.038284] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) [ 5321.093686] mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5321.114770] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130} [ 5321.135925] mce: [Hardware Error]: TSC bab9f2d8a5562 ADDR bb68ec00 MISC 20403ebe86 [ 5321.154918] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 2 microcode 710 [ 5321.176765] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [ 5321.190369] EDAC sbridge MC0: CPU 1: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5321.210303] EDAC sbridge MC0: TSC bab9f2d8a5562 [ 5321.220123] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86 [ 5321.238146] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 2 [ 5321.258723] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) [ 5321.303358] mce: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5321.324279] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130} [ 5321.345397] mce: [Hardware Error]: TSC bab9f2d8a572f ADDR bb68ec00 MISC 20403ebe86 [ 5321.364380] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 8 microcode 710 [ 5321.386184] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [ 5321.399729] EDAC sbridge MC0: CPU 4: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5321.419624] EDAC sbridge MC0: TSC bab9f2d8a572f [ 5321.429445] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86 [ 5321.447454] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 8 [ 5321.467989] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) [ 5321.511475] mce: [Hardware Error]: CPU 32: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5321.532587] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130} [ 5321.553689] mce: [Hardware Error]: TSC bab9f2d8a50f4 ADDR bb68ec00 MISC 20403ebe86 [ 5321.572681] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 1 microcode 710 [ 5321.594500] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [ 5321.608057] EDAC sbridge MC0: CPU 32: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5321.628161] EDAC sbridge MC0: TSC bab9f2d8a50f4 [ 5321.637982] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86 [ 5321.655998] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 1 [ 5321.676524] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) [ 5321.720020] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5321.740939] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130} [ 5321.762058] mce: [Hardware Error]: TSC bab9f2d8a5034 ADDR bb68ec00 MISC 20403ebe86 [ 5321.781022] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 0 microcode 710 [ 5321.802837] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [ 5321.816395] EDAC sbridge MC0: CPU 0: Machine Check Exception: 5 Bank 5: be00000000010090 [ 5321.836300] EDAC sbridge MC0: TSC bab9f2d8a5034 [ 5321.846121] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86 [ 5321.864127] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 0 [ 5321.884647] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) [ 5321.928136] mce: [Hardware Error]: Machine check: Processor context corrupt [ 5321.945589] Kernel panic - not syncing: Fatal machine check [ 5321.985122] Kernel Offset: disabled [ 5322.008492] Rebooting in 100 seconds.. [ 5421.226077] ACPI MEMORY or I/O RESET_REG. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Patch V2] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. 2015-12-07 22:34 ` Borislav Petkov @ 2015-12-07 23:26 ` Luck, Tony 2015-12-07 23:46 ` Raj, Ashok 1 sibling, 0 replies; 13+ messages in thread From: Luck, Tony @ 2015-12-07 23:26 UTC (permalink / raw) To: Borislav Petkov; +Cc: Raj, Ashok, linux-kernel, linux-edac On Mon, Dec 07, 2015 at 11:34:27PM +0100, Borislav Petkov wrote: > BIOS is doing funny cores enumeration: > > node #0, CPUs 0-7 > node #1, CPUs 8-15 > node #2, CPUs 16-23 > node #3, CPUs 24-31 > > and then starts from node 0 again: > > .... node #0, CPUs: #32 #33 #34 #35 #36 #37 #38 #39 > .... node #1, CPUs: #40 #41 #42 #43 #44 #45 #46 #47 > .... node #2, CPUs: #48 #49 #50 #51 #52 #53 #54 #55 > .... node #3, CPUs: #56 #57 #58 #59 #60 #61 #62 #63 That's normal. BIOS writers are encouraged to list all the hyperthread 0 cpus from each core, and then add the hyperthread 1 cpus later in the table. That way an OS that boots less than all the cpus will get the maximum number of real cores into play. > 0x00000010 Memory Uncorrectable non-fatal > > it generates an MCE only on the node 0 cores. For that log see the end > of this mail. The gist of it is that the CPUs on which #MC gets raised > are the cores on node 0, i.e., 0-7 and 32-39. I think all the threads on all the sockets must have shown up in the machine check handler ... but only the ones on socket0 printed anything (they can all see the error in bank5 which is shared across the socket ... but cpus 8-15 etc. will see no errors in any banks ... so will be silent.) -Tony ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Patch V2] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. 2015-12-07 22:34 ` Borislav Petkov 2015-12-07 23:26 ` Luck, Tony @ 2015-12-07 23:46 ` Raj, Ashok 2015-12-07 23:25 ` Borislav Petkov 1 sibling, 1 reply; 13+ messages in thread From: Raj, Ashok @ 2015-12-07 23:46 UTC (permalink / raw) To: Borislav Petkov; +Cc: Luck, Tony, linux-kernel, linux-edac, Ashok Raj On Mon, Dec 07, 2015 at 11:34:27PM +0100, Borislav Petkov wrote: > > Box logs below. Do you have the dmidecode strings to find which platform this is? > > BIOS is doing funny cores enumeration: > > node #0, CPUs 0-7 > node #1, CPUs 8-15 > node #2, CPUs 16-23 > node #3, CPUs 24-31 > > and then starts from node 0 again: > > .... node #0, CPUs: #32 #33 #34 #35 #36 #37 #38 #39 > .... node #1, CPUs: #40 #41 #42 #43 #44 #45 #46 #47 > .... node #2, CPUs: #48 #49 #50 #51 #52 #53 #54 #55 > .... node #3, CPUs: #56 #57 #58 #59 #60 #61 #62 #63 > > So I went and offlined cores 5 and 34 which are on node 0. > > Why node 0? Well, when I inject error type 0x10 which is > > 0x00000010 Memory Uncorrectable non-fatal > > it generates an MCE only on the node 0 cores. For that log see the end > of this mail. The gist of it is that the CPUs on which #MC gets raised > are the cores on node 0, i.e., 0-7 and 32-39. > > Cores 5 and 34 are gone, of course. > > I mean, even if the #MC gets raised only on the node, the fix still > works. Not sure how the fix works.. since we excluded only the ones offline. So unless all online cpu's check in, the code should give you the old behavior. What does cat /proc/interrupts | grep MCE In a system broadcasting, all cpu counts should be the same. Since we didn't increment the offline stats, if you were to bring the cpu up, it should be one less than other cpus... Cheers, Ashok ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Patch V2] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. 2015-12-07 23:46 ` Raj, Ashok @ 2015-12-07 23:25 ` Borislav Petkov 2015-12-08 1:41 ` Raj, Ashok 0 siblings, 1 reply; 13+ messages in thread From: Borislav Petkov @ 2015-12-07 23:25 UTC (permalink / raw) To: Raj, Ashok; +Cc: Luck, Tony, linux-kernel, linux-edac On Mon, Dec 07, 2015 at 06:46:40PM -0500, Raj, Ashok wrote: > On Mon, Dec 07, 2015 at 11:34:27PM +0100, Borislav Petkov wrote: > > > > Box logs below. > > Do you have the dmidecode strings to find which platform this is? Is this enough or you want complete dmidecode dump? DMI: Intel Corporation LH Pass/S4600LH...., BIOS SE5C600.86B.99.99.2050.043020121425 04/30/2012 > Not sure how the fix works.. since we excluded only the ones offline. > So unless all online cpu's check in, the code should give you the old > behavior. Did you miss my statement in my previous mail where I said that the MCE is being raised only on the cores of node 0? > What does cat /proc/interrupts | grep MCE Can't. Shell on the box is dead after the injection. > In a system broadcasting, all cpu counts should be the same. Since we didn't > increment the offline stats, if you were to bring the cpu up, it should be one > less than other cpus... See the logs at the end of my previous email. #MC gets raised - or at least output from mce_panic() comes out only - on the cores of node 0. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Patch V2] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. 2015-12-07 23:25 ` Borislav Petkov @ 2015-12-08 1:41 ` Raj, Ashok 2015-12-08 9:18 ` Borislav Petkov 0 siblings, 1 reply; 13+ messages in thread From: Raj, Ashok @ 2015-12-08 1:41 UTC (permalink / raw) To: Borislav Petkov; +Cc: Luck, Tony, linux-kernel, linux-edac, ashok.raj On Tue, Dec 08, 2015 at 12:25:24AM +0100, Borislav Petkov wrote: > > Did you miss my statement in my previous mail where I said that the MCE > is being raised only on the cores of node 0? > That's right.. but i think if MCE is only given to node0, then the system would panic eveytime with or without the patch. which is why i got confused. I somehow misunderstood that with this patch the system didn't panic. Cheers, Ashok ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Patch V2] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. 2015-12-08 1:41 ` Raj, Ashok @ 2015-12-08 9:18 ` Borislav Petkov 2015-12-08 15:59 ` Luck, Tony 0 siblings, 1 reply; 13+ messages in thread From: Borislav Petkov @ 2015-12-08 9:18 UTC (permalink / raw) To: Raj, Ashok; +Cc: Luck, Tony, linux-kernel, linux-edac On Mon, Dec 07, 2015 at 08:41:43PM -0500, Raj, Ashok wrote: > On Tue, Dec 08, 2015 at 12:25:24AM +0100, Borislav Petkov wrote: > > > > Did you miss my statement in my previous mail where I said that the MCE > > is being raised only on the cores of node 0? > > > > That's right.. but i think if MCE is only given to node0, then the system > would panic eveytime with or without the patch. which is why i got confused. > > I somehow misunderstood that with this patch the system didn't panic. No, the system did panic in both times. The "strange" observation is that the MCE gets reported only on the cores on node 0. Or at least only the printks from mce_panic() on the cores on node0 reach the serial console. If we really broadcast only on node0, then that would be a problem if the corrupted data leaves the node and manages to corrupt storage when written out on some of the other nodes. I'm not sure if the kernel panicking the whole system is on time and there's not a small window between the detection and the panicking, in which the corruption might happen. If so, this'd defeat the purpose of MCE broadcasting but I'm just hypothesizing here. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [Patch V2] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. 2015-12-08 9:18 ` Borislav Petkov @ 2015-12-08 15:59 ` Luck, Tony 2015-12-08 18:56 ` Borislav Petkov 0 siblings, 1 reply; 13+ messages in thread From: Luck, Tony @ 2015-12-08 15:59 UTC (permalink / raw) To: Borislav Petkov, Raj, Ashok; +Cc: linux-kernel, linux-edac [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 803 bytes --] > No, the system did panic in both times. The "strange" observation is > that the MCE gets reported only on the cores on node 0. Or at least only > the printks from mce_panic() on the cores on node0 reach the serial > console. You only see messages and logs from node0, because the cpus there are the only ones that see any errors logged in their banks. The cpus on node 1, 2, 3 scan all banks and find nothing, so say nothing. There are no system-wide banks ... just core-wide (in recent generations banks 0-3) and socket-wide (banks >=4). But don't code those numbers into any generic code ... we will change them sooner or later. -Tony ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Patch V2] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. 2015-12-08 15:59 ` Luck, Tony @ 2015-12-08 18:56 ` Borislav Petkov 0 siblings, 0 replies; 13+ messages in thread From: Borislav Petkov @ 2015-12-08 18:56 UTC (permalink / raw) To: Luck, Tony; +Cc: Raj, Ashok, linux-kernel, linux-edac On Tue, Dec 08, 2015 at 03:59:58PM +0000, Luck, Tony wrote: > > No, the system did panic in both times. The "strange" observation is > > that the MCE gets reported only on the cores on node 0. Or at least only > > the printks from mce_panic() on the cores on node0 reach the serial > > console. > > You only see messages and logs from node0, because the cpus there are > the only ones that see any errors logged in their banks. > > The cpus on node 1, 2, 3 scan all banks and find nothing, so say nothing. Right, sure, of course. Doh! Confirmation: [ 183.840517] mce: do_machine_check: CPU: 30 [ 183.840531] mce: do_machine_check: CPU: 27 [ 183.840536] mce: do_machine_check: CPU: 29 [ 183.840541] mce: do_machine_check: CPU: 56 [ 183.840546] mce: do_machine_check: CPU: 28 [ 183.840548] mce: do_machine_check: CPU: 60 [ 183.840550] mce: do_machine_check: CPU: 24 [ 183.840557] mce: do_machine_check: CPU: 12 [ 183.840561] mce: do_machine_check: CPU: 45 [ 183.840565] mce: do_machine_check: CPU: 59 [ 183.840569] mce: do_machine_check: CPU: 57 [ 183.840572] mce: do_machine_check: CPU: 61 [ 183.840584] mce: do_machine_check: CPU: 0 [ 183.840587] mce: do_machine_check: CPU: 32 [ 183.840593] mce: do_machine_check: CPU: 63 [ 183.840596] mce: do_machine_check: CPU: 31 [ 183.840602] mce: do_machine_check: CPU: 42 [ 183.840606] mce: do_machine_check: CPU: 11 [ 183.840611] mce: do_machine_check: CPU: 41 [ 183.840613] mce: do_machine_check: CPU: 9 [ 183.840617] mce: do_machine_check: CPU: 62 [ 183.840619] mce: do_machine_check: CPU: 25 [ 183.840624] mce: do_machine_check: CPU: 58 [ 183.840627] mce: do_machine_check: CPU: 26 [ 183.840633] mce: do_machine_check: CPU: 5 [ 183.840638] mce: do_machine_check: CPU: 1 [ 183.840642] mce: do_machine_check: CPU: 37 [ 183.840648] mce: do_machine_check: CPU: 15 [ 183.840650] mce: do_machine_check: CPU: 47 [ 183.840653] mce: do_machine_check: CPU: 44 [ 183.840657] mce: do_machine_check: CPU: 14 [ 183.840659] mce: do_machine_check: CPU: 46 [ 183.840666] mce: do_machine_check: CPU: 52 [ 183.840670] mce: do_machine_check: CPU: 50 [ 183.840675] mce: do_machine_check: CPU: 48 [ 183.840677] mce: do_machine_check: CPU: 16 [ 183.840682] mce: do_machine_check: CPU: 54 [ 183.840686] mce: do_machine_check: CPU: 18 [ 183.840692] mce: do_machine_check: CPU: 40 [ 183.840695] mce: do_machine_check: CPU: 8 [ 183.840701] mce: do_machine_check: CPU: 2 [ 183.840705] mce: do_machine_check: CPU: 20 [ 183.840710] mce: do_machine_check: CPU: 13 [ 183.840712] mce: do_machine_check: CPU: 43 [ 183.840716] mce: do_machine_check: CPU: 10 [ 183.840722] mce: do_machine_check: CPU: 3 [ 183.840724] mce: do_machine_check: CPU: 35 [ 183.840727] mce: do_machine_check: CPU: 33 [ 183.840730] mce: do_machine_check: CPU: 34 [ 183.840734] mce: do_machine_check: CPU: 6 [ 183.840738] mce: do_machine_check: CPU: 38 [ 183.840743] mce: do_machine_check: CPU: 53 [ 183.840745] mce: do_machine_check: CPU: 21 [ 183.840750] mce: do_machine_check: CPU: 23 [ 183.840752] mce: do_machine_check: CPU: 55 [ 183.840755] mce: do_machine_check: CPU: 22 [ 183.840759] mce: do_machine_check: CPU: 49 [ 183.840761] mce: do_machine_check: CPU: 17 [ 183.840767] mce: do_machine_check: CPU: 19 [ 183.840770] mce: do_machine_check: CPU: 51 [ 183.840776] mce: do_machine_check: CPU: 39 [ 183.840778] mce: do_machine_check: CPU: 7 [ 183.840784] mce: do_machine_check: CPU: 36 [ 183.840786] mce: do_machine_check: CPU: 4 [ 184.485104] Disabling lock debugging due to kernel taint [ 184.498006] mce: [Hardware Error]: CPU 32: Machine Check Exception: 5 Bank 5: be00000000010090 [ 184.498023] mce: [Hardware Error]: Machine check events logged [ 184.531428] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130} [ 184.551126] mce: [Hardware Error]: TSC c760ad064ccce ADDR bb68ec00 MISC 421c8c86 [ 184.568358] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449600598 SOCKET 0 APIC 1 microcode 710 [ 184.588862] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR ... mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 2: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 32: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 33: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 34: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 35: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 36: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 37: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 38: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 39: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 5: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 6: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 7: Machine Check Exception: 5 Bank 5: be00000000010090 CPUs: [ 1.103200] x86: Booting SMP configuration: [ 1.112441] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 [ 1.227835] .... node #1, CPUs: #8 #9 #10 #11 #12 #13 #14 #15 [ 1.451861] .... node #2, CPUs: #16 #17 #18 #19 #20 #21 #22 #23 [ 1.674819] .... node #3, CPUs: #24 #25 #26 #27 #28 #29 #30 #31 [ 1.899011] .... node #0, CPUs: #32 #33 #34 #35 #36 #37 #38 #39 [ 2.026616] .... node #1, CPUs: #40 #41 #42 #43 #44 #45 #46 #47 [ 2.152645] .... node #2, CPUs: #48 #49 #50 #51 #52 #53 #54 #55 [ 2.276782] .... node #3, CPUs: #56 #57 #58 #59 #60 #61 #62 #63 [ 2.402263] x86: Booted up 4 nodes, 64 CPUs Ok, all clear. Thanks! -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2015-12-08 18:57 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-12-05 0:29 [Patch V2] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process Ashok Raj 2015-12-07 20:00 ` Borislav Petkov 2015-12-07 20:04 ` Luck, Tony 2015-12-07 20:19 ` Borislav Petkov 2015-12-07 22:07 ` Luck, Tony 2015-12-07 22:34 ` Borislav Petkov 2015-12-07 23:26 ` Luck, Tony 2015-12-07 23:46 ` Raj, Ashok 2015-12-07 23:25 ` Borislav Petkov 2015-12-08 1:41 ` Raj, Ashok 2015-12-08 9:18 ` Borislav Petkov 2015-12-08 15:59 ` Luck, Tony 2015-12-08 18:56 ` Borislav Petkov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).