From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752065AbbLHS5A (ORCPT ); Tue, 8 Dec 2015 13:57:00 -0500 Received: from mail.skyhub.de ([78.46.96.112]:51875 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752005AbbLHS46 (ORCPT ); Tue, 8 Dec 2015 13:56:58 -0500 Date: Tue, 8 Dec 2015 19:56:44 +0100 From: Borislav Petkov To: "Luck, Tony" Cc: "Raj, Ashok" , "linux-kernel@vger.kernel.org" , "linux-edac@vger.kernel.org" Subject: Re: [Patch V2] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. Message-ID: <20151208185644.GE27180@pd.tnic> References: <20151207200019.GH22248@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F7C24B@ORSMSX114.amr.corp.intel.com> <20151207201951.GI22248@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F7C3A4@ORSMSX114.amr.corp.intel.com> <20151207223427.GJ22248@pd.tnic> <20151207234639.GA81526@otc-brkl-03.jf.intel.com> <20151207232524.GK22248@pd.tnic> <20151208014142.GA82345@otc-brkl-03.jf.intel.com> <20151208091812.GA27180@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F7CF67@ORSMSX114.amr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F39F7CF67@ORSMSX114.amr.corp.intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 08, 2015 at 03:59:58PM +0000, Luck, Tony wrote: > > No, the system did panic in both times. The "strange" observation is > > that the MCE gets reported only on the cores on node 0. Or at least only > > the printks from mce_panic() on the cores on node0 reach the serial > > console. > > You only see messages and logs from node0, because the cpus there are > the only ones that see any errors logged in their banks. > > The cpus on node 1, 2, 3 scan all banks and find nothing, so say nothing. Right, sure, of course. Doh! Confirmation: [ 183.840517] mce: do_machine_check: CPU: 30 [ 183.840531] mce: do_machine_check: CPU: 27 [ 183.840536] mce: do_machine_check: CPU: 29 [ 183.840541] mce: do_machine_check: CPU: 56 [ 183.840546] mce: do_machine_check: CPU: 28 [ 183.840548] mce: do_machine_check: CPU: 60 [ 183.840550] mce: do_machine_check: CPU: 24 [ 183.840557] mce: do_machine_check: CPU: 12 [ 183.840561] mce: do_machine_check: CPU: 45 [ 183.840565] mce: do_machine_check: CPU: 59 [ 183.840569] mce: do_machine_check: CPU: 57 [ 183.840572] mce: do_machine_check: CPU: 61 [ 183.840584] mce: do_machine_check: CPU: 0 [ 183.840587] mce: do_machine_check: CPU: 32 [ 183.840593] mce: do_machine_check: CPU: 63 [ 183.840596] mce: do_machine_check: CPU: 31 [ 183.840602] mce: do_machine_check: CPU: 42 [ 183.840606] mce: do_machine_check: CPU: 11 [ 183.840611] mce: do_machine_check: CPU: 41 [ 183.840613] mce: do_machine_check: CPU: 9 [ 183.840617] mce: do_machine_check: CPU: 62 [ 183.840619] mce: do_machine_check: CPU: 25 [ 183.840624] mce: do_machine_check: CPU: 58 [ 183.840627] mce: do_machine_check: CPU: 26 [ 183.840633] mce: do_machine_check: CPU: 5 [ 183.840638] mce: do_machine_check: CPU: 1 [ 183.840642] mce: do_machine_check: CPU: 37 [ 183.840648] mce: do_machine_check: CPU: 15 [ 183.840650] mce: do_machine_check: CPU: 47 [ 183.840653] mce: do_machine_check: CPU: 44 [ 183.840657] mce: do_machine_check: CPU: 14 [ 183.840659] mce: do_machine_check: CPU: 46 [ 183.840666] mce: do_machine_check: CPU: 52 [ 183.840670] mce: do_machine_check: CPU: 50 [ 183.840675] mce: do_machine_check: CPU: 48 [ 183.840677] mce: do_machine_check: CPU: 16 [ 183.840682] mce: do_machine_check: CPU: 54 [ 183.840686] mce: do_machine_check: CPU: 18 [ 183.840692] mce: do_machine_check: CPU: 40 [ 183.840695] mce: do_machine_check: CPU: 8 [ 183.840701] mce: do_machine_check: CPU: 2 [ 183.840705] mce: do_machine_check: CPU: 20 [ 183.840710] mce: do_machine_check: CPU: 13 [ 183.840712] mce: do_machine_check: CPU: 43 [ 183.840716] mce: do_machine_check: CPU: 10 [ 183.840722] mce: do_machine_check: CPU: 3 [ 183.840724] mce: do_machine_check: CPU: 35 [ 183.840727] mce: do_machine_check: CPU: 33 [ 183.840730] mce: do_machine_check: CPU: 34 [ 183.840734] mce: do_machine_check: CPU: 6 [ 183.840738] mce: do_machine_check: CPU: 38 [ 183.840743] mce: do_machine_check: CPU: 53 [ 183.840745] mce: do_machine_check: CPU: 21 [ 183.840750] mce: do_machine_check: CPU: 23 [ 183.840752] mce: do_machine_check: CPU: 55 [ 183.840755] mce: do_machine_check: CPU: 22 [ 183.840759] mce: do_machine_check: CPU: 49 [ 183.840761] mce: do_machine_check: CPU: 17 [ 183.840767] mce: do_machine_check: CPU: 19 [ 183.840770] mce: do_machine_check: CPU: 51 [ 183.840776] mce: do_machine_check: CPU: 39 [ 183.840778] mce: do_machine_check: CPU: 7 [ 183.840784] mce: do_machine_check: CPU: 36 [ 183.840786] mce: do_machine_check: CPU: 4 [ 184.485104] Disabling lock debugging due to kernel taint [ 184.498006] mce: [Hardware Error]: CPU 32: Machine Check Exception: 5 Bank 5: be00000000010090 [ 184.498023] mce: [Hardware Error]: Machine check events logged [ 184.531428] mce: [Hardware Error]: RIP !INEXACT! 10: {intel_idle+0xbf/0x130} [ 184.551126] mce: [Hardware Error]: TSC c760ad064ccce ADDR bb68ec00 MISC 421c8c86 [ 184.568358] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449600598 SOCKET 0 APIC 1 microcode 710 [ 184.588862] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR ... mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 2: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 32: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 33: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 34: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 35: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 36: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 37: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 38: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 39: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 5: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 6: Machine Check Exception: 5 Bank 5: be00000000010090 mce: [Hardware Error]: CPU 7: Machine Check Exception: 5 Bank 5: be00000000010090 CPUs: [ 1.103200] x86: Booting SMP configuration: [ 1.112441] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 [ 1.227835] .... node #1, CPUs: #8 #9 #10 #11 #12 #13 #14 #15 [ 1.451861] .... node #2, CPUs: #16 #17 #18 #19 #20 #21 #22 #23 [ 1.674819] .... node #3, CPUs: #24 #25 #26 #27 #28 #29 #30 #31 [ 1.899011] .... node #0, CPUs: #32 #33 #34 #35 #36 #37 #38 #39 [ 2.026616] .... node #1, CPUs: #40 #41 #42 #43 #44 #45 #46 #47 [ 2.152645] .... node #2, CPUs: #48 #49 #50 #51 #52 #53 #54 #55 [ 2.276782] .... node #3, CPUs: #56 #57 #58 #59 #60 #61 #62 #63 [ 2.402263] x86: Booted up 4 nodes, 64 CPUs Ok, all clear. Thanks! -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply.