From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751890AbdF1RoY (ORCPT ); Wed, 28 Jun 2017 13:44:24 -0400 Received: from mail-oi0-f66.google.com ([209.85.218.66]:33003 "EHLO mail-oi0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751560AbdF1RoS (ORCPT ); Wed, 28 Jun 2017 13:44:18 -0400 MIME-Version: 1.0 In-Reply-To: <20170628092219.4df52dhwe7q3iao5@pd.tnic> References: <20170628000630.1973-1-jack@codezen.org> <20170628092219.4df52dhwe7q3iao5@pd.tnic> From: Jack Miller Date: Wed, 28 Jun 2017 12:44:17 -0500 X-Google-Sender-Auth: BjyWrpulaClyyE58IRpg_PW88-Q Message-ID: Subject: Re: [PATCH] x86/mce/AMD: Fix partial SMCA bank init when CPU 0 != thread 0 To: Borislav Petkov Cc: Jack Miller , linux-kernel@vger.kernel.org, tglx@linutronix.de, Yazen.Ghannam@amd.com, x86@kernel.org Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 28, 2017 at 4:22 AM, Borislav Petkov wrote: > On Tue, Jun 27, 2017 at 07:06:30PM -0500, Jack Miller wrote: >> After a call to firmware SwitchBSP(), > > What is that and who does that? SwitchBSP() is part of the UEFI MPServices Protocol which I believe is an extension but it is supported by all of the firmwares I've tested on. In this case, I'm using a bootloader to SwitchBSP() so that hardware thread 0 (and thus core 0) can be offlined on AMD hardware (cpu0_hotplug unsupported). This is currently working by passing 'nomce' to the kernel, but obviously I'd prefer not to disable it. > >> Linux can be booted with a thread >> that isn't the first in the system. That thread automatically becomes >> CPU 0. > > Btw, you should be seeing other explosions too as a lot of code assumes > CPU 0 is the BSP. Actually, with 'nomce' or this patch applied the system seems to chug along merrily, no further errors in dmesg, no further BUGs. Linux still gets all of the topology correct (i.e. CPU 0's core/thread/siblings are correctly identified) so really, aside from userspace programs doing naive stuff with CPU affinity (like expecting even,odd CPUs to be SMT pairs), I think the overall result here is that most threads are interchangeable... except when probing certain features like these MCA types. > > ... > >> Signed-off-by: Jack Miller >> --- >> arch/x86/kernel/cpu/mcheck/mce_amd.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c >> index 6e4a047e4b68..9d74adcf34d2 100644 >> --- a/arch/x86/kernel/cpu/mcheck/mce_amd.c >> +++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c >> @@ -170,8 +170,8 @@ static void get_smca_bank_info(unsigned int bank) >> struct smca_hwid *s_hwid; >> u32 high, instance_id; >> >> - /* Collect bank_info using CPU 0 for now. */ >> - if (cpu) >> + /* Collect bank_info using hardware thread 0 for now. */ >> + if (apic->get_apic_id(apic->read(APIC_ID)) != 0) > > Does > > if (cpu != boot_cpu_data.cpu_index) > return; > > work? Unfortunately, it doesn't. That value is explicitly set to 0. Most mechanisms operate around CPU #, which isn't very helpful if the BSP was changed under the covers. Alternatively, we could possibly sidestep the APIC ID uncertainty by patching get_smca_bank_info() to fallback on reading the bank hwid_mcatype from other online CPUs (it's already using rdmsr_safe_on_cpu) if its own hwid_mcatype isn't valid/recognized, but that's a more invasive patch. Jack