From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753821AbdL1PrB (ORCPT ); Thu, 28 Dec 2017 10:47:01 -0500 Received: from mail-qt0-f193.google.com ([209.85.216.193]:43643 "EHLO mail-qt0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753479AbdL1Pq7 (ORCPT ); Thu, 28 Dec 2017 10:46:59 -0500 X-Google-Smtp-Source: ACJfBot6HybtphRNovlpWUsRqfEtvDFsGwQPkdhKWE5tMI232nTCh6nzOW2Zao1vB5X33rOBzZ3SJQ== Date: Thu, 28 Dec 2017 10:48:35 -0500 From: Alexandru Chirvasitu To: Thomas Gleixner Cc: Dou Liyang , Pavel Machek , kernel list , Ingo Molnar , "Maciej W. Rozycki" , Mikael Pettersson , Josh Poulson , Mihai Costache , Stephen Hemminger , Marc Zyngier , linux-pci@vger.kernel.org, Haiyang Zhang , Dexuan Cui , Simon Xiao , Saeed Mahameed , Jork Loeser , Bjorn Helgaas , devel@linuxdriverproject.org, KY Srinivasan Subject: Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop Message-ID: <20171228154835.GB10658@chirva-slack.chirva-slack> References: <20171218082011.GA24638@arch-chirva.localdomain> <20171218101131.GA5338@amd> <20171219083421.GB24638@arch-chirva.localdomain> <20171220131929.GC24638@arch-chirva.localdomain> <20171228142117.GA10658@chirva-slack.chirva-slack> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.6.1 (2016-04-27) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 28, 2017 at 03:48:15PM +0100, Thomas Gleixner wrote: > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote: > > On Thu, Dec 28, 2017 at 12:00:47PM +0100, Thomas Gleixner wrote: > > > Ok, lets take a step back. The bisect/kexec attempts led us away from the > > > initial problem which is the machine locking up after login, right? > > > > > > > Yes; sorry about that.. > > Nothing to be sorry about. > > > x86/vector: Replace the raw_spin_lock() with > > > > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c > > index 7504491..e5bab02 100644 > > --- a/arch/x86/kernel/apic/vector.c > > +++ b/arch/x86/kernel/apic/vector.c > > @@ -726,6 +726,7 @@ static int apic_set_affinity(struct irq_data *irqd, > > const struct cpumask *dest, bool force) > > { > > struct apic_chip_data *apicd = apic_chip_data(irqd); > > + unsigned long flags; > > int err; > > > > /* > > @@ -740,13 +741,13 @@ static int apic_set_affinity(struct irq_data *irqd, > > (apicd->is_managed || apicd->can_reserve)) > > return IRQ_SET_MASK_OK; > > > > - raw_spin_lock(&vector_lock); > > + raw_spin_lock_irqsave(&vector_lock, flags); > > cpumask_and(vector_searchmask, dest, cpu_online_mask); > > if (irqd_affinity_is_managed(irqd)) > > err = assign_managed_vector(irqd, vector_searchmask); > > else > > err = assign_vector_locked(irqd, vector_searchmask); > > - raw_spin_unlock(&vector_lock); > > + raw_spin_unlock_irqrestore(&vector_lock, flags); > > return err ? err : IRQ_SET_MASK_OK; > > } > > > > With this, I still get the lockup messages after login, but not the > > freezes! > > That's really interesting. There should be no code path which calls into > that with interrupts enabled. I assume you never ran that kernel with > CONFIG_PROVE_LOCKING=y. > Correct. That option is not set in .config. > Find below a debug patch which should show us the call chain for that > case. Please apply that on top of Dou's patch so the machine stays > accessible. Plain output from dmesg is sufficient. > > > The lockups register in the log, which I am attaching (see below for > > attachment naming conventions). > > Hmm. That's RCU lockups and that backtrace on the CPU which gets the stall > looks very familiar. I'd like to see the above result first and then I'll > send you another pile of patches which might cure that RCU issue. > > Thanks, > > tglx > > 8<------------------- > --- a/arch/x86/kernel/apic/vector.c > +++ b/arch/x86/kernel/apic/vector.c > @@ -729,6 +729,8 @@ static int apic_set_affinity(struct irq_ > unsigned long flags; > int err; > > + WARN_ON_ONCE(!irqs_disabled()); > + > /* > * Core code can call here for inactive interrupts. For inactive > * interrupts which use managed or reservation mode there is no > > > Bit of a step back here: the kernel treated with Dou's patch no longer logs me in reliably as before, with or without this newest patch on top.. So now I sometimes get immediate lockups and freezes upon trying to log in, and other times I get logged in but get a freeze seconds later. In no case can I roam around long nough to get a dmesg, and I no longer get the non-freezing lockups from before. I can't imagine what I could possibly have changed.. Here's the output of `git log --pretty=oneline -5` on the branch I'm working in. -------------------- f2c02af5cc1d620c039b21fab0ca5948a06daf90 2nd tglx patch 7715575170bacf3566d400b9f2210a10ce152880 x86/vector: Replace the raw_spin_lock() with raw_spin_lock_irqsave() 8d9d56caf33d78bfe6b6087767b1b84acee58458 x86-32: fix kexec with stack canary (CONFIG_CC_STACKPROTECTOR) a197e9dea4ccb72e1a6457fac15329bd5319e719 irq/matrix: Remove the overused BUGON() in irq_matrix_assign_system() 464e1d5f23cca236b930ef068c328a64cab78fb1 Linux 4.15-rc5 -------------------- 7715575170bacf3566d400b9f2210a10ce152880, which is the kernel with Dou's patch, logged me in and allowed me to produce the dmesg from before. I did this a couple of times back then. I no longer can, for some reason, as it's reverted back to the no-go lockups from before. And the next one, f2c02af5cc1d620c039b21fab0ca5948a06daf90, where I applied the patch you just sent, behaves identically.