From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753851AbbBSWp4 (ORCPT ); Thu, 19 Feb 2015 17:45:56 -0500 Received: from mail-ie0-f173.google.com ([209.85.223.173]:46438 "EHLO mail-ie0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752134AbbBSWpz (ORCPT ); Thu, 19 Feb 2015 17:45:55 -0500 MIME-Version: 1.0 In-Reply-To: References: <20150218222544.GA17717@twins.programming.kicks-ass.net> Date: Thu, 19 Feb 2015 14:45:54 -0800 X-Google-Sender-Auth: erO8zsRWvM8k5HuNnrVCw2T_jM0 Message-ID: Subject: Re: smp_call_function_single lockups From: Linus Torvalds To: Rafael David Tinoco , Ingo Molnar , Peter Anvin , Jiang Liu Cc: Peter Zijlstra , LKML , Jens Axboe , Frederic Weisbecker , Gema Gomez , Christopher Arges , "the arch/x86 maintainers" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 19, 2015 at 1:59 PM, Linus Torvalds wrote: > > Is this worth looking at? Or is it something spurious? I might have > gotten the vectors wrong, and maybe the warning is not because the ISR > bit isn't set, but because I test the wrong bit. I edited the patch to do ratelimiting (one per 10s max) rather than "once". And tested it some more. It seems to work correctly. The irq case during 8042 probing is not repeatable, and I suspect it happens because the interrupt source goes away (some probe-time thing that first triggers an interrupt, but then clears it itself), so it doesn't happen every boot, and I've gotten it with slightly different backtraces. But it's the only warning that happens for me, so I think my code is right (at least for the cases that trigger on this machine). It's definitely not a "every interrupt causes the warning because the code was buggy, and the WARN_ONCE() just printed the first one". It would be interesting to hear if others see spurious APIC EOI cases too. In particular, the people seeing the IPI lockup. Because a lot of the lockups we've seen have *looked* like the IPI interrupt just never happened, and so we're waiting forever for the target CPU to react to it. And just maybe the spurious EOI could cause the wrong bit to be cleared in the ISR, and then the interrupt never shows up. Something like that would certainly explain why it only happens on some machines and under certain timing circumstances. Linus