From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752046AbbCaEqR (ORCPT ); Tue, 31 Mar 2015 00:46:17 -0400 Received: from mail-ig0-f177.google.com ([209.85.213.177]:37679 "EHLO mail-ig0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750926AbbCaEqN (ORCPT ); Tue, 31 Mar 2015 00:46:13 -0400 MIME-Version: 1.0 In-Reply-To: <20150331031536.GA9303@canonical.com> References: <20150218222544.GA17717@twins.programming.kicks-ass.net>

<20150331031536.GA9303@canonical.com> Date: Mon, 30 Mar 2015 21:46:12 -0700 X-Google-Sender-Auth: VBqB88YpZ6bH0lPio5atao5duFo Message-ID: Subject: Re: smp_call_function_single lockups From: Linus Torvalds To: Chris J Arges Cc: Rafael David Tinoco , Ingo Molnar , Peter Anvin , Jiang Liu , Peter Zijlstra , LKML , Jens Axboe , Frederic Weisbecker , Gema Gomez , "the arch/x86 maintainers" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 30, 2015 at 8:15 PM, Chris J Arges wrote: > [ 13.613531] WARNING: CPU: 0 PID: 0 at ./arch/x86/include/asm/apic.h:444 apic_ack_edge+0x84/0x90() > [ 13.613531] [] apic_ack_edge+0x84/0x90 > [ 13.613531] [] handle_edge_irq+0x57/0x120 > [ 13.613531] [] handle_irq+0x22/0x40 > [ 13.613531] [] do_IRQ+0x4f/0x140 > [ 13.613531] [] common_interrupt+0x6d/0x6d > [ 13.613531] [] ? hrtimer_start+0x18/0x20 > [ 13.613531] [] ? native_safe_halt+0x6/0x10 > [ 13.613531] [] ? rcu_eqs_enter+0xa3/0xb0 > [ 13.613531] [] default_idle+0x1e/0xc0 Hmm. I didn't notice that "hrtimer_start" was always there as a stale entry on the stack when this happened. That may well be immaterial - the CPU being idle means that the last thing it did before going to sleep was likely that "start timer" thing, but it's interesting even so. Some issue with reprogramming the hrtimer as it is triggering, kind of similar to the bootup case I saw where the keyboard init sequence raises an interrupt that was already cleared by the time the interrupt happened. So maybe something like this happens: - local timer is about to go off and raises the interrupt line - in the meantime, we're reprogramming the timer into the future - the CPU takes the interrupt, but now the timer has been reprogammed, so the irq line is no longer active, and ISR is zero even though we took the interrupt (which is why the new warning triggers) - we're running the local timer interrupt (which happened due to the *old* programmed value), but we do something wrong because when we read the timer state, we see the *new* programmed value and so we think that it's the new timer that triggered. I dunno. I don't see why we'd lock up, but DaveJ's old lockup had several signs that it seemed to be timer-related. It would be interesting to see the actual irq number. Maybe this has nothing what-so-ever to do with the hrtimer. Linus