From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752729AbbDACA0 (ORCPT ); Tue, 31 Mar 2015 22:00:26 -0400 Received: from mail-ob0-f174.google.com ([209.85.214.174]:36363 "EHLO mail-ob0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752117AbbDACAY (ORCPT ); Tue, 31 Mar 2015 22:00:24 -0400 MIME-Version: 1.0 X-Originating-IP: [180.255.240.24] Date: Wed, 1 Apr 2015 10:00:23 +0800 Message-ID: Subject: Re: [debug PATCHes] Re: smp_call_function_single lockups From: Daniel J Blueman To: Chris J Arges Cc: Linux Kernel , "x86@kernel.org" , Linus Torvalds , Rafael David Tinoco , Peter Anvin , Jiang Liu , Peter Zijlstra , Jens Axboe , Frederic Weisbecker , Gema Gomez Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wednesday, April 1, 2015 at 6:40:06 AM UTC+8, Chris J Arges wrote: > On Tue, Mar 31, 2015 at 12:56:56PM +0200, Ingo Molnar wrote: > > > > * Linus Torvalds wrote: > > > > > Ok, interesting. So the whole "we try to do an APIC ACK with the ISR > > > bit clear" seems to be a real issue. > > > > It's interesting in particular when it happens with an edge-triggered > > interrupt source: it's much harder to miss level triggered IRQs, which > > stay around until actively handled. Edge triggered irqs are more > > fragile to loss of event processing. > > > > > > Anyway, maybe this sheds some more light on this issue. I can > > > > reproduce this at will, so let me know of other experiments to do. > > > > Btw., could you please describe (again) what your current best method > > for reproduction is? It's been a long discussion ... > > > > Ingo, > > To set this up, I've done the following on a Xeon E5620 / Xeon E312xx machine > ( Although I've heard of others that have reproduced on other machines. ) > > 1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce) > 2) Create an L2 KVM VM inside the L1 VM with 1 vCPU > 3) Add the following to the L1 cmdline: > nmi_watchdog=panic hung_task_panic=1 softlockup_panic=1 unknown_nmi_panic > 3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM > > Sometimes this is sufficient to reproduce the issue, I've observed that running > KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others). > If this doesn't reproduce then you can do the following: > 4) Migrate the L2 vCPU randomly (via virsh vcpupin --live OR tasksel) between > L1 vCPUs until the hang occurs. > > I attempted to write a module that used smp_call_function_single calls to > trigger IPIs but have been unable to create a more simple reproducer. A non-intrusive way of generating a lot of IPIs, is calling stop_machine() via something like: while :; do echo "base=0x20000000000 size=0x8000000 type=write-back" >/proc/mtrr echo "disable=4" >| /proc/mtrr done Of course, ensure base is above DRAM and any 64-bit MMIO for no side-effects and ensure it'll be entry 4. Onlining and offlining cores in parallel will generate IPIs also. Dan -- Daniel J Blueman