From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752729AbbDACA0 (ORCPT <rfc822;w@1wt.eu>);
	Tue, 31 Mar 2015 22:00:26 -0400
Received: from mail-ob0-f174.google.com ([209.85.214.174]:36363 "EHLO
	mail-ob0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752117AbbDACAY (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Tue, 31 Mar 2015 22:00:24 -0400
MIME-Version: 1.0
X-Originating-IP: [180.255.240.24]
Date: Wed, 1 Apr 2015 10:00:23 +0800
Message-ID: <CAMVG2ssR9kXsD7FZM8o0mh4J9v2XsgGnKNV4oXGLt0N7LuTgYA@mail.gmail.com>
Subject: Re: [debug PATCHes] Re: smp_call_function_single lockups
From: Daniel J Blueman <daniel@quora.org>
To: Chris J Arges <chris.j.arges@canonical.com>
Cc: Linux Kernel <linux-kernel@vger.kernel.org>,
        "x86@kernel.org" <x86@kernel.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Rafael David Tinoco <inaddy@ubuntu.com>, Peter Anvin <hpa@zytor.com>,
        Jiang Liu <jiang.liu@linux.intel.com>,
        Peter Zijlstra <peterz@infradead.org>, Jens Axboe <axboe@kernel.dk>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Gema Gomez <gema.gomez-solano@canonical.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wednesday, April 1, 2015 at 6:40:06 AM UTC+8, Chris J Arges wrote:
> On Tue, Mar 31, 2015 at 12:56:56PM +0200, Ingo Molnar wrote:
> >
> > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > > Ok, interesting. So the whole "we try to do an APIC ACK with the ISR
> > > bit clear" seems to be a real issue.
> >
> > It's interesting in particular when it happens with an edge-triggered
> > interrupt source: it's much harder to miss level triggered IRQs, which
> > stay around until actively handled. Edge triggered irqs are more
> > fragile to loss of event processing.
> >
> > > > Anyway, maybe this sheds some more light on this issue. I can
> > > > reproduce this at will, so let me know of other experiments to do.
> >
> > Btw., could you please describe (again) what your current best method
> > for reproduction is? It's been a long discussion ...
> >
>
> Ingo,
>
> To set this up, I've done the following on a Xeon E5620 / Xeon E312xx machine
> ( Although I've heard of others that have reproduced on other machines. )
>
> 1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce)
> 2) Create an L2 KVM VM inside the L1 VM with 1 vCPU
> 3) Add the following to the L1 cmdline:
> nmi_watchdog=panic hung_task_panic=1 softlockup_panic=1 unknown_nmi_panic
> 3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM
>
> Sometimes this is sufficient to reproduce the issue, I've observed that running
> KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others).
> If this doesn't reproduce then you can do the following:
> 4) Migrate the L2 vCPU randomly (via virsh vcpupin --live  OR tasksel) between
> L1 vCPUs until the hang occurs.
>
> I attempted to write a module that used smp_call_function_single calls to
> trigger IPIs but have been unable to create a more simple reproducer.

A non-intrusive way of generating a lot of IPIs, is calling
stop_machine() via something like:

while :; do
    echo "base=0x20000000000 size=0x8000000 type=write-back" >/proc/mtrr
    echo "disable=4" >| /proc/mtrr
done

Of course, ensure base is above DRAM and any 64-bit MMIO for no
side-effects and ensure it'll be entry 4. Onlining and offlining cores
in parallel will generate IPIs also.

Dan
-- 
Daniel J Blueman