From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S932765AbcBCUPD (ORCPT <rfc822;w@1wt.eu>);
	Wed, 3 Feb 2016 15:15:03 -0500
Received: from mx1.redhat.com ([209.132.183.28]:22532 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S932367AbcBCUO7 (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Wed, 3 Feb 2016 15:14:59 -0500
Date: Wed, 3 Feb 2016 15:14:53 -0500
From: Don Zickus <dzickus@redhat.com>
To: Jeffrey Merkey <jeffmerkey@gmail.com>
Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
        atomlin@redhat.com, cmetcalf@ezchip.com, fweisbec@gmail.com,
        hidehiro.kawai.ez@hitachi.com, mhocko@suse.cz, tj@kernel.org,
        uobergfe@redhat.com
Subject: Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection
Message-ID: <20160203201453.GV26637@redhat.com>
References: <1454380428-31474-1-git-send-email-jeffmerkey@gmail.com>
 <1454380428-31474-3-git-send-email-jeffmerkey@gmail.com>
 <20160202173034.GG26637@redhat.com>
 <CA+ekxPXUNBvS9QjKWLGo_0zXJ=-fbC_iry5Jcv4HVyE1hpRbsw@mail.gmail.com>
 <20160203154515.GP26637@redhat.com>
 <CA+ekxPX9-WrLbjKfN+NFqBBaF9xF=08fE9e+806G-r0ArPHsig@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CA+ekxPX9-WrLbjKfN+NFqBBaF9xF=08fE9e+806G-r0ArPHsig@mail.gmail.com>
User-Agent: Mutt/1.5.23.1 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Feb 03, 2016 at 10:23:42AM -0700, Jeffrey Merkey wrote:
> > Hmm, I am confused here.  So you are saying because we are in the nmi
> > handler you can not break into the system?  The nmi handler prints some
> > stuff to the screen, pokes the other cpus to print stuff to the screen and
> > then returns to a normal operation.  Unless you are saying the act of
> > sending NMI IPIs never completes (because a cpu is blocking IPI
> > interrupts),
> > so the cpu hangs in nmi context and the debugger never has a chance to
> > 'break' in and see what is going on?
> >
> > Cheers,
> > Don
> >
> 
> Yes.  the nmi handlers never complete for the bug I worked on with
> tglx, probably because an nmi handler is calling timekeeper.c
> somewhere.  Some of these lockup bugs may be calling code from the nmi
> handlers that cause the lockup condition in the first place in some
> cases, so it will never reach a call to panic.  Looking over this code
> it's damn hard to find a good way to do this that works across all the
> arches without adding another macro to bug.h (BREAK_ON maybe), so I
> just used one that's already there.  I'll go back and rethink this
> some more.  It could just be as simple as calling panic from the first
> detection -- that works.

So, if you disable 'sysctl_hardlockup_all_cpu_backtrace' and enable
'hardlockup_panic', you should be able to achieve what you want, no?

But you mentioned you wanted to recover?  Hence avoiding the panic?

Cheers,
Don