From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932765AbcBCUPD (ORCPT ); Wed, 3 Feb 2016 15:15:03 -0500 Received: from mx1.redhat.com ([209.132.183.28]:22532 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932367AbcBCUO7 (ORCPT ); Wed, 3 Feb 2016 15:14:59 -0500 Date: Wed, 3 Feb 2016 15:14:53 -0500 From: Don Zickus To: Jeffrey Merkey Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org, atomlin@redhat.com, cmetcalf@ezchip.com, fweisbec@gmail.com, hidehiro.kawai.ez@hitachi.com, mhocko@suse.cz, tj@kernel.org, uobergfe@redhat.com Subject: Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection Message-ID: <20160203201453.GV26637@redhat.com> References: <1454380428-31474-1-git-send-email-jeffmerkey@gmail.com> <1454380428-31474-3-git-send-email-jeffmerkey@gmail.com> <20160202173034.GG26637@redhat.com> <20160203154515.GP26637@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 03, 2016 at 10:23:42AM -0700, Jeffrey Merkey wrote: > > Hmm, I am confused here. So you are saying because we are in the nmi > > handler you can not break into the system? The nmi handler prints some > > stuff to the screen, pokes the other cpus to print stuff to the screen and > > then returns to a normal operation. Unless you are saying the act of > > sending NMI IPIs never completes (because a cpu is blocking IPI > > interrupts), > > so the cpu hangs in nmi context and the debugger never has a chance to > > 'break' in and see what is going on? > > > > Cheers, > > Don > > > > Yes. the nmi handlers never complete for the bug I worked on with > tglx, probably because an nmi handler is calling timekeeper.c > somewhere. Some of these lockup bugs may be calling code from the nmi > handlers that cause the lockup condition in the first place in some > cases, so it will never reach a call to panic. Looking over this code > it's damn hard to find a good way to do this that works across all the > arches without adding another macro to bug.h (BREAK_ON maybe), so I > just used one that's already there. I'll go back and rethink this > some more. It could just be as simple as calling panic from the first > detection -- that works. So, if you disable 'sysctl_hardlockup_all_cpu_backtrace' and enable 'hardlockup_panic', you should be able to achieve what you want, no? But you mentioned you wanted to recover? Hence avoiding the panic? Cheers, Don