Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection

From: Don Zickus <dzickus@redhat.com>
To: Jeffrey Merkey <jeffmerkey@gmail.com>
Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
	atomlin@redhat.com, cmetcalf@ezchip.com, fweisbec@gmail.com,
	hidehiro.kawai.ez@hitachi.com, mhocko@suse.cz, tj@kernel.org,
	uobergfe@redhat.com
Subject: Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection
Date: Wed, 3 Feb 2016 10:45:15 -0500	[thread overview]
Message-ID: <20160203154515.GP26637@redhat.com> (raw)
In-Reply-To: <CA+ekxPXUNBvS9QjKWLGo_0zXJ=-fbC_iry5Jcv4HVyE1hpRbsw@mail.gmail.com>

On Tue, Feb 02, 2016 at 03:40:34PM -0700, Jeffrey Merkey wrote:
> >
> > Please remember to add version history, so I can tell what changed.
> >
> 
> What command do I give to git when it creates the patch from git
> format-patch that outputs what you are looking for or do I have to add
> that manually.  The diff of files changed?
> 
> >
> > I am not sure I am a fan of this.  You are taking a known macro BUG_ON with
> > known expectations and perversely converting it into an 'asm'.  So now when
> > folks read the code they scratch their heads why we are dumping the stack
> > twice when in fact we are not. It seems misleading. :-/
> >
> 
> 1.  Does not dump the stack at all the way it is coded -- look again.
> The current code dumps it only once.  Just executes an int3 and
> returns instead of crashing.  If you called panic all the time instead
> of conditionally in this code, this change would not be needed, since
> panic is setup already to call debuggers.  It's the failure of the
> current code to do that requires this change.  How about you call
> panic when this condition ocurrs, then the debugger will get called.
> 
> 2.  BUG() outputs an asm("ud2") and triggers an invalid instruction
> and system crash.  All that was added is the ability to switch that
> ud2 to an int3.    So what is more perverse here:
> 
> BUG() = ud2 -> invalid instruction -> trap  -> call crash code ->
> debugger -> then hang
> BUG() = int3 -> int3 trap -> enter debugger -> return - system can recover

Yes, I understand that.  I was referring to reading the code.  With your
patch I read the code and see BUG() and think the machine will panic right
there, when it fact it does not because of the tricks you play.

> 
> Because:
> 
> BUG() = Debugger = int3
> and
> BUG() != ud2 (undefined instruction) = crash = non recoverable
> 
> int3 (0xCC) has always been understood to mean BUG().  int3
> breakpoints are an integral part of Intel's architecture.  There is no
> reason for not exploiting this capability of their processors to help
> kernel developers use intel technology better.
> 
> > I still don't understand why we can't use Ingo's or tglx's approach?  Your
> > changelog doesn't point out the problems there.
> >
> 
> Because when you catch a bug in the hard lockup detector the system
> just sits there hard hung and you are not able to get into a debugger
> console since the system has crashed and the watchdog code has already
> killed off the other processors and locked up all the NMI interrupt
> handlers, thereby preventing any debugger at all from functioning
> other than a hardware ice, so it's a hell of a lot easier just to
> trigger a break when you detect the first instance of a hard lockup
> before the system is completely hosed.

Hmm, I am confused here.  So you are saying because we are in the nmi
handler you can not break into the system?  The nmi handler prints some
stuff to the screen, pokes the other cpus to print stuff to the screen and
then returns to a normal operation.  Unless you are saying the act of
sending NMI IPIs never completes (because a cpu is blocking IPI interrupts),
so the cpu hangs in nmi context and the debugger never has a chance to
'break' in and see what is going on?

Cheers,
Don