All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 1/3] Add BUG_XX() debugging options
@ 2016-02-02  2:33 Jeffrey Merkey
  2016-02-02  2:33 ` [PATCH v5 2/3] Add BUG_XX() debugging options Kconfig.debug Jeffrey Merkey
  2016-02-02  2:33 ` [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection Jeffrey Merkey
  0 siblings, 2 replies; 14+ messages in thread
From: Jeffrey Merkey @ 2016-02-02  2:33 UTC (permalink / raw)
  To: linux-kernel; +Cc: hpa, jeffmerkey, mingo, tglx, x86

This patch series adds config options which can be set during compile to
direct the compiler to output a breakpoint instruction anywhere a BUG()
macro has been placed in the kernel to trigger the system to
enter a debugger if a bug is detected by the system.  Use of this
compile time option also allows conditional breakpoints to be set in the
kernel with these currently used macros.

This addition is extremely useful for debugging hard and soft lockups
real time and quickly from a console debugger, and other areas of the
kernel.

Signed-off-by: Jeffrey Merkey <jeffmerkey@gmail.com>
---
 arch/x86/include/asm/bug.h | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
index ba38ebb..df26c2b 100644
--- a/arch/x86/include/asm/bug.h
+++ b/arch/x86/include/asm/bug.h
@@ -11,6 +11,13 @@
 # define __BUG_C0	"2:\t.long 1b - 2b, %c0 - 2b\n"
 #endif
 
+#ifdef CONFIG_DEBUG_BUG
+#define BUG()							\
+do {								\
+	asm volatile("int3");					\
+	unreachable();						\
+} while (0)
+#else
 #define BUG()							\
 do {								\
 	asm volatile("1:\tud2\n"				\
@@ -23,7 +30,14 @@ do {								\
 		     "i" (sizeof(struct bug_entry)));		\
 	unreachable();						\
 } while (0)
-
+#endif
+#else
+#ifdef CONFIG_DEBUG_BUG
+#define BUG()							\
+do {								\
+	asm volatile("int3");					\
+	unreachable();						\
+} while (0)
 #else
 #define BUG()							\
 do {								\
@@ -31,6 +45,7 @@ do {								\
 	unreachable();						\
 } while (0)
 #endif
+#endif
 
 #include <asm-generic/bug.h>
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 2/3] Add BUG_XX() debugging options Kconfig.debug
  2016-02-02  2:33 [PATCH v5 1/3] Add BUG_XX() debugging options Jeffrey Merkey
@ 2016-02-02  2:33 ` Jeffrey Merkey
  2016-02-02  2:33 ` [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection Jeffrey Merkey
  1 sibling, 0 replies; 14+ messages in thread
From: Jeffrey Merkey @ 2016-02-02  2:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, aryabinin, dan.j.williams, dave, mingo, nikolay, paulmck,
	vladimir.murzin

This patch series adds config options which can be set during compile to
direct the compiler to output a breakpoint instruction anywhere a BUG()
macro has been placed in the kernel to trigger the system to
enter a debugger if a bug is detected by the system.  Use of this
compile time option also allows conditional breakpoints to be set in the
kernel with these currently used macros.

This addition is extremely useful for debugging hard and soft lockups
real time and quickly from a console debugger, and other areas of the
kernel.

Signed-off-by: Jeffrey Merkey <jeffmerkey@gmail.com>
---
 lib/Kconfig.debug | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 041f995..1f7437d 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -326,6 +326,17 @@ config SECTION_MISMATCH_WARN_ONLY
 # is preferred to always offer frame pointers as a config
 # option on the architecture (regardless of KERNEL_DEBUG):
 #
+config DEBUG_BUG
+	bool "Set Breakpoint for Kernel BUG() macros"
+	depends on DEBUG_KERNEL && !S390
+	help
+	  Say Y here to enable the kernel to emit a breakpoint instruction
+	  anywhere a BUG_XX() macro has been placed in the system kernel code
+	  intended to catch bugs.
+
+	  This option is used for triggering the OS to try enter a debugger
+	  if one is presently installed on the system.
+
 config ARCH_WANT_FRAME_POINTERS
 	bool
 	help
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection
  2016-02-02  2:33 [PATCH v5 1/3] Add BUG_XX() debugging options Jeffrey Merkey
  2016-02-02  2:33 ` [PATCH v5 2/3] Add BUG_XX() debugging options Kconfig.debug Jeffrey Merkey
@ 2016-02-02  2:33 ` Jeffrey Merkey
  2016-02-02 17:30   ` Don Zickus
  1 sibling, 1 reply; 14+ messages in thread
From: Jeffrey Merkey @ 2016-02-02  2:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, atomlin, cmetcalf, dzickus, fweisbec, hidehiro.kawai.ez,
	mhocko, tj, uobergfe

This patch series adds config options which can be set during compile to
direct the compiler to output a breakpoint instruction anywhere a BUG()
macro has been placed in the kernel to trigger the system to
enter a debugger if a bug is detected by the system.  Use of this
compile time option also allows conditional breakpoints to be set in the
kernel with these currently used macros.

This addition is extremely useful for debugging hard and soft lockups
real time and quickly from a console debugger, and other areas of the
kernel.

Signed-off-by: Jeffrey Merkey <jeffmerkey@gmail.com>
---
 kernel/watchdog.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index b3ace6e..c504f7c 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -44,6 +44,12 @@
 #define NMI_WATCHDOG_ENABLED      (1 << NMI_WATCHDOG_ENABLED_BIT)
 #define SOFT_WATCHDOG_ENABLED     (1 << SOFT_WATCHDOG_ENABLED_BIT)
 
+#ifdef CONFIG_DEBUG_BUG
+#define WATCHDOG_DEBUG_LOCKUP  1
+#else
+#define WATCHDOG_DEBUG_LOCKUP  0
+#endif
+
 static DEFINE_MUTEX(watchdog_proc_mutex);
 
 #ifdef CONFIG_HARDLOCKUP_DETECTOR
@@ -358,6 +364,8 @@ static void watchdog_overflow_callback(struct perf_event *event,
 		else
 			dump_stack();
 
+		BUG_ON(WATCHDOG_DEBUG_LOCKUP);
+
 		/*
 		 * Perform all-CPU dump only once to avoid multiple hardlockups
 		 * generating interleaving traces
@@ -478,6 +486,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 		else
 			dump_stack();
 
+		BUG_ON(WATCHDOG_DEBUG_LOCKUP);
+
 		if (softlockup_all_cpu_backtrace) {
 			/* Avoid generating two back traces for current
 			 * given that one is already made above
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection
  2016-02-02  2:33 ` [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection Jeffrey Merkey
@ 2016-02-02 17:30   ` Don Zickus
  2016-02-02 22:40     ` Jeffrey Merkey
  0 siblings, 1 reply; 14+ messages in thread
From: Don Zickus @ 2016-02-02 17:30 UTC (permalink / raw)
  To: Jeffrey Merkey
  Cc: linux-kernel, akpm, atomlin, cmetcalf, fweisbec,
	hidehiro.kawai.ez, mhocko, tj, uobergfe

On Mon, Feb 01, 2016 at 07:33:48PM -0700, Jeffrey Merkey wrote:
> This patch series adds config options which can be set during compile to
> direct the compiler to output a breakpoint instruction anywhere a BUG()
> macro has been placed in the kernel to trigger the system to
> enter a debugger if a bug is detected by the system.  Use of this
> compile time option also allows conditional breakpoints to be set in the
> kernel with these currently used macros.
> 
> This addition is extremely useful for debugging hard and soft lockups
> real time and quickly from a console debugger, and other areas of the
> kernel.

Please remember to add version history, so I can tell what changed.

> 
> Signed-off-by: Jeffrey Merkey <jeffmerkey@gmail.com>
> ---
>  kernel/watchdog.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index b3ace6e..c504f7c 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -44,6 +44,12 @@
>  #define NMI_WATCHDOG_ENABLED      (1 << NMI_WATCHDOG_ENABLED_BIT)
>  #define SOFT_WATCHDOG_ENABLED     (1 << SOFT_WATCHDOG_ENABLED_BIT)
>  
> +#ifdef CONFIG_DEBUG_BUG
> +#define WATCHDOG_DEBUG_LOCKUP  1
> +#else
> +#define WATCHDOG_DEBUG_LOCKUP  0
> +#endif
> +
>  static DEFINE_MUTEX(watchdog_proc_mutex);
>  
>  #ifdef CONFIG_HARDLOCKUP_DETECTOR
> @@ -358,6 +364,8 @@ static void watchdog_overflow_callback(struct perf_event *event,
>  		else
>  			dump_stack();
>  
> +		BUG_ON(WATCHDOG_DEBUG_LOCKUP);
> +

I am not sure I am a fan of this.  You are taking a known macro BUG_ON with
known expectations and perversely converting it into an 'asm'.  So now when
folks read the code they scratch their heads why we are dumping the stack
twice when in fact we are not. It seems misleading. :-/

I still don't understand why we can't use Ingo's or tglx's approach?  Your
changelog doesn't point out the problems there.

Cheers,
Don

>  		/*
>  		 * Perform all-CPU dump only once to avoid multiple hardlockups
>  		 * generating interleaving traces
> @@ -478,6 +486,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
>  		else
>  			dump_stack();
>  
> +		BUG_ON(WATCHDOG_DEBUG_LOCKUP);
> +
>  		if (softlockup_all_cpu_backtrace) {
>  			/* Avoid generating two back traces for current
>  			 * given that one is already made above
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection
  2016-02-02 17:30   ` Don Zickus
@ 2016-02-02 22:40     ` Jeffrey Merkey
  2016-02-03  4:17       ` Jeffrey Merkey
                         ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Jeffrey Merkey @ 2016-02-02 22:40 UTC (permalink / raw)
  To: Don Zickus
  Cc: linux-kernel, akpm, atomlin, cmetcalf, fweisbec,
	hidehiro.kawai.ez, mhocko, tj, uobergfe

>
> Please remember to add version history, so I can tell what changed.
>

What command do I give to git when it creates the patch from git
format-patch that outputs what you are looking for or do I have to add
that manually.  The diff of files changed?

>
> I am not sure I am a fan of this.  You are taking a known macro BUG_ON with
> known expectations and perversely converting it into an 'asm'.  So now when
> folks read the code they scratch their heads why we are dumping the stack
> twice when in fact we are not. It seems misleading. :-/
>

1.  Does not dump the stack at all the way it is coded -- look again.
The current code dumps it only once.  Just executes an int3 and
returns instead of crashing.  If you called panic all the time instead
of conditionally in this code, this change would not be needed, since
panic is setup already to call debuggers.  It's the failure of the
current code to do that requires this change.  How about you call
panic when this condition ocurrs, then the debugger will get called.

2.  BUG() outputs an asm("ud2") and triggers an invalid instruction
and system crash.  All that was added is the ability to switch that
ud2 to an int3.    So what is more perverse here:

BUG() = ud2 -> invalid instruction -> trap  -> call crash code ->
debugger -> then hang
BUG() = int3 -> int3 trap -> enter debugger -> return - system can recover

Because:

BUG() = Debugger = int3
and
BUG() != ud2 (undefined instruction) = crash = non recoverable

int3 (0xCC) has always been understood to mean BUG().  int3
breakpoints are an integral part of Intel's architecture.  There is no
reason for not exploiting this capability of their processors to help
kernel developers use intel technology better.

> I still don't understand why we can't use Ingo's or tglx's approach?  Your
> changelog doesn't point out the problems there.
>

Because when you catch a bug in the hard lockup detector the system
just sits there hard hung and you are not able to get into a debugger
console since the system has crashed and the watchdog code has already
killed off the other processors and locked up all the NMI interrupt
handlers, thereby preventing any debugger at all from functioning
other than a hardware ice, so it's a hell of a lot easier just to
trigger a break when you detect the first instance of a hard lockup
before the system is completely hosed.

So, let's try to use Ingo's and tglx's approach.   For some reason
neither Ingo of tglx seem to understand that I am referring to a
normal user of a system that gets a hard hang that may or may not be
reproducable.   You are not able to even get into a debugger console
when the lockup occurs unless you have a breakpoint already set.    If
the current code always calls panic problem solved -- panic triggers
debugger entry if it detects one.

Jeff

> Cheers,
> Don
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection
  2016-02-02 22:40     ` Jeffrey Merkey
@ 2016-02-03  4:17       ` Jeffrey Merkey
  2016-02-03  4:39         ` Jeffrey Merkey
  2016-02-03 15:45       ` Don Zickus
  2016-02-03 15:47       ` Don Zickus
  2 siblings, 1 reply; 14+ messages in thread
From: Jeffrey Merkey @ 2016-02-03  4:17 UTC (permalink / raw)
  To: Don Zickus
  Cc: linux-kernel, akpm, atomlin, cmetcalf, fweisbec,
	hidehiro.kawai.ez, mhocko, tj, uobergfe

> Because when you catch a bug in the hard lockup detector the system
> just sits there hard hung and you are not able to get into a debugger
> console since the system has crashed and the watchdog code has already
> killed off the other processors and locked up all the NMI interrupt
> handlers, thereby preventing any debugger at all from functioning
> other than a hardware ice, so it's a hell of a lot easier just to
> trigger a break when you detect the first instance of a hard lockup
> before the system is completely hosed.
>

So this is why Ingo and tglx's suggestion doesn't work.  Unless you
can set a breakpoint in the detector coede, once the lockup occurs
about 50% of the time (when the IF flag is not set and interrupts are
disabled), you can't get into a debugger because the system is hosed.

The way the current hard lockup detector works is a lot like the death
star self-destruct system for linux -- it detects one, tries to IPI
the other processors to dump their stacks, then somewhere down in the
OS all of it locks up -- once and a while I can get it too panic.  A
great bug to test your detector with is the one in timekeeper.c tglx
and I worked on.  Good luck getting into any debugger when it fires
off.  I like the fact this code does not call panic and is somewhat
dynamic allowing recovery of the system, but it takes a healthy system
with a single bug, burns it to the ground, locks up all the
processors, and prevents the debugger from being entered unless a
breakpoint has been set.

Perhaps this helps you understand.

Jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection
  2016-02-03  4:17       ` Jeffrey Merkey
@ 2016-02-03  4:39         ` Jeffrey Merkey
  0 siblings, 0 replies; 14+ messages in thread
From: Jeffrey Merkey @ 2016-02-03  4:39 UTC (permalink / raw)
  To: Don Zickus
  Cc: linux-kernel, akpm, atomlin, cmetcalf, fweisbec,
	hidehiro.kawai.ez, mhocko, tj, uobergfe

On 2/2/16, Jeffrey Merkey <jeffmerkey@gmail.com> wrote:
>> Because when you catch a bug in the hard lockup detector the system
>> just sits there hard hung and you are not able to get into a debugger
>> console since the system has crashed and the watchdog code has already
>> killed off the other processors and locked up all the NMI interrupt
>> handlers, thereby preventing any debugger at all from functioning
>> other than a hardware ice, so it's a hell of a lot easier just to
>> trigger a break when you detect the first instance of a hard lockup
>> before the system is completely hosed.
>>
>
> So this is why Ingo and tglx's suggestion doesn't work.  Unless you
> can set a breakpoint in the detector coede, once the lockup occurs
> about 50% of the time (when the IF flag is not set and interrupts are
> disabled), you can't get into a debugger because the system is hosed.
>
> The way the current hard lockup detector works is a lot like the death
> star self-destruct system for linux -- it detects one, tries to IPI
> the other processors to dump their stacks, then somewhere down in the
> OS all of it locks up -- once and a while I can get it too panic.  A
> great bug to test your detector with is the one in timekeeper.c tglx
> and I worked on.  Good luck getting into any debugger when it fires
> off.  I like the fact this code does not call panic and is somewhat
> dynamic allowing recovery of the system, but it takes a healthy system
> with a single bug, burns it to the ground, locks up all the
> processors, and prevents the debugger from being entered unless a
> breakpoint has been set.
>
> Perhaps this helps you understand.
>
> Jeff
>

And we could just call notify_die here instead and pass a faux
debugger exception.  That actually is clean and would work too.  any
handlers out there will behave as though its an int3 instruction.
Hmmm.  That's an easy patch and I could test it quickly.

Jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection
  2016-02-02 22:40     ` Jeffrey Merkey
  2016-02-03  4:17       ` Jeffrey Merkey
@ 2016-02-03 15:45       ` Don Zickus
  2016-02-03 17:23         ` Jeffrey Merkey
  2016-02-03 15:47       ` Don Zickus
  2 siblings, 1 reply; 14+ messages in thread
From: Don Zickus @ 2016-02-03 15:45 UTC (permalink / raw)
  To: Jeffrey Merkey
  Cc: linux-kernel, akpm, atomlin, cmetcalf, fweisbec,
	hidehiro.kawai.ez, mhocko, tj, uobergfe

On Tue, Feb 02, 2016 at 03:40:34PM -0700, Jeffrey Merkey wrote:
> >
> > Please remember to add version history, so I can tell what changed.
> >
> 
> What command do I give to git when it creates the patch from git
> format-patch that outputs what you are looking for or do I have to add
> that manually.  The diff of files changed?
> 
> >
> > I am not sure I am a fan of this.  You are taking a known macro BUG_ON with
> > known expectations and perversely converting it into an 'asm'.  So now when
> > folks read the code they scratch their heads why we are dumping the stack
> > twice when in fact we are not. It seems misleading. :-/
> >
> 
> 1.  Does not dump the stack at all the way it is coded -- look again.
> The current code dumps it only once.  Just executes an int3 and
> returns instead of crashing.  If you called panic all the time instead
> of conditionally in this code, this change would not be needed, since
> panic is setup already to call debuggers.  It's the failure of the
> current code to do that requires this change.  How about you call
> panic when this condition ocurrs, then the debugger will get called.
> 
> 2.  BUG() outputs an asm("ud2") and triggers an invalid instruction
> and system crash.  All that was added is the ability to switch that
> ud2 to an int3.    So what is more perverse here:
> 
> BUG() = ud2 -> invalid instruction -> trap  -> call crash code ->
> debugger -> then hang
> BUG() = int3 -> int3 trap -> enter debugger -> return - system can recover

Yes, I understand that.  I was referring to reading the code.  With your
patch I read the code and see BUG() and think the machine will panic right
there, when it fact it does not because of the tricks you play.

> 
> Because:
> 
> BUG() = Debugger = int3
> and
> BUG() != ud2 (undefined instruction) = crash = non recoverable
> 
> int3 (0xCC) has always been understood to mean BUG().  int3
> breakpoints are an integral part of Intel's architecture.  There is no
> reason for not exploiting this capability of their processors to help
> kernel developers use intel technology better.
> 
> > I still don't understand why we can't use Ingo's or tglx's approach?  Your
> > changelog doesn't point out the problems there.
> >
> 
> Because when you catch a bug in the hard lockup detector the system
> just sits there hard hung and you are not able to get into a debugger
> console since the system has crashed and the watchdog code has already
> killed off the other processors and locked up all the NMI interrupt
> handlers, thereby preventing any debugger at all from functioning
> other than a hardware ice, so it's a hell of a lot easier just to
> trigger a break when you detect the first instance of a hard lockup
> before the system is completely hosed.

Hmm, I am confused here.  So you are saying because we are in the nmi
handler you can not break into the system?  The nmi handler prints some
stuff to the screen, pokes the other cpus to print stuff to the screen and
then returns to a normal operation.  Unless you are saying the act of
sending NMI IPIs never completes (because a cpu is blocking IPI interrupts),
so the cpu hangs in nmi context and the debugger never has a chance to
'break' in and see what is going on?

Cheers,
Don

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection
  2016-02-02 22:40     ` Jeffrey Merkey
  2016-02-03  4:17       ` Jeffrey Merkey
  2016-02-03 15:45       ` Don Zickus
@ 2016-02-03 15:47       ` Don Zickus
  2016-02-03 17:26         ` Jeffrey Merkey
  2 siblings, 1 reply; 14+ messages in thread
From: Don Zickus @ 2016-02-03 15:47 UTC (permalink / raw)
  To: Jeffrey Merkey
  Cc: linux-kernel, akpm, atomlin, cmetcalf, fweisbec,
	hidehiro.kawai.ez, mhocko, tj, uobergfe

On Tue, Feb 02, 2016 at 03:40:34PM -0700, Jeffrey Merkey wrote:
> >
> > Please remember to add version history, so I can tell what changed.
> >
> 
> What command do I give to git when it creates the patch from git
> format-patch that outputs what you are looking for or do I have to add
> that manually.  The diff of files changed?

Sorry I forgot to respond to this part.  This is usually just a manual text
that folks add to explain what changed between different versions of the
patch.  It is difficult to re-review a patch from scratch all the time.  It
is nice to have a little blurb stating what changed to help focus my review.

Now this patch is small enough, it doesn't matter, but it is a nice habit to
get into.

Cheers,
Don

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection
  2016-02-03 15:45       ` Don Zickus
@ 2016-02-03 17:23         ` Jeffrey Merkey
  2016-02-03 20:14           ` Don Zickus
  0 siblings, 1 reply; 14+ messages in thread
From: Jeffrey Merkey @ 2016-02-03 17:23 UTC (permalink / raw)
  To: Don Zickus
  Cc: linux-kernel, akpm, atomlin, cmetcalf, fweisbec,
	hidehiro.kawai.ez, mhocko, tj, uobergfe

> Hmm, I am confused here.  So you are saying because we are in the nmi
> handler you can not break into the system?  The nmi handler prints some
> stuff to the screen, pokes the other cpus to print stuff to the screen and
> then returns to a normal operation.  Unless you are saying the act of
> sending NMI IPIs never completes (because a cpu is blocking IPI
> interrupts),
> so the cpu hangs in nmi context and the debugger never has a chance to
> 'break' in and see what is going on?
>
> Cheers,
> Don
>

Yes.  the nmi handlers never complete for the bug I worked on with
tglx, probably because an nmi handler is calling timekeeper.c
somewhere.  Some of these lockup bugs may be calling code from the nmi
handlers that cause the lockup condition in the first place in some
cases, so it will never reach a call to panic.  Looking over this code
it's damn hard to find a good way to do this that works across all the
arches without adding another macro to bug.h (BREAK_ON maybe), so I
just used one that's already there.  I'll go back and rethink this
some more.  It could just be as simple as calling panic from the first
detection -- that works.

Jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection
  2016-02-03 15:47       ` Don Zickus
@ 2016-02-03 17:26         ` Jeffrey Merkey
  0 siblings, 0 replies; 14+ messages in thread
From: Jeffrey Merkey @ 2016-02-03 17:26 UTC (permalink / raw)
  To: Don Zickus
  Cc: linux-kernel, akpm, atomlin, cmetcalf, fweisbec,
	hidehiro.kawai.ez, mhocko, tj, uobergfe

On 2/3/16, Don Zickus <dzickus@redhat.com> wrote:
> On Tue, Feb 02, 2016 at 03:40:34PM -0700, Jeffrey Merkey wrote:
>> >
>> > Please remember to add version history, so I can tell what changed.
>> >
>>
>> What command do I give to git when it creates the patch from git
>> format-patch that outputs what you are looking for or do I have to add
>> that manually.  The diff of files changed?
>
> Sorry I forgot to respond to this part.  This is usually just a manual text
> that folks add to explain what changed between different versions of the
> patch.  It is difficult to re-review a patch from scratch all the time.  It
> is nice to have a little blurb stating what changed to help focus my
> review.
>
> Now this patch is small enough, it doesn't matter, but it is a nice habit
> to
> get into.
>
> Cheers,
> Don
>

I looked over some other patches on the list and see what you are
refering to.  It will be included in the future patches.

Thank you for explaining this to me.

Jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection
  2016-02-03 17:23         ` Jeffrey Merkey
@ 2016-02-03 20:14           ` Don Zickus
  2016-02-03 20:18             ` Jeffrey Merkey
  0 siblings, 1 reply; 14+ messages in thread
From: Don Zickus @ 2016-02-03 20:14 UTC (permalink / raw)
  To: Jeffrey Merkey
  Cc: linux-kernel, akpm, atomlin, cmetcalf, fweisbec,
	hidehiro.kawai.ez, mhocko, tj, uobergfe

On Wed, Feb 03, 2016 at 10:23:42AM -0700, Jeffrey Merkey wrote:
> > Hmm, I am confused here.  So you are saying because we are in the nmi
> > handler you can not break into the system?  The nmi handler prints some
> > stuff to the screen, pokes the other cpus to print stuff to the screen and
> > then returns to a normal operation.  Unless you are saying the act of
> > sending NMI IPIs never completes (because a cpu is blocking IPI
> > interrupts),
> > so the cpu hangs in nmi context and the debugger never has a chance to
> > 'break' in and see what is going on?
> >
> > Cheers,
> > Don
> >
> 
> Yes.  the nmi handlers never complete for the bug I worked on with
> tglx, probably because an nmi handler is calling timekeeper.c
> somewhere.  Some of these lockup bugs may be calling code from the nmi
> handlers that cause the lockup condition in the first place in some
> cases, so it will never reach a call to panic.  Looking over this code
> it's damn hard to find a good way to do this that works across all the
> arches without adding another macro to bug.h (BREAK_ON maybe), so I
> just used one that's already there.  I'll go back and rethink this
> some more.  It could just be as simple as calling panic from the first
> detection -- that works.

So, if you disable 'sysctl_hardlockup_all_cpu_backtrace' and enable
'hardlockup_panic', you should be able to achieve what you want, no?

But you mentioned you wanted to recover?  Hence avoiding the panic?

Cheers,
Don

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection
  2016-02-03 20:14           ` Don Zickus
@ 2016-02-03 20:18             ` Jeffrey Merkey
  2016-02-04  2:48               ` Jeffrey Merkey
  0 siblings, 1 reply; 14+ messages in thread
From: Jeffrey Merkey @ 2016-02-03 20:18 UTC (permalink / raw)
  To: Don Zickus
  Cc: linux-kernel, akpm, atomlin, cmetcalf, fweisbec,
	hidehiro.kawai.ez, mhocko, tj, uobergfe

On 2/3/16, Don Zickus <dzickus@redhat.com> wrote:
> On Wed, Feb 03, 2016 at 10:23:42AM -0700, Jeffrey Merkey wrote:
>> > Hmm, I am confused here.  So you are saying because we are in the nmi
>> > handler you can not break into the system?  The nmi handler prints some
>> > stuff to the screen, pokes the other cpus to print stuff to the screen
>> > and
>> > then returns to a normal operation.  Unless you are saying the act of
>> > sending NMI IPIs never completes (because a cpu is blocking IPI
>> > interrupts),
>> > so the cpu hangs in nmi context and the debugger never has a chance to
>> > 'break' in and see what is going on?
>> >
>> > Cheers,
>> > Don
>> >
>>
>> Yes.  the nmi handlers never complete for the bug I worked on with
>> tglx, probably because an nmi handler is calling timekeeper.c
>> somewhere.  Some of these lockup bugs may be calling code from the nmi
>> handlers that cause the lockup condition in the first place in some
>> cases, so it will never reach a call to panic.  Looking over this code
>> it's damn hard to find a good way to do this that works across all the
>> arches without adding another macro to bug.h (BREAK_ON maybe), so I
>> just used one that's already there.  I'll go back and rethink this
>> some more.  It could just be as simple as calling panic from the first
>> detection -- that works.
>
> So, if you disable 'sysctl_hardlockup_all_cpu_backtrace' and enable
> 'hardlockup_panic', you should be able to achieve what you want, no?
>
> But you mentioned you wanted to recover?  Hence avoiding the panic?
>
> Cheers,
> Don
>

Yes.  Avoiding the panic is what I wanted to do.  These flags work at
catching the bug as you suggest, however, it does not allow you to
step out of the hard lockup detector into the offending code, which is
what a breakpoint does for you here.  We almost need an explicit call
to enter a debugger here.  I need to go off and rethink this some
more.

:-)

Jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection
  2016-02-03 20:18             ` Jeffrey Merkey
@ 2016-02-04  2:48               ` Jeffrey Merkey
  0 siblings, 0 replies; 14+ messages in thread
From: Jeffrey Merkey @ 2016-02-04  2:48 UTC (permalink / raw)
  To: Don Zickus
  Cc: linux-kernel, akpm, atomlin, cmetcalf, fweisbec,
	hidehiro.kawai.ez, mhocko, tj, uobergfe

>> So, if you disable 'sysctl_hardlockup_all_cpu_backtrace' and enable
>> 'hardlockup_panic', you should be able to achieve what you want, no?
>>
>> But you mentioned you wanted to recover?  Hence avoiding the panic?
>>
>> Cheers,
>> Don
>>
>
> Yes.  Avoiding the panic is what I wanted to do.  These flags work at
> catching the bug as you suggest, however, it does not allow you to
> step out of the hard lockup detector into the offending code, which is
> what a breakpoint does for you here.  We almost need an explicit call
> to enter a debugger here.  I need to go off and rethink this some
> more.
>
> :-)
>
> Jeff
>

Hi Don,

I've decided to maintain this capability in the MDB debugger patch
series I distribute from the MDB Debugger tree on github.   All the
wonderful feedback and discussion has really helped me understand
what's required here and exposed some issues in native Linux for some
folks to think about.  I am not certain this type of change belongs in
native Linux since linux has no in-kernel debugger that provides
disassembly -- it has a device driver that talks to a remote user
space application called GDB and a headless debug dump tool called
kdb.

So unless there is a native in-kernel debugger in Linux such a change
is sort of pointless and from what I have gathered, the x86 folks are
not interested in fixing the entry code and other areas to allow a
full blown kernel debugger unless I just go and patch that myself in
my tree.

This has been very helpful in helping me to understand what types of
changes should be submitted here and are helpful.

Jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-02-04  2:48 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-02  2:33 [PATCH v5 1/3] Add BUG_XX() debugging options Jeffrey Merkey
2016-02-02  2:33 ` [PATCH v5 2/3] Add BUG_XX() debugging options Kconfig.debug Jeffrey Merkey
2016-02-02  2:33 ` [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection Jeffrey Merkey
2016-02-02 17:30   ` Don Zickus
2016-02-02 22:40     ` Jeffrey Merkey
2016-02-03  4:17       ` Jeffrey Merkey
2016-02-03  4:39         ` Jeffrey Merkey
2016-02-03 15:45       ` Don Zickus
2016-02-03 17:23         ` Jeffrey Merkey
2016-02-03 20:14           ` Don Zickus
2016-02-03 20:18             ` Jeffrey Merkey
2016-02-04  2:48               ` Jeffrey Merkey
2016-02-03 15:47       ` Don Zickus
2016-02-03 17:26         ` Jeffrey Merkey

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.