All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] random debug patches
@ 2016-03-18 15:28 Peter Zijlstra
  2016-03-18 15:28 ` [PATCH 1/2] watchdog: Fix output Peter Zijlstra
  2016-03-18 15:28 ` [PATCH 2/2] trace: Record and show NMI state Peter Zijlstra
  0 siblings, 2 replies; 11+ messages in thread
From: Peter Zijlstra @ 2016-03-18 15:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: rostedt, dzickus, mingo, tglx, peterz

The only relation between these patches is that they are the result of me
trying to hunt down fail. Other than that they're unrelated and can be
independently applied.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] watchdog: Fix output
  2016-03-18 15:28 [PATCH 0/2] random debug patches Peter Zijlstra
@ 2016-03-18 15:28 ` Peter Zijlstra
  2016-03-18 16:37   ` Don Zickus
  2016-03-18 15:28 ` [PATCH 2/2] trace: Record and show NMI state Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2016-03-18 15:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: rostedt, dzickus, mingo, tglx, peterz

[-- Attachment #1: peterz-watchdog-fix-output.patch --]
[-- Type: text/plain, Size: 1309 bytes --]

The pr_crap() functions generate idiotic output; use printk().

Broken output:

[ 3538.718135] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s!  [perf_fuzzer:2646]
[ 3383.233583] NMI watchdog: Watchdog detected hard LOCKUP on cpu 27

Fixed output:

[14180.328194] BUG: soft lockup - CPU#2 stuck for 23s!  [perf_fuzzer:13650]
[ 1064.914925] BUG: NMI Watchdog detected hard LOCKUP on cpu 11

Cc: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/watchdog.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -350,7 +350,7 @@ static void watchdog_overflow_callback(s
 		if (__this_cpu_read(hard_watchdog_warn) == true)
 			return;
 
-		pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
+		printk(KERN_EMERG "BUG: NMI detected hard LOCKUP on cpu %d", this_cpu);
 		print_modules();
 		print_irqtrace_events(current);
 		if (regs)
@@ -467,7 +467,7 @@ static enum hrtimer_restart watchdog_tim
 			}
 		}
 
-		pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
+		printk(KERN_EMERG "BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
 			smp_processor_id(), duration,
 			current->comm, task_pid_nr(current));
 		__this_cpu_write(softlockup_task_ptr_saved, current);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/2] trace: Record and show NMI state
  2016-03-18 15:28 [PATCH 0/2] random debug patches Peter Zijlstra
  2016-03-18 15:28 ` [PATCH 1/2] watchdog: Fix output Peter Zijlstra
@ 2016-03-18 15:28 ` Peter Zijlstra
  1 sibling, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2016-03-18 15:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: rostedt, dzickus, mingo, tglx, peterz

[-- Attachment #1: trace-nmi.patch --]
[-- Type: text/plain, Size: 1831 bytes --]

The latency tracer format has a nice column to indicate IRQ state, but
this is not able to tell us about NMI state.

When tracing perf interrupt handlers (which often run in NMI context)
it is very useful to see how the events nest.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/trace/trace.c        |    1 +
 kernel/trace/trace.h        |    1 +
 kernel/trace/trace_output.c |   10 +++++++---
 3 files changed, 9 insertions(+), 3 deletions(-)

--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1662,6 +1662,7 @@ tracing_generic_entry_update(struct trac
 #else
 		TRACE_FLAG_IRQS_NOSUPPORT |
 #endif
+		((pc & NMI_MASK    ) ? TRACE_FLAG_NMI     : 0) |
 		((pc & HARDIRQ_MASK) ? TRACE_FLAG_HARDIRQ : 0) |
 		((pc & SOFTIRQ_MASK) ? TRACE_FLAG_SOFTIRQ : 0) |
 		(tif_need_resched() ? TRACE_FLAG_NEED_RESCHED : 0) |
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -125,6 +125,7 @@ enum trace_flag_type {
 	TRACE_FLAG_HARDIRQ		= 0x08,
 	TRACE_FLAG_SOFTIRQ		= 0x10,
 	TRACE_FLAG_PREEMPT_RESCHED	= 0x20,
+	TRACE_FLAG_NMI			= 0x40,
 };
 
 #define TRACE_BUF_SIZE		1024
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -389,7 +389,9 @@ int trace_print_lat_fmt(struct trace_seq
 	char irqs_off;
 	int hardirq;
 	int softirq;
+	int nmi;
 
+	nmi = entry->flags & TRACE_FLAG_NMI;
 	hardirq = entry->flags & TRACE_FLAG_HARDIRQ;
 	softirq = entry->flags & TRACE_FLAG_SOFTIRQ;
 
@@ -415,10 +417,12 @@ int trace_print_lat_fmt(struct trace_seq
 	}
 
 	hardsoft_irq =
+		(nmi && hardirq)     ? 'Z' :
+		nmi                  ? 'z' :
 		(hardirq && softirq) ? 'H' :
-		hardirq ? 'h' :
-		softirq ? 's' :
-		'.';
+		hardirq              ? 'h' :
+		softirq              ? 's' :
+		                       '.' ;
 
 	trace_seq_printf(s, "%c%c%c",
 			 irqs_off, need_resched, hardsoft_irq);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] watchdog: Fix output
  2016-03-18 15:28 ` [PATCH 1/2] watchdog: Fix output Peter Zijlstra
@ 2016-03-18 16:37   ` Don Zickus
  2016-03-18 16:44     ` Peter Zijlstra
  2016-03-18 17:49     ` Peter Zijlstra
  0 siblings, 2 replies; 11+ messages in thread
From: Don Zickus @ 2016-03-18 16:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, rostedt, mingo, tglx

On Fri, Mar 18, 2016 at 04:28:03PM +0100, Peter Zijlstra wrote:
> The pr_crap() functions generate idiotic output; use printk().
> 
> Broken output:
> 
> [ 3538.718135] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s!  [perf_fuzzer:2646]
> [ 3383.233583] NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
> 
> Fixed output:
> 
> [14180.328194] BUG: soft lockup - CPU#2 stuck for 23s!  [perf_fuzzer:13650]
> [ 1064.914925] BUG: NMI Watchdog detected hard LOCKUP on cpu 11

Would something like this be a better patch?

Cheers,
Don



diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index b3ace6e..e3302a0 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -9,7 +9,7 @@
  * to those contributors as well.
  */
 
-#define pr_fmt(fmt) "NMI watchdog: " fmt
+#define pr_fmt(fmt) "Lockup detector: " fmt
 
 #include <linux/mm.h>
 #include <linux/cpu.h>
@@ -350,7 +350,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
 		if (__this_cpu_read(hard_watchdog_warn) == true)
 			return;
 
-		pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
+		pr_emerg("Detected hard LOCKUP on cpu %d", this_cpu);
 		print_modules();
 		print_irqtrace_events(current);
 		if (regs)
@@ -467,7 +467,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 			}
 		}
 
-		pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
+		pr_emerg("Detected soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
 			smp_processor_id(), duration,
 			current->comm, task_pid_nr(current));
 		__this_cpu_write(softlockup_task_ptr_saved, current);
> 
> Cc: Don Zickus <dzickus@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/watchdog.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -350,7 +350,7 @@ static void watchdog_overflow_callback(s
>  		if (__this_cpu_read(hard_watchdog_warn) == true)
>  			return;
>  
> -		pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> +		printk(KERN_EMERG "BUG: NMI detected hard LOCKUP on cpu %d", this_cpu);
>  		print_modules();
>  		print_irqtrace_events(current);
>  		if (regs)
> @@ -467,7 +467,7 @@ static enum hrtimer_restart watchdog_tim
>  			}
>  		}
>  
> -		pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
> +		printk(KERN_EMERG "BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
>  			smp_processor_id(), duration,
>  			current->comm, task_pid_nr(current));
>  		__this_cpu_write(softlockup_task_ptr_saved, current);
> 
> 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] watchdog: Fix output
  2016-03-18 16:37   ` Don Zickus
@ 2016-03-18 16:44     ` Peter Zijlstra
  2016-03-18 16:48       ` Peter Zijlstra
  2016-03-18 17:49     ` Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2016-03-18 16:44 UTC (permalink / raw)
  To: Don Zickus; +Cc: linux-kernel, rostedt, mingo, tglx

On Fri, Mar 18, 2016 at 12:37:48PM -0400, Don Zickus wrote:
> On Fri, Mar 18, 2016 at 04:28:03PM +0100, Peter Zijlstra wrote:
> > The pr_crap() functions generate idiotic output; use printk().
> > 
> > Broken output:
> > 
> > [ 3538.718135] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s!  [perf_fuzzer:2646]
> > [ 3383.233583] NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
> > 
> > Fixed output:
> > 
> > [14180.328194] BUG: soft lockup - CPU#2 stuck for 23s!  [perf_fuzzer:13650]
> > [ 1064.914925] BUG: NMI Watchdog detected hard LOCKUP on cpu 11
> 
> Would something like this be a better patch?

> -#define pr_fmt(fmt) "NMI watchdog: " fmt
> +#define pr_fmt(fmt) "Lockup detector: " fmt

> -		pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> +		pr_emerg("Detected hard LOCKUP on cpu %d", this_cpu);

That still results in something like:

	Lockup detector: Detected hard LOCKUP on cpu 22

Which is still weird, doesn't mention NMI nor BUG (and when the NMI
triggers it really rather frequently is a dead system).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] watchdog: Fix output
  2016-03-18 16:44     ` Peter Zijlstra
@ 2016-03-18 16:48       ` Peter Zijlstra
  2016-03-18 17:42         ` Joe Perches
  2016-03-18 18:12         ` Don Zickus
  0 siblings, 2 replies; 11+ messages in thread
From: Peter Zijlstra @ 2016-03-18 16:48 UTC (permalink / raw)
  To: Don Zickus; +Cc: linux-kernel, rostedt, mingo, tglx

On Fri, Mar 18, 2016 at 05:44:41PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 18, 2016 at 12:37:48PM -0400, Don Zickus wrote:
> > Would something like this be a better patch?
> 
> > -#define pr_fmt(fmt) "NMI watchdog: " fmt
> > +#define pr_fmt(fmt) "Lockup detector: " fmt

You might want to audit all the other pr_ nonsense in that file; a lot
of the msgs don't particularly make sense when combined with the prefix.

Some are for the NMI thing, others for the soft lockup one.

I would just chuck pr_crap() altogether and use sensible text.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] watchdog: Fix output
  2016-03-18 16:48       ` Peter Zijlstra
@ 2016-03-18 17:42         ` Joe Perches
  2016-03-18 18:12         ` Don Zickus
  1 sibling, 0 replies; 11+ messages in thread
From: Joe Perches @ 2016-03-18 17:42 UTC (permalink / raw)
  To: Peter Zijlstra, Don Zickus; +Cc: linux-kernel, rostedt, mingo, tglx

On Fri, 2016-03-18 at 17:48 +0100, Peter Zijlstra wrote:
> On Fri, Mar 18, 2016 at 05:44:41PM +0100, Peter Zijlstra wrote:
> > 
> > On Fri, Mar 18, 2016 at 12:37:48PM -0400, Don Zickus wrote:
> > > 
> > > Would something like this be a better patch?
> > > 
> > > -#define pr_fmt(fmt) "NMI watchdog: " fmt
> > > +#define pr_fmt(fmt) "Lockup detector: " fmt
> You might want to audit all the other pr_ nonsense in that file; a
> lot
> of the msgs don't particularly make sense when combined with the
> prefix.
> 
> Some are for the NMI thing, others for the soft lockup one.
> 
> I would just chuck pr_crap() altogether and use sensible text.

or change to #define pr_fmt(fmt) fmt

and output whatever unprefixed content you want.

btw: the first printk/pr_emerg should add a newline


	pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
sb:
	pr_emerg("Watchdog detected hard LOCKUP on cpu %d\n", this_cpu);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] watchdog: Fix output
  2016-03-18 16:37   ` Don Zickus
  2016-03-18 16:44     ` Peter Zijlstra
@ 2016-03-18 17:49     ` Peter Zijlstra
  2016-03-19  0:11       ` Jiri Kosina
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2016-03-18 17:49 UTC (permalink / raw)
  To: Don Zickus; +Cc: linux-kernel, rostedt, mingo, tglx, jkosina

On Fri, Mar 18, 2016 at 12:37:48PM -0400, Don Zickus wrote:
> +++ b/kernel/watchdog.c
> @@ -9,7 +9,7 @@
>   * to those contributors as well.
>   */
>  
> -#define pr_fmt(fmt) "NMI watchdog: " fmt
> +#define pr_fmt(fmt) "Lockup detector: " fmt
>  
>  #include <linux/mm.h>
>  #include <linux/cpu.h>
> @@ -350,7 +350,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
>  		if (__this_cpu_read(hard_watchdog_warn) == true)
>  			return;
>  
> -		pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> +		pr_emerg("Detected hard LOCKUP on cpu %d", this_cpu);
>  		print_modules();
>  		print_irqtrace_events(current);
>  		if (regs)

It was Jiri who made this mess by replacing WARN(), which has a very
distinct format, with this custom stuff.

I think we should go back to the WARN() thing.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] watchdog: Fix output
  2016-03-18 16:48       ` Peter Zijlstra
  2016-03-18 17:42         ` Joe Perches
@ 2016-03-18 18:12         ` Don Zickus
  1 sibling, 0 replies; 11+ messages in thread
From: Don Zickus @ 2016-03-18 18:12 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, rostedt, mingo, tglx

On Fri, Mar 18, 2016 at 05:48:58PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 18, 2016 at 05:44:41PM +0100, Peter Zijlstra wrote:
> > On Fri, Mar 18, 2016 at 12:37:48PM -0400, Don Zickus wrote:
> > > Would something like this be a better patch?
> > 
> > > -#define pr_fmt(fmt) "NMI watchdog: " fmt
> > > +#define pr_fmt(fmt) "Lockup detector: " fmt
> 
> You might want to audit all the other pr_ nonsense in that file; a lot
> of the msgs don't particularly make sense when combined with the prefix.
> 
> Some are for the NMI thing, others for the soft lockup one.
> 
> I would just chuck pr_crap() altogether and use sensible text.

Hehe.  I'll look into that next week.  :-)

Cheers,
Don

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] watchdog: Fix output
  2016-03-18 17:49     ` Peter Zijlstra
@ 2016-03-19  0:11       ` Jiri Kosina
  2016-03-19  0:39         ` Jiri Kosina
  0 siblings, 1 reply; 11+ messages in thread
From: Jiri Kosina @ 2016-03-19  0:11 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Don Zickus, linux-kernel, rostedt, mingo, tglx

On Fri, 18 Mar 2016, Peter Zijlstra wrote:

> > -#define pr_fmt(fmt) "NMI watchdog: " fmt
> > +#define pr_fmt(fmt) "Lockup detector: " fmt
> >  
> >  #include <linux/mm.h>
> >  #include <linux/cpu.h>
> > @@ -350,7 +350,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
> >  		if (__this_cpu_read(hard_watchdog_warn) == true)
> >  			return;
> >  
> > -		pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> > +		pr_emerg("Detected hard LOCKUP on cpu %d", this_cpu);
> >  		print_modules();
> >  		print_irqtrace_events(current);
> >  		if (regs)
> 
> It was Jiri who made this mess by replacing WARN(), which has a very
> distinct format, with this custom stuff.

Right, this was in 55537871e ("kernel/watchdog.c: perform all-CPU 
backtrace in case of hard lockup").

To be really honest, I don't really have 100% reliable explanation why I 
did it this way; I *think* it was because otherwise, in case WARN is 
preserved, we'll either

- have different output format for current (warning) CPU and all the 
  others (trigger_allbutself_cpu_backtrace())

or

- have duplicated backtrace for current CPU, one coming from WARN and one 
  coming from the all-cpu backtrace

> I think we should go back to the WARN() thing.

I don't really have strong opinion on this, but someone should pick the 
poison from the two options above; I tried in the mentioned commit, but 
apparently not to general satisfaction :)

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] watchdog: Fix output
  2016-03-19  0:11       ` Jiri Kosina
@ 2016-03-19  0:39         ` Jiri Kosina
  0 siblings, 0 replies; 11+ messages in thread
From: Jiri Kosina @ 2016-03-19  0:39 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Don Zickus, linux-kernel, rostedt, mingo, tglx

On Sat, 19 Mar 2016, Jiri Kosina wrote:

> > It was Jiri who made this mess by replacing WARN(), which has a very
> > distinct format, with this custom stuff.
> 
> Right, this was in 55537871e ("kernel/watchdog.c: perform all-CPU 
> backtrace in case of hard lockup").
> 
> To be really honest, I don't really have 100% reliable explanation why I 
> did it this way; I *think* it was because otherwise, in case WARN is 
> preserved, we'll either
> 
> - have different output format for current (warning) CPU and all the 
>   others (trigger_allbutself_cpu_backtrace())
> 
> or
> 
> - have duplicated backtrace for current CPU, one coming from WARN and one 
>   coming from the all-cpu backtrace
> 
> > I think we should go back to the WARN() thing.
> 
> I don't really have strong opinion on this, but someone should pick the 
> poison from the two options above; I tried in the mentioned commit, but 
> apparently not to general satisfaction :)

Hmm, or we can actually make the whole thing behave the way that was 
introduced by 55537871e only in case of 
sysctl_hardlockup_all_cpu_backtrace being set, and use WARN() otherwise.

I'll send out a patch for this during / after the weekend if noone else 
beats me to do it.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-03-19  0:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-18 15:28 [PATCH 0/2] random debug patches Peter Zijlstra
2016-03-18 15:28 ` [PATCH 1/2] watchdog: Fix output Peter Zijlstra
2016-03-18 16:37   ` Don Zickus
2016-03-18 16:44     ` Peter Zijlstra
2016-03-18 16:48       ` Peter Zijlstra
2016-03-18 17:42         ` Joe Perches
2016-03-18 18:12         ` Don Zickus
2016-03-18 17:49     ` Peter Zijlstra
2016-03-19  0:11       ` Jiri Kosina
2016-03-19  0:39         ` Jiri Kosina
2016-03-18 15:28 ` [PATCH 2/2] trace: Record and show NMI state Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.