* [PATCH 0/2] random debug patches
@ 2016-03-18 15:28 Peter Zijlstra
2016-03-18 15:28 ` [PATCH 1/2] watchdog: Fix output Peter Zijlstra
2016-03-18 15:28 ` [PATCH 2/2] trace: Record and show NMI state Peter Zijlstra
0 siblings, 2 replies; 11+ messages in thread
From: Peter Zijlstra @ 2016-03-18 15:28 UTC (permalink / raw)
To: linux-kernel; +Cc: rostedt, dzickus, mingo, tglx, peterz
The only relation between these patches is that they are the result of me
trying to hunt down fail. Other than that they're unrelated and can be
independently applied.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/2] watchdog: Fix output
2016-03-18 15:28 [PATCH 0/2] random debug patches Peter Zijlstra
@ 2016-03-18 15:28 ` Peter Zijlstra
2016-03-18 16:37 ` Don Zickus
2016-03-18 15:28 ` [PATCH 2/2] trace: Record and show NMI state Peter Zijlstra
1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2016-03-18 15:28 UTC (permalink / raw)
To: linux-kernel; +Cc: rostedt, dzickus, mingo, tglx, peterz
[-- Attachment #1: peterz-watchdog-fix-output.patch --]
[-- Type: text/plain, Size: 1309 bytes --]
The pr_crap() functions generate idiotic output; use printk().
Broken output:
[ 3538.718135] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [perf_fuzzer:2646]
[ 3383.233583] NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
Fixed output:
[14180.328194] BUG: soft lockup - CPU#2 stuck for 23s! [perf_fuzzer:13650]
[ 1064.914925] BUG: NMI Watchdog detected hard LOCKUP on cpu 11
Cc: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/watchdog.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -350,7 +350,7 @@ static void watchdog_overflow_callback(s
if (__this_cpu_read(hard_watchdog_warn) == true)
return;
- pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
+ printk(KERN_EMERG "BUG: NMI detected hard LOCKUP on cpu %d", this_cpu);
print_modules();
print_irqtrace_events(current);
if (regs)
@@ -467,7 +467,7 @@ static enum hrtimer_restart watchdog_tim
}
}
- pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
+ printk(KERN_EMERG "BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
smp_processor_id(), duration,
current->comm, task_pid_nr(current));
__this_cpu_write(softlockup_task_ptr_saved, current);
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 2/2] trace: Record and show NMI state
2016-03-18 15:28 [PATCH 0/2] random debug patches Peter Zijlstra
2016-03-18 15:28 ` [PATCH 1/2] watchdog: Fix output Peter Zijlstra
@ 2016-03-18 15:28 ` Peter Zijlstra
1 sibling, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2016-03-18 15:28 UTC (permalink / raw)
To: linux-kernel; +Cc: rostedt, dzickus, mingo, tglx, peterz
[-- Attachment #1: trace-nmi.patch --]
[-- Type: text/plain, Size: 1831 bytes --]
The latency tracer format has a nice column to indicate IRQ state, but
this is not able to tell us about NMI state.
When tracing perf interrupt handlers (which often run in NMI context)
it is very useful to see how the events nest.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/trace/trace.c | 1 +
kernel/trace/trace.h | 1 +
kernel/trace/trace_output.c | 10 +++++++---
3 files changed, 9 insertions(+), 3 deletions(-)
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1662,6 +1662,7 @@ tracing_generic_entry_update(struct trac
#else
TRACE_FLAG_IRQS_NOSUPPORT |
#endif
+ ((pc & NMI_MASK ) ? TRACE_FLAG_NMI : 0) |
((pc & HARDIRQ_MASK) ? TRACE_FLAG_HARDIRQ : 0) |
((pc & SOFTIRQ_MASK) ? TRACE_FLAG_SOFTIRQ : 0) |
(tif_need_resched() ? TRACE_FLAG_NEED_RESCHED : 0) |
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -125,6 +125,7 @@ enum trace_flag_type {
TRACE_FLAG_HARDIRQ = 0x08,
TRACE_FLAG_SOFTIRQ = 0x10,
TRACE_FLAG_PREEMPT_RESCHED = 0x20,
+ TRACE_FLAG_NMI = 0x40,
};
#define TRACE_BUF_SIZE 1024
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -389,7 +389,9 @@ int trace_print_lat_fmt(struct trace_seq
char irqs_off;
int hardirq;
int softirq;
+ int nmi;
+ nmi = entry->flags & TRACE_FLAG_NMI;
hardirq = entry->flags & TRACE_FLAG_HARDIRQ;
softirq = entry->flags & TRACE_FLAG_SOFTIRQ;
@@ -415,10 +417,12 @@ int trace_print_lat_fmt(struct trace_seq
}
hardsoft_irq =
+ (nmi && hardirq) ? 'Z' :
+ nmi ? 'z' :
(hardirq && softirq) ? 'H' :
- hardirq ? 'h' :
- softirq ? 's' :
- '.';
+ hardirq ? 'h' :
+ softirq ? 's' :
+ '.' ;
trace_seq_printf(s, "%c%c%c",
irqs_off, need_resched, hardsoft_irq);
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] watchdog: Fix output
2016-03-18 15:28 ` [PATCH 1/2] watchdog: Fix output Peter Zijlstra
@ 2016-03-18 16:37 ` Don Zickus
2016-03-18 16:44 ` Peter Zijlstra
2016-03-18 17:49 ` Peter Zijlstra
0 siblings, 2 replies; 11+ messages in thread
From: Don Zickus @ 2016-03-18 16:37 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, rostedt, mingo, tglx
On Fri, Mar 18, 2016 at 04:28:03PM +0100, Peter Zijlstra wrote:
> The pr_crap() functions generate idiotic output; use printk().
>
> Broken output:
>
> [ 3538.718135] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [perf_fuzzer:2646]
> [ 3383.233583] NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
>
> Fixed output:
>
> [14180.328194] BUG: soft lockup - CPU#2 stuck for 23s! [perf_fuzzer:13650]
> [ 1064.914925] BUG: NMI Watchdog detected hard LOCKUP on cpu 11
Would something like this be a better patch?
Cheers,
Don
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index b3ace6e..e3302a0 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -9,7 +9,7 @@
* to those contributors as well.
*/
-#define pr_fmt(fmt) "NMI watchdog: " fmt
+#define pr_fmt(fmt) "Lockup detector: " fmt
#include <linux/mm.h>
#include <linux/cpu.h>
@@ -350,7 +350,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
if (__this_cpu_read(hard_watchdog_warn) == true)
return;
- pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
+ pr_emerg("Detected hard LOCKUP on cpu %d", this_cpu);
print_modules();
print_irqtrace_events(current);
if (regs)
@@ -467,7 +467,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
}
}
- pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
+ pr_emerg("Detected soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
smp_processor_id(), duration,
current->comm, task_pid_nr(current));
__this_cpu_write(softlockup_task_ptr_saved, current);
>
> Cc: Don Zickus <dzickus@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> kernel/watchdog.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -350,7 +350,7 @@ static void watchdog_overflow_callback(s
> if (__this_cpu_read(hard_watchdog_warn) == true)
> return;
>
> - pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> + printk(KERN_EMERG "BUG: NMI detected hard LOCKUP on cpu %d", this_cpu);
> print_modules();
> print_irqtrace_events(current);
> if (regs)
> @@ -467,7 +467,7 @@ static enum hrtimer_restart watchdog_tim
> }
> }
>
> - pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
> + printk(KERN_EMERG "BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
> smp_processor_id(), duration,
> current->comm, task_pid_nr(current));
> __this_cpu_write(softlockup_task_ptr_saved, current);
>
>
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] watchdog: Fix output
2016-03-18 16:37 ` Don Zickus
@ 2016-03-18 16:44 ` Peter Zijlstra
2016-03-18 16:48 ` Peter Zijlstra
2016-03-18 17:49 ` Peter Zijlstra
1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2016-03-18 16:44 UTC (permalink / raw)
To: Don Zickus; +Cc: linux-kernel, rostedt, mingo, tglx
On Fri, Mar 18, 2016 at 12:37:48PM -0400, Don Zickus wrote:
> On Fri, Mar 18, 2016 at 04:28:03PM +0100, Peter Zijlstra wrote:
> > The pr_crap() functions generate idiotic output; use printk().
> >
> > Broken output:
> >
> > [ 3538.718135] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [perf_fuzzer:2646]
> > [ 3383.233583] NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
> >
> > Fixed output:
> >
> > [14180.328194] BUG: soft lockup - CPU#2 stuck for 23s! [perf_fuzzer:13650]
> > [ 1064.914925] BUG: NMI Watchdog detected hard LOCKUP on cpu 11
>
> Would something like this be a better patch?
> -#define pr_fmt(fmt) "NMI watchdog: " fmt
> +#define pr_fmt(fmt) "Lockup detector: " fmt
> - pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> + pr_emerg("Detected hard LOCKUP on cpu %d", this_cpu);
That still results in something like:
Lockup detector: Detected hard LOCKUP on cpu 22
Which is still weird, doesn't mention NMI nor BUG (and when the NMI
triggers it really rather frequently is a dead system).
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] watchdog: Fix output
2016-03-18 16:44 ` Peter Zijlstra
@ 2016-03-18 16:48 ` Peter Zijlstra
2016-03-18 17:42 ` Joe Perches
2016-03-18 18:12 ` Don Zickus
0 siblings, 2 replies; 11+ messages in thread
From: Peter Zijlstra @ 2016-03-18 16:48 UTC (permalink / raw)
To: Don Zickus; +Cc: linux-kernel, rostedt, mingo, tglx
On Fri, Mar 18, 2016 at 05:44:41PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 18, 2016 at 12:37:48PM -0400, Don Zickus wrote:
> > Would something like this be a better patch?
>
> > -#define pr_fmt(fmt) "NMI watchdog: " fmt
> > +#define pr_fmt(fmt) "Lockup detector: " fmt
You might want to audit all the other pr_ nonsense in that file; a lot
of the msgs don't particularly make sense when combined with the prefix.
Some are for the NMI thing, others for the soft lockup one.
I would just chuck pr_crap() altogether and use sensible text.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] watchdog: Fix output
2016-03-18 16:48 ` Peter Zijlstra
@ 2016-03-18 17:42 ` Joe Perches
2016-03-18 18:12 ` Don Zickus
1 sibling, 0 replies; 11+ messages in thread
From: Joe Perches @ 2016-03-18 17:42 UTC (permalink / raw)
To: Peter Zijlstra, Don Zickus; +Cc: linux-kernel, rostedt, mingo, tglx
On Fri, 2016-03-18 at 17:48 +0100, Peter Zijlstra wrote:
> On Fri, Mar 18, 2016 at 05:44:41PM +0100, Peter Zijlstra wrote:
> >
> > On Fri, Mar 18, 2016 at 12:37:48PM -0400, Don Zickus wrote:
> > >
> > > Would something like this be a better patch?
> > >
> > > -#define pr_fmt(fmt) "NMI watchdog: " fmt
> > > +#define pr_fmt(fmt) "Lockup detector: " fmt
> You might want to audit all the other pr_ nonsense in that file; a
> lot
> of the msgs don't particularly make sense when combined with the
> prefix.
>
> Some are for the NMI thing, others for the soft lockup one.
>
> I would just chuck pr_crap() altogether and use sensible text.
or change to #define pr_fmt(fmt) fmt
and output whatever unprefixed content you want.
btw: the first printk/pr_emerg should add a newline
pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
sb:
pr_emerg("Watchdog detected hard LOCKUP on cpu %d\n", this_cpu);
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] watchdog: Fix output
2016-03-18 16:37 ` Don Zickus
2016-03-18 16:44 ` Peter Zijlstra
@ 2016-03-18 17:49 ` Peter Zijlstra
2016-03-19 0:11 ` Jiri Kosina
1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2016-03-18 17:49 UTC (permalink / raw)
To: Don Zickus; +Cc: linux-kernel, rostedt, mingo, tglx, jkosina
On Fri, Mar 18, 2016 at 12:37:48PM -0400, Don Zickus wrote:
> +++ b/kernel/watchdog.c
> @@ -9,7 +9,7 @@
> * to those contributors as well.
> */
>
> -#define pr_fmt(fmt) "NMI watchdog: " fmt
> +#define pr_fmt(fmt) "Lockup detector: " fmt
>
> #include <linux/mm.h>
> #include <linux/cpu.h>
> @@ -350,7 +350,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
> if (__this_cpu_read(hard_watchdog_warn) == true)
> return;
>
> - pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> + pr_emerg("Detected hard LOCKUP on cpu %d", this_cpu);
> print_modules();
> print_irqtrace_events(current);
> if (regs)
It was Jiri who made this mess by replacing WARN(), which has a very
distinct format, with this custom stuff.
I think we should go back to the WARN() thing.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] watchdog: Fix output
2016-03-18 16:48 ` Peter Zijlstra
2016-03-18 17:42 ` Joe Perches
@ 2016-03-18 18:12 ` Don Zickus
1 sibling, 0 replies; 11+ messages in thread
From: Don Zickus @ 2016-03-18 18:12 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, rostedt, mingo, tglx
On Fri, Mar 18, 2016 at 05:48:58PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 18, 2016 at 05:44:41PM +0100, Peter Zijlstra wrote:
> > On Fri, Mar 18, 2016 at 12:37:48PM -0400, Don Zickus wrote:
> > > Would something like this be a better patch?
> >
> > > -#define pr_fmt(fmt) "NMI watchdog: " fmt
> > > +#define pr_fmt(fmt) "Lockup detector: " fmt
>
> You might want to audit all the other pr_ nonsense in that file; a lot
> of the msgs don't particularly make sense when combined with the prefix.
>
> Some are for the NMI thing, others for the soft lockup one.
>
> I would just chuck pr_crap() altogether and use sensible text.
Hehe. I'll look into that next week. :-)
Cheers,
Don
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] watchdog: Fix output
2016-03-18 17:49 ` Peter Zijlstra
@ 2016-03-19 0:11 ` Jiri Kosina
2016-03-19 0:39 ` Jiri Kosina
0 siblings, 1 reply; 11+ messages in thread
From: Jiri Kosina @ 2016-03-19 0:11 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Don Zickus, linux-kernel, rostedt, mingo, tglx
On Fri, 18 Mar 2016, Peter Zijlstra wrote:
> > -#define pr_fmt(fmt) "NMI watchdog: " fmt
> > +#define pr_fmt(fmt) "Lockup detector: " fmt
> >
> > #include <linux/mm.h>
> > #include <linux/cpu.h>
> > @@ -350,7 +350,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
> > if (__this_cpu_read(hard_watchdog_warn) == true)
> > return;
> >
> > - pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> > + pr_emerg("Detected hard LOCKUP on cpu %d", this_cpu);
> > print_modules();
> > print_irqtrace_events(current);
> > if (regs)
>
> It was Jiri who made this mess by replacing WARN(), which has a very
> distinct format, with this custom stuff.
Right, this was in 55537871e ("kernel/watchdog.c: perform all-CPU
backtrace in case of hard lockup").
To be really honest, I don't really have 100% reliable explanation why I
did it this way; I *think* it was because otherwise, in case WARN is
preserved, we'll either
- have different output format for current (warning) CPU and all the
others (trigger_allbutself_cpu_backtrace())
or
- have duplicated backtrace for current CPU, one coming from WARN and one
coming from the all-cpu backtrace
> I think we should go back to the WARN() thing.
I don't really have strong opinion on this, but someone should pick the
poison from the two options above; I tried in the mentioned commit, but
apparently not to general satisfaction :)
Thanks,
--
Jiri Kosina
SUSE Labs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] watchdog: Fix output
2016-03-19 0:11 ` Jiri Kosina
@ 2016-03-19 0:39 ` Jiri Kosina
0 siblings, 0 replies; 11+ messages in thread
From: Jiri Kosina @ 2016-03-19 0:39 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Don Zickus, linux-kernel, rostedt, mingo, tglx
On Sat, 19 Mar 2016, Jiri Kosina wrote:
> > It was Jiri who made this mess by replacing WARN(), which has a very
> > distinct format, with this custom stuff.
>
> Right, this was in 55537871e ("kernel/watchdog.c: perform all-CPU
> backtrace in case of hard lockup").
>
> To be really honest, I don't really have 100% reliable explanation why I
> did it this way; I *think* it was because otherwise, in case WARN is
> preserved, we'll either
>
> - have different output format for current (warning) CPU and all the
> others (trigger_allbutself_cpu_backtrace())
>
> or
>
> - have duplicated backtrace for current CPU, one coming from WARN and one
> coming from the all-cpu backtrace
>
> > I think we should go back to the WARN() thing.
>
> I don't really have strong opinion on this, but someone should pick the
> poison from the two options above; I tried in the mentioned commit, but
> apparently not to general satisfaction :)
Hmm, or we can actually make the whole thing behave the way that was
introduced by 55537871e only in case of
sysctl_hardlockup_all_cpu_backtrace being set, and use WARN() otherwise.
I'll send out a patch for this during / after the weekend if noone else
beats me to do it.
--
Jiri Kosina
SUSE Labs
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2016-03-19 0:39 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-18 15:28 [PATCH 0/2] random debug patches Peter Zijlstra
2016-03-18 15:28 ` [PATCH 1/2] watchdog: Fix output Peter Zijlstra
2016-03-18 16:37 ` Don Zickus
2016-03-18 16:44 ` Peter Zijlstra
2016-03-18 16:48 ` Peter Zijlstra
2016-03-18 17:42 ` Joe Perches
2016-03-18 18:12 ` Don Zickus
2016-03-18 17:49 ` Peter Zijlstra
2016-03-19 0:11 ` Jiri Kosina
2016-03-19 0:39 ` Jiri Kosina
2016-03-18 15:28 ` [PATCH 2/2] trace: Record and show NMI state Peter Zijlstra
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.