All of lore.kernel.org
 help / color / mirror / Atom feed
* NMI watchdog
@ 2015-03-30 12:14 Justin Keller
  2015-03-30 17:09 ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Justin Keller @ 2015-03-30 12:14 UTC (permalink / raw)
  To: linux-kernel

Hello,
Although not running a vanilla kernel on this machine, I have reported
the issue to the distribution's bug tracking system. It has been
almost a week with no response, so I am sending this email.

Multiple times, when I return to my computer from being away for a
little while, I noticed:
Message from syslogd@redacted at Mar 23 XX:XX:XX ...
kernel:[1059322.470817] NMI watchdog: BUG: soft lockup - CPU#1 stuck
for 22s! [kswapd0:31]

Dmesg | grep NMI produced:
[1151200.727734] sending NMI to all CPUs:
[1151200.727812] NMI backtrace for cpu 0
[1151200.764129] INFO: NMI handler
(arch_trigger_all_cpu_backtrace_handler) took too long to run: 36.262
msecs
[1151200.764198] NMI backtrace for cpu 1
[1151216.700893] sending NMI to all CPUs:
[1151216.700984] NMI backtrace for cpu 1
[1151216.706524] NMI backtrace for cpu 0
[1723994.455161] <NMI> [<ffffffff81554a5e>] ? dump_stack+0x41/0x51

I didn't have time to grep for kswapd or to investigate further. Long
story short, the machine was shutdown shortly afterwords.

Justin

PS this was also sent to linux-watchdog. I forgot to turn of HTML, so
I had to re-send it here

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NMI watchdog
  2015-03-30 12:14 NMI watchdog Justin Keller
@ 2015-03-30 17:09 ` Michal Hocko
  0 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2015-03-30 17:09 UTC (permalink / raw)
  To: Justin Keller; +Cc: linux-kernel

On Mon 30-03-15 08:14:45, Justin Keller wrote:
> Hello,
> Although not running a vanilla kernel on this machine, I have reported
> the issue to the distribution's bug tracking system. It has been
> almost a week with no response, so I am sending this email.
> 
> Multiple times, when I return to my computer from being away for a
> little while, I noticed:
> Message from syslogd@redacted at Mar 23 XX:XX:XX ...
> kernel:[1059322.470817] NMI watchdog: BUG: soft lockup - CPU#1 stuck
> for 22s! [kswapd0:31]

traces dumped as a part of the watchdog output is the most interesting
information. And the kernel version is very important as well.

> Dmesg | grep NMI produced:
> [1151200.727734] sending NMI to all CPUs:
> [1151200.727812] NMI backtrace for cpu 0
> [1151200.764129] INFO: NMI handler
> (arch_trigger_all_cpu_backtrace_handler) took too long to run: 36.262
> msecs
> [1151200.764198] NMI backtrace for cpu 1
> [1151216.700893] sending NMI to all CPUs:
> [1151216.700984] NMI backtrace for cpu 1
> [1151216.706524] NMI backtrace for cpu 0
> [1723994.455161] <NMI> [<ffffffff81554a5e>] ? dump_stack+0x41/0x51
> 
> I didn't have time to grep for kswapd or to investigate further. Long
> story short, the machine was shutdown shortly afterwords.
> 
> Justin
> 
> PS this was also sent to linux-watchdog. I forgot to turn of HTML, so
> I had to re-send it here
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* NMI watchdog
@ 2015-03-30 12:15 Justin Keller
  0 siblings, 0 replies; 15+ messages in thread
From: Justin Keller @ 2015-03-30 12:15 UTC (permalink / raw)
  To: linux-watchdog

Hello,
Although not running a vanilla kernel on this machine, I have reported the
issue to the distribution's bug tracking system. It has been almost a week
with no response, so I am sending this email.

Multiple times, when I return to my computer from being away for a little
while, I noticed:
Message from syslogd@redacted at Mar 23 XX:XX:XX ...
kernel:[1059322.470817] NMI watchdog: BUG: soft lockup - CPU#1 stuck for
22s! [kswapd0:31]

Dmesg | grep NMI produced:
[1151200.727734] sending NMI to all CPUs:
[1151200.727812] NMI backtrace for cpu 0
[1151200.764129] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler)
took too long to run: 36.262 msecs
[1151200.764198] NMI backtrace for cpu 1
[1151216.700893] sending NMI to all CPUs:
[1151216.700984] NMI backtrace for cpu 1
[1151216.706524] NMI backtrace for cpu 0
[1723994.455161] <NMI> [<ffffffff81554a5e>] ? dump_stack+0x41/0x51

I didn't have time to grep for kswapd or to investigate further. Long story
short, the machine was shutdown shortly afterwords.

Justin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* NMI watchdog...
@ 2009-01-29 23:54 David Miller
  0 siblings, 0 replies; 15+ messages in thread
From: David Miller @ 2009-01-29 23:54 UTC (permalink / raw)
  To: sparclinux


I just wanted to let folks know what I've been working on, sparc wise.

I have this reocurring issue where one of my workstations hangs
completely, no keyboard input, no console messages, nothing.

Since we have pseudo-NMI support in oprofile via performance counters
in the current tree I worked on rearchitecting this so that a nice NMI
watchdog layer could be added.

It is modelled after the x86 NMI watchdog, with the major difference
being that it is enabled by default.  The cost is one interrupt per
second, and the payback is enormous wrt. the ability to debug complete
system hangs.

Basically how it works is if we see no timer interrupts processed for
5 seconds we print a message, dump registers, and optionally panic the
system.

This will be supported on any system that has profiling counter
overflow interrupt support.  That essentially means any cpu from
UltraSPARC-III onward (including Niagara chips).

Another nice side effect of this work is that it gives us some of the
framework necessary for whatever generic performance counter layer
gets merged into the tree in the future (Ingo Molnar's work, perfmon3,
whatever).

I noticed while doing these changes that we need some work in the
handling of OOPSes and other errors.  In particular we need to start
using the existing generic infrastructure the kernel provides, such as
oops_enter(), oops_exit(), bust_spinlocks(), etc.  I do intend to work
on this.

I'm currently busy doing testing to make sure that the NMI watchdog
and oprofile work as expected.

I'll post the patches when I check them in.  I intend to push this
into the current stable tree because there are entire classes of bugs
people run into which can't be analyzed at all without this kind of
facility.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NMI watchdog
  2007-10-12 16:12     ` Steven Rostedt
@ 2007-10-17 12:20       ` John Sigler
  0 siblings, 0 replies; 15+ messages in thread
From: John Sigler @ 2007-10-17 12:20 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Steven Rostedt, linux-kernel, linux-rt-users

Steven Rostedt wrote:
> --

(Strange characters.)

> John Sigler wrote:
> 
>> APIC timer registered as dummy, due to nmi_watchdog=1!
>> Clockevents: could not switch to one-shot mode: lapic is not functional.
>> Could not switch to high resolution mode on CPU 0
>>
>> Do you know why nmi_watchdog=1 disables high-resolution timers?
>>
>> And why nmi_watchdog=1 implies APIC timer registered as dummy?
> 
> Crap, I forgot about that. Thomas explained why to me once, and I forgot.

Thomas,

Could you explain again why using the IO-APIC watchdog disables hrt?

> Does it still crash? If not, does it crash if you turn off highres?

It's not that simple. My app depends on hrt. I'd have to tweak it to 
make it work when hrt are not available.

BTW, it's not a crash, it's a complete system lock-up. Nothing works, 
and I really do mean nothing: ping is ignored, I can't open an ssh 
connection, the serial console won't budge, even if I send SysRq combos...

> Or better yet, try turning off dynamic-ticks. We had a bug once before
> that had a problem with the new RCU code and dynamic ticks.

I didn't enable dynamic-ticks kernel support.

CONFIG_TICK_ONESHOT=y
# CONFIG_NO_HZ is not set
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
# CONFIG_SMP is not set
CONFIG_X86_PC=y
...
# CONFIG_HPET_TIMER is not set
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT_DESKTOP is not set
CONFIG_PREEMPT_RT=y
CONFIG_PREEMPT=y
CONFIG_PREEMPT_SOFTIRQS=y
CONFIG_PREEMPT_HARDIRQS=y
CONFIG_PREEMPT_BKL=y
# CONFIG_CLASSIC_RCU is not set
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_TRACE is not set
CONFIG_X86_UP_APIC=y
CONFIG_X86_UP_IOAPIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_MCE is not set
# CONFIG_VM86 is not set
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
# CONFIG_X86_REBOOTFIXUPS is not set
# CONFIG_MICROCODE is not set
# CONFIG_X86_MSR is not set

Regards.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NMI watchdog
  2007-10-12 14:48 ` Arjan van de Ven
@ 2007-10-15 16:05   ` John Sigler
  0 siblings, 0 replies; 15+ messages in thread
From: John Sigler @ 2007-10-15 16:05 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, linux-rt-users

Arjan van de Ven wrote:

> John Sigler wrote:
> 
>> I'm experiencing a full system lockup. I'm using an out-of-tree
>> driver which I suspect is responsible. I'm trying to enable the NMI
>> watchdog.
> 
> one thing worth a shot is enabling lockdep.. that often finds deadlocks
> for you ;)

I enabled every option I thought might be useful.

I booted the system, fired a serial console, bumped the priority of the 
serial port IRQ handler to 80 and the priority of the shell in the 
serial console to 70, set the console log level to 9 using SysRq.

I then connected via ssh, loaded the driver module (the output showed up 
in the serial console), started 4 processes, and had a complete system 
lock-up within 10 seconds.

Nothing on the serial console :-(

The system didn't even respond to SysRq...

How do kernel hackers debug these problems?

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
# CONFIG_ENABLE_MUST_CHECK is not set
CONFIG_MAGIC_SYSRQ=y
# CONFIG_UNUSED_SYMBOLS is not set
# CONFIG_DEBUG_FS is not set
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SHIRQ=y
CONFIG_DETECT_SOFTLOCKUP=y
CONFIG_SCHED_DEBUG=y
# CONFIG_SCHEDSTATS is not set
CONFIG_TIMER_STATS=y
CONFIG_DEBUG_SLAB=y
CONFIG_DEBUG_SLAB_LEAK=y
CONFIG_DEBUG_PREEMPT=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_PI_LIST=y
# CONFIG_RT_MUTEX_TESTER is not set
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
CONFIG_LOCKDEP=y
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_LOCKDEP is not set
CONFIG_TRACE_IRQFLAGS=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
CONFIG_DEBUG_LOCKING_API_SELFTESTS=y
CONFIG_STACKTRACE=y
CONFIG_PREEMPT_TRACE=y
# CONFIG_EVENT_TRACE is not set
# CONFIG_FUNCTION_TRACE is not set
# CONFIG_WAKEUP_TIMING is not set
# CONFIG_CRITICAL_PREEMPT_TIMING is not set
# CONFIG_CRITICAL_IRQSOFF_TIMING is not set
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_LIST is not set
# CONFIG_FRAME_POINTER is not set
CONFIG_FORCED_INLINING=y
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_FAULT_INJECTION is not set
CONFIG_EARLY_PRINTK=y
CONFIG_DEBUG_STACKOVERFLOW=y
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_PAGEALLOC is not set
CONFIG_DEBUG_RODATA=y
CONFIG_4KSTACKS=y
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
CONFIG_DOUBLEFAULT=y

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NMI watchdog
  2007-10-12 13:26   ` John Sigler
@ 2007-10-12 16:12     ` Steven Rostedt
  2007-10-17 12:20       ` John Sigler
  0 siblings, 1 reply; 15+ messages in thread
From: Steven Rostedt @ 2007-10-12 16:12 UTC (permalink / raw)
  To: John Sigler; +Cc: linux-kernel, linux-rt-users, Thomas Gleixner


--
On Fri, 12 Oct 2007, John Sigler wrote:
> Steven Rostedt wrote:
>
> > John Sigler wrote:
>  > APIC timer registered as dummy, due to nmi_watchdog=1!
> 213a216,217
>  > Clockevents: could not switch to one-shot mode: lapic is not functional.
>  > Could not switch to high resolution mode on CPU 0
>
> Do you know why nmi_watchdog=1 disables high-resolution timers?
>
> And why nmi_watchdog=1 implies APIC timer registered as dummy?

Crap, I forgot about that. Thomas explained why to me once, and I forgot.

Does it still crash? If not, does it crash if you turn off highres?

Or better yet, try turning off dynamic-ticks. We had a bug once before
that had a problem with the new RCU code and dynamic ticks.

-- Steve

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NMI watchdog
  2007-10-12  9:18 John Sigler
  2007-10-12 10:00 ` Björn Steinbrink
  2007-10-12 10:26 ` Steven Rostedt
@ 2007-10-12 14:48 ` Arjan van de Ven
  2007-10-15 16:05   ` John Sigler
  2 siblings, 1 reply; 15+ messages in thread
From: Arjan van de Ven @ 2007-10-12 14:48 UTC (permalink / raw)
  To: John Sigler; +Cc: linux-kernel, linux-rt-users

On Fri, 12 Oct 2007 11:18:24 +0200
John Sigler <linux.kernel@free.fr> wrote:

> Hello,
> 
> I'm experiencing a full system lockup. I'm using an out-of-tree
> driver which I suspect is responsible. I'm trying to enable the NMI
> watchdog.


one thing worth a shot is enabling lockdep.. that often finds deadlocks
for you ;)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NMI watchdog
  2007-10-12 10:26 ` Steven Rostedt
@ 2007-10-12 13:26   ` John Sigler
  2007-10-12 16:12     ` Steven Rostedt
  0 siblings, 1 reply; 15+ messages in thread
From: John Sigler @ 2007-10-12 13:26 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: linux-kernel, linux-rt-users, Thomas Gleixner

Steven Rostedt wrote:

> John Sigler wrote:
> 
>> I'm experiencing a full system lockup. I'm using an out-of-tree driver
>> which I suspect is responsible. I'm trying to enable the NMI watchdog.
>>
>> # cat /proc/version
>> Linux version 2.6.22.1-rt9 (gcc version 3.4.6) #1 PREEMPT RT Tue Oct 9
>> 12:25:47 CEST 2007
>>
>> # cat /proc/cmdline
>> ro root=/dev/hdc1 console=ttyS0,57600n8 console=tty0 panic=3 apic=debug
>> nmi_watchdog=2
> 
> I've noticed on some boxes that nmi_watchdog=2 does what you state. Try
> out nmi_watchdog=1.

# diff boot_message013 boot_message014
49c49
< Kernel command line: ro root=/dev/hdc1 console=ttyS0,57600n8 
console=tty0 panic=3 apic=debug nmi_watchdog=2
---
 > Kernel command line: ro root=/dev/hdc1 console=ttyS0,57600n8 
console=tty0 panic=3 apic=debug nmi_watchdog=1
69c69
< Calibrating delay using timer specific routine.. 4802.79 BogoMIPS 
(lpj=24013960)
---
 > Calibrating delay using timer specific routine.. 4802.80 BogoMIPS 
(lpj=24014009)
88a89
 > activating NMI Watchdog ... done.
97c98
< ..... CPU clock speed is 2400.1215 MHz.
---
 > ..... CPU clock speed is 2400.1221 MHz.
98a100
 > APIC timer registered as dummy, due to nmi_watchdog=1!
213a216,217
 > Clockevents: could not switch to one-shot mode: lapic is not functional.
 > Could not switch to high resolution mode on CPU 0

Do you know why nmi_watchdog=1 disables high-resolution timers?

And why nmi_watchdog=1 implies APIC timer registered as dummy?

# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc acpi_pm pit jiffies

# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

# cat /proc/timer_list
Timer List Version: v0.3
HRTIMER_MAX_CLOCK_BASES: 2
now at 4613373211613 nsecs

cpu: 0
  clock 0:
   .index:      0
   .resolution: 10000000 nsecs
   .get_time:   ktime_get_real
   .offset:     0 nsecs
active timers:
  clock 1:
   .index:      1
   .resolution: 10000000 nsecs
   .get_time:   ktime_get
   .offset:     0 nsecs
active timers:
  #0: <cf2c1ec0>, it_real_fn, S:01
  # expires at 4630663830511 nsecs [in 17290618898 nsecs]
   .expires_next   : 9223372036854775807 nsecs
   .hres_active    : 0
   .nr_events      : 0
   .nohz_mode      : 0
   .idle_tick      : 0 nsecs
   .tick_stopped   : 0
   .idle_jiffies   : 0
   .idle_calls     : 0
   .idle_sleeps    : 0
   .idle_entrytime : 0 nsecs
   .idle_sleeptime : 0 nsecs
   .last_jiffies   : 0
   .next_jiffies   : 0
   .idle_expires   : 0 nsecs
jiffies: 431306


Tick Device: mode:     0
Clock Event Device: pit
  max_delta_ns:   27461866
  min_delta_ns:   12571
  mult:           5124677
  shift:          32
  mode:           2
  next_event:     9223372036854775807 nsecs
  set_next_event: pit_next_event
  set_mode:       init_pit_timer
  event_handler:  tick_handle_periodic_broadcast
tick_broadcast_mask: 00000001
tick_broadcast_oneshot_mask: 00000000


Tick Device: mode:     0
Clock Event Device: lapic
  max_delta_ns:   1006581321
  min_delta_ns:   1799
  mult:           35793226
  shift:          32
  mode:           1
  next_event:     0 nsecs
  set_next_event: lapic_next_event
  set_mode:       lapic_timer_setup
  event_handler:  tick_handle_periodic

# cat /proc/interrupts
            CPU0
   0:     468721   IO-APIC-edge      timer
   4:        326   IO-APIC-edge      serial
   8:          1   IO-APIC-edge      rtc
   9:          0   IO-APIC-fasteoi   acpi
  15:      15964   IO-APIC-edge      ide1
  16:       4217   IO-APIC-fasteoi   eth0
  17:       2340   IO-APIC-fasteoi   eth1
  18:       2340   IO-APIC-fasteoi   eth2
  19:       2340   IO-APIC-fasteoi   eth3
NMI:     468690
LOC:          0
ERR:          0
MIS:          0

Regards.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NMI watchdog
  2007-10-12 10:00 ` Björn Steinbrink
@ 2007-10-12 10:58   ` John Sigler
  0 siblings, 0 replies; 15+ messages in thread
From: John Sigler @ 2007-10-12 10:58 UTC (permalink / raw)
  To: Björn Steinbrink; +Cc: linux-kernel, linux-rt-users

Björn Steinbrink wrote:

> John Sigler wrote:
> 
>> I'm experiencing a full system lockup. I'm using an out-of-tree driver 
>> which I suspect is responsible. I'm trying to enable the NMI watchdog.
>>
>> # cat /proc/version
>> Linux version 2.6.22.1-rt9 (gcc version 3.4.6) #1 PREEMPT RT Tue Oct 9 
>> 12:25:47 CEST 2007
>>
>> # cat /proc/cmdline
>> ro root=/dev/hdc1 console=ttyS0,57600n8 console=tty0 panic=3 apic=debug 
>> nmi_watchdog=2
>>
>> However, after boot, the NMI count does not change.
>>
>> # cat /proc/interrupts ; sleep 10 ; cat /proc/interrupts
> 
> Try running some cpu hog in the background. The performance counters get
> increased only when the CPU is actually doing something. On a mostly
> idle system, it can take quite a while for the next NMI to show up.

You are right. In another shell, I ran while true; do : ; done

# cat /proc/interrupts ; sleep 10 ; cat /proc/interrupts
            CPU0
   0:        100   IO-APIC-edge      timer
   4:         82   IO-APIC-edge      serial
   8:          1   IO-APIC-edge      rtc
   9:          0   IO-APIC-fasteoi   acpi
  15:      13648   IO-APIC-edge      ide1
  16:       1303   IO-APIC-fasteoi   eth0
  17:        575   IO-APIC-fasteoi   eth1
  18:        575   IO-APIC-fasteoi   eth2
  19:        575   IO-APIC-fasteoi   eth3
NMI:       2889
LOC:     115768
ERR:          0
MIS:          0

            CPU0
   0:        100   IO-APIC-edge      timer
   4:         82   IO-APIC-edge      serial
   8:          1   IO-APIC-edge      rtc
   9:          0   IO-APIC-fasteoi   acpi
  15:      13672   IO-APIC-edge      ide1
  16:       1310   IO-APIC-fasteoi   eth0
  17:        580   IO-APIC-fasteoi   eth1
  18:        580   IO-APIC-fasteoi   eth2
  19:        580   IO-APIC-fasteoi   eth3
NMI:       2899
LOC:     116770
ERR:          0
MIS:          0

The performance counter appears to be configured to fire when the event 
count for CPU_CLK_UNHALTED reaches 2,400,000,000 (I have a 2.4 GHz CPU) 
i.e. one NMI per second when the CPU is 100% busy. Is that correct?

On a related note, I have a Pentium 3 which counts CPU_CLK_UNHALTED 
cycles even when the CPU is halted. I was told this is a bug, but it 
actually sounds like a a nice feature!

Is there really no way to have the event counter increment with every 
tick (even when the CPU is halted) on a P4?

Regards.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NMI watchdog
  2007-10-12  9:18 John Sigler
  2007-10-12 10:00 ` Björn Steinbrink
@ 2007-10-12 10:26 ` Steven Rostedt
  2007-10-12 13:26   ` John Sigler
  2007-10-12 14:48 ` Arjan van de Ven
  2 siblings, 1 reply; 15+ messages in thread
From: Steven Rostedt @ 2007-10-12 10:26 UTC (permalink / raw)
  To: John Sigler; +Cc: linux-kernel, linux-rt-users


--

On Fri, 12 Oct 2007, John Sigler wrote:

> Hello,
>
> I'm experiencing a full system lockup. I'm using an out-of-tree driver
> which I suspect is responsible. I'm trying to enable the NMI watchdog.
>
> # cat /proc/version
> Linux version 2.6.22.1-rt9 (gcc version 3.4.6) #1 PREEMPT RT Tue Oct 9
> 12:25:47 CEST 2007
>
> # cat /proc/cmdline
> ro root=/dev/hdc1 console=ttyS0,57600n8 console=tty0 panic=3 apic=debug
> nmi_watchdog=2
>

I've noticed on some boxes that nmi_watchdog=2 does what you state. Try
out nmi_watchdog=1.

-- Steve


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NMI watchdog
  2007-10-12  9:18 John Sigler
@ 2007-10-12 10:00 ` Björn Steinbrink
  2007-10-12 10:58   ` John Sigler
  2007-10-12 10:26 ` Steven Rostedt
  2007-10-12 14:48 ` Arjan van de Ven
  2 siblings, 1 reply; 15+ messages in thread
From: Björn Steinbrink @ 2007-10-12 10:00 UTC (permalink / raw)
  To: John Sigler; +Cc: linux-kernel, linux-rt-users

On 2007.10.12 11:18:24 +0200, John Sigler wrote:
> Hello,
>
> I'm experiencing a full system lockup. I'm using an out-of-tree driver 
> which I suspect is responsible. I'm trying to enable the NMI watchdog.
>
> # cat /proc/version
> Linux version 2.6.22.1-rt9 (gcc version 3.4.6) #1 PREEMPT RT Tue Oct 9 
> 12:25:47 CEST 2007
>
> # cat /proc/cmdline
> ro root=/dev/hdc1 console=ttyS0,57600n8 console=tty0 panic=3 apic=debug 
> nmi_watchdog=2
>
> However, after boot, the NMI count does not change.
>
> # cat /proc/interrupts ; sleep 10 ; cat /proc/interrupts

Try running some cpu hog in the background. The performance counters get
increased only when the CPU is actually doing something. On a mostly
idle system, it can take quite a while for the next NMI to show up.

Björn

^ permalink raw reply	[flat|nested] 15+ messages in thread

* NMI watchdog
@ 2007-10-12  9:18 John Sigler
  2007-10-12 10:00 ` Björn Steinbrink
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: John Sigler @ 2007-10-12  9:18 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users

Hello,

I'm experiencing a full system lockup. I'm using an out-of-tree driver 
which I suspect is responsible. I'm trying to enable the NMI watchdog.

# cat /proc/version
Linux version 2.6.22.1-rt9 (gcc version 3.4.6) #1 PREEMPT RT Tue Oct 9 
12:25:47 CEST 2007

# cat /proc/cmdline
ro root=/dev/hdc1 console=ttyS0,57600n8 console=tty0 panic=3 apic=debug 
nmi_watchdog=2

However, after boot, the NMI count does not change.

# cat /proc/interrupts ; sleep 10 ; cat /proc/interrupts
            CPU0
   0:         99   IO-APIC-edge      timer
   4:       3822   IO-APIC-edge      serial
   8:          1   IO-APIC-edge      rtc
   9:          0   IO-APIC-fasteoi   acpi
  15:      16443   IO-APIC-edge      ide1
  16:       2166   IO-APIC-fasteoi   eth0
  17:        840   IO-APIC-fasteoi   eth1
  18:        840   IO-APIC-fasteoi   eth2
  19:        840   IO-APIC-fasteoi   eth3
  20:          0   IO-APIC-fasteoi   Dta1xx
  21:          0   IO-APIC-fasteoi   Dta1xx
NMI:       2895
LOC:     168445
ERR:          0
MIS:          0

            CPU0
   0:         99   IO-APIC-edge      timer
   4:       3822   IO-APIC-edge      serial
   8:          1   IO-APIC-edge      rtc
   9:          0   IO-APIC-fasteoi   acpi
  15:      16467   IO-APIC-edge      ide1
  16:       2173   IO-APIC-fasteoi   eth0
  17:        845   IO-APIC-fasteoi   eth1
  18:        845   IO-APIC-fasteoi   eth2
  19:        845   IO-APIC-fasteoi   eth3
  20:          0   IO-APIC-fasteoi   Dta1xx
  21:          0   IO-APIC-fasteoi   Dta1xx
NMI:       2895
LOC:     169448
ERR:          0
MIS:          0

Does this mean the NMI watchdog is not working correctly on my system?

# dmesg | grep NMI
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
Testing NMI watchdog ... OK.

Regards.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NMI Watchdog
  2003-11-14 10:12 NMI Watchdog Maciej Zenczykowski
@ 2003-11-14 10:29 ` Mikael Pettersson
  0 siblings, 0 replies; 15+ messages in thread
From: Mikael Pettersson @ 2003-11-14 10:29 UTC (permalink / raw)
  To: Maciej Zenczykowski; +Cc: Linux Kernel Mailing List

Maciej Zenczykowski writes:
 > How do I go about getting the NMI Watchdog to work on a Celeron Mendocino
 > 400 MHz with no local APIC (nmi_watchdog=1/2 doesn't work, same kernel
 > works [/proc/interrupts show NMI's occuring 1/sec] on a 1GHz P3 with local
 > APIC)

The NMI watchdog requires a working local APIC.
So you can't, unless you upgrade the CPU.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* NMI Watchdog
@ 2003-11-14 10:12 Maciej Zenczykowski
  2003-11-14 10:29 ` Mikael Pettersson
  0 siblings, 1 reply; 15+ messages in thread
From: Maciej Zenczykowski @ 2003-11-14 10:12 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Hi,

How do I go about getting the NMI Watchdog to work on a Celeron Mendocino
400 MHz with no local APIC (nmi_watchdog=1/2 doesn't work, same kernel
works [/proc/interrupts show NMI's occuring 1/sec] on a 1GHz P3 with local
APIC)

Thanks,
MaZe.



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-03-30 17:09 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-30 12:14 NMI watchdog Justin Keller
2015-03-30 17:09 ` Michal Hocko
  -- strict thread matches above, loose matches on Subject: below --
2015-03-30 12:15 Justin Keller
2009-01-29 23:54 David Miller
2007-10-12  9:18 John Sigler
2007-10-12 10:00 ` Björn Steinbrink
2007-10-12 10:58   ` John Sigler
2007-10-12 10:26 ` Steven Rostedt
2007-10-12 13:26   ` John Sigler
2007-10-12 16:12     ` Steven Rostedt
2007-10-17 12:20       ` John Sigler
2007-10-12 14:48 ` Arjan van de Ven
2007-10-15 16:05   ` John Sigler
2003-11-14 10:12 NMI Watchdog Maciej Zenczykowski
2003-11-14 10:29 ` Mikael Pettersson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.