linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] watchdog: update documentation
@ 2012-02-08  9:06 Fernando Luis Vázquez Cao
  2012-02-08 18:34 ` Randy Dunlap
  2012-02-08 18:43 ` Don Zickus
  0 siblings, 2 replies; 9+ messages in thread
From: Fernando Luis Vázquez Cao @ 2012-02-08  9:06 UTC (permalink / raw)
  To: Ingo Molnar, dzickus, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 130 bytes --]

The soft and hard lockup detectors are now built on top of the hrtimer
and perf subsystems. Update the documentation accordingly.

[-- Attachment #2: lockup-watchdogs.patch --]
[-- Type: text/x-patch, Size: 8467 bytes --]

Subject: [PATCH] watchdog: update documentation

From: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>

The soft and hard lockup detectors are now built on top of the hrtimer
and perf subsystems. Update the documentation accordingly.

Signed-off-by: Fernando Luis Vazquez Cao<fernando@oss.ntt.co.jp>
---

diff -urNp linux-3.2.5-orig/Documentation/lockup-watchdogs.txt linux-3.2.5/Documentation/lockup-watchdogs.txt
--- linux-3.2.5-orig/Documentation/lockup-watchdogs.txt	1970-01-01 09:00:00.000000000 +0900
+++ linux-3.2.5/Documentation/lockup-watchdogs.txt	2012-02-08 17:48:35.219915509 +0900
@@ -0,0 +1,64 @@
+===============================================================
+Softlockup detector and hardlockup detector (aka nmi_watchdog)
+===============================================================
+
+The Linux kernel can act as a watchdog to detect both soft and hard
+lockups.
+
+A 'softlockup' is defined as a bug that causes the kernel to loop in
+kernel mode for more than 20 seconds (see "Implementation details"
+below for details), without giving other tasks a chance to run. The
+current stack trace is displayed upon detection and, by default, the
+system will stay locked up. Alternatively, the kernel can be
+configured to panic; a sysctl, "kernel.softlockup_panic", a kernel
+parameter, "softlockup_panic" (see
+"Documentation/kernel-parameters.txt" for details), and a compile
+option, "BOOTPARAM_HARDLOCKUP_PANIC", are provided for this.
+
+A 'hardlockup' is defined as a bug that causes the CPU to loop in
+kernel mode for more than 10 seconds (see "Implementation details"
+below for details), without letting other interrupts have a chance to
+run.  Similarly to the softlockup case, the current stack trace is
+displayed upon detection and the system will stay locked up unless the
+default behavior is changed, which can be done through a compile time
+knob, "BOOTPARAM_HARDLOCKUP_PANIC", and a kernel parameter,
+"nmi_watchdog" (see "Documentation/kernel-parameters.txt" for
+details).
+
+The panic option can be used in combination with panic_timeout (this
+timeout is set through the confusingly named "kernel.panic" sysctl),
+to cause the system to reboot automatically after a specified amount
+of time.
+
+=== Implementation details ===
+
+The soft and hard lockup detectors are built on top of the hrtimer and
+perf subsystems, respectively. A direct consequence of this is that,
+in principle, they should work in any architecture where these
+subsystems are present.
+
+An periodic hrtimer runs to generate interrupts and kick the watchdog
+task. An NMI perf event is generated every "watchdog_thresh"
+(compile-time initialized to 10 and configurable through sysctl of the
+same name) seconds to check for hardlockups. If any CPU in the system
+does not receive any hrtimer interrupt during that time the
+'hardlockup detector' (the handler for the NMI perf event) will
+generate a kernel warning or call panic, depending on the
+configuration.
+
+The watchdog task is a high priority kernel thread that updates a
+timestamp every time it is scheduled. If that timestamp is not updated
+for 2*watchdog_thresh seconds (the softlockup threshold) the
+'softlockup detector' (coded inside the hrtimer callback function)
+will dump useful debug information to the system log, after which it
+will call panic if it was instructed to do so or resume execution of
+other kernel code.
+
+The period of the hrtimer is 2*watchdog_thresh/5, which means it has
+two or three chances two generate an interrupt before the hardware
+detector kicks in.
+
+As explained above, a kernel knob is provided that allows
+administrators to configure the period of the hrtimer and the perf
+event. The right value for a particular environment is a trade-off
+between fast response to lockups and detection overhead.
diff -urNp linux-3.2.5-orig/Documentation/nmi_watchdog.txt linux-3.2.5/Documentation/nmi_watchdog.txt
--- linux-3.2.5-orig/Documentation/nmi_watchdog.txt	2012-01-05 08:55:44.000000000 +0900
+++ linux-3.2.5/Documentation/nmi_watchdog.txt	1970-01-01 09:00:00.000000000 +0900
@@ -1,83 +0,0 @@
-
-[NMI watchdog is available for x86 and x86-64 architectures]
-
-Is your system locking up unpredictably? No keyboard activity, just
-a frustrating complete hard lockup? Do you want to help us debugging
-such lockups? If all yes then this document is definitely for you.
-
-On many x86/x86-64 type hardware there is a feature that enables
-us to generate 'watchdog NMI interrupts'.  (NMI: Non Maskable Interrupt
-which get executed even if the system is otherwise locked up hard).
-This can be used to debug hard kernel lockups.  By executing periodic
-NMI interrupts, the kernel can monitor whether any CPU has locked up,
-and print out debugging messages if so.
-
-In order to use the NMI watchdog, you need to have APIC support in your
-kernel. For SMP kernels, APIC support gets compiled in automatically. For
-UP, enable either CONFIG_X86_UP_APIC (Processor type and features -> Local
-APIC support on uniprocessors) or CONFIG_X86_UP_IOAPIC (Processor type and
-features -> IO-APIC support on uniprocessors) in your kernel config.
-CONFIG_X86_UP_APIC is for uniprocessor machines without an IO-APIC.
-CONFIG_X86_UP_IOAPIC is for uniprocessor with an IO-APIC. [Note: certain
-kernel debugging options, such as Kernel Stack Meter or Kernel Tracer,
-may implicitly disable the NMI watchdog.]
-
-For x86-64, the needed APIC is always compiled in.
-
-Using local APIC (nmi_watchdog=2) needs the first performance register, so
-you can't use it for other purposes (such as high precision performance
-profiling.) However, at least oprofile and the perfctr driver disable the
-local APIC NMI watchdog automatically.
-
-To actually enable the NMI watchdog, use the 'nmi_watchdog=N' boot
-parameter.  Eg. the relevant lilo.conf entry:
-
-        append="nmi_watchdog=1"
-
-For SMP machines and UP machines with an IO-APIC use nmi_watchdog=1.
-For UP machines without an IO-APIC use nmi_watchdog=2, this only works
-for some processor types.  If in doubt, boot with nmi_watchdog=1 and
-check the NMI count in /proc/interrupts; if the count is zero then
-reboot with nmi_watchdog=2 and check the NMI count.  If it is still
-zero then log a problem, you probably have a processor that needs to be
-added to the nmi code.
-
-A 'lockup' is the following scenario: if any CPU in the system does not
-execute the period local timer interrupt for more than 5 seconds, then
-the NMI handler generates an oops and kills the process. This
-'controlled crash' (and the resulting kernel messages) can be used to
-debug the lockup. Thus whenever the lockup happens, wait 5 seconds and
-the oops will show up automatically. If the kernel produces no messages
-then the system has crashed so hard (eg. hardware-wise) that either it
-cannot even accept NMI interrupts, or the crash has made the kernel
-unable to print messages.
-
-Be aware that when using local APIC, the frequency of NMI interrupts
-it generates, depends on the system load. The local APIC NMI watchdog,
-lacking a better source, uses the "cycles unhalted" event. As you may
-guess it doesn't tick when the CPU is in the halted state (which happens
-when the system is idle), but if your system locks up on anything but the
-"hlt" processor instruction, the watchdog will trigger very soon as the
-"cycles unhalted" event will happen every clock tick. If it locks up on
-"hlt", then you are out of luck -- the event will not happen at all and the
-watchdog won't trigger. This is a shortcoming of the local APIC watchdog
--- unfortunately there is no "clock ticks" event that would work all the
-time. The I/O APIC watchdog is driven externally and has no such shortcoming.
-But its NMI frequency is much higher, resulting in a more significant hit
-to the overall system performance.
-
-On x86 nmi_watchdog is disabled by default so you have to enable it with
-a boot time parameter.
-
-It's possible to disable the NMI watchdog in run-time by writing "0" to
-/proc/sys/kernel/nmi_watchdog. Writing "1" to the same file will re-enable
-the NMI watchdog. Notice that you still need to use "nmi_watchdog=" parameter
-at boot time.
-
-NOTE: In kernels prior to 2.4.2-ac18 the NMI-oopser is enabled unconditionally
-on x86 SMP boxes.
-
-[ feel free to send bug reports, suggestions and patches to
-  Ingo Molnar <mingo@redhat.com> or the Linux SMP mailing
-  list at <linux-smp@vger.kernel.org> ]
-

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] watchdog: update documentation
  2012-02-08  9:06 [PATCH] watchdog: update documentation Fernando Luis Vázquez Cao
@ 2012-02-08 18:34 ` Randy Dunlap
  2012-02-08 18:43 ` Don Zickus
  1 sibling, 0 replies; 9+ messages in thread
From: Randy Dunlap @ 2012-02-08 18:34 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao; +Cc: Ingo Molnar, dzickus, linux-kernel

On 02/08/2012 01:06 AM, Fernando Luis Vázquez Cao wrote:
> The soft and hard lockup detectors are now built on top of the hrtimer
> and perf subsystems. Update the documentation accordingly.


Correction:

+The period of the hrtimer is 2*watchdog_thresh/5, which means it has
+two or three chances two generate an interrupt before the hardware

                      to

+detector kicks in.



-- 
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] watchdog: update documentation
  2012-02-08  9:06 [PATCH] watchdog: update documentation Fernando Luis Vázquez Cao
  2012-02-08 18:34 ` Randy Dunlap
@ 2012-02-08 18:43 ` Don Zickus
  2012-02-09  1:01   ` Fernando Luis Vázquez Cao
  2012-02-09  7:43   ` [PATCH] watchdog: update documentation Ingo Molnar
  1 sibling, 2 replies; 9+ messages in thread
From: Don Zickus @ 2012-02-08 18:43 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao; +Cc: Ingo Molnar, linux-kernel

On Wed, Feb 08, 2012 at 06:06:41PM +0900, Fernando Luis Vázquez Cao wrote:
> The soft and hard lockup detectors are now built on top of the hrtimer
> and perf subsystems. Update the documentation accordingly.

> Subject: [PATCH] watchdog: update documentation
> 
> From: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> 
> The soft and hard lockup detectors are now built on top of the hrtimer
> and perf subsystems. Update the documentation accordingly.

I am fine with this (just  need to address Randy's little fixup).

Ingo, should I take this in and repost for you or just ack it and let
someone like Andrew take this in?

Cheers,
Don

> 
> Signed-off-by: Fernando Luis Vazquez Cao<fernando@oss.ntt.co.jp>
> ---
> 
> diff -urNp linux-3.2.5-orig/Documentation/lockup-watchdogs.txt linux-3.2.5/Documentation/lockup-watchdogs.txt
> --- linux-3.2.5-orig/Documentation/lockup-watchdogs.txt	1970-01-01 09:00:00.000000000 +0900
> +++ linux-3.2.5/Documentation/lockup-watchdogs.txt	2012-02-08 17:48:35.219915509 +0900
> @@ -0,0 +1,64 @@
> +===============================================================
> +Softlockup detector and hardlockup detector (aka nmi_watchdog)
> +===============================================================
> +
> +The Linux kernel can act as a watchdog to detect both soft and hard
> +lockups.
> +
> +A 'softlockup' is defined as a bug that causes the kernel to loop in
> +kernel mode for more than 20 seconds (see "Implementation details"
> +below for details), without giving other tasks a chance to run. The
> +current stack trace is displayed upon detection and, by default, the
> +system will stay locked up. Alternatively, the kernel can be
> +configured to panic; a sysctl, "kernel.softlockup_panic", a kernel
> +parameter, "softlockup_panic" (see
> +"Documentation/kernel-parameters.txt" for details), and a compile
> +option, "BOOTPARAM_HARDLOCKUP_PANIC", are provided for this.
> +
> +A 'hardlockup' is defined as a bug that causes the CPU to loop in
> +kernel mode for more than 10 seconds (see "Implementation details"
> +below for details), without letting other interrupts have a chance to
> +run.  Similarly to the softlockup case, the current stack trace is
> +displayed upon detection and the system will stay locked up unless the
> +default behavior is changed, which can be done through a compile time
> +knob, "BOOTPARAM_HARDLOCKUP_PANIC", and a kernel parameter,
> +"nmi_watchdog" (see "Documentation/kernel-parameters.txt" for
> +details).
> +
> +The panic option can be used in combination with panic_timeout (this
> +timeout is set through the confusingly named "kernel.panic" sysctl),
> +to cause the system to reboot automatically after a specified amount
> +of time.
> +
> +=== Implementation details ===
> +
> +The soft and hard lockup detectors are built on top of the hrtimer and
> +perf subsystems, respectively. A direct consequence of this is that,
> +in principle, they should work in any architecture where these
> +subsystems are present.
> +
> +An periodic hrtimer runs to generate interrupts and kick the watchdog
> +task. An NMI perf event is generated every "watchdog_thresh"
> +(compile-time initialized to 10 and configurable through sysctl of the
> +same name) seconds to check for hardlockups. If any CPU in the system
> +does not receive any hrtimer interrupt during that time the
> +'hardlockup detector' (the handler for the NMI perf event) will
> +generate a kernel warning or call panic, depending on the
> +configuration.
> +
> +The watchdog task is a high priority kernel thread that updates a
> +timestamp every time it is scheduled. If that timestamp is not updated
> +for 2*watchdog_thresh seconds (the softlockup threshold) the
> +'softlockup detector' (coded inside the hrtimer callback function)
> +will dump useful debug information to the system log, after which it
> +will call panic if it was instructed to do so or resume execution of
> +other kernel code.
> +
> +The period of the hrtimer is 2*watchdog_thresh/5, which means it has
> +two or three chances two generate an interrupt before the hardware
> +detector kicks in.
> +
> +As explained above, a kernel knob is provided that allows
> +administrators to configure the period of the hrtimer and the perf
> +event. The right value for a particular environment is a trade-off
> +between fast response to lockups and detection overhead.
> diff -urNp linux-3.2.5-orig/Documentation/nmi_watchdog.txt linux-3.2.5/Documentation/nmi_watchdog.txt
> --- linux-3.2.5-orig/Documentation/nmi_watchdog.txt	2012-01-05 08:55:44.000000000 +0900
> +++ linux-3.2.5/Documentation/nmi_watchdog.txt	1970-01-01 09:00:00.000000000 +0900
> @@ -1,83 +0,0 @@
> -
> -[NMI watchdog is available for x86 and x86-64 architectures]
> -
> -Is your system locking up unpredictably? No keyboard activity, just
> -a frustrating complete hard lockup? Do you want to help us debugging
> -such lockups? If all yes then this document is definitely for you.
> -
> -On many x86/x86-64 type hardware there is a feature that enables
> -us to generate 'watchdog NMI interrupts'.  (NMI: Non Maskable Interrupt
> -which get executed even if the system is otherwise locked up hard).
> -This can be used to debug hard kernel lockups.  By executing periodic
> -NMI interrupts, the kernel can monitor whether any CPU has locked up,
> -and print out debugging messages if so.
> -
> -In order to use the NMI watchdog, you need to have APIC support in your
> -kernel. For SMP kernels, APIC support gets compiled in automatically. For
> -UP, enable either CONFIG_X86_UP_APIC (Processor type and features -> Local
> -APIC support on uniprocessors) or CONFIG_X86_UP_IOAPIC (Processor type and
> -features -> IO-APIC support on uniprocessors) in your kernel config.
> -CONFIG_X86_UP_APIC is for uniprocessor machines without an IO-APIC.
> -CONFIG_X86_UP_IOAPIC is for uniprocessor with an IO-APIC. [Note: certain
> -kernel debugging options, such as Kernel Stack Meter or Kernel Tracer,
> -may implicitly disable the NMI watchdog.]
> -
> -For x86-64, the needed APIC is always compiled in.
> -
> -Using local APIC (nmi_watchdog=2) needs the first performance register, so
> -you can't use it for other purposes (such as high precision performance
> -profiling.) However, at least oprofile and the perfctr driver disable the
> -local APIC NMI watchdog automatically.
> -
> -To actually enable the NMI watchdog, use the 'nmi_watchdog=N' boot
> -parameter.  Eg. the relevant lilo.conf entry:
> -
> -        append="nmi_watchdog=1"
> -
> -For SMP machines and UP machines with an IO-APIC use nmi_watchdog=1.
> -For UP machines without an IO-APIC use nmi_watchdog=2, this only works
> -for some processor types.  If in doubt, boot with nmi_watchdog=1 and
> -check the NMI count in /proc/interrupts; if the count is zero then
> -reboot with nmi_watchdog=2 and check the NMI count.  If it is still
> -zero then log a problem, you probably have a processor that needs to be
> -added to the nmi code.
> -
> -A 'lockup' is the following scenario: if any CPU in the system does not
> -execute the period local timer interrupt for more than 5 seconds, then
> -the NMI handler generates an oops and kills the process. This
> -'controlled crash' (and the resulting kernel messages) can be used to
> -debug the lockup. Thus whenever the lockup happens, wait 5 seconds and
> -the oops will show up automatically. If the kernel produces no messages
> -then the system has crashed so hard (eg. hardware-wise) that either it
> -cannot even accept NMI interrupts, or the crash has made the kernel
> -unable to print messages.
> -
> -Be aware that when using local APIC, the frequency of NMI interrupts
> -it generates, depends on the system load. The local APIC NMI watchdog,
> -lacking a better source, uses the "cycles unhalted" event. As you may
> -guess it doesn't tick when the CPU is in the halted state (which happens
> -when the system is idle), but if your system locks up on anything but the
> -"hlt" processor instruction, the watchdog will trigger very soon as the
> -"cycles unhalted" event will happen every clock tick. If it locks up on
> -"hlt", then you are out of luck -- the event will not happen at all and the
> -watchdog won't trigger. This is a shortcoming of the local APIC watchdog
> --- unfortunately there is no "clock ticks" event that would work all the
> -time. The I/O APIC watchdog is driven externally and has no such shortcoming.
> -But its NMI frequency is much higher, resulting in a more significant hit
> -to the overall system performance.
> -
> -On x86 nmi_watchdog is disabled by default so you have to enable it with
> -a boot time parameter.
> -
> -It's possible to disable the NMI watchdog in run-time by writing "0" to
> -/proc/sys/kernel/nmi_watchdog. Writing "1" to the same file will re-enable
> -the NMI watchdog. Notice that you still need to use "nmi_watchdog=" parameter
> -at boot time.
> -
> -NOTE: In kernels prior to 2.4.2-ac18 the NMI-oopser is enabled unconditionally
> -on x86 SMP boxes.
> -
> -[ feel free to send bug reports, suggestions and patches to
> -  Ingo Molnar <mingo@redhat.com> or the Linux SMP mailing
> -  list at <linux-smp@vger.kernel.org> ]
> -


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] watchdog: update documentation
  2012-02-08 18:43 ` Don Zickus
@ 2012-02-09  1:01   ` Fernando Luis Vázquez Cao
  2012-02-09  1:42     ` [PATCH] watchdog: update Kconfig entries Fernando Luis Vázquez Cao
  2012-02-09  7:43   ` [PATCH] watchdog: update documentation Ingo Molnar
  1 sibling, 1 reply; 9+ messages in thread
From: Fernando Luis Vázquez Cao @ 2012-02-09  1:01 UTC (permalink / raw)
  To: Don Zickus; +Cc: Ingo Molnar, linux-kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 592 bytes --]

On 02/09/2012 03:43 AM, Don Zickus wrote:
> On Wed, Feb 08, 2012 at 06:06:41PM +0900, Fernando Luis Vázquez Cao wrote:
>> Subject: [PATCH] watchdog: update documentation
>>
>> From: Fernando Luis Vazquez Cao<fernando@oss.ntt.co.jp>
>>
>> The soft and hard lockup detectors are now built on top of the hrtimer
>> and perf subsystems. Update the documentation accordingly.
> I am fine with this (just  need to address Randy's little fixup).
>
> Ingo, should I take this in and repost for you or just ack it and let
> someone like Andrew take this in?

Fixed version attached.

Thanks,
Fernando

[-- Attachment #2: lockup-watchdogs.patch --]
[-- Type: text/x-patch, Size: 8441 bytes --]

Subject: [PATCH] watchdog: update documentation

From: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>

The soft and hard lockup detectors are now built on top of the hrtimer
and perf subsystems. Update the documentation accordingly.

Signed-off-by: Fernando Luis Vazquez Cao<fernando@oss.ntt.co.jp>
---

diff -urNp linux-3.2.5-orig/Documentation/lockup-watchdogs.txt linux-3.2.5/Documentation/lockup-watchdogs.txt
--- linux-3.2.5-orig/Documentation/lockup-watchdogs.txt	1970-01-01 09:00:00.000000000 +0900
+++ linux-3.2.5/Documentation/lockup-watchdogs.txt	2012-02-09 09:53:24.922346977 +0900
@@ -0,0 +1,63 @@
+===============================================================
+Softlockup detector and hardlockup detector (aka nmi_watchdog)
+===============================================================
+
+The Linux kernel can act as a watchdog to detect both soft and hard
+lockups.
+
+A 'softlockup' is defined as a bug that causes the kernel to loop in
+kernel mode for more than 20 seconds (see "Implementation" below for
+details), without giving other tasks a chance to run. The current
+stack trace is displayed upon detection and, by default, the system
+will stay locked up. Alternatively, the kernel can be configured to
+panic; a sysctl, "kernel.softlockup_panic", a kernel parameter,
+"softlockup_panic" (see "Documentation/kernel-parameters.txt" for
+details), and a compile option, "BOOTPARAM_HARDLOCKUP_PANIC", are
+provided for this.
+
+A 'hardlockup' is defined as a bug that causes the CPU to loop in
+kernel mode for more than 10 seconds (see "Implementation" below for
+details), without letting other interrupts have a chance to run.
+Similarly to the softlockup case, the current stack trace is displayed
+upon detection and the system will stay locked up unless the default
+behavior is changed, which can be done through a compile time knob,
+"BOOTPARAM_HARDLOCKUP_PANIC", and a kernel parameter, "nmi_watchdog"
+(see "Documentation/kernel-parameters.txt" for details).
+
+The panic option can be used in combination with panic_timeout (this
+timeout is set through the confusingly named "kernel.panic" sysctl),
+to cause the system to reboot automatically after a specified amount
+of time.
+
+=== Implementation ===
+
+The soft and hard lockup detectors are built on top of the hrtimer and
+perf subsystems, respectively. A direct consequence of this is that,
+in principle, they should work in any architecture where these
+subsystems are present.
+
+A periodic hrtimer runs to generate interrupts and kick the watchdog
+task. An NMI perf event is generated every "watchdog_thresh"
+(compile-time initialized to 10 and configurable through sysctl of the
+same name) seconds to check for hardlockups. If any CPU in the system
+does not receive any hrtimer interrupt during that time the
+'hardlockup detector' (the handler for the NMI perf event) will
+generate a kernel warning or call panic, depending on the
+configuration.
+
+The watchdog task is a high priority kernel thread that updates a
+timestamp every time it is scheduled. If that timestamp is not updated
+for 2*watchdog_thresh seconds (the softlockup threshold) the
+'softlockup detector' (coded inside the hrtimer callback function)
+will dump useful debug information to the system log, after which it
+will call panic if it was instructed to do so or resume execution of
+other kernel code.
+
+The period of the hrtimer is 2*watchdog_thresh/5, which means it has
+two or three chances to generate an interrupt before the hardlockup
+detector kicks in.
+
+As explained above, a kernel knob is provided that allows
+administrators to configure the period of the hrtimer and the perf
+event. The right value for a particular environment is a trade-off
+between fast response to lockups and detection overhead.
diff -urNp linux-3.2.5-orig/Documentation/nmi_watchdog.txt linux-3.2.5/Documentation/nmi_watchdog.txt
--- linux-3.2.5-orig/Documentation/nmi_watchdog.txt	2012-01-05 08:55:44.000000000 +0900
+++ linux-3.2.5/Documentation/nmi_watchdog.txt	1970-01-01 09:00:00.000000000 +0900
@@ -1,83 +0,0 @@
-
-[NMI watchdog is available for x86 and x86-64 architectures]
-
-Is your system locking up unpredictably? No keyboard activity, just
-a frustrating complete hard lockup? Do you want to help us debugging
-such lockups? If all yes then this document is definitely for you.
-
-On many x86/x86-64 type hardware there is a feature that enables
-us to generate 'watchdog NMI interrupts'.  (NMI: Non Maskable Interrupt
-which get executed even if the system is otherwise locked up hard).
-This can be used to debug hard kernel lockups.  By executing periodic
-NMI interrupts, the kernel can monitor whether any CPU has locked up,
-and print out debugging messages if so.
-
-In order to use the NMI watchdog, you need to have APIC support in your
-kernel. For SMP kernels, APIC support gets compiled in automatically. For
-UP, enable either CONFIG_X86_UP_APIC (Processor type and features -> Local
-APIC support on uniprocessors) or CONFIG_X86_UP_IOAPIC (Processor type and
-features -> IO-APIC support on uniprocessors) in your kernel config.
-CONFIG_X86_UP_APIC is for uniprocessor machines without an IO-APIC.
-CONFIG_X86_UP_IOAPIC is for uniprocessor with an IO-APIC. [Note: certain
-kernel debugging options, such as Kernel Stack Meter or Kernel Tracer,
-may implicitly disable the NMI watchdog.]
-
-For x86-64, the needed APIC is always compiled in.
-
-Using local APIC (nmi_watchdog=2) needs the first performance register, so
-you can't use it for other purposes (such as high precision performance
-profiling.) However, at least oprofile and the perfctr driver disable the
-local APIC NMI watchdog automatically.
-
-To actually enable the NMI watchdog, use the 'nmi_watchdog=N' boot
-parameter.  Eg. the relevant lilo.conf entry:
-
-        append="nmi_watchdog=1"
-
-For SMP machines and UP machines with an IO-APIC use nmi_watchdog=1.
-For UP machines without an IO-APIC use nmi_watchdog=2, this only works
-for some processor types.  If in doubt, boot with nmi_watchdog=1 and
-check the NMI count in /proc/interrupts; if the count is zero then
-reboot with nmi_watchdog=2 and check the NMI count.  If it is still
-zero then log a problem, you probably have a processor that needs to be
-added to the nmi code.
-
-A 'lockup' is the following scenario: if any CPU in the system does not
-execute the period local timer interrupt for more than 5 seconds, then
-the NMI handler generates an oops and kills the process. This
-'controlled crash' (and the resulting kernel messages) can be used to
-debug the lockup. Thus whenever the lockup happens, wait 5 seconds and
-the oops will show up automatically. If the kernel produces no messages
-then the system has crashed so hard (eg. hardware-wise) that either it
-cannot even accept NMI interrupts, or the crash has made the kernel
-unable to print messages.
-
-Be aware that when using local APIC, the frequency of NMI interrupts
-it generates, depends on the system load. The local APIC NMI watchdog,
-lacking a better source, uses the "cycles unhalted" event. As you may
-guess it doesn't tick when the CPU is in the halted state (which happens
-when the system is idle), but if your system locks up on anything but the
-"hlt" processor instruction, the watchdog will trigger very soon as the
-"cycles unhalted" event will happen every clock tick. If it locks up on
-"hlt", then you are out of luck -- the event will not happen at all and the
-watchdog won't trigger. This is a shortcoming of the local APIC watchdog
--- unfortunately there is no "clock ticks" event that would work all the
-time. The I/O APIC watchdog is driven externally and has no such shortcoming.
-But its NMI frequency is much higher, resulting in a more significant hit
-to the overall system performance.
-
-On x86 nmi_watchdog is disabled by default so you have to enable it with
-a boot time parameter.
-
-It's possible to disable the NMI watchdog in run-time by writing "0" to
-/proc/sys/kernel/nmi_watchdog. Writing "1" to the same file will re-enable
-the NMI watchdog. Notice that you still need to use "nmi_watchdog=" parameter
-at boot time.
-
-NOTE: In kernels prior to 2.4.2-ac18 the NMI-oopser is enabled unconditionally
-on x86 SMP boxes.
-
-[ feel free to send bug reports, suggestions and patches to
-  Ingo Molnar <mingo@redhat.com> or the Linux SMP mailing
-  list at <linux-smp@vger.kernel.org> ]
-

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] watchdog: update Kconfig entries
  2012-02-09  1:01   ` Fernando Luis Vázquez Cao
@ 2012-02-09  1:42     ` Fernando Luis Vázquez Cao
  2012-02-09  3:04       ` [PATCH] watchdog: fix code/comments mismatches Fernando Luis Vázquez Cao
  0 siblings, 1 reply; 9+ messages in thread
From: Fernando Luis Vázquez Cao @ 2012-02-09  1:42 UTC (permalink / raw)
  To: Don Zickus; +Cc: Ingo Molnar, linux-kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 69 bytes --]

Hi Don,

Could you pick up the attached patch too?

Thanks,
Fernando

[-- Attachment #2: watchdog-kconfig.patch --]
[-- Type: text/x-patch, Size: 2764 bytes --]

Subject: [PATCH] watchdog: update Kconfig entries

From: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>

The soft and hard lockup thresholds have changed so the corresponding Kconfig
entries need to be updated accordingly. Add a reference to  watchdog_thresh
while at it.

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-3.2.5-orig/lib/Kconfig.debug linux-3.2.5/lib/Kconfig.debug
--- linux-3.2.5-orig/lib/Kconfig.debug	2012-01-05 08:55:44.000000000 +0900
+++ linux-3.2.5/lib/Kconfig.debug	2012-02-09 10:30:06.781625497 +0900
@@ -166,18 +166,21 @@ config LOCKUP_DETECTOR
 	  hard and soft lockups.
 
 	  Softlockups are bugs that cause the kernel to loop in kernel
-	  mode for more than 60 seconds, without giving other tasks a
+	  mode for more than 20 seconds, without giving other tasks a
 	  chance to run.  The current stack trace is displayed upon
 	  detection and the system will stay locked up.
 
 	  Hardlockups are bugs that cause the CPU to loop in kernel mode
-	  for more than 60 seconds, without letting other interrupts have a
+	  for more than 10 seconds, without letting other interrupts have a
 	  chance to run.  The current stack trace is displayed upon detection
 	  and the system will stay locked up.
 
 	  The overhead should be minimal.  A periodic hrtimer runs to
-	  generate interrupts and kick the watchdog task every 10-12 seconds.
-	  An NMI is generated every 60 seconds or so to check for hardlockups.
+	  generate interrupts and kick the watchdog task every 4 seconds.
+	  An NMI is generated every 10 seconds or so to check for hardlockups.
+
+	  The frequency of hrtimer and NMI events and the soft and hard lockup
+	  thresholds can be controlled through the sysctl watchdog_thresh.
 
 config HARDLOCKUP_DETECTOR
 	def_bool LOCKUP_DETECTOR && PERF_EVENTS && HAVE_PERF_EVENTS_NMI && \
@@ -189,7 +192,8 @@ config BOOTPARAM_HARDLOCKUP_PANIC
 	help
 	  Say Y here to enable the kernel to panic on "hard lockups",
 	  which are bugs that cause the kernel to loop in kernel
-	  mode with interrupts disabled for more than 60 seconds.
+	  mode with interrupts disabled for more than 10 seconds (configurable
+	  using the watchdog_thresh sysctl).
 
 	  Say N if unsure.
 
@@ -206,8 +210,8 @@ config BOOTPARAM_SOFTLOCKUP_PANIC
 	help
 	  Say Y here to enable the kernel to panic on "soft lockups",
 	  which are bugs that cause the kernel to loop in kernel
-	  mode for more than 60 seconds, without giving other tasks a
-	  chance to run.
+	  mode for more than 20 seconds (configurable using the watchdog_thresh
+	  sysctl), without giving other tasks a chance to run.
 
 	  The panic can be used in combination with panic_timeout,
 	  to cause the system to reboot automatically after a

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] watchdog: fix code/comments mismatches
  2012-02-09  1:42     ` [PATCH] watchdog: update Kconfig entries Fernando Luis Vázquez Cao
@ 2012-02-09  3:04       ` Fernando Luis Vázquez Cao
  0 siblings, 0 replies; 9+ messages in thread
From: Fernando Luis Vázquez Cao @ 2012-02-09  3:04 UTC (permalink / raw)
  To: Don Zickus; +Cc: Ingo Molnar, linux-kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 107 bytes --]

Hi Don,

This is the last of the documentation fixes.
I am preparing some bug fixes now.

Thanks,
Fernando

[-- Attachment #2: watchdog-ccomments.patch --]
[-- Type: text/x-patch, Size: 2536 bytes --]

Subject: [PATCH] watchdog: fix code/comments mismatches

From: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>

Reflect the change in the soft and hard lockup thresholds and their relation to
the frequency of the hrtimer and NMI events in the code comments. While at it,
remove references to files that do not exist anymore.

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-3.2.5-orig/kernel/watchdog.c linux-3.2.5/kernel/watchdog.c
--- linux-3.2.5-orig/kernel/watchdog.c	2012-01-05 08:55:44.000000000 +0900
+++ linux-3.2.5/kernel/watchdog.c	2012-02-09 11:39:21.712084495 +0900
@@ -3,12 +3,9 @@
  *
  * started by Don Zickus, Copyright (C) 2010 Red Hat, Inc.
  *
- * this code detects hard lockups: incidents in where on a CPU
- * the kernel does not respond to anything except NMI.
- *
- * Note: Most of this code is borrowed heavily from softlockup.c,
- * so thanks to Ingo for the initial implementation.
- * Some chunks also taken from arch/x86/kernel/apic/nmi.c, thanks
+ * Note: Most of this code is borrowed heavily from the original softlockup
+ * detector, so thanks to Ingo for the initial implementation.
+ * Some chunks also taken from the old x86-specific nmi watchdog code, thanks
  * to those contributors as well.
  */
 
@@ -116,10 +113,11 @@ static unsigned long get_timestamp(int t
 static unsigned long get_sample_period(void)
 {
 	/*
-	 * convert watchdog_thresh from seconds to ns
-	 * the divide by 5 is to give hrtimer 5 chances to
-	 * increment before the hardlockup detector generates
-	 * a warning
+	 * convert watchdog_thresh from seconds to ns the
+	 * divide by 5 is to give hrtimer several chances (two
+	 * or three with the current relation between the soft
+	 * and hard thresholds) to increment before the
+	 * hardlockup detector generates a warning
 	 */
 	return get_softlockup_thresh() * (NSEC_PER_SEC / 5);
 }
@@ -336,9 +334,11 @@ static int watchdog(void *unused)
 
 	set_current_state(TASK_INTERRUPTIBLE);
 	/*
-	 * Run briefly once per second to reset the softlockup timestamp.
-	 * If this gets delayed for more than 60 seconds then the
-	 * debug-printout triggers in watchdog_timer_fn().
+	 * Run briefly (kicked by the hrtimer callback function) once every
+	 * get_sample_period() seconds (4 seconds by default) to reset the
+	 * softlockup timestamp. If this gets delayed for more than
+	 * 2*watchdog_thresh seconds then the debug-printout triggers in
+	 * watchdog_timer_fn().
 	 */
 	while (!kthread_should_stop()) {
 		__touch_watchdog();

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] watchdog: update documentation
  2012-02-08 18:43 ` Don Zickus
  2012-02-09  1:01   ` Fernando Luis Vázquez Cao
@ 2012-02-09  7:43   ` Ingo Molnar
  2012-02-09 14:53     ` Don Zickus
  2012-02-09 15:58     ` Randy Dunlap
  1 sibling, 2 replies; 9+ messages in thread
From: Ingo Molnar @ 2012-02-09  7:43 UTC (permalink / raw)
  To: Don Zickus; +Cc: Fernando Luis Vázquez Cao, Ingo Molnar, linux-kernel


* Don Zickus <dzickus@redhat.com> wrote:

> On Wed, Feb 08, 2012 at 06:06:41PM +0900, Fernando Luis Vázquez Cao wrote:
> > The soft and hard lockup detectors are now built on top of the hrtimer
> > and perf subsystems. Update the documentation accordingly.
> 
> > Subject: [PATCH] watchdog: update documentation
> > 
> > From: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> > 
> > The soft and hard lockup detectors are now built on top of the hrtimer
> > and perf subsystems. Update the documentation accordingly.
> 
> I am fine with this (just need to address Randy's little 
> fixup).
> 
> Ingo, should I take this in and repost for you or just ack it 
> and let someone like Andrew take this in?

Yeah, would be nice to have it with Randy's ack and your 
signoff, can apply it then.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] watchdog: update documentation
  2012-02-09  7:43   ` [PATCH] watchdog: update documentation Ingo Molnar
@ 2012-02-09 14:53     ` Don Zickus
  2012-02-09 15:58     ` Randy Dunlap
  1 sibling, 0 replies; 9+ messages in thread
From: Don Zickus @ 2012-02-09 14:53 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Fernando Luis Vázquez Cao, Ingo Molnar, linux-kernel

On Thu, Feb 09, 2012 at 08:43:35AM +0100, Ingo Molnar wrote:
> 
> * Don Zickus <dzickus@redhat.com> wrote:
> 
> > On Wed, Feb 08, 2012 at 06:06:41PM +0900, Fernando Luis Vázquez Cao wrote:
> > > The soft and hard lockup detectors are now built on top of the hrtimer
> > > and perf subsystems. Update the documentation accordingly.
> > 
> > > Subject: [PATCH] watchdog: update documentation
> > > 
> > > From: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> > > 
> > > The soft and hard lockup detectors are now built on top of the hrtimer
> > > and perf subsystems. Update the documentation accordingly.
> > 
> > I am fine with this (just need to address Randy's little 
> > fixup).
> > 
> > Ingo, should I take this in and repost for you or just ack it 
> > and let someone like Andrew take this in?
> 
> Yeah, would be nice to have it with Randy's ack and your 
> signoff, can apply it then.

Ok, thanks!

Cheers,
Don

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] watchdog: update documentation
  2012-02-09  7:43   ` [PATCH] watchdog: update documentation Ingo Molnar
  2012-02-09 14:53     ` Don Zickus
@ 2012-02-09 15:58     ` Randy Dunlap
  1 sibling, 0 replies; 9+ messages in thread
From: Randy Dunlap @ 2012-02-09 15:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Don Zickus, Fernando Luis Vázquez Cao, Ingo Molnar, linux-kernel

On 02/08/2012 11:43 PM, Ingo Molnar wrote:
> 
> * Don Zickus <dzickus@redhat.com> wrote:
> 
>> On Wed, Feb 08, 2012 at 06:06:41PM +0900, Fernando Luis Vázquez Cao wrote:
>>> The soft and hard lockup detectors are now built on top of the hrtimer
>>> and perf subsystems. Update the documentation accordingly.
>>
>>> Subject: [PATCH] watchdog: update documentation
>>>
>>> From: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
>>>
>>> The soft and hard lockup detectors are now built on top of the hrtimer
>>> and perf subsystems. Update the documentation accordingly.
>>
>> I am fine with this (just need to address Randy's little 
>> fixup).
>>
>> Ingo, should I take this in and repost for you or just ack it 
>> and let someone like Andrew take this in?
> 
> Yeah, would be nice to have it with Randy's ack and your 
> signoff, can apply it then.

updated patch:
Acked-by: Randy Dunlap <rdunlap@xenotime.net>

-- 
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-02-09 15:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-08  9:06 [PATCH] watchdog: update documentation Fernando Luis Vázquez Cao
2012-02-08 18:34 ` Randy Dunlap
2012-02-08 18:43 ` Don Zickus
2012-02-09  1:01   ` Fernando Luis Vázquez Cao
2012-02-09  1:42     ` [PATCH] watchdog: update Kconfig entries Fernando Luis Vázquez Cao
2012-02-09  3:04       ` [PATCH] watchdog: fix code/comments mismatches Fernando Luis Vázquez Cao
2012-02-09  7:43   ` [PATCH] watchdog: update documentation Ingo Molnar
2012-02-09 14:53     ` Don Zickus
2012-02-09 15:58     ` Randy Dunlap

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).