[ Please keep me in CC as I'm not subscribed to the list] Hi All, My kernel is built with the following options: $ cat /boot/config-5.0.1 | grep NO_HZ CONFIG_NO_HZ_COMMON=y CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set CONFIG_NO_HZ=y CONFIG_RCU_FAST_NO_HZ=y I booted with watchdog enabled(nmi_watchdog=1) as given below: BOOT_IMAGE=/boot/vmlinuz-5.0.1 root=UUID=f65454ae-3f1d-4b9e-b4be-74a29becbe1e ro debug ignore_loglevel console=ttyUSB0,115200 console=tty0 console=tty1 console=ttyS2,115200 memmap=1M!1023M nmi_watchdog=1 crashkernel=384M-:128M When the system is frozen or the kernel is locked up(I noticed that in this state kernel is not responding for ALT-SysRq-<command key>) but watchdog is not triggered. So I want to understand how to enable the watchdog timer and how to verify the basic watchdog functionality behavior? Any pointers on this will be greatly appreciated. -- Thanks, Sekhar
On 11/15/19 4:35 PM, Muni Sekhar wrote:
> [ Please keep me in CC as I'm not subscribed to the list]
>
> Hi All,
>
> My kernel is built with the following options:
>
> $ cat /boot/config-5.0.1 | grep NO_HZ
> CONFIG_NO_HZ_COMMON=y
> CONFIG_NO_HZ_IDLE=y
> # CONFIG_NO_HZ_FULL is not set
> CONFIG_NO_HZ=y
> CONFIG_RCU_FAST_NO_HZ=y
>
> I booted with watchdog enabled(nmi_watchdog=1) as given below:
>
> BOOT_IMAGE=/boot/vmlinuz-5.0.1
> root=UUID=f65454ae-3f1d-4b9e-b4be-74a29becbe1e ro debug
> ignore_loglevel console=ttyUSB0,115200 console=tty0 console=tty1
> console=ttyS2,115200 memmap=1M!1023M nmi_watchdog=1
> crashkernel=384M-:128M
>
> When the system is frozen or the kernel is locked up(I noticed that in
> this state kernel is not responding for ALT-SysRq-<command key>) but
> watchdog is not triggered. So I want to understand how to enable the
> watchdog timer and how to verify the basic watchdog functionality
> behavior?
> > Any pointers on this will be greatly appreciated.
>
Sorry, I do not have an answer. Please note that you are talking about
the NMI watchdog, which is completely unrelated to hardware watchdogs
and not handled by the watchdog subsystem. I would suggest to send
your question to the Linux kernel mailing list and clearly state
that you are talking about the NMI watchdog.
Please note that, for the NMI watchdog to do anything, you must have
CONFIG_HARDLOCKUP_DETECTOR enabled in your kernel configuration. I don't
know what if anything the configuration options you listed above have
to do with the NMI watchdog.
Another possibility, of course, might be to enable a hardware watchdog
in your system (assuming it supports one). I personally would not trust
the NMI watchdog because to detect a system hang, after all, there are
situations where even NMIs no longer work.
Guenter
On Sat, Nov 16, 2019 at 6:34 AM Guenter Roeck <linux@roeck-us.net> wrote: > > On 11/15/19 4:35 PM, Muni Sekhar wrote: > > [ Please keep me in CC as I'm not subscribed to the list] > > > > Hi All, > > > > My kernel is built with the following options: > > > > $ cat /boot/config-5.0.1 | grep NO_HZ > > CONFIG_NO_HZ_COMMON=y > > CONFIG_NO_HZ_IDLE=y > > # CONFIG_NO_HZ_FULL is not set > > CONFIG_NO_HZ=y > > CONFIG_RCU_FAST_NO_HZ=y > > > > I booted with watchdog enabled(nmi_watchdog=1) as given below: > > > > BOOT_IMAGE=/boot/vmlinuz-5.0.1 > > root=UUID=f65454ae-3f1d-4b9e-b4be-74a29becbe1e ro debug > > ignore_loglevel console=ttyUSB0,115200 console=tty0 console=tty1 > > console=ttyS2,115200 memmap=1M!1023M nmi_watchdog=1 > > crashkernel=384M-:128M > > > > When the system is frozen or the kernel is locked up(I noticed that in > > this state kernel is not responding for ALT-SysRq-<command key>) but > > watchdog is not triggered. So I want to understand how to enable the > > watchdog timer and how to verify the basic watchdog functionality > > behavior? > > > Any pointers on this will be greatly appreciated. > > > Sorry, I do not have an answer. Please note that you are talking about > the NMI watchdog, which is completely unrelated to hardware watchdogs > and not handled by the watchdog subsystem. I would suggest to send > your question to the Linux kernel mailing list and clearly state > that you are talking about the NMI watchdog. > > Please note that, for the NMI watchdog to do anything, you must have > CONFIG_HARDLOCKUP_DETECTOR enabled in your kernel configuration. I don't > know what if anything the configuration options you listed above have > to do with the NMI watchdog. Thank you for your response. I enabled hard\soft\lockup detector config options. My kernel is built with the following .config options: CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y CONFIG_HARDLOCKUP_DETECTOR_PERF=y CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y CONFIG_HARDLOCKUP_DETECTOR=y CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=1 CONFIG_SOFTLOCKUP_DETECTOR=y CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=1 Also I enabled the following stuff in /proc/sys/ directory. kernel.softlockup_panic = 1 kernel.hardlockup_panic = 1 kernel.unknown_nmi_panic = 1 kernel.softlockup_all_cpu_backtrace = 1 kernel.hardlockup_all_cpu_backtrace = 1 kernel.panic = 3 kernel.panic_on_io_nmi = 1 kernel.panic_on_oops = 1 kernel.panic_on_stackoverflow = 1 kernel.panic_on_unrecovered_nmi = 1 kernel.panic_on_rcu_stall = 1 kernel.panic_print = 31 kernel.sysrq=0x1FF The https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt Says “By default, the watchdog runs on all online cores. However, on a kernel configured with NO_HZ_FULL, by default the watchdog runs only on the housekeeping cores, not the cores specified in the "nohz_full" boot argument.”, so I just mentioned my kernel CONFIG_NO_HZ* options. > > Another possibility, of course, might be to enable a hardware watchdog > in your system (assuming it supports one). I personally would not trust > the NMI watchdog because to detect a system hang, after all, there are > situations where even NMIs no longer work. From dmesg , Is it possible to know whether my system supports hardware watchdog or not? I assume that my system supports the hardware watchdog , then how to enable the hardware watchdog to debug the system freeze issues? > > Guenter -- Thanks, Sekhar
On 11/15/19 7:03 PM, Muni Sekhar wrote:
[ ... ]
>>
>> Another possibility, of course, might be to enable a hardware watchdog
>> in your system (assuming it supports one). I personally would not trust
>> the NMI watchdog because to detect a system hang, after all, there are
>> situations where even NMIs no longer work.
>
>>From dmesg , Is it possible to know whether my system supports
> hardware watchdog or not?
> I assume that my system supports the hardware watchdog , then how to
> enable the hardware watchdog to debug the system freeze issues?
>
Hardware watchdog support really depends on the board type. Most PC
mainboards support a watchdog in the Super-IO chip, but on some it is
not wired correctly. On embedded boards it is often built into the SoC.
The easiest way to see if you have a watchdog would be to check for the
existence of /dev/watchdog. However, on a PC that would most likely
not be there because the necessary module is not auto-loaded.
If you tell us your board type, or better the Super-IO chip on the board,
we might be able to help.
Note though that this won't help to debug the problem. A hardware
watchdog resets the system. It helps to recover, but it is not intended
to help with debugging.
Guenter
On Sat, Nov 16, 2019 at 9:31 PM Guenter Roeck <linux@roeck-us.net> wrote: > > On 11/15/19 7:03 PM, Muni Sekhar wrote: > [ ... ] > >> > >> Another possibility, of course, might be to enable a hardware watchdog > >> in your system (assuming it supports one). I personally would not trust > >> the NMI watchdog because to detect a system hang, after all, there are > >> situations where even NMIs no longer work. > > > >>From dmesg , Is it possible to know whether my system supports > > hardware watchdog or not? > > I assume that my system supports the hardware watchdog , then how to > > enable the hardware watchdog to debug the system freeze issues? > > > > Hardware watchdog support really depends on the board type. Most PC > mainboards support a watchdog in the Super-IO chip, but on some it is > not wired correctly. On embedded boards it is often built into the SoC. > The easiest way to see if you have a watchdog would be to check for the > existence of /dev/watchdog. However, on a PC that would most likely > not be there because the necessary module is not auto-loaded. > If you tell us your board type, or better the Super-IO chip on the board, > we might be able to help. I’m having two same configuration systems, in one system I installed the Vanilla kernel and I see the /dev/watchdog and /dev/watchdog0 nodes. In other system I’m running with ubuntu distribution kernel, but I don’t see any watchdog device node. So it looks like I need to manually load the kernel module in distro kernel. Is there a way to know what is the corresponding kernel module for /dev/watchdog node? # ls -l /dev/watchdog* crw------- 1 root root 10, 130 Nov 15 17:15 /dev/watchdog crw------- 1 root root 248, 0 Nov 15 17:15 /dev/watchdog0 # ps -ax | grep watchdog 678 ? S 0:00 [watchdogd] Regarding Super-IO chip, how to find out the Super-IO chip model? > > Note though that this won't help to debug the problem. A hardware > watchdog resets the system. It helps to recover, but it is not intended > to help with debugging. How do I use the hardware watchdog to reset my system when system is frozen? It helps me to collect the crashdump and finally helps me to find the root cause for the system frozen issue. > > Guenter -- Thanks, Sekhar
On 11/16/19 10:34 AM, Muni Sekhar wrote: > On Sat, Nov 16, 2019 at 9:31 PM Guenter Roeck <linux@roeck-us.net> wrote: >> >> On 11/15/19 7:03 PM, Muni Sekhar wrote: >> [ ... ] >>>> >>>> Another possibility, of course, might be to enable a hardware watchdog >>>> in your system (assuming it supports one). I personally would not trust >>>> the NMI watchdog because to detect a system hang, after all, there are >>>> situations where even NMIs no longer work. >>> >>> >From dmesg , Is it possible to know whether my system supports >>> hardware watchdog or not? >>> I assume that my system supports the hardware watchdog , then how to >>> enable the hardware watchdog to debug the system freeze issues? >>> >> >> Hardware watchdog support really depends on the board type. Most PC >> mainboards support a watchdog in the Super-IO chip, but on some it is >> not wired correctly. On embedded boards it is often built into the SoC. >> The easiest way to see if you have a watchdog would be to check for the >> existence of /dev/watchdog. However, on a PC that would most likely >> not be there because the necessary module is not auto-loaded. >> If you tell us your board type, or better the Super-IO chip on the board, >> we might be able to help. > > I’m having two same configuration systems, in one system I installed > the Vanilla kernel and I see the /dev/watchdog and /dev/watchdog0 > nodes. In other system I’m running with ubuntu distribution kernel, > but I don’t see any watchdog device node. So it looks like I need to > manually load the kernel module in distro kernel. Is there a way to > know what is the corresponding kernel module for /dev/watchdog node? > > # ls -l /dev/watchdog* > crw------- 1 root root 10, 130 Nov 15 17:15 /dev/watchdog > crw------- 1 root root 248, 0 Nov 15 17:15 /dev/watchdog0 > > # ps -ax | grep watchdog > 678 ? S 0:00 [watchdogd] > > Regarding Super-IO chip, how to find out the Super-IO chip model? > You could try to run sensors-detect (from the "sensors" package). If you can boot a system with /dev/watchdog0, you should see the type in /sys/class/watchdog/watchdog0/identity. Also, you can test if the watchdog works with "sudo cat /dev/watchdog", assuming the watchdog daemon is not running. The watchdog works if the system reboots after the watchdog times out (/sys/class/watchdog/watchdog0/timeout is the timeout in seconds). >> >> Note though that this won't help to debug the problem. A hardware >> watchdog resets the system. It helps to recover, but it is not intended >> to help with debugging. > How do I use the hardware watchdog to reset my system when system is > frozen? It helps me to collect the crashdump and finally helps me to > find the root cause for the system frozen issue. > There won't be a crashdump. It just hard-resets the system. Guenter
On Sun, Nov 17, 2019 at 3:12 AM Guenter Roeck <linux@roeck-us.net> wrote: > > On 11/16/19 10:34 AM, Muni Sekhar wrote: > > On Sat, Nov 16, 2019 at 9:31 PM Guenter Roeck <linux@roeck-us.net> wrote: > >> > >> On 11/15/19 7:03 PM, Muni Sekhar wrote: > >> [ ... ] > >>>> > >>>> Another possibility, of course, might be to enable a hardware watchdog > >>>> in your system (assuming it supports one). I personally would not trust > >>>> the NMI watchdog because to detect a system hang, after all, there are > >>>> situations where even NMIs no longer work. > >>> > >>> >From dmesg , Is it possible to know whether my system supports > >>> hardware watchdog or not? > >>> I assume that my system supports the hardware watchdog , then how to > >>> enable the hardware watchdog to debug the system freeze issues? > >>> > >> > >> Hardware watchdog support really depends on the board type. Most PC > >> mainboards support a watchdog in the Super-IO chip, but on some it is > >> not wired correctly. On embedded boards it is often built into the SoC. > >> The easiest way to see if you have a watchdog would be to check for the > >> existence of /dev/watchdog. However, on a PC that would most likely > >> not be there because the necessary module is not auto-loaded. > >> If you tell us your board type, or better the Super-IO chip on the board, > >> we might be able to help. > > > > I’m having two same configuration systems, in one system I installed > > the Vanilla kernel and I see the /dev/watchdog and /dev/watchdog0 > > nodes. In other system I’m running with ubuntu distribution kernel, > > but I don’t see any watchdog device node. So it looks like I need to > > manually load the kernel module in distro kernel. Is there a way to > > know what is the corresponding kernel module for /dev/watchdog node? > > > > # ls -l /dev/watchdog* > > crw------- 1 root root 10, 130 Nov 15 17:15 /dev/watchdog > > crw------- 1 root root 248, 0 Nov 15 17:15 /dev/watchdog0 > > > > # ps -ax | grep watchdog > > 678 ? S 0:00 [watchdogd] > > > > Regarding Super-IO chip, how to find out the Super-IO chip model? > > > You could try to run sensors-detect (from the "sensors" package). > > If you can boot a system with /dev/watchdog0, you should see the type > in /sys/class/watchdog/watchdog0/identity. I could not find the /sys/class/watchdog/watchdog0/identity and /sys/class/watchdog/watchdog0/timeout files. $ ls -l /sys/class/watchdog/watchdog0/ total 0 -r--r--r-- 1 root root 4096 Nov 18 15:12 dev lrwxrwxrwx 1 root root 0 Nov 18 15:12 device -> ../../../iTCO_wdt.0.auto drwxr-xr-x 2 root root 0 Nov 18 15:12 power lrwxrwxrwx 1 root root 0 Nov 18 14:53 subsystem -> ../../../../../../class/watchdog -rw-r--r-- 1 root root 4096 Nov 18 14:53 uevent > > Also, you can test if the watchdog works with "sudo cat /dev/watchdog", > assuming the watchdog daemon is not running. The watchdog works if the > system reboots after the watchdog times out (/sys/class/watchdog/watchdog0/timeout > is the timeout in seconds). sudo cat /dev/watchdog perfectly rebooted my system. I don't see timeout node, how do I configure the timeout value? > > >> > >> Note though that this won't help to debug the problem. A hardware > >> watchdog resets the system. It helps to recover, but it is not intended > >> to help with debugging. > > How do I use the hardware watchdog to reset my system when system is > > frozen? It helps me to collect the crashdump and finally helps me to > > find the root cause for the system frozen issue. > > > There won't be a crashdump. It just hard-resets the system. So is there any other solution to capture the crashdump or trigger soft reboot once kernel is lockedup? > > Guenter -- Thanks, Sekhar
On 11/18/19 1:52 AM, Muni Sekhar wrote: > On Sun, Nov 17, 2019 at 3:12 AM Guenter Roeck <linux@roeck-us.net> wrote: >> >> On 11/16/19 10:34 AM, Muni Sekhar wrote: >>> On Sat, Nov 16, 2019 at 9:31 PM Guenter Roeck <linux@roeck-us.net> wrote: >>>> >>>> On 11/15/19 7:03 PM, Muni Sekhar wrote: >>>> [ ... ] >>>>>> >>>>>> Another possibility, of course, might be to enable a hardware watchdog >>>>>> in your system (assuming it supports one). I personally would not trust >>>>>> the NMI watchdog because to detect a system hang, after all, there are >>>>>> situations where even NMIs no longer work. >>>>> >>>>> >From dmesg , Is it possible to know whether my system supports >>>>> hardware watchdog or not? >>>>> I assume that my system supports the hardware watchdog , then how to >>>>> enable the hardware watchdog to debug the system freeze issues? >>>>> >>>> >>>> Hardware watchdog support really depends on the board type. Most PC >>>> mainboards support a watchdog in the Super-IO chip, but on some it is >>>> not wired correctly. On embedded boards it is often built into the SoC. >>>> The easiest way to see if you have a watchdog would be to check for the >>>> existence of /dev/watchdog. However, on a PC that would most likely >>>> not be there because the necessary module is not auto-loaded. >>>> If you tell us your board type, or better the Super-IO chip on the board, >>>> we might be able to help. >>> >>> I’m having two same configuration systems, in one system I installed >>> the Vanilla kernel and I see the /dev/watchdog and /dev/watchdog0 >>> nodes. In other system I’m running with ubuntu distribution kernel, >>> but I don’t see any watchdog device node. So it looks like I need to >>> manually load the kernel module in distro kernel. Is there a way to >>> know what is the corresponding kernel module for /dev/watchdog node? >>> >>> # ls -l /dev/watchdog* >>> crw------- 1 root root 10, 130 Nov 15 17:15 /dev/watchdog >>> crw------- 1 root root 248, 0 Nov 15 17:15 /dev/watchdog0 >>> >>> # ps -ax | grep watchdog >>> 678 ? S 0:00 [watchdogd] >>> >>> Regarding Super-IO chip, how to find out the Super-IO chip model? >>> >> You could try to run sensors-detect (from the "sensors" package). >> >> If you can boot a system with /dev/watchdog0, you should see the type >> in /sys/class/watchdog/watchdog0/identity. > I could not find the /sys/class/watchdog/watchdog0/identity and > /sys/class/watchdog/watchdog0/timeout files. > $ ls -l /sys/class/watchdog/watchdog0/ > total 0 > -r--r--r-- 1 root root 4096 Nov 18 15:12 dev > lrwxrwxrwx 1 root root 0 Nov 18 15:12 device -> ../../../iTCO_wdt.0.auto > drwxr-xr-x 2 root root 0 Nov 18 15:12 power > lrwxrwxrwx 1 root root 0 Nov 18 14:53 subsystem -> > ../../../../../../class/watchdog > -rw-r--r-- 1 root root 4096 Nov 18 14:53 uevent > Presumably CONFIG_WATCHDOG_SYSFS is not enabled in your configuration. >> >> Also, you can test if the watchdog works with "sudo cat /dev/watchdog", >> assuming the watchdog daemon is not running. The watchdog works if the >> system reboots after the watchdog times out (/sys/class/watchdog/watchdog0/timeout >> is the timeout in seconds). > sudo cat /dev/watchdog perfectly rebooted my system. I don't see > timeout node, how do I configure the timeout value? sudo apt-get install watchdog man watchdog should tell you. Alternatively, enable CONFIG_WATCHDOG_SYSFS. >> >>>> >>>> Note though that this won't help to debug the problem. A hardware >>>> watchdog resets the system. It helps to recover, but it is not intended >>>> to help with debugging. >>> How do I use the hardware watchdog to reset my system when system is >>> frozen? It helps me to collect the crashdump and finally helps me to >>> find the root cause for the system frozen issue. >>> >> There won't be a crashdump. It just hard-resets the system. > So is there any other solution to capture the crashdump or trigger > soft reboot once kernel is lockedup? Not that I know of. I suspect, though, that you either have a hard lockup where even NMI is non-operational, or NMI doesn't work in your system to start with. If you have nmi_watchdog=1 in your kernel command line, /proc/interrupts should show a non-zero number of NMI interrupts. Do you see that in your system ? Guenter
[-cc linux-pci (nothing here is PCI-specific)]
On Sat, Nov 16, 2019 at 06:05:05AM +0530, Muni Sekhar wrote:
> My kernel is built with the following options:
>
> $ cat /boot/config-5.0.1 | grep NO_HZ
> CONFIG_NO_HZ_COMMON=y
> CONFIG_NO_HZ_IDLE=y
> # CONFIG_NO_HZ_FULL is not set
> CONFIG_NO_HZ=y
> CONFIG_RCU_FAST_NO_HZ=y
>
> I booted with watchdog enabled(nmi_watchdog=1) as given below:
>
> BOOT_IMAGE=/boot/vmlinuz-5.0.1
> root=UUID=f65454ae-3f1d-4b9e-b4be-74a29becbe1e ro debug
> ignore_loglevel console=ttyUSB0,115200 console=tty0 console=tty1
> console=ttyS2,115200 memmap=1M!1023M nmi_watchdog=1
> crashkernel=384M-:128M
>
> When the system is frozen or the kernel is locked up(I noticed that in
> this state kernel is not responding for ALT-SysRq-<command key>) but
> watchdog is not triggered. So I want to understand how to enable the
> watchdog timer and how to verify the basic watchdog functionality
> behavior?
I don't know much about the watchdog, but I assume you've found these
already?
Documentation/admin-guide/lockup-watchdogs.rst
Documentation/admin-guide/sysctl/kernel.rst
Do you have CONFIG_HAVE_NMI_WATCHDOG=y? (See arch/Kconfig)
On Mon, Nov 18, 2019 at 08:38:38AM -0600, Bjorn Helgaas wrote:
> ...
[facepalm, should have read the rest of the thread before cluttering
it, sorry]
On Mon, Nov 18, 2019 at 7:40 PM Guenter Roeck <linux@roeck-us.net> wrote: > > On 11/18/19 1:52 AM, Muni Sekhar wrote: > > On Sun, Nov 17, 2019 at 3:12 AM Guenter Roeck <linux@roeck-us.net> wrote: > >> > >> On 11/16/19 10:34 AM, Muni Sekhar wrote: > >>> On Sat, Nov 16, 2019 at 9:31 PM Guenter Roeck <linux@roeck-us.net> wrote: > >>>> > >>>> On 11/15/19 7:03 PM, Muni Sekhar wrote: > >>>> [ ... ] > >>>>>> > >>>>>> Another possibility, of course, might be to enable a hardware watchdog > >>>>>> in your system (assuming it supports one). I personally would not trust > >>>>>> the NMI watchdog because to detect a system hang, after all, there are > >>>>>> situations where even NMIs no longer work. > >>>>> > >>>>> >From dmesg , Is it possible to know whether my system supports > >>>>> hardware watchdog or not? > >>>>> I assume that my system supports the hardware watchdog , then how to > >>>>> enable the hardware watchdog to debug the system freeze issues? > >>>>> > >>>> > >>>> Hardware watchdog support really depends on the board type. Most PC > >>>> mainboards support a watchdog in the Super-IO chip, but on some it is > >>>> not wired correctly. On embedded boards it is often built into the SoC. > >>>> The easiest way to see if you have a watchdog would be to check for the > >>>> existence of /dev/watchdog. However, on a PC that would most likely > >>>> not be there because the necessary module is not auto-loaded. > >>>> If you tell us your board type, or better the Super-IO chip on the board, > >>>> we might be able to help. > >>> > >>> I’m having two same configuration systems, in one system I installed > >>> the Vanilla kernel and I see the /dev/watchdog and /dev/watchdog0 > >>> nodes. In other system I’m running with ubuntu distribution kernel, > >>> but I don’t see any watchdog device node. So it looks like I need to > >>> manually load the kernel module in distro kernel. Is there a way to > >>> know what is the corresponding kernel module for /dev/watchdog node? > >>> > >>> # ls -l /dev/watchdog* > >>> crw------- 1 root root 10, 130 Nov 15 17:15 /dev/watchdog > >>> crw------- 1 root root 248, 0 Nov 15 17:15 /dev/watchdog0 > >>> > >>> # ps -ax | grep watchdog > >>> 678 ? S 0:00 [watchdogd] > >>> > >>> Regarding Super-IO chip, how to find out the Super-IO chip model? > >>> > >> You could try to run sensors-detect (from the "sensors" package). > >> > >> If you can boot a system with /dev/watchdog0, you should see the type > >> in /sys/class/watchdog/watchdog0/identity. > > I could not find the /sys/class/watchdog/watchdog0/identity and > > /sys/class/watchdog/watchdog0/timeout files. > > $ ls -l /sys/class/watchdog/watchdog0/ > > total 0 > > -r--r--r-- 1 root root 4096 Nov 18 15:12 dev > > lrwxrwxrwx 1 root root 0 Nov 18 15:12 device -> ../../../iTCO_wdt.0.auto > > drwxr-xr-x 2 root root 0 Nov 18 15:12 power > > lrwxrwxrwx 1 root root 0 Nov 18 14:53 subsystem -> > > ../../../../../../class/watchdog > > -rw-r--r-- 1 root root 4096 Nov 18 14:53 uevent > > > > Presumably CONFIG_WATCHDOG_SYSFS is not enabled in your configuration. > > >> > >> Also, you can test if the watchdog works with "sudo cat /dev/watchdog", > >> assuming the watchdog daemon is not running. The watchdog works if the > >> system reboots after the watchdog times out (/sys/class/watchdog/watchdog0/timeout > >> is the timeout in seconds). > > sudo cat /dev/watchdog perfectly rebooted my system. I don't see > > timeout node, how do I configure the timeout value? > > sudo apt-get install watchdog > man watchdog > > should tell you. Alternatively, enable CONFIG_WATCHDOG_SYSFS. > > >> > >>>> > >>>> Note though that this won't help to debug the problem. A hardware > >>>> watchdog resets the system. It helps to recover, but it is not intended > >>>> to help with debugging. > >>> How do I use the hardware watchdog to reset my system when system is > >>> frozen? It helps me to collect the crashdump and finally helps me to > >>> find the root cause for the system frozen issue. > >>> > >> There won't be a crashdump. It just hard-resets the system. > > So is there any other solution to capture the crashdump or trigger > > soft reboot once kernel is lockedup? > > Not that I know of. I suspect, though, that you either have a hard lockup > where even NMI is non-operational, or NMI doesn't work in your system > to start with. > > If you have nmi_watchdog=1 in your kernel command line, /proc/interrupts > should show a non-zero number of NMI interrupts. Do you see that in your system ? Yes, I see non-zero number. When it(NMI interrupt count) supposed to change? $ cat /proc/interrupts | grep NMI NMI: 4129 4153 4192 183 Non-maskable interrupts $ dmesg | grep NMI [ 0.402175] ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) [ 0.402199] ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1]) [ 0.402220] ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1]) [ 0.402242] ACPI: LAPIC_NMI (acpi_id[0x04] high edge lint[0x1]) [ 4.636467] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter. [ 4.658289] | NMI testsuite: [ 13.863284] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 9.744 msecs Also I enabled pstore\ramoops. While testing the hardware watchdog by running 'sudo cat /dev/watchdog', I see that console dump updates between next boot. I see the same behavior consistently. $ cat /sys/fs/pstore/console-ramoops-0 [ 293.462623] printk: console [pstore-1] enabled [ 293.471026] pstore: Registered ramoops as persistent store backend [ 293.477800] ramoops: using 0x100000@0x3ff00000, ecc: 16 [ 315.461263] systemd-journald[1665]: Sent WATCHDOG=1 notification. [ 317.447791] watchdog: watchdog0: nowayout prevents watchdog being stopped! [ 317.456616] watchdog: watchdog0: watchdog did not stop! No errors detected Now I installed the watchdog daemon and started that service before the kernel locks up. On triggering few tests kernel locked up and hardware watchdog triggered the reset, but in this case I don't see console-ramoops-0 file. Only difference is , this time 'watchdog' daemon triggered the hardware watchdog. Not sure why console dump not updated in this scenario? > > Guenter -- Thanks, Sekhar
On Mon, Nov 18, 2019 at 8:08 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [-cc linux-pci (nothing here is PCI-specific)]
>
> On Sat, Nov 16, 2019 at 06:05:05AM +0530, Muni Sekhar wrote:
> > My kernel is built with the following options:
> >
> > $ cat /boot/config-5.0.1 | grep NO_HZ
> > CONFIG_NO_HZ_COMMON=y
> > CONFIG_NO_HZ_IDLE=y
> > # CONFIG_NO_HZ_FULL is not set
> > CONFIG_NO_HZ=y
> > CONFIG_RCU_FAST_NO_HZ=y
> >
> > I booted with watchdog enabled(nmi_watchdog=1) as given below:
> >
> > BOOT_IMAGE=/boot/vmlinuz-5.0.1
> > root=UUID=f65454ae-3f1d-4b9e-b4be-74a29becbe1e ro debug
> > ignore_loglevel console=ttyUSB0,115200 console=tty0 console=tty1
> > console=ttyS2,115200 memmap=1M!1023M nmi_watchdog=1
> > crashkernel=384M-:128M
> >
> > When the system is frozen or the kernel is locked up(I noticed that in
> > this state kernel is not responding for ALT-SysRq-<command key>) but
> > watchdog is not triggered. So I want to understand how to enable the
> > watchdog timer and how to verify the basic watchdog functionality
> > behavior?
>
> I don't know much about the watchdog, but I assume you've found these
> already?
>
> Documentation/admin-guide/lockup-watchdogs.rst
> Documentation/admin-guide/sysctl/kernel.rst
>
> Do you have CONFIG_HAVE_NMI_WATCHDOG=y? (See arch/Kconfig)
I don’t have CONFIG_HAVE_NMI_WATCHDOG in kernel .config file.
$cat /boot/config-5.0.1 | grep CONFIG_HAVE_NMI_WATCHDOG
But tried to enable CONFIG_HAVE_NMI_WATCHDOG via menuconfig, but could
not able to find it. What is the role of CONFIG_HAVE_NMI_WATCHDOG?
Symbol: HAVE_NMI_WATCHDOG [=n]
│
│ Type : bool
│
│ Defined at arch/Kconfig:339
│
│ Depends on: HAVE_NMI [=y]
│
│ Selected by [n]:
│
│ - HAVE_HARDLOCKUP_DETECTOR_ARCH [=n]
│ Symbol: HAVE_HARDLOCKUP_DETECTOR_ARCH [=n]
│
│ Type : bool
│
│ Defined at arch/Kconfig:346
│
│ Selects: HAVE_NMI_WATCHDOG [=n]
--
Thanks,
Sekhar
On 11/18/19 7:09 AM, Muni Sekhar wrote: > On Mon, Nov 18, 2019 at 8:08 PM Bjorn Helgaas <helgaas@kernel.org> wrote: >> >> [-cc linux-pci (nothing here is PCI-specific)] >> >> On Sat, Nov 16, 2019 at 06:05:05AM +0530, Muni Sekhar wrote: >>> My kernel is built with the following options: >>> >>> $ cat /boot/config-5.0.1 | grep NO_HZ >>> CONFIG_NO_HZ_COMMON=y >>> CONFIG_NO_HZ_IDLE=y >>> # CONFIG_NO_HZ_FULL is not set >>> CONFIG_NO_HZ=y >>> CONFIG_RCU_FAST_NO_HZ=y >>> >>> I booted with watchdog enabled(nmi_watchdog=1) as given below: >>> >>> BOOT_IMAGE=/boot/vmlinuz-5.0.1 >>> root=UUID=f65454ae-3f1d-4b9e-b4be-74a29becbe1e ro debug >>> ignore_loglevel console=ttyUSB0,115200 console=tty0 console=tty1 >>> console=ttyS2,115200 memmap=1M!1023M nmi_watchdog=1 >>> crashkernel=384M-:128M >>> >>> When the system is frozen or the kernel is locked up(I noticed that in >>> this state kernel is not responding for ALT-SysRq-<command key>) but >>> watchdog is not triggered. So I want to understand how to enable the >>> watchdog timer and how to verify the basic watchdog functionality >>> behavior? >> >> I don't know much about the watchdog, but I assume you've found these >> already? >> >> Documentation/admin-guide/lockup-watchdogs.rst >> Documentation/admin-guide/sysctl/kernel.rst >> >> Do you have CONFIG_HAVE_NMI_WATCHDOG=y? (See arch/Kconfig) > > I don’t have CONFIG_HAVE_NMI_WATCHDOG in kernel .config file. > That would mean you don't have NMI in the first place. What is your architecture ? Guenter > $cat /boot/config-5.0.1 | grep CONFIG_HAVE_NMI_WATCHDOG > > But tried to enable CONFIG_HAVE_NMI_WATCHDOG via menuconfig, but could > not able to find it. What is the role of CONFIG_HAVE_NMI_WATCHDOG? > > Symbol: HAVE_NMI_WATCHDOG [=n] > > │ > │ Type : bool > > │ > │ Defined at arch/Kconfig:339 > > │ > │ Depends on: HAVE_NMI [=y] > > │ > │ Selected by [n]: > > │ > │ - HAVE_HARDLOCKUP_DETECTOR_ARCH [=n] > > > │ Symbol: HAVE_HARDLOCKUP_DETECTOR_ARCH [=n] > > │ > │ Type : bool > > │ > │ Defined at arch/Kconfig:346 > > │ > │ Selects: HAVE_NMI_WATCHDOG [=n] > > > > >
On Fri, Nov 22, 2019 at 4:29 PM Guenter Roeck <linux@roeck-us.net> wrote: > > On 11/18/19 7:09 AM, Muni Sekhar wrote: > > On Mon, Nov 18, 2019 at 8:08 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > >> > >> [-cc linux-pci (nothing here is PCI-specific)] > >> > >> On Sat, Nov 16, 2019 at 06:05:05AM +0530, Muni Sekhar wrote: > >>> My kernel is built with the following options: > >>> > >>> $ cat /boot/config-5.0.1 | grep NO_HZ > >>> CONFIG_NO_HZ_COMMON=y > >>> CONFIG_NO_HZ_IDLE=y > >>> # CONFIG_NO_HZ_FULL is not set > >>> CONFIG_NO_HZ=y > >>> CONFIG_RCU_FAST_NO_HZ=y > >>> > >>> I booted with watchdog enabled(nmi_watchdog=1) as given below: > >>> > >>> BOOT_IMAGE=/boot/vmlinuz-5.0.1 > >>> root=UUID=f65454ae-3f1d-4b9e-b4be-74a29becbe1e ro debug > >>> ignore_loglevel console=ttyUSB0,115200 console=tty0 console=tty1 > >>> console=ttyS2,115200 memmap=1M!1023M nmi_watchdog=1 > >>> crashkernel=384M-:128M > >>> > >>> When the system is frozen or the kernel is locked up(I noticed that in > >>> this state kernel is not responding for ALT-SysRq-<command key>) but > >>> watchdog is not triggered. So I want to understand how to enable the > >>> watchdog timer and how to verify the basic watchdog functionality > >>> behavior? > >> > >> I don't know much about the watchdog, but I assume you've found these > >> already? > >> > >> Documentation/admin-guide/lockup-watchdogs.rst > >> Documentation/admin-guide/sysctl/kernel.rst > >> > >> Do you have CONFIG_HAVE_NMI_WATCHDOG=y? (See arch/Kconfig) > > > > I don’t have CONFIG_HAVE_NMI_WATCHDOG in kernel .config file. > > > > That would mean you don't have NMI in the first place. What is your > architecture ? My system has “Intel(R) Atom(TM) CPU E3845” processor and running ‘uname -m’ gives x86_64. /proc/interrupts gives the below statistics for NMI: $ cat /proc/interrupts | grep NMI NMI: 4207 4167 125 Non-maskable interrupts > > Guenter > > > $cat /boot/config-5.0.1 | grep CONFIG_HAVE_NMI_WATCHDOG > > > > But tried to enable CONFIG_HAVE_NMI_WATCHDOG via menuconfig, but could > > not able to find it. What is the role of CONFIG_HAVE_NMI_WATCHDOG? > > > > Symbol: HAVE_NMI_WATCHDOG [=n] > > > > │ > > │ Type : bool > > > > │ > > │ Defined at arch/Kconfig:339 > > > > │ > > │ Depends on: HAVE_NMI [=y] > > > > │ > > │ Selected by [n]: > > > > │ > > │ - HAVE_HARDLOCKUP_DETECTOR_ARCH [=n] > > > > > > │ Symbol: HAVE_HARDLOCKUP_DETECTOR_ARCH [=n] > > > > │ > > │ Type : bool > > > > │ > > │ Defined at arch/Kconfig:346 > > > > │ > > │ Selects: HAVE_NMI_WATCHDOG [=n] > > > > > > > > > > > -- Thanks, Sekhar