All of lore.kernel.org
 help / color / mirror / Atom feed
* troubleshooting/debugging hard locks
@ 2008-05-14 19:27 Lee Howard
  2008-05-14 22:43 ` Ray Lee
  0 siblings, 1 reply; 6+ messages in thread
From: Lee Howard @ 2008-05-14 19:27 UTC (permalink / raw)
  To: linux-kernel

(Please reply also directly to my e-mail address since I am not 
subscribed to the list.)

Hello,

I am using Fedora 9 (and have been for the last few weeks of the 
"preview" period... constantly updating if possible) and testing for a 
fax server using Mainpine IQ Express (PCIe) multi-modem fax cards (they 
use the 8250 serial driver).

My testing involves queuing up and sending 2000 fax jobs using HylaFAX+ 
5.2.4 to send out on two ports (1000 jobs on each port) of a 4-port card 
- receiving those calls on the other two ports.

This exact hardware works perfectly fine with similar testing in Windows 
XP Pro SP2.  However, usually on Fedora 9 (and even occasionally on 
Fedora 8) the system will lock up hard (i.e. the Numlock key does not 
light up the LED on the keyboard and SysReq keys do nothing) somewhere 
during the process.  This happens infrequently when the OS is Fedora 8 
and usually (but not always) when using Fedora 9.

There are no kernel messages on the monitor.  I've set up a remote 
serial console on ttyS0, and there are usually no messages there, 
either, when this happens.  Twice I did get messages that looked like a 
lot of this:

CPU1: Temperature above threshold, cpu clock throttled (total events = 1)
CPU0: Temperature/speed normal
CPU1: Temperature above threshold, cpu clock throttled (total events = 275)
CPU0: Temperature/speed normal
CPU1: Temperature above threshold, cpu clock throttled (total events = 577)
CPU0: Temperature/speed normal
CPU1: Temperature above threshold, cpu clock throttled (total events = 696)
CPU0: Temperature/speed normal

... but there was nothing more.  The side of the system chassis is 
removed, the fans are moving, and the hard lock still occurs even if I 
point a large fan at the open system and prevent the temperature 
warnings from occurring.  I've used sensors to monitor the temperature 
during the test with the external fan pointed at the open system, and 
the temperatures stay roughly as this:

[root@localhost ~]# sensors
it8718-isa-0290
Adapter: ISA adapter
in0:         +1.23 V  (min =  +0.00 V, max =  +4.08 V)  
in1:         +1.82 V  (min =  +0.00 V, max =  +4.08 V)  
in2:         +3.26 V  (min =  +0.00 V, max =  +4.08 V)  
in3:         +2.94 V  (min =  +0.00 V, max =  +4.08 V)  
in4:         +0.00 V  (min =  +0.00 V, max =  +4.08 V)   ALARM
in5:         +0.00 V  (min =  +0.00 V, max =  +4.08 V)   ALARM
in6:         +1.28 V  (min =  +0.00 V, max =  +4.08 V)  
in7:         +3.07 V  (min =  +0.00 V, max =  +4.08 V)  
in8:         +3.28 V
fan1:        688 RPM  (min =    0 RPM)
fan2:          0 RPM  (min =    0 RPM)
fan3:          0 RPM  (min =    0 RPM)
temp1:       +37.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = 
transistor
temp2:       +30.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = 
thermal diode
temp3:        -2.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = 
transistor
cpu0_vid:   +1.063 V

So I tend to think that the temperature warnings are a result of a 
looming hard-lock or they're simply a red herring.

But, without kernel messages indicating where to look to debug... what 
is the best approach to start troubleshooting and debugging this 
condition?  Is there some general debug feature that can be enabled in 
the kernel that would help hone in on the culprit?

[root@localhost ~]# uname -a
Linux localhost.localdomain 2.6.25-14.fc9.i686 #1 SMP Thu May 1 06:28:41 
EDT 2008 i686 i686 i386 GNU/Linux
[root@localhost ~]# lspci
00:00.0 Host bridge: Intel Corporation 82P965/G965 Memory Controller Hub 
(rev 02)
00:02.0 VGA compatible controller: Intel Corporation 82G965 Integrated 
Graphics Controller (rev 02)
00:1a.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI 
Controller #4 (rev 02)
00:1a.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI 
Controller #5 (rev 02)
00:1a.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI 
Controller #2 (rev 02)
00:1b.0 Audio device: Intel Corporation 82801H (ICH8 Family) HD Audio 
Controller (rev 02)
00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express 
Port 1 (rev 02)
00:1c.3 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express 
Port 4 (rev 02)
00:1c.4 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express 
Port 5 (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI 
Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI 
Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI 
Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI 
Controller #1 (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev f2)
00:1f.0 ISA bridge: Intel Corporation 82801HB/HR (ICH8/R) LPC Interface 
Controller (rev 02)
00:1f.2 IDE interface: Intel Corporation 82801H (ICH8 Family) 4 port 
SATA IDE Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801H (ICH8 Family) SMBus Controller 
(rev 02)
00:1f.5 IDE interface: Intel Corporation 82801H (ICH8 Family) 2 port 
SATA IDE Controller (rev 02)
01:00.0 PCI bridge: PLX Technology, Inc. PEX8112 x1 Lane PCI 
Express-to-PCI Bridge (rev aa)
02:00.0 Communication controller: MainPine Ltd PCI <-> IOBus Bridge (rev 81)
03:00.0 SATA controller: JMicron Technologies, Inc. JMicron 20360/20363 
AHCI Controller (rev 02)
03:00.1 IDE interface: JMicron Technologies, Inc. JMicron 20360/20363 
AHCI Controller (rev 02)
04:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E 
Gigabit Ethernet Controller (rev 22)

Thanks,

Lee.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: troubleshooting/debugging hard locks
  2008-05-14 19:27 troubleshooting/debugging hard locks Lee Howard
@ 2008-05-14 22:43 ` Ray Lee
  2008-05-14 23:42   ` Zan Lynx
  0 siblings, 1 reply; 6+ messages in thread
From: Ray Lee @ 2008-05-14 22:43 UTC (permalink / raw)
  To: Lee Howard; +Cc: linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 4130 bytes --]

On Wed, May 14, 2008 at 12:27 PM, Lee Howard <faxguy@howardsilvan.com> wrote:> (Please reply also directly to my e-mail address since I am not subscribed> to the list.)>>  Hello,>>  I am using Fedora 9 (and have been for the last few weeks of the "preview"> period... constantly updating if possible) and testing for a fax server> using Mainpine IQ Express (PCIe) multi-modem fax cards (they use the 8250> serial driver).>>  My testing involves queuing up and sending 2000 fax jobs using HylaFAX+> 5.2.4 to send out on two ports (1000 jobs on each port) of a 4-port card -> receiving those calls on the other two ports.>>  This exact hardware works perfectly fine with similar testing in Windows XP> Pro SP2.  However, usually on Fedora 9 (and even occasionally on Fedora 8)> the system will lock up hard (i.e. the Numlock key does not light up the LED> on the keyboard and SysReq keys do nothing) somewhere during the process.> This happens infrequently when the OS is Fedora 8 and usually (but not> always) when using Fedora 9.>>  There are no kernel messages on the monitor.  I've set up a remote serial> console on ttyS0, and there are usually no messages there, either, when this> happens.  Twice I did get messages that looked like a lot of this:>>  CPU1: Temperature above threshold, cpu clock throttled (total events = 1)>  CPU0: Temperature/speed normal>  CPU1: Temperature above threshold, cpu clock throttled (total events = 275)>  CPU0: Temperature/speed normal>  CPU1: Temperature above threshold, cpu clock throttled (total events = 577)>  CPU0: Temperature/speed normal>  CPU1: Temperature above threshold, cpu clock throttled (total events = 696)>  CPU0: Temperature/speed normal>>  ... but there was nothing more.  The side of the system chassis is removed,> the fans are moving, and the hard lock still occurs even if I point a large> fan at the open system and prevent the temperature warnings from occurring.> I've used sensors to monitor the temperature during the test with the> external fan pointed at the open system, and the temperatures stay roughly> as this:>>  [root@localhost ~]# sensors>  it8718-isa-0290>  Adapter: ISA adapter>  in0:         +1.23 V  (min =  +0.00 V, max =  +4.08 V)  in1:         +1.82> V  (min =  +0.00 V, max =  +4.08 V)  in2:         +3.26 V  (min =  +0.00 V,> max =  +4.08 V)  in3:         +2.94 V  (min =  +0.00 V, max =  +4.08 V)> in4:         +0.00 V  (min =  +0.00 V, max =  +4.08 V)   ALARM>  in5:         +0.00 V  (min =  +0.00 V, max =  +4.08 V)   ALARM>  in6:         +1.28 V  (min =  +0.00 V, max =  +4.08 V)  in7:         +3.07> V  (min =  +0.00 V, max =  +4.08 V)  in8:         +3.28 V>  fan1:        688 RPM  (min =    0 RPM)>  fan2:          0 RPM  (min =    0 RPM)>  fan3:          0 RPM  (min =    0 RPM)>  temp1:       +37.0°C  (low  = +127.0°C, high = +127.0°C)  sensor => transistor>  temp2:       +30.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermal> diode>  temp3:        -2.0°C  (low  = +127.0°C, high = +127.0°C)  sensor => transistor>  cpu0_vid:   +1.063 V>>  So I tend to think that the temperature warnings are a result of a looming> hard-lock or they're simply a red herring.>>  But, without kernel messages indicating where to look to debug... what is> the best approach to start troubleshooting and debugging this condition?  Is> there some general debug feature that can be enabled in the kernel that> would help hone in on the culprit?
There's something called the NMI watchdog, that will print debuggingmessages out if it finds the system has hard locked. The short versionis that you should add "nmi_watchdog=1" (no quotes) to the line inGRUB that has the kernel options. That assumes you have an APIC on thesystem. If that's not the case (you're on Uniprocessor, and no APIC)then you can try nmi_watchdog=2 instead. That'll only work on somesystems, though.
Better docs (than my cheesy writeup) are inDocumentation/nmi_watchdog.txt in the kernel source distribution.ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: troubleshooting/debugging hard locks
  2008-05-14 22:43 ` Ray Lee
@ 2008-05-14 23:42   ` Zan Lynx
  2008-05-15  3:43     ` Lee Howard
  0 siblings, 1 reply; 6+ messages in thread
From: Zan Lynx @ 2008-05-14 23:42 UTC (permalink / raw)
  To: Ray Lee; +Cc: Lee Howard, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1190 bytes --]

On Wed, 2008-05-14 at 15:43 -0700, Ray Lee wrote:
> On Wed, May 14, 2008 at 12:27 PM, Lee Howard <faxguy@howardsilvan.com> wrote:

> >  But, without kernel messages indicating where to look to debug... what is
> > the best approach to start troubleshooting and debugging this condition?  Is
> > there some general debug feature that can be enabled in the kernel that
> > would help hone in on the culprit?
> 
> There's something called the NMI watchdog, that will print debugging
> messages out if it finds the system has hard locked. The short version
> is that you should add "nmi_watchdog=1" (no quotes) to the line in
> GRUB that has the kernel options. That assumes you have an APIC on the
> system. If that's not the case (you're on Uniprocessor, and no APIC)
> then you can try nmi_watchdog=2 instead. That'll only work on some
> systems, though.
> 
> Better docs (than my cheesy writeup) are in
> Documentation/nmi_watchdog.txt in the kernel source distribution.

I was once told to add these to the kernel command line as well when
using NMI watchdog and they do seem to help it trigger more reliably: 

"idle=poll nohz=off"

-- 
Zan Lynx <zlynx@acm.org>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: troubleshooting/debugging hard locks
  2008-05-14 23:42   ` Zan Lynx
@ 2008-05-15  3:43     ` Lee Howard
  2008-05-20 20:47       ` Lee Howard
  0 siblings, 1 reply; 6+ messages in thread
From: Lee Howard @ 2008-05-15  3:43 UTC (permalink / raw)
  To: Zan Lynx; +Cc: Ray Lee, linux-kernel

Zan Lynx wrote:
> On Wed, 2008-05-14 at 15:43 -0700, Ray Lee wrote:
>   
>> On Wed, May 14, 2008 at 12:27 PM, Lee Howard <faxguy@howardsilvan.com> wrote:
>>     
>
>   
>>>  But, without kernel messages indicating where to look to debug... what is
>>> the best approach to start troubleshooting and debugging this condition?  Is
>>> there some general debug feature that can be enabled in the kernel that
>>> would help hone in on the culprit?
>>>       
>> There's something called the NMI watchdog, that will print debugging
>> messages out if it finds the system has hard locked. The short version
>> is that you should add "nmi_watchdog=1" (no quotes) to the line in
>> GRUB that has the kernel options. That assumes you have an APIC on the
>> system. If that's not the case (you're on Uniprocessor, and no APIC)
>> then you can try nmi_watchdog=2 instead. That'll only work on some
>> systems, though.
>>
>> Better docs (than my cheesy writeup) are in
>> Documentation/nmi_watchdog.txt in the kernel source distribution.
>>     
>
> I was once told to add these to the kernel command line as well when
> using NMI watchdog and they do seem to help it trigger more reliably: 
>
> "idle=poll nohz=off"

Thank you to both Ray and Zan.  This was very helpful, and I think that 
it has gotten me what I needed.

"serial8250: too much work for irq16"

Interestingly, now CTRL-SysRq-H will wake it back up... things get 
running normally afterwards - the hard lock never occurs.

Thanks,

Lee.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: troubleshooting/debugging hard locks
  2008-05-15  3:43     ` Lee Howard
@ 2008-05-20 20:47       ` Lee Howard
  2008-05-20 21:19         ` Ray Lee
  0 siblings, 1 reply; 6+ messages in thread
From: Lee Howard @ 2008-05-20 20:47 UTC (permalink / raw)
  To: Zan Lynx; +Cc: Ray Lee, linux-kernel

Lee Howard wrote:
> Zan Lynx wrote:
>> On Wed, 2008-05-14 at 15:43 -0700, Ray Lee wrote:
>>  
>>> On Wed, May 14, 2008 at 12:27 PM, Lee Howard 
>>> <faxguy@howardsilvan.com> wrote:
>>>     
>>
>>  
>>>>  But, without kernel messages indicating where to look to debug... 
>>>> what is
>>>> the best approach to start troubleshooting and debugging this 
>>>> condition?  Is
>>>> there some general debug feature that can be enabled in the kernel 
>>>> that
>>>> would help hone in on the culprit?
>>>>       
>>> There's something called the NMI watchdog, that will print debugging
>>> messages out if it finds the system has hard locked. The short version
>>> is that you should add "nmi_watchdog=1" (no quotes) to the line in
>>> GRUB that has the kernel options. That assumes you have an APIC on the
>>> system. If that's not the case (you're on Uniprocessor, and no APIC)
>>> then you can try nmi_watchdog=2 instead. That'll only work on some
>>> systems, though.
>>>
>>> Better docs (than my cheesy writeup) are in
>>> Documentation/nmi_watchdog.txt in the kernel source distribution.
>>>     
>>
>> I was once told to add these to the kernel command line as well when
>> using NMI watchdog and they do seem to help it trigger more reliably:
>> "idle=poll nohz=off"
>
> Thank you to both Ray and Zan.  This was very helpful, and I think 
> that it has gotten me what I needed.
>
> "serial8250: too much work for irq16"
>
> Interestingly, now CTRL-SysRq-H will wake it back up... things get 
> running normally afterwards - the hard lock never occurs. 

After further troubleshooting it turns out that the message above was a 
bit of a red herring.  I moved the PCIe modem card from one PCIe slot to 
another so that it would be on a different IRQ (one that wasn't shared 
with the network, ATA, and USB).  The message above does not occur now, 
however, the hard lock is back... and this time with no kernel messages 
at all.

So, I'm back to square one again... I've got:

  nmi_watchdog=1 idle=poll nohz=off

... and still I get the hard lockup and no kernel messages.  How do I 
troubleshoot/debug this?  How do I know where to look?

Thanks,

Lee.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: troubleshooting/debugging hard locks
  2008-05-20 20:47       ` Lee Howard
@ 2008-05-20 21:19         ` Ray Lee
  0 siblings, 0 replies; 6+ messages in thread
From: Ray Lee @ 2008-05-20 21:19 UTC (permalink / raw)
  To: Lee Howard; +Cc: Zan Lynx, linux-kernel

On Tue, May 20, 2008 at 1:47 PM, Lee Howard <faxguy@howardsilvan.com> wrote:
> After further troubleshooting it turns out that the message above was a bit
> of a red herring.  I moved the PCIe modem card from one PCIe slot to another
> so that it would be on a different IRQ (one that wasn't shared with the
> network, ATA, and USB).  The message above does not occur now, however, the
> hard lock is back... and this time with no kernel messages at all.
>
> So, I'm back to square one again... I've got:
>
>  nmi_watchdog=1 idle=poll nohz=off
>
> ... and still I get the hard lockup and no kernel messages.  How do I
> troubleshoot/debug this?  How do I know where to look?

The only thing I can think of is to recompile the latest mainline
kernel with all debugging options enabled in the .config. and then try
to reproduce with that.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-05-20 21:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-05-14 19:27 troubleshooting/debugging hard locks Lee Howard
2008-05-14 22:43 ` Ray Lee
2008-05-14 23:42   ` Zan Lynx
2008-05-15  3:43     ` Lee Howard
2008-05-20 20:47       ` Lee Howard
2008-05-20 21:19         ` Ray Lee

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.