linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Help track down a freezing machine
@ 2005-12-14 12:55 Kalin KOZHUHAROV
  2005-12-14 14:25 ` Alex Riesen
  2005-12-14 20:15 ` Help track down a freezing machine Nigel Cunningham
  0 siblings, 2 replies; 7+ messages in thread
From: Kalin KOZHUHAROV @ 2005-12-14 12:55 UTC (permalink / raw)
  To: linux-kernel

Hi, all!

You know there are weeks (or months!) when everything is just plain wrong... While still fighting
with my laptop (See "Help track a memory leak in 2.6.0..14), now one of the semi-production machines
started to freeze without any indication...

No Oops.
Nothing in the logs.
No response to ping, all network is dead.
Trying to log on the console dies at random (sometimes in the middle of the login name entry), but I
can still Alt+SysRq+{S, U,S,B}.
Sometimes no response to any keyboard press...

It is a DIY P4 machine with Asus P5GDC-V-Deluxe (i915G,LGA775), 2GB RAM, SATA WD740GD disk (using
libata).
Running (now) 2.6.14.3 with sk98lin-8.23.1.3 patched in (the in-kernel one does not recognise the
NIC) and mppe-mppc-1.3.patch (using the box to test VPNs). Softwarewise it is a Gentoo machine,
runnig apache-2.0.54, subverison-1.2.3, bugzilla-2.20, mysql-5.0.16, pptpd-1.2.3, ppp-2.4.3 and
latest openss{l,h}. No X, no sound, no WiFi, no USB, no NFSv4 (just 3): it is a headless server-type
box (on a KVM).

When it does die, and lately this happens 2-3 times per 24 hours, there is nothing hwatsoever to
indicate the cause - just dead.

A strange thing is that after the box is restarted with Alt+SysRq+{S, U,S,B}, most of the times it
cannot find the SATA drive (BIOS cannot recognize it), so I need to turn off the power physically.

About the NIC: There are a few posts on the net that Asus shipped some MBs with broken SPD, so they
don't work with linux. Found some king of cryptic patch at Asus site (for another board) and it sayd
to apply cleanly, but NIC is still not recognized by the in-kernel sk98lin at all (flash was done
after problems began, but might have made them appear more frequently?).

02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 Gigabit Ethernet Controller (rev 15)
        Subsystem: ASUSTeK Computer Inc. Marvell 88E8053 Gigabit Ethernet Controller (Asus)
        Flags: bus master, fast devsel, latency 0, IRQ 17
        Memory at cfffc000 (64-bit, non-prefetchable) [size=16K]
        I/O ports at d800 [size=256]
        Expansion ROM at cffc0000 [disabled] [size=128K]
        Capabilities: [48] Power Management version 2
        Capabilities: [50] Vital Product Data
        Capabilities: [5c] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable-
        Capabilities: [e0] Express Legacy Endpoint IRQ 0
        Capabilities: [100] Advanced Error Reporting

Got another NIC and will try it tomorrow.

Now that I get a repetitive freeze, is there anything to debug the problem?
I guess, the point when kernel is still responsive to keyboard, but I cannot login.

It sounds really bad, but a put a cron job to restart the box every 4 hours until I move it's
functions off to another one... and it used to run 30+ days...

Any help is appreciated.
Kalin.

-- 
|[ ~~~~~~~~~~~~~~~~~~~~~~ ]|
+-> http://ThinRope.net/ <-+
|[ ______________________ ]|


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help track down a freezing machine
  2005-12-14 12:55 Help track down a freezing machine Kalin KOZHUHAROV
@ 2005-12-14 14:25 ` Alex Riesen
  2005-12-16 17:52   ` Kalin KOZHUHAROV
  2005-12-14 20:15 ` Help track down a freezing machine Nigel Cunningham
  1 sibling, 1 reply; 7+ messages in thread
From: Alex Riesen @ 2005-12-14 14:25 UTC (permalink / raw)
  To: Kalin KOZHUHAROV; +Cc: linux-kernel

On 12/14/05, Kalin KOZHUHAROV <kalin@thinrope.net> wrote:
> Now that I get a repetitive freeze, is there anything to debug the problem?
> I guess, the point when kernel is still responsive to keyboard, but I cannot login.

try to connect a serial console to it and press Alt+SysRq+t

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help track down a freezing machine
  2005-12-14 12:55 Help track down a freezing machine Kalin KOZHUHAROV
  2005-12-14 14:25 ` Alex Riesen
@ 2005-12-14 20:15 ` Nigel Cunningham
  1 sibling, 0 replies; 7+ messages in thread
From: Nigel Cunningham @ 2005-12-14 20:15 UTC (permalink / raw)
  To: Kalin KOZHUHAROV; +Cc: Linux Kernel Mailing List

Hi.

On Wed, 2005-12-14 at 22:55, Kalin KOZHUHAROV wrote:
> Hi, all!
> 
> You know there are weeks (or months!) when everything is just plain wrong... While still fighting
> with my laptop (See "Help track a memory leak in 2.6.0..14), now one of the semi-production machines
> started to freeze without any indication...

Another suggestion is to patch your kernel with kdb or kgdb and turn the
option on when compiling. You could then do more detailed examination of
the issue.

Regards,

Nigel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help track down a freezing machine
  2005-12-14 14:25 ` Alex Riesen
@ 2005-12-16 17:52   ` Kalin KOZHUHAROV
  2005-12-27  9:30     ` Help track down a freezing machine, libata? Kalin KOZHUHAROV
  0 siblings, 1 reply; 7+ messages in thread
From: Kalin KOZHUHAROV @ 2005-12-16 17:52 UTC (permalink / raw)
  To: linux-kernel

Alex Riesen wrote:
> On 12/14/05, Kalin KOZHUHAROV <kalin@thinrope.net> wrote:
> 
>>Now that I get a repetitive freeze, is there anything to debug the problem?
>>I guess, the point when kernel is still responsive to keyboard, but I cannot login.
> 
> 
> try to connect a serial console to it and press Alt+SysRq+t

Thank you for the suggestio, Alex.

I was always trying to avoid the serial console till now (it just seems difficult, and I DO know it
is not), and didn't even bother with the netconsole...

However, now that I spent almost an hour, trying to OCR and fix a screenshot of an oops, I am
convinced: I DO need serial console! First thing tomorrow.

So until now, here is an oops, the first I saw in a few months, captured by my camera and then
digitally enhanced: http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char__oops1.jpg

The OCRed/handwritten text ( http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char__oops1.txt )
says:

Call trace:
SCSI device sda: 145226112 512-byte hdwr sectors (74356 MB)
SCSI device sda: drive cache: write back
[<c01ec22f>] kobject_put+0x1f/0x30
[<c028c8fd>] scsi_end_request+0xdd/0xf0
[<c028ccae>] scsi_io_completion+0x26e/0x570
[<c011b623>] load_balance_newidle+0x43/0x110
[<c028d255>] scsi_generic_done+0x35/0x50
[<c02873ee>] scsi_finish_command+0x8e/0xd0
[<c0318dea>] schedule+0x4da/0xd50
[<c0318e1d>] schedule+0x50d/0xd50
[<c028728f>] scsi_sortirq+0xdf/0x160
[<c0125836>] __do_softirq+0xd6/0xf0
[<c0125885>] do_softirq+0x35/0x40
[<c0125e35>] ksoftirqd+0x95/0xe0
[<c0125da0>] ksoftirqd+0x0/0xe0
[<c0135b9a>] kthread+0xba/0xc0
[<c0135ae0>] kthread+0x0/0xc0
[<c0101245>] kernel_thread_helper+0x5/0x10
Code: e1 08 00 89 44 24 04 89 1c 24 e8 27 b0 ff ff eb a5 90 8d 74 26 00 55 57 56
 53 83 ec 08 8b 44 24 1c 89 44 24 04 8b 80 ec 00 00 00 <8b> 38 f6 80 79 01 00 00
 80 0f 85 98 00 00 00 8b 47 2c 8d 6f 20
<0>Kernel panic - not syncing: Fatal exception in interrupt

Unfortunately everything was frozen (KBD too), so I couldn't scroll up to see the beginning. As you
may guess, it was not written to the disk.

The oops happened on boot (after a hard power-off) and is probbably related to the SATA system.

The .config is available at http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char.config

Any insights?

I will be "fighting" with the machine this weekend as well and keep posting.
Removed the fcron job (to restart every 4h) and now it has been running almost 11h...

Kalin.
-- 
|[ ~~~~~~~~~~~~~~~~~~~~~~ ]|
+-> http://ThinRope.net/ <-+
|[ ______________________ ]|


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help track down a freezing machine, libata?
  2005-12-16 17:52   ` Kalin KOZHUHAROV
@ 2005-12-27  9:30     ` Kalin KOZHUHAROV
  2006-01-08 16:28       ` Help track down a freezing machine, libata or hardware Kalin KOZHUHAROV
  0 siblings, 1 reply; 7+ messages in thread
From: Kalin KOZHUHAROV @ 2005-12-27  9:30 UTC (permalink / raw)
  To: linux-kernel

Kalin KOZHUHAROV wrote:
> Alex Riesen wrote:
> 
>>On 12/14/05, Kalin KOZHUHAROV <kalin@thinrope.net> wrote:
>>
>>
>>>Now that I get a repetitive freeze, is there anything to debug the problem?
>>>I guess, the point when kernel is still responsive to keyboard, but I cannot login.
>>
>>
>>try to connect a serial console to it and press Alt+SysRq+t
> 
> 
> Thank you for the suggestio, Alex.
> 
> I was always trying to avoid the serial console till now (it just seems difficult, and I DO know it
> is not), and didn't even bother with the netconsole...

It is, I as have to go and buy a null modem cable... will do it.

> So until now, here is an oops, the first I saw in a few months, captured by my camera and then
> digitally enhanced: http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char__oops1.jpg
> 
> The OCRed/handwritten text ( http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char__oops1.txt )
> says:
> 
> Call trace:
> SCSI device sda: 145226112 512-byte hdwr sectors (74356 MB)
> SCSI device sda: drive cache: write back
> [<c01ec22f>] kobject_put+0x1f/0x30
> [<c028c8fd>] scsi_end_request+0xdd/0xf0
> [<c028ccae>] scsi_io_completion+0x26e/0x570
> [<c011b623>] load_balance_newidle+0x43/0x110
> [<c028d255>] scsi_generic_done+0x35/0x50
> [<c02873ee>] scsi_finish_command+0x8e/0xd0
> [<c0318dea>] schedule+0x4da/0xd50
> [<c0318e1d>] schedule+0x50d/0xd50
> [<c028728f>] scsi_sortirq+0xdf/0x160
> [<c0125836>] __do_softirq+0xd6/0xf0
> [<c0125885>] do_softirq+0x35/0x40
> [<c0125e35>] ksoftirqd+0x95/0xe0
> [<c0125da0>] ksoftirqd+0x0/0xe0
> [<c0135b9a>] kthread+0xba/0xc0
> [<c0135ae0>] kthread+0x0/0xc0
> [<c0101245>] kernel_thread_helper+0x5/0x10
> Code: e1 08 00 89 44 24 04 89 1c 24 e8 27 b0 ff ff eb a5 90 8d 74 26 00 55 57 56
>  53 83 ec 08 8b 44 24 1c 89 44 24 04 8b 80 ec 00 00 00 <8b> 38 f6 80 79 01 00 00
>  80 0f 85 98 00 00 00 8b 47 2c 8d 6f 20
> <0>Kernel panic - not syncing: Fatal exception in interrupt
> 
> Unfortunately everything was frozen (KBD too), so I couldn't scroll up to see the beginning. As you
> may guess, it was not written to the disk.
> 
> The oops happened on boot (after a hard power-off) and is probbably related to the SATA system.

All right, the above started to be reproducible, about once every 3 boots: the system freezes when
tries to initialize the ata sybsystem. (still don't have cable for the serial console, sorry)

> The .config is available at http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char.config

Now this is 2.6.14.4 and the new config is:
http://linux.tar.bz/reports/oopses/char/2.6.14.4-K01_char.config

I added another 250GB SATA HDD and changed the PSU, but it does not seem to be related for that bug.
Will try to tweak the IDE parameters in BIOS.

Kalin.

-- 
|[ ~~~~~~~~~~~~~~~~~~~~~~ ]|
+-> http://ThinRope.net/ <-+
|[ ______________________ ]|


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help track down a freezing machine, libata or hardware
  2005-12-27  9:30     ` Help track down a freezing machine, libata? Kalin KOZHUHAROV
@ 2006-01-08 16:28       ` Kalin KOZHUHAROV
  2006-01-11  2:00         ` Esben Stien
  0 siblings, 1 reply; 7+ messages in thread
From: Kalin KOZHUHAROV @ 2006-01-08 16:28 UTC (permalink / raw)
  To: linux-kernel

Kalin KOZHUHAROV wrote:
> Kalin KOZHUHAROV wrote:
> 
>>Alex Riesen wrote:
>>
>>
>>>On 12/14/05, Kalin KOZHUHAROV <kalin@thinrope.net> wrote:
>>>
>>>
>>>
>>>>Now that I get a repetitive freeze, is there anything to debug the problem?
>>>>I guess, the point when kernel is still responsive to keyboard, but I cannot login.
>>>
>>>
>>>try to connect a serial console to it and press Alt+SysRq+t
>>
>>
>>Thank you for the suggestio, Alex.
>>
>>I was always trying to avoid the serial console till now (it just seems difficult, and I DO know it
>>is not), and didn't even bother with the netconsole...
> 
> 
> It is, I as have to go and buy a null modem cable... will do it.
> 
> 
>>So until now, here is an oops, the first I saw in a few months, captured by my camera and then
>>digitally enhanced: http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char__oops1.jpg
>>
>>The OCRed/handwritten text ( http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char__oops1.txt )
>>says:
>>
>>Call trace:
>>SCSI device sda: 145226112 512-byte hdwr sectors (74356 MB)
>>SCSI device sda: drive cache: write back
>>[<c01ec22f>] kobject_put+0x1f/0x30
>>[<c028c8fd>] scsi_end_request+0xdd/0xf0
>>[<c028ccae>] scsi_io_completion+0x26e/0x570
>>[<c011b623>] load_balance_newidle+0x43/0x110
>>[<c028d255>] scsi_generic_done+0x35/0x50
>>[<c02873ee>] scsi_finish_command+0x8e/0xd0
>>[<c0318dea>] schedule+0x4da/0xd50
>>[<c0318e1d>] schedule+0x50d/0xd50
>>[<c028728f>] scsi_sortirq+0xdf/0x160
>>[<c0125836>] __do_softirq+0xd6/0xf0
>>[<c0125885>] do_softirq+0x35/0x40
>>[<c0125e35>] ksoftirqd+0x95/0xe0
>>[<c0125da0>] ksoftirqd+0x0/0xe0
>>[<c0135b9a>] kthread+0xba/0xc0
>>[<c0135ae0>] kthread+0x0/0xc0
>>[<c0101245>] kernel_thread_helper+0x5/0x10
>>Code: e1 08 00 89 44 24 04 89 1c 24 e8 27 b0 ff ff eb a5 90 8d 74 26 00 55 57 56
>> 53 83 ec 08 8b 44 24 1c 89 44 24 04 8b 80 ec 00 00 00 <8b> 38 f6 80 79 01 00 00
>> 80 0f 85 98 00 00 00 8b 47 2c 8d 6f 20
>><0>Kernel panic - not syncing: Fatal exception in interrupt
>>
>>Unfortunately everything was frozen (KBD too), so I couldn't scroll up to see the beginning. As you
>>may guess, it was not written to the disk.
>>
>>The oops happened on boot (after a hard power-off) and is probbably related to the SATA system.
> 
> 
> All right, the above started to be reproducible, about once every 3 boots: the system freezes when
> tries to initialize the ata sybsystem. (still don't have cable for the serial console, sorry)
> 
> 
>>The .config is available at http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char.config
> 
> 
> Now this is 2.6.14.4 and the new config is:
> http://linux.tar.bz/reports/oopses/char/2.6.14.4-K01_char.config
> 
> I added another 250GB SATA HDD and changed the PSU, but it does not seem to be related for that bug.
> Will try to tweak the IDE parameters in BIOS.

OK, now this looks like hardware failure... I run 2.6.15 the other day and I was happy to see 2d
uptime :-) However... everything was borked, root was mounted RO, and fs was generally screwed.

I don't have physical acces to the box right now, so I will post more details tomorrow, but this is
what I got as dmesg:


ata1: port reset, p_is 40000001 is 1 pis 0 cmd c017 tf 471 ss 113 se 0
ata1: translated ATA stat/err 0x71/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata1: status=0x71 { DriveReady DeviceFault SeekComplete Error }
ata1: error=0x04 { DriveStatusError }
ata1: port reset, p_is 40000001 is 1 pis 0 cmd c017 tf 471 ss 113 se 0
ata1: translated ATA stat/err 0x71/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata1: status=0x71 { DriveReady DeviceFault SeekComplete Error }
ata1: error=0x04 { DriveStatusError }
ata1: port reset, p_is 40000001 is 1 pis 0 cmd c017 tf 471 ss 113 se 0
ata1: translated ATA stat/err 0x71/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata1: status=0x71 { DriveReady DeviceFault SeekComplete Error }
ata1: error=0x04 { DriveStatusError }
ata1: port reset, p_is 40000001 is 1 pis 0 cmd c017 tf 471 ss 113 se 0
ata1: translated ATA stat/err 0x71/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata1: status=0x71 { DriveReady DeviceFault SeekComplete Error }
ata1: error=0x04 { DriveStatusError }
ata1: port reset, p_is 40000001 is 1 pis 0 cmd c017 tf 471 ss 113 se 0
ata1: translated ATA stat/err 0x71/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata1: status=0x71 { DriveReady DeviceFault SeekComplete Error }
ata1: error=0x04 { DriveStatusError }
sd 0:0:0:0: SCSI error: return code = 0x8000002
sda: Current: sense key=0xb
    ASC=0x0 ASCQ=0x0
end_request: I/O error, dev sda, sector 10803956
Buffer I/O error on device sda3, logical block 362497


The above was repeated many times (buffer was full, so I don't know how many), cannot tell the time,
as trying to `cat /var/log/everything` resulted in IO error.
I also got a bunch of these:

ReiserFS: sda3: warning: clm-6006: writing inode 56216 on readonly FS
ReiserFS: sda3: warning: clm-6006: writing inode 56216 on readonly FS
ReiserFS: sda3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find
stat data of [25971 27180 0x0 SD]
ReiserFS: sda3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find
stat data of [9064 25871 0x0 SD]
ReiserFS: sda3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find
stat data of [25971 27180 0x0 SD]

Do you see this as hardware problem?

The drive in question is a WD740GD (SATA, 10k RPM) and I will run the WD tools on it in a day or two
to check if it is a hardware failure.

Kalin.

-- 
|[ ~~~~~~~~~~~~~~~~~~~~~~ ]|
+-> http://ThinRope.net/ <-+
|[ ______________________ ]|


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help track down a freezing machine, libata or hardware
  2006-01-08 16:28       ` Help track down a freezing machine, libata or hardware Kalin KOZHUHAROV
@ 2006-01-11  2:00         ` Esben Stien
  0 siblings, 0 replies; 7+ messages in thread
From: Esben Stien @ 2006-01-11  2:00 UTC (permalink / raw)
  To: linux-kernel

Kalin KOZHUHAROV <kalin@thinrope.net> writes:

> Do you see this as hardware problem?
 
Yes, definitely. I wrote a nice mail to the reiserfs list titled
"Close Encounter of the Bad Block Kind". Check it out for how to get
your data before anything else happens. 

-- 
Esben Stien is b0ef@e     s      a             
         http://www. s     t    n m
          irc://irc.  b  -  i  .   e/%23contact
          [sip|iax]:   e     e 
           jid:b0ef@    n     n

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-01-11  0:08 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-12-14 12:55 Help track down a freezing machine Kalin KOZHUHAROV
2005-12-14 14:25 ` Alex Riesen
2005-12-16 17:52   ` Kalin KOZHUHAROV
2005-12-27  9:30     ` Help track down a freezing machine, libata? Kalin KOZHUHAROV
2006-01-08 16:28       ` Help track down a freezing machine, libata or hardware Kalin KOZHUHAROV
2006-01-11  2:00         ` Esben Stien
2005-12-14 20:15 ` Help track down a freezing machine Nigel Cunningham

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).