All of lore.kernel.org
 help / color / mirror / Atom feed
* what are some more advanced error collection methods?
@ 2009-05-06 21:17 Al Niessner
  2009-05-06 22:36 ` Al Niessner
  0 siblings, 1 reply; 2+ messages in thread
From: Al Niessner @ 2009-05-06 21:17 UTC (permalink / raw)
  To: linux-kernel


I am running 2.6.27 on an AMD 64 x2 dual core 6000+. I have the OS
installed its own disk (SATA) and have an mdraid (SATA) with 3 disks
being mirrored for my critical data. I also have an mdraid with 2 disks
being mirrored (USB but I wanted firewire) for very low rate data. Both
mdraids are nfs mounted and use automount on top of that -- nothing
peculiar about nfs and automount except that nfs is over two networks
each with their own NIC. My problem is that every 36 hours the machine
simply locks up. Here is what I find:

1) num lock light is on but was off prior to lock up
2) no response to beating the num and caps lock keys
3) no response to beating the sysreq key plus any sequences
4) nothing is recorded in kern.log, syslog, or any other log file
in /var/log
5) cannot get to console because keyboard is dead
6) have to hold power switch for 10 seconds to get computer to turn off
so the computer is not suspended (power management is not installed
anyway)
7) when computer is rebooted, the mdraids are usually clean (no resync)
8) did a memtest and it passes

Since nothing showed up in the logs and I could not read the console, I
found an old computer and connected the one I care about to it via
ttyS0. Now I have the console even though the keyboard is dead. However,
when the lock up occurs, there is absolutely no output to my RS232
console. I put a pulse onto the console via /dev/console and get stuff
right up until the change of state, but no panic shows up. On reboot, I
start getting characters from the kernel immediately. Hence, I have to
conclude that the serial connection is viable, but there is simply no
output from the kernel.

So, I have tried all of the simple stuff that I know about or found via
google. Now I would like some more advanced ways of trying to pry
helpful information from a dying kernel. Are there more advanced tools,
tricks, or secrets for collecting fault information?

Any and all help is appreciated in advance.

One last item, I am still working on determining if this is a hardware
or software problem. The voltages look resonable and the room is
thermally stable to +/- 1C. So, I am having a hard time blaming
hardware.

-- 
Al Niessner
818.354.0859

All opinions stated above are mine and do not necessarily reflect those
of JPL or NASA.

--------
|  dS  | >= 0
--------



^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: what are some more advanced error collection methods?
  2009-05-06 21:17 what are some more advanced error collection methods? Al Niessner
@ 2009-05-06 22:36 ` Al Niessner
  0 siblings, 0 replies; 2+ messages in thread
From: Al Niessner @ 2009-05-06 22:36 UTC (permalink / raw)
  To: linux-kernel


Using a volt meter, I verified that the 5V and 12V are good and the
computer is running under a normal load. So, I am going to go with the
power supply being alright for now.

I changed the CPU temperature by a couple of degrees with no failure.
While I cannot rule this out, I am willing to lean toward a software
problem; meaning, the kernel is hard locking.

Now I just need some way to get some helpful information out if it so
that I can move toward a solution.

On Wed, 2009-05-06 at 14:17 -0700, Al Niessner wrote:
> I am running 2.6.27 on an AMD 64 x2 dual core 6000+. I have the OS
> installed its own disk (SATA) and have an mdraid (SATA) with 3 disks
> being mirrored for my critical data. I also have an mdraid with 2 disks
> being mirrored (USB but I wanted firewire) for very low rate data. Both
> mdraids are nfs mounted and use automount on top of that -- nothing
> peculiar about nfs and automount except that nfs is over two networks
> each with their own NIC. My problem is that every 36 hours the machine
> simply locks up. Here is what I find:
> 
> 1) num lock light is on but was off prior to lock up
> 2) no response to beating the num and caps lock keys
> 3) no response to beating the sysreq key plus any sequences
> 4) nothing is recorded in kern.log, syslog, or any other log file
> in /var/log
> 5) cannot get to console because keyboard is dead
> 6) have to hold power switch for 10 seconds to get computer to turn off
> so the computer is not suspended (power management is not installed
> anyway)
> 7) when computer is rebooted, the mdraids are usually clean (no resync)
> 8) did a memtest and it passes
> 
> Since nothing showed up in the logs and I could not read the console, I
> found an old computer and connected the one I care about to it via
> ttyS0. Now I have the console even though the keyboard is dead. However,
> when the lock up occurs, there is absolutely no output to my RS232
> console. I put a pulse onto the console via /dev/console and get stuff
> right up until the change of state, but no panic shows up. On reboot, I
> start getting characters from the kernel immediately. Hence, I have to
> conclude that the serial connection is viable, but there is simply no
> output from the kernel.
> 
> So, I have tried all of the simple stuff that I know about or found via
> google. Now I would like some more advanced ways of trying to pry
> helpful information from a dying kernel. Are there more advanced tools,
> tricks, or secrets for collecting fault information?
> 
> Any and all help is appreciated in advance.
> 
> One last item, I am still working on determining if this is a hardware
> or software problem. The voltages look resonable and the room is
> thermally stable to +/- 1C. So, I am having a hard time blaming
> hardware.
> 
-- 
Al Niessner
818.354.0859

--------
|  dS  | >= 0
--------


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2009-05-06 22:36 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-06 21:17 what are some more advanced error collection methods? Al Niessner
2009-05-06 22:36 ` Al Niessner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.