linux-smp.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP
@ 2006-04-18 19:11 Michal Szymanski
  2006-05-05 14:00 ` Bill Davidsen
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Szymanski @ 2006-04-18 19:11 UTC (permalink / raw)
  To: SMP list

Hi all,

I have recently purchased three Supermicro AS1020A-T servers equipped
with two dual-core Opterons 280 each. H8DAR-T motherboards, 8 or 12 GB
RAM. The systems carry FC4 x86_64 with proprietary driver (made by
Adaptec) for the onboard Marvell 88SX6041 SATA Controller. Original
(install) kernel 2.6.11-1.1369_FC4smp - unfortunately not upgradable due
to the lack of the SATA driver for other kernel versions.

All systems crash (either hang with some "machine check exception"
kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU
intensive with some I/O. I run 2 or 4 jobs simultaneously and they had
never survived more than a few hours.

Suspecting it may be the SATA driver problem I mounted /tmp as "tmpfs"
and repeated the tests entirely in /tmp (with plenty of RAM this means
(IMHO) doing I/O in memory). No success.

It is somewhat better when I run similar size no-I/O jobs but these also
crash, although less frequently.

I tried to install i386 version, also crashes. Same (or even worse) with
FC3.

Memtest does not show any RAM errors. 

Finally I did two tests which seem to have excluded SATA
controller/driver as the reason for crashes:

1. I installed an additional IDE hard disk and put FC4/x86_64 system on
it (without the Adaptec driver, so the system does not even see the SATA
disks), updated the kernel to the latest (2.6.16) - also crashed.

2. I ran non-SMP 2.6.11 kernel (with Adaptec driver) on another machine.
There have been two test repeating 1.3g jobs running on it (each getting 50%
of the single CPU used by the system) for over 50 hours now, no crashes.
Also, a single test job running on SMP kernel gave no crashes in 24 hours.

It seems there is a problem with SMP kernel and dual-core Opterons, at
least on this hardware. I am stuck with three top-level machines which
can work only at 25% of nominal cpu power. Any hints would be
appreciated.

regards, Michal.

-- 
  Michal Szymanski (msz at astrouw dot edu dot pl)
  Warsaw University Observatory, Warszawa, POLAND

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP
  2006-04-18 19:11 FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP Michal Szymanski
@ 2006-05-05 14:00 ` Bill Davidsen
  2006-05-05 15:18   ` Robert M. Hyatt
  2006-05-05 15:23   ` cerise
  0 siblings, 2 replies; 10+ messages in thread
From: Bill Davidsen @ 2006-05-05 14:00 UTC (permalink / raw)
  To: Michal Szymanski; +Cc: SMP list

Michal Szymanski wrote:

>Hi all,
>
>I have recently purchased three Supermicro AS1020A-T servers equipped
>with two dual-core Opterons 280 each. H8DAR-T motherboards, 8 or 12 GB
>RAM. The systems carry FC4 x86_64 with proprietary driver (made by
>Adaptec) for the onboard Marvell 88SX6041 SATA Controller. Original
>(install) kernel 2.6.11-1.1369_FC4smp - unfortunately not upgradable due
>to the lack of the SATA driver for other kernel versions.
>
>All systems crash (either hang with some "machine check exception"
>kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU
>intensive with some I/O. I run 2 or 4 jobs simultaneously and they had
>never survived more than a few hours.
>
>Suspecting it may be the SATA driver problem I mounted /tmp as "tmpfs"
>and repeated the tests entirely in /tmp (with plenty of RAM this means
>(IMHO) doing I/O in memory). No success.
>
>It is somewhat better when I run similar size no-I/O jobs but these also
>crash, although less frequently.
>
>I tried to install i386 version, also crashes. Same (or even worse) with
>FC3.
>
>Memtest does not show any RAM errors. 
>
>Finally I did two tests which seem to have excluded SATA
>controller/driver as the reason for crashes:
>
>1. I installed an additional IDE hard disk and put FC4/x86_64 system on
>it (without the Adaptec driver, so the system does not even see the SATA
>disks), updated the kernel to the latest (2.6.16) - also crashed.
>
>2. I ran non-SMP 2.6.11 kernel (with Adaptec driver) on another machine.
>There have been two test repeating 1.3g jobs running on it (each getting 50%
>of the single CPU used by the system) for over 50 hours now, no crashes.
>Also, a single test job running on SMP kernel gave no crashes in 24 hours.
>
>It seems there is a problem with SMP kernel and dual-core Opterons, at
>least on this hardware. I am stuck with three top-level machines which
>can work only at 25% of nominal cpu power. Any hints would be
>appreciated.
>
>  
>
What happens if you use only one CPU? Either with a uni kernel (you 
should have gotten one) or "maxcpus=1" in the boot commands. You are 
running a custom kernel with custom drivers, so you really should be 
asking the supplier, all we can do is suggest things which might provide 
extra information.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP
  2006-05-05 14:00 ` Bill Davidsen
@ 2006-05-05 15:18   ` Robert M. Hyatt
  2006-05-05 15:28     ` cerise
  2006-05-09 12:23     ` Michal Szymanski
  2006-05-05 15:23   ` cerise
  1 sibling, 2 replies; 10+ messages in thread
From: Robert M. Hyatt @ 2006-05-05 15:18 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Michal Szymanski, SMP list


One note.  I am running on a quad 875 system, but am using Suse rather 
than FC4.  It is running perfectly reliable (this is a 4 cpu, dual-core, 
2.2ghz box, 8 processors total).  I had problems with FC4 myself, 
although it runs perfectly on my normal dual xeon boxes...


Robert M. Hyatt, Ph.D.          Computer and Information Sciences
hyatt@uab.edu                   University of Alabama at Birmingham
(205) 934-2213                  136A Campbell Hall
(205) 934-5473 FAX              Birmingham, AL 35294-1170

On Fri, 5 May 2006, Bill Davidsen wrote:

> Michal Szymanski wrote:
>
>> Hi all,
>> 
>> I have recently purchased three Supermicro AS1020A-T servers equipped
>> with two dual-core Opterons 280 each. H8DAR-T motherboards, 8 or 12 GB
>> RAM. The systems carry FC4 x86_64 with proprietary driver (made by
>> Adaptec) for the onboard Marvell 88SX6041 SATA Controller. Original
>> (install) kernel 2.6.11-1.1369_FC4smp - unfortunately not upgradable due
>> to the lack of the SATA driver for other kernel versions.
>> 
>> All systems crash (either hang with some "machine check exception"
>> kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU
>> intensive with some I/O. I run 2 or 4 jobs simultaneously and they had
>> never survived more than a few hours.
>> 
>> Suspecting it may be the SATA driver problem I mounted /tmp as "tmpfs"
>> and repeated the tests entirely in /tmp (with plenty of RAM this means
>> (IMHO) doing I/O in memory). No success.
>> 
>> It is somewhat better when I run similar size no-I/O jobs but these also
>> crash, although less frequently.
>> 
>> I tried to install i386 version, also crashes. Same (or even worse) with
>> FC3.
>> 
>> Memtest does not show any RAM errors. 
>> Finally I did two tests which seem to have excluded SATA
>> controller/driver as the reason for crashes:
>> 
>> 1. I installed an additional IDE hard disk and put FC4/x86_64 system on
>> it (without the Adaptec driver, so the system does not even see the SATA
>> disks), updated the kernel to the latest (2.6.16) - also crashed.
>> 
>> 2. I ran non-SMP 2.6.11 kernel (with Adaptec driver) on another machine.
>> There have been two test repeating 1.3g jobs running on it (each getting 
>> 50%
>> of the single CPU used by the system) for over 50 hours now, no crashes.
>> Also, a single test job running on SMP kernel gave no crashes in 24 hours.
>> 
>> It seems there is a problem with SMP kernel and dual-core Opterons, at
>> least on this hardware. I am stuck with three top-level machines which
>> can work only at 25% of nominal cpu power. Any hints would be
>> appreciated.
>>
>> 
> What happens if you use only one CPU? Either with a uni kernel (you should 
> have gotten one) or "maxcpus=1" in the boot commands. You are running a 
> custom kernel with custom drivers, so you really should be asking the 
> supplier, all we can do is suggest things which might provide extra 
> information.
>
> -- 
> bill davidsen <davidsen@tmr.com>
> CTO TMR Associates, Inc
> Doing interesting things with small computers since 1979
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-smp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP
  2006-05-05 14:00 ` Bill Davidsen
  2006-05-05 15:18   ` Robert M. Hyatt
@ 2006-05-05 15:23   ` cerise
  2006-05-12 10:54     ` Michal Szymanski
  1 sibling, 1 reply; 10+ messages in thread
From: cerise @ 2006-05-05 15:23 UTC (permalink / raw)
  To: msz, linux-smp

> Michal Szymanski wrote:
>
> >All systems crash (either hang with some "machine check exception"
> >kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU
> >intensive with some I/O. I run 2 or 4 jobs simultaneously and they had
> >never survived more than a few hours.

Let's try the easy stuff first -- if it's crashing with a machine check
exception, then let's disable machine check exceptions, and see if things
still break.

Try booting with the parameter "nomce".  Be aware that mce is a mechanism
for the processor to inform the kernel of thermal issues or component 
failure.  You'll only want to disable this mechanism if you aren't having
thermal problems.  

Of course, if you are having thermal problems, it's probably a good idea to
resolve those before cranking up the other 3/4s of your system.  ; )

Hope that helps!

-Phil/CERisE

P.S.  I came a little late to this party -- I didn't see the original message.
Did you include the text of the kernel crash?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP
  2006-05-05 15:18   ` Robert M. Hyatt
@ 2006-05-05 15:28     ` cerise
  2006-05-05 16:31       ` Robert M. Hyatt
  2006-05-09 12:23     ` Michal Szymanski
  1 sibling, 1 reply; 10+ messages in thread
From: cerise @ 2006-05-05 15:28 UTC (permalink / raw)
  To: linux-smp

Hi Robert:

That might be because SuSE's compiled kernel doesn't use mce.  If you can look
in the .config for the compiled kernel (or you can ask one of the maintainers
for SuSE...or you're fortunate enough to have a /proc/config), I'd be curious
if it has MCE enabled (you'd be looking for "CONFIG_X86_MCE=y").  That would
nicely explain the discrepancy. 8)

-Phil/CERisE

On Fri, May 05, 2006 at 10:18:36AM -0500, Robert M. Hyatt wrote:
> 
> One note.  I am running on a quad 875 system, but am using Suse rather 
> than FC4.  It is running perfectly reliable (this is a 4 cpu, dual-core, 
> 2.2ghz box, 8 processors total).  I had problems with FC4 myself, 
> although it runs perfectly on my normal dual xeon boxes...
> 
> 
> Robert M. Hyatt, Ph.D.          Computer and Information Sciences
> hyatt@uab.edu                   University of Alabama at Birmingham
> (205) 934-2213                  136A Campbell Hall
> (205) 934-5473 FAX              Birmingham, AL 35294-1170
> 
> On Fri, 5 May 2006, Bill Davidsen wrote:
> 
> >Michal Szymanski wrote:
> >
> >>Hi all,
> >>
> >>I have recently purchased three Supermicro AS1020A-T servers equipped
> >>with two dual-core Opterons 280 each. H8DAR-T motherboards, 8 or 12 GB
> >>RAM. The systems carry FC4 x86_64 with proprietary driver (made by
> >>Adaptec) for the onboard Marvell 88SX6041 SATA Controller. Original
> >>(install) kernel 2.6.11-1.1369_FC4smp - unfortunately not upgradable due
> >>to the lack of the SATA driver for other kernel versions.
> >>
> >>All systems crash (either hang with some "machine check exception"
> >>kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU
> >>intensive with some I/O. I run 2 or 4 jobs simultaneously and they had
> >>never survived more than a few hours.
> >>
> >>Suspecting it may be the SATA driver problem I mounted /tmp as "tmpfs"
> >>and repeated the tests entirely in /tmp (with plenty of RAM this means
> >>(IMHO) doing I/O in memory). No success.
> >>
> >>It is somewhat better when I run similar size no-I/O jobs but these also
> >>crash, although less frequently.
> >>
> >>I tried to install i386 version, also crashes. Same (or even worse) with
> >>FC3.
> >>
> >>Memtest does not show any RAM errors. 
> >>Finally I did two tests which seem to have excluded SATA
> >>controller/driver as the reason for crashes:
> >>
> >>1. I installed an additional IDE hard disk and put FC4/x86_64 system on
> >>it (without the Adaptec driver, so the system does not even see the SATA
> >>disks), updated the kernel to the latest (2.6.16) - also crashed.
> >>
> >>2. I ran non-SMP 2.6.11 kernel (with Adaptec driver) on another machine.
> >>There have been two test repeating 1.3g jobs running on it (each getting 
> >>50%
> >>of the single CPU used by the system) for over 50 hours now, no crashes.
> >>Also, a single test job running on SMP kernel gave no crashes in 24 hours.
> >>
> >>It seems there is a problem with SMP kernel and dual-core Opterons, at
> >>least on this hardware. I am stuck with three top-level machines which
> >>can work only at 25% of nominal cpu power. Any hints would be
> >>appreciated.
> >>
> >>
> >What happens if you use only one CPU? Either with a uni kernel (you should 
> >have gotten one) or "maxcpus=1" in the boot commands. You are running a 
> >custom kernel with custom drivers, so you really should be asking the 
> >supplier, all we can do is suggest things which might provide extra 
> >information.
> >
> >-- 
> >bill davidsen <davidsen@tmr.com>
> >CTO TMR Associates, Inc
> >Doing interesting things with small computers since 1979
> >
> >-
> >To unsubscribe from this list: send the line "unsubscribe linux-smp" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> -
> To unsubscribe from this list: send the line "unsubscribe linux-smp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP
  2006-05-05 15:28     ` cerise
@ 2006-05-05 16:31       ` Robert M. Hyatt
  0 siblings, 0 replies; 10+ messages in thread
From: Robert M. Hyatt @ 2006-05-05 16:31 UTC (permalink / raw)
  To: cerise; +Cc: linux-smp

Sorry, not much help here.  I've been running redhat forever and have 
built a zillion kernels.  But for Suse, which I am using out at the AMD 
development center, I don't know a thing about where they put config 
info.  There is no /proc/config, so that's out.  I could not find any 
config file in the places I would look in my redhat systems.

This is Suse 10.0.  If you can tell me where to look, I'll be happy to 
peek and report back...


Robert M. Hyatt, Ph.D.          Computer and Information Sciences
hyatt@uab.edu                   University of Alabama at Birmingham
(205) 934-2213                  136A Campbell Hall
(205) 934-5473 FAX              Birmingham, AL 35294-1170

On Fri, 5 May 2006, cerise@armory.com wrote:

> Hi Robert:
>
> That might be because SuSE's compiled kernel doesn't use mce.  If you can look
> in the .config for the compiled kernel (or you can ask one of the maintainers
> for SuSE...or you're fortunate enough to have a /proc/config), I'd be curious
> if it has MCE enabled (you'd be looking for "CONFIG_X86_MCE=y").  That would
> nicely explain the discrepancy. 8)
>
> -Phil/CERisE
>
> On Fri, May 05, 2006 at 10:18:36AM -0500, Robert M. Hyatt wrote:
>>
>> One note.  I am running on a quad 875 system, but am using Suse rather
>> than FC4.  It is running perfectly reliable (this is a 4 cpu, dual-core,
>> 2.2ghz box, 8 processors total).  I had problems with FC4 myself,
>> although it runs perfectly on my normal dual xeon boxes...
>>
>>
>> Robert M. Hyatt, Ph.D.          Computer and Information Sciences
>> hyatt@uab.edu                   University of Alabama at Birmingham
>> (205) 934-2213                  136A Campbell Hall
>> (205) 934-5473 FAX              Birmingham, AL 35294-1170
>>
>> On Fri, 5 May 2006, Bill Davidsen wrote:
>>
>>> Michal Szymanski wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have recently purchased three Supermicro AS1020A-T servers equipped
>>>> with two dual-core Opterons 280 each. H8DAR-T motherboards, 8 or 12 GB
>>>> RAM. The systems carry FC4 x86_64 with proprietary driver (made by
>>>> Adaptec) for the onboard Marvell 88SX6041 SATA Controller. Original
>>>> (install) kernel 2.6.11-1.1369_FC4smp - unfortunately not upgradable due
>>>> to the lack of the SATA driver for other kernel versions.
>>>>
>>>> All systems crash (either hang with some "machine check exception"
>>>> kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU
>>>> intensive with some I/O. I run 2 or 4 jobs simultaneously and they had
>>>> never survived more than a few hours.
>>>>
>>>> Suspecting it may be the SATA driver problem I mounted /tmp as "tmpfs"
>>>> and repeated the tests entirely in /tmp (with plenty of RAM this means
>>>> (IMHO) doing I/O in memory). No success.
>>>>
>>>> It is somewhat better when I run similar size no-I/O jobs but these also
>>>> crash, although less frequently.
>>>>
>>>> I tried to install i386 version, also crashes. Same (or even worse) with
>>>> FC3.
>>>>
>>>> Memtest does not show any RAM errors.
>>>> Finally I did two tests which seem to have excluded SATA
>>>> controller/driver as the reason for crashes:
>>>>
>>>> 1. I installed an additional IDE hard disk and put FC4/x86_64 system on
>>>> it (without the Adaptec driver, so the system does not even see the SATA
>>>> disks), updated the kernel to the latest (2.6.16) - also crashed.
>>>>
>>>> 2. I ran non-SMP 2.6.11 kernel (with Adaptec driver) on another machine.
>>>> There have been two test repeating 1.3g jobs running on it (each getting
>>>> 50%
>>>> of the single CPU used by the system) for over 50 hours now, no crashes.
>>>> Also, a single test job running on SMP kernel gave no crashes in 24 hours.
>>>>
>>>> It seems there is a problem with SMP kernel and dual-core Opterons, at
>>>> least on this hardware. I am stuck with three top-level machines which
>>>> can work only at 25% of nominal cpu power. Any hints would be
>>>> appreciated.
>>>>
>>>>
>>> What happens if you use only one CPU? Either with a uni kernel (you should
>>> have gotten one) or "maxcpus=1" in the boot commands. You are running a
>>> custom kernel with custom drivers, so you really should be asking the
>>> supplier, all we can do is suggest things which might provide extra
>>> information.
>>>
>>> --
>>> bill davidsen <davidsen@tmr.com>
>>> CTO TMR Associates, Inc
>>> Doing interesting things with small computers since 1979
>>>
>>> -
>>> To unsubscribe from this list: send the line "unsubscribe linux-smp" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-smp" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> -
> To unsubscribe from this list: send the line "unsubscribe linux-smp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP
  2006-05-05 15:18   ` Robert M. Hyatt
  2006-05-05 15:28     ` cerise
@ 2006-05-09 12:23     ` Michal Szymanski
  2006-05-24 20:23       ` Bill Davidsen
  1 sibling, 1 reply; 10+ messages in thread
From: Michal Szymanski @ 2006-05-09 12:23 UTC (permalink / raw)
  To: SMP list

On Fri, May 05, 2006 at 10:18:36AM -0500, Robert M. Hyatt wrote:
> 
> One note.  I am running on a quad 875 system, but am using Suse rather 
> than FC4.  It is running perfectly reliable (this is a 4 cpu, dual-core, 
> 2.2ghz box, 8 processors total).  I had problems with FC4 myself, 
> although it runs perfectly on my normal dual xeon boxes...
> 
> On Fri, 5 May 2006, Bill Davidsen wrote:
> 
> >Michal Szymanski wrote:
> >
> >>Hi all,
> >>
> >>I have recently purchased three Supermicro AS1020A-T servers equipped
> >>with two dual-core Opterons 280 each. H8DAR-T motherboards, 8 or 12 GB
> >>RAM. The systems carry FC4 x86_64 with proprietary driver (made by
> >>Adaptec) for the onboard Marvell 88SX6041 SATA Controller. Original
> >>(install) kernel 2.6.11-1.1369_FC4smp - unfortunately not upgradable due
> >>to the lack of the SATA driver for other kernel versions.
> >>
> >>All systems crash (either hang with some "machine check exception"
> >>kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU
> >>intensive with some I/O. I run 2 or 4 jobs simultaneously and they had
> >>never survived more than a few hours.
> >> ...
> >>2. I ran non-SMP 2.6.11 kernel (with Adaptec driver) on another machine.
> >>There have been two test repeating 1.3g jobs running on it (each getting 
> >>50%
> >>of the single CPU used by the system) for over 50 hours now, no crashes.
> >>Also, a single test job running on SMP kernel gave no crashes in 24 hours.
> >>
> >What happens if you use only one CPU? Either with a uni kernel (you should 
> >have gotten one) or "maxcpus=1" in the boot commands. You are running a 
> >custom kernel with custom drivers, so you really should be asking the 
> >supplier, all we can do is suggest things which might provide extra 
> >information.

Hi all,

I got 3 copies of Roberts' message but none of Bill's :-)

Still, I don't quite understand Bill's question ("What happens if you
use only one CPU?"). The answer is quoted just above this question!
There were no crashes with the system running on non-SMP kernel.

In the meantime I got Kingston 1GB modules from my dealer, for testing.
Strangely as it seems, I could not crash the machine with Kingston
memory running tests as long as 72 hours. It seems, then, that it is a
memory issue although I do not understand why the same memory crashes
the machine in SMP and does not in non-SMP, under similar load. Also,
the Patriot 2GB memory modules (which seem to crash the machines) are on
the Supermicro's list of memory recommended for H8DAR-T mobo.

One of the machines went back to the dealer (actually to their memory
supplier) for tests. The memory guys seem not to trust our crashing
experience. We'll see what happens. I am afraid, however, that they will
say "the memory is OK".

regards, Michal.

-- 
  Michal Szymanski (msz at astrouw dot edu dot pl)
  Warsaw University Observatory, Warszawa, POLAND

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP
  2006-05-05 15:23   ` cerise
@ 2006-05-12 10:54     ` Michal Szymanski
  0 siblings, 0 replies; 10+ messages in thread
From: Michal Szymanski @ 2006-05-12 10:54 UTC (permalink / raw)
  To: SMP list

On Fri, May 05, 2006 at 08:23:44AM -0700, cerise@armory.com wrote:
> > Michal Szymanski wrote:
> >
> > >All systems crash (either hang with some "machine check exception"
> > >kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU
> > >intensive with some I/O. I run 2 or 4 jobs simultaneously and they had
> > >never survived more than a few hours.
> 
> Let's try the easy stuff first -- if it's crashing with a machine check
> exception, then let's disable machine check exceptions, and see if things
> still break.
> 
> Try booting with the parameter "nomce".  Be aware that mce is a mechanism
> for the processor to inform the kernel of thermal issues or component 
> failure.  You'll only want to disable this mechanism if you aren't having
> thermal problems.  

I tried "nomce". The machine does not "halt" now with MCE kernel panic
messages onscreen but resets after 3-4 hours of work under 2 or more jobs.

As I wrote in a response to Robert's message, it seems to be a memory
issue, as there are no crashes with Kingston 1GB memory modules.
One of the machines and the memory went back to the dealer for tests.

> P.S.  I came a little late to this party -- I didn't see the original message.
> Did you include the text of the kernel crash?

Below the kernel message as OCR-ed from a screen digital photo :)
Plus the decoded message as adviced by the first message:

Fedora Core release 4 (Stentz)
kernel 2.6.16-1.2069_FC4smp on an x86_64

red10 login:
HARDWARE ERROR
        CPU 0: Machine Check Exception: 4 Bank 4: f604a00200000813
TSC 1504205a42ba ADDR 115e47828
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Machine check

Call Trace: <#MC> 
     <ffffffff80134e6a>{panic+133} (ffffffff801129eb){mcheck_timer+0}
     <ffffffff801131fc>{do_machine_check+753} 
     <ffffffff8010be43>{machine_check+127} <EOE>

------------------

mcelog --ascii  output:

HARDWARE ERROR
CPU 0 BANK 4 TSC 1504205a42ba 
MCG status:MCIP 
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_ADDR register valid
Processor context corrupt
MCA:BUS Generic Originated-request Read Memory-access Request-timeout Error
Model:
STATUS f604a00200000813 MCGSTATUS 4
------------------

regards, Michal.

-- 
  Michal Szymanski (msz at astrouw dot edu dot pl)
  Warsaw University Observatory, Warszawa, POLAND

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP
  2006-05-09 12:23     ` Michal Szymanski
@ 2006-05-24 20:23       ` Bill Davidsen
  2006-05-24 20:28         ` Bill Davidsen
  0 siblings, 1 reply; 10+ messages in thread
From: Bill Davidsen @ 2006-05-24 20:23 UTC (permalink / raw)
  To: Michal Szymanski; +Cc: SMP list

Michal Szymanski wrote:

>On Fri, May 05, 2006 at 10:18:36AM -0500, Robert M. Hyatt wrote:
>  
>
>>One note.  I am running on a quad 875 system, but am using Suse rather 
>>than FC4.  It is running perfectly reliable (this is a 4 cpu, dual-core, 
>>2.2ghz box, 8 processors total).  I had problems with FC4 myself, 
>>although it runs perfectly on my normal dual xeon boxes...
>>
>>On Fri, 5 May 2006, Bill Davidsen wrote:
>>
>>    
>>
>>>Michal Szymanski wrote:
>>>
>>>      
>>>
>>>>Hi all,
>>>>
>>>>I have recently purchased three Supermicro AS1020A-T servers equipped
>>>>with two dual-core Opterons 280 each. H8DAR-T motherboards, 8 or 12 GB
>>>>RAM. The systems carry FC4 x86_64 with proprietary driver (made by
>>>>Adaptec) for the onboard Marvell 88SX6041 SATA Controller. Original
>>>>(install) kernel 2.6.11-1.1369_FC4smp - unfortunately not upgradable due
>>>>to the lack of the SATA driver for other kernel versions.
>>>>
>>>>All systems crash (either hang with some "machine check exception"
>>>>kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU
>>>>intensive with some I/O. I run 2 or 4 jobs simultaneously and they had
>>>>never survived more than a few hours.
>>>>...
>>>>2. I ran non-SMP 2.6.11 kernel (with Adaptec driver) on another machine.
>>>>There have been two test repeating 1.3g jobs running on it (each getting 
>>>>50%
>>>>of the single CPU used by the system) for over 50 hours now, no crashes.
>>>>Also, a single test job running on SMP kernel gave no crashes in 24 hours.
>>>>
>>>>        
>>>>
>>>What happens if you use only one CPU? Either with a uni kernel (you should 
>>>have gotten one) or "maxcpus=1" in the boot commands. You are running a 
>>>custom kernel with custom drivers, so you really should be asking the 
>>>supplier, all we can do is suggest things which might provide extra 
>>>information.
>>>      
>>>
>
>Hi all,
>
>I got 3 copies of Roberts' message but none of Bill's :-)
>
>Still, I don't quite understand Bill's question ("What happens if you
>use only one CPU?"). The answer is quoted just above this question!
>There were no crashes with the system running on non-SMP kernel.
>  
>

It's a great answer, but not to my question. I wasn't asking what 
happens with a different kernel, but what happens when you run the SMP 
kernel and ==>use<== only one CPU by setting the max cpu to one. The uni 
kernel doesn't have a lot of code in an SMP kernel, so it haides a lot 
of possible questions.

>In the meantime I got Kingston 1GB modules from my dealer, for testing.
>Strangely as it seems, I could not crash the machine with Kingston
>memory running tests as long as 72 hours. It seems, then, that it is a
>memory issue although I do not understand why the same memory crashes
>the machine in SMP and does not in non-SMP, under similar load. Also,
>the Patriot 2GB memory modules (which seem to crash the machines) are on
>the Supermicro's list of memory recommended for H8DAR-T mobo.
>
>One of the machines went back to the dealer (actually to their memory
>supplier) for tests. The memory guys seem not to trust our crashing
>experience. We'll see what happens. I am afraid, however, that they will
>say "the memory is OK".
>  
>
The memory may be operating within spec, the timing setup in the BIOS 
may be incorrect, etc, etc. Unfortunately it is possible to get a case 
where everything is right but it doesn't work. Depending on the BIOS 
capabilities, adding .05v or .1v to the memory voltage (can you do 
that?) might solve the problem, or I guess make it worse.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP
  2006-05-24 20:23       ` Bill Davidsen
@ 2006-05-24 20:28         ` Bill Davidsen
  0 siblings, 0 replies; 10+ messages in thread
From: Bill Davidsen @ 2006-05-24 20:28 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Michal Szymanski, SMP list

Bill Davidsen wrote:

> Michal Szymanski wrote:
>
>> On Fri, May 05, 2006 at 10:18:36AM -0500, Robert M. Hyatt wrote:
>>  
>>
>>> One note.  I am running on a quad 875 system, but am using Suse 
>>> rather than FC4.  It is running perfectly reliable (this is a 4 cpu, 
>>> dual-core, 2.2ghz box, 8 processors total).  I had problems with FC4 
>>> myself, although it runs perfectly on my normal dual xeon boxes...
>>>
>>> On Fri, 5 May 2006, Bill Davidsen wrote:
>>>
>>>   
>>>
>>>> Michal Szymanski wrote:
>>>>
>>>>     
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have recently purchased three Supermicro AS1020A-T servers equipped
>>>>> with two dual-core Opterons 280 each. H8DAR-T motherboards, 8 or 
>>>>> 12 GB
>>>>> RAM. The systems carry FC4 x86_64 with proprietary driver (made by
>>>>> Adaptec) for the onboard Marvell 88SX6041 SATA Controller. Original
>>>>> (install) kernel 2.6.11-1.1369_FC4smp - unfortunately not 
>>>>> upgradable due
>>>>> to the lack of the SATA driver for other kernel versions.
>>>>>
>>>>> All systems crash (either hang with some "machine check exception"
>>>>> kernel messages or reset) when loaded with repeating runs of 
>>>>> 1.3gb, CPU
>>>>> intensive with some I/O. I run 2 or 4 jobs simultaneously and they 
>>>>> had
>>>>> never survived more than a few hours.
>>>>> ...
>>>>> 2. I ran non-SMP 2.6.11 kernel (with Adaptec driver) on another 
>>>>> machine.
>>>>> There have been two test repeating 1.3g jobs running on it (each 
>>>>> getting 50%
>>>>> of the single CPU used by the system) for over 50 hours now, no 
>>>>> crashes.
>>>>> Also, a single test job running on SMP kernel gave no crashes in 
>>>>> 24 hours.
>>>>>
>>>>>       
>>>>
>>>> What happens if you use only one CPU? Either with a uni kernel (you 
>>>> should have gotten one) or "maxcpus=1" in the boot commands. You 
>>>> are running a custom kernel with custom drivers, so you really 
>>>> should be asking the supplier, all we can do is suggest things 
>>>> which might provide extra information.
>>>>     
>>>
>>
>> Hi all,
>>
>> I got 3 copies of Roberts' message but none of Bill's :-)
>>
>> Still, I don't quite understand Bill's question ("What happens if you
>> use only one CPU?"). The answer is quoted just above this question!
>> There were no crashes with the system running on non-SMP kernel.
>>  
>>
>
> It's a great answer, but not to my question. I wasn't asking what 
> happens with a different kernel, but what happens when you run the SMP 
> kernel and ==>use<== only one CPU by setting the max cpu to one. The 
> uni kernel doesn't have a lot of code in an SMP kernel, so it haides a 
> lot of possible questions. 


s/haides/hides/

Yes, I know my original question wasn't explicit on what I was asking, 
it's just the first thing I would have tried because I wouldn't have 
that uni kernel around.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-05-24 20:28 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-04-18 19:11 FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP Michal Szymanski
2006-05-05 14:00 ` Bill Davidsen
2006-05-05 15:18   ` Robert M. Hyatt
2006-05-05 15:28     ` cerise
2006-05-05 16:31       ` Robert M. Hyatt
2006-05-09 12:23     ` Michal Szymanski
2006-05-24 20:23       ` Bill Davidsen
2006-05-24 20:28         ` Bill Davidsen
2006-05-05 15:23   ` cerise
2006-05-12 10:54     ` Michal Szymanski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).