* Re: MSI K8D-Master - GART error 3 [not found] <gC1o.2gU.5@gated-at.bofh.it> @ 2003-08-05 0:11 ` Andi Kleen 2003-08-05 0:45 ` Simon Garner 0 siblings, 1 reply; 7+ messages in thread From: Andi Kleen @ 2003-08-05 0:11 UTC (permalink / raw) To: Simon Garner; +Cc: linux-kernel "Simon Garner" <sgarner@expio.co.nz> writes: > Aug 4 12:52:41 terra kernel: Northbridge status 9405c00000000a13 > Aug 4 12:52:41 terra kernel: GART error 3 There is nothing in any of my trees that generates such a message. If it was GART related it would be either "GART TLB error ..." or "extended error gart error". But even that should not happen anymore, see below. I don't know what the RedHat kernel does, they may have changed the MCE handler over the reference port. > The system also has an Adaptec 2120S scsi raid card. Probably the driver is doing something bad with the pci_dma API (which uses the GART on x86-64) You can always disable it with mce=off or better mce=0 as the message seems to be caused by the periodic non fatal MCE check timer. However there was a bug in the MCE handler where it managed to turn on an GART related MCE event through the backdoor that doesn't work correctly and is sometimes raised spuriously. But at least in the SuSE beta9 kernel or recent x86-64.org kernels this should have been fixed. But it doesn't generate such a error message anyways, so it's hard to know what the exact cause is. I would suggest to retry with a recent x86-64.org CVS kernel and see if it still happens there. -Andi ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: MSI K8D-Master - GART error 3 2003-08-05 0:11 ` MSI K8D-Master - GART error 3 Andi Kleen @ 2003-08-05 0:45 ` Simon Garner 2003-08-05 13:42 ` Andi Kleen 0 siblings, 1 reply; 7+ messages in thread From: Simon Garner @ 2003-08-05 0:45 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel Andi Kleen <ak@muc.de> wrote: > There is nothing in any of my trees that generates such a message. > If it was GART related it would be either "GART TLB error ..." or > "extended error gart error". But even that should not happen anymore, > see below. > > I don't know what the RedHat kernel does, they may have changed the > MCE handler over the reference port. > A quick google brings up this reference: http://www.iglu.org.il/lxr/source/arch/x86_64/kernel/bluesmoke.c The error appears to be generated by the code starting around line 152 in that file. Btw, what is 'bluesmoke'? >> The system also has an Adaptec 2120S scsi raid card. > > Probably the driver is doing something bad with the pci_dma API > (which uses the GART on x86-64) > Certainly I had a lot of trouble with this card, I was pleased the aacraid driver worked enough to let me even install this time - the Red Hat GinGin64 installer gave a kernel panic - so I wouldn't be surprised if this card/driver were the cause. :( > You can always disable it with mce=off or better mce=0 > as the message seems to be caused by the periodic non fatal MCE check > timer. > What will I lose by disabling this? I just tried booting with mce=0 and I am still getting the same errors. > However there was a bug in the MCE handler where it managed to turn on > an GART related MCE event through the backdoor that doesn't work > correctly and is sometimes raised spuriously. But at least in the SuSE > beta9 kernel or recent x86-64.org kernels this should have been > fixed. But it doesn't generate such a error message anyways, > so it's hard to know what the exact cause is. > > I would suggest to retry with a recent x86-64.org CVS kernel and see > if it still happens there. > I will give that a go and see what happens. Thanks for the response Andi. -Simon ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: MSI K8D-Master - GART error 3 2003-08-05 0:45 ` Simon Garner @ 2003-08-05 13:42 ` Andi Kleen 2003-08-10 22:43 ` Simon Garner 0 siblings, 1 reply; 7+ messages in thread From: Andi Kleen @ 2003-08-05 13:42 UTC (permalink / raw) To: Simon Garner; +Cc: Andi Kleen, linux-kernel On Tue, Aug 05, 2003 at 12:45:01PM +1200, Simon Garner wrote: > Andi Kleen <ak@muc.de> wrote: > > > There is nothing in any of my trees that generates such a message. > > If it was GART related it would be either "GART TLB error ..." or > > "extended error gart error". But even that should not happen anymore, > > see below. > > > > I don't know what the RedHat kernel does, they may have changed the > > MCE handler over the reference port. > > > > A quick google brings up this reference: > http://www.iglu.org.il/lxr/source/arch/x86_64/kernel/bluesmoke.c Ok that's the very old MCE code that incorrectly enabled the northbridge machine check. Don't use that or use mce=off. However I still think it's a driver bug in your case. If it was the shakey GART MCE itself you would get a panic because it's a unrecoverable MCE. More likely the driver is accessing PCI DMA mappings after they got unmapped, which is a serious bug, but somehow not serious enough that the northbridge triggers the MCE. I was confused by your statement that the SuSE 8.2 beta9 kernel generated that. It didn't because it doesn't contain that old code. What does a modern kernel like the SuSE one or a x86-64.org kernel generate exactly? > > The error appears to be generated by the code starting around line 152 > in that file. > > Btw, what is 'bluesmoke'? Alan Cox's sense of humour. Look it up in the jargon file. > > You can always disable it with mce=off or better mce=0 > > as the message seems to be caused by the periodic non fatal MCE check > > timer. > > > > What will I lose by disabling this? mce=0 turns off periodic MCE checking for non fatal errors. That's not a big issue, the worst you lose is reporting of one bit corrected ECC memory failures. mce=off turns off MCE reporting for fatal MCE exceptions (however your box may still crash when something really bad happens) mce=0 should have turned off the periodic check and your message very much looks like a periodic one, as actual MCE exceptions report more data. I'm a bit puzzled why it doesn't kill the message here. You can try mce=off, but I'm not sure it will help neither. Using a newer kernel is probably a good idea anyways, as there were many bugfixes since then. -Andi ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: MSI K8D-Master - GART error 3 2003-08-05 13:42 ` Andi Kleen @ 2003-08-10 22:43 ` Simon Garner 2003-08-10 22:56 ` Andi Kleen 0 siblings, 1 reply; 7+ messages in thread From: Simon Garner @ 2003-08-10 22:43 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel On Wednesday, August 06, 2003 1:42 AM [GMT+1200=NZT], Andi Kleen <ak@colin2.muc.de> wrote: > > Ok that's the very old MCE code that incorrectly enabled the > northbridge machine check. Don't use that or use mce=off. However I > still think it's a driver bug in your case. If it was the shakey GART > MCE itself you would get a panic because it's a unrecoverable MCE. > More likely the driver is accessing PCI DMA mappings after they got > unmapped, which is a serious bug, but somehow not serious enough that > the northbridge triggers the MCE. > > I was confused by your statement that the SuSE 8.2 beta9 kernel > generated that. It didn't because it doesn't contain that old code. > > What does a modern kernel like the SuSE one or a x86-64.org kernel > generate exactly? > I have reinstalled SuSE now, and I apologise as I was only partially correct. I do get errors, but they are slightly different from RH. They appear to be saying the same thing, though. Every 30 seconds I get: Aug 11 10:37:06 terra kernel: Northbridge status 9405c00000000a13 Aug 11 10:37:06 terra kernel: ECC syndrome bits b Aug 11 10:37:06 terra kernel: extended error ecc error Aug 11 10:37:06 terra kernel: link number 0 Aug 11 10:37:06 terra kernel: corrected ecc error Aug 11 10:37:06 terra kernel: error address valid Aug 11 10:37:06 terra kernel: error enable Aug 11 10:37:06 terra kernel: previous error lost Aug 11 10:37:06 terra kernel: error address 00000000003e4710 Aug 11 10:37:36 terra kernel: Northbridge status 9405c00000000813 Aug 11 10:37:36 terra kernel: ECC syndrome bits b Aug 11 10:37:36 terra kernel: extended error ecc error Aug 11 10:37:36 terra kernel: link number 0 Aug 11 10:37:36 terra kernel: corrected ecc error Aug 11 10:37:36 terra kernel: error address valid Aug 11 10:37:36 terra kernel: error enable Aug 11 10:37:36 terra kernel: previous error lost Aug 11 10:37:36 terra kernel: error address 00000000003c4220 These suggest it's just reporting ECC corrections. Why would it do this exactly every 30 seconds? (or is that just the reporting interval?) # uname -a Linux terra 2.4.19-SMP #1 SMP Wed Jun 25 21:37:18 UTC 2003 x86_64 unknown unknown GNU/Linux thanks for the help, -Simon ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: MSI K8D-Master - GART error 3 2003-08-10 22:43 ` Simon Garner @ 2003-08-10 22:56 ` Andi Kleen 2003-08-12 23:22 ` Simon Garner 0 siblings, 1 reply; 7+ messages in thread From: Andi Kleen @ 2003-08-10 22:56 UTC (permalink / raw) To: Simon Garner; +Cc: linux-kernel On Mon, Aug 11, 2003 at 10:43:57AM +1200, Simon Garner wrote: > These suggest it's just reporting ECC corrections. Why would it do this Yep. You have faulty DIMMs, consider replacing them. > exactly every 30 seconds? (or is that just the reporting interval?) The interval timer checking for "silent" MCEs runs every 30s. You can change that by booting with mce=<number> then it will run each number seconds. 0 should turn it off. -Andi ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: MSI K8D-Master - GART error 3 2003-08-10 22:56 ` Andi Kleen @ 2003-08-12 23:22 ` Simon Garner 0 siblings, 0 replies; 7+ messages in thread From: Simon Garner @ 2003-08-12 23:22 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel On Monday, August 11, 2003 10:56 AM [GMT+1200=NZT], Andi Kleen <ak@colin2.muc.de> wrote: > On Mon, Aug 11, 2003 at 10:43:57AM +1200, Simon Garner wrote: >> These suggest it's just reporting ECC corrections. Why would it do >> this > > Yep. You have faulty DIMMs, consider replacing them. > Well I found that a little hard to stomach (since there's four DIMMs - surely they couldn't all be faulty - and I had already been through a whole other complete set with the same results, when the supplier sent the wrong speed modules), but now that I knew the errors were memory-related I did some more experimenting. (Here is the memory population chart from the motherboard manual to help make sense of this: http://www.expio.co.nz/~sgarner/terra/msi9131memorypop.gif) First I found that if I disabled ECC in the BIOS then the system wouldn't even POST. But if I rearranged the modules so that they were in single channel operation (using only three DIMMs in slots 2,4,6) then the system would boot and I got no errors in SuSE (even after reenabling ECC). Then I tried using a different memory population layout, using all four DIMMs as dual channel w/ ECC in slots 3,4,5,6 where I had been using 1,2,5,6. The system booted and again I got no errors in SuSE. "That's strange," thought I, so I tried putting the memory back as it was, in slots 1,2,5,6, with ECC enabled. Booted the system and still no errors in SuSE. So I'm not sure what I did exactly but the system is now running fine and the ECC errors are gone. I'm still using the same DIMMs - the only thing that may have changed is the DIMMs may be arranged differently among the slots. I have tried swapping them around though and I still can't get the ECC errors back. But that's fine because I didn't particularly want the errors anyway! :) -Simon PS: Under the Northbridge/ECC configuration in the BIOS, the motherboard has options for DRAM, L2 and L1 cache "BG Scrub" which are selected as times from 40ns through to some microseconds. There are also options for "DRAM Scrub REDIRECT" and "ECC Chip Kill". The motherboard manual offers no advice as to the preferred values for these settings or what they do. Can anyone suggest good values for these? I currently have them disabled. ^ permalink raw reply [flat|nested] 7+ messages in thread
* MSI K8D-Master - GART error 3 @ 2003-08-04 1:05 Simon Garner 0 siblings, 0 replies; 7+ messages in thread From: Simon Garner @ 2003-08-04 1:05 UTC (permalink / raw) To: linux-kernel; +Cc: taroon-beta-list Hi lists, (please cc me on any replies as I am not currently subscribed to lkml) I have recently installed Red Hat Enterprise Linux 2.9.5 Beta (Taroon) x86-64 on an MSI K8D-Master (MSI-9131) motherboard with Dual Opteron 240 processors. While the system is running, every 30 seconds I get the following on system console and in /var/log/messages: Aug 4 12:52:41 terra kernel: Northbridge status 9405c00000000a13 Aug 4 12:52:41 terra kernel: GART error 3 Aug 4 12:52:41 terra kernel: Lost an northbridge error Aug 4 12:52:41 terra kernel: NB error address 00000000002e0310 Aug 4 12:53:11 terra kernel: Northbridge status 9405c00000000a13 Aug 4 12:53:11 terra kernel: GART error 3 Aug 4 12:53:11 terra kernel: Lost an northbridge error Aug 4 12:53:11 terra kernel: NB error address 0000000004432320 and so forth. This also occurred under SuSE Linux 8.2 Beta x86_64 and it even occurs while running the Red Hat installer (isolinux). Otherwise the system seems to run fine. Can anyone shed some light on what this means, and how concerned should I be? Is it fixable? I thought GART referred to the AGP aperture - this system doesn't actually have an AGP port, could that be the cause of this? (It has an onboard ATI Rage XL chip) # uname -a Linux terra 2.4.21-1.1931.2.349.2.2.entsmp #1 SMP Fri Jul 18 00:06:19 EDT 2003 x86_64 x86_64 x86_64 GNU/Linux The system also has an Adaptec 2120S scsi raid card. cheers, -Simon ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2003-08-12 23:24 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <gC1o.2gU.5@gated-at.bofh.it> 2003-08-05 0:11 ` MSI K8D-Master - GART error 3 Andi Kleen 2003-08-05 0:45 ` Simon Garner 2003-08-05 13:42 ` Andi Kleen 2003-08-10 22:43 ` Simon Garner 2003-08-10 22:56 ` Andi Kleen 2003-08-12 23:22 ` Simon Garner 2003-08-04 1:05 Simon Garner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).