linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Memory Problems - CTCS/memtst
@ 2001-08-02 13:43 Corin Hartland-Swann
  2001-08-02 14:53 ` Alan Cox
  0 siblings, 1 reply; 6+ messages in thread
From: Corin Hartland-Swann @ 2001-08-02 13:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jason Collins


Hi there,

I have been trying to identify the cause of a number of problems we've
been having with a server.

The server consists of two Pentium III 1000/133's on a Tyan Tiger LE
motherboard, 4 x 1024MB PC133 ECC DIMMs and two UDMA disk drives. It is
running kernel 2.4.7 (unpatched, but with tailored config). To rule out
problems related to the large amount of memory, I temporarily removed all
but one of the DIMMs, leaving it with 1024MB.

I've been getting the usual signs of memory errors:

  segmentation faults
  "unable to handle kernel NULL pointer dereference at virtual address"
  kernel compilations failing
  random panics

I have been using VA CTCS (esp. memtst) to try to identify where the
problems lie.

The latest example of faulty memory somewhere was when I copied the root
disk to another disk using:

  find / -mount -print | cpio -pm /mnt/disk

When I compared the files (between the disks) with md5sum, two of the
files came up with differing checksums (i.e. the files had been corrupted
while copying).

The BIOS has an ECC logging feature, and if I understand it correctly,
then there /cannot/ have been any main memory errors or they would have
shown up in the logs. At least not any single or double-bit errors (ECC
corrects single-bit and detects double-bit, doesn't it?)

When using the older version of memtst (by Larry Augustin), I did not find
any errors, and I ran it a lot of times.

I'm now using CTCS 1.3.0pre2, and memtst has been completely rewritten (by
Jason Collins). There are a number of new tests in addition to the
original one. I find no memory errors with tests 1, 3, 4 and 5 - but test
2 consistently comes back with an error:

  # ./memtst -2 -B -c 1024

  <...snip...>

  Failure Context:
       offset        expected             got
            0        aaaaaaaa        aaaaaaaa   ### fail location
            1        aaaaaaa9        aaaaaaa9
            2        aaaaaaa8        aaaaaaa8
            3        aaaaaaa7        aaaaaaa7
            4        aaaaaaa6        aaaaaaa6
            5        aaaaaaa5        aaaaaaa5
            6        aaaaaaa4        aaaaaaa4
            7        aaaaaaa3        aaaaaaa3
            8        aaaaaaa2        aaaaaaa2
            9        aaaaaaa1        aaaaaaa1
  8 8 fail_page_offset
  Scanning /proc/kcore.  Fire in the hole!
  The memory failure location could not be determined. This,
  while not provably impossible, should never happen under practical
  circumstances unless there is a bug or the memtst program image is
  corrupt.
  Cache RAM fault likely.

I compiled the program on another machine, which has had no problems, and
have compared the two binaries with md5sum to ensure that it was not
corrupted in transit - so I don't think it's a problem with the binary.

Initially, memtst was finding the memory location very quickly, and it was
different each time it was run. I assumed that one of the two CPUs had a
hardware fault, and so I removed one of them and re-ran the test. From
this point on, it could not identify the location of the failed memory.

I then swapped over the two CPUs, and got the same as above again. It is
reproducible every time, including straight after a reboot.

Please could you CC: me in on any replies as I'm not on LKML.

Regards,

Corin

/------------------------+-------------------------------------\
| Corin Hartland-Swann   |    Tel: +44 (0) 20 7491 2000        |
| Commerce Internet Ltd  |    Fax: +44 (0) 20 7491 2010        |
| 22 Cavendish Buildings | Mobile: +44 (0) 79 5854 0027        | 
| Gilbert Street         |                                     |
| Mayfair                |    Web: http://www.commerce.uk.net/ |
| London W1K 5HJ         | E-Mail: cdhs@commerce.uk.net        |
\------------------------+-------------------------------------/





^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Memory Problems - CTCS/memtst
  2001-08-02 13:43 Memory Problems - CTCS/memtst Corin Hartland-Swann
@ 2001-08-02 14:53 ` Alan Cox
  2001-08-02 16:09   ` Corin Hartland-Swann
  2001-08-09 14:45   ` Corin Hartland-Swann
  0 siblings, 2 replies; 6+ messages in thread
From: Alan Cox @ 2001-08-02 14:53 UTC (permalink / raw)
  To: Corin Hartland-Swann; +Cc: linux-kernel, Jason Collins

> The BIOS has an ECC logging feature, and if I understand it correctly,
> then there /cannot/ have been any main memory errors or they would have
> shown up in the logs. At least not any single or double-bit errors (ECC
> corrects single-bit and detects double-bit, doesn't it?)

ALmost certainly it should have been logged. That indicates you may have
problems elsewhere (pci bus, drivers, motherboard, processors...) or you
might even be triggering a kernel bug.

Try a  2.2.19 kernel just out of curiousity

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Memory Problems - CTCS/memtst
  2001-08-02 14:53 ` Alan Cox
@ 2001-08-02 16:09   ` Corin Hartland-Swann
  2001-08-02 18:10     ` Jason T. Collins
  2001-08-09 14:45   ` Corin Hartland-Swann
  1 sibling, 1 reply; 6+ messages in thread
From: Corin Hartland-Swann @ 2001-08-02 16:09 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel, Jason Collins


Alan,

On Thu, 2 Aug 2001, Alan Cox wrote:
> > The BIOS has an ECC logging feature, and if I understand it correctly,
> > then there /cannot/ have been any main memory errors or they would have
> > shown up in the logs. At least not any single or double-bit errors (ECC
> > corrects single-bit and detects double-bit, doesn't it?)
> 
> ALmost certainly it should have been logged. That indicates you may have
> problems elsewhere (pci bus, drivers, motherboard, processors...) or you
> might even be triggering a kernel bug.
> 
> Try a  2.2.19 kernel just out of curiousity

Tried stock 2.2.19, and still got the errors on test 2 (only).

I've just tried test 2 on another machine (with good memory) and it looks
like it's a bug in memtst rather than the detection of an error.

D'oh! Back to the drawing board...

I will experiment with file copies to see if I'm still getting memory
corruption under 2.2.19 - if I am (and considering the lack of ECC errors)
then do you think I consider that conclusive proof that there is a problem
with the CPUs or the motherboard?

I have another motherboard of the same type which I can swap out - I will
try that later on...

Thanks,

Corin

/------------------------+-------------------------------------\
| Corin Hartland-Swann   |    Tel: +44 (0) 20 7491 2000        |
| Commerce Internet Ltd  |    Fax: +44 (0) 20 7491 2010        |
| 22 Cavendish Buildings | Mobile: +44 (0) 79 5854 0027        | 
| Gilbert Street         |                                     |
| Mayfair                |    Web: http://www.commerce.uk.net/ |
| London W1K 5HJ         | E-Mail: cdhs@commerce.uk.net        |
\------------------------+-------------------------------------/




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Memory Problems - CTCS/memtst
  2001-08-02 16:09   ` Corin Hartland-Swann
@ 2001-08-02 18:10     ` Jason T. Collins
  0 siblings, 0 replies; 6+ messages in thread
From: Jason T. Collins @ 2001-08-02 18:10 UTC (permalink / raw)
  To: Corin Hartland-Swann, linux-kernel

Corin Hartland-Swann wrote:
> 
> Alan,
> 
> On Thu, 2 Aug 2001, Alan Cox wrote:
> > > The BIOS has an ECC logging feature, and if I understand it correctly,
> > > then there /cannot/ have been any main memory errors or they would have
> > > shown up in the logs. At least not any single or double-bit errors (ECC
> > > corrects single-bit and detects double-bit, doesn't it?)

Remember, the memory itself is only one area where there might be problems. 
There are other memory related areas including the following that are not
covered by ECC memory:

North bridge (memory controller)
L1/L2/L3 cache levels (some processors have ECC checking in the cache)
Register corruption

In addition, the transfers between the CPU and memory could be corrupted in
transit before the ECC checksum is calculated (I've actually seen this happen
on a poorly designed motherboard).  In other words, there are a lot of things
that could be wrong, see the FAQ in CTCS for more of my ramblings on the
subject.

One way to tell whether or not your memory is the problem is by examining your
files/coredumps for corruption.  If entire page-sized chunks have been
substituted with chunks from other files, pages in RAM, etc, you're likely
dealing with a kernel MM bug rather than memory corruption.  (I suppose an MMU
bug is possible too.. sigh...)  A few bits swapped here and there points to
hardware/faulty memory.  That's one reason why my memory checker displays that
nice context information, so those sorts of determinations can be made.

> I've just tried test 2 on another machine (with good memory) and it looks
> like it's a bug in memtst rather than the detection of an error.

This doesn't surprise me too much, the software is pretty new.  The fact that
the expected and resulting memory contents in the log is the same would seem to
confirm that, plus the fact that the 'error' happened on the first byte in the
test array and other strange things.  :)  A quick check confirms it breaks for
me too, so I'll find this bug and whack it in a new release.  Expect something
this weekend.

-- 
Jason T. Collins  "Noone has lived to see even three of my techniques.  It
Software Engineer  is almost sunset.  How many will you see before you die?"
VA Linux Systems   'Twilight' Suzuka, "Creeping Evil"

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Memory Problems - CTCS/memtst
  2001-08-02 14:53 ` Alan Cox
  2001-08-02 16:09   ` Corin Hartland-Swann
@ 2001-08-09 14:45   ` Corin Hartland-Swann
  2001-08-09 15:10     ` Alan Cox
  1 sibling, 1 reply; 6+ messages in thread
From: Corin Hartland-Swann @ 2001-08-09 14:45 UTC (permalink / raw)
  To: Alan Cox, Jason Collins; +Cc: linux-kernel


Alan/Jason,

To recap...

On Thu, 2 Aug 2001, I wrote:
> > I have been trying to identify the cause of a number of problems we've
> > been having with a server.
> > 
> > The server consists of two Pentium III 1000/133's on a Tyan Tiger LE
> > motherboard, 4 x 1024MB PC133 ECC DIMMs and two UDMA disk drives. It is
> > running kernel 2.4.7 (unpatched, but with tailored config). To rule out
> > problems related to the large amount of memory, I temporarily removed all
> > but one of the DIMMs, leaving it with 1024MB.
> > 
> > I've been getting the usual signs of memory errors:
> > 
> >   segmentation faults
> >   "unable to handle kernel NULL pointer dereference at virtual address"
> >   kernel compilations failing
> >   random panics

On Thu, 2 Aug 2001, Jason Collins wrote:
> One way to tell whether or not your memory is the problem is by examining your
> files/coredumps for corruption.  If entire page-sized chunks have been
> substituted with chunks from other files, pages in RAM, etc, you're likely
> dealing with a kernel MM bug rather than memory corruption.  (I suppose an MMU
> bug is possible too.. sigh...)  A few bits swapped here and there points to
> hardware/faulty memory.  That's one reason why my memory checker displays that
> nice context information, so those sorts of determinations can be made.

I came up with a new way to get the problems to show up - I used the
prandom package in CTCS (generates large amounts of pseudorandom data) to
create a 2048MB file. I then used cp to copy it repeatedly to a different
disk, and then used md5sum to compare the files. Any files that differed,
I used a perl program to compare 4K blocks and indicate the number of
differing bits for each differing block.

The result was that there were differing groups of blocks in the files,
but these were always multiples of 4K blocks. This means that the problem
is not related to memory errors, but more likely to the IDE driver or
(less likely) memory management.

On Thu, 2 Aug 2001, Alan Cox wrote:
> > The BIOS has an ECC logging feature, and if I understand it correctly,
> > then there /cannot/ have been any main memory errors or they would have
> > shown up in the logs. At least not any single or double-bit errors (ECC
> > corrects single-bit and detects double-bit, doesn't it?)
> 
> ALmost certainly it should have been logged. That indicates you may have
> problems elsewhere (pci bus, drivers, motherboard, processors...) or you
> might even be triggering a kernel bug.
> 
> Try a  2.2.19 kernel just out of curiousity

The 2.2.19 kernel works without a problem. After trying a large number of
different kernels, 2.4.7-ac9 also works. I believe that this is because of
the new serverworks driver (as opposed to osb4).

So, I'm fairly convinced that the osb4 driver causes data corruption - has
anyone else experienced this?

What is the status of the new serverworks driver in 2.4.7-ac9 - is it due
to go into the main kernel soon?

Thanks,

Corin

/------------------------+-------------------------------------\
| Corin Hartland-Swann   |    Tel: +44 (0) 20 7491 2000        |
| Commerce Internet Ltd  |    Fax: +44 (0) 20 7491 2010        |
| 22 Cavendish Buildings | Mobile: +44 (0) 79 5854 0027        | 
| Gilbert Street         |                                     |
| Mayfair                |    Web: http://www.commerce.uk.net/ |
| London W1K 5HJ         | E-Mail: cdhs@commerce.uk.net        |
\------------------------+-------------------------------------/


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Memory Problems - CTCS/memtst
  2001-08-09 14:45   ` Corin Hartland-Swann
@ 2001-08-09 15:10     ` Alan Cox
  0 siblings, 0 replies; 6+ messages in thread
From: Alan Cox @ 2001-08-09 15:10 UTC (permalink / raw)
  To: Corin Hartland-Swann; +Cc: Alan Cox, Jason Collins, linux-kernel

> What is the status of the new serverworks driver in 2.4.7-ac9 - is it due
> to go into the main kernel soon?

Its certainly ready to. Andre is the man to ask about the actual submission
schedule

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2001-08-09 15:09 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-08-02 13:43 Memory Problems - CTCS/memtst Corin Hartland-Swann
2001-08-02 14:53 ` Alan Cox
2001-08-02 16:09   ` Corin Hartland-Swann
2001-08-02 18:10     ` Jason T. Collins
2001-08-09 14:45   ` Corin Hartland-Swann
2001-08-09 15:10     ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).