Re: discriminate single bit error hardware failure from slab corruption.

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: discriminate single bit error hardware failure from slab corruption.
@ 2006-02-03  9:25 linux
  2006-02-03 14:14 ` Jan Engelhardt
  0 siblings, 1 reply; 16+ messages in thread
From: linux @ 2006-02-03  9:25 UTC (permalink / raw)
  To: davej; +Cc: linux-kernel

Um... case values are allowed to be expressions.

Isn't
+	switch (total) {
+		case SLAB_POISON ^ 0x01:
+		case SLAB_POISON ^ 0x02:
+		case SLAB_POISON ^ 0x04:
+		case SLAB_POISON ^ 0x08:
+		case SLAB_POISON ^ 0x10:
+		case SLAB_POISON ^ 0x20:
+		case SLAB_POISON ^ 0x40:
+		case SLAB_POISON ^ 0x80:
+			printk (KERN_ERR "Single bit error detected. Possibly bad RAM\n"

Infinitely clearer, even without the comments?  Or, if you want to
be cleverer:

	total ^= SLAB_POISON;
	if ((total & (total-1)) == 0) {
		printk (KERN_ERR "Single bit error detected. Possibly bad RAM\n"
	}


If you wanted to get the bit-counting exactly accurate, you'd do:

	unsigned char total = 0, total2 = 0;

 	for (i = 0; i < limit; i++) {
		unsigned char delta = data[offset+i];
 		printk(" %02x", delta;
		delta ^= POISON_FREE;
		total2 |= total & delta;
		total |= delta;
 	}
 	printk("\n");

	/* If total2 has 0 bits set and total1 has at most 1 bit set... */
	if (!total2 && !(total1 & (total1 - 1)) {
		printk (KERN_ERR "Single bit error detected. Possibly bad RAM\n"
		
	}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: discriminate single bit error hardware failure from slab corruption.
  2006-02-03  9:25 discriminate single bit error hardware failure from slab corruption linux
@ 2006-02-03 14:14 ` Jan Engelhardt
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Engelhardt @ 2006-02-03 14:14 UTC (permalink / raw)
  To: linux; +Cc: davej, linux-kernel

>Um... case values are allowed to be expressions.

Whatever they are, they must be able to be reduced to an integer constant.

Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: discriminate single bit error hardware failure from slab corruption.
  2006-02-03  1:46   ` Dave Jones
  2006-02-03  2:05     ` Avi Kivity
@ 2006-02-06 20:19     ` Pavel Machek
  1 sibling, 0 replies; 16+ messages in thread
From: Pavel Machek @ 2006-02-06 20:19 UTC (permalink / raw)
  To: Dave Jones, Avi Kivity, Linux Kernel

Hi!

>  > here? it seems more readable and more correct as well.
> 
> More readable ? Are you kidding ?

Well, his method really counts flipped bits. It looks very obvious
to me -- unlike magic numbers.

-- 
Thanks, Sharp!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: discriminate single bit error hardware failure from slab corruption.
  2006-02-03  4:41         ` Roland Dreier
  2006-02-03  5:03           ` Dave Jones
@ 2006-02-03 14:12           ` Jan Engelhardt
  1 sibling, 0 replies; 16+ messages in thread
From: Jan Engelhardt @ 2006-02-03 14:12 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Dave Jones, Avi Kivity, Linux Kernel

>
>I have to admit that Avi's code seems clearer to me too, though.
>
"me2". The second (longer) form of it, that is.

Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: discriminate single bit error hardware failure from slab corruption.
  2006-02-02 19:24 Dave Jones
                   ` (3 preceding siblings ...)
  2006-02-03  0:44 ` Avi Kivity
@ 2006-02-03 14:09 ` Jan Engelhardt
  4 siblings, 0 replies; 16+ messages in thread
From: Jan Engelhardt @ 2006-02-03 14:09 UTC (permalink / raw)
  To: Dave Jones; +Cc: Linux Kernel

>
>000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
>Single bit error detected. Possibly bad RAM. Please run memtest86.
>
>--- linux-2.6.15/mm/slab.c~	2006-01-09 13:25:17.000000000 -0500
>+++ linux-2.6.15/mm/slab.c	2006-01-09 13:26:01.000000000 -0500

So, and what do non-x86 users use?


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: discriminate single bit error hardware failure from slab corruption.
  2006-02-03  4:20       ` Dave Jones
  2006-02-03  4:41         ` Roland Dreier
@ 2006-02-03 11:05         ` Olivier Galibert
  1 sibling, 0 replies; 16+ messages in thread
From: Olivier Galibert @ 2006-02-03 11:05 UTC (permalink / raw)
  To: Dave Jones, Avi Kivity, Linux Kernel

On Thu, Feb 02, 2006 at 11:20:35PM -0500, Dave Jones wrote:
> +		case 0x6a:	/* 01101010 bit 0 flipped */
> +		case 0x69:	/* 01101001 bit 1 flipped */
> +		case 0x6f:	/* 01101111 bit 2 flipped */
> +		case 0x63:	/* 01100011 bit 3 flipped */
> +		case 0x7b:	/* 01111011 bit 4 flipped */
> +		case 0x4b:	/* 01001011 bit 5 flipped */
> +		case 0x2b:	/* 00101011 bit 6 flipped */
> +		case 0xeb:	/* 11101011 bit 7 flipped */

What about simply:
  case 0x6b^0x01:
  case 0x6b^0x02:
  case 0x6b^0x04:
  case 0x6b^0x08:
  case 0x6b^0x10:
  case 0x6b^0x20:
  case 0x6b^0x40:
  case 0x6b^0x80:

  OG.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: discriminate single bit error hardware failure from slab corruption.
  2006-02-03  4:41         ` Roland Dreier
@ 2006-02-03  5:03           ` Dave Jones
  2006-02-03 14:12           ` Jan Engelhardt
  1 sibling, 0 replies; 16+ messages in thread
From: Dave Jones @ 2006-02-03  5:03 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Avi Kivity, Linux Kernel

On Thu, Feb 02, 2006 at 08:41:26PM -0800, Roland Dreier wrote:
 >     Dave> Hmm, I made a mistake in my maths somewhere, and some of
 >     Dave> those values are incorrect, so having the compiler do the
 >     Dave> work would have stopped me screwing up, but once the correct
 >     Dave> values are used, I doubt there's ever a really compelling
 >     Dave> reason to change the slab poison pattern.
 > 
 > But Avi is still correct about false positives.  For example, if
 > something stomps on the slab poison and leaves it as
 > 
 >     e0 08 03 00
 > 
 > then that will add up to eb and still trigger your message, even
 > though it's far from a single bit error.

Ah, now I see the point Avi was making.

 > Maybe making the loop be something like
 > 
 > 	unsigned char total = 0, bad_count = 0;
 > 	printk(KERN_ERR "%03x:", offset);
 > 	for (i = 0; i < limit; i++) {
 > 		if (data[offset+i] != POISON_FREE) {
 > 			total += data[offset+i];
 > 			++bad_count;
 > 		}
 > 		printk(" %02x", (unsigned char)data[offset + i]);
 > 	}
 > 
 > and then you can put
 > 
 > 	if (bad_count == 1)
 > 
 > before the switch statement.
 > 
 > I have to admit that Avi's code seems clearer to me too, though.

I'm easily persuaded either way really, as long as
we arrive at a desirable end-result ;)

		Dave

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: discriminate single bit error hardware failure from slab corruption.
  2006-02-03  4:20       ` Dave Jones
@ 2006-02-03  4:41         ` Roland Dreier
  2006-02-03  5:03           ` Dave Jones
  2006-02-03 14:12           ` Jan Engelhardt
  2006-02-03 11:05         ` Olivier Galibert
  1 sibling, 2 replies; 16+ messages in thread
From: Roland Dreier @ 2006-02-03  4:41 UTC (permalink / raw)
  To: Dave Jones; +Cc: Avi Kivity, Linux Kernel

    Dave> Hmm, I made a mistake in my maths somewhere, and some of
    Dave> those values are incorrect, so having the compiler do the
    Dave> work would have stopped me screwing up, but once the correct
    Dave> values are used, I doubt there's ever a really compelling
    Dave> reason to change the slab poison pattern.

But Avi is still correct about false positives.  For example, if
something stomps on the slab poison and leaves it as

    e0 08 03 00

then that will add up to eb and still trigger your message, even
though it's far from a single bit error.

Maybe making the loop be something like

	unsigned char total = 0, bad_count = 0;
	printk(KERN_ERR "%03x:", offset);
	for (i = 0; i < limit; i++) {
		if (data[offset+i] != POISON_FREE) {
			total += data[offset+i];
			++bad_count;
		}
		printk(" %02x", (unsigned char)data[offset + i]);
	}

and then you can put

	if (bad_count == 1)

before the switch statement.

I have to admit that Avi's code seems clearer to me too, though.

 - R.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: discriminate single bit error hardware failure from slab corruption.
  2006-02-03  2:05     ` Avi Kivity
@ 2006-02-03  4:20       ` Dave Jones
  2006-02-03  4:41         ` Roland Dreier
  2006-02-03 11:05         ` Olivier Galibert
  0 siblings, 2 replies; 16+ messages in thread
From: Dave Jones @ 2006-02-03  4:20 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Linux Kernel

On Fri, Feb 03, 2006 at 04:05:23AM +0200, Avi Kivity wrote:

 >    unsigned char modified_bits = data[offset+i] ^ POSION_FREE;
 >    int modified_bits_count = hweight8(modified_bits);
 >    total += modified_bits_count;
 > 
 > >wrt correctness, what do you see wrong with my approach?
 > Your code will generate a false positive 8 times in 256 runs, or 1 in 
 > 32. A 3% false positive rate seems excessive, It's also sensitive to 
 > changes to POISON_FREE.

Hmm, I made a mistake in my maths somewhere, and some of those values
are incorrect, so having the compiler do the work would have stopped
me screwing up, but once the correct values are used, I doubt there's
ever a really compelling reason to change the slab poison pattern.

		Dave

In case where we detect a single bit has been flipped, we spew
the usual slab corruption message, which users instantly think
is a kernel bug.  In a lot of cases, single bit errors are
down to bad memory, or other hardware failure.

This patch adds an extra line to the slab debug messages
in those cases, in the hope that users will try memtest before
they report a bug.

000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Single bit error detected. Possibly bad RAM. Run memtest86.

Signed-off-by: Dave Jones <davej@redhat.com>

--- linux-2.6.15/mm/slab.c~	2006-01-09 13:25:17.000000000 -0500
+++ linux-2.6.15/mm/slab.c	2006-01-09 13:26:01.000000000 -0500
@@ -1313,8 +1313,11 @@ static void poison_obj(kmem_cache_t *cac
 static void dump_line(char *data, int offset, int limit)
 {
 	int i;
+	unsigned char total=0;
 	printk(KERN_ERR "%03x:", offset);
 	for (i = 0; i < limit; i++) {
+		if (data[offset+i] != POISON_FREE)
+			total += data[offset+i];
 		printk(" %02x", (unsigned char)data[offset + i]);
 	}
 	printk("\n");
@@ -1019,6 +1023,22 @@ static void dump_line(char *data, int of
 		}
 	}
 	printk("\n");
+	switch (total) {
+					/* 01101011 (0x6b - SLAB_POISON) */
+		case 0x6a:	/* 01101010 bit 0 flipped */
+		case 0x69:	/* 01101001 bit 1 flipped */
+		case 0x6f:	/* 01101111 bit 2 flipped */
+		case 0x63:	/* 01100011 bit 3 flipped */
+		case 0x7b:	/* 01111011 bit 4 flipped */
+		case 0x4b:	/* 01001011 bit 5 flipped */
+		case 0x2b:	/* 00101011 bit 6 flipped */
+		case 0xeb:	/* 11101011 bit 7 flipped */
+			printk (KERN_ERR "Single bit error detected. Possibly bad RAM\n"
+#ifdef CONFIG_X86
+			printk (KERN_ERR "Run memtest86 or other memory test tool.\n");
+#endif
+			return;
+	}
 }
 #endif
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: discriminate single bit error hardware failure from slab corruption.
  2006-02-03  1:46   ` Dave Jones
@ 2006-02-03  2:05     ` Avi Kivity
  2006-02-03  4:20       ` Dave Jones
  2006-02-06 20:19     ` Pavel Machek
  1 sibling, 1 reply; 16+ messages in thread
From: Avi Kivity @ 2006-02-03  2:05 UTC (permalink / raw)
  To: Dave Jones; +Cc: Linux Kernel

Dave Jones wrote:

>On Fri, Feb 03, 2006 at 02:44:52AM +0200, Avi Kivity wrote:
>
> >         total += hweight8(data[offset+i] ^ POISON_FREE);
> > 
> > >		printk(" %02x", (unsigned char)data[offset + i]);
> > >	}
> > >	printk("\n");
> > >@@ -1019,6 +1023,18 @@ static void dump_line(char *data, int of
> > >		}
> > >	}
> > >	printk("\n");
> > >+	switch (total) {
> > >+		case 0x36:
> > >+		case 0x6a:
> > >+		case 0x6f:
> > >+		case 0x81:
> > >+		case 0xac:
> > >+		case 0xd3:
> > >+		case 0xd5:
> > >+		case 0xea:
> > >+			printk (KERN_ERR "Single bit error detected. 
> > >Possibly bad RAM. Please run memtest86.\n");
> > >+			return;
> > >+	}
> > > 
> > >
> > and a
> > 
> >     if (total == 1)
> >           printk(...);
> > 
> > here? it seems more readable and more correct as well.
>
>More readable ? Are you kidding ?
>What I wrote is smack-you-in-the-face-obvious what it's doing.
>With your variant, I have to sit down and think it through.
>  
>
Looks like we have mirror image brains :) - I had to scratch my scalp to 
figure out where all the magic numbers in the switch came from.  

Perhaps well named variables will help:

    unsigned char modified_bits = data[offset+i] ^ POSION_FREE;
    int modified_bits_count = hweight8(modified_bits);
    total += modified_bits_count;

>wrt correctness, what do you see wrong with my approach?
>  
>
Your code will generate a false positive 8 times in 256 runs, or 1 in 
32. A 3% false positive rate seems excessive, It's also sensitive to 
changes to POISON_FREE.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: discriminate single bit error hardware failure from slab corruption.
  2006-02-03  0:44 ` Avi Kivity
@ 2006-02-03  1:46   ` Dave Jones
  2006-02-03  2:05     ` Avi Kivity
  2006-02-06 20:19     ` Pavel Machek
  0 siblings, 2 replies; 16+ messages in thread
From: Dave Jones @ 2006-02-03  1:46 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Linux Kernel

On Fri, Feb 03, 2006 at 02:44:52AM +0200, Avi Kivity wrote:

 >         total += hweight8(data[offset+i] ^ POISON_FREE);
 > 
 > >		printk(" %02x", (unsigned char)data[offset + i]);
 > >	}
 > >	printk("\n");
 > >@@ -1019,6 +1023,18 @@ static void dump_line(char *data, int of
 > >		}
 > >	}
 > >	printk("\n");
 > >+	switch (total) {
 > >+		case 0x36:
 > >+		case 0x6a:
 > >+		case 0x6f:
 > >+		case 0x81:
 > >+		case 0xac:
 > >+		case 0xd3:
 > >+		case 0xd5:
 > >+		case 0xea:
 > >+			printk (KERN_ERR "Single bit error detected. 
 > >Possibly bad RAM. Please run memtest86.\n");
 > >+			return;
 > >+	}
 > > 
 > >
 > and a
 > 
 >     if (total == 1)
 >           printk(...);
 > 
 > here? it seems more readable and more correct as well.

More readable ? Are you kidding ?
What I wrote is smack-you-in-the-face-obvious what it's doing.
With your variant, I have to sit down and think it through.

wrt correctness, what do you see wrong with my approach?

		Dave


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: discriminate single bit error hardware failure from slab corruption.
  2006-02-02 19:24 Dave Jones
                   ` (2 preceding siblings ...)
  2006-02-02 19:53 ` Pekka Enberg
@ 2006-02-03  0:44 ` Avi Kivity
  2006-02-03  1:46   ` Dave Jones
  2006-02-03 14:09 ` Jan Engelhardt
  4 siblings, 1 reply; 16+ messages in thread
From: Avi Kivity @ 2006-02-03  0:44 UTC (permalink / raw)
  To: Dave Jones; +Cc: Linux Kernel

Dave Jones wrote:

>In the case where we detect a single bit has been flipped, we spew
>the usual slab corruption message, which users instantly think
>is a kernel bug.  In a lot of cases, single bit errors are
>down to bad memory, or other hardware failure.
>
>This patch adds an extra line to the slab debug messages in those
>cases, in the hope that users will try memtest before they report a bug.
>
>000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
>Single bit error detected. Possibly bad RAM. Please run memtest86.
>
>Signed-off-by: Dave Jones <davej@redhat.com>
>
>--- linux-2.6.15/mm/slab.c~	2006-01-09 13:25:17.000000000 -0500
>+++ linux-2.6.15/mm/slab.c	2006-01-09 13:26:01.000000000 -0500
>@@ -1313,8 +1313,11 @@ static void poison_obj(kmem_cache_t *cac
> static void dump_line(char *data, int offset, int limit)
> {
> 	int i;
>+	unsigned char total=0;
> 	printk(KERN_ERR "%03x:", offset);
> 	for (i = 0; i < limit; i++) {
>+		if (data[offset+i] != POISON_FREE)
>+			total += data[offset+i];
>  
>
how about
 
         total += hweight8(data[offset+i] ^ POISON_FREE);

> 		printk(" %02x", (unsigned char)data[offset + i]);
> 	}
> 	printk("\n");
>@@ -1019,6 +1023,18 @@ static void dump_line(char *data, int of
> 		}
> 	}
> 	printk("\n");
>+	switch (total) {
>+		case 0x36:
>+		case 0x6a:
>+		case 0x6f:
>+		case 0x81:
>+		case 0xac:
>+		case 0xd3:
>+		case 0xd5:
>+		case 0xea:
>+			printk (KERN_ERR "Single bit error detected. Possibly bad RAM. Please run memtest86.\n");
>+			return;
>+	}
>  
>
and a

     if (total == 1)
           printk(...);

here? it seems more readable and more correct as well.

> }
> #endif
> 
>  
>


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: discriminate single bit error hardware failure from slab corruption.
  2006-02-02 19:24 Dave Jones
  2006-02-02 19:28 ` Randy.Dunlap
  2006-02-02 19:38 ` Jesper Juhl
@ 2006-02-02 19:53 ` Pekka Enberg
  2006-02-03  0:44 ` Avi Kivity
  2006-02-03 14:09 ` Jan Engelhardt
  4 siblings, 0 replies; 16+ messages in thread
From: Pekka Enberg @ 2006-02-02 19:53 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On 2/2/06, Dave Jones <davej@redhat.com> wrote:
> In the case where we detect a single bit has been flipped, we spew
> the usual slab corruption message, which users instantly think
> is a kernel bug.  In a lot of cases, single bit errors are
> down to bad memory, or other hardware failure.
>
> This patch adds an extra line to the slab debug messages in those
> cases, in the hope that users will try memtest before they report a bug.
>
> 000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Single bit error detected. Possibly bad RAM. Please run memtest86.
>
> Signed-off-by: Dave Jones <davej@redhat.com>

Looks good to me.

                          Pekka

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: discriminate single bit error hardware failure from slab corruption.
  2006-02-02 19:24 Dave Jones
  2006-02-02 19:28 ` Randy.Dunlap
@ 2006-02-02 19:38 ` Jesper Juhl
  2006-02-02 19:53 ` Pekka Enberg
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Jesper Juhl @ 2006-02-02 19:38 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On 2/2/06, Dave Jones <davej@redhat.com> wrote:
> In the case where we detect a single bit has been flipped, we spew
> the usual slab corruption message, which users instantly think
> is a kernel bug.  In a lot of cases, single bit errors are
> down to bad memory, or other hardware failure.
>
> This patch adds an extra line to the slab debug messages in those
> cases, in the hope that users will try memtest before they report a bug.
>
> 000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Single bit error detected. Possibly bad RAM. Please run memtest86.
>
May I suggest that the text be
 Single bit error detected. Possibly bad RAM. Please run memtest86
and/or memtest86+.

both programs are good memory testers, but they are different and
sometimes one finds problems not detected by the other.

--
Jesper Juhl <jesper.juhl@gmail.com>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please      http://www.expita.com/nomime.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: discriminate single bit error hardware failure from slab corruption.
  2006-02-02 19:24 Dave Jones
@ 2006-02-02 19:28 ` Randy.Dunlap
  2006-02-02 19:38 ` Jesper Juhl
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Randy.Dunlap @ 2006-02-02 19:28 UTC (permalink / raw)
  To: Dave Jones; +Cc: Linux Kernel

On Thu, 2 Feb 2006, Dave Jones wrote:

> In the case where we detect a single bit has been flipped, we spew
> the usual slab corruption message, which users instantly think
> is a kernel bug.  In a lot of cases, single bit errors are
> down to bad memory, or other hardware failure.
>
> This patch adds an extra line to the slab debug messages in those
> cases, in the hope that users will try memtest before they report a bug.
>
> 000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Single bit error detected. Possibly bad RAM. Please run memtest86.

does memtest86 run on all $ARCHes ?
or this is good for <large percentage>, so it's Good.  :)
Just checking; it is a good idea.

> Signed-off-by: Dave Jones <davej@redhat.com>
>
> --- linux-2.6.15/mm/slab.c~	2006-01-09 13:25:17.000000000 -0500
> +++ linux-2.6.15/mm/slab.c	2006-01-09 13:26:01.000000000 -0500
> @@ -1313,8 +1313,11 @@ static void poison_obj(kmem_cache_t *cac
>  static void dump_line(char *data, int offset, int limit)
>  {
>  	int i;
> +	unsigned char total=0;
>  	printk(KERN_ERR "%03x:", offset);
>  	for (i = 0; i < limit; i++) {
> +		if (data[offset+i] != POISON_FREE)
> +			total += data[offset+i];
>  		printk(" %02x", (unsigned char)data[offset + i]);
>  	}
>  	printk("\n");
> @@ -1019,6 +1023,18 @@ static void dump_line(char *data, int of
>  		}
>  	}
>  	printk("\n");
> +	switch (total) {
> +		case 0x36:
> +		case 0x6a:
> +		case 0x6f:
> +		case 0x81:
> +		case 0xac:
> +		case 0xd3:
> +		case 0xd5:
> +		case 0xea:
> +			printk (KERN_ERR "Single bit error detected. Possibly bad RAM. Please run memtest86.\n");
> +			return;
> +	}
>  }
>  #endif
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

-- 
~Randy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* discriminate single bit error hardware failure from slab corruption.
@ 2006-02-02 19:24 Dave Jones
  2006-02-02 19:28 ` Randy.Dunlap
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Dave Jones @ 2006-02-02 19:24 UTC (permalink / raw)
  To: Linux Kernel

In the case where we detect a single bit has been flipped, we spew
the usual slab corruption message, which users instantly think
is a kernel bug.  In a lot of cases, single bit errors are
down to bad memory, or other hardware failure.

This patch adds an extra line to the slab debug messages in those
cases, in the hope that users will try memtest before they report a bug.

000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Single bit error detected. Possibly bad RAM. Please run memtest86.

Signed-off-by: Dave Jones <davej@redhat.com>

--- linux-2.6.15/mm/slab.c~	2006-01-09 13:25:17.000000000 -0500
+++ linux-2.6.15/mm/slab.c	2006-01-09 13:26:01.000000000 -0500
@@ -1313,8 +1313,11 @@ static void poison_obj(kmem_cache_t *cac
 static void dump_line(char *data, int offset, int limit)
 {
 	int i;
+	unsigned char total=0;
 	printk(KERN_ERR "%03x:", offset);
 	for (i = 0; i < limit; i++) {
+		if (data[offset+i] != POISON_FREE)
+			total += data[offset+i];
 		printk(" %02x", (unsigned char)data[offset + i]);
 	}
 	printk("\n");
@@ -1019,6 +1023,18 @@ static void dump_line(char *data, int of
 		}
 	}
 	printk("\n");
+	switch (total) {
+		case 0x36:
+		case 0x6a:
+		case 0x6f:
+		case 0x81:
+		case 0xac:
+		case 0xd3:
+		case 0xd5:
+		case 0xea:
+			printk (KERN_ERR "Single bit error detected. Possibly bad RAM. Please run memtest86.\n");
+			return;
+	}
 }
 #endif
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2006-02-06 20:19 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-02-03  9:25 discriminate single bit error hardware failure from slab corruption linux
2006-02-03 14:14 ` Jan Engelhardt
  -- strict thread matches above, loose matches on Subject: below --
2006-02-02 19:24 Dave Jones
2006-02-02 19:28 ` Randy.Dunlap
2006-02-02 19:38 ` Jesper Juhl
2006-02-02 19:53 ` Pekka Enberg
2006-02-03  0:44 ` Avi Kivity
2006-02-03  1:46   ` Dave Jones
2006-02-03  2:05     ` Avi Kivity
2006-02-03  4:20       ` Dave Jones
2006-02-03  4:41         ` Roland Dreier
2006-02-03  5:03           ` Dave Jones
2006-02-03 14:12           ` Jan Engelhardt
2006-02-03 11:05         ` Olivier Galibert
2006-02-06 20:19     ` Pavel Machek
2006-02-03 14:09 ` Jan Engelhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).