All of lore.kernel.org
 help / color / mirror / Atom feed
* using mce_inject I get: RIP 10:<ffffffffa012c909> {ttm_bo_unref+0xf/0x45 [ttm]}
@ 2011-08-21  2:31 Justin P. Mattock
  2011-08-21 22:16 ` Andi Kleen
  0 siblings, 1 reply; 10+ messages in thread
From: Justin P. Mattock @ 2011-08-21  2:31 UTC (permalink / raw)
  To: linux-kernel, Andi Kleen

not sure if I am running mce_test correctly, but during its routine of 
testing things I do get a pause with everything, then the below shows up 
in dmesg..:

http://fpaste.org/kMRd/


[ 1810.670434] Triggering MCE exception on CPU 1
[ 1810.670462] [Hardware Error]: CPU 1: Machine Check Exception: 6 Bank 
4: b300000000000000
[ 1810.670467] [Hardware Error]: RIP 73:<0000000012343434>
[ 1810.670470] [Hardware Error]: TSC 38d1002c216
[ 1810.670474] [Hardware Error]: PROCESSOR 0:6f6 TIME 1313892803 SOCKET 
0 APIC 1
[ 1810.670477] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 1810.670481] [Hardware Error]: Machine check: Processor context corrupt
[ 1810.670483] [Hardware Error]: Fake kernel panic: Fatal Machine check
[ 1810.670495] MCE exception done on CPU 1
[ 1819.064721] Triggering MCE exception on CPU 1

seems light of a pause, then everything resumes properly(music, etc..).
Is this something that needs attention, or are these tests as extreme as 
can be, and should simply be ignored?
(Note: if there is a mce list somewhere let me know so I direct this to 
the proper people)

Justin P. Mattock

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: using mce_inject I get: RIP 10:<ffffffffa012c909> {ttm_bo_unref+0xf/0x45 [ttm]}
  2011-08-21  2:31 using mce_inject I get: RIP 10:<ffffffffa012c909> {ttm_bo_unref+0xf/0x45 [ttm]} Justin P. Mattock
@ 2011-08-21 22:16 ` Andi Kleen
  2011-08-21 23:08   ` Justin P. Mattock
  2011-08-23 18:01   ` Justin P. Mattock
  0 siblings, 2 replies; 10+ messages in thread
From: Andi Kleen @ 2011-08-21 22:16 UTC (permalink / raw)
  To: Justin P. Mattock; +Cc: linux-kernel, Andi Kleen, tony.luck

On Sat, Aug 20, 2011 at 07:31:06PM -0700, Justin P. Mattock wrote:
> not sure if I am running mce_test correctly, but during its routine of 
> testing things I do get a pause with everything, then the below shows up 
> in dmesg..:

The message is expected, but there should be no noticeable
pause.

-Andi

> 
> http://fpaste.org/kMRd/
> 
> 
> [ 1810.670434] Triggering MCE exception on CPU 1
> [ 1810.670462] [Hardware Error]: CPU 1: Machine Check Exception: 6 Bank 
> 4: b300000000000000
> [ 1810.670467] [Hardware Error]: RIP 73:<0000000012343434>
> [ 1810.670470] [Hardware Error]: TSC 38d1002c216
> [ 1810.670474] [Hardware Error]: PROCESSOR 0:6f6 TIME 1313892803 SOCKET 
> 0 APIC 1
> [ 1810.670477] [Hardware Error]: Run the above through 'mcelog --ascii'
> [ 1810.670481] [Hardware Error]: Machine check: Processor context corrupt
> [ 1810.670483] [Hardware Error]: Fake kernel panic: Fatal Machine check
> [ 1810.670495] MCE exception done on CPU 1
> [ 1819.064721] Triggering MCE exception on CPU 1
> 
> seems light of a pause, then everything resumes properly(music, etc..).
> Is this something that needs attention, or are these tests as extreme as 
> can be, and should simply be ignored?
> (Note: if there is a mce list somewhere let me know so I direct this to 
> the proper people)
> 
> Justin P. Mattock
> 

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: using mce_inject I get: RIP 10:<ffffffffa012c909> {ttm_bo_unref+0xf/0x45 [ttm]}
  2011-08-21 22:16 ` Andi Kleen
@ 2011-08-21 23:08   ` Justin P. Mattock
  2011-08-23 18:01   ` Justin P. Mattock
  1 sibling, 0 replies; 10+ messages in thread
From: Justin P. Mattock @ 2011-08-21 23:08 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, tony.luck

On 08/21/2011 03:16 PM, Andi Kleen wrote:
> On Sat, Aug 20, 2011 at 07:31:06PM -0700, Justin P. Mattock wrote:
>> not sure if I am running mce_test correctly, but during its routine of
>> testing things I do get a pause with everything, then the below shows up
>> in dmesg..:
>
> The message is expected, but there should be no noticeable
> pause.

well  looking and doing more of these tests I am getting a noticeable 
pause, lasts for about 2-3 seconds then everything goes back to normal.
(all of these are whenver the tests do a timout test).

>
> -Andi
>
>>
>> http://fpaste.org/kMRd/
>>
>>
>> [ 1810.670434] Triggering MCE exception on CPU 1
>> [ 1810.670462] [Hardware Error]: CPU 1: Machine Check Exception: 6 Bank
>> 4: b300000000000000
>> [ 1810.670467] [Hardware Error]: RIP 73:<0000000012343434>
>> [ 1810.670470] [Hardware Error]: TSC 38d1002c216
>> [ 1810.670474] [Hardware Error]: PROCESSOR 0:6f6 TIME 1313892803 SOCKET
>> 0 APIC 1
>> [ 1810.670477] [Hardware Error]: Run the above through 'mcelog --ascii'
>> [ 1810.670481] [Hardware Error]: Machine check: Processor context corrupt
>> [ 1810.670483] [Hardware Error]: Fake kernel panic: Fatal Machine check
>> [ 1810.670495] MCE exception done on CPU 1
>> [ 1819.064721] Triggering MCE exception on CPU 1
>>
>> seems light of a pause, then everything resumes properly(music, etc..).
>> Is this something that needs attention, or are these tests as extreme as
>> can be, and should simply be ignored?
>> (Note: if there is a mce list somewhere let me know so I direct this to
>> the proper people)
>>
>> Justin P. Mattock
>>
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: using mce_inject I get: RIP 10:<ffffffffa012c909> {ttm_bo_unref+0xf/0x45 [ttm]}
  2011-08-21 22:16 ` Andi Kleen
  2011-08-21 23:08   ` Justin P. Mattock
@ 2011-08-23 18:01   ` Justin P. Mattock
  2011-08-23 20:15     ` Luck, Tony
  1 sibling, 1 reply; 10+ messages in thread
From: Justin P. Mattock @ 2011-08-23 18:01 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, tony.luck

On 08/21/2011 03:16 PM, Andi Kleen wrote:
> On Sat, Aug 20, 2011 at 07:31:06PM -0700, Justin P. Mattock wrote:
>> not sure if I am running mce_test correctly, but during its routine of
>> testing things I do get a pause with everything, then the below shows up
>> in dmesg..:
>
> The message is expected, but there should be no noticeable
> pause.
>
> -Andi

o.k. I have reset the kernel to 2.6.36 and ran the test program in
init 3(since radeon is crapped out with that kernel). anyways
with 2.6.36 I can see the cursor on the bottom left blinking throught 
the whole tests telling me there is no pause(all timeouts given, give no 
pause).
with the current in init 3 I see the cursor pause for 2/3 seconds.

I will start a bisct on this, but need some help on how to proceed 
properly i.g. I get an error with gcc building:

   LD      init/built-in.o
   VDSO    arch/x86/vdso/vdso.so.dbg
gcc: error: elf_x86_64: No such file or directory
   VDSO    arch/x86/vdso/vdso32-int80.so.dbg
gcc: error: elf_i386: No such file or directory
   VDSO    arch/x86/vdso/vdso32-syscall.so.dbg
gcc: error: elf_i386: No such file or directory
   VDSO    arch/x86/vdso/vdso32-sysenter.so.dbg
gcc: error: elf_i386: No such file or directory
   CC      kernel/trace/trace.o

its easily fixable, but not sure its a good idea due to bisect going 
through commits(afraid I might go astray with the bisect if I add any 
patches).

Justin P. Mattock



^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: using mce_inject I get: RIP 10:<ffffffffa012c909> {ttm_bo_unref+0xf/0x45 [ttm]}
  2011-08-23 18:01   ` Justin P. Mattock
@ 2011-08-23 20:15     ` Luck, Tony
  2011-08-24  3:36       ` Justin P. Mattock
  2011-08-27 15:03       ` Justin P. Mattock
  0 siblings, 2 replies; 10+ messages in thread
From: Luck, Tony @ 2011-08-23 20:15 UTC (permalink / raw)
  To: Justin P. Mattock, Andi Kleen; +Cc: linux-kernel

> its easily fixable, but not sure its a good idea due to bisect going 
> through commits(afraid I might go astray with the bisect if I add any 
> patches).

Rather than fixing a bad build - you can try moving to a nearby commit
(use "gitk" to get a view of the structure around the commit that git
bisect suggested).  In the early stages of a bisection, it doesn't really
matter much if you build the mid-point that bisect provided, or some
nearby on - just be sure to mark good/bad the commit you actually built.

-Tony


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: using mce_inject I get: RIP 10:<ffffffffa012c909> {ttm_bo_unref+0xf/0x45 [ttm]}
  2011-08-23 20:15     ` Luck, Tony
@ 2011-08-24  3:36       ` Justin P. Mattock
  2011-08-27 15:03       ` Justin P. Mattock
  1 sibling, 0 replies; 10+ messages in thread
From: Justin P. Mattock @ 2011-08-24  3:36 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Andi Kleen, linux-kernel

On 08/23/2011 01:15 PM, Luck, Tony wrote:
>> its easily fixable, but not sure its a good idea due to bisect going
>> through commits(afraid I might go astray with the bisect if I add any
>> patches).
>
> Rather than fixing a bad build - you can try moving to a nearby commit
> (use "gitk" to get a view of the structure around the commit that git
> bisect suggested).  In the early stages of a bisection, it doesn't really
> matter much if you build the mid-point that bisect provided, or some
> nearby on - just be sure to mark good/bad the commit you actually built.
>
> -Tony
>
>


yeah.. I randomly guessed a kernel with git reset v2.6* to find a good 
point(could of even have been 2.6.37/8/9) will give a try and see if I 
get anywhere with this(I have 15 rev's to test, Ill see yeah in a few 
days!!).

Justin P. Mattock

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: using mce_inject I get: RIP 10:<ffffffffa012c909> {ttm_bo_unref+0xf/0x45 [ttm]}
  2011-08-23 20:15     ` Luck, Tony
  2011-08-24  3:36       ` Justin P. Mattock
@ 2011-08-27 15:03       ` Justin P. Mattock
  2011-08-27 15:12         ` Andi Kleen
  2011-08-30  1:07         ` huang ying
  1 sibling, 2 replies; 10+ messages in thread
From: Justin P. Mattock @ 2011-08-27 15:03 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Andi Kleen, linux-kernel

On 08/23/2011 01:15 PM, Luck, Tony wrote:
>> its easily fixable, but not sure its a good idea due to bisect going
>> through commits(afraid I might go astray with the bisect if I add any
>> patches).
>
> Rather than fixing a bad build - you can try moving to a nearby commit
> (use "gitk" to get a view of the structure around the commit that git
> bisect suggested).  In the early stages of a bisection, it doesn't really
> matter much if you build the mid-point that bisect provided, or some
> nearby on - just be sure to mark good/bad the commit you actually built.
>
> -Tony
>
>

well.. after bisecting(with no results), I found that something in my 
.config was causing this, so after looking through, I found that having 
X86_MCE_INJECT = y causes the pauses when the timeouts occur

let me know if I need to supply any info.

Justin P. Mattock

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: using mce_inject I get: RIP 10:<ffffffffa012c909> {ttm_bo_unref+0xf/0x45 [ttm]}
  2011-08-27 15:03       ` Justin P. Mattock
@ 2011-08-27 15:12         ` Andi Kleen
  2011-08-30  1:07         ` huang ying
  1 sibling, 0 replies; 10+ messages in thread
From: Andi Kleen @ 2011-08-27 15:12 UTC (permalink / raw)
  To: Justin P. Mattock; +Cc: Luck, Tony, Andi Kleen, linux-kernel, ying.huang

On Sat, Aug 27, 2011 at 08:03:06AM -0700, Justin P. Mattock wrote:
> On 08/23/2011 01:15 PM, Luck, Tony wrote:
> >>its easily fixable, but not sure its a good idea due to bisect going
> >>through commits(afraid I might go astray with the bisect if I add any
> >>patches).
> >
> >Rather than fixing a bad build - you can try moving to a nearby commit
> >(use "gitk" to get a view of the structure around the commit that git
> >bisect suggested).  In the early stages of a bisection, it doesn't really
> >matter much if you build the mid-point that bisect provided, or some
> >nearby on - just be sure to mark good/bad the commit you actually built.
> >
> >-Tony
> >
> >
> 
> well.. after bisecting(with no results), I found that something in my 
> .config was causing this, so after looking through, I found that having 
> X86_MCE_INJECT = y causes the pauses when the timeouts occur

Seems like a bug. Copying Ying too.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: using mce_inject I get: RIP 10:<ffffffffa012c909> {ttm_bo_unref+0xf/0x45 [ttm]}
  2011-08-27 15:03       ` Justin P. Mattock
  2011-08-27 15:12         ` Andi Kleen
@ 2011-08-30  1:07         ` huang ying
  2011-08-30 15:38           ` Justin P. Mattock
  1 sibling, 1 reply; 10+ messages in thread
From: huang ying @ 2011-08-30  1:07 UTC (permalink / raw)
  To: Justin P. Mattock; +Cc: Luck, Tony, Andi Kleen, linux-kernel

On Sat, Aug 27, 2011 at 11:03 PM, Justin P. Mattock
<justinmattock@gmail.com> wrote:
> On 08/23/2011 01:15 PM, Luck, Tony wrote:
>>>
>>> its easily fixable, but not sure its a good idea due to bisect going
>>> through commits(afraid I might go astray with the bisect if I add any
>>> patches).
>>
>> Rather than fixing a bad build - you can try moving to a nearby commit
>> (use "gitk" to get a view of the structure around the commit that git
>> bisect suggested).  In the early stages of a bisection, it doesn't really
>> matter much if you build the mid-point that bisect provided, or some
>> nearby on - just be sure to mark good/bad the commit you actually built.
>>
>> -Tony
>>
>>
>
> well.. after bisecting(with no results), I found that something in my
> .config was causing this, so after looking through, I found that having
> X86_MCE_INJECT = y causes the pauses when the timeouts occur
>
> let me know if I need to supply any info.

Which test case cause the pause?  Some test case with "timeout" in
name may cause timeout between CPUs.  Or you can try boot system with
kernel parameter "mce=3,0", which will disable timeout.

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: using mce_inject I get: RIP 10:<ffffffffa012c909> {ttm_bo_unref+0xf/0x45 [ttm]}
  2011-08-30  1:07         ` huang ying
@ 2011-08-30 15:38           ` Justin P. Mattock
  0 siblings, 0 replies; 10+ messages in thread
From: Justin P. Mattock @ 2011-08-30 15:38 UTC (permalink / raw)
  To: huang ying; +Cc: Luck, Tony, Andi Kleen, linux-kernel

On 08/29/2011 06:07 PM, huang ying wrote:
> On Sat, Aug 27, 2011 at 11:03 PM, Justin P. Mattock
> <justinmattock@gmail.com>  wrote:
>> On 08/23/2011 01:15 PM, Luck, Tony wrote:
>>>>
>>>> its easily fixable, but not sure its a good idea due to bisect going
>>>> through commits(afraid I might go astray with the bisect if I add any
>>>> patches).
>>>
>>> Rather than fixing a bad build - you can try moving to a nearby commit
>>> (use "gitk" to get a view of the structure around the commit that git
>>> bisect suggested).  In the early stages of a bisection, it doesn't really
>>> matter much if you build the mid-point that bisect provided, or some
>>> nearby on - just be sure to mark good/bad the commit you actually built.
>>>
>>> -Tony
>>>
>>>
>>
>> well.. after bisecting(with no results), I found that something in my
>> .config was causing this, so after looking through, I found that having
>> X86_MCE_INJECT = y causes the pauses when the timeouts occur
>>
>> let me know if I need to supply any info.
>
> Which test case cause the pause?  Some test case with "timeout" in
> name may cause timeout between CPUs.  Or you can try boot system with
> kernel parameter "mce=3,0", which will disable timeout.
>
> Best Regards,
> Huang Ying
>


cool thanks for the info.
I went and used mce=3,0 on the command line, and then ran the mce-test 
suite. unfortunantly the pause still occurs.
as for which timeouts bassically when any of the timeouts

here is what the verbosity looks like:

`/home/kernel/mce-inject/mce-test'
./drivers/simple/driver.sh simple.conf

soft-inj/non-panic/corrected:
   Failed: can not get gcov graph
   Passed: MCE log is ok
   Passed: No kernel warning or bug

soft-inj/non-panic/corrected_hold:
   Failed: can not get gcov graph
   Failed: MCE log is different from input
   Passed: No kernel warning or bug

soft-inj/non-panic/corrected_no_en:
   Failed: can not get gcov graph
   Passed: MCE log is ok
   Passed: No kernel warning or bug

soft-inj/non-panic/corrected_over:
   Failed: can not get gcov graph
   Passed: MCE log is ok
   Passed: No kernel warning or bug

soft-inj/panic/fatal:
   Failed: can not get gcov graph
   Failed: MCE log is different from input
   Passed: No kernel warning or bug
   Failed: uncorrect panic, expected: Fatal Machine check
   Failed:  uncorrected MCE exp, expected: Processor context corrupt

soft-inj/panic/fatal_eipv:
   Failed: can not get gcov graph
   Failed: MCE log is different from input
   Passed: No kernel warning or bug
   Failed: uncorrect panic, expected: Fatal Machine check
   Failed:  uncorrected MCE exp, expected: Processor context corrupt

soft-inj/panic/fatal_irq:
   Failed: can not get gcov graph
   Failed: MCE log is different from input
   Passed: No kernel warning or bug
   Failed: uncorrect panic, expected: Fatal Machine check
   Failed:  uncorrected MCE exp, expected: Processor context corrupt

soft-inj/panic/fatal_no_en:
   Failed: can not get gcov graph
   Passed: MCE log is ok
   Passed: No kernel warning or bug
   Failed: uncorrect panic, expected: Machine check from unknown source

soft-inj/panic/fatal_over:
   Failed: can not get gcov graph
   Failed: MCE log is different from input
   Passed: No kernel warning or bug
   Failed: uncorrect panic, expected: Fatal Machine check
   Failed:  uncorrected MCE exp, expected: Processor context corrupt

soft-inj/panic/fatal_ripv:
   Failed: can not get gcov graph
   Failed: MCE log is different from input
   Passed: No kernel warning or bug
   Failed: uncorrect panic, expected: Fatal Machine check
   Failed:  uncorrected MCE exp, expected: Processor context corrupt

soft-inj/panic/fatal_timeout:
   Failed: can not get gcov graph
   Failed: MCE log is different from input
   Passed: No kernel warning or bug
   Failed: uncorrect panic, expected: : Fatal machine check on current CPU
   Failed: no timeout detected
   Failed:  uncorrected MCE exp, expected: Processor context corrupt

soft-inj/panic/fatal_timeout_ripv:
   Failed: can not get gcov graph
   Failed: MCE log is different from input
   Passed: No kernel warning or bug
   Failed: uncorrect panic, expected: : Fatal machine check on current CPU
   Failed: no timeout detected
   Failed:  uncorrected MCE exp, expected: Processor context corrupt

soft-inj/panic/fatal_userspace:
   Failed: can not get gcov graph
   Failed: MCE log is different from input
   Passed: No kernel warning or bug
   Failed: uncorrect panic, expected: Fatal Machine check
   Failed:  uncorrected MCE exp, expected: Processor context corrupt



in dmesg I see:

[  102.491609] Starting machine check poll CPU 1
[  102.492077] [Hardware Error]: Machine check events logged
[  102.492086] Machine check poll done on CPU 1
[  123.537575] Triggering MCE exception on CPU 0
[  123.537584] Disabling lock debugging due to kernel taint
[  123.537594] [Hardware Error]: Machine check events logged
[  123.537597] MCE exception done on CPU 0
[  129.779850] Triggering MCE exception on CPU 1
[  129.779879] MCE exception done on CPU 1
[  137.030085] Triggering MCE exception on CPU 0
[  137.030108] MCE exception done on CPU 0
[  143.286096] Triggering MCE exception on CPU 0
[  143.286110] MCE exception done on CPU 0
[  149.541391] Triggering MCE exception on CPU 0
[  149.541409] MCE exception done on CPU 0
[  156.785580] Triggering MCE exception on CPU 1
[  156.785602] MCE exception done on CPU 1
[  164.011576] Triggering MCE exception on CPU 0
[  164.012558] mce_notify_irq: 4 callbacks suppressed
[  164.012558] [Hardware Error]: Machine check events logged
[  166.795340] MCE exception done on CPU 0
[  173.088624] Triggering MCE exception on CPU 0
[  173.089600] [Hardware Error]: Machine check events logged
[  177.119421] MCE exception done on CPU 0
[  184.373355] Triggering MCE exception on CPU 1
[  184.373372] MCE exception done on CPU 1
[  190.741030] Triggering MCE exception on CPU 1
[  190.741047] MCE exception done on CPU 1


let me know if you need more info.

Justin P. Mattock

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-08-30 15:38 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-21  2:31 using mce_inject I get: RIP 10:<ffffffffa012c909> {ttm_bo_unref+0xf/0x45 [ttm]} Justin P. Mattock
2011-08-21 22:16 ` Andi Kleen
2011-08-21 23:08   ` Justin P. Mattock
2011-08-23 18:01   ` Justin P. Mattock
2011-08-23 20:15     ` Luck, Tony
2011-08-24  3:36       ` Justin P. Mattock
2011-08-27 15:03       ` Justin P. Mattock
2011-08-27 15:12         ` Andi Kleen
2011-08-30  1:07         ` huang ying
2011-08-30 15:38           ` Justin P. Mattock

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.