All of lore.kernel.org
 help / color / mirror / Atom feed
* EDAC linux-2.6.34-rc5 non correctable errors not reported on AMD64 opteron
@ 2010-04-29 18:30 Prasanna S. Panchamukhi
  2010-04-29 22:13 ` Keith Mannthey
  0 siblings, 1 reply; 11+ messages in thread
From: Prasanna S. Panchamukhi @ 2010-04-29 18:30 UTC (permalink / raw)
  To: dougthompson, bluesmoke-devel; +Cc: Rob.Becker, Arthur.Jones

Hi Doug,

I am testing Linux-2.6.34-rc5 EDAC driver on AMD64 Opteron.
I am able to inject single bit errors and get the edac driver report the 
correctable errors.
But when I inject 2-bit errors, I did not see any notification or kernel 
log, the system simply hangs.
This happens with or without edac_mc_panic_on_ue enabled.
Please let me know if I am missing something.
Below are the details.

Thanks
Prasanna

Steps to reproduce the problem:


1. Build Linux-2.6.34-rc5 using x86_64_defconfig with following additional config options enabled:
CONFIG_EDAC_DECODE_MCE=y
CONFIG_EDAC_MM_EDAC=y
CONFIG_EDAC_AMD64=m
CONFIG_EDAC_AMD64_ERROR_INJECTION=y
CONFIG_EDAC_E752X=m
CONFIG_EDAC_I82975X=m
CONFIG_EDAC_I3000=m
CONFIG_EDAC_I3200=m
CONFIG_EDAC_X38=m
CONFIG_EDAC_I5400=m
CONFIG_EDAC_I5000=m
CONFIG_EDAC_I5100=m

2. insert the kernel module
#insmod amd64_edac_mod.ko

3. Inject errors

# echo 3 > /sys/devices/system/edac/mc/mc0/inject_section  
# echo 7 > /sys/devices/system/edac/mc/mc0/inject_word
# echo 0x88 > /sys/devices/system/edac/mc/mc0/inject_ecc_vector
# echo 1 > /sys/devices/system/edac/mc/mc0/inject_read
# echo 1 > /sys/devices/system/edac/mc/mc0/inject_write

4. Should hang the system in few minutes.

Additional info:
- AMD64 opteron
# cat /proc/cpuinfo
processor    : 0
vendor_id    : AuthenticAMD
cpu family   : 16
model          : 2
model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
stepping       : 3
cpu MHz      : 1800.023
cache size    : 512 KB
physical id   : 0
siblings       : 4
core id        : 0
cpu cores    : 4
apicid          : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 5
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
bogomips    : 3600.04
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor    : 1
vendor_id    : AuthenticAMD
cpu family    : 16
model        : 2
model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
stepping    : 3
cpu MHz        : 1800.023
cache size    : 512 KB
physical id    : 0
siblings    : 4
core id        : 1
cpu cores    : 4
apicid        : 1
initial apicid    : 1
fpu        : yes
fpu_exception    : yes
cpuid level    : 5
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
bogomips    : 3600.08
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor    : 2
vendor_id    : AuthenticAMD
cpu family    : 16
model        : 2
model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
stepping    : 3
cpu MHz        : 1800.023
cache size    : 512 KB
physical id    : 0
siblings    : 4
core id        : 2
cpu cores    : 4
apicid        : 2
initial apicid    : 2
fpu        : yes
fpu_exception    : yes
cpuid level    : 5
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
bogomips    : 3599.96
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor    : 3
vendor_id    : AuthenticAMD
cpu family    : 16
model        : 2
model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
stepping    : 3
cpu MHz        : 1800.023
cache size    : 512 KB
physical id    : 0
siblings    : 4
core id        : 3
cpu cores    : 4
apicid        : 3
initial apicid    : 3
fpu        : yes
fpu_exception    : yes
cpuid level    : 5
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
bogomips    : 3600.01
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate




------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: EDAC linux-2.6.34-rc5 non correctable errors not reported on AMD64 opteron
  2010-04-29 18:30 EDAC linux-2.6.34-rc5 non correctable errors not reported on AMD64 opteron Prasanna S. Panchamukhi
@ 2010-04-29 22:13 ` Keith Mannthey
  2010-04-29 22:31   ` Prasanna S. Panchamukhi
  0 siblings, 1 reply; 11+ messages in thread
From: Keith Mannthey @ 2010-04-29 22:13 UTC (permalink / raw)
  To: Prasanna S. Panchamukhi
  Cc: Rob.Becker, bluesmoke-devel, Arthur.Jones, dougthompson

On Thu, 2010-04-29 at 11:30 -0700, Prasanna S. Panchamukhi wrote:
> Hi Doug,
> 
> I am testing Linux-2.6.34-rc5 EDAC driver on AMD64 Opteron.
> I am able to inject single bit errors and get the edac driver report the 
> correctable errors.
> But when I inject 2-bit errors, I did not see any notification or kernel 
> log, the system simply hangs.
> This happens with or without edac_mc_panic_on_ue enabled.
> Please let me know if I am missing something.
> Below are the details.

I would have to recheck the specs to be 100% sure but I would consider
double bit errors to be fatal on normal Opteron boxes. There is a good
chance your BIOS detects the fatal error and freezes the box to prevent
data corruption.

Thanks,
  Keith 

> Thanks
> Prasanna
> 
> Steps to reproduce the problem:
> 
> 
> 1. Build Linux-2.6.34-rc5 using x86_64_defconfig with following additional config options enabled:
> CONFIG_EDAC_DECODE_MCE=y
> CONFIG_EDAC_MM_EDAC=y
> CONFIG_EDAC_AMD64=m
> CONFIG_EDAC_AMD64_ERROR_INJECTION=y
> CONFIG_EDAC_E752X=m
> CONFIG_EDAC_I82975X=m
> CONFIG_EDAC_I3000=m
> CONFIG_EDAC_I3200=m
> CONFIG_EDAC_X38=m
> CONFIG_EDAC_I5400=m
> CONFIG_EDAC_I5000=m
> CONFIG_EDAC_I5100=m
> 
> 2. insert the kernel module
> #insmod amd64_edac_mod.ko
> 
> 3. Inject errors
> 
> # echo 3 > /sys/devices/system/edac/mc/mc0/inject_section  
> # echo 7 > /sys/devices/system/edac/mc/mc0/inject_word
> # echo 0x88 > /sys/devices/system/edac/mc/mc0/inject_ecc_vector
> # echo 1 > /sys/devices/system/edac/mc/mc0/inject_read
> # echo 1 > /sys/devices/system/edac/mc/mc0/inject_write
> 
> 4. Should hang the system in few minutes.
> 
> Additional info:
> - AMD64 opteron
> # cat /proc/cpuinfo
> processor    : 0
> vendor_id    : AuthenticAMD
> cpu family   : 16
> model          : 2
> model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> stepping       : 3
> cpu MHz      : 1800.023
> cache size    : 512 KB
> physical id   : 0
> siblings       : 4
> core id        : 0
> cpu cores    : 4
> apicid          : 0
> initial apicid    : 0
> fpu        : yes
> fpu_exception    : yes
> cpuid level    : 5
> wp        : yes
> flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> bogomips    : 3600.04
> TLB size    : 1024 4K pages
> clflush size    : 64
> cache_alignment    : 64
> address sizes    : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate
> 
> processor    : 1
> vendor_id    : AuthenticAMD
> cpu family    : 16
> model        : 2
> model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> stepping    : 3
> cpu MHz        : 1800.023
> cache size    : 512 KB
> physical id    : 0
> siblings    : 4
> core id        : 1
> cpu cores    : 4
> apicid        : 1
> initial apicid    : 1
> fpu        : yes
> fpu_exception    : yes
> cpuid level    : 5
> wp        : yes
> flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> bogomips    : 3600.08
> TLB size    : 1024 4K pages
> clflush size    : 64
> cache_alignment    : 64
> address sizes    : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate
> 
> processor    : 2
> vendor_id    : AuthenticAMD
> cpu family    : 16
> model        : 2
> model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> stepping    : 3
> cpu MHz        : 1800.023
> cache size    : 512 KB
> physical id    : 0
> siblings    : 4
> core id        : 2
> cpu cores    : 4
> apicid        : 2
> initial apicid    : 2
> fpu        : yes
> fpu_exception    : yes
> cpuid level    : 5
> wp        : yes
> flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> bogomips    : 3599.96
> TLB size    : 1024 4K pages
> clflush size    : 64
> cache_alignment    : 64
> address sizes    : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate
> 
> processor    : 3
> vendor_id    : AuthenticAMD
> cpu family    : 16
> model        : 2
> model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> stepping    : 3
> cpu MHz        : 1800.023
> cache size    : 512 KB
> physical id    : 0
> siblings    : 4
> core id        : 3
> cpu cores    : 4
> apicid        : 3
> initial apicid    : 3
> fpu        : yes
> fpu_exception    : yes
> cpuid level    : 5
> wp        : yes
> flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> bogomips    : 3600.01
> TLB size    : 1024 4K pages
> clflush size    : 64
> cache_alignment    : 64
> address sizes    : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate
> 
> 
> 
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> bluesmoke-devel mailing list
> bluesmoke-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bluesmoke-devel


------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: EDAC linux-2.6.34-rc5 non correctable errors not reported on AMD64 opteron
  2010-04-29 22:13 ` Keith Mannthey
@ 2010-04-29 22:31   ` Prasanna S. Panchamukhi
  2010-04-29 23:18     ` Keith Mannthey
  0 siblings, 1 reply; 11+ messages in thread
From: Prasanna S. Panchamukhi @ 2010-04-29 22:31 UTC (permalink / raw)
  To: Keith Mannthey; +Cc: Rob Becker, bluesmoke-devel, Arthur Jones, dougthompson

On Thu, Apr 29, 2010 at 03:13:42PM -0700, Keith Mannthey wrote:
> On Thu, 2010-04-29 at 11:30 -0700, Prasanna S. Panchamukhi wrote:
> > Hi Doug,
> > 
> > I am testing Linux-2.6.34-rc5 EDAC driver on AMD64 Opteron.
> > I am able to inject single bit errors and get the edac driver report the 
> > correctable errors.
> > But when I inject 2-bit errors, I did not see any notification or kernel 
> > log, the system simply hangs.
> > This happens with or without edac_mc_panic_on_ue enabled.
> > Please let me know if I am missing something.
> > Below are the details.
> 
> I would have to recheck the specs to be 100% sure but I would consider
> double bit errors to be fatal on normal Opteron boxes. There is a good
> chance your BIOS detects the fatal error and freezes the box to prevent
> data corruption.

Shouldn't the edac driver be reporting Uncorrectable Errors even 
before ..BIOS detects the fatal error and freezes the box?
Did someone already tested the 2-bit error injection and reporting
on AMD64?
Did the edac driver reported Uncorrectable Errors on other architectures
Powerpc/Intel?

Thanks
Prasanna


> 
> Thanks,
>   Keith 
> 
> > Thanks
> > Prasanna
> > 
> > Steps to reproduce the problem:
> > 
> > 
> > 1. Build Linux-2.6.34-rc5 using x86_64_defconfig with following additional config options enabled:
> > CONFIG_EDAC_DECODE_MCE=y
> > CONFIG_EDAC_MM_EDAC=y
> > CONFIG_EDAC_AMD64=m
> > CONFIG_EDAC_AMD64_ERROR_INJECTION=y
> > CONFIG_EDAC_E752X=m
> > CONFIG_EDAC_I82975X=m
> > CONFIG_EDAC_I3000=m
> > CONFIG_EDAC_I3200=m
> > CONFIG_EDAC_X38=m
> > CONFIG_EDAC_I5400=m
> > CONFIG_EDAC_I5000=m
> > CONFIG_EDAC_I5100=m
> > 
> > 2. insert the kernel module
> > #insmod amd64_edac_mod.ko
> > 
> > 3. Inject errors
> > 
> > # echo 3 > /sys/devices/system/edac/mc/mc0/inject_section  
> > # echo 7 > /sys/devices/system/edac/mc/mc0/inject_word
> > # echo 0x88 > /sys/devices/system/edac/mc/mc0/inject_ecc_vector
> > # echo 1 > /sys/devices/system/edac/mc/mc0/inject_read
> > # echo 1 > /sys/devices/system/edac/mc/mc0/inject_write
> > 
> > 4. Should hang the system in few minutes.
> > 
> > Additional info:
> > - AMD64 opteron
> > # cat /proc/cpuinfo
> > processor    : 0
> > vendor_id    : AuthenticAMD
> > cpu family   : 16
> > model          : 2
> > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > stepping       : 3
> > cpu MHz      : 1800.023
> > cache size    : 512 KB
> > physical id   : 0
> > siblings       : 4
> > core id        : 0
> > cpu cores    : 4
> > apicid          : 0
> > initial apicid    : 0
> > fpu        : yes
> > fpu_exception    : yes
> > cpuid level    : 5
> > wp        : yes
> > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > bogomips    : 3600.04
> > TLB size    : 1024 4K pages
> > clflush size    : 64
> > cache_alignment    : 64
> > address sizes    : 48 bits physical, 48 bits virtual
> > power management: ts ttp tm stc 100mhzsteps hwpstate
> > 
> > processor    : 1
> > vendor_id    : AuthenticAMD
> > cpu family    : 16
> > model        : 2
> > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > stepping    : 3
> > cpu MHz        : 1800.023
> > cache size    : 512 KB
> > physical id    : 0
> > siblings    : 4
> > core id        : 1
> > cpu cores    : 4
> > apicid        : 1
> > initial apicid    : 1
> > fpu        : yes
> > fpu_exception    : yes
> > cpuid level    : 5
> > wp        : yes
> > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > bogomips    : 3600.08
> > TLB size    : 1024 4K pages
> > clflush size    : 64
> > cache_alignment    : 64
> > address sizes    : 48 bits physical, 48 bits virtual
> > power management: ts ttp tm stc 100mhzsteps hwpstate
> > 
> > processor    : 2
> > vendor_id    : AuthenticAMD
> > cpu family    : 16
> > model        : 2
> > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > stepping    : 3
> > cpu MHz        : 1800.023
> > cache size    : 512 KB
> > physical id    : 0
> > siblings    : 4
> > core id        : 2
> > cpu cores    : 4
> > apicid        : 2
> > initial apicid    : 2
> > fpu        : yes
> > fpu_exception    : yes
> > cpuid level    : 5
> > wp        : yes
> > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > bogomips    : 3599.96
> > TLB size    : 1024 4K pages
> > clflush size    : 64
> > cache_alignment    : 64
> > address sizes    : 48 bits physical, 48 bits virtual
> > power management: ts ttp tm stc 100mhzsteps hwpstate
> > 
> > processor    : 3
> > vendor_id    : AuthenticAMD
> > cpu family    : 16
> > model        : 2
> > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > stepping    : 3
> > cpu MHz        : 1800.023
> > cache size    : 512 KB
> > physical id    : 0
> > siblings    : 4
> > core id        : 3
> > cpu cores    : 4
> > apicid        : 3
> > initial apicid    : 3
> > fpu        : yes
> > fpu_exception    : yes
> > cpuid level    : 5
> > wp        : yes
> > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > bogomips    : 3600.01
> > TLB size    : 1024 4K pages
> > clflush size    : 64
> > cache_alignment    : 64
> > address sizes    : 48 bits physical, 48 bits virtual
> > power management: ts ttp tm stc 100mhzsteps hwpstate
> > 
> > 
> > 
> > 
> > ------------------------------------------------------------------------------
> > _______________________________________________
> > bluesmoke-devel mailing list
> > bluesmoke-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/bluesmoke-devel
> 

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: EDAC linux-2.6.34-rc5 non correctable errors not reported on AMD64 opteron
  2010-04-29 22:31   ` Prasanna S. Panchamukhi
@ 2010-04-29 23:18     ` Keith Mannthey
  2010-04-30  0:12       ` Prasanna S. Panchamukhi
  0 siblings, 1 reply; 11+ messages in thread
From: Keith Mannthey @ 2010-04-29 23:18 UTC (permalink / raw)
  To: Prasanna S. Panchamukhi
  Cc: Rob Becker, bluesmoke-devel, Arthur Jones, dougthompson

On Thu, 2010-04-29 at 15:31 -0700, Prasanna S. Panchamukhi wrote:
> On Thu, Apr 29, 2010 at 03:13:42PM -0700, Keith Mannthey wrote:
> > On Thu, 2010-04-29 at 11:30 -0700, Prasanna S. Panchamukhi wrote:
> > > Hi Doug,
> > > 
> > > I am testing Linux-2.6.34-rc5 EDAC driver on AMD64 Opteron.
> > > I am able to inject single bit errors and get the edac driver report the 
> > > correctable errors.
> > > But when I inject 2-bit errors, I did not see any notification or kernel 
> > > log, the system simply hangs.
> > > This happens with or without edac_mc_panic_on_ue enabled.
> > > Please let me know if I am missing something.
> > > Below are the details.
> > 
> > I would have to recheck the specs to be 100% sure but I would consider
> > double bit errors to be fatal on normal Opteron boxes. There is a good
> > chance your BIOS detects the fatal error and freezes the box to prevent
> > data corruption.
> 
> Shouldn't the edac driver be reporting Uncorrectable Errors even 
> before ..BIOS detects the fatal error and freezes the box?
> Did someone already tested the 2-bit error injection and reporting
> on AMD64?

I don't know your hardware or firmware but a SMI can be uses when the
fatal error is triggered. BIOS can get instant notification. 


> Did the edac driver reported Uncorrectable Errors on other architectures
> Powerpc/Intel?

This is a very specific question that requires firmware level knowledge
of your system.  On my AMD and Intel system the bios steps in and kills
the box when faced with a double bit error and this is the safest
possible scenario for a double bit error.  From your report it sounds
like you box may be doing something similar. 

I didn't get around to playing with the error injection code could be
something going on there.  I had real debug dims with switches on them
to generate an error.  

You are triggering a double bit error and waiting around for over a min
for the box to hang? If so I doubt it is the BIOS that is killing your
box as that should happen right away. 

Thanks,
  Keith 

> Thanks
> Prasanna
> 
> 
> > 
> > Thanks,
> >   Keith 
> > 
> > > Thanks
> > > Prasanna
> > > 
> > > Steps to reproduce the problem:
> > > 
> > > 
> > > 1. Build Linux-2.6.34-rc5 using x86_64_defconfig with following additional config options enabled:
> > > CONFIG_EDAC_DECODE_MCE=y
> > > CONFIG_EDAC_MM_EDAC=y
> > > CONFIG_EDAC_AMD64=m
> > > CONFIG_EDAC_AMD64_ERROR_INJECTION=y
> > > CONFIG_EDAC_E752X=m
> > > CONFIG_EDAC_I82975X=m
> > > CONFIG_EDAC_I3000=m
> > > CONFIG_EDAC_I3200=m
> > > CONFIG_EDAC_X38=m
> > > CONFIG_EDAC_I5400=m
> > > CONFIG_EDAC_I5000=m
> > > CONFIG_EDAC_I5100=m
> > > 
> > > 2. insert the kernel module
> > > #insmod amd64_edac_mod.ko
> > > 
> > > 3. Inject errors
> > > 
> > > # echo 3 > /sys/devices/system/edac/mc/mc0/inject_section  
> > > # echo 7 > /sys/devices/system/edac/mc/mc0/inject_word
> > > # echo 0x88 > /sys/devices/system/edac/mc/mc0/inject_ecc_vector
> > > # echo 1 > /sys/devices/system/edac/mc/mc0/inject_read
> > > # echo 1 > /sys/devices/system/edac/mc/mc0/inject_write
> > > 
> > > 4. Should hang the system in few minutes.
> > > 
> > > Additional info:
> > > - AMD64 opteron
> > > # cat /proc/cpuinfo
> > > processor    : 0
> > > vendor_id    : AuthenticAMD
> > > cpu family   : 16
> > > model          : 2
> > > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > > stepping       : 3
> > > cpu MHz      : 1800.023
> > > cache size    : 512 KB
> > > physical id   : 0
> > > siblings       : 4
> > > core id        : 0
> > > cpu cores    : 4
> > > apicid          : 0
> > > initial apicid    : 0
> > > fpu        : yes
> > > fpu_exception    : yes
> > > cpuid level    : 5
> > > wp        : yes
> > > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > > bogomips    : 3600.04
> > > TLB size    : 1024 4K pages
> > > clflush size    : 64
> > > cache_alignment    : 64
> > > address sizes    : 48 bits physical, 48 bits virtual
> > > power management: ts ttp tm stc 100mhzsteps hwpstate
> > > 
> > > processor    : 1
> > > vendor_id    : AuthenticAMD
> > > cpu family    : 16
> > > model        : 2
> > > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > > stepping    : 3
> > > cpu MHz        : 1800.023
> > > cache size    : 512 KB
> > > physical id    : 0
> > > siblings    : 4
> > > core id        : 1
> > > cpu cores    : 4
> > > apicid        : 1
> > > initial apicid    : 1
> > > fpu        : yes
> > > fpu_exception    : yes
> > > cpuid level    : 5
> > > wp        : yes
> > > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > > bogomips    : 3600.08
> > > TLB size    : 1024 4K pages
> > > clflush size    : 64
> > > cache_alignment    : 64
> > > address sizes    : 48 bits physical, 48 bits virtual
> > > power management: ts ttp tm stc 100mhzsteps hwpstate
> > > 
> > > processor    : 2
> > > vendor_id    : AuthenticAMD
> > > cpu family    : 16
> > > model        : 2
> > > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > > stepping    : 3
> > > cpu MHz        : 1800.023
> > > cache size    : 512 KB
> > > physical id    : 0
> > > siblings    : 4
> > > core id        : 2
> > > cpu cores    : 4
> > > apicid        : 2
> > > initial apicid    : 2
> > > fpu        : yes
> > > fpu_exception    : yes
> > > cpuid level    : 5
> > > wp        : yes
> > > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > > bogomips    : 3599.96
> > > TLB size    : 1024 4K pages
> > > clflush size    : 64
> > > cache_alignment    : 64
> > > address sizes    : 48 bits physical, 48 bits virtual
> > > power management: ts ttp tm stc 100mhzsteps hwpstate
> > > 
> > > processor    : 3
> > > vendor_id    : AuthenticAMD
> > > cpu family    : 16
> > > model        : 2
> > > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > > stepping    : 3
> > > cpu MHz        : 1800.023
> > > cache size    : 512 KB
> > > physical id    : 0
> > > siblings    : 4
> > > core id        : 3
> > > cpu cores    : 4
> > > apicid        : 3
> > > initial apicid    : 3
> > > fpu        : yes
> > > fpu_exception    : yes
> > > cpuid level    : 5
> > > wp        : yes
> > > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > > bogomips    : 3600.01
> > > TLB size    : 1024 4K pages
> > > clflush size    : 64
> > > cache_alignment    : 64
> > > address sizes    : 48 bits physical, 48 bits virtual
> > > power management: ts ttp tm stc 100mhzsteps hwpstate
> > > 
> > > 
> > > 
> > > 
> > > ------------------------------------------------------------------------------
> > > _______________________________________________
> > > bluesmoke-devel mailing list
> > > bluesmoke-devel@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/bluesmoke-devel
> > 


------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: EDAC linux-2.6.34-rc5 non correctable errors not reported on AMD64 opteron
  2010-04-29 23:18     ` Keith Mannthey
@ 2010-04-30  0:12       ` Prasanna S. Panchamukhi
  2010-04-30  0:38         ` Keith Mannthey
  2010-04-30 14:08         ` Ben Woodard
  0 siblings, 2 replies; 11+ messages in thread
From: Prasanna S. Panchamukhi @ 2010-04-30  0:12 UTC (permalink / raw)
  To: Keith Mannthey; +Cc: Rob Becker, bluesmoke-devel, Arthur Jones, dougthompson


On Thu, Apr 29, 2010 at 04:18:07PM -0700, Keith Mannthey wrote:
> On Thu, 2010-04-29 at 15:31 -0700, Prasanna S. Panchamukhi wrote:
> > On Thu, Apr 29, 2010 at 03:13:42PM -0700, Keith Mannthey wrote:
> > > On Thu, 2010-04-29 at 11:30 -0700, Prasanna S. Panchamukhi wrote:
> > > > Hi Doug,
> > > > 
> > > > I am testing Linux-2.6.34-rc5 EDAC driver on AMD64 Opteron.
> > > > I am able to inject single bit errors and get the edac driver report the 
> > > > correctable errors.
> > > > But when I inject 2-bit errors, I did not see any notification or kernel 
> > > > log, the system simply hangs.
> > > > This happens with or without edac_mc_panic_on_ue enabled.
> > > > Please let me know if I am missing something.
> > > > Below are the details.
> > > 
> > > I would have to recheck the specs to be 100% sure but I would consider
> > > double bit errors to be fatal on normal Opteron boxes. There is a good
> > > chance your BIOS detects the fatal error and freezes the box to prevent
> > > data corruption.
> > 
> > Shouldn't the edac driver be reporting Uncorrectable Errors even 
> > before ..BIOS detects the fatal error and freezes the box?
> > Did someone already tested the 2-bit error injection and reporting
> > on AMD64?
> 
> I don't know your hardware or firmware but a SMI can be uses when the
> fatal error is triggered. BIOS can get instant notification. 
> 

Thanks Keith for your response.
My current hardware is AMD64 Opteron Family F10.

> 
> > Did the edac driver reported Uncorrectable Errors on other architectures
> > Powerpc/Intel?
> 
> This is a very specific question that requires firmware level knowledge
> of your system.  On my AMD and Intel system the bios steps in and kills
> the box when faced with a double bit error and this is the safest
> possible scenario for a double bit error.  From your report it sounds
> like you box may be doing something similar. 

Killing of the system is expected. But even before the system
gets killed, the edac driver should log saying..double bit UE detected.
And from your above statement, it looks like neither you saw
the Uncorrectable errors being reported but just the system gets killed.

I would expect the edac driver to report Uncorrectable errors
before it kills the box.

Also there is "edac_mc_panic_on_ue" module param. if enabled it
should result in system panic.

> 
> I didn't get around to playing with the error injection code could be
> something going on there.  I had real debug dims with switches on them
> to generate an error.  
> 

AMD error injection is a very cool feature, that helps us to check
the system capabilites to report and detect the memory errors.

> You are triggering a double bit error and waiting around for over a min
> for the box to hang? If so I doubt it is the BIOS that is killing your
> box as that should happen right away.

My system does not have any services running, hence there is a delay
before the system hangs. But generally its pretty quick.

Thanks
Prasanna

> 
> Thanks,
>   Keith 
> 
> > Thanks
> > Prasanna
> > 
> > 
> > > 
> > > Thanks,
> > >   Keith 
> > > 
> > > > Thanks
> > > > Prasanna
> > > > 
> > > > Steps to reproduce the problem:
> > > > 
> > > > 
> > > > 1. Build Linux-2.6.34-rc5 using x86_64_defconfig with following additional config options enabled:
> > > > CONFIG_EDAC_DECODE_MCE=y
> > > > CONFIG_EDAC_MM_EDAC=y
> > > > CONFIG_EDAC_AMD64=m
> > > > CONFIG_EDAC_AMD64_ERROR_INJECTION=y
> > > > CONFIG_EDAC_E752X=m
> > > > CONFIG_EDAC_I82975X=m
> > > > CONFIG_EDAC_I3000=m
> > > > CONFIG_EDAC_I3200=m
> > > > CONFIG_EDAC_X38=m
> > > > CONFIG_EDAC_I5400=m
> > > > CONFIG_EDAC_I5000=m
> > > > CONFIG_EDAC_I5100=m
> > > > 
> > > > 2. insert the kernel module
> > > > #insmod amd64_edac_mod.ko
> > > > 
> > > > 3. Inject errors
> > > > 
> > > > # echo 3 > /sys/devices/system/edac/mc/mc0/inject_section  
> > > > # echo 7 > /sys/devices/system/edac/mc/mc0/inject_word
> > > > # echo 0x88 > /sys/devices/system/edac/mc/mc0/inject_ecc_vector
> > > > # echo 1 > /sys/devices/system/edac/mc/mc0/inject_read
> > > > # echo 1 > /sys/devices/system/edac/mc/mc0/inject_write
> > > > 
> > > > 4. Should hang the system in few minutes.
> > > > 
> > > > Additional info:
> > > > - AMD64 opteron
> > > > # cat /proc/cpuinfo
> > > > processor    : 0
> > > > vendor_id    : AuthenticAMD
> > > > cpu family   : 16
> > > > model          : 2
> > > > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > > > stepping       : 3
> > > > cpu MHz      : 1800.023
> > > > cache size    : 512 KB
> > > > physical id   : 0
> > > > siblings       : 4
> > > > core id        : 0
> > > > cpu cores    : 4
> > > > apicid          : 0
> > > > initial apicid    : 0
> > > > fpu        : yes
> > > > fpu_exception    : yes
> > > > cpuid level    : 5
> > > > wp        : yes
> > > > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > > > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > > > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > > > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > > > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > > > bogomips    : 3600.04
> > > > TLB size    : 1024 4K pages
> > > > clflush size    : 64
> > > > cache_alignment    : 64
> > > > address sizes    : 48 bits physical, 48 bits virtual
> > > > power management: ts ttp tm stc 100mhzsteps hwpstate
> > > > 
> > > > processor    : 1
> > > > vendor_id    : AuthenticAMD
> > > > cpu family    : 16
> > > > model        : 2
> > > > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > > > stepping    : 3
> > > > cpu MHz        : 1800.023
> > > > cache size    : 512 KB
> > > > physical id    : 0
> > > > siblings    : 4
> > > > core id        : 1
> > > > cpu cores    : 4
> > > > apicid        : 1
> > > > initial apicid    : 1
> > > > fpu        : yes
> > > > fpu_exception    : yes
> > > > cpuid level    : 5
> > > > wp        : yes
> > > > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > > > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > > > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > > > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > > > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > > > bogomips    : 3600.08
> > > > TLB size    : 1024 4K pages
> > > > clflush size    : 64
> > > > cache_alignment    : 64
> > > > address sizes    : 48 bits physical, 48 bits virtual
> > > > power management: ts ttp tm stc 100mhzsteps hwpstate
> > > > 
> > > > processor    : 2
> > > > vendor_id    : AuthenticAMD
> > > > cpu family    : 16
> > > > model        : 2
> > > > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > > > stepping    : 3
> > > > cpu MHz        : 1800.023
> > > > cache size    : 512 KB
> > > > physical id    : 0
> > > > siblings    : 4
> > > > core id        : 2
> > > > cpu cores    : 4
> > > > apicid        : 2
> > > > initial apicid    : 2
> > > > fpu        : yes
> > > > fpu_exception    : yes
> > > > cpuid level    : 5
> > > > wp        : yes
> > > > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > > > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > > > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > > > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > > > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > > > bogomips    : 3599.96
> > > > TLB size    : 1024 4K pages
> > > > clflush size    : 64
> > > > cache_alignment    : 64
> > > > address sizes    : 48 bits physical, 48 bits virtual
> > > > power management: ts ttp tm stc 100mhzsteps hwpstate
> > > > 
> > > > processor    : 3
> > > > vendor_id    : AuthenticAMD
> > > > cpu family    : 16
> > > > model        : 2
> > > > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > > > stepping    : 3
> > > > cpu MHz        : 1800.023
> > > > cache size    : 512 KB
> > > > physical id    : 0
> > > > siblings    : 4
> > > > core id        : 3
> > > > cpu cores    : 4
> > > > apicid        : 3
> > > > initial apicid    : 3
> > > > fpu        : yes
> > > > fpu_exception    : yes
> > > > cpuid level    : 5
> > > > wp        : yes
> > > > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > > > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > > > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > > > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > > > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > > > bogomips    : 3600.01
> > > > TLB size    : 1024 4K pages
> > > > clflush size    : 64
> > > > cache_alignment    : 64
> > > > address sizes    : 48 bits physical, 48 bits virtual
> > > > power management: ts ttp tm stc 100mhzsteps hwpstate
> > > > 
> > > > 
> > > > 
> > > > 
> > > > ------------------------------------------------------------------------------
> > > > _______________________________________________
> > > > bluesmoke-devel mailing list
> > > > bluesmoke-devel@lists.sourceforge.net
> > > > https://lists.sourceforge.net/lists/listinfo/bluesmoke-devel
> > > 
> 

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: EDAC linux-2.6.34-rc5 non correctable errors not reported on AMD64 opteron
  2010-04-30  0:12       ` Prasanna S. Panchamukhi
@ 2010-04-30  0:38         ` Keith Mannthey
  2010-04-30 11:00           ` Borislav Petkov
  2010-04-30 14:08         ` Ben Woodard
  1 sibling, 1 reply; 11+ messages in thread
From: Keith Mannthey @ 2010-04-30  0:38 UTC (permalink / raw)
  To: Prasanna S. Panchamukhi
  Cc: Rob Becker, bluesmoke-devel, Arthur Jones, dougthompson

On Thu, 2010-04-29 at 17:12 -0700, Prasanna S. Panchamukhi wrote:
> On Thu, Apr 29, 2010 at 04:18:07PM -0700, Keith Mannthey wrote:
> > On Thu, 2010-04-29 at 15:31 -0700, Prasanna S. Panchamukhi wrote:
> > > On Thu, Apr 29, 2010 at 03:13:42PM -0700, Keith Mannthey wrote:
> > > > On Thu, 2010-04-29 at 11:30 -0700, Prasanna S. Panchamukhi wrote:
> > > > > Hi Doug,
> > > > > 
> > > > > I am testing Linux-2.6.34-rc5 EDAC driver on AMD64 Opteron.
> > > > > I am able to inject single bit errors and get the edac driver report the 
> > > > > correctable errors.
> > > > > But when I inject 2-bit errors, I did not see any notification or kernel 
> > > > > log, the system simply hangs.
> > > > > This happens with or without edac_mc_panic_on_ue enabled.
> > > > > Please let me know if I am missing something.
> > > > > Below are the details.
> > > > 
> > > > I would have to recheck the specs to be 100% sure but I would consider
> > > > double bit errors to be fatal on normal Opteron boxes. There is a good
> > > > chance your BIOS detects the fatal error and freezes the box to prevent
> > > > data corruption.
> > > 
> > > Shouldn't the edac driver be reporting Uncorrectable Errors even 
> > > before ..BIOS detects the fatal error and freezes the box?
> > > Did someone already tested the 2-bit error injection and reporting
> > > on AMD64?
> > 
> > I don't know your hardware or firmware but a SMI can be uses when the
> > fatal error is triggered. BIOS can get instant notification. 
> > 
> 
> Thanks Keith for your response.
> My current hardware is AMD64 Opteron Family F10.

This behavior is mostly firmware/bios dependent. The cpu error handling
and gets programed by your BIOS. 

> > 
> > > Did the edac driver reported Uncorrectable Errors on other architectures
> > > Powerpc/Intel?
> > 
> > This is a very specific question that requires firmware level knowledge
> > of your system.  On my AMD and Intel system the bios steps in and kills
> > the box when faced with a double bit error and this is the safest
> > possible scenario for a double bit error.  From your report it sounds
> > like you box may be doing something similar. 
> 
> Killing of the system is expected. But even before the system
> gets killed, the edac driver should log saying..double bit UE detected.
> And from your above statement, it looks like neither you saw
> the Uncorrectable errors being reported but just the system gets killed.

Correct, on my systems the box dies and the bios reports the error
properly. The OS is not involved in anywhat it is a hard power cycle.  

> I would expect the edac driver to report Uncorrectable errors
> before it kills the box.

In a real error your box may panic before EDAC gets a chance to poll and
process the error.  IE you might catch the fatal ecc error on load into
the data/instruction cache and run some very errant command before the
next EDAC poll. 

With your error injection you probably don't have to worry about that.
You might look into the current driver I heard there was a way to
offline specific pages on error reports.  I have no idea if the current
AMD driver is doing this. 

> Also there is "edac_mc_panic_on_ue" module param. if enabled it
> should result in system panic.

Correct there are lots of different ue errors. Panic is generally a good
idea although I have seen error labeled fatal that were safe to ignore
on one system (Intel TMID) 

> > 
> > I didn't get around to playing with the error injection code could be
> > something going on there.  I had real debug dims with switches on them
> > to generate an error.  
> > 
> 
> AMD error injection is a very cool feature, that helps us to check
> the system capabilites to report and detect the memory errors.

Yes I understand what it is used for.  Be sure to properly investigate
mapping edac errors to human usable error information if you are relying
on EDAC for reporting in a system of any complexity.   

> > You are triggering a double bit error and waiting around for over a min
> > for the box to hang? If so I doubt it is the BIOS that is killing your
> > box as that should happen right away.
> 
> My system does not have any services running, hence there is a delay
> before the system hangs. But generally its pretty quick.

Your system is panicing within a second (less that 10s for sure) caused
by EDAC processing the double bit errors as it stands now? 

You are just wanting an edac UE error notification in dmesg? 

Thanks,
  Keith 

> Thanks
> Prasanna
> 
> > 
> > Thanks,
> >   Keith 
> > 
> > > Thanks
> > > Prasanna
> > > 
> > > 
> > > > 
> > > > Thanks,
> > > >   Keith 
> > > > 
> > > > > Thanks
> > > > > Prasanna
> > > > > 
> > > > > Steps to reproduce the problem:
> > > > > 
> > > > > 
> > > > > 1. Build Linux-2.6.34-rc5 using x86_64_defconfig with following additional config options enabled:
> > > > > CONFIG_EDAC_DECODE_MCE=y
> > > > > CONFIG_EDAC_MM_EDAC=y
> > > > > CONFIG_EDAC_AMD64=m
> > > > > CONFIG_EDAC_AMD64_ERROR_INJECTION=y
> > > > > CONFIG_EDAC_E752X=m
> > > > > CONFIG_EDAC_I82975X=m
> > > > > CONFIG_EDAC_I3000=m
> > > > > CONFIG_EDAC_I3200=m
> > > > > CONFIG_EDAC_X38=m
> > > > > CONFIG_EDAC_I5400=m
> > > > > CONFIG_EDAC_I5000=m
> > > > > CONFIG_EDAC_I5100=m
> > > > > 
> > > > > 2. insert the kernel module
> > > > > #insmod amd64_edac_mod.ko
> > > > > 
> > > > > 3. Inject errors
> > > > > 
> > > > > # echo 3 > /sys/devices/system/edac/mc/mc0/inject_section  
> > > > > # echo 7 > /sys/devices/system/edac/mc/mc0/inject_word
> > > > > # echo 0x88 > /sys/devices/system/edac/mc/mc0/inject_ecc_vector
> > > > > # echo 1 > /sys/devices/system/edac/mc/mc0/inject_read
> > > > > # echo 1 > /sys/devices/system/edac/mc/mc0/inject_write
> > > > > 
> > > > > 4. Should hang the system in few minutes.
> > > > > 
> > > > > Additional info:
> > > > > - AMD64 opteron
> > > > > # cat /proc/cpuinfo
> > > > > processor    : 0
> > > > > vendor_id    : AuthenticAMD
> > > > > cpu family   : 16
> > > > > model          : 2
> > > > > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > > > > stepping       : 3
> > > > > cpu MHz      : 1800.023
> > > > > cache size    : 512 KB
> > > > > physical id   : 0
> > > > > siblings       : 4
> > > > > core id        : 0
> > > > > cpu cores    : 4
> > > > > apicid          : 0
> > > > > initial apicid    : 0
> > > > > fpu        : yes
> > > > > fpu_exception    : yes
> > > > > cpuid level    : 5
> > > > > wp        : yes
> > > > > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > > > > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > > > > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > > > > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > > > > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > > > > bogomips    : 3600.04
> > > > > TLB size    : 1024 4K pages
> > > > > clflush size    : 64
> > > > > cache_alignment    : 64
> > > > > address sizes    : 48 bits physical, 48 bits virtual
> > > > > power management: ts ttp tm stc 100mhzsteps hwpstate
> > > > > 
> > > > > processor    : 1
> > > > > vendor_id    : AuthenticAMD
> > > > > cpu family    : 16
> > > > > model        : 2
> > > > > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > > > > stepping    : 3
> > > > > cpu MHz        : 1800.023
> > > > > cache size    : 512 KB
> > > > > physical id    : 0
> > > > > siblings    : 4
> > > > > core id        : 1
> > > > > cpu cores    : 4
> > > > > apicid        : 1
> > > > > initial apicid    : 1
> > > > > fpu        : yes
> > > > > fpu_exception    : yes
> > > > > cpuid level    : 5
> > > > > wp        : yes
> > > > > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > > > > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > > > > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > > > > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > > > > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > > > > bogomips    : 3600.08
> > > > > TLB size    : 1024 4K pages
> > > > > clflush size    : 64
> > > > > cache_alignment    : 64
> > > > > address sizes    : 48 bits physical, 48 bits virtual
> > > > > power management: ts ttp tm stc 100mhzsteps hwpstate
> > > > > 
> > > > > processor    : 2
> > > > > vendor_id    : AuthenticAMD
> > > > > cpu family    : 16
> > > > > model        : 2
> > > > > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > > > > stepping    : 3
> > > > > cpu MHz        : 1800.023
> > > > > cache size    : 512 KB
> > > > > physical id    : 0
> > > > > siblings    : 4
> > > > > core id        : 2
> > > > > cpu cores    : 4
> > > > > apicid        : 2
> > > > > initial apicid    : 2
> > > > > fpu        : yes
> > > > > fpu_exception    : yes
> > > > > cpuid level    : 5
> > > > > wp        : yes
> > > > > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > > > > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > > > > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > > > > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > > > > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > > > > bogomips    : 3599.96
> > > > > TLB size    : 1024 4K pages
> > > > > clflush size    : 64
> > > > > cache_alignment    : 64
> > > > > address sizes    : 48 bits physical, 48 bits virtual
> > > > > power management: ts ttp tm stc 100mhzsteps hwpstate
> > > > > 
> > > > > processor    : 3
> > > > > vendor_id    : AuthenticAMD
> > > > > cpu family    : 16
> > > > > model        : 2
> > > > > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > > > > stepping    : 3
> > > > > cpu MHz        : 1800.023
> > > > > cache size    : 512 KB
> > > > > physical id    : 0
> > > > > siblings    : 4
> > > > > core id        : 3
> > > > > cpu cores    : 4
> > > > > apicid        : 3
> > > > > initial apicid    : 3
> > > > > fpu        : yes
> > > > > fpu_exception    : yes
> > > > > cpuid level    : 5
> > > > > wp        : yes
> > > > > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > > > > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > > > > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > > > > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > > > > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > > > > bogomips    : 3600.01
> > > > > TLB size    : 1024 4K pages
> > > > > clflush size    : 64
> > > > > cache_alignment    : 64
> > > > > address sizes    : 48 bits physical, 48 bits virtual
> > > > > power management: ts ttp tm stc 100mhzsteps hwpstate
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > ------------------------------------------------------------------------------
> > > > > _______________________________________________
> > > > > bluesmoke-devel mailing list
> > > > > bluesmoke-devel@lists.sourceforge.net
> > > > > https://lists.sourceforge.net/lists/listinfo/bluesmoke-devel
> > > > 
> > 


------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: EDAC linux-2.6.34-rc5 non correctable errors not reported on AMD64 opteron
  2010-04-30  0:38         ` Keith Mannthey
@ 2010-04-30 11:00           ` Borislav Petkov
  2010-05-05  1:40             ` Prasanna S. Panchamukhi
  2010-05-06 23:56             ` Prasanna S. Panchamukhi
  0 siblings, 2 replies; 11+ messages in thread
From: Borislav Petkov @ 2010-04-30 11:00 UTC (permalink / raw)
  To: Keith Mannthey; +Cc: Rob Becker, bluesmoke-devel, Arthur Jones, dougthompson

Hi Prasanna, Keith,

from what I could see, you're doing the injection correctly and
the injection code accesses the right bits so that should work ok.
What happens is rather what Keith explained in detail with the only
correction that it is not the BIOS but the hardware itself that takes
action to prevent the system from damaging the data.

See, double-bit errors are deemed uncorrectable and your machine
syncfloods¹, i.e. it terminates further stale data propagation.
Therefore, no software gets to run, not even the machine check handler
(not to mention the clumsy EDAC error polling mechanism). And that's
why you don't get the errors reported; OTOH, if you want to test the
amd64_edac driver, injecting single-bit errors should work and you can
report to me any issues you encounter.

Hope that helps.

Thanks.



¹ See the section on Sync Flooding in the Hyper Transport spec if you
want to know more details on that.

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: EDAC linux-2.6.34-rc5 non correctable errors not reported on AMD64 opteron
  2010-04-30  0:12       ` Prasanna S. Panchamukhi
  2010-04-30  0:38         ` Keith Mannthey
@ 2010-04-30 14:08         ` Ben Woodard
  1 sibling, 0 replies; 11+ messages in thread
From: Ben Woodard @ 2010-04-30 14:08 UTC (permalink / raw)
  To: Prasanna S. Panchamukhi
  Cc: Rob Becker, Arthur Jones, dougthompson, bluesmoke-devel, linux-edac

On Thu, 2010-04-29 at 17:12 -0700, Prasanna S. Panchamukhi wrote:
> On Thu, Apr 29, 2010 at 04:18:07PM -0700, Keith Mannthey wrote:
> > On Thu, 2010-04-29 at 15:31 -0700, Prasanna S. Panchamukhi wrote:
> > > On Thu, Apr 29, 2010 at 03:13:42PM -0700, Keith Mannthey wrote:
> > > > On Thu, 2010-04-29 at 11:30 -0700, Prasanna S. Panchamukhi wrote:
> > > > > Hi Doug,
> > > > > 
> > > > > I am testing Linux-2.6.34-rc5 EDAC driver on AMD64 Opteron.
> > > > > I am able to inject single bit errors and get the edac driver report the 
> > > > > correctable errors.
> > > > > But when I inject 2-bit errors, I did not see any notification or kernel 
> > > > > log, the system simply hangs.
> > > > > This happens with or without edac_mc_panic_on_ue enabled.
> > > > > Please let me know if I am missing something.
> > > > > Below are the details.
> > > > 
> > > > I would have to recheck the specs to be 100% sure but I would consider
> > > > double bit errors to be fatal on normal Opteron boxes. There is a good
> > > > chance your BIOS detects the fatal error and freezes the box to prevent
> > > > data corruption.
> > > 
> > > Shouldn't the edac driver be reporting Uncorrectable Errors even 
> > > before ..BIOS detects the fatal error and freezes the box?
> > > Did someone already tested the 2-bit error injection and reporting
> > > on AMD64?
> > 
> > I don't know your hardware or firmware but a SMI can be uses when the
> > fatal error is triggered. BIOS can get instant notification. 
> > 
> 
> Thanks Keith for your response.
> My current hardware is AMD64 Opteron Family F10.
> 

This highlights one of the biggest problems with that approach. If the
BIOS through SMI kills the box, then there is no way to make the system
reliable. 

Preventing data corruption is important but the BIOS doesn't have enough
information about the system to make decisions about what to do. A UE
doesn't have to be fatal, it depends on where it is:
        if the page has backing store -> take a page fault
        if free -> no worries
        if network buffer -> discard and allow the network stack to deal
        if user space page without backing store -> send program a
SIGBUS
A UE really doesn't need to be fatal unless the UE happens to be in some
certain kinds kernel pages. As the number of cores within a system
increases the overall system memory increases and the percentage used by
the kernel decreases. Therefore, the probability that a UE will occur in
the part of RAM that is used by the kernel will decrease. Therefore
using a big sledge hammer and killing the system will make the system
less reliable.

-ben


> > 
> > > Did the edac driver reported Uncorrectable Errors on other architectures
> > > Powerpc/Intel?
> > 
> > This is a very specific question that requires firmware level knowledge
> > of your system.  On my AMD and Intel system the bios steps in and kills
> > the box when faced with a double bit error and this is the safest
> > possible scenario for a double bit error.  From your report it sounds
> > like you box may be doing something similar. 
> 
> Killing of the system is expected. But even before the system
> gets killed, the edac driver should log saying..double bit UE detected.
> And from your above statement, it looks like neither you saw
> the Uncorrectable errors being reported but just the system gets killed.
> 
> I would expect the edac driver to report Uncorrectable errors
> before it kills the box.
> 
> Also there is "edac_mc_panic_on_ue" module param. if enabled it
> should result in system panic.
> 
> > 
> > I didn't get around to playing with the error injection code could be
> > something going on there.  I had real debug dims with switches on them
> > to generate an error.  
> > 
> 
> AMD error injection is a very cool feature, that helps us to check
> the system capabilites to report and detect the memory errors.
> 
> > You are triggering a double bit error and waiting around for over a min
> > for the box to hang? If so I doubt it is the BIOS that is killing your
> > box as that should happen right away.
> 
> My system does not have any services running, hence there is a delay
> before the system hangs. But generally its pretty quick.
> 
> Thanks
> Prasanna
> 
> > 
> > Thanks,
> >   Keith 
> > 
> > > Thanks
> > > Prasanna
> > > 
> > > 
> > > > 
> > > > Thanks,
> > > >   Keith 
> > > > 
> > > > > Thanks
> > > > > Prasanna
> > > > > 
> > > > > Steps to reproduce the problem:
> > > > > 
> > > > > 
> > > > > 1. Build Linux-2.6.34-rc5 using x86_64_defconfig with following additional config options enabled:
> > > > > CONFIG_EDAC_DECODE_MCE=y
> > > > > CONFIG_EDAC_MM_EDAC=y
> > > > > CONFIG_EDAC_AMD64=m
> > > > > CONFIG_EDAC_AMD64_ERROR_INJECTION=y
> > > > > CONFIG_EDAC_E752X=m
> > > > > CONFIG_EDAC_I82975X=m
> > > > > CONFIG_EDAC_I3000=m
> > > > > CONFIG_EDAC_I3200=m
> > > > > CONFIG_EDAC_X38=m
> > > > > CONFIG_EDAC_I5400=m
> > > > > CONFIG_EDAC_I5000=m
> > > > > CONFIG_EDAC_I5100=m
> > > > > 
> > > > > 2. insert the kernel module
> > > > > #insmod amd64_edac_mod.ko
> > > > > 
> > > > > 3. Inject errors
> > > > > 
> > > > > # echo 3 > /sys/devices/system/edac/mc/mc0/inject_section  
> > > > > # echo 7 > /sys/devices/system/edac/mc/mc0/inject_word
> > > > > # echo 0x88 > /sys/devices/system/edac/mc/mc0/inject_ecc_vector
> > > > > # echo 1 > /sys/devices/system/edac/mc/mc0/inject_read
> > > > > # echo 1 > /sys/devices/system/edac/mc/mc0/inject_write
> > > > > 
> > > > > 4. Should hang the system in few minutes.
> > > > > 
> > > > > Additional info:
> > > > > - AMD64 opteron
> > > > > # cat /proc/cpuinfo
> > > > > processor    : 0
> > > > > vendor_id    : AuthenticAMD
> > > > > cpu family   : 16
> > > > > model          : 2
> > > > > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > > > > stepping       : 3
> > > > > cpu MHz      : 1800.023
> > > > > cache size    : 512 KB
> > > > > physical id   : 0
> > > > > siblings       : 4
> > > > > core id        : 0
> > > > > cpu cores    : 4
> > > > > apicid          : 0
> > > > > initial apicid    : 0
> > > > > fpu        : yes
> > > > > fpu_exception    : yes
> > > > > cpuid level    : 5
> > > > > wp        : yes
> > > > > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > > > > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > > > > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > > > > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > > > > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > > > > bogomips    : 3600.04
> > > > > TLB size    : 1024 4K pages
> > > > > clflush size    : 64
> > > > > cache_alignment    : 64
> > > > > address sizes    : 48 bits physical, 48 bits virtual
> > > > > power management: ts ttp tm stc 100mhzsteps hwpstate
> > > > > 
> > > > > processor    : 1
> > > > > vendor_id    : AuthenticAMD
> > > > > cpu family    : 16
> > > > > model        : 2
> > > > > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > > > > stepping    : 3
> > > > > cpu MHz        : 1800.023
> > > > > cache size    : 512 KB
> > > > > physical id    : 0
> > > > > siblings    : 4
> > > > > core id        : 1
> > > > > cpu cores    : 4
> > > > > apicid        : 1
> > > > > initial apicid    : 1
> > > > > fpu        : yes
> > > > > fpu_exception    : yes
> > > > > cpuid level    : 5
> > > > > wp        : yes
> > > > > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > > > > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > > > > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > > > > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > > > > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > > > > bogomips    : 3600.08
> > > > > TLB size    : 1024 4K pages
> > > > > clflush size    : 64
> > > > > cache_alignment    : 64
> > > > > address sizes    : 48 bits physical, 48 bits virtual
> > > > > power management: ts ttp tm stc 100mhzsteps hwpstate
> > > > > 
> > > > > processor    : 2
> > > > > vendor_id    : AuthenticAMD
> > > > > cpu family    : 16
> > > > > model        : 2
> > > > > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > > > > stepping    : 3
> > > > > cpu MHz        : 1800.023
> > > > > cache size    : 512 KB
> > > > > physical id    : 0
> > > > > siblings    : 4
> > > > > core id        : 2
> > > > > cpu cores    : 4
> > > > > apicid        : 2
> > > > > initial apicid    : 2
> > > > > fpu        : yes
> > > > > fpu_exception    : yes
> > > > > cpuid level    : 5
> > > > > wp        : yes
> > > > > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > > > > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > > > > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > > > > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > > > > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > > > > bogomips    : 3599.96
> > > > > TLB size    : 1024 4K pages
> > > > > clflush size    : 64
> > > > > cache_alignment    : 64
> > > > > address sizes    : 48 bits physical, 48 bits virtual
> > > > > power management: ts ttp tm stc 100mhzsteps hwpstate
> > > > > 
> > > > > processor    : 3
> > > > > vendor_id    : AuthenticAMD
> > > > > cpu family    : 16
> > > > > model        : 2
> > > > > model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
> > > > > stepping    : 3
> > > > > cpu MHz        : 1800.023
> > > > > cache size    : 512 KB
> > > > > physical id    : 0
> > > > > siblings    : 4
> > > > > core id        : 3
> > > > > cpu cores    : 4
> > > > > apicid        : 3
> > > > > initial apicid    : 3
> > > > > fpu        : yes
> > > > > fpu_exception    : yes
> > > > > cpuid level    : 5
> > > > > wp        : yes
> > > > > flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > > > > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > > > > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
> > > > > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > > > > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
> > > > > bogomips    : 3600.01
> > > > > TLB size    : 1024 4K pages
> > > > > clflush size    : 64
> > > > > cache_alignment    : 64
> > > > > address sizes    : 48 bits physical, 48 bits virtual
> > > > > power management: ts ttp tm stc 100mhzsteps hwpstate
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > ------------------------------------------------------------------------------
> > > > > _______________________________________________
> > > > > bluesmoke-devel mailing list
> > > > > bluesmoke-devel@lists.sourceforge.net
> > > > > https://lists.sourceforge.net/lists/listinfo/bluesmoke-devel
> > > > 
> > 
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> bluesmoke-devel mailing list
> bluesmoke-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bluesmoke-devel



------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: EDAC linux-2.6.34-rc5 non correctable errors not reported on AMD64 opteron
  2010-04-30 11:00           ` Borislav Petkov
@ 2010-05-05  1:40             ` Prasanna S. Panchamukhi
  2010-05-06 23:56             ` Prasanna S. Panchamukhi
  1 sibling, 0 replies; 11+ messages in thread
From: Prasanna S. Panchamukhi @ 2010-05-05  1:40 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Rob Becker, bluesmoke-devel, Arthur Jones, dougthompson

On Fri, Apr 30, 2010 at 04:00:23AM -0700, Borislav Petkov wrote:
> Hi Prasanna, Keith,
> 
> from what I could see, you're doing the injection correctly and
> the injection code accesses the right bits so that should work ok.
> What happens is rather what Keith explained in detail with the only
> correction that it is not the BIOS but the hardware itself that takes
> action to prevent the system from damaging the data.
> 
> See, double-bit errors are deemed uncorrectable and your machine
> syncfloods¹, i.e. it terminates further stale data propagation.
> Therefore, no software gets to run, not even the machine check handler
> (not to mention the clumsy EDAC error polling mechanism). And that's
> why you don't get the errors reported; OTOH, if you want to test the
> amd64_edac driver, injecting single-bit errors should work and you can
> report to me any issues you encounter.
> 
> Hope that helps.
Hi Boris,

Thanks for providing useful info. Now I understand how double-bit errors
are handled by AMD64.
Also injection and notification of single-bit errors seems to work fine.

Thanks
Prasanna

> 
> Thanks.
> 
> 
> 
> ¹ See the section on Sync Flooding in the Hyper Transport spec if you
> want to know more details on that.
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> --
> Advanced Micro Devices, Inc.
> Operating Systems Research Center

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: EDAC linux-2.6.34-rc5 non correctable errors not reported on AMD64 opteron
  2010-04-30 11:00           ` Borislav Petkov
  2010-05-05  1:40             ` Prasanna S. Panchamukhi
@ 2010-05-06 23:56             ` Prasanna S. Panchamukhi
  1 sibling, 0 replies; 11+ messages in thread
From: Prasanna S. Panchamukhi @ 2010-05-06 23:56 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Rob Becker, bluesmoke-devel, Arthur Jones, dougthompson

Hi Boris,


On Fri, Apr 30, 2010 at 04:00:23AM -0700, Borislav Petkov wrote:
> Hi Prasanna, Keith,
> 
> from what I could see, you're doing the injection correctly and
> the injection code accesses the right bits so that should work ok.
> What happens is rather what Keith explained in detail with the only
> correction that it is not the BIOS but the hardware itself that takes
> action to prevent the system from damaging the data.
> 
> See, double-bit errors are deemed uncorrectable and your machine
> syncfloods¹, i.e. it terminates further stale data propagation.
> Therefore, no software gets to run, not even the machine check handler
> (not to mention the clumsy EDAC error polling mechanism). And that's
> why you don't get the errors reported; OTOH, if you want to test the
> amd64_edac driver, injecting single-bit errors should work and you can
> report to me any issues you encounter.

Is there a way to disable Sync Flood on uncorrectable errors and
instead generate MCE.

Thanks
Prasanna


> 
> Hope that helps.
> 
> Thanks.
> 
> 
> 
> ¹ See the section on Sync Flooding in the Hyper Transport spec if you
> want to know more details on that.
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> --
> Advanced Micro Devices, Inc.
> Operating Systems Research Center

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 11+ messages in thread

* EDAC: Linux-2.6.34-rc5 non correctable errors not reported on AMD64 Opteron
@ 2010-04-28 17:14 Prasanna Panchamukhi
  0 siblings, 0 replies; 11+ messages in thread
From: Prasanna Panchamukhi @ 2010-04-28 17:14 UTC (permalink / raw)
  To: dougthompson, bluesmoke-devel; +Cc: Rob Becker, Arthur.Jones

Hi Doug,

I am trying to test Linux-2.6.34-rc5 EDAC driver on AMD64 Opteron.
I am able to inject single bit errors and get the edac driver report the 
correctable errors.
But when I inject 2-bit errors, I did not see any notification or kernel 
log, the system simply hangs.
This happens with or without edac_mc_panic_on_ue enabled.
Please let me know if I am missing something.
Below are the details.

Thanks
Prasanna

Steps to reproduce the problem:


1. Build Linux-2.6.34-rc5 using x86_64_defconfig with following additional config options enabled:
CONFIG_EDAC_DECODE_MCE=y
CONFIG_EDAC_MM_EDAC=y
CONFIG_EDAC_AMD64=m
CONFIG_EDAC_AMD64_ERROR_INJECTION=y
CONFIG_EDAC_E752X=m
CONFIG_EDAC_I82975X=m
CONFIG_EDAC_I3000=m
CONFIG_EDAC_I3200=m
CONFIG_EDAC_X38=m
CONFIG_EDAC_I5400=m
CONFIG_EDAC_I5000=m
CONFIG_EDAC_I5100=m

2. insert the kernel module
#insmod amd64_edac_mod.ko

3. Inject errors

# echo 3 > /sys/devices/system/edac/mc/mc0/inject_section  
# echo 7 > /sys/devices/system/edac/mc/mc0/inject_word
# echo 0x88 > /sys/devices/system/edac/mc/mc0/inject_ecc_vector
# echo 1 > /sys/devices/system/edac/mc/mc0/inject_read
# echo 1 > /sys/devices/system/edac/mc/mc0/inject_write

4. Should hang the system in few minutes.

Additional info:
- AMD64 opteron
# cat /proc/cpuinfo
processor    : 0
vendor_id    : AuthenticAMD
cpu family   : 16
model          : 2
model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
stepping       : 3
cpu MHz      : 1800.023
cache size    : 512 KB
physical id   : 0
siblings       : 4
core id        : 0
cpu cores    : 4
apicid          : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 5
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
bogomips    : 3600.04
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor    : 1
vendor_id    : AuthenticAMD
cpu family    : 16
model        : 2
model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
stepping    : 3
cpu MHz        : 1800.023
cache size    : 512 KB
physical id    : 0
siblings    : 4
core id        : 1
cpu cores    : 4
apicid        : 1
initial apicid    : 1
fpu        : yes
fpu_exception    : yes
cpuid level    : 5
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
bogomips    : 3600.08
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor    : 2
vendor_id    : AuthenticAMD
cpu family    : 16
model        : 2
model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
stepping    : 3
cpu MHz        : 1800.023
cache size    : 512 KB
physical id    : 0
siblings    : 4
core id        : 2
cpu cores    : 4
apicid        : 2
initial apicid    : 2
fpu        : yes
fpu_exception    : yes
cpuid level    : 5
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
bogomips    : 3599.96
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor    : 3
vendor_id    : AuthenticAMD
cpu family    : 16
model        : 2
model name    : Quad-Core AMD Opteron(tm) Processor 2346 HE
stepping    : 3
cpu MHz        : 1800.023
cache size    : 512 KB
physical id    : 0
siblings    : 4
core id        : 3
cpu cores    : 4
apicid        : 3
initial apicid    : 3
fpu        : yes
fpu_exception    : yes
cpuid level    : 5
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc 
extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
bogomips    : 3600.01
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate



------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2010-05-06 23:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-29 18:30 EDAC linux-2.6.34-rc5 non correctable errors not reported on AMD64 opteron Prasanna S. Panchamukhi
2010-04-29 22:13 ` Keith Mannthey
2010-04-29 22:31   ` Prasanna S. Panchamukhi
2010-04-29 23:18     ` Keith Mannthey
2010-04-30  0:12       ` Prasanna S. Panchamukhi
2010-04-30  0:38         ` Keith Mannthey
2010-04-30 11:00           ` Borislav Petkov
2010-05-05  1:40             ` Prasanna S. Panchamukhi
2010-05-06 23:56             ` Prasanna S. Panchamukhi
2010-04-30 14:08         ` Ben Woodard
  -- strict thread matches above, loose matches on Subject: below --
2010-04-28 17:14 EDAC: Linux-2.6.34-rc5 non correctable errors not reported on AMD64 Opteron Prasanna Panchamukhi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.