Linux-EDAC Archive on lore.kernel.org
 help / color / Atom feed
* Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
       [not found] ` <20190924092644.GC19317@zn.tnic>
@ 2019-10-05 16:52   ` Jeff God
  2019-10-07  7:16     ` Borislav Petkov
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff God @ 2019-10-05 16:52 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-edac

Hi, and thanks a lot for the quick response.

So I waited for the 5.4 rc1 to be out and downloaded the tar.gz from
kernel.org to be sure and build this new version to test the new EDAC
for AMD fam 17h model 7xh:
* To test it, I used the mce inject module
* Only the flag "sw" reported anything in dmesg,  here is the output:
[  371.111818] mce: Machine check injector initialized
[  696.292223] mce: [Hardware Error]: Machine check events logged
[  696.292225] [Hardware Error]: Corrected error, no action required.
[  696.292229] [Hardware Error]: CPU:0 (17:71:0)
MC0_STATUS[-|CE|-|-|-|-|-|-|-|-]: 0x0000000000000000
[  696.292232] [Hardware Error]: IPID: 0x0000000000000000
[  696.292234] [Hardware Error]: Load Store Unit Ext. Error Code: 0,
Load queue parity error.
[  696.292236] [Hardware Error]: cache level: RESV, tx: INSN
* I tried different addr values that I believe are in valid range, all
other values (cpu,...) were left to their default (0), I used bank 0
to trigger the tests and tried different values briefly but always
with the same results
* Any other flag did not log any error or message anywhere (I mostly
focused on the "hw" one which I believe is a proper hardware error
injection that simulate the ecc)
* During these tests I monitored both dmesg and edac-util -rfull.
* edac-util always reported/showed 0 error
* I haven't done any memory overclocking this time with and without
ecc to properly assess the ecc as I do normally since it takes more
time and I relied on the assumption that mce-inject with hw flag
should first work and report errors, but it doesn't

Please let me know if there is something I should have done
differently with mce inject to test.

Just to recap, in summary my observations so far based on my tests is
that I can get the EDAC driver to load now (it was also the case with
my previous attempts by changing the pci devices ids with my own
kernel builds of 5.3), but it does not seem EDAC ever report any ce
(or ue) ecc error even when they may be happening and corrected in
background. So I was wondering if others had seen the same thing on
their systems with these new CPUs or if it was really confirmed to
report errors when they happen.

Here are the high level specs of the system used, for reference:
* CPU AMD Ryzen 3900x (fam 17h model 71h)
* Memory modules: KINGSTON KSM26ED8/16ME (4x16GB == 64GB)
* Motherboard: ASUS PRIME X570-PRO (BIOS 1201)

Here is the dmesg output related to the edac driver messages after
startup (in debug):

[    0.000000] Command line: initrd=\elementary\initrd.img
root=UUID=c51491ea-900f-4a3e-997c-8b37cbc675ee ro  quiet splash kaslr
mitigations=off clocksource=tsc tsc=reliable igb.EEE=0
edac_debug_level=666
[    0.000000] Kernel command line: initrd=\elementary\initrd.img
root=UUID=c51491ea-900f-4a3e-997c-8b37cbc675ee ro  quiet splash kaslr
mitigations=off clocksource=tsc tsc=reliable igb.EEE=0
edac_debug_level=666
[    0.178120] EDAC MC: Ver: 3.0.0
[    0.178120] EDAC DEBUG: edac_mc_sysfs_init: device mc created
[    9.914600] EDAC DEBUG: compute_num_umcs: Number of UMCs: 2
[    9.914606] EDAC amd64: Node 0: DRAM ECC enabled.
[    9.914607] EDAC amd64: F17h_M70h detected (node 0).
[    9.914613] EDAC DEBUG: reserve_mc_sibling_devs: F0: 0000:00:18.0
[    9.914613] EDAC DEBUG: reserve_mc_sibling_devs: F3: 0000:00:18.3
[    9.914614] EDAC DEBUG: reserve_mc_sibling_devs: F6: 0000:00:18.6
[    9.914614] EDAC DEBUG: read_mc_regs:   TOP_MEM:  0x00000000e0000000
[    9.914615] EDAC DEBUG: read_mc_regs:   TOP_MEM2: 0x0000001020000000
[    9.914628] EDAC DEBUG: read_umc_base_mask:   DCSB0[0]=0x00000001
reg: 0x50000
[    9.914629] EDAC DEBUG: read_umc_base_mask:
DCSB_SEC0[0]=0x00000000 reg: 0x50010
[    9.914631] EDAC DEBUG: read_umc_base_mask:   DCSB0[1]=0x00000201
reg: 0x50004
[    9.914632] EDAC DEBUG: read_umc_base_mask:
DCSB_SEC0[1]=0x00000000 reg: 0x50014
[    9.914634] EDAC DEBUG: read_umc_base_mask:   DCSB0[2]=0x00000401
reg: 0x50008
[    9.914635] EDAC DEBUG: read_umc_base_mask:
DCSB_SEC0[2]=0x00000000 reg: 0x50018
[    9.914637] EDAC DEBUG: read_umc_base_mask:   DCSB0[3]=0x00000601
reg: 0x5000c
[    9.914638] EDAC DEBUG: read_umc_base_mask:
DCSB_SEC0[3]=0x00000000 reg: 0x5001c
[    9.914640] EDAC DEBUG: read_umc_base_mask:   DCSM0[0]=0x07fff9fe
reg: 0x50020
[    9.914642] EDAC DEBUG: read_umc_base_mask:
DCSM_SEC0[0]=0x00000000 reg: 0x50028
[    9.914643] EDAC DEBUG: read_umc_base_mask:   DCSM0[1]=0x07fff9fe
reg: 0x50024
[    9.914644] EDAC DEBUG: read_umc_base_mask:
DCSM_SEC0[1]=0x00000000 reg: 0x5002c
[    9.914646] EDAC DEBUG: read_umc_base_mask:   DCSB1[0]=0x00000001
reg: 0x150000
[    9.914647] EDAC DEBUG: read_umc_base_mask:
DCSB_SEC1[0]=0x00000000 reg: 0x150010
[    9.914649] EDAC DEBUG: read_umc_base_mask:   DCSB1[1]=0x00000201
reg: 0x150004
[    9.914650] EDAC DEBUG: read_umc_base_mask:
DCSB_SEC1[1]=0x00000000 reg: 0x150014
[    9.914652] EDAC DEBUG: read_umc_base_mask:   DCSB1[2]=0x00000401
reg: 0x150008
[    9.914653] EDAC DEBUG: read_umc_base_mask:
DCSB_SEC1[2]=0x00000000 reg: 0x150018
[    9.914655] EDAC DEBUG: read_umc_base_mask:   DCSB1[3]=0x00000601
reg: 0x15000c
[    9.914656] EDAC DEBUG: read_umc_base_mask:
DCSB_SEC1[3]=0x00000000 reg: 0x15001c
[    9.914658] EDAC DEBUG: read_umc_base_mask:   DCSM1[0]=0x07fff9fe
reg: 0x150020
[    9.914659] EDAC DEBUG: read_umc_base_mask:
DCSM_SEC1[0]=0x00000000 reg: 0x150028
[    9.914661] EDAC DEBUG: read_umc_base_mask:   DCSM1[1]=0x07fff9fe
reg: 0x150024
[    9.914663] EDAC DEBUG: read_umc_base_mask:
DCSM_SEC1[1]=0x00000000 reg: 0x15002c
[    9.914663] EDAC DEBUG: read_mc_regs:   DIMM type: Unbuffered-DDR4
[    9.914664] EDAC DEBUG: __dump_misc_regs_df: UMC0 DIMM cfg: 0x1
[    9.914664] EDAC DEBUG: __dump_misc_regs_df: UMC0 UMC cfg: 0x80001200
[    9.914665] EDAC DEBUG: __dump_misc_regs_df: UMC0 SDP ctrl: 0xb040808b
[    9.914665] EDAC DEBUG: __dump_misc_regs_df: UMC0 ECC ctrl: 0x671
[    9.914667] EDAC DEBUG: __dump_misc_regs_df: UMC0 ECC bad symbol: 0x0
[    9.914668] EDAC DEBUG: __dump_misc_regs_df: UMC0 UMC cap: 0x10030
[    9.914668] EDAC DEBUG: __dump_misc_regs_df: UMC0 UMC cap high: 0x40000000
[    9.914669] EDAC DEBUG: __dump_misc_regs_df: UMC0 ECC capable: yes,
ChipKill ECC capable: no
[    9.914670] EDAC DEBUG: __dump_misc_regs_df: UMC0 All DIMMs support ECC: yes
[    9.914670] EDAC DEBUG: __dump_misc_regs_df: UMC0 x4 DIMMs present: no
[    9.914670] EDAC DEBUG: __dump_misc_regs_df: UMC0 x16 DIMMs present: no
[    9.914671] EDAC MC: UMC0 chip selects:
[    9.914671] EDAC DEBUG: f17_addr_mask_to_cs_size: CS0 DIMM0 AddrMasks:
[    9.914672] EDAC DEBUG: f17_addr_mask_to_cs_size:   Original
AddrMask: 0x7fff9fe
[    9.914672] EDAC DEBUG: f17_addr_mask_to_cs_size:   Deinterleaved
AddrMask: 0x1fffffe
[    9.914673] EDAC DEBUG: f17_addr_mask_to_cs_size: CS1 DIMM0 AddrMasks:
[    9.914673] EDAC DEBUG: f17_addr_mask_to_cs_size:   Original
AddrMask: 0x7fff9fe
[    9.914674] EDAC DEBUG: f17_addr_mask_to_cs_size:   Deinterleaved
AddrMask: 0x1fffffe
[    9.914674] EDAC amd64: MC: 0:  8192MB 1:  8192MB
[    9.914675] EDAC DEBUG: f17_addr_mask_to_cs_size: CS2 DIMM1 AddrMasks:
[    9.914675] EDAC DEBUG: f17_addr_mask_to_cs_size:   Original
AddrMask: 0x7fff9fe
[    9.914676] EDAC DEBUG: f17_addr_mask_to_cs_size:   Deinterleaved
AddrMask: 0x1fffffe
[    9.914676] EDAC DEBUG: f17_addr_mask_to_cs_size: CS3 DIMM1 AddrMasks:
[    9.914676] EDAC DEBUG: f17_addr_mask_to_cs_size:   Original
AddrMask: 0x7fff9fe
[    9.914677] EDAC DEBUG: f17_addr_mask_to_cs_size:   Deinterleaved
AddrMask: 0x1fffffe
[    9.914677] EDAC amd64: MC: 2:  8192MB 3:  8192MB
[    9.914678] EDAC DEBUG: __dump_misc_regs_df: UMC1 DIMM cfg: 0x1
[    9.914678] EDAC DEBUG: __dump_misc_regs_df: UMC1 UMC cfg: 0x80001200
[    9.914679] EDAC DEBUG: __dump_misc_regs_df: UMC1 SDP ctrl: 0xb040808b
[    9.914679] EDAC DEBUG: __dump_misc_regs_df: UMC1 ECC ctrl: 0x671
[    9.914681] EDAC DEBUG: __dump_misc_regs_df: UMC1 ECC bad symbol: 0x0
[    9.914682] EDAC DEBUG: __dump_misc_regs_df: UMC1 UMC cap: 0x10030
[    9.914682] EDAC DEBUG: __dump_misc_regs_df: UMC1 UMC cap high: 0x40000000
[    9.914683] EDAC DEBUG: __dump_misc_regs_df: UMC1 ECC capable: yes,
ChipKill ECC capable: no
[    9.914683] EDAC DEBUG: __dump_misc_regs_df: UMC1 All DIMMs support ECC: yes
[    9.914684] EDAC DEBUG: __dump_misc_regs_df: UMC1 x4 DIMMs present: no
[    9.914684] EDAC DEBUG: __dump_misc_regs_df: UMC1 x16 DIMMs present: no
[    9.914684] EDAC MC: UMC1 chip selects:
[    9.914685] EDAC DEBUG: f17_addr_mask_to_cs_size: CS0 DIMM0 AddrMasks:
[    9.914685] EDAC DEBUG: f17_addr_mask_to_cs_size:   Original
AddrMask: 0x7fff9fe
[    9.914686] EDAC DEBUG: f17_addr_mask_to_cs_size:   Deinterleaved
AddrMask: 0x1fffffe
[    9.914686] EDAC DEBUG: f17_addr_mask_to_cs_size: CS1 DIMM0 AddrMasks:
[    9.914686] EDAC DEBUG: f17_addr_mask_to_cs_size:   Original
AddrMask: 0x7fff9fe
[    9.914687] EDAC DEBUG: f17_addr_mask_to_cs_size:   Deinterleaved
AddrMask: 0x1fffffe
[    9.914687] EDAC amd64: MC: 0:  8192MB 1:  8192MB
[    9.914688] EDAC DEBUG: f17_addr_mask_to_cs_size: CS2 DIMM1 AddrMasks:
[    9.914688] EDAC DEBUG: f17_addr_mask_to_cs_size:   Original
AddrMask: 0x7fff9fe
[    9.914688] EDAC DEBUG: f17_addr_mask_to_cs_size:   Deinterleaved
AddrMask: 0x1fffffe
[    9.914689] EDAC DEBUG: f17_addr_mask_to_cs_size: CS3 DIMM1 AddrMasks:
[    9.914689] EDAC DEBUG: f17_addr_mask_to_cs_size:   Original
AddrMask: 0x7fff9fe
[    9.914689] EDAC DEBUG: f17_addr_mask_to_cs_size:   Deinterleaved
AddrMask: 0x1fffffe
[    9.914690] EDAC amd64: MC: 2:  8192MB 3:  8192MB
[    9.914690] EDAC DEBUG: __dump_misc_regs_df: F0x104 (DRAM Hole
Address): 0xe0000001, base: 0xe0000000
[    9.914691] EDAC DEBUG: dump_misc_regs:   DramHoleValid: yes
[    9.914691] EDAC amd64: using x16 syndromes.
[    9.914691] EDAC amd64: MCT channel count: 2
[    9.914693] EDAC DEBUG: edac_mc_alloc: allocating 1896 bytes for
mci data (8 ranks, 8 csrows/channels)
[    9.914698] EDAC DEBUG: init_csrows_df: MC node: 0, csrow: 0
[    9.914699] EDAC DEBUG: f17_addr_mask_to_cs_size: CS0 DIMM0 AddrMasks:
[    9.914699] EDAC DEBUG: f17_addr_mask_to_cs_size:   Original
AddrMask: 0x7fff9fe
[    9.914700] EDAC DEBUG: f17_addr_mask_to_cs_size:   Deinterleaved
AddrMask: 0x1fffffe
[    9.914700] EDAC DEBUG: get_csrow_nr_pages: csrow: 0, channel: 0, DBAM idx: 3
[    9.914701] EDAC DEBUG: get_csrow_nr_pages: nr_pages/channel: 2097152
[    9.914701] EDAC DEBUG: init_csrows_df: MC node: 0, csrow: 1
[    9.914702] EDAC DEBUG: f17_addr_mask_to_cs_size: CS1 DIMM0 AddrMasks:
[    9.914702] EDAC DEBUG: f17_addr_mask_to_cs_size:   Original
AddrMask: 0x7fff9fe
[    9.914702] EDAC DEBUG: f17_addr_mask_to_cs_size:   Deinterleaved
AddrMask: 0x1fffffe
[    9.914703] EDAC DEBUG: get_csrow_nr_pages: csrow: 1, channel: 0, DBAM idx: 3
[    9.914703] EDAC DEBUG: get_csrow_nr_pages: nr_pages/channel: 2097152
[    9.914704] EDAC DEBUG: init_csrows_df: MC node: 0, csrow: 2
[    9.914704] EDAC DEBUG: f17_addr_mask_to_cs_size: CS2 DIMM1 AddrMasks:
[    9.914704] EDAC DEBUG: f17_addr_mask_to_cs_size:   Original
AddrMask: 0x7fff9fe
[    9.914705] EDAC DEBUG: f17_addr_mask_to_cs_size:   Deinterleaved
AddrMask: 0x1fffffe
[    9.914705] EDAC DEBUG: get_csrow_nr_pages: csrow: 2, channel: 0, DBAM idx: 3
[    9.914706] EDAC DEBUG: get_csrow_nr_pages: nr_pages/channel: 2097152
[    9.914706] EDAC DEBUG: init_csrows_df: MC node: 0, csrow: 3
[    9.914706] EDAC DEBUG: f17_addr_mask_to_cs_size: CS3 DIMM1 AddrMasks:
[    9.914707] EDAC DEBUG: f17_addr_mask_to_cs_size:   Original
AddrMask: 0x7fff9fe
[    9.914707] EDAC DEBUG: f17_addr_mask_to_cs_size:   Deinterleaved
AddrMask: 0x1fffffe
[    9.914708] EDAC DEBUG: get_csrow_nr_pages: csrow: 3, channel: 0, DBAM idx: 3
[    9.914708] EDAC DEBUG: get_csrow_nr_pages: nr_pages/channel: 2097152
[    9.914709] EDAC DEBUG: init_csrows_df: MC node: 0, csrow: 0
[    9.914709] EDAC DEBUG: f17_addr_mask_to_cs_size: CS0 DIMM0 AddrMasks:
[    9.914709] EDAC DEBUG: f17_addr_mask_to_cs_size:   Original
AddrMask: 0x7fff9fe
[    9.914710] EDAC DEBUG: f17_addr_mask_to_cs_size:   Deinterleaved
AddrMask: 0x1fffffe
[    9.914710] EDAC DEBUG: get_csrow_nr_pages: csrow: 0, channel: 1, DBAM idx: 3
[    9.914711] EDAC DEBUG: get_csrow_nr_pages: nr_pages/channel: 2097152
[    9.914711] EDAC DEBUG: init_csrows_df: MC node: 0, csrow: 1
[    9.914711] EDAC DEBUG: f17_addr_mask_to_cs_size: CS1 DIMM0 AddrMasks:
[    9.914712] EDAC DEBUG: f17_addr_mask_to_cs_size:   Original
AddrMask: 0x7fff9fe
[    9.914712] EDAC DEBUG: f17_addr_mask_to_cs_size:   Deinterleaved
AddrMask: 0x1fffffe
[    9.914713] EDAC DEBUG: get_csrow_nr_pages: csrow: 1, channel: 1, DBAM idx: 3
[    9.914713] EDAC DEBUG: get_csrow_nr_pages: nr_pages/channel: 2097152
[    9.914713] EDAC DEBUG: init_csrows_df: MC node: 0, csrow: 2
[    9.914714] EDAC DEBUG: f17_addr_mask_to_cs_size: CS2 DIMM1 AddrMasks:
[    9.914714] EDAC DEBUG: f17_addr_mask_to_cs_size:   Original
AddrMask: 0x7fff9fe
[    9.914715] EDAC DEBUG: f17_addr_mask_to_cs_size:   Deinterleaved
AddrMask: 0x1fffffe
[    9.914715] EDAC DEBUG: get_csrow_nr_pages: csrow: 2, channel: 1, DBAM idx: 3
[    9.914715] EDAC DEBUG: get_csrow_nr_pages: nr_pages/channel: 2097152
[    9.914716] EDAC DEBUG: init_csrows_df: MC node: 0, csrow: 3
[    9.914716] EDAC DEBUG: f17_addr_mask_to_cs_size: CS3 DIMM1 AddrMasks:
[    9.914717] EDAC DEBUG: f17_addr_mask_to_cs_size:   Original
AddrMask: 0x7fff9fe
[    9.914717] EDAC DEBUG: f17_addr_mask_to_cs_size:   Deinterleaved
AddrMask: 0x1fffffe
[    9.914717] EDAC DEBUG: get_csrow_nr_pages: csrow: 3, channel: 1, DBAM idx: 3
[    9.914718] EDAC DEBUG: get_csrow_nr_pages: nr_pages/channel: 2097152
[    9.914718] EDAC DEBUG: edac_mc_add_mc_with_groups:
[    9.914736] EDAC DEBUG: edac_create_sysfs_mci_device: device mc0 created
[    9.914743] EDAC DEBUG: edac_create_dimm_object: device rank0
created at location csrow 0 channel 0
[    9.914750] EDAC DEBUG: edac_create_dimm_object: device rank1
created at location csrow 0 channel 1
[    9.914756] EDAC DEBUG: edac_create_dimm_object: device rank2
created at location csrow 1 channel 0
[    9.914762] EDAC DEBUG: edac_create_dimm_object: device rank3
created at location csrow 1 channel 1
[    9.914768] EDAC DEBUG: edac_create_dimm_object: device rank4
created at location csrow 2 channel 0
[    9.914774] EDAC DEBUG: edac_create_dimm_object: device rank5
created at location csrow 2 channel 1
[    9.914780] EDAC DEBUG: edac_create_dimm_object: device rank6
created at location csrow 3 channel 0
[    9.914786] EDAC DEBUG: edac_create_dimm_object: device rank7
created at location csrow 3 channel 1
[    9.914793] EDAC DEBUG: edac_create_csrow_object: device csrow0 created
[    9.914800] EDAC DEBUG: edac_create_csrow_object: device csrow1 created
[    9.914806] EDAC DEBUG: edac_create_csrow_object: device csrow2 created
[    9.914813] EDAC DEBUG: edac_create_csrow_object: device csrow3 created
[    9.914821] EDAC MC0: Giving out device to module amd64_edac
controller F17h_M70h: DEV 0000:00:18.3 (INTERRUPT)
[    9.914822] EDAC DEBUG: edac_pci_alloc_ctl_info:
[    9.914823] EDAC DEBUG: edac_pci_add_device:
[    9.914824] EDAC DEBUG: add_edac_pci_to_global_list:
[    9.914824] EDAC DEBUG: find_edac_pci_by_dev:
[    9.914825] EDAC DEBUG: edac_pci_create_sysfs: idx=0
[    9.914825] EDAC DEBUG: edac_pci_main_kobj_setup:
[    9.914828] EDAC DEBUG: edac_pci_main_kobj_setup: Registered
'.../edac/pci' kobject
[    9.914828] EDAC DEBUG: edac_pci_create_instance_kobj:
[    9.914830] EDAC DEBUG: edac_pci_create_instance_kobj: Register
instance 'pci0' kobject
[    9.914831] EDAC PCI0: Giving out device to module amd64_edac
controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
[    9.914832] AMD64 EDAC driver v3.5.0

Best Regards,
Jean-Frederic


On Tue, 24 Sep 2019 at 05:26, Borislav Petkov <bp@alien8.de> wrote:
>
> On Mon, Sep 23, 2019 at 07:30:11PM -0400, Jeff God wrote:
> > Has anyone tested the new edac implementation and confirmed that error
> > reporting (CE and UE) are actually working for AMD family 17h model 70h?
>
> Please add linux-edac@vger.kernel.org to CC in the reply of this mail,
> which is the EDAC mailing list. We can continue the discussion there.
>
> > I am asking because I have build and implemented previously my own "basic"
> > support for a ryzen 3900x using the kernel sources before this was pushed
> > by simply adding the different PCI devices IDs and edac appeared to have
> > loaded properly and detected all ECC modules (visible in dmesg)
>
> This all should be unnecessary with 5.4-rc1. Try it when it comes out
> next week as it should have all the bits needed. If not, the above ML is
> for reporting issues.
>
> > However, I did my best to try to generate ECC errors (I am not too familiar
> > with the debugging method to inject and not sure how reliable it is) by
>
> See below. There's an mce-inject module and when loaded, it creates a
> bunch of files in debugfs with which you can inject errors. Help README
> is below.
>
> > overclocking (stock voltages) the memory to a point the system is barely
> > stable and never got any CE or UE error reporting after several hours/days.
> > I used mprime to generate errors with large memory amounts (54GB on 64GB
> > system), and when I turn OFF the ECC in the bios, mprime reports memory
> > errors using its internal check under 10 minutes after starting a torture
> > test (reproduced several times), but when ECC is turned ON, it does not
> > show any error as if everything was stable. This made me conclude that ECC
> > is working and at least some CE errors are most likely happening but are
> > hidden since the edac-util -rfull always reported 0 error.
>
> You'd have to send dmesg with 5.4-rc1 once it is out. Then we can have a
> look.
>
> HTH.
>
> $ cat /sys/kernel/debug/mce-inject/README
> Description of the files and their usages:
>
> Note1: i refers to the bank number below.
> Note2: See respective BKDGs for the exact bit definitions of the files below
> as they mirror the hardware registers.
>
> status:  Set MCi_STATUS: the bits in that MSR control the error type and
>          attributes of the error which caused the MCE.
>
> misc:    Set MCi_MISC: provide auxiliary info about the error. It is mostly
>          used for error thresholding purposes and its validity is indicated by
>          MCi_STATUS[MiscV].
>
> synd:    Set MCi_SYND: provide syndrome info about the error. Only valid on
>          Scalable MCA systems, and its validity is indicated by MCi_STATUS[SyndV].
>
> addr:    Error address value to be written to MCi_ADDR. Log address information
>          associated with the error.
>
> cpu:     The CPU to inject the error on.
>
> bank:    Specify the bank you want to inject the error into: the number of
>          banks in a processor varies and is family/model-specific, therefore, the
>          supplied value is sanity-checked. Setting the bank value also triggers the
>          injection.
>
> flags:   Injection type to be performed. Writing to this file will trigger a
>          real machine check, an APIC interrupt or invoke the error decoder routines
>          for AMD processors.
>
>          Allowed error injection types:
>           - "sw": Software error injection. Decode error to a human-readable
>             format only. Safe to use.
>           - "hw": Hardware error injection. Causes the #MC exception handler to
>             handle the error. Be warned: might cause system panic if MCi_STATUS[PCC]
>             is set. Therefore, consider setting (debugfs_mountpoint)/mce/fake_panic
>             before injecting.
>           - "df": Trigger APIC interrupt for Deferred error. Causes deferred
>             error APIC interrupt handler to handle the error if the feature is
>             is present in hardware.
>           - "th": Trigger APIC interrupt for Threshold errors. Causes threshold
>             APIC interrupt handler to handle the error.
>
> --
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
  2019-10-05 16:52   ` [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support Jeff God
@ 2019-10-07  7:16     ` Borislav Petkov
  2019-10-07 12:58       ` Jeff God
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2019-10-07  7:16 UTC (permalink / raw)
  To: Jeff God; +Cc: linux-edac

On Sat, Oct 05, 2019 at 12:52:15PM -0400, Jeff God wrote:
> * I tried different addr values that I believe are in valid range, all
> other values (cpu,...) were left to their default (0), I used bank 0
> to trigger the tests and tried different values briefly but always
> with the same results

Try this as root (the first command is cd-ing into the default debugfs
mountpoint - check whether that is the case on your system first).

# cd /sys/kernel/debug/mce-inject/
# echo 10 > /sys/devices/system/machinecheck/machinecheck0/check_interval
# echo 0x9c7d410092080813 > status; echo 0x000000006d3d483b > addr; echo 2 > cpu; echo hw > flags; echo 4 > bank

If you have this in your dmesg:

[    1.420991] RAS: Correctable Errors collector initialized.

then you need to boot with "ras=cec_disable" first.

> but it does not seem EDAC ever report any ce (or ue) ecc error even
> when they may be happening and corrected in background.

What does that mean?

You are getting ECCs or you simply want to test whether the ECC
reporting works on your machine?

P.S., Please do not top-post and see if you get another mail client -
gmail is linewrapping dmesg and is generally bad for pasting logs and
code in.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
  2019-10-07  7:16     ` Borislav Petkov
@ 2019-10-07 12:58       ` Jeff God
  2019-10-08 11:50         ` Borislav Petkov
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff God @ 2019-10-07 12:58 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-edac

On Mon, 7 Oct 2019 at 03:16, Borislav Petkov <bp@alien8.de> wrote:
>
> Try this as root (the first command is cd-ing into the default debugfs
> mountpoint - check whether that is the case on your system first).
>
> # cd /sys/kernel/debug/mce-inject/
> # echo 10 > /sys/devices/system/machinecheck/machinecheck0/check_interval
> # echo 0x9c7d410092080813 > status; echo 0x000000006d3d483b > addr; echo 2 > cpu; echo hw > flags; echo 4 > bank
>
> If you have this in your dmesg:
>
> [    1.420991] RAS: Correctable Errors collector initialized.
>
> then you need to boot with "ras=cec_disable" first.

Yes, I had to add ras=cec_disable in my case based on dmesg, but after
that the command above still did not output anything to dmesg or any
error.

> > but it does not seem EDAC ever report any ce (or ue) ecc error even
> > when they may be happening and corrected in background.
>
> What does that mean?
>
> You are getting ECCs or you simply want to test whether the ECC
> reporting works on your machine?
>

I want to test that the ECC reporting is working on my machine (so
that when real errors will happen one day I will get notified)

The method I described previously to generate errors by overclocking
memory was my initial method to generate real errors, which proved to
work well on another system with a previous generation AMD Ryzen 2700x
and similar motherboard and same memory, but on this system it does
not report any error, although turning off ECC in the bios showed that
memory corruption is happening fairly quickly in this case, hence the
conclusion that error reporting was probably not working but the
underlying memory error correction system may be working.

Jean-Frederic

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
  2019-10-07 12:58       ` Jeff God
@ 2019-10-08 11:50         ` Borislav Petkov
  2019-10-08 19:42           ` Ghannam, Yazen
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2019-10-08 11:50 UTC (permalink / raw)
  To: Jeff God, Yazen Ghannam; +Cc: linux-edac

On Mon, Oct 07, 2019 at 08:58:30AM -0400, Jeff God wrote:
> I want to test that the ECC reporting is working on my machine (so
> that when real errors will happen one day I will get notified)
> 
> The method I described previously to generate errors by overclocking
> memory was my initial method to generate real errors, which proved to
> work well on another system with a previous generation AMD Ryzen 2700x
> and similar motherboard and same memory, but on this system it does
> not report any error, although turning off ECC in the bios showed that
> memory corruption is happening fairly quickly in this case, hence the
> conclusion that error reporting was probably not working but the
> underlying memory error correction system may be working.

Yeah, if I inject an "sw" type here, I get immediately:

[  264.740840] [Hardware Error]: Corrected error, no action required.
[  264.740942] [Hardware Error]: CPU:2 (17:1:2) MC4_STATUS[-|CE|MiscV|AddrV|-|SyndV|CECC|-|-|Scrub]: 0x9c7d410092080813
[  264.741074] [Hardware Error]: Error Addr: 0x000000006d3d483b
[  264.741169] [Hardware Error]: IPID: 0x0000000000000000, Syndrome: 0x0000000000000000
[  264.741279] [Hardware Error]: Bank 4 is reserved.
[  264.741368] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)

but doing a hw injection seems to do all that it should do:

[  245.658175] mce: do_inject: CPIU2, toggling...
[  245.658375] mce: prepare_msrs
[  245.658507] mce: trigger_mce: CPU2

but nothing happens.

Yazen, are we missing something here?

See upthread for details - thread is on linux-edac@.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
  2019-10-08 11:50         ` Borislav Petkov
@ 2019-10-08 19:42           ` Ghannam, Yazen
  2019-10-08 23:08             ` Jeff God
  0 siblings, 1 reply; 19+ messages in thread
From: Ghannam, Yazen @ 2019-10-08 19:42 UTC (permalink / raw)
  To: Borislav Petkov, Jeff God; +Cc: linux-edac

On 10/8/2019 7:50 AM, Borislav Petkov wrote:
> On Mon, Oct 07, 2019 at 08:58:30AM -0400, Jeff God wrote:
>> I want to test that the ECC reporting is working on my machine (so
>> that when real errors will happen one day I will get notified)
>>
>> The method I described previously to generate errors by overclocking
>> memory was my initial method to generate real errors, which proved to
>> work well on another system with a previous generation AMD Ryzen 2700x
>> and similar motherboard and same memory, but on this system it does
>> not report any error, although turning off ECC in the bios showed that
>> memory corruption is happening fairly quickly in this case, hence the
>> conclusion that error reporting was probably not working but the
>> underlying memory error correction system may be working.
> 
> Yeah, if I inject an "sw" type here, I get immediately:
> 
> [  264.740840] [Hardware Error]: Corrected error, no action required.
> [  264.740942] [Hardware Error]: CPU:2 (17:1:2) MC4_STATUS[-|CE|MiscV|AddrV|-|SyndV|CECC|-|-|Scrub]: 0x9c7d410092080813
> [  264.741074] [Hardware Error]: Error Addr: 0x000000006d3d483b
> [  264.741169] [Hardware Error]: IPID: 0x0000000000000000, Syndrome: 0x0000000000000000
> [  264.741279] [Hardware Error]: Bank 4 is reserved.
> [  264.741368] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
> 
> but doing a hw injection seems to do all that it should do:
> 
> [  245.658175] mce: do_inject: CPIU2, toggling...
> [  245.658375] mce: prepare_msrs
> [  245.658507] mce: trigger_mce: CPU2
> 
> but nothing happens.
> 
> Yazen, are we missing something here?
> 
> See upthread for details - thread is on linux-edac@.
> 

Hi guys,
The "hw" option requires a non-zero, valid MCA_STATUS to be used so that the
MCA handlers will find the error in the hardware and report it.

Jean-Frederic,
You originally had status=0 which explains why nothing was reported.

Boris,
You used non-zero values, but you targetted bank 4. This bank is
Read-as-Zero/Writes-Ignored on Family 17h and later. So even though you used
good values, the MCA handlers won't find anything because bank 4 is RAZ.


Here are some values I took from a real corrected DRAM ECC error.

status=0x9c2041000000011b
synd=0x7c7600010a800100

The memory controller banks are 17 (channel 0) and 18 (channel 1) on Family
17h Model 7Xh, and these are managed by CPU 0.

Please give these values a try and let me know how it goes.

Thanks,
Yazen


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
  2019-10-08 19:42           ` Ghannam, Yazen
@ 2019-10-08 23:08             ` Jeff God
  2019-10-09 10:30               ` Borislav Petkov
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff God @ 2019-10-08 23:08 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: Borislav Petkov, linux-edac

On Tue, 8 Oct 2019 at 15:42, Ghannam, Yazen <Yazen.Ghannam@amd.com> wrote:
>
> The "hw" option requires a non-zero, valid MCA_STATUS to be used so that the
> MCA handlers will find the error in the hardware and report it.
>
> Jean-Frederic,
> You originally had status=0 which explains why nothing was reported.
>
> Boris,
> You used non-zero values, but you targetted bank 4. This bank is
> Read-as-Zero/Writes-Ignored on Family 17h and later. So even though you used
> good values, the MCA handlers won't find anything because bank 4 is RAZ.
>
>
> Here are some values I took from a real corrected DRAM ECC error.
>
> status=0x9c2041000000011b
> synd=0x7c7600010a800100
>
> The memory controller banks are 17 (channel 0) and 18 (channel 1) on Family
> 17h Model 7Xh, and these are managed by CPU 0.
>
Thanks a lot for the explanation.

I also wanted to apologise for the text emails line wrapping, I
haven't found a viable email client alternative...

I wasn't too sure about the correct meaning of the value for the
status, but I tried the non zero values mentioned above, here is a
list of tests I did:

echo 0x9c7d410092080813 > status; echo 0x000000006d3d483b > addr; echo
0 > cpu; echo hw > flags; echo 17 > bank
echo 0x9c7d410092080813 > status; echo 0x000000006d3d483b > addr; echo
0 > cpu; echo hw > flags; echo 18 > bank
echo 0x9c2041000000011b > status; echo 0x000000006d3d483b > addr; echo
0 > cpu; echo hw > flags; echo 17 > bank
echo 0x9c2041000000011b > status; echo 0x000000006d3d483b > addr; echo
0 > cpu; echo hw > flags; echo 18 > bank

During all these tests I was checking dmesg as well as all the status
files in /sys/kernel/debug/mce-inject.
I did not see anything in dmesg, and all status files remained 0
(except flag which was hw)

If I change hw for sw, it outputs something similar to what I reported before:
[  969.570997] mce: [Hardware Error]: Machine check events logged
[  969.570998] [Hardware Error]: Corrected error, no action required.
[  969.571002] [Hardware Error]: CPU:0 (17:71:0)
MC18_STATUS[-|CE|MiscV|AddrV|-|SyndV|CECC|-|-|Scrub]:
0x9c2041000000011b
[  969.571005] [Hardware Error]: Error Addr: 0x000000006d3d483b
[  969.571006] [Hardware Error]: IPID: 0x0000000000000000, Syndrome:
0x0000000000000000
[  969.571008] [Hardware Error]: Unified Memory Controller Ext. Error
Code: 0, DRAM ECC error.

Also, sometimes with sw it seems to report more messages:
[  840.033409] mce: [Hardware Error]: Machine check events logged
[  840.033411] [Hardware Error]: Corrected error, no action required.
[  840.033414] [Hardware Error]: CPU:0 (17:71:0)
MC17_STATUS[-|CE|MiscV|AddrV|-|SyndV|CECC|-|-|Scrub]:
0x9c2041000000011b
[  840.033417] [Hardware Error]: Error Addr: 0x000000006d3d483b
[  840.033418] [Hardware Error]: IPID: 0x0000000000000000, Syndrome:
0x0000000000000000
[  840.033420] [Hardware Error]: Unified Memory Controller Ext. Error
Code: 0, DRAM ECC error.
[  840.033435] EDAC MC0: 1 CE Unknown syndrome - possible error
reporting race on mc#0csrow#0channel#0 (csrow:0 channel:0
page:0x1d4f52 offset:0x3b grain:1 syndrome:0x0)
[  840.033436] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

And even this:
[  609.681714] mce: [Hardware Error]: Machine check events logged
[  609.681716] [Hardware Error]: Corrected error, no action required.
[  609.681720] [Hardware Error]: CPU:0 (17:71:0)
MC17_STATUS[-|CE|MiscV|AddrV|-|SyndV|CECC|-|-|Scrub]:
0x9c2041000000011b
[  609.681723] [Hardware Error]: Error Addr: 0x000000006d3d483b
[  609.681724] [Hardware Error]: IPID: 0x0000000000000000, Syndrome:
0x0000000000000000
[  609.681726] [Hardware Error]: Unified Memory Controller Ext. Error
Code: 0, DRAM ECC error.
[  609.681743] ------------[ cut here ]------------
[  609.681748] WARNING: CPU: 4 PID: 2447 at
drivers/edac/edac_mc.c:1238 edac_mc_handle_error+0x5a6/0x6d0
[  609.681748] Modules linked in: mce_inject amd64_edac_mod kvm_amd
kvm irqbypass snd_hda_codec_hdmi nls_iso8859_1 joydev crct10dif_pclmul
input_leds crc32_pclmul ghash_clmulni_intel snd_hda_intel
snd_intel_nhlt snd_usb_audio snd_hda_codec uvcvideo snd_hda_core
snd_usbmidi_lib snd_hwdep videobuf2_vmalloc videobuf2_memops
videobuf2_v4l2 snd_pcm videobuf2_common videodev mc snd_seq_midi
snd_seq_midi_event aesni_intel snd_rawmidi crypto_simd cryptd
glue_helper eeepc_wmi asus_wmi snd_seq sparse_keymap mxm_wmi video
wmi_bmof snd_seq_device snd_timer snd k10temp ccp soundcore mac_hid
sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4
btrfs xor zstd_compress raid6_pq libcrc32c hid_logitech_hidpp
hid_logitech_dj uas usb_storage hid_generic usbhid hid nvme i2c_piix4
igb ahci libahci dca nvme_core i2c_algo_bit wmi
[  609.681779] CPU: 4 PID: 2447 Comm: kworker/4:2 Not tainted 5.4.0-rc1 #1
[  609.681780] Hardware name: System manufacturer System Product
Name/PRIME X570-PRO, BIOS 1201 09/09/2019
[  609.681783] Workqueue: events mce_gen_pool_process
[  609.681785] RIP: 0010:edac_mc_handle_error+0x5a6/0x6d0
[  609.681787] Code: 94 be 72 79 00 00 49 89 84 24 68 05 00 00 48 8b
45 98 85 c9 c7 40 08 6d 65 6d 6f 66 89 70 0c c6 40 0e 00 75 a8 e9 6e
fd ff ff <0f> 0b 31 c0 49 c7 84 24 90 06 00 00 01 00 00 00 e9 4b fe ff
ff 49
[  609.681788] RSP: 0018:ffffb200428e3c68 EFLAGS: 00010246
[  609.681789] RAX: 0000000000000000 RBX: ffffffff89be5882 RCX: 0000000000000001
[  609.681789] RDX: 0000000000000000 RSI: ffffffff89be5888 RDI: 0000000000000000
[  609.681790] RBP: ffffb200428e3ce8 R08: 0000000000000000 R09: ffff98a116959c6f
[  609.681791] R10: 00000000ffffffff R11: ffff98a096959c79 R12: ffff98a096959800
[  609.681791] R13: 0000000000000003 R14: ffff98a096959c7a R15: 0000000000000000
[  609.681792] FS:  0000000000000000(0000) GS:ffff98a09d900000(0000)
knlGS:0000000000000000
[  609.681793] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  609.681794] CR2: 0000556dc3323730 CR3: 0000000fd3ab6000 CR4: 0000000000340ee0
[  609.681795] Call Trace:
[  609.681799]  ? pci_read_config_dword+0x23/0x40
[  609.681804]  __log_ecc_error+0x62/0x90 [amd64_edac_mod]
[  609.681807]  decode_umc_error+0xdc/0x1a0 [amd64_edac_mod]
[  609.681810]  amd_decode_mce+0xb26/0xba0
[  609.681812]  notifier_call_chain+0x4c/0x70
[  609.681814]  blocking_notifier_call_chain+0x43/0x60
[  609.681816]  mce_gen_pool_process+0x41/0x70
[  609.681818]  process_one_work+0x1fd/0x3f0
[  609.681820]  worker_thread+0x34/0x410
[  609.681821]  kthread+0x121/0x140
[  609.681822]  ? process_one_work+0x3f0/0x3f0
[  609.681823]  ? kthread_park+0x90/0x90
[  609.681826]  ret_from_fork+0x1f/0x40
[  609.681828] ---[ end trace 1dc9b9df24b597d5 ]---
[  609.681830] EDAC MC0: 1 CE Unknown syndrome - possible error
reporting race on mc#0csrow#0channel#0 (csrow:0 channel:0
page:0x1d4f52 offset:0x3b grain:1 syndrome:0x0)
[  609.681831] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

Again, all these messages showed only with sw, never with hw

Let me know if I should be doing something differently.

Regards,
Jean-Frederic

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
  2019-10-08 23:08             ` Jeff God
@ 2019-10-09 10:30               ` Borislav Petkov
  2019-10-09 20:31                 ` Ghannam, Yazen
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2019-10-09 10:30 UTC (permalink / raw)
  To: Jeff God, Ghannam, Yazen; +Cc: linux-edac

On Tue, Oct 08, 2019 at 07:08:20PM -0400, Jeff God wrote:
> I also wanted to apologise for the text emails line wrapping, I
> haven't found a viable email client alternative...

https://www.kernel.org/doc/html/latest/process/email-clients.html

> I did not see anything in dmesg, and all status files remained 0
> (except flag which was hw)

Nothing here either but my machine is

vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD EPYC 7251 8-Core Processor
stepping        : 2

so I'm guessing it needs something else for injection to work on those
models...

> The memory controller banks are 17 (channel 0) and 18 (channel 1) on Family
> 17h Model 7Xh, and these are managed by CPU 0.

Btw, Yazen, we probably need to have an easy way to find out
how the bank mapping is now on SMCA machine when wanting to do
injection. I know we talked about having some of that info in
/sys/devices/system/machinecheck/machinecheckX...

> And even this:
> [  609.681714] mce: [Hardware Error]: Machine check events logged
> [  609.681716] [Hardware Error]: Corrected error, no action required.
> [  609.681720] [Hardware Error]: CPU:0 (17:71:0)
> MC17_STATUS[-|CE|MiscV|AddrV|-|SyndV|CECC|-|-|Scrub]:
> 0x9c2041000000011b
> [  609.681723] [Hardware Error]: Error Addr: 0x000000006d3d483b
> [  609.681724] [Hardware Error]: IPID: 0x0000000000000000, Syndrome:
> 0x0000000000000000
> [  609.681726] [Hardware Error]: Unified Memory Controller Ext. Error
> Code: 0, DRAM ECC error.
> [  609.681743] ------------[ cut here ]------------
> [  609.681748] WARNING: CPU: 4 PID: 2447 at
> drivers/edac/edac_mc.c:1238 edac_mc_handle_error+0x5a6/0x6d0

You can ignore that for now. That's a sanity-check for a driver supplying a 0
for grain.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
  2019-10-09 10:30               ` Borislav Petkov
@ 2019-10-09 20:31                 ` Ghannam, Yazen
  2019-10-09 23:54                   ` Jeff God
  2019-10-10  9:54                   ` Borislav Petkov
  0 siblings, 2 replies; 19+ messages in thread
From: Ghannam, Yazen @ 2019-10-09 20:31 UTC (permalink / raw)
  To: Borislav Petkov, Jeff God; +Cc: linux-edac

On 10/9/2019 6:30 AM, Borislav Petkov wrote:
> On Tue, Oct 08, 2019 at 07:08:20PM -0400, Jeff God wrote:
>> I also wanted to apologise for the text emails line wrapping, I
>> haven't found a viable email client alternative...
> 
> https://www.kernel.org/doc/html/latest/process/email-clients.html
> 
>> I did not see anything in dmesg, and all status files remained 0
>> (except flag which was hw)
> 
> Nothing here either but my machine is
> 
> vendor_id       : AuthenticAMD
> cpu family      : 23
> model           : 1
> model name      : AMD EPYC 7251 8-Core Processor
> stepping        : 2
> 
> so I'm guessing it needs something else for injection to work on those
> models...
> 

Ah yes, sorry I forgot to mention that you will need to disable Platform First
Error Handling. This can be done in the BIOS. It's usually under something
like:

AMD CBS -> "Core" Common Options -> Platform First Error Handling

This feature will prevent writes to the MCA registers.

Please let me know if this works or not for you. I'll need to do some more
debug if it doesn't work.

>> The memory controller banks are 17 (channel 0) and 18 (channel 1) on Family
>> 17h Model 7Xh, and these are managed by CPU 0.
> 
> Btw, Yazen, we probably need to have an easy way to find out
> how the bank mapping is now on SMCA machine when wanting to do
> injection. I know we talked about having some of that info in
> /sys/devices/system/machinecheck/machinecheckX...
> 

Yep, I agree. I have some ideas, and I'll send them as RFC patches.

>> And even this:
>> [  609.681714] mce: [Hardware Error]: Machine check events logged
>> [  609.681716] [Hardware Error]: Corrected error, no action required.
>> [  609.681720] [Hardware Error]: CPU:0 (17:71:0)
>> MC17_STATUS[-|CE|MiscV|AddrV|-|SyndV|CECC|-|-|Scrub]:
>> 0x9c2041000000011b
>> [  609.681723] [Hardware Error]: Error Addr: 0x000000006d3d483b
>> [  609.681724] [Hardware Error]: IPID: 0x0000000000000000, Syndrome:
>> 0x0000000000000000
>> [  609.681726] [Hardware Error]: Unified Memory Controller Ext. Error
>> Code: 0, DRAM ECC error.
>> [  609.681743] ------------[ cut here ]------------
>> [  609.681748] WARNING: CPU: 4 PID: 2447 at
>> drivers/edac/edac_mc.c:1238 edac_mc_handle_error+0x5a6/0x6d0
> 
> You can ignore that for now. That's a sanity-check for a driver supplying a 0
> for grain.
> 

I've seen this too, and I'm looking into it. I'm doing some research to find
the correct (or at least sane) value for current and legacy systems.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
  2019-10-09 20:31                 ` Ghannam, Yazen
@ 2019-10-09 23:54                   ` Jeff God
  2019-10-10  9:56                     ` Borislav Petkov
  2019-10-10  9:54                   ` Borislav Petkov
  1 sibling, 1 reply; 19+ messages in thread
From: Jeff God @ 2019-10-09 23:54 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: Borislav Petkov, linux-edac

On Wed, 9 Oct 2019 at 16:31, Ghannam, Yazen <Yazen.Ghannam@amd.com> wrote:
>
> Ah yes, sorry I forgot to mention that you will need to disable Platform First
> Error Handling. This can be done in the BIOS. It's usually under something
> like:
>
> AMD CBS -> "Core" Common Options -> Platform First Error Handling
>
> This feature will prevent writes to the MCA registers.
>
> Please let me know if this works or not for you. I'll need to do some more
> debug if it doesn't work.
>
On my side I don't have that setting in my bios under AMD CBS.
Would this setting also prevent error reporting at the OS level or is
it just related to the injection?
The only thing I could find in my bios about ecc is Auto (default),
Enable, Disable

Jean-Frederic

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
  2019-10-09 20:31                 ` Ghannam, Yazen
  2019-10-09 23:54                   ` Jeff God
@ 2019-10-10  9:54                   ` Borislav Petkov
  1 sibling, 0 replies; 19+ messages in thread
From: Borislav Petkov @ 2019-10-10  9:54 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: Jeff God, linux-edac

On Wed, Oct 09, 2019 at 08:31:26PM +0000, Ghannam, Yazen wrote:
> Please let me know if this works or not for you. I'll need to do some more
> debug if it doesn't work.

Yah, that did it, thx:

[  166.317498] mce: Machine check injector initialized
[  171.734222] mce: do_inject: CPU0, toggling...
[  175.808430] mce: [Hardware Error]: Machine check events logged
[  175.808612] [Hardware Error]: Corrected error, no action required.
[  175.808708] [Hardware Error]: CPU:0 (17:1:2) MC18_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
[  175.808831] [Hardware Error]: Error Addr: 0x0000000000000000
[  175.808920] [Hardware Error]: IPID: 0x000000ff03830400, Syndrome: 0x0000000000000000
[  175.809023] [Hardware Error]: Platform Security Processor Ext. Error Code: 0, An ECC or parity error in a PSP RAM instance.
[  175.809143] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

> I've seen this too, and I'm looking into it. I'm doing some research to find
> the correct (or at least sane) value for current and legacy systems.

/**
 * struct edac_raw_error_desc - Raw error report structure
 * @grain:                      minimum granularity for an error report, in bytes

I'm guessing 1 on AMD as the error address reported is on a byte
granularity. Or?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
  2019-10-09 23:54                   ` Jeff God
@ 2019-10-10  9:56                     ` Borislav Petkov
  2019-10-10 12:48                       ` Jean-Frederic
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2019-10-10  9:56 UTC (permalink / raw)
  To: Jeff God; +Cc: Ghannam, Yazen, linux-edac

On Wed, Oct 09, 2019 at 07:54:45PM -0400, Jeff God wrote:
> On my side I don't have that setting in my bios under AMD CBS.

Check all the BIOS menus.

> Would this setting also prevent error reporting at the OS level or is
> it just related to the injection?

Platform first error handling meands, the BIOS gets to see the error
first. So it depends. Yazen, do you have the whole PFEH functionality
documented somewhere?

> The only thing I could find in my bios about ecc is Auto (default),

It could be that your BIOS doesn't even have a switch to turn it off.

Yazen, do we have a way to figure out whether PFEH is enabled on the
platform, from the kernel?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
  2019-10-10  9:56                     ` Borislav Petkov
@ 2019-10-10 12:48                       ` Jean-Frederic
  2019-10-10 13:41                         ` Borislav Petkov
  0 siblings, 1 reply; 19+ messages in thread
From: Jean-Frederic @ 2019-10-10 12:48 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Ghannam, Yazen, linux-edac

On 2019-10-10 5:56 a.m., Borislav Petkov wrote:
> Check all the BIOS menus. 
I did recheck all menus in advanced mode several times. I used my bios fairly often when I got this new system, I would also have seen it before I would think.

> Platform first error handling meands, the BIOS gets to see the error
> first. 
Thanks for the explanation.

-- 
Jean-Frédéric


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
  2019-10-10 12:48                       ` Jean-Frederic
@ 2019-10-10 13:41                         ` Borislav Petkov
  2019-10-10 19:00                           ` Ghannam, Yazen
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2019-10-10 13:41 UTC (permalink / raw)
  To: Jean-Frederic; +Cc: Ghannam, Yazen, linux-edac

On Thu, Oct 10, 2019 at 08:48:20AM -0400, Jean-Frederic wrote:
> I did recheck all menus in advanced mode several times. I used my bios
> fairly often when I got this new system, I would also have seen it
> before I would think.

I have the faint suspicion that our perfectly capable BIOS writers
forgot to add a disable functionality. Let's see what Yazen finds out
first, though.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
  2019-10-10 13:41                         ` Borislav Petkov
@ 2019-10-10 19:00                           ` Ghannam, Yazen
  2019-10-11  1:04                             ` Jean-Frederic
  0 siblings, 1 reply; 19+ messages in thread
From: Ghannam, Yazen @ 2019-10-10 19:00 UTC (permalink / raw)
  To: Borislav Petkov, Jean-Frederic; +Cc: linux-edac

On 10/10/2019 9:41 AM, Borislav Petkov wrote:
> On Thu, Oct 10, 2019 at 08:48:20AM -0400, Jean-Frederic wrote:
>> I did recheck all menus in advanced mode several times. I used my bios
>> fairly often when I got this new system, I would also have seen it
>> before I would think.
> 
> I have the faint suspicion that our perfectly capable BIOS writers
> forgot to add a disable functionality. Let's see what Yazen finds out
> first, though.
> 

I believe PFEH is generally geared towards enterprise users which is why I
remembered it once you mentioned your system is EPYC. I don't really know if
it's being used for desktop/client systems. Of course, it's up to the vendor
which features they choose to implement. I haven't seen it in the client
documentation though.

There's no explicit way to check if PFEH is enabled from the kernel. The
feature is meant to be transparent to the OS.

However, MCA_MISC0 will be Read-as-Zero/Writes-Ignored for all MCA banks when
PFEH is enabled. So you can use this as an implicit check. This is just an
implementation detail though for current systems. It's not an architectural
requirement.

Jean-Frederic,
Please do the following if you'd like to try this check:
1) rdmsr 0xC0002003

This command will read the MCA_MISC0 register from MCA bank 0. If it is
non-zero, then we'll know that PFEH is not enabled.

The "rdmsr" command is usually found in the msr-tools package in many distros.
You will need to run it as root, and you may need to load the "msr" module
before using the command.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
  2019-10-10 19:00                           ` Ghannam, Yazen
@ 2019-10-11  1:04                             ` Jean-Frederic
  2019-10-18 23:08                               ` Jean-Frederic
  0 siblings, 1 reply; 19+ messages in thread
From: Jean-Frederic @ 2019-10-11  1:04 UTC (permalink / raw)
  To: Ghannam, Yazen, Borislav Petkov; +Cc: linux-edac

On 2019-10-10 3:00 p.m., Ghannam, Yazen wrote:
> Jean-Frederic,
> Please do the following if you'd like to try this check:
> 1) rdmsr 0xC0002003
This returns 0 for me, so I guess PFEH is enabled.
As long as this is only for the error injection, and is not preventing
the actual capability for the OS to report the memory errors.
I'm still not clear on that part.
> This command will read the MCA_MISC0 register from MCA bank 0. If it is
> non-zero, then we'll know that PFEH is not enabled.
>

-- 
Jean-Frédéric


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
  2019-10-11  1:04                             ` Jean-Frederic
@ 2019-10-18 23:08                               ` Jean-Frederic
  2019-10-19  8:25                                 ` Borislav Petkov
  0 siblings, 1 reply; 19+ messages in thread
From: Jean-Frederic @ 2019-10-18 23:08 UTC (permalink / raw)
  To: Ghannam, Yazen, Borislav Petkov; +Cc: linux-edac

On 2019-10-10 9:04 p.m., Jean-Frederic wrote:
> On 2019-10-10 3:00 p.m., Ghannam, Yazen wrote:
>> 1) rdmsr 0xC0002003
> This returns 0 for me, so I guess PFEH is enabled.
> As long as this is only for the error injection, and is not preventing
> the actual capability for the OS to report the memory errors.
> I'm still not clear on that part.

On 2019-10-10 5:56 a.m., Borislav Petkov wrote:
> On 2019-10-09 7:54 p.m., Jeff God wrote:
>> Would this setting also prevent error reporting at the OS level or is
>> it just related to the injection?
> Platform first error handling meands, the BIOS gets to see the error
> first. So it depends. Yazen, do you have the whole PFEH functionality
> documented somewhere?
>

I don't know if there has been any new information related to these last
points, I am really looking to understand if ECC error reporting will be
working in this new Kernel 5.4 for AMD Ryzen 3900x (or are we saying maybe
this issue could be related to the motherboard?)
   
In any case, I think EDAC needs to be able to tell us (like at boot time)
if the ECC error reporting is working on the system or not, because right
now (in 5.4) everything appear to load successfully (according to dmesg)
with all the memory information identified, and edac-util tool appear
to be working (and returning zeros).
I don't mind if the error injection part is not working, I think it is
more an enterprise or debugging feature.


Also, since this was working on the previous generation as mentioned before
(i.e. AMD RYZEN 2700X and ASUS PRIME 470 to be more specific), I thought
it would be natural that it works on the newer gen, given the
information/hype provided around launch time.Asus also confirmed to me
through their support that this new motherboard supports ecc. It also has
an ECC option in the bios, as I've mentioned, to enable or disable ecc.


If nobody know the answer to my question, then that is fine, I just
wasn't sure if it was forgotten.


Thanks,

-- 
Jean-Frédéric


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
  2019-10-18 23:08                               ` Jean-Frederic
@ 2019-10-19  8:25                                 ` Borislav Petkov
  2019-10-19 16:12                                   ` Jean-Frederic
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2019-10-19  8:25 UTC (permalink / raw)
  To: Jean-Frederic; +Cc: Ghannam, Yazen, linux-edac

On Fri, Oct 18, 2019 at 07:08:32PM -0400, Jean-Frederic wrote:
> I don't know if there has been any new information related to these last
> points, I am really looking to understand if ECC error reporting will be
> working in this new Kernel 5.4 for AMD Ryzen 3900x (or are we saying maybe
> this issue could be related to the motherboard?)

Look here on page 6:

https://www.amd.com/system/files/2017-06/AMD-EPYC-Brings-New-RAS-Capability.pdf

It hints at what PFEH does. Roughly speaking, the firmware gets to see
the errors first and because it knows the platform much better, it
can take much more adequate recovery for those actions than the OS.
Sometimes.

 [ I believe if the error cannot be handled by the firmware, it gets
   reported to the OS but I'll let Yazen comment on that. ]

In any case, you have RAS protection on your platform - it is just done
by the firmware and not by EDAC. And that is perfectly fine - EDAC is
used when there's no firmware support.

I know, I know, we don't trust the firmware to do it right and so on,
but it is what it is. Like other stuff we have to rely on the firmware
to do right.

> In any case, I think EDAC needs to be able to tell us (like at boot time)
> if the ECC error reporting is working on the system or not, because right
> now (in 5.4) everything appear to load successfully (according to dmesg)
> with all the memory information identified, and edac-util tool appear
> to be working (and returning zeros).

EDAC loads fine but there are simply no errors to report.

> Also, since this was working on the previous generation as mentioned before

See above.

> (i.e. AMD RYZEN 2700X and ASUS PRIME 470 to be more specific), I thought
> it would be natural that it works on the newer gen, given the
> information/hype provided around launch time.Asus also confirmed to me
> through their support that this new motherboard supports ecc. It also has
> an ECC option in the bios, as I've mentioned, to enable or disable ecc.

Again, you have RAS protection if your DIMMs are ECC ones. It is just
not done by the kernel but by the firmware. And that can be a better way
to do it *if* the firmware is doing its job right.

Makes more sense now?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
  2019-10-19  8:25                                 ` Borislav Petkov
@ 2019-10-19 16:12                                   ` Jean-Frederic
  2019-10-21 14:24                                     ` Ghannam, Yazen
  0 siblings, 1 reply; 19+ messages in thread
From: Jean-Frederic @ 2019-10-19 16:12 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Ghannam, Yazen, linux-edac

On 2019-10-19 4:25 a.m., Borislav Petkov wrote:
> Look here on page 6:
> https://www.amd.com/system/files/2017-06/AMD-EPYC-Brings-New-RAS-Capability.pdf
>
> It hints at what PFEH does. 
>  [ I believe if the error cannot be handled by the firmware, it gets
>    reported to the OS but I'll let Yazen comment on that. ]

Yes, I found that document too after I sent my email yesterday, and I kind of
had a similar understanding...

> I know, I know, we don't trust the firmware to do it right and so on,
> but it is what it is. Like other stuff we have to rely on the firmware
> to do right.

I think we would all like to trust the firmware if it was clear what it is doing
to be honest.
However the way these consumer products are sold and documented (the motherboard I mean),
especially for AMD RYZEN and ECC support, is just that there is almost no information
(a vague statement aboutit "supports ecc"...)

The concept of the PFEH and RAS I think is good the more I read about it, but mostly for
enterprise solutions, and it would be good too I guess for a consumer product if we knew
we could rely on it.

As it stands right now, I don't really know if I can trust it. When I did my own tests
of generating real errors it was either the system is totally stable, or would not boot,
or would crash suddenly. I could see that ecc really corrects things, because otherwise
I would get software self check errors in mprime under those conditions fairly quickly
(after 1-2 minutes), but with ecc enabled I can run for hours without any sign of issue
under the same conditions.

So can I rely on this to know one day that I am starting to have hardware issues and I
should replace my memory (or system)? I don't even know how the firmware will report
anything to me. There is nothing in the bios that seems to give any report about ecc,

> Makes more sense now?
>

Yes, it does makes more sense now, thanks Borislav for all the information.

On my side maybe I'll start looking at other motherboards that potentially do this
differently.I'll continue to look in other forums to see what others have found for
other motherboards.


Thanks,

-- 
Jean-Frédéric


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
  2019-10-19 16:12                                   ` Jean-Frederic
@ 2019-10-21 14:24                                     ` Ghannam, Yazen
  0 siblings, 0 replies; 19+ messages in thread
From: Ghannam, Yazen @ 2019-10-21 14:24 UTC (permalink / raw)
  To: Jean-Frederic, Borislav Petkov; +Cc: linux-edac

> -----Original Message-----
> From: Jean-Frederic <jfgaudreault@gmail.com>
> Sent: Saturday, October 19, 2019 12:13 PM
> To: Borislav Petkov <bp@alien8.de>
> Cc: Ghannam, Yazen <Yazen.Ghannam@amd.com>; linux-edac@vger.kernel.org
> Subject: Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
> 
> On 2019-10-19 4:25 a.m., Borislav Petkov wrote:
> > Look here on page 6:
> > https://www.amd.com/system/files/2017-06/AMD-EPYC-Brings-New-RAS-Capability.pdf
> >
> > It hints at what PFEH does.
> >  [ I believe if the error cannot be handled by the firmware, it gets
> >    reported to the OS but I'll let Yazen comment on that. ]
> 
> Yes, I found that document too after I sent my email yesterday, and I kind of
> had a similar understanding...
> 

Yes, that's right. And even if the firmware handles the error it may still
report to the OS. That's really a policy decision and it may vary between
vendors.

> > I know, I know, we don't trust the firmware to do it right and so on,
> > but it is what it is. Like other stuff we have to rely on the firmware
> > to do right.
> 
> I think we would all like to trust the firmware if it was clear what it is doing
> to be honest.
> However the way these consumer products are sold and documented (the motherboard I mean),
> especially for AMD RYZEN and ECC support, is just that there is almost no information
> (a vague statement aboutit "supports ecc"...)
> 
> The concept of the PFEH and RAS I think is good the more I read about it, but mostly for
> enterprise solutions, and it would be good too I guess for a consumer product if we knew
> we could rely on it.
> 
> As it stands right now, I don't really know if I can trust it. When I did my own tests
> of generating real errors it was either the system is totally stable, or would not boot,
> or would crash suddenly. I could see that ecc really corrects things, because otherwise
> I would get software self check errors in mprime under those conditions fairly quickly
> (after 1-2 minutes), but with ecc enabled I can run for hours without any sign of issue
> under the same conditions.
> 
> So can I rely on this to know one day that I am starting to have hardware issues and I
> should replace my memory (or system)? I don't even know how the firmware will report
> anything to me. There is nothing in the bios that seems to give any report about ecc,
> 

Generally, the firmware will report the error up to the OS and the OS will
report to the user. So you should find the error reported through EDAC, etc.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, back to index

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAEVokG7TeAbmkhaxiTpsxhv1pQzqRpU=mR8gVjixb5kXo3s2Eg@mail.gmail.com>
     [not found] ` <20190924092644.GC19317@zn.tnic>
2019-10-05 16:52   ` [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support Jeff God
2019-10-07  7:16     ` Borislav Petkov
2019-10-07 12:58       ` Jeff God
2019-10-08 11:50         ` Borislav Petkov
2019-10-08 19:42           ` Ghannam, Yazen
2019-10-08 23:08             ` Jeff God
2019-10-09 10:30               ` Borislav Petkov
2019-10-09 20:31                 ` Ghannam, Yazen
2019-10-09 23:54                   ` Jeff God
2019-10-10  9:56                     ` Borislav Petkov
2019-10-10 12:48                       ` Jean-Frederic
2019-10-10 13:41                         ` Borislav Petkov
2019-10-10 19:00                           ` Ghannam, Yazen
2019-10-11  1:04                             ` Jean-Frederic
2019-10-18 23:08                               ` Jean-Frederic
2019-10-19  8:25                                 ` Borislav Petkov
2019-10-19 16:12                                   ` Jean-Frederic
2019-10-21 14:24                                     ` Ghannam, Yazen
2019-10-10  9:54                   ` Borislav Petkov

Linux-EDAC Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-edac/0 linux-edac/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-edac linux-edac/ https://lore.kernel.org/linux-edac \
		linux-edac@vger.kernel.org
	public-inbox-index linux-edac

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-edac


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git