linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Ghannam, Yazen" <Yazen.Ghannam@amd.com>
To: Borislav Petkov <bp@alien8.de>, Jeff God <jfgaudreault@gmail.com>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Subject: Re: [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support
Date: Tue, 8 Oct 2019 19:42:09 +0000	[thread overview]
Message-ID: <678ba7d1-cf3d-4101-1819-29b291cf236d@amd.com> (raw)
In-Reply-To: <20191008115041.GD14765@zn.tnic>

On 10/8/2019 7:50 AM, Borislav Petkov wrote:
> On Mon, Oct 07, 2019 at 08:58:30AM -0400, Jeff God wrote:
>> I want to test that the ECC reporting is working on my machine (so
>> that when real errors will happen one day I will get notified)
>>
>> The method I described previously to generate errors by overclocking
>> memory was my initial method to generate real errors, which proved to
>> work well on another system with a previous generation AMD Ryzen 2700x
>> and similar motherboard and same memory, but on this system it does
>> not report any error, although turning off ECC in the bios showed that
>> memory corruption is happening fairly quickly in this case, hence the
>> conclusion that error reporting was probably not working but the
>> underlying memory error correction system may be working.
> 
> Yeah, if I inject an "sw" type here, I get immediately:
> 
> [  264.740840] [Hardware Error]: Corrected error, no action required.
> [  264.740942] [Hardware Error]: CPU:2 (17:1:2) MC4_STATUS[-|CE|MiscV|AddrV|-|SyndV|CECC|-|-|Scrub]: 0x9c7d410092080813
> [  264.741074] [Hardware Error]: Error Addr: 0x000000006d3d483b
> [  264.741169] [Hardware Error]: IPID: 0x0000000000000000, Syndrome: 0x0000000000000000
> [  264.741279] [Hardware Error]: Bank 4 is reserved.
> [  264.741368] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
> 
> but doing a hw injection seems to do all that it should do:
> 
> [  245.658175] mce: do_inject: CPIU2, toggling...
> [  245.658375] mce: prepare_msrs
> [  245.658507] mce: trigger_mce: CPU2
> 
> but nothing happens.
> 
> Yazen, are we missing something here?
> 
> See upthread for details - thread is on linux-edac@.
> 

Hi guys,
The "hw" option requires a non-zero, valid MCA_STATUS to be used so that the
MCA handlers will find the error in the hardware and report it.

Jean-Frederic,
You originally had status=0 which explains why nothing was reported.

Boris,
You used non-zero values, but you targetted bank 4. This bank is
Read-as-Zero/Writes-Ignored on Family 17h and later. So even though you used
good values, the MCA handlers won't find anything because bank 4 is RAZ.


Here are some values I took from a real corrected DRAM ECC error.

status=0x9c2041000000011b
synd=0x7c7600010a800100

The memory controller banks are 17 (channel 0) and 18 (channel 1) on Family
17h Model 7Xh, and these are managed by CPU 0.

Please give these values a try and let me know how it goes.

Thanks,
Yazen


  reply	other threads:[~2019-10-08 19:42 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAEVokG7TeAbmkhaxiTpsxhv1pQzqRpU=mR8gVjixb5kXo3s2Eg@mail.gmail.com>
     [not found] ` <20190924092644.GC19317@zn.tnic>
2019-10-05 16:52   ` [GIT PULL] EDAC pile for 5.4 -> AMD family 17h, model 70h support Jeff God
2019-10-07  7:16     ` Borislav Petkov
2019-10-07 12:58       ` Jeff God
2019-10-08 11:50         ` Borislav Petkov
2019-10-08 19:42           ` Ghannam, Yazen [this message]
2019-10-08 23:08             ` Jeff God
2019-10-09 10:30               ` Borislav Petkov
2019-10-09 20:31                 ` Ghannam, Yazen
2019-10-09 23:54                   ` Jeff God
2019-10-10  9:56                     ` Borislav Petkov
2019-10-10 12:48                       ` Jean-Frederic
2019-10-10 13:41                         ` Borislav Petkov
2019-10-10 19:00                           ` Ghannam, Yazen
2019-10-11  1:04                             ` Jean-Frederic
2019-10-18 23:08                               ` Jean-Frederic
2019-10-19  8:25                                 ` Borislav Petkov
2019-10-19 16:12                                   ` Jean-Frederic
2019-10-21 14:24                                     ` Ghannam, Yazen
2020-01-04 20:03                                     ` Jean-Frederic
2020-01-04 21:47                                       ` Jean-Frederic
2019-10-10  9:54                   ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=678ba7d1-cf3d-4101-1819-29b291cf236d@amd.com \
    --to=yazen.ghannam@amd.com \
    --cc=bp@alien8.de \
    --cc=jfgaudreault@gmail.com \
    --cc=linux-edac@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).