linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: wufan <wufan@codeaurora.org>
To: 'James Morse' <james.morse@arm.com>,
	'Tyler Baicar' <baicar.tyler@gmail.com>
Cc: 'Tyler Baicar' <tbaicar@codeaurora.org>,
	'Linux Kernel Mailing List' <linux-kernel@vger.kernel.org>,
	harba@qti.qualcomm.com, 'Borislav Petkov' <bp@alien8.de>,
	mchehab@kernel.org,
	'arm-mail-list' <linux-arm-kernel@lists.infradead.org>,
	linux-edac@vger.kernel.org
Subject: [RFC] EDAC, ghes: Enable per-layer error reporting for ARM
Date: Fri, 24 Aug 2018 08:30:13 -0600	[thread overview]
Message-ID: <000b01d43bb6$f9419b20$ebc4d160$@codeaurora.org> (raw)

Hi James, 
 
> Why get avoid the layer stuff? Isn't counting DIMM/memory-devices what
> EDAC_MC_LAYER_SLOT is for?

Borislav has explained it in his response. Here let me elaborate a little more. To use the layer information you need an accurate way to pinpoint each component in the layer and the parent components in the layers above. For example, to use EDAC_MC_LAYER_SLOT you also need information for the parent layer say EDAC_MC_LAYER_CHANNEL, or another layer on top say EDAC_MC_LAYER_BRANCH. There are no clear ways to get the information from SMBIOS table. In the case of "memory channel" we looked at type 37 which has the exact spelling but it was introduced to support RamBus and Synclink. Not sure we can readily use it for modern architecture concept of "channel/slot". 

 I think it is good enough if we can pin each error to the corresponding DIMM. At the end of the day DIMMs are what customer can replace in the memory system and that's all that they care about. For the manufacturers of the board/chips they have the knowledge to map the specific DIMMs to the upper layer components, so they can easily collect error counter data for upper layers. 

> CPER's "Memory Error Record 2" thinks that "NODE, CARD and MODULE
> should provide the information necessary to identify the failing FRU". As
> EDAC has three 'levels', these are what they should correspond to for ghes-
> edac.
> 
> I assume NODE means rack/chassis in some distributed system. Lets ignore it
> as it doesn't seem to map to anything in the SMBIOS table.

How about type 4 "Processor Information"?

> 'Card' doesn't mean much to me, but it maps to SMBIOS:17 "Memory Array
> Structure", which the Memory Device structure also points to.
> Card then must mean "a collection of memory devices (DIMMs) that operate
> together to form an address space".
> 
> This might be what I think of as a memory-controller, or it might be
> something more complicated. Regardless, the CPER records think its relevant.

Originally I thought "Card" were memory channel. But looking at the definition of "Card Handle" in CPER: "... this field contains the SMBIOS handle for the Type 16 Memory Array Structure that represents the memory card". So Card is memory controller or something similar to that. Right now ghes-edac assumes one mc. We probably need to map mc(s) to the type 16 instances in SMBIOS table. 

Thanks,
Fan

             reply	other threads:[~2018-08-24 14:30 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-24 14:30 wufan [this message]
  -- strict thread matches above, loose matches on Subject: below --
2018-08-30 10:28 [RFC] EDAC, ghes: Enable per-layer error reporting for ARM Borislav Petkov
2018-08-29 10:20 James Morse
2018-08-29  7:38 Borislav Petkov
2018-08-28 20:04 Tyler Baicar
2018-08-28 17:11 James Morse
2018-08-28 17:09 James Morse
2018-08-28 17:09 James Morse
2018-08-24 15:14 Tyler Baicar
2018-08-24 12:01 Borislav Petkov
2018-08-24  9:48 James Morse
2018-08-23 15:46 Tyler Baicar
2018-08-23  9:28 James Morse
2018-07-20  4:10 Borislav Petkov
2018-07-19 18:36 Tyler Baicar
2018-07-19 14:46 James Morse
2018-07-19 14:01 Borislav Petkov
2018-07-16 17:26 Tyler Baicar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='000b01d43bb6$f9419b20$ebc4d160$@codeaurora.org' \
    --to=wufan@codeaurora.org \
    --cc=baicar.tyler@gmail.com \
    --cc=bp@alien8.de \
    --cc=harba@qti.qualcomm.com \
    --cc=james.morse@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=tbaicar@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).