Linux-EDAC Archive on lore.kernel.org
 help / color / Atom feed
From: Borislav Petkov <bp@alien8.de>
To: Robert Richter <rrichter@marvell.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>,
	Tony Luck <tony.luck@intel.com>,
	James Morse <james.morse@arm.com>,
	Aristeu Rozanski <aris@redhat.com>,
	Matthias Brugger <mbrugger@suse.com>,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v4] EDAC/ghes: Setup DIMM label from DMI and use it in error reports
Date: Tue, 2 Jun 2020 17:48:43 +0200
Message-ID: <20200602154843.GD11634@zn.tnic> (raw)
In-Reply-To: <20200528101307.23245-1-rrichter@marvell.com>

On Thu, May 28, 2020 at 12:13:06PM +0200, Robert Richter wrote:
> The ghes driver reports errors with 'unknown label' even if the actual
> DIMM label is known, e.g.:
> 
>  EDAC MC0: 1 CE Single-bit ECC on unknown label (node:0 card:0
>    module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM location:N0 DIMM_A0
>    page:0x966a9b3 offset:0x0 grain:1 syndrome:0x0 - APEI location:
>    node:0 card:0 module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM
>    location:N0 DIMM_A0 status(0x0000000000000400): Storage error in
>    DRAM memory)
> 
> Fix this by using struct dimm_info's label string in error reports:
> 
>  EDAC MC0: 1 CE Single-bit ECC on N0 DIMM_A0 (node:0 card:0 module:0
>    rank:1 bank:515 col:14 bit_pos:16 DIMM location:N0 DIMM_A0
>    page:0x99223d8 offset:0x0 grain:1 syndrome:0x0 - APEI location:
>    node:0 card:0 module:0 rank:1 bank:515 col:14 bit_pos:16 DIMM
>    location:N0 DIMM_A0 status(0x0000000000000400): Storage error in
>    DRAM memory)
> 
> The labels are initialized by reading the bank and device strings from
> DMI. Now, the label information can also read from sysfs. E.g. a
> ThunderX2 system will show the following:
> 
>  /sys/devices/system/edac/mc/mc0/dimm0/dimm_label:N0 DIMM_A0
>  /sys/devices/system/edac/mc/mc0/dimm1/dimm_label:N0 DIMM_B0
>  /sys/devices/system/edac/mc/mc0/dimm2/dimm_label:N0 DIMM_C0
>  /sys/devices/system/edac/mc/mc0/dimm3/dimm_label:N0 DIMM_D0
>  /sys/devices/system/edac/mc/mc0/dimm4/dimm_label:N0 DIMM_E0
>  /sys/devices/system/edac/mc/mc0/dimm5/dimm_label:N0 DIMM_F0
>  /sys/devices/system/edac/mc/mc0/dimm6/dimm_label:N0 DIMM_G0
>  /sys/devices/system/edac/mc/mc0/dimm7/dimm_label:N0 DIMM_H0
>  /sys/devices/system/edac/mc/mc0/dimm8/dimm_label:N1 DIMM_I0
>  /sys/devices/system/edac/mc/mc0/dimm9/dimm_label:N1 DIMM_J0
>  /sys/devices/system/edac/mc/mc0/dimm10/dimm_label:N1 DIMM_K0
>  /sys/devices/system/edac/mc/mc0/dimm11/dimm_label:N1 DIMM_L0
>  /sys/devices/system/edac/mc/mc0/dimm12/dimm_label:N1 DIMM_M0
>  /sys/devices/system/edac/mc/mc0/dimm13/dimm_label:N1 DIMM_N0
>  /sys/devices/system/edac/mc/mc0/dimm14/dimm_label:N1 DIMM_O0
>  /sys/devices/system/edac/mc/mc0/dimm15/dimm_label:N1 DIMM_P0
> 
> Since dimm_labels can be rewritten, that label will be used in a later
> error report:
> 
>  # echo foobar >/sys/devices/system/edac/mc/mc0/dimm0/dimm_label
>  # # some error injection here
>  # dmesg | grep foobar
>  [ 751.383533] EDAC MC0: 1 CE Single-bit ECC on foobar (node:0 card:0
>  module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM location:N0 DIMM_A0
>  page:0x8c8dc74 offset:0x0 grain:1 syndrome:0x0 - APEI location:
>  node:0 card:0 module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM
>  location:N0 DIMM_A0 status(0x0000000000000400): Storage error in DRAM
>  memory)
> 
> Signed-off-by: Robert Richter <rrichter@marvell.com>
> ---
> v4:
> 
>  * dimm->label: Only update dimm->label in if bank/device is found in
>    the SMBIOS table, this keeps current behavior for machines that do
>    not provide this information.
> 
>  * e->location: Keep current behavior how e->location is written.
> 
>  * e->label: Use dimm->label if a DIMM was found by its handle and
>    "unknown memory" otherwise. This aligns with the edac_mc
>    implementation.
> 
> Signed-off-by: Robert Richter <rrichter@marvell.com>
> ---
>  drivers/edac/ghes_edac.c | 37 ++++++++++++++++++++++++++-----------
>  1 file changed, 26 insertions(+), 11 deletions(-)

Yap, looks good. I'll queue it after the merge window.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

  reply index

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-28 10:13 Robert Richter
2020-06-02 15:48 ` Borislav Petkov [this message]
2020-06-03  6:56   ` Robert Richter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200602154843.GD11634@zn.tnic \
    --to=bp@alien8.de \
    --cc=aris@redhat.com \
    --cc=james.morse@arm.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mbrugger@suse.com \
    --cc=mchehab@kernel.org \
    --cc=rrichter@marvell.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-EDAC Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-edac/0 linux-edac/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-edac linux-edac/ https://lore.kernel.org/linux-edac \
		linux-edac@vger.kernel.org
	public-inbox-index linux-edac

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-edac


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git