Linux-EDAC Archive on lore.kernel.org
 help / color / Atom feed
From: Robert Richter <rrichter@marvell.com>
To: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>, Tony Luck <tony.luck@intel.com>,
	"James Morse" <james.morse@arm.com>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 06/19] EDAC, mc: Remove per layer counters
Date: Mon, 14 Oct 2019 11:12:23 +0000
Message-ID: <20191014111215.f5wyed33ilppfopg@rric.localdomain> (raw)
In-Reply-To: <20191011074031.699396df@coco.lan>

On 11.10.19 07:40:31, Mauro Carvalho Chehab wrote:
> Em Thu, 10 Oct 2019 20:25:16 +0000
> Robert Richter <rrichter@marvell.com> escreveu:
> 
> > Looking at how mci->{ue,ce}_per_layer[EDAC_MAX_LAYERS] is used, it
> > turns out that only the leaves in the memory hierarchy are consumed
> > (in sysfs), but not the intermediate layers, e.g.:
> > 
> >  count = dimm->mci->ce_per_layer[dimm->mci->n_layers-1][dimm->idx];
> > 
> > These unused counters only add complexity, remove them. The error
> > counter values are directly stored in struct dimm_info now.
> 
> Hmm... not sure if this patch is correct. I remember that there are some
> border cases on some drivers (maybe the 3-layer drivers used together
> with RDIMM memory controllers?) where some errors are not associated
> to an specific dimm, but, instead, are related to a problem at the memory
> bus.
> 
> Also, depending on how the memory controllers are organized[1], the ECC
> logic groups memory on DIMM pairs. So, when an error occur, it may be
> either at DIMM1 or DIMM2.
> 
> [1] On Intel, this happens with pre-Nehalem memory controllers.
> 
> Due to that, storing errors at the dimm struct sounds wrong, as the
> error may affect multiple DIMMs or even the entire layer.

That was my first thought too, but the counter values are not used at
all. The only exception is to provide *per-dimm* counters here:

 {ce,ue}_per_layer[n_layers-1][dimm->idx]. 

 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/edac_mc_sysfs.c?id=4f5cafb5cb8471e54afdc9054d973535614f7675#n567
 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/edac_mc_sysfs.c?id=4f5cafb5cb8471e54afdc9054d973535614f7675#n584

The case you mentioned above is if the mc only sends parts of the
error location (with a top, mid or low layer missing). The dimm cannot
be identified then. In this case edac_mc_handle_error() tries to find
a unique row (+ channel infomation if available and lists all possible
dimm labels in e->label. See:

 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/edac_mc.c?id=4f5cafb5cb8471e54afdc9054d973535614f7675#n1153

Thus, we see a counter increment for row (and also channel if it can
be identified), but this is counted in mci->csrows array only that is
not removed by this patch.

That said, the {ue,ce}_per_layer[] arrays can be removed by keeping
the same driver functionality, esp. the case you mentioned above.

-Robert

  reply index

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-10 20:25 [PATCH 00/19] EDAC: Rework edac_mc and ghes drivers Robert Richter
2019-10-10 20:25 ` [PATCH 01/19] EDAC: Replace EDAC_DIMM_PTR() macro with edac_get_dimm() function Robert Richter
2019-10-11  9:58   ` Mauro Carvalho Chehab
2019-10-11 11:38     ` Robert Richter
2019-10-10 20:25 ` [PATCH 02/19] EDAC: Remove EDAC_DIMM_OFF() macro Robert Richter
2019-10-11 10:09   ` Mauro Carvalho Chehab
2019-10-11 11:36     ` Robert Richter
2019-10-10 20:25 ` [PATCH 03/19] EDAC: Introduce mci_for_each_dimm() iterator Robert Richter
2019-10-11 10:14   ` Mauro Carvalho Chehab
2019-10-10 20:25 ` [PATCH 04/19] EDAC, mc: Do not BUG_ON() in edac_mc_alloc() Robert Richter
2019-10-11 10:15   ` Mauro Carvalho Chehab
2019-10-10 20:25 ` [PATCH 05/19] EDAC, mc: Reduce indentation level in edac_mc_handle_error() Robert Richter
2019-10-10 22:10   ` Joe Perches
2019-10-11  6:50     ` Robert Richter
2019-10-11 10:20     ` Mauro Carvalho Chehab
2019-10-11 10:50       ` Joe Perches
2019-10-11 12:08         ` Robert Richter
2019-10-11 14:49           ` Joe Perches
2019-10-11 10:17   ` Mauro Carvalho Chehab
2019-10-10 20:25 ` [PATCH 06/19] EDAC, mc: Remove per layer counters Robert Richter
2019-10-11 10:40   ` Mauro Carvalho Chehab
2019-10-14 11:12     ` Robert Richter [this message]
2019-10-10 20:25 ` [PATCH 07/19] EDAC, mc: Rename iterator variable to idx Robert Richter
2019-10-11 10:41   ` Mauro Carvalho Chehab
2019-10-10 20:25 ` [PATCH 08/19] EDAC, mc: Split edac_mc_alloc() into smaller functions Robert Richter
2019-10-11 10:43   ` Mauro Carvalho Chehab
2019-10-10 20:25 ` [PATCH 09/19] EDAC, mc: Reorder functions edac_mc_alloc*() Robert Richter
2019-10-11 10:45   ` Mauro Carvalho Chehab
2019-10-10 20:25 ` [PATCH 10/19] EDAC, mc: Rework edac_raw_mc_handle_error() to use struct dimm_info Robert Richter
2019-10-11 10:48   ` Mauro Carvalho Chehab
2019-10-10 20:25 ` [PATCH 11/19] EDAC: Remove misleading comment in struct edac_raw_error_desc Robert Richter
2019-10-11 10:49   ` Mauro Carvalho Chehab
2019-10-10 20:25 ` [PATCH 12/19] EDAC: Store error type " Robert Richter
2019-10-11 10:54   ` Mauro Carvalho Chehab
2019-10-14 11:47     ` Robert Richter
2019-10-10 20:25 ` [PATCH 13/19] EDAC, mc: Determine mci pointer from the error descriptor Robert Richter
2019-10-11 10:56   ` Mauro Carvalho Chehab
2019-10-10 20:25 ` [PATCH 14/19] EDAC, mc: Create new function edac_inc_csrow() Robert Richter
2019-10-11 11:08   ` Mauro Carvalho Chehab
2019-10-14 11:58     ` Robert Richter
2019-10-10 20:25 ` [PATCH 15/19] EDAC, ghes: Use standard kernel macros for page calculations Robert Richter
2019-10-11 11:10   ` Mauro Carvalho Chehab
2019-10-10 20:25 ` [PATCH 16/19] EDAC, ghes: Fix grain calculation Robert Richter
2019-10-11 11:22   ` Mauro Carvalho Chehab
2019-10-10 20:25 ` [PATCH 17/19] EDAC, ghes: Remove intermediate buffer pvt->detail_location Robert Richter
2019-10-11 11:20   ` Mauro Carvalho Chehab
2019-10-10 20:25 ` [PATCH 18/19] EDAC, ghes: Unify trace_mc_event() code with edac_mc driver Robert Richter
2019-10-11 11:23   ` Mauro Carvalho Chehab
2019-10-10 20:25 ` [PATCH 19/19] EDAC, Documentation: Describe CPER module definition and DIMM ranks Robert Richter
2019-10-11 11:29   ` Mauro Carvalho Chehab
2019-10-10 20:36 ` [PATCH 00/19] EDAC: Rework edac_mc and ghes drivers Robert Richter
2019-10-14 12:00 ` Robert Richter

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191014111215.f5wyed33ilppfopg@rric.localdomain \
    --to=rrichter@marvell.com \
    --cc=bp@alien8.de \
    --cc=james.morse@arm.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-EDAC Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-edac/0 linux-edac/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-edac linux-edac/ https://lore.kernel.org/linux-edac \
		linux-edac@vger.kernel.org
	public-inbox-index linux-edac

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-edac


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git