All of lore.kernel.org
 help / color / mirror / Atom feed
From: Robert Richter <rric@kernel.org>
To: Robert Richter <rrichter@marvell.com>
Cc: Borislav Petkov <bp@alien8.de>, James Morse <james.morse@arm.com>,
	Mauro Carvalho Chehab <mchehab@kernel.org>,
	Tony Luck <tony.luck@intel.com>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2 00/24] EDAC, mc, ghes: Fixes and updates to improve memory error reporting
Date: Fri, 2 Aug 2019 09:58:55 +0200	[thread overview]
Message-ID: <20190802075854.ydrlufwvdxcjk7jj@rric.localdomain> (raw)
In-Reply-To: <20190624150758.6695-1-rrichter@marvell.com>

Hi all,

this is a friendly ping for review of this series.

Thanks,

-Robert

On 24.06.19 15:08:52, Robert Richter wrote:
> Current arm64 systems that use the ghes driver lack kernel support for
> a proper memory error reporting. Following issues are seen:
> 
>  * Error record shows insufficient data, such as "EDAC MC0: 1 CE
>    unknown error on unknown label",
> 
>  * DMI DIMM labels are not decoded for error reporting,
> 
>  * No memory hierarchy known (NUMA topology),
> 
>  * No per layer reporting,
> 
>  * Significant differences to x86 error reports.
> 
> This patch set addresses all the above involving a rework of the
> ghes_edac and edac_mc driver.
> 
> Patch #1-#4: Fix of grain calculation in edac_mc.c (#1) and
> ghes_edac.c (#2) including unification of trace_mc_event() code (#3,
> #4).
> 
> Patches #5-#12: General fixes and improvements of the ghes and mc
> drivers. Most of it is a rework of existing code without functional
> changes to improve, ease, cleanup and join common code. The changes
> are in preparation of and a requirment for the following patches that
> improve ghes error reports.
> 
> Patches #13-#22: Improve error memory reporting of the ghes driver
> including:
> 
>  * support for legacy API (patch #12),
> 
>  * NUMA detection, one mc device per node (patches #13-#16),
> 
>  * support for DMI DIMM label information (patch #17),
> 
>  * per-layer reporting (patches #18-#20).
> 
> Patch #23: Documentation updates.
> 
> Patch #24: Disable legacy API for ARM64 ghes driver (optional, need to
> be ack'ed by James, I vote for not applying it).
> 
> All changes should keep existing systems working as before. All
> systems that are using ghes will also benefit from the update. There
> is a fallback in the ghes driver that disables NUMA or enters a fake
> mode if some of the NUMA or DIMM information is inconsistent. So it
> should not break existing systems that provide broken firmware tables.
> 
> The series has been tested on a Marvell/Cavium ThunderX2 system. Here
> some example logs and sysfs entries:
> 
> Boot log of memory hierarchy and dimm detection:
> 
>  EDAC DEBUG: mem_info_setup: DIMM0: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x0038, label: N0 DIMM_A0
>  EDAC DEBUG: mem_info_setup: DIMM1: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x0039, label: N0 DIMM_B0
>  EDAC DEBUG: mem_info_setup: DIMM2: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x003a, label: N0 DIMM_C0
>  EDAC DEBUG: mem_info_setup: DIMM3: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x003b, label: N0 DIMM_D0
>  EDAC DEBUG: mem_info_setup: DIMM4: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x003c, label: N0 DIMM_E0
>  EDAC DEBUG: mem_info_setup: DIMM5: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x003d, label: N0 DIMM_F0
>  EDAC DEBUG: mem_info_setup: DIMM6: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x003e, label: N0 DIMM_G0
>  EDAC DEBUG: mem_info_setup: DIMM7: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x003f, label: N0 DIMM_H0
>  EDAC DEBUG: mem_info_setup: DIMM8: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x004f, label: N1 DIMM_I0
>  EDAC DEBUG: mem_info_setup: DIMM9: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0050, label: N1 DIMM_J0
>  EDAC DEBUG: mem_info_setup: DIMM10: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0051, label: N1 DIMM_K0
>  EDAC DEBUG: mem_info_setup: DIMM11: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0052, label: N1 DIMM_L0
>  EDAC DEBUG: mem_info_setup: DIMM12: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0053, label: N1 DIMM_M0
>  EDAC DEBUG: mem_info_setup: DIMM13: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0054, label: N1 DIMM_N0
>  EDAC DEBUG: mem_info_setup: DIMM14: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0055, label: N1 DIMM_O0
>  EDAC DEBUG: mem_info_setup: DIMM15: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0056, label: N1 DIMM_P0
> 
> DIMM label entries in sysfs:
> 
>  # grep . /sys/devices/system/edac/mc/mc*/dimm*/dimm_label
>  /sys/devices/system/edac/mc/mc0/dimm0/dimm_label:N0 DIMM_A0
>  /sys/devices/system/edac/mc/mc0/dimm1/dimm_label:N0 DIMM_B0
>  /sys/devices/system/edac/mc/mc0/dimm2/dimm_label:N0 DIMM_C0
>  /sys/devices/system/edac/mc/mc0/dimm3/dimm_label:N0 DIMM_D0
>  /sys/devices/system/edac/mc/mc0/dimm4/dimm_label:N0 DIMM_E0
>  /sys/devices/system/edac/mc/mc0/dimm5/dimm_label:N0 DIMM_F0
>  /sys/devices/system/edac/mc/mc0/dimm6/dimm_label:N0 DIMM_G0
>  /sys/devices/system/edac/mc/mc0/dimm7/dimm_label:N0 DIMM_H0
>  /sys/devices/system/edac/mc/mc1/dimm0/dimm_label:N1 DIMM_I0
>  /sys/devices/system/edac/mc/mc1/dimm1/dimm_label:N1 DIMM_J0
>  /sys/devices/system/edac/mc/mc1/dimm2/dimm_label:N1 DIMM_K0
>  /sys/devices/system/edac/mc/mc1/dimm3/dimm_label:N1 DIMM_L0
>  /sys/devices/system/edac/mc/mc1/dimm4/dimm_label:N1 DIMM_M0
>  /sys/devices/system/edac/mc/mc1/dimm5/dimm_label:N1 DIMM_N0
>  /sys/devices/system/edac/mc/mc1/dimm6/dimm_label:N1 DIMM_O0
>  /sys/devices/system/edac/mc/mc1/dimm7/dimm_label:N1 DIMM_P0
> 
> Memory error reports in the kernel log:
> 
>  {1}[Hardware Error]:  Error 4, type: corrected
>  {1}[Hardware Error]:   section_type: memory error
>  {1}[Hardware Error]:   error_status: 0x0000000000000400
>  {1}[Hardware Error]:   physical_address: 0x000000bd0db44000
>  {1}[Hardware Error]:   node: 1 card: 3 module: 0 rank: 0 bank: 256 column: 10 bit_position: 16 
>  {1}[Hardware Error]:   DIMM location: N1 DIMM_L0 
>  EDAC MC1: 1 CE ghes_mc on N1 DIMM_L0 (card:3 module:0 page:0xbd0db44 offset:0x0 grain:0 syndrome:0x0 - APEI location: node:1 card:3 module:0 rank:0 bank:256 col:10 bit_pos:16 handle:0x0052 status(0x0000000000000400): Storage error in DRAM memory)
> 
> Error counters in sysfs (zero counters dropped):
> 
>  # find /sys/devices/system/edac/mc/ -name \*count | sort -V | xargs grep . | sed -e '/:0/d'
>  /sys/devices/system/edac/mc/mc0/ce_count:5
>  /sys/devices/system/edac/mc/mc0/csrow0/ce_count:1
>  /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:1
>  /sys/devices/system/edac/mc/mc0/csrow3/ce_count:1
>  /sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:1
>  /sys/devices/system/edac/mc/mc0/csrow4/ce_count:1
>  /sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:1
>  /sys/devices/system/edac/mc/mc0/csrow6/ce_count:2
>  /sys/devices/system/edac/mc/mc0/csrow6/ch0_ce_count:2
>  /sys/devices/system/edac/mc/mc0/dimm0/dimm_ce_count:1
>  /sys/devices/system/edac/mc/mc0/dimm3/dimm_ce_count:1
>  /sys/devices/system/edac/mc/mc0/dimm4/dimm_ce_count:1
>  /sys/devices/system/edac/mc/mc0/dimm6/dimm_ce_count:2
>  /sys/devices/system/edac/mc/mc1/ce_count:4
>  /sys/devices/system/edac/mc/mc1/csrow0/ce_count:1
>  /sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:1
>  /sys/devices/system/edac/mc/mc1/csrow3/ce_count:1
>  /sys/devices/system/edac/mc/mc1/csrow3/ch0_ce_count:1
>  /sys/devices/system/edac/mc/mc1/csrow6/ce_count:2
>  /sys/devices/system/edac/mc/mc1/csrow6/ch0_ce_count:2
>  /sys/devices/system/edac/mc/mc1/dimm0/dimm_ce_count:1
>  /sys/devices/system/edac/mc/mc1/dimm3/dimm_ce_count:1
>  /sys/devices/system/edac/mc/mc1/dimm6/dimm_ce_count:2
> 
> 
> v2 updates:
> 
>  * rebased onto bp/for-next (b2572772d13e: EDAC: Make
>    edac_debugfs_create_x*() return void),
> 
>  * added patches to fix grain calculation (put this at the beginning
>    of the series to apply them separately),
> 
>  * modified sysfs init functions based (EDAC, mc: Fix and improve
>    sysfs init functions) on Greg's fixes (f5d59da9663d EDAC/sysfs:
>    Drop device references properly),
> 
>  * removed duplicate code for mem_info_setup*() by moving it to
>    ghes_dimm_info_init(),
> 
>  * fix bisecting of series,
> 
>  * made mem_info static,
> 
>  * renamed function mci_add_dimm_info() to mem_info_prepare_mci(),
> 
>  * added patch to move struct member smbios_handle to struct
>    ghes_dimm_info,
> 
>  * renamed ghes_mem_info.num_per_node[] to
>    ghes_mem_info.dimms_per_node[],
> 
>  * removed unused mem_info.enable_numa,
> 
>  * removed unused mem_info.num_nodes,
> 
>  * fixed dimm counters after sysfs reset_counters.
> 
> 
> Robert Richter (24):
>   EDAC, mc: Fix grain_bits calculation
>   EDAC, ghes: Fix grain calculation
>   EDAC, ghes: Remove pvt->detail_location string
>   EDAC, ghes: Unify trace_mc_event() code with edac_mc driver
>   EDAC, mc: Fix and improve sysfs init functions
>   EDAC: Kill EDAC_DIMM_PTR() macro
>   EDAC: Kill EDAC_DIMM_OFF() macro
>   EDAC: Introduce mci_for_each_dimm() iterator
>   EDAC, mc: Cleanup _edac_mc_free() code
>   EDAC, mc: Remove per layer counters
>   EDAC, mc: Rework edac_raw_mc_handle_error() to use struct dimm_info
>   EDAC, ghes: Use standard kernel macros for page calculations
>   EDAC, ghes: Add support for legacy API counters
>   EDAC, ghes: Rework memory hierarchy detection
>   EDAC, ghes: Extract numa node information for each dimm
>   EDAC, ghes: Moving code around ghes_edac_register()
>   EDAC, ghes: Create one memory controller device per node
>   EDAC, ghes: Fill sysfs with the DMI DIMM label information
>   EDAC, mc: Introduce edac_mc_alloc_by_dimm() for per dimm allocation
>   EDAC, ghes: Identify dimm by node, card, module and handle
>   EDAC, ghes: Enable per-layer reporting based on card/module
>   EDAC, ghes: Move struct member smbios_handle to struct ghes_dimm_info
>   EDAC, Documentation: Describe CPER module definition and DIMM ranks
>   EDAC, ghes: Disable legacy API for ARM64
> 
>  Documentation/admin-guide/ras.rst |  31 +-
>  drivers/edac/edac_mc.c            | 385 ++++++++++---------
>  drivers/edac/edac_mc.h            |  33 +-
>  drivers/edac/edac_mc_sysfs.c      |  95 ++---
>  drivers/edac/ghes_edac.c          | 609 +++++++++++++++++++++++-------
>  drivers/edac/i10nm_base.c         |   3 +-
>  drivers/edac/i3200_edac.c         |   3 +-
>  drivers/edac/i5000_edac.c         |   5 +-
>  drivers/edac/i5100_edac.c         |  14 +-
>  drivers/edac/i5400_edac.c         |   4 +-
>  drivers/edac/i7300_edac.c         |   3 +-
>  drivers/edac/i7core_edac.c        |   3 +-
>  drivers/edac/ie31200_edac.c       |   7 +-
>  drivers/edac/pnd2_edac.c          |   4 +-
>  drivers/edac/sb_edac.c            |   2 +-
>  drivers/edac/skx_base.c           |   3 +-
>  drivers/edac/ti_edac.c            |   2 +-
>  include/linux/edac.h              | 141 ++++---
>  18 files changed, 842 insertions(+), 505 deletions(-)
> 
> -- 
> 2.20.1
> 

      parent reply	other threads:[~2019-08-02  7:59 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-24 15:08 [PATCH v2 00/24] EDAC, mc, ghes: Fixes and updates to improve memory error reporting Robert Richter
2019-06-24 15:08 ` [PATCH v2 01/24] EDAC, mc: Fix grain_bits calculation Robert Richter
2019-08-03 10:08   ` Borislav Petkov
2019-06-24 15:08 ` [PATCH v2 02/24] EDAC, ghes: Fix grain calculation Robert Richter
2019-08-09 13:15   ` Borislav Petkov
2019-08-12  6:42     ` Robert Richter
2019-08-12  7:32       ` Borislav Petkov
2019-08-12 12:05         ` Robert Richter
2019-08-12 12:38           ` Borislav Petkov
2019-06-24 15:08 ` [PATCH v2 03/24] EDAC, ghes: Remove pvt->detail_location string Robert Richter
2019-08-02 17:04   ` James Morse
2019-08-07  9:00     ` Robert Richter
2019-08-13  8:09   ` Borislav Petkov
2019-06-24 15:09 ` [PATCH v2 04/24] EDAC, ghes: Unify trace_mc_event() code with edac_mc driver Robert Richter
2019-06-24 15:09 ` [PATCH v2 05/24] EDAC, mc: Fix and improve sysfs init functions Robert Richter
2019-08-13  8:26   ` Borislav Petkov
2019-06-24 15:09 ` [PATCH v2 06/24] EDAC: Kill EDAC_DIMM_PTR() macro Robert Richter
2019-08-13 14:59   ` Borislav Petkov
2019-08-27 12:20     ` Robert Richter
2019-06-24 15:09 ` [PATCH v2 07/24] EDAC: Kill EDAC_DIMM_OFF() macro Robert Richter
2019-08-14 14:52   ` Borislav Petkov
2019-06-24 15:09 ` [PATCH v2 08/24] EDAC: Introduce mci_for_each_dimm() iterator Robert Richter
2019-08-14 15:18   ` Borislav Petkov
2019-08-28  8:18     ` Robert Richter
2019-06-24 15:09 ` [PATCH v2 09/24] EDAC, mc: Cleanup _edac_mc_free() code Robert Richter
2019-08-14 16:31   ` Borislav Petkov
2019-06-24 15:09 ` [PATCH v2 10/24] EDAC, mc: Remove per layer counters Robert Richter
2019-08-16  9:24   ` Borislav Petkov
2019-06-24 15:09 ` [PATCH v2 11/24] EDAC, mc: Rework edac_raw_mc_handle_error() to use struct dimm_info Robert Richter
2019-06-24 15:09 ` [PATCH v2 12/24] EDAC, ghes: Use standard kernel macros for page calculations Robert Richter
2019-08-02 17:04   ` James Morse
2019-08-07  9:52     ` Robert Richter
2019-06-24 15:09 ` [PATCH v2 13/24] EDAC, ghes: Add support for legacy API counters Robert Richter
2019-08-16  9:55   ` Borislav Petkov
2019-08-30  9:35     ` Robert Richter
2019-06-24 15:09 ` [PATCH v2 14/24] EDAC, ghes: Rework memory hierarchy detection Robert Richter
2019-08-20  8:56   ` Borislav Petkov
2019-06-24 15:09 ` [PATCH v2 15/24] EDAC, ghes: Extract numa node information for each dimm Robert Richter
2019-08-02 17:05   ` James Morse
2019-08-09 13:09     ` Robert Richter
2019-06-24 15:09 ` [PATCH v2 16/24] EDAC, ghes: Moving code around ghes_edac_register() Robert Richter
2019-06-24 15:09 ` [PATCH v2 17/24] EDAC, ghes: Create one memory controller device per node Robert Richter
2019-06-24 15:09 ` [PATCH v2 18/24] EDAC, ghes: Fill sysfs with the DMI DIMM label information Robert Richter
2019-06-24 15:09 ` [PATCH v2 19/24] EDAC, mc: Introduce edac_mc_alloc_by_dimm() for per dimm allocation Robert Richter
2019-06-24 15:09 ` [PATCH v2 20/24] EDAC, ghes: Identify dimm by node, card, module and handle Robert Richter
2019-06-24 15:09 ` [PATCH v2 21/24] EDAC, ghes: Enable per-layer reporting based on card/module Robert Richter
2019-06-24 15:09 ` [PATCH v2 22/24] EDAC, ghes: Move struct member smbios_handle to struct ghes_dimm_info Robert Richter
2019-06-24 15:09 ` [PATCH v2 23/24] EDAC, Documentation: Describe CPER module definition and DIMM ranks Robert Richter
2019-06-24 15:09 ` [PATCH v2 24/24] EDAC, ghes: Disable legacy API for ARM64 Robert Richter
2019-06-26  9:33   ` James Morse
2019-06-26 10:11     ` Robert Richter
2019-08-02  7:58 ` Robert Richter [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190802075854.ydrlufwvdxcjk7jj@rric.localdomain \
    --to=rric@kernel.org \
    --cc=bp@alien8.de \
    --cc=james.morse@arm.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=rrichter@marvell.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.