All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/24] EDAC, mc, ghes: Fixes and updates to improve memory error reporting
@ 2019-06-24 15:08 Robert Richter
  2019-06-24 15:08 ` [PATCH v2 01/24] EDAC, mc: Fix grain_bits calculation Robert Richter
                   ` (24 more replies)
  0 siblings, 25 replies; 52+ messages in thread
From: Robert Richter @ 2019-06-24 15:08 UTC (permalink / raw)
  To: Borislav Petkov, James Morse, Mauro Carvalho Chehab
  Cc: linux-edac, linux-kernel, Robert Richter

Current arm64 systems that use the ghes driver lack kernel support for
a proper memory error reporting. Following issues are seen:

 * Error record shows insufficient data, such as "EDAC MC0: 1 CE
   unknown error on unknown label",

 * DMI DIMM labels are not decoded for error reporting,

 * No memory hierarchy known (NUMA topology),

 * No per layer reporting,

 * Significant differences to x86 error reports.

This patch set addresses all the above involving a rework of the
ghes_edac and edac_mc driver.

Patch #1-#4: Fix of grain calculation in edac_mc.c (#1) and
ghes_edac.c (#2) including unification of trace_mc_event() code (#3,
#4).

Patches #5-#12: General fixes and improvements of the ghes and mc
drivers. Most of it is a rework of existing code without functional
changes to improve, ease, cleanup and join common code. The changes
are in preparation of and a requirment for the following patches that
improve ghes error reports.

Patches #13-#22: Improve error memory reporting of the ghes driver
including:

 * support for legacy API (patch #12),

 * NUMA detection, one mc device per node (patches #13-#16),

 * support for DMI DIMM label information (patch #17),

 * per-layer reporting (patches #18-#20).

Patch #23: Documentation updates.

Patch #24: Disable legacy API for ARM64 ghes driver (optional, need to
be ack'ed by James, I vote for not applying it).

All changes should keep existing systems working as before. All
systems that are using ghes will also benefit from the update. There
is a fallback in the ghes driver that disables NUMA or enters a fake
mode if some of the NUMA or DIMM information is inconsistent. So it
should not break existing systems that provide broken firmware tables.

The series has been tested on a Marvell/Cavium ThunderX2 system. Here
some example logs and sysfs entries:

Boot log of memory hierarchy and dimm detection:

 EDAC DEBUG: mem_info_setup: DIMM0: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x0038, label: N0 DIMM_A0
 EDAC DEBUG: mem_info_setup: DIMM1: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x0039, label: N0 DIMM_B0
 EDAC DEBUG: mem_info_setup: DIMM2: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x003a, label: N0 DIMM_C0
 EDAC DEBUG: mem_info_setup: DIMM3: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x003b, label: N0 DIMM_D0
 EDAC DEBUG: mem_info_setup: DIMM4: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x003c, label: N0 DIMM_E0
 EDAC DEBUG: mem_info_setup: DIMM5: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x003d, label: N0 DIMM_F0
 EDAC DEBUG: mem_info_setup: DIMM6: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x003e, label: N0 DIMM_G0
 EDAC DEBUG: mem_info_setup: DIMM7: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x003f, label: N0 DIMM_H0
 EDAC DEBUG: mem_info_setup: DIMM8: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x004f, label: N1 DIMM_I0
 EDAC DEBUG: mem_info_setup: DIMM9: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0050, label: N1 DIMM_J0
 EDAC DEBUG: mem_info_setup: DIMM10: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0051, label: N1 DIMM_K0
 EDAC DEBUG: mem_info_setup: DIMM11: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0052, label: N1 DIMM_L0
 EDAC DEBUG: mem_info_setup: DIMM12: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0053, label: N1 DIMM_M0
 EDAC DEBUG: mem_info_setup: DIMM13: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0054, label: N1 DIMM_N0
 EDAC DEBUG: mem_info_setup: DIMM14: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0055, label: N1 DIMM_O0
 EDAC DEBUG: mem_info_setup: DIMM15: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0056, label: N1 DIMM_P0

DIMM label entries in sysfs:

 # grep . /sys/devices/system/edac/mc/mc*/dimm*/dimm_label
 /sys/devices/system/edac/mc/mc0/dimm0/dimm_label:N0 DIMM_A0
 /sys/devices/system/edac/mc/mc0/dimm1/dimm_label:N0 DIMM_B0
 /sys/devices/system/edac/mc/mc0/dimm2/dimm_label:N0 DIMM_C0
 /sys/devices/system/edac/mc/mc0/dimm3/dimm_label:N0 DIMM_D0
 /sys/devices/system/edac/mc/mc0/dimm4/dimm_label:N0 DIMM_E0
 /sys/devices/system/edac/mc/mc0/dimm5/dimm_label:N0 DIMM_F0
 /sys/devices/system/edac/mc/mc0/dimm6/dimm_label:N0 DIMM_G0
 /sys/devices/system/edac/mc/mc0/dimm7/dimm_label:N0 DIMM_H0
 /sys/devices/system/edac/mc/mc1/dimm0/dimm_label:N1 DIMM_I0
 /sys/devices/system/edac/mc/mc1/dimm1/dimm_label:N1 DIMM_J0
 /sys/devices/system/edac/mc/mc1/dimm2/dimm_label:N1 DIMM_K0
 /sys/devices/system/edac/mc/mc1/dimm3/dimm_label:N1 DIMM_L0
 /sys/devices/system/edac/mc/mc1/dimm4/dimm_label:N1 DIMM_M0
 /sys/devices/system/edac/mc/mc1/dimm5/dimm_label:N1 DIMM_N0
 /sys/devices/system/edac/mc/mc1/dimm6/dimm_label:N1 DIMM_O0
 /sys/devices/system/edac/mc/mc1/dimm7/dimm_label:N1 DIMM_P0

Memory error reports in the kernel log:

 {1}[Hardware Error]:  Error 4, type: corrected
 {1}[Hardware Error]:   section_type: memory error
 {1}[Hardware Error]:   error_status: 0x0000000000000400
 {1}[Hardware Error]:   physical_address: 0x000000bd0db44000
 {1}[Hardware Error]:   node: 1 card: 3 module: 0 rank: 0 bank: 256 column: 10 bit_position: 16 
 {1}[Hardware Error]:   DIMM location: N1 DIMM_L0 
 EDAC MC1: 1 CE ghes_mc on N1 DIMM_L0 (card:3 module:0 page:0xbd0db44 offset:0x0 grain:0 syndrome:0x0 - APEI location: node:1 card:3 module:0 rank:0 bank:256 col:10 bit_pos:16 handle:0x0052 status(0x0000000000000400): Storage error in DRAM memory)

Error counters in sysfs (zero counters dropped):

 # find /sys/devices/system/edac/mc/ -name \*count | sort -V | xargs grep . | sed -e '/:0/d'
 /sys/devices/system/edac/mc/mc0/ce_count:5
 /sys/devices/system/edac/mc/mc0/csrow0/ce_count:1
 /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:1
 /sys/devices/system/edac/mc/mc0/csrow3/ce_count:1
 /sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:1
 /sys/devices/system/edac/mc/mc0/csrow4/ce_count:1
 /sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:1
 /sys/devices/system/edac/mc/mc0/csrow6/ce_count:2
 /sys/devices/system/edac/mc/mc0/csrow6/ch0_ce_count:2
 /sys/devices/system/edac/mc/mc0/dimm0/dimm_ce_count:1
 /sys/devices/system/edac/mc/mc0/dimm3/dimm_ce_count:1
 /sys/devices/system/edac/mc/mc0/dimm4/dimm_ce_count:1
 /sys/devices/system/edac/mc/mc0/dimm6/dimm_ce_count:2
 /sys/devices/system/edac/mc/mc1/ce_count:4
 /sys/devices/system/edac/mc/mc1/csrow0/ce_count:1
 /sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:1
 /sys/devices/system/edac/mc/mc1/csrow3/ce_count:1
 /sys/devices/system/edac/mc/mc1/csrow3/ch0_ce_count:1
 /sys/devices/system/edac/mc/mc1/csrow6/ce_count:2
 /sys/devices/system/edac/mc/mc1/csrow6/ch0_ce_count:2
 /sys/devices/system/edac/mc/mc1/dimm0/dimm_ce_count:1
 /sys/devices/system/edac/mc/mc1/dimm3/dimm_ce_count:1
 /sys/devices/system/edac/mc/mc1/dimm6/dimm_ce_count:2


v2 updates:

 * rebased onto bp/for-next (b2572772d13e: EDAC: Make
   edac_debugfs_create_x*() return void),

 * added patches to fix grain calculation (put this at the beginning
   of the series to apply them separately),

 * modified sysfs init functions based (EDAC, mc: Fix and improve
   sysfs init functions) on Greg's fixes (f5d59da9663d EDAC/sysfs:
   Drop device references properly),

 * removed duplicate code for mem_info_setup*() by moving it to
   ghes_dimm_info_init(),

 * fix bisecting of series,

 * made mem_info static,

 * renamed function mci_add_dimm_info() to mem_info_prepare_mci(),

 * added patch to move struct member smbios_handle to struct
   ghes_dimm_info,

 * renamed ghes_mem_info.num_per_node[] to
   ghes_mem_info.dimms_per_node[],

 * removed unused mem_info.enable_numa,

 * removed unused mem_info.num_nodes,

 * fixed dimm counters after sysfs reset_counters.


Robert Richter (24):
  EDAC, mc: Fix grain_bits calculation
  EDAC, ghes: Fix grain calculation
  EDAC, ghes: Remove pvt->detail_location string
  EDAC, ghes: Unify trace_mc_event() code with edac_mc driver
  EDAC, mc: Fix and improve sysfs init functions
  EDAC: Kill EDAC_DIMM_PTR() macro
  EDAC: Kill EDAC_DIMM_OFF() macro
  EDAC: Introduce mci_for_each_dimm() iterator
  EDAC, mc: Cleanup _edac_mc_free() code
  EDAC, mc: Remove per layer counters
  EDAC, mc: Rework edac_raw_mc_handle_error() to use struct dimm_info
  EDAC, ghes: Use standard kernel macros for page calculations
  EDAC, ghes: Add support for legacy API counters
  EDAC, ghes: Rework memory hierarchy detection
  EDAC, ghes: Extract numa node information for each dimm
  EDAC, ghes: Moving code around ghes_edac_register()
  EDAC, ghes: Create one memory controller device per node
  EDAC, ghes: Fill sysfs with the DMI DIMM label information
  EDAC, mc: Introduce edac_mc_alloc_by_dimm() for per dimm allocation
  EDAC, ghes: Identify dimm by node, card, module and handle
  EDAC, ghes: Enable per-layer reporting based on card/module
  EDAC, ghes: Move struct member smbios_handle to struct ghes_dimm_info
  EDAC, Documentation: Describe CPER module definition and DIMM ranks
  EDAC, ghes: Disable legacy API for ARM64

 Documentation/admin-guide/ras.rst |  31 +-
 drivers/edac/edac_mc.c            | 385 ++++++++++---------
 drivers/edac/edac_mc.h            |  33 +-
 drivers/edac/edac_mc_sysfs.c      |  95 ++---
 drivers/edac/ghes_edac.c          | 609 +++++++++++++++++++++++-------
 drivers/edac/i10nm_base.c         |   3 +-
 drivers/edac/i3200_edac.c         |   3 +-
 drivers/edac/i5000_edac.c         |   5 +-
 drivers/edac/i5100_edac.c         |  14 +-
 drivers/edac/i5400_edac.c         |   4 +-
 drivers/edac/i7300_edac.c         |   3 +-
 drivers/edac/i7core_edac.c        |   3 +-
 drivers/edac/ie31200_edac.c       |   7 +-
 drivers/edac/pnd2_edac.c          |   4 +-
 drivers/edac/sb_edac.c            |   2 +-
 drivers/edac/skx_base.c           |   3 +-
 drivers/edac/ti_edac.c            |   2 +-
 include/linux/edac.h              | 141 ++++---
 18 files changed, 842 insertions(+), 505 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2019-08-30  9:35 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-24 15:08 [PATCH v2 00/24] EDAC, mc, ghes: Fixes and updates to improve memory error reporting Robert Richter
2019-06-24 15:08 ` [PATCH v2 01/24] EDAC, mc: Fix grain_bits calculation Robert Richter
2019-08-03 10:08   ` Borislav Petkov
2019-06-24 15:08 ` [PATCH v2 02/24] EDAC, ghes: Fix grain calculation Robert Richter
2019-08-09 13:15   ` Borislav Petkov
2019-08-12  6:42     ` Robert Richter
2019-08-12  7:32       ` Borislav Petkov
2019-08-12 12:05         ` Robert Richter
2019-08-12 12:38           ` Borislav Petkov
2019-06-24 15:08 ` [PATCH v2 03/24] EDAC, ghes: Remove pvt->detail_location string Robert Richter
2019-08-02 17:04   ` James Morse
2019-08-07  9:00     ` Robert Richter
2019-08-13  8:09   ` Borislav Petkov
2019-06-24 15:09 ` [PATCH v2 04/24] EDAC, ghes: Unify trace_mc_event() code with edac_mc driver Robert Richter
2019-06-24 15:09 ` [PATCH v2 05/24] EDAC, mc: Fix and improve sysfs init functions Robert Richter
2019-08-13  8:26   ` Borislav Petkov
2019-06-24 15:09 ` [PATCH v2 06/24] EDAC: Kill EDAC_DIMM_PTR() macro Robert Richter
2019-08-13 14:59   ` Borislav Petkov
2019-08-27 12:20     ` Robert Richter
2019-06-24 15:09 ` [PATCH v2 07/24] EDAC: Kill EDAC_DIMM_OFF() macro Robert Richter
2019-08-14 14:52   ` Borislav Petkov
2019-06-24 15:09 ` [PATCH v2 08/24] EDAC: Introduce mci_for_each_dimm() iterator Robert Richter
2019-08-14 15:18   ` Borislav Petkov
2019-08-28  8:18     ` Robert Richter
2019-06-24 15:09 ` [PATCH v2 09/24] EDAC, mc: Cleanup _edac_mc_free() code Robert Richter
2019-08-14 16:31   ` Borislav Petkov
2019-06-24 15:09 ` [PATCH v2 10/24] EDAC, mc: Remove per layer counters Robert Richter
2019-08-16  9:24   ` Borislav Petkov
2019-06-24 15:09 ` [PATCH v2 11/24] EDAC, mc: Rework edac_raw_mc_handle_error() to use struct dimm_info Robert Richter
2019-06-24 15:09 ` [PATCH v2 12/24] EDAC, ghes: Use standard kernel macros for page calculations Robert Richter
2019-08-02 17:04   ` James Morse
2019-08-07  9:52     ` Robert Richter
2019-06-24 15:09 ` [PATCH v2 13/24] EDAC, ghes: Add support for legacy API counters Robert Richter
2019-08-16  9:55   ` Borislav Petkov
2019-08-30  9:35     ` Robert Richter
2019-06-24 15:09 ` [PATCH v2 14/24] EDAC, ghes: Rework memory hierarchy detection Robert Richter
2019-08-20  8:56   ` Borislav Petkov
2019-06-24 15:09 ` [PATCH v2 15/24] EDAC, ghes: Extract numa node information for each dimm Robert Richter
2019-08-02 17:05   ` James Morse
2019-08-09 13:09     ` Robert Richter
2019-06-24 15:09 ` [PATCH v2 16/24] EDAC, ghes: Moving code around ghes_edac_register() Robert Richter
2019-06-24 15:09 ` [PATCH v2 17/24] EDAC, ghes: Create one memory controller device per node Robert Richter
2019-06-24 15:09 ` [PATCH v2 18/24] EDAC, ghes: Fill sysfs with the DMI DIMM label information Robert Richter
2019-06-24 15:09 ` [PATCH v2 19/24] EDAC, mc: Introduce edac_mc_alloc_by_dimm() for per dimm allocation Robert Richter
2019-06-24 15:09 ` [PATCH v2 20/24] EDAC, ghes: Identify dimm by node, card, module and handle Robert Richter
2019-06-24 15:09 ` [PATCH v2 21/24] EDAC, ghes: Enable per-layer reporting based on card/module Robert Richter
2019-06-24 15:09 ` [PATCH v2 22/24] EDAC, ghes: Move struct member smbios_handle to struct ghes_dimm_info Robert Richter
2019-06-24 15:09 ` [PATCH v2 23/24] EDAC, Documentation: Describe CPER module definition and DIMM ranks Robert Richter
2019-06-24 15:09 ` [PATCH v2 24/24] EDAC, ghes: Disable legacy API for ARM64 Robert Richter
2019-06-26  9:33   ` James Morse
2019-06-26 10:11     ` Robert Richter
2019-08-02  7:58 ` [PATCH v2 00/24] EDAC, mc, ghes: Fixes and updates to improve memory error reporting Robert Richter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.