linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/21] EDAC, mc, ghes: Fixes and updates to improve memory error reporting
@ 2019-05-29  8:44 Robert Richter
  2019-05-29  8:44 ` [PATCH 01/21] EDAC, mc: Fix edac_mc_find() in case no device is found Robert Richter
                   ` (21 more replies)
  0 siblings, 22 replies; 43+ messages in thread
From: Robert Richter @ 2019-05-29  8:44 UTC (permalink / raw)
  To: Borislav Petkov, Tony Luck, James Morse, Mauro Carvalho Chehab
  Cc: linux-edac, linux-kernel, Robert Richter

Current arm64 systems that use the ghes driver lack kernel support for
a proper memory error reporting. Following issues are seen:

 * Error record shows insufficient data, such as "EDAC MC0: 1 CE
   unknown error on unknown label",

 * DMI DIMM labels are not decoded for error reporting,

 * No memory hierarchy known (NUMA topology),

 * No per layer reporting,

 * Significant differences to x86 error reports.

This patch set addresses all the above involving a rework of the
ghes_edac and edac_mc driver.

Patch #1: Repost of an already accepted patch sent to the ml. Adding
it here for completeness as I did not find it in a repository yet. The
fix is also required for this series. Note there is a modification
compared to the previous version that further reduces complexity
(early break of the loop removed).

Patches #2-#11: General fixes and improvements of the ghes and mc
drivers. Most of it is a rework of existing code without functional
changes to improve, ease, cleanup and join common code. The changes
are in preparation of and a requirment for the following patches that
improve ghes error reports.

Patches #12-#20: Improve error memory reporting of the ghes driver
including:

 * support for legacy API (patch #12),

 * NUMA detection, one mc device per node (patches #13-#16),

 * support for DMI DIMM label information (patch #17),

 * per-layer reporting (patches #18-#20).

Patches #21: Documentation updates.

All changes should keep existing systems working as before. All
systems that are using ghes will also benefit from the update. There
is a fallback in the ghes driver that disables NUMA or enters a fake
mode if some of the NUMA or DIMM information is inconsistent. So it
should not break existing systems that provide broken firmware tables.

The series has been tested on a Marvell/Cavium ThunderX2 system. Here
some example logs and sysfs entries:

Boot log of memory hierarchy and dimm detection:

 EDAC DEBUG: mem_info_setup: DIMM0: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x0038, label: N0 DIMM_A0
 EDAC DEBUG: mem_info_setup: DIMM1: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x0039, label: N0 DIMM_B0
 EDAC DEBUG: mem_info_setup: DIMM2: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x003a, label: N0 DIMM_C0
 EDAC DEBUG: mem_info_setup: DIMM3: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x003b, label: N0 DIMM_D0
 EDAC DEBUG: mem_info_setup: DIMM4: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x003c, label: N0 DIMM_E0
 EDAC DEBUG: mem_info_setup: DIMM5: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x003d, label: N0 DIMM_F0
 EDAC DEBUG: mem_info_setup: DIMM6: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x003e, label: N0 DIMM_G0
 EDAC DEBUG: mem_info_setup: DIMM7: Found mem range [0x0000008800000000-0x0000009ffcffffff] on node 0, handle: 0x003f, label: N0 DIMM_H0
 EDAC DEBUG: mem_info_setup: DIMM8: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x004f, label: N1 DIMM_I0
 EDAC DEBUG: mem_info_setup: DIMM9: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0050, label: N1 DIMM_J0
 EDAC DEBUG: mem_info_setup: DIMM10: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0051, label: N1 DIMM_K0
 EDAC DEBUG: mem_info_setup: DIMM11: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0052, label: N1 DIMM_L0
 EDAC DEBUG: mem_info_setup: DIMM12: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0053, label: N1 DIMM_M0
 EDAC DEBUG: mem_info_setup: DIMM13: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0054, label: N1 DIMM_N0
 EDAC DEBUG: mem_info_setup: DIMM14: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0055, label: N1 DIMM_O0
 EDAC DEBUG: mem_info_setup: DIMM15: Found mem range [0x0000009ffd000000-0x000000bffcffffff] on node 1, handle: 0x0056, label: N1 DIMM_P0

DIMM label entries in sysfs:

 # grep . /sys/devices/system/edac/mc/mc*/dimm*/dimm_label
 /sys/devices/system/edac/mc/mc0/dimm0/dimm_label:N0 DIMM_A0
 /sys/devices/system/edac/mc/mc0/dimm1/dimm_label:N0 DIMM_B0
 /sys/devices/system/edac/mc/mc0/dimm2/dimm_label:N0 DIMM_C0
 /sys/devices/system/edac/mc/mc0/dimm3/dimm_label:N0 DIMM_D0
 /sys/devices/system/edac/mc/mc0/dimm4/dimm_label:N0 DIMM_E0
 /sys/devices/system/edac/mc/mc0/dimm5/dimm_label:N0 DIMM_F0
 /sys/devices/system/edac/mc/mc0/dimm6/dimm_label:N0 DIMM_G0
 /sys/devices/system/edac/mc/mc0/dimm7/dimm_label:N0 DIMM_H0
 /sys/devices/system/edac/mc/mc1/dimm0/dimm_label:N1 DIMM_I0
 /sys/devices/system/edac/mc/mc1/dimm1/dimm_label:N1 DIMM_J0
 /sys/devices/system/edac/mc/mc1/dimm2/dimm_label:N1 DIMM_K0
 /sys/devices/system/edac/mc/mc1/dimm3/dimm_label:N1 DIMM_L0
 /sys/devices/system/edac/mc/mc1/dimm4/dimm_label:N1 DIMM_M0
 /sys/devices/system/edac/mc/mc1/dimm5/dimm_label:N1 DIMM_N0
 /sys/devices/system/edac/mc/mc1/dimm6/dimm_label:N1 DIMM_O0
 /sys/devices/system/edac/mc/mc1/dimm7/dimm_label:N1 DIMM_P0

Memory error reports in the kernel log:

 {1}[Hardware Error]:  Error 4, type: corrected
 {1}[Hardware Error]:   section_type: memory error
 {1}[Hardware Error]:   error_status: 0x0000000000000400
 {1}[Hardware Error]:   physical_address: 0x000000bd0db44000
 {1}[Hardware Error]:   node: 1 card: 3 module: 0 rank: 0 bank: 256 column: 10 bit_position: 16 
 {1}[Hardware Error]:   DIMM location: N1 DIMM_L0 
 EDAC MC1: 1 CE ghes_mc on N1 DIMM_L0 (card:3 module:0 page:0xbd0db44 offset:0x0 grain:0 syndrome:0x0 - APEI location: node:1 card:3 module:0 rank:0 bank:256 col:10 bit_pos:16 handle:0x0052 status(0x0000000000000400): Storage error in DRAM memory)

Error counters in sysfs (zero counters dropped):

 # find /sys/devices/system/edac/mc/ -name \*count | sort -V | xargs grep . | sed -e '/:0/d'
 /sys/devices/system/edac/mc/mc0/ce_count:5
 /sys/devices/system/edac/mc/mc0/csrow0/ce_count:1
 /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:1
 /sys/devices/system/edac/mc/mc0/csrow3/ce_count:1
 /sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:1
 /sys/devices/system/edac/mc/mc0/csrow4/ce_count:1
 /sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:1
 /sys/devices/system/edac/mc/mc0/csrow6/ce_count:2
 /sys/devices/system/edac/mc/mc0/csrow6/ch0_ce_count:2
 /sys/devices/system/edac/mc/mc0/dimm0/dimm_ce_count:1
 /sys/devices/system/edac/mc/mc0/dimm3/dimm_ce_count:1
 /sys/devices/system/edac/mc/mc0/dimm4/dimm_ce_count:1
 /sys/devices/system/edac/mc/mc0/dimm6/dimm_ce_count:2
 /sys/devices/system/edac/mc/mc1/ce_count:4
 /sys/devices/system/edac/mc/mc1/csrow0/ce_count:1
 /sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:1
 /sys/devices/system/edac/mc/mc1/csrow3/ce_count:1
 /sys/devices/system/edac/mc/mc1/csrow3/ch0_ce_count:1
 /sys/devices/system/edac/mc/mc1/csrow6/ce_count:2
 /sys/devices/system/edac/mc/mc1/csrow6/ch0_ce_count:2
 /sys/devices/system/edac/mc/mc1/dimm0/dimm_ce_count:1
 /sys/devices/system/edac/mc/mc1/dimm3/dimm_ce_count:1
 /sys/devices/system/edac/mc/mc1/dimm6/dimm_ce_count:2


Robert Richter (21):
  EDAC, mc: Fix edac_mc_find() in case no device is found
  EDAC: Fixes to use put_device() after device_add() errors
  EDAC: Kill EDAC_DIMM_PTR() macro
  EDAC: Kill EDAC_DIMM_OFF() macro
  EDAC: Introduce mci_for_each_dimm() iterator
  EDAC, mc: Cleanup _edac_mc_free() code
  EDAC, mc: Remove per layer counters
  EDAC, mc: Rework edac_raw_mc_handle_error() to use struct dimm_info
  EDAC, ghes: Use standard kernel macros for page calculations
  EDAC, ghes: Remove pvt->detail_location string
  EDAC, ghes: Unify trace_mc_event() code with edac_mc driver
  EDAC, ghes: Add support for legacy API counters
  EDAC, ghes: Rework memory hierarchy detection
  EDAC, ghes: Extract numa node information for each dimm
  EDAC, ghes: Moving code around ghes_edac_register()
  EDAC, ghes: Create one memory controller device per node
  EDAC, ghes: Fill sysfs with the DMI DIMM label information
  EDAC, mc: Introduce edac_mc_alloc_by_dimm() for per dimm allocation
  EDAC, ghes: Identify dimm by node, card, module and handle
  EDAC, ghes: Enable per-layer reporting based on card/module
  EDAC, Documentation: Describe CPER module definition and DIMM ranks

 Documentation/admin-guide/ras.rst |  31 +-
 drivers/edac/edac_mc.c            | 360 +++++++++---------
 drivers/edac/edac_mc.h            |  26 +-
 drivers/edac/edac_mc_sysfs.c      | 103 ++---
 drivers/edac/ghes_edac.c          | 609 +++++++++++++++++++++++-------
 drivers/edac/i10nm_base.c         |   3 +-
 drivers/edac/i3200_edac.c         |   3 +-
 drivers/edac/i5000_edac.c         |   5 +-
 drivers/edac/i5100_edac.c         |  14 +-
 drivers/edac/i5400_edac.c         |   4 +-
 drivers/edac/i7300_edac.c         |   3 +-
 drivers/edac/i7core_edac.c        |   3 +-
 drivers/edac/ie31200_edac.c       |   7 +-
 drivers/edac/pnd2_edac.c          |   4 +-
 drivers/edac/sb_edac.c            |   2 +-
 drivers/edac/skx_base.c           |   3 +-
 drivers/edac/ti_edac.c            |   2 +-
 include/linux/edac.h              | 141 ++++---
 18 files changed, 815 insertions(+), 508 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2019-06-26 10:27 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-29  8:44 [PATCH 00/21] EDAC, mc, ghes: Fixes and updates to improve memory error reporting Robert Richter
2019-05-29  8:44 ` [PATCH 01/21] EDAC, mc: Fix edac_mc_find() in case no device is found Robert Richter
2019-05-29  8:44 ` [PATCH 02/21] EDAC: Fixes to use put_device() after device_add() errors Robert Richter
2019-06-11 17:28   ` Borislav Petkov
2019-06-12 17:17     ` Robert Richter
2019-05-29  8:44 ` [PATCH 03/21] EDAC: Kill EDAC_DIMM_PTR() macro Robert Richter
2019-05-29  8:44 ` [PATCH 04/21] EDAC: Kill EDAC_DIMM_OFF() macro Robert Richter
2019-05-29  8:44 ` [PATCH 05/21] EDAC: Introduce mci_for_each_dimm() iterator Robert Richter
2019-05-29  8:44 ` [PATCH 06/21] EDAC, mc: Cleanup _edac_mc_free() code Robert Richter
2019-05-29  8:44 ` [PATCH 07/21] EDAC, mc: Remove per layer counters Robert Richter
2019-05-29  8:44 ` [PATCH 08/21] EDAC, mc: Rework edac_raw_mc_handle_error() to use struct dimm_info Robert Richter
2019-05-29  8:44 ` [PATCH 09/21] EDAC, ghes: Use standard kernel macros for page calculations Robert Richter
2019-05-29 15:13   ` James Morse
2019-05-29  8:44 ` [PATCH 10/21] EDAC, ghes: Remove pvt->detail_location string Robert Richter
2019-05-29 15:13   ` James Morse
2019-06-12 18:13     ` Robert Richter
2019-05-29  8:44 ` [PATCH 11/21] EDAC, ghes: Unify trace_mc_event() code with edac_mc driver Robert Richter
2019-05-29 15:12   ` James Morse
2019-06-03 13:10     ` Robert Richter
2019-06-04 17:15       ` James Morse
2019-06-13 22:23         ` Robert Richter
2019-05-29  8:44 ` [PATCH 12/21] EDAC, ghes: Add support for legacy API counters Robert Richter
2019-05-29 15:13   ` James Morse
2019-06-12 18:41     ` Robert Richter
2019-06-19 17:22       ` James Morse
2019-06-20  6:55         ` Robert Richter
2019-06-26  9:33           ` James Morse
2019-06-26 10:27             ` Robert Richter
2019-05-29  8:44 ` [PATCH 13/21] EDAC, ghes: Rework memory hierarchy detection Robert Richter
2019-05-29 15:06   ` James Morse
2019-05-31 13:41     ` Robert Richter
2019-05-29  8:44 ` [PATCH 14/21] EDAC, ghes: Extract numa node information for each dimm Robert Richter
2019-05-29 17:51   ` James Morse
2019-06-13 20:52     ` Robert Richter
2019-05-29  8:44 ` [PATCH 15/21] EDAC, ghes: Moving code around ghes_edac_register() Robert Richter
2019-05-29  8:44 ` [PATCH 16/21] EDAC, ghes: Create one memory controller device per node Robert Richter
2019-05-29  8:44 ` [PATCH 17/21] EDAC, ghes: Fill sysfs with the DMI DIMM label information Robert Richter
2019-05-29  8:44 ` [PATCH 18/21] EDAC, mc: Introduce edac_mc_alloc_by_dimm() for per dimm allocation Robert Richter
2019-05-29  8:44 ` [PATCH 19/21] EDAC, ghes: Identify dimm by node, card, module and handle Robert Richter
2019-05-29  8:44 ` [PATCH 20/21] EDAC, ghes: Enable per-layer reporting based on card/module Robert Richter
2019-05-29  8:44 ` [PATCH 21/21] EDAC, Documentation: Describe CPER module definition and DIMM ranks Robert Richter
2019-05-29 14:54 ` [PATCH 00/21] EDAC, mc, ghes: Fixes and updates to improve memory error reporting Borislav Petkov
2019-05-31 14:48   ` Robert Richter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).