From: Mauro Carvalho Chehab <mchehab@redhat.com> Cc: linux-acpi@vger.kernel.org, Huang Ying <ying.huang@intel.com>, Tony Luck <tony.luck@intel.com>, Mauro Carvalho Chehab <mchehab@redhat.com>, Linux Edac Mailing List <linux-edac@vger.kernel.org>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org> Subject: [PATCH EDAC 00/13] Add a driver to report Firmware first errors (via GHES) Date: Fri, 15 Feb 2013 10:44:48 -0200 [thread overview] Message-ID: <cover.1360931635.git.mchehab@redhat.com> (raw) There are currently 3 error mechanisms inside the Linux Kernel: edac, mcelog and ghes. Unfortunately, not all those error mechanisms will work at the same time, as accessing the error registers by the BIOS may interfere on reading them from OS. So, all those 3 mechanisms need to be integrated, in order to avoid such problems. This patch series adds a new EDAC driver that uses "Firmware first" APEI/GHES as an error report mechanism. It automatically disables the hardware-driven EDAC drivers when GHES is enabled, preventing to have both OS and BIOS to read at the very same error mechanisms. It was tested on a "Lizard Head Pass" Intel machine, equipped with BIOS SE5C600.86B.99.99.x059.091020121352 (09/10/2012). Test results: The driver is properly binding into the EDAC core. This BIOS announces and sets "Firmware first" mode: [ 4.537704] ghes_edac: This EDAC driver relies on BIOS to enumerate memory and get error reports. [ 4.547644] ghes_edac: Unfortunately, not all BIOSes reflect the memory layout correctly. [ 4.556807] ghes_edac: So, the end result of using this driver varies from vendor to vendor. [ 4.566260] ghes_edac: If you find incorrect reports, please ask your vendor to fix its BIOS. [ 4.575811] ghes_edac: This system has 48 DIMM sockets. [ 4.581687] EDAC DEBUG: ghes_edac_dmidecode: DIMM0: DDR3 size = 8192 MB(ECC) [ 4.581691] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581695] EDAC DEBUG: ghes_edac_dmidecode: DIMM3: DDR3 size = 8192 MB(ECC) [ 4.581698] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581702] EDAC DEBUG: ghes_edac_dmidecode: DIMM6: DDR3 size = 8192 MB(ECC) [ 4.581705] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581708] EDAC DEBUG: ghes_edac_dmidecode: DIMM9: DDR3 size = 8192 MB(ECC) [ 4.581711] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581715] EDAC DEBUG: ghes_edac_dmidecode: DIMM12: DDR3 size = 8192 MB(ECC) [ 4.581718] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581722] EDAC DEBUG: ghes_edac_dmidecode: DIMM15: DDR3 size = 8192 MB(ECC) [ 4.581724] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581728] EDAC DEBUG: ghes_edac_dmidecode: DIMM18: DDR3 size = 8192 MB(ECC) [ 4.581730] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581734] EDAC DEBUG: ghes_edac_dmidecode: DIMM21: DDR3 size = 8192 MB(ECC) [ 4.581737] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581741] EDAC DEBUG: ghes_edac_dmidecode: DIMM24: DDR3 size = 8192 MB(ECC) [ 4.581752] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581756] EDAC DEBUG: ghes_edac_dmidecode: DIMM27: DDR3 size = 8192 MB(ECC) [ 4.581759] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581763] EDAC DEBUG: ghes_edac_dmidecode: DIMM30: DDR3 size = 8192 MB(ECC) [ 4.581766] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581769] EDAC DEBUG: ghes_edac_dmidecode: DIMM33: DDR3 size = 8192 MB(ECC) [ 4.581772] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581776] EDAC DEBUG: ghes_edac_dmidecode: DIMM36: DDR3 size = 8192 MB(ECC) [ 4.581778] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581782] EDAC DEBUG: ghes_edac_dmidecode: DIMM39: DDR3 size = 8192 MB(ECC) [ 4.581784] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581788] EDAC DEBUG: ghes_edac_dmidecode: DIMM42: DDR3 size = 8192 MB(ECC) [ 4.581791] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581795] EDAC DEBUG: ghes_edac_dmidecode: DIMM45: DDR3 size = 8192 MB(ECC) [ 4.581797] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.582724] EDAC MC0: Giving out device to 'ghes_edac.c' 'ghes_edac': DEV ghes [ 4.591145] EDAC MC1: Giving out device to 'ghes_edac.c' 'ghes_edac': DEV ghes [ 4.599524] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC. However, with this BIOS, the "Firmware first" is not working. The errors are only seen via MCELOG error mechanism: # mcelog Hardware event. This is not a software error. MCE 0 CPU 0 BANK 5 MISC 20404c4c86 ADDR 320000 TIME 1360931174 Fri Feb 15 07:26:14 2013 MCG status: MCi status: Corrected error MCi_MISC register valid MCi_ADDR register valid MCA: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error STATUS 8c00004000010090 MCGSTATUS 0 MCGCAP 1000c14 APICID 0 SOCKETID 0 CPUID Vendor Intel Family 6 Model 45 Hardware event. This is not a software error. So, I was unable to test the GHES->EDAC error report method. Mauro Carvalho Chehab (13): edac: lock module owner to avoid error report conflicts ghes: move structures/enum to a header file ghes: add the needed hooks for EDAC error report edac: add a new memory layer type ghes_edac: Register at EDAC core the BIOS report ghes_edac: Allow registering more than once edac: add support for raw error reports ghes_edac: add support for reporting errors via EDAC ghes_edac: do a better job of filling EDAC DIMM info edac: better report error conditions in debug mode edac: initialize the core earlier ghes_edac.c: Don't credit the same memory dimm twice ghes_edac: Improve driver's printk messages drivers/acpi/apei/ghes.c | 64 +++------ drivers/edac/Kconfig | 23 ++++ drivers/edac/Makefile | 1 + drivers/edac/edac_core.h | 17 +++ drivers/edac/edac_mc.c | 136 ++++++++++++++----- drivers/edac/edac_mc_sysfs.c | 7 +- drivers/edac/edac_module.c | 2 +- drivers/edac/ghes_edac.c | 313 +++++++++++++++++++++++++++++++++++++++++++ include/acpi/ghes.h | 72 ++++++++++ include/linux/edac.h | 5 + 10 files changed, 560 insertions(+), 80 deletions(-) create mode 100644 drivers/edac/ghes_edac.c create mode 100644 include/acpi/ghes.h -- 1.8.1.2
WARNING: multiple messages have this Message-ID (diff)
From: Mauro Carvalho Chehab <mchehab@redhat.com> To: unlisted-recipients:; (no To-header on input) Cc: linux-acpi@vger.kernel.org, Huang Ying <ying.huang@intel.com>, Tony Luck <tony.luck@intel.com>, Mauro Carvalho Chehab <mchehab@redhat.com>, Linux Edac Mailing List <linux-edac@vger.kernel.org>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org> Subject: [PATCH EDAC 00/13] Add a driver to report Firmware first errors (via GHES) Date: Fri, 15 Feb 2013 10:44:48 -0200 [thread overview] Message-ID: <cover.1360931635.git.mchehab@redhat.com> (raw) There are currently 3 error mechanisms inside the Linux Kernel: edac, mcelog and ghes. Unfortunately, not all those error mechanisms will work at the same time, as accessing the error registers by the BIOS may interfere on reading them from OS. So, all those 3 mechanisms need to be integrated, in order to avoid such problems. This patch series adds a new EDAC driver that uses "Firmware first" APEI/GHES as an error report mechanism. It automatically disables the hardware-driven EDAC drivers when GHES is enabled, preventing to have both OS and BIOS to read at the very same error mechanisms. It was tested on a "Lizard Head Pass" Intel machine, equipped with BIOS SE5C600.86B.99.99.x059.091020121352 (09/10/2012). Test results: The driver is properly binding into the EDAC core. This BIOS announces and sets "Firmware first" mode: [ 4.537704] ghes_edac: This EDAC driver relies on BIOS to enumerate memory and get error reports. [ 4.547644] ghes_edac: Unfortunately, not all BIOSes reflect the memory layout correctly. [ 4.556807] ghes_edac: So, the end result of using this driver varies from vendor to vendor. [ 4.566260] ghes_edac: If you find incorrect reports, please ask your vendor to fix its BIOS. [ 4.575811] ghes_edac: This system has 48 DIMM sockets. [ 4.581687] EDAC DEBUG: ghes_edac_dmidecode: DIMM0: DDR3 size = 8192 MB(ECC) [ 4.581691] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581695] EDAC DEBUG: ghes_edac_dmidecode: DIMM3: DDR3 size = 8192 MB(ECC) [ 4.581698] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581702] EDAC DEBUG: ghes_edac_dmidecode: DIMM6: DDR3 size = 8192 MB(ECC) [ 4.581705] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581708] EDAC DEBUG: ghes_edac_dmidecode: DIMM9: DDR3 size = 8192 MB(ECC) [ 4.581711] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581715] EDAC DEBUG: ghes_edac_dmidecode: DIMM12: DDR3 size = 8192 MB(ECC) [ 4.581718] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581722] EDAC DEBUG: ghes_edac_dmidecode: DIMM15: DDR3 size = 8192 MB(ECC) [ 4.581724] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581728] EDAC DEBUG: ghes_edac_dmidecode: DIMM18: DDR3 size = 8192 MB(ECC) [ 4.581730] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581734] EDAC DEBUG: ghes_edac_dmidecode: DIMM21: DDR3 size = 8192 MB(ECC) [ 4.581737] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581741] EDAC DEBUG: ghes_edac_dmidecode: DIMM24: DDR3 size = 8192 MB(ECC) [ 4.581752] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581756] EDAC DEBUG: ghes_edac_dmidecode: DIMM27: DDR3 size = 8192 MB(ECC) [ 4.581759] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581763] EDAC DEBUG: ghes_edac_dmidecode: DIMM30: DDR3 size = 8192 MB(ECC) [ 4.581766] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581769] EDAC DEBUG: ghes_edac_dmidecode: DIMM33: DDR3 size = 8192 MB(ECC) [ 4.581772] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581776] EDAC DEBUG: ghes_edac_dmidecode: DIMM36: DDR3 size = 8192 MB(ECC) [ 4.581778] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581782] EDAC DEBUG: ghes_edac_dmidecode: DIMM39: DDR3 size = 8192 MB(ECC) [ 4.581784] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581788] EDAC DEBUG: ghes_edac_dmidecode: DIMM42: DDR3 size = 8192 MB(ECC) [ 4.581791] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.581795] EDAC DEBUG: ghes_edac_dmidecode: DIMM45: DDR3 size = 8192 MB(ECC) [ 4.581797] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64) [ 4.582724] EDAC MC0: Giving out device to 'ghes_edac.c' 'ghes_edac': DEV ghes [ 4.591145] EDAC MC1: Giving out device to 'ghes_edac.c' 'ghes_edac': DEV ghes [ 4.599524] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC. However, with this BIOS, the "Firmware first" is not working. The errors are only seen via MCELOG error mechanism: # mcelog Hardware event. This is not a software error. MCE 0 CPU 0 BANK 5 MISC 20404c4c86 ADDR 320000 TIME 1360931174 Fri Feb 15 07:26:14 2013 MCG status: MCi status: Corrected error MCi_MISC register valid MCi_ADDR register valid MCA: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error STATUS 8c00004000010090 MCGSTATUS 0 MCGCAP 1000c14 APICID 0 SOCKETID 0 CPUID Vendor Intel Family 6 Model 45 Hardware event. This is not a software error. So, I was unable to test the GHES->EDAC error report method. Mauro Carvalho Chehab (13): edac: lock module owner to avoid error report conflicts ghes: move structures/enum to a header file ghes: add the needed hooks for EDAC error report edac: add a new memory layer type ghes_edac: Register at EDAC core the BIOS report ghes_edac: Allow registering more than once edac: add support for raw error reports ghes_edac: add support for reporting errors via EDAC ghes_edac: do a better job of filling EDAC DIMM info edac: better report error conditions in debug mode edac: initialize the core earlier ghes_edac.c: Don't credit the same memory dimm twice ghes_edac: Improve driver's printk messages drivers/acpi/apei/ghes.c | 64 +++------ drivers/edac/Kconfig | 23 ++++ drivers/edac/Makefile | 1 + drivers/edac/edac_core.h | 17 +++ drivers/edac/edac_mc.c | 136 ++++++++++++++----- drivers/edac/edac_mc_sysfs.c | 7 +- drivers/edac/edac_module.c | 2 +- drivers/edac/ghes_edac.c | 313 +++++++++++++++++++++++++++++++++++++++++++ include/acpi/ghes.h | 72 ++++++++++ include/linux/edac.h | 5 + 10 files changed, 560 insertions(+), 80 deletions(-) create mode 100644 drivers/edac/ghes_edac.c create mode 100644 include/acpi/ghes.h -- 1.8.1.2
next reply other threads:[~2013-02-15 12:44 UTC|newest] Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top 2013-02-15 12:44 Mauro Carvalho Chehab [this message] 2013-02-15 12:44 ` [PATCH EDAC 00/13] Add a driver to report Firmware first errors (via GHES) Mauro Carvalho Chehab 2013-02-15 12:44 ` [PATCH EDAC 01/13] edac: lock module owner to avoid error report conflicts Mauro Carvalho Chehab 2013-02-15 12:44 ` Mauro Carvalho Chehab 2013-02-15 12:44 ` [PATCH EDAC 02/13] ghes: move structures/enum to a header file Mauro Carvalho Chehab 2013-02-15 12:44 ` Mauro Carvalho Chehab 2013-02-15 12:44 ` [PATCH EDAC 03/13] ghes: add the needed hooks for EDAC error report Mauro Carvalho Chehab 2013-02-15 12:44 ` Mauro Carvalho Chehab 2013-02-21 1:26 ` Huang Ying 2013-02-21 12:04 ` Mauro Carvalho Chehab 2013-02-22 0:45 ` Huang Ying 2013-02-22 8:50 ` Mauro Carvalho Chehab 2013-02-22 8:57 ` Mauro Carvalho Chehab 2013-02-25 0:25 ` Huang Ying 2013-02-15 12:44 ` [PATCH EDAC 04/13] edac: add a new memory layer type Mauro Carvalho Chehab 2013-02-15 12:44 ` Mauro Carvalho Chehab 2013-02-15 12:44 ` [PATCH EDAC 05/13] ghes_edac: Register at EDAC core the BIOS report Mauro Carvalho Chehab 2013-02-15 12:44 ` Mauro Carvalho Chehab 2013-02-15 12:44 ` [PATCH EDAC 06/13] ghes_edac: Allow registering more than once Mauro Carvalho Chehab 2013-02-15 12:44 ` Mauro Carvalho Chehab 2013-02-15 12:44 ` [PATCH EDAC 07/13] edac: add support for raw error reports Mauro Carvalho Chehab 2013-02-15 12:44 ` Mauro Carvalho Chehab 2013-02-15 14:13 ` Borislav Petkov 2013-02-15 15:25 ` Mauro Carvalho Chehab 2013-02-15 15:41 ` Borislav Petkov 2013-02-15 15:49 ` Mauro Carvalho Chehab 2013-02-15 16:02 ` Borislav Petkov 2013-02-15 18:20 ` Mauro Carvalho Chehab 2013-02-16 16:57 ` Borislav Petkov 2013-02-16 16:57 ` Borislav Petkov 2013-02-17 10:44 ` Mauro Carvalho Chehab 2013-02-17 10:44 ` Mauro Carvalho Chehab 2013-02-18 13:52 ` Borislav Petkov 2013-02-18 15:24 ` Mauro Carvalho Chehab 2013-02-19 11:56 ` Mauro Carvalho Chehab 2013-02-15 12:44 ` [PATCH EDAC 08/13] ghes_edac: add support for reporting errors via EDAC Mauro Carvalho Chehab 2013-02-15 12:44 ` Mauro Carvalho Chehab 2013-02-15 12:44 ` [PATCH EDAC 09/13] ghes_edac: do a better job of filling EDAC DIMM info Mauro Carvalho Chehab 2013-02-15 12:44 ` Mauro Carvalho Chehab 2013-02-15 12:44 ` [PATCH EDAC 10/13] edac: better report error conditions in debug mode Mauro Carvalho Chehab 2013-02-15 12:44 ` Mauro Carvalho Chehab 2013-02-15 12:44 ` [PATCH EDAC 11/13] edac: initialize the core earlier Mauro Carvalho Chehab 2013-02-15 12:44 ` Mauro Carvalho Chehab 2013-02-15 12:45 ` [PATCH EDAC 12/13] ghes_edac.c: Don't credit the same memory dimm twice Mauro Carvalho Chehab 2013-02-15 12:45 ` Mauro Carvalho Chehab 2013-02-15 12:45 ` [PATCH EDAC 13/13] ghes_edac: Improve driver's printk messages Mauro Carvalho Chehab 2013-02-15 12:45 ` Mauro Carvalho Chehab 2013-02-15 16:38 ` Joe Perches 2013-02-15 17:33 ` Mauro Carvalho Chehab
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=cover.1360931635.git.mchehab@redhat.com \ --to=mchehab@redhat.com \ --cc=linux-acpi@vger.kernel.org \ --cc=linux-edac@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=tony.luck@intel.com \ --cc=ying.huang@intel.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.