All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH EDAC 00/13] Add a driver to report Firmware first errors (via GHES)
@ 2013-02-15 12:44 ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

There are currently 3 error mechanisms inside the Linux Kernel:
edac, mcelog and ghes.

Unfortunately, not all those error mechanisms will work at the same
time, as accessing the error registers by the BIOS may interfere on
reading them from OS.

So, all those 3 mechanisms need to be integrated, in order to avoid
such problems.

This patch series adds a new EDAC driver that uses "Firmware first"
APEI/GHES as an error report mechanism. It automatically disables
the hardware-driven EDAC drivers when GHES is enabled, preventing
to have both OS and BIOS to read at the very same error mechanisms.

It was tested on a "Lizard Head Pass" Intel machine, equipped with
BIOS SE5C600.86B.99.99.x059.091020121352 (09/10/2012).

Test results:

The driver is properly binding into the EDAC core. This BIOS
announces and sets "Firmware first" mode:

[    4.537704] ghes_edac: This EDAC driver relies on BIOS to enumerate memory and get error reports.
[    4.547644] ghes_edac: Unfortunately, not all BIOSes reflect the memory layout correctly.
[    4.556807] ghes_edac: So, the end result of using this driver varies from vendor to vendor.
[    4.566260] ghes_edac: If you find incorrect reports, please ask your vendor to fix its BIOS.
[    4.575811] ghes_edac: This system has 48 DIMM sockets.
[    4.581687] EDAC DEBUG: ghes_edac_dmidecode: DIMM0: DDR3 size = 8192 MB(ECC)
[    4.581691] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581695] EDAC DEBUG: ghes_edac_dmidecode: DIMM3: DDR3 size = 8192 MB(ECC)
[    4.581698] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581702] EDAC DEBUG: ghes_edac_dmidecode: DIMM6: DDR3 size = 8192 MB(ECC)
[    4.581705] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581708] EDAC DEBUG: ghes_edac_dmidecode: DIMM9: DDR3 size = 8192 MB(ECC)
[    4.581711] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581715] EDAC DEBUG: ghes_edac_dmidecode: DIMM12: DDR3 size = 8192 MB(ECC)
[    4.581718] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581722] EDAC DEBUG: ghes_edac_dmidecode: DIMM15: DDR3 size = 8192 MB(ECC)
[    4.581724] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581728] EDAC DEBUG: ghes_edac_dmidecode: DIMM18: DDR3 size = 8192 MB(ECC)
[    4.581730] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581734] EDAC DEBUG: ghes_edac_dmidecode: DIMM21: DDR3 size = 8192 MB(ECC)
[    4.581737] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581741] EDAC DEBUG: ghes_edac_dmidecode: DIMM24: DDR3 size = 8192 MB(ECC)
[    4.581752] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581756] EDAC DEBUG: ghes_edac_dmidecode: DIMM27: DDR3 size = 8192 MB(ECC)
[    4.581759] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581763] EDAC DEBUG: ghes_edac_dmidecode: DIMM30: DDR3 size = 8192 MB(ECC)
[    4.581766] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581769] EDAC DEBUG: ghes_edac_dmidecode: DIMM33: DDR3 size = 8192 MB(ECC)
[    4.581772] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581776] EDAC DEBUG: ghes_edac_dmidecode: DIMM36: DDR3 size = 8192 MB(ECC)
[    4.581778] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581782] EDAC DEBUG: ghes_edac_dmidecode: DIMM39: DDR3 size = 8192 MB(ECC)
[    4.581784] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581788] EDAC DEBUG: ghes_edac_dmidecode: DIMM42: DDR3 size = 8192 MB(ECC)
[    4.581791] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581795] EDAC DEBUG: ghes_edac_dmidecode: DIMM45: DDR3 size = 8192 MB(ECC)
[    4.581797] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.582724] EDAC MC0: Giving out device to 'ghes_edac.c' 'ghes_edac': DEV ghes
[    4.591145] EDAC MC1: Giving out device to 'ghes_edac.c' 'ghes_edac': DEV ghes
[    4.599524] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.

However, with this BIOS, the "Firmware first" is not working. The
errors are only seen via MCELOG error mechanism:

# mcelog
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 5 
MISC 20404c4c86 ADDR 320000 
TIME 1360931174 Fri Feb 15 07:26:14 2013
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNEL0_ERR
Transaction: Memory read error
STATUS 8c00004000010090 MCGSTATUS 0
MCGCAP 1000c14 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 45
Hardware event. This is not a software error.

So, I was unable to test the GHES->EDAC error report method.

Mauro Carvalho Chehab (13):
  edac: lock module owner to avoid error report conflicts
  ghes: move structures/enum to a header file
  ghes: add the needed hooks for EDAC error report
  edac: add a new memory layer type
  ghes_edac: Register at EDAC core the BIOS report
  ghes_edac: Allow registering more than once
  edac: add support for raw error reports
  ghes_edac: add support for reporting errors via EDAC
  ghes_edac: do a better job of filling EDAC DIMM info
  edac: better report error conditions in debug mode
  edac: initialize the core earlier
  ghes_edac.c: Don't credit the same memory dimm twice
  ghes_edac: Improve driver's printk messages

 drivers/acpi/apei/ghes.c     |  64 +++------
 drivers/edac/Kconfig         |  23 ++++
 drivers/edac/Makefile        |   1 +
 drivers/edac/edac_core.h     |  17 +++
 drivers/edac/edac_mc.c       | 136 ++++++++++++++-----
 drivers/edac/edac_mc_sysfs.c |   7 +-
 drivers/edac/edac_module.c   |   2 +-
 drivers/edac/ghes_edac.c     | 313 +++++++++++++++++++++++++++++++++++++++++++
 include/acpi/ghes.h          |  72 ++++++++++
 include/linux/edac.h         |   5 +
 10 files changed, 560 insertions(+), 80 deletions(-)
 create mode 100644 drivers/edac/ghes_edac.c
 create mode 100644 include/acpi/ghes.h

-- 
1.8.1.2

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH EDAC 00/13] Add a driver to report Firmware first errors (via GHES)
@ 2013-02-15 12:44 ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

There are currently 3 error mechanisms inside the Linux Kernel:
edac, mcelog and ghes.

Unfortunately, not all those error mechanisms will work at the same
time, as accessing the error registers by the BIOS may interfere on
reading them from OS.

So, all those 3 mechanisms need to be integrated, in order to avoid
such problems.

This patch series adds a new EDAC driver that uses "Firmware first"
APEI/GHES as an error report mechanism. It automatically disables
the hardware-driven EDAC drivers when GHES is enabled, preventing
to have both OS and BIOS to read at the very same error mechanisms.

It was tested on a "Lizard Head Pass" Intel machine, equipped with
BIOS SE5C600.86B.99.99.x059.091020121352 (09/10/2012).

Test results:

The driver is properly binding into the EDAC core. This BIOS
announces and sets "Firmware first" mode:

[    4.537704] ghes_edac: This EDAC driver relies on BIOS to enumerate memory and get error reports.
[    4.547644] ghes_edac: Unfortunately, not all BIOSes reflect the memory layout correctly.
[    4.556807] ghes_edac: So, the end result of using this driver varies from vendor to vendor.
[    4.566260] ghes_edac: If you find incorrect reports, please ask your vendor to fix its BIOS.
[    4.575811] ghes_edac: This system has 48 DIMM sockets.
[    4.581687] EDAC DEBUG: ghes_edac_dmidecode: DIMM0: DDR3 size = 8192 MB(ECC)
[    4.581691] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581695] EDAC DEBUG: ghes_edac_dmidecode: DIMM3: DDR3 size = 8192 MB(ECC)
[    4.581698] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581702] EDAC DEBUG: ghes_edac_dmidecode: DIMM6: DDR3 size = 8192 MB(ECC)
[    4.581705] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581708] EDAC DEBUG: ghes_edac_dmidecode: DIMM9: DDR3 size = 8192 MB(ECC)
[    4.581711] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581715] EDAC DEBUG: ghes_edac_dmidecode: DIMM12: DDR3 size = 8192 MB(ECC)
[    4.581718] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581722] EDAC DEBUG: ghes_edac_dmidecode: DIMM15: DDR3 size = 8192 MB(ECC)
[    4.581724] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581728] EDAC DEBUG: ghes_edac_dmidecode: DIMM18: DDR3 size = 8192 MB(ECC)
[    4.581730] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581734] EDAC DEBUG: ghes_edac_dmidecode: DIMM21: DDR3 size = 8192 MB(ECC)
[    4.581737] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581741] EDAC DEBUG: ghes_edac_dmidecode: DIMM24: DDR3 size = 8192 MB(ECC)
[    4.581752] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581756] EDAC DEBUG: ghes_edac_dmidecode: DIMM27: DDR3 size = 8192 MB(ECC)
[    4.581759] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581763] EDAC DEBUG: ghes_edac_dmidecode: DIMM30: DDR3 size = 8192 MB(ECC)
[    4.581766] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581769] EDAC DEBUG: ghes_edac_dmidecode: DIMM33: DDR3 size = 8192 MB(ECC)
[    4.581772] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581776] EDAC DEBUG: ghes_edac_dmidecode: DIMM36: DDR3 size = 8192 MB(ECC)
[    4.581778] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581782] EDAC DEBUG: ghes_edac_dmidecode: DIMM39: DDR3 size = 8192 MB(ECC)
[    4.581784] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581788] EDAC DEBUG: ghes_edac_dmidecode: DIMM42: DDR3 size = 8192 MB(ECC)
[    4.581791] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.581795] EDAC DEBUG: ghes_edac_dmidecode: DIMM45: DDR3 size = 8192 MB(ECC)
[    4.581797] EDAC DEBUG: ghes_edac_dmidecode: 	type 24, detail 0x80, width 72(total 64)
[    4.582724] EDAC MC0: Giving out device to 'ghes_edac.c' 'ghes_edac': DEV ghes
[    4.591145] EDAC MC1: Giving out device to 'ghes_edac.c' 'ghes_edac': DEV ghes
[    4.599524] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.

However, with this BIOS, the "Firmware first" is not working. The
errors are only seen via MCELOG error mechanism:

# mcelog
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 5 
MISC 20404c4c86 ADDR 320000 
TIME 1360931174 Fri Feb 15 07:26:14 2013
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNEL0_ERR
Transaction: Memory read error
STATUS 8c00004000010090 MCGSTATUS 0
MCGCAP 1000c14 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 45
Hardware event. This is not a software error.

So, I was unable to test the GHES->EDAC error report method.

Mauro Carvalho Chehab (13):
  edac: lock module owner to avoid error report conflicts
  ghes: move structures/enum to a header file
  ghes: add the needed hooks for EDAC error report
  edac: add a new memory layer type
  ghes_edac: Register at EDAC core the BIOS report
  ghes_edac: Allow registering more than once
  edac: add support for raw error reports
  ghes_edac: add support for reporting errors via EDAC
  ghes_edac: do a better job of filling EDAC DIMM info
  edac: better report error conditions in debug mode
  edac: initialize the core earlier
  ghes_edac.c: Don't credit the same memory dimm twice
  ghes_edac: Improve driver's printk messages

 drivers/acpi/apei/ghes.c     |  64 +++------
 drivers/edac/Kconfig         |  23 ++++
 drivers/edac/Makefile        |   1 +
 drivers/edac/edac_core.h     |  17 +++
 drivers/edac/edac_mc.c       | 136 ++++++++++++++-----
 drivers/edac/edac_mc_sysfs.c |   7 +-
 drivers/edac/edac_module.c   |   2 +-
 drivers/edac/ghes_edac.c     | 313 +++++++++++++++++++++++++++++++++++++++++++
 include/acpi/ghes.h          |  72 ++++++++++
 include/linux/edac.h         |   5 +
 10 files changed, 560 insertions(+), 80 deletions(-)
 create mode 100644 drivers/edac/ghes_edac.c
 create mode 100644 include/acpi/ghes.h

-- 
1.8.1.2


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH EDAC 01/13] edac: lock module owner to avoid error report conflicts
  2013-02-15 12:44 ` Mauro Carvalho Chehab
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  -1 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

APEI GHES and i7core_edac/sb_edac currently can be loaded at
the same time, but those are Highlander modules:
	"There can be only one".

There are two reasons for that:

1) Each driver assumes that it is the only one registering at
   the EDAC core, as it is driver's responsibility to number
   the memory controllers, and all of them start from 0;

2) If BIOS is handling the memory errors, the OS can't also be
   doing it, as one will mangle with the other.

So, we need to add an module owner's lock at the EDAC core,
in order to avoid having two different modules handling memory
errors at the same time. The best way for doing this lock seems
to use the driver's name, as this is unique, and won't require
changes on every driver.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/edac_mc.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index fb219bc..8e58b1c 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -42,6 +42,12 @@
 static DEFINE_MUTEX(mem_ctls_mutex);
 static LIST_HEAD(mc_devices);
 
+/*
+ * Used to lock EDAC MC to just one module, avoiding two drivers e. g.
+ *	apei/ghes and i7core_edac to be used at the same time.
+ */
+static void const *edac_mc_owner;
+
 unsigned edac_dimm_info_location(struct dimm_info *dimm, char *buf,
 			         unsigned len)
 {
@@ -659,9 +665,9 @@ fail1:
 	return 1;
 }
 
-static void del_mc_from_global_list(struct mem_ctl_info *mci)
+static int del_mc_from_global_list(struct mem_ctl_info *mci)
 {
-	atomic_dec(&edac_handlers);
+	int handlers = atomic_dec_return(&edac_handlers);
 	list_del_rcu(&mci->link);
 
 	/* these are for safe removal of devices from global list while
@@ -669,6 +675,8 @@ static void del_mc_from_global_list(struct mem_ctl_info *mci)
 	 */
 	synchronize_rcu();
 	INIT_LIST_HEAD(&mci->link);
+
+	return handlers;
 }
 
 /**
@@ -712,6 +720,7 @@ EXPORT_SYMBOL(edac_mc_find);
 /* FIXME - should a warning be printed if no error detection? correction? */
 int edac_mc_add_mc(struct mem_ctl_info *mci)
 {
+	int ret = -EINVAL;
 	edac_dbg(0, "\n");
 
 #ifdef CONFIG_EDAC_DEBUG
@@ -742,6 +751,11 @@ int edac_mc_add_mc(struct mem_ctl_info *mci)
 #endif
 	mutex_lock(&mem_ctls_mutex);
 
+	if (edac_mc_owner && edac_mc_owner != mci->mod_name) {
+		ret = -EPERM;
+		goto fail0;
+	}
+
 	if (add_mc_to_global_list(mci))
 		goto fail0;
 
@@ -768,6 +782,8 @@ int edac_mc_add_mc(struct mem_ctl_info *mci)
 	edac_mc_printk(mci, KERN_INFO, "Giving out device to '%s' '%s':"
 		" DEV %s\n", mci->mod_name, mci->ctl_name, edac_dev_name(mci));
 
+	edac_mc_owner = mci->mod_name;
+
 	mutex_unlock(&mem_ctls_mutex);
 	return 0;
 
@@ -776,7 +792,7 @@ fail1:
 
 fail0:
 	mutex_unlock(&mem_ctls_mutex);
-	return 1;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(edac_mc_add_mc);
 
@@ -802,7 +818,9 @@ struct mem_ctl_info *edac_mc_del_mc(struct device *dev)
 		return NULL;
 	}
 
-	del_mc_from_global_list(mci);
+	if (!del_mc_from_global_list(mci)) {
+		edac_mc_owner = NULL;
+	}
 	mutex_unlock(&mem_ctls_mutex);
 
 	/* flush workq processes */
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 01/13] edac: lock module owner to avoid error report conflicts
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

APEI GHES and i7core_edac/sb_edac currently can be loaded at
the same time, but those are Highlander modules:
	"There can be only one".

There are two reasons for that:

1) Each driver assumes that it is the only one registering at
   the EDAC core, as it is driver's responsibility to number
   the memory controllers, and all of them start from 0;

2) If BIOS is handling the memory errors, the OS can't also be
   doing it, as one will mangle with the other.

So, we need to add an module owner's lock at the EDAC core,
in order to avoid having two different modules handling memory
errors at the same time. The best way for doing this lock seems
to use the driver's name, as this is unique, and won't require
changes on every driver.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/edac_mc.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index fb219bc..8e58b1c 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -42,6 +42,12 @@
 static DEFINE_MUTEX(mem_ctls_mutex);
 static LIST_HEAD(mc_devices);
 
+/*
+ * Used to lock EDAC MC to just one module, avoiding two drivers e. g.
+ *	apei/ghes and i7core_edac to be used at the same time.
+ */
+static void const *edac_mc_owner;
+
 unsigned edac_dimm_info_location(struct dimm_info *dimm, char *buf,
 			         unsigned len)
 {
@@ -659,9 +665,9 @@ fail1:
 	return 1;
 }
 
-static void del_mc_from_global_list(struct mem_ctl_info *mci)
+static int del_mc_from_global_list(struct mem_ctl_info *mci)
 {
-	atomic_dec(&edac_handlers);
+	int handlers = atomic_dec_return(&edac_handlers);
 	list_del_rcu(&mci->link);
 
 	/* these are for safe removal of devices from global list while
@@ -669,6 +675,8 @@ static void del_mc_from_global_list(struct mem_ctl_info *mci)
 	 */
 	synchronize_rcu();
 	INIT_LIST_HEAD(&mci->link);
+
+	return handlers;
 }
 
 /**
@@ -712,6 +720,7 @@ EXPORT_SYMBOL(edac_mc_find);
 /* FIXME - should a warning be printed if no error detection? correction? */
 int edac_mc_add_mc(struct mem_ctl_info *mci)
 {
+	int ret = -EINVAL;
 	edac_dbg(0, "\n");
 
 #ifdef CONFIG_EDAC_DEBUG
@@ -742,6 +751,11 @@ int edac_mc_add_mc(struct mem_ctl_info *mci)
 #endif
 	mutex_lock(&mem_ctls_mutex);
 
+	if (edac_mc_owner && edac_mc_owner != mci->mod_name) {
+		ret = -EPERM;
+		goto fail0;
+	}
+
 	if (add_mc_to_global_list(mci))
 		goto fail0;
 
@@ -768,6 +782,8 @@ int edac_mc_add_mc(struct mem_ctl_info *mci)
 	edac_mc_printk(mci, KERN_INFO, "Giving out device to '%s' '%s':"
 		" DEV %s\n", mci->mod_name, mci->ctl_name, edac_dev_name(mci));
 
+	edac_mc_owner = mci->mod_name;
+
 	mutex_unlock(&mem_ctls_mutex);
 	return 0;
 
@@ -776,7 +792,7 @@ fail1:
 
 fail0:
 	mutex_unlock(&mem_ctls_mutex);
-	return 1;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(edac_mc_add_mc);
 
@@ -802,7 +818,9 @@ struct mem_ctl_info *edac_mc_del_mc(struct device *dev)
 		return NULL;
 	}
 
-	del_mc_from_global_list(mci);
+	if (!del_mc_from_global_list(mci)) {
+		edac_mc_owner = NULL;
+	}
 	mutex_unlock(&mem_ctls_mutex);
 
 	/* flush workq processes */
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 02/13] ghes: move structures/enum to a header file
  2013-02-15 12:44 ` Mauro Carvalho Chehab
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  -1 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

As a ghes_edac driver will need to access ghes structures, in order
to properly handle the errors, move those structures to a separate
header file. No functional changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/acpi/apei/ghes.c | 47 ++---------------------------------------------
 include/acpi/ghes.h      | 45 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+), 45 deletions(-)
 create mode 100644 include/acpi/ghes.h

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 7ae2750..6d0e146 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -48,8 +48,8 @@
 #include <linux/genalloc.h>
 #include <linux/pci.h>
 #include <linux/aer.h>
-#include <acpi/apei.h>
-#include <acpi/hed.h>
+
+#include <acpi/ghes.h>
 #include <asm/mce.h>
 #include <asm/tlbflush.h>
 #include <asm/nmi.h>
@@ -84,42 +84,6 @@
 	((struct acpi_hest_generic_status *)				\
 	 ((struct ghes_estatus_node *)(estatus_node) + 1))
 
-/*
- * One struct ghes is created for each generic hardware error source.
- * It provides the context for APEI hardware error timer/IRQ/SCI/NMI
- * handler.
- *
- * estatus: memory buffer for error status block, allocated during
- * HEST parsing.
- */
-#define GHES_TO_CLEAR		0x0001
-#define GHES_EXITING		0x0002
-
-struct ghes {
-	struct acpi_hest_generic *generic;
-	struct acpi_hest_generic_status *estatus;
-	u64 buffer_paddr;
-	unsigned long flags;
-	union {
-		struct list_head list;
-		struct timer_list timer;
-		unsigned int irq;
-	};
-};
-
-struct ghes_estatus_node {
-	struct llist_node llnode;
-	struct acpi_hest_generic *generic;
-};
-
-struct ghes_estatus_cache {
-	u32 estatus_len;
-	atomic_t count;
-	struct acpi_hest_generic *generic;
-	unsigned long long time_in;
-	struct rcu_head rcu;
-};
-
 bool ghes_disable;
 module_param_named(disable, ghes_disable, bool, 0);
 
@@ -333,13 +297,6 @@ static void ghes_fini(struct ghes *ghes)
 	apei_unmap_generic_address(&ghes->generic->error_status_address);
 }
 
-enum {
-	GHES_SEV_NO = 0x0,
-	GHES_SEV_CORRECTED = 0x1,
-	GHES_SEV_RECOVERABLE = 0x2,
-	GHES_SEV_PANIC = 0x3,
-};
-
 static inline int ghes_severity(int severity)
 {
 	switch (severity) {
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
new file mode 100644
index 0000000..3eb8dc4
--- /dev/null
+++ b/include/acpi/ghes.h
@@ -0,0 +1,45 @@
+#include <acpi/apei.h>
+#include <acpi/hed.h>
+
+/*
+ * One struct ghes is created for each generic hardware error source.
+ * It provides the context for APEI hardware error timer/IRQ/SCI/NMI
+ * handler.
+ *
+ * estatus: memory buffer for error status block, allocated during
+ * HEST parsing.
+ */
+#define GHES_TO_CLEAR		0x0001
+#define GHES_EXITING		0x0002
+
+struct ghes {
+	struct acpi_hest_generic *generic;
+	struct acpi_hest_generic_status *estatus;
+	u64 buffer_paddr;
+	unsigned long flags;
+	union {
+		struct list_head list;
+		struct timer_list timer;
+		unsigned int irq;
+	};
+};
+
+struct ghes_estatus_node {
+	struct llist_node llnode;
+	struct acpi_hest_generic *generic;
+};
+
+struct ghes_estatus_cache {
+	u32 estatus_len;
+	atomic_t count;
+	struct acpi_hest_generic *generic;
+	unsigned long long time_in;
+	struct rcu_head rcu;
+};
+
+enum {
+	GHES_SEV_NO = 0x0,
+	GHES_SEV_CORRECTED = 0x1,
+	GHES_SEV_RECOVERABLE = 0x2,
+	GHES_SEV_PANIC = 0x3,
+};
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 02/13] ghes: move structures/enum to a header file
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

As a ghes_edac driver will need to access ghes structures, in order
to properly handle the errors, move those structures to a separate
header file. No functional changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/acpi/apei/ghes.c | 47 ++---------------------------------------------
 include/acpi/ghes.h      | 45 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+), 45 deletions(-)
 create mode 100644 include/acpi/ghes.h

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 7ae2750..6d0e146 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -48,8 +48,8 @@
 #include <linux/genalloc.h>
 #include <linux/pci.h>
 #include <linux/aer.h>
-#include <acpi/apei.h>
-#include <acpi/hed.h>
+
+#include <acpi/ghes.h>
 #include <asm/mce.h>
 #include <asm/tlbflush.h>
 #include <asm/nmi.h>
@@ -84,42 +84,6 @@
 	((struct acpi_hest_generic_status *)				\
 	 ((struct ghes_estatus_node *)(estatus_node) + 1))
 
-/*
- * One struct ghes is created for each generic hardware error source.
- * It provides the context for APEI hardware error timer/IRQ/SCI/NMI
- * handler.
- *
- * estatus: memory buffer for error status block, allocated during
- * HEST parsing.
- */
-#define GHES_TO_CLEAR		0x0001
-#define GHES_EXITING		0x0002
-
-struct ghes {
-	struct acpi_hest_generic *generic;
-	struct acpi_hest_generic_status *estatus;
-	u64 buffer_paddr;
-	unsigned long flags;
-	union {
-		struct list_head list;
-		struct timer_list timer;
-		unsigned int irq;
-	};
-};
-
-struct ghes_estatus_node {
-	struct llist_node llnode;
-	struct acpi_hest_generic *generic;
-};
-
-struct ghes_estatus_cache {
-	u32 estatus_len;
-	atomic_t count;
-	struct acpi_hest_generic *generic;
-	unsigned long long time_in;
-	struct rcu_head rcu;
-};
-
 bool ghes_disable;
 module_param_named(disable, ghes_disable, bool, 0);
 
@@ -333,13 +297,6 @@ static void ghes_fini(struct ghes *ghes)
 	apei_unmap_generic_address(&ghes->generic->error_status_address);
 }
 
-enum {
-	GHES_SEV_NO = 0x0,
-	GHES_SEV_CORRECTED = 0x1,
-	GHES_SEV_RECOVERABLE = 0x2,
-	GHES_SEV_PANIC = 0x3,
-};
-
 static inline int ghes_severity(int severity)
 {
 	switch (severity) {
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
new file mode 100644
index 0000000..3eb8dc4
--- /dev/null
+++ b/include/acpi/ghes.h
@@ -0,0 +1,45 @@
+#include <acpi/apei.h>
+#include <acpi/hed.h>
+
+/*
+ * One struct ghes is created for each generic hardware error source.
+ * It provides the context for APEI hardware error timer/IRQ/SCI/NMI
+ * handler.
+ *
+ * estatus: memory buffer for error status block, allocated during
+ * HEST parsing.
+ */
+#define GHES_TO_CLEAR		0x0001
+#define GHES_EXITING		0x0002
+
+struct ghes {
+	struct acpi_hest_generic *generic;
+	struct acpi_hest_generic_status *estatus;
+	u64 buffer_paddr;
+	unsigned long flags;
+	union {
+		struct list_head list;
+		struct timer_list timer;
+		unsigned int irq;
+	};
+};
+
+struct ghes_estatus_node {
+	struct llist_node llnode;
+	struct acpi_hest_generic *generic;
+};
+
+struct ghes_estatus_cache {
+	u32 estatus_len;
+	atomic_t count;
+	struct acpi_hest_generic *generic;
+	unsigned long long time_in;
+	struct rcu_head rcu;
+};
+
+enum {
+	GHES_SEV_NO = 0x0,
+	GHES_SEV_CORRECTED = 0x1,
+	GHES_SEV_RECOVERABLE = 0x2,
+	GHES_SEV_PANIC = 0x3,
+};
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 03/13] ghes: add the needed hooks for EDAC error report
  2013-02-15 12:44 ` Mauro Carvalho Chehab
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  -1 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

In order to allow reporting errors via EDAC, add hooks for:

1) register an EDAC driver;
2) unregister an EDAC driver;
3) report errors via EDAC.

As the EDAC driver will need to access the ghes structure, adds it
as one of the parameters for ghes_do_proc.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/acpi/apei/ghes.c | 17 ++++++++++++++---
 include/acpi/ghes.h      | 27 +++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 6d0e146..a21d7da 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -409,7 +409,8 @@ static void ghes_clear_estatus(struct ghes *ghes)
 	ghes->flags &= ~GHES_TO_CLEAR;
 }
 
-static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
+static void ghes_do_proc(struct ghes *ghes,
+			 const struct acpi_hest_generic_status *estatus)
 {
 	int sev, sec_sev;
 	struct acpi_hest_generic_data *gdata;
@@ -421,6 +422,8 @@ static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
 				 CPER_SEC_PLATFORM_MEM)) {
 			struct cper_sec_mem_err *mem_err;
 			mem_err = (struct cper_sec_mem_err *)(gdata+1);
+			ghes_edac_report_mem_error(ghes, sev, mem_err);
+
 #ifdef CONFIG_X86_MCE
 			apei_mce_report_mem_error(sev == GHES_SEV_CORRECTED,
 						  mem_err);
@@ -639,7 +642,7 @@ static int ghes_proc(struct ghes *ghes)
 		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
 			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
 	}
-	ghes_do_proc(ghes->estatus);
+	ghes_do_proc(ghes, ghes->estatus);
 out:
 	ghes_clear_estatus(ghes);
 	return 0;
@@ -732,7 +735,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
 		estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
 		len = apei_estatus_len(estatus);
 		node_len = GHES_ESTATUS_NODE_LEN(len);
-		ghes_do_proc(estatus);
+		ghes_do_proc(estatus_node->ghes, estatus);
 		if (!ghes_estatus_cached(estatus)) {
 			generic = estatus_node->generic;
 			if (ghes_print_estatus(NULL, generic, estatus))
@@ -821,6 +824,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
 		estatus_node = (void *)gen_pool_alloc(ghes_estatus_pool,
 						      node_len);
 		if (estatus_node) {
+			estatus_node->ghes = ghes;
 			estatus_node->generic = ghes->generic;
 			estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
 			memcpy(estatus, ghes->estatus, len);
@@ -942,6 +946,10 @@ static int ghes_probe(struct platform_device *ghes_dev)
 	}
 	platform_set_drvdata(ghes_dev, ghes);
 
+	rc = ghes_edac_register(ghes, &ghes_dev->dev);
+	if (rc < 0)
+		goto err;
+
 	return 0;
 err:
 	if (ghes) {
@@ -995,6 +1003,9 @@ static int ghes_remove(struct platform_device *ghes_dev)
 	}
 
 	ghes_fini(ghes);
+
+	ghes_edac_unregister(ghes);
+
 	kfree(ghes);
 
 	platform_set_drvdata(ghes_dev, NULL);
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 3eb8dc4..c6fef72 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -22,11 +22,14 @@ struct ghes {
 		struct timer_list timer;
 		unsigned int irq;
 	};
+
+	struct mem_ctl_info *mci;
 };
 
 struct ghes_estatus_node {
 	struct llist_node llnode;
 	struct acpi_hest_generic *generic;
+	struct ghes *ghes;
 };
 
 struct ghes_estatus_cache {
@@ -43,3 +46,27 @@ enum {
 	GHES_SEV_RECOVERABLE = 0x2,
 	GHES_SEV_PANIC = 0x3,
 };
+
+#ifdef CONFIG_EDAC_GHES
+void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
+				struct cper_sec_mem_err *mem_err);
+
+int ghes_edac_register(struct ghes *ghes, struct device *dev);
+
+void ghes_edac_unregister(struct ghes *ghes);
+
+#else
+static inline void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
+				       struct cper_sec_mem_err *mem_err)
+{
+}
+
+static inline int ghes_edac_register(struct ghes *ghes, struct device *dev)
+{
+	return 0;
+}
+
+static inline void ghes_edac_unregister(struct ghes *ghes)
+{
+}
+#endif
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 03/13] ghes: add the needed hooks for EDAC error report
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

In order to allow reporting errors via EDAC, add hooks for:

1) register an EDAC driver;
2) unregister an EDAC driver;
3) report errors via EDAC.

As the EDAC driver will need to access the ghes structure, adds it
as one of the parameters for ghes_do_proc.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/acpi/apei/ghes.c | 17 ++++++++++++++---
 include/acpi/ghes.h      | 27 +++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 6d0e146..a21d7da 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -409,7 +409,8 @@ static void ghes_clear_estatus(struct ghes *ghes)
 	ghes->flags &= ~GHES_TO_CLEAR;
 }
 
-static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
+static void ghes_do_proc(struct ghes *ghes,
+			 const struct acpi_hest_generic_status *estatus)
 {
 	int sev, sec_sev;
 	struct acpi_hest_generic_data *gdata;
@@ -421,6 +422,8 @@ static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
 				 CPER_SEC_PLATFORM_MEM)) {
 			struct cper_sec_mem_err *mem_err;
 			mem_err = (struct cper_sec_mem_err *)(gdata+1);
+			ghes_edac_report_mem_error(ghes, sev, mem_err);
+
 #ifdef CONFIG_X86_MCE
 			apei_mce_report_mem_error(sev == GHES_SEV_CORRECTED,
 						  mem_err);
@@ -639,7 +642,7 @@ static int ghes_proc(struct ghes *ghes)
 		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
 			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
 	}
-	ghes_do_proc(ghes->estatus);
+	ghes_do_proc(ghes, ghes->estatus);
 out:
 	ghes_clear_estatus(ghes);
 	return 0;
@@ -732,7 +735,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
 		estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
 		len = apei_estatus_len(estatus);
 		node_len = GHES_ESTATUS_NODE_LEN(len);
-		ghes_do_proc(estatus);
+		ghes_do_proc(estatus_node->ghes, estatus);
 		if (!ghes_estatus_cached(estatus)) {
 			generic = estatus_node->generic;
 			if (ghes_print_estatus(NULL, generic, estatus))
@@ -821,6 +824,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
 		estatus_node = (void *)gen_pool_alloc(ghes_estatus_pool,
 						      node_len);
 		if (estatus_node) {
+			estatus_node->ghes = ghes;
 			estatus_node->generic = ghes->generic;
 			estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
 			memcpy(estatus, ghes->estatus, len);
@@ -942,6 +946,10 @@ static int ghes_probe(struct platform_device *ghes_dev)
 	}
 	platform_set_drvdata(ghes_dev, ghes);
 
+	rc = ghes_edac_register(ghes, &ghes_dev->dev);
+	if (rc < 0)
+		goto err;
+
 	return 0;
 err:
 	if (ghes) {
@@ -995,6 +1003,9 @@ static int ghes_remove(struct platform_device *ghes_dev)
 	}
 
 	ghes_fini(ghes);
+
+	ghes_edac_unregister(ghes);
+
 	kfree(ghes);
 
 	platform_set_drvdata(ghes_dev, NULL);
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 3eb8dc4..c6fef72 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -22,11 +22,14 @@ struct ghes {
 		struct timer_list timer;
 		unsigned int irq;
 	};
+
+	struct mem_ctl_info *mci;
 };
 
 struct ghes_estatus_node {
 	struct llist_node llnode;
 	struct acpi_hest_generic *generic;
+	struct ghes *ghes;
 };
 
 struct ghes_estatus_cache {
@@ -43,3 +46,27 @@ enum {
 	GHES_SEV_RECOVERABLE = 0x2,
 	GHES_SEV_PANIC = 0x3,
 };
+
+#ifdef CONFIG_EDAC_GHES
+void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
+				struct cper_sec_mem_err *mem_err);
+
+int ghes_edac_register(struct ghes *ghes, struct device *dev);
+
+void ghes_edac_unregister(struct ghes *ghes);
+
+#else
+static inline void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
+				       struct cper_sec_mem_err *mem_err)
+{
+}
+
+static inline int ghes_edac_register(struct ghes *ghes, struct device *dev)
+{
+	return 0;
+}
+
+static inline void ghes_edac_unregister(struct ghes *ghes)
+{
+}
+#endif
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 04/13] edac: add a new memory layer type
  2013-02-15 12:44 ` Mauro Carvalho Chehab
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  -1 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

There are some cases where the memory controller layout is
completely hidden. This is the case of firmware-driven error
code, like the one provided by GHES. Add a new layer to be
used on such memory error report mechanisms.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/edac_mc.c | 1 +
 include/linux/edac.h   | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 8e58b1c..8e33028 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -918,6 +918,7 @@ const char *edac_layer_name[] = {
 	[EDAC_MC_LAYER_CHANNEL] = "channel",
 	[EDAC_MC_LAYER_SLOT] = "slot",
 	[EDAC_MC_LAYER_CHIP_SELECT] = "csrow",
+	[EDAC_MC_LAYER_ALL_MEM] = "memory",
 };
 EXPORT_SYMBOL_GPL(edac_layer_name);
 
diff --git a/include/linux/edac.h b/include/linux/edac.h
index 4784213..1b7744c 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -375,6 +375,9 @@ enum scrub_type {
  * @EDAC_MC_LAYER_CHANNEL:	memory layer is named "channel"
  * @EDAC_MC_LAYER_SLOT:		memory layer is named "slot"
  * @EDAC_MC_LAYER_CHIP_SELECT:	memory layer is named "chip select"
+ * @EDAC_MC_LAYER_ALL_MEM:	memory layout is unknown. All memory is mapped
+ *				as a single memory area. This is used when
+ *				retrieving errors from a firmware driven driver.
  *
  * This enum is used by the drivers to tell edac_mc_sysfs what name should
  * be used when describing a memory stick location.
@@ -384,6 +387,7 @@ enum edac_mc_layer_type {
 	EDAC_MC_LAYER_CHANNEL,
 	EDAC_MC_LAYER_SLOT,
 	EDAC_MC_LAYER_CHIP_SELECT,
+	EDAC_MC_LAYER_ALL_MEM,
 };
 
 /**
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 04/13] edac: add a new memory layer type
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

There are some cases where the memory controller layout is
completely hidden. This is the case of firmware-driven error
code, like the one provided by GHES. Add a new layer to be
used on such memory error report mechanisms.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/edac_mc.c | 1 +
 include/linux/edac.h   | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 8e58b1c..8e33028 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -918,6 +918,7 @@ const char *edac_layer_name[] = {
 	[EDAC_MC_LAYER_CHANNEL] = "channel",
 	[EDAC_MC_LAYER_SLOT] = "slot",
 	[EDAC_MC_LAYER_CHIP_SELECT] = "csrow",
+	[EDAC_MC_LAYER_ALL_MEM] = "memory",
 };
 EXPORT_SYMBOL_GPL(edac_layer_name);
 
diff --git a/include/linux/edac.h b/include/linux/edac.h
index 4784213..1b7744c 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -375,6 +375,9 @@ enum scrub_type {
  * @EDAC_MC_LAYER_CHANNEL:	memory layer is named "channel"
  * @EDAC_MC_LAYER_SLOT:		memory layer is named "slot"
  * @EDAC_MC_LAYER_CHIP_SELECT:	memory layer is named "chip select"
+ * @EDAC_MC_LAYER_ALL_MEM:	memory layout is unknown. All memory is mapped
+ *				as a single memory area. This is used when
+ *				retrieving errors from a firmware driven driver.
  *
  * This enum is used by the drivers to tell edac_mc_sysfs what name should
  * be used when describing a memory stick location.
@@ -384,6 +387,7 @@ enum edac_mc_layer_type {
 	EDAC_MC_LAYER_CHANNEL,
 	EDAC_MC_LAYER_SLOT,
 	EDAC_MC_LAYER_CHIP_SELECT,
+	EDAC_MC_LAYER_ALL_MEM,
 };
 
 /**
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 05/13] ghes_edac: Register at EDAC core the BIOS report
  2013-02-15 12:44 ` Mauro Carvalho Chehab
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  -1 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

Register GHES at EDAC MC core, in order to avoid other
drivers to also handle errors and mangle with error data.

The edac core will warrant that just one driver will be used,
so the first one to register (BIOS first) will be the one that
will be reporting the hardware errors.

For now, the EDAC driver does nothing but to register at the
EDAC core, preventing the hardware-driven mechanism to
interfere with GHES.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/Kconfig     | 23 +++++++++++++++
 drivers/edac/Makefile    |  1 +
 drivers/edac/ghes_edac.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 99 insertions(+)
 create mode 100644 drivers/edac/ghes_edac.c

diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 6671992..28503ec 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -80,6 +80,29 @@ config EDAC_MM_EDAC
 	  occurred so that a particular failing memory module can be
 	  replaced.  If unsure, select 'Y'.
 
+config EDAC_GHES
+	bool "Output ACPI APEI/GHES BIOS detected errors via EDAC"
+	depends on ACPI_APEI_GHES && (EDAC_MM_EDAC=y)
+	default y
+	help
+	  Not all machines support hardware-driven error report. Some of those
+	  provide a BIOS-driven error report mechanism via ACPI, using the
+	  APEI/GHES driver. By enabling this option, the error reports provided
+	  by GHES are sent to userspace via the EDAC API.
+
+	  When this option is enabled, it will disable the hardware-driven
+	  mechanisms, if a GHES BIOS is detected, entering into the
+	  "Firmware First" mode.
+
+	  It should be noticed that keeping both GHES and a hardware-driven
+	  error mechanism won't work well, as BIOS will race with OS, while
+	  reading the error registers. So, if you want to not use "Firmware
+	  first" GHES error mechanism, you should disable GHES either at
+	  compilation time or by passing "ghes_disable=1" Kernel parameter
+	  at boot time.
+
+	  In doubt, say 'Y'.
+
 config EDAC_AMD64
 	tristate "AMD64 (Opteron, Athlon64) K8, F10h"
 	depends on EDAC_MM_EDAC && AMD_NB && X86_64 && EDAC_DECODE_MCE
diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
index 5608a9b..4154ed6 100644
--- a/drivers/edac/Makefile
+++ b/drivers/edac/Makefile
@@ -16,6 +16,7 @@ ifdef CONFIG_PCI
 edac_core-y	+= edac_pci.o edac_pci_sysfs.o
 endif
 
+obj-$(CONFIG_EDAC_GHES)			+= ghes_edac.o
 obj-$(CONFIG_EDAC_MCE_INJ)		+= mce_amd_inj.o
 
 edac_mce_amd-y				:= mce_amd.o
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
new file mode 100644
index 0000000..6952bdf
--- /dev/null
+++ b/drivers/edac/ghes_edac.c
@@ -0,0 +1,75 @@
+#include <acpi/ghes.h>
+#include <linux/edac.h>
+#include "edac_core.h"
+
+#define GHES_PFX   "ghes_edac: "
+#define GHES_EDAC_REVISION " Ver: 1.0.0"
+
+void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
+                               struct cper_sec_mem_err *mem_err)
+{
+}
+EXPORT_SYMBOL_GPL(ghes_edac_report_mem_error);
+
+int ghes_edac_register(struct ghes *ghes, struct device *dev)
+{
+	int rc;
+	struct mem_ctl_info *mci;
+	struct edac_mc_layer layers[1];
+	struct csrow_info *csrow;
+	struct dimm_info *dimm;
+
+	layers[0].type = EDAC_MC_LAYER_ALL_MEM;
+	layers[0].size = 1;
+	layers[0].is_virt_csrow = true;
+	mci = edac_mc_alloc(0, ARRAY_SIZE(layers), layers, 0);
+	if (!mci) {
+		pr_info(GHES_PFX "Can't allocate memory for EDAC data\n");
+		return -ENOMEM;
+	}
+
+	mci->pvt_info = ghes;
+	mci->pdev = dev;
+
+	mci->mtype_cap = MEM_FLAG_EMPTY;
+	mci->edac_ctl_cap = EDAC_FLAG_NONE;
+	mci->edac_cap = EDAC_FLAG_NONE;
+	mci->mod_name = "ghes_edac.c";
+	mci->mod_ver = GHES_EDAC_REVISION;
+	mci->ctl_name = "ghes_edac";
+	mci->dev_name = "ghes";
+
+	csrow = mci->csrows[0];
+	dimm = csrow->channels[0]->dimm;
+
+	/* FIXME: FAKE DATA */
+	dimm->nr_pages = 1000;
+	dimm->grain = 128;
+	dimm->mtype = MEM_UNKNOWN;
+	dimm->dtype = DEV_UNKNOWN;
+	dimm->edac_mode = EDAC_SECDED;
+
+	rc = edac_mc_add_mc(mci);
+	if (rc < 0) {
+		pr_info(GHES_PFX "Can't register at EDAC core\n");
+		edac_mc_free(mci);
+
+		return -ENODEV;
+	}
+
+	ghes->mci = mci;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ghes_edac_register);
+
+void ghes_edac_unregister(struct ghes *ghes)
+{
+	struct mem_ctl_info *mci = ghes->mci;
+
+	if (!mci)
+		return;
+
+	edac_mc_del_mc(mci->pdev);
+	edac_mc_free(mci);
+}
+EXPORT_SYMBOL_GPL(ghes_edac_unregister);
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 05/13] ghes_edac: Register at EDAC core the BIOS report
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

Register GHES at EDAC MC core, in order to avoid other
drivers to also handle errors and mangle with error data.

The edac core will warrant that just one driver will be used,
so the first one to register (BIOS first) will be the one that
will be reporting the hardware errors.

For now, the EDAC driver does nothing but to register at the
EDAC core, preventing the hardware-driven mechanism to
interfere with GHES.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/Kconfig     | 23 +++++++++++++++
 drivers/edac/Makefile    |  1 +
 drivers/edac/ghes_edac.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 99 insertions(+)
 create mode 100644 drivers/edac/ghes_edac.c

diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 6671992..28503ec 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -80,6 +80,29 @@ config EDAC_MM_EDAC
 	  occurred so that a particular failing memory module can be
 	  replaced.  If unsure, select 'Y'.
 
+config EDAC_GHES
+	bool "Output ACPI APEI/GHES BIOS detected errors via EDAC"
+	depends on ACPI_APEI_GHES && (EDAC_MM_EDAC=y)
+	default y
+	help
+	  Not all machines support hardware-driven error report. Some of those
+	  provide a BIOS-driven error report mechanism via ACPI, using the
+	  APEI/GHES driver. By enabling this option, the error reports provided
+	  by GHES are sent to userspace via the EDAC API.
+
+	  When this option is enabled, it will disable the hardware-driven
+	  mechanisms, if a GHES BIOS is detected, entering into the
+	  "Firmware First" mode.
+
+	  It should be noticed that keeping both GHES and a hardware-driven
+	  error mechanism won't work well, as BIOS will race with OS, while
+	  reading the error registers. So, if you want to not use "Firmware
+	  first" GHES error mechanism, you should disable GHES either at
+	  compilation time or by passing "ghes_disable=1" Kernel parameter
+	  at boot time.
+
+	  In doubt, say 'Y'.
+
 config EDAC_AMD64
 	tristate "AMD64 (Opteron, Athlon64) K8, F10h"
 	depends on EDAC_MM_EDAC && AMD_NB && X86_64 && EDAC_DECODE_MCE
diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
index 5608a9b..4154ed6 100644
--- a/drivers/edac/Makefile
+++ b/drivers/edac/Makefile
@@ -16,6 +16,7 @@ ifdef CONFIG_PCI
 edac_core-y	+= edac_pci.o edac_pci_sysfs.o
 endif
 
+obj-$(CONFIG_EDAC_GHES)			+= ghes_edac.o
 obj-$(CONFIG_EDAC_MCE_INJ)		+= mce_amd_inj.o
 
 edac_mce_amd-y				:= mce_amd.o
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
new file mode 100644
index 0000000..6952bdf
--- /dev/null
+++ b/drivers/edac/ghes_edac.c
@@ -0,0 +1,75 @@
+#include <acpi/ghes.h>
+#include <linux/edac.h>
+#include "edac_core.h"
+
+#define GHES_PFX   "ghes_edac: "
+#define GHES_EDAC_REVISION " Ver: 1.0.0"
+
+void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
+                               struct cper_sec_mem_err *mem_err)
+{
+}
+EXPORT_SYMBOL_GPL(ghes_edac_report_mem_error);
+
+int ghes_edac_register(struct ghes *ghes, struct device *dev)
+{
+	int rc;
+	struct mem_ctl_info *mci;
+	struct edac_mc_layer layers[1];
+	struct csrow_info *csrow;
+	struct dimm_info *dimm;
+
+	layers[0].type = EDAC_MC_LAYER_ALL_MEM;
+	layers[0].size = 1;
+	layers[0].is_virt_csrow = true;
+	mci = edac_mc_alloc(0, ARRAY_SIZE(layers), layers, 0);
+	if (!mci) {
+		pr_info(GHES_PFX "Can't allocate memory for EDAC data\n");
+		return -ENOMEM;
+	}
+
+	mci->pvt_info = ghes;
+	mci->pdev = dev;
+
+	mci->mtype_cap = MEM_FLAG_EMPTY;
+	mci->edac_ctl_cap = EDAC_FLAG_NONE;
+	mci->edac_cap = EDAC_FLAG_NONE;
+	mci->mod_name = "ghes_edac.c";
+	mci->mod_ver = GHES_EDAC_REVISION;
+	mci->ctl_name = "ghes_edac";
+	mci->dev_name = "ghes";
+
+	csrow = mci->csrows[0];
+	dimm = csrow->channels[0]->dimm;
+
+	/* FIXME: FAKE DATA */
+	dimm->nr_pages = 1000;
+	dimm->grain = 128;
+	dimm->mtype = MEM_UNKNOWN;
+	dimm->dtype = DEV_UNKNOWN;
+	dimm->edac_mode = EDAC_SECDED;
+
+	rc = edac_mc_add_mc(mci);
+	if (rc < 0) {
+		pr_info(GHES_PFX "Can't register at EDAC core\n");
+		edac_mc_free(mci);
+
+		return -ENODEV;
+	}
+
+	ghes->mci = mci;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ghes_edac_register);
+
+void ghes_edac_unregister(struct ghes *ghes)
+{
+	struct mem_ctl_info *mci = ghes->mci;
+
+	if (!mci)
+		return;
+
+	edac_mc_del_mc(mci->pdev);
+	edac_mc_free(mci);
+}
+EXPORT_SYMBOL_GPL(ghes_edac_unregister);
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 06/13] ghes_edac: Allow registering more than once
  2013-02-15 12:44 ` Mauro Carvalho Chehab
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  -1 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

I was expecting that the GHES register call would happen only
once, but it is registered once per-cpu. So, we need to create
several memory controllers, one by cpu.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/ghes_edac.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 6952bdf..1badac6 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -5,6 +5,9 @@
 #define GHES_PFX   "ghes_edac: "
 #define GHES_EDAC_REVISION " Ver: 1.0.0"
 
+static DEFINE_MUTEX(ghes_edac_lock);
+static int ghes_edac_mc_num;
+
 void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
                                struct cper_sec_mem_err *mem_err)
 {
@@ -22,9 +25,16 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	layers[0].type = EDAC_MC_LAYER_ALL_MEM;
 	layers[0].size = 1;
 	layers[0].is_virt_csrow = true;
-	mci = edac_mc_alloc(0, ARRAY_SIZE(layers), layers, 0);
+
+	/*
+	 * We need to serialize edac_mc_alloc() and edac_mc_add_mc(),
+	 * to avoid duplicated memory controller numbers
+	 */
+	mutex_lock(&ghes_edac_lock);
+	mci = edac_mc_alloc(ghes_edac_mc_num, ARRAY_SIZE(layers), layers, 0);
 	if (!mci) {
 		pr_info(GHES_PFX "Can't allocate memory for EDAC data\n");
+		mutex_unlock(&ghes_edac_lock);
 		return -ENOMEM;
 	}
 
@@ -53,11 +63,14 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	if (rc < 0) {
 		pr_info(GHES_PFX "Can't register at EDAC core\n");
 		edac_mc_free(mci);
-
+		mutex_unlock(&ghes_edac_lock);
 		return -ENODEV;
 	}
 
+
 	ghes->mci = mci;
+	ghes_edac_mc_num++;
+	mutex_unlock(&ghes_edac_lock);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(ghes_edac_register);
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 06/13] ghes_edac: Allow registering more than once
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

I was expecting that the GHES register call would happen only
once, but it is registered once per-cpu. So, we need to create
several memory controllers, one by cpu.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/ghes_edac.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 6952bdf..1badac6 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -5,6 +5,9 @@
 #define GHES_PFX   "ghes_edac: "
 #define GHES_EDAC_REVISION " Ver: 1.0.0"
 
+static DEFINE_MUTEX(ghes_edac_lock);
+static int ghes_edac_mc_num;
+
 void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
                                struct cper_sec_mem_err *mem_err)
 {
@@ -22,9 +25,16 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	layers[0].type = EDAC_MC_LAYER_ALL_MEM;
 	layers[0].size = 1;
 	layers[0].is_virt_csrow = true;
-	mci = edac_mc_alloc(0, ARRAY_SIZE(layers), layers, 0);
+
+	/*
+	 * We need to serialize edac_mc_alloc() and edac_mc_add_mc(),
+	 * to avoid duplicated memory controller numbers
+	 */
+	mutex_lock(&ghes_edac_lock);
+	mci = edac_mc_alloc(ghes_edac_mc_num, ARRAY_SIZE(layers), layers, 0);
 	if (!mci) {
 		pr_info(GHES_PFX "Can't allocate memory for EDAC data\n");
+		mutex_unlock(&ghes_edac_lock);
 		return -ENOMEM;
 	}
 
@@ -53,11 +63,14 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	if (rc < 0) {
 		pr_info(GHES_PFX "Can't register at EDAC core\n");
 		edac_mc_free(mci);
-
+		mutex_unlock(&ghes_edac_lock);
 		return -ENODEV;
 	}
 
+
 	ghes->mci = mci;
+	ghes_edac_mc_num++;
+	mutex_unlock(&ghes_edac_lock);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(ghes_edac_register);
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 07/13] edac: add support for raw error reports
  2013-02-15 12:44 ` Mauro Carvalho Chehab
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  -1 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

That allows APEI GHES driver to report errors directly, using
the EDAC error report API.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/edac_core.h |  17 ++++++++
 drivers/edac/edac_mc.c   | 109 ++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 100 insertions(+), 26 deletions(-)

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index 23bb99f..9c5da11 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -453,6 +453,23 @@ extern struct mem_ctl_info *find_mci_by_dev(struct device *dev);
 extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev);
 extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
 				      unsigned long page);
+
+void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
+			  struct mem_ctl_info *mci,
+			  long grain,
+			  const u16 error_count,
+			  const int top_layer,
+			  const int mid_layer,
+			  const int low_layer,
+			  const unsigned long page_frame_number,
+			  const unsigned long offset_in_page,
+			  const unsigned long syndrome,
+			  const char *msg,
+			  const char *location,
+			  const char *label,
+			  const char *other_detail,
+			  const bool enable_per_layer_report);
+
 void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			  struct mem_ctl_info *mci,
 			  const u16 error_count,
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 8e33028..8fddf65 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1069,6 +1069,82 @@ static void edac_ue_error(struct mem_ctl_info *mci,
 #define OTHER_LABEL " or "
 
 /**
+ * edac_raw_mc_handle_error - reports a memory event to userspace without doing
+ *			      anything to discover the error location
+ *
+ * @type:		severity of the error (CE/UE/Fatal)
+ * @mci:		a struct mem_ctl_info pointer
+ * @grain:		error granularity
+ * @error_count:	Number of errors of the same type
+ * @top_layer:		Memory layer[0] position
+ * @mid_layer:		Memory layer[1] position
+ * @low_layer:		Memory layer[2] position
+ * @page_frame_number:	mem page where the error occurred
+ * @offset_in_page:	offset of the error inside the page
+ * @syndrome:		ECC syndrome
+ * @msg:		Message meaningful to the end users that
+ *			explains the event\
+ * @location:		location of the error, like "csrow:0 channel:1"
+ * @label:		DIMM labels for the affected memory(ies)
+ * @other_detail:	Technical details about the event that
+ *			may help hardware manufacturers and
+ *			EDAC developers to analyse the event
+ * @enable_per_layer_report: should it increment per-layer error counts?
+ *
+ * This raw function is used internally by edac_mc_handle_error(). It should
+ * only be called directly when the hardware error come directly from BIOS,
+ * like in the case of APEI GHES driver.
+ */
+void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
+			  struct mem_ctl_info *mci,
+			  long grain,
+			  const u16 error_count,
+			  const int top_layer,
+			  const int mid_layer,
+			  const int low_layer,
+			  const unsigned long page_frame_number,
+			  const unsigned long offset_in_page,
+			  const unsigned long syndrome,
+			  const char *msg,
+			  const char *location,
+			  const char *label,
+			  const char *other_detail,
+			  const bool enable_per_layer_report)
+{
+	char detail[80];
+	u8 grain_bits;
+	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
+
+	/* Report the error via the trace interface */
+	grain_bits = fls_long(grain) + 1;
+	trace_mc_event(type, msg, label, error_count,
+		       mci->mc_idx, top_layer, mid_layer, low_layer,
+		       PAGES_TO_MiB(page_frame_number) | offset_in_page,
+		       grain_bits, syndrome, other_detail);
+
+	/* Memory type dependent details about the error */
+	if (type == HW_EVENT_ERR_CORRECTED) {
+		snprintf(detail, sizeof(detail),
+			"page:0x%lx offset:0x%lx grain:%ld syndrome:0x%lx",
+			page_frame_number, offset_in_page,
+			grain, syndrome);
+		edac_ce_error(mci, error_count, pos, msg, location, label,
+			      detail, other_detail, enable_per_layer_report,
+			      page_frame_number, offset_in_page, grain);
+	} else {
+		snprintf(detail, sizeof(detail),
+			"page:0x%lx offset:0x%lx grain:%ld",
+			page_frame_number, offset_in_page, grain);
+
+		edac_ue_error(mci, error_count, pos, msg, location, label,
+			      detail, other_detail, enable_per_layer_report);
+	}
+
+
+}
+EXPORT_SYMBOL_GPL(edac_raw_mc_handle_error);
+
+/**
  * edac_mc_handle_error - reports a memory event to userspace
  *
  * @type:		severity of the error (CE/UE/Fatal)
@@ -1099,7 +1175,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			  const char *other_detail)
 {
 	/* FIXME: too much for stack: move it to some pre-alocated area */
-	char detail[80], location[80];
+	char location[80];
 	char label[(EDAC_MC_LABEL_LEN + 1 + sizeof(OTHER_LABEL)) * mci->tot_dimms];
 	char *p;
 	int row = -1, chan = -1;
@@ -1107,7 +1183,6 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	int i;
 	long grain;
 	bool enable_per_layer_report = false;
-	u8 grain_bits;
 
 	edac_dbg(3, "MC%d\n", mci->mc_idx);
 
@@ -1230,29 +1305,11 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	if (p > location)
 		*(p - 1) = '\0';
 
-	/* Report the error via the trace interface */
-	grain_bits = fls_long(grain) + 1;
-	trace_mc_event(type, msg, label, error_count,
-		       mci->mc_idx, top_layer, mid_layer, low_layer,
-		       PAGES_TO_MiB(page_frame_number) | offset_in_page,
-		       grain_bits, syndrome, other_detail);
-
-	/* Memory type dependent details about the error */
-	if (type == HW_EVENT_ERR_CORRECTED) {
-		snprintf(detail, sizeof(detail),
-			"page:0x%lx offset:0x%lx grain:%ld syndrome:0x%lx",
-			page_frame_number, offset_in_page,
-			grain, syndrome);
-		edac_ce_error(mci, error_count, pos, msg, location, label,
-			      detail, other_detail, enable_per_layer_report,
-			      page_frame_number, offset_in_page, grain);
-	} else {
-		snprintf(detail, sizeof(detail),
-			"page:0x%lx offset:0x%lx grain:%ld",
-			page_frame_number, offset_in_page, grain);
-
-		edac_ue_error(mci, error_count, pos, msg, location, label,
-			      detail, other_detail, enable_per_layer_report);
-	}
+	edac_raw_mc_handle_error(type, mci, grain, error_count,
+				 top_layer, mid_layer, low_layer,
+				 page_frame_number, offset_in_page,
+				 syndrome,
+				 msg, location, label, other_detail,
+				 enable_per_layer_report);
 }
 EXPORT_SYMBOL_GPL(edac_mc_handle_error);
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 07/13] edac: add support for raw error reports
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

That allows APEI GHES driver to report errors directly, using
the EDAC error report API.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/edac_core.h |  17 ++++++++
 drivers/edac/edac_mc.c   | 109 ++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 100 insertions(+), 26 deletions(-)

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index 23bb99f..9c5da11 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -453,6 +453,23 @@ extern struct mem_ctl_info *find_mci_by_dev(struct device *dev);
 extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev);
 extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
 				      unsigned long page);
+
+void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
+			  struct mem_ctl_info *mci,
+			  long grain,
+			  const u16 error_count,
+			  const int top_layer,
+			  const int mid_layer,
+			  const int low_layer,
+			  const unsigned long page_frame_number,
+			  const unsigned long offset_in_page,
+			  const unsigned long syndrome,
+			  const char *msg,
+			  const char *location,
+			  const char *label,
+			  const char *other_detail,
+			  const bool enable_per_layer_report);
+
 void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			  struct mem_ctl_info *mci,
 			  const u16 error_count,
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 8e33028..8fddf65 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1069,6 +1069,82 @@ static void edac_ue_error(struct mem_ctl_info *mci,
 #define OTHER_LABEL " or "
 
 /**
+ * edac_raw_mc_handle_error - reports a memory event to userspace without doing
+ *			      anything to discover the error location
+ *
+ * @type:		severity of the error (CE/UE/Fatal)
+ * @mci:		a struct mem_ctl_info pointer
+ * @grain:		error granularity
+ * @error_count:	Number of errors of the same type
+ * @top_layer:		Memory layer[0] position
+ * @mid_layer:		Memory layer[1] position
+ * @low_layer:		Memory layer[2] position
+ * @page_frame_number:	mem page where the error occurred
+ * @offset_in_page:	offset of the error inside the page
+ * @syndrome:		ECC syndrome
+ * @msg:		Message meaningful to the end users that
+ *			explains the event\
+ * @location:		location of the error, like "csrow:0 channel:1"
+ * @label:		DIMM labels for the affected memory(ies)
+ * @other_detail:	Technical details about the event that
+ *			may help hardware manufacturers and
+ *			EDAC developers to analyse the event
+ * @enable_per_layer_report: should it increment per-layer error counts?
+ *
+ * This raw function is used internally by edac_mc_handle_error(). It should
+ * only be called directly when the hardware error come directly from BIOS,
+ * like in the case of APEI GHES driver.
+ */
+void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
+			  struct mem_ctl_info *mci,
+			  long grain,
+			  const u16 error_count,
+			  const int top_layer,
+			  const int mid_layer,
+			  const int low_layer,
+			  const unsigned long page_frame_number,
+			  const unsigned long offset_in_page,
+			  const unsigned long syndrome,
+			  const char *msg,
+			  const char *location,
+			  const char *label,
+			  const char *other_detail,
+			  const bool enable_per_layer_report)
+{
+	char detail[80];
+	u8 grain_bits;
+	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
+
+	/* Report the error via the trace interface */
+	grain_bits = fls_long(grain) + 1;
+	trace_mc_event(type, msg, label, error_count,
+		       mci->mc_idx, top_layer, mid_layer, low_layer,
+		       PAGES_TO_MiB(page_frame_number) | offset_in_page,
+		       grain_bits, syndrome, other_detail);
+
+	/* Memory type dependent details about the error */
+	if (type == HW_EVENT_ERR_CORRECTED) {
+		snprintf(detail, sizeof(detail),
+			"page:0x%lx offset:0x%lx grain:%ld syndrome:0x%lx",
+			page_frame_number, offset_in_page,
+			grain, syndrome);
+		edac_ce_error(mci, error_count, pos, msg, location, label,
+			      detail, other_detail, enable_per_layer_report,
+			      page_frame_number, offset_in_page, grain);
+	} else {
+		snprintf(detail, sizeof(detail),
+			"page:0x%lx offset:0x%lx grain:%ld",
+			page_frame_number, offset_in_page, grain);
+
+		edac_ue_error(mci, error_count, pos, msg, location, label,
+			      detail, other_detail, enable_per_layer_report);
+	}
+
+
+}
+EXPORT_SYMBOL_GPL(edac_raw_mc_handle_error);
+
+/**
  * edac_mc_handle_error - reports a memory event to userspace
  *
  * @type:		severity of the error (CE/UE/Fatal)
@@ -1099,7 +1175,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			  const char *other_detail)
 {
 	/* FIXME: too much for stack: move it to some pre-alocated area */
-	char detail[80], location[80];
+	char location[80];
 	char label[(EDAC_MC_LABEL_LEN + 1 + sizeof(OTHER_LABEL)) * mci->tot_dimms];
 	char *p;
 	int row = -1, chan = -1;
@@ -1107,7 +1183,6 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	int i;
 	long grain;
 	bool enable_per_layer_report = false;
-	u8 grain_bits;
 
 	edac_dbg(3, "MC%d\n", mci->mc_idx);
 
@@ -1230,29 +1305,11 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	if (p > location)
 		*(p - 1) = '\0';
 
-	/* Report the error via the trace interface */
-	grain_bits = fls_long(grain) + 1;
-	trace_mc_event(type, msg, label, error_count,
-		       mci->mc_idx, top_layer, mid_layer, low_layer,
-		       PAGES_TO_MiB(page_frame_number) | offset_in_page,
-		       grain_bits, syndrome, other_detail);
-
-	/* Memory type dependent details about the error */
-	if (type == HW_EVENT_ERR_CORRECTED) {
-		snprintf(detail, sizeof(detail),
-			"page:0x%lx offset:0x%lx grain:%ld syndrome:0x%lx",
-			page_frame_number, offset_in_page,
-			grain, syndrome);
-		edac_ce_error(mci, error_count, pos, msg, location, label,
-			      detail, other_detail, enable_per_layer_report,
-			      page_frame_number, offset_in_page, grain);
-	} else {
-		snprintf(detail, sizeof(detail),
-			"page:0x%lx offset:0x%lx grain:%ld",
-			page_frame_number, offset_in_page, grain);
-
-		edac_ue_error(mci, error_count, pos, msg, location, label,
-			      detail, other_detail, enable_per_layer_report);
-	}
+	edac_raw_mc_handle_error(type, mci, grain, error_count,
+				 top_layer, mid_layer, low_layer,
+				 page_frame_number, offset_in_page,
+				 syndrome,
+				 msg, location, label, other_detail,
+				 enable_per_layer_report);
 }
 EXPORT_SYMBOL_GPL(edac_mc_handle_error);
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 08/13] ghes_edac: add support for reporting errors via EDAC
  2013-02-15 12:44 ` Mauro Carvalho Chehab
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  -1 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

Now that the EDAC core is capable of just forward the errors via
the userspace API, add a report mechanism for the GHES errors.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/ghes_edac.c | 36 +++++++++++++++++++++++++++++++++++-
 include/linux/edac.h     |  1 +
 2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 1badac6..52625b5 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -9,8 +9,42 @@ static DEFINE_MUTEX(ghes_edac_lock);
 static int ghes_edac_mc_num;
 
 void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
-                               struct cper_sec_mem_err *mem_err)
+			        struct cper_sec_mem_err *mem_err)
 {
+	enum hw_event_mc_err_type type;
+	unsigned long page = 0, offset = 0, grain = 0;
+	char location[80];
+	char *label = "unknown";
+
+	if (mem_err->validation_bits & CPER_MEM_VALID_PHYSICAL_ADDRESS) {
+		page = mem_err->physical_addr >> PAGE_SHIFT;
+		offset = mem_err->physical_addr & ~PAGE_MASK;
+		grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
+	}
+
+	switch(sev) {
+	case GHES_SEV_CORRECTED:
+		type = HW_EVENT_ERR_CORRECTED;
+		break;
+	case GHES_SEV_RECOVERABLE:
+		type = HW_EVENT_ERR_UNCORRECTED;
+		break;
+	case GHES_SEV_PANIC:
+		type = HW_EVENT_ERR_FATAL;
+		break;
+	default:
+	case GHES_SEV_NO:
+		type = HW_EVENT_ERR_INFO;
+	}
+
+	sprintf(location,"node:%d card:%d module:%d bank:%d device:%d row: %d column:%d bit_pos:%d",
+		mem_err->node, mem_err->card, mem_err->module,
+		mem_err->bank, mem_err->device, mem_err->row, mem_err->column,
+		mem_err->bit_pos);
+
+	edac_raw_mc_handle_error(type, ghes->mci, grain, 1, 0, 0, 0,
+				 page, offset, 0,
+				 "APEI", location, label, "", 0);
 }
 EXPORT_SYMBOL_GPL(ghes_edac_report_mem_error);
 
diff --git a/include/linux/edac.h b/include/linux/edac.h
index 1b7744c..28232a0 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -100,6 +100,7 @@ enum hw_event_mc_err_type {
 	HW_EVENT_ERR_CORRECTED,
 	HW_EVENT_ERR_UNCORRECTED,
 	HW_EVENT_ERR_FATAL,
+	HW_EVENT_ERR_INFO,
 };
 
 /**
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 08/13] ghes_edac: add support for reporting errors via EDAC
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

Now that the EDAC core is capable of just forward the errors via
the userspace API, add a report mechanism for the GHES errors.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/ghes_edac.c | 36 +++++++++++++++++++++++++++++++++++-
 include/linux/edac.h     |  1 +
 2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 1badac6..52625b5 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -9,8 +9,42 @@ static DEFINE_MUTEX(ghes_edac_lock);
 static int ghes_edac_mc_num;
 
 void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
-                               struct cper_sec_mem_err *mem_err)
+			        struct cper_sec_mem_err *mem_err)
 {
+	enum hw_event_mc_err_type type;
+	unsigned long page = 0, offset = 0, grain = 0;
+	char location[80];
+	char *label = "unknown";
+
+	if (mem_err->validation_bits & CPER_MEM_VALID_PHYSICAL_ADDRESS) {
+		page = mem_err->physical_addr >> PAGE_SHIFT;
+		offset = mem_err->physical_addr & ~PAGE_MASK;
+		grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
+	}
+
+	switch(sev) {
+	case GHES_SEV_CORRECTED:
+		type = HW_EVENT_ERR_CORRECTED;
+		break;
+	case GHES_SEV_RECOVERABLE:
+		type = HW_EVENT_ERR_UNCORRECTED;
+		break;
+	case GHES_SEV_PANIC:
+		type = HW_EVENT_ERR_FATAL;
+		break;
+	default:
+	case GHES_SEV_NO:
+		type = HW_EVENT_ERR_INFO;
+	}
+
+	sprintf(location,"node:%d card:%d module:%d bank:%d device:%d row: %d column:%d bit_pos:%d",
+		mem_err->node, mem_err->card, mem_err->module,
+		mem_err->bank, mem_err->device, mem_err->row, mem_err->column,
+		mem_err->bit_pos);
+
+	edac_raw_mc_handle_error(type, ghes->mci, grain, 1, 0, 0, 0,
+				 page, offset, 0,
+				 "APEI", location, label, "", 0);
 }
 EXPORT_SYMBOL_GPL(ghes_edac_report_mem_error);
 
diff --git a/include/linux/edac.h b/include/linux/edac.h
index 1b7744c..28232a0 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -100,6 +100,7 @@ enum hw_event_mc_err_type {
 	HW_EVENT_ERR_CORRECTED,
 	HW_EVENT_ERR_UNCORRECTED,
 	HW_EVENT_ERR_FATAL,
+	HW_EVENT_ERR_INFO,
 };
 
 /**
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 09/13] ghes_edac: do a better job of filling EDAC DIMM info
  2013-02-15 12:44 ` Mauro Carvalho Chehab
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  -1 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

Instead of just faking a random value for the DIMM data, get
the information that it is available via DMI table.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/ghes_edac.c | 192 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 180 insertions(+), 12 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 52625b5..7bd7161 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -1,5 +1,6 @@
 #include <acpi/ghes.h>
 #include <linux/edac.h>
+#include <linux/dmi.h>
 #include "edac_core.h"
 
 #define GHES_PFX   "ghes_edac: "
@@ -8,6 +9,155 @@
 static DEFINE_MUTEX(ghes_edac_lock);
 static int ghes_edac_mc_num;
 
+/* Memory Device - Type 17 of SMBIOS spec */
+struct memdev_dmi_entry {
+	u8 type;
+	u8 length;
+	u16 handle;
+	u16 phys_mem_array_handle;
+	u16 mem_err_info_handle;
+	u16 total_width;
+	u16 data_width;
+	u16 size;
+	u8 form_factor;
+	u8 device_set;
+	u8 device_locator;
+	u8 bank_locator;
+	u8 memory_type;
+	u16 type_detail;
+	u16 speed;
+	u8 manufacturer;
+	u8 serial_number;
+	u8 asset_tag;
+	u8 part_number;
+	u8 attributes;
+	u32 extended_size;
+	u16 conf_mem_clk_speed;
+} __attribute__((__packed__));
+
+struct ghes_edac_dimm_fill {
+	struct mem_ctl_info *mci;
+	unsigned count;
+};
+
+char *memory_type[] = {
+	[MEM_EMPTY] = "EMPTY",
+	[MEM_RESERVED] = "RESERVED",
+	[MEM_UNKNOWN] = "UNKNOWN",
+	[MEM_FPM] = "FPM",
+	[MEM_EDO] = "EDO",
+	[MEM_BEDO] = "BEDO",
+	[MEM_SDR] = "SDR",
+	[MEM_RDR] = "RDR",
+	[MEM_DDR] = "DDR",
+	[MEM_RDDR] = "RDDR",
+	[MEM_RMBS] = "RMBS",
+	[MEM_DDR2] = "DDR2",
+	[MEM_FB_DDR2] = "FB_DDR2",
+	[MEM_RDDR2] = "RDDR2",
+	[MEM_XDR] = "XDR",
+	[MEM_DDR3] = "DDR3",
+	[MEM_RDDR3] = "RDDR3",
+};
+
+static void ghes_edac_count_dimms(const struct dmi_header *dh, void *arg)
+{
+	int *num_dimm = arg;
+
+	if (dh->type == DMI_ENTRY_MEM_DEVICE)
+		(*num_dimm)++;
+}
+
+static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
+{
+	struct ghes_edac_dimm_fill *dimm_fill = arg;
+	struct mem_ctl_info *mci = dimm_fill->mci;
+
+	if (dh->type == DMI_ENTRY_MEM_DEVICE) {
+		struct memdev_dmi_entry *entry = (struct memdev_dmi_entry *)dh;
+		struct dimm_info *dimm = EDAC_DIMM_PTR(mci->layers, mci->dimms,
+						       mci->n_layers,
+						       dimm_fill->count, 0, 0);
+
+		if (entry->size == 0xffff) {
+			pr_info(GHES_PFX "Can't get dimm size\n");
+			dimm->nr_pages = MiB_TO_PAGES(32);/* Unknown */
+		} else if (entry->size == 0x7fff) {
+			dimm->nr_pages = MiB_TO_PAGES(entry->extended_size);
+		} else {
+			if (entry->size & 1 << 15)
+				dimm->nr_pages = MiB_TO_PAGES((entry->size &
+							       0x7fff) << 10);
+			else
+				dimm->nr_pages = MiB_TO_PAGES(entry->size);
+		}
+
+		switch(entry->memory_type) {
+		case 0x12:
+			if (entry->type_detail & 1 << 13)
+				dimm->mtype = MEM_RDDR;
+			else
+				dimm->mtype = MEM_DDR;
+			break;
+		case 0x13:
+			if (entry->type_detail & 1 << 13)
+				dimm->mtype = MEM_RDDR2;
+			else
+				dimm->mtype = MEM_DDR2;
+			break;
+		case 0x14:
+			dimm->mtype = MEM_FB_DDR2;
+			break;
+		case 0x18:
+			if (entry->type_detail & 1 << 13)
+				dimm->mtype = MEM_RDDR3;
+			else
+				dimm->mtype = MEM_DDR3;
+			break;
+		default:
+			if (entry->type_detail & 1 << 6)
+				dimm->mtype = MEM_RMBS;
+			else if ((entry->type_detail & ((1 << 7) | (1 << 13)))
+				 == ((1 << 7) | (1 << 13)))
+				dimm->mtype = MEM_RDR;
+			else if (entry->type_detail & 1 << 7)
+				dimm->mtype = MEM_SDR;
+			else if (entry->type_detail & 1 << 9)
+				dimm->mtype = MEM_EDO;
+			else
+				dimm->mtype = MEM_UNKNOWN;
+		}
+
+		/*
+		 * Actually, we can only detect if the memory has bits for
+		 * checksum or not
+		 */
+		if (entry->total_width == entry->data_width)
+			dimm->edac_mode = EDAC_NONE;
+		else
+			dimm->edac_mode = EDAC_SECDED;
+
+		dimm->dtype = DEV_UNKNOWN;
+		dimm->grain = 128;		/* Likely, worse case */
+
+		/*
+		 * FIXME: It shouldn't be hard to also fill the DIMM labels
+		 */
+
+		if (dimm->nr_pages) {
+			pr_info(GHES_PFX "DIMM%i: %s size = %d MB%s\n",
+				dimm_fill->count, memory_type[dimm->mtype],
+				PAGES_TO_MiB(dimm->nr_pages),
+				(dimm->edac_mode != EDAC_NONE)? "(ECC)" : "");
+			pr_info (GHES_PFX "\ttype %d, detail 0x%02x, width %d(total %d)\n",
+				entry->memory_type, entry->type_detail,
+				entry->total_width, entry->data_width);
+		}
+
+		dimm_fill->count++;
+	}
+}
+
 void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 			        struct cper_sec_mem_err *mem_err)
 {
@@ -50,14 +200,23 @@ EXPORT_SYMBOL_GPL(ghes_edac_report_mem_error);
 
 int ghes_edac_register(struct ghes *ghes, struct device *dev)
 {
-	int rc;
+	bool fake = false;
+	int rc, num_dimm = 0;
 	struct mem_ctl_info *mci;
 	struct edac_mc_layer layers[1];
-	struct csrow_info *csrow;
-	struct dimm_info *dimm;
+	struct ghes_edac_dimm_fill dimm_fill;
+
+	/* Get the number of DIMMs */
+	dmi_walk(ghes_edac_count_dimms, &num_dimm);
+
+	/* Check if we've got a bogus BIOS */
+	if (num_dimm == 0) {
+		fake = true;
+		num_dimm = 1;
+	}
 
 	layers[0].type = EDAC_MC_LAYER_ALL_MEM;
-	layers[0].size = 1;
+	layers[0].size = num_dimm;
 	layers[0].is_virt_csrow = true;
 
 	/*
@@ -65,6 +224,8 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	 * to avoid duplicated memory controller numbers
 	 */
 	mutex_lock(&ghes_edac_lock);
+	pr_info("ghes_edac#%d: allocating space for %d dimms\n",
+		ghes_edac_mc_num, num_dimm);
 	mci = edac_mc_alloc(ghes_edac_mc_num, ARRAY_SIZE(layers), layers, 0);
 	if (!mci) {
 		pr_info(GHES_PFX "Can't allocate memory for EDAC data\n");
@@ -83,15 +244,22 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	mci->ctl_name = "ghes_edac";
 	mci->dev_name = "ghes";
 
-	csrow = mci->csrows[0];
-	dimm = csrow->channels[0]->dimm;
+	if (!fake) {
+		/* Fill DIMM info from DMI */
+		dimm_fill.count = 0;
+		dimm_fill.mci = mci;
+		dmi_walk (ghes_edac_dmidecode, &dimm_fill);
+	} else {
+		struct dimm_info *dimm = EDAC_DIMM_PTR(mci->layers, mci->dimms,
+						       mci->n_layers, 0, 0, 0);
 
-	/* FIXME: FAKE DATA */
-	dimm->nr_pages = 1000;
-	dimm->grain = 128;
-	dimm->mtype = MEM_UNKNOWN;
-	dimm->dtype = DEV_UNKNOWN;
-	dimm->edac_mode = EDAC_SECDED;
+		pr_info(GHES_PFX "Crappy BIOS detected. Faking DIMM EDAC data\n");
+		dimm->nr_pages = 1000;
+		dimm->grain = 128;
+		dimm->mtype = MEM_UNKNOWN;
+		dimm->dtype = DEV_UNKNOWN;
+		dimm->edac_mode = EDAC_SECDED;
+	}
 
 	rc = edac_mc_add_mc(mci);
 	if (rc < 0) {
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 09/13] ghes_edac: do a better job of filling EDAC DIMM info
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

Instead of just faking a random value for the DIMM data, get
the information that it is available via DMI table.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/ghes_edac.c | 192 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 180 insertions(+), 12 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 52625b5..7bd7161 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -1,5 +1,6 @@
 #include <acpi/ghes.h>
 #include <linux/edac.h>
+#include <linux/dmi.h>
 #include "edac_core.h"
 
 #define GHES_PFX   "ghes_edac: "
@@ -8,6 +9,155 @@
 static DEFINE_MUTEX(ghes_edac_lock);
 static int ghes_edac_mc_num;
 
+/* Memory Device - Type 17 of SMBIOS spec */
+struct memdev_dmi_entry {
+	u8 type;
+	u8 length;
+	u16 handle;
+	u16 phys_mem_array_handle;
+	u16 mem_err_info_handle;
+	u16 total_width;
+	u16 data_width;
+	u16 size;
+	u8 form_factor;
+	u8 device_set;
+	u8 device_locator;
+	u8 bank_locator;
+	u8 memory_type;
+	u16 type_detail;
+	u16 speed;
+	u8 manufacturer;
+	u8 serial_number;
+	u8 asset_tag;
+	u8 part_number;
+	u8 attributes;
+	u32 extended_size;
+	u16 conf_mem_clk_speed;
+} __attribute__((__packed__));
+
+struct ghes_edac_dimm_fill {
+	struct mem_ctl_info *mci;
+	unsigned count;
+};
+
+char *memory_type[] = {
+	[MEM_EMPTY] = "EMPTY",
+	[MEM_RESERVED] = "RESERVED",
+	[MEM_UNKNOWN] = "UNKNOWN",
+	[MEM_FPM] = "FPM",
+	[MEM_EDO] = "EDO",
+	[MEM_BEDO] = "BEDO",
+	[MEM_SDR] = "SDR",
+	[MEM_RDR] = "RDR",
+	[MEM_DDR] = "DDR",
+	[MEM_RDDR] = "RDDR",
+	[MEM_RMBS] = "RMBS",
+	[MEM_DDR2] = "DDR2",
+	[MEM_FB_DDR2] = "FB_DDR2",
+	[MEM_RDDR2] = "RDDR2",
+	[MEM_XDR] = "XDR",
+	[MEM_DDR3] = "DDR3",
+	[MEM_RDDR3] = "RDDR3",
+};
+
+static void ghes_edac_count_dimms(const struct dmi_header *dh, void *arg)
+{
+	int *num_dimm = arg;
+
+	if (dh->type == DMI_ENTRY_MEM_DEVICE)
+		(*num_dimm)++;
+}
+
+static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
+{
+	struct ghes_edac_dimm_fill *dimm_fill = arg;
+	struct mem_ctl_info *mci = dimm_fill->mci;
+
+	if (dh->type == DMI_ENTRY_MEM_DEVICE) {
+		struct memdev_dmi_entry *entry = (struct memdev_dmi_entry *)dh;
+		struct dimm_info *dimm = EDAC_DIMM_PTR(mci->layers, mci->dimms,
+						       mci->n_layers,
+						       dimm_fill->count, 0, 0);
+
+		if (entry->size == 0xffff) {
+			pr_info(GHES_PFX "Can't get dimm size\n");
+			dimm->nr_pages = MiB_TO_PAGES(32);/* Unknown */
+		} else if (entry->size == 0x7fff) {
+			dimm->nr_pages = MiB_TO_PAGES(entry->extended_size);
+		} else {
+			if (entry->size & 1 << 15)
+				dimm->nr_pages = MiB_TO_PAGES((entry->size &
+							       0x7fff) << 10);
+			else
+				dimm->nr_pages = MiB_TO_PAGES(entry->size);
+		}
+
+		switch(entry->memory_type) {
+		case 0x12:
+			if (entry->type_detail & 1 << 13)
+				dimm->mtype = MEM_RDDR;
+			else
+				dimm->mtype = MEM_DDR;
+			break;
+		case 0x13:
+			if (entry->type_detail & 1 << 13)
+				dimm->mtype = MEM_RDDR2;
+			else
+				dimm->mtype = MEM_DDR2;
+			break;
+		case 0x14:
+			dimm->mtype = MEM_FB_DDR2;
+			break;
+		case 0x18:
+			if (entry->type_detail & 1 << 13)
+				dimm->mtype = MEM_RDDR3;
+			else
+				dimm->mtype = MEM_DDR3;
+			break;
+		default:
+			if (entry->type_detail & 1 << 6)
+				dimm->mtype = MEM_RMBS;
+			else if ((entry->type_detail & ((1 << 7) | (1 << 13)))
+				 == ((1 << 7) | (1 << 13)))
+				dimm->mtype = MEM_RDR;
+			else if (entry->type_detail & 1 << 7)
+				dimm->mtype = MEM_SDR;
+			else if (entry->type_detail & 1 << 9)
+				dimm->mtype = MEM_EDO;
+			else
+				dimm->mtype = MEM_UNKNOWN;
+		}
+
+		/*
+		 * Actually, we can only detect if the memory has bits for
+		 * checksum or not
+		 */
+		if (entry->total_width == entry->data_width)
+			dimm->edac_mode = EDAC_NONE;
+		else
+			dimm->edac_mode = EDAC_SECDED;
+
+		dimm->dtype = DEV_UNKNOWN;
+		dimm->grain = 128;		/* Likely, worse case */
+
+		/*
+		 * FIXME: It shouldn't be hard to also fill the DIMM labels
+		 */
+
+		if (dimm->nr_pages) {
+			pr_info(GHES_PFX "DIMM%i: %s size = %d MB%s\n",
+				dimm_fill->count, memory_type[dimm->mtype],
+				PAGES_TO_MiB(dimm->nr_pages),
+				(dimm->edac_mode != EDAC_NONE)? "(ECC)" : "");
+			pr_info (GHES_PFX "\ttype %d, detail 0x%02x, width %d(total %d)\n",
+				entry->memory_type, entry->type_detail,
+				entry->total_width, entry->data_width);
+		}
+
+		dimm_fill->count++;
+	}
+}
+
 void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 			        struct cper_sec_mem_err *mem_err)
 {
@@ -50,14 +200,23 @@ EXPORT_SYMBOL_GPL(ghes_edac_report_mem_error);
 
 int ghes_edac_register(struct ghes *ghes, struct device *dev)
 {
-	int rc;
+	bool fake = false;
+	int rc, num_dimm = 0;
 	struct mem_ctl_info *mci;
 	struct edac_mc_layer layers[1];
-	struct csrow_info *csrow;
-	struct dimm_info *dimm;
+	struct ghes_edac_dimm_fill dimm_fill;
+
+	/* Get the number of DIMMs */
+	dmi_walk(ghes_edac_count_dimms, &num_dimm);
+
+	/* Check if we've got a bogus BIOS */
+	if (num_dimm == 0) {
+		fake = true;
+		num_dimm = 1;
+	}
 
 	layers[0].type = EDAC_MC_LAYER_ALL_MEM;
-	layers[0].size = 1;
+	layers[0].size = num_dimm;
 	layers[0].is_virt_csrow = true;
 
 	/*
@@ -65,6 +224,8 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	 * to avoid duplicated memory controller numbers
 	 */
 	mutex_lock(&ghes_edac_lock);
+	pr_info("ghes_edac#%d: allocating space for %d dimms\n",
+		ghes_edac_mc_num, num_dimm);
 	mci = edac_mc_alloc(ghes_edac_mc_num, ARRAY_SIZE(layers), layers, 0);
 	if (!mci) {
 		pr_info(GHES_PFX "Can't allocate memory for EDAC data\n");
@@ -83,15 +244,22 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	mci->ctl_name = "ghes_edac";
 	mci->dev_name = "ghes";
 
-	csrow = mci->csrows[0];
-	dimm = csrow->channels[0]->dimm;
+	if (!fake) {
+		/* Fill DIMM info from DMI */
+		dimm_fill.count = 0;
+		dimm_fill.mci = mci;
+		dmi_walk (ghes_edac_dmidecode, &dimm_fill);
+	} else {
+		struct dimm_info *dimm = EDAC_DIMM_PTR(mci->layers, mci->dimms,
+						       mci->n_layers, 0, 0, 0);
 
-	/* FIXME: FAKE DATA */
-	dimm->nr_pages = 1000;
-	dimm->grain = 128;
-	dimm->mtype = MEM_UNKNOWN;
-	dimm->dtype = DEV_UNKNOWN;
-	dimm->edac_mode = EDAC_SECDED;
+		pr_info(GHES_PFX "Crappy BIOS detected. Faking DIMM EDAC data\n");
+		dimm->nr_pages = 1000;
+		dimm->grain = 128;
+		dimm->mtype = MEM_UNKNOWN;
+		dimm->dtype = DEV_UNKNOWN;
+		dimm->edac_mode = EDAC_SECDED;
+	}
 
 	rc = edac_mc_add_mc(mci);
 	if (rc < 0) {
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 10/13] edac: better report error conditions in debug mode
  2013-02-15 12:44 ` Mauro Carvalho Chehab
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  -1 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

It is hard to find what's wrong without a proper error
report. Improve it, in debug mode.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/edac_mc_sysfs.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/edac/edac_mc_sysfs.c b/drivers/edac/edac_mc_sysfs.c
index 0ca1ca7..9c58da6 100644
--- a/drivers/edac/edac_mc_sysfs.c
+++ b/drivers/edac/edac_mc_sysfs.c
@@ -429,8 +429,12 @@ static int edac_create_csrow_objects(struct mem_ctl_info *mci)
 		if (!nr_pages_per_csrow(csrow))
 			continue;
 		err = edac_create_csrow_object(mci, mci->csrows[i], i);
-		if (err < 0)
+		if (err < 0) {
+			edac_dbg(1,
+				 "failure: create csrow objects for csrow %d\n",
+				 i);
 			goto error;
+		}
 	}
 	return 0;
 
@@ -1007,6 +1011,7 @@ int edac_create_sysfs_mci_device(struct mem_ctl_info *mci)
 	edac_dbg(0, "creating device %s\n", dev_name(&mci->dev));
 	err = device_add(&mci->dev);
 	if (err < 0) {
+		edac_dbg(1, "failure: create device %s\n", dev_name(&mci->dev));
 		bus_unregister(&mci->bus);
 		kfree(mci->bus.name);
 		return err;
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 10/13] edac: better report error conditions in debug mode
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

It is hard to find what's wrong without a proper error
report. Improve it, in debug mode.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/edac_mc_sysfs.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/edac/edac_mc_sysfs.c b/drivers/edac/edac_mc_sysfs.c
index 0ca1ca7..9c58da6 100644
--- a/drivers/edac/edac_mc_sysfs.c
+++ b/drivers/edac/edac_mc_sysfs.c
@@ -429,8 +429,12 @@ static int edac_create_csrow_objects(struct mem_ctl_info *mci)
 		if (!nr_pages_per_csrow(csrow))
 			continue;
 		err = edac_create_csrow_object(mci, mci->csrows[i], i);
-		if (err < 0)
+		if (err < 0) {
+			edac_dbg(1,
+				 "failure: create csrow objects for csrow %d\n",
+				 i);
 			goto error;
+		}
 	}
 	return 0;
 
@@ -1007,6 +1011,7 @@ int edac_create_sysfs_mci_device(struct mem_ctl_info *mci)
 	edac_dbg(0, "creating device %s\n", dev_name(&mci->dev));
 	err = device_add(&mci->dev);
 	if (err < 0) {
+		edac_dbg(1, "failure: create device %s\n", dev_name(&mci->dev));
 		bus_unregister(&mci->bus);
 		kfree(mci->bus.name);
 		return err;
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 11/13] edac: initialize the core earlier
  2013-02-15 12:44 ` Mauro Carvalho Chehab
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  -1 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

In order for it to work with it builtin, the EDAC core should
be initialized earlier, otherwise the ghes_edac driver initializes
before edac_mc_sysfs_init() being called:

...
[    4.998373] EDAC MC0: Giving out device to 'ghes_edac.c' 'ghes_edac': DEV ghes
...
[    4.998373] EDAC MC1: Giving out device to 'ghes_edac.c' 'ghes_edac': DEV ghes
[    6.519495] EDAC MC: Ver: 3.0.0
[    6.523749] EDAC DEBUG: edac_mc_sysfs_init: device mc created

The net result is that no EDAC sysfs nodes will appear.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/edac_module.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/edac/edac_module.c b/drivers/edac/edac_module.c
index 12c951a..a66941f 100644
--- a/drivers/edac/edac_module.c
+++ b/drivers/edac/edac_module.c
@@ -146,7 +146,7 @@ static void __exit edac_exit(void)
 /*
  * Inform the kernel of our entry and exit points
  */
-module_init(edac_init);
+subsys_initcall(edac_init);
 module_exit(edac_exit);
 
 MODULE_LICENSE("GPL");
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 11/13] edac: initialize the core earlier
@ 2013-02-15 12:44   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:44 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

In order for it to work with it builtin, the EDAC core should
be initialized earlier, otherwise the ghes_edac driver initializes
before edac_mc_sysfs_init() being called:

...
[    4.998373] EDAC MC0: Giving out device to 'ghes_edac.c' 'ghes_edac': DEV ghes
...
[    4.998373] EDAC MC1: Giving out device to 'ghes_edac.c' 'ghes_edac': DEV ghes
[    6.519495] EDAC MC: Ver: 3.0.0
[    6.523749] EDAC DEBUG: edac_mc_sysfs_init: device mc created

The net result is that no EDAC sysfs nodes will appear.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/edac_module.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/edac/edac_module.c b/drivers/edac/edac_module.c
index 12c951a..a66941f 100644
--- a/drivers/edac/edac_module.c
+++ b/drivers/edac/edac_module.c
@@ -146,7 +146,7 @@ static void __exit edac_exit(void)
 /*
  * Inform the kernel of our entry and exit points
  */
-module_init(edac_init);
+subsys_initcall(edac_init);
 module_exit(edac_exit);
 
 MODULE_LICENSE("GPL");
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 12/13] ghes_edac.c: Don't credit the same memory dimm twice
  2013-02-15 12:44 ` Mauro Carvalho Chehab
@ 2013-02-15 12:45   ` Mauro Carvalho Chehab
  -1 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:45 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

On my tests on a 4xE5-4650 CPU's system, the GHES
EDAC driver is called twice. As the SMBIOS DMI enumeration
call will seek for the entire DIMM sockets in the system, on
this machine, equipped with 128 GB of RAM, the memory is
displayed twice:

          +-----------------------+
          |    mc0    |    mc1    |
----------+-----------------------+
memory45: |  8192 MB  |  8192 MB  |
memory44: |     0 MB  |     0 MB  |
----------+-----------------------+
memory43: |     0 MB  |     0 MB  |
memory42: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory41: |     0 MB  |     0 MB  |
memory40: |     0 MB  |     0 MB  |
----------+-----------------------+
memory39: |  8192 MB  |  8192 MB  |
memory38: |     0 MB  |     0 MB  |
----------+-----------------------+
memory37: |     0 MB  |     0 MB  |
memory36: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory35: |     0 MB  |     0 MB  |
memory34: |     0 MB  |     0 MB  |
----------+-----------------------+
memory33: |  8192 MB  |  8192 MB  |
memory32: |     0 MB  |     0 MB  |
----------+-----------------------+
memory31: |     0 MB  |     0 MB  |
memory30: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory29: |     0 MB  |     0 MB  |
memory28: |     0 MB  |     0 MB  |
----------+-----------------------+
memory27: |  8192 MB  |  8192 MB  |
memory26: |     0 MB  |     0 MB  |
----------+-----------------------+
memory25: |     0 MB  |     0 MB  |
memory24: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory23: |     0 MB  |     0 MB  |
memory22: |     0 MB  |     0 MB  |
----------+-----------------------+
memory21: |  8192 MB  |  8192 MB  |
memory20: |     0 MB  |     0 MB  |
----------+-----------------------+
memory19: |     0 MB  |     0 MB  |
memory18: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory17: |     0 MB  |     0 MB  |
memory16: |     0 MB  |     0 MB  |
----------+-----------------------+
memory15: |  8192 MB  |  8192 MB  |
memory14: |     0 MB  |     0 MB  |
----------+-----------------------+
memory13: |     0 MB  |     0 MB  |
memory12: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory11: |     0 MB  |     0 MB  |
memory10: |     0 MB  |     0 MB  |
----------+-----------------------+
memory9:  |  8192 MB  |  8192 MB  |
memory8:  |     0 MB  |     0 MB  |
----------+-----------------------+
memory7:  |     0 MB  |     0 MB  |
memory6:  |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory5:  |     0 MB  |     0 MB  |
memory4:  |     0 MB  |     0 MB  |
----------+-----------------------+
memory3:  |  8192 MB  |  8192 MB  |
memory2:  |     0 MB  |     0 MB  |
----------+-----------------------+
memory1:  |     0 MB  |     0 MB  |
memory0:  |  8192 MB  |  8192 MB  |
----------+-----------------------+

Total sum of 256 GB.

As there's no reliable way to credit DIMMS to the right memory
controller, just put everything on memory controller 0 (with should
always exist).

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/ghes_edac.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 7bd7161..97a0d53 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -245,10 +245,19 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	mci->dev_name = "ghes";
 
 	if (!fake) {
-		/* Fill DIMM info from DMI */
-		dimm_fill.count = 0;
-		dimm_fill.mci = mci;
-		dmi_walk (ghes_edac_dmidecode, &dimm_fill);
+		/*
+		 * Fill DIMM info from DMI for the memory controller #0
+		 *
+		 * Keep it in blank for the other memory controllers, as
+		 * there's no reliable way to properly credit each DIMM to
+		 * the memory controller, as different BIOSes fill the
+		 * DMI bank location fields on different ways
+		 */
+		if (!ghes_edac_mc_num) {
+			dimm_fill.count = 0;
+			dimm_fill.mci = mci;
+			dmi_walk (ghes_edac_dmidecode, &dimm_fill);
+		}
 	} else {
 		struct dimm_info *dimm = EDAC_DIMM_PTR(mci->layers, mci->dimms,
 						       mci->n_layers, 0, 0, 0);
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 12/13] ghes_edac.c: Don't credit the same memory dimm twice
@ 2013-02-15 12:45   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:45 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

On my tests on a 4xE5-4650 CPU's system, the GHES
EDAC driver is called twice. As the SMBIOS DMI enumeration
call will seek for the entire DIMM sockets in the system, on
this machine, equipped with 128 GB of RAM, the memory is
displayed twice:

          +-----------------------+
          |    mc0    |    mc1    |
----------+-----------------------+
memory45: |  8192 MB  |  8192 MB  |
memory44: |     0 MB  |     0 MB  |
----------+-----------------------+
memory43: |     0 MB  |     0 MB  |
memory42: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory41: |     0 MB  |     0 MB  |
memory40: |     0 MB  |     0 MB  |
----------+-----------------------+
memory39: |  8192 MB  |  8192 MB  |
memory38: |     0 MB  |     0 MB  |
----------+-----------------------+
memory37: |     0 MB  |     0 MB  |
memory36: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory35: |     0 MB  |     0 MB  |
memory34: |     0 MB  |     0 MB  |
----------+-----------------------+
memory33: |  8192 MB  |  8192 MB  |
memory32: |     0 MB  |     0 MB  |
----------+-----------------------+
memory31: |     0 MB  |     0 MB  |
memory30: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory29: |     0 MB  |     0 MB  |
memory28: |     0 MB  |     0 MB  |
----------+-----------------------+
memory27: |  8192 MB  |  8192 MB  |
memory26: |     0 MB  |     0 MB  |
----------+-----------------------+
memory25: |     0 MB  |     0 MB  |
memory24: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory23: |     0 MB  |     0 MB  |
memory22: |     0 MB  |     0 MB  |
----------+-----------------------+
memory21: |  8192 MB  |  8192 MB  |
memory20: |     0 MB  |     0 MB  |
----------+-----------------------+
memory19: |     0 MB  |     0 MB  |
memory18: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory17: |     0 MB  |     0 MB  |
memory16: |     0 MB  |     0 MB  |
----------+-----------------------+
memory15: |  8192 MB  |  8192 MB  |
memory14: |     0 MB  |     0 MB  |
----------+-----------------------+
memory13: |     0 MB  |     0 MB  |
memory12: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory11: |     0 MB  |     0 MB  |
memory10: |     0 MB  |     0 MB  |
----------+-----------------------+
memory9:  |  8192 MB  |  8192 MB  |
memory8:  |     0 MB  |     0 MB  |
----------+-----------------------+
memory7:  |     0 MB  |     0 MB  |
memory6:  |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory5:  |     0 MB  |     0 MB  |
memory4:  |     0 MB  |     0 MB  |
----------+-----------------------+
memory3:  |  8192 MB  |  8192 MB  |
memory2:  |     0 MB  |     0 MB  |
----------+-----------------------+
memory1:  |     0 MB  |     0 MB  |
memory0:  |  8192 MB  |  8192 MB  |
----------+-----------------------+

Total sum of 256 GB.

As there's no reliable way to credit DIMMS to the right memory
controller, just put everything on memory controller 0 (with should
always exist).

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/ghes_edac.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 7bd7161..97a0d53 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -245,10 +245,19 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	mci->dev_name = "ghes";
 
 	if (!fake) {
-		/* Fill DIMM info from DMI */
-		dimm_fill.count = 0;
-		dimm_fill.mci = mci;
-		dmi_walk (ghes_edac_dmidecode, &dimm_fill);
+		/*
+		 * Fill DIMM info from DMI for the memory controller #0
+		 *
+		 * Keep it in blank for the other memory controllers, as
+		 * there's no reliable way to properly credit each DIMM to
+		 * the memory controller, as different BIOSes fill the
+		 * DMI bank location fields on different ways
+		 */
+		if (!ghes_edac_mc_num) {
+			dimm_fill.count = 0;
+			dimm_fill.mci = mci;
+			dmi_walk (ghes_edac_dmidecode, &dimm_fill);
+		}
 	} else {
 		struct dimm_info *dimm = EDAC_DIMM_PTR(mci->layers, mci->dimms,
 						       mci->n_layers, 0, 0, 0);
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 13/13] ghes_edac: Improve driver's printk messages
  2013-02-15 12:44 ` Mauro Carvalho Chehab
@ 2013-02-15 12:45   ` Mauro Carvalho Chehab
  -1 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:45 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

Provide a better infrastructure for printk's inside the driver:
	- use edac_dbg() for debug messages;
	- standardize the usage of pr_info();
	- provide warning about the risk of relying on this
	  driver.

While here, changes the size of a fake memory to 1 page. This is
as good or as bad as 1000 pages, but it is easier for userspace to
detect, as I don't expect that any machine implementing GHES would
provide just 1 page available ;)

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/ghes_edac.c | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 97a0d53..48d6cd9 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -80,7 +80,8 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
 						       dimm_fill->count, 0, 0);
 
 		if (entry->size == 0xffff) {
-			pr_info(GHES_PFX "Can't get dimm size\n");
+			pr_info(GHES_PFX "Can't get DIMM%i size\n",
+				dimm_fill->count);
 			dimm->nr_pages = MiB_TO_PAGES(32);/* Unknown */
 		} else if (entry->size == 0x7fff) {
 			dimm->nr_pages = MiB_TO_PAGES(entry->extended_size);
@@ -145,11 +146,11 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
 		 */
 
 		if (dimm->nr_pages) {
-			pr_info(GHES_PFX "DIMM%i: %s size = %d MB%s\n",
+			edac_dbg(1, "DIMM%i: %s size = %d MB%s\n",
 				dimm_fill->count, memory_type[dimm->mtype],
 				PAGES_TO_MiB(dimm->nr_pages),
 				(dimm->edac_mode != EDAC_NONE)? "(ECC)" : "");
-			pr_info (GHES_PFX "\ttype %d, detail 0x%02x, width %d(total %d)\n",
+			edac_dbg(2, "\ttype %d, detail 0x%02x, width %d(total %d)\n",
 				entry->memory_type, entry->type_detail,
 				entry->total_width, entry->data_width);
 		}
@@ -191,6 +192,7 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 		mem_err->node, mem_err->card, mem_err->module,
 		mem_err->bank, mem_err->device, mem_err->row, mem_err->column,
 		mem_err->bit_pos);
+	edac_dbg(3, "error at location %s\n", location);
 
 	edac_raw_mc_handle_error(type, ghes->mci, grain, 1, 0, 0, 0,
 				 page, offset, 0,
@@ -224,8 +226,6 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	 * to avoid duplicated memory controller numbers
 	 */
 	mutex_lock(&ghes_edac_lock);
-	pr_info("ghes_edac#%d: allocating space for %d dimms\n",
-		ghes_edac_mc_num, num_dimm);
 	mci = edac_mc_alloc(ghes_edac_mc_num, ARRAY_SIZE(layers), layers, 0);
 	if (!mci) {
 		pr_info(GHES_PFX "Can't allocate memory for EDAC data\n");
@@ -244,6 +244,21 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	mci->ctl_name = "ghes_edac";
 	mci->dev_name = "ghes";
 
+	if (!ghes_edac_mc_num) {
+		if (!fake) {
+			pr_info(GHES_PFX "This EDAC driver relies on BIOS to enumerate memory and get error reports.\n");
+			pr_info(GHES_PFX "Unfortunately, not all BIOSes reflect the memory layout correctly.\n");
+			pr_info(GHES_PFX "So, the end result of using this driver varies from vendor to vendor.\n");
+			pr_info(GHES_PFX "If you find incorrect reports, please ask your vendor to fix its BIOS.\n");
+			pr_info(GHES_PFX "This system has %d DIMM sockets.\n",
+				num_dimm);
+		} else {
+			pr_info(GHES_PFX "This system has a very crappy BIOS: It doesn't even list the DIMMS.\n");
+			pr_info(GHES_PFX "Its SMBIOS info is wrong. It is doubtful that the error report would\n");
+			pr_info(GHES_PFX "work on such system. Use this driver with caution\n");
+		}
+	}
+
 	if (!fake) {
 		/*
 		 * Fill DIMM info from DMI for the memory controller #0
@@ -262,8 +277,7 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 		struct dimm_info *dimm = EDAC_DIMM_PTR(mci->layers, mci->dimms,
 						       mci->n_layers, 0, 0, 0);
 
-		pr_info(GHES_PFX "Crappy BIOS detected. Faking DIMM EDAC data\n");
-		dimm->nr_pages = 1000;
+		dimm->nr_pages = 1;
 		dimm->grain = 128;
 		dimm->mtype = MEM_UNKNOWN;
 		dimm->dtype = DEV_UNKNOWN;
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH EDAC 13/13] ghes_edac: Improve driver's printk messages
@ 2013-02-15 12:45   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 12:45 UTC (permalink / raw)
  Cc: linux-acpi, Huang Ying, Tony Luck, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

Provide a better infrastructure for printk's inside the driver:
	- use edac_dbg() for debug messages;
	- standardize the usage of pr_info();
	- provide warning about the risk of relying on this
	  driver.

While here, changes the size of a fake memory to 1 page. This is
as good or as bad as 1000 pages, but it is easier for userspace to
detect, as I don't expect that any machine implementing GHES would
provide just 1 page available ;)

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/ghes_edac.c | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 97a0d53..48d6cd9 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -80,7 +80,8 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
 						       dimm_fill->count, 0, 0);
 
 		if (entry->size == 0xffff) {
-			pr_info(GHES_PFX "Can't get dimm size\n");
+			pr_info(GHES_PFX "Can't get DIMM%i size\n",
+				dimm_fill->count);
 			dimm->nr_pages = MiB_TO_PAGES(32);/* Unknown */
 		} else if (entry->size == 0x7fff) {
 			dimm->nr_pages = MiB_TO_PAGES(entry->extended_size);
@@ -145,11 +146,11 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
 		 */
 
 		if (dimm->nr_pages) {
-			pr_info(GHES_PFX "DIMM%i: %s size = %d MB%s\n",
+			edac_dbg(1, "DIMM%i: %s size = %d MB%s\n",
 				dimm_fill->count, memory_type[dimm->mtype],
 				PAGES_TO_MiB(dimm->nr_pages),
 				(dimm->edac_mode != EDAC_NONE)? "(ECC)" : "");
-			pr_info (GHES_PFX "\ttype %d, detail 0x%02x, width %d(total %d)\n",
+			edac_dbg(2, "\ttype %d, detail 0x%02x, width %d(total %d)\n",
 				entry->memory_type, entry->type_detail,
 				entry->total_width, entry->data_width);
 		}
@@ -191,6 +192,7 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 		mem_err->node, mem_err->card, mem_err->module,
 		mem_err->bank, mem_err->device, mem_err->row, mem_err->column,
 		mem_err->bit_pos);
+	edac_dbg(3, "error at location %s\n", location);
 
 	edac_raw_mc_handle_error(type, ghes->mci, grain, 1, 0, 0, 0,
 				 page, offset, 0,
@@ -224,8 +226,6 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	 * to avoid duplicated memory controller numbers
 	 */
 	mutex_lock(&ghes_edac_lock);
-	pr_info("ghes_edac#%d: allocating space for %d dimms\n",
-		ghes_edac_mc_num, num_dimm);
 	mci = edac_mc_alloc(ghes_edac_mc_num, ARRAY_SIZE(layers), layers, 0);
 	if (!mci) {
 		pr_info(GHES_PFX "Can't allocate memory for EDAC data\n");
@@ -244,6 +244,21 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	mci->ctl_name = "ghes_edac";
 	mci->dev_name = "ghes";
 
+	if (!ghes_edac_mc_num) {
+		if (!fake) {
+			pr_info(GHES_PFX "This EDAC driver relies on BIOS to enumerate memory and get error reports.\n");
+			pr_info(GHES_PFX "Unfortunately, not all BIOSes reflect the memory layout correctly.\n");
+			pr_info(GHES_PFX "So, the end result of using this driver varies from vendor to vendor.\n");
+			pr_info(GHES_PFX "If you find incorrect reports, please ask your vendor to fix its BIOS.\n");
+			pr_info(GHES_PFX "This system has %d DIMM sockets.\n",
+				num_dimm);
+		} else {
+			pr_info(GHES_PFX "This system has a very crappy BIOS: It doesn't even list the DIMMS.\n");
+			pr_info(GHES_PFX "Its SMBIOS info is wrong. It is doubtful that the error report would\n");
+			pr_info(GHES_PFX "work on such system. Use this driver with caution\n");
+		}
+	}
+
 	if (!fake) {
 		/*
 		 * Fill DIMM info from DMI for the memory controller #0
@@ -262,8 +277,7 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 		struct dimm_info *dimm = EDAC_DIMM_PTR(mci->layers, mci->dimms,
 						       mci->n_layers, 0, 0, 0);
 
-		pr_info(GHES_PFX "Crappy BIOS detected. Faking DIMM EDAC data\n");
-		dimm->nr_pages = 1000;
+		dimm->nr_pages = 1;
 		dimm->grain = 128;
 		dimm->mtype = MEM_UNKNOWN;
 		dimm->dtype = DEV_UNKNOWN;
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 07/13] edac: add support for raw error reports
  2013-02-15 12:44   ` Mauro Carvalho Chehab
  (?)
@ 2013-02-15 14:13   ` Borislav Petkov
  2013-02-15 15:25     ` Mauro Carvalho Chehab
  -1 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2013-02-15 14:13 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: linux-acpi, Huang Ying, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

On Fri, Feb 15, 2013 at 10:44:55AM -0200, Mauro Carvalho Chehab wrote:
> That allows APEI GHES driver to report errors directly, using
> the EDAC error report API.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
> ---
>  drivers/edac/edac_core.h |  17 ++++++++
>  drivers/edac/edac_mc.c   | 109 ++++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 100 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
> index 23bb99f..9c5da11 100644
> --- a/drivers/edac/edac_core.h
> +++ b/drivers/edac/edac_core.h
> @@ -453,6 +453,23 @@ extern struct mem_ctl_info *find_mci_by_dev(struct device *dev);
>  extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev);
>  extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
>  				      unsigned long page);
> +
> +void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
> +			  struct mem_ctl_info *mci,
> +			  long grain,
> +			  const u16 error_count,
> +			  const int top_layer,
> +			  const int mid_layer,
> +			  const int low_layer,
> +			  const unsigned long page_frame_number,
> +			  const unsigned long offset_in_page,
> +			  const unsigned long syndrome,
> +			  const char *msg,
> +			  const char *location,
> +			  const char *label,
> +			  const char *other_detail,
> +			  const bool enable_per_layer_report);

The argument count of this one looks like an overkill. Maybe it would be
nicer, cleaner to do this:

void __edac_handle_mc_error(const enum hw_event_mc_err_type type,
			    struct mem_ctl_info *mci,
			    struct error_desc *e);

and struct error_desc collects all the remaining arguments.

This way you can't get the arguments order wrong, forget one or
whatever; and it would be much less stack pressure on the function
calls.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 07/13] edac: add support for raw error reports
  2013-02-15 14:13   ` Borislav Petkov
@ 2013-02-15 15:25     ` Mauro Carvalho Chehab
  2013-02-15 15:41       ` Borislav Petkov
  0 siblings, 1 reply; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 15:25 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-acpi, Huang Ying, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

Em Fri, 15 Feb 2013 15:13:30 +0100
Borislav Petkov <bp@alien8.de> escreveu:

> On Fri, Feb 15, 2013 at 10:44:55AM -0200, Mauro Carvalho Chehab wrote:
> > That allows APEI GHES driver to report errors directly, using
> > the EDAC error report API.
> > 
> > Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
> > ---
> >  drivers/edac/edac_core.h |  17 ++++++++
> >  drivers/edac/edac_mc.c   | 109 ++++++++++++++++++++++++++++++++++++-----------
> >  2 files changed, 100 insertions(+), 26 deletions(-)
> > 
> > diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
> > index 23bb99f..9c5da11 100644
> > --- a/drivers/edac/edac_core.h
> > +++ b/drivers/edac/edac_core.h
> > @@ -453,6 +453,23 @@ extern struct mem_ctl_info *find_mci_by_dev(struct device *dev);
> >  extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev);
> >  extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
> >  				      unsigned long page);
> > +
> > +void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
> > +			  struct mem_ctl_info *mci,
> > +			  long grain,
> > +			  const u16 error_count,
> > +			  const int top_layer,
> > +			  const int mid_layer,
> > +			  const int low_layer,
> > +			  const unsigned long page_frame_number,
> > +			  const unsigned long offset_in_page,
> > +			  const unsigned long syndrome,
> > +			  const char *msg,
> > +			  const char *location,
> > +			  const char *label,
> > +			  const char *other_detail,
> > +			  const bool enable_per_layer_report);
> 
> The argument count of this one looks like an overkill. Maybe it would be
> nicer, cleaner to do this:
> 
> void __edac_handle_mc_error(const enum hw_event_mc_err_type type,
> 			    struct mem_ctl_info *mci,
> 			    struct error_desc *e);
> 
> and struct error_desc collects all the remaining arguments.
> 
> This way you can't get the arguments order wrong, forget one or
> whatever; and it would be much less stack pressure on the function
> calls.

Well, for sure using an structure will help to avoid missing a parameter
or exchanging its order. The stack usage won't reduce, though, because
the structure will keep using the stack. As I can't foresee the usage
of this function call outside the core and by the GHES driver, I'm not
sure what would be the better.

Anyway, I moved it to an structure in the enclosed patch.

Regards,
Mauro

--


edac: put all arguments for the raw error handling call into a struct

The number of arguments for edac_raw_mc_handle_error() is too big;
put them into a structure.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index 9c5da11..1574fec 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -381,6 +381,25 @@ struct edac_pci_ctl_info {
 	struct completion kobj_complete;
 };
 
+/*
+ * Raw error report structure
+ */
+struct edac_raw_error_desc {
+	long grain;
+	u16 error_count;
+	int top_layer;
+	int mid_layer;
+	int low_layer;
+	unsigned long page_frame_number;
+	unsigned long offset_in_page;
+	unsigned long syndrome;
+	const char *msg;
+	const char *location;
+	const char *label;
+	const char *other_detail;
+	bool enable_per_layer_report;
+};
+
 #define to_edac_pci_ctl_work(w) \
 		container_of(w, struct edac_pci_ctl_info,work)
 
@@ -455,20 +474,8 @@ extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
 				      unsigned long page);
 
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
-			  struct mem_ctl_info *mci,
-			  long grain,
-			  const u16 error_count,
-			  const int top_layer,
-			  const int mid_layer,
-			  const int low_layer,
-			  const unsigned long page_frame_number,
-			  const unsigned long offset_in_page,
-			  const unsigned long syndrome,
-			  const char *msg,
-			  const char *location,
-			  const char *label,
-			  const char *other_detail,
-			  const bool enable_per_layer_report);
+			      struct mem_ctl_info *mci,
+			      struct edac_raw_error_desc *err);
 
 void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			  struct mem_ctl_info *mci,
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 8fddf65..b36f8f8 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1074,70 +1074,43 @@ static void edac_ue_error(struct mem_ctl_info *mci,
  *
  * @type:		severity of the error (CE/UE/Fatal)
  * @mci:		a struct mem_ctl_info pointer
- * @grain:		error granularity
- * @error_count:	Number of errors of the same type
- * @top_layer:		Memory layer[0] position
- * @mid_layer:		Memory layer[1] position
- * @low_layer:		Memory layer[2] position
- * @page_frame_number:	mem page where the error occurred
- * @offset_in_page:	offset of the error inside the page
- * @syndrome:		ECC syndrome
- * @msg:		Message meaningful to the end users that
- *			explains the event\
- * @location:		location of the error, like "csrow:0 channel:1"
- * @label:		DIMM labels for the affected memory(ies)
- * @other_detail:	Technical details about the event that
- *			may help hardware manufacturers and
- *			EDAC developers to analyse the event
- * @enable_per_layer_report: should it increment per-layer error counts?
+ * @e:			error description
  *
  * This raw function is used internally by edac_mc_handle_error(). It should
  * only be called directly when the hardware error come directly from BIOS,
  * like in the case of APEI GHES driver.
  */
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
-			  struct mem_ctl_info *mci,
-			  long grain,
-			  const u16 error_count,
-			  const int top_layer,
-			  const int mid_layer,
-			  const int low_layer,
-			  const unsigned long page_frame_number,
-			  const unsigned long offset_in_page,
-			  const unsigned long syndrome,
-			  const char *msg,
-			  const char *location,
-			  const char *label,
-			  const char *other_detail,
-			  const bool enable_per_layer_report)
+			      struct mem_ctl_info *mci,
+			      struct edac_raw_error_desc *e)
 {
 	char detail[80];
 	u8 grain_bits;
-	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
+	int pos[EDAC_MAX_LAYERS] = { e->top_layer, e->mid_layer, e->low_layer };
 
 	/* Report the error via the trace interface */
-	grain_bits = fls_long(grain) + 1;
-	trace_mc_event(type, msg, label, error_count,
-		       mci->mc_idx, top_layer, mid_layer, low_layer,
-		       PAGES_TO_MiB(page_frame_number) | offset_in_page,
-		       grain_bits, syndrome, other_detail);
+	grain_bits = fls_long(e->grain) + 1;
+	trace_mc_event(type, e->msg, e->label, e->error_count,
+		       mci->mc_idx, e->top_layer, e->mid_layer, e->low_layer,
+		       PAGES_TO_MiB(e->page_frame_number) | e->offset_in_page,
+		       grain_bits, e->syndrome, e->other_detail);
 
 	/* Memory type dependent details about the error */
 	if (type == HW_EVENT_ERR_CORRECTED) {
 		snprintf(detail, sizeof(detail),
 			"page:0x%lx offset:0x%lx grain:%ld syndrome:0x%lx",
-			page_frame_number, offset_in_page,
-			grain, syndrome);
-		edac_ce_error(mci, error_count, pos, msg, location, label,
-			      detail, other_detail, enable_per_layer_report,
-			      page_frame_number, offset_in_page, grain);
+			e->page_frame_number, e->offset_in_page,
+			e->grain, e->syndrome);
+		edac_ce_error(mci, e->error_count, pos, e->msg, e->location, e->label,
+			      detail, e->other_detail, e->enable_per_layer_report,
+			      e->page_frame_number, e->offset_in_page, e->grain);
 	} else {
 		snprintf(detail, sizeof(detail),
 			"page:0x%lx offset:0x%lx grain:%ld",
-			page_frame_number, offset_in_page, grain);
+			e->page_frame_number, e->offset_in_page, e->grain);
 
-		edac_ue_error(mci, error_count, pos, msg, location, label,
-			      detail, other_detail, enable_per_layer_report);
+		edac_ue_error(mci, e->error_count, pos, e->msg, e->location, e->label,
+			      detail, e->other_detail, e->enable_per_layer_report);
 	}
 
 
@@ -1181,11 +1154,12 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	int row = -1, chan = -1;
 	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
 	int i;
-	long grain;
-	bool enable_per_layer_report = false;
+	struct edac_raw_error_desc e;
 
 	edac_dbg(3, "MC%d\n", mci->mc_idx);
 
+	e.enable_per_layer_report = false;
+
 	/*
 	 * Check if the event report is consistent and if the memory
 	 * location is known. If it is known, enable_per_layer_report will be
@@ -1208,7 +1182,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			pos[i] = -1;
 		}
 		if (pos[i] >= 0)
-			enable_per_layer_report = true;
+			e.enable_per_layer_report = true;
 	}
 
 	/*
@@ -1222,7 +1196,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	 * where each memory belongs to a separate channel within the same
 	 * branch.
 	 */
-	grain = 0;
+	e.grain = 0;
 	p = label;
 	*p = '\0';
 
@@ -1237,8 +1211,8 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			continue;
 
 		/* get the max grain, over the error match range */
-		if (dimm->grain > grain)
-			grain = dimm->grain;
+		if (dimm->grain > e.grain)
+			e.grain = dimm->grain;
 
 		/*
 		 * If the error is memory-controller wide, there's no need to
@@ -1246,7 +1220,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 		 * channel/memory controller/...  may be affected.
 		 * Also, don't show errors for empty DIMM slots.
 		 */
-		if (enable_per_layer_report && dimm->nr_pages) {
+		if (e.enable_per_layer_report && dimm->nr_pages) {
 			if (p != label) {
 				strcpy(p, OTHER_LABEL);
 				p += strlen(OTHER_LABEL);
@@ -1274,7 +1248,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 		}
 	}
 
-	if (!enable_per_layer_report) {
+	if (!e.enable_per_layer_report) {
 		strcpy(label, "any memory");
 	} else {
 		edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
@@ -1305,11 +1279,19 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	if (p > location)
 		*(p - 1) = '\0';
 
-	edac_raw_mc_handle_error(type, mci, grain, error_count,
-				 top_layer, mid_layer, low_layer,
-				 page_frame_number, offset_in_page,
-				 syndrome,
-				 msg, location, label, other_detail,
-				 enable_per_layer_report);
+
+	e.error_count = error_count;
+	e.top_layer = top_layer;
+	e.mid_layer = mid_layer;
+	e.low_layer = low_layer;
+	e.page_frame_number = page_frame_number;
+	e.offset_in_page = offset_in_page;
+	e.syndrome = syndrome;
+	e.msg = msg;
+	e.location = location;
+	e.label = label;
+	e.other_detail = other_detail;
+
+	edac_raw_mc_handle_error(type, mci, &e);
 }
 EXPORT_SYMBOL_GPL(edac_mc_handle_error);
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 94d5286..782ed74 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -162,15 +162,16 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
 void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 			        struct cper_sec_mem_err *mem_err)
 {
+	struct edac_raw_error_desc e;
 	enum hw_event_mc_err_type type;
-	unsigned long page = 0, offset = 0, grain = 0;
 	char location[80];
-	char *label = "unknown";
+
+	memset(&e, 0, sizeof(e));
 
 	if (mem_err->validation_bits & CPER_MEM_VALID_PHYSICAL_ADDRESS) {
-		page = mem_err->physical_addr >> PAGE_SHIFT;
-		offset = mem_err->physical_addr & ~PAGE_MASK;
-		grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
+		e.page_frame_number = mem_err->physical_addr >> PAGE_SHIFT;
+		e.offset_in_page = mem_err->physical_addr & ~PAGE_MASK;
+		e.grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
 	}
 
 	switch(sev) {
@@ -194,9 +195,12 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 		mem_err->bit_pos);
 	edac_dbg(3, "error at location %s\n", location);
 
-	edac_raw_mc_handle_error(type, ghes->mci, grain, 1, 0, 0, 0,
-				 page, offset, 0,
-				 "APEI", location, label, "", 0);
+	e.error_count = 1;
+	e.msg = "APEI";
+	e.location = location;
+	e.label = "unknown";
+	e.other_detail = "";
+	edac_raw_mc_handle_error(type, ghes->mci, &e);
 }
 EXPORT_SYMBOL_GPL(ghes_edac_report_mem_error);
 

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 07/13] edac: add support for raw error reports
  2013-02-15 15:25     ` Mauro Carvalho Chehab
@ 2013-02-15 15:41       ` Borislav Petkov
  2013-02-15 15:49         ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2013-02-15 15:41 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: linux-acpi, Huang Ying, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

On Fri, Feb 15, 2013 at 01:25:30PM -0200, Mauro Carvalho Chehab wrote:
> Well, for sure using an structure will help to avoid missing a
> parameter or exchanging its order. The stack usage won't reduce,
> though, because the structure will keep using the stack.

If you allocate it on the stack of the caller, yes. If you kmalloc it,
no.

In any case, passing a pointer to struct edac_raw_error_desc only will
allow on x86_64 (and i386 AFAICT) to use only registers to pass callee
function arguments. Which is always a win. You probably need to stare at
compiler output to see what gcc actually does with -O2 optimizations.

> As I can't foresee the usage of this function call outside the core
> and by the GHES driver, I'm not sure what would be the better.

Having an error descriptor is always better, even if it were only for
clarity's and simplicity's sake.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 07/13] edac: add support for raw error reports
  2013-02-15 15:41       ` Borislav Petkov
@ 2013-02-15 15:49         ` Mauro Carvalho Chehab
  2013-02-15 16:02           ` Borislav Petkov
  0 siblings, 1 reply; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 15:49 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-acpi, Huang Ying, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

Em Fri, 15 Feb 2013 16:41:23 +0100
Borislav Petkov <bp@alien8.de> escreveu:

> On Fri, Feb 15, 2013 at 01:25:30PM -0200, Mauro Carvalho Chehab wrote:
> > Well, for sure using an structure will help to avoid missing a
> > parameter or exchanging its order. The stack usage won't reduce,
> > though, because the structure will keep using the stack.
> 
> If you allocate it on the stack of the caller, yes. If you kmalloc it,
> no.

Sure, but calling kmalloc while handling a memory error doesn't seem
a very good idea, IMHO. So, better to either use an already allocated
space (or the stack).
> 
> In any case, passing a pointer to struct edac_raw_error_desc only will
> allow on x86_64 (and i386 AFAICT) to use only registers to pass callee
> function arguments. Which is always a win. You probably need to stare at
> compiler output to see what gcc actually does with -O2 optimizations.

Yes, I know, but, on the other hand, there's the additional cost of
copying almost all data into the structure.

> > As I can't foresee the usage of this function call outside the core
> > and by the GHES driver, I'm not sure what would be the better.
> 
> Having an error descriptor is always better, even if it were only for
> clarity's and simplicity's sake.

Yes, the code is now clearer.

Ok, I'll keep this patch on my git. I'll likely fold it with the previous 
one on the final patchset.

-- 

Cheers,
Mauro

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 07/13] edac: add support for raw error reports
  2013-02-15 15:49         ` Mauro Carvalho Chehab
@ 2013-02-15 16:02           ` Borislav Petkov
  2013-02-15 18:20             ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2013-02-15 16:02 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: linux-acpi, Huang Ying, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

On Fri, Feb 15, 2013 at 01:49:29PM -0200, Mauro Carvalho Chehab wrote:
> Sure, but calling kmalloc while handling a memory error doesn't seem
> a very good idea, IMHO. So, better to either use an already allocated
> space (or the stack).

Either that, or prealloc a buffer on EDAC initialization. You probably
won't need more than one in 99% of the cases so if you keep it simple
with a single static buffer for starters, that would probably be the
cleanest solution.

> Yes, I know, but, on the other hand, there's the additional cost of
> copying almost all data into the structure.

That's very easily paralelizable on out-of-order CPUs (I'd say, all of
them which need to run EDAC, can do that :-)) so it wouldn't hurt.

Also, you could allocate the struct in the callers and work directly
with its members before sending it down to edac_raw_mc_handle_error() -
that would probably simplify the code a bit more.

> Ok, I'll keep this patch on my git. I'll likely fold it with the
> previous one on the final patchset.

Yep.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 13/13] ghes_edac: Improve driver's printk messages
  2013-02-15 12:45   ` Mauro Carvalho Chehab
  (?)
@ 2013-02-15 16:38   ` Joe Perches
  2013-02-15 17:33     ` Mauro Carvalho Chehab
  -1 siblings, 1 reply; 49+ messages in thread
From: Joe Perches @ 2013-02-15 16:38 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: linux-acpi, Huang Ying, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

On Fri, 2013-02-15 at 10:45 -0200, Mauro Carvalho Chehab wrote:
> Provide a better infrastructure for printk's inside the driver:
> 	- use edac_dbg() for debug messages;
> 	- standardize the usage of pr_info();
> 	- provide warning about the risk of relying on this
> 	  driver.
[]
> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
[]
> @@ -80,7 +80,8 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
>  						       dimm_fill->count, 0, 0);
>  
>  		if (entry->size == 0xffff) {
> -			pr_info(GHES_PFX "Can't get dimm size\n");
> +			pr_info(GHES_PFX "Can't get DIMM%i size\n",
> +				dimm_fill->count);

Perhaps these should use
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
and remove GHEX_PFX from all the pr_<level>()'s?



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 13/13] ghes_edac: Improve driver's printk messages
  2013-02-15 16:38   ` Joe Perches
@ 2013-02-15 17:33     ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 17:33 UTC (permalink / raw)
  To: Joe Perches
  Cc: linux-acpi, Huang Ying, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

Em Fri, 15 Feb 2013 08:38:17 -0800
Joe Perches <joe@perches.com> escreveu:

> On Fri, 2013-02-15 at 10:45 -0200, Mauro Carvalho Chehab wrote:
> > Provide a better infrastructure for printk's inside the driver:
> > 	- use edac_dbg() for debug messages;
> > 	- standardize the usage of pr_info();
> > 	- provide warning about the risk of relying on this
> > 	  driver.
> []
> > diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
> []
> > @@ -80,7 +80,8 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
> >  						       dimm_fill->count, 0, 0);
> >  
> >  		if (entry->size == 0xffff) {
> > -			pr_info(GHES_PFX "Can't get dimm size\n");
> > +			pr_info(GHES_PFX "Can't get DIMM%i size\n",
> > +				dimm_fill->count);
> 
> Perhaps these should use
> #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> and remove GHEX_PFX from all the pr_<level>()'s?

Yeah, sure.

Regards,
Mauro

-

[PATCH] ghes_edac: remove GHES_PFX macro

As suggested by Joe:

On Fri, 15 Feb 2013 08:38:17 -0800
Joe Perches <joe@perches.com> wrote:

	Perhaps these should use
	#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
	and remove GHES_PFX from all the pr_<level>()'s?

Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 782ed74..9fe787b 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -1,14 +1,16 @@
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include <acpi/ghes.h>
 #include <linux/edac.h>
 #include <linux/dmi.h>
 #include "edac_core.h"
 
-#define GHES_PFX   "ghes_edac: "
 #define GHES_EDAC_REVISION " Ver: 1.0.0"
 
 static DEFINE_MUTEX(ghes_edac_lock);
 static int ghes_edac_mc_num;
 
+
 /* Memory Device - Type 17 of SMBIOS spec */
 struct memdev_dmi_entry {
 	u8 type;
@@ -80,7 +82,7 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
 						       dimm_fill->count, 0, 0);
 
 		if (entry->size == 0xffff) {
-			pr_info(GHES_PFX "Can't get DIMM%i size\n",
+			pr_info("Can't get DIMM%i size\n",
 				dimm_fill->count);
 			dimm->nr_pages = MiB_TO_PAGES(32);/* Unknown */
 		} else if (entry->size == 0x7fff) {
@@ -232,7 +234,7 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	mutex_lock(&ghes_edac_lock);
 	mci = edac_mc_alloc(ghes_edac_mc_num, ARRAY_SIZE(layers), layers, 0);
 	if (!mci) {
-		pr_info(GHES_PFX "Can't allocate memory for EDAC data\n");
+		pr_info("Can't allocate memory for EDAC data\n");
 		mutex_unlock(&ghes_edac_lock);
 		return -ENOMEM;
 	}
@@ -250,17 +252,17 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 
 	if (!ghes_edac_mc_num) {
 		if (!fake) {
-			pr_info(GHES_PFX "This EDAC driver relies on BIOS to enumerate memory and get error reports.\n");
-			pr_info(GHES_PFX "Unfortunately, not all BIOSes reflect the memory layout correctly.\n");
-			pr_info(GHES_PFX "So, the end result of using this driver varies from vendor to vendor.\n");
-			pr_info(GHES_PFX "If you find incorrect reports, please contact your hardware vendor\n");
-			pr_info(GHES_PFX "to correct its BIOS.\n");
-			pr_info(GHES_PFX "This system has %d DIMM sockets.\n",
+			pr_info("This EDAC driver relies on BIOS to enumerate memory and get error reports.\n");
+			pr_info("Unfortunately, not all BIOSes reflect the memory layout correctly.\n");
+			pr_info("So, the end result of using this driver varies from vendor to vendor.\n");
+			pr_info("If you find incorrect reports, please contact your hardware vendor\n");
+			pr_info("to correct its BIOS.\n");
+			pr_info("This system has %d DIMM sockets.\n",
 				num_dimm);
 		} else {
-			pr_info(GHES_PFX "This system has a very crappy BIOS: It doesn't even list the DIMMS.\n");
-			pr_info(GHES_PFX "Its SMBIOS info is wrong. It is doubtful that the error report would\n");
-			pr_info(GHES_PFX "work on such system. Use this driver with caution\n");
+			pr_info("This system has a very crappy BIOS: It doesn't even list the DIMMS.\n");
+			pr_info("Its SMBIOS info is wrong. It is doubtful that the error report would\n");
+			pr_info("work on such system. Use this driver with caution\n");
 		}
 	}
 
@@ -291,7 +293,7 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 
 	rc = edac_mc_add_mc(mci);
 	if (rc < 0) {
-		pr_info(GHES_PFX "Can't register at EDAC core\n");
+		pr_info("Can't register at EDAC core\n");
 		edac_mc_free(mci);
 		mutex_unlock(&ghes_edac_lock);
 		return -ENODEV;

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 07/13] edac: add support for raw error reports
  2013-02-15 16:02           ` Borislav Petkov
@ 2013-02-15 18:20             ` Mauro Carvalho Chehab
  2013-02-16 16:57                 ` Borislav Petkov
  0 siblings, 1 reply; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-15 18:20 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-acpi, Huang Ying, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

Em Fri, 15 Feb 2013 17:02:57 +0100
Borislav Petkov <bp@alien8.de> escreveu:

> On Fri, Feb 15, 2013 at 01:49:29PM -0200, Mauro Carvalho Chehab wrote:
> > Sure, but calling kmalloc while handling a memory error doesn't seem
> > a very good idea, IMHO. So, better to either use an already allocated
> > space (or the stack).
> 
> Either that, or prealloc a buffer on EDAC initialization. You probably
> won't need more than one in 99% of the cases so if you keep it simple
> with a single static buffer for starters, that would probably be the
> cleanest solution.
> 
> > Yes, I know, but, on the other hand, there's the additional cost of
> > copying almost all data into the structure.
> 
> That's very easily paralelizable on out-of-order CPUs (I'd say, all of
> them which need to run EDAC, can do that :-)) so it wouldn't hurt.
> 
> Also, you could allocate the struct in the callers and work directly
> with its members before sending it down to edac_raw_mc_handle_error() -
> that would probably simplify the code a bit more.

Yeah, pre-allocating a buffer is something that it was on my plans.
It seems it is time to do it in a clean way. I prefer to keep this
as a separate patch from 07/13, as it has a different rationale,
and mixing with 07/13 would just mix two different subjects.

Also, having it separate helps reviewing.

---

[PATCH] edac: put all arguments for the raw error handling call into a struct

The number of arguments for edac_raw_mc_handle_error() is too big;
put them into a structure and allocate space for it inside
edac_mc_alloc().

That reduces a lot the stack usage and simplifies the raw API call.

Tested with sb_edac driver and MCE error injection. Worked as expected:

[  143.066100] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.086424] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.106570] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.126712] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index 9c5da11..9cf33a5 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -454,21 +454,20 @@ extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev);
 extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
 				      unsigned long page);
 
+static inline void edac_raw_error_desc_clean(struct edac_raw_error_desc *e)
+{
+	int offset = offsetof(struct edac_raw_error_desc, grain);
+
+	*e->location = '\0';
+	*e->label = '\0';
+
+	memset(e + offset, 0, sizeof(*e) - offset);
+}
+
+
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
-			  struct mem_ctl_info *mci,
-			  long grain,
-			  const u16 error_count,
-			  const int top_layer,
-			  const int mid_layer,
-			  const int low_layer,
-			  const unsigned long page_frame_number,
-			  const unsigned long offset_in_page,
-			  const unsigned long syndrome,
-			  const char *msg,
-			  const char *location,
-			  const char *label,
-			  const char *other_detail,
-			  const bool enable_per_layer_report);
+			      struct mem_ctl_info *mci,
+			      struct edac_raw_error_desc *e);
 
 void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			  struct mem_ctl_info *mci,
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 8fddf65..d72853b 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -42,6 +42,12 @@
 static DEFINE_MUTEX(mem_ctls_mutex);
 static LIST_HEAD(mc_devices);
 
+/* Maximum size of the location string */
+#define LOCATION_SIZE 80
+
+/* String used to join two or more labels */
+#define OTHER_LABEL " or "
+
 /*
  * Used to lock EDAC MC to just one module, avoiding two drivers e. g.
  *	apei/ghes and i7core_edac to be used at the same time.
@@ -232,6 +238,11 @@ static void _edac_mc_free(struct mem_ctl_info *mci)
 		}
 		kfree(mci->csrows);
 	}
+
+	/* Frees the error report string area */
+	kfree(mci->error_event.location);
+	kfree(mci->error_event.label);
+
 	kfree(mci);
 }
 
@@ -445,6 +456,12 @@ struct mem_ctl_info *edac_mc_alloc(unsigned mc_num,
 		}
 	}
 
+	/* Allocate memory for the error report */
+	mci->error_event.location = kmalloc(LOCATION_SIZE, GFP_KERNEL);
+	mci->error_event.label = kmalloc((EDAC_MC_LABEL_LEN + 1 +
+					 sizeof(OTHER_LABEL)) * mci->tot_dimms,
+					 GFP_KERNEL);
+
 	mci->op_state = OP_ALLOC;
 
 	return mci;
@@ -1066,78 +1083,49 @@ static void edac_ue_error(struct mem_ctl_info *mci,
 	edac_inc_ue_error(mci, enable_per_layer_report, pos, error_count);
 }
 
-#define OTHER_LABEL " or "
-
 /**
  * edac_raw_mc_handle_error - reports a memory event to userspace without doing
  *			      anything to discover the error location
  *
  * @type:		severity of the error (CE/UE/Fatal)
  * @mci:		a struct mem_ctl_info pointer
- * @grain:		error granularity
- * @error_count:	Number of errors of the same type
- * @top_layer:		Memory layer[0] position
- * @mid_layer:		Memory layer[1] position
- * @low_layer:		Memory layer[2] position
- * @page_frame_number:	mem page where the error occurred
- * @offset_in_page:	offset of the error inside the page
- * @syndrome:		ECC syndrome
- * @msg:		Message meaningful to the end users that
- *			explains the event\
- * @location:		location of the error, like "csrow:0 channel:1"
- * @label:		DIMM labels for the affected memory(ies)
- * @other_detail:	Technical details about the event that
- *			may help hardware manufacturers and
- *			EDAC developers to analyse the event
- * @enable_per_layer_report: should it increment per-layer error counts?
+ * @e:			error description
  *
  * This raw function is used internally by edac_mc_handle_error(). It should
  * only be called directly when the hardware error come directly from BIOS,
  * like in the case of APEI GHES driver.
  */
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
-			  struct mem_ctl_info *mci,
-			  long grain,
-			  const u16 error_count,
-			  const int top_layer,
-			  const int mid_layer,
-			  const int low_layer,
-			  const unsigned long page_frame_number,
-			  const unsigned long offset_in_page,
-			  const unsigned long syndrome,
-			  const char *msg,
-			  const char *location,
-			  const char *label,
-			  const char *other_detail,
-			  const bool enable_per_layer_report)
+			      struct mem_ctl_info *mci,
+			      struct edac_raw_error_desc *e)
 {
 	char detail[80];
 	u8 grain_bits;
-	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
+	int pos[EDAC_MAX_LAYERS] = { e->top_layer, e->mid_layer, e->low_layer };
 
 	/* Report the error via the trace interface */
-	grain_bits = fls_long(grain) + 1;
-	trace_mc_event(type, msg, label, error_count,
-		       mci->mc_idx, top_layer, mid_layer, low_layer,
-		       PAGES_TO_MiB(page_frame_number) | offset_in_page,
-		       grain_bits, syndrome, other_detail);
+	grain_bits = fls_long(e->grain) + 1;
+	trace_mc_event(type, e->msg, e->label, e->error_count,
+		       mci->mc_idx, e->top_layer, e->mid_layer, e->low_layer,
+		       PAGES_TO_MiB(e->page_frame_number) | e->offset_in_page,
+		       grain_bits, e->syndrome, e->other_detail);
 
 	/* Memory type dependent details about the error */
 	if (type == HW_EVENT_ERR_CORRECTED) {
 		snprintf(detail, sizeof(detail),
 			"page:0x%lx offset:0x%lx grain:%ld syndrome:0x%lx",
-			page_frame_number, offset_in_page,
-			grain, syndrome);
-		edac_ce_error(mci, error_count, pos, msg, location, label,
-			      detail, other_detail, enable_per_layer_report,
-			      page_frame_number, offset_in_page, grain);
+			e->page_frame_number, e->offset_in_page,
+			e->grain, e->syndrome);
+		edac_ce_error(mci, e->error_count, pos, e->msg, e->location, e->label,
+			      detail, e->other_detail, e->enable_per_layer_report,
+			      e->page_frame_number, e->offset_in_page, e->grain);
 	} else {
 		snprintf(detail, sizeof(detail),
 			"page:0x%lx offset:0x%lx grain:%ld",
-			page_frame_number, offset_in_page, grain);
+			e->page_frame_number, e->offset_in_page, e->grain);
 
-		edac_ue_error(mci, error_count, pos, msg, location, label,
-			      detail, other_detail, enable_per_layer_report);
+		edac_ue_error(mci, e->error_count, pos, e->msg, e->location, e->label,
+			      detail, e->other_detail, e->enable_per_layer_report);
 	}
 
 
@@ -1174,18 +1162,26 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			  const char *msg,
 			  const char *other_detail)
 {
-	/* FIXME: too much for stack: move it to some pre-alocated area */
-	char location[80];
-	char label[(EDAC_MC_LABEL_LEN + 1 + sizeof(OTHER_LABEL)) * mci->tot_dimms];
 	char *p;
 	int row = -1, chan = -1;
 	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
 	int i;
-	long grain;
-	bool enable_per_layer_report = false;
+	struct edac_raw_error_desc *e = &mci->error_event;
 
 	edac_dbg(3, "MC%d\n", mci->mc_idx);
 
+	/* Fills the error report buffer */
+	edac_raw_error_desc_clean(e);
+	e->error_count = error_count;
+	e->top_layer = top_layer;
+	e->mid_layer = mid_layer;
+	e->low_layer = low_layer;
+	e->page_frame_number = page_frame_number;
+	e->offset_in_page = offset_in_page;
+	e->syndrome = syndrome;
+	e->msg = msg;
+	e->other_detail = other_detail;
+
 	/*
 	 * Check if the event report is consistent and if the memory
 	 * location is known. If it is known, enable_per_layer_report will be
@@ -1208,7 +1204,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			pos[i] = -1;
 		}
 		if (pos[i] >= 0)
-			enable_per_layer_report = true;
+			e->enable_per_layer_report = true;
 	}
 
 	/*
@@ -1222,8 +1218,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	 * where each memory belongs to a separate channel within the same
 	 * branch.
 	 */
-	grain = 0;
-	p = label;
+	p = e->label;
 	*p = '\0';
 
 	for (i = 0; i < mci->tot_dimms; i++) {
@@ -1237,8 +1232,8 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			continue;
 
 		/* get the max grain, over the error match range */
-		if (dimm->grain > grain)
-			grain = dimm->grain;
+		if (dimm->grain > e->grain)
+			e->grain = dimm->grain;
 
 		/*
 		 * If the error is memory-controller wide, there's no need to
@@ -1246,8 +1241,8 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 		 * channel/memory controller/...  may be affected.
 		 * Also, don't show errors for empty DIMM slots.
 		 */
-		if (enable_per_layer_report && dimm->nr_pages) {
-			if (p != label) {
+		if (e->enable_per_layer_report && dimm->nr_pages) {
+			if (p != e->label) {
 				strcpy(p, OTHER_LABEL);
 				p += strlen(OTHER_LABEL);
 			}
@@ -1274,12 +1269,12 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 		}
 	}
 
-	if (!enable_per_layer_report) {
-		strcpy(label, "any memory");
+	if (!e->enable_per_layer_report) {
+		strcpy(e->label, "any memory");
 	} else {
 		edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
-		if (p == label)
-			strcpy(label, "unknown memory");
+		if (p == e->label)
+			strcpy(e->label, "unknown memory");
 		if (type == HW_EVENT_ERR_CORRECTED) {
 			if (row >= 0) {
 				mci->csrows[row]->ce_count += error_count;
@@ -1292,7 +1287,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	}
 
 	/* Fill the RAM location data */
-	p = location;
+	p = e->location;
 
 	for (i = 0; i < mci->n_layers; i++) {
 		if (pos[i] < 0)
@@ -1302,14 +1297,9 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			     edac_layer_name[mci->layers[i].type],
 			     pos[i]);
 	}
-	if (p > location)
+	if (p > e->location)
 		*(p - 1) = '\0';
 
-	edac_raw_mc_handle_error(type, mci, grain, error_count,
-				 top_layer, mid_layer, low_layer,
-				 page_frame_number, offset_in_page,
-				 syndrome,
-				 msg, location, label, other_detail,
-				 enable_per_layer_report);
+	edac_raw_mc_handle_error(type, mci, e);
 }
 EXPORT_SYMBOL_GPL(edac_mc_handle_error);
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index ef54829..0a0ca51 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -175,15 +175,20 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
 void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 			        struct cper_sec_mem_err *mem_err)
 {
+	struct edac_raw_error_desc *e = &ghes->mci->error_event;
 	enum hw_event_mc_err_type type;
-	unsigned long page = 0, offset = 0, grain = 0;
-	char location[80];
-	char *label = "unknown";
+
+	/* Cleans the error report buffer */
+	edac_raw_error_desc_clean(e);
+	e->error_count = 1;
+	e->msg = "APEI";
+	e->label = "unknown";
+	e->other_detail = "";
 
 	if (mem_err->validation_bits & CPER_MEM_VALID_PHYSICAL_ADDRESS) {
-		page = mem_err->physical_addr >> PAGE_SHIFT;
-		offset = mem_err->physical_addr & ~PAGE_MASK;
-		grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
+		e->page_frame_number = mem_err->physical_addr >> PAGE_SHIFT;
+		e->offset_in_page = mem_err->physical_addr & ~PAGE_MASK;
+		e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
 	}
 
 	switch(sev) {
@@ -201,15 +206,14 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 		type = HW_EVENT_ERR_INFO;
 	}
 
-	sprintf(location,"node:%d card:%d module:%d bank:%d device:%d row: %d column:%d bit_pos:%d",
+	sprintf(e->location,
+		"node:%d card:%d module:%d bank:%d device:%d row: %d column:%d bit_pos:%d",
 		mem_err->node, mem_err->card, mem_err->module,
 		mem_err->bank, mem_err->device, mem_err->row, mem_err->column,
 		mem_err->bit_pos);
-	edac_dbg(3, "error at location %s\n", location);
+	edac_dbg(3, "error at location %s\n", e->location);
 
-	edac_raw_mc_handle_error(type, ghes->mci, grain, 1, 0, 0, 0,
-				 page, offset, 0,
-				 "APEI", location, label, "", 0);
+	edac_raw_mc_handle_error(type, ghes->mci, e);
 }
 EXPORT_SYMBOL_GPL(ghes_edac_report_mem_error);
 
diff --git a/include/linux/edac.h b/include/linux/edac.h
index 28232a0..7f929c3 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -555,6 +555,46 @@ struct errcount_attribute_data {
 	int layer0, layer1, layer2;
 };
 
+/**
+ * edac_raw_error_desc - Raw error report structure
+ * @grain:			minimum granularity for an error report, in bytes
+ * @error_count:		number of errors of the same type
+ * @top_layer:			top layer of the error (layer[0])
+ * @mid_layer:			middle layer of the error (layer[1])
+ * @low_layer:			low layer of the error (layer[2])
+ * @page_frame_number:		page where the error happened
+ * @offset_in_page:		page offset
+ * @syndrome:			syndrome of the error (or 0 if unknown or if
+ * 				the syndrome is not applicable)
+ * @msg:			error message
+ * @location:			location of the error
+ * @label:			label of the affected DIMM(s)
+ * @other_detail:		other driver-specific detail about the error
+ * @enable_per_layer_report:	if false, the error affects all layers
+ *				(typically, a memory controller error)
+ */
+struct edac_raw_error_desc {
+	/*
+	 * NOTE: everything before grain won't be cleaned by
+	 * edac_raw_error_desc_clean()
+	 */
+	char *location;
+	char *label;
+	long grain;
+
+	/* the vars below and grain will be cleaned on every new error report */
+	u16 error_count;
+	int top_layer;
+	int mid_layer;
+	int low_layer;
+	unsigned long page_frame_number;
+	unsigned long offset_in_page;
+	unsigned long syndrome;
+	const char *msg;
+	const char *other_detail;
+	bool enable_per_layer_report;
+};
+
 /* MEMORY controller information structure
  */
 struct mem_ctl_info {
@@ -663,6 +703,12 @@ struct mem_ctl_info {
 	/* work struct for this MC */
 	struct delayed_work work;
 
+	/*
+	 * Used to report an error - by being at the global struct
+	 * makes the memory allocated by the EDAC core
+	 */
+	struct edac_raw_error_desc error_event;
+
 	/* the internal state of this controller instance */
 	int op_state;
 


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 07/13] edac: add support for raw error reports
  2013-02-15 18:20             ` Mauro Carvalho Chehab
@ 2013-02-16 16:57                 ` Borislav Petkov
  0 siblings, 0 replies; 49+ messages in thread
From: Borislav Petkov @ 2013-02-16 16:57 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: linux-acpi, Huang Ying, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

On Fri, Feb 15, 2013 at 04:20:29PM -0200, Mauro Carvalho Chehab wrote:
> Yeah, pre-allocating a buffer is something that it was on my plans. It
> seems it is time to do it in a clean way. I prefer to keep this as a
> separate patch from 07/13, as it has a different rationale, and mixing
> with 07/13 would just mix two different subjects.
>
> Also, having it separate helps reviewing.

Yep.

> ---
> 
> [PATCH] edac: put all arguments for the raw error handling call into a struct
> 
> The number of arguments for edac_raw_mc_handle_error() is too big;
> put them into a structure and allocate space for it inside
> edac_mc_alloc().
> 
> That reduces a lot the stack usage and simplifies the raw API call.
> 
> Tested with sb_edac driver and MCE error injection. Worked as expected:
> 
> [  143.066100] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> [  143.086424] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> [  143.106570] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> [  143.126712] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
> 
> diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
> index 9c5da11..9cf33a5 100644
> --- a/drivers/edac/edac_core.h
> +++ b/drivers/edac/edac_core.h
> @@ -454,21 +454,20 @@ extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev);
>  extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
>  				      unsigned long page);
>  
> +static inline void edac_raw_error_desc_clean(struct edac_raw_error_desc *e)
> +{
> +	int offset = offsetof(struct edac_raw_error_desc, grain);
> +
> +	*e->location = '\0';
> +	*e->label = '\0';

Why the special handling? Why not memset the whole thing?

> +
> +	memset(e + offset, 0, sizeof(*e) - offset);
> +}
> +
> +
>  void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
> -			  struct mem_ctl_info *mci,
> -			  long grain,
> -			  const u16 error_count,
> -			  const int top_layer,
> -			  const int mid_layer,
> -			  const int low_layer,
> -			  const unsigned long page_frame_number,
> -			  const unsigned long offset_in_page,
> -			  const unsigned long syndrome,
> -			  const char *msg,
> -			  const char *location,
> -			  const char *label,
> -			  const char *other_detail,
> -			  const bool enable_per_layer_report);
> +			      struct mem_ctl_info *mci,
> +			      struct edac_raw_error_desc *e);
>  
>  void edac_mc_handle_error(const enum hw_event_mc_err_type type,
>  			  struct mem_ctl_info *mci,
> diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
> index 8fddf65..d72853b 100644
> --- a/drivers/edac/edac_mc.c
> +++ b/drivers/edac/edac_mc.c
> @@ -42,6 +42,12 @@
>  static DEFINE_MUTEX(mem_ctls_mutex);
>  static LIST_HEAD(mc_devices);
>  
> +/* Maximum size of the location string */
> +#define LOCATION_SIZE 80
> +
> +/* String used to join two or more labels */
> +#define OTHER_LABEL " or "
> +
>  /*
>   * Used to lock EDAC MC to just one module, avoiding two drivers e. g.
>   *	apei/ghes and i7core_edac to be used at the same time.
> @@ -232,6 +238,11 @@ static void _edac_mc_free(struct mem_ctl_info *mci)
>  		}
>  		kfree(mci->csrows);
>  	}
> +
> +	/* Frees the error report string area */
> +	kfree(mci->error_event.location);
> +	kfree(mci->error_event.label);
> +
>  	kfree(mci);
>  }
>  
> @@ -445,6 +456,12 @@ struct mem_ctl_info *edac_mc_alloc(unsigned mc_num,
>  		}
>  	}
>  
> +	/* Allocate memory for the error report */
> +	mci->error_event.location = kmalloc(LOCATION_SIZE, GFP_KERNEL);
> +	mci->error_event.label = kmalloc((EDAC_MC_LABEL_LEN + 1 +
> +					 sizeof(OTHER_LABEL)) * mci->tot_dimms,
> +					 GFP_KERNEL);

I see, those are separate strings. Why not embed them into struct
edac_raw_error_desc? This would simplify the whole buffer handling even
more and you won't need to kmalloc them.

Also just FYI, everytime you do kmalloc, you need to handle the case
where it returns an error.

[ … ]

> @@ -1174,18 +1162,26 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
>  			  const char *msg,
>  			  const char *other_detail)
>  {
> -	/* FIXME: too much for stack: move it to some pre-alocated area */
> -	char location[80];
> -	char label[(EDAC_MC_LABEL_LEN + 1 + sizeof(OTHER_LABEL)) * mci->tot_dimms];
>  	char *p;
>  	int row = -1, chan = -1;
>  	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
>  	int i;
> -	long grain;
> -	bool enable_per_layer_report = false;
> +	struct edac_raw_error_desc *e = &mci->error_event;
>  
>  	edac_dbg(3, "MC%d\n", mci->mc_idx);
>  
> +	/* Fills the error report buffer */
> +	edac_raw_error_desc_clean(e);
> +	e->error_count = error_count;
> +	e->top_layer = top_layer;
> +	e->mid_layer = mid_layer;
> +	e->low_layer = low_layer;
> +	e->page_frame_number = page_frame_number;
> +	e->offset_in_page = offset_in_page;
> +	e->syndrome = syndrome;
> +	e->msg = msg;
> +	e->other_detail = other_detail;

Btw, this could be simplified even further if we would've made it like
this from the get-go: if lowlevel EDAC drivers would populate the buffer
already, we wouldn't need to do that copying again. And, it is ironic
but I did that already in amd64_edac - see __log_bus_error, where I have
an amd64_edac-specific struct err_info descriptor which is being handed
off up.

Oh well, maybe something for later.

[ … ]

> @@ -1302,14 +1297,9 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
>  			     edac_layer_name[mci->layers[i].type],
>  			     pos[i]);
>  	}
> -	if (p > location)
> +	if (p > e->location)
>  		*(p - 1) = '\0';
>  
> -	edac_raw_mc_handle_error(type, mci, grain, error_count,
> -				 top_layer, mid_layer, low_layer,
> -				 page_frame_number, offset_in_page,
> -				 syndrome,
> -				 msg, location, label, other_detail,
> -				 enable_per_layer_report);
> +	edac_raw_mc_handle_error(type, mci, e);

Ok, now this hunk looks nice. :-)

[ … ]

> diff --git a/include/linux/edac.h b/include/linux/edac.h
> index 28232a0..7f929c3 100644
> --- a/include/linux/edac.h
> +++ b/include/linux/edac.h
> @@ -555,6 +555,46 @@ struct errcount_attribute_data {
>  	int layer0, layer1, layer2;
>  };
>  
> +/**
> + * edac_raw_error_desc - Raw error report structure
> + * @grain:			minimum granularity for an error report, in bytes
> + * @error_count:		number of errors of the same type
> + * @top_layer:			top layer of the error (layer[0])
> + * @mid_layer:			middle layer of the error (layer[1])
> + * @low_layer:			low layer of the error (layer[2])
> + * @page_frame_number:		page where the error happened
> + * @offset_in_page:		page offset
> + * @syndrome:			syndrome of the error (or 0 if unknown or if
> + * 				the syndrome is not applicable)
> + * @msg:			error message
> + * @location:			location of the error
> + * @label:			label of the affected DIMM(s)
> + * @other_detail:		other driver-specific detail about the error
> + * @enable_per_layer_report:	if false, the error affects all layers
> + *				(typically, a memory controller error)
> + */
> +struct edac_raw_error_desc {
> +	/*
> +	 * NOTE: everything before grain won't be cleaned by
> +	 * edac_raw_error_desc_clean()
> +	 */
> +	char *location;
> +	char *label;
> +	long grain;
> +
> +	/* the vars below and grain will be cleaned on every new error report */
> +	u16 error_count;
> +	int top_layer;
> +	int mid_layer;
> +	int low_layer;
> +	unsigned long page_frame_number;
> +	unsigned long offset_in_page;
> +	unsigned long syndrome;
> +	const char *msg;
> +	const char *other_detail;
> +	bool enable_per_layer_report;
> +};
> +
>  /* MEMORY controller information structure
>   */
>  struct mem_ctl_info {
> @@ -663,6 +703,12 @@ struct mem_ctl_info {
>  	/* work struct for this MC */
>  	struct delayed_work work;
>  
> +	/*
> +	 * Used to report an error - by being at the global struct
> +	 * makes the memory allocated by the EDAC core
> +	 */
> +	struct edac_raw_error_desc error_event;

I think 'error_desc' is clearer. This way you can refer to it everywhere
with mci->error_desc and you know what it is. ->error_event is kinda
ambiguous IMHO.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 07/13] edac: add support for raw error reports
@ 2013-02-16 16:57                 ` Borislav Petkov
  0 siblings, 0 replies; 49+ messages in thread
From: Borislav Petkov @ 2013-02-16 16:57 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: linux-acpi, Huang Ying, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

On Fri, Feb 15, 2013 at 04:20:29PM -0200, Mauro Carvalho Chehab wrote:
> Yeah, pre-allocating a buffer is something that it was on my plans. It
> seems it is time to do it in a clean way. I prefer to keep this as a
> separate patch from 07/13, as it has a different rationale, and mixing
> with 07/13 would just mix two different subjects.
>
> Also, having it separate helps reviewing.

Yep.

> ---
> 
> [PATCH] edac: put all arguments for the raw error handling call into a struct
> 
> The number of arguments for edac_raw_mc_handle_error() is too big;
> put them into a structure and allocate space for it inside
> edac_mc_alloc().
> 
> That reduces a lot the stack usage and simplifies the raw API call.
> 
> Tested with sb_edac driver and MCE error injection. Worked as expected:
> 
> [  143.066100] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> [  143.086424] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> [  143.106570] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> [  143.126712] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
> 
> diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
> index 9c5da11..9cf33a5 100644
> --- a/drivers/edac/edac_core.h
> +++ b/drivers/edac/edac_core.h
> @@ -454,21 +454,20 @@ extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev);
>  extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
>  				      unsigned long page);
>  
> +static inline void edac_raw_error_desc_clean(struct edac_raw_error_desc *e)
> +{
> +	int offset = offsetof(struct edac_raw_error_desc, grain);
> +
> +	*e->location = '\0';
> +	*e->label = '\0';

Why the special handling? Why not memset the whole thing?

> +
> +	memset(e + offset, 0, sizeof(*e) - offset);
> +}
> +
> +
>  void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
> -			  struct mem_ctl_info *mci,
> -			  long grain,
> -			  const u16 error_count,
> -			  const int top_layer,
> -			  const int mid_layer,
> -			  const int low_layer,
> -			  const unsigned long page_frame_number,
> -			  const unsigned long offset_in_page,
> -			  const unsigned long syndrome,
> -			  const char *msg,
> -			  const char *location,
> -			  const char *label,
> -			  const char *other_detail,
> -			  const bool enable_per_layer_report);
> +			      struct mem_ctl_info *mci,
> +			      struct edac_raw_error_desc *e);
>  
>  void edac_mc_handle_error(const enum hw_event_mc_err_type type,
>  			  struct mem_ctl_info *mci,
> diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
> index 8fddf65..d72853b 100644
> --- a/drivers/edac/edac_mc.c
> +++ b/drivers/edac/edac_mc.c
> @@ -42,6 +42,12 @@
>  static DEFINE_MUTEX(mem_ctls_mutex);
>  static LIST_HEAD(mc_devices);
>  
> +/* Maximum size of the location string */
> +#define LOCATION_SIZE 80
> +
> +/* String used to join two or more labels */
> +#define OTHER_LABEL " or "
> +
>  /*
>   * Used to lock EDAC MC to just one module, avoiding two drivers e. g.
>   *	apei/ghes and i7core_edac to be used at the same time.
> @@ -232,6 +238,11 @@ static void _edac_mc_free(struct mem_ctl_info *mci)
>  		}
>  		kfree(mci->csrows);
>  	}
> +
> +	/* Frees the error report string area */
> +	kfree(mci->error_event.location);
> +	kfree(mci->error_event.label);
> +
>  	kfree(mci);
>  }
>  
> @@ -445,6 +456,12 @@ struct mem_ctl_info *edac_mc_alloc(unsigned mc_num,
>  		}
>  	}
>  
> +	/* Allocate memory for the error report */
> +	mci->error_event.location = kmalloc(LOCATION_SIZE, GFP_KERNEL);
> +	mci->error_event.label = kmalloc((EDAC_MC_LABEL_LEN + 1 +
> +					 sizeof(OTHER_LABEL)) * mci->tot_dimms,
> +					 GFP_KERNEL);

I see, those are separate strings. Why not embed them into struct
edac_raw_error_desc? This would simplify the whole buffer handling even
more and you won't need to kmalloc them.

Also just FYI, everytime you do kmalloc, you need to handle the case
where it returns an error.

[ … ]

> @@ -1174,18 +1162,26 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
>  			  const char *msg,
>  			  const char *other_detail)
>  {
> -	/* FIXME: too much for stack: move it to some pre-alocated area */
> -	char location[80];
> -	char label[(EDAC_MC_LABEL_LEN + 1 + sizeof(OTHER_LABEL)) * mci->tot_dimms];
>  	char *p;
>  	int row = -1, chan = -1;
>  	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
>  	int i;
> -	long grain;
> -	bool enable_per_layer_report = false;
> +	struct edac_raw_error_desc *e = &mci->error_event;
>  
>  	edac_dbg(3, "MC%d\n", mci->mc_idx);
>  
> +	/* Fills the error report buffer */
> +	edac_raw_error_desc_clean(e);
> +	e->error_count = error_count;
> +	e->top_layer = top_layer;
> +	e->mid_layer = mid_layer;
> +	e->low_layer = low_layer;
> +	e->page_frame_number = page_frame_number;
> +	e->offset_in_page = offset_in_page;
> +	e->syndrome = syndrome;
> +	e->msg = msg;
> +	e->other_detail = other_detail;

Btw, this could be simplified even further if we would've made it like
this from the get-go: if lowlevel EDAC drivers would populate the buffer
already, we wouldn't need to do that copying again. And, it is ironic
but I did that already in amd64_edac - see __log_bus_error, where I have
an amd64_edac-specific struct err_info descriptor which is being handed
off up.

Oh well, maybe something for later.

[ … ]

> @@ -1302,14 +1297,9 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
>  			     edac_layer_name[mci->layers[i].type],
>  			     pos[i]);
>  	}
> -	if (p > location)
> +	if (p > e->location)
>  		*(p - 1) = '\0';
>  
> -	edac_raw_mc_handle_error(type, mci, grain, error_count,
> -				 top_layer, mid_layer, low_layer,
> -				 page_frame_number, offset_in_page,
> -				 syndrome,
> -				 msg, location, label, other_detail,
> -				 enable_per_layer_report);
> +	edac_raw_mc_handle_error(type, mci, e);

Ok, now this hunk looks nice. :-)

[ … ]

> diff --git a/include/linux/edac.h b/include/linux/edac.h
> index 28232a0..7f929c3 100644
> --- a/include/linux/edac.h
> +++ b/include/linux/edac.h
> @@ -555,6 +555,46 @@ struct errcount_attribute_data {
>  	int layer0, layer1, layer2;
>  };
>  
> +/**
> + * edac_raw_error_desc - Raw error report structure
> + * @grain:			minimum granularity for an error report, in bytes
> + * @error_count:		number of errors of the same type
> + * @top_layer:			top layer of the error (layer[0])
> + * @mid_layer:			middle layer of the error (layer[1])
> + * @low_layer:			low layer of the error (layer[2])
> + * @page_frame_number:		page where the error happened
> + * @offset_in_page:		page offset
> + * @syndrome:			syndrome of the error (or 0 if unknown or if
> + * 				the syndrome is not applicable)
> + * @msg:			error message
> + * @location:			location of the error
> + * @label:			label of the affected DIMM(s)
> + * @other_detail:		other driver-specific detail about the error
> + * @enable_per_layer_report:	if false, the error affects all layers
> + *				(typically, a memory controller error)
> + */
> +struct edac_raw_error_desc {
> +	/*
> +	 * NOTE: everything before grain won't be cleaned by
> +	 * edac_raw_error_desc_clean()
> +	 */
> +	char *location;
> +	char *label;
> +	long grain;
> +
> +	/* the vars below and grain will be cleaned on every new error report */
> +	u16 error_count;
> +	int top_layer;
> +	int mid_layer;
> +	int low_layer;
> +	unsigned long page_frame_number;
> +	unsigned long offset_in_page;
> +	unsigned long syndrome;
> +	const char *msg;
> +	const char *other_detail;
> +	bool enable_per_layer_report;
> +};
> +
>  /* MEMORY controller information structure
>   */
>  struct mem_ctl_info {
> @@ -663,6 +703,12 @@ struct mem_ctl_info {
>  	/* work struct for this MC */
>  	struct delayed_work work;
>  
> +	/*
> +	 * Used to report an error - by being at the global struct
> +	 * makes the memory allocated by the EDAC core
> +	 */
> +	struct edac_raw_error_desc error_event;

I think 'error_desc' is clearer. This way you can refer to it everywhere
with mci->error_desc and you know what it is. ->error_event is kinda
ambiguous IMHO.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 07/13] edac: add support for raw error reports
  2013-02-16 16:57                 ` Borislav Petkov
@ 2013-02-17 10:44                   ` Mauro Carvalho Chehab
  -1 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-17 10:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-acpi, Huang Ying, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

Em Sat, 16 Feb 2013 17:57:48 +0100
Borislav Petkov <bp@alien8.de> escreveu:

> On Fri, Feb 15, 2013 at 04:20:29PM -0200, Mauro Carvalho Chehab wrote:
> > Yeah, pre-allocating a buffer is something that it was on my plans. It
> > seems it is time to do it in a clean way. I prefer to keep this as a
> > separate patch from 07/13, as it has a different rationale, and mixing
> > with 07/13 would just mix two different subjects.
> >
> > Also, having it separate helps reviewing.
> 
> Yep.
> 
> > ---
> > 
> > [PATCH] edac: put all arguments for the raw error handling call into a struct
> > 
> > The number of arguments for edac_raw_mc_handle_error() is too big;
> > put them into a structure and allocate space for it inside
> > edac_mc_alloc().
> > 
> > That reduces a lot the stack usage and simplifies the raw API call.
> > 
> > Tested with sb_edac driver and MCE error injection. Worked as expected:
> > 
> > [  143.066100] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> > [  143.086424] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> > [  143.106570] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> > [  143.126712] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> > 
> > Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
> > 
> > diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
> > index 9c5da11..9cf33a5 100644
> > --- a/drivers/edac/edac_core.h
> > +++ b/drivers/edac/edac_core.h
> > @@ -454,21 +454,20 @@ extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev);
> >  extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
> >  				      unsigned long page);
> >  
> > +static inline void edac_raw_error_desc_clean(struct edac_raw_error_desc *e)
> > +{
> > +	int offset = offsetof(struct edac_raw_error_desc, grain);
> > +
> > +	*e->location = '\0';
> > +	*e->label = '\0';
> 
> Why the special handling? Why not memset the whole thing?

We don't want to clean the pointers for the allocated area, just to
clean the strings.

> 
> > +
> > +	memset(e + offset, 0, sizeof(*e) - offset);
> > +}
> > +
> > +
> >  void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
> > -			  struct mem_ctl_info *mci,
> > -			  long grain,
> > -			  const u16 error_count,
> > -			  const int top_layer,
> > -			  const int mid_layer,
> > -			  const int low_layer,
> > -			  const unsigned long page_frame_number,
> > -			  const unsigned long offset_in_page,
> > -			  const unsigned long syndrome,
> > -			  const char *msg,
> > -			  const char *location,
> > -			  const char *label,
> > -			  const char *other_detail,
> > -			  const bool enable_per_layer_report);
> > +			      struct mem_ctl_info *mci,
> > +			      struct edac_raw_error_desc *e);
> >  
> >  void edac_mc_handle_error(const enum hw_event_mc_err_type type,
> >  			  struct mem_ctl_info *mci,
> > diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
> > index 8fddf65..d72853b 100644
> > --- a/drivers/edac/edac_mc.c
> > +++ b/drivers/edac/edac_mc.c
> > @@ -42,6 +42,12 @@
> >  static DEFINE_MUTEX(mem_ctls_mutex);
> >  static LIST_HEAD(mc_devices);
> >  
> > +/* Maximum size of the location string */
> > +#define LOCATION_SIZE 80
> > +
> > +/* String used to join two or more labels */
> > +#define OTHER_LABEL " or "
> > +
> >  /*
> >   * Used to lock EDAC MC to just one module, avoiding two drivers e. g.
> >   *	apei/ghes and i7core_edac to be used at the same time.
> > @@ -232,6 +238,11 @@ static void _edac_mc_free(struct mem_ctl_info *mci)
> >  		}
> >  		kfree(mci->csrows);
> >  	}
> > +
> > +	/* Frees the error report string area */
> > +	kfree(mci->error_event.location);
> > +	kfree(mci->error_event.label);
> > +
> >  	kfree(mci);
> >  }
> >  
> > @@ -445,6 +456,12 @@ struct mem_ctl_info *edac_mc_alloc(unsigned mc_num,
> >  		}
> >  	}
> >  
> > +	/* Allocate memory for the error report */
> > +	mci->error_event.location = kmalloc(LOCATION_SIZE, GFP_KERNEL);
> > +	mci->error_event.label = kmalloc((EDAC_MC_LABEL_LEN + 1 +
> > +					 sizeof(OTHER_LABEL)) * mci->tot_dimms,
> > +					 GFP_KERNEL);
> 
> I see, those are separate strings. Why not embed them into struct
> edac_raw_error_desc? This would simplify the whole buffer handling even
> more and you won't need to kmalloc them.

We could do it for the location. The space for label, however, depends on
how many DIMMs are in the system, as multiple dimm's may be present, and
the core will point to all possible affected DIMMs.

Ok, perhaps we could just allocate one big area for it (like one page), 
as this would very likely be enough for it, and change the logic to take
the buffer size into account when filling it.

> Also just FYI, everytime you do kmalloc, you need to handle the case
> where it returns an error.

Yeah, I forgot to add the error handling logic.

> [ … ]
> 
> > @@ -1174,18 +1162,26 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
> >  			  const char *msg,
> >  			  const char *other_detail)
> >  {
> > -	/* FIXME: too much for stack: move it to some pre-alocated area */
> > -	char location[80];
> > -	char label[(EDAC_MC_LABEL_LEN + 1 + sizeof(OTHER_LABEL)) * mci->tot_dimms];
> >  	char *p;
> >  	int row = -1, chan = -1;
> >  	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
> >  	int i;
> > -	long grain;
> > -	bool enable_per_layer_report = false;
> > +	struct edac_raw_error_desc *e = &mci->error_event;
> >  
> >  	edac_dbg(3, "MC%d\n", mci->mc_idx);
> >  
> > +	/* Fills the error report buffer */
> > +	edac_raw_error_desc_clean(e);
> > +	e->error_count = error_count;
> > +	e->top_layer = top_layer;
> > +	e->mid_layer = mid_layer;
> > +	e->low_layer = low_layer;
> > +	e->page_frame_number = page_frame_number;
> > +	e->offset_in_page = offset_in_page;
> > +	e->syndrome = syndrome;
> > +	e->msg = msg;
> > +	e->other_detail = other_detail;
> 
> Btw, this could be simplified even further if we would've made it like
> this from the get-go: if lowlevel EDAC drivers would populate the buffer
> already, we wouldn't need to do that copying again. And, it is ironic
> but I did that already in amd64_edac - see __log_bus_error, where I have
> an amd64_edac-specific struct err_info descriptor which is being handed
> off up.
> 
> Oh well, maybe something for later.

I agree, but this is a separate patch, IMHO. It will require to change the
logic again on all drivers, and re-test compilation on all supported archs.

> [ … ]
> 
> > @@ -1302,14 +1297,9 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
> >  			     edac_layer_name[mci->layers[i].type],
> >  			     pos[i]);
> >  	}
> > -	if (p > location)
> > +	if (p > e->location)
> >  		*(p - 1) = '\0';
> >  
> > -	edac_raw_mc_handle_error(type, mci, grain, error_count,
> > -				 top_layer, mid_layer, low_layer,
> > -				 page_frame_number, offset_in_page,
> > -				 syndrome,
> > -				 msg, location, label, other_detail,
> > -				 enable_per_layer_report);
> > +	edac_raw_mc_handle_error(type, mci, e);
> 
> Ok, now this hunk looks nice. :-)
> 
> [ … ]
> 
> > diff --git a/include/linux/edac.h b/include/linux/edac.h
> > index 28232a0..7f929c3 100644
> > --- a/include/linux/edac.h
> > +++ b/include/linux/edac.h
> > @@ -555,6 +555,46 @@ struct errcount_attribute_data {
> >  	int layer0, layer1, layer2;
> >  };
> >  
> > +/**
> > + * edac_raw_error_desc - Raw error report structure
> > + * @grain:			minimum granularity for an error report, in bytes
> > + * @error_count:		number of errors of the same type
> > + * @top_layer:			top layer of the error (layer[0])
> > + * @mid_layer:			middle layer of the error (layer[1])
> > + * @low_layer:			low layer of the error (layer[2])
> > + * @page_frame_number:		page where the error happened
> > + * @offset_in_page:		page offset
> > + * @syndrome:			syndrome of the error (or 0 if unknown or if
> > + * 				the syndrome is not applicable)
> > + * @msg:			error message
> > + * @location:			location of the error
> > + * @label:			label of the affected DIMM(s)
> > + * @other_detail:		other driver-specific detail about the error
> > + * @enable_per_layer_report:	if false, the error affects all layers
> > + *				(typically, a memory controller error)
> > + */
> > +struct edac_raw_error_desc {
> > +	/*
> > +	 * NOTE: everything before grain won't be cleaned by
> > +	 * edac_raw_error_desc_clean()
> > +	 */
> > +	char *location;
> > +	char *label;
> > +	long grain;
> > +
> > +	/* the vars below and grain will be cleaned on every new error report */
> > +	u16 error_count;
> > +	int top_layer;
> > +	int mid_layer;
> > +	int low_layer;
> > +	unsigned long page_frame_number;
> > +	unsigned long offset_in_page;
> > +	unsigned long syndrome;
> > +	const char *msg;
> > +	const char *other_detail;
> > +	bool enable_per_layer_report;
> > +};
> > +
> >  /* MEMORY controller information structure
> >   */
> >  struct mem_ctl_info {
> > @@ -663,6 +703,12 @@ struct mem_ctl_info {
> >  	/* work struct for this MC */
> >  	struct delayed_work work;
> >  
> > +	/*
> > +	 * Used to report an error - by being at the global struct
> > +	 * makes the memory allocated by the EDAC core
> > +	 */
> > +	struct edac_raw_error_desc error_event;
> 
> I think 'error_desc' is clearer. This way you can refer to it everywhere
> with mci->error_desc and you know what it is. ->error_event is kinda
> ambiguous IMHO.

Ok, I'll replace it.


-- 

Cheers,
Mauro
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 07/13] edac: add support for raw error reports
@ 2013-02-17 10:44                   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-17 10:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-acpi, Huang Ying, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

Em Sat, 16 Feb 2013 17:57:48 +0100
Borislav Petkov <bp@alien8.de> escreveu:

> On Fri, Feb 15, 2013 at 04:20:29PM -0200, Mauro Carvalho Chehab wrote:
> > Yeah, pre-allocating a buffer is something that it was on my plans. It
> > seems it is time to do it in a clean way. I prefer to keep this as a
> > separate patch from 07/13, as it has a different rationale, and mixing
> > with 07/13 would just mix two different subjects.
> >
> > Also, having it separate helps reviewing.
> 
> Yep.
> 
> > ---
> > 
> > [PATCH] edac: put all arguments for the raw error handling call into a struct
> > 
> > The number of arguments for edac_raw_mc_handle_error() is too big;
> > put them into a structure and allocate space for it inside
> > edac_mc_alloc().
> > 
> > That reduces a lot the stack usage and simplifies the raw API call.
> > 
> > Tested with sb_edac driver and MCE error injection. Worked as expected:
> > 
> > [  143.066100] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> > [  143.086424] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> > [  143.106570] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> > [  143.126712] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> > 
> > Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
> > 
> > diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
> > index 9c5da11..9cf33a5 100644
> > --- a/drivers/edac/edac_core.h
> > +++ b/drivers/edac/edac_core.h
> > @@ -454,21 +454,20 @@ extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev);
> >  extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
> >  				      unsigned long page);
> >  
> > +static inline void edac_raw_error_desc_clean(struct edac_raw_error_desc *e)
> > +{
> > +	int offset = offsetof(struct edac_raw_error_desc, grain);
> > +
> > +	*e->location = '\0';
> > +	*e->label = '\0';
> 
> Why the special handling? Why not memset the whole thing?

We don't want to clean the pointers for the allocated area, just to
clean the strings.

> 
> > +
> > +	memset(e + offset, 0, sizeof(*e) - offset);
> > +}
> > +
> > +
> >  void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
> > -			  struct mem_ctl_info *mci,
> > -			  long grain,
> > -			  const u16 error_count,
> > -			  const int top_layer,
> > -			  const int mid_layer,
> > -			  const int low_layer,
> > -			  const unsigned long page_frame_number,
> > -			  const unsigned long offset_in_page,
> > -			  const unsigned long syndrome,
> > -			  const char *msg,
> > -			  const char *location,
> > -			  const char *label,
> > -			  const char *other_detail,
> > -			  const bool enable_per_layer_report);
> > +			      struct mem_ctl_info *mci,
> > +			      struct edac_raw_error_desc *e);
> >  
> >  void edac_mc_handle_error(const enum hw_event_mc_err_type type,
> >  			  struct mem_ctl_info *mci,
> > diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
> > index 8fddf65..d72853b 100644
> > --- a/drivers/edac/edac_mc.c
> > +++ b/drivers/edac/edac_mc.c
> > @@ -42,6 +42,12 @@
> >  static DEFINE_MUTEX(mem_ctls_mutex);
> >  static LIST_HEAD(mc_devices);
> >  
> > +/* Maximum size of the location string */
> > +#define LOCATION_SIZE 80
> > +
> > +/* String used to join two or more labels */
> > +#define OTHER_LABEL " or "
> > +
> >  /*
> >   * Used to lock EDAC MC to just one module, avoiding two drivers e. g.
> >   *	apei/ghes and i7core_edac to be used at the same time.
> > @@ -232,6 +238,11 @@ static void _edac_mc_free(struct mem_ctl_info *mci)
> >  		}
> >  		kfree(mci->csrows);
> >  	}
> > +
> > +	/* Frees the error report string area */
> > +	kfree(mci->error_event.location);
> > +	kfree(mci->error_event.label);
> > +
> >  	kfree(mci);
> >  }
> >  
> > @@ -445,6 +456,12 @@ struct mem_ctl_info *edac_mc_alloc(unsigned mc_num,
> >  		}
> >  	}
> >  
> > +	/* Allocate memory for the error report */
> > +	mci->error_event.location = kmalloc(LOCATION_SIZE, GFP_KERNEL);
> > +	mci->error_event.label = kmalloc((EDAC_MC_LABEL_LEN + 1 +
> > +					 sizeof(OTHER_LABEL)) * mci->tot_dimms,
> > +					 GFP_KERNEL);
> 
> I see, those are separate strings. Why not embed them into struct
> edac_raw_error_desc? This would simplify the whole buffer handling even
> more and you won't need to kmalloc them.

We could do it for the location. The space for label, however, depends on
how many DIMMs are in the system, as multiple dimm's may be present, and
the core will point to all possible affected DIMMs.

Ok, perhaps we could just allocate one big area for it (like one page), 
as this would very likely be enough for it, and change the logic to take
the buffer size into account when filling it.

> Also just FYI, everytime you do kmalloc, you need to handle the case
> where it returns an error.

Yeah, I forgot to add the error handling logic.

> [ … ]
> 
> > @@ -1174,18 +1162,26 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
> >  			  const char *msg,
> >  			  const char *other_detail)
> >  {
> > -	/* FIXME: too much for stack: move it to some pre-alocated area */
> > -	char location[80];
> > -	char label[(EDAC_MC_LABEL_LEN + 1 + sizeof(OTHER_LABEL)) * mci->tot_dimms];
> >  	char *p;
> >  	int row = -1, chan = -1;
> >  	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
> >  	int i;
> > -	long grain;
> > -	bool enable_per_layer_report = false;
> > +	struct edac_raw_error_desc *e = &mci->error_event;
> >  
> >  	edac_dbg(3, "MC%d\n", mci->mc_idx);
> >  
> > +	/* Fills the error report buffer */
> > +	edac_raw_error_desc_clean(e);
> > +	e->error_count = error_count;
> > +	e->top_layer = top_layer;
> > +	e->mid_layer = mid_layer;
> > +	e->low_layer = low_layer;
> > +	e->page_frame_number = page_frame_number;
> > +	e->offset_in_page = offset_in_page;
> > +	e->syndrome = syndrome;
> > +	e->msg = msg;
> > +	e->other_detail = other_detail;
> 
> Btw, this could be simplified even further if we would've made it like
> this from the get-go: if lowlevel EDAC drivers would populate the buffer
> already, we wouldn't need to do that copying again. And, it is ironic
> but I did that already in amd64_edac - see __log_bus_error, where I have
> an amd64_edac-specific struct err_info descriptor which is being handed
> off up.
> 
> Oh well, maybe something for later.

I agree, but this is a separate patch, IMHO. It will require to change the
logic again on all drivers, and re-test compilation on all supported archs.

> [ … ]
> 
> > @@ -1302,14 +1297,9 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
> >  			     edac_layer_name[mci->layers[i].type],
> >  			     pos[i]);
> >  	}
> > -	if (p > location)
> > +	if (p > e->location)
> >  		*(p - 1) = '\0';
> >  
> > -	edac_raw_mc_handle_error(type, mci, grain, error_count,
> > -				 top_layer, mid_layer, low_layer,
> > -				 page_frame_number, offset_in_page,
> > -				 syndrome,
> > -				 msg, location, label, other_detail,
> > -				 enable_per_layer_report);
> > +	edac_raw_mc_handle_error(type, mci, e);
> 
> Ok, now this hunk looks nice. :-)
> 
> [ … ]
> 
> > diff --git a/include/linux/edac.h b/include/linux/edac.h
> > index 28232a0..7f929c3 100644
> > --- a/include/linux/edac.h
> > +++ b/include/linux/edac.h
> > @@ -555,6 +555,46 @@ struct errcount_attribute_data {
> >  	int layer0, layer1, layer2;
> >  };
> >  
> > +/**
> > + * edac_raw_error_desc - Raw error report structure
> > + * @grain:			minimum granularity for an error report, in bytes
> > + * @error_count:		number of errors of the same type
> > + * @top_layer:			top layer of the error (layer[0])
> > + * @mid_layer:			middle layer of the error (layer[1])
> > + * @low_layer:			low layer of the error (layer[2])
> > + * @page_frame_number:		page where the error happened
> > + * @offset_in_page:		page offset
> > + * @syndrome:			syndrome of the error (or 0 if unknown or if
> > + * 				the syndrome is not applicable)
> > + * @msg:			error message
> > + * @location:			location of the error
> > + * @label:			label of the affected DIMM(s)
> > + * @other_detail:		other driver-specific detail about the error
> > + * @enable_per_layer_report:	if false, the error affects all layers
> > + *				(typically, a memory controller error)
> > + */
> > +struct edac_raw_error_desc {
> > +	/*
> > +	 * NOTE: everything before grain won't be cleaned by
> > +	 * edac_raw_error_desc_clean()
> > +	 */
> > +	char *location;
> > +	char *label;
> > +	long grain;
> > +
> > +	/* the vars below and grain will be cleaned on every new error report */
> > +	u16 error_count;
> > +	int top_layer;
> > +	int mid_layer;
> > +	int low_layer;
> > +	unsigned long page_frame_number;
> > +	unsigned long offset_in_page;
> > +	unsigned long syndrome;
> > +	const char *msg;
> > +	const char *other_detail;
> > +	bool enable_per_layer_report;
> > +};
> > +
> >  /* MEMORY controller information structure
> >   */
> >  struct mem_ctl_info {
> > @@ -663,6 +703,12 @@ struct mem_ctl_info {
> >  	/* work struct for this MC */
> >  	struct delayed_work work;
> >  
> > +	/*
> > +	 * Used to report an error - by being at the global struct
> > +	 * makes the memory allocated by the EDAC core
> > +	 */
> > +	struct edac_raw_error_desc error_event;
> 
> I think 'error_desc' is clearer. This way you can refer to it everywhere
> with mci->error_desc and you know what it is. ->error_event is kinda
> ambiguous IMHO.

Ok, I'll replace it.


-- 

Cheers,
Mauro

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 07/13] edac: add support for raw error reports
  2013-02-17 10:44                   ` Mauro Carvalho Chehab
  (?)
@ 2013-02-18 13:52                   ` Borislav Petkov
  2013-02-18 15:24                     ` Mauro Carvalho Chehab
  -1 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2013-02-18 13:52 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: linux-acpi, Huang Ying, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

On Sun, Feb 17, 2013 at 07:44:04AM -0300, Mauro Carvalho Chehab wrote:
> We could do it for the location. The space for label, however, depends on
> how many DIMMs are in the system, as multiple dimm's may be present, and
> the core will point to all possible affected DIMMs.
> 
> Ok, perhaps we could just allocate one big area for it (like one page), 
> as this would very likely be enough for it, and change the logic to take
> the buffer size into account when filling it.

Or, in the case where ->label is all dimms on the mci, you simply put
"All DIMMs on MCI%d" in there and done. Simple.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 07/13] edac: add support for raw error reports
  2013-02-18 13:52                   ` Borislav Petkov
@ 2013-02-18 15:24                     ` Mauro Carvalho Chehab
  2013-02-19 11:56                       ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-18 15:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-acpi, Huang Ying, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

Em Mon, 18 Feb 2013 14:52:51 +0100
Borislav Petkov <bp@alien8.de> escreveu:

> On Sun, Feb 17, 2013 at 07:44:04AM -0300, Mauro Carvalho Chehab wrote:
> > We could do it for the location. The space for label, however, depends on
> > how many DIMMs are in the system, as multiple dimm's may be present, and
> > the core will point to all possible affected DIMMs.
> > 
> > Ok, perhaps we could just allocate one big area for it (like one page), 
> > as this would very likely be enough for it, and change the logic to take
> > the buffer size into account when filling it.
> 
> Or, in the case where ->label is all dimms on the mci, you simply put
> "All DIMMs on MCI%d" in there and done. Simple.

The core does this already when it has no glue at all about where is the
error.

The core is prepared to the case where the location is only half-filled,
as this is a common scenario on the drivers, and important enough on
some memory controllers.

As already discussed, on most memory controllers nowadays, the memory
controller can't point to a single DIMM, as the error correction code
takes 128 bits (2 DIMMs). It is impossible for the error correction
code to determine on what DIMM an uncorrected error happened[1].

With Nehalem memory controllers, depending on the memory configuration,
the minimal DIMM granularity for an uncorrected error can be even worse: 
4 DIMMs, if 128-bits error correction code and mirror mode are both enabled.

There are some border cases where the driver can simply not discover on
what channel or on what dimm(or csrow) inside a channel the error
happened. The error could be associated with some failure at the logic
or at the bus that communicated with the Advanced Memory Buffers on an
FB-DIMM memory controller, for example.

So, the real core's worse case scenario would be if the driver can't
determine on what DIMM inside a channel the error happened. As a channel
can have a large number of DIMMs[2] the allocated area for the label
should be conservative.


 (16? Not sure what's the worse case),

[1] such error can even not be fatal, if that particular address is
unused.

[2] Currently, up to 8, according with:
	$for i in $(git grep "layers.*size\s*=" drivers/edac|perl -ne 'print "$1 " if (m/\=\s*([A-Z][^\s]+);/);'); do echo $i; git grep $i drivers/edac; done|grep define|perl -ne 'print "$1 " if (m/define\s+[^\s]+\s(\d+)/)'
	8 8 2 2 4 2 3 3 3 8 4 4 3 3 1 1 4 

and
	$ git grep "layers.*size\s*=" drivers/edac|perl -ne 'print "$1 " if (m/\=\s*(\d+);/);'
	1 1 1 1 2 2 8 4 1 1 1 1 

Nothing prevents that a driver would have more than 8 DIMMs per layer
in the future.

-- 

Cheers,
Mauro

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 07/13] edac: add support for raw error reports
  2013-02-18 15:24                     ` Mauro Carvalho Chehab
@ 2013-02-19 11:56                       ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-19 11:56 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Borislav Petkov, linux-acpi, Huang Ying, Tony Luck,
	Linux Edac Mailing List, Linux Kernel Mailing List

Em Mon, 18 Feb 2013 12:24:29 -0300
Mauro Carvalho Chehab <mchehab@redhat.com> escreveu:

> Em Mon, 18 Feb 2013 14:52:51 +0100
> Borislav Petkov <bp@alien8.de> escreveu:
> 
> > On Sun, Feb 17, 2013 at 07:44:04AM -0300, Mauro Carvalho Chehab wrote:
> > > We could do it for the location. The space for label, however, depends on
> > > how many DIMMs are in the system, as multiple dimm's may be present, and
> > > the core will point to all possible affected DIMMs.
> > > 
> > > Ok, perhaps we could just allocate one big area for it (like one page), 
> > > as this would very likely be enough for it, and change the logic to take
> > > the buffer size into account when filling it.
> > 
> > Or, in the case where ->label is all dimms on the mci, you simply put
> > "All DIMMs on MCI%d" in there and done. Simple.
> 
> The core does this already when it has no glue at all about where is the
> error.
> 
> The core is prepared to the case where the location is only half-filled,
> as this is a common scenario on the drivers, and important enough on
> some memory controllers.
> 
> As already discussed, on most memory controllers nowadays, the memory
> controller can't point to a single DIMM, as the error correction code
> takes 128 bits (2 DIMMs). It is impossible for the error correction
> code to determine on what DIMM an uncorrected error happened[1].
> 
> With Nehalem memory controllers, depending on the memory configuration,
> the minimal DIMM granularity for an uncorrected error can be even worse: 
> 4 DIMMs, if 128-bits error correction code and mirror mode are both enabled.
> 
> There are some border cases where the driver can simply not discover on
> what channel or on what dimm(or csrow) inside a channel the error
> happened. The error could be associated with some failure at the logic
> or at the bus that communicated with the Advanced Memory Buffers on an
> FB-DIMM memory controller, for example.
> 
> So, the real core's worse case scenario would be if the driver can't
> determine on what DIMM inside a channel the error happened. As a channel
> can have a large number of DIMMs[2] the allocated area for the label
> should be conservative.
> 
> 
>  (16? Not sure what's the worse case),
> 
> [1] such error can even not be fatal, if that particular address is
> unused.
> 
> [2] Currently, up to 8, according with:
> 	$for i in $(git grep "layers.*size\s*=" drivers/edac|perl -ne 'print "$1 " if (m/\=\s*([A-Z][^\s]+);/);'); do echo $i; git grep $i drivers/edac; done|grep define|perl -ne 'print "$1 " if (m/define\s+[^\s]+\s(\d+)/)'
> 	8 8 2 2 4 2 3 3 3 8 4 4 3 3 1 1 4 
> 
> and
> 	$ git grep "layers.*size\s*=" drivers/edac|perl -ne 'print "$1 " if (m/\=\s*(\d+);/);'
> 	1 1 1 1 2 2 8 4 1 1 1 1 
> 
> Nothing prevents that a driver would have more than 8 DIMMs per layer
> in the future.

I suspect that you'll be happy with the enclosed patch ;)

It embeds the two string buffers at the mci structure. There are space there
for up to EDAC_MAX_LABELS at the "mci->label" string. If an error affects
more than EDAC_MAX_LABELS, the report logic will write "any memory", just like
what happens when the driver can't discover where the error is.

Tested with sb_edac driver.

Regards,
Mauro

[PATCH EDAC] edac: put all arguments for the raw error handling call into a struct

The number of arguments for edac_raw_mc_handle_error() is too big;
put them into a structure and allocate space for it inside
edac_mc_alloc().

That reduces a lot the stack usage and simplifies the raw API call.

Tested with sb_edac driver and MCE error injection. Worked as expected:

[  143.066100] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.086424] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.106570] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.126712] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index 9c5da11..3c2625e 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -455,20 +455,8 @@ extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
 				      unsigned long page);
 
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
-			  struct mem_ctl_info *mci,
-			  long grain,
-			  const u16 error_count,
-			  const int top_layer,
-			  const int mid_layer,
-			  const int low_layer,
-			  const unsigned long page_frame_number,
-			  const unsigned long offset_in_page,
-			  const unsigned long syndrome,
-			  const char *msg,
-			  const char *location,
-			  const char *label,
-			  const char *other_detail,
-			  const bool enable_per_layer_report);
+			      struct mem_ctl_info *mci,
+			      struct edac_raw_error_desc *e);
 
 void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			  struct mem_ctl_info *mci,
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 8fddf65..ab1ef5c 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1066,78 +1067,49 @@ static void edac_ue_error(struct mem_ctl_info *mci,
 	edac_inc_ue_error(mci, enable_per_layer_report, pos, error_count);
 }
 
-#define OTHER_LABEL " or "
-
 /**
  * edac_raw_mc_handle_error - reports a memory event to userspace without doing
  *			      anything to discover the error location
  *
  * @type:		severity of the error (CE/UE/Fatal)
  * @mci:		a struct mem_ctl_info pointer
- * @grain:		error granularity
- * @error_count:	Number of errors of the same type
- * @top_layer:		Memory layer[0] position
- * @mid_layer:		Memory layer[1] position
- * @low_layer:		Memory layer[2] position
- * @page_frame_number:	mem page where the error occurred
- * @offset_in_page:	offset of the error inside the page
- * @syndrome:		ECC syndrome
- * @msg:		Message meaningful to the end users that
- *			explains the event\
- * @location:		location of the error, like "csrow:0 channel:1"
- * @label:		DIMM labels for the affected memory(ies)
- * @other_detail:	Technical details about the event that
- *			may help hardware manufacturers and
- *			EDAC developers to analyse the event
- * @enable_per_layer_report: should it increment per-layer error counts?
+ * @e:			error description
  *
  * This raw function is used internally by edac_mc_handle_error(). It should
  * only be called directly when the hardware error come directly from BIOS,
  * like in the case of APEI GHES driver.
  */
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
-			  struct mem_ctl_info *mci,
-			  long grain,
-			  const u16 error_count,
-			  const int top_layer,
-			  const int mid_layer,
-			  const int low_layer,
-			  const unsigned long page_frame_number,
-			  const unsigned long offset_in_page,
-			  const unsigned long syndrome,
-			  const char *msg,
-			  const char *location,
-			  const char *label,
-			  const char *other_detail,
-			  const bool enable_per_layer_report)
+			      struct mem_ctl_info *mci,
+			      struct edac_raw_error_desc *e)
 {
 	char detail[80];
 	u8 grain_bits;
-	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
+	int pos[EDAC_MAX_LAYERS] = { e->top_layer, e->mid_layer, e->low_layer };
 
 	/* Report the error via the trace interface */
-	grain_bits = fls_long(grain) + 1;
-	trace_mc_event(type, msg, label, error_count,
-		       mci->mc_idx, top_layer, mid_layer, low_layer,
-		       PAGES_TO_MiB(page_frame_number) | offset_in_page,
-		       grain_bits, syndrome, other_detail);
+	grain_bits = fls_long(e->grain) + 1;
+	trace_mc_event(type, e->msg, e->label, e->error_count,
+		       mci->mc_idx, e->top_layer, e->mid_layer, e->low_layer,
+		       PAGES_TO_MiB(e->page_frame_number) | e->offset_in_page,
+		       grain_bits, e->syndrome, e->other_detail);
 
 	/* Memory type dependent details about the error */
 	if (type == HW_EVENT_ERR_CORRECTED) {
 		snprintf(detail, sizeof(detail),
 			"page:0x%lx offset:0x%lx grain:%ld syndrome:0x%lx",
-			page_frame_number, offset_in_page,
-			grain, syndrome);
-		edac_ce_error(mci, error_count, pos, msg, location, label,
-			      detail, other_detail, enable_per_layer_report,
-			      page_frame_number, offset_in_page, grain);
+			e->page_frame_number, e->offset_in_page,
+			e->grain, e->syndrome);
+		edac_ce_error(mci, e->error_count, pos, e->msg, e->location, e->label,
+			      detail, e->other_detail, e->enable_per_layer_report,
+			      e->page_frame_number, e->offset_in_page, e->grain);
 	} else {
 		snprintf(detail, sizeof(detail),
 			"page:0x%lx offset:0x%lx grain:%ld",
-			page_frame_number, offset_in_page, grain);
+			e->page_frame_number, e->offset_in_page, e->grain);
 
-		edac_ue_error(mci, error_count, pos, msg, location, label,
-			      detail, other_detail, enable_per_layer_report);
+		edac_ue_error(mci, e->error_count, pos, e->msg, e->location, e->label,
+			      detail, e->other_detail, e->enable_per_layer_report);
 	}
 
 
@@ -1174,18 +1146,26 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			  const char *msg,
 			  const char *other_detail)
 {
-	/* FIXME: too much for stack: move it to some pre-alocated area */
-	char location[80];
-	char label[(EDAC_MC_LABEL_LEN + 1 + sizeof(OTHER_LABEL)) * mci->tot_dimms];
 	char *p;
 	int row = -1, chan = -1;
 	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
-	int i;
-	long grain;
-	bool enable_per_layer_report = false;
+	int i, n_labels = 0;
+	struct edac_raw_error_desc *e = &mci->error_desc;
 
 	edac_dbg(3, "MC%d\n", mci->mc_idx);
 
+	/* Fills the error report buffer */
+	memset(e, 0, sizeof (*e));
+	e->error_count = error_count;
+	e->top_layer = top_layer;
+	e->mid_layer = mid_layer;
+	e->low_layer = low_layer;
+	e->page_frame_number = page_frame_number;
+	e->offset_in_page = offset_in_page;
+	e->syndrome = syndrome;
+	e->msg = msg;
+	e->other_detail = other_detail;
+
 	/*
 	 * Check if the event report is consistent and if the memory
 	 * location is known. If it is known, enable_per_layer_report will be
@@ -1208,7 +1188,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			pos[i] = -1;
 		}
 		if (pos[i] >= 0)
-			enable_per_layer_report = true;
+			e->enable_per_layer_report = true;
 	}
 
 	/*
@@ -1222,8 +1202,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	 * where each memory belongs to a separate channel within the same
 	 * branch.
 	 */
-	grain = 0;
-	p = label;
+	p = e->label;
 	*p = '\0';
 
 	for (i = 0; i < mci->tot_dimms; i++) {
@@ -1237,8 +1216,8 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			continue;
 
 		/* get the max grain, over the error match range */
-		if (dimm->grain > grain)
-			grain = dimm->grain;
+		if (dimm->grain > e->grain)
+			e->grain = dimm->grain;
 
 		/*
 		 * If the error is memory-controller wide, there's no need to
@@ -1246,8 +1225,13 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 		 * channel/memory controller/...  may be affected.
 		 * Also, don't show errors for empty DIMM slots.
 		 */
-		if (enable_per_layer_report && dimm->nr_pages) {
-			if (p != label) {
+		if (e->enable_per_layer_report && dimm->nr_pages) {
+			if (n_labels >= EDAC_MAX_LABELS) {
+				e->enable_per_layer_report = false;
+				break;
+			}
+			n_labels++;
+			if (p != e->label) {
 				strcpy(p, OTHER_LABEL);
 				p += strlen(OTHER_LABEL);
 			}
@@ -1274,12 +1258,12 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 		}
 	}
 
-	if (!enable_per_layer_report) {
-		strcpy(label, "any memory");
+	if (!e->enable_per_layer_report) {
+		strcpy(e->label, "any memory");
 	} else {
 		edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
-		if (p == label)
-			strcpy(label, "unknown memory");
+		if (p == e->label)
+			strcpy(e->label, "unknown memory");
 		if (type == HW_EVENT_ERR_CORRECTED) {
 			if (row >= 0) {
 				mci->csrows[row]->ce_count += error_count;
@@ -1292,7 +1276,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	}
 
 	/* Fill the RAM location data */
-	p = location;
+	p = e->location;
 
 	for (i = 0; i < mci->n_layers; i++) {
 		if (pos[i] < 0)
@@ -1302,14 +1286,9 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			     edac_layer_name[mci->layers[i].type],
 			     pos[i]);
 	}
-	if (p > location)
+	if (p > e->location)
 		*(p - 1) = '\0';
 
-	edac_raw_mc_handle_error(type, mci, grain, error_count,
-				 top_layer, mid_layer, low_layer,
-				 page_frame_number, offset_in_page,
-				 syndrome,
-				 msg, location, label, other_detail,
-				 enable_per_layer_report);
+	edac_raw_mc_handle_error(type, mci, e);
 }
 EXPORT_SYMBOL_GPL(edac_mc_handle_error);
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index ef54829..9d7f797 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -175,15 +175,20 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
 void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 			        struct cper_sec_mem_err *mem_err)
 {
+	struct edac_raw_error_desc *e = &ghes->mci->error_desc;
 	enum hw_event_mc_err_type type;
-	unsigned long page = 0, offset = 0, grain = 0;
-	char location[80];
-	char *label = "unknown";
+
+	/* Cleans the error report buffer */
+	memset(e, 0, sizeof (*e));
+	e->error_count = 1;
+	e->msg = "APEI";
+	strcpy(e->label, "unknown");
+	e->other_detail = "";
 
 	if (mem_err->validation_bits & CPER_MEM_VALID_PHYSICAL_ADDRESS) {
-		page = mem_err->physical_addr >> PAGE_SHIFT;
-		offset = mem_err->physical_addr & ~PAGE_MASK;
-		grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
+		e->page_frame_number = mem_err->physical_addr >> PAGE_SHIFT;
+		e->offset_in_page = mem_err->physical_addr & ~PAGE_MASK;
+		e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
 	}
 
 	switch(sev) {
@@ -201,15 +206,14 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 		type = HW_EVENT_ERR_INFO;
 	}
 
-	sprintf(location,"node:%d card:%d module:%d bank:%d device:%d row: %d column:%d bit_pos:%d",
+	sprintf(e->location,
+		"node:%d card:%d module:%d bank:%d device:%d row: %d column:%d bit_pos:%d",
 		mem_err->node, mem_err->card, mem_err->module,
 		mem_err->bank, mem_err->device, mem_err->row, mem_err->column,
 		mem_err->bit_pos);
-	edac_dbg(3, "error at location %s\n", location);
+	edac_dbg(3, "error at location %s\n", e->location);
 
-	edac_raw_mc_handle_error(type, ghes->mci, grain, 1, 0, 0, 0,
-				 page, offset, 0,
-				 "APEI", location, label, "", 0);
+	edac_raw_mc_handle_error(type, ghes->mci, e);
 }
 EXPORT_SYMBOL_GPL(ghes_edac_report_mem_error);
 
diff --git a/include/linux/edac.h b/include/linux/edac.h
index bd14f5c..1cd4472 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -47,8 +47,18 @@ static inline void opstate_init(void)
 	return;
 }
 
+/* Max length of a DIMM label*/
 #define EDAC_MC_LABEL_LEN	31
 
+/* Maximum size of the location string */
+#define LOCATION_SIZE 80
+
+/* Defines the maximum number of labels that can be reported */
+#define EDAC_MAX_LABELS		8
+
+/* String used to join two or more labels */
+#define OTHER_LABEL " or "
+
 /**
  * enum dev_type - describe the type of memory DRAM chips used at the stick
  * @DEV_UNKNOWN:	Can't be determined, or MC doesn't support detect it
@@ -554,6 +564,46 @@ struct errcount_attribute_data {
 	int layer0, layer1, layer2;
 };
 
+/**
+ * edac_raw_error_desc - Raw error report structure
+ * @grain:			minimum granularity for an error report, in bytes
+ * @error_count:		number of errors of the same type
+ * @top_layer:			top layer of the error (layer[0])
+ * @mid_layer:			middle layer of the error (layer[1])
+ * @low_layer:			low layer of the error (layer[2])
+ * @page_frame_number:		page where the error happened
+ * @offset_in_page:		page offset
+ * @syndrome:			syndrome of the error (or 0 if unknown or if
+ * 				the syndrome is not applicable)
+ * @msg:			error message
+ * @location:			location of the error
+ * @label:			label of the affected DIMM(s)
+ * @other_detail:		other driver-specific detail about the error
+ * @enable_per_layer_report:	if false, the error affects all layers
+ *				(typically, a memory controller error)
+ */
+struct edac_raw_error_desc {
+	/*
+	 * NOTE: everything before grain won't be cleaned by
+	 * edac_raw_error_desc_clean()
+	 */
+	char location[LOCATION_SIZE];
+	char label[(EDAC_MC_LABEL_LEN + 1 + sizeof(OTHER_LABEL)) * EDAC_MAX_LABELS];
+	long grain;
+
+	/* the vars below and grain will be cleaned on every new error report */
+	u16 error_count;
+	int top_layer;
+	int mid_layer;
+	int low_layer;
+	unsigned long page_frame_number;
+	unsigned long offset_in_page;
+	unsigned long syndrome;
+	const char *msg;
+	const char *other_detail;
+	bool enable_per_layer_report;
+};
+
 /* MEMORY controller information structure
  */
 struct mem_ctl_info {
@@ -661,6 +711,12 @@ struct mem_ctl_info {
 	/* work struct for this MC */
 	struct delayed_work work;
 
+	/*
+	 * Used to report an error - by being at the global struct
+	 * makes the memory allocated by the EDAC core
+	 */
+	struct edac_raw_error_desc error_desc;
+
 	/* the internal state of this controller instance */
 	int op_state;
 




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 03/13] ghes: add the needed hooks for EDAC error report
  2013-02-15 12:44   ` Mauro Carvalho Chehab
  (?)
@ 2013-02-21  1:26   ` Huang Ying
  2013-02-21 12:04     ` Mauro Carvalho Chehab
  -1 siblings, 1 reply; 49+ messages in thread
From: Huang Ying @ 2013-02-21  1:26 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: linux-acpi, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

Sorry for late!

On Fri, 2013-02-15 at 10:44 -0200, Mauro Carvalho Chehab wrote:
> In order to allow reporting errors via EDAC, add hooks for:
> 
> 1) register an EDAC driver;
> 2) unregister an EDAC driver;
> 3) report errors via EDAC.
> 
> As the EDAC driver will need to access the ghes structure, adds it
> as one of the parameters for ghes_do_proc.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
> ---
>  drivers/acpi/apei/ghes.c | 17 ++++++++++++++---
>  include/acpi/ghes.h      | 27 +++++++++++++++++++++++++++
>  2 files changed, 41 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 6d0e146..a21d7da 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -409,7 +409,8 @@ static void ghes_clear_estatus(struct ghes *ghes)
>  	ghes->flags &= ~GHES_TO_CLEAR;
>  }
>  
> -static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
> +static void ghes_do_proc(struct ghes *ghes,
> +			 const struct acpi_hest_generic_status *estatus)
>  {
>  	int sev, sec_sev;
>  	struct acpi_hest_generic_data *gdata;
> @@ -421,6 +422,8 @@ static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
>  				 CPER_SEC_PLATFORM_MEM)) {
>  			struct cper_sec_mem_err *mem_err;
>  			mem_err = (struct cper_sec_mem_err *)(gdata+1);
> +			ghes_edac_report_mem_error(ghes, sev, mem_err);
> +
>  #ifdef CONFIG_X86_MCE
>  			apei_mce_report_mem_error(sev == GHES_SEV_CORRECTED,
>  						  mem_err);
> @@ -639,7 +642,7 @@ static int ghes_proc(struct ghes *ghes)
>  		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
>  			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
>  	}
> -	ghes_do_proc(ghes->estatus);
> +	ghes_do_proc(ghes, ghes->estatus);
>  out:
>  	ghes_clear_estatus(ghes);
>  	return 0;
> @@ -732,7 +735,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>  		estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>  		len = apei_estatus_len(estatus);
>  		node_len = GHES_ESTATUS_NODE_LEN(len);
> -		ghes_do_proc(estatus);
> +		ghes_do_proc(estatus_node->ghes, estatus);
>  		if (!ghes_estatus_cached(estatus)) {
>  			generic = estatus_node->generic;
>  			if (ghes_print_estatus(NULL, generic, estatus))
> @@ -821,6 +824,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
>  		estatus_node = (void *)gen_pool_alloc(ghes_estatus_pool,
>  						      node_len);
>  		if (estatus_node) {
> +			estatus_node->ghes = ghes;
>  			estatus_node->generic = ghes->generic;
>  			estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>  			memcpy(estatus, ghes->estatus, len);
> @@ -942,6 +946,10 @@ static int ghes_probe(struct platform_device *ghes_dev)
>  	}
>  	platform_set_drvdata(ghes_dev, ghes);
>  
> +	rc = ghes_edac_register(ghes, &ghes_dev->dev);
> +	if (rc < 0)
> +		goto err;
> +

If ghes_edac_register() failed, we need to do some cleanup such as
unregister from hed etc.

Or just move ghes_edac_register() before switch?

>  	return 0;
>  err:
>  	if (ghes) {
> @@ -995,6 +1003,9 @@ static int ghes_remove(struct platform_device *ghes_dev)
>  	}
>  
>  	ghes_fini(ghes);
> +
> +	ghes_edac_unregister(ghes);
> +
>  	kfree(ghes);
>  
>  	platform_set_drvdata(ghes_dev, NULL);
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 3eb8dc4..c6fef72 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -22,11 +22,14 @@ struct ghes {
>  		struct timer_list timer;
>  		unsigned int irq;
>  	};
> +
> +	struct mem_ctl_info *mci;

Why we need this?  This is not used by ghes.[hc].

>  };
>  
>  struct ghes_estatus_node {
>  	struct llist_node llnode;
>  	struct acpi_hest_generic *generic;
> +	struct ghes *ghes;
>  };
>  
>  struct ghes_estatus_cache {
> @@ -43,3 +46,27 @@ enum {
>  	GHES_SEV_RECOVERABLE = 0x2,
>  	GHES_SEV_PANIC = 0x3,
>  };
> +
> +#ifdef CONFIG_EDAC_GHES
> +void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
> +				struct cper_sec_mem_err *mem_err);
> +
> +int ghes_edac_register(struct ghes *ghes, struct device *dev);
> +
> +void ghes_edac_unregister(struct ghes *ghes);
> +
> +#else
> +static inline void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
> +				       struct cper_sec_mem_err *mem_err)
> +{
> +}
> +
> +static inline int ghes_edac_register(struct ghes *ghes, struct device *dev)
> +{
> +	return 0;
> +}
> +
> +static inline void ghes_edac_unregister(struct ghes *ghes)
> +{
> +}
> +#endif

I think it is better to put the above declaration into module which
implement these functions instead of use these functions.

Best Regards,
Huang Ying



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 03/13] ghes: add the needed hooks for EDAC error report
  2013-02-21  1:26   ` Huang Ying
@ 2013-02-21 12:04     ` Mauro Carvalho Chehab
  2013-02-22  0:45       ` Huang Ying
  0 siblings, 1 reply; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-21 12:04 UTC (permalink / raw)
  To: Huang Ying
  Cc: linux-acpi, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

Em Thu, 21 Feb 2013 09:26:07 +0800
Huang Ying <ying.huang@intel.com> escreveu:

> Sorry for late!

Thanks for your comments. See my answers below.
> 
> On Fri, 2013-02-15 at 10:44 -0200, Mauro Carvalho Chehab wrote:
> > In order to allow reporting errors via EDAC, add hooks for:
> > 
> > 1) register an EDAC driver;
> > 2) unregister an EDAC driver;
> > 3) report errors via EDAC.
> > 
> > As the EDAC driver will need to access the ghes structure, adds it
> > as one of the parameters for ghes_do_proc.
> > 
> > Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
> > ---
> >  drivers/acpi/apei/ghes.c | 17 ++++++++++++++---
> >  include/acpi/ghes.h      | 27 +++++++++++++++++++++++++++
> >  2 files changed, 41 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > index 6d0e146..a21d7da 100644
> > --- a/drivers/acpi/apei/ghes.c
> > +++ b/drivers/acpi/apei/ghes.c
> > @@ -409,7 +409,8 @@ static void ghes_clear_estatus(struct ghes *ghes)
> >  	ghes->flags &= ~GHES_TO_CLEAR;
> >  }
> >  
> > -static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
> > +static void ghes_do_proc(struct ghes *ghes,
> > +			 const struct acpi_hest_generic_status *estatus)
> >  {
> >  	int sev, sec_sev;
> >  	struct acpi_hest_generic_data *gdata;
> > @@ -421,6 +422,8 @@ static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
> >  				 CPER_SEC_PLATFORM_MEM)) {
> >  			struct cper_sec_mem_err *mem_err;
> >  			mem_err = (struct cper_sec_mem_err *)(gdata+1);
> > +			ghes_edac_report_mem_error(ghes, sev, mem_err);
> > +
> >  #ifdef CONFIG_X86_MCE
> >  			apei_mce_report_mem_error(sev == GHES_SEV_CORRECTED,
> >  						  mem_err);
> > @@ -639,7 +642,7 @@ static int ghes_proc(struct ghes *ghes)
> >  		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
> >  			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
> >  	}
> > -	ghes_do_proc(ghes->estatus);
> > +	ghes_do_proc(ghes, ghes->estatus);
> >  out:
> >  	ghes_clear_estatus(ghes);
> >  	return 0;
> > @@ -732,7 +735,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
> >  		estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> >  		len = apei_estatus_len(estatus);
> >  		node_len = GHES_ESTATUS_NODE_LEN(len);
> > -		ghes_do_proc(estatus);
> > +		ghes_do_proc(estatus_node->ghes, estatus);
> >  		if (!ghes_estatus_cached(estatus)) {
> >  			generic = estatus_node->generic;
> >  			if (ghes_print_estatus(NULL, generic, estatus))
> > @@ -821,6 +824,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
> >  		estatus_node = (void *)gen_pool_alloc(ghes_estatus_pool,
> >  						      node_len);
> >  		if (estatus_node) {
> > +			estatus_node->ghes = ghes;
> >  			estatus_node->generic = ghes->generic;
> >  			estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> >  			memcpy(estatus, ghes->estatus, len);
> > @@ -942,6 +946,10 @@ static int ghes_probe(struct platform_device *ghes_dev)
> >  	}
> >  	platform_set_drvdata(ghes_dev, ghes);
> >  
> > +	rc = ghes_edac_register(ghes, &ghes_dev->dev);
> > +	if (rc < 0)
> > +		goto err;
> > +
> 
> If ghes_edac_register() failed, we need to do some cleanup such as
> unregister from hed etc.
> 
> Or just move ghes_edac_register() before switch?

Moving it to happen before the switch() looks the better. We need to unregister
ghes_edac if IRQ fails.

> >  	return 0;
> >  err:
> >  	if (ghes) {
> > @@ -995,6 +1003,9 @@ static int ghes_remove(struct platform_device *ghes_dev)
> >  	}
> >  
> >  	ghes_fini(ghes);
> > +
> > +	ghes_edac_unregister(ghes);
> > +
> >  	kfree(ghes);
> >  
> >  	platform_set_drvdata(ghes_dev, NULL);
> > diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> > index 3eb8dc4..c6fef72 100644
> > --- a/include/acpi/ghes.h
> > +++ b/include/acpi/ghes.h
> > @@ -22,11 +22,14 @@ struct ghes {
> >  		struct timer_list timer;
> >  		unsigned int irq;
> >  	};
> > +
> > +	struct mem_ctl_info *mci;
> 
> Why we need this?  This is not used by ghes.[hc].

This will be needed by the EDAC driver, as the EDAC core has its own main
struct, and such struct need to be recovered when reporting an error.

As this struct is not used by ghes, there's no need to even include the
include/linux/edac.h, where this is declared, as gcc will just handle it
as "void *" when compiling ghes.

I opted for this design as it is simpler, and costs just 8 bytes for each
ghes struct, and no extra code is needed.

However, as you're not happy with it, I'm removing it from there, and adding
a list at the edac driver, together with a mutex, to serialize its access,
associating each ghes struct with their corresponding MC struct.

I'll post the patch replacing patch 05/13 soon.

> 
> >  };
> >  
> >  struct ghes_estatus_node {
> >  	struct llist_node llnode;
> >  	struct acpi_hest_generic *generic;
> > +	struct ghes *ghes;
> >  };
> >  
> >  struct ghes_estatus_cache {
> > @@ -43,3 +46,27 @@ enum {
> >  	GHES_SEV_RECOVERABLE = 0x2,
> >  	GHES_SEV_PANIC = 0x3,
> >  };
> > +
> > +#ifdef CONFIG_EDAC_GHES
> > +void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
> > +				struct cper_sec_mem_err *mem_err);
> > +
> > +int ghes_edac_register(struct ghes *ghes, struct device *dev);
> > +
> > +void ghes_edac_unregister(struct ghes *ghes);
> > +
> > +#else
> > +static inline void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
> > +				       struct cper_sec_mem_err *mem_err)
> > +{
> > +}
> > +
> > +static inline int ghes_edac_register(struct ghes *ghes, struct device *dev)
> > +{
> > +	return 0;
> > +}
> > +
> > +static inline void ghes_edac_unregister(struct ghes *ghes)
> > +{
> > +}
> > +#endif
> 
> I think it is better to put the above declaration into module which
> implement these functions instead of use these functions.

This is due to Linux Coding Style. See Documentation/SubmittingPatches,
session 2, item 1:

	2) #ifdefs are ugly

	Code cluttered with ifdefs is difficult to read and maintain.  Don't do
	it.  Instead, put your ifdefs in a header, and conditionally define
	'static inline' functions, or macros, which are used in the code.
	Let the compiler optimize away the "no-op" case.

	Simple example, of poor code:

	        dev = alloc_etherdev (sizeof(struct funky_private));
       	 if (!dev)
	                return -ENODEV;
	        #ifdef CONFIG_NET_FUNKINESS
	        init_funky_net(dev);
	        #endif

	Cleaned-up example:

	(in header)
	        #ifndef CONFIG_NET_FUNKINESS
	        static inline void init_funky_net (struct net_device *d) {}
	        #endif

	(in the code itself)
	        dev = alloc_etherdev (sizeof(struct funky_private));
	        if (!dev)
	                return -ENODEV;
	        init_funky_net(dev);

There is also an advantage on taking this approach: this patch can have just
the ghes changes, and will still compile fine, as CONFIG_EDAC_GHES is not
defined yet. So, the patch becomes is simpler and easier to review.

If I would do it otherwise, I would need to fold patch 5 to avoid breaking
git bisect.

Regards,
Mauro

Patch with the changes enclosed.

-

[PATCH EDACv2 03/13] ghes: add the needed hooks for EDAC error report

In order to allow reporting errors via EDAC, add hooks for:

1) register an EDAC driver;
2) unregister an EDAC driver;
3) report errors via EDAC.

As the EDAC driver will need to access the ghes structure, adds it
as one of the parameters for ghes_do_proc.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 6d0e146..19092dc 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -409,7 +409,8 @@ static void ghes_clear_estatus(struct ghes *ghes)
 	ghes->flags &= ~GHES_TO_CLEAR;
 }
 
-static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
+static void ghes_do_proc(struct ghes *ghes,
+			 const struct acpi_hest_generic_status *estatus)
 {
 	int sev, sec_sev;
 	struct acpi_hest_generic_data *gdata;
@@ -421,6 +422,8 @@ static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
 				 CPER_SEC_PLATFORM_MEM)) {
 			struct cper_sec_mem_err *mem_err;
 			mem_err = (struct cper_sec_mem_err *)(gdata+1);
+			ghes_edac_report_mem_error(ghes, sev, mem_err);
+
 #ifdef CONFIG_X86_MCE
 			apei_mce_report_mem_error(sev == GHES_SEV_CORRECTED,
 						  mem_err);
@@ -639,7 +642,7 @@ static int ghes_proc(struct ghes *ghes)
 		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
 			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
 	}
-	ghes_do_proc(ghes->estatus);
+	ghes_do_proc(ghes, ghes->estatus);
 out:
 	ghes_clear_estatus(ghes);
 	return 0;
@@ -732,7 +735,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
 		estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
 		len = apei_estatus_len(estatus);
 		node_len = GHES_ESTATUS_NODE_LEN(len);
-		ghes_do_proc(estatus);
+		ghes_do_proc(estatus_node->ghes, estatus);
 		if (!ghes_estatus_cached(estatus)) {
 			generic = estatus_node->generic;
 			if (ghes_print_estatus(NULL, generic, estatus))
@@ -821,6 +824,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
 		estatus_node = (void *)gen_pool_alloc(ghes_estatus_pool,
 						      node_len);
 		if (estatus_node) {
+			estatus_node->ghes = ghes;
 			estatus_node->generic = ghes->generic;
 			estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
 			memcpy(estatus, ghes->estatus, len);
@@ -899,6 +903,11 @@ static int ghes_probe(struct platform_device *ghes_dev)
 		ghes = NULL;
 		goto err;
 	}
+
+	rc = ghes_edac_register(ghes, &ghes_dev->dev);
+	if (rc < 0)
+		goto err;
+
 	switch (generic->notify.type) {
 	case ACPI_HEST_NOTIFY_POLLED:
 		ghes->timer.function = ghes_poll_func;
@@ -911,13 +920,13 @@ static int ghes_probe(struct platform_device *ghes_dev)
 		if (acpi_gsi_to_irq(generic->notify.vector, &ghes->irq)) {
 			pr_err(GHES_PFX "Failed to map GSI to IRQ for generic hardware error source: %d\n",
 			       generic->header.source_id);
-			goto err;
+			goto err2;
 		}
 		if (request_irq(ghes->irq, ghes_irq_func,
 				0, "GHES IRQ", ghes)) {
 			pr_err(GHES_PFX "Failed to register IRQ for generic hardware error source: %d\n",
 			       generic->header.source_id);
-			goto err;
+			goto err2;
 		}
 		break;
 	case ACPI_HEST_NOTIFY_SCI:
@@ -943,6 +952,8 @@ static int ghes_probe(struct platform_device *ghes_dev)
 	platform_set_drvdata(ghes_dev, ghes);
 
 	return 0;
+err2:
+	ghes_edac_unregister(ghes);
 err:
 	if (ghes) {
 		ghes_fini(ghes);
@@ -995,6 +1006,9 @@ static int ghes_remove(struct platform_device *ghes_dev)
 	}
 
 	ghes_fini(ghes);
+
+	ghes_edac_unregister(ghes);
+
 	kfree(ghes);
 
 	platform_set_drvdata(ghes_dev, NULL);
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 3eb8dc4..9015ec2 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -27,6 +27,7 @@ struct ghes {
 struct ghes_estatus_node {
 	struct llist_node llnode;
 	struct acpi_hest_generic *generic;
+	struct ghes *ghes;
 };
 
 struct ghes_estatus_cache {
@@ -43,3 +44,27 @@ enum {
 	GHES_SEV_RECOVERABLE = 0x2,
 	GHES_SEV_PANIC = 0x3,
 };
+
+#ifdef CONFIG_EDAC_GHES
+void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
+				struct cper_sec_mem_err *mem_err);
+
+int ghes_edac_register(struct ghes *ghes, struct device *dev);
+
+void ghes_edac_unregister(struct ghes *ghes);
+
+#else
+static inline void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
+				       struct cper_sec_mem_err *mem_err)
+{
+}
+
+static inline int ghes_edac_register(struct ghes *ghes, struct device *dev)
+{
+	return 0;
+}
+
+static inline void ghes_edac_unregister(struct ghes *ghes)
+{
+}
+#endif


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 03/13] ghes: add the needed hooks for EDAC error report
  2013-02-21 12:04     ` Mauro Carvalho Chehab
@ 2013-02-22  0:45       ` Huang Ying
  2013-02-22  8:50         ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 49+ messages in thread
From: Huang Ying @ 2013-02-22  0:45 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: linux-acpi, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

On Thu, 2013-02-21 at 09:04 -0300, Mauro Carvalho Chehab wrote:
> Em Thu, 21 Feb 2013 09:26:07 +0800
> Huang Ying <ying.huang@intel.com> escreveu:
> 
> > Sorry for late!
> 
> Thanks for your comments. See my answers below.
> > 
> > On Fri, 2013-02-15 at 10:44 -0200, Mauro Carvalho Chehab wrote:
> > > In order to allow reporting errors via EDAC, add hooks for:
> > > 
> > > 1) register an EDAC driver;
> > > 2) unregister an EDAC driver;
> > > 3) report errors via EDAC.
> > > 
> > > As the EDAC driver will need to access the ghes structure, adds it
> > > as one of the parameters for ghes_do_proc.
> > > 
> > > Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
> > > ---
> > >  drivers/acpi/apei/ghes.c | 17 ++++++++++++++---
> > >  include/acpi/ghes.h      | 27 +++++++++++++++++++++++++++
> > >  2 files changed, 41 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > > index 6d0e146..a21d7da 100644
> > > --- a/drivers/acpi/apei/ghes.c
> > > +++ b/drivers/acpi/apei/ghes.c
> > > @@ -409,7 +409,8 @@ static void ghes_clear_estatus(struct ghes *ghes)
> > >  	ghes->flags &= ~GHES_TO_CLEAR;
> > >  }
> > >  
> > > -static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
> > > +static void ghes_do_proc(struct ghes *ghes,
> > > +			 const struct acpi_hest_generic_status *estatus)
> > >  {
> > >  	int sev, sec_sev;
> > >  	struct acpi_hest_generic_data *gdata;
> > > @@ -421,6 +422,8 @@ static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
> > >  				 CPER_SEC_PLATFORM_MEM)) {
> > >  			struct cper_sec_mem_err *mem_err;
> > >  			mem_err = (struct cper_sec_mem_err *)(gdata+1);
> > > +			ghes_edac_report_mem_error(ghes, sev, mem_err);
> > > +
> > >  #ifdef CONFIG_X86_MCE
> > >  			apei_mce_report_mem_error(sev == GHES_SEV_CORRECTED,
> > >  						  mem_err);
> > > @@ -639,7 +642,7 @@ static int ghes_proc(struct ghes *ghes)
> > >  		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
> > >  			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
> > >  	}
> > > -	ghes_do_proc(ghes->estatus);
> > > +	ghes_do_proc(ghes, ghes->estatus);
> > >  out:
> > >  	ghes_clear_estatus(ghes);
> > >  	return 0;
> > > @@ -732,7 +735,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
> > >  		estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> > >  		len = apei_estatus_len(estatus);
> > >  		node_len = GHES_ESTATUS_NODE_LEN(len);
> > > -		ghes_do_proc(estatus);
> > > +		ghes_do_proc(estatus_node->ghes, estatus);
> > >  		if (!ghes_estatus_cached(estatus)) {
> > >  			generic = estatus_node->generic;
> > >  			if (ghes_print_estatus(NULL, generic, estatus))
> > > @@ -821,6 +824,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
> > >  		estatus_node = (void *)gen_pool_alloc(ghes_estatus_pool,
> > >  						      node_len);
> > >  		if (estatus_node) {
> > > +			estatus_node->ghes = ghes;
> > >  			estatus_node->generic = ghes->generic;
> > >  			estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> > >  			memcpy(estatus, ghes->estatus, len);
> > > @@ -942,6 +946,10 @@ static int ghes_probe(struct platform_device *ghes_dev)
> > >  	}
> > >  	platform_set_drvdata(ghes_dev, ghes);
> > >  
> > > +	rc = ghes_edac_register(ghes, &ghes_dev->dev);
> > > +	if (rc < 0)
> > > +		goto err;
> > > +
> > 
> > If ghes_edac_register() failed, we need to do some cleanup such as
> > unregister from hed etc.
> > 
> > Or just move ghes_edac_register() before switch?
> 
> Moving it to happen before the switch() looks the better. We need to unregister
> ghes_edac if IRQ fails.
> 
> > >  	return 0;
> > >  err:
> > >  	if (ghes) {
> > > @@ -995,6 +1003,9 @@ static int ghes_remove(struct platform_device *ghes_dev)
> > >  	}
> > >  
> > >  	ghes_fini(ghes);
> > > +
> > > +	ghes_edac_unregister(ghes);
> > > +
> > >  	kfree(ghes);
> > >  
> > >  	platform_set_drvdata(ghes_dev, NULL);
> > > diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> > > index 3eb8dc4..c6fef72 100644
> > > --- a/include/acpi/ghes.h
> > > +++ b/include/acpi/ghes.h
> > > @@ -22,11 +22,14 @@ struct ghes {
> > >  		struct timer_list timer;
> > >  		unsigned int irq;
> > >  	};
> > > +
> > > +	struct mem_ctl_info *mci;
> > 
> > Why we need this?  This is not used by ghes.[hc].
> 
> This will be needed by the EDAC driver, as the EDAC core has its own main
> struct, and such struct need to be recovered when reporting an error.
> 
> As this struct is not used by ghes, there's no need to even include the
> include/linux/edac.h, where this is declared, as gcc will just handle it
> as "void *" when compiling ghes.
> 
> I opted for this design as it is simpler, and costs just 8 bytes for each
> ghes struct, and no extra code is needed.
> 
> However, as you're not happy with it, I'm removing it from there, and adding
> a list at the edac driver, together with a mutex, to serialize its access,
> associating each ghes struct with their corresponding MC struct.
> 
> I'll post the patch replacing patch 05/13 soon.
> 
> > 
> > >  };
> > >  
> > >  struct ghes_estatus_node {
> > >  	struct llist_node llnode;
> > >  	struct acpi_hest_generic *generic;
> > > +	struct ghes *ghes;
> > >  };
> > >  
> > >  struct ghes_estatus_cache {
> > > @@ -43,3 +46,27 @@ enum {
> > >  	GHES_SEV_RECOVERABLE = 0x2,
> > >  	GHES_SEV_PANIC = 0x3,
> > >  };
> > > +
> > > +#ifdef CONFIG_EDAC_GHES
> > > +void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
> > > +				struct cper_sec_mem_err *mem_err);
> > > +
> > > +int ghes_edac_register(struct ghes *ghes, struct device *dev);
> > > +
> > > +void ghes_edac_unregister(struct ghes *ghes);
> > > +
> > > +#else
> > > +static inline void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
> > > +				       struct cper_sec_mem_err *mem_err)
> > > +{
> > > +}
> > > +
> > > +static inline int ghes_edac_register(struct ghes *ghes, struct device *dev)
> > > +{
> > > +	return 0;
> > > +}
> > > +
> > > +static inline void ghes_edac_unregister(struct ghes *ghes)
> > > +{
> > > +}
> > > +#endif
> > 
> > I think it is better to put the above declaration into module which
> > implement these functions instead of use these functions.
> 
> This is due to Linux Coding Style. See Documentation/SubmittingPatches,
> session 2, item 1:
> 
> 	2) #ifdefs are ugly
> 
> 	Code cluttered with ifdefs is difficult to read and maintain.  Don't do
> 	it.  Instead, put your ifdefs in a header, and conditionally define
> 	'static inline' functions, or macros, which are used in the code.
> 	Let the compiler optimize away the "no-op" case.
> 
> 	Simple example, of poor code:
> 
> 	        dev = alloc_etherdev (sizeof(struct funky_private));
>        	 if (!dev)
> 	                return -ENODEV;
> 	        #ifdef CONFIG_NET_FUNKINESS
> 	        init_funky_net(dev);
> 	        #endif
> 
> 	Cleaned-up example:
> 
> 	(in header)
> 	        #ifndef CONFIG_NET_FUNKINESS
> 	        static inline void init_funky_net (struct net_device *d) {}
> 	        #endif
> 
> 	(in the code itself)
> 	        dev = alloc_etherdev (sizeof(struct funky_private));
> 	        if (!dev)
> 	                return -ENODEV;
> 	        init_funky_net(dev);
> 
> There is also an advantage on taking this approach: this patch can have just
> the ghes changes, and will still compile fine, as CONFIG_EDAC_GHES is not
> defined yet. So, the patch becomes is simpler and easier to review.

Thanks for your explanation.  I agree to keep this in header file.

My original suggestion is that it may be better to move this from ghes.h
to some place like "ghes_edac.h", because these functions are
implemented in "ghes_edac.c" instead of ghes.c.

> If I would do it otherwise, I would need to fold patch 5 to avoid breaking
> git bisect.
> 
> Regards,
> Mauro
> 
> Patch with the changes enclosed.
> 
> -
> 
> [PATCH EDACv2 03/13] ghes: add the needed hooks for EDAC error report
> 
> In order to allow reporting errors via EDAC, add hooks for:
> 
> 1) register an EDAC driver;
> 2) unregister an EDAC driver;
> 3) report errors via EDAC.
> 
> As the EDAC driver will need to access the ghes structure, adds it
> as one of the parameters for ghes_do_proc.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 6d0e146..19092dc 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -409,7 +409,8 @@ static void ghes_clear_estatus(struct ghes *ghes)
>  	ghes->flags &= ~GHES_TO_CLEAR;
>  }
>  
> -static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
> +static void ghes_do_proc(struct ghes *ghes,
> +			 const struct acpi_hest_generic_status *estatus)
>  {
>  	int sev, sec_sev;
>  	struct acpi_hest_generic_data *gdata;
> @@ -421,6 +422,8 @@ static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
>  				 CPER_SEC_PLATFORM_MEM)) {
>  			struct cper_sec_mem_err *mem_err;
>  			mem_err = (struct cper_sec_mem_err *)(gdata+1);
> +			ghes_edac_report_mem_error(ghes, sev, mem_err);
> +
>  #ifdef CONFIG_X86_MCE
>  			apei_mce_report_mem_error(sev == GHES_SEV_CORRECTED,
>  						  mem_err);
> @@ -639,7 +642,7 @@ static int ghes_proc(struct ghes *ghes)
>  		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
>  			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
>  	}
> -	ghes_do_proc(ghes->estatus);
> +	ghes_do_proc(ghes, ghes->estatus);
>  out:
>  	ghes_clear_estatus(ghes);
>  	return 0;
> @@ -732,7 +735,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>  		estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>  		len = apei_estatus_len(estatus);
>  		node_len = GHES_ESTATUS_NODE_LEN(len);
> -		ghes_do_proc(estatus);
> +		ghes_do_proc(estatus_node->ghes, estatus);
>  		if (!ghes_estatus_cached(estatus)) {
>  			generic = estatus_node->generic;
>  			if (ghes_print_estatus(NULL, generic, estatus))
> @@ -821,6 +824,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
>  		estatus_node = (void *)gen_pool_alloc(ghes_estatus_pool,
>  						      node_len);
>  		if (estatus_node) {
> +			estatus_node->ghes = ghes;
>  			estatus_node->generic = ghes->generic;
>  			estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>  			memcpy(estatus, ghes->estatus, len);
> @@ -899,6 +903,11 @@ static int ghes_probe(struct platform_device *ghes_dev)
>  		ghes = NULL;
>  		goto err;
>  	}
> +
> +	rc = ghes_edac_register(ghes, &ghes_dev->dev);
> +	if (rc < 0)
> +		goto err;
> +
>  	switch (generic->notify.type) {
>  	case ACPI_HEST_NOTIFY_POLLED:
>  		ghes->timer.function = ghes_poll_func;
> @@ -911,13 +920,13 @@ static int ghes_probe(struct platform_device *ghes_dev)
>  		if (acpi_gsi_to_irq(generic->notify.vector, &ghes->irq)) {
>  			pr_err(GHES_PFX "Failed to map GSI to IRQ for generic hardware error source: %d\n",
>  			       generic->header.source_id);
> -			goto err;
> +			goto err2;
>  		}
>  		if (request_irq(ghes->irq, ghes_irq_func,
>  				0, "GHES IRQ", ghes)) {
>  			pr_err(GHES_PFX "Failed to register IRQ for generic hardware error source: %d\n",
>  			       generic->header.source_id);
> -			goto err;
> +			goto err2;
>  		}
>  		break;
>  	case ACPI_HEST_NOTIFY_SCI:
> @@ -943,6 +952,8 @@ static int ghes_probe(struct platform_device *ghes_dev)
>  	platform_set_drvdata(ghes_dev, ghes);
>  
>  	return 0;
> +err2:

Suggest to rename it to err_edac_unreg or something like that.  Just a
suggestion.

Best Regards,
Huang Ying

> +	ghes_edac_unregister(ghes);
>  err:
>  	if (ghes) {
>  		ghes_fini(ghes);
> @@ -995,6 +1006,9 @@ static int ghes_remove(struct platform_device *ghes_dev)
>  	}
>  
>  	ghes_fini(ghes);
> +
> +	ghes_edac_unregister(ghes);
> +
>  	kfree(ghes);
>  
>  	platform_set_drvdata(ghes_dev, NULL);
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 3eb8dc4..9015ec2 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -27,6 +27,7 @@ struct ghes {
>  struct ghes_estatus_node {
>  	struct llist_node llnode;
>  	struct acpi_hest_generic *generic;
> +	struct ghes *ghes;
>  };
>  
>  struct ghes_estatus_cache {
> @@ -43,3 +44,27 @@ enum {
>  	GHES_SEV_RECOVERABLE = 0x2,
>  	GHES_SEV_PANIC = 0x3,
>  };
> +
> +#ifdef CONFIG_EDAC_GHES
> +void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
> +				struct cper_sec_mem_err *mem_err);
> +
> +int ghes_edac_register(struct ghes *ghes, struct device *dev);
> +
> +void ghes_edac_unregister(struct ghes *ghes);
> +
> +#else
> +static inline void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
> +				       struct cper_sec_mem_err *mem_err)
> +{
> +}
> +
> +static inline int ghes_edac_register(struct ghes *ghes, struct device *dev)
> +{
> +	return 0;
> +}
> +
> +static inline void ghes_edac_unregister(struct ghes *ghes)
> +{
> +}
> +#endif
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 03/13] ghes: add the needed hooks for EDAC error report
  2013-02-22  0:45       ` Huang Ying
@ 2013-02-22  8:50         ` Mauro Carvalho Chehab
  2013-02-22  8:57           ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-22  8:50 UTC (permalink / raw)
  To: Huang Ying
  Cc: linux-acpi, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

Em Fri, 22 Feb 2013 08:45:11 +0800
Huang Ying <ying.huang@intel.com> escreveu:

> On Thu, 2013-02-21 at 09:04 -0300, Mauro Carvalho Chehab wrote:
> > Em Thu, 21 Feb 2013 09:26:07 +0800
> > Huang Ying <ying.huang@intel.com> escreveu:
> > 

> > There is also an advantage on taking this approach: this patch can have just
> > the ghes changes, and will still compile fine, as CONFIG_EDAC_GHES is not
> > defined yet. So, the patch becomes is simpler and easier to review.
> 
> Thanks for your explanation.  I agree to keep this in header file.
> 
> My original suggestion is that it may be better to move this from ghes.h
> to some place like "ghes_edac.h", because these functions are
> implemented in "ghes_edac.c" instead of ghes.c.
> 

Ah! Well, I considered a separate header when writing this patch, but adding 
another header for just 3 functions where the parameters are more related to
ghes than with ghes_edac seemed overkill to me.

IMO, a single comment at the header pointing to ghes_edac.c would improve it.

> >  	switch (generic->notify.type) {
> >  	case ACPI_HEST_NOTIFY_POLLED:
> >  		ghes->timer.function = ghes_poll_func;
> > @@ -911,13 +920,13 @@ static int ghes_probe(struct platform_device *ghes_dev)
> >  		if (acpi_gsi_to_irq(generic->notify.vector, &ghes->irq)) {
> >  			pr_err(GHES_PFX "Failed to map GSI to IRQ for generic hardware error source: %d\n",
> >  			       generic->header.source_id);
> > -			goto err;
> > +			goto err2;
> >  		}
> >  		if (request_irq(ghes->irq, ghes_irq_func,
> >  				0, "GHES IRQ", ghes)) {
> >  			pr_err(GHES_PFX "Failed to register IRQ for generic hardware error source: %d\n",
> >  			       generic->header.source_id);
> > -			goto err;
> > +			goto err2;
> >  		}
> >  		break;
> >  	case ACPI_HEST_NOTIFY_SCI:
> > @@ -943,6 +952,8 @@ static int ghes_probe(struct platform_device *ghes_dev)
> >  	platform_set_drvdata(ghes_dev, ghes);
> >  
> >  	return 0;
> > +err2:
> 
> Suggest to rename it to err_edac_unreg or something like that.  Just a
> suggestion.

Done.

I'm folding the enclosed patch on it.

Thanks for review!

Regards,
Mauro

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 19092dc..d668a8a 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -920,13 +920,13 @@ static int ghes_probe(struct platform_device *ghes_dev)
 		if (acpi_gsi_to_irq(generic->notify.vector, &ghes->irq)) {
 			pr_err(GHES_PFX "Failed to map GSI to IRQ for generic hardware error source: %d\n",
 			       generic->header.source_id);
-			goto err2;
+			goto err_edac_unreg;
 		}
 		if (request_irq(ghes->irq, ghes_irq_func,
 				0, "GHES IRQ", ghes)) {
 			pr_err(GHES_PFX "Failed to register IRQ for generic hardware error source: %d\n",
 			       generic->header.source_id);
-			goto err2;
+			goto err_edac_unreg;
 		}
 		break;
 	case ACPI_HEST_NOTIFY_SCI:
@@ -952,7 +952,7 @@ static int ghes_probe(struct platform_device *ghes_dev)
 	platform_set_drvdata(ghes_dev, ghes);
 
 	return 0;
-err2:
+err_edac_unreg:
 	ghes_edac_unregister(ghes);
 err:
 	if (ghes) {
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 9015ec2..d11d952 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -45,6 +45,8 @@ enum {
 	GHES_SEV_PANIC = 0x3,
 };
 
+/* From drivers/edac/ghes_acpi.h */
+
 #ifdef CONFIG_EDAC_GHES
 void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 				struct cper_sec_mem_err *mem_err);



-- 

Cheers,
Mauro

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 03/13] ghes: add the needed hooks for EDAC error report
  2013-02-22  8:50         ` Mauro Carvalho Chehab
@ 2013-02-22  8:57           ` Mauro Carvalho Chehab
  2013-02-25  0:25             ` Huang Ying
  0 siblings, 1 reply; 49+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-22  8:57 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Huang Ying, linux-acpi, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

Em Fri, 22 Feb 2013 05:50:21 -0300
Mauro Carvalho Chehab <mchehab@redhat.com> escreveu:

> Em Fri, 22 Feb 2013 08:45:11 +0800
> Huang Ying <ying.huang@intel.com> escreveu:
> 
> > On Thu, 2013-02-21 at 09:04 -0300, Mauro Carvalho Chehab wrote:
> > > Em Thu, 21 Feb 2013 09:26:07 +0800
> > > Huang Ying <ying.huang@intel.com> escreveu:
> > > 
> 
> > > There is also an advantage on taking this approach: this patch can have just
> > > the ghes changes, and will still compile fine, as CONFIG_EDAC_GHES is not
> > > defined yet. So, the patch becomes is simpler and easier to review.
> > 
> > Thanks for your explanation.  I agree to keep this in header file.
> > 
> > My original suggestion is that it may be better to move this from ghes.h
> > to some place like "ghes_edac.h", because these functions are
> > implemented in "ghes_edac.c" instead of ghes.c.
> > 
> 
> Ah! Well, I considered a separate header when writing this patch, but adding 
> another header for just 3 functions where the parameters are more related to
> ghes than with ghes_edac seemed overkill to me.
> 
> IMO, a single comment at the header pointing to ghes_edac.c would improve it.
> 
...
> +/* From drivers/edac/ghes_acpi.h */

...with is obviously wrong.

Mental note to myself: never hack too early in the morning before drinking a
cup of coffee.

Anyway, fixed on the patch below. Could you please ack?

Regards,
Mauro.


[PATCH EDAC] ghes: add the needed hooks for EDAC error report

In order to allow reporting errors via EDAC, add hooks for:

1) register an EDAC driver;
2) unregister an EDAC driver;
3) report errors via EDAC.

As the EDAC driver will need to access the ghes structure, adds it
as one of the parameters for ghes_do_proc.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 6d0e146..d668a8a 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -409,7 +409,8 @@ static void ghes_clear_estatus(struct ghes *ghes)
 	ghes->flags &= ~GHES_TO_CLEAR;
 }
 
-static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
+static void ghes_do_proc(struct ghes *ghes,
+			 const struct acpi_hest_generic_status *estatus)
 {
 	int sev, sec_sev;
 	struct acpi_hest_generic_data *gdata;
@@ -421,6 +422,8 @@ static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
 				 CPER_SEC_PLATFORM_MEM)) {
 			struct cper_sec_mem_err *mem_err;
 			mem_err = (struct cper_sec_mem_err *)(gdata+1);
+			ghes_edac_report_mem_error(ghes, sev, mem_err);
+
 #ifdef CONFIG_X86_MCE
 			apei_mce_report_mem_error(sev == GHES_SEV_CORRECTED,
 						  mem_err);
@@ -639,7 +642,7 @@ static int ghes_proc(struct ghes *ghes)
 		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
 			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
 	}
-	ghes_do_proc(ghes->estatus);
+	ghes_do_proc(ghes, ghes->estatus);
 out:
 	ghes_clear_estatus(ghes);
 	return 0;
@@ -732,7 +735,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
 		estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
 		len = apei_estatus_len(estatus);
 		node_len = GHES_ESTATUS_NODE_LEN(len);
-		ghes_do_proc(estatus);
+		ghes_do_proc(estatus_node->ghes, estatus);
 		if (!ghes_estatus_cached(estatus)) {
 			generic = estatus_node->generic;
 			if (ghes_print_estatus(NULL, generic, estatus))
@@ -821,6 +824,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
 		estatus_node = (void *)gen_pool_alloc(ghes_estatus_pool,
 						      node_len);
 		if (estatus_node) {
+			estatus_node->ghes = ghes;
 			estatus_node->generic = ghes->generic;
 			estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
 			memcpy(estatus, ghes->estatus, len);
@@ -899,6 +903,11 @@ static int ghes_probe(struct platform_device *ghes_dev)
 		ghes = NULL;
 		goto err;
 	}
+
+	rc = ghes_edac_register(ghes, &ghes_dev->dev);
+	if (rc < 0)
+		goto err;
+
 	switch (generic->notify.type) {
 	case ACPI_HEST_NOTIFY_POLLED:
 		ghes->timer.function = ghes_poll_func;
@@ -911,13 +920,13 @@ static int ghes_probe(struct platform_device *ghes_dev)
 		if (acpi_gsi_to_irq(generic->notify.vector, &ghes->irq)) {
 			pr_err(GHES_PFX "Failed to map GSI to IRQ for generic hardware error source: %d\n",
 			       generic->header.source_id);
-			goto err;
+			goto err_edac_unreg;
 		}
 		if (request_irq(ghes->irq, ghes_irq_func,
 				0, "GHES IRQ", ghes)) {
 			pr_err(GHES_PFX "Failed to register IRQ for generic hardware error source: %d\n",
 			       generic->header.source_id);
-			goto err;
+			goto err_edac_unreg;
 		}
 		break;
 	case ACPI_HEST_NOTIFY_SCI:
@@ -943,6 +952,8 @@ static int ghes_probe(struct platform_device *ghes_dev)
 	platform_set_drvdata(ghes_dev, ghes);
 
 	return 0;
+err_edac_unreg:
+	ghes_edac_unregister(ghes);
 err:
 	if (ghes) {
 		ghes_fini(ghes);
@@ -995,6 +1006,9 @@ static int ghes_remove(struct platform_device *ghes_dev)
 	}
 
 	ghes_fini(ghes);
+
+	ghes_edac_unregister(ghes);
+
 	kfree(ghes);
 
 	platform_set_drvdata(ghes_dev, NULL);
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 3eb8dc4..720446c 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -27,6 +27,7 @@ struct ghes {
 struct ghes_estatus_node {
 	struct llist_node llnode;
 	struct acpi_hest_generic *generic;
+	struct ghes *ghes;
 };
 
 struct ghes_estatus_cache {
@@ -43,3 +44,29 @@ enum {
 	GHES_SEV_RECOVERABLE = 0x2,
 	GHES_SEV_PANIC = 0x3,
 };
+
+/* From drivers/edac/ghes_edac.c */
+
+#ifdef CONFIG_EDAC_GHES
+void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
+				struct cper_sec_mem_err *mem_err);
+
+int ghes_edac_register(struct ghes *ghes, struct device *dev);
+
+void ghes_edac_unregister(struct ghes *ghes);
+
+#else
+static inline void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
+				       struct cper_sec_mem_err *mem_err)
+{
+}
+
+static inline int ghes_edac_register(struct ghes *ghes, struct device *dev)
+{
+	return 0;
+}
+
+static inline void ghes_edac_unregister(struct ghes *ghes)
+{
+}
+#endif


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH EDAC 03/13] ghes: add the needed hooks for EDAC error report
  2013-02-22  8:57           ` Mauro Carvalho Chehab
@ 2013-02-25  0:25             ` Huang Ying
  0 siblings, 0 replies; 49+ messages in thread
From: Huang Ying @ 2013-02-25  0:25 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: linux-acpi, Tony Luck, Linux Edac Mailing List,
	Linux Kernel Mailing List

On Fri, 2013-02-22 at 05:57 -0300, Mauro Carvalho Chehab wrote:
> Em Fri, 22 Feb 2013 05:50:21 -0300
> Mauro Carvalho Chehab <mchehab@redhat.com> escreveu:
> 
> > Em Fri, 22 Feb 2013 08:45:11 +0800
> > Huang Ying <ying.huang@intel.com> escreveu:
> > 
> > > On Thu, 2013-02-21 at 09:04 -0300, Mauro Carvalho Chehab wrote:
> > > > Em Thu, 21 Feb 2013 09:26:07 +0800
> > > > Huang Ying <ying.huang@intel.com> escreveu:
> > > > 
> > 
> > > > There is also an advantage on taking this approach: this patch can have just
> > > > the ghes changes, and will still compile fine, as CONFIG_EDAC_GHES is not
> > > > defined yet. So, the patch becomes is simpler and easier to review.
> > > 
> > > Thanks for your explanation.  I agree to keep this in header file.
> > > 
> > > My original suggestion is that it may be better to move this from ghes.h
> > > to some place like "ghes_edac.h", because these functions are
> > > implemented in "ghes_edac.c" instead of ghes.c.
> > > 
> > 
> > Ah! Well, I considered a separate header when writing this patch, but adding 
> > another header for just 3 functions where the parameters are more related to
> > ghes than with ghes_edac seemed overkill to me.
> > 
> > IMO, a single comment at the header pointing to ghes_edac.c would improve it.
> > 
> ...
> > +/* From drivers/edac/ghes_acpi.h */
> 
> ...with is obviously wrong.
> 
> Mental note to myself: never hack too early in the morning before drinking a
> cup of coffee.
> 
> Anyway, fixed on the patch below. Could you please ack?
> 
> Regards,
> Mauro.
> 
> 
> [PATCH EDAC] ghes: add the needed hooks for EDAC error report
> 
> In order to allow reporting errors via EDAC, add hooks for:
> 
> 1) register an EDAC driver;
> 2) unregister an EDAC driver;
> 3) report errors via EDAC.
> 
> As the EDAC driver will need to access the ghes structure, adds it
> as one of the parameters for ghes_do_proc.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>

Acked-by: Huang Ying <ying.huang@intel.com>

> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 6d0e146..d668a8a 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -409,7 +409,8 @@ static void ghes_clear_estatus(struct ghes *ghes)
>  	ghes->flags &= ~GHES_TO_CLEAR;
>  }
>  
> -static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
> +static void ghes_do_proc(struct ghes *ghes,
> +			 const struct acpi_hest_generic_status *estatus)
>  {
>  	int sev, sec_sev;
>  	struct acpi_hest_generic_data *gdata;
> @@ -421,6 +422,8 @@ static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
>  				 CPER_SEC_PLATFORM_MEM)) {
>  			struct cper_sec_mem_err *mem_err;
>  			mem_err = (struct cper_sec_mem_err *)(gdata+1);
> +			ghes_edac_report_mem_error(ghes, sev, mem_err);
> +
>  #ifdef CONFIG_X86_MCE
>  			apei_mce_report_mem_error(sev == GHES_SEV_CORRECTED,
>  						  mem_err);
> @@ -639,7 +642,7 @@ static int ghes_proc(struct ghes *ghes)
>  		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
>  			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
>  	}
> -	ghes_do_proc(ghes->estatus);
> +	ghes_do_proc(ghes, ghes->estatus);
>  out:
>  	ghes_clear_estatus(ghes);
>  	return 0;
> @@ -732,7 +735,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>  		estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>  		len = apei_estatus_len(estatus);
>  		node_len = GHES_ESTATUS_NODE_LEN(len);
> -		ghes_do_proc(estatus);
> +		ghes_do_proc(estatus_node->ghes, estatus);
>  		if (!ghes_estatus_cached(estatus)) {
>  			generic = estatus_node->generic;
>  			if (ghes_print_estatus(NULL, generic, estatus))
> @@ -821,6 +824,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
>  		estatus_node = (void *)gen_pool_alloc(ghes_estatus_pool,
>  						      node_len);
>  		if (estatus_node) {
> +			estatus_node->ghes = ghes;
>  			estatus_node->generic = ghes->generic;
>  			estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>  			memcpy(estatus, ghes->estatus, len);
> @@ -899,6 +903,11 @@ static int ghes_probe(struct platform_device *ghes_dev)
>  		ghes = NULL;
>  		goto err;
>  	}
> +
> +	rc = ghes_edac_register(ghes, &ghes_dev->dev);
> +	if (rc < 0)
> +		goto err;
> +
>  	switch (generic->notify.type) {
>  	case ACPI_HEST_NOTIFY_POLLED:
>  		ghes->timer.function = ghes_poll_func;
> @@ -911,13 +920,13 @@ static int ghes_probe(struct platform_device *ghes_dev)
>  		if (acpi_gsi_to_irq(generic->notify.vector, &ghes->irq)) {
>  			pr_err(GHES_PFX "Failed to map GSI to IRQ for generic hardware error source: %d\n",
>  			       generic->header.source_id);
> -			goto err;
> +			goto err_edac_unreg;
>  		}
>  		if (request_irq(ghes->irq, ghes_irq_func,
>  				0, "GHES IRQ", ghes)) {
>  			pr_err(GHES_PFX "Failed to register IRQ for generic hardware error source: %d\n",
>  			       generic->header.source_id);
> -			goto err;
> +			goto err_edac_unreg;
>  		}
>  		break;
>  	case ACPI_HEST_NOTIFY_SCI:
> @@ -943,6 +952,8 @@ static int ghes_probe(struct platform_device *ghes_dev)
>  	platform_set_drvdata(ghes_dev, ghes);
>  
>  	return 0;
> +err_edac_unreg:
> +	ghes_edac_unregister(ghes);
>  err:
>  	if (ghes) {
>  		ghes_fini(ghes);
> @@ -995,6 +1006,9 @@ static int ghes_remove(struct platform_device *ghes_dev)
>  	}
>  
>  	ghes_fini(ghes);
> +
> +	ghes_edac_unregister(ghes);
> +
>  	kfree(ghes);
>  
>  	platform_set_drvdata(ghes_dev, NULL);
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 3eb8dc4..720446c 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -27,6 +27,7 @@ struct ghes {
>  struct ghes_estatus_node {
>  	struct llist_node llnode;
>  	struct acpi_hest_generic *generic;
> +	struct ghes *ghes;
>  };
>  
>  struct ghes_estatus_cache {
> @@ -43,3 +44,29 @@ enum {
>  	GHES_SEV_RECOVERABLE = 0x2,
>  	GHES_SEV_PANIC = 0x3,
>  };
> +
> +/* From drivers/edac/ghes_edac.c */
> +
> +#ifdef CONFIG_EDAC_GHES
> +void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
> +				struct cper_sec_mem_err *mem_err);
> +
> +int ghes_edac_register(struct ghes *ghes, struct device *dev);
> +
> +void ghes_edac_unregister(struct ghes *ghes);
> +
> +#else
> +static inline void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
> +				       struct cper_sec_mem_err *mem_err)
> +{
> +}
> +
> +static inline int ghes_edac_register(struct ghes *ghes, struct device *dev)
> +{
> +	return 0;
> +}
> +
> +static inline void ghes_edac_unregister(struct ghes *ghes)
> +{
> +}
> +#endif
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2013-02-25  0:25 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-15 12:44 [PATCH EDAC 00/13] Add a driver to report Firmware first errors (via GHES) Mauro Carvalho Chehab
2013-02-15 12:44 ` Mauro Carvalho Chehab
2013-02-15 12:44 ` [PATCH EDAC 01/13] edac: lock module owner to avoid error report conflicts Mauro Carvalho Chehab
2013-02-15 12:44   ` Mauro Carvalho Chehab
2013-02-15 12:44 ` [PATCH EDAC 02/13] ghes: move structures/enum to a header file Mauro Carvalho Chehab
2013-02-15 12:44   ` Mauro Carvalho Chehab
2013-02-15 12:44 ` [PATCH EDAC 03/13] ghes: add the needed hooks for EDAC error report Mauro Carvalho Chehab
2013-02-15 12:44   ` Mauro Carvalho Chehab
2013-02-21  1:26   ` Huang Ying
2013-02-21 12:04     ` Mauro Carvalho Chehab
2013-02-22  0:45       ` Huang Ying
2013-02-22  8:50         ` Mauro Carvalho Chehab
2013-02-22  8:57           ` Mauro Carvalho Chehab
2013-02-25  0:25             ` Huang Ying
2013-02-15 12:44 ` [PATCH EDAC 04/13] edac: add a new memory layer type Mauro Carvalho Chehab
2013-02-15 12:44   ` Mauro Carvalho Chehab
2013-02-15 12:44 ` [PATCH EDAC 05/13] ghes_edac: Register at EDAC core the BIOS report Mauro Carvalho Chehab
2013-02-15 12:44   ` Mauro Carvalho Chehab
2013-02-15 12:44 ` [PATCH EDAC 06/13] ghes_edac: Allow registering more than once Mauro Carvalho Chehab
2013-02-15 12:44   ` Mauro Carvalho Chehab
2013-02-15 12:44 ` [PATCH EDAC 07/13] edac: add support for raw error reports Mauro Carvalho Chehab
2013-02-15 12:44   ` Mauro Carvalho Chehab
2013-02-15 14:13   ` Borislav Petkov
2013-02-15 15:25     ` Mauro Carvalho Chehab
2013-02-15 15:41       ` Borislav Petkov
2013-02-15 15:49         ` Mauro Carvalho Chehab
2013-02-15 16:02           ` Borislav Petkov
2013-02-15 18:20             ` Mauro Carvalho Chehab
2013-02-16 16:57               ` Borislav Petkov
2013-02-16 16:57                 ` Borislav Petkov
2013-02-17 10:44                 ` Mauro Carvalho Chehab
2013-02-17 10:44                   ` Mauro Carvalho Chehab
2013-02-18 13:52                   ` Borislav Petkov
2013-02-18 15:24                     ` Mauro Carvalho Chehab
2013-02-19 11:56                       ` Mauro Carvalho Chehab
2013-02-15 12:44 ` [PATCH EDAC 08/13] ghes_edac: add support for reporting errors via EDAC Mauro Carvalho Chehab
2013-02-15 12:44   ` Mauro Carvalho Chehab
2013-02-15 12:44 ` [PATCH EDAC 09/13] ghes_edac: do a better job of filling EDAC DIMM info Mauro Carvalho Chehab
2013-02-15 12:44   ` Mauro Carvalho Chehab
2013-02-15 12:44 ` [PATCH EDAC 10/13] edac: better report error conditions in debug mode Mauro Carvalho Chehab
2013-02-15 12:44   ` Mauro Carvalho Chehab
2013-02-15 12:44 ` [PATCH EDAC 11/13] edac: initialize the core earlier Mauro Carvalho Chehab
2013-02-15 12:44   ` Mauro Carvalho Chehab
2013-02-15 12:45 ` [PATCH EDAC 12/13] ghes_edac.c: Don't credit the same memory dimm twice Mauro Carvalho Chehab
2013-02-15 12:45   ` Mauro Carvalho Chehab
2013-02-15 12:45 ` [PATCH EDAC 13/13] ghes_edac: Improve driver's printk messages Mauro Carvalho Chehab
2013-02-15 12:45   ` Mauro Carvalho Chehab
2013-02-15 16:38   ` Joe Perches
2013-02-15 17:33     ` Mauro Carvalho Chehab

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.