linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH EDACv2 00/12] Add a driver to report Firmware first errors (via GHES)
@ 2013-02-21 15:38 Mauro Carvalho Chehab
  2013-02-21 15:38 ` [PATCH EDACv2 01/12] edac: add support for raw error reports Mauro Carvalho Chehab
                   ` (11 more replies)
  0 siblings, 12 replies; 18+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-21 15:38 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List, Len Brown, Huang Ying,
	Rafael J . Wysocki

There are currently 3 error mechanisms inside the Linux Kernel:
edac, mcelog and ghes.

Unfortunately, not all those error mechanisms will work at the same
time, as accessing the error registers by the BIOS may interfere on
reading them from OS.

So, all those 3 mechanisms need to be integrated, in order to avoid
such problems.

This patch series adds a new EDAC driver that uses "Firmware first"
APEI/GHES as an error report mechanism. It automatically disables
the hardware-driven EDAC drivers when GHES is enabled, preventing
to have both OS and BIOS to read at the very same error mechanisms.

It was tested on a "Lizard Head Pass" Intel machine, equipped with
BIOS SE5C600.86B.99.99.x059.091020121352 (09/10/2012).

Test results:

dmesg logs:
...
[  771.616999] mce: [Hardware Error]: Machine check events logged
[  771.623635] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[  771.632872] {3}[Hardware Error]: APEI generic hardware error status
[  771.639875] {3}[Hardware Error]: severity: 2, corrected
[  771.645708] {3}[Hardware Error]: section: 0, severity: 2, corrected
[  771.652704] {3}[Hardware Error]: flags: 0x01
[  771.657470] {3}[Hardware Error]: primary
[  771.661869] {3}[Hardware Error]: section_type: memory error
[  771.668103] {3}[Hardware Error]: error_status: 0x0000000000000400
[  771.674911] {3}[Hardware Error]: physical_address: 0x000000080b7f0000
[  771.682098] {3}[Hardware Error]: node: 0
[  771.686474] {3}[Hardware Error]: card: 0
[  771.690866] {3}[Hardware Error]: module: 0
[  771.695440] {3}[Hardware Error]: device: 0
[  771.700027] {3}[Hardware Error]: error_type: 18, unknown
[  771.706013] EDAC DEBUG: ghes_edac_report_mem_error: error validation_bits: 0x000040bb
[  771.706023] EDAC MC0: 1 CE reserved error (18) on unknown label (node:0 card:0 module:0 page:0x80b7f0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory)

trace logs (for 3 injected errors):
# tracer: nop
#
# entries-in-buffer/entries-written: 3/3   #P:64
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
     kworker/0:2-590   [000] ....   120.908219: mc_event: 1 Corrected error: reserved error (18) on unknown label (mc:0 location:-1:-1:-1 address:0x00000000 grain:1 syndrome:0x00000000 APEI location: node:3 card:0 module:0 status(0x0000000000000400): Storage error in DRAM memory)
     kworker/0:2-590   [000] ....   281.557076: mc_event: 1 Corrected error: reserved error (18) on unknown label (mc:0 location:-1:-1:-1 address:0x00000000 grain:1 syndrome:0x00000000 APEI location: node:3 card:0 module:1 status(0x0000000000000400): Storage error in DRAM memory)
     kworker/0:1-1989  [000] ....   771.706018: mc_event: 1 Corrected error: reserved error (18) on unknown label (mc:0 location:-1:-1:-1 address:0x000080b7 grain:1 syndrome:0x00000000 APEI location: node:0 card:0 module:0 status(0x0000000000000400): Storage error in DRAM memory)

-
Version 2: 

- for the sake of focusing on the important patches, I removed
from this patch series 4 patches that were submitted on version 1
and are just trivial cleanups and a simple EDAC internal API addition:
	c2c93db - edac: remove proc_name from mci structure
	c66b5a7 - edac: add a new memory layer type
	4ab19b0 - edac: initialize the core earlier
	3d95882 - edac: better report error conditions in debug mode

Those 4 patches can be found on my experimental git tree, at:
	http://git.infradead.org/users/mchehab/edac.git/shortlog/refs/heads/experimental

- I added in this series the 6 additional patches that improve ghes_edac
  output;

- Several patches got fold, to keep the patch series clean. So, instead
  of having 13+6 patches, the entire series has now 12 patches;

- A new patch was added from Borislav's feedback about patch 07/13:
	3897b2a - edac: put all arguments for the raw error handling call into a struct (79 minutes ago)

- Patch 03/13 was changed according with Huang's feedback. It is now
  patch 04/12. Due to the removal of mci from ghes structure, patch
  05/12 was changed to store the ghes <==> mci association inside the
  driver's private data. Subsequent patches required changes as well
  to cope with the new way to retrieve data;

- the buffer usage by ghes_edac_report_mem_error() was reduced, by storing
  the temporary buffers inside the private structure;

- minor cleanups.

To help reviewers, I'm enclosing the diff between v1 and v2 at the end of
this email.

Mauro Carvalho Chehab (12):
  edac: add support for raw error reports
  edac: add support for error type "Info"
  ghes: move structures/enum to a header file
  ghes: add the needed hooks for EDAC error report
  ghes_edac: Register at EDAC core the BIOS report
  ghes_edac: add support for reporting errors via EDAC
  ghes_edac: do a better job of filling EDAC DIMM info
  ghes_edac: Don't credit the same memory dimm twice
  ghes_edac: Improve driver's printk messages
  edac: put all arguments for the raw error handling call into a struct
  ghes_edac: Make it compliant with UEFI spec 2.3.1
  ghes_edac: Fix RAS tracing

 MAINTAINERS              |   7 +
 drivers/acpi/apei/ghes.c |  71 ++-----
 drivers/edac/Kconfig     |  23 ++
 drivers/edac/Makefile    |   1 +
 drivers/edac/edac_core.h |   5 +
 drivers/edac/edac_mc.c   | 119 +++++++----
 drivers/edac/ghes_edac.c | 537 +++++++++++++++++++++++++++++++++++++++++++++++
 include/acpi/ghes.h      |  70 ++++++
 include/linux/edac.h     |  72 +++++++
 include/ras/ras_event.h  |   4 +-
 10 files changed, 814 insertions(+), 95 deletions(-)
 create mode 100644 drivers/edac/ghes_edac.c
 create mode 100644 include/acpi/ghes.h

----------------------------
Difference against version 1:
----------------------------

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a21d7da..19092dc 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -903,6 +903,11 @@ static int ghes_probe(struct platform_device *ghes_dev)
 		ghes = NULL;
 		goto err;
 	}
+
+	rc = ghes_edac_register(ghes, &ghes_dev->dev);
+	if (rc < 0)
+		goto err;
+
 	switch (generic->notify.type) {
 	case ACPI_HEST_NOTIFY_POLLED:
 		ghes->timer.function = ghes_poll_func;
@@ -915,13 +920,13 @@ static int ghes_probe(struct platform_device *ghes_dev)
 		if (acpi_gsi_to_irq(generic->notify.vector, &ghes->irq)) {
 			pr_err(GHES_PFX "Failed to map GSI to IRQ for generic hardware error source: %d\n",
 			       generic->header.source_id);
-			goto err;
+			goto err2;
 		}
 		if (request_irq(ghes->irq, ghes_irq_func,
 				0, "GHES IRQ", ghes)) {
 			pr_err(GHES_PFX "Failed to register IRQ for generic hardware error source: %d\n",
 			       generic->header.source_id);
-			goto err;
+			goto err2;
 		}
 		break;
 	case ACPI_HEST_NOTIFY_SCI:
@@ -946,11 +951,9 @@ static int ghes_probe(struct platform_device *ghes_dev)
 	}
 	platform_set_drvdata(ghes_dev, ghes);
 
-	rc = ghes_edac_register(ghes, &ghes_dev->dev);
-	if (rc < 0)
-		goto err;
-
 	return 0;
+err2:
+	ghes_edac_unregister(ghes);
 err:
 	if (ghes) {
 		ghes_fini(ghes);
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 8d89bc0..cdb81aa 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -818,9 +818,8 @@ struct mem_ctl_info *edac_mc_del_mc(struct device *dev)
 		return NULL;
 	}
 
-	if (!del_mc_from_global_list(mci)) {
+	if (!del_mc_from_global_list(mci))
 		edac_mc_owner = NULL;
-	}
 	mutex_unlock(&mem_ctls_mutex);
 
 	/* flush workq processes */
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 2126aab..636dcf1 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -19,6 +19,18 @@
 
 #define GHES_EDAC_REVISION " Ver: 1.0.0"
 
+struct ghes_edac_pvt {
+	struct list_head list;
+	struct ghes *ghes;
+	struct mem_ctl_info *mci;
+
+	/* Buffers for the error handling routine */
+	char detail_location[240];
+	char other_detail[160];
+	char msg[80];
+};
+
+static LIST_HEAD(ghes_reglist);
 static DEFINE_MUTEX(ghes_edac_lock);
 static int ghes_edac_mc_num;
 
@@ -107,7 +119,7 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
 				dimm->nr_pages = MiB_TO_PAGES(entry->size);
 		}
 
-		switch(entry->memory_type) {
+		switch (entry->memory_type) {
 		case 0x12:
 			if (entry->type_detail & 1 << 13)
 				dimm->mtype = MEM_RDDR;
@@ -163,7 +175,7 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
 			edac_dbg(1, "DIMM%i: %s size = %d MB%s\n",
 				dimm_fill->count, memory_type[dimm->mtype],
 				PAGES_TO_MiB(dimm->nr_pages),
-				(dimm->edac_mode != EDAC_NONE)? "(ECC)" : "");
+				(dimm->edac_mode != EDAC_NONE) ? "(ECC)" : "");
 			edac_dbg(2, "\ttype %d, detail 0x%02x, width %d(total %d)\n",
 				entry->memory_type, entry->type_detail,
 				entry->total_width, entry->data_width);
@@ -174,27 +186,39 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
 }
 
 void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
-			        struct cper_sec_mem_err *mem_err)
+				struct cper_sec_mem_err *mem_err)
 {
-	struct edac_raw_error_desc *e = &ghes->mci->error_desc;
 	enum hw_event_mc_err_type type;
-	char detail_location[240];
-	char other_detail[160] = "";
-	char msg[40] = "";
+	struct edac_raw_error_desc *e;
+	struct mem_ctl_info *mci;
+	struct ghes_edac_pvt *pvt = NULL;
 	char *p;
 	u8 grain_bits;
 
+	list_for_each_entry(pvt, &ghes_reglist, list) {
+		if (ghes == pvt->ghes)
+			break;
+	}
+	if (!pvt) {
+		pr_err("Internal error: Can't find EDAC structure\n");
+		return;
+	}
+	mci = pvt->mci;
+	e = &mci->error_desc;
+
 	/* Cleans the error report buffer */
 	memset(e, 0, sizeof (*e));
 	e->error_count = 1;
 	strcpy(e->label, "unknown label");
-	e->msg = msg;
-	e->other_detail = other_detail;
+	e->msg = pvt->msg;
+	e->other_detail = pvt->other_detail;
 	e->top_layer = -1;
 	e->mid_layer = -1;
 	e->low_layer = -1;
+	*pvt->other_detail = '\0';
+	*pvt->msg = '\0';
 
-	switch(sev) {
+	switch (sev) {
 	case GHES_SEV_CORRECTED:
 		type = HW_EVENT_ERR_CORRECTED;
 		break;
@@ -209,9 +233,12 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 		type = HW_EVENT_ERR_INFO;
 	}
 
+	edac_dbg(1, "error validation_bits: 0x%08llx\n",
+		 (long long)mem_err->validation_bits);
+
 	/* Error type, mapped on e->msg */
 	if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE) {
-		p = msg;
+		p = pvt->msg;
 		switch (mem_err->error_type) {
 		case 0:
 			p += sprintf(p, "Unknown");
@@ -266,7 +293,7 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 				     mem_err->error_type);
 		}
 	} else {
-		strcpy(msg, "unknown error");
+		strcpy(pvt->msg, "unknown error");
 	}
 
 	/* Error address */
@@ -300,7 +327,7 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 		*(p - 1) = '\0';
 
 	/* All other fields are mapped on e->other_detail */
-	p= other_detail;
+	p = pvt->other_detail;
 	if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
 		u64 status = mem_err->error_status;
 
@@ -313,7 +340,7 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 			p += sprintf(p, "Error detected in the bus ");
 			break;
 		case 4:
-			p += sprintf(p, "Storage error in memory (DRAM) ");
+			p += sprintf(p, "Storage error in DRAM memory ");
 			break;
 		case 5:
 			p += sprintf(p, "Storage error in TLB ");
@@ -346,7 +373,7 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 			p += sprintf(p, "Response not associated with a request ");
 			break;
 		case 22:
-			p += sprintf(p, "Bus parity error (must also set the A, C, or D Bits) ");
+			p += sprintf(p, "Bus parity error - must also set the A, C, or D Bits ");
 			break;
 		case 23:
 			p += sprintf(p, "Detection of a PATH_ERROR ");
@@ -363,29 +390,28 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 		}
 	}
 	if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
-		p += sprintf(p, "requestor ID: 0x%016llx ",
+		p += sprintf(p, "requestorID: 0x%016llx ",
 			     (long long)mem_err->requestor_id);
 	if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
-		p += sprintf(p, "responder ID: 0x%016llx ",
+		p += sprintf(p, "responderID: 0x%016llx ",
 			     (long long)mem_err->responder_id);
 	if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
-		p += sprintf(p, "target ID: 0x%016llx ",
+		p += sprintf(p, "targetID: 0x%016llx ",
 			     (long long)mem_err->responder_id);
-	if (p > other_detail)
+	if (p > pvt->other_detail)
 		*(p - 1) = '\0';
 
 	/* Generate the trace event */
 	grain_bits = fls_long(e->grain);
-	sprintf(detail_location, "APEI location: %s %s",
+	sprintf(pvt->detail_location, "APEI location: %s %s",
 		e->location, e->other_detail);
 	trace_mc_event(type, e->msg, e->label, e->error_count,
-		       ghes->mci->mc_idx,
-		       e->top_layer, e->mid_layer, e->low_layer,
+		       mci->mc_idx, e->top_layer, e->mid_layer, e->low_layer,
 		       PAGES_TO_MiB(e->page_frame_number) | e->offset_in_page,
-		       grain_bits, e->syndrome, detail_location);
+		       grain_bits, e->syndrome, pvt->detail_location);
 
 	/* Report the error via EDAC API */
-	edac_raw_mc_handle_error(type, ghes->mci, e);
+	edac_raw_mc_handle_error(type, mci, e);
 }
 EXPORT_SYMBOL_GPL(ghes_edac_report_mem_error);
 
@@ -395,6 +421,7 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	int rc, num_dimm = 0;
 	struct mem_ctl_info *mci;
 	struct edac_mc_layer layers[1];
+	struct ghes_edac_pvt *pvt;
 	struct ghes_edac_dimm_fill dimm_fill;
 
 	/* Get the number of DIMMs */
@@ -415,14 +442,19 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	 * to avoid duplicated memory controller numbers
 	 */
 	mutex_lock(&ghes_edac_lock);
-	mci = edac_mc_alloc(ghes_edac_mc_num, ARRAY_SIZE(layers), layers, 0);
+	mci = edac_mc_alloc(ghes_edac_mc_num, ARRAY_SIZE(layers), layers,
+			    sizeof(*pvt));
 	if (!mci) {
 		pr_info("Can't allocate memory for EDAC data\n");
 		mutex_unlock(&ghes_edac_lock);
 		return -ENOMEM;
 	}
 
-	mci->pvt_info = ghes;
+	pvt = mci->pvt_info;
+	memset(pvt, 0, sizeof(*pvt));
+	list_add_tail(&pvt->list, &ghes_reglist);
+	pvt->ghes = ghes;
+	pvt->mci  = mci;
 	mci->pdev = dev;
 
 	mci->mtype_cap = MEM_FLAG_EMPTY;
@@ -461,7 +493,7 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 		if (!ghes_edac_mc_num) {
 			dimm_fill.count = 0;
 			dimm_fill.mci = mci;
-			dmi_walk (ghes_edac_dmidecode, &dimm_fill);
+			dmi_walk(ghes_edac_dmidecode, &dimm_fill);
 		}
 	} else {
 		struct dimm_info *dimm = EDAC_DIMM_PTR(mci->layers, mci->dimms,
@@ -482,8 +514,6 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 		return -ENODEV;
 	}
 
-
-	ghes->mci = mci;
 	ghes_edac_mc_num++;
 	mutex_unlock(&ghes_edac_lock);
 	return 0;
@@ -492,12 +522,16 @@ EXPORT_SYMBOL_GPL(ghes_edac_register);
 
 void ghes_edac_unregister(struct ghes *ghes)
 {
-	struct mem_ctl_info *mci = ghes->mci;
-
-	if (!mci)
-		return;
-
-	edac_mc_del_mc(mci->pdev);
-	edac_mc_free(mci);
+	struct mem_ctl_info *mci;
+	struct ghes_edac_pvt *pvt;
+
+	list_for_each_entry(pvt, &ghes_reglist, list) {
+		if (ghes == pvt->ghes) {
+			mci = pvt->mci;
+			edac_mc_del_mc(mci->pdev);
+			edac_mc_free(mci);
+			list_del(&pvt->list);
+		}
+	}
 }
 EXPORT_SYMBOL_GPL(ghes_edac_unregister);
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index c6fef72..9015ec2 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -22,8 +22,6 @@ struct ghes {
 		struct timer_list timer;
 		unsigned int irq;
 	};
-
-	struct mem_ctl_info *mci;
 };
 
 struct ghes_estatus_node {


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH EDACv2 01/12] edac: add support for raw error reports
  2013-02-21 15:38 [PATCH EDACv2 00/12] Add a driver to report Firmware first errors (via GHES) Mauro Carvalho Chehab
@ 2013-02-21 15:38 ` Mauro Carvalho Chehab
  2013-02-21 15:53   ` Borislav Petkov
  2013-02-21 15:39 ` [PATCH EDACv2 02/12] edac: add support for error type "Info" Mauro Carvalho Chehab
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 18+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-21 15:38 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

That allows APEI GHES driver to report errors directly, using
the EDAC error report API.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/edac_core.h |  17 ++++++++
 drivers/edac/edac_mc.c   | 109 ++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 100 insertions(+), 26 deletions(-)

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index 23bb99f..9c5da11 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -453,6 +453,23 @@ extern struct mem_ctl_info *find_mci_by_dev(struct device *dev);
 extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev);
 extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
 				      unsigned long page);
+
+void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
+			  struct mem_ctl_info *mci,
+			  long grain,
+			  const u16 error_count,
+			  const int top_layer,
+			  const int mid_layer,
+			  const int low_layer,
+			  const unsigned long page_frame_number,
+			  const unsigned long offset_in_page,
+			  const unsigned long syndrome,
+			  const char *msg,
+			  const char *location,
+			  const char *label,
+			  const char *other_detail,
+			  const bool enable_per_layer_report);
+
 void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			  struct mem_ctl_info *mci,
 			  const u16 error_count,
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 34eb970..94a5f26 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1068,6 +1068,82 @@ static void edac_ue_error(struct mem_ctl_info *mci,
 #define OTHER_LABEL " or "
 
 /**
+ * edac_raw_mc_handle_error - reports a memory event to userspace without doing
+ *			      anything to discover the error location
+ *
+ * @type:		severity of the error (CE/UE/Fatal)
+ * @mci:		a struct mem_ctl_info pointer
+ * @grain:		error granularity
+ * @error_count:	Number of errors of the same type
+ * @top_layer:		Memory layer[0] position
+ * @mid_layer:		Memory layer[1] position
+ * @low_layer:		Memory layer[2] position
+ * @page_frame_number:	mem page where the error occurred
+ * @offset_in_page:	offset of the error inside the page
+ * @syndrome:		ECC syndrome
+ * @msg:		Message meaningful to the end users that
+ *			explains the event\
+ * @location:		location of the error, like "csrow:0 channel:1"
+ * @label:		DIMM labels for the affected memory(ies)
+ * @other_detail:	Technical details about the event that
+ *			may help hardware manufacturers and
+ *			EDAC developers to analyse the event
+ * @enable_per_layer_report: should it increment per-layer error counts?
+ *
+ * This raw function is used internally by edac_mc_handle_error(). It should
+ * only be called directly when the hardware error come directly from BIOS,
+ * like in the case of APEI GHES driver.
+ */
+void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
+			  struct mem_ctl_info *mci,
+			  long grain,
+			  const u16 error_count,
+			  const int top_layer,
+			  const int mid_layer,
+			  const int low_layer,
+			  const unsigned long page_frame_number,
+			  const unsigned long offset_in_page,
+			  const unsigned long syndrome,
+			  const char *msg,
+			  const char *location,
+			  const char *label,
+			  const char *other_detail,
+			  const bool enable_per_layer_report)
+{
+	char detail[80];
+	u8 grain_bits;
+	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
+
+	/* Report the error via the trace interface */
+	grain_bits = fls_long(grain) + 1;
+	trace_mc_event(type, msg, label, error_count,
+		       mci->mc_idx, top_layer, mid_layer, low_layer,
+		       PAGES_TO_MiB(page_frame_number) | offset_in_page,
+		       grain_bits, syndrome, other_detail);
+
+	/* Memory type dependent details about the error */
+	if (type == HW_EVENT_ERR_CORRECTED) {
+		snprintf(detail, sizeof(detail),
+			"page:0x%lx offset:0x%lx grain:%ld syndrome:0x%lx",
+			page_frame_number, offset_in_page,
+			grain, syndrome);
+		edac_ce_error(mci, error_count, pos, msg, location, label,
+			      detail, other_detail, enable_per_layer_report,
+			      page_frame_number, offset_in_page, grain);
+	} else {
+		snprintf(detail, sizeof(detail),
+			"page:0x%lx offset:0x%lx grain:%ld",
+			page_frame_number, offset_in_page, grain);
+
+		edac_ue_error(mci, error_count, pos, msg, location, label,
+			      detail, other_detail, enable_per_layer_report);
+	}
+
+
+}
+EXPORT_SYMBOL_GPL(edac_raw_mc_handle_error);
+
+/**
  * edac_mc_handle_error - reports a memory event to userspace
  *
  * @type:		severity of the error (CE/UE/Fatal)
@@ -1098,7 +1174,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			  const char *other_detail)
 {
 	/* FIXME: too much for stack: move it to some pre-alocated area */
-	char detail[80], location[80];
+	char location[80];
 	char label[(EDAC_MC_LABEL_LEN + 1 + sizeof(OTHER_LABEL)) * mci->tot_dimms];
 	char *p;
 	int row = -1, chan = -1;
@@ -1106,7 +1182,6 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	int i;
 	long grain;
 	bool enable_per_layer_report = false;
-	u8 grain_bits;
 
 	edac_dbg(3, "MC%d\n", mci->mc_idx);
 
@@ -1229,29 +1304,11 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	if (p > location)
 		*(p - 1) = '\0';
 
-	/* Report the error via the trace interface */
-	grain_bits = fls_long(grain) + 1;
-	trace_mc_event(type, msg, label, error_count,
-		       mci->mc_idx, top_layer, mid_layer, low_layer,
-		       PAGES_TO_MiB(page_frame_number) | offset_in_page,
-		       grain_bits, syndrome, other_detail);
-
-	/* Memory type dependent details about the error */
-	if (type == HW_EVENT_ERR_CORRECTED) {
-		snprintf(detail, sizeof(detail),
-			"page:0x%lx offset:0x%lx grain:%ld syndrome:0x%lx",
-			page_frame_number, offset_in_page,
-			grain, syndrome);
-		edac_ce_error(mci, error_count, pos, msg, location, label,
-			      detail, other_detail, enable_per_layer_report,
-			      page_frame_number, offset_in_page, grain);
-	} else {
-		snprintf(detail, sizeof(detail),
-			"page:0x%lx offset:0x%lx grain:%ld",
-			page_frame_number, offset_in_page, grain);
-
-		edac_ue_error(mci, error_count, pos, msg, location, label,
-			      detail, other_detail, enable_per_layer_report);
-	}
+	edac_raw_mc_handle_error(type, mci, grain, error_count,
+				 top_layer, mid_layer, low_layer,
+				 page_frame_number, offset_in_page,
+				 syndrome,
+				 msg, location, label, other_detail,
+				 enable_per_layer_report);
 }
 EXPORT_SYMBOL_GPL(edac_mc_handle_error);
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH EDACv2 02/12] edac: add support for error type "Info"
  2013-02-21 15:38 [PATCH EDACv2 00/12] Add a driver to report Firmware first errors (via GHES) Mauro Carvalho Chehab
  2013-02-21 15:38 ` [PATCH EDACv2 01/12] edac: add support for raw error reports Mauro Carvalho Chehab
@ 2013-02-21 15:39 ` Mauro Carvalho Chehab
  2013-02-21 15:39 ` [PATCH EDACv2 03/12] ghes: move structures/enum to a header file Mauro Carvalho Chehab
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-21 15:39 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

The CPER spec defines a forth type of error: informational
logs. Add support for it at the edac API and at the
trace event interface.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 include/linux/edac.h    | 16 ++++++++++++++++
 include/ras/ras_event.h |  4 +---
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/edac.h b/include/linux/edac.h
index ff18efc..2049b96 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -99,8 +99,24 @@ enum hw_event_mc_err_type {
 	HW_EVENT_ERR_CORRECTED,
 	HW_EVENT_ERR_UNCORRECTED,
 	HW_EVENT_ERR_FATAL,
+	HW_EVENT_ERR_INFO,
 };
 
+static inline char *mc_event_error_type(const unsigned int err_type)
+{
+	switch (err_type) {
+	case HW_EVENT_ERR_CORRECTED:
+		return "Corrected";
+	case HW_EVENT_ERR_UNCORRECTED:
+		return "Uncorrected";
+	case HW_EVENT_ERR_FATAL:
+		return "Fatal";
+	default:
+	case HW_EVENT_ERR_INFO:
+		return "Info";
+	}
+}
+
 /**
  * enum mem_type - memory types. For a more detailed reference, please see
  *			http://en.wikipedia.org/wiki/DRAM
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 260470e..21cdb0b 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -78,9 +78,7 @@ TRACE_EVENT(mc_event,
 
 	TP_printk("%d %s error%s:%s%s on %s (mc:%d location:%d:%d:%d address:0x%08lx grain:%d syndrome:0x%08lx%s%s)",
 		  __entry->error_count,
-		  (__entry->error_type == HW_EVENT_ERR_CORRECTED) ? "Corrected" :
-			((__entry->error_type == HW_EVENT_ERR_FATAL) ?
-			"Fatal" : "Uncorrected"),
+		  mc_event_error_type(__entry->error_type),
 		  __entry->error_count > 1 ? "s" : "",
 		  ((char *)__get_str(msg))[0] ? " " : "",
 		  __get_str(msg),
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH EDACv2 03/12] ghes: move structures/enum to a header file
  2013-02-21 15:38 [PATCH EDACv2 00/12] Add a driver to report Firmware first errors (via GHES) Mauro Carvalho Chehab
  2013-02-21 15:38 ` [PATCH EDACv2 01/12] edac: add support for raw error reports Mauro Carvalho Chehab
  2013-02-21 15:39 ` [PATCH EDACv2 02/12] edac: add support for error type "Info" Mauro Carvalho Chehab
@ 2013-02-21 15:39 ` Mauro Carvalho Chehab
  2013-02-21 15:39 ` [PATCH EDACv2 04/12] ghes: add the needed hooks for EDAC error report Mauro Carvalho Chehab
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-21 15:39 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List, Len Brown, Huang Ying,
	Rafael J . Wysocki

As a ghes_edac driver will need to access ghes structures, in order
to properly handle the errors, move those structures to a separate
header file. No functional changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/acpi/apei/ghes.c | 47 ++---------------------------------------------
 include/acpi/ghes.h      | 45 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+), 45 deletions(-)
 create mode 100644 include/acpi/ghes.h

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 7ae2750..6d0e146 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -48,8 +48,8 @@
 #include <linux/genalloc.h>
 #include <linux/pci.h>
 #include <linux/aer.h>
-#include <acpi/apei.h>
-#include <acpi/hed.h>
+
+#include <acpi/ghes.h>
 #include <asm/mce.h>
 #include <asm/tlbflush.h>
 #include <asm/nmi.h>
@@ -84,42 +84,6 @@
 	((struct acpi_hest_generic_status *)				\
 	 ((struct ghes_estatus_node *)(estatus_node) + 1))
 
-/*
- * One struct ghes is created for each generic hardware error source.
- * It provides the context for APEI hardware error timer/IRQ/SCI/NMI
- * handler.
- *
- * estatus: memory buffer for error status block, allocated during
- * HEST parsing.
- */
-#define GHES_TO_CLEAR		0x0001
-#define GHES_EXITING		0x0002
-
-struct ghes {
-	struct acpi_hest_generic *generic;
-	struct acpi_hest_generic_status *estatus;
-	u64 buffer_paddr;
-	unsigned long flags;
-	union {
-		struct list_head list;
-		struct timer_list timer;
-		unsigned int irq;
-	};
-};
-
-struct ghes_estatus_node {
-	struct llist_node llnode;
-	struct acpi_hest_generic *generic;
-};
-
-struct ghes_estatus_cache {
-	u32 estatus_len;
-	atomic_t count;
-	struct acpi_hest_generic *generic;
-	unsigned long long time_in;
-	struct rcu_head rcu;
-};
-
 bool ghes_disable;
 module_param_named(disable, ghes_disable, bool, 0);
 
@@ -333,13 +297,6 @@ static void ghes_fini(struct ghes *ghes)
 	apei_unmap_generic_address(&ghes->generic->error_status_address);
 }
 
-enum {
-	GHES_SEV_NO = 0x0,
-	GHES_SEV_CORRECTED = 0x1,
-	GHES_SEV_RECOVERABLE = 0x2,
-	GHES_SEV_PANIC = 0x3,
-};
-
 static inline int ghes_severity(int severity)
 {
 	switch (severity) {
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
new file mode 100644
index 0000000..3eb8dc4
--- /dev/null
+++ b/include/acpi/ghes.h
@@ -0,0 +1,45 @@
+#include <acpi/apei.h>
+#include <acpi/hed.h>
+
+/*
+ * One struct ghes is created for each generic hardware error source.
+ * It provides the context for APEI hardware error timer/IRQ/SCI/NMI
+ * handler.
+ *
+ * estatus: memory buffer for error status block, allocated during
+ * HEST parsing.
+ */
+#define GHES_TO_CLEAR		0x0001
+#define GHES_EXITING		0x0002
+
+struct ghes {
+	struct acpi_hest_generic *generic;
+	struct acpi_hest_generic_status *estatus;
+	u64 buffer_paddr;
+	unsigned long flags;
+	union {
+		struct list_head list;
+		struct timer_list timer;
+		unsigned int irq;
+	};
+};
+
+struct ghes_estatus_node {
+	struct llist_node llnode;
+	struct acpi_hest_generic *generic;
+};
+
+struct ghes_estatus_cache {
+	u32 estatus_len;
+	atomic_t count;
+	struct acpi_hest_generic *generic;
+	unsigned long long time_in;
+	struct rcu_head rcu;
+};
+
+enum {
+	GHES_SEV_NO = 0x0,
+	GHES_SEV_CORRECTED = 0x1,
+	GHES_SEV_RECOVERABLE = 0x2,
+	GHES_SEV_PANIC = 0x3,
+};
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH EDACv2 04/12] ghes: add the needed hooks for EDAC error report
  2013-02-21 15:38 [PATCH EDACv2 00/12] Add a driver to report Firmware first errors (via GHES) Mauro Carvalho Chehab
                   ` (2 preceding siblings ...)
  2013-02-21 15:39 ` [PATCH EDACv2 03/12] ghes: move structures/enum to a header file Mauro Carvalho Chehab
@ 2013-02-21 15:39 ` Mauro Carvalho Chehab
  2013-02-21 15:39 ` [PATCH EDACv2 05/12] ghes_edac: Register at EDAC core the BIOS report Mauro Carvalho Chehab
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-21 15:39 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List, Len Brown, Huang Ying,
	Rafael J . Wysocki

In order to allow reporting errors via EDAC, add hooks for:

1) register an EDAC driver;
2) unregister an EDAC driver;
3) report errors via EDAC.

As the EDAC driver will need to access the ghes structure, adds it
as one of the parameters for ghes_do_proc.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/acpi/apei/ghes.c | 24 +++++++++++++++++++-----
 include/acpi/ghes.h      | 25 +++++++++++++++++++++++++
 2 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 6d0e146..19092dc 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -409,7 +409,8 @@ static void ghes_clear_estatus(struct ghes *ghes)
 	ghes->flags &= ~GHES_TO_CLEAR;
 }
 
-static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
+static void ghes_do_proc(struct ghes *ghes,
+			 const struct acpi_hest_generic_status *estatus)
 {
 	int sev, sec_sev;
 	struct acpi_hest_generic_data *gdata;
@@ -421,6 +422,8 @@ static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
 				 CPER_SEC_PLATFORM_MEM)) {
 			struct cper_sec_mem_err *mem_err;
 			mem_err = (struct cper_sec_mem_err *)(gdata+1);
+			ghes_edac_report_mem_error(ghes, sev, mem_err);
+
 #ifdef CONFIG_X86_MCE
 			apei_mce_report_mem_error(sev == GHES_SEV_CORRECTED,
 						  mem_err);
@@ -639,7 +642,7 @@ static int ghes_proc(struct ghes *ghes)
 		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
 			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
 	}
-	ghes_do_proc(ghes->estatus);
+	ghes_do_proc(ghes, ghes->estatus);
 out:
 	ghes_clear_estatus(ghes);
 	return 0;
@@ -732,7 +735,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
 		estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
 		len = apei_estatus_len(estatus);
 		node_len = GHES_ESTATUS_NODE_LEN(len);
-		ghes_do_proc(estatus);
+		ghes_do_proc(estatus_node->ghes, estatus);
 		if (!ghes_estatus_cached(estatus)) {
 			generic = estatus_node->generic;
 			if (ghes_print_estatus(NULL, generic, estatus))
@@ -821,6 +824,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
 		estatus_node = (void *)gen_pool_alloc(ghes_estatus_pool,
 						      node_len);
 		if (estatus_node) {
+			estatus_node->ghes = ghes;
 			estatus_node->generic = ghes->generic;
 			estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
 			memcpy(estatus, ghes->estatus, len);
@@ -899,6 +903,11 @@ static int ghes_probe(struct platform_device *ghes_dev)
 		ghes = NULL;
 		goto err;
 	}
+
+	rc = ghes_edac_register(ghes, &ghes_dev->dev);
+	if (rc < 0)
+		goto err;
+
 	switch (generic->notify.type) {
 	case ACPI_HEST_NOTIFY_POLLED:
 		ghes->timer.function = ghes_poll_func;
@@ -911,13 +920,13 @@ static int ghes_probe(struct platform_device *ghes_dev)
 		if (acpi_gsi_to_irq(generic->notify.vector, &ghes->irq)) {
 			pr_err(GHES_PFX "Failed to map GSI to IRQ for generic hardware error source: %d\n",
 			       generic->header.source_id);
-			goto err;
+			goto err2;
 		}
 		if (request_irq(ghes->irq, ghes_irq_func,
 				0, "GHES IRQ", ghes)) {
 			pr_err(GHES_PFX "Failed to register IRQ for generic hardware error source: %d\n",
 			       generic->header.source_id);
-			goto err;
+			goto err2;
 		}
 		break;
 	case ACPI_HEST_NOTIFY_SCI:
@@ -943,6 +952,8 @@ static int ghes_probe(struct platform_device *ghes_dev)
 	platform_set_drvdata(ghes_dev, ghes);
 
 	return 0;
+err2:
+	ghes_edac_unregister(ghes);
 err:
 	if (ghes) {
 		ghes_fini(ghes);
@@ -995,6 +1006,9 @@ static int ghes_remove(struct platform_device *ghes_dev)
 	}
 
 	ghes_fini(ghes);
+
+	ghes_edac_unregister(ghes);
+
 	kfree(ghes);
 
 	platform_set_drvdata(ghes_dev, NULL);
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 3eb8dc4..9015ec2 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -27,6 +27,7 @@ struct ghes {
 struct ghes_estatus_node {
 	struct llist_node llnode;
 	struct acpi_hest_generic *generic;
+	struct ghes *ghes;
 };
 
 struct ghes_estatus_cache {
@@ -43,3 +44,27 @@ enum {
 	GHES_SEV_RECOVERABLE = 0x2,
 	GHES_SEV_PANIC = 0x3,
 };
+
+#ifdef CONFIG_EDAC_GHES
+void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
+				struct cper_sec_mem_err *mem_err);
+
+int ghes_edac_register(struct ghes *ghes, struct device *dev);
+
+void ghes_edac_unregister(struct ghes *ghes);
+
+#else
+static inline void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
+				       struct cper_sec_mem_err *mem_err)
+{
+}
+
+static inline int ghes_edac_register(struct ghes *ghes, struct device *dev)
+{
+	return 0;
+}
+
+static inline void ghes_edac_unregister(struct ghes *ghes)
+{
+}
+#endif
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH EDACv2 05/12] ghes_edac: Register at EDAC core the BIOS report
  2013-02-21 15:38 [PATCH EDACv2 00/12] Add a driver to report Firmware first errors (via GHES) Mauro Carvalho Chehab
                   ` (3 preceding siblings ...)
  2013-02-21 15:39 ` [PATCH EDACv2 04/12] ghes: add the needed hooks for EDAC error report Mauro Carvalho Chehab
@ 2013-02-21 15:39 ` Mauro Carvalho Chehab
  2013-02-21 15:39 ` [PATCH EDACv2 06/12] ghes_edac: add support for reporting errors via EDAC Mauro Carvalho Chehab
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-21 15:39 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

Register GHES at EDAC MC core, in order to avoid other
drivers to also handle errors and mangle with error data.

The edac core will warrant that just one driver will be used,
so the first one to register (BIOS first) will be the one that
will be reporting the hardware errors.

For now, the EDAC driver does nothing but to register at the
EDAC core, preventing the hardware-driven mechanism to
interfere with GHES.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 MAINTAINERS              |   7 +++
 drivers/edac/Kconfig     |  23 ++++++++++
 drivers/edac/Makefile    |   1 +
 drivers/edac/ghes_edac.c | 114 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 145 insertions(+)
 create mode 100644 drivers/edac/ghes_edac.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 35a56bc..889644d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2803,6 +2803,13 @@ W:	bluesmoke.sourceforge.net
 S:	Maintained
 F:	drivers/edac/e7xxx_edac.c
 
+EDAC-GHES
+M:	Mauro Carvalho Chehab <mchehab@redhat.com>
+L:	linux-edac@vger.kernel.org
+W:	bluesmoke.sourceforge.net
+S:	Maintained
+F:	drivers/edac/ghes-edac.c
+
 EDAC-I82443BXGX
 M:	Tim Small <tim@buttersideup.com>
 L:	linux-edac@vger.kernel.org
diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 6671992..7e38e5e 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -80,6 +80,29 @@ config EDAC_MM_EDAC
 	  occurred so that a particular failing memory module can be
 	  replaced.  If unsure, select 'Y'.
 
+config EDAC_GHES
+	bool "Output ACPI APEI/GHES BIOS detected errors via EDAC"
+	depends on ACPI_APEI_GHES && (EDAC_MM_EDAC=y)
+	default y
+	help
+	  Not all machines support hardware-driven error report. Some of those
+	  provide a BIOS-driven error report mechanism via ACPI, using the
+	  APEI/GHES driver. By enabling this option, the error reports provided
+	  by GHES are sent to userspace via the EDAC API.
+
+	  When this option is enabled, it will disable the hardware-driven
+	  mechanisms, if a GHES BIOS is detected, entering into the
+	  "Firmware First" mode.
+
+	  It should be noticed that keeping both GHES and a hardware-driven
+	  error mechanism won't work well, as BIOS will race with OS, while
+	  reading the error registers. So, if you want to not use "Firmware
+	  first" GHES error mechanism, you should disable GHES either at
+	  compilation time or by passing "ghes.disable=1" Kernel parameter
+	  at boot time.
+
+	  In doubt, say 'Y'.
+
 config EDAC_AMD64
 	tristate "AMD64 (Opteron, Athlon64) K8, F10h"
 	depends on EDAC_MM_EDAC && AMD_NB && X86_64 && EDAC_DECODE_MCE
diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
index 5608a9b..4154ed6 100644
--- a/drivers/edac/Makefile
+++ b/drivers/edac/Makefile
@@ -16,6 +16,7 @@ ifdef CONFIG_PCI
 edac_core-y	+= edac_pci.o edac_pci_sysfs.o
 endif
 
+obj-$(CONFIG_EDAC_GHES)			+= ghes_edac.o
 obj-$(CONFIG_EDAC_MCE_INJ)		+= mce_amd_inj.o
 
 edac_mce_amd-y				:= mce_amd.o
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
new file mode 100644
index 0000000..d8e54b4
--- /dev/null
+++ b/drivers/edac/ghes_edac.c
@@ -0,0 +1,114 @@
+/*
+ * GHES/EDAC Linux driver
+ *
+ * This file may be distributed under the terms of the GNU General Public
+ * License version 2.
+ *
+ * Copyright (c) 2013 by Mauro Carvalho Chehab <mchehab@redhat.com>
+ *
+ * Red Hat Inc. http://www.redhat.com
+ */
+
+#include <acpi/ghes.h>
+#include <linux/edac.h>
+#include "edac_core.h"
+
+#define GHES_PFX   "ghes_edac: "
+#define GHES_EDAC_REVISION " Ver: 1.0.0"
+
+struct ghes_edac_pvt {
+	struct list_head list;
+	struct ghes *ghes;
+	struct mem_ctl_info *mci;
+};
+
+static LIST_HEAD(ghes_reglist);
+static DEFINE_MUTEX(ghes_edac_lock);
+static int ghes_edac_mc_num;
+
+void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
+                               struct cper_sec_mem_err *mem_err)
+{
+}
+EXPORT_SYMBOL_GPL(ghes_edac_report_mem_error);
+
+int ghes_edac_register(struct ghes *ghes, struct device *dev)
+{
+	int rc;
+	struct mem_ctl_info *mci;
+	struct edac_mc_layer layers[1];
+	struct csrow_info *csrow;
+	struct dimm_info *dimm;
+	struct ghes_edac_pvt *pvt;
+
+	layers[0].type = EDAC_MC_LAYER_ALL_MEM;
+	layers[0].size = 1;
+	layers[0].is_virt_csrow = true;
+
+	/*
+	 * We need to serialize edac_mc_alloc() and edac_mc_add_mc(),
+	 * to avoid duplicated memory controller numbers
+	 */
+	mutex_lock(&ghes_edac_lock);
+	mci = edac_mc_alloc(ghes_edac_mc_num, ARRAY_SIZE(layers), layers,
+			    sizeof(*pvt));
+	if (!mci) {
+		pr_info(GHES_PFX "Can't allocate memory for EDAC data\n");
+		mutex_unlock(&ghes_edac_lock);
+		return -ENOMEM;
+	}
+
+	pvt = mci->pvt_info;
+	memset(pvt, 0, sizeof(*pvt));
+        list_add_tail(&pvt->list, &ghes_reglist);
+	pvt->ghes = ghes;
+	pvt->mci  = mci;
+	mci->pdev = dev;
+
+	mci->mtype_cap = MEM_FLAG_EMPTY;
+	mci->edac_ctl_cap = EDAC_FLAG_NONE;
+	mci->edac_cap = EDAC_FLAG_NONE;
+	mci->mod_name = "ghes_edac.c";
+	mci->mod_ver = GHES_EDAC_REVISION;
+	mci->ctl_name = "ghes_edac";
+	mci->dev_name = "ghes";
+
+	csrow = mci->csrows[0];
+	dimm = csrow->channels[0]->dimm;
+
+	/* FIXME: FAKE DATA */
+	dimm->nr_pages = 1000;
+	dimm->grain = 128;
+	dimm->mtype = MEM_UNKNOWN;
+	dimm->dtype = DEV_UNKNOWN;
+	dimm->edac_mode = EDAC_SECDED;
+
+	rc = edac_mc_add_mc(mci);
+	if (rc < 0) {
+		pr_info(GHES_PFX "Can't register at EDAC core\n");
+		edac_mc_free(mci);
+		mutex_unlock(&ghes_edac_lock);
+		return -ENODEV;
+	}
+
+	ghes_edac_mc_num++;
+	mutex_unlock(&ghes_edac_lock);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ghes_edac_register);
+
+void ghes_edac_unregister(struct ghes *ghes)
+{
+	struct mem_ctl_info *mci;
+	struct ghes_edac_pvt *pvt;
+
+	list_for_each_entry(pvt, &ghes_reglist, list) {
+		if (ghes == pvt->ghes) {
+			mci = pvt->mci;
+			edac_mc_del_mc(mci->pdev);
+			edac_mc_free(mci);
+			list_del(&pvt->list);
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(ghes_edac_unregister);
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH EDACv2 06/12] ghes_edac: add support for reporting errors via EDAC
  2013-02-21 15:38 [PATCH EDACv2 00/12] Add a driver to report Firmware first errors (via GHES) Mauro Carvalho Chehab
                   ` (4 preceding siblings ...)
  2013-02-21 15:39 ` [PATCH EDACv2 05/12] ghes_edac: Register at EDAC core the BIOS report Mauro Carvalho Chehab
@ 2013-02-21 15:39 ` Mauro Carvalho Chehab
  2013-02-21 15:39 ` [PATCH EDACv2 07/12] ghes_edac: do a better job of filling EDAC DIMM info Mauro Carvalho Chehab
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-21 15:39 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

Now that the EDAC core is capable of just forward the errors via
the userspace API, add a report mechanism for the GHES errors.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/ghes_edac.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 48 insertions(+), 2 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index d8e54b4..e6f1426 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -27,8 +27,54 @@ static DEFINE_MUTEX(ghes_edac_lock);
 static int ghes_edac_mc_num;
 
 void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
-                               struct cper_sec_mem_err *mem_err)
+				struct cper_sec_mem_err *mem_err)
 {
+	enum hw_event_mc_err_type type;
+	struct mem_ctl_info *mci;
+	struct ghes_edac_pvt *pvt = NULL;
+	unsigned long page = 0, offset = 0, grain = 0;
+	char location[80];
+	char *label = "unknown";
+
+	list_for_each_entry(pvt, &ghes_reglist, list) {
+		if (ghes == pvt->ghes)
+			break;
+	}
+	if (!pvt) {
+		pr_err("Internal error: Can't find EDAC structure\n");
+		return;
+	}
+	mci = pvt->mci;
+
+	if (mem_err->validation_bits & CPER_MEM_VALID_PHYSICAL_ADDRESS) {
+		page = mem_err->physical_addr >> PAGE_SHIFT;
+		offset = mem_err->physical_addr & ~PAGE_MASK;
+		grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
+	}
+
+	switch (sev) {
+	case GHES_SEV_CORRECTED:
+		type = HW_EVENT_ERR_CORRECTED;
+		break;
+	case GHES_SEV_RECOVERABLE:
+		type = HW_EVENT_ERR_UNCORRECTED;
+		break;
+	case GHES_SEV_PANIC:
+		type = HW_EVENT_ERR_FATAL;
+		break;
+	default:
+	case GHES_SEV_NO:
+		type = HW_EVENT_ERR_INFO;
+	}
+
+	sprintf(location, "node:%d card:%d module:%d bank:%d device:%d row: %d column:%d bit_pos:%d",
+		mem_err->node, mem_err->card, mem_err->module,
+		mem_err->bank, mem_err->device, mem_err->row, mem_err->column,
+		mem_err->bit_pos);
+
+	edac_raw_mc_handle_error(type, mci, grain, 1, 0, 0, 0,
+				 page, offset, 0,
+				 "APEI", location, label, "", 0);
 }
 EXPORT_SYMBOL_GPL(ghes_edac_report_mem_error);
 
@@ -60,7 +106,7 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 
 	pvt = mci->pvt_info;
 	memset(pvt, 0, sizeof(*pvt));
-        list_add_tail(&pvt->list, &ghes_reglist);
+	list_add_tail(&pvt->list, &ghes_reglist);
 	pvt->ghes = ghes;
 	pvt->mci  = mci;
 	mci->pdev = dev;
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH EDACv2 07/12] ghes_edac: do a better job of filling EDAC DIMM info
  2013-02-21 15:38 [PATCH EDACv2 00/12] Add a driver to report Firmware first errors (via GHES) Mauro Carvalho Chehab
                   ` (5 preceding siblings ...)
  2013-02-21 15:39 ` [PATCH EDACv2 06/12] ghes_edac: add support for reporting errors via EDAC Mauro Carvalho Chehab
@ 2013-02-21 15:39 ` Mauro Carvalho Chehab
  2013-02-21 15:39 ` [PATCH EDACv2 08/12] ghes_edac: Don't credit the same memory dimm twice Mauro Carvalho Chehab
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-21 15:39 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

Instead of just faking a random value for the DIMM data, get
the information that it is available via DMI table.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/ghes_edac.c | 192 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 180 insertions(+), 12 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index e6f1426..b2826ab 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -11,6 +11,7 @@
 
 #include <acpi/ghes.h>
 #include <linux/edac.h>
+#include <linux/dmi.h>
 #include "edac_core.h"
 
 #define GHES_PFX   "ghes_edac: "
@@ -26,6 +27,155 @@ static LIST_HEAD(ghes_reglist);
 static DEFINE_MUTEX(ghes_edac_lock);
 static int ghes_edac_mc_num;
 
+/* Memory Device - Type 17 of SMBIOS spec */
+struct memdev_dmi_entry {
+	u8 type;
+	u8 length;
+	u16 handle;
+	u16 phys_mem_array_handle;
+	u16 mem_err_info_handle;
+	u16 total_width;
+	u16 data_width;
+	u16 size;
+	u8 form_factor;
+	u8 device_set;
+	u8 device_locator;
+	u8 bank_locator;
+	u8 memory_type;
+	u16 type_detail;
+	u16 speed;
+	u8 manufacturer;
+	u8 serial_number;
+	u8 asset_tag;
+	u8 part_number;
+	u8 attributes;
+	u32 extended_size;
+	u16 conf_mem_clk_speed;
+} __attribute__((__packed__));
+
+struct ghes_edac_dimm_fill {
+	struct mem_ctl_info *mci;
+	unsigned count;
+};
+
+char *memory_type[] = {
+	[MEM_EMPTY] = "EMPTY",
+	[MEM_RESERVED] = "RESERVED",
+	[MEM_UNKNOWN] = "UNKNOWN",
+	[MEM_FPM] = "FPM",
+	[MEM_EDO] = "EDO",
+	[MEM_BEDO] = "BEDO",
+	[MEM_SDR] = "SDR",
+	[MEM_RDR] = "RDR",
+	[MEM_DDR] = "DDR",
+	[MEM_RDDR] = "RDDR",
+	[MEM_RMBS] = "RMBS",
+	[MEM_DDR2] = "DDR2",
+	[MEM_FB_DDR2] = "FB_DDR2",
+	[MEM_RDDR2] = "RDDR2",
+	[MEM_XDR] = "XDR",
+	[MEM_DDR3] = "DDR3",
+	[MEM_RDDR3] = "RDDR3",
+};
+
+static void ghes_edac_count_dimms(const struct dmi_header *dh, void *arg)
+{
+	int *num_dimm = arg;
+
+	if (dh->type == DMI_ENTRY_MEM_DEVICE)
+		(*num_dimm)++;
+}
+
+static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
+{
+	struct ghes_edac_dimm_fill *dimm_fill = arg;
+	struct mem_ctl_info *mci = dimm_fill->mci;
+
+	if (dh->type == DMI_ENTRY_MEM_DEVICE) {
+		struct memdev_dmi_entry *entry = (struct memdev_dmi_entry *)dh;
+		struct dimm_info *dimm = EDAC_DIMM_PTR(mci->layers, mci->dimms,
+						       mci->n_layers,
+						       dimm_fill->count, 0, 0);
+
+		if (entry->size == 0xffff) {
+			pr_info(GHES_PFX "Can't get dimm size\n");
+			dimm->nr_pages = MiB_TO_PAGES(32);/* Unknown */
+		} else if (entry->size == 0x7fff) {
+			dimm->nr_pages = MiB_TO_PAGES(entry->extended_size);
+		} else {
+			if (entry->size & 1 << 15)
+				dimm->nr_pages = MiB_TO_PAGES((entry->size &
+							       0x7fff) << 10);
+			else
+				dimm->nr_pages = MiB_TO_PAGES(entry->size);
+		}
+
+		switch (entry->memory_type) {
+		case 0x12:
+			if (entry->type_detail & 1 << 13)
+				dimm->mtype = MEM_RDDR;
+			else
+				dimm->mtype = MEM_DDR;
+			break;
+		case 0x13:
+			if (entry->type_detail & 1 << 13)
+				dimm->mtype = MEM_RDDR2;
+			else
+				dimm->mtype = MEM_DDR2;
+			break;
+		case 0x14:
+			dimm->mtype = MEM_FB_DDR2;
+			break;
+		case 0x18:
+			if (entry->type_detail & 1 << 13)
+				dimm->mtype = MEM_RDDR3;
+			else
+				dimm->mtype = MEM_DDR3;
+			break;
+		default:
+			if (entry->type_detail & 1 << 6)
+				dimm->mtype = MEM_RMBS;
+			else if ((entry->type_detail & ((1 << 7) | (1 << 13)))
+				 == ((1 << 7) | (1 << 13)))
+				dimm->mtype = MEM_RDR;
+			else if (entry->type_detail & 1 << 7)
+				dimm->mtype = MEM_SDR;
+			else if (entry->type_detail & 1 << 9)
+				dimm->mtype = MEM_EDO;
+			else
+				dimm->mtype = MEM_UNKNOWN;
+		}
+
+		/*
+		 * Actually, we can only detect if the memory has bits for
+		 * checksum or not
+		 */
+		if (entry->total_width == entry->data_width)
+			dimm->edac_mode = EDAC_NONE;
+		else
+			dimm->edac_mode = EDAC_SECDED;
+
+		dimm->dtype = DEV_UNKNOWN;
+		dimm->grain = 128;		/* Likely, worse case */
+
+		/*
+		 * FIXME: It shouldn't be hard to also fill the DIMM labels
+		 */
+
+		if (dimm->nr_pages) {
+			pr_info(GHES_PFX "DIMM%i: %s size = %d MB%s\n",
+				dimm_fill->count, memory_type[dimm->mtype],
+				PAGES_TO_MiB(dimm->nr_pages),
+				(dimm->edac_mode != EDAC_NONE) ? "(ECC)" : "");
+			pr_info(GHES_PFX "\ttype %d, detail 0x%02x, width %d(total %d)\n",
+				entry->memory_type, entry->type_detail,
+				entry->total_width, entry->data_width);
+		}
+
+		dimm_fill->count++;
+	}
+}
+
 void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 				struct cper_sec_mem_err *mem_err)
 {
@@ -80,15 +230,24 @@ EXPORT_SYMBOL_GPL(ghes_edac_report_mem_error);
 
 int ghes_edac_register(struct ghes *ghes, struct device *dev)
 {
-	int rc;
+	bool fake = false;
+	int rc, num_dimm = 0;
 	struct mem_ctl_info *mci;
 	struct edac_mc_layer layers[1];
-	struct csrow_info *csrow;
-	struct dimm_info *dimm;
 	struct ghes_edac_pvt *pvt;
+	struct ghes_edac_dimm_fill dimm_fill;
+
+	/* Get the number of DIMMs */
+	dmi_walk(ghes_edac_count_dimms, &num_dimm);
+
+	/* Check if we've got a bogus BIOS */
+	if (num_dimm == 0) {
+		fake = true;
+		num_dimm = 1;
+	}
 
 	layers[0].type = EDAC_MC_LAYER_ALL_MEM;
-	layers[0].size = 1;
+	layers[0].size = num_dimm;
 	layers[0].is_virt_csrow = true;
 
 	/*
@@ -96,6 +255,8 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	 * to avoid duplicated memory controller numbers
 	 */
 	mutex_lock(&ghes_edac_lock);
+	pr_info("ghes_edac#%d: allocating space for %d dimms\n",
+		ghes_edac_mc_num, num_dimm);
 	mci = edac_mc_alloc(ghes_edac_mc_num, ARRAY_SIZE(layers), layers,
 			    sizeof(*pvt));
 	if (!mci) {
@@ -119,15 +280,22 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	mci->ctl_name = "ghes_edac";
 	mci->dev_name = "ghes";
 
-	csrow = mci->csrows[0];
-	dimm = csrow->channels[0]->dimm;
+	if (!fake) {
+		/* Fill DIMM info from DMI */
+		dimm_fill.count = 0;
+		dimm_fill.mci = mci;
+		dmi_walk(ghes_edac_dmidecode, &dimm_fill);
+	} else {
+		struct dimm_info *dimm = EDAC_DIMM_PTR(mci->layers, mci->dimms,
+						       mci->n_layers, 0, 0, 0);
 
-	/* FIXME: FAKE DATA */
-	dimm->nr_pages = 1000;
-	dimm->grain = 128;
-	dimm->mtype = MEM_UNKNOWN;
-	dimm->dtype = DEV_UNKNOWN;
-	dimm->edac_mode = EDAC_SECDED;
+		pr_info(GHES_PFX "Crappy BIOS detected. Faking DIMM EDAC data\n");
+		dimm->nr_pages = 1000;
+		dimm->grain = 128;
+		dimm->mtype = MEM_UNKNOWN;
+		dimm->dtype = DEV_UNKNOWN;
+		dimm->edac_mode = EDAC_SECDED;
+	}
 
 	rc = edac_mc_add_mc(mci);
 	if (rc < 0) {
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH EDACv2 08/12] ghes_edac: Don't credit the same memory dimm twice
  2013-02-21 15:38 [PATCH EDACv2 00/12] Add a driver to report Firmware first errors (via GHES) Mauro Carvalho Chehab
                   ` (6 preceding siblings ...)
  2013-02-21 15:39 ` [PATCH EDACv2 07/12] ghes_edac: do a better job of filling EDAC DIMM info Mauro Carvalho Chehab
@ 2013-02-21 15:39 ` Mauro Carvalho Chehab
  2013-02-21 15:39 ` [PATCH EDACv2 09/12] ghes_edac: Improve driver's printk messages Mauro Carvalho Chehab
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-21 15:39 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

On my tests on a 4xE5-4650 CPU's system, the GHES
EDAC driver is called twice. As the SMBIOS DMI enumeration
call will seek for the entire DIMM sockets in the system, on
this machine, equipped with 128 GB of RAM, the memory is
displayed twice:

          +-----------------------+
          |    mc0    |    mc1    |
----------+-----------------------+
memory45: |  8192 MB  |  8192 MB  |
memory44: |     0 MB  |     0 MB  |
----------+-----------------------+
memory43: |     0 MB  |     0 MB  |
memory42: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory41: |     0 MB  |     0 MB  |
memory40: |     0 MB  |     0 MB  |
----------+-----------------------+
memory39: |  8192 MB  |  8192 MB  |
memory38: |     0 MB  |     0 MB  |
----------+-----------------------+
memory37: |     0 MB  |     0 MB  |
memory36: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory35: |     0 MB  |     0 MB  |
memory34: |     0 MB  |     0 MB  |
----------+-----------------------+
memory33: |  8192 MB  |  8192 MB  |
memory32: |     0 MB  |     0 MB  |
----------+-----------------------+
memory31: |     0 MB  |     0 MB  |
memory30: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory29: |     0 MB  |     0 MB  |
memory28: |     0 MB  |     0 MB  |
----------+-----------------------+
memory27: |  8192 MB  |  8192 MB  |
memory26: |     0 MB  |     0 MB  |
----------+-----------------------+
memory25: |     0 MB  |     0 MB  |
memory24: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory23: |     0 MB  |     0 MB  |
memory22: |     0 MB  |     0 MB  |
----------+-----------------------+
memory21: |  8192 MB  |  8192 MB  |
memory20: |     0 MB  |     0 MB  |
----------+-----------------------+
memory19: |     0 MB  |     0 MB  |
memory18: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory17: |     0 MB  |     0 MB  |
memory16: |     0 MB  |     0 MB  |
----------+-----------------------+
memory15: |  8192 MB  |  8192 MB  |
memory14: |     0 MB  |     0 MB  |
----------+-----------------------+
memory13: |     0 MB  |     0 MB  |
memory12: |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory11: |     0 MB  |     0 MB  |
memory10: |     0 MB  |     0 MB  |
----------+-----------------------+
memory9:  |  8192 MB  |  8192 MB  |
memory8:  |     0 MB  |     0 MB  |
----------+-----------------------+
memory7:  |     0 MB  |     0 MB  |
memory6:  |  8192 MB  |  8192 MB  |
----------+-----------------------+
memory5:  |     0 MB  |     0 MB  |
memory4:  |     0 MB  |     0 MB  |
----------+-----------------------+
memory3:  |  8192 MB  |  8192 MB  |
memory2:  |     0 MB  |     0 MB  |
----------+-----------------------+
memory1:  |     0 MB  |     0 MB  |
memory0:  |  8192 MB  |  8192 MB  |
----------+-----------------------+

Total sum of 256 GB.

As there's no reliable way to credit DIMMS to the right memory
controller, just put everything on memory controller 0 (with should
always exist).

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/ghes_edac.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index b2826ab..ed57058 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -281,10 +281,19 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	mci->dev_name = "ghes";
 
 	if (!fake) {
-		/* Fill DIMM info from DMI */
-		dimm_fill.count = 0;
-		dimm_fill.mci = mci;
-		dmi_walk(ghes_edac_dmidecode, &dimm_fill);
+		/*
+		 * Fill DIMM info from DMI for the memory controller #0
+		 *
+		 * Keep it in blank for the other memory controllers, as
+		 * there's no reliable way to properly credit each DIMM to
+		 * the memory controller, as different BIOSes fill the
+		 * DMI bank location fields on different ways
+		 */
+		if (!ghes_edac_mc_num) {
+			dimm_fill.count = 0;
+			dimm_fill.mci = mci;
+			dmi_walk(ghes_edac_dmidecode, &dimm_fill);
+		}
 	} else {
 		struct dimm_info *dimm = EDAC_DIMM_PTR(mci->layers, mci->dimms,
 						       mci->n_layers, 0, 0, 0);
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH EDACv2 09/12] ghes_edac: Improve driver's printk messages
  2013-02-21 15:38 [PATCH EDACv2 00/12] Add a driver to report Firmware first errors (via GHES) Mauro Carvalho Chehab
                   ` (7 preceding siblings ...)
  2013-02-21 15:39 ` [PATCH EDACv2 08/12] ghes_edac: Don't credit the same memory dimm twice Mauro Carvalho Chehab
@ 2013-02-21 15:39 ` Mauro Carvalho Chehab
  2013-02-21 15:39 ` [PATCH EDACv2 10/12] edac: put all arguments for the raw error handling call into a struct Mauro Carvalho Chehab
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-21 15:39 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

Provide a better infrastructure for printk's inside the driver:
	- use edac_dbg() for debug messages;
	- standardize the usage of pr_info();
	- provide warning about the risk of relying on this
	  driver.

While here, changes the size of a fake memory to 1 page. This is
as good or as bad as 1000 pages, but it is easier for userspace to
detect, as I don't expect that any machine implementing GHES would
provide just 1 page available ;)

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/ghes_edac.c | 37 +++++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index ed57058..802d52a 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -9,12 +9,13 @@
  * Red Hat Inc. http://www.redhat.com
  */
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include <acpi/ghes.h>
 #include <linux/edac.h>
 #include <linux/dmi.h>
 #include "edac_core.h"
 
-#define GHES_PFX   "ghes_edac: "
 #define GHES_EDAC_REVISION " Ver: 1.0.0"
 
 struct ghes_edac_pvt {
@@ -27,6 +28,7 @@ static LIST_HEAD(ghes_reglist);
 static DEFINE_MUTEX(ghes_edac_lock);
 static int ghes_edac_mc_num;
 
+
 /* Memory Device - Type 17 of SMBIOS spec */
 struct memdev_dmi_entry {
 	u8 type;
@@ -98,7 +100,8 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
 						       dimm_fill->count, 0, 0);
 
 		if (entry->size == 0xffff) {
-			pr_info(GHES_PFX "Can't get dimm size\n");
+			pr_info("Can't get DIMM%i size\n",
+				dimm_fill->count);
 			dimm->nr_pages = MiB_TO_PAGES(32);/* Unknown */
 		} else if (entry->size == 0x7fff) {
 			dimm->nr_pages = MiB_TO_PAGES(entry->extended_size);
@@ -163,11 +166,11 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
 		 */
 
 		if (dimm->nr_pages) {
-			pr_info(GHES_PFX "DIMM%i: %s size = %d MB%s\n",
+			edac_dbg(1, "DIMM%i: %s size = %d MB%s\n",
 				dimm_fill->count, memory_type[dimm->mtype],
 				PAGES_TO_MiB(dimm->nr_pages),
 				(dimm->edac_mode != EDAC_NONE) ? "(ECC)" : "");
-			pr_info(GHES_PFX "\ttype %d, detail 0x%02x, width %d(total %d)\n",
+			edac_dbg(2, "\ttype %d, detail 0x%02x, width %d(total %d)\n",
 				entry->memory_type, entry->type_detail,
 				entry->total_width, entry->data_width);
 		}
@@ -221,6 +224,7 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 		mem_err->node, mem_err->card, mem_err->module,
 		mem_err->bank, mem_err->device, mem_err->row, mem_err->column,
 		mem_err->bit_pos);
+	edac_dbg(3, "error at location %s\n", location);
 
 	edac_raw_mc_handle_error(type, mci, grain, 1, 0, 0, 0,
 				 page, offset, 0,
@@ -255,12 +259,10 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	 * to avoid duplicated memory controller numbers
 	 */
 	mutex_lock(&ghes_edac_lock);
-	pr_info("ghes_edac#%d: allocating space for %d dimms\n",
-		ghes_edac_mc_num, num_dimm);
 	mci = edac_mc_alloc(ghes_edac_mc_num, ARRAY_SIZE(layers), layers,
 			    sizeof(*pvt));
 	if (!mci) {
-		pr_info(GHES_PFX "Can't allocate memory for EDAC data\n");
+		pr_info("Can't allocate memory for EDAC data\n");
 		mutex_unlock(&ghes_edac_lock);
 		return -ENOMEM;
 	}
@@ -280,6 +282,22 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	mci->ctl_name = "ghes_edac";
 	mci->dev_name = "ghes";
 
+	if (!ghes_edac_mc_num) {
+		if (!fake) {
+			pr_info("This EDAC driver relies on BIOS to enumerate memory and get error reports.\n");
+			pr_info("Unfortunately, not all BIOSes reflect the memory layout correctly.\n");
+			pr_info("So, the end result of using this driver varies from vendor to vendor.\n");
+			pr_info("If you find incorrect reports, please contact your hardware vendor\n");
+			pr_info("to correct its BIOS.\n");
+			pr_info("This system has %d DIMM sockets.\n",
+				num_dimm);
+		} else {
+			pr_info("This system has a very crappy BIOS: It doesn't even list the DIMMS.\n");
+			pr_info("Its SMBIOS info is wrong. It is doubtful that the error report would\n");
+			pr_info("work on such system. Use this driver with caution\n");
+		}
+	}
+
 	if (!fake) {
 		/*
 		 * Fill DIMM info from DMI for the memory controller #0
@@ -298,8 +316,7 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 		struct dimm_info *dimm = EDAC_DIMM_PTR(mci->layers, mci->dimms,
 						       mci->n_layers, 0, 0, 0);
 
-		pr_info(GHES_PFX "Crappy BIOS detected. Faking DIMM EDAC data\n");
-		dimm->nr_pages = 1000;
+		dimm->nr_pages = 1;
 		dimm->grain = 128;
 		dimm->mtype = MEM_UNKNOWN;
 		dimm->dtype = DEV_UNKNOWN;
@@ -308,7 +325,7 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 
 	rc = edac_mc_add_mc(mci);
 	if (rc < 0) {
-		pr_info(GHES_PFX "Can't register at EDAC core\n");
+		pr_info("Can't register at EDAC core\n");
 		edac_mc_free(mci);
 		mutex_unlock(&ghes_edac_lock);
 		return -ENODEV;
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH EDACv2 10/12] edac: put all arguments for the raw error handling call into a struct
  2013-02-21 15:38 [PATCH EDACv2 00/12] Add a driver to report Firmware first errors (via GHES) Mauro Carvalho Chehab
                   ` (8 preceding siblings ...)
  2013-02-21 15:39 ` [PATCH EDACv2 09/12] ghes_edac: Improve driver's printk messages Mauro Carvalho Chehab
@ 2013-02-21 15:39 ` Mauro Carvalho Chehab
  2013-02-21 15:39 ` [PATCH EDACv2 11/12] ghes_edac: Make it compliant with UEFI spec 2.3.1 Mauro Carvalho Chehab
  2013-02-21 15:39 ` [PATCH EDACv2 12/12] ghes_edac: Fix RAS tracing Mauro Carvalho Chehab
  11 siblings, 0 replies; 18+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-21 15:39 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

The number of arguments for edac_raw_mc_handle_error() is too big;
put them into a structure and allocate space for it inside
edac_mc_alloc().

That reduces a lot the stack usage and simplifies the raw API call.

Tested with sb_edac driver and MCE error injection. Worked as expected:

[  143.066100] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.086424] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.106570] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.126712] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/edac_core.h |  16 +------
 drivers/edac/edac_mc.c   | 120 +++++++++++++++++++----------------------------
 drivers/edac/ghes_edac.c |  27 ++++++-----
 include/linux/edac.h     |  56 ++++++++++++++++++++++
 4 files changed, 123 insertions(+), 96 deletions(-)

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index 9c5da11..3c2625e 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -455,20 +455,8 @@ extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
 				      unsigned long page);
 
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
-			  struct mem_ctl_info *mci,
-			  long grain,
-			  const u16 error_count,
-			  const int top_layer,
-			  const int mid_layer,
-			  const int low_layer,
-			  const unsigned long page_frame_number,
-			  const unsigned long offset_in_page,
-			  const unsigned long syndrome,
-			  const char *msg,
-			  const char *location,
-			  const char *label,
-			  const char *other_detail,
-			  const bool enable_per_layer_report);
+			      struct mem_ctl_info *mci,
+			      struct edac_raw_error_desc *e);
 
 void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			  struct mem_ctl_info *mci,
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 94a5f26..501061d 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1065,78 +1065,49 @@ static void edac_ue_error(struct mem_ctl_info *mci,
 	edac_inc_ue_error(mci, enable_per_layer_report, pos, error_count);
 }
 
-#define OTHER_LABEL " or "
-
 /**
  * edac_raw_mc_handle_error - reports a memory event to userspace without doing
  *			      anything to discover the error location
  *
  * @type:		severity of the error (CE/UE/Fatal)
  * @mci:		a struct mem_ctl_info pointer
- * @grain:		error granularity
- * @error_count:	Number of errors of the same type
- * @top_layer:		Memory layer[0] position
- * @mid_layer:		Memory layer[1] position
- * @low_layer:		Memory layer[2] position
- * @page_frame_number:	mem page where the error occurred
- * @offset_in_page:	offset of the error inside the page
- * @syndrome:		ECC syndrome
- * @msg:		Message meaningful to the end users that
- *			explains the event\
- * @location:		location of the error, like "csrow:0 channel:1"
- * @label:		DIMM labels for the affected memory(ies)
- * @other_detail:	Technical details about the event that
- *			may help hardware manufacturers and
- *			EDAC developers to analyse the event
- * @enable_per_layer_report: should it increment per-layer error counts?
+ * @e:			error description
  *
  * This raw function is used internally by edac_mc_handle_error(). It should
  * only be called directly when the hardware error come directly from BIOS,
  * like in the case of APEI GHES driver.
  */
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
-			  struct mem_ctl_info *mci,
-			  long grain,
-			  const u16 error_count,
-			  const int top_layer,
-			  const int mid_layer,
-			  const int low_layer,
-			  const unsigned long page_frame_number,
-			  const unsigned long offset_in_page,
-			  const unsigned long syndrome,
-			  const char *msg,
-			  const char *location,
-			  const char *label,
-			  const char *other_detail,
-			  const bool enable_per_layer_report)
+			      struct mem_ctl_info *mci,
+			      struct edac_raw_error_desc *e)
 {
 	char detail[80];
 	u8 grain_bits;
-	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
+	int pos[EDAC_MAX_LAYERS] = { e->top_layer, e->mid_layer, e->low_layer };
 
 	/* Report the error via the trace interface */
-	grain_bits = fls_long(grain) + 1;
-	trace_mc_event(type, msg, label, error_count,
-		       mci->mc_idx, top_layer, mid_layer, low_layer,
-		       PAGES_TO_MiB(page_frame_number) | offset_in_page,
-		       grain_bits, syndrome, other_detail);
+	grain_bits = fls_long(e->grain) + 1;
+	trace_mc_event(type, e->msg, e->label, e->error_count,
+		       mci->mc_idx, e->top_layer, e->mid_layer, e->low_layer,
+		       PAGES_TO_MiB(e->page_frame_number) | e->offset_in_page,
+		       grain_bits, e->syndrome, e->other_detail);
 
 	/* Memory type dependent details about the error */
 	if (type == HW_EVENT_ERR_CORRECTED) {
 		snprintf(detail, sizeof(detail),
 			"page:0x%lx offset:0x%lx grain:%ld syndrome:0x%lx",
-			page_frame_number, offset_in_page,
-			grain, syndrome);
-		edac_ce_error(mci, error_count, pos, msg, location, label,
-			      detail, other_detail, enable_per_layer_report,
-			      page_frame_number, offset_in_page, grain);
+			e->page_frame_number, e->offset_in_page,
+			e->grain, e->syndrome);
+		edac_ce_error(mci, e->error_count, pos, e->msg, e->location, e->label,
+			      detail, e->other_detail, e->enable_per_layer_report,
+			      e->page_frame_number, e->offset_in_page, e->grain);
 	} else {
 		snprintf(detail, sizeof(detail),
 			"page:0x%lx offset:0x%lx grain:%ld",
-			page_frame_number, offset_in_page, grain);
+			e->page_frame_number, e->offset_in_page, e->grain);
 
-		edac_ue_error(mci, error_count, pos, msg, location, label,
-			      detail, other_detail, enable_per_layer_report);
+		edac_ue_error(mci, e->error_count, pos, e->msg, e->location, e->label,
+			      detail, e->other_detail, e->enable_per_layer_report);
 	}
 
 
@@ -1173,18 +1144,26 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			  const char *msg,
 			  const char *other_detail)
 {
-	/* FIXME: too much for stack: move it to some pre-alocated area */
-	char location[80];
-	char label[(EDAC_MC_LABEL_LEN + 1 + sizeof(OTHER_LABEL)) * mci->tot_dimms];
 	char *p;
 	int row = -1, chan = -1;
 	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
-	int i;
-	long grain;
-	bool enable_per_layer_report = false;
+	int i, n_labels = 0;
+	struct edac_raw_error_desc *e = &mci->error_desc;
 
 	edac_dbg(3, "MC%d\n", mci->mc_idx);
 
+	/* Fills the error report buffer */
+	memset(e, 0, sizeof (*e));
+	e->error_count = error_count;
+	e->top_layer = top_layer;
+	e->mid_layer = mid_layer;
+	e->low_layer = low_layer;
+	e->page_frame_number = page_frame_number;
+	e->offset_in_page = offset_in_page;
+	e->syndrome = syndrome;
+	e->msg = msg;
+	e->other_detail = other_detail;
+
 	/*
 	 * Check if the event report is consistent and if the memory
 	 * location is known. If it is known, enable_per_layer_report will be
@@ -1207,7 +1186,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			pos[i] = -1;
 		}
 		if (pos[i] >= 0)
-			enable_per_layer_report = true;
+			e->enable_per_layer_report = true;
 	}
 
 	/*
@@ -1221,8 +1200,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	 * where each memory belongs to a separate channel within the same
 	 * branch.
 	 */
-	grain = 0;
-	p = label;
+	p = e->label;
 	*p = '\0';
 
 	for (i = 0; i < mci->tot_dimms; i++) {
@@ -1236,8 +1214,8 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			continue;
 
 		/* get the max grain, over the error match range */
-		if (dimm->grain > grain)
-			grain = dimm->grain;
+		if (dimm->grain > e->grain)
+			e->grain = dimm->grain;
 
 		/*
 		 * If the error is memory-controller wide, there's no need to
@@ -1245,8 +1223,13 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 		 * channel/memory controller/...  may be affected.
 		 * Also, don't show errors for empty DIMM slots.
 		 */
-		if (enable_per_layer_report && dimm->nr_pages) {
-			if (p != label) {
+		if (e->enable_per_layer_report && dimm->nr_pages) {
+			if (n_labels >= EDAC_MAX_LABELS) {
+				e->enable_per_layer_report = false;
+				break;
+			}
+			n_labels++;
+			if (p != e->label) {
 				strcpy(p, OTHER_LABEL);
 				p += strlen(OTHER_LABEL);
 			}
@@ -1273,12 +1256,12 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 		}
 	}
 
-	if (!enable_per_layer_report) {
-		strcpy(label, "any memory");
+	if (!e->enable_per_layer_report) {
+		strcpy(e->label, "any memory");
 	} else {
 		edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
-		if (p == label)
-			strcpy(label, "unknown memory");
+		if (p == e->label)
+			strcpy(e->label, "unknown memory");
 		if (type == HW_EVENT_ERR_CORRECTED) {
 			if (row >= 0) {
 				mci->csrows[row]->ce_count += error_count;
@@ -1291,7 +1274,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	}
 
 	/* Fill the RAM location data */
-	p = location;
+	p = e->location;
 
 	for (i = 0; i < mci->n_layers; i++) {
 		if (pos[i] < 0)
@@ -1301,14 +1284,9 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			     edac_layer_name[mci->layers[i].type],
 			     pos[i]);
 	}
-	if (p > location)
+	if (p > e->location)
 		*(p - 1) = '\0';
 
-	edac_raw_mc_handle_error(type, mci, grain, error_count,
-				 top_layer, mid_layer, low_layer,
-				 page_frame_number, offset_in_page,
-				 syndrome,
-				 msg, location, label, other_detail,
-				 enable_per_layer_report);
+	edac_raw_mc_handle_error(type, mci, e);
 }
 EXPORT_SYMBOL_GPL(edac_mc_handle_error);
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 802d52a..b4acc4f 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -183,11 +183,9 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 				struct cper_sec_mem_err *mem_err)
 {
 	enum hw_event_mc_err_type type;
+	struct edac_raw_error_desc *e;
 	struct mem_ctl_info *mci;
 	struct ghes_edac_pvt *pvt = NULL;
-	unsigned long page = 0, offset = 0, grain = 0;
-	char location[80];
-	char *label = "unknown";
 
 	list_for_each_entry(pvt, &ghes_reglist, list) {
 		if (ghes == pvt->ghes)
@@ -198,11 +196,19 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 		return;
 	}
 	mci = pvt->mci;
+	e = &mci->error_desc;
+
+	/* Cleans the error report buffer */
+	memset(e, 0, sizeof (*e));
+	e->error_count = 1;
+	e->msg = "APEI";
+	strcpy(e->label, "unknown");
+	e->other_detail = "";
 
 	if (mem_err->validation_bits & CPER_MEM_VALID_PHYSICAL_ADDRESS) {
-		page = mem_err->physical_addr >> PAGE_SHIFT;
-		offset = mem_err->physical_addr & ~PAGE_MASK;
-		grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
+		e->page_frame_number = mem_err->physical_addr >> PAGE_SHIFT;
+		e->offset_in_page = mem_err->physical_addr & ~PAGE_MASK;
+		e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
 	}
 
 	switch (sev) {
@@ -220,15 +226,14 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 		type = HW_EVENT_ERR_INFO;
 	}
 
-	sprintf(location, "node:%d card:%d module:%d bank:%d device:%d row: %d column:%d bit_pos:%d",
+	sprintf(e->location,
+		"node:%d card:%d module:%d bank:%d device:%d row: %d column:%d bit_pos:%d",
 		mem_err->node, mem_err->card, mem_err->module,
 		mem_err->bank, mem_err->device, mem_err->row, mem_err->column,
 		mem_err->bit_pos);
-	edac_dbg(3, "error at location %s\n", location);
+	edac_dbg(3, "error at location %s\n", e->location);
 
-	edac_raw_mc_handle_error(type, mci, grain, 1, 0, 0, 0,
-				 page, offset, 0,
-				 "APEI", location, label, "", 0);
+	edac_raw_mc_handle_error(type, mci, e);
 }
 EXPORT_SYMBOL_GPL(ghes_edac_report_mem_error);
 
diff --git a/include/linux/edac.h b/include/linux/edac.h
index 2049b96..4fd4999 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -47,8 +47,18 @@ static inline void opstate_init(void)
 	return;
 }
 
+/* Max length of a DIMM label*/
 #define EDAC_MC_LABEL_LEN	31
 
+/* Maximum size of the location string */
+#define LOCATION_SIZE 80
+
+/* Defines the maximum number of labels that can be reported */
+#define EDAC_MAX_LABELS		8
+
+/* String used to join two or more labels */
+#define OTHER_LABEL " or "
+
 /**
  * enum dev_type - describe the type of memory DRAM chips used at the stick
  * @DEV_UNKNOWN:	Can't be determined, or MC doesn't support detect it
@@ -569,6 +579,46 @@ struct errcount_attribute_data {
 	int layer0, layer1, layer2;
 };
 
+/**
+ * edac_raw_error_desc - Raw error report structure
+ * @grain:			minimum granularity for an error report, in bytes
+ * @error_count:		number of errors of the same type
+ * @top_layer:			top layer of the error (layer[0])
+ * @mid_layer:			middle layer of the error (layer[1])
+ * @low_layer:			low layer of the error (layer[2])
+ * @page_frame_number:		page where the error happened
+ * @offset_in_page:		page offset
+ * @syndrome:			syndrome of the error (or 0 if unknown or if
+ * 				the syndrome is not applicable)
+ * @msg:			error message
+ * @location:			location of the error
+ * @label:			label of the affected DIMM(s)
+ * @other_detail:		other driver-specific detail about the error
+ * @enable_per_layer_report:	if false, the error affects all layers
+ *				(typically, a memory controller error)
+ */
+struct edac_raw_error_desc {
+	/*
+	 * NOTE: everything before grain won't be cleaned by
+	 * edac_raw_error_desc_clean()
+	 */
+	char location[LOCATION_SIZE];
+	char label[(EDAC_MC_LABEL_LEN + 1 + sizeof(OTHER_LABEL)) * EDAC_MAX_LABELS];
+	long grain;
+
+	/* the vars below and grain will be cleaned on every new error report */
+	u16 error_count;
+	int top_layer;
+	int mid_layer;
+	int low_layer;
+	unsigned long page_frame_number;
+	unsigned long offset_in_page;
+	unsigned long syndrome;
+	const char *msg;
+	const char *other_detail;
+	bool enable_per_layer_report;
+};
+
 /* MEMORY controller information structure
  */
 struct mem_ctl_info {
@@ -676,6 +726,12 @@ struct mem_ctl_info {
 	/* work struct for this MC */
 	struct delayed_work work;
 
+	/*
+	 * Used to report an error - by being at the global struct
+	 * makes the memory allocated by the EDAC core
+	 */
+	struct edac_raw_error_desc error_desc;
+
 	/* the internal state of this controller instance */
 	int op_state;
 
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH EDACv2 11/12] ghes_edac: Make it compliant with UEFI spec 2.3.1
  2013-02-21 15:38 [PATCH EDACv2 00/12] Add a driver to report Firmware first errors (via GHES) Mauro Carvalho Chehab
                   ` (9 preceding siblings ...)
  2013-02-21 15:39 ` [PATCH EDACv2 10/12] edac: put all arguments for the raw error handling call into a struct Mauro Carvalho Chehab
@ 2013-02-21 15:39 ` Mauro Carvalho Chehab
  2013-02-21 15:39 ` [PATCH EDACv2 12/12] ghes_edac: Fix RAS tracing Mauro Carvalho Chehab
  11 siblings, 0 replies; 18+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-21 15:39 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

The UEFI spec defines the memory error types ans the bits that
validate each field on the memory error record, at
Appendix N om items N.2.5 (Memory Error Section) and
N.2.11 (Error Status). Make the error description compliant with
it, only showing the valid fields.

The EDAC error log is now properly reporting the error:

[  281.556854] mce: [Hardware Error]: Machine check events logged
[  281.557042] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[  281.557044] {2}[Hardware Error]: APEI generic hardware error status
[  281.557046] {2}[Hardware Error]: severity: 2, corrected
[  281.557048] {2}[Hardware Error]: section: 0, severity: 2, corrected
[  281.557050] {2}[Hardware Error]: flags: 0x01
[  281.557052] {2}[Hardware Error]: primary
[  281.557053] {2}[Hardware Error]: section_type: memory error
[  281.557055] {2}[Hardware Error]: error_status: 0x0000000000000400
[  281.557056] {2}[Hardware Error]: node: 3
[  281.557057] {2}[Hardware Error]: card: 0
[  281.557058] {2}[Hardware Error]: module: 1
[  281.557059] {2}[Hardware Error]: device: 0
[  281.557061] {2}[Hardware Error]: error_type: 18, unknown
[  281.557067] EDAC DEBUG: ghes_edac_report_mem_error: error validation_bits: 0x000040b9
[  281.557084] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:1 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory)

Tested on a 4 CPUs E5-4650 Sandy Bridge machine.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/ghes_edac.c | 195 +++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 180 insertions(+), 15 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index b4acc4f..1bde451 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -22,6 +22,10 @@ struct ghes_edac_pvt {
 	struct list_head list;
 	struct ghes *ghes;
 	struct mem_ctl_info *mci;
+
+	/* Buffers for the error handling routine */
+	char other_detail[160];
+	char msg[80];
 };
 
 static LIST_HEAD(ghes_reglist);
@@ -186,6 +190,7 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 	struct edac_raw_error_desc *e;
 	struct mem_ctl_info *mci;
 	struct ghes_edac_pvt *pvt = NULL;
+	char *p;
 
 	list_for_each_entry(pvt, &ghes_reglist, list) {
 		if (ghes == pvt->ghes)
@@ -201,15 +206,14 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 	/* Cleans the error report buffer */
 	memset(e, 0, sizeof (*e));
 	e->error_count = 1;
-	e->msg = "APEI";
-	strcpy(e->label, "unknown");
-	e->other_detail = "";
-
-	if (mem_err->validation_bits & CPER_MEM_VALID_PHYSICAL_ADDRESS) {
-		e->page_frame_number = mem_err->physical_addr >> PAGE_SHIFT;
-		e->offset_in_page = mem_err->physical_addr & ~PAGE_MASK;
-		e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
-	}
+	strcpy(e->label, "unknown label");
+	e->msg = pvt->msg;
+	e->other_detail = pvt->other_detail;
+	e->top_layer = -1;
+	e->mid_layer = -1;
+	e->low_layer = -1;
+	*pvt->other_detail = '\0';
+	*pvt->msg = '\0';
 
 	switch (sev) {
 	case GHES_SEV_CORRECTED:
@@ -226,12 +230,173 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 		type = HW_EVENT_ERR_INFO;
 	}
 
-	sprintf(e->location,
-		"node:%d card:%d module:%d bank:%d device:%d row: %d column:%d bit_pos:%d",
-		mem_err->node, mem_err->card, mem_err->module,
-		mem_err->bank, mem_err->device, mem_err->row, mem_err->column,
-		mem_err->bit_pos);
-	edac_dbg(3, "error at location %s\n", e->location);
+	edac_dbg(1, "error validation_bits: 0x%08llx\n",
+		 (long long)mem_err->validation_bits);
+
+	/* Error type, mapped on e->msg */
+	if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE) {
+		p = pvt->msg;
+		switch (mem_err->error_type) {
+		case 0:
+			p += sprintf(p, "Unknown");
+			break;
+		case 1:
+			p += sprintf(p, "No error");
+			break;
+		case 2:
+			p += sprintf(p, "Single-bit ECC");
+			break;
+		case 3:
+			p += sprintf(p, "Multi-bit ECC");
+			break;
+		case 4:
+			p += sprintf(p, "Single-symbol ChipKill ECC");
+			break;
+		case 5:
+			p += sprintf(p, "Multi-symbol ChipKill ECC");
+			break;
+		case 6:
+			p += sprintf(p, "Master abort");
+			break;
+		case 7:
+			p += sprintf(p, "Target abort");
+			break;
+		case 8:
+			p += sprintf(p, "Parity Error");
+			break;
+		case 9:
+			p += sprintf(p, "Watchdog timeout");
+			break;
+		case 10:
+			p += sprintf(p, "Invalid address");
+			break;
+		case 11:
+			p += sprintf(p, "Mirror Broken");
+			break;
+		case 12:
+			p += sprintf(p, "Memory Sparing");
+			break;
+		case 13:
+			p += sprintf(p, "Scrub corrected error");
+			break;
+		case 14:
+			p += sprintf(p, "Scrub uncorrected error");
+			break;
+		case 15:
+			p += sprintf(p, "Physical Memory Map-out event");
+			break;
+		default:
+			p += sprintf(p, "reserved error (%d)",
+				     mem_err->error_type);
+		}
+	} else {
+		strcpy(pvt->msg, "unknown error");
+	}
+
+	/* Error address */
+	if (mem_err->validation_bits & CPER_MEM_VALID_PHYSICAL_ADDRESS) {
+		e->page_frame_number = mem_err->physical_addr >> PAGE_SHIFT;
+		e->offset_in_page = mem_err->physical_addr & ~PAGE_MASK;
+	}
+
+	/* Error grain */
+	if (mem_err->validation_bits & CPER_MEM_VALID_PHYSICAL_ADDRESS_MASK) {
+		e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
+	}
+
+	/* Memory error location, mapped on e->location */
+	p = e->location;
+	if (mem_err->validation_bits & CPER_MEM_VALID_NODE)
+		p += sprintf(p, "node:%d ", mem_err->node);
+	if (mem_err->validation_bits & CPER_MEM_VALID_CARD)
+		p += sprintf(p, "card:%d ", mem_err->card);
+	if (mem_err->validation_bits & CPER_MEM_VALID_MODULE)
+		p += sprintf(p, "module:%d ", mem_err->module);
+	if (mem_err->validation_bits & CPER_MEM_VALID_BANK)
+		p += sprintf(p, "bank:%d ", mem_err->bank);
+	if (mem_err->validation_bits & CPER_MEM_VALID_ROW)
+		p += sprintf(p, "row:%d ", mem_err->row);
+	if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
+		p += sprintf(p, "col:%d ", mem_err->column);
+	if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
+		p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
+	if (p > e->location)
+		*(p - 1) = '\0';
+
+	/* All other fields are mapped on e->other_detail */
+	p = pvt->other_detail;
+	if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
+		u64 status = mem_err->error_status;
+
+		p += sprintf(p, "status(0x%016llx): ", (long long)status);
+		switch ((status >> 8) & 0xff) {
+		case 1:
+			p += sprintf(p, "Error detected internal to the component ");
+			break;
+		case 16:
+			p += sprintf(p, "Error detected in the bus ");
+			break;
+		case 4:
+			p += sprintf(p, "Storage error in DRAM memory ");
+			break;
+		case 5:
+			p += sprintf(p, "Storage error in TLB ");
+			break;
+		case 6:
+			p += sprintf(p, "Storage error in cache ");
+			break;
+		case 7:
+			p += sprintf(p, "Error in one or more functional units ");
+			break;
+		case 8:
+			p += sprintf(p, "component failed self test ");
+			break;
+		case 9:
+			p += sprintf(p, "Overflow or undervalue of internal queue ");
+			break;
+		case 17:
+			p += sprintf(p, "Virtual address not found on IO-TLB or IO-PDIR ");
+			break;
+		case 18:
+			p += sprintf(p, "Improper access error ");
+			break;
+		case 19:
+			p += sprintf(p, "Access to a memory address which is not mapped to any component ");
+			break;
+		case 20:
+			p += sprintf(p, "Loss of Lockstep ");
+			break;
+		case 21:
+			p += sprintf(p, "Response not associated with a request ");
+			break;
+		case 22:
+			p += sprintf(p, "Bus parity error - must also set the A, C, or D Bits ");
+			break;
+		case 23:
+			p += sprintf(p, "Detection of a PATH_ERROR ");
+			break;
+		case 25:
+			p += sprintf(p, "Bus operation timeout ");
+			break;
+		case 26:
+			p += sprintf(p, "A read was issued to data that has been poisoned ");
+			break;
+		default:
+			p += sprintf(p, "reserved ");
+			break;
+		}
+	}
+	if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
+		p += sprintf(p, "requestorID: 0x%016llx ",
+			     (long long)mem_err->requestor_id);
+	if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
+		p += sprintf(p, "responderID: 0x%016llx ",
+			     (long long)mem_err->responder_id);
+	if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
+		p += sprintf(p, "targetID: 0x%016llx ",
+			     (long long)mem_err->responder_id);
+	if (p > pvt->other_detail)
+		*(p - 1) = '\0';
 
 	edac_raw_mc_handle_error(type, mci, e);
 }
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH EDACv2 12/12] ghes_edac: Fix RAS tracing
  2013-02-21 15:38 [PATCH EDACv2 00/12] Add a driver to report Firmware first errors (via GHES) Mauro Carvalho Chehab
                   ` (10 preceding siblings ...)
  2013-02-21 15:39 ` [PATCH EDACv2 11/12] ghes_edac: Make it compliant with UEFI spec 2.3.1 Mauro Carvalho Chehab
@ 2013-02-21 15:39 ` Mauro Carvalho Chehab
  11 siblings, 0 replies; 18+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-21 15:39 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

With the current version of CPER, there's no way to associate an
error with the memory error. So, the error location in EDAC
layers is unused.

As CPER has its own idea about memory architectural layers, just
output whatever is there inside the driver's detail at the RAS
tracepoint.

The EDAC location keeps untouched, in the case that, in some future,
we could actually map the error into the dimm labels.

Now, the error message:

[   72.396625] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[   72.396627] {1}[Hardware Error]: APEI generic hardware error status
[   72.396628] {1}[Hardware Error]: severity: 2, corrected
[   72.396630] {1}[Hardware Error]: section: 0, severity: 2, corrected
[   72.396632] {1}[Hardware Error]: flags: 0x01
[   72.396634] {1}[Hardware Error]: primary
[   72.396635] {1}[Hardware Error]: section_type: memory error
[   72.396637] {1}[Hardware Error]: error_status: 0x0000000000000400
[   72.396638] {1}[Hardware Error]: node: 3
[   72.396639] {1}[Hardware Error]: card: 0
[   72.396640] {1}[Hardware Error]: module: 0
[   72.396641] {1}[Hardware Error]: device: 0
[   72.396643] {1}[Hardware Error]: error_type: 18, unknown
[   72.396666] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:0 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory)

Is properly represented on the trace event:

     kworker/0:2-584   [000] ....    72.396657: mc_event: 1 Corrected error: reserved error (18) on unknown label (mc:0 location:-1:-1:-1 address:0x00000000 grain:1 syndrome:0x00000000 APEI location: node:3 card:0 module:0 status(0x0000000000000400): Storage error in DRAM memory)

Tested on a 4 sockets E5-4650 Sandy Bridge machine.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/edac_mc.c   | 16 ++++++++--------
 drivers/edac/ghes_edac.c | 13 +++++++++++++
 2 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 501061d..cdb81aa 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1082,16 +1082,8 @@ void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
 			      struct edac_raw_error_desc *e)
 {
 	char detail[80];
-	u8 grain_bits;
 	int pos[EDAC_MAX_LAYERS] = { e->top_layer, e->mid_layer, e->low_layer };
 
-	/* Report the error via the trace interface */
-	grain_bits = fls_long(e->grain) + 1;
-	trace_mc_event(type, e->msg, e->label, e->error_count,
-		       mci->mc_idx, e->top_layer, e->mid_layer, e->low_layer,
-		       PAGES_TO_MiB(e->page_frame_number) | e->offset_in_page,
-		       grain_bits, e->syndrome, e->other_detail);
-
 	/* Memory type dependent details about the error */
 	if (type == HW_EVENT_ERR_CORRECTED) {
 		snprintf(detail, sizeof(detail),
@@ -1148,6 +1140,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	int row = -1, chan = -1;
 	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
 	int i, n_labels = 0;
+	u8 grain_bits;
 	struct edac_raw_error_desc *e = &mci->error_desc;
 
 	edac_dbg(3, "MC%d\n", mci->mc_idx);
@@ -1287,6 +1280,13 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	if (p > e->location)
 		*(p - 1) = '\0';
 
+	/* Report the error via the trace interface */
+	grain_bits = fls_long(e->grain) + 1;
+	trace_mc_event(type, e->msg, e->label, e->error_count,
+		       mci->mc_idx, e->top_layer, e->mid_layer, e->low_layer,
+		       PAGES_TO_MiB(e->page_frame_number) | e->offset_in_page,
+		       grain_bits, e->syndrome, e->other_detail);
+
 	edac_raw_mc_handle_error(type, mci, e);
 }
 EXPORT_SYMBOL_GPL(edac_mc_handle_error);
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 1bde451..636dcf1 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -15,6 +15,7 @@
 #include <linux/edac.h>
 #include <linux/dmi.h>
 #include "edac_core.h"
+#include <ras/ras_event.h>
 
 #define GHES_EDAC_REVISION " Ver: 1.0.0"
 
@@ -24,6 +25,7 @@ struct ghes_edac_pvt {
 	struct mem_ctl_info *mci;
 
 	/* Buffers for the error handling routine */
+	char detail_location[240];
 	char other_detail[160];
 	char msg[80];
 };
@@ -191,6 +193,7 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 	struct mem_ctl_info *mci;
 	struct ghes_edac_pvt *pvt = NULL;
 	char *p;
+	u8 grain_bits;
 
 	list_for_each_entry(pvt, &ghes_reglist, list) {
 		if (ghes == pvt->ghes)
@@ -398,6 +401,16 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 	if (p > pvt->other_detail)
 		*(p - 1) = '\0';
 
+	/* Generate the trace event */
+	grain_bits = fls_long(e->grain);
+	sprintf(pvt->detail_location, "APEI location: %s %s",
+		e->location, e->other_detail);
+	trace_mc_event(type, e->msg, e->label, e->error_count,
+		       mci->mc_idx, e->top_layer, e->mid_layer, e->low_layer,
+		       PAGES_TO_MiB(e->page_frame_number) | e->offset_in_page,
+		       grain_bits, e->syndrome, pvt->detail_location);
+
+	/* Report the error via EDAC API */
 	edac_raw_mc_handle_error(type, mci, e);
 }
 EXPORT_SYMBOL_GPL(ghes_edac_report_mem_error);
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH EDACv2 01/12] edac: add support for raw error reports
  2013-02-21 15:38 ` [PATCH EDACv2 01/12] edac: add support for raw error reports Mauro Carvalho Chehab
@ 2013-02-21 15:53   ` Borislav Petkov
  2013-02-21 16:01     ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 18+ messages in thread
From: Borislav Petkov @ 2013-02-21 15:53 UTC (permalink / raw)
  To: Mauro Carvalho Chehab; +Cc: Linux Edac Mailing List, Linux Kernel Mailing List

On Thu, Feb 21, 2013 at 12:38:59PM -0300, Mauro Carvalho Chehab wrote:
> That allows APEI GHES driver to report errors directly, using
> the EDAC error report API.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
> ---
>  drivers/edac/edac_core.h |  17 ++++++++
>  drivers/edac/edac_mc.c   | 109 ++++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 100 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
> index 23bb99f..9c5da11 100644
> --- a/drivers/edac/edac_core.h
> +++ b/drivers/edac/edac_core.h
> @@ -453,6 +453,23 @@ extern struct mem_ctl_info *find_mci_by_dev(struct device *dev);
>  extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev);
>  extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
>  				      unsigned long page);
> +
> +void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
> +			  struct mem_ctl_info *mci,
> +			  long grain,
> +			  const u16 error_count,
> +			  const int top_layer,
> +			  const int mid_layer,
> +			  const int low_layer,
> +			  const unsigned long page_frame_number,
> +			  const unsigned long offset_in_page,
> +			  const unsigned long syndrome,
> +			  const char *msg,
> +			  const char *location,
> +			  const char *label,
> +			  const char *other_detail,
> +			  const bool enable_per_layer_report);

Why not merge this patch with 10/12?

You're adding edac_raw_mc_handle_error here and rewriting it in 10/12.
This looks like unnecessary churn to me.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH EDACv2 01/12] edac: add support for raw error reports
  2013-02-21 15:53   ` Borislav Petkov
@ 2013-02-21 16:01     ` Mauro Carvalho Chehab
  2013-02-21 16:08       ` Borislav Petkov
  0 siblings, 1 reply; 18+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-21 16:01 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Linux Edac Mailing List, Linux Kernel Mailing List

Em Thu, 21 Feb 2013 16:53:37 +0100
Borislav Petkov <bp@alien8.de> escreveu:

> On Thu, Feb 21, 2013 at 12:38:59PM -0300, Mauro Carvalho Chehab wrote:
> > That allows APEI GHES driver to report errors directly, using
> > the EDAC error report API.
> > 
> > Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
> > ---
> >  drivers/edac/edac_core.h |  17 ++++++++
> >  drivers/edac/edac_mc.c   | 109 ++++++++++++++++++++++++++++++++++++-----------
> >  2 files changed, 100 insertions(+), 26 deletions(-)
> > 
> > diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
> > index 23bb99f..9c5da11 100644
> > --- a/drivers/edac/edac_core.h
> > +++ b/drivers/edac/edac_core.h
> > @@ -453,6 +453,23 @@ extern struct mem_ctl_info *find_mci_by_dev(struct device *dev);
> >  extern struct mem_ctl_info *edac_mc_del_mc(struct device *dev);
> >  extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
> >  				      unsigned long page);
> > +
> > +void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
> > +			  struct mem_ctl_info *mci,
> > +			  long grain,
> > +			  const u16 error_count,
> > +			  const int top_layer,
> > +			  const int mid_layer,
> > +			  const int low_layer,
> > +			  const unsigned long page_frame_number,
> > +			  const unsigned long offset_in_page,
> > +			  const unsigned long syndrome,
> > +			  const char *msg,
> > +			  const char *location,
> > +			  const char *label,
> > +			  const char *other_detail,
> > +			  const bool enable_per_layer_report);
> 
> Why not merge this patch with 10/12?
> 
> You're adding edac_raw_mc_handle_error here and rewriting it in 10/12.
> This looks like unnecessary churn to me.

I did this way to to follow the "one patch per logical change rule".
The rationale for this patch is to allow doing raw error reports,
as needed by changeset 06/12.

The rationale for 10/12 was to reduce the stack pressure. While
10/12 touches on the above routine, it's main focus is to fix a
previous fixup of using a large space inside the stack during the
memory report preparation.

Regards,
Mauro

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH EDACv2 01/12] edac: add support for raw error reports
  2013-02-21 16:01     ` Mauro Carvalho Chehab
@ 2013-02-21 16:08       ` Borislav Petkov
  2013-02-21 17:47         ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 18+ messages in thread
From: Borislav Petkov @ 2013-02-21 16:08 UTC (permalink / raw)
  To: Mauro Carvalho Chehab; +Cc: Linux Edac Mailing List, Linux Kernel Mailing List

On Thu, Feb 21, 2013 at 01:01:25PM -0300, Mauro Carvalho Chehab wrote:
> I did this way to to follow the "one patch per logical change rule".
> The rationale for this patch is to allow doing raw error reports,
> as needed by changeset 06/12.

Right, but you're adding a new function so why not add the right version
from the get-go? Otherwise the rule above would apply but not for new
code, right?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH EDACv2 01/12] edac: add support for raw error reports
  2013-02-21 16:08       ` Borislav Petkov
@ 2013-02-21 17:47         ` Mauro Carvalho Chehab
  2013-02-21 18:03           ` Borislav Petkov
  0 siblings, 1 reply; 18+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-21 17:47 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Linux Edac Mailing List, Linux Kernel Mailing List

Em Thu, 21 Feb 2013 17:08:21 +0100
Borislav Petkov <bp@alien8.de> escreveu:

> On Thu, Feb 21, 2013 at 01:01:25PM -0300, Mauro Carvalho Chehab wrote:
> > I did this way to to follow the "one patch per logical change rule".
> > The rationale for this patch is to allow doing raw error reports,
> > as needed by changeset 06/12.
> 
> Right, but you're adding a new function so why not add the right version
> from the get-go? Otherwise the rule above would apply but not for new
> code, right?

The rule also applies for both patches.

Except for making everybody's looking on this code to re-read all patches 
and for me to rebase it again, I can't see any difference between what I
posted and your proposal to move patch 10 to the beginning of the changeset.

Anyway, I did it at this separate branch, in order to preserve the previous
ghes_v2 (also at the same repository):
	http://git.infradead.org/users/mchehab/edac.git/shortlog/refs/heads/ghes_v3

Patch 10/12 is now the first patch, and became this one:
	http://git.infradead.org/users/mchehab/edac.git/commit/c7ef7645544131b0750478d1cf94cdfa945c809d

Patch 01/12 is now the second one:
	http://git.infradead.org/users/mchehab/edac.git/commit/e7e248304c8ccf02b89e04c3b3b66006b993b5a7

Obviously, the ghes_edac part of 10/12 got fold into this patch:
	http://git.infradead.org/users/mchehab/edac.git/commit/bf45061af29e585506e8d3bfcda2f2b1e557281c

And a couple other patches got side-affected by this change with trivial
conflicts (patches 9/12 and 11/12). Except for this change, v2 and v3 are 
identical, so I won't re-post it.

Mauro

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH EDACv2 01/12] edac: add support for raw error reports
  2013-02-21 17:47         ` Mauro Carvalho Chehab
@ 2013-02-21 18:03           ` Borislav Petkov
  0 siblings, 0 replies; 18+ messages in thread
From: Borislav Petkov @ 2013-02-21 18:03 UTC (permalink / raw)
  To: Mauro Carvalho Chehab; +Cc: Linux Edac Mailing List, Linux Kernel Mailing List

On Thu, Feb 21, 2013 at 02:47:59PM -0300, Mauro Carvalho Chehab wrote:
> Em Thu, 21 Feb 2013 17:08:21 +0100
> Borislav Petkov <bp@alien8.de> escreveu:
> 
> > On Thu, Feb 21, 2013 at 01:01:25PM -0300, Mauro Carvalho Chehab wrote:
> > > I did this way to to follow the "one patch per logical change rule".
> > > The rationale for this patch is to allow doing raw error reports,
> > > as needed by changeset 06/12.
> > 
> > Right, but you're adding a new function so why not add the right version
> > from the get-go? Otherwise the rule above would apply but not for new
> > code, right?
> 
> The rule also applies for both patches.
> 
> Except for making everybody's looking on this code to re-read all patches 
> and for me to rebase it again, I can't see any difference between what I
> posted and your proposal to move patch 10 to the beginning of the changeset.
> 
> Anyway, I did it at this separate branch, in order to preserve the previous
> ghes_v2 (also at the same repository):
> 	http://git.infradead.org/users/mchehab/edac.git/shortlog/refs/heads/ghes_v3
> 
> Patch 10/12 is now the first patch, and became this one:
> 	http://git.infradead.org/users/mchehab/edac.git/commit/c7ef7645544131b0750478d1cf94cdfa945c809d
> 
> Patch 01/12 is now the second one:
> 	http://git.infradead.org/users/mchehab/edac.git/commit/e7e248304c8ccf02b89e04c3b3b66006b993b5a7
> 
> Obviously, the ghes_edac part of 10/12 got fold into this patch:
> 	http://git.infradead.org/users/mchehab/edac.git/commit/bf45061af29e585506e8d3bfcda2f2b1e557281c
> 
> And a couple other patches got side-affected by this change with trivial
> conflicts (patches 9/12 and 11/12). Except for this change, v2 and v3 are 
> identical, so I won't re-post it.

Yep, looks good at a quick glance.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-02-21 18:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-21 15:38 [PATCH EDACv2 00/12] Add a driver to report Firmware first errors (via GHES) Mauro Carvalho Chehab
2013-02-21 15:38 ` [PATCH EDACv2 01/12] edac: add support for raw error reports Mauro Carvalho Chehab
2013-02-21 15:53   ` Borislav Petkov
2013-02-21 16:01     ` Mauro Carvalho Chehab
2013-02-21 16:08       ` Borislav Petkov
2013-02-21 17:47         ` Mauro Carvalho Chehab
2013-02-21 18:03           ` Borislav Petkov
2013-02-21 15:39 ` [PATCH EDACv2 02/12] edac: add support for error type "Info" Mauro Carvalho Chehab
2013-02-21 15:39 ` [PATCH EDACv2 03/12] ghes: move structures/enum to a header file Mauro Carvalho Chehab
2013-02-21 15:39 ` [PATCH EDACv2 04/12] ghes: add the needed hooks for EDAC error report Mauro Carvalho Chehab
2013-02-21 15:39 ` [PATCH EDACv2 05/12] ghes_edac: Register at EDAC core the BIOS report Mauro Carvalho Chehab
2013-02-21 15:39 ` [PATCH EDACv2 06/12] ghes_edac: add support for reporting errors via EDAC Mauro Carvalho Chehab
2013-02-21 15:39 ` [PATCH EDACv2 07/12] ghes_edac: do a better job of filling EDAC DIMM info Mauro Carvalho Chehab
2013-02-21 15:39 ` [PATCH EDACv2 08/12] ghes_edac: Don't credit the same memory dimm twice Mauro Carvalho Chehab
2013-02-21 15:39 ` [PATCH EDACv2 09/12] ghes_edac: Improve driver's printk messages Mauro Carvalho Chehab
2013-02-21 15:39 ` [PATCH EDACv2 10/12] edac: put all arguments for the raw error handling call into a struct Mauro Carvalho Chehab
2013-02-21 15:39 ` [PATCH EDACv2 11/12] ghes_edac: Make it compliant with UEFI spec 2.3.1 Mauro Carvalho Chehab
2013-02-21 15:39 ` [PATCH EDACv2 12/12] ghes_edac: Fix RAS tracing Mauro Carvalho Chehab

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).