linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)
       [not found] <cover.1300996141.git.mchehab@redhat.com>
@ 2011-03-24 20:32 ` Mauro Carvalho Chehab
  2011-03-24 22:39   ` Borislav Petkov
  2011-03-24 20:32 ` [PATCH RFC 1/2] edac: Move edac main structs to include/linux/edac.h Mauro Carvalho Chehab
  2011-03-24 20:54 ` Mauro Carvalho Chehab
  2 siblings, 1 reply; 20+ messages in thread
From: Mauro Carvalho Chehab @ 2011-03-24 20:32 UTC (permalink / raw)
  Cc: Linux Edac Mailing List, Linux Kernel Mailing List

Adds a trace class for handle hardware events

Part of the description bellow is shamelessly copied from Tony
Luck's notes about the Hardware Error BoF during LPC 2010 [1].
Tony, thanks for your notes and discussions to generate the
h/w error reporting requirements.

[1] http://lwn.net/Articles/416669/

    We have several subsystems & methods for reporting hardware errors:

    1) EDAC ("Error Detection and Correction").  In its original form
    this consisted of a platform specific driver that read topology
    information and error counts from chipset registers and reported
    the results via a sysfs interface.

    2) mcelog - x86 specific decoding of machine check bank registers
    reporting in binary form via /dev/mcelog. Recent additions make use
    of the APEI extensions that were documented in version 4.0a of the
    ACPI specification to acquire more information about errors without
    having to rely reading chipset registers directly. A user level
    programs decodes into somewhat human readable format.

    3) drivers/edac/mce_amd.c  A recent addition - this driver hooks into
    the mcelog path and decodes errors reported via machine check bank
    registers in AMD processors to the console log using printk() [despite
    being in the drivers/edac directory, this seems completely different
    from classic EDAC to me].

    Each of these mechanisms has a band of followers ... and none
    of them appear to meet all the needs of all users.

In order to provide a proper hardware event subsystem, let's
encapsulate hardware events into a common trace facility, and
make both edac and mce drivers to use it. After that, common
facilities can be moved into a new core for hardware events
reporting subsystem. This patch is the first of a series, and just
touches at mce.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>

 create mode 100644 include/trace/events/hw_event.h

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index a4e9db2..12821ef 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -34,6 +34,9 @@
 #include "edac_core.h"
 #include "edac_module.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/hw_event.h>
+
 /* lock to memory controller's control array */
 static DEFINE_MUTEX(mem_ctls_mutex);
 static LIST_HEAD(mc_devices);
@@ -224,6 +227,9 @@ struct mem_ctl_info *edac_mc_alloc(unsigned sz_pvt, unsigned nr_csrows,
 	 * which will perform kobj unregistration and the actual free
 	 * will occur during the kobject callback operation
 	 */
+
+	trace_hw_event_init("mce", (unsigned)edac_index);
+
 	return mci;
 }
 EXPORT_SYMBOL_GPL(edac_mc_alloc);
@@ -689,6 +695,7 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
 	/* FIXME - maybe make panic on INTERNAL ERROR an option */
 	if (row >= mci->nr_csrows || row < 0) {
 		/* something is wrong */
+		trace_mc_out_of_range(mci, "CE", "row", row, 0, mci->nr_csrows);
 		edac_mc_printk(mci, KERN_ERR,
 			"INTERNAL ERROR: row out of range "
 			"(%d >= %d)\n", row, mci->nr_csrows);
@@ -698,6 +705,8 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
 
 	if (channel >= mci->csrows[row].nr_channels || channel < 0) {
 		/* something is wrong */
+		trace_mc_out_of_range(mci, "CE", "channel", channel,
+				      0, mci->csrows[row].nr_channels);
 		edac_mc_printk(mci, KERN_ERR,
 			"INTERNAL ERROR: channel out of range "
 			"(%d >= %d)\n", channel,
@@ -706,6 +715,9 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
 		return;
 	}
 
+	trace_mc_corrected_error(mci, page_frame_number, offset_in_page,
+				syndrome, row, channel, msg);
+
 	if (edac_mc_get_log_ce())
 		/* FIXME - put in DIMM location */
 		edac_mc_printk(mci, KERN_WARNING,
@@ -741,6 +753,7 @@ EXPORT_SYMBOL_GPL(edac_mc_handle_ce);
 
 void edac_mc_handle_ce_no_info(struct mem_ctl_info *mci, const char *msg)
 {
+	trace_mc_corrected_error_no_info(mci, msg);
 	if (edac_mc_get_log_ce())
 		edac_mc_printk(mci, KERN_WARNING,
 			"CE - no information available: %s\n", msg);
@@ -765,6 +778,8 @@ void edac_mc_handle_ue(struct mem_ctl_info *mci,
 	/* FIXME - maybe make panic on INTERNAL ERROR an option */
 	if (row >= mci->nr_csrows || row < 0) {
 		/* something is wrong */
+		trace_mc_out_of_range(mci, "UE", "row", row,
+				      0, mci->nr_csrows);
 		edac_mc_printk(mci, KERN_ERR,
 			"INTERNAL ERROR: row out of range "
 			"(%d >= %d)\n", row, mci->nr_csrows);
@@ -785,6 +800,8 @@ void edac_mc_handle_ue(struct mem_ctl_info *mci,
 		pos += chars;
 	}
 
+	trace_mc_uncorrected_error(mci, page_frame_number, offset_in_page,
+				row, msg, labels);
 	if (edac_mc_get_log_ue())
 		edac_mc_printk(mci, KERN_EMERG,
 			"UE page 0x%lx, offset 0x%lx, grain %d, row %d, "
@@ -805,6 +822,7 @@ EXPORT_SYMBOL_GPL(edac_mc_handle_ue);
 
 void edac_mc_handle_ue_no_info(struct mem_ctl_info *mci, const char *msg)
 {
+	trace_mc_uncorrected_error_no_info(mci, msg);
 	if (edac_mc_get_panic_on_ue())
 		panic("EDAC MC%d: Uncorrected Error", mci->mc_idx);
 
@@ -832,6 +850,9 @@ void edac_mc_handle_fbd_ue(struct mem_ctl_info *mci,
 
 	if (csrow >= mci->nr_csrows) {
 		/* something is wrong */
+
+		trace_mc_out_of_range(mci, "UE FBDIMM", "row", csrow,
+				      0, mci->nr_csrows);
 		edac_mc_printk(mci, KERN_ERR,
 			"INTERNAL ERROR: row out of range (%d >= %d)\n",
 			csrow, mci->nr_csrows);
@@ -841,6 +862,8 @@ void edac_mc_handle_fbd_ue(struct mem_ctl_info *mci,
 
 	if (channela >= mci->csrows[csrow].nr_channels) {
 		/* something is wrong */
+		trace_mc_out_of_range(mci, "UE FBDIMM", "channel-a", channela,
+				      0, mci->csrows[csrow].nr_channels);
 		edac_mc_printk(mci, KERN_ERR,
 			"INTERNAL ERROR: channel-a out of range "
 			"(%d >= %d)\n",
@@ -851,6 +874,8 @@ void edac_mc_handle_fbd_ue(struct mem_ctl_info *mci,
 
 	if (channelb >= mci->csrows[csrow].nr_channels) {
 		/* something is wrong */
+		trace_mc_out_of_range(mci, "UE FBDIMM", "channel-b", channelb,
+				      0, mci->csrows[csrow].nr_channels);
 		edac_mc_printk(mci, KERN_ERR,
 			"INTERNAL ERROR: channel-b out of range "
 			"(%d >= %d)\n",
@@ -870,6 +895,8 @@ void edac_mc_handle_fbd_ue(struct mem_ctl_info *mci,
 	chars = snprintf(pos, len + 1, "-%s",
 			 mci->csrows[csrow].channels[channelb].label);
 
+	trace_mc_uncorrected_error_fbd(mci, csrow, channela, channelb,
+				       msg, labels);
 	if (edac_mc_get_log_ue())
 		edac_mc_printk(mci, KERN_EMERG,
 			"UE row %d, channel-a= %d channel-b= %d "
@@ -894,6 +921,8 @@ void edac_mc_handle_fbd_ce(struct mem_ctl_info *mci,
 	/* Ensure boundary values */
 	if (csrow >= mci->nr_csrows) {
 		/* something is wrong */
+		trace_mc_out_of_range(mci, "CE FBDIMM", "row", csrow,
+				      0, mci->nr_csrows);
 		edac_mc_printk(mci, KERN_ERR,
 			"INTERNAL ERROR: row out of range (%d >= %d)\n",
 			csrow, mci->nr_csrows);
@@ -902,6 +931,8 @@ void edac_mc_handle_fbd_ce(struct mem_ctl_info *mci,
 	}
 	if (channel >= mci->csrows[csrow].nr_channels) {
 		/* something is wrong */
+		trace_mc_out_of_range(mci, "UE FBDIMM", "channel", channel,
+				      0, mci->csrows[csrow].nr_channels);
 		edac_mc_printk(mci, KERN_ERR,
 			"INTERNAL ERROR: channel out of range (%d >= %d)\n",
 			channel, mci->csrows[csrow].nr_channels);
@@ -909,6 +940,7 @@ void edac_mc_handle_fbd_ce(struct mem_ctl_info *mci,
 		return;
 	}
 
+	trace_mc_corrected_error_fbd(mci, csrow, channel, msg);
 	if (edac_mc_get_log_ce())
 		/* FIXME - put in DIMM location */
 		edac_mc_printk(mci, KERN_WARNING,
diff --git a/include/trace/events/hw_event.h b/include/trace/events/hw_event.h
new file mode 100644
index 0000000..a46ac61
--- /dev/null
+++ b/include/trace/events/hw_event.h
@@ -0,0 +1,322 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM hw_event
+
+#if !defined(_TRACE_HW_EVENT_MC_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_HW_EVENT_MC_H
+
+#include <linux/tracepoint.h>
+#include <linux/edac.h>
+
+/*
+ * Hardware Anomaly Report Mechanism (HARM) events
+ *
+ * Those events are generated when hardware detected a corrected or
+ * uncorrected event, and are meant to replace the current API to report
+ * errors defined on both EDAC and MCE subsystems.
+ */
+
+DECLARE_EVENT_CLASS(hw_event_class,
+	TP_PROTO(const char *type, unsigned int instance),
+	TP_ARGS(type, instance),
+
+	TP_STRUCT__entry(
+		__field(	const char *,	type			)
+		__field(	unsigned int,	instance		)
+	),
+
+	TP_fast_assign(
+		__entry->type	= type;
+		__entry->instance = instance;
+	),
+
+	TP_printk("Initialized %s#%d\n",
+		__entry->type,
+		__entry->instance)
+);
+
+/*
+ * This event indicates that a hardware collection mechanism is started
+ */
+DEFINE_EVENT(hw_event_class, hw_event_init,
+
+	TP_PROTO(const char *type, unsigned int instance),
+
+	TP_ARGS(type, instance)
+);
+
+
+/*
+ * Memory Controller specific events
+ */
+
+/*
+ * Default error mechanisms for Memory Controller errors (CE and UE)
+ */
+TRACE_EVENT(mc_corrected_error,
+
+	TP_PROTO(struct mem_ctl_info *mci,
+		unsigned long page_frame_number,
+		unsigned long offset_in_page, unsigned long syndrome,
+		int row, int channel, const char *msg),
+
+	TP_ARGS(mci, page_frame_number, offset_in_page, syndrome, row,
+		channel, msg),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	mc_index		)
+		__field(	unsigned long,	page_frame_number	)
+		__field(	unsigned long,	offset_in_page		)
+		__field(	u32,		grain			)
+		__field(	unsigned long,	syndrome		)
+		__field(	int,		row			)
+		__field(	int,		channel			)
+		__field(	const char *,	label			)
+		__field(	const char *,	msg			)
+	),
+
+	TP_fast_assign(
+		__entry->mc_index		= mci->mc_idx;
+		__entry->page_frame_number	= page_frame_number;
+		__entry->offset_in_page		= offset_in_page;
+		__entry->grain			= mci->csrows[row].grain;
+		__entry->syndrome		= syndrome;
+		__entry->row			= row;
+		__entry->channel		= channel;
+		__entry->label			= mci->csrows[row].channels[channel].label;
+		__entry->msg			= msg;
+	),
+
+	TP_printk(HW_ERR "mce#%d: Corrected error %s on label \"%s\" "
+			 "(page 0x%lux, offset 0x%lux, grain %ud, "
+			 "syndrome 0x%lux, row %d, channel %d)\n",
+		__entry->mc_index,
+		__entry->msg,
+		__entry->label,
+		__entry->page_frame_number,
+		__entry->offset_in_page,
+		__entry->grain,
+		__entry->syndrome,
+		__entry->row,
+		__entry->channel)
+);
+
+TRACE_EVENT(mc_uncorrected_error,
+
+	TP_PROTO(struct mem_ctl_info *mci,
+		unsigned long page_frame_number,
+		unsigned long offset_in_page,
+		int row, const char *msg, const char *label),
+
+	TP_ARGS(mci, page_frame_number, offset_in_page,
+		row, msg, label),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	mc_index		)
+		__field(	unsigned long,	page_frame_number	)
+		__field(	unsigned long,	offset_in_page		)
+		__field(	u32,		grain			)
+		__field(	int,		row			)
+		__field(	const char *,	msg			)
+		__field(	const char *,	label			)
+	),
+
+	TP_fast_assign(
+		__entry->mc_index		= mci->mc_idx;
+		__entry->page_frame_number	= page_frame_number;
+		__entry->offset_in_page		= offset_in_page;
+		__entry->grain			= mci->csrows[row].grain;
+		__entry->row			= row;
+		__entry->msg			= msg;
+		__entry->label			= label;
+	),
+
+	TP_printk(HW_ERR "mce#%d: Uncorrected error %s on label \"%s\""
+			 "(page 0x%lux, offset 0x%lux, grain %ud, row %d)\n",
+		__entry->mc_index,
+		__entry->msg,
+		__entry->label,
+		__entry->page_frame_number,
+		__entry->offset_in_page,
+		__entry->grain,
+		__entry->row)
+);
+
+
+/*
+ * Fully-Buffered memory hardware in general don't provide syndrome/grain/row
+ * information for all types of errors. So, we need to either have another
+ * trace event or add a bitmapped field to indicate that some info are not
+ * provided and use the previously-declared event. It seemed easier and less
+ * confusing to create a different event for such cases
+ */
+TRACE_EVENT(mc_corrected_error_fbd,
+
+	TP_PROTO(struct mem_ctl_info *mci,
+		int row, int channel, const char *msg),
+
+	TP_ARGS(mci, row, channel, msg),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	mc_index		)
+		__field(	int,		row			)
+		__field(	int,		channel	        	)
+		__field(	const char *,	label			)
+		__field(	const char *,	msg			)
+	),
+
+	TP_fast_assign(
+		__entry->mc_index		= mci->mc_idx;
+		__entry->row			= row;
+		__entry->channel		= channel;
+		__entry->label			= mci->csrows[row].channels[channel].label;
+		__entry->msg			= msg;
+	),
+
+	TP_printk(HW_ERR "mce#%d: Corrected Error %s on label \"%s\" "
+			 "(row %d, channel %d)\n",
+		__entry->mc_index,
+		__entry->msg,
+		__entry->label,
+		__entry->row,
+		__entry->channel)
+);
+
+TRACE_EVENT(mc_uncorrected_error_fbd,
+
+	TP_PROTO(struct mem_ctl_info *mci,
+		int row, int channela, int channelb,
+		const char *msg, const char *label),
+
+	TP_ARGS(mci, row, channela, channelb, msg, label),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	mc_index		)
+		__field(	int,		row			)
+		__field(	int,		channela		)
+		__field(	int,		channelb		)
+		__field(	const char *,	msg			)
+		__field(	const char *,	label			)
+	),
+
+	TP_fast_assign(
+		__entry->mc_index		= mci->mc_idx;
+		__entry->row			= row;
+		__entry->channela		= channela;
+		__entry->channelb		= channelb;
+		__entry->msg			= msg;
+		__entry->label			= label;
+	),
+
+	TP_printk(HW_ERR "mce#%d: Uncorrected Error %s on label \"%s\" "
+			 "(row %d, channels: %d, %d)\n",
+		__entry->mc_index,
+		__entry->msg,
+		__entry->label,
+		__entry->row,
+		__entry->channela,
+		__entry->channelb)
+);
+
+/*
+ * The Memory controller driver needs to discover the memory topology, in
+ * order to associate a hardware error with the memory label. If, for any
+ * reason, it receives an error for a channel or row that are not supposed
+ * to be there, an error event needs to be generated to indicate:
+ *	- that a Corrected or Uncorrected error was received;
+ *	- that the driver has a bug and, for that particular hardware, was
+ *	  not capable of detecting the hardware architecture
+ * If one of such errors is ever received, a bug to the kernel driver must
+ * be filled.
+ */
+
+TRACE_EVENT(mc_out_of_range,
+	TP_PROTO(struct mem_ctl_info *mci, const char *type, const char *field,
+		int invalid_val, int min, int max),
+
+	TP_ARGS(mci, type, field, invalid_val, min, max),
+
+	TP_STRUCT__entry(
+		__field(	const char *,	type			)
+		__field(	const char *,	field			)
+		__field(	unsigned int,	mc_index		)
+		__field(	int,		invalid_val		)
+		__field(	int,		min			)
+		__field(	int,		max			)
+	),
+
+	TP_fast_assign(
+		__entry->type			= type;
+		__entry->field			= field;
+		__entry->mc_index		= mci->mc_idx;
+		__entry->invalid_val		= invalid_val;
+		__entry->min			= min;
+		__entry->max			= max;
+	),
+
+	TP_printk(HW_ERR "mce#%d %s: %s=%d is not between %d and %d\n",
+		__entry->mc_index,
+		__entry->type,
+		__entry->field,
+		__entry->invalid_val,
+		__entry->min,
+		__entry->max)
+);
+
+/*
+ * On some cases, a corrected or uncorrected error was detected, but it
+ * couldn't be properly handled, or because another error overrided the
+ * error registers that details the error or because of some internal problem
+ * on the driver. Those events bellow are meant for those error types.
+ */
+TRACE_EVENT(mc_corrected_error_no_info,
+	TP_PROTO(struct mem_ctl_info *mci, const char *msg),
+
+	TP_ARGS(mci, msg),
+
+	TP_STRUCT__entry(
+		__field(	const char *,	msg			)
+		__field(	unsigned int,	mc_index		)
+	),
+
+	TP_fast_assign(
+		__entry->msg			= msg;
+		__entry->mc_index		= mci->mc_idx;
+	),
+
+	TP_printk(HW_ERR "mce#%d: Corrected Error: %s\n",
+		__entry->mc_index,
+		__entry->msg)
+);
+
+TRACE_EVENT(mc_uncorrected_error_no_info,
+	TP_PROTO(struct mem_ctl_info *mci, const char *msg),
+
+	TP_ARGS(mci, msg),
+
+	TP_STRUCT__entry(
+		__field(	const char *,	msg			)
+		__field(	unsigned int,	mc_index		)
+	),
+
+	TP_fast_assign(
+		__entry->msg			= msg;
+		__entry->mc_index		= mci->mc_idx;
+	),
+
+	TP_printk(HW_ERR "mce#%d: Uncorrected Error: %s\n",
+		__entry->mc_index,
+		__entry->msg)
+);
+
+
+
+/*
+ * MCE Events placeholder. Please add non-memory events that come from the
+ * MCE driver here
+ */
+
+
+#endif /* _TRACE_HW_EVENT_MC_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 1/2] edac: Move edac main structs to include/linux/edac.h
       [not found] <cover.1300996141.git.mchehab@redhat.com>
  2011-03-24 20:32 ` [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM) Mauro Carvalho Chehab
@ 2011-03-24 20:32 ` Mauro Carvalho Chehab
  2011-03-24 20:54 ` Mauro Carvalho Chehab
  2 siblings, 0 replies; 20+ messages in thread
From: Mauro Carvalho Chehab @ 2011-03-24 20:32 UTC (permalink / raw)
  Cc: Linux Edac Mailing List, Linux Kernel Mailing List

As we'll need to use those structs for trace functions, they should
be on a more public place. So, move struct mem_ctl_info & friends
to edac.h.

No functional changes on this patch.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index 3d96534..ef0738b 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -34,11 +34,10 @@
 #include <linux/platform_device.h>
 #include <linux/sysdev.h>
 #include <linux/workqueue.h>
+#include <linux/edac.h>
 
-#define EDAC_MC_LABEL_LEN	31
 #define EDAC_DEVICE_NAME_LEN	31
 #define EDAC_ATTRIB_VALUE_LEN	15
-#define MC_PROC_NAME_MAX_LEN	7
 
 #if PAGE_SHIFT < 20
 #define PAGES_TO_MiB(pages)	((pages) >> (20 - PAGE_SHIFT))
@@ -101,357 +100,6 @@ extern int edac_debug_level;
 
 #define edac_dev_name(dev) (dev)->dev_name
 
-/* memory devices */
-enum dev_type {
-	DEV_UNKNOWN = 0,
-	DEV_X1,
-	DEV_X2,
-	DEV_X4,
-	DEV_X8,
-	DEV_X16,
-	DEV_X32,		/* Do these parts exist? */
-	DEV_X64			/* Do these parts exist? */
-};
-
-#define DEV_FLAG_UNKNOWN	BIT(DEV_UNKNOWN)
-#define DEV_FLAG_X1		BIT(DEV_X1)
-#define DEV_FLAG_X2		BIT(DEV_X2)
-#define DEV_FLAG_X4		BIT(DEV_X4)
-#define DEV_FLAG_X8		BIT(DEV_X8)
-#define DEV_FLAG_X16		BIT(DEV_X16)
-#define DEV_FLAG_X32		BIT(DEV_X32)
-#define DEV_FLAG_X64		BIT(DEV_X64)
-
-/* memory types */
-enum mem_type {
-	MEM_EMPTY = 0,		/* Empty csrow */
-	MEM_RESERVED,		/* Reserved csrow type */
-	MEM_UNKNOWN,		/* Unknown csrow type */
-	MEM_FPM,		/* Fast page mode */
-	MEM_EDO,		/* Extended data out */
-	MEM_BEDO,		/* Burst Extended data out */
-	MEM_SDR,		/* Single data rate SDRAM */
-	MEM_RDR,		/* Registered single data rate SDRAM */
-	MEM_DDR,		/* Double data rate SDRAM */
-	MEM_RDDR,		/* Registered Double data rate SDRAM */
-	MEM_RMBS,		/* Rambus DRAM */
-	MEM_DDR2,		/* DDR2 RAM */
-	MEM_FB_DDR2,		/* fully buffered DDR2 */
-	MEM_RDDR2,		/* Registered DDR2 RAM */
-	MEM_XDR,		/* Rambus XDR */
-	MEM_DDR3,		/* DDR3 RAM */
-	MEM_RDDR3,		/* Registered DDR3 RAM */
-};
-
-#define MEM_FLAG_EMPTY		BIT(MEM_EMPTY)
-#define MEM_FLAG_RESERVED	BIT(MEM_RESERVED)
-#define MEM_FLAG_UNKNOWN	BIT(MEM_UNKNOWN)
-#define MEM_FLAG_FPM		BIT(MEM_FPM)
-#define MEM_FLAG_EDO		BIT(MEM_EDO)
-#define MEM_FLAG_BEDO		BIT(MEM_BEDO)
-#define MEM_FLAG_SDR		BIT(MEM_SDR)
-#define MEM_FLAG_RDR		BIT(MEM_RDR)
-#define MEM_FLAG_DDR		BIT(MEM_DDR)
-#define MEM_FLAG_RDDR		BIT(MEM_RDDR)
-#define MEM_FLAG_RMBS		BIT(MEM_RMBS)
-#define MEM_FLAG_DDR2           BIT(MEM_DDR2)
-#define MEM_FLAG_FB_DDR2        BIT(MEM_FB_DDR2)
-#define MEM_FLAG_RDDR2          BIT(MEM_RDDR2)
-#define MEM_FLAG_XDR            BIT(MEM_XDR)
-#define MEM_FLAG_DDR3		 BIT(MEM_DDR3)
-#define MEM_FLAG_RDDR3		 BIT(MEM_RDDR3)
-
-/* chipset Error Detection and Correction capabilities and mode */
-enum edac_type {
-	EDAC_UNKNOWN = 0,	/* Unknown if ECC is available */
-	EDAC_NONE,		/* Doesnt support ECC */
-	EDAC_RESERVED,		/* Reserved ECC type */
-	EDAC_PARITY,		/* Detects parity errors */
-	EDAC_EC,		/* Error Checking - no correction */
-	EDAC_SECDED,		/* Single bit error correction, Double detection */
-	EDAC_S2ECD2ED,		/* Chipkill x2 devices - do these exist? */
-	EDAC_S4ECD4ED,		/* Chipkill x4 devices */
-	EDAC_S8ECD8ED,		/* Chipkill x8 devices */
-	EDAC_S16ECD16ED,	/* Chipkill x16 devices */
-};
-
-#define EDAC_FLAG_UNKNOWN	BIT(EDAC_UNKNOWN)
-#define EDAC_FLAG_NONE		BIT(EDAC_NONE)
-#define EDAC_FLAG_PARITY	BIT(EDAC_PARITY)
-#define EDAC_FLAG_EC		BIT(EDAC_EC)
-#define EDAC_FLAG_SECDED	BIT(EDAC_SECDED)
-#define EDAC_FLAG_S2ECD2ED	BIT(EDAC_S2ECD2ED)
-#define EDAC_FLAG_S4ECD4ED	BIT(EDAC_S4ECD4ED)
-#define EDAC_FLAG_S8ECD8ED	BIT(EDAC_S8ECD8ED)
-#define EDAC_FLAG_S16ECD16ED	BIT(EDAC_S16ECD16ED)
-
-/* scrubbing capabilities */
-enum scrub_type {
-	SCRUB_UNKNOWN = 0,	/* Unknown if scrubber is available */
-	SCRUB_NONE,		/* No scrubber */
-	SCRUB_SW_PROG,		/* SW progressive (sequential) scrubbing */
-	SCRUB_SW_SRC,		/* Software scrub only errors */
-	SCRUB_SW_PROG_SRC,	/* Progressive software scrub from an error */
-	SCRUB_SW_TUNABLE,	/* Software scrub frequency is tunable */
-	SCRUB_HW_PROG,		/* HW progressive (sequential) scrubbing */
-	SCRUB_HW_SRC,		/* Hardware scrub only errors */
-	SCRUB_HW_PROG_SRC,	/* Progressive hardware scrub from an error */
-	SCRUB_HW_TUNABLE	/* Hardware scrub frequency is tunable */
-};
-
-#define SCRUB_FLAG_SW_PROG	BIT(SCRUB_SW_PROG)
-#define SCRUB_FLAG_SW_SRC	BIT(SCRUB_SW_SRC)
-#define SCRUB_FLAG_SW_PROG_SRC	BIT(SCRUB_SW_PROG_SRC)
-#define SCRUB_FLAG_SW_TUN	BIT(SCRUB_SW_SCRUB_TUNABLE)
-#define SCRUB_FLAG_HW_PROG	BIT(SCRUB_HW_PROG)
-#define SCRUB_FLAG_HW_SRC	BIT(SCRUB_HW_SRC)
-#define SCRUB_FLAG_HW_PROG_SRC	BIT(SCRUB_HW_PROG_SRC)
-#define SCRUB_FLAG_HW_TUN	BIT(SCRUB_HW_TUNABLE)
-
-/* FIXME - should have notify capabilities: NMI, LOG, PROC, etc */
-
-/* EDAC internal operation states */
-#define	OP_ALLOC		0x100
-#define OP_RUNNING_POLL		0x201
-#define OP_RUNNING_INTERRUPT	0x202
-#define OP_RUNNING_POLL_INTR	0x203
-#define OP_OFFLINE		0x300
-
-/*
- * There are several things to be aware of that aren't at all obvious:
- *
- *
- * SOCKETS, SOCKET SETS, BANKS, ROWS, CHIP-SELECT ROWS, CHANNELS, etc..
- *
- * These are some of the many terms that are thrown about that don't always
- * mean what people think they mean (Inconceivable!).  In the interest of
- * creating a common ground for discussion, terms and their definitions
- * will be established.
- *
- * Memory devices:	The individual chip on a memory stick.  These devices
- *			commonly output 4 and 8 bits each.  Grouping several
- *			of these in parallel provides 64 bits which is common
- *			for a memory stick.
- *
- * Memory Stick:	A printed circuit board that agregates multiple
- *			memory devices in parallel.  This is the atomic
- *			memory component that is purchaseable by Joe consumer
- *			and loaded into a memory socket.
- *
- * Socket:		A physical connector on the motherboard that accepts
- *			a single memory stick.
- *
- * Channel:		Set of memory devices on a memory stick that must be
- *			grouped in parallel with one or more additional
- *			channels from other memory sticks.  This parallel
- *			grouping of the output from multiple channels are
- *			necessary for the smallest granularity of memory access.
- *			Some memory controllers are capable of single channel -
- *			which means that memory sticks can be loaded
- *			individually.  Other memory controllers are only
- *			capable of dual channel - which means that memory
- *			sticks must be loaded as pairs (see "socket set").
- *
- * Chip-select row:	All of the memory devices that are selected together.
- *			for a single, minimum grain of memory access.
- *			This selects all of the parallel memory devices across
- *			all of the parallel channels.  Common chip-select rows
- *			for single channel are 64 bits, for dual channel 128
- *			bits.
- *
- * Single-Ranked stick:	A Single-ranked stick has 1 chip-select row of memory.
- *			Motherboards commonly drive two chip-select pins to
- *			a memory stick. A single-ranked stick, will occupy
- *			only one of those rows. The other will be unused.
- *
- * Double-Ranked stick:	A double-ranked stick has two chip-select rows which
- *			access different sets of memory devices.  The two
- *			rows cannot be accessed concurrently.
- *
- * Double-sided stick:	DEPRECATED TERM, see Double-Ranked stick.
- *			A double-sided stick has two chip-select rows which
- *			access different sets of memory devices.  The two
- *			rows cannot be accessed concurrently.  "Double-sided"
- *			is irrespective of the memory devices being mounted
- *			on both sides of the memory stick.
- *
- * Socket set:		All of the memory sticks that are required for
- *			a single memory access or all of the memory sticks
- *			spanned by a chip-select row.  A single socket set
- *			has two chip-select rows and if double-sided sticks
- *			are used these will occupy those chip-select rows.
- *
- * Bank:		This term is avoided because it is unclear when
- *			needing to distinguish between chip-select rows and
- *			socket sets.
- *
- * Controller pages:
- *
- * Physical pages:
- *
- * Virtual pages:
- *
- *
- * STRUCTURE ORGANIZATION AND CHOICES
- *
- *
- *
- * PS - I enjoyed writing all that about as much as you enjoyed reading it.
- */
-
-struct channel_info {
-	int chan_idx;		/* channel index */
-	u32 ce_count;		/* Correctable Errors for this CHANNEL */
-	char label[EDAC_MC_LABEL_LEN + 1];	/* DIMM label on motherboard */
-	struct csrow_info *csrow;	/* the parent */
-};
-
-struct csrow_info {
-	unsigned long first_page;	/* first page number in dimm */
-	unsigned long last_page;	/* last page number in dimm */
-	unsigned long page_mask;	/* used for interleaving -
-					 * 0UL for non intlv
-					 */
-	u32 nr_pages;		/* number of pages in csrow */
-	u32 grain;		/* granularity of reported error in bytes */
-	int csrow_idx;		/* the chip-select row */
-	enum dev_type dtype;	/* memory device type */
-	u32 ue_count;		/* Uncorrectable Errors for this csrow */
-	u32 ce_count;		/* Correctable Errors for this csrow */
-	enum mem_type mtype;	/* memory csrow type */
-	enum edac_type edac_mode;	/* EDAC mode for this csrow */
-	struct mem_ctl_info *mci;	/* the parent */
-
-	struct kobject kobj;	/* sysfs kobject for this csrow */
-
-	/* channel information for this csrow */
-	u32 nr_channels;
-	struct channel_info *channels;
-};
-
-struct mcidev_sysfs_group {
-	const char *name;				/* group name */
-	const struct mcidev_sysfs_attribute *mcidev_attr; /* group attributes */
-};
-
-struct mcidev_sysfs_group_kobj {
-	struct list_head list;		/* list for all instances within a mc */
-
-	struct kobject kobj;		/* kobj for the group */
-
-	const struct mcidev_sysfs_group *grp;	/* group description table */
-	struct mem_ctl_info *mci;	/* the parent */
-};
-
-/* mcidev_sysfs_attribute structure
- *	used for driver sysfs attributes and in mem_ctl_info
- * 	sysfs top level entries
- */
-struct mcidev_sysfs_attribute {
-	/* It should use either attr or grp */
-	struct attribute attr;
-	const struct mcidev_sysfs_group *grp;	/* Points to a group of attributes */
-
-	/* Ops for show/store values at the attribute - not used on group */
-        ssize_t (*show)(struct mem_ctl_info *,char *);
-        ssize_t (*store)(struct mem_ctl_info *, const char *,size_t);
-};
-
-/* MEMORY controller information structure
- */
-struct mem_ctl_info {
-	struct list_head link;	/* for global list of mem_ctl_info structs */
-
-	struct module *owner;	/* Module owner of this control struct */
-
-	unsigned long mtype_cap;	/* memory types supported by mc */
-	unsigned long edac_ctl_cap;	/* Mem controller EDAC capabilities */
-	unsigned long edac_cap;	/* configuration capabilities - this is
-				 * closely related to edac_ctl_cap.  The
-				 * difference is that the controller may be
-				 * capable of s4ecd4ed which would be listed
-				 * in edac_ctl_cap, but if channels aren't
-				 * capable of s4ecd4ed then the edac_cap would
-				 * not have that capability.
-				 */
-	unsigned long scrub_cap;	/* chipset scrub capabilities */
-	enum scrub_type scrub_mode;	/* current scrub mode */
-
-	/* Translates sdram memory scrub rate given in bytes/sec to the
-	   internal representation and configures whatever else needs
-	   to be configured.
-	 */
-	int (*set_sdram_scrub_rate) (struct mem_ctl_info * mci, u32 bw);
-
-	/* Get the current sdram memory scrub rate from the internal
-	   representation and converts it to the closest matching
-	   bandwith in bytes/sec.
-	 */
-	int (*get_sdram_scrub_rate) (struct mem_ctl_info * mci);
-
-
-	/* pointer to edac checking routine */
-	void (*edac_check) (struct mem_ctl_info * mci);
-
-	/*
-	 * Remaps memory pages: controller pages to physical pages.
-	 * For most MC's, this will be NULL.
-	 */
-	/* FIXME - why not send the phys page to begin with? */
-	unsigned long (*ctl_page_to_phys) (struct mem_ctl_info * mci,
-					   unsigned long page);
-	int mc_idx;
-	int nr_csrows;
-	struct csrow_info *csrows;
-	/*
-	 * FIXME - what about controllers on other busses? - IDs must be
-	 * unique.  dev pointer should be sufficiently unique, but
-	 * BUS:SLOT.FUNC numbers may not be unique.
-	 */
-	struct device *dev;
-	const char *mod_name;
-	const char *mod_ver;
-	const char *ctl_name;
-	const char *dev_name;
-	char proc_name[MC_PROC_NAME_MAX_LEN + 1];
-	void *pvt_info;
-	u32 ue_noinfo_count;	/* Uncorrectable Errors w/o info */
-	u32 ce_noinfo_count;	/* Correctable Errors w/o info */
-	u32 ue_count;		/* Total Uncorrectable Errors for this MC */
-	u32 ce_count;		/* Total Correctable Errors for this MC */
-	unsigned long start_time;	/* mci load start time (in jiffies) */
-
-	/* this stuff is for safe removal of mc devices from global list while
-	 * NMI handlers may be traversing list
-	 */
-	struct rcu_head rcu;
-	struct completion complete;
-
-	/* edac sysfs device control */
-	struct kobject edac_mci_kobj;
-
-	/* list for all grp instances within a mc */
-	struct list_head grp_kobj_list;
-
-	/* Additional top controller level attributes, but specified
-	 * by the low level driver.
-	 *
-	 * Set by the low level driver to provide attributes at the
-	 * controller level, same level as 'ue_count' and 'ce_count' above.
-	 * An array of structures, NULL terminated
-	 *
-	 * If attributes are desired, then set to array of attributes
-	 * If no attributes are desired, leave NULL
-	 */
-	const struct mcidev_sysfs_attribute *mc_driver_sysfs_attributes;
-
-	/* work struct for this MC */
-	struct delayed_work work;
-
-	/* the internal state of this controller instance */
-	int op_state;
-};
-
 /*
  * The following are the structures to provide for a generic
  * or abstract 'edac_device'. This set of structures and the
diff --git a/include/linux/edac.h b/include/linux/edac.h
index 36c6644..12d0b45 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -42,4 +42,358 @@ static inline void opstate_init(void)
 	return;
 }
 
+#define EDAC_MC_LABEL_LEN	31
+#define MC_PROC_NAME_MAX_LEN	7
+
+/* memory devices */
+enum dev_type {
+	DEV_UNKNOWN = 0,
+	DEV_X1,
+	DEV_X2,
+	DEV_X4,
+	DEV_X8,
+	DEV_X16,
+	DEV_X32,		/* Do these parts exist? */
+	DEV_X64			/* Do these parts exist? */
+};
+
+#define DEV_FLAG_UNKNOWN	BIT(DEV_UNKNOWN)
+#define DEV_FLAG_X1		BIT(DEV_X1)
+#define DEV_FLAG_X2		BIT(DEV_X2)
+#define DEV_FLAG_X4		BIT(DEV_X4)
+#define DEV_FLAG_X8		BIT(DEV_X8)
+#define DEV_FLAG_X16		BIT(DEV_X16)
+#define DEV_FLAG_X32		BIT(DEV_X32)
+#define DEV_FLAG_X64		BIT(DEV_X64)
+
+/* memory types */
+enum mem_type {
+	MEM_EMPTY = 0,		/* Empty csrow */
+	MEM_RESERVED,		/* Reserved csrow type */
+	MEM_UNKNOWN,		/* Unknown csrow type */
+	MEM_FPM,		/* Fast page mode */
+	MEM_EDO,		/* Extended data out */
+	MEM_BEDO,		/* Burst Extended data out */
+	MEM_SDR,		/* Single data rate SDRAM */
+	MEM_RDR,		/* Registered single data rate SDRAM */
+	MEM_DDR,		/* Double data rate SDRAM */
+	MEM_RDDR,		/* Registered Double data rate SDRAM */
+	MEM_RMBS,		/* Rambus DRAM */
+	MEM_DDR2,		/* DDR2 RAM */
+	MEM_FB_DDR2,		/* fully buffered DDR2 */
+	MEM_RDDR2,		/* Registered DDR2 RAM */
+	MEM_XDR,		/* Rambus XDR */
+	MEM_DDR3,		/* DDR3 RAM */
+	MEM_RDDR3,		/* Registered DDR3 RAM */
+};
+
+#define MEM_FLAG_EMPTY		BIT(MEM_EMPTY)
+#define MEM_FLAG_RESERVED	BIT(MEM_RESERVED)
+#define MEM_FLAG_UNKNOWN	BIT(MEM_UNKNOWN)
+#define MEM_FLAG_FPM		BIT(MEM_FPM)
+#define MEM_FLAG_EDO		BIT(MEM_EDO)
+#define MEM_FLAG_BEDO		BIT(MEM_BEDO)
+#define MEM_FLAG_SDR		BIT(MEM_SDR)
+#define MEM_FLAG_RDR		BIT(MEM_RDR)
+#define MEM_FLAG_DDR		BIT(MEM_DDR)
+#define MEM_FLAG_RDDR		BIT(MEM_RDDR)
+#define MEM_FLAG_RMBS		BIT(MEM_RMBS)
+#define MEM_FLAG_DDR2           BIT(MEM_DDR2)
+#define MEM_FLAG_FB_DDR2        BIT(MEM_FB_DDR2)
+#define MEM_FLAG_RDDR2          BIT(MEM_RDDR2)
+#define MEM_FLAG_XDR            BIT(MEM_XDR)
+#define MEM_FLAG_DDR3		 BIT(MEM_DDR3)
+#define MEM_FLAG_RDDR3		 BIT(MEM_RDDR3)
+
+/* chipset Error Detection and Correction capabilities and mode */
+enum edac_type {
+	EDAC_UNKNOWN = 0,	/* Unknown if ECC is available */
+	EDAC_NONE,		/* Doesnt support ECC */
+	EDAC_RESERVED,		/* Reserved ECC type */
+	EDAC_PARITY,		/* Detects parity errors */
+	EDAC_EC,		/* Error Checking - no correction */
+	EDAC_SECDED,		/* Single bit error correction, Double detection */
+	EDAC_S2ECD2ED,		/* Chipkill x2 devices - do these exist? */
+	EDAC_S4ECD4ED,		/* Chipkill x4 devices */
+	EDAC_S8ECD8ED,		/* Chipkill x8 devices */
+	EDAC_S16ECD16ED,	/* Chipkill x16 devices */
+};
+
+#define EDAC_FLAG_UNKNOWN	BIT(EDAC_UNKNOWN)
+#define EDAC_FLAG_NONE		BIT(EDAC_NONE)
+#define EDAC_FLAG_PARITY	BIT(EDAC_PARITY)
+#define EDAC_FLAG_EC		BIT(EDAC_EC)
+#define EDAC_FLAG_SECDED	BIT(EDAC_SECDED)
+#define EDAC_FLAG_S2ECD2ED	BIT(EDAC_S2ECD2ED)
+#define EDAC_FLAG_S4ECD4ED	BIT(EDAC_S4ECD4ED)
+#define EDAC_FLAG_S8ECD8ED	BIT(EDAC_S8ECD8ED)
+#define EDAC_FLAG_S16ECD16ED	BIT(EDAC_S16ECD16ED)
+
+/* scrubbing capabilities */
+enum scrub_type {
+	SCRUB_UNKNOWN = 0,	/* Unknown if scrubber is available */
+	SCRUB_NONE,		/* No scrubber */
+	SCRUB_SW_PROG,		/* SW progressive (sequential) scrubbing */
+	SCRUB_SW_SRC,		/* Software scrub only errors */
+	SCRUB_SW_PROG_SRC,	/* Progressive software scrub from an error */
+	SCRUB_SW_TUNABLE,	/* Software scrub frequency is tunable */
+	SCRUB_HW_PROG,		/* HW progressive (sequential) scrubbing */
+	SCRUB_HW_SRC,		/* Hardware scrub only errors */
+	SCRUB_HW_PROG_SRC,	/* Progressive hardware scrub from an error */
+	SCRUB_HW_TUNABLE	/* Hardware scrub frequency is tunable */
+};
+
+#define SCRUB_FLAG_SW_PROG	BIT(SCRUB_SW_PROG)
+#define SCRUB_FLAG_SW_SRC	BIT(SCRUB_SW_SRC)
+#define SCRUB_FLAG_SW_PROG_SRC	BIT(SCRUB_SW_PROG_SRC)
+#define SCRUB_FLAG_SW_TUN	BIT(SCRUB_SW_SCRUB_TUNABLE)
+#define SCRUB_FLAG_HW_PROG	BIT(SCRUB_HW_PROG)
+#define SCRUB_FLAG_HW_SRC	BIT(SCRUB_HW_SRC)
+#define SCRUB_FLAG_HW_PROG_SRC	BIT(SCRUB_HW_PROG_SRC)
+#define SCRUB_FLAG_HW_TUN	BIT(SCRUB_HW_TUNABLE)
+
+/* FIXME - should have notify capabilities: NMI, LOG, PROC, etc */
+
+/* EDAC internal operation states */
+#define	OP_ALLOC		0x100
+#define OP_RUNNING_POLL		0x201
+#define OP_RUNNING_INTERRUPT	0x202
+#define OP_RUNNING_POLL_INTR	0x203
+#define OP_OFFLINE		0x300
+
+/*
+ * There are several things to be aware of that aren't at all obvious:
+ *
+ *
+ * SOCKETS, SOCKET SETS, BANKS, ROWS, CHIP-SELECT ROWS, CHANNELS, etc..
+ *
+ * These are some of the many terms that are thrown about that don't always
+ * mean what people think they mean (Inconceivable!).  In the interest of
+ * creating a common ground for discussion, terms and their definitions
+ * will be established.
+ *
+ * Memory devices:	The individual chip on a memory stick.  These devices
+ *			commonly output 4 and 8 bits each.  Grouping several
+ *			of these in parallel provides 64 bits which is common
+ *			for a memory stick.
+ *
+ * Memory Stick:	A printed circuit board that agregates multiple
+ *			memory devices in parallel.  This is the atomic
+ *			memory component that is purchaseable by Joe consumer
+ *			and loaded into a memory socket.
+ *
+ * Socket:		A physical connector on the motherboard that accepts
+ *			a single memory stick.
+ *
+ * Channel:		Set of memory devices on a memory stick that must be
+ *			grouped in parallel with one or more additional
+ *			channels from other memory sticks.  This parallel
+ *			grouping of the output from multiple channels are
+ *			necessary for the smallest granularity of memory access.
+ *			Some memory controllers are capable of single channel -
+ *			which means that memory sticks can be loaded
+ *			individually.  Other memory controllers are only
+ *			capable of dual channel - which means that memory
+ *			sticks must be loaded as pairs (see "socket set").
+ *
+ * Chip-select row:	All of the memory devices that are selected together.
+ *			for a single, minimum grain of memory access.
+ *			This selects all of the parallel memory devices across
+ *			all of the parallel channels.  Common chip-select rows
+ *			for single channel are 64 bits, for dual channel 128
+ *			bits.
+ *
+ * Single-Ranked stick:	A Single-ranked stick has 1 chip-select row of memory.
+ *			Motherboards commonly drive two chip-select pins to
+ *			a memory stick. A single-ranked stick, will occupy
+ *			only one of those rows. The other will be unused.
+ *
+ * Double-Ranked stick:	A double-ranked stick has two chip-select rows which
+ *			access different sets of memory devices.  The two
+ *			rows cannot be accessed concurrently.
+ *
+ * Double-sided stick:	DEPRECATED TERM, see Double-Ranked stick.
+ *			A double-sided stick has two chip-select rows which
+ *			access different sets of memory devices.  The two
+ *			rows cannot be accessed concurrently.  "Double-sided"
+ *			is irrespective of the memory devices being mounted
+ *			on both sides of the memory stick.
+ *
+ * Socket set:		All of the memory sticks that are required for
+ *			a single memory access or all of the memory sticks
+ *			spanned by a chip-select row.  A single socket set
+ *			has two chip-select rows and if double-sided sticks
+ *			are used these will occupy those chip-select rows.
+ *
+ * Bank:		This term is avoided because it is unclear when
+ *			needing to distinguish between chip-select rows and
+ *			socket sets.
+ *
+ * Controller pages:
+ *
+ * Physical pages:
+ *
+ * Virtual pages:
+ *
+ *
+ * STRUCTURE ORGANIZATION AND CHOICES
+ *
+ *
+ *
+ * PS - I enjoyed writing all that about as much as you enjoyed reading it.
+ */
+
+struct channel_info {
+	int chan_idx;		/* channel index */
+	u32 ce_count;		/* Correctable Errors for this CHANNEL */
+	char label[EDAC_MC_LABEL_LEN + 1];	/* DIMM label on motherboard */
+	struct csrow_info *csrow;	/* the parent */
+};
+
+struct csrow_info {
+	unsigned long first_page;	/* first page number in dimm */
+	unsigned long last_page;	/* last page number in dimm */
+	unsigned long page_mask;	/* used for interleaving -
+					 * 0UL for non intlv
+					 */
+	u32 nr_pages;		/* number of pages in csrow */
+	u32 grain;		/* granularity of reported error in bytes */
+	int csrow_idx;		/* the chip-select row */
+	enum dev_type dtype;	/* memory device type */
+	u32 ue_count;		/* Uncorrectable Errors for this csrow */
+	u32 ce_count;		/* Correctable Errors for this csrow */
+	enum mem_type mtype;	/* memory csrow type */
+	enum edac_type edac_mode;	/* EDAC mode for this csrow */
+	struct mem_ctl_info *mci;	/* the parent */
+
+	struct kobject kobj;	/* sysfs kobject for this csrow */
+
+	/* channel information for this csrow */
+	u32 nr_channels;
+	struct channel_info *channels;
+};
+
+struct mcidev_sysfs_group {
+	const char *name;				/* group name */
+	const struct mcidev_sysfs_attribute *mcidev_attr; /* group attributes */
+};
+
+struct mcidev_sysfs_group_kobj {
+	struct list_head list;		/* list for all instances within a mc */
+
+	struct kobject kobj;		/* kobj for the group */
+
+	const struct mcidev_sysfs_group *grp;	/* group description table */
+	struct mem_ctl_info *mci;	/* the parent */
+};
+
+/* mcidev_sysfs_attribute structure
+ *	used for driver sysfs attributes and in mem_ctl_info
+ * 	sysfs top level entries
+ */
+struct mcidev_sysfs_attribute {
+	/* It should use either attr or grp */
+	struct attribute attr;
+	const struct mcidev_sysfs_group *grp;	/* Points to a group of attributes */
+
+	/* Ops for show/store values at the attribute - not used on group */
+        ssize_t (*show)(struct mem_ctl_info *,char *);
+        ssize_t (*store)(struct mem_ctl_info *, const char *,size_t);
+};
+
+/* MEMORY controller information structure
+ */
+struct mem_ctl_info {
+	struct list_head link;	/* for global list of mem_ctl_info structs */
+
+	struct module *owner;	/* Module owner of this control struct */
+
+	unsigned long mtype_cap;	/* memory types supported by mc */
+	unsigned long edac_ctl_cap;	/* Mem controller EDAC capabilities */
+	unsigned long edac_cap;	/* configuration capabilities - this is
+				 * closely related to edac_ctl_cap.  The
+				 * difference is that the controller may be
+				 * capable of s4ecd4ed which would be listed
+				 * in edac_ctl_cap, but if channels aren't
+				 * capable of s4ecd4ed then the edac_cap would
+				 * not have that capability.
+				 */
+	unsigned long scrub_cap;	/* chipset scrub capabilities */
+	enum scrub_type scrub_mode;	/* current scrub mode */
+
+	/* Translates sdram memory scrub rate given in bytes/sec to the
+	   internal representation and configures whatever else needs
+	   to be configured.
+	 */
+	int (*set_sdram_scrub_rate) (struct mem_ctl_info * mci, u32 bw);
+
+	/* Get the current sdram memory scrub rate from the internal
+	   representation and converts it to the closest matching
+	   bandwith in bytes/sec.
+	 */
+	int (*get_sdram_scrub_rate) (struct mem_ctl_info * mci);
+
+
+	/* pointer to edac checking routine */
+	void (*edac_check) (struct mem_ctl_info * mci);
+
+	/*
+	 * Remaps memory pages: controller pages to physical pages.
+	 * For most MC's, this will be NULL.
+	 */
+	/* FIXME - why not send the phys page to begin with? */
+	unsigned long (*ctl_page_to_phys) (struct mem_ctl_info * mci,
+					   unsigned long page);
+	int mc_idx;
+	int nr_csrows;
+	struct csrow_info *csrows;
+	/*
+	 * FIXME - what about controllers on other busses? - IDs must be
+	 * unique.  dev pointer should be sufficiently unique, but
+	 * BUS:SLOT.FUNC numbers may not be unique.
+	 */
+	struct device *dev;
+	const char *mod_name;
+	const char *mod_ver;
+	const char *ctl_name;
+	const char *dev_name;
+	char proc_name[MC_PROC_NAME_MAX_LEN + 1];
+	void *pvt_info;
+	u32 ue_noinfo_count;	/* Uncorrectable Errors w/o info */
+	u32 ce_noinfo_count;	/* Correctable Errors w/o info */
+	u32 ue_count;		/* Total Uncorrectable Errors for this MC */
+	u32 ce_count;		/* Total Correctable Errors for this MC */
+	unsigned long start_time;	/* mci load start time (in jiffies) */
+
+	/* this stuff is for safe removal of mc devices from global list while
+	 * NMI handlers may be traversing list
+	 */
+	struct rcu_head rcu;
+	struct completion complete;
+
+	/* edac sysfs device control */
+	struct kobject edac_mci_kobj;
+
+	/* list for all grp instances within a mc */
+	struct list_head grp_kobj_list;
+
+	/* Additional top controller level attributes, but specified
+	 * by the low level driver.
+	 *
+	 * Set by the low level driver to provide attributes at the
+	 * controller level, same level as 'ue_count' and 'ce_count' above.
+	 * An array of structures, NULL terminated
+	 *
+	 * If attributes are desired, then set to array of attributes
+	 * If no attributes are desired, leave NULL
+	 */
+	const struct mcidev_sysfs_attribute *mc_driver_sysfs_attributes;
+
+	/* work struct for this MC */
+	struct delayed_work work;
+
+	/* the internal state of this controller instance */
+	int op_state;
+};
+
 #endif
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 1/2] edac: Move edac main structs to include/linux/edac.h
       [not found] <cover.1300996141.git.mchehab@redhat.com>
  2011-03-24 20:32 ` [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM) Mauro Carvalho Chehab
  2011-03-24 20:32 ` [PATCH RFC 1/2] edac: Move edac main structs to include/linux/edac.h Mauro Carvalho Chehab
@ 2011-03-24 20:54 ` Mauro Carvalho Chehab
  2 siblings, 0 replies; 20+ messages in thread
From: Mauro Carvalho Chehab @ 2011-03-24 20:54 UTC (permalink / raw)
  Cc: Linux Edac Mailing List, Linux Kernel Mailing List

As we'll need to use those structs for trace functions, they should
be on a more public place. So, move struct mem_ctl_info & friends
to edac.h.

No functional changes on this patch.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index 3d96534..ef0738b 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -34,11 +34,10 @@
 #include <linux/platform_device.h>
 #include <linux/sysdev.h>
 #include <linux/workqueue.h>
+#include <linux/edac.h>
 
-#define EDAC_MC_LABEL_LEN	31
 #define EDAC_DEVICE_NAME_LEN	31
 #define EDAC_ATTRIB_VALUE_LEN	15
-#define MC_PROC_NAME_MAX_LEN	7
 
 #if PAGE_SHIFT < 20
 #define PAGES_TO_MiB(pages)	((pages) >> (20 - PAGE_SHIFT))
@@ -101,357 +100,6 @@ extern int edac_debug_level;
 
 #define edac_dev_name(dev) (dev)->dev_name
 
-/* memory devices */
-enum dev_type {
-	DEV_UNKNOWN = 0,
-	DEV_X1,
-	DEV_X2,
-	DEV_X4,
-	DEV_X8,
-	DEV_X16,
-	DEV_X32,		/* Do these parts exist? */
-	DEV_X64			/* Do these parts exist? */
-};
-
-#define DEV_FLAG_UNKNOWN	BIT(DEV_UNKNOWN)
-#define DEV_FLAG_X1		BIT(DEV_X1)
-#define DEV_FLAG_X2		BIT(DEV_X2)
-#define DEV_FLAG_X4		BIT(DEV_X4)
-#define DEV_FLAG_X8		BIT(DEV_X8)
-#define DEV_FLAG_X16		BIT(DEV_X16)
-#define DEV_FLAG_X32		BIT(DEV_X32)
-#define DEV_FLAG_X64		BIT(DEV_X64)
-
-/* memory types */
-enum mem_type {
-	MEM_EMPTY = 0,		/* Empty csrow */
-	MEM_RESERVED,		/* Reserved csrow type */
-	MEM_UNKNOWN,		/* Unknown csrow type */
-	MEM_FPM,		/* Fast page mode */
-	MEM_EDO,		/* Extended data out */
-	MEM_BEDO,		/* Burst Extended data out */
-	MEM_SDR,		/* Single data rate SDRAM */
-	MEM_RDR,		/* Registered single data rate SDRAM */
-	MEM_DDR,		/* Double data rate SDRAM */
-	MEM_RDDR,		/* Registered Double data rate SDRAM */
-	MEM_RMBS,		/* Rambus DRAM */
-	MEM_DDR2,		/* DDR2 RAM */
-	MEM_FB_DDR2,		/* fully buffered DDR2 */
-	MEM_RDDR2,		/* Registered DDR2 RAM */
-	MEM_XDR,		/* Rambus XDR */
-	MEM_DDR3,		/* DDR3 RAM */
-	MEM_RDDR3,		/* Registered DDR3 RAM */
-};
-
-#define MEM_FLAG_EMPTY		BIT(MEM_EMPTY)
-#define MEM_FLAG_RESERVED	BIT(MEM_RESERVED)
-#define MEM_FLAG_UNKNOWN	BIT(MEM_UNKNOWN)
-#define MEM_FLAG_FPM		BIT(MEM_FPM)
-#define MEM_FLAG_EDO		BIT(MEM_EDO)
-#define MEM_FLAG_BEDO		BIT(MEM_BEDO)
-#define MEM_FLAG_SDR		BIT(MEM_SDR)
-#define MEM_FLAG_RDR		BIT(MEM_RDR)
-#define MEM_FLAG_DDR		BIT(MEM_DDR)
-#define MEM_FLAG_RDDR		BIT(MEM_RDDR)
-#define MEM_FLAG_RMBS		BIT(MEM_RMBS)
-#define MEM_FLAG_DDR2           BIT(MEM_DDR2)
-#define MEM_FLAG_FB_DDR2        BIT(MEM_FB_DDR2)
-#define MEM_FLAG_RDDR2          BIT(MEM_RDDR2)
-#define MEM_FLAG_XDR            BIT(MEM_XDR)
-#define MEM_FLAG_DDR3		 BIT(MEM_DDR3)
-#define MEM_FLAG_RDDR3		 BIT(MEM_RDDR3)
-
-/* chipset Error Detection and Correction capabilities and mode */
-enum edac_type {
-	EDAC_UNKNOWN = 0,	/* Unknown if ECC is available */
-	EDAC_NONE,		/* Doesnt support ECC */
-	EDAC_RESERVED,		/* Reserved ECC type */
-	EDAC_PARITY,		/* Detects parity errors */
-	EDAC_EC,		/* Error Checking - no correction */
-	EDAC_SECDED,		/* Single bit error correction, Double detection */
-	EDAC_S2ECD2ED,		/* Chipkill x2 devices - do these exist? */
-	EDAC_S4ECD4ED,		/* Chipkill x4 devices */
-	EDAC_S8ECD8ED,		/* Chipkill x8 devices */
-	EDAC_S16ECD16ED,	/* Chipkill x16 devices */
-};
-
-#define EDAC_FLAG_UNKNOWN	BIT(EDAC_UNKNOWN)
-#define EDAC_FLAG_NONE		BIT(EDAC_NONE)
-#define EDAC_FLAG_PARITY	BIT(EDAC_PARITY)
-#define EDAC_FLAG_EC		BIT(EDAC_EC)
-#define EDAC_FLAG_SECDED	BIT(EDAC_SECDED)
-#define EDAC_FLAG_S2ECD2ED	BIT(EDAC_S2ECD2ED)
-#define EDAC_FLAG_S4ECD4ED	BIT(EDAC_S4ECD4ED)
-#define EDAC_FLAG_S8ECD8ED	BIT(EDAC_S8ECD8ED)
-#define EDAC_FLAG_S16ECD16ED	BIT(EDAC_S16ECD16ED)
-
-/* scrubbing capabilities */
-enum scrub_type {
-	SCRUB_UNKNOWN = 0,	/* Unknown if scrubber is available */
-	SCRUB_NONE,		/* No scrubber */
-	SCRUB_SW_PROG,		/* SW progressive (sequential) scrubbing */
-	SCRUB_SW_SRC,		/* Software scrub only errors */
-	SCRUB_SW_PROG_SRC,	/* Progressive software scrub from an error */
-	SCRUB_SW_TUNABLE,	/* Software scrub frequency is tunable */
-	SCRUB_HW_PROG,		/* HW progressive (sequential) scrubbing */
-	SCRUB_HW_SRC,		/* Hardware scrub only errors */
-	SCRUB_HW_PROG_SRC,	/* Progressive hardware scrub from an error */
-	SCRUB_HW_TUNABLE	/* Hardware scrub frequency is tunable */
-};
-
-#define SCRUB_FLAG_SW_PROG	BIT(SCRUB_SW_PROG)
-#define SCRUB_FLAG_SW_SRC	BIT(SCRUB_SW_SRC)
-#define SCRUB_FLAG_SW_PROG_SRC	BIT(SCRUB_SW_PROG_SRC)
-#define SCRUB_FLAG_SW_TUN	BIT(SCRUB_SW_SCRUB_TUNABLE)
-#define SCRUB_FLAG_HW_PROG	BIT(SCRUB_HW_PROG)
-#define SCRUB_FLAG_HW_SRC	BIT(SCRUB_HW_SRC)
-#define SCRUB_FLAG_HW_PROG_SRC	BIT(SCRUB_HW_PROG_SRC)
-#define SCRUB_FLAG_HW_TUN	BIT(SCRUB_HW_TUNABLE)
-
-/* FIXME - should have notify capabilities: NMI, LOG, PROC, etc */
-
-/* EDAC internal operation states */
-#define	OP_ALLOC		0x100
-#define OP_RUNNING_POLL		0x201
-#define OP_RUNNING_INTERRUPT	0x202
-#define OP_RUNNING_POLL_INTR	0x203
-#define OP_OFFLINE		0x300
-
-/*
- * There are several things to be aware of that aren't at all obvious:
- *
- *
- * SOCKETS, SOCKET SETS, BANKS, ROWS, CHIP-SELECT ROWS, CHANNELS, etc..
- *
- * These are some of the many terms that are thrown about that don't always
- * mean what people think they mean (Inconceivable!).  In the interest of
- * creating a common ground for discussion, terms and their definitions
- * will be established.
- *
- * Memory devices:	The individual chip on a memory stick.  These devices
- *			commonly output 4 and 8 bits each.  Grouping several
- *			of these in parallel provides 64 bits which is common
- *			for a memory stick.
- *
- * Memory Stick:	A printed circuit board that agregates multiple
- *			memory devices in parallel.  This is the atomic
- *			memory component that is purchaseable by Joe consumer
- *			and loaded into a memory socket.
- *
- * Socket:		A physical connector on the motherboard that accepts
- *			a single memory stick.
- *
- * Channel:		Set of memory devices on a memory stick that must be
- *			grouped in parallel with one or more additional
- *			channels from other memory sticks.  This parallel
- *			grouping of the output from multiple channels are
- *			necessary for the smallest granularity of memory access.
- *			Some memory controllers are capable of single channel -
- *			which means that memory sticks can be loaded
- *			individually.  Other memory controllers are only
- *			capable of dual channel - which means that memory
- *			sticks must be loaded as pairs (see "socket set").
- *
- * Chip-select row:	All of the memory devices that are selected together.
- *			for a single, minimum grain of memory access.
- *			This selects all of the parallel memory devices across
- *			all of the parallel channels.  Common chip-select rows
- *			for single channel are 64 bits, for dual channel 128
- *			bits.
- *
- * Single-Ranked stick:	A Single-ranked stick has 1 chip-select row of memory.
- *			Motherboards commonly drive two chip-select pins to
- *			a memory stick. A single-ranked stick, will occupy
- *			only one of those rows. The other will be unused.
- *
- * Double-Ranked stick:	A double-ranked stick has two chip-select rows which
- *			access different sets of memory devices.  The two
- *			rows cannot be accessed concurrently.
- *
- * Double-sided stick:	DEPRECATED TERM, see Double-Ranked stick.
- *			A double-sided stick has two chip-select rows which
- *			access different sets of memory devices.  The two
- *			rows cannot be accessed concurrently.  "Double-sided"
- *			is irrespective of the memory devices being mounted
- *			on both sides of the memory stick.
- *
- * Socket set:		All of the memory sticks that are required for
- *			a single memory access or all of the memory sticks
- *			spanned by a chip-select row.  A single socket set
- *			has two chip-select rows and if double-sided sticks
- *			are used these will occupy those chip-select rows.
- *
- * Bank:		This term is avoided because it is unclear when
- *			needing to distinguish between chip-select rows and
- *			socket sets.
- *
- * Controller pages:
- *
- * Physical pages:
- *
- * Virtual pages:
- *
- *
- * STRUCTURE ORGANIZATION AND CHOICES
- *
- *
- *
- * PS - I enjoyed writing all that about as much as you enjoyed reading it.
- */
-
-struct channel_info {
-	int chan_idx;		/* channel index */
-	u32 ce_count;		/* Correctable Errors for this CHANNEL */
-	char label[EDAC_MC_LABEL_LEN + 1];	/* DIMM label on motherboard */
-	struct csrow_info *csrow;	/* the parent */
-};
-
-struct csrow_info {
-	unsigned long first_page;	/* first page number in dimm */
-	unsigned long last_page;	/* last page number in dimm */
-	unsigned long page_mask;	/* used for interleaving -
-					 * 0UL for non intlv
-					 */
-	u32 nr_pages;		/* number of pages in csrow */
-	u32 grain;		/* granularity of reported error in bytes */
-	int csrow_idx;		/* the chip-select row */
-	enum dev_type dtype;	/* memory device type */
-	u32 ue_count;		/* Uncorrectable Errors for this csrow */
-	u32 ce_count;		/* Correctable Errors for this csrow */
-	enum mem_type mtype;	/* memory csrow type */
-	enum edac_type edac_mode;	/* EDAC mode for this csrow */
-	struct mem_ctl_info *mci;	/* the parent */
-
-	struct kobject kobj;	/* sysfs kobject for this csrow */
-
-	/* channel information for this csrow */
-	u32 nr_channels;
-	struct channel_info *channels;
-};
-
-struct mcidev_sysfs_group {
-	const char *name;				/* group name */
-	const struct mcidev_sysfs_attribute *mcidev_attr; /* group attributes */
-};
-
-struct mcidev_sysfs_group_kobj {
-	struct list_head list;		/* list for all instances within a mc */
-
-	struct kobject kobj;		/* kobj for the group */
-
-	const struct mcidev_sysfs_group *grp;	/* group description table */
-	struct mem_ctl_info *mci;	/* the parent */
-};
-
-/* mcidev_sysfs_attribute structure
- *	used for driver sysfs attributes and in mem_ctl_info
- * 	sysfs top level entries
- */
-struct mcidev_sysfs_attribute {
-	/* It should use either attr or grp */
-	struct attribute attr;
-	const struct mcidev_sysfs_group *grp;	/* Points to a group of attributes */
-
-	/* Ops for show/store values at the attribute - not used on group */
-        ssize_t (*show)(struct mem_ctl_info *,char *);
-        ssize_t (*store)(struct mem_ctl_info *, const char *,size_t);
-};
-
-/* MEMORY controller information structure
- */
-struct mem_ctl_info {
-	struct list_head link;	/* for global list of mem_ctl_info structs */
-
-	struct module *owner;	/* Module owner of this control struct */
-
-	unsigned long mtype_cap;	/* memory types supported by mc */
-	unsigned long edac_ctl_cap;	/* Mem controller EDAC capabilities */
-	unsigned long edac_cap;	/* configuration capabilities - this is
-				 * closely related to edac_ctl_cap.  The
-				 * difference is that the controller may be
-				 * capable of s4ecd4ed which would be listed
-				 * in edac_ctl_cap, but if channels aren't
-				 * capable of s4ecd4ed then the edac_cap would
-				 * not have that capability.
-				 */
-	unsigned long scrub_cap;	/* chipset scrub capabilities */
-	enum scrub_type scrub_mode;	/* current scrub mode */
-
-	/* Translates sdram memory scrub rate given in bytes/sec to the
-	   internal representation and configures whatever else needs
-	   to be configured.
-	 */
-	int (*set_sdram_scrub_rate) (struct mem_ctl_info * mci, u32 bw);
-
-	/* Get the current sdram memory scrub rate from the internal
-	   representation and converts it to the closest matching
-	   bandwith in bytes/sec.
-	 */
-	int (*get_sdram_scrub_rate) (struct mem_ctl_info * mci);
-
-
-	/* pointer to edac checking routine */
-	void (*edac_check) (struct mem_ctl_info * mci);
-
-	/*
-	 * Remaps memory pages: controller pages to physical pages.
-	 * For most MC's, this will be NULL.
-	 */
-	/* FIXME - why not send the phys page to begin with? */
-	unsigned long (*ctl_page_to_phys) (struct mem_ctl_info * mci,
-					   unsigned long page);
-	int mc_idx;
-	int nr_csrows;
-	struct csrow_info *csrows;
-	/*
-	 * FIXME - what about controllers on other busses? - IDs must be
-	 * unique.  dev pointer should be sufficiently unique, but
-	 * BUS:SLOT.FUNC numbers may not be unique.
-	 */
-	struct device *dev;
-	const char *mod_name;
-	const char *mod_ver;
-	const char *ctl_name;
-	const char *dev_name;
-	char proc_name[MC_PROC_NAME_MAX_LEN + 1];
-	void *pvt_info;
-	u32 ue_noinfo_count;	/* Uncorrectable Errors w/o info */
-	u32 ce_noinfo_count;	/* Correctable Errors w/o info */
-	u32 ue_count;		/* Total Uncorrectable Errors for this MC */
-	u32 ce_count;		/* Total Correctable Errors for this MC */
-	unsigned long start_time;	/* mci load start time (in jiffies) */
-
-	/* this stuff is for safe removal of mc devices from global list while
-	 * NMI handlers may be traversing list
-	 */
-	struct rcu_head rcu;
-	struct completion complete;
-
-	/* edac sysfs device control */
-	struct kobject edac_mci_kobj;
-
-	/* list for all grp instances within a mc */
-	struct list_head grp_kobj_list;
-
-	/* Additional top controller level attributes, but specified
-	 * by the low level driver.
-	 *
-	 * Set by the low level driver to provide attributes at the
-	 * controller level, same level as 'ue_count' and 'ce_count' above.
-	 * An array of structures, NULL terminated
-	 *
-	 * If attributes are desired, then set to array of attributes
-	 * If no attributes are desired, leave NULL
-	 */
-	const struct mcidev_sysfs_attribute *mc_driver_sysfs_attributes;
-
-	/* work struct for this MC */
-	struct delayed_work work;
-
-	/* the internal state of this controller instance */
-	int op_state;
-};
-
 /*
  * The following are the structures to provide for a generic
  * or abstract 'edac_device'. This set of structures and the
diff --git a/include/linux/edac.h b/include/linux/edac.h
index 36c6644..12d0b45 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -42,4 +42,358 @@ static inline void opstate_init(void)
 	return;
 }
 
+#define EDAC_MC_LABEL_LEN	31
+#define MC_PROC_NAME_MAX_LEN	7
+
+/* memory devices */
+enum dev_type {
+	DEV_UNKNOWN = 0,
+	DEV_X1,
+	DEV_X2,
+	DEV_X4,
+	DEV_X8,
+	DEV_X16,
+	DEV_X32,		/* Do these parts exist? */
+	DEV_X64			/* Do these parts exist? */
+};
+
+#define DEV_FLAG_UNKNOWN	BIT(DEV_UNKNOWN)
+#define DEV_FLAG_X1		BIT(DEV_X1)
+#define DEV_FLAG_X2		BIT(DEV_X2)
+#define DEV_FLAG_X4		BIT(DEV_X4)
+#define DEV_FLAG_X8		BIT(DEV_X8)
+#define DEV_FLAG_X16		BIT(DEV_X16)
+#define DEV_FLAG_X32		BIT(DEV_X32)
+#define DEV_FLAG_X64		BIT(DEV_X64)
+
+/* memory types */
+enum mem_type {
+	MEM_EMPTY = 0,		/* Empty csrow */
+	MEM_RESERVED,		/* Reserved csrow type */
+	MEM_UNKNOWN,		/* Unknown csrow type */
+	MEM_FPM,		/* Fast page mode */
+	MEM_EDO,		/* Extended data out */
+	MEM_BEDO,		/* Burst Extended data out */
+	MEM_SDR,		/* Single data rate SDRAM */
+	MEM_RDR,		/* Registered single data rate SDRAM */
+	MEM_DDR,		/* Double data rate SDRAM */
+	MEM_RDDR,		/* Registered Double data rate SDRAM */
+	MEM_RMBS,		/* Rambus DRAM */
+	MEM_DDR2,		/* DDR2 RAM */
+	MEM_FB_DDR2,		/* fully buffered DDR2 */
+	MEM_RDDR2,		/* Registered DDR2 RAM */
+	MEM_XDR,		/* Rambus XDR */
+	MEM_DDR3,		/* DDR3 RAM */
+	MEM_RDDR3,		/* Registered DDR3 RAM */
+};
+
+#define MEM_FLAG_EMPTY		BIT(MEM_EMPTY)
+#define MEM_FLAG_RESERVED	BIT(MEM_RESERVED)
+#define MEM_FLAG_UNKNOWN	BIT(MEM_UNKNOWN)
+#define MEM_FLAG_FPM		BIT(MEM_FPM)
+#define MEM_FLAG_EDO		BIT(MEM_EDO)
+#define MEM_FLAG_BEDO		BIT(MEM_BEDO)
+#define MEM_FLAG_SDR		BIT(MEM_SDR)
+#define MEM_FLAG_RDR		BIT(MEM_RDR)
+#define MEM_FLAG_DDR		BIT(MEM_DDR)
+#define MEM_FLAG_RDDR		BIT(MEM_RDDR)
+#define MEM_FLAG_RMBS		BIT(MEM_RMBS)
+#define MEM_FLAG_DDR2           BIT(MEM_DDR2)
+#define MEM_FLAG_FB_DDR2        BIT(MEM_FB_DDR2)
+#define MEM_FLAG_RDDR2          BIT(MEM_RDDR2)
+#define MEM_FLAG_XDR            BIT(MEM_XDR)
+#define MEM_FLAG_DDR3		 BIT(MEM_DDR3)
+#define MEM_FLAG_RDDR3		 BIT(MEM_RDDR3)
+
+/* chipset Error Detection and Correction capabilities and mode */
+enum edac_type {
+	EDAC_UNKNOWN = 0,	/* Unknown if ECC is available */
+	EDAC_NONE,		/* Doesnt support ECC */
+	EDAC_RESERVED,		/* Reserved ECC type */
+	EDAC_PARITY,		/* Detects parity errors */
+	EDAC_EC,		/* Error Checking - no correction */
+	EDAC_SECDED,		/* Single bit error correction, Double detection */
+	EDAC_S2ECD2ED,		/* Chipkill x2 devices - do these exist? */
+	EDAC_S4ECD4ED,		/* Chipkill x4 devices */
+	EDAC_S8ECD8ED,		/* Chipkill x8 devices */
+	EDAC_S16ECD16ED,	/* Chipkill x16 devices */
+};
+
+#define EDAC_FLAG_UNKNOWN	BIT(EDAC_UNKNOWN)
+#define EDAC_FLAG_NONE		BIT(EDAC_NONE)
+#define EDAC_FLAG_PARITY	BIT(EDAC_PARITY)
+#define EDAC_FLAG_EC		BIT(EDAC_EC)
+#define EDAC_FLAG_SECDED	BIT(EDAC_SECDED)
+#define EDAC_FLAG_S2ECD2ED	BIT(EDAC_S2ECD2ED)
+#define EDAC_FLAG_S4ECD4ED	BIT(EDAC_S4ECD4ED)
+#define EDAC_FLAG_S8ECD8ED	BIT(EDAC_S8ECD8ED)
+#define EDAC_FLAG_S16ECD16ED	BIT(EDAC_S16ECD16ED)
+
+/* scrubbing capabilities */
+enum scrub_type {
+	SCRUB_UNKNOWN = 0,	/* Unknown if scrubber is available */
+	SCRUB_NONE,		/* No scrubber */
+	SCRUB_SW_PROG,		/* SW progressive (sequential) scrubbing */
+	SCRUB_SW_SRC,		/* Software scrub only errors */
+	SCRUB_SW_PROG_SRC,	/* Progressive software scrub from an error */
+	SCRUB_SW_TUNABLE,	/* Software scrub frequency is tunable */
+	SCRUB_HW_PROG,		/* HW progressive (sequential) scrubbing */
+	SCRUB_HW_SRC,		/* Hardware scrub only errors */
+	SCRUB_HW_PROG_SRC,	/* Progressive hardware scrub from an error */
+	SCRUB_HW_TUNABLE	/* Hardware scrub frequency is tunable */
+};
+
+#define SCRUB_FLAG_SW_PROG	BIT(SCRUB_SW_PROG)
+#define SCRUB_FLAG_SW_SRC	BIT(SCRUB_SW_SRC)
+#define SCRUB_FLAG_SW_PROG_SRC	BIT(SCRUB_SW_PROG_SRC)
+#define SCRUB_FLAG_SW_TUN	BIT(SCRUB_SW_SCRUB_TUNABLE)
+#define SCRUB_FLAG_HW_PROG	BIT(SCRUB_HW_PROG)
+#define SCRUB_FLAG_HW_SRC	BIT(SCRUB_HW_SRC)
+#define SCRUB_FLAG_HW_PROG_SRC	BIT(SCRUB_HW_PROG_SRC)
+#define SCRUB_FLAG_HW_TUN	BIT(SCRUB_HW_TUNABLE)
+
+/* FIXME - should have notify capabilities: NMI, LOG, PROC, etc */
+
+/* EDAC internal operation states */
+#define	OP_ALLOC		0x100
+#define OP_RUNNING_POLL		0x201
+#define OP_RUNNING_INTERRUPT	0x202
+#define OP_RUNNING_POLL_INTR	0x203
+#define OP_OFFLINE		0x300
+
+/*
+ * There are several things to be aware of that aren't at all obvious:
+ *
+ *
+ * SOCKETS, SOCKET SETS, BANKS, ROWS, CHIP-SELECT ROWS, CHANNELS, etc..
+ *
+ * These are some of the many terms that are thrown about that don't always
+ * mean what people think they mean (Inconceivable!).  In the interest of
+ * creating a common ground for discussion, terms and their definitions
+ * will be established.
+ *
+ * Memory devices:	The individual chip on a memory stick.  These devices
+ *			commonly output 4 and 8 bits each.  Grouping several
+ *			of these in parallel provides 64 bits which is common
+ *			for a memory stick.
+ *
+ * Memory Stick:	A printed circuit board that agregates multiple
+ *			memory devices in parallel.  This is the atomic
+ *			memory component that is purchaseable by Joe consumer
+ *			and loaded into a memory socket.
+ *
+ * Socket:		A physical connector on the motherboard that accepts
+ *			a single memory stick.
+ *
+ * Channel:		Set of memory devices on a memory stick that must be
+ *			grouped in parallel with one or more additional
+ *			channels from other memory sticks.  This parallel
+ *			grouping of the output from multiple channels are
+ *			necessary for the smallest granularity of memory access.
+ *			Some memory controllers are capable of single channel -
+ *			which means that memory sticks can be loaded
+ *			individually.  Other memory controllers are only
+ *			capable of dual channel - which means that memory
+ *			sticks must be loaded as pairs (see "socket set").
+ *
+ * Chip-select row:	All of the memory devices that are selected together.
+ *			for a single, minimum grain of memory access.
+ *			This selects all of the parallel memory devices across
+ *			all of the parallel channels.  Common chip-select rows
+ *			for single channel are 64 bits, for dual channel 128
+ *			bits.
+ *
+ * Single-Ranked stick:	A Single-ranked stick has 1 chip-select row of memory.
+ *			Motherboards commonly drive two chip-select pins to
+ *			a memory stick. A single-ranked stick, will occupy
+ *			only one of those rows. The other will be unused.
+ *
+ * Double-Ranked stick:	A double-ranked stick has two chip-select rows which
+ *			access different sets of memory devices.  The two
+ *			rows cannot be accessed concurrently.
+ *
+ * Double-sided stick:	DEPRECATED TERM, see Double-Ranked stick.
+ *			A double-sided stick has two chip-select rows which
+ *			access different sets of memory devices.  The two
+ *			rows cannot be accessed concurrently.  "Double-sided"
+ *			is irrespective of the memory devices being mounted
+ *			on both sides of the memory stick.
+ *
+ * Socket set:		All of the memory sticks that are required for
+ *			a single memory access or all of the memory sticks
+ *			spanned by a chip-select row.  A single socket set
+ *			has two chip-select rows and if double-sided sticks
+ *			are used these will occupy those chip-select rows.
+ *
+ * Bank:		This term is avoided because it is unclear when
+ *			needing to distinguish between chip-select rows and
+ *			socket sets.
+ *
+ * Controller pages:
+ *
+ * Physical pages:
+ *
+ * Virtual pages:
+ *
+ *
+ * STRUCTURE ORGANIZATION AND CHOICES
+ *
+ *
+ *
+ * PS - I enjoyed writing all that about as much as you enjoyed reading it.
+ */
+
+struct channel_info {
+	int chan_idx;		/* channel index */
+	u32 ce_count;		/* Correctable Errors for this CHANNEL */
+	char label[EDAC_MC_LABEL_LEN + 1];	/* DIMM label on motherboard */
+	struct csrow_info *csrow;	/* the parent */
+};
+
+struct csrow_info {
+	unsigned long first_page;	/* first page number in dimm */
+	unsigned long last_page;	/* last page number in dimm */
+	unsigned long page_mask;	/* used for interleaving -
+					 * 0UL for non intlv
+					 */
+	u32 nr_pages;		/* number of pages in csrow */
+	u32 grain;		/* granularity of reported error in bytes */
+	int csrow_idx;		/* the chip-select row */
+	enum dev_type dtype;	/* memory device type */
+	u32 ue_count;		/* Uncorrectable Errors for this csrow */
+	u32 ce_count;		/* Correctable Errors for this csrow */
+	enum mem_type mtype;	/* memory csrow type */
+	enum edac_type edac_mode;	/* EDAC mode for this csrow */
+	struct mem_ctl_info *mci;	/* the parent */
+
+	struct kobject kobj;	/* sysfs kobject for this csrow */
+
+	/* channel information for this csrow */
+	u32 nr_channels;
+	struct channel_info *channels;
+};
+
+struct mcidev_sysfs_group {
+	const char *name;				/* group name */
+	const struct mcidev_sysfs_attribute *mcidev_attr; /* group attributes */
+};
+
+struct mcidev_sysfs_group_kobj {
+	struct list_head list;		/* list for all instances within a mc */
+
+	struct kobject kobj;		/* kobj for the group */
+
+	const struct mcidev_sysfs_group *grp;	/* group description table */
+	struct mem_ctl_info *mci;	/* the parent */
+};
+
+/* mcidev_sysfs_attribute structure
+ *	used for driver sysfs attributes and in mem_ctl_info
+ * 	sysfs top level entries
+ */
+struct mcidev_sysfs_attribute {
+	/* It should use either attr or grp */
+	struct attribute attr;
+	const struct mcidev_sysfs_group *grp;	/* Points to a group of attributes */
+
+	/* Ops for show/store values at the attribute - not used on group */
+        ssize_t (*show)(struct mem_ctl_info *,char *);
+        ssize_t (*store)(struct mem_ctl_info *, const char *,size_t);
+};
+
+/* MEMORY controller information structure
+ */
+struct mem_ctl_info {
+	struct list_head link;	/* for global list of mem_ctl_info structs */
+
+	struct module *owner;	/* Module owner of this control struct */
+
+	unsigned long mtype_cap;	/* memory types supported by mc */
+	unsigned long edac_ctl_cap;	/* Mem controller EDAC capabilities */
+	unsigned long edac_cap;	/* configuration capabilities - this is
+				 * closely related to edac_ctl_cap.  The
+				 * difference is that the controller may be
+				 * capable of s4ecd4ed which would be listed
+				 * in edac_ctl_cap, but if channels aren't
+				 * capable of s4ecd4ed then the edac_cap would
+				 * not have that capability.
+				 */
+	unsigned long scrub_cap;	/* chipset scrub capabilities */
+	enum scrub_type scrub_mode;	/* current scrub mode */
+
+	/* Translates sdram memory scrub rate given in bytes/sec to the
+	   internal representation and configures whatever else needs
+	   to be configured.
+	 */
+	int (*set_sdram_scrub_rate) (struct mem_ctl_info * mci, u32 bw);
+
+	/* Get the current sdram memory scrub rate from the internal
+	   representation and converts it to the closest matching
+	   bandwith in bytes/sec.
+	 */
+	int (*get_sdram_scrub_rate) (struct mem_ctl_info * mci);
+
+
+	/* pointer to edac checking routine */
+	void (*edac_check) (struct mem_ctl_info * mci);
+
+	/*
+	 * Remaps memory pages: controller pages to physical pages.
+	 * For most MC's, this will be NULL.
+	 */
+	/* FIXME - why not send the phys page to begin with? */
+	unsigned long (*ctl_page_to_phys) (struct mem_ctl_info * mci,
+					   unsigned long page);
+	int mc_idx;
+	int nr_csrows;
+	struct csrow_info *csrows;
+	/*
+	 * FIXME - what about controllers on other busses? - IDs must be
+	 * unique.  dev pointer should be sufficiently unique, but
+	 * BUS:SLOT.FUNC numbers may not be unique.
+	 */
+	struct device *dev;
+	const char *mod_name;
+	const char *mod_ver;
+	const char *ctl_name;
+	const char *dev_name;
+	char proc_name[MC_PROC_NAME_MAX_LEN + 1];
+	void *pvt_info;
+	u32 ue_noinfo_count;	/* Uncorrectable Errors w/o info */
+	u32 ce_noinfo_count;	/* Correctable Errors w/o info */
+	u32 ue_count;		/* Total Uncorrectable Errors for this MC */
+	u32 ce_count;		/* Total Correctable Errors for this MC */
+	unsigned long start_time;	/* mci load start time (in jiffies) */
+
+	/* this stuff is for safe removal of mc devices from global list while
+	 * NMI handlers may be traversing list
+	 */
+	struct rcu_head rcu;
+	struct completion complete;
+
+	/* edac sysfs device control */
+	struct kobject edac_mci_kobj;
+
+	/* list for all grp instances within a mc */
+	struct list_head grp_kobj_list;
+
+	/* Additional top controller level attributes, but specified
+	 * by the low level driver.
+	 *
+	 * Set by the low level driver to provide attributes at the
+	 * controller level, same level as 'ue_count' and 'ce_count' above.
+	 * An array of structures, NULL terminated
+	 *
+	 * If attributes are desired, then set to array of attributes
+	 * If no attributes are desired, leave NULL
+	 */
+	const struct mcidev_sysfs_attribute *mc_driver_sysfs_attributes;
+
+	/* work struct for this MC */
+	struct delayed_work work;
+
+	/* the internal state of this controller instance */
+	int op_state;
+};
+
 #endif
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)
  2011-03-24 20:32 ` [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM) Mauro Carvalho Chehab
@ 2011-03-24 22:39   ` Borislav Petkov
  2011-03-25 10:20     ` Mauro Carvalho Chehab
  2012-01-26 23:05     ` [PATCH 1/3] events/hw_event: Create a Hardware Events Report Mecanism (HERM) Mauro Carvalho Chehab
  0 siblings, 2 replies; 20+ messages in thread
From: Borislav Petkov @ 2011-03-24 22:39 UTC (permalink / raw)
  To: Mauro Carvalho Chehab; +Cc: Linux Edac Mailing List, Linux Kernel Mailing List

On Thu, Mar 24, 2011 at 05:32:57PM -0300, Mauro Carvalho Chehab wrote:
> Adds a trace class for handle hardware events
> 
> Part of the description bellow is shamelessly copied from Tony
> Luck's notes about the Hardware Error BoF during LPC 2010 [1].
> Tony, thanks for your notes and discussions to generate the
> h/w error reporting requirements.
> 
> [1] http://lwn.net/Articles/416669/
> 
>     We have several subsystems & methods for reporting hardware errors:
> 
>     1) EDAC ("Error Detection and Correction").  In its original form
>     this consisted of a platform specific driver that read topology
>     information and error counts from chipset registers and reported
>     the results via a sysfs interface.
> 
>     2) mcelog - x86 specific decoding of machine check bank registers
>     reporting in binary form via /dev/mcelog. Recent additions make use
>     of the APEI extensions that were documented in version 4.0a of the
>     ACPI specification to acquire more information about errors without
>     having to rely reading chipset registers directly. A user level
>     programs decodes into somewhat human readable format.
> 
>     3) drivers/edac/mce_amd.c  A recent addition - this driver hooks into
>     the mcelog path and decodes errors reported via machine check bank
>     registers in AMD processors to the console log using printk() [despite
>     being in the drivers/edac directory, this seems completely different
>     from classic EDAC to me].

Well, maybe it is time to rename drivers/edac/ to drivers/ras/ where all
RAS stuff should go.

[.. ]

> diff --git a/include/trace/events/hw_event.h b/include/trace/events/hw_event.h
> new file mode 100644
> index 0000000..a46ac61
> --- /dev/null
> +++ b/include/trace/events/hw_event.h
> @@ -0,0 +1,322 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM hw_event
> +
> +#if !defined(_TRACE_HW_EVENT_MC_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_HW_EVENT_MC_H
> +
> +#include <linux/tracepoint.h>
> +#include <linux/edac.h>
> +
> +/*
> + * Hardware Anomaly Report Mechanism (HARM) events
> + *
> + * Those events are generated when hardware detected a corrected or
> + * uncorrected event, and are meant to replace the current API to report
> + * errors defined on both EDAC and MCE subsystems.
> + */
> +
> +DECLARE_EVENT_CLASS(hw_event_class,
> +	TP_PROTO(const char *type, unsigned int instance),
> +	TP_ARGS(type, instance),
> +
> +	TP_STRUCT__entry(
> +		__field(	const char *,	type			)
> +		__field(	unsigned int,	instance		)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->type	= type;
> +		__entry->instance = instance;
> +	),
> +
> +	TP_printk("Initialized %s#%d\n",
> +		__entry->type,
> +		__entry->instance)
> +);
> +
> +/*
> + * This event indicates that a hardware collection mechanism is started
> + */
> +DEFINE_EVENT(hw_event_class, hw_event_init,
> +
> +	TP_PROTO(const char *type, unsigned int instance),
> +
> +	TP_ARGS(type, instance)
> +);
> +
> +
> +/*
> + * Memory Controller specific events
> + */

I think this is too fine-grained. You see, all those error records are
of type MCE so there's no need to have a trace event for corrected,
uncorrected, out of range etc. error types. You basically add a
flags argument to the trace_mce_record() tracepoint so that you can
differentiate between the different error records in the tracebuffer.
Then, you add additional fields like above for the MCEs which report a
DRAM ECC error.

IOW, what we need are two basic error records (tracepoints, etc.): MCEs
and PCI(e) errors which are derived from the hw_event_class.

Btw, I've played with the MCE tracepoint extension a bit and it looks
doable: http://lkml.org/lkml/2010/5/15/40.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)
  2011-03-24 22:39   ` Borislav Petkov
@ 2011-03-25 10:20     ` Mauro Carvalho Chehab
  2011-03-25 14:13       ` Borislav Petkov
  2012-01-26 23:05     ` [PATCH 1/3] events/hw_event: Create a Hardware Events Report Mecanism (HERM) Mauro Carvalho Chehab
  1 sibling, 1 reply; 20+ messages in thread
From: Mauro Carvalho Chehab @ 2011-03-25 10:20 UTC (permalink / raw)
  To: Borislav Petkov, Linux Edac Mailing List, Linux Kernel Mailing List

Em 24-03-2011 19:39, Borislav Petkov escreveu:
> On Thu, Mar 24, 2011 at 05:32:57PM -0300, Mauro Carvalho Chehab wrote:
>> Adds a trace class for handle hardware events
>>
>> Part of the description bellow is shamelessly copied from Tony
>> Luck's notes about the Hardware Error BoF during LPC 2010 [1].
>> Tony, thanks for your notes and discussions to generate the
>> h/w error reporting requirements.
>>
>> [1] http://lwn.net/Articles/416669/
>>
>>     We have several subsystems & methods for reporting hardware errors:
>>
>>     1) EDAC ("Error Detection and Correction").  In its original form
>>     this consisted of a platform specific driver that read topology
>>     information and error counts from chipset registers and reported
>>     the results via a sysfs interface.
>>
>>     2) mcelog - x86 specific decoding of machine check bank registers
>>     reporting in binary form via /dev/mcelog. Recent additions make use
>>     of the APEI extensions that were documented in version 4.0a of the
>>     ACPI specification to acquire more information about errors without
>>     having to rely reading chipset registers directly. A user level
>>     programs decodes into somewhat human readable format.
>>
>>     3) drivers/edac/mce_amd.c  A recent addition - this driver hooks into
>>     the mcelog path and decodes errors reported via machine check bank
>>     registers in AMD processors to the console log using printk() [despite
>>     being in the drivers/edac directory, this seems completely different
>>     from classic EDAC to me].
> 
> Well, maybe it is time to rename drivers/edac/ to drivers/ras/ where all
> RAS stuff should go.

Maybe, but I think that there are still some steps to go before that.
> 
> [.. ]
> 
>> diff --git a/include/trace/events/hw_event.h b/include/trace/events/hw_event.h
>> new file mode 100644
>> index 0000000..a46ac61
>> --- /dev/null
>> +++ b/include/trace/events/hw_event.h
>> @@ -0,0 +1,322 @@
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM hw_event
>> +
>> +#if !defined(_TRACE_HW_EVENT_MC_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_HW_EVENT_MC_H
>> +
>> +#include <linux/tracepoint.h>
>> +#include <linux/edac.h>
>> +
>> +/*
>> + * Hardware Anomaly Report Mechanism (HARM) events
>> + *
>> + * Those events are generated when hardware detected a corrected or
>> + * uncorrected event, and are meant to replace the current API to report
>> + * errors defined on both EDAC and MCE subsystems.
>> + */
>> +
>> +DECLARE_EVENT_CLASS(hw_event_class,
>> +	TP_PROTO(const char *type, unsigned int instance),
>> +	TP_ARGS(type, instance),
>> +
>> +	TP_STRUCT__entry(
>> +		__field(	const char *,	type			)
>> +		__field(	unsigned int,	instance		)
>> +	),
>> +
>> +	TP_fast_assign(
>> +		__entry->type	= type;
>> +		__entry->instance = instance;
>> +	),
>> +
>> +	TP_printk("Initialized %s#%d\n",
>> +		__entry->type,
>> +		__entry->instance)
>> +);
>> +
>> +/*
>> + * This event indicates that a hardware collection mechanism is started
>> + */
>> +DEFINE_EVENT(hw_event_class, hw_event_init,
>> +
>> +	TP_PROTO(const char *type, unsigned int instance),
>> +
>> +	TP_ARGS(type, instance)
>> +);
>> +
>> +
>> +/*
>> + * Memory Controller specific events
>> + */
> 
> I think this is too fine-grained. You see, all those error records are
> of type MCE so there's no need to have a trace event for corrected,
> uncorrected, out of range etc. error types. You basically add a
> flags argument to the trace_mce_record() tracepoint so that you can
> differentiate between the different error records in the tracebuffer.
> Then, you add additional fields like above for the MCEs which report a
> DRAM ECC error.
> 
> IOW, what we need are two basic error records (tracepoints, etc.): MCEs
> and PCI(e) errors which are derived from the hw_event_class.
>
> Btw, I've played with the MCE tracepoint extension a bit and it looks
> doable: http://lkml.org/lkml/2010/5/15/40.
> 

As discussed on LPC, those are some requirements for the subsystem:

*) Architecture independent (both power and arm are potentially interested)

*) Report errors against human readable labels (e.g. using motherboard
   labels to identify DIMM or PCI slots).  This is hard (will often need
   some platform-specific mapping table to provide, or override, detailed
   information).

*) General interface available for any kind of h/w error report (e.g.
   device driver might use it for board level problems, or IPMI might
   report fan speed problems or over-temperature events).

*) Useful to make it easy to adapt existing EDAC drivers, machine-check
   bank decoders and other existing error reporters to use this new
   mechanism.

*) Robust - should not lose error information.  If the platform provides
   some sort of persistent storage, should make use of it to preserve
   details for fatal errors across reboot.  But may need some threshold
   mechanism that copes with floods of errors from a failed object.

*) Flexible: Errors may be discovered by polling, or reported by some
   interrupt/exception

People at the audience also commented that there are some other parts of the
Kernel that produce hardware errors and may also be interesting to map them
via perf, so grouping them together into just two types may not fit.

Also, as we want to have errors generated even for uncorrected errors that
can be fatal, and the report system should provide user-friendly error
reports, just printing a MCE code (and the MCE-specific data) is not enough:
the error should be parsed on kernel to avoid loosing fatal errors.

Maybe the way I mapped is too fine-grained, and we may want to group some
events together, but, on the other hand, having more events allow users
to filter some events that may not be relevant to them. For example, some
systems with i7300 memory controller, under certain circumstances (it seems
to be related to a bug at BIOS quick boot implementation), don't properly 
initialize the memory controller registers. The net result is that, on every 
one second (the poll interval of the edac driver), a false error report is 
produced. Having events fine-grained, users can just change the perf filter 
to discard the false alarms, but keeping the other hardware errors enabled.
 
In the specific case of MCE errors, I think we should create a new
hw_event pair that will provide the decoded info and the raw MCE info, on
a format like:

	Corrected Error %s at label "%s" (CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: % x)
	Uncorrected Error %s at label "%s" (CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: % x)

This way, the info that it is relevant to the system admin is clearly pointed
(error type and label), while hardware vendors may use the MCE data to better
analyse the issue.

Cheers,
Mauro.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)
  2011-03-25 10:20     ` Mauro Carvalho Chehab
@ 2011-03-25 14:13       ` Borislav Petkov
  2011-03-25 21:22         ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 20+ messages in thread
From: Borislav Petkov @ 2011-03-25 14:13 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Borislav Petkov, Linux Edac Mailing List, Linux Kernel Mailing List

On Fri, Mar 25, 2011 at 07:20:48AM -0300, Mauro Carvalho Chehab wrote:

[..]

> > I think this is too fine-grained. You see, all those error records are
> > of type MCE so there's no need to have a trace event for corrected,
> > uncorrected, out of range etc. error types. You basically add a
> > flags argument to the trace_mce_record() tracepoint so that you can
> > differentiate between the different error records in the tracebuffer.
> > Then, you add additional fields like above for the MCEs which report a
> > DRAM ECC error.
> > 
> > IOW, what we need are two basic error records (tracepoints, etc.): MCEs
> > and PCI(e) errors which are derived from the hw_event_class.
> >
> > Btw, I've played with the MCE tracepoint extension a bit and it looks
> > doable: http://lkml.org/lkml/2010/5/15/40.
> > 
> 
> As discussed on LPC, those are some requirements for the subsystem:
> 
> *) Architecture independent (both power and arm are potentially interested)

I knew you'll play the arch-independent card :)

I don't know how well a single hw event would fit all architectures
though, since I don't know what error formats the others require.

We could make the hw_event different on every arch but use a superset of
arguments which we can wrap with a macro that picks up only the relevant
args on each arch.

> *) Report errors against human readable labels (e.g. using motherboard
>    labels to identify DIMM or PCI slots).  This is hard (will often need
>    some platform-specific mapping table to provide, or override, detailed
>    information).

That doesn't have anything to do with the hw event - a DRAM CE/UE error
is a MCE with certain bits set. You can have an additional field in the
MCE TP:

		__field(	const char *,	label			)

> *) General interface available for any kind of h/w error report (e.g.
>    device driver might use it for board level problems, or IPMI might
>    report fan speed problems or over-temperature events).

Those can be another tracepoint.

> *) Useful to make it easy to adapt existing EDAC drivers, machine-check
>    bank decoders and other existing error reporters to use this new
>    mechanism.
> 
> *) Robust - should not lose error information.  If the platform provides
>    some sort of persistent storage, should make use of it to preserve
>    details for fatal errors across reboot.  But may need some threshold
>    mechanism that copes with floods of errors from a failed object.

This can be done in userspace - a logging daemon which flushes the trace
buffers so that there's more room for new errors.

> *) Flexible: Errors may be discovered by polling, or reported by some
>    interrupt/exception

Although we should try to avoid polling as much as possible.

> People at the audience also commented that there are some other parts of the
> Kernel that produce hardware errors and may also be interesting to map them
> via perf, so grouping them together into just two types may not fit.
> 
> Also, as we want to have errors generated even for uncorrected errors that
> can be fatal, and the report system should provide user-friendly error
> reports, just printing a MCE code (and the MCE-specific data) is not enough:
> the error should be parsed on kernel to avoid loosing fatal errors.

This is already the case on AMD - we decode those. However, there's
another issue with fatal errors - you want to execute as less code as
possible in the wake of a fatal error. There are situations where an
MCE exception doesn't even call the MCE handler but simply stops the
machine completely. For such cases, persistent storage is our safest
bet. However, we still don't have a solution for clients like laptops
and desktops with no RAS features. We need to think about those too,
especially for debugging kernel oopses, suspend/resume, etc.

> Maybe the way I mapped is too fine-grained, and we may want to group some
> events together, but, on the other hand, having more events allow users
> to filter some events that may not be relevant to them. For example, some
> systems with i7300 memory controller, under certain circumstances (it seems
> to be related to a bug at BIOS quick boot implementation), don't properly 
> initialize the memory controller registers. The net result is that, on every 
> one second (the poll interval of the edac driver), a false error report is 
> produced. Having events fine-grained, users can just change the perf filter 
> to discard the false alarms, but keeping the other hardware errors enabled.
>  
> In the specific case of MCE errors, I think we should create a new
> hw_event pair that will provide the decoded info and the raw MCE info, on
> a format like:
> 
> 	Corrected Error %s at label "%s" (CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: % x)
> 	Uncorrected Error %s at label "%s" (CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: % x)

Why like this? This is the same tracepoint with almost all fields
repeated except a small difference which can be expressed with a single
bit: Corrected vs Uncorrected error.

> This way, the info that it is relevant to the system admin is clearly pointed
> (error type and label), while hardware vendors may use the MCE data to better
> analyse the issue.

So it all sounds like we need simply to expand the MCE tracepoint with
DIMM-related information and wrap it in an arch-agnostic macro in the
EDAC code. Other arches will hide their error sources behind it too
depending on how they read those errors from the hardware.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)
  2011-03-25 14:13       ` Borislav Petkov
@ 2011-03-25 21:22         ` Mauro Carvalho Chehab
  2011-03-25 22:37           ` Tony Luck
  2011-03-28 17:03           ` Borislav Petkov
  0 siblings, 2 replies; 20+ messages in thread
From: Mauro Carvalho Chehab @ 2011-03-25 21:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Borislav Petkov, Linux Edac Mailing List, Linux Kernel Mailing List

Em 25-03-2011 11:13, Borislav Petkov escreveu:
> On Fri, Mar 25, 2011 at 07:20:48AM -0300, Mauro Carvalho Chehab wrote:
> 
> [..]
> 
>>> I think this is too fine-grained. You see, all those error records are
>>> of type MCE so there's no need to have a trace event for corrected,
>>> uncorrected, out of range etc. error types. You basically add a
>>> flags argument to the trace_mce_record() tracepoint so that you can
>>> differentiate between the different error records in the tracebuffer.
>>> Then, you add additional fields like above for the MCEs which report a
>>> DRAM ECC error.
>>>
>>> IOW, what we need are two basic error records (tracepoints, etc.): MCEs
>>> and PCI(e) errors which are derived from the hw_event_class.
>>>
>>> Btw, I've played with the MCE tracepoint extension a bit and it looks
>>> doable: http://lkml.org/lkml/2010/5/15/40.
>>>
>>
>> As discussed on LPC, those are some requirements for the subsystem:
>>
>> *) Architecture independent (both power and arm are potentially interested)
> 
> I knew you'll play the arch-independent card :)

:) It is an important feature. Since the beginning, Unix differentiated
from other OS'ses because it provides layers to abstract for device and
architecture-specific implementations. Due to that, it become so successful.

> 
> I don't know how well a single hw event would fit all architectures
> though, since I don't know what error formats the others require.

The EDAC layer is architecture independent, as it decodes the errors internally
and provides a higher layer. But I see your point: there will be some events
that are dependent on the architecture. That's one of the reasons why we need
a class of events.

> We could make the hw_event different on every arch but use a superset of
> arguments which we can wrap with a macro that picks up only the relevant
> args on each arch.

Makes sense to me.

>> *) Report errors against human readable labels (e.g. using motherboard
>>    labels to identify DIMM or PCI slots).  This is hard (will often need
>>    some platform-specific mapping table to provide, or override, detailed
>>    information).
> 
> That doesn't have anything to do with the hw event - a DRAM CE/UE error
> is a MCE with certain bits set. You can have an additional field in the
> MCE TP:
> 
> 		__field(	const char *,	label			)

Assuming an architecture with MCE, I think it should provide a little more than just
the label: it should also provide the parsed information about the event.
In other words, I think that such error should have those two extra fields:

 		__field(	const char *,	msg			)
 		__field(	const char *,	label			)

where msg is a brief description about what type of error happened, extracted
from MCE log.

> 
>> *) General interface available for any kind of h/w error report (e.g.
>>    device driver might use it for board level problems, or IPMI might
>>    report fan speed problems or over-temperature events).
> 
> Those can be another tracepoint.

Yes. It makes sense though to group it at the hw_event class.

>> *) Useful to make it easy to adapt existing EDAC drivers, machine-check
>>    bank decoders and other existing error reporters to use this new
>>    mechanism.
>>
>> *) Robust - should not lose error information.  If the platform provides
>>    some sort of persistent storage, should make use of it to preserve
>>    details for fatal errors across reboot.  But may need some threshold
>>    mechanism that copes with floods of errors from a failed object.
> 
> This can be done in userspace - a logging daemon which flushes the trace
> buffers so that there's more room for new errors.

Yes, but it helps to have support for it on Kernel and hardware, in order to
avoid loose events that may happen before the daemon starts.

>> *) Flexible: Errors may be discovered by polling, or reported by some
>>    interrupt/exception
> 
> Although we should try to avoid polling as much as possible.

Agreed.

>> People at the audience also commented that there are some other parts of the
>> Kernel that produce hardware errors and may also be interesting to map them
>> via perf, so grouping them together into just two types may not fit.
>>
>> Also, as we want to have errors generated even for uncorrected errors that
>> can be fatal, and the report system should provide user-friendly error
>> reports, just printing a MCE code (and the MCE-specific data) is not enough:
>> the error should be parsed on kernel to avoid loosing fatal errors.
> 
> This is already the case on AMD - we decode those. 

Yes, and that's great. The i7core_edac driver also does that, but only for a 
subset of the existing errors and CPU's.

I think that moving the call to the tracepoint function on MCE to happen after
the AMD decode routine (for AMD), and adding a similar logic for Intel will
be the proper way to provide useful info to the system admin.

> However, there's
> another issue with fatal errors - you want to execute as less code as
> possible in the wake of a fatal error.

Yes. That's one of the reasons why it may make sense to have a separate event
for fatal errors.

> There are situations where an
> MCE exception doesn't even call the MCE handler but simply stops the
> machine completely. For such cases, persistent storage is our safest
> bet.

Agreed.

> However, we still don't have a solution for clients like laptops
> and desktops with no RAS features. We need to think about those too,
> especially for debugging kernel oopses, suspend/resume, etc.

It would be good to use some non-volatile ram for these. I was told that
APEI spec defines a way for that, but I'm not sure if low end machines would
be shipped with that.

>> Maybe the way I mapped is too fine-grained, and we may want to group some
>> events together, but, on the other hand, having more events allow users
>> to filter some events that may not be relevant to them. For example, some
>> systems with i7300 memory controller, under certain circumstances (it seems
>> to be related to a bug at BIOS quick boot implementation), don't properly 
>> initialize the memory controller registers. The net result is that, on every 
>> one second (the poll interval of the edac driver), a false error report is 
>> produced. Having events fine-grained, users can just change the perf filter 
>> to discard the false alarms, but keeping the other hardware errors enabled.
>>  
>> In the specific case of MCE errors, I think we should create a new
>> hw_event pair that will provide the decoded info and the raw MCE info, on
>> a format like:
>>
>> 	Corrected Error %s at label "%s" (CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: % x)
>> 	Uncorrected Error %s at label "%s" (CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: % x)
> 
> Why like this? This is the same tracepoint with almost all fields
> repeated except a small difference which can be expressed with a single
> bit: Corrected vs Uncorrected error.

It might be mapped as a bit, like:

 		__field(	bool,	is_uncorrected			)

and the printk might do something like:

	"%s ...",
	(is_uncorrected ? " Uncorrected" : "Corrected" )

However, this would prevent the capability of filtering just the UE events.
Also, we may eventually provide more information for the corrected case, as we may
execute more code on that situation.

I don't have a strong preference, but it seems better to use two different events.
It would be nice to hear other opinions about this subject.

>> This way, the info that it is relevant to the system admin is clearly pointed
>> (error type and label), while hardware vendors may use the MCE data to better
>> analyse the issue.
> 
> So it all sounds like we need simply to expand the MCE tracepoint with
> DIMM-related information and wrap it in an arch-agnostic macro in the
> EDAC code. Other arches will hide their error sources behind it too
> depending on how they read those errors from the hardware.

It seems so. By purpose, I decided to take the approach of touching first the
EDAC arch-agnostic events. With those mapped, it is easier to compare them with
the MCE traces and adjust both trace calls to get the best of the both words.

I think that, in the specific case of i7core_edac and amd64_edac, what we need
is to add a function at the EDAC core that will translate a MCE event into
a memory label, and call it from the arch-dependent code (so, inside the mce driver).

Alternatively, edac could fill a translation table, and the decoding code at
mce would be just a table retrieve routine (in order to speed-up translation,
in the case of fatal errors.

Cheers,
Mauro.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)
  2011-03-25 21:22         ` Mauro Carvalho Chehab
@ 2011-03-25 22:37           ` Tony Luck
  2011-03-26 11:56             ` Mauro Carvalho Chehab
  2011-03-28 17:03           ` Borislav Petkov
  1 sibling, 1 reply; 20+ messages in thread
From: Tony Luck @ 2011-03-25 22:37 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Borislav Petkov, Borislav Petkov, Linux Edac Mailing List,
	Linux Kernel Mailing List

On Fri, Mar 25, 2011 at 2:22 PM, Mauro Carvalho Chehab
<mchehab@redhat.com> wrote:
> Em 25-03-2011 11:13, Borislav Petkov escreveu:
>> However, there's
>> another issue with fatal errors - you want to execute as less code as
>> possible in the wake of a fatal error.
>
> Yes. That's one of the reasons why it may make sense to have a separate event
> for fatal errors.

We have three categories (severities):
1) Corrected - log these
2) Uncorrected-but-not-immediately-fatal - log these too
3) Fatal - all we can do with these is log to some persistent store (or
    to a serial console connected to a logging device). perf style event
    tracing doesn't help when all the userland daemons will never get a
    chance to run.

> It would be good to use some non-volatile ram for these. I was told that
> APEI spec defines a way for that, but I'm not sure if low end machines would
> be shipped with that.

You are talking about ERST - and you are right, this is generally not going
to be present on low-end machines.  drivers/acpi/apei/erst.c was accepted
in 2.6.35.  My /dev/pstore changes are in the current merge for 2.6.39 (but
currently only show dmesg traces to the user).

> Alternatively, edac could fill a translation table, and the decoding code at
> mce would be just a table retrieve routine (in order to speed-up translation,
> in the case of fatal errors.

Eventually the translation table should move above edac (to the drivers/ras/
area that Borislav suggested earlier?) so that both mce and edac can access.
I think we'll need this for some time as SMBIOS continues to disappoint
me with its inaccuracies.

-Tony

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)
  2011-03-25 22:37           ` Tony Luck
@ 2011-03-26 11:56             ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 20+ messages in thread
From: Mauro Carvalho Chehab @ 2011-03-26 11:56 UTC (permalink / raw)
  To: Tony Luck
  Cc: Borislav Petkov, Borislav Petkov, Linux Edac Mailing List,
	Linux Kernel Mailing List

Em 25-03-2011 19:37, Tony Luck escreveu:
> On Fri, Mar 25, 2011 at 2:22 PM, Mauro Carvalho Chehab
> <mchehab@redhat.com> wrote:
>> Em 25-03-2011 11:13, Borislav Petkov escreveu:
>>> However, there's
>>> another issue with fatal errors - you want to execute as less code as
>>> possible in the wake of a fatal error.
>>
>> Yes. That's one of the reasons why it may make sense to have a separate event
>> for fatal errors.
> 
> We have three categories (severities):
> 1) Corrected - log these
> 2) Uncorrected-but-not-immediately-fatal - log these too
> 3) Fatal - all we can do with these is log to some persistent store (or
>     to a serial console connected to a logging device). perf style event
>     tracing doesn't help when all the userland daemons will never get a
>     chance to run.

Ok. Assuming that fatal errors will be stored on some persistent way, on a next
boot, the daemon will be able to catch them. So, I think it would be a nice feature
to have 3 different trace events, in order to allow users to filter between them.
Alternatively, we may implement filtering capabilities on userspace, but as perf
has this already, I'm in favor of using what's there.

>> It would be good to use some non-volatile ram for these. I was told that
>> APEI spec defines a way for that, but I'm not sure if low end machines would
>> be shipped with that.
> 
> You are talking about ERST - and you are right, this is generally not going
> to be present on low-end machines.  drivers/acpi/apei/erst.c was accepted
> in 2.6.35.  My /dev/pstore changes are in the current merge for 2.6.39 (but
> currently only show dmesg traces to the user).

It makes sense to integrate it on perf, after we add there a way to recover
persistent data when the daemon starts.

>> Alternatively, edac could fill a translation table, and the decoding code at
>> mce would be just a table retrieve routine (in order to speed-up translation,
>> in the case of fatal errors.
> 
> Eventually the translation table should move above edac (to the drivers/ras/
> area that Borislav suggested earlier?) so that both mce and edac can access.
> I think we'll need this for some time as SMBIOS continues to disappoint
> me with its inaccuracies.

That makes sense to me. The translation table there is only for memories, currently.

The /ras table needs to be generic enough to cover other types of translation, like
for example, translating a cpu Kernel representation into a CPU socket label,
and a PCI BUS ID into a PCI slot number.

Mauro.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)
  2011-03-25 21:22         ` Mauro Carvalho Chehab
  2011-03-25 22:37           ` Tony Luck
@ 2011-03-28 17:03           ` Borislav Petkov
  2011-03-28 19:44             ` Mauro Carvalho Chehab
  1 sibling, 1 reply; 20+ messages in thread
From: Borislav Petkov @ 2011-03-28 17:03 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Borislav Petkov, Borislav Petkov, Linux Edac Mailing List,
	Linux Kernel Mailing List

On Fri, Mar 25, 2011 at 05:22:20PM -0400, Mauro Carvalho Chehab wrote:

[..]

> >> *) Report errors against human readable labels (e.g. using motherboard
> >>    labels to identify DIMM or PCI slots).  This is hard (will often need
> >>    some platform-specific mapping table to provide, or override, detailed
> >>    information).
> > 
> > That doesn't have anything to do with the hw event - a DRAM CE/UE error
> > is a MCE with certain bits set. You can have an additional field in the
> > MCE TP:
> > 
> > 		__field(	const char *,	label			)
> 
> Assuming an architecture with MCE, I think it should provide a little more than just
> the label: it should also provide the parsed information about the event.
> In other words, I think that such error should have those two extra fields:
> 
>  		__field(	const char *,	msg			)
>  		__field(	const char *,	label			)
> 
> where msg is a brief description about what type of error happened, extracted
> from MCE log.

Yep, either 'msg' or 'desc' or so which is the decoded error string. One
has to be careful when building the whole string but yes, this should
do.

[..]

> >> *) Useful to make it easy to adapt existing EDAC drivers, machine-check
> >>    bank decoders and other existing error reporters to use this new
> >>    mechanism.
> >>
> >> *) Robust - should not lose error information.  If the platform provides
> >>    some sort of persistent storage, should make use of it to preserve
> >>    details for fatal errors across reboot.  But may need some threshold
> >>    mechanism that copes with floods of errors from a failed object.
> > 
> > This can be done in userspace - a logging daemon which flushes the trace
> > buffers so that there's more room for new errors.
> 
> Yes, but it helps to have support for it on Kernel and hardware, in order to
> avoid loose events that may happen before the daemon starts.

Well, last time I checked there's only a small window between enabling
MCA on the machine and initializing perf buffers so we can use the
buffer in mce.c for temporary storage. Once perf initializes and
persistent MCE events get enabled, we copy the mce.c buffer into the
persistent MCE's buffer and done.

When the persistent storage or perf kernel buffers overflow and there's no
userspace consumer, we simply overwrite the oldest events - nothing much
we can do there.

[..]

> >> People at the audience also commented that there are some other parts of the
> >> Kernel that produce hardware errors and may also be interesting to map them
> >> via perf, so grouping them together into just two types may not fit.
> >>
> >> Also, as we want to have errors generated even for uncorrected errors that
> >> can be fatal, and the report system should provide user-friendly error
> >> reports, just printing a MCE code (and the MCE-specific data) is not enough:
> >> the error should be parsed on kernel to avoid loosing fatal errors.
> > 
> > This is already the case on AMD - we decode those. 
> 
> Yes, and that's great. The i7core_edac driver also does that, but only for a 
> subset of the existing errors and CPU's.
> 
> I think that moving the call to the tracepoint function on MCE to happen after
> the AMD decode routine (for AMD), and adding a similar logic for Intel will
> be the proper way to provide useful info to the system admin.

Yes.

You might want to split the decoding functionality out from i7core_edac
since the last deals with DRAM ECC errors only AFAICT and the first
deals with MCE errors in general.

> > However, there's
> > another issue with fatal errors - you want to execute as less code as
> > possible in the wake of a fatal error.
> 
> Yes. That's one of the reasons why it may make sense to have a separate event
> for fatal errors.

No, you don't need a different tracepoint for that. The most
conservative approach when you cannot recover from such an error is to
save raw error info to pstore and die, as Tony says in the other mail.
You decode it _after_ you've rebooted the machine. However, we still
haven't addressed clients with no pstore with such a strategy.

I don't know whether relaxing the approach and running into the decoding
code is still fine with all fatal errors, maybe that needs to be decided
on a case-by-case basis and since this probably will also be family- and
arch-specific, the whole decision code might get very hairy, very fast.

And yes, deferred errors should get decoded and reported - I think the
poisoning code should handle them just fine. But back to the issue, no
need for different TPs on x86 - one is just fine.

[..]

> > However, we still don't have a solution for clients like laptops
> > and desktops with no RAS features. We need to think about those too,
> > especially for debugging kernel oopses, suspend/resume, etc.
> 
> It would be good to use some non-volatile ram for these. I was told that
> APEI spec defines a way for that, but I'm not sure if low end machines would
> be shipped with that.

This is only ACPI 4.0 spec - you still need the hw device that
actually acts as a non-volatile storage. And if we don't have that
physically present, we can't save to pstore - we need a different
approach but I don't have the slightest idea what to do. The BIOS idea
(http://www.spinics.net/lists/linux-ide/msg40038.html) was neat in
theory, but it kinda sucked when trying to keep it generic enough and
working with all storage controllers.

> >> Maybe the way I mapped is too fine-grained, and we may want to group some
> >> events together, but, on the other hand, having more events allow users
> >> to filter some events that may not be relevant to them. For example, some
> >> systems with i7300 memory controller, under certain circumstances (it seems
> >> to be related to a bug at BIOS quick boot implementation), don't properly 
> >> initialize the memory controller registers. The net result is that, on every 
> >> one second (the poll interval of the edac driver), a false error report is 
> >> produced. Having events fine-grained, users can just change the perf filter 
> >> to discard the false alarms, but keeping the other hardware errors enabled.
> >>  
> >> In the specific case of MCE errors, I think we should create a new
> >> hw_event pair that will provide the decoded info and the raw MCE info, on
> >> a format like:
> >>
> >> 	Corrected Error %s at label "%s" (CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: % x)
> >> 	Uncorrected Error %s at label "%s" (CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: % x)

Again, this is the same information (all MCE fields are the same) except
CE/UE bit in MCi_STATUS. You don't need this differentiation because:

a) either we're fatal and we don't decode, or

b) we're deferred/correctable and we then decode but the decoding code
says what kind of error it is by looking at MCi_STATUS.

IOW, the tracepoint carries the raw info only and the decoding code adds
the proper error type in the const char *msg field.

> > 
> > Why like this? This is the same tracepoint with almost all fields
> > repeated except a small difference which can be expressed with a single
> > bit: Corrected vs Uncorrected error.
> 
> It might be mapped as a bit, like:
> 
>  		__field(	bool,	is_uncorrected			)
> 
> and the printk might do something like:
> 
> 	"%s ...",
> 	(is_uncorrected ? " Uncorrected" : "Corrected" )
> 
> However, this would prevent the capability of filtering just the UE events.
> Also, we may eventually provide more information for the corrected case, as we may
> execute more code on that situation.

Filtering can be done in userspace too. At least on those UCs which still keep
the machine alive.

[..]

> >> This way, the info that it is relevant to the system admin is clearly pointed
> >> (error type and label), while hardware vendors may use the MCE data to better
> >> analyse the issue.
> > 
> > So it all sounds like we need simply to expand the MCE tracepoint with
> > DIMM-related information and wrap it in an arch-agnostic macro in the
> > EDAC code. Other arches will hide their error sources behind it too
> > depending on how they read those errors from the hardware.
> 
> It seems so. By purpose, I decided to take the approach of touching first the
> EDAC arch-agnostic events. With those mapped, it is easier to compare them with
> the MCE traces and adjust both trace calls to get the best of the both words.
> 
> I think that, in the specific case of i7core_edac and amd64_edac, what we need
> is to add a function at the EDAC core that will translate a MCE event into
> a memory label, and call it from the arch-dependent code (so, inside the mce driver).
> 
> Alternatively, edac could fill a translation table, and the decoding code at
> mce would be just a table retrieve routine (in order to speed-up translation,
> in the case of fatal errors.

I'm afraid I don't understand this one completely. Remember that with
MCEs, you don't always have a memory label - MCEs report all kinds of
errors in the CPU core and not only DRAM errors.

Concerning the translation table, I thought about this too but this
might become too big too fast since you have a different set of MCE
signatures per family and, at the same time, families share some of the
MCE error signatures and with tables it might be hard/ugly to share
code. I think I'll worry about this only if MCE decoding becomes very
expensive - I haven't traced how long it takes but it should be too
small a time to measure since we don't do any special stuff except array
lookups and bit fiddling.

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)
  2011-03-28 17:03           ` Borislav Petkov
@ 2011-03-28 19:44             ` Mauro Carvalho Chehab
  2011-03-30 17:27               ` Luck, Tony
  0 siblings, 1 reply; 20+ messages in thread
From: Mauro Carvalho Chehab @ 2011-03-28 19:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Borislav Petkov, Linux Edac Mailing List, Linux Kernel Mailing List

Em 28-03-2011 14:03, Borislav Petkov escreveu:
> On Fri, Mar 25, 2011 at 05:22:20PM -0400, Mauro Carvalho Chehab wrote:
> 
> [..]
> 
>>>> *) Report errors against human readable labels (e.g. using motherboard
>>>>    labels to identify DIMM or PCI slots).  This is hard (will often need
>>>>    some platform-specific mapping table to provide, or override, detailed
>>>>    information).
>>>
>>> That doesn't have anything to do with the hw event - a DRAM CE/UE error
>>> is a MCE with certain bits set. You can have an additional field in the
>>> MCE TP:
>>>
>>> 		__field(	const char *,	label			)
>>
>> Assuming an architecture with MCE, I think it should provide a little more than just
>> the label: it should also provide the parsed information about the event.
>> In other words, I think that such error should have those two extra fields:
>>
>>  		__field(	const char *,	msg			)
>>  		__field(	const char *,	label			)
>>
>> where msg is a brief description about what type of error happened, extracted
>> from MCE log.
> 
> Yep, either 'msg' or 'desc' or so which is the decoded error string. One
> has to be careful when building the whole string but yes, this should
> do.

'desc' is probably better. I'll use it on a version 2 of the patch series.

> [..]
> 
>>>> *) Useful to make it easy to adapt existing EDAC drivers, machine-check
>>>>    bank decoders and other existing error reporters to use this new
>>>>    mechanism.
>>>>
>>>> *) Robust - should not lose error information.  If the platform provides
>>>>    some sort of persistent storage, should make use of it to preserve
>>>>    details for fatal errors across reboot.  But may need some threshold
>>>>    mechanism that copes with floods of errors from a failed object.
>>>
>>> This can be done in userspace - a logging daemon which flushes the trace
>>> buffers so that there's more room for new errors.
>>
>> Yes, but it helps to have support for it on Kernel and hardware, in order to
>> avoid loose events that may happen before the daemon starts.
> 
> Well, last time I checked there's only a small window between enabling
> MCA on the machine and initializing perf buffers so we can use the
> buffer in mce.c for temporary storage. Once perf initializes and
> persistent MCE events get enabled, we copy the mce.c buffer into the
> persistent MCE's buffer and done.
> 
> When the persistent storage or perf kernel buffers overflow and there's no
> userspace consumer, we simply overwrite the oldest events - nothing much
> we can do there.

It seems ok to me.

> [..]
> 
>>>> People at the audience also commented that there are some other parts of the
>>>> Kernel that produce hardware errors and may also be interesting to map them
>>>> via perf, so grouping them together into just two types may not fit.
>>>>
>>>> Also, as we want to have errors generated even for uncorrected errors that
>>>> can be fatal, and the report system should provide user-friendly error
>>>> reports, just printing a MCE code (and the MCE-specific data) is not enough:
>>>> the error should be parsed on kernel to avoid loosing fatal errors.
>>>
>>> This is already the case on AMD - we decode those. 
>>
>> Yes, and that's great. The i7core_edac driver also does that, but only for a 
>> subset of the existing errors and CPU's.
>>
>> I think that moving the call to the tracepoint function on MCE to happen after
>> the AMD decode routine (for AMD), and adding a similar logic for Intel will
>> be the proper way to provide useful info to the system admin.
> 
> Yes.
> 
> You might want to split the decoding functionality out from i7core_edac
> since the last deals with DRAM ECC errors only AFAICT and the first
> deals with MCE errors in general.

Yes, that's my idea too.

>>> However, there's
>>> another issue with fatal errors - you want to execute as less code as
>>> possible in the wake of a fatal error.
>>
>> Yes. That's one of the reasons why it may make sense to have a separate event
>> for fatal errors.
> 
> No, you don't need a different tracepoint for that. The most
> conservative approach when you cannot recover from such an error is to
> save raw error info to pstore and die, as Tony says in the other mail.
> You decode it _after_ you've rebooted the machine. However, we still
> haven't addressed clients with no pstore with such a strategy.
> 
> I don't know whether relaxing the approach and running into the decoding
> code is still fine with all fatal errors, maybe that needs to be decided
> on a case-by-case basis and since this probably will also be family- and
> arch-specific, the whole decision code might get very hairy, very fast.
> 
> And yes, deferred errors should get decoded and reported - I think the
> poisoning code should handle them just fine. But back to the issue, no
> need for different TPs on x86 - one is just fine.

Ok.

> [..]
> 
>>> However, we still don't have a solution for clients like laptops
>>> and desktops with no RAS features. We need to think about those too,
>>> especially for debugging kernel oopses, suspend/resume, etc.
>>
>> It would be good to use some non-volatile ram for these. I was told that
>> APEI spec defines a way for that, but I'm not sure if low end machines would
>> be shipped with that.
> 
> This is only ACPI 4.0 spec - you still need the hw device that
> actually acts as a non-volatile storage. And if we don't have that
> physically present, we can't save to pstore - we need a different
> approach but I don't have the slightest idea what to do. The BIOS idea
> (http://www.spinics.net/lists/linux-ide/msg40038.html) was neat in
> theory, but it kinda sucked when trying to keep it generic enough and
> working with all storage controllers.

A INT13 persistent data saving will probably be too hard to implement for
all kind of storages. So, it will probably need lots of hack in order to
work. Also, some hardware may even not have any local hard disk. Not sure
if it is worth to spend time on it.

ACPI 4.0/APEI seems promising, provided that newer systems will use it and
will follow the standard. So, I think it is a good way to start with it. 

If the persistent "trace event panic cache" is properly encapsulated into some 
abstraction layer, latter patches may allow adding INT13 support where APEI
is not available, and it is safe to use it.

>>>> Maybe the way I mapped is too fine-grained, and we may want to group some
>>>> events together, but, on the other hand, having more events allow users
>>>> to filter some events that may not be relevant to them. For example, some
>>>> systems with i7300 memory controller, under certain circumstances (it seems
>>>> to be related to a bug at BIOS quick boot implementation), don't properly 
>>>> initialize the memory controller registers. The net result is that, on every 
>>>> one second (the poll interval of the edac driver), a false error report is 
>>>> produced. Having events fine-grained, users can just change the perf filter 
>>>> to discard the false alarms, but keeping the other hardware errors enabled.
>>>>  
>>>> In the specific case of MCE errors, I think we should create a new
>>>> hw_event pair that will provide the decoded info and the raw MCE info, on
>>>> a format like:
>>>>
>>>> 	Corrected Error %s at label "%s" (CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: % x)
>>>> 	Uncorrected Error %s at label "%s" (CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: % x)
> 
> Again, this is the same information (all MCE fields are the same) except
> CE/UE bit in MCi_STATUS. You don't need this differentiation because:
> 
> a) either we're fatal and we don't decode, or
> 
> b) we're deferred/correctable and we then decode but the decoding code
> says what kind of error it is by looking at MCi_STATUS.
> 
> IOW, the tracepoint carries the raw info only and the decoding code adds
> the proper error type in the const char *msg field.

Ok. I'll work on merging the traces into one trace for both CE and UE, where
this is possible.

>>>
>>> Why like this? This is the same tracepoint with almost all fields
>>> repeated except a small difference which can be expressed with a single
>>> bit: Corrected vs Uncorrected error.
>>
>> It might be mapped as a bit, like:
>>
>>  		__field(	bool,	is_uncorrected			)
>>
>> and the printk might do something like:
>>
>> 	"%s ...",
>> 	(is_uncorrected ? " Uncorrected" : "Corrected" )
>>
>> However, this would prevent the capability of filtering just the UE events.
>> Also, we may eventually provide more information for the corrected case, as we may
>> execute more code on that situation.
> 
> Filtering can be done in userspace too. At least on those UCs which still keep
> the machine alive.

Ok.

> [..]
> 
>>>> This way, the info that it is relevant to the system admin is clearly pointed
>>>> (error type and label), while hardware vendors may use the MCE data to better
>>>> analyse the issue.
>>>
>>> So it all sounds like we need simply to expand the MCE tracepoint with
>>> DIMM-related information and wrap it in an arch-agnostic macro in the
>>> EDAC code. Other arches will hide their error sources behind it too
>>> depending on how they read those errors from the hardware.
>>
>> It seems so. By purpose, I decided to take the approach of touching first the
>> EDAC arch-agnostic events. With those mapped, it is easier to compare them with
>> the MCE traces and adjust both trace calls to get the best of the both words.
>>
>> I think that, in the specific case of i7core_edac and amd64_edac, what we need
>> is to add a function at the EDAC core that will translate a MCE event into
>> a memory label, and call it from the arch-dependent code (so, inside the mce driver).
>>
>> Alternatively, edac could fill a translation table, and the decoding code at
>> mce would be just a table retrieve routine (in order to speed-up translation,
>> in the case of fatal errors.
> 
> I'm afraid I don't understand this one completely. Remember that with
> MCEs, you don't always have a memory label - MCEs report all kinds of
> errors in the CPU core and not only DRAM errors.

I'm referring to the memory-related MCE events. we need a memory label for those
(and, we'll need a PCI label/CPU socket label/... for other MCE events).

The current logic that converts a hardware-specific error into a memory label is
currently inside edac core (remember: we use the edac sysfs nodes to store the
labels). 

By moving the label decoding for MCE to happen outside the EDAC core (and keep 
using it for the other MC hardware), we'll need to change the way it is done 
right now.

> Concerning the translation table, I thought about this too but this
> might become too big too fast since you have a different set of MCE
> signatures per family and, at the same time, families share some of the
> MCE error signatures and with tables it might be hard/ugly to share
> code. I think I'll worry about this only if MCE decoding becomes very
> expensive - I haven't traced how long it takes but it should be too
> small a time to measure since we don't do any special stuff except array
> lookups and bit fiddling.

One thing needs to be common: the way userspace will fill the label translation
tables. Yet, the current way has two major issues:
	- Only memories are labeled;
	- It uses a fixed memory struct: csrow/channel/dimm.

The last one is currently broken, as modern types of RAM hardware don't use that 
structure anymore.

So, we'll need to work on it to fix both issues, and have a common code to allow 
filling those data structs and use it inside the hw error drivers.

Cheers,
Mauro.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)
  2011-03-28 19:44             ` Mauro Carvalho Chehab
@ 2011-03-30 17:27               ` Luck, Tony
  2011-03-30 17:51                 ` Borislav Petkov
  0 siblings, 1 reply; 20+ messages in thread
From: Luck, Tony @ 2011-03-30 17:27 UTC (permalink / raw)
  To: Mauro Carvalho Chehab, Borislav Petkov
  Cc: Borislav Petkov, Linux Edac Mailing List, Linux Kernel Mailing List

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1272 bytes --]

> A INT13 persistent data saving will probably be too hard to implement for
> all kind of storages. So, it will probably need lots of hack in order to
> work. Also, some hardware may even not have any local hard disk. Not sure
> if it is worth to spend time on it.

Linus was pretty clear that he didn't like the idea when
it came up a few weeks ago in the:
 [PATCH 0/2][conce​pt RFC] x86: BIOS-save kernel log to disk upon panic
thread.

Linus ranted:
> Quite frankly, I'm not likely to _ever_ merge anything like this.
>
> Over the years, many people have tried to write things to disk on
> oops. I refuse to take it. No way in hell do I want the situation of
> "the system is screwed, so let's overwrite the disk" to be something
> the kernel I release might do. It's crazy. That disk is a lot more
> important than the kernel, and overwriting it when we might have
> serious memory corruption issues or something is not a thing I feel is
> appropriate.

I agree with him - writing to disk from a known broken kernel (using
a probably buggy BIOS) is a disaster waiting to happen to somebody's
disk.

-Tony

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)
  2011-03-30 17:27               ` Luck, Tony
@ 2011-03-30 17:51                 ` Borislav Petkov
  2011-03-30 18:30                   ` Francis St. Amant
  0 siblings, 1 reply; 20+ messages in thread
From: Borislav Petkov @ 2011-03-30 17:51 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Mauro Carvalho Chehab, Borislav Petkov, Borislav Petkov,
	Linux Edac Mailing List, Linux Kernel Mailing List

On Wed, Mar 30, 2011 at 01:27:43PM -0400, Luck, Tony wrote:
> > A INT13 persistent data saving will probably be too hard to implement for
> > all kind of storages. So, it will probably need lots of hack in order to
> > work. Also, some hardware may even not have any local hard disk. Not sure
> > if it is worth to spend time on it.
> 
> Linus was pretty clear that he didn't like the idea when
> it came up a few weeks ago in the:
>  [PATCH 0/2][conce​pt RFC] x86: BIOS-save kernel log to disk upon panic
> thread.
> 
> Linus ranted:
> > Quite frankly, I'm not likely to _ever_ merge anything like this.
> >
> > Over the years, many people have tried to write things to disk on
> > oops. I refuse to take it. No way in hell do I want the situation of
> > "the system is screwed, so let's overwrite the disk" to be something
> > the kernel I release might do. It's crazy. That disk is a lot more
> > important than the kernel, and overwriting it when we might have
> > serious memory corruption issues or something is not a thing I feel is
> > appropriate.
> 
> I agree with him - writing to disk from a known broken kernel (using
> a probably buggy BIOS) is a disaster waiting to happen to somebody's
> disk.

I can't say I don't agree. The reference to the BIOS method was simply
to say that we need some kind of a solution for saving OOPSen and other
error info on clients too, with no pers. storage hw. And frankly, and
AFAICT, we don't have one that would work reliably on existing hw right
now :(.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)
  2011-03-30 17:51                 ` Borislav Petkov
@ 2011-03-30 18:30                   ` Francis St. Amant
  2011-03-30 19:50                     ` Borislav Petkov
  0 siblings, 1 reply; 20+ messages in thread
From: Francis St. Amant @ 2011-03-30 18:30 UTC (permalink / raw)
  To: Borislav Petkov, Luck, Tony
  Cc: Mauro Carvalho Chehab, Borislav Petkov, Linux Edac Mailing List,
	Linux Kernel Mailing List

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3235 bytes --]

Hi,

Sorry for using your BW but I wonder if you can do me a favor and remove Arthur Jones (Arthur.jones@riverbed.com; ajones@riverbed.com) from this list?  I tried to do that via the unsubscribe method at bottom but it didn't work.

Arthur has left Riverbed, and as his former manager I now get all his emails forwarded to me.  Since I'm not useful to this list, there's no reason to receive all these emails.  Probably my attempt to unsubscribe didn't work because the email came from me and I'm already not on the list--Arthur is.

Thanks very much,

- - - - - - - - - - - - - - - - - - - - - - - - - -
Francis St. Amant
Sr. Mgr., Platform/OEM Engineering
Riverbed Technology
francis@riverbed.com
Voice: 415-344-4480
FAX: 415-247-8801


-----Original Message-----
From: linux-edac-owner@vger.kernel.org [mailto:linux-edac-owner@vger.kernel.org] On Behalf Of Borislav Petkov
Sent: Wednesday, March 30, 2011 10:51 AM
To: Luck, Tony
Cc: Mauro Carvalho Chehab; Borislav Petkov; Borislav Petkov; Linux Edac Mailing List; Linux Kernel Mailing List
Subject: Re: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)

On Wed, Mar 30, 2011 at 01:27:43PM -0400, Luck, Tony wrote:
> > A INT13 persistent data saving will probably be too hard to implement for
> > all kind of storages. So, it will probably need lots of hack in order to
> > work. Also, some hardware may even not have any local hard disk. Not sure
> > if it is worth to spend time on it.
> 
> Linus was pretty clear that he didn't like the idea when
> it came up a few weeks ago in the:
>  [PATCH 0/2][conce​pt RFC] x86: BIOS-save kernel log to disk upon panic
> thread.
> 
> Linus ranted:
> > Quite frankly, I'm not likely to _ever_ merge anything like this.
> >
> > Over the years, many people have tried to write things to disk on
> > oops. I refuse to take it. No way in hell do I want the situation of
> > "the system is screwed, so let's overwrite the disk" to be something
> > the kernel I release might do. It's crazy. That disk is a lot more
> > important than the kernel, and overwriting it when we might have
> > serious memory corruption issues or something is not a thing I feel is
> > appropriate.
> 
> I agree with him - writing to disk from a known broken kernel (using
> a probably buggy BIOS) is a disaster waiting to happen to somebody's
> disk.

I can't say I don't agree. The reference to the BIOS method was simply
to say that we need some kind of a solution for saving OOPSen and other
error info on clients too, with no pers. storage hw. And frankly, and
AFAICT, we don't have one that would work reliably on existing hw right
now :(.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
--
To unsubscribe from this list: send the line "unsubscribe linux-edac" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)
  2011-03-30 18:30                   ` Francis St. Amant
@ 2011-03-30 19:50                     ` Borislav Petkov
  2011-03-30 20:00                       ` Francis St. Amant
  0 siblings, 1 reply; 20+ messages in thread
From: Borislav Petkov @ 2011-03-30 19:50 UTC (permalink / raw)
  To: Francis St. Amant, David Miller
  Cc: Borislav Petkov, Luck, Tony, Mauro Carvalho Chehab,
	Borislav Petkov, Linux Edac Mailing List,
	Linux Kernel Mailing List

Right,

I guess you have to send from the subscribed mail address. Well, I don't
know how to unsubscribe in that case either, but maybe David can help
us.

David, can you please take a look?

Thanks.

On Wed, Mar 30, 2011 at 02:30:00PM -0400, Francis St. Amant wrote:
> Hi,
> 
> Sorry for using your BW but I wonder if you can do me a favor and remove Arthur Jones (Arthur.jones@riverbed.com; ajones@riverbed.com) from this list?  I tried to do that via the unsubscribe method at bottom but it didn't work.
> 
> Arthur has left Riverbed, and as his former manager I now get all his emails forwarded to me.  Since I'm not useful to this list, there's no reason to receive all these emails.  Probably my attempt to unsubscribe didn't work because the email came from me and I'm already not on the list--Arthur is.
> 
> Thanks very much,
> 
> - - - - - - - - - - - - - - - - - - - - - - - - - -
> Francis St. Amant
> Sr. Mgr., Platform/OEM Engineering
> Riverbed Technology
> francis@riverbed.com
> Voice: 415-344-4480
> FAX: 415-247-8801
> 
> 
> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org [mailto:linux-edac-owner@vger.kernel.org] On Behalf Of Borislav Petkov
> Sent: Wednesday, March 30, 2011 10:51 AM
> To: Luck, Tony
> Cc: Mauro Carvalho Chehab; Borislav Petkov; Borislav Petkov; Linux Edac Mailing List; Linux Kernel Mailing List
> Subject: Re: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)
> 
> On Wed, Mar 30, 2011 at 01:27:43PM -0400, Luck, Tony wrote:
> > > A INT13 persistent data saving will probably be too hard to implement for
> > > all kind of storages. So, it will probably need lots of hack in order to
> > > work. Also, some hardware may even not have any local hard disk. Not sure
> > > if it is worth to spend time on it.
> > 
> > Linus was pretty clear that he didn't like the idea when
> > it came up a few weeks ago in the:
> >  [PATCH 0/2][conce​pt RFC] x86: BIOS-save kernel log to disk upon panic
> > thread.
> > 
> > Linus ranted:
> > > Quite frankly, I'm not likely to _ever_ merge anything like this.
> > >
> > > Over the years, many people have tried to write things to disk on
> > > oops. I refuse to take it. No way in hell do I want the situation of
> > > "the system is screwed, so let's overwrite the disk" to be something
> > > the kernel I release might do. It's crazy. That disk is a lot more
> > > important than the kernel, and overwriting it when we might have
> > > serious memory corruption issues or something is not a thing I feel is
> > > appropriate.
> > 
> > I agree with him - writing to disk from a known broken kernel (using
> > a probably buggy BIOS) is a disaster waiting to happen to somebody's
> > disk.
> 
> I can't say I don't agree. The reference to the BIOS method was simply
> to say that we need some kind of a solution for saving OOPSen and other
> error info on clients too, with no pers. storage hw. And frankly, and
> AFAICT, we don't have one that would work reliably on existing hw right
> now :(.
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> Advanced Micro Devices GmbH
> Einsteinring 24, 85609 Dornach
> General Managers: Alberto Bozzo, Andrew Bowd
> Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> Registergericht Muenchen, HRB Nr. 43632
> --
> To unsubscribe from this list: send the line "unsubscribe linux-edac" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)
  2011-03-30 19:50                     ` Borislav Petkov
@ 2011-03-30 20:00                       ` Francis St. Amant
  2011-03-31  7:43                         ` Borislav Petkov
  0 siblings, 1 reply; 20+ messages in thread
From: Francis St. Amant @ 2011-03-30 20:00 UTC (permalink / raw)
  To: Borislav Petkov, David Miller
  Cc: Luck, Tony, Mauro Carvalho Chehab, Borislav Petkov,
	Linux Edac Mailing List, Linux Kernel Mailing List

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 4712 bytes --]

If it's not easy from your side, I can go to our IT and ask if they can revive Arthur's account just to send this one unsubscribe :) 

FWIW on the below email, I do agree that some way to save OOPS output is very useful.  In our case Arthur created a special serial driver which wrote to EEPROM instead of console--besides not endangering the disk, a lot less SW has to be executed to save the data.  Then we fetch it back on next boot.

Thanks again,

  - Francis
 

-----Original Message-----
From: Borislav Petkov [mailto:bp@amd64.org] 
Sent: Wednesday, March 30, 2011 12:50 PM
To: Francis St. Amant; David Miller
Cc: Borislav Petkov; Luck, Tony; Mauro Carvalho Chehab; Borislav Petkov; Linux Edac Mailing List; Linux Kernel Mailing List
Subject: Re: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)

Right,

I guess you have to send from the subscribed mail address. Well, I don't
know how to unsubscribe in that case either, but maybe David can help
us.

David, can you please take a look?

Thanks.

On Wed, Mar 30, 2011 at 02:30:00PM -0400, Francis St. Amant wrote:
> Hi,
> 
> Sorry for using your BW but I wonder if you can do me a favor and remove Arthur Jones (Arthur.jones@riverbed.com; ajones@riverbed.com) from this list?  I tried to do that via the unsubscribe method at bottom but it didn't work.
> 
> Arthur has left Riverbed, and as his former manager I now get all his emails forwarded to me.  Since I'm not useful to this list, there's no reason to receive all these emails.  Probably my attempt to unsubscribe didn't work because the email came from me and I'm already not on the list--Arthur is.
> 
> Thanks very much,
> 
> - - - - - - - - - - - - - - - - - - - - - - - - - -
> Francis St. Amant
> Sr. Mgr., Platform/OEM Engineering
> Riverbed Technology
> francis@riverbed.com
> Voice: 415-344-4480
> FAX: 415-247-8801
> 
> 
> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org [mailto:linux-edac-owner@vger.kernel.org] On Behalf Of Borislav Petkov
> Sent: Wednesday, March 30, 2011 10:51 AM
> To: Luck, Tony
> Cc: Mauro Carvalho Chehab; Borislav Petkov; Borislav Petkov; Linux Edac Mailing List; Linux Kernel Mailing List
> Subject: Re: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)
> 
> On Wed, Mar 30, 2011 at 01:27:43PM -0400, Luck, Tony wrote:
> > > A INT13 persistent data saving will probably be too hard to implement for
> > > all kind of storages. So, it will probably need lots of hack in order to
> > > work. Also, some hardware may even not have any local hard disk. Not sure
> > > if it is worth to spend time on it.
> > 
> > Linus was pretty clear that he didn't like the idea when
> > it came up a few weeks ago in the:
> >  [PATCH 0/2][conce​pt RFC] x86: BIOS-save kernel log to disk upon panic
> > thread.
> > 
> > Linus ranted:
> > > Quite frankly, I'm not likely to _ever_ merge anything like this.
> > >
> > > Over the years, many people have tried to write things to disk on
> > > oops. I refuse to take it. No way in hell do I want the situation of
> > > "the system is screwed, so let's overwrite the disk" to be something
> > > the kernel I release might do. It's crazy. That disk is a lot more
> > > important than the kernel, and overwriting it when we might have
> > > serious memory corruption issues or something is not a thing I feel is
> > > appropriate.
> > 
> > I agree with him - writing to disk from a known broken kernel (using
> > a probably buggy BIOS) is a disaster waiting to happen to somebody's
> > disk.
> 
> I can't say I don't agree. The reference to the BIOS method was simply
> to say that we need some kind of a solution for saving OOPSen and other
> error info on clients too, with no pers. storage hw. And frankly, and
> AFAICT, we don't have one that would work reliably on existing hw right
> now :(.
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> Advanced Micro Devices GmbH
> Einsteinring 24, 85609 Dornach
> General Managers: Alberto Bozzo, Andrew Bowd
> Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> Registergericht Muenchen, HRB Nr. 43632
> --
> To unsubscribe from this list: send the line "unsubscribe linux-edac" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)
  2011-03-30 20:00                       ` Francis St. Amant
@ 2011-03-31  7:43                         ` Borislav Petkov
  0 siblings, 0 replies; 20+ messages in thread
From: Borislav Petkov @ 2011-03-31  7:43 UTC (permalink / raw)
  To: Francis St. Amant
  Cc: Borislav Petkov, David Miller, Luck, Tony, Mauro Carvalho Chehab,
	Linux Edac Mailing List, Linux Kernel Mailing List

On Wed, Mar 30, 2011 at 01:00:46PM -0700, Francis St. Amant wrote:
> If it's not easy from your side, I can go to our IT and ask if they
> can revive Arthur's account just to send this one unsubscribe :)

Well, if you have some idle time you could give that a try :).

> FWIW on the below email, I do agree that some way to save OOPS output
> is very useful. In our case Arthur created a special serial driver
> which wrote to EEPROM instead of console--besides not endangering the
> disk, a lot less SW has to be executed to save the data. Then we fetch
> it back on next boot.

Yeah, the problem is, we want to have this working for the majority of
machines out there. I'm assuming there should be some kind of EEPROM
memory available for BIOS to store its config or so which we could
hijack. But I don't know for sure - I'll have to research this a bit.
And besides, interfering with BIOS is almost always bound to fail.
Hmmm...

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/3] events/hw_event: Create a Hardware Events Report Mecanism (HERM)
  2011-03-24 22:39   ` Borislav Petkov
  2011-03-25 10:20     ` Mauro Carvalho Chehab
@ 2012-01-26 23:05     ` Mauro Carvalho Chehab
  2012-01-26 23:05       ` [PATCH 2/3] events/hw_event: use __string() trace macros for events Mauro Carvalho Chehab
  2012-01-26 23:05       ` [PATCH 3/3] hw_event: Consolidate uncorrected/corrected error msgs into one Mauro Carvalho Chehab
  1 sibling, 2 replies; 20+ messages in thread
From: Mauro Carvalho Chehab @ 2012-01-26 23:05 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

Adds a trace class for handle hardware events

Part of the description bellow is shamelessly copied from Tony
Luck's notes about the Hardware Error BoF during LPC 2010 [1].
Tony, thanks for your notes and discussions to generate the
h/w error reporting requirements.

[1] http://lwn.net/Articles/416669/

    We have several subsystems & methods for reporting hardware errors:

    1) EDAC ("Error Detection and Correction").  In its original form
    this consisted of a platform specific driver that read topology
    information and error counts from chipset registers and reported
    the results via a sysfs interface.

    2) mcelog - x86 specific decoding of machine check bank registers
    reporting in binary form via /dev/mcelog. Recent additions make use
    of the APEI extensions that were documented in version 4.0a of the
    ACPI specification to acquire more information about errors without
    having to rely reading chipset registers directly. A user level
    programs decodes into somewhat human readable format.

    3) drivers/edac/mce_amd.c - this driver hooks into the mcelog path and
    decodes errors reported via machine check bank registers in AMD
    processors to the console log using printk();

    Each of these mechanisms has a band of followers ... and none
    of them appear to meet all the needs of all users.

In order to provide a proper hardware event subsystem, let's
encapsulate hardware events into a common trace facility, and
make both edac and mce drivers to use it. After that, common
facilities can be moved into a new core for hardware events
reporting subsystem. This patch is the first of a series, and just
touches at mce.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/edac_mc.c          |   32 ++++
 include/trace/events/hw_event.h |  322 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 354 insertions(+), 0 deletions(-)
 create mode 100644 include/trace/events/hw_event.h

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index d69144a..2b8382e 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -34,6 +34,9 @@
 #include "edac_core.h"
 #include "edac_module.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/hw_event.h>
+
 /* lock to memory controller's control array */
 static DEFINE_MUTEX(mem_ctls_mutex);
 static LIST_HEAD(mc_devices);
@@ -224,6 +227,9 @@ struct mem_ctl_info *edac_mc_alloc(unsigned sz_pvt, unsigned nr_csrows,
 	 * which will perform kobj unregistration and the actual free
 	 * will occur during the kobject callback operation
 	 */
+
+	trace_hw_event_init("mce", (unsigned)edac_index);
+
 	return mci;
 }
 EXPORT_SYMBOL_GPL(edac_mc_alloc);
@@ -685,6 +691,7 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
 	/* FIXME - maybe make panic on INTERNAL ERROR an option */
 	if (row >= mci->nr_csrows || row < 0) {
 		/* something is wrong */
+		trace_mc_out_of_range(mci, "CE", "row", row, 0, mci->nr_csrows);
 		edac_mc_printk(mci, KERN_ERR,
 			"INTERNAL ERROR: row out of range "
 			"(%d >= %d)\n", row, mci->nr_csrows);
@@ -694,6 +701,8 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
 
 	if (channel >= mci->csrows[row].nr_channels || channel < 0) {
 		/* something is wrong */
+		trace_mc_out_of_range(mci, "CE", "channel", channel,
+				      0, mci->csrows[row].nr_channels);
 		edac_mc_printk(mci, KERN_ERR,
 			"INTERNAL ERROR: channel out of range "
 			"(%d >= %d)\n", channel,
@@ -702,6 +711,9 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
 		return;
 	}
 
+	trace_mc_corrected_error(mci, page_frame_number, offset_in_page,
+				syndrome, row, channel, msg);
+
 	if (edac_mc_get_log_ce())
 		/* FIXME - put in DIMM location */
 		edac_mc_printk(mci, KERN_WARNING,
@@ -737,6 +749,7 @@ EXPORT_SYMBOL_GPL(edac_mc_handle_ce);
 
 void edac_mc_handle_ce_no_info(struct mem_ctl_info *mci, const char *msg)
 {
+	trace_mc_corrected_error_no_info(mci, msg);
 	if (edac_mc_get_log_ce())
 		edac_mc_printk(mci, KERN_WARNING,
 			"CE - no information available: %s\n", msg);
@@ -761,6 +774,8 @@ void edac_mc_handle_ue(struct mem_ctl_info *mci,
 	/* FIXME - maybe make panic on INTERNAL ERROR an option */
 	if (row >= mci->nr_csrows || row < 0) {
 		/* something is wrong */
+		trace_mc_out_of_range(mci, "UE", "row", row,
+				      0, mci->nr_csrows);
 		edac_mc_printk(mci, KERN_ERR,
 			"INTERNAL ERROR: row out of range "
 			"(%d >= %d)\n", row, mci->nr_csrows);
@@ -781,6 +796,8 @@ void edac_mc_handle_ue(struct mem_ctl_info *mci,
 		pos += chars;
 	}
 
+	trace_mc_uncorrected_error(mci, page_frame_number, offset_in_page,
+				row, msg, labels);
 	if (edac_mc_get_log_ue())
 		edac_mc_printk(mci, KERN_EMERG,
 			"UE page 0x%lx, offset 0x%lx, grain %d, row %d, "
@@ -801,6 +818,7 @@ EXPORT_SYMBOL_GPL(edac_mc_handle_ue);
 
 void edac_mc_handle_ue_no_info(struct mem_ctl_info *mci, const char *msg)
 {
+	trace_mc_uncorrected_error_no_info(mci, msg);
 	if (edac_mc_get_panic_on_ue())
 		panic("EDAC MC%d: Uncorrected Error", mci->mc_idx);
 
@@ -828,6 +846,9 @@ void edac_mc_handle_fbd_ue(struct mem_ctl_info *mci,
 
 	if (csrow >= mci->nr_csrows) {
 		/* something is wrong */
+
+		trace_mc_out_of_range(mci, "UE FBDIMM", "row", csrow,
+				      0, mci->nr_csrows);
 		edac_mc_printk(mci, KERN_ERR,
 			"INTERNAL ERROR: row out of range (%d >= %d)\n",
 			csrow, mci->nr_csrows);
@@ -837,6 +858,8 @@ void edac_mc_handle_fbd_ue(struct mem_ctl_info *mci,
 
 	if (channela >= mci->csrows[csrow].nr_channels) {
 		/* something is wrong */
+		trace_mc_out_of_range(mci, "UE FBDIMM", "channel-a", channela,
+				      0, mci->csrows[csrow].nr_channels);
 		edac_mc_printk(mci, KERN_ERR,
 			"INTERNAL ERROR: channel-a out of range "
 			"(%d >= %d)\n",
@@ -847,6 +870,8 @@ void edac_mc_handle_fbd_ue(struct mem_ctl_info *mci,
 
 	if (channelb >= mci->csrows[csrow].nr_channels) {
 		/* something is wrong */
+		trace_mc_out_of_range(mci, "UE FBDIMM", "channel-b", channelb,
+				      0, mci->csrows[csrow].nr_channels);
 		edac_mc_printk(mci, KERN_ERR,
 			"INTERNAL ERROR: channel-b out of range "
 			"(%d >= %d)\n",
@@ -866,6 +891,8 @@ void edac_mc_handle_fbd_ue(struct mem_ctl_info *mci,
 	chars = snprintf(pos, len + 1, "-%s",
 			 mci->csrows[csrow].channels[channelb].label);
 
+	trace_mc_uncorrected_error_fbd(mci, csrow, channela, channelb,
+				       msg, labels);
 	if (edac_mc_get_log_ue())
 		edac_mc_printk(mci, KERN_EMERG,
 			"UE row %d, channel-a= %d channel-b= %d "
@@ -890,6 +917,8 @@ void edac_mc_handle_fbd_ce(struct mem_ctl_info *mci,
 	/* Ensure boundary values */
 	if (csrow >= mci->nr_csrows) {
 		/* something is wrong */
+		trace_mc_out_of_range(mci, "CE FBDIMM", "row", csrow,
+				      0, mci->nr_csrows);
 		edac_mc_printk(mci, KERN_ERR,
 			"INTERNAL ERROR: row out of range (%d >= %d)\n",
 			csrow, mci->nr_csrows);
@@ -898,6 +927,8 @@ void edac_mc_handle_fbd_ce(struct mem_ctl_info *mci,
 	}
 	if (channel >= mci->csrows[csrow].nr_channels) {
 		/* something is wrong */
+		trace_mc_out_of_range(mci, "UE FBDIMM", "channel", channel,
+				      0, mci->csrows[csrow].nr_channels);
 		edac_mc_printk(mci, KERN_ERR,
 			"INTERNAL ERROR: channel out of range (%d >= %d)\n",
 			channel, mci->csrows[csrow].nr_channels);
@@ -905,6 +936,7 @@ void edac_mc_handle_fbd_ce(struct mem_ctl_info *mci,
 		return;
 	}
 
+	trace_mc_corrected_error_fbd(mci, csrow, channel, msg);
 	if (edac_mc_get_log_ce())
 		/* FIXME - put in DIMM location */
 		edac_mc_printk(mci, KERN_WARNING,
diff --git a/include/trace/events/hw_event.h b/include/trace/events/hw_event.h
new file mode 100644
index 0000000..a46ac61
--- /dev/null
+++ b/include/trace/events/hw_event.h
@@ -0,0 +1,322 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM hw_event
+
+#if !defined(_TRACE_HW_EVENT_MC_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_HW_EVENT_MC_H
+
+#include <linux/tracepoint.h>
+#include <linux/edac.h>
+
+/*
+ * Hardware Anomaly Report Mecanism (HARM) events
+ *
+ * Those events are generated when hardware detected a corrected or
+ * uncorrected event, and are meant to replace the current API to report
+ * errors defined on both EDAC and MCE subsystems.
+ */
+
+DECLARE_EVENT_CLASS(hw_event_class,
+	TP_PROTO(const char *type, unsigned int instance),
+	TP_ARGS(type, instance),
+
+	TP_STRUCT__entry(
+		__field(	const char *,	type			)
+		__field(	unsigned int,	instance		)
+	),
+
+	TP_fast_assign(
+		__entry->type	= type;
+		__entry->instance = instance;
+	),
+
+	TP_printk("Initialized %s#%d\n",
+		__entry->type,
+		__entry->instance)
+);
+
+/*
+ * This event indicates that a hardware collection mechanism is started
+ */
+DEFINE_EVENT(hw_event_class, hw_event_init,
+
+	TP_PROTO(const char *type, unsigned int instance),
+
+	TP_ARGS(type, instance)
+);
+
+
+/*
+ * Memory Controller specific events
+ */
+
+/*
+ * Default error mechanisms for Memory Controller errors (CE and UE)
+ */
+TRACE_EVENT(mc_corrected_error,
+
+	TP_PROTO(struct mem_ctl_info *mci,
+		unsigned long page_frame_number,
+		unsigned long offset_in_page, unsigned long syndrome,
+		int row, int channel, const char *msg),
+
+	TP_ARGS(mci, page_frame_number, offset_in_page, syndrome, row,
+		channel, msg),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	mc_index		)
+		__field(	unsigned long,	page_frame_number	)
+		__field(	unsigned long,	offset_in_page		)
+		__field(	u32,		grain			)
+		__field(	unsigned long,	syndrome		)
+		__field(	int,		row			)
+		__field(	int,		channel			)
+		__field(	const char *,	label			)
+		__field(	const char *,	msg			)
+	),
+
+	TP_fast_assign(
+		__entry->mc_index		= mci->mc_idx;
+		__entry->page_frame_number	= page_frame_number;
+		__entry->offset_in_page		= offset_in_page;
+		__entry->grain			= mci->csrows[row].grain;
+		__entry->syndrome		= syndrome;
+		__entry->row			= row;
+		__entry->channel		= channel;
+		__entry->label			= mci->csrows[row].channels[channel].label;
+		__entry->msg			= msg;
+	),
+
+	TP_printk(HW_ERR "mce#%d: Corrected error %s on label \"%s\" "
+			 "(page 0x%lux, offset 0x%lux, grain %ud, "
+			 "syndrome 0x%lux, row %d, channel %d)\n",
+		__entry->mc_index,
+		__entry->msg,
+		__entry->label,
+		__entry->page_frame_number,
+		__entry->offset_in_page,
+		__entry->grain,
+		__entry->syndrome,
+		__entry->row,
+		__entry->channel)
+);
+
+TRACE_EVENT(mc_uncorrected_error,
+
+	TP_PROTO(struct mem_ctl_info *mci,
+		unsigned long page_frame_number,
+		unsigned long offset_in_page,
+		int row, const char *msg, const char *label),
+
+	TP_ARGS(mci, page_frame_number, offset_in_page,
+		row, msg, label),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	mc_index		)
+		__field(	unsigned long,	page_frame_number	)
+		__field(	unsigned long,	offset_in_page		)
+		__field(	u32,		grain			)
+		__field(	int,		row			)
+		__field(	const char *,	msg			)
+		__field(	const char *,	label			)
+	),
+
+	TP_fast_assign(
+		__entry->mc_index		= mci->mc_idx;
+		__entry->page_frame_number	= page_frame_number;
+		__entry->offset_in_page		= offset_in_page;
+		__entry->grain			= mci->csrows[row].grain;
+		__entry->row			= row;
+		__entry->msg			= msg;
+		__entry->label			= label;
+	),
+
+	TP_printk(HW_ERR "mce#%d: Uncorrected error %s on label \"%s\""
+			 "(page 0x%lux, offset 0x%lux, grain %ud, row %d)\n",
+		__entry->mc_index,
+		__entry->msg,
+		__entry->label,
+		__entry->page_frame_number,
+		__entry->offset_in_page,
+		__entry->grain,
+		__entry->row)
+);
+
+
+/*
+ * Fully-Buffered memory hardware in general don't provide syndrome/grain/row
+ * information for all types of errors. So, we need to either have another
+ * trace event or add a bitmapped field to indicate that some info are not
+ * provided and use the previously-declared event. It seemed easier and less
+ * confusing to create a different event for such cases
+ */
+TRACE_EVENT(mc_corrected_error_fbd,
+
+	TP_PROTO(struct mem_ctl_info *mci,
+		int row, int channel, const char *msg),
+
+	TP_ARGS(mci, row, channel, msg),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	mc_index		)
+		__field(	int,		row			)
+		__field(	int,		channel	        	)
+		__field(	const char *,	label			)
+		__field(	const char *,	msg			)
+	),
+
+	TP_fast_assign(
+		__entry->mc_index		= mci->mc_idx;
+		__entry->row			= row;
+		__entry->channel		= channel;
+		__entry->label			= mci->csrows[row].channels[channel].label;
+		__entry->msg			= msg;
+	),
+
+	TP_printk(HW_ERR "mce#%d: Corrected Error %s on label \"%s\" "
+			 "(row %d, channel %d)\n",
+		__entry->mc_index,
+		__entry->msg,
+		__entry->label,
+		__entry->row,
+		__entry->channel)
+);
+
+TRACE_EVENT(mc_uncorrected_error_fbd,
+
+	TP_PROTO(struct mem_ctl_info *mci,
+		int row, int channela, int channelb,
+		const char *msg, const char *label),
+
+	TP_ARGS(mci, row, channela, channelb, msg, label),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	mc_index		)
+		__field(	int,		row			)
+		__field(	int,		channela		)
+		__field(	int,		channelb		)
+		__field(	const char *,	msg			)
+		__field(	const char *,	label			)
+	),
+
+	TP_fast_assign(
+		__entry->mc_index		= mci->mc_idx;
+		__entry->row			= row;
+		__entry->channela		= channela;
+		__entry->channelb		= channelb;
+		__entry->msg			= msg;
+		__entry->label			= label;
+	),
+
+	TP_printk(HW_ERR "mce#%d: Uncorrected Error %s on label \"%s\" "
+			 "(row %d, channels: %d, %d)\n",
+		__entry->mc_index,
+		__entry->msg,
+		__entry->label,
+		__entry->row,
+		__entry->channela,
+		__entry->channelb)
+);
+
+/*
+ * The Memory controller driver needs to discover the memory topology, in
+ * order to associate a hardware error with the memory label. If, for any
+ * reason, it receives an error for a channel or row that are not supposed
+ * to be there, an error event needs to be generated to indicate:
+ *	- that a Corrected or Uncorrected error was received;
+ *	- that the driver has a bug and, for that particular hardware, was
+ *	  not capable of detecting the hardware architecture
+ * If one of such errors is ever received, a bug to the kernel driver must
+ * be filled.
+ */
+
+TRACE_EVENT(mc_out_of_range,
+	TP_PROTO(struct mem_ctl_info *mci, const char *type, const char *field,
+		int invalid_val, int min, int max),
+
+	TP_ARGS(mci, type, field, invalid_val, min, max),
+
+	TP_STRUCT__entry(
+		__field(	const char *,	type			)
+		__field(	const char *,	field			)
+		__field(	unsigned int,	mc_index		)
+		__field(	int,		invalid_val		)
+		__field(	int,		min			)
+		__field(	int,		max			)
+	),
+
+	TP_fast_assign(
+		__entry->type			= type;
+		__entry->field			= field;
+		__entry->mc_index		= mci->mc_idx;
+		__entry->invalid_val		= invalid_val;
+		__entry->min			= min;
+		__entry->max			= max;
+	),
+
+	TP_printk(HW_ERR "mce#%d %s: %s=%d is not between %d and %d\n",
+		__entry->mc_index,
+		__entry->type,
+		__entry->field,
+		__entry->invalid_val,
+		__entry->min,
+		__entry->max)
+);
+
+/*
+ * On some cases, a corrected or uncorrected error was detected, but it
+ * couldn't be properly handled, or because another error overrided the
+ * error registers that details the error or because of some internal problem
+ * on the driver. Those events bellow are meant for those error types.
+ */
+TRACE_EVENT(mc_corrected_error_no_info,
+	TP_PROTO(struct mem_ctl_info *mci, const char *msg),
+
+	TP_ARGS(mci, msg),
+
+	TP_STRUCT__entry(
+		__field(	const char *,	msg			)
+		__field(	unsigned int,	mc_index		)
+	),
+
+	TP_fast_assign(
+		__entry->msg			= msg;
+		__entry->mc_index		= mci->mc_idx;
+	),
+
+	TP_printk(HW_ERR "mce#%d: Corrected Error: %s\n",
+		__entry->mc_index,
+		__entry->msg)
+);
+
+TRACE_EVENT(mc_uncorrected_error_no_info,
+	TP_PROTO(struct mem_ctl_info *mci, const char *msg),
+
+	TP_ARGS(mci, msg),
+
+	TP_STRUCT__entry(
+		__field(	const char *,	msg			)
+		__field(	unsigned int,	mc_index		)
+	),
+
+	TP_fast_assign(
+		__entry->msg			= msg;
+		__entry->mc_index		= mci->mc_idx;
+	),
+
+	TP_printk(HW_ERR "mce#%d: Uncorrected Error: %s\n",
+		__entry->mc_index,
+		__entry->msg)
+);
+
+
+
+/*
+ * MCE Events placeholder. Please add non-memory events that come from the
+ * MCE driver here
+ */
+
+
+#endif /* _TRACE_HW_EVENT_MC_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/3] events/hw_event: use __string() trace macros for events
  2012-01-26 23:05     ` [PATCH 1/3] events/hw_event: Create a Hardware Events Report Mecanism (HERM) Mauro Carvalho Chehab
@ 2012-01-26 23:05       ` Mauro Carvalho Chehab
  2012-01-26 23:05       ` [PATCH 3/3] hw_event: Consolidate uncorrected/corrected error msgs into one Mauro Carvalho Chehab
  1 sibling, 0 replies; 20+ messages in thread
From: Mauro Carvalho Chehab @ 2012-01-26 23:05 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

Some data there uses temporary alloced space. Just attributing
string pointers directly won't work on such cases.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 include/trace/events/hw_event.h |   78 +++++++++++++++++++-------------------
 1 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/include/trace/events/hw_event.h b/include/trace/events/hw_event.h
index a46ac61..85fca0d 100644
--- a/include/trace/events/hw_event.h
+++ b/include/trace/events/hw_event.h
@@ -20,17 +20,17 @@ DECLARE_EVENT_CLASS(hw_event_class,
 	TP_ARGS(type, instance),
 
 	TP_STRUCT__entry(
-		__field(	const char *,	type			)
+		__string(	type,		type			)
 		__field(	unsigned int,	instance		)
 	),
 
 	TP_fast_assign(
-		__entry->type	= type;
+		__assign_str(type, type);
 		__entry->instance = instance;
 	),
 
 	TP_printk("Initialized %s#%d\n",
-		__entry->type,
+		__get_str(type),
 		__entry->instance)
 );
 
@@ -70,8 +70,8 @@ TRACE_EVENT(mc_corrected_error,
 		__field(	unsigned long,	syndrome		)
 		__field(	int,		row			)
 		__field(	int,		channel			)
-		__field(	const char *,	label			)
-		__field(	const char *,	msg			)
+		__string(	label,		mci->csrows[row].channels[channel].label)
+		__string(	msg,		msg			)
 	),
 
 	TP_fast_assign(
@@ -82,16 +82,16 @@ TRACE_EVENT(mc_corrected_error,
 		__entry->syndrome		= syndrome;
 		__entry->row			= row;
 		__entry->channel		= channel;
-		__entry->label			= mci->csrows[row].channels[channel].label;
-		__entry->msg			= msg;
+		__assign_str(label, mci->csrows[row].channels[channel].label);
+		__assign_str(msg, msg);
 	),
 
 	TP_printk(HW_ERR "mce#%d: Corrected error %s on label \"%s\" "
 			 "(page 0x%lux, offset 0x%lux, grain %ud, "
 			 "syndrome 0x%lux, row %d, channel %d)\n",
 		__entry->mc_index,
-		__entry->msg,
-		__entry->label,
+		__get_str(msg),
+		__get_str(label),
 		__entry->page_frame_number,
 		__entry->offset_in_page,
 		__entry->grain,
@@ -116,8 +116,8 @@ TRACE_EVENT(mc_uncorrected_error,
 		__field(	unsigned long,	offset_in_page		)
 		__field(	u32,		grain			)
 		__field(	int,		row			)
-		__field(	const char *,	msg			)
-		__field(	const char *,	label			)
+		__string(	msg,		msg			)
+		__string(	label,		label			)
 	),
 
 	TP_fast_assign(
@@ -126,15 +126,15 @@ TRACE_EVENT(mc_uncorrected_error,
 		__entry->offset_in_page		= offset_in_page;
 		__entry->grain			= mci->csrows[row].grain;
 		__entry->row			= row;
-		__entry->msg			= msg;
-		__entry->label			= label;
+		__assign_str(msg, msg);
+		__assign_str(label, label);
 	),
 
 	TP_printk(HW_ERR "mce#%d: Uncorrected error %s on label \"%s\""
 			 "(page 0x%lux, offset 0x%lux, grain %ud, row %d)\n",
 		__entry->mc_index,
-		__entry->msg,
-		__entry->label,
+		__get_str(msg),
+		__get_str(label),
 		__entry->page_frame_number,
 		__entry->offset_in_page,
 		__entry->grain,
@@ -160,23 +160,23 @@ TRACE_EVENT(mc_corrected_error_fbd,
 		__field(	unsigned int,	mc_index		)
 		__field(	int,		row			)
 		__field(	int,		channel	        	)
-		__field(	const char *,	label			)
-		__field(	const char *,	msg			)
+		__string(	label,		mci->csrows[row].channels[channel].label)
+		__string(	msg,		msg			)
 	),
 
 	TP_fast_assign(
 		__entry->mc_index		= mci->mc_idx;
 		__entry->row			= row;
 		__entry->channel		= channel;
-		__entry->label			= mci->csrows[row].channels[channel].label;
-		__entry->msg			= msg;
+		__assign_str(label, mci->csrows[row].channels[channel].label);
+		__assign_str(msg, msg);
 	),
 
 	TP_printk(HW_ERR "mce#%d: Corrected Error %s on label \"%s\" "
 			 "(row %d, channel %d)\n",
 		__entry->mc_index,
-		__entry->msg,
-		__entry->label,
+		__get_str(msg),
+		__get_str(label),
 		__entry->row,
 		__entry->channel)
 );
@@ -194,8 +194,8 @@ TRACE_EVENT(mc_uncorrected_error_fbd,
 		__field(	int,		row			)
 		__field(	int,		channela		)
 		__field(	int,		channelb		)
-		__field(	const char *,	msg			)
-		__field(	const char *,	label			)
+		__string(	msg,		msg			)
+		__string(	label,		label			)
 	),
 
 	TP_fast_assign(
@@ -203,15 +203,15 @@ TRACE_EVENT(mc_uncorrected_error_fbd,
 		__entry->row			= row;
 		__entry->channela		= channela;
 		__entry->channelb		= channelb;
-		__entry->msg			= msg;
-		__entry->label			= label;
+		__assign_str(msg, msg);
+		__assign_str(label, label);
 	),
 
 	TP_printk(HW_ERR "mce#%d: Uncorrected Error %s on label \"%s\" "
 			 "(row %d, channels: %d, %d)\n",
 		__entry->mc_index,
-		__entry->msg,
-		__entry->label,
+		__get_str(msg),
+		__get_str(label),
 		__entry->row,
 		__entry->channela,
 		__entry->channelb)
@@ -236,8 +236,8 @@ TRACE_EVENT(mc_out_of_range,
 	TP_ARGS(mci, type, field, invalid_val, min, max),
 
 	TP_STRUCT__entry(
-		__field(	const char *,	type			)
-		__field(	const char *,	field			)
+		__string(	type,		type			)
+		__string(	field,		field			)
 		__field(	unsigned int,	mc_index		)
 		__field(	int,		invalid_val		)
 		__field(	int,		min			)
@@ -245,8 +245,8 @@ TRACE_EVENT(mc_out_of_range,
 	),
 
 	TP_fast_assign(
-		__entry->type			= type;
-		__entry->field			= field;
+		__assign_str(type, type);
+		__assign_str(field, field);
 		__entry->mc_index		= mci->mc_idx;
 		__entry->invalid_val		= invalid_val;
 		__entry->min			= min;
@@ -255,8 +255,8 @@ TRACE_EVENT(mc_out_of_range,
 
 	TP_printk(HW_ERR "mce#%d %s: %s=%d is not between %d and %d\n",
 		__entry->mc_index,
-		__entry->type,
-		__entry->field,
+		__get_str(type),
+		__get_str(field),
 		__entry->invalid_val,
 		__entry->min,
 		__entry->max)
@@ -274,18 +274,18 @@ TRACE_EVENT(mc_corrected_error_no_info,
 	TP_ARGS(mci, msg),
 
 	TP_STRUCT__entry(
-		__field(	const char *,	msg			)
+	__string(	msg,			msg			)
 		__field(	unsigned int,	mc_index		)
 	),
 
 	TP_fast_assign(
-		__entry->msg			= msg;
+		__assign_str(msg, msg);
 		__entry->mc_index		= mci->mc_idx;
 	),
 
 	TP_printk(HW_ERR "mce#%d: Corrected Error: %s\n",
 		__entry->mc_index,
-		__entry->msg)
+		__get_str(msg))
 );
 
 TRACE_EVENT(mc_uncorrected_error_no_info,
@@ -294,18 +294,18 @@ TRACE_EVENT(mc_uncorrected_error_no_info,
 	TP_ARGS(mci, msg),
 
 	TP_STRUCT__entry(
-		__field(	const char *,	msg			)
+		__string(	msg,		msg			)
 		__field(	unsigned int,	mc_index		)
 	),
 
 	TP_fast_assign(
-		__entry->msg			= msg;
+		__assign_str(msg, msg);
 		__entry->mc_index		= mci->mc_idx;
 	),
 
 	TP_printk(HW_ERR "mce#%d: Uncorrected Error: %s\n",
 		__entry->mc_index,
-		__entry->msg)
+		__get_str(msg))
 );
 
 
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/3] hw_event: Consolidate uncorrected/corrected error msgs into one
  2012-01-26 23:05     ` [PATCH 1/3] events/hw_event: Create a Hardware Events Report Mecanism (HERM) Mauro Carvalho Chehab
  2012-01-26 23:05       ` [PATCH 2/3] events/hw_event: use __string() trace macros for events Mauro Carvalho Chehab
@ 2012-01-26 23:05       ` Mauro Carvalho Chehab
  1 sibling, 0 replies; 20+ messages in thread
From: Mauro Carvalho Chehab @ 2012-01-26 23:05 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

This is an RFC patch, consolidating two trace calls into one.
Not sure if this is the better thing to do, but it simplifies
the error tracepoint, while still keeping the technical details
that may be needed by someone debugging the driver or for
the vendors to double-check what's happening inside the system.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/edac_mc.c          |   51 +++++++--
 include/linux/edac.h            |    6 +
 include/trace/events/hw_event.h |  231 ++++-----------------------------------
 3 files changed, 68 insertions(+), 220 deletions(-)

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 2b8382e..5038239 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -685,6 +685,7 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
 		int row, int channel, const char *msg)
 {
 	unsigned long remapped_page;
+	char detail[80];
 
 	debugf3("MC%d: %s()\n", mci->mc_idx, __func__);
 
@@ -711,8 +712,15 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
 		return;
 	}
 
-	trace_mc_corrected_error(mci, page_frame_number, offset_in_page,
-				syndrome, row, channel, msg);
+	/* Memory type dependent details about the error */
+	snprintf(detail, sizeof(detail),
+		 " (page 0x%lx, offset 0x%lx, grain %d, "
+		 "syndrome 0x%lx, row %d, channel %d)\n",
+		 page_frame_number, offset_in_page,
+		 mci->csrows[row].grain, syndrome, row, channel);
+	trace_mc_error(HW_EVENT_ERR_CORRECTED, mci->mc_idx,
+		       mci->csrows[row].channels[channel].label,
+		       msg, detail);
 
 	if (edac_mc_get_log_ce())
 		/* FIXME - put in DIMM location */
@@ -749,7 +757,8 @@ EXPORT_SYMBOL_GPL(edac_mc_handle_ce);
 
 void edac_mc_handle_ce_no_info(struct mem_ctl_info *mci, const char *msg)
 {
-	trace_mc_corrected_error_no_info(mci, msg);
+	trace_mc_error(HW_EVENT_ERR_CORRECTED, mci->mc_idx,
+		       "unknown", msg, "");
 	if (edac_mc_get_log_ce())
 		edac_mc_printk(mci, KERN_WARNING,
 			"CE - no information available: %s\n", msg);
@@ -768,6 +777,7 @@ void edac_mc_handle_ue(struct mem_ctl_info *mci,
 	char *pos = labels;
 	int chan;
 	int chars;
+	char detail[80];
 
 	debugf3("MC%d: %s()\n", mci->mc_idx, __func__);
 
@@ -796,8 +806,15 @@ void edac_mc_handle_ue(struct mem_ctl_info *mci,
 		pos += chars;
 	}
 
-	trace_mc_uncorrected_error(mci, page_frame_number, offset_in_page,
-				row, msg, labels);
+	/* Memory type dependent details about the error */
+	snprintf(detail, sizeof(detail),
+		 "page 0x%lx, offset 0x%lx, grain %d, row %d ",
+		 page_frame_number, offset_in_page,
+	         mci->csrows[row].grain, row);
+	trace_mc_error(HW_EVENT_ERR_UNCORRECTED, mci->mc_idx,
+		       labels,
+		       msg, detail);
+
 	if (edac_mc_get_log_ue())
 		edac_mc_printk(mci, KERN_EMERG,
 			"UE page 0x%lx, offset 0x%lx, grain %d, row %d, "
@@ -818,7 +835,8 @@ EXPORT_SYMBOL_GPL(edac_mc_handle_ue);
 
 void edac_mc_handle_ue_no_info(struct mem_ctl_info *mci, const char *msg)
 {
-	trace_mc_uncorrected_error_no_info(mci, msg);
+	trace_mc_error(HW_EVENT_ERR_UNCORRECTED, mci->mc_idx,
+		       "unknown", msg, "");
 	if (edac_mc_get_panic_on_ue())
 		panic("EDAC MC%d: Uncorrected Error", mci->mc_idx);
 
@@ -843,6 +861,7 @@ void edac_mc_handle_fbd_ue(struct mem_ctl_info *mci,
 	char labels[len + 1];
 	char *pos = labels;
 	int chars;
+	char detail[80];
 
 	if (csrow >= mci->nr_csrows) {
 		/* something is wrong */
@@ -891,8 +910,13 @@ void edac_mc_handle_fbd_ue(struct mem_ctl_info *mci,
 	chars = snprintf(pos, len + 1, "-%s",
 			 mci->csrows[csrow].channels[channelb].label);
 
-	trace_mc_uncorrected_error_fbd(mci, csrow, channela, channelb,
-				       msg, labels);
+	/* Memory type dependent details about the error */
+	snprintf(detail, sizeof(detail),
+		 "row %d, channel-a= %d channel-b= %d ",
+		 csrow, channela, channelb);
+	trace_mc_error(HW_EVENT_ERR_UNCORRECTED, mci->mc_idx,
+		       labels,
+		       msg, detail);
 	if (edac_mc_get_log_ue())
 		edac_mc_printk(mci, KERN_EMERG,
 			"UE row %d, channel-a= %d channel-b= %d "
@@ -913,7 +937,7 @@ EXPORT_SYMBOL(edac_mc_handle_fbd_ue);
 void edac_mc_handle_fbd_ce(struct mem_ctl_info *mci,
 			unsigned int csrow, unsigned int channel, char *msg)
 {
-
+	char detail[80];
 	/* Ensure boundary values */
 	if (csrow >= mci->nr_csrows) {
 		/* something is wrong */
@@ -936,7 +960,14 @@ void edac_mc_handle_fbd_ce(struct mem_ctl_info *mci,
 		return;
 	}
 
-	trace_mc_corrected_error_fbd(mci, csrow, channel, msg);
+	/* Memory type dependent details about the error */
+	snprintf(detail, sizeof(detail),
+		 "(row %d, channel %d)\n",
+		 csrow, channel);
+	trace_mc_error(HW_EVENT_ERR_CORRECTED, mci->mc_idx,
+		       mci->csrows[csrow].channels[channel].label,
+		       msg, detail);
+
 	if (edac_mc_get_log_ce())
 		/* FIXME - put in DIMM location */
 		edac_mc_printk(mci, KERN_WARNING,
diff --git a/include/linux/edac.h b/include/linux/edac.h
index 055b248..3ba99d7 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -66,6 +66,12 @@ enum dev_type {
 #define DEV_FLAG_X32		BIT(DEV_X32)
 #define DEV_FLAG_X64		BIT(DEV_X64)
 
+enum hw_event_mc_err_type {
+	HW_EVENT_ERR_CORRECTED,
+	HW_EVENT_ERR_UNCORRECTED,
+	HW_EVENT_ERR_FATAL,
+};
+
 /* memory types */
 enum mem_type {
 	MEM_EMPTY = 0,		/* Empty csrow */
diff --git a/include/trace/events/hw_event.h b/include/trace/events/hw_event.h
index 85fca0d..fee7ed2 100644
--- a/include/trace/events/hw_event.h
+++ b/include/trace/events/hw_event.h
@@ -52,183 +52,42 @@ DEFINE_EVENT(hw_event_class, hw_event_init,
 /*
  * Default error mechanisms for Memory Controller errors (CE and UE)
  */
-TRACE_EVENT(mc_corrected_error,
+TRACE_EVENT(mc_error,
 
-	TP_PROTO(struct mem_ctl_info *mci,
-		unsigned long page_frame_number,
-		unsigned long offset_in_page, unsigned long syndrome,
-		int row, int channel, const char *msg),
+	TP_PROTO(unsigned int err_type,
+		 unsigned int mc_index,
+		 const char *label,
+		 const char *msg,
+		 const char *detail),
 
-	TP_ARGS(mci, page_frame_number, offset_in_page, syndrome, row,
-		channel, msg),
+	TP_ARGS(err_type, mc_index, label, msg, detail),
 
 	TP_STRUCT__entry(
+		__field(	unsigned int,	err_type		)
 		__field(	unsigned int,	mc_index		)
-		__field(	unsigned long,	page_frame_number	)
-		__field(	unsigned long,	offset_in_page		)
-		__field(	u32,		grain			)
-		__field(	unsigned long,	syndrome		)
-		__field(	int,		row			)
-		__field(	int,		channel			)
-		__string(	label,		mci->csrows[row].channels[channel].label)
-		__string(	msg,		msg			)
-	),
-
-	TP_fast_assign(
-		__entry->mc_index		= mci->mc_idx;
-		__entry->page_frame_number	= page_frame_number;
-		__entry->offset_in_page		= offset_in_page;
-		__entry->grain			= mci->csrows[row].grain;
-		__entry->syndrome		= syndrome;
-		__entry->row			= row;
-		__entry->channel		= channel;
-		__assign_str(label, mci->csrows[row].channels[channel].label);
-		__assign_str(msg, msg);
-	),
-
-	TP_printk(HW_ERR "mce#%d: Corrected error %s on label \"%s\" "
-			 "(page 0x%lux, offset 0x%lux, grain %ud, "
-			 "syndrome 0x%lux, row %d, channel %d)\n",
-		__entry->mc_index,
-		__get_str(msg),
-		__get_str(label),
-		__entry->page_frame_number,
-		__entry->offset_in_page,
-		__entry->grain,
-		__entry->syndrome,
-		__entry->row,
-		__entry->channel)
-);
-
-TRACE_EVENT(mc_uncorrected_error,
-
-	TP_PROTO(struct mem_ctl_info *mci,
-		unsigned long page_frame_number,
-		unsigned long offset_in_page,
-		int row, const char *msg, const char *label),
-
-	TP_ARGS(mci, page_frame_number, offset_in_page,
-		row, msg, label),
-
-	TP_STRUCT__entry(
-		__field(	unsigned int,	mc_index		)
-		__field(	unsigned long,	page_frame_number	)
-		__field(	unsigned long,	offset_in_page		)
-		__field(	u32,		grain			)
-		__field(	int,		row			)
-		__string(	msg,		msg			)
 		__string(	label,		label			)
-	),
-
-	TP_fast_assign(
-		__entry->mc_index		= mci->mc_idx;
-		__entry->page_frame_number	= page_frame_number;
-		__entry->offset_in_page		= offset_in_page;
-		__entry->grain			= mci->csrows[row].grain;
-		__entry->row			= row;
-		__assign_str(msg, msg);
-		__assign_str(label, label);
-	),
-
-	TP_printk(HW_ERR "mce#%d: Uncorrected error %s on label \"%s\""
-			 "(page 0x%lux, offset 0x%lux, grain %ud, row %d)\n",
-		__entry->mc_index,
-		__get_str(msg),
-		__get_str(label),
-		__entry->page_frame_number,
-		__entry->offset_in_page,
-		__entry->grain,
-		__entry->row)
-);
-
-
-/*
- * Fully-Buffered memory hardware in general don't provide syndrome/grain/row
- * information for all types of errors. So, we need to either have another
- * trace event or add a bitmapped field to indicate that some info are not
- * provided and use the previously-declared event. It seemed easier and less
- * confusing to create a different event for such cases
- */
-TRACE_EVENT(mc_corrected_error_fbd,
-
-	TP_PROTO(struct mem_ctl_info *mci,
-		int row, int channel, const char *msg),
-
-	TP_ARGS(mci, row, channel, msg),
-
-	TP_STRUCT__entry(
-		__field(	unsigned int,	mc_index		)
-		__field(	int,		row			)
-		__field(	int,		channel	        	)
-		__string(	label,		mci->csrows[row].channels[channel].label)
 		__string(	msg,		msg			)
+		__string(	detail,		detail			)
 	),
 
 	TP_fast_assign(
-		__entry->mc_index		= mci->mc_idx;
-		__entry->row			= row;
-		__entry->channel		= channel;
-		__assign_str(label, mci->csrows[row].channels[channel].label);
-		__assign_str(msg, msg);
-	),
-
-	TP_printk(HW_ERR "mce#%d: Corrected Error %s on label \"%s\" "
-			 "(row %d, channel %d)\n",
-		__entry->mc_index,
-		__get_str(msg),
-		__get_str(label),
-		__entry->row,
-		__entry->channel)
-);
-
-TRACE_EVENT(mc_uncorrected_error_fbd,
-
-	TP_PROTO(struct mem_ctl_info *mci,
-		int row, int channela, int channelb,
-		const char *msg, const char *label),
-
-	TP_ARGS(mci, row, channela, channelb, msg, label),
-
-	TP_STRUCT__entry(
-		__field(	unsigned int,	mc_index		)
-		__field(	int,		row			)
-		__field(	int,		channela		)
-		__field(	int,		channelb		)
-		__string(	msg,		msg			)
-		__string(	label,		label			)
-	),
-
-	TP_fast_assign(
-		__entry->mc_index		= mci->mc_idx;
-		__entry->row			= row;
-		__entry->channela		= channela;
-		__entry->channelb		= channelb;
-		__assign_str(msg, msg);
+		__entry->err_type		= err_type;
+		__entry->mc_index		= mc_index;
 		__assign_str(label, label);
+		__assign_str(msg, msg);
+		__assign_str(detail, detail);
 	),
 
-	TP_printk(HW_ERR "mce#%d: Uncorrected Error %s on label \"%s\" "
-			 "(row %d, channels: %d, %d)\n",
-		__entry->mc_index,
-		__get_str(msg),
-		__get_str(label),
-		__entry->row,
-		__entry->channela,
-		__entry->channelb)
+	TP_printk(HW_ERR "mce#%d: %s error %s on label \"%s\" %s\n",
+		  __entry->mc_index,
+		  (__entry->err_type == HW_EVENT_ERR_CORRECTED) ? "Corrected" :
+			((__entry->err_type == HW_EVENT_ERR_FATAL) ?
+			"Fatal" : "Uncorrected"),
+		  __get_str(msg),
+		  __get_str(label),
+		  __get_str(detail))
 );
 
-/*
- * The Memory controller driver needs to discover the memory topology, in
- * order to associate a hardware error with the memory label. If, for any
- * reason, it receives an error for a channel or row that are not supposed
- * to be there, an error event needs to be generated to indicate:
- *	- that a Corrected or Uncorrected error was received;
- *	- that the driver has a bug and, for that particular hardware, was
- *	  not capable of detecting the hardware architecture
- * If one of such errors is ever received, a bug to the kernel driver must
- * be filled.
- */
-
 TRACE_EVENT(mc_out_of_range,
 	TP_PROTO(struct mem_ctl_info *mci, const char *type, const char *field,
 		int invalid_val, int min, int max),
@@ -263,54 +122,6 @@ TRACE_EVENT(mc_out_of_range,
 );
 
 /*
- * On some cases, a corrected or uncorrected error was detected, but it
- * couldn't be properly handled, or because another error overrided the
- * error registers that details the error or because of some internal problem
- * on the driver. Those events bellow are meant for those error types.
- */
-TRACE_EVENT(mc_corrected_error_no_info,
-	TP_PROTO(struct mem_ctl_info *mci, const char *msg),
-
-	TP_ARGS(mci, msg),
-
-	TP_STRUCT__entry(
-	__string(	msg,			msg			)
-		__field(	unsigned int,	mc_index		)
-	),
-
-	TP_fast_assign(
-		__assign_str(msg, msg);
-		__entry->mc_index		= mci->mc_idx;
-	),
-
-	TP_printk(HW_ERR "mce#%d: Corrected Error: %s\n",
-		__entry->mc_index,
-		__get_str(msg))
-);
-
-TRACE_EVENT(mc_uncorrected_error_no_info,
-	TP_PROTO(struct mem_ctl_info *mci, const char *msg),
-
-	TP_ARGS(mci, msg),
-
-	TP_STRUCT__entry(
-		__string(	msg,		msg			)
-		__field(	unsigned int,	mc_index		)
-	),
-
-	TP_fast_assign(
-		__assign_str(msg, msg);
-		__entry->mc_index		= mci->mc_idx;
-	),
-
-	TP_printk(HW_ERR "mce#%d: Uncorrected Error: %s\n",
-		__entry->mc_index,
-		__get_str(msg))
-);
-
-
-
-/*
  * MCE Events placeholder. Please add non-memory events that come from the
  * MCE driver here
  */
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-01-26 23:05 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <cover.1300996141.git.mchehab@redhat.com>
2011-03-24 20:32 ` [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM) Mauro Carvalho Chehab
2011-03-24 22:39   ` Borislav Petkov
2011-03-25 10:20     ` Mauro Carvalho Chehab
2011-03-25 14:13       ` Borislav Petkov
2011-03-25 21:22         ` Mauro Carvalho Chehab
2011-03-25 22:37           ` Tony Luck
2011-03-26 11:56             ` Mauro Carvalho Chehab
2011-03-28 17:03           ` Borislav Petkov
2011-03-28 19:44             ` Mauro Carvalho Chehab
2011-03-30 17:27               ` Luck, Tony
2011-03-30 17:51                 ` Borislav Petkov
2011-03-30 18:30                   ` Francis St. Amant
2011-03-30 19:50                     ` Borislav Petkov
2011-03-30 20:00                       ` Francis St. Amant
2011-03-31  7:43                         ` Borislav Petkov
2012-01-26 23:05     ` [PATCH 1/3] events/hw_event: Create a Hardware Events Report Mecanism (HERM) Mauro Carvalho Chehab
2012-01-26 23:05       ` [PATCH 2/3] events/hw_event: use __string() trace macros for events Mauro Carvalho Chehab
2012-01-26 23:05       ` [PATCH 3/3] hw_event: Consolidate uncorrected/corrected error msgs into one Mauro Carvalho Chehab
2011-03-24 20:32 ` [PATCH RFC 1/2] edac: Move edac main structs to include/linux/edac.h Mauro Carvalho Chehab
2011-03-24 20:54 ` Mauro Carvalho Chehab

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).