linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RAS trace event proto
@ 2012-02-20 14:59 Borislav Petkov
  2012-02-21  1:14 ` Steven Rostedt
  2012-02-21 12:24 ` Mauro Carvalho Chehab
  0 siblings, 2 replies; 19+ messages in thread
From: Borislav Petkov @ 2012-02-20 14:59 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Steven Rostedt, Ingo Molnar, Tony Luck, edac-devel, LKML

Hi all,

here's a dirty patch which should hopefully show what I have in mind wrt
using tracepoints for RAS events. The code compiles and should only give
an initial idea, it is subject to significant changes until it reaches
its final form thus it is only exemplary and not final in any case.

Notes:

* So there are two RAS tracepoints: trace_mce_record which dumps the
MCE-related errors + their decoded info and trace_hw_error which simply
carries a string to userspace. The second one can be used for non-MCA
errors.

* When prepping the string for the tracepoint, we cache the string by
calling ras_printk which buffers the so-far done string internally,
so everything that wants to dump into it needs to be converted to use
ras_printk.

* Which brings me to it: ras_printk() is x86-only and it could probably
be moved to an arch-agnostic place for the other arches. I'd leave it
x86-only for now, for testing purposes, and then later the other arches
could consider using it (this is wrt non-x86 EDAC drivers).

* When writing a 1 into /sys/devices/system/ras/agent, we enable the
string buffering functionality - this could be done by the RAS daemon or
whatever agent is requesting putting hw errors info into tracing.

* I'd like to have conditional printk-ing in trace_mce_record depending
on the TP args, Steve probably knows what can be done:

@Steven:

I'd like to do the following:

	TP_printk("%s, ARG1: %d, ARG2: %c ...", str1, arg1, arg2)

and have it print only the first arg, i.e. the string and drop the rest
of the args while still doing the TP_fast_assign into the ring buffer
and carrying the stuff to its consumers. Background is that I want to
dump the decoded string of a hardware error, if it is decoded, but carry
the MCE info to userspace and only dump the fields of the MCE if I
haven't managed to decode it, i.e. str1 == "".

So, my question is, can I do something like:

	TP_printk("%s, ARG1: %d, ARG2: %c ...", __print_conditional(str1, arg1, arg2))

where __print_conditional is a vararg macro which calls a
ftrace_print_cond() which prints only str1 if strlen(str1) > 0 and
otherwise calls a vsnprintf() variant to deal with the va_args?

As always, all comments are welcome.

--
>From e06143929d7d6cbed7bec1a7f4976f595a2537da Mon Sep 17 00:00:00 2001
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Mon, 20 Feb 2012 14:52:19 +0100
Subject: [PATCH] RAS trace event proto

---
 arch/x86/Kconfig                 |    9 ++
 arch/x86/Makefile                |    3 +
 arch/x86/include/asm/ras.h       |   17 +++
 arch/x86/kernel/cpu/mcheck/mce.c |    2 +-
 arch/x86/ras/Makefile            |    1 +
 arch/x86/ras/ras.c               |  121 +++++++++++++++++++++
 drivers/edac/amd64_edac.c        |    8 +-
 drivers/edac/edac_mc.c           |    9 ++-
 drivers/edac/mce_amd.c           |  218 ++++++++++++++++++++------------------
 include/trace/events/mce.h       |   26 ++++-
 10 files changed, 306 insertions(+), 108 deletions(-)
 create mode 100644 arch/x86/include/asm/ras.h
 create mode 100644 arch/x86/ras/Makefile
 create mode 100644 arch/x86/ras/ras.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5bed94e189fa..bda1480241b2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -657,6 +657,15 @@ config X86_CYCLONE_TIMER
 	def_bool y
 	depends on X86_SUMMIT
 
+config X86_RAS
+	def_bool y
+	prompt "X86 RAS features"
+	---help---
+	A collection of Reliability, Availability and Serviceability
+	software features which aim to enable hardware error logging
+	and reporting. Leave it at 'y' unless you really know what
+	you're doing
+
 source "arch/x86/Kconfig.cpu"
 
 config HPET_TIMER
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 209ba1294592..a6b6bb1f308b 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -146,6 +146,9 @@ drivers-$(CONFIG_OPROFILE) += arch/x86/oprofile/
 # suspend and hibernation support
 drivers-$(CONFIG_PM) += arch/x86/power/
 
+# RAS support
+core-y += arch/x86/ras/
+
 drivers-$(CONFIG_FB) += arch/x86/video/
 
 ####
diff --git a/arch/x86/include/asm/ras.h b/arch/x86/include/asm/ras.h
new file mode 100644
index 000000000000..27333cfd7534
--- /dev/null
+++ b/arch/x86/include/asm/ras.h
@@ -0,0 +1,17 @@
+#ifndef _ASM_X86_RAS_H
+#define _ASM_X86_RAS_H
+
+#define ERR_STRING_SZ 200
+
+extern bool ras_agent;
+extern char *decoded_err_str;
+
+enum ras_printk_flags {
+	PR_EMERG	= 0,
+	PR_WARNING	= 1,
+	PR_CONT		= 2,
+	RAS_EOFLAGS,
+};
+extern void ras_printk(enum ras_printk_flags flags, const char *fmt, ...);
+
+#endif /* _ASM_X86_RAS_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 5a11ae2e9e91..072e020ecaf3 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -144,7 +144,7 @@ void mce_log(struct mce *mce)
 	int ret = 0;
 
 	/* Emit the trace record: */
-	trace_mce_record(mce);
+	trace_mce_record("", mce);
 
 	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
 	if (ret == NOTIFY_STOP)
diff --git a/arch/x86/ras/Makefile b/arch/x86/ras/Makefile
new file mode 100644
index 000000000000..7a70bb5cd057
--- /dev/null
+++ b/arch/x86/ras/Makefile
@@ -0,0 +1 @@
+obj-y		:= ras.o
diff --git a/arch/x86/ras/ras.c b/arch/x86/ras/ras.c
new file mode 100644
index 000000000000..64099a03ea32
--- /dev/null
+++ b/arch/x86/ras/ras.c
@@ -0,0 +1,121 @@
+#include <linux/types.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <asm/ras.h>
+
+#define ERR_STRING_SZ 200
+char *decoded_err_str;
+static unsigned dec_len;
+
+/*
+ * If true, userspace has an agent running and eating all the
+ * tracing data we're sending out so there's no dmesg output
+ */
+bool ras_agent;
+EXPORT_SYMBOL_GPL(ras_agent);
+
+void ras_printk(enum ras_printk_flags flags, const char *fmt, ...)
+{
+	va_list args;
+	char *buf;
+	unsigned cur_sz;
+	int i;
+
+	if (dec_len >= ERR_STRING_SZ-1)
+		return;
+
+	buf = decoded_err_str + dec_len;
+	cur_sz = ERR_STRING_SZ - dec_len - 1;
+
+	va_start(args, fmt);
+	i = vsnprintf(buf, cur_sz, fmt, args);
+	va_end(args);
+
+	if (i >= cur_sz) {
+		pr_err("Error decode buffer truncated.\n");
+		dec_len = ERR_STRING_SZ-1;
+		decoded_err_str[dec_len] = '\n';
+	} else
+		dec_len += i;
+
+	if (!ras_agent) {
+		if (flags == PR_EMERG)
+			pr_emerg("%s", buf);
+		if (flags == PR_WARNING)
+			pr_warning("%s", buf);
+		else if (flags == PR_CONT)
+			pr_cont("%s", buf);
+	}
+}
+EXPORT_SYMBOL_GPL(ras_printk);
+
+struct bus_type ras_subsys = {
+	.name	  = "ras",
+	.dev_name = "ras",
+};
+
+struct ras_attr {
+	const struct attribute attr;
+	ssize_t (*show) (struct kobject *kobj, struct ras_attr *attr, char *buf);
+	ssize_t (*store)(struct kobject *kobj, struct ras_attr *attr,
+			 const char *buf, size_t count);
+};
+
+#define RAS_ATTR(_name, _mode, _show, _store)	\
+static struct ras_attr ras_attr_##_name = __ATTR(_name, _mode, _show, _store)
+
+static ssize_t ras_agent_show(struct kobject *kobj,
+			      struct ras_attr *attr,
+			      char *buf)
+{
+	return sprintf(buf, "%.1d\n", ras_agent);
+}
+
+static ssize_t ras_agent_store(struct kobject *kobj,
+			       struct ras_attr *attr,
+			       const char *buf, size_t count)
+{
+	int ret = 0;
+	unsigned long value;
+
+	ret = kstrtoul(buf, 10, &value);
+	if (ret < 0) {
+		printk(KERN_ERR "Wrong value for ras_agent field.\n");
+		return ret;
+	}
+
+	ras_agent = !!value;
+
+	return count;
+}
+
+RAS_ATTR(agent, 0644, ras_agent_show, ras_agent_store);
+
+static int __init ras_init(void)
+{
+	int err = 0;
+
+	err = subsys_system_register(&ras_subsys, NULL);
+	if (err) {
+		printk(KERN_ERR "Error registering toplevel RAS sysfs node.\n");
+		return -EINVAL;
+	}
+
+	err = sysfs_create_file(&ras_subsys.dev_root->kobj, &ras_attr_agent.attr);
+	if (err) {
+		printk(KERN_ERR "Error creating %s sysfs node.\n",
+				ras_attr_agent.attr.name);
+		goto err_sysfs_create;
+	}
+
+	return 0;
+
+err_sysfs_create:
+	sysfs_remove_file(&ras_subsys.dev_root->kobj, &ras_attr_agent.attr);
+	bus_unregister(&ras_subsys);
+
+	return -EINVAL;
+
+}
+early_initcall(ras_init);
diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index c9eee6d33e9a..8a42e591508d 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -1,6 +1,7 @@
-#include "amd64_edac.h"
 #include <asm/amd_nb.h>
+#include <asm/ras.h>
 
+#include "amd64_edac.h"
 static struct edac_pci_ctl_info *amd64_ctl_pci;
 
 static int report_gart_errors;
@@ -1901,7 +1902,10 @@ static void amd64_handle_ce(struct mem_ctl_info *mci, struct mce *m)
 	sys_addr = get_error_address(m);
 	syndrome = extract_syndrome(m->status);
 
-	amd64_mc_err(mci, "CE ERROR_ADDRESS= 0x%llx\n", sys_addr);
+	if (ras_agent)
+		ras_printk(PR_EMERG, "err addr: 0x%llx", sys_addr);
+	else
+		amd64_mc_err(mci, "CE ERROR_ADDRESS= 0x%llx\n", sys_addr);
 
 	pvt->ops->map_sysaddr_to_csrow(mci, sys_addr, syndrome);
 }
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index ca6c04d350ee..772d712a6f74 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -30,8 +30,10 @@
 #include <asm/uaccess.h>
 #include <asm/page.h>
 #include <asm/edac.h>
+#include <asm/ras.h>
 #include "edac_core.h"
 #include "edac_module.h"
+#include "mce_amd.h"
 
 /* lock to memory controller's control array */
 static DEFINE_MUTEX(mem_ctls_mutex);
@@ -701,7 +703,11 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
 		return;
 	}
 
-	if (edac_mc_get_log_ce())
+	if (edac_mc_get_log_ce()) {
+		if (ras_agent)
+			ras_printk(PR_CONT, "row: %d, channel: %d\n",
+				   row, channel);
+
 		/* FIXME - put in DIMM location */
 		edac_mc_printk(mci, KERN_WARNING,
 			"CE page 0x%lx, offset 0x%lx, grain %d, syndrome "
@@ -709,6 +715,7 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
 			page_frame_number, offset_in_page,
 			mci->csrows[row].grain, syndrome, row, channel,
 			mci->csrows[row].channels[channel].label, msg);
+	}
 
 	mci->ce_count++;
 	mci->csrows[row].ce_count++;
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index bd926ea2e00c..ad7f47ddd7da 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -1,5 +1,7 @@
 #include <linux/module.h>
 #include <linux/slab.h>
+#include <trace/events/mce.h>
+#include <asm/ras.h>
 
 #include "mce_amd.h"
 
@@ -137,9 +139,9 @@ static bool f12h_dc_mce(u16 ec, u8 xec)
 		ret = true;
 
 		if (ll == LL_L2)
-			pr_cont("during L1 linefill from L2.\n");
+			ras_printk(PR_CONT, "during L1 linefill from L2.\n");
 		else if (ll == LL_L1)
-			pr_cont("Data/Tag %s error.\n", R4_MSG(ec));
+			ras_printk(PR_CONT, "Data/Tag %s error.\n", R4_MSG(ec));
 		else
 			ret = false;
 	}
@@ -149,7 +151,7 @@ static bool f12h_dc_mce(u16 ec, u8 xec)
 static bool f10h_dc_mce(u16 ec, u8 xec)
 {
 	if (R4(ec) == R4_GEN && LL(ec) == LL_L1) {
-		pr_cont("during data scrub.\n");
+		ras_printk(PR_CONT, "during data scrub.\n");
 		return true;
 	}
 	return f12h_dc_mce(ec, xec);
@@ -158,7 +160,7 @@ static bool f10h_dc_mce(u16 ec, u8 xec)
 static bool k8_dc_mce(u16 ec, u8 xec)
 {
 	if (BUS_ERROR(ec)) {
-		pr_cont("during system linefill.\n");
+		ras_printk(PR_CONT, "during system linefill.\n");
 		return true;
 	}
 
@@ -178,14 +180,14 @@ static bool f14h_dc_mce(u16 ec, u8 xec)
 		switch (r4) {
 		case R4_DRD:
 		case R4_DWR:
-			pr_cont("Data/Tag parity error due to %s.\n",
+			ras_printk(PR_CONT, "Data/Tag parity error due to %s.\n",
 				(r4 == R4_DRD ? "load/hw prf" : "store"));
 			break;
 		case R4_EVICT:
-			pr_cont("Copyback parity error on a tag miss.\n");
+			ras_printk(PR_CONT, "Copyback parity error on a tag miss.\n");
 			break;
 		case R4_SNOOP:
-			pr_cont("Tag parity error during snoop.\n");
+			ras_printk(PR_CONT, "Tag parity error during snoop.\n");
 			break;
 		default:
 			ret = false;
@@ -195,17 +197,17 @@ static bool f14h_dc_mce(u16 ec, u8 xec)
 		if ((II(ec) != II_MEM && II(ec) != II_IO) || LL(ec) != LL_LG)
 			return false;
 
-		pr_cont("System read data error on a ");
+		ras_printk(PR_CONT, "System read data error on a ");
 
 		switch (r4) {
 		case R4_RD:
-			pr_cont("TLB reload.\n");
+			ras_printk(PR_CONT, "TLB reload.\n");
 			break;
 		case R4_DWR:
-			pr_cont("store.\n");
+			ras_printk(PR_CONT, "store.\n");
 			break;
 		case R4_DRD:
-			pr_cont("load.\n");
+			ras_printk(PR_CONT, "load.\n");
 			break;
 		default:
 			ret = false;
@@ -225,28 +227,29 @@ static bool f15h_dc_mce(u16 ec, u8 xec)
 
 		switch (xec) {
 		case 0x0:
-			pr_cont("Data Array access error.\n");
+			ras_printk(PR_CONT, "Data Array access error.\n");
 			break;
 
 		case 0x1:
-			pr_cont("UC error during a linefill from L2/NB.\n");
+			ras_printk(PR_CONT, "UC error during a linefill "
+					    "from L2/NB.\n");
 			break;
 
 		case 0x2:
 		case 0x11:
-			pr_cont("STQ access error.\n");
+			ras_printk(PR_CONT, "STQ access error.\n");
 			break;
 
 		case 0x3:
-			pr_cont("SCB access error.\n");
+			ras_printk(PR_CONT, "SCB access error.\n");
 			break;
 
 		case 0x10:
-			pr_cont("Tag error.\n");
+			ras_printk(PR_CONT, "Tag error.\n");
 			break;
 
 		case 0x12:
-			pr_cont("LDQ access error.\n");
+			ras_printk(PR_CONT, "LDQ access error.\n");
 			break;
 
 		default:
@@ -255,9 +258,9 @@ static bool f15h_dc_mce(u16 ec, u8 xec)
 	} else if (BUS_ERROR(ec)) {
 
 		if (!xec)
-			pr_cont("during system linefill.\n");
+			ras_printk(PR_CONT, "during system linefill.\n");
 		else
-			pr_cont(" Internal %s condition.\n",
+			ras_printk(PR_CONT, " Internal %s condition.\n",
 				((xec == 1) ? "livelock" : "deadlock"));
 	} else
 		ret = false;
@@ -270,12 +273,12 @@ static void amd_decode_dc_mce(struct mce *m)
 	u16 ec = EC(m->status);
 	u8 xec = XEC(m->status, xec_mask);
 
-	pr_emerg(HW_ERR "Data Cache Error: ");
+	ras_printk(PR_EMERG, HW_ERR "Data Cache Error: ");
 
 	/* TLB error signatures are the same across families */
 	if (TLB_ERROR(ec)) {
 		if (TT(ec) == TT_DATA) {
-			pr_cont("%s TLB %s.\n", LL_MSG(ec),
+			ras_printk(PR_CONT, "%s TLB %s.\n", LL_MSG(ec),
 				((xec == 2) ? "locked miss"
 					    : (xec ? "multimatch" : "parity")));
 			return;
@@ -283,7 +286,7 @@ static void amd_decode_dc_mce(struct mce *m)
 	} else if (fam_ops->dc_mce(ec, xec))
 		;
 	else
-		pr_emerg(HW_ERR "Corrupted DC MCE info?\n");
+		ras_printk(PR_EMERG, HW_ERR "Corrupted DC MCE info?\n");
 }
 
 static bool k8_ic_mce(u16 ec, u8 xec)
@@ -295,19 +298,19 @@ static bool k8_ic_mce(u16 ec, u8 xec)
 		return false;
 
 	if (ll == 0x2)
-		pr_cont("during a linefill from L2.\n");
+		ras_printk(PR_CONT, "during a linefill from L2.\n");
 	else if (ll == 0x1) {
 		switch (R4(ec)) {
 		case R4_IRD:
-			pr_cont("Parity error during data load.\n");
+			ras_printk(PR_CONT, "Parity error during data load.\n");
 			break;
 
 		case R4_EVICT:
-			pr_cont("Copyback Parity/Victim error.\n");
+			ras_printk(PR_CONT, "Copyback Parity/Victim error.\n");
 			break;
 
 		case R4_SNOOP:
-			pr_cont("Tag Snoop error.\n");
+			ras_printk(PR_CONT, "Tag Snoop error.\n");
 			break;
 
 		default:
@@ -330,9 +333,9 @@ static bool f14h_ic_mce(u16 ec, u8 xec)
 			ret = false;
 
 		if (r4 == R4_IRD)
-			pr_cont("Data/tag array parity error for a tag hit.\n");
+			ras_printk(PR_CONT, "Data/tag array parity error for a tag hit.\n");
 		else if (r4 == R4_SNOOP)
-			pr_cont("Tag error during snoop/victimization.\n");
+			ras_printk(PR_CONT, "Tag error during snoop/victimization.\n");
 		else
 			ret = false;
 	}
@@ -348,15 +351,16 @@ static bool f15h_ic_mce(u16 ec, u8 xec)
 
 	switch (xec) {
 	case 0x0 ... 0xa:
-		pr_cont("%s.\n", f15h_ic_mce_desc[xec]);
+		ras_printk(PR_CONT, "%s.\n", f15h_ic_mce_desc[xec]);
 		break;
 
 	case 0xd:
-		pr_cont("%s.\n", f15h_ic_mce_desc[xec-2]);
+		ras_printk(PR_CONT, "%s.\n", f15h_ic_mce_desc[xec-2]);
 		break;
 
 	case 0x10 ... 0x14:
-		pr_cont("Decoder %s parity error.\n", f15h_ic_mce_desc[xec-4]);
+		ras_printk(PR_CONT, "Decoder %s parity error.\n",
+				    f15h_ic_mce_desc[xec-4]);
 		break;
 
 	default:
@@ -370,19 +374,20 @@ static void amd_decode_ic_mce(struct mce *m)
 	u16 ec = EC(m->status);
 	u8 xec = XEC(m->status, xec_mask);
 
-	pr_emerg(HW_ERR "Instruction Cache Error: ");
+	ras_printk(PR_EMERG, HW_ERR "Instruction Cache Error: ");
 
 	if (TLB_ERROR(ec))
-		pr_cont("%s TLB %s.\n", LL_MSG(ec),
+		ras_printk(PR_CONT, "%s TLB %s.\n", LL_MSG(ec),
 			(xec ? "multimatch" : "parity error"));
 	else if (BUS_ERROR(ec)) {
 		bool k8 = (boot_cpu_data.x86 == 0xf && (m->status & BIT_64(58)));
 
-		pr_cont("during %s.\n", (k8 ? "system linefill" : "NB data read"));
+		ras_printk(PR_CONT, "during %s.\n", (k8 ? "system linefill"
+							: "NB data read"));
 	} else if (fam_ops->ic_mce(ec, xec))
 		;
 	else
-		pr_emerg(HW_ERR "Corrupted IC MCE info?\n");
+		ras_printk(PR_EMERG, HW_ERR "Corrupted IC MCE info?\n");
 }
 
 static void amd_decode_bu_mce(struct mce *m)
@@ -390,30 +395,33 @@ static void amd_decode_bu_mce(struct mce *m)
 	u16 ec = EC(m->status);
 	u8 xec = XEC(m->status, xec_mask);
 
-	pr_emerg(HW_ERR "Bus Unit Error");
+	ras_printk(PR_EMERG, HW_ERR "Bus Unit Error");
 
 	if (xec == 0x1)
-		pr_cont(" in the write data buffers.\n");
+		ras_printk(PR_CONT, " in the write data buffers.\n");
 	else if (xec == 0x3)
-		pr_cont(" in the victim data buffers.\n");
+		ras_printk(PR_CONT, " in the victim data buffers.\n");
 	else if (xec == 0x2 && MEM_ERROR(ec))
-		pr_cont(": %s error in the L2 cache tags.\n", R4_MSG(ec));
+		ras_printk(PR_CONT, ": %s error in the L2 cache tags.\n",
+			   R4_MSG(ec));
 	else if (xec == 0x0) {
 		if (TLB_ERROR(ec))
-			pr_cont(": %s error in a Page Descriptor Cache or "
-				"Guest TLB.\n", TT_MSG(ec));
+			ras_printk(PR_CONT, ": %s error in a Page Descriptor "
+					    "Cache or Guest TLB.\n",
+					    TT_MSG(ec));
 		else if (BUS_ERROR(ec))
-			pr_cont(": %s/ECC error in data read from NB: %s.\n",
-				R4_MSG(ec), PP_MSG(ec));
+			ras_printk(PR_CONT, ": %s/ECC error in data read from NB: %s.\n",
+					    R4_MSG(ec), PP_MSG(ec));
 		else if (MEM_ERROR(ec)) {
 			u8 r4 = R4(ec);
 
 			if (r4 >= 0x7)
-				pr_cont(": %s error during data copyback.\n",
-					R4_MSG(ec));
+				ras_printk(PR_CONT, ": %s error during data copyback.\n",
+						    R4_MSG(ec));
 			else if (r4 <= 0x1)
-				pr_cont(": %s parity/ECC error during data "
-					"access from L2.\n", R4_MSG(ec));
+				ras_printk(PR_CONT, ": %s parity/ECC error "
+						    "during data access from L2.\n",
+						    R4_MSG(ec));
 			else
 				goto wrong_bu_mce;
 		} else
@@ -424,7 +432,7 @@ static void amd_decode_bu_mce(struct mce *m)
 	return;
 
 wrong_bu_mce:
-	pr_emerg(HW_ERR "Corrupted BU MCE info?\n");
+	ras_printk(PR_EMERG, HW_ERR "Corrupted BU MCE info?\n");
 }
 
 static void amd_decode_cu_mce(struct mce *m)
@@ -432,28 +440,28 @@ static void amd_decode_cu_mce(struct mce *m)
 	u16 ec = EC(m->status);
 	u8 xec = XEC(m->status, xec_mask);
 
-	pr_emerg(HW_ERR "Combined Unit Error: ");
+	ras_printk(PR_EMERG, HW_ERR "Combined Unit Error: ");
 
 	if (TLB_ERROR(ec)) {
 		if (xec == 0x0)
-			pr_cont("Data parity TLB read error.\n");
+			ras_printk(PR_CONT, "Data parity TLB read error.\n");
 		else if (xec == 0x1)
-			pr_cont("Poison data provided for TLB fill.\n");
+			ras_printk(PR_CONT, "Poison data provided for TLB fill.\n");
 		else
 			goto wrong_cu_mce;
 	} else if (BUS_ERROR(ec)) {
 		if (xec > 2)
 			goto wrong_cu_mce;
 
-		pr_cont("Error during attempted NB data read.\n");
+		ras_printk(PR_CONT, "Error during attempted NB data read.\n");
 	} else if (MEM_ERROR(ec)) {
 		switch (xec) {
 		case 0x4 ... 0xc:
-			pr_cont("%s.\n", f15h_cu_mce_desc[xec - 0x4]);
+			ras_printk(PR_CONT, "%s.\n", f15h_cu_mce_desc[xec - 0x4]);
 			break;
 
 		case 0x10 ... 0x14:
-			pr_cont("%s.\n", f15h_cu_mce_desc[xec - 0x7]);
+			ras_printk(PR_CONT, "%s.\n", f15h_cu_mce_desc[xec - 0x7]);
 			break;
 
 		default:
@@ -464,7 +472,7 @@ static void amd_decode_cu_mce(struct mce *m)
 	return;
 
 wrong_cu_mce:
-	pr_emerg(HW_ERR "Corrupted CU MCE info?\n");
+	ras_printk(PR_EMERG, HW_ERR "Corrupted CU MCE info?\n");
 }
 
 static void amd_decode_ls_mce(struct mce *m)
@@ -473,12 +481,12 @@ static void amd_decode_ls_mce(struct mce *m)
 	u8 xec = XEC(m->status, xec_mask);
 
 	if (boot_cpu_data.x86 >= 0x14) {
-		pr_emerg("You shouldn't be seeing an LS MCE on this cpu family,"
-			 " please report on LKML.\n");
+		ras_printk(PR_EMERG, "You shouldn't be seeing an LS MCE on this"
+				     " cpu family, please report on LKML.\n");
 		return;
 	}
 
-	pr_emerg(HW_ERR "Load Store Error");
+	ras_printk(PR_EMERG, HW_ERR "Load Store Error");
 
 	if (xec == 0x0) {
 		u8 r4 = R4(ec);
@@ -486,14 +494,14 @@ static void amd_decode_ls_mce(struct mce *m)
 		if (!BUS_ERROR(ec) || (r4 != R4_DRD && r4 != R4_DWR))
 			goto wrong_ls_mce;
 
-		pr_cont(" during %s.\n", R4_MSG(ec));
+		ras_printk(PR_CONT, " during %s.\n", R4_MSG(ec));
 	} else
 		goto wrong_ls_mce;
 
 	return;
 
 wrong_ls_mce:
-	pr_emerg(HW_ERR "Corrupted LS MCE info?\n");
+	ras_printk(PR_EMERG, HW_ERR "Corrupted LS MCE info?\n");
 }
 
 static bool k8_nb_mce(u16 ec, u8 xec)
@@ -502,15 +510,15 @@ static bool k8_nb_mce(u16 ec, u8 xec)
 
 	switch (xec) {
 	case 0x1:
-		pr_cont("CRC error detected on HT link.\n");
+		ras_printk(PR_CONT, "CRC error detected on HT link.\n");
 		break;
 
 	case 0x5:
-		pr_cont("Invalid GART PTE entry during GART table walk.\n");
+		ras_printk(PR_CONT, "Invalid GART PTE entry during GART table walk.\n");
 		break;
 
 	case 0x6:
-		pr_cont("Unsupported atomic RMW received from an IO link.\n");
+		ras_printk(PR_CONT, "Unsupported atomic RMW received from an IO link.\n");
 		break;
 
 	case 0x0:
@@ -518,11 +526,11 @@ static bool k8_nb_mce(u16 ec, u8 xec)
 		if (boot_cpu_data.x86 == 0x11)
 			return false;
 
-		pr_cont("DRAM ECC error detected on the NB.\n");
+		ras_printk(PR_CONT, "DRAM ECC error detected on the NB.\n");
 		break;
 
 	case 0xd:
-		pr_cont("Parity error on the DRAM addr/ctl signals.\n");
+		ras_printk(PR_CONT, "Parity error on the DRAM addr/ctl signals.\n");
 		break;
 
 	default:
@@ -552,9 +560,9 @@ static bool f10h_nb_mce(u16 ec, u8 xec)
 
 	case 0xf:
 		if (TLB_ERROR(ec))
-			pr_cont("GART Table Walk data error.\n");
+			ras_printk(PR_CONT, "GART Table Walk data error.\n");
 		else if (BUS_ERROR(ec))
-			pr_cont("DMA Exclusion Vector Table Walk error.\n");
+			ras_printk(PR_CONT, "DMA Exclusion Vector Table Walk error.\n");
 		else
 			ret = false;
 
@@ -563,7 +571,7 @@ static bool f10h_nb_mce(u16 ec, u8 xec)
 
 	case 0x19:
 		if (boot_cpu_data.x86 == 0x15)
-			pr_cont("Compute Unit Data Error.\n");
+			ras_printk(PR_CONT, "Compute Unit Data Error.\n");
 		else
 			ret = false;
 
@@ -581,7 +589,7 @@ static bool f10h_nb_mce(u16 ec, u8 xec)
 		break;
 	}
 
-	pr_cont("%s.\n", f10h_nb_mce_desc[xec - offset]);
+	ras_printk(PR_CONT, "%s.\n", f10h_nb_mce_desc[xec - offset]);
 
 out:
 	return ret;
@@ -599,27 +607,27 @@ void amd_decode_nb_mce(struct mce *m)
 	u16 ec = EC(m->status);
 	u8 xec = XEC(m->status, 0x1f);
 
-	pr_emerg(HW_ERR "Northbridge Error (node %d): ", node_id);
+	ras_printk(PR_EMERG, HW_ERR "Northbridge Error (node %d): ", node_id);
 
 	switch (xec) {
 	case 0x2:
-		pr_cont("Sync error (sync packets on HT link detected).\n");
+		ras_printk(PR_CONT, "Sync error (sync packets on HT link detected).\n");
 		return;
 
 	case 0x3:
-		pr_cont("HT Master abort.\n");
+		ras_printk(PR_CONT, "HT Master abort.\n");
 		return;
 
 	case 0x4:
-		pr_cont("HT Target abort.\n");
+		ras_printk(PR_CONT, "HT Target abort.\n");
 		return;
 
 	case 0x7:
-		pr_cont("NB Watchdog timeout.\n");
+		ras_printk(PR_CONT, "NB Watchdog timeout.\n");
 		return;
 
 	case 0x9:
-		pr_cont("SVM DMA Exclusion Vector error.\n");
+		ras_printk(PR_CONT, "SVM DMA Exclusion Vector error.\n");
 		return;
 
 	default:
@@ -636,7 +644,7 @@ void amd_decode_nb_mce(struct mce *m)
 	return;
 
 wrong_nb_mce:
-	pr_emerg(HW_ERR "Corrupted NB MCE info?\n");
+	ras_printk(PR_EMERG, HW_ERR "Corrupted NB MCE info?\n");
 }
 EXPORT_SYMBOL_GPL(amd_decode_nb_mce);
 
@@ -651,80 +659,80 @@ static void amd_decode_fr_mce(struct mce *m)
 	if (c->x86 != 0x15 && xec != 0x0)
 		goto wrong_fr_mce;
 
-	pr_emerg(HW_ERR "%s Error: ",
+	ras_printk(PR_EMERG, HW_ERR "%s Error: ",
 		 (c->x86 == 0x15 ? "Execution Unit" : "FIROB"));
 
 	if (xec == 0x0 || xec == 0xc)
-		pr_cont("%s.\n", fr_ex_mce_desc[xec]);
+		ras_printk(PR_CONT, "%s.\n", fr_ex_mce_desc[xec]);
 	else if (xec < 0xd)
-		pr_cont("%s parity error.\n", fr_ex_mce_desc[xec]);
+		ras_printk(PR_CONT, "%s parity error.\n", fr_ex_mce_desc[xec]);
 	else
 		goto wrong_fr_mce;
 
 	return;
 
 wrong_fr_mce:
-	pr_emerg(HW_ERR "Corrupted FR MCE info?\n");
+	ras_printk(PR_EMERG, HW_ERR "Corrupted FR MCE info?\n");
 }
 
 static void amd_decode_fp_mce(struct mce *m)
 {
 	u8 xec = XEC(m->status, xec_mask);
 
-	pr_emerg(HW_ERR "Floating Point Unit Error: ");
+	ras_printk(PR_EMERG, HW_ERR "Floating Point Unit Error: ");
 
 	switch (xec) {
 	case 0x1:
-		pr_cont("Free List");
+		ras_printk(PR_CONT, "Free List");
 		break;
 
 	case 0x2:
-		pr_cont("Physical Register File");
+		ras_printk(PR_CONT, "Physical Register File");
 		break;
 
 	case 0x3:
-		pr_cont("Retire Queue");
+		ras_printk(PR_CONT, "Retire Queue");
 		break;
 
 	case 0x4:
-		pr_cont("Scheduler table");
+		ras_printk(PR_CONT, "Scheduler table");
 		break;
 
 	case 0x5:
-		pr_cont("Status Register File");
+		ras_printk(PR_CONT, "Status Register File");
 		break;
 
 	default:
 		goto wrong_fp_mce;
 		break;
 	}
-
-	pr_cont(" parity error.\n");
+	ras_printk(PR_CONT, " parity error.\n");
 
 	return;
 
 wrong_fp_mce:
-	pr_emerg(HW_ERR "Corrupted FP MCE info?\n");
+	ras_printk(PR_EMERG, HW_ERR "Corrupted FP MCE info?\n");
 }
 
 static inline void amd_decode_err_code(u16 ec)
 {
 
-	pr_emerg(HW_ERR "cache level: %s", LL_MSG(ec));
+	ras_printk(PR_EMERG, HW_ERR "cache level: %s", LL_MSG(ec));
 
 	if (BUS_ERROR(ec))
-		pr_cont(", mem/io: %s", II_MSG(ec));
+		ras_printk(PR_CONT, ", mem/io: %s", II_MSG(ec));
 	else
-		pr_cont(", tx: %s", TT_MSG(ec));
+		ras_printk(PR_CONT, ", tx: %s", TT_MSG(ec));
 
 	if (MEM_ERROR(ec) || BUS_ERROR(ec)) {
-		pr_cont(", mem-tx: %s", R4_MSG(ec));
+		ras_printk(PR_CONT, ", mem-tx: %s", R4_MSG(ec));
 
 		if (BUS_ERROR(ec))
-			pr_cont(", part-proc: %s (%s)", PP_MSG(ec), TO_MSG(ec));
+			ras_printk(PR_CONT, ", part-proc: %s (%s)",
+					    PP_MSG(ec), TO_MSG(ec));
 	}
 
-	pr_cont("\n");
+	ras_printk(PR_CONT, "\n");
 }
 
 /*
@@ -752,7 +760,7 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 	if (amd_filter_mce(m))
 		return NOTIFY_STOP;
 
-	pr_emerg(HW_ERR "CPU:%d\tMC%d_STATUS[%s|%s|%s|%s|%s",
+	ras_printk(PR_EMERG, HW_ERR "CPU:%d\tMC%d_STATUS[%s|%s|%s|%s|%s",
 		m->extcpu, m->bank,
 		((m->status & MCI_STATUS_OVER)	? "Over"  : "-"),
 		((m->status & MCI_STATUS_UC)	? "UE"	  : "CE"),
@@ -761,19 +769,20 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 		((m->status & MCI_STATUS_ADDRV)	? "AddrV" : "-"));
 
 	if (c->x86 == 0x15)
-		pr_cont("|%s|%s",
+		ras_printk(PR_CONT, "|%s|%s",
 			((m->status & BIT_64(44)) ? "Deferred" : "-"),
 			((m->status & BIT_64(43)) ? "Poison"   : "-"));
 
 	/* do the two bits[14:13] together */
 	ecc = (m->status >> 45) & 0x3;
 	if (ecc)
-		pr_cont("|%sECC", ((ecc == 2) ? "C" : "U"));
+		ras_printk(PR_CONT, "|%sECC", ((ecc == 2) ? "C" : "U"));
 
-	pr_cont("]: 0x%016llx\n", m->status);
+	ras_printk(PR_CONT, "]: 0x%016llx\n", m->status);
 
 	if (m->status & MCI_STATUS_ADDRV)
-		pr_emerg(HW_ERR "\tMC%d_ADDR: 0x%016llx\n", m->bank, m->addr);
+		ras_printk(PR_EMERG, HW_ERR "\tMC%d_ADDR: 0x%016llx\n",
+			   m->bank, m->addr);
 
 	switch (m->bank) {
 	case 0:
@@ -813,6 +822,8 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 
 	amd_decode_err_code(m->status & 0xffff);
 
+	trace_mce_record(decoded_err_str, m);
+
 	return NOTIFY_STOP;
 }
 EXPORT_SYMBOL_GPL(amd_decode_mce);
@@ -882,10 +893,14 @@ static int __init mce_amd_init(void)
 		return -EINVAL;
 	}
 
-	pr_info("MCE: In-kernel MCE decoding enabled.\n");
+	decoded_err_str = kzalloc(ERR_STRING_SZ, GFP_KERNEL);
+	if (!decoded_err_str)
+		return -ENOMEM;
 
 	mce_register_decode_chain(&amd_mce_dec_nb);
 
+	pr_info("MCE: In-kernel MCE decoding enabled.\n");
+
 	return 0;
 }
 early_initcall(mce_amd_init);
@@ -894,6 +909,7 @@ early_initcall(mce_amd_init);
 static void __exit mce_amd_exit(void)
 {
 	mce_unregister_decode_chain(&amd_mce_dec_nb);
+	kfree(decoded_err_str);
 	kfree(fam_ops);
 }
 
diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h
index 4cbbcef6baa8..1a7cd471a771 100644
--- a/include/trace/events/mce.h
+++ b/include/trace/events/mce.h
@@ -10,11 +10,12 @@
 
 TRACE_EVENT(mce_record,
 
-	TP_PROTO(struct mce *m),
+	TP_PROTO(const char *msg, struct mce *m),
 
-	TP_ARGS(m),
+	TP_ARGS(msg, m),
 
 	TP_STRUCT__entry(
+		__string(	msg,		msg		)
 		__field(	u64,		mcgcap		)
 		__field(	u64,		mcgstatus	)
 		__field(	u64,		status		)
@@ -33,6 +34,7 @@ TRACE_EVENT(mce_record,
 	),
 
 	TP_fast_assign(
+		__assign_str(msg,	msg);
 		__entry->mcgcap		= m->mcgcap;
 		__entry->mcgstatus	= m->mcgstatus;
 		__entry->status		= m->status;
@@ -50,7 +52,8 @@ TRACE_EVENT(mce_record,
 		__entry->cpuvendor	= m->cpuvendor;
 	),
 
-	TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x",
+	TP_printk("%s\n(CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x)",
+		__get_str(msg),
 		__entry->cpu,
 		__entry->mcgcap, __entry->mcgstatus,
 		__entry->bank, __entry->status,
@@ -63,6 +66,23 @@ TRACE_EVENT(mce_record,
 		__entry->apicid)
 );
 
+TRACE_EVENT(hw_error,
+
+	TP_PROTO(const char *msg),
+
+	TP_ARGS(msg),
+
+	TP_STRUCT__entry(
+		__string(msg, msg)
+	),
+
+	TP_fast_assign(
+		__assign_str(msg, msg);
+	),
+
+	TP_printk(HW_ERR "%s\n", __get_str(msg))
+);
+
 #endif /* _TRACE_MCE_H */
 
 /* This part must be outside protection */
-- 
1.7.8.rc0


-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: RAS trace event proto
  2012-02-20 14:59 RAS trace event proto Borislav Petkov
@ 2012-02-21  1:14 ` Steven Rostedt
  2012-02-21 10:15   ` Borislav Petkov
  2012-02-21 12:24 ` Mauro Carvalho Chehab
  1 sibling, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2012-02-21  1:14 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mauro Carvalho Chehab, Ingo Molnar, Tony Luck, edac-devel, LKML

On Mon, 2012-02-20 at 15:59 +0100, Borislav Petkov wrote:

> * I'd like to have conditional printk-ing in trace_mce_record depending
> on the TP args, Steve probably knows what can be done:

/me is confused :-)

> 
> @Steven:
> 
> I'd like to do the following:
> 
> 	TP_printk("%s, ARG1: %d, ARG2: %c ...", str1, arg1, arg2)
> 
> and have it print only the first arg, i.e. the string and drop the rest
> of the args while still doing the TP_fast_assign into the ring buffer
> and carrying the stuff to its consumers. Background is that I want to
> dump the decoded string of a hardware error, if it is decoded, but carry
> the MCE info to userspace and only dump the fields of the MCE if I
> haven't managed to decode it, i.e. str1 == "".
> 
> So, my question is, can I do something like:
> 
> 	TP_printk("%s, ARG1: %d, ARG2: %c ...", __print_conditional(str1, arg1, arg2))

You want to affect the output of ftrace?

perf and even trace-cmd do the parsing later and can be overridden.
Well, perf can be when we finally get it to use the updated trace-cmd
parser.

-- Steve

> 
> where __print_conditional is a vararg macro which calls a
> ftrace_print_cond() which prints only str1 if strlen(str1) > 0 and
> otherwise calls a vsnprintf() variant to deal with the va_args?
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAS trace event proto
  2012-02-21  1:14 ` Steven Rostedt
@ 2012-02-21 10:15   ` Borislav Petkov
  0 siblings, 0 replies; 19+ messages in thread
From: Borislav Petkov @ 2012-02-21 10:15 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Borislav Petkov, Mauro Carvalho Chehab, Ingo Molnar, Tony Luck,
	edac-devel, LKML

On Mon, Feb 20, 2012 at 08:14:28PM -0500, Steven Rostedt wrote:
> On Mon, 2012-02-20 at 15:59 +0100, Borislav Petkov wrote:
> 
> > * I'd like to have conditional printk-ing in trace_mce_record depending
> > on the TP args, Steve probably knows what can be done:
> 
> /me is confused :-)

Yeah, sorry. I'll explain to you in more detail when you get over here
next month :-).

> > 
> > @Steven:
> > 
> > I'd like to do the following:
> > 
> > 	TP_printk("%s, ARG1: %d, ARG2: %c ...", str1, arg1, arg2)
> > 
> > and have it print only the first arg, i.e. the string and drop the rest
> > of the args while still doing the TP_fast_assign into the ring buffer
> > and carrying the stuff to its consumers. Background is that I want to
> > dump the decoded string of a hardware error, if it is decoded, but carry
> > the MCE info to userspace and only dump the fields of the MCE if I
> > haven't managed to decode it, i.e. str1 == "".
> > 
> > So, my question is, can I do something like:
> > 
> > 	TP_printk("%s, ARG1: %d, ARG2: %c ...", __print_conditional(str1, arg1, arg2))
> 
> You want to affect the output of ftrace?
> 
> perf and even trace-cmd do the parsing later and can be overridden.
> Well, perf can be when we finally get it to use the updated trace-cmd
> parser.

Right, that's another possibility. I simply didn't want to burden the
ring buffer with useless stuff we aren't going to use but I guess a
couple of tens of bytes aren't the world :)

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAS trace event proto
  2012-02-20 14:59 RAS trace event proto Borislav Petkov
  2012-02-21  1:14 ` Steven Rostedt
@ 2012-02-21 12:24 ` Mauro Carvalho Chehab
  2012-02-21 14:12   ` Borislav Petkov
  1 sibling, 1 reply; 19+ messages in thread
From: Mauro Carvalho Chehab @ 2012-02-21 12:24 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Steven Rostedt, Ingo Molnar, Tony Luck, edac-devel, LKML

Em 20-02-2012 12:59, Borislav Petkov escreveu:
> Hi all,
> 
> here's a dirty patch which should hopefully show what I have in mind wrt
> using tracepoints for RAS events. The code compiles and should only give
> an initial idea, it is subject to significant changes until it reaches
> its final form thus it is only exemplary and not final in any case.
> 
> Notes:
> 
> * So there are two RAS tracepoints: trace_mce_record which dumps the
> MCE-related errors + their decoded info and trace_hw_error which simply
> carries a string to userspace. The second one can be used for non-MCA
> errors.
> 
> * When prepping the string for the tracepoint, we cache the string by
> calling ras_printk which buffers the so-far done string internally,
> so everything that wants to dump into it needs to be converted to use
> ras_printk.
> 
> * Which brings me to it: ras_printk() is x86-only and it could probably
> be moved to an arch-agnostic place for the other arches. I'd leave it
> x86-only for now, for testing purposes, and then later the other arches
> could consider using it (this is wrt non-x86 EDAC drivers).
> 
> * When writing a 1 into /sys/devices/system/ras/agent, we enable the
> string buffering functionality - this could be done by the RAS daemon or
> whatever agent is requesting putting hw errors info into tracing.
> 
> * I'd like to have conditional printk-ing in trace_mce_record depending
> on the TP args, Steve probably knows what can be done:
> 
> @Steven:
> 
> I'd like to do the following:
> 
> 	TP_printk("%s, ARG1: %d, ARG2: %c ...", str1, arg1, arg2)
> 
> and have it print only the first arg, i.e. the string and drop the rest
> of the args while still doing the TP_fast_assign into the ring buffer
> and carrying the stuff to its consumers. Background is that I want to
> dump the decoded string of a hardware error, if it is decoded, but carry
> the MCE info to userspace and only dump the fields of the MCE if I
> haven't managed to decode it, i.e. str1 == "".
> 
> So, my question is, can I do something like:
> 
> 	TP_printk("%s, ARG1: %d, ARG2: %c ...", __print_conditional(str1, arg1, arg2))
> 
> where __print_conditional is a vararg macro which calls a
> ftrace_print_cond() which prints only str1 if strlen(str1) > 0 and
> otherwise calls a vsnprintf() variant to deal with the va_args?
> 
> As always, all comments are welcome.

Boris,

The idea of having a separate ring buffer for the hardware errors seem
to be interesting (the code there at ras.c seems incomplete on this dirty
patch - at least I couldn't see how userspace would retrieve the messages,
if the has agent is enabled).

I prefer to code it arch-agnostic since the beginning, as there's nothing
there that it is arch-specific.

It is not clear from this patch how errors are supposed to be reported,
on your proposal, as, currently, you're using trace_mce_record() to report 
just the MCE error, without any parsing,  using an empty string for the
error description.

I _think_ you're planning to move trace_mce_record() to happen inside the
driver (or at the edac core), for the cases where the error is parsed 
inside amd64_edac.

You're using ras_printk() to report not only RAS errors, but also errors
that happened inside the amd64_edac driver, like:

+		ras_printk(PR_EMERG, "You shouldn't be seeing an LS MCE on this"
+				     " cpu family, please report on LKML.\n");

That sounds confusing.

It also should be noticed that the amd64_edac driver reports error on a
different way than other drivers. Except for amd64_edac, the other drivers
prepare an error string on some internal buffer, and then they call the
RAS core (edac) to present that error to userspace. For example:

static void i5000_process_fatal_error_info(struct mem_ctl_info *mci,
					struct i5000_error_info *info,
					int handle_errors)
{
...
/* Only 1 bit will be on */
	switch (allErrors) {
	case FERR_FAT_M1ERR:
		specific = "Alert on non-redundant retry or fast "
				"reset timeout";
		break;
...
		/* Form out message */
		snprintf(msg, sizeof(msg),
			 "(Branch=%d DRAM-Bank=%d RDWR=%s RAS=%d "
			 "CAS=%d, UE Err=0x%x (%s))",
			 branch >> 1, bank, rdwr ? "Write" : "Read", ras, cas,
			 ue_errors, specific);

		/* Call the helper to output message */
		edac_mc_handle_fbd_ue(mci, rank, channel, channel + 1, msg);


So, the "msg" buffer will contain the entire error message, and the core call
will increment the error counters, will identify the silkscreen label associated
with the memory location and output the error.

In the case of amd64_edac, the error message is generated using KERN_COUNT, e. g.
instead of storing it into some buffer and then doing just one call to output it,
it calls pr_count() several times (or, after your patch, ras_printk).

As the idea is to generate just one event for one hardware error, the better is
to change the logic inside amd64_edac to use driver-specific buffer for the
error message, and then call the RAS core error handler to output it, and let
the core to generate the trace event, as this simplifies the logic on all
drivers, as it removes redundancy, as reporting an error requires several
operations, like:
	- incrementing error counts;
	- retrieving the motherboard silkscreen labels;
	- printing the error at dmesg and/or generating a trace event.

<snip/>

>  TRACE_EVENT(mce_record,
>  
> -	TP_PROTO(struct mce *m),
> +	TP_PROTO(const char *msg, struct mce *m),

That doesn't sound good on my eyes. It is basically preserving all MCE register
values as-is (u32/u64 values), and folding all parsed information into just
one string field. I fail to understand why it is more important to store in
binary format the content of MCE registers than the other datat here. 

Also, folding everything into just one string prevents (or make hard) the usage of
the perf filters, for example, to filter only the errors at the memory controller 1.

It also requires that a RAS software to parse the messages in order to detect, 
for example, where the memory silkscreen label is inside it. This can be very tricky, 
as each driver could fill it with a different way, making very hard for RAS userspace
developers to write a code that would be generic.

My proposal is to have more fields there than just one message. On my patches,
the proposal is to use, instead:

TRACE_EVENT(mc_error_mce,

	TP_PROTO(const unsigned int err_type,
		 const unsigned int mc_index,
		 const char *msg,
		 const char *label,
		 const char *location,
		 const char *detail,
		 const char *driver_detail,
		 const struct mce *m),

Where:

	err_type - identifies if the error is corrected, uncorrected, fatal;
	mc_index - identifies the memory controller instance;
	msg - the error cause, like "read error" , "write error", etc;
	label - The silkscreen label of the affected component(s);
	location - the memory hierarchy location of the error (csrow/channel or
		   branch/channel/dimm, depending on the memory hierarchy);
	detail - contains the error page, offset, grain and syndrome.
	driver_detail - contains additional driver-specific details about
		 the error (other data that the driver error parser may get).
			    
With the above, all possible usecases are covered. The normal usecase
where the user just needs to know if there were memory corruption, and what
memory needs to be replaced is covered by "err_type", "msg" and "label".

If more information is required, the other fields will provide him enough
detail to either replace the bad DRAM chip or to provide enough data for 
the user to discuss the issues with the hardware vendor.

Regards,
Mauro

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAS trace event proto
  2012-02-21 12:24 ` Mauro Carvalho Chehab
@ 2012-02-21 14:12   ` Borislav Petkov
  2012-02-21 14:48     ` Steven Rostedt
  2012-02-21 17:28     ` Mauro Carvalho Chehab
  0 siblings, 2 replies; 19+ messages in thread
From: Borislav Petkov @ 2012-02-21 14:12 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Borislav Petkov, Steven Rostedt, Ingo Molnar, Tony Luck,
	edac-devel, LKML

On Tue, Feb 21, 2012 at 10:24:09AM -0200, Mauro Carvalho Chehab wrote:
> The idea of having a separate ring buffer for the hardware errors seem

Not a separate ring buffer - ras_printk buffers the string before its
being sent through the tracepoint, i.e. once the string is done, we do:

	trace_mce_record(decoded_err_str, m);

in amd_decode_mce(). The tracepoint can be invoked in the same manner
from other places.

> to be interesting (the code there at ras.c seems incomplete on this dirty
> patch - at least I couldn't see how userspace would retrieve the messages,
> if the has agent is enabled).

The normal way you retrieve an ftrace trace.

> I prefer to code it arch-agnostic since the beginning, as there's nothing
> there that it is arch-specific.

Let me quote my last email:

"* Which brings me to it: ras_printk() is x86-only and it could probably
be moved to an arch-agnostic place for the other arches. I'd leave it
x86-only for now, for testing purposes, and then later the other arches
could consider using it (this is wrt non-x86 EDAC drivers)."

> It is not clear from this patch how errors are supposed to be reported,
> on your proposal, as, currently, you're using trace_mce_record() to report 
> just the MCE error, without any parsing,  using an empty string for the
> error description.

Let me quote my last email:

"* So there are two RAS tracepoints: trace_mce_record which dumps the
MCE-related errors + their decoded info and trace_hw_error which simply
carries a string to userspace. The second one can be used for non-MCA
errors."

> I _think_ you're planning to move trace_mce_record() to happen inside
> the driver (or at the edac core), for the cases where the error is
> parsed inside amd64_edac.

No.

> You're using ras_printk() to report not only RAS errors, but also errors
> that happened inside the amd64_edac driver, like:

Not the amd64_edac driver - this is the routine that decodes LS MCEs.

> 
> +		ras_printk(PR_EMERG, "You shouldn't be seeing an LS MCE on this"
> +				     " cpu family, please report on LKML.\n");
> 
> That sounds confusing.

No, that's ok since we're carrying a string of a decoded MCE to
userspace - if, instead, we have _another_ string saying that we don't
decode LS MCEs for a CPU family, it's still a string so we're fine. We
could do additional mappings like "DECODE_ERROR" in front of it so that
the userspace tools don't choke on it.

> It also should be noticed that the amd64_edac driver reports error on a
> different way than other drivers. Except for amd64_edac, the other drivers
> prepare an error string on some internal buffer, and then they call the
> RAS core (edac) to present that error to userspace. For example:

This is what I do too - ras_printk() buffers the string internally
before sendint it out through the tracepoint.

> 
> static void i5000_process_fatal_error_info(struct mem_ctl_info *mci,
> 					struct i5000_error_info *info,
> 					int handle_errors)
> {
> ...
> /* Only 1 bit will be on */
> 	switch (allErrors) {
> 	case FERR_FAT_M1ERR:
> 		specific = "Alert on non-redundant retry or fast "
> 				"reset timeout";
> 		break;
> ...
> 		/* Form out message */
> 		snprintf(msg, sizeof(msg),
> 			 "(Branch=%d DRAM-Bank=%d RDWR=%s RAS=%d "
> 			 "CAS=%d, UE Err=0x%x (%s))",
> 			 branch >> 1, bank, rdwr ? "Write" : "Read", ras, cas,
> 			 ue_errors, specific);
> 
> 		/* Call the helper to output message */
> 		edac_mc_handle_fbd_ue(mci, rank, channel, channel + 1, msg);
> 
> 
> So, the "msg" buffer will contain the entire error message, and the core call
> will increment the error counters, will identify the silkscreen label associated
> with the memory location and output the error.
> 
> In the case of amd64_edac, the error message is generated using KERN_COUNT, e. g.
> instead of storing it into some buffer and then doing just one call to output it,
> it calls pr_count() several times (or, after your patch, ras_printk).

I think you're confusing amd64_edac with <drivers/edac/mce_amd.c> which
contains the code that decodes _ALL_ MCEs and not only DRAM ECC errors
which amd64_edac decodes.

> 
> As the idea is to generate just one event for one hardware error, the better is
> to change the logic inside amd64_edac to use driver-specific buffer for the
> error message, and then call the RAS core error handler to output it, and let
> the core to generate the trace event, as this simplifies the logic on all
> drivers, as it removes redundancy, as reporting an error requires several
> operations, like:
> 	- incrementing error counts;
> 	- retrieving the motherboard silkscreen labels;
> 	- printing the error at dmesg and/or generating a trace event.

This is what the code does already.

> 
> <snip/>
> 
> >  TRACE_EVENT(mce_record,
> >  
> > -	TP_PROTO(struct mce *m),
> > +	TP_PROTO(const char *msg, struct mce *m),
> 
> That doesn't sound good on my eyes. It is basically preserving all MCE register
> values as-is (u32/u64 values), and folding all parsed information into just
> one string field. I fail to understand why it is more important to store in
> binary format the content of MCE registers than the other datat here.

The msg arg is the decoded string and not the MCE register values. We
carry the register values to userspace just in case, say, a RAS daemon
would like to look at them too. This makes sense in cases where a RAS
daemon, for example, wants to count how many errors have happened per
rank and at which address, etc...

> Also, folding everything into just one string prevents (or make hard) the usage of
> the perf filters, for example, to filter only the errors at the memory controller 1.

Huh, because you can't grep through the trace anymore...?

> It also requires that a RAS software to parse the messages in order to
> detect, for example, where the memory silkscreen label is inside it.
> This can be very tricky, as each driver could fill it with a different
> way, making very hard for RAS userspace developers to write a code
> that would be generic.

We can add an additional arg called char *silkscreen in case it is
needed.

> My proposal is to have more fields there than just one message. On my patches,
> the proposal is to use, instead:
> 
> TRACE_EVENT(mc_error_mce,
> 
> 	TP_PROTO(const unsigned int err_type,
> 		 const unsigned int mc_index,
> 		 const char *msg,
> 		 const char *label,
> 		 const char *location,
> 		 const char *detail,
> 		 const char *driver_detail,
> 		 const struct mce *m),

Hehe, so basically instead of one string you have a bunch of strings in
addition to some unneeded fields. I fail to see the difference.

The greatest common denominator between hw errors is a single string
containing the whole error message, that's it! The moment you start
implementing an arch-agnostic solution, a single string is all you can
use. And adding a tracepoint for each error type is a no-go too, as I've
explained it to you already.

> Where:
> 
> 	err_type - identifies if the error is corrected, uncorrected, fatal;
> 	mc_index - identifies the memory controller instance;
> 	msg - the error cause, like "read error" , "write error", etc;
> 	label - The silkscreen label of the affected component(s);
> 	location - the memory hierarchy location of the error (csrow/channel or
> 		   branch/channel/dimm, depending on the memory hierarchy);
> 	detail - contains the error page, offset, grain and syndrome.
> 	driver_detail - contains additional driver-specific details about
> 		 the error (other data that the driver error parser may get).
> 			    
> With the above, all possible usecases are covered. The normal usecase

As I've already explained it to you multiple times, this does NOT cover
all possible cases - this covers only DRAM errors and adds unnecessarily
too much info. If you start adding tracepoints for all the other error
types hw can report, you're going to enter Bloatland very quickly.

> where the user just needs to know if there were memory corruption, and what
> memory needs to be replaced is covered by "err_type", "msg" and "label".

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAS trace event proto
  2012-02-21 14:12   ` Borislav Petkov
@ 2012-02-21 14:48     ` Steven Rostedt
  2012-02-21 14:59       ` Borislav Petkov
  2012-02-21 17:28     ` Mauro Carvalho Chehab
  1 sibling, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2012-02-21 14:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mauro Carvalho Chehab, Ingo Molnar, Tony Luck, edac-devel, LKML

On Tue, 2012-02-21 at 15:12 +0100, Borislav Petkov wrote:

> > Also, folding everything into just one string prevents (or make hard) the usage of
> > the perf filters, for example, to filter only the errors at the memory controller 1.
> 
> Huh, because you can't grep through the trace anymore...?
> 

I believe Mauro is talking about the tracing filters used by both perf
and ftrace that lets you ignore trace events when the contents of the
event does not match the filter. This is filtering out events before
they go to the buffer.

-- Steve



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAS trace event proto
  2012-02-21 14:48     ` Steven Rostedt
@ 2012-02-21 14:59       ` Borislav Petkov
  2012-02-21 16:18         ` Mauro Carvalho Chehab
  2012-02-22  0:58         ` Luck, Tony
  0 siblings, 2 replies; 19+ messages in thread
From: Borislav Petkov @ 2012-02-21 14:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mauro Carvalho Chehab, Ingo Molnar, Tony Luck, edac-devel, LKML

On Tue, Feb 21, 2012 at 09:48:18AM -0500, Steven Rostedt wrote:
> On Tue, 2012-02-21 at 15:12 +0100, Borislav Petkov wrote:
> 
> > > Also, folding everything into just one string prevents (or make hard) the usage of
> > > the perf filters, for example, to filter only the errors at the memory controller 1.
> > 
> > Huh, because you can't grep through the trace anymore...?
> > 
> 
> I believe Mauro is talking about the tracing filters used by both perf
> and ftrace that lets you ignore trace events when the contents of the
> event does not match the filter. This is filtering out events before
> they go to the buffer.

Oh ok, in that case we could filter the errors - if needed - before they
get even reported. I say "if needed" because normally we want to collect
all hw errors in the trace, IMHO.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAS trace event proto
  2012-02-21 14:59       ` Borislav Petkov
@ 2012-02-21 16:18         ` Mauro Carvalho Chehab
  2012-02-22  0:58         ` Luck, Tony
  1 sibling, 0 replies; 19+ messages in thread
From: Mauro Carvalho Chehab @ 2012-02-21 16:18 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Steven Rostedt, Ingo Molnar, Tony Luck, edac-devel, LKML

Em 21-02-2012 12:59, Borislav Petkov escreveu:
> On Tue, Feb 21, 2012 at 09:48:18AM -0500, Steven Rostedt wrote:
>> On Tue, 2012-02-21 at 15:12 +0100, Borislav Petkov wrote:
>>
>>>> Also, folding everything into just one string prevents (or make hard) the usage of
>>>> the perf filters, for example, to filter only the errors at the memory controller 1.
>>>
>>> Huh, because you can't grep through the trace anymore...?
>>>
>>
>> I believe Mauro is talking about the tracing filters used by both perf
>> and ftrace that lets you ignore trace events when the contents of the
>> event does not match the filter. This is filtering out events before
>> they go to the buffer.

Yes, that's what I'm talking about.

Another advantage of using separate fields for the silkscreen label and for
the additional details is that the userspace program that it is displaying
it (trace-cmd or whatever) could have different profiles to either display
just the error message and the silkscreen label or to print the entire error
message.

> Oh ok, in that case we could filter the errors - if needed - before they
> get even reported. I say "if needed" because normally we want to collect
> all hw errors in the trace, IMHO.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAS trace event proto
  2012-02-21 14:12   ` Borislav Petkov
  2012-02-21 14:48     ` Steven Rostedt
@ 2012-02-21 17:28     ` Mauro Carvalho Chehab
  1 sibling, 0 replies; 19+ messages in thread
From: Mauro Carvalho Chehab @ 2012-02-21 17:28 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Steven Rostedt, Ingo Molnar, Tony Luck, edac-devel, LKML

Em 21-02-2012 12:12, Borislav Petkov escreveu:
> On Tue, Feb 21, 2012 at 10:24:09AM -0200, Mauro Carvalho Chehab wrote:
>> The idea of having a separate ring buffer for the hardware errors seem
> 
> Not a separate ring buffer - ras_printk buffers the string before its
> being sent through the tracepoint, i.e. once the string is done, we do:
> 
> 	trace_mce_record(decoded_err_str, m);
> in amd_decode_mce(). The tracepoint can be invoked in the same manner
> from other places.

Ah, ok. I missed this call for trace_mce_record() on my first review.

The ras.c seems to be a way to help your amd64 mce parser. After reviewing
how errors are output on the other drivers, I don't think that these
routines would help the other drivers, as, on all other places, there is
a driver buffer that does the same.

Also, it doesn't sound a good idea to have this hunk:

--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -701,7 +703,11 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
 		return;
 	}
 
-	if (edac_mc_get_log_ce())
+	if (edac_mc_get_log_ce()) {
+		if (ras_agent)
+			ras_printk(PR_CONT, "row: %d, channel: %d\n",
+				   row, channel);
+
 		/* FIXME - put in DIMM location */
 		edac_mc_printk(mci, KERN_WARNING,
 			"CE page 0x%lx, offset 0x%lx, grain %d, syndrome "

The routine that handles the error should be generating the trace, and not
filling a buffer for the trace to happen on some other place.

>> to be interesting (the code there at ras.c seems incomplete on this dirty
>> patch - at least I couldn't see how userspace would retrieve the messages,
>> if the has agent is enabled).
> 
> The normal way you retrieve an ftrace trace.
> 
>> I prefer to code it arch-agnostic since the beginning, as there's nothing
>> there that it is arch-specific.
> 
> Let me quote my last email:
> 
> "* Which brings me to it: ras_printk() is x86-only and it could probably
> be moved to an arch-agnostic place for the other arches. I'd leave it
> x86-only for now, for testing purposes, and then later the other arches
> could consider using it (this is wrt non-x86 EDAC drivers)."

The ras_printk() seems to be mce_amd only.

>> It is not clear from this patch how errors are supposed to be reported,
>> on your proposal, as, currently, you're using trace_mce_record() to report 
>> just the MCE error, without any parsing,  using an empty string for the
>> error description.
> 
> Let me quote my last email:
> 
> "* So there are two RAS tracepoints: trace_mce_record which dumps the
> MCE-related errors + their decoded info and trace_hw_error which simply
> carries a string to userspace. The second one can be used for non-MCA
> errors."
> 
>> I _think_ you're planning to move trace_mce_record() to happen inside
>> the driver (or at the edac core), for the cases where the error is
>> parsed inside amd64_edac.
> 
> No.

The main purpose of edac_mc_handle_ce() & friends is to report the
error to userspace and increment the error counters. Not having the trace
call there is a failure on doing that.

> 
>> You're using ras_printk() to report not only RAS errors, but also errors
>> that happened inside the amd64_edac driver, like:
> 
> Not the amd64_edac driver - this is the routine that decodes LS MCEs.

Ok.

> 
>>
>> +		ras_printk(PR_EMERG, "You shouldn't be seeing an LS MCE on this"
>> +				     " cpu family, please report on LKML.\n");
>>
>> That sounds confusing.
> 
> No, that's ok since we're carrying a string of a decoded MCE to
> userspace - if, instead, we have _another_ string saying that we don't
> decode LS MCEs for a CPU family, it's still a string so we're fine. We
> could do additional mappings like "DECODE_ERROR" in front of it so that
> the userspace tools don't choke on it.
> 
>> It also should be noticed that the amd64_edac driver reports error on a
>> different way than other drivers. Except for amd64_edac, the other drivers
>> prepare an error string on some internal buffer, and then they call the
>> RAS core (edac) to present that error to userspace. For example:
> 
> This is what I do too - ras_printk() buffers the string internally
> before sendint it out through the tracepoint.

So, it is your way to implement an amd64_edac/mce_amd string buffer.
I don't see any value of changing the other drivers to do it, as, on
all other drivers, a simple char * buffer would produce the same result.

<snip/>

>>>  TRACE_EVENT(mce_record,
>>>  
>>> -	TP_PROTO(struct mce *m),
>>> +	TP_PROTO(const char *msg, struct mce *m),
>>
>> That doesn't sound good on my eyes. It is basically preserving all MCE register
>> values as-is (u32/u64 values), and folding all parsed information into just
>> one string field. I fail to understand why it is more important to store in
>> binary format the content of MCE registers than the other datat here.
> 
> The msg arg is the decoded string and not the MCE register values. We
> carry the register values to userspace just in case, say, a RAS daemon
> would like to look at them too. This makes sense in cases where a RAS
> daemon, for example, wants to count how many errors have happened per
> rank and at which address, etc...

The rank information may not be there! On Sandy Bridge, the rank is the
result of a complex decoding logic that _doesn't_ use the MCE registers.
In a matter of fact, the only thing that it is provided at the MCE registers
on Sandy Bridge is the address and the error type. All the other information
is parsed by using non-MCE registers: memory controller, channel, rank, dimm.
So, exporting struct mce to userspace doesn't help a RAS daemon for Sandy
Bridge.

On the other hand, the "location" field on my proposal will always have
the error location (csrow/dimm or branch/channel/slot, depending on the
memory architecture). This makes much more sense to have there than the
raw MCE register values.

On the other hand, on Nehalem, the MCE registers contains the memory location.
So, a RAS daemon could use them, or could equally use the location field.

<snip/>
> 
>> It also requires that a RAS software to parse the messages in order to
>> detect, for example, where the memory silkscreen label is inside it.
>> This can be very tricky, as each driver could fill it with a different
>> way, making very hard for RAS userspace developers to write a code
>> that would be generic.
> 
> We can add an additional arg called char *silkscreen in case it is
> needed.

It is not just the silkscreen label. See my patch series, if you want to
see the dirty details on all drivers, or try to convert the other drivers
to use your proposal, and you'll see what I'm talking about.

>> My proposal is to have more fields there than just one message. On my patches,
>> the proposal is to use, instead:
>>
>> TRACE_EVENT(mc_error_mce,
>>
>> 	TP_PROTO(const unsigned int err_type,
>> 		 const unsigned int mc_index,
>> 		 const char *msg,
>> 		 const char *label,
>> 		 const char *location,
>> 		 const char *detail,
>> 		 const char *driver_detail,
>> 		 const struct mce *m),
> 
> Hehe, so basically instead of one string you have a bunch of strings in
> addition to some unneeded fields. I fail to see the difference.

Some of them could be, instead, integers or some other way to represent
the error details. For example, the location could be, instead:

	u32 layer0_pos, u32 layer1_pos, u32 layer2_pos

to represent the memory location (csrow/channel, branch/channel/slot or
channel/slot, depending on how the location is defined on the memory
hierarchy.

The bad thing of using this way is that, when just 2 layers are filled,
layer2_pos will be meaningless.

The detail could be, instead:
	u32 page, u32 offset, u32 grain, u32 syndrome
or even:
	u64 address, u32 grain, u32 syndrome

The other fields are, per their nature, strings (error message, DIMM label,
extra driver-specific details).

I'm not entirely convinced to use strings there. Comments are welcome.

> The greatest common denominator between hw errors is a single string
> containing the whole error message, that's it!

If you use this argument, the MCE fields should also be converted to
string, as less than 10% of the existing drivers have it.

> The moment you start
> implementing an arch-agnostic solution, a single string is all you can
> use. And adding a tracepoint for each error type is a no-go too, as I've
> explained it to you already.

This trace can easily be generalized to cover any type of error, like:

	TP_PROTO(const unsigned int error_severity,
		 const unsigned int error_type,
		 const unsigned int instance,
		 const char *error_msg,
		 const char *silkscreen_label,
		 const char *location,
		 const char *detail,
		 const char *driver_detail,
 		 const struct mce *m),

as there's nothing there that it is really memory specific:

	- error_severity: corrected, uncorrected, fatal, ...;
	- error_type: memory error, PCI error, bus error, ...;
	- instance: an instance number to uniquelly identify the affected resource;
	- silkscreen_label: can be a memory slot, a bus slot, a cpu slot, ...,
			    depending on the error type;
	- location: "csrow 1 channel 0" for MC; "07:05.0" for PCI, 
		    "Bus 003 Device 001" for USB, ...
	- detail: some additional error_type-specific detail handled by the core;
	- driver_detail: some additional error_type-specific detail handled by the driver.

By having several fields, RAS software can group errors by location, by label, etc.

> 
>> Where:
>>
>> 	err_type - identifies if the error is corrected, uncorrected, fatal;
>> 	mc_index - identifies the memory controller instance;
>> 	msg - the error cause, like "read error" , "write error", etc;
>> 	label - The silkscreen label of the affected component(s);
>> 	location - the memory hierarchy location of the error (csrow/channel or
>> 		   branch/channel/dimm, depending on the memory hierarchy);
>> 	detail - contains the error page, offset, grain and syndrome.
>> 	driver_detail - contains additional driver-specific details about
>> 		 the error (other data that the driver error parser may get).
>> 			    
>> With the above, all possible usecases are covered. The normal usecase
> 
> As I've already explained it to you multiple times, this does NOT cover
> all possible cases - this covers only DRAM errors and adds unnecessarily
> too much info. If you start adding tracepoints for all the other error
> types hw can report, you're going to enter Bloatland very quickly.

I fail to see it. Most cases will fail on the above trace.

>> where the user just needs to know if there were memory corruption, and what
>> memory needs to be replaced is covered by "err_type", "msg" and "label".
> 

Regards,
Mauro

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: RAS trace event proto
  2012-02-21 14:59       ` Borislav Petkov
  2012-02-21 16:18         ` Mauro Carvalho Chehab
@ 2012-02-22  0:58         ` Luck, Tony
  2012-02-22 10:43           ` Borislav Petkov
  1 sibling, 1 reply; 19+ messages in thread
From: Luck, Tony @ 2012-02-22  0:58 UTC (permalink / raw)
  To: Borislav Petkov, Steven Rostedt
  Cc: Mauro Carvalho Chehab, Ingo Molnar, edac-devel, LKML

> > > > Also, folding everything into just one string prevents (or make hard) the usage of
> > > > the perf filters, for example, to filter only the errors at the memory controller 1.
> > > 
> > > Huh, because you can't grep through the trace anymore...?
> > > 
> > 
> > I believe Mauro is talking about the tracing filters used by both perf
> > and ftrace that lets you ignore trace events when the contents of the
> > event does not match the filter. This is filtering out events before
> > they go to the buffer.
>
> Oh ok, in that case we could filter the errors - if needed - before they
> get even reported. I say "if needed" because normally we want to collect
> all hw errors in the trace, IMHO.

I'm also struggling to understand an end-user use case where you would
want filtering.  Mauro - can you expand a bit on why someone would just
want to see the errors from memory controller 1?

My mental model of the world is that large systems have some background
noise - a trickle of corrected errors that happen in normal operation.
User shouldn't care about these errors unless they breach some threshold.

When something goes wrong, you may see a storm of corrected errors, or
some uncorrected errors. In either case you'd like to get as much information
as possible to identify the component that is at fault. I'd definitely like
to see some structure to the error reporting, so that mining for data patterns
in a storm isn't hideously platform dependent.

It might be easier to evaluate the competing ideas here with some sample
output in addition to the code.

-Tony


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAS trace event proto
  2012-02-22  0:58         ` Luck, Tony
@ 2012-02-22 10:43           ` Borislav Petkov
  2012-02-22 12:02             ` Mauro Carvalho Chehab
  2012-02-22 15:59             ` Borislav Petkov
  0 siblings, 2 replies; 19+ messages in thread
From: Borislav Petkov @ 2012-02-22 10:43 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Steven Rostedt, Mauro Carvalho Chehab,
	Ingo Molnar, edac-devel, LKML

On Wed, Feb 22, 2012 at 12:58:37AM +0000, Luck, Tony wrote:
> I'm also struggling to understand an end-user use case where you would
> want filtering.  Mauro - can you expand a bit on why someone would just
> want to see the errors from memory controller 1?
> 
> My mental model of the world is that large systems have some background
> noise - a trickle of corrected errors that happen in normal operation.
> User shouldn't care about these errors unless they breach some threshold.
> 
> When something goes wrong, you may see a storm of corrected errors, or
> some uncorrected errors. In either case you'd like to get as much information
> as possible to identify the component that is at fault. I'd definitely like
> to see some structure to the error reporting, so that mining for data patterns
> in a storm isn't hideously platform dependent.

Yep, I'm on the same page here.

> It might be easier to evaluate the competing ideas here with some sample
> output in addition to the code.

Well, to clarify:

When you get a decoded error, you get the same format as what you get in
dmesg, for example:

[ 2666.646070] [Hardware Error]: CPU:64   MC1_STATUS[-|CE|MiscV|PCC|-|CECC]: 0x9a05c00007010011
[ 2666.655003] [Hardware Error]: Instruction Cache Error: L1 TLB multimatch.
[ 2666.655008] [Hardware Error]: cache level: L1, tx: INSN

And with the decoded string tracepoint, that thing above is a single
string. If you use trace_mce_record(), you still can get the single
MCE fields which we carry to userspace from struct mce, _in addition_.
The hypothetical problem is for userspace not being able to use the
tracepoint format to parse reported fields easily and in an unambiguous
manner. Instead, it gets a single string which, I admit, is not that
pretty.

Now, the problem is if we want to use a single tracepoint for all errors
- it is unfeasible that any fields sharing can be done there except
maybe the TSC stamp when it happened, the CPU that caught it and etc.
not so important details.

IOW, the error format is different for each error type, almost, and
there's no marrying between them. OTOH, ff we start adding tracepoints
for each error type, we'll hit the other end - bloat. So also a no-no.

Maybe the compromise would be to define a single tracepoint per
_hardware_ error reporting scheme. That is, MCA has an own tracepoint,
PCIE AER has its own error reporting tracepoint, then there's an EDAC
!x86 one which doesn't use MCA for reporting and also any other scheme a
hw vendor would come up with...

This will keep the bloat level to a minimum, keep the TPs apart and
hopefully make all of us happy :).

Opinions?


-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAS trace event proto
  2012-02-22 10:43           ` Borislav Petkov
@ 2012-02-22 12:02             ` Mauro Carvalho Chehab
  2012-02-22 12:25               ` Borislav Petkov
  2012-02-22 15:59             ` Borislav Petkov
  1 sibling, 1 reply; 19+ messages in thread
From: Mauro Carvalho Chehab @ 2012-02-22 12:02 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Luck, Tony, Steven Rostedt, Ingo Molnar, edac-devel, LKML

Em 22-02-2012 08:43, Borislav Petkov escreveu:
> On Wed, Feb 22, 2012 at 12:58:37AM +0000, Luck, Tony wrote:
>> I'm also struggling to understand an end-user use case where you would
>> want filtering.  Mauro - can you expand a bit on why someone would just
>> want to see the errors from memory controller 1?
>>
>> My mental model of the world is that large systems have some background
>> noise - a trickle of corrected errors that happen in normal operation.
>> User shouldn't care about these errors unless they breach some threshold.
>>
>> When something goes wrong, you may see a storm of corrected errors, or
>> some uncorrected errors. In either case you'd like to get as much information
>> as possible to identify the component that is at fault. I'd definitely like
>> to see some structure to the error reporting, so that mining for data patterns
>> in a storm isn't hideously platform dependent.
> 
> Yep, I'm on the same page here.

The error counters will be incremented at the sysfs even if the events are
filtered. A RAS software could start with the tracing disabled, and only
enable the events as a hole or partially if the errors are above a certain 
limit (or rate limit).

>> It might be easier to evaluate the competing ideas here with some sample
>> output in addition to the code.
> 
> Well, to clarify:
> 
> When you get a decoded error, you get the same format as what you get in
> dmesg, for example:
> 
> [ 2666.646070] [Hardware Error]: CPU:64   MC1_STATUS[-|CE|MiscV|PCC|-|CECC]: 0x9a05c00007010011
> [ 2666.655003] [Hardware Error]: Instruction Cache Error: L1 TLB multimatch.
> [ 2666.655008] [Hardware Error]: cache level: L1, tx: INSN
> 
> And with the decoded string tracepoint, that thing above is a single
> string. If you use trace_mce_record(), you still can get the single
> MCE fields which we carry to userspace from struct mce, _in addition_.

Using the same concept I've adopted for my EDAC patches, I would map the 
above into 3 fields:

	CPU instance = 64
	error message = Instruction Cache Error: L1 TLB multimatch.
	detail = cache level: L1, tx: INSN
	(or, maybe, detail = [-|CE|MiscV|PCC|-|CECC] cache level: L1, tx: INSN)

Those fields contain what userspace needs, and it is easy for it to parse,
as different things are on different places.

> The hypothetical problem is for userspace not being able to use the
> tracepoint format to parse reported fields easily and in an unambiguous
> manner. Instead, it gets a single string which, I admit, is not that
> pretty.

Yes.

> Now, the problem is if we want to use a single tracepoint for all errors
> - it is unfeasible that any fields sharing can be done there except
> maybe the TSC stamp when it happened, the CPU that caught it and etc.
> not so important details.

Well, in this example, 3 fields could be very similar to the ones I used
for the memory errors.

"silkscreen label" may make some sense, in order to convert from
a CPU core and from some logical CPU number into the right CPU socket.

Other fields, like "location" wouldn't make much sense (as location in this
case matches the CPU number), but those could simply be filled with a blank 
string.

I don't see any problem on having some fields that would be filled with a
blank string (or that it would contain a value like -1 - in the case of
integers - to mean that such value is not relevant for that error type).

> IOW, the error format is different for each error type, almost, and
> there's no marrying between them. OTOH, ff we start adding tracepoints
> for each error type, we'll hit the other end - bloat. So also a no-no.

Indeed, one tracepoint per error type is a bad idea.

> Maybe the compromise would be to define a single tracepoint per
> _hardware_ error reporting scheme. That is, MCA has an own tracepoint,
> PCIE AER has its own error reporting tracepoint, then there's an EDAC
> !x86 one which doesn't use MCA for reporting and also any other scheme a
> hw vendor would come up with...
 
> This will keep the bloat level to a minimum, keep the TPs apart and
> hopefully make all of us happy :).

That sounds interesting.

There is also another alternative: one tracepoint per HW block (where
HW block in this context is: memory, cpu, PCI, ...).

I think we should start with one tracepoint per hw reporting scheme, and
see how it fits, using the parameters I found it is common denominator
for the memory errors.  

Regards,
Mauro

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAS trace event proto
  2012-02-22 12:02             ` Mauro Carvalho Chehab
@ 2012-02-22 12:25               ` Borislav Petkov
  2012-02-22 13:32                 ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2012-02-22 12:25 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Luck, Tony, Steven Rostedt, Ingo Molnar, edac-devel, LKML

On Wed, Feb 22, 2012 at 10:02:16AM -0200, Mauro Carvalho Chehab wrote:
> Using the same concept I've adopted for my EDAC patches, I would map the 
> above into 3 fields:
> 
> 	CPU instance = 64
> 	error message = Instruction Cache Error: L1 TLB multimatch.
> 	detail = cache level: L1, tx: INSN
> 	(or, maybe, detail = [-|CE|MiscV|PCC|-|CECC] cache level: L1, tx: INSN)

No, this is not going to fly the moment you decide to dump MCi_ADDR
because it is relevant for a certain types of errors:

[ 1121.970020] [Hardware Error]: CPU:64   MC2_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd400400000000813
[ 1121.979039] [Hardware Error]:  MC2_ADDR: 0xbabedeaddeadbeef
[ 1121.979042] [Hardware Error]: Bus Unit Error: RD/ECC error in data read from NB.
[ 1121.979047] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)

The whole decoded thing above is a string and the MCA registers are
passed on into the trace, _in_ _addition_.

IOW, for each tracepoint, the format should be a string which is
possibly empty and describes the error message additionally, and the
remaining register attributes, one per field. Mapping the hw error
scheme to some memory error format you've come up with is a bad idea and
a no-no.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAS trace event proto
  2012-02-22 12:25               ` Borislav Petkov
@ 2012-02-22 13:32                 ` Mauro Carvalho Chehab
  2012-02-22 14:05                   ` Borislav Petkov
  2012-02-22 14:26                   ` Steven Rostedt
  0 siblings, 2 replies; 19+ messages in thread
From: Mauro Carvalho Chehab @ 2012-02-22 13:32 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Luck, Tony, Steven Rostedt, Ingo Molnar, edac-devel, LKML

Em 22-02-2012 10:25, Borislav Petkov escreveu:
> On Wed, Feb 22, 2012 at 10:02:16AM -0200, Mauro Carvalho Chehab wrote:
>> Using the same concept I've adopted for my EDAC patches, I would map the 
>> above into 3 fields:
>>
>> 	CPU instance = 64
>> 	error message = Instruction Cache Error: L1 TLB multimatch.
>> 	detail = cache level: L1, tx: INSN
>> 	(or, maybe, detail = [-|CE|MiscV|PCC|-|CECC] cache level: L1, tx: INSN)
> 
> No, this is not going to fly the moment you decide to dump MCi_ADDR
> because it is relevant for a certain types of errors:
> 
> [ 1121.970020] [Hardware Error]: CPU:64   MC2_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd400400000000813
> [ 1121.979039] [Hardware Error]:  MC2_ADDR: 0xbabedeaddeadbeef
> [ 1121.979042] [Hardware Error]: Bus Unit Error: RD/ECC error in data read from NB.
> [ 1121.979047] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)

This could be mapped as:

	CPU instance = 64
	error message = Bus Unit Error: RD/ECC error in data read from NB
	detail = [Over|CE|-|-|AddrV|CECC] cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout) address: 0xbabedeaddeadbeef

> The whole decoded thing above is a string and the MCA registers are
> passed on into the trace, _in_ _addition_.

The MCA registers are also there on their own fields, so there's no need 
to also store them at the decoded string. Yet, they could be there at the
detail.

What I'm proposing is that the "error message" should contain the most relevant
information for the end user, while additional technical details would be at
detail.

That makes easy for a RAS daemon to store and group similar errors. Messing
all information inside one string would be very hard for it to parse.

> IOW, for each tracepoint, the format should be a string which is
> possibly empty and describes the error message additionally, and the
> remaining register attributes, one per field. Mapping the hw error
> scheme to some memory error format you've come up with is a bad idea and
> a no-no.

The "remaining register attributes" is a no-no: such concept would assume that
the remaining information is inside a MCA register (or inside a well-known group
of registers). This is not always true.

As I said several times, the memory error location, on Sandy Bridge, is not
stored on any register. The driver has to use the MCA error address as a starting
point, and navigate into a tree of about 200+ registers, in order to get the
DIMM channel, rank and slot. 

Passing the values of all those 200+ registers to userspace wouldn't make any sense,
as only a few is used to parse the error. What it makes sense is to pass the memory
location to userspace.

To make things worse, when the memory controller is at mirror mode, or at lockstep
mode, Sandy Bridge won't provide a single location. Instead, it will provide a
channel mask, meaning that one of the two DIMMs (or four DIMMs, if both modes are
enabled) is the root cause of the error.

The same also applies to FB-DIMM drivers.

So, the error location has to be either a string or up to 3 integers to represent
the memory location, plus a mask, in order to represent lockstep and mirror modes.

Using a string makes it easier.

So, in other words, I think that the _best_ way to represent an MCA type of error
is to use an event with the following prototype:

	TP_PROTO(const unsigned int error_severity,
		 const unsigned int error_type,
		 const unsigned int instance,
		 const char *error_msg,
		 const char *silkscreen_label,
		 const char *location,
		 const char *detail,
		 const char *driver_detail,
 		 const struct mce *m),

Where the above fields are:

	- error_severity: corrected, uncorrected, fatal, ...;
	- error_type: memory error, PCI error, bus error, ...;
	- instance: an instance number to uniquelly identify the affected resource;
	- silkscreen_label: can be a memory slot, a bus slot, a cpu slot, ...,
			    depending on the error type;
	- location: "csrow 1 channel 0" for MC; "07:05.0" for PCI, 
		    "Bus 003 Device 001" for USB, ...
	- detail: some additional error_type-specific detail handled by the core;
	- driver_detail: some additional error_type-specific detail handled by the driver.


@Steven:

In a matter of fact, the error_severity and error_type should be enum's. We should not
use enum's on ioctl's, as enum size is not portable. Can it be used inside perf, or should
them be exported, on perf, via integer (like the above proposal)?


Regards,
Mauro.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAS trace event proto
  2012-02-22 13:32                 ` Mauro Carvalho Chehab
@ 2012-02-22 14:05                   ` Borislav Petkov
  2012-02-22 14:25                     ` Mauro Carvalho Chehab
  2012-02-22 14:26                   ` Steven Rostedt
  1 sibling, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2012-02-22 14:05 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Luck, Tony, Steven Rostedt, Ingo Molnar, edac-devel, LKML

On Wed, Feb 22, 2012 at 11:32:16AM -0200, Mauro Carvalho Chehab wrote:

> So, in other words, I think that the _best_ way to represent an MCA
> type of error is to use an event with the following prototype:

No, it's not.

And if you're going to repeat yourself for the 100th time as to why,
don't even bother replying.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAS trace event proto
  2012-02-22 14:05                   ` Borislav Petkov
@ 2012-02-22 14:25                     ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 19+ messages in thread
From: Mauro Carvalho Chehab @ 2012-02-22 14:25 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Luck, Tony, Steven Rostedt, Ingo Molnar, edac-devel, LKML

Em 22-02-2012 12:05, Borislav Petkov escreveu:
> On Wed, Feb 22, 2012 at 11:32:16AM -0200, Mauro Carvalho Chehab wrote:
> 
>> So, in other words, I think that the _best_ way to represent an MCA
>> type of error is to use an event with the following prototype:
> 
> No, it's not.

Why? It fits with all existing edac drivers, and it seems to fit with
mce_amd, based on the examples you gave.

> 
> And if you're going to repeat yourself for the 100th time as to why,
> don't even bother replying.

Well, a single string is not ok, as it is not easily parseable and it
doesn't allow to use the perf filters.

We need to agree on some trace format that fits, and that covers the
needs by the existing drivers.

I'm ok, if you didn't like the format I've proposed, if you have
a better proposal.

Regards,
Mauro.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAS trace event proto
  2012-02-22 13:32                 ` Mauro Carvalho Chehab
  2012-02-22 14:05                   ` Borislav Petkov
@ 2012-02-22 14:26                   ` Steven Rostedt
  1 sibling, 0 replies; 19+ messages in thread
From: Steven Rostedt @ 2012-02-22 14:26 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Borislav Petkov, Luck, Tony, Ingo Molnar, edac-devel, LKML

On Wed, 2012-02-22 at 11:32 -0200, Mauro Carvalho Chehab wrote:

> @Steven:
> 
> In a matter of fact, the error_severity and error_type should be enum's. We should not
> use enum's on ioctl's, as enum size is not portable. Can it be used inside perf, or should
> them be exported, on perf, via integer (like the above proposal)?

I don't know exactly how perf does it today, but I'm sure it's similar
to trace-cmd.

You can use an enum, as it just comes through as a number. Each field is
given the size of that field. If it's an int it stays an int.

Note though, perf and trace-cmd do not know how to convert the number
back to a enum name. You can use the __print_symbols() feature, but
that's about it.

-- Steve



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAS trace event proto
  2012-02-22 10:43           ` Borislav Petkov
  2012-02-22 12:02             ` Mauro Carvalho Chehab
@ 2012-02-22 15:59             ` Borislav Petkov
  2012-02-27 15:54               ` Borislav Petkov
  1 sibling, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2012-02-22 15:59 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Steven Rostedt, Mauro Carvalho Chehab, Ingo Molnar, edac-devel, LKML

On Wed, Feb 22, 2012 at 11:43:24AM +0100, Borislav Petkov wrote:
> This will keep the bloat level to a minimum, keep the TPs apart and
> hopefully make all of us happy :).

Btw, here's how the rough MCE TP trace_mce_record() looks like:

       mcegen.py-2715  [001] .N..  1049.818840: mce_record: [Hardware Error]: CPU:0     MC4_STATUS[Over|UE|-|PCC|AddrV|UECC]: 0xf604a00006080a41
[Hardware Error]:       MC4_ADDR: 0xbabedeaddeadbeef
[Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected
(CPU: 0, MCGc/s: 0/0, MC4: f604a00006080a41, ADDR/MISC: babedeaddeadbeef/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 0:0, TIME: 0, SOCKET: 0, APIC: 0)

Basically, the userspace daemon will consume the error string (after
it's been massaged into looking prettier and smaller :-)) (1st arg)
and dump it to some logs, and use some of the MCE fields to do error
collection and thresholding/ratelimiting/whatever.

While at it, I'm also looking very critically at the fields SOCKET,
APIC, TSC (we have walltime) for I'd like to drop them. Also, MC4 should
be MC4_STATUS btw.

To be continued...

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAS trace event proto
  2012-02-22 15:59             ` Borislav Petkov
@ 2012-02-27 15:54               ` Borislav Petkov
  0 siblings, 0 replies; 19+ messages in thread
From: Borislav Petkov @ 2012-02-27 15:54 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Steven Rostedt, Mauro Carvalho Chehab, Ingo Molnar, edac-devel, LKML

Hi Tony,

On Wed, Feb 22, 2012 at 04:59:48PM +0100, Borislav Petkov wrote:
> On Wed, Feb 22, 2012 at 11:43:24AM +0100, Borislav Petkov wrote:
> > This will keep the bloat level to a minimum, keep the TPs apart and
> > hopefully make all of us happy :).
> 
> Btw, here's how the rough MCE TP trace_mce_record() looks like:
> 
>        mcegen.py-2715  [001] .N..  1049.818840: mce_record: [Hardware Error]: CPU:0     MC4_STATUS[Over|UE|-|PCC|AddrV|UECC]: 0xf604a00006080a41
> [Hardware Error]:       MC4_ADDR: 0xbabedeaddeadbeef
> [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected
> (CPU: 0, MCGc/s: 0/0, MC4: f604a00006080a41, ADDR/MISC: babedeaddeadbeef/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 0:0, TIME: 0, SOCKET: 0, APIC: 0)
> 
> Basically, the userspace daemon will consume the error string (after
> it's been massaged into looking prettier and smaller :-)) (1st arg)
> and dump it to some logs, and use some of the MCE fields to do error
> collection and thresholding/ratelimiting/whatever.
> 
> While at it, I'm also looking very critically at the fields SOCKET,
> APIC, TSC (we have walltime) for I'd like to drop them. Also, MC4 should
> be MC4_STATUS btw.
> 
> To be continued...

new week, new stuff:

Here's how the MCE TP looks like with a couple of MCEs injected:

       mcegen.py-2318  [001] .N..   580.902409: mce_record: [Hardware Error]: CPU:0 MC4_STATUS[Over|CE|-|PCC|AddrV|CECC]: 0xd604c00006080a41 MC4_ADDR: 0x0000000000000016
[Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
[Hardware Error]: ERR_ADDR: 0x16 row: 0, channel: 0
[Hardware Error]: cache level: L1, mem/io: MEM, mem-tx: DWR, part-proc: RES (no timeout)
[Hardware Error]: CPU: 0, MCGc/s: 0/0, MC4: d604c00006080a41, ADDR/MISC: 0000000000000016/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, TIME: 0)

       mcegen.py-2326  [001] .N..   598.795494: mce_record: [Hardware Error]: CPU:0 MC4_STATUS[Over|UE|MiscV|PCC|-|UECC]: 0xfa002000001c011b[Hardware Error]: Northbridge Error (node 0): L3 ECC data cache error.
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[Hardware Error]: CPU: 0, MCGc/s: 0/0, MC4: fa002000001c011b, ADDR/MISC: 0000000000000016/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, TIME: 0)

       mcegen.py-2343  [013] .N..   619.620698: mce_record: [Hardware Error]: CPU:0 MC4_STATUS[-|UE|MiscV|PCC|-|UECC]: 0xba002100000f001b[Hardware Error]: Northbridge Error (node 0): GART Table Walk data error.
[Hardware Error]: cache level: L3/GEN, tx: GEN
[Hardware Error]: CPU: 0, MCGc/s: 0/0, MC4: ba002100000f001b, ADDR/MISC: 0000000000000016/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, TIME: 0)

Basically the lines excluding the last one are the string message
generated by the decoding code and collected into the ras decode buffer
using ras_printk. Btw, the buffer enlarges itself on demand when we're
close to filling it up with the decoding info.

The last line is the MCE TP with useless IMO fields removed which will
be used by the RAS daemon in userspace.

I'll be splitting the single patch into multiple, more digestible chunks
for review now.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2012-02-27 15:54 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-20 14:59 RAS trace event proto Borislav Petkov
2012-02-21  1:14 ` Steven Rostedt
2012-02-21 10:15   ` Borislav Petkov
2012-02-21 12:24 ` Mauro Carvalho Chehab
2012-02-21 14:12   ` Borislav Petkov
2012-02-21 14:48     ` Steven Rostedt
2012-02-21 14:59       ` Borislav Petkov
2012-02-21 16:18         ` Mauro Carvalho Chehab
2012-02-22  0:58         ` Luck, Tony
2012-02-22 10:43           ` Borislav Petkov
2012-02-22 12:02             ` Mauro Carvalho Chehab
2012-02-22 12:25               ` Borislav Petkov
2012-02-22 13:32                 ` Mauro Carvalho Chehab
2012-02-22 14:05                   ` Borislav Petkov
2012-02-22 14:25                     ` Mauro Carvalho Chehab
2012-02-22 14:26                   ` Steven Rostedt
2012-02-22 15:59             ` Borislav Petkov
2012-02-27 15:54               ` Borislav Petkov
2012-02-21 17:28     ` Mauro Carvalho Chehab

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).