[RFC -v2 PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC -v2 PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs
@ 2012-03-02 14:25 Borislav Petkov
  2012-03-02 14:25 ` [PATCH 1/4] mce: Slim up struct mce Borislav Petkov
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Borislav Petkov @ 2012-03-02 14:25 UTC (permalink / raw)
  To: Tony Luck; +Cc: Ingo Molnar, EDAC devel, LKML, Borislav Petkov

From: Borislav Petkov <borislav.petkov@amd.com>

Hi,

here's a second version of the patchset with the buffer enlarging ripped
out. 1/4 in the series could go in independently since it is a cleanup,
I'll add it to a for-next testing branch if there are no objections.

Changelog:

* V1:

this is an initial, more or less serious attempt to collect decoded
MCE info into a buffer and jettison it into userspace using the MCE
tracepoint trace_mce_record(). This initial approach needs userspace to
do

$ echo 1 > /sys/devices/system/ras/agent

and decoded MCE info gets collected into a buffer. Then, when decoding
is finished, the tracepoint is called and the MCE info along with the
decoded information lands in the ring buffer and at possible userspace
consumers.

Also, the commit messages of the single patches contain additional info.

For example, the data looks like this:

mcegen.py-2318  [001] .N..   580.902409: mce_record: [Hardware Error]: CPU:0 MC4_STATUS[Over|CE|-|PCC|AddrV|CECC]: 0xd604c00006080a41 MC4_ADDR: 0x0000000000000016
[Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
[Hardware Error]: ERR_ADDR: 0x16 row: 0, channel: 0
[Hardware Error]: cache level: L1, mem/io: MEM, mem-tx: DWR, part-proc: RES (no timeout)
[Hardware Error]: CPU: 0, MCGc/s: 0/0, MC4: d604c00006080a41, ADDR/MISC: 0000000000000016/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, TIME: 0)

       mcegen.py-2326  [001] .N..   598.795494: mce_record: [Hardware Error]: CPU:0 MC4_STATUS[Over|UE|MiscV|PCC|-|UECC]: 0xfa002000001c011b
[Hardware Error]: Northbridge Error (node 0): L3 ECC data cache error.
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[Hardware Error]: CPU: 0, MCGc/s: 0/0, MC4: fa002000001c011b, ADDR/MISC: 0000000000000016/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, TIME: 0)

mcegen.py-2343  [013] .N..   619.620698: mce_record: [Hardware Error]: CPU:0 MC4_STATUS[-|UE|MiscV|PCC|-|UECC]: 0xba002100000f001b[HardwareError]: Northbridge Error (node 0): GART Table Walk data error.
[Hardware Error]: cache level: L3/GEN, tx: GEN
[Hardware Error]: CPU: 0, MCGc/s: 0/0, MC4: ba002100000f001b, ADDR/MISC: 0000000000000016/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, TIME: 0)

As always, reviews and comments are welcome.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/4] mce: Slim up struct mce
  2012-03-02 14:25 [RFC -v2 PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs Borislav Petkov
@ 2012-03-02 14:25 ` Borislav Petkov
  2012-03-02 17:47   ` Luck, Tony
  2012-03-02 14:25 ` [PATCH 2/4] mce: Add a msg string to the MCE tracepoint Borislav Petkov
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Borislav Petkov @ 2012-03-02 14:25 UTC (permalink / raw)
  To: Tony Luck; +Cc: Ingo Molnar, EDAC devel, LKML, Borislav Petkov

From: Borislav Petkov <borislav.petkov@amd.com>

Drop unused ->apicid, make ->socketid __u16 since it is copied from
struct cpuinfo_86.phys_proc_id, reorder members and adjust padding so
that it remains packed and on a 64-byte boundary.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/include/asm/mce.h              |   12 +++++-------
 arch/x86/kernel/cpu/mcheck/mce-inject.c |    8 ++++----
 arch/x86/kernel/cpu/mcheck/mce.c        |   13 ++++++-------
 drivers/edac/amd64_edac.c               |    2 +-
 drivers/edac/mce_amd.c                  |    4 ++--
 include/trace/events/mce.h              |   11 ++++-------
 6 files changed, 22 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 6aefb14cbbc5..42e209e32231 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -67,15 +67,13 @@ struct mce {
 	__u64 time;	/* wall time_t when error was detected */
 	__u8  cpuvendor;	/* cpu vendor as encoded in system.h */
 	__u8  inject_flags;	/* software inject flags */
-	__u16  pad;
+	__u8  pad;
+	__u8  cs;	/* code segment */
 	__u32 cpuid;	/* CPUID 1 EAX */
-	__u8  cs;		/* code segment */
 	__u8  bank;	/* machine check bank */
-	__u8  cpu;	/* cpu number; obsolete; use extcpu now */
-	__u8  finished;   /* entry is valid */
-	__u32 extcpu;	/* linux cpu number that detected the error */
-	__u32 socketid;	/* CPU socket ID */
-	__u32 apicid;	/* CPU initial apic ID */
+	__u8  finished; /* entry is valid */
+	__u16 socketid;	/* CPU socket ID */
+	__u32 cpu;	/* linux cpu number that detected the error */
 	__u64 mcgcap;	/* MCGCAP MSR: machine check capabilities of CPU */
 };
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce-inject.c b/arch/x86/kernel/cpu/mcheck/mce-inject.c
index fc4beb393577..a19eace708f1 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-inject.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-inject.c
@@ -31,14 +31,14 @@
 /* Update fake mce registers on current CPU. */
 static void inject_mce(struct mce *m)
 {
-	struct mce *i = &per_cpu(injectm, m->extcpu);
+	struct mce *i = &per_cpu(injectm, m->cpu);
 
 	/* Make sure no one reads partially written injectm */
 	i->finished = 0;
 	mb();
 	m->finished = 0;
 	/* First set the fields after finished */
-	i->extcpu = m->extcpu;
+	i->cpu = m->cpu;
 	mb();
 	/* Now write record in order, finished last (except above) */
 	memcpy(i, m, sizeof(struct mce));
@@ -111,7 +111,7 @@ static int raise_local(void)
 	struct mce *m = &__get_cpu_var(injectm);
 	int context = MCJ_CTX(m->inject_flags);
 	int ret = 0;
-	int cpu = m->extcpu;
+	int cpu = m->cpu;
 
 	if (m->inject_flags & MCJ_EXCEPTION) {
 		printk(KERN_INFO "Triggering MCE exception on CPU %d\n", cpu);
@@ -217,7 +217,7 @@ static ssize_t mce_write(struct file *filp, const char __user *ubuf,
 	if (copy_from_user(&m, ubuf, usize))
 		return -EFAULT;
 
-	if (m.extcpu >= num_possible_cpus() || !cpu_online(m.extcpu))
+	if (m.cpu >= num_possible_cpus() || !cpu_online(m.cpu))
 		return -EINVAL;
 
 	/*
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 5a11ae2e9e91..a30d7af18e66 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -112,14 +112,13 @@ ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
 void mce_setup(struct mce *m)
 {
 	memset(m, 0, sizeof(struct mce));
-	m->cpu = m->extcpu = smp_processor_id();
+	m->cpu = smp_processor_id();
 	rdtscll(m->tsc);
 	/* We hope get_seconds stays lockless */
 	m->time = get_seconds();
 	m->cpuvendor = boot_cpu_data.x86_vendor;
 	m->cpuid = cpuid_eax(1);
-	m->socketid = cpu_data(m->extcpu).phys_proc_id;
-	m->apicid = cpu_data(m->extcpu).initial_apicid;
+	m->socketid = cpu_data(m->cpu).phys_proc_id;
 	rdmsrl(MSR_IA32_MCG_CAP, m->mcgcap);
 }
 
@@ -243,7 +242,7 @@ static void print_mce(struct mce *m)
 	int ret = 0;
 
 	pr_emerg(HW_ERR "CPU %d: Machine Check Exception: %Lx Bank %d: %016Lx\n",
-	       m->extcpu, m->mcgstatus, m->bank, m->status);
+	       m->cpu, m->mcgstatus, m->bank, m->status);
 
 	if (m->ip) {
 		pr_emerg(HW_ERR "RIP%s %02x:<%016Lx> ",
@@ -266,9 +265,9 @@ static void print_mce(struct mce *m)
 	 * Note this output is parsed by external tools and old fields
 	 * should not be changed.
 	 */
-	pr_emerg(HW_ERR "PROCESSOR %u:%x TIME %llu SOCKET %u APIC %x microcode %x\n",
-		m->cpuvendor, m->cpuid, m->time, m->socketid, m->apicid,
-		cpu_data(m->extcpu).microcode);
+	pr_emerg(HW_ERR "PROCESSOR %u:%x TIME %llu SOCKET %u microcode %x\n",
+		m->cpuvendor, m->cpuid, m->time, m->socketid,
+		cpu_data(m->cpu).microcode);
 
 	/*
 	 * Print out human-readable details about the MCE error,
diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index c9eee6d33e9a..08413377a43b 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -972,7 +972,7 @@ static u64 get_error_address(struct mce *m)
 		if ((addr & GENMASK(24, 47)) >> 24 != 0x00fdf7)
 			return addr;
 
-		mce_nid	= amd_get_nb_id(m->extcpu);
+		mce_nid	= amd_get_nb_id(m->cpu);
 		pvt	= mcis[mce_nid]->pvt_info;
 
 		amd64_read_pci_cfg(pvt->F1, DRAM_LOCAL_NODE_LIM, &tmp);
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index bd926ea2e00c..7752f81fa4d6 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -595,7 +595,7 @@ static bool nb_noop_mce(u16 ec, u8 xec)
 void amd_decode_nb_mce(struct mce *m)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
-	int node_id = amd_get_nb_id(m->extcpu);
+	int node_id = amd_get_nb_id(m->cpu);
 	u16 ec = EC(m->status);
 	u8 xec = XEC(m->status, 0x1f);
 
@@ -753,7 +753,7 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 		return NOTIFY_STOP;
 
 	pr_emerg(HW_ERR "CPU:%d\tMC%d_STATUS[%s|%s|%s|%s|%s",
-		m->extcpu, m->bank,
+		m->cpu, m->bank,
 		((m->status & MCI_STATUS_OVER)	? "Over"  : "-"),
 		((m->status & MCI_STATUS_UC)	? "UE"	  : "CE"),
 		((m->status & MCI_STATUS_MISCV)	? "MiscV" : "-"),
diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h
index 4cbbcef6baa8..4c1740ac7889 100644
--- a/include/trace/events/mce.h
+++ b/include/trace/events/mce.h
@@ -25,8 +25,7 @@ TRACE_EVENT(mce_record,
 		__field(	u64,		walltime	)
 		__field(	u32,		cpu		)
 		__field(	u32,		cpuid		)
-		__field(	u32,		apicid		)
-		__field(	u32,		socketid	)
+		__field(	u16,		socketid	)
 		__field(	u8,		cs		)
 		__field(	u8,		bank		)
 		__field(	u8,		cpuvendor	)
@@ -41,16 +40,15 @@ TRACE_EVENT(mce_record,
 		__entry->ip		= m->ip;
 		__entry->tsc		= m->tsc;
 		__entry->walltime	= m->time;
-		__entry->cpu		= m->extcpu;
+		__entry->cpu		= m->cpu;
 		__entry->cpuid		= m->cpuid;
-		__entry->apicid		= m->apicid;
 		__entry->socketid	= m->socketid;
 		__entry->cs		= m->cs;
 		__entry->bank		= m->bank;
 		__entry->cpuvendor	= m->cpuvendor;
 	),
 
-	TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x",
+	TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u",
 		__entry->cpu,
 		__entry->mcgcap, __entry->mcgstatus,
 		__entry->bank, __entry->status,
@@ -59,8 +57,7 @@ TRACE_EVENT(mce_record,
 		__entry->tsc,
 		__entry->cpuvendor, __entry->cpuid,
 		__entry->walltime,
-		__entry->socketid,
-		__entry->apicid)
+		__entry->socketid)
 );
 
 #endif /* _TRACE_MCE_H */
-- 
1.7.8.rc0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/4] mce: Add a msg string to the MCE tracepoint
  2012-03-02 14:25 [RFC -v2 PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs Borislav Petkov
  2012-03-02 14:25 ` [PATCH 1/4] mce: Slim up struct mce Borislav Petkov
@ 2012-03-02 14:25 ` Borislav Petkov
  2012-03-02 14:25 ` [PATCH 3/4] x86, RAS: Add a decoded msg buffer Borislav Petkov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Borislav Petkov @ 2012-03-02 14:25 UTC (permalink / raw)
  To: Tony Luck; +Cc: Ingo Molnar, EDAC devel, LKML, Borislav Petkov

From: Borislav Petkov <borislav.petkov@amd.com>

The idea here is to pass an additional decoded MCE message through
the tracepoint and into the ring buffer for userspace to consume. The
designated consumers are RAS daemons and other tools collecting RAS
information.

Drop unneeded fields while at it, thus saving some room in the ring
buffer.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |    2 +-
 include/trace/events/mce.h       |   14 ++++++--------
 2 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index a30d7af18e66..b28800443045 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -143,7 +143,7 @@ void mce_log(struct mce *mce)
 	int ret = 0;
 
 	/* Emit the trace record: */
-	trace_mce_record(mce);
+	trace_mce_record("", mce);
 
 	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
 	if (ret == NOTIFY_STOP)
diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h
index 4c1740ac7889..517536de2458 100644
--- a/include/trace/events/mce.h
+++ b/include/trace/events/mce.h
@@ -10,11 +10,12 @@
 
 TRACE_EVENT(mce_record,
 
-	TP_PROTO(struct mce *m),
+	TP_PROTO(const char *msg, struct mce *m),
 
-	TP_ARGS(m),
+	TP_ARGS(msg, m),
 
 	TP_STRUCT__entry(
+		__string(	msg,		msg		)
 		__field(	u64,		mcgcap		)
 		__field(	u64,		mcgstatus	)
 		__field(	u64,		status		)
@@ -24,14 +25,13 @@ TRACE_EVENT(mce_record,
 		__field(	u64,		tsc		)
 		__field(	u64,		walltime	)
 		__field(	u32,		cpu		)
-		__field(	u32,		cpuid		)
 		__field(	u16,		socketid	)
 		__field(	u8,		cs		)
 		__field(	u8,		bank		)
-		__field(	u8,		cpuvendor	)
 	),
 
 	TP_fast_assign(
+		__assign_str(msg,	msg);
 		__entry->mcgcap		= m->mcgcap;
 		__entry->mcgstatus	= m->mcgstatus;
 		__entry->status		= m->status;
@@ -41,21 +41,19 @@ TRACE_EVENT(mce_record,
 		__entry->tsc		= m->tsc;
 		__entry->walltime	= m->time;
 		__entry->cpu		= m->cpu;
-		__entry->cpuid		= m->cpuid;
 		__entry->socketid	= m->socketid;
 		__entry->cs		= m->cs;
 		__entry->bank		= m->bank;
-		__entry->cpuvendor	= m->cpuvendor;
 	),
 
-	TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u",
+	TP_printk("%s" HW_ERR "CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, TIME: %llu, SOCKET: %u",
+		__get_str(msg),
 		__entry->cpu,
 		__entry->mcgcap, __entry->mcgstatus,
 		__entry->bank, __entry->status,
 		__entry->addr, __entry->misc,
 		__entry->cs, __entry->ip,
 		__entry->tsc,
-		__entry->cpuvendor, __entry->cpuid,
 		__entry->walltime,
 		__entry->socketid)
 );
-- 
1.7.8.rc0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/4] x86, RAS: Add a decoded msg buffer
  2012-03-02 14:25 [RFC -v2 PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs Borislav Petkov
  2012-03-02 14:25 ` [PATCH 1/4] mce: Slim up struct mce Borislav Petkov
  2012-03-02 14:25 ` [PATCH 2/4] mce: Add a msg string to the MCE tracepoint Borislav Petkov
@ 2012-03-02 14:25 ` Borislav Petkov
  2012-03-02 14:25 ` [PATCH 4/4] EDAC: Convert AMD EDAC pieces to use RAS printk buffer Borislav Petkov
  2012-03-02 14:41 ` [RFC -v2 PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs Mauro Carvalho Chehab
  4 siblings, 0 replies; 26+ messages in thread
From: Borislav Petkov @ 2012-03-02 14:25 UTC (permalink / raw)
  To: Tony Luck; +Cc: Ingo Molnar, EDAC devel, LKML, Borislav Petkov

From: Borislav Petkov <borislav.petkov@amd.com>

Echoing 1 into /sys/devices/system/ras/agent causes the ras_printk()
function to buffer a string describing a hardware error. This is meant
for userspace daemons which are running on the system and are going to
consume decoded information through the MCE tracepoint.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig           |    9 +++
 arch/x86/Makefile          |    3 +
 arch/x86/include/asm/ras.h |   15 +++++
 arch/x86/ras/Makefile      |    1 +
 arch/x86/ras/ras.c         |  145 ++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 173 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/ras.h
 create mode 100644 arch/x86/ras/Makefile
 create mode 100644 arch/x86/ras/ras.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5bed94e189fa..bda1480241b2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -657,6 +657,15 @@ config X86_CYCLONE_TIMER
 	def_bool y
 	depends on X86_SUMMIT
 
+config X86_RAS
+	def_bool y
+	prompt "X86 RAS features"
+	---help---
+	A collection of Reliability, Availability and Serviceability
+	software features which aim to enable hardware error logging
+	and reporting. Leave it at 'y' unless you really know what
+	you're doing
+
 source "arch/x86/Kconfig.cpu"
 
 config HPET_TIMER
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 209ba1294592..a6b6bb1f308b 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -146,6 +146,9 @@ drivers-$(CONFIG_OPROFILE) += arch/x86/oprofile/
 # suspend and hibernation support
 drivers-$(CONFIG_PM) += arch/x86/power/
 
+# RAS support
+core-y += arch/x86/ras/
+
 drivers-$(CONFIG_FB) += arch/x86/video/
 
 ####
diff --git a/arch/x86/include/asm/ras.h b/arch/x86/include/asm/ras.h
new file mode 100644
index 000000000000..92199af2ab7b
--- /dev/null
+++ b/arch/x86/include/asm/ras.h
@@ -0,0 +1,15 @@
+#ifndef _ASM_X86_RAS_H
+#define _ASM_X86_RAS_H
+
+#define ERR_BUF_SZ 500
+
+extern bool ras_agent;
+
+#define PR_EMERG	BIT(0)
+#define PR_WARNING	BIT(1)
+#define PR_CONT		BIT(2)
+
+extern const char *ras_get_decoded_err(void);
+extern void ras_printk(unsigned long flags, const char *fmt, ...);
+
+#endif /* _ASM_X86_RAS_H */
diff --git a/arch/x86/ras/Makefile b/arch/x86/ras/Makefile
new file mode 100644
index 000000000000..7a70bb5cd057
--- /dev/null
+++ b/arch/x86/ras/Makefile
@@ -0,0 +1 @@
+obj-y		:= ras.o
diff --git a/arch/x86/ras/ras.c b/arch/x86/ras/ras.c
new file mode 100644
index 000000000000..5edfe30034d3
--- /dev/null
+++ b/arch/x86/ras/ras.c
@@ -0,0 +1,145 @@
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <asm/ras.h>
+
+static size_t dec_len;
+static char *err_str;
+
+/*
+ * If true, userspace has an agent running and eating all the
+ * tracing data we're sending out so there's no dmesg output
+ */
+bool ras_agent;
+EXPORT_SYMBOL_GPL(ras_agent);
+
+/* getting the string implies the current buffer is emptied */
+const char *ras_get_decoded_err(void)
+{
+	dec_len = 0;
+	return err_str;
+}
+
+void ras_printk(unsigned long flags, const char *fmt, ...)
+{
+	va_list args;
+	char *buf;
+	size_t left;
+	int i;
+
+	/* add a HW_ERR prefix to a newly started line */
+	if (!(flags & PR_CONT)) {
+		strcpy(err_str + dec_len, HW_ERR);
+		dec_len += strlen(HW_ERR);
+	}
+
+	left = ERR_BUF_SZ - dec_len - 1;
+	buf = err_str + dec_len;
+
+	va_start(args, fmt);
+	i = vsnprintf(buf, left, fmt, args);
+	va_end(args);
+
+	if (i >= left) {
+		pr_err("Error decode buffer truncated.\n");
+		dec_len = ERR_BUF_SZ-1;
+		err_str[dec_len] = '\n';
+	} else
+		dec_len += i;
+
+	if (!ras_agent) {
+		if (flags & PR_EMERG)
+			pr_emerg("%s", buf);
+		if (flags & PR_WARNING)
+			pr_warning("%s", buf);
+		else if (flags & PR_CONT)
+			pr_cont("%s", buf);
+
+		dec_len = 0;
+	}
+}
+EXPORT_SYMBOL_GPL(ras_printk);
+
+struct bus_type ras_subsys = {
+	.name	  = "ras",
+	.dev_name = "ras",
+};
+
+struct ras_attr {
+	struct attribute attr;
+	ssize_t (*show) (struct kobject *kobj, struct ras_attr *attr, char *bf);
+	ssize_t (*store)(struct kobject *kobj, struct ras_attr *attr,
+			 const char *buf, size_t count);
+};
+
+#define RAS_ATTR(_name, _mode, _show, _store)	\
+static struct ras_attr ras_attr_##_name = __ATTR(_name, _mode, _show, _store)
+
+static ssize_t ras_agent_show(struct kobject *kobj,
+			      struct ras_attr *attr,
+			      char *buf)
+{
+	return sprintf(buf, "%.1d\n", ras_agent);
+}
+
+static ssize_t ras_agent_store(struct kobject *kobj,
+			       struct ras_attr *attr,
+			       const char *buf, size_t count)
+{
+	int ret = 0;
+	unsigned long value;
+
+	ret = kstrtoul(buf, 10, &value);
+	if (ret < 0) {
+		printk(KERN_ERR "Wrong value for ras_agent field.\n");
+		return ret;
+	}
+
+	ras_agent = !!value;
+
+	return count;
+}
+
+RAS_ATTR(agent, 0644, ras_agent_show, ras_agent_store);
+
+static struct attribute *ras_root_attrs[] = {
+	&ras_attr_agent.attr,
+	NULL
+};
+
+static const struct attribute_group ras_root_attr_group = {
+	.attrs = ras_root_attrs,
+};
+
+static const struct attribute_group *ras_root_attr_groups[] = {
+	&ras_root_attr_group,
+	NULL,
+};
+
+static int __init ras_init(void)
+{
+	int err = 0;
+
+	err = subsys_system_register(&ras_subsys, ras_root_attr_groups);
+	if (err) {
+		printk(KERN_ERR "Error registering toplevel RAS sysfs node.\n");
+		return err;
+	}
+
+	/* no freeing of this since it ras.c is compiled-on only */
+	err_str = kzalloc(ERR_BUF_SZ, GFP_KERNEL);
+	if (!err_str) {
+		err = -ENOMEM;
+		goto err_alloc;
+	}
+
+	return 0;
+
+err_alloc:
+	bus_unregister(&ras_subsys);
+
+	return err;
+}
+subsys_initcall(ras_init);
-- 
1.7.8.rc0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/4] EDAC: Convert AMD EDAC pieces to use RAS printk buffer
  2012-03-02 14:25 [RFC -v2 PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs Borislav Petkov
                   ` (2 preceding siblings ...)
  2012-03-02 14:25 ` [PATCH 3/4] x86, RAS: Add a decoded msg buffer Borislav Petkov
@ 2012-03-02 14:25 ` Borislav Petkov
  2012-03-02 14:52   ` Mauro Carvalho Chehab
  2012-03-02 14:41 ` [RFC -v2 PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs Mauro Carvalho Chehab
  4 siblings, 1 reply; 26+ messages in thread
From: Borislav Petkov @ 2012-03-02 14:25 UTC (permalink / raw)
  To: Tony Luck; +Cc: Ingo Molnar, EDAC devel, LKML, Borislav Petkov

From: Borislav Petkov <borislav.petkov@amd.com>

This is an initial version of the patch which converts MCE decoding
facilities to use the RAS printk buffer. When there's no userspace agent
running (i.e., /sys/devices/system/ras/agent == 0), we fall back to the
default printk's into dmesg which is what we've been doing so far.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 drivers/edac/amd64_edac.c |    8 ++-
 drivers/edac/edac_mc.c    |   42 ++++++---
 drivers/edac/mce_amd.c    |  217 ++++++++++++++++++++++++---------------------
 3 files changed, 149 insertions(+), 118 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 08413377a43b..29e153c57e33 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -1,6 +1,7 @@
-#include "amd64_edac.h"
 #include <asm/amd_nb.h>
+#include <asm/ras.h>
 
+#include "amd64_edac.h"
 static struct edac_pci_ctl_info *amd64_ctl_pci;
 
 static int report_gart_errors;
@@ -1901,7 +1902,10 @@ static void amd64_handle_ce(struct mem_ctl_info *mci, struct mce *m)
 	sys_addr = get_error_address(m);
 	syndrome = extract_syndrome(m->status);
 
-	amd64_mc_err(mci, "CE ERROR_ADDRESS= 0x%llx\n", sys_addr);
+	if (ras_agent)
+		ras_printk(PR_EMERG, "ERR_ADDR: 0x%llx", sys_addr);
+	else
+		amd64_mc_err(mci, "CE ERROR_ADDRESS= 0x%llx\n", sys_addr);
 
 	pvt->ops->map_sysaddr_to_csrow(mci, sys_addr, syndrome);
 }
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index ca6c04d350ee..3b3db477b5d0 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -30,8 +30,10 @@
 #include <asm/uaccess.h>
 #include <asm/page.h>
 #include <asm/edac.h>
+#include <asm/ras.h>
 #include "edac_core.h"
 #include "edac_module.h"
+#include "mce_amd.h"
 
 /* lock to memory controller's control array */
 static DEFINE_MUTEX(mem_ctls_mutex);
@@ -701,14 +703,22 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
 		return;
 	}
 
-	if (edac_mc_get_log_ce())
-		/* FIXME - put in DIMM location */
-		edac_mc_printk(mci, KERN_WARNING,
-			"CE page 0x%lx, offset 0x%lx, grain %d, syndrome "
-			"0x%lx, row %d, channel %d, label \"%s\": %s\n",
-			page_frame_number, offset_in_page,
-			mci->csrows[row].grain, syndrome, row, channel,
-			mci->csrows[row].channels[channel].label, msg);
+	if (edac_mc_get_log_ce()) {
+		if (ras_agent)
+			ras_printk(PR_CONT, ", row: %d, channel: %d\n",
+				   row, channel);
+		else
+			/* FIXME - put in DIMM location */
+			edac_mc_printk(mci, KERN_WARNING,
+					"CE page 0x%lx, offset 0x%lx, grain %d,"
+					" syndrome 0x%lx, row %d, channel %d,"
+					" label \"%s\": %s\n",
+					page_frame_number, offset_in_page,
+					mci->csrows[row].grain, syndrome,
+					row, channel,
+					mci->csrows[row].channels[channel].label,
+					msg);
+	}
 
 	mci->ce_count++;
 	mci->csrows[row].ce_count++;
@@ -780,12 +790,16 @@ void edac_mc_handle_ue(struct mem_ctl_info *mci,
 		pos += chars;
 	}
 
-	if (edac_mc_get_log_ue())
-		edac_mc_printk(mci, KERN_EMERG,
-			"UE page 0x%lx, offset 0x%lx, grain %d, row %d, "
-			"labels \"%s\": %s\n", page_frame_number,
-			offset_in_page, mci->csrows[row].grain, row,
-			labels, msg);
+	if (edac_mc_get_log_ue()) {
+		if (ras_agent)
+			ras_printk(PR_CONT, "row: %d\n", row);
+		else
+			edac_mc_printk(mci, KERN_EMERG,
+				       "UE page 0x%lx, offset 0x%lx, grain %d,"
+				       " row %d, labels \"%s\": %s\n",
+				       page_frame_number, offset_in_page,
+				       mci->csrows[row].grain, row, labels, msg);
+	}
 
 	if (edac_mc_get_panic_on_ue())
 		panic("EDAC MC%d: UE page 0x%lx, offset 0x%lx, grain %d, "
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 7752f81fa4d6..73fc33b21858 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -1,5 +1,7 @@
 #include <linux/module.h>
 #include <linux/slab.h>
+#include <trace/events/mce.h>
+#include <asm/ras.h>
 
 #include "mce_amd.h"
 
@@ -137,9 +139,9 @@ static bool f12h_dc_mce(u16 ec, u8 xec)
 		ret = true;
 
 		if (ll == LL_L2)
-			pr_cont("during L1 linefill from L2.\n");
+			ras_printk(PR_CONT, "during L1 linefill from L2.\n");
 		else if (ll == LL_L1)
-			pr_cont("Data/Tag %s error.\n", R4_MSG(ec));
+			ras_printk(PR_CONT, "Data/Tag %s error.\n", R4_MSG(ec));
 		else
 			ret = false;
 	}
@@ -149,7 +151,7 @@ static bool f12h_dc_mce(u16 ec, u8 xec)
 static bool f10h_dc_mce(u16 ec, u8 xec)
 {
 	if (R4(ec) == R4_GEN && LL(ec) == LL_L1) {
-		pr_cont("during data scrub.\n");
+		ras_printk(PR_CONT, "during data scrub.\n");
 		return true;
 	}
 	return f12h_dc_mce(ec, xec);
@@ -158,7 +160,7 @@ static bool f10h_dc_mce(u16 ec, u8 xec)
 static bool k8_dc_mce(u16 ec, u8 xec)
 {
 	if (BUS_ERROR(ec)) {
-		pr_cont("during system linefill.\n");
+		ras_printk(PR_CONT, "during system linefill.\n");
 		return true;
 	}
 
@@ -178,14 +180,14 @@ static bool f14h_dc_mce(u16 ec, u8 xec)
 		switch (r4) {
 		case R4_DRD:
 		case R4_DWR:
-			pr_cont("Data/Tag parity error due to %s.\n",
+			ras_printk(PR_CONT, "Data/Tag parity error due to %s.\n",
 				(r4 == R4_DRD ? "load/hw prf" : "store"));
 			break;
 		case R4_EVICT:
-			pr_cont("Copyback parity error on a tag miss.\n");
+			ras_printk(PR_CONT, "Copyback parity error on a tag miss.\n");
 			break;
 		case R4_SNOOP:
-			pr_cont("Tag parity error during snoop.\n");
+			ras_printk(PR_CONT, "Tag parity error during snoop.\n");
 			break;
 		default:
 			ret = false;
@@ -195,17 +197,17 @@ static bool f14h_dc_mce(u16 ec, u8 xec)
 		if ((II(ec) != II_MEM && II(ec) != II_IO) || LL(ec) != LL_LG)
 			return false;
 
-		pr_cont("System read data error on a ");
+		ras_printk(PR_CONT, "System read data error on a ");
 
 		switch (r4) {
 		case R4_RD:
-			pr_cont("TLB reload.\n");
+			ras_printk(PR_CONT, "TLB reload.\n");
 			break;
 		case R4_DWR:
-			pr_cont("store.\n");
+			ras_printk(PR_CONT, "store.\n");
 			break;
 		case R4_DRD:
-			pr_cont("load.\n");
+			ras_printk(PR_CONT, "load.\n");
 			break;
 		default:
 			ret = false;
@@ -225,28 +227,29 @@ static bool f15h_dc_mce(u16 ec, u8 xec)
 
 		switch (xec) {
 		case 0x0:
-			pr_cont("Data Array access error.\n");
+			ras_printk(PR_CONT, "Data Array access error.\n");
 			break;
 
 		case 0x1:
-			pr_cont("UC error during a linefill from L2/NB.\n");
+			ras_printk(PR_CONT, "UC error during a linefill "
+					    "from L2/NB.\n");
 			break;
 
 		case 0x2:
 		case 0x11:
-			pr_cont("STQ access error.\n");
+			ras_printk(PR_CONT, "STQ access error.\n");
 			break;
 
 		case 0x3:
-			pr_cont("SCB access error.\n");
+			ras_printk(PR_CONT, "SCB access error.\n");
 			break;
 
 		case 0x10:
-			pr_cont("Tag error.\n");
+			ras_printk(PR_CONT, "Tag error.\n");
 			break;
 
 		case 0x12:
-			pr_cont("LDQ access error.\n");
+			ras_printk(PR_CONT, "LDQ access error.\n");
 			break;
 
 		default:
@@ -255,9 +258,9 @@ static bool f15h_dc_mce(u16 ec, u8 xec)
 	} else if (BUS_ERROR(ec)) {
 
 		if (!xec)
-			pr_cont("during system linefill.\n");
+			ras_printk(PR_CONT, "during system linefill.\n");
 		else
-			pr_cont(" Internal %s condition.\n",
+			ras_printk(PR_CONT, " Internal %s condition.\n",
 				((xec == 1) ? "livelock" : "deadlock"));
 	} else
 		ret = false;
@@ -270,12 +273,12 @@ static void amd_decode_dc_mce(struct mce *m)
 	u16 ec = EC(m->status);
 	u8 xec = XEC(m->status, xec_mask);
 
-	pr_emerg(HW_ERR "Data Cache Error: ");
+	ras_printk(PR_EMERG, "Data Cache Error: ");
 
 	/* TLB error signatures are the same across families */
 	if (TLB_ERROR(ec)) {
 		if (TT(ec) == TT_DATA) {
-			pr_cont("%s TLB %s.\n", LL_MSG(ec),
+			ras_printk(PR_CONT, "%s TLB %s.\n", LL_MSG(ec),
 				((xec == 2) ? "locked miss"
 					    : (xec ? "multimatch" : "parity")));
 			return;
@@ -283,7 +286,7 @@ static void amd_decode_dc_mce(struct mce *m)
 	} else if (fam_ops->dc_mce(ec, xec))
 		;
 	else
-		pr_emerg(HW_ERR "Corrupted DC MCE info?\n");
+		ras_printk(PR_EMERG, "Corrupted DC MCE info?\n");
 }
 
 static bool k8_ic_mce(u16 ec, u8 xec)
@@ -295,19 +298,19 @@ static bool k8_ic_mce(u16 ec, u8 xec)
 		return false;
 
 	if (ll == 0x2)
-		pr_cont("during a linefill from L2.\n");
+		ras_printk(PR_CONT, "during a linefill from L2.\n");
 	else if (ll == 0x1) {
 		switch (R4(ec)) {
 		case R4_IRD:
-			pr_cont("Parity error during data load.\n");
+			ras_printk(PR_CONT, "Parity error during data load.\n");
 			break;
 
 		case R4_EVICT:
-			pr_cont("Copyback Parity/Victim error.\n");
+			ras_printk(PR_CONT, "Copyback Parity/Victim error.\n");
 			break;
 
 		case R4_SNOOP:
-			pr_cont("Tag Snoop error.\n");
+			ras_printk(PR_CONT, "Tag Snoop error.\n");
 			break;
 
 		default:
@@ -330,9 +333,9 @@ static bool f14h_ic_mce(u16 ec, u8 xec)
 			ret = false;
 
 		if (r4 == R4_IRD)
-			pr_cont("Data/tag array parity error for a tag hit.\n");
+			ras_printk(PR_CONT, "Data/tag array parity error for a tag hit.\n");
 		else if (r4 == R4_SNOOP)
-			pr_cont("Tag error during snoop/victimization.\n");
+			ras_printk(PR_CONT, "Tag error during snoop/victimization.\n");
 		else
 			ret = false;
 	}
@@ -348,15 +351,16 @@ static bool f15h_ic_mce(u16 ec, u8 xec)
 
 	switch (xec) {
 	case 0x0 ... 0xa:
-		pr_cont("%s.\n", f15h_ic_mce_desc[xec]);
+		ras_printk(PR_CONT, "%s.\n", f15h_ic_mce_desc[xec]);
 		break;
 
 	case 0xd:
-		pr_cont("%s.\n", f15h_ic_mce_desc[xec-2]);
+		ras_printk(PR_CONT, "%s.\n", f15h_ic_mce_desc[xec-2]);
 		break;
 
 	case 0x10 ... 0x14:
-		pr_cont("Decoder %s parity error.\n", f15h_ic_mce_desc[xec-4]);
+		ras_printk(PR_CONT, "Decoder %s parity error.\n",
+				    f15h_ic_mce_desc[xec-4]);
 		break;
 
 	default:
@@ -370,19 +374,20 @@ static void amd_decode_ic_mce(struct mce *m)
 	u16 ec = EC(m->status);
 	u8 xec = XEC(m->status, xec_mask);
 
-	pr_emerg(HW_ERR "Instruction Cache Error: ");
+	ras_printk(PR_EMERG, "Instruction Cache Error: ");
 
 	if (TLB_ERROR(ec))
-		pr_cont("%s TLB %s.\n", LL_MSG(ec),
+		ras_printk(PR_CONT, "%s TLB %s.\n", LL_MSG(ec),
 			(xec ? "multimatch" : "parity error"));
 	else if (BUS_ERROR(ec)) {
 		bool k8 = (boot_cpu_data.x86 == 0xf && (m->status & BIT_64(58)));
 
-		pr_cont("during %s.\n", (k8 ? "system linefill" : "NB data read"));
+		ras_printk(PR_CONT, "during %s.\n", (k8 ? "system linefill"
+							: "NB data read"));
 	} else if (fam_ops->ic_mce(ec, xec))
 		;
 	else
-		pr_emerg(HW_ERR "Corrupted IC MCE info?\n");
+		ras_printk(PR_EMERG, "Corrupted IC MCE info?\n");
 }
 
 static void amd_decode_bu_mce(struct mce *m)
@@ -390,30 +395,33 @@ static void amd_decode_bu_mce(struct mce *m)
 	u16 ec = EC(m->status);
 	u8 xec = XEC(m->status, xec_mask);
 
-	pr_emerg(HW_ERR "Bus Unit Error");
+	ras_printk(PR_EMERG, "Bus Unit Error");
 
 	if (xec == 0x1)
-		pr_cont(" in the write data buffers.\n");
+		ras_printk(PR_CONT, " in the write data buffers.\n");
 	else if (xec == 0x3)
-		pr_cont(" in the victim data buffers.\n");
+		ras_printk(PR_CONT, " in the victim data buffers.\n");
 	else if (xec == 0x2 && MEM_ERROR(ec))
-		pr_cont(": %s error in the L2 cache tags.\n", R4_MSG(ec));
+		ras_printk(PR_CONT, ": %s error in the L2 cache tags.\n",
+			   R4_MSG(ec));
 	else if (xec == 0x0) {
 		if (TLB_ERROR(ec))
-			pr_cont(": %s error in a Page Descriptor Cache or "
-				"Guest TLB.\n", TT_MSG(ec));
+			ras_printk(PR_CONT, ": %s error in a Page Descriptor "
+					    "Cache or Guest TLB.\n",
+					    TT_MSG(ec));
 		else if (BUS_ERROR(ec))
-			pr_cont(": %s/ECC error in data read from NB: %s.\n",
-				R4_MSG(ec), PP_MSG(ec));
+			ras_printk(PR_CONT, ": %s/ECC error in data read from NB: %s.\n",
+					    R4_MSG(ec), PP_MSG(ec));
 		else if (MEM_ERROR(ec)) {
 			u8 r4 = R4(ec);
 
 			if (r4 >= 0x7)
-				pr_cont(": %s error during data copyback.\n",
-					R4_MSG(ec));
+				ras_printk(PR_CONT, ": %s error during data copyback.\n",
+						    R4_MSG(ec));
 			else if (r4 <= 0x1)
-				pr_cont(": %s parity/ECC error during data "
-					"access from L2.\n", R4_MSG(ec));
+				ras_printk(PR_CONT, ": %s parity/ECC error "
+						    "during data access from L2.\n",
+						    R4_MSG(ec));
 			else
 				goto wrong_bu_mce;
 		} else
@@ -424,7 +432,7 @@ static void amd_decode_bu_mce(struct mce *m)
 	return;
 
 wrong_bu_mce:
-	pr_emerg(HW_ERR "Corrupted BU MCE info?\n");
+	ras_printk(PR_EMERG, "Corrupted BU MCE info?\n");
 }
 
 static void amd_decode_cu_mce(struct mce *m)
@@ -432,28 +440,28 @@ static void amd_decode_cu_mce(struct mce *m)
 	u16 ec = EC(m->status);
 	u8 xec = XEC(m->status, xec_mask);
 
-	pr_emerg(HW_ERR "Combined Unit Error: ");
+	ras_printk(PR_EMERG, "Combined Unit Error: ");
 
 	if (TLB_ERROR(ec)) {
 		if (xec == 0x0)
-			pr_cont("Data parity TLB read error.\n");
+			ras_printk(PR_CONT, "Data parity TLB read error.\n");
 		else if (xec == 0x1)
-			pr_cont("Poison data provided for TLB fill.\n");
+			ras_printk(PR_CONT, "Poison data provided for TLB fill.\n");
 		else
 			goto wrong_cu_mce;
 	} else if (BUS_ERROR(ec)) {
 		if (xec > 2)
 			goto wrong_cu_mce;
 
-		pr_cont("Error during attempted NB data read.\n");
+		ras_printk(PR_CONT, "Error during attempted NB data read.\n");
 	} else if (MEM_ERROR(ec)) {
 		switch (xec) {
 		case 0x4 ... 0xc:
-			pr_cont("%s.\n", f15h_cu_mce_desc[xec - 0x4]);
+			ras_printk(PR_CONT, "%s.\n", f15h_cu_mce_desc[xec - 0x4]);
 			break;
 
 		case 0x10 ... 0x14:
-			pr_cont("%s.\n", f15h_cu_mce_desc[xec - 0x7]);
+			ras_printk(PR_CONT, "%s.\n", f15h_cu_mce_desc[xec - 0x7]);
 			break;
 
 		default:
@@ -464,7 +472,7 @@ static void amd_decode_cu_mce(struct mce *m)
 	return;
 
 wrong_cu_mce:
-	pr_emerg(HW_ERR "Corrupted CU MCE info?\n");
+	ras_printk(PR_EMERG, "Corrupted CU MCE info?\n");
 }
 
 static void amd_decode_ls_mce(struct mce *m)
@@ -473,12 +481,12 @@ static void amd_decode_ls_mce(struct mce *m)
 	u8 xec = XEC(m->status, xec_mask);
 
 	if (boot_cpu_data.x86 >= 0x14) {
-		pr_emerg("You shouldn't be seeing an LS MCE on this cpu family,"
-			 " please report on LKML.\n");
+		ras_printk(PR_EMERG, "You shouldn't be seeing an LS MCE on this"
+				     " cpu family, please report on LKML.\n");
 		return;
 	}
 
-	pr_emerg(HW_ERR "Load Store Error");
+	ras_printk(PR_EMERG, "Load Store Error");
 
 	if (xec == 0x0) {
 		u8 r4 = R4(ec);
@@ -486,14 +494,14 @@ static void amd_decode_ls_mce(struct mce *m)
 		if (!BUS_ERROR(ec) || (r4 != R4_DRD && r4 != R4_DWR))
 			goto wrong_ls_mce;
 
-		pr_cont(" during %s.\n", R4_MSG(ec));
+		ras_printk(PR_CONT, " during %s.\n", R4_MSG(ec));
 	} else
 		goto wrong_ls_mce;
 
 	return;
 
 wrong_ls_mce:
-	pr_emerg(HW_ERR "Corrupted LS MCE info?\n");
+	ras_printk(PR_EMERG, "Corrupted LS MCE info?\n");
 }
 
 static bool k8_nb_mce(u16 ec, u8 xec)
@@ -502,15 +510,15 @@ static bool k8_nb_mce(u16 ec, u8 xec)
 
 	switch (xec) {
 	case 0x1:
-		pr_cont("CRC error detected on HT link.\n");
+		ras_printk(PR_CONT, "CRC error detected on HT link.\n");
 		break;
 
 	case 0x5:
-		pr_cont("Invalid GART PTE entry during GART table walk.\n");
+		ras_printk(PR_CONT, "Invalid GART PTE entry during GART table walk.\n");
 		break;
 
 	case 0x6:
-		pr_cont("Unsupported atomic RMW received from an IO link.\n");
+		ras_printk(PR_CONT, "Unsupported atomic RMW received from an IO link.\n");
 		break;
 
 	case 0x0:
@@ -518,11 +526,11 @@ static bool k8_nb_mce(u16 ec, u8 xec)
 		if (boot_cpu_data.x86 == 0x11)
 			return false;
 
-		pr_cont("DRAM ECC error detected on the NB.\n");
+		ras_printk(PR_CONT, "DRAM ECC error detected on the NB.\n");
 		break;
 
 	case 0xd:
-		pr_cont("Parity error on the DRAM addr/ctl signals.\n");
+		ras_printk(PR_CONT, "Parity error on the DRAM addr/ctl signals.\n");
 		break;
 
 	default:
@@ -552,9 +560,9 @@ static bool f10h_nb_mce(u16 ec, u8 xec)
 
 	case 0xf:
 		if (TLB_ERROR(ec))
-			pr_cont("GART Table Walk data error.\n");
+			ras_printk(PR_CONT, "GART Table Walk data error.\n");
 		else if (BUS_ERROR(ec))
-			pr_cont("DMA Exclusion Vector Table Walk error.\n");
+			ras_printk(PR_CONT, "DMA Exclusion Vector Table Walk error.\n");
 		else
 			ret = false;
 
@@ -563,7 +571,7 @@ static bool f10h_nb_mce(u16 ec, u8 xec)
 
 	case 0x19:
 		if (boot_cpu_data.x86 == 0x15)
-			pr_cont("Compute Unit Data Error.\n");
+			ras_printk(PR_CONT, "Compute Unit Data Error.\n");
 		else
 			ret = false;
 
@@ -581,7 +589,7 @@ static bool f10h_nb_mce(u16 ec, u8 xec)
 		break;
 	}
 
-	pr_cont("%s.\n", f10h_nb_mce_desc[xec - offset]);
+	ras_printk(PR_CONT, "%s.\n", f10h_nb_mce_desc[xec - offset]);
 
 out:
 	return ret;
@@ -599,27 +607,27 @@ void amd_decode_nb_mce(struct mce *m)
 	u16 ec = EC(m->status);
 	u8 xec = XEC(m->status, 0x1f);
 
-	pr_emerg(HW_ERR "Northbridge Error (node %d): ", node_id);
+	ras_printk(PR_EMERG, "Northbridge Error (node %d): ", node_id);
 
 	switch (xec) {
 	case 0x2:
-		pr_cont("Sync error (sync packets on HT link detected).\n");
+		ras_printk(PR_CONT, "Sync error (sync packets on HT link detected).\n");
 		return;
 
 	case 0x3:
-		pr_cont("HT Master abort.\n");
+		ras_printk(PR_CONT, "HT Master abort.\n");
 		return;
 
 	case 0x4:
-		pr_cont("HT Target abort.\n");
+		ras_printk(PR_CONT, "HT Target abort.\n");
 		return;
 
 	case 0x7:
-		pr_cont("NB Watchdog timeout.\n");
+		ras_printk(PR_CONT, "NB Watchdog timeout.\n");
 		return;
 
 	case 0x9:
-		pr_cont("SVM DMA Exclusion Vector error.\n");
+		ras_printk(PR_CONT, "SVM DMA Exclusion Vector error.\n");
 		return;
 
 	default:
@@ -636,7 +644,7 @@ void amd_decode_nb_mce(struct mce *m)
 	return;
 
 wrong_nb_mce:
-	pr_emerg(HW_ERR "Corrupted NB MCE info?\n");
+	ras_printk(PR_EMERG, "Corrupted NB MCE info?\n");
 }
 EXPORT_SYMBOL_GPL(amd_decode_nb_mce);
 
@@ -651,80 +659,80 @@ static void amd_decode_fr_mce(struct mce *m)
 	if (c->x86 != 0x15 && xec != 0x0)
 		goto wrong_fr_mce;
 
-	pr_emerg(HW_ERR "%s Error: ",
+	ras_printk(PR_EMERG, "%s Error: ",
 		 (c->x86 == 0x15 ? "Execution Unit" : "FIROB"));
 
 	if (xec == 0x0 || xec == 0xc)
-		pr_cont("%s.\n", fr_ex_mce_desc[xec]);
+		ras_printk(PR_CONT, "%s.\n", fr_ex_mce_desc[xec]);
 	else if (xec < 0xd)
-		pr_cont("%s parity error.\n", fr_ex_mce_desc[xec]);
+		ras_printk(PR_CONT, "%s parity error.\n", fr_ex_mce_desc[xec]);
 	else
 		goto wrong_fr_mce;
 
 	return;
 
 wrong_fr_mce:
-	pr_emerg(HW_ERR "Corrupted FR MCE info?\n");
+	ras_printk(PR_EMERG, "Corrupted FR MCE info?\n");
 }
 
 static void amd_decode_fp_mce(struct mce *m)
 {
 	u8 xec = XEC(m->status, xec_mask);
 
-	pr_emerg(HW_ERR "Floating Point Unit Error: ");
+	ras_printk(PR_EMERG, "Floating Point Unit Error: ");
 
 	switch (xec) {
 	case 0x1:
-		pr_cont("Free List");
+		ras_printk(PR_CONT, "Free List");
 		break;
 
 	case 0x2:
-		pr_cont("Physical Register File");
+		ras_printk(PR_CONT, "Physical Register File");
 		break;
 
 	case 0x3:
-		pr_cont("Retire Queue");
+		ras_printk(PR_CONT, "Retire Queue");
 		break;
 
 	case 0x4:
-		pr_cont("Scheduler table");
+		ras_printk(PR_CONT, "Scheduler table");
 		break;
 
 	case 0x5:
-		pr_cont("Status Register File");
+		ras_printk(PR_CONT, "Status Register File");
 		break;
 
 	default:
 		goto wrong_fp_mce;
 		break;
 	}
-
-	pr_cont(" parity error.\n");
+	ras_printk(PR_CONT, " parity error.\n");
 
 	return;
 
 wrong_fp_mce:
-	pr_emerg(HW_ERR "Corrupted FP MCE info?\n");
+	ras_printk(PR_EMERG, "Corrupted FP MCE info?\n");
 }
 
 static inline void amd_decode_err_code(u16 ec)
 {
 
-	pr_emerg(HW_ERR "cache level: %s", LL_MSG(ec));
+	ras_printk(PR_EMERG, "cache level: %s", LL_MSG(ec));
 
 	if (BUS_ERROR(ec))
-		pr_cont(", mem/io: %s", II_MSG(ec));
+		ras_printk(PR_CONT, ", mem/io: %s", II_MSG(ec));
 	else
-		pr_cont(", tx: %s", TT_MSG(ec));
+		ras_printk(PR_CONT, ", tx: %s", TT_MSG(ec));
 
 	if (MEM_ERROR(ec) || BUS_ERROR(ec)) {
-		pr_cont(", mem-tx: %s", R4_MSG(ec));
+		ras_printk(PR_CONT, ", mem-tx: %s", R4_MSG(ec));
 
 		if (BUS_ERROR(ec))
-			pr_cont(", part-proc: %s (%s)", PP_MSG(ec), TO_MSG(ec));
+			ras_printk(PR_CONT, ", part-proc: %s (%s)",
+					    PP_MSG(ec), TO_MSG(ec));
 	}
 
-	pr_cont("\n");
+	ras_printk(PR_CONT, "\n");
 }
 
 /*
@@ -752,7 +760,7 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 	if (amd_filter_mce(m))
 		return NOTIFY_STOP;
 
-	pr_emerg(HW_ERR "CPU:%d\tMC%d_STATUS[%s|%s|%s|%s|%s",
+	ras_printk(PR_EMERG, "CPU:%d MC%d_STATUS[%s|%s|%s|%s|%s",
 		m->cpu, m->bank,
 		((m->status & MCI_STATUS_OVER)	? "Over"  : "-"),
 		((m->status & MCI_STATUS_UC)	? "UE"	  : "CE"),
@@ -761,19 +769,22 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 		((m->status & MCI_STATUS_ADDRV)	? "AddrV" : "-"));
 
 	if (c->x86 == 0x15)
-		pr_cont("|%s|%s",
+		ras_printk(PR_CONT, "|%s|%s",
 			((m->status & BIT_64(44)) ? "Deferred" : "-"),
 			((m->status & BIT_64(43)) ? "Poison"   : "-"));
 
 	/* do the two bits[14:13] together */
 	ecc = (m->status >> 45) & 0x3;
 	if (ecc)
-		pr_cont("|%sECC", ((ecc == 2) ? "C" : "U"));
+		ras_printk(PR_CONT, "|%sECC", ((ecc == 2) ? "C" : "U"));
 
-	pr_cont("]: 0x%016llx\n", m->status);
+	ras_printk(PR_CONT, "]: 0x%016llx", m->status);
 
 	if (m->status & MCI_STATUS_ADDRV)
-		pr_emerg(HW_ERR "\tMC%d_ADDR: 0x%016llx\n", m->bank, m->addr);
+		ras_printk(PR_CONT, " MC%d_ADDR: 0x%016llx",
+			   m->bank, m->addr);
+
+	ras_printk(PR_CONT, "\n");
 
 	switch (m->bank) {
 	case 0:
@@ -813,6 +824,8 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 
 	amd_decode_err_code(m->status & 0xffff);
 
+	trace_mce_record(ras_get_decoded_err(), m);
+
 	return NOTIFY_STOP;
 }
 EXPORT_SYMBOL_GPL(amd_decode_mce);
@@ -882,10 +895,10 @@ static int __init mce_amd_init(void)
 		return -EINVAL;
 	}
 
-	pr_info("MCE: In-kernel MCE decoding enabled.\n");
-
 	mce_register_decode_chain(&amd_mce_dec_nb);
 
+	pr_info("MCE: In-kernel MCE decoding enabled.\n");
+
 	return 0;
 }
 early_initcall(mce_amd_init);
-- 
1.7.8.rc0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC -v2 PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs
  2012-03-02 14:25 [RFC -v2 PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs Borislav Petkov
                   ` (3 preceding siblings ...)
  2012-03-02 14:25 ` [PATCH 4/4] EDAC: Convert AMD EDAC pieces to use RAS printk buffer Borislav Petkov
@ 2012-03-02 14:41 ` Mauro Carvalho Chehab
  2012-03-02 14:48   ` Borislav Petkov
  4 siblings, 1 reply; 26+ messages in thread
From: Mauro Carvalho Chehab @ 2012-03-02 14:41 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, Ingo Molnar, EDAC devel, LKML, Borislav Petkov

Em 02-03-2012 11:25, Borislav Petkov escreveu:
> From: Borislav Petkov <borislav.petkov@amd.com>
> 
> Hi,
> 
> here's a second version of the patchset with the buffer enlarging ripped
> out. 1/4 in the series could go in independently since it is a cleanup,
> I'll add it to a for-next testing branch if there are no objections.

It is premature to add patch 2 for -next, as there was no agreement about
the fields to be added for the tracepoint. 

Also, We'll need to coordinate how to merge patch 4, as the changes at
edac_mc.c will conflict with my changes there (even removing the tracepoint patch),
as the memory location based on row/channel there only works for amd64.
That's mainly the reason why I didn't merge my patch series yet at -next,
but, instead, stored it on a separate branch.

Mauro.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC -v2 PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs
  2012-03-02 14:41 ` [RFC -v2 PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs Mauro Carvalho Chehab
@ 2012-03-02 14:48   ` Borislav Petkov
  0 siblings, 0 replies; 26+ messages in thread
From: Borislav Petkov @ 2012-03-02 14:48 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Borislav Petkov, Tony Luck, Ingo Molnar, EDAC devel, LKML

On Fri, Mar 02, 2012 at 11:41:54AM -0300, Mauro Carvalho Chehab wrote:
> It is premature to add patch 2 for -next, as there was no agreement about
> the fields to be added for the tracepoint.

This is yet another proof that you're not really reading my emails:

> > 1/4 in the series could go in independently since it is a cleanup,
> > I'll add it to a for-next testing branch if there are no objections.

This sentence clearly refers to patch 1/4, i.e. the first patch in the
series. So either make sure you've really understood what you've read
before replying or don't reply at all.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] EDAC: Convert AMD EDAC pieces to use RAS printk buffer
  2012-03-02 14:25 ` [PATCH 4/4] EDAC: Convert AMD EDAC pieces to use RAS printk buffer Borislav Petkov
@ 2012-03-02 14:52   ` Mauro Carvalho Chehab
  2012-03-05 11:04     ` Borislav Petkov
  0 siblings, 1 reply; 26+ messages in thread
From: Mauro Carvalho Chehab @ 2012-03-02 14:52 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, Ingo Molnar, EDAC devel, LKML, Borislav Petkov

Em 02-03-2012 11:25, Borislav Petkov escreveu:
> From: Borislav Petkov <borislav.petkov@amd.com>
> 
> This is an initial version of the patch which converts MCE decoding
> facilities to use the RAS printk buffer. When there's no userspace agent
> running (i.e., /sys/devices/system/ras/agent == 0), we fall back to the
> default printk's into dmesg which is what we've been doing so far.
> 
> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
> ---
>  drivers/edac/amd64_edac.c |    8 ++-
>  drivers/edac/edac_mc.c    |   42 ++++++---
>  drivers/edac/mce_amd.c    |  217 ++++++++++++++++++++++++---------------------
>  3 files changed, 149 insertions(+), 118 deletions(-)
> 
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 08413377a43b..29e153c57e33 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -1,6 +1,7 @@
> -#include "amd64_edac.h"
>  #include <asm/amd_nb.h>
> +#include <asm/ras.h>
>  
> +#include "amd64_edac.h"
>  static struct edac_pci_ctl_info *amd64_ctl_pci;
>  
>  static int report_gart_errors;
> @@ -1901,7 +1902,10 @@ static void amd64_handle_ce(struct mem_ctl_info *mci, struct mce *m)
>  	sys_addr = get_error_address(m);
>  	syndrome = extract_syndrome(m->status);
>  
> -	amd64_mc_err(mci, "CE ERROR_ADDRESS= 0x%llx\n", sys_addr);
> +	if (ras_agent)
> +		ras_printk(PR_EMERG, "ERR_ADDR: 0x%llx", sys_addr);
> +	else
> +		amd64_mc_err(mci, "CE ERROR_ADDRESS= 0x%llx\n", sys_addr);
>  
>  	pvt->ops->map_sysaddr_to_csrow(mci, sys_addr, syndrome);
>  }
> diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
> index ca6c04d350ee..3b3db477b5d0 100644
> --- a/drivers/edac/edac_mc.c
> +++ b/drivers/edac/edac_mc.c
> @@ -30,8 +30,10 @@
>  #include <asm/uaccess.h>
>  #include <asm/page.h>
>  #include <asm/edac.h>
> +#include <asm/ras.h>
>  #include "edac_core.h"
>  #include "edac_module.h"
> +#include "mce_amd.h"
>  
>  /* lock to memory controller's control array */
>  static DEFINE_MUTEX(mem_ctls_mutex);
> @@ -701,14 +703,22 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
>  		return;
>  	}
>  
> -	if (edac_mc_get_log_ce())
> -		/* FIXME - put in DIMM location */
> -		edac_mc_printk(mci, KERN_WARNING,
> -			"CE page 0x%lx, offset 0x%lx, grain %d, syndrome "
> -			"0x%lx, row %d, channel %d, label \"%s\": %s\n",
> -			page_frame_number, offset_in_page,
> -			mci->csrows[row].grain, syndrome, row, channel,
> -			mci->csrows[row].channels[channel].label, msg);
> +	if (edac_mc_get_log_ce()) {
> +		if (ras_agent)
> +			ras_printk(PR_CONT, ", row: %d, channel: %d\n",
> +				   row, channel);
> +		else
> +			/* FIXME - put in DIMM location */
> +			edac_mc_printk(mci, KERN_WARNING,
> +					"CE page 0x%lx, offset 0x%lx, grain %d,"
> +					" syndrome 0x%lx, row %d, channel %d,"
> +					" label \"%s\": %s\n",
> +					page_frame_number, offset_in_page,
> +					mci->csrows[row].grain, syndrome,
> +					row, channel,
> +					mci->csrows[row].channels[channel].label,
> +					msg);
> +	}


As I've commented already, This piece of the code is not ok, due
to several reasons:

	- the "ras_agent" helper functions is used only for amd64_edac. There's
no reason for use it elsewhere;

	- the code here is adding the location only for CE errors. The location
of the error is also pertinent for the other types of errors;

	- on my patches, this function disappears, being replaced by a single
function, that can be used to report all types of memory errors;

	- non MCA drivers should also generate tracepoints;

	- it is a way easier to add just one call to the tracepoint function here, than
to spread on all drivers.

As it is shown at:
	http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commitdiff_plain;h=4eb2a29419c1fefd76c8dbcd308b84a4b52faf4d

Once we made an agreement with regards to the tracepoint function, 
a change similar to what it was proposed there, e. g.:

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 37d2c97..2dca0e3 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -899,7 +899,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			  const int layer2,
 			  const char *msg,
 			  const char *other_detail,
-			  const void *mcelog)
+			  const void *arch_log)
 {
 	unsigned long remapped_page;
 	/* FIXME: too much for stack: move it to some pre-alocated area */
@@ -1033,8 +1047,17 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			"page 0x%lx offset 0x%lx grain %d\n",
 			page_frame_number, offset_in_page, grain);
 
+#ifdef CONFIG_X86
+	if (arch_log)
+		trace_mc_error_mce(type, mci->mc_idx, msg, label, location,
+				   detail, other_detail, arch_log);
+	else
+		trace_mc_error(type, mci->mc_idx, msg, label, location,
+			       detail, other_detail);
+#else
 	trace_mc_error(type, mci->mc_idx, msg, label, location,
 		       detail, other_detail);
+#endif
 
 	if (type == HW_EVENT_ERR_CORRECTED) {
 		if (edac_mc_get_log_ce())

is enough to generate traces for all existing drivers.

Regards,
Mauro

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* RE: [PATCH 1/4] mce: Slim up struct mce
  2012-03-02 14:25 ` [PATCH 1/4] mce: Slim up struct mce Borislav Petkov
@ 2012-03-02 17:47   ` Luck, Tony
  2012-03-03  7:37     ` Ingo Molnar
  0 siblings, 1 reply; 26+ messages in thread
From: Luck, Tony @ 2012-03-02 17:47 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Ingo Molnar, EDAC devel, LKML, Borislav Petkov

> Drop unused ->apicid, make ->socketid __u16 since it is copied from
> struct cpuinfo_86.phys_proc_id, reorder members and adjust padding so
> that it remains packed and on a 64-byte boundary.

This structure is user visible via /dev/mcelog.  The mcelog(8) daemon
has been coded to cope with extra fields being added to the end of the
structure (an old daemon that doesn't know what they are will just skip
over them). But things will break if you change the offsets of any fields
that it does know about.

Linus says - no breaking user space - so "NACK" for this patch.

-Tony

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] mce: Slim up struct mce
  2012-03-02 17:47   ` Luck, Tony
@ 2012-03-03  7:37     ` Ingo Molnar
  2012-03-05  9:17       ` Borislav Petkov
  0 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2012-03-03  7:37 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Borislav Petkov, EDAC devel, LKML, Borislav Petkov

* Luck, Tony <tony.luck@intel.com> wrote:

> > Drop unused ->apicid, make ->socketid __u16 since it is 
> > copied from struct cpuinfo_86.phys_proc_id, reorder members 
> > and adjust padding so that it remains packed and on a 
> > 64-byte boundary.
> 
> This structure is user visible via /dev/mcelog.  The mcelog(8) 
> daemon has been coded to cope with extra fields being added to 
> the end of the structure (an old daemon that doesn't know what 
> they are will just skip over them). But things will break if 
> you change the offsets of any fields that it does know about.

Please separate out a 'struct mce_legacy' data type that 
contains the /dev/mcelog format, and use it in the place that 
outputs to /dev/mcelog (and convert from 'struct mce' to 'struct 
mce_legacy' and back when interacting with /dev/mcelog).

This keeps 'struct mce' kernel internal and flexible. A pure 
kernel internal structure should never have been exported like 
this to begin with, doing that results in problems like this.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] mce: Slim up struct mce
  2012-03-03  7:37     ` Ingo Molnar
@ 2012-03-05  9:17       ` Borislav Petkov
  0 siblings, 0 replies; 26+ messages in thread
From: Borislav Petkov @ 2012-03-05  9:17 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Luck, Tony, Borislav Petkov, EDAC devel, LKML

On Sat, Mar 03, 2012 at 08:37:26AM +0100, Ingo Molnar wrote:
> 
> * Luck, Tony <tony.luck@intel.com> wrote:
> 
> > > Drop unused ->apicid, make ->socketid __u16 since it is 
> > > copied from struct cpuinfo_86.phys_proc_id, reorder members 
> > > and adjust padding so that it remains packed and on a 
> > > 64-byte boundary.
> > 
> > This structure is user visible via /dev/mcelog.  The mcelog(8) 
> > daemon has been coded to cope with extra fields being added to 
> > the end of the structure (an old daemon that doesn't know what 
> > they are will just skip over them). But things will break if 
> > you change the offsets of any fields that it does know about.
> 
> Please separate out a 'struct mce_legacy' data type that 
> contains the /dev/mcelog format, and use it in the place that 
> outputs to /dev/mcelog (and convert from 'struct mce' to 'struct 
> mce_legacy' and back when interacting with /dev/mcelog).

Either that or I'll save my patch for when we remove /dev/mcelog
altogether.

> This keeps 'struct mce' kernel internal and flexible. A pure 
> kernel internal structure should never have been exported like 
> this to begin with, doing that results in problems like this.

Ho-hum.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] EDAC: Convert AMD EDAC pieces to use RAS printk buffer
  2012-03-02 14:52   ` Mauro Carvalho Chehab
@ 2012-03-05 11:04     ` Borislav Petkov
  2012-03-05 11:43       ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 26+ messages in thread
From: Borislav Petkov @ 2012-03-05 11:04 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Tony Luck, Ingo Molnar, EDAC devel, LKML, Borislav Petkov

On Fri, Mar 02, 2012 at 11:52:59AM -0300, Mauro Carvalho Chehab wrote:

[..]

> 	- the "ras_agent" helper functions is used only for amd64_edac. There's
> no reason for use it elsewhere;

[..]

That can be fixed very easily with the patch below ontop of the current
series (I'll add a proper version of it to the series later). This way,
one does ras_printk() and builds up the message and then, when its done,
calls into the tracepoint like this:

	trace_mce_record(ras_get_decoded_err(), m);

where m is struct mce.

The result is:

[Hardware Error]: CPU:0 MC4_STATUS[-|UE|-|PCC|AddrV|UECC]: 0xb607a10009080a23 MC4_ADDR: 0x0000000955647380
[Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
[Hardware Error]: EDAC MC2: UE page 0x955647, offset 0x380, grain 0, row 2, labels ":": amd64_edac

^^^ This line comes from the respective edac driver.

[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: WR, part-proc: RES (no timeout)
[Hardware Error]: CPU: 0, MCGc/s: 0/0, MC4: b607a10009080a23, ADDR/MISC: 0000000955647380/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, TIME: 0, SOCKET: 0


> non MCA drivers should also generate tracepoints;

yes, they should define a tracepoint which fits the hardware error
reporting scheme they're using.

--
diff --git a/arch/x86/include/asm/ras.h b/arch/x86/include/asm/ras.h
index 92199af2ab7b..b51838514259 100644
--- a/arch/x86/include/asm/ras.h
+++ b/arch/x86/include/asm/ras.h
@@ -6,8 +6,8 @@
 extern bool ras_agent;
 
 #define PR_EMERG	BIT(0)
-#define PR_WARNING	BIT(1)
-#define PR_CONT		BIT(2)
+#define PR_WARNING	BIT(4)
+#define PR_CONT		BIT(8)
 
 extern const char *ras_get_decoded_err(void);
 extern void ras_printk(unsigned long flags, const char *fmt, ...);
diff --git a/arch/x86/ras/ras.c b/arch/x86/ras/ras.c
index 5edfe30034d3..868d732c6cd4 100644
--- a/arch/x86/ras/ras.c
+++ b/arch/x86/ras/ras.c
@@ -52,7 +52,7 @@ void ras_printk(unsigned long flags, const char *fmt, ...)
 	if (!ras_agent) {
 		if (flags & PR_EMERG)
 			pr_emerg("%s", buf);
-		if (flags & PR_WARNING)
+		else if (flags & PR_WARNING)
 			pr_warning("%s", buf);
 		else if (flags & PR_CONT)
 			pr_cont("%s", buf);
diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 29e153c57e33..c0f31f953c70 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -1902,10 +1902,7 @@ static void amd64_handle_ce(struct mem_ctl_info *mci, struct mce *m)
 	sys_addr = get_error_address(m);
 	syndrome = extract_syndrome(m->status);
 
-	if (ras_agent)
-		ras_printk(PR_EMERG, "ERR_ADDR: 0x%llx", sys_addr);
-	else
-		amd64_mc_err(mci, "CE ERROR_ADDRESS= 0x%llx\n", sys_addr);
+	amd64_mc_err(mci, "CE ERROR_ADDRESS= 0x%llx\n", sys_addr);
 
 	pvt->ops->map_sysaddr_to_csrow(mci, sys_addr, syndrome);
 }
diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index e48ab3108ad8..62a32b3fa660 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -49,8 +49,17 @@
 #define edac_printk(level, prefix, fmt, arg...) \
 	printk(level "EDAC " prefix ": " fmt, ##arg)
 
-#define edac_mc_printk(mci, level, fmt, arg...) \
-	printk(level "EDAC MC%d: " fmt, mci->mc_idx, ##arg)
+#define edac_mc_printk(mci, level, fmt, arg...)					\
+({										\
+	if (ras_agent) {							\
+		unsigned pr_lvl = BIT((unsigned)(level[1] - '0'));		\
+										\
+		ras_printk(pr_lvl, HW_ERR "EDAC MC%d" fmt,			\
+			   mci->mc_idx, ##arg);					\
+	}									\
+	else									\
+		printk(level "EDAC MC%d: " fmt, mci->mc_idx, ##arg);		\
+})
 
 #define edac_mc_chipset_printk(mci, level, prefix, fmt, arg...) \
 	printk(level "EDAC " prefix " MC%d: " fmt, mci->mc_idx, ##arg)
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 3b3db477b5d0..446853a303c4 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -703,11 +703,7 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
 		return;
 	}
 
-	if (edac_mc_get_log_ce()) {
-		if (ras_agent)
-			ras_printk(PR_CONT, ", row: %d, channel: %d\n",
-				   row, channel);
-		else
+	if (edac_mc_get_log_ce())
 			/* FIXME - put in DIMM location */
 			edac_mc_printk(mci, KERN_WARNING,
 					"CE page 0x%lx, offset 0x%lx, grain %d,"
@@ -718,7 +714,6 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
 					row, channel,
 					mci->csrows[row].channels[channel].label,
 					msg);
-	}
 
 	mci->ce_count++;
 	mci->csrows[row].ce_count++;
@@ -790,16 +785,12 @@ void edac_mc_handle_ue(struct mem_ctl_info *mci,
 		pos += chars;
 	}
 
-	if (edac_mc_get_log_ue()) {
-		if (ras_agent)
-			ras_printk(PR_CONT, "row: %d\n", row);
-		else
+	if (edac_mc_get_log_ue())
 			edac_mc_printk(mci, KERN_EMERG,
 				       "UE page 0x%lx, offset 0x%lx, grain %d,"
 				       " row %d, labels \"%s\": %s\n",
 				       page_frame_number, offset_in_page,
 				       mci->csrows[row].grain, row, labels, msg);
-	}
 
 	if (edac_mc_get_panic_on_ue())
 		panic("EDAC MC%d: UE page 0x%lx, offset 0x%lx, grain %d, "

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] EDAC: Convert AMD EDAC pieces to use RAS printk buffer
  2012-03-05 11:04     ` Borislav Petkov
@ 2012-03-05 11:43       ` Mauro Carvalho Chehab
  2012-03-05 12:44         ` Borislav Petkov
  0 siblings, 1 reply; 26+ messages in thread
From: Mauro Carvalho Chehab @ 2012-03-05 11:43 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, Ingo Molnar, EDAC devel, LKML, Borislav Petkov

Em 05-03-2012 08:04, Borislav Petkov escreveu:
> On Fri, Mar 02, 2012 at 11:52:59AM -0300, Mauro Carvalho Chehab wrote:
> 
> [..]
> 
>> 	- the "ras_agent" helper functions is used only for amd64_edac. There's
>> no reason for use it elsewhere;
> 
> [..]
> 
> That can be fixed very easily with the patch below ontop of the current
> series (I'll add a proper version of it to the series later). This way,
> one does ras_printk() and builds up the message and then, when its done,
> calls into the tracepoint like this:
> 
> 	trace_mce_record(ras_get_decoded_err(), m);
> 
> where m is struct mce.

It is still adding an amd64-specific code inside the core, as no other driver will
use the amd64 "ras_agent".

Instead of doing that, you should do the opposite: call the EDAC report function
passing ras_get_decoded_err() as the string parameter of the function. This way,
the EDAC core will work the same way with amd64 and with the other drivers.

> 
> The result is:
> 
> [Hardware Error]: CPU:0 MC4_STATUS[-|UE|-|PCC|AddrV|UECC]: 0xb607a10009080a23 MC4_ADDR: 0x0000000955647380
> [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
> [Hardware Error]: EDAC MC2: UE page 0x955647, offset 0x380, grain 0, row 2, labels ":": amd64_edac
> 
> ^^^ This line comes from the respective edac driver.
> 
> [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: WR, part-proc: RES (no timeout)
> [Hardware Error]: CPU: 0, MCGc/s: 0/0, MC4: b607a10009080a23, ADDR/MISC: 0000000955647380/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, TIME: 0, SOCKET: 0
> 
> 
>> non MCA drivers should also generate tracepoints;
> 
> yes, they should define a tracepoint which fits the hardware error
> reporting scheme they're using.

If nobody objects, I'll add my changes to linux-next, as it was tested on most
systems. I'll remove the MCE-specific tracepoint from my code, keeping the trace
there for the other stuff.

> 
> --
> diff --git a/arch/x86/include/asm/ras.h b/arch/x86/include/asm/ras.h
> index 92199af2ab7b..b51838514259 100644
> --- a/arch/x86/include/asm/ras.h
> +++ b/arch/x86/include/asm/ras.h
> @@ -6,8 +6,8 @@
>  extern bool ras_agent;
>  
>  #define PR_EMERG	BIT(0)
> -#define PR_WARNING	BIT(1)
> -#define PR_CONT		BIT(2)
> +#define PR_WARNING	BIT(4)
> +#define PR_CONT		BIT(8)
>  
>  extern const char *ras_get_decoded_err(void);
>  extern void ras_printk(unsigned long flags, const char *fmt, ...);
> diff --git a/arch/x86/ras/ras.c b/arch/x86/ras/ras.c
> index 5edfe30034d3..868d732c6cd4 100644
> --- a/arch/x86/ras/ras.c
> +++ b/arch/x86/ras/ras.c
> @@ -52,7 +52,7 @@ void ras_printk(unsigned long flags, const char *fmt, ...)
>  	if (!ras_agent) {
>  		if (flags & PR_EMERG)
>  			pr_emerg("%s", buf);
> -		if (flags & PR_WARNING)
> +		else if (flags & PR_WARNING)
>  			pr_warning("%s", buf);
>  		else if (flags & PR_CONT)
>  			pr_cont("%s", buf);
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 29e153c57e33..c0f31f953c70 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -1902,10 +1902,7 @@ static void amd64_handle_ce(struct mem_ctl_info *mci, struct mce *m)
>  	sys_addr = get_error_address(m);
>  	syndrome = extract_syndrome(m->status);
>  
> -	if (ras_agent)
> -		ras_printk(PR_EMERG, "ERR_ADDR: 0x%llx", sys_addr);
> -	else
> -		amd64_mc_err(mci, "CE ERROR_ADDRESS= 0x%llx\n", sys_addr);
> +	amd64_mc_err(mci, "CE ERROR_ADDRESS= 0x%llx\n", sys_addr);
>  
>  	pvt->ops->map_sysaddr_to_csrow(mci, sys_addr, syndrome);
>  }
> diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
> index e48ab3108ad8..62a32b3fa660 100644
> --- a/drivers/edac/edac_core.h
> +++ b/drivers/edac/edac_core.h
> @@ -49,8 +49,17 @@
>  #define edac_printk(level, prefix, fmt, arg...) \
>  	printk(level "EDAC " prefix ": " fmt, ##arg)
>  
> -#define edac_mc_printk(mci, level, fmt, arg...) \
> -	printk(level "EDAC MC%d: " fmt, mci->mc_idx, ##arg)
> +#define edac_mc_printk(mci, level, fmt, arg...)					\
> +({										\
> +	if (ras_agent) {							\
> +		unsigned pr_lvl = BIT((unsigned)(level[1] - '0'));		\
> +										\
> +		ras_printk(pr_lvl, HW_ERR "EDAC MC%d" fmt,			\
> +			   mci->mc_idx, ##arg);					\
> +	}									\
> +	else									\
> +		printk(level "EDAC MC%d: " fmt, mci->mc_idx, ##arg);		\
> +})
>  
>  #define edac_mc_chipset_printk(mci, level, prefix, fmt, arg...) \
>  	printk(level "EDAC " prefix " MC%d: " fmt, mci->mc_idx, ##arg)
> diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
> index 3b3db477b5d0..446853a303c4 100644
> --- a/drivers/edac/edac_mc.c
> +++ b/drivers/edac/edac_mc.c
> @@ -703,11 +703,7 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
>  		return;
>  	}
>  
> -	if (edac_mc_get_log_ce()) {
> -		if (ras_agent)
> -			ras_printk(PR_CONT, ", row: %d, channel: %d\n",
> -				   row, channel);
> -		else
> +	if (edac_mc_get_log_ce())
>  			/* FIXME - put in DIMM location */
>  			edac_mc_printk(mci, KERN_WARNING,
>  					"CE page 0x%lx, offset 0x%lx, grain %d,"
> @@ -718,7 +714,6 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
>  					row, channel,
>  					mci->csrows[row].channels[channel].label,
>  					msg);
> -	}
>  
>  	mci->ce_count++;
>  	mci->csrows[row].ce_count++;
> @@ -790,16 +785,12 @@ void edac_mc_handle_ue(struct mem_ctl_info *mci,
>  		pos += chars;
>  	}
>  
> -	if (edac_mc_get_log_ue()) {
> -		if (ras_agent)
> -			ras_printk(PR_CONT, "row: %d\n", row);
> -		else
> +	if (edac_mc_get_log_ue())
>  			edac_mc_printk(mci, KERN_EMERG,
>  				       "UE page 0x%lx, offset 0x%lx, grain %d,"
>  				       " row %d, labels \"%s\": %s\n",
>  				       page_frame_number, offset_in_page,
>  				       mci->csrows[row].grain, row, labels, msg);
> -	}
>  
>  	if (edac_mc_get_panic_on_ue())
>  		panic("EDAC MC%d: UE page 0x%lx, offset 0x%lx, grain %d, "
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] EDAC: Convert AMD EDAC pieces to use RAS printk buffer
  2012-03-05 11:43       ` Mauro Carvalho Chehab
@ 2012-03-05 12:44         ` Borislav Petkov
  2012-03-05 13:35           ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 26+ messages in thread
From: Borislav Petkov @ 2012-03-05 12:44 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Tony Luck, Ingo Molnar, EDAC devel, LKML, Borislav Petkov

On Mon, Mar 05, 2012 at 08:43:59AM -0300, Mauro Carvalho Chehab wrote:
> It is still adding an amd64-specific code inside the core, as no other driver will
> use the amd64 "ras_agent".

Whoopie, we have another example that you're not really reading
my emails: ras_agent is _NOT_ amd64-specific but it is defined in
<arch/x86/ras/ras.c>

[..]

> If nobody objects, I'll add my changes to linux-next, as it was tested
> on most systems. I'll remove the MCE-specific tracepoint from my code,
> keeping the trace there for the other stuff.

As already pointed out, I object to the tracepoints you've defined for every
single edac_mc_handle_* call:

TRACE_EVENT(mc_corrected_error,
TRACE_EVENT(mc_uncorrected_error,
TRACE_EVENT(mc_corrected_error_fbd,
TRACE_EVENT(mc_uncorrected_error_fbd,
TRACE_EVENT(mc_out_of_range,
TRACE_EVENT(mc_corrected_error_no_info,
TRACE_EVENT(mc_uncorrected_error_no_info,

The edac drivers which get their error info from MCA should use
trace_mce_record() and the others should either use a _single_ generic
tracepoint or define one which adheres to the underlying hardware
reporting scheme (be it PCI-AER, or whatever).

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] EDAC: Convert AMD EDAC pieces to use RAS printk buffer
  2012-03-05 12:44         ` Borislav Petkov
@ 2012-03-05 13:35           ` Mauro Carvalho Chehab
  2012-03-05 14:13             ` Borislav Petkov
  0 siblings, 1 reply; 26+ messages in thread
From: Mauro Carvalho Chehab @ 2012-03-05 13:35 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, Ingo Molnar, EDAC devel, LKML, Borislav Petkov

Em 05-03-2012 09:44, Borislav Petkov escreveu:
> On Mon, Mar 05, 2012 at 08:43:59AM -0300, Mauro Carvalho Chehab wrote:
>> It is still adding an amd64-specific code inside the core, as no other driver will
>> use the amd64 "ras_agent".
> 
> Whoopie, we have another example that you're not really reading
> my emails: ras_agent is _NOT_ amd64-specific but it is defined in
> <arch/x86/ras/ras.c>

No. This is an example that you're not reading my emails: no other driver needs that.
So, it is something that it is specific to the MCA amd64 drivers.

The other two MCA drivers are sb_edac and i7core_edac. I wrote both drivers, and they
don't need any helper function to store strings on a temporary buffer.

Also, the edac core is not x86-specific. So, referencing to a var there (ras_agent) 
that it is defined inside arch/x86 would break Kernel compilation on all other 
architectures.

> [..]
> 
>> If nobody objects, I'll add my changes to linux-next, as it was tested
>> on most systems. I'll remove the MCE-specific tracepoint from my code,
>> keeping the trace there for the other stuff.
> 
> As already pointed out, I object to the tracepoints you've defined for every
> single edac_mc_handle_* call:
> 
> TRACE_EVENT(mc_corrected_error,
> TRACE_EVENT(mc_uncorrected_error,
> TRACE_EVENT(mc_corrected_error_fbd,
> TRACE_EVENT(mc_uncorrected_error_fbd,
> TRACE_EVENT(mc_out_of_range,
> TRACE_EVENT(mc_corrected_error_no_info,
> TRACE_EVENT(mc_uncorrected_error_no_info,

As already pointed out, you're not reading my emails. The above were at the version 1 of
my patches, with I sent at least a month ago. Since version 2, what is proposed is to use:

TRACE_EVENT(mc_error_mce,

for MCA-based memory error events. There's also a variant for non-MCA drivers (mc_error). 

[1] http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commitdiff;h=4eb2a29419c1fefd76c8dbcd308b84a4b52faf4d

I also wrote on my emails that, instead of having a tracepoint specific for memory errors,
it is possible to re-define the fields I've proposed to cover CPU location/socket label,
and that this is better than folding everything into a hard-to-parse single string message.

> The edac drivers which get their error info from MCA should use
> trace_mce_record() and the others should either use a _single_ generic
> tracepoint or define one which adheres to the underlying hardware
> reporting scheme (be it PCI-AER, or whatever).
> 

Mauro.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] EDAC: Convert AMD EDAC pieces to use RAS printk buffer
  2012-03-05 13:35           ` Mauro Carvalho Chehab
@ 2012-03-05 14:13             ` Borislav Petkov
  2012-03-05 14:58               ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 26+ messages in thread
From: Borislav Petkov @ 2012-03-05 14:13 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Borislav Petkov, Tony Luck, Ingo Molnar, EDAC devel, LKML

On Mon, Mar 05, 2012 at 10:35:47AM -0300, Mauro Carvalho Chehab wrote:
> No. This is an example that you're not reading my emails:

Unfortunately, I read your emails.

> no other driver needs that. So, it is something that it is specific to
> the MCA amd64 drivers.

Let me spell it for ya: no, it's specific to x86, and not to amd64_edac.

> The other two MCA drivers are sb_edac and i7core_edac. I wrote both drivers, and they
> don't need any helper function to store strings on a temporary buffer.
> 
> Also, the edac core is not x86-specific. So, referencing to a var there (ras_agent) 
> that it is defined inside arch/x86 would break Kernel compilation on all other 
> architectures.

That's more like it.

It can be moved to an arch-agnostic place or be defined
__attribute__((weak)) in edac_core.c. Unless someone has a better idea,
of course.

[..]

> As already pointed out, you're not reading my emails. The above were at the version 1 of
> my patches, with I sent at least a month ago. Since version 2, what is proposed is to use:
> 
> TRACE_EVENT(mc_error_mce,
> 
> for MCA-based memory error events. There's also a variant for non-MCA drivers (mc_error). 
> 
> [1] http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commitdiff;h=4eb2a29419c1fefd76c8dbcd308b84a4b52faf4d

I see at least 4 misdesigned tracepoints there:

trace_mc_out_of_range_mce
trace_mc_out_of_range
trace_mc_error_mce
trace_mc_error
...

so NACK to those.

> I also wrote on my emails that, instead of having a tracepoint
> specific for memory errors, it is possible to re-define the fields
> I've proposed to cover CPU location/socket label, and that this is
> better than folding everything into a hard-to-parse single string
> message.

No, this is repurposing the fields of memory errors, which is ugly. So, no.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] EDAC: Convert AMD EDAC pieces to use RAS printk buffer
  2012-03-05 14:13             ` Borislav Petkov
@ 2012-03-05 14:58               ` Mauro Carvalho Chehab
  2012-03-05 22:00                 ` [PATCHv5] EDAC core changes in order to properly report errors from all types of memory controllers Mauro Carvalho Chehab
  0 siblings, 1 reply; 26+ messages in thread
From: Mauro Carvalho Chehab @ 2012-03-05 14:58 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, Ingo Molnar, EDAC devel, LKML

Em 05-03-2012 11:13, Borislav Petkov escreveu:
> On Mon, Mar 05, 2012 at 10:35:47AM -0300, Mauro Carvalho Chehab wrote:
>> No. This is an example that you're not reading my emails:
> 
> Unfortunately, I read your emails.
> 
>> no other driver needs that. So, it is something that it is specific to
>> the MCA amd64 drivers.
> 
> Let me spell it for ya: no, it's specific to x86, and not to amd64_edac.

As I'll NACK adding this solution on my drivers, as it makes no sense there,
it is specific to amd64_edac/amd64 mce.

>> The other two MCA drivers are sb_edac and i7core_edac. I wrote both drivers, and they
>> don't need any helper function to store strings on a temporary buffer.
>>
>> Also, the edac core is not x86-specific. So, referencing to a var there (ras_agent) 
>> that it is defined inside arch/x86 would break Kernel compilation on all other 
>> architectures.
> 
> That's more like it.
> 
> It can be moved to an arch-agnostic place or be defined
> __attribute__((weak)) in edac_core.c. Unless someone has a better idea,
> of course.

Well, just fill the string on the way it makes sense for amd64, and then call the
EDAC report function, letting it to call the trace function.

> 
> [..]
> 
>> As already pointed out, you're not reading my emails. The above were at the version 1 of
>> my patches, with I sent at least a month ago. Since version 2, what is proposed is to use:
>>
>> TRACE_EVENT(mc_error_mce,
>>
>> for MCA-based memory error events. There's also a variant for non-MCA drivers (mc_error). 
>>
>> [1] http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commitdiff;h=4eb2a29419c1fefd76c8dbcd308b84a4b52faf4d
> 
> I see at least 4 misdesigned tracepoints there:
> 
> trace_mc_out_of_range_mce
> trace_mc_out_of_range
> trace_mc_error_mce
> trace_mc_error
> ...

There's no "..." there. There are just 4 traces defined.
The out of range is an special case to report parse errors.

As I said before, I'm OK to remove the *out_of_range* traces. 

So, there'are just two traces:

	trace_mc_error_mce
	trace_mc_error

E. g. one for the MCA errors, and another one for the non-architecture supported
error handling.

> so NACK to those.
> 
>> I also wrote on my emails that, instead of having a tracepoint
>> specific for memory errors, it is possible to re-define the fields
>> I've proposed to cover CPU location/socket label, and that this is
>> better than folding everything into a hard-to-parse single string
>> message.
> 
> No, this is repurposing the fields of memory errors, which is ugly. So, no.

Then, I it should have 2 MCA error traces:

	- One when the error is inside the CPU socket;
	- Another one when the error is outside the CPU.

Tony,

Please correct me if I'm wrong, but Intel MCA can only point to an error inside
the CPU or a memory error, right? At least, I didn't find there at the x86 arch 
specs anything at the MCA registers that would allow an error to point to the 
PCI bus address for a PCI error, for example.

Regards,
Mauro



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCHv5] EDAC core changes in order to properly report errors from all types of memory controllers
  2012-03-05 14:58               ` Mauro Carvalho Chehab
@ 2012-03-05 22:00                 ` Mauro Carvalho Chehab
  2012-03-05 23:23                   ` Borislav Petkov
  0 siblings, 1 reply; 26+ messages in thread
From: Mauro Carvalho Chehab @ 2012-03-05 22:00 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, Ingo Molnar, EDAC devel, LKML

This is the 5th version of my patch series. It seemed too big to
send all those emails to LKML/edac mailing lists for the 5th
time, so, instead, I'll point to the git tree where they're hold.

I'm doing a massive test of the entire patchset with several different edac 
drivers, so the biggest changes on this series are the bug-fix patches.

Besides that, there are a few other differences:

	- the struct channel_info doesn't represent a channel. Its contents
represent a memory rank. So, call it as "rank_info";

	- Add a FIXME information to remind that, currently, the new "dimm_info"
structure represents a "rank", if the memory is addressed via csrows;

	- when the "dimm_info" is representing a rank, the sysfs nodes for it are
called as "rank" instead of "dimm";

	- an agreement was not reached yet for the the MCA-based tracepoint. So, I've
removed it from the patch series;

	- the out_of_range tracepoint got removed. Instead, a parse error will
generate only a printk message.

With those changes, there's just one tracepont defined there, on this patchset:
	http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commit;h=fdfa64045e43c942e1250708365d9240cd0da9c3

The following changes since commit 805a6af8dba5dfdd35ec35dc52ec0122400b2610:

  Linux 3.2 (2012-01-04 15:55:44 -0800)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac.git hw_events_v5

Mauro Carvalho Chehab (43):
      edac/ppc4xx_edac: Fix compilation
      edac: Better describe the memory concepts
      drivers/edac: rename channel_info to rank_info
      edac: Create a dimm struct and move the labels into it
      edac: Add per dimm's sysfs nodes
      edac: Prepare to push down to drivers the filling of the memset_info
      i5400_edac: Convert it to report memory with the new location
      i7300_edac: Convert it to report memory with the new location
      edac: move dimm properties to struct memset_info
      edac: Don't initialize csrow's first_page & friends when not needed
      edac: move nr_pages to dimm struct
      edac: Add per-dimm sysfs show nodes
      edac: DIMM location cleanup
      edac-mc: Allow reporting errors on a non-csrow oriented way
      edac.h: Use kernel-doc-nano-HOWTO.txt notation for enums
      edac: rework memory layer hierarchy description
      edac: Export MC hierarchy counters for CE and UE
      edac: Cleanup the logs for i7core and sb edac drivers
      edac_mc: Some clenups at the log message
      edac: Add a sysfs node to test the EDAC error report facility
      edac_mc: Fix the enable label filter logic
      edac: Initialize the dimm label with the known information
      edac: don't OOPS if the csrow is not visible
      edac: Fix sysfs csrow?/*ce*count counters
      edac: Fix new error counts
      edac: Fix per layer error count counters
      edac: i5400: Fix DIMM memory filling
      edac_mc: Improve the labels parsing
      edac: Fix module removal logic
      edac_mc_sysfs: don't create inactive errcount sysfs nodes
      edac_mc: Fixes the logic that fills the dimms
      i5400_edac: Avoid calling pci_put_device() twice
      i5400_edac: Better represent the memory controller hierarchy
      edac: fill the location with something useful if the DIMM is not found
      edac: be sure to use the GET_POS macro to get memset_info struct
      amd64_edac: remove a duplicated call to edac_mc_handle_error()
      i5000_edac: Fix the logic that retrieves memory information
      i5100_edac: Fix the logic
      edac: add a sysfs node that stores the max possible memory location
      edac: Call the minimum grain node as "rank" if chip select is used
      i7300_edac: fixup
      Fix memory error count
      events/hw_event: Create a Hardware Events Report Mecanism (HERM)

 drivers/edac/amd64_edac.c       |  210 +++++++------
 drivers/edac/amd64_edac_dbg.c   |    6 +-
 drivers/edac/amd64_edac_inj.c   |   24 +-
 drivers/edac/amd76x_edac.c      |   44 ++-
 drivers/edac/cell_edac.c        |   42 ++-
 drivers/edac/cpc925_edac.c      |   93 +++--
 drivers/edac/e752x_edac.c       |   94 ++++--
 drivers/edac/e7xxx_edac.c       |   88 ++++--
 drivers/edac/edac_core.h        |   48 +--
 drivers/edac/edac_device.c      |   27 +-
 drivers/edac/edac_mc.c          |  700 ++++++++++++++++++++++++---------------
 drivers/edac/edac_mc_sysfs.c    |  625 ++++++++++++++++++++++++++++++++---
 drivers/edac/edac_module.h      |    2 +-
 drivers/edac/edac_pci.c         |    7 +-
 drivers/edac/i3000_edac.c       |   51 ++-
 drivers/edac/i3200_edac.c       |   57 ++--
 drivers/edac/i5000_edac.c       |  225 +++++++------
 drivers/edac/i5100_edac.c       |  105 +++---
 drivers/edac/i5400_edac.c       |  318 ++++++++++--------
 drivers/edac/i7300_edac.c       |  117 +++----
 drivers/edac/i7core_edac.c      |  267 +++++-----------
 drivers/edac/i82443bxgx_edac.c  |   43 ++-
 drivers/edac/i82860_edac.c      |   57 +++-
 drivers/edac/i82875p_edac.c     |   53 ++-
 drivers/edac/i82975x_edac.c     |   58 +++-
 drivers/edac/mpc85xx_edac.c     |   45 ++-
 drivers/edac/mv64x60_edac.c     |   47 ++-
 drivers/edac/pasemi_edac.c      |   51 ++--
 drivers/edac/ppc4xx_edac.c      |   62 ++--
 drivers/edac/r82600_edac.c      |   42 ++-
 drivers/edac/sb_edac.c          |  203 +++++-------
 drivers/edac/tile_edac.c        |   33 ++-
 drivers/edac/x38_edac.c         |   54 ++--
 include/linux/edac.h            |  454 ++++++++++++++++++++-----
 include/trace/events/hw_event.h |  107 ++++++
 35 files changed, 2856 insertions(+), 1603 deletions(-)
 create mode 100644 include/trace/events/hw_event.h


-

Whan an agreement with regards to the MCA-based tracepont is reached, a simple
patch like the one below would be enough to use a separate tracepoint for the
x86 architecture, when MCA is enabled and the error comes from it.

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index ea7eb9a..348a396 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -1898,7 +1898,7 @@ static void amd64_handle_ce(struct mem_ctl_info *mci, struct mce *m)
 				     -1, -1, -1,
 				     EDAC_MOD_STR,
 				     "HW has no ERROR_ADDRESS available",
-				     NULL);
+				     m);
 		return;
 	}
 
@@ -1927,7 +1927,7 @@ static void amd64_handle_ue(struct mem_ctl_info *mci, struct mce *m)
 				     -1, -1, -1,
 				     EDAC_MOD_STR,
 				     "HW has no ERROR_ADDRESS available",
-				     NULL);
+				     m);
 		return;
 	}
 
@@ -1946,7 +1946,7 @@ static void amd64_handle_ue(struct mem_ctl_info *mci, struct mce *m)
 				     page, offset, 0,
 				     -1, -1, -1,
 				     EDAC_MOD_STR,
-				     "ERROR ADDRESS NOT mapped to a MC", NULL);
+				     "ERROR ADDRESS NOT mapped to a MC", m);
 		return;
 	}
 
@@ -1961,12 +1961,12 @@ static void amd64_handle_ue(struct mem_ctl_info *mci, struct mce *m)
 				     -1, -1, -1,
 				     EDAC_MOD_STR,
 				     "ERROR ADDRESS NOT mapped to CS",
-				     NULL);
+				     m);
 	} else {
 		edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci,
 				     page, offset, 0,
 				     csrow, -1, -1,
-				     EDAC_MOD_STR, "", NULL);
+				     EDAC_MOD_STR, "", m);
 	}
 }
 
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index eb73ddc..dfd24d3 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1055,8 +1055,17 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			"page 0x%lx offset 0x%lx grain %d",
 			page_frame_number, offset_in_page, grain);
 
+#ifdef CONFIG_X86
+	if (arch_log)
+		trace_mc_error_mce(type, mci->mc_idx, msg, label, location,
+				   detail, other_detail, arch_log);
+	else
+		trace_mc_error(type, mci->mc_idx, msg, label, location,
+			       detail, other_detail);
+#else
 	trace_mc_error(type, mci->mc_idx, msg, label, location,
 		       detail, other_detail);
+#endif
 
 	if (type == HW_EVENT_ERR_CORRECTED) {
 		if (edac_mc_get_log_ce())
diff --git a/include/trace/events/hw_event.h b/include/trace/events/hw_event.h
index 9209c6b..76f4dd5 100644
--- a/include/trace/events/hw_event.h
+++ b/include/trace/events/hw_event.h
@@ -101,6 +101,110 @@ TRACE_EVENT(mc_error,
 		  __get_str(driver_detail))
 );
 
+/*
+ * X86 arch-specific events
+ */
+
+#ifdef CONFIG_X86
+#include <asm/mce.h>
+
+/*
+ * MCE event for memory-controller errors
+ */
+
+/*
+ * NOTE: due to trace contraints, we can't have the mce_record at the
+ * same file as mce_record, as they're used by different files. Including
+ * trace headers twice cause duplicated symbols. So, care is needed to
+ * sync changes here with changes at include/trace/events/mce.h.
+ */
+
+TRACE_EVENT(mc_error_mce,
+
+	TP_PROTO(const unsigned int err_type,
+		 const unsigned int mc_index,
+		 const char *msg,
+		 const char *label,
+		 const char *location,
+		 const char *detail,
+		 const char *driver_detail,
+		 const struct mce *m),
+
+	TP_ARGS(err_type, mc_index, msg, label, location,
+		detail, driver_detail, m),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	err_type	)
+		__field(	unsigned int,	mc_index	)
+		__string(	msg,		msg		)
+		__string(	label,		label		)
+		__string(	detail,		detail		)
+		__string(	location,	location	)
+		__string(	driver_detail,	driver_detail	)
+		__field(	u64,		mcgcap		)
+		__field(	u64,		mcgstatus	)
+		__field(	u64,		status		)
+		__field(	u64,		addr		)
+		__field(	u64,		misc		)
+		__field(	u64,		ip		)
+		__field(	u64,		tsc		)
+		__field(	u64,		walltime	)
+		__field(	u32,		cpu		)
+		__field(	u32,		cpuid		)
+		__field(	u32,		apicid		)
+		__field(	u32,		socketid	)
+		__field(	u8,		cs		)
+		__field(	u8,		bank		)
+		__field(	u8,		cpuvendor	)
+	),
+
+	TP_fast_assign(
+		__entry->err_type	= err_type;
+		__entry->mc_index	= mc_index;
+		__assign_str(msg, msg);
+		__assign_str(label, label);
+		__assign_str(location, location);
+		__assign_str(detail, detail);
+		__assign_str(driver_detail, driver_detail);
+		__entry->mcgcap		= m->mcgcap;
+		__entry->mcgstatus	= m->mcgstatus;
+		__entry->status		= m->status;
+		__entry->addr		= m->addr;
+		__entry->misc		= m->misc;
+		__entry->ip		= m->ip;
+		__entry->tsc		= m->tsc;
+		__entry->walltime	= m->time;
+		__entry->cpu		= m->extcpu;
+		__entry->cpuid		= m->cpuid;
+		__entry->apicid		= m->apicid;
+		__entry->socketid	= m->socketid;
+		__entry->cs		= m->cs;
+		__entry->bank		= m->bank;
+		__entry->cpuvendor	= m->cpuvendor;
+	),
+
+	TP_printk("mce#%d: %s error %s on label \"%s\" (%s %s CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x %s)",
+		  __entry->mc_index,
+		  (__entry->err_type == HW_EVENT_ERR_CORRECTED) ? "Corrected" :
+			((__entry->err_type == HW_EVENT_ERR_FATAL) ?
+			"Fatal" : "Uncorrected"),
+		  __get_str(msg),
+		  __get_str(label),
+		  __get_str(location),
+		  __get_str(detail),
+		  __entry->cpu,
+		  __entry->mcgcap, __entry->mcgstatus,
+		  __entry->bank, __entry->status,
+		  __entry->addr, __entry->misc,
+		  __entry->cs, __entry->ip,
+		  __entry->tsc,
+		  __entry->cpuvendor, __entry->cpuid,
+		  __entry->walltime,
+		  __entry->socketid,
+		  __entry->apicid,
+		  __get_str(driver_detail))
+);
+
 #endif /* _TRACE_HW_EVENT_MC_H */
 
 /* This part must be outside protection */

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCHv5] EDAC core changes in order to properly report errors from all types of memory controllers
  2012-03-05 22:00                 ` [PATCHv5] EDAC core changes in order to properly report errors from all types of memory controllers Mauro Carvalho Chehab
@ 2012-03-05 23:23                   ` Borislav Petkov
  2012-03-06 11:31                     ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 26+ messages in thread
From: Borislav Petkov @ 2012-03-05 23:23 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Borislav Petkov, Tony Luck, Ingo Molnar, EDAC devel, LKML

On Mon, Mar 05, 2012 at 07:00:04PM -0300, Mauro Carvalho Chehab wrote:
> With those changes, there's just one tracepont defined there, on this patchset:
> 	http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commit;h=fdfa64045e43c942e1250708365d9240cd0da9c3

Ok, here's my take from simply skimming over the patchset for 5 mins:

* first of all, a lot of the patches are done sloppily and have no commit
messages whatsoever. This is a no-no and the first thing you need to fix.

* http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commitdiff;h=ad015450e549b29a74283733048254d3c647a33at is humongous, has no commit message and contains a lot of changes which probably deserve a patch of their own. Mauro, you're not doing this since yesterday, you should know better!

* http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commitdiff;h=ccca89bce4c0515aabd67440ba01ede97d2b4dcc removes crap introduced by earlier patches in the series, so also a no-no and needs to be fixed.

* WTF does the commit message of http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commitdiff;h=c911a426ef72c817222e7c2ab302a81b69ccbd07 mean? Looks like it redacts a huge function you've introduced in an earlier patch... Hell, no!

...

* And, last but not least, your patches touch amd64_edac* extensively,
and, I need to properly review those changes before they get committed.

So, my absolutely sincere suggestion to you is to split this huge
patchset into smaller patchsets containing 5-10 patches, making it much
easier to review, go over <Documentation/SubmittingPatches> and refresh
your knowledge on how a proper patch should look like and what it should
contain, test your stuff properly on the machines it addresses and
_only_ _then_ send them to the relevant parties for proper review - not
earlier.

HTH.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHv5] EDAC core changes in order to properly report errors from all types of memory controllers
  2012-03-05 23:23                   ` Borislav Petkov
@ 2012-03-06 11:31                     ` Mauro Carvalho Chehab
  2012-03-06 12:16                       ` Borislav Petkov
  0 siblings, 1 reply; 26+ messages in thread
From: Mauro Carvalho Chehab @ 2012-03-06 11:31 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, Ingo Molnar, EDAC devel, LKML

Em 05-03-2012 20:23, Borislav Petkov escreveu:
> On Mon, Mar 05, 2012 at 07:00:04PM -0300, Mauro Carvalho Chehab wrote:
>> With those changes, there's just one tracepont defined there, on this patchset:
>> 	http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commit;h=fdfa64045e43c942e1250708365d9240cd0da9c3
> 
> Ok, here's my take from simply skimming over the patchset for 5 mins:
> 
> * first of all, a lot of the patches are done sloppily and have no commit
> messages whatsoever. This is a no-no and the first thing you need to fix.

Yes, sure. Most of the patches after changeset dd3309c are fixes, noticed 
during the test phasis.

The above changeset is huge and complex, as it changes the internal API,
affecting all drivers.

I'm testing the patchset on i3000, i3200, i5000, i5100, i5400, i7300,
i7core_edac, SB, amd64, x38, ...

As I'm getting regressions, I'm adding fixes on the series.

It likely makes sense to fold them into it, or to better describe them.

> 
> * http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commitdiff;h=ad015450e549b29a74283733048254d3c647a33at is humongous, has no commit message and contains a lot of changes which probably deserve a patch of their own. Mauro, you're not doing this since yesterday, you should know better!

While the patch is complex, the description here is simple: basically the
FB-DIMM drivers were using the wrong offset for the dimm array. I'll better
describe it on v6.

> * http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commitdiff;h=ccca89bce4c0515aabd67440ba01ede97d2b4dcc removes crap introduced by earlier patches in the series, so also a no-no and needs to be fixed.

Yes, it should be fold with a previous patch (likely with d943730b4a86a594d5fb83849d66442e7f9c600a).

> * WTF does the commit message of http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commitdiff;h=c911a426ef72c817222e7c2ab302a81b69ccbd07 mean? Looks like it redacts a huge function you've introduced in an earlier patch... Hell, no!

It changes the behavior of a logic I've introduced. Before this patch, one
sysfs node is created for every element at the hierarchy (branch/channel/dimm),
even if there's nothing populated there. 

The new logic checks if is there any memory inserted on it, before creating
the nodes.

I opted to have it on a separate patchset.
> 
> ...
> 
> * And, last but not least, your patches touch amd64_edac* extensively,
> and, I need to properly review those changes before they get committed.

Yes, sure you need to review. 

You'll see there that most of the changes there are trivial:

- there's now just one function to report errors, so, all lines that call the
  error report routines got changed. This is a trivial change;

- there's no need, on amd64, to fill the data structures used by the EDAC code
  to convert from physical memory into csrow/channel, as the amd64 driver
  doesn't call the EDAC core routine that does such calculus
  (first_page/last_page stuff). So, there's some cleanup there;

- I've re-ordered some parsing blocks at amd64_edac, in order to benefit on
  the edac core improvements: as there's now just one error handling routine,
  if the parser fails, for example, to get the channel, all it needs to do is
  to fill the channel with "-1". The EDAC core will remove "channel" from the
  error location information. This removed the need for the separate "fbd"
  variants of the error functions.

The amd64 changes are:

$ git log --oneline v3.2.. drivers/edac/amd64_edac.c 
a582cfa Fix memory error count
ccca89b amd64_edac: remove a duplicated call to edac_mc_handle_error()
dd3309c edac: rework memory layer hierarchy description
d943730 edac-mc: Allow reporting errors on a non-csrow oriented way
7967d1d edac: DIMM location cleanup
50048a7 edac: move nr_pages to dimm struct
8e83f0a edac: Don't initialize csrow's first_page & friends when not needed
53144b2 edac: move dimm properties to struct memset_info

The last two patches should be fold on dd3309c, as they fixes the logic
there.

> So, my absolutely sincere suggestion to you is to split this huge
> patchset into smaller patchsets containing 5-10 patches, making it much
> easier to review, 

Breaking it into smaller patchsets is not trivial, as the analysis of the
individual changesets only makes sense in the light of the big change on
the edac core structures. One alternative would be to fold patches, but
this would mean to loose some details.

> go over <Documentation/SubmittingPatches> and refresh
> your knowledge on how a proper patch should look like and what it should
> contain, test your stuff properly on the machines it addresses and
> _only_ _then_ send them to the relevant parties for proper review - not
> earlier.

Linux workflow works better with commit soon/commit often.

As the internal representation changed, and the API calls as well, all
drivers are potentially affected.

I'm testing it on a broad range of machines we have available at Red Hat, 
even replacing some memories at the selected hardware in order to see how 
the driver behaves on different situations.

Even replacing the memories on the test systems I can, it is unlikely
that I'll be able to test all possible memory combinations. So, I expect 
that the others can give me some feedback about other systems I may not
have, or where I may not be able to discover some special case where a
regression might be introduced.

That's why I'm feeding the ML with the patches I'm working with.

Besides that, holding patches to be reviewed at the end of all tests will 
just increase the number of patches at the changeset.

So, it is better to work on get reviews and merging what's there for -next,
and then move ahead, applying other fixes as needed, instead of rebasing
everything for every discovered bug or regression.

Btw, there are several patches on this series fixing existing bugs on
the drivers, discovered on such tests. For example, see changeset
88617ec94595 (i5000_edac):

The previous (upstream) logic for handle_channel() was:

static void handle_channel(struct i5000_pvt *pvt, int csrow, int channel,
			struct i5000_dimm_info *dinfo)
{
	int mtr;
	int amb_present_reg;
	int addrBits;

	mtr = determine_mtr(pvt, csrow, channel);
	if (MTR_DIMMS_PRESENT(mtr)) {
		amb_present_reg = determine_amb_present_reg(pvt, channel);

		/* Determine if there is  a  DIMM present in this DIMM slot */
		if (amb_present_reg & (1 << (csrow >> 1))) {
			dinfo->dual_rank = MTR_DIMM_RANK(mtr);

			if (!((dinfo->dual_rank == 0) &&
				((csrow & 0x1) == 0x1))) {
				/* Start with the number of bits for a Bank
				 * on the DRAM */
				addrBits = MTR_DRAM_BANKS_ADDR_BITS(mtr);
				/* Add thenumber of ROW bits */
				addrBits += MTR_DIMM_ROWS_ADDR_BITS(mtr);
				/* add the number of COLUMN bits */
				addrBits += MTR_DIMM_COLS_ADDR_BITS(mtr);

				addrBits += 6;	/* add 64 bits per DIMM */
				addrBits -= 20;	/* divide by 2^^20 */
				addrBits -= 3;	/* 8 bits per bytes */

				dinfo->megabytes = 1 << addrBits;
			}
		}
	}
}

The above logic calculates and fills the DIMM memory size (dinfo->megabytes),
that it is latter used to calculate the dimm number of pages. The edac core
only shows the sysfs nodes for a csrow/channel if the number of pages for the
memory is not zero. 

As the size is only calculated if the memory is dual-rank, that means
that the i5000 edac driver doesn't work properly with single rank memories,
as size would be zero.

The above "emulation" logic seems to be trying to make a dual-rank FB-DIMM to
look like a csrow-based memory controller, e. g. creating two csrows per DIMM.

This is a good example of drivers workarounds introduced due to the lack of a
proper support for non-csrow-based memory controllers.

On a csrow-based MC, dual-rank memories would be mapped as two separate 
(csrow, channel) addresses. Each will have half of the DIMM size on each
address. The above "emulation" creates two (csrow, channel) addreses for every
FB-DIMM, filling just one, but only if the memory is dual-rank.

For a FB-DIMM controller, the number of ranks is just a detail associated with
a given DIMM slot, as the memory is selected by slot, and not by rank.

So, the logic is completely broken for single-rank memories and half-broken for 
double-rank ones.

This is an example of a patch that should not be fold.

After this patch, the memories on the i5000 machine I'm testing are properly
reported, being single-rank or dual-rank.

Regards,
Mauro

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHv5] EDAC core changes in order to properly report errors from all types of memory controllers
  2012-03-06 11:31                     ` Mauro Carvalho Chehab
@ 2012-03-06 12:16                       ` Borislav Petkov
  2012-03-07  0:20                         ` [PATCHv7] " Mauro Carvalho Chehab
  0 siblings, 1 reply; 26+ messages in thread
From: Borislav Petkov @ 2012-03-06 12:16 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Borislav Petkov, Tony Luck, Ingo Molnar, EDAC devel, LKML

On Tue, Mar 06, 2012 at 08:31:36AM -0300, Mauro Carvalho Chehab wrote:

[..]

> Breaking it into smaller patchsets is not trivial, as the analysis of the
> individual changesets only makes sense in the light of the big change on
> the edac core structures.

I don't believe that and LKML contains countless examples of how patches
can be split into simple hunks containing only one logical change.
Simply put yourself in the reviewer's place and try to imagine what kind
of patch format you'd like to see.

[..]

> On a csrow-based MC, dual-rank memories would be mapped as two separate 
> (csrow, channel) addresses. Each will have half of the DIMM size on each
> address. The above "emulation" creates two (csrow, channel) addreses for every
> FB-DIMM, filling just one, but only if the memory is dual-rank.
> 
> For a FB-DIMM controller, the number of ranks is just a detail associated with
> a given DIMM slot, as the memory is selected by slot, and not by rank.
> 
> So, the logic is completely broken for single-rank memories and half-broken for 
> double-rank ones.

I'm still wondering whether FBDIMM-based drivers should get their own
EDAC infrastructure and own nomenclature instead of fitting them in the
existing scheme...

> 
> This is an example of a patch that should not be fold.
> 
> After this patch, the memories on the i5000 machine I'm testing are properly
> reported, being single-rank or dual-rank.

How about instead of verbosely explaining in a mail why you're doing
what you're doing, you add that explanation to your patches so that even
the unlightened one can understand what you're doing? I think that will
benefit all.

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCHv7] EDAC core changes in order to properly report errors from all types of memory controllers
  2012-03-06 12:16                       ` Borislav Petkov
@ 2012-03-07  0:20                         ` Mauro Carvalho Chehab
  2012-03-07  8:42                           ` Borislav Petkov
  0 siblings, 1 reply; 26+ messages in thread
From: Mauro Carvalho Chehab @ 2012-03-07  0:20 UTC (permalink / raw)
  To: EDAC devel; +Cc: Borislav Petkov, Tony Luck, Ingo Molnar, LKML

Here it is the version 7 of the EDAC core changes.

Version 6 skipped due to a small issue on the series.

This series has only "cosmetic" changes over the last one. No
functional changes. What's different:

- Instead of 43 patches, this series contain 21 patches. Most of the
  dirty history were removed. It is now cleaner for review.

- A few coding style changes were applied (24 lines changed, most on
  some comments with more than 80 lines).

- The first approach to address the needs for non-csrow-based memory
  controllers were removed from the history. This made the series
  cleaner, as several patches could be folded, improving patch
  readability;

- patch descriptions were changed/improved.

The series now contains:

- 2 fix patches over upstream:
      edac/ppc4xx_edac: Fix compilation
      i5400_edac: Avoid calling pci_put_device() twice

- 1 comments improvements:
      edac: Improve the comments to better describe the memory concepts

- 1 internal struct renaming patch:
      edac: rename channel_info to rank_info

- 6 patches that prepare the internal structures to represent the memory
  properties per dimm, instead of per csrow. This is needed for modern
  controllers, where the memories at different channels may be different:
      edac: Create a dimm struct and move the labels into it
      edac: Add per dimm's sysfs nodes
      edac: move dimm properties to struct memset_info
      edac: Don't initialize csrow's first_page & friends when not needed
      edac: move nr_pages to dimm struct
      edac: Add per-dimm sysfs show nodes

- 2 patches that add proper support for FB-DIMM and for the modern Intel
  DDR2/DDR3 memory controllers: 
      edac: Fix core support for MC's that see DIMMS instead of ranks
      edac: Export MC hierarchy counters for CE and UE

- 1 log cleanup patch, that prepares for using a MCA based tracepoint:
      edac: Cleanup the logs for i7core and sb edac drivers

- 2 debug improvement patches:
      edac: Add a sysfs node to test the EDAC error report facility
      edac: Initialize the dimm label with the known information

- 5 post-FB-DIMM patches that cleans, fix and/or improve a few random things:
      edac_mc_sysfs: don't create inactive errcount sysfs nodes
      i5000_edac: Fix the logic that retrieves memory information
      edac: add a sysfs node that stores the max possible memory location
      edac: Call the sysfs nodes as "rank" instead of "dimm" if chip select is used
      i5400_edac: improve debug messages to better represent the filled memory

- 1 patch that adds a trace event to report memory errors:
      events/hw_event: Create a Hardware Events Report Mecanism (HERM)

While the preliminar tests is working ok on the machines I'm testing,
as I didn't finish the tests yet, some other fix patches may be needed,
but I'll insert them at the end of the series, as rebasing a large patchset
like that is very time-consuming.

So,  I think it is time to merge it at -next, in order to give more visibility
to it. So, tomorrow, I'll add it there, if I got no complains.

The above changes since commit 805a6af8dba5dfdd35ec35dc52ec0122400b2610:

  Linux 3.2 (2012-01-04 15:55:44 -0800)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac.git hw_events_v7


Em 06-03-2012 09:16, Borislav Petkov escreveu:
> On Tue, Mar 06, 2012 at 08:31:36AM -0300, Mauro Carvalho Chehab wrote:

>> For a FB-DIMM controller, the number of ranks is just a detail associated with
>> a given DIMM slot, as the memory is selected by slot, and not by rank.
>>
>> So, the logic is completely broken for single-rank memories and half-broken for 
>> double-rank ones.
> 
> I'm still wondering whether FBDIMM-based drivers should get their own
> EDAC infrastructure and own nomenclature instead of fitting them in the
> existing scheme...

A typical driver using csrow/channel describes the memory based on ranks. 
A FB-DIMM memory controller describes memory based on DIMMs. But those
are just the to opposite sides of the issue. There's a number of other
situations between them. Creating a FBDIMM-based won't cover them.

There are "non-typical" DDR2/DDR3 drivers that also describes the memory
internally using DIMMs, due to several factors:
	1) a rank is not a FRU. The FRU is a DIMM;
	2) several memory controllers hide the ranks information;
	3) some memory controllers have the number of ranks as a property
	   for a dimm;
	4) Some memory controllers allow using different dimms on separate
	   channels[1]. So, the memory at slot 0 at channel 0 can be different
	   than the one at channel 1.

[1] probably, there are some limits on it, depending on how the memory
    channels are interlaced, but it seems that the Intel memory controllers
    with 3 or 4 channels allow the usage of different memory sticks on
    each channel or channel pair.

After analyzing all EDAC drivers, the "typical" case is actually a minority,
nowadays.

Also, the upstream version currently has a per-rank memory label, with is
very bad, as two ranks at the same DIMM may receive two different labels.

So, it is actually better to convert the existing drivers to internally
represent the memory DIMMs.


Regards,
Mauro

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHv7] EDAC core changes in order to properly report errors from all types of memory controllers
  2012-03-07  0:20                         ` [PATCHv7] " Mauro Carvalho Chehab
@ 2012-03-07  8:42                           ` Borislav Petkov
  2012-03-07 11:36                             ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 26+ messages in thread
From: Borislav Petkov @ 2012-03-07  8:42 UTC (permalink / raw)
  To: Mauro Carvalho Chehab; +Cc: EDAC devel, Tony Luck, Ingo Molnar, LKML

On Tue, Mar 06, 2012 at 09:20:27PM -0300, Mauro Carvalho Chehab wrote:
> The series now contains:

The below looks like a good way to split this huge patchset into
smaller, much easier to review ones:

> 
> - 2 fix patches over upstream:
>       edac/ppc4xx_edac: Fix compilation
>       i5400_edac: Avoid calling pci_put_device() twice
> 
> - 1 comments improvements:
>       edac: Improve the comments to better describe the memory concepts
> 
> - 1 internal struct renaming patch:
>       edac: rename channel_info to rank_info
> 
> - 6 patches that prepare the internal structures to represent the memory
>   properties per dimm, instead of per csrow. This is needed for modern
>   controllers, where the memories at different channels may be different:
>       edac: Create a dimm struct and move the labels into it
>       edac: Add per dimm's sysfs nodes
>       edac: move dimm properties to struct memset_info
>       edac: Don't initialize csrow's first_page & friends when not needed
>       edac: move nr_pages to dimm struct
>       edac: Add per-dimm sysfs show nodes
> 
> - 2 patches that add proper support for FB-DIMM and for the modern Intel
>   DDR2/DDR3 memory controllers: 
>       edac: Fix core support for MC's that see DIMMS instead of ranks
>       edac: Export MC hierarchy counters for CE and UE
> 
> - 1 log cleanup patch, that prepares for using a MCA based tracepoint:
>       edac: Cleanup the logs for i7core and sb edac drivers
> 
> - 2 debug improvement patches:
>       edac: Add a sysfs node to test the EDAC error report facility
>       edac: Initialize the dimm label with the known information
> 
> - 5 post-FB-DIMM patches that cleans, fix and/or improve a few random things:
>       edac_mc_sysfs: don't create inactive errcount sysfs nodes
>       i5000_edac: Fix the logic that retrieves memory information
>       edac: add a sysfs node that stores the max possible memory location
>       edac: Call the sysfs nodes as "rank" instead of "dimm" if chip select is used
>       i5400_edac: improve debug messages to better represent the filled memory
> 
> - 1 patch that adds a trace event to report memory errors:
>       events/hw_event: Create a Hardware Events Report Mecanism (HERM)

NACK to that last one.

> While the preliminar tests is working ok on the machines I'm testing,
> as I didn't finish the tests yet, some other fix patches may be needed,
> but I'll insert them at the end of the series, as rebasing a large patchset
> like that is very time-consuming.
> 
> So,  I think it is time to merge it at -next, in order to give more visibility
> to it. So, tomorrow, I'll add it there, if I got no complains.

linux-next is not a testing ground for unfinished testing, unreviewed
patches (I'm sure you already knew that), so before you send your stuff
anywhere, it needs to be reviewed by the interested parties. One of
them is me, I'm sure there are others, so please split them in proper
patchsets, as I've already asked you (the above topical split could
work) and send them to edac-devel and people for review.

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHv7] EDAC core changes in order to properly report errors from all types of memory controllers
  2012-03-07  8:42                           ` Borislav Petkov
@ 2012-03-07 11:36                             ` Mauro Carvalho Chehab
  2012-03-07 12:06                               ` Borislav Petkov
  0 siblings, 1 reply; 26+ messages in thread
From: Mauro Carvalho Chehab @ 2012-03-07 11:36 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: EDAC devel, Tony Luck, Ingo Molnar, LKML

Em 07-03-2012 05:42, Borislav Petkov escreveu:
> On Tue, Mar 06, 2012 at 09:20:27PM -0300, Mauro Carvalho Chehab wrote:
>> The series now contains:
> 
> The below looks like a good way to split this huge patchset into
> smaller, much easier to review ones:
> 
>>
>> - 2 fix patches over upstream:
>>       edac/ppc4xx_edac: Fix compilation

This one was reviewed already, at the first time I sent it.
So, I'll skip it on my mailbomb.

>>       i5400_edac: Avoid calling pci_put_device() twice
>>
>> - 1 comments improvements:
>>       edac: Improve the comments to better describe the memory concepts
>>
>> - 1 internal struct renaming patch:
>>       edac: rename channel_info to rank_info
>>
>> - 6 patches that prepare the internal structures to represent the memory
>>   properties per dimm, instead of per csrow. This is needed for modern
>>   controllers, where the memories at different channels may be different:
>>       edac: Create a dimm struct and move the labels into it
>>       edac: Add per dimm's sysfs nodes
>>       edac: move dimm properties to struct memset_info
>>       edac: Don't initialize csrow's first_page & friends when not needed
>>       edac: move nr_pages to dimm struct
>>       edac: Add per-dimm sysfs show nodes
>>
>> - 2 patches that add proper support for FB-DIMM and for the modern Intel
>>   DDR2/DDR3 memory controllers: 
>>       edac: Fix core support for MC's that see DIMMS instead of ranks
>>       edac: Export MC hierarchy counters for CE and UE
>>
>> - 1 log cleanup patch, that prepares for using a MCA based tracepoint:
>>       edac: Cleanup the logs for i7core and sb edac drivers
>>
>> - 2 debug improvement patches:
>>       edac: Add a sysfs node to test the EDAC error report facility
>>       edac: Initialize the dimm label with the known information
>>
>> - 5 post-FB-DIMM patches that cleans, fix and/or improve a few random things:
>>       edac_mc_sysfs: don't create inactive errcount sysfs nodes
>>       i5000_edac: Fix the logic that retrieves memory information
>>       edac: add a sysfs node that stores the max possible memory location
>>       edac: Call the sysfs nodes as "rank" instead of "dimm" if chip select is used
>>       i5400_edac: improve debug messages to better represent the filled memory

Ok, I'll mailbomb them.

>>
>> - 1 patch that adds a trace event to report memory errors:
>>       events/hw_event: Create a Hardware Events Report Mecanism (HERM)
> 
> NACK to that last one.

Hmm... interesting... this one adds a tracepoint for non-MCA based memory errors...
I've understood that you've against only the mca one...

Anyway, we have a dead lock with regards to trace, as I'm nacking your approach,
and you're nacking mine.

I think we should then try to schedule a meeting (either physical or a conference)
in order to addresss it, as it doesn't sound that we'll be able to solve it via
ML.

>> While the preliminar tests is working ok on the machines I'm testing,
>> as I didn't finish the tests yet, some other fix patches may be needed,
>> but I'll insert them at the end of the series, as rebasing a large patchset
>> like that is very time-consuming.
>>
>> So,  I think it is time to merge it at -next, in order to give more visibility
>> to it. So, tomorrow, I'll add it there, if I got no complains.
> 
> linux-next is not a testing ground for unfinished testing, unreviewed
> patches (I'm sure you already knew that), so before you send your stuff
> anywhere, it needs to be reviewed by the interested parties. One of
> them is me, I'm sure there are others, so please split them in proper
> patchsets, as I've already asked you (the above topical split could
> work) and send them to edac-devel and people for review.
> 
> Thanks.
> 

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHv7] EDAC core changes in order to properly report errors from all types of memory controllers
  2012-03-07 11:36                             ` Mauro Carvalho Chehab
@ 2012-03-07 12:06                               ` Borislav Petkov
  2012-03-07 12:13                                 ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 26+ messages in thread
From: Borislav Petkov @ 2012-03-07 12:06 UTC (permalink / raw)
  To: Mauro Carvalho Chehab; +Cc: EDAC devel, Tony Luck, Ingo Molnar, LKML

On Wed, Mar 07, 2012 at 08:36:12AM -0300, Mauro Carvalho Chehab wrote:
> >> - 1 patch that adds a trace event to report memory errors:
> >>       events/hw_event: Create a Hardware Events Report Mecanism (HERM)
> > 
> > NACK to that last one.
> 
> Hmm... interesting... this one adds a tracepoint for non-MCA based memory errors...
> I've understood that you've against only the mca one...

If you mean the patch at

http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commitdiff;h=56efd647f5a63497a07caaa22e993307aaf95270

where you've removed the trace.*mce call, then I don't object to it per
se. Send it for proper review too, though, because it contains incorrect
comments like

+ * Those events are generated when hardware detected a corrected or
+ * uncorrected event, and are meant to replace the current API to report
+ * errors defined on both EDAC and MCE subsystems.

which clearly need fixing.

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHv7] EDAC core changes in order to properly report errors from all types of memory controllers
  2012-03-07 12:06                               ` Borislav Petkov
@ 2012-03-07 12:13                                 ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 26+ messages in thread
From: Mauro Carvalho Chehab @ 2012-03-07 12:13 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: EDAC devel, Tony Luck, Ingo Molnar, LKML

Em 07-03-2012 09:06, Borislav Petkov escreveu:
> On Wed, Mar 07, 2012 at 08:36:12AM -0300, Mauro Carvalho Chehab wrote:
>>>> - 1 patch that adds a trace event to report memory errors:
>>>>       events/hw_event: Create a Hardware Events Report Mecanism (HERM)
>>>
>>> NACK to that last one.
>>
>> Hmm... interesting... this one adds a tracepoint for non-MCA based memory errors...
>> I've understood that you've against only the mca one...
> 
> If you mean the patch at
> 
> http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commitdiff;h=56efd647f5a63497a07caaa22e993307aaf95270

Yes, that one.

> where you've removed the trace.*mce call, then I don't object to it per
> se. Send it for proper review too, though, because it contains incorrect
> comments like
> 
> + * Those events are generated when hardware detected a corrected or
> + * uncorrected event, and are meant to replace the current API to report
> + * errors defined on both EDAC and MCE subsystems.
> 
> which clearly need fixing.

True. Ok, sent it.
> 
> Thanks.
> 

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2012-03-07 12:14 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-02 14:25 [RFC -v2 PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs Borislav Petkov
2012-03-02 14:25 ` [PATCH 1/4] mce: Slim up struct mce Borislav Petkov
2012-03-02 17:47   ` Luck, Tony
2012-03-03  7:37     ` Ingo Molnar
2012-03-05  9:17       ` Borislav Petkov
2012-03-02 14:25 ` [PATCH 2/4] mce: Add a msg string to the MCE tracepoint Borislav Petkov
2012-03-02 14:25 ` [PATCH 3/4] x86, RAS: Add a decoded msg buffer Borislav Petkov
2012-03-02 14:25 ` [PATCH 4/4] EDAC: Convert AMD EDAC pieces to use RAS printk buffer Borislav Petkov
2012-03-02 14:52   ` Mauro Carvalho Chehab
2012-03-05 11:04     ` Borislav Petkov
2012-03-05 11:43       ` Mauro Carvalho Chehab
2012-03-05 12:44         ` Borislav Petkov
2012-03-05 13:35           ` Mauro Carvalho Chehab
2012-03-05 14:13             ` Borislav Petkov
2012-03-05 14:58               ` Mauro Carvalho Chehab
2012-03-05 22:00                 ` [PATCHv5] EDAC core changes in order to properly report errors from all types of memory controllers Mauro Carvalho Chehab
2012-03-05 23:23                   ` Borislav Petkov
2012-03-06 11:31                     ` Mauro Carvalho Chehab
2012-03-06 12:16                       ` Borislav Petkov
2012-03-07  0:20                         ` [PATCHv7] " Mauro Carvalho Chehab
2012-03-07  8:42                           ` Borislav Petkov
2012-03-07 11:36                             ` Mauro Carvalho Chehab
2012-03-07 12:06                               ` Borislav Petkov
2012-03-07 12:13                                 ` Mauro Carvalho Chehab
2012-03-02 14:41 ` [RFC -v2 PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs Mauro Carvalho Chehab
2012-03-02 14:48   ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).