linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* x86 MCE improvements series for 2.6.31 v2
@ 2009-05-26 23:54 Andi Kleen
  2009-05-26 23:54 ` [PATCH 01/31] x86: MCE: Synchronize core after machine check handling Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86

Various new Intel x86 machine check features, including initial support
for machine check recovery and new status bits. The actual machine check
recovery requires some generic VM changes, which will be posted in a separate
series ("POISON series"), however this series is independent and compiles
without it.

This is all ready for merging for 2.6.31. 

v2: Various bug fixes and a few minor improvements
Now based on the mce3-32bitmerge branch
(git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git mce3-32bitmerge)
which is based on the mce3 branch
(git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git mce3)
which is a updated version of the x86/mce2 branch in tip

-Andi



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 01/31] x86: MCE: Synchronize core after machine check handling
  2009-05-26 23:54 x86 MCE improvements series for 2.6.31 v2 Andi Kleen
@ 2009-05-26 23:54 ` Andi Kleen
  2009-05-26 23:54   ` [PATCH 02/31] x86: MCE: Improve mce_get_rip v3 Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Impact: Spec compliance

The example code in the IA32 SDM recommends to synchronize the CPU
after machine check handling. So do that here.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 87f8198..249e3cf 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -328,6 +328,8 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 	 * Don't clear MCG_STATUS here because it's only defined for
 	 * exceptions.
 	 */
+
+	sync_core();
 }
 EXPORT_SYMBOL_GPL(machine_check_poll);
 
@@ -501,6 +503,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
  out2:
 	atomic_dec(&mce_entry);
+	sync_core();
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 02/31] x86: MCE: Improve mce_get_rip v3
  2009-05-26 23:54 ` [PATCH 01/31] x86: MCE: Synchronize core after machine check handling Andi Kleen
@ 2009-05-26 23:54   ` Andi Kleen
  2009-05-26 23:54     ` [PATCH 03/31] x86: MCE: Fix EIPV behaviour with !PCC Andi Kleen
                       ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Huang Ying, Andi Kleen, Hidetoshi Seto

From: Huang Ying <ying.huang@intel.com>

Assume RIP is valid when either EIPV or RIPV are set. This influences
whether the machine check exception handler decides to return or panic.

This fixes a test case in the mce-test suite and is more compliant
to the specification.

This currently only makes a difference in a artificial testing
scenario with the mce-test test suite.

Also in addition do not force the RIP to be valid with the exact
register MSRs.

[AK: combination of patches from Huang Ying and Hidetoshi Seto, with
new description by me]
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |   13 +++++++------
 1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 249e3cf..3f158d7 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -247,21 +247,22 @@ int mce_available(struct cpuinfo_x86 *c)
 	return cpu_has(c, X86_FEATURE_MCE) && cpu_has(c, X86_FEATURE_MCA);
 }
 
+/*
+ * Get the address of the instruction at the time of the machine check
+ * error.
+ */
 static inline void mce_get_rip(struct mce *m, struct pt_regs *regs)
 {
-	if (regs && (m->mcgstatus & MCG_STATUS_RIPV)) {
+
+	if (regs && (m->mcgstatus & (MCG_STATUS_RIPV|MCG_STATUS_EIPV))) {
 		m->ip = regs->ip;
 		m->cs = regs->cs;
 	} else {
 		m->ip = 0;
 		m->cs = 0;
 	}
-	if (rip_msr) {
-		/* Assume the RIP in the MSR is exact. Is this true? */
-		m->mcgstatus |= MCG_STATUS_EIPV;
+	if (rip_msr)
 		m->ip = mce_rdmsrl(rip_msr);
-		m->cs = 0;
-	}
 }
 
 /*
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 03/31] x86: MCE: Fix EIPV behaviour with !PCC
  2009-05-26 23:54   ` [PATCH 02/31] x86: MCE: Improve mce_get_rip v3 Andi Kleen
@ 2009-05-26 23:54     ` Andi Kleen
  2009-05-26 23:54       ` [PATCH 04/31] x86: MCE: Use extended sysattrs for the check_interval attribute Andi Kleen
  2009-05-27  4:30       ` [PATCH 03/31] x86: MCE: Fix EIPV behaviour with !PCC Hidetoshi Seto
  2009-05-27  4:29     ` [PATCH 02/31] x86: MCE: Improve mce_get_rip v3 Hidetoshi Seto
  2009-05-27  4:29     ` [PATCH] x86: MCE: Fix for getting IP/CS at MCE Hidetoshi Seto
  2 siblings, 2 replies; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Huang Ying, Andi Kleen

From: Huang Ying <ying.huang@intel.com>

Impact: Spec compliance

When tolerant == 0, system should go panic instead of send SIGBUS even
for a MCE with EPIV && !PCC on user space.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 3f158d7..e1271ac 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -485,7 +485,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		 * force_sig() takes an awful lot of locks and has a slight
 		 * risk of deadlocking.
 		 */
-		if (user_space) {
+		if (user_space && tolerant > 0) {
 			force_sig(SIGBUS, current);
 		} else if (panic_on_oops || tolerant < 2) {
 			mce_panic("Uncorrected machine check",
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 04/31] x86: MCE: Use extended sysattrs for the check_interval attribute.
  2009-05-26 23:54     ` [PATCH 03/31] x86: MCE: Fix EIPV behaviour with !PCC Andi Kleen
@ 2009-05-26 23:54       ` Andi Kleen
  2009-05-26 23:54         ` [PATCH 05/31] x86: MCE: Add machine check exception count in /proc/interrupts Andi Kleen
  2009-05-27  4:30       ` [PATCH 03/31] x86: MCE: Fix EIPV behaviour with !PCC Hidetoshi Seto
  1 sibling, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Impact: Cleanup

Instead of using own callbacks use the generic ones provided by
the sysdev later.

This finally allows to get rid of the ugly ACCESSOR macros. Should
also save some text size.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |   39 ++++++++++++++-----------------------
 1 files changed, 15 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index e1271ac..8a8af56 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1055,28 +1055,6 @@ DEFINE_PER_CPU(struct sys_device, device_mce);
 __cpuinitdata
 void (*threshold_cpu_callback)(unsigned long action, unsigned int cpu);
 
-/* Why are there no generic functions for this? */
-#define ACCESSOR(name, var, start) \
-	static ssize_t show_ ## name(struct sys_device *s,		\
-				     struct sysdev_attribute *attr,	\
-				     char *buf) {			\
-		return sprintf(buf, "%Lx\n", (u64)var);			\
-	}								\
-	static ssize_t set_ ## name(struct sys_device *s,		\
-				    struct sysdev_attribute *attr,	\
-				    const char *buf, size_t siz) {	\
-		char *end;						\
-		u64 new = simple_strtoull(buf, &end, 0);		\
-									\
-		if (end == buf)						\
-			return -EINVAL;					\
-		var = new;						\
-		start;							\
-									\
-		return end-buf;						\
-	}								\
-	static SYSDEV_ATTR(name, 0644, show_ ## name, set_ ## name);
-
 static struct sysdev_attribute *bank_attrs;
 
 static ssize_t show_bank(struct sys_device *s, struct sysdev_attribute *attr,
@@ -1127,13 +1105,26 @@ static ssize_t set_trigger(struct sys_device *s, struct sysdev_attribute *attr,
 	return len;
 }
 
+static ssize_t store_int_with_restart(struct sys_device *s,
+				      struct sysdev_attribute *attr,
+				      const char *buf, size_t size)
+{
+	ssize_t ret = sysdev_store_int(s, attr, buf, size);
+	mce_restart();
+	return ret;
+}
+
 static SYSDEV_ATTR(trigger, 0644, show_trigger, set_trigger);
 static SYSDEV_INT_ATTR(tolerant, 0644, tolerant);
 
-ACCESSOR(check_interval, check_interval, mce_restart())
+static struct sysdev_ext_attribute attr_check_interval = {
+	_SYSDEV_ATTR(check_interval, 0644, sysdev_show_int,
+		     store_int_with_restart),
+	&check_interval
+};
 
 static struct sysdev_attribute *mce_attr[] = {
-	&attr_tolerant.attr, &attr_check_interval, &attr_trigger,
+	&attr_tolerant.attr, &attr_check_interval.attr, &attr_trigger,
 	NULL
 };
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 05/31] x86: MCE: Add machine check exception count in /proc/interrupts
  2009-05-26 23:54       ` [PATCH 04/31] x86: MCE: Use extended sysattrs for the check_interval attribute Andi Kleen
@ 2009-05-26 23:54         ` Andi Kleen
  2009-05-26 23:54           ` [PATCH 06/31] x86: Fix panic with interrupts off (needed for MCE) Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Impact: feature, debugging tool

Useful for debugging, but it's also good general policy
to have a counter for all special interrupts there. This makes it easier
to diagnose where a CPU is spending its time.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/mce.h       |    3 +++
 arch/x86/kernel/cpu/mcheck/mce.c |    4 ++++
 arch/x86/kernel/irq.c            |   10 ++++++++++
 3 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 8ad67bf..1ff437f 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -89,6 +89,7 @@ struct mce_log {
 extern int mce_disabled;
 
 #include <asm/atomic.h>
+#include <linux/percpu.h>
 
 void mce_setup(struct mce *m);
 void mce_log(struct mce *m);
@@ -123,6 +124,8 @@ static inline void mce_amd_feature_init(struct cpuinfo_x86 *c) { }
 
 extern int mce_available(struct cpuinfo_x86 *c);
 
+DECLARE_PER_CPU(unsigned, mce_exception_count);
+
 void mce_log_therm_throt_event(__u64 status);
 
 extern atomic_t mce_entry;
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 8a8af56..d2d7a68 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -58,6 +58,8 @@ int				mce_disabled;
 
 atomic_t mce_entry;
 
+DEFINE_PER_CPU(unsigned, mce_exception_count);
+
 /*
  * Tolerant levels:
  *   0: always panic on uncorrected errors, log corrected errors
@@ -362,6 +364,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 
 	atomic_inc(&mce_entry);
 
+	__get_cpu_var(mce_exception_count)++;
+
 	if (notify_die(DIE_NMI, "machine check", regs, error_code,
 			   18, SIGKILL) == NOTIFY_STOP)
 		goto out2;
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 35eddc9..14047b9 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -12,6 +12,7 @@
 #include <asm/io_apic.h>
 #include <asm/irq.h>
 #include <asm/idle.h>
+#include <asm/mce.h>
 
 atomic_t irq_err_count;
 
@@ -96,6 +97,12 @@ static int show_other_interrupts(struct seq_file *p, int prec)
 	seq_printf(p, "  Threshold APIC interrupts\n");
 # endif
 #endif
+#if defined(CONFIG_X86_MCE) && defined(CONFIG_X86_64)
+	seq_printf(p, "%*s: ", prec, "MCE");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ", per_cpu(mce_exception_count, j));
+	seq_printf(p, "  Machine check exceptions\n");
+#endif
 	seq_printf(p, "%*s: %10u\n", prec, "ERR", atomic_read(&irq_err_count));
 #if defined(CONFIG_X86_IO_APIC)
 	seq_printf(p, "%*s: %10u\n", prec, "MIS", atomic_read(&irq_mis_count));
@@ -163,6 +170,9 @@ u64 arch_irq_stat_cpu(unsigned int cpu)
 {
 	u64 sum = irq_stats(cpu)->__nmi_count;
 
+#if defined(CONFIG_X86_MCE) && defined(CONFIG_X86_64)
+	sum += per_cpu(mce_exception_count, cpu);
+#endif
 #ifdef CONFIG_X86_LOCAL_APIC
 	sum += irq_stats(cpu)->apic_timer_irqs;
 	sum += irq_stats(cpu)->irq_spurious_count;
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 06/31] x86: Fix panic with interrupts off (needed for MCE)
  2009-05-26 23:54         ` [PATCH 05/31] x86: MCE: Add machine check exception count in /proc/interrupts Andi Kleen
@ 2009-05-26 23:54           ` Andi Kleen
  2009-05-26 23:54             ` [PATCH 07/31] x86: MCE: Log corrected errors when panicing Andi Kleen
  2009-05-27  4:30             ` [PATCH 06/31] x86: Fix panic with interrupts off (needed for MCE) Hidetoshi Seto
  0 siblings, 2 replies; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Impact: bug fix, fixes an old regression

For some time each panic() called with interrupts disabled triggered the
!irqs_disabled() WARN_ON in smp_call_function(), producing ugly
backtraces and confusing users.

This is a common situation with machine checks for example which tend
to call panic with interrupts disabled, but will also hit
in other situations e.g. panic during early boot. In fact
it means that panic cannot be called in many circumstances, which
would be bad.

This all started with the new fancy queued smp_call_function,
which is then used by the shutdown path to shut down the other CPUs.

On closer examination it turned out that the fancy RCU
smp_call_function() does lots of things not suitable in a panic
situation anyways, like allocating memory and relying on complex system
state.

I originally tried to patch this over by checking for panic
there, but it was quite complicated and the original patch
was also not very popular. This also didn't fix some
of the underlying complexity problems.

The new code in post 2.6.29 tries to patch around this by
checking for oops_in_progress, but that is not enough to make
this fully safe and I don't think that's a real solution
because panic has to be reliable.

So instead use an own vector to reboot. This makes the reboot code
extremly straight forward, which is definitely a big plus
in a panic situation where it is important to avoid relying
on too much kernel state. The new simple code is also
safe to be called from interupts off region because it is
very very simple.

There can be situations where it is important that panic
is reliable. For example on a fatal machine check the panic
is needed to get the system up again and running as quickly
as possible. So it's important that panic is reliable and
all function it calls simple.

This is why I came up with this simple vector scheme.
It's very hard to beat in simplicity.  Vectors are not
particularly precious anymore since all big systems
are using per CPU vectors.

Another possibility would have been to use an NMI similar
to kdump, but there is still the problem that NMIs don't
work reliably on some systems due to BIOS issues. NMIs
would have been able to stop CPUs running with interrupts
off too. In the sake of universal reliability I opted for
using a non NMI vector for now.

I put the reboot vector into the highest priority bucket
of the APIC vectors and moved the 64bit UV_BAU message
down instead into the next lower priority.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/entry_arch.h  |    1 +
 arch/x86/include/asm/hw_irq.h      |    1 +
 arch/x86/include/asm/irq_vectors.h |    8 +++-----
 arch/x86/kernel/entry_64.S         |    2 ++
 arch/x86/kernel/irqinit_32.c       |    3 +++
 arch/x86/kernel/irqinit_64.c       |    3 +++
 arch/x86/kernel/smp.c              |   28 +++++++++++++++++++++++++++-
 7 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
index b2eb9c0..b884dcd 100644
--- a/arch/x86/include/asm/entry_arch.h
+++ b/arch/x86/include/asm/entry_arch.h
@@ -14,6 +14,7 @@ BUILD_INTERRUPT(reschedule_interrupt,RESCHEDULE_VECTOR)
 BUILD_INTERRUPT(call_function_interrupt,CALL_FUNCTION_VECTOR)
 BUILD_INTERRUPT(call_function_single_interrupt,CALL_FUNCTION_SINGLE_VECTOR)
 BUILD_INTERRUPT(irq_move_cleanup_interrupt,IRQ_MOVE_CLEANUP_VECTOR)
+BUILD_INTERRUPT(reboot_interrupt,REBOOT_VECTOR)
 
 BUILD_INTERRUPT3(invalidate_interrupt0,INVALIDATE_TLB_VECTOR_START+0,
 		 smp_invalidate_interrupt)
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index b762ea4..5498db6 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -44,6 +44,7 @@ extern void invalidate_interrupt6(void);
 extern void invalidate_interrupt7(void);
 
 extern void irq_move_cleanup_interrupt(void);
+extern void reboot_interrupt(void);
 extern void threshold_interrupt(void);
 
 extern void call_function_interrupt(void);
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 451e24d..b83991d 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -89,11 +89,7 @@
 
 #define THRESHOLD_APIC_VECTOR		0xf9
 
-#ifdef CONFIG_X86_32
-/* 0xf9 : free */
-#else
-# define UV_BAU_MESSAGE			0xf8
-#endif
+#define REBOOT_VECTOR			0xf8
 
 /* f0-f7 used for spreading out TLB flushes: */
 #define INVALIDATE_TLB_VECTOR_END	0xf7
@@ -117,6 +113,8 @@
  */
 #define GENERIC_INTERRUPT_VECTOR	0xed
 
+#define UV_BAU_MESSAGE			0xec
+
 /*
  * First APIC vector available to drivers: (vectors 0x30-0xee) we
  * start at 0x31(0x41) to spread out vectors evenly between priority
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index a31a7f2..bb835c9 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -976,6 +976,8 @@ END(\sym)
 #ifdef CONFIG_SMP
 apicinterrupt IRQ_MOVE_CLEANUP_VECTOR \
 	irq_move_cleanup_interrupt smp_irq_move_cleanup_interrupt
+apicinterrupt REBOOT_VECTOR \
+	reboot_interrupt smp_reboot_interrupt
 #endif
 
 #ifdef CONFIG_X86_UV
diff --git a/arch/x86/kernel/irqinit_32.c b/arch/x86/kernel/irqinit_32.c
index 2512ad9..d57f91a 100644
--- a/arch/x86/kernel/irqinit_32.c
+++ b/arch/x86/kernel/irqinit_32.c
@@ -167,6 +167,9 @@ void __init native_init_IRQ(void)
 	/* Low priority IPI to cleanup after moving an irq */
 	set_intr_gate(IRQ_MOVE_CLEANUP_VECTOR, irq_move_cleanup_interrupt);
 	set_bit(IRQ_MOVE_CLEANUP_VECTOR, used_vectors);
+
+	/* IPI used for rebooting/stopping */
+	alloc_intr_gate(REBOOT_VECTOR, reboot_interrupt);
 #endif
 
 #ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/kernel/irqinit_64.c b/arch/x86/kernel/irqinit_64.c
index 8cd1053..f965749 100644
--- a/arch/x86/kernel/irqinit_64.c
+++ b/arch/x86/kernel/irqinit_64.c
@@ -133,6 +133,9 @@ static void __init smp_intr_init(void)
 	/* Low priority IPI to cleanup after moving an irq */
 	set_intr_gate(IRQ_MOVE_CLEANUP_VECTOR, irq_move_cleanup_interrupt);
 	set_bit(IRQ_MOVE_CLEANUP_VECTOR, used_vectors);
+
+	/* IPI for rebooting/panicing */
+	alloc_intr_gate(REBOOT_VECTOR, reboot_interrupt);
 #endif
 }
 
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 13f33ea..8835783 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -150,14 +150,40 @@ void native_send_call_func_ipi(const struct cpumask *mask)
  * this function calls the 'stop' function on all other CPUs in the system.
  */
 
+asmlinkage void smp_reboot_interrupt(void)
+{
+	ack_APIC_irq();
+	irq_enter();
+	stop_this_cpu(NULL);
+	irq_exit();
+}
+
 static void native_smp_send_stop(void)
 {
 	unsigned long flags;
+	unsigned long wait;
 
 	if (reboot_force)
 		return;
 
-	smp_call_function(stop_this_cpu, NULL, 0);
+	/*
+	 * Use an own vector here because smp_call_function
+	 * does lots of things not suitable in a panic situation.
+	 * On most systems we could also use an NMI here,
+	 * but there are a few systems around where NMI
+	 * is problematic so stay with an non NMI for now
+	 * (this implies we cannot stop CPUs spinning with irq off
+	 * currently)
+	 */
+	if (num_online_cpus() > 1) {
+		apic->send_IPI_allbutself(REBOOT_VECTOR);
+
+		/* Don't wait longer than a second */
+		wait = USEC_PER_SEC;
+		while (num_online_cpus() > 1 && wait--)
+			udelay(1);
+	}
+
 	local_irq_save(flags);
 	disable_local_APIC();
 	local_irq_restore(flags);
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 07/31] x86: MCE: Log corrected errors when panicing
  2009-05-26 23:54           ` [PATCH 06/31] x86: Fix panic with interrupts off (needed for MCE) Andi Kleen
@ 2009-05-26 23:54             ` Andi Kleen
  2009-05-26 23:54               ` [PATCH 08/31] x86: MCE: Remove unused mce_events variable Andi Kleen
  2009-05-27  4:30             ` [PATCH 06/31] x86: Fix panic with interrupts off (needed for MCE) Hidetoshi Seto
  1 sibling, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Impact: bugfix

Normally the machine check handler ignores corrected errors and leaves
them to machine_check_poll(). But when panicing mcp won't run, so
log all errors.

Note: this can still miss some cases until the "early no way out"
patch later is applied too.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index d2d7a68..d379e3f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -398,9 +398,9 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 
 		/*
 		 * Non uncorrected errors are handled by machine_check_poll
-		 * Leave them alone.
+		 * Leave them alone, unless this panics.
 		 */
-		if ((m.status & MCI_STATUS_UC) == 0)
+		if ((m.status & MCI_STATUS_UC) == 0 && !no_way_out)
 			continue;
 
 		/*
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 08/31] x86: MCE: Remove unused mce_events variable
  2009-05-26 23:54             ` [PATCH 07/31] x86: MCE: Log corrected errors when panicing Andi Kleen
@ 2009-05-26 23:54               ` Andi Kleen
  2009-05-26 23:54                 ` [PATCH 09/31] x86: MCE: Remove mce_init unused argument Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Impact: Cleanup

Remove unused mce_events static variable.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index d379e3f..925b222 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -73,7 +73,6 @@ static u64			*bank;
 static unsigned long		notify_user;
 static int			rip_msr;
 static int			mce_bootlog = -1;
-static atomic_t			mce_events;
 
 static char			trigger[128];
 static char			*trigger_argv[2] = { trigger, NULL };
@@ -118,7 +117,6 @@ void mce_log(struct mce *mce)
 {
 	unsigned next, entry;
 
-	atomic_inc(&mce_events);
 	mce->finished = 0;
 	wmb();
 	for (;;) {
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 09/31] x86: MCE: Remove mce_init unused argument
  2009-05-26 23:54               ` [PATCH 08/31] x86: MCE: Remove unused mce_events variable Andi Kleen
@ 2009-05-26 23:54                 ` Andi Kleen
  2009-05-26 23:54                   ` [PATCH 10/31] x86: MCE: Rename and align out2 label Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Thomas Gleixner, Andi Kleen

From: Thomas Gleixner <tglx@linutronix.de>

Impact: Cleanup

Remove unused mce_init argument.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 925b222..2e3ceed 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -649,7 +649,7 @@ static int mce_cap_init(void)
 	return 0;
 }
 
-static void mce_init(void *dummy)
+static void mce_init(void)
 {
 	mce_banks_t all_banks;
 	u64 cap;
@@ -782,7 +782,7 @@ void __cpuinit mcheck_init(struct cpuinfo_x86 *c)
 
 	machine_check_vector = do_machine_check;
 
-	mce_init(NULL);
+	mce_init();
 	mce_cpu_features(c);
 	mce_init_timer();
 }
@@ -1025,7 +1025,7 @@ static int mce_shutdown(struct sys_device *dev)
  */
 static int mce_resume(struct sys_device *dev)
 {
-	mce_init(NULL);
+	mce_init();
 	mce_cpu_features(&current_cpu_data);
 
 	return 0;
@@ -1035,7 +1035,7 @@ static void mce_cpu_restart(void *data)
 {
 	del_timer_sync(&__get_cpu_var(mce_timer));
 	if (mce_available(&current_cpu_data))
-		mce_init(NULL);
+		mce_init();
 	mce_init_timer();
 }
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 10/31] x86: MCE: Rename and align out2 label
  2009-05-26 23:54                 ` [PATCH 09/31] x86: MCE: Remove mce_init unused argument Andi Kleen
@ 2009-05-26 23:54                   ` Andi Kleen
  2009-05-26 23:54                     ` [PATCH 11/31] x86: MCE: Implement bootstrapping for machine check wakeups Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Impact: Minor cleanup, no functional changes

There's only a single out path in do_machine_check now, so rename the label from
out2 to out.  Also align it at the first column.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 2e3ceed..1ce7b55 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -366,9 +366,9 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 
 	if (notify_die(DIE_NMI, "machine check", regs, error_code,
 			   18, SIGKILL) == NOTIFY_STOP)
-		goto out2;
+		goto out;
 	if (!banks)
-		goto out2;
+		goto out;
 
 	mce_setup(&m);
 
@@ -504,7 +504,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 			mce_wrmsrl(MSR_IA32_MC0_STATUS+4*i, 0);
 	}
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
- out2:
+out:
 	atomic_dec(&mce_entry);
 	sync_core();
 }
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 11/31] x86: MCE: Implement bootstrapping for machine check wakeups
  2009-05-26 23:54                   ` [PATCH 10/31] x86: MCE: Rename and align out2 label Andi Kleen
@ 2009-05-26 23:54                     ` Andi Kleen
  2009-05-26 23:54                       ` [PATCH 12/31] x86: MCE: Remove TSC print heuristic Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

machine checks support waking up the mcelog daemon quickly.

The original wake up code for this was pretty ugly, relying on
a idle notifier and a special process flag. The reason it did
it this way is that the machine check handler is not subject
to normal interrupt locking rules so it's not safe
to call wake_up().  Instead it set a process flag
and then either did the wakeup in the syscall return
or in the idle notifier.

This patch adds a new "bootstraping" method as replacement.

The idea is that the handler checks if it's in a state where
it is unsafe to call wake_up(). If it's safe it calls it directly.
When it's not safe -- that is it interrupted in a critical
section with interrupts disables -- it uses a new "self IPI" to trigger
an IPI to its own CPU. This can be done safely because IPI
triggers are atomic with some care. The IPI is raised
once the interrupts are reenabled and can then safely call
wake_up().

When APICs are disabled the event is just queued and will be picked up
eventually by the next polling timer. I think that's a reasonable
compromise, since it should only happen quite rarely.

Contains fixes from Ying Huang

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/hw_irq.h      |    1 +
 arch/x86/include/asm/irq_vectors.h |    5 +++
 arch/x86/kernel/cpu/mcheck/mce.c   |   52 ++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/entry_64.S         |    5 +++
 arch/x86/kernel/irqinit_64.c       |    4 +++
 5 files changed, 67 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 5498db6..f07814a 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -32,6 +32,7 @@ extern void error_interrupt(void);
 extern void spurious_interrupt(void);
 extern void thermal_interrupt(void);
 extern void reschedule_interrupt(void);
+extern void mce_self_interrupt(void);
 
 extern void invalidate_interrupt(void);
 extern void invalidate_interrupt0(void);
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index b83991d..61f1592 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -116,6 +116,11 @@
 #define UV_BAU_MESSAGE			0xec
 
 /*
+ * Self IPI vector for machine checks
+ */
+#define MCE_SELF_VECTOR			0xeb
+
+/*
  * First APIC vector available to drivers: (vectors 0x30-0xee) we
  * start at 0x31(0x41) to spread out vectors evenly between priority
  * levels. (0x80 is the syscall vector)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 1ce7b55..6219a18 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -7,6 +7,7 @@
  * Author: Andi Kleen
  */
 
+#include <linux/interrupt.h>
 #include <linux/thread_info.h>
 #include <linux/capability.h>
 #include <linux/miscdevice.h>
@@ -36,6 +37,9 @@
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/smp.h>
+#include <asm/ipi.h>
+#include <asm/hw_irq.h>
+#include <asm/apic.h>
 
 #include "mce.h"
 
@@ -266,6 +270,52 @@ static inline void mce_get_rip(struct mce *m, struct pt_regs *regs)
 }
 
 /*
+ * Called after interrupts have been reenabled again
+ * when a MCE happened during an interrupts off region
+ * in the kernel.
+ */
+asmlinkage void smp_mce_self_interrupt(struct pt_regs *regs)
+{
+	ack_APIC_irq();
+	exit_idle();
+	irq_enter();
+	mce_notify_user();
+	irq_exit();
+}
+
+static void mce_report_event(struct pt_regs *regs)
+{
+	if (regs->flags & (X86_VM_MASK|X86_EFLAGS_IF)) {
+		mce_notify_user();
+		return;
+	}
+
+#ifdef CONFIG_X86_LOCAL_APIC
+	/*
+	 * Without APIC do not notify. The event will be picked
+	 * up eventually.
+	 */
+	if (!cpu_has_apic)
+		return;
+
+	/*
+	 * When interrupts are disabled we cannot use
+	 * kernel services safely. Trigger an self interrupt
+	 * through the APIC to instead do the notification
+	 * after interrupts are reenabled again.
+	 */
+	apic->send_IPI_self(MCE_SELF_VECTOR);
+
+	/*
+	 * Wait for idle afterwards again so that we don't leave the
+	 * APIC in a non idle state because the normal APIC writes
+	 * cannot exclude us.
+	 */
+	apic_wait_icr_idle();
+#endif
+}
+
+/*
  * Poll for corrected events or events that happened before reset.
  * Those are just logged through /dev/mcelog.
  *
@@ -498,6 +548,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	/* notify userspace ASAP */
 	set_thread_flag(TIF_MCE_NOTIFY);
 
+	mce_report_event(regs);
+
 	/* the last thing we do is clear state */
 	for (i = 0; i < banks; i++) {
 		if (test_bit(i, toclear))
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index bb835c9..4234b12 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1013,6 +1013,11 @@ apicinterrupt THRESHOLD_APIC_VECTOR \
 apicinterrupt THERMAL_APIC_VECTOR \
 	thermal_interrupt smp_thermal_interrupt
 
+#ifdef CONFIG_X86_MCE
+apicinterrupt MCE_SELF_VECTOR \
+	mce_self_interrupt smp_mce_self_interrupt
+#endif
+
 #ifdef CONFIG_SMP
 apicinterrupt CALL_FUNCTION_SINGLE_VECTOR \
 	call_function_single_interrupt smp_call_function_single_interrupt
diff --git a/arch/x86/kernel/irqinit_64.c b/arch/x86/kernel/irqinit_64.c
index f965749..265eeb4 100644
--- a/arch/x86/kernel/irqinit_64.c
+++ b/arch/x86/kernel/irqinit_64.c
@@ -155,6 +155,10 @@ static void __init apic_intr_init(void)
 	/* IPI vectors for APIC spurious and error interrupts */
 	alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
 	alloc_intr_gate(ERROR_APIC_VECTOR, error_interrupt);
+
+#ifdef CONFIG_X86_MCE
+	alloc_intr_gate(MCE_SELF_VECTOR, mce_self_interrupt);
+#endif
 }
 
 void __init native_init_IRQ(void)
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 12/31] x86: MCE: Remove TSC print heuristic
  2009-05-26 23:54                     ` [PATCH 11/31] x86: MCE: Implement bootstrapping for machine check wakeups Andi Kleen
@ 2009-05-26 23:54                       ` Andi Kleen
  2009-05-26 23:54                         ` [PATCH 13/31] x86: MCE: Drop BKL in mce_open Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Impact: bug fix, cleanup

Previously mce_panic used a simple heuristic to avoid printing
old so far unreported machine check events on a mce panic. This worked
by comparing the TSC value at the start of the machine check handler
with the event time stamp and only printing newer ones.

This has a couple of issues, in particular on systems where the TSC
is not fully synchronized between CPUs it could lose events or print
old ones.

It is also problematic with full system synchronization as it is
added by the next patch.

Remove the TSC heuristic and instead replace it with a simple heuristic
to print corrected errors first and after that uncorrected errors
and finally the worst machine check as determined by the machine
check handler.

This simplifies the code because there is no need to pass the
original TSC value around.

Contains fixes from Ying Huang

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |   32 ++++++++++++++++++--------------
 1 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 6219a18..93c3b93 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -152,6 +152,7 @@ void mce_log(struct mce *mce)
 	mcelog.entry[entry].finished = 1;
 	wmb();
 
+	mce->finished = 1;
 	set_bit(0, &notify_user);
 }
 
@@ -181,23 +182,29 @@ static void print_mce(struct mce *m)
 	       "and contact your hardware vendor\n");
 }
 
-static void mce_panic(char *msg, struct mce *backup, u64 start)
+static void mce_panic(char *msg, struct mce *final)
 {
 	int i;
 
 	bust_spinlocks(1);
 	console_verbose();
+	/* First print corrected ones that are still unlogged */
 	for (i = 0; i < MCE_LOG_LEN; i++) {
-		u64 tsc = mcelog.entry[i].tsc;
-
-		if ((s64)(tsc - start) < 0)
+		struct mce *m = &mcelog.entry[i];
+		if ((m->status & MCI_STATUS_VAL) &&
+			!(m->status & MCI_STATUS_UC))
+			print_mce(m);
+	}
+	/* Now print uncorrected but with the final one last */
+	for (i = 0; i < MCE_LOG_LEN; i++) {
+		struct mce *m = &mcelog.entry[i];
+		if (!(m->status & MCI_STATUS_VAL))
 			continue;
-		print_mce(&mcelog.entry[i]);
-		if (backup && mcelog.entry[i].tsc == backup->tsc)
-			backup = NULL;
+		if (!final || memcmp(m, final, sizeof(struct mce)))
+			print_mce(m);
 	}
-	if (backup)
-		print_mce(backup);
+	if (final)
+		print_mce(final);
 	panic(msg);
 }
 
@@ -396,7 +403,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 {
 	struct mce m, panicm;
 	int panicm_found = 0;
-	u64 mcestart = 0;
 	int i;
 	/*
 	 * If no_way_out gets set, there is no safe way to recover from this
@@ -428,7 +434,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	if (!(m.mcgstatus & MCG_STATUS_RIPV))
 		no_way_out = 1;
 
-	rdtscll(mcestart);
 	barrier();
 
 	for (i = 0; i < banks; i++) {
@@ -512,7 +517,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	 * has not set tolerant to an insane level, give up and die.
 	 */
 	if (no_way_out && tolerant < 3)
-		mce_panic("Machine check", &panicm, mcestart);
+		mce_panic("Machine check", &panicm);
 
 	/*
 	 * If the error seems to be unrecoverable, something should be
@@ -540,8 +545,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		if (user_space && tolerant > 0) {
 			force_sig(SIGBUS, current);
 		} else if (panic_on_oops || tolerant < 2) {
-			mce_panic("Uncorrected machine check",
-				&panicm, mcestart);
+			mce_panic("Uncorrected machine check", &panicm);
 		}
 	}
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 13/31] x86: MCE: Drop BKL in mce_open
  2009-05-26 23:54                       ` [PATCH 12/31] x86: MCE: Remove TSC print heuristic Andi Kleen
@ 2009-05-26 23:54                         ` Andi Kleen
  2009-05-26 23:54                           ` [PATCH 14/31] x86: MCE: Add table driven machine check grading Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Impact: cleanup

BKL is not needed for anything in mce_open because it has
an own spinlock. Remove it.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |    4 ----
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 93c3b93..f617c9b 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -14,7 +14,6 @@
 #include <linux/ratelimit.h>
 #include <linux/kallsyms.h>
 #include <linux/rcupdate.h>
-#include <linux/smp_lock.h>
 #include <linux/kobject.h>
 #include <linux/kdebug.h>
 #include <linux/kernel.h>
@@ -853,12 +852,10 @@ static int		open_exclu;		/* already open exclusive? */
 
 static int mce_open(struct inode *inode, struct file *file)
 {
-	lock_kernel();
 	spin_lock(&mce_state_lock);
 
 	if (open_exclu || (open_count && (file->f_flags & O_EXCL))) {
 		spin_unlock(&mce_state_lock);
-		unlock_kernel();
 
 		return -EBUSY;
 	}
@@ -868,7 +865,6 @@ static int mce_open(struct inode *inode, struct file *file)
 	open_count++;
 
 	spin_unlock(&mce_state_lock);
-	unlock_kernel();
 
 	return nonseekable_open(inode, file);
 }
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 14/31] x86: MCE: Add table driven machine check grading
  2009-05-26 23:54                         ` [PATCH 13/31] x86: MCE: Drop BKL in mce_open Andi Kleen
@ 2009-05-26 23:54                           ` Andi Kleen
  2009-05-26 23:54                             ` [PATCH 15/31] x86: MCE: Check early in exception handler if panic is needed Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Impact: Feature

The machine check grading (as in deciding what should be done for a given
register value) has to be done multiple times soon and it's also getting more complicated.
So it makes sense to consolidate it into a single function. To get smaller
and more straight forward and possibly more extensible code I opted towards
a new table driven method. The various rules are put into a table
when is then executed by a very simple interpreter.

The grading engine is in a new file mce-severity.c. I also added a private
include file mce-internal.h, because mce.h is already a bit too cluttered.

This is dead code right now, but will be used in followon patches.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/mcheck/Makefile       |    1 +
 arch/x86/kernel/cpu/mcheck/mce-internal.h |   10 +++++
 arch/x86/kernel/cpu/mcheck/mce-severity.c |   61 +++++++++++++++++++++++++++++
 3 files changed, 72 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/mcheck/mce-internal.h
 create mode 100644 arch/x86/kernel/cpu/mcheck/mce-severity.c

diff --git a/arch/x86/kernel/cpu/mcheck/Makefile b/arch/x86/kernel/cpu/mcheck/Makefile
index 60ee182..45004fa 100644
--- a/arch/x86/kernel/cpu/mcheck/Makefile
+++ b/arch/x86/kernel/cpu/mcheck/Makefile
@@ -1,5 +1,6 @@
 obj-y				=  mce.o therm_throt.o
 
+obj-$(CONFIG_X86_NEW_MCE)	+= mce-severity.o
 obj-$(CONFIG_X86_OLD_MCE)	+= k7.o p4.o p6.o
 obj-$(CONFIG_X86_ANCIENT_MCE)	+= winchip.o p5.o
 obj-$(CONFIG_X86_MCE_P4THERMAL)	+= mce_intel.o
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
new file mode 100644
index 0000000..f126b4a
--- /dev/null
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -0,0 +1,10 @@
+#include <asm/mce.h>
+
+enum severity_level {
+	MCE_NO_SEVERITY,
+	MCE_SOME_SEVERITY,
+	MCE_UC_SEVERITY,
+	MCE_PANIC_SEVERITY,
+};
+
+int mce_severity(struct mce *a, int tolerant, char **msg);
diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
new file mode 100644
index 0000000..c189e89
--- /dev/null
+++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
@@ -0,0 +1,61 @@
+/*
+ * MCE grading rules.
+ * Copyright 2008, 2009 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ *
+ * Author: Andi Kleen
+ */
+#include <linux/kernel.h>
+#include <asm/mce.h>
+
+#include "mce-internal.h"
+
+/*
+ * Grade an mce by severity. In general the most severe ones are processed
+ * first. Since there are quite a lot of combinations test the bits in a
+ * table-driven way. The rules are simply processed in order, first
+ * match wins.
+ */
+
+static struct severity {
+	u64 mask;
+	u64 result;
+	unsigned char sev;
+	unsigned char mcgmask;
+	unsigned char mcgres;
+	char *msg;
+} severities[] = {
+#define SEV(s) .sev = MCE_ ## s ## _SEVERITY
+#define BITCLR(x, s, m, r...) { .mask = x, .result = 0, SEV(s), .msg = m, ## r }
+#define BITSET(x, s, m, r...) { .mask = x, .result = x, SEV(s), .msg = m, ## r }
+#define MCGMASK(x, res, s, m, r...) \
+	{ .mcgmask = x, .mcgres = res, SEV(s), .msg = m, ## r }
+	BITCLR(MCI_STATUS_VAL, NO, "Invalid"),
+	BITCLR(MCI_STATUS_EN, NO, "Not enabled"),
+	BITSET(MCI_STATUS_PCC, PANIC, "Processor context corrupt"),
+	MCGMASK(MCG_STATUS_RIPV, 0, PANIC, "No restart IP"),
+	BITSET(MCI_STATUS_UC|MCI_STATUS_OVER, PANIC, "Overflowed uncorrected"),
+	BITSET(MCI_STATUS_UC, UC, "Uncorrected"),
+	BITSET(0, SOME, "No match")	/* always matches. keep at end */
+};
+
+int mce_severity(struct mce *a, int tolerant, char **msg)
+{
+	struct severity *s;
+	for (s = severities;; s++) {
+		if ((a->status & s->mask) != s->result)
+			continue;
+		if ((a->mcgstatus & s->mcgmask) != s->mcgres)
+			continue;
+		if (s->sev > MCE_NO_SEVERITY && (a->status & MCI_STATUS_UC) &&
+			tolerant < 1)
+			return MCE_PANIC_SEVERITY;
+		if (msg)
+			*msg = s->msg;
+		return s->sev;
+	}
+}
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 15/31] x86: MCE: Check early in exception handler if panic is needed
  2009-05-26 23:54                           ` [PATCH 14/31] x86: MCE: Add table driven machine check grading Andi Kleen
@ 2009-05-26 23:54                             ` Andi Kleen
  2009-05-26 23:54                               ` [PATCH 16/31] x86: MCE: Implement panic synchronization Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Impact: Feature

The exception handler should behave differently if the exception is
fatal versus one that can be returned from.  In the first case it should
never clear any registers because these need to be preserved
for logging after the next boot. Otherwise it should clear them
on each CPU step by step so that other CPUs sharing the same bank don't
see duplicate events. Otherwise we risk reporting events multiple
times on any CPUs which have shared machine check banks, which
is a common problem on Intel Nehalem which has both SMT (two
CPU threads sharing banks) and shared machine check banks in the uncore.

Determine early in a special pass if any event requires a panic.
This uses the mce_severity() function added earlier.

This is needed for the next patch.

Also fixes a problem together with an earlier patch
that corrected events weren't logged on a fatal MCE.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |   38 +++++++++++++++++++++++++-------------
 1 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index f617c9b..a61eead 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -39,6 +39,7 @@
 #include <asm/ipi.h>
 #include <asm/hw_irq.h>
 #include <asm/apic.h>
+#include "mce-internal.h"
 
 #include "mce.h"
 
@@ -181,7 +182,7 @@ static void print_mce(struct mce *m)
 	       "and contact your hardware vendor\n");
 }
 
-static void mce_panic(char *msg, struct mce *final)
+static void mce_panic(char *msg, struct mce *final, char *exp)
 {
 	int i;
 
@@ -204,6 +205,8 @@ static void mce_panic(char *msg, struct mce *final)
 	}
 	if (final)
 		print_mce(final);
+	if (exp)
+		printk(KERN_EMERG "Machine check: %s\n", exp);
 	panic(msg);
 }
 
@@ -391,6 +394,22 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 EXPORT_SYMBOL_GPL(machine_check_poll);
 
 /*
+ * Do a quick check if any of the events requires a panic.
+ * This decides if we keep the events around or clear them.
+ */
+static int mce_no_way_out(struct mce *m, char **msg)
+{
+	int i;
+
+	for (i = 0; i < banks; i++) {
+		m->status = mce_rdmsrl(MSR_IA32_MC0_STATUS + i*4);
+		if (mce_severity(m, tolerant, msg) >= MCE_PANIC_SEVERITY)
+			return 1;
+	}
+	return 0;
+}
+
+/*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
  *
@@ -414,6 +433,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	 */
 	int kill_it = 0;
 	DECLARE_BITMAP(toclear, MAX_NR_BANKS);
+	char *msg = "Unknown";
 
 	atomic_inc(&mce_entry);
 
@@ -428,10 +448,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	mce_setup(&m);
 
 	m.mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
-
-	/* if the restart IP is not valid, we're done for */
-	if (!(m.mcgstatus & MCG_STATUS_RIPV))
-		no_way_out = 1;
+	no_way_out = mce_no_way_out(&m, &msg);
 
 	barrier();
 
@@ -463,18 +480,13 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		__set_bit(i, toclear);
 
 		if (m.status & MCI_STATUS_EN) {
-			/* if PCC was set, there's no way out */
-			no_way_out |= !!(m.status & MCI_STATUS_PCC);
 			/*
 			 * If this error was uncorrectable and there was
 			 * an overflow, we're in trouble.  If no overflow,
 			 * we might get away with just killing a task.
 			 */
-			if (m.status & MCI_STATUS_UC) {
-				if (tolerant < 1 || m.status & MCI_STATUS_OVER)
-					no_way_out = 1;
+			if (m.status & MCI_STATUS_UC)
 				kill_it = 1;
-			}
 		} else {
 			/*
 			 * Machine check event was not enabled. Clear, but
@@ -516,7 +528,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	 * has not set tolerant to an insane level, give up and die.
 	 */
 	if (no_way_out && tolerant < 3)
-		mce_panic("Machine check", &panicm);
+		mce_panic("Machine check", &panicm, msg);
 
 	/*
 	 * If the error seems to be unrecoverable, something should be
@@ -544,7 +556,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		if (user_space && tolerant > 0) {
 			force_sig(SIGBUS, current);
 		} else if (panic_on_oops || tolerant < 2) {
-			mce_panic("Uncorrected machine check", &panicm);
+			mce_panic("Uncorrected machine check", &panicm, msg);
 		}
 	}
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 16/31] x86: MCE: Implement panic synchronization
  2009-05-26 23:54                             ` [PATCH 15/31] x86: MCE: Check early in exception handler if panic is needed Andi Kleen
@ 2009-05-26 23:54                               ` Andi Kleen
  2009-05-26 23:54                                 ` [PATCH 17/31] x86: MCE: Switch x86 machine check handler to Monarch election. v2 Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Impact: bugfix

In some circumstances multiple CPUs can enter mce_panic() in parallel. This
gives quite confused output because they will all dump the same machine check
buffer.

The other problem is that they would all panic in parallel, but not process
each other's shutdown IPIs because interrupts are disabled.

Detect this situation early on in mce_panic(). On the first CPU
entering will do the panic, the others will just wait to be killed.

For paranoia reasons in case the other CPU dies during the MCE I added
a 5 seconds timeout. If it expires each CPU will panic on its own again,

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index a61eead..65c62d1 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -182,10 +182,32 @@ static void print_mce(struct mce *m)
 	       "and contact your hardware vendor\n");
 }
 
+#define PANIC_TIMEOUT 5 /* 5 seconds */
+
+static atomic_t mce_paniced;
+
+/* Panic in progress. Enable interrupts and wait for final IPI */
+static void wait_for_panic(void)
+{
+	long timeout = PANIC_TIMEOUT*USEC_PER_SEC;
+	preempt_disable();
+	local_irq_enable();
+	while (timeout-- > 0)
+		udelay(1);
+	panic("Panicing machine check CPU died");
+}
+
 static void mce_panic(char *msg, struct mce *final, char *exp)
 {
 	int i;
 
+	/*
+	 * Make sure only one CPU runs in machine check panic
+	 */
+	if (atomic_add_return(1, &mce_paniced) > 1)
+		wait_for_panic();
+	barrier();
+
 	bust_spinlocks(1);
 	console_verbose();
 	/* First print corrected ones that are still unlogged */
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 17/31] x86: MCE: Switch x86 machine check handler to Monarch election. v2
  2009-05-26 23:54                               ` [PATCH 16/31] x86: MCE: Implement panic synchronization Andi Kleen
@ 2009-05-26 23:54                                 ` Andi Kleen
  2009-05-26 23:54                                   ` [PATCH 18/31] x86: MCE: Store record length into memory struct mce anchor Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Impact: feature, fixes long standing problems.

On Intel platforms machine check exceptions are always broadcast to all CPUs.
This patch makes the machine check handler synchronize all these machine
checks, elect a Monarch to handle the event and collect the worst event from
all CPUs and then process it first.

This has some advantages:
- When there is a truly data corrupting error the system panics as
quickly as possible. This improves containment of corrupted
data and makes sure the corrupted data never hits stable storage.
- The panics are synchronized and do not reenter the panic code
on multiple CPUs (which currently does not handle this well).
- All the errors are reported. Currently it often happens that
another CPU happens to do the panic first, but reports useless
information (empty machine check) because the real error
happened on another CPU which came in later.
This is a big advantage on Nehalem where the 8 threads per CPU
lead to often the wrong CPU winning the race and dumping
useless information on a machine check.  The problem also occurs
in a less severe form on older CPUs.
- The system can detect when no CPUs detected a machine check
and shut down the system. This can happen when one CPU is so
badly hung that that it cannot process a machine check anymore
or when some external agent wants to stop the system by
asserting the machine check pin. This follows Intel hardware
recommendations.
- This matches the recommended error model by the CPU designers.
- The events can be output in true severity order
- When a panic happens on another CPU it makes sure to be actually
be able to process the stop IPI by enabling interrupts.

The code is extremly careful to handle timeouts while waiting
for other CPUs. It can't rely on the normal timing mechanisms
(jiffies, ktime_get) because of its asynchronous/lockless nature,
so it uses own timeouts using ndelay() and a "SPINUNIT"

The timeout is configurable. By default it waits for upto one
second for the other CPUs.  This can be also disabled.

>From some informal testing AMD systems do not see to broadcast
machine checks, so right now it's always disabled by default on
non Intel CPUs or also on very old Intel systems.

Includes fixes from Ying Huang
Fixed a "ecception" in a comment (H.Seto)
Moved global_nwo reset later based on suggestion from H.Seto
v2: Avoid duplicate messages

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/x86/x86_64/boot-options.txt |    6 +-
 Documentation/x86/x86_64/machinecheck     |    4 +
 arch/x86/kernel/cpu/mcheck/mce.c          |  360 ++++++++++++++++++++++++++---
 3 files changed, 340 insertions(+), 30 deletions(-)

diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index 34c1304..e97ebe4 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -13,13 +13,17 @@ Machine check
                in a reboot. On Intel systems it is enabled by default.
    mce=nobootlog
 		Disable boot machine check logging.
-   mce=tolerancelevel (number)
+   mce=tolerancelevel[,monarchtimeout] (number,number)
+		tolerance levels:
 		0: always panic on uncorrected errors, log corrected errors
 		1: panic or SIGBUS on uncorrected errors, log corrected errors
 		2: SIGBUS or log uncorrected errors, log corrected errors
 		3: never panic or SIGBUS, log all errors (for testing only)
 		Default is 1
 		Can be also set using sysfs which is preferable.
+		monarchtimout:
+		Sets the time in us to wait for other CPUs on machine checks. 0
+		to disable.
 
    nomce (for compatibility with i386): same as mce=off
 
diff --git a/Documentation/x86/x86_64/machinecheck b/Documentation/x86/x86_64/machinecheck
index a05e58e..bec2af3 100644
--- a/Documentation/x86/x86_64/machinecheck
+++ b/Documentation/x86/x86_64/machinecheck
@@ -67,6 +67,10 @@ trigger
 	Program to run when a machine check event is detected.
 	This is an alternative to running mcelog regularly from cron
 	and allows to detect events faster.
+monarch_timeout
+	How long to wait for the other CPUs to machine check too on a
+	exception. 0 to disable waiting for other CPUs.
+	Unit: us
 
 TBD document entries for AMD threshold interrupt configuration
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 65c62d1..d87f46f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -20,6 +20,7 @@
 #include <linux/percpu.h>
 #include <linux/string.h>
 #include <linux/sysdev.h>
+#include <linux/delay.h>
 #include <linux/ctype.h>
 #include <linux/sched.h>
 #include <linux/sysfs.h>
@@ -27,6 +28,7 @@
 #include <linux/init.h>
 #include <linux/kmod.h>
 #include <linux/poll.h>
+#include <linux/nmi.h>
 #include <linux/cpu.h>
 #include <linux/fs.h>
 
@@ -60,6 +62,8 @@ int				mce_disabled;
 
 #define MISC_MCELOG_MINOR	227
 
+#define SPINUNIT 100	/* 100ns */
+
 atomic_t mce_entry;
 
 DEFINE_PER_CPU(unsigned, mce_exception_count);
@@ -77,6 +81,7 @@ static u64			*bank;
 static unsigned long		notify_user;
 static int			rip_msr;
 static int			mce_bootlog = -1;
+static int			monarch_timeout = -1;
 
 static char			trigger[128];
 static char			*trigger_argv[2] = { trigger, NULL };
@@ -84,6 +89,9 @@ static char			*trigger_argv[2] = { trigger, NULL };
 static unsigned long		dont_init_banks;
 
 static DECLARE_WAIT_QUEUE_HEAD(mce_wait);
+static DEFINE_PER_CPU(struct mce, mces_seen);
+static int			cpu_missing;
+
 
 /* MCA banks polled by the period polling timer for corrected events */
 DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = {
@@ -227,6 +235,8 @@ static void mce_panic(char *msg, struct mce *final, char *exp)
 	}
 	if (final)
 		print_mce(final);
+	if (cpu_missing)
+		printk(KERN_EMERG "Some CPUs didn't answer in synchronization\n");
 	if (exp)
 		printk(KERN_EMERG "Machine check: %s\n", exp);
 	panic(msg);
@@ -432,18 +442,287 @@ static int mce_no_way_out(struct mce *m, char **msg)
 }
 
 /*
+ * Variable to establish order between CPUs while scanning.
+ * Each CPU spins initially until executing is equal its number.
+ */
+static atomic_t mce_executing;
+
+/*
+ * Defines order of CPUs on entry. First CPU becomes Monarch.
+ */
+static atomic_t mce_callin;
+
+/*
+ * Check if a timeout waiting for other CPUs happened.
+ */
+static int mce_timed_out(u64 *t)
+{
+	/*
+	 * The others already did panic for some reason.
+	 * Bail out like in a timeout.
+	 * rmb() to tell the compiler that system_state
+	 * might have been modified by someone else.
+	 */
+	rmb();
+	if (atomic_read(&mce_paniced))
+		wait_for_panic();
+	if (!monarch_timeout)
+		goto out;
+	if ((s64)*t < SPINUNIT) {
+		/* CHECKME: Make panic default for 1 too? */
+		if (tolerant < 1)
+			mce_panic("Timeout synchronizing machine check over CPUs",
+				  NULL, NULL);
+		cpu_missing = 1;
+		return 1;
+	}
+	*t -= SPINUNIT;
+out:
+	touch_nmi_watchdog();
+	return 0;
+}
+
+/*
+ * The Monarch's reign.  The Monarch is the CPU who entered
+ * the machine check handler first. It waits for the others to
+ * raise the exception too and then grades them. When any
+ * error is fatal panic. Only then let the others continue.
+ *
+ * The other CPUs entering the MCE handler will be controlled by the
+ * Monarch. They are called Subjects.
+ *
+ * This way we prevent any potential data corruption in a unrecoverable case
+ * and also makes sure always all CPU's errors are examined.
+ *
+ * Also this detects the case of an machine check event coming from outer
+ * space (not detected by any CPUs) In this case some external agent wants
+ * us to shut down, so panic too.
+ *
+ * The other CPUs might still decide to panic if the handler happens
+ * in a unrecoverable place, but in this case the system is in a semi-stable
+ * state and won't corrupt anything by itself. It's ok to let the others
+ * continue for a bit first.
+ *
+ * All the spin loops have timeouts; when a timeout happens a CPU
+ * typically elects itself to be Monarch.
+ */
+static void mce_reign(void)
+{
+	int cpu;
+	struct mce *m = NULL;
+	int global_worst = 0;
+	char *msg = NULL;
+	char *nmsg = NULL;
+
+	/*
+	 * This CPU is the Monarch and the other CPUs have run
+	 * through their handlers.
+	 * Grade the severity of the errors of all the CPUs.
+	 */
+	for_each_possible_cpu (cpu) {
+		int severity = mce_severity(&per_cpu(mces_seen, cpu), tolerant,
+					    &nmsg);
+		if (severity > global_worst) {
+			msg = nmsg;
+			global_worst = severity;
+			m = &per_cpu(mces_seen, cpu);
+		}
+	}
+
+	/*
+	 * Cannot recover? Panic here then.
+	 * This dumps all the mces in the log buffer and stops the
+	 * other CPUs.
+	 */
+	if (m && global_worst >= MCE_PANIC_SEVERITY && tolerant < 3)
+		mce_panic("Fatal machine check", m, msg);
+
+	/*
+	 * For UC somewhere we let the CPU who detects it handle it.
+	 * Also must let continue the others, otherwise the handling
+	 * CPU could deadlock on a lock.
+	 */
+
+	/*
+	 * No machine check event found. Must be some external
+	 * source or one CPU is hung. Panic.
+	 */
+	if (!m && tolerant < 3)
+		mce_panic("Machine check from unknown source", NULL, NULL);
+
+	/*
+	 * Now clear all the mces_seen so that they don't reappear on
+	 * the next mce.
+	 */
+	for_each_possible_cpu (cpu)
+		memset(&per_cpu(mces_seen, cpu), 0, sizeof(struct mce));
+}
+
+static atomic_t global_nwo;
+
+/*
+ * Start of Monarch synchronization. This waits until all CPUs have
+ * entered the exception handler and then determines if any of them
+ * saw a fatal event that requires panic. Then it executes them
+ * in the entry order.
+ * TBD double check parallel CPU hotunplug
+ */
+static int mce_start(int no_way_out, int *order)
+{
+	int nwo;
+	int cpus = num_online_cpus();
+	u64 timeout = (u64)monarch_timeout * NSEC_PER_USEC;
+
+	if (!timeout) {
+		*order = -1;
+		return no_way_out;
+	}
+
+	atomic_add(no_way_out, &global_nwo);
+
+	/*
+	 * Wait for everyone.
+	 */
+	while (atomic_read(&mce_callin) != cpus) {
+		if (mce_timed_out(&timeout)) {
+			atomic_set(&global_nwo, 0);
+			*order = -1;
+			return no_way_out;
+		}
+		ndelay(SPINUNIT);
+	}
+
+	/*
+	 * Cache the global no_way_out state.
+	 */
+	nwo = atomic_read(&global_nwo);
+
+	/*
+	 * Monarch starts executing now, the others wait.
+	 */
+	if (*order == 1) {
+		atomic_set(&mce_executing, 1);
+		return nwo;
+	}
+
+	/*
+	 * Now start the scanning loop one by one
+	 * in the original callin order.
+	 * This way when there are any shared banks it will
+	 * be only seen by one CPU before cleared, avoiding duplicates.
+	 */
+	while (atomic_read(&mce_executing) < *order) {
+		if (mce_timed_out(&timeout)) {
+			atomic_set(&global_nwo, 0);
+			*order = -1;
+			return no_way_out;
+		}
+		ndelay(SPINUNIT);
+	}
+	return nwo;
+}
+
+/*
+ * Synchronize between CPUs after main scanning loop.
+ * This invokes the bulk of the Monarch processing.
+ */
+static int mce_end(int order)
+{
+	int ret = -1;
+	u64 timeout = (u64)monarch_timeout * NSEC_PER_USEC;
+
+	if (!timeout)
+		goto reset;
+	if (order < 0)
+		goto reset;
+
+	/*
+	 * Allow others to run.
+	 */
+	atomic_inc(&mce_executing);
+
+	if (order == 1) {
+		/* CHECKME: Can this race with a parallel hotplug? */
+		int cpus = num_online_cpus();
+
+		/*
+		 * Monarch: Wait for everyone to go through their scanning
+		 * loops.
+		 */
+		while (atomic_read(&mce_executing) <= cpus) {
+			if (mce_timed_out(&timeout))
+				goto reset;
+			ndelay(SPINUNIT);
+		}
+
+		mce_reign();
+		barrier();
+		ret = 0;
+	} else {
+		/*
+		 * Subject: Wait for Monarch to finish.
+		 */
+		while (atomic_read(&mce_executing) != 0) {
+			if (mce_timed_out(&timeout))
+				goto reset;
+			ndelay(SPINUNIT);
+		}
+
+		/*
+		 * Don't reset anything. That's done by the Monarch.
+		 */
+		return 0;
+	}
+
+	/*
+	 * Reset all global state.
+	 */
+reset:
+	atomic_set(&global_nwo, 0);
+	atomic_set(&mce_callin, 0);
+	barrier();
+
+	/*
+	 * Let others run again.
+	 */
+	atomic_set(&mce_executing, 0);
+	return ret;
+}
+
+static void mce_clear_state(unsigned long *toclear)
+{
+	int i;
+
+	for (i = 0; i < banks; i++) {
+		if (test_bit(i, toclear))
+			mce_wrmsrl(MSR_IA32_MC0_STATUS+4*i, 0);
+	}
+}
+
+/*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
  *
  * This is executed in NMI context not subject to normal locking rules. This
  * implies that most kernel services cannot be safely used. Don't even
  * think about putting a printk in there!
+ *
+ * On Intel systems this is entered on all CPUs in parallel through
+ * MCE broadcast. However some CPUs might be broken beyond repair,
+ * so be always careful when synchronizing with others.
  */
 void do_machine_check(struct pt_regs *regs, long error_code)
 {
-	struct mce m, panicm;
-	int panicm_found = 0;
+	struct mce m, *final;
 	int i;
+	int worst = 0;
+	int severity;
+	/*
+	 * Establish sequential order between the CPUs entering the machine
+	 * check handler.
+	 */
+	int order;
+
 	/*
 	 * If no_way_out gets set, there is no safe way to recover from this
 	 * MCE.  If tolerant is cranked up, we'll try anyway.
@@ -467,13 +746,23 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	if (!banks)
 		goto out;
 
+	order = atomic_add_return(1, &mce_callin);
 	mce_setup(&m);
 
 	m.mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
 	no_way_out = mce_no_way_out(&m, &msg);
 
+	final = &__get_cpu_var(mces_seen);
+	*final = m;
+
 	barrier();
 
+	/*
+	 * Go through all the banks in exclusion of the other CPUs.
+	 * This way we don't report duplicated events on shared banks
+	 * because the first one to see it will clear it.
+	 */
+	no_way_out = mce_start(no_way_out, &order);
 	for (i = 0; i < banks; i++) {
 		__clear_bit(i, toclear);
 		if (!bank[i])
@@ -525,32 +814,32 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		mce_get_rip(&m, regs);
 		mce_log(&m);
 
-		/*
-		 * Did this bank cause the exception?
-		 *
-		 * Assume that the bank with uncorrectable errors did it,
-		 * and that there is only a single one:
-		 */
-		if ((m.status & MCI_STATUS_UC) &&
-					(m.status & MCI_STATUS_EN)) {
-			panicm = m;
-			panicm_found = 1;
+		severity = mce_severity(&m, tolerant, NULL);
+		if (severity > worst) {
+			*final = m;
+			worst = severity;
 		}
 	}
 
+	if (!no_way_out)
+		mce_clear_state(toclear);
+
 	/*
-	 * If we didn't find an uncorrectable error, pick
-	 * the last one (shouldn't happen, just being safe).
+	 * Do most of the synchronization with other CPUs.
+	 * When there's any problem use only local no_way_out state.
 	 */
-	if (!panicm_found)
-		panicm = m;
+	if (mce_end(order) < 0)
+		no_way_out = worst >= MCE_PANIC_SEVERITY;
 
 	/*
 	 * If we have decided that we just CAN'T continue, and the user
 	 * has not set tolerant to an insane level, give up and die.
+	 *
+	 * This is mainly used in the case when the system doesn't
+	 * support MCE broadcasting or it has been disabled.
 	 */
 	if (no_way_out && tolerant < 3)
-		mce_panic("Machine check", &panicm, msg);
+		mce_panic("Machine check", final, msg);
 
 	/*
 	 * If the error seems to be unrecoverable, something should be
@@ -566,7 +855,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		 * instruction which caused the MCE.
 		 */
 		if (m.mcgstatus & MCG_STATUS_EIPV)
-			user_space = panicm.ip && (panicm.cs & 3);
+			user_space = final->ip && (final->cs & 3);
 
 		/*
 		 * If we know that the error was in user space, send a
@@ -578,20 +867,15 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		if (user_space && tolerant > 0) {
 			force_sig(SIGBUS, current);
 		} else if (panic_on_oops || tolerant < 2) {
-			mce_panic("Uncorrected machine check", &panicm, msg);
+			mce_panic("Uncorrected machine check", final, msg);
 		}
 	}
 
 	/* notify userspace ASAP */
 	set_thread_flag(TIF_MCE_NOTIFY);
 
-	mce_report_event(regs);
-
-	/* the last thing we do is clear state */
-	for (i = 0; i < banks; i++) {
-		if (test_bit(i, toclear))
-			mce_wrmsrl(MSR_IA32_MC0_STATUS+4*i, 0);
-	}
+	if (worst > 0)
+		mce_report_event(regs);
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
 out:
 	atomic_dec(&mce_entry);
@@ -804,7 +1088,17 @@ static void mce_cpu_quirks(struct cpuinfo_x86 *c)
 
 		if (c->x86 == 6 && c->x86_model < 0x1A)
 			__set_bit(0, &dont_init_banks);
+
+		/*
+		 * All newer Intel systems support MCE broadcasting. Enable
+		 * synchronization with a one second timeout.
+		 */
+		if ((c->x86 > 6 || (c->x86 == 6 && c->x86_model >= 0xe)) &&
+			monarch_timeout < 0)
+			monarch_timeout = USEC_PER_SEC;
 	}
+	if (monarch_timeout < 0)
+		monarch_timeout = 0;
 }
 
 static void __cpuinit mce_ancient_init(struct cpuinfo_x86 *c)
@@ -1050,7 +1344,9 @@ static struct miscdevice mce_log_device = {
 
 /*
  * mce=off disables machine check
- * mce=TOLERANCELEVEL (number, see above)
+ * mce=TOLERANCELEVEL[,monarchtimeout] (number, see above)
+ *	monarchtimeout is how long to wait for other CPUs on machine
+ *      check, or 0 to not wait
  * mce=bootlog Log MCEs from before booting. Disabled by default on AMD.
  * mce=nobootlog Don't log MCEs from before booting.
  */
@@ -1064,9 +1360,13 @@ static int __init mcheck_enable(char *str)
 		mce_disabled = 1;
 	else if (!strcmp(str, "bootlog") || !strcmp(str, "nobootlog"))
 		mce_bootlog = (str[0] == 'b');
-	else if (isdigit(str[0]))
+	else if (isdigit(str[0])) {
 		get_option(&str, &tolerant);
-	else {
+		if (*str == ',') {
+			++str;
+			get_option(&str, &monarch_timeout);
+		}
+	} else {
 		printk(KERN_INFO "mce argument %s ignored. Please use /sys\n",
 		       str);
 		return 0;
@@ -1204,6 +1504,7 @@ static ssize_t store_int_with_restart(struct sys_device *s,
 
 static SYSDEV_ATTR(trigger, 0644, show_trigger, set_trigger);
 static SYSDEV_INT_ATTR(tolerant, 0644, tolerant);
+static SYSDEV_INT_ATTR(monarch_timeout, 0644, monarch_timeout);
 
 static struct sysdev_ext_attribute attr_check_interval = {
 	_SYSDEV_ATTR(check_interval, 0644, sysdev_show_int,
@@ -1213,6 +1514,7 @@ static struct sysdev_ext_attribute attr_check_interval = {
 
 static struct sysdev_attribute *mce_attr[] = {
 	&attr_tolerant.attr, &attr_check_interval.attr, &attr_trigger,
+	&attr_monarch_timeout.attr,
 	NULL
 };
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 18/31] x86: MCE: Store record length into memory struct mce anchor
  2009-05-26 23:54                                 ` [PATCH 17/31] x86: MCE: Switch x86 machine check handler to Monarch election. v2 Andi Kleen
@ 2009-05-26 23:54                                   ` Andi Kleen
  2009-05-26 23:54                                     ` [PATCH 19/31] x86: MCE: Default to panic timeout for machine checks v2 Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

This makes it easier for tools who want to extract the mcelog out of
crash images or memory dumps to adapt to changing struct mce size.
The length field replaces padding, so it's fully compatible.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/mce.h       |    2 +-
 arch/x86/kernel/cpu/mcheck/mce.c |    5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 1ff437f..6f45a87 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -59,7 +59,7 @@ struct mce_log {
 	unsigned len;	    /* = MCE_LOG_LEN */
 	unsigned next;
 	unsigned flags;
-	unsigned pad0;
+	unsigned recordlen;	/* length of struct mce */
 	struct mce entry[MCE_LOG_LEN];
 };
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index d87f46f..bf9bd6f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -121,8 +121,9 @@ EXPORT_PER_CPU_SYMBOL_GPL(injectm);
  */
 
 static struct mce_log mcelog = {
-	MCE_LOG_SIGNATURE,
-	MCE_LOG_LEN,
+	.signature = MCE_LOG_SIGNATURE,
+	.len = MCE_LOG_LEN,
+	.recordlen = sizeof(struct mce),
 };
 
 void mce_log(struct mce *mce)
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 19/31] x86: MCE: Default to panic timeout for machine checks v2
  2009-05-26 23:54                                   ` [PATCH 18/31] x86: MCE: Store record length into memory struct mce anchor Andi Kleen
@ 2009-05-26 23:54                                     ` Andi Kleen
  2009-05-26 23:54                                       ` [PATCH 20/31] x86: MCE: Improve documentation Andi Kleen
                                                         ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Fatal machine checks can be logged to disk after boot, but only if
the system did a warm reboot. That's unfortunately difficult with the
default panic behaviour, which waits forever and the admin has to
press the power button because modern systems usually miss a reset button.
This clears the machine checks in the registers and make
it impossible to log them.

This patch changes the default for machine check panic to always
reboot after 30s. Then the mce can be successfully logged after
reboot.

I believe this will improve machine check experience for any
system running the X server.

This is dependent on successfull boot logging of MCEs. This currently
only works on Intel systems, on AMD there are quite a lot of systems
around which leave junk in the machine check registers after boot,
so it's disabled here. These systems will continue to default
to endless waiting panic.

v2: Only force panic timeout when it's shorter (H.Seto)

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index bf9bd6f..c31cbcf 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -82,6 +82,7 @@ static unsigned long		notify_user;
 static int			rip_msr;
 static int			mce_bootlog = -1;
 static int			monarch_timeout = -1;
+static int			mce_panic_timeout;
 
 static char			trigger[128];
 static char			*trigger_argv[2] = { trigger, NULL };
@@ -203,6 +204,8 @@ static void wait_for_panic(void)
 	local_irq_enable();
 	while (timeout-- > 0)
 		udelay(1);
+	if (mce_panic_timeout < panic_timeout)
+		panic_timeout = mce_panic_timeout;
 	panic("Panicing machine check CPU died");
 }
 
@@ -240,6 +243,8 @@ static void mce_panic(char *msg, struct mce *final, char *exp)
 		printk(KERN_EMERG "Some CPUs didn't answer in synchronization\n");
 	if (exp)
 		printk(KERN_EMERG "Machine check: %s\n", exp);
+	if (mce_panic_timeout < panic_timeout)
+		panic_timeout = mce_panic_timeout;
 	panic(msg);
 }
 
@@ -1100,6 +1105,8 @@ static void mce_cpu_quirks(struct cpuinfo_x86 *c)
 	}
 	if (monarch_timeout < 0)
 		monarch_timeout = 0;
+	if (mce_bootlog != 0)
+		mce_panic_timeout = 30;
 }
 
 static void __cpuinit mce_ancient_init(struct cpuinfo_x86 *c)
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 20/31] x86: MCE: Improve documentation
  2009-05-26 23:54                                     ` [PATCH 19/31] x86: MCE: Default to panic timeout for machine checks v2 Andi Kleen
@ 2009-05-26 23:54                                       ` Andi Kleen
  2009-05-26 23:54                                         ` [PATCH 21/31] x86: MCE: Support more than 256 CPUs in struct mce Andi Kleen
  2009-05-27  4:31                                       ` [PATCH 19/31] x86: MCE: Default to panic timeout for machine checks v2 Hidetoshi Seto
  2009-05-27  4:31                                       ` [PATCH] x86: MCE: Fix for mce_panic_timeout Hidetoshi Seto
  2 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen, seto.hidetoshi

From: Andi Kleen <ak@linux.intel.com>

Document that check_interval set to 0 means no polling.
Noticed by Hidetoshi Seto

Also add a reference from boot options to the sysfs tunables

Cc: seto.hidetoshi@jp.fujitsu.com

Acked-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/x86/x86_64/boot-options.txt |    2 ++
 Documentation/x86/x86_64/machinecheck     |    4 +++-
 2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index e97ebe4..9fb1f4f 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -5,6 +5,8 @@ only the AMD64 specific ones are listed here.
 
 Machine check
 
+   Please see Documentation/x86/x86_64/machinecheck for sysfs runtime tunables.
+
    mce=off disable machine check
    mce=bootlog Enable logging of machine checks left over from booting.
                Disabled by default on AMD because some BIOS leave bogus ones.
diff --git a/Documentation/x86/x86_64/machinecheck b/Documentation/x86/x86_64/machinecheck
index bec2af3..b1fb302 100644
--- a/Documentation/x86/x86_64/machinecheck
+++ b/Documentation/x86/x86_64/machinecheck
@@ -41,7 +41,9 @@ check_interval
 	the polling interval.  When the poller stops finding MCEs, it
 	triggers an exponential backoff (poll less often) on the polling
 	interval. The check_interval variable is both the initial and
-	maximum polling interval.
+	maximum polling interval. 0 means no polling for corrected machine
+	check errors (but some corrected errors might be still reported
+	in other ways)
 
 tolerant
 	Tolerance level. When a machine check exception occurs for a non
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 21/31] x86: MCE: Support more than 256 CPUs in struct mce
  2009-05-26 23:54                                       ` [PATCH 20/31] x86: MCE: Improve documentation Andi Kleen
@ 2009-05-26 23:54                                         ` Andi Kleen
  2009-05-26 23:54                                           ` [PATCH 22/31] x86: MCE: Extend struct mce user interface with more information Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

The old struct mce had a limitation to 256 CPUs. But x86 Linux supports more than
that now with x2apic. Add a new field extcpu to report the extended number.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/mce.h              |    4 ++--
 arch/x86/kernel/cpu/mcheck/mce-inject.c |   10 +++++-----
 arch/x86/kernel/cpu/mcheck/mce.c        |    4 ++--
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 6f45a87..ee7d427 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -40,9 +40,9 @@ struct mce {
 	__u64 res2;	/* dito. */
 	__u8  cs;		/* code segment */
 	__u8  bank;	/* machine check bank */
-	__u8  cpu;	/* cpu that raised the error */
+	__u8  cpu;	/* cpu number; obsolete; use extcpu now */
 	__u8  finished;   /* entry is valid */
-	__u32 pad;
+	__u32 extcpu;	/* linux cpu number that detected the error */
 };
 
 /*
diff --git a/arch/x86/kernel/cpu/mcheck/mce-inject.c b/arch/x86/kernel/cpu/mcheck/mce-inject.c
index 58afac4..e6cfa1a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-inject.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-inject.c
@@ -23,14 +23,14 @@
 /* Update fake mce registers on current CPU. */
 static void inject_mce(struct mce *m)
 {
-	struct mce *i = &per_cpu(injectm, m->cpu);
+	struct mce *i = &per_cpu(injectm, m->extcpu);
 
 	/* Make sure noone reads partially written injectm */
 	i->finished = 0;
 	mb();
 	m->finished = 0;
 	/* First set the fields after finished */
-	i->cpu = m->cpu;
+	i->extcpu = m->extcpu;
 	mb();
 	/* Now write record in order, finished last (except above) */
 	memcpy(i, m, sizeof(struct mce));
@@ -49,7 +49,7 @@ static void raise_mce(unsigned long data)
 {
 	struct delayed_mce *dm = (struct delayed_mce *)data;
 	struct mce *m = &dm->m;
-	int cpu = m->cpu;
+	int cpu = m->extcpu;
 
 	inject_mce(m);
 	if (m->status & MCI_STATUS_UC) {
@@ -93,7 +93,7 @@ static ssize_t mce_write(struct file *filp, const char __user *ubuf,
 	if (copy_from_user(&m, ubuf, usize))
 		return -EFAULT;
 
-	if (m.cpu >= NR_CPUS || !cpu_online(m.cpu))
+	if (m.extcpu >= NR_CPUS || !cpu_online(m.extcpu))
 		return -EINVAL;
 
 	dm = kmalloc(sizeof(struct delayed_mce), GFP_KERNEL);
@@ -108,7 +108,7 @@ static ssize_t mce_write(struct file *filp, const char __user *ubuf,
 	memcpy(&dm->m, &m, sizeof(struct mce));
 	setup_timer(&dm->timer, raise_mce, (unsigned long)dm);
 	dm->timer.expires = jiffies + 2;
-	add_timer_on(&dm->timer, m.cpu);
+	add_timer_on(&dm->timer, m.extcpu);
 	return usize;
 }
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index c31cbcf..aa87530 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -108,7 +108,7 @@ static inline int skip_bank_init(int i)
 void mce_setup(struct mce *m)
 {
 	memset(m, 0, sizeof(struct mce));
-	m->cpu = smp_processor_id();
+	m->cpu = m->extcpu = smp_processor_id();
 	rdtscll(m->tsc);
 }
 
@@ -172,7 +172,7 @@ static void print_mce(struct mce *m)
 	       KERN_EMERG "HARDWARE ERROR\n"
 	       KERN_EMERG
 	       "CPU %d: Machine Check Exception: %16Lx Bank %d: %016Lx\n",
-	       m->cpu, m->mcgstatus, m->bank, m->status);
+	       m->extcpu, m->mcgstatus, m->bank, m->status);
 	if (m->ip) {
 		printk(KERN_EMERG "RIP%s %02x:<%016Lx> ",
 		       !(m->mcgstatus & MCG_STATUS_EIPV) ? " !INEXACT!" : "",
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 22/31] x86: MCE: Extend struct mce user interface with more information.
  2009-05-26 23:54                                         ` [PATCH 21/31] x86: MCE: Support more than 256 CPUs in struct mce Andi Kleen
@ 2009-05-26 23:54                                           ` Andi Kleen
  2009-05-26 23:54                                             ` [PATCH 23/31] x86: MCE: Add MCE poll count to /proc/interrupts Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Experience has shown that struct mce which is used to pass an machine check
to the user space daemon currently a few limitations.  Also some data
which is useful to print at panic level is also missing.

This patch addresses most of them. The same information is also
printed out together with mce panic.

struct mce can be painlessly extended in a compatible way, the mcelog user space
code just ignores additional fields with a warning.

- It doesn't provide a wall time timestamp. There have been a few
complaints about that. Fix that by adding a 64bit time_t
- It doesn't provide the exact CPU identification. This makes
it awkward for mcelog to decode the event correctly, especially
when there are variations in the supported MCE codes on different
CPU models or when mcelog is running on a different host after a panic.
Previously the administrator had to specify the correct CPU
when mcelog ran on a different host, but with the more variation
in machine checks now it's better to auto detect that.
It's also useful for more detailed analysis of CPU events.
Pass CPUID 1.EAX and the cpu vendor (as encoded in processor.h) instead.
- Socket ID and initial APIC ID are useful to report because they
allow to identify the failing CPU in some (not all) cases.
This is also especially useful for the panic situation.
This addresses one of the complaints from Thomas Gleixner earlier.
- The MCG capabilities MSR needs to be reported for some advanced
error processing in mcelog

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/mce.h       |   10 ++++++++--
 arch/x86/kernel/cpu/mcheck/mce.c |   12 ++++++++++++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index ee7d427..e958322 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -36,13 +36,19 @@ struct mce {
 	__u64 mcgstatus;
 	__u64 ip;
 	__u64 tsc;	/* cpu time stamp counter */
-	__u64 res1;	/* for future extension */
-	__u64 res2;	/* dito. */
+	__u64 time;	/* wall time_t when error was detected */
+	__u8  cpuvendor;	/* cpu vendor as encoded in system.h */
+	__u8  pad1;
+	__u16 pad2;
+	__u32 cpuid;	/* CPUID 1 EAX */
 	__u8  cs;		/* code segment */
 	__u8  bank;	/* machine check bank */
 	__u8  cpu;	/* cpu number; obsolete; use extcpu now */
 	__u8  finished;   /* entry is valid */
 	__u32 extcpu;	/* linux cpu number that detected the error */
+	__u32 socketid;	/* CPU socket ID */
+	__u32 apicid;	/* CPU initial apic ID */
+	__u64 mcgcap;	/* MCGCAP MSR: machine check capabilities of CPU */
 };
 
 /*
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index aa87530..04cb0d4 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -110,6 +110,15 @@ void mce_setup(struct mce *m)
 	memset(m, 0, sizeof(struct mce));
 	m->cpu = m->extcpu = smp_processor_id();
 	rdtscll(m->tsc);
+	/* We hope get_seconds stays lockless */
+	m->time = get_seconds();
+	m->cpuvendor = boot_cpu_data.x86_vendor;
+	m->cpuid = cpuid_eax(1);
+#ifdef CONFIG_SMP
+	m->socketid = cpu_data(m->extcpu).phys_proc_id;
+#endif
+	m->apicid = cpu_data(m->extcpu).initial_apicid;
+	rdmsrl(MSR_IA32_MCG_CAP, m->mcgcap);
 }
 
 DEFINE_PER_CPU(struct mce, injectm);
@@ -187,6 +196,9 @@ static void print_mce(struct mce *m)
 	if (m->misc)
 		printk("MISC %llx ", m->misc);
 	printk("\n");
+	printk(KERN_EMERG "PROCESSOR %u:%x TIME %llu SOCKET %u APIC %x\n",
+			m->cpuvendor, m->cpuid, m->time, m->socketid,
+			m->apicid);
 	printk(KERN_EMERG "This is not a software problem!\n");
 	printk(KERN_EMERG "Run through mcelog --ascii to decode "
 	       "and contact your hardware vendor\n");
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 23/31] x86: MCE: Add MCE poll count to /proc/interrupts
  2009-05-26 23:54                                           ` [PATCH 22/31] x86: MCE: Extend struct mce user interface with more information Andi Kleen
@ 2009-05-26 23:54                                             ` Andi Kleen
  2009-05-26 23:54                                               ` [PATCH 24/31] x86: MCE: Don't print backtrace on machine checks with DEBUG_BUGVERBOSE Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Keep a count of the machine check polls (or CMCI events) in /proc/interrupts.

Andi needs this for debugging, but it's also useful in general to see
what's going in by the kernel.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/mce.h       |    1 +
 arch/x86/kernel/cpu/mcheck/mce.c |    4 ++++
 arch/x86/kernel/irq.c            |    4 ++++
 3 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index e958322..2c4209d 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -131,6 +131,7 @@ static inline void mce_amd_feature_init(struct cpuinfo_x86 *c) { }
 extern int mce_available(struct cpuinfo_x86 *c);
 
 DECLARE_PER_CPU(unsigned, mce_exception_count);
+DECLARE_PER_CPU(unsigned, mce_poll_count);
 
 void mce_log_therm_throt_event(__u64 status);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 04cb0d4..c511d05 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -374,6 +374,8 @@ static void mce_report_event(struct pt_regs *regs)
 #endif
 }
 
+DEFINE_PER_CPU(unsigned, mce_poll_count);
+
 /*
  * Poll for corrected events or events that happened before reset.
  * Those are just logged through /dev/mcelog.
@@ -385,6 +387,8 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 	struct mce m;
 	int i;
 
+	__get_cpu_var(mce_poll_count)++;
+
 	mce_setup(&m);
 
 	m.mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 14047b9..67f7916 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -102,6 +102,10 @@ static int show_other_interrupts(struct seq_file *p, int prec)
 	for_each_online_cpu(j)
 		seq_printf(p, "%10u ", per_cpu(mce_exception_count, j));
 	seq_printf(p, "  Machine check exceptions\n");
+	seq_printf(p, "%*s: ", prec, "MCP");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ", per_cpu(mce_poll_count, j));
+	seq_printf(p, "  Machine check polls\n");
 #endif
 	seq_printf(p, "%*s: %10u\n", prec, "ERR", atomic_read(&irq_err_count));
 #if defined(CONFIG_X86_IO_APIC)
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 24/31] x86: MCE: Don't print backtrace on machine checks with DEBUG_BUGVERBOSE
  2009-05-26 23:54                                             ` [PATCH 23/31] x86: MCE: Add MCE poll count to /proc/interrupts Andi Kleen
@ 2009-05-26 23:54                                               ` Andi Kleen
  2009-05-26 23:54                                                 ` [PATCH 25/31] x86: MCE: Implement new status bits v2 Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

CONFIG_DEBUG_BUGVERBOSE prints backtraces on panics, but we don't
want that on machine checks because these are not kernel bugs.
So disable it here.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 kernel/panic.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/panic.c b/kernel/panic.c
index 984b3ec..fa3b607 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -71,7 +71,8 @@ NORET_TYPE void panic(const char * fmt, ...)
 	va_end(args);
 	printk(KERN_EMERG "Kernel panic - not syncing: %s\n",buf);
 #ifdef CONFIG_DEBUG_BUGVERBOSE
-	dump_stack();
+	if (!strstr(fmt, "Machine check"))
+		dump_stack();
 #endif
 
 	/*
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 25/31] x86: MCE: Implement new status bits v2
  2009-05-26 23:54                                               ` [PATCH 24/31] x86: MCE: Don't print backtrace on machine checks with DEBUG_BUGVERBOSE Andi Kleen
@ 2009-05-26 23:54                                                 ` Andi Kleen
  2009-05-26 23:54                                                   ` [PATCH 26/31] x86: MCE: Export MCE severities coverage via debugfs Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

The x86 architecture recently added some new machine check status bits:
S(ignalled) and AR (Action-Required). Signalled allows to check
if a specific event caused an exception or was just logged through CMCI.
AR allows the kernel to decide if an event needs immediate action
or can be delayed or ignored.

Implement support for these new status bits. mce_severity() uses
the new bits to grade the machine check correctly and decide what
to do. The exception handler uses AR to decide to kill or not.
The S bit is used to separate events between the poll/CMCI handler
and the exception handler.

Classical UC always leads to panic. That was true before anyways
because the existing CPUs always passed a PCC with it.

Also corrects the rules whether to kill in user or kernel context
and how to handle missing RIPV.

The machine check handler largely uses the mce-severity grading
engine now instead of making its own decisions. This means the logic
is centralized in one place. This is useful because it has to be evaluated
multiple times.

v2: Some rule fixes; Add AO events
Fix RIPV, RIPV|EIPV order (Ying Huang)
Fix UCNA with AR=1 message (Ying Huang)
Add comment about panicing in m_c_p.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/mce.h                |   10 ++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h |    5 ++
 arch/x86/kernel/cpu/mcheck/mce-severity.c |   82 ++++++++++++++++++++++++++--
 arch/x86/kernel/cpu/mcheck/mce.c          |   85 +++++++++++++++--------------
 4 files changed, 137 insertions(+), 45 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 2c4209d..0cc8175 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -15,6 +15,7 @@
 #define MCG_CTL_P		(1ULL << 8)  /* MCG_CAP register available */
 #define MCG_EXT_P		(1ULL << 9)  /* Extended registers available */
 #define MCG_CMCI_P		(1ULL << 10)  /* CMCI supported */
+#define MCG_SER_P	 (1ULL<<24)  /* MCA recovery/new status bits */
 
 #define MCG_STATUS_RIPV  (1ULL<<0)   /* restart ip valid */
 #define MCG_STATUS_EIPV  (1ULL<<1)   /* ip points to correct instruction */
@@ -27,6 +28,15 @@
 #define MCI_STATUS_MISCV (1ULL<<59)  /* misc error reg. valid */
 #define MCI_STATUS_ADDRV (1ULL<<58)  /* addr reg. valid */
 #define MCI_STATUS_PCC   (1ULL<<57)  /* processor context corrupt */
+#define MCI_STATUS_S	 (1ULL<<56)  /* Signaled machine check */
+#define MCI_STATUS_AR	 (1ULL<<55)  /* Action required */
+
+/* MISC register defines */
+#define MCM_ADDR_SEGOFF  0	/* segment offset */
+#define MCM_ADDR_LINEAR  1	/* linear address */
+#define MCM_ADDR_PHYS	 2	/* physical address */
+#define MCM_ADDR_MEM	 3	/* memory address */
+#define MCM_ADDR_GENERIC 7	/* generic */
 
 /* Fields are zero when not available */
 struct mce {
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index f126b4a..54dcb8f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -2,9 +2,14 @@
 
 enum severity_level {
 	MCE_NO_SEVERITY,
+	MCE_KEEP_SEVERITY,
 	MCE_SOME_SEVERITY,
+	MCE_AO_SEVERITY,
 	MCE_UC_SEVERITY,
+	MCE_AR_SEVERITY,
 	MCE_PANIC_SEVERITY,
 };
 
 int mce_severity(struct mce *a, int tolerant, char **msg);
+
+extern int mce_ser;
diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
index c189e89..4f4d2ca 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
@@ -19,43 +19,117 @@
  * first. Since there are quite a lot of combinations test the bits in a
  * table-driven way. The rules are simply processed in order, first
  * match wins.
+ *
+ * Note this is only used for machine check exceptions, the corrected
+ * errors use much simpler rules. The exceptions still check for the corrected
+ * errors, but only to leave them alone for the CMCI handler (except for
+ * panic situations)
  */
 
+enum context { IN_KERNEL = 1, IN_USER = 2 };
+enum ser { SER_REQUIRED = 1, NO_SER = 2 };
+
 static struct severity {
 	u64 mask;
 	u64 result;
 	unsigned char sev;
 	unsigned char mcgmask;
 	unsigned char mcgres;
+	unsigned char ser;
+	unsigned char context;
 	char *msg;
 } severities[] = {
+#define KERNEL .context = IN_KERNEL
+#define USER .context = IN_USER
+#define SER .ser = SER_REQUIRED
+#define NOSER .ser = NO_SER
 #define SEV(s) .sev = MCE_ ## s ## _SEVERITY
 #define BITCLR(x, s, m, r...) { .mask = x, .result = 0, SEV(s), .msg = m, ## r }
 #define BITSET(x, s, m, r...) { .mask = x, .result = x, SEV(s), .msg = m, ## r }
 #define MCGMASK(x, res, s, m, r...) \
 	{ .mcgmask = x, .mcgres = res, SEV(s), .msg = m, ## r }
+#define MASK(x, y, s, m, r...) \
+	{ .mask = x, .result = y, SEV(s), .msg = m, ## r }
+#define MCI_UC_S (MCI_STATUS_UC|MCI_STATUS_S)
+#define MCI_UC_SAR (MCI_STATUS_UC|MCI_STATUS_S|MCI_STATUS_AR)
+#define MCACOD 0xffff
+
 	BITCLR(MCI_STATUS_VAL, NO, "Invalid"),
 	BITCLR(MCI_STATUS_EN, NO, "Not enabled"),
 	BITSET(MCI_STATUS_PCC, PANIC, "Processor context corrupt"),
-	MCGMASK(MCG_STATUS_RIPV, 0, PANIC, "No restart IP"),
+	/* When MCIP is not set something is very confused */
+	MCGMASK(MCG_STATUS_MCIP, 0, PANIC, "MCIP not set in MCA handler"),
+	/* Neither return not error IP -- no chance to recover -> PANIC */
+	MCGMASK(MCG_STATUS_RIPV|MCG_STATUS_EIPV, 0, PANIC,
+		"Neither restart nor error IP"),
+	MCGMASK(MCG_STATUS_RIPV, 0, PANIC, "In kernel and no restart IP",
+		KERNEL),
+	BITCLR(MCI_STATUS_UC, KEEP, "Corrected error", NOSER),
+	MASK(MCI_STATUS_OVER|MCI_STATUS_UC|MCI_STATUS_EN, MCI_STATUS_UC, SOME,
+	     "Spurious not enabled", SER),
+
+	/* ignore OVER for UCNA */
+	MASK(MCI_UC_SAR, MCI_STATUS_UC, KEEP,
+	     "Uncorrected no action required", SER),
+	MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_STATUS_UC|MCI_STATUS_AR, PANIC,
+	     "Illegal combination (UCNA with AR=1)", SER),
+	MASK(MCI_STATUS_S, 0, KEEP, "Non signalled machine check", SER),
+
+	/* AR add known MCACODs here */
+	MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_STATUS_OVER|MCI_UC_SAR, PANIC,
+	     "Action required with lost events", SER),
+	MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCACOD, MCI_UC_SAR, PANIC,
+	     "Action required; unknown MCACOD", SER),
+
+	/* known AO MCACODs: */
+	MASK(MCI_UC_SAR|MCI_STATUS_OVER|0xfff0, MCI_UC_S|0xc0, AO,
+	     "Action optional: memory scrubbing error", SER),
+	MASK(MCI_UC_SAR|MCI_STATUS_OVER|MCACOD, MCI_UC_S|0x17a, AO,
+	     "Action optional: last level cache writeback error", SER),
+
+	MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_S, SOME,
+	     "Action optional unknown MCACOD", SER),
+	MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_S|MCI_STATUS_OVER, SOME,
+	     "Action optional with lost events", SER),
 	BITSET(MCI_STATUS_UC|MCI_STATUS_OVER, PANIC, "Overflowed uncorrected"),
 	BITSET(MCI_STATUS_UC, UC, "Uncorrected"),
 	BITSET(0, SOME, "No match")	/* always matches. keep at end */
 };
 
+/*
+ * If the EIPV bit is set, it means the saved IP is the
+ * instruction which caused the MCE.
+ */
+static int error_context(struct mce *m)
+{
+	if (m->mcgstatus & MCG_STATUS_EIPV)
+		return (m->ip && (m->cs & 3) == 3) ? IN_USER : IN_KERNEL;
+	/* Unknown, assume kernel */
+	return IN_KERNEL;
+}
+
 int mce_severity(struct mce *a, int tolerant, char **msg)
 {
+	enum context ctx = error_context(a);
 	struct severity *s;
+
 	for (s = severities;; s++) {
 		if ((a->status & s->mask) != s->result)
 			continue;
 		if ((a->mcgstatus & s->mcgmask) != s->mcgres)
 			continue;
-		if (s->sev > MCE_NO_SEVERITY && (a->status & MCI_STATUS_UC) &&
-			tolerant < 1)
-			return MCE_PANIC_SEVERITY;
+		if (s->ser == SER_REQUIRED && !mce_ser)
+			continue;
+		if (s->ser == NO_SER && mce_ser)
+			continue;
+		if (s->context && ctx != s->context)
+			continue;
 		if (msg)
 			*msg = s->msg;
+		if (s->sev >= MCE_UC_SEVERITY && ctx == IN_KERNEL) {
+			if (panic_on_oops || tolerant < 1)
+				return MCE_PANIC_SEVERITY;
+		}
 		return s->sev;
 	}
 }
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index c511d05..8c55c86 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -83,6 +83,7 @@ static int			rip_msr;
 static int			mce_bootlog = -1;
 static int			monarch_timeout = -1;
 static int			mce_panic_timeout;
+int				mce_ser;
 
 static char			trigger[128];
 static char			*trigger_argv[2] = { trigger, NULL };
@@ -381,6 +382,15 @@ DEFINE_PER_CPU(unsigned, mce_poll_count);
  * Those are just logged through /dev/mcelog.
  *
  * This is executed in standard interrupt context.
+ *
+ * Note: spec recommends to panic for fatal unsignalled
+ * errors here. However this would be quite problematic --
+ * we would need to reimplement the Monarch handling and
+ * it would mess up the exclusion between exception handler
+ * and poll hander -- * so we skip this for now.
+ * These cases should not happen anyways, or only when the CPU
+ * is already totally * confused. In this case it's likely it will
+ * not fully execute the machine check handler either.
  */
 void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 {
@@ -407,13 +417,13 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 			continue;
 
 		/*
-		 * Uncorrected events are handled by the exception handler
-		 * when it is enabled. But when the exception is disabled log
-		 * everything.
+		 * Uncorrected or signalled events are handled by the exception
+		 * handler when it is enabled, so don't process those here.
 		 *
 		 * TBD do the same check for MCI_STATUS_EN here?
 		 */
-		if ((m.status & MCI_STATUS_UC) && !(flags & MCP_UC))
+		if (!(flags & MCP_UC) &&
+		    (m.status & (mce_ser ? MCI_STATUS_S : MCI_STATUS_UC)))
 			continue;
 
 		if (m.status & MCI_STATUS_MISCV)
@@ -780,6 +790,12 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	barrier();
 
 	/*
+	 * When no restart IP must always kill or panic.
+	 */
+	if (!(m.mcgstatus & MCG_STATUS_RIPV))
+		kill_it = 1;
+
+	/*
 	 * Go through all the banks in exclusion of the other CPUs.
 	 * This way we don't report duplicated events on shared banks
 	 * because the first one to see it will clear it.
@@ -799,10 +815,11 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 			continue;
 
 		/*
-		 * Non uncorrected errors are handled by machine_check_poll
-		 * Leave them alone, unless this panics.
+		 * Non uncorrected or non signaled errors are handled by
+		 * machine_check_poll. Leave them alone, unless this panics.
 		 */
-		if ((m.status & MCI_STATUS_UC) == 0 && !no_way_out)
+		if (!(m.status & (mce_ser ? MCI_STATUS_S : MCI_STATUS_UC)) &&
+			!no_way_out)
 			continue;
 
 		/*
@@ -810,17 +827,16 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		 */
 		add_taint(TAINT_MACHINE_CHECK);
 
-		__set_bit(i, toclear);
+		severity = mce_severity(&m, tolerant, NULL);
 
-		if (m.status & MCI_STATUS_EN) {
-			/*
-			 * If this error was uncorrectable and there was
-			 * an overflow, we're in trouble.  If no overflow,
-			 * we might get away with just killing a task.
-			 */
-			if (m.status & MCI_STATUS_UC)
-				kill_it = 1;
-		} else {
+		/*
+		 * When machine check was for corrected handler don't touch,
+		 * unless we're panicing.
+		 */
+		if (severity == MCE_KEEP_SEVERITY && !no_way_out)
+			continue;
+		__set_bit(i, toclear);
+		if (severity == MCE_NO_SEVERITY) {
 			/*
 			 * Machine check event was not enabled. Clear, but
 			 * ignore.
@@ -828,6 +844,12 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 			continue;
 		}
 
+		/*
+		 * Kill on action required.
+		 */
+		if (severity == MCE_AR_SEVERITY)
+			kill_it = 1;
+
 		if (m.status & MCI_STATUS_MISCV)
 			m.misc = mce_rdmsrl(MSR_IA32_MC0_MISC + i*4);
 		if (m.status & MCI_STATUS_ADDRV)
@@ -836,7 +858,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		mce_get_rip(&m, regs);
 		mce_log(&m);
 
-		severity = mce_severity(&m, tolerant, NULL);
 		if (severity > worst) {
 			*final = m;
 			worst = severity;
@@ -869,29 +890,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	 * one task, do that.  If the user has set the tolerance very
 	 * high, don't try to do anything at all.
 	 */
-	if (kill_it && tolerant < 3) {
-		int user_space = 0;
-
-		/*
-		 * If the EIPV bit is set, it means the saved IP is the
-		 * instruction which caused the MCE.
-		 */
-		if (m.mcgstatus & MCG_STATUS_EIPV)
-			user_space = final->ip && (final->cs & 3);
-
-		/*
-		 * If we know that the error was in user space, send a
-		 * SIGBUS.  Otherwise, panic if tolerance is low.
-		 *
-		 * force_sig() takes an awful lot of locks and has a slight
-		 * risk of deadlocking.
-		 */
-		if (user_space && tolerant > 0) {
-			force_sig(SIGBUS, current);
-		} else if (panic_on_oops || tolerant < 2) {
-			mce_panic("Uncorrected machine check", final, msg);
-		}
-	}
+	if (kill_it && tolerant < 3)
+		force_sig(SIGBUS, current);
 
 	/* notify userspace ASAP */
 	set_thread_flag(TIF_MCE_NOTIFY);
@@ -1041,6 +1041,9 @@ static int mce_cap_init(void)
 	    ((cap >> MCG_EXT_CNT_SHIFT) & MCG_BANKCNT_MASK) >= 9)
 		rip_msr = MSR_IA32_MCG_EIP;
 
+	if (cap & MCG_SER_P)
+		mce_ser = 1;
+
 	return 0;
 }
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 26/31] x86: MCE: Export MCE severities coverage via debugfs
  2009-05-26 23:54                                                 ` [PATCH 25/31] x86: MCE: Implement new status bits v2 Andi Kleen
@ 2009-05-26 23:54                                                   ` Andi Kleen
  2009-05-26 23:54                                                     ` [PATCH 27/31] x86: MCE: Print header/footer only once for multiple MCEs Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Huang Ying, Andi Kleen

From: Huang Ying <ying.huang@intel.com>

The MCE severity judgement code is data-driven, so code coverage tools such as
gcov can not be used for measuring coverage. Instead a dedicated coverage
mechanism is implemented.  The kernel keeps track of rules executed
and reports them in debugfs.

This is useful for increasing coverage of the mce-test testsuite.

Right now it's unconditionally enabled because it's very little code.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce-severity.c |   83 +++++++++++++++++++++++++++++
 1 files changed, 83 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
index 4f4d2ca..ff0807f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
@@ -10,6 +10,9 @@
  * Author: Andi Kleen
  */
 #include <linux/kernel.h>
+#include <linux/seq_file.h>
+#include <linux/init.h>
+#include <linux/debugfs.h>
 #include <asm/mce.h>
 
 #include "mce-internal.h"
@@ -37,6 +40,7 @@ static struct severity {
 	unsigned char mcgres;
 	unsigned char ser;
 	unsigned char context;
+	unsigned char covered;
 	char *msg;
 } severities[] = {
 #define KERNEL .context = IN_KERNEL
@@ -126,6 +130,7 @@ int mce_severity(struct mce *a, int tolerant, char **msg)
 			continue;
 		if (msg)
 			*msg = s->msg;
+		s->covered = 1;
 		if (s->sev >= MCE_UC_SEVERITY && ctx == IN_KERNEL) {
 			if (panic_on_oops || tolerant < 1)
 				return MCE_PANIC_SEVERITY;
@@ -133,3 +138,81 @@ int mce_severity(struct mce *a, int tolerant, char **msg)
 		return s->sev;
 	}
 }
+
+static void *s_start(struct seq_file *f, loff_t *pos)
+{
+	if (*pos >= ARRAY_SIZE(severities))
+		return NULL;
+	return &severities[*pos];
+}
+
+static void *s_next(struct seq_file *f, void *data, loff_t *pos)
+{
+	if (++(*pos) >= ARRAY_SIZE(severities))
+		return NULL;
+	return &severities[*pos];
+}
+
+static void s_stop(struct seq_file *f, void *data)
+{
+}
+
+static int s_show(struct seq_file *f, void *data)
+{
+	struct severity *ser = data;
+	seq_printf(f, "%d\t%s\n", ser->covered, ser->msg);
+	return 0;
+}
+
+static const struct seq_operations severities_seq_ops = {
+	.start	= s_start,
+	.next	= s_next,
+	.stop	= s_stop,
+	.show	= s_show,
+};
+
+static int severities_coverage_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &severities_seq_ops);
+}
+
+static ssize_t severities_coverage_write(struct file *file,
+					 const char __user *ubuf,
+					 size_t count, loff_t *ppos)
+{
+	int i;
+	for (i = 0; i < ARRAY_SIZE(severities); i++)
+		severities[i].covered = 0;
+	return count;
+}
+
+static const struct file_operations severities_coverage_fops = {
+	.open		= severities_coverage_open,
+	.release	= seq_release,
+	.read		= seq_read,
+	.write		= severities_coverage_write,
+};
+
+static int __init severities_debugfs_init(void)
+{
+	struct dentry *dmce = NULL, *fseverities_coverage = NULL;
+
+	dmce = debugfs_create_dir("mce", NULL);
+	if (dmce == NULL)
+		goto err_out;
+	fseverities_coverage = debugfs_create_file("severities-coverage",
+						   0444, dmce, NULL,
+						   &severities_coverage_fops);
+	if (fseverities_coverage == NULL)
+		goto err_out;
+
+	return 0;
+
+err_out:
+	if (fseverities_coverage)
+		debugfs_remove(fseverities_coverage);
+	if (dmce)
+		debugfs_remove(dmce);
+	return -ENOMEM;
+}
+late_initcall(severities_debugfs_init);
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 27/31] x86: MCE: Print header/footer only once for multiple MCEs
  2009-05-26 23:54                                                   ` [PATCH 26/31] x86: MCE: Export MCE severities coverage via debugfs Andi Kleen
@ 2009-05-26 23:54                                                     ` Andi Kleen
  2009-05-26 23:54                                                       ` [PATCH 28/31] x86: MCE: Make non Monarch panic message "Fatal machine check" too v2 Andi Kleen
  2009-05-27  4:31                                                       ` [PATCH 27/31] x86: MCE: Print header/footer only once for multiple MCEs Hidetoshi Seto
  0 siblings, 2 replies; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

When multiple MCEs are printed print the "HARDWARE ERROR" header
and "This is not a software error" footer only once. This
makes the output much more compact with many CPUs.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |   27 +++++++++++++++++----------
 1 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 8c55c86..94dddc7 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -176,11 +176,13 @@ void mce_log(struct mce *mce)
 	set_bit(0, &notify_user);
 }
 
-static void print_mce(struct mce *m)
+static void print_mce(struct mce *m, int *first)
 {
-	printk(KERN_EMERG "\n"
-	       KERN_EMERG "HARDWARE ERROR\n"
-	       KERN_EMERG
+	if (*first) {
+		printk(KERN_EMERG "\n" KERN_EMERG "HARDWARE ERROR\n");
+		*first = 0;
+	}
+	printk(KERN_EMERG
 	       "CPU %d: Machine Check Exception: %16Lx Bank %d: %016Lx\n",
 	       m->extcpu, m->mcgstatus, m->bank, m->status);
 	if (m->ip) {
@@ -200,9 +202,12 @@ static void print_mce(struct mce *m)
 	printk(KERN_EMERG "PROCESSOR %u:%x TIME %llu SOCKET %u APIC %x\n",
 			m->cpuvendor, m->cpuid, m->time, m->socketid,
 			m->apicid);
-	printk(KERN_EMERG "This is not a software problem!\n");
-	printk(KERN_EMERG "Run through mcelog --ascii to decode "
-	       "and contact your hardware vendor\n");
+}
+
+static void print_mce_tail(void)
+{
+	printk(KERN_EMERG "This is not a software problem!\n"
+	       KERN_EMERG "Run through mcelog --ascii to decode and contact your hardware vendor\n");
 }
 
 #define PANIC_TIMEOUT 5 /* 5 seconds */
@@ -225,6 +230,7 @@ static void wait_for_panic(void)
 static void mce_panic(char *msg, struct mce *final, char *exp)
 {
 	int i;
+	int first = 1;
 
 	/*
 	 * Make sure only one CPU runs in machine check panic
@@ -240,7 +246,7 @@ static void mce_panic(char *msg, struct mce *final, char *exp)
 		struct mce *m = &mcelog.entry[i];
 		if ((m->status & MCI_STATUS_VAL) &&
 			!(m->status & MCI_STATUS_UC))
-			print_mce(m);
+			print_mce(m, &first);
 	}
 	/* Now print uncorrected but with the final one last */
 	for (i = 0; i < MCE_LOG_LEN; i++) {
@@ -248,12 +254,13 @@ static void mce_panic(char *msg, struct mce *final, char *exp)
 		if (!(m->status & MCI_STATUS_VAL))
 			continue;
 		if (!final || memcmp(m, final, sizeof(struct mce)))
-			print_mce(m);
+			print_mce(m, &first);
 	}
 	if (final)
-		print_mce(final);
+		print_mce(final, &first);
 	if (cpu_missing)
 		printk(KERN_EMERG "Some CPUs didn't answer in synchronization\n");
+	print_mce_tail();
 	if (exp)
 		printk(KERN_EMERG "Machine check: %s\n", exp);
 	if (mce_panic_timeout < panic_timeout)
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 28/31] x86: MCE: Make non Monarch panic message "Fatal machine check" too v2
  2009-05-26 23:54                                                     ` [PATCH 27/31] x86: MCE: Print header/footer only once for multiple MCEs Andi Kleen
@ 2009-05-26 23:54                                                       ` Andi Kleen
  2009-05-26 23:54                                                         ` [PATCH 29/31] x86: MCE: Rename mce_notify_user to mce_notify_irq Andi Kleen
  2009-05-27  4:31                                                       ` [PATCH 27/31] x86: MCE: Print header/footer only once for multiple MCEs Hidetoshi Seto
  1 sibling, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

... instead of "Machine check". This is for consistency with the Monarch
panic message.

Based on a report from Ying Huang.

v2: But add a descriptive postfix so that the test suite can distingush.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 94dddc7..9029aa0 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -889,7 +889,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	 * support MCE broadcasting or it has been disabled.
 	 */
 	if (no_way_out && tolerant < 3)
-		mce_panic("Machine check", final, msg);
+		mce_panic("Fatal machine check on current CPU", final, msg);
 
 	/*
 	 * If the error seems to be unrecoverable, something should be
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 29/31] x86: MCE: Rename mce_notify_user to mce_notify_irq
  2009-05-26 23:54                                                       ` [PATCH 28/31] x86: MCE: Make non Monarch panic message "Fatal machine check" too v2 Andi Kleen
@ 2009-05-26 23:54                                                         ` Andi Kleen
  2009-05-26 23:54                                                           ` [PATCH 30/31] x86: MCE: Define MCE_VECTOR Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Impact: cleanup

Rename the mce_notify_user function to mce_notify_irq. The next
patch will split the wakeup handling of interrupt context
and of process context and it's better to give it a clearer
name for this.

Contains a fix from Ying Huang

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/mce.h                |    2 +-
 arch/x86/kernel/cpu/mcheck/mce-inject.c   |    2 +-
 arch/x86/kernel/cpu/mcheck/mce.c          |   10 +++++-----
 arch/x86/kernel/cpu/mcheck/mce_intel_64.c |    2 +-
 arch/x86/kernel/signal.c                  |    2 +-
 5 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 0cc8175..52dbac5 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -159,7 +159,7 @@ enum mcp_flags {
 };
 extern void machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
 
-extern int mce_notify_user(void);
+extern int mce_notify_irq(void);
 
 DECLARE_PER_CPU(struct mce, injectm);
 extern struct file_operations mce_chrdev_ops;
diff --git a/arch/x86/kernel/cpu/mcheck/mce-inject.c b/arch/x86/kernel/cpu/mcheck/mce-inject.c
index e6cfa1a..9702f10 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-inject.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-inject.c
@@ -65,7 +65,7 @@ static void raise_mce(unsigned long data)
 		memset(&b, 0xff, sizeof(mce_banks_t));
 		printk(KERN_INFO "Starting machine check poll CPU %d\n", cpu);
 		machine_check_poll(0, &b);
-		mce_notify_user();
+		mce_notify_irq();
 		printk(KERN_INFO "Finished machine check poll on CPU %d\n",
 		       cpu);
 	}
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 9029aa0..a3470b3 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -346,14 +346,14 @@ asmlinkage void smp_mce_self_interrupt(struct pt_regs *regs)
 	ack_APIC_irq();
 	exit_idle();
 	irq_enter();
-	mce_notify_user();
+	mce_notify_irq();
 	irq_exit();
 }
 
 static void mce_report_event(struct pt_regs *regs)
 {
 	if (regs->flags & (X86_VM_MASK|X86_EFLAGS_IF)) {
-		mce_notify_user();
+		mce_notify_irq();
 		return;
 	}
 
@@ -964,7 +964,7 @@ static void mcheck_timer(unsigned long data)
 	 * polling interval, otherwise increase the polling interval.
 	 */
 	n = &__get_cpu_var(next_interval);
-	if (mce_notify_user()) {
+	if (mce_notify_irq()) {
 		*n = max(*n/2, HZ/100);
 	} else {
 		*n = min(*n*2, (int)round_jiffies_relative(check_interval*HZ));
@@ -986,7 +986,7 @@ static DECLARE_WORK(mce_trigger_work, mce_do_trigger);
  * Can be called from interrupt context, but not from machine check/NMI
  * context.
  */
-int mce_notify_user(void)
+int mce_notify_irq(void)
 {
 	/* Not more than two messages every minute */
 	static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);
@@ -1011,7 +1011,7 @@ int mce_notify_user(void)
 	}
 	return 0;
 }
-EXPORT_SYMBOL_GPL(mce_notify_user);
+EXPORT_SYMBOL_GPL(mce_notify_irq);
 
 /*
  * Initialize Machine Checks for a CPU.
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel_64.c b/arch/x86/kernel/cpu/mcheck/mce_intel_64.c
index f28aa4c..be1abfb 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel_64.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel_64.c
@@ -78,7 +78,7 @@ static int cmci_supported(int *banks)
 static void intel_threshold_interrupt(void)
 {
 	machine_check_poll(MCP_TIMESTAMP, &__get_cpu_var(mce_banks_owned));
-	mce_notify_user();
+	mce_notify_irq();
 }
 
 /*
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index d0851e3..d5dc15b 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -860,7 +860,7 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 #ifdef CONFIG_X86_NEW_MCE
 	/* notify userspace of pending MCEs */
 	if (thread_info_flags & _TIF_MCE_NOTIFY)
-		mce_notify_user();
+		mce_notify_irq();
 #endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
 
 	/* deal with pending signal delivery */
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 30/31] x86: MCE: Define MCE_VECTOR
  2009-05-26 23:54                                                         ` [PATCH 29/31] x86: MCE: Rename mce_notify_user to mce_notify_irq Andi Kleen
@ 2009-05-26 23:54                                                           ` Andi Kleen
  2009-05-26 23:54                                                             ` [PATCH 31/31] x86: MCE: Support action-optional machine checks v2 Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/irq_vectors.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 61f1592..ab92c50 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -25,6 +25,7 @@
  */
 
 #define NMI_VECTOR			0x02
+#define MCE_VECTOR			0x12
 
 /*
  * IDT vectors usable for external interrupt sources start
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 31/31] x86: MCE: Support action-optional machine checks v2
  2009-05-26 23:54                                                           ` [PATCH 30/31] x86: MCE: Define MCE_VECTOR Andi Kleen
@ 2009-05-26 23:54                                                             ` Andi Kleen
  0 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2009-05-26 23:54 UTC (permalink / raw)
  To: linux-kernel, hpa, x86; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Newer Intel CPUs support a new class of machine checks called recoverable
action optional.

Action Optional means that the CPU detected some form of corruption in
the background and tells the OS about using a machine check
exception. The OS can then take appropiate action, like killing the
process with the corrupted data or logging the event properly to disk.

This is done by the new generic high level memory failure handler added in a
earlier patch. The high level handler takes the address with the failed
memory and does the appropiate action, like killing the process.

In this version of the patch the high level handler is stubbed out
with a weak function to not create a direct dependency on the hwpoison
branch.

The high level handler cannot be directly called from the machine check
exception though, because it has to run in a defined process context to be able
to sleep when taking VM locks (it is not expected to sleep for a long time,
just do so in some exceptional cases like lock contention)

Thus the MCE handler has to queue a work item for process context,
trigger process context and then call the high level handler from there.

This patch adds two path to process context: through a per thread kernel exit
notify_user() callback or through a high priority work item.  The first
runs when the process exits back to user space, the other when it goes
to sleep and there is no higher priority process.

The machine check handler will schedule both, and whoever runs first
will grab the event. This is done because quick reaction to this
event is critical to avoid a potential more fatal machine check
when the corruption is consumed.

There is a simple lock less ring buffer to queue the corrupted
addresses between the exception handler and the process context handler.
Then in process context it just calls the high level VM code with
the corrupted PFNs.

The code adds the required code to extract the failed address from
the CPU's machine check registers. It doesn't try to handle all
possible cases -- the specification has 6 different ways to specify
memory address -- but only the linear address.

Most of the required checking has been already done earlier in the
mce_severity rule checking engine.  Following the Intel
recommendations Action Optional errors are only enabled for known
situations (encoded in MCACODs). The errors are ignored otherwise,
because they are action optional.

v2: Improve comment, disable preemption while processing ring buffer
    (reported by Ying Huang)

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/mce.h       |    1 +
 arch/x86/kernel/cpu/mcheck/mce.c |  133 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/signal.c         |    2 +-
 3 files changed, 135 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 52dbac5..2cd8f6a 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -160,6 +160,7 @@ enum mcp_flags {
 extern void machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
 
 extern int mce_notify_irq(void);
+extern void mce_notify_process(void);
 
 DECLARE_PER_CPU(struct mce, injectm);
 extern struct file_operations mce_chrdev_ops;
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index a3470b3..7cdbdf9 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -31,6 +31,7 @@
 #include <linux/nmi.h>
 #include <linux/cpu.h>
 #include <linux/fs.h>
+#include <linux/mm.h>
 
 #include <asm/processor.h>
 #include <asm/uaccess.h>
@@ -105,6 +106,8 @@ static inline int skip_bank_init(int i)
 	return i < BITS_PER_LONG && test_bit(i, &dont_init_banks);
 }
 
+static DEFINE_PER_CPU(struct work_struct, mce_work);
+
 /* Do initial initialization of a struct mce */
 void mce_setup(struct mce *m)
 {
@@ -311,6 +314,61 @@ static void mce_wrmsrl(u32 msr, u64 v)
 	wrmsrl(msr, v);
 }
 
+/*
+ * Simple lockless ring to communicate PFNs from the exception handler with the
+ * process context work function. This is vastly simplified because there's
+ * only a single reader and a single writer.
+ */
+#define MCE_RING_SIZE 16	/* we use one entry less */
+
+struct mce_ring {
+	unsigned short start;
+	unsigned short end;
+	unsigned long ring[MCE_RING_SIZE];
+};
+static DEFINE_PER_CPU(struct mce_ring, mce_ring);
+
+/* Runs with CPU affinity in workqueue */
+static int mce_ring_empty(void)
+{
+	struct mce_ring *r = &__get_cpu_var(mce_ring);
+
+	return r->start == r->end;
+}
+
+static int mce_ring_get(unsigned long *pfn)
+{
+	struct mce_ring *r;
+	int ret = 0;
+
+	*pfn = 0;
+	get_cpu();
+	r = &__get_cpu_var(mce_ring);
+	if (r->start == r->end)
+		goto out;
+	*pfn = r->ring[r->start];
+	r->start = (r->start + 1) % MCE_RING_SIZE;
+	ret = 1;
+out:
+	put_cpu();
+	return ret;
+}
+
+/* Always runs in MCE context with preempt off */
+static int mce_ring_add(unsigned long pfn)
+{
+	struct mce_ring *r = &__get_cpu_var(mce_ring);
+	unsigned next;
+
+	next = (r->end + 1) % MCE_RING_SIZE;
+	if (next == r->start)
+		return -1;
+	r->ring[r->end] = pfn;
+	wmb();
+	r->end = next;
+	return 0;
+}
+
 int mce_available(struct cpuinfo_x86 *c)
 {
 	if (mce_disabled)
@@ -336,6 +394,15 @@ static inline void mce_get_rip(struct mce *m, struct pt_regs *regs)
 		m->ip = mce_rdmsrl(rip_msr);
 }
 
+static void mce_schedule_work(void)
+{
+	if (!mce_ring_empty()) {
+		struct work_struct *work = &__get_cpu_var(mce_work);
+		if (!work_pending(work))
+			schedule_work(work);
+	}
+}
+
 /*
  * Called after interrupts have been reenabled again
  * when a MCE happened during an interrupts off region
@@ -347,6 +414,7 @@ asmlinkage void smp_mce_self_interrupt(struct pt_regs *regs)
 	exit_idle();
 	irq_enter();
 	mce_notify_irq();
+	mce_schedule_work();
 	irq_exit();
 }
 
@@ -354,6 +422,13 @@ static void mce_report_event(struct pt_regs *regs)
 {
 	if (regs->flags & (X86_VM_MASK|X86_EFLAGS_IF)) {
 		mce_notify_irq();
+		/*
+		 * Triggering the work queue here is just an insurance
+		 * policy in case the syscall exit notify handler
+		 * doesn't run soon enough or ends up running on the
+		 * wrong CPU (can happen when audit sleeps)
+		 */
+		mce_schedule_work();
 		return;
 	}
 
@@ -728,6 +803,23 @@ reset:
 	return ret;
 }
 
+/*
+ * Check if the address reported by the CPU is in a format we can parse.
+ * It would be possible to add code for most other cases, but all would
+ * be somewhat complicated (e.g. segment offset would require an instruction
+ * parser). So only support physical addresses upto page granuality for now.
+ */
+static int mce_usable_address(struct mce *m)
+{
+	if (!(m->status & MCI_STATUS_MISCV) || !(m->status & MCI_STATUS_ADDRV))
+		return 0;
+	if ((m->misc & 0x3f) > PAGE_SHIFT)
+		return 0;
+	if (((m->misc >> 6) & 7) != MCM_ADDR_PHYS)
+		return 0;
+	return 1;
+}
+
 static void mce_clear_state(unsigned long *toclear)
 {
 	int i;
@@ -862,6 +954,16 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		if (m.status & MCI_STATUS_ADDRV)
 			m.addr = mce_rdmsrl(MSR_IA32_MC0_ADDR + i*4);
 
+		/*
+		 * Action optional error. Queue address for later processing.
+		 * When the ring overflows we just ignore the AO error.
+		 * RED-PEN add some logging mechanism when
+		 * usable_address or mce_add_ring fails.
+		 * RED-PEN don't ignore overflow for tolerant == 0
+		 */
+		if (severity == MCE_AO_SEVERITY && mce_usable_address(&m))
+			mce_ring_add(m.addr >> PAGE_SHIFT);
+
 		mce_get_rip(&m, regs);
 		mce_log(&m);
 
@@ -912,6 +1014,36 @@ out:
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
 
+/* dummy to break dependency. actual code is in mm/memory-failure.c */
+void __attribute__((weak)) memory_failure(unsigned long pfn, unsigned vector)
+{
+	printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);
+}
+
+/*
+ * Called after mce notification in process context. This code
+ * is allowed to sleep. Call the high level VM handler to process
+ * any corrupted pages.
+ * Assume that the work queue code only calls this one at a time
+ * per CPU.
+ * Note we don't disable preemption, so this code might run on the wrong
+ * CPU. In this case the event is picked up by the scheduled work queue.
+ * This is merely a fast path to expedite processing in some common
+ * cases.
+ */
+void mce_notify_process(void)
+{
+	unsigned long pfn;
+	mce_notify_irq();
+	while (mce_ring_get(&pfn))
+		memory_failure(pfn, MCE_VECTOR);
+}
+
+static void mce_process_work(struct work_struct *dummy)
+{
+	mce_notify_process();
+}
+
 #ifdef CONFIG_X86_MCE_INTEL
 /***
  * mce_log_therm_throt_event - Logs the thermal throttling event to mcelog
@@ -1202,6 +1334,7 @@ void __cpuinit mcheck_init(struct cpuinfo_x86 *c)
 	mce_init();
 	mce_cpu_features(c);
 	mce_init_timer();
+	INIT_WORK(&__get_cpu_var(mce_work), mce_process_work);
 }
 
 /*
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index d5dc15b..4976888 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -860,7 +860,7 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 #ifdef CONFIG_X86_NEW_MCE
 	/* notify userspace of pending MCEs */
 	if (thread_info_flags & _TIF_MCE_NOTIFY)
-		mce_notify_irq();
+		mce_notify_process();
 #endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
 
 	/* deal with pending signal delivery */
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/31] x86: MCE: Improve mce_get_rip v3
  2009-05-26 23:54   ` [PATCH 02/31] x86: MCE: Improve mce_get_rip v3 Andi Kleen
  2009-05-26 23:54     ` [PATCH 03/31] x86: MCE: Fix EIPV behaviour with !PCC Andi Kleen
@ 2009-05-27  4:29     ` Hidetoshi Seto
  2009-05-27  7:23       ` Andi Kleen
  2009-05-27  4:29     ` [PATCH] x86: MCE: Fix for getting IP/CS at MCE Hidetoshi Seto
  2 siblings, 1 reply; 49+ messages in thread
From: Hidetoshi Seto @ 2009-05-27  4:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, hpa, x86, Huang Ying, Andi Kleen

Andi Kleen wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> Assume RIP is valid when either EIPV or RIPV are set.

Bad description.
If RIP means "restart IP" that is valid only if RIPV is set,
this sentence doesn't make sense completely.

Followings will be better:
"Assume IP stored on the stack indicates the address of instruction
 at the time of the MCE when either EIPV or RIPV are set."

> This influences
> whether the machine check exception handler decides to return or panic.

I suppose you are pointing logics in:

                mce_get_rip(&m, regs);
 :
                        panicm = m;
 :
                /*
                 * If the EIPV bit is set, it means the saved IP is the
                 * instruction which caused the MCE.
                 */
                if (m.mcgstatus & MCG_STATUS_EIPV)
                        user_space = panicm.ip && (panicm.cs & 3);

                /*
                 * If we know that the error was in user space, send a
                 * SIGBUS.  Otherwise, panic if tolerance is low.
                 *
                 * force_sig() takes an awful lot of locks and has a slight
                 * risk of deadlocking.
                 */
                if (user_space) {
                        force_sig(SIGBUS, current);
                } else if (panic_on_oops || tolerant < 2) {
                        mce_panic("Uncorrected machine check",
                                &panicm, mcestart);
                }

So EIPV without RIPV will be no ip and will result in panic,
while expected result is SIGBUS. 

> This fixes a test case in the mce-test suite and is more compliant
> to the specification.

It would be better to describe about the test case that need this fix.

> This currently only makes a difference in a artificial testing
> scenario with the mce-test test suite.
> 
> Also in addition do not force the RIP to be valid with the exact
> register MSRs.

I think the forced one is EIP:
> -		m->mcgstatus |= MCG_STATUS_EIPV;

And please note that it keep use CS on stack even if MSR is available.

I made an alternative patch for this, with no functional change.
Please consider replacing.


Thanks,
H.Seto

> [AK: combination of patches from Huang Ying and Hidetoshi Seto, with
> new description by me]
> Signed-off-by: Huang Ying <ying.huang@intel.com>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> ---
>  arch/x86/kernel/cpu/mcheck/mce.c |   13 +++++++------
>  1 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 249e3cf..3f158d7 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -247,21 +247,22 @@ int mce_available(struct cpuinfo_x86 *c)
>  	return cpu_has(c, X86_FEATURE_MCE) && cpu_has(c, X86_FEATURE_MCA);
>  }
>  
> +/*
> + * Get the address of the instruction at the time of the machine check
> + * error.
> + */
>  static inline void mce_get_rip(struct mce *m, struct pt_regs *regs)
>  {
> -	if (regs && (m->mcgstatus & MCG_STATUS_RIPV)) {
> +
> +	if (regs && (m->mcgstatus & (MCG_STATUS_RIPV|MCG_STATUS_EIPV))) {
>  		m->ip = regs->ip;
>  		m->cs = regs->cs;
>  	} else {
>  		m->ip = 0;
>  		m->cs = 0;
>  	}
> -	if (rip_msr) {
> -		/* Assume the RIP in the MSR is exact. Is this true? */
> -		m->mcgstatus |= MCG_STATUS_EIPV;
> +	if (rip_msr)
>  		m->ip = mce_rdmsrl(rip_msr);
> -		m->cs = 0;
> -	}
>  }
>  
>  /*


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH] x86: MCE: Fix for getting IP/CS at MCE
  2009-05-26 23:54   ` [PATCH 02/31] x86: MCE: Improve mce_get_rip v3 Andi Kleen
  2009-05-26 23:54     ` [PATCH 03/31] x86: MCE: Fix EIPV behaviour with !PCC Andi Kleen
  2009-05-27  4:29     ` [PATCH 02/31] x86: MCE: Improve mce_get_rip v3 Hidetoshi Seto
@ 2009-05-27  4:29     ` Hidetoshi Seto
  2 siblings, 0 replies; 49+ messages in thread
From: Hidetoshi Seto @ 2009-05-27  4:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, hpa, x86, Huang Ying, Andi Kleen

Assume IP stored on the stack indicates the address of instruction at
the time of the MCE when either EIPV or RIPV are set.  This influences
whether the machine check exception handler decides to return or panic.

This fixes an EIPV test case in the mce-test suite and is more compliant
to the specification.

This currently only makes a difference in a artificial testing
scenario with the mce-test test suite.

Also in addition do not overwrite the EIPV with the presence of MSR,
and keep in trust the CS value on stack even if MSR is available.

[AK: combination of patches from Huang Ying and Hidetoshi Seto, with
 new description by me]
[HS: change misleading function name, add comment and some description
 based on "Improve mce_get_rip v3" ]
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |   21 +++++++++++++--------
 1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 249e3cf..1bb5958 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -247,21 +247,26 @@ int mce_available(struct cpuinfo_x86 *c)
 	return cpu_has(c, X86_FEATURE_MCE) && cpu_has(c, X86_FEATURE_MCA);
 }
 
-static inline void mce_get_rip(struct mce *m, struct pt_regs *regs)
+/*
+ * Get the address of the instruction at the time of the machine check
+ * error.
+ */
+static inline void mce_get_ip_at_error(struct mce *m, struct pt_regs *regs)
 {
-	if (regs && (m->mcgstatus & MCG_STATUS_RIPV)) {
+	/* We can trust IP & CS on stack if either RIPV or EIPV is valid */
+	if (regs && (m->mcgstatus & (MCG_STATUS_RIPV|MCG_STATUS_EIPV))) {
 		m->ip = regs->ip;
 		m->cs = regs->cs;
 	} else {
 		m->ip = 0;
 		m->cs = 0;
 	}
-	if (rip_msr) {
-		/* Assume the RIP in the MSR is exact. Is this true? */
-		m->mcgstatus |= MCG_STATUS_EIPV;
+	/*
+	 * Use MSR if available.
+	 * Since there is no MSR for CS, keep in trust it on stack.
+	 */
+	if (rip_msr)
 		m->ip = mce_rdmsrl(rip_msr);
-		m->cs = 0;
-	}
 }
 
 /*
@@ -431,7 +436,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		if (m.status & MCI_STATUS_ADDRV)
 			m.addr = mce_rdmsrl(MSR_IA32_MC0_ADDR + i*4);
 
-		mce_get_rip(&m, regs);
+		mce_get_ip_at_error(&m, regs);
 		mce_log(&m);
 
 		/*
-- 
1.6.3



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 03/31] x86: MCE: Fix EIPV behaviour with !PCC
  2009-05-26 23:54     ` [PATCH 03/31] x86: MCE: Fix EIPV behaviour with !PCC Andi Kleen
  2009-05-26 23:54       ` [PATCH 04/31] x86: MCE: Use extended sysattrs for the check_interval attribute Andi Kleen
@ 2009-05-27  4:30       ` Hidetoshi Seto
  2009-05-27  7:38         ` Andi Kleen
  1 sibling, 1 reply; 49+ messages in thread
From: Hidetoshi Seto @ 2009-05-27  4:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, hpa, x86, Huang Ying, Andi Kleen

Andi Kleen wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> Impact: Spec compliance
> 
> When tolerant == 0, system should go panic instead of send SIGBUS even
> for a MCE with EPIV && !PCC on user space.
> 
> Signed-off-by: Huang Ying <ying.huang@intel.com>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/kernel/cpu/mcheck/mce.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 3f158d7..e1271ac 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -485,7 +485,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  		 * force_sig() takes an awful lot of locks and has a slight
>  		 * risk of deadlocking.
>  		 */
> -		if (user_space) {
> +		if (user_space && tolerant > 0) {
>  			force_sig(SIGBUS, current);
>  		} else if (panic_on_oops || tolerant < 2) {
>  			mce_panic("Uncorrected machine check",

Again, this patch is useless.

                if (m.status & MCI_STATUS_EN) {
                        /* if PCC was set, there's no way out */
                        no_way_out |= !!(m.status & MCI_STATUS_PCC);
                        /*
                         * If this error was uncorrectable and there was
                         * an overflow, we're in trouble.  If no overflow,
                         * we might get away with just killing a task.
                         */
                        if (m.status & MCI_STATUS_UC) {
                                if (tolerant < 1 || m.status & MCI_STATUS_OVER)
                                        no_way_out = 1;
                                kill_it = 1;
                        }
                } else {
 :
        if (no_way_out && tolerant < 3)
                mce_panic("Machine check", &panicm, mcestart);
 :
        if (kill_it && tolerant < 3) {
 :
                if (user_space) {
                        force_sig(SIGBUS, current);
                } else if (panic_on_oops || tolerant < 2) {
                        mce_panic("Uncorrected machine check",
                                &panicm, mcestart);
                }
        }

We never reach there with tolerant == 0.

And you will remove this code in your 25/31.


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 06/31] x86: Fix panic with interrupts off (needed for MCE)
  2009-05-26 23:54           ` [PATCH 06/31] x86: Fix panic with interrupts off (needed for MCE) Andi Kleen
  2009-05-26 23:54             ` [PATCH 07/31] x86: MCE: Log corrected errors when panicing Andi Kleen
@ 2009-05-27  4:30             ` Hidetoshi Seto
  2009-05-27  7:05               ` Andi Kleen
  1 sibling, 1 reply; 49+ messages in thread
From: Hidetoshi Seto @ 2009-05-27  4:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, hpa, x86, Andi Kleen

Andi Kleen wrote:
> @@ -89,11 +89,7 @@
>  
>  #define THRESHOLD_APIC_VECTOR		0xf9
>  
> -#ifdef CONFIG_X86_32
> -/* 0xf9 : free */
> -#else
> -# define UV_BAU_MESSAGE			0xf8
> -#endif
> +#define REBOOT_VECTOR			0xf8
>  
>  /* f0-f7 used for spreading out TLB flushes: */
>  #define INVALIDATE_TLB_VECTOR_END	0xf7
> @@ -117,6 +113,8 @@
>   */
>  #define GENERIC_INTERRUPT_VECTOR	0xed
>  
> +#define UV_BAU_MESSAGE			0xec
> +
>  /*
>   * First APIC vector available to drivers: (vectors 0x30-0xee) we
>   * start at 0x31(0x41) to spread out vectors evenly between priority

#ifdef CONFIG_X86_64
#define UV_BAU_MESSAGE			0xec
#endif

Isn't it?


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 19/31] x86: MCE: Default to panic timeout for machine checks v2
  2009-05-26 23:54                                     ` [PATCH 19/31] x86: MCE: Default to panic timeout for machine checks v2 Andi Kleen
  2009-05-26 23:54                                       ` [PATCH 20/31] x86: MCE: Improve documentation Andi Kleen
@ 2009-05-27  4:31                                       ` Hidetoshi Seto
  2009-05-27  7:24                                         ` Andi Kleen
  2009-05-27  4:31                                       ` [PATCH] x86: MCE: Fix for mce_panic_timeout Hidetoshi Seto
  2 siblings, 1 reply; 49+ messages in thread
From: Hidetoshi Seto @ 2009-05-27  4:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, hpa, x86, Andi Kleen

Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Fatal machine checks can be logged to disk after boot, but only if
> the system did a warm reboot. That's unfortunately difficult with the
> default panic behaviour, which waits forever and the admin has to
> press the power button because modern systems usually miss a reset button.
> This clears the machine checks in the registers and make
> it impossible to log them.
> 
> This patch changes the default for machine check panic to always
> reboot after 30s. Then the mce can be successfully logged after
> reboot.
> 
> I believe this will improve machine check experience for any
> system running the X server.
> 
> This is dependent on successfull boot logging of MCEs. This currently
> only works on Intel systems, on AMD there are quite a lot of systems
> around which leave junk in the machine check registers after boot,
> so it's disabled here. These systems will continue to default
> to endless waiting panic.
> 
> v2: Only force panic timeout when it's shorter (H.Seto)
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---

I suppose the original intention is overwrite the panic_timeout 0 to 30.

> @@ -240,6 +243,8 @@ static void mce_panic(char *msg, struct mce *final, char *exp)
>  		printk(KERN_EMERG "Some CPUs didn't answer in synchronization\n");
>  	if (exp)
>  		printk(KERN_EMERG "Machine check: %s\n", exp);
> +	if (mce_panic_timeout < panic_timeout)
> +		panic_timeout = mce_panic_timeout;
>  	panic(msg);
>  }
>  
> @@ -1100,6 +1105,8 @@ static void mce_cpu_quirks(struct cpuinfo_x86 *c)
>  	}
>  	if (monarch_timeout < 0)
>  		monarch_timeout = 0;
> +	if (mce_bootlog != 0)
> +		mce_panic_timeout = 30;
>  }
>  
>  static void __cpuinit mce_ancient_init(struct cpuinfo_x86 *c)

It seems it doesn't work.

I made a incremental patch for this fix.  Please consider applying.


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH] x86: MCE: Fix for mce_panic_timeout
  2009-05-26 23:54                                     ` [PATCH 19/31] x86: MCE: Default to panic timeout for machine checks v2 Andi Kleen
  2009-05-26 23:54                                       ` [PATCH 20/31] x86: MCE: Improve documentation Andi Kleen
  2009-05-27  4:31                                       ` [PATCH 19/31] x86: MCE: Default to panic timeout for machine checks v2 Hidetoshi Seto
@ 2009-05-27  4:31                                       ` Hidetoshi Seto
  2009-05-27 10:07                                         ` Andi Kleen
  2 siblings, 1 reply; 49+ messages in thread
From: Hidetoshi Seto @ 2009-05-27  4:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, hpa, x86, Andi Kleen

This fixes:

 - In case of panic_timeout == 0 and mce_bootlog != 0:
   System should reboot after mce panic, but it doesn't because current
   mce code doesn't overwrite small panic_timeout even if it is 0.

 - In case of panic_timeout > 0 and mce_bootlog == 0.
   System should reboot after panic, but it doesn't on mce panic because
   current mce code overwrite panic_timeout to 0.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |   19 ++++++++++++-------
 1 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 0b87fb4..40fedb2 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -82,7 +82,7 @@ static unsigned long		notify_user;
 static int			rip_msr;
 static int			mce_bootlog = -1;
 static int			monarch_timeout = -1;
-static int			mce_panic_timeout;
+static int			mce_panic_timeout = 30;
 
 static char			trigger[128];
 static char			*trigger_argv[2] = { trigger, NULL };
@@ -196,6 +196,15 @@ static void print_mce(struct mce *m)
 
 static atomic_t mce_paniced;
 
+/* Let system panic to reboot for logging the fatal machine check */
+static void mce_set_panic_timeout(void)
+{
+	if (!mce_bootlog)
+		return;
+	if (!panic_timeout || mce_panic_timeout < panic_timeout)
+		panic_timeout = mce_panic_timeout;
+}
+
 /* Panic in progress. Enable interrupts and wait for final IPI */
 static void wait_for_panic(void)
 {
@@ -204,8 +213,7 @@ static void wait_for_panic(void)
 	local_irq_enable();
 	while (timeout-- > 0)
 		udelay(1);
-	if (mce_panic_timeout < panic_timeout)
-		panic_timeout = mce_panic_timeout;
+	mce_set_panic_timeout();
 	panic("Panicing machine check CPU died");
 }
 
@@ -243,8 +251,7 @@ static void mce_panic(char *msg, struct mce *final, char *exp)
 		printk(KERN_EMERG "Some CPUs didn't answer in synchronization\n");
 	if (exp)
 		printk(KERN_EMERG "Machine check: %s\n", exp);
-	if (mce_panic_timeout < panic_timeout)
-		panic_timeout = mce_panic_timeout;
+	mce_set_panic_timeout();
 	panic(msg);
 }
 
@@ -1109,8 +1116,6 @@ static void mce_cpu_quirks(struct cpuinfo_x86 *c)
 	}
 	if (monarch_timeout < 0)
 		monarch_timeout = 0;
-	if (mce_bootlog != 0)
-		mce_panic_timeout = 30;
 }
 
 static void __cpuinit mce_ancient_init(struct cpuinfo_x86 *c)
-- 
1.6.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 27/31] x86: MCE: Print header/footer only once for multiple MCEs
  2009-05-26 23:54                                                     ` [PATCH 27/31] x86: MCE: Print header/footer only once for multiple MCEs Andi Kleen
  2009-05-26 23:54                                                       ` [PATCH 28/31] x86: MCE: Make non Monarch panic message "Fatal machine check" too v2 Andi Kleen
@ 2009-05-27  4:31                                                       ` Hidetoshi Seto
  2009-05-27  7:10                                                         ` Andi Kleen
  1 sibling, 1 reply; 49+ messages in thread
From: Hidetoshi Seto @ 2009-05-27  4:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, hpa, x86, Andi Kleen

Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> When multiple MCEs are printed print the "HARDWARE ERROR" header
> and "This is not a software error" footer only once. This
> makes the output much more compact with many CPUs.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/kernel/cpu/mcheck/mce.c |   27 +++++++++++++++++----------
>  1 files changed, 17 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 8c55c86..94dddc7 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -176,11 +176,13 @@ void mce_log(struct mce *mce)
>  	set_bit(0, &notify_user);
>  }
>  
> -static void print_mce(struct mce *m)
> +static void print_mce(struct mce *m, int *first)
>  {
> -	printk(KERN_EMERG "\n"
> -	       KERN_EMERG "HARDWARE ERROR\n"
> -	       KERN_EMERG
> +	if (*first) {
> +		printk(KERN_EMERG "\n" KERN_EMERG "HARDWARE ERROR\n");
> +		*first = 0;
> +	}
> +	printk(KERN_EMERG
>  	       "CPU %d: Machine Check Exception: %16Lx Bank %d: %016Lx\n",
>  	       m->extcpu, m->mcgstatus, m->bank, m->status);
>  	if (m->ip) {

Why don't you have print_mce_head?

I suppose one possible benefit not to have it and do like this is that we
can suppress printing header if there is no log to print.
But on the other hand footer is always printed.

> @@ -200,9 +202,12 @@ static void print_mce(struct mce *m)
>  	printk(KERN_EMERG "PROCESSOR %u:%x TIME %llu SOCKET %u APIC %x\n",
>  			m->cpuvendor, m->cpuid, m->time, m->socketid,
>  			m->apicid);
> -	printk(KERN_EMERG "This is not a software problem!\n");
> -	printk(KERN_EMERG "Run through mcelog --ascii to decode "
> -	       "and contact your hardware vendor\n");
> +}
> +
> +static void print_mce_tail(void)
> +{
> +	printk(KERN_EMERG "This is not a software problem!\n"
> +	       KERN_EMERG "Run through mcelog --ascii to decode and contact your hardware vendor\n");
>  }
>  
>  #define PANIC_TIMEOUT 5 /* 5 seconds */

And if there is no log, do we still need to run mcelog?

> @@ -225,6 +230,7 @@ static void wait_for_panic(void)
>  static void mce_panic(char *msg, struct mce *final, char *exp)
>  {
>  	int i;
> +	int first = 1;
>  
>  	/*
>  	 * Make sure only one CPU runs in machine check panic
> @@ -240,7 +246,7 @@ static void mce_panic(char *msg, struct mce *final, char *exp)
>  		struct mce *m = &mcelog.entry[i];
>  		if ((m->status & MCI_STATUS_VAL) &&
>  			!(m->status & MCI_STATUS_UC))
> -			print_mce(m);
> +			print_mce(m, &first);
>  	}
>  	/* Now print uncorrected but with the final one last */
>  	for (i = 0; i < MCE_LOG_LEN; i++) {
> @@ -248,12 +254,13 @@ static void mce_panic(char *msg, struct mce *final, char *exp)
>  		if (!(m->status & MCI_STATUS_VAL))
>  			continue;
>  		if (!final || memcmp(m, final, sizeof(struct mce)))
> -			print_mce(m);
> +			print_mce(m, &first);
>  	}
>  	if (final)
> -		print_mce(final);
> +		print_mce(final, &first);
>  	if (cpu_missing)
>  		printk(KERN_EMERG "Some CPUs didn't answer in synchronization\n");
> +	print_mce_tail();
>  	if (exp)
>  		printk(KERN_EMERG "Machine check: %s\n", exp);
>  	if (mce_panic_timeout < panic_timeout)

If mce_panic is only caller of mce_log_tail (and possible mce_log_head),
it would be better to use printk directly than having such sub-function.


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 06/31] x86: Fix panic with interrupts off (needed for MCE)
  2009-05-27  4:30             ` [PATCH 06/31] x86: Fix panic with interrupts off (needed for MCE) Hidetoshi Seto
@ 2009-05-27  7:05               ` Andi Kleen
  0 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2009-05-27  7:05 UTC (permalink / raw)
  To: Hidetoshi Seto; +Cc: Andi Kleen, linux-kernel, hpa, x86, Andi Kleen

> #ifdef CONFIG_X86_64
> #define UV_BAU_MESSAGE			0xec
> #endif
> 
> Isn't it?

No, it's not, the ifdef is not needed. I see no point at all in adding 
that ifdef. It would only help if 32bit wanted to add a vector
on its own here which is totally unlikely.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 27/31] x86: MCE: Print header/footer only once for multiple MCEs
  2009-05-27  4:31                                                       ` [PATCH 27/31] x86: MCE: Print header/footer only once for multiple MCEs Hidetoshi Seto
@ 2009-05-27  7:10                                                         ` Andi Kleen
  0 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2009-05-27  7:10 UTC (permalink / raw)
  To: Hidetoshi Seto; +Cc: Andi Kleen, linux-kernel, hpa, x86, Andi Kleen

On Wed, May 27, 2009 at 01:31:36PM +0900, Hidetoshi Seto wrote:
> I suppose one possible benefit not to have it and do like this is that we
> can suppress printing header if there is no log to print.
> But on the other hand footer is always printed.

If there's a machine check exception there's a hardware error, 
so that message should be printed.

This can happen for example in the "external agent" case.

> >  #define PANIC_TIMEOUT 5 /* 5 seconds */
> 
> And if there is no log, do we still need to run mcelog?

Not manually if it runs correctly, but the kernel doesn't know that in 
advance. So in case someone picks it up from a console log
it might be still needed.

> >  	 * Make sure only one CPU runs in machine check panic
> > @@ -240,7 +246,7 @@ static void mce_panic(char *msg, struct mce *final, char *exp)
> >  		struct mce *m = &mcelog.entry[i];
> >  		if ((m->status & MCI_STATUS_VAL) &&
> >  			!(m->status & MCI_STATUS_UC))
> > -			print_mce(m);
> > +			print_mce(m, &first);
> >  	}
> >  	/* Now print uncorrected but with the final one last */
> >  	for (i = 0; i < MCE_LOG_LEN; i++) {
> > @@ -248,12 +254,13 @@ static void mce_panic(char *msg, struct mce *final, char *exp)
> >  		if (!(m->status & MCI_STATUS_VAL))
> >  			continue;
> >  		if (!final || memcmp(m, final, sizeof(struct mce)))
> > -			print_mce(m);
> > +			print_mce(m, &first);
> >  	}
> >  	if (final)
> > -		print_mce(final);
> > +		print_mce(final, &first);
> >  	if (cpu_missing)
> >  		printk(KERN_EMERG "Some CPUs didn't answer in synchronization\n");
> > +	print_mce_tail();
> >  	if (exp)
> >  		printk(KERN_EMERG "Machine check: %s\n", exp);
> >  	if (mce_panic_timeout < panic_timeout)
> 
> If mce_panic is only caller of mce_log_tail (and possible mce_log_head),
> it would be better to use printk directly than having such sub-function.

The coding standard recommends to avoid too long functions, so I split it out.


-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/31] x86: MCE: Improve mce_get_rip v3
  2009-05-27  4:29     ` [PATCH 02/31] x86: MCE: Improve mce_get_rip v3 Hidetoshi Seto
@ 2009-05-27  7:23       ` Andi Kleen
  0 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2009-05-27  7:23 UTC (permalink / raw)
  To: Hidetoshi Seto; +Cc: Andi Kleen, linux-kernel, hpa, x86, Huang Ying, Andi Kleen

On Wed, May 27, 2009 at 01:29:16PM +0900, Hidetoshi Seto wrote:
> Andi Kleen wrote:
> > From: Huang Ying <ying.huang@intel.com>
> > 
> > Assume RIP is valid when either EIPV or RIPV are set.
> 
> Bad description.
> If RIP means "restart IP" that is valid only if RIPV is set,
> this sentence doesn't make sense completely.

No it doesn't mean restart IP, it just means normal instruction
pointer like everywhere else.

> 
> > This influences
> > whether the machine check exception handler decides to return or panic.
> 
> I suppose you are pointing logics in:

Yes.

> 
>                 mce_get_rip(&m, regs);
>  :
>                         panicm = m;
>  :
>                 /*
>                  * If the EIPV bit is set, it means the saved IP is the
>                  * instruction which caused the MCE.
>                  */
>                 if (m.mcgstatus & MCG_STATUS_EIPV)
>                         user_space = panicm.ip && (panicm.cs & 3);
> 
>                 /*
>                  * If we know that the error was in user space, send a
>                  * SIGBUS.  Otherwise, panic if tolerance is low.
>                  *
>                  * force_sig() takes an awful lot of locks and has a slight
>                  * risk of deadlocking.
>                  */
>                 if (user_space) {
>                         force_sig(SIGBUS, current);
>                 } else if (panic_on_oops || tolerant < 2) {
>                         mce_panic("Uncorrected machine check",
>                                 &panicm, mcestart);
>                 }
> 
> So EIPV without RIPV will be no ip and will result in panic,
> while expected result is SIGBUS. 

First this is only for the !MCA recovery case. In the MCA recovery
case we have more information and can decide better.

In this case no EIPV means that the kernel isn't sure where the 
error occurred so it cannot safely decide if it was user space 
or kernel space and in the tolerant == 2 case has to panic
just in case a kernel kill would cause deadlock.

With MCA recovery this whole this is replaced by a new improved
mechanism using the high level handler.

> > 
> > Also in addition do not force the RIP to be valid with the exact
> > register MSRs.
> 
> I think the forced one is EIP:
> > -		m->mcgstatus |= MCG_STATUS_EIPV;

True. Changed.

> 
> And please note that it keep use CS on stack even if MSR is available.
> 
> I made an alternative patch for this, with no functional change.
> Please consider replacing.

No, sorry I got burned too much last time you touched the description
of this simple patch. I think my description is simple and to the point
and this patch doesn't really deserve anything more.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 19/31] x86: MCE: Default to panic timeout for machine checks v2
  2009-05-27  4:31                                       ` [PATCH 19/31] x86: MCE: Default to panic timeout for machine checks v2 Hidetoshi Seto
@ 2009-05-27  7:24                                         ` Andi Kleen
  0 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2009-05-27  7:24 UTC (permalink / raw)
  To: Hidetoshi Seto; +Cc: Andi Kleen, linux-kernel, hpa, x86, Andi Kleen

On Wed, May 27, 2009 at 01:31:07PM +0900, Hidetoshi Seto wrote:
> I suppose the original intention is overwrite the panic_timeout 0 to 30.

Yes. 

> > @@ -1100,6 +1105,8 @@ static void mce_cpu_quirks(struct cpuinfo_x86 *c)
> >  	}
> >  	if (monarch_timeout < 0)
> >  		monarch_timeout = 0;
> > +	if (mce_bootlog != 0)
> > +		mce_panic_timeout = 30;
> >  }
> >  
> >  static void __cpuinit mce_ancient_init(struct cpuinfo_x86 *c)
> 
> It seems it doesn't work.

Okay fair point. I applied your earlier review suggestion literally yes
that was not correct. I changed it in my patchkit to handle the 0 case too.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 03/31] x86: MCE: Fix EIPV behaviour with !PCC
  2009-05-27  4:30       ` [PATCH 03/31] x86: MCE: Fix EIPV behaviour with !PCC Hidetoshi Seto
@ 2009-05-27  7:38         ` Andi Kleen
  2009-05-27  7:38           ` Huang Ying
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-27  7:38 UTC (permalink / raw)
  To: Hidetoshi Seto; +Cc: Andi Kleen, linux-kernel, hpa, x86, Huang Ying, Andi Kleen

> We never reach there with tolerant == 0.

Fair point. Ying do you remember which test case that fixed?

> 
> And you will remove this code in your 25/31.

I usually kept the earlier patches fixing the code before I remove it, just
in case someone drops the later code.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 03/31] x86: MCE: Fix EIPV behaviour with !PCC
  2009-05-27  7:38         ` Andi Kleen
@ 2009-05-27  7:38           ` Huang Ying
  2009-05-27  8:53             ` Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Huang Ying @ 2009-05-27  7:38 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Hidetoshi Seto, linux-kernel, hpa, x86, Andi Kleen

On Wed, 2009-05-27 at 15:38 +0800, Andi Kleen wrote:
> > We never reach there with tolerant == 0.
> 
> Fair point. Ying do you remember which test case that fixed?

I think I fixed a "bug" that does not exist. So I think maybe we should
drop this patch. Sorry about that.

Best Regards,
Huang Ying


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 03/31] x86: MCE: Fix EIPV behaviour with !PCC
  2009-05-27  7:38           ` Huang Ying
@ 2009-05-27  8:53             ` Andi Kleen
  0 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2009-05-27  8:53 UTC (permalink / raw)
  To: Huang Ying; +Cc: Andi Kleen, Hidetoshi Seto, linux-kernel, hpa, x86, Andi Kleen

On Wed, May 27, 2009 at 03:38:25PM +0800, Huang Ying wrote:
> On Wed, 2009-05-27 at 15:38 +0800, Andi Kleen wrote:
> > > We never reach there with tolerant == 0.
> > 
> > Fair point. Ying do you remember which test case that fixed?
> 
> I think I fixed a "bug" that does not exist. So I think maybe we should

Or maybe not anymore.

> drop this patch. Sorry about that.

Okay. I dropped it. Thanks everyone.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] x86: MCE: Fix for mce_panic_timeout
  2009-05-27  4:31                                       ` [PATCH] x86: MCE: Fix for mce_panic_timeout Hidetoshi Seto
@ 2009-05-27 10:07                                         ` Andi Kleen
  2009-05-28  0:52                                           ` Hidetoshi Seto
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2009-05-27 10:07 UTC (permalink / raw)
  To: Hidetoshi Seto; +Cc: linux-kernel, hpa, x86, Andi Kleen

Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> writes:

> This fixes:

Thanks I had already fixed it on my own.

Updated patch appended.
>
>  - In case of panic_timeout > 0 and mce_bootlog == 0.
>    System should reboot after panic, but it doesn't on mce panic because
>    current mce code overwrite panic_timeout to 0.

Nope, with bootlog==0 it should _not_ automatically reboot on panic.
Automatic rebooting makes mainly sense with boot logging, otherwise
you will likely lose the information. Or at least the kernel 
cannot know if you lose information or not so it has to err on 
the safe side.

I changed it now to only override with panic_timeout == 0,
as in the user didn't set anything,
that's probably the most sensible semantics anyways.

-Andi

---

x86: MCE: Default to panic timeout for machine checks v3

Fatal machine checks can be logged to disk after boot, but only if
the system did a warm reboot. That's unfortunately difficult with the
default panic behaviour, which waits forever and the admin has to
press the power button because modern systems usually miss a reset button.
This clears the machine checks in the registers and make
it impossible to log them.

This patch changes the default for machine check panic to always
reboot after 30s. Then the mce can be successfully logged after
reboot.

I believe this will improve machine check experience for any 
system running the X server.

This is dependent on successfull boot logging of MCEs. This currently
only works on Intel systems, on AMD there are quite a lot of systems
around which leave junk in the machine check registers after boot,
so it's disabled here. These systems will continue to default
to endless waiting panic.

v2: Only force panic timeout when it's shorter (H.Seto)
v3: Only panic when there is no earlier timeout or it's not zero
(based on comment H.Seto)

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 arch/x86/kernel/cpu/mcheck/mce.c |    7 +++++++
 1 file changed, 7 insertions(+)

Index: linux/arch/x86/kernel/cpu/mcheck/mce.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce.c	2009-05-27 11:59:03.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce.c	2009-05-27 12:01:07.000000000 +0200
@@ -82,6 +82,7 @@
 static int			rip_msr;
 static int			mce_bootlog = -1;
 static int			monarch_timeout = -1;
+static int			mce_panic_timeout;
 
 static char			trigger[128];
 static char			*trigger_argv[2] = { trigger, NULL };
@@ -203,6 +204,8 @@
 	local_irq_enable();
 	while (timeout-- > 0)
 		udelay(1);
+	if (panic_timeout == 0)
+		panic_timeout = mce_panic_timeout;
 	panic("Panicing machine check CPU died");
 }
 
@@ -240,6 +243,8 @@
 		printk(KERN_EMERG "Some CPUs didn't answer in synchronization\n");
 	if (exp)
 		printk(KERN_EMERG "Machine check: %s\n", exp);
+	if (panic_timeout == 0)
+		panic_timeout = mce_panic_timeout;
 	panic(msg);
 }
 
@@ -1100,6 +1105,8 @@
 	}
 	if (monarch_timeout < 0)
 		monarch_timeout = 0;
+	if (mce_bootlog != 0)
+		mce_panic_timeout = 30;
 }
 
 static void __cpuinit mce_ancient_init(struct cpuinfo_x86 *c)


-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] x86: MCE: Fix for mce_panic_timeout
  2009-05-27 10:07                                         ` Andi Kleen
@ 2009-05-28  0:52                                           ` Hidetoshi Seto
  2009-05-28  8:15                                             ` Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Hidetoshi Seto @ 2009-05-28  0:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, hpa, x86, Andi Kleen

Andi Kleen wrote:
> Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> writes:
>>  - In case of panic_timeout > 0 and mce_bootlog == 0.
>>    System should reboot after panic, but it doesn't on mce panic because
>>    current mce code overwrite panic_timeout to 0.
> 
> Nope, with bootlog==0 it should _not_ automatically reboot on panic.
> Automatic rebooting makes mainly sense with boot logging, otherwise
> you will likely lose the information. Or at least the kernel 
> cannot know if you lose information or not so it has to err on 
> the safe side.

Users already know they might lose the information about the panic
if panic_timeout > 0.  It is same on all kind of panic.

> I changed it now to only override with panic_timeout == 0,
> as in the user didn't set anything,
> that's probably the most sensible semantics anyways.

It's OK with me.
The logic in this patch also seems good.

But...

> v2: Only force panic timeout when it's shorter (H.Seto)
> v3: Only panic when there is no earlier timeout or it's not zero
> (based on comment H.Seto)

"Only panic when ..." ?
I could not cache the meaning of this changelog.


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] x86: MCE: Fix for mce_panic_timeout
  2009-05-28  0:52                                           ` Hidetoshi Seto
@ 2009-05-28  8:15                                             ` Andi Kleen
  0 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2009-05-28  8:15 UTC (permalink / raw)
  To: Hidetoshi Seto; +Cc: Andi Kleen, linux-kernel, hpa, x86, Andi Kleen

On Thu, May 28, 2009 at 09:52:55AM +0900, Hidetoshi Seto wrote:
> Andi Kleen wrote:
> > Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> writes:
> >>  - In case of panic_timeout > 0 and mce_bootlog == 0.
> >>    System should reboot after panic, but it doesn't on mce panic because
> >>    current mce code overwrite panic_timeout to 0.
> > 
> > Nope, with bootlog==0 it should _not_ automatically reboot on panic.
> > Automatic rebooting makes mainly sense with boot logging, otherwise
> > you will likely lose the information. Or at least the kernel 
> > cannot know if you lose information or not so it has to err on 
> > the safe side.
> 
> Users already know they might lose the information about the panic
> if panic_timeout > 0.  It is same on all kind of panic.

Yes if they set panic=..., but they don't expect it by default as
this patch does.

> 
> > I changed it now to only override with panic_timeout == 0,
> > as in the user didn't set anything,
> > that's probably the most sensible semantics anyways.
> 
> It's OK with me.
> The logic in this patch also seems good.

Thanks.

> 
> But...
> 
> > v2: Only force panic timeout when it's shorter (H.Seto)
> > v3: Only panic when there is no earlier timeout or it's not zero
> > (based on comment H.Seto)
> 
> "Only panic when ..." ?
> I could not cache the meaning of this changelog.

"Only automatically panic ..."

Ok fair enough, if I ever redo a repost I'll update that line, but I hope
I don't have to touch this trivial patch anymore.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2009-05-28  8:09 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-26 23:54 x86 MCE improvements series for 2.6.31 v2 Andi Kleen
2009-05-26 23:54 ` [PATCH 01/31] x86: MCE: Synchronize core after machine check handling Andi Kleen
2009-05-26 23:54   ` [PATCH 02/31] x86: MCE: Improve mce_get_rip v3 Andi Kleen
2009-05-26 23:54     ` [PATCH 03/31] x86: MCE: Fix EIPV behaviour with !PCC Andi Kleen
2009-05-26 23:54       ` [PATCH 04/31] x86: MCE: Use extended sysattrs for the check_interval attribute Andi Kleen
2009-05-26 23:54         ` [PATCH 05/31] x86: MCE: Add machine check exception count in /proc/interrupts Andi Kleen
2009-05-26 23:54           ` [PATCH 06/31] x86: Fix panic with interrupts off (needed for MCE) Andi Kleen
2009-05-26 23:54             ` [PATCH 07/31] x86: MCE: Log corrected errors when panicing Andi Kleen
2009-05-26 23:54               ` [PATCH 08/31] x86: MCE: Remove unused mce_events variable Andi Kleen
2009-05-26 23:54                 ` [PATCH 09/31] x86: MCE: Remove mce_init unused argument Andi Kleen
2009-05-26 23:54                   ` [PATCH 10/31] x86: MCE: Rename and align out2 label Andi Kleen
2009-05-26 23:54                     ` [PATCH 11/31] x86: MCE: Implement bootstrapping for machine check wakeups Andi Kleen
2009-05-26 23:54                       ` [PATCH 12/31] x86: MCE: Remove TSC print heuristic Andi Kleen
2009-05-26 23:54                         ` [PATCH 13/31] x86: MCE: Drop BKL in mce_open Andi Kleen
2009-05-26 23:54                           ` [PATCH 14/31] x86: MCE: Add table driven machine check grading Andi Kleen
2009-05-26 23:54                             ` [PATCH 15/31] x86: MCE: Check early in exception handler if panic is needed Andi Kleen
2009-05-26 23:54                               ` [PATCH 16/31] x86: MCE: Implement panic synchronization Andi Kleen
2009-05-26 23:54                                 ` [PATCH 17/31] x86: MCE: Switch x86 machine check handler to Monarch election. v2 Andi Kleen
2009-05-26 23:54                                   ` [PATCH 18/31] x86: MCE: Store record length into memory struct mce anchor Andi Kleen
2009-05-26 23:54                                     ` [PATCH 19/31] x86: MCE: Default to panic timeout for machine checks v2 Andi Kleen
2009-05-26 23:54                                       ` [PATCH 20/31] x86: MCE: Improve documentation Andi Kleen
2009-05-26 23:54                                         ` [PATCH 21/31] x86: MCE: Support more than 256 CPUs in struct mce Andi Kleen
2009-05-26 23:54                                           ` [PATCH 22/31] x86: MCE: Extend struct mce user interface with more information Andi Kleen
2009-05-26 23:54                                             ` [PATCH 23/31] x86: MCE: Add MCE poll count to /proc/interrupts Andi Kleen
2009-05-26 23:54                                               ` [PATCH 24/31] x86: MCE: Don't print backtrace on machine checks with DEBUG_BUGVERBOSE Andi Kleen
2009-05-26 23:54                                                 ` [PATCH 25/31] x86: MCE: Implement new status bits v2 Andi Kleen
2009-05-26 23:54                                                   ` [PATCH 26/31] x86: MCE: Export MCE severities coverage via debugfs Andi Kleen
2009-05-26 23:54                                                     ` [PATCH 27/31] x86: MCE: Print header/footer only once for multiple MCEs Andi Kleen
2009-05-26 23:54                                                       ` [PATCH 28/31] x86: MCE: Make non Monarch panic message "Fatal machine check" too v2 Andi Kleen
2009-05-26 23:54                                                         ` [PATCH 29/31] x86: MCE: Rename mce_notify_user to mce_notify_irq Andi Kleen
2009-05-26 23:54                                                           ` [PATCH 30/31] x86: MCE: Define MCE_VECTOR Andi Kleen
2009-05-26 23:54                                                             ` [PATCH 31/31] x86: MCE: Support action-optional machine checks v2 Andi Kleen
2009-05-27  4:31                                                       ` [PATCH 27/31] x86: MCE: Print header/footer only once for multiple MCEs Hidetoshi Seto
2009-05-27  7:10                                                         ` Andi Kleen
2009-05-27  4:31                                       ` [PATCH 19/31] x86: MCE: Default to panic timeout for machine checks v2 Hidetoshi Seto
2009-05-27  7:24                                         ` Andi Kleen
2009-05-27  4:31                                       ` [PATCH] x86: MCE: Fix for mce_panic_timeout Hidetoshi Seto
2009-05-27 10:07                                         ` Andi Kleen
2009-05-28  0:52                                           ` Hidetoshi Seto
2009-05-28  8:15                                             ` Andi Kleen
2009-05-27  4:30             ` [PATCH 06/31] x86: Fix panic with interrupts off (needed for MCE) Hidetoshi Seto
2009-05-27  7:05               ` Andi Kleen
2009-05-27  4:30       ` [PATCH 03/31] x86: MCE: Fix EIPV behaviour with !PCC Hidetoshi Seto
2009-05-27  7:38         ` Andi Kleen
2009-05-27  7:38           ` Huang Ying
2009-05-27  8:53             ` Andi Kleen
2009-05-27  4:29     ` [PATCH 02/31] x86: MCE: Improve mce_get_rip v3 Hidetoshi Seto
2009-05-27  7:23       ` Andi Kleen
2009-05-27  4:29     ` [PATCH] x86: MCE: Fix for getting IP/CS at MCE Hidetoshi Seto

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).