All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 2/2] x86: mce: comment about MCE synchronization timeout on definition of tolerant
  2015-03-03  9:01 [PATCH v3 1/2] x86: mce: kexec: switch MCE handler for kexec/kdump Naoya Horiguchi
@ 2015-03-03  9:01 ` Naoya Horiguchi
  2015-03-03 18:09 ` [PATCH v3 1/2] x86: mce: kexec: switch MCE handler for kexec/kdump Luck, Tony
  2015-03-03 18:53 ` [PATCH v3 1/2] " Borislav Petkov
  2 siblings, 0 replies; 44+ messages in thread
From: Naoya Horiguchi @ 2015-03-03  9:01 UTC (permalink / raw)
  To: Tony Luck, Borislav Petkov
  Cc: Prarit Bhargava, Vivek Goyal, linux-kernel, Junichi Nomura, Kiyoshi Ueda

commit 716079f66eac ("mce: Panic when a core has reached a timeout") changed
the behavior of mca_cfg->tolerant. So let's add comment about it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git v3.19.orig/arch/x86/kernel/cpu/mcheck/mce.c v3.19/arch/x86/kernel/cpu/mcheck/mce.c
index 6e7730a72b79..6633d4e244b0 100644
--- v3.19.orig/arch/x86/kernel/cpu/mcheck/mce.c
+++ v3.19/arch/x86/kernel/cpu/mcheck/mce.c
@@ -69,8 +69,10 @@ struct mca_config mca_cfg __read_mostly = {
 	/*
 	 * Tolerant levels:
 	 * 0: always panic on uncorrected errors, log corrected errors
-	 * 1: panic or SIGBUS on uncorrected errors, log corrected errors
-	 * 2: SIGBUS or log uncorrected errors (if possible), log corr. errors
+	 * 1: panic or SIGBUS on uncorrected errors, log corrected errors,
+	 *    panic on MCE synchronization timeout.
+	 * 2: SIGBUS or log uncorrected errors (if possible), log corr. errors,
+	 *    no panic on MCE synchronization timeout.
 	 * 3: never panic or SIGBUS, log all errors (for testing only)
 	 */
 	.tolerant = 1,
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 1/2] x86: mce: kexec: switch MCE handler for kexec/kdump
@ 2015-03-03  9:01 Naoya Horiguchi
  2015-03-03  9:01 ` [PATCH v3 2/2] x86: mce: comment about MCE synchronization timeout on definition of tolerant Naoya Horiguchi
                   ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Naoya Horiguchi @ 2015-03-03  9:01 UTC (permalink / raw)
  To: Tony Luck, Borislav Petkov
  Cc: Prarit Bhargava, Vivek Goyal, linux-kernel, Junichi Nomura, Kiyoshi Ueda

kexec disables (or "shoots down") all CPUs other than a crashing CPU before
entering the 2nd kernel. But the MCE handler is still enabled after that,
so if MCE happens and broadcasts over the CPUs after the main thread starts
the 2nd kernel (which might not initialize MCE device yet, or might decide
not to enable it,) MCE handler runs only on the other CPUs (not on the main
thread,) leading to kernel panic with MCE synchronization. The user-visible
effect of this bug is kdump failure.

Our standard MCE handler do_machine_check() assumes some about system's
status and it's hard to alter it to cover kexec/kdump context, so let's add
another kdump-specific one and switch to it.

Note that this problem exists since current MCE handler was implemented in
2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
a timeout") made it more visible by changing the default behavior of the
synchronization timeout from "ignore" to "panic".

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>        [2.6.32+]
---
ChangeLog v2 -> v3
- go to "switch MCE handler" approach

ChangeLog v1 -> v2
- clear MSR_IA32_MCG_CTL, MSR_IA32_MCx_CTL, and CR4.MCE instead of using
  global flag to ignore MCE events.
- fixed the description of the problem
---
 arch/x86/include/asm/mce.h       |  6 +++++
 arch/x86/kernel/cpu/mcheck/mce.c | 47 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/crash.c          |  3 +++
 3 files changed, 56 insertions(+)

diff --git v3.19.orig/arch/x86/include/asm/mce.h v3.19/arch/x86/include/asm/mce.h
index 51b26e895933..8010d4b77183 100644
--- v3.19.orig/arch/x86/include/asm/mce.h
+++ v3.19/arch/x86/include/asm/mce.h
@@ -114,6 +114,9 @@ struct mca_config {
 	int monarch_timeout;
 	int panic_timeout;
 	u32 rip_msr;
+#ifdef CONFIG_KEXEC
+	int kdump_cpu;
+#endif
 };
 
 extern struct mca_config mca_cfg;
@@ -175,6 +178,9 @@ static inline void mce_amd_feature_init(struct cpuinfo_x86 *c) { }
 #endif
 
 int mce_available(struct cpuinfo_x86 *c);
+#ifdef CONFIG_KEXEC
+void switch_mce_handler_for_kdump(void);
+#endif
 
 DECLARE_PER_CPU(unsigned, mce_exception_count);
 DECLARE_PER_CPU(unsigned, mce_poll_count);
diff --git v3.19.orig/arch/x86/kernel/cpu/mcheck/mce.c v3.19/arch/x86/kernel/cpu/mcheck/mce.c
index 3112b79ace8e..6e7730a72b79 100644
--- v3.19.orig/arch/x86/kernel/cpu/mcheck/mce.c
+++ v3.19/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1219,6 +1219,36 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
 
+#ifdef CONFIG_KEXEC
+/*
+ * kdump-specific machine check handler
+ *
+ * When kexec/kdump is running, what the MCE handler is expected to do
+ * changes depending on whether the CPU is running the main thread or not.
+ *
+ * The crashing CPU, controlling the whole system exclusively, should try to
+ * get kdump as hard as possible even if an MCE happens concurrently, because
+ * some types of MCEs (for example, uncorrected errors like SRAO and SRAR,)
+ * are not fatal or don't ruin reliablility of the kdump (consider that an
+ * MCE can hit the other CPU, in which case corrupted data is never consumed.)
+ * If an MCE critically breaks the kdump operation, we are unlucky so let's
+ * accept the fate of whatever HW causes, hoping a dying message reaches admins.
+ *
+ * The other CPUs are supposed to be quiet during kexec/kdump, so after the
+ * crashing CPU shot them down, they should not do anything except clearing
+ * MCG_STATUS (without this the system is reset, which is undesirable.)
+ * Note that this is also true after the crashing CPU enter the 2nd kernel.
+ */
+static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
+{
+	if (mca_cfg.kdump_cpu == smp_processor_id())
+		pr_emerg("MCE triggered when kdumping. If you are lucky enough, you will have a kdump. Otherwise, this is a dying message.\n");
+
+	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
+	sync_core();
+}
+#endif
+
 #ifndef CONFIG_MEMORY_FAILURE
 int memory_failure(unsigned long pfn, int vector, int flags)
 {
@@ -2104,6 +2134,23 @@ static void mce_syscore_shutdown(void)
 	mce_disable_error_reporting();
 }
 
+#ifdef CONFIG_KEXEC
+/*
+ * Called in kdump entering code to switch the MCE handler to a primitive and
+ * kdump-specific one.
+ *
+ * In kexec/kdump context, getting kdump is prior to handling MCEs because
+ * what the users are really interested in is to find what caused the crashing,
+ * not what caused the crashing to fail. So the kdump-specific MCE handler does
+ * very little things not to disrupt kdumping.
+ */
+void switch_mce_handler_for_kdump(void)
+{
+	mca_cfg.kdump_cpu = smp_processor_id();
+	machine_check_vector = machine_check_under_kdump;
+}
+#endif
+
 /*
  * On resume clear all MCE state. Don't want to see leftovers from the BIOS.
  * Only one CPU is active at this time, the others get re-added later using
diff --git v3.19.orig/arch/x86/kernel/crash.c v3.19/arch/x86/kernel/crash.c
index 6f3baedcb6f6..273805e772f6 100644
--- v3.19.orig/arch/x86/kernel/crash.c
+++ v3.19/arch/x86/kernel/crash.c
@@ -34,6 +34,7 @@
 #include <asm/cpu.h>
 #include <asm/reboot.h>
 #include <asm/virtext.h>
+#include <asm/mce.h>
 
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
@@ -166,6 +167,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 	/* The kernel is broken so disable interrupts */
 	local_irq_disable();
 
+	switch_mce_handler_for_kdump();
+
 	kdump_nmi_shootdown_cpus();
 
 	/*
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 1/2] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-03  9:01 [PATCH v3 1/2] x86: mce: kexec: switch MCE handler for kexec/kdump Naoya Horiguchi
  2015-03-03  9:01 ` [PATCH v3 2/2] x86: mce: comment about MCE synchronization timeout on definition of tolerant Naoya Horiguchi
@ 2015-03-03 18:09 ` Luck, Tony
  2015-03-04  7:41   ` [PATCH v4] " Naoya Horiguchi
  2015-03-03 18:53 ` [PATCH v3 1/2] " Borislav Petkov
  2 siblings, 1 reply; 44+ messages in thread
From: Luck, Tony @ 2015-03-03 18:09 UTC (permalink / raw)
  To: Naoya Horiguchi, Borislav Petkov
  Cc: Prarit Bhargava, Vivek Goyal, linux-kernel, Junichi Nomura, Kiyoshi Ueda

+static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
+{
+	if (mca_cfg.kdump_cpu == smp_processor_id())
+		pr_emerg("MCE triggered when kdumping. If you are lucky enough, you will have a kdump. Otherwise, this is a dying message.\n");

I'm worried about the SRAR case here.  Your code just returns, which will trigger the same machine check again. The system will spin forever printing this message.

I think you have to look at MCG_STATUS and scan the machine check banks to make a choice.  There are some simple cases:

  MCG_STATUS.RIPV=0 -> cannot return (where will the cpu go - you have no idea!)
  SRAO -> safe to just return
  SRAR -> should not return

But the rest may require some thought.  If there is a PCC=1 error, then you may end up with a corrupt dump. Perhaps this case will already be covered by RPIV==0?

-Tony

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 1/2] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-03  9:01 [PATCH v3 1/2] x86: mce: kexec: switch MCE handler for kexec/kdump Naoya Horiguchi
  2015-03-03  9:01 ` [PATCH v3 2/2] x86: mce: comment about MCE synchronization timeout on definition of tolerant Naoya Horiguchi
  2015-03-03 18:09 ` [PATCH v3 1/2] x86: mce: kexec: switch MCE handler for kexec/kdump Luck, Tony
@ 2015-03-03 18:53 ` Borislav Petkov
  2015-03-04  7:51   ` Naoya Horiguchi
  2 siblings, 1 reply; 44+ messages in thread
From: Borislav Petkov @ 2015-03-03 18:53 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Tony Luck, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Tue, Mar 03, 2015 at 09:01:49AM +0000, Naoya Horiguchi wrote:
> kexec disables (or "shoots down") all CPUs other than a crashing CPU before
> entering the 2nd kernel. But the MCE handler is still enabled after that,
> so if MCE happens and broadcasts over the CPUs after the main thread starts
> the 2nd kernel (which might not initialize MCE device yet, or might decide
> not to enable it,) MCE handler runs only on the other CPUs (not on the main
> thread,) leading to kernel panic with MCE synchronization. The user-visible
> effect of this bug is kdump failure.
> 
> Our standard MCE handler do_machine_check() assumes some about system's
> status and it's hard to alter it to cover kexec/kdump context, so let's add
> another kdump-specific one and switch to it.
> 
> Note that this problem exists since current MCE handler was implemented in
> 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
> a timeout") made it more visible by changing the default behavior of the
> synchronization timeout from "ignore" to "panic".
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: <stable@vger.kernel.org>        [2.6.32+]
> ---
> ChangeLog v2 -> v3
> - go to "switch MCE handler" approach
> 
> ChangeLog v1 -> v2
> - clear MSR_IA32_MCG_CTL, MSR_IA32_MCx_CTL, and CR4.MCE instead of using
>   global flag to ignore MCE events.
> - fixed the description of the problem
> ---
>  arch/x86/include/asm/mce.h       |  6 +++++
>  arch/x86/kernel/cpu/mcheck/mce.c | 47 ++++++++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/crash.c          |  3 +++
>  3 files changed, 56 insertions(+)
> 
> diff --git v3.19.orig/arch/x86/include/asm/mce.h v3.19/arch/x86/include/asm/mce.h
> index 51b26e895933..8010d4b77183 100644
> --- v3.19.orig/arch/x86/include/asm/mce.h
> +++ v3.19/arch/x86/include/asm/mce.h
> @@ -114,6 +114,9 @@ struct mca_config {
>  	int monarch_timeout;
>  	int panic_timeout;
>  	u32 rip_msr;
> +#ifdef CONFIG_KEXEC
> +	int kdump_cpu;
> +#endif

This CONFIG_KEXEC-ifdeffery is too ugly to live. Please put everything
in arch/x86/kernel/crash.c. AFAICT, you don't need to touch anything in
arch/x86/kernel/cpu/mcheck/ for your purposes.

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v4] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-03 18:09 ` [PATCH v3 1/2] x86: mce: kexec: switch MCE handler for kexec/kdump Luck, Tony
@ 2015-03-04  7:41   ` Naoya Horiguchi
  2015-03-04 23:12     ` Luck, Tony
  0 siblings, 1 reply; 44+ messages in thread
From: Naoya Horiguchi @ 2015-03-04  7:41 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Tue, Mar 03, 2015 at 06:09:27PM +0000, Luck, Tony wrote:
> +static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
> +{
> +	if (mca_cfg.kdump_cpu == smp_processor_id())
> +		pr_emerg("MCE triggered when kdumping. If you are lucky enough, you will have a kdump. Otherwise, this is a dying message.\n");
> 
> I'm worried about the SRAR case here.  Your code just returns, which will trigger the same machine check again. The system will spin forever printing this message.

You're right.

> I think you have to look at MCG_STATUS and scan the machine check banks to make a choice.  There are some simple cases:
> 
>   MCG_STATUS.RIPV=0 -> cannot return (where will the cpu go - you have no idea!)
>   SRAO -> safe to just return
>   SRAR -> should not return
> 
> But the rest may require some thought.  If there is a PCC=1 error, then you may end up with a corrupt dump. Perhaps this case will already be covered by RPIV==0?

A PCC=1 error is defined as UC error in SDM, and our severity assessor returns
MCE_PANIC_SEVERITY for it, so kdump should abort in such case.

So I added severity checking code in the new MCE handler in the following patch,
which borrowed some more code from do_machine_check(). Please see also the
changelog in the patch description.

Thanks,
Naoya Horiguchi
---
>From 6c0d53b80d6a45568c250d7b0d227f3399a2f2f2 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Wed, 4 Mar 2015 16:13:30 +0900
Subject: [PATCH] x86: mce: kexec: switch MCE handler for kexec/kdump

kexec disables (or "shoots down") all CPUs other than a crashing CPU before
entering the 2nd kernel. But the MCE handler is still enabled after that,
so if MCE happens and broadcasts over the CPUs after the main thread starts
the 2nd kernel (which might not initialize MCE device yet, or might decide
not to enable it,) MCE handler runs only on the other CPUs (not on the main
thread,) leading to kernel panic with MCE synchronization. The user-visible
effect of this bug is kdump failure.

Our standard MCE handler do_machine_check() assumes some about system's
status and it's hard to alter it to cover kexec/kdump context, so let's add
another kdump-specific one and switch to it.

Note that this problem exists since current MCE handler was implemented in
2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
a timeout") made it more visible by changing the default behavior of the
synchronization timeout from "ignore" to "panic".

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>        [2.6.32+]
---
ChangeLog v3 -> v4:
- fixed AR and UC order in enum severity_level because UC is severer than AR
  by definition. Current code is not affected by this wrong order by chance.
- check severity in machine_check_under_kdump(), and call mce_panic() if the
  resultant severity is as bad as or worse than MCE_AR_SEVERITY.
- use static global variable kdump_cpu instead of mca_cfg->kdump_cpu
- reduce "#ifdef CONFIG_KEXEC"
- add "#ifdef CONFIG_X86_MCE" for declaration of machine_check_under_kdump()
  in mce.h
- update comment on switch_mce_handler_for_kdump()

ChangeLog v2 -> v3
- go to "switch MCE handler" approach

ChangeLog v1 -> v2
- clear MSR_IA32_MCG_CTL, MSR_IA32_MCx_CTL, and CR4.MCE instead of using
  global flag to ignore MCE events.
- fixed the description of the problem
---
 arch/x86/include/asm/mce.h                |  7 +++
 arch/x86/kernel/cpu/mcheck/mce-internal.h |  2 +-
 arch/x86/kernel/cpu/mcheck/mce.c          | 82 +++++++++++++++++++++++++++++++
 arch/x86/kernel/crash.c                   |  3 ++
 4 files changed, 93 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 51b26e895933..223c4897a637 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -175,6 +175,13 @@ static inline void mce_amd_feature_init(struct cpuinfo_x86 *c) { }
 #endif
 
 int mce_available(struct cpuinfo_x86 *c);
+#ifdef CONFIG_KEXEC
+#ifdef CONFIG_X86_MCE
+void switch_mce_handler_for_kdump(void);
+#else
+static inline void switch_mce_handler_for_kdump(void) {}
+#endif
+#endif
 
 DECLARE_PER_CPU(unsigned, mce_exception_count);
 DECLARE_PER_CPU(unsigned, mce_poll_count);
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 10b46906767f..62e41dee907f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -8,8 +8,8 @@ enum severity_level {
 	MCE_KEEP_SEVERITY,
 	MCE_SOME_SEVERITY,
 	MCE_AO_SEVERITY,
-	MCE_UC_SEVERITY,
 	MCE_AR_SEVERITY,
+	MCE_UC_SEVERITY,
 	MCE_PANIC_SEVERITY,
 };
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 3112b79ace8e..a9b23ea86e56 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1219,6 +1219,88 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
 
+#ifdef CONFIG_KEXEC
+static int kdump_cpu;
+
+/*
+ * kdump-specific machine check handler
+ *
+ * When kexec/kdump is running, what the MCE handler is expected to do
+ * changes depending on whether the CPU is running the main thread or not.
+ *
+ * The crashing CPU, controlling the whole system exclusively, should try to
+ * get kdump as hard as possible even if an MCE happens concurrently, because
+ * some types of MCEs (for example, uncorrected errors like SRAO,)
+ * are not fatal or don't ruin reliablility of the kdump (consider that an
+ * MCE can hit the other CPU, in which case corrupted data is never consumed.)
+ * If an MCE critically breaks the kdump operation, we are unlucky so let's
+ * accept the fate of whatever HW causes, hoping a dying message reaches admins.
+ *
+ * The other CPUs are supposed to be quiet during kexec/kdump, so after the
+ * crashing CPU shot them down, they should not do anything except clearing
+ * MCG_STATUS (without this the system is reset, which is undesirable.)
+ * Note that this is also true after the crashing CPU enter the 2nd kernel.
+ */
+static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
+{
+	struct mce m, final;
+	char *msg = NULL;
+	char *nmsg = NULL;
+	int i;
+	int worst = 0;
+	int severity;
+
+	if (!mca_cfg.banks)
+		goto out;
+	if (kdump_cpu != smp_processor_id())
+		goto clear_mcg_status;
+
+	mce_gather_info(&m, regs);
+	for (i = 0; i < mca_cfg.banks; i++) {
+		m.status = mce_rdmsrl(MSR_IA32_MCx_STATUS(i));
+		if (m.status & MCI_STATUS_VAL)
+			if (quirk_no_way_out)
+				quirk_no_way_out(i, &m, regs);
+		severity = mce_severity(&m, mca_cfg.tolerant, &nmsg, true);
+		if (severity > worst) {
+			final = m;
+			worst = severity;
+			msg = nmsg;
+		}
+	}
+
+	if (worst >= MCE_AR_SEVERITY)
+		mce_panic("Unignorable machine check under kdumping", &final, msg);
+
+	pr_emerg("MCE triggered under kdumping. If you are lucky enough, you will have a kdump. Otherwise, this is a dying message.\n");
+
+clear_mcg_status:
+	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
+out:
+	sync_core();
+}
+
+/*
+ * Switch the MCE handler to kdump-specific one (called in kdump entering code)
+ *
+ * Standard MCE handler do_machine_check() is not designed for kexec/kdump
+ * context, where we can't expect MCE recovering and logging to work fine
+ * because the kernel might be unstable (all CPUs except one must be quiet
+ * as possible.)
+ *
+ * Moreover, in such situation getting kdump is prior to doing something for
+ * MCEs because what the users are really interested in is to find what caused
+ * the crashing, not what caused the crashing to fail.
+ *
+ * So let's switch MCE handler to the one suitable for kexec/kdump situation.
+ */
+void switch_mce_handler_for_kdump(void)
+{
+	kdump_cpu = smp_processor_id();
+	machine_check_vector = machine_check_under_kdump;
+}
+#endif
+
 #ifndef CONFIG_MEMORY_FAILURE
 int memory_failure(unsigned long pfn, int vector, int flags)
 {
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 6f3baedcb6f6..273805e772f6 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -34,6 +34,7 @@
 #include <asm/cpu.h>
 #include <asm/reboot.h>
 #include <asm/virtext.h>
+#include <asm/mce.h>
 
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
@@ -166,6 +167,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 	/* The kernel is broken so disable interrupts */
 	local_irq_disable();
 
+	switch_mce_handler_for_kdump();
+
 	kdump_nmi_shootdown_cpus();
 
 	/*
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 1/2] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-03 18:53 ` [PATCH v3 1/2] " Borislav Petkov
@ 2015-03-04  7:51   ` Naoya Horiguchi
  2015-03-04  9:12     ` Borislav Petkov
  0 siblings, 1 reply; 44+ messages in thread
From: Naoya Horiguchi @ 2015-03-04  7:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Tue, Mar 03, 2015 at 07:53:24PM +0100, Borislav Petkov wrote:
> On Tue, Mar 03, 2015 at 09:01:49AM +0000, Naoya Horiguchi wrote:
...
> 
> This CONFIG_KEXEC-ifdeffery is too ugly to live. Please put everything
> in arch/x86/kernel/crash.c. AFAICT, you don't need to touch anything in
> arch/x86/kernel/cpu/mcheck/ for your purposes.

I think that I could do this if I don't need any logical update on v3,
but as Tony wrote, the new MCE handler needs check the type of MCE to
determine whether to return or not.
And the result of the update, the new code contains some MCE-internal code,
so I feel like putting most of the code in arch/x86/kernel/cpu/mcheck/mce.c
as in the previous versions.

But I agree I used needlessly too many ifdef, so I put all kdump-related
code in arch/x86/kernel/cpu/mcheck/mce.c in a single ifdef for less ugly code.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 1/2] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-04  7:51   ` Naoya Horiguchi
@ 2015-03-04  9:12     ` Borislav Petkov
  2015-03-05  1:27       ` Naoya Horiguchi
  0 siblings, 1 reply; 44+ messages in thread
From: Borislav Petkov @ 2015-03-04  9:12 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Tony Luck, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Wed, Mar 04, 2015 at 07:51:56AM +0000, Naoya Horiguchi wrote:
> And the result of the update, the new code contains some
> MCE-internal code, so I feel like putting most of the code in
> arch/x86/kernel/cpu/mcheck/mce.c as in the previous versions.

We can expose those defines for kdump's use if absolutely needed. mce.c
is already not easy to stare at, having more stuff in it, while it can
be avoided, is not something we want.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v4] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-04  7:41   ` [PATCH v4] " Naoya Horiguchi
@ 2015-03-04 23:12     ` Luck, Tony
  2015-03-05  1:24       ` Naoya Horiguchi
  2015-03-05  8:48       ` Borislav Petkov
  0 siblings, 2 replies; 44+ messages in thread
From: Luck, Tony @ 2015-03-04 23:12 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Borislav Petkov, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

> - fixed AR and UC order in enum severity_level because UC is severer than AR
>  by definition. Current code is not affected by this wrong order by chance.

AR and AO are both UC errors - that happen also to be recoverable.  Are you really sure
about this re-order not affecting existing code?  You might well be right, but as every one
else has pointed out mce_severity() is full of odd subtleties that catch people out.

Is the "UC" entry at the end of the severities[] table just a catch-all for things that made it
past all the other entries? Does it ever really get used?

What was the test case that made you promote UC above AR?

This absolutely should not be buried in the middle of your other patch - it needs to
be separate with a much more than two lines of commit description.

-Tony

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-04 23:12     ` Luck, Tony
@ 2015-03-05  1:24       ` Naoya Horiguchi
  2015-03-05  6:45         ` [PATCH v5] " Naoya Horiguchi
  2015-03-06  5:44         ` [PATCH v4] " Naoya Horiguchi
  2015-03-05  8:48       ` Borislav Petkov
  1 sibling, 2 replies; 44+ messages in thread
From: Naoya Horiguchi @ 2015-03-05  1:24 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Wed, Mar 04, 2015 at 11:12:33PM +0000, Luck, Tony wrote:
> > - fixed AR and UC order in enum severity_level because UC is severer than AR
> >  by definition. Current code is not affected by this wrong order by chance.
> 
> AR and AO are both UC errors - that happen also to be recoverable.

Maybe just saying "UC" might be confusing, meaning "UC bit is set" or "type of
error (defined in Table SDM's Table 15-6 in vol.3B) is 'Uncorrected Error'
(clearly separate from SRAR/SRAO)". You seem to mean the former, and I meant
the latter. So I should write the description more accurately like "UC=1,PCC=0"
error and "UC=1,PCC=1" error.

>  Are you really sure
> about this re-order not affecting existing code?

Sorry, I thought I checked that but I missed one, so let me check again now.
I checked all referencing site of MCE_*_SEVERITY. Most of them are using '=='
to compare the severity, where the re-order doesn't affect them.
Some are using inequalities:

- around ./arch/x86/kernel/cpu/mcheck/mce.c:720, 

          if (mce_severity(m, mca_cfg.tolerant, msg, true) >=
              MCE_PANIC_SEVERITY)                            
  
  , the re-order doesn't affect.

- ./arch/x86/kernel/cpu/mcheck/mce.c:819:

          if (m && global_worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)

  , the re-order doesn't affect.

- ./arch/x86/kernel/cpu/mcheck/mce.c:832:

          if (global_worst <= MCE_KEEP_SEVERITY && mca_cfg.tolerant < 3) 

  , ditto.

- ./arch/x86/kernel/cpu/mcheck/mce.c:1196:

                no_way_out = worst >= MCE_PANIC_SEVERITY;

  , ditto.

- ./arch/x86/kernel/cpu/mcheck/mce-severity.c:211:

                if (s->sev >= MCE_UC_SEVERITY && ctx == IN_KERNEL) {

  , the re-order should change to s->sev >= MCE_AR_SEVERITY to keep the
  kernel behavior.

So I fixed the last part to be included in the re-order patch.

>  You might well be right, but as every one
> else has pointed out mce_severity() is full of odd subtleties that catch people out.

I agree that this one big table is hard to maintain and bug-prone. One problem
is that it has too many fields to check so the parameter space is huge. I think
some field are checked only once, so separating it out makes table more simple
and readable.

> Is the "UC" entry at the end of the severities[] table just a catch-all for things that made it
> past all the other entries? Does it ever really get used?

I read through the severity check table and it seems that all UC=1 case
are already considered by the above entries, so it seems not used.

> What was the test case that made you promote UC above AR?

I thought of "Action required but unaffected thread is continuable" case
on kexec kernel, but I'm not sure that such a case really happens.
My motivation on the promotion was mainly from what SDM says rather than
the real testcase.

> This absolutely should not be buried in the middle of your other patch - it needs to
> be separate with a much more than two lines of commit description.

OK, I might not include this part in this series in later post, but if I do,
I'll separete it out.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 1/2] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-04  9:12     ` Borislav Petkov
@ 2015-03-05  1:27       ` Naoya Horiguchi
  0 siblings, 0 replies; 44+ messages in thread
From: Naoya Horiguchi @ 2015-03-05  1:27 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Wed, Mar 04, 2015 at 10:12:56AM +0100, Borislav Petkov wrote:
> On Wed, Mar 04, 2015 at 07:51:56AM +0000, Naoya Horiguchi wrote:
> > And the result of the update, the new code contains some
> > MCE-internal code, so I feel like putting most of the code in
> > arch/x86/kernel/cpu/mcheck/mce.c as in the previous versions.
> 
> We can expose those defines for kdump's use if absolutely needed. mce.c
> is already not easy to stare at, having more stuff in it, while it can
> be avoided, is not something we want.

OK, I'll move in that direction in the next post.

Naoya

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v5] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-05  1:24       ` Naoya Horiguchi
@ 2015-03-05  6:45         ` Naoya Horiguchi
  2015-03-05  8:57           ` Borislav Petkov
  2015-03-06  5:44         ` [PATCH v4] " Naoya Horiguchi
  1 sibling, 1 reply; 44+ messages in thread
From: Naoya Horiguchi @ 2015-03-05  6:45 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 13375 bytes --]

On Thu, Mar 05, 2015 at 01:24:47AM +0000, Horiguchi Naoya(堀口 直也) wrote:
> On Wed, Mar 04, 2015 at 11:12:33PM +0000, Luck, Tony wrote:
> > > - fixed AR and UC order in enum severity_level because UC is severer than AR
> > >  by definition. Current code is not affected by this wrong order by chance.
> > 
> > AR and AO are both UC errors - that happen also to be recoverable.
> 
> Maybe just saying "UC" might be confusing, meaning "UC bit is set" or "type of
> error (defined in Table SDM's Table 15-6 in vol.3B) is 'Uncorrected Error'
> (clearly separate from SRAR/SRAO)". You seem to mean the former, and I meant
> the latter. So I should write the description more accurately like "UC=1,PCC=0"
> error and "UC=1,PCC=1" error.
> 
> >  Are you really sure
> > about this re-order not affecting existing code?
> 
> Sorry, I thought I checked that but I missed one, so let me check again now.
> I checked all referencing site of MCE_*_SEVERITY. Most of them are using '=='
> to compare the severity, where the re-order doesn't affect them.
> Some are using inequalities:
> 
> - around ./arch/x86/kernel/cpu/mcheck/mce.c:720, 
> 
>           if (mce_severity(m, mca_cfg.tolerant, msg, true) >=
>               MCE_PANIC_SEVERITY)                            
>   
>   , the re-order doesn't affect.
> 
> - ./arch/x86/kernel/cpu/mcheck/mce.c:819:
> 
>           if (m && global_worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
> 
>   , the re-order doesn't affect.
> 
> - ./arch/x86/kernel/cpu/mcheck/mce.c:832:
> 
>           if (global_worst <= MCE_KEEP_SEVERITY && mca_cfg.tolerant < 3) 
> 
>   , ditto.
> 
> - ./arch/x86/kernel/cpu/mcheck/mce.c:1196:
> 
>                 no_way_out = worst >= MCE_PANIC_SEVERITY;
> 
>   , ditto.
> 
> - ./arch/x86/kernel/cpu/mcheck/mce-severity.c:211:
> 
>                 if (s->sev >= MCE_UC_SEVERITY && ctx == IN_KERNEL) {
> 
>   , the re-order should change to s->sev >= MCE_AR_SEVERITY to keep the
>   kernel behavior.
> 
> So I fixed the last part to be included in the re-order patch.
> 
> >  You might well be right, but as every one
> > else has pointed out mce_severity() is full of odd subtleties that catch people out.
> 
> I agree that this one big table is hard to maintain and bug-prone. One problem
> is that it has too many fields to check so the parameter space is huge. I think
> some field are checked only once, so separating it out makes table more simple
> and readable.
> 
> > Is the "UC" entry at the end of the severities[] table just a catch-all for things that made it
> > past all the other entries? Does it ever really get used?
> 
> I read through the severity check table and it seems that all UC=1 case
> are already considered by the above entries, so it seems not used.
> 
> > What was the test case that made you promote UC above AR?
> 
> I thought of "Action required but unaffected thread is continuable" case
> on kexec kernel, but I'm not sure that such a case really happens.
> My motivation on the promotion was mainly from what SDM says rather than
> the real testcase.
> 
> > This absolutely should not be buried in the middle of your other patch - it needs to
> > be separate with a much more than two lines of commit description.
> 
> OK, I might not include this part in this series in later post, but if I do,
> I'll separete it out.

Although I don't think we finished the discussion over v4, I updated the
patch to reflect your feedbacks so far.

Of cource, if you still have some comment about the previous version, I'm
glad to hear that.

Thanks,
Naoya Horiguchi
----
>From bf4ce58b8296774a69e5436f43e8dc9eed41a829 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Thu, 5 Mar 2015 15:28:23 +0900
Subject: [PATCH v5] x86: mce: kexec: switch MCE handler for kexec/kdump

kexec disables (or "shoots down") all CPUs other than a crashing CPU before
entering the 2nd kernel. But the MCE handler is still enabled after that,
so if MCE happens and broadcasts over the CPUs after the main thread starts
the 2nd kernel (which might not initialize MCE device yet, or might decide
not to enable it,) MCE handler runs only on the other CPUs (not on the main
thread,) leading to kernel panic with MCE synchronization. The user-visible
effect of this bug is kdump failure.

Our standard MCE handler do_machine_check() assumes some about system's
status and it's hard to alter it to cover kexec/kdump context, so let's add
another kdump-specific one and switch to it.

Note that this problem exists since current MCE handler was implemented in
2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
a timeout") made it more visible by changing the default behavior of the
synchronization timeout from "ignore" to "panic".

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>        [2.6.32+]
---
ChangeLog v4 -> v5:
- drop MCE_UC/AR_SEVERITY re-ordering
- move most of code to arch/x86/kernel/crash.c
- export some MCE internal variables/routines via arch/x86/include/asm/mce.h

ChangeLog v3 -> v4:
- fixed AR and UC order in enum severity_level because UC is severer than AR
  by definition. Current code is not affected by this wrong order by chance.
- check severity in machine_check_under_kdump(), and call mce_panic() if the
  resultant severity is as bad as or worse than MCE_AR_SEVERITY.
- use static global variable kdump_cpu instead of mca_cfg->kdump_cpu
- reduce "#ifdef CONFIG_KEXEC"
- add "#ifdef CONFIG_X86_MCE" for declaration of machine_check_under_kdump()
  in mce.h
- update comment on switch_mce_handler_for_kdump()

ChangeLog v2 -> v3
- go to "switch MCE handler" approach

ChangeLog v1 -> v2
- clear MSR_IA32_MCG_CTL, MSR_IA32_MCx_CTL, and CR4.MCE instead of using
  global flag to ignore MCE events.
- fixed the description of the problem
---
 arch/x86/include/asm/mce.h                | 19 +++++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h | 13 -----
 arch/x86/kernel/cpu/mcheck/mce.c          | 10 ++--
 arch/x86/kernel/crash.c                   | 87 +++++++++++++++++++++++++++++++
 4 files changed, 111 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 51b26e895933..fbb385611a14 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -248,4 +248,23 @@ struct cper_sec_mem_err;
 extern void apei_mce_report_mem_error(int corrected,
 				      struct cper_sec_mem_err *mem_err);
 
+enum severity_level {
+	MCE_NO_SEVERITY,
+	MCE_DEFERRED_SEVERITY,
+	MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
+	MCE_KEEP_SEVERITY,
+	MCE_SOME_SEVERITY,
+	MCE_AO_SEVERITY,
+	MCE_UC_SEVERITY,
+	MCE_AR_SEVERITY,
+	MCE_PANIC_SEVERITY,
+};
+
+int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
+
+extern void mce_panic(char *msg, struct mce *final, char *exp);
+extern u64 mce_rdmsrl(u32 msr);
+extern void mce_wrmsrl(u32 msr, u64 v);
+extern inline void mce_gather_info(struct mce *m, struct pt_regs *regs);
+extern void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
 #endif /* _ASM_X86_MCE_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 10b46906767f..909ee3ed95dd 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -1,18 +1,6 @@
 #include <linux/device.h>
 #include <asm/mce.h>
 
-enum severity_level {
-	MCE_NO_SEVERITY,
-	MCE_DEFERRED_SEVERITY,
-	MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
-	MCE_KEEP_SEVERITY,
-	MCE_SOME_SEVERITY,
-	MCE_AO_SEVERITY,
-	MCE_UC_SEVERITY,
-	MCE_AR_SEVERITY,
-	MCE_PANIC_SEVERITY,
-};
-
 #define ATTR_LEN		16
 
 /* One object for each MCE bank, shared by all CPUs */
@@ -23,7 +11,6 @@ struct mce_bank {
 	char			attrname[ATTR_LEN];	/* attribute name */
 };
 
-int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
 struct dentry *mce_get_debugfs_dir(void);
 
 extern struct mce_bank *mce_banks;
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 3112b79ace8e..60a307a4b507 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -109,7 +109,7 @@ mce_banks_t mce_banks_ce_disabled;
 
 static DEFINE_PER_CPU(struct work_struct, mce_work);
 
-static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
+void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
 
 /*
  * CPU/chipset specific EDAC code can register a notifier call here to print
@@ -311,7 +311,7 @@ static void wait_for_panic(void)
 	panic("Panicing machine check CPU died");
 }
 
-static void mce_panic(char *msg, struct mce *final, char *exp)
+void mce_panic(char *msg, struct mce *final, char *exp)
 {
 	int i, apei_err = 0;
 
@@ -391,7 +391,7 @@ static int msr_to_offset(u32 msr)
 }
 
 /* MSR access wrappers used for error injection */
-static u64 mce_rdmsrl(u32 msr)
+u64 mce_rdmsrl(u32 msr)
 {
 	u64 v;
 
@@ -416,7 +416,7 @@ static u64 mce_rdmsrl(u32 msr)
 	return v;
 }
 
-static void mce_wrmsrl(u32 msr, u64 v)
+void mce_wrmsrl(u32 msr, u64 v)
 {
 	if (__this_cpu_read(injectm.finished)) {
 		int offset = msr_to_offset(msr);
@@ -433,7 +433,7 @@ static void mce_wrmsrl(u32 msr, u64 v)
  * check into our "mce" struct so that we can use it later to assess
  * the severity of the problem as we read per-bank specific details.
  */
-static inline void mce_gather_info(struct mce *m, struct pt_regs *regs)
+inline void mce_gather_info(struct mce *m, struct pt_regs *regs)
 {
 	mce_setup(m);
 
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 6f3baedcb6f6..8586dfff0e52 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -34,6 +34,7 @@
 #include <asm/cpu.h>
 #include <asm/reboot.h>
 #include <asm/virtext.h>
+#include <asm/mce.h>
 
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
@@ -76,6 +77,90 @@ struct crash_memmap_data {
 
 int in_crash_kexec;
 
+#ifdef CONFIG_X86_MCE
+static int kdump_cpu;
+
+/*
+ * kdump-specific machine check handler
+ *
+ * When kexec/kdump is running, what the MCE handler is expected to do
+ * changes depending on whether the CPU is running the main thread or not.
+ *
+ * The crashing CPU, controlling the whole system exclusively, should try to
+ * get kdump as hard as possible even if an MCE happens concurrently, because
+ * some types of MCEs (for example, uncorrected errors like SRAO,)
+ * are not fatal or don't ruin reliablility of the kdump (consider that an
+ * MCE can hit the other CPU, in which case corrupted data is never consumed.)
+ * If an MCE critically breaks the kdump operation, we are unlucky so let's
+ * accept the fate of whatever HW causes, hoping a dying message reaches admins.
+ *
+ * The other CPUs are supposed to be quiet during kexec/kdump, so after the
+ * crashing CPU shot them down, they should not do anything except clearing
+ * MCG_STATUS (without this the system is reset, which is undesirable.)
+ * Note that this is also true after the crashing CPU enter the 2nd kernel.
+ */
+static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
+{
+	struct mce m, final;
+	char *msg = NULL;
+	char *nmsg = NULL;
+	int i;
+	int worst = 0;
+	int severity;
+
+	if (!mca_cfg.banks)
+		goto out;
+	if (kdump_cpu != smp_processor_id())
+		goto clear_mcg_status;
+
+	mce_gather_info(&m, regs);
+	for (i = 0; i < mca_cfg.banks; i++) {
+		m.status = mce_rdmsrl(MSR_IA32_MCx_STATUS(i));
+		if (m.status & MCI_STATUS_VAL)
+			if (quirk_no_way_out)
+				quirk_no_way_out(i, &m, regs);
+		severity = mce_severity(&m, mca_cfg.tolerant, &nmsg, true);
+		if (severity > worst) {
+			final = m;
+			worst = severity;
+			msg = nmsg;
+		}
+	}
+
+	if (worst >= MCE_UC_SEVERITY)
+		mce_panic("unignorable machine check under kdumping", &final, msg);
+
+	pr_emerg("MCE triggered under kdumping. If you are lucky enough, you will have a kdump. Otherwise, this is a dying message.\n");
+
+clear_mcg_status:
+	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
+out:
+	sync_core();
+}
+
+/*
+ * Switch the MCE handler to kdump-specific one
+ *
+ * Standard MCE handler do_machine_check() is not designed for kexec/kdump
+ * context, where we can't expect MCE recovering and logging to work fine
+ * because the kernel might be unstable (all CPUs except one must be quiet
+ * as possible.)
+ *
+ * Moreover, in such situation getting kdump is prior to doing something for
+ * MCEs because what the users are really interested in is to find what caused
+ * the crashing, not what caused the crashing to fail.
+ *
+ * So let's switch MCE handler to the one suitable for kexec/kdump situation.
+ */
+void switch_mce_handler_for_kdump(void)
+{
+	kdump_cpu = smp_processor_id();
+	machine_check_vector = machine_check_under_kdump;
+}
+#else
+static inline void switch_mce_handler_for_kdump(void) {}
+#endif
+
 /*
  * This is used to VMCLEAR all VMCSs loaded on the
  * processor. And when loading kvm_intel module, the
@@ -166,6 +251,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 	/* The kernel is broken so disable interrupts */
 	local_irq_disable();
 
+	switch_mce_handler_for_kdump();
+
 	kdump_nmi_shootdown_cpus();
 
 	/*
-- 
1.9.3ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v4] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-04 23:12     ` Luck, Tony
  2015-03-05  1:24       ` Naoya Horiguchi
@ 2015-03-05  8:48       ` Borislav Petkov
  1 sibling, 0 replies; 44+ messages in thread
From: Borislav Petkov @ 2015-03-05  8:48 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Naoya Horiguchi, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Wed, Mar 04, 2015 at 11:12:33PM +0000, Luck, Tony wrote:
> > - fixed AR and UC order in enum severity_level because UC is severer than AR
> >  by definition. Current code is not affected by this wrong order by chance.
> 
> AR and AO are both UC errors - that happen also to be recoverable.  Are you really sure
> about this re-order not affecting existing code?  You might well be right, but as every one
> else has pointed out mce_severity() is full of odd subtleties that catch people out.

Hallelujah! :-)

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-05  6:45         ` [PATCH v5] " Naoya Horiguchi
@ 2015-03-05  8:57           ` Borislav Petkov
  2015-03-05  9:37             ` Naoya Horiguchi
  0 siblings, 1 reply; 44+ messages in thread
From: Borislav Petkov @ 2015-03-05  8:57 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Luck, Tony, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Thu, Mar 05, 2015 at 06:45:10AM +0000, Naoya Horiguchi wrote:
> ----
> From bf4ce58b8296774a69e5436f43e8dc9eed41a829 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Date: Thu, 5 Mar 2015 15:28:23 +0900
> Subject: [PATCH v5] x86: mce: kexec: switch MCE handler for kexec/kdump
> 
> kexec disables (or "shoots down") all CPUs other than a crashing CPU before
> entering the 2nd kernel. But the MCE handler is still enabled after that,
> so if MCE happens and broadcasts over the CPUs after the main thread starts
> the 2nd kernel (which might not initialize MCE device yet, or might decide
> not to enable it,) MCE handler runs only on the other CPUs (not on the main
> thread,) leading to kernel panic with MCE synchronization. The user-visible
> effect of this bug is kdump failure.
> 
> Our standard MCE handler do_machine_check() assumes some about system's
> status and it's hard to alter it to cover kexec/kdump context, so let's add
> another kdump-specific one and switch to it.
> 
> Note that this problem exists since current MCE handler was implemented in
> 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
> a timeout") made it more visible by changing the default behavior of the
> synchronization timeout from "ignore" to "panic".
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: <stable@vger.kernel.org>        [2.6.32+]

I don't think you can CC stable on something which looks like a new
feature to me.

It is most likely that distros will pick it up separately.

> ---
> ChangeLog v4 -> v5:
> - drop MCE_UC/AR_SEVERITY re-ordering
> - move most of code to arch/x86/kernel/crash.c
> - export some MCE internal variables/routines via arch/x86/include/asm/mce.h
> 
> ChangeLog v3 -> v4:
> - fixed AR and UC order in enum severity_level because UC is severer than AR
>   by definition. Current code is not affected by this wrong order by chance.
> - check severity in machine_check_under_kdump(), and call mce_panic() if the
>   resultant severity is as bad as or worse than MCE_AR_SEVERITY.
> - use static global variable kdump_cpu instead of mca_cfg->kdump_cpu
> - reduce "#ifdef CONFIG_KEXEC"
> - add "#ifdef CONFIG_X86_MCE" for declaration of machine_check_under_kdump()
>   in mce.h
> - update comment on switch_mce_handler_for_kdump()
> 
> ChangeLog v2 -> v3
> - go to "switch MCE handler" approach
> 
> ChangeLog v1 -> v2
> - clear MSR_IA32_MCG_CTL, MSR_IA32_MCx_CTL, and CR4.MCE instead of using
>   global flag to ignore MCE events.
> - fixed the description of the problem
> ---
>  arch/x86/include/asm/mce.h                | 19 +++++++
>  arch/x86/kernel/cpu/mcheck/mce-internal.h | 13 -----
>  arch/x86/kernel/cpu/mcheck/mce.c          | 10 ++--
>  arch/x86/kernel/crash.c                   | 87 +++++++++++++++++++++++++++++++
>  4 files changed, 111 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index 51b26e895933..fbb385611a14 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -248,4 +248,23 @@ struct cper_sec_mem_err;
>  extern void apei_mce_report_mem_error(int corrected,
>  				      struct cper_sec_mem_err *mem_err);
>  
> +enum severity_level {
> +	MCE_NO_SEVERITY,
> +	MCE_DEFERRED_SEVERITY,
> +	MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
> +	MCE_KEEP_SEVERITY,
> +	MCE_SOME_SEVERITY,
> +	MCE_AO_SEVERITY,
> +	MCE_UC_SEVERITY,
> +	MCE_AR_SEVERITY,
> +	MCE_PANIC_SEVERITY,
> +};
> +
> +int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
> +
> +extern void mce_panic(char *msg, struct mce *final, char *exp);

mce_panic is doing a lot of MCE-specific stuff like flushing out mcelog
etc. I don't think you need all that in your case - I think in your case
you simply want to panic().

> +extern u64 mce_rdmsrl(u32 msr);
> +extern void mce_wrmsrl(u32 msr, u64 v);

Those wrap error injection. I don't think you need that either - use
generic rd/wrmsr* functions.

> +extern inline void mce_gather_info(struct mce *m, struct pt_regs *regs);

This has a vm86 mode special case. Also probably not needed for you. You
can simply read MSR_IA32_MCG_STATUS in a simplified, private version.

> +extern void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);

Now this is exposing a really MCE-internal function. You probably should add a

	mce_callback(bank, m, regs);

in a prepatch and call it from your code.

With the above simplified versions used, the rest of the patch becomes
almost trivial.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-05  8:57           ` Borislav Petkov
@ 2015-03-05  9:37             ` Naoya Horiguchi
  2015-03-06  2:59               ` [PATCH v6] " Naoya Horiguchi
  2015-03-06  8:28               ` [PATCH v5] " Borislav Petkov
  0 siblings, 2 replies; 44+ messages in thread
From: Naoya Horiguchi @ 2015-03-05  9:37 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Thu, Mar 05, 2015 at 09:57:35AM +0100, Borislav Petkov wrote:
> On Thu, Mar 05, 2015 at 06:45:10AM +0000, Naoya Horiguchi wrote:
...
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: <stable@vger.kernel.org>        [2.6.32+]
> 
> I don't think you can CC stable on something which looks like a new
> feature to me.

OK, I drop it.

> It is most likely that distros will pick it up separately.
> 
> > ---
> > ChangeLog v4 -> v5:
> > - drop MCE_UC/AR_SEVERITY re-ordering
> > - move most of code to arch/x86/kernel/crash.c
> > - export some MCE internal variables/routines via arch/x86/include/asm/mce.h
> > 
> > ChangeLog v3 -> v4:
> > - fixed AR and UC order in enum severity_level because UC is severer than AR
> >   by definition. Current code is not affected by this wrong order by chance.
> > - check severity in machine_check_under_kdump(), and call mce_panic() if the
> >   resultant severity is as bad as or worse than MCE_AR_SEVERITY.
> > - use static global variable kdump_cpu instead of mca_cfg->kdump_cpu
> > - reduce "#ifdef CONFIG_KEXEC"
> > - add "#ifdef CONFIG_X86_MCE" for declaration of machine_check_under_kdump()
> >   in mce.h
> > - update comment on switch_mce_handler_for_kdump()
> > 
> > ChangeLog v2 -> v3
> > - go to "switch MCE handler" approach
> > 
> > ChangeLog v1 -> v2
> > - clear MSR_IA32_MCG_CTL, MSR_IA32_MCx_CTL, and CR4.MCE instead of using
> >   global flag to ignore MCE events.
> > - fixed the description of the problem
> > ---
> >  arch/x86/include/asm/mce.h                | 19 +++++++
> >  arch/x86/kernel/cpu/mcheck/mce-internal.h | 13 -----
> >  arch/x86/kernel/cpu/mcheck/mce.c          | 10 ++--
> >  arch/x86/kernel/crash.c                   | 87 +++++++++++++++++++++++++++++++
> >  4 files changed, 111 insertions(+), 18 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> > index 51b26e895933..fbb385611a14 100644
> > --- a/arch/x86/include/asm/mce.h
> > +++ b/arch/x86/include/asm/mce.h
> > @@ -248,4 +248,23 @@ struct cper_sec_mem_err;
> >  extern void apei_mce_report_mem_error(int corrected,
> >  				      struct cper_sec_mem_err *mem_err);
> >  
> > +enum severity_level {
> > +	MCE_NO_SEVERITY,
> > +	MCE_DEFERRED_SEVERITY,
> > +	MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
> > +	MCE_KEEP_SEVERITY,
> > +	MCE_SOME_SEVERITY,
> > +	MCE_AO_SEVERITY,
> > +	MCE_UC_SEVERITY,
> > +	MCE_AR_SEVERITY,
> > +	MCE_PANIC_SEVERITY,
> > +};
> > +
> > +int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
> > +
> > +extern void mce_panic(char *msg, struct mce *final, char *exp);
> 
> mce_panic is doing a lot of MCE-specific stuff like flushing out mcelog
> etc. I don't think you need all that in your case - I think in your case
> you simply want to panic().

print_mce() doesn't help?

> > +extern u64 mce_rdmsrl(u32 msr);
> > +extern void mce_wrmsrl(u32 msr, u64 v);
> 
> Those wrap error injection. I don't think you need that either - use
> generic rd/wrmsr* functions.
> 
> > +extern inline void mce_gather_info(struct mce *m, struct pt_regs *regs);
> 
> This has a vm86 mode special case. Also probably not needed for you. You
> can simply read MSR_IA32_MCG_STATUS in a simplified, private version.
> 
> > +extern void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
> 
> Now this is exposing a really MCE-internal function. You probably should add a
> 
> 	mce_callback(bank, m, regs);
> 
> in a prepatch and call it from your code.

Currently quirk_no_way_out() is set only for a specific CPU model, so even
if we define another callback for kdump code, setting it to a real function
need to be done in MCE internal function, __mcheck_cpu_apply_quirks() ?

It seems that in this version I relied on reusing exisiting code too much.
So I'll do only what I really need. Then, if the model specific behavior
(what quirk_sandybridge_ifu() does) doesn't affect machine_check_under_kdump()'s
behavior, simply stop copying this part is a right thing to do?


> With the above simplified versions used, the rest of the patch becomes
> almost trivial.

Other than that, I'm OK to write in the simplified form.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v6] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-05  9:37             ` Naoya Horiguchi
@ 2015-03-06  2:59               ` Naoya Horiguchi
  2015-03-06  8:34                 ` Borislav Petkov
  2015-03-06  8:28               ` [PATCH v5] " Borislav Petkov
  1 sibling, 1 reply; 44+ messages in thread
From: Naoya Horiguchi @ 2015-03-06  2:59 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Thu, Mar 05, 2015 at 09:37:52AM +0000, Naoya Horiguchi wrote:
...
> > With the above simplified versions used, the rest of the patch becomes
> > almost trivial.
> 
> Other than that, I'm OK to write in the simplified form.

Here is the updated one.

And I found some cleanups and/or tiny fixes (independent from this patch),
so will post them later.

Thanks,
Naoya Horiguchi

---
>From 8890e9976c525a4b480bf5f86008641688de8c11 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Fri, 6 Mar 2015 11:52:10 +0900
Subject: [PATCH v6] x86: mce: kexec: switch MCE handler for kexec/kdump

kexec disables (or "shoots down") all CPUs other than a crashing CPU before
entering the 2nd kernel. But the MCE handler is still enabled after that,
so if MCE happens and broadcasts over the CPUs after the main thread starts
the 2nd kernel (which might not initialize MCE device yet, or might decide
not to enable it,) MCE handler runs only on the other CPUs (not on the main
thread,) leading to kernel panic with MCE synchronization. The user-visible
effect of this bug is kdump failure.

Our standard MCE handler do_machine_check() assumes some about system's
status and it's hard to alter it to cover kexec/kdump context, so let's add
another kdump-specific one and switch to it.

Note that this problem exists since current MCE handler was implemented in
2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
a timeout") made it more visible by changing the default behavior of the
synchronization timeout from "ignore" to "panic".

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog v5 -> v6:
- drop "CC stable" tag
- stop using/exporting mce_gather_info(), mce_(rd|wr)msrl(), and mce_panic()
- drop quirk_no_way_out() part, because quirk_sandybridge_ifu() (only possible
  callback) could just change a MCE_PANIC_SEVERITY case to a MCE_AR_SEVERITY
  case, which doesn't affect the panic/return decision.

ChangeLog v4 -> v5:
- drop MCE_UC/AR_SEVERITY re-ordering
- move most of code to arch/x86/kernel/crash.c
- export some MCE internal variables/routines via arch/x86/include/asm/mce.h

ChangeLog v3 -> v4:
- fixed AR and UC order in enum severity_level because UC is severer than AR
  by definition. Current code is not affected by this wrong order by chance.
- check severity in machine_check_under_kdump(), and call mce_panic() if the
  resultant severity is as bad as or worse than MCE_AR_SEVERITY.
- use static global variable kdump_cpu instead of mca_cfg->kdump_cpu
- reduce "#ifdef CONFIG_KEXEC"
- add "#ifdef CONFIG_X86_MCE" for declaration of machine_check_under_kdump()
  in mce.h
- update comment on switch_mce_handler_for_kdump()

ChangeLog v2 -> v3
- go to "switch MCE handler" approach

ChangeLog v1 -> v2
- clear MSR_IA32_MCG_CTL, MSR_IA32_MCx_CTL, and CR4.MCE instead of using
  global flag to ignore MCE events.
- fixed the description of the problem
---
 arch/x86/include/asm/mce.h                | 14 +++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h | 13 -----
 arch/x86/kernel/crash.c                   | 88 +++++++++++++++++++++++++++++++
 3 files changed, 102 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 51b26e895933..192267fcee73 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -248,4 +248,18 @@ struct cper_sec_mem_err;
 extern void apei_mce_report_mem_error(int corrected,
 				      struct cper_sec_mem_err *mem_err);
 
+enum severity_level {
+	MCE_NO_SEVERITY,
+	MCE_DEFERRED_SEVERITY,
+	MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
+	MCE_KEEP_SEVERITY,
+	MCE_SOME_SEVERITY,
+	MCE_AO_SEVERITY,
+	MCE_UC_SEVERITY,
+	MCE_AR_SEVERITY,
+	MCE_PANIC_SEVERITY,
+};
+
+int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
+
 #endif /* _ASM_X86_MCE_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 10b46906767f..909ee3ed95dd 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -1,18 +1,6 @@
 #include <linux/device.h>
 #include <asm/mce.h>
 
-enum severity_level {
-	MCE_NO_SEVERITY,
-	MCE_DEFERRED_SEVERITY,
-	MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
-	MCE_KEEP_SEVERITY,
-	MCE_SOME_SEVERITY,
-	MCE_AO_SEVERITY,
-	MCE_UC_SEVERITY,
-	MCE_AR_SEVERITY,
-	MCE_PANIC_SEVERITY,
-};
-
 #define ATTR_LEN		16
 
 /* One object for each MCE bank, shared by all CPUs */
@@ -23,7 +11,6 @@ struct mce_bank {
 	char			attrname[ATTR_LEN];	/* attribute name */
 };
 
-int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
 struct dentry *mce_get_debugfs_dir(void);
 
 extern struct mce_bank *mce_banks;
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 6f3baedcb6f6..588a8b214356 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -34,6 +34,7 @@
 #include <asm/cpu.h>
 #include <asm/reboot.h>
 #include <asm/virtext.h>
+#include <asm/mce.h>
 
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
@@ -76,6 +77,91 @@ struct crash_memmap_data {
 
 int in_crash_kexec;
 
+#ifdef CONFIG_X86_MCE
+static int kdump_cpu;
+
+/*
+ * kdump-specific machine check handler
+ *
+ * When kexec/kdump is running, what the MCE handler is expected to do
+ * changes depending on whether the CPU is running the main thread or not.
+ *
+ * The crashing CPU, controlling the whole system exclusively, should try to
+ * get kdump as hard as possible even if an MCE happens concurrently, because
+ * some types of MCEs (for example, uncorrected errors like SRAO,)
+ * are not fatal or don't ruin reliablility of the kdump (consider that an
+ * MCE can hit the other CPU, in which case corrupted data is never consumed.)
+ * If an MCE critically breaks the kdump operation, we are unlucky so let's
+ * accept the fate of whatever HW causes, hoping a dying message reaches admins.
+ *
+ * The other CPUs are supposed to be quiet during kexec/kdump, so after the
+ * crashing CPU shot them down, they should not do anything except clearing
+ * MCG_STATUS (without this the system is reset, which is undesirable.)
+ * Note that this is also true after the crashing CPU enter the 2nd kernel.
+ */
+static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
+{
+	struct mce m = {};
+	char *msg = NULL;
+	char *nmsg = NULL;
+	int i;
+	int worst = 0;
+	int severity;
+	int ret;
+
+	if (!mca_cfg.banks)
+		goto out;
+	if (kdump_cpu != smp_processor_id())
+		goto clear_mcg_status;
+
+	if (rdmsrl_safe(MSR_IA32_MCG_STATUS, &m.mcgstatus))
+		m.mcgstatus = 0;
+	if (regs && m.mcgstatus & (MCG_STATUS_RIPV|MCG_STATUS_EIPV))
+		m.cs = regs->cs;
+	for (i = 0; i < mca_cfg.banks; i++) {
+		if (rdmsrl_safe(MSR_IA32_MCx_STATUS(i), &m.status))
+			m.status = 0;
+		severity = mce_severity(&m, mca_cfg.tolerant, &nmsg, true);
+		if (severity > worst) {
+			worst = severity;
+			msg = nmsg;
+		}
+	}
+
+	if (worst >= MCE_UC_SEVERITY)
+		panic("unignorable machine check under kdumping: %s", msg);
+
+	pr_emerg("MCE triggered under kdumping. If you are lucky enough, you will have a kdump. Otherwise, this is a dying message.\n");
+
+clear_mcg_status:
+	wrmsrl(MSR_IA32_MCG_STATUS, 0);
+out:
+	sync_core();
+}
+
+/*
+ * Switch the MCE handler to kdump-specific one
+ *
+ * Standard MCE handler do_machine_check() is not designed for kexec/kdump
+ * context, where we can't expect MCE recovering and logging to work fine
+ * because the kernel might be unstable (all CPUs except one must be quiet
+ * as possible.)
+ *
+ * Moreover, in such situation getting kdump is prior to doing something for
+ * MCEs because what the users are really interested in is to find what caused
+ * the crashing, not what caused the crashing to fail.
+ *
+ * So let's switch MCE handler to the one suitable for kexec/kdump situation.
+ */
+void switch_mce_handler_for_kdump(void)
+{
+	kdump_cpu = smp_processor_id();
+	machine_check_vector = machine_check_under_kdump;
+}
+#else
+static inline void switch_mce_handler_for_kdump(void) {}
+#endif
+
 /*
  * This is used to VMCLEAR all VMCSs loaded on the
  * processor. And when loading kvm_intel module, the
@@ -166,6 +252,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 	/* The kernel is broken so disable interrupts */
 	local_irq_disable();
 
+	switch_mce_handler_for_kdump();
+
 	kdump_nmi_shootdown_cpus();
 
 	/*
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v4] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-05  1:24       ` Naoya Horiguchi
  2015-03-05  6:45         ` [PATCH v5] " Naoya Horiguchi
@ 2015-03-06  5:44         ` Naoya Horiguchi
  1 sibling, 0 replies; 44+ messages in thread
From: Naoya Horiguchi @ 2015-03-06  5:44 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 706 bytes --]

On Thu, Mar 05, 2015 at 01:24:47AM +0000, Horiguchi Naoya(堀口 直也) wrote:
...
> > Is the "UC" entry at the end of the severities[] table just a catch-all for things that made it
> > past all the other entries? Does it ever really get used?
> 
> I read through the severity check table and it seems that all UC=1 case
> are already considered by the above entries, so it seems not used.

I was completely wrong, the "Uncorrected" entry is chosen when mca_cfg.ser is
false (where all checks with SER_REQUIRED are skipped) and UC=1 and OVER=0.ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-05  9:37             ` Naoya Horiguchi
  2015-03-06  2:59               ` [PATCH v6] " Naoya Horiguchi
@ 2015-03-06  8:28               ` Borislav Petkov
  1 sibling, 0 replies; 44+ messages in thread
From: Borislav Petkov @ 2015-03-06  8:28 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Luck, Tony, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Thu, Mar 05, 2015 at 09:37:52AM +0000, Naoya Horiguchi wrote:
> print_mce() doesn't help?

No, it doesn't because it does a bunch of MCA-specific things too.

> Currently quirk_no_way_out() is set only for a specific CPU model, so even
> if we define another callback for kdump code, setting it to a real function
> need to be done in MCE internal function, __mcheck_cpu_apply_quirks() ?
> 
> It seems that in this version I relied on reusing exisiting code too much.
> So I'll do only what I really need. Then, if the model specific behavior
> (what quirk_sandybridge_ifu() does) doesn't affect machine_check_under_kdump()'s
> behavior, simply stop copying this part is a right thing to do?

Just add this to mce.c in a pre-patch:

void mce_mc_callback(void *arg)
{
	if (quirk_no_way_out)
		quirk_no_way_out(i, m, regs);
}

I'm still not madly in love with it but I can't think of something
better. Maybe Tony would have a better idea...

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-06  2:59               ` [PATCH v6] " Naoya Horiguchi
@ 2015-03-06  8:34                 ` Borislav Petkov
  2015-03-06  9:09                   ` Naoya Horiguchi
  0 siblings, 1 reply; 44+ messages in thread
From: Borislav Petkov @ 2015-03-06  8:34 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Luck, Tony, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Fri, Mar 06, 2015 at 02:59:13AM +0000, Naoya Horiguchi wrote:
> From 8890e9976c525a4b480bf5f86008641688de8c11 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Date: Fri, 6 Mar 2015 11:52:10 +0900
> Subject: [PATCH v6] x86: mce: kexec: switch MCE handler for kexec/kdump
> 
> kexec disables (or "shoots down") all CPUs other than a crashing CPU before
> entering the 2nd kernel. But the MCE handler is still enabled after that,
> so if MCE happens and broadcasts over the CPUs after the main thread starts
> the 2nd kernel (which might not initialize MCE device yet, or might decide
> not to enable it,) MCE handler runs only on the other CPUs (not on the main
> thread,) leading to kernel panic with MCE synchronization. The user-visible
> effect of this bug is kdump failure.
> 
> Our standard MCE handler do_machine_check() assumes some about system's
> status and it's hard to alter it to cover kexec/kdump context, so let's add
> another kdump-specific one and switch to it.
> 
> Note that this problem exists since current MCE handler was implemented in
> 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
> a timeout") made it more visible by changing the default behavior of the
> synchronization timeout from "ignore" to "panic".
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

...

> +static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
> +{
> +	struct mce m = {};
> +	char *msg = NULL;
> +	char *nmsg = NULL;
> +	int i;
> +	int worst = 0;
> +	int severity;
> +	int ret;

if you do here

	if (mce_cfg.disabled)
		return;

you can use the simple rdmsrl variants and not the _safe() ones with
exception handling.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-06  8:34                 ` Borislav Petkov
@ 2015-03-06  9:09                   ` Naoya Horiguchi
  2015-03-06  9:27                     ` Borislav Petkov
  0 siblings, 1 reply; 44+ messages in thread
From: Naoya Horiguchi @ 2015-03-06  9:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Fri, Mar 06, 2015 at 09:34:21AM +0100, Borislav Petkov wrote:
> On Fri, Mar 06, 2015 at 02:59:13AM +0000, Naoya Horiguchi wrote:
> > From 8890e9976c525a4b480bf5f86008641688de8c11 Mon Sep 17 00:00:00 2001
> > From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Date: Fri, 6 Mar 2015 11:52:10 +0900
> > Subject: [PATCH v6] x86: mce: kexec: switch MCE handler for kexec/kdump
> > 
> > kexec disables (or "shoots down") all CPUs other than a crashing CPU before
> > entering the 2nd kernel. But the MCE handler is still enabled after that,
> > so if MCE happens and broadcasts over the CPUs after the main thread starts
> > the 2nd kernel (which might not initialize MCE device yet, or might decide
> > not to enable it,) MCE handler runs only on the other CPUs (not on the main
> > thread,) leading to kernel panic with MCE synchronization. The user-visible
> > effect of this bug is kdump failure.
> > 
> > Our standard MCE handler do_machine_check() assumes some about system's
> > status and it's hard to alter it to cover kexec/kdump context, so let's add
> > another kdump-specific one and switch to it.
> > 
> > Note that this problem exists since current MCE handler was implemented in
> > 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
> > a timeout") made it more visible by changing the default behavior of the
> > synchronization timeout from "ignore" to "panic".
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> ...
> 
> > +static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
> > +{
> > +	struct mce m = {};
> > +	char *msg = NULL;
> > +	char *nmsg = NULL;
> > +	int i;
> > +	int worst = 0;
> > +	int severity;
> > +	int ret;
> 
> if you do here
> 
> 	if (mce_cfg.disabled)
> 		return;
> 
> you can use the simple rdmsrl variants and not the _safe() ones with
> exception handling.

I'm not sure why that works, could you elabroate it?

I feel that we need some comment about why it's OK to use rdmsrl_safe()
variant *only* in mce_rdmsrl() (other MCE registers is accessed via rdmsrl(),)
which seems not clear to me from reading current code and git history.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-06  9:09                   ` Naoya Horiguchi
@ 2015-03-06  9:27                     ` Borislav Petkov
  2015-03-06  9:32                       ` Naoya Horiguchi
  0 siblings, 1 reply; 44+ messages in thread
From: Borislav Petkov @ 2015-03-06  9:27 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Luck, Tony, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Fri, Mar 06, 2015 at 09:09:30AM +0000, Naoya Horiguchi wrote:
> I'm not sure why that works, could you elabroate it?

Because the _safe() variants are only for handling reads to possibly
non-existent MSRs. The _STATUS MSRs which you're accessing are
guaranteed to be present, otherwise __mcheck_cpu_cap_init() would've
disabled MCA on the first kernel, thus the mca_cfg.disabled check.

We can always switch to the _safe() variants later if really needed - I
just don't see the need for them right now.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v6] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-06  9:27                     ` Borislav Petkov
@ 2015-03-06  9:32                       ` Naoya Horiguchi
  2015-03-06 10:22                         ` [PATCH v7] " Naoya Horiguchi
  0 siblings, 1 reply; 44+ messages in thread
From: Naoya Horiguchi @ 2015-03-06  9:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Fri, Mar 06, 2015 at 10:27:38AM +0100, Borislav Petkov wrote:
> On Fri, Mar 06, 2015 at 09:09:30AM +0000, Naoya Horiguchi wrote:
> > I'm not sure why that works, could you elabroate it?
> 
> Because the _safe() variants are only for handling reads to possibly
> non-existent MSRs. The _STATUS MSRs which you're accessing are
> guaranteed to be present, otherwise __mcheck_cpu_cap_init() would've
> disabled MCA on the first kernel, thus the mca_cfg.disabled check.

Great, thanks.

> We can always switch to the _safe() variants later if really needed - I
> just don't see the need for them right now.

OK, I'll do with rdmsrl().

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v7] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-06  9:32                       ` Naoya Horiguchi
@ 2015-03-06 10:22                         ` Naoya Horiguchi
  2015-04-06  7:18                           ` Naoya Horiguchi
  2015-04-06 11:56                           ` [PATCH v7] " Borislav Petkov
  0 siblings, 2 replies; 44+ messages in thread
From: Naoya Horiguchi @ 2015-03-06 10:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

kexec disables (or "shoots down") all CPUs other than a crashing CPU before
entering the 2nd kernel. But the MCE handler is still enabled after that,
so if MCE happens and broadcasts over the CPUs after the main thread starts
the 2nd kernel (which might not initialize MCE device yet, or might decide
not to enable it,) MCE handler runs only on the other CPUs (not on the main
thread,) leading to kernel panic with MCE synchronization. The user-visible
effect of this bug is kdump failure.

Our standard MCE handler do_machine_check() assumes some about system's
status and it's hard to alter it to cover kexec/kdump context, so let's add
another kdump-specific one and switch to it.

Note that this problem exists since current MCE handler was implemented in
2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
a timeout") made it more visible by changing the default behavior of the
synchronization timeout from "ignore" to "panic".

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog: v6 -> v7
- use rdmsrl() instead of rdmsrl_safe() with mca_cfg.disabled check

ChangeLog v5 -> v6:
- drop "CC stable" tag
- stop using/exporting mce_gather_info(), mce_(rd|wr)msrl(), and mce_panic()
- drop quirk_no_way_out() part, because quirk_sandybridge_ifu() (only possible
  callback) could just change a MCE_PANIC_SEVERITY case to a MCE_AR_SEVERITY
  case, which doesn't affect the panic/return decision.

ChangeLog v4 -> v5:
- drop MCE_UC/AR_SEVERITY re-ordering
- move most of code to arch/x86/kernel/crash.c
- export some MCE internal variables/routines via arch/x86/include/asm/mce.h

ChangeLog v3 -> v4:
- fixed AR and UC order in enum severity_level because UC is severer than AR
  by definition. Current code is not affected by this wrong order by chance.
- check severity in machine_check_under_kdump(), and call mce_panic() if the
  resultant severity is as bad as or worse than MCE_AR_SEVERITY.
- use static global variable kdump_cpu instead of mca_cfg->kdump_cpu
- reduce "#ifdef CONFIG_KEXEC"
- add "#ifdef CONFIG_X86_MCE" for declaration of machine_check_under_kdump()
  in mce.h
- update comment on switch_mce_handler_for_kdump()

ChangeLog v2 -> v3
- go to "switch MCE handler" approach

ChangeLog v1 -> v2
- clear MSR_IA32_MCG_CTL, MSR_IA32_MCx_CTL, and CR4.MCE instead of using
  global flag to ignore MCE events.
- fixed the description of the problem
---
 arch/x86/include/asm/mce.h                | 14 +++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h | 13 -----
 arch/x86/kernel/crash.c                   | 87 +++++++++++++++++++++++++++++++
 3 files changed, 101 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 51b26e895933..192267fcee73 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -248,4 +248,18 @@ struct cper_sec_mem_err;
 extern void apei_mce_report_mem_error(int corrected,
 				      struct cper_sec_mem_err *mem_err);
 
+enum severity_level {
+	MCE_NO_SEVERITY,
+	MCE_DEFERRED_SEVERITY,
+	MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
+	MCE_KEEP_SEVERITY,
+	MCE_SOME_SEVERITY,
+	MCE_AO_SEVERITY,
+	MCE_UC_SEVERITY,
+	MCE_AR_SEVERITY,
+	MCE_PANIC_SEVERITY,
+};
+
+int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
+
 #endif /* _ASM_X86_MCE_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 10b46906767f..909ee3ed95dd 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -1,18 +1,6 @@
 #include <linux/device.h>
 #include <asm/mce.h>
 
-enum severity_level {
-	MCE_NO_SEVERITY,
-	MCE_DEFERRED_SEVERITY,
-	MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
-	MCE_KEEP_SEVERITY,
-	MCE_SOME_SEVERITY,
-	MCE_AO_SEVERITY,
-	MCE_UC_SEVERITY,
-	MCE_AR_SEVERITY,
-	MCE_PANIC_SEVERITY,
-};
-
 #define ATTR_LEN		16
 
 /* One object for each MCE bank, shared by all CPUs */
@@ -23,7 +11,6 @@ struct mce_bank {
 	char			attrname[ATTR_LEN];	/* attribute name */
 };
 
-int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
 struct dentry *mce_get_debugfs_dir(void);
 
 extern struct mce_bank *mce_banks;
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 6f3baedcb6f6..4a46654091bb 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -34,6 +34,7 @@
 #include <asm/cpu.h>
 #include <asm/reboot.h>
 #include <asm/virtext.h>
+#include <asm/mce.h>
 
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
@@ -76,6 +77,90 @@ struct crash_memmap_data {
 
 int in_crash_kexec;
 
+#ifdef CONFIG_X86_MCE
+static int kdump_cpu;
+
+/*
+ * kdump-specific machine check handler
+ *
+ * When kexec/kdump is running, what the MCE handler is expected to do
+ * changes depending on whether the CPU is running the main thread or not.
+ *
+ * The crashing CPU, controlling the whole system exclusively, should try to
+ * get kdump as hard as possible even if an MCE happens concurrently, because
+ * some types of MCEs (for example, uncorrected errors like SRAO,)
+ * are not fatal or don't ruin reliablility of the kdump (consider that an
+ * MCE can hit the other CPU, in which case corrupted data is never consumed.)
+ * If an MCE critically breaks the kdump operation, we are unlucky so let's
+ * accept the fate of whatever HW causes, hoping a dying message reaches admins.
+ *
+ * The other CPUs are supposed to be quiet during kexec/kdump, so after the
+ * crashing CPU shot them down, they should not do anything except clearing
+ * MCG_STATUS (without this the system is reset, which is undesirable.)
+ * Note that this is also true after the crashing CPU enter the 2nd kernel.
+ */
+static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
+{
+	struct mce m = {};
+	char *msg = NULL;
+	char *nmsg = NULL;
+	int i;
+	int worst = 0;
+	int severity;
+
+	if (mca_cfg.disabled)
+		return;
+	if (!mca_cfg.banks)
+		goto out;
+	if (kdump_cpu != smp_processor_id())
+		goto clear_mcg_status;
+
+	rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus);
+	if (regs && m.mcgstatus & (MCG_STATUS_RIPV|MCG_STATUS_EIPV))
+		m.cs = regs->cs;
+	for (i = 0; i < mca_cfg.banks; i++) {
+		rdmsrl(MSR_IA32_MCx_STATUS(i), m.status);
+		severity = mce_severity(&m, mca_cfg.tolerant, &nmsg, true);
+		if (severity > worst) {
+			worst = severity;
+			msg = nmsg;
+		}
+	}
+
+	if (worst >= MCE_UC_SEVERITY)
+		panic("unignorable machine check under kdumping: %s", msg);
+
+	pr_emerg("MCE triggered under kdumping. If you are lucky enough, you will have a kdump. Otherwise, this is a dying message.\n");
+
+clear_mcg_status:
+	wrmsrl(MSR_IA32_MCG_STATUS, 0);
+out:
+	sync_core();
+}
+
+/*
+ * Switch the MCE handler to kdump-specific one
+ *
+ * Standard MCE handler do_machine_check() is not designed for kexec/kdump
+ * context, where we can't expect MCE recovering and logging to work fine
+ * because the kernel might be unstable (all CPUs except one must be quiet
+ * as possible.)
+ *
+ * Moreover, in such situation getting kdump is prior to doing something for
+ * MCEs because what the users are really interested in is to find what caused
+ * the crashing, not what caused the crashing to fail.
+ *
+ * So let's switch MCE handler to the one suitable for kexec/kdump situation.
+ */
+void switch_mce_handler_for_kdump(void)
+{
+	kdump_cpu = smp_processor_id();
+	machine_check_vector = machine_check_under_kdump;
+}
+#else
+static inline void switch_mce_handler_for_kdump(void) {}
+#endif
+
 /*
  * This is used to VMCLEAR all VMCSs loaded on the
  * processor. And when loading kvm_intel module, the
@@ -166,6 +251,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 	/* The kernel is broken so disable interrupts */
 	local_irq_disable();
 
+	switch_mce_handler_for_kdump();
+
 	kdump_nmi_shootdown_cpus();
 
 	/*
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v7] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-06 10:22                         ` [PATCH v7] " Naoya Horiguchi
@ 2015-04-06  7:18                           ` Naoya Horiguchi
  2015-04-06 11:59                             ` Borislav Petkov
  2015-04-06 11:56                           ` [PATCH v7] " Borislav Petkov
  1 sibling, 1 reply; 44+ messages in thread
From: Naoya Horiguchi @ 2015-04-06  7:18 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

Hi Boris, Tony,

How will this patch be handled for v4.1?
Could you review and consider merging into your tree?

Naoya

On Fri, Mar 06, 2015 at 10:22:16AM +0000, Naoya Horiguchi wrote:
> kexec disables (or "shoots down") all CPUs other than a crashing CPU before
> entering the 2nd kernel. But the MCE handler is still enabled after that,
> so if MCE happens and broadcasts over the CPUs after the main thread starts
> the 2nd kernel (which might not initialize MCE device yet, or might decide
> not to enable it,) MCE handler runs only on the other CPUs (not on the main
> thread,) leading to kernel panic with MCE synchronization. The user-visible
> effect of this bug is kdump failure.
> 
> Our standard MCE handler do_machine_check() assumes some about system's
> status and it's hard to alter it to cover kexec/kdump context, so let's add
> another kdump-specific one and switch to it.
> 
> Note that this problem exists since current MCE handler was implemented in
> 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
> a timeout") made it more visible by changing the default behavior of the
> synchronization timeout from "ignore" to "panic".
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
> ChangeLog: v6 -> v7
> - use rdmsrl() instead of rdmsrl_safe() with mca_cfg.disabled check
> 
> ChangeLog v5 -> v6:
> - drop "CC stable" tag
> - stop using/exporting mce_gather_info(), mce_(rd|wr)msrl(), and mce_panic()
> - drop quirk_no_way_out() part, because quirk_sandybridge_ifu() (only possible
>   callback) could just change a MCE_PANIC_SEVERITY case to a MCE_AR_SEVERITY
>   case, which doesn't affect the panic/return decision.
> 
> ChangeLog v4 -> v5:
> - drop MCE_UC/AR_SEVERITY re-ordering
> - move most of code to arch/x86/kernel/crash.c
> - export some MCE internal variables/routines via arch/x86/include/asm/mce.h
> 
> ChangeLog v3 -> v4:
> - fixed AR and UC order in enum severity_level because UC is severer than AR
>   by definition. Current code is not affected by this wrong order by chance.
> - check severity in machine_check_under_kdump(), and call mce_panic() if the
>   resultant severity is as bad as or worse than MCE_AR_SEVERITY.
> - use static global variable kdump_cpu instead of mca_cfg->kdump_cpu
> - reduce "#ifdef CONFIG_KEXEC"
> - add "#ifdef CONFIG_X86_MCE" for declaration of machine_check_under_kdump()
>   in mce.h
> - update comment on switch_mce_handler_for_kdump()
> 
> ChangeLog v2 -> v3
> - go to "switch MCE handler" approach
> 
> ChangeLog v1 -> v2
> - clear MSR_IA32_MCG_CTL, MSR_IA32_MCx_CTL, and CR4.MCE instead of using
>   global flag to ignore MCE events.
> - fixed the description of the problem
> ---
>  arch/x86/include/asm/mce.h                | 14 +++++
>  arch/x86/kernel/cpu/mcheck/mce-internal.h | 13 -----
>  arch/x86/kernel/crash.c                   | 87 +++++++++++++++++++++++++++++++
>  3 files changed, 101 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index 51b26e895933..192267fcee73 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -248,4 +248,18 @@ struct cper_sec_mem_err;
>  extern void apei_mce_report_mem_error(int corrected,
>  				      struct cper_sec_mem_err *mem_err);
>  
> +enum severity_level {
> +	MCE_NO_SEVERITY,
> +	MCE_DEFERRED_SEVERITY,
> +	MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
> +	MCE_KEEP_SEVERITY,
> +	MCE_SOME_SEVERITY,
> +	MCE_AO_SEVERITY,
> +	MCE_UC_SEVERITY,
> +	MCE_AR_SEVERITY,
> +	MCE_PANIC_SEVERITY,
> +};
> +
> +int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
> +
>  #endif /* _ASM_X86_MCE_H */
> diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> index 10b46906767f..909ee3ed95dd 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
> +++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> @@ -1,18 +1,6 @@
>  #include <linux/device.h>
>  #include <asm/mce.h>
>  
> -enum severity_level {
> -	MCE_NO_SEVERITY,
> -	MCE_DEFERRED_SEVERITY,
> -	MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
> -	MCE_KEEP_SEVERITY,
> -	MCE_SOME_SEVERITY,
> -	MCE_AO_SEVERITY,
> -	MCE_UC_SEVERITY,
> -	MCE_AR_SEVERITY,
> -	MCE_PANIC_SEVERITY,
> -};
> -
>  #define ATTR_LEN		16
>  
>  /* One object for each MCE bank, shared by all CPUs */
> @@ -23,7 +11,6 @@ struct mce_bank {
>  	char			attrname[ATTR_LEN];	/* attribute name */
>  };
>  
> -int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
>  struct dentry *mce_get_debugfs_dir(void);
>  
>  extern struct mce_bank *mce_banks;
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index 6f3baedcb6f6..4a46654091bb 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -34,6 +34,7 @@
>  #include <asm/cpu.h>
>  #include <asm/reboot.h>
>  #include <asm/virtext.h>
> +#include <asm/mce.h>
>  
>  /* Alignment required for elf header segment */
>  #define ELF_CORE_HEADER_ALIGN   4096
> @@ -76,6 +77,90 @@ struct crash_memmap_data {
>  
>  int in_crash_kexec;
>  
> +#ifdef CONFIG_X86_MCE
> +static int kdump_cpu;
> +
> +/*
> + * kdump-specific machine check handler
> + *
> + * When kexec/kdump is running, what the MCE handler is expected to do
> + * changes depending on whether the CPU is running the main thread or not.
> + *
> + * The crashing CPU, controlling the whole system exclusively, should try to
> + * get kdump as hard as possible even if an MCE happens concurrently, because
> + * some types of MCEs (for example, uncorrected errors like SRAO,)
> + * are not fatal or don't ruin reliablility of the kdump (consider that an
> + * MCE can hit the other CPU, in which case corrupted data is never consumed.)
> + * If an MCE critically breaks the kdump operation, we are unlucky so let's
> + * accept the fate of whatever HW causes, hoping a dying message reaches admins.
> + *
> + * The other CPUs are supposed to be quiet during kexec/kdump, so after the
> + * crashing CPU shot them down, they should not do anything except clearing
> + * MCG_STATUS (without this the system is reset, which is undesirable.)
> + * Note that this is also true after the crashing CPU enter the 2nd kernel.
> + */
> +static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
> +{
> +	struct mce m = {};
> +	char *msg = NULL;
> +	char *nmsg = NULL;
> +	int i;
> +	int worst = 0;
> +	int severity;
> +
> +	if (mca_cfg.disabled)
> +		return;
> +	if (!mca_cfg.banks)
> +		goto out;
> +	if (kdump_cpu != smp_processor_id())
> +		goto clear_mcg_status;
> +
> +	rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus);
> +	if (regs && m.mcgstatus & (MCG_STATUS_RIPV|MCG_STATUS_EIPV))
> +		m.cs = regs->cs;
> +	for (i = 0; i < mca_cfg.banks; i++) {
> +		rdmsrl(MSR_IA32_MCx_STATUS(i), m.status);
> +		severity = mce_severity(&m, mca_cfg.tolerant, &nmsg, true);
> +		if (severity > worst) {
> +			worst = severity;
> +			msg = nmsg;
> +		}
> +	}
> +
> +	if (worst >= MCE_UC_SEVERITY)
> +		panic("unignorable machine check under kdumping: %s", msg);
> +
> +	pr_emerg("MCE triggered under kdumping. If you are lucky enough, you will have a kdump. Otherwise, this is a dying message.\n");
> +
> +clear_mcg_status:
> +	wrmsrl(MSR_IA32_MCG_STATUS, 0);
> +out:
> +	sync_core();
> +}
> +
> +/*
> + * Switch the MCE handler to kdump-specific one
> + *
> + * Standard MCE handler do_machine_check() is not designed for kexec/kdump
> + * context, where we can't expect MCE recovering and logging to work fine
> + * because the kernel might be unstable (all CPUs except one must be quiet
> + * as possible.)
> + *
> + * Moreover, in such situation getting kdump is prior to doing something for
> + * MCEs because what the users are really interested in is to find what caused
> + * the crashing, not what caused the crashing to fail.
> + *
> + * So let's switch MCE handler to the one suitable for kexec/kdump situation.
> + */
> +void switch_mce_handler_for_kdump(void)
> +{
> +	kdump_cpu = smp_processor_id();
> +	machine_check_vector = machine_check_under_kdump;
> +}
> +#else
> +static inline void switch_mce_handler_for_kdump(void) {}
> +#endif
> +
>  /*
>   * This is used to VMCLEAR all VMCSs loaded on the
>   * processor. And when loading kvm_intel module, the
> @@ -166,6 +251,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>  	/* The kernel is broken so disable interrupts */
>  	local_irq_disable();
>  
> +	switch_mce_handler_for_kdump();
> +
>  	kdump_nmi_shootdown_cpus();
>  
>  	/*
> -- 
> 1.9.3
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-03-06 10:22                         ` [PATCH v7] " Naoya Horiguchi
  2015-04-06  7:18                           ` Naoya Horiguchi
@ 2015-04-06 11:56                           ` Borislav Petkov
  2015-04-07  7:59                             ` Naoya Horiguchi
  1 sibling, 1 reply; 44+ messages in thread
From: Borislav Petkov @ 2015-04-06 11:56 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Luck, Tony, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Fri, Mar 06, 2015 at 10:22:16AM +0000, Naoya Horiguchi wrote:
> kexec disables (or "shoots down") all CPUs other than a crashing CPU before
> entering the 2nd kernel. But the MCE handler is still enabled after that,
> so if MCE happens and broadcasts over the CPUs after the main thread starts
> the 2nd kernel (which might not initialize MCE device yet, or might decide
> not to enable it,) MCE handler runs only on the other CPUs (not on the main

You lost me here: "MCE device"?

> thread,) leading to kernel panic with MCE synchronization. The user-visible
> effect of this bug is kdump failure.
> 
> Our standard MCE handler do_machine_check() assumes some about system's
> status and it's hard to alter it to cover kexec/kdump context, so let's add
> another kdump-specific one and switch to it.
> 
> Note that this problem exists since current MCE handler was implemented in

Did we have kdump then at all and was that a "problem" then even?

> 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
> a timeout") made it more visible by changing the default behavior of the
> synchronization timeout from "ignore" to "panic".
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
> ChangeLog: v6 -> v7
> - use rdmsrl() instead of rdmsrl_safe() with mca_cfg.disabled check
> 
> ChangeLog v5 -> v6:
> - drop "CC stable" tag
> - stop using/exporting mce_gather_info(), mce_(rd|wr)msrl(), and mce_panic()
> - drop quirk_no_way_out() part, because quirk_sandybridge_ifu() (only possible
>   callback) could just change a MCE_PANIC_SEVERITY case to a MCE_AR_SEVERITY
>   case, which doesn't affect the panic/return decision.
> 
> ChangeLog v4 -> v5:
> - drop MCE_UC/AR_SEVERITY re-ordering
> - move most of code to arch/x86/kernel/crash.c
> - export some MCE internal variables/routines via arch/x86/include/asm/mce.h
> 
> ChangeLog v3 -> v4:
> - fixed AR and UC order in enum severity_level because UC is severer than AR
>   by definition. Current code is not affected by this wrong order by chance.
> - check severity in machine_check_under_kdump(), and call mce_panic() if the
>   resultant severity is as bad as or worse than MCE_AR_SEVERITY.
> - use static global variable kdump_cpu instead of mca_cfg->kdump_cpu
> - reduce "#ifdef CONFIG_KEXEC"
> - add "#ifdef CONFIG_X86_MCE" for declaration of machine_check_under_kdump()
>   in mce.h
> - update comment on switch_mce_handler_for_kdump()
> 
> ChangeLog v2 -> v3
> - go to "switch MCE handler" approach
> 
> ChangeLog v1 -> v2
> - clear MSR_IA32_MCG_CTL, MSR_IA32_MCx_CTL, and CR4.MCE instead of using
>   global flag to ignore MCE events.
> - fixed the description of the problem
> ---
>  arch/x86/include/asm/mce.h                | 14 +++++
>  arch/x86/kernel/cpu/mcheck/mce-internal.h | 13 -----
>  arch/x86/kernel/crash.c                   | 87 +++++++++++++++++++++++++++++++
>  3 files changed, 101 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index 51b26e895933..192267fcee73 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -248,4 +248,18 @@ struct cper_sec_mem_err;
>  extern void apei_mce_report_mem_error(int corrected,
>  				      struct cper_sec_mem_err *mem_err);
>  
> +enum severity_level {
> +	MCE_NO_SEVERITY,
> +	MCE_DEFERRED_SEVERITY,
> +	MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
> +	MCE_KEEP_SEVERITY,
> +	MCE_SOME_SEVERITY,
> +	MCE_AO_SEVERITY,
> +	MCE_UC_SEVERITY,
> +	MCE_AR_SEVERITY,
> +	MCE_PANIC_SEVERITY,
> +};
> +
> +int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
> +
>  #endif /* _ASM_X86_MCE_H */
> diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> index 10b46906767f..909ee3ed95dd 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
> +++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> @@ -1,18 +1,6 @@
>  #include <linux/device.h>
>  #include <asm/mce.h>
>  
> -enum severity_level {
> -	MCE_NO_SEVERITY,
> -	MCE_DEFERRED_SEVERITY,
> -	MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
> -	MCE_KEEP_SEVERITY,
> -	MCE_SOME_SEVERITY,
> -	MCE_AO_SEVERITY,
> -	MCE_UC_SEVERITY,
> -	MCE_AR_SEVERITY,
> -	MCE_PANIC_SEVERITY,
> -};
> -
>  #define ATTR_LEN		16
>  
>  /* One object for each MCE bank, shared by all CPUs */
> @@ -23,7 +11,6 @@ struct mce_bank {
>  	char			attrname[ATTR_LEN];	/* attribute name */
>  };
>  
> -int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
>  struct dentry *mce_get_debugfs_dir(void);
>  
>  extern struct mce_bank *mce_banks;
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index 6f3baedcb6f6..4a46654091bb 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -34,6 +34,7 @@
>  #include <asm/cpu.h>
>  #include <asm/reboot.h>
>  #include <asm/virtext.h>
> +#include <asm/mce.h>
>  
>  /* Alignment required for elf header segment */
>  #define ELF_CORE_HEADER_ALIGN   4096
> @@ -76,6 +77,90 @@ struct crash_memmap_data {
>  
>  int in_crash_kexec;
>  
> +#ifdef CONFIG_X86_MCE
> +static int kdump_cpu;
> +
> +/*
> + * kdump-specific machine check handler
> + *
> + * When kexec/kdump is running, what the MCE handler is expected to do
> + * changes depending on whether the CPU is running the main thread or not.
> + *
> + * The crashing CPU, controlling the whole system exclusively, should try to
> + * get kdump as hard as possible even if an MCE happens concurrently, because
> + * some types of MCEs (for example, uncorrected errors like SRAO,)
> + * are not fatal or don't ruin reliablility of the kdump (consider that an
> + * MCE can hit the other CPU, in which case corrupted data is never consumed.)
> + * If an MCE critically breaks the kdump operation, we are unlucky so let's
> + * accept the fate of whatever HW causes, hoping a dying message reaches admins.
> + *
> + * The other CPUs are supposed to be quiet during kexec/kdump, so after the
> + * crashing CPU shot them down, they should not do anything except clearing
> + * MCG_STATUS (without this the system is reset, which is undesirable.)
> + * Note that this is also true after the crashing CPU enter the 2nd kernel.
> + */
> +static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
> +{
> +	struct mce m = {};
> +	char *msg = NULL;
> +	char *nmsg = NULL;
> +	int i;
> +	int worst = 0;
> +	int severity;
> +
> +	if (mca_cfg.disabled)
> +		return;
> +	if (!mca_cfg.banks)
> +		goto out;
> +	if (kdump_cpu != smp_processor_id())
> +		goto clear_mcg_status;
> +
> +	rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus);
> +	if (regs && m.mcgstatus & (MCG_STATUS_RIPV|MCG_STATUS_EIPV))
> +		m.cs = regs->cs;

Newline here please.

> +	for (i = 0; i < mca_cfg.banks; i++) {
> +		rdmsrl(MSR_IA32_MCx_STATUS(i), m.status);
> +		severity = mce_severity(&m, mca_cfg.tolerant, &nmsg, true);

MCE severity got reworked recently, please redo your patch against
latest tip/master:

In file included from arch/x86/kernel/cpu/mcheck/mce-severity.c:18:0:
arch/x86/kernel/cpu/mcheck/mce-internal.h:15:14: error: ‘mce_severity’ redeclared as different kind of symbol
 extern int (*mce_severity)(struct mce *a, int tolerant, char **msg, bool is_excp);
              ^
In file included from arch/x86/kernel/cpu/mcheck/mce-severity.c:16:0:
./arch/x86/include/asm/mce.h:270:5: note: previous declaration of ‘mce_severity’ was here
 int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
     ^
arch/x86/kernel/cpu/mcheck/mce-severity.c:274:7: error: ‘mce_severity’ redeclared as different kind of symbol
 int (*mce_severity)(struct mce *m, int tolerant, char **msg, bool is_excp) =
       ^
In file included from arch/x86/kernel/cpu/mcheck/mce-severity.c:16:0:
./arch/x86/include/asm/mce.h:270:5: note: previous declaration of ‘mce_severity’ was here
 int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
     ^
make[4]: *** [arch/x86/kernel/cpu/mcheck/mce-severity.o] Error 1
make[4]: *** Waiting for unfinished jobs....
In file included from arch/x86/kernel/cpu/mcheck/mce.c:50:0:
arch/x86/kernel/cpu/mcheck/mce-internal.h:15:14: error: ‘mce_severity’ redeclared as different kind of symbol
 extern int (*mce_severity)(struct mce *a, int tolerant, char **msg, bool is_excp);
              ^
In file included from arch/x86/kernel/cpu/mcheck/mce.c:47:0:
./arch/x86/include/asm/mce.h:270:5: note: previous declaration of ‘mce_severity’ was here
 int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
     ^
make[4]: *** [arch/x86/kernel/cpu/mcheck/mce.o] Error 1
make[3]: *** [arch/x86/kernel/cpu/mcheck] Error 2
make[3]: *** Waiting for unfinished jobs....
make[2]: *** [arch/x86/kernel/cpu] Error 2
make[1]: *** [arch/x86/kernel] Error 2
make: *** [arch/x86] Error 2
make: *** Waiting for unfinished jobs....

> +		if (severity > worst) {
> +			worst = severity;
> +			msg = nmsg;
> +		}
> +	}
> +
> +	if (worst >= MCE_UC_SEVERITY)
> +		panic("unignorable machine check under kdumping: %s", msg);

Just say:

		panic("kdump: Fatal machine check: %s", msg);

> +
> +	pr_emerg("MCE triggered under kdumping. If you are lucky enough, you will have a kdump. Otherwise, this is a dying message.\n");

This message is also funny.

How about:

	pr_emerg("kdump: Non-fatal MCE detected - kernel dump might be unreliable.\n");

> +
> +clear_mcg_status:
> +	wrmsrl(MSR_IA32_MCG_STATUS, 0);
> +out:
> +	sync_core();
> +}
> +
> +/*
> + * Switch the MCE handler to kdump-specific one
> + *
> + * Standard MCE handler do_machine_check() is not designed for kexec/kdump
> + * context, where we can't expect MCE recovering and logging to work fine
> + * because the kernel might be unstable (all CPUs except one must be quiet
> + * as possible.)

"... all CPUs except one must be idle."

> + *
> + * Moreover, in such situation getting kdump is prior to doing something for

... " in such situations, getting a kernel dump is more important than handling
MCEs..."

> + * MCEs because what the users are really interested in is to find what caused

								 find out

> + * the crashing, not what caused the crashing to fail.

... the crash." Fullstop.

> + *
> + * So let's switch MCE handler to the one suitable for kexec/kdump situation.
> + */
> +void switch_mce_handler_for_kdump(void)

s/switch_mce_handler_for_kdump/kdump_setup_mce/

> +{
> +	kdump_cpu = smp_processor_id();
> +	machine_check_vector = machine_check_under_kdump;
> +}
> +#else
> +static inline void switch_mce_handler_for_kdump(void) {}
> +#endif
> +
>  /*
>   * This is used to VMCLEAR all VMCSs loaded on the
>   * processor. And when loading kvm_intel module, the
> @@ -166,6 +251,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>  	/* The kernel is broken so disable interrupts */
>  	local_irq_disable();
>  
> +	switch_mce_handler_for_kdump();
> +
>  	kdump_nmi_shootdown_cpus();
>  
>  	/*
> -- 
> 1.9.3
> 

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-06  7:18                           ` Naoya Horiguchi
@ 2015-04-06 11:59                             ` Borislav Petkov
  2015-04-07  8:00                               ` Naoya Horiguchi
  0 siblings, 1 reply; 44+ messages in thread
From: Borislav Petkov @ 2015-04-06 11:59 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Luck, Tony, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

Hi Naoya,

On Mon, Apr 06, 2015 at 07:18:03AM +0000, Naoya Horiguchi wrote:
> Hi Boris, Tony,
> 
> How will this patch be handled for v4.1?
> Could you review and consider merging into your tree?

Damn, it seems Tony and I both forgot about this one, sorry about that
:-\

I'm afraid it would be too late for 4.1 as Linus should be doing a 4.0
release and thus open the merge window any day now.

I've reviewed your v7 and once you've addressed the comments and Tony
has acked it, I'll make sure it gets queued for 4.2.

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-06 11:56                           ` [PATCH v7] " Borislav Petkov
@ 2015-04-07  7:59                             ` Naoya Horiguchi
  0 siblings, 0 replies; 44+ messages in thread
From: Naoya Horiguchi @ 2015-04-07  7:59 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 6502 bytes --]

On Mon, Apr 06, 2015 at 01:56:40PM +0200, Borislav Petkov wrote:
> On Fri, Mar 06, 2015 at 10:22:16AM +0000, Naoya Horiguchi wrote:
> > kexec disables (or "shoots down") all CPUs other than a crashing CPU before
> > entering the 2nd kernel. But the MCE handler is still enabled after that,
> > so if MCE happens and broadcasts over the CPUs after the main thread starts
> > the 2nd kernel (which might not initialize MCE device yet, or might decide
> > not to enable it,) MCE handler runs only on the other CPUs (not on the main
> 
> You lost me here: "MCE device"?

I meant "MCE handler", sorry.

> > thread,) leading to kernel panic with MCE synchronization. The user-visible
> > effect of this bug is kdump failure.
> > 
> > Our standard MCE handler do_machine_check() assumes some about system's
> > status and it's hard to alter it to cover kexec/kdump context, so let's add
> > another kdump-specific one and switch to it.
> > 
> > Note that this problem exists since current MCE handler was implemented in
> 
> Did we have kdump then at all and was that a "problem" then even?

At that time we did have kdump because MCE synchronization timeout didn't
cause panic. But there're some minor problems like
 1) idling CPUs can respond to MCE and try to do some work (like reading MCE
    banks and doing mce_log()), which means disturbing memory,
 2) no direct hint about unreliability of the kdump.

...

> > +
> > +	rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus);
> > +	if (regs && m.mcgstatus & (MCG_STATUS_RIPV|MCG_STATUS_EIPV))
> > +		m.cs = regs->cs;
> 
> Newline here please.

OK.

> > +	for (i = 0; i < mca_cfg.banks; i++) {
> > +		rdmsrl(MSR_IA32_MCx_STATUS(i), m.status);
> > +		severity = mce_severity(&m, mca_cfg.tolerant, &nmsg, true);
> 
> MCE severity got reworked recently, please redo your patch against
> latest tip/master:

Rebased to tip/master, and fixed the conflict.

> In file included from arch/x86/kernel/cpu/mcheck/mce-severity.c:18:0:
> arch/x86/kernel/cpu/mcheck/mce-internal.h:15:14: error: ‘mce_severity’ redeclared as different kind of symbol
>  extern int (*mce_severity)(struct mce *a, int tolerant, char **msg, bool is_excp);
>               ^
> In file included from arch/x86/kernel/cpu/mcheck/mce-severity.c:16:0:
> ./arch/x86/include/asm/mce.h:270:5: note: previous declaration of ‘mce_severity’ was here
>  int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
>      ^
> arch/x86/kernel/cpu/mcheck/mce-severity.c:274:7: error: ‘mce_severity’ redeclared as different kind of symbol
>  int (*mce_severity)(struct mce *m, int tolerant, char **msg, bool is_excp) =
>        ^
> In file included from arch/x86/kernel/cpu/mcheck/mce-severity.c:16:0:
> ./arch/x86/include/asm/mce.h:270:5: note: previous declaration of ‘mce_severity’ was here
>  int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
>      ^
> make[4]: *** [arch/x86/kernel/cpu/mcheck/mce-severity.o] Error 1
> make[4]: *** Waiting for unfinished jobs....
> In file included from arch/x86/kernel/cpu/mcheck/mce.c:50:0:
> arch/x86/kernel/cpu/mcheck/mce-internal.h:15:14: error: ‘mce_severity’ redeclared as different kind of symbol
>  extern int (*mce_severity)(struct mce *a, int tolerant, char **msg, bool is_excp);
>               ^
> In file included from arch/x86/kernel/cpu/mcheck/mce.c:47:0:
> ./arch/x86/include/asm/mce.h:270:5: note: previous declaration of ‘mce_severity’ was here
>  int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
>      ^
> make[4]: *** [arch/x86/kernel/cpu/mcheck/mce.o] Error 1
> make[3]: *** [arch/x86/kernel/cpu/mcheck] Error 2
> make[3]: *** Waiting for unfinished jobs....
> make[2]: *** [arch/x86/kernel/cpu] Error 2
> make[1]: *** [arch/x86/kernel] Error 2
> make: *** [arch/x86] Error 2
> make: *** Waiting for unfinished jobs....
> 
> > +		if (severity > worst) {
> > +			worst = severity;
> > +			msg = nmsg;
> > +		}
> > +	}
> > +
> > +	if (worst >= MCE_UC_SEVERITY)
> > +		panic("unignorable machine check under kdumping: %s", msg);
> 
> Just say:
> 
> 		panic("kdump: Fatal machine check: %s", msg);

OK.

> > +
> > +	pr_emerg("MCE triggered under kdumping. If you are lucky enough, you will have a kdump. Otherwise, this is a dying message.\n");
> 
> This message is also funny.
> 
> How about:
> 
> 	pr_emerg("kdump: Non-fatal MCE detected - kernel dump might be unreliable.\n");

OK, and I noticed that printing out msg in this case is helpful.

> > +
> > +clear_mcg_status:
> > +	wrmsrl(MSR_IA32_MCG_STATUS, 0);
> > +out:
> > +	sync_core();
> > +}
> > +
> > +/*
> > + * Switch the MCE handler to kdump-specific one
> > + *
> > + * Standard MCE handler do_machine_check() is not designed for kexec/kdump
> > + * context, where we can't expect MCE recovering and logging to work fine
> > + * because the kernel might be unstable (all CPUs except one must be quiet
> > + * as possible.)
> 
> "... all CPUs except one must be idle."
> 
> > + *
> > + * Moreover, in such situation getting kdump is prior to doing something for
> 
> ... " in such situations, getting a kernel dump is more important than handling
> MCEs..."
> 
> > + * MCEs because what the users are really interested in is to find what caused
> 
> 								 find out
> 
> > + * the crashing, not what caused the crashing to fail.
> 
> ... the crash." Fullstop.

All these comments are included, thanks.

> > + *
> > + * So let's switch MCE handler to the one suitable for kexec/kdump situation.
> > + */
> > +void switch_mce_handler_for_kdump(void)
> 
> s/switch_mce_handler_for_kdump/kdump_setup_mce/

Yes, it's simpler.

Thanks,
Naoya Horiguchi

> > +{
> > +	kdump_cpu = smp_processor_id();
> > +	machine_check_vector = machine_check_under_kdump;
> > +}
> > +#else
> > +static inline void switch_mce_handler_for_kdump(void) {}
> > +#endif
> > +
> >  /*
> >   * This is used to VMCLEAR all VMCSs loaded on the
> >   * processor. And when loading kvm_intel module, the
> > @@ -166,6 +251,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> >  	/* The kernel is broken so disable interrupts */
> >  	local_irq_disable();
> >  
> > +	switch_mce_handler_for_kdump();
> > +
> >  	kdump_nmi_shootdown_cpus();
> >  
> >  	/*
> > -- 
> > 1.9.3
> > 
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> ECO tip #101: Trim your mails when you reply.
> --
> ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-06 11:59                             ` Borislav Petkov
@ 2015-04-07  8:00                               ` Naoya Horiguchi
  2015-04-07  8:02                                 ` [PATCH v8] " Naoya Horiguchi
  0 siblings, 1 reply; 44+ messages in thread
From: Naoya Horiguchi @ 2015-04-07  8:00 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Mon, Apr 06, 2015 at 01:59:23PM +0200, Borislav Petkov wrote:
> Hi Naoya,
> 
> On Mon, Apr 06, 2015 at 07:18:03AM +0000, Naoya Horiguchi wrote:
> > Hi Boris, Tony,
> > 
> > How will this patch be handled for v4.1?
> > Could you review and consider merging into your tree?
> 
> Damn, it seems Tony and I both forgot about this one, sorry about that
> :-\
> 
> I'm afraid it would be too late for 4.1 as Linus should be doing a 4.0
> release and thus open the merge window any day now.

OK,

> I've reviewed your v7 and once you've addressed the comments and Tony
> has acked it, I'll make sure it gets queued for 4.2.

Thank you very much for the review, I'll post v8 soon.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-07  8:00                               ` Naoya Horiguchi
@ 2015-04-07  8:02                                 ` Naoya Horiguchi
  2015-04-09  6:13                                   ` Borislav Petkov
  0 siblings, 1 reply; 44+ messages in thread
From: Naoya Horiguchi @ 2015-04-07  8:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

kexec disables (or "shoots down") all CPUs other than a crashing CPU before
entering the 2nd kernel. But the MCE handler is still enabled after that,
so if MCE happens and broadcasts over the CPUs after the main thread starts
the 2nd kernel (which might not initialize its MCE handler yet, or might decide
not to enable it,) MCE handler runs only on the other CPUs (not on the main
thread,) leading to kernel panic with MCE synchronization. The user-visible
effect of this bug is kdump failure.

Our standard MCE handler do_machine_check() assumes some about system's
status and it's hard to alter it to cover kexec/kdump context, so let's add
another kdump-specific one and switch to it.

Note that this problem exists since current MCE handler was implemented in
2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
a timeout") made it more visible by changing the default behavior of the
synchronization timeout from "ignore" to "panic".

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog: v7 -> v8
- rebase onto tip/master (resolve conflict with mce_severity rework)
- fix printk messages and comments
- s/switch_mce_handler_for_kdump/kdump_setup_mce/g
- print out msg in non-fatal MCE case too

ChangeLog: v6 -> v7
- use rdmsrl() instead of rdmsrl_safe() with mca_cfg.disabled check

ChangeLog v5 -> v6:
- drop "CC stable" tag
- stop using/exporting mce_gather_info(), mce_(rd|wr)msrl(), and mce_panic()
- drop quirk_no_way_out() part, because quirk_sandybridge_ifu() (only possible
  callback) could just change a MCE_PANIC_SEVERITY case to a MCE_AR_SEVERITY
  case, which doesn't affect the panic/return decision.

ChangeLog v4 -> v5:
- drop MCE_UC/AR_SEVERITY re-ordering
- move most of code to arch/x86/kernel/crash.c
- export some MCE internal variables/routines via arch/x86/include/asm/mce.h

ChangeLog v3 -> v4:
- fixed AR and UC order in enum severity_level because UC is severer than AR
  by definition. Current code is not affected by this wrong order by chance.
- check severity in machine_check_under_kdump(), and call mce_panic() if the
  resultant severity is as bad as or worse than MCE_AR_SEVERITY.
- use static global variable kdump_cpu instead of mca_cfg->kdump_cpu
- reduce "#ifdef CONFIG_KEXEC"
- add "#ifdef CONFIG_X86_MCE" for declaration of machine_check_under_kdump()
  in mce.h
- update comment on switch_mce_handler_for_kdump()

ChangeLog v2 -> v3
- go to "switch MCE handler" approach

ChangeLog v1 -> v2
- clear MSR_IA32_MCG_CTL, MSR_IA32_MCx_CTL, and CR4.MCE instead of using
  global flag to ignore MCE events.
- fixed the description of the problem
---
 arch/x86/include/asm/mce.h                | 15 ++++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h | 13 -----
 arch/x86/kernel/crash.c                   | 87 +++++++++++++++++++++++++++++++
 3 files changed, 102 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 1f5a86d518db..d35ea7e8a764 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -255,4 +255,19 @@ struct cper_sec_mem_err;
 extern void apei_mce_report_mem_error(int corrected,
 				      struct cper_sec_mem_err *mem_err);
 
+enum severity_level {
+	MCE_NO_SEVERITY,
+	MCE_DEFERRED_SEVERITY,
+	MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
+	MCE_KEEP_SEVERITY,
+	MCE_SOME_SEVERITY,
+	MCE_AO_SEVERITY,
+	MCE_UC_SEVERITY,
+	MCE_AR_SEVERITY,
+	MCE_PANIC_SEVERITY,
+};
+
+extern int (*mce_severity)(struct mce *a, int tolerant, char **msg,
+				 bool is_excp);
+
 #endif /* _ASM_X86_MCE_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index fe32074b865b..47d1e5218fb5 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -1,18 +1,6 @@
 #include <linux/device.h>
 #include <asm/mce.h>
 
-enum severity_level {
-	MCE_NO_SEVERITY,
-	MCE_DEFERRED_SEVERITY,
-	MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
-	MCE_KEEP_SEVERITY,
-	MCE_SOME_SEVERITY,
-	MCE_AO_SEVERITY,
-	MCE_UC_SEVERITY,
-	MCE_AR_SEVERITY,
-	MCE_PANIC_SEVERITY,
-};
-
 #define ATTR_LEN		16
 #define INITIAL_CHECK_INTERVAL	5 * 60 /* 5 minutes */
 
@@ -24,7 +12,6 @@ struct mce_bank {
 	char			attrname[ATTR_LEN];	/* attribute name */
 };
 
-extern int (*mce_severity)(struct mce *a, int tolerant, char **msg, bool is_excp);
 struct dentry *mce_get_debugfs_dir(void);
 
 extern struct mce_bank *mce_banks;
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index b71ff16cbdb0..0b12eb5452f6 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -34,6 +34,7 @@
 #include <asm/cpu.h>
 #include <asm/reboot.h>
 #include <asm/virtext.h>
+#include <asm/mce.h>
 
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
@@ -76,6 +77,90 @@ struct crash_memmap_data {
 
 int in_crash_kexec;
 
+#ifdef CONFIG_X86_MCE
+static int kdump_cpu;
+
+/*
+ * kdump-specific machine check handler
+ *
+ * When kexec/kdump is running, what the MCE handler is expected to do
+ * changes depending on whether the CPU is running the main thread or not.
+ *
+ * The crashing CPU, controlling the whole system exclusively, should try to
+ * get kdump as hard as possible even if an MCE happens concurrently, because
+ * some types of MCEs (for example, uncorrected errors like SRAO,)
+ * are not fatal or don't ruin reliablility of the kdump (consider that an
+ * MCE can hit the other CPU, in which case corrupted data is never consumed.)
+ * If an MCE critically breaks the kdump operation, we are unlucky so let's
+ * accept the fate of whatever HW causes, hoping a dying message reaches admins.
+ *
+ * The other CPUs are supposed to be quiet during kexec/kdump, so after the
+ * crashing CPU shot them down, they should not do anything except clearing
+ * MCG_STATUS (without this the system is reset, which is undesirable.)
+ * Note that this is also true after the crashing CPU enter the 2nd kernel.
+ */
+static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
+{
+	struct mce m = {};
+	char *msg = NULL;
+	char *nmsg = NULL;
+	int i;
+	int worst = 0;
+	int severity;
+
+	if (mca_cfg.disabled)
+		return;
+	if (!mca_cfg.banks)
+		goto out;
+	if (kdump_cpu != smp_processor_id())
+		goto clear_mcg_status;
+
+	rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus);
+	if (regs && m.mcgstatus & (MCG_STATUS_RIPV|MCG_STATUS_EIPV))
+		m.cs = regs->cs;
+
+	for (i = 0; i < mca_cfg.banks; i++) {
+		rdmsrl(MSR_IA32_MCx_STATUS(i), m.status);
+		severity = mce_severity(&m, mca_cfg.tolerant, &nmsg, true);
+		if (severity > worst) {
+			worst = severity;
+			msg = nmsg;
+		}
+	}
+
+	if (worst >= MCE_UC_SEVERITY)
+		panic("kdump: Fatal machine check: %s", msg);
+
+	pr_emerg("kdump: Non-fatal MCE detected: %s - kernel dump might be unreliable.\n", msg);
+
+clear_mcg_status:
+	wrmsrl(MSR_IA32_MCG_STATUS, 0);
+out:
+	sync_core();
+}
+
+/*
+ * Switch the MCE handler to kdump-specific one
+ *
+ * Standard MCE handler do_machine_check() is not designed for kexec/kdump
+ * context, where we can't expect MCE's recovering and logging to work fine
+ * because the kernel might be unstable (all CPUs except one must be idle.)
+ *
+ * In such situations, getting a kernel dump is more important than handling
+ * MCEs because what the users are really interested in is to find out what
+ * caused the crash.
+ *
+ * So let's switch MCE handler to the one suitable for kexec/kdump situation.
+ */
+void kdump_setup_mce(void)
+{
+	kdump_cpu = smp_processor_id();
+	machine_check_vector = machine_check_under_kdump;
+}
+#else
+static inline void kdump_setup_mce(void) {}
+#endif
+
 /*
  * This is used to VMCLEAR all VMCSs loaded on the
  * processor. And when loading kvm_intel module, the
@@ -166,6 +251,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 	/* The kernel is broken so disable interrupts */
 	local_irq_disable();
 
+	kdump_setup_mce();
+
 	kdump_nmi_shootdown_cpus();
 
 	/*
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-07  8:02                                 ` [PATCH v8] " Naoya Horiguchi
@ 2015-04-09  6:13                                   ` Borislav Petkov
  2015-04-09  6:57                                     ` Naoya Horiguchi
  2015-04-09  8:00                                     ` Ingo Molnar
  0 siblings, 2 replies; 44+ messages in thread
From: Borislav Petkov @ 2015-04-09  6:13 UTC (permalink / raw)
  To: Naoya Horiguchi, Ingo Molnar, Tony Luck
  Cc: Luck, Tony, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Tue, Apr 07, 2015 at 08:02:18AM +0000, Naoya Horiguchi wrote:
> kexec disables (or "shoots down") all CPUs other than a crashing CPU before
> entering the 2nd kernel. But the MCE handler is still enabled after that,
> so if MCE happens and broadcasts over the CPUs after the main thread starts
> the 2nd kernel (which might not initialize its MCE handler yet, or might decide
> not to enable it,) MCE handler runs only on the other CPUs (not on the main
> thread,) leading to kernel panic with MCE synchronization. The user-visible
> effect of this bug is kdump failure.
> 
> Our standard MCE handler do_machine_check() assumes some about system's
> status and it's hard to alter it to cover kexec/kdump context, so let's add
> another kdump-specific one and switch to it.
> 
> Note that this problem exists since current MCE handler was implemented in
> 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
> a timeout") made it more visible by changing the default behavior of the
> synchronization timeout from "ignore" to "panic".
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
> ChangeLog: v7 -> v8

Ok, I cleaned it up a bit and straightened some comments, see below. But
other than minor issues I don't see anything wrong with this patch so
far.

Btw, Ingo had some reservations about this. Ingo?

Also Tony hasn't Ack/Naked this...

The more important question for me is how are you testing this? I did
try injecting some MCEs with qemu while the second kernel is booting but
that caused a triple-fault or the guest froze completely.

Hmmm.

Thanks.

---
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Tue, 7 Apr 2015 08:02:18 +0000
Subject: [PATCH] x86/mce, crash: Switch MCE handler for kexec/kdump

kexec disables (or "shoots down") all CPUs other than the crashing
CPU before entering the 2nd kernel. However, MCA is still enabled so
if an MCE happens and broadcasts to the CPUs after the main thread
starts the 2nd kernel (which might not initialize its MCE handler yet,
or might decide not to enable it) the MCE handler runs only on the
other CPUs (not on the main thread) leading to kernel panic during MCE
synchronization. The user-visible effect of this bug is a kdump failure.

Our standard MCE handler do_machine_check() assumes a bunch of things
about system's status and it's hard to alter it to cover kexec/kdump
context, so add another, kdump-specific one and switch to it.

Note that this problem exists since current MCE handler was implemented
in 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core
has reached a timeout") made it more visible by changing the default
behavior of the synchronization timeout from "ignore" to "panic".

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Cc: Junichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Link: http://lkml.kernel.org/r/1425373306-26187-1-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/mce.h                | 14 +++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h | 13 -----
 arch/x86/kernel/crash.c                   | 89 +++++++++++++++++++++++++++++++
 3 files changed, 103 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 1f5a86d518db..a88a74e19d14 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -255,4 +255,18 @@ struct cper_sec_mem_err;
 extern void apei_mce_report_mem_error(int corrected,
 				      struct cper_sec_mem_err *mem_err);
 
+enum severity_level {
+	MCE_NO_SEVERITY,
+	MCE_DEFERRED_SEVERITY,
+	MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
+	MCE_KEEP_SEVERITY,
+	MCE_SOME_SEVERITY,
+	MCE_AO_SEVERITY,
+	MCE_UC_SEVERITY,
+	MCE_AR_SEVERITY,
+	MCE_PANIC_SEVERITY,
+};
+
+extern int (*mce_severity)(struct mce *a, int tolerant, char **msg, bool is_excp);
+
 #endif /* _ASM_X86_MCE_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index fe32074b865b..47d1e5218fb5 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -1,18 +1,6 @@
 #include <linux/device.h>
 #include <asm/mce.h>
 
-enum severity_level {
-	MCE_NO_SEVERITY,
-	MCE_DEFERRED_SEVERITY,
-	MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
-	MCE_KEEP_SEVERITY,
-	MCE_SOME_SEVERITY,
-	MCE_AO_SEVERITY,
-	MCE_UC_SEVERITY,
-	MCE_AR_SEVERITY,
-	MCE_PANIC_SEVERITY,
-};
-
 #define ATTR_LEN		16
 #define INITIAL_CHECK_INTERVAL	5 * 60 /* 5 minutes */
 
@@ -24,7 +12,6 @@ struct mce_bank {
 	char			attrname[ATTR_LEN];	/* attribute name */
 };
 
-extern int (*mce_severity)(struct mce *a, int tolerant, char **msg, bool is_excp);
 struct dentry *mce_get_debugfs_dir(void);
 
 extern struct mce_bank *mce_banks;
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index aceb2f90c716..f4948a8d5fa6 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -34,6 +34,7 @@
 #include <asm/cpu.h>
 #include <asm/reboot.h>
 #include <asm/virtext.h>
+#include <asm/mce.h>
 
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
@@ -76,6 +77,92 @@ struct crash_memmap_data {
 
 int in_crash_kexec;
 
+#ifdef CONFIG_X86_MCE
+static int kdump_cpu;
+
+/*
+ * kdump-specific machine check handler
+ *
+ * When kexec/kdump is running, what the MCE handler is expected to do
+ * changes depending on whether the CPU is running the main thread or not.
+ *
+ * The crashing CPU, controlling the whole system exclusively, should
+ * try to complete a dump as hard as possible even if an MCE happens
+ * concurrently because some types of non-fatal MCEs (for example,
+ * uncorrected errors like SRAO) don't necessarily impair kdump's
+ * reliability (consider that an MCE can hit another CPU, in which
+ * case corrupted data might not get consumed). If an MCE critically
+ * breaks kdump operation, we are unlucky and then have to accept the HW
+ * failure. In that case, we hope that at least a dying message reaches
+ * the admins.
+ *
+ * The other CPUs are supposed to be quiet during kexec/kdump, so after
+ * the crashing CPU shot them down, they should not do anything except
+ * clearing MCG_STATUS (they need to, otherwise the system gets reset,
+ * which is undesirable either). Note that this is also true after the
+ * crashing CPU enters the 2nd kernel.
+ */
+static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
+{
+	char *msg = NULL, *nmsg = NULL;
+	int i, severity, worst = 0;
+	struct mce m = {};
+
+	if (mca_cfg.disabled)
+		return;
+
+	if (!mca_cfg.banks)
+		goto out;
+
+	if (kdump_cpu != smp_processor_id())
+		goto clear_mcg_status;
+
+	rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus);
+	if (regs && m.mcgstatus & (MCG_STATUS_RIPV|MCG_STATUS_EIPV))
+		m.cs = regs->cs;
+
+	for (i = 0; i < mca_cfg.banks; i++) {
+		rdmsrl(MSR_IA32_MCx_STATUS(i), m.status);
+		severity = mce_severity(&m, mca_cfg.tolerant, &nmsg, true);
+		if (severity > worst) {
+			worst = severity;
+			msg = nmsg;
+		}
+	}
+
+	if (worst >= MCE_UC_SEVERITY)
+		panic("kdump: Fatal machine check: %s", msg);
+
+	pr_emerg("kdump: Non-fatal MCE detected: %s - kernel dump might be unreliable.\n", msg);
+
+clear_mcg_status:
+	wrmsrl(MSR_IA32_MCG_STATUS, 0);
+out:
+	sync_core();
+}
+
+/*
+ * Switch the MCE handler to kdump-specific one
+ *
+ * Standard MCE handler do_machine_check() is not designed for kexec/kdump
+ * context, where we can't expect MCE's recovering and logging to work fine
+ * because the kernel might be unstable (all CPUs except one must be idle).
+ *
+ * In such situations, getting a kernel dump is more important than handling
+ * MCEs because what the users are really interested in is to find out what
+ * caused the crash.
+ *
+ * So let's switch MCE handler to the one suitable for kexec/kdump situation.
+ */
+void kdump_setup_mce(void)
+{
+	kdump_cpu = smp_processor_id();
+	machine_check_vector = machine_check_under_kdump;
+}
+#else
+static inline void kdump_setup_mce(void) {}
+#endif
+
 /*
  * This is used to VMCLEAR all VMCSs loaded on the
  * processor. And when loading kvm_intel module, the
@@ -157,6 +244,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 	/* The kernel is broken so disable interrupts */
 	local_irq_disable();
 
+	kdump_setup_mce();
+
 	kdump_nmi_shootdown_cpus();
 
 	/*
-- 
2.3.3


-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-09  6:13                                   ` Borislav Petkov
@ 2015-04-09  6:57                                     ` Naoya Horiguchi
  2015-04-09  7:02                                       ` Borislav Petkov
  2015-04-09  8:00                                     ` Ingo Molnar
  1 sibling, 1 reply; 44+ messages in thread
From: Naoya Horiguchi @ 2015-04-09  6:57 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, Tony Luck, Prarit Bhargava, Vivek Goyal,
	linux-kernel, Junichi Nomura, Kiyoshi Ueda

On Thu, Apr 09, 2015 at 08:13:46AM +0200, Borislav Petkov wrote:
> On Tue, Apr 07, 2015 at 08:02:18AM +0000, Naoya Horiguchi wrote:
> > kexec disables (or "shoots down") all CPUs other than a crashing CPU before
> > entering the 2nd kernel. But the MCE handler is still enabled after that,
> > so if MCE happens and broadcasts over the CPUs after the main thread starts
> > the 2nd kernel (which might not initialize its MCE handler yet, or might decide
> > not to enable it,) MCE handler runs only on the other CPUs (not on the main
> > thread,) leading to kernel panic with MCE synchronization. The user-visible
> > effect of this bug is kdump failure.
> > 
> > Our standard MCE handler do_machine_check() assumes some about system's
> > status and it's hard to alter it to cover kexec/kdump context, so let's add
> > another kdump-specific one and switch to it.
> > 
> > Note that this problem exists since current MCE handler was implemented in
> > 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
> > a timeout") made it more visible by changing the default behavior of the
> > synchronization timeout from "ignore" to "panic".
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> > ChangeLog: v7 -> v8
> 
> Ok, I cleaned it up a bit and straightened some comments, see below. But
> other than minor issues I don't see anything wrong with this patch so
> far.
> 
> Btw, Ingo had some reservations about this. Ingo?
> 
> Also Tony hasn't Ack/Naked this...
> 
> The more important question for me is how are you testing this? I did
> try injecting some MCEs with qemu while the second kernel is booting but
> that caused a triple-fault or the guest froze completely.

Yes, I did see it at fisrt, so I did two tweaks for the testing:

1) to fix qemu code. I think that current mce injection code of qemu is buggy,
because when we try to inject MCE in broadcast mode, all injections other than
the first one are done with MCG_STATUS_MCIP (see cpu_x86_inject_mce()@target-i386/helper.c.)
It looks to me a bug because this means that every (broadcast mode) MCE injection
causes triplet-fault, which seems not mimicking the real HW behavior.

2) to insert the delay (for a few seconds) into kdump_nmi_callback() before
disable_local_APIC(). This is because MCE interrupt is delivered to CPUs in
different manners in qemu and in bare metal. Bare metals do respond to MCE
interrupts after disable_local_APIC(), but qemu not.


Unfortunately our testing (~15000 times kdump/reboot cycles) with the debug
kernel on bare metals didn't reproduce the problem yet, but I believe that
the above testing on qemu should hit a target.

Thanks,
Naoya Horiguchi

> 
> Hmmm.
> 
> Thanks.
> 
> ---
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Date: Tue, 7 Apr 2015 08:02:18 +0000
> Subject: [PATCH] x86/mce, crash: Switch MCE handler for kexec/kdump
> 
> kexec disables (or "shoots down") all CPUs other than the crashing
> CPU before entering the 2nd kernel. However, MCA is still enabled so
> if an MCE happens and broadcasts to the CPUs after the main thread
> starts the 2nd kernel (which might not initialize its MCE handler yet,
> or might decide not to enable it) the MCE handler runs only on the
> other CPUs (not on the main thread) leading to kernel panic during MCE
> synchronization. The user-visible effect of this bug is a kdump failure.
> 
> Our standard MCE handler do_machine_check() assumes a bunch of things
> about system's status and it's hard to alter it to cover kexec/kdump
> context, so add another, kdump-specific one and switch to it.
> 
> Note that this problem exists since current MCE handler was implemented
> in 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core
> has reached a timeout") made it more visible by changing the default
> behavior of the synchronization timeout from "ignore" to "panic".
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: "Tony Luck" <tony.luck@intel.com>
> Cc: Prarit Bhargava <prarit@redhat.com>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
> Cc: Junichi Nomura <j-nomura@ce.jp.nec.com>
> Cc: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
> Link: http://lkml.kernel.org/r/1425373306-26187-1-git-send-email-n-horiguchi@ah.jp.nec.com
> Signed-off-by: Borislav Petkov <bp@suse.de>
> ---
>  arch/x86/include/asm/mce.h                | 14 +++++
>  arch/x86/kernel/cpu/mcheck/mce-internal.h | 13 -----
>  arch/x86/kernel/crash.c                   | 89 +++++++++++++++++++++++++++++++
>  3 files changed, 103 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index 1f5a86d518db..a88a74e19d14 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -255,4 +255,18 @@ struct cper_sec_mem_err;
>  extern void apei_mce_report_mem_error(int corrected,
>  				      struct cper_sec_mem_err *mem_err);
>  
> +enum severity_level {
> +	MCE_NO_SEVERITY,
> +	MCE_DEFERRED_SEVERITY,
> +	MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
> +	MCE_KEEP_SEVERITY,
> +	MCE_SOME_SEVERITY,
> +	MCE_AO_SEVERITY,
> +	MCE_UC_SEVERITY,
> +	MCE_AR_SEVERITY,
> +	MCE_PANIC_SEVERITY,
> +};
> +
> +extern int (*mce_severity)(struct mce *a, int tolerant, char **msg, bool is_excp);
> +
>  #endif /* _ASM_X86_MCE_H */
> diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> index fe32074b865b..47d1e5218fb5 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
> +++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> @@ -1,18 +1,6 @@
>  #include <linux/device.h>
>  #include <asm/mce.h>
>  
> -enum severity_level {
> -	MCE_NO_SEVERITY,
> -	MCE_DEFERRED_SEVERITY,
> -	MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
> -	MCE_KEEP_SEVERITY,
> -	MCE_SOME_SEVERITY,
> -	MCE_AO_SEVERITY,
> -	MCE_UC_SEVERITY,
> -	MCE_AR_SEVERITY,
> -	MCE_PANIC_SEVERITY,
> -};
> -
>  #define ATTR_LEN		16
>  #define INITIAL_CHECK_INTERVAL	5 * 60 /* 5 minutes */
>  
> @@ -24,7 +12,6 @@ struct mce_bank {
>  	char			attrname[ATTR_LEN];	/* attribute name */
>  };
>  
> -extern int (*mce_severity)(struct mce *a, int tolerant, char **msg, bool is_excp);
>  struct dentry *mce_get_debugfs_dir(void);
>  
>  extern struct mce_bank *mce_banks;
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index aceb2f90c716..f4948a8d5fa6 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -34,6 +34,7 @@
>  #include <asm/cpu.h>
>  #include <asm/reboot.h>
>  #include <asm/virtext.h>
> +#include <asm/mce.h>
>  
>  /* Alignment required for elf header segment */
>  #define ELF_CORE_HEADER_ALIGN   4096
> @@ -76,6 +77,92 @@ struct crash_memmap_data {
>  
>  int in_crash_kexec;
>  
> +#ifdef CONFIG_X86_MCE
> +static int kdump_cpu;
> +
> +/*
> + * kdump-specific machine check handler
> + *
> + * When kexec/kdump is running, what the MCE handler is expected to do
> + * changes depending on whether the CPU is running the main thread or not.
> + *
> + * The crashing CPU, controlling the whole system exclusively, should
> + * try to complete a dump as hard as possible even if an MCE happens
> + * concurrently because some types of non-fatal MCEs (for example,
> + * uncorrected errors like SRAO) don't necessarily impair kdump's
> + * reliability (consider that an MCE can hit another CPU, in which
> + * case corrupted data might not get consumed). If an MCE critically
> + * breaks kdump operation, we are unlucky and then have to accept the HW
> + * failure. In that case, we hope that at least a dying message reaches
> + * the admins.
> + *
> + * The other CPUs are supposed to be quiet during kexec/kdump, so after
> + * the crashing CPU shot them down, they should not do anything except
> + * clearing MCG_STATUS (they need to, otherwise the system gets reset,
> + * which is undesirable either). Note that this is also true after the
> + * crashing CPU enters the 2nd kernel.
> + */
> +static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
> +{
> +	char *msg = NULL, *nmsg = NULL;
> +	int i, severity, worst = 0;
> +	struct mce m = {};
> +
> +	if (mca_cfg.disabled)
> +		return;
> +
> +	if (!mca_cfg.banks)
> +		goto out;
> +
> +	if (kdump_cpu != smp_processor_id())
> +		goto clear_mcg_status;
> +
> +	rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus);
> +	if (regs && m.mcgstatus & (MCG_STATUS_RIPV|MCG_STATUS_EIPV))
> +		m.cs = regs->cs;
> +
> +	for (i = 0; i < mca_cfg.banks; i++) {
> +		rdmsrl(MSR_IA32_MCx_STATUS(i), m.status);
> +		severity = mce_severity(&m, mca_cfg.tolerant, &nmsg, true);
> +		if (severity > worst) {
> +			worst = severity;
> +			msg = nmsg;
> +		}
> +	}
> +
> +	if (worst >= MCE_UC_SEVERITY)
> +		panic("kdump: Fatal machine check: %s", msg);
> +
> +	pr_emerg("kdump: Non-fatal MCE detected: %s - kernel dump might be unreliable.\n", msg);
> +
> +clear_mcg_status:
> +	wrmsrl(MSR_IA32_MCG_STATUS, 0);
> +out:
> +	sync_core();
> +}
> +
> +/*
> + * Switch the MCE handler to kdump-specific one
> + *
> + * Standard MCE handler do_machine_check() is not designed for kexec/kdump
> + * context, where we can't expect MCE's recovering and logging to work fine
> + * because the kernel might be unstable (all CPUs except one must be idle).
> + *
> + * In such situations, getting a kernel dump is more important than handling
> + * MCEs because what the users are really interested in is to find out what
> + * caused the crash.
> + *
> + * So let's switch MCE handler to the one suitable for kexec/kdump situation.
> + */
> +void kdump_setup_mce(void)
> +{
> +	kdump_cpu = smp_processor_id();
> +	machine_check_vector = machine_check_under_kdump;
> +}
> +#else
> +static inline void kdump_setup_mce(void) {}
> +#endif
> +
>  /*
>   * This is used to VMCLEAR all VMCSs loaded on the
>   * processor. And when loading kvm_intel module, the
> @@ -157,6 +244,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>  	/* The kernel is broken so disable interrupts */
>  	local_irq_disable();
>  
> +	kdump_setup_mce();
> +
>  	kdump_nmi_shootdown_cpus();
>  
>  	/*
> -- 
> 2.3.3
> 
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> ECO tip #101: Trim your mails when you reply.
> --
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-09  6:57                                     ` Naoya Horiguchi
@ 2015-04-09  7:02                                       ` Borislav Petkov
  2015-04-09 18:07                                         ` Luck, Tony
  0 siblings, 1 reply; 44+ messages in thread
From: Borislav Petkov @ 2015-04-09  7:02 UTC (permalink / raw)
  To: Naoya Horiguchi, Tony Luck
  Cc: Ingo Molnar, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

On Thu, Apr 09, 2015 at 06:57:38AM +0000, Naoya Horiguchi wrote:
> Yes, I did see it at fisrt, so I did two tweaks for the testing:
> 
> 1) to fix qemu code. I think that current mce injection code of qemu is buggy,
> because when we try to inject MCE in broadcast mode, all injections other than
> the first one are done with MCG_STATUS_MCIP (see cpu_x86_inject_mce()@target-i386/helper.c.)
> It looks to me a bug because this means that every (broadcast mode) MCE injection
> causes triplet-fault, which seems not mimicking the real HW behavior.
> 
> 2) to insert the delay (for a few seconds) into kdump_nmi_callback() before
> disable_local_APIC(). This is because MCE interrupt is delivered to CPUs in
> different manners in qemu and in bare metal. Bare metals do respond to MCE
> interrupts after disable_local_APIC(), but qemu not.

Lemme take a look at that.

> Unfortunately our testing (~15000 times kdump/reboot cycles) with the debug
> kernel on bare metals didn't reproduce the problem yet, but I believe that
> the above testing on qemu should hit a target.

If only APEI EINJ could be taught to do delayed injection, regardless of
OS kernel running. Tony, is something like that even possible at all?

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-09  6:13                                   ` Borislav Petkov
  2015-04-09  6:57                                     ` Naoya Horiguchi
@ 2015-04-09  8:00                                     ` Ingo Molnar
  2015-04-09  8:21                                       ` Borislav Petkov
  2015-04-09  8:39                                       ` Naoya Horiguchi
  1 sibling, 2 replies; 44+ messages in thread
From: Ingo Molnar @ 2015-04-09  8:00 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Naoya Horiguchi, Tony Luck, Prarit Bhargava, Vivek Goyal,
	linux-kernel, Junichi Nomura, Kiyoshi Ueda


* Borislav Petkov <bp@alien8.de> wrote:

> Btw, Ingo had some reservations about this. Ingo?

Yeah, so my concerns are the following:

> kexec disables (or "shoots down") all CPUs other than the crashing 
> CPU before entering the 2nd kernel. However, MCA is still enabled so 
> if an MCE happens and broadcasts to the CPUs after the main thread 
> starts the 2nd kernel (which might not initialize its MCE handler 
> yet, or might decide not to enable it) the MCE handler runs only on 
> the other CPUs (not on the main thread) leading to kernel panic 
> during MCE synchronization. The user-visible effect of this bug is a 
> kdump failure.

So the thing is, when we boot up the second kernel there will be a 
window where the old handler isn't valid (because the new kernel has 
its own pagetables, etc.) and the new handler is not installed yet.

If an MCE hits that window, it's bad luck. (unless the bootup sequence 
is rearchitected significantly to allow cross-kernel inheritance of 
MCE handlers.)

So I think we can ignore _that_ race.

The more significant question is: what happens when an MCE arrives 
whiel the kdump is proceeding - as kdumps can take a long time to 
finish when there's a lot of RAM.

But ... since the 'shootdown' is analogous to a CPU hotplug CPU-down 
sequence, I suppose that the existing MCE code should already properly 
handle the case where an MCE arrives on a (supposedly) dead CPU, 
right? In that case installing a separate MCE handler looks like the 
wrong thing.

> Our standard MCE handler do_machine_check() assumes a bunch of 
> things about system's status and it's hard to alter it to cover 
> kexec/kdump context, so add another, kdump-specific one and switch 
> to it.

So I don't like this principle either: 'our current code is a mess 
that might not work, add new one'.

> Note that this problem exists since current MCE handler was 
> implemented in 2.6.32, and recently commit 716079f66eac ("mce: Panic 
> when a core has reached a timeout") made it more visible by changing 
> the default behavior of the synchronization timeout from "ignore" to 
> "panic".

Looks like that's the real problem. How about the kdump crash dumper 
sets it back to 'ignore' again when we crash, and also double check 
how we handle various corner cases?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-09  8:00                                     ` Ingo Molnar
@ 2015-04-09  8:21                                       ` Borislav Petkov
  2015-04-09  8:59                                         ` Naoya Horiguchi
  2015-04-09  8:39                                       ` Naoya Horiguchi
  1 sibling, 1 reply; 44+ messages in thread
From: Borislav Petkov @ 2015-04-09  8:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Naoya Horiguchi, Tony Luck, Prarit Bhargava, Vivek Goyal,
	linux-kernel, Junichi Nomura, Kiyoshi Ueda

On Thu, Apr 09, 2015 at 10:00:30AM +0200, Ingo Molnar wrote:
> So the thing is, when we boot up the second kernel there will be a 
> window where the old handler isn't valid (because the new kernel has 
> its own pagetables, etc.) and the new handler is not installed yet.
> 
> If an MCE hits that window, it's bad luck. (unless the bootup sequence 
> is rearchitected significantly to allow cross-kernel inheritance of 
> MCE handlers.)
> 
> So I think we can ignore _that_ race.

Yah, that's the "tough luck" race.

> The more significant question is: what happens when an MCE arrives 
> whiel the kdump is proceeding - as kdumps can take a long time to 
> finish when there's a lot of RAM.

We say that the dump might be unreliable.

> But ... since the 'shootdown' is analogous to a CPU hotplug CPU-down 
> sequence, I suppose that the existing MCE code should already properly 
> handle the case where an MCE arrives on a (supposedly) dead CPU, 
> right? In that case installing a separate MCE handler looks like the 
> wrong thing.

Hmm, so mce_start() does look only on the online CPUs. So if crash does
maintain those masks correctly...

> So I don't like this principle either: 'our current code is a mess 
> that might not work, add new one'.

Well, we can try to simplify it in the sense that those assumptions like
mcelog and other MCE consuming crap and notifier chain are tested for
their presence before using them...

I'd be open for this if we have a way to test this kdump scenario. For
now, not even qemu can do that.

> Looks like that's the real problem. How about the kdump crash dumper 
> sets it back to 'ignore' again when we crash, and also double check 
> how we handle various corner cases?

I think I even suggested that at some point. Or was it to increase the
tolerance level. So Naoya, what's wrong with this again? I forgot.

Because this would be the simplest. Simply set tolerance level to 3 and
dump away...

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-09  8:00                                     ` Ingo Molnar
  2015-04-09  8:21                                       ` Borislav Petkov
@ 2015-04-09  8:39                                       ` Naoya Horiguchi
  2015-04-09  9:13                                         ` Ingo Molnar
  1 sibling, 1 reply; 44+ messages in thread
From: Naoya Horiguchi @ 2015-04-09  8:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, Tony Luck, Prarit Bhargava, Vivek Goyal,
	linux-kernel, Junichi Nomura, Kiyoshi Ueda

On Thu, Apr 09, 2015 at 10:00:30AM +0200, Ingo Molnar wrote:
> 
> * Borislav Petkov <bp@alien8.de> wrote:
> 
> > Btw, Ingo had some reservations about this. Ingo?
> 
> Yeah, so my concerns are the following:
> 
> > kexec disables (or "shoots down") all CPUs other than the crashing 
> > CPU before entering the 2nd kernel. However, MCA is still enabled so 
> > if an MCE happens and broadcasts to the CPUs after the main thread 
> > starts the 2nd kernel (which might not initialize its MCE handler 
> > yet, or might decide not to enable it) the MCE handler runs only on 
> > the other CPUs (not on the main thread) leading to kernel panic 
> > during MCE synchronization. The user-visible effect of this bug is a 
> > kdump failure.
> 
> So the thing is, when we boot up the second kernel there will be a 
> window where the old handler isn't valid (because the new kernel has 
> its own pagetables, etc.) and the new handler is not installed yet.
> 
> If an MCE hits that window, it's bad luck. (unless the bootup sequence 
> is rearchitected significantly to allow cross-kernel inheritance of 
> MCE handlers.)
> 
> So I think we can ignore _that_ race.
> 
> The more significant question is: what happens when an MCE arrives 
> whiel the kdump is proceeding - as kdumps can take a long time to 
> finish when there's a lot of RAM.

Without this patch, MCE makes idling CPUs unpreferably wake up and
needlessly run MCE handler, which disturbs memory so does harm on the kdump.
This patch improves not only the transition phase, but also that window.

> But ... since the 'shootdown' is analogous to a CPU hotplug CPU-down 
> sequence, I suppose that the existing MCE code should already properly 
> handle the case where an MCE arrives on a (supposedly) dead CPU, 
> right?

Currently not, so Tony mentioned some idea about it (although not included
in this patch.)

> In that case installing a separate MCE handler looks like the 
> wrong thing.

One difference bewteen kdump and CPU offline is whether we need handle
MCEs then or not. In CPU offline situation, running CPUs have to continue
their normal operations, so it's imporatant to handle MCE (i.e. log and/or
take recovery action), so I think that should be done in our main MCE
handler, do_machine_check().
But that's not the case in kdump situation (logging or recovering is
not possible/necessary any more.) So it seems make sense to me to
separate the handler.

> > Our standard MCE handler do_machine_check() assumes a bunch of 
> > things about system's status and it's hard to alter it to cover 
> > kexec/kdump context, so add another, kdump-specific one and switch 
> > to it.
> 
> So I don't like this principle either: 'our current code is a mess 
> that might not work, add new one'.
> 
> > Note that this problem exists since current MCE handler was 
> > implemented in 2.6.32, and recently commit 716079f66eac ("mce: Panic 
> > when a core has reached a timeout") made it more visible by changing 
> > the default behavior of the synchronization timeout from "ignore" to 
> > "panic".
> 
> Looks like that's the real problem. How about the kdump crash dumper 
> sets it back to 'ignore' again when we crash, and also double check 
> how we handle various corner cases?

Boris mentions this in another email, so I'll reply to it.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-09  8:21                                       ` Borislav Petkov
@ 2015-04-09  8:59                                         ` Naoya Horiguchi
  2015-04-09  9:53                                           ` Borislav Petkov
  0 siblings, 1 reply; 44+ messages in thread
From: Naoya Horiguchi @ 2015-04-09  8:59 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, Tony Luck, Prarit Bhargava, Vivek Goyal,
	linux-kernel, Junichi Nomura, Kiyoshi Ueda

On Thu, Apr 09, 2015 at 10:21:25AM +0200, Borislav Petkov wrote:
> On Thu, Apr 09, 2015 at 10:00:30AM +0200, Ingo Molnar wrote:
> > So the thing is, when we boot up the second kernel there will be a 
> > window where the old handler isn't valid (because the new kernel has 
> > its own pagetables, etc.) and the new handler is not installed yet.
> > 
> > If an MCE hits that window, it's bad luck. (unless the bootup sequence 
> > is rearchitected significantly to allow cross-kernel inheritance of 
> > MCE handlers.)
> > 
> > So I think we can ignore _that_ race.
> 
> Yah, that's the "tough luck" race.
> 
> > The more significant question is: what happens when an MCE arrives 
> > whiel the kdump is proceeding - as kdumps can take a long time to 
> > finish when there's a lot of RAM.
> 
> We say that the dump might be unreliable.
> 
> > But ... since the 'shootdown' is analogous to a CPU hotplug CPU-down 
> > sequence, I suppose that the existing MCE code should already properly 
> > handle the case where an MCE arrives on a (supposedly) dead CPU, 
> > right? In that case installing a separate MCE handler looks like the 
> > wrong thing.
> 
> Hmm, so mce_start() does look only on the online CPUs. So if crash does
> maintain those masks correctly...
> 
> > So I don't like this principle either: 'our current code is a mess 
> > that might not work, add new one'.
> 
> Well, we can try to simplify it in the sense that those assumptions like
> mcelog and other MCE consuming crap and notifier chain are tested for
> their presence before using them...
> 
> I'd be open for this if we have a way to test this kdump scenario. For
> now, not even qemu can do that.

I replied about testing.
That might be tricky a little, but I hope it helps.

> > Looks like that's the real problem. How about the kdump crash dumper 
> > sets it back to 'ignore' again when we crash, and also double check 
> > how we handle various corner cases?
> 
> I think I even suggested that at some point. Or was it to increase the
> tolerance level. So Naoya, what's wrong with this again? I forgot.

Even if we raise tolerant level in running kdump, that doesn't prevent
idling CPUs from running MCE handlers when MCE arrives, which makes memory
accesses (losing information from kdump's viewpoint) and spits
"MCE synchronization timeout" messages (unclear and confusing for users.)
And it also leaves a potential risk of being broken again when do_machine_check()
changes in the future (which maybe come from sharing code to handle different
situations.)

So raising tolerance is OK as a "minimum change" approach, but it has
above downsides to be traded off.

Thanks,
Naoya Horiguchi

> Because this would be the simplest. Simply set tolerance level to 3 and
> dump away...

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-09  8:39                                       ` Naoya Horiguchi
@ 2015-04-09  9:13                                         ` Ingo Molnar
  0 siblings, 0 replies; 44+ messages in thread
From: Ingo Molnar @ 2015-04-09  9:13 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Borislav Petkov, Tony Luck, Prarit Bhargava, Vivek Goyal,
	linux-kernel, Junichi Nomura, Kiyoshi Ueda


* Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> On Thu, Apr 09, 2015 at 10:00:30AM +0200, Ingo Molnar wrote:
> > 
> > * Borislav Petkov <bp@alien8.de> wrote:
> > 
> > > Btw, Ingo had some reservations about this. Ingo?
> > 
> > Yeah, so my concerns are the following:
> > 
> > > kexec disables (or "shoots down") all CPUs other than the crashing 
> > > CPU before entering the 2nd kernel. However, MCA is still enabled so 
> > > if an MCE happens and broadcasts to the CPUs after the main thread 
> > > starts the 2nd kernel (which might not initialize its MCE handler 
> > > yet, or might decide not to enable it) the MCE handler runs only on 
> > > the other CPUs (not on the main thread) leading to kernel panic 
> > > during MCE synchronization. The user-visible effect of this bug is a 
> > > kdump failure.
> > 
> > So the thing is, when we boot up the second kernel there will be a 
> > window where the old handler isn't valid (because the new kernel has 
> > its own pagetables, etc.) and the new handler is not installed yet.
> > 
> > If an MCE hits that window, it's bad luck. (unless the bootup sequence 
> > is rearchitected significantly to allow cross-kernel inheritance of 
> > MCE handlers.)
> > 
> > So I think we can ignore _that_ race.
> > 
> > The more significant question is: what happens when an MCE arrives 
> > whiel the kdump is proceeding - as kdumps can take a long time to 
> > finish when there's a lot of RAM.
> 
> Without this patch, MCE makes idling CPUs unpreferably wake up and 
> needlessly run MCE handler, which disturbs memory so does harm on 
> the kdump. This patch improves not only the transition phase, but 
> also that window.

The way the kdump code stops CPUs already 'disturbs' the state of 
those CPUs.

> > But ... since the 'shootdown' is analogous to a CPU hotplug 
> > CPU-down sequence, I suppose that the existing MCE code should 
> > already properly handle the case where an MCE arrives on a 
> > (supposedly) dead CPU, right?
> 
> Currently not, so Tony mentioned some idea about it (although not 
> included in this patch.)
> 
> > In that case installing a separate MCE handler looks like the 
> > wrong thing.
> 
> One difference bewteen kdump and CPU offline is whether we need handle
> MCEs then or not. In CPU offline situation, running CPUs have to continue
> their normal operations, so it's imporatant to handle MCE (i.e. log and/or
> take recovery action), so I think that should be done in our main MCE
> handler, do_machine_check().

I disagree: if offline CPUs are still active and can produce MCEs then 
they should be reported regardless of whether they were shot down by 
the CPU hotplug code or by kdump.

> But that's not the case in kdump situation (logging or recovering is
> not possible/necessary any more.) So it seems make sense to me to
> separate the handler.

I disagree: for example logging to the screen is still possible and 
should be done if there's an uncorrectable error.

So I agree that MCE policy should be made non-fatal during kdump, but 
I disagree that it needs a separate handler: it should be part of the 
regular MCE handling routines to handle kdump gracefully.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-09  8:59                                         ` Naoya Horiguchi
@ 2015-04-09  9:53                                           ` Borislav Petkov
  2015-04-09 18:22                                             ` Luck, Tony
  0 siblings, 1 reply; 44+ messages in thread
From: Borislav Petkov @ 2015-04-09  9:53 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Ingo Molnar, Tony Luck, Prarit Bhargava, Vivek Goyal,
	linux-kernel, Junichi Nomura, Kiyoshi Ueda

On Thu, Apr 09, 2015 at 08:59:44AM +0000, Naoya Horiguchi wrote:
> I replied about testing. That might be tricky a little, but I hope it helps.

Yeah, whatever we do, we need this properly tested before upstreaming.
That's a given.

> Even if we raise tolerant level in running kdump, that doesn't prevent
> idling CPUs from running MCE handlers when MCE arrives, which makes memory
> accesses (losing information from kdump's viewpoint) and spits
> "MCE synchronization timeout" messages (unclear and confusing for users.)

Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
account for that, no?

And if those are offlined, they're very very unlikely to trigger an MCE
as they're idle and not executing code.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-09  7:02                                       ` Borislav Petkov
@ 2015-04-09 18:07                                         ` Luck, Tony
  0 siblings, 0 replies; 44+ messages in thread
From: Luck, Tony @ 2015-04-09 18:07 UTC (permalink / raw)
  To: Borislav Petkov, Naoya Horiguchi
  Cc: Ingo Molnar, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 531 bytes --]

> If only APEI EINJ could be taught to do delayed injection, regardless of
> OS kernel running. Tony, is something like that even possible at all?

Use:

  # echo 1 > notrigger

that allows you to plant a land-mine in memory that will get tripped later.  Pick the memory address in a clever way
and you can have the MCE trigger when some particular function runs.

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-09  9:53                                           ` Borislav Petkov
@ 2015-04-09 18:22                                             ` Luck, Tony
  2015-04-09 19:05                                               ` Borislav Petkov
  0 siblings, 1 reply; 44+ messages in thread
From: Luck, Tony @ 2015-04-09 18:22 UTC (permalink / raw)
  To: Borislav Petkov, Naoya Horiguchi
  Cc: Ingo Molnar, Prarit Bhargava, Vivek Goyal, linux-kernel,
	Junichi Nomura, Kiyoshi Ueda

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1511 bytes --]

> Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> account for that, no?
>
> And if those are offlined, they're very very unlikely to trigger an MCE
> as they're idle and not executing code.

Let's step back a few feet and look at the big picture.  There are three main classes of machine check
that we might see while trying to run kdump - an remember that all machine checks are currently
broadcast, so all cpus whether online or offline will see them

1) Fatal
We have to crash - lose the dump.  Having a new machine check handler will make things a bit easier
to see what happened because we won't have any synchronization failed messages from the offline
cpus.

2) Execution path recoverable (SRAR in SDM parlance).
Also going to be fatal (kdump is all running in ring0, and we can't recover from errors in ring 0). Cleaner
messages as above. Potentially in the future we might be able to make the kdump machine check handler
actually recover by just skipping a page - if the location of the error was in the old kernel image.

3) Non-execution path recoverable (SRAO in SDM)
We ought to be able to keep kdump running if this happens - the "AO" stands for "action optional",
so we are going to choose to not take an action. Wherever the error was, it won't affect correctness
of execution of the current context.

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-09 18:22                                             ` Luck, Tony
@ 2015-04-09 19:05                                               ` Borislav Petkov
  2015-04-10  0:49                                                 ` Naoya Horiguchi
  0 siblings, 1 reply; 44+ messages in thread
From: Borislav Petkov @ 2015-04-09 19:05 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Naoya Horiguchi, Ingo Molnar, Prarit Bhargava, Vivek Goyal,
	linux-kernel, Junichi Nomura, Kiyoshi Ueda

On Thu, Apr 09, 2015 at 06:22:02PM +0000, Luck, Tony wrote:
> > Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> > account for that, no?
> >
> > And if those are offlined, they're very very unlikely to trigger an MCE
> > as they're idle and not executing code.
> 
> Let's step back a few feet and look at the big picture.  There are three main classes of machine check
> that we might see while trying to run kdump - an remember that all machine checks are currently
> broadcast, so all cpus whether online or offline will see them
> 
> 1) Fatal
> We have to crash - lose the dump.  Having a new machine check handler will make things a bit easier
> to see what happened because we won't have any synchronization failed messages from the offline
> cpus.

But this should not be a problem if kdump path keeps cpu_online_mask
uptodate. I'm looking at kdump_nmi_callback() or crash_nmi_callback() or
so. Those should clear cpu_online_mask and then mce_start() will work
fine on the crashing CPU.

IMHO, of course.

> 2) Execution path recoverable (SRAR in SDM parlance).
> Also going to be fatal (kdump is all running in ring0, and we can't recover from errors in ring 0). Cleaner
> messages as above. Potentially in the future we might be able to make the kdump machine check handler
> actually recover by just skipping a page - if the location of the error was in the old kernel image.
> 
> 3) Non-execution path recoverable (SRAO in SDM)
> We ought to be able to keep kdump running if this happens - the "AO" stands for "action optional",
> so we are going to choose to not take an action. Wherever the error was, it won't affect correctness
> of execution of the current context.

Those could be simply made to go to dmesg during kdump, i.e. decouple
any MCE consumers. And we do that now anyway, i.e. box without mcelog or
some other ras daemon running.

So we could reuse the normal handler - we just need to do some tweaking
first... AFAICT, of course. I believe in that endeavor, the devil will
be in the detail.

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-09 19:05                                               ` Borislav Petkov
@ 2015-04-10  0:49                                                 ` Naoya Horiguchi
  2015-04-10  4:07                                                   ` Naoya Horiguchi
  2015-04-28  8:41                                                   ` Baoquan He
  0 siblings, 2 replies; 44+ messages in thread
From: Naoya Horiguchi @ 2015-04-10  0:49 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Ingo Molnar, Prarit Bhargava, Vivek Goyal,
	linux-kernel, Junichi Nomura, Kiyoshi Ueda

On Thu, Apr 09, 2015 at 09:05:51PM +0200, Borislav Petkov wrote:
> On Thu, Apr 09, 2015 at 06:22:02PM +0000, Luck, Tony wrote:
> > > Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> > > account for that, no?
> > >
> > > And if those are offlined, they're very very unlikely to trigger an MCE
> > > as they're idle and not executing code.
> > 
> > Let's step back a few feet and look at the big picture.  There are three main classes of machine check
> > that we might see while trying to run kdump - an remember that all machine checks are currently
> > broadcast, so all cpus whether online or offline will see them
> > 
> > 1) Fatal
> > We have to crash - lose the dump.  Having a new machine check handler will make things a bit easier
> > to see what happened because we won't have any synchronization failed messages from the offline
> > cpus.
> 
> But this should not be a problem if kdump path keeps cpu_online_mask
> uptodate. I'm looking at kdump_nmi_callback() or crash_nmi_callback() or
> so. Those should clear cpu_online_mask and then mce_start() will work
> fine on the crashing CPU.
> 
> IMHO, of course.

Sorry, I misread you. With clearing cpu_online_mask in shootdown (not done
yet,) raising tolerance should work without timeout message.
So I think you are right.

> > 2) Execution path recoverable (SRAR in SDM parlance).
> > Also going to be fatal (kdump is all running in ring0, and we can't recover from errors in ring 0). Cleaner
> > messages as above. Potentially in the future we might be able to make the kdump machine check handler
> > actually recover by just skipping a page - if the location of the error was in the old kernel image.
> > 
> > 3) Non-execution path recoverable (SRAO in SDM)
> > We ought to be able to keep kdump running if this happens - the "AO" stands for "action optional",
> > so we are going to choose to not take an action. Wherever the error was, it won't affect correctness
> > of execution of the current context.
> 
> Those could be simply made to go to dmesg during kdump, i.e. decouple
> any MCE consumers. And we do that now anyway, i.e. box without mcelog or
> some other ras daemon running.
> 
> So we could reuse the normal handler - we just need to do some tweaking
> first... AFAICT, of course. I believe in that endeavor, the devil will
> be in the detail.

OK, I'll try this approach with updating cpu_online_mask.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-10  0:49                                                 ` Naoya Horiguchi
@ 2015-04-10  4:07                                                   ` Naoya Horiguchi
  2015-04-10  7:24                                                     ` Borislav Petkov
  2015-04-28  8:41                                                   ` Baoquan He
  1 sibling, 1 reply; 44+ messages in thread
From: Naoya Horiguchi @ 2015-04-10  4:07 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Ingo Molnar, Prarit Bhargava, Vivek Goyal,
	linux-kernel, Junichi Nomura, Kiyoshi Ueda

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1969 bytes --]

On Fri, Apr 10, 2015 at 12:49:33AM +0000, Horiguchi Naoya(堀口 直也) wrote:
> On Thu, Apr 09, 2015 at 09:05:51PM +0200, Borislav Petkov wrote:
> > On Thu, Apr 09, 2015 at 06:22:02PM +0000, Luck, Tony wrote:
> > > > Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> > > > account for that, no?
> > > >
> > > > And if those are offlined, they're very very unlikely to trigger an MCE
> > > > as they're idle and not executing code.
> > > 
> > > Let's step back a few feet and look at the big picture.  There are three main classes of machine check
> > > that we might see while trying to run kdump - an remember that all machine checks are currently
> > > broadcast, so all cpus whether online or offline will see them
> > > 
> > > 1) Fatal
> > > We have to crash - lose the dump.  Having a new machine check handler will make things a bit easier
> > > to see what happened because we won't have any synchronization failed messages from the offline
> > > cpus.
> > 
> > But this should not be a problem if kdump path keeps cpu_online_mask
> > uptodate. I'm looking at kdump_nmi_callback() or crash_nmi_callback() or
> > so. Those should clear cpu_online_mask and then mce_start() will work
> > fine on the crashing CPU.
> > 
> > IMHO, of course.
> 
> Sorry, I misread you. With clearing cpu_online_mask in shootdown (not done
> yet,) raising tolerance should work without timeout message.
> So I think you are right.

... wait, changing cpu_online_mask might confuse admins who try to
analyze the kdump, especially when the problems causing panic are CPU
related issues?

In the similar way, changing tolerant value loses the original value,
although this is unlikely to be a problem. But if we change it, using
an upper bit to keep lowest 2 bit to save the original value is better?ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-10  4:07                                                   ` Naoya Horiguchi
@ 2015-04-10  7:24                                                     ` Borislav Petkov
  0 siblings, 0 replies; 44+ messages in thread
From: Borislav Petkov @ 2015-04-10  7:24 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Luck, Tony, Ingo Molnar, Prarit Bhargava, Vivek Goyal,
	linux-kernel, Junichi Nomura, Kiyoshi Ueda

On Fri, Apr 10, 2015 at 04:07:26AM +0000, Naoya Horiguchi wrote:
> ... wait, changing cpu_online_mask might confuse admins who try to
> analyze the kdump, especially when the problems causing panic are CPU
> related issues?

Well, if you're offlining the CPUs before doing the dump, you're already
changing the system, aren't you? So why lie to people?

> In the similar way, changing tolerant value loses the original value,
> although this is unlikely to be a problem. But if we change it, using
> an upper bit to keep lowest 2 bit to save the original value is better?

I don't think that you need to do that - you can see from the original
kernel's dmesg whether an MCE happened - we're normally very vocal. If
the user tried to deliberately suppress that, then that's her fault
only.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
  2015-04-10  0:49                                                 ` Naoya Horiguchi
  2015-04-10  4:07                                                   ` Naoya Horiguchi
@ 2015-04-28  8:41                                                   ` Baoquan He
  1 sibling, 0 replies; 44+ messages in thread
From: Baoquan He @ 2015-04-28  8:41 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Borislav Petkov, Luck, Tony, Ingo Molnar, Prarit Bhargava,
	Vivek Goyal, linux-kernel, Junichi Nomura, Kiyoshi Ueda

On 04/10/15 at 12:49am, Naoya Horiguchi wrote:
> On Thu, Apr 09, 2015 at 09:05:51PM +0200, Borislav Petkov wrote:
> > On Thu, Apr 09, 2015 at 06:22:02PM +0000, Luck, Tony wrote:
> > > > Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> > > > account for that, no?
> > > >
> > > > And if those are offlined, they're very very unlikely to trigger an MCE
> > > > as they're idle and not executing code.
> > > 
> > > Let's step back a few feet and look at the big picture.  There are three main classes of machine check
> > > that we might see while trying to run kdump - an remember that all machine checks are currently
> > > broadcast, so all cpus whether online or offline will see them
> > > 
> > > 1) Fatal
> > > We have to crash - lose the dump.  Having a new machine check handler will make things a bit easier
> > > to see what happened because we won't have any synchronization failed messages from the offline
> > > cpus.
> > 
> > But this should not be a problem if kdump path keeps cpu_online_mask
> > uptodate. I'm looking at kdump_nmi_callback() or crash_nmi_callback() or
> > so. Those should clear cpu_online_mask and then mce_start() will work
> > fine on the crashing CPU.
> > 
> > IMHO, of course.
> 
> Sorry, I misread you. With clearing cpu_online_mask in shootdown (not done
> yet,) raising tolerance should work without timeout message.
> So I think you are right.

Hi Naoya,

Thanks for great efforts you have made on this issue.

I am trying to catch up, and have read mails in this thread.
Please also add me to CC list when you post a new version. I would like to
review it.

Thanks
Baoquan

> 
> > > 2) Execution path recoverable (SRAR in SDM parlance).
> > > Also going to be fatal (kdump is all running in ring0, and we can't recover from errors in ring 0). Cleaner
> > > messages as above. Potentially in the future we might be able to make the kdump machine check handler
> > > actually recover by just skipping a page - if the location of the error was in the old kernel image.
> > > 
> > > 3) Non-execution path recoverable (SRAO in SDM)
> > > We ought to be able to keep kdump running if this happens - the "AO" stands for "action optional",
> > > so we are going to choose to not take an action. Wherever the error was, it won't affect correctness
> > > of execution of the current context.
> > 
> > Those could be simply made to go to dmesg during kdump, i.e. decouple
> > any MCE consumers. And we do that now anyway, i.e. box without mcelog or
> > some other ras daemon running.
> > 
> > So we could reuse the normal handler - we just need to do some tweaking
> > first... AFAICT, of course. I believe in that endeavor, the devil will
> > be in the detail.
> 
> OK, I'll try this approach with updating cpu_online_mask.
> 
> Thanks,
> Naoya Horiguchi--
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2015-04-28  8:42 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-03  9:01 [PATCH v3 1/2] x86: mce: kexec: switch MCE handler for kexec/kdump Naoya Horiguchi
2015-03-03  9:01 ` [PATCH v3 2/2] x86: mce: comment about MCE synchronization timeout on definition of tolerant Naoya Horiguchi
2015-03-03 18:09 ` [PATCH v3 1/2] x86: mce: kexec: switch MCE handler for kexec/kdump Luck, Tony
2015-03-04  7:41   ` [PATCH v4] " Naoya Horiguchi
2015-03-04 23:12     ` Luck, Tony
2015-03-05  1:24       ` Naoya Horiguchi
2015-03-05  6:45         ` [PATCH v5] " Naoya Horiguchi
2015-03-05  8:57           ` Borislav Petkov
2015-03-05  9:37             ` Naoya Horiguchi
2015-03-06  2:59               ` [PATCH v6] " Naoya Horiguchi
2015-03-06  8:34                 ` Borislav Petkov
2015-03-06  9:09                   ` Naoya Horiguchi
2015-03-06  9:27                     ` Borislav Petkov
2015-03-06  9:32                       ` Naoya Horiguchi
2015-03-06 10:22                         ` [PATCH v7] " Naoya Horiguchi
2015-04-06  7:18                           ` Naoya Horiguchi
2015-04-06 11:59                             ` Borislav Petkov
2015-04-07  8:00                               ` Naoya Horiguchi
2015-04-07  8:02                                 ` [PATCH v8] " Naoya Horiguchi
2015-04-09  6:13                                   ` Borislav Petkov
2015-04-09  6:57                                     ` Naoya Horiguchi
2015-04-09  7:02                                       ` Borislav Petkov
2015-04-09 18:07                                         ` Luck, Tony
2015-04-09  8:00                                     ` Ingo Molnar
2015-04-09  8:21                                       ` Borislav Petkov
2015-04-09  8:59                                         ` Naoya Horiguchi
2015-04-09  9:53                                           ` Borislav Petkov
2015-04-09 18:22                                             ` Luck, Tony
2015-04-09 19:05                                               ` Borislav Petkov
2015-04-10  0:49                                                 ` Naoya Horiguchi
2015-04-10  4:07                                                   ` Naoya Horiguchi
2015-04-10  7:24                                                     ` Borislav Petkov
2015-04-28  8:41                                                   ` Baoquan He
2015-04-09  8:39                                       ` Naoya Horiguchi
2015-04-09  9:13                                         ` Ingo Molnar
2015-04-06 11:56                           ` [PATCH v7] " Borislav Petkov
2015-04-07  7:59                             ` Naoya Horiguchi
2015-03-06  8:28               ` [PATCH v5] " Borislav Petkov
2015-03-06  5:44         ` [PATCH v4] " Naoya Horiguchi
2015-03-05  8:48       ` Borislav Petkov
2015-03-03 18:53 ` [PATCH v3 1/2] " Borislav Petkov
2015-03-04  7:51   ` Naoya Horiguchi
2015-03-04  9:12     ` Borislav Petkov
2015-03-05  1:27       ` Naoya Horiguchi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.