All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] tip-queue 2015-12-10
@ 2015-12-10 10:12 Borislav Petkov
  2015-12-10 10:12 ` [PATCH] x86/mce: Ensure offline CPUs don't participate in rendezvous process Borislav Petkov
  0 siblings, 1 reply; 4+ messages in thread
From: Borislav Petkov @ 2015-12-10 10:12 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: LKML

From: Borislav Petkov <bp@suse.de>

Hi,

just one for tip-urgent today.

It is taking care of the case where CPUs are offlined and an #MC
happens. Fix is purposefully kept minimal for stable@. More involved
dealing with the whole issue is going to follow.

Thanks.

Ashok Raj (1):
  x86/mce: Ensure offline CPUs don't participate in rendezvous process

 arch/x86/kernel/cpu/mcheck/mce.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

-- 
2.3.5


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] x86/mce: Ensure offline CPUs don't participate in rendezvous process
  2015-12-10 10:12 [PATCH] tip-queue 2015-12-10 Borislav Petkov
@ 2015-12-10 10:12 ` Borislav Petkov
  2015-12-14  8:18   ` [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t " tip-bot for Ashok Raj
  2015-12-19  9:15   ` tip-bot for Ashok Raj
  0 siblings, 2 replies; 4+ messages in thread
From: Borislav Petkov @ 2015-12-10 10:12 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: LKML

From: Ashok Raj <ashok.raj@intel.com>

Intel's MCA implementation broadcasts MCEs to all CPUs on the node.
This poses a problem for offlined CPUs which cannot participate in the
rendezvous process:

  Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
  Kernel Offset: disabled
  Rebooting in 100 seconds..

More specifically, Linux does a soft offline of a CPU when writing a 0
to /sys/devices/system/cpu/cpuX/online, which doesn't prevent the #MC
exception from being broadcasted to that CPU.

Ensure that offline CPUs don't participate in the MCE rendezvous and
clear the RIP valid status bit so that a second MCE won't cause a
shutdown.

Without the patch, mce_start() will increment mce_callin and wait for
all CPUs. Offlined CPUs should avoid participating in the rendezvous
process altogether.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/1449188170-3909-1-git-send-email-ashok.raj@intel.com
[ Massage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index c5b0d562dbf5..7e8a736d09db 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -999,6 +999,17 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	int flags = MF_ACTION_REQUIRED;
 	int lmce = 0;
 
+	/* If this CPU is offline, just bail out. */
+	if (cpu_is_offline(smp_processor_id())) {
+		u64 mcgstatus;
+
+		mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
+		if (mcgstatus & MCG_STATUS_RIPV) {
+			mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
+			return;
+		}
+	}
+
 	ist_enter(regs);
 
 	this_cpu_inc(mce_exception_count);
-- 
2.3.5


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process
  2015-12-10 10:12 ` [PATCH] x86/mce: Ensure offline CPUs don't participate in rendezvous process Borislav Petkov
@ 2015-12-14  8:18   ` tip-bot for Ashok Raj
  2015-12-19  9:15   ` tip-bot for Ashok Raj
  1 sibling, 0 replies; 4+ messages in thread
From: tip-bot for Ashok Raj @ 2015-12-14  8:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tony.luck, torvalds, ashok.raj, mingo, stable, hpa, linux-kernel,
	peterz, tglx, linux-edac, bp

Commit-ID:  06f337b7c7eb86254c86e8e717273d1e356d5a1b
Gitweb:     http://git.kernel.org/tip/06f337b7c7eb86254c86e8e717273d1e356d5a1b
Author:     Ashok Raj <ashok.raj@intel.com>
AuthorDate: Thu, 10 Dec 2015 11:12:26 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 11 Dec 2015 08:59:48 +0100

x86/mce: Ensure offline CPUs don't participate in rendezvous process

Intel's MCA implementation broadcasts MCEs to all CPUs on the
node. This poses a problem for offlined CPUs which cannot
participate in the rendezvous process:

  Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
  Kernel Offset: disabled
  Rebooting in 100 seconds..

More specifically, Linux does a soft offline of a CPU when
writing a 0 to /sys/devices/system/cpu/cpuX/online, which
doesn't prevent the #MC exception from being broadcasted to that
CPU.

Ensure that offline CPUs don't participate in the MCE rendezvous
and clear the RIP valid status bit so that a second MCE won't
cause a shutdown.

Without the patch, mce_start() will increment mce_callin and
wait for all CPUs. Offlined CPUs should avoid participating in
the rendezvous process altogether.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
[ Massage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1449742346-21470-2-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index c5b0d56..7e8a736 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -999,6 +999,17 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	int flags = MF_ACTION_REQUIRED;
 	int lmce = 0;
 
+	/* If this CPU is offline, just bail out. */
+	if (cpu_is_offline(smp_processor_id())) {
+		u64 mcgstatus;
+
+		mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
+		if (mcgstatus & MCG_STATUS_RIPV) {
+			mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
+			return;
+		}
+	}
+
 	ist_enter(regs);
 
 	this_cpu_inc(mce_exception_count);

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process
  2015-12-10 10:12 ` [PATCH] x86/mce: Ensure offline CPUs don't participate in rendezvous process Borislav Petkov
  2015-12-14  8:18   ` [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t " tip-bot for Ashok Raj
@ 2015-12-19  9:15   ` tip-bot for Ashok Raj
  1 sibling, 0 replies; 4+ messages in thread
From: tip-bot for Ashok Raj @ 2015-12-19  9:15 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, tglx, peterz, ashok.raj, torvalds, hpa, bp, linux-kernel,
	linux-edac, tony.luck, stable

Commit-ID:  d90167a941f62860f35eb960e1012aa2d30e7e94
Gitweb:     http://git.kernel.org/tip/d90167a941f62860f35eb960e1012aa2d30e7e94
Author:     Ashok Raj <ashok.raj@intel.com>
AuthorDate: Thu, 10 Dec 2015 11:12:26 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Sat, 19 Dec 2015 09:55:31 +0100

x86/mce: Ensure offline CPUs don't participate in rendezvous process

Intel's MCA implementation broadcasts MCEs to all CPUs on the
node. This poses a problem for offlined CPUs which cannot
participate in the rendezvous process:

  Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
  Kernel Offset: disabled
  Rebooting in 100 seconds..

More specifically, Linux does a soft offline of a CPU when
writing a 0 to /sys/devices/system/cpu/cpuX/online, which
doesn't prevent the #MC exception from being broadcasted to that
CPU.

Ensure that offline CPUs don't participate in the MCE rendezvous
and clear the RIP valid status bit so that a second MCE won't
cause a shutdown.

Without the patch, mce_start() will increment mce_callin and
wait for all CPUs. Offlined CPUs should avoid participating in
the rendezvous process altogether.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
[ Massage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1449742346-21470-2-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index c5b0d56..7e8a736 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -999,6 +999,17 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	int flags = MF_ACTION_REQUIRED;
 	int lmce = 0;
 
+	/* If this CPU is offline, just bail out. */
+	if (cpu_is_offline(smp_processor_id())) {
+		u64 mcgstatus;
+
+		mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
+		if (mcgstatus & MCG_STATUS_RIPV) {
+			mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
+			return;
+		}
+	}
+
 	ist_enter(regs);
 
 	this_cpu_inc(mce_exception_count);

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-12-19  9:16 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-10 10:12 [PATCH] tip-queue 2015-12-10 Borislav Petkov
2015-12-10 10:12 ` [PATCH] x86/mce: Ensure offline CPUs don't participate in rendezvous process Borislav Petkov
2015-12-14  8:18   ` [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t " tip-bot for Ashok Raj
2015-12-19  9:15   ` tip-bot for Ashok Raj

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.