All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process
@ 2019-05-29  9:58 Tony W Wang-oc
  0 siblings, 0 replies; 4+ messages in thread
From: Tony W Wang-oc @ 2019-05-29  9:58 UTC (permalink / raw)
  To: tipbot
  Cc: ashok.raj, bp, hpa, linux-edac, linux-kernel, linux-tip-commits,
	mingo, peterz, stable, tglx, tony.luck, torvalds, David Wang

Hi,
	This patch requires all #MC exception errors set MCG_STATUS_RIPV = 1?
Because on offline CPUs, for #MC exception errors set MCG_STATUS_RIPV = 0
(like "Recoverable-not-continuable SRAR Type" Errors), this patch doesn't seem
to work. if this patch's "return; " in a wrong place?

Thanks
Tony W Wang-oc

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process
@ 2019-05-30  3:50 Tony W Wang-oc
  0 siblings, 0 replies; 4+ messages in thread
From: Tony W Wang-oc @ 2019-05-30  3:50 UTC (permalink / raw)
  To: tipbot, ashok.raj
  Cc: bp, hpa, linux-edac, linux-kernel, linux-tip-commits, mingo,
	peterz, stable, tglx, tony.luck, torvalds, David Wang

Hi Ashok,
I have two questions about this patch, could you help to check:

1, for broadcast #MC exceptions, this patch seems require #MC exception errors
set MCG_STATUS_RIPV = 1. 
But for Intel CPU, some #MC exception errors set MCG_STATUS_RIPV = 0 
(like "Recoverable-not-continuable SRAR Type" Errors), for these errors
the patch doesn't seem to work, is that okay?

2, for LMCE exceptions, this patch seems require #MC exception errors
set MCG_STATUS_RIPV = 0 to make sure LMCE be handled normally even
on offline CPU. 
For LMCE errors set MCG_STAUS_RIPV = 1, the patch prevents offline CPU
handle these LMCE errors, is that okay?

Thanks
Tony W Wang-oc

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process
  2015-12-10 10:12 [PATCH] x86/mce: Ensure offline CPUs don't " Borislav Petkov
  2015-12-14  8:18 ` [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t " tip-bot for Ashok Raj
@ 2015-12-19  9:15 ` tip-bot for Ashok Raj
  1 sibling, 0 replies; 4+ messages in thread
From: tip-bot for Ashok Raj @ 2015-12-19  9:15 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, tglx, peterz, ashok.raj, torvalds, hpa, bp, linux-kernel,
	linux-edac, tony.luck, stable

Commit-ID:  d90167a941f62860f35eb960e1012aa2d30e7e94
Gitweb:     http://git.kernel.org/tip/d90167a941f62860f35eb960e1012aa2d30e7e94
Author:     Ashok Raj <ashok.raj@intel.com>
AuthorDate: Thu, 10 Dec 2015 11:12:26 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Sat, 19 Dec 2015 09:55:31 +0100

x86/mce: Ensure offline CPUs don't participate in rendezvous process

Intel's MCA implementation broadcasts MCEs to all CPUs on the
node. This poses a problem for offlined CPUs which cannot
participate in the rendezvous process:

  Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
  Kernel Offset: disabled
  Rebooting in 100 seconds..

More specifically, Linux does a soft offline of a CPU when
writing a 0 to /sys/devices/system/cpu/cpuX/online, which
doesn't prevent the #MC exception from being broadcasted to that
CPU.

Ensure that offline CPUs don't participate in the MCE rendezvous
and clear the RIP valid status bit so that a second MCE won't
cause a shutdown.

Without the patch, mce_start() will increment mce_callin and
wait for all CPUs. Offlined CPUs should avoid participating in
the rendezvous process altogether.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
[ Massage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1449742346-21470-2-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index c5b0d56..7e8a736 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -999,6 +999,17 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	int flags = MF_ACTION_REQUIRED;
 	int lmce = 0;
 
+	/* If this CPU is offline, just bail out. */
+	if (cpu_is_offline(smp_processor_id())) {
+		u64 mcgstatus;
+
+		mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
+		if (mcgstatus & MCG_STATUS_RIPV) {
+			mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
+			return;
+		}
+	}
+
 	ist_enter(regs);
 
 	this_cpu_inc(mce_exception_count);

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process
  2015-12-10 10:12 [PATCH] x86/mce: Ensure offline CPUs don't " Borislav Petkov
@ 2015-12-14  8:18 ` tip-bot for Ashok Raj
  2015-12-19  9:15 ` tip-bot for Ashok Raj
  1 sibling, 0 replies; 4+ messages in thread
From: tip-bot for Ashok Raj @ 2015-12-14  8:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tony.luck, torvalds, ashok.raj, mingo, stable, hpa, linux-kernel,
	peterz, tglx, linux-edac, bp

Commit-ID:  06f337b7c7eb86254c86e8e717273d1e356d5a1b
Gitweb:     http://git.kernel.org/tip/06f337b7c7eb86254c86e8e717273d1e356d5a1b
Author:     Ashok Raj <ashok.raj@intel.com>
AuthorDate: Thu, 10 Dec 2015 11:12:26 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 11 Dec 2015 08:59:48 +0100

x86/mce: Ensure offline CPUs don't participate in rendezvous process

Intel's MCA implementation broadcasts MCEs to all CPUs on the
node. This poses a problem for offlined CPUs which cannot
participate in the rendezvous process:

  Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
  Kernel Offset: disabled
  Rebooting in 100 seconds..

More specifically, Linux does a soft offline of a CPU when
writing a 0 to /sys/devices/system/cpu/cpuX/online, which
doesn't prevent the #MC exception from being broadcasted to that
CPU.

Ensure that offline CPUs don't participate in the MCE rendezvous
and clear the RIP valid status bit so that a second MCE won't
cause a shutdown.

Without the patch, mce_start() will increment mce_callin and
wait for all CPUs. Offlined CPUs should avoid participating in
the rendezvous process altogether.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
[ Massage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1449742346-21470-2-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index c5b0d56..7e8a736 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -999,6 +999,17 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	int flags = MF_ACTION_REQUIRED;
 	int lmce = 0;
 
+	/* If this CPU is offline, just bail out. */
+	if (cpu_is_offline(smp_processor_id())) {
+		u64 mcgstatus;
+
+		mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
+		if (mcgstatus & MCG_STATUS_RIPV) {
+			mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
+			return;
+		}
+	}
+
 	ist_enter(regs);
 
 	this_cpu_inc(mce_exception_count);

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-05-30  3:50 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-29  9:58 [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process Tony W Wang-oc
  -- strict thread matches above, loose matches on Subject: below --
2019-05-30  3:50 Tony W Wang-oc
2015-12-10 10:12 [PATCH] x86/mce: Ensure offline CPUs don't " Borislav Petkov
2015-12-14  8:18 ` [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t " tip-bot for Ashok Raj
2015-12-19  9:15 ` tip-bot for Ashok Raj

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.