All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process.
  2015-12-04  0:16 [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process Ashok Raj
@ 2015-12-03 23:34 ` Greg KH
  2015-12-04 14:34 ` Borislav Petkov
  1 sibling, 0 replies; 14+ messages in thread
From: Greg KH @ 2015-12-03 23:34 UTC (permalink / raw)
  To: Ashok Raj; +Cc: linux-kernel, stable, linux-edac, Boris Petkov, Tony Luck

On Thu, Dec 03, 2015 at 07:16:10PM -0500, Ashok Raj wrote:
> Linux has logical cpu offline capability. That can be triggered by:
> 
> # echo 0 > /sys/devices/system/cpu/cpuX/online
> 
> In Intel Architecture, MCE's are broadcasted to all CPUs in the system.
> 
> This includes the CPUs marked offline by Linux. Unless the CPU's were removed
> via an ACPI notification, in which case the cpu's are removed from the
> cpu_present_map.
> 
> This patch ensures offline CPU's don't participate in MCE rendezvous, but
> simply perform clearing some status bits to ensure a second MCE wont cause
> automatic shutdown.
> 
> Without the patch, mce_start will increment mce_callin, but mce_start would
> wait for all online_cpus. So offline cpu's should avoid participating in the
> rendezvous process.
> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  arch/x86/kernel/cpu/mcheck/mce.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read Documentation/stable_kernel_rules.txt
for how to do this properly.

</formletter>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process.
@ 2015-12-04  0:16 Ashok Raj
  2015-12-03 23:34 ` Greg KH
  2015-12-04 14:34 ` Borislav Petkov
  0 siblings, 2 replies; 14+ messages in thread
From: Ashok Raj @ 2015-12-04  0:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ashok Raj, stable, linux-edac, Boris Petkov, Tony Luck

Linux has logical cpu offline capability. That can be triggered by:

# echo 0 > /sys/devices/system/cpu/cpuX/online

In Intel Architecture, MCE's are broadcasted to all CPUs in the system.

This includes the CPUs marked offline by Linux. Unless the CPU's were removed
via an ACPI notification, in which case the cpu's are removed from the
cpu_present_map.

This patch ensures offline CPU's don't participate in MCE rendezvous, but
simply perform clearing some status bits to ensure a second MCE wont cause
automatic shutdown.

Without the patch, mce_start will increment mce_callin, but mce_start would
wait for all online_cpus. So offline cpu's should avoid participating in the
rendezvous process.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index c5b0d56..82a0c8b 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -998,6 +998,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	u64 recover_paddr = ~0ull;
 	int flags = MF_ACTION_REQUIRED;
 	int lmce = 0;
+	unsigned int cpu = smp_processor_id();
 
 	ist_enter(regs);
 
@@ -1008,6 +1009,14 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 
 	mce_gather_info(&m, regs);
 
+	/*
+	 * if this cpu is offline, just bail out.
+	 * TBD: looking into adding any logs this offline CPU might have,
+	 * to be collected and reported by the rendezvous master.
+	 */
+	if (cpu_is_offline(cpu) && (m.mcgstatus & MCG_STATUS_RIPV))
+		goto out;
+
 	final = this_cpu_ptr(&mces_seen);
 	*final = m;
 
@@ -1142,8 +1151,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 
 	if (worst > 0)
 		mce_report_event(regs);
-	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
 out:
+	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
 	sync_core();
 
 	if (recover_paddr == ~0ull)
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process.
  2015-12-04  0:16 [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process Ashok Raj
  2015-12-03 23:34 ` Greg KH
@ 2015-12-04 14:34 ` Borislav Petkov
  2015-12-04 17:14   ` Raj, Ashok
  1 sibling, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2015-12-04 14:34 UTC (permalink / raw)
  To: Ashok Raj; +Cc: linux-kernel, linux-edac, Tony Luck

On Thu, Dec 03, 2015 at 07:16:10PM -0500, Ashok Raj wrote:
> Linux has logical cpu offline capability. That can be triggered by:
> 
> # echo 0 > /sys/devices/system/cpu/cpuX/online
> 
> In Intel Architecture, MCE's are broadcasted to all CPUs in the system.
> 
> This includes the CPUs marked offline by Linux. Unless the CPU's were removed
> via an ACPI notification, in which case the cpu's are removed from the
> cpu_present_map.
> 
> This patch ensures offline CPU's don't participate in MCE rendezvous, but
> simply perform clearing some status bits to ensure a second MCE wont cause
> automatic shutdown.
> 
> Without the patch, mce_start will increment mce_callin, but mce_start would
> wait for all online_cpus. So offline cpu's should avoid participating in the
> rendezvous process.
> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  arch/x86/kernel/cpu/mcheck/mce.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index c5b0d56..82a0c8b 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -998,6 +998,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  	u64 recover_paddr = ~0ull;
>  	int flags = MF_ACTION_REQUIRED;
>  	int lmce = 0;
> +	unsigned int cpu = smp_processor_id();
>  
>  	ist_enter(regs);
>  
> @@ -1008,6 +1009,14 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  
>  	mce_gather_info(&m, regs);
>  
> +	/*
> +	 * if this cpu is offline, just bail out.
> +	 * TBD: looking into adding any logs this offline CPU might have,
> +	 * to be collected and reported by the rendezvous master.
> +	 */
> +	if (cpu_is_offline(cpu) && (m.mcgstatus & MCG_STATUS_RIPV))
> +		goto out;

This CPU - it being offline and all - is not doing the minimal amount of
work possible IMO.

Why does it have to do ist_enter(), this_cpu_inc(mce_exception_count),
etc?

IMO the only things it should do is this:

	if (cpu_is_offline(smp_processor_id())) {
		mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
		return;
	}

and that should be at the very beginning of do_machine_check(). So
that the hardware is happy. Concerning Linux, it is offline so no data
structures on it are valid.

Hmmm?

P.S., please don't put stable@ to CC - add it as a "CC: " line in the
SOB section instead.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process.
  2015-12-04 17:14   ` Raj, Ashok
@ 2015-12-04 16:51     ` Borislav Petkov
  2015-12-04 17:23       ` Luck, Tony
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2015-12-04 16:51 UTC (permalink / raw)
  To: Raj, Ashok; +Cc: linux-kernel, linux-edac, Tony Luck

On Fri, Dec 04, 2015 at 12:14:20PM -0500, Raj, Ashok wrote:
> Yes, thats possible to not do ist_enter() and the exception count. 
> 
> I tried to keep most of the part as is and leveraging code already
> doing the reading of MCG_STATUS. Architecturally we need to also check RIPV
> and if clear we should initiate shutdown.

So add that check too.

> When we add the logging from offline cpus as next step it would be safe to 
> use interrupt stack, and the offline

Franky, I'm not sure at all and very very wary of adding *any* code
which runs on an offlined CPU. Because *no one* does that and it hasn't
been tested at all. So who knows what happens.

What we should be doing is execute the *minimal* amount of code possible
and get out. No counting, no per-cpu variables. No nothing.

> I liked the observability part keeping the exception count. if and
> when we online the cpu again, it might look as it noticed nothing. Now
> we can check /proc/interrupts and see the offline cpu also observed
> the MCE.

And? Tell us what? That SMM fondled the hardware under our feet. TBH,
I'd tend to be much more drastic here and even taint the kernel. I mean,
seriously, what kind of MCEs which happen as a result of OS execution
are you expecting to get reported on an offlined CPU?

I can't think of very any.

Because we have been considering offlining a core as one possible RAS
action. So what happens is a user or a RAS agent offlines a core and
yet, that offlined core still reports MCEs. Something's terribly wrong
with that picture, IMO.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process.
  2015-12-04 14:34 ` Borislav Petkov
@ 2015-12-04 17:14   ` Raj, Ashok
  2015-12-04 16:51     ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Raj, Ashok @ 2015-12-04 17:14 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-kernel, linux-edac, Tony Luck, Ashok Raj

Hi Boris

On Fri, Dec 04, 2015 at 03:34:04PM +0100, Borislav Petkov wrote:
> > @@ -1008,6 +1009,14 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> > +	if (cpu_is_offline(cpu) && (m.mcgstatus & MCG_STATUS_RIPV))
> > +		goto out;
> 
> This CPU - it being offline and all - is not doing the minimal amount of
> work possible IMO.
> 
> Why does it have to do ist_enter(), this_cpu_inc(mce_exception_count),
> etc?

Yes, thats possible to not do ist_enter() and the exception count. 

I tried to keep most of the part as is and leveraging code already
doing the reading of MCG_STATUS. Architecturally we need to also check RIPV
and if clear we should initiate shutdown. 

When we add the logging from offline cpus as next step it would be safe to 
use interrupt stack, and the offline 

I liked the observability part keeping the exception count. if and when we 
online the cpu again, it might look as it noticed nothing. Now we can 
check /proc/interrupts and see the offline cpu also observed the MCE.
> 
> IMO the only things it should do is this:
> 
> 	if (cpu_is_offline(smp_processor_id())) {
> 		mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> 		return;
> 	}
> 
> and that should be at the very beginning of do_machine_check(). So
> that the hardware is happy. Concerning Linux, it is offline so no data
> structures on it are valid.

> 
> P.S., please don't put stable@ to CC - add it as a "CC: " line in the
> SOB section instead.

Let me know what you think, i can resend with the Cc: stable line.. I 
Did add the stable line in the right section in an earlier version, but 
deleting some extraneous commit messages accidently got to this one :(. 

Cheers,
Ashok


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process.
  2015-12-04 16:51     ` Borislav Petkov
@ 2015-12-04 17:23       ` Luck, Tony
  2015-12-04 17:36         ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Luck, Tony @ 2015-12-04 17:23 UTC (permalink / raw)
  To: Borislav Petkov, Raj, Ashok; +Cc: linux-kernel, linux-edac

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1588 bytes --]

> Franky, I'm not sure at all and very very wary of adding *any* code
> which runs on an offlined CPU. Because *no one* does that and it hasn't
> been tested at all. So who knows what happens.
>
> What we should be doing is execute the *minimal* amount of code possible
> and get out. No counting, no per-cpu variables. No nothing.

The minimal code requires we use:

    smp_processor_id() [to get our cpu number]
    cpu_is_offline() [to find out the cpu is offline]

The first of those looks more dangerous in that it accesses a per-cpu variable.

I don't think we need to be totally paranoid here.  We know that the offline cpus
were once online and went through normal kernel initialization code (if they didn't,
then we can't possibly be executing this code ... their CR4.MCE bit would be zero so their
response to a machine check would have been to reset the system).

> Because we have been considering offlining a core as one possible RAS
> action. So what happens is a user or a RAS agent offlines a core and
> yet, that offlined core still reports MCEs. Something's terribly wrong
> with that picture, IMO.

Agreed. It would be more pleasant if we had some way to *really* offline a cpu,
including telling the rest of the system not to send it any more broadcast events
like MCE, SMI.  But the h/w guys like to give the s/w guys job security by making
these corner cases that we have to work around in s/w :-)

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process.
  2015-12-04 17:23       ` Luck, Tony
@ 2015-12-04 17:36         ` Borislav Petkov
  2015-12-04 17:53           ` Luck, Tony
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2015-12-04 17:36 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Raj, Ashok, linux-kernel, linux-edac

On Fri, Dec 04, 2015 at 05:23:18PM +0000, Luck, Tony wrote:
> > Franky, I'm not sure at all and very very wary of adding *any* code
> > which runs on an offlined CPU. Because *no one* does that and it hasn't
> > been tested at all. So who knows what happens.
> >
> > What we should be doing is execute the *minimal* amount of code possible
> > and get out. No counting, no per-cpu variables. No nothing.
> 
> The minimal code requires we use:
> 
>     smp_processor_id() [to get our cpu number]
>     cpu_is_offline() [to find out the cpu is offline]
> 
> The first of those looks more dangerous in that it accesses a per-cpu variable.
> 
> I don't think we need to be totally paranoid here.  We know that the offline cpus
> were once online and went through normal kernel initialization code (if they didn't,
> then we can't possibly be executing this code ... their CR4.MCE bit would be zero so their
> response to a machine check would have been to reset the system).

I don't mean that - I mean the stuff we do before we call
cpu_is_offline() like ist_enter, this_cpu_inc(mce_exception_count),
etc. Then we do a whole another bunch of stuff at the "out:" label like
printk and whatnot which shouldn't run on an offlined CPU.

I.e., the check whether a CPU is offline should be the first thing we do
in do_machine_check and get the hell out if so.

> Agreed. It would be more pleasant if we had some way to *really* offline a cpu,
> including telling the rest of the system not to send it any more broadcast events
> like MCE, SMI.  But the h/w guys like to give the s/w guys job security by making
> these corner cases that we have to work around in s/w :-)

Mind you, this is unintentional from the hw guys. But ha(!), I know
*exactly* what you mean.

:-)

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process.
  2015-12-04 17:36         ` Borislav Petkov
@ 2015-12-04 17:53           ` Luck, Tony
  2015-12-04 18:00             ` Borislav Petkov
  2015-12-04 22:34             ` Andy Lutomirski
  0 siblings, 2 replies; 14+ messages in thread
From: Luck, Tony @ 2015-12-04 17:53 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Raj, Ashok, linux-kernel, linux-edac,
	Andy Lutomirski (luto@amacapital.net)

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1012 bytes --]

> I don't mean that - I mean the stuff we do before we call
> cpu_is_offline() like ist_enter, this_cpu_inc(mce_exception_count),
> etc. Then we do a whole another bunch of stuff at the "out:" label like
> printk and whatnot which shouldn't run on an offlined CPU.

ist_enter() is black magic to me. Andy? Would you be worried about executing
ist_{enter,exit}() on a cpu that was once online, but is currently marked offline
by Linux?

Bumping mce_exception_count doesn't look like a big deal either way. It is visible in
/proc/interrupts so I'd like to keep that honest (if the cpu comes back online again).
But we could do the offline check before this.

There will be no printk() executed in the tail of the function. after we clear MCG_STATUS
at the (new location of) the out: label we will see recover_paddr is still ~0ull and "goto done".

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process.
  2015-12-04 17:53           ` Luck, Tony
@ 2015-12-04 18:00             ` Borislav Petkov
  2015-12-04 18:30               ` Luck, Tony
  2015-12-04 22:34             ` Andy Lutomirski
  1 sibling, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2015-12-04 18:00 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Raj, Ashok, linux-kernel, linux-edac,
	Andy Lutomirski (luto@amacapital.net)

On Fri, Dec 04, 2015 at 05:53:33PM +0000, Luck, Tony wrote:
> > I don't mean that - I mean the stuff we do before we call
> > cpu_is_offline() like ist_enter, this_cpu_inc(mce_exception_count),
> > etc. Then we do a whole another bunch of stuff at the "out:" label like
> > printk and whatnot which shouldn't run on an offlined CPU.
> 
> ist_enter() is black magic to me. Andy? Would you be worried about executing
> ist_{enter,exit}() on a cpu that was once online, but is currently marked offline
> by Linux?

ist_enter() is context tracking functionality.

> Bumping mce_exception_count doesn't look like a big deal either way. It is visible in
> /proc/interrupts so I'd like to keep that honest (if the cpu comes back online again).
> But we could do the offline check before this.
> 
> There will be no printk() executed in the tail of the function. after we clear MCG_STATUS
> at the (new location of) the out: label we will see recover_paddr is still ~0ull and "goto done".

Whether it is kosher or not is beside the point. Why should an offlined
CPU even noodle through all that code if it doesn't need/have to? It can
return immediately instead.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process.
  2015-12-04 18:00             ` Borislav Petkov
@ 2015-12-04 18:30               ` Luck, Tony
  2015-12-04 19:38                 ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Luck, Tony @ 2015-12-04 18:30 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Raj, Ashok, linux-kernel, linux-edac,
	Andy Lutomirski (luto@amacapital.net)

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 988 bytes --]

> Whether it is kosher or not is beside the point. Why should an offlined
> CPU even noodle through all that code if it doesn't need/have to? It can
> return immediately instead.

Ashok wants to move in stage 2 to having the offline cpu scan banks and report
any errors seen there.  To do that we'll have to run through a fair bit of the
do_machine_check() code.

But ... if you want a super safe version to put the stable tag on ... we could
just have something like this at the head of do_machine_check()

	int cpu = smp_processor_id();

	if (cpu_is_offline(cpu)) {
		rdmsr(MCG_STATUS);
		if (RIPV bit set) {
			wrmsr(MCG_STATUS, 0);
			return;
		}
		// can we do anything here? Offline cpu has no place to return to.
		// There are no good answers ... falling into the regular code is
		// what we did historically
	}

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process.
  2015-12-04 18:30               ` Luck, Tony
@ 2015-12-04 19:38                 ` Borislav Petkov
  0 siblings, 0 replies; 14+ messages in thread
From: Borislav Petkov @ 2015-12-04 19:38 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Raj, Ashok, linux-kernel, linux-edac,
	Andy Lutomirski (luto@amacapital.net)

On Fri, Dec 04, 2015 at 06:30:39PM +0000, Luck, Tony wrote:
> Ashok wants to move in stage 2 to having the offline cpu scan banks and report
> any errors seen there.  To do that we'll have to run through a fair bit of the
> do_machine_check() code.

I'm still very sceptical whether logging those errors are worth the risk
of running code on an offlined CPU.

> But ... if you want a super safe version to put the stable tag on ... we could
> just have something like this at the head of do_machine_check()

It would be prudent IMO. We don't want to disrupt stable kernels
unnecessarily.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process.
  2015-12-04 17:53           ` Luck, Tony
  2015-12-04 18:00             ` Borislav Petkov
@ 2015-12-04 22:34             ` Andy Lutomirski
  2015-12-05  0:08               ` Raj, Ashok
  1 sibling, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2015-12-04 22:34 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Borislav Petkov, Raj, Ashok, linux-kernel, linux-edac

On Fri, Dec 4, 2015 at 9:53 AM, Luck, Tony <tony.luck@intel.com> wrote:
>> I don't mean that - I mean the stuff we do before we call
>> cpu_is_offline() like ist_enter, this_cpu_inc(mce_exception_count),
>> etc. Then we do a whole another bunch of stuff at the "out:" label like
>> printk and whatnot which shouldn't run on an offlined CPU.
>
> ist_enter() is black magic to me. Andy? Would you be worried about executing
> ist_{enter,exit}() on a cpu that was once online, but is currently marked offline
> by Linux?

Offline CPUs are black magic to me.  But as long as the CPU works the
way that the normal specs say it should, then ist_enter is fair game.
In any event, if context tracking blows up on an offline CPU, I'd
argue that's a context tracking bug and needs to be fixed.

But maybe offlined CPUs are supposed to have all interrupts off
(including MCE?) and the argument goes the other way?  Dunno.

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process.
  2015-12-05  0:08               ` Raj, Ashok
@ 2015-12-04 23:14                 ` Andy Lutomirski
  0 siblings, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2015-12-04 23:14 UTC (permalink / raw)
  To: Raj, Ashok; +Cc: Luck, Tony, Borislav Petkov, linux-kernel, linux-edac

On Fri, Dec 4, 2015 at 4:08 PM, Raj, Ashok <ashok.raj@intel.com> wrote:
> On Fri, Dec 04, 2015 at 02:34:52PM -0800, Andy Lutomirski wrote:
>> On Fri, Dec 4, 2015 at 9:53 AM, Luck, Tony <tony.luck@intel.com> wrote:
>> > ist_enter() is black magic to me. Andy? Would you be worried about executing
>> > ist_{enter,exit}() on a cpu that was once online, but is currently marked offline
>> > by Linux?
>>
>> Offline CPUs are black magic to me.  But as long as the CPU works the
>> way that the normal specs say it should, then ist_enter is fair game.
>> In any event, if context tracking blows up on an offline CPU, I'd
>> argue that's a context tracking bug and needs to be fixed.
>>
>> But maybe offlined CPUs are supposed to have all interrupts off
>> (including MCE?) and the argument goes the other way?  Dunno.
>
> MCE's are broadcast by the hardware and cannot be blocked. Offline
> is only a Linux specific state. Now if the offline was a result of an ACPI
> event (eject) that triggered the CPU removal (offline in Linux, as it would
> have in a platform that supports true hotplug) then the platform would
> remove this cpu from the broadcast list.
>
> if kernel were to set CR4.MCE=0 that would cause system shutdown when
> an MCE is broadcast and hits this cpu.

I meant "supposed" as in Linux might expect arch code to prevent the
CPU from receiving interrupts.

Anyway, I think that would be silly and we should just expect
ist_enter to work regardless of online state.

This does mean that if we plug in a new CPU and online it, then
there's a window before we set up percpu memory and enable CR4.MCE in
which an MCE on any CPU will kill the system, at least on hardware for
which MCE broadcast can't be turned off.

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process.
  2015-12-04 22:34             ` Andy Lutomirski
@ 2015-12-05  0:08               ` Raj, Ashok
  2015-12-04 23:14                 ` Andy Lutomirski
  0 siblings, 1 reply; 14+ messages in thread
From: Raj, Ashok @ 2015-12-05  0:08 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Luck, Tony, Borislav Petkov, linux-kernel, linux-edac

On Fri, Dec 04, 2015 at 02:34:52PM -0800, Andy Lutomirski wrote:
> On Fri, Dec 4, 2015 at 9:53 AM, Luck, Tony <tony.luck@intel.com> wrote:
> > ist_enter() is black magic to me. Andy? Would you be worried about executing
> > ist_{enter,exit}() on a cpu that was once online, but is currently marked offline
> > by Linux?
> 
> Offline CPUs are black magic to me.  But as long as the CPU works the
> way that the normal specs say it should, then ist_enter is fair game.
> In any event, if context tracking blows up on an offline CPU, I'd
> argue that's a context tracking bug and needs to be fixed.
> 
> But maybe offlined CPUs are supposed to have all interrupts off
> (including MCE?) and the argument goes the other way?  Dunno.

MCE's are broadcast by the hardware and cannot be blocked. Offline
is only a Linux specific state. Now if the offline was a result of an ACPI
event (eject) that triggered the CPU removal (offline in Linux, as it would 
have in a platform that supports true hotplug) then the platform would 
remove this cpu from the broadcast list. 

if kernel were to set CR4.MCE=0 that would cause system shutdown when 
an MCE is broadcast and hits this cpu.

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-12-04 23:14 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-04  0:16 [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process Ashok Raj
2015-12-03 23:34 ` Greg KH
2015-12-04 14:34 ` Borislav Petkov
2015-12-04 17:14   ` Raj, Ashok
2015-12-04 16:51     ` Borislav Petkov
2015-12-04 17:23       ` Luck, Tony
2015-12-04 17:36         ` Borislav Petkov
2015-12-04 17:53           ` Luck, Tony
2015-12-04 18:00             ` Borislav Petkov
2015-12-04 18:30               ` Luck, Tony
2015-12-04 19:38                 ` Borislav Petkov
2015-12-04 22:34             ` Andy Lutomirski
2015-12-05  0:08               ` Raj, Ashok
2015-12-04 23:14                 ` Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.