* [RFC] #MC mess @ 2020-02-18 17:31 Borislav Petkov 2020-02-18 18:11 ` Steven Rostedt ` (2 more replies) 0 siblings, 3 replies; 28+ messages in thread From: Borislav Petkov @ 2020-02-18 17:31 UTC (permalink / raw) To: Peter Zijlstra, Steven Rostedt, Andy Lutomirski; +Cc: Tony Luck, x86-ml, lkml Ok, so Peter raised this question on IRC today, that the #MC handler needs to disable all kinds of tracing/kprobing and etc exceptions happening while handling an #MC. And I guess we can talk about supporting some exceptions but #MC is usually nasty enough to not care about tracing when former happens. So how about this trivial first stab of using the big hammer and simply turning off stuff? The nmi_enter()/nmi_exit() thing still needs debating because ist_enter() already does rcu_nmi_enter() and I'm not sure whether any of the context tracking would still be ok with that. Anything else I'm missing? It is likely... Thx. --- diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 2c4f949611e4..6dff97c53310 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1214,7 +1214,7 @@ static void __mc_scan_banks(struct mce *m, struct mce *final, * MCE broadcast. However some CPUs might be broken beyond repair, * so be always careful when synchronizing with others. */ -void do_machine_check(struct pt_regs *regs, long error_code) +void notrace do_machine_check(struct pt_regs *regs, long error_code) { DECLARE_BITMAP(valid_banks, MAX_NR_BANKS); DECLARE_BITMAP(toclear, MAX_NR_BANKS); @@ -1251,6 +1251,10 @@ void do_machine_check(struct pt_regs *regs, long error_code) if (__mc_check_crashing_cpu(cpu)) return; + hw_breakpoint_disable(); + static_key_disable(&__tracepoint_read_msr.key); + tracing_off(); + ist_enter(regs); this_cpu_inc(mce_exception_count); @@ -1360,6 +1364,7 @@ void do_machine_check(struct pt_regs *regs, long error_code) ist_exit(regs); } EXPORT_SYMBOL_GPL(do_machine_check); +NOKPROBE_SYMBOL(do_machine_check); #ifndef CONFIG_MEMORY_FAILURE int memory_failure(unsigned long pfn, int flags) -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-18 17:31 [RFC] #MC mess Borislav Petkov @ 2020-02-18 18:11 ` Steven Rostedt 2020-02-18 19:50 ` Borislav Petkov 2020-02-18 18:20 ` Luck, Tony 2020-02-19 0:15 ` Andy Lutomirski 2 siblings, 1 reply; 28+ messages in thread From: Steven Rostedt @ 2020-02-18 18:11 UTC (permalink / raw) To: Borislav Petkov; +Cc: Peter Zijlstra, Andy Lutomirski, Tony Luck, x86-ml, lkml On Tue, 18 Feb 2020 18:31:50 +0100 Borislav Petkov <bp@alien8.de> wrote: > Ok, > > so Peter raised this question on IRC today, that the #MC handler needs > to disable all kinds of tracing/kprobing and etc exceptions happening > while handling an #MC. And I guess we can talk about supporting some > exceptions but #MC is usually nasty enough to not care about tracing > when former happens. What's the issue with tracing? Does this affect the tracing done by the edac_mc_handle_error code? It has a trace event in it, that the rasdaemon uses. > > So how about this trivial first stab of using the big hammer and simply > turning off stuff? The nmi_enter()/nmi_exit() thing still needs debating > because ist_enter() already does rcu_nmi_enter() and I'm not sure > whether any of the context tracking would still be ok with that. > > Anything else I'm missing? It is likely... > > Thx. > > --- > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index 2c4f949611e4..6dff97c53310 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -1214,7 +1214,7 @@ static void __mc_scan_banks(struct mce *m, struct mce *final, > * MCE broadcast. However some CPUs might be broken beyond repair, > * so be always careful when synchronizing with others. > */ > -void do_machine_check(struct pt_regs *regs, long error_code) > +void notrace do_machine_check(struct pt_regs *regs, long error_code) > { > DECLARE_BITMAP(valid_banks, MAX_NR_BANKS); > DECLARE_BITMAP(toclear, MAX_NR_BANKS); > @@ -1251,6 +1251,10 @@ void do_machine_check(struct pt_regs *regs, long error_code) > if (__mc_check_crashing_cpu(cpu)) > return; > > + hw_breakpoint_disable(); > + static_key_disable(&__tracepoint_read_msr.key); I believe static_key_disable() sleeps, and does all kinds of crazing things (like update the code). -- Steve > + tracing_off(); > + > ist_enter(regs); > > this_cpu_inc(mce_exception_count); > @@ -1360,6 +1364,7 @@ void do_machine_check(struct pt_regs *regs, long error_code) > ist_exit(regs); > } > EXPORT_SYMBOL_GPL(do_machine_check); > +NOKPROBE_SYMBOL(do_machine_check); > > #ifndef CONFIG_MEMORY_FAILURE > int memory_failure(unsigned long pfn, int flags) > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-18 18:11 ` Steven Rostedt @ 2020-02-18 19:50 ` Borislav Petkov 2020-02-18 20:08 ` Steven Rostedt 0 siblings, 1 reply; 28+ messages in thread From: Borislav Petkov @ 2020-02-18 19:50 UTC (permalink / raw) To: Steven Rostedt; +Cc: Peter Zijlstra, Andy Lutomirski, Tony Luck, x86-ml, lkml On Tue, Feb 18, 2020 at 01:11:58PM -0500, Steven Rostedt wrote: > What's the issue with tracing? Does this affect the tracing done by the > edac_mc_handle_error code? > > It has a trace event in it, that the rasdaemon uses. Nah, that code is called from process context. The problem with tracing the #MC handler is the same as tracing the NMI handler. And the NMI handler does all kinds of dancing wrt breakpoints and nested NMIs and the #MC handler doesn't do any of that. Not sure if it should at all, btw. > I believe static_key_disable() sleeps, and does all kinds of crazing > things (like update the code). True story, thanks for that hint! static_key_disable() |-> cpus_read_lock() |-> percpu_down_read(&cpu_hotplug_lock) |->might_sleep() Yuck. Which means, the #MC handler must switch to __rdmsr()/__wrmsr() now. I wish I could travel back in time and NAK the hell of that MSR tracepoint crap. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-18 19:50 ` Borislav Petkov @ 2020-02-18 20:08 ` Steven Rostedt 2020-02-18 20:30 ` Peter Zijlstra 2020-02-18 20:52 ` Borislav Petkov 0 siblings, 2 replies; 28+ messages in thread From: Steven Rostedt @ 2020-02-18 20:08 UTC (permalink / raw) To: Borislav Petkov; +Cc: Peter Zijlstra, Andy Lutomirski, Tony Luck, x86-ml, lkml On Tue, 18 Feb 2020 20:50:35 +0100 Borislav Petkov <bp@alien8.de> wrote: > True story, thanks for that hint! > > static_key_disable() > |-> cpus_read_lock() > |-> percpu_down_read(&cpu_hotplug_lock) > |->might_sleep() > > Yuck. Which means, the #MC handler must switch to __rdmsr()/__wrmsr() > now. > > I wish I could travel back in time and NAK the hell of that MSR > tracepoint crap. Can we create a per_cpu variable that gets set when we enter the MC handler, and not call the msr trace points when that is set? Now, is jump labels bad in these cases (note, it is possible to trigger a breakpoint in them, does an iret disable the MC as well, which means we could get nested MC handlers corrupting the IST stack?). You could have the msr_tracepoint_active() check this per cpu variable? msr reading and writing is rather slow, and I'm sure reading a per_cpu variable is going to be in the noise of it. -- Steve ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-18 20:08 ` Steven Rostedt @ 2020-02-18 20:30 ` Peter Zijlstra 2020-02-18 20:52 ` Borislav Petkov 1 sibling, 0 replies; 28+ messages in thread From: Peter Zijlstra @ 2020-02-18 20:30 UTC (permalink / raw) To: Steven Rostedt; +Cc: Borislav Petkov, Andy Lutomirski, Tony Luck, x86-ml, lkml On Tue, Feb 18, 2020 at 03:08:50PM -0500, Steven Rostedt wrote: > Now, is jump labels bad in these cases (note, it is possible to trigger > a breakpoint in them, does an iret disable the MC as well, which means > we could get nested MC handlers corrupting the IST stack?). That is what I expect, and would be consistent with the whole NMI situation. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-18 20:08 ` Steven Rostedt 2020-02-18 20:30 ` Peter Zijlstra @ 2020-02-18 20:52 ` Borislav Petkov 1 sibling, 0 replies; 28+ messages in thread From: Borislav Petkov @ 2020-02-18 20:52 UTC (permalink / raw) To: Steven Rostedt; +Cc: Peter Zijlstra, Andy Lutomirski, Tony Luck, x86-ml, lkml On Tue, Feb 18, 2020 at 03:08:50PM -0500, Steven Rostedt wrote: > You could have the msr_tracepoint_active() check this per cpu variable? > > msr reading and writing is rather slow, and I'm sure reading a per_cpu > variable is going to be in the noise of it. Yeah, I was worrying about using the tracing MSR variants in NMI context but Peter says tracing should do in_nmi() if it isn't doing so. Same with #MC: we hold a subsequent #MC from getting raised with MCIP - thanks Tony - but we can have other exceptions raised while in the #MC handler. That too should be taken care of with the in_nmi() thing. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [RFC] #MC mess 2020-02-18 17:31 [RFC] #MC mess Borislav Petkov 2020-02-18 18:11 ` Steven Rostedt @ 2020-02-18 18:20 ` Luck, Tony 2020-02-18 19:51 ` Borislav Petkov ` (2 more replies) 2020-02-19 0:15 ` Andy Lutomirski 2 siblings, 3 replies; 28+ messages in thread From: Luck, Tony @ 2020-02-18 18:20 UTC (permalink / raw) To: Borislav Petkov, Peter Zijlstra, Steven Rostedt, Andy Lutomirski Cc: x86-ml, lkml > Anything else I'm missing? It is likely... + hw_breakpoint_disable(); + static_key_disable(&__tracepoint_read_msr.key); + tracing_off(); + ist_enter(regs); How about some code to turn all those back on for a recoverable (where we actually recovered) #MC? -Tony ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-18 18:20 ` Luck, Tony @ 2020-02-18 19:51 ` Borislav Petkov 2020-02-18 19:54 ` Luck, Tony 2020-02-18 20:02 ` Peter Zijlstra 2020-02-18 23:10 ` Andy Lutomirski 2 siblings, 1 reply; 28+ messages in thread From: Borislav Petkov @ 2020-02-18 19:51 UTC (permalink / raw) To: Luck, Tony; +Cc: Peter Zijlstra, Steven Rostedt, Andy Lutomirski, x86-ml, lkml On Tue, Feb 18, 2020 at 06:20:38PM +0000, Luck, Tony wrote: > > Anything else I'm missing? It is likely... > > + hw_breakpoint_disable(); > + static_key_disable(&__tracepoint_read_msr.key); > + tracing_off(); > + > ist_enter(regs); > > How about some code to turn all those back on for a recoverable (where > we actually recovered) #MC? The use case being? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [RFC] #MC mess 2020-02-18 19:51 ` Borislav Petkov @ 2020-02-18 19:54 ` Luck, Tony 2020-02-18 20:00 ` Borislav Petkov 0 siblings, 1 reply; 28+ messages in thread From: Luck, Tony @ 2020-02-18 19:54 UTC (permalink / raw) To: Borislav Petkov Cc: Peter Zijlstra, Steven Rostedt, Andy Lutomirski, x86-ml, lkml >> How about some code to turn all those back on for a recoverable (where >> we actually recovered) #MC? > > The use case being? Recoverable machine checks are a normal (hopefully rare) event on a server. But you wouldn't want to lose tracing capability just because we took a page offline and killed a process to recover. -Tony ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-18 19:54 ` Luck, Tony @ 2020-02-18 20:00 ` Borislav Petkov 2020-02-18 20:05 ` Luck, Tony 0 siblings, 1 reply; 28+ messages in thread From: Borislav Petkov @ 2020-02-18 20:00 UTC (permalink / raw) To: Luck, Tony; +Cc: Peter Zijlstra, Steven Rostedt, Andy Lutomirski, x86-ml, lkml On Tue, Feb 18, 2020 at 07:54:54PM +0000, Luck, Tony wrote: > Recoverable machine checks are a normal (hopefully rare) event on a server. But you wouldn't > want to lose tracing capability just because we took a page offline and killed a process to recover. Yeah, ok. How do you want to select which ones? What mce_no_way_out() says or severity or...? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [RFC] #MC mess 2020-02-18 20:00 ` Borislav Petkov @ 2020-02-18 20:05 ` Luck, Tony 0 siblings, 0 replies; 28+ messages in thread From: Luck, Tony @ 2020-02-18 20:05 UTC (permalink / raw) To: Borislav Petkov Cc: Peter Zijlstra, Steven Rostedt, Andy Lutomirski, x86-ml, lkml > Yeah, ok. How do you want to select which ones? What mce_no_way_out() > says or severity or...? We only return from do_machine_check() in the recoverable case. So down at the end just here: out_ist: ist_exit(regs); } Though that does include cases where we returned just because cfg->tolerant was set to some value saying ignore this #MC. -Tony ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-18 18:20 ` Luck, Tony 2020-02-18 19:51 ` Borislav Petkov @ 2020-02-18 20:02 ` Peter Zijlstra 2020-02-18 20:09 ` Borislav Petkov ` (2 more replies) 2020-02-18 23:10 ` Andy Lutomirski 2 siblings, 3 replies; 28+ messages in thread From: Peter Zijlstra @ 2020-02-18 20:02 UTC (permalink / raw) To: Luck, Tony; +Cc: Borislav Petkov, Steven Rostedt, Andy Lutomirski, x86-ml, lkml On Tue, Feb 18, 2020 at 06:20:38PM +0000, Luck, Tony wrote: > > Anything else I'm missing? It is likely... > > + hw_breakpoint_disable(); > + static_key_disable(&__tracepoint_read_msr.key); > + tracing_off(); > + > ist_enter(regs); > > How about some code to turn all those back on for a recoverable (where we actually recovered) #MC? Then please rewrite the #MC entry code to deal with nested exceptions unmasking the MCE, very similr to NMI. The moment you allow tracing, jump_labels or anything else you can expect #PF, #BP and probably #DB while inside #MC, those will then IRET and re-enable the #MC. The current situation is completely and utterly buggered. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-18 20:02 ` Peter Zijlstra @ 2020-02-18 20:09 ` Borislav Petkov 2020-02-18 20:11 ` Luck, Tony 2020-02-18 23:17 ` Andy Lutomirski 2 siblings, 0 replies; 28+ messages in thread From: Borislav Petkov @ 2020-02-18 20:09 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Luck, Tony, Steven Rostedt, Andy Lutomirski, x86-ml, lkml On Tue, Feb 18, 2020 at 09:02:00PM +0100, Peter Zijlstra wrote: > Then please rewrite the #MC entry code to deal with nested exceptions > unmasking the MCE, very similr to NMI. > > The moment you allow tracing, jump_labels or anything else you can > expect #PF, #BP and probably #DB while inside #MC, those will then IRET > and re-enable the #MC. Yeah, I'd like to keep it simple and reenable all the crap we disabled only on exit from the handler. Dunno if we care about losing tracing samples when an MCE happened... > The current situation is completely and utterly buggered. Lovely. ;-( -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [RFC] #MC mess 2020-02-18 20:02 ` Peter Zijlstra 2020-02-18 20:09 ` Borislav Petkov @ 2020-02-18 20:11 ` Luck, Tony 2020-02-18 20:34 ` Peter Zijlstra 2020-02-18 23:17 ` Andy Lutomirski 2 siblings, 1 reply; 28+ messages in thread From: Luck, Tony @ 2020-02-18 20:11 UTC (permalink / raw) To: Peter Zijlstra Cc: Borislav Petkov, Steven Rostedt, Andy Lutomirski, x86-ml, lkml > Then please rewrite the #MC entry code to deal with nested exceptions > unmasking the MCE, very similr to NMI. #MC doesn't work like NMI. It isn't enabled by IRET. Nested #MC cause an immediate reset. Detection of nested case is by IA32_MCG_STATUS.MCIP. We don't clear MCG_STATUS until we are ready to return from the machine check handler. -Tony ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-18 20:11 ` Luck, Tony @ 2020-02-18 20:34 ` Peter Zijlstra 2020-02-18 21:49 ` Peter Zijlstra 0 siblings, 1 reply; 28+ messages in thread From: Peter Zijlstra @ 2020-02-18 20:34 UTC (permalink / raw) To: Luck, Tony; +Cc: Borislav Petkov, Steven Rostedt, Andy Lutomirski, x86-ml, lkml On Tue, Feb 18, 2020 at 08:11:10PM +0000, Luck, Tony wrote: > > Then please rewrite the #MC entry code to deal with nested exceptions > > unmasking the MCE, very similr to NMI. > > #MC doesn't work like NMI. It isn't enabled by IRET. Nested #MC cause an > immediate reset. Detection of nested case is by IA32_MCG_STATUS.MCIP. > We don't clear MCG_STATUS until we are ready to return from the machine > check handler. Ooh, excellent! That saves a whole heap of problems. Then I suppose we should work at getting in_nmi() set. Let me have another look at things. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-18 20:34 ` Peter Zijlstra @ 2020-02-18 21:49 ` Peter Zijlstra 2020-02-18 21:53 ` Peter Zijlstra 2020-02-18 22:40 ` Frederic Weisbecker 0 siblings, 2 replies; 28+ messages in thread From: Peter Zijlstra @ 2020-02-18 21:49 UTC (permalink / raw) To: Luck, Tony Cc: Borislav Petkov, Steven Rostedt, Andy Lutomirski, x86-ml, lkml, paulmck On Tue, Feb 18, 2020 at 09:34:04PM +0100, Peter Zijlstra wrote: > On Tue, Feb 18, 2020 at 08:11:10PM +0000, Luck, Tony wrote: > > > Then please rewrite the #MC entry code to deal with nested exceptions > > > unmasking the MCE, very similr to NMI. > > > > #MC doesn't work like NMI. It isn't enabled by IRET. Nested #MC cause an > > immediate reset. Detection of nested case is by IA32_MCG_STATUS.MCIP. > > We don't clear MCG_STATUS until we are ready to return from the machine > > check handler. > > Ooh, excellent! That saves a whole heap of problems. > > Then I suppose we should work at getting in_nmi() set. Let me have > another look at things. How's something like the -- completely untested -- below? --- arch/arm64/kernel/sdei.c | 14 +----- arch/arm64/kernel/traps.c | 8 +--- arch/powerpc/kernel/traps.c | 19 +++----- arch/x86/include/asm/traps.h | 5 --- arch/x86/kernel/cpu/mce/core.c | 60 ++++++++++++++----------- arch/x86/kernel/cpu/mce/p5.c | 7 +-- arch/x86/kernel/cpu/mce/winchip.c | 7 +-- arch/x86/kernel/traps.c | 94 ++++----------------------------------- include/linux/hardirq.h | 2 +- include/linux/preempt.h | 4 +- include/linux/sched.h | 6 +++ kernel/printk/printk_safe.c | 6 ++- 12 files changed, 72 insertions(+), 160 deletions(-) diff --git a/arch/arm64/kernel/sdei.c b/arch/arm64/kernel/sdei.c index d6259dac62b6..e396e69e33a1 100644 --- a/arch/arm64/kernel/sdei.c +++ b/arch/arm64/kernel/sdei.c @@ -251,22 +251,12 @@ asmlinkage __kprobes notrace unsigned long __sdei_handler(struct pt_regs *regs, struct sdei_registered_event *arg) { unsigned long ret; - bool do_nmi_exit = false; - /* - * nmi_enter() deals with printk() re-entrance and use of RCU when - * RCU believed this CPU was idle. Because critical events can - * interrupt normal events, we may already be in_nmi(). - */ - if (!in_nmi()) { - nmi_enter(); - do_nmi_exit = true; - } + nmi_enter(); ret = _sdei_handler(regs, arg); - if (do_nmi_exit) - nmi_exit(); + nmi_exit(); return ret; } diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c index cf402be5c573..c728f163f329 100644 --- a/arch/arm64/kernel/traps.c +++ b/arch/arm64/kernel/traps.c @@ -906,17 +906,13 @@ bool arm64_is_fatal_ras_serror(struct pt_regs *regs, unsigned int esr) asmlinkage void do_serror(struct pt_regs *regs, unsigned int esr) { - const bool was_in_nmi = in_nmi(); - - if (!was_in_nmi) - nmi_enter(); + nmi_enter(); /* non-RAS errors are not containable */ if (!arm64_is_ras_serror(esr) || arm64_is_fatal_ras_serror(regs, esr)) arm64_serror_panic(regs, esr); - if (!was_in_nmi) - nmi_exit(); + nmi_exit(); } asmlinkage void enter_from_user_mode(void) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 014ff0701f24..aaaeb2884f59 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -441,15 +441,9 @@ void hv_nmi_check_nonrecoverable(struct pt_regs *regs) void system_reset_exception(struct pt_regs *regs) { unsigned long hsrr0, hsrr1; - bool nested = in_nmi(); bool saved_hsrrs = false; - /* - * Avoid crashes in case of nested NMI exceptions. Recoverability - * is determined by RI and in_nmi - */ - if (!nested) - nmi_enter(); + nmi_enter(); /* * System reset can interrupt code where HSRRs are live and MSR[RI]=1. @@ -521,8 +515,7 @@ void system_reset_exception(struct pt_regs *regs) mtspr(SPRN_HSRR1, hsrr1); } - if (!nested) - nmi_exit(); + nmi_exit(); /* What should we do here? We could issue a shutdown or hard reset. */ } @@ -823,9 +816,8 @@ int machine_check_generic(struct pt_regs *regs) void machine_check_exception(struct pt_regs *regs) { int recover = 0; - bool nested = in_nmi(); - if (!nested) - nmi_enter(); + + nmi_enter(); __this_cpu_inc(irq_stat.mce_exceptions); @@ -863,8 +855,7 @@ void machine_check_exception(struct pt_regs *regs) return; bail: - if (!nested) - nmi_exit(); + nmi_exit(); } void SMIException(struct pt_regs *regs) diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index ffa0dc8a535e..6aa9da422d23 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -121,11 +121,6 @@ void smp_spurious_interrupt(struct pt_regs *regs); void smp_error_interrupt(struct pt_regs *regs); asmlinkage void smp_irq_move_cleanup_interrupt(void); -extern void ist_enter(struct pt_regs *regs); -extern void ist_exit(struct pt_regs *regs); -extern void ist_begin_non_atomic(struct pt_regs *regs); -extern void ist_end_non_atomic(void); - #ifdef CONFIG_VMAP_STACK void __noreturn handle_stack_overflow(const char *message, struct pt_regs *regs, diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 2c4f949611e4..403380f857c7 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1084,23 +1084,6 @@ static void mce_clear_state(unsigned long *toclear) } } -static int do_memory_failure(struct mce *m) -{ - int flags = MF_ACTION_REQUIRED; - int ret; - - pr_err("Uncorrected hardware memory error in user-access at %llx", m->addr); - if (!(m->mcgstatus & MCG_STATUS_RIPV)) - flags |= MF_MUST_KILL; - ret = memory_failure(m->addr >> PAGE_SHIFT, flags); - if (ret) - pr_err("Memory error not recovered"); - else - set_mce_nospec(m->addr >> PAGE_SHIFT); - return ret; -} - - /* * Cases where we avoid rendezvous handler timeout: * 1) If this CPU is offline. @@ -1202,6 +1185,30 @@ static void __mc_scan_banks(struct mce *m, struct mce *final, *m = *final; } +static void mce_kill_me_now(struct callback_head *ch) +{ + force_sig(SIGBUS); +} + +static void mce_kill_me_maybe(struct callback_head *cb) +{ + struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); + int flags = MF_ACTION_REQUIRED; + int ret; + + pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); + if (!(p->mce_status & MCG_STATUS_RIPV)) + flags |= MF_MUST_KILL; + ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags); + if (ret) + pr_err("Memory error not recovered"); + else + set_mce_nospec(p->mce_addr >> PAGE_SHIFT); + + if (ret) + mce_kill_me_now(cb); +} + /* * The actual machine check handler. This only handles real * exceptions when something got corrupted coming in through int 18. @@ -1214,7 +1221,7 @@ static void __mc_scan_banks(struct mce *m, struct mce *final, * MCE broadcast. However some CPUs might be broken beyond repair, * so be always careful when synchronizing with others. */ -void do_machine_check(struct pt_regs *regs, long error_code) +notrace void do_machine_check(struct pt_regs *regs, long error_code) { DECLARE_BITMAP(valid_banks, MAX_NR_BANKS); DECLARE_BITMAP(toclear, MAX_NR_BANKS); @@ -1251,7 +1258,7 @@ void do_machine_check(struct pt_regs *regs, long error_code) if (__mc_check_crashing_cpu(cpu)) return; - ist_enter(regs); + nmi_enter(); this_cpu_inc(mce_exception_count); @@ -1344,22 +1351,23 @@ void do_machine_check(struct pt_regs *regs, long error_code) /* Fault was in user mode and we need to take some action */ if ((m.cs & 3) == 3) { - ist_begin_non_atomic(regs); - local_irq_enable(); + current->mce_addr = m->addr; + current->mce_status = m->mcgstatus; + current->mce_kill_me.func = mce_kill_me_maybe; + if (kill_it) + current->mce_kill_me.func = mce_kill_me_now; - if (kill_it || do_memory_failure(&m)) - force_sig(SIGBUS); - local_irq_disable(); - ist_end_non_atomic(); + task_work_add(current, ¤t->mce_kill_me, true); } else { if (!fixup_exception(regs, X86_TRAP_MC, error_code, 0)) mce_panic("Failed kernel mode recovery", &m, msg); } out_ist: - ist_exit(regs); + nmi_exit(); } EXPORT_SYMBOL_GPL(do_machine_check); +NOKPROBE_SYMBOL(do_machine_check); #ifndef CONFIG_MEMORY_FAILURE int memory_failure(unsigned long pfn, int flags) diff --git a/arch/x86/kernel/cpu/mce/p5.c b/arch/x86/kernel/cpu/mce/p5.c index 4ae6df556526..135f6d5df3db 100644 --- a/arch/x86/kernel/cpu/mce/p5.c +++ b/arch/x86/kernel/cpu/mce/p5.c @@ -20,11 +20,11 @@ int mce_p5_enabled __read_mostly; /* Machine check handler for Pentium class Intel CPUs: */ -static void pentium_machine_check(struct pt_regs *regs, long error_code) +static notrace void pentium_machine_check(struct pt_regs *regs, long error_code) { u32 loaddr, hi, lotype; - ist_enter(regs); + nmi_enter(); rdmsr(MSR_IA32_P5_MC_ADDR, loaddr, hi); rdmsr(MSR_IA32_P5_MC_TYPE, lotype, hi); @@ -39,8 +39,9 @@ static void pentium_machine_check(struct pt_regs *regs, long error_code) add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE); - ist_exit(regs); + nmi_exit(); } +NOKPROBE_SYMBOL(pentium_machine_check); /* Set up machine check reporting for processors with Intel style MCE: */ void intel_p5_mcheck_init(struct cpuinfo_x86 *c) diff --git a/arch/x86/kernel/cpu/mce/winchip.c b/arch/x86/kernel/cpu/mce/winchip.c index a30ea13cccc2..fbc5216c72f7 100644 --- a/arch/x86/kernel/cpu/mce/winchip.c +++ b/arch/x86/kernel/cpu/mce/winchip.c @@ -16,15 +16,16 @@ #include "internal.h" /* Machine check handler for WinChip C6: */ -static void winchip_machine_check(struct pt_regs *regs, long error_code) +static notrace void winchip_machine_check(struct pt_regs *regs, long error_code) { - ist_enter(regs); + nmi_enter(); pr_emerg("CPU0: Machine Check Exception.\n"); add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE); - ist_exit(regs); + nmi_exit(); } +NOKPROBE_SYMBOL(winchip_machine_check); /* Set up machine check reporting on the Winchip C6 series */ void winchip_mcheck_init(struct cpuinfo_x86 *c) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 9e6f822922a3..acc3a847c610 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -83,78 +83,6 @@ static inline void cond_local_irq_disable(struct pt_regs *regs) local_irq_disable(); } -/* - * In IST context, we explicitly disable preemption. This serves two - * purposes: it makes it much less likely that we would accidentally - * schedule in IST context and it will force a warning if we somehow - * manage to schedule by accident. - */ -void ist_enter(struct pt_regs *regs) -{ - if (user_mode(regs)) { - RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU"); - } else { - /* - * We might have interrupted pretty much anything. In - * fact, if we're a machine check, we can even interrupt - * NMI processing. We don't want in_nmi() to return true, - * but we need to notify RCU. - */ - rcu_nmi_enter(); - } - - preempt_disable(); - - /* This code is a bit fragile. Test it. */ - RCU_LOCKDEP_WARN(!rcu_is_watching(), "ist_enter didn't work"); -} -NOKPROBE_SYMBOL(ist_enter); - -void ist_exit(struct pt_regs *regs) -{ - preempt_enable_no_resched(); - - if (!user_mode(regs)) - rcu_nmi_exit(); -} - -/** - * ist_begin_non_atomic() - begin a non-atomic section in an IST exception - * @regs: regs passed to the IST exception handler - * - * IST exception handlers normally cannot schedule. As a special - * exception, if the exception interrupted userspace code (i.e. - * user_mode(regs) would return true) and the exception was not - * a double fault, it can be safe to schedule. ist_begin_non_atomic() - * begins a non-atomic section within an ist_enter()/ist_exit() region. - * Callers are responsible for enabling interrupts themselves inside - * the non-atomic section, and callers must call ist_end_non_atomic() - * before ist_exit(). - */ -void ist_begin_non_atomic(struct pt_regs *regs) -{ - BUG_ON(!user_mode(regs)); - - /* - * Sanity check: we need to be on the normal thread stack. This - * will catch asm bugs and any attempt to use ist_preempt_enable - * from double_fault. - */ - BUG_ON(!on_thread_stack()); - - preempt_enable_no_resched(); -} - -/** - * ist_end_non_atomic() - begin a non-atomic section in an IST exception - * - * Ends a non-atomic section started with ist_begin_non_atomic(). - */ -void ist_end_non_atomic(void) -{ - preempt_disable(); -} - int is_valid_bugaddr(unsigned long addr) { unsigned short ud; @@ -345,7 +273,7 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code, unsign * The net result is that our #GP handler will think that we * entered from usermode with the bad user context. * - * No need for ist_enter here because we don't use RCU. + * No need for nmi_enter() here because we don't use RCU. */ if (((long)regs->sp >> P4D_SHIFT) == ESPFIX_PGD_ENTRY && regs->cs == __KERNEL_CS && @@ -380,7 +308,7 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code, unsign } #endif - ist_enter(regs); + nmi_enter(); notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV); tsk->thread.error_code = error_code; @@ -432,6 +360,7 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code, unsign die("double fault", regs, error_code); panic("Machine halted."); } +NOKPROBE_SYMBOL(do_double_fault) #endif dotraplinkage void do_bounds(struct pt_regs *regs, long error_code) @@ -645,14 +574,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) if (poke_int3_handler(regs)) return; - /* - * Use ist_enter despite the fact that we don't use an IST stack. - * We can be called from a kprobe in non-CONTEXT_KERNEL kernel - * mode or even during context tracking state changes. - * - * This means that we can't schedule. That's okay. - */ - ist_enter(regs); + nmi_enter(); RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU"); #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP, @@ -674,7 +596,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) cond_local_irq_disable(regs); exit: - ist_exit(regs); + nmi_exit(); } NOKPROBE_SYMBOL(do_int3); @@ -771,14 +693,14 @@ static bool is_sysenter_singlestep(struct pt_regs *regs) * * May run on IST stack. */ -dotraplinkage void do_debug(struct pt_regs *regs, long error_code) +dotraplinkage notrace void do_debug(struct pt_regs *regs, long error_code) { struct task_struct *tsk = current; int user_icebp = 0; unsigned long dr6; int si_code; - ist_enter(regs); + nmi_enter(); get_debugreg(dr6, 6); /* @@ -871,7 +793,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code) debug_stack_usage_dec(); exit: - ist_exit(regs); + nmi_exit(); } NOKPROBE_SYMBOL(do_debug); diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h index da0af631ded5..146332764673 100644 --- a/include/linux/hardirq.h +++ b/include/linux/hardirq.h @@ -71,7 +71,7 @@ extern void irq_exit(void); printk_nmi_enter(); \ lockdep_off(); \ ftrace_nmi_enter(); \ - BUG_ON(in_nmi()); \ + BUG_ON(in_nmi() == 0xf); \ preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET); \ rcu_nmi_enter(); \ trace_hardirq_enter(); \ diff --git a/include/linux/preempt.h b/include/linux/preempt.h index bbb68dba37cc..9b26069531de 100644 --- a/include/linux/preempt.h +++ b/include/linux/preempt.h @@ -26,13 +26,13 @@ * PREEMPT_MASK: 0x000000ff * SOFTIRQ_MASK: 0x0000ff00 * HARDIRQ_MASK: 0x000f0000 - * NMI_MASK: 0x00100000 + * NMI_MASK: 0x00f00000 * PREEMPT_NEED_RESCHED: 0x80000000 */ #define PREEMPT_BITS 8 #define SOFTIRQ_BITS 8 #define HARDIRQ_BITS 4 -#define NMI_BITS 1 +#define NMI_BITS 4 #define PREEMPT_SHIFT 0 #define SOFTIRQ_SHIFT (PREEMPT_SHIFT + PREEMPT_BITS) diff --git a/include/linux/sched.h b/include/linux/sched.h index b49901f90ef5..ad2aed16b74b 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1285,6 +1285,12 @@ struct task_struct { unsigned long prev_lowest_stack; #endif +#ifdef CONFIG_MEMORY_FAILURE + u64 mce_addr; + u64 mce_status; + struct callback_head mce_kill_me; +#endif + /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c index b4045e782743..a029393a8541 100644 --- a/kernel/printk/printk_safe.c +++ b/kernel/printk/printk_safe.c @@ -296,12 +296,14 @@ static __printf(1, 0) int vprintk_nmi(const char *fmt, va_list args) void notrace printk_nmi_enter(void) { - this_cpu_or(printk_context, PRINTK_NMI_CONTEXT_MASK); + if (!in_nmi()) + this_cpu_or(printk_context, PRINTK_NMI_CONTEXT_MASK); } void notrace printk_nmi_exit(void) { - this_cpu_and(printk_context, ~PRINTK_NMI_CONTEXT_MASK); + if (!in_nmi()) + this_cpu_and(printk_context, ~PRINTK_NMI_CONTEXT_MASK); } /* ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-18 21:49 ` Peter Zijlstra @ 2020-02-18 21:53 ` Peter Zijlstra 2020-02-18 22:41 ` Frederic Weisbecker 2020-02-18 22:40 ` Frederic Weisbecker 1 sibling, 1 reply; 28+ messages in thread From: Peter Zijlstra @ 2020-02-18 21:53 UTC (permalink / raw) To: Luck, Tony Cc: Borislav Petkov, Steven Rostedt, Andy Lutomirski, x86-ml, lkml, paulmck On Tue, Feb 18, 2020 at 10:49:04PM +0100, Peter Zijlstra wrote: > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h > index da0af631ded5..146332764673 100644 > --- a/include/linux/hardirq.h > +++ b/include/linux/hardirq.h > @@ -71,7 +71,7 @@ extern void irq_exit(void); > printk_nmi_enter(); \ > lockdep_off(); \ > ftrace_nmi_enter(); \ > - BUG_ON(in_nmi()); \ > + BUG_ON(in_nmi() == 0xf); \ That wants to be: BUG_ON(in_nmi() == NMI_MASK); \ > preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET); \ > rcu_nmi_enter(); \ > trace_hardirq_enter(); \ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-18 21:53 ` Peter Zijlstra @ 2020-02-18 22:41 ` Frederic Weisbecker 0 siblings, 0 replies; 28+ messages in thread From: Frederic Weisbecker @ 2020-02-18 22:41 UTC (permalink / raw) To: Peter Zijlstra Cc: Luck, Tony, Borislav Petkov, Steven Rostedt, Andy Lutomirski, x86-ml, lkml, paulmck On Tue, Feb 18, 2020 at 10:53:25PM +0100, Peter Zijlstra wrote: > On Tue, Feb 18, 2020 at 10:49:04PM +0100, Peter Zijlstra wrote: > > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h > > index da0af631ded5..146332764673 100644 > > --- a/include/linux/hardirq.h > > +++ b/include/linux/hardirq.h > > @@ -71,7 +71,7 @@ extern void irq_exit(void); > > printk_nmi_enter(); \ > > lockdep_off(); \ > > ftrace_nmi_enter(); \ > > - BUG_ON(in_nmi()); \ > > + BUG_ON(in_nmi() == 0xf); \ > > That wants to be: > > BUG_ON(in_nmi() == NMI_MASK); \ Ah that's the email I didn't read... Sorry for the noise. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-18 21:49 ` Peter Zijlstra 2020-02-18 21:53 ` Peter Zijlstra @ 2020-02-18 22:40 ` Frederic Weisbecker 1 sibling, 0 replies; 28+ messages in thread From: Frederic Weisbecker @ 2020-02-18 22:40 UTC (permalink / raw) To: Peter Zijlstra Cc: Luck, Tony, Borislav Petkov, Steven Rostedt, Andy Lutomirski, x86-ml, lkml, paulmck On Tue, Feb 18, 2020 at 10:49:04PM +0100, Peter Zijlstra wrote: > On Tue, Feb 18, 2020 at 09:34:04PM +0100, Peter Zijlstra wrote: > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h > index da0af631ded5..146332764673 100644 > --- a/include/linux/hardirq.h > +++ b/include/linux/hardirq.h > @@ -71,7 +71,7 @@ extern void irq_exit(void); > printk_nmi_enter(); \ > lockdep_off(); \ > ftrace_nmi_enter(); \ > - BUG_ON(in_nmi()); \ > + BUG_ON(in_nmi() == 0xf); \ I guess you meant: BUG_ON(in_nmi() == NMI_MASK); Thanks. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-18 20:02 ` Peter Zijlstra 2020-02-18 20:09 ` Borislav Petkov 2020-02-18 20:11 ` Luck, Tony @ 2020-02-18 23:17 ` Andy Lutomirski 2 siblings, 0 replies; 28+ messages in thread From: Andy Lutomirski @ 2020-02-18 23:17 UTC (permalink / raw) To: Peter Zijlstra Cc: Luck, Tony, Borislav Petkov, Steven Rostedt, Andy Lutomirski, x86-ml, lkml > On Feb 18, 2020, at 12:02 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Feb 18, 2020 at 06:20:38PM +0000, Luck, Tony wrote: >>> Anything else I'm missing? It is likely... >> >> + hw_breakpoint_disable(); >> + static_key_disable(&__tracepoint_read_msr.key); >> + tracing_off(); >> + >> ist_enter(regs); >> >> How about some code to turn all those back on for a recoverable (where we actually recovered) #MC? > > Then please rewrite the #MC entry code to deal with nested exceptions > unmasking the MCE, very similr to NMI. > > The moment you allow tracing, jump_labels or anything else you can > expect #PF, #BP and probably #DB while inside #MC, those will then IRET > and re-enable the #MC. Huh? As I understand it, there is no such thing as MCE masking. There are two states: CR4.MCE=1: MCE is delivered when it occurs. CR4.MCE=0: MCE causes shutdown MC delivery sets MCE=0. So, basically, without LMCE, we are irredeemably screwed. With LMCE, we are still hosed if we nest an MCE inside a recoverable MCE. We can play some games to make the OOPS more reliable, but we are still mostly screwed. The x86 MCE architecture sucks. > > The current situation is completely and utterly buggered. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-18 18:20 ` Luck, Tony 2020-02-18 19:51 ` Borislav Petkov 2020-02-18 20:02 ` Peter Zijlstra @ 2020-02-18 23:10 ` Andy Lutomirski 2020-02-18 23:17 ` Steven Rostedt 2 siblings, 1 reply; 28+ messages in thread From: Andy Lutomirski @ 2020-02-18 23:10 UTC (permalink / raw) To: Luck, Tony Cc: Borislav Petkov, Peter Zijlstra, Steven Rostedt, Andy Lutomirski, x86-ml, lkml > On Feb 18, 2020, at 10:20 AM, Luck, Tony <tony.luck@intel.com> wrote: > > >> >> Anything else I'm missing? It is likely... > > + hw_breakpoint_disable(); > + static_key_disable(&__tracepoint_read_msr.key); > + tracing_off(); > + > ist_enter(regs); > > How about some code to turn all those back on for a recoverable (where we actually recovered) #MC? > > At the very least, in the user_mode(regs) case, tracing is fine. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-18 23:10 ` Andy Lutomirski @ 2020-02-18 23:17 ` Steven Rostedt 0 siblings, 0 replies; 28+ messages in thread From: Steven Rostedt @ 2020-02-18 23:17 UTC (permalink / raw) To: Andy Lutomirski Cc: Luck, Tony, Borislav Petkov, Peter Zijlstra, Andy Lutomirski, x86-ml, lkml On Tue, 18 Feb 2020 15:10:17 -0800 Andy Lutomirski <luto@amacapital.net> wrote: > > On Feb 18, 2020, at 10:20 AM, Luck, Tony <tony.luck@intel.com> wrote: > > > > > >> > >> Anything else I'm missing? It is likely... > > > > + hw_breakpoint_disable(); > > + static_key_disable(&__tracepoint_read_msr.key); > > + tracing_off(); > > + > > ist_enter(regs); > > > > How about some code to turn all those back on for a recoverable (where we actually recovered) #MC? > > > > > > > At the very least, in the user_mode(regs) case, tracing is fine. Also, I don't think "tracing_off()" is what is wanted here. That just disables writing to the ring buffer, which can be called in pretty much any context (if it's before in_nmi() get's set, the worse thing that happens is that events will get dropped due to the recursion protection that checks to make sure there's no re-entrant events at the same level of context). The only issue with having function tracing enabled, is that it may add a breakpoint when it gets turned on or off. And that tracing_off() doesn't prevent that. tracepoints still use RCU of some kind, and the protection there has nothing to do with whether a trace point does recording or not. -- Steve ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-18 17:31 [RFC] #MC mess Borislav Petkov 2020-02-18 18:11 ` Steven Rostedt 2020-02-18 18:20 ` Luck, Tony @ 2020-02-19 0:15 ` Andy Lutomirski 2020-02-19 8:15 ` Peter Zijlstra 2 siblings, 1 reply; 28+ messages in thread From: Andy Lutomirski @ 2020-02-19 0:15 UTC (permalink / raw) To: Borislav Petkov Cc: Peter Zijlstra, Steven Rostedt, Andy Lutomirski, Tony Luck, x86-ml, lkml On Tue, Feb 18, 2020 at 9:31 AM Borislav Petkov <bp@alien8.de> wrote: > > Ok, > > so Peter raised this question on IRC today, that the #MC handler needs > to disable all kinds of tracing/kprobing and etc exceptions happening > while handling an #MC. And I guess we can talk about supporting some > exceptions but #MC is usually nasty enough to not care about tracing > when former happens. > It's worth noting that MCE is utterly, terminally screwed under high load. In particular: Step 1: NMI (due to perf). immediately thereafter (before any of the entry asm runs) Step 2: MCE (due to recoverable memory failure or remote CPU MCE) Step 3: MCE does its thing and does IRET Step 4: NMI We are toast. Tony, etc, can you ask your Intel contacts who care about this kind of thing to stop twiddling their thumbs and FIX IT? The easy fix is utterly trivial. Add a new instruction IRET_NON_NMI. It does *exactly* the same thing as IRET except that it does not unmask NMIs. (It also doesn't unmask NMIs if it faults.) No fancy design work. Future improvements can still happen on top of this. (One other improvement that may or may not have happened: the CPU should be configurable so that it never even sends #MC unless it literally cannot continue executing without OS help. No remote MCE triggering #MC, no notifications of corrected errors, no nothing. If the CPU *cannot* continue execution in its current context, for example because a load could not be satisfied, send #MC. If a cache line cannot be written back, then *which* CPU should get the MCE is an interesting question.) If Intel cares about memory failure recovery, then this design problem needs to be fixed. Without a fix, we're just duct taping little holes and ignoring the giant gaping hole in front of our faces. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-19 0:15 ` Andy Lutomirski @ 2020-02-19 8:15 ` Peter Zijlstra 2020-02-19 14:21 ` Steven Rostedt 0 siblings, 1 reply; 28+ messages in thread From: Peter Zijlstra @ 2020-02-19 8:15 UTC (permalink / raw) To: Andy Lutomirski; +Cc: Borislav Petkov, Steven Rostedt, Tony Luck, x86-ml, lkml On Tue, Feb 18, 2020 at 04:15:57PM -0800, Andy Lutomirski wrote: > On Tue, Feb 18, 2020 at 9:31 AM Borislav Petkov <bp@alien8.de> wrote: > > > > Ok, > > > > so Peter raised this question on IRC today, that the #MC handler needs > > to disable all kinds of tracing/kprobing and etc exceptions happening > > while handling an #MC. And I guess we can talk about supporting some > > exceptions but #MC is usually nasty enough to not care about tracing > > when former happens. > > > > It's worth noting that MCE is utterly, terminally screwed under high > load. In particular: > > Step 1: NMI (due to perf). > > immediately thereafter (before any of the entry asm runs) > > Step 2: MCE (due to recoverable memory failure or remote CPU MCE) > > Step 3: MCE does its thing and does IRET > > Step 4: NMI > > We are toast. > > Tony, etc, can you ask your Intel contacts who care about this kind of > thing to stop twiddling their thumbs and FIX IT? The easy fix is > utterly trivial. Add a new instruction IRET_NON_NMI. It does > *exactly* the same thing as IRET except that it does not unmask NMIs. > (It also doesn't unmask NMIs if it faults.) No fancy design work. > Future improvements can still happen on top of this. Yes please! Of course, we're stuck with the existing NMI entry crap forever because legacy, but it would make all things NMI so much saner. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-19 8:15 ` Peter Zijlstra @ 2020-02-19 14:21 ` Steven Rostedt 2020-02-19 14:43 ` Borislav Petkov 2020-02-19 15:05 ` Peter Zijlstra 0 siblings, 2 replies; 28+ messages in thread From: Steven Rostedt @ 2020-02-19 14:21 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Andy Lutomirski, Borislav Petkov, Tony Luck, x86-ml, lkml On Wed, 19 Feb 2020 09:15:41 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > > > > Tony, etc, can you ask your Intel contacts who care about this kind of > > thing to stop twiddling their thumbs and FIX IT? The easy fix is > > utterly trivial. Add a new instruction IRET_NON_NMI. It does > > *exactly* the same thing as IRET except that it does not unmask NMIs. > > (It also doesn't unmask NMIs if it faults.) No fancy design work. > > Future improvements can still happen on top of this. > > Yes please! Of course, we're stuck with the existing NMI entry crap > forever because legacy, but it would make all things NMI so much saner. What would be nice is to have a NMI_IRET, that is defined as something that wont break legacy CPUs. Where it could be just a nop iret, or maybe if possible a "lock iret"? That is, not have a IRET_NON_NMI, as that would be all over the place, but just the iret for NMI itself. As that's in one place. -- Steve ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-19 14:21 ` Steven Rostedt @ 2020-02-19 14:43 ` Borislav Petkov 2020-02-19 15:05 ` Peter Zijlstra 1 sibling, 0 replies; 28+ messages in thread From: Borislav Petkov @ 2020-02-19 14:43 UTC (permalink / raw) To: Steven Rostedt; +Cc: Peter Zijlstra, Andy Lutomirski, Tony Luck, x86-ml, lkml On Wed, Feb 19, 2020 at 09:21:15AM -0500, Steven Rostedt wrote: > What would be nice is to have a NMI_IRET, that is defined as something > that wont break legacy CPUs. Where it could be just a nop iret, or maybe > if possible a "lock iret"? That is, not have a IRET_NON_NMI, as that > would be all over the place, but just the iret for NMI itself. As > that's in one place. You mean, we could keep the nested NMI thing but it won't practically get executed on NMI_IRET CPUs because there won't be any NMI nesting anymore? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-19 14:21 ` Steven Rostedt 2020-02-19 14:43 ` Borislav Petkov @ 2020-02-19 15:05 ` Peter Zijlstra 2020-02-19 15:20 ` Andy Lutomirski 1 sibling, 1 reply; 28+ messages in thread From: Peter Zijlstra @ 2020-02-19 15:05 UTC (permalink / raw) To: Steven Rostedt; +Cc: Andy Lutomirski, Borislav Petkov, Tony Luck, x86-ml, lkml On Wed, Feb 19, 2020 at 09:21:15AM -0500, Steven Rostedt wrote: > On Wed, 19 Feb 2020 09:15:41 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > Tony, etc, can you ask your Intel contacts who care about this kind of > > > thing to stop twiddling their thumbs and FIX IT? The easy fix is > > > utterly trivial. Add a new instruction IRET_NON_NMI. It does > > > *exactly* the same thing as IRET except that it does not unmask NMIs. > > > (It also doesn't unmask NMIs if it faults.) No fancy design work. > > > Future improvements can still happen on top of this. > > > > Yes please! Of course, we're stuck with the existing NMI entry crap > > forever because legacy, but it would make all things NMI so much saner. > > What would be nice is to have a NMI_IRET, that is defined as something > that wont break legacy CPUs. Where it could be just a nop iret, or maybe > if possible a "lock iret"? That is, not have a IRET_NON_NMI, as that > would be all over the place, but just the iret for NMI itself. As > that's in one place. I don't think that matters much; alternatives should be able to deal with all that either which way around. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] #MC mess 2020-02-19 15:05 ` Peter Zijlstra @ 2020-02-19 15:20 ` Andy Lutomirski 0 siblings, 0 replies; 28+ messages in thread From: Andy Lutomirski @ 2020-02-19 15:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Steven Rostedt, Andy Lutomirski, Borislav Petkov, Tony Luck, x86-ml, lkml > On Feb 19, 2020, at 7:05 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Feb 19, 2020 at 09:21:15AM -0500, Steven Rostedt wrote: >>> On Wed, 19 Feb 2020 09:15:41 +0100 >>> Peter Zijlstra <peterz@infradead.org> wrote: >>> >>>>> >>>>> Tony, etc, can you ask your Intel contacts who care about this kind of >>>>> thing to stop twiddling their thumbs and FIX IT? The easy fix is >>>>> utterly trivial. Add a new instruction IRET_NON_NMI. It does >>>>> *exactly* the same thing as IRET except that it does not unmask NMIs. >>>>> (It also doesn't unmask NMIs if it faults.) No fancy design work. >>>>> Future improvements can still happen on top of this. >>> >>> Yes please! Of course, we're stuck with the existing NMI entry crap >>> forever because legacy, but it would make all things NMI so much saner. >> >> What would be nice is to have a NMI_IRET, that is defined as something >> that wont break legacy CPUs. Where it could be just a nop iret, or maybe >> if possible a "lock iret"? That is, not have a IRET_NON_NMI, as that >> would be all over the place, but just the iret for NMI itself. As >> that's in one place. > > I don't think that matters much; alternatives should be able to deal > with all that either which way around. Agreed. That being said, kernels without alternatives could prefer the variant where a CR4 bit makes regular IRET leave NMIs masked and a new IRET instruction (or LOCK IRET, I suppose) unmasks them. ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2020-02-19 15:20 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-02-18 17:31 [RFC] #MC mess Borislav Petkov 2020-02-18 18:11 ` Steven Rostedt 2020-02-18 19:50 ` Borislav Petkov 2020-02-18 20:08 ` Steven Rostedt 2020-02-18 20:30 ` Peter Zijlstra 2020-02-18 20:52 ` Borislav Petkov 2020-02-18 18:20 ` Luck, Tony 2020-02-18 19:51 ` Borislav Petkov 2020-02-18 19:54 ` Luck, Tony 2020-02-18 20:00 ` Borislav Petkov 2020-02-18 20:05 ` Luck, Tony 2020-02-18 20:02 ` Peter Zijlstra 2020-02-18 20:09 ` Borislav Petkov 2020-02-18 20:11 ` Luck, Tony 2020-02-18 20:34 ` Peter Zijlstra 2020-02-18 21:49 ` Peter Zijlstra 2020-02-18 21:53 ` Peter Zijlstra 2020-02-18 22:41 ` Frederic Weisbecker 2020-02-18 22:40 ` Frederic Weisbecker 2020-02-18 23:17 ` Andy Lutomirski 2020-02-18 23:10 ` Andy Lutomirski 2020-02-18 23:17 ` Steven Rostedt 2020-02-19 0:15 ` Andy Lutomirski 2020-02-19 8:15 ` Peter Zijlstra 2020-02-19 14:21 ` Steven Rostedt 2020-02-19 14:43 ` Borislav Petkov 2020-02-19 15:05 ` Peter Zijlstra 2020-02-19 15:20 ` Andy Lutomirski
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.