From: Andy Lutomirski <luto@amacapital.net>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Borislav Petkov <bp@alien8.de>,
LKML <linux-kernel@vger.kernel.org>,
linux-arch <linux-arch@vger.kernel.org>,
Steven Rostedt <rostedt@goodmis.org>,
Ingo Molnar <mingo@kernel.org>,
Joel Fernandes <joel@joelfernandes.org>,
Greg KH <gregkh@linuxfoundation.org>,
"gustavo@embeddedor.com" <gustavo@embeddedor.com>,
Thomas Gleixner <tglx@linutronix.de>,
"paulmck@kernel.org" <paulmck@kernel.org>,
Josh Triplett <josh@joshtriplett.org>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Lai Jiangshan <jiangshanlai@gmail.com>,
Frederic Weisbecker <frederic@kernel.org>,
Dan Carpenter <dan.carpenter@oracle.com>,
Masami Hiramatsu <mhiramat@kernel.org>
Subject: Re: [PATCH v3 02/22] x86,mce: Delete ist_begin_non_atomic()
Date: Wed, 19 Feb 2020 14:48:34 -0800 [thread overview]
Message-ID: <772ACE2A-FD8B-492F-960E-981ECC72E283@amacapital.net> (raw)
In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F7F57E302@ORSMSX115.amr.corp.intel.com>
> On Feb 19, 2020, at 2:33 PM, Luck, Tony <tony.luck@intel.com> wrote:
>
>
>>
>> One big question here: are memory failure #MC exceptions synchronous
>> or can they be delayed? If we get a memory failure, is it possible
>> that the #MC hits some random context and not the actual context where
>> the error occurred?
>
> There are a few cases:
> 1) SRAO (Software recoverable action optional) [Patrol scrub or L3 cache eviction]
> These aren't synchronous with any core execution. Using machine check to signal
> was probably a mistake - compounded by it being broadcast :-( Could pick any CPU
> to handle (actually choose the first to arrive in do_machine_check()). That guy should
> arrange to soft offline the affected page. Every CPU can return to what they were doing
> before.
You could handle this by sending IPI-to-self and dealing with it in the interrupt handler. Or even wake a high-priority kthread or workqueue. irq_work may help. Relying on task_work or the non_atomic stuff seems silly - you can’t rely on anything about the interrupted context, and the context is more or less irrelevant anyway.
>
> 2) SRAR (Software recoverable action required)
> These are synchronous. Starting with Skylake they may be signaled just to the thread
> that hit the poison. Earlier generations broadcast.
Here’s where dealing with one that came from kernel code is just nasty, right?
I would argue that, if IF=0, killing the machine is reasonable. If IF=1, we should be okay. Actually making this work sanely is gross, and arguably the goal should be minimizing grossness.
Perhaps, if we came from kernel mode, we should IPI-to-self and use a special vector that is idtentry, not apicinterrupt. Or maybe even do this for entries from usermode just to keep everything consistent.
> 2a) Hit in ring3 code ... we want to offline the page and SIGBUS the task(s)
> 2b) Memcpy_mcsafe() ... kernel has a recovery path. "Return" to the recovery code instead of to the original RIP.
> 2c) copy_from_user ... not implemented yet. We are in kernel, but would like to treat this like case 2a
>
> 3) Fatal
> Always broadcast. Some bank has MCi_STATUS.PCC==1. System must be shutdown.
Easy :)
It would be really, really nice if NMI was masked in MCE context.
>
> -Tony
next prev parent reply other threads:[~2020-02-19 22:48 UTC|newest]
Thread overview: 99+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-02-19 14:47 [PATCH v3 00/22] tracing vs world Peter Zijlstra
2020-02-19 14:47 ` [PATCH v3 01/22] hardirq/nmi: Allow nested nmi_enter() Peter Zijlstra
2020-02-19 15:31 ` Steven Rostedt
2020-02-19 16:56 ` Borislav Petkov
2020-02-19 17:07 ` Peter Zijlstra
2020-02-20 8:41 ` Will Deacon
2020-02-20 9:19 ` Marc Zyngier
2020-02-20 13:18 ` Petr Mladek
2020-02-19 14:47 ` [PATCH v3 02/22] x86,mce: Delete ist_begin_non_atomic() Peter Zijlstra
2020-02-19 17:13 ` Borislav Petkov
2020-02-19 17:21 ` Andy Lutomirski
2020-02-19 17:33 ` Peter Zijlstra
2020-02-19 22:12 ` Andy Lutomirski
2020-02-19 22:33 ` Luck, Tony
2020-02-19 22:48 ` Andy Lutomirski [this message]
2020-02-20 7:39 ` Peter Zijlstra
2020-02-19 17:42 ` Borislav Petkov
2020-02-19 17:46 ` Peter Zijlstra
2020-02-19 14:47 ` [PATCH v3 03/22] x86: Replace ist_enter() with nmi_enter() Peter Zijlstra
2020-02-20 10:54 ` Borislav Petkov
2020-02-20 12:11 ` Peter Zijlstra
2020-02-19 14:47 ` [PATCH v3 04/22] x86/doublefault: Make memmove() notrace/NOKPROBE Peter Zijlstra
2020-02-19 15:36 ` Steven Rostedt
2020-02-19 15:40 ` Peter Zijlstra
2020-02-19 15:55 ` Steven Rostedt
2020-02-19 15:57 ` Peter Zijlstra
2020-02-19 16:04 ` Peter Zijlstra
2020-02-19 16:12 ` Steven Rostedt
2020-02-19 16:27 ` Paul E. McKenney
2020-02-19 16:34 ` Peter Zijlstra
2020-02-19 16:46 ` Paul E. McKenney
2020-02-19 17:05 ` Steven Rostedt
2020-02-20 12:17 ` Borislav Petkov
2020-02-20 12:37 ` Peter Zijlstra
2020-02-19 15:47 ` Steven Rostedt
2020-02-19 14:47 ` [PATCH v3 05/22] rcu: Make RCU IRQ enter/exit functions rely on in_nmi() Peter Zijlstra
2020-02-19 16:31 ` Paul E. McKenney
2020-02-19 16:37 ` Peter Zijlstra
2020-02-19 16:45 ` Paul E. McKenney
2020-02-19 17:03 ` Peter Zijlstra
2020-02-19 17:42 ` Paul E. McKenney
2020-02-19 17:16 ` [PATCH] rcu/kprobes: Comment why rcu_nmi_enter() is marked NOKPROBE Steven Rostedt
2020-02-19 17:18 ` Joel Fernandes
2020-02-19 17:41 ` Paul E. McKenney
2020-02-20 5:54 ` Masami Hiramatsu
2020-02-19 14:47 ` [PATCH v3 06/22] rcu: Rename rcu_irq_{enter,exit}_irqson() Peter Zijlstra
2020-02-19 16:38 ` Paul E. McKenney
2020-02-19 14:47 ` [PATCH v3 07/22] rcu: Mark rcu_dynticks_curr_cpu_in_eqs() inline Peter Zijlstra
2020-02-19 16:39 ` Paul E. McKenney
2020-02-19 17:19 ` Steven Rostedt
2020-02-19 14:47 ` [PATCH v3 08/22] rcu,tracing: Create trace_rcu_{enter,exit}() Peter Zijlstra
2020-02-19 15:49 ` Steven Rostedt
2020-02-19 15:58 ` Peter Zijlstra
2020-02-19 16:15 ` Steven Rostedt
2020-02-19 16:35 ` Peter Zijlstra
2020-02-19 16:44 ` Paul E. McKenney
2020-02-20 10:34 ` Peter Zijlstra
2020-02-20 13:58 ` Paul E. McKenney
2020-02-19 14:47 ` [PATCH v3 09/22] sched,rcu,tracing: Avoid tracing before in_nmi() is correct Peter Zijlstra
2020-02-19 15:50 ` Steven Rostedt
2020-02-19 15:50 ` Steven Rostedt
2020-02-19 14:47 ` [PATCH v3 10/22] x86,tracing: Add comments to do_nmi() Peter Zijlstra
2020-02-19 15:51 ` Steven Rostedt
2020-02-19 14:47 ` [PATCH v3 11/22] perf,tracing: Prepare the perf-trace interface for RCU changes Peter Zijlstra
2020-02-19 14:47 ` [PATCH v3 12/22] tracing: Employ trace_rcu_{enter,exit}() Peter Zijlstra
2020-02-19 15:52 ` Steven Rostedt
2020-02-19 14:47 ` [PATCH v3 13/22] tracing: Remove regular RCU context for _rcuidle tracepoints (again) Peter Zijlstra
2020-02-19 15:53 ` Steven Rostedt
2020-02-19 16:43 ` Paul E. McKenney
2020-02-19 16:47 ` Peter Zijlstra
2020-02-19 17:05 ` Peter Zijlstra
2020-02-19 17:21 ` Steven Rostedt
2020-02-19 17:40 ` Paul E. McKenney
2020-02-19 18:00 ` Steven Rostedt
2020-02-19 19:05 ` Paul E. McKenney
2020-02-19 14:47 ` [PATCH v3 14/22] perf,tracing: Allow function tracing when !RCU Peter Zijlstra
2020-02-19 14:47 ` [PATCH v3 15/22] x86/int3: Ensure that poke_int3_handler() is not traced Peter Zijlstra
2020-02-19 14:47 ` [PATCH v3 16/22] locking/atomics, kcsan: Add KCSAN instrumentation Peter Zijlstra
2020-02-19 15:46 ` Steven Rostedt
2020-02-19 16:03 ` Peter Zijlstra
2020-02-19 16:50 ` Paul E. McKenney
2020-02-19 16:54 ` Peter Zijlstra
2020-02-19 17:36 ` Paul E. McKenney
2020-02-19 14:47 ` [PATCH v3 17/22] asm-generic/atomic: Use __always_inline for pure wrappers Peter Zijlstra
2020-02-19 14:47 ` [PATCH v3 18/22] asm-generic/atomic: Use __always_inline for fallback wrappers Peter Zijlstra
2020-02-19 16:55 ` Paul E. McKenney
2020-02-19 17:06 ` Peter Zijlstra
2020-02-19 17:35 ` Paul E. McKenney
2020-02-19 14:47 ` [PATCH v3 19/22] compiler: Simple READ/WRITE_ONCE() implementations Peter Zijlstra
2020-02-19 14:47 ` [PATCH v3 20/22] locking/atomics: Flip fallbacks and instrumentation Peter Zijlstra
2020-02-19 14:47 ` [PATCH v3 21/22] x86/int3: Avoid atomic instrumentation Peter Zijlstra
2020-02-19 14:47 ` [PATCH v3 22/22] x86/int3: Ensure that poke_int3_handler() is not sanitized Peter Zijlstra
2020-02-19 16:06 ` Dmitry Vyukov
2020-02-19 16:30 ` Peter Zijlstra
2020-02-19 16:51 ` Peter Zijlstra
2020-02-19 17:20 ` Peter Zijlstra
2020-02-20 10:37 ` Dmitry Vyukov
2020-02-20 12:06 ` Peter Zijlstra
2020-02-20 16:22 ` Dmitry Vyukov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=772ACE2A-FD8B-492F-960E-981ECC72E283@amacapital.net \
--to=luto@amacapital.net \
--cc=bp@alien8.de \
--cc=dan.carpenter@oracle.com \
--cc=frederic@kernel.org \
--cc=gregkh@linuxfoundation.org \
--cc=gustavo@embeddedor.com \
--cc=jiangshanlai@gmail.com \
--cc=joel@joelfernandes.org \
--cc=josh@joshtriplett.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mhiramat@kernel.org \
--cc=mingo@kernel.org \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).