Dealing with the NMI mess

* Dealing with the NMI mess
@ 2015-07-23 20:21 Andy Lutomirski
  2015-07-23 20:38 ` Linus Torvalds
                   ` (3 more replies)
  0 siblings, 4 replies; 85+ messages in thread
From: Andy Lutomirski @ 2015-07-23 20:21 UTC (permalink / raw)
  To: X86 ML, linux-kernel, Willy Tarreau, Borislav Petkov,
	Thomas Gleixner, Peter Zijlstra, Linus Torvalds, Steven Rostedt,
	Brian Gerst

[moved to a new thread, cc list trimmed]

Hi all-

We've considered two approaches to dealing with NMIs:

1. Allow nesting.  We know quite well how messy that is.

2. Forbid IRET inside NMIs.  Doable but maybe not that pretty.

We haven't considered:

3. Forbid faults (other than MCE) inside NMI.

Option 3 is almost easy.  There are really only two kinds of faults
that can legitimately nest inside NMI: #PF and #DB.  #DB is easy to
fix (e.g. with my patches or Peter's patches).

What if we went all out and forbade page faults in NMI as well.  There
are two reasons that I can think of that we might page fault inside an
NMI:

a) vmalloc fault.  I think Ingo already half-implemented a rework to
eliminate vmalloc faults entirely.

b) User memory access faults.

The reason we access user state in general from an NMI is to allow
perf to capture enough user stack data to let the tooling backtrace
back to user space.  What if we did it differently?  Instead of
capturing this data in NMI context, capture it in
prepare_exit_to_usermode.  That would let us capture user state
*correctly*, which we currently can't really do.  There's a
never-ending series of minor bugs in which we try to guess the user
register state from NMI context, and it sort of works.  In
prepare_exit_to_usermode, we really truly know the user state.
There's a race where an NMI hits during or after
prepare_exit_to_usermode, but maybe that's okay -- just admit defeat
in that case and don't show the user state.  (Realistically, without
CFI data, we're not going to be guaranteed to get the right state
anyway.)

To make this work, we'd have to teach NMI-from-userspace to call the
callback itself.  It would look like:

prepare_exit_to_usermode() {
  ...
  while (blah blah blah) {
    if (cached_flags & TIF_PERF_CAPTURE_USER_STATE)
      perf_capture_user_state();
    ...
  }
  ...
}

and then, on NMI exit, we'd call perf_capture_user_state directly,
since we don't want to enable IRQs or do opportunsitic sysret on exit
from NMI.  (Why not?  Because NMIs are still masked, and we don't want
to pay for double-IRET to unmask them, so we really want to leave IRQs
off and IRET straight back to user mode.)

There's an unavoidable race in which we enter user mode with
TIF_PERF_CAPTURE_USER_STATE still set.  In principle, we could
IPI-to-self from the NMI handler to cover that case (mostly -- we
capture the wrong state if we're on our way to an IRET fault), or we
could just check on entry if the flag is still set and, if so, admit
defeat.

Peter, can this be done without breaking the perf ABI?  If we were
designing all of this stuff from scratch right now, I'd suggest doing
it this way, but I'm not sure whether it makes sense to try to
retrofit it in.

If we decide to stick with option 2, then I've now convinced myself
that banning all kernel breakpoints and watchpoints during NMI
processing is probably for the best.  Maybe we should go one step
farther and ban all DR7 breakpoints period.  Sure, it will slow down
perf if there are user breakpoints or watchpoints set, but, having
looked at the asm, returning from #DB using RET is, while doable,
distinctly ugly.

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread