linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andy Lutomirski <luto@kernel.org>
To: Paolo Bonzini <pbonzini@redhat.com>, Tony Luck <tony.luck@intel.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>,
	Andy Lutomirski <luto@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Sean Christopherson <sean.j.christopherson@intel.com>,
	Vivek Goyal <vgoyal@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>, X86 ML <x86@kernel.org>,
	kvm list <kvm@vger.kernel.org>, stable <stable@vger.kernel.org>
Subject: Re: [PATCH v2] x86/kvm: Disable KVM_ASYNC_PF_SEND_ALWAYS
Date: Thu, 9 Apr 2020 10:32:39 -0700	[thread overview]
Message-ID: <CALCETrUpWBKHHyfMoqD2ZT3CnDdguNnK=KoZiTmN5PnbnD_k0A@mail.gmail.com> (raw)
In-Reply-To: <931f6e6d-ac17-05f9-0605-ac8f89f40b2b@redhat.com>

Hi, Tony.  I'm adding you because, despite the fact that everyone in
this thread is allergic to #MC, this seems to be one of your favorite
topics :)

> On Apr 9, 2020, at 8:17 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 09/04/20 17:03, Andy Lutomirski wrote:
>>> No, I think we wouldn't use a paravirt #VE at this point, we would
>>> use the real thing if available.
>>>
>>> It would still be possible to switch from the IST to the main
>>> kernel stack before writing 0 to the reentrancy word.
>>
>> Almost but not quite. We do this for NMI-from-usermode, and it’s
>> ugly. But we can’t do this for NMI-from-kernel or #VE-from-kernel
>> because there might not be a kernel stack.  Trying to hack around
>> this won’t be pretty.
>>
>> Frankly, I think that we shouldn’t even try to report memory failure
>> to the guest if it happens with interrupts off. Just kill the guest
>> cleanly and keep it simple. Or inject an intentionally unrecoverable
>> IST exception.
>
> But it would be nice to use #VE for all host-side page faults, not just
> for memory failure.
>
> So the solution would be the same as for NMIs, duplicating the stack
> frame and patching the outer handler's stack from the recursive #VE
> (https://lwn.net/Articles/484932/).  It's ugly but it's a known ugliness.
>
>

Believe me, I know all about how ugly it is, since I’m the one who
fixed most of the bugs in the first few implementations.  And, before
I wrote or ack any such thing for #VE, I want to know why.  What,
exactly, is a sufficiently strong motivation for using #VE *at all*
that Linux should implement a #VE handler?

As I see it, #VE has several downsides:

1. Intel only.

2. Depending on precisely what it's used for, literally any memory
access in the kernel can trigger it as a fault.  This means that it
joins NMI and #MC (and, to a limited extent, #DB) in the horrible
super-atomic-happens-in-bad-contexts camp.  IST is mandatory, and IST
is not so great.

3. Just like NMI and MCE, it comes with a fundamentally broken
don't-recurse-me mechanism.

If we do support #VE, I would suggest we do it roughly like this.  The
#VE handler is a regular IST entry -- there's a single IST stack, and
#VE from userspace stack-switches to the regular kernel stack.  The C
handler (do_virtualization_exception?) is permitted to panic if
something is horribly wrong, but is *not* permitted to clear the word
at byte 4 to re-enable #VE.  Instead, it does something to trigger a
deferred re-enable.  For example, it sends IPI to self and the IPI
handler clears the magic re-enable flag.

There are definitely horrible corner cases here.  For example, suppose
user code mmaps() some kind of failable memory (due to NVDIMM hardware
failures, truncation, whatever).  Then user code points RBP at it and
we get a perf NMI.  Now the perf code tries to copy_from_user_nmi()
the user stack and hits the failure.  It gets #MC or #VE or some
paravirt thing.  Now we're in a situation where we got an IST
exception in the middle of NMI processing and we're expected to do
something intelligent about it.  Sure, we can invoke the extable
handler, but it's not really clear how to recover if we somehow hit
the same type of failure again in the same NMI.

A model that could actually work, perhaps for #VE and #MC, is to have
the extable code do the re-enabling.  So ex_handler_uaccess() and
ex_handler_mcsafe() will call something like rearm_memory_failure(),
and that will do whatever is needed to re-arm the specific memory
failure reporting mechanism in use.

But, before I touch that with a ten-foot pole, I want to know *why*.
What's the benefit?  Why do we want convertible EPT violations?

  reply	other threads:[~2020-04-09 17:32 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-07  2:26 [PATCH v2] x86/kvm: Disable KVM_ASYNC_PF_SEND_ALWAYS Andy Lutomirski
2020-03-07 15:03 ` Andy Lutomirski
2020-03-07 15:47   ` Thomas Gleixner
2020-03-07 15:59     ` Andy Lutomirski
2020-03-07 19:01       ` Thomas Gleixner
2020-03-07 19:34         ` Andy Lutomirski
2020-03-08  7:23         ` Thomas Gleixner
2020-03-09  6:57           ` Thomas Gleixner
2020-03-09  8:40             ` Paolo Bonzini
2020-03-09  9:09               ` Thomas Gleixner
2020-03-09 18:14                 ` Andy Lutomirski
2020-03-09 19:05                   ` Thomas Gleixner
2020-03-09 20:22                     ` Peter Zijlstra
2020-04-06 19:09                       ` Vivek Goyal
2020-04-06 20:25                         ` Peter Zijlstra
2020-04-06 20:32                           ` Andy Lutomirski
2020-04-06 20:42                             ` Andy Lutomirski
2020-04-07 17:21                               ` Vivek Goyal
2020-04-07 17:38                                 ` Andy Lutomirski
2020-04-07 20:20                                   ` Thomas Gleixner
2020-04-07 21:41                                     ` Andy Lutomirski
2020-04-07 22:07                                       ` Paolo Bonzini
2020-04-07 22:29                                         ` Andy Lutomirski
2020-04-08  0:30                                           ` Paolo Bonzini
2020-05-21 15:55                                         ` Vivek Goyal
2020-04-07 22:48                                       ` Thomas Gleixner
2020-04-08  4:48                                         ` Andy Lutomirski
2020-04-08  9:32                                           ` Borislav Petkov
2020-04-08 10:12                                           ` Thomas Gleixner
2020-04-08 18:23                                           ` Vivek Goyal
2020-04-07 22:49                                       ` Vivek Goyal
2020-04-08 10:01                                         ` Borislav Petkov
2020-04-07 22:04                                     ` Paolo Bonzini
2020-04-07 23:21                                       ` Thomas Gleixner
2020-04-08  8:23                                         ` Paolo Bonzini
2020-04-08 13:01                                           ` Thomas Gleixner
2020-04-08 15:38                                             ` Peter Zijlstra
2020-04-08 16:41                                               ` Thomas Gleixner
2020-04-09  9:03                                             ` Paolo Bonzini
2020-04-08 15:34                                           ` Sean Christopherson
2020-04-08 16:50                                             ` Paolo Bonzini
2020-04-08 18:01                                               ` Thomas Gleixner
2020-04-08 20:34                                                 ` Vivek Goyal
2020-04-08 23:06                                                   ` Thomas Gleixner
2020-04-08 23:14                                                     ` Thomas Gleixner
2020-04-09  4:50                                                 ` Andy Lutomirski
2020-04-09  9:43                                                   ` Paolo Bonzini
2020-04-09 11:36                                                   ` Andrew Cooper
2020-04-09 12:47                                                   ` Paolo Bonzini
2020-04-09 14:13                                                     ` Andrew Cooper
2020-04-09 14:32                                                       ` Paolo Bonzini
2020-04-09 15:03                                                         ` Andy Lutomirski
2020-04-09 15:17                                                           ` Paolo Bonzini
2020-04-09 17:32                                                             ` Andy Lutomirski [this message]
2020-04-06 21:32                         ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALCETrUpWBKHHyfMoqD2ZT3CnDdguNnK=KoZiTmN5PnbnD_k0A@mail.gmail.com' \
    --to=luto@kernel.org \
    --cc=andrew.cooper3@citrix.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=sean.j.christopherson@intel.com \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=vgoyal@redhat.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).