Re: [PATCH v2] x86/kvm: Disable KVM_ASYNC_PF_SEND_ALWAYS

From: Andy Lutomirski <luto@kernel.org>
To: Thomas Gleixner <tglx@linutronix.de>,
	Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Sean Christopherson <sean.j.christopherson@intel.com>,
	Vivek Goyal <vgoyal@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Andy Lutomirski <luto@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>, X86 ML <x86@kernel.org>,
	kvm list <kvm@vger.kernel.org>, stable <stable@vger.kernel.org>
Subject: Re: [PATCH v2] x86/kvm: Disable KVM_ASYNC_PF_SEND_ALWAYS
Date: Wed, 8 Apr 2020 21:50:58 -0700	[thread overview]
Message-ID: <CALCETrWG2Y4SPmVkugqgjZcMfpQiq=YgsYBmWBm1hj_qx3JNVQ@mail.gmail.com> (raw)
In-Reply-To: <87d08hc0vz.fsf@nanos.tec.linutronix.de>

On Wed, Apr 8, 2020 at 11:01 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Paolo Bonzini <pbonzini@redhat.com> writes:
> > On 08/04/20 17:34, Sean Christopherson wrote:
> >> On Wed, Apr 08, 2020 at 10:23:58AM +0200, Paolo Bonzini wrote:
> >>> Page-not-present async page faults are almost a perfect match for the
> >>> hardware use of #VE (and it might even be possible to let the processor
> >>> deliver the exceptions).
> >>
> >> My "async" page fault knowledge is limited, but if the desired behavior is
> >> to reflect a fault into the guest for select EPT Violations, then yes,
> >> enabling EPT Violation #VEs in hardware is doable.  The big gotcha is that
> >> KVM needs to set the suppress #VE bit for all EPTEs when allocating a new
> >> MMU page, otherwise not-present faults on zero-initialized EPTEs will get
> >> reflected.
> >>
> >> Attached a patch that does the prep work in the MMU.  The VMX usage would be:
> >>
> >>      kvm_mmu_set_spte_init_value(VMX_EPT_SUPPRESS_VE_BIT);
> >>
> >> when EPT Violation #VEs are enabled.  It's 64-bit only as it uses stosq to
> >> initialize EPTEs.  32-bit could also be supported by doing memcpy() from
> >> a static page.
> >
> > The complication is that (at least according to the current ABI) we
> > would not want #VE to kick if the guest currently has IF=0 (and possibly
> > CPL=0).  But the ABI is not set in stone, and anyway the #VE protocol is
> > a decent one and worth using as a base for whatever PV protocol we design.
>
> Forget the current pf async semantics (or the lack of). You really want
> to start from scratch and igore the whole thing.
>
> The charm of #VE is that the hardware can inject it and it's not nesting
> until the guest cleared the second word in the VE information area. If
> that word is not 0 then you get a regular vmexit where you suspend the
> vcpu until the nested problem is solved.

Can you point me at where the SDM says this?

Anyway, I see two problems with #VE, one big and one small.  The small
(or maybe small) one is that any fancy protocol where the guest
returns from an exception by doing, logically:

Hey I'm done;  /* MOV somewhere, hypercall, MOV to CR4, whatever */
IRET;

is fundamentally racy.  After we say we're done and before IRET, we
can be recursively reentered.  Hi, NMI!

The big problem is that #VE doesn't exist on AMD, and I really think
that any fancy protocol we design should work on AMD.  I have no
problem with #VE being a nifty optimization to the protocol on Intel,
but it should *work* without #VE.

>
> So you really don't worry about the guest CPU state at all. The guest
> side #VE handler has to decide what it wants from the host depending on
> it's internal state:
>
>      - Suspend me and resume once the EPT fail is solved

I'm not entirely convinced this is better than the HLT loop.  It's
*prettier*, but the HLT loop avoids an extra hypercall and has the
potentially useful advantage that the guest can continue to process
interrupts even if it is unable to make progress otherwise.

Anyway, the whole thing can be made to work reasonably well without
#VE, #MC or any other additional special exception, like this:

First, when the guest accesses a page that is not immediately
available (paged out or failed), the host attempts to deliver the
"page not present -- try to do other stuff" event.  This event has an
associated per-vCPU data structure along these lines:

struct page_not_present_data {
  u32 inuse;  /* 1: the structure is in use.  0: free */
  u32 token;
  u64 gpa;
  u64 gva;  /* if known, and there should be a way to indicate that
it's not known. */
};

Only the host ever changes inuse from 0 to 1 and only the guest ever
changes inuse from 1 to 0.

The "page not present -- try to do other stuff" event has interrupt
semantics -- it is only delivered if the vCPU can currently receive an
interrupt.  This means IF = 1 and STI and MOV SS shadows are not
active.  Arguably TPR should be checked too.  It is also only
delivered if page_not_present_data.inuse == 0 and if there are tokens
available -- see below.  If the event can be delivered, then
page_not_present_data is filled out and the event is delivered.  If
the event is not delivered, then one of three things happens:

a) If the page not currently known to be failed (e.g. paged out or the
host simply does not know yet until it does some IO), then the vCPU
goes to sleep until the host is ready.
b) If the page is known to be failed, and no fancy #VE / #MC is
available, then the guest is killed and the host logs an error.
c) If some fancy recovery mechanism is implemented, which is optional,
then the guest gets an appropriate fault.

If a page_not_present event is delivered, then the host promises to
eventually resolve it.  Resolving it looks like this:

struct page_not_present_resolution {
  u32 result;  /* 0: guest should try again.  1: page is failed */
  u32 token;
};

struct page_not_present_resolutions {
  struct page_not_present_resolution events[N];
  u32 head, tail;
};

Only N page-not-presents can be outstanding and unresolved at a time.
it is entirely legal for the host to write the resolution to the
resolution list before delivering page-not-present.

When a page-not-present is resolved, the host writes the outcome to
the page_not_present_resolutions ring.  If there is no space, this
means that either the host or guest messed up (the host will not
allocate more tokens than can fit in the ring) and the guest is
killed.  The host also sends the guest an interrupt.  This is a
totally normal interrupt.

If the guest gets a "page is failed" resolution, the page is failed.
If the guest accesses the failed page again, then the host will try to
send a page-not-present event again.  If there is no space in the
ring, then the rules above are followed.

This will allow the sensible cases of memory failure to be recovered
by the guest without the introduction of any super-atomic faults. :)

Obviously there is more than one way to rig up the descriptor ring.
My proposal above is just one way to do it, not necessarily the best
way.