Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info

From: Paolo Bonzini <pbonzini@redhat.com>
To: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	kvm@vger.kernel.org, x86@kernel.org,
	Andy Lutomirski <luto@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Borislav Petkov <bp@alien8.de>, "H. Peter Anvin" <hpa@zytor.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Gavin Shan <gshan@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org,
	Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
Date: Sat, 16 May 2020 00:23:31 +0200	[thread overview]
Message-ID: <943cfc2f-5b18-e00a-f5a2-4577472a1ff5@redhat.com> (raw)
In-Reply-To: <20200515204341.GF17572@linux.intel.com>

On 15/05/20 22:43, Sean Christopherson wrote:
> On Fri, May 15, 2020 at 09:18:07PM +0200, Paolo Bonzini wrote:
>> On 15/05/20 20:46, Sean Christopherson wrote:
>>> Why even bother using 'struct kvm_vcpu_pv_apf_data' for the #PF case?  VMX
>>> only requires error_code[31:16]==0 and SVM doesn't vet it at all, i.e. we
>>> can (ab)use the error code to indicate an async #PF by setting it to an
>>> impossible value, e.g. 0xaaaa (a is for async!).  That partciular error code
>>> is even enforced by the SDM, which states:
>>
>> Possibly, but it's water under the bridge now.
> 
> Why is that?  I thought we were redoing the entire thing because the current
> ABI is unfixably broken?  In other words, since the guest needs to change,
> why are we keeping any of the current async #PF pieces?  E.g. why keep using
> #PF instead of usurping something like #NP?

Because that would be 3 ABIs to support instead of 2.  The #PF solution
is only broken as long as you allow async PF from ring 0 (which wasn't
even true except for preemptable kernels) _and_ have NMIs that can
generate page faults.  We also have the #PF vmexit part for nested
virtualization.  This adds up and makes a quick fix for 'page not ready'
notifications not that quick.

However, interrupts for 'page ready' do have a bunch of advantages (more
control on what can be preempted by the notification, a saner check for
new page faults which is effectively a bug fix) so it makes sense to get
them in more quickly (probably 5.9 at this point due to the massive
cleanups that are being done around interrupt vectors).

>> And the #PF mechanism also has the problem with NMIs that happen before the
>> error code is read and page faults happening in the handler.
> 
> Hrm, I think there's no unfixable problem except for a pathological
> #PF->NMI->#DB->#PF scenario.  But it is a problem :-(

Yeah, that made no sense.  But still I'm not sure the x86 maintainers
would like it.

The only possible isue with #VE is the re-entrancy at the end.  Andy
proposed re-enabling it from an interrupt, but here is one solution that
can be done almost entirely from C.  The idea is to split the IST in two
halves, and flip between them in the TSS with an XOR operation on entry
to the interrupt handler.  This is possible because there won't ever be
more than two handlers active at the same time.  Unlike if you used
SUB/ADD, with XOR you don't have to restore the IST on exit: the two
halves will take turns as the current IST and there's no problematic
window between the ADD and the IRET.

The pseudocode would be:

- on #VE entry
   xor 512 with the IST address in the TSS
   check if saved RSP comes from the IST
   if so:
     overwrite the saved flags/CS/SS in the "other" IST half with the
       current value of the registers
     overwrite the saved RSP in the "other" IST half with the address
       of the top of the IST itself
     overwrite the saved RIP in the "other" IST half with the address
       of a trampoline (see below)
   else:
     save the top 5 words of the IST somewhere
     do normal stuff

- the trampoline restores the 5 words at the top of the IST with five
  push instructions, and jumps back to the first instruction of the
  handler

Everything except the first step can even be done in C.

Here is an example.

Assuming that on entry to the outer #VE the IST is the "even" half, the
outer #VE moves the IST to the "odd" half and the return
flags/CS/SS/RSP/RIP are saved.

After the reentrancy flag has been cleared, a nested #VE arrives and
runs within the "odd" half of the IST.  The IST is moved back to the
"even" half and the return flags/CS/SS/RSP/RIP in the "even" half are
patched to point to the trampoline.

When we get back to the outer handler the reentrancy flag not zero, so
even though the IST points to the current stack, reentrancy is
impossible and we go just fine through the few final instructions of the
handler.

On outer #VE exit, the IRET instruction jumps to the trampoline, with
RSP pointing at the top of the "even" half.  The return
flags/CS/SS/RSP/RIP are restored, and everything restarts from the
beginning: the outer #VE moves the IST to the "odd" half, the return
flags/CS/SS/RSP/RIP are saved, the data for the nested #VE is fished out
of the virtualization exception area and so on.

Paolo