Re: [PATCH 2/4] KVM: nVMX: Check GUEST_DR7 on vmentry of nested guests

From: Jim Mattson <jmattson@google.com>
To: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Cc: "kvm list" <kvm@vger.kernel.org>,
	"Radim Krčmář" <rkrcmar@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>
Subject: Re: [PATCH 2/4] KVM: nVMX: Check GUEST_DR7 on vmentry of nested guests
Date: Wed, 4 Sep 2019 11:20:28 -0700	[thread overview]
Message-ID: <CALMp9eTO_ChOHQ4paR1SgmxnpSGZrMjHTa2aUWHSCn0+tCGvAA@mail.gmail.com> (raw)
In-Reply-To: <e8a4477c-b3a9-b4e4-1283-99bdaf7aa29b@oracle.com>

On Wed, Sep 4, 2019 at 11:07 AM Krish Sadhukhan
<krish.sadhukhan@oracle.com> wrote:
>
>
> On 9/4/19 9:44 AM, Jim Mattson wrote:
> > On Tue, Sep 3, 2019 at 5:59 PM Krish Sadhukhan
> > <krish.sadhukhan@oracle.com> wrote:
> >>
> >>
> >> On 09/01/2019 05:33 PM, Jim Mattson wrote:
> >>
> >> On Fri, Aug 30, 2019 at 4:15 PM Jim Mattson <jmattson@google.com> wrote:
> >>
> >> On Fri, Aug 30, 2019 at 4:07 PM Krish Sadhukhan
> >> <krish.sadhukhan@oracle.com> wrote:
> >>
> >> On 08/29/2019 03:26 PM, Jim Mattson wrote:
> >>
> >> On Thu, Aug 29, 2019 at 2:25 PM Krish Sadhukhan
> >> <krish.sadhukhan@oracle.com> wrote:
> >>
> >> According to section "Checks on Guest Control Registers, Debug Registers, and
> >> and MSRs" in Intel SDM vol 3C, the following checks are performed on vmentry
> >> of nested guests:
> >>
> >>       If the "load debug controls" VM-entry control is 1, bits 63:32 in the DR7
> >>       field must be 0.
> >>
> >> Can't we just let the hardware check guest DR7? This results in
> >> "VM-entry failure due to invalid guest state," right? And we just
> >> reflect that to L1?
> >>
> >> Just trying to understand the reason why this particular check can be
> >> deferred to the hardware.
> >>
> >> The vmcs02 field has the same value as the vmcs12 field, and the
> >> physical CPU has the same requirements as the virtual CPU.
> >>
> >> Actually, you're right. There is a problem. With the current
> >> implementation, there's a priority inversion if the vmcs12 contains
> >> both illegal guest state for which the checks are deferred to
> >> hardware, and illegal entries in the VM-entry MSR-load area. In this
> >> case, we will synthesize a "VM-entry failure due to MSR loading"
> >> rather than a "VM-entry failure due to invalid guest state."
> >>
> >> There are so many checks on guest state that it's really compelling to
> >> defer as many as possible to hardware. However, we need to fix the
> >> aforesaid priority inversion. Instead of returning early from
> >> nested_vmx_enter_non_root_mode() with EXIT_REASON_MSR_LOAD_FAIL, we
> >> could induce a "VM-entry failure due to MSR loading" for the next
> >> VM-entry of vmcs02 and continue with the attempted vmcs02 VM-entry. If
> >> hardware exits with EXIT_REASON_INVALID_STATE, we reflect that to L1,
> >> and if hardware exits with EXIT_REASON_INVALID_STATE, we reflect that
> >> to L1 (along with the appropriate exit qualification).
> >>
> >>
> >> Looking at nested_vmx_exit_reflected(), it seems we do return to L1 if the error is EXIT_REASON_INVALID_STATE. So if we fix the priority inversion, this should work then ?
> > Yes.
> >
> >> The tricky part is in undoing the successful MSR writes if we reflect
> >> EXIT_REASON_INVALID_STATE to L1. Some MSR writes can't actually be
> >> undone (e.g. writes to IA32_PRED_CMD), but maybe we can get away with
> >> those. (Fortunately, it's illegal to put x2APIC MSRs in the VM-entry
> >> MSR-load area!) Other MSR writes are just a bit tricky to undo (e.g.
> >> writes to IA32_TIME_STAMP_COUNTER).
> >>
> >>
> >> Let's say that the priority inversion issue is fixed. In the scenario in which the Guest state is fine but the VM-entry MSR-Load area contains an illegal entry,  you are saying that the induced "VM-entry failure due to MSR loading"  will be caught during the next VM-entry of vmcs02. So how far does the attempted VM-entry of vmcs02  continue with an illegal MSR-Load entry and how do get to the next VM-entry of vmcs02 ?
> > Sorry; I don't understand the questions.
>
>
> Let's say that all guest state checks are deferred to hardware and that
> they all will pass. Now, the VM-entry MSR-load area contains an illegal
> entry and we modify nested_vmx_enter_non_root_mode() to induce a
> "VM-entry failure due to MSR loading" for the next VM-entry of vmcs02. I
> wanted to understand how that induced error ultimately leads to a
> VM-entry failure ?

One possible implementation is as follows:

While nested_vmx_load_msr() is processing the vmcs12 VM-entry MSR-load
area, it finds an error in entry <i>. We could set up the vmcs02
VM-entry MSR-load area so that the first entry has <i+1> in the
reserved bits, and the VM-entry MSR-load count is greater than 0.
Since the reserved bits must be one, when we try to launch/resume the
vmcs02 in vmx_vcpu_run(), it will result in "VM-entry failure due to
MSR loading." We can then reflect that to the guest, setting the
vmcs12 exit qualification field from the reserved bits in the first
entry of the vmcs02 VM-entry MSR-load area, rather than passing on the
exit qualification field from the vmcs02. Of course, this doesn't work
if <i> is MAX_UINT32, but I suspect you've already got bigger problems
in that case.

>
> >> There are two other scenarios there:
> >>
> >>      1. Guest state is illegal and VM-entry MSR-Load area contains an illegal entry
> >>      2. Guest state is illegal but VM-entry MSR-Load area is fine
> >>
> >> In these scenarios, L2 will exit to L1 with EXIT_REASON_INVALID_STATE and finally this will be returned to L1 userspace. Right ?  If so, we do we care about reverting MSR-writes  because the SDM section 26.8 say,
> >>
> >>          "Processor state is loaded as would be done on a VM exit (see Section 27.5)"
> > I'm not sure how the referenced section of the SDM is relevant. Are
> > you assuming that every MSR in the VM-entry MSR load area also appears
> > in the VM-exit MSR load area? That certainly isn't the case.
> >
> >> Alternatively, we could perform validity checks on the entire vmcs12
> >> VM-entry MSR-load area before writing any of the MSRs. This may be
> >> easier, but it would certainly be slower. We would have to be wary of
> >> situations where processing an earlier entry affects the validity of a
> >> later entry. (If we take this route, then we would also have to
> >> process the valid prefix of the VM-entry MSR-load area when we reflect
> >> EXIT_REASON_MSR_LOAD_FAIL to L1.)
> > Forget this paragraph. Even if all of the checks pass, we still have
> > to undo all of the MSR-writes in the event of a deferred "VM-entry
> > failure due to invalid guest state."
> >
> >> Note that this approach could be extended to permit the deferral of
> >> some control field checks to hardware as well.
> >>
> >>
> >> Why can't the first approach be used for VM-entry controls as well ?
> > Sorry; I don't understand this question either.
>
>
> Since you mentioned,
>
>      "Note that this approach could be extended to permit the deferral
> of some control field checks..."
>
> So it seemed that only the second approach was applicable to deferring
> VM-entry control checks to hardware. Hence I asked why the first
> approach can't be used.

By "this approach," I meant the deferred delivery of an error
discovered in software.

> >
> >>   As long as the control
> >> field is copied verbatim from vmcs12 to vmcs02 and the virtual CPU
> >> enforces the same constraints as the physical CPU, deferral should be
> >> fine. We just have to make sure that we induce a "VM-entry failure due
> >> to invalid guest state" for the next VM-entry of vmcs02 if any
> >> software checks on guest state fail, rather than immediately
> >> synthesizing an "VM-entry failure due to invalid guest state" during
> >> the construction of vmcs02.
> >>
> >>
> >> Is it OK to keep this Guest check in software for now and then remove it once we have a solution in place ?
> > Why do you feel that getting the priority correct is so important for
> > this one check in particular? I'd be surprised if any hypervisor ever
> > assembled a VMCS that failed this check.