kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected
@ 2020-08-18  0:43 Peter Shier
  2020-08-18 15:20 ` Sean Christopherson
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Shier @ 2020-08-18  0:43 UTC (permalink / raw)
  To: kvm; +Cc: pbonzini, Peter Shier, Jim Mattson

When L2 uses PAE, L0 intercepts of L2 writes to CR0/CR3/CR4 call
load_pdptrs to read the possibly updated PDPTEs from the guest
physical address referenced by CR3.  It loads them into
vcpu->arch.walk_mmu->pdptrs[] and sets VCPU_EXREG_PDPTR in
vcpu->arch.regs_dirty.

At the subsequent assumed reentry into L2, the mmu will call
vmx_load_mmu_pgd which calls ept_load_pdptrs. ept_load_pdptrs sees
VCPU_EXREG_PDPTR set in vcpu->arch.regs_dirty and loads
VMCS02.GUEST_PDPTRn from vcpu->arch.walk_mmu->pdptrs[]. This all works
if the L2 CRn write intercept always resumes L2.

The resume path calls vmx_check_nested_events which checks for
exceptions, MTF, and expired VMX preemption timers. If
vmx_check_nested_events finds any of these conditions pending it will
reflect the corresponding exit into L1. Live migration at this point
would also cause a missed immediate reentry into L2.

After L1 exits, vmx_vcpu_run calls vmx_register_cache_reset which
clears VCPU_EXREG_PDPTR in vcpu->arch.regs_dirty.  When L2 next
resumes, ept_load_pdptrs finds VCPU_EXREG_PDPTR clear in
vcpu->arch.regs_dirty and does not load VMCS02.GUEST_PDPTRn from
vcpu->arch.walk_mmu->pdptrs[]. prepare_vmcs02 will then load
VMCS02.GUEST_PDPTRn from vmcs12->pdptr0/1/2/3 which contain the stale
values stored at last L2 exit. A repro of this bug showed L2 entering
triple fault immediately due to the bad VMCS02.GUEST_PDPTRn values.

Add a new x86 op to enable vendor-specific code to process the PDPTE
updates. For Intel, when in guest mode using PAE, update
VMCS02.GUEST_PDPTRn from the newly loaded values.

Tested:
kvm-unit-tests with new directed test: vmx_mtf_pdpte_test.
Verified that test fails without the fix.

Also ran Google internal VMM with an Ubuntu 16.04 4.4.0-83 guest
running a custom hypervisor with a 32-bit Windows XP L2 guest using
PAE. Prior to fix would repro readily. Ran with fix for 13 hours
booting 14 simultaneous L2s with no failures.

Signed-off-by: Peter Shier <pshier@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/vmx/vmx.c          | 11 +++++++++++
 arch/x86/kvm/x86.c              |  2 ++
 3 files changed, 14 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5ab3af7275d8..c9971c5d316f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1227,6 +1227,7 @@ struct kvm_x86_ops {
 	int (*enable_direct_tlbflush)(struct kvm_vcpu *vcpu);
 
 	void (*migrate_timers)(struct kvm_vcpu *vcpu);
+	void (*load_pdptrs)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 46ba2e03a892..b8e36ea077dc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2971,6 +2971,16 @@ static void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
 	vpid_sync_context(to_vmx(vcpu)->vpid);
 }
 
+static void vmx_load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
+{
+	if (enable_ept && is_guest_mode(vcpu)) {
+		vmcs_write64(GUEST_PDPTR0, mmu->pdptrs[0]);
+		vmcs_write64(GUEST_PDPTR1, mmu->pdptrs[1]);
+		vmcs_write64(GUEST_PDPTR2, mmu->pdptrs[2]);
+		vmcs_write64(GUEST_PDPTR3, mmu->pdptrs[3]);
+	}
+}
+
 static void ept_load_pdptrs(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
@@ -8005,6 +8015,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.need_emulation_on_page_fault = vmx_need_emulation_on_page_fault,
 	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
 	.migrate_timers = vmx_migrate_timers,
+	.load_pdptrs = vmx_load_pdptrs,
 };
 
 static __init int hardware_setup(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 599d73206299..e52c2d67ba0a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -769,7 +769,8 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
 	memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
 	kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
 
+	if (kvm_x86_ops.load_pdptrs)
+		kvm_x86_ops.load_pdptrs(vcpu, mmu);
 out:
 
 	return ret;
-- 


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected
  2020-08-18  0:43 [PATCH] KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected Peter Shier
@ 2020-08-18 15:20 ` Sean Christopherson
  2020-08-18 17:14   ` Jim Mattson
  0 siblings, 1 reply; 6+ messages in thread
From: Sean Christopherson @ 2020-08-18 15:20 UTC (permalink / raw)
  To: Peter Shier; +Cc: kvm, pbonzini, Jim Mattson

On Mon, Aug 17, 2020 at 05:43:14PM -0700, Peter Shier wrote:
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/vmx/vmx.c          | 11 +++++++++++
>  arch/x86/kvm/x86.c              |  2 ++
>  3 files changed, 14 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5ab3af7275d8..c9971c5d316f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1227,6 +1227,7 @@ struct kvm_x86_ops {
>  	int (*enable_direct_tlbflush)(struct kvm_vcpu *vcpu);
>  
>  	void (*migrate_timers)(struct kvm_vcpu *vcpu);
> +	void (*load_pdptrs)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu);
>  };
>  
>  struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 46ba2e03a892..b8e36ea077dc 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2971,6 +2971,16 @@ static void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
>  	vpid_sync_context(to_vmx(vcpu)->vpid);
>  }
>  
> +static void vmx_load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> +{
> +	if (enable_ept && is_guest_mode(vcpu)) {
> +		vmcs_write64(GUEST_PDPTR0, mmu->pdptrs[0]);
> +		vmcs_write64(GUEST_PDPTR1, mmu->pdptrs[1]);
> +		vmcs_write64(GUEST_PDPTR2, mmu->pdptrs[2]);
> +		vmcs_write64(GUEST_PDPTR3, mmu->pdptrs[3]);
> +	}
> +}
> +
>  static void ept_load_pdptrs(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
> @@ -8005,6 +8015,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
>  	.need_emulation_on_page_fault = vmx_need_emulation_on_page_fault,
>  	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
>  	.migrate_timers = vmx_migrate_timers,
> +	.load_pdptrs = vmx_load_pdptrs,
>  };
>  
>  static __init int hardware_setup(void)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 599d73206299..e52c2d67ba0a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -769,7 +769,8 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
>  	memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
>  	kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
>  
> +	if (kvm_x86_ops.load_pdptrs)
> +		kvm_x86_ops.load_pdptrs(vcpu, mmu);
>  out:
>  
>  	return ret;
> -- 

I'd prefer to handle this on the switch from L2->L1.  It avoids adding a
kvm_x86_ops and yet another sequence of four VMWRITEs, e.g. I think this
will do the trick.

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 9c74a732b08d..67465f0ca1b9 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4356,6 +4356,9 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
        if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu))
                kvm_vcpu_flush_tlb_current(vcpu);

+       if (enable_ept && is_pae_paging(vcpu))
+               ept_load_pdptrs(vcpu);
+
        leave_guest_mode(vcpu);

        if (nested_cpu_has_preemption_timer(vmcs12))
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 40f5a3923d5b..aa1f259ee47a 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -320,6 +320,7 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
 int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
 void set_cr4_guest_host_mask(struct vcpu_vmx *vmx);
 void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long cr3);
+void ept_load_pdptrs(struct kvm_vcpu *vcpu);
 void ept_save_pdptrs(struct kvm_vcpu *vcpu);
 void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
 void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected
  2020-08-18 15:20 ` Sean Christopherson
@ 2020-08-18 17:14   ` Jim Mattson
  2020-08-18 17:24     ` Sean Christopherson
  0 siblings, 1 reply; 6+ messages in thread
From: Jim Mattson @ 2020-08-18 17:14 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Peter Shier, kvm list, Paolo Bonzini

On Tue, Aug 18, 2020 at 8:20 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:

> I'd prefer to handle this on the switch from L2->L1.  It avoids adding a
> kvm_x86_ops and yet another sequence of four VMWRITEs, e.g. I think this
> will do the trick.
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 9c74a732b08d..67465f0ca1b9 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4356,6 +4356,9 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
>         if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu))
>                 kvm_vcpu_flush_tlb_current(vcpu);
>
> +       if (enable_ept && is_pae_paging(vcpu))
> +               ept_load_pdptrs(vcpu);
> +

Are the mmu->pdptrs[] guaranteed to be valid at this point? If L2 has
PAE paging enabled, and it has modified CR3 without a VM-exit, where
are the current PDPTE values read from the vmcs02 into mmu->pdptrs[]?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected
  2020-08-18 17:14   ` Jim Mattson
@ 2020-08-18 17:24     ` Sean Christopherson
  2020-08-18 17:34       ` Jim Mattson
  0 siblings, 1 reply; 6+ messages in thread
From: Sean Christopherson @ 2020-08-18 17:24 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Peter Shier, kvm list, Paolo Bonzini

On Tue, Aug 18, 2020 at 10:14:39AM -0700, Jim Mattson wrote:
> On Tue, Aug 18, 2020 at 8:20 AM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> 
> > I'd prefer to handle this on the switch from L2->L1.  It avoids adding a
> > kvm_x86_ops and yet another sequence of four VMWRITEs, e.g. I think this
> > will do the trick.
> >
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 9c74a732b08d..67465f0ca1b9 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -4356,6 +4356,9 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
> >         if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu))
> >                 kvm_vcpu_flush_tlb_current(vcpu);
> >
> > +       if (enable_ept && is_pae_paging(vcpu))
> > +               ept_load_pdptrs(vcpu);
> > +
> 
> Are the mmu->pdptrs[] guaranteed to be valid at this point? If L2 has
> PAE paging enabled, and it has modified CR3 without a VM-exit, where
> are the current PDPTE values read from the vmcs02 into mmu->pdptrs[]?

ept_load_pdptrs() checks kvm_register_is_dirty(vcpu, VCPU_EXREG_PDPTR).  The
idea is basically the same as the above TLB_FLUSH_CURRENT; process pending
requests and/or dirty state for L2 before switching to L1.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected
  2020-08-18 17:24     ` Sean Christopherson
@ 2020-08-18 17:34       ` Jim Mattson
  2020-08-18 17:45         ` Sean Christopherson
  0 siblings, 1 reply; 6+ messages in thread
From: Jim Mattson @ 2020-08-18 17:34 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Peter Shier, kvm list, Paolo Bonzini

On Tue, Aug 18, 2020 at 10:24 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Tue, Aug 18, 2020 at 10:14:39AM -0700, Jim Mattson wrote:
> > On Tue, Aug 18, 2020 at 8:20 AM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> >
> > > I'd prefer to handle this on the switch from L2->L1.  It avoids adding a
> > > kvm_x86_ops and yet another sequence of four VMWRITEs, e.g. I think this
> > > will do the trick.
> > >
> > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > > index 9c74a732b08d..67465f0ca1b9 100644
> > > --- a/arch/x86/kvm/vmx/nested.c
> > > +++ b/arch/x86/kvm/vmx/nested.c
> > > @@ -4356,6 +4356,9 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
> > >         if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu))
> > >                 kvm_vcpu_flush_tlb_current(vcpu);
> > >
> > > +       if (enable_ept && is_pae_paging(vcpu))
> > > +               ept_load_pdptrs(vcpu);
> > > +
> >
> > Are the mmu->pdptrs[] guaranteed to be valid at this point? If L2 has
> > PAE paging enabled, and it has modified CR3 without a VM-exit, where
> > are the current PDPTE values read from the vmcs02 into mmu->pdptrs[]?
>
> ept_load_pdptrs() checks kvm_register_is_dirty(vcpu, VCPU_EXREG_PDPTR).  The
> idea is basically the same as the above TLB_FLUSH_CURRENT; process pending
> requests and/or dirty state for L2 before switching to L1.

Thanks. Is it right to conclude that if we get to the end of
nested_vmx_vmexit, and vcpu->arch.regs_dirty is non-zero, then
something is amiss?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected
  2020-08-18 17:34       ` Jim Mattson
@ 2020-08-18 17:45         ` Sean Christopherson
  0 siblings, 0 replies; 6+ messages in thread
From: Sean Christopherson @ 2020-08-18 17:45 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Peter Shier, kvm list, Paolo Bonzini

On Tue, Aug 18, 2020 at 10:34:52AM -0700, Jim Mattson wrote:
> On Tue, Aug 18, 2020 at 10:24 AM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Tue, Aug 18, 2020 at 10:14:39AM -0700, Jim Mattson wrote:
> > > On Tue, Aug 18, 2020 at 8:20 AM Sean Christopherson
> > > <sean.j.christopherson@intel.com> wrote:
> > >
> > > > I'd prefer to handle this on the switch from L2->L1.  It avoids adding a
> > > > kvm_x86_ops and yet another sequence of four VMWRITEs, e.g. I think this
> > > > will do the trick.
> > > >
> > > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > > > index 9c74a732b08d..67465f0ca1b9 100644
> > > > --- a/arch/x86/kvm/vmx/nested.c
> > > > +++ b/arch/x86/kvm/vmx/nested.c
> > > > @@ -4356,6 +4356,9 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
> > > >         if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu))
> > > >                 kvm_vcpu_flush_tlb_current(vcpu);
> > > >
> > > > +       if (enable_ept && is_pae_paging(vcpu))
> > > > +               ept_load_pdptrs(vcpu);
> > > > +
> > >
> > > Are the mmu->pdptrs[] guaranteed to be valid at this point? If L2 has
> > > PAE paging enabled, and it has modified CR3 without a VM-exit, where
> > > are the current PDPTE values read from the vmcs02 into mmu->pdptrs[]?
> >
> > ept_load_pdptrs() checks kvm_register_is_dirty(vcpu, VCPU_EXREG_PDPTR).  The
> > idea is basically the same as the above TLB_FLUSH_CURRENT; process pending
> > requests and/or dirty state for L2 before switching to L1.
> 
> Thanks. Is it right to conclude that if we get to the end of
> nested_vmx_vmexit, and vcpu->arch.regs_dirty is non-zero, then
> something is amiss?

Not necessarily.  I distinctly remember adding a WARN in vmx_switch_vmcs()
on regs_dirty being non-zero when I added vmx_register_cache_reset().  IIRC,
I didn't push it upstream because I got false positives on RSP and RIP.  At
that point the WARN was purely for PDPTRs, as no other registers on VMX use
dirty tracking.  At the point the WARN felt silly so I ultimately removed it
as I thought all PDPTR flows were covered.  Got that one wrong...

Anyways, for nested_vmx_vmexit(), I think we could have dirty RSP or RIP,
but it's ok because those fields are unconditionally loaded on nested
VM-Enter, i.e. whatever pending update KVM had for L2 will get overwritten
by L1 anyways.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-08-18 17:45 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-18  0:43 [PATCH] KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected Peter Shier
2020-08-18 15:20 ` Sean Christopherson
2020-08-18 17:14   ` Jim Mattson
2020-08-18 17:24     ` Sean Christopherson
2020-08-18 17:34       ` Jim Mattson
2020-08-18 17:45         ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).