All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
@ 2016-11-29  8:49 Ladi Prosek
  2016-11-29 16:01 ` Radim Krčmář
  0 siblings, 1 reply; 2+ messages in thread
From: Ladi Prosek @ 2016-11-29  8:49 UTC (permalink / raw)
  To: kvm; +Cc: pbonzini, rkrcmar, bsd

KVM does not correctly handle L1 hypervisors that emulate L2 real mode with
PAE and EPT, such as Hyper-V. In this mode, the L1 hypervisor populates guest
PDPTE VMCS fields and leaves guest CR3 uninitialized because it is not used
(see 26.3.2.4 Loading Page-Directory-Pointer-Table Entries). KVM always
dereferences CR3 and tries to load PDPTEs if PAE is on. This leads to two
related issues:

1) On the first nested vmentry, the guest PDPTEs, as populated by L1, are
overwritten in ept_load_pdptrs because the registers are believed to have
been loaded in load_pdptrs as part of kvm_set_cr3. This is incorrect. L2 is
running with PAE enabled but PDPTRs have been set up by L1.

2) When L2 is about to enable paging and loads its CR3, we, again, attempt
to load PDPTEs in load_pdptrs called from kvm_set_cr3. There are no guarantees
that this will succeed (it's just a CR3 load, paging is not enabled yet) and
if it doesn't, kvm_set_cr3 returns early without persisting the CR3 which is
then lost and L2 crashes right after it enables paging.

This patch replaces the kvm_set_cr3 call on nested entry and exit with a new
function nested_vmx_load_cr3 which correctly handles PAE+EPT, performs CR3
validity checks per the specification, and does not do unnecessary and
duplicated MMU work.

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
---

tl;dr This makes Hyper-V (Windows Server 2016) work on KVM.

v1->v2:

* introduced nested_vmx_load_cr3 (more error handling, less MMU work)
* also handles nested exit in addition to nested entry
* amended commit description

 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/include/uapi/asm/vmx.h |  1 +
 arch/x86/kvm/vmx.c              | 65 +++++++++++++++++++++++++++++++++++++----
 arch/x86/kvm/x86.c              |  3 +-
 4 files changed, 63 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index bdde807..9ec06a3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1062,6 +1062,7 @@ unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages);
 
 int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3);
+bool pdptrs_changed(struct kvm_vcpu *vcpu);
 
 int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
 			  const void *val, int bytes);
diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index f9dea4f..f1a545c 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -134,5 +134,6 @@
 
 #define VMX_ABORT_SAVE_GUEST_MSR_FAIL        1
 #define VMX_ABORT_LOAD_HOST_MSR_FAIL         4
+#define VMX_ABORT_LOAD_HOST_CR3_FAIL         5
 
 #endif /* _UAPIVMX_H */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 81fbda0..10981d8 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -9796,6 +9796,44 @@ static int nested_vmx_store_msr(struct kvm_vcpu *vcpu, u64 gpa, u32 count)
 }
 
 /*
+ * Load guest's/host's cr3 at nested entry/exit. nested_ept is true if we are
+ * emulating VM entry into a guest with EPT enabled.
+ * Returns 0 on success, 1 on failure. Invalid state exit qualification code
+ * is assigned to entry_failure_code on failure.
+ */
+static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3, bool nested_ept,
+			       unsigned long *entry_failure_code)
+{
+	unsigned long invalid_mask;
+
+	if (cr3 != kvm_read_cr3(vcpu) || (!nested_ept && pdptrs_changed(vcpu))) {
+		invalid_mask = (~0ULL) << cpuid_maxphyaddr(vcpu);
+		if (cr3 & invalid_mask) {
+			*entry_failure_code = ENTRY_FAIL_DEFAULT;
+			return 1;
+		}
+
+		/*
+		 * If PAE paging and EPT are both on, CR3 is not used by the CPU and
+		 * must not be dereferenced.
+		 */
+		if (!is_long_mode(vcpu) && is_pae(vcpu) && is_paging(vcpu) &&
+		    !nested_ept) {
+			if (!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3)) {
+				*entry_failure_code = ENTRY_FAIL_PDPTE;
+				return 1;
+			}
+		}
+
+		vcpu->arch.cr3 = cr3;
+		__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
+	}
+
+	kvm_mmu_reset_context(vcpu);
+	return 0;
+}
+
+/*
  * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
  * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
  * with L0's requirements for its guest (a.k.a. vmcs01), so we can run the L2
@@ -9803,11 +9841,15 @@ static int nested_vmx_store_msr(struct kvm_vcpu *vcpu, u64 gpa, u32 count)
  * needs. In addition to modifying the active vmcs (which is vmcs02), this
  * function also has additional necessary side-effects, like setting various
  * vcpu->arch fields.
+ * Returns 0 on success, 1 on failure. Invalid state exit qualification code
+ * is assigned to entry_failure_code on failure.
  */
-static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
+			  unsigned long *entry_failure_code)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 exec_control;
+	bool nested_ept_enabled = false;
 
 	vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
 	vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
@@ -9972,6 +10014,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 				vmcs12->guest_intr_status);
 		}
 
+		nested_ept_enabled = (exec_control & SECONDARY_EXEC_ENABLE_EPT) != 0;
 		vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
 	}
 
@@ -10114,8 +10157,9 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 	vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12));
 
 	/* shadow page tables on either EPT or shadow page tables */
-	kvm_set_cr3(vcpu, vmcs12->guest_cr3);
-	kvm_mmu_reset_context(vcpu);
+	if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_ept_enabled,
+				entry_failure_code))
+		return 1;
 
 	if (!enable_ept)
 		vcpu->arch.walk_mmu->inject_page_fault = vmx_inject_page_fault_nested;
@@ -10132,6 +10176,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 
 	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
 	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
+	return 0;
 }
 
 /*
@@ -10146,6 +10191,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 	struct loaded_vmcs *vmcs02;
 	bool ia32e;
 	u32 msr_entry_idx;
+	unsigned long exit_qualification;
 
 	if (!nested_vmx_check_permission(vcpu) ||
 	    !nested_vmx_check_vmcs12(vcpu))
@@ -10301,7 +10347,13 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 
 	vmx_segment_cache_clear(vmx);
 
-	prepare_vmcs02(vcpu, vmcs12);
+	if (prepare_vmcs02(vcpu, vmcs12, &exit_qualification)) {
+		leave_guest_mode(vcpu);
+		vmx_load_vmcs01(vcpu);
+		nested_vmx_entry_failure(vcpu, vmcs12,
+				EXIT_REASON_INVALID_STATE, exit_qualification);
+		return 1;
+	}
 
 	msr_entry_idx = nested_vmx_load_msr(vcpu,
 					    vmcs12->vm_entry_msr_load_addr,
@@ -10633,6 +10685,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 				   struct vmcs12 *vmcs12)
 {
 	struct kvm_segment seg;
+	unsigned long entry_failure_code;
 
 	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
 		vcpu->arch.efer = vmcs12->host_ia32_efer;
@@ -10670,8 +10723,8 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 
 	nested_ept_uninit_mmu_context(vcpu);
 
-	kvm_set_cr3(vcpu, vmcs12->host_cr3);
-	kvm_mmu_reset_context(vcpu);
+	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
+		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_CR3_FAIL);
 
 	if (!enable_ept)
 		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 78f6061..e4e06f9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -573,7 +573,7 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
 }
 EXPORT_SYMBOL_GPL(load_pdptrs);
 
-static bool pdptrs_changed(struct kvm_vcpu *vcpu)
+bool pdptrs_changed(struct kvm_vcpu *vcpu)
 {
 	u64 pdpte[ARRAY_SIZE(vcpu->arch.walk_mmu->pdptrs)];
 	bool changed = true;
@@ -599,6 +599,7 @@ static bool pdptrs_changed(struct kvm_vcpu *vcpu)
 
 	return changed;
 }
+EXPORT_SYMBOL_GPL(pdptrs_changed);
 
 int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH v2] KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
  2016-11-29  8:49 [PATCH v2] KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT Ladi Prosek
@ 2016-11-29 16:01 ` Radim Krčmář
  0 siblings, 0 replies; 2+ messages in thread
From: Radim Krčmář @ 2016-11-29 16:01 UTC (permalink / raw)
  To: Ladi Prosek; +Cc: kvm, pbonzini, bsd

2016-11-29 09:49+0100, Ladi Prosek:
> KVM does not correctly handle L1 hypervisors that emulate L2 real mode with
> PAE and EPT, such as Hyper-V. In this mode, the L1 hypervisor populates guest
> PDPTE VMCS fields and leaves guest CR3 uninitialized because it is not used
> (see 26.3.2.4 Loading Page-Directory-Pointer-Table Entries). KVM always
> dereferences CR3 and tries to load PDPTEs if PAE is on. This leads to two
> related issues:
> 
> 1) On the first nested vmentry, the guest PDPTEs, as populated by L1, are
> overwritten in ept_load_pdptrs because the registers are believed to have
> been loaded in load_pdptrs as part of kvm_set_cr3. This is incorrect. L2 is
> running with PAE enabled but PDPTRs have been set up by L1.
> 
> 2) When L2 is about to enable paging and loads its CR3, we, again, attempt
> to load PDPTEs in load_pdptrs called from kvm_set_cr3. There are no guarantees
> that this will succeed (it's just a CR3 load, paging is not enabled yet) and
> if it doesn't, kvm_set_cr3 returns early without persisting the CR3 which is
> then lost and L2 crashes right after it enables paging.
> 
> This patch replaces the kvm_set_cr3 call on nested entry and exit with a new
> function nested_vmx_load_cr3 which correctly handles PAE+EPT, performs CR3
> validity checks per the specification, and does not do unnecessary and
> duplicated MMU work.
> 
> Signed-off-by: Ladi Prosek <lprosek@redhat.com>
> ---
> 
> tl;dr This makes Hyper-V (Windows Server 2016) work on KVM.
> 
> v1->v2:
> 
> * introduced nested_vmx_load_cr3 (more error handling, less MMU work)
> * also handles nested exit in addition to nested entry
> * amended commit description

There is a minor problem with abort in this patch ... and when you'll be
sending v3, please split it into more patches -- maybe it would be
nicest if you reused the patch from v1, just with is_long_mode() fixup
and then added patch(es) for error propagation from prepare_vmcs02 and
other cr3 checks.

(If you have other pressing issues, just the first fixed patch would be
 ok for the time being -- I can change it directly if you wish.)

> diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
> @@ -134,5 +134,6 @@
>  
>  #define VMX_ABORT_SAVE_GUEST_MSR_FAIL        1
>  #define VMX_ABORT_LOAD_HOST_MSR_FAIL         4
> +#define VMX_ABORT_LOAD_HOST_CR3_FAIL         5

Number 5 is suspicious ... SDM (27.7 VMX ABORTS) says it means:

  5. There was a machine-check event during VM exit (see Section 27.8).

You'll want number 2:

  2. Host checking of the page-directory-pointer-table entries (PDPTEs)
     failed (see Section 27.5.4).

> @@ -10670,8 +10723,8 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>  
>  	nested_ept_uninit_mmu_context(vcpu);
>  
> -	kvm_set_cr3(vcpu, vmcs12->host_cr3);
> -	kvm_mmu_reset_context(vcpu);
> +	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> +		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_CR3_FAIL);

Only loading PDPTEs can fail, because host_cr3 cannot change and its
reserved bits are checked during VM entry, 26.2.2 (Checks on Host
Control Registers and MSRs).

Checking the host state during VM entry is slightly different and
described in 26.2 (CHECKS ON VMX CONTROLS AND HOST-STATE AREA).
KVM uses nested_vmx_failValid() to report an error as it happens before
trying to load the guest state, which means there is no VM exit.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2016-11-29 16:01 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-29  8:49 [PATCH v2] KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT Ladi Prosek
2016-11-29 16:01 ` Radim Krčmář

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.