All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/15] Nested EPT
@ 2013-08-01 14:08 Gleb Natapov
  2013-08-01 14:08 ` [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Gleb Natapov
                   ` (15 more replies)
  0 siblings, 16 replies; 39+ messages in thread
From: Gleb Natapov @ 2013-08-01 14:08 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

Another day -- another version of the nested EPT patches. In this version
included fix for need_remote_flush() with shadowed ept, set bits 6:8
of exit_qualification during ept_violation, update_permission_bitmask()
made to work with shadowed ept pages and other small adjustment according
to review comments.

Gleb Natapov (3):
  nEPT: make guest's A/D bits depends on guest's paging mode
  nEPT: Support shadow paging for guest paging without A/D bits
  nEPT: correctly check if remote tlb flush is needed for shadowed EPT
    tables

Nadav Har'El (10):
  nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  nEPT: Fix cr3 handling in nested exit and entry
  nEPT: Fix wrong test in kvm_set_cr3
  nEPT: Move common code to paging_tmpl.h
  nEPT: Add EPT tables support to paging_tmpl.h
  nEPT: Nested INVEPT
  nEPT: MMU context for nested EPT
  nEPT: Advertise EPT to L1
  nEPT: Some additional comments
  nEPT: Miscelleneous cleanups

Yang Zhang (2):
  nEPT: Redefine EPT-specific link_shadow_page()
  nEPT: Add nEPT violation/misconfigration support

 arch/x86/include/asm/kvm_host.h |    4 +
 arch/x86/include/asm/vmx.h      |    2 +
 arch/x86/include/uapi/asm/vmx.h |    1 +
 arch/x86/kvm/mmu.c              |  170 ++++++++++++++++++-------------
 arch/x86/kvm/mmu.h              |    2 +
 arch/x86/kvm/paging_tmpl.h      |  176 ++++++++++++++++++++++++++++----
 arch/x86/kvm/vmx.c              |  215 ++++++++++++++++++++++++++++++++++++---
 arch/x86/kvm/x86.c              |   11 --
 8 files changed, 462 insertions(+), 119 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-08-01 14:08 [PATCH v6 00/15] Nested EPT Gleb Natapov
@ 2013-08-01 14:08 ` Gleb Natapov
  2013-08-02  3:04   ` Zhang, Yang Z
  2013-08-01 14:08 ` [PATCH v6 02/15] nEPT: Fix cr3 handling in nested exit and entry Gleb Natapov
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Gleb Natapov @ 2013-08-01 14:08 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
switch the EFER MSR when EPT is used and the host and guest have different
NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
and want to be able to run recent KVM as L1, we need to allow L1 to use this
EFER switching feature.

To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available,
and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
support for the former (the latter is still unsupported).

Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state,
respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all
that's left to do in this patch is to properly advertise this feature to L1.

Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using
vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
support this feature, regardless of whether the host supports it.

Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/vmx.c |   23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e999dc7..27efa6a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2198,7 +2198,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 #else
 	nested_vmx_exit_ctls_high = 0;
 #endif
-	nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
+	nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
+				      VM_EXIT_LOAD_IA32_EFER);
 
 	/* entry controls */
 	rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
@@ -2207,8 +2208,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 	nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
 	nested_vmx_entry_ctls_high &=
 		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
-	nested_vmx_entry_ctls_high |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
-
+	nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
+				       VM_ENTRY_LOAD_IA32_EFER);
 	/* cpu-based controls */
 	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
 		nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
@@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 	vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
 
-	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */
-	vmcs_write32(VM_EXIT_CONTROLS,
-		vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
-	vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
+	/* L2->L1 exit controls are emulated - the hardware exit is to L0 so
+	 * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
+	 * bits are further modified by vmx_set_efer() below.
+	 */
+	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
+
+	/* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are
+	 * emulated by vmx_set_efer(), below.
+	 */
+	vmcs_write32(VM_ENTRY_CONTROLS,
+		(vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
+			~VM_ENTRY_IA32E_MODE) |
 		(vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
 
 	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 02/15] nEPT: Fix cr3 handling in nested exit and entry
  2013-08-01 14:08 [PATCH v6 00/15] Nested EPT Gleb Natapov
  2013-08-01 14:08 ` [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Gleb Natapov
@ 2013-08-01 14:08 ` Gleb Natapov
  2013-08-02  9:23   ` Xiao Guangrong
  2013-08-01 14:08 ` [PATCH v6 03/15] nEPT: Fix wrong test in kvm_set_cr3 Gleb Natapov
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Gleb Natapov @ 2013-08-01 14:08 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

The existing code for handling cr3 and related VMCS fields during nested
exit and entry wasn't correct in all cases:

If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
we forgot to do so. This patch adds this copy.

If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
whoever does control cr3 (L1 or L2) is using PAE, the processor might have
saved PDPTEs and we should also save them in vmcs12 (and restore later).

Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/vmx.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 27efa6a..4e98764 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7595,6 +7595,16 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 	kvm_set_cr3(vcpu, vmcs12->guest_cr3);
 	kvm_mmu_reset_context(vcpu);
 
+	/*
+	 * L1 may access the L2's PDPTR, so save them to construct vmcs12
+	 */
+	if (enable_ept) {
+		vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
+		vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
+		vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
+		vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
+	}
+
 	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
 	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
 }
@@ -7917,6 +7927,22 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 	vmcs12->guest_pending_dbg_exceptions =
 		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
 
+	/*
+	 * In some cases (usually, nested EPT), L2 is allowed to change its
+	 * own CR3 without exiting. If it has changed it, we must keep it.
+	 * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
+	 * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
+	 *
+	 * Additionally, restore L2's PDPTR to vmcs12.
+	 */
+	if (enable_ept) {
+		vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
+		vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
+		vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
+		vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
+		vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
+	}
+
 	vmcs12->vm_entry_controls =
 		(vmcs12->vm_entry_controls & ~VM_ENTRY_IA32E_MODE) |
 		(vmcs_read32(VM_ENTRY_CONTROLS) & VM_ENTRY_IA32E_MODE);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 03/15] nEPT: Fix wrong test in kvm_set_cr3
  2013-08-01 14:08 [PATCH v6 00/15] Nested EPT Gleb Natapov
  2013-08-01 14:08 ` [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Gleb Natapov
  2013-08-01 14:08 ` [PATCH v6 02/15] nEPT: Fix cr3 handling in nested exit and entry Gleb Natapov
@ 2013-08-01 14:08 ` Gleb Natapov
  2013-08-01 14:08 ` [PATCH v6 04/15] nEPT: Move common code to paging_tmpl.h Gleb Natapov
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Gleb Natapov @ 2013-08-01 14:08 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
address. The problem is that with nested EPT, cr3 is an *L2* physical
address, not an L1 physical address as this test expects.

As the comment above this test explains, it isn't necessary, and doesn't
correspond to anything a real processor would do. So this patch removes it.

Note that this wrong test could have also theoretically caused problems
in nested NPT, not just in nested EPT. However, in practice, the problem
was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
circumventing the problem. Additional potential calls to the buggy function
are avoided in that we don't trap cr3 modifications when nested NPT is
enabled. However, because in nested VMX we did want to use kvm_set_cr3()
(as requested in Avi Kivity's review of the original nested VMX patches),
we can't avoid this problem and need to fix it.

Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/x86.c |   11 -----------
 1 file changed, 11 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d2caeb9..e2fef8b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -682,17 +682,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 		 */
 	}
 
-	/*
-	 * Does the new cr3 value map to physical memory? (Note, we
-	 * catch an invalid cr3 even in real-mode, because it would
-	 * cause trouble later on when we turn on paging anyway.)
-	 *
-	 * A real CPU would silently accept an invalid cr3 and would
-	 * attempt to use it - with largely undefined (and often hard
-	 * to debug) behavior on the guest side.
-	 */
-	if (unlikely(!gfn_to_memslot(vcpu->kvm, cr3 >> PAGE_SHIFT)))
-		return 1;
 	vcpu->arch.cr3 = cr3;
 	__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
 	vcpu->arch.mmu.new_cr3(vcpu);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 04/15] nEPT: Move common code to paging_tmpl.h
  2013-08-01 14:08 [PATCH v6 00/15] Nested EPT Gleb Natapov
                   ` (2 preceding siblings ...)
  2013-08-01 14:08 ` [PATCH v6 03/15] nEPT: Fix wrong test in kvm_set_cr3 Gleb Natapov
@ 2013-08-01 14:08 ` Gleb Natapov
  2013-08-01 14:08 ` [PATCH v6 05/15] nEPT: make guest's A/D bits depends on guest's paging mode Gleb Natapov
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Gleb Natapov @ 2013-08-01 14:08 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

For preparation, we just move gpte_access(), prefetch_invalid_gpte(),
s_rsvd_bits_set(), protect_clean_gpte() and is_dirty_gpte() from mmu.c
to paging_tmpl.h.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/mmu.c         |   55 ------------------------------
 arch/x86/kvm/paging_tmpl.h |   80 +++++++++++++++++++++++++++++++++++++-------
 2 files changed, 68 insertions(+), 67 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3a9493a..4c4274d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -331,11 +331,6 @@ static int is_large_pte(u64 pte)
 	return pte & PT_PAGE_SIZE_MASK;
 }
 
-static int is_dirty_gpte(unsigned long pte)
-{
-	return pte & PT_DIRTY_MASK;
-}
-
 static int is_rmap_spte(u64 pte)
 {
 	return is_shadow_present_pte(pte);
@@ -2574,14 +2569,6 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
 	mmu_free_roots(vcpu);
 }
 
-static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
-{
-	int bit7;
-
-	bit7 = (gpte >> 7) & 1;
-	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
-}
-
 static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
 				     bool no_dirty_log)
 {
@@ -2594,26 +2581,6 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return gfn_to_pfn_memslot_atomic(slot, gfn);
 }
 
-static bool prefetch_invalid_gpte(struct kvm_vcpu *vcpu,
-				  struct kvm_mmu_page *sp, u64 *spte,
-				  u64 gpte)
-{
-	if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
-		goto no_present;
-
-	if (!is_present_gpte(gpte))
-		goto no_present;
-
-	if (!(gpte & PT_ACCESSED_MASK))
-		goto no_present;
-
-	return false;
-
-no_present:
-	drop_spte(vcpu->kvm, spte);
-	return true;
-}
-
 static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
 				    struct kvm_mmu_page *sp,
 				    u64 *start, u64 *end)
@@ -3501,18 +3468,6 @@ static void paging_free(struct kvm_vcpu *vcpu)
 	nonpaging_free(vcpu);
 }
 
-static inline void protect_clean_gpte(unsigned *access, unsigned gpte)
-{
-	unsigned mask;
-
-	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
-
-	mask = (unsigned)~ACC_WRITE_MASK;
-	/* Allow write access to dirty gptes */
-	mask |= (gpte >> (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) & PT_WRITABLE_MASK;
-	*access &= mask;
-}
-
 static bool sync_mmio_spte(struct kvm *kvm, u64 *sptep, gfn_t gfn,
 			   unsigned access, int *nr_present)
 {
@@ -3530,16 +3485,6 @@ static bool sync_mmio_spte(struct kvm *kvm, u64 *sptep, gfn_t gfn,
 	return false;
 }
 
-static inline unsigned gpte_access(struct kvm_vcpu *vcpu, u64 gpte)
-{
-	unsigned access;
-
-	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
-	access &= ~(gpte >> PT64_NX_SHIFT);
-
-	return access;
-}
-
 static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gpte)
 {
 	unsigned index;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 7769699..fb26ca9 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -80,6 +80,31 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
 	return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
 }
 
+static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
+{
+	unsigned mask;
+
+	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
+
+	mask = (unsigned)~ACC_WRITE_MASK;
+	/* Allow write access to dirty gptes */
+	mask |= (gpte >> (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) & PT_WRITABLE_MASK;
+	*access &= mask;
+}
+
+static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
+{
+	int bit7;
+
+	bit7 = (gpte >> 7) & 1;
+	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
+}
+
+static inline int FNAME(is_present_gpte)(unsigned long pte)
+{
+	return is_present_gpte(pte);
+}
+
 static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 			       pt_element_t __user *ptep_user, unsigned index,
 			       pt_element_t orig_pte, pt_element_t new_pte)
@@ -103,6 +128,36 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	return (ret != orig_pte);
 }
 
+static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
+				  struct kvm_mmu_page *sp, u64 *spte,
+				  u64 gpte)
+{
+	if (FNAME(is_rsvd_bits_set)(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
+		goto no_present;
+
+	if (!FNAME(is_present_gpte)(gpte))
+		goto no_present;
+
+	if (!(gpte & PT_ACCESSED_MASK))
+		goto no_present;
+
+	return false;
+
+no_present:
+	drop_spte(vcpu->kvm, spte);
+	return true;
+}
+
+static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
+{
+	unsigned access;
+
+	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
+	access &= ~(gpte >> PT64_NX_SHIFT);
+
+	return access;
+}
+
 static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 					     struct kvm_mmu *mmu,
 					     struct guest_walker *walker,
@@ -123,7 +178,8 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 			trace_kvm_mmu_set_accessed_bit(table_gfn, index, sizeof(pte));
 			pte |= PT_ACCESSED_MASK;
 		}
-		if (level == walker->level && write_fault && !is_dirty_gpte(pte)) {
+		if (level == walker->level && write_fault &&
+				!(pte & PT_DIRTY_MASK)) {
 			trace_kvm_mmu_set_dirty_bit(table_gfn, index, sizeof(pte));
 			pte |= PT_DIRTY_MASK;
 		}
@@ -170,7 +226,7 @@ retry_walk:
 	if (walker->level == PT32E_ROOT_LEVEL) {
 		pte = mmu->get_pdptr(vcpu, (addr >> 30) & 3);
 		trace_kvm_mmu_paging_element(pte, walker->level);
-		if (!is_present_gpte(pte))
+		if (!FNAME(is_present_gpte)(pte))
 			goto error;
 		--walker->level;
 	}
@@ -215,17 +271,17 @@ retry_walk:
 
 		trace_kvm_mmu_paging_element(pte, walker->level);
 
-		if (unlikely(!is_present_gpte(pte)))
+		if (unlikely(!FNAME(is_present_gpte)(pte)))
 			goto error;
 
-		if (unlikely(is_rsvd_bits_set(&vcpu->arch.mmu, pte,
+		if (unlikely(FNAME(is_rsvd_bits_set)(&vcpu->arch.mmu, pte,
 					      walker->level))) {
 			errcode |= PFERR_RSVD_MASK | PFERR_PRESENT_MASK;
 			goto error;
 		}
 
 		accessed_dirty &= pte;
-		pte_access = pt_access & gpte_access(vcpu, pte);
+		pte_access = pt_access & FNAME(gpte_access)(vcpu, pte);
 
 		walker->ptes[walker->level - 1] = pte;
 	} while (!is_last_gpte(mmu, walker->level, pte));
@@ -248,7 +304,7 @@ retry_walk:
 	walker->gfn = real_gpa >> PAGE_SHIFT;
 
 	if (!write_fault)
-		protect_clean_gpte(&pte_access, pte);
+		FNAME(protect_clean_gpte)(&pte_access, pte);
 	else
 		/*
 		 * On a write fault, fold the dirty bit into accessed_dirty by
@@ -309,14 +365,14 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	gfn_t gfn;
 	pfn_t pfn;
 
-	if (prefetch_invalid_gpte(vcpu, sp, spte, gpte))
+	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
 		return false;
 
 	pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
 
 	gfn = gpte_to_gfn(gpte);
-	pte_access = sp->role.access & gpte_access(vcpu, gpte);
-	protect_clean_gpte(&pte_access, gpte);
+	pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
+	FNAME(protect_clean_gpte)(&pte_access, gpte);
 	pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
 			no_dirty_log && (pte_access & ACC_WRITE_MASK));
 	if (is_error_pfn(pfn))
@@ -785,15 +841,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 					  sizeof(pt_element_t)))
 			return -EINVAL;
 
-		if (prefetch_invalid_gpte(vcpu, sp, &sp->spt[i], gpte)) {
+		if (FNAME(prefetch_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) {
 			vcpu->kvm->tlbs_dirty++;
 			continue;
 		}
 
 		gfn = gpte_to_gfn(gpte);
 		pte_access = sp->role.access;
-		pte_access &= gpte_access(vcpu, gpte);
-		protect_clean_gpte(&pte_access, gpte);
+		pte_access &= FNAME(gpte_access)(vcpu, gpte);
+		FNAME(protect_clean_gpte)(&pte_access, gpte);
 
 		if (sync_mmio_spte(vcpu->kvm, &sp->spt[i], gfn, pte_access,
 		      &nr_present))
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 05/15] nEPT: make guest's A/D bits depends on guest's paging mode
  2013-08-01 14:08 [PATCH v6 00/15] Nested EPT Gleb Natapov
                   ` (3 preceding siblings ...)
  2013-08-01 14:08 ` [PATCH v6 04/15] nEPT: Move common code to paging_tmpl.h Gleb Natapov
@ 2013-08-01 14:08 ` Gleb Natapov
  2013-08-01 14:08 ` [PATCH v6 06/15] nEPT: Support shadow paging for guest paging without A/D bits Gleb Natapov
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Gleb Natapov @ 2013-08-01 14:08 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

This patch makes guest A/D bits definition to be dependable on paging
mode, so when EPT support will be added it will be able to define them
differently.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/paging_tmpl.h |   30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index fb26ca9..7581395 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -32,6 +32,10 @@
 	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
 	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
 	#define PT_LEVEL_BITS PT64_LEVEL_BITS
+	#define PT_GUEST_ACCESSED_MASK PT_ACCESSED_MASK
+	#define PT_GUEST_DIRTY_MASK PT_DIRTY_MASK
+	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
+	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
 	#ifdef CONFIG_X86_64
 	#define PT_MAX_FULL_LEVELS 4
 	#define CMPXCHG cmpxchg
@@ -49,6 +53,10 @@
 	#define PT_INDEX(addr, level) PT32_INDEX(addr, level)
 	#define PT_LEVEL_BITS PT32_LEVEL_BITS
 	#define PT_MAX_FULL_LEVELS 2
+	#define PT_GUEST_ACCESSED_MASK PT_ACCESSED_MASK
+	#define PT_GUEST_DIRTY_MASK PT_DIRTY_MASK
+	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
+	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
 	#define CMPXCHG cmpxchg
 #else
 	#error Invalid PTTYPE value
@@ -88,7 +96,8 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
 
 	mask = (unsigned)~ACC_WRITE_MASK;
 	/* Allow write access to dirty gptes */
-	mask |= (gpte >> (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) & PT_WRITABLE_MASK;
+	mask |= (gpte >> (PT_GUEST_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) &
+		PT_WRITABLE_MASK;
 	*access &= mask;
 }
 
@@ -138,7 +147,7 @@ static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
 	if (!FNAME(is_present_gpte)(gpte))
 		goto no_present;
 
-	if (!(gpte & PT_ACCESSED_MASK))
+	if (!(gpte & PT_GUEST_ACCESSED_MASK))
 		goto no_present;
 
 	return false;
@@ -174,14 +183,14 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 		table_gfn = walker->table_gfn[level - 1];
 		ptep_user = walker->ptep_user[level - 1];
 		index = offset_in_page(ptep_user) / sizeof(pt_element_t);
-		if (!(pte & PT_ACCESSED_MASK)) {
+		if (!(pte & PT_GUEST_ACCESSED_MASK)) {
 			trace_kvm_mmu_set_accessed_bit(table_gfn, index, sizeof(pte));
-			pte |= PT_ACCESSED_MASK;
+			pte |= PT_GUEST_ACCESSED_MASK;
 		}
 		if (level == walker->level && write_fault &&
-				!(pte & PT_DIRTY_MASK)) {
+				!(pte & PT_GUEST_DIRTY_MASK)) {
 			trace_kvm_mmu_set_dirty_bit(table_gfn, index, sizeof(pte));
-			pte |= PT_DIRTY_MASK;
+			pte |= PT_GUEST_DIRTY_MASK;
 		}
 		if (pte == orig_pte)
 			continue;
@@ -235,7 +244,7 @@ retry_walk:
 	ASSERT((!is_long_mode(vcpu) && is_pae(vcpu)) ||
 	       (mmu->get_cr3(vcpu) & CR3_NONPAE_RESERVED_BITS) == 0);
 
-	accessed_dirty = PT_ACCESSED_MASK;
+	accessed_dirty = PT_GUEST_ACCESSED_MASK;
 	pt_access = pte_access = ACC_ALL;
 	++walker->level;
 
@@ -310,7 +319,8 @@ retry_walk:
 		 * On a write fault, fold the dirty bit into accessed_dirty by
 		 * shifting it one place right.
 		 */
-		accessed_dirty &= pte >> (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT);
+		accessed_dirty &= pte >>
+			(PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
 
 	if (unlikely(!accessed_dirty)) {
 		ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, write_fault);
@@ -886,3 +896,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 #undef gpte_to_gfn
 #undef gpte_to_gfn_lvl
 #undef CMPXCHG
+#undef PT_GUEST_ACCESSED_MASK
+#undef PT_GUEST_DIRTY_MASK
+#undef PT_GUEST_DIRTY_SHIFT
+#undef PT_GUEST_ACCESSED_SHIFT
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 06/15] nEPT: Support shadow paging for guest paging without A/D bits
  2013-08-01 14:08 [PATCH v6 00/15] Nested EPT Gleb Natapov
                   ` (4 preceding siblings ...)
  2013-08-01 14:08 ` [PATCH v6 05/15] nEPT: make guest's A/D bits depends on guest's paging mode Gleb Natapov
@ 2013-08-01 14:08 ` Gleb Natapov
  2013-08-01 14:08 ` [PATCH v6 07/15] nEPT: Add EPT tables support to paging_tmpl.h Gleb Natapov
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Gleb Natapov @ 2013-08-01 14:08 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

Some guest paging modes do not support A/D bits. Add support for such
modes in shadow page code. For such modes PT_GUEST_DIRTY_MASK,
PT_GUEST_ACCESSED_MASK, PT_GUEST_DIRTY_SHIFT and PT_GUEST_ACCESSED_SHIFT
should be set to zero.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/paging_tmpl.h |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 7581395..2c2f635 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -92,6 +92,10 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
 {
 	unsigned mask;
 
+	/* dirty bit is not supported, so no need to track it */
+	if (!PT_GUEST_DIRTY_MASK)
+		return;
+
 	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
 
 	mask = (unsigned)~ACC_WRITE_MASK;
@@ -147,7 +151,8 @@ static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
 	if (!FNAME(is_present_gpte)(gpte))
 		goto no_present;
 
-	if (!(gpte & PT_GUEST_ACCESSED_MASK))
+	/* if accessed bit is not supported prefetch non accessed gpte */
+	if (PT_GUEST_ACCESSED_MASK && !(gpte & PT_GUEST_ACCESSED_MASK))
 		goto no_present;
 
 	return false;
@@ -178,6 +183,10 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 	gfn_t table_gfn;
 	int ret;
 
+	/* dirty/accessed bits are not supported, so no need to update them */
+	if (!PT_GUEST_DIRTY_MASK)
+		return 0;
+
 	for (level = walker->max_level; level >= walker->level; --level) {
 		pte = orig_pte = walker->ptes[level - 1];
 		table_gfn = walker->table_gfn[level - 1];
@@ -316,8 +325,9 @@ retry_walk:
 		FNAME(protect_clean_gpte)(&pte_access, pte);
 	else
 		/*
-		 * On a write fault, fold the dirty bit into accessed_dirty by
-		 * shifting it one place right.
+		 * On a write fault, fold the dirty bit into accessed_dirty.
+		 * For modes without A/D bits support accessed_dirty will be
+		 * always clear.
 		 */
 		accessed_dirty &= pte >>
 			(PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 07/15] nEPT: Add EPT tables support to paging_tmpl.h
  2013-08-01 14:08 [PATCH v6 00/15] Nested EPT Gleb Natapov
                   ` (5 preceding siblings ...)
  2013-08-01 14:08 ` [PATCH v6 06/15] nEPT: Support shadow paging for guest paging without A/D bits Gleb Natapov
@ 2013-08-01 14:08 ` Gleb Natapov
  2013-08-01 14:08 ` [PATCH v6 08/15] nEPT: Redefine EPT-specific link_shadow_page() Gleb Natapov
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Gleb Natapov @ 2013-08-01 14:08 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

This is the first patch in a series which adds nested EPT support to KVM's
nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
to set its own cr3 and take its own page faults without either of L0 or L1
getting involved. This often significanlty improves L2's performance over the
previous two alternatives (shadow page tables over EPT, and shadow page
tables over shadow page tables).

This patch adds EPT support to paging_tmpl.h.

paging_tmpl.h contains the code for reading and writing page tables. The code
for 32-bit and 64-bit tables is very similar, but not identical, so
paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
with PTTYPE=64, and this generates the two sets of similar functions.

There are subtle but important differences between the format of EPT tables
and that of ordinary x86 64-bit page tables, so for nested EPT we need a
third set of functions to read the guest EPT table and to write the shadow
EPT table.

So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
with "EPT") which correctly read and write EPT tables.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/mmu.c         |    5 +++++
 arch/x86/kvm/paging_tmpl.h |   37 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4c4274d..b5273c3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3494,6 +3494,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
 	return mmu->last_pte_bitmap & (1 << index);
 }
 
+#define PTTYPE_EPT 18 /* arbitrary */
+#define PTTYPE PTTYPE_EPT
+#include "paging_tmpl.h"
+#undef PTTYPE
+
 #define PTTYPE 64
 #include "paging_tmpl.h"
 #undef PTTYPE
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 2c2f635..762c904 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -23,6 +23,13 @@
  * so the code in this file is compiled twice, once per pte size.
  */
 
+/*
+ * This is used to catch non optimized PT_GUEST_(DIRTY|ACCESS)_SHIFT macro
+ * uses for EPT without A/D paging type.
+ */
+extern u64 __pure __using_nonexistent_pte_bit(void)
+	       __compiletime_error("wrong use of PT_GUEST_(DIRTY|ACCESS)_SHIFT");
+
 #if PTTYPE == 64
 	#define pt_element_t u64
 	#define guest_walker guest_walker64
@@ -58,6 +65,21 @@
 	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
 	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
 	#define CMPXCHG cmpxchg
+#elif PTTYPE == PTTYPE_EPT
+	#define pt_element_t u64
+	#define guest_walker guest_walkerEPT
+	#define FNAME(name) ept_##name
+	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
+	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
+	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
+	#define PT_LEVEL_BITS PT64_LEVEL_BITS
+	#define PT_GUEST_ACCESSED_MASK 0
+	#define PT_GUEST_DIRTY_MASK 0
+	#define PT_GUEST_DIRTY_SHIFT __using_nonexistent_pte_bit()
+	#define PT_GUEST_ACCESSED_SHIFT __using_nonexistent_pte_bit()
+	#define CMPXCHG cmpxchg64
+	#define PT_MAX_FULL_LEVELS 4
 #else
 	#error Invalid PTTYPE value
 #endif
@@ -115,7 +137,11 @@ static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
 
 static inline int FNAME(is_present_gpte)(unsigned long pte)
 {
+#if PTTYPE != PTTYPE_EPT
 	return is_present_gpte(pte);
+#else
+	return pte & 7;
+#endif
 }
 
 static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
@@ -165,9 +191,14 @@ no_present:
 static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
 {
 	unsigned access;
-
+#if PTTYPE == PTTYPE_EPT
+	access = ((gpte & VMX_EPT_WRITABLE_MASK) ? ACC_WRITE_MASK : 0) |
+		((gpte & VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0) |
+		ACC_USER_MASK;
+#else
 	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
 	access &= ~(gpte >> PT64_NX_SHIFT);
+#endif
 
 	return access;
 }
@@ -369,6 +400,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
 					access);
 }
 
+#if PTTYPE != PTTYPE_EPT
 static int FNAME(walk_addr_nested)(struct guest_walker *walker,
 				   struct kvm_vcpu *vcpu, gva_t addr,
 				   u32 access)
@@ -376,6 +408,7 @@ static int FNAME(walk_addr_nested)(struct guest_walker *walker,
 	return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
 					addr, access);
 }
+#endif
 
 static bool
 FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
@@ -803,6 +836,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
 	return gpa;
 }
 
+#if PTTYPE != PTTYPE_EPT
 static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
 				      u32 access,
 				      struct x86_exception *exception)
@@ -821,6 +855,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
 
 	return gpa;
 }
+#endif
 
 /*
  * Using the cached information from sp->gfns is safe because:
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 08/15] nEPT: Redefine EPT-specific link_shadow_page()
  2013-08-01 14:08 [PATCH v6 00/15] Nested EPT Gleb Natapov
                   ` (6 preceding siblings ...)
  2013-08-01 14:08 ` [PATCH v6 07/15] nEPT: Add EPT tables support to paging_tmpl.h Gleb Natapov
@ 2013-08-01 14:08 ` Gleb Natapov
  2013-08-01 14:08 ` [PATCH v6 09/15] nEPT: correctly check if remote tlb flush is needed for shadowed EPT tables Gleb Natapov
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Gleb Natapov @ 2013-08-01 14:08 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Yang Zhang <yang.z.zhang@Intel.com>

Since nEPT doesn't support A/D bit, so we should not set those bit
when build shadow page table.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/mmu.c         |   12 +++++++++---
 arch/x86/kvm/paging_tmpl.h |    4 ++--
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b5273c3..9e0f467 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2047,12 +2047,18 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
 	return __shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
+static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, bool accessed)
 {
 	u64 spte;
 
+	BUILD_BUG_ON(VMX_EPT_READABLE_MASK != PT_PRESENT_MASK ||
+			VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
+
 	spte = __pa(sp->spt) | PT_PRESENT_MASK | PT_WRITABLE_MASK |
-	       shadow_user_mask | shadow_x_mask | shadow_accessed_mask;
+	       shadow_user_mask | shadow_x_mask;
+
+	if (accessed)
+		spte |= shadow_accessed_mask;
 
 	mmu_spte_set(sptep, spte);
 }
@@ -2677,7 +2683,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
 					      iterator.level - 1,
 					      1, ACC_ALL, iterator.sptep);
 
-			link_shadow_page(iterator.sptep, sp);
+			link_shadow_page(iterator.sptep, sp, true);
 		}
 	}
 	return emulate;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 762c904..f8e5680 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -555,7 +555,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			goto out_gpte_changed;
 
 		if (sp)
-			link_shadow_page(it.sptep, sp);
+			link_shadow_page(it.sptep, sp, PT_GUEST_ACCESSED_MASK);
 	}
 
 	for (;
@@ -575,7 +575,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 
 		sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
 				      true, direct_access, it.sptep);
-		link_shadow_page(it.sptep, sp);
+		link_shadow_page(it.sptep, sp, PT_GUEST_ACCESSED_MASK);
 	}
 
 	clear_sp_write_flooding_count(it.sptep);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 09/15] nEPT: correctly check if remote tlb flush is needed for shadowed EPT tables
  2013-08-01 14:08 [PATCH v6 00/15] Nested EPT Gleb Natapov
                   ` (7 preceding siblings ...)
  2013-08-01 14:08 ` [PATCH v6 08/15] nEPT: Redefine EPT-specific link_shadow_page() Gleb Natapov
@ 2013-08-01 14:08 ` Gleb Natapov
  2013-08-02  5:58   ` Xiao Guangrong
  2013-08-01 14:08 ` [PATCH v6 10/15] nEPT: Nested INVEPT Gleb Natapov
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Gleb Natapov @ 2013-08-01 14:08 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

need_remote_flush() assumes that shadow page is in PT64 format, but
with addition of nested EPT this is no longer always true. Fix it by
bits definitions that depend on host shadow page type.

Reported-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/mmu.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9e0f467..a512ecf 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -132,8 +132,8 @@ module_param(dbg, bool, 0644);
 	(PAGE_MASK & ~((1ULL << (PAGE_SHIFT + (((level) - 1) \
 					    * PT32_LEVEL_BITS))) - 1))
 
-#define PT64_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | PT_USER_MASK \
-			| PT64_NX_MASK)
+#define PT64_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask \
+			| shadow_x_mask | shadow_nx_mask)
 
 #define ACC_EXEC_MASK    1
 #define ACC_WRITE_MASK   PT_WRITABLE_MASK
@@ -3879,8 +3879,8 @@ static bool need_remote_flush(u64 old, u64 new)
 		return true;
 	if ((old ^ new) & PT64_BASE_ADDR_MASK)
 		return true;
-	old ^= PT64_NX_MASK;
-	new ^= PT64_NX_MASK;
+	old ^= shadow_nx_mask;
+	new ^= shadow_nx_mask;
 	return (old & ~new & PT64_PERM_MASK) != 0;
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 10/15] nEPT: Nested INVEPT
  2013-08-01 14:08 [PATCH v6 00/15] Nested EPT Gleb Natapov
                   ` (8 preceding siblings ...)
  2013-08-01 14:08 ` [PATCH v6 09/15] nEPT: correctly check if remote tlb flush is needed for shadowed EPT tables Gleb Natapov
@ 2013-08-01 14:08 ` Gleb Natapov
  2013-08-02  8:06   ` Xiao Guangrong
  2013-08-01 14:08 ` [PATCH v6 11/15] nEPT: Add nEPT violation/misconfigration support Gleb Natapov
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Gleb Natapov @ 2013-08-01 14:08 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

If we let L1 use EPT, we should probably also support the INVEPT instruction.

In our current nested EPT implementation, when L1 changes its EPT table
for L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in
the course of this modification already calls INVEPT. But if last level
of shadow page is unsync not all L1's changes to EPT12 are intercepted,
which means roots need to be synced when L1 calls INVEPT. Global INVEPT
should not be different since roots are synced by kvm_mmu_load() each
time EPTP02 changes.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/vmx.h      |    2 ++
 arch/x86/include/uapi/asm/vmx.h |    1 +
 arch/x86/kvm/mmu.c              |    2 ++
 arch/x86/kvm/vmx.c              |   67 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 72 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index f3e01a2..966502d 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -387,6 +387,7 @@ enum vmcs_field {
 #define VMX_EPT_EXTENT_INDIVIDUAL_ADDR		0
 #define VMX_EPT_EXTENT_CONTEXT			1
 #define VMX_EPT_EXTENT_GLOBAL			2
+#define VMX_EPT_EXTENT_SHIFT			24
 
 #define VMX_EPT_EXECUTE_ONLY_BIT		(1ull)
 #define VMX_EPT_PAGE_WALK_4_BIT			(1ull << 6)
@@ -394,6 +395,7 @@ enum vmcs_field {
 #define VMX_EPTP_WB_BIT				(1ull << 14)
 #define VMX_EPT_2MB_PAGE_BIT			(1ull << 16)
 #define VMX_EPT_1GB_PAGE_BIT			(1ull << 17)
+#define VMX_EPT_INVEPT_BIT			(1ull << 20)
 #define VMX_EPT_AD_BIT				    (1ull << 21)
 #define VMX_EPT_EXTENT_CONTEXT_BIT		(1ull << 25)
 #define VMX_EPT_EXTENT_GLOBAL_BIT		(1ull << 26)
diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index d651082..7a34e8f 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -65,6 +65,7 @@
 #define EXIT_REASON_EOI_INDUCED         45
 #define EXIT_REASON_EPT_VIOLATION       48
 #define EXIT_REASON_EPT_MISCONFIG       49
+#define EXIT_REASON_INVEPT              50
 #define EXIT_REASON_PREEMPTION_TIMER    52
 #define EXIT_REASON_WBINVD              54
 #define EXIT_REASON_XSETBV              55
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a512ecf..a1f8f7b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3182,6 +3182,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 	mmu_sync_roots(vcpu);
 	spin_unlock(&vcpu->kvm->mmu_lock);
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_sync_roots);
 
 static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t vaddr,
 				  u32 access, struct x86_exception *exception)
@@ -3451,6 +3452,7 @@ void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu)
 	++vcpu->stat.tlb_flush;
 	kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_flush_tlb);
 
 static void paging_new_cr3(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4e98764..a4d9385 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2156,6 +2156,7 @@ static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high;
 static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high;
 static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high;
 static u32 nested_vmx_misc_low, nested_vmx_misc_high;
+static u32 nested_vmx_ept_caps;
 static __init void nested_vmx_setup_ctls_msrs(void)
 {
 	/*
@@ -6270,6 +6271,70 @@ static int handle_vmptrst(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+/* Emulate the INVEPT instruction */
+static int handle_invept(struct kvm_vcpu *vcpu)
+{
+	u32 vmx_instruction_info;
+	bool ok;
+	unsigned long type;
+	gva_t gva;
+	struct x86_exception e;
+	struct {
+		u64 eptp, gpa;
+	} operand;
+
+	if (!(nested_vmx_secondary_ctls_high & SECONDARY_EXEC_ENABLE_EPT) ||
+	    !(nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	/* According to the Intel VMX instruction reference, the memory
+	 * operand is read even if it isn't needed (e.g., for type==global)
+	 */
+	vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+			vmx_instruction_info, &gva))
+		return 1;
+	if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &operand,
+				sizeof(operand), &e)) {
+		kvm_inject_page_fault(vcpu, &e);
+		return 1;
+	}
+
+	type = kvm_register_read(vcpu, (vmx_instruction_info >> 28) & 0xf);
+
+	switch (type) {
+	case VMX_EPT_EXTENT_GLOBAL:
+	case VMX_EPT_EXTENT_CONTEXT:
+		ok = !!(nested_vmx_ept_caps &
+				(1UL << (type + VMX_EPT_EXTENT_SHIFT)));
+		break;
+	default:
+		ok = false;
+	}
+
+	if (ok) {
+		kvm_mmu_sync_roots(vcpu);
+		kvm_mmu_flush_tlb(vcpu);
+		nested_vmx_succeed(vcpu);
+	}
+	else
+		nested_vmx_failValid(vcpu,
+				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -6314,6 +6379,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_PAUSE_INSTRUCTION]       = handle_pause,
 	[EXIT_REASON_MWAIT_INSTRUCTION]	      = handle_invalid_op,
 	[EXIT_REASON_MONITOR_INSTRUCTION]     = handle_invalid_op,
+	[EXIT_REASON_INVEPT]                  = handle_invept,
 };
 
 static const int kvm_vmx_max_exit_handlers =
@@ -6540,6 +6606,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
 	case EXIT_REASON_VMPTRST: case EXIT_REASON_VMREAD:
 	case EXIT_REASON_VMRESUME: case EXIT_REASON_VMWRITE:
 	case EXIT_REASON_VMOFF: case EXIT_REASON_VMON:
+	case EXIT_REASON_INVEPT:
 		/*
 		 * VMX instructions trap unconditionally. This allows L1 to
 		 * emulate them for its L2 guest, i.e., allows 3-level nesting!
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 11/15] nEPT: Add nEPT violation/misconfigration support
  2013-08-01 14:08 [PATCH v6 00/15] Nested EPT Gleb Natapov
                   ` (9 preceding siblings ...)
  2013-08-01 14:08 ` [PATCH v6 10/15] nEPT: Nested INVEPT Gleb Natapov
@ 2013-08-01 14:08 ` Gleb Natapov
  2013-08-02  6:12   ` Xiao Guangrong
  2013-08-01 14:08 ` [PATCH v6 12/15] nEPT: MMU context for nested EPT Gleb Natapov
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Gleb Natapov @ 2013-08-01 14:08 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Yang Zhang <yang.z.zhang@Intel.com>

Inject nEPT fault to L1 guest. This patch is original from Xinhao.

Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    4 +++
 arch/x86/kvm/mmu.c              |   61 ++++++++++++++++++++++++++++++++-------
 arch/x86/kvm/paging_tmpl.h      |   25 ++++++++++++++--
 arch/x86/kvm/vmx.c              |   19 ++++++++++++
 4 files changed, 95 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 531f47c..58a17c0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -286,6 +286,7 @@ struct kvm_mmu {
 	u64 *pae_root;
 	u64 *lm_root;
 	u64 rsvd_bits_mask[2][4];
+	u64 bad_mt_xwr;
 
 	/*
 	 * Bitmap: bit set = last pte in walk
@@ -512,6 +513,9 @@ struct kvm_vcpu_arch {
 	 * instruction.
 	 */
 	bool write_fault_to_shadow_pgtable;
+
+	/* set at EPT violation at this point */
+	unsigned long exit_qualification;
 };
 
 struct kvm_lpage_info {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a1f8f7b..81b73bc 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3521,6 +3521,8 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
 	int maxphyaddr = cpuid_maxphyaddr(vcpu);
 	u64 exb_bit_rsvd = 0;
 
+	context->bad_mt_xwr = 0;
+
 	if (!context->nx)
 		exb_bit_rsvd = rsvd_bits(63, 63);
 	switch (context->root_level) {
@@ -3576,7 +3578,40 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
 	}
 }
 
-static void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
+static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
+		struct kvm_mmu *context, bool execonly)
+{
+	int maxphyaddr = cpuid_maxphyaddr(vcpu);
+	int pte;
+
+	context->rsvd_bits_mask[0][3] =
+		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
+	context->rsvd_bits_mask[0][2] =
+		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
+	context->rsvd_bits_mask[0][1] =
+		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
+	context->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
+
+	/* large page */
+	context->rsvd_bits_mask[1][3] = context->rsvd_bits_mask[0][3];
+	context->rsvd_bits_mask[1][2] =
+		rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
+	context->rsvd_bits_mask[1][1] =
+		rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20);
+	context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0];
+	
+	for (pte = 0; pte < 64; pte++) {
+		int rwx_bits = pte & 7;
+		int mt = pte >> 3;
+		if (mt == 0x2 || mt == 0x3 || mt == 0x7 ||
+				rwx_bits == 0x2 || rwx_bits == 0x6 ||
+				(rwx_bits == 0x4 && !execonly))
+			context->bad_mt_xwr |= (1ull << pte);
+	}
+}
+
+static void update_permission_bitmask(struct kvm_vcpu *vcpu,
+		struct kvm_mmu *mmu, bool ept)
 {
 	unsigned bit, byte, pfec;
 	u8 map;
@@ -3594,12 +3629,16 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu
 			w = bit & ACC_WRITE_MASK;
 			u = bit & ACC_USER_MASK;
 
-			/* Not really needed: !nx will cause pte.nx to fault */
-			x |= !mmu->nx;
-			/* Allow supervisor writes if !cr0.wp */
-			w |= !is_write_protection(vcpu) && !uf;
-			/* Disallow supervisor fetches of user code if cr4.smep */
-			x &= !(smep && u && !uf);
+			if (!ept) {
+				/* Not really needed: !nx will cause pte.nx to fault */
+				x |= !mmu->nx;
+				/* Allow supervisor writes if !cr0.wp */
+				w |= !is_write_protection(vcpu) && !uf;
+				/* Disallow supervisor fetches of user code if cr4.smep */
+				x &= !(smep && u && !uf);
+			} else
+				/* Not really needed: no U/S accesses on ept  */
+				u = 1;
 
 			fault = (ff && !x) || (uf && !u) || (wf && !w);
 			map |= fault << bit;
@@ -3634,7 +3673,7 @@ static int paging64_init_context_common(struct kvm_vcpu *vcpu,
 	context->root_level = level;
 
 	reset_rsvds_bits_mask(vcpu, context);
-	update_permission_bitmask(vcpu, context);
+	update_permission_bitmask(vcpu, context, false);
 	update_last_pte_bitmap(vcpu, context);
 
 	ASSERT(is_pae(vcpu));
@@ -3664,7 +3703,7 @@ static int paging32_init_context(struct kvm_vcpu *vcpu,
 	context->root_level = PT32_ROOT_LEVEL;
 
 	reset_rsvds_bits_mask(vcpu, context);
-	update_permission_bitmask(vcpu, context);
+	update_permission_bitmask(vcpu, context, false);
 	update_last_pte_bitmap(vcpu, context);
 
 	context->new_cr3 = paging_new_cr3;
@@ -3726,7 +3765,7 @@ static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
 		context->gva_to_gpa = paging32_gva_to_gpa;
 	}
 
-	update_permission_bitmask(vcpu, context);
+	update_permission_bitmask(vcpu, context, false);
 	update_last_pte_bitmap(vcpu, context);
 
 	return 0;
@@ -3805,7 +3844,7 @@ static int init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
 		g_context->gva_to_gpa = paging32_gva_to_gpa_nested;
 	}
 
-	update_permission_bitmask(vcpu, g_context);
+	update_permission_bitmask(vcpu, g_context, false);
 	update_last_pte_bitmap(vcpu, g_context);
 
 	return 0;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index f8e5680..7f6af8e 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -129,10 +129,10 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
 
 static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
 {
-	int bit7;
+	int bit7 = (gpte >> 7) & 1, low6 = gpte & 0x3f;
 
-	bit7 = (gpte >> 7) & 1;
-	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
+	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) |
+		((mmu->bad_mt_xwr & (1ull << low6)) != 0);
 }
 
 static inline int FNAME(is_present_gpte)(unsigned long pte)
@@ -386,6 +386,25 @@ error:
 	walker->fault.vector = PF_VECTOR;
 	walker->fault.error_code_valid = true;
 	walker->fault.error_code = errcode;
+
+#if PTTYPE == PTTYPE_EPT
+	/*
+	 * Use PFERR_RSVD_MASK in error_code to to tell if EPT
+	 * misconfiguration requires to be injected. The detection is
+	 * done by is_rsvd_bits_set() above.
+	 *
+	 * We set up the value of exit_qualification to inject:
+	 * [2:0] - Derive from [2:0] of real exit_qualification at EPT violation
+	 * [5:3] - Calculated by the page walk of the guest EPT page tables
+	 * [7:8] - Derived from [7:8] of real exit_qualification
+	 *
+	 * The other bits are set to 0.
+	 */
+	if (!(errcode & PFERR_RSVD_MASK)) {
+		vcpu->arch.exit_qualification &= 0x187;
+		vcpu->arch.exit_qualification |= ((pt_access & pte) & 0x7) << 3;
+	}
+#endif
 	walker->fault.address = addr;
 	walker->fault.nested_page_fault = mmu != vcpu->arch.walk_mmu;
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a4d9385..2d84875 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5318,9 +5318,13 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 
 	/* It is a write fault? */
 	error_code = exit_qualification & (1U << 1);
+	/* It is a fetch fault? */
+	error_code |= (exit_qualification & (1U << 2)) << 2;
 	/* ept page table is present? */
 	error_code |= (exit_qualification >> 3) & 0x1;
 
+	vcpu->arch.exit_qualification = exit_qualification;
+
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
@@ -7415,6 +7419,21 @@ static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
 		entry->ecx |= bit(X86_FEATURE_VMX);
 }
 
+static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
+		struct x86_exception *fault)
+{
+	struct vmcs12 *vmcs12;
+	nested_vmx_vmexit(vcpu);
+	vmcs12 = get_vmcs12(vcpu);
+
+	if (fault->error_code & PFERR_RSVD_MASK)
+		vmcs12->vm_exit_reason = EXIT_REASON_EPT_MISCONFIG;
+	else
+		vmcs12->vm_exit_reason = EXIT_REASON_EPT_VIOLATION;
+	vmcs12->exit_qualification = vcpu->arch.exit_qualification;
+	vmcs12->guest_physical_address = fault->address;
+}
+
 /*
  * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
  * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 12/15] nEPT: MMU context for nested EPT
  2013-08-01 14:08 [PATCH v6 00/15] Nested EPT Gleb Natapov
                   ` (10 preceding siblings ...)
  2013-08-01 14:08 ` [PATCH v6 11/15] nEPT: Add nEPT violation/misconfigration support Gleb Natapov
@ 2013-08-01 14:08 ` Gleb Natapov
  2013-08-02  6:13   ` Xiao Guangrong
  2013-08-01 14:08 ` [PATCH v6 13/15] nEPT: Advertise EPT to L1 Gleb Natapov
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Gleb Natapov @ 2013-08-01 14:08 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

KVM's existing shadow MMU code already supports nested TDP. To use it, we
need to set up a new "MMU context" for nested EPT, and create a few callbacks
for it (nested_ept_*()). This context should also use the EPT versions of
the page table access functions (defined in the previous patch).
Then, we need to switch back and forth between this nested context and the
regular MMU context when switching between L1 and L2 (when L1 runs this L2
with EPT).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/mmu.c |   27 +++++++++++++++++++++++++++
 arch/x86/kvm/mmu.h |    2 ++
 arch/x86/kvm/vmx.c |   41 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 81b73bc..c0b4e0f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3797,6 +3797,33 @@ int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
 }
 EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu);
 
+int kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context,
+		bool execonly)
+{
+	ASSERT(vcpu);
+	ASSERT(!VALID_PAGE(vcpu->arch.mmu.root_hpa));
+
+	context->shadow_root_level = kvm_x86_ops->get_tdp_level();
+
+	context->nx = true;
+	context->new_cr3 = paging_new_cr3;
+	context->page_fault = ept_page_fault;
+	context->gva_to_gpa = ept_gva_to_gpa;
+	context->sync_page = ept_sync_page;
+	context->invlpg = ept_invlpg;
+	context->update_pte = ept_update_pte;
+	context->free = paging_free;
+	context->root_level = context->shadow_root_level;
+	context->root_hpa = INVALID_PAGE;
+	context->direct_map = false;
+
+	update_permission_bitmask(vcpu, context, true);
+	reset_rsvds_bits_mask_ept(vcpu, context, execonly);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu);
+
 static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
 {
 	int r = kvm_init_shadow_mmu(vcpu, vcpu->arch.walk_mmu);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 5b59c57..77e044a 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -71,6 +71,8 @@ enum {
 
 int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
 int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
+int kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context,
+		bool execonly);
 
 static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2d84875..627b504 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1046,6 +1046,11 @@ static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12,
 	return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
 }
 
+static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12)
+{
+	return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT);
+}
+
 static inline bool is_exception(u32 intr_info)
 {
 	return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -7434,6 +7439,33 @@ static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
 	vmcs12->guest_physical_address = fault->address;
 }
 
+/* Callbacks for nested_ept_init_mmu_context: */
+
+static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
+{
+	/* return the page table to be shadowed - in our case, EPT12 */
+	return get_vmcs12(vcpu)->ept_pointer;
+}
+
+static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
+{
+	int r = kvm_init_shadow_ept_mmu(vcpu, &vcpu->arch.mmu,
+			nested_vmx_ept_caps & VMX_EPT_EXECUTE_ONLY_BIT);
+
+	vcpu->arch.mmu.set_cr3           = vmx_set_cr3;
+	vcpu->arch.mmu.get_cr3           = nested_ept_get_cr3;
+	vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
+
+	vcpu->arch.walk_mmu              = &vcpu->arch.nested_mmu;
+
+	return r;
+}
+
+static void nested_ept_uninit_mmu_context(struct kvm_vcpu *vcpu)
+{
+	vcpu->arch.walk_mmu = &vcpu->arch.mmu;
+}
+
 /*
  * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
  * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
@@ -7654,6 +7686,11 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 		vmx_flush_tlb(vcpu);
 	}
 
+	if (nested_cpu_has_ept(vmcs12)) {
+		kvm_mmu_unload(vcpu);
+		nested_ept_init_mmu_context(vcpu);
+	}
+
 	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)
 		vcpu->arch.efer = vmcs12->guest_ia32_efer;
 	else if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE)
@@ -8126,7 +8163,9 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
 	kvm_set_cr4(vcpu, vmcs12->host_cr4);
 
-	/* shadow page tables on either EPT or shadow page tables */
+	if (nested_cpu_has_ept(vmcs12))
+		nested_ept_uninit_mmu_context(vcpu);
+
 	kvm_set_cr3(vcpu, vmcs12->host_cr3);
 	kvm_mmu_reset_context(vcpu);
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 13/15] nEPT: Advertise EPT to L1
  2013-08-01 14:08 [PATCH v6 00/15] Nested EPT Gleb Natapov
                   ` (11 preceding siblings ...)
  2013-08-01 14:08 ` [PATCH v6 12/15] nEPT: MMU context for nested EPT Gleb Natapov
@ 2013-08-01 14:08 ` Gleb Natapov
  2013-08-02  8:29   ` Xiao Guangrong
  2013-08-01 14:08 ` [PATCH v6 14/15] nEPT: Some additional comments Gleb Natapov
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Gleb Natapov @ 2013-08-01 14:08 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

Advertise the support of EPT to the L1 guest, through the appropriate MSR.

This is the last patch of the basic Nested EPT feature, so as to allow
bisection through this patch series: The guest will not see EPT support until
this last patch, and will not attempt to use the half-applied feature.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/vmx.c |   20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 627b504..51909ad 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2249,6 +2249,22 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
 		SECONDARY_EXEC_WBINVD_EXITING;
 
+	if (enable_ept) {
+		/* nested EPT: emulate EPT also to L1 */
+		nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
+		nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
+		nested_vmx_ept_caps |= VMX_EPT_INVEPT_BIT;
+		nested_vmx_ept_caps &= vmx_capability.ept;
+		/*
+		 * Since invept is completely emulated we support both global
+		 * and context invalidation independent of what host cpu
+		 * supports
+		 */
+	       	nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT |
+			VMX_EPT_EXTENT_CONTEXT_BIT;
+	} else
+		nested_vmx_ept_caps = 0;
+
 	/* miscellaneous data */
 	rdmsr(MSR_IA32_VMX_MISC, nested_vmx_misc_low, nested_vmx_misc_high);
 	nested_vmx_misc_low &= VMX_MISC_PREEMPTION_TIMER_RATE_MASK |
@@ -2357,8 +2373,8 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
 					nested_vmx_secondary_ctls_high);
 		break;
 	case MSR_IA32_VMX_EPT_VPID_CAP:
-		/* Currently, no nested ept or nested vpid */
-		*pdata = 0;
+		/* Currently, no nested vpid support */
+		*pdata = nested_vmx_ept_caps;
 		break;
 	default:
 		return 0;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 14/15] nEPT: Some additional comments
  2013-08-01 14:08 [PATCH v6 00/15] Nested EPT Gleb Natapov
                   ` (12 preceding siblings ...)
  2013-08-01 14:08 ` [PATCH v6 13/15] nEPT: Advertise EPT to L1 Gleb Natapov
@ 2013-08-01 14:08 ` Gleb Natapov
  2013-08-02  6:26   ` Xiao Guangrong
  2013-08-01 14:08 ` [PATCH v6 15/15] nEPT: Miscelleneous cleanups Gleb Natapov
  2013-08-04  9:24 ` [PATCH v6 00/15] Nested EPT Jan Kiszka
  15 siblings, 1 reply; 39+ messages in thread
From: Gleb Natapov @ 2013-08-01 14:08 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

Some additional comments to preexisting code:
Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
Don't mention "shadow on either EPT or shadow" as the only two options.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/vmx.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 51909ad..56f8e0c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6664,7 +6664,20 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
 		return nested_cpu_has2(vmcs12,
 			SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
 	case EXIT_REASON_EPT_VIOLATION:
+		/*
+		 * L0 always deals with the EPT violation. If nested EPT is
+		 * used, and the nested mmu code discovers that the address is
+		 * missing in the guest EPT table (EPT12), the EPT violation
+		 * will be injected with nested_ept_inject_page_fault()
+		 */
+		return 0;
 	case EXIT_REASON_EPT_MISCONFIG:
+		/*
+		 * L2 never uses directly L1's EPT, but rather L0's own EPT
+		 * table (shadow on EPT) or a merged EPT table that L0 built
+		 * (EPT on EPT). So any problems with the structure of the
+		 * table is L0's fault.
+		 */
 		return 0;
 	case EXIT_REASON_PREEMPTION_TIMER:
 		return vmcs12->pin_based_vm_exec_control &
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 15/15] nEPT: Miscelleneous cleanups
  2013-08-01 14:08 [PATCH v6 00/15] Nested EPT Gleb Natapov
                   ` (13 preceding siblings ...)
  2013-08-01 14:08 ` [PATCH v6 14/15] nEPT: Some additional comments Gleb Natapov
@ 2013-08-01 14:08 ` Gleb Natapov
  2013-08-02  6:45   ` Xiao Guangrong
  2013-08-04  9:24 ` [PATCH v6 00/15] Nested EPT Jan Kiszka
  15 siblings, 1 reply; 39+ messages in thread
From: Gleb Natapov @ 2013-08-01 14:08 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

Some trivial code cleanups not really related to nested EPT.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/vmx.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 56f8e0c..ef62772 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -715,7 +715,6 @@ static void nested_release_page_clean(struct page *page)
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
 static void kvm_cpu_vmxoff(void);
-static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
 static void vmx_set_segment(struct kvm_vcpu *vcpu,
 			    struct kvm_segment *var, int seg);
@@ -1040,8 +1039,7 @@ static inline bool nested_cpu_has2(struct vmcs12 *vmcs12, u32 bit)
 		(vmcs12->secondary_vm_exec_control & bit);
 }
 
-static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12,
-	struct kvm_vcpu *vcpu)
+static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12)
 {
 	return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
 }
@@ -6765,7 +6763,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
 
 	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked &&
 	    !(is_guest_mode(vcpu) && nested_cpu_has_virtual_nmis(
-	                                get_vmcs12(vcpu), vcpu)))) {
+					get_vmcs12(vcpu))))) {
 		if (vmx_interrupt_allowed(vcpu)) {
 			vmx->soft_vnmi_blocked = 0;
 		} else if (vmx->vnmi_blocked_time > 1000000000LL &&
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* RE: [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-08-01 14:08 ` [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Gleb Natapov
@ 2013-08-02  3:04   ` Zhang, Yang Z
  2013-08-02  6:35     ` Jan Kiszka
  0 siblings, 1 reply; 39+ messages in thread
From: Zhang, Yang Z @ 2013-08-02  3:04 UTC (permalink / raw)
  To: Gleb Natapov, kvm; +Cc: Xiao Guangrong, Nakajima, Jun, pbonzini

Gleb Natapov wrote on 2013-08-01:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> Recent KVM, since
> http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
> switch the EFER MSR when EPT is used and the host and guest have different
> NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
> and want to be able to run recent KVM as L1, we need to allow L1 to use this
> EFER switching feature.
> 
> To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if
> available,
> and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
> support for the former (the latter is still unsupported).
> 
> Nested entry and exit emulation (prepare_vmcs_02 and
> load_vmcs12_host_state,
> respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So
> all
> that's left to do in this patch is to properly advertise this feature to L1.
> 
> Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by
> using
> vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
> support this feature, regardless of whether the host supports it.
> 
> Reviewed-by: Orit Wasserman <owasserm@redhat.com>
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>  arch/x86/kvm/vmx.c |   23 ++++++++++++++++-------
>  1 file changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index e999dc7..27efa6a 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2198,7 +2198,8 @@ static __init void
> nested_vmx_setup_ctls_msrs(void)
>  #else
>  	nested_vmx_exit_ctls_high = 0;
>  #endif
> -	nested_vmx_exit_ctls_high |=
> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
> +	nested_vmx_exit_ctls_high |=
> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> +				      VM_EXIT_LOAD_IA32_EFER);
> 
>  	/* entry controls */
>  	rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
> @@ -2207,8 +2208,8 @@ static __init void
> nested_vmx_setup_ctls_msrs(void)
>  	nested_vmx_entry_ctls_low =
> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
>  	nested_vmx_entry_ctls_high &=
>  		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
> -	nested_vmx_entry_ctls_high |=
> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
> -
> +	nested_vmx_entry_ctls_high |=
> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
> +				       VM_ENTRY_LOAD_IA32_EFER);
Just saw it, we didn't expose bit22 save VMX-preemption timer in vm-exit control but we already allowed guest to set active VMX-preemption timer in pin based vm-execution conrols. This is wrong.

>  	/* cpu-based controls */
>  	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
>  		nested_vmx_procbased_ctls_low,
> nested_vmx_procbased_ctls_high);
> @@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu
> *vcpu, struct vmcs12 *vmcs12)
>  	vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
>  	vmcs_writel(CR0_GUEST_HOST_MASK,
> ~vcpu->arch.cr0_guest_owned_bits);
> 
> -	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
> below */
> -	vmcs_write32(VM_EXIT_CONTROLS,
> -		vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
> -	vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
> +	/* L2->L1 exit controls are emulated - the hardware exit is to L0 so
> +	 * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
> +	 * bits are further modified by vmx_set_efer() below.
> +	 */
> +	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
Should we mentioned that save vmx preemption bit must use host|guest, not just host?

> +
> +	/* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
> are
> +	 * emulated by vmx_set_efer(), below.
> +	 */
> +	vmcs_write32(VM_ENTRY_CONTROLS,
> +		(vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
> +			~VM_ENTRY_IA32E_MODE) |
>  		(vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
> 
>  	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
> --
> 1.7.10.4

Best regards,
Yang



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 09/15] nEPT: correctly check if remote tlb flush is needed for shadowed EPT tables
  2013-08-01 14:08 ` [PATCH v6 09/15] nEPT: correctly check if remote tlb flush is needed for shadowed EPT tables Gleb Natapov
@ 2013-08-02  5:58   ` Xiao Guangrong
  0 siblings, 0 replies; 39+ messages in thread
From: Xiao Guangrong @ 2013-08-02  5:58 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini

On 08/01/2013 10:08 PM, Gleb Natapov wrote:
> need_remote_flush() assumes that shadow page is in PT64 format, but
> with addition of nested EPT this is no longer always true. Fix it by
> bits definitions that depend on host shadow page type.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 11/15] nEPT: Add nEPT violation/misconfigration support
  2013-08-01 14:08 ` [PATCH v6 11/15] nEPT: Add nEPT violation/misconfigration support Gleb Natapov
@ 2013-08-02  6:12   ` Xiao Guangrong
  0 siblings, 0 replies; 39+ messages in thread
From: Xiao Guangrong @ 2013-08-02  6:12 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini

On 08/01/2013 10:08 PM, Gleb Natapov wrote:
> From: Yang Zhang <yang.z.zhang@Intel.com>
> 
> Inject nEPT fault to L1 guest. This patch is original from Xinhao.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 12/15] nEPT: MMU context for nested EPT
  2013-08-01 14:08 ` [PATCH v6 12/15] nEPT: MMU context for nested EPT Gleb Natapov
@ 2013-08-02  6:13   ` Xiao Guangrong
  0 siblings, 0 replies; 39+ messages in thread
From: Xiao Guangrong @ 2013-08-02  6:13 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini

On 08/01/2013 10:08 PM, Gleb Natapov wrote:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> KVM's existing shadow MMU code already supports nested TDP. To use it, we
> need to set up a new "MMU context" for nested EPT, and create a few callbacks
> for it (nested_ept_*()). This context should also use the EPT versions of
> the page table access functions (defined in the previous patch).
> Then, we need to switch back and forth between this nested context and the
> regular MMU context when switching between L1 and L2 (when L1 runs this L2
> with EPT).

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 14/15] nEPT: Some additional comments
  2013-08-01 14:08 ` [PATCH v6 14/15] nEPT: Some additional comments Gleb Natapov
@ 2013-08-02  6:26   ` Xiao Guangrong
  0 siblings, 0 replies; 39+ messages in thread
From: Xiao Guangrong @ 2013-08-02  6:26 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini

On 08/01/2013 10:08 PM, Gleb Natapov wrote:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> Some additional comments to preexisting code:
> Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
> Don't mention "shadow on either EPT or shadow" as the only two options.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-08-02  3:04   ` Zhang, Yang Z
@ 2013-08-02  6:35     ` Jan Kiszka
  2013-08-02  7:27       ` Zhang, Yang Z
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Kiszka @ 2013-08-02  6:35 UTC (permalink / raw)
  To: Zhang, Yang Z
  Cc: Gleb Natapov, kvm, Xiao Guangrong, Nakajima, Jun, pbonzini,
	"李春奇 <Arthur Chunqi Li>"

[-- Attachment #1: Type: text/plain, Size: 4766 bytes --]

On 2013-08-02 05:04, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2013-08-01:
>> From: Nadav Har'El <nyh@il.ibm.com>
>>
>> Recent KVM, since
>> http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
>> switch the EFER MSR when EPT is used and the host and guest have different
>> NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
>> and want to be able to run recent KVM as L1, we need to allow L1 to use this
>> EFER switching feature.
>>
>> To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if
>> available,
>> and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
>> support for the former (the latter is still unsupported).
>>
>> Nested entry and exit emulation (prepare_vmcs_02 and
>> load_vmcs12_host_state,
>> respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So
>> all
>> that's left to do in this patch is to properly advertise this feature to L1.
>>
>> Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by
>> using
>> vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
>> support this feature, regardless of whether the host supports it.
>>
>> Reviewed-by: Orit Wasserman <owasserm@redhat.com>
>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
>> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
>> Signed-off-by: Gleb Natapov <gleb@redhat.com>
>> ---
>>  arch/x86/kvm/vmx.c |   23 ++++++++++++++++-------
>>  1 file changed, 16 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index e999dc7..27efa6a 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2198,7 +2198,8 @@ static __init void
>> nested_vmx_setup_ctls_msrs(void)
>>  #else
>>  	nested_vmx_exit_ctls_high = 0;
>>  #endif
>> -	nested_vmx_exit_ctls_high |=
>> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
>> +	nested_vmx_exit_ctls_high |=
>> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>> +				      VM_EXIT_LOAD_IA32_EFER);
>>
>>  	/* entry controls */
>>  	rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
>> @@ -2207,8 +2208,8 @@ static __init void
>> nested_vmx_setup_ctls_msrs(void)
>>  	nested_vmx_entry_ctls_low =
>> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
>>  	nested_vmx_entry_ctls_high &=
>>  		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
>> -	nested_vmx_entry_ctls_high |=
>> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
>> -
>> +	nested_vmx_entry_ctls_high |=
>> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
>> +				       VM_ENTRY_LOAD_IA32_EFER);
> Just saw it, we didn't expose bit22 save VMX-preemption timer in vm-exit control but we already allowed guest to set active VMX-preemption timer in pin based vm-execution conrols. This is wrong.

Does the presence of preemption timer support imply that saving its
value is also supported? Then we could demand this combination (ie. do
not expose preemption timer support at all to L1 if value saving is
missing) and build our preemption timer emulation on top.

There is more broken /wrt VMX preemption timer, patches are welcome.
Arthur will also try to develop test cases for it. But that topic is
unrelated to this series.

Jan

> 
>>  	/* cpu-based controls */
>>  	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
>>  		nested_vmx_procbased_ctls_low,
>> nested_vmx_procbased_ctls_high);
>> @@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu
>> *vcpu, struct vmcs12 *vmcs12)
>>  	vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
>>  	vmcs_writel(CR0_GUEST_HOST_MASK,
>> ~vcpu->arch.cr0_guest_owned_bits);
>>
>> -	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
>> below */
>> -	vmcs_write32(VM_EXIT_CONTROLS,
>> -		vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
>> -	vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
>> +	/* L2->L1 exit controls are emulated - the hardware exit is to L0 so
>> +	 * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
>> +	 * bits are further modified by vmx_set_efer() below.
>> +	 */
>> +	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
> Should we mentioned that save vmx preemption bit must use host|guest, not just host?
> 
>> +
>> +	/* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
>> are
>> +	 * emulated by vmx_set_efer(), below.
>> +	 */
>> +	vmcs_write32(VM_ENTRY_CONTROLS,
>> +		(vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
>> +			~VM_ENTRY_IA32E_MODE) |
>>  		(vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
>>
>>  	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
>> --
>> 1.7.10.4
> 
> Best regards,
> Yang



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 15/15] nEPT: Miscelleneous cleanups
  2013-08-01 14:08 ` [PATCH v6 15/15] nEPT: Miscelleneous cleanups Gleb Natapov
@ 2013-08-02  6:45   ` Xiao Guangrong
  0 siblings, 0 replies; 39+ messages in thread
From: Xiao Guangrong @ 2013-08-02  6:45 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini

On 08/01/2013 10:08 PM, Gleb Natapov wrote:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> Some trivial code cleanups not really related to nested EPT.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-08-02  6:35     ` Jan Kiszka
@ 2013-08-02  7:27       ` Zhang, Yang Z
  2013-08-02  7:33         ` Jan Kiszka
  0 siblings, 1 reply; 39+ messages in thread
From: Zhang, Yang Z @ 2013-08-02  7:27 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Gleb Natapov, kvm, Xiao Guangrong, Nakajima, Jun, pbonzini,
	"李春奇 <Arthur Chunqi Li>"

Jan Kiszka wrote on 2013-08-02:
> On 2013-08-02 05:04, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2013-08-01:
>>> From: Nadav Har'El <nyh@il.ibm.com>
>>> 
>>> Recent KVM, since
>>> http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
>>> switch the EFER MSR when EPT is used and the host and guest have 
>>> different NX bits. So if we add support for nested EPT (L1 guest 
>>> using EPT to run L2) and want to be able to run recent KVM as L1, we 
>>> need to allow L1 to use this EFER switching feature.
>>> 
>>> To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if 
>>> available, and if it isn't, it uses the generic 
>>> VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former (the 
>>> latter is still unsupported).
>>> 
>>> Nested entry and exit emulation (prepare_vmcs_02 and 
>>> load_vmcs12_host_state,
>>> respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly.
>>> So all that's left to do in this patch is to properly advertise this 
>>> feature to L1.
>>> 
>>> Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, 
>>> by using vmx_set_efer (which itself sets one of several vmcs02 
>>> fields), so we always support this feature, regardless of whether 
>>> the host supports it.
>>> 
>>> Reviewed-by: Orit Wasserman <owasserm@redhat.com>
>>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
>>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
>>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
>>> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
>>> Signed-off-by: Gleb Natapov <gleb@redhat.com>
>>> ---
>>>  arch/x86/kvm/vmx.c |   23 ++++++++++++++++-------
>>>  1 file changed, 16 insertions(+), 7 deletions(-) diff --git 
>>> a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e999dc7..27efa6a 
>>> 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -2198,7 +2198,8 @@ static __init void
>>> nested_vmx_setup_ctls_msrs(void)
>>>  #else
>>>  	nested_vmx_exit_ctls_high = 0;
>>>  #endif
>>> -	nested_vmx_exit_ctls_high |=
>>> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
>>> +	nested_vmx_exit_ctls_high |=
>>> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>>> +				      VM_EXIT_LOAD_IA32_EFER);
>>> 
>>>  	/* entry controls */
>>>  	rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
>>> @@ -2207,8 +2208,8 @@ static __init void
>>> nested_vmx_setup_ctls_msrs(void)
>>>  	nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
>>>  	nested_vmx_entry_ctls_high &= 		VM_ENTRY_LOAD_IA32_PAT |
>>>  VM_ENTRY_IA32E_MODE;
>>> -	nested_vmx_entry_ctls_high |=
>>> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
>>> -
>>> +	nested_vmx_entry_ctls_high |=
>>> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
>>> +				       VM_ENTRY_LOAD_IA32_EFER);
>> Just saw it, we didn't expose bit22 save VMX-preemption timer in 
>> vm-exit
> control but we already allowed guest to set active VMX-preemption 
> timer in pin based vm-execution conrols. This is wrong.
> 
> Does the presence of preemption timer support imply that saving its 
> value is also supported? Then we could demand this combination (ie. do 
> not expose preemption timer support at all to L1 if value saving is
> missing) and build our preemption timer emulation on top.
> 
I don't see we saved the preemption timer value to vmcs12 in prepare_vmcs12(). Will it be saved automatically?

> There is more broken /wrt VMX preemption timer, patches are welcome.
> Arthur will also try to develop test cases for it. But that topic is 
> unrelated to this series.
> 
> Jan
> 
>> 
>>>  	/* cpu-based controls */
>>>  	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
>>>  		nested_vmx_procbased_ctls_low,
>>> nested_vmx_procbased_ctls_high);
>>> @@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu 
>>> *vcpu, struct vmcs12 *vmcs12)
>>>  	vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
>>>  	vmcs_writel(CR0_GUEST_HOST_MASK,
>>> ~vcpu->arch.cr0_guest_owned_bits);
>>> 
>>> -	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
>>> below */
>>> -	vmcs_write32(VM_EXIT_CONTROLS,
>>> -		vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
>>> -	vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
>>> +	/* L2->L1 exit controls are emulated - the hardware exit is to L0 so
>>> +	 * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
>>> +	 * bits are further modified by vmx_set_efer() below.
>>> +	 */
>>> +	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
>> Should we mentioned that save vmx preemption bit must use host|guest, 
>> not just host?
>> 
>>> +
>>> +	/* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
>>> are
>>> +	 * emulated by vmx_set_efer(), below.
>>> +	 */
>>> +	vmcs_write32(VM_ENTRY_CONTROLS,
>>> +		(vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
>>> +			~VM_ENTRY_IA32E_MODE) |
>>>  		(vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
>>>  
>>>  	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
>>> --
>>> 1.7.10.4
>> 
>> Best regards,
>> Yang
>


Best regards,
Yang



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-08-02  7:27       ` Zhang, Yang Z
@ 2013-08-02  7:33         ` Jan Kiszka
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kiszka @ 2013-08-02  7:33 UTC (permalink / raw)
  To: Zhang, Yang Z
  Cc: Gleb Natapov, kvm, Xiao Guangrong, Nakajima, Jun, pbonzini,
	"李春奇 <Arthur Chunqi Li>"

[-- Attachment #1: Type: text/plain, Size: 5362 bytes --]

On 2013-08-02 09:27, Zhang, Yang Z wrote:
> Jan Kiszka wrote on 2013-08-02:
>> On 2013-08-02 05:04, Zhang, Yang Z wrote:
>>> Gleb Natapov wrote on 2013-08-01:
>>>> From: Nadav Har'El <nyh@il.ibm.com>
>>>>
>>>> Recent KVM, since
>>>> http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
>>>> switch the EFER MSR when EPT is used and the host and guest have 
>>>> different NX bits. So if we add support for nested EPT (L1 guest 
>>>> using EPT to run L2) and want to be able to run recent KVM as L1, we 
>>>> need to allow L1 to use this EFER switching feature.
>>>>
>>>> To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if 
>>>> available, and if it isn't, it uses the generic 
>>>> VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former (the 
>>>> latter is still unsupported).
>>>>
>>>> Nested entry and exit emulation (prepare_vmcs_02 and 
>>>> load_vmcs12_host_state,
>>>> respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly.
>>>> So all that's left to do in this patch is to properly advertise this 
>>>> feature to L1.
>>>>
>>>> Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, 
>>>> by using vmx_set_efer (which itself sets one of several vmcs02 
>>>> fields), so we always support this feature, regardless of whether 
>>>> the host supports it.
>>>>
>>>> Reviewed-by: Orit Wasserman <owasserm@redhat.com>
>>>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
>>>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
>>>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
>>>> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
>>>> Signed-off-by: Gleb Natapov <gleb@redhat.com>
>>>> ---
>>>>  arch/x86/kvm/vmx.c |   23 ++++++++++++++++-------
>>>>  1 file changed, 16 insertions(+), 7 deletions(-) diff --git 
>>>> a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e999dc7..27efa6a 
>>>> 100644
>>>> --- a/arch/x86/kvm/vmx.c
>>>> +++ b/arch/x86/kvm/vmx.c
>>>> @@ -2198,7 +2198,8 @@ static __init void
>>>> nested_vmx_setup_ctls_msrs(void)
>>>>  #else
>>>>  	nested_vmx_exit_ctls_high = 0;
>>>>  #endif
>>>> -	nested_vmx_exit_ctls_high |=
>>>> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
>>>> +	nested_vmx_exit_ctls_high |=
>>>> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>>>> +				      VM_EXIT_LOAD_IA32_EFER);
>>>>
>>>>  	/* entry controls */
>>>>  	rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
>>>> @@ -2207,8 +2208,8 @@ static __init void
>>>> nested_vmx_setup_ctls_msrs(void)
>>>>  	nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
>>>>  	nested_vmx_entry_ctls_high &= 		VM_ENTRY_LOAD_IA32_PAT |
>>>>  VM_ENTRY_IA32E_MODE;
>>>> -	nested_vmx_entry_ctls_high |=
>>>> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
>>>> -
>>>> +	nested_vmx_entry_ctls_high |=
>>>> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
>>>> +				       VM_ENTRY_LOAD_IA32_EFER);
>>> Just saw it, we didn't expose bit22 save VMX-preemption timer in 
>>> vm-exit
>> control but we already allowed guest to set active VMX-preemption 
>> timer in pin based vm-execution conrols. This is wrong.
>>
>> Does the presence of preemption timer support imply that saving its 
>> value is also supported? Then we could demand this combination (ie. do 
>> not expose preemption timer support at all to L1 if value saving is
>> missing) and build our preemption timer emulation on top.
>>
> I don't see we saved the preemption timer value to vmcs12 in prepare_vmcs12(). Will it be saved automatically?

No. As I said, there is more broken with our preemption timer emulation.

Jan

> 
>> There is more broken /wrt VMX preemption timer, patches are welcome.
>> Arthur will also try to develop test cases for it. But that topic is 
>> unrelated to this series.
>>
>> Jan
>>
>>>
>>>>  	/* cpu-based controls */
>>>>  	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
>>>>  		nested_vmx_procbased_ctls_low,
>>>> nested_vmx_procbased_ctls_high);
>>>> @@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu 
>>>> *vcpu, struct vmcs12 *vmcs12)
>>>>  	vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
>>>>  	vmcs_writel(CR0_GUEST_HOST_MASK,
>>>> ~vcpu->arch.cr0_guest_owned_bits);
>>>>
>>>> -	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
>>>> below */
>>>> -	vmcs_write32(VM_EXIT_CONTROLS,
>>>> -		vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
>>>> -	vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
>>>> +	/* L2->L1 exit controls are emulated - the hardware exit is to L0 so
>>>> +	 * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
>>>> +	 * bits are further modified by vmx_set_efer() below.
>>>> +	 */
>>>> +	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
>>> Should we mentioned that save vmx preemption bit must use host|guest, 
>>> not just host?
>>>
>>>> +
>>>> +	/* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
>>>> are
>>>> +	 * emulated by vmx_set_efer(), below.
>>>> +	 */
>>>> +	vmcs_write32(VM_ENTRY_CONTROLS,
>>>> +		(vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
>>>> +			~VM_ENTRY_IA32E_MODE) |
>>>>  		(vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
>>>>  
>>>>  	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
>>>> --
>>>> 1.7.10.4
>>>
>>> Best regards,
>>> Yang
>>
> 
> 
> Best regards,
> Yang
> 
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 10/15] nEPT: Nested INVEPT
  2013-08-01 14:08 ` [PATCH v6 10/15] nEPT: Nested INVEPT Gleb Natapov
@ 2013-08-02  8:06   ` Xiao Guangrong
  2013-08-02 10:00     ` Gleb Natapov
  0 siblings, 1 reply; 39+ messages in thread
From: Xiao Guangrong @ 2013-08-02  8:06 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini

On 08/01/2013 10:08 PM, Gleb Natapov wrote:

> +/* Emulate the INVEPT instruction */
> +static int handle_invept(struct kvm_vcpu *vcpu)
> +{
> +	u32 vmx_instruction_info;
> +	bool ok;
> +	unsigned long type;
> +	gva_t gva;
> +	struct x86_exception e;
> +	struct {
> +		u64 eptp, gpa;
> +	} operand;
> +
> +	if (!(nested_vmx_secondary_ctls_high & SECONDARY_EXEC_ENABLE_EPT) ||
> +	    !(nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) {
> +		kvm_queue_exception(vcpu, UD_VECTOR);
> +		return 1;
> +	}
> +
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
> +		kvm_queue_exception(vcpu, UD_VECTOR);
> +		return 1;
> +	}
> +
> +	/* According to the Intel VMX instruction reference, the memory
> +	 * operand is read even if it isn't needed (e.g., for type==global)
> +	 */
> +	vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> +	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
> +			vmx_instruction_info, &gva))
> +		return 1;
> +	if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &operand,
> +				sizeof(operand), &e)) {
> +		kvm_inject_page_fault(vcpu, &e);
> +		return 1;
> +	}
> +
> +	type = kvm_register_read(vcpu, (vmx_instruction_info >> 28) & 0xf);
> +
> +	switch (type) {
> +	case VMX_EPT_EXTENT_GLOBAL:
> +	case VMX_EPT_EXTENT_CONTEXT:
> +		ok = !!(nested_vmx_ept_caps &
> +				(1UL << (type + VMX_EPT_EXTENT_SHIFT)));
> +		break;
> +	default:
> +		ok = false;
> +	}
> +
> +	if (ok) {
> +		kvm_mmu_sync_roots(vcpu);
> +		kvm_mmu_flush_tlb(vcpu);
> +		nested_vmx_succeed(vcpu);
> +	}
> +	else
> +		nested_vmx_failValid(vcpu,
> +				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> +
> +	skip_emulated_instruction(vcpu);
> +	return 1;

Can this code be changed to:

	switch (type) {
	case VMX_EPT_EXTENT_GLOBAL:
	case VMX_EPT_EXTENT_CONTEXT:
		if (nested_vmx_ept_caps &
				(1UL << (type + VMX_EPT_EXTENT_SHIFT) {
			kvm_mmu_sync_roots(vcpu);
			kvm_mmu_flush_tlb(vcpu);
			nested_vmx_succeed(vcpu);
			break;
		}
	default:
		nested_vmx_failValid(vcpu,
				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
	}
?
That's clearer i think.

Also, we can sync the shadow page table only if the current eptp is the required
eptp, that means:

if (type == GLOBAL || operand.eptp == nested_ept_get_cr3(vcpu)) {
	kvm_mmu_sync_roots(vcpu);
	......
}


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 13/15] nEPT: Advertise EPT to L1
  2013-08-01 14:08 ` [PATCH v6 13/15] nEPT: Advertise EPT to L1 Gleb Natapov
@ 2013-08-02  8:29   ` Xiao Guangrong
  0 siblings, 0 replies; 39+ messages in thread
From: Xiao Guangrong @ 2013-08-02  8:29 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini

On 08/01/2013 10:08 PM, Gleb Natapov wrote:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> Advertise the support of EPT to the L1 guest, through the appropriate MSR.
> 
> This is the last patch of the basic Nested EPT feature, so as to allow
> bisection through this patch series: The guest will not see EPT support until
> this last patch, and will not attempt to use the half-applied feature.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 02/15] nEPT: Fix cr3 handling in nested exit and entry
  2013-08-01 14:08 ` [PATCH v6 02/15] nEPT: Fix cr3 handling in nested exit and entry Gleb Natapov
@ 2013-08-02  9:23   ` Xiao Guangrong
  0 siblings, 0 replies; 39+ messages in thread
From: Xiao Guangrong @ 2013-08-02  9:23 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini

On 08/01/2013 10:08 PM, Gleb Natapov wrote:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> The existing code for handling cr3 and related VMCS fields during nested
> exit and entry wasn't correct in all cases:
> 
> If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
> during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
> we forgot to do so. This patch adds this copy.
> 
> If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
> whoever does control cr3 (L1 or L2) is using PAE, the processor might have
> saved PDPTEs and we should also save them in vmcs12 (and restore later).

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 10/15] nEPT: Nested INVEPT
  2013-08-02  8:06   ` Xiao Guangrong
@ 2013-08-02 10:00     ` Gleb Natapov
  0 siblings, 0 replies; 39+ messages in thread
From: Gleb Natapov @ 2013-08-02 10:00 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini

On Fri, Aug 02, 2013 at 04:06:00PM +0800, Xiao Guangrong wrote:
> On 08/01/2013 10:08 PM, Gleb Natapov wrote:
> 
> > +/* Emulate the INVEPT instruction */
> > +static int handle_invept(struct kvm_vcpu *vcpu)
> > +{
> > +	u32 vmx_instruction_info;
> > +	bool ok;
> > +	unsigned long type;
> > +	gva_t gva;
> > +	struct x86_exception e;
> > +	struct {
> > +		u64 eptp, gpa;
> > +	} operand;
> > +
> > +	if (!(nested_vmx_secondary_ctls_high & SECONDARY_EXEC_ENABLE_EPT) ||
> > +	    !(nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) {
> > +		kvm_queue_exception(vcpu, UD_VECTOR);
> > +		return 1;
> > +	}
> > +
> > +	if (!nested_vmx_check_permission(vcpu))
> > +		return 1;
> > +
> > +	if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
> > +		kvm_queue_exception(vcpu, UD_VECTOR);
> > +		return 1;
> > +	}
> > +
> > +	/* According to the Intel VMX instruction reference, the memory
> > +	 * operand is read even if it isn't needed (e.g., for type==global)
> > +	 */
> > +	vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> > +	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
> > +			vmx_instruction_info, &gva))
> > +		return 1;
> > +	if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &operand,
> > +				sizeof(operand), &e)) {
> > +		kvm_inject_page_fault(vcpu, &e);
> > +		return 1;
> > +	}
> > +
> > +	type = kvm_register_read(vcpu, (vmx_instruction_info >> 28) & 0xf);
> > +
> > +	switch (type) {
> > +	case VMX_EPT_EXTENT_GLOBAL:
> > +	case VMX_EPT_EXTENT_CONTEXT:
> > +		ok = !!(nested_vmx_ept_caps &
> > +				(1UL << (type + VMX_EPT_EXTENT_SHIFT)));
> > +		break;
> > +	default:
> > +		ok = false;
> > +	}
> > +
> > +	if (ok) {
> > +		kvm_mmu_sync_roots(vcpu);
> > +		kvm_mmu_flush_tlb(vcpu);
> > +		nested_vmx_succeed(vcpu);
> > +	}
> > +	else
> > +		nested_vmx_failValid(vcpu,
> > +				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> > +
> > +	skip_emulated_instruction(vcpu);
> > +	return 1;
> 
> Can this code be changed to:
> 
> 	switch (type) {
> 	case VMX_EPT_EXTENT_GLOBAL:
> 	case VMX_EPT_EXTENT_CONTEXT:
> 		if (nested_vmx_ept_caps &
> 				(1UL << (type + VMX_EPT_EXTENT_SHIFT) {
> 			kvm_mmu_sync_roots(vcpu);
> 			kvm_mmu_flush_tlb(vcpu);
> 			nested_vmx_succeed(vcpu);
> 			break;
> 		}
> 	default:
> 		nested_vmx_failValid(vcpu,
> 				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> 	}
> ?
> That's clearer i think.
> 
> Also, we can sync the shadow page table only if the current eptp is the required
> eptp, that means:
> 
> if (type == GLOBAL || operand.eptp == nested_ept_get_cr3(vcpu)) {
> 	kvm_mmu_sync_roots(vcpu);
> 	......
> }
Good point. Will do.

--
			Gleb.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 00/15] Nested EPT
  2013-08-01 14:08 [PATCH v6 00/15] Nested EPT Gleb Natapov
                   ` (14 preceding siblings ...)
  2013-08-01 14:08 ` [PATCH v6 15/15] nEPT: Miscelleneous cleanups Gleb Natapov
@ 2013-08-04  9:24 ` Jan Kiszka
  2013-08-04  9:32   ` Gleb Natapov
  15 siblings, 1 reply; 39+ messages in thread
From: Jan Kiszka @ 2013-08-04  9:24 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

[-- Attachment #1: Type: text/plain, Size: 728 bytes --]

On 2013-08-01 16:08, Gleb Natapov wrote:
> Another day -- another version of the nested EPT patches. In this version
> included fix for need_remote_flush() with shadowed ept, set bits 6:8
> of exit_qualification during ept_violation, update_permission_bitmask()
> made to work with shadowed ept pages and other small adjustment according
> to review comments.

Was just testing it here and ran into a bug: I've L2 accessing the HPET
MMIO region that my L1 passed through from L0 (where it is supposed to
be emulated in this setup). This used to work with an older posting of
Jun, but now it locks up (infinite loop over L2's MMIO access, no L2->L1
transition). Any ideas where to look for debugging this?

Jan



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 00/15] Nested EPT
  2013-08-04  9:24 ` [PATCH v6 00/15] Nested EPT Jan Kiszka
@ 2013-08-04  9:32   ` Gleb Natapov
  2013-08-04  9:53     ` Gleb Natapov
  0 siblings, 1 reply; 39+ messages in thread
From: Gleb Natapov @ 2013-08-04  9:32 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

On Sun, Aug 04, 2013 at 11:24:41AM +0200, Jan Kiszka wrote:
> On 2013-08-01 16:08, Gleb Natapov wrote:
> > Another day -- another version of the nested EPT patches. In this version
> > included fix for need_remote_flush() with shadowed ept, set bits 6:8
> > of exit_qualification during ept_violation, update_permission_bitmask()
> > made to work with shadowed ept pages and other small adjustment according
> > to review comments.
> 
> Was just testing it here and ran into a bug: I've L2 accessing the HPET
> MMIO region that my L1 passed through from L0 (where it is supposed to
> be emulated in this setup). This used to work with an older posting of
Not sure I understand your setup. L0 emulates HPET, L1 passes it through
to L2 (mmaps it and creates kvm slot that points to it) and when L2
accessed it it locks up?

> Jun, but now it locks up (infinite loop over L2's MMIO access, no L2->L1
> transition). Any ideas where to look for debugging this?
> 
Can you do an ftrace -e kvm -e kvmmmu? Unit test will also be helpful :)

--
			Gleb.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 00/15] Nested EPT
  2013-08-04  9:32   ` Gleb Natapov
@ 2013-08-04  9:53     ` Gleb Natapov
  2013-08-04 13:44       ` Gleb Natapov
  0 siblings, 1 reply; 39+ messages in thread
From: Gleb Natapov @ 2013-08-04  9:53 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

On Sun, Aug 04, 2013 at 12:32:06PM +0300, Gleb Natapov wrote:
> On Sun, Aug 04, 2013 at 11:24:41AM +0200, Jan Kiszka wrote:
> > On 2013-08-01 16:08, Gleb Natapov wrote:
> > > Another day -- another version of the nested EPT patches. In this version
> > > included fix for need_remote_flush() with shadowed ept, set bits 6:8
> > > of exit_qualification during ept_violation, update_permission_bitmask()
> > > made to work with shadowed ept pages and other small adjustment according
> > > to review comments.
> > 
> > Was just testing it here and ran into a bug: I've L2 accessing the HPET
> > MMIO region that my L1 passed through from L0 (where it is supposed to
> > be emulated in this setup). This used to work with an older posting of
> Not sure I understand your setup. L0 emulates HPET, L1 passes it through
> to L2 (mmaps it and creates kvm slot that points to it) and when L2
> accessed it it locks up?
> 
> > Jun, but now it locks up (infinite loop over L2's MMIO access, no L2->L1
> > transition). Any ideas where to look for debugging this?
> > 
> Can you do an ftrace -e kvm -e kvmmmu? Unit test will also be helpful :)
> 
I did an MMIO access from nested guest in the vmx unit test (which is
naturally passed through to L0 since L1 is so simple) and I can see that
the access hits L0.

--
			Gleb.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 00/15] Nested EPT
  2013-08-04  9:53     ` Gleb Natapov
@ 2013-08-04 13:44       ` Gleb Natapov
  2013-08-04 15:14         ` Jan Kiszka
  0 siblings, 1 reply; 39+ messages in thread
From: Gleb Natapov @ 2013-08-04 13:44 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

On Sun, Aug 04, 2013 at 12:53:56PM +0300, Gleb Natapov wrote:
> On Sun, Aug 04, 2013 at 12:32:06PM +0300, Gleb Natapov wrote:
> > On Sun, Aug 04, 2013 at 11:24:41AM +0200, Jan Kiszka wrote:
> > > On 2013-08-01 16:08, Gleb Natapov wrote:
> > > > Another day -- another version of the nested EPT patches. In this version
> > > > included fix for need_remote_flush() with shadowed ept, set bits 6:8
> > > > of exit_qualification during ept_violation, update_permission_bitmask()
> > > > made to work with shadowed ept pages and other small adjustment according
> > > > to review comments.
> > > 
> > > Was just testing it here and ran into a bug: I've L2 accessing the HPET
> > > MMIO region that my L1 passed through from L0 (where it is supposed to
> > > be emulated in this setup). This used to work with an older posting of
> > Not sure I understand your setup. L0 emulates HPET, L1 passes it through
> > to L2 (mmaps it and creates kvm slot that points to it) and when L2
> > accessed it it locks up?
> > 
> > > Jun, but now it locks up (infinite loop over L2's MMIO access, no L2->L1
> > > transition). Any ideas where to look for debugging this?
> > > 
> > Can you do an ftrace -e kvm -e kvmmmu? Unit test will also be helpful :)
> > 
> I did an MMIO access from nested guest in the vmx unit test (which is
> naturally passed through to L0 since L1 is so simple) and I can see that
> the access hits L0.
> 
But then unit test not yet uses nested EPT :)

--
			Gleb.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 00/15] Nested EPT
  2013-08-04 13:44       ` Gleb Natapov
@ 2013-08-04 15:14         ` Jan Kiszka
  2013-08-04 16:15           ` Xiao Guangrong
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Kiszka @ 2013-08-04 15:14 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

[-- Attachment #1: Type: text/plain, Size: 6656 bytes --]

On 2013-08-04 15:44, Gleb Natapov wrote:
> On Sun, Aug 04, 2013 at 12:53:56PM +0300, Gleb Natapov wrote:
>> On Sun, Aug 04, 2013 at 12:32:06PM +0300, Gleb Natapov wrote:
>>> On Sun, Aug 04, 2013 at 11:24:41AM +0200, Jan Kiszka wrote:
>>>> On 2013-08-01 16:08, Gleb Natapov wrote:
>>>>> Another day -- another version of the nested EPT patches. In this version
>>>>> included fix for need_remote_flush() with shadowed ept, set bits 6:8
>>>>> of exit_qualification during ept_violation, update_permission_bitmask()
>>>>> made to work with shadowed ept pages and other small adjustment according
>>>>> to review comments.
>>>>
>>>> Was just testing it here and ran into a bug: I've L2 accessing the HPET
>>>> MMIO region that my L1 passed through from L0 (where it is supposed to
>>>> be emulated in this setup). This used to work with an older posting of
>>> Not sure I understand your setup. L0 emulates HPET, L1 passes it through
>>> to L2 (mmaps it and creates kvm slot that points to it) and when L2
>>> accessed it it locks up?
>>>
>>>> Jun, but now it locks up (infinite loop over L2's MMIO access, no L2->L1
>>>> transition). Any ideas where to look for debugging this?
>>>>
>>> Can you do an ftrace -e kvm -e kvmmmu? Unit test will also be helpful :)
>>>
>> I did an MMIO access from nested guest in the vmx unit test (which is
>> naturally passed through to L0 since L1 is so simple) and I can see that
>> the access hits L0.
>>
> But then unit test not yet uses nested EPT :)

Indeed, that's what I was about to notice as well. EPT test cases are on
Arthur's list, but I suggested to start easier with some MSR switches
(just to let him run into KVM's PAT bugs ;) ).

Anyway, here are the traces:

 qemu-system-x86-11521 [000]  4724.170191: kvm_entry:            vcpu 0
 qemu-system-x86-11521 [000]  4724.170192: kvm_exit:             reason EPT_VIOLATION rip 0xffffffff8102ab70 info 181 0
 qemu-system-x86-11521 [000]  4724.170192: kvm_page_fault:       address 1901978 error_code 181
 qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_pagetable_walk: addr 1901978 pferr 0 
 qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 3c04c007 level 4
 qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 3c04d007 level 3
 qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 3c05a007 level 2
 qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 1901037 level 1
 qemu-system-x86-11521 [000]  4724.170197: kvm_entry:            vcpu 0
 qemu-system-x86-11521 [000]  4724.170198: kvm_exit:             reason EPT_VIOLATION rip 0xffffffff8102ab77 info 81 0
 qemu-system-x86-11521 [000]  4724.170199: kvm_page_fault:       address 3a029000 error_code 81
 qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_pagetable_walk: addr 3a029000 pferr 0 
 qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_paging_element: pte 3c04c007 level 4
 qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_paging_element: pte 3c04d007 level 3
 qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_paging_element: pte 3c21e007 level 2
 qemu-system-x86-11521 [000]  4724.170200: kvm_mmu_paging_element: pte 3a029037 level 1
 qemu-system-x86-11521 [000]  4724.170203: kvm_entry:            vcpu 0
 qemu-system-x86-11521 [000]  4724.170204: kvm_exit:             reason EPT_VIOLATION rip 0xffffffff8102ab77 info 181 0
 qemu-system-x86-11521 [000]  4724.170204: kvm_page_fault:       address fed000f0 error_code 181
 qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_pagetable_walk: addr fed000f0 pferr 0 
 qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_paging_element: pte 3c04c007 level 4
 qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_paging_element: pte 3c42f003 level 3
 qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_paging_element: pte 3c626003 level 2
 qemu-system-x86-11521 [000]  4724.170206: kvm_mmu_paging_element: pte fed00033 level 1
 qemu-system-x86-11521 [000]  4724.170213: mark_mmio_spte:       sptep:0xffff88014e8ad800 gfn fed00 access 6 gen b7f
 qemu-system-x86-11521 [000]  4724.170214: kvm_mmu_pagetable_walk: addr ffffffff8102ab77 pferr 10 F
 qemu-system-x86-11521 [000]  4724.170215: kvm_mmu_pagetable_walk: addr 1710000 pferr 6 W|U
 qemu-system-x86-11521 [000]  4724.170215: kvm_mmu_paging_element: pte 3c04c007 level 4
 qemu-system-x86-11521 [000]  4724.170215: kvm_mmu_paging_element: pte 3c04d007 level 3
 qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_paging_element: pte 3c059007 level 2
 qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_paging_element: pte 1710037 level 1
 qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_paging_element: pte 1711067 level 4
 qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_walker_error: pferr 19 P|RSVD|F
 qemu-system-x86-11521 [000]  4724.170217: kvm_emulate_insn:     0:ffffffff8102ab77: 
 qemu-system-x86-11521 [000]  4724.170217: kvm_mmu_pagetable_walk: addr fed000f0 pferr 2 W
 qemu-system-x86-11521 [000]  4724.170217: kvm_mmu_pagetable_walk: addr 1710000 pferr 6 W|U
 qemu-system-x86-11521 [000]  4724.170217: kvm_mmu_paging_element: pte 3c04c007 level 4
 qemu-system-x86-11521 [000]  4724.170217: kvm_mmu_paging_element: pte 3c04d007 level 3
 qemu-system-x86-11521 [000]  4724.170218: kvm_mmu_paging_element: pte 3c059007 level 2
 qemu-system-x86-11521 [000]  4724.170218: kvm_mmu_paging_element: pte 1710037 level 1
 qemu-system-x86-11521 [000]  4724.170218: kvm_mmu_paging_element: pte 0 level 4
 qemu-system-x86-11521 [000]  4724.170218: kvm_mmu_walker_error: pferr 2 W
 qemu-system-x86-11521 [000]  4724.170219: kvm_entry:            vcpu 0
 qemu-system-x86-11521 [000]  4724.170220: kvm_exit:             reason EPT_MISCONFIG rip 0xffffffff8102ab77 info 0 0
 qemu-system-x86-11521 [000]  4724.170221: kvm_mmu_pagetable_walk: addr ffffffff8102ab77 pferr 10 F
 qemu-system-x86-11521 [000]  4724.170221: kvm_mmu_pagetable_walk: addr 1710000 pferr 6 W|U
 qemu-system-x86-11521 [000]  4724.170222: kvm_mmu_paging_element: pte 3c04c007 level 4
 qemu-system-x86-11521 [000]  4724.170222: kvm_mmu_paging_element: pte 3c04d007 level 3
 qemu-system-x86-11521 [000]  4724.170222: kvm_mmu_paging_element: pte 3c059007 level 2
 qemu-system-x86-11521 [000]  4724.170222: kvm_mmu_paging_element: pte 1710037 level 1
 qemu-system-x86-11521 [000]  4724.170223: kvm_mmu_paging_element: pte 1711067 level 4
 qemu-system-x86-11521 [000]  4724.170223: kvm_mmu_walker_error: pferr 19 P|RSVD|F
 qemu-system-x86-11521 [000]  4724.170223: kvm_emulate_insn:     0:ffffffff8102ab77: 

The L1 code runs fine as L0 (ie. natively) on the same hardware.

Jan



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 00/15] Nested EPT
  2013-08-04 15:14         ` Jan Kiszka
@ 2013-08-04 16:15           ` Xiao Guangrong
  2013-08-04 16:42             ` Jan Kiszka
  0 siblings, 1 reply; 39+ messages in thread
From: Xiao Guangrong @ 2013-08-04 16:15 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Gleb Natapov, kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini


On Aug 4, 2013, at 11:14 PM, Jan Kiszka <jan.kiszka@web.de> wrote:

> On 2013-08-04 15:44, Gleb Natapov wrote:
>> On Sun, Aug 04, 2013 at 12:53:56PM +0300, Gleb Natapov wrote:
>>> On Sun, Aug 04, 2013 at 12:32:06PM +0300, Gleb Natapov wrote:
>>>> On Sun, Aug 04, 2013 at 11:24:41AM +0200, Jan Kiszka wrote:
>>>>> On 2013-08-01 16:08, Gleb Natapov wrote:
>>>>>> Another day -- another version of the nested EPT patches. In this version
>>>>>> included fix for need_remote_flush() with shadowed ept, set bits 6:8
>>>>>> of exit_qualification during ept_violation, update_permission_bitmask()
>>>>>> made to work with shadowed ept pages and other small adjustment according
>>>>>> to review comments.
>>>>> 
>>>>> Was just testing it here and ran into a bug: I've L2 accessing the HPET
>>>>> MMIO region that my L1 passed through from L0 (where it is supposed to
>>>>> be emulated in this setup). This used to work with an older posting of
>>>> Not sure I understand your setup. L0 emulates HPET, L1 passes it through
>>>> to L2 (mmaps it and creates kvm slot that points to it) and when L2
>>>> accessed it it locks up?
>>>> 
>>>>> Jun, but now it locks up (infinite loop over L2's MMIO access, no L2->L1
>>>>> transition). Any ideas where to look for debugging this?
>>>>> 
>>>> Can you do an ftrace -e kvm -e kvmmmu? Unit test will also be helpful :)
>>>> 
>>> I did an MMIO access from nested guest in the vmx unit test (which is
>>> naturally passed through to L0 since L1 is so simple) and I can see that
>>> the access hits L0.
>>> 
>> But then unit test not yet uses nested EPT :)
> 
> Indeed, that's what I was about to notice as well. EPT test cases are on
> Arthur's list, but I suggested to start easier with some MSR switches
> (just to let him run into KVM's PAT bugs ;) ).
> 
> Anyway, here are the traces:
> 
> qemu-system-x86-11521 [000]  4724.170191: kvm_entry:            vcpu 0
> qemu-system-x86-11521 [000]  4724.170192: kvm_exit:             reason EPT_VIOLATION rip 0xffffffff8102ab70 info 181 0
> qemu-system-x86-11521 [000]  4724.170192: kvm_page_fault:       address 1901978 error_code 181
> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_pagetable_walk: addr 1901978 pferr 0 
> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 3c04c007 level 4
> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 3c04d007 level 3
> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 3c05a007 level 2
> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 1901037 level 1
> qemu-system-x86-11521 [000]  4724.170197: kvm_entry:            vcpu 0
> qemu-system-x86-11521 [000]  4724.170198: kvm_exit:             reason EPT_VIOLATION rip 0xffffffff8102ab77 info 81 0
> qemu-system-x86-11521 [000]  4724.170199: kvm_page_fault:       address 3a029000 error_code 81
> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_pagetable_walk: addr 3a029000 pferr 0 
> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_paging_element: pte 3c04c007 level 4
> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_paging_element: pte 3c04d007 level 3
> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_paging_element: pte 3c21e007 level 2
> qemu-system-x86-11521 [000]  4724.170200: kvm_mmu_paging_element: pte 3a029037 level 1
> qemu-system-x86-11521 [000]  4724.170203: kvm_entry:            vcpu 0
> qemu-system-x86-11521 [000]  4724.170204: kvm_exit:             reason EPT_VIOLATION rip 0xffffffff8102ab77 info 181 0
> qemu-system-x86-11521 [000]  4724.170204: kvm_page_fault:       address fed000f0 error_code 181
> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_pagetable_walk: addr fed000f0 pferr 0 
> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_paging_element: pte 3c04c007 level 4
> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_paging_element: pte 3c42f003 level 3
> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_paging_element: pte 3c626003 level 2
> qemu-system-x86-11521 [000]  4724.170206: kvm_mmu_paging_element: pte fed00033 level 1
> qemu-system-x86-11521 [000]  4724.170213: mark_mmio_spte:       sptep:0xffff88014e8ad800 gfn fed00 access 6 gen b7f
> qemu-system-x86-11521 [000]  4724.170214: kvm_mmu_pagetable_walk: addr ffffffff8102ab77 pferr 10 F
> qemu-system-x86-11521 [000]  4724.170215: kvm_mmu_pagetable_walk: addr 1710000 pferr 6 W|U
> qemu-system-x86-11521 [000]  4724.170215: kvm_mmu_paging_element: pte 3c04c007 level 4
> qemu-system-x86-11521 [000]  4724.170215: kvm_mmu_paging_element: pte 3c04d007 level 3
> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_paging_element: pte 3c059007 level 2
> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_paging_element: pte 1710037 level 1
> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_paging_element: pte 1711067 level 4
> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_walker_error: pferr 19 P|RSVD|F

I guess the bug is here, could you please change this code to:

-		if (unlikely(is_rsvd_bits_set(&vcpu->arch.mmu, pte,
+             if (unlikely(is_rsvd_bits_set(mmu, pte,
					      walker->level))) {

(
	In static int FNAME(walk_addr_generic)(struct guest_walker *walker,
				    struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
				    gva_t addr, u32 access)
)

and try again?


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 00/15] Nested EPT
  2013-08-04 16:15           ` Xiao Guangrong
@ 2013-08-04 16:42             ` Jan Kiszka
  2013-08-04 16:58               ` Gleb Natapov
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Kiszka @ 2013-08-04 16:42 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Gleb Natapov, kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

[-- Attachment #1: Type: text/plain, Size: 5672 bytes --]

On 2013-08-04 18:15, Xiao Guangrong wrote:
> 
> On Aug 4, 2013, at 11:14 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> 
>> On 2013-08-04 15:44, Gleb Natapov wrote:
>>> On Sun, Aug 04, 2013 at 12:53:56PM +0300, Gleb Natapov wrote:
>>>> On Sun, Aug 04, 2013 at 12:32:06PM +0300, Gleb Natapov wrote:
>>>>> On Sun, Aug 04, 2013 at 11:24:41AM +0200, Jan Kiszka wrote:
>>>>>> On 2013-08-01 16:08, Gleb Natapov wrote:
>>>>>>> Another day -- another version of the nested EPT patches. In this version
>>>>>>> included fix for need_remote_flush() with shadowed ept, set bits 6:8
>>>>>>> of exit_qualification during ept_violation, update_permission_bitmask()
>>>>>>> made to work with shadowed ept pages and other small adjustment according
>>>>>>> to review comments.
>>>>>>
>>>>>> Was just testing it here and ran into a bug: I've L2 accessing the HPET
>>>>>> MMIO region that my L1 passed through from L0 (where it is supposed to
>>>>>> be emulated in this setup). This used to work with an older posting of
>>>>> Not sure I understand your setup. L0 emulates HPET, L1 passes it through
>>>>> to L2 (mmaps it and creates kvm slot that points to it) and when L2
>>>>> accessed it it locks up?
>>>>>
>>>>>> Jun, but now it locks up (infinite loop over L2's MMIO access, no L2->L1
>>>>>> transition). Any ideas where to look for debugging this?
>>>>>>
>>>>> Can you do an ftrace -e kvm -e kvmmmu? Unit test will also be helpful :)
>>>>>
>>>> I did an MMIO access from nested guest in the vmx unit test (which is
>>>> naturally passed through to L0 since L1 is so simple) and I can see that
>>>> the access hits L0.
>>>>
>>> But then unit test not yet uses nested EPT :)
>>
>> Indeed, that's what I was about to notice as well. EPT test cases are on
>> Arthur's list, but I suggested to start easier with some MSR switches
>> (just to let him run into KVM's PAT bugs ;) ).
>>
>> Anyway, here are the traces:
>>
>> qemu-system-x86-11521 [000]  4724.170191: kvm_entry:            vcpu 0
>> qemu-system-x86-11521 [000]  4724.170192: kvm_exit:             reason EPT_VIOLATION rip 0xffffffff8102ab70 info 181 0
>> qemu-system-x86-11521 [000]  4724.170192: kvm_page_fault:       address 1901978 error_code 181
>> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_pagetable_walk: addr 1901978 pferr 0 
>> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 3c04c007 level 4
>> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 3c04d007 level 3
>> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 3c05a007 level 2
>> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 1901037 level 1
>> qemu-system-x86-11521 [000]  4724.170197: kvm_entry:            vcpu 0
>> qemu-system-x86-11521 [000]  4724.170198: kvm_exit:             reason EPT_VIOLATION rip 0xffffffff8102ab77 info 81 0
>> qemu-system-x86-11521 [000]  4724.170199: kvm_page_fault:       address 3a029000 error_code 81
>> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_pagetable_walk: addr 3a029000 pferr 0 
>> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_paging_element: pte 3c04c007 level 4
>> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_paging_element: pte 3c04d007 level 3
>> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_paging_element: pte 3c21e007 level 2
>> qemu-system-x86-11521 [000]  4724.170200: kvm_mmu_paging_element: pte 3a029037 level 1
>> qemu-system-x86-11521 [000]  4724.170203: kvm_entry:            vcpu 0
>> qemu-system-x86-11521 [000]  4724.170204: kvm_exit:             reason EPT_VIOLATION rip 0xffffffff8102ab77 info 181 0
>> qemu-system-x86-11521 [000]  4724.170204: kvm_page_fault:       address fed000f0 error_code 181
>> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_pagetable_walk: addr fed000f0 pferr 0 
>> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_paging_element: pte 3c04c007 level 4
>> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_paging_element: pte 3c42f003 level 3
>> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_paging_element: pte 3c626003 level 2
>> qemu-system-x86-11521 [000]  4724.170206: kvm_mmu_paging_element: pte fed00033 level 1
>> qemu-system-x86-11521 [000]  4724.170213: mark_mmio_spte:       sptep:0xffff88014e8ad800 gfn fed00 access 6 gen b7f
>> qemu-system-x86-11521 [000]  4724.170214: kvm_mmu_pagetable_walk: addr ffffffff8102ab77 pferr 10 F
>> qemu-system-x86-11521 [000]  4724.170215: kvm_mmu_pagetable_walk: addr 1710000 pferr 6 W|U
>> qemu-system-x86-11521 [000]  4724.170215: kvm_mmu_paging_element: pte 3c04c007 level 4
>> qemu-system-x86-11521 [000]  4724.170215: kvm_mmu_paging_element: pte 3c04d007 level 3
>> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_paging_element: pte 3c059007 level 2
>> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_paging_element: pte 1710037 level 1
>> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_paging_element: pte 1711067 level 4
>> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_walker_error: pferr 19 P|RSVD|F
> 
> I guess the bug is here, could you please change this code to:
> 
> -		if (unlikely(is_rsvd_bits_set(&vcpu->arch.mmu, pte,
> +             if (unlikely(is_rsvd_bits_set(mmu, pte,
> 					      walker->level))) {
> 
> (
> 	In static int FNAME(walk_addr_generic)(struct guest_walker *walker,
> 				    struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> 				    gva_t addr, u32 access)
> )
> 
> and try again?
> 

Thanks, that fixed the bug!

Obviously, previous version were less strict or imprecise with reserved
bit checking as this pattern exists in the version that worked here as well.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 00/15] Nested EPT
  2013-08-04 16:42             ` Jan Kiszka
@ 2013-08-04 16:58               ` Gleb Natapov
  2013-08-04 17:19                 ` Xiao Guangrong
  0 siblings, 1 reply; 39+ messages in thread
From: Gleb Natapov @ 2013-08-04 16:58 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Xiao Guangrong, kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

On Sun, Aug 04, 2013 at 06:42:09PM +0200, Jan Kiszka wrote:
> On 2013-08-04 18:15, Xiao Guangrong wrote:
> > 
> > On Aug 4, 2013, at 11:14 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> > 
> >> On 2013-08-04 15:44, Gleb Natapov wrote:
> >>> On Sun, Aug 04, 2013 at 12:53:56PM +0300, Gleb Natapov wrote:
> >>>> On Sun, Aug 04, 2013 at 12:32:06PM +0300, Gleb Natapov wrote:
> >>>>> On Sun, Aug 04, 2013 at 11:24:41AM +0200, Jan Kiszka wrote:
> >>>>>> On 2013-08-01 16:08, Gleb Natapov wrote:
> >>>>>>> Another day -- another version of the nested EPT patches. In this version
> >>>>>>> included fix for need_remote_flush() with shadowed ept, set bits 6:8
> >>>>>>> of exit_qualification during ept_violation, update_permission_bitmask()
> >>>>>>> made to work with shadowed ept pages and other small adjustment according
> >>>>>>> to review comments.
> >>>>>>
> >>>>>> Was just testing it here and ran into a bug: I've L2 accessing the HPET
> >>>>>> MMIO region that my L1 passed through from L0 (where it is supposed to
> >>>>>> be emulated in this setup). This used to work with an older posting of
> >>>>> Not sure I understand your setup. L0 emulates HPET, L1 passes it through
> >>>>> to L2 (mmaps it and creates kvm slot that points to it) and when L2
> >>>>> accessed it it locks up?
> >>>>>
> >>>>>> Jun, but now it locks up (infinite loop over L2's MMIO access, no L2->L1
> >>>>>> transition). Any ideas where to look for debugging this?
> >>>>>>
> >>>>> Can you do an ftrace -e kvm -e kvmmmu? Unit test will also be helpful :)
> >>>>>
> >>>> I did an MMIO access from nested guest in the vmx unit test (which is
> >>>> naturally passed through to L0 since L1 is so simple) and I can see that
> >>>> the access hits L0.
> >>>>
> >>> But then unit test not yet uses nested EPT :)
> >>
> >> Indeed, that's what I was about to notice as well. EPT test cases are on
> >> Arthur's list, but I suggested to start easier with some MSR switches
> >> (just to let him run into KVM's PAT bugs ;) ).
> >>
> >> Anyway, here are the traces:
> >>
> >> qemu-system-x86-11521 [000]  4724.170191: kvm_entry:            vcpu 0
> >> qemu-system-x86-11521 [000]  4724.170192: kvm_exit:             reason EPT_VIOLATION rip 0xffffffff8102ab70 info 181 0
> >> qemu-system-x86-11521 [000]  4724.170192: kvm_page_fault:       address 1901978 error_code 181
> >> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_pagetable_walk: addr 1901978 pferr 0 
> >> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 3c04c007 level 4
> >> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 3c04d007 level 3
> >> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 3c05a007 level 2
> >> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 1901037 level 1
> >> qemu-system-x86-11521 [000]  4724.170197: kvm_entry:            vcpu 0
> >> qemu-system-x86-11521 [000]  4724.170198: kvm_exit:             reason EPT_VIOLATION rip 0xffffffff8102ab77 info 81 0
> >> qemu-system-x86-11521 [000]  4724.170199: kvm_page_fault:       address 3a029000 error_code 81
> >> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_pagetable_walk: addr 3a029000 pferr 0 
> >> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_paging_element: pte 3c04c007 level 4
> >> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_paging_element: pte 3c04d007 level 3
> >> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_paging_element: pte 3c21e007 level 2
> >> qemu-system-x86-11521 [000]  4724.170200: kvm_mmu_paging_element: pte 3a029037 level 1
> >> qemu-system-x86-11521 [000]  4724.170203: kvm_entry:            vcpu 0
> >> qemu-system-x86-11521 [000]  4724.170204: kvm_exit:             reason EPT_VIOLATION rip 0xffffffff8102ab77 info 181 0
> >> qemu-system-x86-11521 [000]  4724.170204: kvm_page_fault:       address fed000f0 error_code 181
> >> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_pagetable_walk: addr fed000f0 pferr 0 
> >> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_paging_element: pte 3c04c007 level 4
> >> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_paging_element: pte 3c42f003 level 3
> >> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_paging_element: pte 3c626003 level 2
> >> qemu-system-x86-11521 [000]  4724.170206: kvm_mmu_paging_element: pte fed00033 level 1
> >> qemu-system-x86-11521 [000]  4724.170213: mark_mmio_spte:       sptep:0xffff88014e8ad800 gfn fed00 access 6 gen b7f
> >> qemu-system-x86-11521 [000]  4724.170214: kvm_mmu_pagetable_walk: addr ffffffff8102ab77 pferr 10 F
> >> qemu-system-x86-11521 [000]  4724.170215: kvm_mmu_pagetable_walk: addr 1710000 pferr 6 W|U
> >> qemu-system-x86-11521 [000]  4724.170215: kvm_mmu_paging_element: pte 3c04c007 level 4
> >> qemu-system-x86-11521 [000]  4724.170215: kvm_mmu_paging_element: pte 3c04d007 level 3
> >> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_paging_element: pte 3c059007 level 2
> >> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_paging_element: pte 1710037 level 1
> >> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_paging_element: pte 1711067 level 4
> >> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_walker_error: pferr 19 P|RSVD|F
> > 
> > I guess the bug is here, could you please change this code to:
> > 
> > -		if (unlikely(is_rsvd_bits_set(&vcpu->arch.mmu, pte,
> > +             if (unlikely(is_rsvd_bits_set(mmu, pte,
> > 					      walker->level))) {
> > 
> > (
> > 	In static int FNAME(walk_addr_generic)(struct guest_walker *walker,
> > 				    struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > 				    gva_t addr, u32 access)
> > )
> > 
> > and try again?
> > 
> 
> Thanks, that fixed the bug!
> 
Awesome!

> Obviously, previous version were less strict or imprecise with reserved
> bit checking as this pattern exists in the version that worked here as well.
> 
Previous version did not use mmu->rsvd_bits_mask for nEPT reserved bit
checking so is_rsvd_bits_set() happened to work correctly for nested mmu
using non nested mmu pointer, but when I change it to use
mmu->rsvd_bits_mask the bug become triggerable. In theory this bug
should be triggerable on nested SVM to if two levels of paging use
different modes, no?

--
			Gleb.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 00/15] Nested EPT
  2013-08-04 16:58               ` Gleb Natapov
@ 2013-08-04 17:19                 ` Xiao Guangrong
  2013-08-04 17:24                   ` Gleb Natapov
  0 siblings, 1 reply; 39+ messages in thread
From: Xiao Guangrong @ 2013-08-04 17:19 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Jan Kiszka, kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini


On Aug 5, 2013, at 12:58 AM, Gleb Natapov <gleb@redhat.com> wrote:

> On Sun, Aug 04, 2013 at 06:42:09PM +0200, Jan Kiszka wrote:
>> On 2013-08-04 18:15, Xiao Guangrong wrote:
>>> 
>>> On Aug 4, 2013, at 11:14 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>> 
>>>> On 2013-08-04 15:44, Gleb Natapov wrote:
>>>>> On Sun, Aug 04, 2013 at 12:53:56PM +0300, Gleb Natapov wrote:
>>>>>> On Sun, Aug 04, 2013 at 12:32:06PM +0300, Gleb Natapov wrote:
>>>>>>> On Sun, Aug 04, 2013 at 11:24:41AM +0200, Jan Kiszka wrote:
>>>>>>>> On 2013-08-01 16:08, Gleb Natapov wrote:
>>>>>>>>> Another day -- another version of the nested EPT patches. In this version
>>>>>>>>> included fix for need_remote_flush() with shadowed ept, set bits 6:8
>>>>>>>>> of exit_qualification during ept_violation, update_permission_bitmask()
>>>>>>>>> made to work with shadowed ept pages and other small adjustment according
>>>>>>>>> to review comments.
>>>>>>>> 
>>>>>>>> Was just testing it here and ran into a bug: I've L2 accessing the HPET
>>>>>>>> MMIO region that my L1 passed through from L0 (where it is supposed to
>>>>>>>> be emulated in this setup). This used to work with an older posting of
>>>>>>> Not sure I understand your setup. L0 emulates HPET, L1 passes it through
>>>>>>> to L2 (mmaps it and creates kvm slot that points to it) and when L2
>>>>>>> accessed it it locks up?
>>>>>>> 
>>>>>>>> Jun, but now it locks up (infinite loop over L2's MMIO access, no L2->L1
>>>>>>>> transition). Any ideas where to look for debugging this?
>>>>>>>> 
>>>>>>> Can you do an ftrace -e kvm -e kvmmmu? Unit test will also be helpful :)
>>>>>>> 
>>>>>> I did an MMIO access from nested guest in the vmx unit test (which is
>>>>>> naturally passed through to L0 since L1 is so simple) and I can see that
>>>>>> the access hits L0.
>>>>>> 
>>>>> But then unit test not yet uses nested EPT :)
>>>> 
>>>> Indeed, that's what I was about to notice as well. EPT test cases are on
>>>> Arthur's list, but I suggested to start easier with some MSR switches
>>>> (just to let him run into KVM's PAT bugs ;) ).
>>>> 
>>>> Anyway, here are the traces:
>>>> 
>>>> qemu-system-x86-11521 [000]  4724.170191: kvm_entry:            vcpu 0
>>>> qemu-system-x86-11521 [000]  4724.170192: kvm_exit:             reason EPT_VIOLATION rip 0xffffffff8102ab70 info 181 0
>>>> qemu-system-x86-11521 [000]  4724.170192: kvm_page_fault:       address 1901978 error_code 181
>>>> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_pagetable_walk: addr 1901978 pferr 0 
>>>> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 3c04c007 level 4
>>>> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 3c04d007 level 3
>>>> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 3c05a007 level 2
>>>> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 1901037 level 1
>>>> qemu-system-x86-11521 [000]  4724.170197: kvm_entry:            vcpu 0
>>>> qemu-system-x86-11521 [000]  4724.170198: kvm_exit:             reason EPT_VIOLATION rip 0xffffffff8102ab77 info 81 0
>>>> qemu-system-x86-11521 [000]  4724.170199: kvm_page_fault:       address 3a029000 error_code 81
>>>> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_pagetable_walk: addr 3a029000 pferr 0 
>>>> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_paging_element: pte 3c04c007 level 4
>>>> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_paging_element: pte 3c04d007 level 3
>>>> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_paging_element: pte 3c21e007 level 2
>>>> qemu-system-x86-11521 [000]  4724.170200: kvm_mmu_paging_element: pte 3a029037 level 1
>>>> qemu-system-x86-11521 [000]  4724.170203: kvm_entry:            vcpu 0
>>>> qemu-system-x86-11521 [000]  4724.170204: kvm_exit:             reason EPT_VIOLATION rip 0xffffffff8102ab77 info 181 0
>>>> qemu-system-x86-11521 [000]  4724.170204: kvm_page_fault:       address fed000f0 error_code 181
>>>> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_pagetable_walk: addr fed000f0 pferr 0 
>>>> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_paging_element: pte 3c04c007 level 4
>>>> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_paging_element: pte 3c42f003 level 3
>>>> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_paging_element: pte 3c626003 level 2
>>>> qemu-system-x86-11521 [000]  4724.170206: kvm_mmu_paging_element: pte fed00033 level 1
>>>> qemu-system-x86-11521 [000]  4724.170213: mark_mmio_spte:       sptep:0xffff88014e8ad800 gfn fed00 access 6 gen b7f
>>>> qemu-system-x86-11521 [000]  4724.170214: kvm_mmu_pagetable_walk: addr ffffffff8102ab77 pferr 10 F
>>>> qemu-system-x86-11521 [000]  4724.170215: kvm_mmu_pagetable_walk: addr 1710000 pferr 6 W|U
>>>> qemu-system-x86-11521 [000]  4724.170215: kvm_mmu_paging_element: pte 3c04c007 level 4
>>>> qemu-system-x86-11521 [000]  4724.170215: kvm_mmu_paging_element: pte 3c04d007 level 3
>>>> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_paging_element: pte 3c059007 level 2
>>>> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_paging_element: pte 1710037 level 1
>>>> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_paging_element: pte 1711067 level 4
>>>> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_walker_error: pferr 19 P|RSVD|F
>>> 
>>> I guess the bug is here, could you please change this code to:
>>> 
>>> -		if (unlikely(is_rsvd_bits_set(&vcpu->arch.mmu, pte,
>>> +             if (unlikely(is_rsvd_bits_set(mmu, pte,
>>> 					      walker->level))) {
>>> 
>>> (
>>> 	In static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>>> 				    struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>>> 				    gva_t addr, u32 access)
>>> )
>>> 
>>> and try again?
>>> 
>> 
>> Thanks, that fixed the bug!
>> 
> Awesome!
> 
>> Obviously, previous version were less strict or imprecise with reserved
>> bit checking as this pattern exists in the version that worked here as well.
>> 
> Previous version did not use mmu->rsvd_bits_mask for nEPT reserved bit
> checking so is_rsvd_bits_set() happened to work correctly for nested mmu
> using non nested mmu pointer, but when I change it to use
> mmu->rsvd_bits_mask the bug become triggerable. In theory this bug
> should be triggerable on nested SVM to if two levels of paging use
> different modes, no?

Yes, it should be.

BTW, i will post this fix with the right format when i work in the office.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 00/15] Nested EPT
  2013-08-04 17:19                 ` Xiao Guangrong
@ 2013-08-04 17:24                   ` Gleb Natapov
  0 siblings, 0 replies; 39+ messages in thread
From: Gleb Natapov @ 2013-08-04 17:24 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Jan Kiszka, kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

On Mon, Aug 05, 2013 at 01:19:26AM +0800, Xiao Guangrong wrote:
> 
> On Aug 5, 2013, at 12:58 AM, Gleb Natapov <gleb@redhat.com> wrote:
> 
> > On Sun, Aug 04, 2013 at 06:42:09PM +0200, Jan Kiszka wrote:
> >> On 2013-08-04 18:15, Xiao Guangrong wrote:
> >>> 
> >>> On Aug 4, 2013, at 11:14 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> >>> 
> >>>> On 2013-08-04 15:44, Gleb Natapov wrote:
> >>>>> On Sun, Aug 04, 2013 at 12:53:56PM +0300, Gleb Natapov wrote:
> >>>>>> On Sun, Aug 04, 2013 at 12:32:06PM +0300, Gleb Natapov wrote:
> >>>>>>> On Sun, Aug 04, 2013 at 11:24:41AM +0200, Jan Kiszka wrote:
> >>>>>>>> On 2013-08-01 16:08, Gleb Natapov wrote:
> >>>>>>>>> Another day -- another version of the nested EPT patches. In this version
> >>>>>>>>> included fix for need_remote_flush() with shadowed ept, set bits 6:8
> >>>>>>>>> of exit_qualification during ept_violation, update_permission_bitmask()
> >>>>>>>>> made to work with shadowed ept pages and other small adjustment according
> >>>>>>>>> to review comments.
> >>>>>>>> 
> >>>>>>>> Was just testing it here and ran into a bug: I've L2 accessing the HPET
> >>>>>>>> MMIO region that my L1 passed through from L0 (where it is supposed to
> >>>>>>>> be emulated in this setup). This used to work with an older posting of
> >>>>>>> Not sure I understand your setup. L0 emulates HPET, L1 passes it through
> >>>>>>> to L2 (mmaps it and creates kvm slot that points to it) and when L2
> >>>>>>> accessed it it locks up?
> >>>>>>> 
> >>>>>>>> Jun, but now it locks up (infinite loop over L2's MMIO access, no L2->L1
> >>>>>>>> transition). Any ideas where to look for debugging this?
> >>>>>>>> 
> >>>>>>> Can you do an ftrace -e kvm -e kvmmmu? Unit test will also be helpful :)
> >>>>>>> 
> >>>>>> I did an MMIO access from nested guest in the vmx unit test (which is
> >>>>>> naturally passed through to L0 since L1 is so simple) and I can see that
> >>>>>> the access hits L0.
> >>>>>> 
> >>>>> But then unit test not yet uses nested EPT :)
> >>>> 
> >>>> Indeed, that's what I was about to notice as well. EPT test cases are on
> >>>> Arthur's list, but I suggested to start easier with some MSR switches
> >>>> (just to let him run into KVM's PAT bugs ;) ).
> >>>> 
> >>>> Anyway, here are the traces:
> >>>> 
> >>>> qemu-system-x86-11521 [000]  4724.170191: kvm_entry:            vcpu 0
> >>>> qemu-system-x86-11521 [000]  4724.170192: kvm_exit:             reason EPT_VIOLATION rip 0xffffffff8102ab70 info 181 0
> >>>> qemu-system-x86-11521 [000]  4724.170192: kvm_page_fault:       address 1901978 error_code 181
> >>>> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_pagetable_walk: addr 1901978 pferr 0 
> >>>> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 3c04c007 level 4
> >>>> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 3c04d007 level 3
> >>>> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 3c05a007 level 2
> >>>> qemu-system-x86-11521 [000]  4724.170193: kvm_mmu_paging_element: pte 1901037 level 1
> >>>> qemu-system-x86-11521 [000]  4724.170197: kvm_entry:            vcpu 0
> >>>> qemu-system-x86-11521 [000]  4724.170198: kvm_exit:             reason EPT_VIOLATION rip 0xffffffff8102ab77 info 81 0
> >>>> qemu-system-x86-11521 [000]  4724.170199: kvm_page_fault:       address 3a029000 error_code 81
> >>>> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_pagetable_walk: addr 3a029000 pferr 0 
> >>>> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_paging_element: pte 3c04c007 level 4
> >>>> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_paging_element: pte 3c04d007 level 3
> >>>> qemu-system-x86-11521 [000]  4724.170199: kvm_mmu_paging_element: pte 3c21e007 level 2
> >>>> qemu-system-x86-11521 [000]  4724.170200: kvm_mmu_paging_element: pte 3a029037 level 1
> >>>> qemu-system-x86-11521 [000]  4724.170203: kvm_entry:            vcpu 0
> >>>> qemu-system-x86-11521 [000]  4724.170204: kvm_exit:             reason EPT_VIOLATION rip 0xffffffff8102ab77 info 181 0
> >>>> qemu-system-x86-11521 [000]  4724.170204: kvm_page_fault:       address fed000f0 error_code 181
> >>>> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_pagetable_walk: addr fed000f0 pferr 0 
> >>>> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_paging_element: pte 3c04c007 level 4
> >>>> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_paging_element: pte 3c42f003 level 3
> >>>> qemu-system-x86-11521 [000]  4724.170205: kvm_mmu_paging_element: pte 3c626003 level 2
> >>>> qemu-system-x86-11521 [000]  4724.170206: kvm_mmu_paging_element: pte fed00033 level 1
> >>>> qemu-system-x86-11521 [000]  4724.170213: mark_mmio_spte:       sptep:0xffff88014e8ad800 gfn fed00 access 6 gen b7f
> >>>> qemu-system-x86-11521 [000]  4724.170214: kvm_mmu_pagetable_walk: addr ffffffff8102ab77 pferr 10 F
> >>>> qemu-system-x86-11521 [000]  4724.170215: kvm_mmu_pagetable_walk: addr 1710000 pferr 6 W|U
> >>>> qemu-system-x86-11521 [000]  4724.170215: kvm_mmu_paging_element: pte 3c04c007 level 4
> >>>> qemu-system-x86-11521 [000]  4724.170215: kvm_mmu_paging_element: pte 3c04d007 level 3
> >>>> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_paging_element: pte 3c059007 level 2
> >>>> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_paging_element: pte 1710037 level 1
> >>>> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_paging_element: pte 1711067 level 4
> >>>> qemu-system-x86-11521 [000]  4724.170216: kvm_mmu_walker_error: pferr 19 P|RSVD|F
> >>> 
> >>> I guess the bug is here, could you please change this code to:
> >>> 
> >>> -		if (unlikely(is_rsvd_bits_set(&vcpu->arch.mmu, pte,
> >>> +             if (unlikely(is_rsvd_bits_set(mmu, pte,
> >>> 					      walker->level))) {
> >>> 
> >>> (
> >>> 	In static int FNAME(walk_addr_generic)(struct guest_walker *walker,
> >>> 				    struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >>> 				    gva_t addr, u32 access)
> >>> )
> >>> 
> >>> and try again?
> >>> 
> >> 
> >> Thanks, that fixed the bug!
> >> 
> > Awesome!
> > 
> >> Obviously, previous version were less strict or imprecise with reserved
> >> bit checking as this pattern exists in the version that worked here as well.
> >> 
> > Previous version did not use mmu->rsvd_bits_mask for nEPT reserved bit
> > checking so is_rsvd_bits_set() happened to work correctly for nested mmu
> > using non nested mmu pointer, but when I change it to use
> > mmu->rsvd_bits_mask the bug become triggerable. In theory this bug
> > should be triggerable on nested SVM to if two levels of paging use
> > different modes, no?
> 
> Yes, it should be.
> 
> BTW, i will post this fix with the right format when i work in the office.
Cool, thanks. It can be applied independent of this patch series since
it fixes nested NPT too.

--
			Gleb.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2013-08-04 17:24 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-01 14:08 [PATCH v6 00/15] Nested EPT Gleb Natapov
2013-08-01 14:08 ` [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Gleb Natapov
2013-08-02  3:04   ` Zhang, Yang Z
2013-08-02  6:35     ` Jan Kiszka
2013-08-02  7:27       ` Zhang, Yang Z
2013-08-02  7:33         ` Jan Kiszka
2013-08-01 14:08 ` [PATCH v6 02/15] nEPT: Fix cr3 handling in nested exit and entry Gleb Natapov
2013-08-02  9:23   ` Xiao Guangrong
2013-08-01 14:08 ` [PATCH v6 03/15] nEPT: Fix wrong test in kvm_set_cr3 Gleb Natapov
2013-08-01 14:08 ` [PATCH v6 04/15] nEPT: Move common code to paging_tmpl.h Gleb Natapov
2013-08-01 14:08 ` [PATCH v6 05/15] nEPT: make guest's A/D bits depends on guest's paging mode Gleb Natapov
2013-08-01 14:08 ` [PATCH v6 06/15] nEPT: Support shadow paging for guest paging without A/D bits Gleb Natapov
2013-08-01 14:08 ` [PATCH v6 07/15] nEPT: Add EPT tables support to paging_tmpl.h Gleb Natapov
2013-08-01 14:08 ` [PATCH v6 08/15] nEPT: Redefine EPT-specific link_shadow_page() Gleb Natapov
2013-08-01 14:08 ` [PATCH v6 09/15] nEPT: correctly check if remote tlb flush is needed for shadowed EPT tables Gleb Natapov
2013-08-02  5:58   ` Xiao Guangrong
2013-08-01 14:08 ` [PATCH v6 10/15] nEPT: Nested INVEPT Gleb Natapov
2013-08-02  8:06   ` Xiao Guangrong
2013-08-02 10:00     ` Gleb Natapov
2013-08-01 14:08 ` [PATCH v6 11/15] nEPT: Add nEPT violation/misconfigration support Gleb Natapov
2013-08-02  6:12   ` Xiao Guangrong
2013-08-01 14:08 ` [PATCH v6 12/15] nEPT: MMU context for nested EPT Gleb Natapov
2013-08-02  6:13   ` Xiao Guangrong
2013-08-01 14:08 ` [PATCH v6 13/15] nEPT: Advertise EPT to L1 Gleb Natapov
2013-08-02  8:29   ` Xiao Guangrong
2013-08-01 14:08 ` [PATCH v6 14/15] nEPT: Some additional comments Gleb Natapov
2013-08-02  6:26   ` Xiao Guangrong
2013-08-01 14:08 ` [PATCH v6 15/15] nEPT: Miscelleneous cleanups Gleb Natapov
2013-08-02  6:45   ` Xiao Guangrong
2013-08-04  9:24 ` [PATCH v6 00/15] Nested EPT Jan Kiszka
2013-08-04  9:32   ` Gleb Natapov
2013-08-04  9:53     ` Gleb Natapov
2013-08-04 13:44       ` Gleb Natapov
2013-08-04 15:14         ` Jan Kiszka
2013-08-04 16:15           ` Xiao Guangrong
2013-08-04 16:42             ` Jan Kiszka
2013-08-04 16:58               ` Gleb Natapov
2013-08-04 17:19                 ` Xiao Guangrong
2013-08-04 17:24                   ` Gleb Natapov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.