All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
@ 2013-05-19  4:52 Jun Nakajima
  2013-05-19  4:52 ` [PATCH v3 02/13] nEPT: Move gpte_access() and prefetch_invalid_gpte() to paging_tmpl.h Jun Nakajima
                   ` (12 more replies)
  0 siblings, 13 replies; 52+ messages in thread
From: Jun Nakajima @ 2013-05-19  4:52 UTC (permalink / raw)
  To: kvm; +Cc: Gleb Natapov, Paolo Bonzini

From: Nadav Har'El <nyh@il.ibm.com>

Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
switch the EFER MSR when EPT is used and the host and guest have different
NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
and want to be able to run recent KVM as L1, we need to allow L1 to use this
EFER switching feature.

To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available,
and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
support for the former (the latter is still unsupported).

Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state,
respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all
that's left to do in this patch is to properly advertise this feature to L1.

Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using
vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
support this feature, regardless of whether the host supports it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
---
 arch/x86/kvm/vmx.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 260a919..fb9cae5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2192,7 +2192,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 #else
 	nested_vmx_exit_ctls_high = 0;
 #endif
-	nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
+	nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
+				      VM_EXIT_LOAD_IA32_EFER);
 
 	/* entry controls */
 	rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
@@ -2201,8 +2202,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 	nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
 	nested_vmx_entry_ctls_high &=
 		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
-	nested_vmx_entry_ctls_high |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
-
+	nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
+				       VM_ENTRY_LOAD_IA32_EFER);
 	/* cpu-based controls */
 	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
 		nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
@@ -7492,10 +7493,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 	vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
 
-	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */
-	vmcs_write32(VM_EXIT_CONTROLS,
-		vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
-	vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
+	/* L2->L1 exit controls are emulated - the hardware exit is to L0 so
+	 * we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
+	 * bits are further modified by vmx_set_efer() below.
+	 */
+	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
+
+	/* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are
+	 * emulated by vmx_set_efer(), below.
+	 */
+	vmcs_write32(VM_ENTRY_CONTROLS,
+		(vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
+			~VM_ENTRY_IA32E_MODE) |
 		(vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
 
 	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 02/13] nEPT: Move gpte_access() and prefetch_invalid_gpte() to paging_tmpl.h
  2013-05-19  4:52 [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Jun Nakajima
@ 2013-05-19  4:52 ` Jun Nakajima
  2013-05-20 12:34   ` Paolo Bonzini
  2013-05-19  4:52 ` [PATCH v3 03/13] nEPT: Add EPT tables support " Jun Nakajima
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 52+ messages in thread
From: Jun Nakajima @ 2013-05-19  4:52 UTC (permalink / raw)
  To: kvm; +Cc: Gleb Natapov, Paolo Bonzini

From: Nadav Har'El <nyh@il.ibm.com>

For preparation, we just move gpte_access() and prefetch_invalid_gpte() from mmu.c to paging_tmpl.h.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
---
 arch/x86/kvm/mmu.c         | 30 ------------------------------
 arch/x86/kvm/paging_tmpl.h | 40 +++++++++++++++++++++++++++++++++++-----
 2 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 004cc87..117233f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2488,26 +2488,6 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return gfn_to_pfn_memslot_atomic(slot, gfn);
 }
 
-static bool prefetch_invalid_gpte(struct kvm_vcpu *vcpu,
-				  struct kvm_mmu_page *sp, u64 *spte,
-				  u64 gpte)
-{
-	if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
-		goto no_present;
-
-	if (!is_present_gpte(gpte))
-		goto no_present;
-
-	if (!(gpte & PT_ACCESSED_MASK))
-		goto no_present;
-
-	return false;
-
-no_present:
-	drop_spte(vcpu->kvm, spte);
-	return true;
-}
-
 static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
 				    struct kvm_mmu_page *sp,
 				    u64 *start, u64 *end)
@@ -3408,16 +3388,6 @@ static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, unsigned access,
 	return false;
 }
 
-static inline unsigned gpte_access(struct kvm_vcpu *vcpu, u64 gpte)
-{
-	unsigned access;
-
-	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
-	access &= ~(gpte >> PT64_NX_SHIFT);
-
-	return access;
-}
-
 static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gpte)
 {
 	unsigned index;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index da20860..df34d4a 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -103,6 +103,36 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	return (ret != orig_pte);
 }
 
+static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
+				  struct kvm_mmu_page *sp, u64 *spte,
+				  u64 gpte)
+{
+	if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
+		goto no_present;
+
+	if (!is_present_gpte(gpte))
+		goto no_present;
+
+	if (!(gpte & PT_ACCESSED_MASK))
+		goto no_present;
+
+	return false;
+
+no_present:
+	drop_spte(vcpu->kvm, spte);
+	return true;
+}
+
+static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
+{
+	unsigned access;
+
+	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
+	access &= ~(gpte >> PT64_NX_SHIFT);
+
+	return access;
+}
+
 static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 					     struct kvm_mmu *mmu,
 					     struct guest_walker *walker,
@@ -225,7 +255,7 @@ retry_walk:
 		}
 
 		accessed_dirty &= pte;
-		pte_access = pt_access & gpte_access(vcpu, pte);
+		pte_access = pt_access & FNAME(gpte_access)(vcpu, pte);
 
 		walker->ptes[walker->level - 1] = pte;
 	} while (!is_last_gpte(mmu, walker->level, pte));
@@ -309,13 +339,13 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	gfn_t gfn;
 	pfn_t pfn;
 
-	if (prefetch_invalid_gpte(vcpu, sp, spte, gpte))
+	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
 		return false;
 
 	pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
 
 	gfn = gpte_to_gfn(gpte);
-	pte_access = sp->role.access & gpte_access(vcpu, gpte);
+	pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
 	protect_clean_gpte(&pte_access, gpte);
 	pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
 			no_dirty_log && (pte_access & ACC_WRITE_MASK));
@@ -782,14 +812,14 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 					  sizeof(pt_element_t)))
 			return -EINVAL;
 
-		if (prefetch_invalid_gpte(vcpu, sp, &sp->spt[i], gpte)) {
+		if (FNAME(prefetch_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) {
 			vcpu->kvm->tlbs_dirty++;
 			continue;
 		}
 
 		gfn = gpte_to_gfn(gpte);
 		pte_access = sp->role.access;
-		pte_access &= gpte_access(vcpu, gpte);
+		pte_access &= FNAME(gpte_access)(vcpu, gpte);
 		protect_clean_gpte(&pte_access, gpte);
 
 		if (sync_mmio_spte(&sp->spt[i], gfn, pte_access, &nr_present))
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-05-19  4:52 [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Jun Nakajima
  2013-05-19  4:52 ` [PATCH v3 02/13] nEPT: Move gpte_access() and prefetch_invalid_gpte() to paging_tmpl.h Jun Nakajima
@ 2013-05-19  4:52 ` Jun Nakajima
  2013-05-21  7:52   ` Xiao Guangrong
  2013-05-19  4:52 ` [PATCH v3 04/13] nEPT: Define EPT-specific link_shadow_page() Jun Nakajima
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 52+ messages in thread
From: Jun Nakajima @ 2013-05-19  4:52 UTC (permalink / raw)
  To: kvm; +Cc: Gleb Natapov, Paolo Bonzini

From: Nadav Har'El <nyh@il.ibm.com>

This is the first patch in a series which adds nested EPT support to KVM's
nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
to set its own cr3 and take its own page faults without either of L0 or L1
getting involved. This often significanlty improves L2's performance over the
previous two alternatives (shadow page tables over EPT, and shadow page
tables over shadow page tables).

This patch adds EPT support to paging_tmpl.h.

paging_tmpl.h contains the code for reading and writing page tables. The code
for 32-bit and 64-bit tables is very similar, but not identical, so
paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
with PTTYPE=64, and this generates the two sets of similar functions.

There are subtle but important differences between the format of EPT tables
and that of ordinary x86 64-bit page tables, so for nested EPT we need a
third set of functions to read the guest EPT table and to write the shadow
EPT table.

So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
with "EPT") which correctly read and write EPT tables.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
---
 arch/x86/kvm/mmu.c         |  5 +++++
 arch/x86/kvm/paging_tmpl.h | 43 +++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 117233f..6c1670f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3397,6 +3397,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
 	return mmu->last_pte_bitmap & (1 << index);
 }
 
+#define PTTYPE_EPT 18 /* arbitrary */
+#define PTTYPE PTTYPE_EPT
+#include "paging_tmpl.h"
+#undef PTTYPE
+
 #define PTTYPE 64
 #include "paging_tmpl.h"
 #undef PTTYPE
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index df34d4a..4c45654 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -50,6 +50,22 @@
 	#define PT_LEVEL_BITS PT32_LEVEL_BITS
 	#define PT_MAX_FULL_LEVELS 2
 	#define CMPXCHG cmpxchg
+#elif PTTYPE == PTTYPE_EPT
+	#define pt_element_t u64
+	#define guest_walker guest_walkerEPT
+	#define FNAME(name) EPT_##name
+	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
+	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
+	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
+	#define PT_LEVEL_BITS PT64_LEVEL_BITS
+	#ifdef CONFIG_X86_64
+	#define PT_MAX_FULL_LEVELS 4
+	#define CMPXCHG cmpxchg
+	#else
+	#define CMPXCHG cmpxchg64
+	#define PT_MAX_FULL_LEVELS 2
+	#endif
 #else
 	#error Invalid PTTYPE value
 #endif
@@ -80,6 +96,10 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
 	return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
 }
 
+#if PTTYPE != PTTYPE_EPT
+/*
+ *  Comment out this for EPT because update_accessed_dirty_bits() is not used.
+ */
 static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 			       pt_element_t __user *ptep_user, unsigned index,
 			       pt_element_t orig_pte, pt_element_t new_pte)
@@ -102,6 +122,7 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 
 	return (ret != orig_pte);
 }
+#endif
 
 static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
 				  struct kvm_mmu_page *sp, u64 *spte,
@@ -126,13 +147,21 @@ no_present:
 static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
 {
 	unsigned access;
-
+#if PTTYPE == PTTYPE_EPT
+	access = (gpte & (VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
+			  VMX_EPT_EXECUTABLE_MASK));
+#else
 	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
 	access &= ~(gpte >> PT64_NX_SHIFT);
+#endif
 
 	return access;
 }
 
+#if PTTYPE != PTTYPE_EPT
+/*
+ * EPT A/D bit support is not implemented.
+ */
 static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 					     struct kvm_mmu *mmu,
 					     struct guest_walker *walker,
@@ -169,6 +198,7 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 	}
 	return 0;
 }
+#endif
 
 /*
  * Fetch a guest pte for a guest virtual address
@@ -177,7 +207,6 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 				    struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 				    gva_t addr, u32 access)
 {
-	int ret;
 	pt_element_t pte;
 	pt_element_t __user *uninitialized_var(ptep_user);
 	gfn_t table_gfn;
@@ -192,7 +221,9 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	gfn_t gfn;
 
 	trace_kvm_mmu_pagetable_walk(addr, access);
+#if PTTYPE != PTTYPE_EPT
 retry_walk:
+#endif
 	walker->level = mmu->root_level;
 	pte           = mmu->get_cr3(vcpu);
 
@@ -277,6 +308,7 @@ retry_walk:
 
 	walker->gfn = real_gpa >> PAGE_SHIFT;
 
+#if PTTYPE != PTTYPE_EPT
 	if (!write_fault)
 		protect_clean_gpte(&pte_access, pte);
 	else
@@ -287,12 +319,15 @@ retry_walk:
 		accessed_dirty &= pte >> (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT);
 
 	if (unlikely(!accessed_dirty)) {
+		int ret;
+
 		ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, write_fault);
 		if (unlikely(ret < 0))
 			goto error;
 		else if (ret)
 			goto retry_walk;
 	}
+#endif
 
 	walker->pt_access = pt_access;
 	walker->pte_access = pte_access;
@@ -323,6 +358,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
 					access);
 }
 
+#if PTTYPE != PTTYPE_EPT
 static int FNAME(walk_addr_nested)(struct guest_walker *walker,
 				   struct kvm_vcpu *vcpu, gva_t addr,
 				   u32 access)
@@ -330,6 +366,7 @@ static int FNAME(walk_addr_nested)(struct guest_walker *walker,
 	return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
 					addr, access);
 }
+#endif
 
 static bool
 FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
@@ -754,6 +791,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
 	return gpa;
 }
 
+#if PTTYPE != PTTYPE_EPT
 static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
 				      u32 access,
 				      struct x86_exception *exception)
@@ -772,6 +810,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
 
 	return gpa;
 }
+#endif
 
 /*
  * Using the cached information from sp->gfns is safe because:
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 04/13] nEPT: Define EPT-specific link_shadow_page()
  2013-05-19  4:52 [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Jun Nakajima
  2013-05-19  4:52 ` [PATCH v3 02/13] nEPT: Move gpte_access() and prefetch_invalid_gpte() to paging_tmpl.h Jun Nakajima
  2013-05-19  4:52 ` [PATCH v3 03/13] nEPT: Add EPT tables support " Jun Nakajima
@ 2013-05-19  4:52 ` Jun Nakajima
  2013-05-20 12:43   ` Paolo Bonzini
  2013-05-21  8:15   ` Xiao Guangrong
  2013-05-19  4:52 ` [PATCH v3 05/13] nEPT: MMU context for nested EPT Jun Nakajima
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 52+ messages in thread
From: Jun Nakajima @ 2013-05-19  4:52 UTC (permalink / raw)
  To: kvm; +Cc: Gleb Natapov, Paolo Bonzini

From: Nadav Har'El <nyh@il.ibm.com>

Since link_shadow_page() is used by a routine in mmu.c, add an
EPT-specific link_shadow_page() in paging_tmp.h, rather than moving
it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
---
 arch/x86/kvm/paging_tmpl.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 4c45654..dc495f9 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -461,6 +461,18 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
 	}
 }
 
+#if PTTYPE == PTTYPE_EPT
+static void FNAME(link_shadow_page)(u64 *sptep, struct kvm_mmu_page *sp)
+{
+	u64 spte;
+
+	spte = __pa(sp->spt) | VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
+		VMX_EPT_EXECUTABLE_MASK;
+
+	mmu_spte_set(sptep, spte);
+}
+#endif
+
 /*
  * Fetch a shadow pte for a specific level in the paging hierarchy.
  * If the guest tries to write a write-protected page, we need to
@@ -513,7 +525,11 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			goto out_gpte_changed;
 
 		if (sp)
+#if PTTYPE == PTTYPE_EPT
+			FNAME(link_shadow_page)(it.sptep, sp);
+#else
 			link_shadow_page(it.sptep, sp);
+#endif
 	}
 
 	for (;
@@ -533,7 +549,11 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 
 		sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
 				      true, direct_access, it.sptep);
+#if PTTYPE == PTTYPE_EPT
+		FNAME(link_shadow_page)(it.sptep, sp);
+#else
 		link_shadow_page(it.sptep, sp);
+#endif
 	}
 
 	clear_sp_write_flooding_count(it.sptep);
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 05/13] nEPT: MMU context for nested EPT
  2013-05-19  4:52 [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Jun Nakajima
                   ` (2 preceding siblings ...)
  2013-05-19  4:52 ` [PATCH v3 04/13] nEPT: Define EPT-specific link_shadow_page() Jun Nakajima
@ 2013-05-19  4:52 ` Jun Nakajima
  2013-05-21  8:50   ` Xiao Guangrong
  2013-05-19  4:52 ` [PATCH v3 06/13] nEPT: Fix cr3 handling in nested exit and entry Jun Nakajima
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 52+ messages in thread
From: Jun Nakajima @ 2013-05-19  4:52 UTC (permalink / raw)
  To: kvm; +Cc: Gleb Natapov, Paolo Bonzini

From: Nadav Har'El <nyh@il.ibm.com>

KVM's existing shadow MMU code already supports nested TDP. To use it, we
need to set up a new "MMU context" for nested EPT, and create a few callbacks
for it (nested_ept_*()). This context should also use the EPT versions of
the page table access functions (defined in the previous patch).
Then, we need to switch back and forth between this nested context and the
regular MMU context when switching between L1 and L2 (when L1 runs this L2
with EPT).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
---
 arch/x86/kvm/mmu.c | 38 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu.h |  1 +
 arch/x86/kvm/vmx.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 92 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6c1670f..37f8d7f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3653,6 +3653,44 @@ int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
 }
 EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu);
 
+int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
+{
+	ASSERT(vcpu);
+	ASSERT(!VALID_PAGE(vcpu->arch.mmu.root_hpa));
+
+	context->shadow_root_level = kvm_x86_ops->get_tdp_level();
+
+	context->nx = is_nx(vcpu); /* TODO: ? */
+	context->new_cr3 = paging_new_cr3;
+	context->page_fault = EPT_page_fault;
+	context->gva_to_gpa = EPT_gva_to_gpa;
+	context->sync_page = EPT_sync_page;
+	context->invlpg = EPT_invlpg;
+	context->update_pte = EPT_update_pte;
+	context->free = paging_free;
+	context->root_level = context->shadow_root_level;
+	context->root_hpa = INVALID_PAGE;
+	context->direct_map = false;
+
+	/* TODO: reset_rsvds_bits_mask() is not built for EPT, we need
+	   something different.
+	 */
+	reset_rsvds_bits_mask(vcpu, context);
+
+
+	/* TODO: I copied these from kvm_init_shadow_mmu, I don't know why
+	   they are done, or why they write to vcpu->arch.mmu and not context
+	 */
+	vcpu->arch.mmu.base_role.cr4_pae = !!is_pae(vcpu);
+	vcpu->arch.mmu.base_role.cr0_wp  = is_write_protection(vcpu);
+	vcpu->arch.mmu.base_role.smep_andnot_wp =
+		kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) &&
+		!is_write_protection(vcpu);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_init_shadow_EPT_mmu);
+
 static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
 {
 	int r = kvm_init_shadow_mmu(vcpu, vcpu->arch.walk_mmu);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2adcbc2..8fc94dd 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -54,6 +54,7 @@ int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
 int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
 int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
+int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
 
 static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index fb9cae5..a88432f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1045,6 +1045,11 @@ static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12,
 	return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
 }
 
+static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12)
+{
+	return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT);
+}
+
 static inline bool is_exception(u32 intr_info)
 {
 	return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -7311,6 +7316,46 @@ static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
 		entry->ecx |= bit(X86_FEATURE_VMX);
 }
 
+/* Callbacks for nested_ept_init_mmu_context: */
+
+static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
+{
+	/* return the page table to be shadowed - in our case, EPT12 */
+	return get_vmcs12(vcpu)->ept_pointer;
+}
+
+static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
+	struct x86_exception *fault)
+{
+	struct vmcs12 *vmcs12;
+	nested_vmx_vmexit(vcpu);
+	vmcs12 = get_vmcs12(vcpu);
+	/*
+	 * Note no need to set vmcs12->vm_exit_reason as it is already copied
+	 * from vmcs02 in nested_vmx_vmexit() above, i.e., EPT_VIOLATION.
+	 */
+	vmcs12->exit_qualification = fault->error_code;
+	vmcs12->guest_physical_address = fault->address;
+}
+
+static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
+{
+	int r = kvm_init_shadow_EPT_mmu(vcpu, &vcpu->arch.mmu);
+
+	vcpu->arch.mmu.set_cr3           = vmx_set_cr3;
+	vcpu->arch.mmu.get_cr3           = nested_ept_get_cr3;
+	vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
+
+	vcpu->arch.walk_mmu              = &vcpu->arch.nested_mmu;
+
+	return r;
+}
+
+static void nested_ept_uninit_mmu_context(struct kvm_vcpu *vcpu)
+{
+	vcpu->arch.walk_mmu = &vcpu->arch.mmu;
+}
+
 /*
  * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
  * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
@@ -7531,6 +7576,11 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 		vmx_flush_tlb(vcpu);
 	}
 
+	if (nested_cpu_has_ept(vmcs12)) {
+		kvm_mmu_unload(vcpu);
+		nested_ept_init_mmu_context(vcpu);
+	}
+
 	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)
 		vcpu->arch.efer = vmcs12->guest_ia32_efer;
 	else if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE)
@@ -7975,7 +8025,9 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
 	kvm_set_cr4(vcpu, vmcs12->host_cr4);
 
-	/* shadow page tables on either EPT or shadow page tables */
+	if (nested_cpu_has_ept(vmcs12))
+		nested_ept_uninit_mmu_context(vcpu);
+
 	kvm_set_cr3(vcpu, vmcs12->host_cr3);
 	kvm_mmu_reset_context(vcpu);
 
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 06/13] nEPT: Fix cr3 handling in nested exit and entry
  2013-05-19  4:52 [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Jun Nakajima
                   ` (3 preceding siblings ...)
  2013-05-19  4:52 ` [PATCH v3 05/13] nEPT: MMU context for nested EPT Jun Nakajima
@ 2013-05-19  4:52 ` Jun Nakajima
  2013-05-20 13:19   ` Paolo Bonzini
  2013-06-12 12:42   ` Gleb Natapov
  2013-05-19  4:52 ` [PATCH v3 07/13] nEPT: Fix wrong test in kvm_set_cr3 Jun Nakajima
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 52+ messages in thread
From: Jun Nakajima @ 2013-05-19  4:52 UTC (permalink / raw)
  To: kvm; +Cc: Gleb Natapov, Paolo Bonzini

From: Nadav Har'El <nyh@il.ibm.com>

The existing code for handling cr3 and related VMCS fields during nested
exit and entry wasn't correct in all cases:

If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
we forgot to do so. This patch adds this copy.

If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
whoever does control cr3 (L1 or L2) is using PAE, the processor might have
saved PDPTEs and we should also save them in vmcs12 (and restore later).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
---
 arch/x86/kvm/vmx.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a88432f..b79efd4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7608,6 +7608,17 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 	kvm_set_cr3(vcpu, vmcs12->guest_cr3);
 	kvm_mmu_reset_context(vcpu);
 
+	/*
+	 * Additionally, except when L0 is using shadow page tables, L1 or
+	 * L2 control guest_cr3 for L2, so they may also have saved PDPTEs
+	 */
+	if (enable_ept) {
+		vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
+		vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
+		vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
+		vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
+	}
+
 	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
 	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
 }
@@ -7930,6 +7941,25 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 	vmcs12->guest_pending_dbg_exceptions =
 		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
 
+	/*
+	 * In some cases (usually, nested EPT), L2 is allowed to change its
+	 * own CR3 without exiting. If it has changed it, we must keep it.
+	 * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
+	 * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
+	 */
+	if (enable_ept)
+		vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
+	/*
+	 * Additionally, except when L0 is using shadow page tables, L1 or
+	 * L2 control guest_cr3 for L2, so save their PDPTEs
+	 */
+	if (enable_ept) {
+		vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
+		vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
+		vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
+		vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
+	}
+
 	vmcs12->vm_entry_controls =
 		(vmcs12->vm_entry_controls & ~VM_ENTRY_IA32E_MODE) |
 		(vmcs_read32(VM_ENTRY_CONTROLS) & VM_ENTRY_IA32E_MODE);
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 07/13] nEPT: Fix wrong test in kvm_set_cr3
  2013-05-19  4:52 [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Jun Nakajima
                   ` (4 preceding siblings ...)
  2013-05-19  4:52 ` [PATCH v3 06/13] nEPT: Fix cr3 handling in nested exit and entry Jun Nakajima
@ 2013-05-19  4:52 ` Jun Nakajima
  2013-05-20 13:17   ` Paolo Bonzini
  2013-05-19  4:52 ` [PATCH v3 08/13] nEPT: Some additional comments Jun Nakajima
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 52+ messages in thread
From: Jun Nakajima @ 2013-05-19  4:52 UTC (permalink / raw)
  To: kvm; +Cc: Gleb Natapov, Paolo Bonzini

From: Nadav Har'El <nyh@il.ibm.com>

kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
address. The problem is that with nested EPT, cr3 is an *L2* physical
address, not an L1 physical address as this test expects.

As the comment above this test explains, it isn't necessary, and doesn't
correspond to anything a real processor would do. So this patch removes it.

Note that this wrong test could have also theoretically caused problems
in nested NPT, not just in nested EPT. However, in practice, the problem
was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
circumventing the problem. Additional potential calls to the buggy function
are avoided in that we don't trap cr3 modifications when nested NPT is
enabled. However, because in nested VMX we did want to use kvm_set_cr3()
(as requested in Avi Kivity's review of the original nested VMX patches),
we can't avoid this problem and need to fix it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
---
 arch/x86/kvm/x86.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 094b5d9..7b36ec6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -683,17 +683,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 		 */
 	}
 
-	/*
-	 * Does the new cr3 value map to physical memory? (Note, we
-	 * catch an invalid cr3 even in real-mode, because it would
-	 * cause trouble later on when we turn on paging anyway.)
-	 *
-	 * A real CPU would silently accept an invalid cr3 and would
-	 * attempt to use it - with largely undefined (and often hard
-	 * to debug) behavior on the guest side.
-	 */
-	if (unlikely(!gfn_to_memslot(vcpu->kvm, cr3 >> PAGE_SHIFT)))
-		return 1;
 	vcpu->arch.cr3 = cr3;
 	__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
 	vcpu->arch.mmu.new_cr3(vcpu);
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 08/13] nEPT: Some additional comments
  2013-05-19  4:52 [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Jun Nakajima
                   ` (5 preceding siblings ...)
  2013-05-19  4:52 ` [PATCH v3 07/13] nEPT: Fix wrong test in kvm_set_cr3 Jun Nakajima
@ 2013-05-19  4:52 ` Jun Nakajima
  2013-05-20 13:21   ` Paolo Bonzini
  2013-05-19  4:52 ` [PATCH v3 09/13] nEPT: Advertise EPT to L1 Jun Nakajima
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 52+ messages in thread
From: Jun Nakajima @ 2013-05-19  4:52 UTC (permalink / raw)
  To: kvm; +Cc: Gleb Natapov, Paolo Bonzini

From: Nadav Har'El <nyh@il.ibm.com>

Some additional comments to preexisting code:
Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
Don't mention "shadow on either EPT or shadow" as the only two options.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
---
 arch/x86/kvm/vmx.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b79efd4..4661a22 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6540,7 +6540,20 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
 		return nested_cpu_has2(vmcs12,
 			SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
 	case EXIT_REASON_EPT_VIOLATION:
+		/*
+		 * L0 always deals with the EPT violation. If nested EPT is
+		 * used, and the nested mmu code discovers that the address is
+		 * missing in the guest EPT table (EPT12), the EPT violation
+		 * will be injected with nested_ept_inject_page_fault()
+		 */
+		return 0;
 	case EXIT_REASON_EPT_MISCONFIG:
+		/*
+		 * L2 never uses directly L1's EPT, but rather L0's own EPT
+		 * table (shadow on EPT) or a merged EPT table that L0 built
+		 * (EPT on EPT). So any problems with the structure of the
+		 * table is L0's fault.
+		 */
 		return 0;
 	case EXIT_REASON_PREEMPTION_TIMER:
 		return vmcs12->pin_based_vm_exec_control &
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 09/13] nEPT: Advertise EPT to L1
  2013-05-19  4:52 [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Jun Nakajima
                   ` (6 preceding siblings ...)
  2013-05-19  4:52 ` [PATCH v3 08/13] nEPT: Some additional comments Jun Nakajima
@ 2013-05-19  4:52 ` Jun Nakajima
  2013-05-20 13:05   ` Paolo Bonzini
  2013-05-19  4:52 ` [PATCH v3 10/13] nEPT: Nested INVEPT Jun Nakajima
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 52+ messages in thread
From: Jun Nakajima @ 2013-05-19  4:52 UTC (permalink / raw)
  To: kvm; +Cc: Gleb Natapov, Paolo Bonzini

From: Nadav Har'El <nyh@il.ibm.com>

Advertise the support of EPT to the L1 guest, through the appropriate MSR.

This is the last patch of the basic Nested EPT feature, so as to allow
bisection through this patch series: The guest will not see EPT support until
this last patch, and will not attempt to use the half-applied feature.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
---
 arch/x86/include/asm/vmx.h |  2 ++
 arch/x86/kvm/vmx.c         | 17 +++++++++++++++--
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index f3e01a2..4aec45d 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -394,7 +394,9 @@ enum vmcs_field {
 #define VMX_EPTP_WB_BIT				(1ull << 14)
 #define VMX_EPT_2MB_PAGE_BIT			(1ull << 16)
 #define VMX_EPT_1GB_PAGE_BIT			(1ull << 17)
+#define VMX_EPT_INVEPT_BIT			(1ull << 20)
 #define VMX_EPT_AD_BIT				    (1ull << 21)
+#define VMX_EPT_EXTENT_INDIVIDUAL_BIT		(1ull << 24)
 #define VMX_EPT_EXTENT_CONTEXT_BIT		(1ull << 25)
 #define VMX_EPT_EXTENT_GLOBAL_BIT		(1ull << 26)
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4661a22..1cf8a41 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2155,6 +2155,7 @@ static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high;
 static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high;
 static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high;
 static u32 nested_vmx_misc_low, nested_vmx_misc_high;
+static u32 nested_vmx_ept_caps;
 static __init void nested_vmx_setup_ctls_msrs(void)
 {
 	/*
@@ -2242,6 +2243,18 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
 		SECONDARY_EXEC_WBINVD_EXITING;
 
+	if (enable_ept) {
+		/* nested EPT: emulate EPT also to L1 */
+		nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
+		nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
+		nested_vmx_ept_caps |=
+			VMX_EPT_INVEPT_BIT | VMX_EPT_EXTENT_GLOBAL_BIT |
+			VMX_EPT_EXTENT_CONTEXT_BIT |
+			VMX_EPT_EXTENT_INDIVIDUAL_BIT;
+		nested_vmx_ept_caps &= vmx_capability.ept;
+	} else
+		nested_vmx_ept_caps = 0;
+
 	/* miscellaneous data */
 	rdmsr(MSR_IA32_VMX_MISC, nested_vmx_misc_low, nested_vmx_misc_high);
 	nested_vmx_misc_low &= VMX_MISC_PREEMPTION_TIMER_RATE_MASK |
@@ -2347,8 +2360,8 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
 					nested_vmx_secondary_ctls_high);
 		break;
 	case MSR_IA32_VMX_EPT_VPID_CAP:
-		/* Currently, no nested ept or nested vpid */
-		*pdata = 0;
+		/* Currently, no nested vpid support */
+		*pdata = nested_vmx_ept_caps;
 		break;
 	default:
 		return 0;
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 10/13] nEPT: Nested INVEPT
  2013-05-19  4:52 [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Jun Nakajima
                   ` (7 preceding siblings ...)
  2013-05-19  4:52 ` [PATCH v3 09/13] nEPT: Advertise EPT to L1 Jun Nakajima
@ 2013-05-19  4:52 ` Jun Nakajima
  2013-05-20 12:46   ` Paolo Bonzini
  2013-05-21  9:16   ` Xiao Guangrong
  2013-05-19  4:52 ` [PATCH v3 11/13] nEPT: Miscelleneous cleanups Jun Nakajima
                   ` (3 subsequent siblings)
  12 siblings, 2 replies; 52+ messages in thread
From: Jun Nakajima @ 2013-05-19  4:52 UTC (permalink / raw)
  To: kvm; +Cc: Gleb Natapov, Paolo Bonzini

From: Nadav Har'El <nyh@il.ibm.com>

If we let L1 use EPT, we should probably also support the INVEPT instruction.

In our current nested EPT implementation, when L1 changes its EPT table for
L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in the course
of this modification already calls INVEPT. Therefore, when L1 calls INVEPT,
we don't really need to do anything. In particular we *don't* need to call
the real INVEPT again. All we do in our INVEPT is verify the validity of the
call, and its parameters, and then do nothing.

In KVM Forum 2010, Dong et al. presented "Nested Virtualization Friendly KVM"
and classified our current nested EPT implementation as "shadow-like virtual
EPT". He recommended instead a different approach, which he called "VTLB-like
virtual EPT". If we had taken that alternative approach, INVEPT would have had
a bigger role: L0 would only rebuild the shadow EPT table when L1 calls INVEPT.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
---
 arch/x86/include/uapi/asm/vmx.h |  1 +
 arch/x86/kvm/vmx.c              | 83 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 84 insertions(+)

diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index d651082..7a34e8f 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -65,6 +65,7 @@
 #define EXIT_REASON_EOI_INDUCED         45
 #define EXIT_REASON_EPT_VIOLATION       48
 #define EXIT_REASON_EPT_MISCONFIG       49
+#define EXIT_REASON_INVEPT              50
 #define EXIT_REASON_PREEMPTION_TIMER    52
 #define EXIT_REASON_WBINVD              54
 #define EXIT_REASON_XSETBV              55
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 1cf8a41..d9d991d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6251,6 +6251,87 @@ static int handle_vmptrst(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+/* Emulate the INVEPT instruction */
+static int handle_invept(struct kvm_vcpu *vcpu)
+{
+	u32 vmx_instruction_info;
+	unsigned long type;
+	gva_t gva;
+	struct x86_exception e;
+	struct {
+		u64 eptp, gpa;
+	} operand;
+
+	if (!(nested_vmx_secondary_ctls_high & SECONDARY_EXEC_ENABLE_EPT) ||
+	    !(nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	/* According to the Intel VMX instruction reference, the memory
+	 * operand is read even if it isn't needed (e.g., for type==global)
+	 */
+	vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+			vmx_instruction_info, &gva))
+		return 1;
+	if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &operand,
+				sizeof(operand), &e)) {
+		kvm_inject_page_fault(vcpu, &e);
+		return 1;
+	}
+
+	type = kvm_register_read(vcpu, (vmx_instruction_info >> 28) & 0xf);
+
+	switch (type) {
+	case VMX_EPT_EXTENT_GLOBAL:
+		if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_GLOBAL_BIT))
+			nested_vmx_failValid(vcpu,
+				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+		else {
+			/*
+			 * Do nothing: when L1 changes EPT12, we already
+			 * update EPT02 (the shadow EPT table) and call INVEPT.
+			 * So when L1 calls INVEPT, there's nothing left to do.
+			 */
+			nested_vmx_succeed(vcpu);
+		}
+		break;
+	case VMX_EPT_EXTENT_CONTEXT:
+		if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_CONTEXT_BIT))
+			nested_vmx_failValid(vcpu,
+				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+		else {
+			/* Do nothing */
+			nested_vmx_succeed(vcpu);
+		}
+		break;
+	case VMX_EPT_EXTENT_INDIVIDUAL_ADDR:
+		if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_INDIVIDUAL_BIT))
+			nested_vmx_failValid(vcpu,
+				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+		else {
+			/* Do nothing */
+			nested_vmx_succeed(vcpu);
+		}
+		break;
+	default:
+		nested_vmx_failValid(vcpu,
+			VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+	}
+
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -6295,6 +6376,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_PAUSE_INSTRUCTION]       = handle_pause,
 	[EXIT_REASON_MWAIT_INSTRUCTION]	      = handle_invalid_op,
 	[EXIT_REASON_MONITOR_INSTRUCTION]     = handle_invalid_op,
+	[EXIT_REASON_INVEPT]                  = handle_invept,
 };
 
 static const int kvm_vmx_max_exit_handlers =
@@ -6521,6 +6603,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
 	case EXIT_REASON_VMPTRST: case EXIT_REASON_VMREAD:
 	case EXIT_REASON_VMRESUME: case EXIT_REASON_VMWRITE:
 	case EXIT_REASON_VMOFF: case EXIT_REASON_VMON:
+	case EXIT_REASON_INVEPT:
 		/*
 		 * VMX instructions trap unconditionally. This allows L1 to
 		 * emulate them for its L2 guest, i.e., allows 3-level nesting!
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 11/13] nEPT: Miscelleneous cleanups
  2013-05-19  4:52 [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Jun Nakajima
                   ` (8 preceding siblings ...)
  2013-05-19  4:52 ` [PATCH v3 10/13] nEPT: Nested INVEPT Jun Nakajima
@ 2013-05-19  4:52 ` Jun Nakajima
  2013-05-19  4:52 ` [PATCH v3 12/13] nEPT: Move is_rsvd_bits_set() to paging_tmpl.h Jun Nakajima
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 52+ messages in thread
From: Jun Nakajima @ 2013-05-19  4:52 UTC (permalink / raw)
  To: kvm; +Cc: Gleb Natapov, Paolo Bonzini

From: Nadav Har'El <nyh@il.ibm.com>

Some trivial code cleanups not really related to nested EPT.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d9d991d..ec4e9b9 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -714,7 +714,6 @@ static void nested_release_page_clean(struct page *page)
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
 static void kvm_cpu_vmxoff(void);
-static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
 static void vmx_set_segment(struct kvm_vcpu *vcpu,
 			    struct kvm_segment *var, int seg);
@@ -1039,8 +1038,7 @@ static inline bool nested_cpu_has2(struct vmcs12 *vmcs12, u32 bit)
 		(vmcs12->secondary_vm_exec_control & bit);
 }
 
-static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12,
-	struct kvm_vcpu *vcpu)
+static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12)
 {
 	return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
 }
@@ -6737,7 +6735,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
 
 	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked &&
 	    !(is_guest_mode(vcpu) && nested_cpu_has_virtual_nmis(
-	                                get_vmcs12(vcpu), vcpu)))) {
+					get_vmcs12(vcpu))))) {
 		if (vmx_interrupt_allowed(vcpu)) {
 			vmx->soft_vnmi_blocked = 0;
 		} else if (vmx->vnmi_blocked_time > 1000000000LL &&
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 12/13] nEPT: Move is_rsvd_bits_set() to paging_tmpl.h
  2013-05-19  4:52 [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Jun Nakajima
                   ` (9 preceding siblings ...)
  2013-05-19  4:52 ` [PATCH v3 11/13] nEPT: Miscelleneous cleanups Jun Nakajima
@ 2013-05-19  4:52 ` Jun Nakajima
  2013-05-19  4:52 ` [PATCH v3 13/13] nEPT: Inject EPT violation/misconfigration Jun Nakajima
  2013-05-20 12:33 ` [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Paolo Bonzini
  12 siblings, 0 replies; 52+ messages in thread
From: Jun Nakajima @ 2013-05-19  4:52 UTC (permalink / raw)
  To: kvm; +Cc: Gleb Natapov, Paolo Bonzini

Move is_rsvd_bits_set() to paging_tmpl.h so that it can be used to check
reserved bits in EPT page table entries as well.

Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
---
 arch/x86/kvm/mmu.c         |  8 --------
 arch/x86/kvm/paging_tmpl.h | 12 ++++++++++--
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 37f8d7f..93d6abf 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2468,14 +2468,6 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
 	mmu_free_roots(vcpu);
 }
 
-static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
-{
-	int bit7;
-
-	bit7 = (gpte >> 7) & 1;
-	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
-}
-
 static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
 				     bool no_dirty_log)
 {
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index dc495f9..2432d49 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -124,11 +124,19 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 }
 #endif
 
+static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
+{
+	int bit7;
+
+	bit7 = (gpte >> 7) & 1;
+	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
+}
+
 static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
 				  struct kvm_mmu_page *sp, u64 *spte,
 				  u64 gpte)
 {
-	if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
+	if (FNAME(is_rsvd_bits_set)(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
 		goto no_present;
 
 	if (!is_present_gpte(gpte))
@@ -279,7 +287,7 @@ retry_walk:
 		if (unlikely(!is_present_gpte(pte)))
 			goto error;
 
-		if (unlikely(is_rsvd_bits_set(&vcpu->arch.mmu, pte,
+		if (unlikely(FNAME(is_rsvd_bits_set)(&vcpu->arch.mmu, pte,
 					      walker->level))) {
 			errcode |= PFERR_RSVD_MASK | PFERR_PRESENT_MASK;
 			goto error;
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 13/13] nEPT: Inject EPT violation/misconfigration
  2013-05-19  4:52 [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Jun Nakajima
                   ` (10 preceding siblings ...)
  2013-05-19  4:52 ` [PATCH v3 12/13] nEPT: Move is_rsvd_bits_set() to paging_tmpl.h Jun Nakajima
@ 2013-05-19  4:52 ` Jun Nakajima
  2013-05-20 13:09   ` Paolo Bonzini
  2013-05-21 10:56   ` Xiao Guangrong
  2013-05-20 12:33 ` [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Paolo Bonzini
  12 siblings, 2 replies; 52+ messages in thread
From: Jun Nakajima @ 2013-05-19  4:52 UTC (permalink / raw)
  To: kvm; +Cc: Gleb Natapov, Paolo Bonzini

Add code to detect EPT misconfiguration and inject it to L1 VMM. Also,
it injects more correct exit qualification upon EPT violation to L1
VMM.  Now L1 can correctly go to ept_misconfig handler (instead of
wrongly going to fast_page_fault), it will try to handle mmio page
fault, if failed, it is a real EPT misconfiguration.

Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  4 +++
 arch/x86/kvm/mmu.c              |  5 ---
 arch/x86/kvm/mmu.h              |  5 +++
 arch/x86/kvm/paging_tmpl.h      | 26 ++++++++++++++
 arch/x86/kvm/vmx.c              | 79 +++++++++++++++++++++++++++++++++++++++--
 5 files changed, 111 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3741c65..1d03202 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -262,6 +262,8 @@ struct kvm_mmu {
 	void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva);
 	void (*update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 			   u64 *spte, const void *pte);
+	bool (*check_tdp_pte)(u64 pte, int level);
+
 	hpa_t root_hpa;
 	int root_level;
 	int shadow_root_level;
@@ -503,6 +505,8 @@ struct kvm_vcpu_arch {
 	 * instruction.
 	 */
 	bool write_fault_to_shadow_pgtable;
+
+	unsigned long exit_qualification; /* set at EPT violation at this point */
 };
 
 struct kvm_lpage_info {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 93d6abf..3a3b11f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -233,11 +233,6 @@ static bool set_mmio_spte(u64 *sptep, gfn_t gfn, pfn_t pfn, unsigned access)
 	return false;
 }
 
-static inline u64 rsvd_bits(int s, int e)
-{
-	return ((1ULL << (e - s + 1)) - 1) << s;
-}
-
 void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 		u64 dirty_mask, u64 nx_mask, u64 x_mask)
 {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 8fc94dd..559e2e0 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -88,6 +88,11 @@ static inline bool is_write_protection(struct kvm_vcpu *vcpu)
 	return kvm_read_cr0_bits(vcpu, X86_CR0_WP);
 }
 
+static inline u64 rsvd_bits(int s, int e)
+{
+	return ((1ULL << (e - s + 1)) - 1) << s;
+}
+
 /*
  * Will a fault with a given page-fault error code (pfec) cause a permission
  * fault with the given access (in ACC_* format)?
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 2432d49..067b1f8 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -126,10 +126,14 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 
 static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
 {
+#if PTTYPE == PTTYPE_EPT
+	return (mmu->check_tdp_pte(gpte, level));
+#else
 	int bit7;
 
 	bit7 = (gpte >> 7) & 1;
 	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
+#endif
 }
 
 static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
@@ -352,6 +356,28 @@ error:
 	walker->fault.vector = PF_VECTOR;
 	walker->fault.error_code_valid = true;
 	walker->fault.error_code = errcode;
+
+#if PTTYPE == PTTYPE_EPT
+	/*
+	 * Use PFERR_RSVD_MASK in erorr_code to to tell if EPT
+	 * misconfiguration requires to be injected. The detection is
+	 * done by is_rsvd_bits_set() above.
+	 *
+	 * We set up the value of exit_qualification to inject:
+	 * [2:0] -- Derive from [2:0] of real exit_qualification at EPT violation
+	 * [5:3] -- Calculated by the page walk of the guest EPT page tables
+	 * [7:8] -- Clear to 0.
+	 *
+	 * The other bits are set to 0.
+	 */
+	if (!(errcode & PFERR_RSVD_MASK)) {
+		unsigned long exit_qualification = vcpu->arch.exit_qualification;
+
+		pte_access = pt_access & pte;
+		vcpu->arch.exit_qualification = ((pte_access & 0x7) << 3) |
+			(exit_qualification & 0x7);
+	}
+#endif
 	walker->fault.address = addr;
 	walker->fault.nested_page_fault = mmu != vcpu->arch.walk_mmu;
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ec4e9b9..667be15 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5310,6 +5310,8 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	/* ept page table is present? */
 	error_code |= (exit_qualification >> 3) & 0x1;
 
+	vcpu->arch.exit_qualification = exit_qualification;
+
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
@@ -7432,7 +7434,7 @@ static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
 }
 
 static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
-	struct x86_exception *fault)
+					struct x86_exception *fault)
 {
 	struct vmcs12 *vmcs12;
 	nested_vmx_vmexit(vcpu);
@@ -7441,10 +7443,81 @@ static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
 	 * Note no need to set vmcs12->vm_exit_reason as it is already copied
 	 * from vmcs02 in nested_vmx_vmexit() above, i.e., EPT_VIOLATION.
 	 */
-	vmcs12->exit_qualification = fault->error_code;
+	if (fault->error_code & PFERR_RSVD_MASK)
+		vmcs12->vm_exit_reason = EXIT_REASON_EPT_MISCONFIG;
+	else
+		vmcs12->vm_exit_reason = EXIT_REASON_EPT_VIOLATION;
+
+	vmcs12->exit_qualification = vcpu->arch.exit_qualification;
 	vmcs12->guest_physical_address = fault->address;
 }
 
+static bool nested_ept_rsvd_bits_check(u64 pte, int level)
+{
+	const int maxphyaddr = 48; /* set to the max size for now */
+	u64 rsvd_mask = rsvd_bits(maxphyaddr, 51);
+
+	switch (level) {
+	case 4:
+		rsvd_mask |= rsvd_bits(3, 7);
+		break;
+	case 3:
+	case 2:
+		if (pte & (1 << 7))
+			rsvd_mask |= rsvd_bits(PAGE_SHIFT, PAGE_SHIFT + 9 * (level - 1) - 1);
+		else
+			rsvd_mask |= rsvd_bits(3, 6);
+		break;
+	case 1:
+		break;
+	default:
+		/* impossible to go to here */
+		BUG();
+	}
+
+	return pte & rsvd_mask;
+}
+
+static bool nested_ept_rwx_bits_check(u64 pte)
+{
+	/* write only or write/execute only */
+	uint8_t rwx_bits = pte & 7;
+
+	switch (rwx_bits) {
+	case 0x2:
+	case 0x6:
+		return true;
+	case 0x4:
+		if (!(nested_vmx_ept_caps & 0x1))
+			return 1;
+	default:
+		return false;
+	}
+}
+
+static bool nested_ept_memtype_check(u64 pte, int level)
+{
+	if (level == 1 || (level == 2 && (pte & (1ULL << 7)))) {
+		/* 0x38, namely bits 5:3, stands for EPT memory type */
+		u64 ept_mem_type = (pte & 0x38) >> 3;
+
+		if (ept_mem_type == 0x2 || ept_mem_type == 0x3 ||
+		    ept_mem_type == 0x7)
+			return true;
+	}
+	return false;
+}
+
+bool nested_check_ept_pte(u64 pte, int level)
+{
+	bool r;
+	r = nested_ept_rsvd_bits_check(pte, level) ||
+		nested_ept_rwx_bits_check(pte) ||
+		nested_ept_memtype_check(pte, level);
+
+	return r;
+}
+
 static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
 {
 	int r = kvm_init_shadow_EPT_mmu(vcpu, &vcpu->arch.mmu);
@@ -7452,7 +7525,7 @@ static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
 	vcpu->arch.mmu.set_cr3           = vmx_set_cr3;
 	vcpu->arch.mmu.get_cr3           = nested_ept_get_cr3;
 	vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
-
+	vcpu->arch.mmu.check_tdp_pte     = nested_check_ept_pte;
 	vcpu->arch.walk_mmu              = &vcpu->arch.nested_mmu;
 
 	return r;
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-05-19  4:52 [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Jun Nakajima
                   ` (11 preceding siblings ...)
  2013-05-19  4:52 ` [PATCH v3 13/13] nEPT: Inject EPT violation/misconfigration Jun Nakajima
@ 2013-05-20 12:33 ` Paolo Bonzini
  2013-07-02  3:01   ` Zhang, Yang Z
  12 siblings, 1 reply; 52+ messages in thread
From: Paolo Bonzini @ 2013-05-20 12:33 UTC (permalink / raw)
  To: Jun Nakajima; +Cc: kvm, Gleb Natapov

Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
> switch the EFER MSR when EPT is used and the host and guest have different
> NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
> and want to be able to run recent KVM as L1, we need to allow L1 to use this
> EFER switching feature.
> 
> To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available,
> and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
> support for the former (the latter is still unsupported).
> 
> Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state,
> respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all
> that's left to do in this patch is to properly advertise this feature to L1.
> 
> Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using
> vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
> support this feature, regardless of whether the host supports it.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> ---
>  arch/x86/kvm/vmx.c | 23 ++++++++++++++++-------
>  1 file changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 260a919..fb9cae5 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2192,7 +2192,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
>  #else
>  	nested_vmx_exit_ctls_high = 0;
>  #endif
> -	nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
> +	nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> +				      VM_EXIT_LOAD_IA32_EFER);
>  
>  	/* entry controls */
>  	rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
> @@ -2201,8 +2202,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
>  	nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
>  	nested_vmx_entry_ctls_high &=
>  		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
> -	nested_vmx_entry_ctls_high |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
> -
> +	nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
> +				       VM_ENTRY_LOAD_IA32_EFER);
>  	/* cpu-based controls */
>  	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
>  		nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> @@ -7492,10 +7493,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>  	vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
>  	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
>  
> -	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */
> -	vmcs_write32(VM_EXIT_CONTROLS,
> -		vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
> -	vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
> +	/* L2->L1 exit controls are emulated - the hardware exit is to L0 so
> +	 * we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
> +	 * bits are further modified by vmx_set_efer() below.
> +	 */
> +	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
> +
> +	/* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are
> +	 * emulated by vmx_set_efer(), below.

VM_ENTRY_LOAD_IA32_EFER is not emulated by vmx_set_efer, so:

    /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
     * are emulated below.  VM_ENTRY_IA32E_MODE is handled in
     * vmx_set_efer().  */

Paolo

> +	 */
> +	vmcs_write32(VM_ENTRY_CONTROLS,
> +		(vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
> +			~VM_ENTRY_IA32E_MODE) |
>  		(vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
>  
>  	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 02/13] nEPT: Move gpte_access() and prefetch_invalid_gpte() to paging_tmpl.h
  2013-05-19  4:52 ` [PATCH v3 02/13] nEPT: Move gpte_access() and prefetch_invalid_gpte() to paging_tmpl.h Jun Nakajima
@ 2013-05-20 12:34   ` Paolo Bonzini
  0 siblings, 0 replies; 52+ messages in thread
From: Paolo Bonzini @ 2013-05-20 12:34 UTC (permalink / raw)
  To: Jun Nakajima; +Cc: kvm, Gleb Natapov

Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> For preparation, we just move gpte_access() and prefetch_invalid_gpte() from mmu.c to paging_tmpl.h.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> ---
>  arch/x86/kvm/mmu.c         | 30 ------------------------------
>  arch/x86/kvm/paging_tmpl.h | 40 +++++++++++++++++++++++++++++++++++-----
>  2 files changed, 35 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 004cc87..117233f 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2488,26 +2488,6 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
>  	return gfn_to_pfn_memslot_atomic(slot, gfn);
>  }
>  
> -static bool prefetch_invalid_gpte(struct kvm_vcpu *vcpu,
> -				  struct kvm_mmu_page *sp, u64 *spte,
> -				  u64 gpte)
> -{
> -	if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
> -		goto no_present;
> -
> -	if (!is_present_gpte(gpte))
> -		goto no_present;
> -
> -	if (!(gpte & PT_ACCESSED_MASK))
> -		goto no_present;
> -
> -	return false;
> -
> -no_present:
> -	drop_spte(vcpu->kvm, spte);
> -	return true;
> -}
> -
>  static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
>  				    struct kvm_mmu_page *sp,
>  				    u64 *start, u64 *end)
> @@ -3408,16 +3388,6 @@ static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, unsigned access,
>  	return false;
>  }
>  
> -static inline unsigned gpte_access(struct kvm_vcpu *vcpu, u64 gpte)
> -{
> -	unsigned access;
> -
> -	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
> -	access &= ~(gpte >> PT64_NX_SHIFT);
> -
> -	return access;
> -}
> -
>  static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gpte)
>  {
>  	unsigned index;
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index da20860..df34d4a 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -103,6 +103,36 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  	return (ret != orig_pte);
>  }
>  
> +static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
> +				  struct kvm_mmu_page *sp, u64 *spte,
> +				  u64 gpte)
> +{
> +	if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
> +		goto no_present;
> +
> +	if (!is_present_gpte(gpte))
> +		goto no_present;
> +
> +	if (!(gpte & PT_ACCESSED_MASK))
> +		goto no_present;
> +
> +	return false;
> +
> +no_present:
> +	drop_spte(vcpu->kvm, spte);
> +	return true;
> +}
> +
> +static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
> +{
> +	unsigned access;
> +
> +	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
> +	access &= ~(gpte >> PT64_NX_SHIFT);
> +
> +	return access;
> +}
> +
>  static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
>  					     struct kvm_mmu *mmu,
>  					     struct guest_walker *walker,
> @@ -225,7 +255,7 @@ retry_walk:
>  		}
>  
>  		accessed_dirty &= pte;
> -		pte_access = pt_access & gpte_access(vcpu, pte);
> +		pte_access = pt_access & FNAME(gpte_access)(vcpu, pte);
>  
>  		walker->ptes[walker->level - 1] = pte;
>  	} while (!is_last_gpte(mmu, walker->level, pte));
> @@ -309,13 +339,13 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	gfn_t gfn;
>  	pfn_t pfn;
>  
> -	if (prefetch_invalid_gpte(vcpu, sp, spte, gpte))
> +	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
>  		return false;
>  
>  	pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
>  
>  	gfn = gpte_to_gfn(gpte);
> -	pte_access = sp->role.access & gpte_access(vcpu, gpte);
> +	pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
>  	protect_clean_gpte(&pte_access, gpte);
>  	pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
>  			no_dirty_log && (pte_access & ACC_WRITE_MASK));
> @@ -782,14 +812,14 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  					  sizeof(pt_element_t)))
>  			return -EINVAL;
>  
> -		if (prefetch_invalid_gpte(vcpu, sp, &sp->spt[i], gpte)) {
> +		if (FNAME(prefetch_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) {
>  			vcpu->kvm->tlbs_dirty++;
>  			continue;
>  		}
>  
>  		gfn = gpte_to_gfn(gpte);
>  		pte_access = sp->role.access;
> -		pte_access &= gpte_access(vcpu, gpte);
> +		pte_access &= FNAME(gpte_access)(vcpu, gpte);
>  		protect_clean_gpte(&pte_access, gpte);
>  
>  		if (sync_mmio_spte(&sp->spt[i], gfn, pte_access, &nr_present))
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 04/13] nEPT: Define EPT-specific link_shadow_page()
  2013-05-19  4:52 ` [PATCH v3 04/13] nEPT: Define EPT-specific link_shadow_page() Jun Nakajima
@ 2013-05-20 12:43   ` Paolo Bonzini
  2013-05-21  8:15   ` Xiao Guangrong
  1 sibling, 0 replies; 52+ messages in thread
From: Paolo Bonzini @ 2013-05-20 12:43 UTC (permalink / raw)
  To: Jun Nakajima; +Cc: kvm, Gleb Natapov

Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> Since link_shadow_page() is used by a routine in mmu.c, add an
> EPT-specific link_shadow_page() in paging_tmp.h, rather than moving
> it.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> ---
>  arch/x86/kvm/paging_tmpl.h | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 4c45654..dc495f9 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -461,6 +461,18 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
>  	}
>  }
>  
> +#if PTTYPE == PTTYPE_EPT
> +static void FNAME(link_shadow_page)(u64 *sptep, struct kvm_mmu_page *sp)
> +{
> +	u64 spte;
> +
> +	spte = __pa(sp->spt) | VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
> +		VMX_EPT_EXECUTABLE_MASK;
> +
> +	mmu_spte_set(sptep, spte);
> +}
> +#endif

The function is small enough that likely the compiler will inline it.
You can just handle it unconditionally with FNAME().

Paolo

>  /*
>   * Fetch a shadow pte for a specific level in the paging hierarchy.
>   * If the guest tries to write a write-protected page, we need to
> @@ -513,7 +525,11 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>  			goto out_gpte_changed;
>  
>  		if (sp)
> +#if PTTYPE == PTTYPE_EPT
> +			FNAME(link_shadow_page)(it.sptep, sp);
> +#else
>  			link_shadow_page(it.sptep, sp);
> +#endif
>  	}
>  
>  	for (;
> @@ -533,7 +549,11 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>  
>  		sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
>  				      true, direct_access, it.sptep);
> +#if PTTYPE == PTTYPE_EPT
> +		FNAME(link_shadow_page)(it.sptep, sp);
> +#else
>  		link_shadow_page(it.sptep, sp);
> +#endif
>  	}
>  
>  	clear_sp_write_flooding_count(it.sptep);
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 10/13] nEPT: Nested INVEPT
  2013-05-19  4:52 ` [PATCH v3 10/13] nEPT: Nested INVEPT Jun Nakajima
@ 2013-05-20 12:46   ` Paolo Bonzini
  2013-05-21  9:16   ` Xiao Guangrong
  1 sibling, 0 replies; 52+ messages in thread
From: Paolo Bonzini @ 2013-05-20 12:46 UTC (permalink / raw)
  To: Jun Nakajima; +Cc: kvm, Gleb Natapov

Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> +	switch (type) {
> +	case VMX_EPT_EXTENT_GLOBAL:
> +		if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_GLOBAL_BIT))
> +			nested_vmx_failValid(vcpu,
> +				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> +		else {
> +			/*
> +			 * Do nothing: when L1 changes EPT12, we already
> +			 * update EPT02 (the shadow EPT table) and call INVEPT.
> +			 * So when L1 calls INVEPT, there's nothing left to do.
> +			 */
> +			nested_vmx_succeed(vcpu);
> +		}
> +		break;

Duplicate code:

	switch (type) {
	case VMX_EPT_EXTENT_GLOBAL
		ok = (nested_vmx_ept_caps & VMX_EPT_EXTENT_GLOBAL_BIT) != 0;
		break;
		...
	default:
		ok = false;
		break;
	}
	if (ok) {
		/* Do nothing: ... */
		nested_vmx_succeed(vcpu);
	} else {
		nested_vmx_failValid(vcpu, ...);
	}
	break;

Paolo

> +	case VMX_EPT_EXTENT_CONTEXT:
> +		if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_CONTEXT_BIT))
> +			nested_vmx_failValid(vcpu,
> +				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> +		else {
> +			/* Do nothing */
> +			nested_vmx_succeed(vcpu);
> +		}
> +		break;
> +	case VMX_EPT_EXTENT_INDIVIDUAL_ADDR:
> +		if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_INDIVIDUAL_BIT))
> +			nested_vmx_failValid(vcpu,
> +				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> +		else {
> +			/* Do nothing */
> +			nested_vmx_succeed(vcpu);
> +		}
> +		break;
> +	default:
> +		nested_vmx_failValid(vcpu,
> +			VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 09/13] nEPT: Advertise EPT to L1
  2013-05-19  4:52 ` [PATCH v3 09/13] nEPT: Advertise EPT to L1 Jun Nakajima
@ 2013-05-20 13:05   ` Paolo Bonzini
  0 siblings, 0 replies; 52+ messages in thread
From: Paolo Bonzini @ 2013-05-20 13:05 UTC (permalink / raw)
  To: Jun Nakajima; +Cc: kvm, Gleb Natapov

Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> Advertise the support of EPT to the L1 guest, through the appropriate MSR.
> 
> This is the last patch of the basic Nested EPT feature, so as to allow
> bisection through this patch series: The guest will not see EPT support until
> this last patch, and will not attempt to use the half-applied feature.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> ---
>  arch/x86/include/asm/vmx.h |  2 ++
>  arch/x86/kvm/vmx.c         | 17 +++++++++++++++--
>  2 files changed, 17 insertions(+), 2 deletions(-)

This patch is ok, but it must be placed after patch 10 ("nEPT: Nested
INVEPT").

Paolo

> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index f3e01a2..4aec45d 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -394,7 +394,9 @@ enum vmcs_field {
>  #define VMX_EPTP_WB_BIT				(1ull << 14)
>  #define VMX_EPT_2MB_PAGE_BIT			(1ull << 16)
>  #define VMX_EPT_1GB_PAGE_BIT			(1ull << 17)
> +#define VMX_EPT_INVEPT_BIT			(1ull << 20)
>  #define VMX_EPT_AD_BIT				    (1ull << 21)
> +#define VMX_EPT_EXTENT_INDIVIDUAL_BIT		(1ull << 24)
>  #define VMX_EPT_EXTENT_CONTEXT_BIT		(1ull << 25)
>  #define VMX_EPT_EXTENT_GLOBAL_BIT		(1ull << 26)
>  
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 4661a22..1cf8a41 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2155,6 +2155,7 @@ static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high;
>  static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high;
>  static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high;
>  static u32 nested_vmx_misc_low, nested_vmx_misc_high;
> +static u32 nested_vmx_ept_caps;
>  static __init void nested_vmx_setup_ctls_msrs(void)
>  {
>  	/*
> @@ -2242,6 +2243,18 @@ static __init void nested_vmx_setup_ctls_msrs(void)
>  		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
>  		SECONDARY_EXEC_WBINVD_EXITING;
>  
> +	if (enable_ept) {
> +		/* nested EPT: emulate EPT also to L1 */
> +		nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
> +		nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
> +		nested_vmx_ept_caps |=
> +			VMX_EPT_INVEPT_BIT | VMX_EPT_EXTENT_GLOBAL_BIT |
> +			VMX_EPT_EXTENT_CONTEXT_BIT |
> +			VMX_EPT_EXTENT_INDIVIDUAL_BIT;
> +		nested_vmx_ept_caps &= vmx_capability.ept;
> +	} else
> +		nested_vmx_ept_caps = 0;
> +
>  	/* miscellaneous data */
>  	rdmsr(MSR_IA32_VMX_MISC, nested_vmx_misc_low, nested_vmx_misc_high);
>  	nested_vmx_misc_low &= VMX_MISC_PREEMPTION_TIMER_RATE_MASK |
> @@ -2347,8 +2360,8 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
>  					nested_vmx_secondary_ctls_high);
>  		break;
>  	case MSR_IA32_VMX_EPT_VPID_CAP:
> -		/* Currently, no nested ept or nested vpid */
> -		*pdata = 0;
> +		/* Currently, no nested vpid support */
> +		*pdata = nested_vmx_ept_caps;
>  		break;
>  	default:
>  		return 0;
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 13/13] nEPT: Inject EPT violation/misconfigration
  2013-05-19  4:52 ` [PATCH v3 13/13] nEPT: Inject EPT violation/misconfigration Jun Nakajima
@ 2013-05-20 13:09   ` Paolo Bonzini
  2013-05-21 10:56   ` Xiao Guangrong
  1 sibling, 0 replies; 52+ messages in thread
From: Paolo Bonzini @ 2013-05-20 13:09 UTC (permalink / raw)
  To: Jun Nakajima; +Cc: kvm, Gleb Natapov

Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> @@ -7441,10 +7443,81 @@ static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
>  	 * Note no need to set vmcs12->vm_exit_reason as it is already copied
>  	 * from vmcs02 in nested_vmx_vmexit() above, i.e., EPT_VIOLATION.
>  	 */

This comment is now wrong.

> -	vmcs12->exit_qualification = fault->error_code;

And this shows that patch 5 ("nEPT: MMU context for nested EPT") was
wrong in this respect.  Perhaps this patch should be moved earlier in
the series, so that the exit qualification is "bisectably" ok.

1) the updating of exit_qualification in walk_addr_generic should be
split out and moved before patch 5;

2) the changes to handle_ept_violation and nested_ept_inject_page_fault
(plus fixing the above comment) should also be split out, this time to
squash them in patch 5.  These two changes ensure that patch 5 can
already use the right exit qualification.

3) if needed to make the series bisectable, squash patch 12 into patch 2
and make is_rsvd_bits_set always return 0 in patch 3; then the rest of
the handling of reserved bits (including the introduction of
check_tdp_pte) will remain here. Otherwise, just squash what's left of
this patch into patch 12 and again change the subject.   In either case
the subject will have to change.

Paolo

> +	if (fault->error_code & PFERR_RSVD_MASK)
> +		vmcs12->vm_exit_reason = EXIT_REASON_EPT_MISCONFIG;
> +	else
> +		vmcs12->vm_exit_reason = EXIT_REASON_EPT_VIOLATION;
> +
> +	vmcs12->exit_qualification = vcpu->arch.exit_qualification;
>  	vmcs12->guest_physical_address = fault->address;
>  }
>  


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 07/13] nEPT: Fix wrong test in kvm_set_cr3
  2013-05-19  4:52 ` [PATCH v3 07/13] nEPT: Fix wrong test in kvm_set_cr3 Jun Nakajima
@ 2013-05-20 13:17   ` Paolo Bonzini
  0 siblings, 0 replies; 52+ messages in thread
From: Paolo Bonzini @ 2013-05-20 13:17 UTC (permalink / raw)
  To: Jun Nakajima; +Cc: kvm, Gleb Natapov

Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
> address. The problem is that with nested EPT, cr3 is an *L2* physical
> address, not an L1 physical address as this test expects.
> 
> As the comment above this test explains, it isn't necessary, and doesn't
> correspond to anything a real processor would do. So this patch removes it.
> 
> Note that this wrong test could have also theoretically caused problems
> in nested NPT, not just in nested EPT. However, in practice, the problem
> was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
> nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
> circumventing the problem. Additional potential calls to the buggy function
> are avoided in that we don't trap cr3 modifications when nested NPT is
> enabled. However, because in nested VMX we did want to use kvm_set_cr3()
> (as requested in Avi Kivity's review of the original nested VMX patches),
> we can't avoid this problem and need to fix it

Makes sense, but did you test what happens (without nesting, and both
with/without EPT) if L1 points CR3 at an invalid physical address?  Does
a basic level of sanity remain?

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 06/13] nEPT: Fix cr3 handling in nested exit and entry
  2013-05-19  4:52 ` [PATCH v3 06/13] nEPT: Fix cr3 handling in nested exit and entry Jun Nakajima
@ 2013-05-20 13:19   ` Paolo Bonzini
  2013-06-12 12:42   ` Gleb Natapov
  1 sibling, 0 replies; 52+ messages in thread
From: Paolo Bonzini @ 2013-05-20 13:19 UTC (permalink / raw)
  To: Jun Nakajima; +Cc: kvm, Gleb Natapov

Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> The existing code for handling cr3 and related VMCS fields during nested
> exit and entry wasn't correct in all cases:
> 
> If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
> during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
> we forgot to do so. This patch adds this copy.
> 
> If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
> whoever does control cr3 (L1 or L2) is using PAE, the processor might have
> saved PDPTEs and we should also save them in vmcs12 (and restore later).
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> ---
>  arch/x86/kvm/vmx.c | 30 ++++++++++++++++++++++++++++++
>  1 file changed, 30 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index a88432f..b79efd4 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7608,6 +7608,17 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>  	kvm_set_cr3(vcpu, vmcs12->guest_cr3);
>  	kvm_mmu_reset_context(vcpu);
>  
> +	/*
> +	 * Additionally, except when L0 is using shadow page tables, L1 or
> +	 * L2 control guest_cr3 for L2, so they may also have saved PDPTEs
> +	 */
> +	if (enable_ept) {
> +		vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
> +		vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
> +		vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
> +		vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
> +	}
> +
>  	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
>  	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
>  }
> @@ -7930,6 +7941,25 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>  	vmcs12->guest_pending_dbg_exceptions =
>  		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>  
> +	/*
> +	 * In some cases (usually, nested EPT), L2 is allowed to change its
> +	 * own CR3 without exiting. If it has changed it, we must keep it.
> +	 * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
> +	 * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
> +	 */
> +	if (enable_ept)
> +		vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
> +	/*
> +	 * Additionally, except when L0 is using shadow page tables, L1 or
> +	 * L2 control guest_cr3 for L2, so save their PDPTEs
> +	 */
> +	if (enable_ept) {
> +		vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
> +		vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
> +		vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
> +		vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
> +	}
> +
>  	vmcs12->vm_entry_controls =
>  		(vmcs12->vm_entry_controls & ~VM_ENTRY_IA32E_MODE) |
>  		(vmcs_read32(VM_ENTRY_CONTROLS) & VM_ENTRY_IA32E_MODE);
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 08/13] nEPT: Some additional comments
  2013-05-19  4:52 ` [PATCH v3 08/13] nEPT: Some additional comments Jun Nakajima
@ 2013-05-20 13:21   ` Paolo Bonzini
  0 siblings, 0 replies; 52+ messages in thread
From: Paolo Bonzini @ 2013-05-20 13:21 UTC (permalink / raw)
  To: Jun Nakajima; +Cc: kvm, Gleb Natapov

Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> Some additional comments to preexisting code:
> Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
> Don't mention "shadow on either EPT or shadow" as the only two options.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> ---
>  arch/x86/kvm/vmx.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index b79efd4..4661a22 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6540,7 +6540,20 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
>  		return nested_cpu_has2(vmcs12,
>  			SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
>  	case EXIT_REASON_EPT_VIOLATION:
> +		/*
> +		 * L0 always deals with the EPT violation. If nested EPT is
> +		 * used, and the nested mmu code discovers that the address is
> +		 * missing in the guest EPT table (EPT12), the EPT violation
> +		 * will be injected with nested_ept_inject_page_fault()
> +		 */
> +		return 0;
>  	case EXIT_REASON_EPT_MISCONFIG:
> +		/*
> +		 * L2 never uses directly L1's EPT, but rather L0's own EPT
> +		 * table (shadow on EPT) or a merged EPT table that L0 built
> +		 * (EPT on EPT). So any problems with the structure of the
> +		 * table is L0's fault.
> +		 */
>  		return 0;
>  	case EXIT_REASON_PREEMPTION_TIMER:
>  		return vmcs12->pin_based_vm_exec_control &
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-05-19  4:52 ` [PATCH v3 03/13] nEPT: Add EPT tables support " Jun Nakajima
@ 2013-05-21  7:52   ` Xiao Guangrong
  2013-05-21  8:30     ` Xiao Guangrong
  2013-06-11 11:32     ` Gleb Natapov
  0 siblings, 2 replies; 52+ messages in thread
From: Xiao Guangrong @ 2013-05-21  7:52 UTC (permalink / raw)
  To: Jun Nakajima; +Cc: kvm, Gleb Natapov, Paolo Bonzini

On 05/19/2013 12:52 PM, Jun Nakajima wrote:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> This is the first patch in a series which adds nested EPT support to KVM's
> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
> to set its own cr3 and take its own page faults without either of L0 or L1
> getting involved. This often significanlty improves L2's performance over the
> previous two alternatives (shadow page tables over EPT, and shadow page
> tables over shadow page tables).
> 
> This patch adds EPT support to paging_tmpl.h.
> 
> paging_tmpl.h contains the code for reading and writing page tables. The code
> for 32-bit and 64-bit tables is very similar, but not identical, so
> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
> with PTTYPE=64, and this generates the two sets of similar functions.
> 
> There are subtle but important differences between the format of EPT tables
> and that of ordinary x86 64-bit page tables, so for nested EPT we need a
> third set of functions to read the guest EPT table and to write the shadow
> EPT table.
> 
> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
> with "EPT") which correctly read and write EPT tables.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> ---
>  arch/x86/kvm/mmu.c         |  5 +++++
>  arch/x86/kvm/paging_tmpl.h | 43 +++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 46 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 117233f..6c1670f 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3397,6 +3397,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
>  	return mmu->last_pte_bitmap & (1 << index);
>  }
> 
> +#define PTTYPE_EPT 18 /* arbitrary */
> +#define PTTYPE PTTYPE_EPT
> +#include "paging_tmpl.h"
> +#undef PTTYPE
> +
>  #define PTTYPE 64
>  #include "paging_tmpl.h"
>  #undef PTTYPE
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index df34d4a..4c45654 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -50,6 +50,22 @@
>  	#define PT_LEVEL_BITS PT32_LEVEL_BITS
>  	#define PT_MAX_FULL_LEVELS 2
>  	#define CMPXCHG cmpxchg
> +#elif PTTYPE == PTTYPE_EPT
> +	#define pt_element_t u64
> +	#define guest_walker guest_walkerEPT
> +	#define FNAME(name) EPT_##name
> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> +	#ifdef CONFIG_X86_64
> +	#define PT_MAX_FULL_LEVELS 4
> +	#define CMPXCHG cmpxchg
> +	#else
> +	#define CMPXCHG cmpxchg64

CMPXHG is only used in FNAME(cmpxchg_gpte), but you commented it later.
Do we really need it?

> +	#define PT_MAX_FULL_LEVELS 2

And the SDM says:

"It uses a page-walk length of 4, meaning that at most 4 EPT paging-structure
entriesare accessed to translate a guest-physical address.", Is My SDM obsolete?
Which kind of process supports page-walk length = 2?

It seems your patch is not able to handle the case that the guest uses walk-lenght = 2
which is running on the host with walk-lenght = 4.
(plrease refer to how to handle sp->role.quadrant in FNAME(get_level1_sp_gpa) in
the current code.)

> +	#endif
>  #else
>  	#error Invalid PTTYPE value
>  #endif
> @@ -80,6 +96,10 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
>  	return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
>  }
> 
> +#if PTTYPE != PTTYPE_EPT
> +/*
> + *  Comment out this for EPT because update_accessed_dirty_bits() is not used.
> + */
>  static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  			       pt_element_t __user *ptep_user, unsigned index,
>  			       pt_element_t orig_pte, pt_element_t new_pte)
> @@ -102,6 +122,7 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> 
>  	return (ret != orig_pte);
>  }
> +#endif
> 
>  static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
>  				  struct kvm_mmu_page *sp, u64 *spte,
> @@ -126,13 +147,21 @@ no_present:
>  static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
>  {
>  	unsigned access;
> -
> +#if PTTYPE == PTTYPE_EPT
> +	access = (gpte & (VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
> +			  VMX_EPT_EXECUTABLE_MASK));

It seems wrong. The ACC_XXX definition:

#define ACC_EXEC_MASK    1
#define ACC_WRITE_MASK   PT_WRITABLE_MASK
#define ACC_USER_MASK    PT_USER_MASK
#define ACC_ALL          (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)

The bits are different with the bits used in EPT page table, for example,
your code always see that the execution is not allowed.

> +#else
>  	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
>  	access &= ~(gpte >> PT64_NX_SHIFT);
> +#endif
> 
>  	return access;
>  }
> 
> +#if PTTYPE != PTTYPE_EPT
> +/*
> + * EPT A/D bit support is not implemented.
> + */
>  static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
>  					     struct kvm_mmu *mmu,
>  					     struct guest_walker *walker,
> @@ -169,6 +198,7 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
>  	}
>  	return 0;
>  }
> +#endif
> 
>  /*
>   * Fetch a guest pte for a guest virtual address
> @@ -177,7 +207,6 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  				    struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  				    gva_t addr, u32 access)
>  {
> -	int ret;
>  	pt_element_t pte;
>  	pt_element_t __user *uninitialized_var(ptep_user);
>  	gfn_t table_gfn;
> @@ -192,7 +221,9 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  	gfn_t gfn;
> 
>  	trace_kvm_mmu_pagetable_walk(addr, access);
> +#if PTTYPE != PTTYPE_EPT
>  retry_walk:
> +#endif
>  	walker->level = mmu->root_level;
>  	pte           = mmu->get_cr3(vcpu);
> 
> @@ -277,6 +308,7 @@ retry_walk:
> 
>  	walker->gfn = real_gpa >> PAGE_SHIFT;
> 
> +#if PTTYPE != PTTYPE_EPT
>  	if (!write_fault)
>  		protect_clean_gpte(&pte_access, pte);
>  	else
> @@ -287,12 +319,15 @@ retry_walk:
>  		accessed_dirty &= pte >> (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT);
> 
>  	if (unlikely(!accessed_dirty)) {
> +		int ret;
> +
>  		ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, write_fault);
>  		if (unlikely(ret < 0))
>  			goto error;
>  		else if (ret)
>  			goto retry_walk;
>  	}
> +#endif

There are lots of code in paging_tmpl.h depends on PT_ACCESSED_MASK/PT_DIRTY_MASK.
I do not see other parts are adjusted in your patch.

How about redefine PT_ACCESSED_MASK / PT_DIRTY_MASK, something like:

#if PTTYPE == 32
PT_ACCESS = PT_ACCESSED_MASK;
......
#elif PTTYPE == 64
PT_ACCESS = PT_ACCESSED_MASK;
......
#elif PTTYPE == PTTYPE_EPT
PT_ACCESS = 0
#else
.......

I guess the compiler can drop the unnecessary branch when PT_ACCESS == 0.
Also, it can help use to remove the untidy "#if PTTYPE != PTTYPE_EPT"

> 
>  	walker->pt_access = pt_access;
>  	walker->pte_access = pte_access;
> @@ -323,6 +358,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
>  					access);
>  }
> 
> +#if PTTYPE != PTTYPE_EPT
>  static int FNAME(walk_addr_nested)(struct guest_walker *walker,
>  				   struct kvm_vcpu *vcpu, gva_t addr,
>  				   u32 access)
> @@ -330,6 +366,7 @@ static int FNAME(walk_addr_nested)(struct guest_walker *walker,
>  	return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
>  					addr, access);
>  }
> +#endif
> 
>  static bool
>  FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> @@ -754,6 +791,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
>  	return gpa;
>  }
> 
> +#if PTTYPE != PTTYPE_EPT
>  static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
>  				      u32 access,
>  				      struct x86_exception *exception)
> @@ -772,6 +810,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
> 
>  	return gpa;
>  }
> +#endif

Strange!

Why does nested ept not need these functions? How to emulate the instruction faulted on L2?




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 04/13] nEPT: Define EPT-specific link_shadow_page()
  2013-05-19  4:52 ` [PATCH v3 04/13] nEPT: Define EPT-specific link_shadow_page() Jun Nakajima
  2013-05-20 12:43   ` Paolo Bonzini
@ 2013-05-21  8:15   ` Xiao Guangrong
  2013-05-21 21:44     ` Nakajima, Jun
  1 sibling, 1 reply; 52+ messages in thread
From: Xiao Guangrong @ 2013-05-21  8:15 UTC (permalink / raw)
  To: Jun Nakajima; +Cc: kvm, Gleb Natapov, Paolo Bonzini

On 05/19/2013 12:52 PM, Jun Nakajima wrote:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> Since link_shadow_page() is used by a routine in mmu.c, add an
> EPT-specific link_shadow_page() in paging_tmp.h, rather than moving
> it.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> ---
>  arch/x86/kvm/paging_tmpl.h | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 4c45654..dc495f9 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -461,6 +461,18 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
>  	}
>  }
> 
> +#if PTTYPE == PTTYPE_EPT
> +static void FNAME(link_shadow_page)(u64 *sptep, struct kvm_mmu_page *sp)
> +{
> +	u64 spte;
> +
> +	spte = __pa(sp->spt) | VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
> +		VMX_EPT_EXECUTABLE_MASK;
> +
> +	mmu_spte_set(sptep, spte);
> +}
> +#endif

The only difference between this function and the current link_shadow_page()
is shadow_accessed_mask. Can we add a parameter to eliminate this difference,
some like:

static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, bool accessed)
{
	u64 spte;

	spte = __pa(sp->spt) | PT_PRESENT_MASK | PT_WRITABLE_MASK |
	       shadow_user_mask | shadow_x_mask;
	
	if (accessed)
		spte |= shadow_accessed_mask;

	mmu_spte_set(sptep, spte);
}

?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-05-21  7:52   ` Xiao Guangrong
@ 2013-05-21  8:30     ` Xiao Guangrong
  2013-05-21  9:01       ` Gleb Natapov
  2013-06-11 11:32     ` Gleb Natapov
  1 sibling, 1 reply; 52+ messages in thread
From: Xiao Guangrong @ 2013-05-21  8:30 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Jun Nakajima, kvm, Gleb Natapov, Paolo Bonzini

On 05/21/2013 03:52 PM, Xiao Guangrong wrote:
> On 05/19/2013 12:52 PM, Jun Nakajima wrote:
>> From: Nadav Har'El <nyh@il.ibm.com>
>>
>> This is the first patch in a series which adds nested EPT support to KVM's
>> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
>> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
>> to set its own cr3 and take its own page faults without either of L0 or L1
>> getting involved. This often significanlty improves L2's performance over the
>> previous two alternatives (shadow page tables over EPT, and shadow page
>> tables over shadow page tables).
>>
>> This patch adds EPT support to paging_tmpl.h.
>>
>> paging_tmpl.h contains the code for reading and writing page tables. The code
>> for 32-bit and 64-bit tables is very similar, but not identical, so
>> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
>> with PTTYPE=64, and this generates the two sets of similar functions.
>>
>> There are subtle but important differences between the format of EPT tables
>> and that of ordinary x86 64-bit page tables, so for nested EPT we need a
>> third set of functions to read the guest EPT table and to write the shadow
>> EPT table.
>>
>> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
>> with "EPT") which correctly read and write EPT tables.
>>
>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
>> ---
>>  arch/x86/kvm/mmu.c         |  5 +++++
>>  arch/x86/kvm/paging_tmpl.h | 43 +++++++++++++++++++++++++++++++++++++++++--
>>  2 files changed, 46 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 117233f..6c1670f 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -3397,6 +3397,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
>>  	return mmu->last_pte_bitmap & (1 << index);
>>  }
>>
>> +#define PTTYPE_EPT 18 /* arbitrary */
>> +#define PTTYPE PTTYPE_EPT
>> +#include "paging_tmpl.h"
>> +#undef PTTYPE
>> +
>>  #define PTTYPE 64
>>  #include "paging_tmpl.h"
>>  #undef PTTYPE
>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>> index df34d4a..4c45654 100644
>> --- a/arch/x86/kvm/paging_tmpl.h
>> +++ b/arch/x86/kvm/paging_tmpl.h
>> @@ -50,6 +50,22 @@
>>  	#define PT_LEVEL_BITS PT32_LEVEL_BITS
>>  	#define PT_MAX_FULL_LEVELS 2
>>  	#define CMPXCHG cmpxchg
>> +#elif PTTYPE == PTTYPE_EPT
>> +	#define pt_element_t u64
>> +	#define guest_walker guest_walkerEPT
>> +	#define FNAME(name) EPT_##name
>> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
>> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
>> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
>> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
>> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
>> +	#ifdef CONFIG_X86_64
>> +	#define PT_MAX_FULL_LEVELS 4
>> +	#define CMPXCHG cmpxchg
>> +	#else
>> +	#define CMPXCHG cmpxchg64
> 
> CMPXHG is only used in FNAME(cmpxchg_gpte), but you commented it later.
> Do we really need it?
> 
>> +	#define PT_MAX_FULL_LEVELS 2
> 
> And the SDM says:
> 
> "It uses a page-walk length of 4, meaning that at most 4 EPT paging-structure
> entriesare accessed to translate a guest-physical address.", Is My SDM obsolete?
> Which kind of process supports page-walk length = 2?
> 
> It seems your patch is not able to handle the case that the guest uses walk-lenght = 2
> which is running on the host with walk-lenght = 4.
> (plrease refer to how to handle sp->role.quadrant in FNAME(get_level1_sp_gpa) in
> the current code.)
> 
>> +	#endif
>>  #else
>>  	#error Invalid PTTYPE value
>>  #endif
>> @@ -80,6 +96,10 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
>>  	return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
>>  }
>>
>> +#if PTTYPE != PTTYPE_EPT
>> +/*
>> + *  Comment out this for EPT because update_accessed_dirty_bits() is not used.
>> + */
>>  static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>>  			       pt_element_t __user *ptep_user, unsigned index,
>>  			       pt_element_t orig_pte, pt_element_t new_pte)
>> @@ -102,6 +122,7 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>>
>>  	return (ret != orig_pte);
>>  }
>> +#endif
>>
>>  static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
>>  				  struct kvm_mmu_page *sp, u64 *spte,
>> @@ -126,13 +147,21 @@ no_present:
>>  static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
>>  {
>>  	unsigned access;
>> -
>> +#if PTTYPE == PTTYPE_EPT
>> +	access = (gpte & (VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
>> +			  VMX_EPT_EXECUTABLE_MASK));
> 
> It seems wrong. The ACC_XXX definition:
> 
> #define ACC_EXEC_MASK    1
> #define ACC_WRITE_MASK   PT_WRITABLE_MASK
> #define ACC_USER_MASK    PT_USER_MASK
> #define ACC_ALL          (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
> 
> The bits are different with the bits used in EPT page table, for example,
> your code always see that the execution is not allowed.
> 
>> +#else
>>  	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
>>  	access &= ~(gpte >> PT64_NX_SHIFT);
>> +#endif
>>
>>  	return access;
>>  }
>>
>> +#if PTTYPE != PTTYPE_EPT
>> +/*
>> + * EPT A/D bit support is not implemented.
>> + */
>>  static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
>>  					     struct kvm_mmu *mmu,
>>  					     struct guest_walker *walker,
>> @@ -169,6 +198,7 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
>>  	}
>>  	return 0;
>>  }
>> +#endif
>>
>>  /*
>>   * Fetch a guest pte for a guest virtual address
>> @@ -177,7 +207,6 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>>  				    struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>>  				    gva_t addr, u32 access)
>>  {
>> -	int ret;
>>  	pt_element_t pte;
>>  	pt_element_t __user *uninitialized_var(ptep_user);
>>  	gfn_t table_gfn;
>> @@ -192,7 +221,9 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>>  	gfn_t gfn;
>>
>>  	trace_kvm_mmu_pagetable_walk(addr, access);
>> +#if PTTYPE != PTTYPE_EPT
>>  retry_walk:
>> +#endif
>>  	walker->level = mmu->root_level;
>>  	pte           = mmu->get_cr3(vcpu);
>>
>> @@ -277,6 +308,7 @@ retry_walk:
>>
>>  	walker->gfn = real_gpa >> PAGE_SHIFT;
>>
>> +#if PTTYPE != PTTYPE_EPT
>>  	if (!write_fault)
>>  		protect_clean_gpte(&pte_access, pte);
>>  	else
>> @@ -287,12 +319,15 @@ retry_walk:
>>  		accessed_dirty &= pte >> (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT);
>>
>>  	if (unlikely(!accessed_dirty)) {
>> +		int ret;
>> +
>>  		ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, write_fault);
>>  		if (unlikely(ret < 0))
>>  			goto error;
>>  		else if (ret)
>>  			goto retry_walk;
>>  	}
>> +#endif
> 
> There are lots of code in paging_tmpl.h depends on PT_ACCESSED_MASK/PT_DIRTY_MASK.
> I do not see other parts are adjusted in your patch.
> 
> How about redefine PT_ACCESSED_MASK / PT_DIRTY_MASK, something like:
> 
> #if PTTYPE == 32
> PT_ACCESS = PT_ACCESSED_MASK;
> ......
> #elif PTTYPE == 64
> PT_ACCESS = PT_ACCESSED_MASK;
> ......
> #elif PTTYPE == PTTYPE_EPT
> PT_ACCESS = 0
> #else
> .......
> 
> I guess the compiler can drop the unnecessary branch when PT_ACCESS == 0.
> Also, it can help use to remove the untidy "#if PTTYPE != PTTYPE_EPT"
> 
>>
>>  	walker->pt_access = pt_access;
>>  	walker->pte_access = pte_access;
>> @@ -323,6 +358,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
>>  					access);
>>  }
>>
>> +#if PTTYPE != PTTYPE_EPT
>>  static int FNAME(walk_addr_nested)(struct guest_walker *walker,
>>  				   struct kvm_vcpu *vcpu, gva_t addr,
>>  				   u32 access)
>> @@ -330,6 +366,7 @@ static int FNAME(walk_addr_nested)(struct guest_walker *walker,
>>  	return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
>>  					addr, access);
>>  }
>> +#endif
>>
>>  static bool
>>  FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>> @@ -754,6 +791,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
>>  	return gpa;
>>  }
>>
>> +#if PTTYPE != PTTYPE_EPT
>>  static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
>>  				      u32 access,
>>  				      struct x86_exception *exception)
>> @@ -772,6 +810,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
>>
>>  	return gpa;
>>  }
>> +#endif
> 
> Strange!
> 
> Why does nested ept not need these functions? How to emulate the instruction faulted on L2?

Sorry, i misunderstood it. Have found the reason out.




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 05/13] nEPT: MMU context for nested EPT
  2013-05-19  4:52 ` [PATCH v3 05/13] nEPT: MMU context for nested EPT Jun Nakajima
@ 2013-05-21  8:50   ` Xiao Guangrong
  2013-05-21 22:30     ` Nakajima, Jun
  0 siblings, 1 reply; 52+ messages in thread
From: Xiao Guangrong @ 2013-05-21  8:50 UTC (permalink / raw)
  To: Jun Nakajima; +Cc: kvm, Gleb Natapov, Paolo Bonzini

On 05/19/2013 12:52 PM, Jun Nakajima wrote:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> KVM's existing shadow MMU code already supports nested TDP. To use it, we
> need to set up a new "MMU context" for nested EPT, and create a few callbacks
> for it (nested_ept_*()). This context should also use the EPT versions of
> the page table access functions (defined in the previous patch).
> Then, we need to switch back and forth between this nested context and the
> regular MMU context when switching between L1 and L2 (when L1 runs this L2
> with EPT).
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> ---
>  arch/x86/kvm/mmu.c | 38 ++++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/mmu.h |  1 +
>  arch/x86/kvm/vmx.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 92 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 6c1670f..37f8d7f 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3653,6 +3653,44 @@ int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
>  }
>  EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu);
> 
> +int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
> +{
> +	ASSERT(vcpu);
> +	ASSERT(!VALID_PAGE(vcpu->arch.mmu.root_hpa));
> +
> +	context->shadow_root_level = kvm_x86_ops->get_tdp_level();

That means L1 guest always uses page-walk length == 4? But in your previous patch,
it can be 2.

> +
> +	context->nx = is_nx(vcpu); /* TODO: ? */

Hmm? EPT always support NX.

> +	context->new_cr3 = paging_new_cr3;
> +	context->page_fault = EPT_page_fault;
> +	context->gva_to_gpa = EPT_gva_to_gpa;
> +	context->sync_page = EPT_sync_page;
> +	context->invlpg = EPT_invlpg;
> +	context->update_pte = EPT_update_pte;
> +	context->free = paging_free;
> +	context->root_level = context->shadow_root_level;
> +	context->root_hpa = INVALID_PAGE;
> +	context->direct_map = false;
> +
> +	/* TODO: reset_rsvds_bits_mask() is not built for EPT, we need
> +	   something different.
> +	 */

Exactly. :)

> +	reset_rsvds_bits_mask(vcpu, context);
> +
> +
> +	/* TODO: I copied these from kvm_init_shadow_mmu, I don't know why
> +	   they are done, or why they write to vcpu->arch.mmu and not context
> +	 */
> +	vcpu->arch.mmu.base_role.cr4_pae = !!is_pae(vcpu);
> +	vcpu->arch.mmu.base_role.cr0_wp  = is_write_protection(vcpu);
> +	vcpu->arch.mmu.base_role.smep_andnot_wp =
> +		kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) &&
> +		!is_write_protection(vcpu);

I guess we need not care these since the permission of EPT page does not depend
on these.

> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvm_init_shadow_EPT_mmu);
> +
>  static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
>  {
>  	int r = kvm_init_shadow_mmu(vcpu, vcpu->arch.walk_mmu);
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 2adcbc2..8fc94dd 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -54,6 +54,7 @@ int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
>  int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
>  int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
> +int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
> 
>  static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
>  {
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index fb9cae5..a88432f 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -1045,6 +1045,11 @@ static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12,
>  	return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
>  }
> 
> +static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12)
> +{
> +	return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT);
> +}
> +
>  static inline bool is_exception(u32 intr_info)
>  {
>  	return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
> @@ -7311,6 +7316,46 @@ static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
>  		entry->ecx |= bit(X86_FEATURE_VMX);
>  }
> 
> +/* Callbacks for nested_ept_init_mmu_context: */
> +
> +static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
> +{
> +	/* return the page table to be shadowed - in our case, EPT12 */
> +	return get_vmcs12(vcpu)->ept_pointer;
> +}
> +
> +static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
> +	struct x86_exception *fault)
> +{
> +	struct vmcs12 *vmcs12;
> +	nested_vmx_vmexit(vcpu);
> +	vmcs12 = get_vmcs12(vcpu);
> +	/*
> +	 * Note no need to set vmcs12->vm_exit_reason as it is already copied
> +	 * from vmcs02 in nested_vmx_vmexit() above, i.e., EPT_VIOLATION.
> +	 */
> +	vmcs12->exit_qualification = fault->error_code;

Hmm, you directly copy the error code from FNAME(walk_addr_generic),
but its format is different and i did not see you cook the error code
in the previous patches.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-05-21  8:30     ` Xiao Guangrong
@ 2013-05-21  9:01       ` Gleb Natapov
  2013-05-21 11:05         ` Xiao Guangrong
  0 siblings, 1 reply; 52+ messages in thread
From: Gleb Natapov @ 2013-05-21  9:01 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Jun Nakajima, kvm, Paolo Bonzini

On Tue, May 21, 2013 at 04:30:13PM +0800, Xiao Guangrong wrote:
> >> @@ -772,6 +810,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
> >>
> >>  	return gpa;
> >>  }
> >> +#endif
> > 
> > Strange!
> > 
> > Why does nested ept not need these functions? How to emulate the instruction faulted on L2?
> 
> Sorry, i misunderstood it. Have found the reason out.
> 
You can write it down here for future reviewers :)

--
			Gleb.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 10/13] nEPT: Nested INVEPT
  2013-05-19  4:52 ` [PATCH v3 10/13] nEPT: Nested INVEPT Jun Nakajima
  2013-05-20 12:46   ` Paolo Bonzini
@ 2013-05-21  9:16   ` Xiao Guangrong
  1 sibling, 0 replies; 52+ messages in thread
From: Xiao Guangrong @ 2013-05-21  9:16 UTC (permalink / raw)
  To: Jun Nakajima; +Cc: kvm, Gleb Natapov, Paolo Bonzini

On 05/19/2013 12:52 PM, Jun Nakajima wrote:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> If we let L1 use EPT, we should probably also support the INVEPT instruction.
> 
> In our current nested EPT implementation, when L1 changes its EPT table for
> L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in the course

Hmm?

L0 can not always intercept L1's changes due to unsync shadow pages...

> of this modification already calls INVEPT. Therefore, when L1 calls INVEPT,
> we don't really need to do anything. In particular we *don't* need to call

So, i can not understand why we need not handle INVEPT.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 13/13] nEPT: Inject EPT violation/misconfigration
  2013-05-19  4:52 ` [PATCH v3 13/13] nEPT: Inject EPT violation/misconfigration Jun Nakajima
  2013-05-20 13:09   ` Paolo Bonzini
@ 2013-05-21 10:56   ` Xiao Guangrong
  1 sibling, 0 replies; 52+ messages in thread
From: Xiao Guangrong @ 2013-05-21 10:56 UTC (permalink / raw)
  To: Jun Nakajima; +Cc: kvm, Gleb Natapov, Paolo Bonzini

On 05/19/2013 12:52 PM, Jun Nakajima wrote:
> Add code to detect EPT misconfiguration and inject it to L1 VMM. Also,
> it injects more correct exit qualification upon EPT violation to L1
> VMM.  Now L1 can correctly go to ept_misconfig handler (instead of
> wrongly going to fast_page_fault), it will try to handle mmio page
> fault, if failed, it is a real EPT misconfiguration.
> 
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  4 +++
>  arch/x86/kvm/mmu.c              |  5 ---
>  arch/x86/kvm/mmu.h              |  5 +++
>  arch/x86/kvm/paging_tmpl.h      | 26 ++++++++++++++
>  arch/x86/kvm/vmx.c              | 79 +++++++++++++++++++++++++++++++++++++++--
>  5 files changed, 111 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 3741c65..1d03202 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -262,6 +262,8 @@ struct kvm_mmu {
>  	void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva);
>  	void (*update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  			   u64 *spte, const void *pte);
> +	bool (*check_tdp_pte)(u64 pte, int level);
> +
>  	hpa_t root_hpa;
>  	int root_level;
>  	int shadow_root_level;
> @@ -503,6 +505,8 @@ struct kvm_vcpu_arch {
>  	 * instruction.
>  	 */
>  	bool write_fault_to_shadow_pgtable;
> +
> +	unsigned long exit_qualification; /* set at EPT violation at this point */
>  };
> 
>  struct kvm_lpage_info {
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 93d6abf..3a3b11f 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -233,11 +233,6 @@ static bool set_mmio_spte(u64 *sptep, gfn_t gfn, pfn_t pfn, unsigned access)
>  	return false;
>  }
> 
> -static inline u64 rsvd_bits(int s, int e)
> -{
> -	return ((1ULL << (e - s + 1)) - 1) << s;
> -}
> -
>  void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
>  		u64 dirty_mask, u64 nx_mask, u64 x_mask)
>  {
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 8fc94dd..559e2e0 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -88,6 +88,11 @@ static inline bool is_write_protection(struct kvm_vcpu *vcpu)
>  	return kvm_read_cr0_bits(vcpu, X86_CR0_WP);
>  }
> 
> +static inline u64 rsvd_bits(int s, int e)
> +{
> +	return ((1ULL << (e - s + 1)) - 1) << s;
> +}
> +
>  /*
>   * Will a fault with a given page-fault error code (pfec) cause a permission
>   * fault with the given access (in ACC_* format)?
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 2432d49..067b1f8 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -126,10 +126,14 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> 
>  static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
>  {
> +#if PTTYPE == PTTYPE_EPT
> +	return (mmu->check_tdp_pte(gpte, level));
> +#else
>  	int bit7;
> 
>  	bit7 = (gpte >> 7) & 1;
>  	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> +#endif
>  }

It is better that set mmu->check_tdp_pte = is_rsvd_bits_set for the
current modes, then this part can be moved to mmu.c

> 
>  static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
> @@ -352,6 +356,28 @@ error:
>  	walker->fault.vector = PF_VECTOR;
>  	walker->fault.error_code_valid = true;
>  	walker->fault.error_code = errcode;
> +
> +#if PTTYPE == PTTYPE_EPT
> +	/*
> +	 * Use PFERR_RSVD_MASK in erorr_code to to tell if EPT
> +	 * misconfiguration requires to be injected. The detection is
> +	 * done by is_rsvd_bits_set() above.
> +	 *
> +	 * We set up the value of exit_qualification to inject:
> +	 * [2:0] -- Derive from [2:0] of real exit_qualification at EPT violation
> +	 * [5:3] -- Calculated by the page walk of the guest EPT page tables
> +	 * [7:8] -- Clear to 0.
> +	 *
> +	 * The other bits are set to 0.
> +	 */
> +	if (!(errcode & PFERR_RSVD_MASK)) {
> +		unsigned long exit_qualification = vcpu->arch.exit_qualification;
> +
> +		pte_access = pt_access & pte;
> +		vcpu->arch.exit_qualification = ((pte_access & 0x7) << 3) |
> +			(exit_qualification & 0x7);
> +	}
> +#endif

This specified operations can be move to nested_ept_inject_page_fault()?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-05-21  9:01       ` Gleb Natapov
@ 2013-05-21 11:05         ` Xiao Guangrong
  2013-05-21 22:26           ` Nakajima, Jun
  0 siblings, 1 reply; 52+ messages in thread
From: Xiao Guangrong @ 2013-05-21 11:05 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jun Nakajima, kvm, Paolo Bonzini

On 05/21/2013 05:01 PM, Gleb Natapov wrote:
> On Tue, May 21, 2013 at 04:30:13PM +0800, Xiao Guangrong wrote:
>>>> @@ -772,6 +810,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
>>>>
>>>>  	return gpa;
>>>>  }
>>>> +#endif
>>>
>>> Strange!
>>>
>>> Why does nested ept not need these functions? How to emulate the instruction faulted on L2?
>>
>> Sorry, i misunderstood it. Have found the reason out.
>>
> You can write it down here for future reviewers :)

Okay.

The functions used to translate L2's gva to L1's gpa are paging32_gva_to_gpa_nested
and paging64_gva_to_gpa_nested which are created by PTTYPE == 32 and PTTYPE == 64.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 04/13] nEPT: Define EPT-specific link_shadow_page()
  2013-05-21  8:15   ` Xiao Guangrong
@ 2013-05-21 21:44     ` Nakajima, Jun
  0 siblings, 0 replies; 52+ messages in thread
From: Nakajima, Jun @ 2013-05-21 21:44 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: kvm, Gleb Natapov, Paolo Bonzini

Sure. Thanks for the suggestion.


On Tue, May 21, 2013 at 1:15 AM, Xiao Guangrong
<xiaoguangrong@linux.vnet.ibm.com> wrote:
> On 05/19/2013 12:52 PM, Jun Nakajima wrote:
>> From: Nadav Har'El <nyh@il.ibm.com>
>>
>> Since link_shadow_page() is used by a routine in mmu.c, add an
>> EPT-specific link_shadow_page() in paging_tmp.h, rather than moving
>> it.
>>
>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
>> ---
>>  arch/x86/kvm/paging_tmpl.h | 20 ++++++++++++++++++++
>>  1 file changed, 20 insertions(+)
>>
>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>> index 4c45654..dc495f9 100644
>> --- a/arch/x86/kvm/paging_tmpl.h
>> +++ b/arch/x86/kvm/paging_tmpl.h
>> @@ -461,6 +461,18 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
>>       }
>>  }
>>
>> +#if PTTYPE == PTTYPE_EPT
>> +static void FNAME(link_shadow_page)(u64 *sptep, struct kvm_mmu_page *sp)
>> +{
>> +     u64 spte;
>> +
>> +     spte = __pa(sp->spt) | VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
>> +             VMX_EPT_EXECUTABLE_MASK;
>> +
>> +     mmu_spte_set(sptep, spte);
>> +}
>> +#endif
>
> The only difference between this function and the current link_shadow_page()
> is shadow_accessed_mask. Can we add a parameter to eliminate this difference,
> some like:
>
> static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, bool accessed)
> {
>         u64 spte;
>
>         spte = __pa(sp->spt) | PT_PRESENT_MASK | PT_WRITABLE_MASK |
>                shadow_user_mask | shadow_x_mask;
>
>         if (accessed)
>                 spte |= shadow_accessed_mask;
>
>         mmu_spte_set(sptep, spte);
> }
>
> ?
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Jun
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-05-21 11:05         ` Xiao Guangrong
@ 2013-05-21 22:26           ` Nakajima, Jun
  2013-05-22  1:10             ` Xiao Guangrong
  2013-05-22  6:16             ` Gleb Natapov
  0 siblings, 2 replies; 52+ messages in thread
From: Nakajima, Jun @ 2013-05-21 22:26 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Gleb Natapov, kvm, Paolo Bonzini

On Tue, May 21, 2013 at 4:05 AM, Xiao Guangrong
<xiaoguangrong@linux.vnet.ibm.com> wrote:
> On 05/21/2013 05:01 PM, Gleb Natapov wrote:
>> On Tue, May 21, 2013 at 04:30:13PM +0800, Xiao Guangrong wrote:
>>>>> @@ -772,6 +810,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
>>>>>
>>>>>    return gpa;
>>>>>  }
>>>>> +#endif
>>>>
>>>> Strange!
>>>>
>>>> Why does nested ept not need these functions? How to emulate the instruction faulted on L2?
>>>
>>> Sorry, i misunderstood it. Have found the reason out.
>>>
>> You can write it down here for future reviewers :)
>
> Okay.
>
> The functions used to translate L2's gva to L1's gpa are paging32_gva_to_gpa_nested
> and paging64_gva_to_gpa_nested which are created by PTTYPE == 32 and PTTYPE == 64.
>
>

Back to your comments on PT_MAX_FULL_LEVELS:
> +     #ifdef CONFIG_X86_64
> +     #define PT_MAX_FULL_LEVELS 4
> +     #define CMPXCHG cmpxchg
> +     #else
> +     #define CMPXCHG cmpxchg64
> +    #define PT_MAX_FULL_LEVELS 2
I don't think we need to support nEPT on 32-bit hosts.  So, I plan to
remove such code. What do you think?

--
Jun
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 05/13] nEPT: MMU context for nested EPT
  2013-05-21  8:50   ` Xiao Guangrong
@ 2013-05-21 22:30     ` Nakajima, Jun
  0 siblings, 0 replies; 52+ messages in thread
From: Nakajima, Jun @ 2013-05-21 22:30 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: kvm, Gleb Natapov, Paolo Bonzini

On Tue, May 21, 2013 at 1:50 AM, Xiao Guangrong
<xiaoguangrong@linux.vnet.ibm.com> wrote:
> On 05/19/2013 12:52 PM, Jun Nakajima wrote:
>> From: Nadav Har'El <nyh@il.ibm.com>
>>
>> KVM's existing shadow MMU code already supports nested TDP. To use it, we
>> need to set up a new "MMU context" for nested EPT, and create a few callbacks
>> for it (nested_ept_*()). This context should also use the EPT versions of
>> the page table access functions (defined in the previous patch).
>> Then, we need to switch back and forth between this nested context and the
>> regular MMU context when switching between L1 and L2 (when L1 runs this L2
>> with EPT).
>>
>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
>> ---
>>  arch/x86/kvm/mmu.c | 38 ++++++++++++++++++++++++++++++++++++++
>>  arch/x86/kvm/mmu.h |  1 +
>>  arch/x86/kvm/vmx.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  3 files changed, 92 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 6c1670f..37f8d7f 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -3653,6 +3653,44 @@ int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
>>  }
>>  EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu);
>>
>> +int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
>> +{
>> +     ASSERT(vcpu);
>> +     ASSERT(!VALID_PAGE(vcpu->arch.mmu.root_hpa));
>> +
>> +     context->shadow_root_level = kvm_x86_ops->get_tdp_level();
>
> That means L1 guest always uses page-walk length == 4? But in your previous patch,
> it can be 2.

We want to support "page-walk length == 4" only.

>
>> +
>> +     context->nx = is_nx(vcpu); /* TODO: ? */
>
> Hmm? EPT always support NX.
>
>> +     context->new_cr3 = paging_new_cr3;
>> +     context->page_fault = EPT_page_fault;
>> +     context->gva_to_gpa = EPT_gva_to_gpa;
>> +     context->sync_page = EPT_sync_page;
>> +     context->invlpg = EPT_invlpg;
>> +     context->update_pte = EPT_update_pte;
>> +     context->free = paging_free;
>> +     context->root_level = context->shadow_root_level;
>> +     context->root_hpa = INVALID_PAGE;
>> +     context->direct_map = false;
>> +
>> +     /* TODO: reset_rsvds_bits_mask() is not built for EPT, we need
>> +        something different.
>> +      */
>
> Exactly. :)
>
>> +     reset_rsvds_bits_mask(vcpu, context);
>> +
>> +
>> +     /* TODO: I copied these from kvm_init_shadow_mmu, I don't know why
>> +        they are done, or why they write to vcpu->arch.mmu and not context
>> +      */
>> +     vcpu->arch.mmu.base_role.cr4_pae = !!is_pae(vcpu);
>> +     vcpu->arch.mmu.base_role.cr0_wp  = is_write_protection(vcpu);
>> +     vcpu->arch.mmu.base_role.smep_andnot_wp =
>> +             kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) &&
>> +             !is_write_protection(vcpu);
>
> I guess we need not care these since the permission of EPT page does not depend
> on these.

Right. I'll clean up this.

>
>> +
>> +     return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(kvm_init_shadow_EPT_mmu);
>> +
>>  static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
>>  {
>>       int r = kvm_init_shadow_mmu(vcpu, vcpu->arch.walk_mmu);
>> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>> index 2adcbc2..8fc94dd 100644
>> --- a/arch/x86/kvm/mmu.h
>> +++ b/arch/x86/kvm/mmu.h
>> @@ -54,6 +54,7 @@ int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
>>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
>>  int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
>>  int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
>> +int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
>>
>>  static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
>>  {
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index fb9cae5..a88432f 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -1045,6 +1045,11 @@ static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12,
>>       return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
>>  }
>>
>> +static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12)
>> +{
>> +     return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT);
>> +}
>> +
>>  static inline bool is_exception(u32 intr_info)
>>  {
>>       return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
>> @@ -7311,6 +7316,46 @@ static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
>>               entry->ecx |= bit(X86_FEATURE_VMX);
>>  }
>>
>> +/* Callbacks for nested_ept_init_mmu_context: */
>> +
>> +static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
>> +{
>> +     /* return the page table to be shadowed - in our case, EPT12 */
>> +     return get_vmcs12(vcpu)->ept_pointer;
>> +}
>> +
>> +static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
>> +     struct x86_exception *fault)
>> +{
>> +     struct vmcs12 *vmcs12;
>> +     nested_vmx_vmexit(vcpu);
>> +     vmcs12 = get_vmcs12(vcpu);
>> +     /*
>> +      * Note no need to set vmcs12->vm_exit_reason as it is already copied
>> +      * from vmcs02 in nested_vmx_vmexit() above, i.e., EPT_VIOLATION.
>> +      */
>> +     vmcs12->exit_qualification = fault->error_code;
>
> Hmm, you directly copy the error code from FNAME(walk_addr_generic),
> but its format is different and i did not see you cook the error code
> in the previous patches.
>

Right. Basically this is the original code from Nadav, and 12, 13
fix/cook the error code.

--
Jun
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-05-21 22:26           ` Nakajima, Jun
@ 2013-05-22  1:10             ` Xiao Guangrong
  2013-05-22  6:16             ` Gleb Natapov
  1 sibling, 0 replies; 52+ messages in thread
From: Xiao Guangrong @ 2013-05-22  1:10 UTC (permalink / raw)
  To: Nakajima, Jun; +Cc: Gleb Natapov, kvm, Paolo Bonzini

On 05/22/2013 06:26 AM, Nakajima, Jun wrote:
> On Tue, May 21, 2013 at 4:05 AM, Xiao Guangrong
> <xiaoguangrong@linux.vnet.ibm.com> wrote:
>> On 05/21/2013 05:01 PM, Gleb Natapov wrote:
>>> On Tue, May 21, 2013 at 04:30:13PM +0800, Xiao Guangrong wrote:
>>>>>> @@ -772,6 +810,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
>>>>>>
>>>>>>    return gpa;
>>>>>>  }
>>>>>> +#endif
>>>>>
>>>>> Strange!
>>>>>
>>>>> Why does nested ept not need these functions? How to emulate the instruction faulted on L2?
>>>>
>>>> Sorry, i misunderstood it. Have found the reason out.
>>>>
>>> You can write it down here for future reviewers :)
>>
>> Okay.
>>
>> The functions used to translate L2's gva to L1's gpa are paging32_gva_to_gpa_nested
>> and paging64_gva_to_gpa_nested which are created by PTTYPE == 32 and PTTYPE == 64.
>>
>>
> 
> Back to your comments on PT_MAX_FULL_LEVELS:
>> +     #ifdef CONFIG_X86_64
>> +     #define PT_MAX_FULL_LEVELS 4
>> +     #define CMPXCHG cmpxchg
>> +     #else
>> +     #define CMPXCHG cmpxchg64
>> +    #define PT_MAX_FULL_LEVELS 2
> I don't think we need to support nEPT on 32-bit hosts.  So, I plan to
> remove such code. What do you think?

Good to me. :)



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-05-21 22:26           ` Nakajima, Jun
  2013-05-22  1:10             ` Xiao Guangrong
@ 2013-05-22  6:16             ` Gleb Natapov
  1 sibling, 0 replies; 52+ messages in thread
From: Gleb Natapov @ 2013-05-22  6:16 UTC (permalink / raw)
  To: Nakajima, Jun; +Cc: Xiao Guangrong, kvm, Paolo Bonzini

On Tue, May 21, 2013 at 03:26:18PM -0700, Nakajima, Jun wrote:
> On Tue, May 21, 2013 at 4:05 AM, Xiao Guangrong
> <xiaoguangrong@linux.vnet.ibm.com> wrote:
> > On 05/21/2013 05:01 PM, Gleb Natapov wrote:
> >> On Tue, May 21, 2013 at 04:30:13PM +0800, Xiao Guangrong wrote:
> >>>>> @@ -772,6 +810,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
> >>>>>
> >>>>>    return gpa;
> >>>>>  }
> >>>>> +#endif
> >>>>
> >>>> Strange!
> >>>>
> >>>> Why does nested ept not need these functions? How to emulate the instruction faulted on L2?
> >>>
> >>> Sorry, i misunderstood it. Have found the reason out.
> >>>
> >> You can write it down here for future reviewers :)
> >
> > Okay.
> >
> > The functions used to translate L2's gva to L1's gpa are paging32_gva_to_gpa_nested
> > and paging64_gva_to_gpa_nested which are created by PTTYPE == 32 and PTTYPE == 64.
> >
> >
> 
> Back to your comments on PT_MAX_FULL_LEVELS:
> > +     #ifdef CONFIG_X86_64
> > +     #define PT_MAX_FULL_LEVELS 4
> > +     #define CMPXCHG cmpxchg
> > +     #else
> > +     #define CMPXCHG cmpxchg64
> > +    #define PT_MAX_FULL_LEVELS 2
> I don't think we need to support nEPT on 32-bit hosts.  So, I plan to
> remove such code. What do you think?
> 
Why shouldn't we support nEPT on 32-bit hosts?

--
			Gleb.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-05-21  7:52   ` Xiao Guangrong
  2013-05-21  8:30     ` Xiao Guangrong
@ 2013-06-11 11:32     ` Gleb Natapov
  2013-06-17 12:11       ` Xiao Guangrong
  1 sibling, 1 reply; 52+ messages in thread
From: Gleb Natapov @ 2013-06-11 11:32 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Jun Nakajima, kvm, Paolo Bonzini

On Tue, May 21, 2013 at 03:52:12PM +0800, Xiao Guangrong wrote:
> On 05/19/2013 12:52 PM, Jun Nakajima wrote:
> > From: Nadav Har'El <nyh@il.ibm.com>
> > 
> > This is the first patch in a series which adds nested EPT support to KVM's
> > nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
> > EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
> > to set its own cr3 and take its own page faults without either of L0 or L1
> > getting involved. This often significanlty improves L2's performance over the
> > previous two alternatives (shadow page tables over EPT, and shadow page
> > tables over shadow page tables).
> > 
> > This patch adds EPT support to paging_tmpl.h.
> > 
> > paging_tmpl.h contains the code for reading and writing page tables. The code
> > for 32-bit and 64-bit tables is very similar, but not identical, so
> > paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
> > with PTTYPE=64, and this generates the two sets of similar functions.
> > 
> > There are subtle but important differences between the format of EPT tables
> > and that of ordinary x86 64-bit page tables, so for nested EPT we need a
> > third set of functions to read the guest EPT table and to write the shadow
> > EPT table.
> > 
> > So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
> > with "EPT") which correctly read and write EPT tables.
> > 
> > Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> > Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> > Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> > ---
> >  arch/x86/kvm/mmu.c         |  5 +++++
> >  arch/x86/kvm/paging_tmpl.h | 43 +++++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 46 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 117233f..6c1670f 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -3397,6 +3397,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
> >  	return mmu->last_pte_bitmap & (1 << index);
> >  }
> > 
> > +#define PTTYPE_EPT 18 /* arbitrary */
> > +#define PTTYPE PTTYPE_EPT
> > +#include "paging_tmpl.h"
> > +#undef PTTYPE
> > +
> >  #define PTTYPE 64
> >  #include "paging_tmpl.h"
> >  #undef PTTYPE
> > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > index df34d4a..4c45654 100644
> > --- a/arch/x86/kvm/paging_tmpl.h
> > +++ b/arch/x86/kvm/paging_tmpl.h
> > @@ -50,6 +50,22 @@
> >  	#define PT_LEVEL_BITS PT32_LEVEL_BITS
> >  	#define PT_MAX_FULL_LEVELS 2
> >  	#define CMPXCHG cmpxchg
> > +#elif PTTYPE == PTTYPE_EPT
> > +	#define pt_element_t u64
> > +	#define guest_walker guest_walkerEPT
> > +	#define FNAME(name) EPT_##name
> > +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> > +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> > +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> > +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> > +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> > +	#ifdef CONFIG_X86_64
> > +	#define PT_MAX_FULL_LEVELS 4
> > +	#define CMPXCHG cmpxchg
> > +	#else
> > +	#define CMPXCHG cmpxchg64
> 
> CMPXHG is only used in FNAME(cmpxchg_gpte), but you commented it later.
> Do we really need it?
> 
> > +	#define PT_MAX_FULL_LEVELS 2
> 
> And the SDM says:
> 
> "It uses a page-walk length of 4, meaning that at most 4 EPT paging-structure
> entriesare accessed to translate a guest-physical address.", Is My SDM obsolete?
> Which kind of process supports page-walk length = 2?
> 
> It seems your patch is not able to handle the case that the guest uses walk-lenght = 2
> which is running on the host with walk-lenght = 4.
> (plrease refer to how to handle sp->role.quadrant in FNAME(get_level1_sp_gpa) in
> the current code.)
> 
But since EPT always has 4 levels on all existing cpus it is not an issue and the only case
that we should worry about is guest walk-lenght == host walk-lenght == 4, or have I
misunderstood what you mean here?

--
			Gleb.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 06/13] nEPT: Fix cr3 handling in nested exit and entry
  2013-05-19  4:52 ` [PATCH v3 06/13] nEPT: Fix cr3 handling in nested exit and entry Jun Nakajima
  2013-05-20 13:19   ` Paolo Bonzini
@ 2013-06-12 12:42   ` Gleb Natapov
  1 sibling, 0 replies; 52+ messages in thread
From: Gleb Natapov @ 2013-06-12 12:42 UTC (permalink / raw)
  To: Jun Nakajima; +Cc: kvm, Paolo Bonzini

On Sat, May 18, 2013 at 09:52:25PM -0700, Jun Nakajima wrote:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> The existing code for handling cr3 and related VMCS fields during nested
> exit and entry wasn't correct in all cases:
> 
> If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
> during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
> we forgot to do so. This patch adds this copy.
> 
> If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
> whoever does control cr3 (L1 or L2) is using PAE, the processor might have
> saved PDPTEs and we should also save them in vmcs12 (and restore later).
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> ---
>  arch/x86/kvm/vmx.c | 30 ++++++++++++++++++++++++++++++
>  1 file changed, 30 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index a88432f..b79efd4 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7608,6 +7608,17 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>  	kvm_set_cr3(vcpu, vmcs12->guest_cr3);
>  	kvm_mmu_reset_context(vcpu);
>  
> +	/*
> +	 * Additionally, except when L0 is using shadow page tables, L1 or
What this "Additionally" corresponds to?

> +	 * L2 control guest_cr3 for L2, so they may also have saved PDPTEs
> +	 */
> +	if (enable_ept) {
> +		vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
> +		vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
> +		vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
> +		vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
> +	}
> +
>  	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
>  	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
>  }
> @@ -7930,6 +7941,25 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>  	vmcs12->guest_pending_dbg_exceptions =
>  		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>  
> +	/*
> +	 * In some cases (usually, nested EPT), L2 is allowed to change its
> +	 * own CR3 without exiting. If it has changed it, we must keep it.
> +	 * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
> +	 * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
> +	 */
> +	if (enable_ept)
Non need separate if for guest_cr3. Put it under if() below.

> +		vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
> +	/*
> +	 * Additionally, except when L0 is using shadow page tables, L1 or
> +	 * L2 control guest_cr3 for L2, so save their PDPTEs
> +	 */
> +	if (enable_ept) {
> +		vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
> +		vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
> +		vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
> +		vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
> +	}
> +
>  	vmcs12->vm_entry_controls =
>  		(vmcs12->vm_entry_controls & ~VM_ENTRY_IA32E_MODE) |
>  		(vmcs_read32(VM_ENTRY_CONTROLS) & VM_ENTRY_IA32E_MODE);
> -- 
> 1.8.1.2

--
			Gleb.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-06-11 11:32     ` Gleb Natapov
@ 2013-06-17 12:11       ` Xiao Guangrong
  2013-06-18 10:57         ` Gleb Natapov
  0 siblings, 1 reply; 52+ messages in thread
From: Xiao Guangrong @ 2013-06-17 12:11 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jun Nakajima, kvm, Paolo Bonzini

On 06/11/2013 07:32 PM, Gleb Natapov wrote:
> On Tue, May 21, 2013 at 03:52:12PM +0800, Xiao Guangrong wrote:
>> On 05/19/2013 12:52 PM, Jun Nakajima wrote:
>>> From: Nadav Har'El <nyh@il.ibm.com>
>>>
>>> This is the first patch in a series which adds nested EPT support to KVM's
>>> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
>>> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
>>> to set its own cr3 and take its own page faults without either of L0 or L1
>>> getting involved. This often significanlty improves L2's performance over the
>>> previous two alternatives (shadow page tables over EPT, and shadow page
>>> tables over shadow page tables).
>>>
>>> This patch adds EPT support to paging_tmpl.h.
>>>
>>> paging_tmpl.h contains the code for reading and writing page tables. The code
>>> for 32-bit and 64-bit tables is very similar, but not identical, so
>>> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
>>> with PTTYPE=64, and this generates the two sets of similar functions.
>>>
>>> There are subtle but important differences between the format of EPT tables
>>> and that of ordinary x86 64-bit page tables, so for nested EPT we need a
>>> third set of functions to read the guest EPT table and to write the shadow
>>> EPT table.
>>>
>>> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
>>> with "EPT") which correctly read and write EPT tables.
>>>
>>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
>>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
>>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
>>> ---
>>>  arch/x86/kvm/mmu.c         |  5 +++++
>>>  arch/x86/kvm/paging_tmpl.h | 43 +++++++++++++++++++++++++++++++++++++++++--
>>>  2 files changed, 46 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>> index 117233f..6c1670f 100644
>>> --- a/arch/x86/kvm/mmu.c
>>> +++ b/arch/x86/kvm/mmu.c
>>> @@ -3397,6 +3397,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
>>>  	return mmu->last_pte_bitmap & (1 << index);
>>>  }
>>>
>>> +#define PTTYPE_EPT 18 /* arbitrary */
>>> +#define PTTYPE PTTYPE_EPT
>>> +#include "paging_tmpl.h"
>>> +#undef PTTYPE
>>> +
>>>  #define PTTYPE 64
>>>  #include "paging_tmpl.h"
>>>  #undef PTTYPE
>>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>>> index df34d4a..4c45654 100644
>>> --- a/arch/x86/kvm/paging_tmpl.h
>>> +++ b/arch/x86/kvm/paging_tmpl.h
>>> @@ -50,6 +50,22 @@
>>>  	#define PT_LEVEL_BITS PT32_LEVEL_BITS
>>>  	#define PT_MAX_FULL_LEVELS 2
>>>  	#define CMPXCHG cmpxchg
>>> +#elif PTTYPE == PTTYPE_EPT
>>> +	#define pt_element_t u64
>>> +	#define guest_walker guest_walkerEPT
>>> +	#define FNAME(name) EPT_##name
>>> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
>>> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
>>> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
>>> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
>>> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
>>> +	#ifdef CONFIG_X86_64
>>> +	#define PT_MAX_FULL_LEVELS 4
>>> +	#define CMPXCHG cmpxchg
>>> +	#else
>>> +	#define CMPXCHG cmpxchg64
>>
>> CMPXHG is only used in FNAME(cmpxchg_gpte), but you commented it later.
>> Do we really need it?
>>
>>> +	#define PT_MAX_FULL_LEVELS 2
>>
>> And the SDM says:
>>
>> "It uses a page-walk length of 4, meaning that at most 4 EPT paging-structure
>> entriesare accessed to translate a guest-physical address.", Is My SDM obsolete?
>> Which kind of process supports page-walk length = 2?
>>
>> It seems your patch is not able to handle the case that the guest uses walk-lenght = 2
>> which is running on the host with walk-lenght = 4.
>> (plrease refer to how to handle sp->role.quadrant in FNAME(get_level1_sp_gpa) in
>> the current code.)
>>
> But since EPT always has 4 levels on all existing cpus it is not an issue and the only case
> that we should worry about is guest walk-lenght == host walk-lenght == 4, or have I

Yes. I totally agree with you, but...

> misunderstood what you mean here?

What confused me is that this patch defines "#define PT_MAX_FULL_LEVELS 2", so i asked the
question: "Which kind of process supports page-walk length = 2".
Sorry, there is a typo in my origin comments. "process" should be "processor" or "CPU".





^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-06-17 12:11       ` Xiao Guangrong
@ 2013-06-18 10:57         ` Gleb Natapov
  2013-06-18 12:51           ` Xiao Guangrong
  0 siblings, 1 reply; 52+ messages in thread
From: Gleb Natapov @ 2013-06-18 10:57 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Jun Nakajima, kvm, Paolo Bonzini

On Mon, Jun 17, 2013 at 08:11:03PM +0800, Xiao Guangrong wrote:
> On 06/11/2013 07:32 PM, Gleb Natapov wrote:
> > On Tue, May 21, 2013 at 03:52:12PM +0800, Xiao Guangrong wrote:
> >> On 05/19/2013 12:52 PM, Jun Nakajima wrote:
> >>> From: Nadav Har'El <nyh@il.ibm.com>
> >>>
> >>> This is the first patch in a series which adds nested EPT support to KVM's
> >>> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
> >>> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
> >>> to set its own cr3 and take its own page faults without either of L0 or L1
> >>> getting involved. This often significanlty improves L2's performance over the
> >>> previous two alternatives (shadow page tables over EPT, and shadow page
> >>> tables over shadow page tables).
> >>>
> >>> This patch adds EPT support to paging_tmpl.h.
> >>>
> >>> paging_tmpl.h contains the code for reading and writing page tables. The code
> >>> for 32-bit and 64-bit tables is very similar, but not identical, so
> >>> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
> >>> with PTTYPE=64, and this generates the two sets of similar functions.
> >>>
> >>> There are subtle but important differences between the format of EPT tables
> >>> and that of ordinary x86 64-bit page tables, so for nested EPT we need a
> >>> third set of functions to read the guest EPT table and to write the shadow
> >>> EPT table.
> >>>
> >>> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
> >>> with "EPT") which correctly read and write EPT tables.
> >>>
> >>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> >>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> >>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> >>> ---
> >>>  arch/x86/kvm/mmu.c         |  5 +++++
> >>>  arch/x86/kvm/paging_tmpl.h | 43 +++++++++++++++++++++++++++++++++++++++++--
> >>>  2 files changed, 46 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >>> index 117233f..6c1670f 100644
> >>> --- a/arch/x86/kvm/mmu.c
> >>> +++ b/arch/x86/kvm/mmu.c
> >>> @@ -3397,6 +3397,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
> >>>  	return mmu->last_pte_bitmap & (1 << index);
> >>>  }
> >>>
> >>> +#define PTTYPE_EPT 18 /* arbitrary */
> >>> +#define PTTYPE PTTYPE_EPT
> >>> +#include "paging_tmpl.h"
> >>> +#undef PTTYPE
> >>> +
> >>>  #define PTTYPE 64
> >>>  #include "paging_tmpl.h"
> >>>  #undef PTTYPE
> >>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> >>> index df34d4a..4c45654 100644
> >>> --- a/arch/x86/kvm/paging_tmpl.h
> >>> +++ b/arch/x86/kvm/paging_tmpl.h
> >>> @@ -50,6 +50,22 @@
> >>>  	#define PT_LEVEL_BITS PT32_LEVEL_BITS
> >>>  	#define PT_MAX_FULL_LEVELS 2
> >>>  	#define CMPXCHG cmpxchg
> >>> +#elif PTTYPE == PTTYPE_EPT
> >>> +	#define pt_element_t u64
> >>> +	#define guest_walker guest_walkerEPT
> >>> +	#define FNAME(name) EPT_##name
> >>> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> >>> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> >>> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> >>> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> >>> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> >>> +	#ifdef CONFIG_X86_64
> >>> +	#define PT_MAX_FULL_LEVELS 4
> >>> +	#define CMPXCHG cmpxchg
> >>> +	#else
> >>> +	#define CMPXCHG cmpxchg64
> >>
> >> CMPXHG is only used in FNAME(cmpxchg_gpte), but you commented it later.
> >> Do we really need it?
> >>
> >>> +	#define PT_MAX_FULL_LEVELS 2
> >>
> >> And the SDM says:
> >>
> >> "It uses a page-walk length of 4, meaning that at most 4 EPT paging-structure
> >> entriesare accessed to translate a guest-physical address.", Is My SDM obsolete?
> >> Which kind of process supports page-walk length = 2?
> >>
> >> It seems your patch is not able to handle the case that the guest uses walk-lenght = 2
> >> which is running on the host with walk-lenght = 4.
> >> (plrease refer to how to handle sp->role.quadrant in FNAME(get_level1_sp_gpa) in
> >> the current code.)
> >>
> > But since EPT always has 4 levels on all existing cpus it is not an issue and the only case
> > that we should worry about is guest walk-lenght == host walk-lenght == 4, or have I
> 
> Yes. I totally agree with you, but...
> 
> > misunderstood what you mean here?
> 
> What confused me is that this patch defines "#define PT_MAX_FULL_LEVELS 2", so i asked the
> question: "Which kind of process supports page-walk length = 2".
> Sorry, there is a typo in my origin comments. "process" should be "processor" or "CPU".
> 
That is how I understood it, but then the discussion moved to dropping
of nEPT support on 32-bit host. What's the connection? Even on 32bit
host the walk is 4 levels. Doesn't shadow page code support 4 levels on
32bit host?

--
			Gleb.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-06-18 10:57         ` Gleb Natapov
@ 2013-06-18 12:51           ` Xiao Guangrong
  2013-06-18 13:01             ` Gleb Natapov
  0 siblings, 1 reply; 52+ messages in thread
From: Xiao Guangrong @ 2013-06-18 12:51 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jun Nakajima, kvm, Paolo Bonzini

On 06/18/2013 06:57 PM, Gleb Natapov wrote:
> On Mon, Jun 17, 2013 at 08:11:03PM +0800, Xiao Guangrong wrote:
>> On 06/11/2013 07:32 PM, Gleb Natapov wrote:
>>> On Tue, May 21, 2013 at 03:52:12PM +0800, Xiao Guangrong wrote:
>>>> On 05/19/2013 12:52 PM, Jun Nakajima wrote:
>>>>> From: Nadav Har'El <nyh@il.ibm.com>
>>>>>
>>>>> This is the first patch in a series which adds nested EPT support to KVM's
>>>>> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
>>>>> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
>>>>> to set its own cr3 and take its own page faults without either of L0 or L1
>>>>> getting involved. This often significanlty improves L2's performance over the
>>>>> previous two alternatives (shadow page tables over EPT, and shadow page
>>>>> tables over shadow page tables).
>>>>>
>>>>> This patch adds EPT support to paging_tmpl.h.
>>>>>
>>>>> paging_tmpl.h contains the code for reading and writing page tables. The code
>>>>> for 32-bit and 64-bit tables is very similar, but not identical, so
>>>>> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
>>>>> with PTTYPE=64, and this generates the two sets of similar functions.
>>>>>
>>>>> There are subtle but important differences between the format of EPT tables
>>>>> and that of ordinary x86 64-bit page tables, so for nested EPT we need a
>>>>> third set of functions to read the guest EPT table and to write the shadow
>>>>> EPT table.
>>>>>
>>>>> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
>>>>> with "EPT") which correctly read and write EPT tables.
>>>>>
>>>>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
>>>>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
>>>>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
>>>>> ---
>>>>>  arch/x86/kvm/mmu.c         |  5 +++++
>>>>>  arch/x86/kvm/paging_tmpl.h | 43 +++++++++++++++++++++++++++++++++++++++++--
>>>>>  2 files changed, 46 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>>>> index 117233f..6c1670f 100644
>>>>> --- a/arch/x86/kvm/mmu.c
>>>>> +++ b/arch/x86/kvm/mmu.c
>>>>> @@ -3397,6 +3397,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
>>>>>  	return mmu->last_pte_bitmap & (1 << index);
>>>>>  }
>>>>>
>>>>> +#define PTTYPE_EPT 18 /* arbitrary */
>>>>> +#define PTTYPE PTTYPE_EPT
>>>>> +#include "paging_tmpl.h"
>>>>> +#undef PTTYPE
>>>>> +
>>>>>  #define PTTYPE 64
>>>>>  #include "paging_tmpl.h"
>>>>>  #undef PTTYPE
>>>>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>>>>> index df34d4a..4c45654 100644
>>>>> --- a/arch/x86/kvm/paging_tmpl.h
>>>>> +++ b/arch/x86/kvm/paging_tmpl.h
>>>>> @@ -50,6 +50,22 @@
>>>>>  	#define PT_LEVEL_BITS PT32_LEVEL_BITS
>>>>>  	#define PT_MAX_FULL_LEVELS 2
>>>>>  	#define CMPXCHG cmpxchg
>>>>> +#elif PTTYPE == PTTYPE_EPT
>>>>> +	#define pt_element_t u64
>>>>> +	#define guest_walker guest_walkerEPT
>>>>> +	#define FNAME(name) EPT_##name
>>>>> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
>>>>> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
>>>>> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
>>>>> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
>>>>> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
>>>>> +	#ifdef CONFIG_X86_64
>>>>> +	#define PT_MAX_FULL_LEVELS 4
>>>>> +	#define CMPXCHG cmpxchg
>>>>> +	#else
>>>>> +	#define CMPXCHG cmpxchg64
>>>>
>>>> CMPXHG is only used in FNAME(cmpxchg_gpte), but you commented it later.
>>>> Do we really need it?
>>>>
>>>>> +	#define PT_MAX_FULL_LEVELS 2
>>>>
>>>> And the SDM says:
>>>>
>>>> "It uses a page-walk length of 4, meaning that at most 4 EPT paging-structure
>>>> entriesare accessed to translate a guest-physical address.", Is My SDM obsolete?
>>>> Which kind of process supports page-walk length = 2?
>>>>
>>>> It seems your patch is not able to handle the case that the guest uses walk-lenght = 2
>>>> which is running on the host with walk-lenght = 4.
>>>> (plrease refer to how to handle sp->role.quadrant in FNAME(get_level1_sp_gpa) in
>>>> the current code.)
>>>>
>>> But since EPT always has 4 levels on all existing cpus it is not an issue and the only case
>>> that we should worry about is guest walk-lenght == host walk-lenght == 4, or have I
>>
>> Yes. I totally agree with you, but...
>>
>>> misunderstood what you mean here?
>>
>> What confused me is that this patch defines "#define PT_MAX_FULL_LEVELS 2", so i asked the
>> question: "Which kind of process supports page-walk length = 2".
>> Sorry, there is a typo in my origin comments. "process" should be "processor" or "CPU".
>>
> That is how I understood it, but then the discussion moved to dropping
> of nEPT support on 32-bit host. What's the connection? Even on 32bit

If EPT supports "walk-level = 2" on 32bit host (maybe it is not true), i thought dropping
32bit support to reduce the complex is worthwhile, otherwise:
a) we need to handle different page size between L0 and L2 and
b) we need to carefully review the code due to lacking PDPT supporting on nept on L2.

I remember that the origin version of NEPT did not support PAE-32bit L2 guest. I have
found it out:
http://comments.gmane.org/gmane.comp.emulators.kvm.devel/95395

It seems no changes in this version, I have no idea how it was fixed in this version.

> host the walk is 4 levels. Doesn't shadow page code support 4 levels on
> 32bit host?

Yes, it does. 4 levels is fine on 32bit host.

If EPT only supports 4 levels on both 32bit and 64bit hosts, there is no big difference
to support nept on 32bit L2 and 64bit L2.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-06-18 12:51           ` Xiao Guangrong
@ 2013-06-18 13:01             ` Gleb Natapov
  0 siblings, 0 replies; 52+ messages in thread
From: Gleb Natapov @ 2013-06-18 13:01 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Jun Nakajima, kvm, Paolo Bonzini

On Tue, Jun 18, 2013 at 08:51:25PM +0800, Xiao Guangrong wrote:
> On 06/18/2013 06:57 PM, Gleb Natapov wrote:
> > On Mon, Jun 17, 2013 at 08:11:03PM +0800, Xiao Guangrong wrote:
> >> On 06/11/2013 07:32 PM, Gleb Natapov wrote:
> >>> On Tue, May 21, 2013 at 03:52:12PM +0800, Xiao Guangrong wrote:
> >>>> On 05/19/2013 12:52 PM, Jun Nakajima wrote:
> >>>>> From: Nadav Har'El <nyh@il.ibm.com>
> >>>>>
> >>>>> This is the first patch in a series which adds nested EPT support to KVM's
> >>>>> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
> >>>>> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
> >>>>> to set its own cr3 and take its own page faults without either of L0 or L1
> >>>>> getting involved. This often significanlty improves L2's performance over the
> >>>>> previous two alternatives (shadow page tables over EPT, and shadow page
> >>>>> tables over shadow page tables).
> >>>>>
> >>>>> This patch adds EPT support to paging_tmpl.h.
> >>>>>
> >>>>> paging_tmpl.h contains the code for reading and writing page tables. The code
> >>>>> for 32-bit and 64-bit tables is very similar, but not identical, so
> >>>>> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
> >>>>> with PTTYPE=64, and this generates the two sets of similar functions.
> >>>>>
> >>>>> There are subtle but important differences between the format of EPT tables
> >>>>> and that of ordinary x86 64-bit page tables, so for nested EPT we need a
> >>>>> third set of functions to read the guest EPT table and to write the shadow
> >>>>> EPT table.
> >>>>>
> >>>>> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
> >>>>> with "EPT") which correctly read and write EPT tables.
> >>>>>
> >>>>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> >>>>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> >>>>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> >>>>> ---
> >>>>>  arch/x86/kvm/mmu.c         |  5 +++++
> >>>>>  arch/x86/kvm/paging_tmpl.h | 43 +++++++++++++++++++++++++++++++++++++++++--
> >>>>>  2 files changed, 46 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >>>>> index 117233f..6c1670f 100644
> >>>>> --- a/arch/x86/kvm/mmu.c
> >>>>> +++ b/arch/x86/kvm/mmu.c
> >>>>> @@ -3397,6 +3397,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
> >>>>>  	return mmu->last_pte_bitmap & (1 << index);
> >>>>>  }
> >>>>>
> >>>>> +#define PTTYPE_EPT 18 /* arbitrary */
> >>>>> +#define PTTYPE PTTYPE_EPT
> >>>>> +#include "paging_tmpl.h"
> >>>>> +#undef PTTYPE
> >>>>> +
> >>>>>  #define PTTYPE 64
> >>>>>  #include "paging_tmpl.h"
> >>>>>  #undef PTTYPE
> >>>>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> >>>>> index df34d4a..4c45654 100644
> >>>>> --- a/arch/x86/kvm/paging_tmpl.h
> >>>>> +++ b/arch/x86/kvm/paging_tmpl.h
> >>>>> @@ -50,6 +50,22 @@
> >>>>>  	#define PT_LEVEL_BITS PT32_LEVEL_BITS
> >>>>>  	#define PT_MAX_FULL_LEVELS 2
> >>>>>  	#define CMPXCHG cmpxchg
> >>>>> +#elif PTTYPE == PTTYPE_EPT
> >>>>> +	#define pt_element_t u64
> >>>>> +	#define guest_walker guest_walkerEPT
> >>>>> +	#define FNAME(name) EPT_##name
> >>>>> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> >>>>> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> >>>>> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> >>>>> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> >>>>> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> >>>>> +	#ifdef CONFIG_X86_64
> >>>>> +	#define PT_MAX_FULL_LEVELS 4
> >>>>> +	#define CMPXCHG cmpxchg
> >>>>> +	#else
> >>>>> +	#define CMPXCHG cmpxchg64
> >>>>
> >>>> CMPXHG is only used in FNAME(cmpxchg_gpte), but you commented it later.
> >>>> Do we really need it?
> >>>>
> >>>>> +	#define PT_MAX_FULL_LEVELS 2
> >>>>
> >>>> And the SDM says:
> >>>>
> >>>> "It uses a page-walk length of 4, meaning that at most 4 EPT paging-structure
> >>>> entriesare accessed to translate a guest-physical address.", Is My SDM obsolete?
> >>>> Which kind of process supports page-walk length = 2?
> >>>>
> >>>> It seems your patch is not able to handle the case that the guest uses walk-lenght = 2
> >>>> which is running on the host with walk-lenght = 4.
> >>>> (plrease refer to how to handle sp->role.quadrant in FNAME(get_level1_sp_gpa) in
> >>>> the current code.)
> >>>>
> >>> But since EPT always has 4 levels on all existing cpus it is not an issue and the only case
> >>> that we should worry about is guest walk-lenght == host walk-lenght == 4, or have I
> >>
> >> Yes. I totally agree with you, but...
> >>
> >>> misunderstood what you mean here?
> >>
> >> What confused me is that this patch defines "#define PT_MAX_FULL_LEVELS 2", so i asked the
> >> question: "Which kind of process supports page-walk length = 2".
> >> Sorry, there is a typo in my origin comments. "process" should be "processor" or "CPU".
> >>
> > That is how I understood it, but then the discussion moved to dropping
> > of nEPT support on 32-bit host. What's the connection? Even on 32bit
> 
> If EPT supports "walk-level = 2" on 32bit host (maybe it is not true), i thought dropping
> 32bit support to reduce the complex is worthwhile, otherwise:
> a) we need to handle different page size between L0 and L2 and
> b) we need to carefully review the code due to lacking PDPT supporting on nept on L2.
> 
> I remember that the origin version of NEPT did not support PAE-32bit L2 guest. I have
> found it out:
> http://comments.gmane.org/gmane.comp.emulators.kvm.devel/95395
> 
> It seems no changes in this version, I have no idea how it was fixed in this version.
> 
I think that patches that reloads nested PDPT pointers did that.

> > host the walk is 4 levels. Doesn't shadow page code support 4 levels on
> > 32bit host?
> 
> Yes, it does. 4 levels is fine on 32bit host.
> 
> If EPT only supports 4 levels on both 32bit and 64bit hosts, there is no big difference
> to support nept on 32bit L2 and 64bit L2.
> 
OK, that is my understating too. Thanks for confirmation.

--
			Gleb.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-05-20 12:33 ` [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Paolo Bonzini
@ 2013-07-02  3:01   ` Zhang, Yang Z
  2013-07-02 13:59     ` Gleb Natapov
  0 siblings, 1 reply; 52+ messages in thread
From: Zhang, Yang Z @ 2013-07-02  3:01 UTC (permalink / raw)
  To: Paolo Bonzini, Nakajima, Jun; +Cc: kvm, Gleb Natapov, Jan Kiszka

Since this series is pending in mail list for long time. And it's really a big feature for Nested. Also, I doubt the original authors(Jun and Nahav)should not have enough time to continue it. So I will pick it up. :)

See comments below:

Paolo Bonzini wrote on 2013-05-20:
> Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> > From: Nadav Har'El <nyh@il.ibm.com>
> >
> > Recent KVM, since
> http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
> > switch the EFER MSR when EPT is used and the host and guest have different
> > NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
> > and want to be able to run recent KVM as L1, we need to allow L1 to use this
> > EFER switching feature.
> >
> > To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if
> available,
> > and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
> > support for the former (the latter is still unsupported).
> >
> > Nested entry and exit emulation (prepare_vmcs_02 and
> load_vmcs12_host_state,
> > respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So
> all
> > that's left to do in this patch is to properly advertise this feature to L1.
> >
> > Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by
> using
> > vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
> > support this feature, regardless of whether the host supports it.
> >
> > Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> > Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> > Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> > ---
> >  arch/x86/kvm/vmx.c | 23 ++++++++++++++++-------
> >  1 file changed, 16 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 260a919..fb9cae5 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -2192,7 +2192,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
> >  #else
> >  	nested_vmx_exit_ctls_high = 0;
> >  #endif
> > -	nested_vmx_exit_ctls_high |=
> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
> > +	nested_vmx_exit_ctls_high |=
> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> > +				      VM_EXIT_LOAD_IA32_EFER);
> >
> >  	/* entry controls */
> >  	rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
> > @@ -2201,8 +2202,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
> >  	nested_vmx_entry_ctls_low =
> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
> >  	nested_vmx_entry_ctls_high &=
> >  		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
> > -	nested_vmx_entry_ctls_high |=
> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
> > -
> > +	nested_vmx_entry_ctls_high |=
> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
> > +				       VM_ENTRY_LOAD_IA32_EFER);
> >  	/* cpu-based controls */
> >  	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
> >  		nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> > @@ -7492,10 +7493,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
> struct vmcs12 *vmcs12)
> >  	vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
> >  	vmcs_writel(CR0_GUEST_HOST_MASK,
> ~vcpu->arch.cr0_guest_owned_bits);
> >
> > -	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
> below */
> > -	vmcs_write32(VM_EXIT_CONTROLS,
> > -		vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
> > -	vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
> > +	/* L2->L1 exit controls are emulated - the hardware exit is to L0 so
> > +	 * we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
> > +	 * bits are further modified by vmx_set_efer() below.
> > +	 */
> > +	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
This is wrong. We cannot use L0 exit control directly.
LOAD_PERF_GLOBAL_CTRL, LOAD_HOST_EFE, LOAD_HOST_PAT, ACK_INTR_ON_EXIT should use host's exit control. But others, still need use (vmcs12|host).

> > +
> > +	/* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
> are
> > +	 * emulated by vmx_set_efer(), below.
> 
> VM_ENTRY_LOAD_IA32_EFER is not emulated by vmx_set_efer, so:
VM_ENTRY_LOAD_IA32_EFER is hanlded in setup_msrs(), and vmx_set_efer already call it.

> 
>     /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
>      * are emulated below.  VM_ENTRY_IA32E_MODE is handled in
>      * vmx_set_efer().  */
> 
> Paolo
> 
> > +	 */
> > +	vmcs_write32(VM_ENTRY_CONTROLS,
> > +		(vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
> > +			~VM_ENTRY_IA32E_MODE) |
> >  		(vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
> >
> >  	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Best regards,
Yang



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-07-02  3:01   ` Zhang, Yang Z
@ 2013-07-02 13:59     ` Gleb Natapov
  2013-07-02 14:28       ` Jan Kiszka
  0 siblings, 1 reply; 52+ messages in thread
From: Gleb Natapov @ 2013-07-02 13:59 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: Paolo Bonzini, Nakajima, Jun, kvm, Jan Kiszka

On Tue, Jul 02, 2013 at 03:01:24AM +0000, Zhang, Yang Z wrote:
> Since this series is pending in mail list for long time. And it's really a big feature for Nested. Also, I doubt the original authors(Jun and Nahav)should not have enough time to continue it. So I will pick it up. :)
> 
> See comments below:
> 
> Paolo Bonzini wrote on 2013-05-20:
> > Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> > > From: Nadav Har'El <nyh@il.ibm.com>
> > >
> > > Recent KVM, since
> > http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
> > > switch the EFER MSR when EPT is used and the host and guest have different
> > > NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
> > > and want to be able to run recent KVM as L1, we need to allow L1 to use this
> > > EFER switching feature.
> > >
> > > To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if
> > available,
> > > and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
> > > support for the former (the latter is still unsupported).
> > >
> > > Nested entry and exit emulation (prepare_vmcs_02 and
> > load_vmcs12_host_state,
> > > respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So
> > all
> > > that's left to do in this patch is to properly advertise this feature to L1.
> > >
> > > Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by
> > using
> > > vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
> > > support this feature, regardless of whether the host supports it.
> > >
> > > Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> > > Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> > > Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> > > ---
> > >  arch/x86/kvm/vmx.c | 23 ++++++++++++++++-------
> > >  1 file changed, 16 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > > index 260a919..fb9cae5 100644
> > > --- a/arch/x86/kvm/vmx.c
> > > +++ b/arch/x86/kvm/vmx.c
> > > @@ -2192,7 +2192,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
> > >  #else
> > >  	nested_vmx_exit_ctls_high = 0;
> > >  #endif
> > > -	nested_vmx_exit_ctls_high |=
> > VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
> > > +	nested_vmx_exit_ctls_high |=
> > (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> > > +				      VM_EXIT_LOAD_IA32_EFER);
> > >
> > >  	/* entry controls */
> > >  	rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
> > > @@ -2201,8 +2202,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
> > >  	nested_vmx_entry_ctls_low =
> > VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
> > >  	nested_vmx_entry_ctls_high &=
> > >  		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
> > > -	nested_vmx_entry_ctls_high |=
> > VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
> > > -
> > > +	nested_vmx_entry_ctls_high |=
> > (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
> > > +				       VM_ENTRY_LOAD_IA32_EFER);
> > >  	/* cpu-based controls */
> > >  	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
> > >  		nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> > > @@ -7492,10 +7493,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
> > struct vmcs12 *vmcs12)
> > >  	vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
> > >  	vmcs_writel(CR0_GUEST_HOST_MASK,
> > ~vcpu->arch.cr0_guest_owned_bits);
> > >
> > > -	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
> > below */
> > > -	vmcs_write32(VM_EXIT_CONTROLS,
> > > -		vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
> > > -	vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
> > > +	/* L2->L1 exit controls are emulated - the hardware exit is to L0 so
> > > +	 * we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
> > > +	 * bits are further modified by vmx_set_efer() below.
> > > +	 */
> > > +	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
> This is wrong. We cannot use L0 exit control directly.
> LOAD_PERF_GLOBAL_CTRL, LOAD_HOST_EFE, LOAD_HOST_PAT, ACK_INTR_ON_EXIT should use host's exit control. But others, still need use (vmcs12|host).
> 
I do not see why. We always intercept DR7/PAT/EFER, so save is emulated
too. Host address space size always come from L0 and preemption timer is
not supported for nested IIRC and when it will be host will have to save
it on exit anyway for correct emulation.

> > > +
> > > +	/* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
> > are
> > > +	 * emulated by vmx_set_efer(), below.
> > 
> > VM_ENTRY_LOAD_IA32_EFER is not emulated by vmx_set_efer, so:
> VM_ENTRY_LOAD_IA32_EFER is hanlded in setup_msrs(), and vmx_set_efer already call it.
> 
> > 
> >     /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
> >      * are emulated below.  VM_ENTRY_IA32E_MODE is handled in
> >      * vmx_set_efer().  */
> > 
> > Paolo
> > 
> > > +	 */
> > > +	vmcs_write32(VM_ENTRY_CONTROLS,
> > > +		(vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
> > > +			~VM_ENTRY_IA32E_MODE) |
> > >  		(vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
> > >
> > >  	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
> > >
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> Best regards,
> Yang
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-07-02 13:59     ` Gleb Natapov
@ 2013-07-02 14:28       ` Jan Kiszka
  2013-07-02 15:15         ` Gleb Natapov
  0 siblings, 1 reply; 52+ messages in thread
From: Jan Kiszka @ 2013-07-02 14:28 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Zhang, Yang Z, Paolo Bonzini, Nakajima, Jun, kvm

[-- Attachment #1: Type: text/plain, Size: 4564 bytes --]

On 2013-07-02 15:59, Gleb Natapov wrote:
> On Tue, Jul 02, 2013 at 03:01:24AM +0000, Zhang, Yang Z wrote:
>> Since this series is pending in mail list for long time. And it's really a big feature for Nested. Also, I doubt the original authors(Jun and Nahav)should not have enough time to continue it. So I will pick it up. :)
>>
>> See comments below:
>>
>> Paolo Bonzini wrote on 2013-05-20:
>>> Il 19/05/2013 06:52, Jun Nakajima ha scritto:
>>>> From: Nadav Har'El <nyh@il.ibm.com>
>>>>
>>>> Recent KVM, since
>>> http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
>>>> switch the EFER MSR when EPT is used and the host and guest have different
>>>> NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
>>>> and want to be able to run recent KVM as L1, we need to allow L1 to use this
>>>> EFER switching feature.
>>>>
>>>> To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if
>>> available,
>>>> and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
>>>> support for the former (the latter is still unsupported).
>>>>
>>>> Nested entry and exit emulation (prepare_vmcs_02 and
>>> load_vmcs12_host_state,
>>>> respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So
>>> all
>>>> that's left to do in this patch is to properly advertise this feature to L1.
>>>>
>>>> Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by
>>> using
>>>> vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
>>>> support this feature, regardless of whether the host supports it.
>>>>
>>>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
>>>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
>>>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
>>>> ---
>>>>  arch/x86/kvm/vmx.c | 23 ++++++++++++++++-------
>>>>  1 file changed, 16 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>> index 260a919..fb9cae5 100644
>>>> --- a/arch/x86/kvm/vmx.c
>>>> +++ b/arch/x86/kvm/vmx.c
>>>> @@ -2192,7 +2192,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
>>>>  #else
>>>>  	nested_vmx_exit_ctls_high = 0;
>>>>  #endif
>>>> -	nested_vmx_exit_ctls_high |=
>>> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
>>>> +	nested_vmx_exit_ctls_high |=
>>> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>>>> +				      VM_EXIT_LOAD_IA32_EFER);
>>>>
>>>>  	/* entry controls */
>>>>  	rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
>>>> @@ -2201,8 +2202,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
>>>>  	nested_vmx_entry_ctls_low =
>>> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
>>>>  	nested_vmx_entry_ctls_high &=
>>>>  		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
>>>> -	nested_vmx_entry_ctls_high |=
>>> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
>>>> -
>>>> +	nested_vmx_entry_ctls_high |=
>>> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
>>>> +				       VM_ENTRY_LOAD_IA32_EFER);
>>>>  	/* cpu-based controls */
>>>>  	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
>>>>  		nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
>>>> @@ -7492,10 +7493,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
>>> struct vmcs12 *vmcs12)
>>>>  	vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
>>>>  	vmcs_writel(CR0_GUEST_HOST_MASK,
>>> ~vcpu->arch.cr0_guest_owned_bits);
>>>>
>>>> -	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
>>> below */
>>>> -	vmcs_write32(VM_EXIT_CONTROLS,
>>>> -		vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
>>>> -	vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
>>>> +	/* L2->L1 exit controls are emulated - the hardware exit is to L0 so
>>>> +	 * we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
>>>> +	 * bits are further modified by vmx_set_efer() below.
>>>> +	 */
>>>> +	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
>> This is wrong. We cannot use L0 exit control directly.
>> LOAD_PERF_GLOBAL_CTRL, LOAD_HOST_EFE, LOAD_HOST_PAT, ACK_INTR_ON_EXIT should use host's exit control. But others, still need use (vmcs12|host).
>>
> I do not see why. We always intercept DR7/PAT/EFER, so save is emulated
> too. Host address space size always come from L0 and preemption timer is
> not supported for nested IIRC and when it will be host will have to save
> it on exit anyway for correct emulation.

Preemption timer is already supported and works fine as far as I tested.
KVM doesn't use it for L1, so we do not need to save/restore it - IIRC.

Jan



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-07-02 14:28       ` Jan Kiszka
@ 2013-07-02 15:15         ` Gleb Natapov
  2013-07-02 15:34           ` Jan Kiszka
  0 siblings, 1 reply; 52+ messages in thread
From: Gleb Natapov @ 2013-07-02 15:15 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Zhang, Yang Z, Paolo Bonzini, Nakajima, Jun, kvm

On Tue, Jul 02, 2013 at 04:28:56PM +0200, Jan Kiszka wrote:
> On 2013-07-02 15:59, Gleb Natapov wrote:
> > On Tue, Jul 02, 2013 at 03:01:24AM +0000, Zhang, Yang Z wrote:
> >> Since this series is pending in mail list for long time. And it's really a big feature for Nested. Also, I doubt the original authors(Jun and Nahav)should not have enough time to continue it. So I will pick it up. :)
> >>
> >> See comments below:
> >>
> >> Paolo Bonzini wrote on 2013-05-20:
> >>> Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> >>>> From: Nadav Har'El <nyh@il.ibm.com>
> >>>>
> >>>> Recent KVM, since
> >>> http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
> >>>> switch the EFER MSR when EPT is used and the host and guest have different
> >>>> NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
> >>>> and want to be able to run recent KVM as L1, we need to allow L1 to use this
> >>>> EFER switching feature.
> >>>>
> >>>> To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if
> >>> available,
> >>>> and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
> >>>> support for the former (the latter is still unsupported).
> >>>>
> >>>> Nested entry and exit emulation (prepare_vmcs_02 and
> >>> load_vmcs12_host_state,
> >>>> respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So
> >>> all
> >>>> that's left to do in this patch is to properly advertise this feature to L1.
> >>>>
> >>>> Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by
> >>> using
> >>>> vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
> >>>> support this feature, regardless of whether the host supports it.
> >>>>
> >>>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> >>>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> >>>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> >>>> ---
> >>>>  arch/x86/kvm/vmx.c | 23 ++++++++++++++++-------
> >>>>  1 file changed, 16 insertions(+), 7 deletions(-)
> >>>>
> >>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>>> index 260a919..fb9cae5 100644
> >>>> --- a/arch/x86/kvm/vmx.c
> >>>> +++ b/arch/x86/kvm/vmx.c
> >>>> @@ -2192,7 +2192,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
> >>>>  #else
> >>>>  	nested_vmx_exit_ctls_high = 0;
> >>>>  #endif
> >>>> -	nested_vmx_exit_ctls_high |=
> >>> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
> >>>> +	nested_vmx_exit_ctls_high |=
> >>> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> >>>> +				      VM_EXIT_LOAD_IA32_EFER);
> >>>>
> >>>>  	/* entry controls */
> >>>>  	rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
> >>>> @@ -2201,8 +2202,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
> >>>>  	nested_vmx_entry_ctls_low =
> >>> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
> >>>>  	nested_vmx_entry_ctls_high &=
> >>>>  		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
> >>>> -	nested_vmx_entry_ctls_high |=
> >>> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
> >>>> -
> >>>> +	nested_vmx_entry_ctls_high |=
> >>> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
> >>>> +				       VM_ENTRY_LOAD_IA32_EFER);
> >>>>  	/* cpu-based controls */
> >>>>  	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
> >>>>  		nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> >>>> @@ -7492,10 +7493,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
> >>> struct vmcs12 *vmcs12)
> >>>>  	vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
> >>>>  	vmcs_writel(CR0_GUEST_HOST_MASK,
> >>> ~vcpu->arch.cr0_guest_owned_bits);
> >>>>
> >>>> -	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
> >>> below */
> >>>> -	vmcs_write32(VM_EXIT_CONTROLS,
> >>>> -		vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
> >>>> -	vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
> >>>> +	/* L2->L1 exit controls are emulated - the hardware exit is to L0 so
> >>>> +	 * we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
> >>>> +	 * bits are further modified by vmx_set_efer() below.
> >>>> +	 */
> >>>> +	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
> >> This is wrong. We cannot use L0 exit control directly.
> >> LOAD_PERF_GLOBAL_CTRL, LOAD_HOST_EFE, LOAD_HOST_PAT, ACK_INTR_ON_EXIT should use host's exit control. But others, still need use (vmcs12|host).
> >>
> > I do not see why. We always intercept DR7/PAT/EFER, so save is emulated
> > too. Host address space size always come from L0 and preemption timer is
> > not supported for nested IIRC and when it will be host will have to save
> > it on exit anyway for correct emulation.
> 
> Preemption timer is already supported and works fine as far as I tested.
> KVM doesn't use it for L1, so we do not need to save/restore it - IIRC.
> 
So what happens if L1 configures it to value X after X/2 ticks L0 exit
happen and L0 gets back to L2 directly. The counter will be X again
instead of X/2.

--
			Gleb.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-07-02 15:15         ` Gleb Natapov
@ 2013-07-02 15:34           ` Jan Kiszka
  2013-07-02 15:43             ` Gleb Natapov
  0 siblings, 1 reply; 52+ messages in thread
From: Jan Kiszka @ 2013-07-02 15:34 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Zhang, Yang Z, Paolo Bonzini, Nakajima, Jun, kvm

[-- Attachment #1: Type: text/plain, Size: 5257 bytes --]

On 2013-07-02 17:15, Gleb Natapov wrote:
> On Tue, Jul 02, 2013 at 04:28:56PM +0200, Jan Kiszka wrote:
>> On 2013-07-02 15:59, Gleb Natapov wrote:
>>> On Tue, Jul 02, 2013 at 03:01:24AM +0000, Zhang, Yang Z wrote:
>>>> Since this series is pending in mail list for long time. And it's really a big feature for Nested. Also, I doubt the original authors(Jun and Nahav)should not have enough time to continue it. So I will pick it up. :)
>>>>
>>>> See comments below:
>>>>
>>>> Paolo Bonzini wrote on 2013-05-20:
>>>>> Il 19/05/2013 06:52, Jun Nakajima ha scritto:
>>>>>> From: Nadav Har'El <nyh@il.ibm.com>
>>>>>>
>>>>>> Recent KVM, since
>>>>> http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
>>>>>> switch the EFER MSR when EPT is used and the host and guest have different
>>>>>> NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
>>>>>> and want to be able to run recent KVM as L1, we need to allow L1 to use this
>>>>>> EFER switching feature.
>>>>>>
>>>>>> To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if
>>>>> available,
>>>>>> and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
>>>>>> support for the former (the latter is still unsupported).
>>>>>>
>>>>>> Nested entry and exit emulation (prepare_vmcs_02 and
>>>>> load_vmcs12_host_state,
>>>>>> respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So
>>>>> all
>>>>>> that's left to do in this patch is to properly advertise this feature to L1.
>>>>>>
>>>>>> Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by
>>>>> using
>>>>>> vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
>>>>>> support this feature, regardless of whether the host supports it.
>>>>>>
>>>>>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
>>>>>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
>>>>>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
>>>>>> ---
>>>>>>  arch/x86/kvm/vmx.c | 23 ++++++++++++++++-------
>>>>>>  1 file changed, 16 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>>>> index 260a919..fb9cae5 100644
>>>>>> --- a/arch/x86/kvm/vmx.c
>>>>>> +++ b/arch/x86/kvm/vmx.c
>>>>>> @@ -2192,7 +2192,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
>>>>>>  #else
>>>>>>  	nested_vmx_exit_ctls_high = 0;
>>>>>>  #endif
>>>>>> -	nested_vmx_exit_ctls_high |=
>>>>> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
>>>>>> +	nested_vmx_exit_ctls_high |=
>>>>> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>>>>>> +				      VM_EXIT_LOAD_IA32_EFER);
>>>>>>
>>>>>>  	/* entry controls */
>>>>>>  	rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
>>>>>> @@ -2201,8 +2202,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
>>>>>>  	nested_vmx_entry_ctls_low =
>>>>> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
>>>>>>  	nested_vmx_entry_ctls_high &=
>>>>>>  		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
>>>>>> -	nested_vmx_entry_ctls_high |=
>>>>> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
>>>>>> -
>>>>>> +	nested_vmx_entry_ctls_high |=
>>>>> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
>>>>>> +				       VM_ENTRY_LOAD_IA32_EFER);
>>>>>>  	/* cpu-based controls */
>>>>>>  	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
>>>>>>  		nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
>>>>>> @@ -7492,10 +7493,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
>>>>> struct vmcs12 *vmcs12)
>>>>>>  	vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
>>>>>>  	vmcs_writel(CR0_GUEST_HOST_MASK,
>>>>> ~vcpu->arch.cr0_guest_owned_bits);
>>>>>>
>>>>>> -	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
>>>>> below */
>>>>>> -	vmcs_write32(VM_EXIT_CONTROLS,
>>>>>> -		vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
>>>>>> -	vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
>>>>>> +	/* L2->L1 exit controls are emulated - the hardware exit is to L0 so
>>>>>> +	 * we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
>>>>>> +	 * bits are further modified by vmx_set_efer() below.
>>>>>> +	 */
>>>>>> +	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
>>>> This is wrong. We cannot use L0 exit control directly.
>>>> LOAD_PERF_GLOBAL_CTRL, LOAD_HOST_EFE, LOAD_HOST_PAT, ACK_INTR_ON_EXIT should use host's exit control. But others, still need use (vmcs12|host).
>>>>
>>> I do not see why. We always intercept DR7/PAT/EFER, so save is emulated
>>> too. Host address space size always come from L0 and preemption timer is
>>> not supported for nested IIRC and when it will be host will have to save
>>> it on exit anyway for correct emulation.
>>
>> Preemption timer is already supported and works fine as far as I tested.
>> KVM doesn't use it for L1, so we do not need to save/restore it - IIRC.
>>
> So what happens if L1 configures it to value X after X/2 ticks L0 exit
> happen and L0 gets back to L2 directly. The counter will be X again
> instead of X/2.

Likely. Yes, we need to improve our emulation by setting "Save
VMX-preemption timer value" or emulate this in software if the hardware
lacks support for it (was this flag introduced after the preemption
timer itself?).

Jan



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-07-02 15:34           ` Jan Kiszka
@ 2013-07-02 15:43             ` Gleb Natapov
  2013-07-04  8:42               ` Zhang, Yang Z
  0 siblings, 1 reply; 52+ messages in thread
From: Gleb Natapov @ 2013-07-02 15:43 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Zhang, Yang Z, Paolo Bonzini, Nakajima, Jun, kvm

On Tue, Jul 02, 2013 at 05:34:56PM +0200, Jan Kiszka wrote:
> On 2013-07-02 17:15, Gleb Natapov wrote:
> > On Tue, Jul 02, 2013 at 04:28:56PM +0200, Jan Kiszka wrote:
> >> On 2013-07-02 15:59, Gleb Natapov wrote:
> >>> On Tue, Jul 02, 2013 at 03:01:24AM +0000, Zhang, Yang Z wrote:
> >>>> Since this series is pending in mail list for long time. And it's really a big feature for Nested. Also, I doubt the original authors(Jun and Nahav)should not have enough time to continue it. So I will pick it up. :)
> >>>>
> >>>> See comments below:
> >>>>
> >>>> Paolo Bonzini wrote on 2013-05-20:
> >>>>> Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> >>>>>> From: Nadav Har'El <nyh@il.ibm.com>
> >>>>>>
> >>>>>> Recent KVM, since
> >>>>> http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
> >>>>>> switch the EFER MSR when EPT is used and the host and guest have different
> >>>>>> NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
> >>>>>> and want to be able to run recent KVM as L1, we need to allow L1 to use this
> >>>>>> EFER switching feature.
> >>>>>>
> >>>>>> To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if
> >>>>> available,
> >>>>>> and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
> >>>>>> support for the former (the latter is still unsupported).
> >>>>>>
> >>>>>> Nested entry and exit emulation (prepare_vmcs_02 and
> >>>>> load_vmcs12_host_state,
> >>>>>> respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So
> >>>>> all
> >>>>>> that's left to do in this patch is to properly advertise this feature to L1.
> >>>>>>
> >>>>>> Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by
> >>>>> using
> >>>>>> vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
> >>>>>> support this feature, regardless of whether the host supports it.
> >>>>>>
> >>>>>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> >>>>>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> >>>>>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> >>>>>> ---
> >>>>>>  arch/x86/kvm/vmx.c | 23 ++++++++++++++++-------
> >>>>>>  1 file changed, 16 insertions(+), 7 deletions(-)
> >>>>>>
> >>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>>>>> index 260a919..fb9cae5 100644
> >>>>>> --- a/arch/x86/kvm/vmx.c
> >>>>>> +++ b/arch/x86/kvm/vmx.c
> >>>>>> @@ -2192,7 +2192,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
> >>>>>>  #else
> >>>>>>  	nested_vmx_exit_ctls_high = 0;
> >>>>>>  #endif
> >>>>>> -	nested_vmx_exit_ctls_high |=
> >>>>> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
> >>>>>> +	nested_vmx_exit_ctls_high |=
> >>>>> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> >>>>>> +				      VM_EXIT_LOAD_IA32_EFER);
> >>>>>>
> >>>>>>  	/* entry controls */
> >>>>>>  	rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
> >>>>>> @@ -2201,8 +2202,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
> >>>>>>  	nested_vmx_entry_ctls_low =
> >>>>> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
> >>>>>>  	nested_vmx_entry_ctls_high &=
> >>>>>>  		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
> >>>>>> -	nested_vmx_entry_ctls_high |=
> >>>>> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
> >>>>>> -
> >>>>>> +	nested_vmx_entry_ctls_high |=
> >>>>> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
> >>>>>> +				       VM_ENTRY_LOAD_IA32_EFER);
> >>>>>>  	/* cpu-based controls */
> >>>>>>  	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
> >>>>>>  		nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> >>>>>> @@ -7492,10 +7493,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
> >>>>> struct vmcs12 *vmcs12)
> >>>>>>  	vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
> >>>>>>  	vmcs_writel(CR0_GUEST_HOST_MASK,
> >>>>> ~vcpu->arch.cr0_guest_owned_bits);
> >>>>>>
> >>>>>> -	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
> >>>>> below */
> >>>>>> -	vmcs_write32(VM_EXIT_CONTROLS,
> >>>>>> -		vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
> >>>>>> -	vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
> >>>>>> +	/* L2->L1 exit controls are emulated - the hardware exit is to L0 so
> >>>>>> +	 * we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
> >>>>>> +	 * bits are further modified by vmx_set_efer() below.
> >>>>>> +	 */
> >>>>>> +	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
> >>>> This is wrong. We cannot use L0 exit control directly.
> >>>> LOAD_PERF_GLOBAL_CTRL, LOAD_HOST_EFE, LOAD_HOST_PAT, ACK_INTR_ON_EXIT should use host's exit control. But others, still need use (vmcs12|host).
> >>>>
> >>> I do not see why. We always intercept DR7/PAT/EFER, so save is emulated
> >>> too. Host address space size always come from L0 and preemption timer is
> >>> not supported for nested IIRC and when it will be host will have to save
> >>> it on exit anyway for correct emulation.
> >>
> >> Preemption timer is already supported and works fine as far as I tested.
> >> KVM doesn't use it for L1, so we do not need to save/restore it - IIRC.
> >>
> > So what happens if L1 configures it to value X after X/2 ticks L0 exit
> > happen and L0 gets back to L2 directly. The counter will be X again
> > instead of X/2.
> 
> Likely. Yes, we need to improve our emulation by setting "Save
> VMX-preemption timer value" or emulate this in software if the hardware
> lacks support for it (was this flag introduced after the preemption
> timer itself?).
> 
Not sure, but my point was that for correct emulation host needs to set
 "save preempt timer on vmexit" anyway so all VM_EXIT_CONTROLS are
indeed emulated as far as I see.

--
			Gleb.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-07-02 15:43             ` Gleb Natapov
@ 2013-07-04  8:42               ` Zhang, Yang Z
  2013-07-08 12:37                 ` Gleb Natapov
  0 siblings, 1 reply; 52+ messages in thread
From: Zhang, Yang Z @ 2013-07-04  8:42 UTC (permalink / raw)
  To: Gleb Natapov, Jan Kiszka; +Cc: Paolo Bonzini, Nakajima, Jun, kvm

Gleb Natapov wrote on 2013-07-02:
> On Tue, Jul 02, 2013 at 05:34:56PM +0200, Jan Kiszka wrote:
>> On 2013-07-02 17:15, Gleb Natapov wrote:
>>> On Tue, Jul 02, 2013 at 04:28:56PM +0200, Jan Kiszka wrote:
>>>> On 2013-07-02 15:59, Gleb Natapov wrote:
>>>>> On Tue, Jul 02, 2013 at 03:01:24AM +0000, Zhang, Yang Z wrote:
>>>>>> Since this series is pending in mail list for long time. And
>>>>>> it's really a big feature for Nested. Also, I doubt the
>>>>>> original authors(Jun and Nahav)should not have enough time to continue it.
>>>>>> So I will pick it up. :)
>>>>>> 
>>>>>> See comments below:
>>>>>> 
>>>>>> Paolo Bonzini wrote on 2013-05-20:
>>>>>>> Il 19/05/2013 06:52, Jun Nakajima ha scritto:
>>>>>>>> From: Nadav Har'El <nyh@il.ibm.com>
>>>>>>>> 
>>>>>>>> Recent KVM, since
>>>>>>>> http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
>>>>>>>> switch the EFER MSR when EPT is used and the host and guest have
>>>>>>>> different NX bits. So if we add support for nested EPT (L1 guest
>>>>>>>> using EPT to run L2) and want to be able to run recent KVM as L1,
>>>>>>>> we need to allow L1 to use this EFER switching feature.
>>>>>>>> 
>>>>>>>> To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER
>>>>>>>> if available, and if it isn't, it uses the generic
>>>>>>>> VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former
>>>>>>>> (the latter is still unsupported).
>>>>>>>> 
>>>>>>>> Nested entry and exit emulation (prepare_vmcs_02 and
>>>>>>>> load_vmcs12_host_state, respectively) already handled
>>>>>>>> VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all that's left to do
>>>>>>>> in this patch is to properly advertise this feature to L1.
>>>>>>>> 
>>>>>>>> Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by
>>>>>>>> L0, by using vmx_set_efer (which itself sets one of several
>>>>>>>> vmcs02 fields), so we always support this feature, regardless of
>>>>>>>> whether the host supports it.
>>>>>>>> 
>>>>>>>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
>>>>>>>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
>>>>>>>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
>>>>>>>> ---
>>>>>>>>  arch/x86/kvm/vmx.c | 23 ++++++++++++++++-------
>>>>>>>>  1 file changed, 16 insertions(+), 7 deletions(-)
>>>>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
>>>>>>>> 260a919..fb9cae5 100644
>>>>>>>> --- a/arch/x86/kvm/vmx.c
>>>>>>>> +++ b/arch/x86/kvm/vmx.c
>>>>>>>> @@ -2192,7 +2192,8 @@ static __init void
>>>>>>>> nested_vmx_setup_ctls_msrs(void)  #else
>>>>>>>>  	nested_vmx_exit_ctls_high = 0;  #endif
>>>>>>>> -	nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
>>>>>>>> +	nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR
>>>>>>>> | +				      VM_EXIT_LOAD_IA32_EFER);
>>>>>>>> 
>>>>>>>>  	/* entry controls */
>>>>>>>>  	rdmsr(MSR_IA32_VMX_ENTRY_CTLS, @@ -2201,8 +2202,8
> @@ static
>>>>>>>> __init void nested_vmx_setup_ctls_msrs(void)
>>>>>>>>  	nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
>>>>>>>>  	nested_vmx_entry_ctls_high &= 		VM_ENTRY_LOAD_IA32_PAT |
>>>>>>>>  VM_ENTRY_IA32E_MODE;
>>>>>>>> -	nested_vmx_entry_ctls_high |=
>>>>>>>> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; -
>>>>>>>> +	nested_vmx_entry_ctls_high |=
>>>>>>>> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | +				      
>>>>>>>> VM_ENTRY_LOAD_IA32_EFER);
>>>>>>>>  	/* cpu-based controls */
>>>>>>>>  	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
>>>>>>>>  		nested_vmx_procbased_ctls_low,
>>>>>>>> nested_vmx_procbased_ctls_high); @@ -7492,10 +7493,18 @@ static
>>>>>>>> void prepare_vmcs02(struct kvm_vcpu *vcpu,
>>>>>>> struct vmcs12 *vmcs12)
>>>>>>>>  	vcpu->arch.cr0_guest_owned_bits &=
>>>>>>>>  ~vmcs12->cr0_guest_host_mask; 	vmcs_writel(CR0_GUEST_HOST_MASK,
>>>>>>> ~vcpu->arch.cr0_guest_owned_bits);
>>>>>>>> 
>>>>>>>> -	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by
> vmx_set_efer
>>>>>>> below */
>>>>>>>> -	vmcs_write32(VM_EXIT_CONTROLS, -		vmcs12->vm_exit_controls |
>>>>>>>> vmcs_config.vmexit_ctrl); -	vmcs_write32(VM_ENTRY_CONTROLS,
>>>>>>>> vmcs12->vm_entry_controls | +	/* L2->L1 exit controls are
>>>>>>>> emulated - the hardware exit is +to L0 so +	 * we should use its
>>>>>>>> exit controls. Note that IA32_MODE, LOAD_IA32_EFER +	 * bits are
>>>>>>>> further modified by vmx_set_efer() below. +	 */
>>>>>>>> +	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
>>>>>> This is wrong. We cannot use L0 exit control directly.
>>>>>> LOAD_PERF_GLOBAL_CTRL, LOAD_HOST_EFE, LOAD_HOST_PAT,
> ACK_INTR_ON_EXIT should use host's exit control. But others, still
> need use (vmcs12|host).
>>>>>> 
>>>>> I do not see why. We always intercept DR7/PAT/EFER, so save is
>>>>> emulated too. Host address space size always come from L0 and
>>>>> preemption timer is not supported for nested IIRC and when it
>>>>> will be host will have to save it on exit anyway for correct emulation.
>>>> 
>>>> Preemption timer is already supported and works fine as far as I tested.
>>>> KVM doesn't use it for L1, so we do not need to save/restore it - IIRC.
>>>> 
>>> So what happens if L1 configures it to value X after X/2 ticks L0
>>> exit happen and L0 gets back to L2 directly. The counter will be X
>>> again instead of X/2.
>> 
>> Likely. Yes, we need to improve our emulation by setting "Save
>> VMX-preemption timer value" or emulate this in software if the
>> hardware lacks support for it (was this flag introduced after the
>> preemption timer itself?).
>> 
> Not sure, but my point was that for correct emulation host needs to
> set "save preempt timer on vmexit" anyway so all VM_EXIT_CONTROLS are
> indeed emulated as far as I see.
>
Ok, here is my summary, please correct me if I am wrong:
bit 2: Save debug controls, the first processor only support 1-setting on it, so just use host's setting is enough
bit 9: Host address space size, it indicate the host's state, so must use host's setting.
bit 12: Load IA32_PERF_GLOBAL_CTRL: same as above.
bit 15 : Acknowledge interrupt on exit: same as above.
bit 19: Load IA32_PAT: same as above.
bit 20: Load IA32_EFER: same as above.
bit 18: Save IA32_PAT, Didn't expose it to L1, so use host' setting is ok.
bit 19: Save IA32_EFER, same as above.
bit 22: Save VMXpreemption timer value, I don't see KVM expose it to L1, but Jan said it's working. Strange! And according gleb's suggestion, it better to always set it.

So, currently, only use host' exit_control for L2 is enough.

Best regards,
Yang



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-07-04  8:42               ` Zhang, Yang Z
@ 2013-07-08 12:37                 ` Gleb Natapov
  2013-07-08 14:28                   ` Zhang, Yang Z
  0 siblings, 1 reply; 52+ messages in thread
From: Gleb Natapov @ 2013-07-08 12:37 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: Jan Kiszka, Paolo Bonzini, Nakajima, Jun, kvm

On Thu, Jul 04, 2013 at 08:42:53AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2013-07-02:
> > On Tue, Jul 02, 2013 at 05:34:56PM +0200, Jan Kiszka wrote:
> >> On 2013-07-02 17:15, Gleb Natapov wrote:
> >>> On Tue, Jul 02, 2013 at 04:28:56PM +0200, Jan Kiszka wrote:
> >>>> On 2013-07-02 15:59, Gleb Natapov wrote:
> >>>>> On Tue, Jul 02, 2013 at 03:01:24AM +0000, Zhang, Yang Z wrote:
> >>>>>> Since this series is pending in mail list for long time. And
> >>>>>> it's really a big feature for Nested. Also, I doubt the
> >>>>>> original authors(Jun and Nahav)should not have enough time to continue it.
> >>>>>> So I will pick it up. :)
> >>>>>> 
> >>>>>> See comments below:
> >>>>>> 
> >>>>>> Paolo Bonzini wrote on 2013-05-20:
> >>>>>>> Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> >>>>>>>> From: Nadav Har'El <nyh@il.ibm.com>
> >>>>>>>> 
> >>>>>>>> Recent KVM, since
> >>>>>>>> http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
> >>>>>>>> switch the EFER MSR when EPT is used and the host and guest have
> >>>>>>>> different NX bits. So if we add support for nested EPT (L1 guest
> >>>>>>>> using EPT to run L2) and want to be able to run recent KVM as L1,
> >>>>>>>> we need to allow L1 to use this EFER switching feature.
> >>>>>>>> 
> >>>>>>>> To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER
> >>>>>>>> if available, and if it isn't, it uses the generic
> >>>>>>>> VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former
> >>>>>>>> (the latter is still unsupported).
> >>>>>>>> 
> >>>>>>>> Nested entry and exit emulation (prepare_vmcs_02 and
> >>>>>>>> load_vmcs12_host_state, respectively) already handled
> >>>>>>>> VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all that's left to do
> >>>>>>>> in this patch is to properly advertise this feature to L1.
> >>>>>>>> 
> >>>>>>>> Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by
> >>>>>>>> L0, by using vmx_set_efer (which itself sets one of several
> >>>>>>>> vmcs02 fields), so we always support this feature, regardless of
> >>>>>>>> whether the host supports it.
> >>>>>>>> 
> >>>>>>>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> >>>>>>>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> >>>>>>>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> >>>>>>>> ---
> >>>>>>>>  arch/x86/kvm/vmx.c | 23 ++++++++++++++++-------
> >>>>>>>>  1 file changed, 16 insertions(+), 7 deletions(-)
> >>>>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
> >>>>>>>> 260a919..fb9cae5 100644
> >>>>>>>> --- a/arch/x86/kvm/vmx.c
> >>>>>>>> +++ b/arch/x86/kvm/vmx.c
> >>>>>>>> @@ -2192,7 +2192,8 @@ static __init void
> >>>>>>>> nested_vmx_setup_ctls_msrs(void)  #else
> >>>>>>>>  	nested_vmx_exit_ctls_high = 0;  #endif
> >>>>>>>> -	nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
> >>>>>>>> +	nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR
> >>>>>>>> | +				      VM_EXIT_LOAD_IA32_EFER);
> >>>>>>>> 
> >>>>>>>>  	/* entry controls */
> >>>>>>>>  	rdmsr(MSR_IA32_VMX_ENTRY_CTLS, @@ -2201,8 +2202,8
> > @@ static
> >>>>>>>> __init void nested_vmx_setup_ctls_msrs(void)
> >>>>>>>>  	nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
> >>>>>>>>  	nested_vmx_entry_ctls_high &= 		VM_ENTRY_LOAD_IA32_PAT |
> >>>>>>>>  VM_ENTRY_IA32E_MODE;
> >>>>>>>> -	nested_vmx_entry_ctls_high |=
> >>>>>>>> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; -
> >>>>>>>> +	nested_vmx_entry_ctls_high |=
> >>>>>>>> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | +				      
> >>>>>>>> VM_ENTRY_LOAD_IA32_EFER);
> >>>>>>>>  	/* cpu-based controls */
> >>>>>>>>  	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
> >>>>>>>>  		nested_vmx_procbased_ctls_low,
> >>>>>>>> nested_vmx_procbased_ctls_high); @@ -7492,10 +7493,18 @@ static
> >>>>>>>> void prepare_vmcs02(struct kvm_vcpu *vcpu,
> >>>>>>> struct vmcs12 *vmcs12)
> >>>>>>>>  	vcpu->arch.cr0_guest_owned_bits &=
> >>>>>>>>  ~vmcs12->cr0_guest_host_mask; 	vmcs_writel(CR0_GUEST_HOST_MASK,
> >>>>>>> ~vcpu->arch.cr0_guest_owned_bits);
> >>>>>>>> 
> >>>>>>>> -	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by
> > vmx_set_efer
> >>>>>>> below */
> >>>>>>>> -	vmcs_write32(VM_EXIT_CONTROLS, -		vmcs12->vm_exit_controls |
> >>>>>>>> vmcs_config.vmexit_ctrl); -	vmcs_write32(VM_ENTRY_CONTROLS,
> >>>>>>>> vmcs12->vm_entry_controls | +	/* L2->L1 exit controls are
> >>>>>>>> emulated - the hardware exit is +to L0 so +	 * we should use its
> >>>>>>>> exit controls. Note that IA32_MODE, LOAD_IA32_EFER +	 * bits are
> >>>>>>>> further modified by vmx_set_efer() below. +	 */
> >>>>>>>> +	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
> >>>>>> This is wrong. We cannot use L0 exit control directly.
> >>>>>> LOAD_PERF_GLOBAL_CTRL, LOAD_HOST_EFE, LOAD_HOST_PAT,
> > ACK_INTR_ON_EXIT should use host's exit control. But others, still
> > need use (vmcs12|host).
> >>>>>> 
> >>>>> I do not see why. We always intercept DR7/PAT/EFER, so save is
> >>>>> emulated too. Host address space size always come from L0 and
> >>>>> preemption timer is not supported for nested IIRC and when it
> >>>>> will be host will have to save it on exit anyway for correct emulation.
> >>>> 
> >>>> Preemption timer is already supported and works fine as far as I tested.
> >>>> KVM doesn't use it for L1, so we do not need to save/restore it - IIRC.
> >>>> 
> >>> So what happens if L1 configures it to value X after X/2 ticks L0
> >>> exit happen and L0 gets back to L2 directly. The counter will be X
> >>> again instead of X/2.
> >> 
> >> Likely. Yes, we need to improve our emulation by setting "Save
> >> VMX-preemption timer value" or emulate this in software if the
> >> hardware lacks support for it (was this flag introduced after the
> >> preemption timer itself?).
> >> 
> > Not sure, but my point was that for correct emulation host needs to
> > set "save preempt timer on vmexit" anyway so all VM_EXIT_CONTROLS are
> > indeed emulated as far as I see.
> >
> Ok, here is my summary, please correct me if I am wrong:
> bit 2: Save debug controls, the first processor only support 1-setting on it, so just use host's setting is enough
Not because first processor only supported 1-setting, but because L0
intercepts all changes to DR7 and DEBUGCTL MSR, so L2 cannot change them
behind L0 back. If L1 asks to save them in vmcs12 L0 can do it during
vmexit emulation.

> bit 9: Host address space size, it indicate the host's state, so must use host's setting.
> bit 12: Load IA32_PERF_GLOBAL_CTRL: same as above.
Not sure what "above" do you mean. It is fully emulated during vmexit
emulation. We do not want PERF_GLOBAL_CTRL to be loaded during L2->L0
vmexit, we want it to be loaded in L1 after L0 emulates L2->L1 vmexit.
But I think there is a bug in current emulation. The code looks like
this:

        if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
                vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL,
                        vmcs12->host_ia32_perf_global_ctrl);

But GUEST_IA32_PERF_GLOBAL_CTRL will not be loaded during L1 entry unless
VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL is set. Why do we assume it is set
here?

> bit 15 : Acknowledge interrupt on exit: same as above.
Same as bit 9.

> bit 19: Load IA32_PAT: same as above.
> bit 20: Load IA32_EFER: same as above.
Those two are same as bit 12.

> bit 18: Save IA32_PAT, Didn't expose it to L1, so use host' setting is ok.
> bit 19: Save IA32_EFER, same as above.
Those two are the same a bit 2.

> bit 22: Save VMXpreemption timer value, I don't see KVM expose it to L1, but Jan said it's working. Strange! And according gleb's suggestion, it better to always set it.
>
It exposed it in nested_vmx_setup_ctls_msrs:
 
        nested_vmx_pinbased_ctls_high &= PIN_BASED_EXT_INTR_MASK |
                PIN_BASED_NMI_EXITING | PIN_BASED_VIRTUAL_NMIS |
                PIN_BASED_VMX_PREEMPTION_TIMER;

According to Gleb's suggestion current emulation is broken and to fix it
the bit will have to be set on each L2 entry if L1 is using preemption timer.

> So, currently, only use host' exit_control for L2 is enough.
> 
> Best regards,
> Yang
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-07-08 12:37                 ` Gleb Natapov
@ 2013-07-08 14:28                   ` Zhang, Yang Z
  2013-07-08 16:08                     ` Gleb Natapov
  0 siblings, 1 reply; 52+ messages in thread
From: Zhang, Yang Z @ 2013-07-08 14:28 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, Paolo Bonzini, Nakajima, Jun, kvm

> -----Original Message-----
> From: Gleb Natapov [mailto:gleb@redhat.com]
> Sent: Monday, July 08, 2013 8:38 PM
> To: Zhang, Yang Z
> Cc: Jan Kiszka; Paolo Bonzini; Nakajima, Jun; kvm@vger.kernel.org
> Subject: Re: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit
> controls for L1
> 
> On Thu, Jul 04, 2013 at 08:42:53AM +0000, Zhang, Yang Z wrote:
> > Gleb Natapov wrote on 2013-07-02:
> > > On Tue, Jul 02, 2013 at 05:34:56PM +0200, Jan Kiszka wrote:
> > >> On 2013-07-02 17:15, Gleb Natapov wrote:
> > >>> On Tue, Jul 02, 2013 at 04:28:56PM +0200, Jan Kiszka wrote:
> > >>>> On 2013-07-02 15:59, Gleb Natapov wrote:
> > >>>>> On Tue, Jul 02, 2013 at 03:01:24AM +0000, Zhang, Yang Z wrote:
> > >>>>>> Since this series is pending in mail list for long time. And
> > >>>>>> it's really a big feature for Nested. Also, I doubt the
> > >>>>>> original authors(Jun and Nahav)should not have enough time to
> continue it.
> > >>>>>> So I will pick it up. :)
> > >>>>>>
> > >>>>>> See comments below:
> > >>>>>>
> > >>>>>> Paolo Bonzini wrote on 2013-05-20:
> > >>>>>>> Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> > >>>>>>>> From: Nadav Har'El <nyh@il.ibm.com>
> > >>>>>>>>
> > >>>>>>>> Recent KVM, since
> > >>>>>>>> http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
> > >>>>>>>> switch the EFER MSR when EPT is used and the host and guest
> have
> > >>>>>>>> different NX bits. So if we add support for nested EPT (L1 guest
> > >>>>>>>> using EPT to run L2) and want to be able to run recent KVM as L1,
> > >>>>>>>> we need to allow L1 to use this EFER switching feature.
> > >>>>>>>>
> > >>>>>>>> To do this EFER switching, KVM uses
> VM_ENTRY/EXIT_LOAD_IA32_EFER
> > >>>>>>>> if available, and if it isn't, it uses the generic
> > >>>>>>>> VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the
> former
> > >>>>>>>> (the latter is still unsupported).
> > >>>>>>>>
> > >>>>>>>> Nested entry and exit emulation (prepare_vmcs_02 and
> > >>>>>>>> load_vmcs12_host_state, respectively) already handled
> > >>>>>>>> VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all that's left to do
> > >>>>>>>> in this patch is to properly advertise this feature to L1.
> > >>>>>>>>
> > >>>>>>>> Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are
> emulated by
> > >>>>>>>> L0, by using vmx_set_efer (which itself sets one of several
> > >>>>>>>> vmcs02 fields), so we always support this feature, regardless of
> > >>>>>>>> whether the host supports it.
> > >>>>>>>>
> > >>>>>>>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> > >>>>>>>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> > >>>>>>>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> > >>>>>>>> ---
> > >>>>>>>>  arch/x86/kvm/vmx.c | 23 ++++++++++++++++-------
> > >>>>>>>>  1 file changed, 16 insertions(+), 7 deletions(-)
> > >>>>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
> > >>>>>>>> 260a919..fb9cae5 100644
> > >>>>>>>> --- a/arch/x86/kvm/vmx.c
> > >>>>>>>> +++ b/arch/x86/kvm/vmx.c
> > >>>>>>>> @@ -2192,7 +2192,8 @@ static __init void
> > >>>>>>>> nested_vmx_setup_ctls_msrs(void)  #else
> > >>>>>>>>  	nested_vmx_exit_ctls_high = 0;  #endif
> > >>>>>>>> -	nested_vmx_exit_ctls_high |=
> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
> > >>>>>>>> +	nested_vmx_exit_ctls_high |=
> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR
> > >>>>>>>> | +				      VM_EXIT_LOAD_IA32_EFER);
> > >>>>>>>>
> > >>>>>>>>  	/* entry controls */
> > >>>>>>>>  	rdmsr(MSR_IA32_VMX_ENTRY_CTLS, @@ -2201,8 +2202,8
> > > @@ static
> > >>>>>>>> __init void nested_vmx_setup_ctls_msrs(void)
> > >>>>>>>>  	nested_vmx_entry_ctls_low =
> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
> > >>>>>>>>  	nested_vmx_entry_ctls_high &=
> 	VM_ENTRY_LOAD_IA32_PAT |
> > >>>>>>>>  VM_ENTRY_IA32E_MODE;
> > >>>>>>>> -	nested_vmx_entry_ctls_high |=
> > >>>>>>>> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; -
> > >>>>>>>> +	nested_vmx_entry_ctls_high |=
> > >>>>>>>> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | +
> > >>>>>>>> VM_ENTRY_LOAD_IA32_EFER);
> > >>>>>>>>  	/* cpu-based controls */
> > >>>>>>>>  	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
> > >>>>>>>>  		nested_vmx_procbased_ctls_low,
> > >>>>>>>> nested_vmx_procbased_ctls_high); @@ -7492,10 +7493,18 @@
> static
> > >>>>>>>> void prepare_vmcs02(struct kvm_vcpu *vcpu,
> > >>>>>>> struct vmcs12 *vmcs12)
> > >>>>>>>>  	vcpu->arch.cr0_guest_owned_bits &=
> > >>>>>>>>  ~vmcs12->cr0_guest_host_mask;
> 	vmcs_writel(CR0_GUEST_HOST_MASK,
> > >>>>>>> ~vcpu->arch.cr0_guest_owned_bits);
> > >>>>>>>>
> > >>>>>>>> -	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by
> > > vmx_set_efer
> > >>>>>>> below */
> > >>>>>>>> -	vmcs_write32(VM_EXIT_CONTROLS, -
> 	vmcs12->vm_exit_controls |
> > >>>>>>>> vmcs_config.vmexit_ctrl); -	vmcs_write32(VM_ENTRY_CONTROLS,
> > >>>>>>>> vmcs12->vm_entry_controls | +	/* L2->L1 exit controls are
> > >>>>>>>> emulated - the hardware exit is +to L0 so +	 * we should use
> its
> > >>>>>>>> exit controls. Note that IA32_MODE, LOAD_IA32_EFER +	 *
> bits are
> > >>>>>>>> further modified by vmx_set_efer() below. +	 */
> > >>>>>>>> +	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
> > >>>>>> This is wrong. We cannot use L0 exit control directly.
> > >>>>>> LOAD_PERF_GLOBAL_CTRL, LOAD_HOST_EFE, LOAD_HOST_PAT,
> > > ACK_INTR_ON_EXIT should use host's exit control. But others, still
> > > need use (vmcs12|host).
> > >>>>>>
> > >>>>> I do not see why. We always intercept DR7/PAT/EFER, so save is
> > >>>>> emulated too. Host address space size always come from L0 and
> > >>>>> preemption timer is not supported for nested IIRC and when it
> > >>>>> will be host will have to save it on exit anyway for correct emulation.
> > >>>>
> > >>>> Preemption timer is already supported and works fine as far as I tested.
> > >>>> KVM doesn't use it for L1, so we do not need to save/restore it - IIRC.
> > >>>>
> > >>> So what happens if L1 configures it to value X after X/2 ticks L0
> > >>> exit happen and L0 gets back to L2 directly. The counter will be X
> > >>> again instead of X/2.
> > >>
> > >> Likely. Yes, we need to improve our emulation by setting "Save
> > >> VMX-preemption timer value" or emulate this in software if the
> > >> hardware lacks support for it (was this flag introduced after the
> > >> preemption timer itself?).
> > >>
> > > Not sure, but my point was that for correct emulation host needs to
> > > set "save preempt timer on vmexit" anyway so all VM_EXIT_CONTROLS are
> > > indeed emulated as far as I see.
> > >
> > Ok, here is my summary, please correct me if I am wrong:
> > bit 2: Save debug controls, the first processor only support 1-setting on it, so
> just use host's setting is enough
> Not because first processor only supported 1-setting, but because L0
> intercepts all changes to DR7 and DEBUGCTL MSR, so L2 cannot change them
> behind L0 back. If L1 asks to save them in vmcs12 L0 can do it during
> vmexit emulation.
> 
> > bit 9: Host address space size, it indicate the host's state, so must use host's
> setting.
> > bit 12: Load IA32_PERF_GLOBAL_CTRL: same as above.
> Not sure what "above" do you mean. It is fully emulated during vmexit
> emulation. We do not want PERF_GLOBAL_CTRL to be loaded during L2->L0
Not understand why PERF_GLOBAL_CTRL/ Load IA32_PAT/ Load IA32_EFER shouldn't be loaded during L2->L0?

> vmexit, we want it to be loaded in L1 after L0 emulates L2->L1 vmexit.
> But I think there is a bug in current emulation. The code looks like
> this:
> 
>         if (vmcs12->vm_exit_controls &
> VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
>                 vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL,
>                         vmcs12->host_ia32_perf_global_ctrl);
> 
> But GUEST_IA32_PERF_GLOBAL_CTRL will not be loaded during L1 entry unless
> VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL is set. Why do we assume it is
> set
> here?
You are right. It seems a bug in current emulation.

> 
> > bit 15 : Acknowledge interrupt on exit: same as above.
> Same as bit 9.
> 
> > bit 19: Load IA32_PAT: same as above.
> > bit 20: Load IA32_EFER: same as above.
> Those two are same as bit 12.
> 
> > bit 18: Save IA32_PAT, Didn't expose it to L1, so use host' setting is ok.
> > bit 19: Save IA32_EFER, same as above.
> Those two are the same a bit 2.
> 
> > bit 22: Save VMXpreemption timer value, I don't see KVM expose it to L1, but
> Jan said it's working. Strange! And according gleb's suggestion, it better to
> always set it.
> >
> It exposed it in nested_vmx_setup_ctls_msrs:
> 
>         nested_vmx_pinbased_ctls_high &= PIN_BASED_EXT_INTR_MASK |
>                 PIN_BASED_NMI_EXITING | PIN_BASED_VIRTUAL_NMIS |
>                 PIN_BASED_VMX_PREEMPTION_TIMER;
> According to Gleb's suggestion current emulation is broken and to fix it
> the bit will have to be set on each L2 entry if L1 is using preemption timer.
> 
> > So, currently, only use host' exit_control for L2 is enough.
> >
> > Best regards,
> > Yang
> >

Best regards,
Yang

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-07-08 14:28                   ` Zhang, Yang Z
@ 2013-07-08 16:08                     ` Gleb Natapov
  0 siblings, 0 replies; 52+ messages in thread
From: Gleb Natapov @ 2013-07-08 16:08 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: Jan Kiszka, Paolo Bonzini, Nakajima, Jun, kvm

On Mon, Jul 08, 2013 at 02:28:15PM +0000, Zhang, Yang Z wrote:
> > -----Original Message-----
> > From: Gleb Natapov [mailto:gleb@redhat.com]
> > Sent: Monday, July 08, 2013 8:38 PM
> > To: Zhang, Yang Z
> > Cc: Jan Kiszka; Paolo Bonzini; Nakajima, Jun; kvm@vger.kernel.org
> > Subject: Re: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit
> > controls for L1
> > 
> > On Thu, Jul 04, 2013 at 08:42:53AM +0000, Zhang, Yang Z wrote:
> > > Gleb Natapov wrote on 2013-07-02:
> > > > On Tue, Jul 02, 2013 at 05:34:56PM +0200, Jan Kiszka wrote:
> > > >> On 2013-07-02 17:15, Gleb Natapov wrote:
> > > >>> On Tue, Jul 02, 2013 at 04:28:56PM +0200, Jan Kiszka wrote:
> > > >>>> On 2013-07-02 15:59, Gleb Natapov wrote:
> > > >>>>> On Tue, Jul 02, 2013 at 03:01:24AM +0000, Zhang, Yang Z wrote:
> > > >>>>>> Since this series is pending in mail list for long time. And
> > > >>>>>> it's really a big feature for Nested. Also, I doubt the
> > > >>>>>> original authors(Jun and Nahav)should not have enough time to
> > continue it.
> > > >>>>>> So I will pick it up. :)
> > > >>>>>>
> > > >>>>>> See comments below:
> > > >>>>>>
> > > >>>>>> Paolo Bonzini wrote on 2013-05-20:
> > > >>>>>>> Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> > > >>>>>>>> From: Nadav Har'El <nyh@il.ibm.com>
> > > >>>>>>>>
> > > >>>>>>>> Recent KVM, since
> > > >>>>>>>> http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
> > > >>>>>>>> switch the EFER MSR when EPT is used and the host and guest
> > have
> > > >>>>>>>> different NX bits. So if we add support for nested EPT (L1 guest
> > > >>>>>>>> using EPT to run L2) and want to be able to run recent KVM as L1,
> > > >>>>>>>> we need to allow L1 to use this EFER switching feature.
> > > >>>>>>>>
> > > >>>>>>>> To do this EFER switching, KVM uses
> > VM_ENTRY/EXIT_LOAD_IA32_EFER
> > > >>>>>>>> if available, and if it isn't, it uses the generic
> > > >>>>>>>> VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the
> > former
> > > >>>>>>>> (the latter is still unsupported).
> > > >>>>>>>>
> > > >>>>>>>> Nested entry and exit emulation (prepare_vmcs_02 and
> > > >>>>>>>> load_vmcs12_host_state, respectively) already handled
> > > >>>>>>>> VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all that's left to do
> > > >>>>>>>> in this patch is to properly advertise this feature to L1.
> > > >>>>>>>>
> > > >>>>>>>> Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are
> > emulated by
> > > >>>>>>>> L0, by using vmx_set_efer (which itself sets one of several
> > > >>>>>>>> vmcs02 fields), so we always support this feature, regardless of
> > > >>>>>>>> whether the host supports it.
> > > >>>>>>>>
> > > >>>>>>>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> > > >>>>>>>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> > > >>>>>>>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> > > >>>>>>>> ---
> > > >>>>>>>>  arch/x86/kvm/vmx.c | 23 ++++++++++++++++-------
> > > >>>>>>>>  1 file changed, 16 insertions(+), 7 deletions(-)
> > > >>>>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
> > > >>>>>>>> 260a919..fb9cae5 100644
> > > >>>>>>>> --- a/arch/x86/kvm/vmx.c
> > > >>>>>>>> +++ b/arch/x86/kvm/vmx.c
> > > >>>>>>>> @@ -2192,7 +2192,8 @@ static __init void
> > > >>>>>>>> nested_vmx_setup_ctls_msrs(void)  #else
> > > >>>>>>>>  	nested_vmx_exit_ctls_high = 0;  #endif
> > > >>>>>>>> -	nested_vmx_exit_ctls_high |=
> > VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
> > > >>>>>>>> +	nested_vmx_exit_ctls_high |=
> > (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR
> > > >>>>>>>> | +				      VM_EXIT_LOAD_IA32_EFER);
> > > >>>>>>>>
> > > >>>>>>>>  	/* entry controls */
> > > >>>>>>>>  	rdmsr(MSR_IA32_VMX_ENTRY_CTLS, @@ -2201,8 +2202,8
> > > > @@ static
> > > >>>>>>>> __init void nested_vmx_setup_ctls_msrs(void)
> > > >>>>>>>>  	nested_vmx_entry_ctls_low =
> > VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
> > > >>>>>>>>  	nested_vmx_entry_ctls_high &=
> > 	VM_ENTRY_LOAD_IA32_PAT |
> > > >>>>>>>>  VM_ENTRY_IA32E_MODE;
> > > >>>>>>>> -	nested_vmx_entry_ctls_high |=
> > > >>>>>>>> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; -
> > > >>>>>>>> +	nested_vmx_entry_ctls_high |=
> > > >>>>>>>> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | +
> > > >>>>>>>> VM_ENTRY_LOAD_IA32_EFER);
> > > >>>>>>>>  	/* cpu-based controls */
> > > >>>>>>>>  	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
> > > >>>>>>>>  		nested_vmx_procbased_ctls_low,
> > > >>>>>>>> nested_vmx_procbased_ctls_high); @@ -7492,10 +7493,18 @@
> > static
> > > >>>>>>>> void prepare_vmcs02(struct kvm_vcpu *vcpu,
> > > >>>>>>> struct vmcs12 *vmcs12)
> > > >>>>>>>>  	vcpu->arch.cr0_guest_owned_bits &=
> > > >>>>>>>>  ~vmcs12->cr0_guest_host_mask;
> > 	vmcs_writel(CR0_GUEST_HOST_MASK,
> > > >>>>>>> ~vcpu->arch.cr0_guest_owned_bits);
> > > >>>>>>>>
> > > >>>>>>>> -	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by
> > > > vmx_set_efer
> > > >>>>>>> below */
> > > >>>>>>>> -	vmcs_write32(VM_EXIT_CONTROLS, -
> > 	vmcs12->vm_exit_controls |
> > > >>>>>>>> vmcs_config.vmexit_ctrl); -	vmcs_write32(VM_ENTRY_CONTROLS,
> > > >>>>>>>> vmcs12->vm_entry_controls | +	/* L2->L1 exit controls are
> > > >>>>>>>> emulated - the hardware exit is +to L0 so +	 * we should use
> > its
> > > >>>>>>>> exit controls. Note that IA32_MODE, LOAD_IA32_EFER +	 *
> > bits are
> > > >>>>>>>> further modified by vmx_set_efer() below. +	 */
> > > >>>>>>>> +	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
> > > >>>>>> This is wrong. We cannot use L0 exit control directly.
> > > >>>>>> LOAD_PERF_GLOBAL_CTRL, LOAD_HOST_EFE, LOAD_HOST_PAT,
> > > > ACK_INTR_ON_EXIT should use host's exit control. But others, still
> > > > need use (vmcs12|host).
> > > >>>>>>
> > > >>>>> I do not see why. We always intercept DR7/PAT/EFER, so save is
> > > >>>>> emulated too. Host address space size always come from L0 and
> > > >>>>> preemption timer is not supported for nested IIRC and when it
> > > >>>>> will be host will have to save it on exit anyway for correct emulation.
> > > >>>>
> > > >>>> Preemption timer is already supported and works fine as far as I tested.
> > > >>>> KVM doesn't use it for L1, so we do not need to save/restore it - IIRC.
> > > >>>>
> > > >>> So what happens if L1 configures it to value X after X/2 ticks L0
> > > >>> exit happen and L0 gets back to L2 directly. The counter will be X
> > > >>> again instead of X/2.
> > > >>
> > > >> Likely. Yes, we need to improve our emulation by setting "Save
> > > >> VMX-preemption timer value" or emulate this in software if the
> > > >> hardware lacks support for it (was this flag introduced after the
> > > >> preemption timer itself?).
> > > >>
> > > > Not sure, but my point was that for correct emulation host needs to
> > > > set "save preempt timer on vmexit" anyway so all VM_EXIT_CONTROLS are
> > > > indeed emulated as far as I see.
> > > >
> > > Ok, here is my summary, please correct me if I am wrong:
> > > bit 2: Save debug controls, the first processor only support 1-setting on it, so
> > just use host's setting is enough
> > Not because first processor only supported 1-setting, but because L0
> > intercepts all changes to DR7 and DEBUGCTL MSR, so L2 cannot change them
> > behind L0 back. If L1 asks to save them in vmcs12 L0 can do it during
> > vmexit emulation.
> > 
> > > bit 9: Host address space size, it indicate the host's state, so must use host's
> > setting.
> > > bit 12: Load IA32_PERF_GLOBAL_CTRL: same as above.
> > Not sure what "above" do you mean. It is fully emulated during vmexit
> > emulation. We do not want PERF_GLOBAL_CTRL to be loaded during L2->L0
> Not understand why PERF_GLOBAL_CTRL/ Load IA32_PAT/ Load IA32_EFER shouldn't be loaded during L2->L0?
> 
Because L0 didn't ask them to be loaded and values that would be loaded
are L1 values, not L0's. If L0 requests loading then they should be
loaded, but we should be careful to put L0's values in vmcs02.

> > vmexit, we want it to be loaded in L1 after L0 emulates L2->L1 vmexit.
> > But I think there is a bug in current emulation. The code looks like
> > this:
> > 
> >         if (vmcs12->vm_exit_controls &
> > VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
> >                 vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL,
> >                         vmcs12->host_ia32_perf_global_ctrl);
> > 
> > But GUEST_IA32_PERF_GLOBAL_CTRL will not be loaded during L1 entry unless
> > VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL is set. Why do we assume it is
> > set
> > here?
> You are right. It seems a bug in current emulation.
> 
> > 
> > > bit 15 : Acknowledge interrupt on exit: same as above.
> > Same as bit 9.
> > 
> > > bit 19: Load IA32_PAT: same as above.
> > > bit 20: Load IA32_EFER: same as above.
> > Those two are same as bit 12.
> > 
> > > bit 18: Save IA32_PAT, Didn't expose it to L1, so use host' setting is ok.
> > > bit 19: Save IA32_EFER, same as above.
> > Those two are the same a bit 2.
> > 
> > > bit 22: Save VMXpreemption timer value, I don't see KVM expose it to L1, but
> > Jan said it's working. Strange! And according gleb's suggestion, it better to
> > always set it.
> > >
> > It exposed it in nested_vmx_setup_ctls_msrs:
> > 
> >         nested_vmx_pinbased_ctls_high &= PIN_BASED_EXT_INTR_MASK |
> >                 PIN_BASED_NMI_EXITING | PIN_BASED_VIRTUAL_NMIS |
> >                 PIN_BASED_VMX_PREEMPTION_TIMER;
> > According to Gleb's suggestion current emulation is broken and to fix it
> > the bit will have to be set on each L2 entry if L1 is using preemption timer.
> > 
> > > So, currently, only use host' exit_control for L2 is enough.
> > >
> > > Best regards,
> > > Yang
> > >
> 
> Best regards,
> Yang

--
			Gleb.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 04/13] nEPT: Define EPT-specific link_shadow_page()
  2013-05-09  0:53   ` [PATCH v3 03/13] nEPT: Add EPT tables support " Jun Nakajima
@ 2013-05-09  0:53     ` Jun Nakajima
  0 siblings, 0 replies; 52+ messages in thread
From: Jun Nakajima @ 2013-05-09  0:53 UTC (permalink / raw)
  To: kvm

Since link_shadow_page() is used by a routine in mmu.c, add an
EPT-specific link_shadow_page() in paging_tmp.h, rather than moving
it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
---
 arch/x86/kvm/paging_tmpl.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 4c45654..dc495f9 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -461,6 +461,18 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
 	}
 }
 
+#if PTTYPE == PTTYPE_EPT
+static void FNAME(link_shadow_page)(u64 *sptep, struct kvm_mmu_page *sp)
+{
+	u64 spte;
+
+	spte = __pa(sp->spt) | VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
+		VMX_EPT_EXECUTABLE_MASK;
+
+	mmu_spte_set(sptep, spte);
+}
+#endif
+
 /*
  * Fetch a shadow pte for a specific level in the paging hierarchy.
  * If the guest tries to write a write-protected page, we need to
@@ -513,7 +525,11 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			goto out_gpte_changed;
 
 		if (sp)
+#if PTTYPE == PTTYPE_EPT
+			FNAME(link_shadow_page)(it.sptep, sp);
+#else
 			link_shadow_page(it.sptep, sp);
+#endif
 	}
 
 	for (;
@@ -533,7 +549,11 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 
 		sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
 				      true, direct_access, it.sptep);
+#if PTTYPE == PTTYPE_EPT
+		FNAME(link_shadow_page)(it.sptep, sp);
+#else
 		link_shadow_page(it.sptep, sp);
+#endif
 	}
 
 	clear_sp_write_flooding_count(it.sptep);
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2013-07-08 16:09 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-19  4:52 [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Jun Nakajima
2013-05-19  4:52 ` [PATCH v3 02/13] nEPT: Move gpte_access() and prefetch_invalid_gpte() to paging_tmpl.h Jun Nakajima
2013-05-20 12:34   ` Paolo Bonzini
2013-05-19  4:52 ` [PATCH v3 03/13] nEPT: Add EPT tables support " Jun Nakajima
2013-05-21  7:52   ` Xiao Guangrong
2013-05-21  8:30     ` Xiao Guangrong
2013-05-21  9:01       ` Gleb Natapov
2013-05-21 11:05         ` Xiao Guangrong
2013-05-21 22:26           ` Nakajima, Jun
2013-05-22  1:10             ` Xiao Guangrong
2013-05-22  6:16             ` Gleb Natapov
2013-06-11 11:32     ` Gleb Natapov
2013-06-17 12:11       ` Xiao Guangrong
2013-06-18 10:57         ` Gleb Natapov
2013-06-18 12:51           ` Xiao Guangrong
2013-06-18 13:01             ` Gleb Natapov
2013-05-19  4:52 ` [PATCH v3 04/13] nEPT: Define EPT-specific link_shadow_page() Jun Nakajima
2013-05-20 12:43   ` Paolo Bonzini
2013-05-21  8:15   ` Xiao Guangrong
2013-05-21 21:44     ` Nakajima, Jun
2013-05-19  4:52 ` [PATCH v3 05/13] nEPT: MMU context for nested EPT Jun Nakajima
2013-05-21  8:50   ` Xiao Guangrong
2013-05-21 22:30     ` Nakajima, Jun
2013-05-19  4:52 ` [PATCH v3 06/13] nEPT: Fix cr3 handling in nested exit and entry Jun Nakajima
2013-05-20 13:19   ` Paolo Bonzini
2013-06-12 12:42   ` Gleb Natapov
2013-05-19  4:52 ` [PATCH v3 07/13] nEPT: Fix wrong test in kvm_set_cr3 Jun Nakajima
2013-05-20 13:17   ` Paolo Bonzini
2013-05-19  4:52 ` [PATCH v3 08/13] nEPT: Some additional comments Jun Nakajima
2013-05-20 13:21   ` Paolo Bonzini
2013-05-19  4:52 ` [PATCH v3 09/13] nEPT: Advertise EPT to L1 Jun Nakajima
2013-05-20 13:05   ` Paolo Bonzini
2013-05-19  4:52 ` [PATCH v3 10/13] nEPT: Nested INVEPT Jun Nakajima
2013-05-20 12:46   ` Paolo Bonzini
2013-05-21  9:16   ` Xiao Guangrong
2013-05-19  4:52 ` [PATCH v3 11/13] nEPT: Miscelleneous cleanups Jun Nakajima
2013-05-19  4:52 ` [PATCH v3 12/13] nEPT: Move is_rsvd_bits_set() to paging_tmpl.h Jun Nakajima
2013-05-19  4:52 ` [PATCH v3 13/13] nEPT: Inject EPT violation/misconfigration Jun Nakajima
2013-05-20 13:09   ` Paolo Bonzini
2013-05-21 10:56   ` Xiao Guangrong
2013-05-20 12:33 ` [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Paolo Bonzini
2013-07-02  3:01   ` Zhang, Yang Z
2013-07-02 13:59     ` Gleb Natapov
2013-07-02 14:28       ` Jan Kiszka
2013-07-02 15:15         ` Gleb Natapov
2013-07-02 15:34           ` Jan Kiszka
2013-07-02 15:43             ` Gleb Natapov
2013-07-04  8:42               ` Zhang, Yang Z
2013-07-08 12:37                 ` Gleb Natapov
2013-07-08 14:28                   ` Zhang, Yang Z
2013-07-08 16:08                     ` Gleb Natapov
  -- strict thread matches above, loose matches on Subject: below --
2013-05-09  0:53 Jun Nakajima
2013-05-09  0:53 ` [PATCH v3 02/13] nEPT: Move gpte_access() and prefetch_invalid_gpte() to paging_tmpl.h Jun Nakajima
2013-05-09  0:53   ` [PATCH v3 03/13] nEPT: Add EPT tables support " Jun Nakajima
2013-05-09  0:53     ` [PATCH v3 04/13] nEPT: Define EPT-specific link_shadow_page() Jun Nakajima

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.